Today, we are going to do a brief introduction to Pandas. It is an open-source library built on the top of Python Programming Language. It was developed by Wes McKinney in 2008. It provides such data structures and operations that make data manipulation and analysis easier and efficient. If you are an aspiring data scientist, you need to be comfortable with data exploration, cleaning, manipulation, visualization, etc. Pandas lets you do all that. So, let’s get started with Pandas basics.
Series
Pandas Series is a one-dimensional labeled array that can hold any data type like int, string, python object, etc. Its axis labels are collectively known as an index. Series can be created from a list, NumPy array, dict, or scalar value. The syntax is pd.Series(data=None, index=None, dtype=None, name=None, copy=False)
.
Let’s see examples to understand this.
import pandas as pd
import numpy as np
marks = [80, 90, 70, 95, 75]
series = pd.Series(
data=marks, index=["Math", "English", "Science", "Geography", "Art"], name="scores"
)
print(series)
Output
Math 80 English 90 Science 70 Geography 95 Arts 75 Name: scores, dtype: int64 |
Note that it is not important to import pandas as pd, but by convention, people use pd and np for pandas and NumPy, respectively. In the above example, we have created a series using a list of marks obtained by a student in different subjects. The index represents the subject names, and the name of the series is “scores”. If we don’t provide an index, it will default to RangeIndex (0, 1, 2, …, n).
Let’s now create the same series using dict.
import pandas as pd
import numpy as np
marks = {"Math": 80, "English": 90, "Science": 70, "Geography": 95, "Arts": 75}
series = pd.Series(data=marks, name="scores")
print(series)
Output
Math 80 English 90 Science 70 Geography 95 Arts 75 Name: scores, dtype: int64 |
The keys of the dict are considered as an index, while the values are considered as data. If the index is provided separately, then only those keys and values will be included in the series that occur in the index. The labels that are not in the dict are assigned NaN values. i.e.
import pandas as pd
import numpy as np
marks = {"Math": 80, "English": 90, "Science": 70, "Geography": 95, "Arts": 75}
series = pd.Series(
data=marks, index=["Math", "Science", "English", "Ethics"], name="scores"
)
print(series)
Output
Math 80.0 Science 70.0 English 90.0 Ethics NaN Name: scores, dtype: float64 |
The following code shows how to access elements of series, slice the series, etc.
import pandas as pd
import numpy as np
marks = [80, 90, 70, 95, 75]
series = pd.Series(
data=marks, index=["Math", "English", "Science", "Geography", "Arts"], name="scores"
)
print(series.index) # finding the index(axis labels) of the series
print()
print(series.iloc[0]) # returns value based on integer positioning. Index starts from 0
print()
print(series.loc["Math"]) # returns value based on label positioing.
print()
print(series.iloc[:3]) # slice the series from 0 to 3(exclusive)
print()
print(series[:3]) # same as above
print()
print(series[-2:]) # slice the last two elements
print()
Index(['Math', 'English', 'Science', 'Geography', 'Arts'], dtype='object') 80 80 Math 80 English 90 Science 70 Name: scores, dtype: int64 Math 80 English 90 Science 70 Name: scores, dtype: int64 Geography 95 Arts 75 Name: scores, dtype: int64 |
DataFrame
The pandas DataFrame is a two-dimensional data structure. The data is arranged in rows and columns in a tabular fashion. Both the column and rows axes are labeled. It can contain columns of different data types, and the size of the DataFrame can be changed (mutable). DataFrame can be created from ndarray, dict, series, constant value, another DataFrame, etc. The syntax is pd.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)
.
Consider the following example.
import pandas as pd
import numpy as np
data = {
"Subject": ["Math", "Science", "English", "Geography", "Arts"],
"Marks": [80, 90, 70, 85, 75],
"Remarks": ["Good", "Excellent", "Average", "Good", "Average"],
}
df = pd.DataFrame(data=data)
df
Output
Subject | Marks | Remarks | |
0 | Math | 80 | Good |
1 | Science | 90 | Excellent |
2 | English | 70 | Average |
3 | Geography | 85 | Good |
4 | Arts | 75 | Average |
The above code creates a DataFrame containing marks and remarks of a student in different subjects. The keys of the dict are considered as the labels of the columns. Like index, if the labels of the columns are not provided, they default to RangeIndex (0, 1, 2, …, n).
df.dtypes
returns a series containing the datatypes of each column.
import pandas as pd
import numpy as np
data = {
"Subject": ["Math", "Science", "English", "Geography", "Arts"],
"Marks": [80, 90, 70, 85, 75],
"Remarks": ["Good", "Excellent", "Average", "Good", "Average"],
}
df = pd.DataFrame(data=data)
df.dtypes
Output
Subject object Marks int64 Remarks object dtype: object |
df.head(n)
returns the first n rows. If no argument is given, it will return the first five rows.
import pandas as pd
import numpy as np
data = {
"Subject": ["Math", "Science", "English", "Geography", "Arts"],
"Marks": [80, 90, 70, 85, 75],
"Remarks": ["Good", "Excellent", "Average", "Good", "Average"],
}
df = pd.DataFrame(data=data)
df.head(2)
Subject | Marks | Remarks | |
0 | Math | 80 | Good |
1 | Science | 90 | Excellent |
Consider the following code that shows how to access rows, columns, slice the array, etc.
import pandas as pd
import numpy as np
data = {
"Subject": ["Math", "Science", "English", "Geography", "Arts"],
"Marks": [80, 90, 70, 85, 75],
"Remarks": ["Good", "Excellent", "Average", "Good", "Average"],
}
df = pd.DataFrame(data=data)
print(df.index) # returns the row labels
print()
print(df.columns) # returns the column label
print()
print(df.T) # returns the transpose of the DataFrame
print()
print(df.iloc[0]) # returns the first row using integer positioning
print()
print(df.loc[0]) # returns the first row using the label positioning
print()
print(df["Subject"]) # returns the first column
print()
print(df.iloc[2]["Subject"]) # returns the subject at position 2
print()
print(df.iloc[0:3]) # slice the first three records
Output
RangeIndex(start=0, stop=5, step=1) Index(['Subject', 'Marks', 'Remarks'], dtype='object') Subject Math Marks 80 Remarks Good Name: 0, dtype: object Subject Math Marks 80 Remarks Good Name: 0, dtype: object 0 Math 1 Science 2 English 3 Geography 4 Arts Name: Subject, dtype: object English Subject Marks Remarks 0 Math 80 Good 1 Science 90 Excellent 2 English 70 Average |
Some other operations that you can perform on a DataFrame.
import pandas as pd
import numpy as np
data = {
"Subject": ["Math", "Science", "English", "Geography", "Arts"],
"Marks": [80, 90, 70, 85, 75],
"Remarks": ["Good", "Excellent", "Average", "Good", "Average"],
}
df = pd.DataFrame(data=data)
print(df["Marks"].sum()) # calculates the sum of the "Marks" column
df.sort_values(
by="Subject", ascending=True
) # sort the dataframe by "Subject" column in ascending order
print(df["Marks"].mean()) # calculates the mean of the "Marks" column
print(df["Marks"].max()) # returns the max
Output
400 80.0 90 |
Import and Export Data
Pandas allows us to import and export data with various file types, for example, csv, json, excel, txt, etc. The general syntax to read the file is pd.read_<type>()
and to write the file is pd.to_<type>()
. Consider the following examples.
import pandas as pd
import numpy as np
df = pd.read_csv("sample.csv")
df.head()
Output
Subject | Marks | Remarks | |
0 | Math | 80 | Good |
1 | Science | 90 | Excellent |
2 | English | 70 | Average |
3 | Geography | 85 | Good |
4 | Arts | 75 | Average |
Let’s see how to write to a csv file.
import pandas as pd
import numpy as np
data = {
"Subject": ["Math", "Science", "English", "Geography", "Arts"],
"Marks": [80, 90, 70, 85, 75],
"Remarks": ["Good", "Excellent", "Average", "Good", "Average"],
}
df = pd.DataFrame(data=data)
df.to_csv("output.csv")
df.to_csv()
also adds an index as an additional column. If you do not want to include it in the csv file, set index=False
in the argument i.e. df.to_csv("output.csv",index=False)
.
Pandas offers a lot more than what is shown here. It can’t be covered in one article. Pandas is one of the important skills in data science. Check its official documentation to master pandas.