Introduction to Pandas

Today, we are going to do a brief introduction to Pandas. It is an open-source library built on the top of Python Programming Language. It was developed by Wes McKinney in 2008. It provides such data structures and operations that make data manipulation and analysis easier and efficient. If you are an aspiring data scientist, you need to be comfortable with data exploration, cleaning, manipulation, visualization, etc. Pandas lets you do all that. So, let’s get started with Pandas basics.

Series

Pandas Series is a one-dimensional labeled array that can hold any data type like int, string, python object, etc. Its axis labels are collectively known as an index. Series can be created from a list, NumPy array, dict, or scalar value. The syntax is pd.Series(data=None, index=None, dtype=None, name=None, copy=False).

Let’s see examples to understand this.

import pandas as pd
import numpy as np

marks = [80, 90, 70, 95, 75]

series = pd.Series(
    data=marks, index=["Math", "English", "Science", "Geography", "Art"], name="scores"
)

print(series)

Output

Math 80

English 90

Science 70

Geography 95

Arts 75

Name: scores, dtype: int64

Note that it is not important to import pandas as pd, but by convention, people use pd and np for pandas and NumPy, respectively. In the above example, we have created a series using a list of marks obtained by a student in different subjects. The index represents the subject names, and the name of the series is “scores”. If we don’t provide an index, it will default to RangeIndex (0, 1, 2, …, n).

Let’s now create the same series using dict.

import pandas as pd
import numpy as np


marks = {"Math": 80, "English": 90, "Science": 70, "Geography": 95, "Arts": 75}

series = pd.Series(data=marks, name="scores")

print(series)

Output

Math 80

English 90

Science 70

Geography 95

Arts 75

Name: scores, dtype: int64

The keys of the dict are considered as an index, while the values are considered as data. If the index is provided separately, then only those keys and values will be included in the series that occur in the index. The labels that are not in the dict are assigned NaN values. i.e.

import pandas as pd
import numpy as np

marks = {"Math": 80, "English": 90, "Science": 70, "Geography": 95, "Arts": 75}

series = pd.Series(
    data=marks, index=["Math", "Science", "English", "Ethics"], name="scores"
)

print(series)

Output

Math 80.0

Science 70.0

English 90.0

Ethics NaN

Name: scores, dtype: float64

The following code shows how to access elements of series, slice the series, etc.

import pandas as pd
import numpy as np

marks = [80, 90, 70, 95, 75]

series = pd.Series(
    data=marks, index=["Math", "English", "Science", "Geography", "Arts"], name="scores"
)

print(series.index)  # finding the index(axis labels) of the series

print()

print(series.iloc[0])  # returns value based on integer positioning. Index starts from 0

print()

print(series.loc["Math"])  # returns value based on label positioing.

print()

print(series.iloc[:3])  # slice the series from 0 to 3(exclusive)

print()

print(series[:3])  # same as above

print()

print(series[-2:])  # slice the last two elements

print()

Index(['Math', 'English', 'Science', 'Geography', 'Arts'], dtype='object')

Math 80

English 90

Science 70

Name: scores, dtype: int64

Math 80

English 90

Science 70

Name: scores, dtype: int64

Geography 95

Arts 75

Name: scores, dtype: int64

DataFrame

The pandas DataFrame is a two-dimensional data structure. The data is arranged in rows and columns in a tabular fashion. Both the column and rows axes are labeled. It can contain columns of different data types, and the size of the DataFrame can be changed (mutable). DataFrame can be created from ndarray, dict, series, constant value, another DataFrame, etc. The syntax is pd.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False).

Consider the following example.

import pandas as pd

import numpy as np

data = {
    "Subject": ["Math", "Science", "English", "Geography", "Arts"],
    "Marks": [80, 90, 70, 85, 75],
    "Remarks": ["Good", "Excellent", "Average", "Good", "Average"],
}

df = pd.DataFrame(data=data)

df

Output

	Subject	Marks	Remarks
0	Math	80	Good
1	Science	90	Excellent
2	English	70	Average
3	Geography	85	Good
4	Arts	75	Average

The above code creates a DataFrame containing marks and remarks of a student in different subjects. The keys of the dict are considered as the labels of the columns. Like index, if the labels of the columns are not provided, they default to RangeIndex (0, 1, 2, …, n).

df.dtypes returns a series containing the datatypes of each column.

import pandas as pd
import numpy as np

data = {
    "Subject": ["Math", "Science", "English", "Geography", "Arts"],
    "Marks": [80, 90, 70, 85, 75],
    "Remarks": ["Good", "Excellent", "Average", "Good", "Average"],
}

df = pd.DataFrame(data=data)

df.dtypes

Output

Subject object

Marks int64

Remarks object

dtype: object

df.head(n) returns the first n rows. If no argument is given, it will return the first five rows.

import pandas as pd
import numpy as np

data = {
    "Subject": ["Math", "Science", "English", "Geography", "Arts"],
    "Marks": [80, 90, 70, 85, 75],
    "Remarks": ["Good", "Excellent", "Average", "Good", "Average"],
}

df = pd.DataFrame(data=data)

df.head(2)

	Subject	Marks	Remarks
0	Math	80	Good
1	Science	90	Excellent

Consider the following code that shows how to access rows, columns, slice the array, etc.

import pandas as pd
import numpy as np

data = {
    "Subject": ["Math", "Science", "English", "Geography", "Arts"],
    "Marks": [80, 90, 70, 85, 75],
    "Remarks": ["Good", "Excellent", "Average", "Good", "Average"],
}

df = pd.DataFrame(data=data)

print(df.index)  # returns the row labels

print()

print(df.columns)  # returns the column label

print()

print(df.T)  # returns the transpose of the DataFrame

print()

print(df.iloc[0])  # returns the first row using integer positioning

print()

print(df.loc[0])  # returns the first row using the label positioning

print()

print(df["Subject"])  # returns the first column

print()

print(df.iloc[2]["Subject"])  # returns the subject at position 2

print()

print(df.iloc[0:3])  # slice the first three records

Output

RangeIndex(start=0, stop=5, step=1)

Index(['Subject', 'Marks', 'Remarks'], dtype='object')

Subject Math

Marks 80

Remarks Good

Name: 0, dtype: object

Subject Math

Marks 80

Remarks Good

Name: 0, dtype: object

0 Math

1 Science

2 English

3 Geography

4 Arts

Name: Subject, dtype: object

English

Subject Marks Remarks

0 Math 80 Good

1 Science 90 Excellent

2 English 70 Average

Some other operations that you can perform on a DataFrame.

import pandas as pd
import numpy as np

data = {
    "Subject": ["Math", "Science", "English", "Geography", "Arts"],
    "Marks": [80, 90, 70, 85, 75],
    "Remarks": ["Good", "Excellent", "Average", "Good", "Average"],
}

df = pd.DataFrame(data=data)

print(df["Marks"].sum())  # calculates the sum of the "Marks" column

df.sort_values(
    by="Subject", ascending=True
)  # sort the dataframe by "Subject" column in ascending order

print(df["Marks"].mean())  # calculates the mean of the "Marks" column

print(df["Marks"].max())  # returns the max

Output

400

80.0

Import and Export Data

Pandas allows us to import and export data with various file types, for example, csv, json, excel, txt, etc. The general syntax to read the file is pd.read_<type>() and to write the file is pd.to_<type>(). Consider the following examples.

import pandas as pd
import numpy as np

df = pd.read_csv("sample.csv")

df.head()

Output

	Subject	Marks	Remarks
0	Math	80	Good
1	Science	90	Excellent
2	English	70	Average
3	Geography	85	Good
4	Arts	75	Average

Let’s see how to write to a csv file.

import pandas as pd
import numpy as np

data = {
    "Subject": ["Math", "Science", "English", "Geography", "Arts"],
    "Marks": [80, 90, 70, 85, 75],
    "Remarks": ["Good", "Excellent", "Average", "Good", "Average"],
}

df = pd.DataFrame(data=data)

df.to_csv("output.csv")

df.to_csv() also adds an index as an additional column. If you do not want to include it in the csv file, set index=False in the argument i.e. df.to_csv("output.csv",index=False).

Pandas offers a lot more than what is shown here. It can’t be covered in one article. Pandas is one of the important skills in data science. Check its official documentation to master pandas.

Introduction to Pandas

Series

DataFrame

Import and Export Data

C++ : Linked lists in C++ (Singly linked list)

Adding Outline to Text Using CSS

Set, toggle and clear a bit in C

12 Creative CSS and JavaScript Text Typing Animations

Inserting a new node to a linked list in C++

pow() in Python

Dutch National Flag problem - Sort 0, 1, 2 in an array

memoryview() in Python

next() in Python

map() in Python

Mouse Rollover Zoom Effect on Images

Important functions in math.h library of C

Formatting the print using printf in C

Linked list traversal using loop and recursion in c++

Calculator using Java Swing and AWT with source code

Animate your Website Elements with CSS Transforms

Controlling the Outline Position with outline-offset

Prime numbers using Sieve Algorithm in C