BlogsDope image BlogsDope

Introduction to Pandas

June 22, 2020 PANDAS PYTHON 349

Today, we are going to do a brief introduction to Pandas. It is an open-source library built on the top of Python Programming Language. It was developed by Wes McKinney in 2008. It provides such data structures and operations that make data manipulation and analysis easier and efficient. If you are an aspiring data scientist, you need to be comfortable with data exploration, cleaning, manipulation, visualization, etc. Pandas lets you do all that. So, let’s get started with Pandas basics.

Series


Pandas Series is a one-dimensional labeled array that can hold any data type like int, string, python object, etc. Its axis labels are collectively known as an index. Series can be created from a list, NumPy array, dict, or scalar value. The syntax is pd.Series(data=None, index=None, dtype=None, name=None,  copy=False).

Let’s see examples to understand this.

import pandas as pd
import numpy as np

marks = [80, 90, 70, 95, 75]

series = pd.Series(
    data=marks, index=["Math", "English", "Science", "Geography", "Art"], name="scores"
)

print(series)

Output

Math         80

English      90

Science      70

Geography    95

Arts         75

Name: scores, dtype: int64

Note that it is not important to import pandas as pd, but by convention, people use pd and np for pandas and NumPy, respectively. In the above example, we have created a series using a list of marks obtained by a student in different subjects. The index represents the subject names, and the name of the series is “scores”. If we don’t provide an index, it will default to RangeIndex (0, 1, 2, …, n).

Let’s now create the same series using dict.

import pandas as pd
import numpy as np


marks = {"Math": 80, "English": 90, "Science": 70, "Geography": 95, "Arts": 75}

series = pd.Series(data=marks, name="scores")

print(series)

Output

Math         80

English      90

Science      70

Geography    95

Arts         75

Name: scores, dtype: int64

The keys of the dict are considered as an index, while the values are considered as data. If the index is provided separately, then only those keys and values will be included in the series that occur in the index. The labels that are not in the dict are assigned NaN values. i.e.

import pandas as pd
import numpy as np

marks = {"Math": 80, "English": 90, "Science": 70, "Geography": 95, "Arts": 75}

series = pd.Series(
    data=marks, index=["Math", "Science", "English", "Ethics"], name="scores"
)

print(series)

Output

Math       80.0

Science    70.0

English    90.0

Ethics      NaN

Name: scores, dtype: float64

The following code shows how to access elements of series, slice the series, etc.

import pandas as pd
import numpy as np

marks = [80, 90, 70, 95, 75]

series = pd.Series(
    data=marks, index=["Math", "English", "Science", "Geography", "Arts"], name="scores"
)

print(series.index)  # finding the index(axis labels) of the series

print()

print(series.iloc[0])  # returns value based on integer positioning. Index starts from 0

print()

print(series.loc["Math"])  # returns value based on label positioing.

print()

print(series.iloc[:3])  # slice the series from 0 to 3(exclusive)

print()

print(series[:3])  # same as above

print()

print(series[-2:])  # slice the last two elements

print()

Index(['Math', 'English', 'Science', 'Geography', 'Arts'], dtype='object')

80

80

Math       80

English    90

Science    70

Name: scores, dtype: int64

Math       80

English    90

Science    70

Name: scores, dtype: int64

Geography    95

Arts         75

Name: scores, dtype: int64

 

DataFrame


The pandas DataFrame is a two-dimensional data structure. The data is arranged in rows and columns in a tabular fashion. Both the column and rows axes are labeled. It can contain columns of different data types, and the size of the DataFrame can be changed (mutable). DataFrame can be created from ndarray, dict, series, constant value, another DataFrame, etc. The syntax is pd.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False).

Consider the following example.

import pandas as pd

import numpy as np

data = {
    "Subject": ["Math", "Science", "English", "Geography", "Arts"],
    "Marks": [80, 90, 70, 85, 75],
    "Remarks": ["Good", "Excellent", "Average", "Good", "Average"],
}

df = pd.DataFrame(data=data)

df

Output

  Subject Marks Remarks
0 Math 80 Good
1 Science 90 Excellent
2 English 70 Average
3 Geography 85 Good
4 Arts 75 Average

The above code creates a DataFrame containing marks and remarks of a student in different subjects. The keys of the dict are considered as the labels of the columns. Like index, if the labels of the columns are not provided, they default to RangeIndex (0, 1, 2, …, n).

df.dtypes returns a series containing the datatypes of each column.

import pandas as pd
import numpy as np

data = {
    "Subject": ["Math", "Science", "English", "Geography", "Arts"],
    "Marks": [80, 90, 70, 85, 75],
    "Remarks": ["Good", "Excellent", "Average", "Good", "Average"],
}

df = pd.DataFrame(data=data)

df.dtypes

Output

Subject    object

Marks       int64

Remarks    object

dtype: object

 

df.head(n) returns the first n rows. If no argument is given, it will return the first five rows.

import pandas as pd
import numpy as np

data = {
    "Subject": ["Math", "Science", "English", "Geography", "Arts"],
    "Marks": [80, 90, 70, 85, 75],
    "Remarks": ["Good", "Excellent", "Average", "Good", "Average"],
}

df = pd.DataFrame(data=data)

df.head(2)
  Subject Marks Remarks
0 Math 80 Good
1 Science 90 Excellent

Consider the following code that shows how to access rows, columns, slice the array, etc.

import pandas as pd
import numpy as np

data = {
    "Subject": ["Math", "Science", "English", "Geography", "Arts"],
    "Marks": [80, 90, 70, 85, 75],
    "Remarks": ["Good", "Excellent", "Average", "Good", "Average"],
}

df = pd.DataFrame(data=data)

print(df.index)  # returns the row labels

print()

print(df.columns)  # returns the column label

print()

print(df.T)  # returns the transpose of the DataFrame

print()

print(df.iloc[0])  # returns the first row using integer positioning

print()

print(df.loc[0])  # returns the first row using the label positioning

print()

print(df["Subject"])  # returns the first column

print()

print(df.iloc[2]["Subject"])  # returns the subject at position 2

print()

print(df.iloc[0:3])  # slice the first three records

Output

RangeIndex(start=0, stop=5, step=1)

Index(['Subject', 'Marks', 'Remarks'], dtype='object')

Subject    Math

Marks        80

Remarks    Good

Name: 0, dtype: object

Subject    Math

Marks        80

Remarks    Good

Name: 0, dtype: object

0         Math

1      Science

2      English

3    Geography

4         Arts

Name: Subject, dtype: object

English

   Subject  Marks    Remarks

0     Math     80       Good

1  Science     90  Excellent

2  English     70    Average

Some other operations that you can perform on a DataFrame.

import pandas as pd
import numpy as np

data = {
    "Subject": ["Math", "Science", "English", "Geography", "Arts"],
    "Marks": [80, 90, 70, 85, 75],
    "Remarks": ["Good", "Excellent", "Average", "Good", "Average"],
}

df = pd.DataFrame(data=data)

print(df["Marks"].sum())  # calculates the sum of the "Marks" column

df.sort_values(
    by="Subject", ascending=True
)  # sort the dataframe by "Subject" column in ascending order

print(df["Marks"].mean())  # calculates the mean of the "Marks" column

print(df["Marks"].max())  # returns the max

Output

400

80.0

90

 

Import and Export Data


Pandas allows us to import and export data with various file types, for example, csv, json, excel, txt, etc. The general syntax to read the file is pd.read_<type>() and to write the file is pd.to_<type>(). Consider the following examples.

import pandas as pd
import numpy as np

df = pd.read_csv("sample.csv")

df.head()

Output

  Subject Marks Remarks
0 Math 80 Good
1 Science 90 Excellent
2 English 70 Average
3 Geography 85 Good
4 Arts 75 Average

Let’s see how to write to a csv file.

import pandas as pd
import numpy as np

data = {
    "Subject": ["Math", "Science", "English", "Geography", "Arts"],
    "Marks": [80, 90, 70, 85, 75],
    "Remarks": ["Good", "Excellent", "Average", "Good", "Average"],
}

df = pd.DataFrame(data=data)

df.to_csv("output.csv")

df.to_csv() also adds an index as an additional column. If you do not want to include it in the csv file, set index=False in the argument i.e. df.to_csv("output.csv",index=False).

Pandas offers a lot more than what is shown here. It can’t be covered in one article. Pandas is one of the important skills in data science. Check its official documentation to master pandas.


Liked the post?
A computer science student having interest in web development. Well versed in Object Oriented Concepts, and its implementation in various projects. Strong grasp of various data structures and algorithms. Excellent problem solving skills.
Editor's Picks
0 COMMENT

Please login to view or add comment(s).