Introduction to Pandas for Data Analysis in Python

Introduction

Data processing is a very important part of analyzing data. It is because the data available are not always in the desired format.

Before analyzing data processing such as cleaning, merging, etc. is required. NumPy, SciPy, and Pandas are some of the tools available in python for data processing. Working with pandas is fast and easy compared to other tools. Pandas integrate with the matplotlib library which makes a very good tool for analyzing data.

 

Data structures

Pandas provide two useful data structures i.e Series and DataFrame.

Series

A Series is a one-dimensional array that can store various data types. The row label is called the index. Tuple, list, and dictionary can be converted into series.

Converting tuple into series

import pandas as pd
tup = ("Ram", 123, "Kathmandu", 50000)
df = pd.Series(tup)

#type of df
print("Type of df is: ", type(df))

#printing the series
print(df)

Output

Type of df is:  <class 'pandas.core.series.Series'>
0          Ram
1          123
2    Kathmandu
3        50000
dtype: object

Here, the index is set from 0 to 3. We can provide a custom index to this df as

import pandas as pd
tup = ("Ram", 123, "Kathmandu", 50000)
df = pd.Series(tup, index = ["Name", "Id", "Address", "Salary"])

#printing the series
print(df)

Output

Name             Ram
Id               123
Address    Kathmandu
Salary         50000
dtype: object

 

Converting list into series

import pandas as pd
lst = ["Ram", 123, "Kathmandu", 50000]
df = pd.Series(lst)
print(df)

Output

0          Ram
1          123
2    Kathmandu
3        50000
dtype: object

Same as a tuple,  conversion of the list to series, the index is set from 0 to 3. We can also set the custom index to df as

import pandas as pd
lst = ["Ram", 123, "Kathmandu", 50000]
df = pd.Series(lst, index = ["Name", "Id", "Address", "Salary"])
print(df)

Output

Name             Ram
Id               123
Address    Kathmandu
Salary         50000
dtype: object

 

Converting dictionary to series

import pandas as pd
dict = {"Name" : "Ram", "Id": 123,"Address":"Kathmandu", "Salary": 50000}
df = pd.Series(dict)
print(df)

Output

Name             Ram
Id               123
Address    Kathmandu
Salary         50000
dtype: object

 

Elements of series can be accessed using the index name as

print(df[["Name", "Address"]])

Output

Name             Ram
Address    Kathmandu
dtype: object

 

DataFrame

DataFrame is the most widely used data structure for pandas. Series is used to work with dimensional arrays only but DataFrame works with a two-dimensional array.

DataFrame has two indices i.e row index and column index. Text files and spreadsheets are read using DataFrame so it is mostly used as a data structure. A most common method of creating DataFrame is using a dictionary of equal length lists.

import pandas as pd
dict = {
			"Name" : ["Ram", "Shyam", "Gita"],
			"Id" : [1,2,3],
			"Address": ["KTM", "PKR", "BHK"],
			"Salary": [1000, 2000, 3000],
			}
			
df = pd.DataFrame(dict)
print(df)

Output

    Name  Id Address  Salary
0    Ram   1     KTM    1000
1  Shyam   2     PKR    2000
2   Gita   3     BHK    3000

We can see that the column indexes are set to Name, Id, Address, and Salary. But the row index is set from 0 to 2 and we can give a custom name to the row index as

df.index = ["One", "Two", "Three"]
print(df)

Output

        Name  Id Address  Salary
One      Ram   1     KTM    1000
Two    Shyam   2     PKR    2000
Three   Gita   3     BHK    3000

 

It is possible to set any value of DataFrame to row index using the set_index() function

df = df.set_index(["Name"])

Output

Name    Id Address  Salary
Ram     1     KTM    1000
Shyam   2     PKR    2000
Gita    3     BHK    3000

 

We can add value to the previous dataframe as

import pandas as pd
dict = {
			"Name" : ["Ram", "Shyam", "Gita"],
			"Id" : [1,2,3],
			"Address": ["KTM", "PKR", "BHK"],
			"Salary": [1000, 2000, 3000],
			}
			
df = pd.DataFrame(dict)

print("Original dataframe: \n")
print(df)
print("\n")

#adding value to dataframe
df["Post"] = ["Manager", "CEO", "worker"]

print("after addition of value to dataframe: \n")
print(df)

Output

Original dataframe:

    Name  Id Address  Salary
0    Ram   1     KTM    1000
1  Shyam   2     PKR    2000
2   Gita   3     BHK    3000


after addition of value to dataframe:

    Name  Id Address  Salary     Post
0    Ram   1     KTM    1000  Manager
1  Shyam   2     PKR    2000      CEO
2   Gita   3     BHK    3000   worker

 

We can also add common value to DataFrame

#adding common value to dataframe
df["Company"] = "CodeFires"

print("after addition of common value to dataframe: \n")
print(df)

Output

Original dataframe:

    Name  Id Address  Salary
0    Ram   1     KTM    1000
1  Shyam   2     PKR    2000
2   Gita   3     BHK    3000


after addition of common value to dataframe:

    Name  Id Address  Salary     Post    Company
0    Ram   1     KTM    1000  Manager  CodeFires
1  Shyam   2     PKR    2000      CEO  CodeFires
2   Gita   3     BHK    3000   worker  CodeFires

 

Same as Series we can set any value of DataFrame to the row index

print("Before adding row index: \n")
print(df)

#adding value "Name" to row index
df = df.set_index("Name")

print("After setting row index: \n")
print(df)

Output

Before adding row index:

    Name  Id Address  Salary     Post    Company
0    Ram   1     KTM    1000  Manager  CodeFires
1  Shyam   2     PKR    2000      CEO  CodeFires
2   Gita   3     BHK    3000   worker  CodeFires
After setting row index:

       Id Address  Salary     Post    Company
Name
Ram     1     KTM    1000  Manager  CodeFires
Shyam   2     PKR    2000      CEO  CodeFires
Gita    3     BHK    3000   worker  CodeFires

 

We can access data using row index and column index

#using column index
print("Data accessing using column index: \n")
print(df["Salary"])
print("\n")

#using row index
print("Data accessing using row index: \n")
print(df.loc["Gita", : ])

Output

Data accessing using column index:

Name
Ram      1000
Shyam    2000
Gita     3000
Name: Salary, dtype: int64


Data accessing using row index:

Id                 3
Address          BHK
Salary          3000
Post          worker
Company    CodeFires
Name: Gita, dtype: object

 

At first, we use a column index for accessing the data of DataFrame. We’ve used “Salary” as a column index and it printed out the salary of all persons.

For accessing the row index we used the loc attribute of DataFrame. We use “Gita” as a row index which printed out all the information about that specific person.

We can access specific data of DataFrame. Let’s say we want to access the salary of person Gita, so we can do this way

#accessing specific element
print(df.loc["Gita", "Salary"])

# Output
# 3000

 

Deleting data of DataFrame

Using del

#deleting data of dataframe
del df["Company"]
print("New DataFrame after delete: \n")
print(df)

Output

New DataFrame after delete:

       Id Address  Salary     Post
Name
Ram     1     KTM    1000  Manager
Shyam   2     PKR    2000      CEO
Gita    3     BHK    3000   worker

 

Using drop()

#deleting data of dataframe
print("New DataFrame after delete: \n")
print(df.drop("Salary", axis = 1))

Output

New DataFrame after delete:

       Id Address     Post    Company
Name
Ram     1     KTM  Manager  CodeFires
Shyam   2     PKR      CEO  CodeFires
Gita    3     BHK   worker  CodeFires

In the first case, we used del to delete the Company column and in the latter case, we used the drop function to delete the Salary column. Use axis = 1 to specify the column index. If we want to drop row-wise we can use index = 0.

 

Reading Files

We can read various spreadsheets using pandas. For demonstration, we will access CSV files. Also, make sure that the CSV file is located in the current working directory or pass the full path of that CSV file

import pandas as pd
df = pd.read_csv("order.csv")
print(df.head())

Output

   id  groupid  bookid  quantity
0   1        1       6         1
1   2        1       1         4
2   3        2      10         1
3   4        2       7        20
4   5        2       8         1

head() function prints first five data from CSV files.

 

Let’s see the last five data of our dataframe.

print(pd.tail())

Output

   id  groupid  bookid  quantity
3   4        2       7        20
4   5        2       8         1
5   6        3       1         4
6   7        3       6         5
7   8        3       7         2

tail() function prints out last five data from CSV file.

Actually, we can pass the numbers in the head() and tail() functions to tell them what number of data we want to access. Let’s print the first three and last three data from CSV file. Here’s how

print("First 2 data: \n")
#to print first three data from file
print(df.head(3))
print("\n")

print("Last 2 data: \n")
#to print last three data from file
print(df.tail(3))

Output

First 2 data:

   id  groupid  bookid  quantity
0   1        1       6         1
1   2        1       1         4
2   3        2      10         1


Last 2 data:

   id  groupid  bookid  quantity
5   6        3       1         4
6   7        3       6         5
7   8        3       7         2

 

We can get all the names of columns in datasets

#to print all the columns in dataset
print(df.columns.values)

Output

['id' 'groupid' 'bookid' 'quantity']

These are the columns present in this CSV file.

 

We also can print all the elements present in specific columns as

#to print all values in specific column
print(df["quantity"])

Output

0     1
1     4
2     1
3    20
4     1
5     4
6     5
7     2
Name: quantity, dtype: int64

 

Let’s say we want to know how many books have the same bookid. We can do this using value_counts() function

#to check no of books having same bookid
print(df["bookid"].value_counts())

Output

7     2
6     2
1     2
10    1
8     1
Name: bookid, dtype: int64

Observing this data, we can say that two books have bookid 1, 6, 7

Sometimes data could be missing. We can check is there any value corresponds to null or not using isnull() function

print(df["quantity"].isnull())

Output

0    False
1    False
2    False
3    False
4    False
5    False
6    False
7    False
Name: quantity, dtype: bool

Here there are no missing values in our dataset. If there are any missing values we can replace them with the mean of current data.

 

Indexing or slicing Dataframe

For slicing Dataframe we use iloc and loc method. Usually, we pass the number(index) of row and column to iloc method and for loc, we use the name of columns

Using iloc

Syntax

iloc[row_selection, column_selection]

Let’s take a look at how iloc is used to access the data from the dataframe

print(df.iloc[:4, 0:3])

Output

id  groupid  bookid
0   1        1       6
1   2        1       1
2   3        2      10
3   4        2       7

Here we got the output having 4 rows and three columns.

 

Using loc

Syntax

loc[row_selection, ["name of columns"]]

Let’s see how loc is used to access the data from the dataframe

print(df.loc[:, ["bookid"]])

Output

bookid
0       6
1       1
2      10
3       7
4       8
5       1
6       6
7       7

 

We can also access multiple columns as

print(df.loc[:, ["groupid","bookid", "quantity"]])

Output

       groupid  bookid  quantity
0        1       6         1
1        1       1         4
2        2      10         1
3        2       7        20
4        2       8         1
5        3       1         4
6        3       6         5
7        3       7         2

 

Filtering

Let’s say we want to filter the original dataframe in such a way that the quantity of book ordered is greater than 5. So here’s how it can be done

df2 = df[df["quantity"]>4]
print(df2)

Output

    id  groupid  bookid  quantity
3   4        2       7        20
6   7        3       6         5

 

If we take a look at this new dataframe we see that the row index is unordered that is because while filtering those indexes were taken out to the new dataframe along with other data. We can reset those index using reset_index() function

index  id  groupid  bookid  quantity
0      3   4        2       7        20
1      6   7        3       6         5

 

We’ve reset the index. Have we managed to reset the index actually? Let’s print the newly formed dataframe

#to reset index 
print(df2.reset_index())
print("\n")
#lets print df2
print(df2)

Output

index  id  groupid  bookid  quantity
0      3   4        2       7        20
1      6   7        3       6         5


   id  groupid  bookid  quantity
3   4        2       7        20
6   7        3       6         5

While applying the reset_index() function those row indexes got reset but it didn’t really affect the dataframe. So what if we want to actually make a change in dataframe

#to reset index 
df2.reset_index(inplace = True)

#lets print df2
print(df2)

Output

index  id  groupid  bookid  quantity
0      3   4        2       7        20
1      6   7        3       6         5

So by using inplace = True we can make changes in original (modified) dataframe

 

Conclusion

Pandas is a powerful python module for processing data. It is used to read, filter, and analyze spreadsheets. Raw data that is available may or may not be suitable for retrieving the conclusion from it. So they need to be filtered, fill the null values, sorted, grouped, and so on.

Happy Learning 🙂

Leave a Comment