Introduction
Data processing is a very important part of analyzing data. It is because the data available are not always in the desired format.
Before analyzing data processing such as cleaning, merging, etc. is required. NumPy, SciPy, and Pandas are some of the tools available in python for data processing. Working with pandas is fast and easy compared to other tools. Pandas integrate with the matplotlib library which makes a very good tool for analyzing data.
Data structures
Pandas provide two useful data structures i.e Series and DataFrame.
Series
A Series is a one-dimensional array that can store various data types. The row label is called the index. Tuple, list, and dictionary can be converted into series.
Converting tuple into series
import pandas as pd tup = ("Ram", 123, "Kathmandu", 50000) df = pd.Series(tup) #type of df print("Type of df is: ", type(df)) #printing the series print(df)
Output
Type of df is: <class 'pandas.core.series.Series'> 0 Ram 1 123 2 Kathmandu 3 50000 dtype: object
Here, the index is set from 0 to 3. We can provide a custom index to this df as
import pandas as pd tup = ("Ram", 123, "Kathmandu", 50000) df = pd.Series(tup, index = ["Name", "Id", "Address", "Salary"]) #printing the series print(df)
Output
Name Ram Id 123 Address Kathmandu Salary 50000 dtype: object
Converting list into series
import pandas as pd lst = ["Ram", 123, "Kathmandu", 50000] df = pd.Series(lst) print(df)
Output
0 Ram 1 123 2 Kathmandu 3 50000 dtype: object
Same as a tuple, conversion of the list to series, the index is set from 0 to 3. We can also set the custom index to df as
import pandas as pd lst = ["Ram", 123, "Kathmandu", 50000] df = pd.Series(lst, index = ["Name", "Id", "Address", "Salary"]) print(df)
Output
Name Ram Id 123 Address Kathmandu Salary 50000 dtype: object
Converting dictionary to series
import pandas as pd dict = {"Name" : "Ram", "Id": 123,"Address":"Kathmandu", "Salary": 50000} df = pd.Series(dict) print(df)
Output
Name Ram Id 123 Address Kathmandu Salary 50000 dtype: object
Elements of series can be accessed using the index name as
print(df[["Name", "Address"]])
Output
Name Ram Address Kathmandu dtype: object
DataFrame
DataFrame is the most widely used data structure for pandas. Series is used to work with dimensional arrays only but DataFrame works with a two-dimensional array.
DataFrame has two indices i.e row index and column index. Text files and spreadsheets are read using DataFrame so it is mostly used as a data structure. A most common method of creating DataFrame is using a dictionary of equal length lists.
import pandas as pd dict = { "Name" : ["Ram", "Shyam", "Gita"], "Id" : [1,2,3], "Address": ["KTM", "PKR", "BHK"], "Salary": [1000, 2000, 3000], } df = pd.DataFrame(dict) print(df)
Output
Name Id Address Salary 0 Ram 1 KTM 1000 1 Shyam 2 PKR 2000 2 Gita 3 BHK 3000
We can see that the column indexes are set to Name, Id, Address, and Salary. But the row index is set from 0 to 2 and we can give a custom name to the row index as
df.index = ["One", "Two", "Three"] print(df)
Output
Name Id Address Salary One Ram 1 KTM 1000 Two Shyam 2 PKR 2000 Three Gita 3 BHK 3000
It is possible to set any value of DataFrame to row index using the set_index() function
df = df.set_index(["Name"])
Output
Name Id Address Salary Ram 1 KTM 1000 Shyam 2 PKR 2000 Gita 3 BHK 3000
We can add value to the previous dataframe as
import pandas as pd dict = { "Name" : ["Ram", "Shyam", "Gita"], "Id" : [1,2,3], "Address": ["KTM", "PKR", "BHK"], "Salary": [1000, 2000, 3000], } df = pd.DataFrame(dict) print("Original dataframe: \n") print(df) print("\n") #adding value to dataframe df["Post"] = ["Manager", "CEO", "worker"] print("after addition of value to dataframe: \n") print(df)
Output
Original dataframe: Name Id Address Salary 0 Ram 1 KTM 1000 1 Shyam 2 PKR 2000 2 Gita 3 BHK 3000 after addition of value to dataframe: Name Id Address Salary Post 0 Ram 1 KTM 1000 Manager 1 Shyam 2 PKR 2000 CEO 2 Gita 3 BHK 3000 worker
We can also add common value to DataFrame
#adding common value to dataframe df["Company"] = "CodeFires" print("after addition of common value to dataframe: \n") print(df)
Output
Original dataframe: Name Id Address Salary 0 Ram 1 KTM 1000 1 Shyam 2 PKR 2000 2 Gita 3 BHK 3000 after addition of common value to dataframe: Name Id Address Salary Post Company 0 Ram 1 KTM 1000 Manager CodeFires 1 Shyam 2 PKR 2000 CEO CodeFires 2 Gita 3 BHK 3000 worker CodeFires
Same as Series we can set any value of DataFrame to the row index
print("Before adding row index: \n") print(df) #adding value "Name" to row index df = df.set_index("Name") print("After setting row index: \n") print(df)
Output
Before adding row index: Name Id Address Salary Post Company 0 Ram 1 KTM 1000 Manager CodeFires 1 Shyam 2 PKR 2000 CEO CodeFires 2 Gita 3 BHK 3000 worker CodeFires After setting row index: Id Address Salary Post Company Name Ram 1 KTM 1000 Manager CodeFires Shyam 2 PKR 2000 CEO CodeFires Gita 3 BHK 3000 worker CodeFires
We can access data using row index and column index
#using column index print("Data accessing using column index: \n") print(df["Salary"]) print("\n") #using row index print("Data accessing using row index: \n") print(df.loc["Gita", : ])
Output
Data accessing using column index: Name Ram 1000 Shyam 2000 Gita 3000 Name: Salary, dtype: int64 Data accessing using row index: Id 3 Address BHK Salary 3000 Post worker Company CodeFires Name: Gita, dtype: object
At first, we use a column index for accessing the data of DataFrame. We’ve used “Salary” as a column index and it printed out the salary of all persons.
For accessing the row index we used the loc attribute of DataFrame. We use “Gita” as a row index which printed out all the information about that specific person.
We can access specific data of DataFrame. Let’s say we want to access the salary of person Gita, so we can do this way
#accessing specific element print(df.loc["Gita", "Salary"]) # Output # 3000
Deleting data of DataFrame
Using del
#deleting data of dataframe del df["Company"] print("New DataFrame after delete: \n") print(df)
Output
New DataFrame after delete: Id Address Salary Post Name Ram 1 KTM 1000 Manager Shyam 2 PKR 2000 CEO Gita 3 BHK 3000 worker
Using drop()
#deleting data of dataframe print("New DataFrame after delete: \n") print(df.drop("Salary", axis = 1))
Output
New DataFrame after delete: Id Address Post Company Name Ram 1 KTM Manager CodeFires Shyam 2 PKR CEO CodeFires Gita 3 BHK worker CodeFires
In the first case, we used del to delete the Company column and in the latter case, we used the drop function to delete the Salary column. Use axis = 1 to specify the column index. If we want to drop row-wise we can use index = 0.
Reading Files
We can read various spreadsheets using pandas. For demonstration, we will access CSV files. Also, make sure that the CSV file is located in the current working directory or pass the full path of that CSV file
import pandas as pd df = pd.read_csv("order.csv") print(df.head())
Output
id groupid bookid quantity 0 1 1 6 1 1 2 1 1 4 2 3 2 10 1 3 4 2 7 20 4 5 2 8 1
head() function prints first five data from CSV files.
Let’s see the last five data of our dataframe.
print(pd.tail())
Output
id groupid bookid quantity 3 4 2 7 20 4 5 2 8 1 5 6 3 1 4 6 7 3 6 5 7 8 3 7 2
tail() function prints out last five data from CSV file.
Actually, we can pass the numbers in the head() and tail() functions to tell them what number of data we want to access. Let’s print the first three and last three data from CSV file. Here’s how
print("First 2 data: \n") #to print first three data from file print(df.head(3)) print("\n") print("Last 2 data: \n") #to print last three data from file print(df.tail(3))
Output
First 2 data: id groupid bookid quantity 0 1 1 6 1 1 2 1 1 4 2 3 2 10 1 Last 2 data: id groupid bookid quantity 5 6 3 1 4 6 7 3 6 5 7 8 3 7 2
We can get all the names of columns in datasets
#to print all the columns in dataset print(df.columns.values)
Output
['id' 'groupid' 'bookid' 'quantity']
These are the columns present in this CSV file.
We also can print all the elements present in specific columns as
#to print all values in specific column print(df["quantity"])
Output
0 1 1 4 2 1 3 20 4 1 5 4 6 5 7 2 Name: quantity, dtype: int64
Let’s say we want to know how many books have the same bookid. We can do this using value_counts() function
#to check no of books having same bookid print(df["bookid"].value_counts())
Output
7 2 6 2 1 2 10 1 8 1 Name: bookid, dtype: int64
Observing this data, we can say that two books have bookid 1, 6, 7
Sometimes data could be missing. We can check is there any value corresponds to null or not using isnull() function
print(df["quantity"].isnull())
Output
0 False 1 False 2 False 3 False 4 False 5 False 6 False 7 False Name: quantity, dtype: bool
Here there are no missing values in our dataset. If there are any missing values we can replace them with the mean of current data.
Indexing or slicing Dataframe
For slicing Dataframe we use iloc and loc method. Usually, we pass the number(index) of row and column to iloc method and for loc, we use the name of columns
Using iloc
Syntax
iloc[row_selection, column_selection]
Let’s take a look at how iloc is used to access the data from the dataframe
print(df.iloc[:4, 0:3])
Output
id groupid bookid 0 1 1 6 1 2 1 1 2 3 2 10 3 4 2 7
Here we got the output having 4 rows and three columns.
Using loc
Syntax
loc[row_selection, ["name of columns"]]
Let’s see how loc is used to access the data from the dataframe
print(df.loc[:, ["bookid"]])
Output
bookid 0 6 1 1 2 10 3 7 4 8 5 1 6 6 7 7
We can also access multiple columns as
print(df.loc[:, ["groupid","bookid", "quantity"]])
Output
groupid bookid quantity 0 1 6 1 1 1 1 4 2 2 10 1 3 2 7 20 4 2 8 1 5 3 1 4 6 3 6 5 7 3 7 2
Filtering
Let’s say we want to filter the original dataframe in such a way that the quantity of book ordered is greater than 5. So here’s how it can be done
df2 = df[df["quantity"]>4] print(df2)
Output
id groupid bookid quantity 3 4 2 7 20 6 7 3 6 5
If we take a look at this new dataframe we see that the row index is unordered that is because while filtering those indexes were taken out to the new dataframe along with other data. We can reset those index using reset_index() function
index id groupid bookid quantity 0 3 4 2 7 20 1 6 7 3 6 5
We’ve reset the index. Have we managed to reset the index actually? Let’s print the newly formed dataframe
#to reset index print(df2.reset_index()) print("\n") #lets print df2 print(df2)
Output
index id groupid bookid quantity 0 3 4 2 7 20 1 6 7 3 6 5 id groupid bookid quantity 3 4 2 7 20 6 7 3 6 5
While applying the reset_index() function those row indexes got reset but it didn’t really affect the dataframe. So what if we want to actually make a change in dataframe
#to reset index df2.reset_index(inplace = True) #lets print df2 print(df2)
Output
index id groupid bookid quantity 0 3 4 2 7 20 1 6 7 3 6 5
So by using inplace = True we can make changes in original (modified) dataframe
Conclusion
Pandas is a powerful python module for processing data. It is used to read, filter, and analyze spreadsheets. Raw data that is available may or may not be suitable for retrieving the conclusion from it. So they need to be filtered, fill the null values, sorted, grouped, and so on.
Happy Learning 🙂