Data Analysis With Pandas
If you want to learn about Data Analysis with Pandas and Python and you’re not familiar with Kaggle, check it out!
Time to read article : 5 mins
TLDR;
We show how to use idxmax and apply with Pandas
Introduction
Here we will look at some functions in Pandas which will help with ‘EDA’ – exploratory data analysis.
Once you have signed in you can locate Pandas tutorials and begin learning and testing your understanding by running through the exercises and if you get stuck there are hints, and also the solution.
Pandas ‘idxmax’ example:
One such exercise is shown here, where you are asked:
Which wine is the “best bargain”? Create a variable
bargain_wine
with the title of the wine with the highest points-to-price ratio in the dataset.
The hint tells us to use idxmax()
Here is an example of what idxmax does:
Solution:
bargain_idx = (reviews.points / reviews.price).idxmax()
bargain_wine = reviews.loc[bargain_idx, 'title']
The Kaggle solution locates the row where idxmax is True, and returns the ‘title’
In this next example we see how you can “apply” a function to each row in your dataframe:
Pandas ‘apply’ example:
“Find number of dealers by area”
import pandas as pd import numpy as np pd.set_option('display.max_colwidth', 150) df = pd.read_csv("used_car.csv") df.head(6)
Name | Address | Phone | |
---|---|---|---|
0 | PakWheels Karachi | Suit No : 303 Third Floor Tariq Centre Main Tariq Road | 3105703505 |
1 | PakWheels Lahore | 37 Commercial Zone, Liberty Market Lahore | 614584545 |
2 | PakWheels Islamabad | 37 Commercial Zone, Liberty Market Lahore | 614584545 |
3 | Sam Automobiles | 8 Banglore town,7/8 Block near awami markaz shahrah e faisal karachi | 3422804414 |
4 | Marvel Motors | 27-E, Ali Plaza, Fazal e Haq Road, Blue Area, Islamabad | 518358006 |
5 | Merchants Automobiles | Plot 167 /C Shop#4 Parsa City luxuria PECHS Block 3 at Main khalid bin waleed road | 2134552897 |
In [152]:
# create a function to assign a city based on address details def area(row): if 'lahore' in (str(row.Address)).lower(): return "Lahore" elif "faisalbad" in (str(row.Address)).lower(): return "Faisalbad" elif "karachi" in (str(row.Address)).lower(): return "Karachi" elif "islamabad" in (str(row.Address)).lower(): return "Islamabad" else: return "Other"
ans = df.apply(area, axis=1) ans
0 Other 1 Lahore 2 Lahore 3 Karachi 4 Islamabad ... 2332 Other 2333 Other 2334 Other 2335 Other 2336 Other Length: 2337, dtype: object
# Check how many times each city occurs in the dataframe ans.value_counts()
Other 1433 Karachi 471 Lahore 331 Islamabad 102 dtype: int64
Next, what if we want to find out if there are other dealerships that use the same phone number?
Pandas ‘isin’ with ‘loc’
df.loc[df.Phone.isin(['614584545'])]
Name | Address | Phone | |
---|---|---|---|
1 | PakWheels Lahore | 37 Commercial Zone, Liberty Market Lahore | 614584545 |
2 | PakWheels Islamabad | 37 Commercial Zone, Liberty Market Lahore | 614584545 |
437 | Rehman Motors | Old Bahawalpur Road,Multan | 614584545 |
Summary
Kaggle is free and even if you are not pursuing a career in data science you can still gain valuable Python skills from it.
See the Red and Green – Kaggle example on Kaggle