Data Analysis With Pandas

If you want to learn about Data Analysis with Pandas and Python and you’re not familiar with Kaggle, check it out!

Time to read article : 5 mins

TLDR;

We show how to use idxmax and apply with Pandas

Introduction

Here we will look at some functions in Pandas which will help with ‘EDA’ – exploratory data analysis.

Once you have signed in you can locate Pandas tutorials and begin learning and testing your understanding by running through the exercises and if you get stuck there are hints, and also the solution.

Pandas ‘idxmax’ example:

One such exercise is shown here, where you are asked:

Which wine is the “best bargain”? Create a variable bargain_wine with the title of the wine with the highest points-to-price ratio in the dataset.

The hint tells us to use idxmax()

Here is an example of what idxmax does:

Solution:

bargain_idx = (reviews.points / reviews.price).idxmax()
bargain_wine = reviews.loc[bargain_idx, 'title']

The Kaggle solution locates the row where idxmax is True, and returns the ‘title’

In this next example we see how you can “apply” a function to each row in your dataframe:

Pandas ‘apply’ example:

“Find number of dealers by area”

import pandas as pd
import numpy as np

pd.set_option('display.max_colwidth', 150) 
df = pd.read_csv("used_car.csv")
df.head(6)
NameAddressPhone
0PakWheels KarachiSuit No : 303 Third Floor Tariq Centre Main Tariq Road3105703505
1PakWheels Lahore37 Commercial Zone, Liberty Market Lahore614584545
2PakWheels Islamabad37 Commercial Zone, Liberty Market Lahore614584545
3Sam Automobiles8 Banglore town,7/8 Block near awami markaz shahrah e faisal karachi3422804414
4Marvel Motors27-E, Ali Plaza, Fazal e Haq Road, Blue Area, Islamabad518358006
5Merchants AutomobilesPlot 167 /C Shop#4 Parsa City luxuria PECHS Block 3 at Main khalid bin waleed road2134552897

In [152]:

# create a function to assign a city based on address details

def area(row):
    if 'lahore' in (str(row.Address)).lower():
        return "Lahore"
    elif "faisalbad" in (str(row.Address)).lower():
        return "Faisalbad"
    elif "karachi" in (str(row.Address)).lower():
        return "Karachi"
    elif "islamabad" in (str(row.Address)).lower():
        return "Islamabad"
    else:
        return "Other"
ans = df.apply(area, axis=1)
ans
0           Other
1          Lahore
2          Lahore
3         Karachi
4       Islamabad
          ...    
2332        Other
2333        Other
2334        Other
2335        Other
2336        Other
Length: 2337, dtype: object
# Check how many times each city occurs in the dataframe

ans.value_counts()
Other        1433
Karachi       471
Lahore        331
Islamabad     102
dtype: int64

Next, what if we want to find out if there are other dealerships that use the same phone number?

Pandas ‘isin’ with ‘loc’

df.loc[df.Phone.isin(['614584545'])]
NameAddressPhone
1PakWheels Lahore37 Commercial Zone, Liberty Market Lahore614584545
2PakWheels Islamabad37 Commercial Zone, Liberty Market Lahore614584545
437Rehman MotorsOld Bahawalpur Road,Multan614584545
We’ve found 3 business names, and 2 addresses that share the same phone number….

Summary

Kaggle is free and even if you are not pursuing a career in data science you can still gain valuable Python skills from it.

See the Red and Green – Kaggle example on Kaggle

Previous article

EBAY API – Python Code

Next article

ffill_bfill