Categories
MySQL Pandas Python Code Raspberry Pi web scraping

Add data to a database from a CSV using Python, Pandas, and SQL

Do you have a CSV file and a database to transfer the data into?

The data needs to go into your database and you want to automate this?

For instance : you have been web scraping and extracted publicly available data and now need to store it in a database table?

This guide will show you a basic example of how to get the data from a CSV, convert it to a pandas dataframe, and insert it into your chosen database table.

# |r|e|d|a|n|d|g|r|e|e|n|.|c|o|.|u|k|

# CSV to PANDAS to SQL example

Import a CSV and convert it into a pandas dataframe

import pandas as pd

df = pd.read_csv(r’claims.csv‘)

df

pat_numclaimstatus
0111222123accepted
1111333234accepted
2222444432rejected
3444555431rejected
4666777655accpted
The example csv as a pandas dataframe

Connect to existing SQL database and check we can query some data

# Module Imports
import mariadb
import sys

# Connect to MariaDB Platform
try:
    conn = mariadb.connect(
        user="user2",
        password="password2",
        host="192.168.1.9",
        port=3306,
        database="newz"

    )
except mariadb.Error as e:
    print(f"Error connecting to MariaDB Platform: {e}")
    sys.exit(1)

# Get Cursor
cur = conn.cursor()
# Select the first 8 results as a check

cur.execute(
    "SELECT title,story FROM tnewz WHERE publication=? LIMIT 8", ("independent.co.uk",)
)
# Print results selection

for (title, publication) in cur:
    print(f"Title: {title}, Story: {publication}")
Title: We need to get back to normality, and fast – I appear to have fallen in love with my robot vacuum cleaner, Story: It is absurdly easy to anthropomorphise this creature, which spins around picking up fluff – especially in the absence of a puppy or a baby in the house, writes 
Title: ‘This ends now’: Biden demands Trump make national address to end Capitol siege, Story: President-elect condemns Capitol violence: 'This is not dissent. It's disorder. It's chaos. It borders on sedition’
Title: ‘We love you, you’re all very special’: Trump tells mob to ‘go home’ but repeats false claims about election, Story: 'I know your pain. I know you're hurt’
Title: Senate evacuates after sudden recess while certifying Biden’s win amid reports of shots fired on Capitol Hill, Story: Congress calls surprise recess as protesters storm Capitol building
Title: Capitol Hill riots: Dramatic videos show Trump supporters storming Capitol building, Story: Videos posted to social media from reporters and lawmakers depicted a chaotic, terrifying scene of pro-Trump rioters breaking into the Capitol building
Title: Capitol riots: The Simpsons eerily predicted incident in 1996, Story: Scene in question shows characters running up Capitol steps, firing guns
Title: Prince William tells children of NHS staff sacrifices ‘every day’ during pandemic, Story: 'We're making sure the children understand all of the sacrifices that all of you are making,' says Duke of Cambridge
Title: Sriwijaya Air flight SJ182: Boeing 737 loses contact in Indonesia, Story: The aircraft is a ‘classic’ 737 — two generations older than the Boeing 737 Max

Connect to new database and prepare to insert dataframe

# Connect to MariaDB Platform
try:
    conn = mariadb.connect(
        user="user3",
        password="redandgreen",
        host="192.168.1.9",
        port=3306,
        database="exampledb"

    )
except mariadb.Error as e:
    print(f"Error connecting to MariaDB Platform: {e}")
    sys.exit(1)

# Get Cursor
cursor = conn.cursor()

Create new table (do this just once)

# Create Table
cursor.execute('CREATE TABLE IF NOT EXISTS patients3 (id INT AUTO_INCREMENT PRIMARY KEY, pat_num varchar(50), claim varchar(50), status varchar(50))')
for row in df.itertuples():
    cursor.execute("""
                INSERT INTO patients3
                (pat_num, claim, status)
                VALUES 
                (%s,%s,%s)
                """,
                (row.pat_num, 
                row.claim,
                row.status,
                )
)
conn.commit()

Check your database table for new data

SQL-output

If you have access to your database server then you can create a database and user with these commands:

## Useful mariadb commands :

# sudo mysql -u root -p
# CREATE DATABASE exampledb;
# CREATE USER 'user3'@'localhost' IDENTIFIED BY 'redandgreen';

# DROP USER IF EXISTS 'user3'@'%';
# CREATE USER 'user3'@'%' IDENTIFIED BY 'redandgreen';
# GRANT ALL PRIVILEGES ON exampledb.* TO 'user3'@'%';
# FLUSH PRIVILEGES;

# The % allows remote connections 

Query the SQL database and print the Patient Numbers

cur = conn.cursor()
cur.execute(
    "SELECT pat_num FROM patients3"
)
# Print results selection from SQL query

for (pat_num) in cur:
    print(f"PatientNumber: {pat_num}")
PatientNumber: ('111222',)
PatientNumber: ('111333',)
PatientNumber: ('222444',)
PatientNumber: ('3444555',)
PatientNumber: ('4666777',)
PatientNumber: ('111222',)
PatientNumber: ('111333',)

Print results selection from SQL using Pandas with an SQL query

# Create SQL query for Pandas to run
sql = """

SELECT *
FROM patients3

"""
df = pd.read_sql(sql, conn)
df.head()
idpat_numclaimstatus
01111222123accepted
12111333234accepted
23222444432rejected
343444555431rejected
45466677765accpted
Data retrieved from the SQL table using Pandas “pd.read”

Summary

We have shown how to :

  • Import a CSV into a pandas dataframe using Python
  • Connect to a MariaDB (SQL) database
  • INSERT the contents of the dataframe into a table
  • SELECT values from the database using SQL and also using Pandas

Note : (If you don’t have access to the SQL database and the table you will need the database administrator to give you the permissions to it).

Categories
Python Code Raspberry Pi Scrapy

Price Tracking Amazon

A common task is to track competitors prices and use that information as a guide to the prices you can charge, or if you are buying, you can spot when a product is at a new lowest price. The purpose of this article is to describe how to web scrape Amazon.

Web Scraping Amazon to SQL

Using Python, Scrapy, MySQL, and Matplotlib you can extract large amounts of data, query it, and produce meaningful visualizations.

In the example featured, we wanted to identify which Amazon books related to “web scraping” had been reduced in price over the time we had been running the spider.

If you want to run your spider daily then see the video for instructions on how to schedule a spider in CRON on a Linux server.

Procedure used for price tracking
Price Tracking Amazon with Python

query = '''select amzbooks2.* from
(select amzbooks2.*,
lag(price) over (partition by title order by posted) as prev_price
from amzbooks2) amzbooks2
where prev_price <> price'''

Visualize the stored data using Python and Matplotlib
Amazon Price Tracker Graph produced from Scrapy Spider output
To see how to get to this stage, you may wish to watch the video:
How To Track Prices In Amazon With Scrapy

All of the code is on GitHub

Categories
Python Code Raspberry Pi Scrapy

Configure a Raspberry Pi for web scraping

Introduction

The task was to scrape over 50,000 records from a website and be gentle on the site being scraped. A Raspberry Pi Zero was chosen to do this as speed was not a significant issue, and in fact, being slower makes it ideal for web scraping when you want to be kind to the site you are scraping and not request resources too quickly. This article describes using Scrapy, but BeautifulSoup or Requests would work in the same way.

The main considerations were:

  • Could it run Scrapy without issue?
  • Could it run with a VPN connection?
  • Would it be able to store the results?

So a quick, short test proved that it could collect approx 50,000 records per day which meant it was entirely suitable.

I wanted a VPN tunnel from the Pi Zero to my VPN provider. This was an unknown, because I had only previously run it on a Windows PC with a GUI. Now I was attempting to run it from a headless Raspberry Pi!

This took approx 15 mins to set up. Surprisingly easy.

The only remaining challenges were:

  • run the spider without having to leave my PC on as well (closing PuTTy in Windows would have terminated the process on the Pi) – That’s where nohup came in handy.
  • Transfer the output back to a PC (running Ubuntu – inside a VM ) – this is where rsync was handy. (SCP could also have been used)

See the writing of the Scrapy spider with “Load More”