Categories
Python Code Selenium web scraping

How to scrape iframes

If you are scraping a website with pop up messages asking you to agree to accept cookies.

This can prevent your scraper from continuing to the pages you want to scrape.

web-scrape-iframe-cookie-accept
How can you get past this with your scraper if it is in an iframe?

How do you get past these?

Using Selenium you need to switch to the iframe (which you can identify using browser tools / inspect in Chrome).

Here is a video showing the general concept:

Scrape a site with pop up “accept cookies” message

If you just want the basic idea….

driver.switch_to.frame(iframe)

“click on accept or ok”

Instead of “click on accept or ok” you need to identify the @id, or @class name and use that in an XPATH so you can send the .click() commend with Selenium.

driver.switch_to.default_content()

Once past this you need to return to the default (or parent) content to carry on with the navigation towars the page you want to scrape.

driver = webdriver.Chrome(executable_path=path_chrome)

# find the frame using id, title etc.
frame = driver.find_elements_by_xpath("//iframe[@title='iframe_to_get']")

# switch the webdriver object to the iframe.
driver.switch_to.frame(frame[i])

See also : https://stackoverflow.com/questions/49825722/scraping-iframe-using-selenium#49825750

Categories
Python Code

Comparing values in SQL against previously scraped data

If you have scraped data on more than one occasion and want to check if a value has changed in a column since the previous scrape you could use this:

select col1, col2 

from TABLENAME 

group by col1, col2 

having count(col2) <2

This will compare and check if a value for col2 has changed since the previous scrape

Let’s put some real names into this:

select CLAIMNO, STATUSID 

from PREV 

group by CLAIMNO, STATUSID 

having count(STATUSID) <2

We can see from our results that from our entire comparison, only 1 claim has had a status change

We now know that since the last time we scraped the site, only one of our claims has been updated by “them”

This has been achieved by using “group by” – where we have many records in our table, and we’ve picked out where there are not 2 matching STATUSID values for a given CLAIM

Summary

We have used SQL to identify where a column value has changed compared to the previous time we checked by grouping 2 columns and checking if the count is less than 2.

If you would like more information please contact me via the contact page

See also : https://stackoverflow.com/questions/1786533/find-rows-that-have-the-same-value-on-a-column-in-mysql#1786568

Categories
Python Code

Parsing Scraped Data with Pandas

Once you have some data you’ll need to find the relevant parts and use them for further analysis. Rather than use the proverbial Excel sheet you can write code to automate the task.

Consider the following :

import pandas as pd
df=pd.read_csv('SalesRecords.csv')

The following code will match where the row contains “Dijbouti” and

return the value that is in the column “Order Priority provided you add “iloc[0]” or whatever index number you require.

df.loc[df['Country']=='Djibouti',['Order Priority']]
Categories
Python Code

Debugging Python Code

& using Snippets

This article will cover how to achieve the following, using Python :

  • Reduce the need for print statements
  • Reduce the need to type repetitive code

Why not automate some of your coding to allow you to automate the ‘boring stuff’ – anything that you do regularly could be saved as a snippet and you won’t have to type that section of code from scratch each time….

If you find you have print statements everywhere, or are repeating the same code in many projects then this will be of interest.

First, these are the 3 shortcuts you will want to memorise:

vscode shortcuts
Example of a snippet that provides “hello world” via intellisense – note you can also add variables within the snippet
“I Do the Code to Print Hello World”: {
        “prefix”: “hellw”,
        “body”: [
            “print(‘Hello World’)\n”,
            “print(‘Hello to you’, ${user_name})\n”,
            “$2”
        ],
        “description”: “This Snippet Prints Hello World”
    }
By typing “hellw” intellisense will insert the snippet of code that you have assigned to that name
This is the code that gets placed into your code that you are currently writing
Categories
Python Code Scrapy

Scrapy response.meta

capture your start urls in your output with Scrapy response.meta

scrapy real estate scraping

Every web scraping project has aspects that are different or interesting and worth remembering for future use.

This is a look at a recent real world project and looks saving more than one start url in the output.

This assumes basic knowledge of web scraping, and identifying selectors. See my other videos if you would like to learn more about selectors (xpath & css)

scraping a real estate site for houses and appartments

We want to fill all of the columns in our client’s master excel sheet.

We could* then provide them with a CSV which they can import and do with what they wish.

We want 1500+ properties so we will be using Scrapy and Python

scrapy spider logic

Considerations

One of the required fields requires us to pass the particular start url all the way through to the CSV (use response.meta)

Some of the required values are inside text and will require parsing with re (use regular expressions) ¡We don’t care about being fast – edit “settings.py” with conservative values for concurrent connections, download delay

This is a German website so I will use Google Chrome browser and translate to English.

Scrapy response.meta

We will use Scrapy’s Request.meta attribute to achieve the following:

Capture whichever of the multiple start_urls is used – pass it all the way through to the output CSV.

scrapy response documentation

Create a “meta” dictionary in the initial Request in start_requests 

“surl” represents each of our start urls

(we have 2, one for ‘rent’ and one for the ‘buy’ url, we could have many more if required)

start_requests
response.meta
start_url output
we still have the start_url (converted to human a readable label)
start_url in final csv output
End result : we have the start url converted to a human readable name, that represents the particular URL that scrapy used for the particular listing

Dr Pi YouTube videos
For videos on youtube please visit : www.youtube.com/c/DrPiCode

Categories
MySQL Pandas Python Code Raspberry Pi web scraping

Add data to a database from a CSV using Python, Pandas, and SQL

Do you have a CSV file and a database to transfer the data into?

The data needs to go into your database and you want to automate this?

For instance : you have been web scraping and extracted publicly available data and now need to store it in a database table?

This guide will show you a basic example of how to get the data from a CSV, convert it to a pandas dataframe, and insert it into your chosen database table.

# |r|e|d|a|n|d|g|r|e|e|n|.|c|o|.|u|k|

# CSV to PANDAS to SQL example

Import a CSV and convert it into a pandas dataframe

import pandas as pd

df = pd.read_csv(r’claims.csv‘)

df

pat_numclaimstatus
0111222123accepted
1111333234accepted
2222444432rejected
3444555431rejected
4666777655accpted
The example csv as a pandas dataframe

Connect to existing SQL database and check we can query some data

# Module Imports
import mariadb
import sys

# Connect to MariaDB Platform
try:
    conn = mariadb.connect(
        user="user2",
        password="password2",
        host="192.168.1.9",
        port=3306,
        database="newz"

    )
except mariadb.Error as e:
    print(f"Error connecting to MariaDB Platform: {e}")
    sys.exit(1)

# Get Cursor
cur = conn.cursor()
# Select the first 8 results as a check

cur.execute(
    "SELECT title,story FROM tnewz WHERE publication=? LIMIT 8", ("independent.co.uk",)
)
# Print results selection

for (title, publication) in cur:
    print(f"Title: {title}, Story: {publication}")
Title: We need to get back to normality, and fast – I appear to have fallen in love with my robot vacuum cleaner, Story: It is absurdly easy to anthropomorphise this creature, which spins around picking up fluff – especially in the absence of a puppy or a baby in the house, writes 
Title: ‘This ends now’: Biden demands Trump make national address to end Capitol siege, Story: President-elect condemns Capitol violence: 'This is not dissent. It's disorder. It's chaos. It borders on sedition’
Title: ‘We love you, you’re all very special’: Trump tells mob to ‘go home’ but repeats false claims about election, Story: 'I know your pain. I know you're hurt’
Title: Senate evacuates after sudden recess while certifying Biden’s win amid reports of shots fired on Capitol Hill, Story: Congress calls surprise recess as protesters storm Capitol building
Title: Capitol Hill riots: Dramatic videos show Trump supporters storming Capitol building, Story: Videos posted to social media from reporters and lawmakers depicted a chaotic, terrifying scene of pro-Trump rioters breaking into the Capitol building
Title: Capitol riots: The Simpsons eerily predicted incident in 1996, Story: Scene in question shows characters running up Capitol steps, firing guns
Title: Prince William tells children of NHS staff sacrifices ‘every day’ during pandemic, Story: 'We're making sure the children understand all of the sacrifices that all of you are making,' says Duke of Cambridge
Title: Sriwijaya Air flight SJ182: Boeing 737 loses contact in Indonesia, Story: The aircraft is a ‘classic’ 737 — two generations older than the Boeing 737 Max

Connect to new database and prepare to insert dataframe

# Connect to MariaDB Platform
try:
    conn = mariadb.connect(
        user="user3",
        password="redandgreen",
        host="192.168.1.9",
        port=3306,
        database="exampledb"

    )
except mariadb.Error as e:
    print(f"Error connecting to MariaDB Platform: {e}")
    sys.exit(1)

# Get Cursor
cursor = conn.cursor()

Create new table (do this just once)

# Create Table
cursor.execute('CREATE TABLE IF NOT EXISTS patients3 (id INT AUTO_INCREMENT PRIMARY KEY, pat_num varchar(50), claim varchar(50), status varchar(50))')
for row in df.itertuples():
    cursor.execute("""
                INSERT INTO patients3
                (pat_num, claim, status)
                VALUES 
                (%s,%s,%s)
                """,
                (row.pat_num, 
                row.claim,
                row.status,
                )
)
conn.commit()

Check your database table for new data

SQL-output

If you have access to your database server then you can create a database and user with these commands:

## Useful mariadb commands :

# sudo mysql -u root -p
# CREATE DATABASE exampledb;
# CREATE USER 'user3'@'localhost' IDENTIFIED BY 'redandgreen';

# DROP USER IF EXISTS 'user3'@'%';
# CREATE USER 'user3'@'%' IDENTIFIED BY 'redandgreen';
# GRANT ALL PRIVILEGES ON exampledb.* TO 'user3'@'%';
# FLUSH PRIVILEGES;

# The % allows remote connections 

Query the SQL database and print the Patient Numbers

cur = conn.cursor()
cur.execute(
    "SELECT pat_num FROM patients3"
)
# Print results selection from SQL query

for (pat_num) in cur:
    print(f"PatientNumber: {pat_num}")
PatientNumber: ('111222',)
PatientNumber: ('111333',)
PatientNumber: ('222444',)
PatientNumber: ('3444555',)
PatientNumber: ('4666777',)
PatientNumber: ('111222',)
PatientNumber: ('111333',)

Print results selection from SQL using Pandas with an SQL query

# Create SQL query for Pandas to run
sql = """

SELECT *
FROM patients3

"""
df = pd.read_sql(sql, conn)
df.head()
idpat_numclaimstatus
01111222123accepted
12111333234accepted
23222444432rejected
343444555431rejected
45466677765accpted
Data retrieved from the SQL table using Pandas “pd.read”

Summary

We have shown how to :

  • Import a CSV into a pandas dataframe using Python
  • Connect to a MariaDB (SQL) database
  • INSERT the contents of the dataframe into a table
  • SELECT values from the database using SQL and also using Pandas

Note : (If you don’t have access to the SQL database and the table you will need the database administrator to give you the permissions to it).

Categories
Python Code

regex examples

Sooner or later you will need to resort to using regular expressions if you are scraping text. This article will show you some useful Python (version 3) examples which you can use over and over again..

We’ll be using “re.search” and “group”

example 1 – parse a paragraph of text for the number of floors, so you need to find the word “floor” or equivalent, and then extract the number(s) directly before it.

re.search("(\d+)(\s[Ee]tage)", property_data)
We look for 1 or more digits followed by a space followed by “Etage “or “etage”

example 2 – parse a paragraph of text and extract the vendor details, so locate the text that immediately follows, but no more…

re.search(r'Vendor Name:([A-Z]).+?(?=Payment)', invoice)
re.search(r'Vendor Number:([0-9])+', invoice)

If you would like to copy the code from GitHub then visit the RGGH page here

Categories
Python Code

Back Up MySQL

Back up your MySQL database (on a Raspberry Pi)

Once in production you will need to ensure you have a copy of your data for many reasons which we can all identify with.

  • Hardware failure
  • Data Corruption
  • Human Error
Creating a MySQL Backup can save a headache
Save your data before it gets to this!

So, before getting too far into a web scraping project using Scrapy with MySQL let’s spare a moment to get familiar with some basic back up and restore commands.

This is about as simple as can be, so why not try them for yourself if you have MySQL installed, make a sample database, and create a back up, check it, and then delete the original and restore your back up!

First, inspect the server to check your database name(s) that you want to back up for future restores…

sudo mysql -u root -p

MariaDB [(none)]> show databases;

checking databases in MySQL

Next :

MariaDB [(none)]> exit;

Now run this command (using your own username and password that has access privileges to the database)

pi@pi4:~ $ mysqldump -u user1 -p newz > newzdump.sql

Next : Check the backup

pi@pi4:~ $ cat newzdump.sql

checking sql dump

— MySQL dump 10.17 Distrib 10.3.25-MariaDB, for debian-linux-gnueabihf (armv7l)

— Host: localhost Database: newz


— Server version 10.3.25-MariaDB-0+deb10u1

/*!40101 SET @OLD_CHARACTER_SET_CLIENT=@@CHARACTER_SET_CLIENT */;
/*!40101 SET @OLD_CHARACTER_SET_RESULTS=@@CHARACTER_SET_RESULTS */;
/*!40101 SET @OLD_COLLATION_CONNECTION=@@COLLATION_CONNECTION */;
/*!40101 SET NAMES utf8mb4 */;
/*!40103 SET @OLD_TIME_ZONE=@@TIME_ZONE */;
/*!40103 SET TIME_ZONE=’+00:00′ */;
/*!40014 SET @OLD_UNIQUE_CHECKS=@@UNIQUE_CHECKS, UNIQUE_CHECKS=0 */;
/*!40014 SET @OLD_FOREIGN_KEY_CHECKS=@@FOREIGN_KEY_CHECKS, FOREIGN_KEY_CHECKS=0 */;
/*!40101 SET @OLD_SQL_MODE=@@SQL_MODE, SQL_MODE=’NO_AUTO_VALUE_ON_ZERO’ */;
/*!40111 SET @OLD_SQL_NOTES=@@SQL_NOTES, SQL_NOTES=0 */;

— Table structure for table tnewz

DROP TABLE IF EXISTS tnewz;
/*!40101 SET @saved_cs_client = @@character_set_client */;
/*!40101 SET character_set_client = utf8 */;
CREATE TABLE tnewz (
id int(11) NOT NULL AUTO_INCREMENT,
title varchar(255) NOT NULL,
publication varchar(255) DEFAULT NULL,
author varchar(50) DEFAULT NULL,
story varchar(255) DEFAULT NULL,
url varchar(255) DEFAULT NULL,
posted timestamp NOT NULL DEFAULT current_timestamp() ON UPDATE current_timestamp(),
PRIMARY KEY (id)
) ENGINE=InnoDB AUTO_INCREMENT=11 DEFAULT CHARSET=utf8mb4;
/*!40101 SET character_set_client = @saved_cs_client */;

— Dumping data for table tnewz

LOCK TABLES tnewz WRITE;
/*!40000 ALTER TABLE tnewz DISABLE KEYS */;
INSERT INTO tnewz VALUES (1,’title’,NULL,’author’,’story’,’/news/uk/1377686/tier-shake-up-boris-johnson-meeting-tier-5-school-closures-reopening-coronavirus-update’,’2020-12-29 15:33:34′),(2,’title’,NULL,’author’,’story’,’/news/politics/1377709/nicola-sturgeon-scotland-lockdown-update-announcement-boris-johnson-tiers-tier-5′,’2020-12-29 15:33:3

To Restore the backup of your Database

pi@pi4:~ $ mysql -u user1 -p newz < newzdump.sql

Summary

We have looked at how to back up and restore a MySQL database using:

  • mysqldump -u user1 -p newz > newzdump.sql
  • mysql -u user1 -p newz < newzdump.sql

If you want to transfer a copy of your database to a remote location then you can use “rsync” or “scp”, or if you are doing from a home pc you could always use “winscp”.

Categories
MySQL Pandas Python Code

Get started with Pandas and MySQL

How to make a dataframe from your SQL database and query the data

This assumes you already have a sample database set up on your MySQL server and you have the username and password.

In the example shown we are logging on to a Raspberry Pi running MariaDB and we are executing a query to get 3408 properties into a dataframe.

We then run it in Python and view the results.

The Python code is as follows:

import mysql.connector
import pandas as pd
try:
cnx = mysql.connector.connect(user='user2', password='password2', host='192.168.1.9', database='immodb')
query = '''select * from imt1'''
SQL_Query = pd.read_sql_query(query, cnx)
df = pd.DataFrame(SQL_Query,columns=['title','price'])
print(df)
except mysql.connector.Error as err:
print(err)
else:
cnx.close()

Pandas demo - SQL Python code
Animation showing the minimum required code to connect to the database, query the table and convert into a Pandas dataframe to perform further analysis
pandas dataframe from MySQL
Pandas dataframe – from MySQL – using sort_values
Categories
email Python Code

Send email with Python

(and attachments)

If you have seen any tutorials online, beware of many that use port 465, port 587 is the correct port. Here we show the code you can use to send email and an attachment with this simple Python code.

It’s simple, as you don’t need to know MIME or “octet-stream” – you just need to include the code and add your own specific username, password, account details for Gmail.

Rather than include your password in the code, you can use “load_dotenv” instead. This especially important if you save your code on GitHub or similar.

You will need the following libraries installed in your Python environment: “smtplib“, “ssl” and “email

send_mail function
send_from and send_to
Also: Try using load_dotenv

# save user/pw info in .env in same dir

from dotenv import load_dotenv

load_dotenv()

https://github.com/RGGH/BeautifulSoup/blob/MySQL_version/emailer.py

If you would like to see this in action then you can watch the demonstration on “how to send email and attachments using python” on YouTube : https://youtu.be/yj9y4u04gaM