Categories
Python Code

regex examples

Sooner or later you will need to resort to using regular expressions if you are scraping text. This article will show you some useful Python (version 3) examples which you can use over and over again..

We’ll be using “re.search” and “group”

example 1 – parse a paragraph of text for the number of floors, so you need to find the word “floor” or equivalent, and then extract the number(s) directly before it.

re.search("(\d+)(\s[Ee]tage)", property_data)
We look for 1 or more digits followed by a space followed by “Etage “or “etage”

example 2 – parse a paragraph of text and extract the vendor details, so locate the text that immediately follows, but no more…

re.search(r'Vendor Name:([A-Z]).+?(?=Payment)', invoice)
re.search(r'Vendor Number:([0-9])+', invoice)

If you would like to copy the code from GitHub then visit the RGGH page here

Categories
Python Code

Back Up MySQL

Back up your MySQL database (on a Raspberry Pi)

Once in production you will need to ensure you have a copy of your data for many reasons which we can all identify with.

  • Hardware failure
  • Data Corruption
  • Human Error
Creating a MySQL Backup can save a headache
Save your data before it gets to this!

So, before getting too far into a web scraping project using Scrapy with MySQL let’s spare a moment to get familiar with some basic back up and restore commands.

This is about as simple as can be, so why not try them for yourself if you have MySQL installed, make a sample database, and create a back up, check it, and then delete the original and restore your back up!

First, inspect the server to check your database name(s) that you want to back up for future restores…

sudo mysql -u root -p

MariaDB [(none)]> show databases;

checking databases in MySQL

Next :

MariaDB [(none)]> exit;

Now run this command (using your own username and password that has access privileges to the database)

pi@pi4:~ $ mysqldump -u user1 -p newz > newzdump.sql

Next : Check the backup

pi@pi4:~ $ cat newzdump.sql

checking sql dump

— MySQL dump 10.17 Distrib 10.3.25-MariaDB, for debian-linux-gnueabihf (armv7l)

— Host: localhost Database: newz


— Server version 10.3.25-MariaDB-0+deb10u1

/*!40101 SET @OLD_CHARACTER_SET_CLIENT=@@CHARACTER_SET_CLIENT */;
/*!40101 SET @OLD_CHARACTER_SET_RESULTS=@@CHARACTER_SET_RESULTS */;
/*!40101 SET @OLD_COLLATION_CONNECTION=@@COLLATION_CONNECTION */;
/*!40101 SET NAMES utf8mb4 */;
/*!40103 SET @OLD_TIME_ZONE=@@TIME_ZONE */;
/*!40103 SET TIME_ZONE=’+00:00′ */;
/*!40014 SET @OLD_UNIQUE_CHECKS=@@UNIQUE_CHECKS, UNIQUE_CHECKS=0 */;
/*!40014 SET @OLD_FOREIGN_KEY_CHECKS=@@FOREIGN_KEY_CHECKS, FOREIGN_KEY_CHECKS=0 */;
/*!40101 SET @OLD_SQL_MODE=@@SQL_MODE, SQL_MODE=’NO_AUTO_VALUE_ON_ZERO’ */;
/*!40111 SET @OLD_SQL_NOTES=@@SQL_NOTES, SQL_NOTES=0 */;

— Table structure for table tnewz

DROP TABLE IF EXISTS tnewz;
/*!40101 SET @saved_cs_client = @@character_set_client */;
/*!40101 SET character_set_client = utf8 */;
CREATE TABLE tnewz (
id int(11) NOT NULL AUTO_INCREMENT,
title varchar(255) NOT NULL,
publication varchar(255) DEFAULT NULL,
author varchar(50) DEFAULT NULL,
story varchar(255) DEFAULT NULL,
url varchar(255) DEFAULT NULL,
posted timestamp NOT NULL DEFAULT current_timestamp() ON UPDATE current_timestamp(),
PRIMARY KEY (id)
) ENGINE=InnoDB AUTO_INCREMENT=11 DEFAULT CHARSET=utf8mb4;
/*!40101 SET character_set_client = @saved_cs_client */;

— Dumping data for table tnewz

LOCK TABLES tnewz WRITE;
/*!40000 ALTER TABLE tnewz DISABLE KEYS */;
INSERT INTO tnewz VALUES (1,’title’,NULL,’author’,’story’,’/news/uk/1377686/tier-shake-up-boris-johnson-meeting-tier-5-school-closures-reopening-coronavirus-update’,’2020-12-29 15:33:34′),(2,’title’,NULL,’author’,’story’,’/news/politics/1377709/nicola-sturgeon-scotland-lockdown-update-announcement-boris-johnson-tiers-tier-5′,’2020-12-29 15:33:3

To Restore the backup of your Database

pi@pi4:~ $ mysql -u user1 -p newz < newzdump.sql

Summary

We have looked at how to back up and restore a MySQL database using:

  • mysqldump -u user1 -p newz > newzdump.sql
  • mysql -u user1 -p newz < newzdump.sql

If you want to transfer a copy of your database to a remote location then you can use “rsync” or “scp”, or if you are doing from a home pc you could always use “winscp”.

Categories
MySQL Pandas Python Code

Get started with Pandas and MySQL

How to make a dataframe from your SQL database and query the data

This assumes you already have a sample database set up on your MySQL server and you have the username and password.

In the example shown we are logging on to a Raspberry Pi running MariaDB and we are executing a query to get 3408 properties into a dataframe.

We then run it in Python and view the results.

The Python code is as follows:

import mysql.connector
import pandas as pd
try:
cnx = mysql.connector.connect(user='user2', password='password2', host='192.168.1.9', database='immodb')
query = '''select * from imt1'''
SQL_Query = pd.read_sql_query(query, cnx)
df = pd.DataFrame(SQL_Query,columns=['title','price'])
print(df)
except mysql.connector.Error as err:
print(err)
else:
cnx.close()

Pandas demo - SQL Python code
Animation showing the minimum required code to connect to the database, query the table and convert into a Pandas dataframe to perform further analysis
pandas dataframe from MySQL
Pandas dataframe – from MySQL – using sort_values
Categories
email Python Code

Send email with Python

(and attachments)

If you have seen any tutorials online, beware of many that use port 465, port 587 is the correct port. Here we show the code you can use to send email and an attachment with this simple Python code.

It’s simple, as you don’t need to know MIME or “octet-stream” – you just need to include the code and add your own specific username, password, account details for Gmail.

Rather than include your password in the code, you can use “load_dotenv” instead. This especially important if you save your code on GitHub or similar.

You will need the following libraries installed in your Python environment: “smtplib“, “ssl” and “email

send_mail function
send_from and send_to
Also: Try using load_dotenv

# save user/pw info in .env in same dir

from dotenv import load_dotenv

load_dotenv()

https://github.com/RGGH/BeautifulSoup/blob/MySQL_version/emailer.py

If you would like to see this in action then you can watch the demonstration on “how to send email and attachments using python” on YouTube : https://youtu.be/yj9y4u04gaM

Categories
Python Code web scraping

Web Scraping with bs4

BeautifulSoup

Conditonal logic with soup.select

'''make a new list "ls_expose" and add hrefs to it if they contain 'expose' in the url.'''

for a in soup.select('a'):
            if 'expose' in a['href']:
                        ls_expose.append(a['href'])

Get all dropdown values from html

(using page saved locally, for testing)
from bs4 import BeautifulSoup
import requests

my_page = "Wohnung mieten im Umkreis von 50 km von Mannheim - Immob.html"

soup = BeautifulSoup(open(my_page),"lxml")
soup = soup.find('div',class_ = "select-input-wrapper")
items = soup.select('option[value]')
values = [item.get('value') for item in items]
textValues = [item.text for item in items]
print(textValues)

soup.select(‘option[value]’)

how to get dropdown option values with bs4
127 option values as seen in the html, we need to be able to get this with our code….

soup.select output
Output from script – this shows the values as text, they need to be converted to integers if you are going to use them in a for loop. See below for the code to do this, using a list comprehension.
 from bs4 import BeautifulSoup
 import requests

 my_page = "Wohnung mieten im Umkreis von 50 km von Mannheim - Immob.html"
 soup = BeautifulSoup(open(my_page),"lxml")
 soup = soup.find('div',class_ = "select-input-wrapper")
 items = soup.select('option[value]')
 values = [item.get('value') for item in items]
 x = [int(item) for item in values]
 print(x)

Note how I use soup.find to narrow down the find, otherwise I would have found other dropdowns as well

Categories
Python Code Raspberry Pi Scrapy

Price Tracking Amazon

A common task is to track competitors prices and use that information as a guide to the prices you can charge, or if you are buying, you can spot when a product is at a new lowest price. The purpose of this article is to describe how to web scrape Amazon.

Web Scraping Amazon to SQL

Using Python, Scrapy, MySQL, and Matplotlib you can extract large amounts of data, query it, and produce meaningful visualizations.

In the example featured, we wanted to identify which Amazon books related to “web scraping” had been reduced in price over the time we had been running the spider.

If you want to run your spider daily then see the video for instructions on how to schedule a spider in CRON on a Linux server.

Procedure used for price tracking
Price Tracking Amazon with Python

query = '''select amzbooks2.* from
(select amzbooks2.*,
lag(price) over (partition by title order by posted) as prev_price
from amzbooks2) amzbooks2
where prev_price <> price'''

Visualize the stored data using Python and Matplotlib
Amazon Price Tracker Graph produced from Scrapy Spider output
To see how to get to this stage, you may wish to watch the video:
How To Track Prices In Amazon With Scrapy

All of the code is on GitHub

Categories
Python Code Scrapy

How To Web Scrape Amazon (successfully)

You may want to scrape Amazon for information about books about web scraping!

We scrape Amazon for web scraping books!

We shorten what would have been a very very long selector, by using “contains” in our xpath :

response.xpath('//*[contains(@class,"sg-col-20-of-24 s-result-item s-asin")]')

The most important thing when starting to scrape is to establish what you want in your final output.

Here are the data points we want to extract :

  • ‘title’
  • ‘author’
  • ‘star_rating’
  • ‘book_format’
  • ‘price’

Now we can write our parse method, and once done, we can finally add on the “next page” code.

The Amazon pages have white space around the Author name(s) so you this will be an example of when to use ‘normalize-space’.

We also had to make sure we weren’t splitting the partially parsed response too soon, and removing the 2nd Author, (if there was one).

Some of the results are perhaps not what you want, but this is due to Amazon returning products which it thinks are in some way related to your search criteria!

By using pipelines in Scrapy, along with the process_item method we were able to filter much of what was irrelevant. The great thing about web scraping to an SQL database is the flexibility it offers once you have the data. SQL, Pandas, Matplotlib and Python are a powerful combination…

If you are unsure about how any of the code works, drop us a comment on the comments section of the Web Scraping Amazon : YouTube video.

Categories
Python Code Scrapy Selenium

Combine Scrapy with Selenium

A major disadvantage of Scrapy is that it can not handle dynamic websites (eg. ones that use JavaScript).

If you need to get past a login that is proving impossible to get past, usually if the form data keeps changing, then you can use Selenium to get past the login screen and then pass the response back into Scrapy.

It may sound like a workaround, and it is, but it’s a good way to get logged in so you can get the content much quicker than if you try and use Selenium to do it all.

Selenium is for testing, but sometimes you can combine Selenium and Scrapy to get the job done!

Watch the YouTube video on How to login past javascript & send response back to Scrapy

Combining Scrapy with Selenium
Combining Scrapy with Selenium Solution
Combining Scrapy with Selenium Prerequisites
Combining Scrapy with Selenium - integrating into Scrapy

Below:

The ‘LoginEM’ and ‘LoginPW’ represent the ‘name’ of the input field (find these from viewing the source in your browser).

self.pw and self.em are the variables which equal your stored email and passwords – I’ve stored mine here as environment variables in .bash_profile on the host computer.

Combining Scrapy with Selenium - Lgging in
Combining Scrapy with Selenium - response.replace
Combining Scrapy with Selenium start_urls
Categories
Python Code

Xpath for hidden values

This article describes how to form a Scrapy xpath selector to pick out the hidden value that you may need to POST along with a username and password when scraping a site with a log in. These hidden values are dynamically created so you must send them with your form data in your POST request.

Step one

Identify the source in the browser:

xpath-for-hidden-values
Ok, so we want to pass “__VIEWSTATE” as a key : value pair in the POST request

This is the xpath selector format you will need to use:

response.xpath('//div[@class="aspNetHidden"]/input[@name="__VIEWSTATE"]/@value').get()

This should get what you need…in Scrapy SHELL !

But…Not when you run it in your Spider…

Instead, you need to use :

response.xpath('//*[@id="Form1"]/input[@name="__VIEWSTATE"]/@value').get()

Scrapy-Hidden-Values-XPATH-Selector
Categories
Python Code

Scrapy Form Login

scrapy form login how to
Scrapy – how to login to a website

The following is an article which will show you how to use Scrapy to log in to sites that have username and password authentication.

The important thing to remember is that there may be additional data that needs to be sent to the login page, data that is in addition to just username and password…

We are going to cover :

🔴 Identifying what data to POST

🔴 The Scrapy method to login

🔴 How to progress after logging in

gather data to submit
passwords in bash profile
scrapy form data
scrapy formrequest

Once you have seen that the Spider has logged in, you can proceed to scrape what you need. Remember, you are looking for “Logout” because that will mean you are logged in!

Conclusion : we have looked at how to use FormRequest to POST form data in a Scrapy spider, and we have extracted data from a site that is protected by a login form.

See the video on YouTube :

Scrapy Form Login | How to log in to sites using FormRequest | Web Scraping Tutorial

https://youtu.be/VvU9mR-WNSA
scrapy github redandgreen