Categories
Python Code Raspberry Pi Scrapy

Price Tracking Amazon

A common task is to track competitors prices and use that information as a guide to the prices you can charge, or if you are buying, you can spot when a product is at a new lowest price. The purpose of this article is to describe how to web scrape Amazon.

Web Scraping Amazon to SQL

Using Python, Scrapy, MySQL, and Matplotlib you can extract large amounts of data, query it, and produce meaningful visualizations.

In the example featured, we wanted to identify which Amazon books related to “web scraping” had been reduced in price over the time we had been running the spider.

If you want to run your spider daily then see the video for instructions on how to schedule a spider in CRON on a Linux server.

Procedure used for price tracking
Price Tracking Amazon with Python

query = '''select amzbooks2.* from
(select amzbooks2.*,
lag(price) over (partition by title order by posted) as prev_price
from amzbooks2) amzbooks2
where prev_price <> price'''

Visualize the stored data using Python and Matplotlib
Amazon Price Tracker Graph produced from Scrapy Spider output
To see how to get to this stage, you may wish to watch the video:
How To Track Prices In Amazon With Scrapy

All of the code is on GitHub

Categories
Python Code Raspberry Pi Scrapy

Configure a Raspberry Pi for web scraping

Introduction

The task was to scrape over 50,000 records from a website and be gentle on the site being scraped. A Raspberry Pi Zero was chosen to do this as speed was not a significant issue, and in fact, being slower makes it ideal for web scraping when you want to be kind to the site you are scraping and not request resources too quickly. This article describes using Scrapy, but BeautifulSoup or Requests would work in the same way.

The main considerations were:

  • Could it run Scrapy without issue?
  • Could it run with a VPN connection?
  • Would it be able to store the results?

So a quick, short test proved that it could collect approx 50,000 records per day which meant it was entirely suitable.

I wanted a VPN tunnel from the Pi Zero to my VPN provider. This was an unknown, because I had only previously run it on a Windows PC with a GUI. Now I was attempting to run it from a headless Raspberry Pi!

This took approx 15 mins to set up. Surprisingly easy.

The only remaining challenges were:

  • run the spider without having to leave my PC on as well (closing PuTTy in Windows would have terminated the process on the Pi) – That’s where nohup came in handy.
  • Transfer the output back to a PC (running Ubuntu – inside a VM ) – this is where rsync was handy. (SCP could also have been used)

See the writing of the Scrapy spider with “Load More”