The task was to scrape over 50,000 records from a website and be gentle on the site being scraped. A Raspberry Pi Zero was chosen to do this as speed was not a significant issue, and in fact, being slower makes it ideal for web scraping when you want to be kind to the site you are scraping and not request resources too quickly. This article describes using Scrapy, but BeautifulSoup or Requests would work in the same way.
The main considerations were:
- Could it run Scrapy without issue?
- Could it run with a VPN connection?
- Would it be able to store the results?
So a quick, short test proved that it could collect approx 50,000 records per day which meant it was entirely suitable.
I wanted a VPN tunnel from the Pi Zero to my VPN provider. This was an unknown, because I had only previously run it on a Windows PC with a GUI. Now I was attempting to run it from a headless Raspberry Pi!
This took approx 15 mins to set up. Surprisingly easy.
The only remaining challenges were:
- run the spider without having to leave my PC on as well (closing PuTTy in Windows would have terminated the process on the Pi) – That’s where nohup came in handy.
- Transfer the output back to a PC (running Ubuntu – inside a VM ) – this is where rsync was handy. (SCP could also have been used)
See the writing of the Scrapy spider with “Load More”