A common task is to track competitors prices and use that information as a guide to the prices you can charge, or if you are buying, you can spot when a product is at a new lowest price. The purpose of this article is to describe how to web scrape Amazon.
Using Python, Scrapy, MySQL, and Matplotlib you can extract large amounts of data, query it, and produce meaningful visualizations.
In the example featured, we wanted to identify which Amazon books related to “web scraping” had been reduced in price over the time we had been running the spider.
If you want to run your spider daily then see the video for instructions on how to schedule a spider in CRON on a Linux server.
Procedure used for price tracking
query = '''select amzbooks2.* from (select amzbooks2.*, lag(price) over (partition by title order by posted) as prev_price from amzbooks2) amzbooks2 where prev_price <> price'''
Visualize the stored data using Python and Matplotlib
The most important thing when starting to scrape is to establish what you want in your final output.
Here are the data points we want to extract :
‘title’
‘author’
‘star_rating’
‘book_format’
‘price’
Now we can write our parse method, and once done, we can finally add on the “next page” code.
The Amazon pages have white space around the Author name(s) so you this will be an example of when to use ‘normalize-space’.
We also had to make sure we weren’t splitting the partially parsed response too soon, and removing the 2nd Author, (if there was one).
Some of the results are perhaps not what you want, but this is due to Amazon returning products which it thinks are in some way related to your search criteria!
By using pipelines in Scrapy, along with the process_item method we were able to filter much of what was irrelevant. The great thing about web scraping to an SQL database is the flexibility it offers once you have the data. SQL, Pandas, Matplotlib and Python are a powerful combination…
This article describes how to form a Scrapy xpath selector to pick out the hidden value that you may need to POST along with a username and password when scraping a site with a log in. These hidden values are dynamically created so you must send them with your form data in your POST request.
Step one
Identify the source in the browser:
Ok, so we want to pass “__VIEWSTATE” as a key : value pair in the POST request
This is the xpath selector format you will need to use:
The following is an article which will show you how to use Scrapy to log in to sites that have username and password authentication.
The important thing to remember is that there may be additional data that needs to be sent to the login page, data that is in addition to just username and password…
We are going to cover :
π΄ Identifying what data to POST
π΄ The Scrapy method to login
π΄ How to progress after logging in
Once you have seen that the Spider has logged in, you can proceed to scrape what you need. Remember, you are looking for “Logout” because that will mean you are logged in!
Conclusion : we have looked at how to use FormRequest to POST form data in a Scrapy spider, and we have extracted data from a site that is protected by a login form.
See the video on YouTube :
Scrapy Form Login | How to log in to sites using FormRequest | Web Scraping Tutorial
The task was to scrape over 50,000 records from a website and be gentle on the site being scraped. A Raspberry Pi Zero was chosen to do this as speed was not a significant issue, and in fact, being slower makes it ideal for web scraping when you want to be kind to the site you are scraping and not request resources too quickly. This article describes using Scrapy, but BeautifulSoup or Requests would work in the same way.
The main considerations were:
Could it run Scrapy without issue?
Could it run with a VPN connection?
Would it be able to store the results?
So a quick, short test proved that it could collect approx 50,000 records per day which meant it was entirely suitable.
I wanted a VPN tunnel from the Pi Zero to my VPN provider. This was an unknown, because I had only previously run it on a Windows PC with a GUI. Now I was attempting to run it from a headless Raspberry Pi!
This took approx 15 mins to set up. Surprisingly easy.
The only remaining challenges were:
run the spider without having to leave my PC on as well (closing PuTTy in Windows would have terminated the process on the Pi) – That’s where nohup came in handy.
Transfer the output back to a PC (running Ubuntu – inside a VM ) – this is where rsync was handy. (SCP could also have been used)
Well, if you are web scraping using Python, and Scrapy for instance, you may need to extract reviews, or comments that are loaded from JavaScript. This would mean you could not use your css or xpath selectors like you can with regular html.
Parse
Instead, in your browser, check if you may be able to parse the code, beginning with ctrl + f, and “json” and track down some JSON in the form of a python dictionary. You ‘just’ need to isolate it.
view-source to find occurrences of “JSON” in your page
The response is not nice, but you can gradually shrink it down, in Scrapy shell or python shell…
Figure 1 – The response
Split, strip, replace
From within Scrapy, or your own Python code you can split, strip, and replace, with the built-in python commands until you have just a dictionary that you can use with json.loads.
x = response.text.split('JSON.parse')[3].replace("\u0022","\"").replace("\u2019m","'").lstrip("(").split(" ")[0].strip().replace("\"","",1).replace("\");","")
Master replace, strip , and split and you won’t need regular expressions!
With the response.text now ready as a JSON friendly dictionary you can do this:
import json q = json.loads(x)
comment = (q[‘doctor’][‘sample_rating_comment’])
comment.replace(“\u2019″,”‘”) print(comment)
The key thing to remember to use when parsing the response text is to use the index, to pick out the section you want, and then make use of “\” backslash to escaped characters when you are working with quotes, and actual backslashes in the text you’re parsing.
Figure 2 – The parsed response
Conclusion
Rendering to HTML using Splash, or Selenium, or using regular expressions are not always essential. Hope this helps illustrate how you can extract values FROM a python dictionary FROM json FROM javascript !
You may see a mass of text on your screen to begin with, but persevere and you can arrive at the dictionary contained within…
Demo of getting a Python Dictionary from JSON from JavaScript
As an overview of how web scraping works, here is a brief introduction to the process, with the emphasis on using Scrapy to scrape a listing site.
If you would like to know more, or would like us to scrape data from a specific site please get in touch.
*This article also assumes you have some knowledge of Python, and have Scrapy installed. It is recommended to use a virtual environment. Although the focus is on using Scrapy, similar logic will apply if you are using Beautiful Soup, or Requests. (Beautiful Soup does some of the hard work for you with find, and select).
Below is a basic representation of the process used to scrape a page / site
Most sites you will want to scrape provide data in a list – so check the number per page, and the total number. Is there a “next page” link? – If so, that will be useful later on…
Identify the div and class name to use in your “selector”
Identifying the div and class name
Using your web browser developer tools, traverse up through the elements (Chrome = Inspect Elements) until you find a ‘div’ (well, itβs usually a div) that contains the entire advert, and go no higher up the DOM.
(advert = typically: the thumbnail + mini description that leads to the detail page)
The code inside the ‘div’ will be the iterable that you use with the for loop. The β.β before the β//β in the xpath means you select all of them eg. All 20, on a listings page that has 20 adverts per page
Now you have the xpath and checked it in scrapy shell, you can proceed to use it with a for loop and selectors for each piece of information you want to pick out. If you are using XPATH, you can use the class name from the listing, and just add “.” to the start as highlighted below.
This “.” ensures you will be able to iterate through all 20 adverts at the same node level. (i.e All 20 on the page).
To go to the details page we use “Yield” but we also have to pass the variables that we have picked out on the main page. So we use ‘meta’ (or newer version = cb_kwargs”).
Using ‘meta’ allows us to pass variables to the next function – in this case it’s called “fetch_details” – where they will be added to the rest of the variables collected and sent to the FEEDS export which makes the output file.
There is also a newer, recommended version of βmetaβ to pass variables between functions in Scrapy: βcb_kwargsβ
Once you have all the data it is time to use βYieldβ to send it to the FEEDS export.
The “FEEDS” method that let you write output to your chosen file format/destination
This is the format and destination that you have set for your output file.
*Note it can also be a database, rather than JSON or CSV file.
You may wish to run all of your code from within the script, in which case you can do this:
# main driver
if __name__ == "__main__":
process = CrawlerProcess()
process.crawl(YelpSpider)
process.start()
# Also you will need to add this at the start :
from scrapy.crawler import CrawlerProcess
Web Scraping β Summary
We have looked at the steps involved and some of the code you’ll find useful when using Scrapy.
Identifying the html to iterate through is the key
Try and find the block of code that has all of the listings / adverts, and then narrow it down to one advert/listing. Once you have done that you can test your code in “scrapy shell” and start building your spider.
(Scrapy shell can be run from your CLI, independent of your spider code):