How To Web Scrape Amazon (successfully)
You may want to scrape Amazon for information about books about web scraping!
We shorten what would have been a very very long selector, by using “contains” in our xpath :
response.xpath('//*[contains(@class,"sg-col-20-of-24 s-result-item s-asin")]')
The most important thing when starting to scrape is to establish what you want in your final output.
Here are the data points we want to extract :
- ‘title’
- ‘author’
- ‘star_rating’
- ‘book_format’
- ‘price’
Now we can write our parse method, and once done, we can finally add on the “next page” code.
The Amazon pages have white space around the Author name(s) so you this will be an example of when to use ‘normalize-space’.
We also had to make sure we weren’t splitting the partially parsed response too soon, and removing the 2nd Author, (if there was one).
Some of the results are perhaps not what you want, but this is due to Amazon returning products which it thinks are in some way related to your search criteria!
By using pipelines in Scrapy, along with the process_item method we were able to filter much of what was irrelevant. The great thing about web scraping to an SQL database is the flexibility it offers once you have the data. SQL, Pandas, Matplotlib and Python are a powerful combination…
If you are unsure about how any of the code works, drop us a comment on the comments section of the Web Scraping Amazon : YouTube video.