Web Scraping Articles

  • Price Tracking Amazon - A common task is to track competitors prices and use that information as a guide to the prices you can charge, or if you are buying, you can spot when a product is at a new lowest price. The purpose of this article is to describe how to web scrape Amazon. Using Python, Scrapy, MySQL, […]
  • How To Web Scrape Amazon (successfully) - You may want to scrape Amazon for information about books about web scraping! We shorten what would have been a very very long selector, by using “contains” in our xpath : response.xpath('//*[contains(@class,"sg-col-20-of-24 s-result-item s-asin")]') The most important thing when starting to scrape is to establish what you want in your final output. Here are the […]
  • Combine Scrapy with Selenium - A major disadvantage of Scrapy is that it can not handle dynamic websites (eg. ones that use JavaScript). If you need to get past a login that is proving impossible to get past, usually if the form data keeps changing, then you can use Selenium to get past the login screen and then pass the […]
  • Xpath for hidden values - This article describes how to form a Scrapy xpath selector to pick out the hidden value that you may need to POST along with a username and password when scraping a site with a log in. These hidden values are dynamically created so you must send them with your form data in your POST request. […]
  • Scrapy Form Login - The following is an article which will show you how to use Scrapy to log in to sites that have username and password authentication. The important thing to remember is that there may be additional data that needs to be sent to the login page, data that is in addition to just username and password… […]
  • Web Scraping Tools - We’ve not included Javascript or Selenium and how others think about it may vary but time is finite, so concentrate on what gets the biggest benefit. We could have put Python at the centre but we’ve assumed that’s a given. How do you visualize your own process of web scraping?
  • Fix : Module Not Found - If you use Scrapy, and CrawlerProcess then Python may not search the path that is required for Scrapy to find its “items.py” file. How do you fix this ‘ModuleNotFoundError‘ ? Read on… As we can see, Scrapy framework has the items.py file in a directory above the spiders directoy
  • Scraping a page via CSS style data - The challenge was to scrape a site where the class names for each element were out of order / randomised. So the only way to get the data in the correct sequence was to sort through the CSS styles by left, top, and match to the class names in the divs… The names were meaningless […]
  • Configure a Raspberry Pi for web scraping - Introduction The task was to scrape over 50,000 records from a website and be gentle on the site being scraped. A Raspberry Pi Zero was chosen to do this as speed was not a significant issue, and in fact, being slower makes it ideal for web scraping when you want to be kind to the […]
  • Scraping “LOAD MORE” - Do you need to scrape a page that is dynamically loading content as “infinite scroll” ? Using self.nxp +=1 the value passed to “pn=” in the URL gets incremented “pn=” is the query – in your spider it may be different, you can always use urllib.parse to split up the URL into it’s parts. Test […]