Fix : Module Not Found
If you use Scrapy, and CrawlerProcess then Python may not search the path that is required for Scrapy to find its “items.py” file. How do you fix this ‘ModuleNotFoundError‘ ? Read on… As we can see, Scrapy framework has the items.py file in a directory above the spiders directoy
Scraping a page via CSS style data
The challenge was to scrape a site where the class names for each element were out of order / randomised. So the only way to get the data in the correct sequence was to sort through the CSS styles by left, top, and match to the class names in the divs… The names were meaningless […]
Configure a Raspberry Pi for web scraping
Introduction The task was to scrape over 50,000 records from a website and be gentle on the site being scraped. A Raspberry Pi Zero was chosen to do this as speed was not a significant issue, and in fact, being slower makes it ideal for web scraping when you want to be kind to the […]
Scraping “LOAD MORE”
Do you need to scrape a page that is dynamically loading content as “infinite scroll” ? Using self.nxp +=1 the value passed to “pn=” in the URL gets incremented “pn=” is the query – in your spider it may be different, you can always use urllib.parse to split up the URL into it’s parts. Test […]
Extracting JSON from JavaScript in a web page
Why would you want to do that? Well, if you are web scraping using Python, and Scrapy for instance, you may need to extract reviews, or comments that are loaded from JavaScript. This would mean you could not use your css or xpath selectors like you can with regular html. Parse Instead, in your browser, […]
Scrapy : Yield
Yes, you can use “Yield” more than once inside a method – we look at how this was useful when scraping a real estate / property section of Craigslist. Put simply, “yield” lets you run another function with Scrapy and then resume from where you “yielded”. To demonstrate this it is best show it with […]
Nested Dictionaries
Summary : Read a JSON file, create a Dict, loop through and get keys and values from the inner Dict using Python. uses : f.read / json.dumps / json.loads / list comprehension / for loop This post assumes you already have a nested dictionary saved as a JSON file. We want to do 3 things: […]
Automating a potentially repetitive task
Write 102 workplans or write some code to do it? The was a requirement to write 102 workplans and word/docx files. The only data that was unique to each workplan was held in a csv – 1 row per workplan, so this is how we automated the creation of 102 word documents.