As an overview of how web scraping works, here is a brief introduction to the process, with the emphasis on using Scrapy to scrape a listing site.
If you would like to know more, or would like us to scrape data from a specific site please get in touch.
*This article also assumes you have some knowledge of Python, and have Scrapy installed. It is recommended to use a virtual environment. Although the focus is on using Scrapy, similar logic will apply if you are using Beautiful Soup, or Requests. (Beautiful Soup does some of the hard work for you with find, and select).
Below is a basic representation of the process used to scrape a page / site
Identifying the div and class name
Using your web browser developer tools, traverse up through the elements (Chrome = Inspect Elements) until you find a ‘div’ (well, it’s usually a div) that contains the entire advert, and go no higher up the DOM.
(advert = typically: the thumbnail + mini description that leads to the detail page)
The code inside the ‘div’ will be the iterable that you use with the for loop. The “.” before the “//” in the xpath means you select all of them eg. All 20, on a listings page that has 20 adverts per page
Now you have the xpath and checked it in scrapy shell, you can proceed to use it with a for loop and selectors for each piece of information you want to pick out. If you are using XPATH, you can use the class name from the listing, and just add “.” to the start as highlighted below.
This “.” ensures you will be able to iterate through all 20 adverts at the same node level. (i.e All 20 on the page).
To go to the details page we use “Yield” but we also have to pass the variables that we have picked out on the main page. So we use ‘meta’ (or newer version = cb_kwargs”).
Using ‘meta’ allows us to pass variables to the next function – in this case it’s called “fetch_details” – where they will be added to the rest of the variables collected and sent to the FEEDS export which makes the output file.
There is also a newer, recommended version of “meta” to pass variables between functions in Scrapy: “cb_kwargs”
Once you have all the data it is time to use “Yield” to send it to the FEEDS export.
This is the format and destination that you have set for your output file.
*Note it can also be a database, rather than JSON or CSV file.
You may wish to run all of your code from within the script, in which case you can do this:
# main driver
if __name__ == "__main__":
process = CrawlerProcess()
# Also you will need to add this at the start :
from scrapy.crawler import CrawlerProcess
Web Scraping – Summary
We have looked at the steps involved and some of the code you’ll find useful when using Scrapy.
Identifying the html to iterate through is the key
Try and find the block of code that has all of the listings / adverts, and then narrow it down to one advert/listing. Once you have done that you can test your code in “scrapy shell” and start building your spider.
(Scrapy shell can be run from your CLI, independent of your spider code):