capture your start urls in your output with Scrapy response.meta
Every web scraping project has aspects that are different or interesting and worth remembering for future use.
This is a look at a recent real world project and looks saving more than one start url in the output.
This assumes basic knowledge of web scraping, and identifying selectors. See my other videos if you would like to learn more about selectors (xpath & css)
We want to fill all of the columns in our client’s master excel sheet.
We could* then provide them with a CSV which they can import and do with what they wish.
We want 1500+ properties so we will be using Scrapy and Python
One of the required fields requires us to pass the particular start url all the way through to the CSV (use response.meta)
Some of the required values are inside text and will require parsing with re (use regular expressions) ¡We don’t care about being fast – edit “settings.py” with conservative values for concurrent connections, download delay
This is a German website so I will use Google Chrome browser and translate to English.
We will use Scrapy’s Request.meta attribute to achieve the following:
Capture whichever of the multiple start_urls is used – pass it all the way through to the output CSV.
Create a “meta” dictionary in the initial Request in start_requests
“surl” represents each of our start urls
(we have 2, one for ‘rent’ and one for the ‘buy’ url, we could have many more if required)