Scrapy response.meta

capture your start urls in your output with Scrapy response.meta

scrapy real estate scraping

Every web scraping project has aspects that are different or interesting and worth remembering for future use.

This is a look at a recent real world project and looks saving more than one start url in the output.

This assumes basic knowledge of web scraping, and identifying selectors. See my other videos if you would like to learn more about selectors (xpath & css)

scraping a real estate site for houses and appartments

We want to fill all of the columns in our client’s master excel sheet.

We could* then provide them with a CSV which they can import and do with what they wish.

We want 1500+ properties so we will be using Scrapy and Python

scrapy spider logic

Considerations

One of the required fields requires us to pass the particular start url all the way through to the CSV (use response.meta)

Some of the required values are inside text and will require parsing with re (use regular expressions) ¡We don’t care about being fast – edit “settings.py” with conservative values for concurrent connections, download delay

This is a German website so I will use Google Chrome browser and translate to English.

Scrapy response.meta

We will use Scrapy’s Request.meta attribute to achieve the following:

Capture whichever of the multiple start_urls is used – pass it all the way through to the output CSV.

scrapy response documentation

Create a “meta” dictionary in the initial Request in start_requests 

“surl” represents each of our start urls

(we have 2, one for ‘rent’ and one for the ‘buy’ url, we could have many more if required)

start_requests
response.meta
start_url output
we still have the start_url (converted to human a readable label)
start_url in final csv output
End result : we have the start url converted to a human readable name, that represents the particular URL that scrapy used for the particular listing

Dr Pi YouTube videos
For videos on youtube please visit : www.youtube.com/c/DrPiCode