Python Code

June 3, 2020 / Last updated : June 3, 2020 admin Python Code

Scrapy : Yield

Yes, you can use “Yield” more than once inside a method – we look at how this was useful when scraping a real estate / property section of Craigslist.

Put simply, “yield” lets you run another function with Scrapy and then resume from where you “yielded”.

To demonstrate this it is best show it with a working example, and then you’ll see the reason for using it.

Source code for this Scrapy Project

https://github.com/RGGH/Scrapy4/blob/master/realestate_loader_with_geo.py

The difference with this project was that most of the usual ‘details‘ were actually on the ‘thumbnails’ / ‘listing’ page, with the exception of the geo data. (Longitude and Latitude).

So you could say this was a back-to-front website. Typically all the details would be extracted from the details page, accessed via a listings page.

Because we want to pass data (“Lon” and “Lat”) between 2 functions (methods) – we had to initialise these variables:

def __init__(self):
    self.lat =""
    self.lon = ""

Scrapy-Yield-Explanation — The “lat’ and ‘lon’ variables that are used in ‘parse_detail’ – their values then get passed to items back inside the parse method.

Next, the typical ‘parse’ code that identifies all of the ads (adverts) on the page – class name = “result-info”.

You could use either:

all_ads = response.xpath('//p[@class="result-info"]')

or

all_ads = response.css("p.result-info")

( XPATH or CSS – both get the same result, to use as the Selector )

start_requests

We coded this, but it would run even if we hadn’t, it’s the default scrapy method that gets the first URL and passes the output “response” to the next method : ‘parse’.

parse

This is the method that finds all of the adverts on page 1, and goes off to the details page and extracts the geo data.

Next it fills the scrapy fields in items.py with the data from the thumbnail listing for the property on the listings page and the geo data.

So the reason we described this as a back-to-front website is that the majority of the details come from the thumbnails/listing, and only 2 bits of data (“lon” and “lat”) come from the ‘details’ page.

Craigslist Scrapy Project - Listings page — Above : Throughout this article we’re referring to this as the “thumbnails/listings’ page

Scrapy-Details-Page :
Craigslist Scrapy Project — Above : This is what we typically refer to as the “details” page

Categories: Python Code and Scrapy

Python Code

April 3, 2020

Python Code

June 22, 2020

Scrapy : Yield

Yes, you can use “Yield” more than once inside a method – we look at how this was useful when scraping a real estate / property section of Craigslist.

Source code for this Scrapy Project

start_requests

parse

Nested Dictionaries

Scrapy tips