Categories
Python Code Scrapy

Scraping a JSON response with Scrapy

reference:
► https://stackoverflow.com/questions/44939247/scrapy-extract-ldjson#48131898

We can’t get the html we need using a normal selector so having located the ‘script’ section in the browser (Chrome/Developer Tools) we can load into a JSON object to manipulate.

 json.loads(response.xpath('//script[@type="application/ld+json"]//text()') to get the data from a page containing javascript 

Using json.loads

We extracted the output which was not available from just using a normal css or xpath selector in Scrapy.

See the JSON response in scrapy video

Categories
Python Code Scrapy

Web Scraping Introduction

As an overview of how web scraping works, here is a brief introduction to the process, with the emphasis on using Scrapy to scrape a listing site.

If you would like to know more, or would like us to scrape data from a specific site please get in touch.

*This article also assumes you have some knowledge of Python, and have Scrapy installed. It is recommended to use a virtual environment. Although the focus is on using Scrapy, similar logic will apply if you are using Beautiful Soup, or Requests. (Beautiful Soup does some of the hard work for you with find, and select).

Below is a basic representation of the process used to scrape a page / site

web scraping with scrapy
Web Scraping Process Simplified
Most sites you will want to scrape provide data in a list – so check the number per page, and the total number. Is there a “next page” link? – If so, that will be useful later on…
Identify the div and class name to use in your “selector”

Identifying the div and class name

Using your web browser developer tools, traverse up through the elements (Chrome = Inspect Elements) until you find a ‘div’ (well, it’s usually a div) that contains the entire advert, and go no higher up the DOM.

(advert = typically: the thumbnail + mini description that leads to the detail page)

The code inside the ‘div’ will be the iterable that you use with the for loop.
The “.” before the “//” in the xpath means you select all of them
eg. All 20, on a listings page that has 20 adverts per page

Now you have the xpath and checked it in scrapy shell, you can proceed to use it with a for loop and selectors for each piece of information you want to pick out. If you are using XPATH, you can use the class name from the listing, and just add “.” to the start as highlighted below.

This “.” ensures you will be able to iterate through all 20 adverts at the same node level. (i.e All 20 on the page).

parse

To go to the details page we use “Yield” but we also have to pass the variables that we have picked out on the main page. So we use ‘meta’ (or newer version = cb_kwargs”).

yield Request(absolute_url, callback=self.fetch_detail, meta={'link': link, 'logo_url': logo_url, 'lcompanyname':lcompanyname})

Using ‘meta’ allows us to pass variables to the next function – in this case it’s called “fetch_details” – where they will be added to the rest of the variables collected and sent to the FEEDS export which makes the output file.

There is also a newer, recommended version of “meta” to pass variables between functions in Scrapy: “cb_kwargs

Once you have all the data it is time to use “Yield” to send it to the FEEDS export.

The “FEEDS” method that let you write output to your chosen file format/destination

This is the format and destination that you have set for your output file.

*Note it can also be a database, rather than JSON or CSV file.

Putting it all together

See the fully working spider code here :

https://github.com/RGGH/Scrapy5/blob/master/yelpspider.py

You may wish to run all of your code from within the script, in which case you can do this:

# main driver

if __name__ == "__main__":

    process = CrawlerProcess()

    process.crawl(YelpSpider)

    process.start()

# Also you will need to add this at the start :

from scrapy.crawler import CrawlerProcess

Web Scraping – Summary

We have looked at the steps involved and some of the code you’ll find useful when using Scrapy.

Identifying the html to iterate through is the key

Try and find the block of code that has all of the listings / adverts, and then narrow it down to one advert/listing. Once you have done that you can test your code in “scrapy shell” and start building your spider.

(Scrapy shell can be run from your CLI, independent of your spider code):

xpath-starts-with
Scrapy shell is your friend!

! Some of this content may be easier to relate to if you have studied and completed the following : https://docs.scrapy.org/en/latest/intro/tutorial.html

If you have any questions please get in touch and we’ll be pleased to help.

Categories
Python Code Scrapy

Scrapy tips

Passing variables between functions using meta and cb_kwargs

This will cover how to use callback with “meta” and the newer “cb_kwargs”

The highlighted sections show how “logo_url” goes from parse to fetch_detail, where “yield” then sends it to the FEED export (output CSV file).

When using ‘meta’ you need to use ‘meta.get’ on the ‘response’ in the next function.

The newer, Scrapy recommended way is to use “cb_kwargs”

As you can see, there is no need to use any sort of “get” method in ‘fetch-detail’ so it’s simpler to use now, albeit with a slightly longer, less memorable name!

Watch the YouTube video on cb_kwargs
Categories
Python Code Scrapy

Scrapy : Yield

Yes, you can use “Yield” more than once inside a method – we look at how this was useful when scraping a real estate / property section of Craigslist.

Put simply, “yield” lets you run another function with Scrapy and then resume from where you “yielded”.

To demonstrate this it is best show it with a working example, and then you’ll see the reason for using it.

Source code for this Scrapy Project

https://github.com/RGGH/Scrapy4/blob/master/realestate_loader_with_geo.py

The difference with this project was that most of the usual ‘details‘ were actually on the ‘thumbnails’ / ‘listing’ page, with the exception of the geo data. (Longitude and Latitude).

So you could say this was a back-to-front website. Typically all the details would be extracted from the details page, accessed via a listings page.

Because we want to pass data (“Lon” and “Lat”) between 2 functions (methods) – we had to initialise these variables:

def __init__(self):
    self.lat =""
    self.lon = ""
Scrapy-Yield-Explanation
The “lat’ and ‘lon’ variables that are used in ‘parse_detail’ – their values then get passed to items back inside the parse method.

Next, the typical ‘parse’ code that identifies all of the ads (adverts) on the page – class name = “result-info”.

You could use either:

all_ads = response.xpath('//p[@class="result-info"]')

or

all_ads = response.css("p.result-info")

( XPATH or CSS – both get the same result, to use as the Selector )

start_requests

We coded this, but it would run even if we hadn’t, it’s the default scrapy method that gets the first URL and passes the output “response” to the next method : ‘parse’.

parse

This is the method that finds all of the adverts on page 1, and goes off to the details page and extracts the geo data.

Next it fills the scrapy fields in items.py with the data from the thumbnail listing for the property on the listings page and the geo data.

So the reason we described this as a back-to-front website is that the majority of the details come from the thumbnails/listing, and only 2 bits of data (“lon” and “lat”) come from the ‘details’ page.

Craigslist Scrapy Project - Listings page
Above : Throughout this article we’re referring to this as the “thumbnails/listings’ page
Scrapy-Details-Page :
Craigslist Scrapy Project
Above : This is what we typically refer to as the “details” page