Categories
Python Code

Scraping a page via CSS style data

The challenge was to scrape a site where the class names for each element were out of order / randomised. So the only way to get the data in the correct sequence was to sort through the CSS styles by left, top, and match to the class names in the divs…

web-scraping-mixed-up-css
Above : The left and top also varied slightly even in the same columns, 1055 and 1068 would need to be rounded to be the same.

The names were meaningless and there was no way of establishing an order, a quick check in developer tools shows the highlighter rectangle jumps all over the page in no particular order when traversing the source code.

page_useddemo-gear_fF0iR8Rn6hrUOJh0YwkOA body 

So we began the code by identifying the CSS and then parsing it:

The idea is to extract the CSS style data, parse it for left, and top px, and sort through those to match the out of sequence div/classnames in the body of the page.

There was NO possibility of sequentially looping through the divs to extract ordered data.

After watching the intro to the challenge set on YouTube, and a handy hint from CMK we got to work.

From this article you will learn as much about coding with Python as you will about web scraping in specific. Note I used requests_html, as it provided me with the option to use XPATH.

BeautifulSoup could also have been used.

Identifying columns and rows based on “left px” and “top px”

Python methods used:

round

x = round(1466,-2)
print(x) # 1500

x = round(1526,-2)
print(x) # 1500

x = round(1526,-1)
print(x) # 1530

I needed to use round as the only way to identify a “Column” of text was to cross reference the <style> “left px” and “top px” with the class names used inside the divs. Round was required as there was on occasion a 2 or 3 px variation in the “left” value.

itemgetter

Item getterfrom operator import itemgetter

ls.sort(key=itemgetter('left','top'))

I had to sort the values from the parsed css style in order of “left” to identify 3 columns of data, and then “top” to sort the contents of the 3 columns A, B, C.

zip

zipped = zip(ls_desc,ls_sellp,ls_suggp)
Zipping the 3 lists – read on to find out the issue with doing this…

rows = list(zipped)

So to get the data from the 3 columns, A (ls_desc), B (ls_sellp), and C (ls_suggp) I used ZIP, but…….the were/are 2 values missing in column C!!

A had 77 values,

B had 77 values

C had 75 !

Not only was there no text in 2 of the blanks in column C, there was also NO text or even any CSS.

We only identified this as an issue after running the code – visually the page looked consistent, alas the last part of column “C” becomes out of sequence with the data in colmumn A and B which are both correct.

Solution?

Go back and check if column “C” has a value at the same top px value as Column “B”. If no value then insert an “x” or spacer into Column C at that top px value.

This will need to be rewritten using dictionaries, and create one dictionary per ROW rather than my initial idea of 1 list per column and zipping them!

Zipping the 3 lists nearly works..but 2 missing values in “Suggested Price” means that the data in Column C become out of synch.

special thanks to “code monkey king” for the idea/challenge!

url=’http://audioeden.com/useddemo-gear/4525583102

My initial solution:

https://github.com/RGGH/Experimental-Custom-Scrapers/blob/master/audio1.py

Next :

Rewrite the section for “Column B” to check for presence of text in column “C” on the same row…

webscraping-css-style

1 missing value halfway down column “C” means more error checking is required! – If you just want the “Selling Price” and “Description” then this is code is 100% successful! πŸ‘

See the solution, and error on the YouTube Video

Conclusion:

For more robust web scraping where css elements may be missing use dictionaries/enumerate each row and check. It’s the old case of “you don’t know what you don’t know”

If you can ensure each list has the same number of items, then ZIP is ok to use.

Categories
Python Code Raspberry Pi Scrapy

Configure a Raspberry Pi for web scraping

Introduction

The task was to scrape over 50,000 records from a website and be gentle on the site being scraped. A Raspberry Pi Zero was chosen to do this as speed was not a significant issue, and in fact, being slower makes it ideal for web scraping when you want to be kind to the site you are scraping and not request resources too quickly. This article describes using Scrapy, but BeautifulSoup or Requests would work in the same way.

The main considerations were:

  • Could it run Scrapy without issue?
  • Could it run with a VPN connection?
  • Would it be able to store the results?

So a quick, short test proved that it could collect approx 50,000 records per day which meant it was entirely suitable.

I wanted a VPN tunnel from the Pi Zero to my VPN provider. This was an unknown, because I had only previously run it on a Windows PC with a GUI. Now I was attempting to run it from a headless Raspberry Pi!

This took approx 15 mins to set up. Surprisingly easy.

The only remaining challenges were:

  • run the spider without having to leave my PC on as well (closing PuTTy in Windows would have terminated the process on the Pi) – That’s where nohup came in handy.
  • Transfer the output back to a PC (running Ubuntu – inside a VM ) – this is where rsync was handy. (SCP could also have been used)

See the writing of the Scrapy spider with “Load More”

Categories
Python Code Scrapy

Scraping “LOAD MORE”

Do you need to scrape a page that is dynamically loading content as “infinite scroll” ?

Scrapy Load More - infinite scroll - more results
If you need to scrape a site like this then you can increment the URL within your Scrapy code

Using self.nxp +=1 the value passed to “pn=” in the URL gets incremented

“pn=” is the query – in your spider it may be different, you can always use urllib.parse to split up the URL into it’s parts.

Test in scrapy shell if you are checking the URL for next page – see if you get response 200 and then check the response.text

What if you don’t know how many pages there are?

One way would be to use try/except – but a more elegant solution would be to check the source for “next” or “has_next” and keep going to next page until “next” is not true.

https://github.com/RGGH/Scrapy6/blob/master/AJAX%20example/foodcom.py

If you look at line 51 – you can see how we did that.

if response.xpath("//link/@rel='next\'").get() == "1":

See our video where we did just this : https://youtu.be/07FYDHTV73Y

Conclusion

We’ve shown how to deal with “infinite scroll” without resorting to selenium, splash, or any javascript rendering. Also, check in developer tools, “network” and “XHR” if you can find any mention of API in the URL – this may be useful also.

Categories
Python Code

Extracting JSON from JavaScript in a web page

Why would you want to do that?

Well, if you are web scraping using Python, and Scrapy for instance, you may need to extract reviews, or comments that are loaded from JavaScript. This would mean you could not use your css or xpath selectors like you can with regular html.

Parse

Instead, in your browser, check if you may be able to parse the code, beginning with ctrl + f, and “json” and track down some JSON in the form of a python dictionary. You ‘just’ need to isolate it.

web-scraping javascript pages
view-source to find occurrences of “JSON” in your page

The response is not nice, but you can gradually shrink it down, in Scrapy shell or python shell…

scrapy-shell-response
Figure 1 – The response

Split, strip, replace

From within Scrapy, or your own Python code you can split, strip, and replace, with the built-in python commands until you have just a dictionary that you can use with json.loads.

x = response.text.split('JSON.parse')[3].replace("\u0022","\"").replace("\u2019m","'").lstrip("(").split(" ")[0].strip().replace("\"","",1).replace("\");","")

Master replace, strip , and split and you won’t need regular expressions!

With the response.text now ready as a JSON friendly dictionary you can do this:

import json
q = json.loads(x)

comment = (q[‘doctor’][‘sample_rating_comment’])

comment.replace(“\u2019″,”‘”)
print(comment)

The key thing to remember to use when parsing the response text is to use the index, to pick out the section you want, and then make use of “\” backslash to escaped characters when you are working with quotes, and actual backslashes in the text you’re parsing.

parsed-response
Figure 2 – The parsed response

Conclusion

Rendering to HTML using Splash, or Selenium, or using regular expressions are not always essential. Hope this helps illustrate how you can extract values FROM a python dictionary FROM json FROM javascript !

You may see a mass of text on your screen to begin with, but persevere and you can arrive at the dictionary contained within…

Demo of getting a Python Dictionary from JSON from JavaScript

Categories
Python Code Scrapy

Scraping a JSON response with Scrapy

reference:
β–Ί https://stackoverflow.com/questions/44939247/scrapy-extract-ldjson#48131898

We can’t get the html we need using a normal selector so having located the ‘script’ section in the browser (Chrome/Developer Tools) we can load into a JSON object to manipulate.

 json.loads(response.xpath('//script[@type="application/ld+json"]//text()') to get the data from a page containing javascript 

Using json.loads

We extracted the output which was not available from just using a normal css or xpath selector in Scrapy.

See the JSON response in scrapy video

Categories
Python Code Scrapy

Web Scraping Introduction

As an overview of how web scraping works, here is a brief introduction to the process, with the emphasis on using Scrapy to scrape a listing site.

If you would like to know more, or would like us to scrape data from a specific site please get in touch.

*This article also assumes you have some knowledge of Python, and have Scrapy installed. It is recommended to use a virtual environment. Although the focus is on using Scrapy, similar logic will apply if you are using Beautiful Soup, or Requests. (Beautiful Soup does some of the hard work for you with find, and select).

Below is a basic representation of the process used to scrape a page / site

web scraping with scrapy
Web Scraping Process Simplified
Most sites you will want to scrape provide data in a list – so check the number per page, and the total number. Is there a “next page” link? – If so, that will be useful later on…
Identify the div and class name to use in your “selector”

Identifying the div and class name

Using your web browser developer tools, traverse up through the elements (Chrome = Inspect Elements) until you find a ‘div’ (well, it’s usually a div) that contains the entire advert, and go no higher up the DOM.

(advert = typically: the thumbnail + mini description that leads to the detail page)

The code inside the ‘div’ will be the iterable that you use with the for loop.
The β€œ.” before the β€œ//” in the xpath means you select all of them
eg. All 20, on a listings page that has 20 adverts per page

Now you have the xpath and checked it in scrapy shell, you can proceed to use it with a for loop and selectors for each piece of information you want to pick out. If you are using XPATH, you can use the class name from the listing, and just add “.” to the start as highlighted below.

This “.” ensures you will be able to iterate through all 20 adverts at the same node level. (i.e All 20 on the page).

parse

To go to the details page we use “Yield” but we also have to pass the variables that we have picked out on the main page. So we use ‘meta’ (or newer version = cb_kwargs”).

yield Request(absolute_url, callback=self.fetch_detail, meta={'link': link, 'logo_url': logo_url, 'lcompanyname':lcompanyname})

Using ‘meta’ allows us to pass variables to the next function – in this case it’s called “fetch_details” – where they will be added to the rest of the variables collected and sent to the FEEDS export which makes the output file.

There is also a newer, recommended version of β€œmeta” to pass variables between functions in Scrapy: β€œcb_kwargs”

Once you have all the data it is time to use β€œYield” to send it to the FEEDS export.

The “FEEDS” method that let you write output to your chosen file format/destination

This is the format and destination that you have set for your output file.

*Note it can also be a database, rather than JSON or CSV file.

Putting it all together

See the fully working spider code here :

https://github.com/RGGH/Scrapy5/blob/master/yelpspider.py

You may wish to run all of your code from within the script, in which case you can do this:

# main driver

if __name__ == "__main__":

    process = CrawlerProcess()

    process.crawl(YelpSpider)

    process.start()

# Also you will need to add this at the start :

from scrapy.crawler import CrawlerProcess

Web Scraping – Summary

We have looked at the steps involved and some of the code you’ll find useful when using Scrapy.

Identifying the html to iterate through is the key

Try and find the block of code that has all of the listings / adverts, and then narrow it down to one advert/listing. Once you have done that you can test your code in “scrapy shell” and start building your spider.

(Scrapy shell can be run from your CLI, independent of your spider code):

xpath-starts-with
Scrapy shell is your friend!

! Some of this content may be easier to relate to if you have studied and completed the following : https://docs.scrapy.org/en/latest/intro/tutorial.html

If you have any questions please get in touch and we’ll be pleased to help.

Categories
Python Code Scrapy

Scrapy tips

Passing variables between functions using meta and cb_kwargs

This will cover how to use callback with “meta” and the newer “cb_kwargs”

The highlighted sections show how “logo_url” goes from parse to fetch_detail, where “yield” then sends it to the FEED export (output CSV file).

When using ‘meta’ you need to use ‘meta.get’ on the ‘response’ in the next function.

The newer, Scrapy recommended way is to use “cb_kwargs”

As you can see, there is no need to use any sort of “get” method in ‘fetch-detail’ so it’s simpler to use now, albeit with a slightly longer, less memorable name!

Watch the YouTube video on cb_kwargs
Categories
Python Code Scrapy

Scrapy : Yield

Yes, you can use “Yield” more than once inside a method – we look at how this was useful when scraping a real estate / property section of Craigslist.

Put simply, “yield” lets you run another function with Scrapy and then resume from where you “yielded”.

To demonstrate this it is best show it with a working example, and then you’ll see the reason for using it.

Source code for this Scrapy Project

https://github.com/RGGH/Scrapy4/blob/master/realestate_loader_with_geo.py

The difference with this project was that most of the usual ‘details‘ were actually on the ‘thumbnails’ / ‘listing’ page, with the exception of the geo data. (Longitude and Latitude).

So you could say this was a back-to-front website. Typically all the details would be extracted from the details page, accessed via a listings page.

Because we want to pass data (“Lon” and “Lat”) between 2 functions (methods) – we had to initialise these variables:

def __init__(self):
    self.lat =""
    self.lon = ""
Scrapy-Yield-Explanation
The “lat’ and ‘lon’ variables that are used in ‘parse_detail’ – their values then get passed to items back inside the parse method.

Next, the typical ‘parse’ code that identifies all of the ads (adverts) on the page – class name = “result-info”.

You could use either:

all_ads = response.xpath('//p[@class="result-info"]')

or

all_ads = response.css("p.result-info")

( XPATH or CSS – both get the same result, to use as the Selector )

start_requests

We coded this, but it would run even if we hadn’t, it’s the default scrapy method that gets the first URL and passes the output “response” to the next method : ‘parse’.

parse

This is the method that finds all of the adverts on page 1, and goes off to the details page and extracts the geo data.

Next it fills the scrapy fields in items.py with the data from the thumbnail listing for the property on the listings page and the geo data.

So the reason we described this as a back-to-front website is that the majority of the details come from the thumbnails/listing, and only 2 bits of data (“lon” and “lat”) come from the ‘details’ page.

Craigslist Scrapy Project - Listings page
Above : Throughout this article we’re referring to this as the “thumbnails/listings’ page
Scrapy-Details-Page :
Craigslist Scrapy Project
Above : This is what we typically refer to as the “details” page
Categories
Python Code

Nested Dictionaries

Summary : Read a JSON file, create a Dict, loop through and get keys and values from the inner Dict using Python.

uses : f.read / json.dumps / json.loads / list comprehension / for loop
Start with a JSON file – which looks like this in Notepad….

This post assumes you already have a nested dictionary saved as a JSON file. If not, you can download my json example.

We want to do 3 things:

  1. Import the JSON into a Dict type object

2. Display the Dict to check the format

3. Loop through the Dict to get the values that we need to print out.

Let’s get started,

Import the JSON file

First we need to use “import json” to work with the json format

Next, open the file “ytdict.json” and load it into an object

with open("ytdict.json") as f:
d = json.loads(f.read())

print(type(d)) # will give : <class 'dict'>

Check the contents, sort by int, otherwise 11 will appear before 2

Print the contents

print (json.dumps({int(x):d[x] for x in d.keys()}, indent=2, sort_keys=True))

The output should look like this :

Here you can see the outer dict, and then the inner dict with 2 key:value pairs

For loop with Nested Dictionaries

Write a new file, which will be the one we want to use as our actual program; ‘demo_nested_dict.py’

import json
# The 'real' code used in ytapithumbget.py
with open("ytdict.json") as f:
    data = json.loads(f.read())
    # nested dictionary needs 2 for loops
    for k,v in data.items():
        for v2 in v:
        	print (v2[1])
        	print("------------")

We loop through the outer dictionary, finding the key:value pairs, and then using the second for loop we iterate through the key:value pairs from the returned values of the outer dictionary.

We’ve picked out the “Title” and “Description” values from the JSON file (It’s from my ‘YouTube-Channel-Downloader’ : ytapithumbget.py Script)

Note the index of [1] – That picks out the values from the values in the first for loop.

If we had used [0] we would have ended up with this:

Iterating through Nested Dictionaries | JSON – watch the YouTube Video

Categories
Python Code

Automating a potentially repetitive task

Write 102 workplans or write some code to do it?

The was a requirement to write 102 workplans and word/docx files. The only data that was unique to each workplan was held in a csv – 1 row per workplan, so this is how we automated the creation of 102 word documents.