Categories
Python Code Scrapy

How To Web Scrape Amazon (successfully)

You may want to scrape Amazon for information about books about web scraping!

We scrape Amazon for web scraping books!

We shorten what would have been a very very long selector, by using “contains” in our xpath :

response.xpath('//*[contains(@class,"sg-col-20-of-24 s-result-item s-asin")]')

The most important thing when starting to scrape is to establish what you want in your final output.

Here are the data points we want to extract :

  • ‘title’
  • ‘author’
  • ‘star_rating’
  • ‘book_format’
  • ‘price’

Now we can write our parse method, and once done, we can finally add on the “next page” code.

The Amazon pages have white space around the Author name(s) so you this will be an example of when to use ‘normalize-space’.

We also had to make sure we weren’t splitting the partially parsed response too soon, and removing the 2nd Author, (if there was one).

Some of the results are perhaps not what you want, but this is due to Amazon returning products which it thinks are in some way related to your search criteria!

By using pipelines in Scrapy, along with the process_item method we were able to filter much of what was irrelevant. The great thing about web scraping to an SQL database is the flexibility it offers once you have the data. SQL, Pandas, Matplotlib and Python are a powerful combination…

If you are unsure about how any of the code works, drop us a comment on the comments section of the Web Scraping Amazon : YouTube video.

Categories
Python Code Scrapy Selenium

Combine Scrapy with Selenium

A major disadvantage of Scrapy is that it can not handle dynamic websites (eg. ones that use JavaScript).

If you need to get past a login that is proving impossible to get past, usually if the form data keeps changing, then you can use Selenium to get past the login screen and then pass the response back into Scrapy.

It may sound like a workaround, and it is, but it’s a good way to get logged in so you can get the content much quicker than if you try and use Selenium to do it all.

Selenium is for testing, but sometimes you can combine Selenium and Scrapy to get the job done!

Watch the YouTube video on How to login past javascript & send response back to Scrapy

Combining Scrapy with Selenium
Combining Scrapy with Selenium Solution
Combining Scrapy with Selenium Prerequisites
Combining Scrapy with Selenium - integrating into Scrapy

Below:

The ‘LoginEM’ and ‘LoginPW’ represent the ‘name’ of the input field (find these from viewing the source in your browser).

self.pw and self.em are the variables which equal your stored email and passwords – I’ve stored mine here as environment variables in .bash_profile on the host computer.

Combining Scrapy with Selenium - Lgging in
Combining Scrapy with Selenium - response.replace
Combining Scrapy with Selenium start_urls
Categories
Python Code

Xpath for hidden values

This article describes how to form a Scrapy xpath selector to pick out the hidden value that you may need to POST along with a username and password when scraping a site with a log in. These hidden values are dynamically created so you must send them with your form data in your POST request.

Step one

Identify the source in the browser:

xpath-for-hidden-values
Ok, so we want to pass “__VIEWSTATE” as a key : value pair in the POST request

This is the xpath selector format you will need to use:

response.xpath('//div[@class="aspNetHidden"]/input[@name="__VIEWSTATE"]/@value').get()

This should get what you need…in Scrapy SHELL !

But…Not when you run it in your Spider…

Instead, you need to use :

response.xpath('//*[@id="Form1"]/input[@name="__VIEWSTATE"]/@value').get()

Scrapy-Hidden-Values-XPATH-Selector
Categories
Python Code

Scrapy Form Login

scrapy form login how to
Scrapy – how to login to a website

The following is an article which will show you how to use Scrapy to log in to sites that have username and password authentication.

The important thing to remember is that there may be additional data that needs to be sent to the login page, data that is in addition to just username and password…

We are going to cover :

🔴 Identifying what data to POST

🔴 The Scrapy method to login

🔴 How to progress after logging in

gather data to submit
passwords in bash profile
scrapy form data
scrapy formrequest

Once you have seen that the Spider has logged in, you can proceed to scrape what you need. Remember, you are looking for “Logout” because that will mean you are logged in!

Conclusion : we have looked at how to use FormRequest to POST form data in a Scrapy spider, and we have extracted data from a site that is protected by a login form.

See the video on YouTube :

Scrapy Form Login | How to log in to sites using FormRequest | Web Scraping Tutorial

https://youtu.be/VvU9mR-WNSA
scrapy github redandgreen
Categories
Python Code

Web Scraping Tools

webscraping tools

We’ve not included Javascript or Selenium and how others think about it may vary but time is finite, so concentrate on what gets the biggest benefit.

We could have put Python at the centre but we’ve assumed that’s a given.

How do you visualize your own process of web scraping?

Categories
Python Code

Fix : Module Not Found

If you use Scrapy, and CrawlerProcess then Python may not search the path that is required for Scrapy to find its “items.py” file.

How do you fix this ‘ModuleNotFoundError‘ ?

Read on…

scrapy items.py module not found
items.py lcoation

As we can see, Scrapy framework has the items.py file in a directory above the spiders directoy

spider location

path append(Path = Project Name)
use path.append to add items.py to the Scrapy search path
Categories
Python Code

Scraping a page via CSS style data

The challenge was to scrape a site where the class names for each element were out of order / randomised. So the only way to get the data in the correct sequence was to sort through the CSS styles by left, top, and match to the class names in the divs…

The px varied slightly: see 710 and 711 above

The names were meaningless and there was no way of establishing an order, a quick check in developer tools shows the highlighter rectangle jumps all over the page in no particular order when traversing the source code.

page_useddemo-gear_fF0iR8Rn6hrUOJh0YwkOA body 

So we began the code by identifying the CSS and then parsing it:

The idea is to extract the CSS style data, parse it for left, and top px, and sort through those to match the out of sequence div/classnames in the body of the page.

There was NO possibility of sequentially looping through the divs to extract ordered data.

After watching the intro to the challenge set on YouTube, and a handy hint from CMK we got to work.

From this article you will learn as much about coding with Python as you will about web scraping in specific. Note I used requests_html, as it provided me with the option to use XPATH.

BeautifulSoup could also have been used.

Identifying columns and rows based on “left px” and “top px”

Python methods used:

round

x = round(1466,-2)
print(x) # 1500

x = round(1526,-2)
print(x) # 1500

x = round(1526,-1)
print(x) # 1530

I needed to use round as the only way to identify a “Column” of text was to cross reference the <style> “left px” and “top px” with the class names used inside the divs. Round was required as there was on occasion a 2 or 3 px variation in the “left” value.

itemgetter

Item getterfrom operator import itemgetter

ls.sort(key=itemgetter('left','top'))

I had to sort the values from the parsed css style in order of “left” to identify 3 columns of data, and then “top” to sort the contents of the 3 columns A, B, C.

zip

zipped = zip(ls_desc,ls_sellp,ls_suggp)
Zipping the 3 lists – read on to find out the issue with doing this…

rows = list(zipped)

So to get the data from the 3 columns, A (ls_desc), B (ls_sellp), and C (ls_suggp) I used ZIP, but…….the were/are 2 values missing in column C!!

A had 77 values,

B had 77 values

C had 75 !

Not only was there no text in 2 of the blanks in column C, there was also NO text or even any CSS.

We only identified this as an issue after running the code – visually the page looked consistent, alas the last part of column “C” becomes out of sequence with the data in colmumn A and B which are both correct.

Solution?

Go back and check if column “C” has a value at the same top px value as Column “B”. If no value then insert an “x” or spacer into Column C at that top px value.

This will need to be rewritten using dictionaries, and create one dictionary per ROW rather than my initial idea of 1 list per column and zipping them!

Zipping the 3 lists nearly works..but 2 missing values in “Suggested Price” means that the data in Column C become out of synch.

special thanks to “code monkey king” for the idea/challenge!

url=’http://audioeden.com/useddemo-gear/4525583102

My initial solution:

https://github.com/RGGH/Experimental-Custom-Scrapers/blob/master/audio1.py

Next :

Rewrite the section for “Column B” to check for presence of text in column “C” on the same row…

webscraping-css-style

1 missing value halfway down column “C” means more error checking is required! – If you just want the “Selling Price” and “Description” then this is code is 100% successful! 👍

See the solution, and error on the YouTube Video

Conclusion:

For more robust web scraping where css elements may be missing use dictionaries/enumerate each row and check. It’s the old case of “you don’t know what you don’t know”

If you can ensure each list has the same number of items, then ZIP is ok to use.

Categories
Python Code Raspberry Pi Scrapy

Configure a Raspberry Pi for web scraping

Introduction

The task was to scrape over 50,000 records from a website and be gentle on the site being scraped. A Raspberry Pi Zero was chosen to do this as speed was not a significant issue, and in fact, being slower makes it ideal for web scraping when you want to be kind to the site you are scraping and not request resources too quickly. This article describes using Scrapy, but BeautifulSoup or Requests would work in the same way.

The main considerations were:

  • Could it run Scrapy without issue?
  • Could it run with a VPN connection?
  • Would it be able to store the results?

So a quick, short test proved that it could collect approx 50,000 records per day which meant it was entirely suitable.

I wanted a VPN tunnel from the Pi Zero to my VPN provider. This was an unknown, because I had only previously run it on a Windows PC with a GUI. Now I was attempting to run it from a headless Raspberry Pi!

This took approx 15 mins to set up. Surprisingly easy.

The only remaining challenges were:

  • run the spider without having to leave my PC on as well (closing PuTTy in Windows would have terminated the process on the Pi) – That’s where nohup came in handy.
  • Transfer the output back to a PC (running Ubuntu – inside a VM ) – this is where rsync was handy. (SCP could also have been used)

See the writing of the Scrapy spider with “Load More”

Categories
Python Code Scrapy

Scraping “LOAD MORE”

Do you need to scrape a page that is dynamically loading content as “infinite scroll” ?

Scrapy Load More - infinite scroll - more results
If you need to scrape a site like this then you can increment the URL within your Scrapy code

Using self.nxp +=1 the value passed to “pn=” in the URL gets incremented

“pn=” is the query – in your spider it may be different, you can always use urllib.parse to split up the URL into it’s parts.

Test in scrapy shell if you are checking the URL for next page – see if you get response 200 and then check the response.text

What if you don’t know how many pages there are?

One way would be to use try/except – but a more elegant solution would be to check the source for “next” or “has_next” and keep going to next page until “next” is not true.

https://github.com/RGGH/Scrapy6/blob/master/AJAX%20example/foodcom.py

If you look at line 51 – you can see how we did that.

if response.xpath("//link/@rel='next\'").get() == "1":

See our video where we did just this : https://youtu.be/07FYDHTV73Y

Conclusion

We’ve shown how to deal with “infinite scroll” without resorting to selenium, splash, or any javascript rendering. Also, check in developer tools, “network” and “XHR” if you can find any mention of API in the URL – this may be useful also.

Categories
Python Code

Extracting JSON from JavaScript in a web page

Why would you want to do that?

Well, if you are web scraping using Python, and Scrapy for instance, you may need to extract reviews, or comments that are loaded from JavaScript. This would mean you could not use your css or xpath selectors like you can with regular html.

Parse

Instead, in your browser, check if you may be able to parse the code, beginning with ctrl + f, and “json” and track down some JSON in the form of a python dictionary. You ‘just’ need to isolate it.

web-scraping javascript pages
view-source to find occurrences of “JSON” in your page

The response is not nice, but you can gradually shrink it down, in Scrapy shell or python shell…

scrapy-shell-response
Figure 1 – The response

Split, strip, replace

From within Scrapy, or your own Python code you can split, strip, and replace, with the built-in python commands until you have just a dictionary that you can use with json.loads.

x = response.text.split('JSON.parse')[3].replace("\u0022","\"").replace("\u2019m","'").lstrip("(").split(" ")[0].strip().replace("\"","",1).replace("\");","")

Master replace, strip , and split and you won’t need regular expressions!

With the response.text now ready as a JSON friendly dictionary you can do this:

import json
q = json.loads(x)

comment = (q[‘doctor’][‘sample_rating_comment’])

comment.replace(“\u2019″,”‘”)
print(comment)

The key thing to remember to use when parsing the response text is to use the index, to pick out the section you want, and then make use of “\” backslash to escaped characters when you are working with quotes, and actual backslashes in the text you’re parsing.

parsed-response
Figure 2 – The parsed response

Conclusion

Rendering to HTML using Splash, or Selenium, or using regular expressions are not always essential. Hope this helps illustrate how you can extract values FROM a python dictionary FROM json FROM javascript !

You may see a mass of text on your screen to begin with, but persevere and you can arrive at the dictionary contained within…

Demo of getting a Python Dictionary from JSON from JavaScript