Categories
Python Code Scrapy

How To Web Scrape Amazon (successfully)

You may want to scrape Amazon for information about books about web scraping!

We scrape Amazon for web scraping books!

We shorten what would have been a very very long selector, by using “contains” in our xpath :

response.xpath('//*[contains(@class,"sg-col-20-of-24 s-result-item s-asin")]')

The most important thing when starting to scrape is to establish what you want in your final output.

Here are the data points we want to extract :

  • ‘title’
  • ‘author’
  • ‘star_rating’
  • ‘book_format’
  • ‘price’

Now we can write our parse method, and once done, we can finally add on the “next page” code.

The Amazon pages have white space around the Author name(s) so you this will be an example of when to use ‘normalize-space’.

We also had to make sure we weren’t splitting the partially parsed response too soon, and removing the 2nd Author, (if there was one).

Some of the results are perhaps not what you want, but this is due to Amazon returning products which it thinks are in some way related to your search criteria!

By using pipelines in Scrapy, along with the process_item method we were able to filter much of what was irrelevant. The great thing about web scraping to an SQL database is the flexibility it offers once you have the data. SQL, Pandas, Matplotlib and Python are a powerful combination…

If you are unsure about how any of the code works, drop us a comment on the comments section of the Web Scraping Amazon : YouTube video.

Categories
Python Code Scrapy Selenium

Combine Scrapy with Selenium

A major disadvantage of Scrapy is that it can not handle dynamic websites (eg. ones that use JavaScript).

If you need to get past a login that is proving impossible to get past, usually if the form data keeps changing, then you can use Selenium to get past the login screen and then pass the response back into Scrapy.

It may sound like a workaround, and it is, but it’s a good way to get logged in so you can get the content much quicker than if you try and use Selenium to do it all.

Selenium is for testing, but sometimes you can combine Selenium and Scrapy to get the job done!

Watch the YouTube video on How to login past javascript & send response back to Scrapy

Combining Scrapy with Selenium
Combining Scrapy with Selenium Solution
Combining Scrapy with Selenium Prerequisites
Combining Scrapy with Selenium - integrating into Scrapy

Below:

The ‘LoginEM’ and ‘LoginPW’ represent the ‘name’ of the input field (find these from viewing the source in your browser).

self.pw and self.em are the variables which equal your stored email and passwords – I’ve stored mine here as environment variables in .bash_profile on the host computer.

Combining Scrapy with Selenium - Lgging in
Combining Scrapy with Selenium - response.replace
Combining Scrapy with Selenium start_urls
Categories
Python Code

Xpath for hidden values

This article describes how to form a Scrapy xpath selector to pick out the hidden value that you may need to POST along with a username and password when scraping a site with a log in. These hidden values are dynamically created so you must send them with your form data in your POST request.

Step one

Identify the source in the browser:

xpath-for-hidden-values
Ok, so we want to pass “__VIEWSTATE” as a key : value pair in the POST request

This is the xpath selector format you will need to use:

response.xpath('//div[@class="aspNetHidden"]/input[@name="__VIEWSTATE"]/@value').get()

This should get what you need…in Scrapy SHELL !

But…Not when you run it in your Spider…

Instead, you need to use :

response.xpath('//*[@id="Form1"]/input[@name="__VIEWSTATE"]/@value').get()

Scrapy-Hidden-Values-XPATH-Selector
Categories
Python Code

Scrapy Form Login

scrapy form login how to
Scrapy – how to login to a website

The following is an article which will show you how to use Scrapy to log in to sites that have username and password authentication.

The important thing to remember is that there may be additional data that needs to be sent to the login page, data that is in addition to just username and password…

We are going to cover :

🔴 Identifying what data to POST

🔴 The Scrapy method to login

🔴 How to progress after logging in

gather data to submit
passwords in bash profile
scrapy form data
scrapy formrequest

Once you have seen that the Spider has logged in, you can proceed to scrape what you need. Remember, you are looking for “Logout” because that will mean you are logged in!

Conclusion : we have looked at how to use FormRequest to POST form data in a Scrapy spider, and we have extracted data from a site that is protected by a login form.

See the video on YouTube :

Scrapy Form Login | How to log in to sites using FormRequest | Web Scraping Tutorial

https://youtu.be/VvU9mR-WNSA
scrapy github redandgreen