Categories
Python Code Scrapy

Extract links with Scrapy

Using Scrapy’s LinkExtractor methiod you can get the links from every page that you desire.

Link extraction can be achieved very quickly with Scrapy and Python

https://www.programcreek.com/python/example/106165/scrapy.linkextractors.LinkExtractor


https://github.com/scrapy/scrapy/blob/2.5/docs/topics/link-extractors.rsthttps://github.com/scrapy/scrapy/blob/master/scrapy/linkextractors/lxmlhtml.py


https://w3lib.readthedocs.io/en/latest/_modules/w3lib/url.html

What are Link Extractors?

    Link Extractors are the objects used for extracting links from web pages using scrapy.http.Response objects. 
    “A link extractor is an object that extracts links from responses.” Though Scrapy has built in extractors like  scrapy.linkextractors import LinkExtractor,  you can customize your own link extractor based on your needs by implementing a simple interface.
 The scrapy link extractor makes use of w3lib.url
    Have a look at the source code for w3lib.url : https://w3lib.readthedocs.io/en/latest/_modules/w3lib/url.html

# -*- coding: utf-8 -*-

#+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

#|r|e|d|a|n|d|g|r|e|e|n|.|c|o|.|u|k|

#+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

 

import scrapy

from scrapy import Spider

from scrapy import Request

from scrapy.crawler import CrawlerProcess

from scrapy.linkextractors import LinkExtractor

 

import os

 

class Ebayspider(Spider):

 

    name = 'ebayspider'

    allowed_domains = ['ebay.co.uk']

    start_urls = ['https://www.ebay.co.uk/deals']

    

    try:

        os.remove('ebay2.txt')

    except OSError:

        pass

 

    custom_settings = {

        'CONCURRENT_REQUESTS' : 2,               

        'AUTOTHROTTLE_ENABLED': True,

        'AUTOTHROTTLE_DEBUG': True,

        'DOWNLOAD_DELAY': 1

    }

 

    def __init__(self):

        self.link_extractor = LinkExtractor(allow=\

            "https://www.ebay.co.uk/e/fashion/up-to-50-off-superdry", unique=True)

 

    def parse(self, response):

        for link in self.link_extractor.extract_links(response):

            with open('ebay2.txt','a+') as f:

                f.write(f"\n{str(link)}")

 

            yield response.follow(url=link, callback=self.parse)

 

if __name__ == "__main__":

    process = CrawlerProcess()

    process.crawl(Ebayspider)

    process.start()

Summary

The above code gets all of the hrefs very quickly and give you the flexibility to omit or include very specific attirbutes

Watch the video Extract Links | how to scrape website urls | Python + Scrapy Link Extractors

Categories
Python Code Scrapy

Scrapy response.meta

capture your start urls in your output with Scrapy response.meta

scrapy real estate scraping

Every web scraping project has aspects that are different or interesting and worth remembering for future use.

This is a look at a recent real world project and looks saving more than one start url in the output.

This assumes basic knowledge of web scraping, and identifying selectors. See my other videos if you would like to learn more about selectors (xpath & css)

scraping a real estate site for houses and appartments

We want to fill all of the columns in our client’s master excel sheet.

We could* then provide them with a CSV which they can import and do with what they wish.

We want 1500+ properties so we will be using Scrapy and Python

scrapy spider logic

Considerations

One of the required fields requires us to pass the particular start url all the way through to the CSV (use response.meta)

Some of the required values are inside text and will require parsing with re (use regular expressions) ¡We don’t care about being fast – edit “settings.py” with conservative values for concurrent connections, download delay

This is a German website so I will use Google Chrome browser and translate to English.

Scrapy response.meta

We will use Scrapy’s Request.meta attribute to achieve the following:

Capture whichever of the multiple start_urls is used – pass it all the way through to the output CSV.

scrapy response documentation

Create a “meta” dictionary in the initial Request in start_requests 

“surl” represents each of our start urls

(we have 2, one for ‘rent’ and one for the ‘buy’ url, we could have many more if required)

start_requests
response.meta
start_url output
we still have the start_url (converted to human a readable label)
start_url in final csv output
End result : we have the start url converted to a human readable name, that represents the particular URL that scrapy used for the particular listing

Dr Pi YouTube videos
For videos on youtube please visit : www.youtube.com/c/DrPiCode

Categories
Python Code Raspberry Pi Scrapy

Price Tracking Amazon

A common task is to track competitors prices and use that information as a guide to the prices you can charge, or if you are buying, you can spot when a product is at a new lowest price. The purpose of this article is to describe how to web scrape Amazon.

Web Scraping Amazon to SQL

Using Python, Scrapy, MySQL, and Matplotlib you can extract large amounts of data, query it, and produce meaningful visualizations.

In the example featured, we wanted to identify which Amazon books related to “web scraping” had been reduced in price over the time we had been running the spider.

If you want to run your spider daily then see the video for instructions on how to schedule a spider in CRON on a Linux server.

Procedure used for price tracking
Price Tracking Amazon with Python

query = '''select amzbooks2.* from
(select amzbooks2.*,
lag(price) over (partition by title order by posted) as prev_price
from amzbooks2) amzbooks2
where prev_price <> price'''

Visualize the stored data using Python and Matplotlib
Amazon Price Tracker Graph produced from Scrapy Spider output
To see how to get to this stage, you may wish to watch the video:
How To Track Prices In Amazon With Scrapy

All of the code is on GitHub

Categories
Python Code Scrapy

How To Web Scrape Amazon (successfully)

You may want to scrape Amazon for information about books about web scraping!

We scrape Amazon for web scraping books!

We shorten what would have been a very very long selector, by using “contains” in our xpath :

response.xpath('//*[contains(@class,"sg-col-20-of-24 s-result-item s-asin")]')

The most important thing when starting to scrape is to establish what you want in your final output.

Here are the data points we want to extract :

  • ‘title’
  • ‘author’
  • ‘star_rating’
  • ‘book_format’
  • ‘price’

Now we can write our parse method, and once done, we can finally add on the “next page” code.

The Amazon pages have white space around the Author name(s) so you this will be an example of when to use ‘normalize-space’.

We also had to make sure we weren’t splitting the partially parsed response too soon, and removing the 2nd Author, (if there was one).

Some of the results are perhaps not what you want, but this is due to Amazon returning products which it thinks are in some way related to your search criteria!

By using pipelines in Scrapy, along with the process_item method we were able to filter much of what was irrelevant. The great thing about web scraping to an SQL database is the flexibility it offers once you have the data. SQL, Pandas, Matplotlib and Python are a powerful combination…

If you are unsure about how any of the code works, drop us a comment on the comments section of the Web Scraping Amazon : YouTube video.

Categories
Python Code Scrapy Selenium

Combine Scrapy with Selenium

A major disadvantage of Scrapy is that it can not handle dynamic websites (eg. ones that use JavaScript).

If you need to get past a login that is proving impossible to get past, usually if the form data keeps changing, then you can use Selenium to get past the login screen and then pass the response back into Scrapy.

It may sound like a workaround, and it is, but it’s a good way to get logged in so you can get the content much quicker than if you try and use Selenium to do it all.

Selenium is for testing, but sometimes you can combine Selenium and Scrapy to get the job done!

Watch the YouTube video on How to login past javascript & send response back to Scrapy

Combining Scrapy with Selenium
Combining Scrapy with Selenium Solution
Combining Scrapy with Selenium Prerequisites
Combining Scrapy with Selenium - integrating into Scrapy

Below:

The ‘LoginEM’ and ‘LoginPW’ represent the ‘name’ of the input field (find these from viewing the source in your browser).

self.pw and self.em are the variables which equal your stored email and passwords – I’ve stored mine here as environment variables in .bash_profile on the host computer.

Combining Scrapy with Selenium - Lgging in
Combining Scrapy with Selenium - response.replace
Combining Scrapy with Selenium start_urls
Categories
Python Code Raspberry Pi Scrapy

Configure a Raspberry Pi for web scraping

Introduction

The task was to scrape over 50,000 records from a website and be gentle on the site being scraped. A Raspberry Pi Zero was chosen to do this as speed was not a significant issue, and in fact, being slower makes it ideal for web scraping when you want to be kind to the site you are scraping and not request resources too quickly. This article describes using Scrapy, but BeautifulSoup or Requests would work in the same way.

The main considerations were:

  • Could it run Scrapy without issue?
  • Could it run with a VPN connection?
  • Would it be able to store the results?

So a quick, short test proved that it could collect approx 50,000 records per day which meant it was entirely suitable.

I wanted a VPN tunnel from the Pi Zero to my VPN provider. This was an unknown, because I had only previously run it on a Windows PC with a GUI. Now I was attempting to run it from a headless Raspberry Pi!

This took approx 15 mins to set up. Surprisingly easy.

The only remaining challenges were:

  • run the spider without having to leave my PC on as well (closing PuTTy in Windows would have terminated the process on the Pi) – That’s where nohup came in handy.
  • Transfer the output back to a PC (running Ubuntu – inside a VM ) – this is where rsync was handy. (SCP could also have been used)

See the writing of the Scrapy spider with “Load More”

Categories
Python Code Scrapy

Scraping “LOAD MORE”

Do you need to scrape a page that is dynamically loading content as “infinite scroll” ?

Scrapy Load More - infinite scroll - more results
If you need to scrape a site like this then you can increment the URL within your Scrapy code

Using self.nxp +=1 the value passed to “pn=” in the URL gets incremented

“pn=” is the query – in your spider it may be different, you can always use urllib.parse to split up the URL into it’s parts.

Test in scrapy shell if you are checking the URL for next page – see if you get response 200 and then check the response.text

What if you don’t know how many pages there are?

One way would be to use try/except – but a more elegant solution would be to check the source for “next” or “has_next” and keep going to next page until “next” is not true.

https://github.com/RGGH/Scrapy6/blob/master/AJAX%20example/foodcom.py

If you look at line 51 – you can see how we did that.

if response.xpath("//link/@rel='next\'").get() == "1":

See our video where we did just this : https://youtu.be/07FYDHTV73Y

Conclusion

We’ve shown how to deal with “infinite scroll” without resorting to selenium, splash, or any javascript rendering. Also, check in developer tools, “network” and “XHR” if you can find any mention of API in the URL – this may be useful also.

Categories
Python Code Scrapy

Scraping a JSON response with Scrapy

reference:
► https://stackoverflow.com/questions/44939247/scrapy-extract-ldjson#48131898

We can’t get the html we need using a normal selector so having located the ‘script’ section in the browser (Chrome/Developer Tools) we can load into a JSON object to manipulate.

 json.loads(response.xpath('//script[@type="application/ld+json"]//text()') to get the data from a page containing javascript 

Using json.loads

We extracted the output which was not available from just using a normal css or xpath selector in Scrapy.

See the JSON response in scrapy video

Categories
Python Code Scrapy

Web Scraping Introduction

As an overview of how web scraping works, here is a brief introduction to the process, with the emphasis on using Scrapy to scrape a listing site.

If you would like to know more, or would like us to scrape data from a specific site please get in touch.

*This article also assumes you have some knowledge of Python, and have Scrapy installed. It is recommended to use a virtual environment. Although the focus is on using Scrapy, similar logic will apply if you are using Beautiful Soup, or Requests. (Beautiful Soup does some of the hard work for you with find, and select).

Below is a basic representation of the process used to scrape a page / site

web scraping with scrapy
Web Scraping Process Simplified
Most sites you will want to scrape provide data in a list – so check the number per page, and the total number. Is there a “next page” link? – If so, that will be useful later on…
Identify the div and class name to use in your “selector”

Identifying the div and class name

Using your web browser developer tools, traverse up through the elements (Chrome = Inspect Elements) until you find a ‘div’ (well, it’s usually a div) that contains the entire advert, and go no higher up the DOM.

(advert = typically: the thumbnail + mini description that leads to the detail page)

The code inside the ‘div’ will be the iterable that you use with the for loop.
The “.” before the “//” in the xpath means you select all of them
eg. All 20, on a listings page that has 20 adverts per page

Now you have the xpath and checked it in scrapy shell, you can proceed to use it with a for loop and selectors for each piece of information you want to pick out. If you are using XPATH, you can use the class name from the listing, and just add “.” to the start as highlighted below.

This “.” ensures you will be able to iterate through all 20 adverts at the same node level. (i.e All 20 on the page).

parse

To go to the details page we use “Yield” but we also have to pass the variables that we have picked out on the main page. So we use ‘meta’ (or newer version = cb_kwargs”).

yield Request(absolute_url, callback=self.fetch_detail, meta={'link': link, 'logo_url': logo_url, 'lcompanyname':lcompanyname})

Using ‘meta’ allows us to pass variables to the next function – in this case it’s called “fetch_details” – where they will be added to the rest of the variables collected and sent to the FEEDS export which makes the output file.

There is also a newer, recommended version of “meta” to pass variables between functions in Scrapy: “cb_kwargs

Once you have all the data it is time to use “Yield” to send it to the FEEDS export.

The “FEEDS” method that let you write output to your chosen file format/destination

This is the format and destination that you have set for your output file.

*Note it can also be a database, rather than JSON or CSV file.

Putting it all together

See the fully working spider code here :

https://github.com/RGGH/Scrapy5/blob/master/yelpspider.py

You may wish to run all of your code from within the script, in which case you can do this:

# main driver

if __name__ == "__main__":

    process = CrawlerProcess()

    process.crawl(YelpSpider)

    process.start()

# Also you will need to add this at the start :

from scrapy.crawler import CrawlerProcess

Web Scraping – Summary

We have looked at the steps involved and some of the code you’ll find useful when using Scrapy.

Identifying the html to iterate through is the key

Try and find the block of code that has all of the listings / adverts, and then narrow it down to one advert/listing. Once you have done that you can test your code in “scrapy shell” and start building your spider.

(Scrapy shell can be run from your CLI, independent of your spider code):

xpath-starts-with
Scrapy shell is your friend!

! Some of this content may be easier to relate to if you have studied and completed the following : https://docs.scrapy.org/en/latest/intro/tutorial.html

If you have any questions please get in touch and we’ll be pleased to help.

Categories
Python Code Scrapy

Scrapy tips

Passing variables between functions using meta and cb_kwargs

This will cover how to use callback with “meta” and the newer “cb_kwargs”

The highlighted sections show how “logo_url” goes from parse to fetch_detail, where “yield” then sends it to the FEED export (output CSV file).

When using ‘meta’ you need to use ‘meta.get’ on the ‘response’ in the next function.

The newer, Scrapy recommended way is to use “cb_kwargs”

As you can see, there is no need to use any sort of “get” method in ‘fetch-detail’ so it’s simpler to use now, albeit with a slightly longer, less memorable name!

Watch the YouTube video on cb_kwargs