Categories
Python Code Scrapy

Extract links with Scrapy

Using Scrapy’s LinkExtractor methiod you can get the links from every page that you desire.

Link extraction can be achieved very quickly with Scrapy and Python

https://www.programcreek.com/python/example/106165/scrapy.linkextractors.LinkExtractor


https://github.com/scrapy/scrapy/blob/2.5/docs/topics/link-extractors.rsthttps://github.com/scrapy/scrapy/blob/master/scrapy/linkextractors/lxmlhtml.py


https://w3lib.readthedocs.io/en/latest/_modules/w3lib/url.html

What are Link Extractors?

    Link Extractors are the objects used for extracting links from web pages using scrapy.http.Response objects. 
    “A link extractor is an object that extracts links from responses.” Though Scrapy has built in extractors like  scrapy.linkextractors import LinkExtractor,  you can customize your own link extractor based on your needs by implementing a simple interface.
 The scrapy link extractor makes use of w3lib.url
    Have a look at the source code for w3lib.url : https://w3lib.readthedocs.io/en/latest/_modules/w3lib/url.html

# -*- coding: utf-8 -*-

#+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

#|r|e|d|a|n|d|g|r|e|e|n|.|c|o|.|u|k|

#+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

 

import scrapy

from scrapy import Spider

from scrapy import Request

from scrapy.crawler import CrawlerProcess

from scrapy.linkextractors import LinkExtractor

 

import os

 

class Ebayspider(Spider):

 

    name = 'ebayspider'

    allowed_domains = ['ebay.co.uk']

    start_urls = ['https://www.ebay.co.uk/deals']

    

    try:

        os.remove('ebay2.txt')

    except OSError:

        pass

 

    custom_settings = {

        'CONCURRENT_REQUESTS' : 2,               

        'AUTOTHROTTLE_ENABLED': True,

        'AUTOTHROTTLE_DEBUG': True,

        'DOWNLOAD_DELAY': 1

    }

 

    def __init__(self):

        self.link_extractor = LinkExtractor(allow=\

            "https://www.ebay.co.uk/e/fashion/up-to-50-off-superdry", unique=True)

 

    def parse(self, response):

        for link in self.link_extractor.extract_links(response):

            with open('ebay2.txt','a+') as f:

                f.write(f"\n{str(link)}")

 

            yield response.follow(url=link, callback=self.parse)

 

if __name__ == "__main__":

    process = CrawlerProcess()

    process.crawl(Ebayspider)

    process.start()

Summary

The above code gets all of the hrefs very quickly and give you the flexibility to omit or include very specific attirbutes

Watch the video Extract Links | how to scrape website urls | Python + Scrapy Link Extractors