Categories
postgres

PostgreSQL

PostgreSQL is a free, powerful SQL database which is frequently used with Python

How to connect to postgres

sudo -i -u postgres

run psql

postgres@rag-laptop:~$ psql
psql (12.6 (Ubuntu 12.6-0ubuntu0.20.04.1))
Type "help" for help.

list databases

postgres=# l
                                  List of databases
   Name    |  Owner   | Encoding |   Collate   |    Ctype    |   Access privileges   
-----------+----------+----------+-------------+-------------+-----------------------
 gis       | postgres | UTF8     | en_GB.UTF-8 | en_GB.UTF-8 | 
 postgres  | postgres | UTF8     | en_GB.UTF-8 | en_GB.UTF-8 | 
 suppliers | postgres | UTF8     | en_GB.UTF-8 | en_GB.UTF-8 | 
 template0 | postgres | UTF8     | en_GB.UTF-8 | en_GB.UTF-8 | =c/postgres          +
           |          |          |             |             | postgres=CTc/postgres
 template1 | postgres | UTF8     | en_GB.UTF-8 | en_GB.UTF-8 | =c/postgres          +
           |          |          |             |             | postgres=CTc/postgres
(5 rows)

connect to a database

postgres=# c suppliers
You are now connected to database "suppliers" as user "postgres".

create a database

postgres=# CREATE DATABASE jml;
CREATE DATABASE
postgres=# l
                                   List of databases
    Name     |  Owner   | Encoding |   Collate   |    Ctype    |   Access privileges   
-------------+----------+----------+-------------+-------------+-----------------------
 gis         | postgres | UTF8     | en_GB.UTF-8 | en_GB.UTF-8 | 
 jml         | postgres | UTF8     | en_GB.UTF-8 | en_GB.UTF-8 |

create a table

postgres=# c jml;
You are now connected to database "jml" as user "postgres".
jml=# CREATE TABLE PRODUCTS (id serial, product varchar(40), price money);
CREATE TABLE
jml=# 

insert into table

jml=# INSERT INTO PRODUCTS (product,price) VALUES('jug',29.99);
INSERT 0 1
jml=# INSERT INTO PRODUCTS (product,price) VALUES('pot',9.99) RETURNING *;
 id | product | price 
----+---------+-------
  2 | pot     | £9.99
(1 row)

INSERT 0 1

select from a table

jml=# SELECT product FROM products WHERE price < '29.99';
 product 
---------
 pot
(1 row)

store ip addresses in a table

jml=# CREATE TABLE inet_test (  
jml(#     address INET
jml(# );
CREATE TABLE

#### table created

jml=# d
               List of relations
 Schema |      Name       |   Type   |  Owner   
--------+-----------------+----------+----------
 public | inet_test       | table    | postgres
 public | products        | table    | postgres
 public | products_id_seq | sequence | postgres
(3 rows)

jml=# INSERT INTO inet_test (address) VALUES ('192.168.1.0/24'); 
INSERT 0 1
jml=# INSERT INTO inet_test (address) VALUES ('172.16.11.0/24') RETURNING *; 
    address     
----------------
 172.16.11.0/24
(1 row)

Use CIDR notation

jml=# CREATE TABLE cidr_test (  
jml(#     address CIDR
jml(# );

INSERT INTO cidr_test (address) VALUES ('192.168.10/24');  
INSERT INTO cidr_test (address) VALUES ('192.168.10');  
INSERT INTO cidr_test (address) VALUES ('192.168.100.128/25');  
INSERT INTO cidr_test (address) VALUES ('192.168.100.128'); 

jml=# INSERT INTO cidr_test (address) VALUES ('192.168.10/24');  
INSERT 0 1
jml=# INSERT INTO cidr_test (address) VALUES ('192.168.10');  
INSERT 0 1
jml=# INSERT INTO cidr_test (address) VALUES ('192.168.100.128/25');  
INSERT 0 1
jml=# INSERT INTO cidr_test (address) VALUES ('192.168.100.128');
INSERT 0 1
jml=# select * from cidr_test;
      address       
--------------------
 192.168.10.0/24
 192.168.10.0/24
 192.168.100.128/25
 192.168.100.128/32
(4 rows)

Create postgres user for Python Script

Syntax = 

CREATE USER user3 WITH
LOGIN
SUPERUSER
CREATEDB
CREATEROLE
INHERIT
NOREPLICATION
CONNECTION LIMIT -1
VALID UNTIL '2025-04-03T11:50:38+05:30'
PASSWORD 'password3';

Full commands to run to set up user on Raspberry Pi (aka 'The Server') :

postgres=# pi@pi4:~$ sudo -i -u postgres
postgres-# postgres@pi4:~$ psql
postgres-# psql (11.11 (Raspbian 11.11-0+deb10u1))
postgres-# Type "help" for help.
postgres-#
postgres=# CREATE USER user3 WITH
postgres-# LOGIN
postgres-# SUPERUSER
postgres-# CREATEDB
postgres-# CREATEROLE
postgres-# INHERIT
postgres-# NOREPLICATION
postgres-# CONNECTION LIMIT -1
postgres-# VALID UNTIL '2025-04-03T11:50:38+05:30'
postgres-# PASSWORD 'password3';

Check the new user has been added

\du

check user added

Install psycopg2 to connect to postgres using Python

psql_python
Connect to postgres using Python : https://github.com/RGGH/SQL
Categories
Python Code Scrapy

Extract links with Scrapy

Using Scrapy’s LinkExtractor methiod you can get the links from every page that you desire.

Link extraction can be achieved very quickly with Scrapy and Python

https://www.programcreek.com/python/example/106165/scrapy.linkextractors.LinkExtractor


https://github.com/scrapy/scrapy/blob/2.5/docs/topics/link-extractors.rsthttps://github.com/scrapy/scrapy/blob/master/scrapy/linkextractors/lxmlhtml.py


https://w3lib.readthedocs.io/en/latest/_modules/w3lib/url.html

What are Link Extractors?

    Link Extractors are the objects used for extracting links from web pages using scrapy.http.Response objects. 
    “A link extractor is an object that extracts links from responses.” Though Scrapy has built in extractors like  scrapy.linkextractors import LinkExtractor,  you can customize your own link extractor based on your needs by implementing a simple interface.
 The scrapy link extractor makes use of w3lib.url
    Have a look at the source code for w3lib.url : https://w3lib.readthedocs.io/en/latest/_modules/w3lib/url.html

# -*- coding: utf-8 -*-

#+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

#|r|e|d|a|n|d|g|r|e|e|n|.|c|o|.|u|k|

#+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

 

import scrapy

from scrapy import Spider

from scrapy import Request

from scrapy.crawler import CrawlerProcess

from scrapy.linkextractors import LinkExtractor

 

import os

 

class Ebayspider(Spider):

 

    name = 'ebayspider'

    allowed_domains = ['ebay.co.uk']

    start_urls = ['https://www.ebay.co.uk/deals']

    

    try:

        os.remove('ebay2.txt')

    except OSError:

        pass

 

    custom_settings = {

        'CONCURRENT_REQUESTS' : 2,               

        'AUTOTHROTTLE_ENABLED': True,

        'AUTOTHROTTLE_DEBUG': True,

        'DOWNLOAD_DELAY': 1

    }

 

    def __init__(self):

        self.link_extractor = LinkExtractor(allow=\

            "https://www.ebay.co.uk/e/fashion/up-to-50-off-superdry", unique=True)

 

    def parse(self, response):

        for link in self.link_extractor.extract_links(response):

            with open('ebay2.txt','a+') as f:

                f.write(f"\n{str(link)}")

 

            yield response.follow(url=link, callback=self.parse)

 

if __name__ == "__main__":

    process = CrawlerProcess()

    process.crawl(Ebayspider)

    process.start()

Summary

The above code gets all of the hrefs very quickly and give you the flexibility to omit or include very specific attirbutes

Watch the video Extract Links | how to scrape website urls | Python + Scrapy Link Extractors

Categories
Python Code

Read Scrapy ‘start_urls’ from csv file

How can the start_urls for scrapy be imported from csv?

Using a list comprehension and a csv file you can make Scrapy get specific URLs from a predefined list

use the .strip() method to remove newline characters

Here you can see the line.strip() is performing the removal:

[line.strip() for line in file]

Demonstration of how to read a list of URLs from a CSV (and use in Scrapy)

with open('data.csv') as file:
    start_urls = [line.strip() for line in file]

use start_urls as the url for each request made by start_request method

def start_request(self):
    request = Request(url = self.start_urls, callback=self.parse)
    yield request
Full example code – add your own selectors in the parse method!

Get the code on the Red and Green GitHub page https://github.com/RGGH/Scrapy18/blob/main/stackospider.py

This is also an answer to a question on stackoverflow:

https://stackoverflow.com/questions/67166565/how-can-the-start-urls-for-scrapy-be-imported-from-csv/67175549#67175549

Categories
Python Code

How to web scrape iframes with scrapy

Web Scraping pages with iframes in can be done with Scrapy if you use a separate URL to access the data inside the iframe.

You need to identify the name of the page of the iframe and then append that to your base url to provide a 2nd URL for the Scrapy spider to visit.

Watch the video here on YouTube

In the video you see how to extract the elements from within the iframe using Scrapy

View/fork/copy the code from our GitHub repo: https://github.com/RGGH/iframes

See also our article on scraping iframes using Selenium :

https://redandgreen.co.uk/web-scrape-iframes-how-to-accept-cookies-privacy-pop-up-when-web-scraping-with-python-and-selenium/

Let us know if you would like any more information or discuss a project

We can provide a quote and initial analysis within 1 working day

For more details on iframes see : https://www.w3schools.com/html/html_iframe.asp

Categories
Python Code Selenium web scraping

How to scrape iframes

If you are scraping a website with pop up messages asking you to agree to accept cookies.

This can prevent your scraper from continuing to the pages you want to scrape.

web-scrape-iframe-cookie-accept
How can you get past this with your scraper if it is in an iframe?

How do you get past these?

Using Selenium you need to switch to the iframe (which you can identify using browser tools / inspect in Chrome).

Here is a video showing the general concept:

Scrape a site with pop up “accept cookies” message

If you just want the basic idea….

driver.switch_to.frame(iframe)

“click on accept or ok”

Instead of “click on accept or ok” you need to identify the @id, or @class name and use that in an XPATH so you can send the .click() commend with Selenium.

driver.switch_to.default_content()

Once past this you need to return to the default (or parent) content to carry on with the navigation towars the page you want to scrape.

driver = webdriver.Chrome(executable_path=path_chrome)

# find the frame using id, title etc.
frame = driver.find_elements_by_xpath("//iframe[@title='iframe_to_get']")

# switch the webdriver object to the iframe.
driver.switch_to.frame(frame[i])

See also : https://stackoverflow.com/questions/49825722/scraping-iframe-using-selenium#49825750