Categories
Python Code

Remove non printable characters from text

import string

my_str = "htt😆, moon, boo, f🔥re "

def clean_words(my_str):
    for i in my_str.split():
        word = "" if (any([char not in string.printable for char in i])) else i
        print(word)

clean_words(my_str)
-
Categories
Pandas Python Code

map & lambda

Introduction

Using lambda can save you having to write a function.

If you’ve not used ‘map’ then we’ll show you how it can perform the same task as lambda in an example

import pandas as pd
pd.set_option('max_rows',10)
import numpy as np
reviews = pd.read_csv("winemag-data-130k-v2.csv",index_col=0)
reviews

next we’ll drop any rows full of NaNs

reviews.dropna()

now we have good data…

reviews.price.mean()

35.363389129985535

We can now use a lambda expression to run all the way down the price column and update it to show whether it is more or less than the mean:

reviews_price_mean = reviews.price.mean()
reviews.price.apply(lambda p : p - reviews_price_mean)

What does this do exactly?

lambda p is equivalent to the price value in each row

p - reviews_price_mean

We subtract the mean review price from the ‘p’ value to give us a positive or negative value compared to the mean price.

By applying it with apply we can go all the way through the dataframe.

We can create a new column reviews[‘price_dfif’] and set that equal to the result of our lambda function.

0               NaN
1        -20.363389
2        -21.363389
3        -22.363389
4         29.636611
            ...    
129966    -7.363389
129967    39.636611
129968    -5.363389
129969    -3.363389
129970   -14.363389
Name: price, Length: 129971, dtype: float64

The result now shows as a price as +/- the mean

Summary:

Using map gives the same results:

reviews.price_diff.map(lambda p : p - reviews_price_mean)
wine-reviews-price-vs-points
For those interested, here are the prices vs points from the reviewers

Both of these ways allow you to apply a function without the need for a traditional Python ‘for’ loop.

Categories
Python Code

Data Analysis With Pandas

If you want to learn about Data Analysis with Pandas and Python and you’re not familiar with Kaggle, check it out!

Time to read article : 5 mins

TLDR;

We show how to use idxmax and apply with Pandas

Introduction

Here we will look at some functions in Pandas which will help with ‘EDA’ – exploratory data analysis.

Kaggle-Screenshot-functions-and-maps-Data Analysis
Using Kaggle is much like Jupyter Notebook

Once you have signed in you can locate Pandas tutorials and begin learning and testing your understanding by running through the exercises and if you get stuck there are hints, and also the solution.

Pandas ‘idxmax’ example:

One such exercise is shown here, where you are asked:

Which wine is the “best bargain”? Create a variable bargain_wine with the title of the wine with the highest points-to-price ratio in the dataset.

The hint tells us to use idxmax()

Here is an example of what idxmax does:

idxmax in pandas-Data Analysis
We can now use this as a binary mask.

Solution:

bargain_idx = (reviews.points / reviews.price).idxmax()
bargain_wine = reviews.loc[bargain_idx, 'title']

The Kaggle solution locates the row where idxmax is True, and returns the ‘title’

In this next example we see how you can “apply” a function to each row in your dataframe:

kaggle-pandas-function-Data Analysis
Using a function on all rows

Pandas ‘apply’ example:

“Find number of dealers by area”

import pandas as pd
import numpy as np

pd.set_option('display.max_colwidth', 150) 
df = pd.read_csv("used_car.csv")
df.head(6)
NameAddressPhone
0PakWheels KarachiSuit No : 303 Third Floor Tariq Centre Main Tariq Road3105703505
1PakWheels Lahore37 Commercial Zone, Liberty Market Lahore614584545
2PakWheels Islamabad37 Commercial Zone, Liberty Market Lahore614584545
3Sam Automobiles8 Banglore town,7/8 Block near awami markaz shahrah e faisal karachi3422804414
4Marvel Motors27-E, Ali Plaza, Fazal e Haq Road, Blue Area, Islamabad518358006
5Merchants AutomobilesPlot 167 /C Shop#4 Parsa City luxuria PECHS Block 3 at Main khalid bin waleed road2134552897

In [152]:

# create a function to assign a city based on address details

def area(row):
    if 'lahore' in (str(row.Address)).lower():
        return "Lahore"
    elif "faisalbad" in (str(row.Address)).lower():
        return "Faisalbad"
    elif "karachi" in (str(row.Address)).lower():
        return "Karachi"
    elif "islamabad" in (str(row.Address)).lower():
        return "Islamabad"
    else:
        return "Other"
ans = df.apply(area, axis=1)
ans
0           Other
1          Lahore
2          Lahore
3         Karachi
4       Islamabad
          ...    
2332        Other
2333        Other
2334        Other
2335        Other
2336        Other
Length: 2337, dtype: object
# Check how many times each city occurs in the dataframe

ans.value_counts()
Other        1433
Karachi       471
Lahore        331
Islamabad     102
dtype: int64

Next, what if we want to find out if there are other dealerships that use the same phone number?

Pandas ‘isin’ with ‘loc’

df.loc[df.Phone.isin(['614584545'])]
NameAddressPhone
1PakWheels Lahore37 Commercial Zone, Liberty Market Lahore614584545
2PakWheels Islamabad37 Commercial Zone, Liberty Market Lahore614584545
437Rehman MotorsOld Bahawalpur Road,Multan614584545
We’ve found 3 business names, and 2 addresses that share the same phone number….

Summary

Kaggle is free and even if you are not pursuing a career in data science you can still gain valuable Python skills from it.

See the Red and Green – Kaggle example on Kaggle

Categories
ebay api Python Code

EBAY API – Python Code

If you have been looking for EBAY API – Python code then this article can help.

Rather than use web scraping techniques you can access live EBAY listings via the API with an SDK for Python.

ebay-api-python
Register for a free API key to make 5000 calls per day

Time to read this article about EBAY API – Python Code : 5 mins

TLDR; Watch the ebay api python video on YouTube.

api-ebay
An API is like a bridge between your program and the site you want to get data from – it is already structured, unlike when you are web scraping.

Introduction:

Firstly install the SDK with pip

  • Installing the SDK :

pip install ebaysdk

Add the imports in your code:

from ebaysdk.exception import ConnectionError
from ebaysdk.finding import Connection

Using the EBAY API with Python

Provided you have registered for your own API key with EBAY you are good to go.

Using the production site will give you access to real listings, or alternatively experiment in the sandbox version.

When you create an api object with the connection details it will return a response object:

We have used ‘st’ a variable as our keyword search term, but you can use a string directly.

With the response object you now have all of the data you need.

Access it with response.reply.searchResult.item

Summary

Don’t use Selenium or Scrapy to access live data – there is no need.

The EBAY API documentation does not always cover Python, so we have shown a useful example which will get you started.

We’ve looked a t how to use findItemsByKeywords you can also find by category or use a combination of the two.

Register for an API key, read the docs, and watch the tutorial on how to use the EBAY API with Python

For my EBAY API Python code and more Python tutorials please visit our channel on youtube

Get the code on GitHub

If you would like to discuss a project you have, please contact us

Categories
postgres

PostgreSQL

PostgreSQL is a free, powerful SQL database which is frequently used with Python

How to connect to postgres

sudo -i -u postgres

run psql

postgres@rag-laptop:~$ psql
psql (12.6 (Ubuntu 12.6-0ubuntu0.20.04.1))
Type "help" for help.

list databases

postgres=# l
                                  List of databases
   Name    |  Owner   | Encoding |   Collate   |    Ctype    |   Access privileges   
-----------+----------+----------+-------------+-------------+-----------------------
 gis       | postgres | UTF8     | en_GB.UTF-8 | en_GB.UTF-8 | 
 postgres  | postgres | UTF8     | en_GB.UTF-8 | en_GB.UTF-8 | 
 suppliers | postgres | UTF8     | en_GB.UTF-8 | en_GB.UTF-8 | 
 template0 | postgres | UTF8     | en_GB.UTF-8 | en_GB.UTF-8 | =c/postgres          +
           |          |          |             |             | postgres=CTc/postgres
 template1 | postgres | UTF8     | en_GB.UTF-8 | en_GB.UTF-8 | =c/postgres          +
           |          |          |             |             | postgres=CTc/postgres
(5 rows)

connect to a database

postgres=# c suppliers
You are now connected to database "suppliers" as user "postgres".

create a database

postgres=# CREATE DATABASE jml;
CREATE DATABASE
postgres=# l
                                   List of databases
    Name     |  Owner   | Encoding |   Collate   |    Ctype    |   Access privileges   
-------------+----------+----------+-------------+-------------+-----------------------
 gis         | postgres | UTF8     | en_GB.UTF-8 | en_GB.UTF-8 | 
 jml         | postgres | UTF8     | en_GB.UTF-8 | en_GB.UTF-8 |

create a table

postgres=# c jml;
You are now connected to database "jml" as user "postgres".
jml=# CREATE TABLE PRODUCTS (id serial, product varchar(40), price money);
CREATE TABLE
jml=# 

insert into table

jml=# INSERT INTO PRODUCTS (product,price) VALUES('jug',29.99);
INSERT 0 1
jml=# INSERT INTO PRODUCTS (product,price) VALUES('pot',9.99) RETURNING *;
 id | product | price 
----+---------+-------
  2 | pot     | £9.99
(1 row)

INSERT 0 1

select from a table

jml=# SELECT product FROM products WHERE price < '29.99';
 product 
---------
 pot
(1 row)

store ip addresses in a table

jml=# CREATE TABLE inet_test (  
jml(#     address INET
jml(# );
CREATE TABLE

#### table created

jml=# d
               List of relations
 Schema |      Name       |   Type   |  Owner   
--------+-----------------+----------+----------
 public | inet_test       | table    | postgres
 public | products        | table    | postgres
 public | products_id_seq | sequence | postgres
(3 rows)

jml=# INSERT INTO inet_test (address) VALUES ('192.168.1.0/24'); 
INSERT 0 1
jml=# INSERT INTO inet_test (address) VALUES ('172.16.11.0/24') RETURNING *; 
    address     
----------------
 172.16.11.0/24
(1 row)

Use CIDR notation

jml=# CREATE TABLE cidr_test (  
jml(#     address CIDR
jml(# );

INSERT INTO cidr_test (address) VALUES ('192.168.10/24');  
INSERT INTO cidr_test (address) VALUES ('192.168.10');  
INSERT INTO cidr_test (address) VALUES ('192.168.100.128/25');  
INSERT INTO cidr_test (address) VALUES ('192.168.100.128'); 

jml=# INSERT INTO cidr_test (address) VALUES ('192.168.10/24');  
INSERT 0 1
jml=# INSERT INTO cidr_test (address) VALUES ('192.168.10');  
INSERT 0 1
jml=# INSERT INTO cidr_test (address) VALUES ('192.168.100.128/25');  
INSERT 0 1
jml=# INSERT INTO cidr_test (address) VALUES ('192.168.100.128');
INSERT 0 1
jml=# select * from cidr_test;
      address       
--------------------
 192.168.10.0/24
 192.168.10.0/24
 192.168.100.128/25
 192.168.100.128/32
(4 rows)

Create postgres user for Python Script

Syntax = 

CREATE USER user3 WITH
LOGIN
SUPERUSER
CREATEDB
CREATEROLE
INHERIT
NOREPLICATION
CONNECTION LIMIT -1
VALID UNTIL '2025-04-03T11:50:38+05:30'
PASSWORD 'password3';

Full commands to run to set up user on Raspberry Pi (aka 'The Server') :

postgres=# pi@pi4:~$ sudo -i -u postgres
postgres-# postgres@pi4:~$ psql
postgres-# psql (11.11 (Raspbian 11.11-0+deb10u1))
postgres-# Type "help" for help.
postgres-#
postgres=# CREATE USER user3 WITH
postgres-# LOGIN
postgres-# SUPERUSER
postgres-# CREATEDB
postgres-# CREATEROLE
postgres-# INHERIT
postgres-# NOREPLICATION
postgres-# CONNECTION LIMIT -1
postgres-# VALID UNTIL '2025-04-03T11:50:38+05:30'
postgres-# PASSWORD 'password3';

Check the new user has been added

\du

check user added

Install psycopg2 to connect to postgres using Python

psql_python
Connect to postgres using Python : https://github.com/RGGH/SQL
Categories
Python Code Scrapy

Extract links with Scrapy

Using Scrapy’s LinkExtractor methiod you can get the links from every page that you desire.

Link extraction can be achieved very quickly with Scrapy and Python

https://www.programcreek.com/python/example/106165/scrapy.linkextractors.LinkExtractor


https://github.com/scrapy/scrapy/blob/2.5/docs/topics/link-extractors.rsthttps://github.com/scrapy/scrapy/blob/master/scrapy/linkextractors/lxmlhtml.py


https://w3lib.readthedocs.io/en/latest/_modules/w3lib/url.html

What are Link Extractors?

    Link Extractors are the objects used for extracting links from web pages using scrapy.http.Response objects. 
    “A link extractor is an object that extracts links from responses.” Though Scrapy has built in extractors like  scrapy.linkextractors import LinkExtractor,  you can customize your own link extractor based on your needs by implementing a simple interface.
 The scrapy link extractor makes use of w3lib.url
    Have a look at the source code for w3lib.url : https://w3lib.readthedocs.io/en/latest/_modules/w3lib/url.html

# -*- coding: utf-8 -*-

#+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

#|r|e|d|a|n|d|g|r|e|e|n|.|c|o|.|u|k|

#+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

 

import scrapy

from scrapy import Spider

from scrapy import Request

from scrapy.crawler import CrawlerProcess

from scrapy.linkextractors import LinkExtractor

 

import os

 

class Ebayspider(Spider):

 

    name = 'ebayspider'

    allowed_domains = ['ebay.co.uk']

    start_urls = ['https://www.ebay.co.uk/deals']

    

    try:

        os.remove('ebay2.txt')

    except OSError:

        pass

 

    custom_settings = {

        'CONCURRENT_REQUESTS' : 2,               

        'AUTOTHROTTLE_ENABLED': True,

        'AUTOTHROTTLE_DEBUG': True,

        'DOWNLOAD_DELAY': 1

    }

 

    def __init__(self):

        self.link_extractor = LinkExtractor(allow=\

            "https://www.ebay.co.uk/e/fashion/up-to-50-off-superdry", unique=True)

 

    def parse(self, response):

        for link in self.link_extractor.extract_links(response):

            with open('ebay2.txt','a+') as f:

                f.write(f"\n{str(link)}")

 

            yield response.follow(url=link, callback=self.parse)

 

if __name__ == "__main__":

    process = CrawlerProcess()

    process.crawl(Ebayspider)

    process.start()

Summary

The above code gets all of the hrefs very quickly and give you the flexibility to omit or include very specific attirbutes

Watch the video Extract Links | how to scrape website urls | Python + Scrapy Link Extractors

Categories
Python Code

Read Scrapy ‘start_urls’ from csv file

How can the start_urls for scrapy be imported from csv?

Using a list comprehension and a csv file you can make Scrapy get specific URLs from a predefined list

use the .strip() method to remove newline characters

Here you can see the line.strip() is performing the removal:

[line.strip() for line in file]

Demonstration of how to read a list of URLs from a CSV (and use in Scrapy)

with open('data.csv') as file:
    start_urls = [line.strip() for line in file]

use start_urls as the url for each request made by start_request method

def start_request(self):
    request = Request(url = self.start_urls, callback=self.parse)
    yield request
Full example code – add your own selectors in the parse method!

Get the code on the Red and Green GitHub page https://github.com/RGGH/Scrapy18/blob/main/stackospider.py

This is also an answer to a question on stackoverflow:

https://stackoverflow.com/questions/67166565/how-can-the-start-urls-for-scrapy-be-imported-from-csv/67175549#67175549

Categories
Python Code

How to web scrape iframes with scrapy

Web Scraping pages with iframes in can be done with Scrapy if you use a separate URL to access the data inside the iframe.

You need to identify the name of the page of the iframe and then append that to your base url to provide a 2nd URL for the Scrapy spider to visit.

Watch the video here on YouTube

In the video you see how to extract the elements from within the iframe using Scrapy

View/fork/copy the code from our GitHub repo: https://github.com/RGGH/iframes

See also our article on scraping iframes using Selenium :

https://redandgreen.co.uk/web-scrape-iframes-how-to-accept-cookies-privacy-pop-up-when-web-scraping-with-python-and-selenium/

Let us know if you would like any more information or discuss a project

We can provide a quote and initial analysis within 1 working day

For more details on iframes see : https://www.w3schools.com/html/html_iframe.asp

Categories
Python Code Selenium web scraping

How to scrape iframes

If you are scraping a website with pop up messages asking you to agree to accept cookies.

This can prevent your scraper from continuing to the pages you want to scrape.

web-scrape-iframe-cookie-accept
How can you get past this with your scraper if it is in an iframe?

How do you get past these?

Using Selenium you need to switch to the iframe (which you can identify using browser tools / inspect in Chrome).

Here is a video showing the general concept:

Scrape a site with pop up “accept cookies” message

If you just want the basic idea….

driver.switch_to.frame(iframe)

“click on accept or ok”

Instead of “click on accept or ok” you need to identify the @id, or @class name and use that in an XPATH so you can send the .click() commend with Selenium.

driver.switch_to.default_content()

Once past this you need to return to the default (or parent) content to carry on with the navigation towars the page you want to scrape.

driver = webdriver.Chrome(executable_path=path_chrome)

# find the frame using id, title etc.
frame = driver.find_elements_by_xpath("//iframe[@title='iframe_to_get']")

# switch the webdriver object to the iframe.
driver.switch_to.frame(frame[i])

See also : https://stackoverflow.com/questions/49825722/scraping-iframe-using-selenium#49825750

Categories
Python Code

Comparing values in SQL against previously scraped data

If you have scraped data on more than one occasion and want to check if a value has changed in a column since the previous scrape you could use this:

select col1, col2 

from TABLENAME 

group by col1, col2 

having count(col2) <2

This will compare and check if a value for col2 has changed since the previous scrape

Let’s put some real names into this:

select CLAIMNO, STATUSID 

from PREV 

group by CLAIMNO, STATUSID 

having count(STATUSID) <2

We can see from our results that from our entire comparison, only 1 claim has had a status change

We now know that since the last time we scraped the site, only one of our claims has been updated by “them”

This has been achieved by using “group by” – where we have many records in our table, and we’ve picked out where there are not 2 matching STATUSID values for a given CLAIM

Summary

We have used SQL to identify where a column value has changed compared to the previous time we checked by grouping 2 columns and checking if the count is less than 2.

If you would like more information please contact me via the contact page

See also : https://stackoverflow.com/questions/1786533/find-rows-that-have-the-same-value-on-a-column-in-mysql#1786568