If you are scraping a website with pop up messages asking you to agree to accept cookies.
This can prevent your scraper from continuing to the pages you want to scrape.
How can you get past this with your scraper if it is in an iframe?
How do you get past these?
Using Selenium you need to switch to the iframe (which you can identify using browser tools / inspect in Chrome).
Here is a video showing the general concept:
Scrape a site with pop up “accept cookies” message
If you just want the basic idea….
driver.switch_to.frame(iframe)
“click on accept or ok”
Instead of “click on accept or ok” you need to identify the @id, or @class name and use that in an XPATH so you can send the .click() commend with Selenium.
driver.switch_to.default_content()
Once past this you need to return to the default (or parent) content to carry on with the navigation towars the page you want to scrape.
driver = webdriver.Chrome(executable_path=path_chrome)
# find the frame using id, title etc.
frame = driver.find_elements_by_xpath("//iframe[@title='iframe_to_get']")
# switch the webdriver object to the iframe.
driver.switch_to.frame(frame[i])
Do you have a CSV file and a database to transfer the data into?
The data needs to go into your database and you want to automate this?
For instance : you have been web scraping and extracted publicly available data and now need to store it in a database table?
This guide will show you a basic example of how to get the data from a CSV, convert it to a pandas dataframe, and insert it into your chosen database table.
# |r|e|d|a|n|d|g|r|e|e|n|.|c|o|.|u|k|# CSV to PANDAS to SQL example
Import a CSV and convert it into a pandas dataframe
importpandasaspd
df = pd.read_csv(r’claims.csv‘)
df
pat_num
claim
status
0
111222
123
accepted
1
111333
234
accepted
2
222444
432
rejected
3
444555
431
rejected
4
666777
655
accpted
The example csv as a pandas dataframe
Connect to existing SQL database and check we can query some data
# Module Importsimportmariadbimportsys# Connect to MariaDB Platformtry:
conn = mariadb.connect(
user="user2",
password="password2",
host="192.168.1.9",
port=3306,
database="newz"
)
except mariadb.Error as e:
print(f"Error connecting to MariaDB Platform: {e}")
sys.exit(1)
# Get Cursor
cur = conn.cursor()
# Select the first 8 results as a check
cur.execute(
"SELECT title,story FROM tnewz WHERE publication=? LIMIT 8", ("independent.co.uk",)
)
Title: We need to get back to normality, and fast – I appear to have fallen in love with my robot vacuum cleaner, Story: It is absurdly easy to anthropomorphise this creature, which spins around picking up fluff – especially in the absence of a puppy or a baby in the house, writes
Title: ‘This ends now’: Biden demands Trump make national address to end Capitol siege, Story: President-elect condemns Capitol violence: 'This is not dissent. It's disorder. It's chaos. It borders on sedition’
Title: ‘We love you, you’re all very special’: Trump tells mob to ‘go home’ but repeats false claims about election, Story: 'I know your pain. I know you're hurt’
Title: Senate evacuates after sudden recess while certifying Biden’s win amid reports of shots fired on Capitol Hill, Story: Congress calls surprise recess as protesters storm Capitol building
Title: Capitol Hill riots: Dramatic videos show Trump supporters storming Capitol building, Story: Videos posted to social media from reporters and lawmakers depicted a chaotic, terrifying scene of pro-Trump rioters breaking into the Capitol building
Title: Capitol riots: The Simpsons eerily predicted incident in 1996, Story: Scene in question shows characters running up Capitol steps, firing guns
Title: Prince William tells children of NHS staff sacrifices ‘every day’ during pandemic, Story: 'We're making sure the children understand all of the sacrifices that all of you are making,' says Duke of Cambridge
Title: Sriwijaya Air flight SJ182: Boeing 737 loses contact in Indonesia, Story: The aircraft is a ‘classic’ 737 — two generations older than the Boeing 737 Max
Connect to new database and prepare to insert dataframe
# Connect to MariaDB Platformtry:
conn = mariadb.connect(
user="user3",
password="redandgreen",
host="192.168.1.9",
port=3306,
database="exampledb"
)
except mariadb.Error as e:
print(f"Error connecting to MariaDB Platform: {e}")
sys.exit(1)
# Get Cursor
cursor = conn.cursor()
Create new table (do this just once)
# Create Tablecursor.execute('CREATE TABLE IF NOT EXISTSpatients3 (id INT AUTO_INCREMENT PRIMARY KEY,pat_num varchar(50), claim varchar(50), status varchar(50))')
for row in df.itertuples():
cursor.execute("""
INSERT INTO patients3
(pat_num, claim, status)
VALUES
(%s,%s,%s)
""",
(row.pat_num,
row.claim,
row.status,
)
)
conn.commit()
Check your database table for new data
If you have access to your database server then you can create a database and user with these commands:
## Useful mariadb commands :# sudo mysql -u root -p# CREATE DATABASE exampledb;# CREATE USER 'user3'@'localhost' IDENTIFIED BY 'redandgreen';# DROP USER IF EXISTS 'user3'@'%';# CREATE USER 'user3'@'%' IDENTIFIED BY 'redandgreen';# GRANT ALL PRIVILEGES ON exampledb.* TO 'user3'@'%';# FLUSH PRIVILEGES;# The % allows remote connections
Query the SQL database and print the Patient Numbers
cur = conn.cursor()
cur.execute(
"SELECT pat_num FROM patients3"
)
# Print results selection from SQL queryfor (pat_num) in cur:
print(f"PatientNumber: {pat_num}")
'''make a new list "ls_expose" and add hrefs to it if they contain 'expose' in the url.'''
for a in soup.select('a'):
if 'expose' in a['href']:
ls_expose.append(a['href'])
Get all dropdown values from html
(using page saved locally, for testing)
from bs4 import BeautifulSoup
import requests
my_page = "Wohnung mieten im Umkreis von 50 km von Mannheim - Immob.html"
soup = BeautifulSoup(open(my_page),"lxml")
soup = soup.find('div',class_ = "select-input-wrapper")
items = soup.select('option[value]')
values = [item.get('value') for item in items]
textValues = [item.text for item in items]
print(textValues)
soup.select(‘option[value]’)
127 option values as seen in the html, we need to be able to get this with our code….Output from script – this shows the values as text, they need to be converted to integers if you are going to use them in a for loop. See below for the code to do this, using a list comprehension.
from bs4 import BeautifulSoup
import requests
my_page = "Wohnung mieten im Umkreis von 50 km von Mannheim - Immob.html"
soup = BeautifulSoup(open(my_page),"lxml")
soup = soup.find('div',class_ = "select-input-wrapper")
items = soup.select('option[value]')
values = [item.get('value') for item in items]
x = [int(item) for item in values]
print(x)
Note how I use soup.find to narrow down the find, otherwise I would have found other dropdowns as well