The most important thing when starting to scrape is to establish what you want in your final output.
Here are the data points we want to extract :
Now we can write our parse method, and once done, we can finally add on the “next page” code.
The Amazon pages have white space around the Author name(s) so you this will be an example of when to use ‘normalize-space’.
We also had to make sure we weren’t splitting the partially parsed response too soon, and removing the 2nd Author, (if there was one).
Some of the results are perhaps not what you want, but this is due to Amazon returning products which it thinks are in some way related to your search criteria!
By using pipelines in Scrapy, along with the process_item method we were able to filter much of what was irrelevant. The great thing about web scraping to an SQL database is the flexibility it offers once you have the data. SQL, Pandas, Matplotlib and Python are a powerful combination…
The challenge was to scrape a site where the class names for each element were out of order / randomised. So the only way to get the data in the correct sequence was to sort through the CSS styles by left, top, and match to the class names in the divs…
The names were meaningless and there was no way of establishing an order, a quick check in developer tools shows the highlighter rectangle jumps all over the page in no particular order when traversing the source code.
So we began the code by identifying the CSS and then parsing it:
There was NO possibility of sequentially looping through the divs to extract ordered data.
After watching the intro to the challenge set on YouTube, and a handy hint from CMK we got to work.
From this article you will learn as much about coding with Python as you will about web scraping in specific. Note I used requests_html, as it provided me with the option to use XPATH.
BeautifulSoup could also have been used.
Python methods used:
x = round(1466,-2)
print(x) # 1500
x = round(1526,-2)
print(x) # 1500
x = round(1526,-1)
print(x) # 1530
I needed to use round as the only way to identify a “Column” of text was to cross reference the <style> “left px” and “top px” with the class names used inside the divs. Round was required as there was on occasion a 2 or 3 px variation in the “left” value.
Item getterfrom operator import itemgetter
I had to sort the values from the parsed css style in order of “left” to identify 3 columns of data, and then “top” to sort the contents of the 3 columns A, B, C.
zipped = zip(ls_desc,ls_sellp,ls_suggp)
Zipping the 3 lists – read on to find out the issue with doing this…
rows = list(zipped)
So to get the data from the 3 columns, A (ls_desc), B (ls_sellp), and C (ls_suggp) I used ZIP, but…….the were/are 2 values missing in column C!!
A had 77 values,
B had 77 values
C had 75 !
Not only was there no text in 2 of the blanks in column C, there was also NO text or even any CSS.
We only identified this as an issue after running the code – visually the page looked consistent, alas the last part of column “C” becomes out of sequence with the data in colmumn A and B which are both correct.
Go back and check if column “C” has a value at the same top px value as Column “B”. If no value then insert an “x” or spacer into Column C at that top px value.
This will need to be rewritten using dictionaries, and create one dictionary per ROW rather than my initial idea of 1 list per column and zipping them!
special thanks to “code monkey king” for the idea/challenge!