The following is an article which will show you how to use Scrapy to log in to sites that have username and password authentication.
The important thing to remember is that there may be additional data that needs to be sent to the login page, data that is in addition to just username and password…
We are going to cover :
🔴 Identifying what data to POST
🔴 The Scrapy method to login
🔴 How to progress after logging in
Once you have seen that the Spider has logged in, you can proceed to scrape what you need. Remember, you are looking for “Logout” because that will mean you are logged in!
Conclusion : we have looked at how to use FormRequest to POST form data in a Scrapy spider, and we have extracted data from a site that is protected by a login form.
See the video on YouTube :
Scrapy Form Login | How to log in to sites using FormRequest | Web Scraping Tutorial
The challenge was to scrape a site where the class names for each element were out of order / randomised. So the only way to get the data in the correct sequence was to sort through the CSS styles by left, top, and match to the class names in the divs…
The px varied slightly: see 710 and 711 above
The names were meaningless and there was no way of establishing an order, a quick check in developer tools shows the highlighter rectangle jumps all over the page in no particular order when traversing the source code.
So we began the code by identifying the CSS and then parsing it:
The idea is to extract the CSS style data, parse it for left, and top px, and sort through those to match the out of sequence div/classnames in the body of the page.
There was NO possibility of sequentially looping through the divs to extract ordered data.
After watching the intro to the challenge set on YouTube, and a handy hint from CMK we got to work.
From this article you will learn as much about coding with Python as you will about web scraping in specific. Note I used requests_html, as it provided me with the option to use XPATH.
BeautifulSoup could also have been used.
Identifying columns and rows based on “left px” and “top px”
Python methods used:
round
x = round(1466,-2)
print(x) # 1500
x = round(1526,-2)
print(x) # 1500
x = round(1526,-1)
print(x) # 1530
I needed to use round as the only way to identify a “Column” of text was to cross reference the <style> “left px” and “top px” with the class names used inside the divs. Round was required as there was on occasion a 2 or 3 px variation in the “left” value.
itemgetter
Item getterfrom operator import itemgetter
ls.sort(key=itemgetter('left','top'))
I had to sort the values from the parsed css style in order of “left” to identify 3 columns of data, and then “top” to sort the contents of the 3 columns A, B, C.
zip
zipped = zip(ls_desc,ls_sellp,ls_suggp)
Zipping the 3 lists – read on to find out the issue with doing this…
rows = list(zipped)
So to get the data from the 3 columns, A (ls_desc), B (ls_sellp), and C (ls_suggp) I used ZIP, but…….the were/are 2 values missing in column C!!
A had 77 values,
B had 77 values
C had 75 !
Not only was there no text in 2 of the blanks in column C, there was also NO text or even any CSS.
We only identified this as an issue after running the code – visually the page looked consistent, alas the last part of column “C” becomes out of sequence with the data in colmumn A and B which are both correct.
Solution?
Go back and check if column “C” has a value at the same top px value as Column “B”. If no value then insert an “x” or spacer into Column C at that top px value.
This will need to be rewritten using dictionaries, and create one dictionary per ROW rather than my initial idea of 1 list per column and zipping them!
Zipping the 3 lists nearly works..but 2 missing values in “Suggested Price” means that the data in Column C become out of synch.
special thanks to “code monkey king” for the idea/challenge!
Rewrite the section for “Column B” to check for presence of text in column “C” on the same row…
1 missing value halfway down column “C” means more error checking is required! – If you just want the “Selling Price” and “Description” then this is code is 100% successful! 👍
See the solution, and error on the YouTube Video
Conclusion:
For more robust web scraping where css elements may be missing use dictionaries/enumerate each row and check. It’s the old case of “you don’t know what you don’t know”
If you can ensure each list has the same number of items, then ZIP is ok to use.
Well, if you are web scraping using Python, and Scrapy for instance, you may need to extract reviews, or comments that are loaded from JavaScript. This would mean you could not use your css or xpath selectors like you can with regular html.
Parse
Instead, in your browser, check if you may be able to parse the code, beginning with ctrl + f, and “json” and track down some JSON in the form of a python dictionary. You ‘just’ need to isolate it.
view-source to find occurrences of “JSON” in your page
The response is not nice, but you can gradually shrink it down, in Scrapy shell or python shell…
Figure 1 – The response
Split, strip, replace
From within Scrapy, or your own Python code you can split, strip, and replace, with the built-in python commands until you have just a dictionary that you can use with json.loads.
x = response.text.split('JSON.parse')[3].replace("\u0022","\"").replace("\u2019m","'").lstrip("(").split(" ")[0].strip().replace("\"","",1).replace("\");","")
Master replace, strip , and split and you won’t need regular expressions!
With the response.text now ready as a JSON friendly dictionary you can do this:
import json q = json.loads(x)
comment = (q[‘doctor’][‘sample_rating_comment’])
comment.replace(“\u2019″,”‘”) print(comment)
The key thing to remember to use when parsing the response text is to use the index, to pick out the section you want, and then make use of “\” backslash to escaped characters when you are working with quotes, and actual backslashes in the text you’re parsing.
Figure 2 – The parsed response
Conclusion
Rendering to HTML using Splash, or Selenium, or using regular expressions are not always essential. Hope this helps illustrate how you can extract values FROM a python dictionary FROM json FROM javascript !
You may see a mass of text on your screen to begin with, but persevere and you can arrive at the dictionary contained within…
Demo of getting a Python Dictionary from JSON from JavaScript