Baby steps to learning browser automation and web scraping 4

March 23, 2018 0 Comments

Baby steps to learning browser automation and web scraping 4

 

 


Alt python_pic

Find elements by id,classname, name

Hello there! In our previous lesson, we looked at how to make a simple google search with an automation script at the end of the lesson. We came across the method findelementbyname().

We can find html elements based on other attributes, maybe based on any attribute as we will discover in our next lesson.

Here are some functions and what they do:






































Method Use
findelementbyid(id) Find an html element by its ID
findelementbyname(name) Find an html element by its name
findelementbyclassname(class) Find an html element by its class
findelementbylinktext(text) Find a link element by its text
findelementbycssselector(cssselector) Find a link element using its css selector string
findelementbyxpath(xpath) Find a link element using its xpath (More on this later)
findelementsby() Replace * with any of the above method ending. (xpath,id etc). This returns a list containing all such elements

There are other methods, but these should get you started. You may never use some of these methods anyway.

Calling any of the above methods (except findelementsby*) will throw a NoSuchElementException Exception. However, calling the same method with ‘s’ attached to the element will return an empty list. Know when to use which to your advantage.

Practice makes perfect

Let’s look at how to make use of some of these functions. Our url for today is http://quotes.toscrape.com/. Visit it, do some inspections on the elements. The quotes, tags, quote div etc.

What attributes do you see? What are the values of those attributes? Are you ready to code? Talk is cheap, show me the code!

from selenium import webdriver URL = 'http://quotes.toscrape.com/' 
driver = webdriver.Chrome() driver.get(URL)

With this, the browser opens the URL.

Let’s get all the quote divs into an array. Each quote is in a div that has a class called ‘quote’. We can use the classname to get them.

#get quote div 
quote
divs = driver.findelementsbyclassname('quote')

You can output the list returned to see what gibberish it is. Here’s my first element in the list.

Each element is a WebElement object.

Now, let’s get each quote’s text, tags (name and link), Author (name and link to about).

We will have one big list with each quote as a dictionary, having all the needed information in it. Sounds fun?

 
from selenium import webdriver URL = 'http://quotes.toscrape.com/'
driver = webdriver.Chrome() driver.get(URL)
biglist = []
#get quote div
quote
divs = driver.findelementsbyclassname('quote') for quote in quotedivs: info = {} info['text'] = quote.findelementbyclassname('text').text #get author details author = {} author['name'] = quote.findelementbyclassname("author").text author['aboutlink'] = quote.findelementbylinktext('(about)').getattribute('href') #add to info info['author'] = author #get tags details tags = quote.findelementsbyclassname('tag') tagsinfo = [] for tag in tags: t = {} t['name'] = tag.text t['link'] = tag.getattribute('href') tagsinfo.append(t) #add to info info['tags'] = tagsinfo print("THIS IS THE FULL INFO: {}".format(info)) biglist.append(info)

Yeah I know I said baby steps and now I’m running… but I’m not. Relax, take a deep breath, let’s take it line by line okay??

So what we have now is a for loop that takes each quote div and picks out the info needed and saves it into our big-ol-list called biglist. Each entry in the biglist is a dictionary.

As you might have noticed, we can find an element from a web element object by calling the methods on that object. Same way we call them on a driver.

Intro to some new methods

text: This is not a method tho, it’s an attribute of an element. It returns all text enclosed in that tag.

getattribute(attr): This returns the value of whichever attribute you want. You can get the class name by passing ‘class’, id by passing ‘id’, any attribute. We can use same to get the href attribute, which is the link.

Let’s box on

We get the quote text very easily using the text attribute on the element that has a class text in the quote div.

We get the author too but here, author isn’t just a string. It’s a dictionry that contains the name and about link of the author. We make use of the findelementbylinktext here.

When it comes to tags, we have a list of them hence the for loop. We find all tag elements using their classname ‘tag’. We then get the name and link for each tag into a dictionary, which we put in the tag list and add to our info dictionary.

After everything, we append it to our big_list.

Next page

What if we want to go to the next page?

I will leave that for you to do but here are some ideas:

You can use a while loop to increase the page variable and get that url. Do this till an exception is thrown.


  1. (A bit advanced) Use a while loop before the code, generate urls by changing page numbers and call each url using requests, urlib3 or any library of your choice. Do this till an exception is thrown, then you can know the last page. Get that page number and use it for your loop.

Experiment and try out other methods and stuff. Don’t miss our next lesson on xpath. It will be one of the most useful lessons for you. Stay creative!

Video Tutorial:


Tag cloud