Web Scraping With Selenium DIY or Buy
avatar

Gabija Fatenaite

Jul 15, 2020 6 min read

In order to understand the fundamentals of web scraping, it’s important to learn how to leverage different frameworks and request libraries. By developing an understanding for various HTTP methods (mainly GET and POST) web scraping can become a lot easier. 

For instance, Selenium is one of the better known and often used tools that help automate web browser interactions. By using it together with other technologies (e.g., BeautifulSoup), you can get a better grasp on web scraping basics.

How does Selenium work? It automates your written script processes, as the script needs to interact with a browser to perform repetitive tasks like clicking, scrolling, etc. As described on Selenium’s official webpage, it is “primarily for automating web applications for testing purposes, but is certainly not limited to just that.”

In this guide, on how to web scrape with Selenium, we will be using Python 3.x. as our main input language (as it is not only the most common scraping language but the one we closely work with as well). 

Setting up Selenium 

Firstly, to download the Selenium package, execute the pip command in your terminal:

pip install selenium 

You will also need to install Selenium drivers, as it enables python to control the browser on OS-level interactions. This should be accessible via the PATH variable if doing a manual installation. 

You can download the drivers for Firefox, Chrome, and Edge from here.

Quick starting Selenium

Let’s begin the automatization by starting up your browser:

  • Open up a new browser window (in this instance, Firefox) 
  • Load the page of your choice (our provided URL)
from selenium import webdriver
browser = webdriver.Firefox()
browser.get('http://oxylabs.io/')

This will launch it in the headful mode. In order to run your browser in headless mode and run it on a server, it should look something like this: 

from selenium import webdriver
from selenium.webdriver.firefox.options import Options

options = Options()
options.headless = True
options.add_argument("--window-size=1920,1200")

driver = webdriver.firefox(options=options, executable_path=DRIVER_PATH)
driver.get("https://www.oxylabs.io/")
print(driver.page_source)
driver.quit()

Data extraction with Selenium by locating elements

find_element 

Selenium offers a variety of functions to help locate elements on a page: 

  • find_element_by_id
  • find_element_by_name
  • find_element_by_xpath
  • find_element_by_link_text (find element by using text value)
  • find_element_by_partial_link_text (find element by matching some part of a hyperlink text(anchor tag))
  • find_element_by_tag_name
  • find_element_by_class_name
  • find_element_by_css_selector (find element by using a CSS selector for id class)

As an example, let’s try and locate the H1 tag on oxylabs.io homepage with Selenium: 

<html>
    <head>
        ... something
    </head>
    <body>
        <h1 class="someclass" id="greatID"> Partner Up With Proxy Experts</h1>
    </body>
</html>

h1 = driver.find_element_by_name('h1')
h1 = driver.find_element_by_class_name('someclass')
h1 = driver.find_element_by_xpath('//h1')
h1 = driver.find_element_by_id('greatID')

Web Scraping With Selenium: DIY or Buy?

You can also use the find_elements (plural form) to return a list of elements. E.g.: 

all_links = driver.find_elements_by_tag_name('a')

This way, you’ll get all anchors in the page. 

However, some elements are not easily accessible with an ID or a simple class. This is why you will need XPath.

XPath

XPath is a syntax language that helps find a specific object in DOM. XPath syntax finds the node from the root element either through an absolute path or by using a relative path. e.g.: 

  • / : Select node from the root. /html/body/div[1] will find the first div
  • //: Select node from the current node no matter where they are. //form[1] will find the first form element
  • [@attributename=’value’]: a predicate. It looks for a specific node or a node with a specific value.

Example:

//input[@name='email'] will find the first input element with the name "email".

<html> 
 <body> 
   <div class = "content-login"> 
     <form id="loginForm"> 
         <div> 
            <input type="text" name="email" value="Email Address:"> 
            <input type="password" name="password"value="Password:"> 
         </div> 
        <button type="submit">Submit</button> 
     </form> 
   </div> 
 </body> 
</html>

WebElement

WebElement in Selenium represents an HTML element. Here are the most commonly used actions: 

  • element.text (accessing text element)
  • element.click() (clicking on the element) 
  • element.get_attribute(‘class’) (accessing attribute) 
  • element.send_keys(‘mypassword’) (sending text to an input)

Slow website render solutions

Some websites use a lot of JavaScript to render content, and they can be tricky to deal with as they use a lot of AJAX calls. There are a few ways to solve this:

  • time.sleep(ARBITRARY_TIME)
  • WebDriverWait()

Example:

try:    
    element = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.ID, "mySuperId"))
    )
finally:
   driver.quit()

This will allow the located element to be loaded after 10 seconds. To dig deeper into this topic, go ahead and check out the official Selenium documentation.

Selenium vs Puppeteer

The biggest reason for Selenium’s popularity and complexity is that it supports writing tests in multiple programming languages. This includes C#, Groovy, Java, Perl, PHP, Python, Ruby, Scala, and even JavaScript. It supports multiple browsers, including Chrome, Firefox, Edge, Internet Explorer, Opera, and Safari. 

However, for web scraping tasks, Selenium is perhaps more complex than it needs to be. Remember that Selenium’s real purpose is functional testing. For effective functional testing, it mimics what a human would do in a browser. Selenium thus needs three different components:

  • A driver for each browser
  • Installation of each browser
  • The package/library depending on the programming language used

In the case of Puppeteer, though, the node package puppeteer includes Chromium. It means no browser or driver is needed. It makes it simpler. It also supports Chrome if that is what you need.

On the other hand, multiple browser support is missing. Firefox support is limited. Google announced Puppeteer for Firefox, but it was soon deprecated. As wehn writing this, Firefox support is experimental. So, to sum up, if you need a lightweight and fast headless browser for web scraping, Puppeteer would be the best choice. You can check our Puppeteer tutorial for more information.

Selenium vs scraping tools: Real-Time Crawler

Selenium is great if you want to learn web scraping. We recommend using it together with BeautifulSoup as well as focus on learning HTTP protocols, methods on how the server and browser exchange data, and how cookies and headers work.

However, if you’re seeking an easier method for web scraping, there are various tools to help you out with this process. Depending on the scale of your scraping project and targets, implementing a web scraping tool will save you a lot of time and resources.

At Oxylabs, we provide a tool called Real-Time Crawler. It has two main functionalities:

  • Data API –  focuses on e-commerce and search engine websites and allows you to receive structured data in JSON
  • HTML Crawler API – The second functionality allows you to carry out scraping projects for most websites in HTML

Real-Time Crawler also has easy integration, here’s for Python:

    import requests
  from pprint import pprint

  # Structure payload.
  payload = {
    'source': 'universal',
    'url': 'https://stackoverflow.com/questions/tagged/python',
    'user_agent_type': 'desktop',
  }

  # Get response.
  response = requests.request(
    'POST',
    'https://realtime.oxylabs.io/v1/queries',
    auth=('user', 'pass1'),
    json=payload,
  )

  # This will return the JSON response with results.
 pprint(response.json())

More integration examples for other languages are available (shell, PHP, curl). 

The main benefits of Real-Time Crawler when comparing with Selenium are: 

  • All web scraping processes are automated
  • No need for extra coding
  • Easily scalable 
  • Guaranteed 100% success rates per successful requests
  • Has a built-in proxy rotation tool

Conclusion

Selenium is a great tool for web scraping, especially when learning the basics. But, depending on your goals, it is sometimes easier to choose an already-built tool that does web scraping for you. Building your own scraper is a long and resource-costly procedure that might not be worth the time and effort. 

To learn more about Real-Time Crawler and how to integrate it, you can check out Real-Time Crawler quick start guide, or if you have any product related questions, contact us at [email protected] 

avatar

About Gabija Fatenaite

Gabija Fatenaite is a Senior Content Manager at Oxylabs. Having grown up on video games and the internet, she grew to find the tech side of things more and more interesting over the years. So if you ever find yourself wanting to learn more about proxies (or video games), feel free to contact her - she’ll be more than happy to answer you.

Related articles

CEO’s Guide to Data Extraction

CEO’s Guide to Data Extraction

Oct 13, 2020

7 min read

Web Scraping vs Web Crawling: The Differences

Web Scraping vs Web Crawling: The Differences

Oct 04, 2020

9 min read

What is a Headless Browser?

What is a Headless Browser?

Oct 02, 2020

5 min read

All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.