How to build a web scraper
avatar

Iveta Vistorskyte

Jan 13, 2021 20 min read

Today, the most important asset any business can have is data. Quality data allows getting insights and information that was impossible to attain until a few years ago.

The most common method to collect the data is web scraping. Web scraping (also called data scraping or screen scraping) involves using a web scraper to gather publicly available data in a structured and automated way.

This article explains what a web scraper is, how to build it, and how to overcome the most common web scraping challenges.

What is a web scraper?

A web scraper is simply a program that collects publicly available data from websites. Let’s assume that you are running an e-commerce store, listing thousands of products. A lot of these products are being sold at other stores as well. If the same product is being sold at multiple online stores, buyers will likely go to the store with the lowest price. The profits for the seller, however, would depend on the margin. The challenge here is to find the ideal price point to attract more buyers while keeping the margins as high as possible.

A web scraper can help you monitor the prices of the competitors. This data can then be used to determine the ideal prices for your store.

How to write a web scraper?

Before getting started on the programming part, let’s take a look at the basics.

How do browsers work?

Let’s first understand how a browser works:

  1. When a URL is entered in the browser, the browser sends an HTTP GET request to the server.
  2. The server returns a response, which contains a response headers and the body. The body primarily contains the HTML. 
  3. The browsers will now parse (read and understand) this HTML, and load CSS, JavaScript, and images needed by the page. 
  4. The browser now renders (displays) the HTML that you see.

Even though this is a simplified explanation, there is one important point to note here. In most cases, the needed data is available after step 2. Steps 3 and 4 load extra files and execute scripts that are only required to make an easy-to-read and user-friendly web page.

If the site is dynamic, and you can find the actual file that contains the data, which usually is in JSON format, you do not need to download other files, but your browser downloads everything.

How do browsers work?

How does a web scraper work?

As explained in the previous section, a web scraper downloads only the required file. In most cases, this is only the HTML file, meaning it skips other JavaScript and CSS files. It also skips the rendering, which is not required as the needed data is already in the HTML file.

This saves a lot of bandwidth, memory, CPU, and is thus very fast compared to loading the page in a browser. Here is how a web scraper works:

  1. Gets HTML.
  2. Converts a web page into an object (also called parsing).
  3. Locates the required data in the object.
  4. Saves the data in a file or database as needed.

Best programming language for web scraping

While a lot of programming languages such as C#, Ruby, Java, and R can be used to build a web scraper, the two most popular languages are Python and JavaScript (Node.js).

Python for building a web scraper

Python is the most popular language for web scraping. The biggest advantage is the vast number of libraries available. Python is an easy to learn, general-purpose language. There are libraries such as BeautifulSoup and Requests which make writing a web scraper very easy.

JavaScript for building a web scraper

With the arrival of Node.js, JavaScript has evolved into a very powerful language for web scraping. Node.js is the engine that runs the JavaScript code without a browser.  Web scraping with JavaScript and Node.js is not only easy, it is fast, and the learning curve is very low for those who are already familiar with JavaScript.

Prerequisite for building a web scraper

Apart from the knowledge of the programming language of your choice, a knowledge of how web pages work is also required. To extract the required data, a good knowledge of CSS selectors is also required. Some libraries allow the use of XPATH selectors, but for beginners, CSS selectors are far easier to learn. 

Basic knowledge of Chrome Developer Tools or Firefox Developer Tools is also required to locate the page that contains the data you need and to build the selectors.

Quick overview of CSS selectors

Since knowledge of CSS selectors is essential to selecting a specific HTML from a page, it is also essential in learning how to build a web scraper. Let’s go over some examples to reinforce your understanding of CSS selectors.

  • #firstname – selects any element where id= "firstname"
  • .redirect – selects any element where class="redirect"
  • p – selects all <p> tags
  • div#firstname – select div elements where id= "firstname"
  • p.link.new – note that there is no space here. This selects <p class="link new"> or <p class="new link">
  • p.link .new – note the space here. Selects any element with class “new”, which are inside <p class="link">

Now that we’ve covered these, let’s get started with building a web scraper.

How to build a web scraper in Python

Even though multiple libraries are available in Python for web scraping, the most popular ones are requests and BeautifulSoup. The following sections will walk you through each step of scraping a website.

STEP 1. How to get the HTML?

The first step to building a web scraper is getting the HTML of a page. We will be using the requests library to get the HTML. It allows us to send a request and get a response. This can be installed using pip or pip3, depending on your Python installation.

pip install requests

Now create a new file with extension .py in your favorite editor and open it. Alternatively, you can also use Jupyter Notebooks or even a Python console. This allows the execution of small code snippets and viewing of the result immediately.

If you are using Jupyter Notebooks, enter these lines in a cell, and execute the cell. If you are using a code editor, enter these lines, save the file, and execute it with Python.

import requests

url_to_parse = "https://en.wikipedia.org/wiki/Python_(programming_language)"
response = requests.get(url_to_parse)
print(response)

You will see the output like this:

<Response [200]>

This means we received a response object with status code 200 (a successful response). To view a list of all the possible response codes, click here.

If we check the type of response object by calling the type(response), we will see that it is an instance of requests.models.Response.

This has many interesting properties, like status_code, encoding, and the most interesting of all — text.

Edit the code file so that response.text is printed:

print(response.text)

You will see that the output will be the entire HTML of the page. Here is the partial output:

<!DOCTYPE html><html class="client-nojs" lang="en" dir="ltr"><head><meta charset="UTF-8"/><title>Python (programming language) - Wikipedia</title>…

Now that we have the HTML ready, it’s time to move on to the next step.

STEP 2. How to parse the HTML?

Now this HTML response, which currently is a string, needs to be parsed into an object. The most important thing here is that we should be able to easily query this object to get the desired data.

We can use parsing libraries directly. However, we will use another library called beautifulsoup4. This sits on top of the parser. The advantage is that we can easily write selectors so that we can query this HTML markup and look for the data that we need.

To install this library, run the following on your terminal:

pip install beautifulsoup4

OR

pip install bs4

Once the installation is complete, add the import statement and create an object of BeautifulSoup. Here is the updated code:

import requests
from bs4 import BeautifulSoup
url_to_parse = "https://en.wikipedia.org/wiki/Python_(programming_language)"
response = requests.get(url_to_parse)
soup = BeautifulSoup(response.text,'html.parser')

Note that we are specifying the parsers as html.parser. We can, however, use a different parser like lxml.

Now that we have the parsed object, we can now extract the data we need.

STEP 3. How to extract data?

BeautifulSoup provides an easy way to navigate the data structure. Here are some examples and the output:

soup.title
# output <title>Python (programming language) - Wikipedia</title>
soup.title.name
# output 'title'
soup.title.text
# output 'Python (programming language) - Wikipedia'
soup.title.parent.name
# output 'head'

If you’re looking for a specific text, you first need to know where exactly that text is located in the HTML. In this example, we will try to extract the items from the table of contents of this Wikipedia page.

Open the url https://en.wikipedia.org/wiki/Python_(programming_language) in Chrome or Firefox, right click any item in the table of contents, and click Inspect. This will show that the text that we need is in <div id="toc" class="toc"> .

Once we know where the text is located, we have two options:

  1. We can use the find() or find_all() method. 
  2. Alternatively, we can use the select() method.

Using find method with Beautiful Soup

The only difference between the find() and find_all() methods is that the find() method returns the first match, while find_all() returns them all.

Let’s look at a few examples.

If we simply run soup.find("div"), it will return the first div it finds, which is the same as running soup.div. This needs filtering as we need a specific div which contains the table of contents.

In this case, the whole table of contents is in the div that has it’s id set to toc. This information can be supplied to the find() method as the second argument.

soup.find("div",id="toc")

This will return everything inside the first div which has its id set to toc. It also means that instead of div, this method can accept any tag.

Let’s take another example. In this page, there is a link with markup like this:

<a href="/wiki/End-of-life_(product)" class="mw-redirect" title="End-of-life (product)">end-of-life</a>

This can be selected using any of these methods:

soup.find('a',title="End-of-life (product)")
soup.find('a',href="/wiki/End-of-life_(product)")

You can even use more than one attribute:

soup.find('a',title="End-of-life (product)",href="/wiki/End-of-life_(product)")

NOTE. Be careful about class attributes. Class is a reserved keyword in Python. It means that you cannot use class in the same fashion:

soup.find('a',class="mw-redirect") # SyntaxError: invalid syntax

The workaround is to suffix class with an underscore:

soup.find('a',class_="mw-redirect") # will return first a tag with this class

Using CSS selectors with BeautifulSoup

BeautifulSoup also supports use of CSS selectors. This is arguably a better approach, because CSS selectors are generic and not specific to BeautifulSoup. Chances are that you already know how to build CSS selectors. Even if you don’t know CSS selectors, learning CSS selectors would be a good idea as it can help in the future. Even JavaScript scraping packages work well with CSS selectors.

Note that there are two options – select() and select_one(). The select() method is similar to find_all(). Both return a list of all the matching occurrences. The select_one() method is similar to the  find() method, which returns the first matching occurrence.

Let’s look at the same examples. To extract this link:

<a href="/wiki/End-of-life_(product)" class="mw-redirect" title="End-of-life (product)">end-of-life</a>

Either of these methods will work:

soup.select_one('a[title="End-of-life (product)"]')

soup.select_one('a[href="/wiki/End-of-life_(product)"]')

Again, you can use more than one attribute:

soup.select_one('a[title="End-of-life (product)"][href="/wiki/End-of-life_(product)"]')

NOTE. When using more than one attribute, there should not be any space. This is standard CSS syntax and not specific to BeautifulSoup.

While using class, the syntax is much cleaner. A class is represented as a period. Similarly, id is presented by #.

soup.select_one('a.mw-redirect') # will return the first a tag with mw-redirect
soup.select_one('a#mw-redirect') # will return the first a tag with id mw-redirect

If you want to chain more than one class, write the classes separated with a period, but no space. 

soup.select_one('a.mw-redirect.external') # will return the first a tag with classes mw-redirect and external.

Coming back to the example of Wikipedia Table of contents, the following snippet will return all the span with class toctext.

toc = soup.select("span.toctext")
for item in toc:
print(item)
# OUTPUT
# <span class="toctext">History</span>
# <span class="toctext">Design philosophy and features</span>
# <span class="toctext">Syntax and semantics</span>

This code returns all elements.  If we check the type of these elements, it will be bs4.element.Tag. Typically, we would need the text insides these elements.  This is as simple as getting the .text of the elements.

toc = soup.select("span.toctext")
for item in toc:
print(item.text)
# OUTPUT
# History
# Design philosophy and features
# Syntax and semantics
# ...

Let’s get one more piece of information from the table of contents – the toc number. For example, the toc number of “Syntax and semantics” is 3, and the toc number for “Statements and control flow” is 3.1.

To get both of these, we can go through the parent elements, and again use the select method on the individual elements.

for item in soup.select('li.toclevel-1'):
toc_number = item.select_one('span.tocnumber').text
print(toc_number)
# OUTPUT
# 1
# 2
# 3
# …

The most important point here is that the select method works with beautifulsoup objects, as well as the elements extracted by select methods.

We can create dictionary inside the for loop and save everything in a list:

# Create empty list
data = []
# loop over outer elements
for item in soup.select('li.toclevel-1'):
	# Get the toc number element and it’s text
toc_number = item.select_one('span.tocnumber').text
	# Get the toc text element and it’s text
toc_text= item.select_one('span.toctext').text
# Create a dictionary and add to the list
data.append({
         'TOC Number': toc_number,
         'TOC Text': toc_text
})

Now we are ready to save this dictionary to a file or a database. To keep things simple, let’s begin with a file.

STEP 4. How to export data to CSV?

Exporting to CSV doesn’t need any installation. The csv module, which is bundled with Python installation, offers this functionality.

Here is the code snippet with each line explained:

# Import csv module
import csv
# open a new file in write mode
with open('wiki.csv', 'w', newline='') as csvfile:
	# Specify the column names
fieldnames = ['TOC Number', 'TOC Text']
# create a dictionary writer object
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
# write the headers, this will write the fieldnames as column headings
writer.writeheaders()
# run a loop on the data
for item in data:
	# Each item in the list is a dictionary
	# this will be written in one row.
writer.writerow(item)

The data is exported to a CSV file. This is the last step of building a web scraper in Python.

How to build a web scraper in JavaScript

Building a web scraper in JavaScript follows the same steps:

  1. Get the HTML.
  2. Parse the Response.
  3. Extract desired data.
  4. Save the data.

Preparing the Development Environment

The only software required are node.js and npm. Once you have node.js setup, open terminal and create a new node project:

npm init -y

After that, install these three packages:

npm install axios cheerio json2csv

Now let’s move on to the first step.

STEP 1. How to get the HTML?

The HTML page can be fetched by the package axios.

Create a new file and enter these lines:

// load axios
const axios = require("axios");
const wiki_python = "https://en.wikipedia.org/wiki/Python_(programming_language)";

//create an async function
(async function() {
// get the response		
const response = await axios.get(wiki_python);
// prints 200 if response is successful
console.log(response.status)
})();

While most of the code is a standard node.js code, the important line in this code is:

const response = await axios.get(url);

Once this line executes, the response will contain the HTML that we need.

STEP 2. How to parse the HTML?

For parsing, the package that can be used is cheerio.

Open the same file that we have been working on, and once the response is available, add the following line of code:

const $ = cheerio.load(response.data);

Note here that the HTML is being accessed using the data attribute of the response object created by Axios.

It’s also important to mention that instead of using a variable name, we are using the $ sign. This simply means that we will be able to write jQuery-like syntax and use CSS selectors.

STEP 3. How to extract data?

The desired data can be extracted using CSS selectors. For example, this line will select all the TOC elements:

const TOC = $("li.toclevel-1"); 

Now we can run a loop on all these elements, and select toc number and toc text. The extract data can then be pushed to a list to create a JSON.

const toc_data = []
TOC.each(function () {
        level = $(this).find("span.tocnumber").first().text();
        text = $(this).find("span.toctext").first().text();
        toc_data.push({ level, text });
    });

Now we are ready to save the data to a CSV.

STEP 4. How to export data to CSV?

For exporting data to CSV, we can simply use the package json2csv because we already have the data in JSON format. This will create the CSV in memory. To write this CSV to disk, we can use the fs package, which does not need to be installed separately.

const parser = new j2cp();
    const csv = parser.parse(toc_data);
    fs.writeFileSync("./wiki_toc.csv", csv);

Once everything is put together, this is how the entire code file would be:

const fs = require("fs");
const j2cp = require("json2csv").Parser;
const axios = require("axios");
const cheerio = require("cheerio");

const wiki_python =  "https://en.wikipedia.org/wiki/Python_(programming_language)";

async function getWikiTOC(url) {
  try {
    const response = await axios.get(url);
    const $ = cheerio.load(response.data);

    const TOC = $("li.toclevel-1");
    let toc_data = [];
    TOC.each(function () {
      level = $(this).find("span.tocnumber").first().text();
      text = $(this).find("span.toctext").first().text();
      toc_data.push({ level, text });
    });
    const parser = new j2cp();
    const csv = parser.parse(toc_data);
    fs.writeFileSync("./wiki_toc.csv", csv);
  } catch (err) {
    console.error(err);
  }
}

getWikiTOC(wiki_python);

Save the above code as wiki_toc.js, open the terminal, and run node wiki_toc.js. This will save the extracted data in wiki_toc.csv file.

Potential challenges and how to avoid them

Now that we know how to write a web scraper let’s talk about the potential challenges. Of course, there will be times when you’ll face some difficulties because building a web scraper and data gathering isn’t that easy. Fortunately, there are ways to avoid them. Let’s have a look at some of the common challenges.

Difficulty in writing CSS selectors

Creating effective CSS selectors can be challenging at the beginning. The solution is to use tools like SelectorGadget. This allows creating CSS selectors with just a few clicks.

Dynamic sites

Many websites do not contain the data that you need in the HTML. The data is loaded separately using JavaScript. In these cases, you can use Selenium if you are using Python, or Puppeteer if you are using JavaScript to build the web scraper.

Selenium and Puppeteer both load an actual browser. This makes things easier. This also makes the web scraper slower. The alternative is to examine the network tab in the Developer Tools and find the URL that loads the data. Once you know that, you can use the Requests library of Python or Axios package of JavaScript to load only that URL. This makes the scraper run much faster.

Server restrictions

There are a variety of server restrictions. These can vary from a simple header-check to throwing CAPTCHA, or even blocking the IP. Fortunately, there is a solution for everything.

Handling 4xx errors by sending correct headers

If the response code received is in the range 400-499, most probably you just need to send the correct headers. The most important header is the user-agent. Both the Requests library and Axios package have an optional parameter, where you can pass the headers as a dictionary. If you want to know the actual headers sent by the browser, check the network tab of Developer Tools.

CAPTCHA(s)

Sometimes websites detect that a request is sent by a bot, and not a real user. It may reply with a CAPTCHA challenge. There are multiple ways to bypass this:

  1. Send all headers that a browser would, not just user-agent. 
  2. Randomize the user-agent between requests. 
  3. Introduce a random delay in-between the requests.

Instead of introducing delays, you can also use proxies so that all the requests go from different IP addresses. This will allow you to scrape at maximum speed.

Server bans and throttling

Sometimes websites may ban your IP address. The solution is similar to the one mentioned in the previous paragraph – use random headers, introduce delays, and better yet, use proxies.

Note that for efficient web scraping, the quality of the proxy is crucial. Always choose a reliable proxy service.

Conclusion

Knowing how to scrape a website can result in much more informed and quick business decisions, which are essential for every company to succeed. It’s definitely worth trying web scraping to stay competitive in the ever changing market. 

If you’re interested in starting web scraping, we have even more detailed tutorials on scraping a website with Python or JavaScript. Be sure to get all the information you need for the smooth start of web scraping. 

People also ask

Is web scraping legal?

This is a complex question that needs a detailed explanation. We have explored this subject in detai and we highly recommend that you read it. In short, web scraping activity is legal in cases where it’s done without breaching any laws.

Why build a web scraper?

There are many use cases of how building a web scraper and starting web scraping can help businesses achieve their goals. For example, SEO monitoring, price monitoring, review monitoring, etc. You should check this article for more information.  

What type of proxies is suitable for web scraping?

You should think of what websites you are going to scrape and what data is your target. Also, if you already tried web scraping and know your target websites, you should consider what issues you have encountered. For more information about which proxy type is suitable for web scraping, you should check this article. 

avatar

About Iveta Vistorskyte

Iveta Vistorskyte is a Copywriter at Oxylabs. Growing up as a writer and a challenge seeker, she decided to welcome herself to the tech-side, and instantly became interested in this field. When she is not at work, you'll probably find her just chillin' while listening to her favorite music or playing board games with friends.

Related articles

Screening Multiple Targets for Cybersecurity

Screening Multiple Targets for Cybersecurity

Jan 14, 2021

7 min read

How to Detect Bot Traffic?

How to Detect Bot Traffic?

Jan 08, 2021

6 min read

Scraping Images for Intellectual Property Protection

Scraping Images for Intellectual Property Protection

Jan 07, 2021

9 min read

All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.