A photo with some beautiful soup
avatar

Adomas Sulcas

Aug 14, 2020 10 min read

Although web scraping in its totality is a complex and nuanced field of knowledge, building your own basic web scraper is not all that difficult. And that’s mostly due to coding languages such as Python. This language makes the process much more straightforward thanks to its relative ease of use and the many useful libraries that it offers. One of these wildly popular libraries is Beautiful Soup, a Python package used for parsing HTML and XML documents. And that’s exactly what we’ll be focusing on in this tutorial. 

If you want to build your first web scraper, I recommend checking out our article that details everything you need to know to get started with Python web scraping. Yet, today we will focus specifically on parsing data using a sample HTML file. 

Navigation

This tutorial is useful for those seeking to quickly grasp the value that Python and Beautiful Soup v4 offers. After following the provided examples you should be able to understand the basic principles of how to parse HTML data. The examples will demonstrate traversing a document for HTML tags, printing the full content of the tags, finding elements by ID, extracting text from specified tags and exporting it to a .csv file.

Before getting to the matter at hand, let’s first take a look at some of the fundamentals.

What is data parsing?

Data parsing is a process during which a piece of data gets converted into a different type of data according to specified criteria. It is an important part of web scraping since it helps transform raw HTML data into a more easily readable format that can be understood and analyzed.

What does a parser do?

A well-built parser will identify the needed HTML string and the relevant information within it. Based on predefined criteria and the rules of the parser, it will filter and combine the needed information into CSV or JSON files.

If you’d like to read up more on data parsing and parsers, check out this text by my colleague Gabija.

if you have more questions about data parsing, book a call with our sales team!

What is Beautiful Soup?

Beautiful Soup is a Python package for parsing HTML and XML documents. It creates a parse tree for parsed pages based on specific criteria that can be used to extract, navigate, search and modify data from HTML, which is mostly used for web scraping. It is available for Python 2.7 and Python 3. A useful library, it can save programmers loads of time.

Installing Beautiful Soup

Before working on this tutorial, you should have a Python programming environment set up on your machine. For this tutorial we will assume that PyCharm is used since it’s a convenient choice even for the less experienced with Python and is a great starting point. Otherwise, simply use your go-to IDE.

On Windows, when installing Python make sure to tick the “PATH installation” checkbox. PATH installation adds executables to the default Windows Command Prompt executable search. Windows will then recognize commands like “pip” or “python” without having to point to the directory of the executable which makes things more convenient.

You should also have Beautiful Soup installed on your system. No matter the OS, you can easily do it by using this command on the terminal to install the current latest version of Beautiful Soup:

pip install BeautifulSoup4

If you are using Windows, it is recommended to run terminal as administrator to ensure that everything works out smoothly.

Finally, since we will be working with a sample file written in HTML, you should be at least somewhat familiar with HTML structure.

Python code for Beautiful Soup

Getting started

A sample HTML file will help demonstrate the main methods of how Beautiful Soup parses data. This file is much more simple than your average modern website, however, it will be sufficient for the scope of this tutorial.

<!DOCTYPE html>
<html>
    <head>
        <title>What is a Proxy?</title>
        <meta charset="utf-8">
    </head>

    <body>
        <h2>Proxy types</h2>

        <p>
          There are many different ways to categorize proxies. However, two of   
 the most popular types are residential and data center proxies. Here is a list of the most common types.
        </p>

        <ul id="proxytypes">
            <li>Residential proxies</li>
            <li>Datacenter proxies</li>
            <li>Shared proxies</li>
            <li>Semi-dedicated proxies</li>
            <li>Private proxies</li>
        </ul>

    </body>
</html>

For PyCharm to use this file, simply copy it to any text editor and save it with the .html extension to the directory of your PyCharm project.

Going further, open PyCharm and after a right click on the project area navigate to New -> Python File. Congratulations and welcome to your new playground!

Traversing for HTML tags

First, we can use Beautiful Soup to extract a list of all the tags used in our sample HTML file. For this, we will use the soup.descendants generator.

from bs4 import BeautifulSoup

with open('index.html', 'r') as f:
    contents = f.read()

    soup = BeautifulSoup(contents, features="html.parser")

    for child in soup.descendants:

        if child.name:
            print(child.name)

After running this code (right click on code and click “Run”) you should get the below output:

html
head
title
meta
body
h2
p
ul
li
li
li
li
li

What just happened? Beautiful Soup traversed our HTML file and printed all the HTML tags that it has found sequentially. Let’s take a quick look at what each line did.

from bs4 import BeautifulSoup

This tells Python to use the Beautiful Soup library.

with open('index.html', 'r') as f:
    contents = f.read()

And this code, as you could probably guess, gives an instruction to open our sample HTML file and read its contents.

    soup = BeautifulSoup(contents, features="html.parser")

This line creates a BeautifulSoup object and passes it to Python’s built in HTML parser. Other parsers, such as lxml, might also be used, but it is a separate external library and for the purpose of this tutorial the built-in parser will do just fine.

    for child in soup.descendants:

        if child.name:
            print(child.name)

The final pieces of code, namely the soup.descendants generator, instruct Beautiful Soup to look for HTML tags and print them in the PyCharm console. The results can also easily be exported to a .csv file but we will get to this later.

Getting the full content of tags

To get the content of tags, this is what we can do:

from bs4 import BeautifulSoup

with open('index.html', 'r') as f:
    contents = f.read()

    soup = BeautifulSoup(contents, features="html.parser")

    print(soup.h2)
    print(soup.p)
    print(soup.li)

This is a simple instruction that outputs the HTML tag with its full content in the specified order. Here’s what the output should look like:

<h2>Proxy types</h2>
<p>
          There are many different ways to categorize proxies.  However, two of the most popular types are residential and data center proxies. Here is a list of the most common types.
        </p>
<li>Residential proxies</li>

You could also remove the HTML tags and print text only, by using, for example:

    print(soup.li.text)

Which in our case will give the following output:

Residential proxies

Note that this only prints the first instance of the specified tag. Let’s continue to see how to find elements by ID or using the find_all method to filter elements by specific criteria.

Using Beautiful Soup to find elements by ID

We can use two similar ways to find elements by ID:

    print(soup.find('ul', attrs={'id': 'proxytypes'}))

or

    print(soup.find('ul', id='proxytypes'))

Both of these will output the same result in the Python Console:

<ul id="proxytypes">
<li>Residential proxies</li>
<li>Datacenter proxies</li>
<li>Shared proxies</li>
<li>Semi-dedicated proxies</li>
<li>Private proxies</li>
</ul>

Finding all specified tags and extracting text

The find_all method is a great way to extract specific data from an HTML file. It accepts many criteria that make it a flexible tool allowing us to filter data in convenient ways. Yet for this tutorial we do not need anything more complex. Let’s find all items of our list and print them as text only:

   for tag in soup.find_all('li'):
        print(tag.text)

This is how the full code should look like:

from bs4 import BeautifulSoup

with open('index.html', 'r') as f:
    contents = f.read()

    soup = BeautifulSoup(contents, features="html.parser")

    for tag in soup.find_all('li'):
        print(tag.text)

And here’s the output:

Residential proxies
Datacenter proxies
Shared proxies
Semi-dedicated proxies
Private proxies

Congratulations, you should now have a basic understanding of how Beautiful Soup might be used to parse data. It should be noted that the information presented in this article is useful as introductory material yet real-world web scraping with BeautifulSoup and the consequent parsing of data is usually much more complicated than this. For a more in-depth look at Beautiful Soup you will hardly find a better source than its documentation, so be sure to check it out too.

Exporting data to a .csv file

A very common real-world application would be exporting data to a .csv file for later analysis. Although this is outside the scope of this tutorial, let’s take a quick look at how this might be achieved.

First, you would need to install the pandas library that helps Python create structured data. This can be easily done by using:

pip install pandas

You should also add this line to the beginning of your code to import the library:

import pandas as pd

Going further, let’s add some lines that will export the list we extracted earlier to a .csv file. This is how our full code should look like:

from bs4 import BeautifulSoup
import pandas as pd

with open('index.html', 'r') as f:
    contents = f.read()

    soup = BeautifulSoup(contents, features="html.parser")
    results = soup.find_all('li')

    df = pd.DataFrame({'Names': results})
    df.to_csv('names.csv', index=False, encoding='utf-8')

What happened here? Let’s take a look:

    results = soup.find_all('li')

This line finds all instances of the <li> tag and stores it in the results object.

    df = pd.DataFrame({'Names': results})
    df.to_csv('names.csv', index=False, encoding='utf-8')

And here we see the pandas library at work, storing our results into a table (DataFrame) and exporting it to .csv.

If all went well, a new file titled names.csv should appear in the directory of your Python project and inside you should see a table with the proxy types list. That’s it! You now not only know how extracting data from an HTML works but can also programmatically export it to a new file.

Conclusion

As you can see, Beautiful Soup is a greatly useful HTML parser. With a relatively low learning curve, you can quickly grasp how to navigate, search, and modify the parse tree. With the addition of libraries such as pandas you can further manipulate and analyze the data which offers a powerful package for a near infinite amount of data collection and analysis use cases. 

And if you’d like to expand your knowledge on Python web scraping in general and get familiar with other libraries used with it, I very much recommend heading over to What is Python used for? and Python Requests.

avatar

About Adomas Sulcas

Adomas Sulcas is a Content Manager at Oxylabs. Having grown up in a tech-minded household, he quickly developed an interest in everything IT and Internet related. When he is not nerding out online or immersed in reading, you will find him on an adventure or coming up with wicked business ideas.

Related articles

lxml Tutorial: XML Processing and Web Scraping With lxml

lxml Tutorial: XML Processing and Web Scraping With lxml

Sep 24, 2020

10 min read

How to Crawl a Website Without Getting Blocked

How to Crawl a Website Without Getting Blocked

Sep 24, 2020

9 min read

Python Web Scraping Tutorial: Step-By-Step

Python Web Scraping Tutorial: Step-By-Step

Sep 22, 2020

18 min read

All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.