Back to blog
Adomas Sulcas
Although web scraping in its totality is a complex and nuanced field of knowledge, building your own basic web scraper is not all that difficult. And that’s mostly due to coding languages such as Python. This language makes the process much more straightforward thanks to its relative ease of use and the many useful libraries that it offers. One of these wildly popular libraries is Beautiful Soup, a Python package used for parsing HTML and XML documents. And that’s exactly what we’ll be focusing on in this tutorial.
If you want to build your first web scraper, I recommend checking out our article that details everything you need to know to get started with Python web scraping or our video tutorial below. Yet, in this tutorial we will focus specifically on parsing data in Python using a sample HTML file.
This tutorial is useful for those seeking to quickly grasp the value that Python and Beautiful Soup v4 offers. After following the provided examples you should be able to understand the basic principles of how to parse HTML data. The examples will demonstrate traversing a document for HTML tags, printing the full content of the tags, finding elements by ID, extracting text from specified tags and exporting it to a .csv file.
Before getting to the matter at hand, let’s first take a look at some of the fundamentals.
Data parsing is a process during which a piece of data gets converted into a different type of data according to specified criteria. It is an important part of web scraping since it helps transform raw HTML data into a more easily readable format that can be understood and analyzed.
A well-built parser will identify the needed HTML string and the relevant information within it. Based on predefined criteria and the rules of the parser, it will filter and combine the needed information into CSV or JSON files.
What is parsing and data parsers nicely sums up our previous article. If you have more questions about data parsing, book a call with our sales team!
Beautiful Soup is a Python package for parsing HTML and XML documents. It creates a parse tree for parsed pages based on specific criteria that can be used to extract, navigate, search and modify data from HTML, which is mostly used for web scraping. It is available for Python 2.7 and Python 3. A useful library, it can save programmers loads of time.
Before working on this tutorial, you should have a Python programming environment set up on your machine. For this tutorial we will assume that PyCharm is used since it’s a convenient choice even for the less experienced with Python and is a great starting point. Otherwise, simply use your go-to IDE.
On Windows, when installing Python make sure to tick the “PATH installation” checkbox. PATH installation adds executables to the default Windows Command Prompt executable search. Windows will then recognize commands like “pip” or “python” without having to point to the directory of the executable which makes things more convenient.
You should also have Beautiful Soup installed on your system. No matter the OS, you can easily do it by using this command on the terminal to install the current latest version of Beautiful Soup:
pip install BeautifulSoup4
If you are using Windows, it is recommended to run terminal as administrator to ensure that everything works out smoothly.
Finally, since we will be working with a sample file written in HTML, you should be at least somewhat familiar with HTML structure.
A sample HTML file will help demonstrate the main methods of how Beautiful Soup parses data. This file is much more simple than your average modern website, however, it will be sufficient for the scope of this tutorial.
<!DOCTYPE html>
<html>
<head>
<title>What is a Proxy?</title>
<meta charset="utf-8">
</head>
<body>
<h2>Proxy types</h2>
<p>
There are many different ways to categorize proxies. However, two of
the most popular types are residential and data center proxies. Here is a list of the most common types.
</p>
<ul id="proxytypes">
<li>Residential proxies</li>
<li>Datacenter proxies</li>
<li>Shared proxies</li>
<li>Semi-dedicated proxies</li>
<li>Private proxies</li>
</ul>
</body>
</html>
For PyCharm to use this file, simply copy it to any text editor and save it with the .html extension to the directory of your PyCharm project.
Going further, open PyCharm and after a right click on the project area navigate to New -> Python File. Congratulations and welcome to your new playground!
First, we can use Beautiful Soup to extract a list of all the tags used in our sample HTML file. For this, we will use the soup.descendants generator.
from bs4 import BeautifulSoup
with open('index.html', 'r') as f:
contents = f.read()
soup = BeautifulSoup(contents, features="html.parser")
for child in soup.descendants:
if child.name:
print(child.name)
After running this code (right click on code and click “Run”) you should get the below output:
html
head
title
meta
body
h2
p
ul
li
li
li
li
li
What just happened? Beautiful Soup traversed our HTML file and printed all the HTML tags that it has found sequentially. Let’s take a quick look at what each line did.
from bs4 import BeautifulSoup
This tells Python to use the Beautiful Soup library.
with open('index.html', 'r') as f:
contents = f.read()
And this code, as you could probably guess, gives an instruction to open our sample HTML file and read its contents.
soup = BeautifulSoup(contents, features="html.parser")
This line creates a BeautifulSoup Python object and passes it to Python’s built-in BeautifulSoup HTML parser. Other parsers, such as lxml, might also be used, but it is a separate external library and for the purpose of this tutorial the built-in parser will do just fine.
for child in soup.descendants:
if child.name:
print(child.name)
The final pieces of code, namely the soup.descendants generator, instruct Beautiful Soup to look for HTML tags and print them in the PyCharm console. The results can also easily be exported to a .csv file but we will get to this later.
To get the content of tags, this is what we can do:
from bs4 import BeautifulSoup
with open('index.html', 'r') as f:
contents = f.read()
soup = BeautifulSoup(contents, features="html.parser")
print(soup.h2)
print(soup.p)
print(soup.li)
This is a simple instruction that outputs the HTML tag with its full content in the specified order. Here’s what the output should look like:
<h2>Proxy types</h2>
<p>
There are many different ways to categorize proxies. However, two of the most popular types are residential and data center proxies. Here is a list of the most common types.
</p>
<li>Residential proxies</li>
You could also remove the HTML tags and print text only, by using, for example:
print(soup.li.text)
Which in our case will give the following output:
Residential proxies
Note that this only prints the first instance of the specified tag. Let’s continue to see how to find elements by ID or using the find_all method to filter elements by specific criteria.
We can use two similar ways to find elements by ID:
print(soup.find('ul', attrs={'id': 'proxytypes'}))
or
print(soup.find('ul', id='proxytypes'))
Both of these will output the same result in the Python Console:
<ul id="proxytypes">
<li>Residential proxies</li>
<li>Datacenter proxies</li>
<li>Shared proxies</li>
<li>Semi-dedicated proxies</li>
<li>Private proxies</li>
</ul>
The find_all method is a great way to extract specific data from an HTML file. It accepts many criteria that make it a flexible tool allowing us to filter data in convenient ways. Yet for this tutorial we do not need anything more complex. Let’s find all items of our list and print them as text only:
for tag in soup.find_all('li'):
print(tag.text)
This is how the full code should look like:
from bs4 import BeautifulSoup
with open('index.html', 'r') as f:
contents = f.read()
soup = BeautifulSoup(contents, features="html.parser")
for tag in soup.find_all('li'):
print(tag.text)
And here’s the output:
Residential proxies
Datacenter proxies
Shared proxies
Semi-dedicated proxies
Private proxies
Congratulations, you should now have a basic understanding of how Beautiful Soup might be used to parse data. It should be noted that the information presented in this article is useful as introductory material yet real-world web scraping with BeautifulSoup and the consequent parsing of data is usually much more complicated than this. For a more in-depth look at Beautiful Soup you will hardly find a better source than its documentation, so be sure to check it out too.
A very common real-world application would be exporting data to a .csv file for later analysis. Although this is outside the scope of this tutorial, let’s take a quick look at how this might be achieved.
First, you would need to install the pandas library that helps Python create structured data. This can be easily done by using:
pip install pandas
You should also add this line to the beginning of your code to import the library:
import pandas as pd
Going further, let’s add some lines that will export the list we extracted earlier to a .csv file. This is how our full code should look like:
from bs4 import BeautifulSoup
import pandas as pd
with open('index.html', 'r') as f:
contents = f.read()
soup = BeautifulSoup(contents, features="html.parser")
results = soup.find_all('li')
df = pd.DataFrame({'Names': results})
df.to_csv('names.csv', index=False, encoding='utf-8')
What happened here? Let’s take a look:
results = soup.find_all('li')
This line finds all instances of the <li> tag and stores it in the results object.
df = pd.DataFrame({'Names': results})
df.to_csv('names.csv', index=False, encoding='utf-8')
And here we see the pandas library at work, storing our results into a table (DataFrame) and exporting it to .csv.
If all went well, a new file titled names.csv should appear in the directory of your Python project and inside you should see a table with the proxy types list. That’s it! You now not only know how extracting data from an HTML works but can also programmatically export it to a new file.
As you can see, Beautiful Soup is a greatly useful HTML parser. With a relatively low learning curve, you can quickly grasp how to navigate, search, and modify the parse tree. With the addition of libraries such as pandas you can further manipulate and analyze the data which offers a powerful package for a near infinite amount of data collection and analysis use cases.
And if you’d like to expand your knowledge on Python web scraping in general and get familiar with other libraries used with it, I very much recommend heading over to What is Python used for? and Python Requests.
About the author
Adomas Sulcas
PR Team Lead
Adomas Sulcas is a PR Team Lead at Oxylabs. Having grown up in a tech-minded household, he quickly developed an interest in everything IT and Internet related. When he is not nerding out online or immersed in reading, you will find him on an adventure or coming up with wicked business ideas.
All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.
Roberta Aukstikalnyte
2023-03-24
Yelyzaveta Nechytailo
2023-03-16
Get the latest news from data gathering world
Forget about complex web scraping processes
Choose Oxylabs' advanced web scraping solutions to gather real-time public data hassle-free.
Scale up your business with Oxylabs®
GET IN TOUCH
General:
hello@oxylabs.ioSupport:
support@oxylabs.ioCareer:
career@oxylabs.ioCertified data centers and upstream providers
Connect with us
Advanced proxy solutions
Resources
Innovation hub