BeautifulSoup Tutorial - How to Parse Web Data With Python

Adomas Sulcas

Last updated by Roberta Aukstikalnyte

2025-04-11

9 min read

Although web scraping in its totality is a complex and nuanced field of knowledge, building your own basic scraper isn't all that difficult. And that's mostly due to coding languages such as Python. This language makes the process much more straightforward thanks to its relative ease of use and the many useful libraries that it offers. In this tutorial, we'll be focusing on one of these wildly popular libraries named BeautifulSoup, a Python package used to parse HTML and XML documents.

If you want to build your first scraper, we recommend checking our video tutorial below or our article that details everything you need to know to get started with Python web scraping. Yet, in this tutorial, we'll focus specifically on parsing a sample HTML file in Python and using Selenium to render dynamic pages.

This tutorial is useful for those seeking to quickly grasp the value of BeautifulSoup in Python. After following the provided examples, you should be able to understand the basic principles of how to use Beautiful Soup to parse HTML data. The examples will show how to create a parse tree (traverse a document for HTML tags), print the full content of the tags, find elements by ID, extract text from specified tags, and export it to a CSV file.

Before getting to the matter at hand, let's first take a look at some of the fundamentals of this topic.

1. Install the BeautifulSoup library

Before following this tutorial, you should have a Python programming environment set up on your machine. For this tutorial, we'll assume that PyCharm is used since it's a convenient choice even for the less experienced with Python and is a great starting point. Otherwise, simply use your go-to IDE.

On Windows, when installing Python, make sure to tick the PATH installation checkbox. PATH installation adds executables to the default OS Command Prompt executable search. The OS will then recognize commands like pip or python without having to point to the directory of the executable, which makes things more convenient.

The next step is to install the BeautifulSoup 4 library on your system. No matter the OS, you can easily do it by using this command on the terminal to install the latest version of BeautifulSoup:

Copy

pip install beautifulsoup4

If you're using Windows, it's recommended to run the terminal as administrator to ensure that everything works out smoothly.

Finally, since this article explores working with a sample file written in HTML, you should be at least somewhat familiar with the HTML structure.

2. Inspect your target HTML

A sample HTML document will help demonstrate the main methods of how Beautiful Soup can parse HTML. This file is much more simple than your average modern website; however, it'll be sufficient for the scope of this tutorial. If you want to use data from a table found on your target website, check this tutorial on how to scrape HTML tables with Python.

Copy

<!DOCTYPE html>
<html>
    <head>
        <title>What is a Proxy?</title>
        <meta charset="utf-8">
    </head>

    <body>
        <h2>Proxy types</h2>

        <p>
          There are many different ways to categorize proxies. However, two of   
 the most popular types are residential and data center proxies. Here is a list of the most common types.
        </p>

        <ul id="proxytypes">
            <li>Residential proxies</li>
            <li>Datacenter proxies</li>
            <li>Shared proxies</li>
            <li>Semi-dedicated proxies</li>
            <li>Private proxies</li>
        </ul>

    </body>
</html>

For PyCharm to use this file, simply copy it to any text editor and save it with the .html extension to the directory of your PyCharm project. Alternatively, you can create an HTML file in PyCharm by right-clicking on the project area, then navigating to New > HTML File and pasting the HTML code from above.

Going further, you can create a new Python file by navigating to New > Python File. Congratulations, and welcome to your new playground!

3. Find the tags

First, you can use BeautifulSoup to extract a list of all the tags used in our sample HTML file. For this step, you can use the soup.descendants generator:

Copy

from bs4 import BeautifulSoup

with open('index.html', 'r') as f:
    contents = f.read()

    soup = BeautifulSoup(contents, "html.parser")

    for child in soup.descendants:

        if child.name:
            print(child.name)

Click the Run button, and you should get the below output:

Copy

html
head
title
meta
body
h2
p
ul
li
li
li
li
li

BeautifulSoup traversed our HTML file and printed all the tags that it found sequentially. Let's take a quick look at what each line did:

Copy

from bs4 import BeautifulSoup

This tells Python to import the BeautifulSoup library.

Copy

with open('index.html', 'r') as f:
    contents = f.read()

This code snippet above, as you could probably guess, gives an instruction to open our sample HTML file, read its contents, and store them in the contents variable.

Copy

    soup = BeautifulSoup(contents, "html.parser")

This line creates a Python BeautifulSoup object and passes it to Python's built-in HTML parser. Other parsers, such as lxml, might also be used, but it's a separate external library, and for the purpose of this tutorial, the built-in parser will do just fine.

Copy

    for child in soup.descendants:

        if child.name:
            print(child.name)

The final piece of code, namely the soup.descendants generator, instructs BeautifulSoup to look for tag names and print them in the PyCharm console. The results can also easily be exported to a CSV file, but we'll get to this later.

4. Extract the full content from tags

To extract the content of tags, this is what you can do:

Copy

from bs4 import BeautifulSoup

with open('index.html', 'r') as f:
    contents = f.read()

    soup = BeautifulSoup(contents, "html.parser")

    print(soup.h2)
    print(soup.p)
    print(soup.li)

It's a simple HTML parsing instruction that outputs the tag with its full content in the specified order. Here's what the output should look like:

Copy

<h2>Proxy types</h2>
<p>
          There are many different ways to categorize proxies.  However, two of the most popular types are residential and data center proxies. Here is a list of the most common types.
        </p>
<li>Residential proxies</li>

Additionally, you can remove the HTML tags and print the text only by adding .text:

Which gives the following output:

Note that this only prints the first instance of the specified tag. Let's continue to see how to find an HTML element by ID and use the find_all method to filter all HTML elements by specific criteria.

5. Find elements by ID

You can use two similar ways to find elements by ID:

Copy

    print(soup.find('ul', attrs={'id': 'proxytypes'}))

Copy

    print(soup.find('ul', id='proxytypes'))

Both of these will output the same result in the Python Console:

Copy

<ul id="proxytypes">
<li>Residential proxies</li>
<li>Datacenter proxies</li>
<li>Shared proxies</li>
<li>Semi-dedicated proxies</li>
<li>Private proxies</li>
</ul>

6. Find all instances of a tag and extract text

The find_all method is a great way to extract all the data stored in specific elements from an HTML file. It accepts many criteria that make it a flexible tool allowing users to filter data in convenient ways. Let's find all the items within the <li> tags and print them as text only:

Copy

   for tag in soup.find_all('li'):
        print(tag.text)

This is how the full code should look like:

Copy

from bs4 import BeautifulSoup

with open('index.html', 'r') as f:
    contents = f.read()

    soup = BeautifulSoup(contents, "html.parser")

    for tag in soup.find_all('li'):
        print(tag.text)

And here's the output:

Copy

Residential proxies
Datacenter proxies
Shared proxies
Semi-dedicated proxies
Private proxies

7. Parse elements by CSS selectors

BeautifulSoup has excellent support for CSS selectors as it provides several methods to interact with HTML content using selectors. Under the hood, BeautifulSoup uses the soupsieve package. When you install BeautifulSoup with Python's package-management system pip, it'll automatically install the soupsieve dependency for you. Be sure to check out their documentation to learn more about the supported CSS selectors.

BeautifulSoup primarily provides two methods to interact with HTML web page content using CSS selectors: select and select_one. Let's try out both of them.

Using the select method

You can grab the title from our HTML sample file using the select method. Your code should look like the below:

Copy

print(soup.select('html head title'))

Simple, isn't it? Notice how the CSS selector navigates the HTML by going through the hierarchy of the HTML elements sequentially.

Using the select_one method

This method is useful when you need to grab only one element using a CSS selector that matches multiple elements. For instance, our HTML sample has several <li> elements. If you want to grab only the first one, you can use the following CSS selector:

Copy

print(soup.select_one('body ul li'))

This will pick the first <li> element of the <ul> tag, which has several other <li> elements.

To extract a specific <li> element, you can add :nth-of-type(n) to your CSS selector. For instance, you can extract the third <li> element, which in our HTML file is <li>Shared proxies</li>, using the following line:

Copy

print(soup.select_one('body ul li:nth-of-type(3)'))

8. How to parse dynamic elements

Most websites these days tend to load content dynamically, meaning data can be left out if JavaScript isn't triggered to load the content. The requests library and BeautifulSoup libraries aren't equipped to handle JavaScript-rendered web pages. Consequently, using these libraries to download the HTML document of a website would exclude any dynamically-loaded content.

You'll have to use other libraries that can render the website by executing JavaScript to parse dynamic elements. Python's Selenium package offers powerful capabilities to interact with and manipulate DOM elements. In a nutshell, its WebDriver utilizes popular web browsers and renders JavaScript-based dynamic websites quickly. By combining BeautifulSoup with Selenium WebDriver, you can easily parse dynamic content from any website.

Additionally, there are other ways you can scrape dynamic websites that we have explored in our Playwright and Scrapy Splash tutorials.

Step 1: Install Selenium

First, install Selenium with the below command:

As of Selenium 4.6, the browser driver is downloaded automatically. Yet, if you're using an older version of Selenium or the driver wasn't found, you'll have to manually download the WebDriver. Visit this page to find the driver download links for the supported web browsers.

Step 2: Import the necessary libraries

Now that you've installed all the required dependencies, you can jump right into writing the code. Let's begin by importing the newly installed library and BeautifulSoup:

Copy

from selenium import webdriver
from bs4 import BeautifulSoup

Step 3: Launch the browser

Next, you'll have to initiate a browser instance using the below code:

Copy

driver = webdriver.Chrome()

The above code uses the Chrome() driver to launch an instance of a Chrome browser.

Step 4: Fetch content from a dynamic website

Now, you can use this driver object to fetch dynamic content. So let's extract the HTML of this JavaScript-rendered dummy website http://quotes.toscrape.com/js/:

As soon as you execute the above code, you'll notice the Chrome browser instance automatically navigating to the desired website and rendering the JavaScript-based content. The new object named js_content contains the HTML content of the website.

Step 5: Parse the HTML content using BeautifulSoup

Now that you've got the HTML content in a string format, you can simply use the BeautifulSoup() constructor to create the BeautifulSoup object with parsed data:

Copy

soup = BeautifulSoup(js_content, "html.parser")

You can now navigate the soup object with Beautiful Soup and parse any HTML data element using the methods outlined previously. For example, let's extract the first quote found on our target website. Every quote is within the <span> tag with an attribute set to class="text", so the code line to extract the content from the quote can look like this:

Copy

quote = soup.find("span", class_="text")
print(quote.text)

Note the additional underscore _ within class_="text" – you must use it. Otherwise, Python will interpret it as a reserved class keyword.

To learn more about common issues that can arise when performing such tasks and how to fix them, take a look at our how to find HTML data element by class with BeautifulSoup

When parsing dynamic websites, keep in mind that some websites have strong anti-bot measures that can easily detect Selenium-based web scrapers. Mostly, this is achieved by identifying the Selenium web driver's common request patterns and using various other fingerprinting techniques. Thus, it's extremely difficult to avoid such anti-bot measures. In case your IP address gets blocked, you might want to consider using a proxy, and implementing other anti-detection methods. Make sure to look into which proxy type would be most suitable for your web scraping project, whether that be datacenter or residential proxies

By now you should now have a basic understanding of how BeautifulSoup can be used to do web scraping and parsing tasks. It should be noted that the information presented in this article is useful as introductory material, yet real-world web scraping and HTML parsing with BeautifulSoup is usually much more complicated than this. For a more in-depth look at BeautifulSoup, you'll hardly find a better source than its official documentation, so be sure to check it out too.

9. Export data to a CSV file

A very common real-world application would be exporting data to a CSV file for later analysis. Although this is outside the scope of this tutorial, let's take a quick look at how this might be achieved.

First, you would need to install an additional Python library called pandas that helps Python create structured data. This can be easily done by entering the following line in your terminal:

You should also add this line to the beginning of your code to import the library:

Going further, let's add some lines that'll export the list we extracted earlier to a CSV file. This is how your full code should look like:

Copy

from bs4 import BeautifulSoup
import pandas as pd

with open('index.html', 'r') as f:
    contents = f.read()

    soup = BeautifulSoup(contents, "html.parser")
    results = soup.find_all('li')

    df = pd.DataFrame({'Names': results})
    df.to_csv('names.csv', index=False, encoding='utf-8')

What happens here exactly? Let's take a look:

Copy

    results = soup.find_all('li')

This line finds all instances of the <li> tag and stores it in the results object.

Copy

    df = pd.DataFrame({'Names': results})
    df.to_csv('names.csv', index=False, encoding='utf-8')

And here, we see the pandas library at work, storing our results into a table (DataFrame) and exporting it to a CSV file.

If all goes well, a new file titled names.csv should appear in the running directory of your Python project, and inside, you should see a table with the proxy types list. That's it! Now you not only know how data extraction from an HTML document works, but you can also programmatically export the data to a new file.

What is data parsing?

Data parsing is a process during which a piece of data gets converted into a different type of data according to specified criteria. It's an important part of web scraping: with parsing HTML allows, you transform raw data into a more easily readable format that can be understood and analyzed.

What does a Beautiful Soup HTML parser do?

A well-built BeautifulSoup HTML parser will identify the needed HTML string and the relevant information within it. Based on predefined criteria and the rules of the parser, it'll filter and combine the needed information into CSV, JSON, or any other format.

Our previous article on what is parsing sums up this topic nicely. You can also check our video tutorial on how to parse data with BeautifulSoup or keep reading the text below.

What is BeautifulSoup?

BeautifulSoup is a Python package for XML and HTML parsing. It creates a parse tree for parsed web pages based on specific criteria that can be used to extract, navigate, search, and modify data from HTML, which is mostly used for web scraping. BeautifulSoup 4 is supported on Python versions 3.6 and greater. Being a useful library, BeautifulSoup can save programmers loads of time on web scraping tasks.

As you can see, Beautiful Soup is a greatly useful HTML parser. With a relatively low learning curve, you can quickly grasp how to navigate, search, and modify the parse tree. With the addition of libraries, such as pandas, you can further manipulate and analyze the data, which offers a powerful package for a near-infinite amount of web scraping use cases.

And if you'd like to expand your knowledge on Python web scraping in general and get familiar with other Python libraries, we recommend heading over to What is Python used for?, Python Requests, and Python type() function blog posts.

Try Oxylabs Web Scraper API

Test Oxylabs Web Scraper API designed for advanced web scraping tasks.

5K requests for free
No credit card is required

Frequently asked questions

Yes, Beautiful Soup is relatively easy to learn. It offers a straightforward way to extract data by creating a parse tree, navigating and searching through the HTML data structure. In addition, the Beautiful Soup documentation offers in-depth explanations with examples, so you can be sure to find most of the answers to your questions.

While the Beautiful Soup Python library is pretty simple to use, it still requires you to have, at the very least, basic Python coding knowledge and an understanding of HTML data structure.

The answer really depends on what you're trying to achieve. Beautiful Soup is a lightweight Python library that focuses on HTML parsing, while Scrapy is a full-fledged web scraping infrastructure that allows users to make HTTP requests, scrape data, and parse it.

In essence, Beautiful Soup is better when working with small-scale web scraping projects that don't require complex web scraping techniques. On the other hand, Scrapy is exceptionally better for medium to large-scale operations. It offers much more features, such as web crawling, the ability to follow links, concurrency and asynchronous web scraping, cookie management, and more. Using Scrapy for larger projects guarantees better overall performance and speed.

Take a look at our blog post on Web Scraping with Scrapy to learn more and see the tool in action.

Yes, Beautiful Soup is highly regarded for most web scraping projects. Its ease of use through intuitive functions makes it one of the most popular Python parsing libraries. It offers all the fundamentals required to parse HTML and XML files and allows users to search for elements based on HTML elements like tags, attributes, text, and more.

While it lacks some functionality for more complex web scraping tasks, it's certainly one of the better web scraping libraries for beginner and advanced programmers.

BeautifulSoup is great for smaller tasks and simple parsing of XML and HTML data. On the other hand, Scrapy excels in larger, more complex scraping projects with its powerful framework and built-in crawling capabilities. Selenium, meanwhile, automates browser actions, making it essential for dynamic or JavaScript-heavy web pages.

About the author

Adomas Sulcas

Former PR Team Lead

Adomas Sulcas was a PR Team Lead at Oxylabs. Having grown up in a tech-minded household, he quickly developed an interest in everything IT and Internet related. When he is not nerding out online or immersed in reading, you will find him on an adventure or coming up with wicked business ideas.

Learn more about Adomas Sulcas Learn more about Adomas Sulcas

All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.