Proxy locations

Europe

North America

South America

Asia

Africa

Oceania

See all locations

Network statusCareers

Back to blog

How to Read HTML Tables With Pandas

Iveta Vistorskyte

Iveta Vistorskyte

2021-09-017 min read
Share

Pandas is one of the most popular Python libraries for data analysis. This library has many useful functions. One such function is pandas read_html. It can convert HTML tables into pandas DataFrame efficiently. 

This tutorial will show you how useful pd read_html can be, especially when combined with other helpful functions.

Getting started with pandas read_html

HTML tables in strings

As mentioned before, the function that we’ll be using is pd read_html. For details on how to read HTML tables from a variety of sources, you can also head over to the official documentation.

First of all, we’re going to install pandas.

Pandas can be installed using the pip command or conda command if you’re using Anaconda.

pip3 install pandas pyarrow
conda install pandas pyarrow
Link to GitHub

You must also install the lxml, html5lib, BeautifulSoup4, and Matplotlib libraries to facilitate reading & parsing the HTML and plotting the information.

Once pandas is installed, you’ll need to create a new Python file. Alternatively, you can also create a Jupyter Notebook.

Let’s start with extracting data from HTML tables in strings.

In the following line of the code, a variable contains HTML. You should note that we’re using Python’s triple quote conventions to store multiline strings in a variable easily.

html = '''
<table>
    <thead>
        <tr>
            <th>Sequence</th>
            <th>Country</th>
            <th>Population</th>
            <th>Updated</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>1</td>
            <td>China</td>
            <td>1,439,323,776</td>
            <td>1-Dec-2020</td>
        </tr>
        <tr>
            <td>2</td>
            <td>India</td>
            <td>1,380,004,385</td>
            <td>1-Dec-2020</td>
        </tr>
        <tr>
            <td>3</td>
            <td>United States</td>
            <td>331,002,651</td>
            <td>1-Dec-2020</td>
        </tr>
    </tbody>
</table>'''
Link to GitHub

The next step is to import pandas and call the pandas read_html function. The variable html that was created above will be used for this. The pandas read_html function will extract data from HTML tables and return a list of all the tables.

import pandas as pd
from io import StringIO

html_wrapped = StringIO(html)
df_list = pd.read_html(html_wrapped)
Link to GitHub

Note that pandas read_html function returns a list of Pandas DataFrame objects.

In this case, there’s only one table. This means that the variable df_list will be a list of pandas DataFrame objects, with a length of one.

This can be verified by checking the length of the df_list variable.

print(len(df_list))
# OUTPUT: 1
Link to GitHub

Let’s check the content of the DataFrame by printing it. Add the following print statement and run this file from the terminal.

print(df_list[0])
Link to GitHub

When you run from the terminal, the data from HTML tables will be extracted and displayed as follows:

$ python3 read_html.py
   Sequence        Country  Population     Updated
0         1          China  1439323776  1-Dec-2020
1         2          India  1380004385  1-Dec-2020
2         3  United States   331002651  1-Dec-2020
Link to GitHub

If you’re using Jupyter Notebook, the output of the same command will have a better appearance.

pandas DataFrame created from a string

pandas DataFrame created from a string

Now, it’s time to perform a basic cleanup.

Cleanup and processing of HTML table data

The first thing that looks a bit off is the index column. 

Whenever a new pandas DataFrame is created, pandas adds the index by default. It’s a numeric index starting with 0. In the HTML table, however, the first column Sequence is the actual index for this table.

This index column can be easily updated by calling the set_index() function of the DataFrame.

It’s intentional that most of the changes to the pandas DataFrame aren’t permanent. Pandas is made for handling a large amount of data. Most functions have an optional parameter inplace. The default value of this parameter is False. Alternatively, this function that sets the index column returns a new pandas DataFrame. This DataFrame can be stored in another variable as follows:

population = df_list[0].set_index('Sequence')
print(population)
Link to GitHub

Once again, let’s take a look at the output from the Jupyter Notebook of this new DataFrame.

pandas DataFrame after updating the index column

pandas DataFrame after updating the index column

As mentioned, pandas can handle a large amount of data. It’s also optimized for numeric and date data types. Therefore, it’s important to ensure that the pandas read_html function has correctly interpreted data from HTML tables.

The data types can be checked by calling the info() function of the DataFrame as follows:

print(population.info())
Link to GitHub

The output will be as follows:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3 entries, 1 to 3
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Country     3 non-null      object
 1   Population  3 non-null      int64 
 2   Updated     3 non-null      object
dtypes: int64(1), object(2)
Link to GitHub

Note the Dtype for the column Updated is object. It means that the pandas read_html function didn’t understand that this column is dated. 

There are multiple ways to do this. The easiest of these methods is to use one more parameter of the pandas read_html function. This parameter is parse_dates

Pandas read_html has one mandatory parameter – io. This parameter can send a string, as we’ve done so far. This parameter can also accept the file path and URL. We’ll cover this in the later paragraphs of this tutorial.

What’s more interesting is that there are thirteen optional parameters for the pandas read_html function. Out of these thirteen, let’s focus on one parameter – parse_dates. This parameter can mainly take a list. This list can be the column numbers or column names. Even if you’re parsing only one column from the html_table, this parameter must be a list.

Looking at the HTML table that we’re working with, the Updated column is at index 3, as the indexing in Python begins with 0. Thus, both the following statement will ensure that the Updated column is correctly parsed into a date-time data type:

pd.read_html(html_wrapped, parse_dates=[3])
pd.read_html(html_wrapped, parse_dates=['Updated'])
Link to GitHub

Before we put this together, let’s discuss one more useful parameter. This parameter is useful for setting the correct index. It eliminates the need for calling the set_index function.

In this example, pandas read_html is being sent three parameters.

df_list= pd.read_html(html_wrapped, parse_dates=['Updated'], index_col=0)
population = df_list[0]
print(population)
Link to GitHub

The immediate change that can be seen is that the values in the Updated column now show dates in a different format than the HTML table. This time, if the .info() function is called, the DataFrame will have correct data types:

DataFrame with date-time data type

DataFrame with date-time data type

Invalid and imperfect HTML

The HTML that we used in the previous example is valid. The column headings are in <th>, and the whole row is wrapped in the <thead> tag. Practically, this won’t always be the case.

If the heading in the HTML table is embedded in regular <tr> and <td> tags, the DataFrame will be created with default numeric columns.

Column's headings as rows

Column’s headings as rows

In such cases, you can use another optional parameter of the pandas read_html method as follows:

pd.read_html(html_no_head, header=0)
Link to GitHub

This will set the correct column headings.

If some <td> tags are missing, they will be replaced by NaN. Optionally, you can set the keep_default_na parameter to False, and then these will be blanks in the DataFrame. 

For further cleanup, you can also use the pandas replace function. For details on this, see the documentation

Extracting HTML tables from files

Extracting data from HTML tables that are in HTML files is almost the same as reading from strings. However, you should note the main difference.

Here’s one local file opened in Chrome:

HTML file in local disk

HTML file in local disk

Instead of the HTML string, the pandas read_html needs the file path, relative or absolute.

Assuming that the population.html file contains the HTML table with population information, which is currently located in the tmp folder, you can read the HTML table as follows:

population_file= pd.read_html("/tmp/population.html", parse_dates=['Updated'], index_col=0)
print(population_file[0])
Link to GitHub
HTML file converted to DataFrame

HTML file converted to DataFrame

Now the HTML file has been read, indexes are set correctly, and the dates have been parsed, just like it was done for an HTML table stored in a string.

Extracting HTML tables from URLs

Pandas can directly connect to web URLs and read HTML tables. This functionality can be used for further Python web scraping. In this section, we’ll work on a fictional web scraping project. This web scraping project involves extracting the best books.

Extracting data from HTML tables is the same, whether the source is a string containing HTML, a local HTML file path, or, in this case, any website that contains HTML. In this practical example, the URL will be Top 100 Best Sci-Fi Novels.

The first step is to extract the list of tables using pandas read_html function. Next, we’ll check the length of the tables returned.

import pandas as pd
list_of_df = pd.read_html("https://en.wikipedia.org/w/index.php?title=Science_Fiction:_The_100_Best_Novels&oldid=1091082777")
print(len(list_of_df))
# OUTPUT: 7
Link to GitHub

To get to the exact table, there are multiple approaches possible. The first approach is to simply check all the indexes and find which index contains the required table. However, this method isn’t efficient because the number of tables can change, and the web scraping script we’re creating will fail.

There are two better ways to extract the required table. The first method is to use a regular expression. To use regular expressions, first, we need to identify any pattern inside the <table> that we want to scrape. Open the URL in a browser, and right-click the table, and click inspect.

HTML table markup

HTML table markup

When examining the HTML markup of the site, we can see that the <caption> tag contains the text: “The 100 Best Novels.” This <caption> is inside the <table> tag that we need.

The first step is to create a regular expression. In this case, the regular expression can simply be the text “The 100 Best Novels.” Note that this is case-sensitive and space-sensitive.

This regular expression can now be supplied to the optional parameter match of the pandas read_html function.

import pandas as pd
list_of_df = pd.read_html("https://en.wikipedia.org/w/index.php?title=Science_Fiction:_The_100_Best_Novels&oldid=1091082777", match='The 100 Best Novels')
print(len(list_of_df))
# OUTPUT: 1
Link to GitHub

Note that in this particular case, the text “The 100 Best Novels” worked. However, the match parameter can take any regular expression. For example:

[\w\s]+(\d){3}[\w\s]+

It means that the following line of code will also extract an example of the same table:

list_of_df = pd.read_html("https://en.wikipedia.org/w/index.php?title=Science_Fiction:_The_100_Best_Novels&oldid=1091082777", match=r'[\w\s]+(\d){3}[\w\s]+')

One more way to extract the required table is by using specific attributes. For example, the table that we need has the CSS class wikitable applied to it. This information can be supplied to the attrs parameter of pandas read_html function:

pd.read_html("https://en.wikipedia.org/w/index.php?title=Science_Fiction:_The_100_Best_Novels&oldid=1091082777", attrs={'class':"wikitable"})
Link to GitHub

Now that we have the data, it’s time for analysis.

Analyzing and visualizing scraped data

This web scraping project can be taken further by analyzing data. Let’s find the author who has written the most of the books in this Top 100 list:

df=list_of_df[0]
df.value_counts(subset=['Author'])
Link to GitHub

This will print the following pandas series:

Author               
Philip K. Dick           6
J. G. Ballard            4
Robert A. Heinlein       3
Brian Aldiss             3
Thomas M. Disch          3
                        ..
Link to GitHub

It gives us the information that Philip K. Dick has written 6 books out of the 100 best books. If needed, you can also plot charts to represent the same information.

First, this series can be converted to a pandas DataFrame for further analysis. This can be done by calling the reset_index function and giving the count of books a name. In this case, the name is BookCount:

df = df.value_counts(subset=['Author']).reset_index(name='BookCount')
Link to GitHub

The next step is to make a subset of this DataFrame, where authors have published 3 or more books out of these Top 100:

top_df = df[df['BookCount'] >= 3]
print(top_df)
Link to GitHub

The output will be the following DataFrame:

  Author  BookCount
0      Philip K. Dick          6
1       J. G. Ballard          4
2  Robert A. Heinlein          3
3        Brian Aldiss          3
4     Thomas M. Disch          3
Link to GitHub

Let's install the matplotlib library:

pip install matplotlib

Finally, this data can be plotted as a horizontal bar chart:

top_df.plot.barh(x='Author',y='BookCount',figsize=(12,5))
plt.show()
Link to GitHub
Authors with three or more books on the Top 100 list

Authors with three or more books on the Top 100 list

Click here and check out a repository on GitHub to find the complete code used in this article.

Conclusion

Pandas read_html can be a very powerful and easy tool for web scraping. What makes this tool best is its simplicity and data analysis capabilities. However, no library is perfect, and pandas is also no exception. One limitation of read_html is that the User-Agent and any other header can not be changed. Many websites will reject the request because of this.

Also, the biggest limitation is that it’s only capable of retrieving HTML tables that aren’t loaded with JavaScript. This is quite often the case.

Oxylabs’ web scraper can help retrieve tables that are loaded with JavaScript. These solutions can handle almost any kind of website without having to worry about User-Agent or rendering.

About the author

Iveta Vistorskyte

Iveta Vistorskyte

Lead Content Manager

Iveta Vistorskyte is a Lead Content Manager at Oxylabs. Growing up as a writer and a challenge seeker, she decided to welcome herself to the tech-side, and instantly became interested in this field. When she is not at work, you'll probably find her just chillin' while listening to her favorite music or playing board games with friends.

All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.

Related articles

Get the latest news from data gathering world

I’m interested

IN THIS ARTICLE:


  • Getting started with pandas read_html


  • Cleanup and processing of HTML table data


  • Invalid and imperfect HTML


  • Extracting HTML tables from files


  • Extracting HTML tables from URLs


  • Analyzing and visualizing scraped data


  • Conclusion

Web Scraper API for smooth scraping

Enjoy efficient web scraping while avoiding CAPTCHA and IP blocks.

Scale up your business with Oxylabs®