How to read HTML tables with pandas
avatar

Iveta Vistorskyte

Sep 01, 2021 12 min read

Pandas is one of the most popular Python libraries for data analysis. This library has many useful functions. One of such functions is pandas read_html. It can convert HTML tables into pandas DataFrame efficiently. 

This tutorial will show you how useful pandas read_html can be, especially when combined with other helpful functions.

Getting started with pandas read_html

HTML tables in strings

As mentioned before, the function that we’ll be using is pandas read_html. For details on how to read HTML tables from a variety of sources, you can also head over to the official documentation.

First of all, we’re going to install pandas.

Pandas can be installed using the pip command or conda command, if you’re using Anaconda.

pip3 install pandas
conda install pandas

Once pandas is installed, you’ll need to create a new Python file. Alternatively, you can also create a Jupyter Notebook.

Let’s start with extracting data from HTML tables in strings.

In the following line of the code, a variable contains HTML. You should note that we’re using Python’s triple quote conventions to store multiline strings in a variable easily.

html = '''
<table>
    <thead>
        <tr>
            <th>Sequence</th>
            <th>Country</th>
            <th>Population</th>
            <th>Updated</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>1</td>
            <td>China</td>
            <td>1,439,323,776</td>
            <td>1-Dec-2020</td>
        </tr>
        <tr>
            <td>2</td>
            <td>India</td>
            <td>1,380,004,385</td>
            <td>1-Dec-2020</td>
        </tr>
        <tr>
            <td>3</td>
            <td>United States</td>
            <td>331,002,651</td>
            <td>1-Dec-2020</td>
        </tr>
    </tbody>
</table>'''

The next step is to import pandas and call the pandas read_html function. The variable html that was created above will be used for this. The pandas read_html function will extract data from HTML tables and return a list of all the tables.

import pandas as pd
 
df_list = pd.read_html(html) 

Note that pandas read_html function returns a list of Pandas DataFrame objects.

In this case, there’s only one table. This means that the variable df_list will be a list of pandas DataFrame objects, with a length of one.

This can be verified by checking the length of the df_list variable.

print(len(df_list))
# OUTPUT: 1

Let’s check the content of the DataFrame by printing it. Add the following print statement and run this file from the terminal.

print(df_list[0])

When you run from the terminal, the date from HTML tables will be extracted and displayed as follows:

$ python3 read_html.py
   Sequence        Country  Population     Updated
0         1          China  1439323776  1-Dec-2020
1         2          India  1380004385  1-Dec-2020
2         3  United States   331002651  1-Dec-2020

If you’re using Jupyter Notebook, the output of the same command will have a better appearance.

pandas DataFrame created from string
pandas DataFrame created from string

Now, it’s time to perform a basic cleanup.

Cleanup and processing of HTML table data

The first thing that looks a bit off is the index column. 

Whenever a new pandas DataFrame is created, pandas adds the index by default. It’s a numeric index starting with 0. In the HTML table, however, the first column Sequence is the actual index for this table.

This index column can be easily updated by calling the set_index() function of the DataFrame.

It’s intentional that most of the changes to the pandas DataFrame aren’t permanent. Pandas is made for handling a large amount of data. Most functions have an optional parameter inplace. The default value of this parameter is False. Alternatively, this function that sets the index column returns a new pandas DataFrame. This DataFrame can be stored in another variable as follows:

population = df_list[0].set_index('Sequence')

Once again, let’s take a look at the output from the Jupyter Notebook of this new DataFrame.

pandas DataFrame after updating index column
pandas DataFrame after updating index column

As mentioned, pandas can handle a large amount of data. It’s also optimized for numeric and date data types. Therefore, it’s important to ensure that pandas read_html function has correctly interpreted data from HTML tables.

The data types can be checked by calling info() function of the DataFrame as follows:

population.info()

The output will be as follows:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3 entries, 1 to 3
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Country     3 non-null      object
 1   Population  3 non-null      int64 
 2   Updated     3 non-null      object
dtypes: int64(1), object(2)

Note the Dtype for the column Updated is object. It means that pandas read_html function didn’t understand that this column is dated. 

There are multiple ways to do this. The easiest of these methods is to use one more parameter of the pandas read_html function. This parameter is parse_dates

Pandas read_html has one mandatory parameter – io. This parameter can send a string, as we’ve done so far. This parameter can also accept the file path and URL. We’ll cover this in the later paragraphs of this tutorial.

What’s more interesting is that there are thirteen optional parameters for the pandas read_html function. Out of these thirteen, let’s focus on one parameter – parse_dates. This parameter can mainly take a list. This list can be the column numbers or column names. Even if you’re parsing only one column from the html_table, this parameter must be a list.

Looking at the HTML table that we’re working with, the Updated column is at index 3, as the indexing in Python begins with 0. Thus, both the following statement will ensure that the Updated column is correctly parsed into a date-time data type:

pd.read_html(html, parse_dates=[3])
pd.read_html(html, parse_dates=['Updated'])

Before we put this together, let’s discuss one more useful parameter. This parameter is useful for setting the correct index. It eliminates the need for calling the set_index function.

In this example, pandas read_html is being sent three parameters.

df_list= pd.read_html(html, parse_dates=['Updated'],index_col=0)
population=df_list[0]
population

The immediate change that can be seen is that the values in the Updated column now show dates in a different format than the HTML table. This time, if the .info() function is called, the DataFrame will have correct data types:

DataFrame with date-time data type
DataFrame with date-time data type

Invalid and imperfect HTML

The HTML that we used in the previous example is valid. The column headings are in <th>, and the whole row is wrapped in the <thead> tag. Practically, this won’t always be the case.

If the heading in the HTML table is embedded in regular <tr> and <td> tags, the DataFrame will be created with default numeric columns.

Column's headings as rows
Column’s headings as rows

In such cases, you can use another optional parameter of pandas read_html method as follows:

pd.read_html(html_no_head,header=0)

This will set the correct column headings.

If some <td> tags are missing, they will be replaced by NaN. Optionally, you can set the keep_default_na parameter to False, and then these will be blanks in the DataFrame. 

For further cleanup, you can also use the pandas replace function. For details on this, see the documentation

Extracting HTML tables from files

Extracting data from HTML tables that are in HTML files is almost the same as reading from strings. However, you should note the main difference.

Here’s one local file opened in Chrome:

HTML file in local disk
HTML file in local disk

Instead of the HTML string, the pandas read_html needs the file path, relative or absolute.

population_file= pd.read_html("/tmp/population.html",parse_dates=['Updated'],index_col=0)
population_file[0]
HTML file converted to DataFrame
HTML file converted to DataFrame

Now the HTML file has been read, indexes are set correctly, and the dates have been parsed, just like it was done for an HTML table stored in a string.

Extracting HTML tables from URLs

Pandas can directly connect to web URLs and read HTML tables. This functionality can be used for web scraping. In this section, we’ll work on a fictional web scraping project. This web scraping project involves extracting the best books.

Extracting data from HTML tables is the same, whether the source is a string containing HTML, a local HTML file path, or, in this case, any website that contains HTML. In this practical example, the URL will be Top 100 Best Sci Fi Novels.

The first step is to extract the list of tables using pandas read_html function. Next, we’ll check the length of the tables returned.

import pandas as pd 
list_of_df = pd.read_html("https://en.wikipedia.org/wiki/Science_Fiction:_The_100_Best_Novels")
len(list_of_df)
# OUTPUT: 7

To get to the exact table, there are multiple approaches possible. The first approach is to simply check all the indexes and find which index contains the required table. However, this method isn’t efficient because the number of tables can change, and the web scraping script we’re creating will fail.

There are two better ways to extract the required table. The first method is to use a regular expression. To use regular expressions, first, we need to identify any pattern inside the <table> that we want to scrape. Open the URL in a browser, and right-click the table, and click inspect.

HTML table markup
HTML table markup

When examining the HTML markup of the site, we can see that the <caption> tag contains the text: “The 100 Best Novels.” This <caption> is inside the <table> tag that we need.

The first step is to create a regular expression. In this case, the regular expression can simply be the text “The 100 Best Novels.” Note that this is case-sensitive and space-sensitive.

This regular expression can now be supplied to the optional parameter match of the pandas read_html function.

import pandas as pd
list_of_df = pd.read_html("https://en.wikipedia.org/wiki/Science_Fiction:_The_100_Best_Novels", match='The 100 Best Novels')
len(list_of_df)
# OUTPUT: 1

Note that in this particular case, the text “The 100 Best Novels” worked. However, the match parameter can take any regular expression. For example:

[\w\s]+(\d){3}[\w\s]+

It means that the following line of code will also extract example the same table:

pd.read_html("https://en.wikipedia.org/wiki/Science_Fiction:_The_100_Best_Novels", match='[\w\s]+(\d){3}[\w\s]+')

One more way to extract the required table is by using the specific attributes. For example, the table that we need has the CSS class wikitable applied to it. This information can be supplied to the attrs parameter of pandas read_html function:

pd.read_html("https://en.wikipedia.org/wiki/Science_Fiction:_The_100_Best_Novels", attrs={'class':"wikitable"})

Now that we have the data, it’s time for analysis.

Analyzing and visualizing scraped data

This web scraping project can be taken further by analyzing data. Let’s find the author who has written the most of the books in this Top 100 list:

df=list_of_df[0]
df.value_counts(subset=['Author'])

This will print the following pandas series:

Author               
Philip K. Dick           6
J. G. Ballard            4
Robert A. Heinlein       3
Brian Aldiss             3
Thomas M. Disch          3
                        ..

It gives us the information that Philip K. Dick has written 6 books out of these 100 best books. If needed, you can also plot charts to represent the same information.

First, this series can be converted to a pandas DataFrame for further analysis. This can be done by calling the reset_index function and giving the count of books a name. In this case, the name is BookCount:

df = df.value_counts(subset=['Author']).reset_index(name='BookCount')

The next step is to make a subset of this DataFrame, where authors have published 3 or more books out of these Top 100:

top_df = df[df['BookCount'] >= 3]
print(top_df)

The output will be the following DataFrame:

  Author  BookCount
0      Philip K. Dick          6
1       J. G. Ballard          4
2  Robert A. Heinlein          3
3        Brian Aldiss          3
4     Thomas M. Disch          3

And finally, this data can be plotted as a horizontal bar chart:

top_df.plot.barh(x='Author',y='BookCount',figsize=(12,5))
Authors with three or more books in Top 100 list
Authors with three or more books in Top 100 list

Conclusion

Pandas read_html can be a very powerful and easy tool for web scraping. What makes this tool best is its simplicity and data analysis capabilities. However, no library is perfect, and pandas is also no exception. One limitation of read_html is that the User-Agent and any other header can not be changed. Many websites will reject the request because of this.

Also, the biggest limitation is that it’s only capable of retrieving HTML tables that aren’t loaded with JavaScript. This is quite often the case.

Oxylabs’ web scraping solution can help retrieve tables that are loaded with JavaScript. These solutions can handle almost any kind of website without having to worry about User-Agent or rendering.

avatar

About Iveta Vistorskyte

Iveta Vistorskyte is a Content Manager at Oxylabs. Growing up as a writer and a challenge seeker, she decided to welcome herself to the tech-side, and instantly became interested in this field. When she is not at work, you'll probably find her just chillin' while listening to her favorite music or playing board games with friends.

All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.

Related articles

Most Common HTTP Headers

Most Common HTTP Headers

Sep 20, 2021

5 min read