Pandas is one of the most popular Python libraries for data analysis. This library has many useful functions. One of such functions is pandas
read_html. It can convert HTML tables into pandas DataFrame efficiently.
This tutorial will show you how useful pandas
read_html can be, especially when combined with other helpful functions.
Getting started with pandas read_html
HTML tables in strings
As mentioned before, the function that we’ll be using is pandas
read_html. For details on how to read HTML tables from a variety of sources, you can also head over to the official documentation.
First of all, we’re going to install pandas.
Pandas can be installed using the
pip command or
conda command, if you’re using Anaconda.
pip3 install pandas conda install pandas
Once pandas is installed, you’ll need to create a new Python file. Alternatively, you can also create a Jupyter Notebook.
Let’s start with extracting data from HTML tables in strings.
In the following line of the code, a variable contains HTML. You should note that we’re using Python’s triple quote conventions to store multiline strings in a variable easily.
html = ''' <table> <thead> <tr> <th>Sequence</th> <th>Country</th> <th>Population</th> <th>Updated</th> </tr> </thead> <tbody> <tr> <td>1</td> <td>China</td> <td>1,439,323,776</td> <td>1-Dec-2020</td> </tr> <tr> <td>2</td> <td>India</td> <td>1,380,004,385</td> <td>1-Dec-2020</td> </tr> <tr> <td>3</td> <td>United States</td> <td>331,002,651</td> <td>1-Dec-2020</td> </tr> </tbody> </table>'''
The next step is to import pandas and call the pandas
read_html function. The variable
html that was created above will be used for this. The pandas
read_html function will extract data from HTML tables and return a list of all the tables.
import pandas as pd df_list = pd.read_html(html)
Note that pandas
read_html function returns a list of Pandas DataFrame objects.
In this case, there’s only one table. This means that the variable
df_list will be a list of pandas DataFrame objects, with a length of one.
This can be verified by checking the length of the
print(len(df_list)) # OUTPUT: 1
Let’s check the content of the DataFrame by printing it. Add the following print statement and run this file from the terminal.
When you run from the terminal, the date from HTML tables will be extracted and displayed as follows:
$ python3 read_html.py Sequence Country Population Updated 0 1 China 1439323776 1-Dec-2020 1 2 India 1380004385 1-Dec-2020 2 3 United States 331002651 1-Dec-2020
If you’re using Jupyter Notebook, the output of the same command will have a better appearance.
Now, it’s time to perform a basic cleanup.
Cleanup and processing of HTML table data
The first thing that looks a bit off is the index column.
Whenever a new pandas DataFrame is created, pandas adds the index by default. It’s a numeric index starting with 0. In the HTML table, however, the first column Sequence is the actual index for this table.
This index column can be easily updated by calling the
set_index() function of the DataFrame.
It’s intentional that most of the changes to the pandas DataFrame aren’t permanent. Pandas is made for handling a large amount of data. Most functions have an optional parameter
inplace. The default value of this parameter is
False. Alternatively, this function that sets the index column returns a new pandas DataFrame. This DataFrame can be stored in another variable as follows:
population = df_list.set_index('Sequence')
Once again, let’s take a look at the output from the Jupyter Notebook of this new DataFrame.
As mentioned, pandas can handle a large amount of data. It’s also optimized for numeric and date data types. Therefore, it’s important to ensure that pandas
read_html function has correctly interpreted data from HTML tables.
The data types can be checked by calling
info() function of the DataFrame as follows:
The output will be as follows:
<class 'pandas.core.frame.DataFrame'> Int64Index: 3 entries, 1 to 3 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Country 3 non-null object 1 Population 3 non-null int64 2 Updated 3 non-null object dtypes: int64(1), object(2)
Dtype for the column
object. It means that pandas
read_html function didn’t understand that this column is dated.
There are multiple ways to do this. The easiest of these methods is to use one more parameter of the pandas
read_html function. This parameter is
read_html has one mandatory parameter –
io. This parameter can send a string, as we’ve done so far. This parameter can also accept the file path and URL. We’ll cover this in the later paragraphs of this tutorial.
What’s more interesting is that there are thirteen optional parameters for the pandas
read_html function. Out of these thirteen, let’s focus on one parameter –
parse_dates. This parameter can mainly take a list. This list can be the column numbers or column names. Even if you’re parsing only one column from the
html_table, this parameter must be a list.
Looking at the HTML table that we’re working with, the Updated column is at index 3, as the indexing in Python begins with 0. Thus, both the following statement will ensure that the Updated column is correctly parsed into a date-time data type:
pd.read_html(html, parse_dates=) pd.read_html(html, parse_dates=['Updated'])
Before we put this together, let’s discuss one more useful parameter. This parameter is useful for setting the correct index. It eliminates the need for calling the
In this example, pandas
read_html is being sent three parameters.
df_list= pd.read_html(html, parse_dates=['Updated'],index_col=0) population=df_list population
The immediate change that can be seen is that the values in the Updated column now show dates in a different format than the HTML table. This time, if the
.info() function is called, the DataFrame will have correct data types:
Invalid and imperfect HTML
The HTML that we used in the previous example is valid. The column headings are in
<th>, and the whole row is wrapped in the
<thead> tag. Practically, this won’t always be the case.
If the heading in the HTML table is embedded in regular
<td> tags, the DataFrame will be created with default numeric columns.
In such cases, you can use another optional parameter of pandas
read_html method as follows:
This will set the correct column headings.
<td> tags are missing, they will be replaced by
NaN. Optionally, you can set the
keep_default_na parameter to
False, and then these will be blanks in the DataFrame.
For further cleanup, you can also use the pandas replace function. For details on this, see the documentation.
Extracting HTML tables from files
Extracting data from HTML tables that are in HTML files is almost the same as reading from strings. However, you should note the main difference.
Here’s one local file opened in Chrome:
Instead of the HTML string, the pandas
read_html needs the file path, relative or absolute.
population_file= pd.read_html("/tmp/population.html",parse_dates=['Updated'],index_col=0) population_file
Now the HTML file has been read, indexes are set correctly, and the dates have been parsed, just like it was done for an HTML table stored in a string.
Extracting HTML tables from URLs
Pandas can directly connect to web URLs and read HTML tables. This functionality can be used for web scraping. In this section, we’ll work on a fictional web scraping project. This web scraping project involves extracting the best books.
Extracting data from HTML tables is the same, whether the source is a string containing HTML, a local HTML file path, or, in this case, any website that contains HTML. In this practical example, the URL will be Top 100 Best Sci Fi Novels.
The first step is to extract the list of tables using pandas
read_html function. Next, we’ll check the length of the tables returned.
import pandas as pd list_of_df = pd.read_html("https://en.wikipedia.org/wiki/Science_Fiction:_The_100_Best_Novels") len(list_of_df) # OUTPUT: 7
To get to the exact table, there are multiple approaches possible. The first approach is to simply check all the indexes and find which index contains the required table. However, this method isn’t efficient because the number of tables can change, and the web scraping script we’re creating will fail.
There are two better ways to extract the required table. The first method is to use a regular expression. To use regular expressions, first, we need to identify any pattern inside the
<table> that we want to scrape. Open the URL in a browser, and right-click the table, and click inspect.
When examining the HTML markup of the site, we can see that the
<caption> tag contains the text: “The 100 Best Novels.” This
<caption> is inside the
<table> tag that we need.
The first step is to create a regular expression. In this case, the regular expression can simply be the text “The 100 Best Novels.” Note that this is case-sensitive and space-sensitive.
This regular expression can now be supplied to the optional parameter match of the pandas
import pandas as pd list_of_df = pd.read_html("https://en.wikipedia.org/wiki/Science_Fiction:_The_100_Best_Novels", match='The 100 Best Novels') len(list_of_df) # OUTPUT: 1
Note that in this particular case, the text “The 100 Best Novels” worked. However, the
match parameter can take any regular expression. For example:
It means that the following line of code will also extract example the same table:
One more way to extract the required table is by using the specific attributes. For example, the table that we need has the CSS class wikitable applied to it. This information can be supplied to the
attrs parameter of pandas
Now that we have the data, it’s time for analysis.
Analyzing and visualizing scraped data
This web scraping project can be taken further by analyzing data. Let’s find the author who has written the most of the books in this Top 100 list:
This will print the following pandas series:
Author Philip K. Dick 6 J. G. Ballard 4 Robert A. Heinlein 3 Brian Aldiss 3 Thomas M. Disch 3 ..
It gives us the information that Philip K. Dick has written 6 books out of these 100 best books. If needed, you can also plot charts to represent the same information.
First, this series can be converted to a pandas DataFrame for further analysis. This can be done by calling the
reset_index function and giving the count of books a name. In this case, the name is
df = df.value_counts(subset=['Author']).reset_index(name='BookCount')
The next step is to make a subset of this DataFrame, where authors have published 3 or more books out of these Top 100:
top_df = df[df['BookCount'] >= 3] print(top_df)
The output will be the following DataFrame:
Author BookCount 0 Philip K. Dick 6 1 J. G. Ballard 4 2 Robert A. Heinlein 3 3 Brian Aldiss 3 4 Thomas M. Disch 3
And finally, this data can be plotted as a horizontal bar chart:
read_html can be a very powerful and easy tool for web scraping. What makes this tool best is its simplicity and data analysis capabilities. However, no library is perfect, and pandas is also no exception. One limitation of
read_html is that the User-Agent and any other header can not be changed. Many websites will reject the request because of this.