Back to blog
Vytenis Kaubrė
Whether it’s driving market analysis or powering advanced AI models like large language models (LLMs), web scraping remains indispensable for efficient data ingestion. Among the various types of data you might encounter, tables are one of the most common and useful formats found on websites. This article explicitly focuses on scraping HTML tables and their complex structures into a format suitable for further manipulation using Python.
HTML tables are commonly used in a variety of contexts to organize and present information in a clear and accessible format, making it easier for users to interpret and compare information. Some popular uses of tabular data include financial data, statistical information, e-commerce product listings, and academic research reports.
An HTML table is structured using the <table> tag, which serves as the container for the table's content. Within this table, rows are defined using the <tr> tag (representing "table row"), and within each row, cells are created using the <td> tag (stands for "standard table data") or the <th> tag (stands for "table headers").
Below is an example of a basic HTML table:
<table>
<thead>
<tr>
<th>Country</th>
<th>Population (2022)</th>
<th>GDP - USD (2022)</th>
</tr>
</thead>
<tbody>
<tr>
<td>UK</td>
<td>66.97 million</td>
<td>3.089 trillion</td>
</tr>
<tr>
<td>Canada</td>
<td>38.93 million</td>
<td>2.138 trillion</td>
</tr>
</tbody>
</table>
When it comes to easily extracting data from tables displayed on a website, you can use the BeautifulSoup and pandas libraries in Python. BeautifulSoup helps extract the data from an HTML document, while pandas helps analyze and manipulate the desired data once it’s extracted.
Other alternatives, such as Scrapy, lxml, and Selenium, are also available for more complex scraping tasks. This tutorial will focus on using BeautifulSoup, pandas, and Selenium to scrape complex tables that are spread over multiple web pages. To get more familiar with the libraries, take a look at these comprehensive tutorials covering how to read HTML tables with pandas and a Beautiful Soup tutorial.
Before getting started, you should also have some preliminary knowledge of HTML data and Python. However, if you’re an absolute beginner with Python or web scraping, this Python web scraping tutorial will get you on track.
The next sections will cover methods for extracting data from different types of tables: simple HTML tables, tables with multiple headers, and tables spread across multiple web pages.
Ensure you have installed the latest version of Python (or version 3.8 and above) on your system. Next, install the requests, BeautifulSoup, pandas, and Selenium libraries using the following pip command:
pip install requests bs4 pandas selenium
Let’s use this Wikipedia page to extract data from a basic HTML table:
The table contains three columns and eleven rows. The first row of the table is a header row with the table header column tags (<th>). The rest of the rows contain simple table data columns (<td>). The task is to web scrape this table from the source page and extract, clean, and save its data.
Create a new Python file and start by importing the following libraries:
import requests
import pandas as pd
from bs4 import BeautifulSoup
from io import StringIO
After importing the required modules, send an HTTP request to the target web page to retrieve the HTML content of the page:
try:
# Step 2: Send a request to the Wikipedia page.
url = "https://en.wikipedia.org/wiki/List_of_highest-grossing_films"
response = requests.get(url)
response.raise_for_status() # Check if the request was successful.
except requests.exceptions.RequestException as e:
print(f"Error fetching the webpage: {e}")
Once the response variable is ready, parse it using the BeautifulSoup library and extract the table using an appropriate CSS selector. Then, convert it to a DataFrame, as done in the last statement of the try block in the following code snippet:
try:
# Step 3.1: Parse the HTML content with BeautifulSoup.
soup = BeautifulSoup(response.content, "html.parser")
# Step 3.2: Find the specific table (usually the first table under this section).
table = soup.find_all("table", {"class": "wikitable"})[1]
# Step 3.3: Read the table into a DataFrame using pandas.
df = pd.read_html(StringIO(str(table)))[0]
except Exception as e:
print(f"Error parsing the HTML or reading the table: {e}")
The str() function converts the HTML table into a text stream, which is then converted to an in-memory text stream using StringIO. The resultant in-memory text stream is treated like a file object by the read_html() method, which converts it to a DataFrame.
Since the header of this table is a single non-nested row, handling it requires no special treatment. However, cleaning the DataFrame by removing any null values is always a good idea:
try:
# Step 4: Clean up the DataFrame if necessary.
# Clean up any unwanted rows or columns.
df = df.dropna(how="all") # Drop rows with all NaN values.
except Exception as e:
print(f"Error cleaning or processing the DataFrame: {e}")
Finally, you can export the DataFrame to a CSV file for easy storage and sharing or simply display it on the screen to view the extracted data directly:
if films_df is not None:
print("Successfully extracted the table!")
print(films_df)
# Save the DataFrame to a CSV file.
file_name = "highest_grossing_films"
films_df.to_csv(f"{file_name}.csv", index=False)
print(f"\nData saved to '{file_name}.csv'.")
Let’s combine all the steps and have a look at the complete code:
# Step 1: Import the modules.
import requests
import pandas as pd
from bs4 import BeautifulSoup
from io import StringIO
def fetch_highest_grossing_films_table():
try:
# Step 2: Send a request to the Wikipedia page.
url = "https://en.wikipedia.org/wiki/List_of_highest-grossing_films"
response = requests.get(url)
response.raise_for_status() # Check if the request was successful.
except requests.exceptions.RequestException as e:
print(f"Error fetching the webpage: {e}")
return None
try:
# Step 3.1: Parse the HTML content with BeautifulSoup.
soup = BeautifulSoup(response.content, "html.parser")
# Step 3.2: Find the specific table (usually the first table under this section).
table = soup.find_all("table", {"class": "wikitable"})[1]
# Step 3.3: Read the table into a DataFrame using pandas.
df = pd.read_html(StringIO(str(table)))[0]
except Exception as e:
print(f"Error parsing the HTML or reading the table: {e}")
return None
try:
# Step 4: Clean up the DataFrame if necessary.
# Clean up any unwanted rows or columns.
df = df.dropna(how="all") # Drop rows with all NaN values.
return df
except Exception as e:
print(f"Error cleaning or processing the DataFrame: {e}")
return None
# Run the function and get the DataFrame.
films_df = fetch_highest_grossing_films_table()
if films_df is not None:
print("Successfully extracted the table!")
print(films_df)
# Save the DataFrame to a CSV file.
file_name = "highest_grossing_films"
films_df.to_csv(f"{file_name}.csv", index=False)
print(f"\nData saved to '{file_name}.csv'.")
else:
print("Failed to extract the table.")
The code will yield the following output:
For this section, let’s use another Wikipedia table:
The table is organized with multiple levels of headers, indicating a complex HTML structure:
The first row of the table has headers for the country/territory and organizations like the IMF, World Bank, and United Nations.
The header's second level is in the table's second row. This row includes headers for the forecast, year, and estimate.
Create a new Python file and start by importing the following libraries:
import requests
import pandas as pd
from bs4 import BeautifulSoup
from io import StringIO
Like previously, after importing the required modules, send a request to the target URL by using the following code:
try:
# Step 2: Send a request to the Wikipedia page.
url = "https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)"
response = requests.get(url)
response.raise_for_status() # Check if the request was successful.
except requests.exceptions.RequestException as e:
print(f"Error fetching the webpage: {e}")
Once the response of the HTTP request is ready in the response variable, parse the content using the BeautifulSoup library. Then, you can use a unique CSS selector to extract the target table data from the parsed soup object:
try:
# Step 3.1: Parse the HTML content with BeautifulSoup.
soup = BeautifulSoup(response.content, "html.parser")
# Step 3.2: Find the specific table (usually the first large table on the page).
tables = soup.find_all("table", {"class": "wikitable"})
if not tables:
print("Error: Could not find the expected table on the page.")
return None
table = tables[0] # First table contains the list of countries by GDP (nominal).
# Step 4: Read the table into a DataFrame using pandas.
df = pd.read_html(StringIO(str(table)))[0]
except Exception as e:
print(f"Error parsing the HTML or reading the table: {e}")
Since this page contains multiple tables, you can target the last table using the index value 2.
Since this table has multi-level headers, some of its columns will be multi-indexed (having a hierarchical or nested structure). Directly saving these multi-indexed header columns may make the CSV harder to interpret with standard spreadsheet software, which typically expects a flat column structure. Therefore, before saving it to a CSV document, you can simplify the nested header structure using appropriate flattening:
try:
# Step 5: Clean up the DataFrame.
# Flatten multi-level headers if they exist, converting non-strings to empty strings.
if isinstance(df.columns, pd.MultiIndex):
# Flatten the column headers, filtering out any None or NaN values.
df.columns = [" ".join(filter(None, map(str, col))) for col in df.columns]
# Reset the index to clean up the DataFrame.
df = df.reset_index(drop=True)
return df
except Exception as e:
print(f"Error cleaning or processing the DataFrame: {e}")
In the above code segment:
isinstance(df.columns, pd.MultiIndex) checks if the DataFrame df has multi-level (hierarchical) column headers.
In pandas, MultiIndex is used when the table has more than one level of column headers.
filter(None, ...) filters out any null values.
map(str, col) converts all elements of the column header tuple to string.
' '.join(...) joins the strings with a space separator.
Finally, you can save the scraped table data to a CSV file and print the results to stdout:
if gdp_df is not None:
print("Successfully extracted the table!")
print(gdp_df)
# Save the DataFrame to a CSV file with headers.
file_name = "gdp_by_country"
gdp_df.to_csv(f"{file_name}.csv", index=False)
print(f"Data saved to '{file_name}.csv'.")
Here’s what the complete code should look like after combining all the steps:
# Step 1: Import the modules.
import requests
import pandas as pd
from bs4 import BeautifulSoup
from io import StringIO
def fetch_gdp_table():
try:
# Step 2: Send a request to the Wikipedia page.
url = "https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)"
response = requests.get(url)
response.raise_for_status() # Check if the request was successful.
except requests.exceptions.RequestException as e:
print(f"Error fetching the webpage: {e}")
return None
try:
# Step 3.1: Parse the HTML content with BeautifulSoup.
soup = BeautifulSoup(response.content, "html.parser")
# Step 3.2: Find the specific table (usually the first large table on the page).
tables = soup.find_all("table", {"class": "wikitable"})
if not tables:
print("Error: Could not find the expected table on the page.")
return None
table = tables[0] # First table contains the list of countries by GDP (nominal).
# Step 4: Read the table into a DataFrame using pandas.
df = pd.read_html(StringIO(str(table)))[0]
except Exception as e:
print(f"Error parsing the HTML or reading the table: {e}")
return None
try:
# Step 5: Clean up the DataFrame.
# Flatten multi-level headers if they exist, converting non-strings to empty strings.
if isinstance(df.columns, pd.MultiIndex):
# Flatten the column headers, filtering out any None or NaN values.
df.columns = [" ".join(filter(None, map(str, col))) for col in df.columns]
# Reset the index to clean up the DataFrame.
df = df.reset_index(drop=True)
return df
except Exception as e:
print(f"Error cleaning or processing the DataFrame: {e}")
return None
# Run the function and get the DataFrame.
gdp_df = fetch_gdp_table()
if gdp_df is not None:
print("Successfully extracted the table!")
print(gdp_df)
# Save the DataFrame to a CSV file with headers.
file_name = "gdp_by_country"
gdp_df.to_csv(f"{file_name}.csv", index=False)
print(f"Data saved to '{file_name}.csv'.")
else:
print("Failed to extract the table.")
After executing the code, you should see the following output printed as well as a new CSV file saved in your working directory:
The last type of table to cover is the one spread over multiple pages, requiring you to handle pagination. This can be done by using browser automation tools like Selenium. It helps to extract content that is Javascript-rendered, or that requires a real browser environment for actions like clicking, scrolling, etc. Check this in-depth tutorial on web scraping with Selenium to learn how to use it.
For this section, the target website will be this dummy website for scraping purposes. Here’s what the table looks like:
This web page has 24 pages, and all of these pages contain tables with different information.
Create a new Python file and import the necessary libraries:
from selenium import webdriver
from selenium.webdriver.common.by import By
import time
import pandas as pd
from io import StringIO
The next step is to initialize WebDriver and send a request to the target website:
# Initialize WebDriver.
try:
driver = webdriver.Chrome()
driver.get(url)
except Exception as e:
print(f"Error initializing WebDriver: {e}")
return None
all_data = []
When the URL is fetched, the next step is to extract the table from the page using the correct selector:
table_element = driver.find_element(By.TAG_NAME, "table")
table_html = table_element.get_attribute("outerHTML")
# Read the table into a DataFrame.
table_io = StringIO(table_html)
df = pd.read_html(table_io)[0]
all_data.append(df)
The above code snippet extracts the table HTML and converts it into a DataFrame. This DataFrame is then appended to a variable all_data that will have all the data contained from multiple pages.
After extracting the table from the first page, locate the button to navigate to the next page:
# Try to find and click the 'Next' button.
try:
next_button = driver.find_element(By.CSS_SELECTOR, "[aria-label='Next']")
next_button.click()
time.sleep(2) # Adjust sleep time as necessary.
except Exception as e:
print("Next button not found or click failed, ending pagination.")
break
This process must continue till you reach the last page, where there’s no next button. You can achieve this with a simple while True loop:
while True:
try:
# Locate the table.
table_element = driver.find_element(By.TAG_NAME, "table")
table_html = table_element.get_attribute("outerHTML")
# Read the table into a DataFrame.
table_io = StringIO(table_html)
df = pd.read_html(table_io)[0]
all_data.append(df)
# Try to find and click the 'Next' button.
try:
next_button = driver.find_element(By.CSS_SELECTOR, "[aria-label='Next']")
next_button.click()
time.sleep(2) # Adjust sleep time as necessary.
except Exception as e:
print("Next button not found or click failed, ending pagination.")
break
except Exception as e:
print(f"Error processing the table: {e}")
break
Finally, you can combine all the data, creating a single DataFrame that can then be saved to a CSV file:
# Concatenate all collected data into a single DataFrame.
try:
final_df = pd.concat(all_data, ignore_index=True)
file_name = "paginated_data"
final_df.to_csv(f"{file_name}.csv", index=False)
print(f"Data saved to '{file_name}.csv'.")
except Exception as e:
print(f"Error concatenating data: {e}")
finally:
driver.quit()
Let’s combine all the code and have a look at the complete code:
from selenium import webdriver
from selenium.webdriver.common.by import By
import time
import pandas as pd
from io import StringIO
def fetch_data_from_paginated_table(url):
# Initialize WebDriver.
try:
driver = webdriver.Chrome()
driver.get(url)
except Exception as e:
print(f"Error initializing WebDriver: {e}")
return None
all_data = []
while True:
try:
# Locate the table.
table_element = driver.find_element(By.TAG_NAME, "table")
table_html = table_element.get_attribute("outerHTML")
# Read the table into a DataFrame.
table_io = StringIO(table_html)
df = pd.read_html(table_io)[0]
all_data.append(df)
# Try to find and click the 'Next' button.
try:
next_button = driver.find_element(By.CSS_SELECTOR, "[aria-label='Next']")
next_button.click()
time.sleep(2) # Adjust sleep time as necessary.
except Exception as e:
print("Next button not found or click failed, ending pagination.")
break
except Exception as e:
print(f"Error processing the table: {e}")
break
# Concatenate all collected data into a single DataFrame.
try:
final_df = pd.concat(all_data, ignore_index=True)
return final_df
except Exception as e:
print(f"Error concatenating data: {e}")
return None
finally:
driver.quit()
# URL of the paginated table.
url = "https://www.scrapethissite.com/pages/forms/"
final_df = fetch_data_from_paginated_table(url)
if final_df is not None:
print("Successfully extracted the table!")
print(final_df)
# Save the DataFrame to a CSV file.
file_name = "paginated_data"
final_df.to_csv(f"{file_name}.csv", index=False)
print(f"Data saved to '{file_name}.csv'.")
else:
print("Failed to extract the table.")
The output should look like this:
Respect the policies of the website: Be mindful of the website's robots.txt file, which outlines the site's rules for web crawlers. Learn more about the ethical concerns of data collection.
Show courtesy: Restrict the number of times you make requests so as not to overwhelm the server. One possible way is to use cache-enabled web scrapers.
Clean data: Always clean and validate scraped data to ensure correctness.
Handle non-table structures: Many modern websites use <div> elements or other non-standard tags to create visually similar table layouts. Be mindful of these structural intricacies to capture the necessary information accurately.
By following the step-by-step processes outlined in this guide, you can effectively scrape HTML tables, handle complex table structures, and manage paginated tables. However, it’s crucial to approach web scraping with a strong sense of responsibility. Always respect the website's policies, avoid overloading servers with too many requests, and ensure that the data you scrape is used ethically.
About the author
Vytenis Kaubrė
Technical Copywriter
Vytenis Kaubrė is a Technical Copywriter at Oxylabs. His love for creative writing and a growing interest in technology fuels his daily work, where he crafts technical content and web scrapers with Oxylabs’ solutions. Off duty, you might catch him working on personal projects, coding with Python, or jamming on his electric guitar.
All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.
Get the latest news from data gathering world
Scale up your business with Oxylabs®