Back to blog
Pagination In Web Scraping: How Challenging It May Be
Vejune Tamuliunaite
Back to blog
Vejune Tamuliunaite
Tackling pagination in web scraping can be challenging when building a web scraper. While implementation of pagination can vary a lot, fundamentally, they fall into four broad categories. This article will cover practical examples, along with code in Python to handle pagination.
For your convenience, we also prepared this tutorial in a video format:
Before understanding how to handle pagination in web scraping, it is important to understand what pagination is in web development.
Most of the websites contain a huge amount of data. It is not feasible to display all the data on one page. Even if it is a small dataset, if all the records are displayed on one page, the page size becomes huge. Such a page takes longer to load and consumes more memory in the browser. The solution is to show limited records per page and provide access to the remaining records by using pagination.
In the case of pagination in web design, a user interface component, often known as a pager, is placed at the bottom of the page. This pager can contain links or buttons to move to the next page, previous page, last page, first page, or a specific page. The actual implementation varies with every site.
Even though each website has its way of using pagination, most of these pagination implementations fall into one of these four categories:
With Next button
Page Numbers without Next button
Pagination with infinite scroll
Pagination with Load More
In this article, we will examine these scenarios while scraping web data.
Let’s start with a simple example. Head over to the video games store. Scroll down to the bottom of the page and notice the pagination:
This site has the Forward button. If this button is clicked, the browser loads the next page.
Moreover, with every click, the URL changes:
Page 1 – https://sandbox.oxylabs.io/products?page=1
Page 2 – https://sandbox.oxylabs.io/products?page=2
Page 3 – https://sandbox.oxylabs.io/products?page=3
The next step is to inspect the HTML markup of the Forward link. This could be done by pressing F12 or Ctrl+Alt+I or by right-clicking the Forward link and selecting Inspect.
In the Inspect window, you can see that the Forward button is an anchor element, and you can find the URL of the next page by looking for it.
Let’s start with writing a basic web scraper.
First, prepare your environment with the required packages. Open the terminal, activate the virtual environment (optional), and execute this command to install requests, beautifulsoup4, and lxml. The requests will be used for HTTP requests; the beautifulsoup4 will be used to locate the Next button in the HTML, while the lxml is the back-end for beautifulsoup4.
pip install requests beautifulsoup4 lxml
Start with writing a simple code that fetches the first page and prints the footer. Note that we are printing the footer so that we can keep track of the page that is being parsed. In a real-world application, you would replace it with a proper logging and tracking solution or forgo having visibility for performance reasons.
"""Handling pages with the Next button"""
from urllib.parse import urljoin
import requests
from bs4 import BeautifulSoup
url = 'https://sandbox.oxylabs.io/products'
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")
page_number = soup.select_one('li.active > a')
print(page_number['aria-label'])
Navigate to the next page till you reach the last page. If this is the last page, the forward button’s href value is #.
The output of this code will simply be the footer of the first page:
Page 1 is your current page
Process finished with exit code 0
A few points to note here:
The requests library is sending a GET request to the specified URL;
The soup object is being queried using CSS Selector. This CSS selector is website-specific.
Let’s modify this code to locate the Next button.
next_page_element = soup.select_one('li.next > a')
If the next_page_element is found, we can get the value of the href attribute, which holds the URL of the next page. One important thing to note here is that often, the href will be a relative URL. In such cases, one can use the urljoin method from urllib.parse module to make the URL into an absolute one.
By wrapping the code that scrapes a single page with a while loop and the termination condition being the lack of any more pages, we can reach all pages linked to by pagination.
"""Handling pages with the Next button"""
from urllib.parse import urljoin
import requests
from bs4 import BeautifulSoup
url = 'https://sandbox.oxylabs.io/products'
while True:
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")
page_number = soup.select_one('li.active > a')
print(page_number['aria-label'])
# Do more with each page.
# Find the next page to scrape in the pagination.
next_page_element = soup.select_one('li.next > a')
next_page_url = next_page_element.get('href')
if (next_page_url == '#'):
break
url = urljoin(url, next_page_url)
The output of this code will be the footer of all three pages:
Page 1 is your current page
Page 2 is your current page
Page 3 is your current page
Page 4 is your current page
Page 5 is your current page
Page 6 is your current page
Page 7 is your current page
Page 8 is your current page
Page 9 is your current page
Page 10 is your current page
Page 11 is your current page
Page 12 is your current page
Page 13 is your current page
Page 14 is your current page
Page 15 is your current page
Page 16 is your current page
Page 17 is your current page
Page 18 is your current page
Page 19 is your current page
Page 20 is your current page
Some websites will not show the Next button, only page numbers. For example, here is an example of the pagination from https://sandbox.oxylabs.io/products.
If you examine the HTML markup for this page, something interesting can be seen:
The HTML contains the links to some of the next pages. Some from the start and then some from the end. From here, we can get the last page number and create the links for all the pages. The first step is to get to the first page. Next, we can use BeautifulSoup to extract all these links to other pages. Finally, we can write a for loop that scrapes all these links:
"""Handling pages without the Next button"""
from urllib.parse import urljoin
import requests
from bs4 import BeautifulSoup
# Get the first page.
url = 'https://sandbox.oxylabs.io/products'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
page_link_el = soup.select('.pagination a')
index_page = page_link_el[0].get('href')
last_page = page_link_el[len(page_link_el)-2].get('href')
last_page_number = str(last_page)[-2:]
# Make links for and process the following pages.
for i in range (1,int(last_page_number)):
link = urljoin(url, "products?page=" + str(i) )
response = requests.get(link)
soup = BeautifulSoup(response.text, 'lxml')
print(response.url)
# Do more with each page.
This kind of pagination does not show page numbers or the next button.
Let’s take the Quotes to Scrape website as an example. This site shows a limited number of quotes when the page loads. As you scroll down, it dynamically loads more items, a limited number at a time. Another important thing to note here is that the URL does not change as more pages are loaded.
In such cases, websites use an asynchronous call to an API to get more content and show this content on the page using JavaScript. The actual data returned by the API can be HTML or JSON.
Before you load the site, press F12 to open Developer Tools, head over to the Network tab, and select XHR. Now go to http://quotes.toscrape.com/scroll and monitor the traffic. Scroll down to load more content.
You will notice that as you scroll down, more requests are sent to quotes?page=x, where x is the page number.
As the number of pages is not known beforehand, one has to figure out when to stop scraping. This is where has_next in the response from quotes?page=x is going to be useful.
We can write a while loop as we did in the previous section. This time, there is no need for BeautifulSoup because the response is JSON, and we can parse it directly with json. Following is the code for the web scraper:
import requests
url = 'http://quotes.toscrape.com/api/quotes?page={}'
page_number = 1
while True:
response = requests.get(url.format(page_number))
# Do more with each page.
data = response.json()
print(response.url)
if data.get('has_next'):
page_number += 1
else:
break
Once we can use the information that even the browser uses to handle pagination, replicating it ourselves for web scraping is quite easy.
Now, let’s look at one more example.
In the previous section, we looked at JSON responses to figure out when to stop scraping. The example was fairly simple, as the response had a clear indication of when the last page was reached. Unfortunately, some websites do not provide structured responses and/or indications when there are no more pages to scrape, so one has to do more work to extract meaning from what is available. The next example is a website that requires some creativity to properly handle its pagination.
Open Developer Tools by pressing F12 in your browser, go to the Network tab, and then select XHR. Navigate to https://techinstr.myshopify.com/collections/all. You will notice that initially, 8 products are loaded.
If we scroll down, the next 8 products are loaded. Also, notice the following:
The total number of products is 132.
The URL of the index page is different from the remaining pages.
The response is HTML, with no clear way to identify when to stop.
To handle pagination for this site, we will first load the index page and extract the number of products. We have already observed that 8 products are loaded in one request. With this data, we can now calculate the number of pages as follows:
page_count = 132/8 = 16.5
By using math.ceil function we will get the last page, which will give us 17. Note that if you use the round function, you may end up missing one page in some cases. For example, if there are 132 products and each request loads 5 products, it means that there are 132/5 = 26.4 pages. In practice, it would mean that we do have to check 27 pages. Using the ceil function ensures that pages are always rounded up. In this example, math.ceil will return 27, while round will return 26.
In addition to not providing a clear stop condition, this website also requires one to make the requests after the first one while providing the relevant session data. Otherwise, it redirects back to the first page. In order to continue using the session data received from the first page, we will also need to reuse a session instead of creating a new one for each of the pages.
The complete code for this web scraper is as follows:
import math
import requests
from bs4 import BeautifulSoup
index_page = 'https://techinstr.myshopify.com/collections/all'
url = 'https://techinstr.myshopify.com/collections/all?page={}'
session = requests.session()
response = session.get(index_page)
soup = BeautifulSoup(response.text, "lxml")
count_element = soup.select_one('.filters-toolbar__product-count')
count_str = count_element.text.replace('products', '')
total_count = int(count_str)
# Do more with the first page.
page_count = math.ceil(total_count/8)
for page_number in range(2, page_count+1):
response = session.get(url.format(page_number))
soup = BeautifulSoup(response.text, "lxml")
first_product = soup.select_one('.product-card:nth-child(1) > a > span')
print(first_product.text.strip())
# Do more with each of the pages.
The way Load More works is very similar to how infinite scroll works. The only difference is how loading the next page is triggered on the browser. Because we are not using a browser but a script, the only difference is going to be the analysis of the pagination, not the scraping itself.
Open https://smarthistory.org/americas-before-1900/ with Developer Tools (F12) and click Load More on the page.
You will see that the response is in JSON format with an attribute remaining. The key observations are as follows:
Each request gets 12 results
The value of the remaining decreases by 12 with every click of Load More
If we set the value page to 1 in the API URL, it gets the first page of the results – https://smarthistory.org/wp-json/smthstapi/v1/objects?tag=938&page=1
In this particular case, the user agent also needs to be set for this to work correctly. The following code handles this kind of pagination in web scraping:
import math
import requests
from bs4 import BeautifulSoup
url = 'https://smarthistory.org/wp-json/smthstapi/v1/objects?tag=938&page={}'
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36',
}
page_number = 1
while True:
response = requests.get(url.format(page_number), headers=headers)
data = response.json()
print(response.url)
# Do more with each page.
if data.get('remaining') and int(data.get('remaining')) > 0:
page_number += 1
else:
break
In this article, we explored various examples of pagination in web scraping. There can be many ways websites use to display pagination. To understand how it is working, it is important to look at the HTML markup, as well as the network traffic, using the Developer Tools. Also, this tutorial examined four broad types of pagination and how to handle these. Even if you encounter something new, you should be able to figure it out based on this article.
If you want to learn more about web scraping or using proxies, check our blog and find more interesting content, from tips on how to crawl a website without getting blocked to an in-depth discussion about the legality of web scraping. Also, don’t hesitate to try the functionality of our own general-purpose web scraper for free.
About the author
Vejune Tamuliunaite
Former Product Content Manager
Vejune Tamuliunaite is a former Product Content Manager at Oxylabs with a passion for testing her limits. After years of working as a scriptwriter, she turned to the tech side and is fascinated by being at the core of creating the future. When not writing in-depth articles, Vejune enjoys spending time in nature and watching classic sci-fi movies. Also, she probably could tell the Star Wars script by heart.
All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.
Get the latest news from data gathering world
Scale up your business with Oxylabs®