Pagination handling
avatar

Vejune Tamuliunaite

Jul 06, 2021 12 min read

Tackling pagination in web scraping can be challenging when building a web scraper. While implementation of pagination can vary a lot, fundamentally, they fall into four broad categories. This article will cover practical examples, along with code in Python to handle pagination.

What is pagination in web design?

Before understanding how to handle pagination in web scraping, it is important to understand what pagination is in web development.

Most of the websites contain a huge amount of data. It is not feasible to display all the data on one page. Even if it is a small dataset, if all the records are displayed on one page, the page size becomes huge. Such a page takes longer to load and consumes more memory in the browser. The solution is to show limited records per page and provide access to the remaining records by using pagination. 

In the case of pagination in web design, a user interface component, often known as a pager, is placed at the bottom of the page. This pager can contain the links or buttons to move the next page, previous page, last page, first page, or a specific page. The actual implementation varies with every site. 

Types of pagination

Even though each website has its way of using pagination, most of these pagination implementations fall into one of these four categories:

  • With Next button
  • Page Numbers without Next button
  • Pagination with infinite scroll 
  • Pagination with Load More

In this article, we will examine these scenarios while scraping web data.

Let’s start with a simple example. Head over to the Books to Scrape web page.  Scroll down to the bottom of the page and notice the pagination:

This site has the Next button. If this button is clicked, the browser loads the next page.

Note that now this site displays a previous button along with a Next button. If we keep clicking Next until the last page is reached, this is how it looks:

Moreover, with every click, the URL changes:

  • Page 1 – http://books.toscrape.com/catalogue/category/books/fantasy_19/index.html
  • Page 2 – http://books.toscrape.com/catalogue/category/books/fantasy_19/page-2.html
  • Page 3 – http://books.toscrape.com/catalogue/category/books/fantasy_19/page-3.html

The next step is to inspect the HTML markup of the next link. This could be done by pressing F12, or Ctrl+Alt+I, or by right-clicking the Next link and selecting Inspect.

In the Inspect window, it can be seen that the Next button is an anchor element and we can find the URL of the next page by looking for it.

Python code to handle pagination

Let’s start with writing a basic web scraper. 

First, prepare your environment with the required packages. Open the terminal, activate the virtual environment (optional), and execute this command to install requests, beautifulsoup4 and lxml. The requests will be used for HTTP requests, the beautifulsoup4 will be used for locating the Next button in the HTML while the lxml is the back-end for beautifulsoup4.

pip install requests beautifulsoup4 lxml

Start with writing a simple code that fetches the first page and prints the footer. Note that we are printing the footer so that we can keep track of the page that is being parsed. In a real-world application, you would replace it with a proper logging and tracking solution, or forgo having visibility for performance reasons.

"""Handling pages with the Next button"""
import requests
from bs4 import BeautifulSoup

url = 'http://books.toscrape.com/catalogue/category/books/fantasy_19/index.html'

response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")
footer_element = soup.select_one('li.current')
print(footer_element.text.strip())

The output of this code will be simply the footer of the first page:

Page 1 of 3

Few points to note here are as follows:

  • requests library is sending a GET request to the specified URL;
  • The soup object is being queried using CSS Selector. This CSS selector is website-specific.

Let’s modify this code to locate the Next button.

next_page_element = soup.select_one('li.next > a')

If the next_page_element is found, we can get the value of the href attribute, which holds the URL of the next page. One important thing to note here is that often the href will be a relative url. In such cases, one can use urljoin method from urllib.parse module to make the URL into an absolute one. 

By wrapping the code that scrapes a single page with a while loop and the termination condition being the lack of any more pages, we can reach all pages linked to by pagination. 

"""Handling pages with the Next button"""
from urllib.parse import urljoin

import requests
from bs4 import BeautifulSoup

url = 'http://books.toscrape.com/catalogue/category/books/fantasy_19/index.html'

while True:
    response = requests.get(url)
    soup = BeautifulSoup(response.text, "lxml")

    footer_element = soup.select_one('li.current')
    print(footer_element.text.strip())
    # Do more with each page.

    # Find the next page to scrape in the pagination.
    next_page_element = soup.select_one('li.next > a')
    if next_page_element:
        next_page_url = next_page_element.get('href')
        url = urljoin(url, next_page_url)
    else:
        break

The output of this code will be the footer of all three pages:

Page 1 of 3
Page 2 of 3
Page 3 of 3

Pagination without Next button

Some websites will not show the Next button, but just page numbers. For example, here is an example of the pagination from https://www.gosc.pl/doc/791526.Zaloz-zbroje.

If we examine the HTML markup for this page, something interesting can be seen:

<span class="pgr_nrs">
	<span>1</span>
	<a href="/doc/791526.Zaloz-zbroje/2">2</a>
	<a href="/doc/791526.Zaloz-zbroje/3">3</a>
	<a href="/doc/791526.Zaloz-zbroje/4">4</a>
</span>

The HTML contains the links to all of the following pages. This makes visiting all these pages easy. The first step is to get to the first page. Next, we can use BeautifulSoup to extract all these links to other pages. Finally, we can write a for loop that scrapes all these links:

"""Handling pages without the Next button"""
from urllib.parse import urljoin

import requests
from bs4 import BeautifulSoup

# Get the first page.
url = 'https://www.gosc.pl/doc/791526.Zaloz-zbroje'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
page_link_el = soup.select('.pgr_nrs a')
# Do more with the first page.

# Make links for and process the following pages.
for link_el in page_link_el:
    link = urljoin(url, link_el.get('href'))
    response = requests.get(link)
    soup = BeautifulSoup(response.text, 'lxml')
    print(response.url)
    # Do more with each page.

Pagination with infinite scroll

This kind of pagination does not show page numbers or the next button. 

Let’s take the Quotes to Scrape website as an example. This site shows a limited number of quotes when the page loads. As you scroll down, it dynamically loads more items, a limited number at a time. Another important thing to note here is that the URL does not change as more pages are loaded. 

In such cases, websites use an asynchronous call to an API to get more content and show this content on the page using JavaScript. The actual data returned by the API can be HTML or JSON.

Handling sites with JSON response

Before you load the site, press F12 to open Developer Tools, head over to the Network tab, and select XHR. Now go to http://quotes.toscrape.com/scroll and monitor the traffic. Scroll down to load more content. 

You will notice that as you scroll down, more requests are sent to quotes?page=x, where x is the page number.

As the number of pages is not known beforehand, one has to figure out when to stop scraping. This is where has_next in the response from quotes?page=x is going to be useful.

We can write a while loop as we did in the previous section. This time, there is no need for BeautifulSoup because the response is JSON and we can parse it directly with json. Following is the code for the web scraper:

import requests

url = 'http://quotes.toscrape.com/api/quotes?page={}'
page_number = 1
while True:
    response = requests.get(url.format(page_number))

    # Do more with each page.
    data = response.json()
    print(response.url) 
    if data.get('has_next'):
        page_number += 1
    else:
        break

Once we can use the information that even the browser uses to handle pagination, replicating it ourselves for web scraping is quite easy.

Now let’s look at one more example.

Handling sites with HTML response

In the previous section, we looked at JSON responses to figure out when to stop scraping. The example was fairly simple as the response had a clear indication of when the last page was reached. Unfortunately, some websites do not provide structured responses and/or indications when there are no more pages to scrape, so one has to do more work to extract meaning from what is available. The next example is of a website that requires some creativity to properly handle its pagination.

Open Developer Tools by pressing F12 in your browser, go to the Network tab and then select XHR. Navigate to https://techinstr.myshopify.com/collections/all. You will notice that initially 8 products are loaded.

If we scroll down, the next 8 products are loaded. Also, notice the following:

  • The total number of products is 132.
  • The URL of the index page is different from the remaining pages.
  • The response is HTML, with no clear way to identify when to stop.

To handle pagination for this site, we will first load the index page and extract the number of products. We have already observed that 8 products are loaded in one request. With this data we can now calculate the number of pages as follows:

page_count = 132/8 = 16.5

By using math.ceil function we will get the last page, which will give us 17. Note that if you use the round function, you may end up missing one page in some cases. For example, if there are 132 products and each request loads 5 products, it means that there are 132/5 = 26.4 pages. In practice, it would mean that we do have to check 27 pages. Using ceil function ensures that pages are always rounded up. In this example, math.ceil will return 27, while round will return 26.

In addition to not providing a clear stop condition, this website also requires one to make the requests after the first one while providing the relevant session data. Otherwise, it redirects back to the first page. In order to continue using the session data received from the first page, we will also need to reuse a session instead of creating a new one for each of the pages.

The complete code for this web scraper is as follows:

import math

import requests
from bs4 import BeautifulSoup


index_page = 'https://techinstr.myshopify.com/collections/all'
url = 'https://techinstr.myshopify.com/collections/all?page={}'

session = requests.session()
response = session.get(index_page)
soup = BeautifulSoup(response.text, "lxml")
count_element = soup.select_one('.filters-toolbar__product-count')
count_str = count_element.text.replace('products', '')
total_count = int(count_str)
# Do more with the first page.

page_count = math.ceil(total_count/8)
for page_number in range(2, page_count+1):
    response = session.get(url.format(page_number))
    soup = BeautifulSoup(response.text, "lxml")
    first_product = soup.select_one('.product-card:nth-child(1) > a > span')
    print(first_product.text.strip())
    # Do more with each of the pages.

Pagination with Load More button

The way Load More works is very similar to how infinite scroll works. The only difference is how loading the next page is triggered on the browser. Because we are not using a browser, but a script, the only difference is going to be the analysis of the pagination, not the scraping itself.

Open https://smarthistory.org/americas-before-1900/ with Developer Tools (F12) and click Load More in the page.

You will see that the response is in JSON format with an attribute remaining. The key observations are as follows:

  • Each request gets 12 results 
  • The value of remaining decreases by 12 with every click of Load More
  • If we set the value page to 1 in the API URL, it gets the first page of the results – https://smarthistory.org/wp-json/smthstapi/v1/objects?tag=938&page=1

In this particular case, the user agent also needs to be set for this to work correctly. The following code handles this kind of pagination in web scraping:

import math

import requests
from bs4 import BeautifulSoup


url = 'https://smarthistory.org/wp-json/smthstapi/v1/objects?tag=938&page={}'
headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36',
}

page_number = 1
while True:
    response = requests.get(url.format(page_number), headers=headers)
    data = response.json()
    
    print(response.url)
    # Do more with each page.

    if data.get('remaining') and int(data.get('remaining')) > 0:
        page_number += 1
    else:
        break

Conclusion

In this article, we explored various examples of pagination in web scraping. There can be many ways websites use to display pagination. To understand how it is working, it is important to look at the HTML markup, as well as the network traffic using the Developer Tools. Also, this tutorial examined four broad types of pagination and how to handle these. Even if you encounter something new, you should be able to figure it out based on this article.

If you want to learn more about web scraping or using proxies, check our blog and find more interesting content: from tips on how to crawl a website without getting blocked to an in-depth discussion about the legality of web scraping

avatar

About Vejune Tamuliunaite

Vejune Tamuliunaite is a Product Content Manager at Oxylabs with a passion for testing her limits. After years of working as a scriptwriter, she turned to the tech side and is fascinated by being at the core of creating the future. When not writing in-depth articles, Vejune enjoys spending time in nature and watching classic sci-fi movies. Also, she probably could tell the Star Wars script by heart.

All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.

Related articles

Oxylabs Proxy Integration With Ghost Browser

Oxylabs Proxy Integration With Ghost Browser

Oct 22, 2021

3 min read

What Is Sentiment Analysis?

What Is Sentiment Analysis?

Oct 22, 2021

13 min read

News Scraping: Everything You Need to Know

News Scraping: Everything You Need to Know

Oct 18, 2021

9 min read