Back to blog

How to Make Web Scraping Faster – Python Tutorial

Yelyzaveta Nechytailo

2023-03-296 min read

Efficient business management in today’s competitive environment calls for high-speed public data gathering. You have to access thousands or even millions of pages and do it fast since you want to save as much time as possible to act on this data. But what might prevent you from fetching public information at speed? And how can you make scraping fast?

Read this article where we discuss a few useful ways of making public data collection faster as well as provide sample codes that you can implement in your scraping activities. For your convenience, we also prepared this tutorial in a video format:

What slows down web scraping

The network delay is the first obvious bottleneck for any web scraping project. Transmitting a request to the web server takes time. Once the request is received, the web server will send the response, which again causes a delay.

When browsing a website, this difference is negligible as we deal with one page at a time. For instance, if sending a request and receiving the response takes a second, it will seem very fast while browsing a small number of pages. But if you are running a web scraping code that has to send requests to ten thousand pages, this will add up to almost three hours, which doesn't seem that quick anymore.

The network delay is only one of the factors that can slow down the process. Your web scraping code will not just send and receive requests but also interact with the data. At this point, the scraping can run into I/O or CPU-bound bottlenecks.

I/O bound

I/O bottleneck is an issue that relates to a system's input-output performance and its peripherals such as disk drives, internet interface, etc. Any program dependent on the input-output system (e.g., reading and writing data. copying files, downloading files) is an I/O bound program, and the delays are thus called I/O bound delays.

CPU bound

The other scenario is when a program is CPU-bound. As the name suggests, in this case, the code execution speed depends on the CPU, which refers to the central processing unit of a computing device. A faster CPU would mean faster code execution.

A classic example of CPU-bound application is a task that requires a large number of calculations. For instance, High-Performance Computing (HPC) systems that combine the processing power of multiple processors in the CPU to deliver higher computing performance.

The distinction between I/O and CPU is essential to understand since the strategy to make the program run faster largely depends on the bottleneck type.

How do you speed up web scraping in Python?

There are a few possible approaches that can help increase the scraping speed:

Multiprocessing
Multithreading
Asyncio

However, let’s first take a look at an unoptimized code to make sure the difference between all is clear. If you want, you can also watch this tutorial on our YouTube channel.

Web scraping without optimization

We'll be scraping 1000 books from books.toscrape.com. This website is a dummy book store that is perfect for learning.

Preparation

The first step is to extract all 1000 links to the books and store them in a CSV file. Run this code file to create the links.csv file. You'll need to install requests and Beautiful Soup packages for this code to work.

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin


def fetch_links(url="https://books.toscrape.com/", links=[]):
    r = requests.get(url)
    print(r.url, flush=True)
    soup = BeautifulSoup(r.text, "html.parser")
    for link in soup.select("h3 a"):
        links.append(urljoin(url, link.get("href")))
    next_page = soup.select_one("li.next a")
    if next_page:
        return fetch_links(urljoin(url, next_page.get("href")), links)
    else:
        return links


def refresh_links():
    links = fetch_links()
    with open('links.csv', 'w') as f:
        for link in links:
            f.write(link + '\n')

refresh_links()

Link to GitHub

The fetch_links function will retrieve all the links, and refresh_links() will store the output in a file. We skipped sending the user agent as this is a test site. However, you can do so easily using the requests library.

Writing unoptimized web scraper

We'll focus on optimizing 1,000 pages of web scraping in Python.

First, install the requests library using pip:

pip install requests

To keep things simple, we'll use regular expressions to extract the title element of the page. Note the get_links functions that loads the URLs we saved in the previous step.

import csv
import re
import time
import requests

def get_links():
	links = []
	with open("links.csv", "r") as f:
		reader = csv.reader(f)
		for i, row in enumerate(reader):
			links.append(row[0])
	return links

def get_response(session, url):
	with session.get(url) as resp:
		print('.', end='', flush=True)
		text = resp.text
		exp = r'(<title>).*(<\/title>)'
		return re.search(exp, text, flags=re.DOTALL).group(0)

def main():
	start_time = time.time()
	with requests.Session() as session:
		results = []
		for url in get_links():
			result = get_response(session, url)
			print(result)
	print(f"{(time.time() - start_time):.2f} seconds")


main()

Link to GitHub

The code without optimization took around 126 seconds.

Web scraping using multiprocessing

Multiprocessing, as the name suggests, means utilizing more than one processor core. Nowadays, it's hard to find a single-core CPU. You can write code that takes advantage of all cores using the multiprocessing module, which is included in the Python standard library.

For example, if we have an 8-core CPU, we can essentially write code that can split the task into eight different processes where each process runs in a separate CPU core. Note that this approach is more suitable when the bottleneck is CPU or when the code is CPU-Bound. We'll still see some improvements in our case, though.

The first step is to import Pool and cpu_count from the multiprocessing module

from multiprocessing import Pool, cpu_count

The other change is required in both get_response and main functions.

def get_response(url):
	resp = requests.get(url)
	
	print('.', end='', flush=True)
	text = resp.text
	exp = r'(<title>).*(<\/title>)'
	return re.search(exp, text, flags=re.DOTALL).group(0)

def main():
	start_time = time.time()
	links = get_links()
	coresNr = cpu_count()
	with Pool(coresNr) as p:
		results = p.map(get_response, links)
		for result in results:
			print(result)
	print(f"{(time.time() - start_time):.2f} seconds")

if __name__ == '__main__':
	main()

Link to GitHub

The most critical line of the code is where we create a Pool. Note that we are using cpu_count() function to get the count of CPU cores dynamically. This ensures that this code runs on every machine without any change.

In this example, the execution time was around 49 seconds. It's a better result compared to an unoptimized code, where the same process took around 126 seconds. Still, as expected, it's a slight improvement. As we mentioned, multiprocessing is more suitable when the code is CPU-Bound. Our code is I/O bound; thus, we can improve the performance of this code more with other methods.

Web scraping using multithreading

Multithreading is a great option to optimize web scraping code. Simply put, a thread is a separate flow of execution. Operating systems typically create hundreds of threads and switch the CPU time among them. The switching process is so fast that you can get the illusion of multitasking. It cannot be customized because the CPU controls the switching.

Using the concurrent.futures module of Python, we can customize how many threads we create to optimize our code. There's only one huge caveat: managing threads can become messy and error-prone as the code becomes more complex.

To change our code to utilize multithreading, minimal changes are needed.

First, import ThreadPoolExecutor.

from concurrent.futures import ThreadPoolExecutor

Next, instead of creating a Pool , create a ThreadPoolExecutor:

def main():
    start_time = time.time()
    links = get_links()
    with ThreadPoolExecutor(max_workers=100) as p:
        results = p.map(get_response, links)
        for result in results:
            print(result)
    print(f"{(time.time() - start_time):.2f} seconds")

Link to GitHub

Note that you have to specify max workers. This number will depend on the complexity of the code. However, choosing a too-high number can overload your code, so you must be careful.

This script execution was completed in 7.02 seconds. For reference, the unoptimized code took around 126 seconds. This is a massive improvement.

Asyncio for asynchronous programming

Asynchronous coding using the asyncio module is essentially threading where the code controls the context switching. It also makes coding more effortless and less error-prone. Specifically, for web scraping projects, this is the most suitable approach.

This approach requires quite a lot of changes. First, the requests library won't work. Instead, we'll use the aiohttp library for web scraping in Python. This requires a separate installation:

python3 -m pip install aiohttp

Next, import asyncio and aiohttp modules.

import aiohttp 
import asyncio

The get_response() function now needs to change to a coroutine. Also, we'll be using the same session for every execution. Optionally, you can send the user agent if needed.

Note the use of async and await keywords.

async def get_response(session, url):
	async with session.get(url) as resp:
		text = await resp.text()
		exp = r'(<title>).*(<\/title>)'
		return re.search(exp, text, flags=re.DOTALL).group(0)

Link to GitHub

The most significant changes are in the main() function.

First, it needs to change to a coroutine. Next, we'll use aiohttp.ClientSession to create the session object. Most importantly, we'll need to create tasks for all the links. Finally, all the tasks will be sent to an event loop using the asyncio.gather method.

async def main():
	start_time = time.time()

	async with aiohttp.ClientSession() as session:
		tasks = []
		for url in get_links():
			tasks.append(asyncio.create_task(get_response(session, url)))
		results = await asyncio.gather(*tasks)
		for result in results:
			print(result)
	print(f"{(time.time() - start_time):.2f} seconds")


if __name__ == "__main__":
    asyncio.run(main())

Link to GitHub

Lastly, to run the main() coroutine, we would need to use asyncio.run(main())

In this example, the execution time was 15.61 seconds. The asyncio approach, as expected, also showed great results compared to unoptimized script. Of course, this approach requires an entirely new way of thinking. If you have experience with async-await in any programming language, it won't be hard for you to use this approach for your web scraping jobs.

Conclusion

In an attempt to gather large amounts of public data, it's not rare for businesses to encounter the problem of slow web scraping. They spend long hours collecting the public information they need, thus losing an opportunity to analyze it and make informed decisions ahead of competitors in the market.

This article aimed to explain what contributes to the decreased speed of scraping activities and provide several useful ways to deal with this issue. We looked at such web scraping approaches as multiprocessing, multithreading, and asyncio and compared their execution time so that you could choose the most suitable approach for your specific use case. If you want to learn more tactics to improve your scraper, follow this advanced web scraping with Python guide.

However, all these techniques require a good understanding of programming. Things can get complicated quickly when you decide to start scaling your data gathering process. So, if you are looking for an efficient solution to gather public data at scale, Oxylabs offers advanced web intelligence solutions, such as a general-purpose Web Scraper API. These easy-to-use tools can extract public data fast, even from the most challenging public targets.

Besides, read this in-depth article on concurrency vs. parallelism in Python for more insights.

About the author

Yelyzaveta Nechytailo

Senior Content Manager

Yelyzaveta Nechytailo is a Senior Content Manager at Oxylabs. After working as a writer in fashion, e-commerce, and media, she decided to switch her career path and immerse in the fascinating world of tech. And believe it or not, she absolutely loves it! On weekends, you’ll probably find Yelyzaveta enjoying a cup of matcha at a cozy coffee shop, scrolling through social media, or binge-watching investigative TV series.

Learn more about Yelyzaveta Nechytailo

All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.