Tutorials Data acquisition Scrapers Python

Asynchronous Web Scraping With Python & AIOHTTP

Asynchronous Web Scraping With Python visual

Yelyzaveta Hayrapetyan

Last updated on

2023-06-27

6 min read

If your business operates in a competitive and ever changing market, there’s a high chance you gather large amounts of data to draw actionable insights. Oftentimes, this data cannot be extracted from a website easily. Instead, to make sure it’s complete and of high quality, you have to loop over multiple pages while scraping using different data acquisition tools.

Asynchronous code has become a go-to choice for programmers looking to process a large number of URLs at once. It allows them to execute more tasks and, most importantly, do it fast. In this tutorial, we will focus on describing the asynchronous approach to scraping multiple URLs and by comparing it to the synchronous one, demonstrate why it can be more beneficial.

You can also check out a one of our videos for a visual representation of the same web scraping tutorial:

What is asynchronous web scraping?

Asynchronous web scraping, also referred to as non-blocking or concurrent, is a special technique that allows you to begin a potentially lengthy task and still have a chance to respond to other events, rather than having to wait for that long task to finish.

Additionally, with asyncio library providing a great number of tools to write the non-blocking code and AIOHTTP delivering even more precise functionality for HTTP requests, it is no surprise that asynchronous scraping gained so much popularity among developers.

Remember that even though this blog post demonstrates the asynchronous approach's benefits, scraping multiple URLs at once can also be achieved through multiple threads and processes.. For more details, check out one of our other step-by-step tutorials.

Sync vs. async: What’s the difference?

Compared to async, synchronous approach to scraping multiple website URLs refers to the activity of running one request at a time. In essence, when using this code, a user will process one URL and move on to processing the next URL only after the previous one has completely finished processing.

Thus, the main difference between the two approaches is that synchronous code stops the next request from running while async code allows you to scrape data from multiple web pages roughly at the same time. This leads us to the main benefit of the asynchronous approach – great time-efficiency. You no longer have to wait for the scraping of one page to finish before starting the other.

Now that you have a clear understanding of the main difference between sync and async code, we can move on to the web scraping tutorial part and create a sample Python code for each of the approaches.

Sending asynchronous HTTP requests

Let’s take a look at the asynchronous Python tutorial. For this use-case, we will use the aiohttp module.

1. Create an empty python file with a main function

Note that the main function is marked as asynchronous. We use asyncio loop to prevent the script from exiting until the main function completes.

Copy

import asyncio


async def main():
    print('Saving the output of extracted information')


loop = asyncio.get_event_loop()
loop.run_until_complete(main())

Once again, it is a good idea to track the performance of your script. For that purpose, let's write a code that tracks script execution time.

2. Track script execution time

As with the first example, record the time at the start of the script. Then, type in any code that you need to measure (currently a single print statement). Finally, calculate how much time has passed by taking the current time and subtracting the time at the start of the script. Once we have how much time has passed, we print it while rounding the resulting float to the last 2 decimals.

Copy

import asyncio
import time


async def main():
    start_time = time.time()

    print('Saving the output of extracted information')

    time_difference = time.time() - start_time
    print(f'Scraping time: %.2f seconds.' % time_difference)


loop = asyncio.get_event_loop()
loop.run_until_complete(main())

Time to read the csv file that contains URLs. The file will contain a single column called url. There, you will see all the URLs that need to be scraped for data.

3. Create a loop

Next, we open up urls.csv, then load it using csv module and loop over each and every URL in the csv file. Additionally, we need to create an async task for every URL we are going to scrape.

Copy

import asyncio
import csv
import time


async def main():
    start_time = time.time()

    with open('urls.csv') as file:
        csv_reader = csv.DictReader(file)
        for csv_row in csv_reader:
            # the url from csv can be found in csv_row['url']
            print(csv_row['url'])

    print('Saving the output of extracted information')

    time_difference = time.time() - start_time
    print(f'Scraping time: %.2f seconds.' % time_difference)


loop = asyncio.get_event_loop()
loop.run_until_complete(main())

Later in the function we wait for all the scraping tasks to complete before moving on.

Copy

import asyncio
import csv
import time


async def main():
    start_time = time.time()

    tasks = []
    with open('urls.csv') as file:
        csv_reader = csv.DictReader(file)
        for csv_row in csv_reader:
            task = asyncio.create_task(scrape(csv_row['url']))
            tasks.append(task)

    print('Saving the output of extracted information')
    await asyncio.gather(*tasks)

    time_difference = time.time() - start_time
    print(f'Scraping time: %.2f seconds.' % time_difference)


loop = asyncio.get_event_loop()
loop.run_until_complete(main())

All that's left is scraping! But before doing that, remember to take a look at the data you're scraping.

The title of the book can be extracted from an <h1> tag, that is wrapped by a <div> tag with a product_main class.

Regarding the production information, it can be found in a table with a table-striped class.

4. Create a scrape functionality

The scrape function makes a request to the URL we loaded from the csv file. Once the request is done, it loads the response HTML using the BeautifulSoup module. Then we use the knowledge about where the data is stored in HTML tags to extract the book name into the book_name variable and collect all product information into a product_info dictionary.

Copy

import asyncio
import csv
import time
import aiohttp as aiohttp
from bs4 import BeautifulSoup


async def scrape(url):
    async with aiohttp.ClientSession() as session:
        async with session.get(url) as resp:
            body = await resp.text()
            soup = BeautifulSoup(body, 'html.parser')
            book_name = soup.select_one('.product_main').h1.text
            rows = soup.select('.table.table-striped tr')
            product_info = {row.th.text: row.td.text for row in rows}


async def main():
    start_time = time.time()

    tasks = []
    with open('urls.csv') as file:
        csv_reader = csv.DictReader(file)
        for csv_row in csv_reader:
            task = asyncio.create_task(scrape(csv_row['url']))
            tasks.append(task)

    print('Saving the output of extracted information')
    await asyncio.gather(*tasks)

    time_difference = time.time() - start_time
    print(f'Scraping time: %.2f seconds.' % time_difference)


loop = asyncio.get_event_loop()
loop.run_until_complete(main())

5. Add save_product function

The URL is scraped; however, no results can be seen. For that, you need to add another function – save_product.

Copy

import asyncio
import csv
import json
import time
import aiohttp
from bs4 import BeautifulSoup


async def save_product(book_name, product_info):
    json_file_name = book_name.replace(' ', '_')
    with open(f'data/{json_file_name}.json', 'w') as book_file:
        json.dump(product_info, book_file)


async def scrape(url):
    async with aiohttp.ClientSession() as session:
        async with session.get(url) as resp:
            body = await resp.text()
            soup = BeautifulSoup(body, 'html.parser')
            book_name = soup.select_one('.product_main').h1.text
            rows = soup.select('.table.table-striped tr')
            product_info = {row.th.text: row.td.text for row in rows}
            await save_product(book_name, product_info)


async def main():
    start_time = time.time()

    tasks = []
    with open('urls.csv') as file:
        csv_reader = csv.DictReader(file)
        for csv_row in csv_reader:
            task = asyncio.create_task(scrape(csv_row['url']))
            tasks.append(task)

    print('Saving the output of extracted information')
    await asyncio.gather(*tasks)

    time_difference = time.time() - start_time
    print(f'Scraping time: %.2f seconds.' % time_difference)


loop = asyncio.get_event_loop()
loop.run_until_complete(main())

6. Run the script

Lastly, you can run the script and see the data.

Sending synchronous HTTP requests

In this tutorial we are going to scrape URLs defined in urls.csv using a synchronous approach. For this particular use case, the python requests module is an ideal tool.

1. Create a Python file with a main function

Copy

def main():
    print('Saving the output of extracted information')

main()

Tracking the performance of your script is always a good idea. Therefore, the next step is to add a code that tracks script execution time.

2. Track script execution time

First, record time at the very start of the script. Then, type in any code that needs to be measured – in this case, we are using a single print statement. Finally, calculate how much time has passed. This can be done by taking the current time and subtracting the time at the start of the script. Once we know how much time has passed, we can print it while rounding the resulting float to the last 2 decimals.

Copy

import time


def main():
    start_time = time.time()

    print('Saving the output of extracted information')

    time_difference = time.time() - start_time
    print(f'Scraping time: %.2f seconds.' % time_difference)

main()

Now that the preparations are done, it's time to read the csv file that contains URLs. There, you will see a single column called url, which will contain URLs that have to be scraped for data.

3. Create a loop

Next, we have to open up urls.csv. After that, load it using the csv module and loop over each and every URL from the csv file.

Copy

import csv
import time


def main():
    start_time = time.time()

    print('Saving the output of extracted information')
    with open('urls.csv') as file:
        csv_reader = csv.DictReader(file)
        for csv_row in csv_reader:
            # the url from csv can be found in csv_row['url']
            print(csv_row['url'])

    time_difference = time.time() - start_time
    print(f'Scraping time: %.2f seconds.' % time_difference)

main()

At this point, the job is almost done - all that’s left to do is to scrape it, although before you do that, look at the data you’re scraping.

The title of the book “A Light in the Attic” can be extracted from an <h1> tag, that is wrapped by a <div> tag with a product_main class.

As for the product information, it can be found in a table with a table-striped class, which you can see in the developer tools part.

4. Create a scrape function

Now, let's use what we've learned and create a scrape function.

The scrape function makes a request to the URL we loaded from the csv file. Once the request is done, it loads the response HTML using the BeautifulSoup module. Then, we use the knowledge about where the data is stored in HTML tags to extract the book name into the book_name variable and collect all product information into a product_info dictionary.

Copy

import csv
import time
import requests as requests
from bs4 import BeautifulSoup


def scrape(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    book_name = soup.select_one('.product_main').h1.text
    rows = soup.select('.table.table-striped tr')
    product_info = {row.th.text: row.td.text for row in rows}

def main():
    start_time = time.time()

    print('Saving the output of extracted information')
    with open('urls.csv') as file:
        csv_reader = csv.DictReader(file)
        for csv_row in csv_reader:
            scrape(csv_row['url'])

    time_difference = time.time() - start_time
    print(f'Scraping time: %.2f seconds.' % time_difference)


main()

The URL is scraped; however, no results are seen yet. For that, it’s time to add yet another function - save_product.

5. Add save_product function

save_product takes two parameters: the book name and the product info dictionary. Since the book name contains spaces, we first replace them with underscores. Finally, we create a JSON file and dump all the info we have into it. Make sure you create a data directory in the folder of your script where all the JSON files are going to be saved.

Copy

import csv
import json
import time
import requests
from bs4 import BeautifulSoup


def save_product(book_name, product_info):
    json_file_name = book_name.replace(' ', '_')
    with open(f'data/{json_file_name}.json', 'w') as book_file:
        json.dump(product_info, book_file)


def scrape(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    book_name = soup.select_one('.product_main').h1.text
    rows = soup.select('.table.table-striped tr')
    product_info = {row.th.text: row.td.text for row in rows}
    save_product(book_name, product_info)


def main():
    start_time = time.time()

    print('Saving the output of extracted information')
    with open('urls.csv') as file:
        csv_reader = csv.DictReader(file)
        for csv_row in csv_reader:
            scrape(csv_row['url'])

    time_difference = time.time() - start_time
    print(f'Scraping time: %.2f seconds.' % time_difference)


main()

6. Run the script

Now, it's time to run the script and see the data. Here, we can also see how much time the scraping took – in this case it’s 17.54 seconds.

Comparing the performance of sync and async

Now that we carefully went through the processes of making requests with both synchronous and asynchronous methods, we can run the requests once again and compare the performance of two scripts.

The time difference is huge – while the async web scraping code was able to execute all the tasks in around 3 seconds, it took almost 16 for the synchronous one. This proves that scraping asynchronously is indeed more beneficial due to its noticeable time efficiency.

Wrapping up

Collection of high-quality public data often requires the scraping of dozens or even hundreds of web pages. To help you get an understanding of how this can be done efficiently, we compared two different approaches of scraping multiple website URLs with Python – synchronous and asynchronous (non-blocking). By exploring their main difference and providing sample code snippets, we demonstrated that asynchronous http requests method is more advantageous for sending multiple requests.

If you have any questions about this particular tutorial or any other topic related to web scraping, feel free to contact us at hello@oxylabs.io. Additionally, take advantage of a free trial of our web intelligence solutions, such as Web Scraper API, and decide whether they fit your data needs.

Frequently Asked Questions

Python is versatile and supports both synchronous and asynchronous programming. The main difference between these two approaches is that synchronous code executes one request at a time while asynchronous code allows non-blocking processing of tasks which significantly reduces the scraping time.

A straightforward approach would be to declare and initialize time counter before sending a request and after reading its response and then subtracting one from another. This difference will represent the response time.

One of the main differences between HTTPX and AIOHTTP is the scope of their libraries. While HTTPX is a full-featured HTTP client used for a wide range of applications, AIOHTTP is more focused on providing a simple and efficient API primarily for making HTTP requests in async programs. For more information on the differences between HTTPX and AIOHTTP, check out our blog post.

AIOHTTP is one of the fastest async HTTP clients in Python. By using AIOHTTP you can significantly speed up the performance of your script. Also, AIOHTTP had extensive documentation and several advanced features, such as cookies, sessions, pools, etc.

A prime example of a situation when you would want to use async in Python is any task that requires you to send multiple requests. With python async requests, you can scrape dozens or even hundreds of web pages quickly and efficiently.

Forget about complex web scraping processes

Choose Oxylabs' advanced web intelligence collection solutions to gather real-time public data hassle-free.

About the author

Yelyzaveta Hayrapetyan

Senior Technical Copywriter

Yelyzaveta Hayrapetyan is a Senior Technical Copywriter at Oxylabs. After working as a writer in fashion, e-commerce, and media, she decided to switch her career path and immerse in the fascinating world of tech. And believe it or not, she absolutely loves it! On weekends, you’ll probably find Yelyzaveta enjoying a cup of matcha at a cozy coffee shop, scrolling through social media, or binge-watching investigative TV series.

Learn more about Yelyzaveta Hayrapetyan Learn more about Yelyzaveta Hayrapetyan

All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.