Back to blog

Automated Web Scraper With Python AutoScraper [Guide]

Automated Web Scraper With Python AutoScraper [Guide]

Roberta Aukstikalnyte

2022-09-28
Share

If you’re looking for a way to get public web data regularly scraped at a set time period, you’ve come to the right place. This tutorial will show you how to automate your web scraping processes using AutoScaper – one of the several Python web scraping libraries available.   

Before getting started, you may want to check out this in-depth guide for building an automated web scraper using various web scraping tools supported by Python.

Now, let’s get into it.

Automated web scraping with Python AutoScraper library

AutoScraper is a web scraping library written in Python3; it’s known for being lightweight, intelligent, and easy to use – even beginners can use it without an in-depth understanding of a web scraping. 

AutoScraper accepts the URL or HTML of any website and scrapes the data by learning some rules. In other words, it matches the data on the relevant web page and scrapes data that follow similar rules.

Methods to install AutoScraper

First things first, let’s install the AutoScraper library. There are actually several ways to install and use this library, but for this tutorial, we’re going to use the Python package index (PyPI) repository using the following pip command:

pip install autoscraper

Scraping Books to Scrape with AutoScraper

This section showcases an example of automated web scraping with the AutoScraper module in Python using the Books to Scrape website as a subject. 

The subject website has almost a thousand books in different categories. As the screenshot shows, the links to all the book category pages are available on the page's left section:

Scraping books category URLs

Now, if you want to scrape the links to all the category pages, you can do it following  with the following trivial code:

from autoscraper import AutoScraper
 
url_to_scrape="https://books.toscrape.com"
WantedList=["https://books.toscrape.com/catalogue/category/books/travel_2/index.html"]
 
Scraper = AutoScraper()
ScrapedData = Scraper.build(UrlToScrape, wanted_list=WantedList)
print (ScrapedData)

In the code above, we first import AutoScraper from the autoscraper library. Then, we provide the URL from which we want to scrape the information in the UrlToScrap

The WantedList is assigned sample data that we want to scrape from the given subject URL. To get all the category page links from the target page, we need to give only one example data element to the WantedList. Therefore, we only provide a single link to the Travel category page as a sample data element.

The AutoScraper() creates an AutoScraper object to initiate different functions of the autoscraper library. The Scraper.build() method scrapes the data similar to the wanted_list from the target URL.

After executing the Python script above, the ScrapedData list will have all the category page links available at https://books.toscrape.com. The output of the script should look something like this: 

['https://books.toscrape.com/catalogue/category/books/travel_2/index.html', 'https://books.toscrape.com/catalogue/category/books/mystery_3/index.html', 'https://books.toscrape.com/catalogue/category/books/historical-fiction_4/index.html', 'https://books.toscrape.com/catalogue/category/books/sequential-art_5/index.html', 

Scraping book information from a single webpage

So far, we’ve looked at the method of extracting URLs from a page, but we still need to learn about scraping portions of data. This section discusses the use of AutoScraper to scrape data from a book stored on a specific page.

Say that we want to get the title of the book along with its price; we can train and build an AutoScraper model as:

from autoscraper import AutoScraper
UrlToScrap="https://books.toscrape.com/catalogue/its-only-the-himalayas_981/index.html"
WantedList=["It's Only the Himalayas", "£45.17"]
 
InfoScraper = AutoScraper()
InfoScraper.build(UrlToScrap, wanted_list=WantedList)

The script above feeds a URL of the book page and a sample of required information from that page to the AutoScraper model. The build() method learns the rules to scrape the information and prepares our InfoScraper for future use.

Now, let’s apply this InfoScraper tactic to a different book’s URL and see if it returns the desired information.

another_book_url = 'https://books.toscrape.com/catalogue/full-moon-over-noahs-ark-an-odyssey-to-mount-ararat-and-beyond_811/index.html'

scraped_data = InfoScraper.get_result_similar(Another_book_url)
print (scraped_data)

Output:

['Full Moon over Noah’s Ark: An Odyssey to Mount Ararat and Beyond', 'ce60436f52c5ee68', 'Books', '£49.43', '£0.00', 'In stock (15 available)', '0']

The script above applies InfoScraper to another_book_url and prints the scraped_data. Notice that the scraped data has some unnecessary information along with the desired information. This is due to the get_result_similar() method, which returns information similar to the wanted_list.

another_book_url = 'https://books.toscrape.com/catalogue/full-moon-over-noahs-ark-an-odyssey-to-mount-ararat-and-beyond_811/index.html'

scraped_data = InfoScraper.get_result_exact(Another_book_url)
print (scraped_data)

Output:

['Full Moon over Noah’s Ark: An Odyssey to Mount Ararat and Beyond', '£49.43']

Here, we used the get_result_exact() method to ensure accurate, in-order retrieval of the book title and price as defined by wanted_list

Scraping all the books on a specific category

Until now, we’ve learned to extract similar and exact information from a specific webpage, including URLs. Let’s learn how to scrape data from all the books in one specific category. It can be done by using two scrapers: one for scraping URLs of all the books in this category and the other for scraping information from each link. 

Let’s turn this strategy into action using the following Python script:

#BooksByCategoryScraper.py
from autoscraper import AutoScraper
import pandas as pd
 
#BooksUrlScraper section
TravelCategoryLink = 'https://books.toscrape.com/catalogue/category/books/travel_2/index.html'
WantedList=["https://books.toscrape.com/catalogue/its-only-the-himalayas_981/index.html"]
BooksUrlScraper = AutoScraper()
BooksUrlScraper.build(TravelCategoryLink, wanted_list=WantedList)
 
#BookInfoScraper section
BookPageUrl = "https://books.toscrape.com/catalogue/its-only-the-himalayas_981/index.html" 
WantedList=["It's Only the Himalayas", "£45.17"]
 
BookInfoScraper = AutoScraper()
BookInfoScraper.build(BookPageUrl, wanted_list=WantedList)
 
#Scraping info of each book and storing into an excel file
BooksUrlList = BooksUrlScraper.get_result_similar(TravelCategoryLink) 
BooksInfoList = []
for Url in BooksUrlList:
  book_info= BookInfoScraper.get_result_exact(Url)
  BooksInfoList.append(book_info)
df = pd.DataFrame(BooksInfoList, columns =["Book Title", "Price"])
df.to_excel("BooksInTravelCategory.xlsx")

The script above has three main constituents: two sections for building the scrapers and the third one to scrape data from all the books on the Travel category and save it as an Excel file. 

For this step, we’ve built BooksUrlScraper to scrape all the similar book links on the Travel Category page. These eleven links are stored in the BooksUrlList. Now, for each URL in the BookUrlList, we apply the BookInfoScraper and append the scraped information to the BooksInfoList. Finally, the BooksInfoList  is converted to a data frame, then exported as an Excel file for future use.

Output:

The output reflects achieving the initial goal – scraping titles and prices of all the eleven books on the Travel category. 

Now, we know how to use a combination of multiple AutoScraper models to scrape data in bulk. You can re-formulate the script above to scrape all the books from all the categories and save them in different Excel files for each category. 

How to use AutoScraper with proxies

Proxies are an integral part of the web scraping process: acquiring data without using them involves various risks, such as the target website blocking your IP address. Let’s take a look at the process of using proxies with AutoScraper

The build function of AutoScraper accepts request-related arguments in the request_args parameter. 

Here’s what using AutoScraper with proxy IPs looks like in practice:

from autoscraper import AutoScraper
UrlToScrap="https://books.toscrape.com/catalogue/its-only-the-himalayas_981/index.html"
WantedList=["It's Only the Himalayas", "£45.17"]
 
proxy = {
    "http": 'PROXY_ENDPOINT_HERE',
    "https": 'PROXY_ENDPOINT_HERE',
}
InfoScraper = AutoScraper()
InfoScraper.build(UrlToScrap, wanted_list=WantedList, request_args={"proxies": proxy})

Here, proxy_endpoint refers to the address of a proxy server in the correct format (e.g., http://127.0.0.1:8081). The script above should work fine when proper proxy endpoints are added to the proxy dictionary.  

Saving and loading an AutoScraper model

AutoScraper provides the ability to save and load a pre-trained scraper. We can use the following script to save the InfoScraper object to a file:

InfoScraper.save('file_name')

Similarly, we can load a scraper using:

SavedScraper = AutoScraper()
SavedScraper.load('FileName')

Now that we’ve built the automated web scraper let’s move to the last portion of the tutorial – managing automation mechanisms. 

Alternative options for web scraping automation

This section will discuss the alternatives for scheduling Python scripts on macOS, Unix/Linux, and Windows operating systems. 

Say you want your scraper to periodically visit the Travel category page and scrape every new book uploaded – it can be done by scheduling the BooksByCategoryScraper.py script.This script, whenever executed, scrapes data from all the books on the Travel category page and returns it in an Excel file.

You can schedule a Python script through: 

  • Schedule module in Python: tutorial

  • Adding it to the crontab (cron table): tutorial

  • Creating a demon or background service through systemd: tutorial

  • Task Scheduler in Windows: tutorial

Crontab and systemd (system demon) methods are specific to the Unix-based operating systems, including Linux and macOS. Meanwhile, the Task Scheduler helps to schedule a Python script on Windows. 

Frequently Asked Questions

What is the difference between a web crawler and a web scraper?

Simply put, a web scraper is a tool for extracting data from one or more websites; meanwhile, a crawler finds or discovers URLs or links on the web.

Can you manually edit or remove rules for AutoScraper objects?

Definitely, you can use the keep_rules() and remove_rules() methods to keep the required rules and remove the unwanted rules, respectively. More details can be found here.

About the author

Roberta Aukstikalnyte

Senior Content Manager

Roberta Aukstikalnyte is a Senior Content Manager at Oxylabs. Having worked various jobs in the tech industry, she especially enjoys finding ways to express complex ideas in simple ways through content. In her free time, Roberta unwinds by reading Ottessa Moshfegh's novels, going to boxing classes, and playing around with makeup.

All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.

Get the latest news from data gathering world

I’m interested

IN THIS ARTICLE


  • Automated web scraping with Python AutoScraper library

  • Alternative options for web scraping automation

Scale up your business with Oxylabs®