Back to blog

Web Scraping with Rust

Maryia Stsiopkina

2022-08-24
Share

Rust is rapidly gaining attention as a programming language offering performance just as high as C/C++, especially when it comes to web scraping. However, unlike Python, which is relatively easy to learn, oftentimes at the cost of performance, Rust can be tricky to figure out. 

It doesn't mean that scraping with Rust is not possible or extremely hard. Scraping with Rust can be challenging only if you don't know how to begin.

In this practical tutorial, we will be writing a web scraper that can extract product data from an e-commerce store. With this, you will get started with Rust without much effort. 

Installing and running Rust

Let’s start with installing Rust. Let’s see how you can do it depending on your operating system. 

Installing Rust on Windows

The most recommended way to install Rust is by using the rustup utility. Head to the https://www.rust-lang.org/tools/install page. This page displays different contents based on the operating system you are using. On Windows, the page will be as follows:

Click on the Download RUSTUP-INIT (64-bit) button to download the rustup utility. 

Important: You must download and install the Visual Studio C++ Build tools before installing the Rust programming language and compiler

After installing Visual Studio C++ build tools, run the rustup-init executable that you downloaded. This utility opens a command prompt window informing you that Visual Studio C++ build tools should be installed as follows:

Press y to continue with the installation. On the next screen, review the information and press 1 to proceed with the installation.

After the installation is complete, close the command prompt and open it again. Opening a new command prompt ensures that all the environment variable changes are in effect.

You can run the following command to verify the installation:

C:\>rustc --version
rustc 1.62.1 (e092d0b6b 2022-07-16)

Installing Rust on macOS and Linux

Even though it is possible to install Rust on macOS using Homebrew, we recommend using the rustup utility. Using rustup ensures that other required utilities, such as cargo, are installed correctly.

Head to the https://www.rust-lang.org/tools/install page. On macOS and Linux, the page will be as follows:

Copy the cURL command to download and install the rustup utility. Open a terminal and run this command. You will be presented with the confirmation screen:

Review the information and press 1 to proceed with the installation.

After the installation is complete, close the terminal and open it again. Opening a new terminal ensures that all the environment variable changes are in effect.

You can run the following command to verify the installation:

$ rustc --version
rustc 1.63.0 (4b91a6ea7 2022-08-08)

Rust web scraper for scraping book data

To understand how scraping with Rust works, we will create a real-life web scraping project. 

The site we will be scraping is https://books.toscrape.com/, a dummy bookstore for learning web scraping. It has all the essential components of an eCommerce store. 

Setup

The first step is to open the terminal or command prompt and create a rust project. We will be using the Cargo package manager to build the project structure, download the dependencies, compile, and run the project.

Open the terminal and, run the following command to initialize an empty project as follows:

$ cargo new book_scraper

This command will create a folder book_scraper and initialize this folder with files and folders required for a rust project. The important files are Cargo.toml and the main.rs file in the src folder.

Open this folder in a text editor or IDE of your choice. 

If you are using Visual Studio Code, we recommend installing an extension, such as rust-analyzer, to make coding with Rust easier in visual studio code.

Now, open the Cargo.toml file, and add the following lines:

[dependencies]
reqwest = {version = "0.11", features = ["blocking"]} 
scraper = "0.13.0"

These lines declare two dependencies - reqwest and scraper. We’ll elaborate on it later.

Go back to the terminal and run the following command to download the dependencies and compile the code.

$ cargo build
Finished dev [unoptimized + debuginfo] target(s) in 0.12s
Running `target/debug/book_scraper`
Hello, world!

This command will compile the code to create an executable and run it. The executable is created in the following path:

./target/debug/book_scraper

If your operating system is Windows, the file name will be .\target\debug\book_scraper.exe

Making an HTTP request

To send HTTP requests, GET or POST, we need the Rust library. One of the most convenient Rust libraries is reqwest

This library exposes two types of http clients — an asynchronous http client and a blocking http client.

This article aims to give you a basic overview of scraping with Rust. Therefore, the blocking client is more suitable to make the tutorial easier to follow. That's why we have specified in the Cargo.toml that we need the blocking feature.

reqwest = {version = "0.11", features = ["blocking"]} 

Open main.rs, and in the main() function, enter the following lines of code:

fn main() {
    let url = "https://books.toscrape.com/";
    let response = reqwest::blocking::get(url).expect("Could not load url.");
    let body = response.text().unwrap();
    print!("{}",body);
}

In the first line, we are storing the target URL. 

The next line is sending the HTTP request of type GET to this URL using the blocking http client. The result is stored in the variable response.

After that, the HTML from the response is extracted and stored in the body variable. This variable is simply being printed.

Save this file and navigate to the terminal. Enter the following command:

$ cargo run

The output of this will be the entire HTML printed on the terminal.

Parsing HTML with scraper

To create a web scraper, we need to use another Rust library. This library is conveniently called scraper. It allows using CSS selectors to extract desired HTML elements.

If you haven't already done it, enter the following lines in Cargo.toml file under dependencies:

scraper = "0.13.0" 

Open the main.rs file and append the following line:

let document = Html::parse_document(&body);

In this line, we call the parse_document to parse the web page. We are sending raw HTML extracted using the reqwest Rust library. The result is the parsed document which is then stored in the variable named document.

The parsed HTML document can be queried using CSS selectors to locate the HTML elements that contain the required information. 

We can break this process into three steps:

● Locating products via CSS selectors;

● Extracting product description;

● Extracting product links.

More about them in the following sections.

Locating products via CSS selectors

The first step is to identify the CSS selectors that contain information related to a product. In our example, the product is a book.

Open https://books.toscrape.com/ in Chrome and examine the HTML markup of the web page.

<IMAGE: book_container.png / ALT: Contain for a product and CSS Selector>

You will notice the selector article.product_pod selects a book. It means that we can run a loop over all these books and extract individual information.

First, add the following line at the beginning of the main.rs file:

use scraper::{Html, Selector};

Next, add the following line in the main function:

let book_selector = Selector::parse("article.product_pod").unwrap();
Now the selector is ready to be used. Add the following lines to the main function:
for element in document.select(&book_selector) {
// more code here
} 

We can now apply more CSS selectors to extract information about each book.

Extracting product description

By looping over HTML elements that act as containers for each product, it is easy to write a reusable web scraping code.

In this example, we will retrieve the product name and product price.

First, create two selectors before the for loop as follows:

let book_name_selector = Selector::parse("h3 a").unwrap();
let book_price_selector = Selector::parse(".price_color").unwrap();

In the for loop, use these selectors on an individual book:

for element in document.select(&book_selector) {
    let book_name_element = element.select(&book_name_selector).next().expect("Could not select book name.");
    let book_name = book_name_element.value().attr("title").expect("Could not find title attribute.");
    let price_element = element.select(&book_price_selector).next().expect("Could not find price");
    let price = price_element.text().collect::<String>();
   println!("{:?} - {:?}",book_name, price);
}

Notice two important points:

  • The book name is in the title attribute of the <a> element;

  • The price is in the text of the element.

Save the files and run the following from your terminal:

$ cargo run

This should print the book name and prices on the terminal.

Extracting product links

The product links can be extracted in a similar fashion. Create a selector outside the for loop as follows:

let book_link_selector = Selector::parse("h3 a").unwrap();

Within the for loop, add the following line:

let book_link_element = element.select(&book_name_selector).next().expect("Could not find book link element.");
let book_link= book_link_element.value().attr("href").expect("Could not find href attribute");

All the values we have scraped can now be printed to the console.. Even better, we can save everything to a CSV.

Writing scraped data to a CSV file

No web scraping project is complete without creating a file. In this case, we are going to write a CSV file.

We will use the CSV Rust library to create a CSV file.

First, add the following to Cargo.toml dependencies:

csv="1.1"

Next, create a CSV writer before the for loop as follows:

let mut wtr = csv::Writer::from_path("books.csv").unwrap();

Optionally, write the headers before the for loop as follows:

wtr.write_record(&["Book Name", "Price", "Link"]).unwrap();

Within the for loop, write each record as follows:

wtr.write_record([book_name, &price, &book_link]).unwrap();

Finally, close the file after the for loop:

wtr.flush().expect("Could not close file");

Putting together everything, the main.rs file contains the following:

// main.rs
use scraper::{Html, Selector};
fn main() {
    let url = "https://books.toscrape.com/";
    let response = reqwest::blocking::get(url).expect("Could not load url.");
    let body = response.text().expect("No response body found.");
    let document = Html::parse_document(&body);
    let book_selector = Selector::parse("article.product_pod").expect("Could not create selector.");
    let book_name_selector = Selector::parse("h3 a").expect("Could not create selector.");
    let book_price_selector = Selector::parse(".price_color").expect("Could not create selector.");
    let mut wtr = csv::Writer::from_path("books.csv").expect("Could not create file.");
    wtr.write_record(&["Book Name", "Price", "Link"]).expect("Could not write header.");
    for element in document.select(&book_selector) {
    let book_name_element = element.select(&book_name_selector).next().expect("Could not select book name.");
    let book_name = book_name_element.value().attr("title").expect("Could not find title attribute.");
    let price_element = element.select(&book_price_selector).next().expect("Could not find price");
    let price = price_element.text().collect::<String>();
    let book_link_element = element.select(&book_name_selector).next().expect("Could not find book link element.");
    let book_link= book_link_element.value().attr("href").expect("Could not find href attribute");
    wtr.write_record([book_name, &price, &book_link]).expect("Could not create selector.");
    }
    wtr.flush().expect("Could not close file");
    println!("Done");
} 

Conclusion

This article examined how to write a web scraper using Rust. We discussed how CSS selectors could be used in a web scraper with the help of the Scraper Rust library. 

If you want to learn more about scraping using other programming languages, check our articles, such as Web Scraping with JavaScriptWeb Scraping with JavaWeb Scraping with C#Python Web Scraping Tutorial, and many more.

About the author

Maryia Stsiopkina

Content Manager

Maryia Stsiopkina is a Content Manager at Oxylabs. As her passion for writing was developing, she was writing either creepy detective stories or fairy tales for children at different points in time. Eventually, she found herself in the tech wonderland with numerous hidden corners to explore. In her spare time, she goes birdwatching with the binoculars (some people mistake it for stalking, which is why Maryia finds herself in an awkward situation sometimes), makes flower jewellery, and eats many pickles and green olives.

All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.

Related articles

Get the latest news from data gathering world

I’m interested

IN THIS ARTICLE


  • Installing and running Rust

  • Rust web scraper for scraping book data

  • Conclusion

Scale up your business with Oxylabs®