Proxy locations

Europe

North America

South America

Asia

Africa

Oceania

See all locations

Network statusCareers

Back to blog

Web Scraping with JavaScript and Node.js

Adelina Kiskyte

2020-08-099 min read
Share

With the arrival of Node.js, JavaScript has evolved into a very powerful language for web scraping. Node.js, sometimes written as Node js or even nodejs, is the engine that runs the JavaScript code without a browser.  Additionally, npm, or Node.js Package Manager has a massive collection of libraries, which make web scraping in node.js very easy. Web scraping with JavaScript and Node.js is not only easy, it is fast, and the learning curve is very low for those who are already familiar with JavaScript.

This tutorial will explain all you need to know to get started with web scraping using JavaScript, while using a real-life scenario. By the end of this tutorial, you will have a good understanding of how to web scrape with JavaScript and Node.js.

For your convenience, we also prepared this tutorial in a video format:

This guide assumes at least a basic understanding of JavaScript. Familiarity with Chrome or Firefox Developer tools would also help and some knowledge of jQuery or CSS Selectors is essential. This tutorial does not expect any experience with Node.js or web scraping with Node.js

Required software

There are only two pieces of software that will be needed:

  1. Node.js (which comes with npm—the package manager for Node.js)

  2. Any code editor

The only thing that you need to know about Node.js is that it is a runtime framework. This simply means that JavaScript code, which typically runs in a browser, can run without a browser. Node.js is available for Windows, Mac OS, and Linux. It can be downloaded at the official download page.

Set up Node.js project

Before writing any code to web scrape using node js, create a folder where JavaScript files will be stored. These files will contain all the code required for web scraping.

Once the folder is created, navigate to this folder and run the initialization command:

npm init -y

This will create a package.json file in the directory. This file will contain information about the packages that are installed in this folder. The next step is to install the Node.js Packages that will be discussed in the next section.

Node.js packages

For node.js web scraping, we need to use certain packages, also known as libraries. These libraries are prepackaged code that can be reused. The packages can be downloaded and installed using the npm install command, which is the Node.js Package Manager.

To install any package, simply run npm install <package-name>. For example, to install the package axios, run this on your terminal:

npm install axios

This also supports installing multiple packages. Run the following command to install all the packages used in this tutorial:

npm install axios cheerio json2csv

This command will download the packages in node_modules directory and update the package.json file.

Basics steps of web scraping with JavaScript

Almost every web scraping project using Node.js or JavaScript would involve three basics steps:

  1. Sending the HTTP request

  2. Parsing the HTTP response and extracting desired data

  3. Saving the data in some persistent storage, e.g. file, database and similar

The following sections will demonstrate how Axios can be used to send HTTP requests, cheerio to parse the response and extract the specific information that is needed, and finally, save the extracted data to CSV using json2csv. Additionally, if you want to replicate a cURL command using Axios, Node.js, or JavaScript, you can quickly do so using these cURL to Node Axios, cURL to Node.js, and cURL to JavaScript converters.

Sending HTTP request

The first step of web scraping with JavaScript is to find a package that can send HTTP request and return the response. Even though request and request-promise have been quite popular in the past, these are now deprecated. You will still find many examples and old code using these packages. With millions of downloads every day, Axios is a good alternative. It fully supports Promise syntax as well as async-await syntax.

Parsing the response – Cheerio

Node.js provides another useful package, Cheerio. This package is useful because it converts the raw HTML captured by Axios into something that can be queried using a jQuery-like syntax.

JavaScript developers are usually familiar with jQuery. This makes Cheerio a very good choice to extract the information from HTML.

JavaScript web scraping – a practical example

One of the most common scenarios of web scraping with JavaScript is to scrape e-commerce stores. A good place to start is a fictional book store http://books.toscrape.com/. This site is very much like a real store, except that this is fictional and is made to learn web scraping.

Creating selectors

The first step before beginning JavaScript web scraping is creating selectors. The purpose of selectors is to identify the specific element to be queried.

Begin by opening the URL http://books.toscrape.com/catalogue/category/books/mystery_3/index.html in Chrome or Firefox. Once the page loads, right-click on the title of the genre, Mystery, and select Inspect. This should open the Developer Tools with <h1>Mystery</h1> selected in the Elements tab.

Copy selector for web scraping with node js

The simplest way to create a selector is to right-click this h1 tag in the Developer Tools, point to Copy, and then click Copy Selector. This will create a selector like this:

#content_inner > article > div.row > div.col-sm-6.product_main > h1

This selector is valid and works well. The only problem is that this method creates a long selector. This makes it difficult to understand and maintain the code.

After spending some time with the page, it becomes clear that there is only one h1 tag on the page. This makes it very easy to create a very short selector:

h1

Alternatively, a third-party tool like Selector Gadget extension for Chrome can be used to create selectors very quickly. This is a useful tool for web scraping in JavaScript.

Note that while this works most of the time, there will be cases where it doesn’t work. Understanding how CSS selectors work is always a good idea. W3Schools has a good CSS Reference page.

Scraping the genre

The first step is to define the constants that will hold a reference to Axios and Cheerio.

const cheerio = require("cheerio");
const axios = require("axios");

The address of the page that is being scraped is saved in the variable URL for readability

const url = "http://books.toscrape.com/catalogue/category/books/mystery_3/index.html";

Axios has a method get() that will send an HTTP GET request. Note that this is asynchronous method and thus needs await prefix:

const response = await axios.get(url);

If there is a need to pass additional headers, for example, User-Agent, this can be sent as the second parameter:

const response = await axios.get(url, {
      headers: 
      {
        "User-Agent": "custom-user-agent string",
      }
    });

This particular site does not need any special header, which makes it easier to learn.

Axios supports both the Promise pattern and the async-await pattern. This tutorial focuses on the async-await pattern. The response has a few attributes like headers, data, etc. The HTML that we want is in the data attribute. This HTML can be loaded into an object that can be queried, using cheerio.load() method.

const $ = cheerio.load(response.data);

Cheerio’s load () method returns a reference to the document, which can be stored in a constant. This can have any name. To make our code look and feel more like jQuery web scraping code, a $ can be used instead of a name.

Finding this specific element within the document is as easy as writing . In this particular case, it would be .

The method text() will be used everywhere when writing web scraping code with JavaScript, as it can be used to get the text inside any element. This can be extracted and saved in a local variable.

const genre = $("h1").text();

Finally, console.log() will simply print the variable value on the console.

console.log(genre);

To handle errors, the code will be surrounded by a try-catch block. Note that it is a good practice to use console.error for errors and console.log for other messages.

Here is the complete code put together. Save it as genre.js in the folder created earlier, where the command npm init was run.

const cheerio = require("cheerio");
const axios = require("axios");
const url = "http://books.toscrape.com/catalogue/category/books/mystery_3/index.html";

async function getGenre() {
  try {
    const response = await axios.get(url);
    const document = cheerio.load(response.data);
    const genre = document("h1").text();
    console.log(genre);
  } catch (error) {
    console.error(error);
  }
}
getGenre();

The final step to run this web scraping in JavaScript is to run it using Node.js. Open the terminal and run this command:

node genre.js

The output of this code is going to be the genre name:

Mystery

Congratulations! This was the first program that uses JavaScript and Node.js for web scraping. Time to do more complex things!

Scraping book listings

Let’s try scraping listings. Here is the same page that has a book listing of the Mystery genre – http://books.toscrape.com/catalogue/category/books/mystery_3/index.html

First step is to analyze the page and understand the HTML structure. Load this page in Chrome, press F12, and examine the elements. 

Each book is wrapped in <article> tag. It means that all these books can be extracted and a loop can be run to extract individual book details. If the HTML is parsed with Cheerio, jQuery function each() can be used to run a loop. Let’s start with extracting title of all the books. Here is the code:

const books = $("article"); //Selector to get all books
books.each(function () 
           { //running a loop
		title = $(this).find("h3 a").text(); //extracting book title
		console.log(title);//print the book title
			});

As it is evident from the above code that the extracted details need to be saved somewhere else inside the loop. The best idea would be to store these values in an array. In fact, other attributes of the books can be extracted and stored as a JSON in an array.

Here is the complete code. Create a new file, paste this code and save it as books.js in the same folder that where npm init was run:

const cheerio = require("cheerio");
const axios = require("axios");
const mystery = "http://books.toscrape.com/catalogue/category/books/mystery_3/index.html";
const books_data = [];
async function getBooks(url) {
  try {
    const response = await axios.get(url);
    const $ = cheerio.load(response.data);
 
    const books = $("article");
    books.each(function () {
      title = $(this).find("h3 a").text();
      price = $(this).find(".price_color").text();
      stock = $(this).find(".availability").text().trim();
      books_data.push({ title, price, stock }); //store in array
    });
    console.log(books_data);//print the array
  } catch (err) {
    console.error(err);
  }
}
getBooks(mystery);

Run this file using Node.js from the terminal:

node books.js

This should print the array of books on the console. The only limitation of this JavaScript code is that it is scraping only one page. The next section will cover how pagination can be handled.

Handling pagination

The listings like this are usually spread over multiple pages. While every site may have its own way of paginating, the most common one is having a next button on every page. The exception is the last, which will not have a next page link.

The pagination logic for these situations is rather simple. Create a selector for the next page link. If the selector results in a value, take the href attribute value and call getBooks function with this new URL recursively.

Immediate after the books.each() loop, add these lines:

if ($(".next a").length > 0) {
      next_page = baseUrl + $(".next a").attr("href"); //converting to absolute URL
      getBooks(next_page); //recursive call to the same function with new URL
}

Note that the href returned above is a relative URL. To convert it into an absolute URL, the simplest way is to concatenate a fixed part to it. This fixed part of the URL is stored in the baseUrl variable

const baseUrl ="http://books.toscrape.com/catalogue/category/books/mystery_3/"

Once the scraper reaches the last page, the Next button will not be there and the recursive call will stop. At this point, the array will have book information from all the pages. The final step of web scraping with Node.js is to save the data.

Saving scraped data to CSV

If web scraping with JavaScript is easy, saving data into a CSV file is even easier. It can be done using these two packages —fs and json2csv. The file system is represented by the package fs, which is in-built. json2csv would need to be installed using npm install json2csv command

npm install json2csv

after the installation, create a constant that will store this package’s Parser.

const j2cp = require("json2csv").Parser;

The access to the file system is needed to write the file on disk. For this, initialize the fs package.

const fs = require("fs");

Find the line in the code where an array with all the scraped is available, and then insert the following lines of code to create the CSV file.

const parser = new j2cp();
const csv = parser.parse(books_data); // json to CSV in memory
fs.writeFileSync("./books.csv", csv); // CSV is now written to disk

Here is the complete script put together. This can be saved as a .js file in the node.js project folder. Once it is run using node command on terminal, data from all the pages will be available in books.csv file.

const fs = require("fs");
const j2cp = require("json2csv").Parser;
const axios = require("axios");
const cheerio = require("cheerio");
 
const mystery = "http://books.toscrape.com/catalogue/category/books/mystery_3/index.html";
 
const books_data = [];
 
async function getBooks(url) {
  try {
    const response = await axios.get(url);
    const $ = cheerio.load(response.data);
 
    const books = $("article");
    books.each(function () {
      title = $(this).find("h3 a").text();
      price = $(this).find(".price_color").text();
      stock = $(this).find(".availability").text().trim();
      books_data.push({ title, price, stock });
    });
    // console.log(books_data);
    const baseUrl = "http://books.toscrape.com/catalogue/category/books/mystery_3/";
    if ($(".next a").length > 0) {
      next = baseUrl + $(".next a").attr("href");
      getBooks(next);
    } else {
      const parser = new j2cp();
      const csv = parser.parse(books_data);
      fs.writeFileSync("./books.csv", csv);
    }
  } catch (err) {
    console.error(err);
  }
}
 
getBooks(mystery);

Run this file using Node.js from the terminal:

node books.js

We now have a new file books.csv, which will contain all the desired data. This can be viewed using any spreadsheet program such as Microsoft Excel. You may also find it useful to learn how to read JSON files in JavaScript, especially when JSON data is used heavily by websites, making it considerably easier to gather already structured public web data.

Web scraping dynamic web pages with Puppeteer

Puppeteer is a popular open-source library that can run in headless browser mode (a GUI-less browser controlled programmatically) which can be used for web scraping or automated testing. It allows developers to simulate user interactions with a website and perform tasks such as filling out forms, clicking links, and extracting data from the page.

As Puppeteer supports headless browsing, it can handle dynamic content and execute JavaScript on web pages. This is particularly useful for scraping modern web applications that rely on JavaScript to load their content.

For example, let’s say you want to scrape all the quotes on a dynamic web page http://quotes.toscrape.com/js/. If you open the page in your browser and look for the source js file from the developer tools, you'll notice that all the quotes dynamically load from the data list in that js file. Moreover, the js file appends quotes as a div element having a quote class. 

Let’s see how we can scrape dynamic content using Puppeteer in this situation. To create a dynamic web page scraper, start by installing Puppeteer using the following command:

npm install puppeteer

Next, define constants holding references to the Puppeteer, a new headless browser window, and a new tab or page.

const puppeteer = require('puppeteer');
const headlessBrowser = await puppeteer.launch({ headless: true });
const newTab = await headlessBrowser.newPage();

Now, set the target URL and navigate your tab to the target web page using the following code:

const url = 'http://quotes.toscrape.com/js/';
await newTab.goto(url);

As the quotes load dynamically through the JavaScript file, you must wait for js to load the DIVs with the quote class. You can use waitForSelector() as demonstrated by the following code line:

await newTab.waitForSelector('.quote');

From this point, you can proceed further to scrape all quotes from the current page. Use the following code to do that:

let scrapedQuotes = await NewTab.evaluate(() => {
    let allQuoteDivs = document.querySelectorAll(".quote");
    let quotes= "";
    allQuoteDivs.forEach((quote) => {
      let text = quote.querySelector(".text").innerHTML;
      let author = quote.querySelector(".author").innerHTML;
      quotes += `${text} \n ${author} \n\n`;
    });
    return quotes;
  });
 
console.log(scrapedQuotes);
headlessBrowser.close(); //destroy the headless browser instance

In Puppeteer, the evaluate() method allows you to execute a function in the Document Object Model (DOM) context of the current tab or page. The above code snippet evaluates an anonymous function to scrape and return, in the form of a string, all the quotes from the current DOM context. 

Let’s put everything together with a bit of exception handling:

const puppeteer = require('puppeteer');
const url = 'http://quotes.toscrape.com/js/';
async function QuotesScraping() {
  try {
    const headlessBrowser = await puppeteer.launch({ headless: true });
    const newTab = await headlessBrowser.newPage();
    await newTab.goto(url);
    await newTab.waitForSelector('.quote');
 
    let scrapedQuotes = await newTab.evaluate(() => {
        let allQuoteDivs = document.querySelectorAll(".quote");
        let quotes= "";
        allQuoteDivs.forEach((quote) => {
          let text = quote.querySelector(".text").innerHTML;
          let author = quote.querySelector(".author").innerHTML;
          quotes += `${text} \n ${author} \n\n`;
        });
        return quotes;
      });
    console.log(scrapedQuotes);
    headlessBrowser.close();
    
  } catch (error) {
    console.error(error)
  }
}
 
QuotesScraping();

If our web scraper works fine, you should get an output like the one below.

Summary

This whole exercise of web scraping using JavaScript and Node.js can be broken down into three steps — send the request, parse and query the response, and save the data. For all three steps, there are many packages available. In this tutorial, we discussed how to use Axios, Cheerio, and Json2csv packages for these tasks.

If you would like to learn more about web scraping or how JavaScript compares to other languages, read about Python Web Scraping, Web Scraping with Selenium, and JavaScript vs Python for web scraping. Or if you want to learn a different method of scraping via a headless browser, check out our Puppeteer tutorial. To integrate proxies with Puppeteer, we also have an integration guide available. Also, don’t hesitate to try our own general-purpose web scraper for free.

People also ask

What is Node.js?

Node.js is a JavaScript runtime environment built to execute JavaScript code outside a web browser. It’s an open-source tool commonly used for building server-side applications, command-line tools, and desktop applications.

Is Node.js good for web scraping?

Yes, Node.js can be a good choice for JavaScript web scraping tasks. Node.js is a platform that allows you to run JavaScript code outside of a web browser. 

Using Node.js for web scraping has several benefits, including ease of use (especially for those already familiar with JavaScript), fast performance, and a low learning curve. Most importantly, the majority of websites are configured with JavaScript, so you won't need to learn a different language for your scraper.

What are alternatives to Puppeteer?

Selenium, PhantomJS, Playwright, and SlimerJS are well-known alternatives to Puppeteer in Node.js. They all support headless browsing and can be used for various tasks, including web scraping, testing web applications, and automating web tasks.

What are alternatives to Axios?

In Node.js, you have several alternatives to Axios, including Got, Request, Superagent, and Fetch. These libraries can make HTTP requests and perform web scraping and API integration tasks.

About the author

Adelina Kiskyte

Former Senior Content Manager

Adelina Kiskyte is a former Senior Content Manager at Oxylabs. She constantly follows tech news and loves trying out new apps, even the most useless. When Adelina is not glued to her phone, she also enjoys reading self-motivation books and biographies of tech-inspired innovators. Who knows, maybe one day she will create a life-changing app of her own!

All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.

Related articles

Get the latest news from data gathering world

I’m interested