puppeteer tutorial for web scraping
avatar

Gabija Fatenaite

Sep 30, 2020 12 min read

Web scraping and automation with JavaScript has evolved a lot in recent years. There are a few methods to accessing and parsing web pages, but in this tutorial we will be covering how to do it with Google Puppeteer. 

Table of contents

What is Puppeteer?

Speaking in broad terms, Google Chrome Puppeteer is an open source library developed by Google. It was built for automating and simplifying frontend tasks and development. Based on Chromium, it can be controlled remotely, allowing developers to write and maintain simple, fully automated tests. However, Puppeteer can also be used for web scraping.

Automating web scraping

Generally, there are two methods of accessing and parsing web pages. The first method uses packages e.g., Axios. It directly sends a get request to the web page and receives HTML content. This can then be parsed using packages like Cheerio. We covered this process in-depth in our JavaScript web scraping tutorial

Though this is a fast method, it has its limitations. The biggest is that it cannot handle dynamic sites – sites that are rendered using JavaScript. The easiest way to manage these sites is to open a browser and load the site. Unfortunately, loading a browser would take a lot of resources because it has to load a lot of other things like the toolbar and buttons. These UI elements are not needed when everything is being controlled with code. Fortunately, there are better solutions – headless browsers.

What is a headless browser?

A headless browser is simply a browser but without a graphical user interface. Think of it as a hidden browser. Headless browsers have complete functionality offered by a browser while being faster and taking up a lot less memory because there is no user interface. Everything is controlled programmatically.

The most commonly used browsers, Chrome and Firefox, support headless mode. There are few more browsers with headless mode supported, for example, Splash, Chromium, etc. Splash is aimed at Python programmers. In this Puppeteer tutorial, we will be focusing on Chromium. 

Chromium is an open-source web browser made by Google. Note that Chromium and Chrome are two different browsers. Chromium is an open-source project. Chrome and is built over Chromium by adding many features. In addition to Chrome, many other browsers are based on Chromium, for example, Microsoft Edge, Opera, Brave, etc.

Now that we know what a headless browser is, it’s time to understand the available options to control the browsers programmatically.

Controlling the browsers programmatically

There are several solutions to control headless browsers. Perhaps the most widely known solution is Selenium.  We have covered what it is in our blog post, but to quickly answer is  Puppeteer better than selenium – if you need a lightweight and fast headless browser for web scraping, Google Puppeteer would be the better choice. 

This Puppeteer tutorial will cover Puppeteer in much detail. Puppeteer, however, is a Node.js package, making it exclusive for JavaScript developers. Python programmers, therefore, have a similar option – Pyppeteer.  

Pyppeteer

Pyppeteer is an unofficial port of Puppeteer for Python. This also bundles Chromium and works smoothly with it. Pyppeteer can work with Chrome as well, similar to Puppeteer. 

The syntax is very similar as it uses the asyncio library for Python, except the syntactical differences between Python and JavaScript. Here are two scripts in JavaScript and Python that load a page and then take a screenshot of it.

Puppeteer example:

const puppeteer = require('puppeteer');
async function main() {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('https://oxylabs.io/');
  await page.screenshot({'path': 'oxylabs_js.png'})
  await browser.close();
}
main();

Pyppeteer Example:

import asyncio
import pyppeteer
async def main():
    browser = await pyppeteer.launch()
    page = await browser.newPage()
    await page.goto('https://oxylabs.io/')
    await page.screenshot({'path': 'oxylabs_python.png'})
    await browser.close()
asyncio.get_event_loop().run_until_complete(main())

The code is very similar. For web scraping dynamic websites, Pyppeteer can be an excellent alternative to Selenium for Python developers. But for the sake of making a Puppeteer tutorial, the following sections, we will cover Puppeteer, starting with the installation.

Installation

Before moving on with this Puppeteer tutorial, let’s get the basic tools installed. 

Prerequisite

There are only two pieces of software that will be needed:

  • Node.js (which is bundled with npm—the package manager for Node.js)
  • Any code editor

The only thing that you need to know about Node.js is that it is a runtime framework. This means that JavaScript code, which typically runs in a browser, can run without a browser. 

Node.js is available for Windows, Mac OS, and Linux. It can be downloaded at their official download page.

Create node.js project

Before writing any code to web scrape using node js, create a folder where JavaScript files will be stored. All the code for Puppeteer is written in .js files and is run by Node. 

Once the folder is created, navigate to this folder and run the initialization command:

npm init -y

This will create a package.json file in the directory. This file will contain information about the packages that are installed in this folder. The next step is to install the Node.js Packages in this folder. 

How do you run Puppeteer

Installing Puppeteer is very easy. Just run the npm install command from the terminal. Note that the working directory should be the one which contains package.json:

npm install puppeteer 

Note that Puppeteer is bundled with a full instance of Chromium. When it is installed, it downloads a recent version of Chromium that is guaranteed to work with the version of Puppeteer being installed. 

Getting started with Puppeteer

Puppeteer is a promise-based library, which means it performs asynchronous calls. This Puppeteer tutorial will have all of the examples in async-await syntax.

Simple example of using Puppeteer

Create a new file in your node project directory (the directory that contains package.json and node_modules). Save this file as example1.js and add this code:

const puppeteer = require('puppeteer');
async function main() {
    // Add code here
}
main();

The code above can be simplified by making the function anonymous and calling it on the same line:

const puppeteer = require('puppeteer');
(async () => {
    // Add code here
})();

The required keyword will ensure that the Puppeteer library is available in the file. The rest of the lines are the placeholder where an anonymous, asynchronous function is being created and executed. For the next step, launch the browser. 

const browser = await puppeteer.launch();

Note that by default, the browser is launched in the headless mode. If there is an explicit need for a user interface, the above line can be modified to include an object as a parameter. 

const browser = await puppeteer.launch({headless:false}); // default is true

The next step would be to open a page:

const page = await browser.newPage();

Now that a page or in other words, a tab, is available, any website can be loaded by simply calling the goto() function:

await page.goto('https://oxylabs.io/');

Once the page is loaded, the DOM elements, as well the rendered page is available. This can be verified by taking a quick screenshot:

await page.screenshot({path: 'oxylabs_1080.png'})

This, however, will create only an 800×600 pixel image. The reason is that Puppeteer sets an initial page size to 800×600px. This can be changed by setting the viewport, before taking the screenshot.

  await page.setViewport({
    width: 1920,
    height: 1080,
  });

Finally, remember to close the browser:

await browser.close();

Putting it altogether, here is the complete script. 

const puppeteer = require('puppeteer');
(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.setViewport({
    width: 1920,
    height: 1080,
  });
  await page.goto('https://oxylabs.io/');
  await page.screenshot({path: 'oxylabs_1080.png'})
  await browser.close();
})();

Run this file from the terminal using this command: 

node example1.js

This should create a new file oxylabs_1080.png in the same directory.

Bonus tip: If you need a PDF, you can use the pdf() function:

await page.pdf({path: 'oxylabs.pdf', format: 'A4'});

Scraping an element from a page

Puppeteer loads the complete page in DOM. This means that we can extract any data from the page. The easiest way to do this is to use the function evaluate(). This allows to execute JavaScript functions like document.querySelector(). Consequently, it lets us extract any Element from the DOM.

To understand this, open this link in your preferred browser: https://en.wikipedia.org/wiki/Web_scraping 

Once the page is loaded, right-click the heading of the page, and select Inspect. This should open developer tools with the Elements tab activated. Here it is visible that the page’s heading is in h1 element, with id and class both set to firstHeading.

Now, go to the Console tab in the developer toolbox and write in this line:

document.querySelector('#firstHeading')

You will immediately see that our desired tag is extracted.

Puppeteer Tutorial Wiki example

This returns one element from the page. For this particular element, all we need is text. Text can be easily extracted with this line of code:

document.querySelector('#firstHeading').textContent

The text can now be returned using the return keyword. The next step is to surround this in the evaluate method. This will ensure that this querySelector can be run. 

await page.evaluate(() => {
    return document.querySelector("#firstHeading").textContent;
});

The result of the evaluate() function can be stored in a variable to complete the functionality. Finally, do not forget to close the browser. Here is the complete script:

const puppeteer = require("puppeteer");
 
(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto("https://en.wikipedia.org/wiki/Web_scraping");
 
  title = await page.evaluate(() => {
    return document.querySelector("#firstHeading").textContent.trim();
  });
  console.log(title);
  await browser.close();
})();

Scraping multiple elements

Extracting multiple elements would involve three steps:

1. Use of querySelectorAll to get all elements matching the selector:

headings_elements = document.querySelectorAll("h2 .mw-headline");

2. create an array, as heading_elements is of type NodeList. 

headings_array = Array.from(headings_elements);

3. Call the map() function can be called to process each element in the array and return it.

return headings_array.map(heading => heading.textContent);

This of course needs to be surrounded by page.evaluate() function. Putting everything together, this is the complete script. You can save this as wiki_toc.js:

const puppeteer = require("puppeteer");
 
(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto("https://en.wikipedia.org/wiki/Web_scraping");
 
  headings = await page.evaluate(() => {
    headings_elements = document.querySelectorAll("h2 .mw-headline");
    headings_array = Array.from(headings_elements);
    return headings_array.map(heading => heading.textContent);
  });
  console.log(headings);
  await browser.close();
})();

This file can now be run from your terminal:

node wiki_toc.js

Bonus tip: Array.from() function can be supplied with a map function directly, without a separate call to map. Depending on the comfort level, the same code can thus be written as:

 headings = await page.evaluate(() => {
    return Array.from(document.querySelectorAll("h2 .mw-headline"),
      heading => heading.innerText.trim());
  });

Scraping a hotel listing page

This section will explain how a typical listing page can be scraped to get a JSON object with all the required information. The concepts presented in this section will be applicable for any listing, whether it is an online store, a directory, or a hotel listing. 

The example that we will take is an Airbnb. Apply some filters so that you reach a page similar to the one in the screenprint. In this particular example, we will be scraping this Airbnb page that lists 20 hotels. To scrape all 20 hotels, the first step is to identify the selector for each hotel section.

root = Array.from(document.querySelectorAll("#FMP-target [itemprop='itemListElement']"));

This returns a NodeList of length 20 and stores in the variable root. Note that so far, text or any attribute has not been extracted. All we have is an array  of elements. This will be done in the map() function.

hotels = root.map(hotel => ({ 
// code here
}));

The URL of the photo of the hotel can be extracted with a code like this:

hotel.querySelector("img").getAttribute("src")

Getting the name of the hotel is a little trickier. The classes used on this page are some random words like _krjbj and _mvzr1f2. These class names appear to be generated dynamically and may change later on. It is better to have selectors which do not rely on these class names. 

The hotel name can be extracted by combining parentElement and nextElementSibling selectors:

hotel.querySelector('ol').parentElement.nextElementSibling.textContent

The most important concept to understand here is that we are concatenating querySelectors. Effectively, the first hotel name is being extracted with this line of code:

document.querySelectorAll("#FMP-target [itemprop='itemListElement']")[0].querySelector('ol').parentElement.nextElementSibling.textContent 
Puppeteer Tutorial Airbnb example

Finally, we can create an object containing both of these values. The syntax to create an object is like this:

Hotel = {
        Name: 'x',
        Photo: 'y'
      }

Putting everything together, here is the final script. Save it as bnb.js.

const puppeteer = require("puppeteer");
(async () => {
  let url = "https://www.airbnb.com/s/homes?refinement_paths%5B%5D=%2Fhomes&search_type=section_navigation&property_type_id%5B%5D=8";
  const browser = await puppeteer.launch(url);
  const page = await browser.newPage();
  await page.goto(url);
  data = await page.evaluate(() => {
    root = Array.from(document.querySelectorAll("#FMP-target [itemprop='itemListElement']"));
    hotels = root.map(hotel => ({
      Name: hotel.querySelector('ol').parentElement.nextElementSibling.textContent,
      Photo: hotel.querySelector("img").getAttribute("src")
    }));
    return hotels;
  });
  console.log(data);
  await browser.close();
})();

Run this file from the terminal using:

node bnb.js

You should be able to see a JSON object printed on the console. 

Summary

In this Puppeteer tutorial, various examples of web scraping have been explained. The examples ranged from extracting one element from a website and moving on to extracting hotel listings from a popular website. We recommend that you look at the official Puppeteer documentation for more detailed information. 

We would also recommend reading JavaScript web scraping tutorial to learn web scraping using Axios and Cheerio, which could be more suitable in other scenarios.

avatar

About Gabija Fatenaite

Gabija Fatenaite is a Senior Content Manager at Oxylabs. Having grown up on video games and the internet, she grew to find the tech side of things more and more interesting over the years. So if you ever find yourself wanting to learn more about proxies (or video games), feel free to contact her - she’ll be more than happy to answer you.

Related articles

What is a Reverse Proxy?

What is a Reverse Proxy?

Oct 19, 2020

7 min read

CTO’s Checklist: 6 Things to Know About Web Scraping

CTO’s Checklist: 6 Things to Know About Web Scraping

Oct 13, 2020

8 min read

Proxy vs VPN

Proxy vs VPN

Oct 04, 2020

10 min read

All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.