Puppeteer Tutorial: Scraping With a Headless Browser

Gabija Fatenaite

Last updated by Yelyzaveta Hayrapetyan

2025-07-30

8 min read

Web scraping and automation with JavaScript has evolved a lot in recent years. There are a few methods to accessing and parsing web pages, but in this tutorial we will be covering how to do it with Google Puppeteer.

For your convenience, we also prepared this Puppeteer tutorial in a video format:

Automating web scraping

Generally, there are two methods of accessing and parsing web pages. The first method uses packages e.g., Axios. It directly sends a get request to the web page and receives HTML content. This can then be parsed using packages like Cheerio. We covered this process in-depth in our JavaScript web scraping tutorial.

Though this is a fast method, it has its limitations. The biggest is that it cannot handle dynamic sites – sites that are rendered using JavaScript. The easiest way to manage these sites is to open a browser and load the site. Unfortunately, loading a browser would take a lot of resources because it has to load a lot of other things like the toolbar and buttons. These UI elements are not needed when everything is being controlled with code. Fortunately, there are better solutions – headless browsers.

What is a headless browser?

A headless browser is simply an actual browser but without a graphical user interface. Think of it as a hidden browser. Headless browsers have complete functionality offered by a browser while being faster and taking up a lot less memory because there is no user interface. Everything is controlled programmatically.

The most commonly used browsers, Chrome and Firefox, support headless mode. There are few more browsers with headless mode supported, for example, Splash, Chromium, etc. Splash is aimed at Python programmers. In this Puppeteer JS tutorial, we will be focusing on Chromium.

Chromium is an open-source web browser made by Google. Note that Chromium and Chrome are two different browsers. Chromium is an open-source project. Chrome and is built over Chromium by adding many features. In addition to Chrome, many other browsers are based on Chromium, for example, Microsoft Edge, Opera, Brave, etc.

Now that we know what a headless browser is, it’s time to understand the available options to control the browsers programmatically.

Controlling the browsers programmatically

There are several solutions to control headless browsers. Perhaps the most widely known solution is Selenium. We have covered what it is in our blog post, but to quickly answer is Puppeteer better than selenium – if you need a lightweight and fast headless browser for web scraping, Google Puppeteer would be the better choice.

This Puppeteer tutorial will cover web scraping with Puppeteer in much detail. Puppeteer, however, is a Node.js package, making it exclusive for JavaScript developers. Python programmers, therefore, have a similar option – Pyppeteer.

Pyppeteer

Pyppeteer is an unofficial port of Puppeteer for Python. This also bundles Chromium and works smoothly with it. Pyppeteer can work with Chrome as well, similar to Puppeteer.

The syntax is very similar as it uses the asyncio library for Python, except the syntactical differences between Python and JavaScript. Here are two scripts in JavaScript and Python that load a page and then take a screenshot of it.

Puppeteer script example:

Copy

const puppeteer = require('puppeteer');
async function main() {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('https://oxylabs.io/');
  await page.screenshot({'path': 'oxylabs_js.png'});
  await browser.close();
}
main();

Pyppeteer Example:

Copy

import asyncio
import pyppeteer
async def main():
    browser = await pyppeteer.launch()
    page = await browser.newPage()
    await page.goto('https://oxylabs.io/')
    await page.screenshot({'path': 'oxylabs_python.png'})
    await browser.close()
asyncio.run(main())

The code is very similar. For web scraping dynamic websites, Pyppeteer can be an excellent alternative to Selenium for Python developers. But for the sake of making a Puppeteer tutorial, the following sections, we will cover Puppeteer, starting with the installation.

Installation

Before moving on with this Puppeteer scraping tutorial, let’s get the basic tools installed.

Prerequisite

There are only two pieces of software that will be needed:

Node.js (which is bundled with npm—the package manager for Node.js)
Any code editor

The only thing that you need to know about Node.js is that it is a runtime framework. This means that JavaScript code, which typically runs in a browser, can run without a browser.

Node.js is available for Windows, Mac OS, and Linux. It can be downloaded at their official download page.

Create node.js project

Before writing any code to web scrape using node js, create a folder where JavaScript files will be stored. All the code for Puppeteer is written in .js files and is run by Node.

Once the folder is created, navigate to this folder and run the initialization command:

Copy

npm init -y

This will create a package.json file in the directory. This file will contain information about the packages that are installed in this folder. The next step is to install the Node.js Packages in this folder.

How do you run Puppeteer

Installing Puppeteer is very easy. Just run the npm install command from the terminal. Note that the working directory should be the one which contains package.json:

Copy

npm install puppeteer

Note that Puppeteer is bundled with a full instance of Chromium. When it is installed, it downloads a recent version of Chromium that is guaranteed to work with the version of Puppeteer being installed.

Getting started with Puppeteer

Puppeteer is a promise-based library, which means it performs asynchronous calls. This Puppeteer tutorial for beginners will have all of the examples in async-await syntax. Additionally, if you want integrate proxies with Puppeteer, check out our Puppeteer proxy integration guide.

Simple example of using Puppeteer

Create a new file in your node project directory (the directory that contains package.json and node_modules). Save this file as example1.js and add this code:

Copy

const puppeteer = require('puppeteer');

async function main() {
    // Add code here
}
main();

The code above can be simplified by making the function anonymous and calling it on the same line:

Copy

const puppeteer = require('puppeteer');

(async () => {
    // Add code here
})();

The required keyword will ensure that the Puppeteer library is available in the file. The rest of the lines are the placeholder where an anonymous, asynchronous function is being created and executed. For the next step, launch the browser.

Copy

const browser = await puppeteer.launch();

Note that by default, the browser is launched in the headless mode. If there is an explicit need for a user interface, the above line can be modified to include an object as a parameter.

Copy

const browser = await puppeteer.launch({headless:false}); // default is true

The next step would be to open a page:

Copy

const page = await browser.newPage();

Now that a page or in other words, a tab, is available, any website can be loaded by simply calling the goto() function:

Copy

await page.goto('https://oxylabs.io/');

Once the page is loaded, the DOM elements, as well the rendered page is available. This can be verified by taking a quick screenshot using Puppeteer:

Copy

await page.screenshot({path: 'oxylabs_1080.png'});

This, however, will create only an 800×600 pixel image. The reason is that Puppeteer sets an initial page size to 800×600px. This can be changed by setting the viewport, before taking the screenshot.

Copy

  await page.setViewport({
    width: 1920,
    height: 1080,
  });

Finally, remember to close the browser:

Copy

await browser.close();

Putting it altogether, here is the complete script.

Copy

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.setViewport({width: 1920, height: 1080});
  await page.goto('https://oxylabs.io/');
  await page.screenshot({path: 'oxylabs_1080.png'})
  await browser.close();
})();

Run this file from the terminal using this command:

Copy

node example1.js

This should create a new file oxylabs_1080.png in the same directory.

Bonus tip: If you need a PDF, you can use the pdf() function:

Copy

await page.pdf({path: 'oxylabs.pdf', format: 'A4'});

Scraping an element from a page

Puppeteer loads the complete page in DOM. This means that we can extract any data from the page. The easiest way to do this is to use the function evaluate(). This allows to execute JavaScript functions like document.querySelector(). Consequently, it lets us extract any Element from the DOM.

To understand this, open this link in your preferred browser: https://en.wikipedia.org/wiki/Web_scraping

Once the page is loaded, right-click the heading of the page, and select Inspect. This should open developer tools with the Elements tab activated. Here it is visible that the page’s heading is in h1 element, with id and class both set to firstHeading.

Now, go to the Console tab in the developer toolbox and write in this line:

Copy

document.querySelector('#firstHeading')

You will immediately see that our desired tag is extracted.

This returns one element from the page. For this particular element, all we need is text. Using Puppeteer text can be easily extracted with this line of code:

Copy

document.querySelector('#firstHeading').textContent

The text can now be returned using the return keyword. The next step is to surround this in the evaluate method. This will ensure that this querySelector can be run.

Copy

await page.evaluate(() => {
    return document.querySelector("#firstHeading").textContent;
});

The result of the evaluate() function can be stored in a variable to complete the functionality. Finally, do not forget to close the browser. Here is the complete script:

Copy

const puppeteer = require("puppeteer");
 
(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto("https://en.wikipedia.org/wiki/Web_scraping");
 
  title = await page.evaluate(() => {
    return document.querySelector("#firstHeading").textContent.trim();
  });
  console.log(title);
  await browser.close();
})();

Scraping multiple elements

Extracting multiple elements would involve three steps:

1. Use of querySelectorAll to get all elements matching the selector:

Copy

headings_elements = document.querySelectorAll("h2 .mw-headline");

2. create an array, as heading_elements is of type NodeList.

Copy

headings_array = Array.from(headings_elements);

3. Call the map() function can be called to process each element in the array and return it.

Copy

return headings_array.map(heading => heading.textContent);

This of course needs to be surrounded by page.evaluate() function. Putting everything together, this is the complete script. You can save this as wiki_toc.js:

Copy

const puppeteer = require("puppeteer");
 
(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto("https://en.wikipedia.org/wiki/Web_scraping");
 
  headings = await page.evaluate(() => {
    headings_elements = document.querySelectorAll("h2 .mw-headline");
    headings_array = Array.from(headings_elements);
    return headings_array.map(heading => heading.textContent);
  });
  console.log(headings);
  await browser.close();
})();

This file can now be run from your terminal:

Copy

node wiki_toc.js

Bonus tip: Array.from() function can be supplied with a map function directly, without a separate call to map. Depending on the comfort level, the same code can thus be written as:

Copy

 headings = await page.evaluate(() => {
    return Array.from(document.querySelectorAll("h2 .mw-headline"),
    heading => heading.innerText.trim());
  });

Scraping a hotel listing page

This section will explain how a typical listing page can be scraped to get a JSON object with all the required information. The concepts presented in this section will be applicable for any listing, whether it is an online store, a directory, or a hotel listing.

The example that we will take is an Airbnb search page. Apply some filters so that you reach a page similar to the one in the screenprint. In this particular example, we will be scraping this Airbnb page that lists 20 hotels. If you're interested in scraping Airbnb, check out this post on how to scrape Airbnb listing data on a larger scale. Now, to scrape all 20 hotels, the first step is to identify the selector for each hotel section.

NOTE: Airbnb's page structure changes constantly. Make sure to find appropriate selectors each time.

Copy

root = Array.from(document.querySelectorAll('div[data-testid="card-container"]'));

This returns a NodeList of length 20 and stores in the variable root. Note that so far, text or any attribute has not been extracted. All we have is an array of elements. This will be done in the map() function.

Copy

hotels = root.map(hotel => ({ 
// code here
}));

The hotel name can be extracted with this line:

Copy

hotel.querySelector('div[data-testid="listing-card-title"]').textContent

The most important concept to understand here is that we are concatenating querySelectors. Effectively, the first hotel name is being extracted with this line of code:

Copy

document.querySelectorAll('div[data-testid="card-container"]')[0].querySelector('div[data-testid="listing-card-title"]').textContent

The URL of the photo of the hotel can be extracted with a code like this:

Copy

hotel.querySelector("img").getAttribute("src")

Finally, we can create an object containing both of these values. The syntax to create an object is like this:

Copy

Hotel = {
        Name: 'x',
        Photo: 'y'
      }

Putting everything together, here is the final script. Save it as bnb.js.

Copy

const puppeteer = require("puppeteer");

(async () => {
  let url = "https://www.airbnb.com/s/homes?refinement_paths%5B%5D=%2Fhomes&search_type=section_navigation&property_type_id%5B%5D=8";
  const browser = await puppeteer.launch(url);
  const page = await browser.newPage();
  await page.goto(url);

  data = await page.evaluate(() => {
    root = Array.from(document.querySelectorAll('div[data-testid="card-container"]'));
    hotels = root.map(hotel => ({
      Name: hotel.querySelector('div[data-testid="listing-card-title"]').textContent,
      Photo: hotel.querySelector("img").getAttribute("src")
    }));
    return hotels;
  });
  console.log(data);
  await browser.close();
})();

Run this file from the terminal using:

Copy

node bnb.js

You should see a JSON object printed on the console.

How proxies can benefit Puppeteer use cases

When performing Puppeteer web scraping or automation, especially at scale, it’s important to address one common challenge – getting blocked by websites. Modern targets employ anti-bot measures that detect and restrict traffic coming from a single IP address or known data center ranges. That’s where proxies come in.

They act as intermediaries between your Puppeteer browser and the target website, allowing you to rotate IP addresses and mask your original one. This helps you avoid rate limits, CAPTCHAs, and geo-restrictions while scraping.

Residential proxies

Residential proxies use real devices with IP addresses assigned by internet service providers (ISPs). These premium paid proxy servers appear as original users to websites, making them highly effective at avoiding detection and bypassing sophisticated anti-bot systems. They are particularly useful when scraping websites that are aggressive in blocking IP addresses or when you need to simulate real user behavior across various geographic regions.

Use cases in Puppeteer:

Scraping e-commerce or travel sites with advanced bot protection.
Accessing region-specific content.
Emulating organic browsing behavior for testing or monitoring.

Datacenter proxies

Datacenter proxies are hosted on servers in data centers and offer high speed and availability. They are more cost-effective than residential proxies and work well for less protected websites or when volume and speed matter more than stealth.

Use cases in Puppeteer:

Scraping high-volume public data from less restrictive websites.
Testing applications or websites from different IPs quickly.
Performing SEO audits or price comparisons.

Whether you choose residential or datacenter proxies depends on your target site and scraping goals. Users can always first try proxies before committing to a full subscription – there are free proxies available that perfectly work for testing purposes or small-scale projects.

Summary

In this Puppeteer web scraping tutorial, various Puppeteer examples of web scraping have been explained. The examples of Puppeteer sample code ranged from extracting one element from a website and moving on to extracting hotel listings from a popular website. We recommend that you look at the official Puppeteer documentation for more detailed information.

If you want to save time and effort when web scraping, take advantage of our SERP API (now part of a Web Scraper API solution), which is dedicated to extracting the data you need hassle-free. You can test our Web Scraper API for free, including a built-in feature – Headless Browser.

We would also recommend checking out our blog posts on asynchronous scraping to learn more bout other Python libraries, Playwright vs. Puppeteer, as well as reading JavaScript web scraping tutorial to learn web scraping using Axios and Cheerio, which could be more suitable in other scenarios.

About the author

Gabija Fatenaite

Former Director of Product & Event Marketing

Gabija Fatenaite was a Director of Product & Event Marketing at Oxylabs. Having grown up on video games and the internet, she grew to find the tech side of things more and more interesting over the years. So if you ever find yourself wanting to learn more about proxies (or video games), feel free to contact her - she’ll be more than happy to answer you.

Learn more about Gabija Fatenaite Learn more about Gabija Fatenaite

All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.

Effortless data gathering

Extract data even from the most complex websites without hassle by using Web Scraper API.

Web Scraper API

Human-like scraping without IP blocking

Forget about IP blocks and CAPTCHAs with premium proxies located in 195 countries.

Residential Proxies