How to get text from element in Puppeteer?

Learn how to extract text from elements using Puppeteer in this concise tutorial. Discover essential techniques and troubleshoot common issues to enhance your data extraction skills efficiently.

Best practices

  • Use specific and unique CSS selectors to accurately target the element from which you want to extract text.

  • When using `innerText`, remember it reflects the text as seen on the page, including handling of styles that affect visibility, which is useful for scraping rendered text.

  • For extracting raw text without any HTML tags, use `textContent` as it ignores styling and provides the content of the node and its descendants.

  • If you need to capture the HTML content and then remove tags, use `innerHTML` combined with a regular expression to strip out HTML tags, ensuring you get only the textual content.

// Import puppeteer
const puppeteer = require('puppeteer');

(async () => {
// Launch browser
const browser = await puppeteer.launch();
// Open new page
const page = await browser.newPage();
// Navigate to the page
await page.goto('https://sandbox.oxylabs.io/products');

// Get text using textContent
const text1 = await page.evaluate(() => document.querySelector('selector').textContent);
console.log(text1);

// Get text using innerText
const text2 = await page.evaluate(() => document.querySelector('selector').innerText);
console.log(text2);

// Get text using innerHTML and strip tags
const text3 = await page.evaluate(() => document.querySelector('selector').innerHTML.replace(/<[^>]*>?/gm, ''));
console.log(text3);

// Close browser
await browser.close();
})();

Common issues

  • Ensure that the page has fully loaded before attempting to extract text, as Puppeteer might try to access elements that aren't yet available on the DOM.

  • Handle potential null values returned from `querySelector` by checking if the element exists before attempting to access its properties to avoid runtime errors.

  • Consider using `page.waitForSelector` to ensure the element is present and avoid timing issues when trying to retrieve text.

  • Be aware of the differences between `textContent`, `innerText`, and `innerHTML` to choose the most appropriate method based on whether you need to consider the style or just extract raw data.

// Inorrect: Trying to get text without ensuring the page has fully loaded
const text = await page.evaluate(() => document.querySelector('.product-name').textContent);

// Correct: Ensure the page is fully loaded
await page.waitForSelector('.product-name');
const text = await page.evaluate(() => document.querySelector('.product-name').textContent);

// Inorrect: Not checking if element exists, can throw TypeError if null
const text = await page.evaluate(() => document.querySelector('.nonexistent').innerText);

// Correct: Check if element exists before accessing its properties
const element = await page.evaluate(() => document.querySelector('.nonexistent'));
const text = element ? element.innerText : 'Element not found';

// Inorrect: Using waitForSelector without specifying options, might not wait enough
await page.waitForSelector('.details');
const details = await page.evaluate(() => document.querySelector('.details').innerText);

// Correct: Use waitForSelector with options to handle timing properly
await page.waitForSelector('.details', { visible: true, timeout: 3000 });
const details = await page.evaluate(() => document.querySelector('.details').innerText);

// Inorrect: Using innerHTML when needing text, might include unwanted HTML tags
const content = await page.evaluate(() => document.querySelector('.content').innerHTML);

// Correct: Use textContent or innerText based on need (style consideration or not)
const content = await page.evaluate(() => document.querySelector('.content').innerText);

Try Oyxlabs' Proxies & Scraper API

Residential Proxies

Self-Service

Human-like scraping without IP blocking

From

8

Datacenter Proxies

Self-Service

Fast and reliable proxies for cost-efficient scraping

From

1.2

Web scraper API

Self-Service

Public data delivery from a majority of websites

From

49

Useful resources

Get the latest news from data gathering world

I'm interested