Best practices

  • Use specific and unique CSS selectors to accurately target the element from which you want to extract text.

  • When using `innerText`, remember it reflects the text as seen on the page, including handling of styles that affect visibility, which is useful for scraping rendered text.

  • For extracting raw text without any HTML tags, use `textContent` as it ignores styling and provides the content of the node and its descendants.

  • If you need to capture the HTML content and then remove tags, use `innerHTML` combined with a regular expression to strip out HTML tags, ensuring you get only the textual content.

// Import puppeteer
const puppeteer = require('puppeteer');

(async () => {
// Launch browser
const browser = await puppeteer.launch();
// Open new page
const page = await browser.newPage();
// Navigate to the page
await page.goto('https://sandbox.oxylabs.io/products');

// Get text using textContent
const text1 = await page.evaluate(() => document.querySelector('selector').textContent);
console.log(text1);

// Get text using innerText
const text2 = await page.evaluate(() => document.querySelector('selector').innerText);
console.log(text2);

// Get text using innerHTML and strip tags
const text3 = await page.evaluate(() => document.querySelector('selector').innerHTML.replace(/<[^>]*>?/gm, ''));
console.log(text3);

// Close browser
await browser.close();
})();

Common issues

  • Ensure that the page has fully loaded before attempting to extract text, as Puppeteer might try to access elements that aren't yet available on the DOM.

  • Handle potential null values returned from `querySelector` by checking if the element exists before attempting to access its properties to avoid runtime errors.

  • Consider using `page.waitForSelector` to ensure the element is present and avoid timing issues when trying to retrieve text.

  • Be aware of the differences between `textContent`, `innerText`, and `innerHTML` to choose the most appropriate method based on whether you need to consider the style or just extract raw data.

// Inorrect: Trying to get text without ensuring the page has fully loaded
const text = await page.evaluate(() => document.querySelector('.product-name').textContent);

// Correct: Ensure the page is fully loaded
await page.waitForSelector('.product-name');
const text = await page.evaluate(() => document.querySelector('.product-name').textContent);

// Inorrect: Not checking if element exists, can throw TypeError if null
const text = await page.evaluate(() => document.querySelector('.nonexistent').innerText);

// Correct: Check if element exists before accessing its properties
const element = await page.evaluate(() => document.querySelector('.nonexistent'));
const text = element ? element.innerText : 'Element not found';

// Inorrect: Using waitForSelector without specifying options, might not wait enough
await page.waitForSelector('.details');
const details = await page.evaluate(() => document.querySelector('.details').innerText);

// Correct: Use waitForSelector with options to handle timing properly
await page.waitForSelector('.details', { visible: true, timeout: 3000 });
const details = await page.evaluate(() => document.querySelector('.details').innerText);

// Inorrect: Using innerHTML when needing text, might include unwanted HTML tags
const content = await page.evaluate(() => document.querySelector('.content').innerHTML);

// Correct: Use textContent or innerText based on need (style consideration or not)
const content = await page.evaluate(() => document.querySelector('.content').innerText);

Try Oyxlabs' Proxies & Scraper API

Residential Proxies

Self-Service

Human-like scraping without IP blocking

From

8

Datacenter Proxies

Self-Service

Fast and reliable proxies for cost-efficient scraping

From

1.2

Web scraper API

Self-Service

Public data delivery from a majority of websites

From

49

Useful resources

Get the latest news from data gathering world

I'm interested