Best practices

  • Ensure you use the correct parser with BeautifulSoup before converting to an lxml etree to maintain the integrity of the HTML structure.

  • When using XPath with BeautifulSoup, always convert the BeautifulSoup object to an lxml etree to use XPath expressions effectively.

  • Be precise with your XPath queries to avoid fetching unintended data, especially in complex HTML structures.

  • Regularly update and test your XPath selectors to adapt to changes in the webpage's HTML structure.

from bs4 import BeautifulSoup
import requests
from lxml import etree

# Fetch the webpage
url = ''
response = requests.get(url)

# Parse the content with BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')

# Convert BeautifulSoup object to an lxml etree
root = etree.fromstring(str(soup), etree.HTMLParser())

# Example 1: Using XPath to find the title
title = root.xpath('//title/text()')
print('Title:', title)

# Example 2: Find all 'a' tags using XPath
links = root.xpath('//a/@href')
print('Links:', links)

# Example 3: Find elements by class name
prices = root.xpath('//div[contains(@class, "price-wrapper")]//text()')
print('Prices:', prices)

# Example 4: Using XPath to get text within a specific product paragraph
description = root.xpath('//p[contains(@class, "description")][1]//text()')
print('Description:', description)

Common issues

  • Ensure that the HTML content is properly formatted and complete before parsing it with BeautifulSoup to avoid errors during the conversion to an lxml etree.

  • Verify that all necessary libraries, such as lxml and requests, are installed and up-to-date to prevent runtime errors.

  • Use explicit and correct XPath syntax to ensure accurate selection of elements, avoiding common mistakes like incorrect attribute names or missing brackets.

  • Handle exceptions and errors gracefully when performing XPath queries to manage issues such as missing elements or invalid XPath expressions effectively.

# Incorrect: Parsing incomplete HTML content
soup = BeautifulSoup('
', 'html.parser') root = etree.fromstring(str(soup), etree.HTMLParser()) # Correct: Ensure HTML is complete soup = BeautifulSoup('
', 'html.parser') root = etree.fromstring(str(soup), etree.HTMLParser()) # Ensure all libraries are installed and updated # pip install lxml requests beautifulsoup4 # Incorrect: Incorrect XPath syntax leading to syntax errors or wrong selections products = root.xpath('//div[class="product-card"/text()') # Correct: Correct XPath syntax products = root.xpath('//div[@class="product-card"]/text()') # Incorrect: Not handling exceptions for missing elements title = root.xpath('//title/text()')[0] # Correct: Graceful error handling try: title = root.xpath('//title/text()')[0] except IndexError: title = 'Title not found'

Try Oxylabs' Proxies & Scraper API

Residential Proxies


Human-like scraping without IP blocking



Datacenter Proxies


Fast and reliable proxies for cost-efficient scraping



Web scraper API


Public data delivery from a majority of websites



Useful resources

Get the latest news from data gathering world

I'm interested