How to use XPath in Python?

Discover the essentials of utilizing XPath in Python to navigate and extract data efficiently. This guide provides a concise overview of XPath syntax and methods, tailored for seamless integration into your data extraction workflows.

Best practices

  • Always use the `text()` function in XPath to extract the text content of an element, ensuring you retrieve only the human-readable part.

  • Utilize predicates in XPath expressions to filter and refine selections, enhancing the specificity and accuracy of your data extraction.

  • Use the `contains()` function to match elements based on partial attribute values or text, making your XPath queries more flexible and robust.

  • Regularly update and test your XPath queries to adapt to changes in the webpage structure, ensuring your code remains functional over time.

# Importing necessary libraries
from lxml import html
import requests

# Fetching the webpage
response = requests.get('https://sandbox.oxylabs.io/products')
# Parsing the content
tree = html.fromstring(response.content)

# Example 1: Extracting all product names using XPath
product_names = tree.xpath('//h3[@class="product-name"]/text()')
print(product_names)

# Example 2: Extracting first product's price
first_product_price = tree.xpath('(//span[@class="price"])[1]/text()')
print(first_product_price)

# Example 3: Extracting URLs from links
product_links = tree.xpath('//a[contains(@class, "product-link")]/@href')
print(product_links)

# Example 4: Using predicates to filter data
specific_product = tree.xpath('//h3[@class="product-name"][contains(text(), "Premium")]/text()')
print(specific_product)

Common issues

  • Ensure that your XPath expressions are correctly formed to avoid syntax errors, which are common when navigating complex HTML structures.

  • When using XPath with namespaces in XML documents, remember to register and use the namespace prefixes properly to avoid selection issues.

  • Avoid absolute XPath paths in your scripts; instead, use relative paths to make your code more resilient to changes in the webpage layout.

  • Handle exceptions when web pages fail to load or return unexpected content to maintain the robustness of your web scraping scripts.

# Incorrectly formed XPath, missing closing bracket
product_names = tree.xpath('//h3[@class="product-name"/text()')

# Correctly formed XPath
product_names = tree.xpath('//h3[@class="product-name"]/text()')

# Incorrect namespace handling, missing namespace registration
products = tree.xpath('//ns:product', namespaces={'ns': 'http://example.com/ns'})

# Correct namespace handling
tree.xpath('//ns:product', namespaces={'ns': 'http://example.com/ns'})

# Using absolute XPath, brittle if HTML changes
first_product_price = tree.xpath('/html/body/div[1]/div[2]/span[1]/text()')

# Using relative XPath, more flexible
first_product_price = tree.xpath('//span[@class="price"][1]/text()')

# No exception handling, may crash if page fails to load
response = requests.get('https://sandbox.oxylabs.io/products')
tree = html.fromstring(response.content)

# With exception handling
try:
response = requests.get('https://sandbox.oxylabs.io/products')
tree = html.fromstring(response.content)
except Exception as e:
print(f"Failed to load page or parse content: {e}")

Try Oyxlabs' Proxies & Scraper API

Residential Proxies

Self-Service

Human-like scraping without IP blocking

From

8

Datacenter Proxies

Self-Service

Fast and reliable proxies for cost-efficient scraping

From

1.2

Web scraper API

Self-Service

Public data delivery from a majority of websites

From

49

Useful resources

Get the latest news from data gathering world

I'm interested