How to Use XPath in Python?

Best practices

Always use the text() function in XPath to extract the text content of an element, ensuring you retrieve only the human-readable part.
Utilize predicates in XPath expressions to filter and refine selections, enhancing the specificity and accuracy of your data extraction.
Use the contains() function to match elements based on partial attribute values or text, making your XPath queries more flexible and robust.
Regularly update and test your XPath queries to adapt to changes in the webpage structure, ensuring your code remains functional over time.

# pip install requests lxml
from lxml import html
import requests


# Fetching the webpage
response = requests.get('https://sandbox.oxylabs.io/products')

# Parsing the content
tree = html.fromstring(response.text)


# Example 1: Extracting all product names using XPath
product_names = tree.xpath('//h4/text()')
print(product_names)
print()


# Example 2: Extracting first product's price
first_product_price = tree.xpath('(//div[contains(@class, "price-wrapper")])[1]/text()')
print(first_product_price)
print()


# Example 3: Extracting product URLs from links
product_links = tree.xpath('//div[contains(@class, "product-card")]/a[contains(@class, "card-header")]/@href')
print(product_links)
print()


# Example 4: Using predicates to filter data
specific_product = tree.xpath('//h4[contains(text(), "Mario")]/text()')
print(specific_product)
print()

Common issues

Ensure that your XPath expressions are correctly formed to avoid syntax errors, which are common when navigating complex HTML structures.
When using XPath with namespaces in XML documents, remember to register and use the namespace prefixes properly to avoid selection issues.
Avoid absolute XPath paths in your scripts; instead, use relative paths to make your code more resilient to changes in the webpage layout.
Handle exceptions when web pages fail to load or return unexpected content to maintain the robustness of your web scraping scripts.

# pip install requests lxml
from lxml import html
import requests


response = requests.get('https://sandbox.oxylabs.io/products')
tree = html.fromstring(response.text)


# Incorrectly formed XPath, missing closing bracket
try:
    prices = tree.xpath('//div[contains(@class, "price-wrapper")/text()')
    print(prices)
except Exception as e:
    print(e)

# Correctly formed XPath
prices = tree.xpath('//div[contains(@class, "price-wrapper")]/text()')
print(prices)


# Incorrect namespace handling, missing namespace registration
products = tree.xpath('//ns:product')

# Correct namespace handling
tree.xpath('//ns:product', namespaces={'ns': 'http://example.com/ns'})


# Incorrect: Using absolute XPath, brittle if HTML changes
first_product_title = tree.xpath('//*[@id="__next"]/main/div/div/div/div[2]/div/div[1]/a[1]/h4/text()')
print(first_product_title)

# Correct: Using relative XPath, more flexible
first_product_title = tree.xpath('(//h4)[1]/text()')
print(first_product_title)


# Incorrect: No exception handling, may crash if page fails to load
response = requests.get('https://sandbox.oxylabs.io/products')
tree = html.fromstring(response.text)

# Correct: With exception handling
try:
    response = requests.get('https://sandbox.oxylabs.io/products')
    tree = html.fromstring(response.text)
except Exception as e:
    print(f"Failed to load page or parse content: {e}")