Always use the `text()` function in XPath to extract the text content of an element, ensuring you retrieve only the human-readable part.
Utilize predicates in XPath expressions to filter and refine selections, enhancing the specificity and accuracy of your data extraction.
Use the `contains()` function to match elements based on partial attribute values or text, making your XPath queries more flexible and robust.
Regularly update and test your XPath queries to adapt to changes in the webpage structure, ensuring your code remains functional over time.
# Importing necessary libraries from lxml import html import requests # Fetching the webpage response = requests.get('https://sandbox.oxylabs.io/products') # Parsing the content tree = html.fromstring(response.content) # Example 1: Extracting all product names using XPath product_names = tree.xpath('//h3[@class="product-name"]/text()') print(product_names) # Example 2: Extracting first product's price first_product_price = tree.xpath('(//span[@class="price"])[1]/text()') print(first_product_price) # Example 3: Extracting URLs from links product_links = tree.xpath('//a[contains(@class, "product-link")]/@href') print(product_links) # Example 4: Using predicates to filter data specific_product = tree.xpath('//h3[@class="product-name"][contains(text(), "Premium")]/text()') print(specific_product)
Ensure that your XPath expressions are correctly formed to avoid syntax errors, which are common when navigating complex HTML structures.
When using XPath with namespaces in XML documents, remember to register and use the namespace prefixes properly to avoid selection issues.
Avoid absolute XPath paths in your scripts; instead, use relative paths to make your code more resilient to changes in the webpage layout.
Handle exceptions when web pages fail to load or return unexpected content to maintain the robustness of your web scraping scripts.
# Incorrectly formed XPath, missing closing bracket product_names = tree.xpath('//h3[@class="product-name"/text()') # Correctly formed XPath product_names = tree.xpath('//h3[@class="product-name"]/text()') # Incorrect namespace handling, missing namespace registration products = tree.xpath('//ns:product', namespaces={'ns': 'http://example.com/ns'}) # Correct namespace handling tree.xpath('//ns:product', namespaces={'ns': 'http://example.com/ns'}) # Using absolute XPath, brittle if HTML changes first_product_price = tree.xpath('/html/body/div[1]/div[2]/span[1]/text()') # Using relative XPath, more flexible first_product_price = tree.xpath('//span[@class="price"][1]/text()') # No exception handling, may crash if page fails to load response = requests.get('https://sandbox.oxylabs.io/products') tree = html.fromstring(response.content) # With exception handling try: response = requests.get('https://sandbox.oxylabs.io/products') tree = html.fromstring(response.content) except Exception as e: print(f"Failed to load page or parse content: {e}")
Web scraper API
Public data delivery from a majority of websites
From
49
Get the latest news from data gathering world
Scale up your business with Oxylabs®