Back to blog
Introducing Oxy Parser, an Open-Source Data Parsing Tool
Augustas Pelakauskas
Back to blog
Augustas Pelakauskas
We’ve just launched a new open-source initiative – Oxy Parser. Developed by our Senior Python Developer, Tadas Gedgaudas, the tool attempts to bring together like-minded developers for a community-driven project.
Oxy Parser is an ongoing effort that seeks ideas for further development. If you have experience with web scraping and parsing and would like to contribute, join the Oxylabs Discord community and check our GitHub.
Oxy Parser is an open-source tool for automating HTML parsing. It uses the Pydantic models for data validation and automated XPaths writing to simplify data structurization, significantly reducing parser writing time.
Oxy Parser supports most LLMs and caching backends.
Pydantic is a data validation and settings management library for Python. It uses Python type annotations to validate and parse (check and fix) data based on set rules, ensuring the data fits the model's structure.
A Pydantic model is a Python class that inherits from pydantic.BaseModel. It’s used to define the structure, constraints, and validation rules for data. Pydantic models leverage Python's type annotations to automatically validate and parse data according to the specified types.
Tadas Gedgaudas specializes in Typescript, Python, Golang, and Kubernetes. With his focus on web data acquisition, the Python-based Oxy Parser initiative was a no-brainer.
"Writing parsers manually can take a long time depending on the count of data points. With Oxy Parser, it takes 10 minutes, and you already have a big chunk of your parser done".
Tadas Gedgaudas, Senior Python Developer @ Oxylabs
Open source | Encouragement for devs to contribute, aiming for continuous improvement and collaboration. |
XPath writing automation | Automated XPath creation drastically reduces parser writing time. |
Pydantic models for parsed data validation | Pydantic models are used to validate parsed HTML, ensuring data is accurate and structured. |
Memory, file, and Redis caching support | Allows to reuse previously generated XPaths, improving efficiency for repeated tasks. |
Support for most LLMs | Uses the LiteLLM package in the background, which supports most LLMs (including the newest Llama). |
Simple integration and setup | A simple installation process with documentation and examples. |
Cost-effectiveness | The only expense is LLMs. |
The general process is as follows:
Provide your LLM API key, LLM information, and Oxylabs Web Scraper API credentials (optional). If you’re not using Oxylabs, you have to scrape a target page yourself and provide the HTML file to Oxy Parser.
Define the data points you want to extract using a Pydantic structure.
Oxy Parser examines the scraped HTML page, finds the best selector expressions, and returns the generated XPath selectors.
Oxy Parser saves a cache file, which you can reuse for scraping projects.
Install Oxy Parser in your project’s terminal using pip:
pip install oxyparser
For more details about the supported LLMs and other information, please refer to Oxy Parser’s GitHub repository.
Run the following commands in your Windows Command Prompt to save the required environment variables. Remember to replace the values with your LLM API key and Oxylabs Web Scraper API user credentials (optional):
setx LLM_API_KEY "your_llm_api_key"
setx LLM_MODEL "gpt-4o-mini"
setx OXYLABS_SCRAPER_USER "your_oxylabs_scraper_user"
setx OXYLABS_SCRAPER_PASSWORD "your_oxylabs_scraper_password"
Print the variables to stdout:
echo %LLM_API_KEY% && echo %LLM_MODEL% && echo %OXYLABS_SCRAPER_USER% && echo %OXYLABS_SCRAPER_PASSWORD%
If you’re working with the Z shell (zsh) on a Unix system (macOS, Linux), you can store the required environment variables:
echo "
export LLM_API_KEY=your_llm_api_key
export LLM_MODEL=gpt-4o-mini
export OXYLABS_SCRAPER_USER=your_oxylabs_scraper_user
export OXYLABS_SCRAPER_PASSWORD=your_oxylabs_scraper_password
" >> ~/.zshrc
source ~/.zshrc
Check the saved variables using this line:
echo -e "$LLM_API_KEY\n$LLM_MODEL\n$OXYLABS_SCRAPER_USER\n$OXYLABS_SCRAPER_PASSWORD"
Let’s test out Oxy Parser by parsing this demo e-commerce product page:
from pydantic import BaseModel
from oxyparser.oxyparser import OxyParser
# Define the data points you want to extract.
class ProductItem(BaseModel):
title: str
price: str
developer: str
platform: str
description: str
# Your target URL.
URL: str = "https://sandbox.oxylabs.io/products/1"
# Initiate OxyParser and parse the page.
async def main() -> None:
parser = OxyParser()
job_item = await parser.parse(URL, ProductItem)
print(job_item)
if __name__ == "__main__":
import asyncio
asyncio.run(main())
Running Oxy Parser
After completion, Oxy Parser will save the generated XPath selectors to a cache file:
{"title": ["//h2[@class='title css-1k75zwy e1pl6npa11']//text()"], "price": ["//div[@class='price css-o7uf8d e1pl6npa6']//text()"], "developer": ["//span[@class='brand developer']//text()"], "platform": ["//span[@class='game-platform css-13htf5s e1pl6npa7']//text()"], "description": ["//p[@class='description css-mkw8pm e1pl6npa0']//text()"]}
You can now use the selectors for your scraping projects with any parsing tool that supports XPath. For example, here’s how you can use Oxylabs’ Web Scraper API and its Custom Parser feature to parse the second product page:
import os, json, requests
with open("_oxyparser_cache_oxylabs", "r") as f:
# Use the `json` module to load the selectors file.
selectors = json.load(f)
payload = {
"source": "universal",
"url": "https://sandbox.oxylabs.io/products/2",
"render": "html",
"parse": True,
# Generate parsing instructions for each selector in the file.
"parsing_instructions": {
key: {
"_fns": [{"_fn": "xpath", "_args": selectors[key]}]
} for key in selectors.keys()
}
}
response = requests.post(
"https://realtime.oxylabs.io/v1/queries",
auth=(
# Get your Oxylabs Web Scraper API `username` and `password` from the env variables.
os.environ.get("OXYLABS_SCRAPER_USER"),
os.environ.get("OXYLABS_SCRAPER_PASSWORD")
),
json=payload
)
print(response.json())
with open("parsed_data.json", "w") as f:
json.dump(response.json()["results"], f, indent=4)
The code will save a parse_data.json file in your working directory containing the following parsed product page information:
[
{
"content": {
"price": [
"91,99 \u20ac"
],
"title": [
"Super Mario Galaxy"
],
"platform": [
"wii"
],
"developer": [
"Developer:",
" Nintendo"
],
"description": [
"[Metacritic's 2007 Wii Game of the Year] The ultimate Nintendo hero is taking the ultimate step..."
],
"parse_status_code": 12000
}
}
]
The simplification and automation of parsing are universal and ever-present headaches for the whole web intelligence industry. As the groundwork for the core functionality is set, Tadas continues his work on Oxy Parser.
Contributions are welcome – fork, tweak, or star us on GitHub and join our Discord community.
About the author
Augustas Pelakauskas
Senior Copywriter
Augustas Pelakauskas is a Senior Copywriter at Oxylabs. Coming from an artistic background, he is deeply invested in various creative ventures - the most recent one being writing. After testing his abilities in the field of freelance journalism, he transitioned to tech content creation. When at ease, he enjoys sunny outdoors and active recreation. As it turns out, his bicycle is his fourth best friend.
All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.
Get the latest news from data gathering world
Scale up your business with Oxylabs®