Back to blog

Introducing Oxy Parser, an Open-Source Data Parsing Tool

Introducing Oxy Parser, an Open-Source Data Parsing Tool

Augustas Pelakauskas

2024-10-152 min read
Share

We’ve just launched a new open-source initiative – Oxy Parser. Developed by our Senior Python Developer, Tadas Gedgaudas, the tool attempts to bring together like-minded developers for a community-driven project.

Oxy Parser is an ongoing effort that seeks ideas for further development. If you have experience with web scraping and parsing and would like to contribute, join the Oxylabs Discord community and check our GitHub.

What is Oxy Parser?

Oxy Parser is an open-source tool for automating HTML parsing. It uses the Pydantic models for data validation and automated XPaths writing to simplify data structurization, significantly reducing parser writing time.

Oxy Parser supports most LLMs and caching backends.

What is Pydantic?

Pydantic is a data validation and settings management library for Python. It uses Python type annotations to validate and parse (check and fix) data based on set rules, ensuring the data fits the model's structure.

A Pydantic model is a Python class that inherits from pydantic.BaseModel. It’s used to define the structure, constraints, and validation rules for data. Pydantic models leverage Python's type annotations to automatically validate and parse data according to the specified types.

A word from Tadas

Tadas Gedgaudas specializes in Typescript, Python, Golang, and Kubernetes. With his focus on web data acquisition, the Python-based Oxy Parser initiative was a no-brainer.

"Writing parsers manually can take a long time depending on the count of data points. With Oxy Parser, it takes 10 minutes, and you already have a big chunk of your parser done".

Tadas Gedgaudas, Senior Python Developer @ Oxylabs

The main features of Oxy Parser

Open source Encouragement for devs to contribute, aiming for continuous improvement and collaboration.
XPath writing automation Automated XPath creation drastically reduces parser writing time.
Pydantic models for parsed data validation Pydantic models are used to validate parsed HTML, ensuring data is accurate and structured.
Memory, file, and Redis caching support Allows to reuse previously generated XPaths, improving efficiency for repeated tasks.
Support for most LLMs Uses the LiteLLM package in the background, which supports most LLMs (including the newest Llama).
Simple integration and setup A simple installation process with documentation and examples.
Cost-effectiveness The only expense is LLMs.

How does Oxy Parser work?

The general process is as follows:

  1. Provide your LLM API key, LLM information, and Oxylabs Web Scraper API credentials (optional). If you’re not using Oxylabs, you have to scrape a target page yourself and provide the HTML file to Oxy Parser.

  2. Define the data points you want to extract using a Pydantic structure.

  3. Oxy Parser examines the scraped HTML page, finds the best selector expressions, and returns the generated XPath selectors.

  4. Oxy Parser saves a cache file, which you can reuse for scraping projects.

Process demo

1. Install Oxy Parser

Install Oxy Parser in your project’s terminal using pip:

pip install oxyparser

2. Store API credentials in the env variables

For more details about the supported LLMs and other information, please refer to Oxy Parser’s GitHub repository.

Windows systems

Run the following commands in your Windows Command Prompt to save the required environment variables. Remember to replace the values with your LLM API key and Oxylabs Web Scraper API user credentials (optional):

setx LLM_API_KEY "your_llm_api_key"
setx LLM_MODEL "gpt-4o-mini"
setx OXYLABS_SCRAPER_USER "your_oxylabs_scraper_user"
setx OXYLABS_SCRAPER_PASSWORD "your_oxylabs_scraper_password"

Print the variables to stdout:

echo %LLM_API_KEY% && echo %LLM_MODEL% && echo %OXYLABS_SCRAPER_USER% && echo %OXYLABS_SCRAPER_PASSWORD%

Unix systems

If you’re working with the Z shell (zsh) on a Unix system (macOS, Linux), you can store the required environment variables:

echo "
export LLM_API_KEY=your_llm_api_key
export LLM_MODEL=gpt-4o-mini
export OXYLABS_SCRAPER_USER=your_oxylabs_scraper_user
export OXYLABS_SCRAPER_PASSWORD=your_oxylabs_scraper_password
" >> ~/.zshrc

source ~/.zshrc

Check the saved variables using this line:

echo -e "$LLM_API_KEY\n$LLM_MODEL\n$OXYLABS_SCRAPER_USER\n$OXYLABS_SCRAPER_PASSWORD"

3. Run Oxy Parser

Let’s test out Oxy Parser by parsing this demo e-commerce product page:

from pydantic import BaseModel
from oxyparser.oxyparser import OxyParser


# Define the data points you want to extract.
class ProductItem(BaseModel):
    title: str
    price: str
    developer: str
    platform: str
    description: str

# Your target URL.
URL: str = "https://sandbox.oxylabs.io/products/1"

# Initiate OxyParser and parse the page.
async def main() -> None:
    parser = OxyParser()
    job_item = await parser.parse(URL, ProductItem)
    print(job_item)


if __name__ == "__main__":
    import asyncio
    asyncio.run(main())
Running Oxy Parser

Running Oxy Parser

After completion, Oxy Parser will save the generated XPath selectors to a cache file:

{"title": ["//h2[@class='title css-1k75zwy e1pl6npa11']//text()"], "price": ["//div[@class='price css-o7uf8d e1pl6npa6']//text()"], "developer": ["//span[@class='brand developer']//text()"], "platform": ["//span[@class='game-platform css-13htf5s e1pl6npa7']//text()"], "description": ["//p[@class='description css-mkw8pm e1pl6npa0']//text()"]}

4. Use the generated selectors

You can now use the selectors for your scraping projects with any parsing tool that supports XPath. For example, here’s how you can use Oxylabs’ Web Scraper API and its Custom Parser feature to parse the second product page:

import os, json, requests


with open("_oxyparser_cache_oxylabs", "r") as f:
    # Use the `json` module to load the selectors file.
    selectors = json.load(f)

payload = {
    "source": "universal",
    "url": "https://sandbox.oxylabs.io/products/2",
    "render": "html",
    "parse": True,
    # Generate parsing instructions for each selector in the file.
    "parsing_instructions": {
        key: {
            "_fns": [{"_fn": "xpath", "_args": selectors[key]}]
        } for key in selectors.keys()
    }
}

response = requests.post(
    "https://realtime.oxylabs.io/v1/queries",
    auth=(
        # Get your Oxylabs Web Scraper API `username` and `password` from the env variables.
        os.environ.get("OXYLABS_SCRAPER_USER"),
        os.environ.get("OXYLABS_SCRAPER_PASSWORD")
    ),
    json=payload
)

print(response.json())
with open("parsed_data.json", "w") as f:
    json.dump(response.json()["results"], f, indent=4)

The code will save a parse_data.json file in your working directory containing the following parsed product page information:

[
    {
        "content": {
            "price": [
                "91,99 \u20ac"
            ],
            "title": [
                "Super Mario Galaxy"
            ],
            "platform": [
                "wii"
            ],
            "developer": [
                "Developer:",
                " Nintendo"
            ],
            "description": [
                "[Metacritic's 2007 Wii Game of the Year] The ultimate Nintendo hero is taking the ultimate step..."
            ],
            "parse_status_code": 12000
        }
    }
]

Wrap up

The simplification and automation of parsing are universal and ever-present headaches for the whole web intelligence industry. As the groundwork for the core functionality is set, Tadas continues his work on Oxy Parser. 

Contributions are welcome – fork, tweak, or star us on GitHub and join our Discord community.

About the author

Augustas Pelakauskas

Senior Copywriter

Augustas Pelakauskas is a Senior Copywriter at Oxylabs. Coming from an artistic background, he is deeply invested in various creative ventures - the most recent one being writing. After testing his abilities in the field of freelance journalism, he transitioned to tech content creation. When at ease, he enjoys sunny outdoors and active recreation. As it turns out, his bicycle is his fourth best friend.

All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.

Related articles

Get the latest news from data gathering world

I'm interested