Web Scraping with Claude AI: Python Guide

Agnė Matusevičiūtė

Last updated on

2025-09-16

8 min read

Scraping data from the web can be a challenging task if done without any additional tools. Even though using proxies usually solves the problem of overcoming IP blocking, you still get raw HTML code that needs to be processed to get the data you need. However, with the power of LLMs like Anthropic’s Claude, it’s possible to use AI to make the web scraping process even simpler. In this tutorial, we’ll be looking at how to use Claude AI to scrape websites like Amazon with the help of Python.

Why use Claude for web scraping?

As mentioned before, parsing the received HTML of a webpage can be a challenging task by itself. It’s common to use HTML parsing tools like Python’s BeautifulSoup to help out with the parsing process. This works fine until you come across the challenge of adjusting your scrapers to changes in the webpage itself.

However, as in many other fields, LLMs like Claude are more than capable of analyzing text. By providing Claude with raw HTML, it can define and update CSS/XPath selectors for your scraping process with a single API call. This makes it possible to create a completely automatic scraping solution using just proxies and an LLM.

Step-by-step guide on how to scrape the web with Python and Claude API

Let’s take a look at how to write such a solution using Claude, Oxylabs' Residential Proxies, and Python.

1. Preparing the environment

First, download the latest version of Python from the official website.

Installing libraries

Once you’re set up, you’ll need to install a few tools we’ll use in this tutorial, such as:

anthropic, including the client for accessing Claude.
instructor, a wrapper for the Anthropic client, which allows us to define the output from the LLM strictly.
requests, for performing HTTP requests.
bs4, for parsing HTML.
pydantic, for defining models of expected LLM responses.

Just run these commands to create a virtual environment, activate it, and get everything installed:

python -m venv env
source env/bin/activate
pip install anthropic instructor requests bs4 pydantic

Set up API access for Oxylabs and Anthropic

Since our LLM model will be Claude, provided by Anthropic, you’ll need to get your Anthropic API key from the Anthropic console. You can find it in the Anthropic console.

We'll use Oxylabs' Residential Proxies, so you'll need your proxy credentials. You can find them in your Oxylabs dashboard under the Residential Proxies tab.

2. Scraping a basic website with Claude API

To begin, let’s explore how to scrape a small website using Claude AI and Residential Proxies. We’ll be scraping the products page from the Oxylabs scraping sandbox.

Create a Python file in a folder of your choice. Let’s call it main.py. Open it up and add these lines to import the previously installed libraries:

import requests
import anthropic
import instructor
from bs4 import BeautifulSoup
from pydantic import BaseModel

After importing the libraries, define our previously acquired credentials, together with a Claude model, as global variables. For this tutorial, we’ll be using the claude-3-5-haiku-20241022 model, as it’s cost-efficient and will suffice for our HTML parsing needs. The variables should look like this:

CLAUDE_API_KEY = "API_KEY"
CLAUDE_MODEL = "claude-3-5-haiku-20241022"

OXYLABS_USERNAME = "USERNAME"
OXYLABS_PASSWORD = "PASSWORD"

As mentioned before, we installed the instructor library, which makes it possible to use structured outputs, together with the Anthropic client. This allows us to always reliably get the same response format from the LLM by providing it with a schema of what it should look like.

For the schema itself, we’ll be using pydantic models. Here’s an example model we can use for retrieving info about products in an e-commerce website:

class Product(BaseModel):
    title: str
    rating: str
    price: str

class ProductInfo(BaseModel):
    products: list[Product]

We also defined an additional ProductInfo wrapper model that we can pass to the instructor client, so it’s clear that we want to retrieve a list of objects. Here’s what the script should look like so far:

import requests
import anthropic
import instructor
from bs4 import BeautifulSoup
from pydantic import BaseModel

CLAUDE_API_KEY = "API_KEY"
CLAUDE_MODEL = "claude-3-5-haiku-20241022"

OXYLABS_USERNAME = "USERNAME"
OXYLABS_PASSWORD = "PASSWORD"

class Product(BaseModel):
    title: str
    rating: str
    price: str

class ProductInfo(BaseModel):
    products: list[Product]

With the script foundation in place, let’s start scraping a website using Oxylabs' Residential Proxies.

Scraping with Residential Proxies

Let’s implement a function called get_html_content that accepts a URL as a parameter and returns the scraped HTML from a given website. This makes our code more structured and allows us to reuse the function later on. Here’s what it can look like:

def get_html_content(url: str) -> str:
    ...

We can start implementing the function by defining our proxies. For this example, we’ll be using a rotating session residential proxy with a random entry location. The proxy variables should look like this:

proxy_url = f"https://customer-{OXYLABS_USERNAME}:{OXYLABS_PASSWORD}@pr.oxylabs.io:7777"

proxies = {
    "http": proxy_url,
    "https": proxy_url
}

To make our request look more like it’s coming from a browser, we should include additional headers, like User-Agent and Accept-Language. Including these headers usually decreases your chances of the website blocking the request. The headers can look like this:

headers = {
        "User-Agent": "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/139.0.0.0 Mobile Safari/537.36",
        "Accept-Language": "en-GB,en-US;q=0.9,en;q=0.8",
    }

With that defined, we can build our HTTP request. A single GET request with requests.get should be enough to scrape data from the provided URL. We’ll also add a response.raise_for_status() line, which raises an exception if the request fails. This will stop the code from executing if the request is not successful, making it easier to see what went wrong immediately. Here’s what the function should look like:

def get_html_content(url: str) -> str:

    proxy_url =     f"https://customer-{OXYLABS_USERNAME}:{OXYLABS_PASSWORD}@pr.oxylabs.io:7777"

    proxies = {
        "http": proxy_url,
        "https": proxy_url,
    }
    headers = {
        "User-Agent": "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/139.0.0.0 Mobile Safari/537.36",
        "Accept-Language": "en-GB,en-US;q=0.9,en;q=0.8",
    }

    response = requests.get(url, headers=headers, proxies=proxies)
    response.raise_for_status()

    return response.text

Parsing HTML with Claude API

With the HTML scraping function ready, let’s move on to parsing it into readable data using Claude. When dealing with a smaller website, it’s usually enough to simply ask the LLM to parse out the required data. With the help of instructor, Claude will be aware of the format of data needed, so we can simply provide a brief prompt with general instructions.

Let’s start by defining another function, called parse_products_with_claude, which accepts an html string as an argument. It should return the previously defined ProductInfo object.

def parse_products_with_claude(html: str) -> ProductInfo:
    ...

Inside the function, we should define the instructor based Claude client and the prompt for the LLM. Make sure to use the previously defined Claude API key. It should look like this:

def parse_products_with_claude(html: str) -> ProductInfo:
    client = instructor.from_anthropic(anthropic.Anthropic(api_key=CLAUDE_API_KEY))

    prompt = f"Extract the product info from all products in the following HTML: {html}"

Once that’s set up, we can send an API request to Claude AI with a single method call. We can then use the previously defined ProductInfo model and pass it into the request as a response model. Here’s what the full function should look like:

def parse_products_with_claude(html: str) -> ProductInfo:
    client = instructor.from_anthropic(anthropic.Anthropic(api_key=CLAUDE_API_KEY))

    prompt = f"Extract the product info from all products in the following HTML: {html}"
    
    product_info = client.messages.create(
        model=CLAUDE_MODEL,
        max_tokens=4096,
        messages=[{"role": "user", "content": [{"type": "text", "text": prompt}]}],
        response_model=ProductInfo,
    )

    return product_info

Running the scraper

Once the functions are defined, we can try scraping the Oxylabs sandbox and see what the results look like. At the bottom of your file, add the following lines to invoke your defined functions. Let’s also print out the results that Claude returns:

products_url = "https://sandbox.oxylabs.io/products"
content = get_html_content(products_url)
parsed_products = parse_products_with_claude(content)
print(parsed_products)

If you run the code, you should see something like this in your terminal:

products=[Product(title='The Legend of Zelda: Ocarina of Time', rating='91,99 €', price='91,99 €'), Product(title='Super Mario Galaxy', rating='91,99 €', price='91,99 €'), Product(title='Super Mario Galaxy 2', rating='91,99 €', price='91,99 €'), ...]

We have Pydantic objects that are fully parsed and ready to use inside the code. This demonstrates how Claude AI can serve as a powerful tool for parsing scraped data from the web. It was enough for us to provide a response format and a simple prompt to get structured, ready-to-use data. Here’s the complete code so far:

import requests
import anthropic
import instructor
from bs4 import BeautifulSoup
from pydantic import BaseModel

CLAUDE_API_KEY = "API_KEY"
CLAUDE_MODEL = "claude-3-5-haiku-20241022"

OXYLABS_USERNAME = "USERNAME"
OXYLABS_PASSWORD = "PASSWORD"

class Product(BaseModel):
    title: str
    rating: str
    price: str

class ProductInfo(BaseModel):
    products: list[Product]

def get_html_content(url: str) -> str:
    proxy_url =     f"https://customer-{OXYLABS_USERNAME}:{OXYLABS_PASSWORD}@pr.oxylabs.io:7777"

    proxies = {
        "http": proxy_url,
        "https": proxy_url,
    }
    headers = {
        "User-Agent": "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/139.0.0.0 Mobile Safari/537.36",
        "Accept-Language": "en-GB,en-US;q=0.9,en;q=0.8",
    }

    response = requests.get(url, headers=headers, proxies=proxies)
    response.raise_for_status()

    return response.text

def parse_products_with_claude(html: str) -> ProductInfo:
    client = instructor.from_anthropic(anthropic.Anthropic(api_key=CLAUDE_API_KEY))

    prompt = f"Extract the product info from all products in the following HTML: {html}"
    
    product_info = client.messages.create(
        model=CLAUDE_MODEL,
        max_tokens=4096,
        messages=[{"role": "user", "content": [{"type": "text", "text": prompt}]}],
        response_model=ProductInfo,
    )

    return product_info

products_url = "https://sandbox.oxylabs.io/products"
content = get_html_content(products_url)
parsed_products = parse_products_with_claude(content)

print(parsed_products)

3. Scraping Amazon efficiently with Claude API

We've seen how easy it is to scrape a site using Residential Proxies and Claude. Now, let's try scraping product data from a larger site like Amazon, which may pose extra challenges due to complex HTML, ads, images, and JavaScript content.

The main challenge is managing the LLM’s context window. The Claude model we're using, claude-3-5-haiku-20241022, supports up to 200k tokens – plenty for simple pages like the Oxylabs sandbox. However, Amazon’s HTML is likely to exceed this token limit. To handle this, we can use the following strategies to reduce token count in the HTML while still retaining the essential data:

Clean up as much unnecessary HTML as possible: There’s no value in style, script, or meta tags for retrieving product data.
Select a portion of the HTML manually and only pass that to Claude: For Amazon, we can select a single product card, instead of the whole page, by using a single CSS class.
Ask Claude to return CSS selectors and reuse them: By parsing out the selectors with Claude only once, we can save costs and increase the script's performance.

By applying these strategies, we should have an efficient and relatively low-maintenance tool for scraping products from Amazon. We can begin by extracting the CSS selectors with Claude.

Extracting Amazon product CSS selectors with Claude

Let's start by defining a new Pydantic model to tell Claude what response we expect. Since we'll be scraping CSS selectors, we'll name the model accordingly. For this example, we'll target the title, price, and rating of an Amazon item. Here’s what it can look like:

class ProductSelector(BaseModel):
    title_selector: str
    price_selector: str
    rating_selector: str

With the model defined, implement another function, called parse_amazon_selectors_with_claude. It should also accept an html string as an argument. The rest of the function can look almost the same as the previous one, except for the response model being ProductSelector, and the prompt being slightly adjusted:

def parse_amazon_selectors_with_claude(html: str) -> ProductSelector:
    client = instructor.from_anthropic(anthropic.Anthropic(api_key=CLAUDE_API_KEY))

    prompt = f"Extract CSS selectors for retrieving a product in the following Amazon HTML: {html}"

    product_info = client.messages.create(
        model=CLAUDE_MODEL,
        max_tokens=4096,
        messages=[{"role": "user", "content": [{"type": "text", "text": prompt}]}],
        response_model=ProductSelector,
    )

    return product_info

We can use this function later on, once we have the other parts of the script defined.

Cleaning up the HTML

Next, let’s make the HTML as concise as possible to avoid exceeding the LLM's context window limits. We can achieve this by removing unnecessary tags, comments, attributes, elements, ads, and promotional content. We should also use a generic selector to get a single product element from the HTML, as the CSS selectors would stay the same between multiple elements anyway. We can use the previously installed BeautifulSoup library for this. Here’s the complete function:

def clean_html_for_llm(html_content: str) -> str:
    soup = BeautifulSoup(html_content, "html.parser")
    results_list = soup.find_all("div", {"class": "s-card-container"})

    result: BeautifulSoup = results_list[0]

    for tag in result.find_all(
        ["script", "style", "link", "meta", "noscript", "iframe"]
    ):
        tag.decompose()

    # Remove comments
    for comment in result.find_all(string=lambda text: isinstance(text, Comment)):
        comment.extract()

    # Remove unnecessary attributes but keep important ones
    important_attrs = {"class", "href"}
    for tag in result.find_all():
        if tag.attrs:
            tag.attrs = {k: v for k, v in tag.attrs.items() if k in important_attrs}

    # Remove navigation and footer elements
    for selector in ["nav", "footer", "header", ".nav", ".footer", ".header"]:
        for tag in result.select(selector):
            tag.decompose()

    # Remove ads and promotional content
    for selector in [
        ".ad",
        ".advertisement",
        ".promo",
        ".sponsored",
        "[data-ad]",
        "[data-ads]",
    ]:
        for tag in result.select(selector):
            tag.decompose()

    return str(result)

Parsing the HTML with Claude’s provided selectors

Now that almost everything is in place, all that’s left is to use the CSS selectors parsed out by Claude to parse the necessary values from the HTML. It can be done with BeautifulSoup by simply passing the selectors into a soup object. Then, we can extract the values into our previously defined Product models.

Let’s write another function for this logic called parse_products_with_selectors. It should accept an html argument, together with a selector object we retrieved from Claude. Here’s how it should look:

def parse_products_with_selectors(
    html: str, selector: ProductSelector
) -> list[Product]:

    soup = BeautifulSoup(html, "html.parser")
    products = soup.find_all("div", {"class": "s-card-container"})
    parsed_products = []

    for product in products:
        title_element = product.select_one(selector.title_selector)
        title = title_element.text if title_element else ""

        price_element = product.select_one(selector.price_selector)
        price = price_element.text if price_element else ""

        rating_element = product.select_one(selector.rating_selector)
        rating = rating_element.text if rating_element else ""

        parsed_product = Product(title=title, price=price, rating=rating)
        parsed_products.append(parsed_product)

    return parsed_products

Combining everything into a working script

Since we have the full flow of functions defined, we can tie everything together into a working tool for scraping Amazon products. Let’s start by retrieving an Amazon URL we’d like to scrape. For this example, we’ll be using a search page for over-ear headphones, with this URL.

At the bottom of the script, we can paste the URL into a separate variable and pass it to the previously defined scraper function. It can look like this:

amazon_url = "https://www.amazon.com/b?node=12097479011"
amazon_content = get_html_content(amazon_url)

Once the HTML from the Amazon page is retrieved, we can clean it up and retrieve the selectors using Claude. Let’s add these two lines:

clean_html = clean_html_for_llm(amazon_content)
amazon_selectors = parse_amazon_products_with_claude(clean_html)

And finally, once the selectors are retrieved, we can use them to parse out the final list of results. Let’s also store them in a separate JSON file for later use. You should also import the json module at the top of the script:

import json

products = parse_products_with_selectors(amazon_content, amazon_selectors)
product_dicts = [product.model_dump() for product in products]

with open("products.json", "w") as f:
    json.dump(product_dicts, f, indent=2)

Running the script should create a products.json file in your current directory. If you open it, you should see something like this:

[
  {
    "title": "Soundcore Anker Life Q20 Hybrid Active Noise Cancelling Headphones, Wireless Over Ear Bluetooth Headphones, 60H Playtime, Hi-Res Audio, Deep Bass, Foam Ear Cups, Travel, Office, USB-C Charging",
    "rating": "4.5",
    "price": "29."
  },
  {
    "title": "JBL Tune 720BT - Wireless Over-Ear Headphones with JBL Pure Bass Sound, Bluetooth 5.3, Up to 76H Battery Life and Speed Charge, Lightweight, Comfortable and Foldable Design (Black)",
    "rating": "4.5",
    "price": "69."
  },
]

You can also save the parsed out selectors separately and reuse them to save additional API calls to Claude.

4. The complete code

Here’s the complete code for scraping Amazon with Oxylabs' Residential Proxies and Anthropic’s Claude AI:

import json
from pydantic import BaseModel
import requests
import anthropic
import instructor
from bs4 import BeautifulSoup, Comment

CLAUDE_API_KEY = "API_KEY"
CLAUDE_MODEL = "claude-3-5-haiku-20241022"

OXYLABS_USERNAME = "USERNAME"
OXYLABS_PASSWORD = "PASSWORD"


class Product(BaseModel):
    title: str
    rating: str
    price: str


class ProductInfo(BaseModel):
    products: list[Product]


class ProductSelector(BaseModel):
    title_selector: str
    price_selector: str
    rating_selector: str


def parse_products_with_claude(html: str) -> ProductInfo:

    client = instructor.from_anthropic(anthropic.Anthropic(api_key=CLAUDE_API_KEY))

    prompt = f"Extract the product info from all products in the following HTML: {html}"

    product_info = client.messages.create(
        model=CLAUDE_MODEL,
        max_tokens=4096,
        messages=[{"role": "user", "content": [{"type": "text", "text": prompt}]}],
        response_model=ProductInfo,
    )

    return product_info


def parse_amazon_products_with_claude(html: str) -> ProductSelector:
    client = instructor.from_anthropic(anthropic.Anthropic(api_key=CLAUDE_API_KEY))

    prompt = f"Extract CSS selectors for retrieving a product in the following Amazon HTML: {html}"

    product_info = client.messages.create(
        model=CLAUDE_MODEL,
        max_tokens=4096,
        messages=[{"role": "user", "content": [{"type": "text", "text": prompt}]}],
        response_model=ProductSelector,
    )

    return product_info


def get_html_content(url: str) -> str:

    proxy_url = (
        f"https://customer-{OXYLABS_USERNAME}:{OXYLABS_PASSWORD}@pr.oxylabs.io:7777"
    )

    proxies = {
        "http": proxy_url,
        "https": proxy_url,
    }
    headers = {
        "User-Agent": "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/139.0.0.0 Mobile Safari/537.36",
        "Accept-Language": "en-GB,en-US;q=0.9,en;q=0.8",
    }

    response = requests.get(url, headers=headers, proxies=proxies)
    response.raise_for_status()

    return response.text


def clean_html_for_llm(html_content: str) -> str:
    soup = BeautifulSoup(html_content, "html.parser")
    results_list = soup.find_all("div", {"class": "s-card-container"})

    result: BeautifulSoup = results_list[0]

    for tag in result.find_all(
        ["script", "style", "link", "meta", "noscript", "iframe"]
    ):
        tag.decompose()

    # Remove comments
    for comment in result.find_all(string=lambda text: isinstance(text, Comment)):
        comment.extract()

    # Remove unnecessary attributes but keep important ones
    important_attrs = {"class", "href"}
    for tag in result.find_all():
        if tag.attrs:
            tag.attrs = {k: v for k, v in tag.attrs.items() if k in important_attrs}

    # Remove navigation and footer elements
    for selector in ["nav", "footer", "header", ".nav", ".footer", ".header"]:
        for tag in result.select(selector):
            tag.decompose()

    # Remove ads and promotional content
    for selector in [
        ".ad",
        ".advertisement",
        ".promo",
        ".sponsored",
        "[data-ad]",
        "[data-ads]",
    ]:
        for tag in result.select(selector):
            tag.decompose()

    return str(result)


def parse_products_with_selectors(
    html: str, selector: ProductSelector
) -> list[Product]:

    soup = BeautifulSoup(html, "html.parser")
    products = soup.find_all("div", {"class": "s-card-container"})
    parsed_products = []

    for product in products:
        title_element = product.select_one(selector.title_selector)
        title = title_element.text if title_element else ""

        price_element = product.select_one(selector.price_selector)
        price = price_element.text if price_element else ""

        rating_element = product.select_one(selector.rating_selector)
        rating = rating_element.text if rating_element else ""

        parsed_product = Product(title=title, price=price, rating=rating)
        parsed_products.append(parsed_product)

    return parsed_products


amazon_url = "https://www.amazon.com/b?node=12097479011"

amazon_content = get_html_content(amazon_url)
clean_html = clean_html_for_llm(amazon_content)
amazon_selectors = parse_amazon_products_with_claude(clean_html)
products = parse_products_with_selectors(amazon_content, amazon_selectors)
product_dicts = [product.model_dump() for product in products]

with open("products.json", "w") as f:
    json.dump(product_dicts, f, indent=2)

Other use cases for scraping with Claude

Just like with Amazon, Claude can be used in the same way to scrape any other major website, with the help of Oxylabs proxies or scraping APIs. Below are some of the APIs that Oxylabs supports, which you can use with Claude as an alternative to the Residential Proxies used in this tutorial:

Conclusion

In this tutorial, we explored how to leverage the capabilities of an LLM like Claude to aid in web scraping and automate the process . We’ve used Oxylabs' Residential Proxies to get the HTML of a page, while Claude assisted with parsing out the data we need. We also looked over some cost optimization strategies for scraping larger websites, which make the process quicker and more cost-effective.

If you want to dive deeper into related topics, check out our guide on how to use ChatGPT for web scraping, or explore the fundamentals of LLMs (Large Language Models) and Retrieval-Augmented Generation (RAG). For more technical insights, take a look at the 8 main public data sources used in LLM training.

Web Scraper API

Forget about complex web scraping processes – try out Web Scraper API for free to collect web data effortlessly.

About the author

Agnė Matusevičiūtė

Technical Copywriter

With a background in philology and pedagogy, Agnė focuses on making complicated tech simple.

Learn more about Agnė Matusevičiūtė Learn more about Agnė Matusevičiūtė

All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.

Frequently asked questions

Can Claude do web scraping?

Claude itself cannot perform web scraping directly, as it lacks built-in browsing or scraping capabilities. However, it can be paired with external tools like APIs, proxies, or custom scripts to guide and assist in the scraping process – especially for structuring, analyzing, or interpreting the scraped data. This approach is commonly referred to as Claude web scraping, where the model supports the automation pipeline by handling parsing logic, content extraction patterns, or even writing Python scraping code on demand.

Is HTML scraping illegal?

HTML scraping is not inherently illegal. It depends on how it's done and what is being scraped. Scraping publicly available data is generally legal, but accessing private content, violating a website’s terms of service, or scraping in a way that disrupts service could lead to legal consequences. Always consult legal counsel and respect website policies when scraping.

Can Claude do web browsing?

No, Claude does not have native web browsing capabilities. It cannot actively visit websites or fetch real-time content on its own. However, it can process and analyze content that you provide, or guide you in building tools that interact with the web.

Can AI be used for web scraping?

Yes, AI can be effectively used in web scraping. AI models can help with tasks like identifying data patterns, parsing complex HTML structures, solving CAPTCHAs, or transforming unstructured data into clean datasets. When combined with scraping tools or APIs, AI significantly enhances automation and data extraction workflows.