How to Build an AI Scraper With Crawl4AI and DeepSeek

Yelyzaveta Hayrapetyan

Last updated on

2025-11-03

9 min read

Recent advancements in generative AI, especially continuous improvements in large language models (LLMs), have brought us closer to artificial general intelligence (AGI) than ever before. Instead of depending on rigid scripts (made of complex regular expressions), modern scrapers use LLMs to understand a webpage, reading it much like a human would.

In addition, AI can extract insights from raw HTML beyond what basic web scrapers can. For example, an AI-based scraper can generate summaries from raw text, adapt to dynamic web structures, and perform semantic analysis, such as sentiment analysis, classification, and more.

This guide explains how Crawl4AI and DeepSeek can work together. Crawl4AI handles browser rendering, while DeepSeek interprets the scraped content, converting raw text into structured data ready for analysis, thereby making web scraping more powerful and adaptable.

What is Crawl4AI?

Crawl4AI is an open-source, next-gen AI-ready web scraping framework designed to extract and prepare web data for LLMs. Unlike older libraries such as Scrapy, which deal with static HTML parsing, Crawl4AI uses a Playwright-powered headless browser to render pages as a real user would. This tremendously helps when handling JavaScript-heavy, dynamic, and interactive sites.

Its asynchronous engine supports high-throughput parallel crawling, enabling concurrent scraping of multiple pages. The framework automatically cleans and structures content into Markdown, making it directly compatible with LLMs such as DeepSeek and GPT-based extractors.

Compared to Selenium, Crawl4AI offers lower resource usage, faster execution, and AI-centric output formatting. Selenium was designed primarily for web automation and testing, not large-scale data extraction. Therefore, Selenium is less efficient for AI workflows that require structured, LLM-readable outputs. Crawl4AI, by contrast, is built specifically for the AI web scraping era, integrating seamlessly with proxy and web scraper APIs.

Why combine Crawl4AI with DeepSeek?

DeepSeek serves as the intelligence layer over top of traditional raw data scrapers, transforming scraped data into meaningful information. DeepSeek uses large language models (LLMs) to interpret text in a way similar to a human.

When Crawl4AI collects raw content, DeepSeek analyzes it, identifies patterns, and extracts structured information, such as product details, reviews, and prices, using a straightforward human-language prompt. It remains effective even if the website layout or structure changes.

Traditional scrapers require complex regular expressions designed to select relevant CSS selectors corresponding to desired data fields. However, despite the expert efforts, these expressions often break when websites update their structure.

AI-powered extraction, however, adapts to content changes. DeepSeek understands context from the given user prompt, recognizes data fields such as prices or ratings, even when they appear in different formats or CSS selectors.

Combining Crawl4AI and DeepSeek creates a complete pipeline: Crawl4AI collects clean data from modern websites, and DeepSeek converts it into structured, usable information. This approach is faster, more adaptable, and more resilient than traditional scraping methods.

The following illustration depicts the overall workflow of a typical AI scraper:

Overall workflow of a typical AI scraper

Let’s understand that flow with an example. Consider that we need to extract product listings from an e-commerce site with dynamic JavaScript content and inconsistent layouts.

Here’s how the workflow might look:

Crawl4AI scans and collects page data, including structured and unstructured HTML, text, and metadata.
The raw data is passed to DeepSeek, which uses an LLM to interpret patterns, extract entities (like product name, price, and description), and normalize them into a consistent format.
DeepSeek then outputs clean, structured JSON – ready for analysis or integration into dashboards.

Prerequisites & setup

Before we dive into the code, follow these steps to ensure your system is correctly configured to build and run the AI scraper with Crawl4AI and DeepSeek smoothly.

Python installation

You’ll need Python 3.10 or higher installed on your system. You can verify your version by running:

python --version

If you don’t have it yet, download it from python.org.

Virtual environment

It’s recommended to create a virtual environment to isolate your project's dependencies and prevent conflicts with other Python packages.

Open the command terminal and run the following commands to create a new virtual environment named `AI_scraper_env`:

python -m venv AI_scraper_env
AI_scraper_env\Scripts\activate   # For Windows
source AI_scraper_env/bin/activate  # For macOS/Linux

Required libraries

Install all dependencies for both Crawl4AI and DeepSeek. The `requests` library helps us establish communication between Crawl4AI and the DeepSeek API endpoint.

pip install crawl4ai playwright requests
playwright install

API keys

You’ll need two keys:

DeepSeek API key: Get it by signing up at deepseek.com and creating a new key under your account dashboard.

Getting a DeepSeek API key

Creating a new API key

Crawl4AI Key (optional): Crawl4AI is open-source and doesn’t require an API key for local use, but you can integrate external APIs if needed.

Store your API key securely in an `.env` file:

DEEPSEEK_API_KEY=your_api_key_here

Since DeepSeek has open-sourced several of its large language models (LLMs), including DeepSeek-Coder and DeepSeek-V2 (see DeepSeek GitHub repository), developers can self-host these models locally on their infrastructure without relying on the paid API.

Project folder setup: your project folder should look something like the following:

Project folder setup

Choose a target for web scraping: The target website for our AI web scraper demo is Oxylabs’ sandbox. This dummy website lists a range of products and is designed specifically for testing web scrapers.

The following snippet shows the target website's video game listings:

Target website of video game listings

That’s all we need! Let’s start coding the DeepSeek-powered Crawl4AI web scraper step by step.

Step-by-step guide: building your AI scraper

Once you have installed all the required library files and the project setup is ready, you are now ready to start the actual coding of your AI-powered web scraper using Crawl4AI and DeepSeek.

Step 1: import the required libraries

Import all the required libraries, including the Crawl4AI components.

import asyncio
import os
import json
from pprint import pprint
import requests
from crawl4ai import (
    AsyncWebCrawler,
    BrowserConfig,
    CrawlerRunConfig,
    DefaultMarkdownGenerator,
    PruningContentFilter,
    CrawlResult
)

The `asyncio` library is used to run the crawler asynchronously.

Step 2: initialise the Deepseek key

DEEPSEEK_KEY = os.getenv("DEEPSEEK_KEY")
if not DEEPSEEK_KEY:
    raise ValueError("DeepSeek API key not found in environment variable")

Step 3: set browser configurations

In an async `main()` function, initialise the browser object. Crawl4AI will use this object to initialize the headless browser session.

browser_config = BrowserConfig(headless=True, verbose=True)

`headless=True`: Browser runs invisibly in the background, saving excessive processing and time for the visual rendering.
`verbose=True`: Gives us logs to debug.

Web application firewalls often block headless traffic. To avoid being blocked, set the `headless` property to `False`. This method incurs significant scalability issues, slows down the overall scraping process, and often results in your IP being blacklisted. Fortunately, you can use Oxylabs' Web Scraper API to scrape reliably at scale without getting blocked.

Step 4: launch the crawler

Next, in the main function, initialise and launch the crawler using the `browser_config` in an asynchronous context and set the necessary runtime attributes as:

async with AsyncWebCrawler(config=browser_config, proxy=proxies["http"]) as crawler:
        crawler_config = CrawlerRunConfig(
            markdown_generator=DefaultMarkdownGenerator(
                content_filter=PruningContentFilter()
            ),
        )


        url = "https://sandbox.oxylabs.io/products"
        result = await crawler.arun(url=url, config=crawler_config)

`markdown_generator`: Tells the crawler to return the scraped content in Markdown. This makes the output cleaner and easier to read.
`PruningContentFilter()`: Automatically removes unnecessary elements like sidebars, navigation menus, and scripts to keep only the core content.
`crawler.arun()`: Executes the crawling process for the specified URL and applies the configuration. The resulting content is stored in the result variable.
`result`: Holds the page content in raw Markdown format, ready for further processing, in our case, sending it to DeepSeek for structured extraction.

This setup ensures the crawler captures a clean snapshot of the webpage’s main content, free of ads, navigation bars, and other clutter. Now, this content is ready for feeding to the DeepSeek API.

Step 5: extract data with DeepSeek

This unstructured data will now be sent to DeepSeek to extract the required fields. DeepSeek will analyze the data and return it in a structured JSON format.

To send data to DeepSeek for extraction, we need to extract the markdown text from our scraped content. This markdown text will be sent to a function `extract_with_deepseek(markdown_text)`.

 if hasattr(result.markdown_v2, "raw_markdown"):
            markdown_text = result.markdown_v2.raw_markdown
        else:
            markdown_text = str(result.markdown_v2)

Defining the data extraction schema

Before sending any request to DeepSeek, you need to define the type of data you want to extract from the raw content and the format. This structural definition is known as a data extraction schema.

A schema acts like a blueprint for your output. It helps the LLM understand what fields you need to extract, so you always get consistent results in the given format and structure.

Example:
If we’re scraping game information, our schema could include fields like title, genre, platform, and rating etc.

EXTRACTION_SCHEMA = {
    "games": [
        {
            "id": "number",
            "title": "string",
            "url": "string|null",
            "genres": ["string"],
            "description": "string|null",
            "price": "string|null",
            "action": "string|null"
        }
    ]
}

In this schema, we define all the required fields along with their datatypes to get the most accurate results. This ensures DeepSeek knows exactly what structure to follow, even when web content is messy or inconsistent.

Give the schema to DeepSeek in the prompt

Once your schema is ready, you’ll pass it to DeepSeek using a structured prompt.
This prompt instructs the LLM to extract data according to the schema — no extra text, no noise.

def extract_with_deepseek(markdown_text):
    system_instruction = (
        "You are a structured data extractor. "
        "Extract only game data from the given markdown text. "
        "Return a clean JSON object that matches the provided schema exactly. "
        "Do not include any explanations or markdown formatting. "
        "Use null for missing values."
    )


    schema_msg = {
        "role": "system",
        "content": "Schema:\n" + json.dumps(EXTRACTION_SCHEMA, indent=2)
    }


    user_msg = {"role": "user", "content": markdown_text}

In this code, the `system_instruction` tells DeepSeek how to behave. In LLM terminology, this is called a system prompt. It defines the model's overall “role” or “personality.”

It restricts DeepSeek’s response to return only structured data (no extra text or reasoning).
The instruction “Use null for missing values” ensures consistency in case some fields are unavailable.
It also emphasizes clean JSON, which is important because even small deviations (like extra commentary or markdown symbols) can break parsing later.

System prompts are very important to be well-defined because poorly defined prompts can lead to inconsistent data structures, irrelevant data, or parsing errors in your pipeline.

The `schema_msg` attaches the defined schema, and the `user_message` attaches the actual content from which data is to be extracted.

Prepare the payload

Once the schema and prompt are set, prepare the payload structure.

   payload = {
        "model": "deepseek-chat",
        "messages": [
            {"role": "system", "content": system_instruction},
            schema_msg,
            user_msg
        ],
        "temperature": 0.0,
        "max_tokens": 2000
    }

`model`: Specifies which DeepSeek model to use (deepseek-chat).
`messages`: Follows a chat-like format:
`schema_msg` message defines the data extraction schema (here: extract clean JSON).
`user_msg` message contains the actual Markdown text to process.
`temperature`: Controls creativity/variability. Lower values (0.0) make the model more deterministic.

Set the HTTP headers

headers = {
        "Authorization": f"Bearer {DEEPSEEK_KEY}",
        "Content-Type": "application/json"
    }

Authorization: send your API key for authentication.
Content-Type: indicates that we are sending JSON data.

Send the request to DeepSeek

response = requests.post(DEEPSEEK_ENDPOINT, headers=headers, json=payload)
        response.raise_for_status()
        data = response.json()

It sends an HTTP POST request to DeepSeek
`raise_for_status()` ensures an exception is raised if the request fails
`response.json()` converts the response body into a Python dictionary.

Extract the content from DeepSeek’s response

content = data["choices"][0]["message"]["content"]

DeepSeek returns a choices array.
We take the first choice and grab the message content, which is expected to be a JSON string representing the extracted data.

Since this data usually contains some backticks and `json` word occurrences, you may prefer to clean the data before returning it.

cleaned = content.strip()
        if cleaned.startswith("```"):
            cleaned = cleaned.strip("`")     # remove all backticks
            # remove 'json' if present
            cleaned = cleaned.replace("json", "", 1).strip()  


        # Try to parse cleaned content as JSON
        try:
            parsed = json.loads(cleaned)
            print("\n Parsed JSON successfully!\n")
            return parsed
        except json.JSONDecodeError as je:
            print(f"[DeepSeek ERROR] Failed to decode JSON: {je}")
            return None

So the complete function of DeepSeek extraction can be written as:

EXTRACTION_SCHEMA = {
    "games": [
        {
            "id": "number",
            "title": "string",
            "url": "string|null",
            "genres": ["string"],
            "description": "string|null",
            "price": "string|null",
            "action": "string|null"
        }
    ]
}
# ----------------------------
# Function to send markdown to DeepSeek
# ----------------------------
def extract_with_deepseek(markdown_text):
    system_instruction = (
        "You are a structured data extractor. "
        "Extract only game data from the given markdown text. "
        "Return a clean JSON object that matches the provided schema exactly. "
        "Do not include any explanations or markdown formatting. "
        "Use null for missing values."
    )


    schema_msg = {
        "role": "system",
        "content": "Schema:\n" + json.dumps(EXTRACTION_SCHEMA, indent=2)
    }


    user_msg = {"role": "user", "content": markdown_text}


    payload = {
        "model": "deepseek-chat",
        "messages": [
            {"role": "system", "content": system_instruction},
            schema_msg,
            user_msg
        ],
        "temperature": 0.0,
        "max_tokens": 2000
    }
    headers = {
        "Authorization": f"Bearer {DEEPSEEK_KEY}",
        "Content-Type": "application/json"
    }
    try:
        response = requests.post(DEEPSEEK_ENDPOINT, headers=headers, json=payload)
        response.raise_for_status()
        data = response.json()
        content = data["choices"][0]["message"]["content"]
         # Clean JSON code block markers if present
        cleaned = content.strip()
        if cleaned.startswith("```"):
            cleaned = cleaned.strip("`")       # remove all backticks
            cleaned = cleaned.replace("json", "", 1).strip()  # remove 'json' if present


        # Try to parse cleaned content as JSON
        try:
            parsed = json.loads(cleaned)
            print("\n✅ Parsed JSON successfully!\n")
            return parsed
        except json.JSONDecodeError as je:
            print(f"[DeepSeek ERROR] Failed to decode JSON: {je}")
            return None
    except Exception as e:
        print(f"[DeepSeek ERROR] {e}")
        return None

This function sends Markdown content to DeepSeek with a simple, plain-English prompt instructing the model to extract structured JSON. For straightforward extraction tasks, we use `deepseek-chat`, a fast and cost-efficient model designed for reliably converting text to JSON. The function limits randomness with `temperature=0.0` for more deterministic results and includes basic error handling to catch network or parsing issues.

Step 6: run the script

Finally, run the code to get the output of it.

if __name__ == "__main__":
    asyncio.run(main())

Let’s look at the complete code so far:

import asyncio
import os
import json
from pprint import pprint
import requests
from crawl4ai import (
    AsyncWebCrawler,
    BrowserConfig,
    CrawlerRunConfig,
    DefaultMarkdownGenerator,
    PruningContentFilter,
    CrawlResult
)


# ----------------------------
# DeepSeek API Key
# ----------------------------
DEEPSEEK_KEY = os.getenv("DEEPSEEK_KEY")
if not DEEPSEEK_KEY:
    raise ValueError("DeepSeek API key not found in environment variable")
DEEPSEEK_ENDPOINT = "https://api.deepseek.com/chat/completions"
EXTRACTION_SCHEMA = {
    "games": [
        {
            "id": "number",
            "title": "string",
            "url": "string|null",
            "genres": ["string"],
            "description": "string|null",
            "price": "string|null",
            "action": "string|null"
        }
    ]
}
# ----------------------------
# Function to send markdown to DeepSeek
# ----------------------------
def extract_with_deepseek(markdown_text):
    system_instruction = (
        "You are a structured data extractor. "
        "Extract only game data from the given markdown text. "
        "Return a clean JSON object that matches the provided schema exactly. "
        "Do not include any explanations or markdown formatting. "
        "Use null for missing values."
    )


    schema_msg = {
        "role": "system",
        "content": "Schema:\n" + json.dumps(EXTRACTION_SCHEMA, indent=2)
    }


    user_msg = {"role": "user", "content": markdown_text}


    payload = {
        "model": "deepseek-chat",
        "messages": [
            {"role": "system", "content": system_instruction},
            schema_msg,
            user_msg
        ],
        "temperature": 0.0,
        "max_tokens": 2000
    }
    headers = {
        "Authorization": f"Bearer {DEEPSEEK_KEY}",
        "Content-Type": "application/json"
    }
    try:
        response = requests.post(DEEPSEEK_ENDPOINT, headers=headers, json=payload)
        response.raise_for_status()
        data = response.json()
       
        content = data["choices"][0]["message"]["content"]


         # Clean JSON code block markers if present
        cleaned = content.strip()
        if cleaned.startswith("```"):
            cleaned = cleaned.strip("`")           # remove all backticks
            cleaned = cleaned.replace("json", "", 1).strip()  # remove 'json' if present


        # Try to parse cleaned content as JSON
        try:
            parsed = json.loads(cleaned)
            print("\n✅ Parsed JSON successfully!\n")
            return parsed
        except json.JSONDecodeError as je:
            print(f"[DeepSeek ERROR] Failed to decode JSON: {je}")
            return None
    except Exception as e:
        print(f"[DeepSeek ERROR] {e}")
        return None


# ----------------------------
# Main async crawler & extraction
# ----------------------------
async def main():
    browser_config = BrowserConfig(headless=True, verbose=True)
    async with AsyncWebCrawler(config=browser_config, proxy=proxies["http"]) as crawler:
        crawler_config = CrawlerRunConfig(
            markdown_generator=DefaultMarkdownGenerator(
                content_filter=PruningContentFilter()
            ),
        )


        url = "https://sandbox.oxylabs.io/products"
        result = await crawler.arun(url=url, config=crawler_config)


        # ----------------------------
        # Handle both CrawlResult object or string
        # ----------------------------
        if hasattr(result.markdown_v2, "raw_markdown"):
            markdown_text = result.markdown_v2.raw_markdown
        else:
            markdown_text = str(result.markdown_v2)






        # ----------------------------
        # Call DeepSeek to extract structured JSON
        # ----------------------------
        extracted = extract_with_deepseek(markdown_text)
        if extracted:
            print("DeepSeek Extracted Data:")
            print(json.dumps(extracted, indent=2))
            with open("products_deepseek.json", "w", encoding="utf-8") as f:
                json.dump(extracted, f, indent=2)
            print("Data saved to products_deepseek.json")
        else:
            print("DeepSeek extraction failed")


if __name__ == "__main__":
    asyncio.run(main())

Here is what the output looks like. This is a part of the complete output:

Part of the complete output

This output shows the JSON list of extracted products’ data, with each product showing its ID, title, genres, URL, description, and price.

Scaling a web scraper requires careful request management. Websites with advanced web application firewalls often respond with CAPTCHA and slow-loading pages, and often block access by blacklisting the IP. To overcome that, use a layer of reliable proxy servers.

Add proxy support to AI scraper with Oxylabs

Rotating proxies help address this problem. Each request uses a different IP, making the scraper appear as multiple independent users.

To implement proxies, create a configuration object with your credentials and server address. Integrate this object into the browser configuration. This allows the scraper to operate efficiently while minimizing the risk of blocking. For more details on rotation and geo-targeting, see the Oxylabs' documentation or check our proxy integration with Crawl4AI.

You can start by creating a proxy URL. This URL is formed by adding your username and password.

proxies = {
    "http": "http://customer-USERNAME:PASSWORD@pr.oxylabs.io:7777",
    "Https": "http://customer-USERNAME:PASSWORD@pr.oxylabs.io:7777"
}

This `proxies` object will be sent in the POST request for calling the DeepSeek API.

response = requests.post(DEEPSEEK_ENDPOINT, headers=headers, json=payload, proxies=proxies)

Best practices for AI-powered web scraping

Building an AI-powered scraper is more than just writing code; it’s about making it scalable, resilient, and compliant.

The following practices will help you build more secure, reliable, and scalable AI web scrapers:

Self-host LLMs for cost efficiency and data control

Open-source models (such as DeepSeek-V2 or DeepSeek-Coder) can be self-hosted locally or on your cloud infrastructure. Doing so significantly reduces costs and ensures you always remain in control of your data.

Use specialized web scrapers for reliable data acquisition

In an AI-powered scraping workflow, no model can perform meaningful analysis if the data collection layer fails.

Consider a scenario where you aim to build a multimodal YouTube dataset that includes videos, transcripts, and user comments for AI training. A conventional scraper would likely fail at the first step due to IP bans or access restrictions.

In contrast, a purpose-built solution such as Oxylabs' Video Scraper API ensures high success rates through managed proxy rotation, CAPTCHA avoidance mechanisms, and scalable infrastructure. This enables your AI pipeline to focus on accurate data interpretation instead of troubleshooting collection errors.

Use proxies for block-free crawling

A successful AI scraper depends on uninterrupted access. Using rotating proxies assigns a fresh IP address to each request, preventing IP bans and CAPTCHA challenges. For regional data collection, residential proxies or other paid proxy servers emulate real users from specific locations, ensuring geotargeted and compliant data acquisition across markets.

Automate change monitoring for continual updates

Web structures evolve frequently. Using tools for website change monitoring ensures your scraper detects layout or content shifts early, avoiding silent failures. It’s especially useful when maintaining long-running AI pipelines for research or price tracking.

Handle data efficiently

Store scraped AI data in structured formats like JSON or CSV. This makes it easier to clean, analyze, or feed into AI workflows. For example, converting raw HTML into JSON objects with fields like name, price, and description simplifies downstream AI processing.

Conclusion

Crawl4AI works like a real browser, rendering modern, JavaScript-heavy pages and turning them into clean, structured content. DeepSeek then analyzes that content through its generative large language model, automatically finding and extracting the information you need without relying on complex CSS selectors or rigid parsing rules. By integrating rotating proxies and managed scraper APIs, you can maintain reliable and large-scale data pipelines that minimise blocking and preserve accuracy.

For more similar topics, you can check our articles on:

In addition, you can also try our solutions for Perplexity web scraping or scraping ChatGPT.