Recent advancements in generative AI, especially continuous improvements in large language models (LLMs), have brought us closer to artificial general intelligence (AGI) than ever before. Instead of depending on rigid scripts (made of complex regular expressions), modern scrapers use LLMs to understand a webpage, reading it much like a human would.
In addition, AI can extract insights from raw HTML beyond what basic web scrapers can. For example, an AI-based scraper can generate summaries from raw text, adapt to dynamic web structures, and perform semantic analysis, such as sentiment analysis, classification, and more.
This guide explains how Crawl4AI and DeepSeek can work together. Crawl4AI handles browser rendering, while DeepSeek interprets the scraped content, converting raw text into structured data ready for analysis, thereby making web scraping more powerful and adaptable.
Crawl4AI is an open-source, next-gen AI-ready web scraping framework designed to extract and prepare web data for LLMs. Unlike older libraries such as Scrapy, which deal with static HTML parsing, Crawl4AI uses a Playwright-powered headless browser to render pages as a real user would. This tremendously helps when handling JavaScript-heavy, dynamic, and interactive sites.
Its asynchronous engine supports high-throughput parallel crawling, enabling concurrent scraping of multiple pages. The framework automatically cleans and structures content into Markdown, making it directly compatible with LLMs such as DeepSeek and GPT-based extractors.
Compared to Selenium, Crawl4AI offers lower resource usage, faster execution, and AI-centric output formatting. Selenium was designed primarily for web automation and testing, not large-scale data extraction. Therefore, Selenium is less efficient for AI workflows that require structured, LLM-readable outputs. Crawl4AI, by contrast, is built specifically for the AI web scraping era, integrating seamlessly with proxy and web scraper APIs.
DeepSeek serves as the intelligence layer over top of traditional raw data scrapers, transforming scraped data into meaningful information. DeepSeek uses large language models (LLMs) to interpret text in a way similar to a human.
When Crawl4AI collects raw content, DeepSeek analyzes it, identifies patterns, and extracts structured information, such as product details, reviews, and prices, using a straightforward human-language prompt. It remains effective even if the website layout or structure changes.
Traditional scrapers require complex regular expressions designed to select relevant CSS selectors corresponding to desired data fields. However, despite the expert efforts, these expressions often break when websites update their structure.
AI-powered extraction, however, adapts to content changes. DeepSeek understands context from the given user prompt, recognizes data fields such as prices or ratings, even when they appear in different formats or CSS selectors.
Combining Crawl4AI and DeepSeek creates a complete pipeline: Crawl4AI collects clean data from modern websites, and DeepSeek converts it into structured, usable information. This approach is faster, more adaptable, and more resilient than traditional scraping methods.
The following illustration depicts the overall workflow of a typical AI scraper:

Overall workflow of a typical AI scraper
Let’s understand that flow with an example. Consider that we need to extract product listings from an e-commerce site with dynamic JavaScript content and inconsistent layouts.
Here’s how the workflow might look:
Crawl4AI scans and collects page data, including structured and unstructured HTML, text, and metadata.
The raw data is passed to DeepSeek, which uses an LLM to interpret patterns, extract entities (like product name, price, and description), and normalize them into a consistent format.
DeepSeek then outputs clean, structured JSON – ready for analysis or integration into dashboards.
Before we dive into the code, follow these steps to ensure your system is correctly configured to build and run the AI scraper with Crawl4AI and DeepSeek smoothly.
Python installation
You’ll need Python 3.10 or higher installed on your system. You can verify your version by running:
python --versionIf you don’t have it yet, download it from python.org.
Virtual environment
It’s recommended to create a virtual environment to isolate your project's dependencies and prevent conflicts with other Python packages.
Open the command terminal and run the following commands to create a new virtual environment named `AI_scraper_env`:
python -m venv AI_scraper_env
AI_scraper_env\Scripts\activate # For Windows
source AI_scraper_env/bin/activate # For macOS/LinuxRequired libraries
Install all dependencies for both Crawl4AI and DeepSeek. The `requests` library helps us establish communication between Crawl4AI and the DeepSeek API endpoint.
pip install crawl4ai playwright requests
playwright installAPI keys
You’ll need two keys:
DeepSeek API key: Get it by signing up at deepseek.com and creating a new key under your account dashboard.

Getting a DeepSeek API key

Creating a new API key
Crawl4AI Key (optional): Crawl4AI is open-source and doesn’t require an API key for local use, but you can integrate external APIs if needed.
Store your API key securely in an `.env` file:
DEEPSEEK_API_KEY=your_api_key_hereSince DeepSeek has open-sourced several of its large language models (LLMs), including DeepSeek-Coder and DeepSeek-V2 (see DeepSeek GitHub repository), developers can self-host these models locally on their infrastructure without relying on the paid API.
Project folder setup: your project folder should look something like the following:

Project folder setup
Choose a target for web scraping: The target website for our AI web scraper demo is Oxylabs’ sandbox. This dummy website lists a range of products and is designed specifically for testing web scrapers.
The following snippet shows the target website's video game listings:

Target website of video game listings
That’s all we need! Let’s start coding the DeepSeek-powered Crawl4AI web scraper step by step.
Once you have installed all the required library files and the project setup is ready, you are now ready to start the actual coding of your AI-powered web scraper using Crawl4AI and DeepSeek.
Import all the required libraries, including the Crawl4AI components.
import asyncio
import os
import json
from pprint import pprint
import requests
from crawl4ai import (
AsyncWebCrawler,
BrowserConfig,
CrawlerRunConfig,
DefaultMarkdownGenerator,
PruningContentFilter,
CrawlResult
)The `asyncio` library is used to run the crawler asynchronously.
DEEPSEEK_KEY = os.getenv("DEEPSEEK_KEY")
if not DEEPSEEK_KEY:
raise ValueError("DeepSeek API key not found in environment variable")In an async `main()` function, initialise the browser object. Crawl4AI will use this object to initialize the headless browser session.
browser_config = BrowserConfig(headless=True, verbose=True)`headless=True`: Browser runs invisibly in the background, saving excessive processing and time for the visual rendering.
`verbose=True`: Gives us logs to debug.
Web application firewalls often block headless traffic. To avoid being blocked, set the `headless` property to `False`. This method incurs significant scalability issues, slows down the overall scraping process, and often results in your IP being blacklisted. Fortunately, you can use Oxylabs' Web Scraper API to scrape reliably at scale without getting blocked.
Next, in the main function, initialise and launch the crawler using the `browser_config` in an asynchronous context and set the necessary runtime attributes as:
async with AsyncWebCrawler(config=browser_config, proxy=proxies["http"]) as crawler:
crawler_config = CrawlerRunConfig(
markdown_generator=DefaultMarkdownGenerator(
content_filter=PruningContentFilter()
),
)
url = "https://sandbox.oxylabs.io/products"
result = await crawler.arun(url=url, config=crawler_config)`markdown_generator`: Tells the crawler to return the scraped content in Markdown. This makes the output cleaner and easier to read.
`PruningContentFilter()`: Automatically removes unnecessary elements like sidebars, navigation menus, and scripts to keep only the core content.
`crawler.arun()`: Executes the crawling process for the specified URL and applies the configuration. The resulting content is stored in the result variable.
`result`: Holds the page content in raw Markdown format, ready for further processing, in our case, sending it to DeepSeek for structured extraction.
This setup ensures the crawler captures a clean snapshot of the webpage’s main content, free of ads, navigation bars, and other clutter. Now, this content is ready for feeding to the DeepSeek API.
This unstructured data will now be sent to DeepSeek to extract the required fields. DeepSeek will analyze the data and return it in a structured JSON format.
To send data to DeepSeek for extraction, we need to extract the markdown text from our scraped content. This markdown text will be sent to a function `extract_with_deepseek(markdown_text)`.
if hasattr(result.markdown_v2, "raw_markdown"):
markdown_text = result.markdown_v2.raw_markdown
else:
markdown_text = str(result.markdown_v2)Before sending any request to DeepSeek, you need to define the type of data you want to extract from the raw content and the format. This structural definition is known as a data extraction schema.
A schema acts like a blueprint for your output. It helps the LLM understand what fields you need to extract, so you always get consistent results in the given format and structure.
Example:
If we’re scraping game information, our schema could include fields like title, genre, platform, and rating etc.
EXTRACTION_SCHEMA = {
"games": [
{
"id": "number",
"title": "string",
"url": "string|null",
"genres": ["string"],
"description": "string|null",
"price": "string|null",
"action": "string|null"
}
]
}In this schema, we define all the required fields along with their datatypes to get the most accurate results. This ensures DeepSeek knows exactly what structure to follow, even when web content is messy or inconsistent.
Once your schema is ready, you’ll pass it to DeepSeek using a structured prompt.
This prompt instructs the LLM to extract data according to the schema — no extra text, no noise.
def extract_with_deepseek(markdown_text):
system_instruction = (
"You are a structured data extractor. "
"Extract only game data from the given markdown text. "
"Return a clean JSON object that matches the provided schema exactly. "
"Do not include any explanations or markdown formatting. "
"Use null for missing values."
)
schema_msg = {
"role": "system",
"content": "Schema:\n" + json.dumps(EXTRACTION_SCHEMA, indent=2)
}
user_msg = {"role": "user", "content": markdown_text}In this code, the `system_instruction` tells DeepSeek how to behave. In LLM terminology, this is called a system prompt. It defines the model's overall “role” or “personality.”
It restricts DeepSeek’s response to return only structured data (no extra text or reasoning).
The instruction “Use null for missing values” ensures consistency in case some fields are unavailable.
It also emphasizes clean JSON, which is important because even small deviations (like extra commentary or markdown symbols) can break parsing later.
System prompts are very important to be well-defined because poorly defined prompts can lead to inconsistent data structures, irrelevant data, or parsing errors in your pipeline.
The `schema_msg` attaches the defined schema, and the `user_message` attaches the actual content from which data is to be extracted.
Once the schema and prompt are set, prepare the payload structure.
payload = {
"model": "deepseek-chat",
"messages": [
{"role": "system", "content": system_instruction},
schema_msg,
user_msg
],
"temperature": 0.0,
"max_tokens": 2000
}`model`: Specifies which DeepSeek model to use (deepseek-chat).
`messages`: Follows a chat-like format:
`schema_msg` message defines the data extraction schema (here: extract clean JSON).
`user_msg` message contains the actual Markdown text to process.
`temperature`: Controls creativity/variability. Lower values (0.0) make the model more deterministic.
headers = {
"Authorization": f"Bearer {DEEPSEEK_KEY}",
"Content-Type": "application/json"
}Authorization: send your API key for authentication.
Content-Type: indicates that we are sending JSON data.
response = requests.post(DEEPSEEK_ENDPOINT, headers=headers, json=payload)
response.raise_for_status()
data = response.json()It sends an HTTP POST request to DeepSeek
`raise_for_status()` ensures an exception is raised if the request fails
`response.json()` converts the response body into a Python dictionary.
content = data["choices"][0]["message"]["content"]DeepSeek returns a choices array.
We take the first choice and grab the message content, which is expected to be a JSON string representing the extracted data.
Since this data usually contains some backticks and `json` word occurrences, you may prefer to clean the data before returning it.
cleaned = content.strip()
if cleaned.startswith("```"):
cleaned = cleaned.strip("`") # remove all backticks
# remove 'json' if present
cleaned = cleaned.replace("json", "", 1).strip()
# Try to parse cleaned content as JSON
try:
parsed = json.loads(cleaned)
print("\n Parsed JSON successfully!\n")
return parsed
except json.JSONDecodeError as je:
print(f"[DeepSeek ERROR] Failed to decode JSON: {je}")
return NoneSo the complete function of DeepSeek extraction can be written as:
EXTRACTION_SCHEMA = {
"games": [
{
"id": "number",
"title": "string",
"url": "string|null",
"genres": ["string"],
"description": "string|null",
"price": "string|null",
"action": "string|null"
}
]
}
# ----------------------------
# Function to send markdown to DeepSeek
# ----------------------------
def extract_with_deepseek(markdown_text):
system_instruction = (
"You are a structured data extractor. "
"Extract only game data from the given markdown text. "
"Return a clean JSON object that matches the provided schema exactly. "
"Do not include any explanations or markdown formatting. "
"Use null for missing values."
)
schema_msg = {
"role": "system",
"content": "Schema:\n" + json.dumps(EXTRACTION_SCHEMA, indent=2)
}
user_msg = {"role": "user", "content": markdown_text}
payload = {
"model": "deepseek-chat",
"messages": [
{"role": "system", "content": system_instruction},
schema_msg,
user_msg
],
"temperature": 0.0,
"max_tokens": 2000
}
headers = {
"Authorization": f"Bearer {DEEPSEEK_KEY}",
"Content-Type": "application/json"
}
try:
response = requests.post(DEEPSEEK_ENDPOINT, headers=headers, json=payload)
response.raise_for_status()
data = response.json()
content = data["choices"][0]["message"]["content"]
# Clean JSON code block markers if present
cleaned = content.strip()
if cleaned.startswith("```"):
cleaned = cleaned.strip("`") # remove all backticks
cleaned = cleaned.replace("json", "", 1).strip() # remove 'json' if present
# Try to parse cleaned content as JSON
try:
parsed = json.loads(cleaned)
print("\n✅ Parsed JSON successfully!\n")
return parsed
except json.JSONDecodeError as je:
print(f"[DeepSeek ERROR] Failed to decode JSON: {je}")
return None
except Exception as e:
print(f"[DeepSeek ERROR] {e}")
return NoneThis function sends Markdown content to DeepSeek with a simple, plain-English prompt instructing the model to extract structured JSON. For straightforward extraction tasks, we use `deepseek-chat`, a fast and cost-efficient model designed for reliably converting text to JSON. The function limits randomness with `temperature=0.0` for more deterministic results and includes basic error handling to catch network or parsing issues.
Finally, run the code to get the output of it.
if __name__ == "__main__":
asyncio.run(main())Let’s look at the complete code so far:
import asyncio
import os
import json
from pprint import pprint
import requests
from crawl4ai import (
AsyncWebCrawler,
BrowserConfig,
CrawlerRunConfig,
DefaultMarkdownGenerator,
PruningContentFilter,
CrawlResult
)
# ----------------------------
# DeepSeek API Key
# ----------------------------
DEEPSEEK_KEY = os.getenv("DEEPSEEK_KEY")
if not DEEPSEEK_KEY:
raise ValueError("DeepSeek API key not found in environment variable")
DEEPSEEK_ENDPOINT = "https://api.deepseek.com/chat/completions"
EXTRACTION_SCHEMA = {
"games": [
{
"id": "number",
"title": "string",
"url": "string|null",
"genres": ["string"],
"description": "string|null",
"price": "string|null",
"action": "string|null"
}
]
}
# ----------------------------
# Function to send markdown to DeepSeek
# ----------------------------
def extract_with_deepseek(markdown_text):
system_instruction = (
"You are a structured data extractor. "
"Extract only game data from the given markdown text. "
"Return a clean JSON object that matches the provided schema exactly. "
"Do not include any explanations or markdown formatting. "
"Use null for missing values."
)
schema_msg = {
"role": "system",
"content": "Schema:\n" + json.dumps(EXTRACTION_SCHEMA, indent=2)
}
user_msg = {"role": "user", "content": markdown_text}
payload = {
"model": "deepseek-chat",
"messages": [
{"role": "system", "content": system_instruction},
schema_msg,
user_msg
],
"temperature": 0.0,
"max_tokens": 2000
}
headers = {
"Authorization": f"Bearer {DEEPSEEK_KEY}",
"Content-Type": "application/json"
}
try:
response = requests.post(DEEPSEEK_ENDPOINT, headers=headers, json=payload)
response.raise_for_status()
data = response.json()
content = data["choices"][0]["message"]["content"]
# Clean JSON code block markers if present
cleaned = content.strip()
if cleaned.startswith("```"):
cleaned = cleaned.strip("`") # remove all backticks
cleaned = cleaned.replace("json", "", 1).strip() # remove 'json' if present
# Try to parse cleaned content as JSON
try:
parsed = json.loads(cleaned)
print("\n✅ Parsed JSON successfully!\n")
return parsed
except json.JSONDecodeError as je:
print(f"[DeepSeek ERROR] Failed to decode JSON: {je}")
return None
except Exception as e:
print(f"[DeepSeek ERROR] {e}")
return None
# ----------------------------
# Main async crawler & extraction
# ----------------------------
async def main():
browser_config = BrowserConfig(headless=True, verbose=True)
async with AsyncWebCrawler(config=browser_config, proxy=proxies["http"]) as crawler:
crawler_config = CrawlerRunConfig(
markdown_generator=DefaultMarkdownGenerator(
content_filter=PruningContentFilter()
),
)
url = "https://sandbox.oxylabs.io/products"
result = await crawler.arun(url=url, config=crawler_config)
# ----------------------------
# Handle both CrawlResult object or string
# ----------------------------
if hasattr(result.markdown_v2, "raw_markdown"):
markdown_text = result.markdown_v2.raw_markdown
else:
markdown_text = str(result.markdown_v2)
# ----------------------------
# Call DeepSeek to extract structured JSON
# ----------------------------
extracted = extract_with_deepseek(markdown_text)
if extracted:
print("DeepSeek Extracted Data:")
print(json.dumps(extracted, indent=2))
with open("products_deepseek.json", "w", encoding="utf-8") as f:
json.dump(extracted, f, indent=2)
print("Data saved to products_deepseek.json")
else:
print("DeepSeek extraction failed")
if __name__ == "__main__":
asyncio.run(main())Here is what the output looks like. This is a part of the complete output:

Part of the complete output
This output shows the JSON list of extracted products’ data, with each product showing its ID, title, genres, URL, description, and price.
Scaling a web scraper requires careful request management. Websites with advanced web application firewalls often respond with CAPTCHA and slow-loading pages, and often block access by blacklisting the IP. To overcome that, use a layer of reliable proxy servers.
Rotating proxies help address this problem. Each request uses a different IP, making the scraper appear as multiple independent users.
To implement proxies, create a configuration object with your credentials and server address. Integrate this object into the browser configuration. This allows the scraper to operate efficiently while minimizing the risk of blocking. For more details on rotation and geo-targeting, see the Oxylabs' documentation or check our proxy integration with Crawl4AI.
You can start by creating a proxy URL. This URL is formed by adding your username and password.
proxies = {
"http": "http://customer-USERNAME:PASSWORD@pr.oxylabs.io:7777",
"Https": "http://customer-USERNAME:PASSWORD@pr.oxylabs.io:7777"
}This `proxies` object will be sent in the POST request for calling the DeepSeek API.
response = requests.post(DEEPSEEK_ENDPOINT, headers=headers, json=payload, proxies=proxies)Building an AI-powered scraper is more than just writing code; it’s about making it scalable, resilient, and compliant.
The following practices will help you build more secure, reliable, and scalable AI web scrapers:
Self-host LLMs for cost efficiency and data control
Open-source models (such as DeepSeek-V2 or DeepSeek-Coder) can be self-hosted locally or on your cloud infrastructure. Doing so significantly reduces costs and ensures you always remain in control of your data.
Use specialized web scrapers for reliable data acquisition
In an AI-powered scraping workflow, no model can perform meaningful analysis if the data collection layer fails.
Consider a scenario where you aim to build a multimodal YouTube dataset that includes videos, transcripts, and user comments for AI training. A conventional scraper would likely fail at the first step due to IP bans or access restrictions.
In contrast, a purpose-built solution such as Oxylabs' Video Scraper API ensures high success rates through managed proxy rotation, CAPTCHA avoidance mechanisms, and scalable infrastructure. This enables your AI pipeline to focus on accurate data interpretation instead of troubleshooting collection errors.
Use proxies for block-free crawling
A successful AI scraper depends on uninterrupted access. Using rotating proxies assigns a fresh IP address to each request, preventing IP bans and CAPTCHA challenges. For regional data collection, residential proxies or other paid proxy servers emulate real users from specific locations, ensuring geotargeted and compliant data acquisition across markets.
Automate change monitoring for continual updates
Web structures evolve frequently. Using tools for website change monitoring ensures your scraper detects layout or content shifts early, avoiding silent failures. It’s especially useful when maintaining long-running AI pipelines for research or price tracking.
Handle data efficiently
Store scraped AI data in structured formats like JSON or CSV. This makes it easier to clean, analyze, or feed into AI workflows. For example, converting raw HTML into JSON objects with fields like name, price, and description simplifies downstream AI processing.
Crawl4AI works like a real browser, rendering modern, JavaScript-heavy pages and turning them into clean, structured content. DeepSeek then analyzes that content through its generative large language model, automatically finding and extracting the information you need without relying on complex CSS selectors or rigid parsing rules. By integrating rotating proxies and managed scraper APIs, you can maintain reliable and large-scale data pipelines that minimise blocking and preserve accuracy.
For more similar topics, you can check our articles on:
In addition, you can also try our solutions for Perplexity web scraping or scraping ChatGPT.
DeepSeek AI-based scraping interprets content rather than operating on a set of fixed parsing rules. It can retrieve organizational data from complex or dynamic pages in a single step. This eliminates the need for manual coding and makes the process more efficient and reliable than traditional web scrapers.
Use different IP addresses and change your browser’s user-agent to look like multiple real visitors. Adding short delays between requests helps avoid detection. Using residential or datacenter proxies makes scraping protected sites more reliable. Alternatively, Oxylabs Web Scraper API can reliably handle all such cases and scrape websites on your behalf at scale without getting blocked
Token and request limits depend on the specific DeepSeek model and your subscription plan. Limits exist to prevent overload and misuse. Keeping an eye on usage helps avoid exceeding quotas or incurring extra costs.
Yes, this setup works well with multiple sites or pages that change frequently. Crawl4AI handles dynamic content and generates clean Markdown, which DeepSeek can then convert to structured JSON. Proper proxy management and error handling ensure scalability across many sites.
Always review the target website’s terms of service before scraping. Limit request frequency, avoid scraping sensitive or personal data, and respect robots.txt rules. Compliance with the specified rules minimizes legal risks and ensures that data scraping is conducted ethically.
Proxies prevent IP restrictions and CAPTCHAs by masking requests as originating from many users. For large-scale scraping, a scraper API or proxy service offers rotation, geotargeting, and reliability. Choose based on compatibility with your scraping methodology, speed, and stability.
About the author

Yelyzaveta Hayrapetyan
Former Senior Technical Copywriter
Yelyzaveta Hayrapetyan was a Senior Technical Copywriter at Oxylabs. After working as a writer in fashion, e-commerce, and media, she decided to switch her career path and immerse in the fascinating world of tech. And believe it or not, she absolutely loves it! On weekends, you’ll probably find Yelyzaveta enjoying a cup of matcha at a cozy coffee shop, scrolling through social media, or binge-watching investigative TV series.
All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.



Get the latest news from data gathering world
Scale up your business with Oxylabs®
Proxies
Advanced proxy solutions
Data Collection
Datasets
Resources
Innovation hub