Back to blog

Web Scraping With LangChain & Oxylabs API

roberta avatar
vytenis kaubre avatar

Roberta Aukstikalnyte

Last updated by Vytenis Kaubrė

2025-05-21

6 min read

LangChain is a robust framework designed for building AI applications that integrate Large Language Models (LLMs) with external data sources, workflows, and APIs.
By combining LangChain’s seamless pipeline capabilities with a tool like the Web Scraper API, you can collect public web data, all while avoiding common scraping-related hurdles that can disrupt your processes.

In today’s article, we’ll demonstrate how to integrate the LangChain framework with Web Scraper API and AI Studio. By the end of this article, you’ll be able to retrieve structured web data with minimal development effort. Let’s get started!

Common web scraping challenges 

In conversation about web data extraction, it’s inevitable that we’ll get to discussing its challenges. In practice, developers often run into problems that can complicate or disrupt the process – let’s take a closer look: 

  • IP blocking and rate limiting

Websites often detect and block repeated requests from the same IP address to prevent automated scraping. They may also impose rate limits, capping the number of requests you can make within a specific time frame. Without proper measures, these restrictions can disrupt data collection.

  • CAPTCHAs and other anti-scraping mechanisms

Websites implement CAPTCHAs and other anti-bot technologies to distinguish between human users and automated scripts. Bypassing these defenses requires sophisticated tools or external CAPTCHA-solving services, adding complexity and cost.

  • Large-scale scraping

As scraping projects grow, handling large volumes of data efficiently becomes a challenge. This includes managing storage, ensuring fast processing, and maintaining reliable infrastructure to handle numerous concurrent requests.

LangChain vs. regular scraping methods

Recently, frameworks like LangChain have introduced a new paradigm, leveraging large language models (LLMs) to interpret and process data in a more dynamic and context-aware manner.

Let’s take a look at a comparison of LangChain and regular web scraping methods across a few dimensions:

1. Purpose and use case 

  • LangChain: Combines web scraping with LLMs for automated data analysis and insights generation. Ideal for workflows that need both data extraction and AI-driven processing.

  • Regular scraping: Web scraping tools like Scrapy, BeautifulSoup, and Puppeteer are focused solely on data collection. Post-processing requires separate tools and scripts.

2. Handling dynamic content 

  • LangChain: When paired with APIs like Oxylabs, it seamlessly handles JavaScript-rendered content and bypasses anti-scraping measures.

  • Regular scraping: Requires additional tools like Selenium or Puppeteer to handle dynamic content, which can increase complexity.

3. Data post-processing

  • LangChain: Built-in LLM integration allows for immediate tasks like summarization, sentiment analysis, and pattern recognition.

  • Regular scraping: Data analysis and transformation require custom scripts or separate libraries, adding more steps to the workflow.

4. Error handling and reliability

  • LangChain: Automatically manages challenges like captchas, IP bans, and failed requests via integrated API solutions.

  • Regular scraping: Requires manual error handling with retries, proxy management, or third-party captcha-solving tools.

5. Scalability and workflow automation

  • LangChain: Scales efficiently, automating the entire pipeline from scraping to actionable insights.

  • Regular scraping: Scalability is possible with frameworks like Scrapy but often requires additional configurations and custom setups.

6. Ease of use

  • LangChain: Simplifies complex workflows, making it easier to integrate advanced features like AI with minimal setup.

  • Regular scraping: Requires more technical knowledge and effort to build and maintain robust scrapers.

All in all...

Choose LangChain for projects that benefit from seamless integration of data scraping and AI-driven analysis. Opt for regular web scraping when you need full control over the scraping process or for simpler, standalone data extraction tasks.

How to use LangChain with Web Scraper API

While the possibilities are endless, let's focus on a simple one for today. We'll write a few scripts that use Web Scraper API to handle the complexities of modern web scraping, including JavaScript rendering, anti-bot bypassing, and automatic data parsing. Web Scraper API gives developers programmatic control over enterprise-grade scraping, while LangChain, LangGraph, and LLMs handle the interpretation of extracted data throughout the pipeline. There are three main ways to integrate Web Scraper API:

  • By using the langchain-oxylabs tool to scrape Google search results

  • By using the Oxylabs MCP server to scrape Google, Amazon, and any other URL

  • By directly calling the API to scrape any URL and passing the data to LangChain

Let’s see all methods in action.

Get a free trial

Claim your free trial to test Web Scraper API to get structured, ready-to-use data in JSON, XHR, screenshot, or Markdown formats for your project needs.

Up to 2K results

No credit card is required

Set the environment

First off, install the necessary libraries for all three integration methods at once. For the sake of the tutorial, we’ll use GPT models provided by OpenAI, but you can use any sort of model supported by the LangChain framework.

Inside a new project’s folder, initialize a Python virtual environment and activate it:

python -m venv .venv
source .venv/bin/activate

Then, run the following pip command to install the libraries in the activated .venv:

pip install -U langchain-oxylabs "langchain[openai]" langgraph langchain-mcp-adapters requests python-dotenv

Next, save your Oxylabs credentials and the LLM key as environment variables. You can easily do so by creating a .env file in your project’s directory and storing the authentication details as shown below:

OXYLABS_USERNAME=your-username
OXYLABS_PASSWORD=your-password
OPENAI_API_KEY=your-openai-key

You may also save environment variables system-wide using your terminal and thus skip the dotenv library altogether.

Integrate via langchain-oxylabs module

The langchain-oxylabs package enables LLMs to scrape Google search results in real time and avoid any blocks along the way. Here’s a basic implementation to get Google results:

import os
from dotenv import load_dotenv
from langchain_oxylabs import OxylabsSearchAPIWrapper, OxylabsSearchRun


load_dotenv()

# Initialize the Google search tool
search = OxylabsSearchRun(
    wrapper=OxylabsSearchAPIWrapper(
        oxylabs_username=os.getenv("OXYLABS_USERNAME"),
        oxylabs_password=os.getenv("OXYLABS_PASSWORD")
    )
)

# Send a request with API parameters
response_results = search.invoke({
    "query": "best restaurants in Los Angeles",
    "geo_location": "Los Angeles,United States"
})

print(response_results)

After getting the results, pass the scraped data to your AI workflow for processing. You may also let the AI use the scraper dynamically without specifying API parameters in the code. 

To keep things simple, let’s use LangGraph, a part of the LangChain ecosystem, to abstract away complexities and build an AI agent that can scrape data. You can also use only LangChain and integrate Oxylabs in a similar way if you prefer to have more control over each step of the AI workflow.

import os
from dotenv import load_dotenv
from langgraph.prebuilt import create_react_agent
from langchain_oxylabs import OxylabsSearchAPIWrapper, OxylabsSearchRun


load_dotenv()

search = OxylabsSearchRun(
    wrapper=OxylabsSearchAPIWrapper(
        oxylabs_username=os.getenv("OXYLABS_USERNAME"),
        oxylabs_password=os.getenv("OXYLABS_PASSWORD")
    )
)

# Create an agent that uses the Google search tool
agent = create_react_agent("openai:gpt-4o-mini", [search])

user_input = "Who were the surprise performers at Coachella 2025?"

# Invoke the agent
result = agent.invoke({"messages": user_input})

# Print the agent's response
print(result["messages"][-1].content)

When executed, the agent will send a request to Web Scraper API with the search term, analyze the scraped data, and then answer the question.

Integrate via Oxylabs MCP server

The Oxylabs MCP server enables AI applications to use Oxylabs scraping infrastructure to scrape Google search results, Amazon search and product results, and any other website. Make sure you to install the uv package in your environment by following this installation guide. For example, on macOS, you can use Homebrew:

brew install uv

Again, to keep things simple, let’s use LangGraph to create a basic AI agent that uses the Oxylabs MCP server as a tool. For a deeper understanding of how MCP compares to other AI protocols, see this MCP vs A2A comparison.

import os
import asyncio
from dotenv import load_dotenv
from langchain_mcp_adapters.sessions import create_session
from langchain_mcp_adapters.tools import load_mcp_tools
from langgraph.prebuilt import create_react_agent


load_dotenv()

# Define the MCP server config
config = {
    "transport": "stdio",
    "command": "uvx",
    "args": ["oxylabs-mcp"],
    "env": {
        "OXYLABS_USERNAME": os.getenv("OXYLABS_USERNAME"),
        "OXYLABS_PASSWORD": os.getenv("OXYLABS_PASSWORD"),
    }
}


async def main():
    # Initialize an MCP server session and load the tools
    async with create_session(config) as session:
        await session.initialize()
        tools = await load_mcp_tools(session)

        # Create an AI agent that uses the MCP server tools
        agent = create_react_agent("openai:gpt-4o-mini", tools)

        # A loop to ask questions and get answers
        while True:
            question = input("\nQuestion -> ")
            if question == "exit":
                break
            result = await agent.ainvoke({"messages": question})
            print(f"\n{result['messages'][-1].content}\n")


if __name__ == "__main__":
    asyncio.run(main())

When you run the code and input a question, the agent chooses a best scraping tool and scrapes public data to answer the question:

Integrate via direct API calls

If you want to have more control of the scraper, you can manually set up requests to the API. When you have some data scraped, you can then get all the data processed by an LLM using LangChain.

For this, you can write a simple function that accepts string content as a parameter, then formats that content into a provided prompt template and sends it for interpretation:

import os
import requests
from dotenv import load_dotenv
from langchain_openai import OpenAI
from langchain_core.prompts import PromptTemplate


load_dotenv()

# Define the prompt template
prompt = PromptTemplate.from_template(
    "Analyze the following website content and summarize key points: {content}"
)

# Compose the chain using the prompt and the LLM
chain = prompt | OpenAI()


def scrape_website(url):
    """Scrape the website using Oxylabs Web Scraper API"""
    payload = {
        "source": "universal",
        "url": url,
        "parse": True
    }

    response = requests.post(
        "https://realtime.oxylabs.io/v1/queries",
        auth=(os.getenv("OXYLABS_USERNAME"), os.getenv("OXYLABS_PASSWORD")),
        json=payload
    )

    if response.status_code == 200:
        data = response.json()
        content = data["results"][0]["content"]
        print(content)
        return str(content)
    else:
        print(f"Failed to scrape website: {response.text}")
        return None


def process_content(content):
    """Process the scraped content using LangChain"""
    if not content:
        print("No content to process.")
        return None

    result = chain.invoke({"content": content})
    return result


def main(url):
    print("Scraping website...")
    scraped_content = scrape_website(url)

    if scraped_content:
        print("Processing scraped content with LangChain...")
        analysis = process_content(scraped_content)

        print("\nProcessed Analysis:\n", analysis)
    else:
        print("No content scraped.")


# Example URL to scrape
url = "https://sandbox.oxylabs.io/products/1"
main(url)

Note: The script utilizes automatic result parsing provided by the Oxylabs Web Scraper API to make our lives easier. If you want to read more about the capabilities of the API, head over to the docs here.

When you run the code, you can see the AI-generated summary of the scraped page:

How to use LangChain with AI Studio

If you prefer a low-code web scraping solution, AI Studio lets you scrape data by describing your needs in plain English. It's ideal for developers building AI agents, product teams validating ideas, or anyone who needs structured web data quickly without managing scraping infrastructure. It offers these five apps: AI-Crawler, AI-Scraper, Browser Agent, AI-Search, and AI-Map. Make sure to create an AI Studio account to get free usage credits.

You can integrate AI Studio with LangChain in two ways:

Set up the environment

Create a Python virtual environment if you haven't done that yet:

python -m venv .venv
source .venv/bin/activate

Use pip to install the libraries:

pip install -U oxylabs-ai-studio langchain "langchain[openai]" langchain-mcp-adapters langgraph

Additionally, create a .env file in your project's directory and save your AI Studio API key and OpenAI API key:

OXYLABS_AI_STUDIO_API_KEY=your_ai_studio_key
OPENAI_API_KEY=your_openai_key

Integrate AI Studio via Python SDK

Let's see how you can easily set up LangChain with AI-Crawler. Make sure to visit the AI Studio website or SDK repository to find code samples for other AI Studio apps.

from dotenv import load_dotenv
from oxylabs_ai_studio.apps.ai_crawler import AiCrawler
from langchain_openai import ChatOpenAI
from langchain.schema import HumanMessage


load_dotenv()

result = AiCrawler().crawl(
    url="https://developers.oxylabs.io/",
    user_prompt="Find all Web Scraper API targets, supported websites, and data sources.",
    output_format="markdown",
    render_javascript=False,
    return_sources_limit=10,
    geo_location="US"
)

results = result.data

llm = ChatOpenAI(model="gpt-5-nano", temperature=0)

analysis = llm.invoke([
    HumanMessage(content=f"""
    Based on this crawled content from Oxylabs documentation:
    {results}
    Please provide a structured list of all Web Scraper API targets, organized by category 
    (e.g., E-commerce, Search Engines, etc.)
    """)
])

print(analysis.content)

In the above example, AI Crawler will crawl the entire Oxylabs documentation to find target pages, while LangChain orchestrates AI analysis of scraped data. Running the code will output a similar result:

Integrate AI Studio via MCP

Another option is to use the Oxylabs MCP server, which allows AI agents to discover AI Studio tools and use them as needed. Make sure you have installed the uv package for the following implementation. You can also use other MCP configurations, such as Smithery and local setup – check out the Oxylabs MCP server repository to learn more.

Let's use LangGraph's prebuilt ReAct (Reasoning and Acting) agent to keep it simple:

import os
import asyncio
from dotenv import load_dotenv
from langchain_mcp_adapters.sessions import create_session
from langchain_mcp_adapters.tools import load_mcp_tools
from langgraph.prebuilt import create_react_agent


load_dotenv()

config = {
    "transport": "stdio",
    "command": "uvx",
    "args": ["oxylabs-mcp"],
    "env": {"OXYLABS_AI_STUDIO_API_KEY": os.getenv("OXYLABS_AI_STUDIO_API_KEY")}
}


async def main():
    async with create_session(config) as session:
        await session.initialize()
        tools = await load_mcp_tools(session)

        agent = create_react_agent("openai:gpt-5-nano", tools)

        result = await agent.ainvoke(
            {
                "messages": [
                    {
                        "role": "user",
                        "content": "Search the web for the latest news on the stock market."
                    }
                ]
            }
        )
        print(f"\n{result['messages'][-1].content}\n")


if __name__ == "__main__":
    asyncio.run(main())

Once executed, the agent will pick the best-suited AI Studio app and search the web for stock market news. You should see a similar output:

Conclusion

Combining Oxylabs' Web Scraper API with LangChain makes an excellent solution for efficient web scraping and AI-driven analysis. The API handles common challenges like IP blocking and CAPTCHAs, while LangChain enables real-time data processing with LLMs. Together, they make an ideal choice for large-scale, hassle-free data acquisition.

If you want to read more on the topic of utilizing AI for web scraping, check out these posts:

Forget about complex web scraping processes

Choose Oxylabs' advanced web intelligence collection solutions to gather real-time public data hassle-free.

About the author

roberta avatar

Roberta Aukstikalnyte

Former Senior Content Manager

Roberta Aukstikalnyte was a Senior Content Manager at Oxylabs. Having worked various jobs in the tech industry, she especially enjoys finding ways to express complex ideas in simple ways through content. In her free time, Roberta unwinds by reading Ottessa Moshfegh's novels, going to boxing classes, and playing around with makeup.

All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.

Related articles

How to Build a RAG Chatbot with OpenAI and Web Scraping: Step-by-Step Guide
vytenis kaubre avatar

Vytenis Kaubrė

2025-09-19

Web Scraping with Claude AI: Python Guide
Agne Matuseviciute avatar

Agnė Matusevičiūtė

2025-09-16

Scrape Google AI Mode
How to Scrape Google AI Mode in 2025
Akvilė Lūžaitė avatar

Akvilė Lūžaitė

2025-09-08

Get the latest news from data gathering world

Forget about complex web scraping processes

Choose Oxylabs' advanced web intelligence collection solutions to gather real-time public data hassle-free.