Web Scraping With LangChain & Oxylabs API



Roberta Aukstikalnyte
Last updated by Vytenis Kaubrė
2025-04-16
4 min read
Roberta Aukstikalnyte
Last updated by Vytenis Kaubrė
2025-04-16
4 min read
LangChain is a robust framework designed for building AI applications that integrate Large Language Models (LLMs) with external data sources, workflows, and APIs.
By combining LangChain’s seamless pipeline capabilities with a tool like the Web Scraper API, you can collect public web data, all while avoiding common scraping-related hurdles that can disrupt your processes.
In today’s article, we’ll demonstrate how to integrate the LangChain framework with Web Scraper API at ease. By the end of this article, you’ll be able to retrieve structured web data with minimal development effort. Let’s get started!
In conversation about web data extraction, it’s inevitable that we’ll get to discussing its challenges. In practice, developers often run into problems that can complicate or disrupt the process – let’s take a closer look:
IP blocking and rate limiting
Websites often detect and block repeated requests from the same IP address to prevent automated scraping. They may also impose rate limits, capping the number of requests you can make within a specific time frame. Without proper measures, these restrictions can disrupt data collection.
CAPTCHAs and other anti-scraping mechanisms
Websites implement CAPTCHAs and other anti-bot technologies to distinguish between human users and automated scripts. Bypassing these defenses requires sophisticated tools or external CAPTCHA-solving services, adding complexity and cost.
Large-scale scraping
As scraping projects grow, handling large volumes of data efficiently becomes a challenge. This includes managing storage, ensuring fast processing, and maintaining reliable infrastructure to handle numerous concurrent requests.
Recently, frameworks like LangChain have introduced a new paradigm, leveraging large language models (LLMs) to interpret and process data in a more dynamic and context-aware manner.
Let’s take a look at a comparison of LangChain and regular web scraping methods across a few dimensions:
LangChain: Combines web scraping with LLMs for automated data analysis and insights generation. Ideal for workflows that need both data extraction and AI-driven processing.
Regular scraping: Web scraping tools like Scrapy, BeautifulSoup, and Puppeteer are focused solely on data collection. Post-processing requires separate tools and scripts.
LangChain: When paired with APIs like Oxylabs, it seamlessly handles JavaScript-rendered content and bypasses anti-scraping measures.
Regular scraping: Requires additional tools like Selenium or Puppeteer to handle dynamic content, which can increase complexity.
LangChain: Built-in LLM integration allows for immediate tasks like summarization, sentiment analysis, and pattern recognition.
Regular scraping: Data analysis and transformation require custom scripts or separate libraries, adding more steps to the workflow.
LangChain: Automatically manages challenges like captchas, IP bans, and failed requests via integrated API solutions.
Regular scraping: Requires manual error handling with retries, proxy management, or third-party captcha-solving tools.
LangChain: Scales efficiently, automating the entire pipeline from scraping to actionable insights.
Regular scraping: Scalability is possible with frameworks like Scrapy but often requires additional configurations and custom setups.
LangChain: Simplifies complex workflows, making it easier to integrate advanced features like AI with minimal setup.
Regular scraping: Requires more technical knowledge and effort to build and maintain robust scrapers.
All in all...
Choose LangChain for projects that benefit from seamless integration of data scraping and AI-driven analysis. Opt for regular web scraping when you need full control over the scraping process or for simpler, standalone data extraction tasks.
While the possibilities are endless, let’s focus on a simple one for today – we’ll write a script that scrapes a page using Web Scraper API and then interprets extracted information with the help of LangChain and LLMs. There are two ways to integrate Web Scraper API:
By using the langchain-oxylabs tool to scrape Google search results
By directly calling the API to scrape any URL and passing the data to LangChain
Let’s see both methods in action.
First off, install the necessary libraries. For the sake of the tutorial, we’ll use GPT models provided by OpenAI, but you can use any sort of model supported by the LangChain framework. Run the following pip command:
pip install -qU langchain-oxylabs langchain-openai langgraph requests python-dotenv
Next, save your Oxylabs credentials and the LLM key as environment variables. You can easily do so by creating a .env file in your project’s directory and storing the authentication details as shown below:
OXYLABS_USERNAME=your-username
OXYLABS_PASSWORD=your-password
OPENAI_API_KEY=your-openai-key
You may also save environment variables system-wide using your terminal and thus skip the dotenv library altogether.
The langchain-oxylabs package enables LLMs to scrape Google search results in real time and avoid any blocks along the way. Here’s a basic implementation:
import os
from dotenv import load_dotenv
from langchain.chat_models import init_chat_model
from langgraph.prebuilt import create_react_agent
from langchain_oxylabs import OxylabsSearchAPIWrapper, OxylabsSearchRun
load_dotenv()
# Initialize your preferred LLM model
llm = init_chat_model(
"gpt-4o-mini",
model_provider="openai",
api_key=os.getenv("OPENAI_API_KEY")
)
# Initialize the Google search tool
search = OxylabsSearchRun(
wrapper=OxylabsSearchAPIWrapper(
oxylabs_username=os.getenv("OXYLABS_USERNAME"),
oxylabs_password=os.getenv("OXYLABS_PASSWORD")
)
)
# Create an agent that uses the Google search tool
agent = create_react_agent(llm, [search])
user_input = "When and why did the Maya civilization collapse?"
# Invoke the agent
tool = agent.invoke({"messages": user_input})
# Print the agent's response
print(tool["messages"][-1].content)
When executed, the AI agent will send a request to Web Scraper API with the search term, analyze the scraped data, and then answer the user’s question.
If you want to have more control of the scraper and access other websites, you can manually set up requests to the API. Let’s see a basic function that will connect to the Oxylabs Web Scraper API and scrape information from a requested page:
import os
import requests
from dotenv import load_dotenv
load_dotenv()
def scrape_website(url):
payload = {
"source": "universal",
"url": url,
"parse": True
}
response = requests.post(
"https://realtime.oxylabs.io/v1/queries",
auth=(os.getenv("OXYLABS_USERNAME"), os.getenv("OXYLABS_PASSWORD")),
json=payload
)
if response.status_code == 200:
data = response.json()
content = data["results"][0]["content"]
print(content)
return str(content)
else:
print(f"Failed to scrape website: {response.text}")
return None
scrape_website("https://sandbox.oxylabs.io/products/1")
Note: The script utilizes automatic result parsing provided by the Oxylabs Web Scraper API to make our lives easier. If you want to read more about the capabilities of the API, head over to the docs here.
When you run the function, you can see the contents of the response printed out:
Now that you have some data scraped, let’s get all the data processed by an LLM; this is something LangChain can help us with.
You can write a simple function that accepts string content as a parameter, then formats that content into a provided prompt template and sends it for interpretation:
import os
import requests
from dotenv import load_dotenv
from langchain_openai import OpenAI
from langchain_core.prompts import PromptTemplate
load_dotenv()
prompt = PromptTemplate.from_template(
"Analyze the following website content and summarize key points: {content}"
)
chain = prompt | OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
# def scrape_website(url):
def process_content(content):
if not content:
print("No content to process.")
return None
result = chain.invoke({"content": content})
return result
Feel free to adjust the results of the LLM interpretation by adjusting the template in the prompt_template, which is passed along together with the content to the LLM.
Joining up everything into the main() function, your final code should look something like this:
import os
import requests
from dotenv import load_dotenv
from langchain_openai import OpenAI
from langchain_core.prompts import PromptTemplate
load_dotenv()
# Define the prompt template
prompt = PromptTemplate.from_template(
"Analyze the following website content and summarize key points: {content}"
)
# Compose the chain using the prompt and the LLM
chain = prompt | OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
def scrape_website(url):
"""Scrape the website using Oxylabs Web Scraper API"""
payload = {
"source": "universal",
"url": url,
"parse": True
}
response = requests.post(
"https://realtime.oxylabs.io/v1/queries",
auth=(os.getenv("OXYLABS_USERNAME"), os.getenv("OXYLABS_PASSWORD")),
json=payload
)
if response.status_code == 200:
data = response.json()
content = data["results"][0]["content"]
print(content)
return str(content)
else:
print(f"Failed to scrape website: {response.text}")
return None
def process_content(content):
"""Process the scraped content using LangChain"""
if not content:
print("No content to process.")
return None
result = chain.invoke({"content": content})
return result
def main(url):
print("Scraping website...")
scraped_content = scrape_website(url)
if scraped_content:
print("Processing scraped content with LangChain...")
analysis = process_content(scraped_content)
print("\nProcessed Analysis:\n", analysis)
else:
print("No content scraped.")
# Example URL to scrape
url = "https://sandbox.oxylabs.io/products/1"
main(url)
When you run the code, you can see the AI-generated summary of the scraped page:
Combining Oxylabs' Web Scraper API with LangChain makes an excellent solution for efficient web scraping and AI-driven analysis. The API handles common challenges like IP blocking and CAPTCHAs, while LangChain enables real-time data processing with LLMs. Together, they make an ideal choice for large-scale, hassle-free data acquisition.
If you want to read more on the topic of utilizing AI for web scraping, check out these blog posts:
About the author
Roberta Aukstikalnyte
Senior Content Manager
Roberta Aukstikalnyte is a Senior Content Manager at Oxylabs. Having worked various jobs in the tech industry, she especially enjoys finding ways to express complex ideas in simple ways through content. In her free time, Roberta unwinds by reading Ottessa Moshfegh's novels, going to boxing classes, and playing around with makeup.
All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.
Get the latest news from data gathering world
Scale up your business with Oxylabs®
Proxies
Advanced proxy solutions
Data Collection
Datasets
Resources
Innovation hub