Back to blog

Web Scraping With LangChain & Oxylabs API

Roberta Aukstikalnyte

2024-11-194 min read
Share

LangChain is a robust framework designed for building AI applications that integrate Large Language Models (LLMs) with external data sources, workflows, and APIs.
By combining LangChain’s seamless pipeline capabilities with a tool like the Web Scraper API, you can collect public web data, all while avoiding common scraping-related hurdles that can disrupt scraping your process. 

In today’s article, we’ll demonstrate how to integrate the LangChain framework with Web Scraper API at ease. By the end of this article, you’ll be able to retrieve structured web data with minimal development effort. Let’s get started! 

Common web scraping challenges 

In conversation about web data extraction, it’s inevitable that we’ll get to discussing its challenges. In practice, developers often run into problems that can complicate or disrupt the process – let’s take a closer look: 

  • IP blocking and rate limiting

Websites often detect and block repeated requests from the same IP address to prevent automated scraping. They may also impose rate limits, capping the number of requests you can make within a specific time frame. Without proper measures, these restrictions can disrupt data collection.

  • CAPTCHAs and other anti-scraping mechanisms

Websites implement CAPTCHAs and other anti-bot technologies to distinguish between human users and automated scripts. Bypassing these defenses requires sophisticated tools or external CAPTCHA-solving services, adding complexity and cost.

  • Large-scale scraping

As scraping projects grow, handling large volumes of data efficiently becomes a challenge. This includes managing storage, ensuring fast processing, and maintaining reliable infrastructure to handle numerous concurrent requests.

LangChain vs. traditional scraping methods

Recently, frameworks like LangChain have introduced a new paradigm, leveraging large language models (LLMs) to interpret and process data in a more dynamic and context-aware manner.

Let’s take a look at a comparison of LangChain and traditional web scraping across a few dimensions:

1. Purpose and use case 

  • LangChain: Combines web scraping with LLMs for automated data analysis and insights generation. Ideal for workflows that need both data extraction and AI-driven processing.

  • Traditional methods: Web scraping tools like Scrapy, BeautifulSoup, and Puppeteer are focused solely on data collection. Post-processing requires separate tools and scripts.

2. Handling dynamic content 

  • LangChain: When paired with APIs like Oxylabs, it seamlessly handles JavaScript-rendered content and bypasses anti-scraping measures.

  • Traditional methods: Requires additional tools like Selenium or Puppeteer to handle dynamic content, which can increase complexity.

3. Data post-processing

  • LangChain: Built-in LLM integration allows for immediate tasks like summarization, sentiment analysis, and pattern recognition.

  • Traditional methods: Data analysis and transformation require custom scripts or separate libraries, adding more steps to the workflow.

4. Error handling and reliability

  • LangChain: Automatically manages challenges like captchas, IP bans, and failed requests via integrated API solutions.

  • Traditional methods: Requires manual error handling with retries, proxy management, or third-party captcha-solving tools.

5. Scalability and workflow automation

  • LangChain: Scales efficiently, automating the entire pipeline from scraping to actionable insights.

  • Traditional methods: Scalability is possible with frameworks like Scrapy but often requires additional configurations and custom setups.

6. Ease of use

  • LangChain: Simplifies complex workflows, making it easier to integrate advanced features like AI with minimal setup.

  • Traditional methods: Requires more technical knowledge and effort to build and maintain robust scrapers.

All in all...

Choose LangChain for projects that benefit from seamless integration of data scraping and AI-driven analysis. Opt for traditional methods when you need full control over the scraping process or for simpler, standalone data extraction tasks.

Using Oxylabs Web Scraping API and LangChain

While the possibilities are endless, let’s focus on a simple one for today – we’ll write a script that scrapes a page from the Oxylabs Sandbox using Web Scraper API. Then, the script will interpret that information with the help of LangChain and LLMs. Let’s begin! 

Setting up the environment

First off, we’ll need to install a few libraries that we’re going to work with today. We can do that by using pip:

 pip install langchain requests

Scraping with Oxylabs Web Scraper API

Having the libraries installed, we can move on to creating a basic function that would connect to the Oxylabs Web Scraper API and scrape information from a requested page:

import requests

OXYLABS_ENDPOINT = "https://realtime.oxylabs.io/v1/queries"
OXYLABS_AUTH=('username', 'password')


def scrape_website(url):
   payload = {
       'source': 'universal',
       'url': url,
       'parse': 'true'
   }
  
   response = requests.request(
   'POST',
   OXYLABS_ENDPOINT,
   auth=OXYLABS_AUTH,
   json=payload,
   )
  
   if response.status_code == 200:
       data = response.json()
       print(data["results"][0]["content"])
       return str(data["results"][0]["content"])
   else:
       print(f"Failed to scrape website: {response.text}")
       return None

Note: We’ll be utilizing automatic result parsing provided by the Oxylabs Web Scraper API to make our lives easier. If you want to read more about the capabilities of the API, head over to the docs here.

If we run the function, we can see the contents of the response we get printed out:

Utilizing LangChain to interpret data

Now that we have some data scraped, we want to get it data processed by an LLM, something LangChain can help us with. 

Of course, we’ll need to connect to an LLM of some sort. For the sake of the tutorial, we’ll be using GPT models provided by OpenAI. You can use any sort of model supported by the LangChain framework.

Let’s write a simple function that accepts string content as a parameter, then formats that content into a provided prompt template and sends it for interpretation:

from langchain.llms import OpenAI
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain


OPENAI_API_KEY = "your-api-key"
openai_model = OpenAI(api_key=OPENAI_API_KEY)


prompt_template = PromptTemplate(
   input_variables=["content"],
   template="Analyze the following website content and summarize key points: {content}"
)


def process_content(content):
   if not content:
       print("No content to process.")
       return None
  
   chain = LLMChain(llm=openai_model, prompt=prompt_template)
   result = chain.run(content)
  
   return result

You can adjust the results of the LLM interpretation by adjusting the template in the prompt_template, which is passed along together with the content to the LLM.

Final result

Joining up everything we have made so far, we should get something like this:

import requests
from langchain.llms import OpenAI
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain

OXYLABS_ENDPOINT = "https://realtime.oxylabs.io/v1/queries"
OXYLABS_AUTH=('username', 'password')

# LangChain setup with OpenAI
OPENAI_API_KEY = "your-api-key"
openai_model = OpenAI(api_key=OPENAI_API_KEY)

# Define a LangChain prompt template for processing scraped content
prompt_template = PromptTemplate(
   input_variables=["content"],
   template="Analyze the following website content and summarize key points: {content}"
)

# Function to send a request to the Oxylabs Web Scraper API
def scrape_website(url):
   payload = {
       'source': 'universal',
       'url': url,
       'parse': 'true'
   }
  
   response = requests.request(
   'POST',
   OXYLABS_ENDPOINT,
   auth=OXYLABS_AUTH,
   json=payload,
   )
  
   if response.status_code == 200:
       data = response.json()
       print(data["results"][0]["content"])
       return str(data["results"][0]["content"])
   else:
       print(f"Failed to scrape website: {response.text}")
       return None

# Function to process scraped content with LangChain
def process_content(content):
   if not content:
       print("No content to process.")
       return None
  
   chain = LLMChain(llm=openai_model, prompt=prompt_template)
   result = chain.run(content)
  
   return result

# Main function to scrape a URL and process it
def main(url):
   print("Scraping website...")
   scraped_content = scrape_website(url)
  
   if scraped_content:
       print("Processing scraped content with LangChain...")
       analysis = process_content(scraped_content)
      
       print("\nProcessed Analysis:\n", analysis)
   else:
       print("No content scraped.")

# Example URL to scrape
url = "https://sandbox.oxylabs.io/products/1"
main(url)

If we run the code, we can see the summary:

Conclusion

Combining Oxylabs' Web Scraper API with LangChain makes an excellent solution for efficient web scraping and AI-driven analysis. The API handles common challenges like IP blocking and CAPTCHAs, while LangChain enables real-time data processing with LLMs. Together, they make an ideal choice for large-scale, hassle-free data acquisition.

If you want to read more on the topic of utilizing AI for web scraping, check out these blog posts:

About the author

Roberta Aukstikalnyte

Senior Content Manager

Roberta Aukstikalnyte is a Senior Content Manager at Oxylabs. Having worked various jobs in the tech industry, she especially enjoys finding ways to express complex ideas in simple ways through content. In her free time, Roberta unwinds by reading Ottessa Moshfegh's novels, going to boxing classes, and playing around with makeup.

All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.

Related articles

Get the latest news from data gathering world

I'm interested