Make sure you have the following before starting:
Web Scraper API credentials
LLM provider API key – we’ll use OpenAI
Python (download here) – this guide uses Python 3.13.2
5K results
No credit card needed
Our Web Scraper API is available through the llama-index-readers-oxylabs and llama-index-readers-web packages, simplifying the process of connecting web data with LlamaIndex. Run this pip command to install the necessary libraries:
pip install -qU llama-index llama-index-readers-oxylabs llama-index-readers-web
Create a .env file in your Python project’s directory to store your Oxylabs Web Scraper API credentials and OpenAI API key:
OXYLABS_USERNAME=your_API_username
OXYLABS_PASSWORD=your_API_password
OPENAI_API_KEY=your-openai-key
Alternatively, you can set environment variables through the terminal or directly embed your authentication credentials within the code.
There are two ways to access web content via Web Scraper API in LlamaIndex: by utilizing Oxylabs Reader or Oxylabs Web Reader.
The llama-index-readers-oxylabs module contains specific classes that enable you to scrape data from Google Search, Amazon, and YouTube:
API Data Source | Reader Class |
---|---|
Google Web Search | OxylabsGoogleSearchReader |
Google Search Ads | OxylabsGoogleAdsReader |
Amazon Product | OxylabsAmazonProductReader |
Amazon Search | OxylabsAmazonSearchReader |
Amazon Pricing | OxylabsAmazonPricingReader |
Amazon Sellers | OxylabsAmazonSellersReader |
Amazon Best Sellers | OxylabsAmazonBestsellersReader |
Amazon Reviews | OxylabsAmazonReviewsReader |
YouTube Transcript | OxylabsYoutubeTranscriptReader |
For example, you can extract Google search results as shown below:
import os
from dotenv import load_dotenv
from llama_index.readers.oxylabs import OxylabsGoogleSearchReader
load_dotenv()
reader = OxylabsGoogleSearchReader(
os.getenv('OXYLABS_USERNAME'), os.getenv('OXYLABS_PASSWORD')
)
results = reader.load_data({
'query': 'best pancake recipe',
'parse': True
})
print(results[0].text)
When executed, you should see a similar result in your console:
With the OxylabsWebReader class available in the llama-index-readers-web module, you can extract data from any URL while bypassing most anti-scraping measures:
import os
from dotenv import load_dotenv
from llama_index.readers.web import OxylabsWebReader
load_dotenv()
reader = OxylabsWebReader(
os.getenv('OXYLABS_USERNAME'), os.getenv('OXYLABS_PASSWORD')
)
results = reader.load_data(
[
'https://sandbox.oxylabs.io/products/1',
'https://sandbox.oxylabs.io/products/2'
]
)
for result in results:
print(result.text + '\n')
Additionally, you can include supported API parameters if needed, such as geolocation, JavaScript rendering, user agent type, and others. The above approach provides scraped results in a Markdown format that’s ready for AI processing, for example:
There are countless ways you can utilize real-time web data in AI workflows. Hence, let’s stick with a simple use case – an AI agent that can search Google and answer questions.
The best way to go about this is to let the AI agent use the Oxylabs reader dynamically. Therefore, you may want to create a web_search() function that allows an LLM to provide a search query according to the user’s question. The AI agent can then use this function as a tool to scrape Google search results. Here’s the full code example:
import os
import asyncio
from dotenv import load_dotenv
from llama_index.readers.oxylabs import OxylabsGoogleSearchReader
from llama_index.core.agent.workflow import FunctionAgent
from llama_index.llms.openai import OpenAI
load_dotenv()
reader = OxylabsGoogleSearchReader(
os.getenv('OXYLABS_USERNAME'), os.getenv('OXYLABS_PASSWORD')
)
def web_search(query: str) -> str:
results = reader.load_data({'query': query, 'parse': True})
return results[0].text
agent = FunctionAgent(
tools=[web_search],
llm=OpenAI(model='gpt-4o-mini'),
max_function_calls=1,
system_prompt=(
'Craft a short Google search query to use with the `web_search` tool. '
'Analyze the most relevant results and answer the question.'
)
)
async def main():
response = await agent.run('How did DeepSeek affect the stock market?')
print(response)
if __name__ == '__main__':
asyncio.run(main())
Note the max_function_calls parameter, which avoids excessive LLM calls while producing the same results. Feel free to remove this parameter if you’re encountering issues.
Running this code should output a similar result as shown below:
Use this guide as your launchpad to develop web-powered AI applications that take advantage of both Oxylabs' scraping capabilities and LlamaIndex's flexibility for practical and real-world solutions. If you have any questions about Oxylabs solutions, don’t hesitate to contact our support team via live chat or email.
Interested in building AI agents but not sure which platform is the right one for you? Check out our overview of the 6 Best AI Agent Frameworks.
Please be aware that this is a third-party tool not owned or controlled by Oxylabs. Each third-party provider is responsible for its own software and services. Consequently, Oxylabs will have no liability or responsibility to you regarding those services. Please carefully review the third party's policies and practices and/or conduct due diligence before accessing or using third-party services.
LlamaIndex is a context augmentation framework that streamlines LLM application development for various use cases, including Q&A bots, chatbots, autonomous agents, multi-modal apps, and others. By connecting LLMs with external data, such as documents, APIs, and databases, you can ensure accurate and contextually rich AI workflows.
For Python, install LlamaIndex via terminal with pip install llama-index. For Node.js, TypeScript, Vite, Next.js, or Cloudflare Workers, use npm i llamaindex. You may also use alternative package managers like pnpm, yarn, or bun.
In short, LlamaIndex supplies LLMs with additional knowledge from various external data sources, while CrewAI manages and coordinates multiple AI agents that work together on complex tasks. Take a look at this CrewAI integration with Web Scraper API if you’re interested in building LLM apps using CrewAI.
How to Navigate AI, Legal, and Web Scraping: Asking a Professional
In this interview, we sit down with a legal professional to shed light on the ever-changing legal framework surrounding web scraping.
Web Scraping With LangChain & Oxylabs API
Follow our quick guide on combining LangChain with Web Scraper API for hassle-free web scraping process and AI-driven analysis.
OpenAI Agents SDK Integration With Oxylabs Web Scraper API
Learn how to build AI agents that scrape and analyze web content by combining OpenAI's Agents SDK with Oxylabs Web Scraper API for cost-effective web access.
Get the latest news from data gathering world
Get Web Scraper API for $1.35/1K results
Proxies
Advanced proxy solutions
Data Collection
Datasets
Resources
Innovation hub