Project requirements

Make sure you have the following before starting:

  • Web Scraper API credentials

  • LLM provider API key – we’ll use OpenAI

  • Python (download here) – this guide uses Python 3.13.2

Free trial for Web Scraper API

  • 5K results

  • No credit card needed

1. Install libraries

Our Web Scraper API is available through the llama-index-readers-oxylabs and llama-index-readers-web packages, simplifying the process of connecting web data with LlamaIndex. Run this pip command to install the necessary libraries:

pip install -qU llama-index llama-index-readers-oxylabs llama-index-readers-web

2. Set up environment variables

Create a .env file in your Python project’s directory to store your Oxylabs Web Scraper API credentials and OpenAI API key:

OXYLABS_USERNAME=your_API_username
OXYLABS_PASSWORD=your_API_password
OPENAI_API_KEY=your-openai-key

Alternatively, you can set environment variables through the terminal or directly embed your authentication credentials within the code.

3. Using Oxylabs Web Scraper API in LlamaIndex

There are two ways to access web content via Web Scraper API in LlamaIndex: by utilizing Oxylabs Reader or Oxylabs Web Reader.

3.1 Oxylabs Reader

The llama-index-readers-oxylabs module contains specific classes that enable you to scrape data from Google Search, Amazon, and YouTube:

API Data Source Reader Class
Google Web Search OxylabsGoogleSearchReader
Google Search Ads OxylabsGoogleAdsReader
Amazon Product OxylabsAmazonProductReader
Amazon Search OxylabsAmazonSearchReader
Amazon Pricing OxylabsAmazonPricingReader
Amazon Sellers OxylabsAmazonSellersReader
Amazon Best Sellers OxylabsAmazonBestsellersReader
Amazon Reviews OxylabsAmazonReviewsReader
YouTube Transcript OxylabsYoutubeTranscriptReader

For example, you can extract Google search results as shown below:

import os
from dotenv import load_dotenv
from llama_index.readers.oxylabs import OxylabsGoogleSearchReader


load_dotenv()

reader = OxylabsGoogleSearchReader(
    os.getenv('OXYLABS_USERNAME'), os.getenv('OXYLABS_PASSWORD')
)

results = reader.load_data({
    'query': 'best pancake recipe',
    'parse': True
})

print(results[0].text)

When executed, you should see a similar result in your console:

Scraped Google search results printed in the console

3.2 Oxylabs Web Reader

With the OxylabsWebReader class available in the llama-index-readers-web module, you can extract data from any URL while bypassing most anti-scraping measures:

import os
from dotenv import load_dotenv
from llama_index.readers.web import OxylabsWebReader


load_dotenv()

reader = OxylabsWebReader(
    os.getenv('OXYLABS_USERNAME'), os.getenv('OXYLABS_PASSWORD')
)

results = reader.load_data(
    [
        'https://sandbox.oxylabs.io/products/1',
        'https://sandbox.oxylabs.io/products/2'
    ]
)

for result in results:
    print(result.text + '\n')

Additionally, you can include supported API parameters if needed, such as geolocation, JavaScript rendering, user agent type, and others. The above approach provides scraped results in a Markdown format that’s ready for AI processing, for example:

Scraped product data printed in the console

4. Building a basic AI search agent

There are countless ways you can utilize real-time web data in AI workflows. Hence, let’s stick with a simple use case – an AI agent that can search Google and answer questions.

The best way to go about this is to let the AI agent use the Oxylabs reader dynamically. Therefore, you may want to create a web_search() function that allows an LLM to provide a search query according to the user’s question. The AI agent can then use this function as a tool to scrape Google search results. Here’s the full code example:

import os
import asyncio

from dotenv import load_dotenv
from llama_index.readers.oxylabs import OxylabsGoogleSearchReader
from llama_index.core.agent.workflow import FunctionAgent
from llama_index.llms.openai import OpenAI


load_dotenv()

reader = OxylabsGoogleSearchReader(
    os.getenv('OXYLABS_USERNAME'), os.getenv('OXYLABS_PASSWORD')
)

def web_search(query: str) -> str:
    results = reader.load_data({'query': query, 'parse': True})
    return results[0].text

agent = FunctionAgent(
    tools=[web_search],
    llm=OpenAI(model='gpt-4o-mini'),
    max_function_calls=1,
    system_prompt=(
        'Craft a short Google search query to use with the `web_search` tool. '
        'Analyze the most relevant results and answer the question.'
    )
)

async def main():
    response = await agent.run('How did DeepSeek affect the stock market?')
    print(response)


if __name__ == '__main__':
    asyncio.run(main())

Note the max_function_calls parameter, which avoids excessive LLM calls while producing the same results. Feel free to remove this parameter if you’re encountering issues.

Running this code should output a similar result as shown below:

Final thoughts

Use this guide as your launchpad to develop web-powered AI applications that take advantage of both Oxylabs' scraping capabilities and LlamaIndex's flexibility for practical and real-world solutions. If you have any questions about Oxylabs solutions, don’t hesitate to contact our support team via live chat or email.

Interested in building AI agents but not sure which platform is the right one for you? Check out our overview of the 6 Best AI Agent Frameworks.

Please be aware that this is a third-party tool not owned or controlled by Oxylabs. Each third-party provider is responsible for its own software and services. Consequently, Oxylabs will have no liability or responsibility to you regarding those services. Please carefully review the third party's policies and practices and/or conduct due diligence before accessing or using third-party services.

Frequently Asked Questions

What is LlamaIndex?

LlamaIndex is a context augmentation framework that streamlines LLM application development for various use cases, including Q&A bots, chatbots, autonomous agents, multi-modal apps, and others. By connecting LLMs with external data, such as documents, APIs, and databases, you can ensure accurate and contextually rich AI workflows.

How to install LlamaIndex?

For Python, install LlamaIndex via terminal with pip install llama-index. For Node.js, TypeScript, Vite, Next.js, or Cloudflare Workers, use npm i llamaindex. You may also use alternative package managers like pnpm, yarn, or bun.

What are the main differences between CrewAI and LlamaIndex?

In short, LlamaIndex supplies LLMs with additional knowledge from various external data sources, while CrewAI manages and coordinates multiple AI agents that work together on complex tasks. Take a look at this CrewAI integration with Web Scraper API if you’re interested in building LLM apps using CrewAI.

Get the latest news from data gathering world

I'm interested