Web Scraping With LlamaIndex Tutorial

Project requirements

Make sure you have the following before starting:

Web Scraper API credentials
LLM provider API key – we’ll use OpenAI
Python (download here) – this guide uses Python 3.13.2

Free trial for Web Scraper API

5K results
No credit card needed

1. Install libraries

Our Web Scraper API is available through the llama-index-readers-oxylabs and llama-index-readers-web packages, simplifying the process of connecting web data with LlamaIndex. Run this pip command to install the necessary libraries:

pip install -qU llama-index llama-index-readers-oxylabs llama-index-readers-web

2. Set up environment variables

Create a .env file in your Python project’s directory to store your Oxylabs Web Scraper API credentials and OpenAI API key:

OXYLABS_USERNAME=your_API_username
OXYLABS_PASSWORD=your_API_password
OPENAI_API_KEY=your-openai-key

Alternatively, you can set environment variables through the terminal or directly embed your authentication credentials within the code.

3. Using Oxylabs Web Scraper API in LlamaIndex

There are two ways to access web content via Web Scraper API in LlamaIndex: by utilizing Oxylabs Reader or Oxylabs Web Reader.

3.1 Oxylabs Reader

The llama-index-readers-oxylabs module contains specific classes that enable you to scrape data from Google Search, Amazon, and YouTube:

API Data Source	Reader Class
Google Web Search	`OxylabsGoogleSearchReader`
Google Search Ads	`OxylabsGoogleAdsReader`
Amazon Product	`OxylabsAmazonProductReader`
Amazon Search	`OxylabsAmazonSearchReader`
Amazon Pricing	`OxylabsAmazonPricingReader`
Amazon Sellers	`OxylabsAmazonSellersReader`
Amazon Best Sellers	`OxylabsAmazonBestsellersReader`
Amazon Reviews	`OxylabsAmazonReviewsReader`
YouTube Transcript	`OxylabsYoutubeTranscriptReader`

For example, you can extract Google search results as shown below:

import os
from dotenv import load_dotenv
from llama_index.readers.oxylabs import OxylabsGoogleSearchReader


load_dotenv()

reader = OxylabsGoogleSearchReader(
    os.getenv('OXYLABS_USERNAME'), os.getenv('OXYLABS_PASSWORD')
)

results = reader.load_data({
    'query': 'best pancake recipe',
    'parse': True
})

print(results[0].text)

When executed, you should see a similar result in your console:

Scraped Google search results printed in the console

3.2 Oxylabs Web Reader

With the OxylabsWebReader class available in the llama-index-readers-web module, you can extract data from any URL while bypassing most anti-scraping measures:

import os
from dotenv import load_dotenv
from llama_index.readers.web import OxylabsWebReader


load_dotenv()

reader = OxylabsWebReader(
    os.getenv('OXYLABS_USERNAME'), os.getenv('OXYLABS_PASSWORD')
)

results = reader.load_data(
    [
        'https://sandbox.oxylabs.io/products/1',
        'https://sandbox.oxylabs.io/products/2'
    ]
)

for result in results:
    print(result.text + '\n')

Additionally, you can include supported API parameters if needed, such as geolocation, JavaScript rendering, user agent type, and others. The above approach provides scraped results in a Markdown format that’s ready for AI processing, for example:

Scraped product data printed in the console

4. Building a basic AI search agent

There are countless ways you can utilize real-time web data in AI workflows. Hence, let’s stick with a simple use case – an AI agent that can search Google and answer questions.

The best way to go about this is to let the AI agent use the Oxylabs reader dynamically. Therefore, you may want to create a web_search() function that allows an LLM to provide a search query according to the user’s question. The AI agent can then use this function as a tool to scrape Google search results. Here’s the full code example:

import os
import asyncio

from dotenv import load_dotenv
from llama_index.readers.oxylabs import OxylabsGoogleSearchReader
from llama_index.core.agent.workflow import FunctionAgent
from llama_index.llms.openai import OpenAI


load_dotenv()

reader = OxylabsGoogleSearchReader(
    os.getenv('OXYLABS_USERNAME'), os.getenv('OXYLABS_PASSWORD')
)

def web_search(query: str) -> str:
    results = reader.load_data({'query': query, 'parse': True})
    return results[0].text

agent = FunctionAgent(
    tools=[web_search],
    llm=OpenAI(model='gpt-4o-mini'),
    max_function_calls=1,
    system_prompt=(
        'Craft a short Google search query to use with the `web_search` tool. '
        'Analyze the most relevant results and answer the question.'
    )
)

async def main():
    response = await agent.run('How did DeepSeek affect the stock market?')
    print(response)


if __name__ == '__main__':
    asyncio.run(main())

Note the max_function_calls parameter, which avoids excessive LLM calls while producing the same results. Feel free to remove this parameter if you’re encountering issues.

Running this code should output a similar result as shown below:

Final thoughts

Use this guide as your launchpad to develop web-powered AI applications that take advantage of both Oxylabs' scraping capabilities and LlamaIndex's flexibility for practical and real-world solutions. If you have any questions about Oxylabs solutions, don’t hesitate to contact our support team via live chat or email.

Interested in building AI agents but not sure which platform is the right one for you? Check out our overview of the 6 Best AI Agent Frameworks.

Please be aware that this is a third-party tool not owned or controlled by Oxylabs. Each third-party provider is responsible for its own software and services. Consequently, Oxylabs will have no liability or responsibility to you regarding those services. Please carefully review the third party's policies and practices and/or conduct due diligence before accessing or using third-party services.

Frequently Asked Questions

What is LlamaIndex?

LlamaIndex is a context augmentation framework that streamlines LLM application development for various use cases, including Q&A bots, chatbots, autonomous agents, multi-modal apps, and others. By connecting LLMs with external data, such as documents, APIs, and databases, you can ensure accurate and contextually rich AI workflows.

How to install LlamaIndex?

For Python, install LlamaIndex via terminal with pip install llama-index. For Node.js, TypeScript, Vite, Next.js, or Cloudflare Workers, use npm i llamaindex. You may also use alternative package managers like pnpm, yarn, or bun.

What are the main differences between CrewAI and LlamaIndex?

In short, LlamaIndex supplies LLMs with additional knowledge from various external data sources, while CrewAI manages and coordinates multiple AI agents that work together on complex tasks. Take a look at this CrewAI integration with Web Scraper API if you’re interested in building LLM apps using CrewAI.

Useful resources

Blog post

How to Navigate AI, Legal, and Web Scraping: Asking a Professional

In this interview, we sit down with a legal professional to shed light on the ever-changing legal framework surrounding web scraping.

Integration guide

Web Scraping With LangChain & Oxylabs API

Follow our quick guide on combining LangChain with Web Scraper API for hassle-free web scraping process and AI-driven analysis.

Integration guide

OpenAI Agents SDK Integration With Oxylabs Web Scraper API

Learn how to build AI agents that scrape and analyze web content by combining OpenAI's Agents SDK with Oxylabs Web Scraper API for cost-effective web access.

Get the latest news from data gathering world

I'm interested

ISO/IEC 27001:2017 certified products:

Proxy Solutions

Scraper APIs

Get Web Scraper API for $1.35/1K results

Company

About us Our values Affiliate program Service partners Press area Residential Proxies sourcing Careers OxyCon®Project 4beta Sustainability Community

Proxies

Datacenter Proxies Dedicated Datacenter Proxies Residential Proxies SOCKS5 Proxies Mobile Proxies ISP Proxies Private Proxies Free Proxies

Advanced proxy solutions

Web Unblocker