Back to blog

How to Build a RAG Chatbot with OpenAI and Web Scraping: Step-by-Step Guide

vytenis kaubre avatar

Vytenis Kaubrė

2025-09-19

6 min read

Traditional chatbots quickly become outdated when they can't retrieve relevant information beyond their training data. This guide walks you through:

  • The complete RAG chatbot architecture and key components

  • The best tools and strategies available for every step

  • A complete RAG code example that uses internal documents and live web data

Step-by-step guide on building a RAG chatbot

Most Retrieval-Augmented Generation (RAG) systems start with internal knowledge bases containing company documents, product information, and support materials. While this works well for controlled scenarios, the real power of RAG comes from augmenting these static resources with dynamic external knowledge. When users ask about recent events, competitor updates, or industry trends, your chatbot needs to reach beyond its local database.

This is where web scraping becomes essential for modern RAG systems. You have two main approaches: build a custom scraper with proxy infrastructure to handle anti-bot measures, or leverage a ready-to-use solution like Web Scraper API that manages these complexities for you. It comes with dedicated scrapers, such as SERP scraper API and E-Commerce scraper API, specifically designed to extract web data from SERP and e-commerce sites.  If you're new to web scraping, these resources can help you get started:

For those preferring a low-code approach, AI Studio lets you extract web data using natural language prompts, handling anti-scraping measures and parsing challenges automatically.

1. Choose a RAG framework

Building a RAG system from scratch requires orchestrating document processing, embeddings, vector storage, retrieval, and LLM integration. A RAG framework handles this orchestration for you, providing pre-built pipelines and abstractions that connect these pieces seamlessly. Your choice determines how easily you can customize the pipeline, integrate with different tools, and scale your system in production.

Consider these popular RAG frameworks:

  • LangChain: Most popular with extensive integrations for complex multi-step workflows

  • LlamaIndex: Specialized for document search with 160+ data source connectors

  • Haystack: Production-ready modular pipelines with strong enterprise support available

  • RAGFlow: Visual interface for easy setup with transparent document processing

  • LightRAG: Lightweight graph-based approach for fast retrieval with low resources

2. Load documents

Document loaders read and parse data from various sources (PDF files, Word documents, databases), forming your chatbot's knowledge base. The loader must handle different formats reliably while preserving important structures like headers, tables, and sections, whether working with local files during development or connecting to enterprise systems in production.

Document loader options include:

  • Unstructured.io: Handles 20+ file types with seamless RAG integration

  • LlamaParse: AI-native, built for RAG, excels at complex tables and figures

  • Docling: Multi-format support with AI-powered layout understanding

  • Apache Tika: Supports 1000+ file formats

3. Chunk documents

Large Language Models have limited context windows, so chunking breaks your documents into smaller and semantically meaningful pieces. The choice of the text splitter, chunk size, and chunk overlap is essential here. Get this wrong, and you'll either split related information or create chunks too large for effective retrieval. Your chunking strategy should respect document structure (keep paragraphs, code blocks, and sections intact) while meeting size constraints.

Common chunking strategies include:

  • Fixed-size (character): Splits at fixed character intervals

  • Recursive text: Hierarchically splits by separators

  • Token-based: Split documents by token count limits

  • Markdown/code: Respects document structure formatting

  • Semantic: Groups similar meanings together

Note: Chunking directly affects overall RAG accuracy. Test different text splitters, chunk sizes, and overlaps to find what works best for your specific content and use case.

4. Generate embeddings and create a vector store

Searching through raw text every time a user asks a question would be too slow and resource-intensive. That's why RAG systems convert text chunks into embeddings: numerical vectors that capture semantic meaning. These vectors are stored in a vector database optimized for fast similarity searches. When a user submits a question, it's also converted into a vector, then compared against stored document vectors to instantly find the most relevant content. This vector-based approach makes semantic search practical at scale.

Popular embedding model options:

  • OpenAI text-embedding-3-small/large: Industry standard

  • Voyage AI voyage-3: Top benchmark performance

  • BGE models: Open-source, high accuracy

Common vector database choices:

  • Pinecone: Managed cloud solution

  • Weaviate: Open-source with GraphQL

  • ChromaDB: Developer-friendly, easy setup

5. Configure a retriever

The retriever searches your vector database to find chunks most semantically similar to user queries. Key configurations include the search type and number of chunks to retrieve (top_k). Most systems retrieve 3-5 chunks, though you should adjust this based on your chunk size and use case requirements.

Popular retrieval strategies:

  • Hybrid search: Combines vector and keyword search

  • Similarity search: Pure semantic vector retrieval

  • Multi-query: Generate multiple relevant query variations

6. Integrate web scraping

When you build a RAG chatbot, you'll quickly notice that local knowledge isn't enough. Hence, integrating a web scraping pipeline to pull fresh web data can significantly improve retrieval-augmented generation. Modern websites pose various challenges: anti-bot systems block automated requests, JavaScript-heavy pages require browser rendering, and complex HTML structures complicate parsing.

You can build a custom scraper with residential proxies or use a ready-to-use solution like Oxylabs Web Scraper API that handles these complexities automatically.

Popular self-hosted scraping tools include:

  • Playwright: Modern browser automation, multi-browser support

  • Puppeteer: Chrome/Chromium control for JavaScript rendering

  • Scrapy: Python framework for large-scale scraping

  • Selenium: Cross-browser automation, extensive language support

  • Beautiful Soup: Lightweight HTML/XML parsing library

7. Build the RAG chain

This step connects your retriever with a language model to generate contextual responses based on retrieved documents. The chain orchestrates the entire flow: taking user input, retrieving relevant chunks, formatting them as relevant context, and prompting the LLM to generate answers grounded in your data.

Top proprietary LLM providers to consider:

  • OpenAI: Market-leading models, multimodal, and large context windows

  • Anthropic Claude: Advanced reasoning with large context windows

  • Google Gemini: Multimodal capabilities with competitive pricing

Popular open-source LLM options:

  • Llama 3: Meta's powerful open model

  • Mistral/Mixtral: Efficient performance on consumer hardware

  • Qwen 2.5: Strong multilingual and coding capabilities

8. Add reranking for better accuracy

Initial retrieval often returns many potentially relevant chunks, but not all are equally useful. Reranking models deeply analyze the relationship between the input query and each retrieved chunk, reordering them by relevance. This two-stage approach significantly improves answer quality without sacrificing speed.

Popular reranking solutions:

  • Cohere Rerank: Cloud-based with superior accuracy benchmarks

  • BGE-reranker: Open-source model with strong performance

  • Cross-encoder models: BERT-based for semantic similarity scoring

  • ColBERT: Efficient late interaction for scalable reranking

  • LLM as a reranker: Using GPT or Claude for reranking

9. Generate a response

Transform retrieved context into natural, accurate answers while managing the generation process. Modern RAG systems need to handle various response requirements beyond simple text generation.

Key generation features to implement:

  • Streaming: Real-time token-by-token response display

  • Confidence scores: Measure answer reliability and uncertainty

  • Multi-turn memory: Maintain conversation context across interactions

  • Query routing: Direct queries to appropriate data sources

  • Tool calling: Integrate custom functions, APIs, or databases

  • Citation tracking: Link answers to source documents

  • Fallback strategies: Handle out-of-scope or low-confidence queries

10. Create a chat interface

Build a user-friendly interface for interaction. The interface should handle conversation history, display streaming responses, and potentially show source documents or confidence indicators.

Common GUI frameworks:

  • Chainlit: RAG-specific with built-in features, perfect for testing

  • Streamlit: Python-based with minimal code required

  • React/Next.js: Full-featured web applications

  • FastAPI: High-performance API with documentation

11. Test and evaluate your RAG chatbot

Ensure accuracy, relevance, and quality of responses. Systematic evaluation helps identify weaknesses in retrieval, generation, or the overall pipeline. Testing should cover both component-level metrics (retrieval precision, generation quality) and end-to-end performance.

Popular evaluation tools and frameworks:

  • LangSmith: Debug, test, and monitor LangChain applications

  • RAGAS: RAG-specific metrics for faithfulness and relevance

  • TruLens: Track and evaluate LLM application quality

  • Arize Phoenix: ML observability for RAG systems

  • DeepEval: Unit testing framework for LLM outputs

Note: Regular evaluation with real user queries is crucial. Set up A/B testing to compare different configurations and continuously improve your RAG system based on user feedback and metrics.

RAG code example

This minimal RAG chatbot covers the core RAG steps, combining local document knowledge with live web data. LangChain serves as the backend framework, orchestrating data retrieval and chatbot processes, while OpenAI handles embeddings and chat generation through its native LangChain integration. For the frontend, Chainlit provides a clean interface that makes testing and iteration straightforward.

Prerequisites

  • Python 3.9+ (this guide uses 3.13.7)

  • OpenAI API key

  • Oxylabs Web Scraper API credentials – get a free trial by signing up

Environment setup

Create a Python project with a virtual environment and install the required libraries:

pip install langchain langchain-openai langchain-chroma langchain-community langchain-oxylabs openai chromadb python-dotenv "unstructured[md]" chainlit

Additionally, save your credentials as environment variables in a .env file:

OXYLABS_USERNAME=api_username
OXYLABS_PASSWORD=api_password
OPENAI_API_KEY=openai_key

Add a docs folder to your project directory with some Markdown files for testing. Try these examples: e-commerce scraping guide and Google scraping tutorial.

Python code example

The chatbot first searches its local knowledge base, then falls back to web scraping for questions it can't answer locally. While functional for demonstration, a production-ready chatbot requires further optimizations for accuracy, relevance, and performance at scale.

import os
from typing import List

from dotenv import load_dotenv
import chainlit as cl
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_chroma import Chroma
from langchain.text_splitter import MarkdownTextSplitter
from langchain_community.document_loaders import DirectoryLoader, UnstructuredMarkdownLoader
from langchain_oxylabs import OxylabsSearchAPIWrapper, OxylabsSearchRun, OxylabsLoader
from langchain.prompts import ChatPromptTemplate


load_dotenv()

# Load documents and chunk them
loader = DirectoryLoader("./docs", glob="**/*.md", loader_cls=UnstructuredMarkdownLoader)
documents = loader.load()
text_splitter = MarkdownTextSplitter(chunk_size=2000, chunk_overlap=200)
chunks = text_splitter.split_documents(documents)

# Create embeddings and a vector store
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(chunks, embeddings, persist_directory="./chroma_db")

# Initialize Oxylabs tools 
oxylabs_config = {
    "oxylabs_username": os.getenv("OXYLABS_USERNAME"),
    "oxylabs_password": os.getenv("OXYLABS_PASSWORD")
}
search_tool = OxylabsSearchRun(wrapper=OxylabsSearchAPIWrapper(**oxylabs_config))

def create_web_loader(urls: List[str]) -> OxylabsLoader:
    """Create Oxylabs loader with URLs."""
    return OxylabsLoader(urls=urls, params={"markdown": True}, **oxylabs_config)

# Initialize LLM and prompt
llm = ChatOpenAI(model="gpt-5-nano", temperature=0)

response_prompt = ChatPromptTemplate.from_template("""
You are a helpful assistant. Use the provided context to answer questions accurately.
ALWAYS format code samples using markdown code blocks (```python, etc.).
If the context doesn't fully answer the question, say what you can answer and what information is missing

Context: {context}

Question: {question}
Answer:""")


@cl.on_chat_start
async def start():
    """Initialize chat session."""
    await cl.Message(content="👋 Hi! Ask me anything about Oxylabs!").send()


@cl.on_message
async def main(message: cl.Message):
    """Process messages and generate responses."""
    
    # Search local knowledge
    local_docs = vectorstore.similarity_search_with_score(message.content, k=5)
    relevant_chunks = [doc.page_content for doc, score in local_docs if score < 1.0]
    
    if relevant_chunks:
        context = "\n\n---\n\n".join(relevant_chunks)
        source = "📚 Internal knowledge base"
    else:
        # Search web
        async with cl.Step(name="🔍 Web search") as step:
            search_results = search_tool.run(message.content)
            urls = [
                line.split('URL: ')[-1].strip() 
                for line in search_results.split('\n') 
                if 'URL: ' in line
            ]
            
            if urls:
                # Fetch the first URL
                step.output = f"Fetching {urls[0]}..."
                web_loader = create_web_loader([urls[0]])
                web_docs = web_loader.load()
                
                if web_docs:
                    context = web_docs[0].page_content[:20000]
                    source = f"🌐 From web ({urls[0]})"
                else:
                    context = search_results[:10000]
                    source = "🔍 From web search"
            else:
                context = search_results[:10000]
                source = "🔍 From web search"
    
    # Generate a response
    chain = response_prompt | llm
    response = chain.invoke({
        "context": context,
        "question": message.content
    })
    
    # Send a response
    await cl.Message(content=f"**{source}**\n\n{response.content}").send()


if __name__ == "__main__":
    from chainlit.cli import run_chainlit
    run_chainlit(__file__)

Run the RAG chatbot

Save this Python file as app.py and run python app.py in your terminal. Chainlit will open a web interface at http://localhost:8000 where you can test queries against local knowledge and fresh web data.

Wrap up

When you build a RAG chatbot with only internal documents, it creates a knowledge silo that ages quickly. The real power comes from integrating web data to answer questions about current events, market trends, and evolving topics.

For more insights on building AI-powered systems with fresh data, explore these resources:

Frequently asked questions

How to build a RAG-based chatbot?

Start by setting up a vector database (like Pinecone or Chroma) to store your document embeddings. Create a retrieval pipeline that focuses on converting user queries to embeddings and retrieving relevant documents. Connect this to an LLM API (OpenAI, Anthropic, etc.) that generates responses using the retrieved context. Use frameworks like LangChain or LlamaIndex to simplify the integration.

About the author

vytenis kaubre avatar

Vytenis Kaubrė

Technical Content Researcher

Vytenis Kaubrė is a Technical Content Researcher at Oxylabs. Creative writing and a growing interest in technology fuel his daily work, where he researches and crafts technical content, all the while honing his skills in Python. Off duty, you may catch him working on personal projects, learning all things cybersecurity, or relaxing with a book.

All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.

Related content

Web Scraping with Perplexity AI: Python Guide
Dovydas Vėsa avatar

Dovydas Vėsa

2025-10-10

Web Scraping With LangChain & Oxylabs API
roberta avatar

Roberta Aukstikalnyte

2025-10-03

Web Scraping with Claude AI: Python Guide
Agne Matuseviciute avatar

Agnė Matusevičiūtė

2025-09-16

Try Web Scraper API

Extract quality web data hassle-free while avoiding CAPTCHAs and IP blocks.

Get the latest news from data gathering world

Try Web Scraper API

Extract quality web data hassle-free while avoiding CAPTCHAs and IP blocks.