Traditional chatbots quickly become outdated when they can't retrieve relevant information beyond their training data. This guide walks you through:
The complete RAG chatbot architecture and key components
The best tools and strategies available for every step
A complete RAG code example that uses internal documents and live web data
Most Retrieval-Augmented Generation (RAG) systems start with internal knowledge bases containing company documents, product information, and support materials. While this works well for controlled scenarios, the real power of RAG comes from augmenting these static resources with dynamic external knowledge. When users ask about recent events, competitor updates, or industry trends, your chatbot needs to reach beyond its local database.
This is where web scraping becomes essential for modern RAG systems. You have two main approaches: build a custom scraper with proxy infrastructure to handle anti-bot measures, or leverage a ready-to-use solution like Web Scraper API that manages these complexities for you. It comes with dedicated scrapers, such as SERP scraper API and E-Commerce scraper API, specifically designed to extract web data from SERP and e-commerce sites. If you're new to web scraping, these resources can help you get started:
For those preferring a low-code approach, AI Studio lets you extract web data using natural language prompts, handling anti-scraping measures and parsing challenges automatically.
Building a RAG system from scratch requires orchestrating document processing, embeddings, vector storage, retrieval, and LLM integration. A RAG framework handles this orchestration for you, providing pre-built pipelines and abstractions that connect these pieces seamlessly. Your choice determines how easily you can customize the pipeline, integrate with different tools, and scale your system in production.
Consider these popular RAG frameworks:
LangChain: Most popular with extensive integrations for complex multi-step workflows
LlamaIndex: Specialized for document search with 160+ data source connectors
Haystack: Production-ready modular pipelines with strong enterprise support available
RAGFlow: Visual interface for easy setup with transparent document processing
LightRAG: Lightweight graph-based approach for fast retrieval with low resources
Document loaders read and parse data from various sources (PDF files, Word documents, databases), forming your chatbot's knowledge base. The loader must handle different formats reliably while preserving important structures like headers, tables, and sections, whether working with local files during development or connecting to enterprise systems in production.
Document loader options include:
Unstructured.io: Handles 20+ file types with seamless RAG integration
LlamaParse: AI-native, built for RAG, excels at complex tables and figures
Docling: Multi-format support with AI-powered layout understanding
Apache Tika: Supports 1000+ file formats
Large Language Models have limited context windows, so chunking breaks your documents into smaller and semantically meaningful pieces. The choice of the text splitter, chunk size, and chunk overlap is essential here. Get this wrong, and you'll either split related information or create chunks too large for effective retrieval. Your chunking strategy should respect document structure (keep paragraphs, code blocks, and sections intact) while meeting size constraints.
Common chunking strategies include:
Fixed-size (character): Splits at fixed character intervals
Recursive text: Hierarchically splits by separators
Token-based: Split documents by token count limits
Markdown/code: Respects document structure formatting
Semantic: Groups similar meanings together
Note: Chunking directly affects overall RAG accuracy. Test different text splitters, chunk sizes, and overlaps to find what works best for your specific content and use case.
Searching through raw text every time a user asks a question would be too slow and resource-intensive. That's why RAG systems convert text chunks into embeddings: numerical vectors that capture semantic meaning. These vectors are stored in a vector database optimized for fast similarity searches. When a user submits a question, it's also converted into a vector, then compared against stored document vectors to instantly find the most relevant content. This vector-based approach makes semantic search practical at scale.
Popular embedding model options:
OpenAI text-embedding-3-small/large: Industry standard
Voyage AI voyage-3: Top benchmark performance
BGE models: Open-source, high accuracy
Common vector database choices:
Pinecone: Managed cloud solution
Weaviate: Open-source with GraphQL
ChromaDB: Developer-friendly, easy setup
The retriever searches your vector database to find chunks most semantically similar to user queries. Key configurations include the search type and number of chunks to retrieve (top_k). Most systems retrieve 3-5 chunks, though you should adjust this based on your chunk size and use case requirements.
Popular retrieval strategies:
Hybrid search: Combines vector and keyword search
Similarity search: Pure semantic vector retrieval
Multi-query: Generate multiple relevant query variations
When you build a RAG chatbot, you'll quickly notice that local knowledge isn't enough. Hence, integrating a web scraping pipeline to pull fresh web data can significantly improve retrieval-augmented generation. Modern websites pose various challenges: anti-bot systems block automated requests, JavaScript-heavy pages require browser rendering, and complex HTML structures complicate parsing.
You can build a custom scraper with residential proxies or use a ready-to-use solution like Oxylabs Web Scraper API that handles these complexities automatically.
Popular self-hosted scraping tools include:
Playwright: Modern browser automation, multi-browser support
Puppeteer: Chrome/Chromium control for JavaScript rendering
Scrapy: Python framework for large-scale scraping
Selenium: Cross-browser automation, extensive language support
Beautiful Soup: Lightweight HTML/XML parsing library
This step connects your retriever with a language model to generate contextual responses based on retrieved documents. The chain orchestrates the entire flow: taking user input, retrieving relevant chunks, formatting them as relevant context, and prompting the LLM to generate answers grounded in your data.
Top proprietary LLM providers to consider:
OpenAI: Market-leading models, multimodal, and large context windows
Anthropic Claude: Advanced reasoning with large context windows
Google Gemini: Multimodal capabilities with competitive pricing
Popular open-source LLM options:
Llama 3: Meta's powerful open model
Mistral/Mixtral: Efficient performance on consumer hardware
Qwen 2.5: Strong multilingual and coding capabilities
Initial retrieval often returns many potentially relevant chunks, but not all are equally useful. Reranking models deeply analyze the relationship between the input query and each retrieved chunk, reordering them by relevance. This two-stage approach significantly improves answer quality without sacrificing speed.
Popular reranking solutions:
Cohere Rerank: Cloud-based with superior accuracy benchmarks
BGE-reranker: Open-source model with strong performance
Cross-encoder models: BERT-based for semantic similarity scoring
ColBERT: Efficient late interaction for scalable reranking
LLM as a reranker: Using GPT or Claude for reranking
Transform retrieved context into natural, accurate answers while managing the generation process. Modern RAG systems need to handle various response requirements beyond simple text generation.
Key generation features to implement:
Streaming: Real-time token-by-token response display
Confidence scores: Measure answer reliability and uncertainty
Multi-turn memory: Maintain conversation context across interactions
Query routing: Direct queries to appropriate data sources
Tool calling: Integrate custom functions, APIs, or databases
Citation tracking: Link answers to source documents
Fallback strategies: Handle out-of-scope or low-confidence queries
Build a user-friendly interface for interaction. The interface should handle conversation history, display streaming responses, and potentially show source documents or confidence indicators.
Common GUI frameworks:
Chainlit: RAG-specific with built-in features, perfect for testing
Streamlit: Python-based with minimal code required
React/Next.js: Full-featured web applications
FastAPI: High-performance API with documentation
Ensure accuracy, relevance, and quality of responses. Systematic evaluation helps identify weaknesses in retrieval, generation, or the overall pipeline. Testing should cover both component-level metrics (retrieval precision, generation quality) and end-to-end performance.
Popular evaluation tools and frameworks:
LangSmith: Debug, test, and monitor LangChain applications
RAGAS: RAG-specific metrics for faithfulness and relevance
TruLens: Track and evaluate LLM application quality
Arize Phoenix: ML observability for RAG systems
DeepEval: Unit testing framework for LLM outputs
Note: Regular evaluation with real user queries is crucial. Set up A/B testing to compare different configurations and continuously improve your RAG system based on user feedback and metrics.
This minimal RAG chatbot covers the core RAG steps, combining local document knowledge with live web data. LangChain serves as the backend framework, orchestrating data retrieval and chatbot processes, while OpenAI handles embeddings and chat generation through its native LangChain integration. For the frontend, Chainlit provides a clean interface that makes testing and iteration straightforward.
Python 3.9+ (this guide uses 3.13.7)
OpenAI API key
Oxylabs Web Scraper API credentials – get a free trial by signing up
Create a Python project with a virtual environment and install the required libraries:
pip install langchain langchain-openai langchain-chroma langchain-community langchain-oxylabs openai chromadb python-dotenv "unstructured[md]" chainlit
Additionally, save your credentials as environment variables in a .env file:
OXYLABS_USERNAME=api_username
OXYLABS_PASSWORD=api_password
OPENAI_API_KEY=openai_key
Add a docs folder to your project directory with some Markdown files for testing. Try these examples: e-commerce scraping guide and Google scraping tutorial.
The chatbot first searches its local knowledge base, then falls back to web scraping for questions it can't answer locally. While functional for demonstration, a production-ready chatbot requires further optimizations for accuracy, relevance, and performance at scale.
import os
from typing import List
from dotenv import load_dotenv
import chainlit as cl
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_chroma import Chroma
from langchain.text_splitter import MarkdownTextSplitter
from langchain_community.document_loaders import DirectoryLoader, UnstructuredMarkdownLoader
from langchain_oxylabs import OxylabsSearchAPIWrapper, OxylabsSearchRun, OxylabsLoader
from langchain.prompts import ChatPromptTemplate
load_dotenv()
# Load documents and chunk them
loader = DirectoryLoader("./docs", glob="**/*.md", loader_cls=UnstructuredMarkdownLoader)
documents = loader.load()
text_splitter = MarkdownTextSplitter(chunk_size=2000, chunk_overlap=200)
chunks = text_splitter.split_documents(documents)
# Create embeddings and a vector store
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(chunks, embeddings, persist_directory="./chroma_db")
# Initialize Oxylabs tools
oxylabs_config = {
"oxylabs_username": os.getenv("OXYLABS_USERNAME"),
"oxylabs_password": os.getenv("OXYLABS_PASSWORD")
}
search_tool = OxylabsSearchRun(wrapper=OxylabsSearchAPIWrapper(**oxylabs_config))
def create_web_loader(urls: List[str]) -> OxylabsLoader:
"""Create Oxylabs loader with URLs."""
return OxylabsLoader(urls=urls, params={"markdown": True}, **oxylabs_config)
# Initialize LLM and prompt
llm = ChatOpenAI(model="gpt-5-nano", temperature=0)
response_prompt = ChatPromptTemplate.from_template("""
You are a helpful assistant. Use the provided context to answer questions accurately.
ALWAYS format code samples using markdown code blocks (```python, etc.).
If the context doesn't fully answer the question, say what you can answer and what information is missing
Context: {context}
Question: {question}
Answer:""")
@cl.on_chat_start
async def start():
"""Initialize chat session."""
await cl.Message(content="👋 Hi! Ask me anything about Oxylabs!").send()
@cl.on_message
async def main(message: cl.Message):
"""Process messages and generate responses."""
# Search local knowledge
local_docs = vectorstore.similarity_search_with_score(message.content, k=5)
relevant_chunks = [doc.page_content for doc, score in local_docs if score < 1.0]
if relevant_chunks:
context = "\n\n---\n\n".join(relevant_chunks)
source = "📚 Internal knowledge base"
else:
# Search web
async with cl.Step(name="🔍 Web search") as step:
search_results = search_tool.run(message.content)
urls = [
line.split('URL: ')[-1].strip()
for line in search_results.split('\n')
if 'URL: ' in line
]
if urls:
# Fetch the first URL
step.output = f"Fetching {urls[0]}..."
web_loader = create_web_loader([urls[0]])
web_docs = web_loader.load()
if web_docs:
context = web_docs[0].page_content[:20000]
source = f"🌐 From web ({urls[0]})"
else:
context = search_results[:10000]
source = "🔍 From web search"
else:
context = search_results[:10000]
source = "🔍 From web search"
# Generate a response
chain = response_prompt | llm
response = chain.invoke({
"context": context,
"question": message.content
})
# Send a response
await cl.Message(content=f"**{source}**\n\n{response.content}").send()
if __name__ == "__main__":
from chainlit.cli import run_chainlit
run_chainlit(__file__)
Save this Python file as app.py and run python app.py in your terminal. Chainlit will open a web interface at http://localhost:8000 where you can test queries against local knowledge and fresh web data.
When you build a RAG chatbot with only internal documents, it creates a knowledge silo that ages quickly. The real power comes from integrating web data to answer questions about current events, market trends, and evolving topics.
For more insights on building AI-powered systems with fresh data, explore these resources:
Start by setting up a vector database (like Pinecone or Chroma) to store your document embeddings. Create a retrieval pipeline that focuses on converting user queries to embeddings and retrieving relevant documents. Connect this to an LLM API (OpenAI, Anthropic, etc.) that generates responses using the retrieved context. Use frameworks like LangChain or LlamaIndex to simplify the integration.
Use OpenAI's API to send chat messages to ChatGPT and receive responses. Implement conversation memory by maintaining a history of messages in your application. Add a simple UI using frameworks like Streamlit or Chainlit for quick prototypes, or build a custom frontend for production apps.
Use OpenAI's Embeddings API to convert your documents into vector representations. Store these vectors in a database and implement similarity search for retrieval. When a user asks a question, embed their query, find relevant documents, and include them as context in your ChatGPT API call to generate informed responses.
About the author
Vytenis Kaubrė
Technical Content Researcher
Vytenis Kaubrė is a Technical Content Researcher at Oxylabs. Creative writing and a growing interest in technology fuel his daily work, where he researches and crafts technical content, all the while honing his skills in Python. Off duty, you may catch him working on personal projects, learning all things cybersecurity, or relaxing with a book.
All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.
Try Web Scraper API
Extract quality web data hassle-free while avoiding CAPTCHAs and IP blocks.
Get the latest news from data gathering world
Scale up your business with Oxylabs®
Proxies
Advanced proxy solutions
Data Collection
Datasets
Resources
Innovation hub
Try Web Scraper API
Extract quality web data hassle-free while avoiding CAPTCHAs and IP blocks.