Web Scraping with Gemini AI: Python Tutorial for Data Extraction


Yelyzaveta Hayrapetyan
2025-10-17
10 min read
Yelyzaveta Hayrapetyan
2025-10-17
10 min read
Web scraping is entering a new era powered by AI and large language models (LLMs), and Google’s Gemini stands at the forefront of this transformation. Designed as a multimodal AI system that understands text, code, and eveFLLn HTML structures through natural language, Gemini enables developers to automate complex data extraction tasks faster and smarter than ever before. Gemini can interpret, summarize, and structure web content with contextual understanding, making it a powerful ally for automation and machine learning workflows.
In this step-by-step tutorial, let’s see how to use Gemini for web scraping using Python. Along the way, you’ll see practical code examples as well as real-world use cases and challenges of Gemini AI for web scraping.
Google Gemini is a powerful AI and LLM capable of understanding and generating natural language, code, and HTML. While it can’t directly access websites, developers can use Gemini for web scraping by feeding page content to its API, allowing the model to interpret and extract structured data intelligently from raw HTML.
Let’s take a look at how to scrape the web with Gemini AI, Oxylabs Residential Proxies, and Python.
You can download the latest version of Python from the official website.
Once that’s done, open a folder of your choice and create a file for your script and a virtual environment by running these commands:
python -m venv env
source env/bin/activate
touch main.py
With the virtual environment active, let’s install the necessary libraries we’ll need for this script.
For this tutorial, we’ll be using:
requests, for performing the HTTP requests to the website we’ll scrape.
pydantic, to specify the format we expect as a response from the LLM, when using structured outputs.
bs4, for parsing HTML.
google-genai, for accessing the Gemini API.
We’ll also be importing the json module to store our scraped data later on.
Run this command to install the mentioned libraries:
pip install requests pydantic bs4 google-genai
Now that the libraries are installed, you can open up your previously created Python file and import the libraries like this:
import json
from pydantic import BaseModel
import requests
from bs4 import BeautifulSoup
from google import genai
Now that we have the base of our script set up, let’s take a look at how to get access to the services we’ll be using in this tutorial.
For Gemini, you’ll need to get a Gemini API key from Google by logging in to the Google AI Studio with your Google account. You can find the page for generating your API key here.
Once you’ve created a project within Google AI Studio and copied your API key, you’ll need to set an environment variable within your previously created virtual environment to use it within the script.
You can do that by running the following command:
export GEMINI_API_KEY="<your_api_key>"
As for Oxylabs, you’ll need to log in to the dashboard and retrieve the credentials for your Residential Proxies user.
Once that’s done, you can define them at the top of your script like this:
OXYLABS_USERNAME="USERNAME"
OXYLABS_PASSWORD="PASSWORD"
With your libraries imported and your credentials set, we can move on to scraping a simple website with the help of Gemini and Oxylabs Residential Proxies.
Let’s start by defining a function for scraping a website using Residential Proxies. The purpose of the function should be to accept a URL and return the HTML from the provided website. Let’s call it scrape_website.
def scrape_website(url: str) -> str:
...
Next, let’s start implementing the function by defining the proxy we’ll be using. The residential proxy we’ll be using will be a rotating proxy with a country entry.
Let’s define the URL of the proxy, along with the dictionary we’ll pass to the HTTP request.
proxy_url = f"http://customer-{OXYLABS_USERNAME}:{OXYLABS_PASSWORD}@pr.oxylabs.io:7777"
proxies = {
"http": proxy_url,
"https": proxy_url,
}
All that’s left now is to execute the HTTP request. It should be a GET request, with the URL and proxies passed as arguments.
response = requests.get(url, proxies=proxies)
response.raise_for_status()
We should also add the response.raise_for_status()line to make sure the code execution stops if there is an issue when scraping.
What’s left is to just return the text from the response. Here’s what the full function should look like:
def scrape_website(url: str) -> str:
proxy_url = f"http://customer-{OXYLABS_USERNAME}:{OXYLABS_PASSWORD}@pr.oxylabs.io:7777"
proxies = {
"http": proxy_url,
"https": proxy_url,
}
response = requests.get(url, proxies=proxies)
response.raise_for_status()
return response.text
Now that we have a function to scrape the data, we can utilize Gemini’s capabilities to extract any data we need from the HTML.
Modern-day LLMs are capable enough to parse out any data we need by just asking for it in a prompt. Therefore, if a page is small, it’s usually enough to simply dump the whole HTML into the prompt and ask Gemini to parse out the data we need.
However, Gemini’s client does provide access to a feature called structured outputs, which basically allows us to specify exactly what format of data we expect to receive from the LLM.
Since we’ll be scraping an example e-commerce website, let’s define a model using the installed pydantic library called Product, which contains title, price, and description fields. This should be enough information for Gemini to know that these are the fields we expect to be parsed.
The model should look like this:
class Product(BaseModel):
title: str
price: str
description: str
Next, we can define the function we’ll be using to parse out product data from a scraped website’s HTML. Let’s call it extract_products_from_html.
def extract_products_from_html(html: str) -> list[Product]:
...
As you can see, the function returns a list of Product objects, making it clear what data will be parsed out.
Next, we can define the Gemini client we’ll use from Google, along with the prompt we’ll be passing to it. As you’ll see, the prompt is short and simple, providing only the basic instructions.
client = genai.Client()
prompt = f"Extract information about all products in the following HTML: {html}."
Now we can execute the request to the LLM. We’ll be using the current fastest and most cost-efficient model that Google provides, gemini-2.5-flash-lite, which has enough capabilities for this use case.
We’ll also be passing a list of previously defined Product models as a response_schema parameter. This signals Gemini to use structured outputs for this generation.
Here’s what it should look like:
response = client.models.generate_content(
model="gemini-2.5-flash-lite",
contents=prompt,
config={
"response_mime_type": "application/json",
"response_schema": list[Product],
},
)
The function should just return response.parsed, which should be a list of the Product model.
Here’s what the whole function should look like:
def extract_products_from_html(html: str) -> list[Product]:
client = genai.Client()
prompt = f"Extract information about all products in the following HTML: {html}."
response = client.models.generate_content(
model="gemini-2.5-flash-lite",
contents=prompt,
config={
"response_mime_type": "application/json",
"response_schema": list[Product],
},
)
return response.parsed
Now that both of the functions are implemented, we can try running them to see the results.
Let’s define the URL for the example website we’ll be scraping and call both of the functions at the bottom of the script.
The website we’ll be using is a scraping sandbox, hosted by Oxylabs.
url = "https://sandbox.oxylabs.io/products"
content = scrape_website(url)
products = extract_products_from_html(content)
We can store the scraped products in a JSON file to view the data in a structured format.
products_json = [product.model_dump() for product in products]
with open("products.json", "w") as f:
json.dump(products_json, f, indent=2)
If you run the script, a file called products.json should be created in our current directory.
Inside the file, the data should look something like this:
[
{
"title": "Super Mario Odyssey",
"price": "89,99",
"description": "New Evolution of Mario Sandbox-Style Gameplay. Mario embarks on a new journey through unknown worlds, running and jumping through huge 3D worlds in the first sandbox-style Mario game since Super Mario 64 and Super Mario Sunshine. Set sail between expansive worlds aboard an airship, and perform all-new actions, such as throwing Mario's cap."
},
{
"title": "Halo: Combat Evolved",
"price": "87,99",
"description": "Enter the mysterious world of Halo, an alien planet shaped like a ring. As mankind's super soldier Master Chief, you must uncover the secrets of Halo and fend off the attacking Covenant. During your missions, you'll battle on foot, in vehicles, inside, and outside with alien and human weaponry. Your objectives include attacking enemy outposts, raiding underground labs for advanced technology, rescuing fallen comrades, and sniping enemy forces. Halo also lets you battle three other players via intense split screen combat or fight cooperatively with a friend through the single-player missions."
},
]
As you can see, Gemini completely handled the parsing process without any explicit instructions. This shows how simple web scraping can be when integrated with LLMs like Gemini, automating the difficult-to-maintain parsing process.
However, if done frequently and on larger websites, LLM calls can get expensive quite quickly. Let’s take a look at how we can optimize this process even more.
Let’s take a look at a different approach to scraping with Gemini. In the previous example, we simply provided the whole HTML to the LLM and asked it to parse out what we needed. While it is capable of doing that, running this kind of operation often enough can be expensive in the long run.
A solution to this could be to use Gemini to parse data out all the required CSS selectors at once and then use it to parse out the necessary data ourselves.
This would allow us to use the LLM only occasionally, rather than calling the API on every scraping job.
In order to have Gemini parse out the selectors, we can do a similar thing we did before, but with a slightly different prompt and pydantic model.
Let’s start by defining the model for product selectors. It can look like this:
class ProductSelectors(BaseModel):
title_selector: str
price_selector: str
description_selector: str
product_card_selector: str
We include a selector for every field we wish to parse out, as well as a selector for a single product card. We’ll be using that later on.
Next, let’s define a new function called get_product_selectors. It should accept an html argument as well and return the ProductSelectors object.
The function should look almost the same as the extract_products_from_html function we defined before, except for an adjusted prompt and a different response_schema argument.
Here’s how it should look:
def get_product_selectors(html: str) -> ProductSelectors:
client = genai.Client()
prompt = f"Extract CSS selectors for retrieving information about all products in the following HTML: {html}"
response = client.models.generate_content(
model="gemini-2.5-flash-lite",
contents=prompt,
config={
"response_mime_type": "application/json",
"response_schema": ProductSelectors,
},
)
return response.parsed
Next up, we’ll need to use the retrieved CSS selectors to parse out the data ourselves this time. We’ll use BeautifulSoup for that, which we’ve installed at the beginning of the tutorial.
Let’s write another function called parse_products_with_selectors that accepts the HTML code and the parsed out product selectors. It should return a list of Product.
def parse_products_with_selectors(html: str, selectors: ProductSelectors) -> list[Product]:
...
Inside the function, we’ll simply need to select all product cards in the HTML, iterate over each one, and extract the data we need.
It can look something like this:
def parse_products_with_selectors(html: str, selectors: ProductSelectors) -> list[Product]:
soup = BeautifulSoup(html, "html.parser")
products = soup.select(selectors.product_card_selector)
parsed_products = []
for product in products:
title_element = product.select_one(selectors.title_selector)
title = title_element.text if title_element else ""
price_element = product.select_one(selectors.price_selector)
price = price_element.text if price_element else ""
description_element = product.select_one(selectors.description_selector)
description = description_element.text if description_element else ""
parsed_product = Product(title=title, price=price, description=description)
parsed_products.append(parsed_product)
return parsed_products
Let’s try to combine it all and run the script once more. We can also store the scraped products in a JSON file, as we did before.
url = "https://sandbox.oxylabs.io/products"
content = scrape_website(url)
selectors = get_product_selectors(content)
products = parse_products_with_selectors(content, selectors)
products_json = [product.model_dump() for product in products]
with open("products.json", "w") as f:
json.dump(products_json, f, indent=2)
If you run the script now, you should see the same products as you did in the first example.
Even though the results are the same, this approach provides the option of calling the LLM once and storing the selectors to be reused, making the scraping process more optimized.
Here’s the complete code with both examples combined:
import json
from pydantic import BaseModel
import requests
from bs4 import BeautifulSoup
from google import genai
OXYLABS_USERNAME = "USERNAME"
OXYLABS_PASSWORD = "PASSWORD"
class ProductSelectors(BaseModel):
title_selector: str
price_selector: str
description_selector: str
product_card_selector: str
class Product(BaseModel):
title: str
price: str
description: str
def get_product_selectors(html: str) -> ProductSelectors:
client = genai.Client()
prompt = f"Extract CSS selectors for retrieving information about all products in the following HTML: {html}"
response = client.models.generate_content(
model="gemini-2.5-flash-lite",
contents=prompt,
config={
"response_mime_type": "application/json",
"response_schema": ProductSelectors,
},
)
return response.parsed
def extract_products_from_html(html: str) -> list[Product]:
client = genai.Client()
prompt = f"Extract information about all products in the following HTML: {html}."
response = client.models.generate_content(
model="gemini-2.5-flash-lite",
contents=prompt,
config={
"response_mime_type": "application/json",
"response_schema": list[Product],
},
)
return response.parsed
def scrape_website(url: str) -> str:
proxy_url = (
f"http://customer-{OXYLABS_USERNAME}:{OXYLABS_PASSWORD}@pr.oxylabs.io:7777"
)
proxies = {
"http": proxy_url,
"https": proxy_url,
}
response = requests.get(url, proxies=proxies)
response.raise_for_status()
return response.text
def parse_products_with_selectors(
html: str, selectors: ProductSelectors
) -> list[Product]:
soup = BeautifulSoup(html, "html.parser")
products = soup.select(selectors.product_card_selector)
parsed_products = []
for product in products:
title_element = product.select_one(selectors.title_selector)
title = title_element.text if title_element else ""
price_element = product.select_one(selectors.price_selector)
price = price_element.text if price_element else ""
description_element = product.select_one(selectors.description_selector)
description = description_element.text if description_element else ""
parsed_product = Product(title=title, price=price, description=description)
parsed_products.append(parsed_product)
return parsed_products
url = "https://sandbox.oxylabs.io/products"
# First example
content = scrape_website(url)
products = extract_products_from_html(content)
products_json = [product.model_dump() for product in products]
with open("products_1.json", "w") as f:
json.dump(products_json, f, indent=2)
# Second example
content = scrape_website(url)
selectors = get_product_selectors(content)
products = parse_products_with_selectors(content, selectors)
products_json = [product.model_dump() for product in products]
with open("products_2.json", "w") as f:
json.dump(products_json, f, indent=2)
By combining AI and web scraping, Google Gemini unlocks new possibilities for extracting, interpreting, and transforming online data. Its LLM-based approach allows developers to automate complex tasks like categorizing, summarizing, or cleaning raw HTML data. Below are several Gemini web scraping use cases.
In e-commerce, accurate and timely product data extraction is essential for tracking competitors and managing dynamic inventories. With Gemini, developers can scrape and interpret data from top marketplaces, such as Amazon, eBay, or Shopify, turning unstructured HTML into structured datasets containing product names, prices, images, and reviews.
Unlike conventional scrapers that break when layouts change, Gemini AI scraper adapts to online store variations thanks to its AI-driven flexibility. It can automatically detect product metadata, specifications, and ratings across retail and marketplace websites, ensuring consistent output even when catalogs update in real time. This makes it ideal for handling large, dynamic inventories, maintaining real-time data accuracy, and reducing the need for manual parser updates.
Content aggregation is another area where Gemini AI web scraping shines. It can extract and summarize articles from multiple news outlets, media publications, and blog websites without complex selectors or manual rule sets. By parsing headlines, body text, authors, and timestamps, Gemini enables developers to collect consistent article data from RSS feeds or dynamically-rendered pages.
Gemini’s natural language processing (NLP) capabilities allow for automatic summarization, identifying key points and categorizing topics for easy curation. Whether you’re building a news monitoring dashboard or a content discovery platform, Gemini helps automate scraping and aggregation pipelines while maintaining the context and relevance of each story, turning scattered online text into a structured content database.
Retrieval-Augmented Generation (RAG model) is a powerful approach that combines LLM with a knowledge base for contextual responses. Gemini web scraping supports RAG applications by providing accurate and real-time data updates from the web.
Developers can scrape websites, APIs, or news portals to feed fresh content into an AI system, ensuring continuous synchronization and data freshness. This enhances AI chatbot accuracy, response quality, and relevance. By connecting web scraping pipelines with Gemini’s contextual understanding, chatbots can go beyond static knowledge, offering dynamic answers powered by live data extraction and intelligent content retrieval.
For business intelligence teams, Gemini AI web scraping delivers a powerful edge in competitor analysis. It automates price monitoring and market research, aggregating competitive pricing, trends, and analytics from multiple sources. Marketers and analysts can extract insights, generate automated reports, and even identify new leads by gathering contact information from public listings.
Unlike traditional scripts that only collect raw data, Gemini can interpret context, classify competitors, and highlight market gaps, turning unstructured data into actionable intelligence. From lead generation to strategic decision-making, Gemini streamlines data-driven analysis, providing a smarter, more adaptive foundation for businesses seeking real-time visibility into their markets.
While Gemini AI introduces a more flexible approach to web scraping, it comes with its own set of challenges and limitations. Developers must carefully balance the model’s AI-driven accuracy with factors like speed, cost, and data reliability. Understanding these constraints is key to designing efficient scraping pipelines that leverage Gemini’s strengths without overextending its capabilities.
When using AI for web scraping, speed and performance are critical factors to evaluate. Unlike traditional HTML data parsers or custom scrapers that process requests in milliseconds, Gemini’s AI-based processing typically takes between 2–5 seconds per query, depending on prompt complexity and response size. This delay stems from the ML operations required to structure and format the extracted data intelligently.
While this may be acceptable for small-scale data extraction or analysis tasks, it becomes a bottleneck in large-scale scraping projects. In short, AI-powered scraping sacrifices raw speed for intelligence and accuracy. Gemini excels at understanding ambiguous or inconsistent layouts, but developers should avoid relying on it for real-time scraping where latency directly impacts performance.
Running Gemini web scraping at scale introduces unique cost considerations due to its token-based pricing model. Each API request consumes tokens based on the length of your HTML input and Gemini’s output, making large-scale scraping potentially expensive if not optimized. Unlike traditional scrapers that only incur infrastructure or proxy costs, Gemini’s AI processing adds computational overhead that can quickly accumulate when dealing with thousands of pages.
To manage budget effectively, developers should batch requests, summarize pages, or limit extraction to essential data points. In enterprise environments, this often translates into a trade-off between data volume and AI-enhanced quality. Proper cost management ensures that teams can benefit from Gemini’s analytical depth without overspending, aligning scale economics with project objectives.
Despite its strengths, LLM-based extraction with Gemini AI can sometimes struggle with accuracy and data quality. Because the model generates text predictions rather than executing strict parsing rules, it can introduce hallucinations or misinterpret complex layouts that contain nested elements or inconsistent markup.
These issues are particularly noticeable in structured data extraction, where precision matters more than interpretation. While Gemini handles most semantic understanding tasks well, its data extraction results should be validated to ensure quality assurance. Combining Gemini with traditional HTML parsing or schema validation frameworks can help minimize these errors, offering a balance between AI flexibility and data integrity in production-grade scraping workflows.
This tutorial explored how to use Gemini for web scraping in Python, from setting up your environment and building the scraper to extracting structured data through Gemini’s API. You learned how to combine LLM-powered parsing with Oxylabs Residential Proxies, apply it to real-world use cases, and navigate its key challenges like performance, cost, and data accuracy for smarter, AI scraping.
Interested to discover similar topics? Check out our blog and dive deeper into ChatGPT web scraping, Google AI overview scraper, Claude web scraping, AI model training, LLM training data, and more.
You’ll need Python, the google-genai library to access the Gemini API, and standard scraping tools like Requests and BeautifulSoup for fetching HTML. Proxies are also useful for bypassing geo-blocks and anti-bot systems when scraping larger websites.
Google Gemini can’t directly crawl websites, but it can analyze and extract data from HTML once the content is fetched using Python or other tools. By sending page data to the Gemini API, developers can perform intelligent AI data collection and structuring tasks.
Use Gemini web scraping when you need AI-assisted data extraction, such as summarizing, categorizing, or structuring messy HTML content. It’s ideal for projects focused on data quality and interpretation, not raw speed or high-volume scraping.
The cost depends on Gemini’s token-based pricing, which varies by model and response length. Each HTML input and output consumes tokens, so large-scale scraping can become costly unless optimized with shorter prompts or batch processing.
Gemini itself doesn’t render JavaScript. To scrape such pages, developers must first use tools like Playwright or Selenium to capture the fully loaded HTML before sending it to Gemini for AI-based extraction.
No, Gemini web scraping is slower because of AI processing time (usually 2–5 seconds per query). However, it’s more accurate and flexible, making it a better fit for semantic extraction and complex data interpretation rather than high-speed crawling.
Forget about complex web scraping processes
Choose Oxylabs' advanced web intelligence collection solutions to gather real-time public data hassle-free.
About the author
Yelyzaveta Hayrapetyan
Senior Technical Copywriter
Yelyzaveta Hayrapetyan is a Senior Technical Copywriter at Oxylabs. After working as a writer in fashion, e-commerce, and media, she decided to switch her career path and immerse in the fascinating world of tech. And believe it or not, she absolutely loves it! On weekends, you’ll probably find Yelyzaveta enjoying a cup of matcha at a cozy coffee shop, scrolling through social media, or binge-watching investigative TV series.
All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.
Augustas Pelakauskas
2025-10-16
Vytenis Kaubrė
2025-10-13
Get the latest news from data gathering world
Scale up your business with Oxylabs®
Proxies
Advanced proxy solutions
Data Collection
Datasets
Resources
Innovation hub
Forget about complex web scraping processes
Choose Oxylabs' advanced web intelligence collection solutions to gather real-time public data hassle-free.