Web scraping faces two core technical hurdles: reliably acquiring data from heavily defended sites and efficiently parsing the resulting HTML into structured data. This guide tackles both by combining proxies for data acquisition and the Perplexity AI for intelligent parsing.
In this guide, we will demonstrate a scalable and cost-efficient method to automate selector identification, best for transforming brittle scrapers into resilient web data extraction pipelines using Python.
Traditional web scraping logic is often very fragile, relying on manual maintenance of CSS selectors that break whenever a site updates. Perplexity AI solves this by functioning as an AI-driven HTML parsing engine.
Perplexity's proprietary Sonar model family is optimized for grounded information analysis. By feeding the raw HTML, gathered through Residential Proxies, to the Perplexity API, you can use natural language user queries (prompts) to generate specific but highly reusable CSS selectors.
This strategy essentially creates a self-healing scraper that drastically reduces maintenance time and focuses LLM usage on high-level parsing logic for effective data extraction.
Let’s take a look at how to scrape websites like Amazon with the help of Perplexity AI, Oxylabs Residential proxies, and Python.
You can download the latest version of Python from the official website.
Once that’s done, open up a terminal window and run these commands to activate a virtual environment and install the required libraries:
Just run these commands to create a virtual environment, activate it, and get everything installed:
python -m venv env
source env/bin/activate
pip install pydantic bs4 requests instructor
As seen above, we’ll be using a few different libraries in this tutorial, including:
pydantic for defining specific data models for what data format to expect from Perplexity.
bs4 (BeautifulSoup) for additional HTML parsing.
requests for performing HTTP requests.
instructor, a wrapper for the Perplexity API (OpenAI-compatible), to return structured JSON outputs based on the pydantic models.
To be able to scrape and parse the raw HTML of a website, we’ll need to get access to both Perplexity AI and Oxylabs services. For the AI, you’ll first need to obtain an API key from the Perplexity console. Once that’s done, you’ll need to add funds to your account to start using the API.
As for Oxylabs, you’ll need to retrieve the credentials for a Residential Proxies user. You can find them in the Oxylabs dashboard, under the Residential Proxies. Make sure to store credentials and the API key for now. We’ll be using them later on in the tutorial.
Now that we have the credentials retrieved, let’s take a look at how to scrape and parse HTML from Amazon with Perplexity and Oxylabs. First off, let’s set up the foundation of the script.
To begin, let’s create a file called main.py in the same folder where you previously created a virtual environment. Open it up and add these lines at the start to import the installed libraries.
import json
from pydantic import BaseModel
import requests
import instructor
from bs4 import BeautifulSoup
Next up, we can define our Oxylabs credentials as global variables at the top of the script. Ensure that you replace the placeholder values with your retrieved credentials.
OXYLABS_USERNAME="USERNAME"
OXYLABS_PASSWORD="PASSWORD"
As for the API key for perplexity, we’ll be setting it as an environment variable, which will be consumed by the instructor client later on. To do that, you can run the following command in your terminal window:
export PERPLEXITY_API_KEY="YOUR_API_KEY"
Now that we have the base foundation for our script, we can move to using Oxylabs Residential proxies to scrape a website.
To begin, we’ll implement a separate function that will be used solely for scraping web data from a provided URL, returning the retrieved HTML once the process is complete. This will make our code more structured, readable, and reusable.
Let’s call this function scrape_website. It should accept a URL parameter of string type str and return a string value.
def scrape_website(url: str) -> str:
...
Next, we can define the necessary variables for using the proxy from Oxylabs. We’ll need a URL for the proxy, along with the necessary credentials we’ve retrieved before. For this example, we’ll be using a rotating proxy with a country entry. Here’s what it should look like:
proxy_url = f"https://customer-{OXYLABS_USERNAME}:{OXYLABS_PASSWORD}@pr.oxylabs.io:7777"
proxies = {
"http": proxy_url,
"https": proxy_url
}
Along with assigning a proxy to our request, let’s also add some additional headers so that our request looks more like it was sent from a browser. A User-Agent along with an Accept-Language header should be enough for now.
headers = {
"User-Agent": "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/139.0.0.0 Mobile Safari/537.36",
"Accept-Language": "en-GB,en-US;q=0.9,en;q=0.8",
}
With our proxy and headers defined, we can build our HTTP request with the previously installed requests library. Here’s how the full function should look:
def scrape_website(url: str) -> str:
proxy_url = (
f"https://customer-{OXYLABS_USERNAME}:{OXYLABS_PASSWORD}@pr.oxylabs.io:7777"
)
proxies = {
"http": proxy_url,
"https": proxy_url,
}
headers = {
"User-Agent": "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/139.0.0.0 Mobile Safari/537.36",
"Accept-Language": "en-GB,en-US;q=0.9,en;q=0.8",
}
response = requests.get(url, headers=headers, proxies=proxies)
response.raise_for_status()
return response.text
The response.raise_for_status() method ensures that if something goes wrong inside the request, the script will stop there and raise an exception, ensuring that the response is what we expect.
After setting the foundation of scraping the HTML from any website, let’s try to incorporate it together with Perplexity to get product data from Amazon web pages.
Perplexity’s models are fully capable of parsing out the HTML code and extracting the data we need by just describing it in the user query prompt. However, since LLM calls can be quite expensive, we’ll be using Perplexity only for extracting the CSS selectors for specific sections in the Amazon page HTML.
This allows us to call the LLM only once and reuse the retrieved selectors for further data extraction, only using the proxies in the process. This is both dynamic and powerful, as the script will be not only more cost-efficient, but also deliver a faster data extraction workflow.
Let’s start by defining a Pydantic model, which will describe what our response from Perplexity should look like. The model should have names that perfectly describe what’s expected from the LLM, so let’s end each attribute with selector, to not confuse the LLM. For this example, we’ll be extracting the title, price, and rating of an item on Amazon. The model can look like this:
class ProductSelector(BaseModel):
title_selector: str
price_selector: str
rating_selector: str
Next, we can start defining a function for retrieving the selectors. Let’s call it get_amazon_selectors. It should accept an html string as an argument and return the ProductSelector model.
def get_amazon_selectors(html: str) -> str:
...
Inside the function, let’s define the client we’ll be using to access Perplexity’s API. Since we’ll be using instructor as a wrapper, the client will be initialized like this:
client = instructor.from_provider("perplexity/sonar")
Next, we can define the prompt for Perplexity. The prompt can be a simple instruction to parse out CSS selectors from the provided HTML code, with the actual code injected directly into the prompt. Here’s an example:
prompt = f"Extract CSS selectors for retrieving the title, price and rating of a product in the following Amazon HTML: {html}"
What’s left is to perform the API call to Perplexity with the declared parameters. The full function should look like this:
def get_amazon_selectors(html: str) -> ProductSelector:
client = instructor.from_provider("perplexity/sonar")
prompt = f"Extract CSS selectors for retrieving the title, price and rating of a product in the following Amazon HTML: {html}"
product_info = client.chat.completions.create(
messages=[{"role": "user", "content": [{"type": "text", "text": prompt}]}],
response_model=ProductSelector,
)
return product_info
If we tried to ask Perplexity to parse out the CSS selectors from the whole Amazon HTML page, we’d likely get an error, letting us know that the prompt is too long for a single API request. Even Perplexity's Sonar Pro models, which have a context window of up to 200k tokens, will be less efficient and more expensive when processing gigabytes of raw CSS, JavaScript, and navigation elements.
However, we don’t really need the full HTML content we get from Amazon. For our use case, we can extract a single item section and extract the title, price, and rating from it, making the HTML just a fraction of its original size.
We can also remove any unnecessary elements, like script, style, and link tags that we wouldn’t use anyway. This can be done by utilizing the previously installed BeautifulSoup library, which can find and remove the unused HTML elements from the scraped Amazon page.
Let’s implement a function called clean_html_for_llm, which will select a single product element from an Amazon search results page and remove multiple items from the HTML that won’t be necessary for retrieving the selectors. Here’s what it can look like:
def clean_html_for_llm(html_content: str) -> str:
soup = BeautifulSoup(html_content, "html.parser")
results_list = soup.find_all("div", {"class": "s-card-container"})
result: BeautifulSoup = results_list[0]
for tag in result.find_all(
["script", "style", "link", "meta", "noscript", "iframe"]
):
tag.decompose()
# Remove comments
for comment in result.find_all(string=lambda text: isinstance(text, Comment)):
comment.extract()
# Remove unnecessary attributes but keep important ones
important_attrs = {"class", "href"}
for tag in result.find_all():
if tag.attrs:
tag.attrs = {k: v for k, v in tag.attrs.items() if k in important_attrs}
# Remove navigation and footer elements
for selector in ["nav", "footer", "header", ".nav", ".footer", ".header"]:
for tag in result.select(selector):
tag.decompose()
# Remove ads and promotional content
for selector in [
".ad",
".advertisement",
".promo",
".sponsored",
"[data-ad]",
"[data-ads]",
]:
for tag in result.select(selector):
tag.decompose()
return str(result)
The last part we haven’t covered is retrieving the product data from the HTML elements, using the CSS selectors provided by Perplexity. This can be done by simply passing the selectors into a BeautifulSoup object and saving them into a separate dictionary for further use.
Let’s do that in another function called parse_products_with_selectors. It should accept an html argument of type str and a selector argument of type ProductSelector. Inside the function, we should create a BeautifulSoup object with the provided HTML code and select all product sections, similar to what we did in the clean_html_for_llm function.
All that’s left is to loop through each section and use the CSS selectors provided in the function. Here’s what it should look like:
def parse_products_with_selectors(html: str, selector: ProductSelector) -> list[dict]:
soup = BeautifulSoup(html, "html.parser")
products = soup.find_all("div", {"class": "s-card-container"})
parsed_products = []
for product in products:
title_element = product.select_one(selector.title_selector)
title = title_element.text if title_element else ""
price_element = product.select_one(selector.price_selector)
price = price_element.text if price_element else ""
rating_element = product.select_one(selector.rating_selector)
rating = rating_element.text if rating_element else ""
parsed_product = {
"title": title,
"price": price,
"rating": rating,
}
parsed_products.append(parsed_product)
return parsed_products
Now that we have all the necessary functions and variables defined, let’s combine them all into a single working script for web scraping Amazon. Let’s navigate to the bottom of our Python file and define a variable for the Amazon URL we’d like to scrape. Let’s use the result page for over-ear headphones for this example.
amazon_url = "https://www.amazon.com/b?node=12097479011"
We can now use this URL as an argument for the scrape_website function we defined earlier. What’s left now is to simply call each function in a row to get a list of parsed out results for over-ear headphones from Amazon. It should look like this:
amazon_url = "https://www.amazon.com/b?node=12097479011"
amazon_content = scrape_website(amazon_url)
clean_html = clean_html_for_llm(amazon_content)
amazon_selectors = get_amazon_selectors(clean_html)
products = parse_products_with_selectors(amazon_content, amazon_selectors)
We can also store our scraped items into a JSON file like in the following example:
with open("amazon.json", "w") as f:
json.dump(products, f, indent=2)
Once you run the script, a JSON file called amazon.json should appear in your current directory and should contain something like this:
[
{
"title": "Amazon Basics Hybrid Active Noise Cancelling Headphones, 35 Hours Playtime with ANC on and 45 Hours with ANC Off, Wireless, Over Ear Comfortable Fit, Bluetooth, One Size, Black",
"price": "$42.75",
"rating": "4.4 out of 5 stars"
},
{
"title": "KVIDIO Bluetooth Headphones Over Ear, 65 Hours Playtime Wireless Headphones with Microphone, Foldable Lightweight Headset with Deep Bass, HiFi Stereo Sound Low Latency for Travel Work Cellphone",
"price": "$18.26",
"rating": "4.5 out of 5 stars"
},
{
"title": "Soundcore Anker Life Q20 Hybrid Active Noise Cancelling Headphones, Wireless Over Ear Bluetooth Headphones, 60H Playtime, Hi-Res Audio, Deep Bass, Foam Ear Cups, Travel, Office, USB-C Charging",
"price": "$34.99",
"rating": "4.5 out of 5 stars"
},
{
"title": "Sony WH-CH720N Noise Canceling Wireless Headphones Bluetooth Over The Ear Headset with Microphone and Alexa Built-in, Black New",
"price": "$98.00",
"rating": "4.4 out of 5 stars"
},
]
Here’s the complete script for web scraping Amazon with Perplexity and Oxylabs Residential Proxies:
import json
from pydantic import BaseModel
import requests
import instructor
from bs4 import BeautifulSoup, Comment
OXYLABS_USERNAME = "USERNAME"
OXYLABS_PASSWORD = "PASSWORD"
class ProductSelector(BaseModel):
title_selector: str
price_selector: str
rating_selector: str
def get_amazon_selectors(html: str) -> ProductSelector:
client = instructor.from_provider("perplexity/sonar")
prompt = f"Extract CSS selectors for retrieving the title, price and rating of a product in the following Amazon HTML: {html}"
product_info = client.chat.completions.create(
messages=[{"role": "user", "content": [{"type": "text", "text": prompt}]}],
response_model=ProductSelector,
)
return product_info
def scrape_website(url: str) -> str:
proxy_url = (
f"https://customer-{OXYLABS_USERNAME}:{OXYLABS_PASSWORD}@pr.oxylabs.io:7777"
)
proxies = {
"http": proxy_url,
"https": proxy_url,
}
headers = {
"User-Agent": "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/139.0.0.0 Mobile Safari/537.36",
"Accept-Language": "en-GB,en-US;q=0.9,en;q=0.8",
}
response = requests.get(url, headers=headers, proxies=proxies)
response.raise_for_status()
return response.text
def clean_html_for_llm(html_content: str) -> str:
soup = BeautifulSoup(html_content, "html.parser")
results_list = soup.find_all("div", {"class": "s-card-container"})
result: BeautifulSoup = results_list[0]
for tag in result.find_all(
["script", "style", "link", "meta", "noscript", "iframe"]
):
tag.decompose()
# Remove comments
for comment in result.find_all(string=lambda text: isinstance(text, Comment)):
comment.extract()
# Remove unnecessary attributes but keep important ones
important_attrs = {"class", "href"}
for tag in result.find_all():
if tag.attrs:
tag.attrs = {k: v for k, v in tag.attrs.items() if k in important_attrs}
# Remove navigation and footer elements
for selector in ["nav", "footer", "header", ".nav", ".footer", ".header"]:
for tag in result.select(selector):
tag.decompose()
# Remove ads and promotional content
for selector in [
".ad",
".advertisement",
".promo",
".sponsored",
"[data-ad]",
"[data-ads]",
]:
for tag in result.select(selector):
tag.decompose()
return str(result)
def parse_products_with_selectors(html: str, selector: ProductSelector) -> list[dict]:
soup = BeautifulSoup(html, "html.parser")
products = soup.find_all("div", {"class": "s-card-container"})
parsed_products = []
for product in products:
title_element = product.select_one(selector.title_selector)
title = title_element.text if title_element else ""
price_element = product.select_one(selector.price_selector)
price = price_element.text if price_element else ""
rating_element = product.select_one(selector.rating_selector)
rating = rating_element.text if rating_element else ""
parsed_product = {
"title": title,
"price": price,
"rating": rating,
}
parsed_products.append(parsed_product)
return parsed_products
amazon_url = "https://www.amazon.com/b?node=12097479011"
amazon_content = scrape_website(amazon_url)
clean_html = clean_html_for_llm(amazon_content)
amazon_selectors = get_amazon_selectors(clean_html)
products = parse_products_with_selectors(amazon_content, amazon_selectors)
with open("amazon.json", "w") as f:
json.dump(products, f, indent=2)
Just as we used this approach for Amazon, the combination of Perplexity's AI and Oxylabs proxies can be used for structured data collection from virtually any target, be it search engines, market research, or large language models themselves. You can also level up your Perplexity scraping strategy by using a full Web Scraper API, which handles JavaScript rendering, CAPTCHA solving, and proxy selection, so you could focus solely on LLM prompt engineering.
Here are some APIs by Oxylabs, which could be used with Perplexity in a similar way we used Residential Proxies:
In this tutorial, we covered how to scrape Amazon with Perplexity and Oxylabs Residential Proxies. We looked at not only how to scrape items from an Amazon page, but also at how to use Perplexity to scrape CSS selectors, so that additional calls to the LLM could be saved. This workflow shows the clear potential of integrating LLMs with web scraping services, making the process simpler and more cost-efficient, and highly scalable.
If you want to learn more, check out our deep dives into what are LLMs and how additional LLM training data can be used to create advanced RAG models. Also, feel free to check out our Perplexity scraping documentation for detailed a detailed guide on how to scrape Perplexity AI data and responses using Oxylabs Web Scraper API.
Web Scraper API
Forget about complex web scraping processes – try out Web Scraper API for free to collect web data effortlessly.
About the author
Dovydas Vėsa
Technical Content Researcher
Dovydas Vėsa is a Technical Content Researcher at Oxylabs. He creates in-depth technical content and tutorials for web scraping and data collection solutions, drawing from a background in journalism, cybersecurity, and a lifelong passion for tech, gaming, and all kinds of creative projects.
All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.
No, Perplexity AI cannot scrape websites, but it can work as a data parsing and extraction engine. External tools, like Residential Proxies, must first acquire the raw HTML content and then feed it to the Perplexity API as context. Perplexity can then use its LLM capabilities to extract structured data (JSON) or parsing logic (selectors).
Residential Proxies are necessary for smooth data acquisition step. For complex targets like Amazon, a high-reputation, rotating IP address pool is a must to bypass anti-bot measures and IP bans, allowing the scraper to reliably retrieve the complete HTML data that is then passed to the Perplexity API for analysis.
LLM-assisted parsing to generate selectors is generally more cost-effective for large, dynamic sites with changing layouts. By using Perplexity to generate reusable CSS selectors only once, you can avoid updating broken parsing logic manually, and, in turn, reduce maintenance expenses and time costs.
The Perplexity API ensures structured output by working with Python libraries like Instructor and Pydantic. Passing a Pydantic model as the response_model in the API request allows you to strictly constrain the LLM's output, forcing it to return clean and structured JSON data ready for any production pipeline.
Get the latest news from data gathering world
Scale up your business with Oxylabs®
Proxies
Advanced proxy solutions
Data Collection
Datasets
Resources
Innovation hub
Web Scraper API
Forget about complex web scraping processes – try out Web Scraper API for free to collect web data effortlessly.