You likely ran into blocks if you’ve done any scraping tasks on Amazon. Naturally, you may ask as to why this is the case. Well, it’s because Amazon, like many other e-commerce websites, uses CAPTCHA to prevent bots or automated scripts from accessing its content. This means that without a specialized scraping tool to bypass CAPTCHAs, extracting data from Amazon is a nigh impossible task.
Thankfully, such tools are easily accessible, and below, we show a step-by-step guide on how to extract data with our Amazon Product Data API solution.
You can find the following codes on our GitHub.
Let’s begin by setting up a simple Amazon scraper and see if it runs into any CAPTCHAs. For the purpose of this tutorial, we’ll be using Python, but this could be done in almost any other language, too.
import requests
custom_headers = {
"Accept-language": "en-GB,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Cache-Control": "max-age=0",
"Connection": "keep-alive",
"User-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.1 Safari/605.1.15",
}
url = "https://www.amazon.com/SAMSUNG-Border-Less-Compatible-Adjustable-LS24AG302NNXZA/dp/B096N2MV3H?ref_=Oct_DLandingS_D_fe3953dd_2"
response = requests.get(url, headers=custom_headers)
with open('with_captcha.html', 'w') as file:
file.write(response.text)
Here, we have a very simple script that sends a request to Amazon and fetches the HTML of the page, then saves it as a file for inspection. We also create a custom header for our request. Otherwise, without one, the custom header would get rejected right away.
If we open up the resulting HTML file, we can see that we ran into the issue that we were expecting:
While there are many different ways to approach this issue, let’s use the Oxylabs Amazon Product Data API. This tool is specifically built to avoid Amazon CAPTCHA while scraping. Here’s a short script that’ll help us to utilize the API:
import requests
from pprint import pprint
payload = {
'source': 'amazon',
'url': 'https://www.amazon.com/dp/B096N2MV3H',
'parse': True
}
response = requests.request(
'POST',
'https://realtime.oxylabs.io/v1/queries',
auth=('username', 'password'),
json=payload,
)
with open('without_captcha.json', 'w') as file:
file.write(response.text)
If we look at our results file, we can see that the page was scraped successfully without any CAPTCHA solving. We even managed to retrieve the information in a structured format:
There you have it: using this simple script combined with our Amazon Product Data API will allow you to successfully scrape Amazon without running into CAPTCHA.
As you can see, scraping Amazon data is a relatively straightforward and quick process with a dedicated scraping tool. Other bypassing options you may want to consider include CAPTCHA proxies, using Selenium to handle CAPTCHAs, Playwright to bypass CAPTCHAs, and Puppeteer to overcome CAPTCHA tests. If any questions arise throughout this tutorial, or you’re curious to learn more about our solutions/scraping in general, don’t hesitate to contact us at hello@oxylabs.io.
We have several tutorials available for gathering different types of Amazon data:
Without CAPTCHA, even the most basic automated scripts would get through to Amazon, significantly affecting the website's stability and worsening the user experience.
Amazon uses a variety of CAPTCHA types. However, common ones include text-based CAPTCHA, image-based CAPTCHA, interactive CAPTCHA, and checkbox CAPTCHA. Note that the specific types of CAPTCHA employed by Amazon will likely change and be updated as time goes on to increase anti-scraping measures, so using dedicated CAPTCHA-solving tools such as CapSolver is recommended.
About the author
Danielius Radavicius
Former Copywriter
Danielius Radavičius was a Copywriter at Oxylabs. Having grown up in films, music, and books and having a keen interest in the defense industry, he decided to move his career toward tech-related subjects and quickly became interested in all things technology. In his free time, you'll probably find Danielius watching films, listening to music, and planning world domination.
All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.
Get the latest news from data gathering world
Scale up your business with Oxylabs®
GET IN TOUCH
General:
hello@oxylabs.ioSupport:
support@oxylabs.ioCareer:
career@oxylabs.ioCertified data centers and upstream providers
Connect with us
Advanced proxy solutions
Resources
Data Collection
Innovation hub
oxylabs.io© 2024 All Rights Reserved