Scraping Amazon Product Data With Python: A Step-by-Step Tutorial

laptop illustration Scraping Amazon Product Data

Maryia Stsiopkina

Last updated on

2025-01-17

9 min read

This article was checked by Aleksandras Šulženko, Product Owner at Oxylabs.

Amazon is packed with useful e-commerce data, such as product information, reviews, and prices. Extracting this data efficiently and putting it to use is imperative for any modern business. Whether you intend to monitor the performance of your products sold by third-party resellers or track your competitors, you need reliable web scraping services, like Amazon product data scraper, to grasp this data for market analytics.

Amazon scraping, however, has its peculiarities. In this step-by-step guide, we’ll go over every stage on how to scrape Amazon product data.

You can find codes used in this guide on our GitHub.

Get 50% Off Residential Proxy plans – Limited-Time Offer!

Apply promo code OXYLABS50 for your first purchase.

Setting up for scraping

In this tutorial we'll be scraping:

Product name;
Product rating;
Product price;
Product images;
Product description.

To follow along, you will need Python. If you do not have Python 3.8 or above installed, head to python.org and download and install Python.

Next, create a folder to save your code files for web scraping Amazon. Once you have a folder, creating a virtual environment is generally a good practice.

The following commands work on macOS and Linux. These commands will create a virtual environment and activate it:

Copy

python3 -m venv .env
source .env/bin/activate

If you are on Windows, these commands will vary a little as follows:

Copy

python -m venv .env
.env\scripts\activate

The next step is installing the required Python packages.

You will need packages for two broad steps—getting the HTML and parsing the HTML to query relevant data.

Requests is a popular third-party Python library for making HTTP requests. It provides a simple and intuitive interface to make HTTP requests to web servers and receive responses. This library is perhaps the most known library related to web scraping.

The limitation of the Requests library is that it returns the HTML response as a string, which is not easy to query for specific elements such as listing prices while working with web scraping code.

This is where Beautiful Soup steps in. Beautiful Soup is a Python library used for web scraping to pull the data out of HTML and XML files. It allows you to extract data from the page by searching for tags, attributes, or specific text.

To install these two libraries, you can use the following command:

Copy

python3 -m pip install requests beautifulsoup4 lxml pandas

If you are on windows, use Python instead of Python3. The rest of the command remains unchanged:

Copy

python -m pip install requests beautifulsoup4 lxml pandas

Note that we are installing version 4 of the Beautiful Soup library.

It's time to try out the Requests scraping library. Create a new Python file with the name amazon.py and enter the following code:

Copy

import requests


url = (
    'https://www.amazon.com/Bose-QuietComfort-45-'
    'Bluetooth-Canceling-Headphones/dp/B098FKXT8L'
)

response = requests.get(url)

print(response.text)
print(response.status_code)

with open('amazon_page.html', 'w') as f:
    f.write(response.text)

Save the file and run it from the terminal.

Copy

python3 amazon.py

In most cases, you cannot access the desired content. Amazon will block this request, and you will see the following text in the HTML:

Copy

To discuss automated access to Amazon data please contact api-services-support@amazon.com.

You can open the saved amazon_page.html file in your browser to see what you've scraped. If you print the response.status_code, you will see that instead of getting 200, which means success, you may get 503, which means an error.

Amazon knows this request was not using a browser and thus blocks it.

It is a common practice employed by many websites. Amazon will block your requests and return an error code beginning with 500 or sometimes even 400.

The solution is simple in most cases. You can send the headers along with your request that a browser would.

Sometimes, sending only the User-Agent is enough. At other times, you may need to send more headers. A good example is sending the Accept-Language header.

To identify the User-Agent sent by your browser, press F12 and open the Network tab. Reload the page. Select the first request and examine Request Headers.

examinig request headers in dev tools illustration

You can copy this User-Agent and create a dictionary for the headers.

The following example shows a dictionary with the User-Agent and accept-language headers:

Copy

custom_headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
                  ' (KHTML, like Gecko) Chrome/135.0.0.0 Safari/537.36',
    'Accept-Language': 'da, en-gb, en'
}

You can send this dictionary to the optional parameter of the get method as follows:

Copy

response = requests.get(url, headers= custom_headers)

Executing the code with these changes may show the expected HTML with the product details.

Another note is that if you send as many headers as possible, you may have better success. If you need JavaScript rendering to run each script tag in the HTML file, you will require tools like Playwright or Selenium. If the User-Agent and Accept-Language strings still bring you the 503 error, you can try to use the following headers:

Copy

custom_headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
                  ' (KHTML, like Gecko) Chrome/135.0.0.0 Safari/537.36',
    'Accept-Language': 'da, en-gb, en',
    'Accept-Encoding': 'gzip, deflate, br',
    'Accept': 'text/html,application/xhtml+xml,application/xml;'
              'q=0.9,image/avif,image/webp,*/*;q=0.8',
    'Referer': 'https://www.google.com/'
}

It’s also a good idea to rotate different User-Agent strings and try your requests again to overcome the 503 error.

How to include proxies for Amazon product data scraping

If the 503 error persists or you've received a CAPTCHA, you can have better success by utilizing proxy servers to change your IP address. Head to the Oxylabs dashboard, register a free account, and claim 5 free proxies for lifetime to follow along.

Add the following lines of code to your code and replace the USERNAME and PASSWORD values with the proxy user credentials you've created in the dashboard:

Copy

USERNAME = 'USER'
PASSWORD = 'PASS123'

proxies = {
    'http': f'http://user-{USERNAME}:{PASSWORD}@dc.oxylabs.io:8000',
    'https': f'https://user-{USERNAME}:{PASSWORD}@dc.oxylabs.io:8000'
}

Then, pass the proxies dict to the proxies parameter in the request:

Copy

response = requests.get(url, headers=custom_headers, proxies=proxies)

After running the code with proxies, you should be able to overcome Amazon's 503 error. Note that Datacenter Proxies are easy to detect and can still trigger the 503 error or CAPTCHAs. In that case, it's best to acquire either Dedicated Datacenter Proxies or go a step further and purchase Residential Proxies that will help you to resemble organic web traffic.

Scraping Amazon product data

When web scraping Amazon products, typically, you would work with two categories of pages — the category page and the product details page.

For example, open https://www.amazon.com/b?node=12097479011 or search for Over-Ear Headphones on Amazon. The page that shows the search results is the category page.

The category page displays the product title, product image, product rating, product price, and, most importantly, the product URLs page. If you want more details, such as product descriptions, you will get them only from the product details page.

Let's examine the structure of the product details page.

Open a product URL, such as https://www.amazon.com/Bose-QuietComfort-45-Bluetooth-Canceling-Headphones/dp/B098FKXT8L, in Chrome or any other modern browser, right-click the product title, and select Inspect. You will see that the HTML markup of the product title is highlighted.

highlighted HTML markup of the product title

You will see that it is a span tag with its id attribute set to "productTitle".

Similarly, if you right-click the price and select Inspect, you will see the HTML markup of the price.

You can see that the dollar component of the price is in a span tag with the class "a-price-whole", and the cents component is in another span tag with the class set to "a-price-fraction".

Similarly, you can locate the rating, image, and description.

Once you have this information, add the following lines to the code we have written so far:

1. Send a GET request with custom headers

Copy

import time, random
from bs4 import BeautifulSoup

response = requests.get(url, headers=custom_headers)
soup = BeautifulSoup(response.text, 'lxml')

Beautiful Soup supports a unique way of selecting tags that utilize the find methods. Alternatively, Beautiful Soup also supports CSS selectors. You can use either of these to get the same results. In this guide, we will use CSS selectors, which are universal ways to select elements. CSS selectors work with almost all web scraping tools that can be used for web scraping Amazon product data.

We are now ready to use the Soup object to query for specific information.

2. Locate and scrape product name

The product name or the product title is located in a span element with its id productTitle. It's easy to select elements using the id that is unique.

See the following code for example:

Copy

title_element = soup.select_one('#productTitle')

We send the CSS selector to the select_one method, which returns an element instance.

We can extract information from the text using the text attribute.

Copy

title = title_element.text

Upon printing it, you will see that there are few white spaces. To fix that, add .strip() function call as follows:

Copy

title = title_element.text.strip()

3. Locate and scrape product rating

Scraping Amazon product ratings needs a little more work.

First, let's create a selector for rating:

Copy

#acrPopover

Now, the following statement can select the element that contains the rating.

Copy

rating_element = soup.select_one('#acrPopover')

Note that the rating value is actually in the title attribute:

Copy

rating_text = rating_element.attrs.get('title')
print(rating_text)
# prints '4.6 out of 5 stars'

Lastly, we can use the replace method to get the number:

Copy

rating = rating_text.replace('out of 5 stars','')

4. Locate and scrape product price

The product price is located in two places — below the product title and also on the Buy Now box.

We can use either of these tags to scrape Amazon product prices.

Let's create a CSS selector for the price:

Copy

#corePrice_feature_div span.a-offscreen

This CSS selector can be passed to the select_one method of BeautifulSoup as follows:

Copy

price_element = soup.select_one('#corePrice_feature_div span.a-offscreen')

You can now print the price

Copy

print(price_element.text)

5. Locate and scrape product image

Let's scrape the default image. This image has the CSS selector as #landingImage. With this information, we can write the following lines of code to get the image URL from the src attribute:

Copy

image_element = soup.select_one('#landingImage')
image = image_element.attrs.get('src')

6. Locate and scrape product description

The next step in scraping Amazon product information is scraping the product description.

The methodology remains the same — create a CSS selector and use the select_one method.

The CSS selector for the description is as follows:

Copy

#productDescription

You may want to add another CSS selector as a fallback in case the first one fails. Here's how you can extract the description and "About this item" section:

Copy

description_element = soup.select_one(
    '#productDescription, #feature-bullets > ul'
).text.strip()
print(description_element)

7. Locate and scrape reviews

Scraping Amazon product reviews necessitates a more extended explanation, so we covered it in a separate blog post. Please read it to learn more about locating and extracting data from Amazon review pages.

Handling product listing

So far, we have explored how to scrape Amazon product data.

However, to reach the product information, you will begin with product listing or category pages.

For example, https://www.amazon.com/b?node=12097479011 is the category page for over-ear headphones.

If you examine this page, you will notice that all the products are contained in a div that has a special attribute [data-cy="title-recipe"]. In that div, all the product links are in an a tag.

With this in mind, the CSS Selector would be as follows:

Copy

[data-cy="title-recipe"] > a.a-link-normal

We can read the href attribute of this selector and run a loop. However, note that the links will be relative. You would need to use the urljoin method to parse these links.

Copy

from urllib.parse import urljoin

def parse_listing(listing_url, visited_urls, current_page=1, max_pages=2):
    response = requests.get(
        listing_url, headers=custom_headers, proxies=proxies
    )
    print(response.status_code)
    soup_search = BeautifulSoup(response.text, 'lxml')
    link_elements = soup_search.select(
        '[data-cy="title-recipe"] > a.a-link-normal'
    )

    page_data = []

    for link in link_elements:
        full_url = urljoin(listing_url, link.attrs.get('href'))
        if full_url not in visited_urls:
            visited_urls.add(full_url)
            print(f'Scraping product from {full_url[:100]}', flush=True)
            product_info = get_product_info(full_url)
            if product_info:
                page_data.append(product_info)
                time.sleep(random.uniform(3, 7))

To avoid common issues with extracting href elements, take a look into how to get href elements in BeautifulSoup page in our resources.

Handling pagination

The link to the next page is in a link that contains the text Next. We can look for this link using the contains operator of CSS as follows:

Copy

    next_page_el = soup_search.select_one('a.s-pagination-next')
    if next_page_el and current_page < max_pages:
        next_page_url = next_page_el.attrs.get('href')
        next_page_url = urljoin(listing_url, next_page_url)
        print(
            f'Scraping next page: {next_page_url}'
            f'(Page {current_page+1} of {max_pages})',
            flush=True
        )
        page_data += parse_listing(
            next_page_url, visited_urls, current_page+1, max_pages
        )

    return page_data

8. Export scraped product data to a CSV file

The data we're scraping is being returned as a dictionary. This is intentional. We can create a list that contains all the data from scraped products:

Copy

def main():
    visited_urls = set()
    data = []
    search_url = 'https://www.amazon.com/s?k=bose'
    data = parse_listing(search_url, visited_urls)

This page_data can then be used to create a Pandas DataFrame object:

Copy

    df = pd.DataFrame(data)
    df.to_csv("headphones.csv", index=False)

Reviewing final script

Putting together everything, the following is the final script:

Copy

import time, random
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
import pandas as pd


USERNAME = 'USER'
PASSWORD = 'PASS123'

proxies = {
    'http': f'http://user-{USERNAME}:{PASSWORD}@dc.oxylabs.io:8000',
    'https': f'https://user-{USERNAME}:{PASSWORD}@dc.oxylabs.io:8000'
}

custom_headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
                  ' (KHTML, like Gecko) Chrome/135.0.0.0 Safari/537.36',
    'Accept-Language': 'da, en-gb, en',
    'Accept-Encoding': 'gzip, deflate, br',
    'Accept': 'text/html,application/xhtml+xml,application/xml;'
              'q=0.9,image/avif,image/webp,*/*;q=0.8',
    'Referer': 'https://www.google.com/'
}


def get_product_info(url):
    response = requests.get(url, headers=custom_headers, proxies=proxies)
    if response.status_code != 200:
        print(f'Error in getting webpage: {url}')
        return None

    soup = BeautifulSoup(response.text, 'lxml')

    title_element = soup.select_one('#productTitle')
    title = title_element.text.strip() if title_element else None

    price_element = soup.select_one('#corePrice_feature_div span.a-offscreen')
    price = price_element.text if price_element else None

    rating_element = soup.select_one('#acrPopover')
    rating_text = rating_element.attrs.get('title') if rating_element else None
    rating = rating_text.replace('out of 5 stars', '') if rating_text else None

    image_element = soup.select_one('#landingImage')
    image = image_element.attrs.get('src') if image_element else None

    description_element = soup.select_one(
        '#productDescription, #feature-bullets > ul'
    )
    description = (
        description_element.text.strip() if description_element else None
    )

    return {
        'title': title,
        'price': price,
        'rating': rating,
        'image': image,
        'description': description,
        'url': url
    }


def parse_listing(listing_url, visited_urls, current_page=1, max_pages=2):
    response = requests.get(
        listing_url, headers=custom_headers, proxies=proxies
    )
    print(response.status_code)
    soup_search = BeautifulSoup(response.text, 'lxml')
    link_elements = soup_search.select(
        '[data-cy="title-recipe"] > a.a-link-normal'
    )
    
    page_data = []

    for link in link_elements:
        full_url = urljoin(listing_url, link.attrs.get('href'))
        if full_url not in visited_urls:
            visited_urls.add(full_url)
            print(f'Scraping product from {full_url[:100]}', flush=True)
            product_info = get_product_info(full_url)
            if product_info:
                page_data.append(product_info)
                time.sleep(random.uniform(3, 7))
    
    next_page_el = soup_search.select_one('a.s-pagination-next')
    if next_page_el and current_page < max_pages:
        next_page_url = next_page_el.attrs.get('href')
        next_page_url = urljoin(listing_url, next_page_url)
        print(
            f'Scraping next page: {next_page_url}'
            f'(Page {current_page+1} of {max_pages})',
            flush=True
        )
        page_data += parse_listing(
            next_page_url, visited_urls, current_page+1, max_pages
        )

    return page_data


def main():
    visited_urls = set()
    data = []
    search_url = 'https://www.amazon.com/s?k=bose'
    data = parse_listing(search_url, visited_urls)
    df = pd.DataFrame(data)
    df.to_csv('headphones.csv', index=False)


if __name__ == '__main__':
    main()

Best practices

Scraping Amazon without proxies or dedicated scraping tools is full of obstacles. Just like many other popular scraping targets, Amazon has rate-limiting in place, meaning it can block your IP address if you exceed the established limit. Apart from that, Amazon uses bot-detection algorithms that can check your HTTP headers for any suspicious details. Also, you should be ready to constantly adapt to the different page layouts and various HTML structures.

Considering these factors, it’s recommended to follow some common practices to prevent getting detected and blocked by Amazon. Some of the most useful tips are:

Use a real User-Agent. It’s important to make your User-Agent look as plausible as possible. Here’s the list of the most common user agents.
Set your fingerprint. Many websites use Transmission Control Protocol (TCP) and IP fingerprinting to detect bots. To avoid getting spotted, you need to make sure your fingerprint parameters are always consistent.
Change the crawling pattern. To develop a successful crawling pattern, you should think about how a regular user would behave while exploring a page and add clicks, scrolls, and mouse movements accordingly.

And this is only a small portion of the requirements you should keep in mind when scraping Amazon.

An easier solution to extract Amazon data

Alternatively, you can turn to a ready-made scraping solution designed specifically for scraping Amazon - Amazon Scraper API. With this scraper, you can:

Scrape and parse various Amazon page types, including Search, Product, Offer listing, Questions & Answers, Reviews, Best Sellers, and Sellers.
Target localized product data in 195 locations worldwide;
Retrieve accurate parsed results in JSON format without installing any other library;
Enjoy multiple handy features, such as bulk scraping and automated jobs.

Let's look at Amazon Scraper API in action.

Scraping products from search results

You can search and extract the products from Amazon with this straightforward code example:

Copy

import requests
from pprint import pprint

# API parameters.
payload = {
    'source': 'amazon_search',
    'query': 'bose',  # Search for "bose".
    'start_page': 1,
    'pages': 5,
    'parse': True,
    'context': [
        # Category ID for headphones.
        {'key': 'category_id', 'value': 12097479011}
    ],
}

# Send a request and get a response.
response = requests.post(
    'https://realtime.oxylabs.io/v1/queries',
    auth=('USERNAME', 'PASSWORD'),
    json=payload
)

# Print prettified response to stdout.
pprint(response.json())

Notice how it requests 10 pages beginning with page 1. Also, we limit the search to category ID 12097479011, which is Amazon's category ID for headphones. You’ll get the data returned in JSON format:

Extracting product details

All you need is the product URL — irrespective of the country of the Amazon store. The only change in code is the payload. For example, the following payload extracts details, such as name, price, stock availability, description, and more, for the Bose QC 45 from Amazon.com:

Copy

payload = {
    'source': 'amazon',
    'url': 'https://www.amazon.com/dp/B098FKXT8L',
    'parse': True,
}

Here’s a snippet of the output:

Scraping products by ASIN

Another way to get the information is by the ASIN of the product. Using Amazon ASIN Scraper API, you need to modify the payload:

Copy

payload = {
    'source': 'amazon_product',
    'domain': 'co.uk',
    'query': 'B098FKXT8L',
    'parse': True,
    'context': [
        {'key': 'autoselect_variant', 'value': True}
    ],
}

Note the optional parameter domain. You can use this parameter to get Amazon data from any domain, such as amazon.co.uk.

For your convenience, we also prepared this tutorial in a video format:

As shown in the first part of this tutorial, you can use Oxylabs free proxies as an alternative to scrape Amazon product data without blocks. Head to the free proxies dashboard to register a free Oxylabs account and create your proxy user.

Conclusion

You can write code to scrape Amazon products using the Requests and Beautiful Soup libraries. It may need some effort, but it works. Sending custom headers, rotating user-agents, and proxy rotation can help bypass bans or rate limiting.

However, the easiest solution to scrape Amazon products is using the Amazon Scraper API. If you're specifically interested in extracting book data, check out this guide on the Amazon Books Scraper API. Oxylabs also allows you to gather data from 50 other marketplaces using its E-Commerce Scraper API (now part of a Web Scraper API solution). You can even integrate your Amazon scraper with AI agents – read more about it in our blog post on AI Agent Examples in 2025.

When using your own tools, integrating proxies is key to block-free web scraping. To appear organic on a target website, you can buy proxy server solutions of various types, such as residential proxies or datacenter IPs, each suitable for different data collection scenarios.

If you have any questions, do not hesitate to contact us. Also, learn how to build an Amazon price tracker with Python and Oxylabs' Scraper API in another blog post.

Frequently asked questions

Scraping publicly available data contained within the Amazon website isn’t considered illegal as long as your actions don’t violate its ToS. However, before engaging in any web scraping activity, our legal experts strongly recommend consulting with lawyers knowledgeable in this field.

Yes, scraping can be detected by the anti-bot software that can check your IP address, browser parameters, user agents, and other details. After being detected, the website will throw CAPTCHA, and if not solved, your IP will get blocked. If you'd like to avoid scraping altogether but still need the high-quality data it provides, we recommend checking out our e-commerce datasets.

Yes, Amazon may ban an IP address if it finds it suspicious.

In order to overcome CAPTCHAs, as it's one of the biggest challenges when gathering public data, you should minimize encounters with them as much as possible. Of course, it's important to note that avoiding them can be challenging. Here are some tips on how you can achieve that:

Use reliable proxies and rotate your IP addresses.
Reduce the scraping speed by adding random breaks between requests.
Make sure your fingerprint parameters are consistent, or choose Web Unblocker – an AI-powered proxy solution with dynamic fingerprinting functionality.

For more information, check out this blog post.

You can utilize free web scraping and crawling tools, like Scrapy, that allow crawling websites on a large scale.

The process of scraping product images from Amazon is detailed in Step 5 of this article, so go ahead and check it out.

You can also save images manually. Simply hover your mouse over the image you’d like to save and select Save Image As to download it to your local storage. You can also use Copy Image Address to get the URL or Copy Image to save it to your clipboard temporarily.

About the author

Maryia Stsiopkina

Former Senior Content Manager

Maryia Stsiopkina was a Senior Content Manager at Oxylabs. As her passion for writing was developing, she was writing either creepy detective stories or fairy tales at different points in time. Eventually, she found herself in the tech wonderland with numerous hidden corners to explore. At leisure, she does birdwatching with binoculars (some people mistake it for stalking), makes flower jewelry, and eats pickles.

Learn more about Maryia Stsiopkina Learn more about Maryia Stsiopkina

All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.