Back to blog

How to Scrape Walmart: A Step-by-Step Tutorial

Enrika Pavlovskytė

2023-03-016 min read
Share

As the undisputed leader among the world's top 100 largest companies, Walmart is a major force in the e-commerce industry. Companies can leverage its extensive public product data pool, ranging from pricing to reviews, and unlock valuable business insights. Indeed, using this data, companies can conduct pricing intelligence, product catalog mapping, competitor analysis, and other operations that can shape their business strategies, often with the help of proxies to ensure smooth and efficient data extraction.

In this blog post, we’ll guide you through the process of scraping publicly available Walmart data using Python. We'll delve into critical aspects of web scraping, including initial setup, strategies to avoid getting blocked, techniques for identifying and extracting desired data, and methods for delivering data in a CSV file.

Try free for 1 week

Get a free trial to test our Web Scraper API.

  • 5K results
  • No credit card required
  • Setting up

    Setting up your Python environment is the first step to scraping Walmart product data. Start by downloading Python from the official website and installing it on your computer. Next, you'll want to set up a package manager called pip, which will allow you to easily install the required Python packages for web scraping. To do so, use the following command:

    python -m pip install requests bs4 pandas

    This command will install three libraries – Requests, BeautifulSoup 4, and Pandas. Let’s quickly overview what each of them will do: 

    • Requests is a Python library that allows you to send HTTP requests. It will be used to make network requests to the Walmart website and retrieve the product page.  

    • BeautifulSoup 4 is also a Python library. It’s used for web scraping purposes, such as pulling the data out of HTML and XML files. It will be especially handy to parse the HTML content and scrape product data. 

    • Pandas is a Python library that is used for data manipulation and analysis. We’ll use this library for storing and exporting the scraped data into CSV format. 

    With all the necessary packages installed, it’s time to start writing the script.

    1. Fetch Walmart product page

    Start by importing the necessary libraries following the below: 

    import requests 
    from bs4 import BeautifulSoup 
    import pandas as pd 

    Next, let’s try and scrape Walmart’s iPhone 14 product page: 
    https://www.walmart.com/ip/AT-T-iPhone-14-128GB-Midnight/1756765288

    response = requests.get("https://www.walmart.com/ip/AT-T-iPhone-14-128GB-Midnight/1756765288") 
    print(response.status_code)

    Once you run this code, you'll likely see a status code of 200. Let’s add a few lines of code to parse the content of the response to validate if it’s working properly. Use BeautifulSoup to do that:

    soup = BeautifulSoup(response.content, 'html.parser') 
    print(soup.get_text())

    In the best-case scenario, you might get an HTML of the web page. However, it’s more likely that you’ll get something like that: 

    Robot or human? 
    Activate and hold the button to confirm that you’re human. Thank You! 
    Try a different method 
    Terms of Use 
    Privacy Policy 
    Do Not Sell My Personal Information 
    Request My Personal Information 
    ©2024 Walmart Stores, Inc. 

    Let’s dissect what happened here. It’s apparent that Walmart has blocked the script, and a CAPTCHA page has been displayed to prevent you from accessing the product using a script. Nevertheless, that shouldn’t stop you, as there are alternative approaches to overcome this challenge, and we'll explore them in the subsequent section.

    Avoiding detection using headers

    To prevent detection, you can include a User-Agent header with the request, which websites often use to determine what kind of device is browsing a particular URL. To obtain this header, web browser developer tools can be used. To access them, follow the steps below:

    1. Open the Chrome browser and navigate to the Walmart product page. 

    2. Right-click on the page and select Inspect to open the developer tools. 

    3. Click on the Network tab. 

    4. Refresh the page (F5 or Ctrl+R). 

    5. Click on the first item in the list of requests that appears in the Network tab.

    In the Headers section of the request, you’ll see the User-Agent header. You should then copy its value and use it like in the example below: 

    walmart_product_url = 'https://www.walmart.com/ip/AT-T-iPhone-14-128GB-Midnight/1756765288' 
    headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36'} 
    
    response = requests.get(walmart_product_url, headers=headers) 
    soup = BeautifulSoup(response.content, 'html.parser') 
    print(soup.prettify()) 

    Now, once you run this script again, you’ll see the correct product page HTML. 

    2. Extract Walmart product information with BeautifulSoup

    Before getting to the scraping part, you need to understand how to locate the data you want to extract. For this tutorial, we’ll be scraping the title and price. To find these elements, you can inspect the structure of the page using developer tools like before. 

    Selecting & Scraping Product Title 

    To find the title using developer tools, select title of the product with your cursor, right-click on it, and choose Inspect. You should get be able to locate the title like in the example below:

    Finding the product title

    Finding the product title

    The product title is enclosed within an h1 tag, which provides a clear indication of how to reference it in the code. To extract the product title, we'll utilize BeautifulSoup, a widely-used and user-friendly library for web scraping. You can easily instruct BeautifulSoup to scrape the title text using the following code:

    title = soup.find("h1").text 

    Selecting & Scraping Product Price 

    If you want to scrape the product price, the approach will be similar. First, locate the price on the website, hover your mouse over it, right-click, and select Inspect.

    Finding product price

    Finding product price

    You’ll see that the price is in a span tag. To select this tag, you can use BeautifulSoup selector as before: 

    price_element = soup.find("span", {"itemprop": "price"})
    price = price_element.text if price_element is not None else ""

    Notice that an extra dictionary object is being passed to the find method this time. This tells Beautiful Soup to grab the exact span element using the element’s property. Store this data in a list as per the example below:

    product_data = [{ 
    "title": title, 
    "price": price, 
    }] 

    3. Export data to a CSV file using pandas module

    The product_data list can be used to store the results in a CSV file. This will be much more useful than the HTML format, as you'll be able to use open it in excel. The Pandas library will help you do that. 

    Start by passing the scraped product list to the data frame: 

    df = pd.DataFrame(product_data)

    Finally, store the data frame in a CSV file named result.csv in the current directory using the following code: 

    df.to_csv("result.csv", index=False) 

    Full source code 

    The process of scraping name and price data should be pretty clear by now. However, you might be wondering if you can extract multiple products at once. You can certainly do so by slightly modifying the source code. Simply use a for loop to iterate over the product URLs: 

    import requests
    from bs4 import BeautifulSoup
    import pandas as pd
    
    product_urls = [
        "https://www.walmart.com/ip/AT-T-iPhone-14-128GB-Midnight/1756765288",
        "https://www.walmart.com/ip/Straight-Talk-Apple-iPhone-15-Pro-128GB-Blue-Prepaid-Smartphone/5060213862"
    ]
    
    headers = {
        'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36'
    }
    
    product_data = []
    
    for url in product_urls:
        response = requests.get(url, headers=headers)
        soup = BeautifulSoup(response.content, 'html.parser')
        print(soup.text)
        title = soup.find("h1").text
        price_element = soup.find("span", {"itemprop": "price"})
        price = price_element.text if price_element is not None else ""
        product_data.append({
            "title": title,
            "price": price,
        })
    
    df = pd.DataFrame(product_data)
    df.to_csv("result.csv", index=False)

    As you can see, the code is pretty self-explanatory. The code is looping over the product_urls, which contain target product links from Walmart’s website. Afterward, the code parses each product page using BeautifulSoup and stores the result in the product_data list. Once the products are scraped, Pandas data frame stores the product data in a CSV file. 

    Scraping Walmart without getting blocked 

    The User-Agent method discussed above is a good first step to avoid getting blocked by Walmart when scraping its website. However, for more extensive scraping projects, relying solely on this method may not be adequate since Walmart will eventually detect your scraping behavior and block your IP address. Indeed, Walmart has implemented advanced anti-bot measures, which are frequently updated, and can substantially affect all scraping activities.

    To overcome this issue, a more advanced and intricate script is required. Such a script should be able to prevent browser fingerprinting, implement proxy rotation, and mimic human browsing patterns. Moreover, this solution may demand frequent upkeep.

    An alternative that can potentially save you from the challenges of dealing with Walmart's anti-bot measures is to use a service like Oxylabs' Walmart Scraper API (part of Web Scraper API). This Scraper API significantly simplifies data collection as it takes care of the scraping, proxy management, and parsing. It’s a flexible and efficient alternative that is also easily scalable, allowing you to scrape huge amounts of Walmart product data without getting blocked. It’s also super easy to implement. Let’s take a look at how the code works:

    import requests
    from pprint import pprint
    
    payload = {
        'source': 'universal',
        'url': 'https://www.walmart.com/ip/AT-T-iPhone-14-128GB-Midnight/1756765288',
        'geo_location': 'United States',
        'parse': True,
    }
    
    response = requests.post(
        'https://realtime.oxylabs.io/v1/queries',
        auth=('USERNAME', 'PASSWORD'),
        json=payload,
    )
    
    product_data = response.json()
    pprint(product_data)

    The provided code leverages Oxylabs' Web Scraper API to extract data, requiring only the source URL, location, and a parsing flag to be set to True. The code uses the requests module to send a post request to Oxylabs' API endpoint, along with the necessary payload and authentication credentials. The Scraper API handles all the complexities of bypassing Walmart's anti-bot measures and parsing the data, returning a well-structured JSON file that can be further processed or saved as a JSON/CSV file.

    Wrapping up 

    Scraping Walmart for product data can provide valuable insights for businesses and individuals looking to analyze pricing, product descriptions, and other information. However, it's important to be aware of Walmart's anti-bot measures and to take steps to avoid getting blocked when scraping their website. We have outlined two potential solutions – using a Python script or a Walmart Scraper API – for achieving this goal. If you need to scale your operations or improve scraping efficiency, you can buy proxies to help bypass anti-bot measures. In the end, it's essential to carefully assess your project requirements and determine which approach will best suit your needs. 

    If you found this tutorial useful, be sure to check out our blog for content on building a price tracker with Python, Walmart Price Tracker API, e-commerce keyword research, MAP monitoring, and more.

    People also ask

    Is scraping Walmart allowed?

    While scraping Walmart product data is allowed, it can be quite challenging. Walmart has security measures set in place that protect them from malicious bots and spam. If their system mistakes you for one, you’ll get blocked. 

    However, the Walmart product API is specifically designed to overcome these issues and effortlessly scrape publicly available Walmart product data. If, instead of scraping, you'd like to get ready-to-use datasets, then click this link.

    Can you get sued for scraping data?

    Walmart product pages and general information are considered to be publicly available data. As such, you can scrape its contents freely. Nevertheless, there might be some local and international laws applicable to your specific situation. 

    Oxylabs strongly advises you to consult with a legal expert before engaging in any kind of web scraping activity. We’ve also written a useful article on the legality of web scraping.

    About the author

    Enrika Pavlovskytė

    Former Copywriter

    Enrika Pavlovskytė was a Copywriter at Oxylabs. With a background in digital heritage research, she became increasingly fascinated with innovative technologies and started transitioning into the tech world. On her days off, you might find her camping in the wilderness and, perhaps, trying to befriend a fox! Even so, she would never pass up a chance to binge-watch old horror movies on the couch.

    All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.

    Related articles

    Get the latest news from data gathering world

    I’m interested