Back to blog
Enrika Pavlovskytė
As the undisputed leader among the world's top 100 largest companies, Walmart is a major force in the e-commerce industry. Companies can leverage its extensive public product data pool, ranging from pricing to reviews, and unlock valuable business insights. Indeed, using this data, companies can conduct pricing intelligence, product catalog mapping, competitor analysis, and other operations that can shape their business strategies.
In this blog post, we’ll guide you through the process of scraping publicly available Walmart data using Python. We'll delve into critical aspects of web scraping, including initial setup, strategies to avoid getting blocked, techniques for identifying and extracting desired data, and methods for delivering data in a CSV file.
Request a free trial to test our E-Commerce Scraper API.
Setting up your Python environment is the first step to scraping Walmart product data. Start by downloading Python from the official website and installing it on your computer. Next, you'll want to set up a package manager called pip, which will allow you to easily install the required Python packages for web scraping. To do so, use the following command:
python -m pip install requests bs4 pandas
This command will install three libraries – Requests, BeautifulSoup 4, and Pandas. Let’s quickly overview what each of them will do:
Requests is a Python library that allows you to send HTTP requests. It will be used to make network requests to the Walmart website and retrieve the product page.
BeautifulSoup 4 is also a Python library. It’s used for web scraping purposes, such as pulling the data out of HTML and XML files. It will be especially handy to parse the HTML content and scrape product data.
Pandas is a Python library that is used for data manipulation and analysis. We’ll use this library for storing and exporting the scraped data into CSV format.
With all the necessary packages installed, it’s time to start writing the script.
Start by importing the necessary libraries following the below:
import requests
from bs4 import BeautifulSoup
import pandas as pd
Next, let’s try and scrape Walmart’s iPhone 14 product page:
https://www.walmart.com/ip/AT-T-iPhone-14-128GB-Midnight/1756765288
response = requests.get("https://www.walmart.com/ip/AT-T-iPhone-14-128GB-Midnight/1756765288")
print(response.status_code)
Once you run this code, you'll likely see a status code of 200. Let’s add a few lines of code to parse the content of the response to validate if it’s working properly. Use BeautifulSoup to do that:
soup = BeautifulSoup(response.content, 'html.parser')
print(soup.get_text())
In the best-case scenario, you might get an HTML of the web page. However, it’s more likely that you’ll get something like that:
Robot or human?
Activate and hold the button to confirm that you’re human. Thank You!
Try a different method
Terms of Use
Privacy Policy
Do Not Sell My Personal Information
Request My Personal Information
©2024 Walmart Stores, Inc.
Let’s dissect what happened here. It’s apparent that Walmart has blocked the script, and a CAPTCHA page has been displayed to prevent you from accessing the product using a script. Nevertheless, that shouldn’t stop you, as there are alternative approaches to overcome this challenge, and we'll explore them in the subsequent section.
To prevent detection, you can include a User-Agent header with the request, which websites often use to determine what kind of device is browsing a particular URL. To obtain this header, web browser developer tools can be used. To access them, follow the steps below:
Open the Chrome browser and navigate to the Walmart product page.
Right-click on the page and select Inspect to open the developer tools.
Click on the Network tab.
Refresh the page (F5 or Ctrl+R).
Click on the first item in the list of requests that appears in the Network tab.
In the Headers section of the request, you’ll see the User-Agent header. You should then copy its value and use it like in the example below:
walmart_product_url = 'https://www.walmart.com/ip/AT-T-iPhone-14-128GB-Midnight/1756765288'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36'}
response = requests.get(walmart_product_url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')
print(soup.prettify())
Now, once you run this script again, you’ll see the correct product page HTML.
Before getting to the scraping part, you need to understand how to locate the data you want to extract. For this tutorial, we’ll be scraping the title and price. To find these elements, you can inspect the structure of the page using developer tools like before.
To find the title using developer tools, select title of the product with your cursor, right-click on it, and choose Inspect. You should get be able to locate the title like in the example below:
Finding the product title
The product title is enclosed within an h1 tag, which provides a clear indication of how to reference it in the code. To extract the product title, we'll utilize BeautifulSoup, a widely-used and user-friendly library for web scraping. You can easily instruct BeautifulSoup to scrape the title text using the following code:
title = soup.find("h1").text
Selecting & Scraping Product Price
If you want to scrape the product price, the approach will be similar. First, locate the price on the website, hover your mouse over it, right-click, and select Inspect.
Finding product price
You’ll see that the price is in a span tag. To select this tag, you can use BeautifulSoup selector as before:
price_element = soup.find("span", {"itemprop": "price"})
price = price_element.text if price_element is not None else ""
Notice that an extra dictionary object is being passed to the find method this time. This tells Beautiful Soup to grab the exact span element using the element’s property. Store this data in a list as per the example below:
product_data = [{
"title": title,
"price": price,
}]
The product_data list can be used to store the results in a CSV file. This will be much more useful than the HTML format, as you'll be able to use open it in excel. The Pandas library will help you do that.
Start by passing the scraped product list to the data frame:
df = pd.DataFrame(product_data)
Finally, store the data frame in a CSV file named result.csv in the current directory using the following code:
df.to_csv("result.csv", index=False)
The process of scraping name and price data should be pretty clear by now. However, you might be wondering if you can extract multiple products at once. You can certainly do so by slightly modifying the source code. Simply use a for loop to iterate over the product URLs:
import requests
from bs4 import BeautifulSoup
import pandas as pd
product_urls = [
"https://www.walmart.com/ip/AT-T-iPhone-14-128GB-Midnight/1756765288",
"https://www.walmart.com/ip/Straight-Talk-Apple-iPhone-15-Pro-128GB-Blue-Prepaid-Smartphone/5060213862"
]
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36'}
product_data = []
for url in product_urls:
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')
print(soup.text)
title = soup.find("h1").text
price_element = soup.find("span", {"itemprop": "price"})
price = price_element.text if price_element is not None else ""
product_data.append({
"title": title,
"price": price,
})
df = pd.DataFrame(product_data)
df.to_csv("result.csv", index=False)
As you can see, the code is pretty self-explanatory. The code is looping over the product_urls, which contain target product links from Walmart’s website. Afterward, the code parses each product page using BeautifulSoup and stores the result in the product_data list. Once the products are scraped, Pandas data frame stores the product data in a CSV file.
The User-Agent method discussed above is a good first step to avoid getting blocked by Walmart when scraping its website. However, for more extensive scraping projects, relying solely on this method may not be adequate since Walmart will eventually detect your scraping behavior and block your IP address. Indeed, Walmart has implemented advanced anti-bot measures, which are frequently updated, and can substantially affect all scraping activities.
To overcome this issue, a more advanced and intricate script is required. Such a script should be able to prevent browser fingerprinting, implement proxy rotation, and mimic human browsing patterns. Moreover, this solution may demand frequent upkeep.
An alternative that can potentially save you from the challenges of dealing with Walmart's anti-bot measures is to use a service like Oxylabs' Walmart Scraper API. This Scraper API significantly simplifies data collection as it takes care of the scraping, proxy management, and parsing. It’s a flexible and efficient alternative that is also easily scalable, allowing you to scrape huge amounts of Walmart product data without getting blocked. It’s also super easy to implement. Let’s take a look at how the code works:
import requests
from pprint import pprint
payload = {
'source': 'universal_ecommerce',
'url': 'https://www.walmart.com/ip/AT-T-iPhone-14-128GB-Midnight/1756765288',
'geo_location': 'United States',
'parse': True,
}
response = requests.post(
'https://realtime.oxylabs.io/v1/queries',
auth=('USERNAME', 'PASSWORD'),
json=payload,
)
product_data = response.json()
pprint(product_data)
The provided code leverages Oxylabs' E-commerce Scraper API to extract data, requiring only the source URL, location, and a parsing flag to be set to True. The code uses the requests module to send a post request to Oxylabs' API endpoint, along with the necessary payload and authentication credentials. The Scraper API handles all the complexities of bypassing Walmart's anti-bot measures and parsing the data, returning a well-structured JSON file that can be further processed or saved as a JSON/CSV file.
Scraping Walmart for product data can provide valuable insights for businesses and individuals looking to analyze pricing, product descriptions, and other information. However, it's important to be aware of Walmart's anti-bot measures and to take steps to avoid getting blocked when scraping their website. We have outlined two potential solutions – using a Python script or a Walmart Scraper API – for achieving this goal. In the end, it's essential to carefully assess your project requirements and determine which approach will best suit your needs.
If you found this tutorial useful, be sure to check out our blog for content on building a price tracker with Python, Walmart Price Tracker API, e-commerce keyword research, MAP monitoring, and more.
While scraping Walmart product data is allowed, it can be quite challenging. Walmart has security measures set in place that protect them from malicious bots and spam. If their system mistakes you for one, you’ll get blocked.
However, the Walmart product API is specifically designed to overcome these issues and effortlessly scrape publicly available Walmart product data. If, instead of scraping, you'd like to get ready-to-use datasets, then click this link.
Walmart product pages and general information are considered to be publicly available data. As such, you can scrape its contents freely. Nevertheless, there might be some local and international laws applicable to your specific situation.
Oxylabs strongly advises you to consult with a legal expert before engaging in any kind of web scraping activity. We’ve also written a useful article on the legality of web scraping.
About the author
Enrika Pavlovskytė
Former Copywriter
Enrika Pavlovskytė was a Copywriter at Oxylabs. With a background in digital heritage research, she became increasingly fascinated with innovative technologies and started transitioning into the tech world. On her days off, you might find her camping in the wilderness and, perhaps, trying to befriend a fox! Even so, she would never pass up a chance to binge-watch old horror movies on the couch.
All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.
Augustas Pelakauskas
2023-04-03
Augustas Pelakauskas
2023-04-02
Vytenis Kaubrė
2023-03-08
Get the latest news from data gathering world
Scale up your business with Oxylabs®
GET IN TOUCH
General:
hello@oxylabs.ioSupport:
support@oxylabs.ioCareer:
career@oxylabs.ioCertified data centers and upstream providers
Connect with us
Advanced proxy solutions
Resources
Innovation hub
oxylabs.io© 2024 All Rights Reserved