Previously, we outlined how to scrape text-based data with Python. We‘ve showcased the entire process: from installing Python and the required libraries to coding a basic web scraper and outputting the acquired data into a CSV file. If you're unfamiliar with web scraping or want to learn more, check out our article on what is web scraping and how to scrape data from a website. In this second installment, you’ll learn how to extract images from a website and store them in a desired location.
We highly recommend reading our article Python Web Scraping Tutorial: Step-By-Step, before moving forward. Understanding how to build a basic data extraction tool will make it significantly easier to create a Python image data scraper. Additionally, we’ll use parts of the code we had written previously as a foundation to download image links. Finally, we’ll use both Selenium and the requests library for learning purposes.
Before scraping images, please consult with legal professionals to be sure that you aren't breaching third-party rights, including but not limited to intellectual property rights.
There are quite a few libraries you can utilize to extract pictures from a website. In the basic web scraper tutorial, we used Beautiful Soup, Selenium, and pandas to gather and output data into a CSV file. You'll follow all these previous steps to export image URLs as well.
Of course, web scraping image URLs into a list isn’t enough. You’ll need to use several other packages to store the content of the URL into a variable, convert it into an image object, and then save it to a specified location. The Pillow and requests libraries will do this job.
If you missed the previous tutorial, open your terminal and install the following:
pip install beautifulsoup4 selenium pandas pyarrow Pillow requests
The Pillow library will process the images, and the requests library will send HTTP requests. Additionally, the PyArrow library is required as a dependency for pandas.
Additionally, we’ll use built-in Python packages, such as io, pathlib, and hashlib, to download images from a website and store them in a specified folder.
For demonstration, we'll use eBay’s product listing page with the search keyword “laptop”. Here’s what the listing page, along with the HTML source, looks like:
Notice each image in the page's HTML source is enclosed in a div element with a class s-item__image-wrapper image-treatment. Furthermore, the images are within the img tags with a source URL in the src attribute.
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://www.ebay.com/sch/i.html?_from=R40&_trksid=p2334524.m570.l1313&_nkw=laptop&_sacat=0&LH_TitleDesc=0&_osacat=0&_odkw=laptop")
results = []
content = driver.page_source
soup = BeautifulSoup(content, "html.parser")
The data extraction process begins almost exactly the same – you have to import the Python packages as needed. Then, specify your preferred browser for WebDriver, select the URL from which to scrape image links, and create a list to store them. After the Chrome driver opens the URL, the content variable points to the page source, and then the Beautiful Soup library makes the data ready for parsing.
Currently, the code runs the driver in a headful mode. You can run it without a GUI by specifying the headless mode:
from selenium.webdriver import ChromeOptions
options = ChromeOptions()
options.add_argument("--headless=new")
driver = webdriver.Chrome(options=options)
If you have stored the executable browser driver in a custom location, you should specify the directory by creating a ChromeService instance with the path to your driver and passing it to the webdriver instance as a service:
service = webdriver.ChromeService(executable_path='/path/to/driver')
driver = webdriver.Chrome(service=service)
In the previous tutorial, we performed all actions by using built-in and defined functions. While we could do another tutorial without defining any functions, it’s an extremely useful tool for just about any project:
# An example of how to define a function
# and select custom arguments for the code inside the function.
def function_name(arguments):
# Function body goes here.
So let’s move our URL scraper into a defined function. We’ll reuse the same code used in the Python Web Scraping Tutorial: Step-by-Step article and repurpose it to extract full URLs:
Before
for a in soup.findAll(attrs={'class': 'class'}):
name = a.find('a')
if name not in results:
results.append(name.text)
After
# Picking a name that represents the function will be useful as you expand your code.
def parse_image_urls(classes, location, source):
for a in soup.findAll(attrs={'class': classes}):
name = a.find(location)
if name not in results:
results.append(name.get(source))
Note that it now appends in a different manner. Instead of appending the text, the code uses another function, get(), and adds a new parameter source to it. The source parameter indicates the field on the website where image links are stored. In our case, the image links are in the src HTML tag.
From the previous section, we know the classes, locations, and sources of the images we want to scrape. Thereby, you can call your newly defined function as such:
parse_image_urls("s-item__image-wrapper image-treatment", "img", "src")
Your code should now look something like this:
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver import ChromeOptions
options = ChromeOptions()
options.add_argument("--headless=new")
driver = webdriver.Chrome(options=options)
driver.get("https://www.ebay.com/sch/i.html?_from=R40&_trksid=p2334524.m570.l1313&_nkw=laptop&_sacat=0&LH_TitleDesc=0&_osacat=0&_odkw=laptop")
results = []
content = driver.page_source
soup = BeautifulSoup(content, "html.parser")
def parse_image_urls(classes, location, source):
for a in soup.findAll(attrs={"class": classes}):
name = a.find(location)
if name not in results:
results.append(name.get(source))
parse_image_urls("s-item__image-wrapper image-treatment", "img", "src")
It’s extremely useful to export the data into a CSV file, so you can always easily check for any possible semantic errors. To achieve this, you can use the pandas library and add the following lines to your code:
df = pd.DataFrame({"links": results})
df.to_csv("links.csv", index=False, encoding="utf-8")
If you run the code right now, you should get the links.csv file created right in the running directory. Here’s a partial snippet of the CSV file with the extracted image URLs:
Assuming that you didn’t run into any issues in the previous section, you can continue with this section to download images from a website.
The requests library can be used to extract the content stored in the image URL. The below code snippet achieves that with a for loop that iterates over the results list:
# Import the requests library to send HTTP requests
import requests
for b in results:
# Store the content from the URL to a variable
image_content = requests.get(b).content
Next, use the io library to create a byte object out of image_content and store the data in a new variable:
# The io library manages file-related in/out operations.
import io
# Create a byte object out of image_content and store it in the variable image_file
image_file = io.BytesIO(image_content)
So far, the above image_file is just a Python object. It has to be converted to an RGB image with the Python library called Pillow:
# Use Pillow to convert the Python object to an RGB image
from PIL import Image
image = Image.open(image_file).convert("RGB")
The code is still not complete, as you need to find a place to store the image data. Let’s create a folder called “Test”:
# pathlib lets you point to specific directories. Use it to store the images in a folder.
from pathlib import Path
# hashlib allows you to get hashes. Let's use 'sha1' to name the images.
import hashlib
# Set a file_path variable that points to your directory.
# Create a file based on the sha1 hash of 'image_content'.
# Use .hexdigest to convert it into a string.
file_path = Path("/path/to/test", hashlib.sha1(image_content).hexdigest()[:10] + ".png")
image.save(file_path, "PNG", quality=80)
Let’s combine all of the previous steps and see how it all works out. Note that the pandas library is grayed out in your code as it isn’t used in the code below. We kept it for the sake of convenience. Use it if you need to see or double-check the outputs.
import hashlib, io, requests, pandas as pd
from selenium import webdriver
from selenium.webdriver import ChromeOptions
from bs4 import BeautifulSoup
from pathlib import Path
from PIL import Image
options = ChromeOptions()
options.add_argument("--headless=new")
driver = webdriver.Chrome(options=options)
driver.get("https://www.ebay.com/sch/i.html?_from=R40&_trksid=p2334524.m570.l1313&_nkw=laptop&_sacat=0&LH_TitleDesc=0&_osacat=0&_odkw=laptop")
content = driver.page_source
soup = BeautifulSoup(content, "html.parser")
driver.quit()
def gets_url(classes, location, source):
results = []
for a in soup.findAll(attrs={"class": classes}):
name = a.find(location)
if name not in results:
results.append(name.get(source))
return results
if __name__ == "__main__":
returned_results = gets_url("s-item__image-wrapper image-treatment", "img", "src")
for b in returned_results:
image_content = requests.get(b).content
image_file = io.BytesIO(image_content)
image = Image.open(image_file).convert("RGB")
file_path = Path("/path/to/test", hashlib.sha1(image_content).hexdigest()[:10] + ".png")
image.save(file_path, "PNG", quality=80)
For efficiency, you should quit the WebDriver by using driver.quit() after retrieving the URL list. You no longer need the browser, as everything is stored locally.
Running the code can produce one of two results:
Images are outputted into the folder you’ve selected by defining the file_path variable;
Python outputs a 403 Forbidden Error.
Obviously, getting the first result means your code worked as intended, and your image destination folder should look something like this:
You're likely to receive the second result when web scraping a website that has strict anti-bot measures since the 403 Forbidden Error means the target site has denied your request. Fixing the second outcome will take a little bit of time in most cases, although, at times, there can be more difficult scenarios.
Whenever you use the requests library to send a request to the destination server, a default user agent python-requests/version.number is assigned, for instance, python-requests/2.31.0. Some web services might block such web requests as this user agent points to the use of the requests library. Fortunately, the requests library allows you to assign any user agent (or an entire header) you want:
image_content = requests.get(b, headers={'User-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36'}).content
Adding a user agent will be enough for most cases. There are more complex cases where servers might try to check other parts of the HTTP header in order to confirm that it’s a genuine user. Refer to our guides on HTTP headers and web scraping practices for more information on how to use them to extract images from websites online. We’ve also made an in-depth video tutorial specifically on how to bypass the 403 Forbidden Error, so be sure to check it out below:
The image extraction process is finished, but the code still looks messy. You can make your application more readable and reusable by putting everything under defined functions:
import hashlib, io, requests, pandas as pd
from selenium import webdriver
from selenium.webdriver import ChromeOptions
from bs4 import BeautifulSoup
from pathlib import Path
from PIL import Image
def get_content_from_url(url):
options = ChromeOptions()
options.add_argument("--headless=new")
driver = webdriver.Chrome(options=options)
driver.get(url)
page_content = driver.page_source
driver.quit()
return page_content
def parse_image_urls(content, classes, location, source):
soup = BeautifulSoup(content, "html.parser")
results = []
for a in soup.findAll(attrs={"class": classes}):
name = a.find(location)
if name not in results:
results.append(name.get(source))
return results
def save_urls_to_csv(image_urls):
df = pd.DataFrame({"links": image_urls})
df.to_csv("links.csv", index=False, encoding="utf-8")
def get_and_save_image_to_file(image_url, output_dir):
image_content = requests.get(image_url).content
image_file = io.BytesIO(image_content)
image = Image.open(image_file).convert("RGB")
filename = hashlib.sha1(image_content).hexdigest()[:10] + ".png"
file_path = output_dir / filename
image.save(file_path, "PNG", quality=80)
def main():
url = "https://www.ebay.com/sch/i.html?_from=R40&_trksid=p2334524.m570.l1313&_nkw=laptop&_sacat=0&LH_TitleDesc=0&_osacat=0&_odkw=laptop"
content = get_content_from_url(url)
image_urls = parse_image_urls(
content=content, classes="s-item__image-wrapper image-treatment", location="img", source="src"
)
save_urls_to_csv(image_urls)
for image_url in image_urls:
get_and_save_image_to_file(
image_url, output_dir=Path("/path/to/test")
)
if __name__ == "__main__":
main()
print("Done!")
Everything is now nested under clearly defined functions and can be called when imported. Otherwise, it will run as it had previously.
By using the code outlined above, you should now be able to complete basic image web scraping tasks, such as downloading all images from a website in one go. Upgrading an image scraper can be done in a variety of ways, most of which we outlined in the previous installment. We recommend studying our Python Requests article to get up to speed with the library used in this tutorial. In addition, check out our blog for more details on how to get started with data acquisition.
A pre-built scraper can drastically enhance your operations, so don’t miss out on a free trial of our eBay Scraper API and see whether it meets your data needs firsthand. In case eBay isn’t your primary focus, we offer alternative solutions like our general-purpose web scraper, which is also available with a free trial.
The legality of web scraping is a much-debated topic among everyone who works in the data-gathering field. It’s important to note that web scraping may be legal in cases where it’s done without breaching any laws regarding the source targets or data itself. That being said, we advise you to seek legal consultation before engaging in scraping activities of any kind.
We’ve explored the legality of web scraping in this blog post, so feel free to check it out for a more in-depth explanation.
About the author
Adomas Sulcas
Former PR Team Lead
Adomas Sulcas was a PR Team Lead at Oxylabs. Having grown up in a tech-minded household, he quickly developed an interest in everything IT and Internet related. When he is not nerding out online or immersed in reading, you will find him on an adventure or coming up with wicked business ideas.
All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.
Yelyzaveta Nechytailo
2024-12-09
Augustas Pelakauskas
2024-12-09
Get the latest news from data gathering world
Scale up your business with Oxylabs®