Previously we outlined how to scrape text-based data with Python. Throughout the tutorial we went through the entire process: all the way from installing Python, getting the required libraries, setting everything up to coding a basic web scraper and outputting the acquired data into a .csv file. In the second installment, we will learn how to scrape images from a website and store them in a set location.
We highly recommend reading our article “Python Web Scraping Tutorial: Step-By-Step” before moving forward. Understanding how to build a basic data extraction tool will make creating a Python image scraper significantly easier. Additionally, we will use parts of code we had written previously as a foundation to download image links. Finally, we will use both Selenium and the requests library for learning purposes.
Before scraping images please consult with legal professionals to be sure that you are not breaching third party rights, including but not limited to, intellectual property rights.
We will need quite a few libraries in order to extract images from a website. In the basic web scraper tutorial we used BeautifulSoup, Selenium and pandas to gather and output data into a .csv file. We will do all these previous steps to export scraped data (i.e. image URLs).
Of course, scraping image URLs into a list is not enough. We will use several other libraries to store the content of the URL into a variable, convert it into an image object and then save it to a specified location. Our newly acquired libraries are Pillow and requests.
If you missed the previous installment:
pip install beautifulsoup4 selenium pandas
Install these libraries as well:
#install the Pillow library (used for image processing)
pip install Pillow
#install the requests library (used to send HTTP requests)
pip install requests
Additionally, we will use built-in libraries to download images from a website, mostly to store our acquired files in a specified folder.
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome(executable_path='/nix/path/to/webdriver/executable')
driver.get('https://your.url/here?yes=brilliant')
results = []
content = driver.page_source
soup = BeautifulSoup(content)
Our data extraction process begins almost exactly the same (we will import libraries as needed). We assign our preferred webdriver, select the URL from which we will scrape image links and create a list to store them in. As our Chrome driver arrives at the URL, we use the variable ‘content’ to point to the page source and then “soupify” it with BeautifulSoup.
In the previous tutorial, we performed all actions by using built-in and library defined functions. While we could do another tutorial without defining any functions, it is an extremely useful tool for just about any project:
# Example on how to define a function and select custom arguments for the
# code that goes into it.
def function_name(arguments):
# Function body goes here.
We’ll move our URL scraper into a defined function. Additionally, we will reuse the same code we used in the “Python Web Scraping Tutorial: Step-by-Step” article and repurpose it to scrape full URLs.
Before
for a in soup.findAll(attrs={'class': 'class'}):
name = a.find('a')
if name not in results:
results.append(name.text)
After
#picking a name that represents the functions will be useful later on.
def parse_image_urls(classes, location, source):
for a in soup.findAll(attrs={'class': classes}):
name = a.find(location)
if name not in results:
results.append(name.get(source))
Note that we now append in a different manner. Instead of appending the text, we use another function ‘get()’ and add a new parameter ‘source’ to it. We use ‘source’ to indicate the field in the website where image links are stored . They will be nested in a ‘src’, ‘data-src’ or other similar HTML tags.
Let’s assume that our target URL has image links nested in the classes ‘blog-card__link’, ‘img’ and that the URL itself is in the ‘src’ attribute of the element. We would call our newly defined function as such:
parse_image_urls("blog-card__link", "img", "src")
Our code should now look something like this:
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome(executable_path='/nix/path/to/webdriver/executable')
driver.get('https://your.url/here?yes=brilliant')
results = []
content = driver.page_source
soup = BeautifulSoup(content)
def parse_image_urls(classes, location, source):
for a in soup.findAll(attrs={'class': classes}):
name = a.find(location)
if name not in results:
results.append(name.get(source))
parse_image_urls("blog-card__link", "img", "src")
Since we sometimes want to export scraped data and we had already used pandas before, we can check by outputting everything into a “.csv” file. If needed, we can always check for any possible semantic errors this way.
df = pd.DataFrame("links": results})
df.to_csv('links.csv', index=False, encoding='utf-8')
If we run our code right now, we should get a “links.csv” file outputted right into the running directory.
Assuming that we didn’t run into any issues at the end of the previous section, we can continue to download images from websites.
#import library requests to send HTTP requests
import requests
for b in results:
#add the content of the url to a variable
image_content = requests.get(b).content
We will use the requests library to acquire the content stored in the image URL. Our “for” loop above will iterate over our ‘results’ list.
#io manages file-related in/out operations
import io
#creates a byte object out of image_content and point the variable image_file to it
image_file = io.BytesIO(image_content)
We are not done yet. So far the “image” we have above is just a Python object.
#we use Pillow to convert our object to an RGB image
from PIL import Image
image = Image.open(image_file).convert('RGB')
We are still not done as we need to find a place to save our images. Creating a folder “Test” for the purposes of this tutorial would be the easiest option.
#pathlib let's us point to specific locations. Will be used to save our images.
import pathlib
#hashlib allows us to get hashes. We will be using sha1 to name our images.
import hashlib
#sets a file_path variable which is pointed to
#our directory and creates a file based on #the sha1 hash of 'image_content'
#and uses .hexdigest to convert it into a string.
file_path = pathlib.Path('nix/path/to/test', hashlib.sha1(image_content).hexdigest()[:10] + '.png')
image.save(file_path, "PNG", quality=80)
Let’s combine all of the previous steps without any comments and see how it works out. Note that pandas are greyed out as we are not extracting data into any tables. We kept it in for the sake of convenience. Use it if you need to see or double-check the outputs.
import hashlib
import io
from pathlib import Path
import pandas as pd
import requests
from bs4 import BeautifulSoup
from PIL import Image
from selenium import webdriver
driver = webdriver.Chrome(executable_path='/nix/path/to/webdriver/executable')
driver.get('https://your.url/here?yes=brilliant')
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
results = []
content = driver.page_source
soup = BeautifulSoup(content)
def gets_url(classes, location, source):
results = []
for a in soup.findAll(attrs={'class': classes}):
name = a.find(location)
if name not in results:
results.append(name.get(source))
return results
driver.quit()
if __name__ == "__main__":
returned_results = gets_url("blog-card__link", "img", "src")
for b in returned_results::
image_content = requests.get(b).content
image_file = io.BytesIO(image_content)
image = Image.open(image_file).convert('RGB')
file_path = pathlib.Path('nix/path/to/test', hashlib.sha1(image_content).hexdigest()[:10] + '.png')
image.save(file_path, "PNG", quality=80)
For efficiency, we quit our webdriver by using “driver.quit()” after retrieving the URL list we need. We no longer need that browser as everything is stored locally.
Running our application will output one of two results:
Images are outputted into the folder we selected by defining the ‘file_path’ variable.
Python outputs a 403 Forbidden HTTP error.
Obviously, getting the first result means we are finished. We would receive the second outcome if we were to scrape our /blog/ page. Fixing the second outcome will take a little bit of time in most cases, although, at times, there can be more difficult scenarios.
Whenever we use the requests library to send a request to the destination server, a default user-agent “Python-urllib/version.number” is assigned. Some web services might block these user-agents specifically as they are guaranteed to be bots. Fortunately, the requests library allows us to assign any user-agent (or an entire header) we want:
image_content = requests.get(b, headers={'User-agent': 'Mozilla/5.0'}).content
Adding a user-agent will be enough for most cases. There are more complex cases where servers might try to check other parts of the HTTP header in order to confirm that it is a genuine user. Refer to our guides on HTTP headers and web scraping practices for more information on how to use them to extract images from websites online.
Our task is finished but the code is still messy. We can make our application more readable and reusable by putting everything under defined functions:
import io
import pathlib
import hashlib
import pandas as pd
import requests
from bs4 import BeautifulSoup
from PIL import Image
from selenium import webdriver
def get_content_from_url(url):
driver = webdriver.Chrome() # add "executable_path=" if driver not in running directory
driver.get(url)
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
page_content = driver.page_source
driver.quit() # We do not need the browser instance for further steps.
return page_content
def parse_image_urls(content, classes, location, source):
soup = BeautifulSoup(content)
results = []
for a in soup.findAll(attrs={"class": classes}):
name = a.find(location)
if name not in results:
results.append(name.get(source))
return results
def save_urls_to_csv(image_urls):
df = pd.DataFrame({"links": image_urls})
df.to_csv("links.csv", index=False, encoding="utf-8")
def get_and_save_image_to_file(image_url, output_dir):
response = requests.get(image_url, headers={"User-agent": "Mozilla/5.0"})
image_content = response.content
image_file = io.BytesIO(image_content)
image = Image.open(image_file).convert("RGB")
filename = hashlib.sha1(image_content).hexdigest()[:10] + ".png"
file_path = output_dir / filename
image.save(file_path, "PNG", quality=80)
def main():
url = "https://your.url/here?yes=brilliant"
content = get_content_from_url(url)
image_urls = parse_image_urls(
content=content, classes="blog-card__link", location="img", source="src",
)
save_urls_to_csv(image_urls)
for image_url in image_urls:
get_and_save_image_to_file(
image_url, output_dir=pathlib.Path("nix/path/to/test"),
)
if __name__ == "__main__": #only executes if imported as main file
main()
Everything is now nested under clearly defined functions and can be called when imported. Otherwise it will run as it had previously.
By using the code outlined above, you should now be able to complete basic image scraping tasks such as to download all images from a website in one go. Upgrading an image scraper can be done in a variety of ways, most of which we outlined in the previous installment. We recommend studying our Python Requests article to get more up to speed with the library used in this tutorial. Check out our blog for more details on how to get started with data acquisition and take a look at our own general-purpose web scraper.
About the author
Adomas Sulcas
PR Team Lead
Adomas Sulcas is a PR Team Lead at Oxylabs. Having grown up in a tech-minded household, he quickly developed an interest in everything IT and Internet related. When he is not nerding out online or immersed in reading, you will find him on an adventure or coming up with wicked business ideas.
All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.
Get the latest news from data gathering world
Forget about complex web scraping processes
Choose Oxylabs' advanced web intelligence collection solutions to gather real-time public data hassle-free.
Scale up your business with Oxylabs®
GET IN TOUCH
General:
hello@oxylabs.ioSupport:
support@oxylabs.ioCareer:
career@oxylabs.ioCertified data centers and upstream providers
Connect with us
Advanced proxy solutions
Resources
Innovation hub