avatar

Adomas Sulcas

Jun 22, 2020 6 min read

Previously we outlined how to scrape text-based data with Python. Throughout the tutorial we went through the entire process: all the way from installing Python, getting the required libraries, setting everything up to coding a basic web scraper and outputting the acquired data into a .csv file. In the second installment, we will learn how to scrape images from a website and store them in a set location.

We highly recommend reading our article “Python Web Scraping Tutorial: Step-By-Step” before moving forward. Understanding how to build a basic data extraction tool will make creating a Python image scraper significantly easier. Additionally, we will use parts of code we had written previously as a foundation to download image links. Finally, we will use both Selenium and the requests library for learning purposes.

Before conducting image scraping please consult with legal professionals to be sure that you are not breaching third party rights, including but not limited to, intellectual property rights.

Libraries: new and old

We will need quite a few libraries in order to extract images from a website. In the basic web scraper tutorial we used BeautifulSoup, Selenium and pandas to gather and output data into a .csv file. We will do all these previous steps to export scraped data (i.e. image URLs).

Of course, gathering image URLs into a list is not enough. We will use several other libraries to store the content of the URL into a variable, convert it into an image object and then save it to a specified location. Our newly acquired libraries are Pillow and requests.

If you missed the previous installment:

pip install beautifulsoup4 selenium pandas

Install these libraries as well:

Additionally, we will use built-in libraries to download images from a website, mostly to store our acquired files in a specified folder.

Back to square one

Our data extraction process begins almost exactly the same (we will import libraries as needed). We assign our preferred webdriver, select the URL from which we will scrape image links and create a list to store them in. As our Chrome driver arrives at the URL, we use the variable ‘content’ to point to the page source and then “soupify” it with BeautifulSoup.

In the previous tutorial, we performed all actions by using built-in and library defined functions. While we could do another tutorial without defining any functions, it is an extremely useful tool for just about any project:

We’ll move our URL scraper into a defined function. Additionally, we will reuse the same code we used in the “Python Web Scraping Tutorial: Step-by-Step” article and repurpose it to scrape full URLs.

Before

After

Note that we now append in a different manner.  Instead of appending the text, we use another function ‘get()’ and add a new parameter ‘source’ to it. We use ‘source’ to indicate the field in the website where image links are stored . They will be nested in a ‘src’, ‘data-src’ or other similar HTML tags.

Moving forward with defined functions

Let’s assume that our target URL has image links nested in the classes ‘blog-card__link’, ‘img’ and that the URL itself is in the ‘src’ attribute of the element. We would call our newly defined function as such:

parse_image_urls("blog-card__link", "img", "src")

Our code should now look something like this:

Since we sometimes want to export scraped data and we had already used pandas before, we can check by outputting everything into a “.csv” file. If needed, we can always check for any possible semantic errors this way.

df = pd.DataFrame("links": results})
df.to_csv('links.csv', index=False, encoding='utf-8')

If we run our code right now, we should get a “links.csv” file outputted right into the running directory.

Time to extract images from the website

Assuming that we didn’t run into any issues at the end of the previous section, we can continue to download images from websites.

We will use the requests library to acquire the content stored in the image URL. Our “for” loop above will iterate over our ‘results’ list.

#io manages file-related in/out operations
import io
#creates a byte object out of image_content and point the variable image_file to it
image_file = io.BytesIO(image_content)

We are not done yet. So far the “image” we have above is just a Python object.

#we use Pillow to convert our object to an RGB image
from PIL import Image
image = Image.open(image_file).convert('RGB')

We are still not done as we need to find a place to save our images. Creating a folder “Test” for the purposes of this tutorial would be the easiest option.

Putting it all together

Let’s combine all of the previous steps without any comments and see how it works out. Note that pandas are greyed out as we are not extracting data into any tables. We kept it in for the sake of convenience. Use it if you need to see or double-check the outputs.

For efficiency, we quit our webdriver by using “driver.quit()” after retrieving the URL list we need. We no longer need that browser as everything is stored locally.

Running our application will output one of two results:

  • Images are outputted into the folder we selected by defining the ‘file_path’ variable.
  • Python outputs a 403 Forbidden HTTP error.

Obviously, getting the first result means we are finished. We would receive the second outcome if we were to scrape our /blog/ page. Fixing the second outcome will take a little bit of time in most cases, although, at times, there can be more difficult scenarios.

Whenever we use the requests library to send a request to the destination server, a default user-agent “Python-urllib/version.number” is assigned. Some web services might block these user-agents specifically as they are guaranteed to be bots. Fortunately, the requests library allows us to assign any user-agent (or an entire header) we want:

image_content = requests.get(b, headers={'User-agent': 'Mozilla/5.0'}).content

Adding a user-agent will be enough for most cases. There are more complex cases where servers might try to check other parts of the HTTP header in order to confirm that it is a genuine user. Refer to our guides on HTTP headers and web scraping practices for more information on how to use them to extract images from websites online.

Cleaning up

Our task is finished but the code is still messy.  We can make our application more readable and reusable by putting everything under defined functions:

Everything is now nested under clearly defined functions and can be called when imported. Otherwise it will run as it had previously.

Wrapping up

By using the code outlined above, you should now be able to complete basic image scraping tasks such as to download all images from a website in one go. Upgrading an image scraper can be done in a variety of ways, most of which we outlined in the previous installment. Check out our blog for more details on how to get started with data acquisition or how to make it more efficient.

avatar

About Adomas Sulcas

Adomas Sulcas is a Content Manager at Oxylabs. Having grown up in a tech-minded household, he quickly developed an interest in everything IT and Internet related. When he is not nerding out online or immersed in reading, you will find him on an adventure or coming up with wicked business ideas.

Related articles

Web Scraping With Selenium: DIY or Buy?

Web Scraping With Selenium: DIY or Buy?

Jul 03, 2020

5 min read

Choosing the Right Proxy Service Provider

Choosing the Right Proxy Service Provider

Jun 22, 2020

4 min read

HTTP Headers Explained

HTTP Headers Explained

Jun 21, 2020

8 min read

All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.