Web Scraping Tutorial Python
avatar

Adomas Sulcas

Sep 22, 2020 18 min read

Getting started in web scraping is simple except when it isn’t which is why you are here. Python is one of the easiest ways to get started as it is an object-oriented language. Python’s classes and objects are significantly easier to use than in any other language. Additionally, many libraries exist that make building a tool for web scraping in Python an absolute breeze.

In this web scraping Python tutorial, we will outline everything needed to get started with a simple application. It will acquire text-based data from page sources, store it into a file and sort the output according to set parameters. Options for more advanced features when using Python for web scraping will be outlined at the very end with suggestions for implementation. By following the steps outlined below you will be able to understand how to do web scraping.

This web scraping tutorial will work for all operating systems. There will be slight differences when installing either Python or development environments but not in anything else.

Building a web scraper: Python prepwork

Throughout this entire web scraping tutorial, Python 3.4+ version will be used. Specifically, we used 3.8.3 but any 3.4+ version should work just fine.

For Windows installations, when installing Python make sure to check “PATH installation”. PATH installation adds executables to the default Windows Command Prompt executable search. Windows will then recognize commands like “pip” or “python” without requiring users to point it to the directory of the executable (e.g. C:/tools/python/…/python.exe). If you have already installed Python but did not mark the checkbox, just rerun the installation and select modify. On the second screen select “Add to environment variables”.

Getting to the libraries

Libraries: Beautiful Soup, Pandas, Selenium
Web scraping with Python is easy due to the many useful libraries available

A barebones installation isn’t enough for web scraping. One of the Python advantages is a large selection of libraries for web scraping. We’ll be using three important libraries – BeautifulSoup v4, Pandas, and Selenium.

To install these libraries, start the terminal of your OS. Type in:

pip install BeautifulSoup4 pandas selenium

Each of these installations take anywhere from a few seconds to a few minutes to install. If your terminal freezes, gets stuck when downloading or extracting the package or any other issue outside of a total meltdown arises, use CTRL+C to abort any running installation. 

Further steps in this web scraping with Python tutorial assume a successful installation of the previously listed libraries. If you receive a “NameError: name * is not defined” it is likely that one of these installations has failed.

WebDrivers and browsers

Every web scraper uses a browser as it needs to connect to the destination URL. For testing purposes we highly recommend using a regular browser (or not a headless one), especially for newcomers. Seeing how written code interacts with the application allows simple troubleshooting and debugging, and grants a better understanding of the entire process.

Headless browsers can be used later on as they are more efficient for complex tasks. Throughout this web scraping tutorial we will be using the Chrome web browser although the entire process is almost identical with Firefox.

To get started, use your preferred search engine to find the “webdriver for Chrome” (or Firefox). Take note of your browser’s current version. Download the webdriver that matches your browser’s version.

If applicable, select the requisite package, download and unzip it. Copy the driver’s executable file to any easily accessible directory. Whether everything was done correctly, we will only be able to find out later on.

Finding a cozy place for our Python web scraper

One final step needs to be taken before we can get to the programming part of this web scraping tutorial: using a good coding environment. There are many options, from a simple text editor, with which simply creating a *.py file and writing the code down directly is enough, to a fully-featured IDE (Integrated Development Environment).

If you already have Visual Studio Code installed, picking this IDE would be the simplest option. Otherwise, I’d highly recommend PyCharm for any newcomer as it has very little barrier to entry and an intuitive UI. We will assume that PyCharm is used for the rest of the web scraping tutorial.

In PyCharm, right click on the project area and “New -> Python File”. Give it a nice name!

Importing and using libraries

Time to put all those pips we installed previously to use:

import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver

PyCharm might display these imports in grey as it automatically marks unused libraries. Don’t accept its suggestion to remove unused libs (at least yet).

We should begin by defining our browser. Depending on the webdriver we picked back in “WebDriver and browsers” we should type in:

driver = webdriver.Chrome(executable_path='c:\path\to\windows\webdriver\executable.exe')

OR

driver = webdriver.Firefox(executable_path='/nix/path/to/webdriver/executable')

Picking a URL

Web scraping relies on nested data
Python web scraping requires looking into the source of websites

Before performing our first test run, choose a URL. As this web scraping tutorial is intended to create an elementary application, we highly recommended picking a simple target URL:

  • Avoid data hidden in Javascript elements. These sometimes need to be triggered by performing specific actions in order to display required data. Scraping data from Javascript elements requires more sophisticated use of Python and its logic.
  • Avoid image scraping. Images can be downloaded directly with Selenium.
  • Before conducting any scraping activities ensure that you are scraping public data, and are in no way breaching third party rights. Also, don’t forget to check robots.txt file for guidance.

Select the landing page you want to visit and input the URL into the driver.get(‘URL’) parameter. Selenium requires that the connection protocol is provided. As such, it is always necessary to attach “http://” or “https://” to the URL.

driver.get('https://your.url/here?yes=brilliant')

Try doing a test run by clicking the green arrow at the bottom left or by right clicking the coding environment and selecting ‘Run’.

Python web scrape
Follow the red pointer

If you receive an error message stating that a file is missing then turn double check if the path provided in the driver “webdriver.*” matches the location of the webdriver executable. If you receive a message that there is a version mismatch redownload the correct webdriver executable.

Defining objects and building lists

Python allows coders to design objects without assigning an exact type. An object can be created by simply typing its title and assigning a value.

# Object is “results”, brackets make the object an empty list.
# We will be storing our data here.
results = []

Lists in Python are ordered, mutable and allow duplicate members. Other collections, such as sets or dictionaries, can be used but lists are the easiest to use. Time to make more objects!

# Add the page source to the variable `content`.
content = driver.page_source
# Load the contents of the page, its source, into BeautifulSoup 
# class, which analyzes the HTML as a nested data structure and allows to select
# its elements by using various selectors.
soup = BeautifulSoup(content)

Before we go on with, let’s recap on how our code should look so far:

import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome(executable_path='/nix/path/to/webdriver/executable')
driver.get('https://your.url/here?yes=brilliant')
results = []
content = driver.page_source
soup = BeautifulSoup(content)

Try rerunning the application again. There should be no errors displayed. If any arise, a few possible troubleshooting options were outlined in earlier chapters.

Extracting data with our Python web scraper

We have finally arrived at the fun and difficult part – extracting data out of the HTML file. Since in almost all cases we are taking small sections out of many different parts of the page and we want to store it into a list, we should process every smaller section and then add it to the list:

# Loop over all elements returned by the `findAll` call. It has the filter `attrs` given
# to it in order to limit the data returned to those elements with a given class only.
for element in soup.findAll(attrs={'class': 'list-item'}):
    ...

“soup.findAll” accepts a wide array of arguments. For the purposes of this tutorial we only use “attrs” (attributes). It allows us to narrow down the search by setting up a statement “if attribute is equal to X is true then…”. Classes are easy to find and use therefore we shall use those.

Let’s visit the chosen URL in a real browser before continuing. Open the page source by using CTRL+U (Chrome) or right click and select “View Page Source”. Find the “closest” class where the data is nested. Another option is to press F12 to open DevTools to select Element Picker. For example, it could be nested as:

<h4 class="title">
    <a href="...">This is a Title</a>
</h4>

Our attribute, “class”,  would then be “title”. If you picked a simple target, in most cases data will be nested in a similar way to the example above. Complex targets might require more effort to get the data out. Let’s get back to coding and add the class we found in the source:

# Change ‘list-item’ to ‘title’.
for element in soup.findAll(attrs={'class': 'title'}):
  ...

Our loop will now go through all objects with the class “title” in the page source. We will process each of them:

name = element.find('a')

Let’s take a look at how our loop goes through the HTML:

<h4 class="title">
    <a href="...">This is a Title</a>
</h4>

Our first statement (in the loop itself) finds all elements that match tags, whose “class” attribute contains “title”. We then execute another search within that class. Our next search finds all the <a> tags in the document (<a> is included while partial matches like <span> are not). Finally, the object is assigned to the variable “name”.

We could then assign the object name to our previously created list array “results” but doing this would bring the entire <a href…> tag with the text inside it into one element. In most cases, we would only need the text itself without any additional tags.

# Add the object of “name” to the list “results”.
# `<element>.text` extracts the text in the element, omitting the HTML tags.
results.append(name.text)

Our loop will go through the entire page source, find all the occurrences of the classes listed above, then append the nested data to our list:

import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome(executable_path='/nix/path/to/webdriver/executable')
driver.get('https://your.url/here?yes=brilliant')
results = []
content = driver.page_source
soup = BeautifulSoup(content)
for element in soup.findAll(attrs={'class': 'title'}):
    name = element.find('a')
    results.append(name.text)

Note that the two statements after the loop are indented. Loops require indentation to denote nesting. Any consistent indentation will be considered legal. Loops without indentation will output an “IndentationError” with the offending statement pointed out with the “arrow”.

Exporting the data

Python web scraping tutorial
Python web scraping requires constant double-checking of the code

Even if no syntax or runtime errors appear when running our program, there still might be semantic errors. You should check whether we actually get the data assigned to the right object and move to the array correctly.

One of the simplest ways to check if the data you acquired during the previous steps is being collected correctly is to use “print”. Since arrays have many different values, a simple loop is often used to separate each entry to a separate line in the output:

for x in results:
   print(x)

Both “print” and “for” should be self-explanatory at this point. We are only initiating this loop for quick testing and debugging purposes. It is completely viable to print the results directly:

print(results)

So far our code should look like this:

driver = webdriver.Chrome(executable_path='/nix/path/to/webdriver/executable')
driver.get('https://your.url/here?yes=brilliant')
results = []
content = driver.page_source
soup = BeautifulSoup(content)
for a in soup.findAll(attrs={'class': 'class'}):
    name = a.find('a')
    if name not in results:
        results.append(name.text)
for x in results:
    print(x)

Running our program now should display no errors and display acquired data in the debugger window. While “print” is great for testing purposes, it isn’t all that great for parsing and analyzing data. 

You might have noticed that “import pandas” is still greyed out so far. We will finally get to put the library to good use. I recommend removing the “print” loop for now as we will be doing something similar but moving our data to a csv file.

df = pd.DataFrame({'Names': results})
df.to_csv('names.csv', index=False, encoding='utf-8')

Our two new statements rely on the pandas library. Our first statement creates a variable “df” and turns its object into a two-dimensional data table. “Names” is the name of our column while “results” is our list to be printed out. Note that pandas can create multiple columns, we just don’t have enough lists to utilize those parameters (yet).

Our second statement moves the data of variable “df” to a specific file type (in this case “csv”). Our first parameter assigns a name to our soon-to-be file and an extension. Adding an extension is necessary as “pandas” will otherwise output a file without one and it will have to be changed manually. “index” can be used to assign specific starting numbers to columns. “encoding” is used to save data in a specific format. UTF-8 will be enough in almost all cases.

import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome(executable_path='/nix/path/to/webdriver/executable')
driver.get('https://your.url/here?yes=brilliant')
results = []
content = driver.page_source
soup = BeautifulSoup(content)
for a in soup.findAll(attrs={'class': 'class'}):
    name = a.find('a')
    if name not in results:
        results.append(name.text)
df = pd.DataFrame({'Names': results})
df.to_csv('names.csv', index=False, encoding='utf-8')

No imports should now be greyed out and running our application should output a “names.csv” into our project directory. Note that a “Guessed At Parser” warning remains. We could remove it by installing a third party parser but for the purposes of this Python web scraping tutorial the default HTML option will do just fine.

More lists. More!

Web scraping with python
Python web scraping often requires many data points

Many web scraping operations will need to acquire several sets of data. For example, extracting just the titles of items listed on an e-commerce website will rarely be useful. In order to gather meaningful information and to draw conclusions from it at least two data points are needed.

For the purposes of this tutorial, we will try something slightly different. Since acquiring data from the same class would just mean appending to an additional list, we should attempt to extract data from a different class but, at the same time, maintain the structure of our table.

Obviously, we will need another list to store our data in.

import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome(executable_path='/nix/path/to/webdriver/executable')
driver.get('https://your.url/here?yes=brilliant')
results = []
other_results = []
for b in soup.findAll(attrs={'class': 'otherclass'}):
# Assume that data is nested in ‘span’.
    name2 = b.find('span')
    other_results.append(name.text)

Since we will be extracting an additional data point from a different part of the HTML, we will need an additional loop. If needed we can also add another “if” conditional to control for duplicate entries:

Finally, we need to change how our data table is formed:

df = pd.DataFrame({'Names': results, 'Categories': other_results})

So far the newest iteration of our code should look something like this:

import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome(executable_path='/nix/path/to/webdriver/executable')
driver.get('https://your.url/here?yes=brilliant')
results = []
other_results = []
content = driver.page_source
for a in soup.findAll(attrs={'class': 'class'}):
    name = a.find('a')
    if name not in results:
        results.append(name.text)
for b in soup.findAll(attrs={'class': 'otherclass'}):
    name2 = b.find('span')
    other_results.append(name.text)
df = pd.DataFrame({'Names': results, 'Categories': other_results})
df.to_csv('names.csv', index=False, encoding='utf-8')


If you are lucky, running this code will output no error. In some cases “pandas” will output an “ValueError: arrays must all be the same length” message. Simply put, the length of the lists “results” and “other_results” is unequal, therefore pandas cannot create a two-dimensional table.

There are dozens of ways to resolve that error message. From padding the shortest list with “empty” values, to creating dictionaries, to creating two series and listing them out. We shall do the third option:

series1 = pd.Series(results, name = 'Names')
series2 = pd.Series(other_results, name = 'Categories')
df = pd.DataFrame({'Names': series1, 'Categories': series2})
df.to_csv('names.csv', index=False, encoding='utf-8')

Note that data will not be matched as the lists are of uneven length but creating two series is the easiest fix if two data points are needed. Our final code should look something like this:

import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome(executable_path='/nix/path/to/webdriver/executable')
driver.get('https://your.url/here?yes=brilliant')
results = []
other_results = []
content = driver.page_source
soup = BeautifulSoup(content)
for a in soup.findAll(attrs={'class': 'class'}):
    name = a.find('a')
    if name not in results:
        results.append(name.text)
for b in soup.findAll(attrs={'class': 'otherclass'}):
    name2 = b.find('span')
    other_results.append(name.text)
series1 = pd.Series(results, name = 'Names')
series2 = pd.Series(other_results, name = 'Categories')
df = pd.DataFrame({'Names': series1, 'Categories': series2})
df.to_csv('names.csv', index=False, encoding='utf-8')

Running it should create a csv file named “names” with two columns of data.

Web scraping with Python best practices

Our first web scraper should now be fully functional. Of course it is so basic and simplistic that performing any serious data acquisition would require significant upgrades. Before moving on to greener pastures, I highly recommend experimenting with some additional features:

  • Create matched data extraction by creating a loop that would make lists of an even length. 
  • Scrape several URLs in one go. There are many ways to implement such a feature. One of the simplest options is to simply repeat the code above and change URLs each time. That would be quite boring. Build a loop and an array of URLs to visit.
  • Another option is to create several arrays to store different sets of data and output it into one file with different rows. Scraping several different types of information at once is an important part of e-commerce data acquisition.
  • Once a satisfactory web scraper is running, you no longer need to watch the browser perform its actions. Get headless versions of either Chrome or Firefox browsers and use those to reduce load times.
  • Create a scraping pattern. Think of how a regular user would browse the internet and try to automate their actions. New libraries will definitely be needed. Use “import time” and “from random import randint” to create wait times between pages. Add “scrollto()” or use specific key inputs to move around the browser. It’s nearly impossible to list all of the possible options when it comes to creating a scraping pattern.
  • Create a monitoring process. Data on certain websites might be time (or even user) sensitive. Try creating a long-lasting loop that rechecks certain URLs and scrapes data at set intervals. Ensure that your acquired data is always fresh.
  • Make use of the Python Requests library. Requests is a powerful asset in any web scraping toolkit as it allows to optimize HTTP methods sent to servers.
  • Finally, integrate proxies into your web scraper. Using location specific request sources allows you to acquire data that might otherwise be inaccessible.

Conclusion

From here onwards, you are on your own. Building web scrapers, acquiring data and drawing conclusions from large amounts of information is inherently an interesting and complicated process. If you want to find out more about how proxies or advanced data acquisition tools work, check out our blog! We have enough articles for everyone: a more detailed guide on how to avoid blocks when scraping, is web scraping legal, an in-depth walkthrough on what is a proxy and many more!

avatar

About Adomas Sulcas

Adomas Sulcas is a Content Manager at Oxylabs. Having grown up in a tech-minded household, he quickly developed an interest in everything IT and Internet related. When he is not nerding out online or immersed in reading, you will find him on an adventure or coming up with wicked business ideas.

Related articles

lxml Tutorial: XML Processing and Web Scraping With lxml

lxml Tutorial: XML Processing and Web Scraping With lxml

Sep 24, 2020

10 min read

How to Crawl a Website Without Getting Blocked

How to Crawl a Website Without Getting Blocked

Sep 24, 2020

9 min read

What is cURL and What Does It Mean?

What is cURL and What Does It Mean?

Sep 21, 2020

5 min read

All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.