Back to blog

How to Automate Web Scraping With Python and Cron

Danielius Radavicius

2022-07-29
Share

In most cases, the first step to building an automated web scraper comes from writing up a python web scraper script. The second is the automation itself, which can be done in many different ways, yet one of them stands out as the most straightforward. macOS, Linux, and other Unix-like operating systems have a built-in tool - cron - which is specifically suited for continuously repeated tasks.

Therefore, this article will primarily teach how to schedule tasks using cron. For the automation example, a web scraper written using Python was chosen.

Do note that before you start configuring cron, there are certain preparatory guidelines we’d recommend you follow, as this will ensure you’ll have fewer chances of errors.

Preparing the Python script

The first tip is to use a virtual environment. It will ensure that the correct Python version is available as well as all required libraries are there just for your Python web scraper and not everyone on your system.

The next good practice is to use the absolute file paths. Doing so ensures that the script does not break because of missing files in case you change your working directory.

Lastly, using logging is highly recommended as it allows you to have a log file you can refer to and troubleshoot if something breaks.

You can configure logging with just a single line of code after importing the logging module:

logging.basicConfig(filename="/Users/upen/scraper.log",level=logging.DEBUG)

After this, you can write in the log file as follows:

logging.info("informational message here")

For more information on logging, see the official documentation.

Example Python script

Since the article’s focus is on providing a realistic example, the following script is made to resemble real-life automated scraping:

from bs4 import BeautifulSoup
import requests
url = 'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html'
response = requests.get(url)
response.encoding = "utf-8"
soup = BeautifulSoup(response.text, 'lxml')
price = soup.select_one('#content_inner .price_color').text
with open(r'/Users/upen/data.csv','a') as f:
    f.write(price + "\n")

Every time you run this script, it will append the latest price in a new line to the CSV.

What is cron, and how does it work

The cron utility is a program that checks if any tasks are scheduled and runs those tasks if the schedule matches.

An essential part of cron is crontab, which is short for cron table, a utility to create files that the cron utility reads, a.k.a crontab files.

In this article, we will directly work with such files. If you want to learn how to write cron jobs in Python directly, see the library python-crontab.

When using python-crontab, it is possible to configure cron directly. In this case, you can also use Python to remove crontab jobs. In our example, however, we will focus on working with crontab.

Crontab Syntax

The first step of building an automated web scraping task is understanding how crontab utility works.

To view a list of currently configured crontab tasks, use the -l switch as follows:

crontab -l

To edit the crontab file, use the -e switch:

crontab -e

This command will open the default editor, which in most cases is vi. You can change the editor to something more straightforward, such as nano, by running the following command:

export EDITOR=nano

Note that other editors, such as Visual Studio Code, won’t work because of how it handles files at the system level. It is safest to stick with vi or nano.

The crontab entry use this pattern:

<schedule> <command to run>

Each line contains the schedule and the task to be run.

How to edit the crontab file

To edit the crontab file, open the terminal and enter the following command:

crontab -e

This command will open the default editor for crontab. On some Linux distros, you may be asked which program you want to open to edit this file.

In the editor, enter the task and frequency in each line.

The following image shows the virtualenv python as the binary:

Crontab_e

How to run cron job frequently 

Each entry in crontab begins with cron job frequency. The frequency or schedule contains five parts:

  • Minute

  • Hour (in 24-hour format)

  • Day of the month

  • Month

  • Day of the week

The possible values are * (any value) or a number.

For example, if you want to run a task every hour, the schedule will be as follows:

0 * * * *

Notably, the cron process runs every minute and matches the current system time with this entry. In our case, it will only match when the system time is minute 0. All other fields have *, meaning these will fit for any value.

Since this task will run at 4:00, 5:00, 6:00, etc. -- effectively, the schedule will create a job run every hour.

Here are a few more examples.

  • To run a task at 10 am on the 1st of every month, use the following:

0 10 1 * *
  • To run a task at 2 pm (14:00) every Monday, type:

0 14 * * 1

Many sites, such as crontab.guru can help you build and validate a schedule.

How to remove python crontab job

To remove all crontab jobs, open the terminal and use this command:

crontab -r

If you want to remove a specific crontab job, you can edit the crontab file as follows:

crontab -e

Once in edit mode, you can delete the line for that job and save this file. The crontab will be configured with the updated contents, effectively deleting the cron job.

How to schedule python script in crontab

First, decide what command you want to run. If you are not using a virtual environment, you can run your web scraping script as follows:

python3 /Users/upen/shopping/scraper.py

In some cases, you will have specific dependencies. If you're following recommended practices, it’s likely you've created a virtual environment.

Do note that it's often unnecessary to use source venv/bin/activate to release your venvo python with all its dependencies. For example, .venv/bin/python3 script.py already uses python3 from virtualenv.

A further recommendation would be to create a shell script and write the above lines in that script to make it more manageable. If you do that, the command to run your scraper would be:

sh /Users/upen/shopping/run_scraper.sh

The second step is to create a schedule. Let’s take an example of where the script must be run hourly. The cron schedule will be as follows:

0 * * * *

After finalizing these two pieces of information, open the terminal and enter the command:

crontab -e

Next, enter the following line, assuming you are using a shell script.

0 * * * * sh /Users/upen/shopping/run_scraper.sh

Upon saving the file, you may receive a prompt by your operating system, which will state your system settings are being modified.

Permissions

Accept it to create the cron job.

Installing_cron

Common reasons why Python script isn't running from crontab

On macOS, the most common reason is cron’s lack of permission. To specify them, open System Preferences and click on Security & Privacy. In the Privacy tab, select Full Disk Access on the left and add the path of the cron executable. If you aren’t sure about the location of the cron executable, run the following command from the terminal:

which cron

Another common problem is that the system used Python 2 instead of 3 and vice versa. Remember that macOS and many other Linux distros ship with both Python 2 and Python 3. To fix this, find the complete path of the python executable file. Open the command prompt and run the following:

where python3

Take note of the python executable that you want to use. Unless you are using virtual environments, you must specify the complete path of the Python file.

Another common reason for failure is an incorrect path script. As a thumb rule, when working with cron, always use absolute paths.

Cron job vs SystemD vs Windows Task Scheduler vs AutoScraper

Cron is a tool specific to Unix-like operating systems such as macOS and Linux. Tools similar to it are Systemd (read as system-d) and Anacron. However, these are Linux-specific and aren't available on Windows.

For Windows, you can use the dedicated Windows Task Scheduler tool.

AutoScraper, on the other hand, is an open-source Python library that can work with most scenarios. You should note that the library isn’t meant to be an alternative to cron. It can automate the web scraping part, but you still have to write the Python script and use cron or one of the alternatives to run it automatically.

Conclusion

After having covered the crucial aspects of cron, crontab, and cron jobs, we hope you’ve gained a greater understanding of how web scraping automation is possible through above mentioned specific practices. As always, before automating your web scraping projects, do make sure you conduct adequate research to determine which software and languages are the most relevant for your projects, as both Cron and Python have a variety of benefits and limitations when compared to the alternatives.

People also ask

What is a crontab in Linux?

The crontab (short for cron-Table) is the file that lists the programs or scripts that will be executed by the cron tool. These files cannot be edited directly and should be adjusted using the command line tool crontab.

What is a cron job in Python?

Cron is for scheduling jobs in Unix-like operating systems, such as macOS and Linux. 

A job, in this case, is equal to any executable, including Python. If you want to run a Python script, you can schedule a job using crontab, where the executable is Python, and the argument is the script.

If you want to configure cron via Python, see the library python-crontab.

What is the difference between cron and crontab?

Cron is the tool that runs every minute to check the entries in a table and runs the task that matches the schedule.

These entries are stored in crontab files. The tool to manage these files is also called crontab.

The individual tasks defined in crontab are called cron jobs.

About the author

Danielius Radavicius

Junior Copywriter

Danielius Radavicius is a Junior Copywriter at Oxylabs. Having grown up in films, music, books, and a keen interest in the defense industry, he decided to move his career towards tech-related subjects and quickly became interested in all things technology. In his free time, you'll probably find Danielius watching films, listening to music, and planning world domination.

All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.

Related articles

Get the latest news from data gathering world

I’m interested

IN THIS ARTICLE


  • Preparing the Python script

  • What is cron, and how does it work

  • Crontab Syntax

  • How to run cron job frequently 

  • How to remove python crontab job

  • How to schedule python script in crontab

  • Common reasons why Python script isn't running from crontab

  • Cron job vs SystemD vs Windows Task Scheduler vs AutoScraper

  • Conclusion

Scale up your business with Oxylabs®