The requests library in Python offers a simplified and feature-rich approach for making HTTP requests to web servers in just a few lines of code, shown below:
import requests
r = requests.get('https://sandbox.oxylabs.io/products')
print(r.text)
While the above code does the job, sometimes, web requests may fail, especially during web scraping projects. To avoid manually rerunning the code after failure, you should level up your Python requests module usage. Follow this article to learn how to set up a Python requests retry mechanism using HTTPAdapter, code your own retry function, and retry failed requests using proxies.
Error code | Causes | Solutions |
---|---|---|
403 Forbidden |
Access to the requested resource is denied, possibly due to insufficient permissions. |
|
429 Too Many Requests |
You’ve sent too many requests in a given amount of time, which is usually a consequence of the target's rate-limiting policies. |
|
500 Internal Server Error |
The server encountered an unexpected condition that prevented it from fulfilling the request. |
|
502 Bad Gateway |
The server you’re trying to access got a bad response from another server it depends on. |
|
503 Service Unavailable |
The server cannot handle the request, possibly due to it being overloaded or down for maintenance. |
|
Before diving head-first into coding, let’s understand how request retries work under the hood. Simply put, the retry strategy automatically reruns a request when an HTTP error occurs. The idea is that you don’t want to retry a request immediately but after a certain delay. This time interval can be a fixed number that always stays the same; however, you should avoid fixed delays as this method can overload the website’s server, which might worsen the issue rather than solve it.
Instead, one of the best practices is to use a backoff strategy, where the delay increases with each consecutive retry, reducing the risk of overwhelming the server and increasing the chances of successful execution. While there are distinct backoff strategies, like the exponential, fibonacci, linear, polynomial, and others, the most common in web scraping is the exponential backoff strategy, at which we’ll take a deeper look in this article.
Exponential backoff algorithms come in several forms, each offering a unique set of delays. Take a look at the following exponential backoff formula:
delay = base_delay * (backoff_factor ** current_retry_count)
This algorithm increases the delay by raising the backoff factor to the power of the current retry count. When you change the backoff factor to a different number, the delay times look like this, for example:
# Factor of 2
Delay: 1, 2, 4, 8, 16, 32, 64, 128, 256, 512
# Factor of 3
Delay: 1, 3, 9, 27, 81, 243, 729, 2187, 6561, 19683
# Factor of 5
Delay: 1, 5, 25, 125, 625, 3125, 15625, 78125, 390625, 1953125
Another common backoff algorithm, which is used by Python’s urllib3 library, is fundamentally similar:
delay = backoff_factor * (2 ** retry_count)
This algorithm increases the delay by doubling it for each retry and then multiplying it by the backoff factor. When different backoff numbers are used, you can see a drastic decrease in delay times compared to the previous backoff algorithm:
# Factor of 2
Delay: 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024
# Factor of 3
Delay: 3, 6, 12, 24, 48, 96, 192, 384, 768, 1536
# Factor of 5
Delay: 5, 10, 20, 40, 80, 160, 320, 640, 1280, 2560
The easiest way to configure a Python requests retry strategy is to use available libraries, such as the requests library with its HTTPAdapter class and urllib3 with its Retry class.
Begin by installing the requests library using Python’s package installer. Run the following line in your terminal:
pip install requests
Next, import the required libraries and their classes:
import requests
from requests.adapters import HTTPAdapter
from urllib3.util import Retry
The HTTPAdapter will manage a pool of connections, giving more control and boosting performance by reusing connections. The Retry class will be the key component of your script, enabling you to control the way your script is retrying failed requests.
Then, create a try-except block and define your Python requests retry logic using the Retry() class:
try:
retry = Retry(
total=5,
backoff_factor=2,
status_forcelist=[429, 500, 502, 503, 504],
)
except Exception as e:
print(e)
Here, the total parameter sets the maximum number of retries to perform, while the status_forcelist specifies which HTTP error codes should trigger request retries.
Next, inside the try statement, create the HTTPAdapter instance and pass the retry object to it. Then, create a Session() object, which uses the same parameters and configurations for multiple requests, and after that, use the mount() method to attach the adapter for the https:// URL prefix. Once that’s done, run a GET request to a testing site https://httpbin.org/status/502, which always returns a 502 HTTP error, and print the response code if the request is successful:
import requests
from requests.adapters import HTTPAdapter
from urllib3.util import Retry
try:
retry = Retry(
total=5,
backoff_factor=2,
status_forcelist=[429, 500, 502, 503, 504],
)
adapter = HTTPAdapter(max_retries=retry)
session = requests.Session()
session.mount('https://', adapter)
r = session.get('https://httpbin.org/status/502', timeout=180)
print(r.status_code)
except Exception as e:
print(e)
Once you run the code, it’ll retry the request for up to 5 times. If the code still encounters errors after the maximum retries, it’ll exit the loop with an error message that should look like this:
HTTPSConnectionPool(host='httpbin.org', port=443): Max retries exceeded with url: /status/502 (Caused by ResponseError('too many 502 error responses'))
You can always write your own retry logic and customize it to your specific needs. Thankfully, it’s not a complicated endeavor.
For this task, let’s use two built-in Python modules: random and time. The random module will help you add some jitter to the delay times so your script becomes less predictable. Additionally, the time.sleep() function from the time module will allow you to pause the script from execution for a specific timeframe.
As mentioned previously, there are two common exponential backoff formulae for calculating the time delay, so let’s take a look at the first one:
import random
import time
def delay(base_delay, backoff_factor, max_retries, max_delay, jitter=False):
for retry in range(max_retries):
delay_time = base_delay * (backoff_factor ** retry)
if jitter:
delay_time *= random.uniform(1, 1.5)
effective_delay = min(delay_time, max_delay)
# Uncomment to enable sleep
# time.sleep(effective_delay)
print(f"Attempt {retry + 1}: Delay for {effective_delay} seconds.")
delay(1, 2, 5, 180, jitter=True)
There are several parameters to understand:
base_delay – sets the delay time for the first retry attempt. If it’s set to 1, it’ll output [1, 2, 4, 8, 16], while 2 will output [2, 4, 8, 16, 32];
backoff_factor – determines how much the delay increases with each subsequent retry;
max_retries – the maximum number of retry attempts;
max_delay – the maximum allowed delay time. This caps the exponential growth to prevent excessively long delays.
jitter – an optional parameter that, when set to True, adds randomness to the delay time. You can always adjust the random.uniform(1, 1.5) to control the amount of applied jitter.
Once you run the code, it’ll use 1 for the base_delay, 2 for the backoff_factor, 5 for max_retries, 180 for max_delay, and it’ll add some jitter to the delay times. You should see a similar output to this:
Attempt 1: Delay for 1.1337826730548428 seconds.
Attempt 2: Delay for 2.5009838407289733 seconds.
Attempt 3: Delay for 4.0078030241849945 seconds.
Attempt 4: Delay for 11.36568074005565 seconds.
Attempt 5: Delay for 16.16436422678868 seconds.
Feel free to play around with different parameter values to see the difference.
Another common exponential backoff calculation that’s similarly used by the urllib3 library can look like this:
import random
import time
def delay(backoff_factor, max_retries, max_delay, jitter=False):
for retry in range(max_retries):
delay_time = backoff_factor * (2 ** retry)
if jitter:
delay_time *= random.uniform(1, 1.5)
effective_delay = min(delay_time, max_delay)
# Uncomment to enable sleep
# time.sleep(effective_delay)
print(f"Attempt {retry + 1}: Delay for {effective_delay} seconds.")
delay(2, 5, 180, jitter=True)
Here, the only difference is that there’s no base_delay parameter. Once you run the code, you should see a similar output:
Attempt 1: Delay for 2.1920377303270024 seconds.
Attempt 2: Delay for 4.27119299385229 seconds.
Attempt 3: Delay for 10.836279682514967 seconds.
Attempt 4: Delay for 16.20166015318422 seconds.
Attempt 5: Delay for 32.97554936428317 seconds.
Let’s use the latter formula to build our own request retry script. Start by modifying the above code so it appends all the calculated delay times to a list:
import random, time, requests
def delay(backoff_factor, max_retries, max_delay, jitter=False):
delay_times = []
for retry in range(max_retries):
delay_time = backoff_factor * (2 ** retry)
if jitter:
delay_time *= random.uniform(1, 1.5)
effective_delay = min(delay_time, max_delay)
delay_times.append(effective_delay)
return delay_times
Next, define a new function that attempts to fetch a URL using the GET method, retries upon certain HTTP status codes, and employs delays between retries:
def get(URL, **kwargs):
success = False
for delay in backoff:
r = requests.get(URL, **kwargs)
status = r.status_code
if status >= 200 and status < 300:
print(f'Success! Status: {status}')
success = True
break
elif status in [429, 500, 502, 503, 504]:
print(f'Received status: {status}. Retrying in {delay} seconds.')
time.sleep(delay)
else:
print(f'Received status: {status}.')
break
if not success:
print("Maximum retries reached.")
backoff = delay(2, 5, 180, jitter=True)
get('https://httpbin.org/status/502', timeout=180)
Here, the success flag is used to define when the fetching is successful. If success remains False after all retries, the code prints a message indicating that maximum retries were reached.
As you can see at the bottom of the code, the delay() function is called, which saves the returned delay times to a backoff variable. So, in the get() function, the code iterates over each delay time in the backoff list. If the received response status code is in the list of [429, 500, 502, 503, 504], the function retries the request, each time using a different delay from the backoff list. Additionally, if the response code isn’t in the range of 200-299 and not in the error code list, it breaks the loop and prints the received status code.
When it comes to integrating proxies, it’s quite straightforward since the Python requests library streamlines the proxy integration process. If you're unsure, learn more about what is proxy. First, let’s see how you can use proxies with the HTTPAdapter and the Retry classes.
Unlock premium Residential Proxies on a budget in just a few clicks.
Make sure to use your actual proxy credentials for the USERNAME and PASSWORD variables:
import requests
from requests.adapters import HTTPAdapter
from urllib3.util import Retry
USERNAME = 'PROXY_USERNAME'
PASSWORD = 'PROXY_PASSWORD'
proxies = {
'http': f'http://{USERNAME}:{PASSWORD}@pr.oxylabs.io:7777',
'https': f'https://{USERNAME}:{PASSWORD}@pr.oxylabs.io:7777'
}
try:
retry = Retry(
total=5,
backoff_factor=2,
status_forcelist=[403, 429, 500, 502, 503, 504],
)
adapter = HTTPAdapter(max_retries=retry)
session = requests.Session()
session.mount('https://', adapter)
r = session.get('https://ip.oxylabs.io/', proxies=proxies, timeout=180)
print(r.status_code)
print(r.text)
except Exception as e:
print(e)
Once you run the code, it’ll use proxies for all requests and will output the IP address after visiting the https://ip.oxylabs.io/ site. This website is used just for proxy testing purposes, so make sure you use a target website of your choice after successful integration. In case you’re new to configuring proxy servers in Python’s requests library, check out our Python requests proxy integration tutorial.
When using a custom retry code, you can easily add more functionality. For instance, the below code uses proxy servers only when the target website returns 403 and 429 status codes:
import random, time, requests
USERNAME = 'PROXY_USERNAME'
PASSWORD = 'PROXY_PASSWORD'
proxies = {
'http': f'http://{USERNAME}:{PASSWORD}@pr.oxylabs.io:7777',
'https': f'https://{USERNAME}:{PASSWORD}@pr.oxylabs.io:7777'
}
def delay(backoff_factor, max_retries, max_delay, jitter=False):
delay_times = []
for retry in range(max_retries):
delay_time = backoff_factor * (2 ** retry)
if jitter:
delay_time *= random.uniform(1, 1.5)
effective_delay = min(delay_time, max_delay)
delay_times.append(effective_delay)
return delay_times
def get(URL, **kwargs):
success = False
enable_proxies = False
for delay in backoff:
if enable_proxies:
r = requests.get(URL, proxies=proxies, **kwargs)
else:
r = requests.get(URL, **kwargs)
status = r.status_code
if status >= 200 and status < 300:
print(f'Success! Status: {status}')
success = True
break
elif status in [500, 502, 503, 504]:
print(f'Received status: {status}. Retrying in {delay} seconds.')
time.sleep(delay)
elif status in [403, 429]:
print(f'Received status: {status}. Retrying in {delay} seconds with proxies.')
enable_proxies = True
time.sleep(delay)
else:
print(f'Received status: {status}.')
break
if not success:
print("Maximum retries reached.")
backoff = delay(2, 5, 180, jitter=True)
get('https://httpbin.org/status/429', timeout=180)
In this code snippet, the enable_proxies flag helps to control when the proxies should be used and when they shouldn’t. Since the https://httpbin.org/status/429 website always returns a 429 status code, the code will retry the request five times with different delay times and will exit the loop with a message saying, “Maximum retries reached.”
For an optimized retry strategy that yields the best outcomes, make sure to adhere to these essential best practices:
As mentioned previously, steer clear of using fixed delay times and utilize a backoff algorithm instead;
Create a list of error status codes for which you want to retry failed requests and use the best-suited methods for each. For instance, for requests that return the 403 Forbidden error, you should only retry again using proxy servers or different HTTP header sets;
If a target server returns a Retry-After header in its response, respect this instruction by waiting the specified amount of time before making another request;
If you notice an increase in the server’s response times, it might indicate that the server is under load. Adjust your request rate accordingly to mitigate the impact on the server;
Ease your processes and ensure efficient operations by using a ready-made library like HTTPAdapter, Tenacity, and similar.
Exceptional performance is key to any process; hence, you should consider taking advantage of several improvements:
The requests library allows only synchronous connections. You can improve your web scraping speed by using asynchronous libraries like asyncio coupled with aiohttp. See this asynchronous web scraping article to get started. One thing to note is that asyncio and aiohttp don’t have built-in retry mechanisms; thus, you have to code your own retry script or use retry libraries. Also, check out this HTTPX vs. Requests vs. AIOHTTP comparison.
It’s better to use a smaller value for the number of retries than setting it high. This way, you can avoid indefinite retries and potential performance issues, such as increased latency and resource exhaustion;
Always set timeouts for your requests to prevent them from hanging indefinitely. This helps to keep your script running efficiently by ensuring that a slow or stalled request doesn't block your code’s execution. We've explored this topic in our guide to handling Python requests timeouts, so feel free to check it out;
Consider forcing the Keep-Alive header in your requests. It allows the underlying TCP connection to be used for multiple requests to the same server, reducing the overhead of establishing a new connection for each request. By default, the requests.Session object will reuse connections.
A retry mechanism allows a smart way to specify the response status codes of requests that should be retried, how many times it should be done, and implement custom delays between attempts. The requests.HTTPAdapter coupled with urllib3.Retry is a great starter for beginners, but it also isn’t a complex task to code your own script that unlocks more flexibility.
Have questions about Oxylabs solutions? Reach us via email or drop a line via our 24/7 live chat.
About the author
Vytenis Kaubrė
Technical Copywriter
Vytenis Kaubrė is a Technical Copywriter at Oxylabs. His love for creative writing and a growing interest in technology fuels his daily work, where he crafts technical content and web scrapers with Oxylabs’ solutions. Off duty, you might catch him working on personal projects, coding with Python, or jamming on his electric guitar.
All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.
Yelyzaveta Nechytailo
2024-12-09
Augustas Pelakauskas
2024-12-09
Get the latest news from data gathering world
Scale up your business with Oxylabs®