How to download files from URLs with Python?

Learn how to efficiently download files from URLs using Python with this straightforward guide. Discover essential techniques and tools to streamline your data retrieval processes from various online sources.

Best practices

  • Use the `requests` library for downloading files as it provides more control over requests and responses, including error handling and session management.

  • Always check the `status_code` of the response object to ensure the HTTP request was successful before proceeding with file operations.

  • When downloading large files, use the `stream=True` parameter in `requests.get()` to download the content in chunks, preventing large memory usage.

  • Consider using the `tqdm` library to add a progress bar when downloading files, which improves the user experience by providing visual feedback on the download progress.

import requests # Import requests library
# Download file using requests
url = "https://sandbox.oxylabs.io/products/sample.pdf"
response = requests.get(url)
with open("sample.pdf", "wb") as file:
file.write(response.content)

import urllib.request # Import urllib library for another method
# Download file using urllib
urllib.request.urlretrieve(url, "sample_urllib.pdf")

try:
from tqdm import tqdm # Import tqdm for progress bar (optional)
# Download with progress bar
response = requests.get(url, stream=True)
total_size_in_bytes= int(response.headers.get('content-length', 0))
block_size = 1024 # 1 Kibibyte
progress_bar = tqdm(total=total_size_in_bytes, unit='iB', unit_scale=True)
with open("sample_with_progress.pdf", "wb") as file:
for data in response.iter_content(block_size):
progress_bar.update(len(data))
file.write(data)
progress_bar.close()
except ImportError:
print("tqdm library is not installed. Install it to see the progress bar.")

Common issues

  • Ensure you handle exceptions such as `ConnectionError` or `Timeout` when using `requests.get()` to maintain robustness in network-related failures.

  • Validate the 'content-length' header before downloading to ensure the file size is as expected and to prevent incomplete downloads.

  • Set a timeout in `requests.get()` to avoid hanging indefinitely if the server does not respond or is too slow.

  • Use `os.path` to dynamically set the file path and name, ensuring compatibility across different operating systems.

# Good Example: Handling exceptions with requests.get()
try:
response = requests.get(url, timeout=10) # Set timeout
response.raise_for_status() # Check for HTTP errors
except requests.exceptions.RequestException as e:
print(f"Error downloading file: {e}")

# Bad Example: No exception handling or timeout
response = requests.get(url)
with open("sample.pdf", "wb") as file:
file.write(response.content)

# Good Example: Validate 'content-length' before downloading
response = requests.get(url, stream=True)
content_length = response.headers.get('content-length')
if content_length and len(response.content) == int(content_length):
with open("validated_file.pdf", "wb") as file:
file.write(response.content)
else:
print("Content length mismatch.")

# Bad Example: Ignoring 'content-length' validation
response = requests.get(url)
with open("unvalidated_file.pdf", "wb") as file:
file.write(response.content)

# Good Example: Using os.path for file paths
import os
filename = os.path.join(os.getcwd(), "downloaded_file.pdf")
response = requests.get(url)
with open(filename, "wb") as file:
file.write(response.content)

# Bad Example: Hardcoding file paths
response = requests.get(url)
with open("/absolute/path/downloaded_file.pdf", "wb") as file:
file.write(response.content)

# Good Example: Setting a timeout in requests.get()
try:
response = requests.get(url, timeout=5) # Timeout after 5 seconds
with open("timed_file.pdf", "wb") as file:
file.write(response.content)
except requests.Timeout:
print("The request timed out.")

# Bad Example: No timeout set
response = requests.get(url)
with open("no_timeout_file.pdf", "wb") as file:
file.write(response.content)

Try Oyxlabs' Proxies & Scraper API

Residential Proxies

Self-Service

Human-like scraping without IP blocking

From

8

Datacenter Proxies

Self-Service

Fast and reliable proxies for cost-efficient scraping

From

1.2

Web scraper API

Self-Service

Public data delivery from a majority of websites

From

49

Useful resources

Get the latest news from data gathering world

I'm interested