Ensure the requests library is up-to-date to avoid any compatibility issues with websites and to use the latest features and security fixes.
Use soup.find_all('a', href=True) to directly filter out <a> tags without an href attribute, making the code more efficient.
Validate and clean the extracted URLs to handle relative paths and possible malformed URLs properly.
Consider using requests.Session() to make multiple requests to the same host, as it can reuse underlying TCP connections, which is more efficient.
# pip install -U requests beautifulsoup4
import requests
from bs4 import BeautifulSoup
# Define the target URL
url = 'https://sandbox.oxylabs.io/products'
# Send HTTP request and then parse HTML content
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
# Extract all anchor tags
anchors = soup.find_all('a', href=True)
# Print all href attributes from anchor tags
for link in anchors:
if link.get('href'):
print(link.get('href'))
# Alternative method using list comprehension
links = [link.get('href') for link in anchors if link.get('href')]
print(links)Ensure your script handles exceptions like requests.exceptions.RequestException to manage potential connectivity issues or HTTP errors gracefully.
Set a User-Agent in the header of your request to mimic a browser visit, which can help avoid being blocked by some websites that restrict script access.
Utilize response.raise_for_status() after your requests.get() call to immediately catch HTTP errors and handle them before parsing the HTML.
Incorporate a timeout in your requests.get() call to avoid hanging indefinitely if the server does not respond.
# Bad: Not handling exceptions which might occur during the request
response = requests.get(url)
# Good: Handling exceptions to avoid crashes due to connectivity issues
try:
response = requests.get(url)
except requests.exceptions.RequestException as e:
print(f'Error: {e}')
# Bad: Not setting a user-agent, might get blocked by websites
response = requests.get(url)
# Good: Setting a user-agent to mimic a browser
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
'AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/58.0.3029.110 Safari/537.3'
}
response = requests.get(url, headers=headers)
# Bad: Not checking for HTTP errors which can lead to parsing invalid content
response = requests.get(url)
# Good: Using raise_for_status to catch HTTP errors
response = requests.get(url)
response.raise_for_status()
# Bad: No timeout, request might hang indefinitely
response = requests.get(url)
# Good: Setting a timeout to avoid hanging indefinitely
response = requests.get(url, timeout=10)Get the latest news from data gathering world
Scale up your business with Oxylabs®
Proxies
Advanced proxy solutions
Data Collection
Datasets
Resources
Innovation hub