Ensure the requests library is up-to-date to avoid any compatibility issues with websites and to use the latest features and security fixes.
Use soup.find_all('a', href=True) to directly filter out <a> tags without an href attribute, making the code more efficient.
Validate and clean the extracted URLs to handle relative paths and possible malformed URLs properly.
Consider using requests.Session() to make multiple requests to the same host, as it can reuse underlying TCP connections, which is more efficient.
# pip install -U requests beautifulsoup4 import requests from bs4 import BeautifulSoup # Define the target URL url = 'https://sandbox.oxylabs.io/products' # Send HTTP request and then parse HTML content response = requests.get(url) soup = BeautifulSoup(response.text, 'html.parser') # Extract all anchor tags anchors = soup.find_all('a', href=True) # Print all href attributes from anchor tags for link in anchors: if link.get('href'): print(link.get('href')) # Alternative method using list comprehension links = [link.get('href') for link in anchors if link.get('href')] print(links)
Ensure your script handles exceptions like requests.exceptions.RequestException to manage potential connectivity issues or HTTP errors gracefully.
Set a User-Agent in the header of your request to mimic a browser visit, which can help avoid being blocked by some websites that restrict script access.
Utilize response.raise_for_status() after your requests.get() call to immediately catch HTTP errors and handle them before parsing the HTML.
Incorporate a timeout in your requests.get() call to avoid hanging indefinitely if the server does not respond.
# Bad: Not handling exceptions which might occur during the request response = requests.get(url) # Good: Handling exceptions to avoid crashes due to connectivity issues try: response = requests.get(url) except requests.exceptions.RequestException as e: print(f'Error: {e}') # Bad: Not setting a user-agent, might get blocked by websites response = requests.get(url) # Good: Setting a user-agent to mimic a browser headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) ' 'AppleWebKit/537.36 (KHTML, like Gecko) ' 'Chrome/58.0.3029.110 Safari/537.3' } response = requests.get(url, headers=headers) # Bad: Not checking for HTTP errors which can lead to parsing invalid content response = requests.get(url) # Good: Using raise_for_status to catch HTTP errors response = requests.get(url) response.raise_for_status() # Bad: No timeout, request might hang indefinitely response = requests.get(url) # Good: Setting a timeout to avoid hanging indefinitely response = requests.get(url, timeout=10)
Web scraper API
Public data delivery from a majority of websites
From
49
Get the latest news from data gathering world
Scale up your business with Oxylabs®
Proxies
Advanced proxy solutions
Data Collection
Datasets
Resources
Innovation hub