Ensure the `requests` library is up-to-date to avoid any compatibility issues with websites and to use the latest features and security fixes.
Use `soup.find_all('a', href=True)` to directly filter out `a` tags without an `href` attribute, making the code more efficient.
Validate and clean the URLs extracted to handle relative paths and possible malformed URLs properly.
Consider using `requests.Session()` for making multiple requests to the same host, as it can reuse underlying TCP connections, which is more efficient.
# Import libraries import requests from bs4 import BeautifulSoup # Define the target URL url = 'https://sandbox.oxylabs.io/products' # Send HTTP request response = requests.get(url) # Parse HTML content soup = BeautifulSoup(response.text, 'html.parser') # Extract all anchor tags anchors = soup.find_all('a') # Print all href attributes from anchor tags for link in anchors: if link.get('href'): print(link.get('href')) # Alternative method using list comprehension links = [link.get('href') for link in soup.find_all('a') if link.get('href')] print(links)
Ensure your script handles exceptions like `requests.exceptions.RequestException` to manage potential connectivity issues or HTTP errors gracefully.
Set a user-agent in your requests header to mimic a browser visit, which can help avoid being blocked by some websites that restrict script access.
Utilize `response.raise_for_status()` after your `requests.get()` call to immediately catch HTTP errors and handle them before parsing the HTML.
Incorporate a timeout in your `requests.get()` call to avoid hanging indefinitely if the server does not respond.
# Bad: Not handling exceptions which might occur during the request response = requests.get(url) # Good: Handling exceptions to avoid crashes due to connectivity issues try: response = requests.get(url) except requests.exceptions.RequestException as e: print(f"Error: {e}") # Bad: Not setting a user-agent, might get blocked by websites response = requests.get(url) # Good: Setting a user-agent to mimic a browser headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'} response = requests.get(url, headers=headers) # Bad: Not checking for HTTP errors which can lead to parsing invalid content response = requests.get(url) # Good: Using raise_for_status to catch HTTP errors response = requests.get(url) response.raise_for_status() # Bad: No timeout, request might hang indefinitely response = requests.get(url) # Good: Setting a timeout to avoid hanging indefinitely response = requests.get(url, timeout=10)
Web scraper API
Public data delivery from a majority of websites
From
49
Get the latest news from data gathering world
Scale up your business with Oxylabs®