Best practices

  • Ensure the requests library is up-to-date to avoid any compatibility issues with websites and to use the latest features and security fixes.

  • Use soup.find_all('a', href=True) to directly filter out <a> tags without an href attribute, making the code more efficient.

  • Validate and clean the extracted URLs to handle relative paths and possible malformed URLs properly.

  • Consider using requests.Session() to make multiple requests to the same host, as it can reuse underlying TCP connections, which is more efficient.

# pip install -U requests beautifulsoup4
import requests
from bs4 import BeautifulSoup

# Define the target URL
url = 'https://sandbox.oxylabs.io/products'

# Send HTTP request and then parse HTML content
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# Extract all anchor tags
anchors = soup.find_all('a', href=True)

# Print all href attributes from anchor tags
for link in anchors:
    if link.get('href'):
        print(link.get('href'))

# Alternative method using list comprehension
links = [link.get('href') for link in anchors if link.get('href')]
print(links)

Common issues

  • Ensure your script handles exceptions like requests.exceptions.RequestException to manage potential connectivity issues or HTTP errors gracefully.

  • Set a User-Agent in the header of your request to mimic a browser visit, which can help avoid being blocked by some websites that restrict script access.

  • Utilize response.raise_for_status() after your requests.get() call to immediately catch HTTP errors and handle them before parsing the HTML.

  • Incorporate a timeout in your requests.get() call to avoid hanging indefinitely if the server does not respond.

# Bad: Not handling exceptions which might occur during the request
response = requests.get(url)

# Good: Handling exceptions to avoid crashes due to connectivity issues
try:
    response = requests.get(url)
except requests.exceptions.RequestException as e:
    print(f'Error: {e}')


# Bad: Not setting a user-agent, might get blocked by websites
response = requests.get(url)

# Good: Setting a user-agent to mimic a browser
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
                  'AppleWebKit/537.36 (KHTML, like Gecko) '
                  'Chrome/58.0.3029.110 Safari/537.3'
}
response = requests.get(url, headers=headers)


# Bad: Not checking for HTTP errors which can lead to parsing invalid content
response = requests.get(url)

# Good: Using raise_for_status to catch HTTP errors
response = requests.get(url)
response.raise_for_status()


# Bad: No timeout, request might hang indefinitely
response = requests.get(url)

# Good: Setting a timeout to avoid hanging indefinitely
response = requests.get(url, timeout=10)

Try Oyxlabs' Proxies & Scraper API

Residential Proxies

Self-Service

Human-like scraping without IP blocking

From

8

Datacenter Proxies

Self-Service

Fast and reliable proxies for cost-efficient scraping

From

1.2

Web scraper API

Self-Service

Public data delivery from a majority of websites

From

49

Useful resources

Automated Web Scraper With Python AutoScraper [Guide]
Automated Web Scraper With Python AutoScraper [Guide]
roberta avatar

Roberta Aukstikalnyte

2024-05-06

Crawl a Website visual
15 Tips on How to Crawl a Website Without Getting Blocked
adelina avatar

Adelina Kiskyte

2024-03-15

Using Google Sheets for Basic Web Scraping visuals
Guide to Using Google Sheets for Basic Web Scraping
vytenis kaubre avatar

Vytenis Kaubrė

2022-11-08

Get the latest news from data gathering world

I'm interested