Best practices

  • Ensure you use the correct parser like 'html.parser' or 'lxml' in BeautifulSoup to avoid parsing issues depending on the complexity of the HTML.

  • Always check the response status of your HTTP request to ensure the webpage is accessible before parsing it.

  • Use CSS selectors when you need to target links with specific attributes or within certain elements, as they provide a more flexible querying capability.

  • When extracting href values, validate and sanitize the URLs to avoid potential security risks or errors in data handling.

from bs4 import BeautifulSoup
import requests


# Fetch the webpage
response = requests.get('https://sandbox.oxylabs.io/products')

# Parse the HTML content
soup = BeautifulSoup(response.text, 'html.parser')

# Method 1: Find all 'a' tags and extract href
for link in soup.find_all('a'):
    print(link.get('href'))
print()

# Method 2: Use CSS selectors to get hrefs
for link in soup.select('a[href]'):
    print(link['href'])
print()

# Method 3: Extract hrefs from specific class
for link in soup.find_all('a', class_='card-header'):
    print(link.get('href'))

Common issues

  • Ensure that the requests library is installed and updated to avoid compatibility issues with BeautifulSoup.

  • Handle exceptions for network errors and invalid URLs to ensure your script runs smoothly without crashing.

  • If the href attribute is missing from some a tags, include a condition to check for None before processing to prevent AttributeError.

  • Consider using a session object from requests for better performance when making multiple requests to the same host.

# Incorrect: Assuming every 'a' tag has an href attribute
for link in soup.find_all('a'):
    print(link['href'])

# Correct: Check if href attribute exists before accessing it
for link in soup.find_all('a'):
    if link.has_attr('href'):
        print(link['href'])
else:
    print("No href attribute found")

# Slower method: Using multiple individual requests for the same host
for url in [
    "https://sandbox.oxylabs.io/products?page=1",
    "https://sandbox.oxylabs.io/products?page=2"
]:
    response = requests.get(url)

# Faster method: Use a session object for repeated requests to the same host
with requests.Session() as session:
    for url in [
        "https://sandbox.oxylabs.io/products?page=1",
        "https://sandbox.oxylabs.io/products?page=2"
    ]:
        response = session.get(url)

Try Oxylabs' Proxies & Scraper API

Residential Proxies

Self-Service

Human-like scraping without IP blocking

From

8

Datacenter Proxies

Self-Service

Fast and reliable proxies for cost-efficient scraping

From

1.2

Web scraper API

Self-Service

Public data delivery from a majority of websites

From

49

Useful resources

Get the latest news from data gathering world

I'm interested