How to Get Href attributes in BeautifulSoup?

Best practices

Ensure you use the correct parser like 'html.parser' or 'lxml' in BeautifulSoup to avoid parsing issues depending on the complexity of the HTML.
Always check the response status of your HTTP request to ensure the webpage is accessible before parsing it.
Use CSS selectors when you need to target links with specific attributes or within certain elements, as they provide a more flexible querying capability.
When extracting href values, validate and sanitize the URLs to avoid potential security risks or errors in data handling.

from bs4 import BeautifulSoup
import requests


# Fetch the webpage
response = requests.get('https://sandbox.oxylabs.io/products')

# Parse the HTML content
soup = BeautifulSoup(response.text, 'html.parser')

# Method 1: Find all 'a' tags and extract href
for link in soup.find_all('a'):
    print(link.get('href'))
print()

# Method 2: Use CSS selectors to get hrefs
for link in soup.select('a[href]'):
    print(link['href'])
print()

# Method 3: Extract hrefs from specific class
for link in soup.find_all('a', class_='card-header'):
    print(link.get('href'))

Common issues

Ensure that the requests library is installed and updated to avoid compatibility issues with BeautifulSoup.
Handle exceptions for network errors and invalid URLs to ensure your script runs smoothly without crashing.
If the href attribute is missing from some a tags, include a condition to check for None before processing to prevent AttributeError.
Consider using a session object from requests for better performance when making multiple requests to the same host.

# Incorrect: Assuming every 'a' tag has an href attribute
for link in soup.find_all('a'):
    print(link['href'])

# Correct: Check if href attribute exists before accessing it
for link in soup.find_all('a'):
    if link.has_attr('href'):
        print(link['href'])
else:
    print("No href attribute found")

# Slower method: Using multiple individual requests for the same host
for url in [
    "https://sandbox.oxylabs.io/products?page=1",
    "https://sandbox.oxylabs.io/products?page=2"
]:
    response = requests.get(url)

# Faster method: Use a session object for repeated requests to the same host
with requests.Session() as session:
    for url in [
        "https://sandbox.oxylabs.io/products?page=1",
        "https://sandbox.oxylabs.io/products?page=2"
    ]:
        response = session.get(url)