Best practices

  • Ensure the requests library is up-to-date to avoid any compatibility issues with websites and to use the latest features and security fixes.

  • Use soup.find_all('a', href=True) to directly filter out <a> tags without an href attribute, making the code more efficient.

  • Validate and clean the extracted URLs to handle relative paths and possible malformed URLs properly.

  • Consider using requests.Session() to make multiple requests to the same host, as it can reuse underlying TCP connections, which is more efficient.

1
2
3
4
5
6
7
8
9
10
11
12
13

Common issues

  • Ensure your script handles exceptions like requests.exceptions.RequestException to manage potential connectivity issues or HTTP errors gracefully.

  • Set a User-Agent in the header of your request to mimic a browser visit, which can help avoid being blocked by some websites that restrict script access.

  • Utilize response.raise_for_status() after your requests.get() call to immediately catch HTTP errors and handle them before parsing the HTML.

  • Incorporate a timeout in your requests.get() call to avoid hanging indefinitely if the server does not respond.

1
2
3
4
5
6
7
8
9
10
11
12
13

Try Oyxlabs' Proxies & Scraper API

Residential Proxies

Self-Service

Human-like scraping without IP blocking

From

8

Datacenter Proxies

Self-Service

Fast and reliable proxies for cost-efficient scraping

From

1.2

Web scraper API

Self-Service

Public data delivery from a majority of websites

From

49

Useful resources

Using Google Sheets for Basic Web Scraping visuals
Guide to Using Google Sheets for Basic Web Scraping
vytenis kaubre avatar

Vytenis Kaubrė

2025-07-18

Automated Web Scraper With Python AutoScraper [Guide]
Automated Web Scraper With Python AutoScraper [Guide]
roberta avatar

Roberta Aukstikalnyte

2024-05-06

Crawl a Website visual
15 Tips on How to Crawl a Website Without Getting Blocked
adelina avatar

Adelina Kiskyte

2024-03-15

Get the latest news from data gathering world

I'm interested