How to Make a Web Crawler With Python

Augustas Pelakauskas

Last updated on

2025-03-03

3 min read

Often, web crawling precedes Python web scraping – both techniques are commonly used together. To retrieve data from the web, first and foremost, target websites and their URLs must be selected. Using web crawling, you can gather target URLs manually (by hand) or automatically at scale.

Web scraping techniques are then used to extract data from crawled target URLs. The main difference between web crawling vs web scraping:

A Python crawler discovers all relevant product pages across different websites.
The scraper subsequently extracts specific details like product names, prices, and descriptions from the discovered pages.

Web crawlers follow links from page to page, mapping out website structure and their connections. Search engines use web crawlers (also called spiders or bots) to continuously discover and update their web index. The focus is on finding and navigating between web pages.

Let’s create a simple web crawler in Python to extract video game product URLs from an e-commerce marketplace. Below is a comprehensive guide.

1. Setting up the environment

Ensure you have Python installed on your system. Let’s use the following libraries:

requests: To perform HTTP requests.
BeautifulSoup: To parse HTML page content.
csv and json: To save data in desired formats.
time: To manage crawl delays.
urllib.parse: To parse URLs into components.

Install the necessary libraries using pip, while the rest are built-in:

Copy

pip install requests beautifulsoup4

Import the necessary libraries:

Copy

import csv
import json
import time
from urllib.parse import urlparse, urljoin

import requests
from bs4 import BeautifulSoup

2. Handling duplicate URLs

To avoid web crawling the same URL multiple times, maintain a set of visited URLs. Before processing a new URL, check if it's already in the set:

Copy

visited_urls = set()  # Store unique URLs.

Each URL will be stored in visited_urls to ensure you don't extract it multiple times.

3. Restricting search to a single domain

Ensure the crawler only processes URLs within the target domain, sandbox.oxylabs.io, excluding external URLs. It can be done by checking the URL's domain before adding it to the crawl queue.

Copy

def is_valid_url(url, base_domain="sandbox.oxylabs.io"):
    """Check if the URL belongs to the target domain."""
    parsed_url = urlparse(url)
    return parsed_url.netloc == base_domain

4. Managing pagination

The website has multiple product web pages. To navigate through paginated content, identify the controls and extract the URLs for subsequent pages: ?page=1, ?page=2...

Loop through pages until no more products are found:

Copy

BASE_URL = "https://sandbox.oxylabs.io/products?page="

def get_product_links(page):
    """Extract product links from a given page."""
    url = f"{BASE_URL}{page}"
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                    "AppleWebKit/537.36 (KHTML, like Gecko) "
                    "Chrome/133.0.0.0 Safari/537.36"
    }

    response = requests.get(url, headers=headers)
    if response.status_code != 200:
        print(f"Failed to fetch {url} (Status Code: {response.status_code})")
        return []

    soup = BeautifulSoup(response.text, "html.parser")
    product_links = []

    for link in soup.select("a.card-header"):
        product_url = urljoin("https://sandbox.oxylabs.io", link.get("href"))
        
        if is_valid_url(product_url) and product_url not in visited_urls:
            visited_urls.add(product_url)
            product_links.append(product_url)

    return product_links

A User-Agent HTTP header is included to resemble a web browser.

5. Controlling crawl speed

To prevent overwhelming the server and to mimic human browsing behavior, let’s introduce delays between HTTP requests:

Copy

def crawl_all_pages(start_page=1, max_pages=10, delay=2):
    """Crawl multiple pages and extract product URLs."""
    all_links = []
    
    for page in range(start_page, max_pages + 1):
        print(f"Crawling page {page}...")
        links = get_product_links(page)

        if not links:
            print("No more products found. Stopping.")
            break

        all_links.extend(links)
        time.sleep(delay)  # Delay to avoid overloading the server.

    return all_links

6. Saving data to CSV/JSON

After extracting the desired data, save it in CSV or JSON format for storage.

Copy

def save_to_csv(data, filename="product_urls.csv"):
    """Save URLs to a CSV file."""
    with open(filename, "w", newline="") as file:
        writer = csv.writer(file)
        writer.writerow(["Product URL"])
        for url in data:
            writer.writerow([url])
    print(f"Saved to {filename}")

def save_to_json(data, filename="product_urls.json"):
    """Save URLs to a JSON file."""
    with open(filename, "w") as file:
        json.dump(data, file, indent=4)
    print(f"Saved to {filename}")

7. Running the web crawler

Now, execute the script to crawl and save URLs:

Copy

if __name__ == "__main__":
    # Modify the start page, default page limit, and delay as needed.
    product_urls = crawl_all_pages(max_pages=5, delay=3)
    save_to_csv(product_urls)
    save_to_json(product_urls)

Complete code

Here’s the full Python script:

Copy

import csv
import json
import time
from urllib.parse import urlparse, urljoin

import requests
from bs4 import BeautifulSoup


BASE_URL = "https://sandbox.oxylabs.io/products?page="

# Store visited URLs.
visited_urls = set()

def is_valid_url(url, base_domain="sandbox.oxylabs.io"):
    """Check if the URL belongs to the target domain."""
    parsed_url = urlparse(url)
    return parsed_url.netloc == base_domain

def get_product_links(page):
    """Extract product links from a given page."""
    url = f"{BASE_URL}{page}"
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                    "AppleWebKit/537.36 (KHTML, like Gecko) "
                    "Chrome/133.0.0.0 Safari/537.36"
    }

    response = requests.get(url, headers=headers)
    if response.status_code != 200:
        print(f"Failed to fetch {url} (Status Code: {response.status_code})")
        return []

    soup = BeautifulSoup(response.text, "html.parser")
    product_links = []

    for link in soup.select("a.card-header"):
        product_url = urljoin("https://sandbox.oxylabs.io", link.get("href"))
        
        if is_valid_url(product_url) and product_url not in visited_urls:
            visited_urls.add(product_url)
            product_links.append(product_url)

    return product_links

def crawl_all_pages(start_page=1, max_pages=10, delay=2):
    """Crawl multiple pages and extract product URLs."""
    all_links = []
    
    for page in range(start_page, max_pages + 1):
        print(f"Crawling page {page}...")
        links = get_product_links(page)

        if not links:
            print("No more products found. Stopping.")
            break

        all_links.extend(links)
        time.sleep(delay)  # Delay to avoid overloading the server

    return all_links

def save_to_csv(data, filename="product_urls.csv"):
    """Save URLs to a CSV file."""
    with open(filename, "w", newline="") as file:
        writer = csv.writer(file)
        writer.writerow(["Product URL"])
        for url in data:
            writer.writerow([url])
    print(f"Saved to {filename}")

def save_to_json(data, filename="product_urls.json"):
    """Save URLs to a JSON file."""
    with open(filename, "w") as file:
        json.dump(data, file, indent=4)
    print(f"Saved to {filename}")


if __name__ == "__main__":
    # Modify the start page, default page limit, and delay as needed.
    product_urls = crawl_all_pages(max_pages=5, delay=3)
    save_to_csv(product_urls)
    save_to_json(product_urls)

Handling JavaScript rendering

The web page in question is static and doesn’t require JavaScript rendering. Therefore, requests and BeautifulSoup are enough for web crawling. Otherwise, tools like Playwright or Selenium would help with dynamic content extraction.

How to avoid getting blocked

Automated web crawlers often encounter blocks from websites that employ anti-bot mechanisms. Common blocking methods include IP-based rate limiting, CAPTCHAs, user-agent detection, and behavior pattern recognition.

To bypass these restrictions, rotating proxies are essential. Rotating IPs distribute HTTP requests across different IP addresses, making your web crawling activity appear more like organic user traffic.

You can use services that provide residential and datacenter IPs. When integrating proxies with Python requests, it's important to set appropriate delays between HTTP requests and handle proxy failures in your code.

Free proxies for smaller tasks are a good idea. However, maintaining a quality proxy pool is expensive. Therefore, the wisest approach is to look for a paid proxy service provider that also offers some free proxies.

Also, check some of the best web crawlers, including code-free options with visual interfaces suitable for non-technical users.

Best practices and common issues

Many common issues with web crawling can be addressed by using and properly managing proxies. Here are some tips for sending successful web crawling requests.

User-Agent identification: Use a descriptive User-Agent string to identify your crawler.
Rate limiting: Implement delays between requests to prevent overwhelming servers and reduce IP blocking risk.
Concurrency control: Limit the number of simultaneous requests to a domain to maintain server stability.
JavaScript rendering: To process web pages that rely on JavaScript for content loading, use libraries like requests_html or headless web browsers.
URL filtering: Implement mechanisms to detect and avoid infinite loops or excessively deep directory structures that can trap crawlers.
Asynchronous requests: Employ asynchronous libraries like aiohttp to handle multiple requests concurrently.

Wrap up

To implement the best web crawling practices and be confident that your script is as durable as possible, some advanced Python web scraping is inevitable. However, for small tasks, you can definitely get by using only the basics.

Frequently asked questions

A Python crawler is a script that browses and downloads web pages, typically to extract data or index content. It works by starting from a seed URL, downloading the web page content, extracting links to other web pages, and then recursively visiting those links according to defined rules.

Web crawlers can be used for data mining, content monitoring, search engine indexing, or archiving websites. Common Python libraries for web crawling are Beautiful Soup, requests, Scrapy, and Selenium.

Yes, Python is excellent for web scraping and web crawling. It offers powerful libraries and frameworks like BeautifulSoup and Scrapy that make it easy to extract data from websites.

These tools, combined with Python's simple syntax and extensive request-handling capabilities, make it the most popular choice for web scraping and web crawling. The language also supports different data formats and has built-in features for processing scraped data.

Yes, web crawling is legal if you crawl publicly available data and adhere to regulations. The legality depends on the web crawling methods and intended use. It's recommended to consult legal professionals to ensure compliance with applicable laws.

Forget about complex web scraping processes

Choose Oxylabs' advanced web intelligence collection solutions to gather real-time public data hassle-free.

About the author

Augustas Pelakauskas

Former Senior Technical Copywriter

Augustas Pelakauskas was a Senior Technical Copywriter at Oxylabs. Coming from an artistic background, he is deeply invested in various creative ventures - the most recent being writing. After testing his abilities in freelance journalism, he transitioned to tech content creation. When at ease, he enjoys the sunny outdoors and active recreation. As it turns out, his bicycle is his fourth-best friend.

Learn more about Augustas Pelakauskas Learn more about Augustas Pelakauskas

All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.