Often, web crawling precedes Python web scraping – both techniques are commonly used together. To retrieve data from the web, first and foremost, target websites and their URLs must be selected. Using web crawling, you can gather target URLs manually (by hand) or automatically at scale.
Web scraping techniques are then used to extract data from crawled target URLs. The main difference between web crawling vs web scraping:
A Python crawler discovers all relevant product pages across different websites.
The scraper subsequently extracts specific details like product names, prices, and descriptions from the discovered pages.
Web crawlers follow links from page to page, mapping out website structure and their connections. Search engines use web crawlers (also called spiders or bots) to continuously discover and update their web index. The focus is on finding and navigating between web pages.
Let’s create a simple web crawler in Python to extract video game product URLs from an e-commerce marketplace. Below is a comprehensive guide.
Ensure you have Python installed on your system. Let’s use the following libraries:
requests: To perform HTTP requests.
BeautifulSoup: To parse HTML page content.
csv and json: To save data in desired formats.
time: To manage crawl delays.
urllib.parse: To parse URLs into components.
Install the necessary libraries using pip, while the rest are built-in:
pip install requests beautifulsoup4
Import the necessary libraries:
import csv
import json
import time
from urllib.parse import urlparse, urljoin
import requests
from bs4 import BeautifulSoup
To avoid web crawling the same URL multiple times, maintain a set of visited URLs. Before processing a new URL, check if it's already in the set:
visited_urls = set() # Store unique URLs.
Each URL will be stored in visited_urls to ensure you don't extract it multiple times.
Ensure the crawler only processes URLs within the target domain, sandbox.oxylabs.io, excluding external URLs. It can be done by checking the URL's domain before adding it to the crawl queue.
def is_valid_url(url, base_domain="sandbox.oxylabs.io"):
"""Check if the URL belongs to the target domain."""
parsed_url = urlparse(url)
return parsed_url.netloc == base_domain
The website has multiple product web pages. To navigate through paginated content, identify the controls and extract the URLs for subsequent pages: ?page=1, ?page=2...
Loop through pages until no more products are found:
BASE_URL = "https://sandbox.oxylabs.io/products?page="
def get_product_links(page):
"""Extract product links from a given page."""
url = f"{BASE_URL}{page}"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/133.0.0.0 Safari/537.36"
}
response = requests.get(url, headers=headers)
if response.status_code != 200:
print(f"Failed to fetch {url} (Status Code: {response.status_code})")
return []
soup = BeautifulSoup(response.text, "html.parser")
product_links = []
for link in soup.select("a.card-header"):
product_url = urljoin("https://sandbox.oxylabs.io", link.get("href"))
if is_valid_url(product_url) and product_url not in visited_urls:
visited_urls.add(product_url)
product_links.append(product_url)
return product_links
A User-Agent HTTP header is included to resemble a web browser.
To prevent overwhelming the server and to mimic human browsing behavior, let’s introduce delays between HTTP requests:
def crawl_all_pages(start_page=1, max_pages=10, delay=2):
"""Crawl multiple pages and extract product URLs."""
all_links = []
for page in range(start_page, max_pages + 1):
print(f"Crawling page {page}...")
links = get_product_links(page)
if not links:
print("No more products found. Stopping.")
break
all_links.extend(links)
time.sleep(delay) # Delay to avoid overloading the server.
return all_links
After extracting the desired data, save it in CSV or JSON format for storage.
def save_to_csv(data, filename="product_urls.csv"):
"""Save URLs to a CSV file."""
with open(filename, "w", newline="") as file:
writer = csv.writer(file)
writer.writerow(["Product URL"])
for url in data:
writer.writerow([url])
print(f"Saved to {filename}")
def save_to_json(data, filename="product_urls.json"):
"""Save URLs to a JSON file."""
with open(filename, "w") as file:
json.dump(data, file, indent=4)
print(f"Saved to {filename}")
Now, execute the script to crawl and save URLs:
if __name__ == "__main__":
# Modify the start page, default page limit, and delay as needed.
product_urls = crawl_all_pages(max_pages=5, delay=3)
save_to_csv(product_urls)
save_to_json(product_urls)
Here’s the full Python script:
import csv
import json
import time
from urllib.parse import urlparse, urljoin
import requests
from bs4 import BeautifulSoup
BASE_URL = "https://sandbox.oxylabs.io/products?page="
# Store visited URLs.
visited_urls = set()
def is_valid_url(url, base_domain="sandbox.oxylabs.io"):
"""Check if the URL belongs to the target domain."""
parsed_url = urlparse(url)
return parsed_url.netloc == base_domain
def get_product_links(page):
"""Extract product links from a given page."""
url = f"{BASE_URL}{page}"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/133.0.0.0 Safari/537.36"
}
response = requests.get(url, headers=headers)
if response.status_code != 200:
print(f"Failed to fetch {url} (Status Code: {response.status_code})")
return []
soup = BeautifulSoup(response.text, "html.parser")
product_links = []
for link in soup.select("a.card-header"):
product_url = urljoin("https://sandbox.oxylabs.io", link.get("href"))
if is_valid_url(product_url) and product_url not in visited_urls:
visited_urls.add(product_url)
product_links.append(product_url)
return product_links
def crawl_all_pages(start_page=1, max_pages=10, delay=2):
"""Crawl multiple pages and extract product URLs."""
all_links = []
for page in range(start_page, max_pages + 1):
print(f"Crawling page {page}...")
links = get_product_links(page)
if not links:
print("No more products found. Stopping.")
break
all_links.extend(links)
time.sleep(delay) # Delay to avoid overloading the server
return all_links
def save_to_csv(data, filename="product_urls.csv"):
"""Save URLs to a CSV file."""
with open(filename, "w", newline="") as file:
writer = csv.writer(file)
writer.writerow(["Product URL"])
for url in data:
writer.writerow([url])
print(f"Saved to {filename}")
def save_to_json(data, filename="product_urls.json"):
"""Save URLs to a JSON file."""
with open(filename, "w") as file:
json.dump(data, file, indent=4)
print(f"Saved to {filename}")
if __name__ == "__main__":
# Modify the start page, default page limit, and delay as needed.
product_urls = crawl_all_pages(max_pages=5, delay=3)
save_to_csv(product_urls)
save_to_json(product_urls)
The web page in question is static and doesn’t require JavaScript rendering. Therefore, requests and BeautifulSoup are enough for web crawling. Otherwise, tools like Playwright or Selenium would help with dynamic content extraction.
Automated web crawlers often encounter blocks from websites that employ anti-bot mechanisms. Common blocking methods include IP-based rate limiting, CAPTCHAs, user-agent detection, and behavior pattern recognition.
To bypass these restrictions, rotating proxies are essential. Rotating IPs distribute HTTP requests across different IP addresses, making your web crawling activity appear more like organic user traffic.
You can use services that provide residential and datacenter IPs. When integrating proxies with Python requests, it's important to set appropriate delays between HTTP requests and handle proxy failures in your code.
Free proxies for smaller tasks are a good idea. However, maintaining a quality proxy pool is expensive. Therefore, the wisest approach is to look for a paid proxy service provider that also offers some free proxies.
Alternatively, to avoid the hassle of setting up an elaborate Pyhon logic, third-party tools could be of help. With only minimal coding experience, you can crawl any website at a large scale without worrying about blocks using Oxylabs' Web Crawler, a feature of our Web Scraper API.
You can test the Web Crawler’s functionality with a one-week free trial of Web Scraper API.
To use it, start by creating an API user through the Oxylabs dashboard and obtain your credentials.
Configure your web crawling job by specifying the starting URL, setting filters to control the crawl scope, and defining scraping parameters like geo-location and JavaScript rendering.
Submit your configuration via a POST request to the job initiation endpoint.
Once the crawl is complete, retrieve data in a preferred format – sitemap, HTML files, or parsed data – either directly or via automatic upload to your cloud storage.
Web Crawler's workflow
Also, check some of the best web crawlers, including code-free options with visual interfaces suitable for non-technical users.
Many common issues with web crawling can be addressed by using and properly managing proxies. Here are some tips for sending successful web crawling requests.
User-Agent identification: Use a descriptive User-Agent string to identify your crawler.
Rate limiting: Implement delays between requests to prevent overwhelming servers and reduce IP blocking risk.
Concurrency control: Limit the number of simultaneous requests to a domain to maintain server stability.
JavaScript rendering: To process web pages that rely on JavaScript for content loading, use libraries like requests_html or headless web browsers.
URL filtering: Implement mechanisms to detect and avoid infinite loops or excessively deep directory structures that can trap crawlers.
Asynchronous requests: Employ asynchronous libraries like aiohttp to handle multiple requests concurrently.
To implement the best web crawling practices and be confident that your script is as durable as possible, some advanced Python web scraping is inevitable. However, for small tasks, you can definitely get by using only the basics.
A Python crawler is a script that browses and downloads web pages, typically to extract data or index content. It works by starting from a seed URL, downloading the web page content, extracting links to other web pages, and then recursively visiting those links according to defined rules.
Web crawlers can be used for data mining, content monitoring, search engine indexing, or archiving websites. Common Python libraries for web crawling are Beautiful Soup, requests, Scrapy, and Selenium.
Yes, Python is excellent for web scraping and web crawling. It offers powerful libraries and frameworks like BeautifulSoup and Scrapy that make it easy to extract data from websites.
These tools, combined with Python's simple syntax and extensive request-handling capabilities, make it the most popular choice for web scraping and web crawling. The language also supports different data formats and has built-in features for processing scraped data.
Yes, web crawling is legal if you crawl publicly available data and adhere to regulations. The legality depends on the web crawling methods and intended use. It's recommended to consult legal professionals to ensure compliance with applicable laws.
About the author
Augustas Pelakauskas
Senior Copywriter
Augustas Pelakauskas is a Senior Copywriter at Oxylabs. Coming from an artistic background, he is deeply invested in various creative ventures - the most recent one being writing. After testing his abilities in the field of freelance journalism, he transitioned to tech content creation. When at ease, he enjoys sunny outdoors and active recreation. As it turns out, his bicycle is his fourth best friend.
All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.
Get the latest news from data gathering world
Scale up your business with Oxylabs®