Back to blog
Vytenis Kaubrė
Oftentimes, web scraping requires going beyond basic tools like Beautiful Soup or Scrapy. If you want to scale up and scrape smartly, you’ll have better success by implementing advanced web scraping techniques to overcome complex challenges. This advanced web scraping Python tutorial will help you level up your processes
If you’re a beginner, we recommend starting with the basics highlighted in this Python web scraping tutorial and try building an automated web scraper with AutoScraper.
Crawling public data without getting blocked requires a multitude of complex web scraping techniques. Since websites use anti-scraping measures that analyze incoming HTTP requests, user actions, and browsing patterns, a web scraper that doesn’t resemble realistic user browsing behavior will get quickly blocked. Consider using the following advanced methods to overcome web blocks:
Use rotating proxy IP addresses, ideally residential proxies or mobile proxies, to spread scraping tasks across different IPs, making your requests look like coming from different residential users.
Simulate realistic mouse movements using algorithms like Bézier curves, ensuring that mouse movements aren’t performed in a straight line but rather follow smooth, human-like trajectories with slight variations in speed and direction.
Rotate different user-agent strings of the headless browser you’re using with each request.
Change between different browsers in case of errors. For instance:
from selenium import webdriver
from selenium.webdriver import ChromeOptions, FirefoxOptions
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# Options for both Chrome and Firefox.
chrome_options = ChromeOptions()
chrome_options.add_argument("--headless=new")
firefox_options = FirefoxOptions()
firefox_options.add_argument("-headless")
def check_for_error(driver):
try:
title = driver.title
return "Sorry! Something went wrong!" in title
except:
return False
use_firefox = False
for i in range(5):
url = f"https://www.amazon.com/s?k=adidas&page={i}"
if not use_firefox:
# Start with Chrome.
driver = webdriver.Chrome(options=chrome_options)
print(f"Using Chrome.")
else:
# Continue with Firefox if switched to it.
driver = webdriver.Firefox(options=firefox_options)
print(f"Using Firefox.")
driver.get(url)
# Check if there's an error and switch browsers if needed.
if check_for_error(driver):
driver.quit()
use_firefox = True # Switch to Firefox.
print("Error detected, switching to Firefox.")
driver = webdriver.Firefox(options=firefox_options)
driver.get(url)
# Wait for the element and capture the entire page.
WebDriverWait(driver, 15).until(
EC.presence_of_element_located((By.CSS_SELECTOR, "h2 > a > span"))
)
with open(f"amazon_{i + 1}.html", "w") as f:
f.write(driver.page_source)
driver.quit()
Since most modern websites load web content dynamically to improve web services, for example, when a button is clicked, or more commonly, when a page is scrolled to the bottom, you should stick with headless browsers to scrape dynamic pages. More advanced techniques to tackle dynamic web pages include the following:
Handle infinite scroll without executing JavaScript code directly or using keyboard keys. Websites may detect the use of such methods and block dynamic content from loading. A more sophisticated way to scroll infinite pages is to simulate human-like behavior using a mouse wheel or a touchpad. In Selenium, you can achieve this using the Actions API:
import time, random
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.action_chains import ActionChains
driver = webdriver.Chrome()
# Visit a web page with infinite scroll.
driver.get("https://quotes.toscrape.com/scroll")
time.sleep(2)
while True:
# Get the vertical position of the page in pixels.
last_scroll = driver.execute_script("return window.scrollY")
# Scroll down by a random amount of pixels between a realistic fixed interval.
ActionChains(driver).scroll_by_amount(0, random.randint(499, 3699)).perform()
time.sleep(2)
# Get the new vertical position of the page in pixels after scrolling.
new_scroll = driver.execute_script("return window.scrollY")
# Break the loop if the page has reached its end.
if new_scroll == last_scroll:
break
last_scroll = new_scroll
cards = driver.find_elements(By.CSS_SELECTOR, ".quote")
# Get the number of quote cards you've loaded. It should be 100 cards in total.
print(len(cards))
driver.quit()
Sometimes, you can ditch headless browsers altogether and emulate the same requests that Ajax makes to different web servers to fetch additional data. You may need to use exactly the same request headers and URL parameters as inspected via the Developer Tools > Network tab. Otherwise, the site’s API will return you an error message or time out. The following example retrieves 100 StackShare job listings searched in the local area (see how to navigate Dev Tools):
import requests, json
url = "https://km8652f2eg-dsn.algolia.net/1/indexes/Jobs_production/query"
# Request headers.
headers = {
"Accept": "application/json",
"Accept-Encoding": "gzip, deflate, br, zstd",
"Accept-Language": "en-US,en;q=0.9",
"Content-Type": "application/json", # Send data as JSON instead of 'application/x-www-form-urlencoded'.
}
# 'Query String Parameters' from the Network > Payload tab.
params = {
"x-algolia-agent": "Algolia for JavaScript (3.33.0); Browser",
"x-algolia-application-id": "KM8652F2EG",
"x-algolia-api-key": "YzFhZWIwOGRhOWMyMjdhZTI5Yzc2OWM4OWFkNzc3ZTVjZGFkNDdmMThkZThiNDEzN2Y1NmI3MTQxYjM4MDI3MmZpbHRlcnM9cHJpdmF0ZSUzRDA="
}
# 'Form Data' from the Network > Payload tab.
# Modify the 'length' and 'hitsPerPage' parameters to get more listings.
# This code retrieves a total of 100 listings instead of the default 15.
data = {
"params": "query=&aroundLatLngViaIP=true&offset=0&length=100&hitsPerPage=100&aroundPrecision=20000"
}
# Send a POST request with JSON payload.
r = requests.post(url, headers=headers, params=params, json=data)
if r.status_code == 200:
with open("stackshare_jobs.json", "w") as f:
json.dump(r.json(), f, indent=4)
else:
print(f"Request failed:\n{r.status_code}\n{r.text}")
Bypassing CAPTCHA tests is all about making your requests look like a human is browsing the web. There are several concrete steps you can take in combination to avoid that unwelcome “Are you a robot?” message:
Use high-quality rotating proxies or a dedicated proxy solution, like Web Unblocker, to bypass complex anti-scraping systems and CAPTCHAs.
Use headless browsers and utilize their more stealthy versions, like undetected-chromedriver, nodriver, selenium-driverless. These modified libraries are specifically designed to solve existing headless browsers’ detectability issues.
Disable certain features and built-in settings of your headless browser that often reveal the use of automated browsers. For example, here’s what you can try with Selenium running a Chrome browser:
from selenium import webdriver
from selenium.webdriver import ChromeOptions
chrome_options = ChromeOptions(); chrome_options.add_argument("--headless=new")
# Disable Chrome features that reveal the presence of automation.
chrome_options.add_argument("--disable-blink-features=AutomationControlled")
# Hide the "Chrome is being controlled by automated test software" notification bar.
chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
# Disable the automation extension in Chrome, which is usually injected by Selenium.
chrome_options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(options=chrome_options)
# Modify the navigator object to hide the presence of WebDriver.
driver.execute_script("Object.defineProperty(navigator, 'webdriver', {get: () => undefined})")
# Visit a website.
driver.get("https://sandbox.oxylabs.io/products")
with open(f"website_html.html", "w") as f:
f.write(driver.page_source)
driver.quit()
There are quite a lot of parsing libraries to choose from, yet the best option for you depends on your needs. Below, you can find some of the most streamlined parsing libraries in Python:
Beautiful Soup 4 is the most popular and by far the best to start with if you’re a beginner. While it has extensive documentation and offers simplicity, it can only use CSS selectors, and the biggest trade-off is it’s quite slow. If you’re new to this Python library, check out this in-depth Beautiful Soup parsing tutorial.
html5lib conforms to the WHATWG HTML specification, meaning it parses pages the same way modern web browsers do. Since it’s built using pure Python, its main disadvantage is the slow parsing speed.
lxml is considered one of the fastest parsers as it's built using the C programming language. Another huge advantage this library offers is its ability to use XPath selectors. You can also use it with the cssselect library to convert your written CSS selectors to XPath. See this lxml parsing tutorial for an easy start.
Selectolax is by far the fastest parsing library for Python. Benchmarks show it’s way faster than Beautiful Soup and a bit faster than lxml. It's very similar to Beautiful Soup, as it exclusively uses CSS selectors and is incredibly user-friendly. Consider the following comparison for parsing this Wikipedia table:
import requests, time
from bs4 import BeautifulSoup
total_time = 0
num_requests = 10
for i in range(num_requests):
r = requests.get("https://en.wikipedia.org/wiki/List_of_minor_planet_discoverers#discovering_astronomers")
start_time = time.time()
soup = BeautifulSoup(r.text, "html.parser")
table = soup.find('table', {'class': 'wikitable'})
rows = table.find_all('tr')
astronomers = []
discoveries = []
dob_dod = []
for row in rows[1:]:
columns = row.find_all('td')
if len(columns) > 2:
astronomers.append(columns[0].get_text(strip=True))
discoveries.append(columns[1].get_text(strip=True))
dob_dod.append(columns[2].get_text(strip=True))
time_taken = time.time() - start_time
total_time += time_taken
print(f"Request {i+1} - Time taken: {time_taken} seconds")
average_time = total_time / num_requests
print(f"Average time taken: {average_time} seconds")
The average time taken to parse a part of the table with Beautiful Soup is 0.355 seconds, while with the 'lxml' parser instead of 'html.parser', Beautiful Soup takes around 0.259 seconds on average.
import requests, time
from selectolax.parser import HTMLParser
total_time = 0
num_requests = 10
for i in range(num_requests):
r = requests.get("https://en.wikipedia.org/wiki/List_of_minor_planet_discoverers#discovering_astronomers")
start_time = time.time()
tree = HTMLParser(r.text)
rows = tree.css("table.wikitable:first-of-type tr")
astronomers = []
discoveries = []
dob_dod = []
for row in rows[1:]:
columns = row.css("td")
if len(columns) > 2:
astronomers.append(columns[0].text(strip=True))
discoveries.append(columns[1].text(strip=True))
dob_dod.append(columns[2].text(strip=True))
time_taken = time.time() - start_time
total_time += time_taken
print(f"Request {i+1} - Time taken: {time_taken} seconds")
average_time = total_time / num_requests
print(f"Average time taken: {average_time} seconds")
In contrast, Selectolax took 0.039 seconds on average. That’s a massive reduction of time compared to Beautiful Soup’s 0.355 seconds, which is a significant boost for large scale parsing.
To further advance your data parsing processes, consider utilizing the following tactics:
Combine multiprocessing and multithreading with your chosen parser for concurrent and parallelized data extraction.
Use powerful built-in XPath functions like contains(), starts-with(), and substring() instead of relying on complex nested expressions.
Consider using CSS selectors for simpler tasks since CSS selectors can be faster than XPath in large scale scenarios. This is due to the fact that modern browsers have internal CSS engines and CSS selectors are usually simpler and shorter, making them faster and more efficient to process.
Asyncio and aiohttp libraries are the most popular duo for asynchronous scraping in Python. Below you can find some useful tips for using both modules in advanced ways, helping you control their asynchronous operations.
Reuse aiohttp sessions to improve performance and reduce the load on the server.
Timeout aiohttp sessions efficiently by controlling factors like overall request time, connection duration, socket read time, and others, for example:
custom_timeout = aiohttp.ClientTimeout(total=180, connect=10, sock_read=10, sock_connect=5)
async with aiohttp.ClientSession(timeout=custom_timeout) as session:
If the response data is too large to store entirely in memory, process it incrementally in chunks as it’s received. For instance:
async for chunk in response.content.iter_chunked(1024):
print(chunk)
Limit the number of concurrent connections using asyncio.Semaphore() to avoid overloading servers, respect rate limiting, and ensure your system runs smoothly without hitting performance bottlenecks.
Use the asyncio.Queue() method for better task management. By queuing URLs along with a fixed number of worker tasks in first-in-first-out (FIFO) order, you can control the level of concurrency and pass items between coroutine tasks efficiently and safely.
Add more control by using asyncio.PriorityQueue() and asyncio.LifoQueue(). PriorityQueue retrieves items in priority order (lowest priority number first), while LifoQueue is a last-in-first-out (LIFO) queue, retrieving the most recently added item first.
Here’s a Python code that uses asyncio’s Semaphore() and Queue() functionality:
import asyncio, aiohttp, json
from bs4 import BeautifulSoup
async def generate_urls(queue: asyncio.Queue):
for number in range(1, 21):
url = f"https://sandbox.oxylabs.io/products?page={number}"
# Put the URLs into a queue.
await queue.put(url)
async def save_products(products):
with open("products.json", "w") as f:
json.dump(products, f, indent=4)
async def scrape(
queue: asyncio.Queue,
session: aiohttp.ClientSession,
semaphore: asyncio.Semaphore,
all_products: list
):
while not queue.empty():
url = await queue.get() # Retrieve the next URL from the queue.
async with semaphore:
async with session.get(url) as r:
content = await r.text()
soup = BeautifulSoup(content, "html.parser")
products = []
for product in soup.select(".product-card"):
title = product.select_one("h4").text
link = product.select_one(".card-header").get("href")
price = product.select_one(".price-wrapper").text
product_info = {
"Title": title,
"Link": "https://sandbox.oxylabs.io" + link,
"Price": price
}
products.append(product_info)
all_products.extend(products)
queue.task_done() # Mark the URL as done.
async def main():
queue = asyncio.Queue()
all_products = []
# Limit to 5 concurrent requests if the website uses rate limiting.
semaphore = asyncio.Semaphore(5)
await generate_urls(queue)
# Create a single aiohttp Session for multiple requests.
async with aiohttp.ClientSession() as session:
# Create a single scraping task.
scrape_task = asyncio.create_task(scrape(queue, session, semaphore, all_products))
await queue.join()
await scrape_task
await save_products(all_products)
if __name__ == "__main__":
asyncio.run(main())
If you’re regularly extracting data in large amounts without automation, you could be missing out on significant efficiency gains. Instead of you clicking the “Run” button, it should be done by your computer at specific time intervals. If you’re new to automation, check out this tutorial on how to set up an automated web scraper using cron and follow these advanced best practices:
Automate your scraper to scrape data during off-peak hours. This will ensure your activities don’t overload the target server while also minimizing IP bans and avoiding timeouts and throttling.
Spread your requests throughout the entire day, reducing the amount of requests in a short amount of time.
Set up automated scrapers to trigger alerts when specific conditions are met, such as price drops, competitor business actions, SEO ranking changes, and similar. Monitoring systems can complement timely insights and proactive decision-making.
If you’re developing with a Unix-like system, the cron tool will be your best friend for automation. As mentioned above, scraping during off-peak hours is the way to go; hence, you might find that most servers are least crowded, particularly around midnight to 4 AM server time. The following cron schedules should suit most scraping use cases.
Run at 2:00 AM every day of the week:
0 02 * * 0-6
Run at 2:00 AM on Saturdays and Sundays:
0 02 * * 6-7
Websites commonly change and present different versions of pages to users using A/B testing frameworks. Additionally, websites can purposefully change CSS attribute names, classes, element IDs, and entire site structures specifically to counter web scrapers. All this requires continuous scraper upkeep to adjust CSS or XPath selectors as well as the scraping logic. You can try out the following advanced methods to make your processes easier:
Create selectors using attributes or elements that are less likely to change, such as id attributes that are consistent, heading tags like <h1> and <h2>, or custom data attributes.
Implement a mechanism to detect changes in the structure and adapt the scraping logic automatically. A basic HTML structure change detection mechanism would require you to remove the text from the HTML file, leaving only tags and classes like shown below:
# Make sure to have a 'previous_html.html' file saved before running the code.
import requests
from bs4 import BeautifulSoup
url = "https://sandbox.oxylabs.io/products"
def fetch_html(url):
return requests.get(url).text
# Strip the text, leaving only HTML tags and classes.
def strip_to_tags(html):
soup = BeautifulSoup(html, "html.parser")
for element in soup.find_all(string=True):
element.extract()
return soup.prettify()
def compare_structure():
try:
with open("previous_html.html", "r") as f:
previous_html = f.read()
except FileNotFoundError as e:
print(f"Error: {e}")
current_html = fetch_html(url)
if strip_to_tags(previous_html) != strip_to_tags(current_html):
with open("previous_html.html", "w") as f:
f.write(current_html)
print("HTML structure has changed.")
else:
print("No structure changes detected.")
compare_structure()
Honeypot traps can lead your scraper to follow links that immediately trigger an anti-scraping detection system. While honeypots are a must for detecting cybercriminals, sometimes good bots like web scrapers may get trapped as well.
The first step you should take is to ensure your public data scraper follows only visible links. You should craft your selector in a way that checks CSS properties such as display: none, visibility: hidden, and opacity: 0:
# XPath selector that selects only visible links.
//a[contains(@class, 'card-header') and not(contains(@style, 'display:none') or contains(@style, 'visibility:hidden') or contains(@style, 'opacity:0'))]/@href
Sometimes, elements can be positioned off-screen using left: -9999px or other CSS styles similar in effect. This way, an element or a link won’t be visible to a real user browsing a page, but a web crawler will see it in the HTML document. Hence, you should also take this into account.
If your goal is to extract data on a large scale, you must build your scraper to handle the scale without losing performance and speed. See the advanced tips listed below to get a better grasp.
Scrape and parse using asynchronous techniques with concurrent or parallel execution methods. This will speed up a web scraper and allow you to do millions of requests quickly and efficiently.
Perform scraping and parsing separately, ideally on different machines, to optimize resource consumption and save time.
Distribute the scraping workload across multiple machines to handle larger datasets.
Use a requests retry logic to automate failed request retries when a specific response status code is received. For the best results, you would want to retry requests using proxies for certain status codes.
import requests
from requests.adapters import HTTPAdapter
from urllib3.util import Retry
proxies = {
"http": "http://USERNAME:PASSWORD@pr.oxylabs.io:7777",
"https": "https://USERNAME:PASSWORD@pr.oxylabs.io:7777"
}
try:
retry = Retry(
total=5,
backoff_factor=3,
status_forcelist=[403, 429, 500, 502, 503, 504],
)
adapter = HTTPAdapter(max_retries=retry)
session = requests.Session()
session.mount("https://", adapter)
r = session.get("https://ip.oxylabs.io/", proxies=proxies, timeout=180)
print(r.status_code)
print(r.text)
except Exception as e:
print(e)
Use databases (MongoDB, PostgreSQL, MySQL) or cloud storage (Amazon S3, Google Cloud Storage) to store your scraped data online instead of locally.
Proxy servers are a must for large-scale web scraping. Without them, you’ll face significant performance and engineering difficulties, and most likely, you won’t be able to reach the scale you need. For a more streamlined development workflow, consider using a scraping API.
Use headless browsers like Selenium, Playwright, and Puppeteer only for scraping dynamic content. Switching between a headless browser and a static page scraper will save you a lot of time.
Implement Docker, Kubernetes, Apache Airflow, or similar tools to scale and manage scraping tasks.
When scraping public data, it's important to approach it responsibly and ethically. Following best practices not only helps maintain the integrity of the websites you interact with but also ensures compliance with current legal standards. The question of whether web scraping is legal has been raised time and again; therefore, below are some essential guidelines to keep in mind:
Websites may provide an API to access their data, often for a cost. It’s highly recommended that an API be used instead of extracting data using custom tools.
Follow the rules highlighted in the website’s robots.txt file. Usually, you can access the robots.txt file by appending /robots.txt to the URL, for example:
# Amazon
https://www.amazon.com/robots.txt
# Google
https://www.google.com/robots.txt
Don’t scrape or use copyrighted material.
Before web scraping any personal data, consult with a legal advisor to ensure the legality of your activities. Always adhere to GDPR and other international and local data protection laws and regulations.
Always respect the website. Implement delays between requests, scrape during off-peak hours, and respect rate limits to avoid overloading the website’s servers. In essence, you should develop your scraper in such a way that doesn’t affect the performance and normal working of a website.
By implementing the advanced Python scraping techniques discussed in this blog post, you can optimize your workflows and achieve more efficient and accurate results. The key to maximizing success lies in applying these strategies in combination, ensuring a powerful and cohesive approach to web scraping.
For a more streamlined approach, consider using a ready-made scraper API tool to overcome blocks and reach the scale you need with ease.
About the author
Vytenis Kaubrė
Technical Copywriter
Vytenis Kaubrė is a Technical Copywriter at Oxylabs. His love for creative writing and a growing interest in technology fuels his daily work, where he crafts technical content and web scrapers with Oxylabs’ solutions. Off duty, you might catch him working on personal projects, coding with Python, or jamming on his electric guitar.
All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.
Get the latest news from data gathering world
Scale up your business with Oxylabs®
GET IN TOUCH
General:
hello@oxylabs.ioSupport:
support@oxylabs.ioCareer:
career@oxylabs.ioCertified data centers and upstream providers
Connect with us
Advanced proxy solutions
Resources
Data Collection
Innovation hub
oxylabs.io© 2024 All Rights Reserved