CAPTCHAs (Completely Automated Public Turing tests to tell Computers and Humans Apart) have become vital to website security. Once the security apparatus of the website becomes suspicious of access (e.g., the access pattern does not follow normal human behavior), it loads a CAPTCHA (e.g., text, sound, and image puzzles), preventing bots from further access.
Bypassing a CAPTCHA challenge once it loads, can be extremely difficult. However, knowing how CAPTCHAs work and utilizing a few known methods can help your script exhibit more human behavior to the web firewall. Thereby, you can completely prevent CAPTCHA from loading. We call this bypassing, or avoiding, a CAPTCHA.
This step-by-step tutorial demonstrates how to use Playwright to bypass CAPTCHA challenges using Python. The tutorial will also discuss the perks of using Oxylabs’ Web Unblocker instead of the `playwright-stealth` library.
Note: Bypassing CAPTCHAs for illegal or malicious motives violates ethical and legal standards. This tutorial is for educational purposes only, and we encourage readers to thoroughly read the Terms of Services of the target website to avoid legal issues.
Unlock real-time data hassle-free with Oxylabs' Web Unblocker.
Playwright provides a robust and user-friendly browser automation tool that can interact with web pages. It allows developers to perform tasks such as clicking elements, filling out forms, and extracting data from dynamic websites. Its support for multiple browsers (like Chromium, Firefox, and WebKit) ensures cross-browser compatibility. Additionally, Playwright's support for headless mode allows for hidden browser interactions, making it suitable for web scraping tasks.
Relying only on the Playwright CAPTCHA bypassing method can be challenging as websites may detect traffic from automated and headless scripts. Fortunately, the `playwright-stealth` package can help.
Combining the stealth package with Playwright offers a powerful combo to bypass CAPTCHAs. The stealth package can help Playwright bypass CAPTCHA tests seamlessly while making its headless browser instances appear more human to the websites. Thereby, it helps reduce the chances of being detected by websites.
Let’s demonstrate how to handle CAPTCHA in Playwright by creating a Python script that opens a web link in a headless mode. It then captures the target link's screenshot and saves it in the local file storage. The script is successful if the screenshot shows the actual contents of the page instead of a CAPTCHA or reCAPTCHA screen.
Let’s see a step-by-step procedure to set up the stealth with Playwright in Python and develop any such script.
Install the Playwright library and the stealth package.
pip install playwright playwright-stealth
Use the synchronous version of the Playwright library for a straightforward and linear program flow.
from playwright.sync_api import sync_playwright
from playwright_stealth import stealth_sync
Define the `capture_screenshot()` function that encapsulates the whole code to open a headless browser instance, visit the url, and capture the screenshot. In this function, create a new `sync_playwright` instance and then use it to launch a headless Chromium browser.
# Define the function to capture the screenshot
def capture_screenshot():
# Create a playwright instance
with sync_playwright() as pw:
browser = pw.chromium.launch(headless=True)
# Create a new context and page
context = browser.new_context()
page = context.new_page()
After creating the browser context, enable Playwright CAPTCHA bypasses by applying the stealth settings to the page using the `playwright-stealth` package. Stealth settings help in reducing the chances of automated access detection by hiding the browsers’ automated behavior.
# Apply the stealth settings
stealth_sync(page)
In the next step, navigate to the target URL by specifying your required URL and navigating to it using the `goto()` page method.
# Navigate to the website
url = "https://sandbox.oxylabs.io/products"
page.goto(url)
Wait for the page to load completely, take the screenshot, and close the browser.
# Wait for the webpage to fully load.
page.wait_for_load_state("networkidle")
# Take a screenshot
screenshot_filename = "screenshot.png"
page.screenshot(path=screenshot_filename)
# Close the browser
browser.close()
print("Done! You can check the screenshot...")
capture_screenshot()
Here is what our complete code looks like:
# Import the required modules
from playwright.sync_api import sync_playwright
from playwright_stealth import stealth_sync
# Define the function to capture the screenshot
def capture_screenshot():
# Create a playwright instance
with sync_playwright() as pw:
browser = pw.chromium.launch(headless=True)
# Create a new context and page
context = browser.new_context()
page = context.new_page()
# Apply the stealth settings
stealth_sync(page)
# Navigate to the website
url = "https://sandbox.oxylabs.io/products"
page.goto(url)
# Wait for the webpage to fully load.
page.wait_for_load_state("networkidle")
# Take a screenshot
screenshot_filename = "screenshot.png"
page.screenshot(path=screenshot_filename)
# Close the browser
browser.close()
print("Done! You can check the screenshot...")
capture_screenshot()
Executing the code saves the screenshot. Here's what it looks like in our case:
If the screenshot shows the actual content of the page you're trying to access, it means you've just avoided a CAPTCHA from loading on this page.
Oxylabs’ Web Unblocker employs advanced AI techniques to help users access publicly available information behind the CAPTCHA. Bypassing CAPTCHAs with our advanced proxy solution is easy. You just need to send a simple query. Web Unblocker will automatically choose the fastest CAPTCHA proxy, attach all essential headers, and return the response HTML bypassing any anti-bots of the target websites.
Here are the steps you must follow to implement a simple web scraping request using Web Unblocker.
You can create an account on the dashboard with a 7-day free trial.
After successfully creating your account, you can set your API key username and password from the dashboard. These API key credentials will be used later in the code.
You should use a library that can help perform HTTP requests. We will use the `requests` to send HTTP requests to Web Unblocker API and capture the response.
pip install requests
In your Python file, import the modules using the following import statement:
import requests
Create the proxies dictionary to connect to Web Unblocker and then define the headers dictionary that’ll instruct Web Unblocker to use JavaScript rendering. See the documentation for more details.
# Define proxy dict. Don't forget to pass your Web Unblocker credentials (username and password)
proxies = {
"http": "http://USERNAME:PASSWORD@unblock.oxylabs.io:60000",
"https": "https://USERNAME:PASSWORD@unblock.oxylabs.io:60000",
}
headers = {
"X-Oxylabs-Render": "html"
}
Perform your request by specifying the URL, request type, and proxy by using the following code.
response = requests.request(
"GET",
"https://sandbox.oxylabs.io/products",
verify=False, # Ignore the certificate
proxies=proxies,
headers=headers
)
Write the following code to print the response and save it in an HTML file.
# Print result page to stdout
print(response.text)
# Save returned HTML to result.html file
with open("result.html", "w") as f:
f.write(response.text)
Execute the code and test the output. If the output HTML file has actual page contents, the script successfully bypassed the CAPTCHA. Here is what our complete code looks like.
# Import the modules
import requests
# Define proxy dict. Don't forget to put your real user and pass here as well.
proxies = {
"http": "http://USERNAME:PASSWORD@unblock.oxylabs.io:60000",
"https": "https://USERNAME:PASSWORD@unblock.oxylabs.io:60000",
}
headers = {
"X-Oxylabs-Render": "html"
}
response = requests.request(
"GET",
"https://sandbox.oxylabs.io/products",
verify=False, # Ignore the certificate
proxies=proxies,
headers=headers
)
# Print result page to stdout
print(response.text)
# Save returned HTML to result.html file
with open("result.html", "w") as f:
f.write(response.text)
Here is the snapshot of the output HTML displayed on the screen:
Here's a part of a snapshot of how the browser renders this HTML:
Visit this page on our documentation to see how you can save the PNG file. The above snapshot makes it clear that we accessed the products page without any blocks.
Playwright, when combined with the `playwright-stealth` package, can effectively be used to scrape content behind the sites with ordinary CAPTCHA protection. Learn more about how to perform web scraping with Playwright, configure Playwright with proxies, and combine Scrapy with Playwright in our blog posts. If you're still wondering which proxies fit your needs best, get a free trial for our premium proxies to make the right decision.
Other known headless browsers can perform better in different situations, so check out our posts about bypassing CAPTCHAs with Selenium and using Puppeteer to bypass CAPTCHAs. If you're interested in scraping Amazon, then see this post on how to bypass Amazon CAPTCHAs.
However, bypassing CAPTCHA (e.g., reCAPTCHA) from websites with advanced anti-bots requires a more sophisticated and intelligent bypassing solution. Oxylabs’s Web Unblocker automatically combines the latest AI techniques with bypassing schemes (e.g., proxies and IP rotation, setting realistic fingerprints, and JS rendering) to ditch advanced anti-bots. Therefore, it is a more secure, convenient, and reliable solution for bypassing CAPTCHAs and scraping data at scale.
Playwright alone can't solve CAPTCHAs. It reduces the chances of triggering a CAPTCHA test by making your network requests look more organic and human-like. Hence, there's always a chance to receive a CAPTCHA when web scraping with Playwright. Nonetheless, people often turn to various CAPTCHA-solving services, such as 2Captcha, to achieve desired results when Playwright isn't enough.
Depending on the website and its anti-scraping system, it can be enough to simply use Playwright to bypass CAPTCHA challenges. In other situations where vanilla Playwright doesn't bypass CAPTCHAs, you can try out a stealth plugin like playwright-stealth in Python. When it comes to more complex websites, a dedicated tool like Web Unblocker can make the entire CAPTCHA bypass process seamless.
It's always best to seek legal advice from a professional who can evaluate your use case. As a general rule of thumb, automated public data-gathering activities, including CAPTCHA bypass processes, may not be illegal when performed responsibly and ethically. It's essential to respect the website's rules and ensure your operations don't affect the website's infrastructure and performance.
CAPTCHA is a universal term for tests designed to distinguish between humans and bots. reCAPTCHA, on the other hand, is a specific type of CAPTCHA developed by Google that uses more advanced techniques like machine learning and behavioral analysis. When it comes to avoiding it, standard Playwright reCAPTCHA bypassing techniques may always work, requiring a more sophisticated tool like Web Scraper API or Web Unblocker.
About the author
Yelyzaveta Nechytailo
Senior Content Manager
Yelyzaveta Nechytailo is a Senior Content Manager at Oxylabs. After working as a writer in fashion, e-commerce, and media, she decided to switch her career path and immerse in the fascinating world of tech. And believe it or not, she absolutely loves it! On weekends, you’ll probably find Yelyzaveta enjoying a cup of matcha at a cozy coffee shop, scrolling through social media, or binge-watching investigative TV series.
All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.
Get the latest news from data gathering world
Scale up your business with Oxylabs®