How to Bypass CAPTCHA With Playwright

Yelyzaveta Hayrapetyan

Last updated on

2024-10-11

5 min read

This article was checked by Povilas Kudriavcevas, R&D Engineer at Oxylabs.

CAPTCHAs (Completely Automated Public Turing tests to tell Computers and Humans Apart) have become vital to website security. Once the security apparatus of the website becomes suspicious of access (e.g., the access pattern does not follow normal human behavior), it loads a CAPTCHA (e.g., text, sound, and image puzzles), preventing bots from further access.

Bypassing a CAPTCHA challenge once it loads, can be extremely difficult. However, knowing how CAPTCHAs work and utilizing a few known methods can help your script exhibit more human behavior to the web firewall. Thereby, you can completely prevent CAPTCHA from loading. We call this bypassing, or avoiding, a CAPTCHA.

This step-by-step tutorial demonstrates how to use Playwright to bypass CAPTCHA challenges using Python. The tutorial will also discuss the perks of using Oxylabs’ Web Unblocker instead of the `playwright-stealth` library.

Note: Bypassing CAPTCHAs for illegal or malicious motives violates ethical and legal standards. This tutorial is for educational purposes only, and we encourage readers to thoroughly read the Terms of Services of the target website to avoid legal issues.

Get a 7-day free trial

Unlock real-time data hassle-free with Oxylabs' Web Unblocker.

1 GB of traffic
No credit card needed

Bypass CAPTCHA with Playwright

Playwright provides a robust and user-friendly browser automation tool that can interact with web pages. It allows developers to perform tasks such as clicking elements, filling out forms, and extracting data from dynamic websites. Its support for multiple browsers (like Chromium, Firefox, and WebKit) ensures cross-browser compatibility. Additionally, Playwright's support for headless mode allows for hidden browser interactions, making it suitable for web scraping tasks.

Relying only on the Playwright CAPTCHA bypassing method can be challenging as websites may detect traffic from automated and headless scripts. Fortunately, the `playwright-stealth` package can help.

Combining the stealth package with Playwright offers a powerful combo to bypass CAPTCHAs. The stealth package can help Playwright bypass CAPTCHA tests seamlessly while making its headless browser instances appear more human to the websites. Thereby, it helps reduce the chances of being detected by websites.

Let’s demonstrate how to handle CAPTCHA in Playwright by creating a Python script that opens a web link in a headless mode. It then captures the target link's screenshot and saves it in the local file storage. The script is successful if the screenshot shows the actual contents of the page instead of a CAPTCHA or reCAPTCHA screen.

Let’s see a step-by-step procedure to set up the stealth with Playwright in Python and develop any such script.

1. Install dependencies

Install the Playwright library and the stealth package.

Copy

pip install playwright playwright-stealth

2. Import modules

Use the synchronous version of the Playwright library for a straightforward and linear program flow.

Copy

from playwright.sync_api import sync_playwright
from playwright_stealth import stealth_sync

3. Create a headless browser instance

Define the `capture_screenshot()` function that encapsulates the whole code to open a headless browser instance, visit the url, and capture the screenshot with Playwright. In this function, create a new `sync_playwright` instance and then use it to launch a headless Chromium browser.

Copy

# Define the function to capture the screenshot.
def capture_screenshot():
    # Create a playwright instance.
    with sync_playwright() as pw:
        browser = pw.chromium.launch(headless=True)

        # Create a new context and page.
        context = browser.new_context()
        page = context.new_page()

4. Apply the stealth settings

After creating the browser context, enable Playwright CAPTCHA bypasses by applying the stealth settings to the page using the `playwright-stealth` package. Stealth settings help in reducing the chances of automated access detection by hiding the browsers’ automated behavior.

Copy

        # Apply the stealth settings.
        stealth_sync(page)

5. Navigate to the page

In the next step, navigate to the target URL by specifying your required URL and navigating to it using the `goto()` page method.

Copy

        # Navigate to the website.
        url = "https://sandbox.oxylabs.io/products"
        page.goto(url)

6. Take a screenshot

Wait for page to load completely, take the screenshot, and close the browser.

Copy

        # Wait for the webpage to fully load.
        page.wait_for_load_state("networkidle")

        # Take a screenshot.
        screenshot_filename = "screenshot.png"
        page.screenshot(path=screenshot_filename)

        # Close the browser.
        browser.close()
        print("Done! You can check the screenshot...")


capture_screenshot()

7. Execute and test

Here is what our complete code looks like:

Copy

from playwright.sync_api import sync_playwright
from playwright_stealth import stealth_sync


def capture_screenshot():
    with sync_playwright() as pw:
        browser = pw.chromium.launch(headless=True)
        context = browser.new_context()
        page = context.new_page()

        stealth_sync(page)

        url = "https://sandbox.oxylabs.io/products"
        page.goto(url)
        page.wait_for_load_state("networkidle")

        screenshot_filename = "screenshot.png"
        page.screenshot(path=screenshot_filename)

        browser.close()
        print("Done! You can check the screenshot...")


capture_screenshot()

Executing the code saves the screenshot. Here's what it looks like in our case:

If the screenshot shows the actual content of the page you're trying to access, it means you've just avoided a CAPTCHA from loading on this page.

Bypass CAPTCHA with Web Unblocker

Oxylabs’ Web Unblocker employs advanced AI techniques to help users access publicly available information behind the CAPTCHA. Bypassing CAPTCHAs with our advanced proxy solution is easy. You just need to send a simple query. Web Unblocker will automatically choose the fastest CAPTCHA proxy, attach all essential headers, and return the response HTML bypassing any anti-bots of the target websites.

Here are the steps you must follow to implement a simple web scraping request using Web Unblocker.

1. Create an account

You can create an account on the dashboard with a 7-day free trial.

2. Create API key

After successfully creating your account, you can set your API key username and password from the dashboard. These API key credentials will be used later in the code.

3. Install the requests module

You should use a library that can help perform HTTP requests. We will use the `requests` to send HTTP requests to Web Unblocker API and capture the response.

Copy

pip install requests

4. Import the required modules

In your Python file, import the modules using the following import statement:

Copy

import requests

5. Define proxy and headers

Create the proxies dictionary to connect to Web Unblocker and then define the headers dictionary that’ll instruct Web Unblocker to use JavaScript rendering. See the documentation for more details.

Copy

# Define proxy dict and pass your Web Unblocker credentials.
proxies = {
   "http": "http://USERNAME:PASSWORD@unblock.oxylabs.io:60000",
   "https": "https://USERNAME:PASSWORD@unblock.oxylabs.io:60000",
}

headers = {
    "X-Oxylabs-Render": "html"
}

6. Make a request

Perform your request by specifying the URL, request type, and proxy by using the following code.

Copy

response = requests.get(
   "https://sandbox.oxylabs.io/products",
   verify=False,  # Ignore the certificate.
   proxies=proxies,
   headers=headers
)

7. Save the response

Write the following code to print the response and save it in an HTML file.

Copy

# Print result page to stdout.
print(response.text)

# Save returned HTML to result.html file.
with open("result.html", "w") as f:
   f.write(response.text)

8. Execute and check

Execute the code and test the output. If the output HTML file has actual page contents, the script successfully bypassed the CAPTCHA. Here is what our complete code looks like.

Copy

import requests


proxies = {
   "http": "http://USERNAME:PASSWORD@unblock.oxylabs.io:60000",
   "https": "https://USERNAME:PASSWORD@unblock.oxylabs.io:60000",
}

headers = {
    "X-Oxylabs-Render": "html"
}

response = requests.get(
   "https://sandbox.oxylabs.io/products",
   verify=False,  # Ignore the certificate
   proxies=proxies,
   headers=headers
)

print(response.text)

with open("result.html", "w") as f:
   f.write(response.text)

Here is the snapshot of the output HTML displayed on the screen:

Here's a part of a snapshot of how the browser renders this HTML:

Visit this page on our documentation to see how you can save the PNG file. The above snapshot makes it clear that we accessed the products page without any blocks.

Conclusion

Playwright, when combined with the `playwright-stealth` package, can effectively be used to scrape content behind the sites with ordinary CAPTCHA protection. Learn more about how to perform web scraping with Playwright, configure Playwright with proxies, leverage Playwright best practices, and combine Scrapy with Playwright in our blog posts. If you're still wondering which proxies fit your needs best, get a free trial for our premium proxies to make the right decision.

Other known headless browsers can perform better in different situations, so check out our posts about bypassing CAPTCHAs with Selenium and using Puppeteer to bypass CAPTCHAs. If you're interested in scraping Amazon, then see this post on how to bypass Amazon CAPTCHAs.

However, bypassing CAPTCHA (e.g., reCAPTCHA) from websites with advanced anti-bots requires a more sophisticated and intelligent bypassing solution. Oxylabs’s Web Unblocker automatically combines the latest AI techniques with bypassing schemes (e.g., proxies and IP rotation, setting realistic fingerprints, and JS rendering) to ditch advanced anti-bots. Therefore, it is a more secure, convenient, and reliable solution for bypassing CAPTCHAs and scraping data at scale.

Frequently Asked Questions

To understand what is playwright, it alone can't solve CAPTCHAs. It reduces the chances of triggering a CAPTCHA test by making your network requests look more organic and human-like. Hence, there's always a chance to receive a CAPTCHA when web scraping with Playwright. Nonetheless, people often turn to various CAPTCHA-solving services, such as 2Captcha, to achieve desired results when Playwright isn't enough.

Depending on the website and its anti-scraping system, it can be enough to simply use Playwright to bypass CAPTCHA challenges. In other situations where vanilla Playwright doesn't bypass CAPTCHAs, you can try out a stealth plugin like playwright-stealth in Python. When it comes to more complex websites, a dedicated tool like Web Unblocker can make the entire CAPTCHA bypass process seamless.

It's always best to seek legal advice from a professional who can evaluate your use case. As a general rule of thumb, automated public data-gathering activities, including CAPTCHA bypass processes, may not be illegal when performed responsibly and ethically. It's essential to respect the website's rules and ensure your operations don't affect the website's infrastructure and performance.

CAPTCHA is a universal term for tests designed to distinguish between humans and bots. reCAPTCHA, on the other hand, is a specific type of CAPTCHA developed by Google that uses more advanced techniques like machine learning and behavioral analysis. When it comes to avoiding it, standard Playwright reCAPTCHA bypassing techniques may always work, requiring a more sophisticated tool like Web Scraper API or Web Unblocker.

About the author

Yelyzaveta Hayrapetyan

Senior Technical Copywriter

Yelyzaveta Hayrapetyan is a Senior Technical Copywriter at Oxylabs. After working as a writer in fashion, e-commerce, and media, she decided to switch her career path and immerse in the fascinating world of tech. And believe it or not, she absolutely loves it! On weekends, you’ll probably find Yelyzaveta enjoying a cup of matcha at a cozy coffee shop, scrolling through social media, or binge-watching investigative TV series.

Learn more about Yelyzaveta Hayrapetyan Learn more about Yelyzaveta Hayrapetyan

All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.