Back to blog

How to Bypass CAPTCHA With Playwright

How to Bypass CAPTCHA With Playwright

Yelyzaveta Nechytailo

2024-10-115 min read
Share

CAPTCHAs (Completely Automated Public Turing tests to tell Computers and Humans Apart) have become vital to website security. Once the security apparatus of the website becomes suspicious of access (e.g., the access pattern does not follow normal human behavior), it loads a CAPTCHA (e.g., text, sound, and image puzzles), preventing bots from further access.

Bypassing a CAPTCHA challenge once it loads, can be extremely difficult. However, knowing how CAPTCHAs work and utilizing a few known methods can help your script exhibit more human behavior to the web firewall. Thereby, you can completely prevent CAPTCHA from loading. We call this bypassing, or avoiding, a CAPTCHA.

This step-by-step tutorial demonstrates how to use Playwright to bypass CAPTCHA challenges using Python. The tutorial will also discuss the perks of using Oxylabs’ Web Unblocker instead of the `playwright-stealth` library. 

Note: Bypassing CAPTCHAs for illegal or malicious motives violates ethical and legal standards. This tutorial is for educational purposes only, and we encourage readers to thoroughly read the Terms of Services of the target website to avoid legal issues.

Get a 7-day free trial

Unlock real-time data hassle-free with Oxylabs' Web Unblocker.

  • 1 GB of traffic
  • No credit card needed
  • Bypass CAPTCHA with Playwright

    Playwright provides a robust and user-friendly browser automation tool that can interact with web pages. It allows developers to perform tasks such as clicking elements, filling out forms, and extracting data from dynamic websites. Its support for multiple browsers (like Chromium, Firefox, and WebKit) ensures cross-browser compatibility. Additionally, Playwright's support for headless mode allows for hidden browser interactions, making it suitable for web scraping tasks.

    Relying only on the Playwright CAPTCHA bypassing method can be challenging as websites may detect traffic from automated and headless scripts. Fortunately, the `playwright-stealth` package can help.

    Combining the stealth package with Playwright offers a powerful combo to bypass CAPTCHAs. The stealth package can help Playwright bypass CAPTCHA tests seamlessly while making its headless browser instances appear more human to the websites. Thereby, it helps reduce the chances of being detected by websites. 

    Let’s demonstrate how to handle CAPTCHA in Playwright by creating a Python script that opens a web link in a headless mode. It then captures the target link's screenshot and saves it in the local file storage. The script is successful if the screenshot shows the actual contents of the page instead of a CAPTCHA or reCAPTCHA screen.

    Let’s see a step-by-step procedure to set up the stealth with Playwright in Python and develop any such script.

    1. Install dependencies

    Install the Playwright library and the stealth package.

    pip install playwright playwright-stealth

    2. Import modules 

    Use the synchronous version of the Playwright library for a straightforward and linear program flow.

    from playwright.sync_api import sync_playwright
    from playwright_stealth import stealth_sync

    3. Create a headless browser instance

    Define the `capture_screenshot()` function that encapsulates the whole code to open a headless browser instance, visit the url, and capture the screenshot. In this function, create a new `sync_playwright` instance and then use it to launch a headless Chromium browser.

    # Define the function to capture the screenshot
    def capture_screenshot():
        # Create a playwright instance
        with sync_playwright() as pw:
            browser = pw.chromium.launch(headless=True)
    
            # Create a new context and page
            context = browser.new_context()
            page = context.new_page()

    4. Apply the stealth settings

    After creating the browser context, enable Playwright CAPTCHA bypasses by applying the stealth settings to the page using the `playwright-stealth` package. Stealth settings help in reducing the chances of automated access detection by hiding the browsers’ automated behavior.

            # Apply the stealth settings
            stealth_sync(page)

    5. Navigate to the page

    In the next step, navigate to the target URL by specifying your required URL and navigating to it using the `goto()` page method.

            # Navigate to the website
            url = "https://sandbox.oxylabs.io/products"
            page.goto(url)

    6. Take a screenshot

    Wait for the page to load completely, take the screenshot, and close the browser.

            # Wait for the webpage to fully load.
            page.wait_for_load_state("networkidle")
    
            # Take a screenshot
            screenshot_filename = "screenshot.png"
            page.screenshot(path=screenshot_filename)
    
            # Close the browser
            browser.close()
    
            print("Done! You can check the screenshot...")
    
    capture_screenshot()

    7. Execute and test

    Here is what our complete code looks like:

    # Import the required modules
    from playwright.sync_api import sync_playwright
    from playwright_stealth import stealth_sync
    
    # Define the function to capture the screenshot
    def capture_screenshot():
        # Create a playwright instance
        with sync_playwright() as pw:
            browser = pw.chromium.launch(headless=True)
    
            # Create a new context and page
            context = browser.new_context()
            page = context.new_page()
    
            # Apply the stealth settings
            stealth_sync(page)
    
            # Navigate to the website
            url = "https://sandbox.oxylabs.io/products"
            page.goto(url)
    
            # Wait for the webpage to fully load.
            page.wait_for_load_state("networkidle")
    
            # Take a screenshot
            screenshot_filename = "screenshot.png"
            page.screenshot(path=screenshot_filename)
    
            # Close the browser
            browser.close()
    
            print("Done! You can check the screenshot...")
    
    capture_screenshot()

    Executing the code saves the screenshot. Here's what it looks like in our case:

    Screenshot

    If the screenshot shows the actual content of the page you're trying to access, it means you've just avoided a CAPTCHA from loading on this page.

    Bypass CAPTCHA with Web Unblocker

    Oxylabs’ Web Unblocker employs advanced AI techniques to help users access publicly available information behind the CAPTCHA. Bypassing CAPTCHAs with our advanced proxy solution is easy. You just need to send a simple query. Web Unblocker will automatically choose the fastest CAPTCHA proxy, attach all essential headers, and return the response HTML bypassing any anti-bots of the target websites.

    Here are the steps you must follow to implement a simple web scraping request using Web Unblocker. 

    1. Create an account

    You can create an account on the dashboard with a 7-day free trial. 

    2. Create API key

    After successfully creating your account, you can set your API key username and password from the dashboard. These API key credentials will be used later in the code.

    3. Install the requests module

    You should use a library that can help perform HTTP requests. We will use the `requests` to send HTTP requests to Web  Unblocker API and capture the response.

    pip install requests

    4. Import the required modules

    In your Python file, import the modules using the following import statement:

    import requests

    5. Define proxy and headers 

    Create the proxies dictionary to connect to Web Unblocker and then define the headers dictionary that’ll instruct Web Unblocker to use JavaScript rendering. See the documentation for more details. 

    # Define proxy dict. Don't forget to pass your Web Unblocker credentials (username and password)
    proxies = {
       "http": "http://USERNAME:PASSWORD@unblock.oxylabs.io:60000",
       "https": "https://USERNAME:PASSWORD@unblock.oxylabs.io:60000",
    }
    
    headers = {
        "X-Oxylabs-Render": "html"
    }

    6. Make a request

    Perform your request by specifying the URL, request type, and proxy by using the following code.

    response = requests.request(
       "GET",
       "https://sandbox.oxylabs.io/products",
       verify=False,  # Ignore the certificate
       proxies=proxies,
       headers=headers
    )

    7. Save the response

    Write the following code to print the response and save it in an HTML file.

    # Print result page to stdout
    print(response.text)
    
    # Save returned HTML to result.html file
    with open("result.html", "w") as f:
       f.write(response.text)

    8. Execute and check

    Execute the code and test the output. If the output HTML  file has actual page contents, the script successfully bypassed the CAPTCHA. Here is what our complete code looks like.

    # Import the modules
    import requests
    
    # Define proxy dict. Don't forget to put your real user and pass here as well.
    proxies = {
       "http": "http://USERNAME:PASSWORD@unblock.oxylabs.io:60000",
       "https": "https://USERNAME:PASSWORD@unblock.oxylabs.io:60000",
    }
    
    headers = {
        "X-Oxylabs-Render": "html"
    }
    
    response = requests.request(
       "GET",
       "https://sandbox.oxylabs.io/products",
       verify=False,  # Ignore the certificate
       proxies=proxies,
       headers=headers
    )
    
    # Print result page to stdout
    print(response.text)
    
    # Save returned HTML to result.html file
    with open("result.html", "w") as f:
       f.write(response.text)

    Here is the snapshot of the output HTML displayed on the screen:

    output HTML

    Here's a part of a snapshot of how the browser renders this HTML:

    Snapshot PNG of a rendered page

    Visit this page on our documentation to see how you can save the PNG file. The above snapshot makes it clear that we accessed the products page without any blocks.

    Conclusion

    Playwright, when combined with the `playwright-stealth` package, can effectively be used to scrape content behind the sites with ordinary CAPTCHA protection. Learn more about how to perform web scraping with Playwright, configure Playwright with proxies, and combine Scrapy with Playwright in our blog posts. If you're still wondering which proxies fit your needs best, get a free trial for our premium proxies to make the right decision.

    Other known headless browsers can perform better in different situations, so check out our posts about bypassing CAPTCHAs with Selenium and using Puppeteer to bypass CAPTCHAs. If you're interested in scraping Amazon, then see this post on how to bypass Amazon CAPTCHAs.

    However, bypassing CAPTCHA (e.g., reCAPTCHA) from websites with advanced anti-bots requires a more sophisticated and intelligent bypassing solution. Oxylabs’s Web Unblocker automatically combines the latest AI techniques with bypassing schemes (e.g., proxies and IP rotation, setting realistic fingerprints, and JS rendering) to ditch advanced anti-bots. Therefore, it is a more secure, convenient, and reliable solution for bypassing CAPTCHAs and scraping data at scale.

    Frequently Asked Questions

    Can Playwright solve CAPTCHA?

    Playwright alone can't solve CAPTCHAs. It reduces the chances of triggering a CAPTCHA test by making your network requests look more organic and human-like. Hence, there's always a chance to receive a CAPTCHA when web scraping with Playwright. Nonetheless, people often turn to various CAPTCHA-solving services, such as 2Captcha, to achieve desired results when Playwright isn't enough.

    How to get past CAPTCHA with Playwright?

    Depending on the website and its anti-scraping system, it can be enough to simply use Playwright to bypass CAPTCHA challenges. In other situations where vanilla Playwright doesn't bypass CAPTCHAs, you can try out a stealth plugin like playwright-stealth in Python. When it comes to more complex websites, a dedicated tool like Web Unblocker can make the entire CAPTCHA bypass process seamless.

    Is bypassing CAPTCHA illegal?

    It's always best to seek legal advice from a professional who can evaluate your use case. As a general rule of thumb, automated public data-gathering activities, including CAPTCHA bypass processes, may not be illegal when performed responsibly and ethically. It's essential to respect the website's rules and ensure your operations don't affect the website's infrastructure and performance.

    What is the difference between CAPTCHA and reCAPTCHA?

    CAPTCHA is a universal term for tests designed to distinguish between humans and bots. reCAPTCHA, on the other hand, is a specific type of CAPTCHA developed by Google that uses more advanced techniques like machine learning and behavioral analysis. When it comes to avoiding it, standard Playwright reCAPTCHA bypassing techniques may always work, requiring a more sophisticated tool like Web Scraper API or Web Unblocker.

    About the author

    Yelyzaveta Nechytailo

    Senior Content Manager

    Yelyzaveta Nechytailo is a Senior Content Manager at Oxylabs. After working as a writer in fashion, e-commerce, and media, she decided to switch her career path and immerse in the fascinating world of tech. And believe it or not, she absolutely loves it! On weekends, you’ll probably find Yelyzaveta enjoying a cup of matcha at a cozy coffee shop, scrolling through social media, or binge-watching investigative TV series.

    All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.

    Related articles

    Get the latest news from data gathering world

    I’m interested