Back to blog

How to Bypass CAPTCHA in Web Scraping Using Python

How to Bypass CAPTCHA in Web Scraping Using Python

Yelyzaveta Nechytailo

2024-10-037 min read
Share

Transcribed as a Completely Automated Public Turing Test to Tell Computers and Humans Apart, CAPTCHA is a test that determines whether a user accessing websites or data is real. By providing challenges that prove hard for computers to solve, CAPTCHAs quickly identify suspicious users and modern bots and prevent such activities as scraping and crawling.

This article will provide insights into bypassing CAPTCHA challenges in web scraping. We’ll talk about the different types of tests that can be encountered in the modern internet landscape and discuss useful anti-CAPTCHA solutions to implement in your data-gathering operations.

For your convenience, we have also prepared this tutorial in video format. The video tutorial demonstrates the same steps discussed in this article but with a demo website as the target.

What are the different types of CAPTCHAs?

Generally, there are three CAPTCHA types: text-based, image-based, and sound-based.

Text-based CAPTCHA

It’s usually a combination of random letters and characters presented in a hard-to-read format, with characters being turned, scaled, and distorted in various ways.

Text-based CAPTCHA example

Image-based CAPTCHA 

Image-based CAPTCHAs usually display several pictures in a grid and ask the user to select a specific type of image. For instance, images with traffic lights.

Image-based CAPTCHA example

Sound-based CAPTCHA

Also known as audio-based CAPTCHAs, these tests present audio clips with a combination of letters or numbers that users have to enter, often accompanied by background noise for added difficulty.

Sound-based CAPTCHA example

hCAPTCHA 

It’s a CAPTCHA service that clients can set up on their website. Serving as an alternative to reCAPTCHA, it offers better privacy and provides more control over the CAPTCHA experience.

hCAPTCHA example

Google reCAPTCHA 

It’s a free CAPTCHA service developed by Google that offers protection for web pages. Just like hCAPTCHA, it uses advanced techniques to catch bot-like activity. One such technique is that Google reCAPTCHA now can even recognize human users without any interaction on its side – they simply take into account the user’s previous interactions with other websites, which might be an undesirable approach due to privacy issues.

Google reCAPTCHA is also widely used in most of the brand’s services and products, such as Google Search, Maps, Play, Shopping, and many more.

reCAPTCHA example

You can find more information about each of these CAPTCHA types and dig deeper into how these tools work in general in our blog post on how CAPTCHAs work.

CAPTCHA challenges during web scraping

It’s no secret that CAPTCHAs are one of the biggest challenges when it comes to public data gathering. They interrupt companies’ scraping activities, making it hard to allocate enough time for analyzing data and making the right decisions. Consider this Python code that scrapes Google search results for different keywords (see the library installation steps in the section below):

import requests, json, time, random
from bs4 import BeautifulSoup
from urllib.parse import urlparse, parse_qs


num = [i for i in range(0, 50, 10)] # '50' represents 5 SERP pages.
keywords = ["shoes", "coats", "jeans", "sunglasses", "hats"]

for keyword in keywords:
    results = []
    for page in num:
        response = requests.get(f"https://www.google.com/search?q={keyword}&start={page}")
        soup = BeautifulSoup(response.text, "html.parser")

        items = soup.select("div.egMi0")
        if not items:
            print(f"No search results for {keyword} found on page {page // 10 + 1}.")
            with open(f"google_{keyword}_{page // 10 + 1}.html", "w") as f:
                f.write(soup.prettify())
            exit()

        time.sleep(random.uniform(4, 10))

        for item in items:
            title = item.select_one("h3 > div")
            link = item.select_one("a[href^='/url?q=']")
            if title and link:
                parsed_url = parse_qs(urlparse(link.get("href")).query).get('q', [None])[0]
                if parsed_url:
                    results.append({
                        "title": title.get_text(),
                        "link": parsed_url
                    })

    with open(f"google_{keyword}.json", "w") as f:
        json.dump(results, f, indent=4)

When you run this web scraping code, it’s highly likely that you’ll trigger a CAPTCHA. If that’s the case, the code will save the HTML document, which will have the CAPTCHA response:

Reviewing a CAPTCHA in the scraped HTML document.

You can open this HTML in a browser to see the complete CAPTCHA page:

Loading the scraped HTML document in a browser to see the CAPTCHA.

Utilizing advanced web scraping tactics like User-Agent manipulation and HTTP headers rotation may help to bypass CAPTCHAs, but not for long. Especially for large-scale web scraping, you’ll require a more sophisticated approach that includes session/IP rotation.

How to bypass any CAPTCHA with Web Unblocker using Python

When a CAPTCHA challenge is triggered, it blocks any access to the desired data until the test is passed. One of the ways to overcome a CAPTCHA challenge is to use a service that takes care of them manually. However, this approach takes more time than using anti-detection techniques to bypass CAPTCHA challenges by not triggering them at all. The costs can accumulate for larger-scale projects when taking care of CAPTCHA tests manually; thus, employing anti-detection solutions to avoid a CAPTCHA challenge in the first place can offer a more streamlined and cost-effective approach.

That’s exactly why Web Unblocker was developed. This web scraping solution powered by artificial Intelligence successfully bypasses advanced anti-bot systems, including complex CAPTCHAs. One of its main features is dynamic browser fingerprinting. This feature selects the right combination of headers, cookies, and other browser parameters, allowing you to appear as an organic user and easily get access to the public information you need.

Get a 1-week free trial

Register an account to claim a free trial for testing Web Unblocker.

  • Free 1GB of traffic
  • No credit card is required
  • Using Web Unblocker is straightforward, as the setup is exactly the same as with proxy servers, so let’s review how to use Web Unblocker in Python.

    1. Install the prerequisites

    Begin by installing the requests library, which will help you send a web request to the target website. The Beautiful Soup package will allow you to navigate the HTML and parse the desired elements. For installation, you can use a Python package installed, such as pip, which comes pre-installed with Python.

    Open up your terminal and enter the following line:

    python -m pip install requests beautifulsoup4

    Depending on your setup, you may want to use the python3 keyword:

    python3 -m pip install requests beautifulsoup4

    2. Inspect your target site

    For demonstration purposes, let’s scrape Google search results pages to get each result’s title and link. The code will scrape 5 SERP pages for each keyword in the list. Depending on whether you render JavaScript or not, the element classes will have different values. We’ll use Web Unblocker to bypass CAPTCHAs, and although we won’t render JavaScript with it, the classes will still appear as if we had.

    In this case, using your browser’s Developer Tools, you should see that each search result item is in an <div> element that has the class set to "yuRUbf". The result title text is directly inside the <h3> tag, while the item URL is inside the <a> tag’s “href” attribute:

    Inspecting the Google Search page via Dev Tools

    3. Set up the Web Unblocker endpoint

    Start by importing the installed Python libraries:

    import requests, json
    from bs4 import BeautifulSoup

    Next, create the web_unblocker proxy dictionary object to connect to the Web Unblocker endpoint via HTTPS and use your sub-user’s credentials for authentication:

    USERNAME, PASSWORD = "YOUR_USERNAME", "YOUR_PASSWORD"
    web_unblocker = {
        "http": f"http://{USERNAME}:{PASSWORD}@unblock.oxylabs.io:60000",
        "https": f"https://{USERNAME}:{PASSWORD}@unblock.oxylabs.io:60000",
    }

    Additionally, create a web_unblocker_headers dictionary to set the geo-location to New York, United States, using headers:

    web_unblocker_headers = {
        # Localize results for New York.
        "x-oxylabs-geo-location": "New York,New York,United States"
    }

    4. Send a request to the target

    The next step is to send a GET request to the target website through Web Unblocker. Let’s also mimic the previous code sample by creating the num list that defines the starting positions for each Google search results page, as well as the keywords list to scrape search results for:

    num = [i for i in range(0, 50, 10)] # '50' represents 5 SERP pages.
    keywords = ["shoes", "coats", "jeans", "sunglasses", "hats"]
    
    for keyword in keywords:
        results = []
        for page in num:
            response = requests.get(
                f"https://www.google.com/search?q={keyword}&start={page}",
                verify=False,  # Ignore the SSL certificate.
                proxies=web_unblocker,
                headers=web_unblocker_headers,
                timeout=180
            )

    Web Unblocker requires users to ignore the SSL certificate, which is done by adding verify=False within the GET request. Then, include the proxies argument and pass the web_unblocker object to forward web requests through the Web Unblocker endpoint.

    5. Parse the desired data

    Here, you can utilize the Beautiful Soup library to extract the content from the target page. First, create the soup object, which will store the HTML content:

            soup = BeautifulSoup(response.text, "html.parser")

    Next, define the search results items selector:

            items = soup.select("div.yuRUbf")

    Then, create a third for loop to extract all the titles and links:

            for item in items:
                title = item.select_one("h3")
                link = item.select_one("a")
                if title and link:
                    results.append({
                        "title": title.get_text(),
                        "link": link.get("href")
                    })

    The soup.select uses CSS expressions to select all the <div> elements with a class “yuRUbf”, while the soup.select_one selects the first matching <h3> title and <a> link. Since all the item URLs are stored as a value of the href attribute, you can retrieve the complete link using the .get() function. If you’re interested in learning web scraping, check out our in-depth blog posts on Python Web Scraping and Beautiful Soup to get an easy start.

    6. Store the results in a JSON file

    Once all the titles and links are extracted from all the pages of a specific keyword, you can save the results to a JSON file:

        with open(f"google_{keyword}.json", "w") as f:
            json.dump(results, f, indent=4)

    Full Web Unblocker code sample

    The complete code should look like this:

    import requests, json
    from bs4 import BeautifulSoup
    
    
    USERNAME, PASSWORD = "YOUR_USERNAME", "YOUR_PASSWORD"
    web_unblocker = {
        "http": f"http://{USERNAME}:{PASSWORD}@unblock.oxylabs.io:60000",
        "https": f"https://{USERNAME}:{PASSWORD}@unblock.oxylabs.io:60000",
    }
    
    web_unblocker_headers = {
        # Localize results for New York.
        "x-oxylabs-geo-location": "New York,New York,United States"
    }
    
    num = [i for i in range(0, 50, 10)] # '50' represents 5 SERP pages.
    keywords = ["shoes", "coats", "jeans", "sunglasses", "hats"]
    
    for keyword in keywords:
        results = []
        for page in num:
            response = requests.get(
                f"https://www.google.com/search?q={keyword}&start={page}",
                verify=False,  # Ignore the SSL certificate.
                proxies=web_unblocker,
                headers=web_unblocker_headers,
                timeout=180
            )
            soup = BeautifulSoup(response.text, "html.parser")
    
            items = soup.select("div.yuRUbf")
    
            for item in items:
                title = item.select_one("h3")
                link = item.select_one("a")
                if title and link:
                    results.append({
                        "title": title.get_text(),
                        "link": link.get("href")
                    })
    
        with open(f"google_{keyword}.json", "w") as f:
            json.dump(results, f, indent=4)

    As you can see, it only takes a few lines of Python code to incorporate Oxylabs’ Web Unblocker. Using the above code, you should expect the following output:

    [
        {
            "title": "Women's, Men's & Kids Shoes from Top Brands | DSW",
            "link": "https://www.dsw.com/"
        },
        {
            "title": "Shoes for Women, Men & Kids, Famous Footwear",
            "link": "https://www.famousfootwear.com/"
        },
        {
            "title": "Rack Room Shoes: Shoes Online with Free Shipping*",
            "link": "https://www.rackroomshoes.com/"
        },
        {
            "title": "Men's Shoes & Sneakers",
            "link": "https://www.nike.com/w/mens-shoes-nik1zy7ok"
        },
        {
            "title": "Shoes + FREE SHIPPING",
            "link": "https://www.zappos.com/shoes"
        },
        {
            "title": "Mens shoes, boots and sandals",
            "link": "https://www.nativeshoes.com/mens?srsltid=AfmBOopqVwHL2WurGKZeX7SE94nMSE7HrgJfMUGRUf3l-Ay1vWW1zYiw"
        },
        {
            "title": "Women's Shoes",
            "link": "https://www.nordstrom.com/browse/women/shoes?srsltid=AfmBOopl7t_F-BKJ7LXMFuCayHTbxbC1Hh3eTYAYhpUbdGUSrYvPcAT8"
        },
        {
            "title": "DC Shoes\u00ae Skate Shoes & Snowboard Boots",
            "link": "https://www.dcshoes.com/"
        }
    ]

    Hopefully, these Python examples helped you see how effortless the integration process of Web Unblocker is. Visit our documentation to learn more about its parameters and general integration steps.

    Developing your own solution

    Of course, it’s always possible to create your own solution that takes care of various CAPTCHAs. While the development stage may take some time, you can tailor it specifically to the kind of requests you wish to send. This can result in higher success rates, allowing you to perform web scraping activities without interruptions. There are a couple of viable Python CAPTCHA bypass tools for this quest:

    Playwright

    Playwright is an excellent web testing and automation tool developed by Microsoft, which can also be used to bypass CAPTCHAs. It supports the most popular programming languages, such as Python, JavaScript, and Java. Playwright can work with Chromium-based, Firefox, and WebKit browsers, allowing the users more flexibility. We’ve made a detailed blog post specifically on Playwright, so be sure to take a look at this Playwright Scraping Tutorial for more information.

    Puppeteer

    It’s also a very effective web automation tool that you can use to design a program that avoids CAPTCHAs. While Puppeteer, owned by Google, supports only JavaScript, you can use it in Python with an unofficial library called Pyppeteer. The downside of Puppeteer is that it only supports Chromium-based browsers for interaction. If you’re curious to learn more, check out our in-depth blog post on web scraping with Puppeteer and a tutorial on how to overcome CAPTCHA challenges with Puppeteer.

    Selenium

    Selenium is another popular web automation framework that’s also used for web scraping and overcoming CAPTCHAs. It’s a powerful headless browser that supports automation of Chrome, Firefox, Safari, Edge, and Internet Explorer compatibility mode using Edge. Check out this blog post on bypassing CAPTCHAs with Selenium and this comprehensive tutorial on web scraping using Selenium.

    Keep in mind that developing your own solution will require you to spend time writing code and micromanaging it to adapt to constant changes. In cases where this is an issue, the better option is to utilize ready-made web scrapers that avoid CAPTCHAs automatically. It takes a mountain of effort to build yourself a scalable scraper that sifts through the web undetected and uninterrupted, but a pre-built tool can ease the process immensely, saving time and resources. See how both methods differ in this guide to scraping Amazon. If you aim to scrape Amazon, you may find it helpful to follow this guide on how to bypass Amazon CAPTCHAs.

    Final thoughts

    With CAPTCHAs being one of the most common challenges when it comes to public data collection, it’s essential to find a reliable and high-quality solution to bypass them. This article presented a few anti-CAPTCHA solutions you can try implementing in your scraping tasks, and it discussed the different types of CAPTCHAs available today. 

    If you're curious to try out our scraping solutions, you can simply get a free trial and follow our guides for your desired target. Here are some tutorials to get you started: how to scrape Google search results and how to scrape Etsy data.

    If you have any questions about this topic or would like to learn more about Web Unblocker, Oxylabs’ ultimate solution for bypassing CAPTCHAs, feel free to contact us at hello@oxylabs.io or via the live chat.

    Frequently asked questions

    Is there a way to bypass CAPTCHA?

    Yes, there are many different services, such as a CAPTCHA solver or proxy solutions on the market, specifically designed for the purpose of bypassing a CAPTCHA test. If you want to know how to bypass CAPTCHA challenges easily during web scraping, Oxylabs’ Web Unblocker can suit you well. It chooses the right combination of cookies, headers, browser attributes, etc., to appear as an organic user and, eventually, overcome all target website blocks.

    Can reCAPTCHA be bypassed?

    While reCAPTCHA is considered to be more sophisticated and harder to bypass than the original CAPTCHA, it’s still possible to bypass it in several different ways. You can either implement a ready-to-use tool or develop your own and tailor it specifically to the kind of requests you wish to send.

    Can a bot bypass CAPTCHA?

    Even though modern CAPTCHAs are advanced and tend to provide a high level of security for websites, sophisticated bots can still bypass them. These tools are usually developed with special features like dynamic browser fingerprinting that let users overcome even the most complex CAPTCHA tests and perform their scraping and crawling activities uninterruptedly.

    Why are CAPTCHAs used?

    On websites, CAPTCHAs are used to separate human users from malicious bots. They act as a safety net to stop bots from engaging in possibly harmful or malicious activities like spamming or fraudulent transactions.

    How can I avoid CAPTCHAs?

    There are several ways to avoid CAPTCHA when gathering web data. If you’re using a DIY scraper, make sure to use proxies and rotate them. Adjusting User-Agent headers to refine your scraper's fingerprint is another useful tactic. Additionally, you might want to consider using automated tools like Web Unblocker, which can effectively solve CAPTCHA challenges for you. Also, it's a good practice to use CAPTCHA proxies.

    What does CAPTCHA stand for?

    First patented in 1997, CAPTCHA is an abbreviation for Completely Automated Public Turing Test to tell Computers and Humans Apart.

    Is bypassing CAPTCHA illegal?

    In general, it isn’t illegal to bypass a CAPTCHA when you take into account all the ethical considerations, such as scraping at a responsible rate and not affecting the performance of the website. However, it may be illegal to bypass CAPTCHAs to access restricted content or perform malicious actions.

    About the author

    Yelyzaveta Nechytailo

    Senior Content Manager

    Yelyzaveta Nechytailo is a Senior Content Manager at Oxylabs. After working as a writer in fashion, e-commerce, and media, she decided to switch her career path and immerse in the fascinating world of tech. And believe it or not, she absolutely loves it! On weekends, you’ll probably find Yelyzaveta enjoying a cup of matcha at a cozy coffee shop, scrolling through social media, or binge-watching investigative TV series.

    All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.

    Related articles

    Get the latest news from data gathering world

    I’m interested