Proxy locations

Europe

North America

South America

Asia

Africa

Oceania

See all locations

Network statusCareers

Back to blog

How to Bypass CAPTCHA in Web Scraping Using Python

How to Bypass CAPTCHA in Web Scraping Using Python

Yelyzaveta Nechytailo

2023-10-036 min read
Share

Transcribed as a Completely Automated Public Turing Test to Tell Computers and Humans Apart, CAPTCHA is a test that determines whether a user accessing websites or data is real. By providing challenges that prove hard for computers to solve, CAPTCHAs quickly identify suspicious users and modern bots and prevent such activities as scraping and crawling.

This article will provide insights into bypassing CAPTCHA challenges in web scraping. We’ll talk about the different types of tests that can be encountered in the modern internet landscape and discuss useful anti-CAPTCHA solutions to implement in your data-gathering operations.

For your convenience, we also prepared this tutorial in a video format:

What are the different types of CAPTCHAs?

Generally, there are three CAPTCHA types: text-based, image-based, and sound-based.

Text-based CAPTCHA

It’s usually a combination of random letters and characters presented in a hard-to-read format, with characters being turned, scaled, and distorted in various ways.

Text-based CAPTCHA example

Image-based CAPTCHA 

Image-based CAPTCHA challenges usually display several pictures in a grid and ask the user to select a specific type of image. For instance, images with traffic lights.

Image-based CAPTCHA example

Sound-based CAPTCHA

Also known as an audio CAPTCHA, it presents audio clips with a combination of letters or numbers that users have to enter, often accompanied by background noise for added difficulty.

Sound-based CAPTCHA example

hCAPTCHA 

It’s a CAPTCHA service that clients can set up on their website. Serving as an alternative to reCAPTCHA, it offers better privacy and provides more control over the CAPTCHA experience.

hCAPTCHA example

Google reCAPTCHA 

It’s a free CAPTCHA service developed by Google that offers protection for web pages. Just like hCAPTCHA, it uses advanced techniques to catch bot-like activity. One such technique is that Google reCAPTCHA now can even recognize human users without any interaction on its side – they simply take into account the user’s previous interactions with other websites, which might be an undesirable approach due to privacy issues.

Google reCAPTCHA is also widely used in most of the brand’s services and products, such as Google Search, Maps, Play, Shopping, and many more.

reCAPTCHA example

You can find more information about each of these CAPTCHA types as well as dig deeper into how these tools work in general, in our blog post on how CAPTCHAs work.

How to bypass any CAPTCHA with Web Unblocker using Python

It’s no secret that CAPTCHAs are one of the biggest challenges when it comes to public data gathering. They interrupt companies’ scraping activities, making it hard to allocate enough time for analyzing data and making the right decisions. A CAPTCHA response during web scraping may look like this:

An example of a response with CAPTCHA during web scraping

When a CAPTCHA challenge is triggered, it blocks any access to the desired data until the test is passed. One of the ways to overcome a CAPTCHA challenge is to use a service that takes care of them manually. However, this approach takes more time compared to using anti-detection techniques to bypass CAPTCHAs by not triggering them at all. The costs can accumulate for larger-scale projects when taking care of CAPTCHA tests manually; thus, employing anti-detection solutions to avoid a CAPTCHA challenge in the first place can offer a more streamlined and cost-effective approach.

That’s exactly why Web Unblocker was developed. This web scraping solution powered by artificial Intelligence successfully bypasses advanced anti-bot systems, including complex CAPTCHAs. One of its main features is dynamic browser fingerprinting. This feature selects the right combination of headers, cookies, and other browser parameters, allowing you to appear as an organic user and easily get access to the public data you need.

Using Web Unblocker is straightforward, as the setup is exactly the same as with proxy servers, so let’s review how to use Web Unblocker in Python. We offer a 1-week free trial for our website unblocker, so head to the Oxylabs dashboard and create a free account to get started.

1. Install the prerequisites

Begin by installing the requests library, which we’ll use to send a web request to the target website. We’ll use the Beautiful Soup package to navigate the HTML and parse the desired elements. For installation, we’ll use pip, a package installer for Python, which should install automatically with Python. 

Open up your terminal and enter the following line:

pip install requests beautifulsoup4

2. Inspect your target site

We’ll target a dummy bookstore website https://books.toscrape.com/ to get all the titles from the first listing page. The book titles are stored in the title attribute within the <a> tag, which is under the <h3> tag:

Using developer tools to inspect the HTML of a target website

This dummy website doesn’t have CAPTCHAs implemented, so let’s imagine that it does. Thus, one of the options for bypassing CAPTCHA challenges is to use a solution like Web Unblocker that doesn’t trigger them in the first place.

3. Set up the Web Unblocker endpoint

Start by importing the installed Python libraries:

import requests
from bs4 import BeautifulSoup

Next, create the web_unblocker dictionary object and form the URL with your Oxylabs sub-user’s credentials and the Web Unblocker endpoint:

web_unblocker = {
  'http': 'http://USERNAME:PASSWORD@unblock.oxylabs.io:60000',
  'https': 'http://USERNAME:PASSWORD@unblock.oxylabs.io:60000',
}

4. Send a request to the target

The next step is to send a GET request to the target website through Web Unblocker. This can be achieved with the following code snippet:

response = requests.get(
    'https://books.toscrape.com/',
    verify=False,
    proxies=web_unblocker
)

Web Unblocker requires users to ignore the SSL certificate, which is done by adding verify=False within the GET request. Then, include the proxies argument and pass the web_unblocker object to forward the web request through the Web Unblocker endpoint.

5. Parse the desired data

Here you can utilize the Beautiful Soup library to extract the content from the target page. First, create the soup object, which will store the HTML content:

soup = BeautifulSoup(response.content, "html.parser")

Then, create a for loop to extract all the titles:

for title in soup.select("h3 a"):
    print(title.get("title"))

The soup.select uses CSS expressions to select all the <a> tags inside the <h3> tags. Since all the titles are stored as a value of the title attribute, you can retrieve the complete title names using the .get function. If you’re interested in learning web scraping, check out our in-depth blog posts on Python Web Scraping and Beautiful Soup to get an easy start.

The complete code should look like this:

import requests
from bs4 import BeautifulSoup

web_unblocker = {
    'http': 'http://USERNAME:PASSWORD@unblock.oxylabs.io:60000',
    'https': 'http://USERNAME:PASSWORD@unblock.oxylabs.io:60000'
}

response = requests.get(
    'https://books.toscrape.com/',
    verify=False,
    proxies=web_unblocker
)

soup = BeautifulSoup(response.content, "html.parser")
for title in soup.select("h3 a"):
    print(title.get("title"))

As you can see, it only takes a few lines of Python code to incorporate Oxylabs’ Web Unblocker. Using the above code, you should expect the following output:

A Light in the Attic
Tipping the Velvet
Soumission
Sharp Objects
Sapiens: A Brief History of Humankind
The Requiem Red
The Dirty Little Secrets of Getting Your Dream Job
The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull
The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics
The Black Maria
Starving Hearts (Triangular Trade Trilogy, #1)
Shakespeare's Sonnets
Set Me Free
Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)
Rip it Up and Start Again
Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991
Olio
Mesaerion: The Best Science Fiction Stories 1800-1849
Libertarianism for Beginners
It's Only the Himalayas

Hopefully, these Python examples helped you see how effortless is the integration process of Web Unblocker. Visit our documentation to learn more about its parameters and general integration steps.

Developing your own solution

Of course, it’s always possible to create your own solution that takes care of complex CAPTCHAs. While the development stage may take some time, you can tailor it specifically to the kind of requests you wish to send. This can result in higher success rates, allowing you to perform web scraping activities without interruptions. There are a couple of viable tools for this quest:

Playwright

Playwright is an excellent web testing and automation tool developed by Microsoft, which can also be used to bypass CAPTCHAs. It supports the most popular programming languages, such as Python, JavaScript, and Java. Playwright can work with Chromium-based, Firefox, and WebKit browsers, allowing the users more flexibility. We’ve made a detailed blog post specifically on Playwright, so be sure to take a look at this Playwright Scraping Tutorial for more information.

Puppeteer

It’s also a very effective web automation tool that you can use to design a program that avoids CAPTCHAs. While Puppeteer, owned by Google, supports only JavaScript, you can use it in Python with an unofficial library called Pyppeteer. The downside of Puppeteer is that it only supports Chromium-based browsers for interaction. If you’re curious to learn more, check out our in-depth blog post on web scraping with Puppeteer and a tutorial on how to overcome CAPTCHA challenges with Puppeteer.

Keep in mind that developing your own solution will require you to spend time writing code and micromanaging it to adapt to constant changes. In cases where this is an issue, the better option is to utilize ready-made web scrapers that avoid CAPTCHAs automatically. It takes a mountain of effort to build yourself a scalable scraper that sifts through the web undetected and uninterrupted, but a pre-built tool can ease the process immensely, saving time and resources. See how both methods differ in this guide to scraping Amazon.

Final thoughts

With CAPTCHAs being one of the most common challenges when it comes to public data collection, it’s essential to find a reliable and high-quality solution to bypass them. This article presented a few anti-CAPTCHA solutions you can try implementing in your scraping tasks as well as discussed the different types of CAPTCHA tests available today. 

If you're curious to try out our scraping solutions, you can simply get a free trial and follow our guides for your desired target. Here are some tutorials to get you started: how to scrape Google search results and how to scrape Etsy data.

If you have any questions about this topic or would like to learn more about Web Unblocker, Oxylabs’ ultimate solution for bypassing CAPTCHAs, feel free to contact us at hello@oxylabs.io or via the live chat.

Frequently asked questions

Is there a way to bypass CAPTCHA?

Yes, there are many different services, such as a CAPTCHA solver or proxy solutions on the market, specifically designed for the purpose of bypassing a CAPTCHA test. For instance, Oxylabs’ Web Unblocker chooses the right combination of cookies, headers, browser attributes, etc., to appear as an organic user and, eventually, overcome all target website blocks.

Can reCAPTCHA be bypassed?

While Google reCAPTCHA is considered to be more sophisticated and harder to bypass than the original CAPTCHA, it’s still possible to bypass it in several different ways. You can either implement a ready-to-use tool or develop your own and tailor it specifically to the kind of requests you wish to send.

Can a bot bypass CAPTCHA?

Even though modern CAPTCHAs are advanced and tend to provide a high level of security for websites, sophisticated bots can still bypass them. These tools are usually developed with special features like dynamic browser fingerprinting that let users overcome even the most complex CAPTCHA tests and perform their scraping and crawling activities uninterruptedly.

Why are CAPTCHAs used?

On websites, CAPTCHAs are used to separate human users from malicious bots. They act as a safety net to stop bots from engaging in possibly harmful or malicious activities like spamming or fraudulent transactions.

How can I avoid CAPTCHAs?

There are several ways to avoid CAPTCHA when gathering web data. If you’re using a DIY scraper, make sure to use proxies and rotate them. Adjusting User-Agent headers to refine your scraper's fingerprint is another useful tactic. Additionally, you might want to consider using automated tools like Web Unblocker, which can effectively solve CAPTCHA challenges for you. Also, it's a good practice to use CAPTCHA proxies.

What does CAPTCHA stand for?

First patented in 1997, CAPTCHA is an abbreviation for Completely Automated Public Turing Test to tell Computers and Humans Apart.

About the author

Yelyzaveta Nechytailo

Senior Content Manager

Yelyzaveta Nechytailo is a Senior Content Manager at Oxylabs. After working as a writer in fashion, e-commerce, and media, she decided to switch her career path and immerse in the fascinating world of tech. And believe it or not, she absolutely loves it! On weekends, you’ll probably find Yelyzaveta enjoying a cup of matcha at a cozy coffee shop, scrolling through social media, or binge-watching investigative TV series.

All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.

Related articles

Get the latest news from data gathering world

I’m interested