Back to blog
Yelyzaveta Nechytailo
Transcribed as a Completely Automated Public Turing Test to Tell Computers and Humans Apart, CAPTCHA is a test that determines whether a user accessing websites or data is real. By providing challenges that prove hard for computers to solve, CAPTCHAs quickly identify suspicious users and modern bots and prevent such activities as scraping and crawling.
This article will provide insights into bypassing CAPTCHA challenges in web scraping. We’ll talk about the different types of tests that can be encountered in the modern internet landscape and discuss useful anti-CAPTCHA solutions to implement in your data-gathering operations.
For your convenience, we also prepared this tutorial in a video format:
Generally, there are three CAPTCHA types: text-based, image-based, and sound-based.
It’s usually a combination of random letters and characters presented in a hard-to-read format, with characters being turned, scaled, and distorted in various ways.
Image-based CAPTCHA challenges usually display several pictures in a grid and ask the user to select a specific type of image. For instance, images with traffic lights.
Also known as an audio CAPTCHA, it presents audio clips with a combination of letters or numbers that users have to enter, often accompanied by background noise for added difficulty.
It’s a CAPTCHA service that clients can set up on their website. Serving as an alternative to reCAPTCHA, it offers better privacy and provides more control over the CAPTCHA experience.
It’s a free CAPTCHA service developed by Google that offers protection for web pages. Just like hCAPTCHA, it uses advanced techniques to catch bot-like activity. One such technique is that Google reCAPTCHA now can even recognize human users without any interaction on its side – they simply take into account the user’s previous interactions with other websites, which might be an undesirable approach due to privacy issues.
Google reCAPTCHA is also widely used in most of the brand’s services and products, such as Google Search, Maps, Play, Shopping, and many more.
You can find more information about each of these CAPTCHA types as well as dig deeper into how these tools work in general, in our blog post on how CAPTCHAs work.
It’s no secret that CAPTCHAs are one of the biggest challenges when it comes to public data gathering. They interrupt companies’ scraping activities, making it hard to allocate enough time for analyzing data and making the right decisions. A CAPTCHA response during web scraping may look like this:
When a CAPTCHA challenge is triggered, it blocks any access to the desired data until the test is passed. One of the ways to overcome a CAPTCHA challenge is to use a service that takes care of them manually. However, this approach takes more time compared to using anti-detection techniques to bypass CAPTCHAs by not triggering them at all. The costs can accumulate for larger-scale projects when taking care of CAPTCHA tests manually; thus, employing anti-detection solutions to avoid a CAPTCHA challenge in the first place can offer a more streamlined and cost-effective approach.
That’s exactly why Web Unblocker was developed. This web scraping solution powered by artificial Intelligence successfully bypasses advanced anti-bot systems, including complex CAPTCHAs. One of its main features is dynamic browser fingerprinting. This feature selects the right combination of headers, cookies, and other browser parameters, allowing you to appear as an organic user and easily get access to the public data you need.
Using Web Unblocker is straightforward, as the setup is exactly the same as with proxy servers, so let’s review how to use Web Unblocker in Python. We offer a 1-week free trial for our website unblocker, so head to the Oxylabs dashboard and create a free account to get started.
Begin by installing the requests library, which we’ll use to send a web request to the target website. We’ll use the Beautiful Soup package to navigate the HTML and parse the desired elements. For installation, we’ll use pip, a package installer for Python, which should install automatically with Python.
Open up your terminal and enter the following line:
pip install requests beautifulsoup4
We’ll target a dummy bookstore website https://books.toscrape.com/ to get all the titles from the first listing page. The book titles are stored in the title attribute within the <a> tag, which is under the <h3> tag:
This dummy website doesn’t have CAPTCHAs implemented, so let’s imagine that it does. Thus, one of the options for bypassing CAPTCHA challenges is to use a solution like Web Unblocker that doesn’t trigger them in the first place.
Start by importing the installed Python libraries:
import requests
from bs4 import BeautifulSoup
Next, create the web_unblocker dictionary object and form the URL with your Oxylabs sub-user’s credentials and the Web Unblocker endpoint:
web_unblocker = {
'http': 'http://USERNAME:PASSWORD@unblock.oxylabs.io:60000',
'https': 'http://USERNAME:PASSWORD@unblock.oxylabs.io:60000',
}
The next step is to send a GET request to the target website through Web Unblocker. This can be achieved with the following code snippet:
response = requests.get(
'https://books.toscrape.com/',
verify=False,
proxies=web_unblocker
)
Web Unblocker requires users to ignore the SSL certificate, which is done by adding verify=False within the GET request. Then, include the proxies argument and pass the web_unblocker object to forward the web request through the Web Unblocker endpoint.
Here you can utilize the Beautiful Soup library to extract the content from the target page. First, create the soup object, which will store the HTML content:
soup = BeautifulSoup(response.content, "html.parser")
Then, create a for loop to extract all the titles:
for title in soup.select("h3 a"):
print(title.get("title"))
The soup.select uses CSS expressions to select all the <a> tags inside the <h3> tags. Since all the titles are stored as a value of the title attribute, you can retrieve the complete title names using the .get function. If you’re interested in learning web scraping, check out our in-depth blog posts on Python Web Scraping and Beautiful Soup to get an easy start.
The complete code should look like this:
import requests
from bs4 import BeautifulSoup
web_unblocker = {
'http': 'http://USERNAME:PASSWORD@unblock.oxylabs.io:60000',
'https': 'http://USERNAME:PASSWORD@unblock.oxylabs.io:60000'
}
response = requests.get(
'https://books.toscrape.com/',
verify=False,
proxies=web_unblocker
)
soup = BeautifulSoup(response.content, "html.parser")
for title in soup.select("h3 a"):
print(title.get("title"))
As you can see, it only takes a few lines of Python code to incorporate Oxylabs’ Web Unblocker. Using the above code, you should expect the following output:
A Light in the Attic
Tipping the Velvet
Soumission
Sharp Objects
Sapiens: A Brief History of Humankind
The Requiem Red
The Dirty Little Secrets of Getting Your Dream Job
The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull
The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics
The Black Maria
Starving Hearts (Triangular Trade Trilogy, #1)
Shakespeare's Sonnets
Set Me Free
Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)
Rip it Up and Start Again
Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991
Olio
Mesaerion: The Best Science Fiction Stories 1800-1849
Libertarianism for Beginners
It's Only the Himalayas
Hopefully, these Python examples helped you see how effortless is the integration process of Web Unblocker. Visit our documentation to learn more about its parameters and general integration steps.
Of course, it’s always possible to create your own solution that takes care of complex CAPTCHAs. While the development stage may take some time, you can tailor it specifically to the kind of requests you wish to send. This can result in higher success rates, allowing you to perform web scraping activities without interruptions. There are a couple of viable tools for this quest:
Playwright is an excellent web testing and automation tool developed by Microsoft, which can also be used to bypass CAPTCHAs. It supports the most popular programming languages, such as Python, JavaScript, and Java. Playwright can work with Chromium-based, Firefox, and WebKit browsers, allowing the users more flexibility. We’ve made a detailed blog post specifically on Playwright, so be sure to take a look at this Playwright Scraping Tutorial for more information.
It’s also a very effective web automation tool that you can use to design a program that avoids CAPTCHAs. While Puppeteer, owned by Google, supports only JavaScript, you can use it in Python with an unofficial library called Pyppeteer. The downside of Puppeteer is that it only supports Chromium-based browsers for interaction. If you’re curious to learn more, check out our in-depth blog post on web scraping with Puppeteer and a tutorial on how to overcome CAPTCHA challenges with Puppeteer.
Keep in mind that developing your own solution will require you to spend time writing code and micromanaging it to adapt to constant changes. In cases where this is an issue, the better option is to utilize ready-made web scrapers that avoid CAPTCHAs automatically. It takes a mountain of effort to build yourself a scalable scraper that sifts through the web undetected and uninterrupted, but a pre-built tool can ease the process immensely, saving time and resources. See how both methods differ in this guide to scraping Amazon.
With CAPTCHAs being one of the most common challenges when it comes to public data collection, it’s essential to find a reliable and high-quality solution to bypass them. This article presented a few anti-CAPTCHA solutions you can try implementing in your scraping tasks as well as discussed the different types of CAPTCHA tests available today.
If you're curious to try out our scraping solutions, you can simply get a free trial and follow our guides for your desired target. Here are some tutorials to get you started: how to scrape Google search results and how to scrape Etsy data.
If you have any questions about this topic or would like to learn more about Web Unblocker, Oxylabs’ ultimate solution for bypassing CAPTCHAs, feel free to contact us at hello@oxylabs.io or via the live chat.
Yes, there are many different services, such as a CAPTCHA solver or proxy solutions on the market, specifically designed for the purpose of bypassing a CAPTCHA test. For instance, Oxylabs’ Web Unblocker chooses the right combination of cookies, headers, browser attributes, etc., to appear as an organic user and, eventually, overcome all target website blocks.
While Google reCAPTCHA is considered to be more sophisticated and harder to bypass than the original CAPTCHA, it’s still possible to bypass it in several different ways. You can either implement a ready-to-use tool or develop your own and tailor it specifically to the kind of requests you wish to send.
Even though modern CAPTCHAs are advanced and tend to provide a high level of security for websites, sophisticated bots can still bypass them. These tools are usually developed with special features like dynamic browser fingerprinting that let users overcome even the most complex CAPTCHA tests and perform their scraping and crawling activities uninterruptedly.
On websites, CAPTCHAs are used to separate human users from malicious bots. They act as a safety net to stop bots from engaging in possibly harmful or malicious activities like spamming or fraudulent transactions.
There are several ways to avoid CAPTCHA when gathering web data. If you’re using a DIY scraper, make sure to use proxies and rotate them. Adjusting User-Agent headers to refine your scraper's fingerprint is another useful tactic. Additionally, you might want to consider using automated tools like Web Unblocker, which can effectively solve CAPTCHA challenges for you. Also, it's a good practice to use CAPTCHA proxies.
First patented in 1997, CAPTCHA is an abbreviation for Completely Automated Public Turing Test to tell Computers and Humans Apart.
About the author
Yelyzaveta Nechytailo
Senior Content Manager
Yelyzaveta Nechytailo is a Senior Content Manager at Oxylabs. After working as a writer in fashion, e-commerce, and media, she decided to switch her career path and immerse in the fascinating world of tech. And believe it or not, she absolutely loves it! On weekends, you’ll probably find Yelyzaveta enjoying a cup of matcha at a cozy coffee shop, scrolling through social media, or binge-watching investigative TV series.
All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.
Get the latest news from data gathering world
Scale up your business with Oxylabs®
GET IN TOUCH
General:
hello@oxylabs.ioSupport:
support@oxylabs.ioCareer:
career@oxylabs.ioCertified data centers and upstream providers
Connect with us
Advanced proxy solutions
Resources
Innovation hub
oxylabs.io© 2024 All Rights Reserved