It would be hard to find a person, who has never had to prove a computer they are a human. Solving weird puzzles with fire hydrants may seem like an odd way to prove you have a consciousness. It will not seem that odd after reading this article. You will soon find out how CAPTCHAs work and that by solving them you play an important role in training Artificial Intelligence.
- What does CAPTCHA mean?
- How do CAPTCHAs work?
- Why are CAPTCHAs used?
- What is reCAPTCHA?
- How does reCAPTCHA work?
- Different types of reCAPTCHA
- What triggers CAPTCHAs and reCAPTCHAs?
- CAPTCHAs and Artificial Intelligence
- Can CAPTCHA be bypassed?
What does CAPTCHA mean?
CAPTCHA is an acronym for Completely Automated Public Turing Test to Tell Computers and Humans Apart. Sometimes it is also called Human Interaction Proof (HIP). CAPTCHA is meant to differentiate humans from bots. A traditional CAPTCHA stretches and distorts letters and/or numbers and asks users to identify the text – something that may seem rather easy for a human, but is challenging for a robot.
In 1950 Alan Turing, often called the father of modern computing, introduced the Turing Test. This assessment was designed to show whether machines could think or appear to think as humans. During the test, an interrogator asks two participants a series of questions. One participant is a human, while the other one is a machine. The interrogator does not know which is which, and has to guess based solely on their answers. If the interrogator fails to identify the participants, the machine passes the test.
As the name suggests, traditional CAPTCHA is based on the Turing Test.
How do CAPTCHAs work?
A CAPTCHA’s goal is to separate humans from bots. To achieve that, CAPTCHA presents different images to various users. The database of CAPTCHAs is massive, in order to suggest as many different variations as possible. If the answer to the CAPTCHA code was hidden in the metadata of the image, or if the solution was always the same, it would take no time for a computer to solve it.
While CAPTCHAs are created to be solved only by humans, it does not mean that everyone can solve a CAPTCHA on the first try. Researchers say that humans should be able to solve around 80% of CAPTCHAs, and the success rate for machines should be 0.01%.
Most traditional CAPTCHAs rely on vision as computers are not as sophisticated as humans when it comes to processing visual information. Most people can pick out patterns quite quickly, or make connections between different subjects. An ability to see previously known patterns where they do not appear is called pareidolia. For example, we can see familiar shapes in the clouds as our brains try to associate information into patterns.
For people with impaired vision, CAPTCHAs are presented in audio format. The audio normally has some background noise in order to stop bots from solving these tests.
Why are CAPTCHAs used?
CAPTCHAs are primarily used to protect websites from malicious acts. Many websites do not want to be abused by bots and therefore ask users for CAPTCHA validation. However, sometimes CAPTCHAs stand in a way when people want to collect public data for research or business purposes.
Here are some examples of how CAPTCHAs are used:
- If a free email platform did not use CAPTCHAs, someone may use them to send spam advertisements from many different email addresses. CAPTCHA helps to identify bots and stops them before they do any harm.
- Ticket sellers also often use CAPTCHAs. Resellers sometimes employ bots to get a bunch of tickets to the most popular events mere seconds after their release. They buy out all the tickets and later sell them for a larger price. CAPTCHAs help stop these bots.
- DDoS attacks (Distributed Denial-of-Service) are another common threat. Attackers aim to intentionally disrupt services by sending a large number of requests to one target. Websites introduce CAPTCHAs to avoid potential attacks that may stop their services.
- On the other hand, CAPTCHAs may slow down work. For example, researches have to go through vast amounts of public information, download documents, and collect data. CAPTCHAs intervene with their tasks and become a burden.
What is reCAPTCHA?
ReCAPTCHA is a service from Google that performs the same function as a regular CAPTCHA. Many websites use this as a free web protection solution. You may have noticed reCAPTCHAs that only ask users to tick a box rather than solve a puzzle. These are called “noCAPTCHA reCAPTCHA”. After ticking the box, if the system is still not convinced, the user will be asked to prove they are a human.
How does reCAPTCHA work?
First reCAPTCHAs were created by digitalyzing books, using images of street names, taking text pieces from newspapers and asking users to decrypt words or their combinations. While reading text on an image is not a difficult task for a human, it is challenging for a bot.
Computers are getting more and more sophisticated, but so are reCAPTCHAs. Over time, more types of reCAPTCHAs have been developed and now include image recognition, checkboxes, and general user behavior assessment that does not require any user interaction.
Different types of reCAPTCHA
Image recognition reCAPTCHAs give a user nine or 16 square images. These images may be related or completely different. The user has to identify images that include (or do not include) a certain object. It can be street signs, fire hydrants, clouds, or anything else. How does the system know if the answer is correct? The response has to match the answers submitted by most of the other users who solved the same test.
How do checkbox CAPTCHAs work? Merely ticking a checkbox that says “I’m not a robot” is not the real test. The real test is what leads to the checkbox.
This test considers mouse movements as it comes closer to the checkbox. Human users are much less predictable than bots. Even the most direct mouse movement performed by a person is not straight, and bots are unable to mimic the same pattern. ReCAPTCHAs may also inspect HTTP cookies that the browser stores in the device.
As mentioned previously, sometimes the user may be presented with an additional challenge, if the test cannot determine whether the user is a human or a bot.
The most recent edition of reCAPTCHA is able to determine whether the user is a human without any puzzles or checkboxes. The test takes into account the user’s behavior and history of interacting with websites. In most cases, the system can decide whether the user is a bot based on these factors. If this information is not enough, then the user will be challenged with one of the previously mentioned reCAPTCHAs.
What triggers CAPTCHAs and reCAPTCHAs?
If the system suspects that the user is a bot, then a CAPTCHA shows up. It can be triggered by, for example, sending too many requests to the same target.
ReCAPTCHAs seem to be more sophisticated. While it is not exactly clear what triggers reCAPTCHAs, there are some potential factors:
- Mouse movements
- Tracking cookies
- Browsing history
CAPTCHAs and Artificial Intelligence
CAPTCHAs and reCAPTCHAs are a perfect example of Artificial Intelligence (AI) training. As mentioned earlier, when the system asks, for example, to click on every kitten on the images, it decides whether the answer is correct based on other users’ answers. This information also feeds AI and helps computers get better at recognizing images.
Image recognition is challenging for computers. For example, unlike a human eye, robots cannot make the same connections when a picture is taken from different angles. But with most recent technologies, computers are getting more sophisticated, and machine learning makes robots smarter day by day.
If you are wondering how machine learning works, this is an entertaining and informative video:
Can CAPTCHA be bypassed?
Bypassing CAPTCHAs means that these tests can be improved, and identifying weak points is the first step towards creating even better solutions. Whenever a bot solves a CAPTCHA, it is a step toward creating better tests. However, bypassing CAPTCHAs is not an easy task.
Being blocked or getting CAPTCHAs are some of the most common challenges while web scraping. These challenges can interrupt large-scale public data gathering operations. Some companies have already found solutions for bypassing CAPTCHAs. For example, Real-Time Crawler helps to deliver requested data without any IP bans or CAPTCHAs. Companies can smoothly collect information at scale and focus on data analysis rather than data gathering.
CAPTCHAs are used to protect websites from spam and abuse. The goal of a CAPTCHA is to determine human users from bots, by giving them a test that should only be solved by humans. The idea of CAPTCHA is based on the Turing Test.
ReCAPTCHAs is a CAPTCHA service, provided by Google. There are different types of reCAPTCHA tests and some of them do not even require any human interaction. It is not exactly clear what triggers reCAPTCHAs, but some of the factors include cookie tracking, browser history, and interaction with a website in real time.
Bypassing CAPTCHA for computers is a hard task, as their primary task is to be unsolvable for bots. However, some solutions, such as Real-Time Crawler, support web scraping without any CAPTCHAs or IP bans.