Often, we perceive the term “bot” as negative. However, not all bots are bad. The issue is that good bots can share similar characteristics with malicious bots. Therefore, good bot traffic gets labeled as bad and gets blocked.
Bad bots are only getting smarter, and it’s hard for other bots to stay block-free. This creates a lot of issues not only for site owners to ensure a healthy performance of their website but for the web scraping community as well.
While we have already covered what a bot is, in this article, we’ll go more in-depth about bot activities, how to detect bot traffic and block it, and how it can affect businesses.
Bot traffic is any non-human traffic made to a website. It’s a software application running automated and repetitive tasks, however, much faster than humanly possible.
With this ability to perform tasks very quickly, bots can be used for both bad and good. In 2022, 47.4% of all online traffic came from bots, of which 30.2% were bad bots. That’s 2.5% more than the previous year of 2021.
The legality of traffic bots strictly depends on how they’re used. Generally, traffic bots that don’t infringe on any laws, regulations, and third-party rights might not be considered to be illegal, as opposed to others that do infringe, such as DDoS botnets.
Good bot traffic is also on the rise compared to 2021, as the numbers increased by 2.7%. As friendly bots can also alter web traffic metrics and make analytics unreliable, website owners have been strengthening their website security with the increase of both good and bad bots. Hence enabling more bots to get wrongfully caught.
To better understand what good and bad bots are, let's overview some examples:
Search engine bots – these bots crawl, catalog, and index web pages. Such results are used by search engines such as Google to provide their services effectively.
Site monitoring bots – will monitor websites to identify possible issues such as long loading times, downtimes, etc.
Web scraping bots – if the data being scraped is publicly available, the data can be used for research, identifying and pulling down illegal ads, brand monitoring, and much more.
Spam bots – used for spam purposes. Often for the purpose of creating fake user accounts on forums, social media platforms, messaging apps, and so on. They are used in order to build a social media presence, create more clicks on a post, etc.
DDoS attack bots – some malicious bots are created to take down websites. DDoS attacks usually leave just enough bandwidth available to allow other attacks to make their way into the network and pass weakened network security layers undetected to steal sensitive information.
Ad fraud bots – these bots automatically click on ads, siphoning off money from advertising transactions.
So, a “good” bot is a bot that performs useful or helpful tasks that aren’t detrimental to a user’s experience on the internet. Whereas a bad bot is the exact opposite and, in most cases, has malicious or even illegal intentions.
Websites have created various bot detection techniques to detect bots and prevent malicious bot traffic. Here are several ways they do that:
Browser fingerprinting – this refers to information that is gathered about a computing device for identification purposes (any browser will pass on specific data points to the connected website’s servers, such as your operating system, language, plugins, fonts, hardware, etc.) Learn more about what is browser fingerprinting in our in-depth blog.
Behavioral inconsistencies – this involves behavioral analysis of nonlinear mouse movements, rapid button and mouse clicks, repetitive patterns, average page time, average requests per page, and similar bot behavior.
CAPTCHA – a popular anti-bot measure, is a challenge-response type of test that often asks you to fill in correct codes or identify objects in pictures. You can read more on how CAPTCHAs work in our blog.
Once a website identifies bot-like behavior, it blocks them from further crawling. For more details, Dmitry Babitsky, the co-founder & chief scientist at ForNova, has spoken in-depth on how websites block bots in his presentation at OxyCon.
While websites can use anti-bot systems that automatically check incoming traffic and detect bots, other options include tools like Google Analytics. When inspecting web traffic, you may find strange behavior that can help identify bot traffic, such as:
Unusual volume of page views. Bots may simulate a high number of page views, which may be the case if your website usually attracts significantly fewer views.
Increased or dropped session time. If the time users stay on your website has unexpectedly increased or significantly decreased, there may be bots browsing the website at very slow or inhumanly fast rates.
A sudden increase in bounce rate. When there’s a high bounce rate you can’t explain, it may indicate that bots are used to target a single page.
Unforeseen increase in traffic from one location. An unexpected and unexplained rise of traffic from one specific location may point to a bot network.
Falsified user conversions. Another factor that may be caused by bot traffic is falsified information. For instance, fake names and surnames, company names, phone numbers, email addresses, physical addresses, and other details that can be entered by form-filling bots.
Distinguishing bot traffic patterns from human behavior online has become a complex task in itself, and the bots on the internet have evolved dramatically over the years. Currently, there are four different generations of bots:
First-generation – these basic bots are built with basic scripting tools and mainly perform simple automated tasks like scraping, spam, etc.
Third-generation – often used for slow DDoS attacks, identity thefts, API abuse, and others. They are relatively difficult to detect based on device and browser characteristics and would require proper behavioral and interaction-based analysis to identify bot traffic.
Fourth-generation – the newest iteration of bots. Such bots can perform human-like interactions like nonlinear mouse movements. In order to detect such bots, advanced methods, often involving the use of artificial intelligence (AI) and machine learning algorithms (ML), are required.
The fourth generation of bots is tough to differentiate from legitimate users, and basic bot detection technologies are no longer sufficient. For such bot traffic to be detected, it will take a lot more than simple tools and behavioral interaction analysis.
On top of using Google Analytics or similar tools to track and investigate traffic, you can also implement the following prevention methods:
Create a robots.txt file for your website. A good starting point might be to provide crawling instructions for bots accessing your website's resources. See these examples of Oxylabs' robots.txt and Google's robots.txt file.
Implement CAPTCHA tests. CAPTCHAs could be your next move as they work well at catching simple bots and may introduce difficulties for more advanced bots. One of the popular free options is Google’s reCAPTCHA, which reliably secures websites from spam and abuse.
Set request rate limits. Rate limiting solves the issue of multiple bots coming from one IP address, which can directly prevent DoS attacks. However, this method only works in specific situations and may still allow malicious bots to go through.
Use honeypot traps. Honeypots are specifically designed to attract unwanted or malicious bots, allowing websites to detect bots and ban their IP addresses.
Set up a web application firewall (WAF). WAFs can be used to filter out suspicious requests and block IP addresses based on various factors. Yet, WAFs are complex and may not be the best option for website owners with limited technical resources.
Use bot detection systems. Implementing an anti-bot system is the best and most assured way to detect and prevent bots from accessing your website. While there may be certain limits with such bot detection tools like PerimeterX, they can still use a variety of modern bot detection methods coupled with AI and machine learning to provide up-to-date and proven defenses against most bots.
If you want a step-by-step guide on how to crawl a website without getting blocked by the anti-bot measures, we have written in great detail on how to do exactly that. In that blog post, we provide you with a list of actions to prevent getting blacklisted while scraping and crawling websites. However, if you would like a faster and less labor-intensive method, you could check out Web Unblocker as a solution.
It's an AI-powered proxy solution that has a dynamic fingerprinting feature. This feature allows Web Unblocker to choose the right combination of headers, cookies, browser attributes, and proxies so that you can appear as an organic user and successfully bypass all target website blocks.
Another great option is using a pre-built scraper that encompasses all the features necessary to overcome anti-bot measures. For example, our SERP Scraper API is feature-rich and designed to get the data from search engine results pages. On top of that, it's specifically adapted to extract data from the major search engines and overcome any sort of anti-bot techniques specific to these engines. See how simple it is to use our API in this guide to scraping Google search results.
Bad bot traffic is predicted only to increase each year. As for good bot traffic, the chance to not get mixed in with the bad crowd is slowly dwindling. Amongst friendly bots, there are a lot of web scrapers that use gathered data for research, pulling down illegal ads, market research, etc. All of them may get flagged as bad and blocked. Fortunately, solutions implementing AI and ML technologies are being built to overcome false bot blocks.
Web Unblocker’s functionality offers ML-driven proxy management, dynamic browser fingerprinting, ML-powered response recognitions, and an auto-retry system. All these features make web scraping smooth and block-free.
The foremost step in detecting bots through IP addresses is checking the source of the IP itself. If the IP is sourced from a data center, it could indicate the user behind it is deliberately hiding their real IP address. Furthermore, the IP address can be checked against databases of known bot network IPs, allowing you to begin blocking bots with confidence if their IPs are on the list.
Another more advanced technique would be to track and analyze user behavior, like mouse movements, page scrolling, navigation, and keystrokes, where abnormalities may suggest bot-like behavior.
With the right bot detection and mitigation techniques in place, it’s easier to detect malicious bots. However, it all depends on a number of factors where advanced bots may evade even the strongest anti-bot systems. As malicious bot traffic attacks are designed to be undetectable, it’s very important to constantly update and adjust bot detection measures to stay ahead of these threats.
While the laws that might be applicable to bot attacks may vary by country, a bot attack will, in most cases, be considered illegal as it’s likely to violate applicable laws, regulations, or third-party rights. For example, think of bots that gain unauthorized access to devices and networks, steal sensitive data, spread malware, or carry out DDoS and DoS attacks. Such malicious bot activities are likely to be designated as criminal offenses and may bear legal consequences.
About the author
Lead Product Marketing Manager
Gabija Fatenaite is a Lead Product Marketing Manager at Oxylabs. Having grown up on video games and the internet, she grew to find the tech side of things more and more interesting over the years. So if you ever find yourself wanting to learn more about proxies (or video games), feel free to contact her - she’ll be more than happy to answer you.
All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.
Get the latest news from data gathering world
Forget about complex web scraping processes
Choose Oxylabs' advanced web intelligence collection solutions to gather real-time public data hassle-free.
Scale up your business with Oxylabs®
GET IN TOUCH
Certified data centers and upstream providers
Connect with us
Advanced proxy solutions