How to Detect Bad Bots and How it Affects Web Scraping?
avatar

Gabija Fatenaite

Aug 14, 2020 7 min read

Often we perceive the term “bot” as negative. However, not all bots are bad. The issue is that good bots can share similar characteristics with malicious bots. Therefore, good bots get labeled as bad and get blocked. 

Bad bots are only getting smarter, and it’s hard for good bots to stay block-free. This creates a lot of issues not only for site owners to ensure a healthy performance of their website but for the web scraping community as well. 

In this article, we’ll go more in-depth about bot traffic, what it is, how websites detect and block bots, and how it affects web scraping in the long run. 

What is bot traffic?

Bot traffic is any non-human traffic made to a website. It’s a software application running automated and repetitive tasks; however, much faster than humanly possible. 

With this ability to perform tasks very quickly, bots can be used for both bad and good. In 2019, 24.1% of bot traffic online were malicious bad bots. That’s +18.1% more than the previous year of 2018. 

Bad bot vs. good bot vs. human traffic 2019

Bad bot vs. good bot vs. human traffic 2019
Source: imperva.com/blog/bad-bot-report-2020-bad-bots-strike-back

Whereas good bot traffic is also decreasing (compared to 2018, the numbers dropped by -25.1%). With the increase of bad bots and the decrease of good bots, website owners are forced to strengthen their security. Hence allowing more good bots to get wrongfully caught. 

To better understand what are good and bad bots, here are some examples: 

Good bots

  • Search engine bots – these bots crawl, catalog, and index web pages. Such results are used by search engines such as Google to provide their services effectively. 
  • Site monitoring bots – will monitor websites to identify possible issues such as long loading times, downtimes, etc.
  • Web scraping bots – if the data being scraped is publicly available, the data can be used for research, identifying and pulling down illegal ads, brand monitoring, and much more. 

Bad bots

  • Spam bots – used for spam purposes. Often for the purpose of creating fake accounts on forums, social media platforms, messaging apps, and so on. They are used in order to build a social media presence, create more clicks on a post, etc.
  • DDoS attack bots – some bots are created to take down websites. DDoS attacks usually leave just enough bandwidth available to allow other attacks to make their way into the network and pass weakened network security layers undetected to steal sensitive information. 
  • Ad fraud bots – these bots automatically click on ads siphoning off money from advertising transactions.

So a “good” bot is a bot that performs useful or helpful tasks that aren’t detrimental to a user’s experience on the Internet. Whereas a bad bot is the exact opposite and in most cases has malicious or even illegal intentions.

Bot detection challenges

Distinguishing bots from human behavior online has become a complex task in itself, and the bots on the internet have evolved dramatically over the years. Currently, there are four different generations of bots:

  • First-generation – these bots are built with basic scripting tools and mainly perform basic automated tasks like scraping, spam, etc.
  • Second-generation – mainly operate through website development, hence ending up with the name ‘web crawlers.’ They are relatively easy to detect due to the presence of specific JavaScript firing and iframe tampering.
  • Third-generation – often used for slow DDoS attacks, identity thefts, API abuse, and others. They are relatively difficult to detect based on device and browser characteristics and would require proper behavioral and interaction-based analysis to identify.
  • Fourth-generation – the newest iteration of bots. Such bots can perform human-like interactions like nonlinear mouse movements. In order to detect such bots, advanced methods, often involving the use of AI and machine learning technologies are required.

The fourth generation of bots aretough to differentiate from legitimate human users, and basic bot detection technologies are no longer sufficient. For such bots to be detected, it will take a lot more than simple tools and behavioral interaction analysis. 

How to detect bot traffic?

To prevent bad bots, websites have created various bot detection techniques. Some of which you are very likely to encounter on a daily basis (e.g., CAPTCHA). 

Of course, there are more bot detection techniques to take into account. So how to detect bots? Here are some ways to do that:

  • Browser fingerprinting – the main approach is to check for the presence of attributes added by headless browsers like PhantomJS, Nightmare, Puppeteer, Selenium, and others. 
  • Browser consistency – checking the presence of specific features that should or should not be in a browser. This can be done by executing certain JavaScript requests.
  • Behavioral inconsistencies – nonlinear mouse movements, rapid button and mouse clicks, repetitive patterns, average page time, average requests per page, starting browsing from inner pages without collecting HTTP cookies, and similar, bot-like behavior.
  • CAPTCHA – a popular anti-bot measure are CAPTCHAs – a challenge-response type of test that often asks you to fill in correct codes or identify objects in pictures.

Once a website identifies bot-like behavior, it blocks them from further crawling. For more details, Dmitry Babitsky, the co-founder & chief scientist at ForNova, has spoken in-depth on how websites block bots in his presentation at OxyCon. 

How antibot-measures affect web scraping 

We’ve covered this topic in our blog post fingerprinting, and its impact on web scraping, a summary of Allen O’Neill’s, a full-stack big data cloud engineer, presentation from OxyCon. But to give a short summary, according to Mr. O’Neill:

Allen O'Neill
Allen O’Neill from DataWorks

The only way (to overcome anti-bot measures in web scraping) will be to build guided bots, i.e., construct personas that will need to have their web-footprint schedules. Just like regular internet users, they will have to show their organic behavior to visited websites. Only then will it be possible to mix within the internet crowd, and consequently, slip under implemented anti-bot detection means.

Overcoming anti-bot measures with improved web scraping tools

O’Neill’s predictions were quite accurate. With the bots evolving, and the fourth-generation bots will soon evolve into fifth, Oxylabs focused on developing a tool to overcome these challenges. 

Next-Gen Residential Proxies

Next-Gen Residential Proxies is an AI & ML powered solution that ensures an effortless web data gathering experience. The most important aspect for bot detection being AI-powered dynamic fingerprinting, designed to imitate a regular user’s behavior, ensuring a 100% success rate against bot traffic detection.

Other features include:

  • Auto-Retry system, meaning the platform will automatically retry data extraction in case of any unsuccessful attempt.
  • Highly scalable and customisable by utilising Oxylabs’ IP proxy infrastructure.
  • The ability to tailor your requests to retrieve JavaScript heavy website content at scale. 
  • Existing users able to easily switch to Next-Gen Residential Proxies using standard proxies.

To further improve the tool and create better decisions, Oxylabs has also formed a board of advisors who are experts in the fields of data science, engineering, and AI.

  • Pujaa Rajan, Deep Learning Engineer at Node.io, USA Ambassador at Women In AI and Google Developer ML Expert
  • Adi Andrei, Lead, Mentor, Senior Data Scientist, NASA, Unilever, British Gas
  • Jonas Kubilius, Artificial Intelligence Researcher
  • Ali Chaudhry, PhD Researcher, Artificial Intelligence at UCL

Conclusion

Bad bot traffic is predicted only to increase each year. As for good bot traffic, the chance to not get mixed in with the bad crowd is slowly dwindling. Amongst good bots there are a lot of web scrapers that use gathered data for research, pulling down illegal ads, market research, etc. All of them may get flagged as bad and blocked.

Fortunately, solutions implementing AI and ML technologies are being built to overcome false bot blocks. Learn more about AI and ML technologies used to improve scraping by reading our blog.

avatar

About Gabija Fatenaite

Gabija Fatenaite is a Senior Content Manager at Oxylabs. Having grown up on video games and the internet, she grew to find the tech side of things more and more interesting over the years. So if you ever find yourself wanting to learn more about proxies (or video games), feel free to contact her - she’ll be more than happy to answer you.

Related articles

All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.