Proxy locations

Europe

North America

South America

Asia

Africa

Oceania

See all locations

Network statusCareers

Back to blog

Illegal Content: How We Built an AI-powered Tool to Detect It Online

Adelina Kiskyte

Adelina Kiskyte

2020-12-213 min read
Share

For the past few months, Oxylabs team has worked on a challenge raised by the Communications Regulatory Authority of the Republic of Lithuania (RRT). RRT is a national institution that regulates electronic communications, postal, and rail markets under the European Union directives and the laws of the Republic of Lithuania.

We have already written about GovTech Lab’s challenge, and how Oxylabs won the contest to create a tool that automatically detects illegal content on the internet. In this article, we will explain step-by step, how our team of software developers, data scientists, and engineers created a powerful AI-driven tool that will make the internet cleaner. 

Before we dive into the solution, we would like to answer one question that will help understand the urgency of this sort of tool.

Why do we want to automatically detect illegal content online?

Last year, the RRT hotline in Lithuania received thousands of notifications, reporting prohibited content on the internet. About a third of them were confirmed as illegal content. In 2018, INHOPE, an International Association of Internet Hotlines, received notifications for over 220, 000 images with prohibited content. 

At Oxylabs, we are all about innovations and automation. Checking every notification by hand is inefficient. As soon as we heard about the challenge, we knew that we had the tools and the knowledge to help RRT. Our team got together to brainstorm how to find illegal content online, and we soon had a plan.

Ten weeks later, we are sharing how we built an AI-powered tool to detect illegal content online.

Solution requirements

The main requirements for the solution were:

  • The solution should operate in the Lithuanian IP address range;​

  • The tool must be able to identify websites operating in Lithuania;​

  • It has to be able to recognize prohibited visual material;​

  • The solution has to send a link (URL) of the detected prohibited visual material to the RRT hotline.

To understand the challenge better, you can find its initial description provided by the GovTech Lab.

Building the solution for detecting illegal web content

One of the main challenges was the limited time our team had to build the solution. From the beginning of the challenge, until the demo day, we had ten weeks. After the demo day, the solution will be improved, but this is the time slot we had to build a working tool for illegal content detection.

For clarity, we will explain step-by-step how the tool works.

Step one: domain and IP address check

One of the main requirements for the solution was that it has to operate in the Lithuanian IP address range. So in the first step, the tool checks Lithuanian domains and Lithuanian IP addresses.

The tool checks Lithuanian IP range, and if they have websites. The tool also checks Lithuanian domains if they resolve into Lithuanian IP addresses.

Once the list is gathered, it is time for step two: scraping content.

Step one: the tool checks domains and IP addresses

Step two: content scraping

The tool then scrapes images from the websites that were gathered during the first step. These images are saved in a temporary database for further inspection.

Step two: the tool scrapes content

Step three: hash checking

All the scraped images are turned into hash algorithms (MD5, SHA1). These hashes are then compared against a hash database provided by the police. If hashes match, information is passed onto the reporting module that sends details to the RRT.

The content that does not match the hash database is stored in a temporary database. From there, the content goes through further inspection by Artificial Intelligence (AI) recognition system.

Step three: checking hashes

Step four: AI check

The images without matches in the police database go through further inspection. They are transferred to the AI recognition system. The system runs the content through a library that searches if provided images contain restricted content. If the image content rates over a set threshold, such images are passed over to the reporting module.

Step four: AI checks the content

What is next? 

Today we have a working tool that can automatically detect illegal content online within a Lithuanian IP range. Soon, the tool will be tested, and improved accordingly. In the future, this tool can be used to detect the mentioned content globally.

Conclusion

Our solution for detecting restricted content on the internet is now being tested, improved, and will soon be available for worldwide use by non-profit organizations. 

If you have any questions about our products for other use cases, get in touch. We are always ready to help and share our knowledge.

About the author

Adelina Kiskyte

Adelina Kiskyte

Former Senior Content Manager

Adelina Kiskyte is a former Senior Content Manager at Oxylabs. She constantly follows tech news and loves trying out new apps, even the most useless. When Adelina is not glued to her phone, she also enjoys reading self-motivation books and biographies of tech-inspired innovators. Who knows, maybe one day she will create a life-changing app of her own!

All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.

Related articles

Get the latest news from data gathering world

I’m interested

IN THIS ARTICLE:


  • Why do we want to automatically detect illegal content online?


  • Solution requirements


  • Building the solution for detecting illegal web content


  • Step one: domain and IP address check


  • Step two: content scraping


  • Step three: hash checking


  • Step four: AI check


  • What is next? 


  • Conclusion

Forget about complex web scraping processes

Choose Oxylabs' advanced web intelligence collection solutions to gather real-time public data hassle-free.

Scale up your business with Oxylabs®