For the past few months, Oxylabs team has worked on a challenge raised by the Communications Regulatory Authority of the Republic of Lithuania (RRT). RRT is a national institution that regulates electronic communications, postal, and rail markets under the European Union directives and the laws of the Republic of Lithuania.
We have already written about GovTech Lab’s challenge, and how Oxylabs won the contest to create a tool that automatically detects illegal content on the internet. In this article, we will explain step-by step, how our team of software developers, data scientists, and engineers created a powerful AI-driven tool that will make the internet cleaner.
Before we dive into the solution, we would like to answer one question that will help understand the urgency of this sort of tool.
Why do we want to automatically detect illegal content online?
Last year, the RRT hotline in Lithuania received thousands of notifications, reporting prohibited content on the internet. About a third of them were confirmed as illegal content. In 2018, INHOPE, an International Association of Internet Hotlines, received notifications for over 220, 000 images with prohibited content.
At Oxylabs, we are all about innovations and automation. Checking every notification by hand is inefficient. As soon as we heard about the challenge, we knew that we had the tools and the knowledge to help RRT. Our team got together to brainstorm how to find illegal content online, and we soon had a plan.
Ten weeks later, we are sharing how we built an AI-powered tool to detect illegal content online.
The main requirements for the solution were:
- The solution should operate in the Lithuanian IP address range;
- The tool must be able to identify websites operating in Lithuania;
- It has to be able to recognize prohibited visual material;
- The solution has to send a link (URL) of the detected prohibited visual material to the RRT hotline.
To understand the challenge better, you can find its initial description provided by the GovTech Lab.
Building the solution for detecting illegal web content
One of the main challenges was the limited time our team had to build the solution. From the beginning of the challenge, until the demo day, we had ten weeks. After the demo day, the solution will be improved, but this is the time slot we had to build a working tool for illegal content detection.
For clarity, we will explain step-by-step how the tool works.
Step one: domain and IP address check
One of the main requirements for the solution was that it has to operate in the Lithuanian IP address range. So in the first step, the tool checks Lithuanian domains and Lithuanian IP addresses.
The tool checks Lithuanian IP range, and if they have websites. The tool also checks Lithuanian domains if they resolve into Lithuanian IP addresses.
Once the list is gathered, it is time for step two: scraping content.
Step two: content scraping
The tool then scrapes images from the websites that were gathered during the first step. These images are saved in a temporary database for further inspection.
Step three: hash checking
All the scraped images are turned into hash algorithms (MD5, SHA1). These hashes are then compared against a hash database provided by the police. If hashes match, information is passed onto the reporting module that sends details to the RRT.
The content that does not match the hash database is stored in a temporary database. From there, the content goes through further inspection by Artificial Intelligence (AI) recognition system.
Step four: AI check
The images without matches in the police database go through further inspection. They are transferred to the AI recognition system. The system runs the content through a library that searches if provided images contain restricted content. If the image content rates over a set threshold, such images are passed over to the reporting module.
What is next?
Today we have a working tool that can automatically detect illegal content online within a Lithuanian IP range. Soon, the tool will be tested, and improved accordingly. In the future, this tool can be used to detect the mentioned content globally.
Our solution for detecting restricted content on the internet is now being tested, improved, and will soon be available for worldwide use by non-profit organizations.
If you have any questions about our products for other use cases, get in touch. We are always ready to help and share our knowledge.