For the past few months, Oxylabs team has worked on a challenge raised by the Communications Regulatory Authority of the Republic of Lithuania (RRT). RRT is a national institution that regulates electronic communications, postal, and rail markets under the European Union directives and the laws of the Republic of Lithuania.
We have already written about GovTech Lab’s challenge, and how Oxylabs won the contest to create a tool that automatically detects illegal content on the internet. In this article, we will explain step-by step, how our team of software developers, data scientists, and engineers created a powerful AI-driven tool that will make the internet cleaner.
Before we dive into the solution, we would like to answer one question that will help understand the urgency of this sort of tool.
Last year, the RRT hotline in Lithuania received thousands of notifications, reporting prohibited content on the internet. About a third of them were confirmed as illegal content. In 2018, INHOPE, an International Association of Internet Hotlines, received notifications for over 220, 000 images with prohibited content.
At Oxylabs, we are all about innovations and automation. Checking every notification by hand is inefficient. As soon as we heard about the challenge, we knew that we had the tools and the knowledge to help RRT. Our team got together to brainstorm how to find illegal content online, and we soon had a plan.
Ten weeks later, we are sharing how we built an AI-powered tool to detect illegal content online.
The main requirements for the solution were:
The solution should operate in the Lithuanian IP address range;
The tool must be able to identify websites operating in Lithuania;
It has to be able to recognize prohibited visual material;
The solution has to send a link (URL) of the detected prohibited visual material to the RRT hotline.
To understand the challenge better, you can find its initial description provided by the GovTech Lab.
One of the main challenges was the limited time our team had to build the solution. From the beginning of the challenge, until the demo day, we had ten weeks. After the demo day, the solution will be improved, but this is the time slot we had to build a working tool for illegal content detection.
For clarity, we will explain step-by-step how the tool works.
One of the main requirements for the solution was that it has to operate in the Lithuanian IP address range. So in the first step, the tool checks Lithuanian domains and Lithuanian IP addresses.
The tool checks Lithuanian IP range, and if they have websites. The tool also checks Lithuanian domains if they resolve into Lithuanian IP addresses.
Once the list is gathered, it is time for step two: scraping content.
Step one: the tool checks domains and IP addresses
The tool then scrapes images from the websites that were gathered during the first step. These images are saved in a temporary database for further inspection.
Step two: the tool scrapes content
All the scraped images are turned into hash algorithms (MD5, SHA1). These hashes are then compared against a hash database provided by the police. If hashes match, information is passed onto the reporting module that sends details to the RRT.
The content that does not match the hash database is stored in a temporary database. From there, the content goes through further inspection by Artificial Intelligence (AI) recognition system.
Step three: checking hashes
The images without matches in the police database go through further inspection. They are transferred to the AI recognition system. The system runs the content through a library that searches if provided images contain restricted content. If the image content rates over a set threshold, such images are passed over to the reporting module.
Step four: AI checks the content
Today we have a working tool that can automatically detect illegal content online within a Lithuanian IP range. Soon, the tool will be tested, and improved accordingly. In the future, this tool can be used to detect the mentioned content globally.
Our solution for detecting restricted content on the internet is now being tested, improved, and will soon be available for worldwide use by non-profit organizations.
If you have any questions about our products for other use cases, get in touch. We are always ready to help and share our knowledge.
About the author
Former Senior Content Manager
Adelina Kiskyte is a former Senior Content Manager at Oxylabs. She constantly follows tech news and loves trying out new apps, even the most useless. When Adelina is not glued to her phone, she also enjoys reading self-motivation books and biographies of tech-inspired innovators. Who knows, maybe one day she will create a life-changing app of her own!
All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.
Get the latest news from data gathering world
Forget about complex web scraping processes
Choose Oxylabs' advanced web intelligence collection solutions to gather real-time public data hassle-free.
Scale up your business with Oxylabs®
GET IN TOUCH
Certified data centers and upstream providers
Connect with us
Advanced proxy solutions