PerimeterX has been safeguarding websites from unwanted visitors for a decade now, yet there’s a chance you haven’t even heard of it. Its sole purpose is to verify the human nature of web interactions, as has been the case for its clientele, including such giants as Booking.com, Upwork, Zillow, Wayfair, and many more.
With a network of sophisticated detection techniques in place, netizens answer with a likewise complex approach to overcome these defenses. So how does PerimeterX shield websites from bots, and do these defenses truly hold their ground? Let’s explore the answers.
PerimeterX, now called HUMAN Security, is a cyber security company offering a defense platform for websites to detect fraudulent or automated web activities. These can range from web scraping to account takeover, transaction abuse, ad fraud, malvertising, and more.
One of its essential defenses, the HUMAN Bot Defender, is aimed at bots in their entirety. When it sits on a website, it monitors and collects information from incoming web requests, which are then analyzed according to predefined rules and later acted upon if a web request is suspicious. One thing to note is that PerimeterX does allow some bots through their bot detection system, such as Googlebot, and crawlers of various product and service comparison platforms.
The HUMAN Bot Defender is an advanced anti-bot system equipped with modern bot detection techniques and AI-driven analysis methods, but what exactly are these measures? Let’s take a deeper look.
PerimeterX uses behavioral analysis and a predictive detection method, coupled with different fingerprinting techniques that check a variety of factors to determine whether a real user or a robot is accessing website resources.
PerimeterX defenses rely on machine learning algorithms that analyze requests and predict probable bot activities. When suspicious activity is detected, PerimeterX may utilize honeypots and deliver deceptive content to further confirm the request is coming from a robot. In cases where it’s certainly a bot, the system completely blocks the IP address from accessing site resources.
Some of the major PerimeterX defenses can be summed up into four main categories.
Being the foremost detection technique of most anti-bot systems, PerimeterX doesn’t shy away from its success as well. The IP addresses that visit a website protected by PerimeterX are thoroughly analyzed by checking the past requests from the same IP, the amount of these requests, and whether the delays between them resemble bot-like behavior.
As IPs expose some critical information about their origin, PerimeterX inspects the IP location and whether it was provided by an Internet Service Provider (ISP) or sourced from a data center.
Additionally, IPs are checked against any known bot networks, as well as their history of reputation, if available. In the end, IPs are assigned a reputation score, which signifies whether the IP should be trusted or banned.
Every HTTP request and response includes HTTP headers that tell more information about the request and how it should be realized. Web browsers send a considerable amount of headers to identify themselves. While there are identical headers used by all, Chrome, Firefox, Safari, Opera, and other browsers also have their own header patterns and include specific information that’s unique to each browser.
Hence, PerimeterX uses this information to further analyze and identify bot-like activities. It checks whether unique browser headers are sent or default headers are used, allowing it to identify an HTTP library like Python’s requests. In case there are any inconsistencies in the order of browser headers, or some are missing, PerimeterX is quick to deny website access to such requests as genuine browsers typically send consistent and complete header information.
Fingerprinting further complicates bot operations as it adds another strong defensive layer. PerimeterX can combine a handful of fingerprinting techniques to detect bots, including but not limited to:
Browser fingerprinting collects information about the browser and its version, including screen resolution, operating system, installed plugins and fonts, language, time zone, and other settings. All of this can be used to create an identifying fingerprint of a user, which can trace back the user’s requests even when different IPs are used.
HTTP/2 fingerprinting doesn’t directly identify the user, but it contributes to the identification process by providing more details about the request and the requester, where some of it conveys the same information as HTTP headers. For instance, the HTTP/2 protocol additionally sends information like stream dependencies, compressed headers, flow control signals, and settings frames, creating a more comprehensive profile of the requester. PerimeterX can use HTTP/2 fingerprinting to compare if the information matches request headers and whether all the parameters match what a real web browser would send.
TLS fingerprinting is a cryptographic protocol that encrypts data, yet the first communication step between a client and a server, called a handshake, exchanges information that isn’t encrypted. It stores information about the device, the TLS version, its extensions, and how the data will be encrypted and decrypted. When a TLS handshake has unusual or inconsistent parameters that deviate from what’s expected, it indicates the request may be coming from a robot.
PerimeterX employs its own CAPTCHA, called HUMAN Challenge, to distinguish human users from bots. It’s a lightweight challenge, compared to other CAPTCHAs, as it eases the user experience. However, it’s also a sophisticated challenge since it gives access to activities and events on the web page, while other CAPTCHAs usually don’t, allowing PerimeterX to gain insights into the user behavior and deliver bot-catching techniques if the user is suspicious.
Another behavioral practice that PerimeterX uses is the monitoring of web element interactions, like mouse movements, clicks, keystroke speeds, and other interactions. This allows the system to detect unhuman behavior with a high degree of accuracy since the complexity and variability of genuine user interactions are challenging to simulate.
Even though it's difficult to bypass PerimeterX, there are still identifiable vulnerabilities that bots leverage using common bypassing methods. This is especially concerning as PerimeterX may, at times, not be adequately equipped to mitigate malicious activities.
One of the common bypassing methods uses a cached version of a website, making it arguably the most ethical option as it doesn’t affect the infrastructure and processes of the target website. Since Google crawls websites to show them on Search Engine Results Pages (SERPs), it also saves a copy of each website, which can be accessed via a Google cache URL. Thus, when a user visits a cached version of a web page, it communicates with Google, not the targeted website. It's important to bear in mind, however, that this strategy might not yield the freshest web data, and some websites may be unavailable.
Google cache can be reached by adding a complete URL of a target website at the end of this URL https://webcache.googleusercontent.com/search?q=cache:. For instance, you can access a cached version of the Oxylabs website by visiting https://webcache.googleusercontent.com/search?q=cache:https://oxylabs.io/.
Proxies are another effective tool that users leverage to bypass PerimeterX bot detection. In their simplest form, proxy servers route user's web requests through a proxy IP address, essentially making it impossible for target websites to see the user’s real IP.
As mentioned before, IP monitoring is one of the most important factors that enable anti-bot systems to distinguish bot-like activity. Proxy servers tackle exactly this detection technique, making it almost impossible to trace back the requests to one user as they’ll have different IPs.
When it comes to HTTP header checks, fingerprinting, CAPTCHAs, and behavioral analysis, a proxy server in itself is unable to circumvent these detection techniques since all of them look at what’s beyond the IP address. Thus, a proxy user must set up additional processes, and this is where PerimeterX really starts struggling to detect proxy servers. When proxies are combined with IP rotation, proper HTTP header handling, realistic request times, and other anti-fingerprinting methods, users are able to bypass PerimeterX by not triggering alerts regarding suspicious IPs.
Considering PerimeterX’s commitment to stop bots from accessing sites, is there anything else the system could do to better spot and deny web requests coming from proxy servers? On top of checking IPs for being listed in databases of known bot networks, PerimeterX could also monitor incoming requests for IPs using similar subnets. This information could imply that:
The IP addresses are a part of the same network infrastructure;
The requests can be originating from a data center, but further analysis is required;
If the above is true, the requests are possibly using proxy servers, as data centers are a common source of proxy server IPs;
The user is deliberately trying to hide their identity.
While this technique isn’t foolproof and can’t directly confirm the use of proxy servers, it can add an extra layer of defense.
Headless browsers have been a game changer for automated access to websites, as they allow users to programmatically make requests through a web browser without using its user interface. In turn, the targeted websites see requests coming through a genuine browser without any inconsistencies in request headers and web interactions that appear to be made by real users.
When it comes to bypassing PerimeterX, headless browsers continue to be game changers. With the use of popular and free tools like Selenium, Playwright, and Puppeteer, users are able to mimic human-like web interactions, execute JavaScript to bypass JavaScript challenges, manipulate Document Object Model (DOM) as a real user would, and control headers and cookies to resemble organic traffic. Summed up into one equation – when set up properly, headless browsers give PerimeterX the impression that there’s a real user behind the requests, not a robot.
Headless browsers bypass PerimeterX defenses because the detection system lacks a better way to identify their use. There are a few ways this can be fixed in the future:
CAPTCHAs for headless detection – while PerimeterX offers its own CAPTCHA solution and an ability to set up third-party ones, they aren’t designed to catch requests using headless browsers. An introduction of multi-step or dynamic element CAPTCHAs would make it way harder for users to automate scripts that solve these types of CAPTCHAs.
Browser automation monitoring – another viable approach is to monitor popular automation techniques that include manipulation of browser window properties, emulation of device sensors, and other tactics and patterns that can indicate automation. In the end, vigilant monitoring could diminish the effectiveness of headless browsers.
There’s no debating that PerimeterX is a modern anti-bot system stacked with complex monitoring and analysis techniques. It monitors every IP address, analyzes request headers, uses various fingerprinting methods to identify users, triggers CAPTCHAs, and utilizes machine learning with artificial intelligence to analyze and predict user behavior.
However, these defenses can be penetrated by popular tools and methods. While building a custom PerimeterX bypass mechanism is a timely and complicated quest, people using bots have seen success since PerimeterX can't always tell bots and real users apart.
If you found this article interesting, you might also enjoy reading about bot traffic and a similar article examining Queue-it.
About the author
Vytenis Kaubrė
Technical Copywriter
Vytenis Kaubrė is a Technical Copywriter at Oxylabs. His love for creative writing and a growing interest in technology fuels his daily work, where he crafts technical content and web scrapers with Oxylabs’ solutions. Off duty, you might catch him working on personal projects, coding with Python, or jamming on his electric guitar.
All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.
Augustas Pelakauskas
2024-12-09
Roberta Aukstikalnyte
2024-11-19
Vytenis Kaubrė
2024-11-05
Get the latest news from data gathering world
Scale up your business with Oxylabs®