On the first day of OxyCon, Allen O’Neill, a full-stack big data cloud engineer, glued everyone to their seats with an in-depth talk on web fingerprinting and its effects on the web scraping practice. In a genuinely captivating presentation, numerous ways were introduced of how user data is collected and used to build unique profiles, called fingerprints.
What’s more, Allen emphasized that browser fingerprinting is emerging to be the primary weapon to identify web scraping bots. Hence, for everyone gathered at OxyCon, it was an excellent opportunity to learn about future challenges in the web scraping game. Without further ado, let’s dig deeper into what we learned from his informative presentation.
Mr. O’Neill opened his presentation with a statement – “we all have secrets.” However, he explained just how willingly we give away these “secrets” to various third-party sources. These sources analyze our data, and consequently, target us as consumers more effectively.
How it’s done? For example, by merely having location services enabled on our devices or by giving permissions to our installed apps to track us. Of course, these days, people start to understand that they pay with their data. However, not many of us are aware that our information is being smoothly passed on to “carefully selected partners.” To paint a picture, here’s trackers and permissions of what seems like an ordinary and harmless app:
Surely, it’s about time we read those T&C! However, as Allen explained, it’s that irresistible trade-off in the value of products and services for the wider community. Frankly, there is no hiding for internet users. It seems that privacy is truly being driven to extinction, and it’s just a first step in how our unique online fingerprint is being built.
What is a browser fingerprint?
So, a browser fingerprint refers to information that is gathered about a computing device for identification purposes. In other words, any browser will pass on highly specific data points to the connected website’s servers. For instance, such information as the operating system, language, plugins, fonts, hardware, to name a few. Mr. O’Neill shared a website where everyone could check their browser fingerprint. Here’s how it looks:
If you think that these data points seem rather basic and it would be impossible to identify the actual user behind the browser, think twice. Panopticlick researched that in the best scenario, only 1 in 286,77 other browsers would have the same fingerprint. That’s how unique each user’s browser fingerprint is, which makes it considerably easier to identify individuals.
For those who are thinking about employing a proxy, using a VPN service or old-fashioned incognito mode, unfortunately, that would be no barrier to pass on the said information to various visited websites. Even if the user enables do-not-track mode, the data will still be passed on. What is worse, there is no opt-out option.
Mr. O’Neill further explained that cookies are losing relevance as a technique of identification and tracking, thanks to fingerprinting. Now, a question might pop up whether fingerprinting itself has any advantages? Well, it all depends on the angle. Browser fingerprinting aids in fraud and account hijacking prevention. Also, it collects already mentioned data, helps to obtain real-time marketing analytics, and identify non-human traffic.
Hold on, there’s more
Ultimately, for those who are into web scraping, it’s all about avoiding captchas. In other words, web scrapers need to portray organic traffic to prevent blocks from the desired data sources. Well, according to Mr. O’Neill, not only browser fingerprinting but also WebRTC and Wasm are emerging technologies that will differentiate bots vs. human beings. It already directly impacts web scraping and will only make processes more challenging for the foreseeable future.
What’s more, Mr. O’Neill took a step further and introduced hyper-personalization as the next wave in the web evolution.
What is personalization?
Ecommerce personalization refers to utilizing data from multiple sources to customize the individual shopper experience to increase conversion rates.
It’s important to note that:
- Data sources can be both internal and external
- Data collected can be both explicit and implicit
- Personalization can be both prescriptive and adaptive
Throw in behavior tracking (think likes and check-ins), and it truly is the next big thing. It will allow tracking internet users in much more sophisticated ways. This also means that separating real human beings from bots will potentially become a piece of cake.
What does it mean to web scraping future?
Due to web scraping frequently being a cat-and-mouse game, switched-on data sources are likely to implement such anti-bot measures at their end. This means that today’s bots won’t stand a chance to collect public data successfully. The future web scraper bots will need to adapt to the presented challenges and have no margins of error when trying to imitate organic user behavior. So, how can it be achieved?
According to Mr. O’Neill, the only way will be to build guided bots, i.e., construct personas that will need to have their web-footprint schedules. Just like regular internet users, they will have to show their organic behavior to visited websites. Only then, it will be possible to mix within the internet crowd, and consequently, slip under implemented anti-bot detection means.
It’s a wrap
A huge thanks go to Mr. Allen O’Neill for such an engaging and eye-opening presentation that he delivered to us all at OxyCon 2019. It’s safe to state that all attendees noted actionable insights from his impressive take on fingerprinting and its effects on web scraping practice. For now, roll on OxyCon 2020!