On the first day of OxyCon, Allen O’Neill, a full-stack big data cloud engineer, glued everyone to their seats with an in-depth talk on web fingerprinting and its effects on the web scraping and scraping with Python practices. In a genuinely captivating presentation, numerous ways were introduced of how user data is collected and used to build unique profiles, called fingerprints.
What’s more, Allen emphasized that browser fingerprinting is emerging to be the primary weapon to identify web scraping bots. Hence, for everyone gathered at OxyCon, it was an excellent opportunity to learn about future challenges in the web scraping game. Without further ado, let’s dig deeper into what we learned from his informative presentation.
Mr. O’Neill opened his presentation with a statement – “we all have secrets.” However, he explained just how willingly we give away these “secrets” to various third-party sources. These sources analyze our data, and consequently, target us as consumers more effectively.
How it’s done? For example, by merely having location services enabled on our devices or by giving permissions to our installed apps to track us. Of course, these days, people start to understand that they pay with their data. However, not many of us are aware that our information is being smoothly passed on to “carefully selected partners.” To paint a picture, here’s trackers and permissions of what seems like an ordinary and harmless app:
Surely, it’s about time we read those T&C! However, as Allen explained, it’s that irresistible trade-off in the value of products and services for the wider community. Frankly, there is no hiding for internet users. It seems that privacy is truly being driven to extinction, and it’s just a first step in how our unique online fingerprint is being built.
So, a browser fingerprint refers to information that is gathered about a computing device for identification purposes. In other words, any browser will pass on highly specific data points to the connected website’s servers. For instance, such information as the operating system, language, plugins, fonts, hardware, to name a few. Mr. O’Neill shared a website where everyone could check their browser fingerprint. Here’s how it looks:
If you think that these data points seem rather basic and it would be impossible to identify the actual user behind the browser, think twice. Panopticlick researched that in the best scenario, only 1 in 286,77 other browsers would have the same fingerprint. That’s how unique each user’s browser fingerprint is, which makes it considerably easier to identify individuals.
For those who are thinking about employing a proxy, using a VPN service or old-fashioned incognito mode, unfortunately, that would be no barrier to pass on the said information to various visited websites. Even if the user enables do-not-track mode, the data will still be passed on. What is worse, there is no opt-out option.
Mr. O’Neill further explained that cookies are losing relevance as a technique of identification and tracking, thanks to fingerprinting. Now, a question might pop up whether fingerprinting itself has any advantages? Well, it all depends on the angle. Browser fingerprinting aids in fraud and account hijacking prevention. Also, it collects already mentioned data, helps to obtain real-time marketing analytics, and identify non-human traffic.
Ultimately, for those who are into web scraping, it’s all about avoiding captchas. In other words, web scrapers need to portray organic traffic to prevent blocks from the desired data sources. Well, according to Mr. O’Neill, not only browser fingerprinting but also WebRTC and Wasm are emerging technologies that will differentiate bots vs. human beings. It already directly impacts web scraping and will only make processes more challenging for the foreseeable future.
What’s more, Mr. O’Neill took a step further and introduced hyper-personalization as the next wave in the web evolution.
Ecommerce personalization refers to utilizing data from multiple sources to customize the individual shopper experience to increase conversion rates.
It’s important to note that:
Data sources can be both internal and external
Data collected can be both explicit and implicit
Personalization can be both prescriptive and adaptive
Throw in behavior tracking (think likes and check-ins), and it truly is the next big thing. It will allow tracking internet users in much more sophisticated ways. This also means that separating real human beings from bots will potentially become a piece of cake.
Due to web scraping frequently being a cat-and-mouse game, switched-on data sources are likely to implement such anti-bot measures at their end. This means that today’s bots won’t stand a chance to collect public data successfully. The future web scraper bots will need to adapt to the presented challenges and have no margins of error when trying to imitate organic user behavior. So, how can it be achieved?
According to Mr. O’Neill, the only way will be to build guided bots, i.e., construct personas that will need to have their web-footprint schedules. Just like regular internet users, they will have to show their organic behavior to visited websites. Only then, it will be possible to mix within the internet crowd, and consequently, slip under implemented anti-bot detection means.
A huge thanks go to Mr. Allen O’Neill for such an engaging and eye-opening presentation that he delivered to us all at OxyCon 2019. It’s safe to state that all attendees noted actionable insights from his impressive take on fingerprinting and its effects on web scraping practice. For now, roll on OxyCon 2020!
About the author
Head of PR
Vytautas Kirjazovas is Head of PR at Oxylabs, and he places a strong personal interest in technology due to its magnifying potential to make everyday business processes easier and more efficient. Vytautas is fascinated by new digital tools and approaches, in particular, for web data harvesting purposes, so feel free to drop him a message if you have any questions on this topic. He appreciates a tasty meal, enjoys traveling and writing about himself in the third person.
All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.
Get the latest news from data gathering world
Forget about complex web scraping processes
Choose Oxylabs' advanced web intelligence collection solutions to gather real-time public data hassle-free.
Scale up your business with Oxylabs®
GET IN TOUCH
Certified data centers and upstream providers
Connect with us
Advanced proxy solutions