It was a sunny fall morning in Vilnius, Lithuania, where today the 1st annual OxyCon commenced. A two-day event, OxyCon by Oxylabs is a data extraction industry conference, dedicated to sharing knowledge and know-how by expert speakers from some of the best, market-leading companies. A total of 56 attendees from 28 companies and 12 countries are participating in the event, together with speakers and participants from Oxylabs itself.
We had a bit of everything – good coffee, a national nuclear accident response drill (you read that right), truly in-depth presentations and, of course, some friendly mingling, socializing and entertainment.
Whether you participated in the event yourself or if you’re just curious, we have prepared a handy recap of the first day. Without further ado, let’s get to it.
After an introductory speech by one of the Oxylabs founders, detailing our history, work culture and some beautiful traditions we have here, the mic was passed to Rimgaudas Mazgelis, an Oxylabs data analyst.
Mr. Mazgelis shared some insights into how data analysis evolved here at Oxylabs, seeking to ensure a sustainable operation of our datacenter proxies.
It all started from looking for basic patterns of how websites respond to scraping, at first simply using Microsoft Excel all the way back in 2015 and finally switching to and sticking with R and Python, as it should be.
Here at Oxylabs, quadratic equations are used to help find out the limit of requests a target website can accept before a specific IP gets blocked. Today, ratios are being continuously calculated for all of the most popular targets.
Our in-house software, coupled with a convenient dashboard, helps us see scraping trends for each city, which also makes it easier to optimize the stability of the infrastructure.
Here are a couple of interesting facts from Rimgaudas’ presentation:
There is a strong trend of scrapers taking increasingly more data from their targets.
During the summer, companies scrape less, but when fall starts, the seasonal celebrations (e.g. Black Friday) lead to a dramatic increase in scraping rates.
Allen O’Neill, a big data cloud engineer, discussed many different ways that user data is collected and used to build unique profiles, called fingerprints and how they are used to identify bots. Here are some key takeaways from his rich presentation:
Browser fingerprinting is currently emerging as the primary method of identifying bots.
Browser data, such as resolution, supported fonts, languages and much more helps build this fingerprint. In other words, this unique profile helps identify and track individual users throughout the net.
Cookies are losing relevance as a method of identification/tracking, thanks to fingerprinting.
WebRTC, TCP/IP fingerprinting and Wasm are some of the technologies increasingly leveraged to differentiate bots vs. real users.
Hyper-personalization is the next big thing, requiring huge amounts of data and making separating real users from bots even easier.
Creating realistic personas that act like real web users will eventually become the only way to bypass bot detection systems.
We will end the summary of Mr. O’Neill’s talk with a great quote that perfectly captures his presentation:
“If it looks like a duck, swims like a duck, and quacks like a duck, then it probably is a duck.”
The rest of the presentations and a workshop were much more technical and geared towards practical use cases, so to keep this recap within reasonable length, we will only cover some of the main points:
System administrator Karolis Pabijanskas put forward an argument that every successful company using their own server infrastructure will eventually face the challenge of scaling it.
As the infrastructure grows, a certain point can be reached in which your automation solution might not be able to keep up with your needs.
Mr. Pabijanskas suggested migrating from Ansible to SaltStack IT automation solution, lauding its impressive speed and flexibility.
Another speaker detailed the ins and outs of the concept of using a browser as a service. He detailed his preferred solution of using cloud-based Chrome browsers for web scraping, with the help of VNC and XFCE, allowing for smooth window management and using rotating proxies to ensure a high scraping success rate.
Paul Felby, CTO at AdThena, described how machine learning can be used efficiently to optimize web scraping. In his own words, “scraping less is an optimisation challenge” and machine learning is a perfect solution.
Mr. Felby detailed the process of optimizing scraping efforts, which includes preparing data the right way, the caveats of training your model, evaluating its accuracy and more.
It all boils down to using linear and decision tree regression models, random forests, gradient boosting and some more statistical wizardry. We would like to get more in-depth, but, as the saying goes, you just had to be there.
The official part of day one was concluded with Oxylabs software developer Paulius Stundžia leading a workshop on how to save precious resources with the help of Oxylabs Real-Time Crawler (now known as a Web Scraper API) and a callback handler. But the fun was not over yet.
After getting all the knowledge of the day out in the open, OxyCon attendants are currently gathering at the beautiful hotel PACAI for some quality evening entertainment. Tomorrow, another inspiring day of know-how sharing awaits us! If you want to stay updated, be sure to follow us on Twitter.
About the author
Vytautas Kirjazovas
Head of PR
Vytautas Kirjazovas is Head of PR at Oxylabs, and he places a strong personal interest in technology due to its magnifying potential to make everyday business processes easier and more efficient. Vytautas is fascinated by new digital tools and approaches, in particular, for web data harvesting purposes, so feel free to drop him a message if you have any questions on this topic. He appreciates a tasty meal, enjoys traveling and writing about himself in the third person.
All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.
Get the latest news from data gathering world
Scale up your business with Oxylabs®