It was a sunny fall morning in Vilnius, Lithuania, where today the 1st annual OxyCon commenced. A two-day event, OxyCon by Oxylabs is a data extraction industry conference, dedicated to sharing knowledge and know-how by expert speakers from some of the best, market-leading companies. A total of 56 attendees from 28 companies and 12 countries are participating in the event, together with speakers and participants from Oxylabs itself.
We had a bit of everything – good coffee, a national nuclear accident response drill (you read that right), truly in-depth presentations and, of course, some friendly mingling, socializing and entertainment.
Whether you participated in the event yourself or if you’re just curious, we have prepared a handy recap of the first day. Without further ado, let’s get to it.
Sustainability of Data Center Proxies in the Scraping Industry
After an introductory speech by one of the Oxylabs founders, detailing our history, work culture and some beautiful traditions we have here, the mic was passed to Rimgaudas Mazgelis, an Oxylabs data analyst.
Mr. Mazgelis shared some insights into how data analysis evolved here at Oxylabs, seeking to ensure a sustainable operation of our data center proxies.
It all started from looking for basic patterns of how websites respond to scraping, at first simply using Microsoft Excel all the way back in 2015 and finally switching to and sticking with R and Python, as it should be.
Here at Oxylabs, quadratic equations are used to help find out the limit of requests a target website can accept before a specific IP gets blocked. Today, ratios are being continuously calculated for all of the most popular targets.
Our in-house software, coupled with a convenient dashboard, helps us see scraping trends for each city, which also makes it easier to optimize the stability of the infrastructure.
Here are a couple of interesting facts from Rimgaudas’ presentation:
There is a strong trend of scrapers taking increasingly more data from their targets.
During the summer, companies scrape less, but when fall starts, the seasonal celebrations (e.g. Black Friday) lead to a dramatic increase in scraping rates.
Fingerprinting: the Web’s Dirty Little Secret
Allen O’Neill, a big data cloud engineer, discussed many different ways that user data is collected and used to build unique profiles, called fingerprints and how they are used to identify bots. Here are some key takeaways from his rich presentation:
- Browser fingerprinting is currently emerging as the primary method of identifying bots.
- Browser data, such as resolution, supported fonts, languages and much more helps build this fingerprint. In other words, this unique profile helps identify and track individual users throughout the net.
- Cookies are losing relevance as a method of identification/tracking, thanks to fingerprinting.
- WebRTC, TCP/IP fingerprinting and Wasm are some of the technologies increasingly leveraged to differentiate bots vs. real users.
- Hyper-personalization is the next big thing, requiring huge amounts of data and making separating real users from bots even easier.
- Creating realistic personas that act like real web users will eventually become the only way to bypass bot detection systems.
We will end the summary of Mr. O’Neill’s talk with a great quote that perfectly captures his presentation:
“If it looks like a duck, swims like a duck, and quacks like a duck, then it probably is a duck.”
Let’s Get Technical
The rest of the presentations and a workshop were much more technical and geared towards practical use cases, so to keep this recap within reasonable length, we will only cover some of the main points:
- System administrator Karolis Pabijanskas put forward an argument that every successful company using their own server infrastructure will eventually face the challenge of scaling it.
- As the infrastructure grows, a certain point can be reached in which your automation solution might not be able to keep up with your needs.
- Mr. Pabijanskas suggested migrating from Ansible to SaltStack IT automation solution, lauding its impressive speed and flexibility.
- Another speaker detailed the ins and outs of the concept of using a browser as a service. He detailed his preferred solution of using cloud-based Chrome browsers for web scraping, with the help of VNC and XFCE, allowing for smooth window management and using rotating proxies to ensure a high scraping success rate.
- Paul Felby, CTO at AdThena, described how machine learning can be used efficiently to optimize web scraping. In his own words, “scraping less is an optimisation challenge” and machine learning is a perfect solution.
- Mr. Felby detailed the process of optimizing scraping efforts, which includes preparing data the right way, the caveats of training your model, evaluating its accuracy and more.
- It all boils down to using linear and decision tree regression models, random forests, gradient boosting and some more statistical wizardry. We would like to get more in-depth, but, as the saying goes, you just had to be there.
The official part of day one was concluded with Oxylabs software developer Paulius Stundžia leading a workshop on how to save precious resources with the help of Oxylabs Real-Time Crawler and a callback handler. But the fun was not over yet.
After getting all the knowledge of the day out in the open, OxyCon attendants are currently gathering at the beautiful hotel PACAI for some quality evening entertainment. Tomorrow, another inspiring day of know-how sharing awaits us! If you want to stay updated, be sure to follow us on Twitter.