avatar

Vytautas Kirjazovas

Oct 01, 2019 5 min read

It was a sunny fall morning in Vilnius, Lithuania, where today the 1st annual OxyCon commenced. A two-day event, OxyCon by Oxylabs is a data extraction industry conference, dedicated to sharing knowledge and know-how by expert speakers from some of the best, market-leading companies. A total of 56 attendees from 28 companies and 12 countries are participating in the event, together with speakers and participants from Oxylabs itself.

We had a bit of everything – good coffee, a national nuclear accident response drill (you read that right), truly in-depth presentations and, of course, some friendly mingling, socializing and entertainment. 

Whether you participated in the event yourself or if you’re just curious, we have prepared a handy recap of the first day. Without further ado, let’s get to it.

OxyCon 2019: The Top Takeaways From Day One #4

Sustainability of Data Center Proxies in the Scraping Industry

After an introductory speech by one of the Oxylabs founders, detailing our history, work culture and some beautiful traditions we have here, the mic was passed to Rimgaudas Mazgelis, an Oxylabs data analyst.

Mr. Mazgelis shared some insights into how data analysis evolved here at Oxylabs, seeking to ensure a sustainable operation of our data center proxies

It all started from looking for basic patterns of how websites respond to scraping, at first simply using Microsoft Excel all the way back in 2015 and finally switching to and sticking with R and Python, as it should be. 

Here at Oxylabs, quadratic equations are used to help find out the limit of requests a target website can accept before a specific IP gets blocked. Today, ratios are being continuously calculated for all of the most popular targets. 

Our in-house software, coupled with a convenient dashboard, helps us see scraping trends for each city, which also makes it easier to optimize the stability of the infrastructure.

Here are a couple of interesting facts from Rimgaudas’ presentation:

There is a strong trend of scrapers taking increasingly more data from their targets.

During the summer, companies scrape less, but when fall starts, the seasonal celebrations (e.g. Black Friday) lead to a dramatic increase in scraping rates.

OxyCon 2019: The Top Takeaways From Day One #1

Fingerprinting: the Web’s Dirty Little Secret

Allen O’Neill, a big data cloud engineer, discussed many different ways that user data is collected and used to build unique profiles, called fingerprints and how they are used to identify bots. Here are some key takeaways from his rich presentation:

  1. Browser fingerprinting is currently emerging as the primary method of identifying bots.
  2. Browser data, such as resolution, supported fonts, languages and much more helps build this fingerprint. In other words, this unique profile helps identify and track individual users throughout the net.
  3. Cookies are losing relevance as a method of identification/tracking, thanks to fingerprinting.
  4. WebRTC, TCP/IP fingerprinting and Wasm are some of the technologies increasingly leveraged to differentiate bots vs. real users. 
  5. Hyper-personalization is the next big thing, requiring huge amounts of data and making separating real users from bots even easier.
  6. Creating realistic personas that act like real web users will eventually become the only way to bypass bot detection systems.

We will end the summary of Mr. O’Neill’s talk with a great quote that perfectly captures his presentation:

“If it looks like a duck, swims like a duck, and quacks like a duck, then it probably is a duck.”

OxyCon 2019: The Top Takeaways From Day One #2

Let’s Get Technical

The rest of the presentations and a workshop were much more technical and geared towards practical use cases, so to keep this recap within reasonable length, we will only cover some of the main points:

  1. System administrator Karolis Pabijanskas put forward an argument that every successful company using their own server infrastructure will eventually face the challenge of scaling it.
  2. As the infrastructure grows, a certain point can be reached in which your automation solution might not be able to keep up with your needs.
  3. Mr. Pabijanskas suggested migrating from Ansible to SaltStack IT automation solution, lauding its impressive speed and flexibility.
  4. Another speaker detailed the ins and outs of the concept of using a browser as a service. He detailed his preferred solution of using cloud-based Chrome browsers for web scraping, with the help of VNC and XFCE, allowing for smooth window management and using rotating proxies to ensure a high scraping success rate.
  5. Paul Felby, CTO at AdThena, described how machine learning can be used efficiently to optimize web scraping. In his own words, “scraping less is an optimisation challenge” and machine learning is a perfect solution.
  6. Mr. Felby detailed the process of optimizing scraping efforts, which includes preparing data the right way, the caveats of training your model, evaluating its accuracy and more. 
  7. It all boils down to using linear and decision tree regression models, random forests, gradient boosting and some more statistical wizardry. We would like to get more in-depth, but, as the saying goes, you just had to be there.
OxyCon 2019: The Top Takeaways From Day One #3

The official part of day one was concluded with Oxylabs software developer Paulius Stundžia leading a workshop on how to save precious resources with the help of Oxylabs Real-Time Crawler and a callback handler. But the fun was not over yet. 

After getting all the knowledge of the day out in the open, OxyCon attendants are currently gathering at the beautiful hotel PACAI for some quality evening entertainment. Tomorrow, another inspiring day of know-how sharing awaits us! If you want to stay updated, be sure to follow us on Twitter

avatar

About Vytautas Kirjazovas

Vytautas Kirjazovas is a Content Manager at Oxylabs, and he places a strong personal interest in technology due to its magnifying potential to make everyday business processes easier and more efficient. Vytautas is fascinated by new digital tools and approaches, in particular, for web data harvesting purposes, so feel free to drop him a message if you have any questions on this topic. He appreciates a tasty meal, enjoys travelling and writing about himself in the third person.

Related articles

Using Web Scraping for Lead Generation

Using Web Scraping for Lead Generation

Nov 06, 2019

4 min read

Scraping the Web With 100% Success Rate

Scraping the Web With 100% Success Rate

Oct 10, 2019

6 min read

Scraping Trends and Infrastructure Sustainability

Scraping Trends and Infrastructure Sustainability