OxyCon® 2022: The Top Takeaways From Day One

Danielius Radavicius

Last updated on

2022-09-07

7 min read

We’re proud to once again welcome you to the third annual Oxycon 2022 conference. The free, two-day virtual event includes a diverse set of 15 speakers, all of whom are experts in their respective data gathering fields. Don’t miss out on this rare opportunity to gain unique and valuable insights from people who guide the industry to where it is today and where it will be in the future.

During day one, the presentations ranged from Python scrapers’ dependencies and growing high-quality data requests to scraping for the government and discussions on legal aspects of web scraping. If you’re curious about a detailed rundown of the entire day, check out the brief below.

Managing Dozens of Python Scrapers’ Dependencies: The Monorepo Way

Tadas Malinauskas, a Python developer at Oxylabs, started Oxycon’s myriad talks by focusing on why they chose the monorepo approach.

Initially, two sections were introduced 1) The topic itself and 2) the advantages of the monorepo way. Later, advantageous aspects were outlined, and crucial factors like easy scalability were mentioned to explain why you might want to consider a monorepo approach.

The talk then continued towards the main differences between standard and “wannabe” monorepo, as the latter is what’s currently in use by Oxylabs. The discussion then provided a concrete coding example of how to build a “wannabe” monorepo setup.

How to Continuously Yield High-Quality Data While Growing From 100 to 100M Daily Requests

Glen De Cauwsemaecker, a Lead Crawler Engineer at OTA Insight, presented how OTA Insight went from 100 daily requests to 100,000,000 without sacrificing data quality.

Initially, Glen showed the history and goals of OTA Insight, discussing how the company functions and mentioning the role of revenue managers.

From the start, Glen looked at the company’s retrospective, the various changes, errors and solutions their business had run into. The presentation is divided into these talking points:

Glen immediately wanted to use data extraction to find specific pricing data from the start; a 3 step process was introduced to Glen’s projects involving inputs, extraction and transform steps.
However, around 2013, their approach was very prone to errors, manually intensive and had no overview; therefore, various changes were introduced.
One of these changes was the application of integrities which allowed greater identification and tracking of data. After some time, as soon as enough data was generated, OTA Insight introduced a further automation process called group integrities. This created a report that was effectively automatically generated for the report manager.
Later, a scheduler was also added, which allowed for significant automation.
Importantly, because OTA Insight is focused on hotels and their prices are time-dependent, a concept of Liveshops was introduced, which was always active/live and featured automatic resource-based scaling.
A super proxy was added as a middleman to handle high throughput requests per second.

Lastly, Glen outlined three core priorities for anyone dealing with similar processes: 1) Automate (but only when necessary), 2) be proactive and 3) ensure your pragmatic meaning cost and rewards are always on your mind.

Scraping for Government Use Cases: How to Detect Illegal Content Online?

Ovidijus Balkauskas, a Linux Systems Engineer at Oxylabs, was one of the creators of the automated illegal content detection tool for the Lithuanian Communications Regulatory Authority. Having the unique experience of applying scraping to public entities, Ovidijus now takes us through the process of working with governmental agencies and public web data gathering.

The presentation started by showing three core differences between public and private sectors when web scraping is considered:

Experience: private sectors often have loads of experience working and applying public web data scraping, whilst a public company may have none.
Experimentation: public institutions may struggle with scraping experimentation as they have significantly less knowledge and experience.
Expectations: some projects are started specifically with scraping in mind within private sectors, while public companies may not even know they need or could use scraping.

Ovidijus continued by looking at the entire process Oxylabs went through to help to prevent illegal and harmful content online. Some issues were difficult to tackle due to their dependency on popularity for harmful content to be reported. Thankfully, these issues were clear even before the software was started; therefore, a more proactive approach had to be taken.

The talk then moved toward the crucial question of how to find illegal content online. While, at that point, it hadn't been answered yet, the requirements were already clear, focus on Lithuanian web space and analyze visual content, specifically pictures.

At this point, Oxylabs started talking to Communications Regulatory Authority, a.k.a CRA, providing ideas and raising questions, such as what determines Lithuanian web space? .lt domain does not necessarily constitute a Lithuanian website; therefore, Lithuanian IPs were added as a requirement for a web page to be considered a Lithuanian site.

Oxylabs ensured that the system was scalable, expandable and with clear goals:

Identify websites to target. List .lt domains and check the IPs.
Collect content via scraping.
Mark and report illegal content looked at open source Machine Learning (ML) models trained for looking at illegal content, such as detecting age using pictures.

The end result was highly successful and, in just two months, resulted in: 288k websites checked and eight police reports influenced.

How to Scrape The Web of Tomorrow

Ondra Urban, COO at Apify, started by introducing his company and outlining the key features of how the latest tech teams approach scraping.

The presentation focused on scraping itself and the techniques you can use. Initially, Ondra looked at the issue of having a location block. Questions were asked, does a proxy server help? After testing, a datacenter proxy had the same error.

What about residential proxies? The result was no response whatsoever. This led to Ondra concluding that using a proxy is sometimes not enough to access a wanted website. You have to approach it like a human, i.e. if a request fails to do so, it will also likely get blocked by a website.

Continuing the presentation, to further show how complex it is to mask your request to look like a human, several considerations have to be analyzed:

HTTP 2
Headers
TLS versions
Ciphers
Signature algorithms
ECDH curves

Fingerprints often give away the fact that you’re using a bot, and User Agent isn’t enough, meaning a plethora of detailed information needs to be added to avoid blocks.

The last step is injecting the now detailed info into the fingerprint. Afterwards, access to the website was gained.

Ondra continued by emphasizing that web scraping is a bit different than what some users may expect, and having all the browser setups is becoming increasingly essential.

The presentation then moved towards the reputation of web scraping and how it recently changed from being a topic that was, years ago, seen as a gray area to one that is now considered a standard business norm.

A the end of the presentation, Ondra gave some finishing tips,

such as making sure you only use necessary traffic as residential proxies are expensive, and constantly running them can result in major costs and an overall traffic speed decrease.

Importantly, Ondra stated that scraping should be done ethically, in a way that doesn’t DDoS a website.

Lastly, the topic was finished by providing examples of Crawlee’s usage, an Apify, open-source web scraping and browser automation library.

How hiQ Labs v. LinkedIn and Cases Following It Have Changed U.S. Law on Scraping

It’s challenging to raise the topic of public data web scraping without mentioning legalities. Because of this necessity, Alex Reese, a partner at Farella Braun + Martel, decided to introduce the well-known case of Labs v. LinkedIn and how cases following it have changed U.S. law on scraping.

Primarily the presentation focused on the case mentioned above and how it shaped scraping in the U.S., although there were many other curious talking points. Overall, the presentation is divided into two parts:

Overview of the Computer Fraud and Abuse act. Its initial goal was to outlaw hacking, breaking, and entering a network to prevent stealing protected property. Has criminal liability to punish anyone who accessed a computer:

Without authorization
Exceeding authorization access (have some authorization but accessed information they weren’t supposed to)

Alex Resse then explained how this act, which otherwise has nothing to do with scraping, has been widely used to try and counter it. A wide range of legal claims against public web data scraping was provided by Alex as well.

2. How hiQ v. Linkedin changed the law. Farella Braun + Martel clients scraped public data to provide H.R. analytics, resulting in a cease and desist. Yet, plenty of arguments could be made in favor of scraping. Mainly, CFAA doesn’t apply to public data, the claim breaks first amendment rights, and a plethora of antitrust issues was apparent (Linkedin was trying to offer the same services, therefore, trying to shut down hiQ appears as being anticompetitive).

Alex then went into detail about the court’s final decision and why hiQ had succeeded in winning the case.

Lawyers Discuss Scraping

A panel of legal experts, Hosted by Denas Grybauskas, Head of Legal at Oxylabs, discussed the intricacies and the critical legal questions one has to ask. For example, before starting a public web data scraping project, what must you consider and what does the current legal landscape look like in terms of scraping?

Initially, a hypothetical situation was proposed:

Imagine a new scraping project has been laid out; what are the step-by-step processes you go through to analyze and determine if any legal issues could appear?

During the middle part of the panel discussions, Sanaea Daruwalla, general counsel at Zyte, stated that the hiQ Labs v. LinkedIn case, while important, will not be the only one and many more are likely to pop up from different circumstances. This will lead to varying legal claims, which have to be solved on a case-by-case basis instead of ruling all public web data scraping as “legal” in the US.

Here’s a short overview of the first day, though don’t forget, an equally exciting day two awaits! Head over to our OxyCon page and find out more, and in case you haven’t registered, do so before it’s too late!