Key takeaways from the second day of OxyCon 2021
avatar

Maryia Stsiopkina

Aug 27, 2021 8 min read

OxyCon 2021, a two-day virtual conference, has just come to an end, and we’re excited to share the highlights from the event’s Day Two.

Following the Day One thought-provoking presentations, we continued our web scraping conference marathon with new speakers inviting us to explore the topics that challenge our minds and offer a new perspective.

If you didn’t get the chance to attend the Day Two presentations or simply would like to refresh the key points of the event, here are the top takeaways from Day Two of OxyCon 2021.

Recent Case Law and the Future of Web Scraping

Day Two started with hitting the right note by raising one of the most heatedly debated topics within the web scraping community: legal matters. Denas Grybauskas, Head of Legal at Oxylabs, shed light on all the uncertainties surrounding the latest legal cases that had left a significant mark on the scraping industry:

  • Van Buren v. United States
  • hiQ Labs v. LinkedIn
  • Southwest Airlines Co. v. Kiwi.com
  • Facebook against NYU researchers
Denas Grybauskas, Head of Legal at Oxylabs

Building on these cases, Denas shared his vision on how the web scraping industry must act and highlighted three main points:

  1. Ethics. Even though it’s a daunting task in web scraping to please all the parties involved, it’s crucial to adhere to the principles of ethical data access and collection. Denas advised paying close attention to the type of data you intend to scrape and the terms of service of particular websites you want to access.
  2. Transparency. The speaker emphasized the importance of raising awareness about all the struggles we have and discussing them instead of sweeping the problems under the carpet.
  3. Self-regulation. While there aren’t international regulations for scraping, Denas encouraged colleagues to implement self-regulation mechanisms in the industry. He underpinned the potential of such a strategy with other successful examples, e.g., the advertising industry after GDPR came into force.

Still, many questions remain to be answered, like:

  • What is public data, and what makes it public?
  • Who owns web data?
  • What will trigger the industry into self-regulation?

Active Fingerprinting to Avoid Bot Detection

The next speaker, Paulius Stundžia, Software Engineer at Oxylabs, made sure that his interactive presentation not only provided profound insight on website fingerprinting but could also be put to practical use by all the developer colleagues out there.

Fingerprinting involves gathering information about the browser, user agent, hardware, and other indicators that might be hinting at a headless browser.  According to Paulius, two types of fingerprinting can be distinguished:

  • Passive fingerprinting. This type of fingerprinting contains such information as IP address, headers, HTTP protocol, TLS version. It’s characterized by low uniqueness and can be circumvented via proxies, VPNs, or cookie cleanups.
  • Active fingerprinting. Hardware, browser, and operating system information is identified in this case. The level of uniqueness here is high, even across different IPs and sessions.

Paulius also pointed to three types of red flags that are most likely to lead to bans:

  1. Obvious red flags, e.g., the word “headless” in the user agent information.
  2. Suspicious red flags, e.g., when no speakers, microphone, and plugins are detected.
  3. Discrepancies and mutual exclusion, e.g., IP timezone doesn’t match the one of the browser.

Following a demonstration on how websites use fingerprinting to identify headless browsers, Paulius matched his words with deeds and performed a live coding session with Python showing the most efficient mocking techniques. You can find the code samples here.

Paulius Stundžia, Software Engineer at Oxylabs

Augmenting Web Scraping With Machine Learning

The future starts now. Jurijus Gorskovas, Machine Learning Engineer at Oxylabs, knows better than anyone how machine learning (ML) is transforming web scraping. In his presentation, not only Jurijus opened up about the struggles his team faced while scraping public data at scale, but also gave valuable tips on how to deal with these challenges successfully using ML algorithms. Here are some of the examples when the usual web scraping approach can be replaced with more robust ML techniques:

  • Artificial Intelligence (AI) block detection. Machine learning enables gathering as many data points (HTMLs) as possible, minimizing blocks. However, it comes with its challenges, such as coupling samples per domain, errors in labeling, and multi-language problems. 
  • AI scraper optimizer. Usual scraping parameterization involves some critical issues. Most of the time, new domains require development resources, and existing domains require maintenance, which makes everything even more complex when it comes to scraping on a large scale. AI adaptive parameterization takes less time to react and can learn from  historical data, which makes it more competitive.
  • AI CAPTCHA solvers. While multiple varieties of CAPTCHAs are evolving, AI models are being trained to recognize and solve them. Such in-house solutions reduce developer resources and save costs on a large scale.
  • AI e-commerce parser. The most common issues with a static parser are that each domain requires a personal parser and every reaction to changes needs manual input. It leads to the necessity of maintaining multiple parsers, which can be resource inefficient. AI e-commerce parsers can extract title, price, and other data points from any e-commerce website, requiring less maintenance.

Using NLP for Entity Detection in Parsed HTML

An insightful presentation on how Natural Language Processing (NLP) helps better understand the data we extract was held by Adi Andrei, Founder and CEO at Technosophics. Adi discussed the NLP tools that identify speech patterns and thus detect entities in parsed HTML.

According to Adi, web scraped data, especially text data (natural language), is unstructured in most cases. Computers don’t understand, and can’t easily process this sort of data since extracting meaning from the text is difficult due to several reasons:

  • Words that look different can mean the same thing
  • Same words in a different order can mean something completely different
  • Character level regular expression approach is very limited

The solution offered by Adi involves using linguistic knowledge and natural language processing (NLP) that help transform unstructured data into structured data. 

With the help of Python NLP libraries, the speaker introduced some practical techniques and workflows that can be used in web scraping to extract meanings from text data. Supporting files for this coding session can be found here.  

Ethical Data Collection Policies

The presentation made by Cornelius (Con) Conlon, the Managing Director of Merit Data & Technology, was a guide to action for the companies that favor data collection being ethical, as well as for those who are only about to embark on this path.

During the session, Con turned to the legal aspects of data scraping, the benefits of implementing an ethical data collection policy for the company, and offered advice on handling situations when the ethical approach runs counter customer expectations.

At the beginning of his presentation, Cornelius stressed the positive effect of web scraping in many business areas, including marketing and governance. And while the scraped data can be useful in many respects, it can be scraped either in a good or bad way. Cornelius provided several key reasons why ethical data collection should become a priority:

  • It makes sure everyone involved in web scraping activities is on the same page regarding critical do’s and don’ts.
  • It leads to better code and maintenance outcomes since scrapers that violate the policy draw unwanted attention.
  • As a result, it reduces costs.
  • It helps avoid reputation damage and lawsuits.
  • It forces your customer to be more conscious about what they want and how they want it.

According to Con, these are the elements that could be considered in this policy:

  1. Take into account the target sites and the burden we might put upon them.
  2. Collect only the data you need.
  3. Adhere to Robots.txt and track changes.
  4. Send a user string that contains details whenever it’s possible.

Cornelius proposes discussing this policy with your tech team and revising it every six months. After all, having an ethical data collection policy will help you in any legal proceedings.

Web Scraping in 2021 and Beyond

The logical culmination of Day Two, as well as the OxyCon 2021 conference, was marked by a panel discussion. It brought together the experts from business, legal, and e-commerce areas to share their thoughts on how the web scraping industry was changing in 2021, especially due to the pandemic.

The key takeaways from the panel discussion can be summarized as follows:

  1. As Gabriele Montvile, Head of Account Management at Oxylabs, noted, many clients grew more sensitive to pricing and started to favor more flexible payment terms. It’s crucial to adjust to those needs in order to maintain trust. At the same time, while some business areas suffered fluctuations, other sectors, such as e-commerce, managed to make even more profit during the pandemic.
  2. Acquiring talent got more challenging, but Allen O’Neill, Founder and CTO at DataWorks, has a positive perspective on the future. He believes that we should augment the human workforce with AI and concentrate on automatizing the processes. Focusing on a company’s core competencies and being more selective on what we work on should become our guiding principle.
  3. Cornelius (Con) Conlon, the Managing Director of Merit Data & Technology, noted that making healthcare, education, and other private data publicly available may bring positive changes in policy making in many sectors.
  4. Since there’s still a lot of uncertainty about legal regulation in the web scraping industry, Marija Markova, Head of IP Team at Oxylabs, emphasized the point previously stated: the industry has to come to an internal agreement and self-regulation.
  5. Juras Jursenas, Chief Operating Officer at Oxylabs, concluded by highlighting the importance of finding and maintaining good partnerships.

After a one-year break during the pandemic, the OxyCon conference came back with even more groundbreaking presentations and topics delivered by the leading experts from all over the world. We truly hope that this web scraping event enriched you with multiple inspiring discoveries and know-how.

See you at OxyCon next year!


avatar

About Maryia Stsiopkina

Maryia Stsiopkina is a Junior Copywriter at Oxylabs. As her passion for writing was developing, she was writing either creepy detective stories or fairy tales for children at different points in time. Eventually, she found herself in the tech wonderland with numerous hidden corners to explore. In her spare time, she goes birdwatching with the binoculars (some people mistake it for stalking, which is why Maryia finds herself in an awkward situation sometimes), makes flower jewellery, and eats many pickles and green olives.

All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.

Related articles

Most Common HTTP Headers

Most Common HTTP Headers

Sep 20, 2021

5 min read