Back to blog

Melding Machine Learning with Web Scraping: Interview with Andrius Kūkšta

Adomas Sulcas

2023-09-118 min read
Share

In an initiative to explore the intricate facets of the web scraping world, we engaged in a discussion with Andrius Kūkšta, a seasoned Data Engineer at Oxylabs. Our conversation revolved around the challenges and opportunities presented by the integration of machine learning into web scraping operations, as well as the future of data extraction in an age dominated by advanced algorithms and large language models (LLMs).

With a tenure exceeding five years at Oxylabs, Andrius has been instrumental in multiple projects, introducing numerous machine learning augmentations to enhance web scraping processes. At present, he pioneers the development of a novel data-centric product, showcasing his prowess in both data engineering and the seamless assimilation of machine learning techniques. His profound knowledge and unwavering enthusiasm in the machine learning sphere make for compelling perspectives on optimizing ML within web scraping.

On September 13th, 2023, Andrius shared his expertise at OxyCon, a web scraping convention, in a presentation titled “Leveraging Machine Learning for Web Scraping.”

Andrius, can you share a bit about your journey at Oxylabs and how you transitioned into focusing on the intersection of web scraping and machine learning?

At Oxylabs, my journey started approximately six years ago, marking the inception of a profound learning curve in my professional career. Initially, I took on the role of a technical analyst, a position that presented me with the responsibility of navigating and resolving multifaceted issues experienced by our partners. These challenges spanned a diverse range of products that the company offered at the time.

Over time, my inclination towards technical challenges propelled me to transition to a developer role. I found myself deeply engrossed in the nuances of web scraping, dedicating my efforts to the scraper development team. Our primary objective was to meticulously analyze websites, creating optimal methodologies to acquire data from them.

During this phase of my career, my colleague and I delved deeper into our tasks, and we identified a new area with untapped potential – Artificial Intelligence and Machine Learning (AI & ML) within our existing scraping pipelines. Our passion for machine learning steered us towards a pioneering venture. We embarked on a project to train a machine learning model with the intent to decipher CAPTCHAs on one of our target platforms. To our delight, the endeavor bore fruit.

Recognizing our success and the vast implications of integrating AI & ML into our operations, the OxyBrain team was subsequently established. Our goal within this team was explicit: to serve as torchbearers in an era of AI & ML integration across various product features within Oxylabs. The collaboration within the team enabled us to pioneer solutions that not only enhanced product efficiency but also augmented user experience.

The relationship between web scraping and machine learning seems to be symbiotic, with web scraping feeding data to ML models and ML models enhancing scraping techniques. How do you see this evolving in the future?

In today's rapidly evolving technological landscape, LLMs epitomized by groundbreaking innovations like ChatGPT, are at the forefront of the hype cycle. These models, in their vastness and complexity, are ravenous for data – specifically high-quality, diverse, and voluminous textual data. It is here that web scraping assumes an even more pivotal role. It acts as a conduit, channeling an unending stream of textual information, which is vital for training, refining, and optimizing these language models to understand and generate human-like text.

On the other hand, websites and web applications are increasingly deploying advanced bot detection mechanisms. These are designed to thwart large amounts of requests, making the task of harvesting public data more challenging. In response, I foresee an intensified incorporation of machine learning within web scraping pipelines. The future of scraping is not just about extracting data but doing so intelligently. 

By employing ML algorithms, scraping tools can adapt, evolve, and navigate these sophisticated bot detection measures. Machine learning will empower these tools to extract data more efficiently, bypassing CAPTCHAs, adjusting scraping patterns in real time, and even predicting which parts of a site might be most valuable to scrape.

In essence, the future interplay between web scraping and machine learning will be a dynamic dance of adaptation and innovation. As obstacles in data collection become more formidable, the tools we design, underpinned by ML, will become more adept. This continuous loop of challenge and solution will inevitably drive technological progress in this domain, offering us tools that are not just efficient but also remarkably intelligent.

Oxylabs has been at the forefront of utilizing ML for web scraping. Can you highlight a specific instance where the integration of machine learning significantly transformed or improved a product pipeline?

Certainly, Oxylabs has carved a unique niche for itself in the realm of web scraping, especially with its pioneering integration of machine learning. This blend of data extraction and intelligent algorithms has paved the way for a series of innovations, ensuring that Oxylabs remains a leader in its domain. While there are several shining examples of this symbiosis in action, one instance stands out as an example of the transformative power of machine learning in enhancing web scraping capabilities.

Let me delve deeper into our internally developed ML model, which we named the "Block Detection tool." Before its inception, our web scraping process, though efficient, had certain pitfalls. One of the recurrent challenges was the ambiguity in understanding website responses. 

At times, even when a scraping request appeared successful on the surface, it would contain subtle indications of failure – typically a discrete message insinuating that the requester has been identified as a robot. This subtle "robot" tag was a minor detail, but it carried significant implications. When such data was relayed to our clients, it inadvertently conveyed misinformation, potentially impacting their decision-making processes based on that data.

The introduction of the Block Detection tool marked a turning point in addressing this challenge. Instead of relying solely on traditional metrics to determine the success or failure of a request, this model delved into the nuances of the returned content. It was trained to identify even the most subtle hints of blocking, like the aforementioned "robot" tags, which would otherwise escape a regular detection system. By recognizing these concealed messages, the tool could accurately pinpoint when a scraping attempt had been thwarted.

The transformative effect of integrating this machine learning model into our pipeline was enormous. The accuracy of our scraping results surged, leading to a substantial reduction in cases where clients received misleading data. By ensuring that the results provided to our clients were devoid of such pitfalls, we not only elevated the quality of our service but also increased the trust our clients placed in us. 

The Block Detection tool epitomizes the quintessence of what machine learning can achieve when adeptly integrated into web scraping – precision, reliability, and enhanced user satisfaction.

Can you elaborate on how adaptive parsing works at Oxylabs and the benefits it brings to web scraping operations?

Adaptive parsing, as employed by Oxylabs, is an intricate system rooted in machine learning. It operates as a classification-type model that meticulously sifts through the myriad of elements present on an HTML page, mostly those of ecommerce product pages. While navigating through these elements, most of them turn out to be irrelevant. However, the ones that truly matter, like the price, title, and description, are pinpointed and then used to create structured and parsed data.

The most pronounced advantage we have witnessed is the efficiency it introduces to our software development processes. Prior to this, whenever there was a change in the layout of a website, our developers were confronted with the time-consuming task of modifying parsing templates to adjust to these changes. But, with adaptive parsing, such constant changes have become a thing of the past. This not only streamlines our workflow but also has ripple effects on our clients' operations. They find value in the fact that they no longer have to grapple with the intricacies of parsing on their end, leading to significant savings in both time and financial resources. This positions adaptive parsing as an invaluable tool in modern web scraping operations.

What potential applications of ML in web scraping excite you the most, especially with the rise of LLMs?

The emergence of LLMs in the AI landscape has genuinely piqued my interest, especially when it comes to their integration with web scraping. When we dive deep into their capabilities, these models, given their intrinsic ability to grasp and decipher the context of textual content, offer a myriad of possibilities that could revolutionize the web scraping domain.

Take parsing as an example. Traditional methods heavily rely on classification models, which, although effective, come with their own set of challenges. The foremost among these is the preliminary requirement of amassing a vast volume of data and subsequently labeling it meticulously. 

Anyone familiar with this process would concur that it's a laborious task that demands considerable time, resources, and expertise. However, with LLMs stepping into the picture, the dynamics change dramatically. Given their innate comprehension skills, these models can potentially bypass the exhaustive data gathering and labeling phase, thereby accelerating the parsing process while ensuring precision.

Furthermore, the potential applications extend beyond just parsing. Envision a scenario where LLMs assist in the contextual understanding of the scraped content, enabling not just data extraction but also data interpretation in real time. This could particularly benefit sectors like market analysis, sentiment analysis, and trend forecasting, where contextual understanding is paramount.

Product matching across different e-commerce websites is a promising yet difficult application. Could you explain how it works and how ML can help?

The realm of e-commerce is characterized by its vastness and diversity, where a multitude of platforms each offer a plethora of products. Given this complexity, product matching across different websites becomes both a challenge and an opportunity. 

Product matching is the process of identifying and linking identical or closely related products listed on different e-commerce websites. The aim is to ensure consistency, streamline comparisons, and improve the overall shopping experience for the consumer. For businesses, it offers insights into competitive pricing, product assortment, and other market dynamics.

One of the traditional methods involves leveraging classification models in machine learning. This entails an extensive phase of data gathering, preparation, and labeling. You would essentially be training the model to recognize and classify products based on their features and descriptions. While robust, the approach is time-consuming and demands rigorous maintenance as new products are continually introduced in e-commerce platforms.

Venturing into a less labor-intensive realm, we find pre-trained models like Word2Vec or Sentence2Vec. These models have the inherent ability to convert textual descriptions into mathematical vectors. Once the text, which in this case would be product descriptions, is transformed into vectors, one can employ mathematical formulas to gauge the similarity between two vector representations. By comparing these vectors, we can discern if two product descriptions from different sites are indicative of the same product. The advantage here is the reduction in the need for extensive data preparation and the ability to leverage existing models.

Lastly, there is an approach that melds simplicity with advanced technology. One could utilize models like ChatGPT or other large language models to assess product similarity. By feeding the textual information of two products into the model, you can pose a straightforward query: "Are these descriptions referring to the same product?" The model, leveraging its vast training data and understanding of language nuances, can then provide a verdict.

Given your vast experience and continuous interest in the machine learning field, how do you stay updated with the rapidly advancing technologies and ensure Oxylabs remains at the cutting edge of leveraging ML for web scraping?

In the fast-paced world of machine learning, staying updated is both a challenge and a necessity. On platforms like LinkedIn, I have curated a list of pages and thought leaders that frequently share cutting-edge advancements in ML. Their insights often spark ideas and provide a pulse on the current trends.

Post-lunch, with a cup of coffee and some sweet snacks in hand, I enjoy exploring huggingface.co. It's a remarkable platform where enthusiasts and experts share their latest ML models and tools. This daily ritual not only feeds my curiosity but ensures that I'm exposed to practical and novel solutions that might benefit Oxylabs.

Apart from these, I consistently participate in online courses, webinars, and workshops. It's a blend of structured learning and real-world applications. Collaborating with industry peers and researchers also gives Oxylabs an edge, ensuring we're at the forefront of ML-powered web scraping.

About the author

Adomas Sulcas

Former PR Team Lead

Adomas Sulcas was a PR Team Lead at Oxylabs. Having grown up in a tech-minded household, he quickly developed an interest in everything IT and Internet related. When he is not nerding out online or immersed in reading, you will find him on an adventure or coming up with wicked business ideas.

All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.

Related articles

Get the latest news from data gathering world

I’m interested