From Digital Marketing to Data Extraction: Navigating the Web Scraping Landscape

Adomas Sulcas

Last updated on

2023-09-15

7 min read

In our quest to delve into the depths of modern web scraping methodologies, we recently sat down with Alexander Lebedev, a proficient Software Engineer at Hotjar. Our dialogue centered around his transition from digital marketing to web scraping, contributions to open-source projects, insights on efficient data extraction, and the evolving trends in the data domain.

With an impressive 8-year journey in software development, Alexander boasts six dedicated years to mastering web scraping techniques. His passion for data doesn't stop there; he's also an active contributor to open-source data extraction projects, an endeavor that earned him a notable GitHub badge.

Alexander is has shared his thoughts in OxyCon, a premier web scraping conference, where he delved into his vast experiences in a talk about accelerating data-on-demand services with async Python and AWS.

What drew you towards specializing in web scraping, and how did you develop your expertise?

Over the years, I've been deeply involved in the digital marketing domain, dedicating seven years to understanding its intricacies. During this tenure, my curiosity often gravitated toward the potential of automation in the field. The appeal of efficiently crawling through Google Search results was undeniable, and there was a unique challenge in parsing data from Chinese e-commerce platforms, which I found particularly intriguing.

As I delved more into these tasks, I realized the depth and dynamism they offered. The whole process was not just about fetching data or automating a task; it was a blend of problem-solving, innovation, and coding that consistently provided a rush of excitement.

This combination of challenges and the joy of solving them was so enthralling that, eventually, it prompted me to make a pivotal career switch. I took a leap from being primarily focused on digital marketing to immersing myself in the world of programming, dedicating my time and energy to web scraping. The journey was both challenging and fulfilling, and it played a pivotal role in honing my expertise in this domain.

Could you elaborate on your contribution to open-source data extraction that led to your GitHub badge?

While I was at ScrapingHub, now called Zyte, my role involved a significant amount of work with open-source projects. It wasn't just limited to using these projects – I also played an active part in improving them. I spent considerable time working on existing libraries, identifying areas that needed updates or improvements. This involved both debugging current features and sometimes adding new functionalities to make the tools more efficient for users.

In addition to refining current libraries, I also took on the responsibility of developing new ones. The objective was to address gaps or specific needs within the web scraping community that hadn't been met by the existing tools

A significant portion of my contributions was centered around Scrapy and its associated core libraries. My involvement in this project was extensive, and over time, the enhancements and new pieces of code I introduced became integral to its framework.

Recognizing the value and importance of these contributions, the decision was made to move this code to the Arctic Code Vault. It’s a repository meant to safeguard significant and valuable code for future generations. Due to my involvement in this project and the subsequent archiving of the code, I was recognized with a GitHub badge.

You're set to discuss creating data-on-demand web services. Can you give us a preview of what the attendees should expect?

When attendees join the session on creating data-on-demand web services, they can expect an in-depth dive into the process, crafted from my accumulated experience in this field. My intention is to guide participants through a structured plan, illuminating the roadmap to develop a robust and swift data-on-demand service.

Starting with the very foundation, I'll delve into the details of choosing the right servers. This involves understanding the pros and cons of different server types, ensuring that the infrastructure can handle the anticipated load, and scaling requirements. A well-chosen server can make all the difference in speed and reliability.

From there, we'll transition into the architecture of the service. Architecture is the backbone of any web service, and it's vital to get it right. We'll go over the best practices, the importance of modular design, and how to structure the system for both efficiency and scalability.

Libraries are another focal point of our discussion. The right library can simplify complex tasks, enhance speed, and provide functionalities that can save hours of coding. I'll share my insights on which libraries have proven most valuable in my experience and how they can be integrated seamlessly.

An essential aspect of data-on-demand services is batching. We'll explore the nuances of effective batching strategies, understanding how to group data requests in a manner that optimizes speed without compromising data integrity.

Another topic on our agenda is limiting. This is crucial to prevent system overloads and ensure that the service remains responsive and agile. We'll discuss strategies to set effective limits, considering both the system's capabilities and user needs.

Throughout the session, I'll punctuate each topic with finer details, emphasizing those minor tweaks and adjustments that can significantly elevate the performance of a data-on-demand service. These are the sort of insights that can often make the difference between a service that's 'good' and one that's 'exceptional.'

Your presentation emphasizes the use of async Python libraries for extracting data as quickly as possible. Why is this important, and what are the benefits?

Async is crucial for data extraction as it optimizes the process a lot. Without async, you just wait for a second or two until the server responds. With async, you'll be able to process ten-twenty times more requests simultaneously.

Using async Python libraries for data extraction is good because it makes things faster and lets us do multiple tasks at once. It's not just about speed but changing the way we think about pulling data.

Can you explain how the technique of stable crawling with limits and token buckets contributes to efficient data extraction?

In the world of data extraction, a technique doesn't gain prominence without reason. The strategy of stable crawling, fortified by the principles of limits and token buckets, is a testament to this.

At its core, limiting is an embodiment of responsibility and foresight. Every server, be it a robust API or a diminutive website on a small server, has its threshold. Push it beyond that limit, and you risk overwhelming it. This isn't just about momentarily stalling a website or API; it can potentially lead to more prolonged downtimes or even cause irreversible damage. Sending too many requests in rapid succession can cripple a server.

Moreover, there's an ethical dimension to consider. The digital landscape, vast as it is, thrives on mutual respect and etiquette. Sending excessive requests isn't just technically unwise; it's also ethically questionable. When you access a site or an API, there's an implicit understanding that you'll respect its boundaries. This is where the importance of ethical crawling comes into the picture. Ethical crawling isn't merely a best practice – it's a commitment to sustaining the digital ecosystem and ensuring that your actions don't inadvertently harm others.

Token buckets serve as a regulated mechanism to control the rate at which requests are sent. Think of it as a reservoir that refills at a set rate. Every outgoing request consumes a token from this reservoir. When the bucket is empty, requests are momentarily halted until more tokens are available. This system acts as a buffer, ensuring that there's a steady, sustainable flow of requests, which neither overburdens the source nor lets valuable crawl time go to waste.

In constructing a data-on-demand product, this approach is not just recommended, it's indispensable. If the goal is to maintain consistent, long-term data extraction without disruptions or ethical missteps, then stable crawling using these techniques is the key.

You have mentioned the use of AWS for scaling on-demand extraction. Could you give us an insight into how it works and why you chose it?

Amazon Web Services (AWS) stands out in the cloud computing realm primarily due to its comprehensive educational materials, especially relevant for data-on-demand extraction. While Google Cloud Platform (GCP) and Azure are also formidable choices in the cloud industry, AWS's extensive documentation and community support simplified the learning and implementation process for this specific use case.

What truly amplifies AWS's effectiveness is its seamless integration with Terraform, a tool for infrastructure management. This combination not only ensures compatibility but elevates ease of operations. When you're looking to scale, especially in on-demand extraction, it's pivotal to have precision, predictability, and control. AWS, coupled with Terraform, offers this by making infrastructure provisioning and scaling more strategic and responsive.

In your view, what are the top challenges in accelerating data-on-demand services, and how do you overcome them?

One of the foremost challenges faced in data-on-demand services is dealing with antibots. These are systems implemented by websites to detect and block automated crawlers, hindering smooth data extraction. It becomes a game of cat and mouse, with data extractors continually devising methods to bypass these barriers while still respecting ethical guidelines.

Next on the list is the page size. With the rise of rich media and interactive web content, web pages have become heavier. A larger page size implies more data to download and process, which can naturally impact the speed of data extraction. Navigating this requires optimizing the extraction process to prioritize essential data and minimize the unnecessary load.

Lastly, badly optimized code stands as a silent yet significant impediment. Even with the best infrastructure and strategies in place, inefficient code can act as a bottleneck, slowing down the entire extraction process. Addressing this involves regular code reviews, refactoring, and implementing best coding practices to ensure streamlined operations.

While challenges abound in the realm of data-on-demand services, with a proactive approach and a keen understanding of these impediments, one can devise strategies to navigate and overcome them. I’ll cover all these challenges and their solutions in my upcoming talk.

As a dedicated data enthusiast, what trends or advancements in the data world are you most excited about, and how do you envision their impact on web scraping?

These models are transformative, especially when considering their potential applications in web scraping. Imagine bypassing the traditional intricacies of data extraction coding. Instead, one could simply provide these models with a data sample, and they could intelligently navigate and extract data from analogous sources.

However, with innovations come challenges. Language models, in their current state, can be unpredictable. They might deviate from strict data formats, sometimes even serving up misleading or inconsistent data. While their prowess might shine in relatively flexible domains like blogs, their application in more structured and critical sectors like e-commerce and SaaS is still a matter of debate.

Forget about complex web scraping processes

Choose Oxylabs' advanced web intelligence collection solutions to gather real-time public data hassle-free.

About the author

Adomas Sulcas

Former PR Team Lead

Adomas Sulcas was a PR Team Lead at Oxylabs. Having grown up in a tech-minded household, he quickly developed an interest in everything IT and Internet related. When he is not nerding out online or immersed in reading, you will find him on an adventure or coming up with wicked business ideas.

Learn more about Adomas Sulcas Learn more about Adomas Sulcas

All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.