Proxy locations

Europe

North America

South America

Asia

Africa

Oceania

See all locations

Network statusCareers

Back to blog

How to Continuously Yield High Quality Data | Interview with Glen De Cauwsemaecker

Glen De Cauwsemaecker

2022-08-2510 min read
Share

As part of the combined efforts of Oxylabs and OTA Insight to shed light on the developments in the web scraping industry, we sat down with Glen De Cauwsemaeker, the Lead Crawler Engineer at OTA Insight, to talk about the challenges when scaling data acquisition solutions.

Glen has been working for a decade as a developer and leader in various industries. For the last 3 years he has worked at OTA Insight where he has been driving the technical roadmap and innovation with a team of four other full-time engineers. Besides researching, developing, and maintaining crawlers, he has also maintained and created new projects related to proxy/networking services, backend services, libraries, and much more.

Glen will be presenting his thoughts in OxyCon, a free-to-attend web scraping conference held on September 7-8, 2022, in a presentation called “How to Continuously Yield High Quality Data While Growing From 100 to 100M Daily Requests”.

Glen, your topic for OxyCon is called “How to Continuously Yield High Quality Data While Growing From 100 to 100M Daily Requests”. Could you share what made you choose this topic - why is it important for you?

I chose the topic because at OTA Insight our top priority is data accuracy. It’s not enough to extract as much data as we need, it also has to be correct. In fact, any data present has to be accurate, or it shouldn’t be there at all. False data might cause hotels [the primary clients of OTA Insight] to make assumptions that lead to faulty conclusions.

So, I think the topic strikes a balance of allowing us to share our experience, while keeping any proprietary knowledge confidential. Web scraping, as a field, still has a lot of secrecy, but there are many common patterns that we can and want to share with the community to learn from one another.

Over the years we grew in scale, but our team remained small, so we had to overcome a lot of challenges to continue our performance and stay true to our main goal of maintaining accurate data. We managed to balance a small team and the increasing number of challenges while working at the same time on features such as speed and fault tolerance.

In summary, out of all the web scraping topics, scaling is the one where we can share all the lessons we’ve learned, which will be applicable to nearly all companies. We had to learn a lot of things the hard way, but I hope that other people won’t have to go through the same difficulties we experienced.

You mentioned that “no data is better than false data”. What are the processes you use to verify the accuracy of your extraction processes? 

I will cover the topic during my presentation. But in short, during the entire pipeline, we have a crawling process that we try to keep as simple as possible. You still have to make some assumptions, however, which can cause issues.

For this reason, we have split our pipeline into different components, right from the start. Data extraction is one of these components. Inputs are required to know what to extract, as such generating these inputs is yet another one. The third component is where we transform our extracted data into a standardized format. The final step is regarding the consumption of the already cleaned and standardized data, which is executed by other departments within our company.

As an example, think of hotel rates for certain dates in a certain location. For all our data sources, of which we have about 300, it all arrives in the same format once our transformation phase in the third step is finished. Our data scientists and backend specialists don’t have to worry about where the data has come from as it is all standardized.

It is in the transformation phase that we validate our assumptions. While we can be pretty certain about the assumptions we make, we want to future-proof them. Secondly, because we have all this historic data we can also proactively correct any false assumptions using anomaly detection. We’ll go more in depth on this topic in the presentation.

Finally, since the extraction and transformation aspects are separate, we can reprocess the extracted data. Even if we made a false assumption, as long as we have the necessary context, we can transform the same data once again to fix any inaccuracies that might have appeared.

What are the key challenging areas when facing such growth as OTA Insight did? Where should companies focus their attention first?

I think what I am about to say applies to many contexts, but first of all, you have to make sure that your people aren’t spread too thin. If they have to constantly context switch from one domain to another, they will never get anything done, and will only be reactive instead of proactive. Specialization is key here if you have the resources to do so.

If we think about how we did things in the past, we never had a dedicated crawler team. As a result, we couldn’t in advance resolve potential problems that we most likely will hit sooner or later, so the best we could do would be to patch or improve our infrastructure in a slow incremental manner.

While you’re small, you can get away with operating this way as there are lots of low hanging fruit. As you grow, the challenges will only get more difficult and the customers you want to serve will want much more complicated targets. So, you want people who can focus on specific tasks, network internals to give an example.

Additionally, you want to keep work manageable. When we started OTA Insight, a lot of things were done manually. While we already operated from within the cloud (although, it was more primitive than the technology available today), deployments, scaling, retries of failed workloads, and many other things were performed manually.

So, an important keyword is automation as it helps keep work manageable. When you start scaling, it’s extremely easy to keep adding targets, but that’s not where it ends. From then on you will have to maintain the target for the foreseeable future. This has to be taken into account when calculating the work that can be done with the manpower available to you.

There are also quiet and active periods for the updates of websites and applications. Companies often start updating their software during the same periods such as New Year or summer, so in those moments you have to be able to dedicate even more resources to maintenance.

Automation and devotion to the role, in summary, I think are the two most important aspects when scaling with alerting being a close third. You should be able to see each piece of the pipeline and status in order to understand what you’re scraping, whether it’s working, if all things are running as expected, etc.

So, there’s the human side and the technical side of the scraping equation. What would be your recommendations when scaling the technical side?

Nowadays, there are a lot of tools and solutions such as Kubernetes (k8s) and Docker that make it easier to manage the technical side of crawling. There’s even lots of tooling for specific cloud platforms, making horizontal scaling, for example, much simpler.

These platforms also allow users to scale based on resource needs. Say you have certain services, and you request a certain amount of CPU power and memory dedicated to it. What this allows you to do is to automatically add extra pods (instances) of that same exact process to take over some work once you reach a certain threshold of resource consumption. Automation, again, is an important keyword.

But you have to understand what the problem you’re solving is before throwing in resources. We first try to understand the problem by implementing a framework. Over the years we’ve arrived at a framework we call P0-P1-P2.

At P0, all we’re trying to do is understand the question. At P1, we’re iterating on the problem on a purely theoretical basis, evaluating options and solutions. Only at P2 we start solving the problem. By taking our time, stepping back, iterating, and even dropping potential avenues, we save time in the long run.

On cloud platforms, it’s extremely easy to keep throwing in more resources. It may cost you money, but the possibilities are always there. But we focus on solving the problem first to understand exactly how much we will need. And in fact, throwing more resources into it is often not what you really want at all.

As scaling the technical side has become so easy, you have to know when not to scale. Sometimes there might be networking issues, for example, which can cause requests to time out faster, spamming the target, and causing damage to both your services and theirs.

So, you have to be careful with scaling. On our end, we have triggers that scale down automatically when we start receiving too many errors of a certain kind until the root cause has been resolved. Scaling down is important as, first of all, the data you are receiving might be false and, secondly, since you’re timing out faster, the requests per second might get even higher, causing even more issues.

In the end, the true technical, hardware side is much less of an issue nowadays. Cloud platforms make scaling and acquiring resources easier. It’s more of a matter of understanding the software and technologies and focusing on the business problems.

An important aspect of today’s environment is having a DevOps team. At OTA Insight this is a small but amazing team. They focus on enabling all other technical teams, from infrastructure management to education.

So, would you say that hardware is no longer an issue?

Indeed, you don’t have to reinvent the wheel. In fact, these resources are now cheap and available, the building blocks are all there. Today it’s more a matter of understanding how to properly use the variety of technologies and combinations available to you and your team.

They say that “with great power comes great responsibility” and that’s especially true in web scraping. We only scrape and store the data and properties we need. Having storage allows us to recover from mistakes without having to resend requests to target servers.

In the end, as long as you have the financial resources, you can get anything from the cloud. Economically it might not always make sense, but the possibility is there.

Is there anything that should be considered in the early stages already (in terms of infrastructure and its management), when the number of daily requests is still small?

It’s a difficult question because challenges change with scale. At a small scale, it’s hard for people to focus on specific aspects of the pipeline. It would still be good to implement ownership, but you’ll always have some context switching.

What you should do is focus on the problems at hand instead of focusing on what may come. There will be certain challenges you constantly face at any scale, so tooling and automation is often an area full of opportunities that you might want to focus on.

Additionally, I think, from the beginning you should be heavily focused on education about tools, software, and possibilities. Test them out and document everything, even if you don’t end up using them.

Don’t try to reinvent the wheel if you do not have to. For example, there’s no reason to do on-premises server hosting as it will cost you much more to hire employees and manage everything than simply acquiring something from the cloud. Once you start heading in a certain direction, it might be difficult later to switch to another one.

In summary, you will have different problems at different stages. For example, optimization problems are not an issue at a small scale, so you shouldn’t be focusing on them. Focus on business objectives and the accuracy of data, those should always be your focus instead.

So, you likely have seen numerous mistakes other companies have made when scaling. What are the common mistakes done by companies moving to larger scales?

A common example is the customer satisfaction score, which tends to get worse as companies get bigger. Ours, however, remained high throughout all the years we were in the industry. As we grew, the customer satisfaction score even grew.

The reason for that, I think, is our goal of data accuracy. What most companies do is lower the bar when they start scaling. They do it in three areas - they might start being okay with slower customer support, less accurate, or longer service downtimes.

We focus on maintaining all three. We try to resolve all support tickets within 24 hours. Even if that is not possible, we are already in contact with the customer and will work towards a solution. Doing this alone already sets you apart from other companies.

Another issue that companies face when they grow is that teams get more isolated and, as a result, they forget why they do the things they do. In our company, everybody is product-minded, everyone is aware of the different initiatives, and why their work is important. We don’t drop isolated tasks that have no relation to the big picture.

All teams are aware of the bigger picture and are able to help each other. Teams have to be able to focus on goals, but not become isolated, which I think is a great challenge when companies approach enterprise status.

Finally, teams have to get creative freedom to solve the problems. When companies start growing, they often end up in a transitory phase where they act like enterprises, but have no benefits from it. They only have the bureaucracy and management layers that block teams from working effectively. 

Teams have to have ownership over their products and projects and be able to own up to victories and mistakes alike. But in some enterprises, you only get delivered orders and blamed for mistakes, which isn’t conducive to proper ownership.

Moving on to our final question, any other tips related to maintaining high quality of data?

In short, you want to maintain the present, learn from the past, and prepare for the future. For our team, we always want to maintain and manage our current data sources. Once a customer relies on a source, they will be dissatisfied if it no longer works or provides faulty data. They might be okay with an unavailable source, but they will never be satisfied if an existing one doesn’t work anymore. 

Learning from the past comes through with postmortems. You want to learn from your mistakes, even minor ones, as any incident is a possibility for a lesson, especially if they keep coming back in different forms. Combining all these things helps you build for the future and become proactive instead of reactive. 

Additionally, the approach changes carefully. Even if you know that there might be a better way to develop your pipeline, park the idea until you truly have the time and manpower to implement and maintain it. Everything you do yourself, you’ll have to maintain, which will drain resources.

Own only things you can and have to own.

Invest in alerting. It’s not enough to know that your resources are being used correctly, that the requests are okay, etc. You also have to know if the data is accurate and if potential issues are coming up. That way you can solve things before they start impacting your services. There’s a lot of subtlety in alerting, but it is a critical part of web scraping.

Finally, be fault tolerant. Things will go wrong and when they do, you want to have a manageable scope. Make sure your pipeline is separated into several parts, so that you can solve issues when they appear and are relatively minor. Think in such a way - a pan on fire in your kitchen is better than the entire kitchen being on fire.

This interview was conducted by Adomas Šulcas, Senior PR Manager at Oxylabs.

About the author

Glen De Cauwsemaecker

Lead Crawler Engineer

Glen De Cauwsemaecker is the Lead Crawler Engineer at OTA Insight (around 350 employees, of which 67 engineers). With over ten years of experience, he drives the technical roadmap and innovation with a team of four other engineers. Besides his active professional life, he is also a full-time father of two energetic children, where he learns to become an expert in soft skills of all kinds.

All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.

Get the latest news from data gathering world

I’m interested