How We Handle Changes Implemented By Data Sources

Vytautas Kirjazovas

Last updated on

2019-11-25

2 min read

The purpose of this blog post entry is to shed light on the incident which took place on November 8th-9th, 2019, and affected Real-Time Crawler’s ability to cope with web crawling one of the most prominent online marketplaces at the necessary capacity.

Let’s explore what caused the incident and our technical team’s workflow that allowed us to reinstate Real-Time Crawler’s full working capacity promptly.

Incident timeline

Please note that all times and dates are in UTC+2 time zone.

2019-11-07 23:20 – 2019-11-08 03:00 – the data source rolled out changes to bot detection logic, which led to the Real-Time Crawler operating at 10% of the usual capacity.

2019-11-08 03:00 – 2019-11-08 14:40 – the first fix was rolled out by our technical team, and Real-Time Crawler was now operating at 30% of the usual capacity.
2019-11-08 14:40 – 2019-11-08 19:15:00 – the second fix was rolled out by our technical team, and Real-Time Crawler was now operating at 80% of the usual capacity.
2019-11-08 19:15 – the third fix was rolled out by our technical team, which fully reinstated Real-Time Crawler’s working capacity.

What caused the incident?

Before the implemented changes, the bot detection level was quite low, and data extraction from this data source was reasonably a straightforward task. However, a new implementation caused to generate only 10% of the data retrieval success rate.

On the evening of November 7th, the online marketplace in question rolled out a change across its platform. The change was related to the way the data source detects and blocks bot activity i.e., the introduction of more sophisticated fingerprinting approaches and tighter scrutiny of request patterns.

The solution

To adapt to this sudden change in bot detection, we took a series of experiments that let us determine the impact of every part of our scraper logic, as well as all the changes to this logic that we made.

Some of the variables we tested were related to the contents and the order of the requests we submit, as well as how often and how much we use any particular IP address.

While working out new approaches, we found that though some of them worked fine at a smaller scale, they were not suited for the level of work we deal with in our production environment. Because of that, some of our deployed patches didn’t entirely resolve the situation.

That said, every fix we deployed brought us closer to a working solution, as we were putting together a complex solution from all the variables we tested. By the end of the day, we had managed to completely adapt to the changes that took place and restore 100% Real-Time Crawler’s working capacity.

Final words

Web scraping is regularly being referred to as a cat-and-mouse game, as the most prominent data sources often change their layout structure and implement new anti-bot measures. These changes inevitably have a direct impact on data retrieval success rates.

However, here at Oxylabs, we always react promptly to any potential data-gathering issues and try to find a solution as soon as possible for our valued partners.

If you have any further questions or would like to get a consultation about your own web scraping project, feel free to contact us by clicking here.

Forget about complex web scraping processes

Choose Oxylabs' advanced web intelligence collection solutions to gather real-time public data hassle-free.

About the author

Vytautas Kirjazovas

Head of PR

Vytautas Kirjazovas is Head of PR at Oxylabs, and he places a strong personal interest in technology due to its magnifying potential to make everyday business processes easier and more efficient. Vytautas is fascinated by new digital tools and approaches, in particular, for web data harvesting purposes, so feel free to drop him a message if you have any questions on this topic. He appreciates a tasty meal, enjoys traveling and writing about himself in the third person.

Learn more about Vytautas Kirjazovas Learn more about Vytautas Kirjazovas

All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.