We will be in Tel Aviv on December 10-12. Would you like to meet?

Book a meeting!
avatar

Vytautas Kirjazovas

Nov 25, 2019 3 min read

The purpose of this blog post entry is to shed light on the incident which took place on November 8th-9th, 2019, and affected Real-Time Crawler’s ability to cope with web crawling one of the most prominent online marketplaces at the necessary capacity.  

Let’s explore what caused the incident and our technical team’s workflow that allowed us to reinstate Real-Time Crawler’s full working capacity promptly.

Incident Timeline 

Please note that all times and dates are in UTC+2 time zone.

  • 2019-11-07 23:20 – 2019-11-08 03:00 – the data source rolled out changes to bot detection logic, which led to the Real-Time Crawler operating at 10% of the usual capacity. 
RTC Incident 1
  • 2019-11-08 03:00 – 2019-11-08 14:40 – the first fix was rolled out by our technical team, and Real-Time Crawler was now operating at 30% of the usual capacity.
  • 2019-11-08 14:40 – 2019-11-08 19:15:00 – the second fix was rolled out by our technical team, and Real-Time Crawler was now operating at 80% of the usual capacity.
  • 2019-11-08 19:15 – the third fix was rolled out by our technical team, which fully reinstated Real-Time Crawler’s working capacity. 
RTC Incident 2

What caused the incident?

Before the implemented changes, the bot detection level was quite low, and data extraction from this data source was reasonably a straightforward task. However, a new implementation caused to generate only 10% of the data retrieval success rate.

On the evening of November 7th, the online marketplace in question rolled out a change across its platform. The change was related to the way the data source detects and blocks bot activity i.e., the introduction of more sophisticated fingerprinting approaches and tighter scrutiny of request patterns.

The solution 

To adapt to this sudden change in bot detection, we took a series of experiments that let us determine the impact of every part of our scraper logic, as well as all the changes to this logic that we made.

Some of the variables we tested were related to the contents and the order of the requests we submit, as well as how often and how much we use any particular IP address.

While working out new approaches, we found that though some of them worked fine at a smaller scale, they were not suited for the level of work we deal with in our production environment. Because of that, some of our deployed patches didn’t entirely resolve the situation.

That said, every fix we deployed brought us closer to a working solution, as we were putting together a complex solution from all the variables we tested. By the end of the day, we had managed to completely adapt to the changes that took place and restore 100% Real-Time Crawler’s working capacity. 

Final words

Web scraping is regularly being referred to as a cat-and-mouse game, as the most prominent data sources often change their layout structure and implement new anti-bot measures. These changes inevitably have a direct impact on data retrieval success rates. 

However, here at Oxylabs, we always react promptly to any potential data-gathering issues and try to find a solution as soon as possible for our valued partners. 

If you have any further questions or would like to get a consultation about your own web scraping project, feel free to drop us a line via live chat or email us at [email protected].

avatar

About Vytautas Kirjazovas

Vytautas Kirjazovas is a Content Manager at Oxylabs, and he places a strong personal interest in technology due to its magnifying potential to make everyday business processes easier and more efficient. Vytautas is fascinated by new digital tools and approaches, in particular, for web data harvesting purposes, so feel free to drop him a message if you have any questions on this topic. He appreciates a tasty meal, enjoys travelling and writing about himself in the third person.

Related articles

Black Friday Web Scraping Insights

Black Friday Web Scraping Insights

Dec 05, 2019

6 min read

Scraping the Web With 100% Success Rate

Scraping the Web With 100% Success Rate

Oct 10, 2019

6 min read