The purpose of this blog post entry is to shed light on the incident which took place on November 8th-9th, 2019, and affected Real-Time Crawler’s ability to cope with web crawling one of the most prominent online marketplaces at the necessary capacity.
Let’s explore what caused the incident and our technical team’s workflow that allowed us to reinstate Real-Time Crawler’s full working capacity promptly.
Please note that all times and dates are in UTC+2 time zone.
- 2019-11-07 23:20 – 2019-11-08 03:00 – the data source rolled out changes to bot detection logic, which led to the Real-Time Crawler operating at 10% of the usual capacity.
- 2019-11-08 03:00 – 2019-11-08 14:40 – the first fix was rolled out by our technical team, and Real-Time Crawler was now operating at 30% of the usual capacity.
- 2019-11-08 14:40 – 2019-11-08 19:15:00 – the second fix was rolled out by our technical team, and Real-Time Crawler was now operating at 80% of the usual capacity.
- 2019-11-08 19:15 – the third fix was rolled out by our technical team, which fully reinstated Real-Time Crawler’s working capacity.
What caused the incident?
Before the implemented changes, the bot detection level was quite low, and data extraction from this data source was reasonably a straightforward task. However, a new implementation caused to generate only 10% of the data retrieval success rate.
On the evening of November 7th, the online marketplace in question rolled out a change across its platform. The change was related to the way the data source detects and blocks bot activity i.e., the introduction of more sophisticated fingerprinting approaches and tighter scrutiny of request patterns.
To adapt to this sudden change in bot detection, we took a series of experiments that let us determine the impact of every part of our scraper logic, as well as all the changes to this logic that we made.
Some of the variables we tested were related to the contents and the order of the requests we submit, as well as how often and how much we use any particular IP address.
While working out new approaches, we found that though some of them worked fine at a smaller scale, they were not suited for the level of work we deal with in our production environment. Because of that, some of our deployed patches didn’t entirely resolve the situation.
That said, every fix we deployed brought us closer to a working solution, as we were putting together a complex solution from all the variables we tested. By the end of the day, we had managed to completely adapt to the changes that took place and restore 100% Real-Time Crawler’s working capacity.
Web scraping is regularly being referred to as a cat-and-mouse game, as the most prominent data sources often change their layout structure and implement new anti-bot measures. These changes inevitably have a direct impact on data retrieval success rates.
However, here at Oxylabs, we always react promptly to any potential data-gathering issues and try to find a solution as soon as possible for our valued partners.
If you have any further questions or would like to get a consultation about your own web scraping project, feel free to drop us a line via live chat or email us at [email protected].