During the first day of OxyCon, our guest speaker Slavcho Ivanov, CTO @ Intelligence Node, gave a great presentation on using cloud-based Chrome browsers for web data gathering. He described the method his company has developed and briefly discussed its limitations. In this article, let’s take a look at how Intelligence Node was able to achieve it.
The requirements for this setup are:
- Crawlers need to be able to send HTML requests to browsers
- Browsers need to send back HTML
- The crawlers’ numbers can change
- The browsers’ numbers can change
It is also worth pointing out that, according to Mr. Ivanov, in order to reduce the risk of fingerprinting, cookies should be cleaned periodically.
Some Limitations and Solutions
During the Q&A, some OxyCon participants noted that the Chrome browser has some limitations, such as hogging RAM and only a low-level ability for using proxies.
Addressing the first issue, Mr. Ivanov recommended restarting each browser every 30 minutes. As for proxies, one can either launch a new browser with a different proxy in certain intervals, have one endpoint and do proxy rotation on your own or use a proxy rotation service, offered by the provider.
Mr. Ivanov also noted that this solution is only viable for limited scale scraping. It would not work in the case of thousands of requests per second.
The Intelligence Node CTO also pointed out that although using Selenium and headless browsers could essentially achieve the same, it is easier and more convenient to use cloud-based Chrome browsers.