If you are someone who deals with web scraping on a regular basis, you will probably know that in this sphere there are at least two things that are nearly impossible to avoid: legal considerations and blocks. The second is closely tied together with the first and the debate on the legality and ethics of web scraping continues to this day. If you would like to dive into this topic, be sure to read our publication The Legal Framework of Data Scraping, which gives a great review of all the court cases related to web scraping.
Yet, the goal of this article is to offer advice on how to configure your scraping software to act in ways that minimize the risk of blocks. So what is there to know? Let’s dive in.
How are websites able to identify and block web scraping bots?
When trying to prevent your proxies from getting blocked or blacklisted, it is useful to understand how websites are able to identify and block them in the first place.
Essentially, websites and those gathering public data from them are playing a cat-and-mouse game that is constantly increasing in complexity and sophistication. As websites find new and better ways to prevent automated activity, those collecting data are consequently improving at hiding their footprints.
The process of separating real users from bots and then blocking them is not as straightforward as it may seem at first. It may include things such as detecting activities deemed suspicious, flagging them, further tracking and only then, finally, blocking.
According to Dmitry Babitsky, the co-founder & chief scientist at ForNova, who was also a guest speaker at Oxycon, these are the most common methods that websites utilize to recognize web scraping bots:
- Large amounts of unusual requests and URL’s.
- Missing cookies – if you connect without cookies, it looks suspicious. However, if you do have cookies – they can be used to track your activity.
- Miscorrelation between different request attributes – make sure the location of your IP matches your language and time zone.
- WebRTC leaking your real IP address.
- Non-human behavior – websites track mouse and keyboard events which are very hard to simulate realistically. Unlike bots, humans are unpredictable.
- Browser performance analysis and comparison with similar configurations.
However, identification is just the first step. After websites get suspicious, they may act in different ways, ranging from the previously mentioned further tracking and evaluation to showing a good old 404 error page or even feeding the scraper fake data.
To learn more about this topic, we invite you to read a summary of Mr. Babitsky’s presentation at Oxycon.
Here’s what you can do to safeguard your proxies from blocks
1. Respect the website
As a general rule, being nice and respecting the website’s crawling policies will ensure the best results. Most websites have a robots.txt file (stored in the root directory) that details things such as what can and cannot be scraped or how often you’re allowed to do it.
Another place you should also look at is the Terms of Service (ToS) of a site. There you will likely find definitions of whether the data on the site is public or copyrighted, and how the target server and the data on it should be accessed.
2. Multiple real user agents
The user agent HTTP request header passes information to the target server such as the application type, operating system, software and its version, and allows for the target to decide what type of HTML layout to serve, mobile or desktop.
An empty or unusual user agent might just be the first red flag for the target site, so be sure to use popular configurations. Furthermore, making too many requests from a single user agent is a bad idea too and to go around this you can simulate multiple organic users by switching your headers.
Check out everything you need to know about configuring your HTTP headers in this blog post.
3. IP address rotation
In the world of web scraping it is common knowledge that making a huge amount of requests from the same IP address is a sure-fire way of getting it blocked. It is for this reason that you need multiple proxies and the more data you scrape, the more proxies you’ll have to use. Using multiple proxies for the same operation requires a proxy rotation solution, such as the one we offer here at Oxylabs.
4. Slow down and randomize your scraping speed
Web scraping bots can fetch data from a target website at a speed multiple times faster than any human ever could. This immediately raises suspicion, just like any other on-site behavior that is different from how a human would act. In fact, too many requests at the same time flood the target server with heavy traffic which might make it unresponsive.
The solution is to slow down your scraper by configuring it to randomly sleep between requests (e.g. 3-10 seconds) and giving it longer breaks after a certain number of pages. Using as little concurrent requests as possible is also a good idea.
5. Change your crawling pattern occasionally
Anti-bot systems embedded in target websites can detect web scrapers by finding patterns in the way they act and navigate on the site. As with some other advice here, random is good. To reduce the risk of getting your bot blocked, configure it to perform random actions, such as mouse movements, clicks or scrolls. The more unpredictably it behaves, the more human it will seem to the target site.
The tips outlined in this article are just some of the things that those experienced in data collection do to protect their proxies from getting blacklisted. The fact of the matter is that there’s no single strategy that will work 100% of the time on all of your target websites and ensuring smooth operation requires constant changes and experimentation. Despite this, the advice provided in this article is definitely a great starting point.
Lastly, if you have any further questions or would like to get a consultation about your own web scraping project, feel free to drop us a line via live chat or email us at [email protected]. And don’t forget to follow Oxylabs on LinkedIn and Twitter.