Back to blog
15 Tips on How to Crawl a Website Without Getting Blocked
Adelina Kiskyte
Back to blog
Adelina Kiskyte
Web crawling and web scraping are essential for public data gathering. If you're new to web scraping, you can check out our detailed guide on what is web scraping and how to scrape data from a website. E-commerce businesses use web scrapers to collect fresh data from various websites. This information is later used to improve business and marketing strategies.
Getting blacklisted while scraping data is a common issue for those who don’t know how to crawl a website without getting blocked. We gathered a list of actions on how to crawl a website without getting blocked while scraping and crawling websites.
If you prefer learning in another format, here's our video on the topic:
Web pages detect web crawlers and web scraping tools by checking their IP addresses, user agents, browser parameters, and general behavior. If the website finds it suspicious, you receive CAPTCHAs and then eventually your requests get blocked since your crawler is detected.
Here are the main tips on how to crawl a website without getting blocked:
Before crawling or scraping any website, make sure your target allows data gathering from their page. Inspect the robots exclusion protocol (robots.txt) file and respect the rules of the website.
Even when the web page allows crawling, be respectful, and don’t harm the page. Follow the rules outlined in the robots exclusion protocol, crawl during off-peak hours, limit requests coming from one IP address, and set a delay between them.
However, even if the website allows web scraping, you may still get blocked, so it’s important to follow other steps, too. For a more in-depth look at the topic, see our web scraping Python tutorial.
Web crawling would be hardly possible without proxy servers. Pick a reliable proxy service provider and choose between the datacenter and residential IP proxies, depending on your task.
Using an intermediary between your device and the target website reduces IP address blocks, ensures anonymity, and allows you to access websites that might be unavailable in your region. For example, if you’re based in Germany, you may need to use a US proxy in order to access web content in the United States.
For the best results, choose a proxy provider with a large pool of IPs and a wide set of locations.
When you’re using a proxy pool, it’s essential that you rotate your IP addresses.
If you send too many requests from the same IP address, the target website will soon identify you as a threat and block your IP address. Proxy rotation (read more about ipv6 proxy) makes you look like a number of different internet users and reduces your chances of getting blocked.
All Oxylabs Residential Proxies are rotating IPs, but if you’re using Datacenter Proxies, you should use a proxy rotator service. We also rotate IPv6 proxy and IPv4 proxy. If you are interested in the differences between IPv4 vs IPv6, check out the article my colleague Iveta wrote.
Most servers that host websites can analyze the headers of the HTTP request that crawling bots make. This HTTP request header, called user agent, contains various information ranging from the operating system and software to application type and its version.
Servers can easily detect suspicious user agents. Real user agents contain popular HTTP request configurations that are submitted by organic visitors. To avoid getting blocked, make sure to customize your user agent to look like an organic one.
Since every request made by a web browser contains a user agent, you should switch the user agent frequently.
It’s also important to use up to date and the most common user agents. If you’re making requests with a 5-year-old user agent from a Firefox version that is no longer supported, it raises a lot of red flags. This can also happen if your referrer header is empty. Referrers are websites you visited prior to your destination website. So, to seem like an organic user, you need to include a referrer website.
You can find public databases on the internet that show you which user agents are the most popular these days. We also have our own regularly updated database, get in touch with us if you need access to it.
Anti-scraping mechanisms are getting more sophisticated and some websites use Transmission Control Protocol (TCP) or IP fingerprinting to detect bots.
When scraping the web, TCP leaves various parameters. These parameters are set by the end user’s operating system or the device. If you’re wondering how to prevent getting blacklisted while scraping, make sure your parameters are consistent. Alternatively, you can use Web Unblocker - an AI-powered proxy solution that has a dynamic fingerprinting functionality. Web Unblocker puts many fingerprinting variables together in a way that even when it establishes a single best-working fingerprint, the fingerprints are still seemingly random and successfully pass anti-bot checks.
If you’re interested, learn more about fingerprinting and its impact on web scraping.
Honeypots are links in the HTML code. These links are invisible to organic users, but web scrapers can detect them. Honeypots are used to identify and block web crawlers because only robots would follow that link.
Since setting honeypots requires a relatively large amount of work, this technique is not widely used. However, if your request is blocked and crawler detected, beware that your target might be using honeypot traps.
CAPTCHAs are one of the biggest web crawling challenges. Websites ask visitors to solve various puzzles in order to confirm they’re humans. The current CAPTCHAs often include images that are nearly impossible to read for computers.
How to bypass CAPTCHAs when scraping? In order to work around CAPTCHAs, use dedicated CAPTCHAs solving services or ready-to-use crawling tools. For example, Oxylabs’ data crawling tool solves CAPTCHAs for you and delivers ready to use results. What's more is that our Web Scraper API is specifically adapted for the most popular targets, taking care of various CAPTCHA techniques. If you want to see our tool in action, check out these guides to scraping Amazon and scraping Best Buy.
The pattern refers to how your crawler is configured to navigate the website. If you constantly use the same basic crawling pattern, it’s only a matter of time when you get blocked.
You can add random clicks, scrolls, and mouse movements to make your crawling seem less predictable. However, the behavior should not be completely random. One of the best practices when developing a crawling pattern is to think of how a regular user would browse the website and then apply those principles to the tool itself. For example, visiting home page first and only then making some requests to inner pages makes a lot of sense.
To mitigate the risk of being blocked, you should slow down your scraper speed. For instance, you can add random breaks between requests or initiate wait commands before performing a specific action.
IP address rate limitation means that the target has a limited number of actions that can be done on the website at a certain time. To avoid requests throttling, respect the website, and reduce your scraping speed.
Most crawlers move through pages significantly faster than an average user as they don’t actually read the content. Thus, a single unrestrained web crawling tool will affect server load more than any regular internet user. In turn, crawling during high-load times might negatively impact user experience due to service slowdowns.
Finding the best time to crawl the website will vary on a case-by-case basis but picking off-peak hours just after midnight (localized to the service) is a good starting point.
Images are data-heavy objects that can often be copyright protected. Not only it will take additional bandwidth and storage space but there’s also a higher risk of infringing on someone else’s rights.
Additionally, since images are data-heavy, they are often hidden in JavaScript elements (e.g. behind Lazy loading) which will significantly increase the complexity of the data acquisition process and slow down the web scraper itself. To get images out of JS elements, a more complicated scraping procedure (something that would force the website to load all content) would have to be written and employed.
Data nested in JavaScript elements is hard to acquire. Websites use many different JavaScript features to display content based on specific user actions. A common practice is to only display product images in search bars after the user has provided some input.
JavaScript can also cause a host of other issues – memory leaks, application instability or, at times, complete crashes. Dynamic features can often become a burden. Avoid JavaScript unless absolutely necessary.
One of the additional tools for block-free web scraping is a headless browser. It works like any other browser, except a headless browser doesn’t have a graphical user interface (GUI).
A headless browser also allows scraping content that is loaded by rendering JavaScript elements. The most widely-used web browsers, Chrome and Firefox, have headless modes.
Another solution to scraping without blocks would be to scrape a website from a Google cache instead. Google cache is a backup version of a website that is made when Google crawls websites. This backup version can be used to load a website when it's down.
To access a cached version of a website, simply add your target URL to the following link http://webcache.googleusercontent.com/search?q=cache:. For example, with http://webcache.googleusercontent.com/search?q=cache:oxylabs.io, you’ll see a backup version of our website.
A Scraper API is often an excellent choice to avoid blocks effortlessly. A Web Scraper API is an automated tool that collects public web data. Most of the time, it takes care of all data extraction aspects, from sending requests to parsing and target unblocking.
For example, Oxylabs Web Scraper API focuses on SERP scraping, E-Commerce data gathering, and other websites. You only need to send a request and the tool delivers public data to you either in HTML or JSON. The tool handles the unblocking part for you so you don’t need to worry about it at all. While this option can be a bit pricier, it saves you tons of time and resources.
Gather public data without worrying about how to prevent getting blacklisted while scraping. Set your browser parameters right, take care of fingerprinting, and beware of honeypot traps. Most importantly, use reliable proxies and scrape websites with respect. Then all your public data gathering jobs will go smoothly and you’ll be able to use fresh information to improve your business.
You can try the functionality of our general-purpose web scraper for free and apply some of the tips described above or you could check out the best web crawlers on the market.
If you still wonder if crawling and scraping a website are legal, check out our blog post Is Web Scraping Legal?. Or, learn more about data extraction tools in our best no-code scraping solutions blog post.
If you don’t want to bother with proxy management, you can turn to a ready-made proxy scraper that will handle proxy management on its end - Web Unblocker.
If you get error messages like ”Request Blocked: Crawler Detected” or ”Access Denied: Crawler Detected” while attempting to scrape a website, most likely the website administrator discovered your web crawler. Usually, website admins employ the User-Agent field to detect web crawlers.
To avoid web scraping blocks, try some of the techniques mentioned in our article or automated solutions like Web Unblocker or Web Scraper API.
If you keep getting CAPTCHAs or get IP blocks, consider using techniques mentioned in our article or automated unblocking solutions. However, if Terms of Service prohibit scraping, you should refrain from doing it altogether.
The legality of web scraping depends on whether you’re breaking any laws relevant to your use case. In addition, you should always make sure you’re scraping ethically. This means respecting the website’s Terms of Service and technical capacity. To learn more about the topic, read our article on web scraping legality.
To hide your IP address, you can use a proxy, an intermediate server between you and the target website. It also uses its own IP address, thus concealing yours.
About the author
Adelina Kiskyte
Former Senior Content Manager
Adelina Kiskyte is a former Senior Content Manager at Oxylabs. She constantly follows tech news and loves trying out new apps, even the most useless. When Adelina is not glued to her phone, she also enjoys reading self-motivation books and biographies of tech-inspired innovators. Who knows, maybe one day she will create a life-changing app of her own!
All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.
Get the latest news from data gathering world
Scale up your business with Oxylabs®