avatar

Adelina Kiskyte

Jun 12, 2020 8 min read

Web crawling and web scraping are essential for public data gathering. E-commerce businesses use web scrapers to collect fresh data from various websites. This information is later used to improve business and marketing strategies. 

Getting blacklisted while scraping data is a common issue for those who don’t know how to crawl a website without getting blocked. We gathered a list of actions to prevent getting blacklisted while scraping and crawling websites.

Site navigation:

How do websites detect web crawlers?

Web pages detect web crawlers and web scraping tools by checking their IP addresses, user agents, browser parameters, and general behavior. If the website finds it suspicious, you receive CAPTCHAs and then eventually your requests get blocked since your crawler is detected.

Check robots exclusion protocol

Before crawling or scraping any website, make sure your target allows data gathering from their page or the data can be considered as public. In any case, inspect the robots.txt file and respect the rules of the website.

Even when the web page allows crawling, be respectful, and don’t harm the page. Follow the rules outlined in the robots exclusion protocol, crawl during off-peak hours, limit requests coming from one IP address, and set a delay between them.

However, even if the website allows web scraping, you may still get blocked, so it’s important to know how to crawl a website without getting blocked.

Use proxies for a block-free web scraping
Proxy server works as an intermediary between your device and the target

Use a proxy server

Web crawling would be hardly possible without proxies. Pick a reliable proxy service provider and choose between the datacenter and residential IP proxies, depending on your task. 

Using an intermediary between your device and the target website reduces IP address blocks, ensures anonymity, and allows you to access websites that might be unavailable in your region. For example, if you’re based in Germany, you may need to use a US proxy in order to access web content in the United States.

For the best results, choose a proxy provider with a large pool of IPs and a wide set of locations. 

How to prevent getting blacklisted while scraping? Rotate you IPs
Rotate your IP addresses to reduce the risk of being blocked

Rotate IP addresses

When you’re using a proxy pool, it’s essential that you rotate your IP addresses. 

If you send too many requests from the same IP address, the target website will soon identify you as a threat and block your IP address. Proxy rotation makes you look like a number of different internet users and reduces your chances of getting blocked.

All Oxylabs Residential Proxies are rotating IPs, but if you’re using Datacenter Proxies, you should use a proxy rotator service.

Use real user agents

Most servers that host websites can analyze the headers of the HTTP request that crawling bots make. This HTTP request header, called user agent, contains various information ranging from the operating system and software to application type and its version.

Servers can easily detect suspicious user agents. Real user agents contain popular HTTP request configurations that are submitted by organic visitors. To avoid getting blocked, make sure to customize your user agent to look like an organic one. 

Since every request made by a web browser contains a user agent, you should switch the user agent frequently. 

It’s also important to use up to date and the most popular user agents. If you’re making requests with a 5-year-old user agent from a Firefox version that is no longer supported, it raises a lot of red flags. You can find public databases on the internet that show you which user agents are the most popular these days. We also have our own regularly updated database, get in touch with us if you need access to it.

Set your fingerprint right

Anti-scraping mechanisms are getting more sophisticated and some websites use  Transmission Control Protocol (TCP) or IP fingerprinting to detect bots. 

When scraping the web, TCP leaves various parameters. These parameters are set by the end user’s operating system or the device. If you’re wondering how to prevent getting blacklisted while scraping, make sure your parameters are consistent. 

If you’re interested, learn more about fingerprinting and its impact on web scraping.

Beware of honeypot traps

Honeypots are links in the HTML code. These links are invisible to organic users, but web scrapers can detect them. Honeypots are used to identify and block web crawlers because only robots would follow that link. 

Since setting honeypots requires a relatively large amount of work, this technique is not widely used. However, if your request is blocked and crawler detected, beware that your target might be using honeypot traps.

CAPTCHAs solving is one of the largest web scraping issues that gets your IP blocked
When the target identifies suspicious behavior, it asks to solve CAPTCHAs

Use CAPTCHA solving services

CAPTCHAs are one of the biggest web crawling challenges. Websites ask visitors to solve various puzzles in order to confirm they’re humans. The current CAPTCHAs often include images that are nearly impossible to read for computers. 

In order to work around CAPTCHAs, use dedicated CAPTCHAs solving services or ready-to-use crawling tools. For example, Oxylabs’ data crawling tool solves CAPTCHAs automatically without you even knowing that.

Change the crawling pattern

The pattern refers to how your crawler is configured to navigate the website. If you constantly use the same basic crawling pattern, it’s only a matter of time when you get blocked. 

You can add random clicks, scrolls, and mouse movements to make your crawling seem less predictable. However, the behavior should not be completely random. One of the best practices when developing a crawling pattern is to think of how a regular user would browse the website and then apply those principles to the tool itself. For example, visiting home page first and only then making some requests to inner pages makes a lot of sense.

How to scrape websites without getting blocked?
Data scraping is commonly used in e-commerce business

Reduce the scraping speed

To mitigate the risk of being blocked, you should slow down your scraper speed. For instance, you can add random breaks between requests or initiate wait commands before performing a specific action.

What if I can’t scrape the URL because it is rate limited?

IP address rate limitation means that the target has a limited number of actions that can be done on the website at a certain time. To avoid requests throttling, respect the website, and reduce your scraping speed.

Use a headless browser

One of the additional tools for a block-free web scraping is a headless browser. It works like any other browser, except a headless browser doesn’t have a graphical user interface (GUI). 

A headless browser also allows scraping content that is loaded by rendering JavaScript elements. The most widely-used web browsers, Chrome and Firefox, have headless modes.

Conclusion

Scrape public data without worrying about how to prevent getting blacklisted while scraping. Set your browser parameters right, take care of fingerprinting, and beware of honeypot traps. Most importantly, use reliable proxies and scrape websites with respect. Then all your public data gathering jobs will go smoothly and you’ll be able to use fresh information to improve your business.

Since now you know how to crawl a website without getting blocked, check out our blog, and read more about web scraping uses.

avatar

About Adelina Kiskyte

Adelina Kiskyte is a Content Manager at Oxylabs. Adelina constantly follows tech news and loves trying out new apps, even the most useless. When she is not glued to her phone, she also enjoys reading self-motivation books and biographies of tech-inspired innovators. Who knows, maybe one day she will create a life-changing app of her own!

Related articles

Setting the Right Approach to Web Scraping

Setting the Right Approach to Web Scraping

Jun 26, 2020

5 min read

Choosing the Right Proxy Service Provider

Choosing the Right Proxy Service Provider

Jun 22, 2020

4 min read

Choosing Between Residential and Datacenter Proxies

Choosing Between Residential and Datacenter Proxies

Jun 19, 2020

3 min read

All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.