Large-scale web scraping poses a different set of challenges compared to smaller projects. From building infrastructure, managing resource costs, to overcoming bot detection measures.
This article aims to guide you through the process of large-scale data gathering with an emphasis on e-commerce. But mind you, this guide can help beyond that, as big-scale data acquisition is not limited to e-commerce.
Web scraping infrastructure
Building and managing your web scraping infrastructure is one of the first things you will get your hands on. Of course, we are assuming you have already built a data-gathering method (AKA a scraper).
The general web scraping pipeline will look something like this:
Put simply, you start off with scraping some targets. For large-scale operations, scraping without proxies would not last long, as websites would be quick to block you. Proxies are a huge part of big-scale data gathering. What type of proxies would be best when scraping for e-commerce in particular? It depends.
The best practice for large-scale data gathering is to have more than one proxy solution. And even more than one provider. Let us start with proxy providers.
Choosing a proxy provider
It is important to choose the right proxy provider, as this will directly impact your scraping procedures. Recently, Proxyway published an in-depth market research paper on global proxy service providers, and the report’s performance evaluation section emphasized that it is essential to check proxies for every data source.
If you choose unreliable proxies to work with, it will result in your in-house data retrieval tool working poorly. In other words, make sure that you fuel the business’ data mechanism engines with the right resources.
Choosing a proxy type
If you are simply looking for a proxy type that would work for e-commerce data gathering, check out Residential Proxies. Residential proxies would surely rule over datacenter proxies if used correctly. They are less likely to get blocked due to their nature and offer a large range of locations and a large pool.
Overcoming security measures
Another big part of (successful) web scraping is knowing what obstacles you will run into while doing it, and how to overcome them.
Any e-commerce web page (or any web page, for that matter) has some security measures installed that block unwanted bots. Unfortunately, your scraper bots will most likely be flagged as undesirable as well, as good bots can share similar characteristics with malicious bots. Therefore, good bots often are labeled as bad and get blocked. There are several ways these websites can identify your bots. Here are some popular ways websites do that:
- IP recognition. This is quite a basic one. The server can see whether the IP is from a datacenter or is residential. In some cases, datacenter proxies will be blocked as they are not considered as coming from human users, unlike residential proxies.
- CAPTCHA. A popular anti-bot measure. It is a challenge-response type of test that often asks you to fill-in correct codes or identify objects in pictures. Most bots cannot pass CAPTCHAs and get recognized and blocked.
- Cookies. A normal user rarely goes to a specific product page directly. It usually comes from a search engine or an ad, etc. So to avoid being recognized, your crawling process should mimic a real user, as by doing so you will get the expected cookies.
- Browser fingerprinting. This refers to information that is gathered about a computing device for identification purposes. In other words, any browser will pass on specific data points to the connected website’s servers such as your operating system, language, plugins, fonts, hardware, etc. Your web scraper bots will need to try and imitate organic user’s data points. Learn more about what is browser fingerprinting in our in-depth blog.
- Headers. The site can see your geo-location, time zone, language, etc. If they find inconsistencies and/or illogical combinations, you will get blocked. Check out our 5 key HTTP headers blog post on how to use them.
- Behavioral inconsistencies. Nonlinear mouse movements, rapid button presses and mouse clicks, repetitive patterns, average page time, average requests per page, starting browsing from inner pages without collecting HTTP cookies could be the reason your scraper bot gets blocked.
We have covered how to detect bot traffic and how bot detection affects web scraping in one of our blog posts in much greater detail, so go ahead and check it out. And if you want to learn how to crawl a website without getting blocked, check out the article covering just that.
But before you start scraping, ask yourself “what is my data gathering scope?” More precisely, how many requests will you be making per second? Because if we are talking about big-scale data acquisition, the number will be in hundreds or thousands of requests per second. This is where you will need a good storing strategy, which takes us to the next scraping pipeline step.
The delicate art of storing
All of the data you gather will need to be put somewhere and large-scale scraping requires a lot of storage resources. Here are some quick maths to elucidate the situation:
If you are making 2,000 requests per second, and on average, a page weighs around 200KB, you will be going through 400MB of data within one second. Of course, this is when the file is compressed. Uncompressed, it would be 1,3GB. (However, it is not common practice to leave them full-sized).
That is only one second. What if we increase it to a more likely time period:
- 24 hours of scraping = ~35TB of data
- 1 month of scraping = ~1040TB of data
So the question is why do you want to store data? All of the scraped data will be in raw HTML. What will you do with it? Will you:
- Want to process it and turn it into readable data, like JSON?
- Need raw HTML files together with the processed data?
Before answering these questions, let’s talk a little about where this data will be sent to first. You can send your data directly to a service, or a parser, of course. Or better yet – into a buffer. Why not send it directly? Well, buffers are usually used when there is a difference between the rate at which data is received and the rate at which it can be processed. This leads us to our next topic.
Creating a buffer for data transfer
To explain buffering in layman’s terms, imagine an office. You are sitting in your office, doing work. From time to time someone comes up and puts a task in your work pile. Once you finish the task you are working on, you will move on to the next given task. So the pile is a buffer. If the pile gets too high, it will tumble over, so you have to put a limit on how many pages can be there. That will be the capacity of the buffer – anything more will be overflow.
So in your case, if you are waiting for another service to take the information, you will need a buffer to look over how much information is being transferred. Your worry will be not to overflow it, just like the pile of documents. If the buffer overflows, you will need to sacrifice some of your work. There are three things you can do in this instance:
- Get rid of the oldest data stored in the buffer
- Get rid of the most recently added data
- Stop your data gathering process to stop the overflow
If you choose to stop your scraping process, however, everything you put off for later means that more scraping will need to be done once you get back on track.
So finally, we can answer what to do with the data you have scraped.
Database storage services
So let us say you want to process your incoming data and make it a readable format, like JSON? In this sense, you will not require raw data. This means you can keep the information in short-term storage. And what if you need the HTML files together with the processed data? Long-term storage would be the best option.
However, because we are talking about large scale data gathering, our suggestion would be to use both. Our recommendation would look like this:
In this case, as the short-term storage works really fast and can handle a large number of requests, it will absorb a big load of data coming in from the scraper. With this solution, you will be able to both send the data into the parser and put raw HTML files into the long-term storage.
You can also use only long-term storage as a buffer. However, you would need to put in a lot more resources to make sure all of the processes make it on time.
Here are some services for both short-term and long-term storages:
These solutions usually save the data in a way that is persistent (in hard-drives, not memory/RAM). Because information is expected to stay there longer, they are equipped with tools that allow filtering the data you need selected from the entire set. This is because, from the applications side, it would be too much to retrieve the whole dataset and only then filter it.
These storages have limited functionality related to selecting data and are usually impractical for keeping data for long-term. On the other hand, the storage is made to perform really fast and the availability of rather simplified tools are reasonable trade-offs for achieving the performance needed for large-scale operations.
You could avoid the storing process altogether, of course. One of our tools, Real-Time Crawler, is an advanced scraper customized for heavy-duty data retrieval operations specializing in scraping e-commerce product pages. One of its perks is you can forget data storage headaches, as all you need to do is provide it with a URL. Real-Time Crawler Does the whole scraping, storing (and processing) on its own and gives you back all wanted data (either HTML or JSON).
Processing scraped data
Once you figure out your storage needs, you will have to look at processing, AKA parsing. Data parsing is a process of analysing the incoming information and extracting relevant pieces into a format that is useful for later processing. Data parsing is a crucial step in web scraping.
So from your chosen storage you will give all of the data to a parser. When you scrape, you usually receive your data in raw HTML. A parser takes the HTML and transforms it into a more readable data format. In this more readable format, further data engineering can be done for insight extraction, data crunching, and so on.
However, like everything we discussed in this blog post so far, parsing is not that simple. On a small scale, building a parser and maintaining it is quite simple. But when it comes to large-scale web scraping, it gets a lot more complicated.
Large-scale data parsing struggles
- Targets might change their web page layouts – and with layout changes, the HTML structures change as well. This means you will need to update your old parser to support parsing the new layout.
- When using third-party parsers, your process might have to be stopped – while your selected service providers are updating their parsers, the process might have to be stopped for an unknown period of time, forcing you to stop further operations.
- If you do use third-party services, you will need several of them – you will want to diversify your solutions to make sure that if one service breaks down or works too slowly, you would have a back-up.
- Different services give differently structured datasets – it is unreasonable to expect for all providers to give the data in the same structure to their clients as the other providers. All of that data will need to be standardized to fit your internal system.
- If you are using your parsers, you will need a lot of them – this means getting more storage and servers, more resources for maintenance as well.
- When the parser process is halted, your buffer might overflow – because you cannot take any data from the buffer and give it to the parser, it stacks up in the buffer.
So to sum up, you can either build and maintain your own parsers, or get them from a third party solution. For a large-scale operation we suggest you would try doing either one or the other. Investing into several good third-party solutions (to diversify your services) could ensure smooth web scraping operations.
You can also opt for a solution that can offer you parsing and scraping at the same time. An Oxylabs exclusive, Next-Gen Residential Proxies. This solution is as customizable as any regular proxy, but at the same time, guarantees a much higher success rate and offers Adaptive Parsing. This neat little feature adapts to any e-commerce product page and parses all of the HTML code.
Web scraping on a large-scale is a long process, and it is not easy. We hope this guide will help whether you are planning a project or are in the middle of it.
If you have any questions regarding proxies, best solutions, scraping best practices, or similar, feel free to contact our Sales team, and they will help you as best as they can.