Web scrapers are the most common and widely used data gathering method. Building a web scraper does require some programming knowledge but the entire process is much simpler than it might seem at the outset.
Of course, the effectiveness of these projects depends on many factors such as the difficulty of targets, implemented anti-bot measures, etc. Using web scraping for professional purposes such as long-term data acquisition, pricing intelligence, or other purposes, requires constant maintenance and management. In this article, we will only outline the basics of building a web scraper and the most common challenges newcomers might run in to.
What is web scraping used for?
Web scrapers are often the main part of the data acquisition process. Generally, they are used as an automated way to retrieve large amounts of important information from the web. Web scraping examples include search engine results, e-commerce websites or any other internet resource.
Data acquired in this manner can be used for pricing intelligence, stock market analysis, academic research and many other purposes. There are many web scraping ideas out there as this data gathering method can be used in almost limitless ways.
Web scrapers, when used as a data gathering method, contain multiple steps – a scraping path, data extraction script(s), (headless) browsers, proxies and, finally, parsing. Let’s review what goes into each of these steps.
Developing a basic web scraper
Building a scraping path
Building a scraping path is an important part of almost every data gathering method. A scraping path is the library of URLs from which data is to be extracted. While collecting a few dozen URLs may seem very simple at first, building a scraping path in fact requires a lot of care and attention.
Creating a scraping path, at times, might require additional effort as in order to obtain required URLs scraping the initial page might be necessary. For example, ecommerce websites have URLs for every product and product page. Building a scraping path for specific products in an ecommerce website would work like this:
- Scrape the search page.
- Parse product page URLs.
- Scrape these new URLs.
- Parse according to set criteria.
As such, building a scraping path might not be as simple as just creating a collection of easily accessible URLs. Creating a scraping path by developing an automated process ensures that no important URLs are missed.
All parsing and analysis efforts will depend upon data acquired from URLs outlined in the scraping path. Insights can only be as good as the data collected. If at least a few key sources are missing, the results of dynamic pricing could become inaccurate and thus irrelevant.
Building a scraping path requires some knowledge of the industry at large and of specific competitors. Only when the URLs are collected in a careful and strategic manner, the data acquisition process can begin.
Additionally, data is generally stored in two steps – pre-parsed (short-term) and long-term storage. Of course, in order for any data gathering method to be effective, continuous updates are required. Data is only as good as it is fresh.
Data Extraction Scripts
Building a data extraction script, of course, requires some beforehand coding knowledge. Most rudimentary data extraction scripts use Python but there are many more options available. Python is popular among developers engaging in web scraping as it has many useful libraries that make extraction, parsing and analysis significantly easier.
The development of a data extraction script generally goes through several stages:
- Deciding the type of data to be extracted (e.g. pricing or product data)
- Finding where and how the data is nested
- Importing and installing the required libraries (e.g. BeautifulSoup for parsing, JSON or CSV for output)
- Writing a data extraction script
In most cases, the first step will be clear from the get-go. Step two is where things get interesting. Different types of data will be displayed (or encoded) in different ways. In the best case scenario, data across different URLs will be always stored in the same class and would not require any scripts to be displayed. Classes and tags can be easily found by using the Inspect element feature available in every modern browser. Unfortunately, pricing data is oftentimes slightly more difficult to acquire.
Headless browsers are the main tool used to scrape data placed in JS elements. Alternatively, web drivers can also be used as the most widely used browsers have them on offer. Web drivers are significantly slower than headless browsers as they load pages in a similar manner to regular web browsers. And this means that scraping results might be slightly different in each case. Testing both options and finding the best one for each project might prove to be beneficial.
There are many choices available as the two most popular browsers now offer headless options. Both Chrome and Firefox (68.60% and 8.17% of browser market share respectively) are available in headless mode. Outside of the mainstream options, PhantomJS and Zombie.JS are popular choices among web scrapers. Additionally, headless browsers require automation tools in order to run web scraping scripts. Selenium is the most popular framework for web scraping.
Data parsing is the process of making the previously acquired data intelligible and usable. Most data gathering methods return results that are incredibly hard to understand for humans. As such, parsing and creating well-structured results becomes an important part of any data gathering technique.
As mentioned previously, Python is a popular language for pricing intelligence acquisition due to the easily accessible and optimized libraries. Beautiful Soup, LXML and many others are popular options for data parsing.
Parsing allows developers to sort data by searching for specific parts of HTML or XML files. Parsers such as Beautiful Soup come with inbuilt objects and commands to make the process easier. Most parsing libraries make navigating large swaths of data easier by attaching search or print commands to common HTML/XML document elements.
Data storage procedures will generally depend on volume and type. While building a dedicated database is recommended for pricing intelligence (and other continuous projects), for shorter or one-off projects storing everything in a few CSV or JSON files won’t hurt.
Data storage is a rather simple step with few issues, although there is one thing to always keep in mind – cleanliness. Retrieving stored data from incorrectly indexed databases can quickly become a hassle. Starting off on the right foot and following the same guidelines from the get-go will resolve most data storage problems before they even begin.
Long-term data storage is the final step in the entire acquisition journey. Writing data extraction scripts, finding the required targets, parsing and storing data are the easy parts. Avoiding bot detection algorithms and blocked IP addresses is the real challenge.
So far, web scraping might seem quite simple. Create a script, find the suitable libraries and export the acquired data into a CSV or JSON file. Unfortunately, most web page owners aren’t too keen on giving large amounts of data to anyone out there.
Most web pages nowadays can detect bot-like activity and simply block an offending IP address (or an entire network). Data extraction scripts act exactly like bots as they continuously perform a looped process by accessing a list of URLs. Therefore, performing gathering data through web scraping often leads to blocked IP addresses.
Proxies are used in order to retain continuous access to the same URLs and circumvent IP blocks, making them a critical component of any data acquisition project. Creating a target-specific proxy strategy using this data gathering technique is critical to the success of the project.
Residential proxies are the type most commonly used in data gathering projects. These proxies allow their users to send requests from regular machines and as such avoid geographical or any other restrictions. Additionally, they maintain the identity of any regular internet user as long as the data gathering script is written in a way that imitates such activity.
Of course, bot detection algorithms work on proxies as well. Acquiring and managing premium proxies is part of any successful data acquisition project. A key component of avoiding IP blocks is address rotation.
Unfortunately, proxy rotation issues do not end here. Bot-detection algorithms will vary heavily between targets. Large e-commerce websites or search engines will have sophisticated anti-botting measures requiring different scraping strategies to be utilized.
Proxies against the world
As mentioned previously, rotating proxies is the key to any successful data gathering method, including web scraping. Maintaining an image of a regular internet user is essential if you want to avoid blocked IPs.
Unfortunately, the exact details of how often proxies need to be changed, which type of proxies should be used, etc is highly dependent on scraping targets, frequency of data extraction and other factors. These complexities make proxy management the most difficult part of web scraping.
While each business case is unique and requires specific solutions, in order to use proxies with maximum efficiency, guidelines have to be followed. Companies, experienced in the data gathering industry, acquire top tier understanding of bot detection algorithms. Based on their case studies, proxy and data gathering tool providers create guidelines to avoid blocked IP addresses.
As mentioned previously, maintaining the image of a regular internet user is an important part of avoiding an IP block. While there are many different proxy types, none can do this particular task better than residential proxies. Residential proxies are IPs attached to real machines and assigned by Internet Service Providers. Starting off on the right foot and picking residential proxies for any e-commerce data gathering technique makes the entire process significantly easier.
Residential proxies for e-commerce
Residential proxies are used in e-commerce data gathering methods because most of these require maintaining a particular identity. E-commerce businesses often have several algorithms that they use to calculate prices, some of which are dependent on the consumer attributes. Other businesses will actively block or display incorrect information to visitors they deem to be competitors (or bots). Therefore, switching IPs and locations (e.g. going from a Canada to a German proxy) is of key importance.
Residential proxies are the first line of defense for any e-commerce data gathering tool. As websites implement more sophisticated anti-scraping algorithms and readily detect bot-like activity, these proxies allow web scrapers to reset any suspicions a website gathers about their actions. Unfortunately, there aren’t enough residential proxies out there to keep switching IPs after every request. Therefore, certain strategies need to be implemented in order to use residential proxies effectively.
Proxy rotation basics
Developing a strategy to avoid IP blocks takes time and requires experience. Each target has slightly different parameters on what it considers to be bot-like activity. Therefore, strategies also need to be adjusted accordingly.
There are several basic steps to e-commerce data gathering for proxy rotation:
- Default session times (Oxylabs’ residential proxies have it set to 10 minutes) are generally enough.
- Extending session times is recommended if the target is traffic heavy (e.g. the HTML itself weighs 1 MB without any other assets).
- Building a proxy rotator from scratch isn’t required. Third party applications such as FoxyProxy or Proxifier will do the trick for basic data gathering tasks.
- Whenever scraping targets, think about how an average user would browse and act on the website.
- As a default imitation strategy, spending some time on the homepage then on a few (5-10) product pages is a good starting point.
Remember that each target will be different. Generally, the more advanced, larger and important the e-commerce website is, the harder it will be to get through with web scraping. Trial and error is often the only way to create an efficient strategy for web scraping.
Want to build your first web scraper? Register and start using Oxylabs’ residential proxies! Want more details or a custom plan? Book a call with our sales team by clicking here! All the internet data you will ever need is just a click away!