Back to blog
Cornelius (Con) Conlon
A smart external data collection policy will not just guard against unethical behavior but will also create other valuable benefits for your B2B company. Besides shielding the company’s reputation, it can immensely impact the effectiveness of scrapers.
B2B data collection done ethically has a huge number of benefits for the companies that implement it. In the retail space, this includes better data for optimized transactions (price, features, and quality) for both individuals and companies.
The examples where data collection can serve the greater good are endless: educational standards, price comparisons on a range of goods and services, reducing waste in pairing demand to manufacturing output, reducing risk in investment scenarios, and the list goes on.
In terms of the actual collection itself, the characteristics of what’s ethical are easy enough to outline. Ethical data gathering has to be un-intrusive, low impact (low request rates), efficient, thoughtful, and governed by considered practices that are well communicated to development teams.
Meanwhile, unethical data collection will first of all occur in the instances where data sources are not freely available in the public domain but sit behind paywalls or registered user accounts.
When data is in the public domain, data collection is unethical when it is carried out in a way that:
is indiscriminate: collects everything and does it quickly;
has a disregard for the impact on the site in question;
involves no planning or thought out into how best to gather the information;
demonstrates no effort to consider the impact of various actions on a site in question.
This policy is geared around public domain industry data where no personally identifiable information (PII) is involved.
Distinguishing between ethical and unethical is much easier with an ethical data collection policy in place.
Merit Data and Technology have a team of over two dozen data engineers, developers, and business analysts creating data harvesting tools across all sorts of industry domains. We have seen first-hand how a policy can greatly impact the effectiveness of the scrapers we build. It has helped us flush out examples where database calls were far too frequent, where duplicate data was being constantly gathered, and where poor design decisions led to suboptimal outcomes.
Establishing a common baseline of collection best practices has saved us thousands of pounds in costs and delivers much stronger collection tools across the board.
Ethical data collection policy forces developers to think more deeply about how they will tackle a given source – that pause to think and consider is always a good thing. It will lead to better tech outcomes around each scraper developed – its reliability, efficiency, and design.
It gets all team members on the same page around critical do’s and don’ts in scraper design.
It leads to better code and more easily maintained components – scrapers that violate ethical guidelines are most likely to be ones that are badly designed!
It also forces your customers and data users to think about what exactly they really need.
It can reduce cost dramatically if the scraper has had better design thinking and employs best practices.
It can help avoid lawsuits!
It helps avoid reputational damage and safeguards the value of your company.
Where to start when drafting your first ethical data collection policy? Besides industry specifics, there are some general points that a policy will try to address. As well as strictly ethical elements, companies should also include some “best practice” norms that they wish to see adopted by their developers when they go about building scrapers.
Every source site should have a “design and approach” document filled out before coding begins. This document will examine the specifics of each site and how the site will be approached from both an ethical and an efficiency/reliability point of view.
It is important that the policy is relatively simple and uses quite clear and plain language. This can be a better way of capturing the spirit and the intent of how we should best collect data.
Here are some sample principles that could help you get started:
Don’t gather data that sits behind a paywall or registered user login;
Throttle request rates to the lowest level possible, within the turnaround times required;
Factor in Robots.txt instructions and adhere to these as much as possible;
Collect only the data you need; don’t collect everything or overengineer the robots;
Never “add to basket” in e-commerce sites;
Pause collection where frequent 404/505/303 errors arise – examine the scraper to see if it contributes.
After you have drafted the policy, it’s important for it not to end up in the back of the drawer. Implementation of the policy should be done in a way that is not overly burdensome for the development team. The following approach might make sure the policy is received with goodwill and effectively policed thereafter.
Discuss it with the tech team
Be honest – a policy may make their job harder in the short term. But don’t let up – explain all the benefits that arise from having everybody working to common guidelines and best practices. Push the developers to be creative in meeting the policy – it may make their approach more difficult, but they will find a way.
Create an escalation path
Where the policy cannot be adhered to, ensure that senior staff reviews the developers’ efforts. Sample a percentage of collection scripts each month or each quarter – do a full policy adherence audit on, say, a dozen robots per quarter. Check the Site Evaluation Reports – ensure they are thorough. Check the scripts; check the robots for various aspects of site handling; speed, analytics impact, error handling, etc.
Motivate with the carrot and the stick
Create consequences for non-adherence – and incentives for a really good data collection code. Motivators of both kinds are very important to setting and improving standards.
The above guidelines are universal and suitable for any B2B business that prioritizes ethical data acquisition or is only to embark on this path. Even though implementing the ethical policy may take some time and resources during the first stages, in the long run, it will be rewarding in many different ways.
I’ve also shared my vision on ethical data collection policies during an annual web scraping conference OxyCon 2021. The recordings are available on-demand and can be found here.
About the author
Cornelius (Con) Conlon
Managing Director and Founder of Merit Data and Technology
Cornelius (Con) Conlon is the Managing Director and Founder of Merit Data and Technology. Merit has been delivering data solutions to clients for over 15 years across various industries, from maritime to construction to fashion and e-commerce. Con comes from a software programming background and has held a number of senior board-level roles in technology businesses over the past 20 years.
All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.
Vytenis Kaubrė
2024-11-05
Augustas Pelakauskas
2024-10-15
Get the latest news from data gathering world
Scale up your business with Oxylabs®