While countless companies rely on data for their strategy, some don’t ensure thorough enough implementation of various quality control measures. At least they don’t until it bites them with poor insights that lead to incorrect decisions. Thus, ensuring consistency in the quality of data should be the ultimate goal. As the founder and CTO of The DataWorks, this is the message I want to convey to anyone working in the field of data.
Allen O’Neill, CTO of The DataWorks and Microsoft Regional Director
Our industry gathers data from the web for analysis. We do it to get great insights and for us to get an edge in the competition. The challenge is that great insights need great data. If your data isn’t of high enough quality, your insights will be poor; they won’t be trustworthy. That’s a really big problem.
Someone once told me: if you can scrape data, you can make 1 dollar. If you scrape good quality data and add some value to it, you can make 10 dollars. If you could use that data to predict the future, you can make 100 dollars. It’s all about getting from 1 dollar to 100 dollars. This notion underlines why data quality is the topic I want to focus on the most.
I’ve seen many companies who have bought online and offline advertising based on bad data. They were targeting the wrong area and have lost hundreds of thousands of dollars in revenue.
I’ve also seen companies being litigated against by their customers. This happened when the actions and insights derived from incorrect and low quality data turned out to be damaging to the customer’s business. We think of data quality as great, until it’s not, it bites you, and then it’s your worst nightmare.
Let’s take a look at the human involvement in finding the cause of bad data and the subsequent amending of the consequences. There’s the lost opportunity cost of what those humans would have been doing if they weren’t trying to fix bad data.
If we leave the human cost aside, we can see that the overall computation cost of making something bad good again hurts the cause considerably. Moreover, all that is before we look at the intangible cost of how bad data quality affects your reputation with customers.
Businesses lose essential contracts, close down, and even go bankrupt because they are unable to get a handle on consistently good quality data. The bottom line is that the cost of not managing the quality of your data is high. It should be treated as a first-class citizen at the highest level of the business.
Most data collection systems are not built with data quality in mind from the very start. That is a huge problem. Data quality attributes can change at different points from such aspects like using an inappropriate IP address to collect it.
For example, is it appropriate to use an IP that is based in Europe when you are scraping, say, from a website in Canada? Not really, in this case, you might see export-based pricing, different taxation, and customs information instead of the local focus information.
On the other hand, let’s imagine browsing a website in French-speaking Quebec. In this case, it might be correct to use the French IP or Belgian IP, where they also speak French, and there’s a lot of interaction and trade between those countries.
This is where a professional proxy partner like Oxylabs really shines. They can advise on the best geolocation to use for a particular website or region, and they can ensure that you’re only using safe, trusted IPs that don’t get blacklisted and are appropriate for the task at hand.
Data quality needs to be checked not only at the end of the collection process like most companies and people do, but at all stages. That is the critical part – you need to measure quality right from the point of data collection and all the way through to the endpoint, where it is delivered to the ultimate consumer of data.
What parameters should you follow? You need to ensure the veracity of the data — that the timeliness of the data meets the needs of the business and that it fits the purpose that was intended at the start.
Data quality means different things to different organizations and different people. It is about measuring what is true to the data according to what you consider to be fit for purpose and determining whether the data is timely.
Most of the data that we collect on the web is very time-constrained. A price, flight availability, a room in a hotel, stock value, a news article that has just been posted – all of these things are time-based. Therefore we need to make sure that all these different parameters are taken into account all the way through data collection.
There’s a lot of buzz around artificial intelligence (AI) and machine learning (ML) at the moment. Words like Tensorflow and deep learning are trending. Many people think that it’s a magic bullet to solve all of their problems. Let me tell you something – it’s not.
While all of these technologies have their place, sometimes the best bit of machine learning when it comes to data quality is simply a piece of well written database SQL code – old school, plain, and simple.
The basics are not that difficult; it usually doesn’t require AI and ML. It requires that you spend some time knowing what makes your data fit for purpose and tracking it along the entire data pipeline. Ensure that you maintain the same level of quality all the way through.
AI and ML are awesome, but it’s not everything. Get your basics right first before you try fancy.
The biggest challenges right now are: the lack of knowledge in the industry, the lack of experience, and the skills required to take things to the next level. We need to stop reinventing the wheel at every single step and to work together as an industry to share knowledge and improve outcomes for all of our data consumers.
As the industry grows and consumers become more educated in the nuances of web data, they will require higher levels of automation. They will also require sophistication and, most importantly, value add-on. The days of saying, “we’re going to get that piece of data from that website and present it to you on the CSV file”, are gone. Everybody can do that – it is a commodity now. We need to add value on top of that.
Companies that don’t constantly innovate and push the envelope continuously, adding value at every point of the data chain, will be left behind. The challenge is how can we stop reinventing the wheel, use the industry’s best practices, sit on the shoulders of existing industry partners, and use them to help you succeed in your particular business.
There are a million uses for web data. There’s enough room for everybody – we don’t need to have one dominant company. If you look at one web page, there could be a hundred different companies that take exactly the same data and pivot it in different ways to give them different insights for different outcomes. There’s no need to fight over it. If we share knowledge, wisdom, and technology – it will be better for everyone.
Data quality is one of those things that you don’t pay attention to until it goes wrong – then all hell breaks loose. Me and my colleagues, together we have successfully solved some of eCommerce’s hardest problems using AI-driven cutting-edge patented technology. As an avid technology community contributor and a data world enthusiast, I want to challenge and be challenged by the most difficult web data topics and share the insights I’ve learned along the way with you.
I’ve shared my thoughts on data quality in an annual web scraping conference OxyCon 2021. The recordings and presentations are available on-demand here.
About the author
Allen O'Neill
CTO of ‘The DataWorks’ and Microsoft's Regional Director
Allen O’Neill is the founder and CTO of ‘The DataWorks’ and Microsoft's Regional Director. His company – ‘The DataWorks,’ delivers AI-driven e-commerce web-data solutions to top-tier organizations worldwide. He is an avid technical and business community contributor, sharing knowledge whenever anyone will listen. Allen organizes, assists, and speaks at conferences both big and small internationally. He has written over 250 articles, and his current readership exceeds 4m worldwide.
All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.
Augustas Pelakauskas
2024-12-09
Roberta Aukstikalnyte
2024-11-19
Vytenis Kaubrė
2024-11-05
Get the latest news from data gathering world
Scale up your business with Oxylabs®