Our world has become a quantified one, with data serving as the objective measure of truth. While the average consumer is happy to cipher the health of food intake in percentages of daily requirement and to count ‘likes’ on social media platforms as a means of gauging sociability, use of quantitative research methodologies have extended beyond the realm of scientific enquiry to invade the social sciences and traditional humanities disciplines (creating new statistical and data management academic fields along the way), and massive survey data is collected in ever larger volumes to inform decision-making on politics and policy at global, national, regional and now municipal levels. In the business world, data has evolved to command an authority unto itself, parading an ability to not only support interpretation of a corporation or market’s past, but also, with predictive analytics, its future, foretold by the new priesthood – the data scientist. More and more – and in more spheres of human activity – we think, analyze, calculate and report using numbers – data which is presented as invulnerable to human weakness for bias, and in fact, scientific fact.
In many ways, our growing reliance of quantifiable truth is a product of new technology capabilities. As online panels, with a little help from the mega social and browser sites, have developed to simplify the laborious and often costly process of data collection, new sensors, including the ubiquitous smartphone, are streaming data in quantities that were formerly unimaginable. And to support this ever growing volume and type of data, cheaper storage and new database and storage solutions have evolved to enable more capacity, faster I/O and better management – Hadoop, in memory, tiering, flash and visualization technologies to name just a few – of a phenomenon that is now known generically as ‘Big Data,’ measured in petabytes and exabytes. But does size equal veracity?
While some might argue that data volume covers a multitude of sins, in many businesses that are increasingly inclined to try and take advantage of insights produced by data, but which lack in house expertise, the ‘new source of truth’ can be impacted by issues with data quality, in addition to storage and access challenges. In other words, response to a data query is only is good as the data itself, or as Vincent Lam, marketing director, Information Builders (IB), put it recently in a webinar entitled “Data Quality Preparation for Better Data Discovery,” while there are a number of tools and dashboards available to help us more readily consume data, “one thing that will render them useless” is ‘dirty data’.”
How big is the dirty data quandary? To preface his discussion, Lam introduced some quantified information that offers a good sense of the scope of the problem that can be introduced through data duplication, confusion of database categories, inconsistent terminology or a slip in data entry. According to Lam’s figures, losses equal: $40 billion on a global basis due to bad supply chain data (A.T. Kearney report); an annual average of $8.2 million per corporation surveyed by Gartner, or 15-20% of a company’s operating budget (Larry English) due to data quality issues; and a whopping $3 trillion total for the US economy as a whole on an annual basis (Hollis Tibbets, Dell).
On the relationship between good data and good decisions, Information Builders may be viewed as a source of good advice: the New York-based firm is a purveyor of not only of BI and analytics, but also of data integration and data integrity solutions, with close to 40 years of experience helping clients resolve data issues. As best practice guidance, Lam stressed the importance of creating partnerships between LoB data consumers and IT data managers: “Don't wait till you've already bought software to establish this relationship. Since the interaction of business and IT for data quality is so critical to success, you really want to involve both sides from the onset. Each side will understand their ownership and responsibilities if they're involved from the beginning.”
At a tactical level, Lam also advised the adoption of a “data quality lifecycle” approach, a simple plan to ensure clean data that covers data discovery, cleansing, remediation and governance. Due to the volume of data that is typically stored/accessed in corporate repositories, lifecycle management can no longer be managed manually, but is best accomplished through tools that automate the process. As example of this automation, Lam demoed Information Builders’ web-based iWay Data Profiler, which evaluates company data to identify duplicates, extremes, numbers that are used (but should not be) many times, ‘masks’ (i.e. use of digits to represent a value), etc. In his example, analysis of a personnel data set uncovered six – as opposed to two – genders, resulting from different policies around data input. The Profiler produces a metric or report card showing what percentage of the data “breaks the rules” for individual data sets or aggregates of company data and can do this over time to generate trend analysis. Ultimately, this information is presented in a dashboard for executives to consider impact and cause analysis, and track progress towards better data.
According to Lam, these metrics offer a quantified percentile of validity (or reliability) for the data that can be used along with a rules based engine in IB’s data quality suite to score data as it’s being cleansed – the next stage in the process where data is prioritized, and rules defined specifying what data should look like based on standardization and enrichment from other sources. “A score of 0 means the data didn't need to be corrected. Higher scores indicate more corrections or corrections to critical fields. So even when the data is fixed, we can determine how much repair was required. These scores are available for use anywhere and can be presented to users or embedded in underlying automated process logic,” Lam explained. Or scores may be used in conjunction with rules criteria to automate data quality processes in Information Builders’ data quality firewall, which checks information in real time as it’s being inputted, rejecting data that does not meet data quality standards and policies. “Built for speed,” Lam noted that the firewall is 64 bit optimized and runs as much as possible in memory to avoid issues with processing time that otherwise may occur in the use of large data sets. And when the data set is extremely large, as in the case of real time social feeds, the firewall can scale to run on additional, appropriately sized hardware.
By providing process automation around many of the tasks that formerly were managed manually by IT – for example, the overnight cleansing of data – Information Builders helps to deliver assurances around data validity that are needed if data is going to act as a decision-making tool. At the same time, this automation can help democratize access and management of information assets. For example, when a value format needs to change or when different data formats need to be integrated, this can be accomplished via use of the data quality suite in tandem with IB data master and integration capabilities, such as dynamic mapping or data transformation that are managed within a GUI based development environment so that custom coding is not required. As Lam explained, both business and IT can use the data quality tools, though LoB typically relies more on the profiler functionality as this group owns the data, understands how it is to be used, and using the tool can identify problems more readily.
Lam’s data quality check list includes:
- Data quality is not just an IT problem; it requires partnership with business users
- Data ownership and standards need to be applied
- Appropriate tools for each party (IT, LoB) should be utilized
- Profiling data will identify the worst “bad data” offenders quickly
- Measure quality with real metrics over time
- Provide visibility to LoB and IT
- Implement a real time data quality firewall, which can handle volume and variety of data without performance issues
- For an assessment of your information assets, take the Information Builders Data Quality Challenge