These are halcyon days for the analytics/Big Data community. Research from CapGemini/EMC shows that a majority of businesses are planning to increase their investment in Big Data Analytics over the next three years, even though other research, from KPMG/Forrester, shows that only one-third trust the analytics that they receive today. And if these sources seem to be somewhat conflicted, the management gurus’ positioning of Big Data analytics as a “critically important driver of business success” has been consistent across industry leaders, including Bain & Company (the source of the report that used this phrase) and Harvard Business Review, which famously described Data Scientist as “the sexiest job of the 21st century.”
It seems clear that executives will increasingly rely on data analysis to uncover insights that improve segmentation, enhance competitiveness, improve or expand product/service portfolios and generate entirely new revenue streams. But how are these capabilities built and delivered?
In the portrait painted by HBR and others, the data scientist is Yeti-like, a mystical force delivering insights that power analytics-fueled business innovation. But in practice, that delivery is one element in a much longer chain of events. Data scientists (and “InfoApps” that embed their insights into guidance that can be widely deployed and consumed) are positioned at the top of a large pyramid spanning people, processes and technologies that need to align to deliver guidance to corporate activities.
If data scientists are at the top of this hierarchy, it might be fair to position data cleansing at the bottom. Data cleansing is never described as ‘sexy’, and is unlikely to generate gushing speculation from the mandarins at Bain or in the pages of HBR. But data cleansing is the critical foundation for the insight delivered through Big Data analytics. The recommendations of the exalted data scientist are barely different from the pronouncements of soothsayers if the data that informs these recommendations isn’t complete, accurate and trustworthy.
The ‘three I’s’ meet the ‘three V’s’
From a process perspective, as Mitchell Ogilvie, solutions architect with Information Builders, noted at a recent seminar at IB’s downtown Toronto offices, data cleansing is actually the second step in a “three I’s,” process-based approach to information management. The first, integration, refers to aggregating many different data streams into a coherent body of information. This is an essential first step in Big Data, which by definition (the original “three V’s” proposed by Doug Laney, then of Meta Group, in 2001) includes variety of structured and unstructured sources – databases, images, video, social, etc. – as an essential input attribute. The other Vs, velocity and volume, also define both Big Data and challenges in data cleansing. It is relatively easy to schedule routines for data that is absorbed from batch jobs, but much more difficult to keep pace with real-time feeds; similarly, massive increases in data volumes (from gigabytes to petabytes) force organizations to implement automation and efficient review processes simply to keep pace with Big Data ingestion requirements.
Once diverse data sources have been tied together, the second stage in the three I’s process, integrity, kicks in – and with it, the need for data cleansing. As Oglivie pointed out, organizations must ensure that “erroneous data doesn’t interfere with and propagate into our intelligence base” – and as a result, affect the integrity of both the information and the analysis that is based on it.
This seems like a simple dictum, but there are many different considerations that need to be addressed in order to “empower data quality stewards.” Organizations need to establish what is important to data quality in their business context: for example, does the age of a record matter? Is it essential for a record to be complete, or are partial records useful to analysis?
Once the standards for data quality are established, data cleansing can be implemented. Data cleansing is not, as Ogilvie explained, a one-time activity – it is a process that is used to ensure that growing volumes of ingested data maintain the integrity needed to empower the ‘third I’, intelligence, that results from analysis.
Ogilvie began his discussion of data cleansing by positioning it as one of four key elements in the integrity level of the information management spectrum. “Integrity,” in Information Builders’ taxonomy, is comprised of four major activities: assess, cleanse, remediate and master. Two of these areas, assess and remediate, are best managed by business users who understand the data, its sources and its uses; the other two, cleanse and master, require advanced technical skills and fall into the IT department’s domain. Information Builders’ toolset is designed to provide rich, user-appropriate tools in all four areas, while at the same time enabling the business user/IT collaboration needed to ensure that data is consistent and accurate.
Drilling deeper into topic, Ogilvie described the data cleansing process as comprised of four key steps:
- Data profiling – getting a clear understanding of what data should include; developing the ability “to digest and understand” data that has been aggregated from different sources.
- Rules-based data cleansing – development of data quality rules that can be used to automatically fix common errors that have predictable resolutions. This process provides benefits to both operational and analytical processes.
- Data remediation – data that has issues (because it doesn’t correspond to the data profile) and which can’t be fixed through rules-based data cleansing is then escalated to a data remediation process. In data remediation, data stewards examine inputs that have been flagged for further review and draw upon various sources, such as internal experts or even customers, to correct problems or complete records.
- Real-time data monitoring – as is noted above, data cleansing is a process rather than an event. Real-time data monitoring provides a process-centric view of information, tracking data consistency, validity, accuracy, timeliness and completeness over time, and identifying problem areas that may contribute to avoidable errors.
Quality in practice
At the seminar, Ogilvie demonstrated how Information Builders’ iWay products help businesses to address data quality issue. He demonstrated the iWay Data Quality Server, showing how the system moves beyond data analysis to data repair:
- Pattern-based parsing – e.g., dividing ‘name’ into first and last names.
- Standardization – for example, applying a single code for a common variable (such as gender) across data drawn from different sources.
- Data quality validation – e.g., use of external sources to confirm address and similar information.
- Matching – identification of possible duplicates within a data set. This can use a variety of techniques, including fuzzy logic and sophisticated algorithms.
- De-duplication – identification of the best representation of the target record, gold record creation.
Outputs from these processes provide input to remediation activities, dashboards and processes that flag data for stewards and provide tools for fixing and re-integrating records. The cleansing and remediation results in turn feed into a dashboard-based data governance system that enables monitoring of data quality over time, giving IT and business users a consolidated, detailed view of the state of data used to power business decisions. Separately, the iWay Data Quality Server also integrates with the iWay Data Profiler, a related tool that gives non-technical business users portal-based access to profiles but which removes features requiring advanced IT skills. This complementary tool further expands the scope of data quality-focused collaboration between IT and business professionals.
In the end, data cleansing, and data itself, can’t really be described as ‘sexy’. Data governance, though, is a component of corporate governance – and corporate governance is an obligation of the Board of Directors, which is responsible for both the business performance that can be enhanced by Big Data analytics and for compliance with management standards that include and rely on data. Data cleansing won’t deliver a sexy, Black Swan “eureka!” moment – but it’s essential to the Yetis who produce these breakthroughs, and to shareholders who care both about data-based operational innovation and about adherence to practices that deliver consistent, continuous care for the foundations that underpin these insights and the operations they empower.
 Both figures drawn from the Fast Company article Why Executives Don’t Trust Their Own Data and Analytics Insights, November 4, 2016