Good technology projects begin at the end. Successful implementations are driven by the alignment of a business requirement with available tools; technology that is deployed ‘because we can’ is destined for the rubbish bin of idle IT ventures, no matter how much intrinsic interest the tool may offer. This principle is alive and well in the field of Big Data and AI. Noting at the Big Data and Analytics Summit that AI has been around for 70 years, Next Pathway CEO Chetan Mathur attributed the current surge of interest in advanced analytics to: lower costs for computing, based on cloud investments made by Internet giants such as Google and Amazon; improvements to algorithms; greater availability of analytics talent; and the digital transformation imperative. As preconditions to broader deployment, the first three drivers make a lot of sense; however, in Mathur’s analysis, market uptake of AI is dependent on business need – in this case, enterprise focus on digital transformation as the key to maintaining competitiveness.
But to ensure that AI projects deliver the value that business users anticipate, adopters need also to begin at the beginning. Care must also be taken in the early stages of implementation to build trust in data solutions through rigorous attention to key requirements in data management. As chief strategy officer for Next Pathway Vinay Mathur explained at the event, “there are lots of shiny new tools, and machine learning algorithms out there, but the fundamental thing for these algorithms is they need data to learn. When you are looking to apply these to enterprise level business questions, you don’t want any bias in the outcomes. Algorithms only know what they get…” In other words, what is fed into AI system is critical is critical to the validity of outcome – ‘garbage in; garbage out’, so the adage goes.
In the rush to value, today many organizations, including analytics users and software providers, place emphasis on the outcomes – BI, analytics, data visualization and reporting tools – without due attention to the organization and integrity of underlying data resources. The work in involved in developing data for analytics projects can be daunting: a commonly cited stat in the industry holds that data scientists spend only 20 percent of their time on analysis, while 80 percent is spent on more prosaic activities, including finding the data, cleansing and organizing it. And much of this work is carried out through the use of manual processes – tagging, cleaning, and formatting data for use by the business. As a foundation for digital transformation, which relies on the analysis of massive and increasing amounts of data from disparate systems, manual management of data offers limited appeal or potential.
Next Pathway is looking to square this circle with “Data as a Service,” a platform and methodology that automates many data management processes to help user organizations achieve clean, secure and standardized data formats, with the ultimate goal of ensuring data governance, reusability and accessibility. That’s a tall order which Next Pathway is looking to fulfill with three solution sets that in theory follow platform users as they evolve through the data ‘journey’. Describing this evolution, Vinay outlined several discrete stages that organizations cycle through:
Data collection, which must encompass structured and unstructured data, in a timely and secure manner at affordable cost.
Preparing the data for use by securing it, an applying metadata to align data to specific business groups and capabilities to ensure that even non-technical people can access the data through self-service.
Storage of clean, governable data in REST APIs to ensure that the data is reusable across the organization.
According to Vinay, the step approach helps to solve for many of the Big Data challenges that organizations face as they look to turn information assets to business value. Next Pathway’s Cornerstone data lake management solution, for example, is a platform designed to support the collection and standardizing of data. While first generation data lakes offered potential to quickly store massive amounts of data in multiple formats, as Vinay explained, “this is where data lake 1.0 has really failed. Companies were pumping in data, without attaching any metadata. The consumer would have no idea how to access any of it, because they have no idea what is in the data lake.” In contrast, Next Pathway’s Fuse solution can “physicalize an enterprise domain model,” mapping data to business capabilities in an automated metadata solution using terminology that business users recognize so these are able to access the data. In this process, industry standards such as Bian (for banking) are also used to develop consistency, and reusability.
Vinay described “physicalizing the models” as a technical process that changes the structure and the values of data so that it conforms to the model. “Selecting an enterprise business glossary and defining critical data elements is a necessary step that needs to be done irrespective of data collection. Once a glossary tool – Colibra, or IBM has Alation – is selected, the enterprise has to consult with business partners on what terms are most relevant, defining 50 – 100 critical data elements that will be put into the glossary.” Historically, tagging and mapping data to business capabilities has involved coding; organizations would write ETL code manually, an inefficient approach that Next Pathway has now automated. “Metadata is really the holy grail of automating a lot of the enrichment and preparing of data,” he added.
Standardization of metadata aims at addressing another key data issue for enterprises, which is information silos. In most enterprises, analytics does not begin with ingestion of new resources; rather, these silos have built up within operations over time, a challenge that the Next Pathway platform solves for with a tool that automates the integration of different data stores. As Vinay explained, when an enterprise is migrating or decommissioning a data warehouse or an old data store, data can be retrieved by Cornerstone, but there is also a lot of application code, SQL, or fit for purpose code in place to manipulate the data on the box. Though this code will not be used in the new repository, the enterprise will still need applications to ensure the same translations in the new data lake. Next Pathway’s Shift product takes the old code and converts it to a language that can run inside Hadoop, for example: “It’s not a 1:1 mapping,” Vinay noted. “Old systems have their own language, their own SQL format – you can’t just lift and shift – you need to translate it, and that’s what Shift does.” Based on a business rules engine, Shift contains an algorithm that performs the translation. “Once you feed it some source grammar that we support, it’s very easy, as long as we also support the target grammar (R, SQL, Spark). In cases where its proprietary or custom code, you can apply the business rules that run those exceptions, and the system will learn over time how to manage the technology,” Vinay added.
“To us, metadata is equal or more important than data,” Vinay concluded. If organizations can standardize the way data is collected, and the way metadata is captured, and speed these processes through the use of automation, consumers of data services will know what is available to them, and can access the information they need more quickly. Next Pathway has found that while business users may not necessarily understand the importance of issues around data cleansing, they do feel the need to more quickly access reports. With metadata and terms that are familiar, they may, for example, enter a data lake to see if the data is available, and query a search without the use of code. Since ETL coding is not needed with the Next Pathway platform in place, as long as the business analyst knows where their source data is, they can go easily access it.
According to Next Pathway, people and politics often present a larger hurdle in Big Data deployments than does the technology. While there are tools to manage structured and unstructured data, depending on the maturity of the organization, more tension can emerge around decisions on what solutions will be adopted across the organization, what the processes will be, and what terminology will be used to standardize data. At the event, the importance of governance issues to data transformation was reinforced by Craig Wickett, SVP, international technical services at Scotiabank, who described the bank’s Data as a Service Roadmap. As with many good technology projects, Scotiabank began its data journey with business need – a modernization effort to support Agile methodologies and cloud native apps, and to leverage existing information assets. But precursor to investment in a Big Data strategy based on automation was, Wickett noted, a return to the beginning: “the need to get agreement on a lexicon and on models.” For Scotiabank, DaaS is the end-to-end automation of the data process and key to this is standardization. So far, the bank has completed a lot of work on data ingestion, on the creation of logical data models, and ETL, but has yet to enable its end users to derive value and utility out of DaaS. With governance in place and Next Pathway partners, it is moving towards “target state” – the automation of data streams across the organization.