Working with a variety of organizations on consulting engagements and speaking with companies to learn more about their analytics initiatives, I have learned firsthand that irrespective of project size or maturity, companies face challenges integrating and managing their data assets. This goes well beyond analytics and BI implementations and applies to any initiative involving the movement and storage of disparate data sources. Organizations understand the value proposition of managing their data effectively but struggle with the steps involved in capturing, structuring, transforming, and maintaining access to valid and reliable information. Consequently, data integration activities require a lot of effort and time to make work. This means that without the proper processes, projects can take longer and data architectures may end up more complex than need be. Additionally, maintenance, if not efficiently managed, can cause issues over time. Some of these data challenges include:
- Effectively collecting source data on a regular basis
- Accounting for data structure changes or additions to fields or tables
- Integrating disparate data sources into a centralized database
- Identifying business rules across business units (entities) and making sure that accurate versions of the truth applies to each
- Managing customer expectations related to delivery, access, and security
Each of these areas touches only the surface of the complexities involved in data integration projects. A certain level of understanding is required for organizations to overcome potential stumbling blocks, while addressing the complexities of data management across a broad array of data assets. Common challenges I see on a regular basis include both business and technical roadblocks that make it harder for companies to effectively manage their data integration requirements. Seven areas where organizations face common challenges include:
Purpose of integration
Integration requirements will change depending on the goal of the project, depending on whether it’s Business Intelligence, data consolidation, Big Data or ERP, etc. The real goal of any data integration project is to understand the outcome. Understanding what an organization wants to achieve helps to support the design and development of the end goal solution. Once the purpose is defined, the right architecture can be selected. For instance, an operational intelligence solution needs a platform that can support real-time/right-time data streaming while supporting the types of analytics required. This entails understanding where the algorithms will be applied and how data needs to uploaded and stored. Some database structures support high granular levels of detail, while others will store or perform analytical functions. Integrating data within a hub to support Big Data storage or transactional solutions will require different architectures, which will affect the purpose of integration and the processes involved to extract, transform, and load (ETL) data. Unfortunately, organizations sometimes expect they can use the same framework for different purposes and end up creating roadblocks to their success.
Platform
With increasing cloud adoption, organizations need to decide whether data will be stored on-premise, in the cloud, or in a hybrid approach. In the past, SaaS (Software-as-a-Service) created the façade of easy data integration without much work required from the organization itself. The truth of this assumption depended mostly on the services being used and on how these integration activities were supported. With cloud storage, different integration may be required than that needed when implementing on premise solutions. In many cases, cloud based solutions are built on open platforms, whereas organizations may have multiple proprietary platforms in-house, creating a different set of steps to achieve data movement.
Proprietary data
The types of data being sourced and where data is stored can have a big influence on the complexities required for data integration. Although the idea behind open source is to make source code accessible to anyone and hence integration easier, the opposite is true for many proprietary data sources. In many projects, organizations will be utilizing a variety of source and target systems and will need to identify the best way to integrate data and those sources that will require more work. Assessing the effort involved before a project starts can help an organization set realistic expectations in relation to effort involved. Additionally, there may be limitations on what can be done, on how information should be structured, and on the types of data sources. Each of these considerations may affect the tools used to enable broader data integration.
Skill sets
Although not related to the technical requirements of data integration, proper skill sets are required to create effective data integration processes. Source systems, integration, platform support, existing connectors, etc. are all areas that require expertise. Getting data from a source system to another database requires APIs or data extracts in a format that the target system can read. Some transactional solutions or database technologies have partnerships with each other to make the process easier. If organizations do not have expertise with specific solutions being used, they may have to develop workarounds to get data from one location to another. Over time, lack of in-depth knowledge can lead to an efficiency deficit, longer timelines and project setbacks. Additionally, data might end up structured in a way that is not optimal to the target system, creating performance bottlenecks.
APIs
If possible, organizations should take advantage of available APIs. The goal of connectors is to make the job of data integration easier. Without them, organizations may feel the need to copy extracted data for loading in a target database or to develop their own processes to get data from one location to another. For automated batch or real-time data loading, this will not be effective over time. In addition, many partnerships are built on the premise of various solutions working seamlessly based on the current customer base and on how solutions are being used together. Also, leveraging APIs can make the overall implementation process easier as organizations do not have to reinvent the wheel.
Scalability and design
All of these challenges affect design of the database, whether for Big Data storage, analytics, or in a MDM (Master Data Management) hub. How data integration will be managed overtime and how information will scale becomes a major determiner of project success. This involves looking at data storage and also at the structure of how data is stored. Transferring data from one source to another with transformations will have implications when analyzing issues in depth, when trying to get a holistic view of the organization will mean developing business rules that take into account different viewpoints to ensure data accuracy. How this is managed requires that the IT manager take into account, not only how the solution is designed, but whether it can scale. New data sources over time, historical data capture, additional analytics, change data capture (CDC), or slowly changing dimensions won’t only affect the amount of storage required but may also affect performance over time. Understanding how data integration and data storage will affect future query performance and flexible design requirements is key.
Expectations
Although organizations understand the value their data can bring once it is analyzed, many still struggle with the value proposition of spending money on data integration solutions that act as a conduit between source data and target databases, and focus instead on automating data movement across systems to increase general efficiencies. The reality for organizations, however, remains the same. Data integration is complex and to do it effectively, organizations need to evaluate all data sources, how they are structured, what transformations are required to load data properly for analytics, how data should be stored, when to deliver data, and how to manage data integration on a regular basis by supporting business needs.