Tackling the ‘data dump’ challenge

In the coming era of AI and machine learning, IB argues that enterprises need the right data management platform to ensure Hadoop is a data “lake” rather than data “swamp.”

Big Data initiatives are multiplying at an unprecedented rate. The near-ubiquitous discussion of IoT and the accelerating adoption of AI and machine learning are putting pressure on Big Data and analytics to perform.

But are enterprises able to keep pace? This was the question posed by Brent Bruin, senior systems engineer at Information Builders, at the third annual Big Data & Analytics Summit Canada held recently in Toronto.

Bruin’s presentation, Accelerating Big Data Initiatives, specifically addressed the role of the Hadoop ecosystem in Big Data integration projects; his conclusion was that skills gaps and increased data complexity can hinder rather than speed Big Data solution deployment.

The key to improving prospects for success, he argued, is to optimize data acquisition processes and focus on data science instead. The biggest challenge in achieving that is the data modelling side of the equation. “People may or may not understand how their data gets stored.”

Herein lies the challenge. As an open source development initiative, Hadoop is a “big bucket distribution platform” that was traditionally designed to store structured and unstructured data, Bruin explained. “It handles unstructured data very well, but if you have a lot of unstructured data, you will want to be able to work with it.”

There is no question Hadoop excels as a storage platform, he added. “It’s great if you want to put stuff there and forget it. What it is not out-of-the-box is a data management platform. You have to spend a lot of time on programming, and if you don’t have the skill sets in house, you have to go and hire new resources to do it.”

Hadoop developers are not easy to find, and they are not cheap, Bruin said. “It hearkens back to the Y2K days when experts were in short supply and commanding exorbitant salaries and hourly rates. There was a real gap in the marketplace for programmers and no help from other sources. We can assume the same today with Hadoop. It is complex and has a lot of moving parts. I think about it this way: a Teradata or Oracle implementation is like a 747 flying in sky. Hadoop, on the other hand, could be 12 million parts flying in close formation; you have to know a lot about those individual parts to be able to talk to and coordinate them.”

Despite the challenges, a platform such as Hadoop holds great appeal for organizations from a cost perspective, said Ganesh Iyer, iWay solutions specialist for Information Builders. “With Hadoop you can have a massive computing environment that’s run on commodity hardware using open source technology.”

Ganesh Iyer, pre-sales solution specialist, Enterprise Software, Information Builders

Iyer cited the example of a software-as-a-service company that provisioned separate databases for customers when they were onboarded. “Over a period of time, it had amassed thousands of SQL Server-based databases resulting in significant challenge in the implementation of analytics across its customer base. A traditional enterprise data warehouse strategy would involve consolidating the data in SQL Server at astronomical cost. In that case, it would be far better to leave the databases as they were and move the information into a Hadoop cluster at a fraction of the cost.”

Having a Hadoop cluster is one thing. Making the best use of it is something else, Iyer said. “Up to 70 percent of the complexity in any data management project is getting data into the right format. Business intelligence is much easier once you do that.”

Information Builders’ iWay Big Data Integrator is an example of an offering designed to streamline that process. It “insulates” organizations from the complexity of Hadoop by providing a native approach to data integration and management, Iyer explained. Organizations are able to support any kind of Big Data integration use case by incorporating Sqoop for data replication, capture and export; Flume for streaming and unstructured data acquisition; and iWay Service Manager for ingestion of non-Hadoop data, such as transactions, messages, IoT and other data sources.

The end result is a shorter learning curve as organizations don’t need to know the minutiae. “That’s because it all runs natively in the cluster,” Iyer said. “Optimizing the ingestion process in turn allows the organization to scale faster and achieve total cost of ownership sooner. The most important thing is that nothing we are doing is proprietary. Hadoop code is being generated behind the scenes.”

Another benefit of the Big Data integration approach is that it prevents a tool like Hadoop from becoming a dumping ground for data, he added. “If an organization is not verifying data, it becomes a swamp rather than a lake. Embedded data quality technology with CDI [customer data integration] ensures the data is clean to start with.”

The timing around this kind of approach is becoming even more critical as AI moves mainstream, Bruin noted. “Microsoft CRM and Salesforce have rolled out announcements that AI will be their decision-making engine. To do that you need data quality tools or services. A tool such as Hadoop, combined with a progressive data management development tool, facilitates that massive collection, aggregation and storage to allow organizations to step into those realms.”

“People have been looking at AI and machine learning for ages,” Iyer agreed. “The problem was there was never enough information and computing power and the platforms were not robust enough to achieve it. Now, with the right data integration approach, they can.”