Beyond Hadoop: distributors prepare to do battle for next-generation data platforms

By Matt Aslett; special to from 451 Research

Aslett_MattInsightaaS perspective: 451 Research is one of the world’s leading sources of insight into cutting edge technologies — especially in areas that are important to InsightaaS and our principals, including cloud, analytics, and sustainable IT. works with 451 Research to bring occasional thought leadership pieces to our readers. This piece, "Beyond Hadoop: distributors prepare to do battle for next-generation data platforms," illustrates why we are committed to this practice. In it, 451 research director Matt Aslett provides insight into how the firms involved in the expanding Hadoop ecosystem — Cloudera, MapR, Pivotal and Hortonworks — are gearing up to move Hadoop into mainstream applications and IT environments. As these organizations move to deliver "next-generation data management platform(s) designed to support multiple data-processing workloads" via PaaS subscription models, IT management, executive management and IT suppliers need all take stock of how Big Data may impact their current approaches to their businesses, and how they will work with or around these Hadoop offerings.

Note: if you are interested in obtaining a subscription to the 451 Research Data Management & Analytics program, please contact 451 directly, or contact InsightaaS at


In May 2010, we described Apache Hadoop distributor Cloudera as the elephant in the data-warehousing room, explaining how the company was positioning Hadoop as complementary to existing analytic databases but speculating that over time, Hadoop would be seen to provide greater competition for the incumbent data management suppliers. Since then, Cloudera and the other Hadoop distributors that emerged in the interim have been trying to avoid stepping on the toes of the database giants — tiptoeing around them and forming partnerships to help establish Hadoop's presence in enterprise datacenters for storing, processing and analyzing structured and unstructured data as a complement to the data warehouse.

It is significant, therefore, that on June 4, Cloudera swapped its soft-soled shoes for a pair of hobnailed boots and announced its intention to challenge the incumbent data management providers by positioning Hadoop as the focal point of next-generation data management platforms and calling on enterprises to 'unaccept the status quo.' Questionable grammar aside, this positioning is the inevitable consequence of Cloudera expanding its purview beyond simply being seen as a distributor of Hadoop for batch-based data processing. With Cloudera Enterprise, it has assembled what could now best be described as a multi-purpose data-processing and analytics platform.

Beyond Hadoop
The latest addition to Cloudera Enterprise is Cloudera Search, an Apache Solr-based integrated search engine for exploration of data stored in Hadoop and HBase. It joins a growing portfolio of capabilities that already include Cloudera Manager for systems management (which, together with Cloudera's Distribution including Apache Hadoop and support, forms Cloudera Enterprise), as well as Cloudera Navigator for data management and governance; RTD (Real-Time Delivery) for Apache HBase support; RTQ (Real-Time Query) for native SQL processing based on Cloudera Impala; and the self-explanatory BDR (Backup and Disaster Recovery) — all of which are available as subscription add-ons.

Cloudera is by no means the only Hadoop specialist expanding its capabilities beyond the Hadoop distribution. MapR has also added (or is in the process of adding) integrated search and discovery, real-time data processing in HBase and real-time native SQL-based analytics to its MapR Platform for Apache Hadoop. This includes a number of differentiating capabilities designed to improve performance and reliability, such as the MapR Control System dashboard, Direct Access NFS to mount the cluster as an NFS volume, and mirroring and high availability. MapR is also in the process of testing support for the open source Storm event-processing project.

Meanwhile, Pivotal is being spun off from EMC and VMware with the aim of launching a PaaS stack that builds on VMware's cloud fabric, as well as a data fabric that incorporates capabilities such as stream processing, in-memory data processing, SQL-based analytics, data warehousing, and data visualization and analytics — all built around a single 'data substrate' based on the Hadoop Distributed File System. The Pivotal HD Hadoop distribution and Pivotal Advanced Database Services powered by HAWQ (a subset of the Greenplum massively parallel analytic database) are core to that data fabric, but it is also clear that the Pivotal Data Fabric is much more than an expanded Hadoop distribution. Much like Cloudera Enterprise and the MapR Platform for Apache Hadoop (in fact, arguably even more so given its broader set of capabilities), it is designed to serve as a next-generation data management platform designed to support multiple data-processing workloads.

An operating system for 'big data'
While Hortonworks can be seen as the least aggressive of the Hadoop distributors in terms of expanding its product plans beyond the Hadoop distribution (thanks in part to its commitment to open source but also, no doubt, to the importance of its partnerships with the likes of Microsoft and Teradata), the company also has a vision of Hadoop evolving from serving as a single-application (MapReduce) system to becoming a multi-application 'operating system' for big data.

Hortonworks' vision is not based on building a stack of data management and processing capabilities around Hadoop but on improving the flexibility of Hadoop itself for handling multiple application workloads via Apache YARN. First discussed by 451 Research back in September 2011 (when it was more often referred to as NextGen MapReduce or MapReduce 2.0), YARN enables multiple versions of MapReduce to run in the same cluster, and for HDFS to support data-processing frameworks beyond MapReduce.

It is because of Apache YARN that we are convinced that the original creator of what is now Apache Hadoop, Doug Cutting, is correct in his assertion that Hadoop will evolve over time from a batch-processing engine to encompass a set of replaceable components in a wider distributed data-processing ecosystem. It is YARN that will enable Hadoop, over time, to serve as an efficient processing platform for native SQL analytics, graph processing, bulk synchronous parallel computing, the Spark cluster computing framework and multiple other use cases. It is this flexibility that has the potential to ensure that Hadoop can be considered a real challenger to existing database and data-warehousing products as the focal point in next-generation data management platforms.

While the adherence to a fixed schema is critical to the success of a well-designed data warehouse, resulting in highly efficient processing of queries that were known prior to the design of the schema, it also makes data-warehousing deployments highly inflexible to change. It is this inflexibility to change that has prevented many data-warehousing deployments from fulfilling the goal for which they were designed: creating a single version of the truth for enterprise data.

InsightaaS note: Aslett’s analysis goes on to demonstrate the tremendous cost differences between Hadoop solutions and traditional data warehouse approaches. He states that 451 is "aware of a number of large (traditionally quite conservative) enterprises that are looking to create next-generation data platforms building on top of HDFS. The question, therefore, is not so much whether the Hadoop distributors will position themselves as data-platform providers and go head to head with the incumbent data-warehousing providers, but how those incumbent data-warehousing providers will respond." The report goes on to discuss ways in which traditional competitors like IBM, Teradata, Oracle and Microsoft are positioned to meet the needs addressed by Hadoop solutions. Aslett concludes this section by stating that "As always, should the Hadoop distributors cum big-data platform providers begin to make enough waves, we would expect the database incumbents to simply acquire them. In the meantime, we can expect the emerging big-data platform providers to commence some acquisition activity of their own as they snap up some of the emerging complementary platform players in areas such as configuration and deployment (InfoChimps, MetaScale), development (Continuuity, Mortar Data) applications (NGDATA, WibiData), vertical markets (Guavus) and hosted services (Treasure Data, Qubole). "

For more information on the 451 Research Data Management & Analytics program, please contact 451 directly, or contact InsightaaS at