3 keys to the Big Data fast-track

One measure of tech ROI is user adoption: software that is not incorporated into workplace processes is of questionable value. Recognizing this principle, creators of data integrity, integration and intelligence solutions Information Builders (IB) have built functionality that extends information – in the right formats – to different user types across the business, democratizing data to enable its incorporation into operational work flows. But beyond technology adaptation, Information Builders also works hard to develop awareness of the full potential of its product portfolio in sessions aimed at helping users understand the data value proposition – in new and emerging areas in particular. A good example of this is the “Accelerating Big Data” workshop presented recently by IB’s Canadian team, which focused on three implementation perspectives: the strategic requirements for set up, the technical aspects of moving from project concept to preparing Big Data for usage, and the actual visualization and use of the information by the business/technical analyst.

Ganesh Iyer, pre-sales solution specialist, Enterprise Software, Information Builders
Ganesh Iyer, pre-sales solution specialist, Enterprise Software, Information Builders

Promising “we’ll show you how to get it in, and how to get it out,” pre-sales solution specialist in the Information Builders Enterprise Software group Ganesh Iyer kicked off the discussion with a high level definition of the Big Data “4 Vs”: variety, velocity, veracity and volume, which each present different challenges for operations staff. Coming from disparate systems, from sensors and interactions in structured and unstructured formats, in real time or with latency, complete or in “snapshots,” and in terabytes or petabytes, Big Data introduces a number of issues that Iyer believes are largely associated with volume. But citing a 12:1 ratio of exploratory to production projects, he argued that the biggest problem for most organizations is finding value in Big Data. And describing a May 2015 Gartner US survey, which found that 57 percent of respondents cite the Hadoop skills gap as a primary obstacle to Big Data deployments, he noted that businesses continue to invest in Hadoop “though they’re not sure why, and they don’t know how to use it.”

Building strategy

To support businesses as they begin the Big Data journey, Iyer offered “5 steps” to deployment as follows:

Align the needs of business and IT managers. Practical drivers for Hadoop implementation are likely to vary. While IT is typically motivated by need for a database upgrade, historical data archiving, large enterprise data warehouse initiatives (to migrate from legacy Microsoft or Oracle data storage), and cost reduction in maintenance and operations (software license and maintenance fees will drop dramatically), LOB managers are interested in another set of concerns. Deployment success will depend on mapping IT aspirations to the business user’s interest in analyzing unstructured data in an automated fashion, in tapping social media for new insights, or its need for continuous data and business process optimization to achieve additional savings.

Understand maturity of the organization. Iyer divided “maturity” into two components: infrastructure maturity, where the organization will have good understanding of its data and database needs, and skills maturity. “Do I have the ability to access and extract data from the systems?” is a good test question in infrastructure, to which Iyer added “do the business applications need more than traditional infrastructure,” and “do I want to have a data warehouse?” Answers to these questions will help frame requirements for the Big Data project.  For the second component, Iyer identified key questions as: are there business analysts on staff today, are they accessing the data, and do they know what to do it? “If your business analysts are not in the habit of mining data for insights, Hadoop’s not going to change anything,” Iyer argued.

Estimate budget. Cost factors will include “startup costs,” including use case development, hosted vs. on-premise evaluation, Hadoop selection, initial setup and configuration and training. Beyond that, the organization will have to take into account “resource costs” (software is expensive), “implementation,” and “sustainment” or maintenance costs. On this point, Iyer advised that Big Data implementation be divided into “tactical” and “enterprise” deployments, and that budgeting be aligned with pilot or production needs.

According to Iyer, Hadoop implementation is potentially the most costly item in the Big Data budget – developers command approximately $130/hour – but one way to reduce this expense is leverage short-term cloud-based offerings (identified by online cost calculators) to test drive the Hadoop solution. Similarly, he recommended that organizations evaluate “Big Data” enablement tools that currently exist in the marketplace to reduce dependence on Hadoop technical resources.

Test drive the vision. Iyer’s best advice to user organizations is to “keep it simple: start with Sqoop or Flume for data collection, and sign up for a free course to learn how to work with these technologies. Organizations should next build out a small analytics application – target a report or dashboard that is not currently available due to the report’s data requirements – and use this to prove out the deployment before progressing to more advanced Hadoop applications. And finally, he added, Hadoop should complement existing technologies and ETL skills – it should not be a “rip and replace” proposition.

Incorporate governance. Iyer noted that data governance is often overlooked, and he pointed to growing incidence of data breaches as an increasing outcome. A Hadoop Centre of Excellence can provide expert support, while a controlled sandbox environment can help users better understand metadata management in Hadoop, including the technical glossary, and technical and operational metadata needs, which contribute to development of a lifecycle approach to Hadoop data management.

Optimizing Big Data management

“Welcome to the 1980s when client/server came around,” was senior systems engineer at Information Builders Brent Bruin’s top level observation on Hadoop deployment – with improvements to the paradigm of distributed processing marking client/server’s modern cousin. Essentially the outgrowth of an Apache open source project, Hadoop comes with native tooling, which means that most organizations require help to move data into the repository, and to extract it for analysis. According to Bruin, vendors such as Information Builders, have worked with the 10 year old technology to build a front end that eases deployment, and improves the efficiency of data ingestion and extraction.

To illustrate, Bruin presented a live demo of the process using Information Builders’ iWay Big Data Integrator layered on top of a Cloudera Hive (Apache Hive v 1.1) data store in an Eclipse developing environment. While there are several ways to collect data, such as FTP, copying or data streaming, in the demo, Bruin accessed the Hadoop file system (Cloudera HTFS) as well as an SQL store, and ingested data using Flume log data in real time, sourced across 10,000 appliances, using a Flume Wizard to pick the data source, a memory channel and HDFS sink (this can also be done with Sqoop code). With the data in the repository, Bruin noted, the organization will want to do something with it. To prepare the data, Bruin used a Data Wrangler, which typically maps data from one raw form into another format with the help of automated tools to facilitate data consumption – for example, data aggregation or training a statistical model. In the demo, Bruin used BDI to extract the raw data, and wrap metadata around it so that the data was delimited. With this structure imposed on the data, the data was ready for query and to be transformed – brought to the processing software for analysis. According to Bruin, this ELT (extract, load, transform, with transform being the third step in the process) approach is key: if processing software is brought to the whole data store (as with the traditional ETL approach), the value of massive parallel processing is lost. To get the data out, he used an IB Adaptor to talk to the Hive – leveraging Flume to listen for the log data, slurp data out of the Hive, and use it for reporting.

Automation of these processes is critical: BDI tooling automates SQL or Scala scripting, transform and bash script, greatly reducing the time required for data preparation. For the Sqoop job that he demoed, Bruin estimated that manual coding to set up and provision the data would take around a half day of coding: with the IB tools, he set up with five clicks of the mouse.

The IB solution also features “expression Wizards” that also allow the user to apply functions to data in existing tables or new sources, or even drag existing fields to target table. As Bruin explained, Hadoop is immature in that it’s not quite ready to accept updates, and hence not really sufficient for operational data. However, Information Builders – uniquely – has created “change data capture” – once data has been injested, it can be updated with an IB tool. For example, a user could add a field to an SQL source table, identifying what item was updated/changed when with a hash algorithm.

Powering the business analyst

Yash Shreshta, business intelligence consultant, Information Builders
Yash Shreshta, business intelligence consultant, Information Builders

The speed and flexibility that this automation of ingestion/extraction provides is not limited to benefits in data management. As Yash Shreshta, business intelligence consultant at Information Builders explained, the same metadata that was applied to delimit the data is also used in the presentation/report layer in Information Builders’ BI solution. Showcasing new capabilities in the company’s WebFOCUS Business User Edition, he demonstrated that that using the same metadata for BI that was applied in the data management layer offers huge time savings – a feature that is not available in all vendor products.

According to Shreshta, to access required data, business users need only identify the parameters they want to use. In some cases, when data artifacts are accessed, the business user many not know what kind of report he/she wants to push, and so may start with a “visualization” using the metadata that was already created. In addition, the tool will automatically generate some interesting correlations, which can be customized or modified, based on preparation of the data achieved at the infrastructure layer. Shreshta also pointed to other BI features that are unique to the Information Builders’ solution that work to address business need to derive insight from data, including auto drill, linking capabilities (as content ages, it increases in value as it is linked to additional information sources/objects), and sharing of the data visualization so people who are geographically dispersed can collaborate on a report or so that a specific chart could be sent to an employee group on the shop floor. Using a “coordinated field,” a single page can be sliced off and sent and the report distributed via email, according to a calendar schedule or triggered when a threshold has reached predefined limits based on business logic.

InsightaaS perspective

Information Builders’ three pronged approach to accelerating Big Data Acceleration projects mirrors what must occur within the organization to ensure project success: IT and LOB managers must develop strategy around Big Data set up and execution; the database administrator, developer and information management officer must collaborate on the technical aspects of preparing the data for use; and the business analyst must understand how to quickly mine the data, share results, and push information to the appropriate audience. By using the Information Builders’ solution, organizations can simplify and accelerate key steps along the way. If much of the ‘plumbing’ (data management) can be done by a technically oriented business analyst, then the data scientist can focus on the third piece – crunching numbers to create new business value.