Populating the Barcode of Life

What do crowdsourcing, mobile technology and DNA have in common? If you answered a unique means of gathering information to populate the International Barcode of Life database, you would be on to something. If you added a collaborative project undertaken by SAP and The University of Guelph, you would be familiar with a research initiative launched through Guelph’s Biodiversity Institute of Ontario that aims to combine Big Data and mobility to further our understanding of global species.

The Big Data value proposition is much discussed in the IT industry today. Beyond information collected and stored in traditional systems, massive amounts of data now captured from GPS, social, M2M and sensor technologies are widely touted as the key to the development of new understanding, new services and new business value. But examples of the real life application of Big Data solutions are less ubiquitous, and insight into the design of technology solutions that can bring Big Data to life more rare. The Guelph/SAP initiative, though, provides a study in the development of a Big Data repository, but also offers a singular view into the creative process used to harness the power of social and mobile technologies in expanding knowledge.

International Barcode of Life mapThe International Barcode of Life has set an ambitious agenda. Its stated mission is to help protect global bio-systems through surveillance of agro ecosystems, inspection of food quality, identification of endangered species and the tracking of invasive species. Its method is to populate a database with DNA-based barcodes for every species in the world, and its specific target is the identification of 500,000 species by the end of 2015. This knowledge base will serve as a primary resource for real-time analysis and investigation into ecosystem/biodiversity change by institutions across 25 nations, including Canada, the United States, Germany and China, which form part of the Barcode of Life consortium. The Biodiversity Institute of Ontario (at the U of Guelph) is one such institution: its director and Canada research chair Dr. Paul Hebert, noted:  “Creating an open database containing DNA reference sequences for every species is critical to their identification and conservation, and to the protection of global ecosystems.”

David Jonker, director, Big Data, SAP Canada
David Jonker, director, Big Data, SAP Canada

To execute on this goal, the Barcode of Life consortium is working with SAP on development of a platform to enable the collection, storage, analysis and sharing of species data. A first step in this process is data collection — an exercise that represents a primary hurdle for most organizations looking to develop data-enabled intelligence. While the trend towards quantitative analysis in academic and business communities is clear, conclusions drawn from data are only as good as the primary resource. So how does one build a database with DNA codes for 500,000 species within the next two years? To explore this problem, SAP and the Biodiversity Institute have applied “design thinking,” an iterative process that involves looking at a problem with a fresh perspective to better understand what needs to be addressed in the whole environment, rather than rely on conventional organizational solutions. SAP is well versed in this approach, having co-funded two research institutes for design thinking — one at Stanford University and one at The Hasso Plattner Institute in Germany — and has incorporated it into a three pillar approach to Big Data that encompasses data science, design thinking for the operations, and a technology component. As David Jonker, director of Big Data at SAP described it, “A lot of people think about Big Data as a storage problem, but that is a very simplistic way to look at it. You need to start at management or strategy level, which is data science. You need to ask, ‘What are our business priorities? How do we use data to drive our business imperatives?’ The next phase involves applying this within the daily operations of the organization.” This application, he added, is the only way to derive value from the data. And the final stage entails technology implementation.

Sarah McMullin, project manager on the Guelph initiative, SAP Canada
Sarah McMullin, project manager on the Guelph initiative, SAP Canada

In the Barcode of Life case, design thinking centered on understanding the end user. According to Sarah McMullin, senior project manager on the Guelph initiative and mobility strategist at SAP Canada, Guelph and SAP mustered a multi-disciplinary team who came to together to solve the problem of increasing species “barcodes”: “we had scientists from Guelph, we had technical architects from our team, we had business folks who brainstormed and talked to potential end users. We interviewed a whole variety of people from different groups, gathered their feedback, and derived insight from that.” Much of this insight was then incorporated into creation of proof of concept for a mobile, crowdsourcing application. According to McMullin, up to that point, the scientific community has been largely responsible for populating the Barcode database. However, the consortium’s primary challenge was to increase the number of DNA samples. Through design thinking, project managers came to understand that it was necessary to build a solution that would engage a broader community — and this learning had a large impact on how to design the crowdsource application and on decisions around what information to collect and what information to return to the user [essentially an identification and detailed info about the sample organism]. “All those decisions were made based on the initial insights we developed into the end user that we are going after with the application,” she explained.

A key piece of this methodology is turning conceptual thinking into practical operation. As David Jonker explained, for SAP Big Data represents more than a hardware/software problem: “A lot of people working on Big Data tend to be focused on a technology, backend, data warehousing project,” he noted. “At SAP, we believe there are a lot of different elements to what is required to make Big Data a reality. We have a data science organization that works with a lot of the big brands to help them leverage and gain insights from their data, but we also think that applications are a fundamental element of our offering.” In the Guelph/SAP co-innovation project, the partners have created a mobile application that relies on crowdsourcing to enable “citizen scientists” around the world to collect insect and plant DNA samples and related information. According to Jonker, crowdsourcing was a preferred collection method due to the volume of data that still needs to be collected, and the urgent need for information on species at risk.

Sample image and barcode, Guelph application screenshot
Sample image and barcode, Guelph application screenshot

The crowdsource of scientific information is receiving increased attention as a means to solve problems associated with the generation of research data, a process that can be costly and is often difficult to complete. In the medical field, for example, social websites such as MediGuard, Patients Like Me and HD Buzz, provide a platform for participants to learn more about their own disease, and by sharing information on their own condition, provide a data repository that researchers are now using to overcome issues with the collection of complete data sets through clinical trials. While the use of social data is controversial — traditionalists would prefer data collected in more controlled environments, while others believe that if the data set is large enough, it will adequately resemble a universal population — in the Guelph case, certain procedures have been put in place to ensure data reliability. For example, samples are sent to The Biodiversity Institute and researchers are responsible for inputting the data, “quality control” that McMullin believes is reinforced by the fact that the application collects the very same data that was gathered for the Barcode database in the past. The only difference is that the application interface is very user-friendly and accessible to the non-scientific person. An additional control comes from GPS information associated with a sample submission that is automatically generated by user’s mobile device.

Currently in proof of concept, the SAP/Guelph co-innovation project is intended ultimately to move beyond data collection to the real-time DNA identification directly from a mobile device, a step that will enable worldwide scale of the project. Other coming platform features include data management systems that can accommodate the massive volume and variety of data that is collected, and the enablement of data mining to provide researchers with access to data, storage and real-time analytics capabilities for problem solving in their respective fields. At this point, SAP is working with the university to architect an environment that will run HANA for specimen analysis. If the database and storage technologies needed to manage this type of data are available in SAP IP, the “data science” needed to analyze species information is more unique. The Biodiversity Institute brings this specialization to the project, as does SAP, which has developed genomics experience through a number of initiatives, such as a co-project with Stanford University which uses genomics analytics for research into cancer treatment. As McMullin explained, some of the bioinformatics skill sets that SAP will contribute are the ability to do sequencing alignment for individual DNA strands, a process that is very labour intensive using traditional systems, the referencing of information on a particular DNA sample within the larger Barcode reference database, and an ability to display of information in formats that can be consumed by researchers as well as “citizen scientists.”

This need for this specialist expertise highlights the complexity of the Big Data challenge, and the imperative to incorporate business or user requirements at the outset into what is essentially a technology solution.  SAP’s vision for Big Data entails articulation of the use case up front, and simultaneous creation of a Big Data-enabled application. Big Data is more than an exercise in data warehousing, Jonker stressed, and will become more pervasive only when solutions integrate specific user requirements with analytics, database and storage technologies. “When you take that approach — data science, then design thinking, and then the technology engineering — you very much narrow the scope of the problem and the amount of work needs to be done” — a structured proposition that SAP is hoping will resonate with enterprises looking to embark on their own Big Data journey.





Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.