Hadoop in Action: Novartis Taps Hadoop and Apache Spark as Part of Innovative New Workflow System

Editor's note: this blog post is based on an article in Datanami titled "Creating Flexible Big Data Solutions for Drug Discovery"

It's an exciting time for those in pharmaceutical research these days, given that research organizations can now leverage big data to improve their business. One such organization is the Novartis Institutes for BioMedical Research (NIBR), the global pharmaceutical research organization for Novartis. NIBR takes a unique approach to pharmaceutical research—at the earliest stages, patient need and disease understanding determine their research priorities. On any given day, their scientists are working hard at nine research institutes around the world to bring innovative medicines to patients. Over 6,000 scientists, physicians and business professionals work in this open, entrepreneurial and innovative culture that encourages true collaboration. 

One of NIBR's many interesting drug research areas is in Next Generation Sequencing (NGS) research. NGS research requires a lot of interaction with diverse data from external organizations such as clinical, phenotypical, experimental and other associated data. Integrating all of these heterogeneous datasets is very labor intensive, so they only want to do it once. One of the challenges they face is that as the cost of sequencing continues to drop exponentially, the amount of data that’s being produced increases. Because of this, Novartis needed a highly flexible big data infrastructure so that the latest analytical tools, techniques and databases could be swapped into their platform with minimal effort as NGS technologies and scientific requirements change. The Novartis team chose Hadoop and Apache Spark as part of their system to integrate and analyze this diverse data.

Editor’s Note: Download our free ebook Getting Started with Apache Spark: From Inception to Production here.

New workflow system based on Hadoop and Spark
To make the most of the latest NGS research, they needed workflow tools that were robust enough to process vast amounts of raw data, yet flexible enough to keep up with quickly changing research techniques. Although NGS data requires high data volumes that are ideal for Hadoop, a common problem is that researchers rely on many tools that don’t work on native HDFS. Since these researchers previously couldn’t use systems like Hadoop, they have had to maintain complicated "bookkeeping" logic to parallelize for optimum efficiency on traditional High Performance Computing (HPC). This workflow system uses Hadoop for its performance and robustness and to provide the POSIX file access that lets bioinformaticians use their familiar tools. Additionally, it uses the researchers’ own metadata to allow them to write complex workflows that blend the best aspects of Hadoop and traditional HPC.

Spark used for data processing 
The team then uses Apache Spark to integrate the highly diverse datasets. Their unique approach to dealing with heterogeneity was to represent the data as a vast knowledge graph (currently trillions of edges) that is stored in HDFS and manipulated with custom Spark code. This innovative use of a knowledge graph lets Novartis bioinformaticians easily model the complex and changing ways that biological datasets connect to one another, while the use of Spark allows them to perform graph manipulations reliably and at scale.

On the analytics side, researchers can access data directly through a Spark API, or through a number of endpoint databases with schemas tailored to their specific analytic needs. Their tool chain allows entire schemas with 100 billions of rows to be created quickly from the knowledge graph and then imported into the analyst’s favorite database technologies.

Accessing results of all public experiments accelerates research

As a result of their efforts, this flexible workflow tool is now being used for a variety of different projects across Novartis, including video analysis, proteomics, and meta-genomics. A wonderful side benefit is that the integration of data science infrastructure into pipelines built partly from legacy bioinformatics tools can be achieved in mere days, rather than months. By combining Spark and Hadoop-based workflow and integration layers, Novartis' life science researchers are able to take advantage of the tens of thousands of experiments that public organizations have conducted, which gives them a significant competitive advantage.

How are other companies using Hadoop to give them a competitive advantage? Find out by visiting our Solutions area, which features details on over 50 organizations that are ensuring production success with Hadoop.



Ebook: Getting Started with Apache Spark
Apache Spark is a powerful, multi-purpose execution engine for big data enabling rapid application development and high performance.

Streaming Data Architecture:

New Designs Using Apache Kafka and MapR Streams




Download for free