The MapR-based flexible workflow tool is now being used for a variety of different projects across Novartis, including video analysis, proteomics, and meta-genomics.The combined Spark and MapR-based workflow and integration layers allow the company’s life science researchers to meaningfully take advantage of the tens of thousands of experiments that public organizations have conducted, which gives them a significant competitive advantage.
The Novartis Institutes for BioMedical Research (NIBR) is the global pharmaceutical research organization for Novartis focused on discovering innovative medicines to treat diseases with high unmet medical need. With more than 6,000 scientists, physicians and business professionals around the world, they have an open, entrepreneurial and innovative culture that encourages collaboration.
One of their areas of drug research, Next Generation Sequencing (NGS) data, requires heavy interaction with diverse data from external organizations such as 1000 Genomes, NIH’s GTEx (Genotype-Tissue Expression) and The Cancer Genome Atlas—paying particular attention to clinical, phenotypical, experimental and other associated data. Integrating these heterogeneous datasets is labor intensive, so they only want to do it once.
David Tester, Application Architect for Novartis Institutes for Biomedical Research, and his team have deployed MapR as part of their system to integrate and analyze diverse data to accelerate drug research.
New Workflow System
To solve the first part of this NGS big data problem, the Novartis team built a workflow system that allows them to process NGS data while being responsive to advances in the scientific literature.
Although NGS data requires high data volumes that are ideal for Hadoop, a common problem is that researchers rely on many tools that simply don’t work on native HDFS. Since these researchers previously couldn’t use systems like Hadoop, they have had to maintain complicated ‘bookkeeping’ logic to parallelize for optimum efficiency on traditional High Performance Computing (HPC).
This workflow system uses the MapR Distribution for Hadoop for its performance and robustness and to provide the POSIX file access that lets bioinformaticians use their familiar tools. Additionally, it uses the researchers’ own metadata to allow them to write complex workflows that blend the best aspects of Hadoop and traditional HPC.
As a result of their efforts, the flexible workflow tool is now being used for a variety of different projects across Novartis, including video analysis, proteomics, and meta-genomics. An additional benefit is that the integration of data science infrastructure into pipelines built partly from legacy bioinformatics tools can be achieved in days, rather than months.
MapR and Spark Provide Novartis a Significant Competitive Advantage
For the second part of the problem—the “integrating highly diverse public datasets” requirement—the team used Apache Spark, a fast, general engine for large-scale data processing. Their specific approach to dealing with heterogeneity was to represent the data as a vast knowledge graph (currently trillions of edges) that is stored in HDFS and manipulated with custom Spark code.
This use of a knowledge graph lets Novartis bioinformaticians easily model the complex and changing ways that biological datasets connect to one another, while the use of Spark allows them to perform graph manipulations reliably and at scale.
On the analytics side, researchers can access data directly through a Spark API or through a number of endpoint databases with schemas tailored to their specific analytic needs. Their tool chain allows entire schemas with 100 billions of rows to be created quickly from the knowledge graph and then imported into the analyst’s favorite database technologies.
As a result, the combined Spark and MapR-based workflow and integration layers allow the company’s life science researchers to meaningfully take advantage of the tens of thousands of experiments that public organizations have conducted, which gives them a significant competitive advantage.
“In some ways I feel that I’ve come full circle from my days at Oxford by using my training in machine learning, formal logic and semantics to bring together all these computational elements,” says Tester. “This is particularly important because, as the cost of sequencing continues to drop exponentially, the amount of data that’s being produced increases. We will need to design highly flexible infrastructures so that the latest and greatest analytical tools, techniques and databases can be swapped into our platform with minimal effort as NGS technologies and scientific requirements change.”
Designing platforms with this fact in mind eases user resistance to change and can make interactions between computer scientists and life scientists more productive.