He didn’t know it at the time, but when high school student David Tester acquired his first “computer” – a TI-82 graphing calculator from Texas Instruments – he was on a path that would inevitably lead to Big Data.
Tester’s interest in computation continued through undergraduate and graduate schools culminating in a Ph.D from the University of Oxford.
He followed a rather unusual path, investigating where formal semantics and logic interacted with statistical reasoning. He notes that this problem is one that computers are still not very good at solving as compared to people. Although computers are excellent for crunching statistics and for following an efficient chain of logic, they still fall short where those two tasks combine: using statistical heuristics to guide complex chains of logic.
Since joining the Novartis Institutes for Biomedical Research over two years ago, Tester has been making good use of his academic background. He works as an application architect charged with devising new applications of data science techniques for drug research. His primary focus is on genomic data – specifically Next Generation Sequencing (NGS) data, a classic Big Data application.
In addition to dealing with vast amounts of raw heterogeneous data, one of the major challenges facing Tester and his colleagues is that best practices in NGS research are an actively moving target. Additionally, much of the cutting-edge research requires heavy interaction with diverse data from external organizations. For these reasons, making the most of the latest NGS research in the literature ends up having two major parts.
Firstly, it requires workflow tools that are robust enough to process vast amounts of raw NGS data yet flexible enough to keep up with quickly changing research techniques.
Secondly, it requires a way to meaningfully integrate data from Novartis with data from these large external organizations – such as 1000 Genomes, NIH’s GTEx (Genotype-Tissue Expression) and TCGA (The Cancer Genome Atlas) – paying particular attention to clinical, phenotypical, experimental and other associated data. Integrating these heterogeneous datasets is labor intensive, so they only want to do it once. However, researchers have diverse analytical needs that can’t be met with any one database. These seemingly conflicting requirements suggest a need for a moderately complex solution.
To solve the first part of this NGS Big Data problem, Tester and his team built a workflow system that allows them to process NGS data robustly while being responsive to advances in the scientific literature. Although NGS data requires high data volumes that are ideal for Hadoop, a common problem is that researchers have come to rely on many tools that simply don’t work on native HDFS. Since these researchers previously couldn’t use systems like Hadoop –they have had to maintain complicated ‘bookkeeping’ logic to parallelize for optimum efficiency on traditional HPC.
This workflow system uses Hadoop for performance and robustness and MapR to provide the POSIX file access that lets bioinformaticians use their familiar tools. Additionally, it uses the researchers’ own metadata to allow them to write complex workflows that blend the best aspects of Hadoop and traditional HPC.
As a result of their efforts, the flexible workflow tool is now being used for a variety of different projects across Novartis, including video analysis, proteomics, and metagenomics. An additional benefit is that the integration of data science infrastructure into pipelines built partly from legacy bioinformatics tools can be achieved in days, rather than months.
For the second part of the problem – the “integrating highly diverse public datasets” requirement, the team used Apache Spark, a fast, general engine for large scale data processing. Their specific approach to dealing with heterogeneity was to represent the data as a vast knowledge graph (currently trillions of edges) that is stored in HDFS and manipulated with custom Spark code. This use of a knowledge graph lets Novartis bioinformaticians easily model the complex and changing ways that biological datasets connect to one another, while the use of Spark allows them to perform graph manipulations reliably and at scale.
On the analytics side, researchers can access data directly through a Spark API or through a number of endpoint databases with schemas tailored to their specific analytic needs. Their toolchain allows entire schemas with 100 billions of rows to be created quickly from the knowledge graph and then imported into the analyst’s favorite database technologies.
As a result, these combined Spark and MapR-based workflow and integration layers allow the company’s life science researchers to meaningfully take advantage of the tens of thousands of experiments that public organizations have conducted – a significant competitive advantage.
“In some ways I feel that I’ve come full circle from my days at Oxford by using my training in machine learning and formal logic and semantics to bring together all these computational elements,” Tester adds. “This is particularly important because, as the cost of sequencing continues to drop exponentially, the amount of data that’s being produced increases. We will need to design highly flexible infrastructures so that the latest and greatest analytical tools, techniques and databases can be swapped into our platform with minimal effort as NGS technologies and scientific requirements change. Designing platforms with this fact in mind eases user resistance to change and can make interactions between computer scientists and life scientists more productive.
To his counterparts wrestling with the problems and promise of Big Data in other companies, Tester says that if all you want to do is increase scale and bring down costs, Hadoop and Spark are great. But most organizations’ needs are not so simple. For more complex Big Data requirements, the many tools that are available can be looked at merely as powerful components that can be used to fashion a novel solution. The trick is to work with those components creatively in ways that are sensitive to the users’ needs by drawing upon non-Big Data branches of computer science like artificial intelligence and formal semantics – while also designing the system for flexibility as things change. He thinks that this is an ultimately more productive way to tease out the value of your Big Data implementation.
Originally published in Datanami:
Creating Flexible Big Data Solutions for Drug Discovery
January 19, 2015