Big Data All Stars:

Real-World Stories and Wisdom from the Best in Big Data

The NIH Pushes the Boundaries of Health Research with Data Analytics

Few things probably excite a data analyst more than data on a mission, especially when that mission has the potential to literally save lives.

That fact might make the National Institutes for Health the mother-load of gratifying work projects for data analysts that work there. In fact, the NIH is 27 separate Institutes and Centers under one umbrella title, all dedicated to the most advanced biomedical research in the world.

At approximately 20,000 employees strong, including some of the most prestigious experts in their respective fields, the NIH is generating a tremendous amount of data on healthcare research. From studies on cancer, to infectious diseases, to Aids, or women’s health issues, the NIH probably has more data on each topic than nearly everyone else. Even the agency’s library – the National Library of Medicine – is the largest of its kind in the world.

Data Lake Gives Access to Research Data

Big data’ has been a very big thing for the NIH for some time. But this fall the NIH will benefit from a new ability to combine and compare separate institute grant data sets in a single ‘data lake’.

With the help of MapR, the NIH created a five-server cluster – with approximately 150 terabytes of raw storage – that will be able to “accumulate that data, manipulate the data and clean it, and then apply analytics tools against it,” explains Chuck Lynch, a senior IT specialist with the NIH Office of Portfolio Analysis, in the Division of Program Coordination, Planning, and Strategic Initiatives.

If Lynch’s credentials seem long, they actually get longer. Add to the above the Office of the Director, which coordinates the activities of all of the institutes. Each individual institute in turn has its own director, and a separate budget, set by Congress.

“What the NIH does is basically drive the biomedical research in the United States in two ways,” Lynch explains. “There’s an intermural program where we have the scientists here on campus do biomedical research in laboratories. They are highly credentialed and many of them are world famous.”

“Additionally, we have an extramural program where we issue billions of dollars in grants to universities and to scientists around the world to perform biomedical research – both basic and applied – to advance different areas of research that are of concern to the nation and to the world,” Lynch says.

This is all really great stuff, but it just got a lot better. The new cluster enables the office to effectively apply analytical tools to the newly-shared data. The hope is that the NIH can now do things with health science data it couldn’t do before, and in the process advance medicine.

Expanding Access to ‘Knowledge Stores’

As Lynch notes, ‘big data’ is not about having volumes of information. It is about the ability to apply analytics to data to find new meaning and value in it. That includes the ability to see new relationships between seemingly unrelated data, and to discover gaps in those relationships. As Lynch describes it, analytics helps you better know what you don’t know. If done well, big data raises as many questions as it provides answers, he says.

The challenge for the NIH was the large number of institutes collecting and managing their own data. Lynch refers to them as “knowledge stores” of the scientific research being done.

“We would tap into these and do research on them, but the problem was that we really needed to have all the information at one location where we could manipulate it without interfering with the [original] system of record,” Lynch says.

“For instance, we have an organization that manages all of the grants, research, and documentation, and we have the Library of Medicine that handles all of the publications in medicine. We need that information, but it’s very difficult to tap into those resources and have it all accumulated to do the analysis that we need to do. So the concept that we came up with was building a data lake,” Lynch recalls.

That was exactly one year ago, and the NIH initially hoped to undertake the project itself.

“We have a system here at NIH that we’re not responsible for called Biowulf, which is a play on Beowulf. It’s a high speed computing environment but it’s not data intensive. It’s really computationally intensive,” Lynch explains. “We first talked to them but we realized that what they had wasn’t going to serve our purposes.”

So the IT staff at NIH worked on a preliminary design, and then engaged vendors to help formulate a more formal design. From that process the NIH chose MapR to help it develop the cluster.

“We used end-of-year funding in September of last year to start the procurement of the equipment and the software,” Lynch says. “That arrived here in the November /December timeframe and we started to coordinate with our office of information technology to build the cluster out. Implementation took place in the April to June timeframe, and the cluster went operational in August.”

Training Mitigates Learning Curve

“What we’re doing is that we’re in the process of testing the system and basically wringing out the bugs,” Lynch notes. “Probably the biggest challenge that we’ve faced is our own learning curve; trying to understand the system. The challenge that we have right now as we begin to put data into the system is how do we want to deploy that data? Some of the data lends itself to the different elements of the MapR ecosystem. What should we be putting it into — not just raw data, but should we be using Pig or Hive or any of the other ecosystem elements?”

Key to the project success so far, and going forward, is training.

“Many of the people here are biomedical scientists. The vast majority of them have PhDs in biomedical science or chemistry or something. We want them to be able to use the system directly,” Lynch says. “We had MapR come in and give us training and also give our IT people training on administering the MapR system and using the tools.”

But that is the beginning of the story, not the conclusion. As Lynch notes, “The longer journey is now to use it for true big data analysis; to find tools that we can apply to it; to index; to get metadata; to look at the information that we have there and to start finding things in the data that we had never seen before.”

“Our view is that applying big data analytics to the data that we have will help us discover relationships that we didn’t realize existed,” Lynch continues.

“Success for us is being able to answer the questions being given to us by senior leadership,” Lynch says. “For example, is the research that we’re doing in a particular area productive? Are we getting value out of research? Is there something that we’re missing? Is there overlap in the different types of research or are there gaps in the research? And in what we are funding are we returning value to the public?”

Next Steps

So what is the next step for the NIH?

“To work with experts in the field to find better ways of doing analysis with the data and make it truly a big data environment as opposed to just a data lake,” Lynch says. “That will involve a considerable amount of effort and will take us some time to put that together. I think what we’re interested in doing is comparing and contrasting different methods and analytic techniques. That is the long haul.”

“The knowledge that we’re dealing with is so complex,” Lynch concludes. “In this environment it is a huge step forward and I think it is going to resonate with the biomedical community at large. There are other biomedical organizations that look to NIH to drive methods and approaches and best practices. I think this is the start of a new best practice.”

Originally published in Datanami:
The NIH Pushes the Boundaries of Health Research with Data Analytics
September 21, 2015