Hadoop and Genome Sequencing: A Perfect Match

This is a tremendously exciting time for those who work in clinical genomics. The demand for cutting-edge technologies that deliver fast and accurate genome information has exploded. In 2013, close to 2000 genome sequencers were in operation. These genome sequencers produced a whopping 15 petabytes of sequence data, which included the sequencing of 300k human genomes. In 2018, the growth in the sequencing install base as well as growth in data per run will result in the production of 1 exabyte of sequence data. That’s a lot of data.

What’s fueling this growth? The dramatic decrease in the operating costs of DNA sequencers. In the graph below, notice the dramatic decrease in price per megabase pairs starting in 2008; this marks the start of the “next generation” sequencing technology era.

Genome Sequencing Costs

Genomics pioneer Craig Venter said it best when commenting on the pace of acceleration for genome sequencing: “We’re going to go from zero to a thousand miles an hour very quickly.”

But with this growth comes new challenges—now that you have this massive amount of sequenced human genome data, how can you cost effectively manage it? And what is the right system design to analyze this data cost effectively and quickly?

Hadoop, it turns out, is the ideal platform for genomics workflow processing. Hadoop can rapidly store and sort this massive amount of information, and make it possible to render the data suitable for meaningful analysis. In addition, deploying Hadoop instead of using legacy systems results in both data reliability advantages and significant cost savings.

Read our new white paper, titled “Next Generation Sequencing Using MapR” to gain an in-depth understanding of how scale-out Internet architectures can be applied to clinical genomics.

In the next blog post in this series, we’ll talk about why Hadoop is the ideal platform for genomics data processing.

Want to learn more?

If you have any questions or comments, please post them in the comments section below.


Streaming Data Architecture:

New Designs Using Apache Kafka and MapR Streams




Download for free