At the Strata + Hadoop World 2014 conference held in New York, Allen Day, Principal Data Scientist for MapR, gave a fascinating talk titled “Renaissance in Medicine: Next-Generation Big Data Workloads,” where he showcased how ETL and MapReduce can be applied in a clinical session.
Allen began his talk by talking about a Danish geneticist named Wilhelm Johannsen, a physiologist and geneticist who coined the word “gene.” Johannsen studied the metabolism of dormancy and germination in seeds, tubers and buds. While studying inbred strains of peas, he discovered that sizes are not identical for genetically identical peas. He introduced a new concept where he splits out what was observed as the phenotype from the genotype (the hidden, causal latent variable upstream). He noticed that the variance in the observed phenotype follows a Gaussian process. The phenotype can be described as a function of the genotype plus some environmental components (P ~ G + E). This is the basis of quantitative genetics, and it was the foundation of Allen’s talk at Strata.
The First Renaissance in Medicine
Allen went on to describe the first renaissance in medicine, which took place in Europe between 1400-1700; a period defined by historians as part of the larger renaissance. It had some interesting enabling factors that allowed new medical breakthroughs to occur, in particular, the Movable Type machine, the compound microscope, and math-driven hypotheses all had on effect on medicine during that era. Specifically these enabling factors brought about a rapid diffusion of ideas, new data sources (human dissection), precise data (diagrams), and a paradigm shift in reasoning.
The Second Renaissance in Medicine: 1900 –
Enabling factors in the second renaissance in medicine include telecom networks, globalization, the next-gen DNA sequencer, and data-driven hypotheses. The telecom networks also enabled a rapid diffusion of ideas; globalization helped enable new data sources such as GMOs and stem cells. The DNA sequencers enabled dense, precise metrics (such as the ability to measure human genomes), while data-driven hypotheses have enabled a paradigm shift in discovery.
DNA Sequencers Improving Rapidly
Instead of using 1s and 0s (base2), biological software is encoded as A, T, C, and G (base4). DNA sequencers are simply devices for converting information encoded in base4 to base2. Improvements in DNA sequencing technology are happening at a rate that outstrips even Moore’s Law of Computing. As a result, the number of human genomes converted to base2 and uploaded for analysis is rapidly increasing.
“Next-gen” DNA Sequencer Brings Dense, Precise Metrics
The current “next-gen” sequencer came out in January 2014, called the Illumina XTen, which is a set of 10 sequencers that produces 6 terabytes of data per day. Scientists need to sequence about 300 billion base pairs per human to build a medical-grade genome.
Moore’s Law describes a long-term trend in the computer hardware industry that involves the doubling of compute power every two years. In 2005, the Lynx sequencer was commercialized, which marked the beginning of “Even Moore’s Law”, as researchers were able to get access to sequencers that were even faster than Moore’s Law. The doubling time went from 19 months to 5 months. A doubling time of 19 months refers to the fact that you get twice as many megabytes per dollar in 19 months.
What does this mean in terms of the impact of XTen on genomic medicine? In the graphic below, you can see that scientists can now sequence 6 trillion basepairs per day at a cost of $20k per day. This translates to sequencing 7,000 humans per year using one of these machines at a cost of $1,000 per human. For $4 billion per year, all of the newborn babies’ DNA can be sequenced, at a capital cost of $5 billion.
There are two other sequencing technologies that are also available in the market today:
Ion Torrent technology takes an entirely new approach to sequencing. Ion Torrent systems sequence DNA using a semi-conductor chip. This method of sequencing is based on the detection of hydrogen ions that are released during the polymerization of DNA, as opposed to the optical methods used in other sequencing systems.
Nanopore technology, offered by both Roche and Oxford Nanopore, is a new generation of nanopore-based electronic systems for analysis of single molecules.
Social Impact of Genetic Sequencing
Several genetically deterministic diseases could potentially be prevented with pre-conception and pre-natal screening. These diseases include muscular dystrophy, cystic fibrosis, albinism, phenylketonuria, and hemophilia. Allen also mentioned two facts regarding cancer: 1) 10% of all cancers have a hereditary component, and 2) the total annual cancer spending is $50 billion over 1 million people.
Many DNA-Based Applications Are on the Horizon
There are many DNA-based applications that will be coming soon. In 2014, according to a Macquarie Capital report, US companies spent $2 billion on DNA application research, which was comprised of mostly chemical costs (demand for DNA sequencers). In 2020, US firms are expected to spend $20 billion on DNA application research (mostly comprised of clinical costs) – and this research will be deployed into the healthcare system.
Personalized medicine is now a reality. In this model, diagnostic testing is often employed for selecting appropriate and optimal therapies based on the context of a patient’s genetic content and/or other molecular analysis. In addition, companion diagnostics have been developed to preselect patients for specific treatments based on their own biology, where such targeted therapy may hold promise in personalized treatment of diseases such as cancer.
Clinical Genomics: Information Systems PerspectiveYou can also think about clinical genomics from an information systems perspective, as outlined in the graphic below. In this scenario, the machine extracts the compressed, structured base4 data and does some base4 to base2 conversion. At the same time, it is shredding the data, so it’s destructuring it at the same time it converts it to base2 so that a conventional computer can read it. This data gets stored and goes into a BI tool for reporting and visualization.
Clinical Genomics: Data Science Process
From a data science perspective, there are four different steps in clinical genomics: 1) experiment design, 2) DNA sequencing, 3) secondary analytics, and 4) downstream analytics.
DNA Sequencing Getting Cheaper so Experiment Design and Analytics Will Increase
The chart below shows how much money is being spent on each of the different phases. In the past, there was a lot of time and money spent on DNA sequencing itself, because it was expensive. As DNA sequencing is getting cheaper, the proportion of effort being spent on experiment design and downstream analytics is going to increase.
Influx of Data is Creating New Use Cases
The figure below illustrates the commoditization of DNA sequencing. The huge influx of inexpensive data will enable a whole new crop of medical and industrial use cases.
Paradigm Shift in Discovery: Personalized Therapeutics
In the near future, all next-gen drugs from pharmaceutical companies will require a “companion diagnostic” as part of your prescription to determine your personal response segment. Personalized therapeutics will become the norm, and will be used to determine which response segment (unsuitable therapy vs. suitable therapy) is appropriate.
Personal Genome in EHR = Better Therapeutics
According to Allen, it’s possible, even without genetic sequencing, to determine the appropriate therapy by analyzing millions of electronic medical records. Recently, a scientist at Stanford applied NLP (Natural Language Processing) techniques to medical records in order to figure out what entities are being talked about, and when they are being talked about. He was able to discover causal relationships and associations such as drug interactions and side effects.
To do this in a therapeutic context, you would also need to mine through all of the medical records. However, as the data becomes cheaper, a new use case becomes enabled. Now, you’ll be able to take the personal genome data and include that in the medical record. Once that is complete, you can co-analyze it with the medical records, or go straight from the genomic data to predicting phenotype.
This means that you can predict a personal outcome, or the probability of an outcome, given the genetic data. You can also administer preventative action, which is something you can do in combination with the medical record data, where you have the time series. You can begin to perform observational studies of populations, such as figuring out what types of healthy behaviors, even within the context of a problematic genotype, can still produce a healthy phenotype.
To summarize, ETL and MapReduce can now be applied in a clinical setting. NoSQL and advanced analytics can be used to “reverse engineer” the genetic causes of disease. Such information can be used to predict and prevent individual suffering, as well as to increase the overall health of a society.
Want to learn more?
View Allen’s entire slide deck (120 slides) on Slideshare here.