Warakorn - Fotolia
While genome sequencing -- the act of analyzing the genetic makeup of a person -- is at the forefront of medical research, the data involved is oppressive from a technology perspective, potentially barring healthcare professionals from achieving population-scale genomics.
Sequencing genomes "spits out masses and masses of raw data," Peter White, Ph.D., principal investigator and director of the Biomedical Genomics Core at Nationwide Children's Hospital in Columbus, Ohio, said. To put it into perspective, sequencing just one human genome produces one terabyte of data.
For years healthcare professionals have lacked efficient technology to not only handle the data but find value in it.
"Technology is certainly a limitation. There's no doubt about it," Thomas Handler, M.D., research vice president at Gartner Inc. who focuses on precision medicine and genomics among other areas of healthcare IT, said.
This lack of technology has slowed advances in analyzing genomes that would make rapid diagnosis, precision medicine, and population-scale genomics possible.
But White and his team at Nationwide Children's have successfully analyzed 2,500 whole genomes and 2,500 exomes -- a part of the genome in which the majority of genetic diseases result from due to mutations -- in seven days, White said. He added that it took the 1,000 Genomes Project -- a group that conducted the first project to sequence the genomes of a large number of people and to provide a comprehensive resource on human genetic variation -- 18 months to analyze the same data set.
Altogether, Nationwide Children's, in partnership with GenomeNext, LLC, a bioinformatics company, analyzed a total of 5,000 samples and 70 terabytes of data in seven days, said James Hirmas, CEO at GenomeNext.
White and his team embarked on this endeavor when, in November 2014, they submitted a proposal as part of the "Intel Head in the Clouds Challenge on Amazon Web Services," he said. At the AWS WWPS Government, Education and Nonprofits Symposium back in July 2014, Intel challenged attendees to think about a problem within their agency, organization or industry that they would most like to solve and submit it to Intel with a proposed solution. White and his team proposed using an algorithm they had developed called "Churchill" with GenomeNext's software as a service (SaaS) analytical platform, which runs on Amazon Web Services (AWS), to analyze the largest publicly available data set of population scale genomic data from the 1000 Genomes Project, he said.
"The algorithm is an extremely important piece of the overall platform" that GenomeNext provides, Hirmas said.
Churchill is able to efficiently distribute the analysis process. White and his colleagues found that Churchill has an accuracy of 99.99% and an overall diagnostic effectiveness of 99.66%, according to the standards set by the National Institute of Standards and Technology, the federal technology agency that works to develop and apply measurements and standards.
"Also what distinguished [Churchill] was that it's 100% reproducible. So every time you run the analysis you get the same results," White added. Churchill is also deterministic, which means "that regardless of whether you run this on your local server or you run it on a supercomputer or you run it up in the cloud, you get exactly the same results back."
The initial goal of White and his team was to use the Churchill algorithm for rapid diagnosis in a pediatric setting. Then the team realized Churchill could scale "really well in the cloud and could be used for population-scale studies," White said.
GenomeNext exclusively licensed Churchill for commercial use. The company built the infrastructure, security, and HIPAA compliance around the algorithm and made it fully automated to give customers the ability to upload sequencing results "to a platform running on the cloud that then allows them to do mass computer analysis or genomic analysis on the sample," Hirmas said.
Organizations that use GenomeNext to analyze genomes get free storage and only pay for what they use. "It's like the electric company. You turn on your lights, you pay for it. Once you turn them off, you stop paying. [It's] the same concept," Hirmas said.
Here's how the whole process works: Once a DNA sample has been collected and put into a format that allows the data to be sequenced, the sequencer runs for about 40 hours and produces an output called a FASTQ file, which is in a format that stores genome sequences, White explained.
Once those FASTQ files are generated, GenomeNext's analytical SaaS comes into play: A lab technician uploads the file to AWS, and "Churchill's able to take it through that entire process of what we call secondary data analysis --that's the alignment, the post alignment processing, the variant discovery and genotyping," White said.
After that point, the tertiary analysis piece --annotating the genetic variants and figuring out which may be disease-causing -- occurs, White said.
"The data's uploaded to AWS, and then that entire process of secondary and tertiary analysis is fully automated," White said.
Nationwide Children's Hospital is currently using this technology in its clinical diagnostic laboratory.
"What we're working towards is how do we start to integrate a patient's genomic signature with their electronic health record?" White said.
Peter White, Ph.D.Nationwide Children's Hospital
Handler said that EHR aspect will be challenging. "In the same way that right now the EHRs will remind me not to prescribe penicillin to someone who's allergic to penicillin, presumably in the future, when we know that there are classes of drugs that shouldn't be given to individuals with a certain genetic make-up, that decision support role needs to fire. And how that will be done is going to be fairly complex," he said.
And while these technologies are certainly a step in the right direction towards rapid diagnosis and population-scale genomics, Handler believes it's only the beginning and that it'll be about 10 years until these goals are fully reached.
For White, the results he and his team got using GenomeNext and the Churchill algorithm are a hopeful step forward: "By reducing the computational burden and cost, the technology enables any group to perform genomic analysis of [thousands] of individuals using universally available cloud compute resources," White said.
Genomic sequencing adds to pile of healthcare data
Bioinformatics institute simulates the spread of Ebola
Precision medicine builds off of genomic and biological data