There’s an interesting battle that appears to be brewing in the genetics world. It’s bred from many of the hypes around cloud computing and how these visions paint the panacea for all verticals. For people who know me, they’ll say that I’m a huge proponent for the cloud in general. However, genetics isn’t a general thing, and I believe that we’ll see an emerging hybrid cloud that specializes in handling just this. A “biocloud” that acts as a platform for the biologists.
There’s many contributing factors as to why I believe this is going to be the likely outcome for the genetics market. Personally, many of the accelerated genomics pipeline tools that I’m working on are hedging on this outcome due to challenges I’ve seen with computing on the cloud. The reality is that the cloud is design for general purpose computing. For the most part, the problems that are solved by cloud computing are fairly basic and don’t require huge clusters of servers. They’re often smaller files and simpler computations when compared to genomics.
For genomics, the data sets are massive, have N’th permutations of interactions, and constantly evolve. In a weird way, the data we’re storing has many different dimensions – time being one of them. In order to do queries and complex computations, a specific computing environment is needed both from a software and hardware perspective.
To back up, an accelerated genomics pipeline looks something like this:
- Sequenced data is stored for alignment
- Sequenced genome is aligned against reference genome
- Aligned genome and variant file are passed to structured database storage
- End user queries against one or many of the datasets ad-hoc to collect data for a hypothesis
- End user performs complex computations against large collections of datasets for deep understanding of datasets (simulation, similarities, etc.)
While step 2 is considered a high performance computing example, you could easily say the same for steps 4 and 5. However, the computations being performed are completely different. In step 2, we’re running an algorithm called “Bowtie” based off of the popular Burrows-Wheeler Transform. This algorithm aligns short read sequences to a larger genome; often times this means aligning 3.1 billion rows to 3.1 billion rows. This takes a specially designed system to do at scale utilizing sophisticated hardware architectures such as GPUs, Infiniband, and Flash Array Block Storage. We’ve personally tried using AWS for this and it has either failed to complete or is so slow that it nets a negative gain. On a custom design system though, we’ve seen close to 10x improvements in speeds where we’ve reduced the time to align a full human genome down to seconds from nearly an hour or days.
That said, this system isn’t designed for storing data in a way that is very useful. It’s purely designed for speed. We would consider this part of our pipeline as the “local compute cluster” where we stream data coming from the sequencer and align it on the fly, allowing us to do “genomics” in real time. On the flip side, we want to take advantage of the economies of scale with cloud storage and computing. This is where the output of the aligned data should go since we get many of these benefits over time. Personally, we’ve tested passing aligned data to the cloud for storage, analysis, and automation which has been a positive outcome. In our tests, we’ve used Redshift from AWS (a PostgreSQL install) as our core database for prototype purposes. We’ve had great success with very low query times for full disease to genetic mappings, providing a viable solution for our “cloud” portion of the pipeline. In the future, we plan on using different types of elastic and scalable resources, such as EC2, for doing interesting data analysis utilizing machine learning software.
At the end of the day, while many of the major cloud providers have “Life Sciences” focused vertical cloud offerings, the reality is that they don’t stand up to the real use cases of a commercial environment that will be required to genetics at scale. They currently cater to many of the ad-hoc analysis done by researchers which is incredibly useful. However, once we start to scale up to a medical grade and commercial scale, there will need to be a specific pipeline and hybrid platform that gives us the benefit of both speed and complex analysis. There’s a world where there may be an entirely new genomics cloud that, while “cloud” based, isn’t part of one of the major cloud providers. Rather, this is a separate cloud environment designed specifically or the biology world, designed and tuned specifically for the incredibly complex and massive datasets that we haven’t seen yet. For now though, I believe that the best solution is a combination of both local and cloud based computing for the full benefits of an optimized system.