Thoughts on Genomics

Learnings on Opening a Lab

Posted by | Thoughts on Genomics, Thoughts on Startups | No Comments

It’s been a long time since I’ve written anything here. However, I have a good excuse! We’ve been busy opening a next generation sequencing lab!

Last year, by Dad and I were looking heavily at the space. It’s been a long time dream of his to build out a genome sequencing facility and it’s been a long time dream for me to once again blend software and life sciences (dabbled with this on my first company). Through a series of fortunate events, we had the opportunity to pull the trigger in September this year so we founded The Sequencing Center.

Since then, we’ve been working hard on the website, building out our marketing and content strategy, finding lab space, and so much more. My sister came on board to help out with everything as well, from social to account management. It’s been an all hands on deck situation. Now, we’re bootstrapping this and funding this from our day jobs, we haven’t quit the primary jobs so it’s double duty 7 days a week.

Just recently in March we found our new lab space. It’s a 1,400 sqft former surgery facility that met all the specs. A little old in age and a bit dirty but nothing a bit of TLC couldn’t solve. We signed in the middle of March and have been purchasing equipment since then. Here’s a list of just random shit that we learned throughout the process.

Getting used equipment takes forever. Seriously. The vendors in this space take forever to get anything done. Emails are usually responded within 24-48 hours, they’re highly unhelpful, and getting the equipment through QC often takes weeks. If you’re building out a similar lab, make sure that you have about 3 months of capital allocated just for the lease for while you’re purchasing equipment.

The price you see is never the final price. While this is pretty much true for most purchases, it’s especially true in this field. The vendors we worked with would provide insanely high quotes that we would haggle on heavily. As an example, one vendor quoted us $20,000 for a device which was about $6,000 more than another. However, they also had another piece of equipment we needed that the other vendor didn’t so we proposed to purchase the other piece of equipment at full price if we could get the $20k unit for $14k (a $6,000 discount). After adding a bit of added pressure to close the deal, they agreed. Everything is negotiable and if you’re a small startup on a budget, fight for the discounts.

The laws and regulations are extremely vague. Our lab is technically BSL2 rated which means we can handle a variety of sensitive organism. These aren’t likely to kill you but you have to have a BSL2 lab to handle them. Here’s the kicker: BSL2 labs don’t actually have a certification. This sent us down a rabbit hole because we assumed there was something we needed to prove that we could handle these organisms. However, after calling the EPA, CDC, local health inspector, and reviewing the 438 page document from the CDC on biosafety labs, we couldn’t find anything. Then, after talking with an equipment certifier to certify our biosafety cabinet, they said that only BSL3/4 have hard requirements. BSL1/2 have more “guidelines” than anything – an area of grey operation. As long as you had the proper certified equipment, manuals, processes, etc., then you should be fine.

Electric outlets. Make sure that you know the amp and volt limits of your outlets. Then, make sure that when you buy big equipment (such as a -80 freezer) that they map to your outlet specifications.

Have all refurbished equipment go through quality control. Make sure that in each of the invoices that you sign with resellers/refurbished vendors that they include the quality control inspection. Often times this just means turning the machine on and making sure it has some basic operations (centrifuge spins, freezer holds temperature, etc.). This is really more of a small insurance policy for you than anything else and will save you headaches.

Do a full run through the lab before accepting clients. This seems like a no brainer and fortunately it’s not something we’ve messed up on (yet!). However, this is extremely important because if you’ve accepted clients before the lab has completed a full run through and one of your equipment pieces fails, it could take weeks or months to fix. This will provide you with some pissed off customers and a hurt reputation.

We’re on the cusp of opening up our lab after a few months of painfully getting all of our equipment. We have 2 more major devices that we need and then we’ll be good to go. It’s been a journey so far and it’s only just starting. Should be fun!

Hybrid Genomics Cloud: Why genetics needs local and cloud computing

Posted by | Thoughts on Genomics, Thoughts on Technology | No Comments

There’s an interesting battle that appears to be brewing in the genetics world. It’s bred from many of the hypes around cloud computing and how these visions paint the panacea for all verticals. For people who know me, they’ll say that I’m a huge proponent for the cloud in general. However, genetics isn’t a general thing, and I believe that we’ll see an emerging hybrid cloud that specializes in handling just this. A “biocloud” that acts as a platform for the biologists.

There’s many contributing factors as to why I believe this is going to be the likely outcome for the genetics market. Personally, many of the accelerated genomics pipeline tools that I’m working on are hedging on this outcome due to challenges I’ve seen with computing on the cloud. The reality is that the cloud is design for general purpose computing. For the most part, the problems that are solved by cloud computing are fairly basic and don’t require huge clusters of servers. They’re often smaller files and simpler computations when compared to genomics.

For genomics, the data sets are massive, have N’th permutations of interactions, and constantly evolve. In a weird way, the data we’re storing has many different dimensions – time being one of them. In order to do queries and complex computations, a specific computing environment is needed both from a software and hardware perspective.

To back up, an accelerated genomics pipeline looks something like this:

  1. Sequenced data is stored for alignment
  2. Sequenced genome is aligned against reference genome
  3. Aligned genome and variant file are passed to structured database storage
  4. End user queries against one or many of the datasets ad-hoc to collect data for a hypothesis
  5. End user performs complex computations against large collections of datasets for deep understanding of datasets (simulation, similarities, etc.)

While step 2 is considered a high performance computing example, you could easily say the same for steps 4 and 5. However, the computations being performed are completely different. In step 2, we’re running an algorithm called “Bowtie” based off of the popular Burrows-Wheeler Transform. This algorithm aligns short read sequences to a larger genome; often times this means aligning 3.1 billion rows to 3.1 billion rows. This takes a specially designed system to do at scale utilizing sophisticated hardware architectures such as GPUs, Infiniband, and Flash Array Block Storage. We’ve personally tried using AWS for this and it has either failed to complete or is so slow that it nets a negative gain. On a custom design system though, we’ve seen close to 10x improvements in speeds where we’ve reduced the time to align a full human genome down to seconds from nearly an hour or days.

That said, this system isn’t designed for storing data in a way that is very useful. It’s purely designed for speed. We would consider this part of our pipeline as the “local compute cluster” where we stream data coming from the sequencer and align it on the fly, allowing us to do “genomics” in real time. On the flip side, we want to take advantage of the economies of scale with cloud storage and computing. This is where the output of the aligned data should go since we get many of these benefits over time. Personally, we’ve tested passing aligned data to the cloud for storage, analysis, and automation which has been a positive outcome. In our tests, we’ve used Redshift from AWS (a PostgreSQL install) as our core database for prototype purposes. We’ve had great success with very low query times for full disease to genetic mappings, providing a viable solution for our “cloud” portion of the pipeline. In the future, we plan on using different types of elastic and scalable resources, such as EC2, for doing interesting data analysis utilizing machine learning software.

At the end of the day, while many of the major cloud providers have “Life Sciences” focused vertical cloud offerings, the reality is that they don’t stand up to the real use cases of a commercial environment that will be required to genetics at scale. They currently cater to many of the ad-hoc analysis done by researchers which is incredibly useful. However, once we start to scale up to a medical grade and commercial scale, there will need to be a specific pipeline and hybrid platform that gives us the benefit of both speed and complex analysis. There’s a world where there may be an entirely new genomics cloud that, while “cloud” based, isn’t part of one of the major cloud providers. Rather, this is a separate cloud environment designed specifically or the biology world, designed and tuned specifically for the incredibly complex and massive datasets that we haven’t seen yet. For now though, I believe that the best solution is a combination of both local and cloud based computing for the full benefits of an optimized system.

Federated Queries in Genomics

Posted by | Thoughts on Genomics, Thoughts on Technology | No Comments

In the late 90s and early 2000s there was a trend called “federated search” that made some ground in how the web was architected. Federated Search, more commonly known as “federated querying”, allows a system to query multiple disparate data sources in order to surface up a single result from a single query. It never really made grounds however due to the disorganized nature of where we were at with data architecture as well as data storage performance.

One of the other trends within this world was storing a bunch of data into a single database from many different sources which is where the term “data lake” came into play. The term describes exactly what it is: a giant lake of data that you can query against from many different sources.

These were two trends that ended up dying pretty quickly as optimal solutions became apparent for specific industries. Postgres and NoSQL moved on to the scene and became the choice database solution for many needs within the high tech world. With NoSQL specifically, it’s performance was incredible for fast ingestion and really fast query times. However, the unstructured nature of NoSQL causes it to be problematic in many types of organizations.

In a funny way, the “big data” world is coming around to going back to federated querying. I think specifically that Genomics is going to be a huge user of this type of architecture given the nature of dramatically different database requirements for different parts of the genomics pipelines. Some systems will want very rigid and structured databases whereas others will want the freedom of unstructured storage that allows them to scale and bend data.

For example, you might want to store information about Genes and all their annotated meta data in a MySQL database with optimized query performances. However, you may want a Postgres build for ingesting human genome variant data. Postgres could be nice for this given its parallel processing nature. In another realm, you may want to store clinical trial data inside of a NoSQL database in order to return large arrays of data extremely fast. This means that you have 3 different databases with, while similar, different query languages. A structured federated query language to hit all 3 sources would be beneficial.

An example of a federated query could look something like:

PREFIX gene: </local_endpoint/genes>
PREFIX diseases: </local_endpoint/diseases>
PREFIX genome_variants: </local_endpoint/genome_variants>
PREFIX clinical_trials: </local_endpoint/clinical_trials>

SELECT ?genes ?chromosome ?basepair WHERE {
    SERVICE </local_query_api_endpoint/> {
        ?basepair BETWEEN 100000 AND 200000:genome_variants.
        ?chrom = 11:gene.
        ?diseases genes:sameAs ?genes
        FILTER(str(?ADHD), "DRD4"))

This isn’t necessarily the most pretty (or accurate) representation of a federated query but it provides the structure in which we might join together a multiple data sources with specific queries into one table. This table might be persisted or we may just be doing a floating query, in which we could store the query in a temp table.

Apart from its flexibility, federated queries are attractive to genomics primarily because it can provide incredible performance across massive datasets. There are other query languages or implementations in the field that act as more aggregators vs. federations however their purpose is very similar. Facebook, in my opinion, has the most advanced version of this where they use a node:leaf system pair with GPU based operations for querying against petabytes of data.

Among the other little experiments I’ve been running, I’ll be testing out an implementation of an aggregation service on a small scale from disparate data sources to see if I can create a simple example of this.