Thoughts on Genomics

Visualizing Genome Variant Data

Posted by | Thoughts on Genomics | No Comments

One of the challenges, or more so “ugliness”, of bioinformatics that trips a lot of people up is that most of analysis is done within a terminal command. This means you typically need to know some level of SQL, the common terminal command language, and the specific tools syntax in order to execute on the questions you’re asking. It never seems to be friendly to the 80% of users and provides lots of hurdles for when you want to try and visualize your data. If you’re like most biologists, you don’t have a degree in computer science and are just trying to pick this up as you go along.

I’ve been working on a little side project for a while that allows users to analyze the genome and it’s contents within a visualization tool. In this case, I’ve repurposed a business intelligence tool so that you can drag-n-drop different columns or rows to surface interesting insights. Let’s walk through what had to happen in order to get the basic constructs of this.

Redshift on Amazon Web Services

I wanted to do this all within the cloud so I chose AWS since it’s what I’m “most” familiar with. Most is a vague term here since I’m the furthest thing from a data or software engineer.

Redshift was a simple choice as it is their data warehousing solution. While Redshift scales up to over a petabyte of data, it actually starts to have very large performance issues at around the 100TB range. From one of my past jobs, this was incredibly challenging and required lots of caching aggregated tables in order to keep speed within analyzing the data. For my initial prototype though I was only working with a few megabytes of data so it was a no brainer.

The first thing I did was spin up a Redshift dc1.large cluster. Once this was spun up, it gave me a URL endpoint and a JDBC URL that I could hook a 3rd party SQL tool into. This was pretty critical since I did the data insertion manually.

Getting Hooked Up to Redshift

I needed some level of direct access to the Redshift cluster, so I downloaded SQL Workbench. This is not necessarily the best tool out there (we use OxDBE/DataGrip at my current job) but I wanted to keep all of this isolated from my current work setup.

The next step was to download the Redshift JDBC driver and load it up into SQL Workbench. Once that was done, I created a new connection profile, loaded up the Redshift JDBC URL, then hooked right in. Awesome!

Creating a Variant Table

I wanted to keep the variant table pretty basic since a lot of the data included in variant files feels “excessive”. When I say excessive, I don’t mean that the data is useless. There is a lot of super interesting insights that you can glean from it but for my purposes, I wanted to keep it very basic and solve for an “80%-like” use case.

I kept the table to something very basic:

  • User_id – This is the specific key that we want to attached to an individual whose genome was sequenced. This way, further down the road, we can query multiple types of databases and pull in different sources of data regarding an individual.
  • G_type – This stands for Genome Type. The concept here is that this database can have many different types of genomes. For example, we might house homo sapien and canine genomes. The database could theoretically be agnostic.
  • Chrom – This stands for the Chromosome number that the variant lives on.
  • Pos – This stands for the Position Count within the sequence that the variant is at.
  • Bpid – This stands for Base Pair ID which attaches an identification to the specific base pair. For example, the bpid might look something like “rs6054257”.
  • Ref – This stands for the Reference Base Pair of the genome that the sequence was being aligned to. This is considered the “source of truth” to a degree.
  • Alt – This stands for the Alternative which is what the sequenced genome presented. For example, the Reference might state that Adenine was the correct basepair but the Alternative showed Cytosine.
  • Qual – This is the Quality Score of the sequence read. We attach this number because not all sequence reads come out with the same level of quality which can factor into whether or not we care about look at that specific row. The rule of thumb is that below 17-20 (on a scale of 100) is considered a “fail” and that the quality is too low to consider.
  • Filter – This is attributed to the Quality score and is a boolean input. You either pass or fail, which is represented as True or False.

I created this table by running the following statement:

create table variants (userid integer not null distkey sortkey,
g_type char(16), chrom integer, pos integer, bpid integer, ref char(16),
alt char(16), qual integer, filter boolean);

Tada! A variant table.


Data Cleansing

This is by far the hardest part. The file formats that are created from the sequencing devices are incredibly unfriendly and hard to manipulate easily. The file that I ingest is called a VCF file – Variant Call Format. This file contains the SNPs (single nucleotide polymorphisms – aka. mutations) and is one of the files that is produced by the alignment process to a reference genome. Typically these files are massive, so I used a sample file that contained only ~950 rows of variants.

I had to remove all of the meta data from the file which consists of information like “source, filter description, quality description” and more. While this information can be important, I don’t need it for this prototype. Additionally, I removed the header of the file since I aligned the columns of my table to adhere to the file formats.

From there, I had to get it into a format that could be accepted by the database as an insert. This is where good old Excel came in. This isn’t the sexy way to do it and was super manual, but it got the job done. Typically, you can just write a basic script that can modify the data. This is how the VCF file typically formats it:

20 14370 rs6054257 G A 29 PASS

And here is the format after manipulating the dataset:


Bit of a difference. In the end, this is what the statement looked like:

insert into variants values
('74756', 'human', '1', '4793', '71268704', 'A', 'G', '99', '1'),
('74756', 'human', '1', '5074', '11586607', 'T', 'G', '95', '1'),
('74756', 'human', '1', '5931', '0', 'T', 'C', '99', '1')

Hooking up a BI Tool for Visualization

To bring the prototype full circle, I needed to get some level of visualization. Since I’m not that skilled in programming, I opted to use a pre-baked solution. However, if I had a deeper skill set, I would have gone with a custom library such as genome.js or D3.js for a nice web-based tool. For this prototype I used Tableau on a trial version. The setup was surprisingly easy. All I had to do was download the specific Tableau Redshift driver, install it, insert the OBDC Redshift credentials, and I was up and running.

I’m fairly new to Tableau and I’m finding that it is a nice interface but challenging to grasp some concepts. For example, I’m easily able to figure out the number of variants based on the base pair type (ie. 300 variants for Adenine) as the image below describes. However, I’m having a much harder time slicing the data up to ask more interesting questions such as: Show me all of the variants on Chromosome 1 that have a quality score between 30-60 and have a position between 500-350,000.


This is the first of many different prototypes that I’m working on to build out a robust comparative genomics analysis pipeline. While the methods above are no good in the commercial realm, the results are encouraging because we can easily automate all portions of the pipeline that I’ve encountered. If you’re interested in collaborating or learning more about the project/mission/end goal, feel free to leave a comment or reach out.

Hurdles with Modern Genome Pipelines

Posted by | Thoughts on Genomics | No Comments

It’s no secret in the bioinformatics world that the software, tools, and overall industry is lacking behind modern practices. Much of the reason for this lagging is due to deep rooted infrastructures and practices that rely on these antiquated software methodologies.

As I’ve been working through what it will take to build an accelerated genome alignment and analysis pipeline, I’ve encountered some massive challenges. They’re fun challenges, so I figured I’d write down a stream of thoughts around what we’ve seen so far and some potential ideas on how we’re planning on tackling them.

Size of Genome Files

Each full human genome (30x+ coverage) can range between 100-150gb – depending on the machine, quality of data, and methods. What makes this challenging is that moving that big of a file around becomes a really big chore. You can’t just move it to the cloud because of transfer costs, timeouts, and time. You have to be really smart about it by using different methods, such as chunking, in order to make it relatively efficient. Most people today use GlobusFTP for faster transfers however this is obviously not the most secure method and won’t jive with HIPPA compliance or best practices for handling sensitive data. This forces many organizations to build internal computing infrastructures to get around that which bring in a whole slew of additional challenges (management, maintenance, scalability, costs).

One method I’ve considered is simply moving the sequencing in house to make it full vertical. The samples would be passed to the sequencing center, the sequencing is done in house, then passed to our own local cluster for alignment before it is shipped up to the cloud. Obviously this is still the same problem that the organizations experience but the difference is that we would maintain it entirely where it’s managed by professionals. This reduces the burden significantly for biologists and researchers since we would handle the brunt of the data processing.

File Formats

The most recent challenge I’ve encountered is the abysmal state in which files are formatted. The chief example for this is VCF files coming from the alignment of the genome to a reference genome. VCF files contain the SNPs (single nucleotide polymorphisms) that show where variants or gaps are at within the aligned genome. This file format consists of a very unique meta data structure and header structure which makes it hard to use. Additionally, the data within the VCF file somewhat requires a second data source for interpretation since is assumes that whoever is looking at it will understand (to a degree) what each of the values mean. This is hard for a computer to interpret since we have to reference a separate data source.

The way we’re looking at getting around this is actually rewriting the alignment output file structure. This means that as we align the genome, we restructure the VCF output file to be in a much more friendly and performant format (like a JSON schema). We can then simplify the file and allow it to be passed much faster into a cloud environment where we can run it through a parser for interpretation. Today, you have to use certain tools, such as VTools, to interpret the data. When working on very large files, this takes hours on end on a local machine. By changing the file structure, we can bring it into a more modern light and allow us to ETL the data in many different ways.

Genome File Size

One of the most interest and potentially useful applications of genomics is comparing tons of genomes against each other. For example, I could pull together 1,000 genomes that have all expressed breast cancer and compare the SNPs across each of the chromosomes or even compare the gene pathways. However, with the current file format of SNPs in VCFs today, that is hardly possible. Additionally, the sheer size of the computation as well as file storage is massive as well. You would need 100’s of terrabytes in order to just store the files. The reason for the size is not only because the genome really is that large but also because we’re storing it in ASCII format which is often 8-10x larger than binary.

One method we’ve considered that others have ventured down as well is changing the file storage format to be strictly in binary. For example, you could potentially change the raw alignment from ATCG to binary where A=00, C=01, and so on. I believe the area of study for this is called Object Marshalling. We’ve seen others do this as well where they store the data in binary format and see extraordinary reductions in file sizes. Since binary isn’t great for analyzing however, we’re going to build a lightning fast interpreter as we access data from the database. For example, we might call all Gene BRCA1 SNPs across 1,000 genomes. This would obviously surface up a lot of 0’s and 1’s, so as we fetch these files, we would run it through an extremely fast interpreter that translates the data from 01 to ATCG. Facebook has actually been quite a pioneer in this area with their Audience Insights engine.

We’re fortunately making decent progress not only on our own genome pipeline but in the field in general. There are some great open source initiatives, notably ADAM, where we’re taking modern big data practices and applying them to bioinformatics. We’re just on the tip of the ice berg when it comes to being able to really start doing large scale analysis of populations of genomes. There hasn’t been a time before in history where we’ve been able to due to costs, computation speeds, and storage/memory. One of the beauties that has fallen out of the machine learning revolution is that we’re starting to do really complex analysis on datasets by storing them in-memory with projects like Spark (ADAM does this as well).

We hope to get our entire genome data ingestion pipeline down into the seconds range versus hours it takes today. From there, we’re also working to make progress specifically within comparative genomics. Our mission is to make it so that you can compare 1,000’s of genomes against each other within minutes. To make that happen, we’ll need to transform a lot of the old software to more modern techniques. We’re well on our way and it will be exciting to see more progress in the near future within this field.

What’s After the $1,000 Genome?

Posted by | Thoughts on Genomics | No Comments

The advent of genome sequencing is one of our greatest achievements as humans. The first fully sequenced in 2003 (first formal draft in 2001) was one massive step forward in our quest to understand the machine that makes us who we are. 3.1 billion base pairs later, this step was accomplish. The simple format of Adenine, Thymine, Cytosine, and Guanine (A, T, C, G) uncovers all traits of humans, such as hair color, eye color, personality traits, pre-disposed disorders, and many more. All of the things that go wrong (or right) with human is transcribed in 4 letters. Those 4 letters however are not as simple as we’d like. While they provide the constructs of who we are, they are incredibly complex in what they control, how they interact with one another, and how they can be manipulated.

There was a recent article about how the cost of sequencing a human genome is now down to around $1,000 from $2.7 billion. This is as bit deceptive since this is the unit economics cost of sequencing the genome but fails to account for some of the other hidden costs (fluids, electricity, data storage, lab plates, etc.) as well as state which type of sequence (De novo, resequencing, exome, etc.). However, the cost is still dramatically lower than where we were at as is attainable by most early adopters. This article came from the perspective that the next step in the genome market is to progress further down the path of DNA repairs or manipulation of the genome. I’d argue this is the wrong approach. While the sequencing of the genome helps us significantly in our understanding of how humans work, we still have an incredibly long way in understanding it in full detail.

With the current state of genome affairs, we understand the “state of the union” of humans. However, we don’t fully understand yet how each of these basepairs interact with each other, what happens if we manipulate them, how they interact with massive problems such as cancer, and a whole array of other unknowns.

I’d argue heavily that the next leap forward with genomics is not manipulation of the genome but deeper computational analysis of it. While we have reference genomes that can help us understand portions of who we are, we do not have what I would call “genomic logic” backing them. Meaning, we have these genome sets stored in databases as raw values but the base pairs of the genomes in these databases are dumb. We don’t have any computation logic that backs how the base pairs deeply interact with each other. Much like pharmaceutical market has computational drug analysis for computationally testing new drugs on how they perform, we need this for human genome sets.

The reason we need dramatically need this before we start to manipulate our genome is simple: speed and safety. With logic backing our genome datasets, we can run millions of simulations on the fly in seconds, reduce the trial and error required, and have fast acceleration towards winning solutions. For example, if I want to see what happens if I modify a base pair making up Chromosome 2, I should be able to do so on the computer through a knob, run millions of tests across millions of genomic sets, and get results on what all I impacted down stream, what the different implications are, and provide a concise solution. Doing this also means I’m safely running this in a sandboxed environment on my computer without directly testing or impacting humans. This allows us to get to a safer potential solution faster without the standard clinical trials, trial and error, and long 5-10 year lifecycle development. This should get us to the point where we run a computer simulation against our genomic data sets, have a returned proposed solution with 95%+ confidence, and be able to take that solution to the labs for testing and mass roll out to the market.

By doing the above, we’re increasing our velocity towards a more personalized healthcare system. To me, it’s less about trying to turn all of the different knobs and playing around with our own live production environment and more about creating a safe and fast sandboxed environment. By doing the above, we’re accelerating and increasing our safety towards manipulating, repairing, and creating genomes. We need to continue to push genomics digital in order to reap the benefits of scalable systems such as cloud computing, high performance computing, and machine learning.

In the end, its better knowing specific insights about humans on an individual level versus the macro level. While the macro is important, there’s a simpler and more rapid step in getting there which is through building computational logic behind genomic data sets. Data is often useless if we’re unable to take action on the insights we glean from it. The more computational insight and action we can take, the better the results on practical field solutions.