One of the challenges, or more so “ugliness”, of bioinformatics that trips a lot of people up is that most of analysis is done within a terminal command. This means you typically need to know some level of SQL, the common terminal command language, and the specific tools syntax in order to execute on the questions you’re asking. It never seems to be friendly to the 80% of users and provides lots of hurdles for when you want to try and visualize your data. If you’re like most biologists, you don’t have a degree in computer science and are just trying to pick this up as you go along.
I’ve been working on a little side project for a while that allows users to analyze the genome and it’s contents within a visualization tool. In this case, I’ve repurposed a business intelligence tool so that you can drag-n-drop different columns or rows to surface interesting insights. Let’s walk through what had to happen in order to get the basic constructs of this.
Redshift on Amazon Web Services
I wanted to do this all within the cloud so I chose AWS since it’s what I’m “most” familiar with. Most is a vague term here since I’m the furthest thing from a data or software engineer.
Redshift was a simple choice as it is their data warehousing solution. While Redshift scales up to over a petabyte of data, it actually starts to have very large performance issues at around the 100TB range. From one of my past jobs, this was incredibly challenging and required lots of caching aggregated tables in order to keep speed within analyzing the data. For my initial prototype though I was only working with a few megabytes of data so it was a no brainer.
The first thing I did was spin up a Redshift dc1.large cluster. Once this was spun up, it gave me a URL endpoint and a JDBC URL that I could hook a 3rd party SQL tool into. This was pretty critical since I did the data insertion manually.
Getting Hooked Up to Redshift
I needed some level of direct access to the Redshift cluster, so I downloaded SQL Workbench. This is not necessarily the best tool out there (we use OxDBE/DataGrip at my current job) but I wanted to keep all of this isolated from my current work setup.
The next step was to download the Redshift JDBC driver and load it up into SQL Workbench. Once that was done, I created a new connection profile, loaded up the Redshift JDBC URL, then hooked right in. Awesome!
Creating a Variant Table
I wanted to keep the variant table pretty basic since a lot of the data included in variant files feels “excessive”. When I say excessive, I don’t mean that the data is useless. There is a lot of super interesting insights that you can glean from it but for my purposes, I wanted to keep it very basic and solve for an “80%-like” use case.
I kept the table to something very basic:
- User_id – This is the specific key that we want to attached to an individual whose genome was sequenced. This way, further down the road, we can query multiple types of databases and pull in different sources of data regarding an individual.
- G_type – This stands for Genome Type. The concept here is that this database can have many different types of genomes. For example, we might house homo sapien and canine genomes. The database could theoretically be agnostic.
- Chrom – This stands for the Chromosome number that the variant lives on.
- Pos – This stands for the Position Count within the sequence that the variant is at.
- Bpid – This stands for Base Pair ID which attaches an identification to the specific base pair. For example, the bpid might look something like “rs6054257”.
- Ref – This stands for the Reference Base Pair of the genome that the sequence was being aligned to. This is considered the “source of truth” to a degree.
- Alt – This stands for the Alternative which is what the sequenced genome presented. For example, the Reference might state that Adenine was the correct basepair but the Alternative showed Cytosine.
- Qual – This is the Quality Score of the sequence read. We attach this number because not all sequence reads come out with the same level of quality which can factor into whether or not we care about look at that specific row. The rule of thumb is that below 17-20 (on a scale of 100) is considered a “fail” and that the quality is too low to consider.
- Filter – This is attributed to the Quality score and is a boolean input. You either pass or fail, which is represented as True or False.
I created this table by running the following statement:
create table variants (userid integer not null distkey sortkey,
g_type char(16), chrom integer, pos integer, bpid integer, ref char(16),
alt char(16), qual integer, filter boolean);
Tada! A variant table.
This is by far the hardest part. The file formats that are created from the sequencing devices are incredibly unfriendly and hard to manipulate easily. The file that I ingest is called a VCF file – Variant Call Format. This file contains the SNPs (single nucleotide polymorphisms – aka. mutations) and is one of the files that is produced by the alignment process to a reference genome. Typically these files are massive, so I used a sample file that contained only ~950 rows of variants.
I had to remove all of the meta data from the file which consists of information like “source, filter description, quality description” and more. While this information can be important, I don’t need it for this prototype. Additionally, I removed the header of the file since I aligned the columns of my table to adhere to the file formats.
From there, I had to get it into a format that could be accepted by the database as an insert. This is where good old Excel came in. This isn’t the sexy way to do it and was super manual, but it got the job done. Typically, you can just write a basic script that can modify the data. This is how the VCF file typically formats it:
20 14370 rs6054257 G A 29 PASS
And here is the format after manipulating the dataset:
Bit of a difference. In the end, this is what the statement looked like:
insert into variants values
('74756', 'human', '1', '4793', '71268704', 'A', 'G', '99', '1'),
('74756', 'human', '1', '5074', '11586607', 'T', 'G', '95', '1'),
('74756', 'human', '1', '5931', '0', 'T', 'C', '99', '1')
Hooking up a BI Tool for Visualization
To bring the prototype full circle, I needed to get some level of visualization. Since I’m not that skilled in programming, I opted to use a pre-baked solution. However, if I had a deeper skill set, I would have gone with a custom library such as genome.js or D3.js for a nice web-based tool. For this prototype I used Tableau on a trial version. The setup was surprisingly easy. All I had to do was download the specific Tableau Redshift driver, install it, insert the OBDC Redshift credentials, and I was up and running.
I’m fairly new to Tableau and I’m finding that it is a nice interface but challenging to grasp some concepts. For example, I’m easily able to figure out the number of variants based on the base pair type (ie. 300 variants for Adenine) as the image below describes. However, I’m having a much harder time slicing the data up to ask more interesting questions such as: Show me all of the variants on Chromosome 1 that have a quality score between 30-60 and have a position between 500-350,000.
This is the first of many different prototypes that I’m working on to build out a robust comparative genomics analysis pipeline. While the methods above are no good in the commercial realm, the results are encouraging because we can easily automate all portions of the pipeline that I’ve encountered. If you’re interested in collaborating or learning more about the project/mission/end goal, feel free to leave a comment or reach out.