Finding Diseases that Match Variants

Posted by | January 20, 2016 | Genomics, Healthcare | No Comments

In my last blog post, I talked about the challenges I had when trying to compare an individual genome variant file to the CDC Disease data set from a clinician’s perspective. It’s not a simple task to create a map from the symptom level down to the molecular level as I found out. There are many complications when it comes to scale, speed, table references, etc.

The scenario I was trying to solve was this:

“I’m a clinician and the patient I’m seeing today thinks they have ADHD based on their inability to focus, sit still, etc. I can observe their symptoms and have the ability to sequence their genome. How can I find out whether or not they may have a genetic mutation that could help me make an accurate diagnosis?”

Again, simple right? If we could solve part of this, doctors would have a really powerful utility at their disposal. The answer to this question is actually quite complex and took me a few weeks to work out a solution – albeit a hacked solution. It’s not pretty but it gets the job done. Fortunately, in a commercial world, we’d be able to make this pretty, performant, more simple, and way faster.


What I’ve done is created multiple tables to reference different aspects of the solution that held lead the clinicians down the route of solving their hypothesis based on the initial diagnosis. In the first step, I created a table called “Diagnosis” which houses a unique key specifying “ADHD”. I’ve also included other information such as Disease Class (Psychological), Study, Study Conclusion, and reference genes. The really useful column is “gene_id” which is where we house the primary gene associated with the disease. In the case of ADHD, this is widely considered DRD4. Each disease would have a similar setup with the option of adding many secondary or influential genes that we can surface as well.

At this step, all the clinician is thinking is “I think this kid has ADHD”. Therefore, the first step is to query the diagnosis database for the title ADHD and return the gene_id and chromosome associated with this disease. Chromosome is especially important because later on we’ll be querying against basepair positions which are not unique across the whole genome but unique across a chromosome. The query looks like this:

select gene_id, chrom
into temp table test
from diagnosis
where title = 'ADHD'
intersect
select gene_id, chrom from genes

This returns the result “gene_id = DRD4” and “chrom = 11”. Sweet! We now have our first bridge into the molecular level that allows us to query further.


Since we now have the gene_id as DRD4, we now need to figure out whether or not the kid has a genetic mutation against that specific gene (and how many mutations). I’ve created a separate database for each gene that we can query against which, in this case, the key “gene_id” would be DRD4. However, each gene has a range of positions. For example, DRD4 has 3,414 basepairs. On my last post, I totally messed this up because I ingested the position range into a single row which doesn’t work for querying against for obvious reasons. I needed to blow it out so that I could run an intersect against a whole range of basepairs.

To get around this (probably a hack job way), I created the table DRD4 which has each of the basepair ids. The column header format looks like this:

  • gene_id
  • pos
  • title (a diseases)
  • extended_title
  • chrom
  • species

In a commercial world example, we’d move the title and extended titles into a different table and just use some sort of reference key to pull them in. However, I’m not that awesome at all this and really just wanted to get a small building block created for this prototype – For Science!

So, at this point we want to isolate the gene DRD4 to be able to run an intersect query in the next step. This means that we need to pull the data associated with DRD4 into a temporary table to run the queries on. To do this, I ran the following query:

select *
into temp table test1
from genes
where gene_id = 'DRD4'

Awesome! We now have all of the data associated with DRD4 in a temporary table called “test1”.


Alright, last step. Since we sequenced the kids genome back in the clinicians office, we now have the full variant file consisting of all of this basepair mutations. What we want to do now is run an intersection query against our “test1” table consisting of the gene DRD4 against that variant file. The data that we’re specifically interested in here is the position and chromosome, since both dictate where the gene lives in both tables.

I already have a table setup that consists of a full genome variant file for Craig Venter who we already know has ADHD. Since the variant file is literally just a giant string of basepair numbers and chromosome, an intersection query should run fine against – and fairly fast! Here’s the query:

select pos, chrom
into temp table test2
from venter
intersect
select pos, chrom from test1

This returns the 3 beautiful basepair mutations associated with that gene:

pos chrom
639261 11
637884 11
639579 11

Success! Not the smoothest but it is still awesome to see the results that I’m looking for. I’ve now come full circle with the clinician. Their question was “Does this kid have ADHD?”. Our answer is now “Possibly, the kid has 3 genetic mutations on the gene highly associated with ADHD, which suggest genetic contributions towards the symptoms you’re seeing”.

Please note that there’s A LOT more data that I’m looking to bring in to help make an informed decision. The beauty of this is that we could start looking at additional data points such as blood work, genotypes, variant type, clinical study association, etc. All these additional data points are things that the clinician wouldn’t have at their disposal otherwise in a rapid format. The basics of this tool could help surface that information in a simple format that helps these clinicians make a more data driven diagnosis quicker.

Speaking of quick – total query time for the 3 queries above is ~8 seconds. Sort of fast, but this could be way faster. We could do a much more efficient table format, more simple data storage, query optimization, etc., to dramatically reduce the query time to probably less than 500ms. Imagine that: A clinician could simply type in their hypothesis and nearly instantly have a potential data drive diagnosis less than a second later.

Disclaimer: This is obviously not clinically proven but is an interesting prototype suggesting that this is indeed possible.


Interested in the project? Want to join the fun? Have suggestions to help me out? Comment below!

Leave a Reply

Your email address will not be published.