Finding Diseases that Match Variants

Posted by | Genomics, Healthcare | No Comments

In my last blog post, I talked about the challenges I had when trying to compare an individual genome variant file to the CDC Disease data set from a clinician’s perspective. It’s not a simple task to create a map from the symptom level down to the molecular level as I found out. There are many complications when it comes to scale, speed, table references, etc.

The scenario I was trying to solve was this:

“I’m a clinician and the patient I’m seeing today thinks they have ADHD based on their inability to focus, sit still, etc. I can observe their symptoms and have the ability to sequence their genome. How can I find out whether or not they may have a genetic mutation that could help me make an accurate diagnosis?”

Again, simple right? If we could solve part of this, doctors would have a really powerful utility at their disposal. The answer to this question is actually quite complex and took me a few weeks to work out a solution – albeit a hacked solution. It’s not pretty but it gets the job done. Fortunately, in a commercial world, we’d be able to make this pretty, performant, more simple, and way faster.

What I’ve done is created multiple tables to reference different aspects of the solution that held lead the clinicians down the route of solving their hypothesis based on the initial diagnosis. In the first step, I created a table called “Diagnosis” which houses a unique key specifying “ADHD”. I’ve also included other information such as Disease Class (Psychological), Study, Study Conclusion, and reference genes. The really useful column is “gene_id” which is where we house the primary gene associated with the disease. In the case of ADHD, this is widely considered DRD4. Each disease would have a similar setup with the option of adding many secondary or influential genes that we can surface as well.

At this step, all the clinician is thinking is “I think this kid has ADHD”. Therefore, the first step is to query the diagnosis database for the title ADHD and return the gene_id and chromosome associated with this disease. Chromosome is especially important because later on we’ll be querying against basepair positions which are not unique across the whole genome but unique across a chromosome. The query looks like this:

select gene_id, chrom
into temp table test
from diagnosis
where title = 'ADHD'
select gene_id, chrom from genes

This returns the result “gene_id = DRD4” and “chrom = 11”. Sweet! We now have our first bridge into the molecular level that allows us to query further.

Since we now have the gene_id as DRD4, we now need to figure out whether or not the kid has a genetic mutation against that specific gene (and how many mutations). I’ve created a separate database for each gene that we can query against which, in this case, the key “gene_id” would be DRD4. However, each gene has a range of positions. For example, DRD4 has 3,414 basepairs. On my last post, I totally messed this up because I ingested the position range into a single row which doesn’t work for querying against for obvious reasons. I needed to blow it out so that I could run an intersect against a whole range of basepairs.

To get around this (probably a hack job way), I created the table DRD4 which has each of the basepair ids. The column header format looks like this:

  • gene_id
  • pos
  • title (a diseases)
  • extended_title
  • chrom
  • species

In a commercial world example, we’d move the title and extended titles into a different table and just use some sort of reference key to pull them in. However, I’m not that awesome at all this and really just wanted to get a small building block created for this prototype – For Science!

So, at this point we want to isolate the gene DRD4 to be able to run an intersect query in the next step. This means that we need to pull the data associated with DRD4 into a temporary table to run the queries on. To do this, I ran the following query:

select *
into temp table test1
from genes
where gene_id = 'DRD4'

Awesome! We now have all of the data associated with DRD4 in a temporary table called “test1”.

Alright, last step. Since we sequenced the kids genome back in the clinicians office, we now have the full variant file consisting of all of this basepair mutations. What we want to do now is run an intersection query against our “test1” table consisting of the gene DRD4 against that variant file. The data that we’re specifically interested in here is the position and chromosome, since both dictate where the gene lives in both tables.

I already have a table setup that consists of a full genome variant file for Craig Venter who we already know has ADHD. Since the variant file is literally just a giant string of basepair numbers and chromosome, an intersection query should run fine against – and fairly fast! Here’s the query:

select pos, chrom
into temp table test2
from venter
select pos, chrom from test1

This returns the 3 beautiful basepair mutations associated with that gene:

pos chrom
639261 11
637884 11
639579 11

Success! Not the smoothest but it is still awesome to see the results that I’m looking for. I’ve now come full circle with the clinician. Their question was “Does this kid have ADHD?”. Our answer is now “Possibly, the kid has 3 genetic mutations on the gene highly associated with ADHD, which suggest genetic contributions towards the symptoms you’re seeing”.

Please note that there’s A LOT more data that I’m looking to bring in to help make an informed decision. The beauty of this is that we could start looking at additional data points such as blood work, genotypes, variant type, clinical study association, etc. All these additional data points are things that the clinician wouldn’t have at their disposal otherwise in a rapid format. The basics of this tool could help surface that information in a simple format that helps these clinicians make a more data driven diagnosis quicker.

Speaking of quick – total query time for the 3 queries above is ~8 seconds. Sort of fast, but this could be way faster. We could do a much more efficient table format, more simple data storage, query optimization, etc., to dramatically reduce the query time to probably less than 500ms. Imagine that: A clinician could simply type in their hypothesis and nearly instantly have a potential data drive diagnosis less than a second later.

Disclaimer: This is obviously not clinically proven but is an interesting prototype suggesting that this is indeed possible.

Interested in the project? Want to join the fun? Have suggestions to help me out? Comment below!

CDC Disease Data Compared Against Variant File

Posted by | Genomics, Healthcare | No Comments

Here is the scenario that I’m trying to solve:

“Physician has a patient they’re looking at who is in for lack of sitting still in class. Physician sees ADHD like symptoms but he doesn’t want to jump to conclusion and medicate a kid who might just be a ball of energy. He sequences the kids genome and then asks the system to surface up any issues within his genome that may be related to ADHD. Positive hit. The kid has 5 mutations on DRD4 and 2 on DRD3. There are also multiple medical population based studies with a strong suggestion leaning towards these mutations on these genes correlating with ADHD. The physician agrees that the symptoms match and decides to diagnose the kid.”

Turns out, the CDC has a pretty nifty database of many different diseases. While it was deprecated in 2014 for some reason (I probably just can’t find the most updated version) it still serves it’s purpose as a prototype for automatically alerting physicians of potential diseases based on variants.

The goal? Ingest the entire CDC dataset into a separate table the run an intersection against the genome variant table. Easy right?

No. I was so super wrong.

First up, the file is downloaded in .zip format. Easy – just unzip. It then is decompressed into a .tsv file which isn’t really that big of a deal. Turns out that this specific data set is around 355mb so the next thing to do in my super unsophisticated non programmer was is to loaded it into excel so that I can manipulate the data. This whole process is what I like to call a “poor mans ETL”.

Second, the data ended up being a total shit show. Here’s what the file looks like in command line.

What. The. Hell. Anyways, you can ingest this into Excel which is great. Once in excel, I cleansed a lot of the data where I basically got rid of a bunch of repetitive columns that aren’t necessarily useful for scaling the analysis pipeline.

Cool, data cleansed. I know I’m going to need to insert this into Redshift so I’m going to have to make sure the formatting is correct. That means looking out for special characters, commas in weird places, etc. This took forever. Long story short, ingesting data is a sensitive process and I’m understanding why data architects make the big bucks.

My methods are pretty noobish for getting it into ingestible format. Save as .csv. Run a command line tool called “csvtojson” which outputs a nice JSON format file (except that it adds a comma to the end of each row, which Redshift can’t handle. Delete.)

Now I was ready to make the table. Here are the columns used:

  • id
  • disease (title)
  • disease_class (Psych, Cancer, etc.)
  • gene_id (actual gene name)
  • population (for when there was clinical study data)
  • study_size (yep, see above)
  • control (reference above)
  • title (title of the actual study)
  • pos (start position of basepairs found in disease)
  • pos_end (end position of basepairs found in disease)
  • conclusion (end result of the study – useful for physician diagnoses)
  • env_factor (what type of environment the population lived under)
  • gi_gene_a (gene association/interaction)
  • gi_gene_b (gene association/interaction)
  • gi_gene_c (gene association/interaction)
  • gi_comb_env (no clue, didn’t mean to ingest…)
  • chrom (chromosome number)

Why did I bold “pos” and “pos_end”? You’ll find out soon enough.

Sweet! I now ingested 137,000 documents around diseases that I can query against. Many of the diseases are repeated but there are 12,764 unique gene_ids.

As I alluded to in my last blog post, DRD4 is a gene that is near and dear to my heart. I happen to know that I have a mutation on it and so does Craig Venter. To test the system, I decided to start with prior knowledge and query against Craig’s genome to get the range of basepairs where he has mutations on DRD4.


“Pos” is the column we want. Now, the next step was to ask the question “Do any of these pos variants live within in my shiny new diseases table?”

Back to why I bolded  pos and pos_end above. Wait a second. I ingested the CDC pos and pos_end data. That means a single row contains both the start and end range. DRD4 happens to have 3,413 basepairs on it. I have a start and end. Nothing in between. How do I query a single variant, such as 637,884, against a range of values that I don’t have?



Big miss. Nada bueno. I tried many different queries to see if I could munge it together but no dice. But, there’s a super hack that could be dig me out of this hole. I create a new table called “DRD4” and put the following columns on it:

  • gene_id (the gene name, eg. DRD4)
  • pos (basepair unique position, I took starting basepair value and +1 until I got to the end)
  • title (ADHD for abbreviated clinical name)
  • extended_title (Attention Deficit Hyperactivity Disorder for more “professional” sound)

Easy enough. Ingest the data. Now all we have to do is run the following query:

select pos from venter
where pos between 637293 and 640706
and chrom = 11
select pos from drd4
where pos between 637293 and 640706

Sweet! This returns my 3 known variants, so it worked. But this doesn’t solve my scenario. My physician isn’t going to know what DRD4. He’s going to know what ADHD is. We have the top level data point (“I think he has ADHD“) and the bottom level datapoint (“Variants 637884, 639261, 639579“). How do we find the intersection between both across 3 tables?

Good question. Still trying to figure that out. Going to save that for another blog post unfortunately. In a normal commercial system built by engineers much more brilliant than myself, we would have a query API in place that would allow the user to ask these questions in plan language.

Close, but not quite there yet. Good news is that I can see a path forward. The disease data is missing critical information such as whether the variant was an inversion, duplication, insertion, etc. It’s also missing genotyping which will be a critical role in this whole thing. Until next time…

Analyzing DRD4 on Craig Venter’s Genome

Posted by | Genomics, Healthcare, Technology | No Comments

Craig Venter and I share a similar genetic mutation: we both have ADHD. I was diagnosed at around the age 13 after struggling to pay attention in class, could not stop fidgeting, couldn’t focus, would forget everything, etc. The list goes on and on. Basically, I was a total shit student because I was all over the place mentally. The only time I was able to focus was when adrenaline was kicking in or it was something that really interested me and got me excited.

After learning of my diagnoses, I did what any 13 year old would do and started to go on a vision quest of what exactly it meant to have ADHD. I didn’t quite get it because I would look at other students and long to just be able to sit still and focus. I felt completely out of place. I started by going to the trust Google and searching, searching, searching. I read many articles that I didn’t understand and lots of unclear direction as to what really caused ADHD. I turned up dry with results except for one thing: ADHD was some symptom of genetics.

Years later, while I was in a brief moment of college, I went on the quest again to understand why this happened. I don’t remember where I heard it but someone told me that people with ADHD or entrepreneurs had the “risk taking” gene. After Googling that, I found the results of DRD4. I also found out that the man who initially sequenced the human genome also had it too – Craig Venter. This is 50% where my interest in genomics comes from while the other 50% is cancer and how my family is plagued with it (a post for a different day).

Fast forward to today and I’m now tinkering with massive scalable data warehouses that can hold 1,000’s of genomes to do population-based comparative genomics. The 1st genome on the list that has been ingested into this database was Craig Venter’s as a small tribute to someone I admire. I’ve ingested his variant format file which shows all of the SNPs (single nucleotide polymorphisms) within his genome compared to a reference genome. This netted around 3 million rows inserted.

I’ve also mapped this to a database of diseases where, upon ingestion, an intersection between the variant file and diseases is surfaced. This provides instant insights into diseases that may be present based on the diseases in the database. It’s simple and crude at the moment, and is by no means up to clinical grade. However, it’s one step towards pulling in additional data and running machine learning models to find the propensity of different diseases. Based on current benchmarks, I anticipate that we can do this in less than 1 minute.

This blog is really a reflection on something pretty extraordinary that I’m proud that I’ve built. While its basic, it’s been insanely rewarding to see the results. To bring this post full circle, I hope to explore much further the impacts of base pair mutations on DRD4 which start with a simple query and a simple image:


These are the mutations of Venter’s base pairs within Chromosome 11 at the specific base pair range of DRD4. The next steps I’ll be taking are associating this with genotyping mutations, association with a gene database & annotation, and providing multiple other genomes for comparison.

As a last item, probably the most serendipitous moment in this little adventure so far has been when I ran the first intersection between a variant test file and a database of 750. Again, not clinical grade by any means, but it was still a great moment to see a very small baby step towards a grander vision.


If interested, feel free to reach out to me if you have questions, would like to help, or just want to talk shop.