Here is the scenario that I’m trying to solve:
“Physician has a patient they’re looking at who is in for lack of sitting still in class. Physician sees ADHD like symptoms but he doesn’t want to jump to conclusion and medicate a kid who might just be a ball of energy. He sequences the kids genome and then asks the system to surface up any issues within his genome that may be related to ADHD. Positive hit. The kid has 5 mutations on DRD4 and 2 on DRD3. There are also multiple medical population based studies with a strong suggestion leaning towards these mutations on these genes correlating with ADHD. The physician agrees that the symptoms match and decides to diagnose the kid.”
Turns out, the CDC has a pretty nifty database of many different diseases. While it was deprecated in 2014 for some reason (I probably just can’t find the most updated version) it still serves it’s purpose as a prototype for automatically alerting physicians of potential diseases based on variants.
The goal? Ingest the entire CDC dataset into a separate table the run an intersection against the genome variant table. Easy right?
No. I was so super wrong.
First up, the file is downloaded in .zip format. Easy – just unzip. It then is decompressed into a .tsv file which isn’t really that big of a deal. Turns out that this specific data set is around 355mb so the next thing to do in my super unsophisticated non programmer was is to loaded it into excel so that I can manipulate the data. This whole process is what I like to call a “poor mans ETL”.
Second, the data ended up being a total shit show. Here’s what the file looks like in command line.
What. The. Hell. Anyways, you can ingest this into Excel which is great. Once in excel, I cleansed a lot of the data where I basically got rid of a bunch of repetitive columns that aren’t necessarily useful for scaling the analysis pipeline.
Cool, data cleansed. I know I’m going to need to insert this into Redshift so I’m going to have to make sure the formatting is correct. That means looking out for special characters, commas in weird places, etc. This took forever. Long story short, ingesting data is a sensitive process and I’m understanding why data architects make the big bucks.
My methods are pretty noobish for getting it into ingestible format. Save as .csv. Run a command line tool called “csvtojson” which outputs a nice JSON format file (except that it adds a comma to the end of each row, which Redshift can’t handle. Delete.)
Now I was ready to make the table. Here are the columns used:
- disease (title)
- disease_class (Psych, Cancer, etc.)
- gene_id (actual gene name)
- population (for when there was clinical study data)
- study_size (yep, see above)
- control (reference above)
- title (title of the actual study)
- pos (start position of basepairs found in disease)
- pos_end (end position of basepairs found in disease)
- conclusion (end result of the study – useful for physician diagnoses)
- env_factor (what type of environment the population lived under)
- gi_gene_a (gene association/interaction)
- gi_gene_b (gene association/interaction)
- gi_gene_c (gene association/interaction)
- gi_comb_env (no clue, didn’t mean to ingest…)
- chrom (chromosome number)
Why did I bold “pos” and “pos_end”? You’ll find out soon enough.
Sweet! I now ingested 137,000 documents around diseases that I can query against. Many of the diseases are repeated but there are 12,764 unique gene_ids.
As I alluded to in my last blog post, DRD4 is a gene that is near and dear to my heart. I happen to know that I have a mutation on it and so does Craig Venter. To test the system, I decided to start with prior knowledge and query against Craig’s genome to get the range of basepairs where he has mutations on DRD4.
“Pos” is the column we want. Now, the next step was to ask the question “Do any of these pos variants live within in my shiny new diseases table?”
Back to why I bolded pos and pos_end above. Wait a second. I ingested the CDC pos and pos_end data. That means a single row contains both the start and end range. DRD4 happens to have 3,413 basepairs on it. I have a start and end. Nothing in between. How do I query a single variant, such as 637,884, against a range of values that I don’t have?
Big miss. Nada bueno. I tried many different queries to see if I could munge it together but no dice. But, there’s a super hack that could be dig me out of this hole. I create a new table called “DRD4” and put the following columns on it:
- gene_id (the gene name, eg. DRD4)
- pos (basepair unique position, I took starting basepair value and +1 until I got to the end)
- title (ADHD for abbreviated clinical name)
- extended_title (Attention Deficit Hyperactivity Disorder for more “professional” sound)
Easy enough. Ingest the data. Now all we have to do is run the following query:
select pos from venter
where pos between 637293 and 640706
and chrom = 11
select pos from drd4
where pos between 637293 and 640706
Sweet! This returns my 3 known variants, so it worked. But this doesn’t solve my scenario. My physician isn’t going to know what DRD4. He’s going to know what ADHD is. We have the top level data point (“I think he has ADHD“) and the bottom level datapoint (“Variants 637884, 639261, 639579“). How do we find the intersection between both across 3 tables?
Good question. Still trying to figure that out. Going to save that for another blog post unfortunately. In a normal commercial system built by engineers much more brilliant than myself, we would have a query API in place that would allow the user to ask these questions in plan language.
Close, but not quite there yet. Good news is that I can see a path forward. The disease data is missing critical information such as whether the variant was an inversion, duplication, insertion, etc. It’s also missing genotyping which will be a critical role in this whole thing. Until next time…