Federated Queries in Genomics

Posted by | Genomics, Technology | No Comments

In the late 90s and early 2000s there was a trend called “federated search” that made some ground in how the web was architected. Federated Search, more commonly known as “federated querying”, allows a system to query multiple disparate data sources in order to surface up a single result from a single query. It never really made grounds however due to the disorganized nature of where we were at with data architecture as well as data storage performance.

One of the other trends within this world was storing a bunch of data into a single database from many different sources which is where the term “data lake” came into play. The term describes exactly what it is: a giant lake of data that you can query against from many different sources.

These were two trends that ended up dying pretty quickly as optimal solutions became apparent for specific industries. Postgres and NoSQL moved on to the scene and became the choice database solution for many needs within the high tech world. With NoSQL specifically, it’s performance was incredible for fast ingestion and really fast query times. However, the unstructured nature of NoSQL causes it to be problematic in many types of organizations.

In a funny way, the “big data” world is coming around to going back to federated querying. I think specifically that Genomics is going to be a huge user of this type of architecture given the nature of dramatically different database requirements for different parts of the genomics pipelines. Some systems will want very rigid and structured databases whereas others will want the freedom of unstructured storage that allows them to scale and bend data.

For example, you might want to store information about Genes and all their annotated meta data in a MySQL database with optimized query performances. However, you may want a Postgres build for ingesting human genome variant data. Postgres could be nice for this given its parallel processing nature. In another realm, you may want to store clinical trial data inside of a NoSQL database in order to return large arrays of data extremely fast. This means that you have 3 different databases with, while similar, different query languages. A structured federated query language to hit all 3 sources would be beneficial.

An example of a federated query could look something like:

PREFIX gene: </local_endpoint/genes>
PREFIX diseases: </local_endpoint/diseases>
PREFIX genome_variants: </local_endpoint/genome_variants>
PREFIX clinical_trials: </local_endpoint/clinical_trials>

SELECT ?genes ?chromosome ?basepair WHERE {
    SERVICE </local_query_api_endpoint/> {
        ?basepair BETWEEN 100000 AND 200000:genome_variants.
        ?chrom = 11:gene.
        ?diseases genes:sameAs ?genes
        FILTER(str(?ADHD), "DRD4"))

This isn’t necessarily the most pretty (or accurate) representation of a federated query but it provides the structure in which we might join together a multiple data sources with specific queries into one table. This table might be persisted or we may just be doing a floating query, in which we could store the query in a temp table.

Apart from its flexibility, federated queries are attractive to genomics primarily because it can provide incredible performance across massive datasets. There are other query languages or implementations in the field that act as more aggregators vs. federations however their purpose is very similar. Facebook, in my opinion, has the most advanced version of this where they use a node:leaf system pair with GPU based operations for querying against petabytes of data.

Among the other little experiments I’ve been running, I’ll be testing out an implementation of an aggregation service on a small scale from disparate data sources to see if I can create a simple example of this.

Analyzing DRD4 on Craig Venter’s Genome

Posted by | Genomics, Healthcare, Technology | No Comments

Craig Venter and I share a similar genetic mutation: we both have ADHD. I was diagnosed at around the age 13 after struggling to pay attention in class, could not stop fidgeting, couldn’t focus, would forget everything, etc. The list goes on and on. Basically, I was a total shit student because I was all over the place mentally. The only time I was able to focus was when adrenaline was kicking in or it was something that really interested me and got me excited.

After learning of my diagnoses, I did what any 13 year old would do and started to go on a vision quest of what exactly it meant to have ADHD. I didn’t quite get it because I would look at other students and long to just be able to sit still and focus. I felt completely out of place. I started by going to the trust Google and searching, searching, searching. I read many articles that I didn’t understand and lots of unclear direction as to what really caused ADHD. I turned up dry with results except for one thing: ADHD was some symptom of genetics.

Years later, while I was in a brief moment of college, I went on the quest again to understand why this happened. I don’t remember where I heard it but someone told me that people with ADHD or entrepreneurs had the “risk taking” gene. After Googling that, I found the results of DRD4. I also found out that the man who initially sequenced the human genome also had it too – Craig Venter. This is 50% where my interest in genomics comes from while the other 50% is cancer and how my family is plagued with it (a post for a different day).

Fast forward to today and I’m now tinkering with massive scalable data warehouses that can hold 1,000’s of genomes to do population-based comparative genomics. The 1st genome on the list that has been ingested into this database was Craig Venter’s as a small tribute to someone I admire. I’ve ingested his variant format file which shows all of the SNPs (single nucleotide polymorphisms) within his genome compared to a reference genome. This netted around 3 million rows inserted.

I’ve also mapped this to a database of diseases where, upon ingestion, an intersection between the variant file and diseases is surfaced. This provides instant insights into diseases that may be present based on the diseases in the database. It’s simple and crude at the moment, and is by no means up to clinical grade. However, it’s one step towards pulling in additional data and running machine learning models to find the propensity of different diseases. Based on current benchmarks, I anticipate that we can do this in less than 1 minute.

This blog is really a reflection on something pretty extraordinary that I’m proud that I’ve built. While its basic, it’s been insanely rewarding to see the results. To bring this post full circle, I hope to explore much further the impacts of base pair mutations on DRD4 which start with a simple query and a simple image:


These are the mutations of Venter’s base pairs within Chromosome 11 at the specific base pair range of DRD4. The next steps I’ll be taking are associating this with genotyping mutations, association with a gene database & annotation, and providing multiple other genomes for comparison.

As a last item, probably the most serendipitous moment in this little adventure so far has been when I ran the first intersection between a variant test file and a database of 750. Again, not clinical grade by any means, but it was still a great moment to see a very small baby step towards a grander vision.


If interested, feel free to reach out to me if you have questions, would like to help, or just want to talk shop.




Convergence of Mobile and Web: How apps are the apex of it all

Posted by | Technology | No Comments

There’s a great term floating around that I think will eventually become how we think about software: the appification of everything. We’re seeing a lot of trends moving towards this type of thinking because it provides more utility to much of the software we’re building. This trend is really composed of 4 foundations with the forcing function being the pressure of the consumer market with their many devices.

1) Seamless Unification of Web, Mobile, and Everything Else

The number of devices per person is averaging around 3 as of 2015. Since web was the first to gain adoption, we typically see the most robust systems here. However, this has dramatically changed with the introduction of apps for mobile devices. This is the “appification” aspect of technology. Up until recent years, if you wanted to have your product cross platform, you were required to develop differently for each. While much of that is still true today, we’ve seen a massive trend into what we call “Cards” – a type of design and functionality pattern that provides a consistent experience across all devices. Cards are like mini-apps that allow developers to retain similar designs across all devices with similar levels of functionality. The key is that they still tie back to the same backend system. This is the first step towards a seamless unification across all platforms. Once this happens, we’ll see an intersection for user profiles and analytics where we’re able to share information across platforms and devices from one central location.

2) Requirement for Different Presentation Layers yet Same Features and Functionality

With multiple platforms comes multiple ways of presenting. This has posed a challenges for businesses since they often lose brand identity as they move across platforms. They often can’t retain things like functionality, style, fonts, and more that make their brand who they are. With the “appification” of everything delivered through mechanisms such as cards, this has become a hurdle that can be conquered. Cards provide a contract to the presentation layer that basically say “We’re a card and our outer system will follow what your platform requires.”. However, the key is the second part of their conversation: “We’ll present in your format but we’re going to loop in our own functionality that will be contained within our Card.” A perfect example of this is how Google Maps works. Maps is cross platform and varies in the level of data being presented however as you interact and expand the app, it retains the same robust functionality of the full scale app. Since we’re able to retain the same functionality across platforms, we can also pass in information across platforms as well – which is great for analytics.

3) Unification of Data and User Profiles across Platforms

Since Cards can be passed through different platforms, we’re able to collect more intimate data that is unified. In the past, it’s been a nightmare to unify user profile data across platforms to build a comprehensive understanding. With Cards, we now have to worry about two things: Collection and Data schema. The collection schema is specific to the environment that the Card is living in, meaning if I’m on the web I collect “A, B, C” data whereas on mobile I collect “G, H, I” data for a user. From there, it’s a matter of the user profile server side to handle the data collection which means that the user profile schema must be able to accept this information. This is incredibly useful because server side allows you to do interesting things with cross-platform data collection – such as recommendations or content personalization based on unified profile data. The nature of Cards allows you to easily add, remove, or swap out different levels of functionality through microservices.

4) Advances in Microservice Capability for Interoperability

One of the beautiful thing about Cards and Microservices is their ability to change constantly. With this structure, you’re able to update APIs and functionality on any platform (including native ones) without having to relaunch SDKs. I’ve seen incredible use cases for this such as delivery of content, updates in core functionality, or swapping out logic/engines behind the scenes. Additionally, you can add multiple microservices to the cards to extend functionality. This plays well with interoperability across platforms since other platforms can subscribe to data updates. One key to note here is that with Cards, you can expand your functionality within the limits of the Card. Since Cards are isolated from the exterior environment, it provides a great way to insert robust apps into hostile environments while still being able to “land and expand” functionality in the future.

In the near future I believe we’ll see a large shift towards everything being app based. We’ve seen this movement with larger corporations such as Google, Pinterest, Facebook, and Apple. I’m betting in the near future that we’ll see more platforms making it easier to develop this type of technology with more advanced developer tools specifically for this, app focused delivery mechanisms, and all-in-one development solutions that allow for app creation in one area but deploy in all. As the expansion of devices increases and we become more connect, I believe we’ll see a collapse in the code bases to something like apps so that we can provide that unification that businesses and users are looking for.