Federated Queries in Genomics

Posted by | January 26, 2016 | Genomics, Technology | No Comments

In the late 90s and early 2000s there was a trend called “federated search” that made some ground in how the web was architected. Federated Search, more commonly known as “federated querying”, allows a system to query multiple disparate data sources in order to surface up a single result from a single query. It never really made grounds however due to the disorganized nature of where we were at with data architecture as well as data storage performance.

One of the other trends within this world was storing a bunch of data into a single database from many different sources which is where the term “data lake” came into play. The term describes exactly what it is: a giant lake of data that you can query against from many different sources.

These were two trends that ended up dying pretty quickly as optimal solutions became apparent for specific industries. Postgres and NoSQL moved on to the scene and became the choice database solution for many needs within the high tech world. With NoSQL specifically, it’s performance was incredible for fast ingestion and really fast query times. However, the unstructured nature of NoSQL causes it to be problematic in many types of organizations.

In a funny way, the “big data” world is coming around to going back to federated querying. I think specifically that Genomics is going to be a huge user of this type of architecture given the nature of dramatically different database requirements for different parts of the genomics pipelines. Some systems will want very rigid and structured databases whereas others will want the freedom of unstructured storage that allows them to scale and bend data.

For example, you might want to store information about Genes and all their annotated meta data in a MySQL database with optimized query performances. However, you may want a Postgres build for ingesting human genome variant data. Postgres could be nice for this given its parallel processing nature. In another realm, you may want to store clinical trial data inside of a NoSQL database in order to return large arrays of data extremely fast. This means that you have 3 different databases with, while similar, different query languages. A structured federated query language to hit all 3 sources would be beneficial.

An example of a federated query could look something like:

PREFIX gene: </local_endpoint/genes>
PREFIX diseases: </local_endpoint/diseases>
PREFIX genome_variants: </local_endpoint/genome_variants>
PREFIX clinical_trials: </local_endpoint/clinical_trials>

SELECT ?genes ?chromosome ?basepair WHERE {
    SERVICE </local_query_api_endpoint/> {
        ?basepair BETWEEN 100000 AND 200000:genome_variants.
        ?chrom = 11:gene.
        ?diseases genes:sameAs ?genes
        FILTER(str(?ADHD), "DRD4"))
    }
}

This isn’t necessarily the most pretty (or accurate) representation of a federated query but it provides the structure in which we might join together a multiple data sources with specific queries into one table. This table might be persisted or we may just be doing a floating query, in which we could store the query in a temp table.

Apart from its flexibility, federated queries are attractive to genomics primarily because it can provide incredible performance across massive datasets. There are other query languages or implementations in the field that act as more aggregators vs. federations however their purpose is very similar. Facebook, in my opinion, has the most advanced version of this where they use a node:leaf system pair with GPU based operations for querying against petabytes of data.

Among the other little experiments I’ve been running, I’ll be testing out an implementation of an aggregation service on a small scale from disparate data sources to see if I can create a simple example of this.

Leave a Reply

Your email address will not be published.