In its journey to provide analytics of distributed, locally-controlled health data, CanDIG naturally is looking to machine learning as one of its next steps towards improving its platform. Learning in decentralized environments is a somewhat new problem, with some of the first marked progress being made in 2016 with the onset of Federated Averaging. Since then, federated learning – the study of attempting to federate ML models – has received much attention from the ML community, with frameworks such as Tensorflow Federated, NVFlare, and Flower...
Continue...GraphQL is a query language that was released by Facebook in 2015. A lot of big companies such as Twitter, Netflix and PayPal have adopted GraphQL and released GraphQL APIs. However, most use cases for GraphQL are to connect it directly to their user databases. Since the GraphQL interface is usually deployed within one organization and connected with multiple data storage within the same organization, there are not too many security or privacy concerns. At CanDIG, our efforts are focused on developing a large-scale, federated...
Continue...Keycloak Overview and Our Usecase Keycloak is an open-source identity and access management solution. Some of the features that it provides out-of-the-box are the easy deployment of application authentication, single-sign on, identity brokering from various sources like social media applications, user federation by creating a facade for LDAP, Active Directory, etc., and all of this using standard protocols like SAML, OpenID Connect (OIDC), etc. CanDIG uses Keycloak (one deployment on each site in a project), to create a uniform interface for authentication and authorization infrastructure...
Continue...CanDIG, is building the next generation of it’s federated solution for the analysis of privacy-sensitive genomic data across Canada. It will enable clinical researchers from distributed sites to examine and analyze quality data, including data from cancer trials, with no need for a central infrastructure to maintain or secure. This sharing of research-quality data between numerous cancer trials and treatment centers will help generate essential information that will aid in the efficient and effective diagnosis, treatment and follow-up of health conditions. An introduction to mCODE...
Continue...Overview CanDIG has some pretty particular authentication and authorization requirements! For authentication our fully-distributed federation, with each site self-governing, we need to accept authentication credentials, and claims about those identities, from multiple institutions. Similarly for authorization: we support a number of different data projects, and fine-grained access control within each (e.g. a researcher might be able to view somatic variants but not germline). In our federation, authorization decisions are made locally, but informed by platform-wide information. So there are multiple sources of claims of entitlements...
Continue...Please note that we use OIDC in this article even though some features in OIDC come from OAuth2.0. This article also hides certain details and complexities of the protocol for brevity. The problem Say you are solving the problem of understanding a rare disease - Rarivitis. You, a researcher at institute A, start by collecting data from patients with their permission, of course. In this problem, two facts stand out - You need a sufficient amount of data to make a judgment on a specific...
Continue...CanDIG has spent the last two years hypothesizing, prototyping, and validating data federation approaches for its national efforts. For us, data federation means something fairly specific: We are interested in federating queries and analyses across the network. These are read-only operations; thus we’re not worried about consistency of updates which is a complex problem in distributed read/write databases. In CanDIG’s case, the different sites representing different geographical regions and the data subjects are research project participants, with all of their data located at one given...
Continue...I was working on a project for CanDIG which was an implementation of the Htsget API — a simple data-access API for reads and variants. A key component of our application was the PySAM library, which provides an interface to genomic data. Its main use in the service was to parse a desired file and return only a chunk of that file with various filters. The most common way of accessing a file is with a local file path. However, with S3 buckets emerging as...
Continue...The thousand genome project was an early large-scale population genomics project which sequenced 2,504 individuals from a broad range of regions and ancestries, with aligned reads and called variants being made publically available. The resulting populations included some now-classic figures in population genetics, such as those shown to the right. Here we describe our process of reproducing this analysis of important, public, moderately sized data sets with well-understood characteristics, but now across a partitioned, federated set of the data. Our aim is to demonstrate that...
Continue...An important part of our architecture is the ability to run specified bioinformatics tasks against data sets; but this requires the ability to bundle up these tasks and distribute them across potentially quite heterogenous sites. There are many different possibilties for such an approach: Containers (Docker, rkt, LXC, Singularity, Intel Clear Containers) VMs Application packagers (AppImage, Snappy) To weigh these options, we conducted a simple Benchmark; a crude remapping pipeline taking reads from GRCh37 to GRCh38, then measuring coverage in specific exons. This benchmark allows...
Continue...Motivation While de-identification by removing explicit identifiers is the first step that we take towards the privacy protection, we consider the cases when the approach may fail to provide the sufficient level of protection specifically when records associated to the individuals contain characteristics or combinations of characteristics that are rare if not unique and make them identifiable even in the large crowds and even in the cases when data query responses are restricted to aggregate statistical results only. CanDIG provides an additional layer of privacy...
Continue...