General question about the Gen3 data commons data model

Question 1: Why GraphQL?

I'm sorry that it seems like we're making things harder than they need to be! I know it's annoying, and GraphQL is not the easiest to master! But, we think it's worth it for the advantages it provides.

The data we're representing has a hierarchical relationship, and that's easily mapped to a graph representation. So, having a programmatic representation that also is graph-based made sense. If we were using SQL instead of GraphQL, every query would programmatically require us to do a ton of joins all the time. This would take a lot of programming energy, not to mention possibly making the system more vulnerable to programming errors and less robust to updates. With a graphical model, we can set the users up to query on relationships between several entities in a graph based manner, which is simpler because we can more easily describe the path to take through the model for the query. Also, GraphQl provides a more structured way to add new analysis entities (e.g., workflows, data outputs) once the commons are extant and begin to grow.

As you noted, the backend is a relational database setup in PostGres. Peregrine, our GraphQL service, allows you to construct queries of PostGres. Additionally, our ETL tool (tube) can flatten and store structured data in ElasticSearch, allowing for faster searches. This flattened data can be searched using Guppy (which allows you to use GraphQL queries on data in ElasticSearch.)

If you haven't yet seen this resource in the documentation, I encourage you to check out this page: Gen3 - Technical Intro

Question 2: Why not use relational database architecture?

We actually do have a relational database architecture -- it's the backend for GraphQL (PostGres). One of the neat things about the Gen3 is that it does not require data commons created from it to all have a common data model. If you look at the examples on stats.gen3.org, you can click on any of the data commons. You can see the data models by clicking on the Data Dictionary button (upper right) and then the Graph View (upper left). If you compare what's there for different commons, you'll see that there's a wide variety of data models.

Question 3: what about longitudinal data?

We agree that longitudinal data is definitely more challenging to model! But -- we have had some success. If you go to this page (Gen3 - Set up Gen3), then scroll down to the section called "Representing Longitudinal Data," you'll get a better sense for how we manage it. An example of a data commons that uses longitudinal EHR data is our VA Data Commons (https://va.data-commons.org/). Although the data dictionary is not available for this commons, the documentation is public and could be helpful to better understand how this is addressed. You can find that here: VA Data Commons — va-doc 1.0.0 documentation. It's true that EHR doesn’t work in a very usable way in our current dictionary/ElasticSearch approach.

In this documentation, you'll also see that this data commons uses the ATLAS application. ATLAS is an analytics platform that can be used to perform analyses across one or more observational databases which have been standardized to the OMOP Common Data Model V5. We integrated ATLAS as a gen3 app in an iframe; however, a user who is provisioned access to ATLAS would be able to access all data in there. We did build a new service called “cohort-middleware” to do some OMOP queries in our GWAS application in this data commons, but it is not generalizable at the moment

Finally:

Do you know about our Slack channel that's available for the Gen3 community? You can sign up to join this community Slack channel by completing this form: Sign up to join our Gen3-Community on Slack!. We hope to see you there!