General question about the Gen3 data commons data model

Chesterguan · September 25, 2022, 3:37am

Hi, I am a developer and very interested in the gen3 platform, especially data commons.
I have a couple of questions about the Data dictionary / Data model used in the gen3 data commons.

After checking with the data dictionary and data model sections, I think the Gen3 data commons is trying to mimic "graph database", but actually backend applied relational database. Are there any specific considerations here? Usually GraphQL is more difficult than SQL to learn and play with especially for beginners, why you selected the GraphQL here ?
Why not build the data commons with the relational database architecture? Maybe the Graph database model could provide more flexibility with customized properties and nodes. But only if each data commons could provide a common data model, the idea still works right?
From my personal perspective, it's not easy to create a customized data model with longitudinal EHR data. Could you please share some existing data dictionaries that I could refer to?
Recently, OMOP CDM is quite popular, do you have any plans to integrate with OMOP CDM?

Appreciate your time!

cgmeyer · September 28, 2022, 12:28am

Question 1: Why GraphQL?

I'm sorry that it seems like we're making things harder than they need to be! I know it's annoying, and GraphQL is not the easiest to master! But, we think it's worth it for the advantages it provides.

The data we're representing has a hierarchical relationship, and that's easily mapped to a graph representation. So, having a programmatic representation that also is graph-based made sense. If we were using SQL instead of GraphQL, every query would programmatically require us to do a ton of joins all the time. This would take a lot of programming energy, not to mention possibly making the system more vulnerable to programming errors and less robust to updates. With a graphical model, we can set the users up to query on relationships between several entities in a graph based manner, which is simpler because we can more easily describe the path to take through the model for the query. Also, GraphQl provides a more structured way to add new analysis entities (e.g., workflows, data outputs) once the commons are extant and begin to grow.

As you noted, the backend is a relational database setup in PostGres. Peregrine, our GraphQL service, allows you to construct queries of PostGres. Additionally, our ETL tool (tube) can flatten and store structured data in ElasticSearch, allowing for faster searches. This flattened data can be searched using Guppy (which allows you to use GraphQL queries on data in ElasticSearch.)

If you haven't yet seen this resource in the documentation, I encourage you to check out this page: Gen3 - Technical Intro

Question 2: Why not use relational database architecture?

We actually do have a relational database architecture -- it's the backend for GraphQL (PostGres). One of the neat things about the Gen3 is that it does not require data commons created from it to all have a common data model. If you look at the examples on stats.gen3.org, you can click on any of the data commons. You can see the data models by clicking on the Data Dictionary button (upper right) and then the Graph View (upper left). If you compare what's there for different commons, you'll see that there's a wide variety of data models.

Question 3: what about longitudinal data?

We agree that longitudinal data is definitely more challenging to model! But -- we have had some success. If you go to this page (Gen3 - Set up Gen3), then scroll down to the section called "Representing Longitudinal Data," you'll get a better sense for how we manage it. An example of a data commons that uses longitudinal EHR data is our VA Data Commons (https://va.data-commons.org/). Although the data dictionary is not available for this commons, the documentation is public and could be helpful to better understand how this is addressed. You can find that here: VA Data Commons — va-doc 1.0.0 documentation. It's true that EHR doesn’t work in a very usable way in our current dictionary/ElasticSearch approach.

In this documentation, you'll also see that this data commons uses the ATLAS application. ATLAS is an analytics platform that can be used to perform analyses across one or more observational databases which have been standardized to the OMOP Common Data Model V5. We integrated ATLAS as a gen3 app in an iframe; however, a user who is provisioned access to ATLAS would be able to access all data in there. We did build a new service called “cohort-middleware” to do some OMOP queries in our GWAS application in this data commons, but it is not generalizable at the moment

Finally:

Do you know about our Slack channel that's available for the Gen3 community? You can sign up to join this community Slack channel by completing this form: Sign up to join our Gen3-Community on Slack!. We hope to see you there!

Chesterguan · September 28, 2022, 1:18am

Hi Chris, Thanks so much for your response!
Your answers are very professional and clear. I did check Gen3 documentation a couple of times and finally, understand your design from a deeper view. You guys have done an awesome job!

I will try to apply the Gen3 platform with our data like VA Data Commons. Hope we could have a further discussion later. Appreciate for your time!

Topic		Replies	Views
Use of graph vs SQL Other Services	2	467	November 21, 2022
Postgresql vs Graph database Using Gen3	1	712	May 6, 2020
Webinar: Gen3 Data Modeling Announcements	0	710	May 2, 2019
High level ETL tools? Using Gen3	3	357	January 26, 2023
DCF data dictionary for longitudinal data Using Gen3	15	1280	October 3, 2019

General question about the Gen3 data commons data model

Related topics