DCF data dictionary for longitudinal data

yingzhang104 · September 18, 2019, 3:37pm

On the Gen 3 Operator "Set up Gen 3" page, it mentioned that " A clinical node that is not included in the DCF is the Visit or Follow-Up node. The Visit node is used to store longitudinal data that is collected over time and usually has a many to one relationship with its parent node.".
I wonder how to incorporate longitudinal data in Gen 3 data dictionary otherwise?

Thanks.

svburke · September 20, 2019, 8:04pm

Hello @yingzhang104,

At this time the Gen3 data dictionary best represents longitudinal data with many to one relationships with the parent node. An good example of this can be found here: https://github.com/uc-cdis/bhcdictionary/blob/develop/gdcdictionary/schemas/visit.yaml .

While longitudinal data could also be handled by creating multiple nodes for each repeated measurement, numerous nodes that contain identical properties would be problematic overall.

Please let me know if you have any further questions.

Sean

yingzhang104 · September 23, 2019, 6:21pm

Thank you, Sean, for the prompt response and the reference.

I would appreciate some clarifications on the differences between the two options you mentioned: so the recommended approach is to use many-to-one relationships while the alternative approach is to create multiple nodes for each visit, while each visit node would have all the repeated measures? The latter seems straightforward but I am not quite sure about the first option. Do you mean putting repeated measures all in one node and link each measure to a different visit node? If that's the case, how to avoid name conflicts? And instead of using &anchor, do you use to refer to another yaml file? (I haven't seen any explanation on the usage of dollar sign in the YAML manual. I wonder if gen 3 DFC uses an extended list of YAML syntax?)

Would you mind explaining how to achieve option 1 using the basic the YAML language?

Thank you very much!

svburke · September 23, 2019, 9:01pm

Hello @yingzhang104

For the many-to-one option, the subjects (case_a/b/c) will be connected to multiple entries, representing multiple visits, in the follow_up node. All the entries will have a different submitter_id to avoid name conflicts. An example would be this:

node: follow_up
submitter_id........cases.submitter_id........visit_number........(other columns of information)
case_a_1............case_a..........................1
case_a_2............case_a..........................2
case_a_3............case_a..........................3
case_b_1............case_b..........................1
case_c_1............case_c..........................1
case_c_2............case_c..........................2

In comparison if you take a multiple node approach you will have something like this:

node: follow_up_1
submitter_id........cases.submitter_id........visit_number........(other columns of information)
case_a_1............case_a..........................1
case_b_1............case_b..........................1
case_c_1............case_c..........................1

node: follow_up_2
submitter_id........cases.submitter_id........visit_number........(other columns of information)
case_a_2............case_a..........................2
case_c_2............case_c..........................2

node: follow_up_3
submitter_id........cases.submitter_id........visit_number........(other columns of information)
case_a_3............case_a...........................3

In the multiple node situation, you could remove visit_number and the numeric suffix on the submitter_id, but I left them to display the similarity between the two structures.

Based on these two examples, the many-to-one connection uses fewer nodes and this is helpful when viewing and searching throughout the data model.

I have notified our Data Dictionary Expert about your other questions and he will respond to this thread.

Sean

Laron_Hughes · September 24, 2019, 3:04pm

Hello @yingzhang104,

In order to indicate a many to one relationship between the follow-up and case nodes, you would need to update the "links" section of the yaml file. For example in visit.yaml,

links:

name: cases
backref: follow_ups
label: describes
target_type: subject
multiplicity: many_to_one
required: true

In terms of the $, it is used to refer to to items in reference files that begin with an underscore (for example, _terms.yaml). I'm not aware of & being used for this purpose. Please let us know if you have any further questions.

yingzhang104 · September 30, 2019, 6:34pm

Hi Sean,

Thank you very much for the example and explanation. Conceptually it makes a lot of sense, just like in relational database, however, I still wonder how to implement in practice. For instance, I found the followup json file https://github.com/uc-cdis/bhcdictionary/blob/develop/gdcdictionary/examples/valid/followup.json, in which there seems to be only one visit (visit 5). If I need to add visit 6 which also measured BMI, height and weight, how would I input the data? I guess it would also be very helpful if there is a followup json schema, which I can't find in the schema folder.
Thank you very much!

yingzhang104 · September 30, 2019, 6:44pm

Hi Laron,

Thank you very much for your input. After reading about JSON schema, I have a much better understanding of how the complex json schema is built for gen 3 dcf.
I have a few followup questions:

Has gen 3 dcf developed its own json schema keywords, such as "links" and the properties under links(e.g. multiplicity, etc.)? Is there a manual on how to use those keywords?
is it possible to set up an unbounded/expandable many-to-many relations between variables (e.g. BMI) and tags (i.e. one variable may have many tags, while many variables may also share one tags)?

Thank you very much!

Best,
Ying

Viktorija · October 2, 2019, 1:51pm

Hi @yingzhang104!

I think the BPA dictionary you mentioned in your post doesn't use a follow-up node currently; that is why you could not find it in the schema folder. You can find an excellent example for follow-up node in NIAID dictionary: https://github.com/uc-cdis/ndhdictionary/blob/master/gdcdictionary/schemas/follow_up.yaml or GDC dictionary https://github.com/NCI-GDC/gdcdictionary/blob/develop/gdcdictionary/schemas/follow_up.yaml

Viktorija · October 2, 2019, 2:28pm

Files in the examples/valid folder demonstrate how you can submit the data to the node. Here is a valid example for a follow-up visit 11 in the NIAID dictionary https://github.com/uc-cdis/ndhdictionary/blob/master/gdcdictionary/examples/valid/follow_up.json For visit 12 you would use the same format, just different values. Please note, that you can submit data in JSON, TSV format or using a form on the Gen3 web-portal.

Viktorija · October 2, 2019, 2:31pm

You can find more detailed information about data submission here https://gen3.org/resources/user/submit-data/#4-submit-additional-project-metadata

Viktorija · October 2, 2019, 2:49pm

Also you might want to check our webinar about Gen3 data modeling at https://www.youtube.com/watch?v=cVTvzP-li0M

yingzhang104 · October 2, 2019, 3:39pm

Thanks for the reply and resources!
To clarify - for visit 12, I would create another follow-up visit json file, named follow-up_2.json, which has the same format as follow-up.json, but with different values? I thought it's the "multiple-node" approach Sean mentioned as being not efficient. Or do I just duplicate everything between "visit_number" to "submitter_id" with values specific for visit 12 and keep both in the same json file?
Thank you so much for your patience!

Viktorija · October 2, 2019, 4:10pm

Yes, it can be the same file, just make sure it is a list. I attached an example for NIAID follow-up data submission with simulated data of 3 records.

[
    {
        "age_at_visit": 29, 
        "age_at_visit_gt89": "Yes", 
        "bmi": 43.37122781226677, 
        "days_to_follow_up": 47, 
        "drug_used": "No", 
        "ever_transferred": "Transferred", 
        "harmonized_visit_number": 5, 
        "health_insurance": false, 
        "height": 420.1635399517189, 
        "pregnancy_status": true, 
        "subjects": {
            "submitter_id": "subject_73b7f6f7ed"
        }, 
        "submitter_id": "follow_up_610f209ba0", 
        "tint": 13, 
        "type": "follow_up", 
        "version_data": "e34450c1ab", 
        "visit_date": 5, 
        "visit_id": 51, 
        "visit_name": "1f6c27695c", 
        "visit_number": 38, 
        "visit_type": "Follow-up Visit", 
        "weight": 22.142679484048944, 
        "weight_percentage": 52.07787680439634
    }, 
    {
        "age_at_visit": 22, 
        "age_at_visit_gt89": "Yes", 
        "bmi": 60.74102547543777, 
        "days_to_follow_up": 56, 
        "drug_used": "Refusal", 
        "ever_transferred": "Transferred", 
        "harmonized_visit_number": 55, 
        "health_insurance": false, 
        "height": 309.41313537224136, 
        "pregnancy_status": true, 
        "subjects": {
            "submitter_id": "subject_6642452205"
        }, 
        "submitter_id": "follow_up_d988e8876e", 
        "tint": 78, 
        "type": "follow_up", 
        "version_data": "04732a3b0d", 
        "visit_date": 37, 
        "visit_id": 24, 
        "visit_name": "fb9125be12", 
        "visit_number": 97, 
        "visit_type": "Follow-up Visit", 
        "weight": 35.72049775618203, 
        "weight_percentage": 98.93225263842955
    }, 
    {
        "age_at_visit": 33, 
        "age_at_visit_gt89": "No", 
        "bmi": 46.31693893550629, 
        "days_to_follow_up": 49, 
        "drug_used": "Yes", 
        "ever_transferred": "Never transferred", 
        "harmonized_visit_number": 4, 
        "health_insurance": false, 
        "height": 180.08344488242435, 
        "pregnancy_status": false, 
        "subjects": {
            "submitter_id": "subject_7f236a12e7"
        }, 
        "submitter_id": "follow_up_6d3d7e66b9", 
        "tint": 24, 
        "type": "follow_up", 
        "version_data": "13d3b74ca2", 
        "visit_date": 92, 
        "visit_id": 58, 
        "visit_name": "c3605ebafb", 
        "visit_number": 39, 
        "visit_type": "Abbreviated Visit (Record in ABRV file)", 
        "weight": 72.8053275126038, 
        "weight_percentage": 94.45939569093899
    }
]

Viktorija · October 2, 2019, 7:55pm

FYI, some people find it easier to submit data in the TSV (tab-separated values) file instead of the JSON file. It is easier to edit TSV in spreadsheet programs like Excel, Libre Calc or other. Please check https://gen3.org/resources/user/template-tsvs/ for examples.

svburke · October 2, 2019, 8:32pm

Hello @yingzhang104

If you wanted to add a new visit, you would then create a new json entry and make it visit 6. This new json would include only the values that were taken at that visit, such as BMI, height and weight. Such that in TSV format you could have something like this:

submitter_id........cases.submitter_id........visit_number........height (cm)........weight (kg)........bmi
case_a_1............case_a..........................1..........................172.....................63.5...................21
case_a_2............case_a..........................2..........................172.....................___....................__
case_a_3............case_a..........................3..........................172.....................72.5...................24

In this example, case_a came in for three visits and on visit 2 they did not get weighed, thus there is no information for weight or bmi. The visit column shows that case_a came back for a third visit and height and weight were both measured and thus there is a bmi as well. Each one of these are an individual entry, like the example json.

Please let me know if you have any further questions.

Sean

yingzhang104 · October 3, 2019, 2:05pm

Thank you both for explaining how to achieve this in both in tsv format and json format. It's much clearer to me now.

Topic		Replies	Views
Understanding current state of default data-dictonaries (VCFs specifically) Using Gen3	1	527	October 15, 2020
Project and Program Sheepdog	6	317	November 23, 2022
General question about the Gen3 data commons data model Using Gen3	2	559	September 28, 2022
Ontology support Using Gen3	5	536	September 29, 2021
Using Gen3 for linking multiple datasets Using Gen3	3	272	October 27, 2023

DCF data dictionary for longitudinal data

Related topics