DCF data dictionary for longitudinal data

On the Gen 3 Operator "Set up Gen 3" page, it mentioned that " A clinical node that is not included in the DCF is the Visit or Follow-Up node. The Visit node is used to store longitudinal data that is collected over time and usually has a many to one relationship with its parent node.".
I wonder how to incorporate longitudinal data in Gen 3 data dictionary otherwise?

Thanks.

Hello @yingzhang104,

At this time the Gen3 data dictionary best represents longitudinal data with many to one relationships with the parent node. An good example of this can be found here: https://github.com/uc-cdis/bhcdictionary/blob/develop/gdcdictionary/schemas/visit.yaml .

While longitudinal data could also be handled by creating multiple nodes for each repeated measurement, numerous nodes that contain identical properties would be problematic overall.

Please let me know if you have any further questions.

Sean

Thank you, Sean, for the prompt response and the reference.

I would appreciate some clarifications on the differences between the two options you mentioned: so the recommended approach is to use many-to-one relationships while the alternative approach is to create multiple nodes for each visit, while each visit node would have all the repeated measures? The latter seems straightforward but I am not quite sure about the first option. Do you mean putting repeated measures all in one node and link each measure to a different visit node? If that's the case, how to avoid name conflicts? And instead of using &anchor, do you use to refer to another yaml file? (I haven't seen any explanation on the usage of dollar sign in the YAML manual. I wonder if gen 3 DFC uses an extended list of YAML syntax?)

Would you mind explaining how to achieve option 1 using the basic the YAML language?

Thank you very much!

Hello @yingzhang104

For the many-to-one option, the subjects (case_a/b/c) will be connected to multiple entries, representing multiple visits, in the follow_up node. All the entries will have a different submitter_id to avoid name conflicts. An example would be this:

node: follow_up
submitter_id........cases.submitter_id........visit_number........(other columns of information)
case_a_1............case_a..........................1
case_a_2............case_a..........................2
case_a_3............case_a..........................3
case_b_1............case_b..........................1
case_c_1............case_c..........................1
case_c_2............case_c..........................2

In comparison if you take a multiple node approach you will have something like this:

node: follow_up_1
submitter_id........cases.submitter_id........visit_number........(other columns of information)
case_a_1............case_a..........................1
case_b_1............case_b..........................1
case_c_1............case_c..........................1

node: follow_up_2
submitter_id........cases.submitter_id........visit_number........(other columns of information)
case_a_2............case_a..........................2
case_c_2............case_c..........................2

node: follow_up_3
submitter_id........cases.submitter_id........visit_number........(other columns of information)
case_a_3............case_a...........................3

In the multiple node situation, you could remove visit_number and the numeric suffix on the submitter_id, but I left them to display the similarity between the two structures.

Based on these two examples, the many-to-one connection uses fewer nodes and this is helpful when viewing and searching throughout the data model.

I have notified our Data Dictionary Expert about your other questions and he will respond to this thread.

Sean

Hello @yingzhang104,

In order to indicate a many to one relationship between the follow-up and case nodes, you would need to update the "links" section of the yaml file. For example in visit.yaml,

links:

  • name: cases
    backref: follow_ups
    label: describes
    target_type: subject
    multiplicity: many_to_one
    required: true

In terms of the $, it is used to refer to to items in reference files that begin with an underscore (for example, _terms.yaml). I'm not aware of & being used for this purpose. Please let us know if you have any further questions.

Hi Sean,

Thank you very much for the example and explanation. Conceptually it makes a lot of sense, just like in relational database, however, I still wonder how to implement in practice. For instance, I found the followup json file https://github.com/uc-cdis/bhcdictionary/blob/develop/gdcdictionary/examples/valid/followup.json, in which there seems to be only one visit (visit 5). If I need to add visit 6 which also measured BMI, height and weight, how would I input the data? I guess it would also be very helpful if there is a followup json schema, which I can't find in the schema folder.
Thank you very much!

Hi Laron,

Thank you very much for your input. After reading about JSON schema, I have a much better understanding of how the complex json schema is built for gen 3 dcf.
I have a few followup questions:

  1. Has gen 3 dcf developed its own json schema keywords, such as "links" and the properties under links(e.g. multiplicity, etc.)? Is there a manual on how to use those keywords?
  2. is it possible to set up an unbounded/expandable many-to-many relations between variables (e.g. BMI) and tags (i.e. one variable may have many tags, while many variables may also share one tags)?

Thank you very much!

Best,
Ying

Hi @yingzhang104!

I think the BPA dictionary you mentioned in your post doesn't use a follow-up node currently; that is why you could not find it in the schema folder. You can find an excellent example for follow-up node in NIAID dictionary: https://github.com/uc-cdis/ndhdictionary/blob/master/gdcdictionary/schemas/follow_up.yaml or GDC dictionary https://github.com/NCI-GDC/gdcdictionary/blob/develop/gdcdictionary/schemas/follow_up.yaml

Files in the examples/valid folder demonstrate how you can submit the data to the node. Here is a valid example for a follow-up visit 11 in the NIAID dictionary https://github.com/uc-cdis/ndhdictionary/blob/master/gdcdictionary/examples/valid/follow_up.json For visit 12 you would use the same format, just different values. Please note, that you can submit data in JSON, TSV format or using a form on the Gen3 web-portal.

You can find more detailed information about data submission here https://gen3.org/resources/user/submit-data/#4-submit-additional-project-metadata

Also you might want to check our webinar about Gen3 data modeling at https://www.youtube.com/watch?v=cVTvzP-li0M

Thanks for the reply and resources!
To clarify - for visit 12, I would create another follow-up visit json file, named follow-up_2.json, which has the same format as follow-up.json, but with different values? I thought it's the "multiple-node" approach Sean mentioned as being not efficient. Or do I just duplicate everything between "visit_number" to "submitter_id" with values specific for visit 12 and keep both in the same json file?
Thank you so much for your patience!

Yes, it can be the same file, just make sure it is a list. I attached an example for NIAID follow-up data submission with simulated data of 3 records.

[
    {
        "age_at_visit": 29, 
        "age_at_visit_gt89": "Yes", 
        "bmi": 43.37122781226677, 
        "days_to_follow_up": 47, 
        "drug_used": "No", 
        "ever_transferred": "Transferred", 
        "harmonized_visit_number": 5, 
        "health_insurance": false, 
        "height": 420.1635399517189, 
        "pregnancy_status": true, 
        "subjects": {
            "submitter_id": "subject_73b7f6f7ed"
        }, 
        "submitter_id": "follow_up_610f209ba0", 
        "tint": 13, 
        "type": "follow_up", 
        "version_data": "e34450c1ab", 
        "visit_date": 5, 
        "visit_id": 51, 
        "visit_name": "1f6c27695c", 
        "visit_number": 38, 
        "visit_type": "Follow-up Visit", 
        "weight": 22.142679484048944, 
        "weight_percentage": 52.07787680439634
    }, 
    {
        "age_at_visit": 22, 
        "age_at_visit_gt89": "Yes", 
        "bmi": 60.74102547543777, 
        "days_to_follow_up": 56, 
        "drug_used": "Refusal", 
        "ever_transferred": "Transferred", 
        "harmonized_visit_number": 55, 
        "health_insurance": false, 
        "height": 309.41313537224136, 
        "pregnancy_status": true, 
        "subjects": {
            "submitter_id": "subject_6642452205"
        }, 
        "submitter_id": "follow_up_d988e8876e", 
        "tint": 78, 
        "type": "follow_up", 
        "version_data": "04732a3b0d", 
        "visit_date": 37, 
        "visit_id": 24, 
        "visit_name": "fb9125be12", 
        "visit_number": 97, 
        "visit_type": "Follow-up Visit", 
        "weight": 35.72049775618203, 
        "weight_percentage": 98.93225263842955
    }, 
    {
        "age_at_visit": 33, 
        "age_at_visit_gt89": "No", 
        "bmi": 46.31693893550629, 
        "days_to_follow_up": 49, 
        "drug_used": "Yes", 
        "ever_transferred": "Never transferred", 
        "harmonized_visit_number": 4, 
        "health_insurance": false, 
        "height": 180.08344488242435, 
        "pregnancy_status": false, 
        "subjects": {
            "submitter_id": "subject_7f236a12e7"
        }, 
        "submitter_id": "follow_up_6d3d7e66b9", 
        "tint": 24, 
        "type": "follow_up", 
        "version_data": "13d3b74ca2", 
        "visit_date": 92, 
        "visit_id": 58, 
        "visit_name": "c3605ebafb", 
        "visit_number": 39, 
        "visit_type": "Abbreviated Visit (Record in ABRV file)", 
        "weight": 72.8053275126038, 
        "weight_percentage": 94.45939569093899
    }
]

FYI, some people find it easier to submit data in the TSV (tab-separated values) file instead of the JSON file. It is easier to edit TSV in spreadsheet programs like Excel, Libre Calc or other. Please check https://gen3.org/resources/user/template-tsvs/ for examples.

Hello @yingzhang104

If you wanted to add a new visit, you would then create a new json entry and make it visit 6. This new json would include only the values that were taken at that visit, such as BMI, height and weight. Such that in TSV format you could have something like this:

submitter_id........cases.submitter_id........visit_number........height (cm)........weight (kg)........bmi
case_a_1............case_a..........................1..........................172.....................63.5...................21
case_a_2............case_a..........................2..........................172.....................___....................__
case_a_3............case_a..........................3..........................172.....................72.5...................24

In this example, case_a came in for three visits and on visit 2 they did not get weighed, thus there is no information for weight or bmi. The visit column shows that case_a came back for a third visit and height and weight were both measured and thus there is a bmi as well. Each one of these are an individual entry, like the example json.

Please let me know if you have any further questions.

Sean

Thank you both for explaining how to achieve this in both in tsv format and json format. It's much clearer to me now.

1 Like