Understanding current state of default data-dictonaries (VCFs specifically)

EDIT: I was not allowed to post this with all the links by the forum software, so the original (and full text with the links needed for this to make sense) is here:


Hello Everyone,

I am doing a small demo project that will hopefully include a gen3 server running from compose-services to pull data stored in a VCF from gen3, do some analysis, and upload the output to a data type that I define in a new dictionary. If all goes well we would go after funding for a more permanent installation later on.

First, here is the current state of the world around the data dictionaries (DD), as far as I can tell:

  • compose-services is configured by URL to point to a DD stored in s3 [1]. This does not contain the VCF definition. Is this meant to be kept up to date? It has some SHAs in the '_settings', but I don't know if those identify a git commit, or what repo.
  • compose-services can optionally use a DD stored as files in the same directory, and there is one checked into the repo [2]. This version also does not contain VCF and is at least 14 months old.
  • [3] shows a DD with a VCF, but is there a URL like [1] that I can configure my instance to pull from? Where are the source yaml files for this version?
  • The uc-cdis/dataditionary repo [4] has no VCF.yaml, but does have some related types like 'submitted_somatic_mutation'. I am surprised that what I see at gen3.datacommons.io/dd does not match these files, but I don't know if they are actually supposed to match.
  • The nci-gdc/gdcdictionary [5] has even more mutation-related types, and they have data_format=VCF, but nothing actually named 'VCF' like what I see in the dictionary browser in [3].
  • I also can see that the uc-cdis and nci-gdc github repos have diverged greatly (the fork is " 111 commits ahead, 719 commits behind NCI-GDC:develop.")

At this point I'd be interested if there was a simple story of what is important and what isn't in the above list.

Otherwise, my real question is: what's the most up-to-date set of schemas that you would recommend to start with if I need to represent the simplest possible mutations that will originally be in a VCF file?

Thank you for any assistance.

(Links [1] - [5] available in the gist at the top of post)

Hi @pgroves,
welcome to our Forum!

Just to explain a bit, every Gen3 commons has an individual database; all of them are viable, including the generic one (gen3.datacommons.io). If you find the node "VCF" to be useful for your database, feel free to grab the yaml file.

The reason for the databases being different is because data that is stored on each Gen3 data commons require a different database architecture, according to the data content. The GDC database is the basis for other Gen3 dictionaries, thus you see the "datadictionary" in the path, but it was and still is edited for each Gen3 commons (--> different commits for each repo).

There are certain rules for building your own database, which can be found here, and it is important to know that certain nodes cannot change (program, project, and core_metadata_collection).

You will not see Variant Call File (VCF) so often, because most Gen3 commons want to inherently inform the user about the type of data that is found in that node; for example in a node that is called "simple_germline_variation". Regarding your real question, this should give you a hint, and feel free to read through the descriptions of the nodes to decide if this is the right node to use for your data.

TL;DR, you can use any of these databases according to your data. You can poke around different dictionaries (e.g. 1, 2, 3) and see which one fits the best as a basis for your data.

Hope this helps,