Checking and validating for file completeness upon upload

Hi there - just a general question here on whether the Gen3 stack has any functionality for checking and validating for file completeness (e.g. VCF with header, reference for CRAM specified, etc).
I can see Sheepdog is used to validate metadata against the data dictionary to ensure all required fields are present and have appropriate metadata values, but just wondering if there are functions to apply to QA of the files themselves.
Thanks

Hi @jchrist! Welcome to the Forum!

Thank you for your question. I asked the gen3 devs about this and I will get back to you soon.

Thanks @xritter2 - looking forward to finding out. I have some researchers who are interested in such a file validity/checking capacity so would be great to know if it may be on the horizon for Gen3 if not included now.

Hello!

when we upload data, Gen3 calculates and stores the file's md5sum, thus, the md5sums can be checked for file completeness; e.g. https://gen3.datacommons.io/files/92183610-735e-4e43-afd6-7b15c91f6d10

Yet there is not any sort of automatic file format QC when you upload. Gen3 doesn’t check the content of files uploaded. It only calcs file_size and md5sum.

Perfect. Thanks @xritter2 for finding out. Very useful.
Cheers!

1 Like