Using Gen3 for linking multiple datasets

Hi Gen3 community, I am working for research foundation and we are currently exploring NHGRI AnVIL as a ecosystem for bioinformatics. We have explored Terra and ran few analyses using publicly available datasets and workflows.

Here's the challenge we are facing, Our researchers are working in different labs developing mouse models, zebrafish models, cell lines from patient collected samples. They have WGS, CHIP-Seq, Single Cell, RNA-Seq sequencing data generated from these models. We have collected patients' genetic WES testing data. Also, we have survey and natural history data with health records.

But all these datasets are kind of siloed and not linkable. So we need to de-identify, link and centralize these various datasets.

Can you guide how Gen3 data commons can provide this link-ability and centralization. Also, how is de-identification supported by Gen3. Looking forward to hearing thoughts.


Hi, Nikhil_Shingte,

Thanks for reaching out! I'm a little confused, so I wanted to clarify some points before I try to answer.

It sounds like you work for a research foundation that has labs working on many different types of human samples and animal models, producing a variety of genetic and clinical data that is currently not de-identified. Your research foundation is interested in finding a cloud-based workspace/data commons where these researchers can share their lab's data with each other (and possibly external users?) and also be able to analyze the genetic data using bioinformatic pipelines.

I think you are looking at the AnVIL ecosystem to maybe provide this analysis space, through Terra. But, you would need to prepare your datasets -- de-identify, "centralize," and make them "linkable" (I'm not sure what these last two pieces mean - could you explain more?)

It's also possible that maybe you are instead talking about creating your own Gen3 data commons that is like AnVIL, but has all your researcher's data instead of the AnVIL data. (Is that what you mean?)

Thanks for helping me better understand your goals so I can be sure I'm answering the right question.

-- Sara

Hi Sara, thats right, we have multiple datasets e.g. phenotypical data and genomic assets created. We want to use Terra as analysis platform, We would like to know how does Gen3 fit into whole ecosystem. What are the benefits offered by using Gen3 with Terra.

Hi, Nikhil,

Gen3 can provide the infrastructure for ingesting data / indexing file objects, exposing these data to queries, and provides user authN/authZ required for interoperating with an analysis platform like Terra. Some of the data commons we support, like BioDataCatalyst and AnVIL, interact with Terra, so Terra is definitely familiar with Gen3.

Actually, this article -- and especially the video -- seems like it is useful

Are you planning to stand up your own data commons with Gen3 using your lab's research data? Or are you hoping to bring your data into AnVIL or some other existing Gen3 data commons?

-- Sara