Skip to main content

Finding Key Files of Interest in the GREGoR Dataset

The GREGoR Dataset includes a wide range of data and file formats as described in the GREGoR Data Model. Analysts working with GREGoR Data can find file paths to molecular data files using the structure of the Data Model and search features within an AnVIL workspace to find specific information in workspace data tables. We have also compiled the following information about how to find certain files of broad interest. Note: if you have not already done so, please first follow the instructions at the Getting Started section of the Getting Started on AnVIL with GREGoR webpage to set-up your AnVIL account and secure data access permissions.

Locations of key files

Refer to the table below for the information needed to find key files of interest in an AnVIL workspace (step numbers are annotated in the screenshot and specified in the table below).

  1. Log in to AnVIL at anvil.terra.bio.
  2. Select “Workspaces” from the menu in the upper left. Navigate to a workspace of interest (recommended to choose the latest released GREGoR workspace, e.g. AnVIL_GREGoR_R**_GRU or AnVIL_GREGoR_R**_HMB).
  3. Navigate to the “DATA” tab from the menu bar at the top.
  4. Search for the text string - as specified in the table below - by entering into the search box under the “TABLES” section in the left sidebar to search across all data tables in the workspace.
  5. Filter your results by selecting the data table - as specified in the table below - from the left sidebar (which you may need to expand).
  6. Refer to the column - as specified in the table below (Step 6a) - and identify rows with the matching text string (Step 6b). You might need to expand the columns in the data table individually to find the specific column of interest. Rows that contain the matching text string in the specified column include paths to the key files of interest.
  7. Select the checkboxes for all the files that match the search criteria to open or export.
     

Screenshot of AnVIL Data Tab harmonized CRAMs

The example above is a screenshot of the results that you may see by searching for harmonized CRAMs following the steps outlined in row 1 of the table below.

For resources on interacting with and analyzing GREGoR data, refer to the AnVIL Resources webpage.
 

Key file(s) Description of the file(s) Step 4: Enter text string into search box Step 5: Select data table Step 6a: Refer to column Step 6b and 7: Select (check) rows including text string
Harmonized CRAMs Alignment files reprocessed by the GREGoR DCC GREGoR_DCC_A1 aligned_dna_short_read aligned_dna_short_read_id “GREGoR_DCC_A1…”
Text string is a prefix. Select all rows.
GVCFs Single sample genomic VCFs generated by the DCC from harmonized CRAMs gVCF called_variants_dna_short_read analysis_details “gVCF generated with…”
Text string indicates that the row contains the correct file. Select all rows.
GREGoR joint callset:
Unannotated & Annotated VCFs
Unannotated and VEP annotated multi-sample chromosome VCFs generated by the DCC from single sample GVCFs consortium-wide callset called_variants_dna_short_read analysis_details “VEP annotated consortium-wide callset” indicates annotated VCF.
“consortium-wide callset” indicates unannotated VCF.
Broad de novo callset: 
Complete & Confident set
Consortium-wide de novo callset provided by the Broad GREGoR_DENOVO called_variants_dna_short_read aligned_dna_short_read_set_id “GREGoR_DENOVO…”
Text string is a prefix. Refer to the analysis_details column for information on which callset to select.
Broad mitochondrial callset Consortium-wide mitochondrial callset provided by the Broad Broad_U07_GS_mtDNA_callset called_variants_dna_short_read aligned_dna_short_read_set_id “Broad_U07_GS_mtDNA_callset…”
Text string is a prefix.