Submission File Format
File Structure
- The submission data must be in tab-delimited format.
- Each column corresponds to a data element defined in DCC Data Element specification.
- Column order and case must match the data elements in DCC Element specification
- Extra columns are not allowed
- Required values cannot have null values
- Each mutation/variant is represented as a row (one mutation per row)
An example file is shown below (note that parts of the lines are omitted for readability):
| analysis_id | analyzed_sample_id | mutation_type | chromosome | chromosome_start | chromosome_end | reference_genome_allele | control_genotype | mutated_from_allele | mutated_to_allele | tumour_genotype | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | | m124 | ssm_3396649 | 3 | 20 | 49510011 | 49510012 | GA | GA/GA | GA | - | GA/- | m124 | ssm_61023021 | 2 | X | 115303927 | 115303927 | - | -/- | - | T | -/T | m124 | ssm_175270973 | 4 | 15 | 39884779 | 39884787 | ACTCAGACC | ACTCAGACC/ACTCAGACC | ACTCAGACC | TTGT | ACTCAGACC/TTGT | m124 | ssm_175270973 | 1 | 15 | 39884792 | 39884792 | C | C/C | C | T | C/T | m124 | ssm_4545634 | 3 | 12 | 23454340 | 23454341 | GA | GA/GA | GA | - | GA/-
ICGC DCC Data File Specification
ICGC DCC provides a data file specification for each data type which details the required format to construct a valid submission file. You can view the current ICGC DCC Data Specification here .
Column | Description |
---|---|
Data Element ID | Name of the column that must be included in the submission file |
Name | The descriptive name of the Data Element ID |
Description | Definition of the Data Element ID |
Data Type | The required type required for the given Data Element ID (ie. Integer, text,controlled vocabulary) |
CV Codes | Controlled vocabulary (if applicable to the Data Element ID) |
Required? | Indicates whether the Data Element ID requires a value |
N/A Code Valid | Indicates whether the Data Element ID accepts the reserve codes -777 or -888 |
Controlled Access | Indicates whether Data Element ID is open or controlled access |
Regexp | A Java regular expression indicating required format |
Examples | Examples of valid values |
Notes | Additional notes describing requirements/restrictions and cross-field validation checks |
Current Dictionary and Codelists
To view current dictionary, please go to Dictionary Viewer. Green-highlighted rows, such as "donor_id" are considered identifier data fields (foreign keys) and must be unique for each row.
Alternatively, you can also access the JSON format of the DCC Data Specification via REST webservice. Please see Submission API for details
File Naming Conventions
Clinical/Experimental Files
Category | Data type | File type | File name |
Description |
---|---|---|---|---|
Core Clinical Files |
donor | donor.txt[.gz|.bz2] | Donor information | |
specimen | specimen.txt[.gz|.bz2] | Specimen information | ||
sample | sample.txt[.gz|.bz2] | Analyzed sample information | ||
Optional Clinical Files |
surgery | surgery[.gz|.bz2] | Donor surgery information | |
exposure | exposure[.gz|.bz2] | Donor environmental exposure | ||
family | family.txt[.gz|.bz2] | Donor family history | ||
biomarker | biomarker.txt[.gz|.bz2] | Donor biomarkers | ||
therapy | therapy.txt[.gz|.bz2] | Donor therapy | ||
Experimental Files |
ssm | metadata | ssm_m.txt[.gz|.bz2] | Simple somatic mutations including single base substitutions and indels of ≤200 bp |
primary | ssm_p.txt[.gz|.bz2] | |||
sgv | metadata | sgv_m.txt[.gz|.bz2] | Simple germline variations including single base substitutions and indels of ≤200 bp | |
primary | sgv_p.txt[.gz|.bz2] | |||
cnsm | metadata | cnsm_m.txt[.gz|.bz2] | Copy number somatic mutations | |
primary | cnsm_p.txt[.gz|.bz2] | |||
secondary | cnsm_s.txt[.gz|.bz2] | |||
stsm | metadata | stsm_m.txt[.gz|.bz2] | Structural somatic mutations | |
primary | stsm_p.txt[.gz|.bz2] | |||
secondary | stsm_s.txt[.gz|.bz2] | |||
exp | metadata | exp_m.txt[.gz|.bz2] | Gene expression | |
gene expression | exp_g.txt[.gz|.bz2] | |||
mirna | metadata | mirna_m.txt[.gz|.bz2] | miRNA expression | |
primary | mirna_p.txt[.gz|.bz2] | |||
secondary | mirna_s.txt[.gz|.bz2] | |||
jcn | metadata | jcn_m.txt[.gz|.bz2] | Exon junction | |
primary | jcn_p.txt[.gz|.bz2] | |||
pexp | metadata | pexp_m.txt[.gz|.bz2] | Protein expression | |
primary | pexp_p.txt[.gz|.bz2] |