Terra_2_ENA¶
Quick Facts¶
Workflow Type | Applicable Kingdom | Last Known Changes | Command-line Compatibility | Workflow Level |
---|---|---|---|---|
Public Data Sharing | Bacteria, Viral | vX.X.X | No | Set-level |
Terra_2_ENA_PHB¶
This workflow utilizes the ENA Webin-CLI Bulk Submission Tool to bulk submit read data to the ENA.
ENA Submissions¶
Before you can submit data to ENA you must register a Webin submission account. ENA allows submissions via the Webin-CLI program which validates your submissions entirely before you complete them. Submissions made through Webin are represented using a number of different metadata objects. Submissions to ENA result in accession numbers and these accessions can be used to identify each unique part of your submission. See the ENA submission documentation for more information.
Pre-requisites¶
Before running the Terra_2_ENA
workflow, make sure you have registered a study and are using the correct study accession number
-
To submit data into ENA you must first register a study to contain and manage it. Studies (also referred to as projects) can be registered through the Webin Portal. Log in with your Webin credentials and select the ‘Register Study’ button to bring up the interface. Once registration is complete, you will be assigned accession numbers. You may return to the dashboard and select the ‘Studies Report’ button to review registered studies.
-
Additionally, before submitting most types of data to ENA, samples must be registered. To register samples, ensure that your Terra data table includes all the samples you intend to submit, along with their raw read data (
FASTQ
,BAM
, orCRAM
format) and associated metadata. To meet ENA’s requirements, each sample must include a minimum set of metadata. See below for the mandatory and recommended metadata fields, as well as the default column names used to identify them in your Terra data table.
What needs to be included in your Terra data table?¶
Read Data Fields¶
-
Mandatory Fields
These columns are required for submission and must be included in the Terra data table. The column names must appear exactly as shown and cannot be substituted or modified using column mappings.
Terra Column Name Description read1
/read2
/bam_file
/cram_file
The path to two paired end FASTQ
files,BAM
file, orCRAM
file containing sequencing data.experiment_name
Unique name of the experiment. sequencing_platform
The platform used to generate the sequence data. See permitted values. sequencing_instrument
The instrument used to generate the sequence data. See permitted values. library_source
The source of the library. See permitted values. library_selection
The method used to select the library. See permitted values. library_strategy
The strategy used to generate the library. See permitted values.
-
Optional Fields
Terra Column Name Description insert_size
The insert size for paired reads. library_description
Free text library description.
Sample Metadata Fields¶
Using Customized Column Names in Terra Tables
In some cases, users may have data tables in Terra with column names that differ from the field names expected by ENA. The Terra_2_ENA
workflow allows users to supply a custom column mapping file, enabling them to specify how their columns map to the required/mandatory field names.
To use a custom column mapping file:
-
Create a tab-delimited
.tsv
file with the following structure:A header including
terra_column
andena_column
should be included in the first row. Theterra_column
column should contain the actual column names in your Terra table (e.g., 'my_fav_collection_date'), and theena_column
column should contain the column names expected by ENA (e.g.,collection date
). More information about the mandatory and recommended metadata fields and associated column names are described in the tables below.Example Mapping File:
-
Upload the file to your Terra workspace and reference it in the
column_mappings
input parameter when running the workflow using Google Cloud Storage paths.
Ensure the mapping file includes all columns with custom names. Columns that match the default workflow names do not need to be included. Missing mappings for renamed columns may result in errors during execution if the column is required, and will not be found if the column is optional. The workflow will automatically map the specified column names from your Terra table to the required ENA field names as long as the mapping file is provided correctly.
-
Mandatory Fields
These fields are required for submission and must be included in the Terra data table or supplied as an input parameter
If you cannot provide a value for a mandatory field within, set the
allow-missing
input parameter totrue
or alternatively, use one of the INDSC accepted terms for missing value reporting.Terra Column NameENA Field NameDescription title
sample_title Title of the sample. taxon_id
and/ororganism
tax_id and/or scientific name Taxonomic identifier (NCBI taxon ID) or scientific name of the organism from which the sample was obtained. collection_date
collection date The date the sample was collected with the intention of sequencing, either as an instance (single point in time) or interval. In case no exact time is available, the date/time can be right truncated i.e. all of these are valid ISO8601 compliant times: 2008-01-23T19:23:10+00:00; 2008-01-23T19:23:10; 2008-01-23; 2008-01; 2008. geo_loc_name
geographic location (country and/or sea) The geographical origin of where the sample was collected from, with the intention of sequencing, as defined by the country or sea name. Country or sea names should be chosen from the INSDC country list. host_health_state
host health state Health status of the host at the time of sample collection. Must be one of the following: diseased
,healthy
,missing: control sample
,missing: data agreement established pre-2023
,missing: endangered species
,missing: human-identifiable
,missing: lab stock
,missing: sample group
,missing: synthetic construct
,missing: third party data
,not applicable
,not collected
,not provided
,restricted access
.host_scientific_name
host scientific name Scientific name of the natural (as opposed to laboratory) host to the organism from which sample was obtained. isolation_source
isolation_source Describes the physical, environmental and/or local geographical source of the biological sample from which the sample was derived. isolate
isolate Individual isolate from which the sample was obtained.
-
Optional Fields
Terra Column NameENA Field NameDescription library_description
sample_description Description of the sample. lat_lon
lat_lon Geographical coordinates of the location where the specimen was collected. serovar
serovar Serological variety of a species (usually a prokaryote) characterized by its antigenic properties. strain
strain Name of the strain from which the sample was obtained.
Reference: ENA prokaryotic pathogen minimal sample checklist
-
Mandatory Fields
These fields are required for submission and must be included in the Terra data table or supplied as an input parameter
If you cannot provide a value for a mandatory field within, set the
allow-missing
input parameter totrue
or alternatively, use one of the INDSC accepted terms for missing value reporting.Terra Column NameENA Field NameDescription title
sample_title Title of the sample. taxon_id
and/ororganism
tax_id and/or scientific name Taxonomic identifier (NCBI taxon ID) or scientific name of the organism from which the sample was obtained. collection_date
collection date The date the sample was collected with the intention of sequencing, either as an instance (single point in time) or interval. In case no exact time is available, the date/time can be right truncated i.e. all of these are valid ISO8601 compliant times: 2008-01-23T19:23:10+00:00; 2008-01-23T19:23:10; 2008-01-23; 2008-01; 2008. collecting_institution
collecting institution Name of the institution to which the person collecting the specimen belongs. Format: Institute Name, Institute Address collector_name
collector name Name of the person who collected the specimen. Example: John Smith geo_loc_name
geographic location (country and/or sea) The geographical origin of where the sample was collected from, with the intention of sequencing, as defined by the country or sea name. Country or sea names should be chosen from the INSDC country list. host_common_name
host common name Common name of the host, e.g. human host_health_state
host health state Health status of the host at the time of sample collection. Must be one of the following: diseased
,healthy
,missing: control sample
,missing: data agreement established pre-2023
,missing: endangered species
,missing: human-identifiable
,missing: lab stock
,missing: sample group
,missing: synthetic construct
,missing: third party data
,not applicable
,not collected
,not provided
,restricted access
.host_scientific_name
host scientific name Scientific name of the natural (as opposed to laboratory) host to the organism from which sample was obtained. host_sex
host sex Gender or sex of the host. Must be one of the following: female
,male
,hermaphrodite
,neuter
,not applicable
,not collected
,not provided
,other
,missing: control sample
,missing: data agreement established pre-2023
,missing: endangered species
,missing: human-identifiable
,missing: lab stock
,missing: sample group
,missing: synthetic construct
,missing: third party data
.host_subject_id
host subject id A unique identifier by which each subject can be referred to, de-identified, e.g. #131 isolation_source
isolation_source Describes the physical, environmental and/or local geographical source of the biological sample from which the sample was derived. isolate
isolate Individual isolate from which the sample was obtained.
-
Optional Fields
Terra Column NameENA Field NameDescription latitude
geographic location (latitude) The geographical origin of the sample as defined by latitude. The values should be reported in decimal degrees and in WGS84 system longitude
geographic location (longitude) The geographical origin of the sample as defined by longitude. The values should be reported in decimal degrees and in WGS84 system region_locality
geographic location (region and locality) The geographical origin of the sample as defined by the specific region name followed by the locality name. host_disease_outcome
host disease outcome Disease outcome in the host. host_age
host age Age of host at the time of sampling; relevant scale depends on species and study, e.g. could be seconds for amoebae or centuries for trees host_behaviour
host behaviour Natural behaviour of the host. host_habitat
host habitat Natural habitat of the avian or mammalian host. isolation_source_host
isolation source host-associated Name of host tissue or organ sampled for analysis. Example: tracheal tissue isolation_source_non_host
isolation source non-host-associated Describes the physical, environmental and/or local geographical source of the biological sample from which the sample was derived. Example: soil receipt_date
receipt date Date on which the sample was received. Format:YYYY-MM-DD. Please provide the highest precision possible. If the sample was received by the institution and not collected, the 'receipt date' must be provided instead. sample_capture_status
sample capture status Reason for the sample collection. library_description
sample_description Description of the sample. serotype
serotype Serological variety of a species characterised by its antigenic properties. For Influenza, HA subtype should be the letter H followed by a number between 1-16 unless novel subtype is identified and the NA subtype should be the letter N followed by a number between 1-9 unless novel subtype is identified. virus_identifier
virus identifier Unique laboratory identifier assigned to the virus by the investigator. Strain name is not sufficient since it might not be unique due to various passsages of the same virus. Format: up to 50 alphanumeric characters
Reference: ENA viral minimal sample checklist
Workflow Inputs¶
It's important to note that the Terra_2_ENA
workflow is designed to run on set-level data tables. This means that the workflow will process all samples within a set together, rather than handling each sample individually. The samples
input variable expects an array of sample IDs, corresponding to a set table. In most cases, set tables are generated automatically when running a workflow. However, if you need to create one manually, refer to this guide on how to create a set table.
The submit_to_production
input parameter is set to false
by default. This means that the workflow will not submit data to the production ENA server unless you explicitly set it to true
. This is useful for testing purposes, allowing you to validate your data without making actual submissions.
Terra Task Name | Variable | Type | Description | Default Value | Terra Status |
---|---|---|---|---|---|
Terra_2_ENA | ena_password | String | ENA password to authenticate submission | Required | |
Terra_2_ENA | ena_username | String | ENA username to authenticate submission | Required | |
Terra_2_ENA | sample_id_column | String | The column name in the Terra data table containing sample IDs | Required | |
Terra_2_ENA | sample_type | String | Type of sample being submitted ("prokaryotic_pathogen" or "virus_pathogen") | Required | |
Terra_2_ENA | samples | Array[String] | Array of sample IDs to submit | Required | |
Terra_2_ENA | study_accession | String | ENA study accession number to associate submissions with | Required | |
Terra_2_ENA | terra_project_name | String | The Terra project containing the data table | Required | |
Terra_2_ENA | terra_table_name | String | The name of the Terra data table containing sample data | Required | |
Terra_2_ENA | terra_workspace_name | String | The Terra workspace containing the data table | Required | |
Terra_2_ENA | allow_missing | Boolean | Whether to allow missing values in metadata | FALSE | Optional |
Terra_2_ENA | column_mappings | File | TSV file mapping Terra table columns to ENA submission fields | Optional | |
download_terra_table | cpu | Int | Number of CPUs to allocate to the task | 1 | Optional |
download_terra_table | disk_size | Int | Amount of storage (in GB) to allocate to the task | 10 | Optional |
download_terra_table | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/theiagen/terra-tools:2023-06-21 | Optional |
download_terra_table | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 2 | Optional |
register_ena_samples | batch_size | Int | Number of samples to process in each batch | 100 | Optional |
register_ena_samples | center | String | Name of submitting center | Optional | |
register_ena_samples | cpu | Int | Number of CPUs to allocate to the task | 1 | Optional |
register_ena_samples | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
register_ena_samples | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/theiagen/terra_to_ena:0.6 | Optional |
register_ena_samples | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 2 | Optional |
submit_ena_data | bam_column | String | Column name containing BAM file paths | bam_file | Optional |
submit_ena_data | cpu | Int | Number of CPUs to allocate to the task | 1 | Optional |
submit_ena_data | cram_column | String | Column name containing CRAM file paths | cram_file | Optional |
submit_ena_data | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
submit_ena_data | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/theiagen/terra_to_ena:0.6 | Optional |
submit_ena_data | experiment_name | String | Internal component, do not modify | Optional | |
submit_ena_data | instrument | String | Internal component, do not modify | Optional | |
submit_ena_data | library_selection | String | Internal component, do not modify | RANDOM | Optional |
submit_ena_data | library_source | String | Internal component, do not modify | GENOMIC | Optional |
submit_ena_data | library_strategy | String | Internal component, do not modify | WGS | Optional |
submit_ena_data | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 2 | Optional |
submit_ena_data | platform | String | Internal component, do not modify | ILLUMINA | Optional |
submit_ena_data | read1_column | String | Column name containing read1 file paths | read1 | Optional |
submit_ena_data | read2_column | String | Column name containing read2 file paths | read2 | Optional |
submit_ena_data | submit_to_production | Boolean | If false, performs a test submission of metadata | FALSE | Optional |
Workflow Outputs¶
Variable | Type | Description |
---|---|---|
Terra_2_ENA_analysis_date | String | Date the Terra to ENA workflow was run |
Terra_2_ENA_version | String | Version of the Terra to ENA workflow used |
ena_accessions | File | Text file containing the accession numbers generated by ENA submission |
ena_docker_image | String | Docker image used for ENA submission processing |
ena_excluded_samples | File | Text file listing samples that were excluded from ENA submission |
ena_file_paths_json | File | JSON file containing paths to the files submitted to ENA |
ena_metadata_accessions | File | File containing metadata and their corresponding accessions from ENA |
ena_registration_log | File | Log file detailing the ENA registration process |
ena_registration_success | String | String indicating whether the ENA registration was successful |
ena_registration_summary | File | Summary file of the ENA registration results |
ena_submission_manifest_files | Array[File] | Array of manifest files used for ENA submission. Each file corresponds to a sample and contains metadata and file paths |
ena_submission_report_files | Array[File] | Array of report files containing the results of the ENA submission |
ena_webincli_results | File | File containing the cumulative results of the ENA submission |
prepped_ena_data | File | Prepared data formatted for ENA submission |
terra_table | File | Terra table file used for submission to ENA |