Terra_2_ENA¶
Quick Facts¶
| Workflow Type | Applicable Kingdom | Last Known Changes | Command-line Compatibility | Workflow Level |
|---|---|---|---|---|
| Public Data Sharing | Bacteria, Viral | v3.1.0 | No | Set-level |
Terra_2_ENA_PHB¶
This workflow utilizes the ENA Webin-CLI Bulk Submission Tool to bulk submit read data to the ENA.
ENA Submissions¶
Before you can submit data to ENA you must register a Webin submission account. ENA allows submissions via the Webin-CLI program which validates your submissions entirely before you complete them. Submissions made through Webin are represented using a number of different metadata objects. Submissions to ENA result in accession numbers and these accessions can be used to identify each unique part of your submission. See the ENA submission documentation for more information.
Pre-requisites¶
Before running the Terra_2_ENA workflow, make sure you have registered a study and are using the correct study accession number
-
To submit data into ENA you must first register a study to contain and manage it. Studies (also referred to as projects) can be registered through the Webin Portal. Log in with your Webin credentials and select the ‘Register Study’ button to bring up the interface. Once registration is complete, you will be assigned accession numbers. You may return to the dashboard and select the ‘Studies Report’ button to review registered studies.
-
Additionally, before submitting most types of data to ENA, samples must be registered. To register samples, ensure that your Terra data table includes all the samples you intend to submit, along with their raw read data (
FASTQ,BAM, orCRAMformat) and associated metadata. To meet ENA’s requirements, each sample must include a minimum set of metadata. See below for the mandatory and recommended metadata fields, as well as the default column names used to identify them in your Terra data table.
What needs to be included in your Terra data table?¶
Read Data Fields¶
-
Mandatory Fields
These columns are required for submission and must be included in the Terra data table. The column names must appear exactly as shown and cannot be substituted or modified using column mappings.
Terra Column Name Description read1/read2/bam_file/cram_fileThe path to two paired end FASTQfiles,BAMfile, orCRAMfile containing sequencing data.experiment_nameUnique name of the experiment. sequencing_platformThe platform used to generate the sequence data. See permitted values. sequencing_instrumentThe instrument used to generate the sequence data. See permitted values. library_sourceThe source of the library. See permitted values. library_selectionThe method used to select the library. See permitted values. library_strategyThe strategy used to generate the library. See permitted values.
-
Optional Fields
Terra Column Name Description insert_sizeThe insert size for paired reads. library_descriptionFree text library description.
Sample Metadata Fields¶
Using Customized Column Names in Terra Tables
In some cases, users may have data tables in Terra with column names that differ from the field names expected by ENA. The Terra_2_ENA workflow allows users to supply a custom column mapping file, enabling them to specify how their columns map to the required/mandatory field names.
To use a custom column mapping file:
-
Create a tab-delimited
.tsvfile with the following structure:A header including
terra_columnandena_columnshould be included in the first row. Theterra_columncolumn should contain the actual column names in your Terra table (e.g., 'my_fav_collection_date'), and theena_columncolumn should contain the column names expected by ENA (e.g.,collection date). More information about the mandatory and recommended metadata fields and associated column names are described in the tables below.Example Mapping File:
-
Upload the file to your Terra workspace and reference it in the
column_mappingsinput parameter when running the workflow using Google Cloud Storage paths.
Ensure the mapping file includes all columns with custom names. Columns that match the default workflow names do not need to be included. Missing mappings for renamed columns may result in errors during execution if the column is required, and will not be found if the column is optional. The workflow will automatically map the specified column names from your Terra table to the required ENA field names as long as the mapping file is provided correctly.
-
Mandatory Fields
These fields are required for submission and must be included in the Terra data table or supplied as an input parameter
If you cannot provide a value for a mandatory field within, set the
allow-missinginput parameter totrueor alternatively, use one of the INDSC accepted terms for missing value reporting.Terra Column NameENA Field NameDescription titlesample_title Title of the sample. taxon_idand/ororganismtax_id and/or scientific name Taxonomic identifier (NCBI taxon ID) or scientific name of the organism from which the sample was obtained. collection_datecollection date The date the sample was collected with the intention of sequencing, either as an instance (single point in time) or interval. In case no exact time is available, the date/time can be right truncated i.e. all of these are valid ISO8601 compliant times: 2008-01-23T19:23:10+00:00; 2008-01-23T19:23:10; 2008-01-23; 2008-01; 2008. geo_loc_namegeographic location (country and/or sea) The geographical origin of where the sample was collected from, with the intention of sequencing, as defined by the country or sea name. Country or sea names should be chosen from the INSDC country list. host_health_statehost health state Health status of the host at the time of sample collection. Must be one of the following: diseased,healthy,missing: control sample,missing: data agreement established pre-2023,missing: endangered species,missing: human-identifiable,missing: lab stock,missing: sample group,missing: synthetic construct,missing: third party data,not applicable,not collected,not provided,restricted access.host_scientific_namehost scientific name Scientific name of the natural (as opposed to laboratory) host to the organism from which sample was obtained. isolation_sourceisolation_source Describes the physical, environmental and/or local geographical source of the biological sample from which the sample was derived. isolateisolate Individual isolate from which the sample was obtained.
-
Optional Fields
Terra Column NameENA Field NameDescription library_descriptionsample_description Description of the sample. lat_lonlat_lon Geographical coordinates of the location where the specimen was collected. serovarserovar Serological variety of a species (usually a prokaryote) characterized by its antigenic properties. strainstrain Name of the strain from which the sample was obtained.
Reference: ENA prokaryotic pathogen minimal sample checklist
-
Mandatory Fields
These fields are required for submission and must be included in the Terra data table or supplied as an input parameter
If you cannot provide a value for a mandatory field within, set the
allow-missinginput parameter totrueor alternatively, use one of the INDSC accepted terms for missing value reporting.Terra Column NameENA Field NameDescription titlesample_title Title of the sample. taxon_idand/ororganismtax_id and/or scientific name Taxonomic identifier (NCBI taxon ID) or scientific name of the organism from which the sample was obtained. collection_datecollection date The date the sample was collected with the intention of sequencing, either as an instance (single point in time) or interval. In case no exact time is available, the date/time can be right truncated i.e. all of these are valid ISO8601 compliant times: 2008-01-23T19:23:10+00:00; 2008-01-23T19:23:10; 2008-01-23; 2008-01; 2008. collecting_institutioncollecting institution Name of the institution to which the person collecting the specimen belongs. Format: Institute Name, Institute Address collector_namecollector name Name of the person who collected the specimen. Example: John Smith geo_loc_namegeographic location (country and/or sea) The geographical origin of where the sample was collected from, with the intention of sequencing, as defined by the country or sea name. Country or sea names should be chosen from the INSDC country list. host_common_namehost common name Common name of the host, e.g. human host_health_statehost health state Health status of the host at the time of sample collection. Must be one of the following: diseased,healthy,missing: control sample,missing: data agreement established pre-2023,missing: endangered species,missing: human-identifiable,missing: lab stock,missing: sample group,missing: synthetic construct,missing: third party data,not applicable,not collected,not provided,restricted access.host_scientific_namehost scientific name Scientific name of the natural (as opposed to laboratory) host to the organism from which sample was obtained. host_sexhost sex Gender or sex of the host. Must be one of the following: female,male,hermaphrodite,neuter,not applicable,not collected,not provided,other,missing: control sample,missing: data agreement established pre-2023,missing: endangered species,missing: human-identifiable,missing: lab stock,missing: sample group,missing: synthetic construct,missing: third party data.host_subject_idhost subject id A unique identifier by which each subject can be referred to, de-identified, e.g. #131 isolation_sourceisolation_source Describes the physical, environmental and/or local geographical source of the biological sample from which the sample was derived. isolateisolate Individual isolate from which the sample was obtained.
-
Optional Fields
Terra Column NameENA Field NameDescription latitudegeographic location (latitude) The geographical origin of the sample as defined by latitude. The values should be reported in decimal degrees and in WGS84 system longitudegeographic location (longitude) The geographical origin of the sample as defined by longitude. The values should be reported in decimal degrees and in WGS84 system region_localitygeographic location (region and locality) The geographical origin of the sample as defined by the specific region name followed by the locality name. host_disease_outcomehost disease outcome Disease outcome in the host. host_agehost age Age of host at the time of sampling; relevant scale depends on species and study, e.g. could be seconds for amoebae or centuries for trees host_behaviourhost behaviour Natural behaviour of the host. host_habitathost habitat Natural habitat of the avian or mammalian host. isolation_source_hostisolation source host-associated Name of host tissue or organ sampled for analysis. Example: tracheal tissue isolation_source_non_hostisolation source non-host-associated Describes the physical, environmental and/or local geographical source of the biological sample from which the sample was derived. Example: soil receipt_datereceipt date Date on which the sample was received. Format:YYYY-MM-DD. Please provide the highest precision possible. If the sample was received by the institution and not collected, the 'receipt date' must be provided instead. sample_capture_statussample capture status Reason for the sample collection. library_descriptionsample_description Description of the sample. serotypeserotype Serological variety of a species characterised by its antigenic properties. For Influenza, HA subtype should be the letter H followed by a number between 1-16 unless novel subtype is identified and the NA subtype should be the letter N followed by a number between 1-9 unless novel subtype is identified. virus_identifiervirus identifier Unique laboratory identifier assigned to the virus by the investigator. Strain name is not sufficient since it might not be unique due to various passsages of the same virus. Format: up to 50 alphanumeric characters
Reference: ENA viral minimal sample checklist
Workflow Inputs¶
It's important to note that the Terra_2_ENA workflow is designed to run on set-level data tables. This means that the workflow will process all samples within a set together, rather than handling each sample individually. The samples input variable expects an array of sample IDs, corresponding to a set table. In most cases, set tables are generated automatically when running a workflow. However, if you need to create one manually, refer to this guide on how to create a set table.
The submit_to_production input parameter is set to false by default. This means that the workflow will not submit data to the production ENA server unless you explicitly set it to true. This is useful for testing purposes, allowing you to validate your data without making actual submissions.
| Terra Task Name | Variable | Type | Description | Default Value | Terra Status |
|---|---|---|---|---|---|
| Terra_2_ENA | ena_password | String | ENA password to authenticate submission | Required | |
| Terra_2_ENA | ena_username | String | ENA username to authenticate submission | Required | |
| Terra_2_ENA | sample_id_column | String | The column name in the Terra data table containing sample IDs | Required | |
| Terra_2_ENA | sample_type | String | Type of sample being submitted ("prokaryotic_pathogen" or "virus_pathogen") | Required | |
| Terra_2_ENA | samples | Array[String] | Array of sample IDs to submit | Required | |
| Terra_2_ENA | study_accession | String | ENA study accession number to associate submissions with | Required | |
| Terra_2_ENA | terra_project_name | String | The Terra project containing the data table | Required | |
| Terra_2_ENA | terra_table_name | String | The name of the Terra data table containing sample data | Required | |
| Terra_2_ENA | terra_workspace_name | String | The Terra workspace containing the data table | Required | |
| Terra_2_ENA | allow_missing | Boolean | Whether to allow missing values in metadata | FALSE | Optional |
| Terra_2_ENA | column_mappings | File | TSV file mapping Terra table columns to ENA submission fields | Optional | |
| download_terra_table | cpu | Int | Number of CPUs to allocate to the task | 1 | Optional |
| download_terra_table | disk_size | Int | Amount of storage (in GB) to allocate to the task | 10 | Optional |
| download_terra_table | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/theiagen/terra-tools:2023-06-21 | Optional |
| download_terra_table | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 2 | Optional |
| register_ena_samples | batch_size | Int | Number of samples to process in each batch | 100 | Optional |
| register_ena_samples | center | String | Name of submitting center | Optional | |
| register_ena_samples | cpu | Int | Number of CPUs to allocate to the task | 1 | Optional |
| register_ena_samples | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
| register_ena_samples | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/theiagen/terra_to_ena:0.6 | Optional |
| register_ena_samples | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 2 | Optional |
| submit_ena_data | bam_column | String | Column name containing BAM file paths | bam_file | Optional |
| submit_ena_data | cpu | Int | Number of CPUs to allocate to the task | 1 | Optional |
| submit_ena_data | cram_column | String | Column name containing CRAM file paths | cram_file | Optional |
| submit_ena_data | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
| submit_ena_data | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/theiagen/terra_to_ena:0.6 | Optional |
| submit_ena_data | experiment_name | String | Internal component, do not modify | Optional | |
| submit_ena_data | instrument | String | Internal component, do not modify | Optional | |
| submit_ena_data | library_selection | String | Internal component, do not modify | RANDOM | Optional |
| submit_ena_data | library_source | String | Internal component, do not modify | GENOMIC | Optional |
| submit_ena_data | library_strategy | String | Internal component, do not modify | WGS | Optional |
| submit_ena_data | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 2 | Optional |
| submit_ena_data | platform | String | Internal component, do not modify | ILLUMINA | Optional |
| submit_ena_data | read1_column | String | Column name containing read1 file paths | read1 | Optional |
| submit_ena_data | read2_column | String | Column name containing read2 file paths | read2 | Optional |
| submit_ena_data | submit_to_production | Boolean | If false, performs a test submission of metadata | FALSE | Optional |
Workflow Outputs¶
| Variable | Type | Description |
|---|---|---|
| Terra_2_ENA_analysis_date | String | Date the Terra to ENA workflow was run |
| Terra_2_ENA_version | String | Version of the Terra to ENA workflow used |
| ena_accessions | File | Text file containing the accession numbers generated by ENA submission |
| ena_docker_image | String | Docker image used for ENA submission processing |
| ena_excluded_samples | File | Text file listing samples that were excluded from ENA submission |
| ena_file_paths_json | File | JSON file containing paths to the files submitted to ENA |
| ena_metadata_accessions | File | File containing metadata and their corresponding accessions from ENA |
| ena_registration_log | File | Log file detailing the ENA registration process |
| ena_registration_success | String | String indicating whether the ENA registration was successful |
| ena_registration_summary | File | Summary file of the ENA registration results |
| ena_submission_manifest_files | Array[File] | Array of manifest files used for ENA submission. Each file corresponds to a sample and contains metadata and file paths |
| ena_submission_report_files | Array[File] | Array of report files containing the results of the ENA submission |
| ena_webincli_results | File | File containing the cumulative results of the ENA submission |
| prepped_ena_data | File | Prepared data formatted for ENA submission |