Skip to content

Terra_2_ENA

Quick Facts

Workflow Type Applicable Kingdom Last Known Changes Command-line Compatibility Workflow Level
Public Data Sharing Bacteria, Viral vX.X.X No Set-level

Terra_2_ENA_PHB

This workflow utilizes the ENA Webin-CLI Bulk Submission Tool to bulk submit read data to the ENA.

ENA Submissions

Before you can submit data to ENA you must register a Webin submission account. ENA allows submissions via the Webin-CLI program which validates your submissions entirely before you complete them. Submissions made through Webin are represented using a number of different metadata objects. Submissions to ENA result in accession numbers and these accessions can be used to identify each unique part of your submission. See the ENA submission documentation for more information.

Pre-requisites

Before running the Terra_2_ENA workflow, make sure you have registered a study and are using the correct study accession number

  • To submit data into ENA you must first register a study to contain and manage it. Studies (also referred to as projects) can be registered through the Webin Portal. Log in with your Webin credentials and select the ‘Register Study’ button to bring up the interface. Once registration is complete, you will be assigned accession numbers. You may return to the dashboard and select the ‘Studies Report’ button to review registered studies.

  • Additionally, before submitting most types of data to ENA, samples must be registered. To register samples, ensure that your Terra data table includes all the samples you intend to submit, along with their raw read data (FASTQ, BAM, or CRAM format) and associated metadata. To meet ENA’s requirements, each sample must include a minimum set of metadata. See below for the mandatory and recommended metadata fields, as well as the default column names used to identify them in your Terra data table.

What needs to be included in your Terra data table?

Read Data Fields

  • Mandatory Fields

    These columns are required for submission and must be included in the Terra data table. The column names must appear exactly as shown and cannot be substituted or modified using column mappings.

    Terra Column Name Description
    read1/read2/bam_file/cram_file The path to two paired end FASTQ files, BAM file, or CRAM file containing sequencing data.
    experiment_name Unique name of the experiment.
    sequencing_platform The platform used to generate the sequence data. See permitted values.
    sequencing_instrument The instrument used to generate the sequence data. See permitted values.
    library_source The source of the library. See permitted values.
    library_selection The method used to select the library. See permitted values.
    library_strategy The strategy used to generate the library. See permitted values.
  • Optional Fields
    Terra Column Name Description
    insert_size The insert size for paired reads.
    library_description Free text library description.

Sample Metadata Fields

Using Customized Column Names in Terra Tables

In some cases, users may have data tables in Terra with column names that differ from the field names expected by ENA. The Terra_2_ENA workflow allows users to supply a custom column mapping file, enabling them to specify how their columns map to the required/mandatory field names.

To use a custom column mapping file:

  1. Create a tab-delimited .tsv file with the following structure:

    A header including terra_column and ena_column should be included in the first row. The terra_column column should contain the actual column names in your Terra table (e.g., 'my_fav_collection_date'), and the ena_column column should contain the column names expected by ENA (e.g., collection date). More information about the mandatory and recommended metadata fields and associated column names are described in the tables below.

    Example Mapping File:

    terra_column            ena_column
    my_sample_title         title
    my_fav_collection_date  collection_date
    my_geo_loc_name         geo_loc_name
    

  2. Upload the file to your Terra workspace and reference it in the column_mappings input parameter when running the workflow using Google Cloud Storage paths.

Ensure the mapping file includes all columns with custom names. Columns that match the default workflow names do not need to be included. Missing mappings for renamed columns may result in errors during execution if the column is required, and will not be found if the column is optional. The workflow will automatically map the specified column names from your Terra table to the required ENA field names as long as the mapping file is provided correctly.

  • Mandatory Fields

    These fields are required for submission and must be included in the Terra data table or supplied as an input parameter

    If you cannot provide a value for a mandatory field within, set the allow-missing input parameter to true or alternatively, use one of the INDSC accepted terms for missing value reporting.

    Terra Column Name
    ENA Field Name
    Description
    title sample_title Title of the sample.
    taxon_id and/or organism tax_id and/or scientific name Taxonomic identifier (NCBI taxon ID) or scientific name of the organism from which the sample was obtained.
    collection_date collection date The date the sample was collected with the intention of sequencing, either as an instance (single point in time) or interval. In case no exact time is available, the date/time can be right truncated i.e. all of these are valid ISO8601 compliant times: 2008-01-23T19:23:10+00:00; 2008-01-23T19:23:10; 2008-01-23; 2008-01; 2008.
    geo_loc_name geographic location (country and/or sea) The geographical origin of where the sample was collected from, with the intention of sequencing, as defined by the country or sea name. Country or sea names should be chosen from the INSDC country list.
    host_health_state host health state Health status of the host at the time of sample collection. Must be one of the following: diseased, healthy, missing: control sample, missing: data agreement established pre-2023, missing: endangered species, missing: human-identifiable, missing: lab stock, missing: sample group, missing: synthetic construct, missing: third party data, not applicable, not collected, not provided, restricted access.
    host_scientific_name host scientific name Scientific name of the natural (as opposed to laboratory) host to the organism from which sample was obtained.
    isolation_source isolation_source Describes the physical, environmental and/or local geographical source of the biological sample from which the sample was derived.
    isolate isolate Individual isolate from which the sample was obtained.
  • Optional Fields
    Terra Column Name
    ENA Field Name
    Description
    library_description sample_description Description of the sample.
    lat_lon lat_lon Geographical coordinates of the location where the specimen was collected.
    serovar serovar Serological variety of a species (usually a prokaryote) characterized by its antigenic properties.
    strain strain Name of the strain from which the sample was obtained.

Reference: ENA prokaryotic pathogen minimal sample checklist

  • Mandatory Fields

    These fields are required for submission and must be included in the Terra data table or supplied as an input parameter

    If you cannot provide a value for a mandatory field within, set the allow-missing input parameter to true or alternatively, use one of the INDSC accepted terms for missing value reporting.

    Terra Column Name
    ENA Field Name
    Description
    title sample_title Title of the sample.
    taxon_id and/or organism tax_id and/or scientific name Taxonomic identifier (NCBI taxon ID) or scientific name of the organism from which the sample was obtained.
    collection_date collection date The date the sample was collected with the intention of sequencing, either as an instance (single point in time) or interval. In case no exact time is available, the date/time can be right truncated i.e. all of these are valid ISO8601 compliant times: 2008-01-23T19:23:10+00:00; 2008-01-23T19:23:10; 2008-01-23; 2008-01; 2008.
    collecting_institution collecting institution Name of the institution to which the person collecting the specimen belongs. Format: Institute Name, Institute Address
    collector_name collector name Name of the person who collected the specimen. Example: John Smith
    geo_loc_name geographic location (country and/or sea) The geographical origin of where the sample was collected from, with the intention of sequencing, as defined by the country or sea name. Country or sea names should be chosen from the INSDC country list.
    host_common_name host common name Common name of the host, e.g. human
    host_health_state host health state Health status of the host at the time of sample collection. Must be one of the following: diseased, healthy, missing: control sample, missing: data agreement established pre-2023, missing: endangered species, missing: human-identifiable, missing: lab stock, missing: sample group, missing: synthetic construct, missing: third party data, not applicable, not collected, not provided, restricted access.
    host_scientific_name host scientific name Scientific name of the natural (as opposed to laboratory) host to the organism from which sample was obtained.
    host_sex host sex Gender or sex of the host. Must be one of the following: female, male, hermaphrodite, neuter, not applicable, not collected, not provided, other, missing: control sample, missing: data agreement established pre-2023, missing: endangered species, missing: human-identifiable, missing: lab stock, missing: sample group, missing: synthetic construct, missing: third party data.
    host_subject_id host subject id A unique identifier by which each subject can be referred to, de-identified, e.g. #131
    isolation_source isolation_source Describes the physical, environmental and/or local geographical source of the biological sample from which the sample was derived.
    isolate isolate Individual isolate from which the sample was obtained.
  • Optional Fields
    Terra Column Name
    ENA Field Name
    Description
    latitude geographic location (latitude) The geographical origin of the sample as defined by latitude. The values should be reported in decimal degrees and in WGS84 system
    longitude geographic location (longitude) The geographical origin of the sample as defined by longitude. The values should be reported in decimal degrees and in WGS84 system
    region_locality geographic location (region and locality) The geographical origin of the sample as defined by the specific region name followed by the locality name.
    host_disease_outcome host disease outcome Disease outcome in the host.
    host_age host age Age of host at the time of sampling; relevant scale depends on species and study, e.g. could be seconds for amoebae or centuries for trees
    host_behaviour host behaviour Natural behaviour of the host.
    host_habitat host habitat Natural habitat of the avian or mammalian host.
    isolation_source_host isolation source host-associated Name of host tissue or organ sampled for analysis. Example: tracheal tissue
    isolation_source_non_host isolation source non-host-associated Describes the physical, environmental and/or local geographical source of the biological sample from which the sample was derived. Example: soil
    receipt_date receipt date Date on which the sample was received. Format:YYYY-MM-DD. Please provide the highest precision possible. If the sample was received by the institution and not collected, the 'receipt date' must be provided instead.
    sample_capture_status sample capture status Reason for the sample collection.
    library_description sample_description Description of the sample.
    serotype serotype Serological variety of a species characterised by its antigenic properties. For Influenza, HA subtype should be the letter H followed by a number between 1-16 unless novel subtype is identified and the NA subtype should be the letter N followed by a number between 1-9 unless novel subtype is identified.
    virus_identifier virus identifier Unique laboratory identifier assigned to the virus by the investigator. Strain name is not sufficient since it might not be unique due to various passsages of the same virus. Format: up to 50 alphanumeric characters

Reference: ENA viral minimal sample checklist

Workflow Inputs

It's important to note that the Terra_2_ENA workflow is designed to run on set-level data tables. This means that the workflow will process all samples within a set together, rather than handling each sample individually. The samples input variable expects an array of sample IDs, corresponding to a set table. In most cases, set tables are generated automatically when running a workflow. However, if you need to create one manually, refer to this guide on how to create a set table.

The submit_to_production input parameter is set to false by default. This means that the workflow will not submit data to the production ENA server unless you explicitly set it to true. This is useful for testing purposes, allowing you to validate your data without making actual submissions.

Terra Task Name Variable Type Description Default Value Terra Status
Terra_2_ENA ena_password String ENA password to authenticate submission Required
Terra_2_ENA ena_username String ENA username to authenticate submission Required
Terra_2_ENA sample_id_column String The column name in the Terra data table containing sample IDs Required
Terra_2_ENA sample_type String Type of sample being submitted ("prokaryotic_pathogen" or "virus_pathogen") Required
Terra_2_ENA samples Array[String] Array of sample IDs to submit Required
Terra_2_ENA study_accession String ENA study accession number to associate submissions with Required
Terra_2_ENA terra_project_name String The Terra project containing the data table Required
Terra_2_ENA terra_table_name String The name of the Terra data table containing sample data Required
Terra_2_ENA terra_workspace_name String The Terra workspace containing the data table Required
Terra_2_ENA allow_missing Boolean Whether to allow missing values in metadata FALSE Optional
Terra_2_ENA column_mappings File TSV file mapping Terra table columns to ENA submission fields Optional
download_terra_table cpu Int Number of CPUs to allocate to the task 1 Optional
download_terra_table disk_size Int Amount of storage (in GB) to allocate to the task 10 Optional
download_terra_table docker String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/theiagen/terra-tools:2023-06-21 Optional
download_terra_table memory Int Amount of memory/RAM (in GB) to allocate to the task 2 Optional
register_ena_samples batch_size Int Number of samples to process in each batch 100 Optional
register_ena_samples center String Name of submitting center Optional
register_ena_samples cpu Int Number of CPUs to allocate to the task 1 Optional
register_ena_samples disk_size Int Amount of storage (in GB) to allocate to the task 100 Optional
register_ena_samples docker String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/theiagen/terra_to_ena:0.6 Optional
register_ena_samples memory Int Amount of memory/RAM (in GB) to allocate to the task 2 Optional
submit_ena_data bam_column String Column name containing BAM file paths bam_file Optional
submit_ena_data cpu Int Number of CPUs to allocate to the task 1 Optional
submit_ena_data cram_column String Column name containing CRAM file paths cram_file Optional
submit_ena_data disk_size Int Amount of storage (in GB) to allocate to the task 100 Optional
submit_ena_data docker String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/theiagen/terra_to_ena:0.6 Optional
submit_ena_data experiment_name String Internal component, do not modify Optional
submit_ena_data instrument String Internal component, do not modify Optional
submit_ena_data library_selection String Internal component, do not modify RANDOM Optional
submit_ena_data library_source String Internal component, do not modify GENOMIC Optional
submit_ena_data library_strategy String Internal component, do not modify WGS Optional
submit_ena_data memory Int Amount of memory/RAM (in GB) to allocate to the task 2 Optional
submit_ena_data platform String Internal component, do not modify ILLUMINA Optional
submit_ena_data read1_column String Column name containing read1 file paths read1 Optional
submit_ena_data read2_column String Column name containing read2 file paths read2 Optional
submit_ena_data submit_to_production Boolean If false, performs a test submission of metadata FALSE Optional

Workflow Outputs

Variable Type Description
Terra_2_ENA_analysis_date String Date the Terra to ENA workflow was run
Terra_2_ENA_version String Version of the Terra to ENA workflow used
ena_accessions File Text file containing the accession numbers generated by ENA submission
ena_docker_image String Docker image used for ENA submission processing
ena_excluded_samples File Text file listing samples that were excluded from ENA submission
ena_file_paths_json File JSON file containing paths to the files submitted to ENA
ena_metadata_accessions File File containing metadata and their corresponding accessions from ENA
ena_registration_log File Log file detailing the ENA registration process
ena_registration_success String String indicating whether the ENA registration was successful
ena_registration_summary File Summary file of the ENA registration results
ena_submission_manifest_files Array[File] Array of manifest files used for ENA submission. Each file corresponds to a sample and contains metadata and file paths
ena_submission_report_files Array[File] Array of report files containing the results of the ENA submission
ena_webincli_results File File containing the cumulative results of the ENA submission
prepped_ena_data File Prepared data formatted for ENA submission
terra_table File Terra table file used for submission to ENA

References