Terra_2_NCBI¶

Quick Facts¶

Workflow Type	Applicable Kingdom	Last Known Changes	Command-line Compatibility	Workflow Level
Public Data Sharing	Bacteria, Mycotics, Viral	v3.0.0	No	Set-level

Terra_2_NCBI_PHB¶

Do not resubmit!

If the Terra_2_NCBI workflow fails, DO NOT resubmit.

Resubmission risks duplicate submissions and future failures.

Contact Theiagen (support@theiagen.com) to determine the reason for failure, and only move forward with Theiagen's guidance.

Key Resources

The Terra_2_NCBI workflow is a programmatic data submission method to share metadata information with NCBI BioSample and paired-end Illumina reads with NCBI SRA directly from Terra without having to use the NCBI portal.

Prerequisites¶

Before running the Terra_2_NCBI workflow

The user must have access to the NCBI FTP. To gain these credentials, we recommend emailing **sra@ncbi.nlm.nih.gov** a variation of the following example, including all the information:
Hello,

We would like to automate submissions to the Submission Portal using XML metadata to accompany our cloud-hosted data files. We would like to upload via FTP and need to create a submission group.

Here is the relevant information:
1. Suggested group abbreviation:
2. Full group name:
3. Institution and department:
4. Contact person (someone likely to remain at the location for an extended time):
5. Contact email:
6. Mailing address (including country and postcode):
We will be using an existing submission pipeline that is known to work and would like to request that the production folder be activated. Thank you for your assistance!
From NCBI, you will need to get in response:
1. an FTP address (it will likely be ftp-private.ncbi.nih.gov)
2. Username (typically the suggested group abbreviation)
3. Password
4. an acknowledgment that the production folder has been activated.
Please confirm that the production folder has been activated, or else the submission pipeline will either fail or only run test submissions and not actually submit to NCBI.
Before you can run the workflow for the first time, we also recommend scheduling a meeting with Theiagen to get additional things set up, including
- adding a correctly-formatted configuration file to your workspace data elements that includes your FTP username and password, laboratory details, and other important information.
- ensuring your proxy account has been given permission to write to the google bucket where SRA reads are temporarily stored before being transferred to NCBI.
What is the configuration file used for?

The configuration file tells the workflow your username and password so you can access the FTP. It also provides important information about who should be contacted regarding the submission. We recommend contacting a member of Theiagen for help in the creation of this configuration file to ensure that everything is formatted correctly.

Collating BioSample Metadata¶

In order to create BioSamples, you need to choose the correct BioSample package and have the appropriate metadata included in your data table.

Currently, Terra_2_NCBI only supports Pathogen, Virus, Microbe, and SARS-CoV-2 Wastewater Surveillance BioSample packages. Most organisms should be submitted using the Pathogen package unless you have been specifically directed otherwise (either through CDC communications or another reliable source). Definitions of packages supported by Terra_2_NCBI are listed below with more requirements provided via the links:

Pathogen.cl - any clinical or host-associated pathogen
Pathogen.env - environmental, food or other pathogen (no metadata formatter available at this time)
Microbe - bacteria or other unicellular microbes that do not fit under the MIxS, Pathogen, or Virus packages.
Virus - viruses not directly associated with disease
- Viral pathogens should be submitted using the Pathogen: Clinical or host-associated pathogen package.
SARS-CoV-2.wwsurv - SARS-CoV-2 wastewater surveillance samples

Metadata Formatters¶

For each package, we have created a metadata template spreadsheet to help you organize your metadata:

Please note that the pathogen metadata formatter is for the clinical pathogen package, not the environmental pathogen.

We are constantly working on improving these spreadsheets and they will be updated in due course.

Running the Workflow¶

We recommend running a test submission before your first production submission to ensure that all data has been formatted correctly. Please contact Theiagen (support@theiagen.com) to get this set up.

In the test submission, any real BioProject accession numbers you provide will not be recognized. You will have to make a "fake" or "test" BioProject. This cannot be done through the NCBI portal. Theiagen can provide assistance in creating this as it requires manual command-line work on the NCBI FTP using the account they provided for you.

What's the difference between a test submission and a production submission?

A production submission means that your submission using Terra_2_NCBI will be submitted to NCBI as if you were using the online portal. That means that anything you submit on production will be given to the *real* NCBI servers and appear and become searchable on the NCBI website.

A test submission gives your data to a completely detached replica of the production server. This means that any data you submit as a test will behave exactly like a real submission, but since it's detached, nothing will appear on the NCBI website, and anything returned from the workflow (such as BioSample accession numbers) will be fake. If you search for these test BioSample accession numbers on the NCBI website, either (a) nothing will appear, or (b) it will link to a random sample.

If you want your data to be on NCBI, you must run a production submission. Initially, NCBI locks the production folder so that the user doesn't accidentally submit test data to the main database. You must have requested activation of the production folder prior to your first production submission.

Inputs¶

This workflow runs on set-level data tables.

Production Submissions

Please note that an optional Boolean variable, submit_to_production, is required for a production submission.

Using Customized Column Names in Terra Tables

In some cases, users may have data tables in Terra with column names that differ from the default expected by the workflow. The Terra_2_NCBI workflow allows users to supply a custom column mapping file, enabling them to specify how their columns map to the required workflow variables.

To use a custom column mapping file:

Create a tab-delimited .tsv file with the following structure:

A header including "Custom" and "Required" should be included in the first row. The "Custom" column should contain the actual column names in your Terra table (e.g., 'collection-date'), and the "Required" column should contain the column names expected by the workflow (e.g., 'collection_date').

Example Mapping File:
```
Custom  Required
Collection-Date collection_date
geo_location    geo_loc_name
bioproject_column   bioproject
sample_id_column    sample_names
```
Upload the file to your Terra workspace and reference it in the column_mapping_file parameter when running the workflow using Google Cloud Storage paths.

Ensure the mapping file includes all columns with custom names. Columns that match the default workflow names do not need to be included. Missing mappings for renamed columns may result in errors during execution if the column is required, and will not be found if the column is optional.

The workflow will automatically map the specified column names from your Terra table to the required workflow variables using the 'custom_mapping_file'.

To find a list of the expected required and optional column names, please refer to the code blocks that can be found here. The required and optional metadata fields are organized by the BioSample type.

Below, you can find the required metadata fields for the currently supported BioSample types:

Microbe Required Metadata

submission_id
organism
collection_date
geo_loc_name
sample_type

Wastewater Required Metadata

submission_id
organism
collection_date
geo_loc_name
isolation_source
ww_population
ww_sample_duration
ww_sample_matrix
ww_sample_type
ww_surv_target_1
ww_surv_target_1_known_present

Pathogen.cl Required Metadata

submission_id
organism
collected_by
collection_date
geo_loc_name
host
host_disease
isolation_source
lat_lon

Pathogen.env Required Metadata

submission_id
organism
collected_by
collection_date
geo_loc_name
isolation_source
lat_lon

Virus Required Metadata

submission_id
organism
isolate
collection_date
geo_loc_name
isolation_source

For further assistance in setting up a custom column mapping file, please contact Theiagen at support@theiagen.com.

Terra Task Name	Variable	Type	Description	Default Value	Terra Status
Terra_2_NCBI	bioproject	String	BioProject accession that the samples will be submitted to		Required
Terra_2_NCBI	biosample_package	String	The BioSample package that the samples will be submitted under		Required
Terra_2_NCBI	ncbi_config_js	File	Configuration file that contains your username and password for the NCBI FTP		Required
Terra_2_NCBI	project_name	String	The name of your Terra project. You can find this information in the url of the webpage you are on. It is the section right after "#workspaces/"		Required
Terra_2_NCBI	sample_names	Array[String]	The list of samples you want to submit		Required
Terra_2_NCBI	sra_transfer_gcp_bucket	String	Google bucket where your SRA reads will be temporarily stored before transferring to SRA		Required
Terra_2_NCBI	table_name	String	The name of the Terra table where your samples are found		Required
Terra_2_NCBI	workspace_name	String	The name of the workspace where your samples are found		Required
Terra_2_NCBI	submit_to_production	Boolean	Used to indicate whether or not the workflow should submit to NCBI's production environment. If set to true, then a Production submission will occur. Otherwise, by default (false), it will perform a Test submission.	FALSE	Optional, Required
Terra_2_NCBI	input_table	File	Internal component, do not modify		Optional
Terra_2_NCBI	skip_biosample	Boolean	Boolean switch to turn on actual production level submission	FALSE	Optional
add_biosample_accessions	cpu	Int	Number of CPUs to allocate to the task	2	Optional
add_biosample_accessions	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
add_biosample_accessions	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/broadinstitute/ncbi-tools:2.10.7.10	Optional
add_biosample_accessions	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	2	Optional
biosample_submit_tsv_ftp_upload	cpu	Int	Number of CPUs to allocate to the task	2	Optional
biosample_submit_tsv_ftp_upload	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
biosample_submit_tsv_ftp_upload	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/broadinstitute/ncbi-tools:2.10.7.10	Optional
biosample_submit_tsv_ftp_upload	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	2	Optional
ncbi_sftp_upload	additional_files	Array[File]	Internal component, do not modify		Optional
ncbi_sftp_upload	cpu	Int	Number of CPUs to allocate to the task	2	Optional
ncbi_sftp_upload	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
ncbi_sftp_upload	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/broadinstitute/ncbi-tools:2.10.7.10	Optional
ncbi_sftp_upload	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	2	Optional
ncbi_sftp_upload	wait_for	String	Internal component, do not modify		Optional
prune_table	cpu	Int	Number of CPUs to allocate to the task	2	Optional
prune_table	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
prune_table	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/broadinstitute/ncbi-tools:2.10.7.10	Optional
prune_table	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	2	Optional
prune_table	read1_column_name	String	The column header of the read1 column		Optional
prune_table	read2_column_name	String	The column header of the read1 column		Optional
sra_tsv_to_xml	cpu	Int	Number of CPUs to allocate to the task	2	Optional
sra_tsv_to_xml	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
sra_tsv_to_xml	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/broadinstitute/ncbi-tools:2.10.7.10	Optional
sra_tsv_to_xml	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	2	Optional
version_capture	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0	Optional
version_capture	timezone	String	Set the time zone to get an accurate date of analysis (uses UTC by default)		Optional

Workflow Tasks

Workflow Tasks¶

The workflow will perform the following tasks, each highlighted as code

prune_tableformats all incoming metadata for submission.
If you are submitting BioSamples:
1. biosample_submit_tsv_ftp_upload will
  1. format the BioSample table into XML format
  2. submit BioSamples to NCBI
  3. return all NCBI communications in XML format, and
  4. parse those communications for any and all BioSample accessions.
2. 1. add the BioSample accessions to SRA metadata
  2. upload the BioSample accessions to the origin Terra table
  add_biosample_accessions will
  
  If BioSample accessions fail to be generated, this task ends the workflow and users should contact Theiagen for further support. Otherwise, the workflow will continue and outputs are returned to the Terra data table.
If BioSample accessions were generated or if BioSample submission was skipped
1. sra_tsv_to_xml converts the SRA metadata (including any generated or pre-provided BioSample accessions) into XML format.
2. ncbi_sftp_upload
  1. uploads the SRA metadata to NCBI
  2. returns any XML communications from NCBI.

Workflow Success¶

If the workflow ends successfully, it returns the outputs to the Terra data table and the XML communications from NCBI will say that submission is underway. The workflow does not declare successful sample submission since SRA sometimes takes a while to do this. If the submission was successful, the point of contact for the submission will receive the SRA accessions via email from NCBI.

If the workflow ends unsuccessfully, no outputs will be shown on Terra and the biosample_status output variable will indicate that the BioSample submission failed.

Outputs¶

The output files contain information mostly for debugging purposes. Additionally, if your submission is successful, the point of contact for the submission should also receive an email from NCBI notifying them of their submission success.

Variable	Type	Description
biosample_failures	File	Text file listing samples that failed BioSample submission
biosample_metadata	File	Metadata used for BioSample submission in proper BioSample formatting
biosample_report_xmls	Array[File]	One or more XML files that contain the response from NCBI regarding your BioSample submission. These can be pretty cryptic, but often contain information to determine if anything went wrong
biosample_status	String	String showing whether BioSample submission was successful
biosample_submission_xml	File	XML file used to submit your BioSamples to NCBI
excluded_samples	File	Text file listing samples that were excluded from BioSample submission for missing required metadata
generated_accessions	File	Text file mapping the BioSample accession with its sample name.
sra_metadata	File	Metadata used for SRA submission in proper SRA formatting
sra_report_xmls	Array[File]	One or more XML files containing the response from NCBI regarding your SRA submission. These can be pretty cryptic, but often contain information to determine if anything went wrong
sra_submission_xml	File	XML file that was used to submit your SRA reads to NCBI
terra_2_ncbi_analysis_date	String	Date that the workflow was run
terra_2_ncbi_version	String	Version of the PHB repository where the workflow is hosted

An example excluded_samples.tsv file

An example excluded_samples.tsv file¶

Due to the nature of tsv files, it may be easier to download and open this file in Excel.

example_excluded_samples.tsv

Samples excluded for quality thresholds:
sample_name message 
sample2 VADR skipped due to poor assembly
sample3 VADR number alerts too high: 3 greater than limit of 0
sample4 Number of Ns was too high: 10000 greater than limit of 5000

Samples excluded for missing required metadata (will have empty values in indicated columns):
tablename_id    organism    country library_layout
sample5         paired
sample6 SARS-CoV-2  USA

This example informs the user that samples 2-4 were excluded for quality reasons (the exact reason is listed in the message column), and that samples 5 and 6 were excluded because they were missing required metadata fields (sample5 was missing the organism and country fields, and sample6 was missing the library_layout field).

Limitations¶

The maximum number of samples that can be submitted at once appears to be 300. We recommend submitting less than 300 samples at a time to avoid errors due to large submission sizes.
A workflow on returning SRA accessions using the generated BioSample accessions is in progress.

Acknowledgments¶

This workflow would not have been possible without the invaluable contributions of Dr. Danny Park.