Mercury_Prep_N_Batch¶

Quick Facts¶

Workflow Type	Applicable Kingdom	Last Known Changes	Command-line Compatibility	Workflow Level
Public Data Sharing	Viral	PHB v2.2.0	Yes	Set-level

Mercury_Prep_N_Batch_PHB¶

Mercury prepares and formats metadata and sequencing files located in Google Cloud Platform (GCP) buckets for submission to national & international databases, currently NCBI & GISAID. Mercury was initially developed to ingest read, assembly, and metadata files associated with SARS-CoV-2 amplicon reads from clinical samples and format that data for submission per the Public Health Alliance for Genomic Epidemiology (PH4GE)'s SARS-CoV-2 Contextual Data Specifications.

Currently, Mercury supports submission preparation for SARS-CoV-2, mpox, and influenza. These organisms have different metadata requirements, and are submitted to different repositories; the following table lists the repositories for each organism & what is supported in Mercury:

	BankIt (NCBI)	BioSample (NCBI)	GenBank (NCBI)	GISAID	SRA (NCBI)
`"flu"`		✓			✓
`"mpox"`	✓	✓		✓	✓
`"sars-cov-2"`		✓	✓	✓	✓

Mercury expects data tables made with TheiaCoV

Mercury was designed to work with metadata tables that were partially created after running the TheiaCoV workflows. If you are using a different pipeline, please ensure that the metadata table is formatted correctly. See this file for the hard-coded list of all of the different metadata fields expected for each organism.

Metadata Formatters¶

To help users collect all required metadata, we have created the following Excel spreadsheets that can help you collect the necessary metadata and allow for easy upload of this metadata into your Terra data tables:

For flu

Flu Metadata Formatter

Flu uses the same metadata formatter as the Terra_2_NCBI Pathogen BioSample package.

If neither strain nor isolate are found in the Terra data table, Mercury will automatically generate an isolate, using the following format ABRicate flu type / State / sample name / year (ABRicate flu subtype). Example: A/California/Sample-01/2024 (H1N1)

The ABRicate flu type and subtype (abricate_flu_type and abricate_flu_subtype columns) are extracted from your table, and are required to generate the isolate field if it is not provided.

For mpox

Mpox Metadata Formatter

For sars-cov-2

SARS-CoV-2 Metadata Formatter

Usage on Terra

Usage on Terra¶

A note on the using_clearlabs_data & using_reads_dehosted optional input parameters

The using_clearlabs_data and using_reads_dehosted arguments change the default values for the read1_column_name, assembly_fasta_column_name, and assembly_mean_coverage_column_name metadata columns. The default values are shown in the table below in addition to what they are changed to depending on what arguments are used.

Variable	Default Value	with `using_clearlabs_data`	with `using_reads_dehosted`	with both `using_clearlabs_data` *and* `using_reads_dehosted`
`read1_column_name`	`"read1_dehosted"`	`"clearlabs_fastq_gz"`	`"reads_dehosted"`	`"reads_dehosted"`
`assembly_fasta_column_name`	`"assembly_fasta"`	`"clearlabs_fasta"`	`"assembly_fasta"`	`"clearlabs_fasta"`
`assembly_mean_coverage_column_name`	`"assembly_mean_coverage"`	`"clearlabs_assembly_coverage"`	`"assembly_mean_coverage"`	`"clearlabs_assembly_coverage"`

Inputs¶

This workflow runs on the set-level.

| Terra Task Name | Variable | Type | Description | Default Value | Terra Status | | |---|---|---|---|---|---| | mercury_prep_n_batch | gcp_bucket_uri | String | Google bucket where your SRA reads will be temporarily stored before transferring to SRA. Example: "gs://theiagen_sra_transfer" | | Required | | mercury_prep_n_batch | sample_names | Array[String] | The samples you want to submit | | Required | | mercury_prep_n_batch | terra_project_name | String | The name of your Terra project. You can find this information in the URL of the webpage of your Terra dashboard. For example, if your URL contains #workspaces/example/my_workspace/ then your project name is example | | Required | | mercury_prep_n_batch | terra_table_name | String | The name of the Terra table where your samples can be found. Do not include the entity: prefix or the _id suffix, just the name of the table as listed in the sidebar on lefthand side of the Terra Data tab. | | Required | | mercury_prep_n_batch | terra_workspace_name | String | The name of your Terra workspace where your samples can be found. For example, if your URL contains #workspaces/example/my_workspace/ then your project name is my_workspace | | Required | | download_terra_table | cpu | Int | Number of CPUs to allocate to the task | 1 | Optional | | download_terra_table | disk_size | Int | Amount of storage (in GB) to allocate to the task | 10 | Optional | | download_terra_table | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/theiagen/terra-tools:2023-06-21 | Optional | | download_terra_table | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 1 | Optional | | mercury | cpu | Int | Number of CPUs to allocate to the task | 2 | Optional | | mercury | disk_size | Int | Amount of storage (in GB) to allocate to the task | 50 | Optional | | mercury | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/theiagen/mercury:1.0.7 | Optional | | mercury | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 2 | Optional | | mercury | number_N_threshold | Int | Only for "sars-cov-2" submissions; used to filter out any samples that contain more than the indicated number of Ns in the assembly file | 5000 | Optional | | mercury | single_end | Boolean | Set to true if your data is single-end; this ensures that a read2 column is not included in the metadata | FALSE | Optional | | mercury | skip_county | Boolean | Use if your Terra table contains a county column that you do not want to include in your submission. | FALSE | Optional | | mercury | usa_territory | Boolean | If true, the "state" column will be used in place of the "country" column. For example, if "state" is Puerto Rico, then the GISAID virus name will be hCoV-19/Puerto Rico/<name>/<year>. The NCBI geo_loc_name will be "USA: Puerto Rico". This optional Boolean variable should only be used with clear understanding of what it does. | FALSE | Optional | | mercury | using_clearlabs_data | Boolean | When set to true will change read1_dehosted → clearlabs_fastq_gz; assembly_fasta → clearlabs_fasta; assembly_mean_coverage → clearlabs_assembly_coverage | FALSE | Optional | | mercury | using_reads_dehosted | Boolean | When set to true will only change read1_dehosted → reads_dehosted. Takes priority over the replacement for read1_dehosted made with the using_clearlabs_data Boolean input | FALSE | Optional | | mercury | vadr_alert_limit | Int | Only for "sars-cov-2" submissions; used to filter out any samples that contain more than the indicated number of vadr alerts | 0 | Optional | | mercury_prep_n_batch | authors_sbt | File | Only for "mpox" submissions; a file that contains author information. This file can be created here: https://submit.ncbi.nlm.nih.gov/genbank/template/submission/ | | Optional | | mercury_prep_n_batch | organism | String | The organism that you want submission prepare for — each organism requires different metadata fields so please ensure this field is accurate. Options: "flu", "mpox"" or "sars-cov-2" | sars-cov-2 | Optional | | mercury_prep_n_batch | output_name | String | Free text prefix for all output files | mercury | Optional | | mercury_prep_n_batch | skip_ncbi | Boolean | Set to true if you only want to prepare GISAID submission files | FALSE | Optional | | table2asn | cpu | Int | Number of CPUs to allocate to the task | 1 | Optional | | table2asn | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional | | table2asn | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/ncbi-table2asn:1.26.678 | Optional | | table2asn | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 1 | Optional | | trim_genbank_fastas | cpu | Int | Number of CPUs to allocate to the task | 1 | Optional | | trim_genbank_fastas | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional | | trim_genbank_fastas | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/vadr:1.3 | Optional | | trim_genbank_fastas | max_length | Int | Only for "sars-cov-2" submissions; the maximum genome length for trimming terminal ambiguous nucleotides. If your sample's genome is higher than this value, the workflow will error/fail. | 30000 | Optional | | trim_genbank_fastas | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 2 | Optional | | trim_genbank_fastas | min_length | Int | Only for "sars-cov-2" submissions; the minimum genome length for trimming terminal ambiguous nucleotides. If your sample's genome is lower than this value, the workflow will error/fail. | 50 | Optional | | version_capture | docker | String | The Docker container to use for the task | "us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0" | Optional | | version_capture | timezone | String | Set the time zone to get an accurate date of analysis (uses UTC by default) | | Optional |

Outputs¶

Variable	Type	Description
bankit_sqn_to_email	File	Only for mpox submission: the sqn file that you will use to submit mpox assembly files to NCBI via email
biosample_metadata	File	BioSample metadata TSV file for upload to NCBI
excluded_samples	File	A file that contains the names and reasons why a sample was excluded from submission. For SARS-CoV-2, there are two sections: First, a section for any samples that failed to meet pre-determined quality thresholds (`number_N` and `vadr_num_alert`). Second, a section that includes a table that describes any missing required metadata for each sample. This table has the sample name for rows and any columns that have missing metadata as headers. If a sample is missing a piece of required metadata, the corresponding cell will be blank. However, if a different sample does have metadata for that column, the associated value will appear in the corresponding cell. For flu and mpox, only the second section described above exists. Please see the example below for more details.
genbank_fasta	File	Only for SARS-CoV-2 submission: GenBank fasta file for upload
genbank_metadata	File	Only for SARS-CoV-2 submission: GenBank metadata for upload
gisaid_fasta	File	Only for mpox and SARS-CoV-2 submission: GISAID fasta file for upload
gisaid_metadata	File	Only for mpox and SARS-CoV-2 submission: GISAID metadata for upload
mercury_prep_n_batch_analysis_date	String	Date analysis was run
mercury_prep_n_batch_version	String	Version of the PHB repository that hosts this workflow
mercury_script_version	String	Version of the Mercury tool that was used in this workflow
sra_metadata	File	SRA metadata TSV file for upload

An example excluded_samples.tsv file

An example excluded_samples.tsv file¶

Due to the nature of tsv files, it may be easier to download and open this file in Excel.

example_excluded_samples.tsv

Samples excluded for quality thresholds:
sample_name message 
sample2 VADR skipped due to poor assembly
sample3 VADR number alerts too high: 3 greater than limit of 0
sample4 Number of Ns was too high: 10000 greater than limit of 5000

Samples excluded for missing required metadata (will have empty values in indicated columns):
tablename_id    organism    country library_layout
sample5         paired
sample6 SARS-CoV-2  USA

This example informs the user that samples 2-4 were excluded for quality reasons (the exact reason is listed in the message column), and that samples 5 and 6 were excluded because they were missing required metadata fields (sample5 was missing the organism and country fields, and sample6 was missing the library_layout field).

Usage outside of Terra¶

This tool can also be used on the command-line. Please see the Mercury GitHub for more information on how to run Mercury with a Docker image or in your local command-line environment.