TheiaCoV Workflow Series¶

Quick Facts¶

Workflow Type	Applicable Kingdom	Last Known Changes	Command-line Compatibility	Workflow Level
Genomic Characterization	Viral	PHB v2.2.0	Yes, some optional features incompatible	Sample-level

TheiaCoV Workflows¶

The TheiaCoV workflows are for the assembly, quality assessment, and characterization of viral genomes. There are currently five TheiaCoV workflows designed to accommodate different kinds of input data:

Illumina paired-end sequencing (TheiaCoV_Illumina_PE)
Illumina single-end sequencing (TheiaCoV_Illumina_SE)
ONT sequencing (TheiaCoV_ONT)
Genome assemblies (TheiaCoV_FASTA)
ClearLabs sequencing (TheiaCoV_ClearLabs)

Additionally, the TheiaCoV_FASTA_Batch workflow is available to process several hundred SARS-CoV-2 assemblies at the same time.

Key Resources

Reference Materials for SARS-CoV-2

Reference Materials for Mpox

HIV Input JSONs

WNV Input JSONs

Flu Input JSONs

RSV-A Input JSONs

RSV-B Input JSONs

TheiaCoV Workflow Diagram

Supported Organisms¶

These workflows currently support the following organisms:

SARS-CoV-2 ("sars-cov-2", "SARS-CoV-2") - default organism input
Monkeypox virus ("MPXV", "mpox", "monkeypox", "Monkeypox virus", "Mpox")
Human Immunodeficiency Virus ("HIV")
West Nile Virus ("WNV", "wnv", "West Nile virus")
Influenza ("flu", "influenza", "Flu", "Influenza")
RSV-A ("rsv_a", "rsv-a", "RSV-A", "RSV_A")
RSV-B ("rsv_b", "rsv-b", "RSV-B", "RSV_B")

The compatibility of each workflow with each pathogen is shown below:

	SARS-CoV-2	Mpox	HIV	WNV	Influenza	RSV-A	RSV-B
Illumina_PE	✅	✅	✅	✅	✅	✅	✅
Illumina_SE	✅	✅	❌	✅	❌	✅	✅
ClearLabs	✅	❌	❌	❌	❌	❌	❌
ONT	✅	✅	✅	❌	✅	✅	✅
FASTA	✅	✅	❌	✅	✅	✅	✅

We've provided the following information to help you set up the workflow for each organism in the form of input JSONs.

Inputs¶

All TheiaCoV Workflows (not TheiaCoV_FASTA_Batch)

TheiaCoV_Illumina_PE Input Read Data

The TheiaCoV_Illumina_PE workflow takes in Illumina paired-end read data. Read file names should end with .fastq or .fq, with the optional addition of .gz. When possible, Theiagen recommends zipping files with gzip before Terra uploads to minimize data upload time.

By default, the workflow anticipates 2 x 150bp reads (i.e. the input reads were generated using a 300-cycle sequencing kit). Modifications to the optional parameter for trim_minlen may be required to accommodate shorter read data, such as the 2 x 75bp reads generated using a 150-cycle sequencing kit.

TheiaCoV_Illumina_SE Input Read Data

TheiaCoV_Illumina_SE takes in Illumina single-end reads. Read file names should end with .fastq or .fq, with the optional addition of .gz. Theiagen highly recommends zipping files with gzip before uploading to Terra to minimize data upload time & save on storage costs.

By default, the workflow anticipates 1 x 35 bp reads (i.e. the input reads were generated using a 70-cycle sequencing kit). Modifications to the optional parameter for trim_minlen may be required to accommodate longer read data.

TheiaCoV_ONT Input Read Data

The TheiaCoV_ONT workflow takes in base-called ONT read data. Read file names should end with .fastq or .fq, with the optional addition of .gz. When possible, Theiagen recommends zipping files with gzip before uploading to Terra to minimize data upload time.

The ONT sequencing kit and base-calling approach can produce substantial variability in the amount and quality of read data. Genome assemblies produced by the TheiaCoV_ONT workflow must be quality assessed before reporting results.

TheiaCoV_FASTA Input Assembly Data

The TheiaCoV_FASTA workflow takes in assembly files in FASTA format.

TheiaCoV_ClearLabs Input Read Data

The TheiaCoV_ClearLabs workflow takes in read data produced by the Clear Dx platform from ClearLabs. However, many users use the TheiaCoV_FASTA workflow instead of this one due to a few known issues when generating assemblies with this pipeline that are not present when using ClearLabs-generated FASTA files.

Terra Task Name	Variable	Type	Description	Default Value	Terra Status	*	Organism
theiacov_clearlabs	primer_bed	File	The bed file containing the primers used when sequencing was performed		Required	CL	sars-cov-2
theiacov_clearlabs	read1	File	Read data produced by the Clear Dx platform from ClearLabs		Required	CL	sars-cov-2
theiacov_fasta	assembly_fasta	File	Input assembly FASTA file		Required	FASTA	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
theiacov_fasta	input_assembly_method	File	Method used to generate the assembly file		Required	FASTA	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
theiacov_illumina_pe	read1	File	Forward Illumina read in FASTQ file format (compression optional)		Required	PE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
theiacov_illumina_pe	read2	File	Reverse Illumina read in FASTQ file format (compression optional)		Required	PE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
theiacov_illumina_se	read1	File	Forward Illumina read in FASTQ file format (compression optional)		Required	SE	MPXV, WNV, sars-cov-2
theiacov_ont	read1	File	Demultiplexed ONT read in FASTQ file format (compression optional)		Required	ONT	HIV, MPXV, WNV, flu, sars-cov-2
workflow name	samplename	String	Name of the sample being analyzed		Required	CL, FASTA, ONT, PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
workflow name	seq_method	String	The sequencing methodology used to generate the input read data		Required	FASTA	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
clean_check_reads	cpu	Int	Number of CPUs to allocate to the task	2	Optional	ONT, PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
clean_check_reads	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional	ONT, PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
clean_check_reads	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/bactopia/gather_samples:2.0.2	Optional	ONT, PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
clean_check_reads	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	2	Optional	ONT, PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
consensus	cpu	Int	Number of CPUs to allocate to the task	8	Optional	CL, ONT	sars-cov-2
consensus	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional	CL, ONT	sars-cov-2
consensus	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/artic-ncov2019-epi2me	Optional	ONT	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
consensus	medaka_model	String	In order to obtain the best results, the appropriate model must be set to match the sequencer's basecaller model; this string takes the format of {pore}{device}. See also https://github.com/nanoporetech/medaka?tab=readme-ov-file#models.}_{caller_version	r941_min_high_g360	Optional	CL, ONT	sars-cov-2
consensus	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	16	Optional	CL, ONT	sars-cov-2
consensus_qc	cpu	Int	Number of CPUs to allocate to the task	1	Optional	CL, FASTA, ONT, PE, SE	HIV, MPXV, WNV, rsv_a, rsv_b, sars-cov-2
consensus_qc	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional	CL, FASTA, ONT, PE, SE	HIV, MPXV, WNV, rsv_a, rsv_b, sars-cov-2
consensus_qc	docker	String	The Docker container to use for the task	ngolin	Optional	CL, FASTA, ONT, PE, SE	HIV, MPXV, WNV, rsv_a, rsv_b, sars-cov-2
consensus_qc	genome_length	Int	Internal component, do not modify		Do not modify, Optional	CL, SE	HIV, MPXV, WNV, rsv_a, rsv_b, sars-cov-2
consensus_qc	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	2	Optional	CL, FASTA, ONT, PE, SE	HIV, MPXV, WNV, rsv_a, rsv_b, sars-cov-2
fastq_scan_clean_reads	cpu	Int	Number of CPUs to allocate to the task	1	Optional	CL	sars-cov-2
fastq_scan_clean_reads	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional	CL	sars-cov-2
fastq_scan_clean_reads	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/theiagen/utility:1.1	Optional	CL	sars-cov-2
fastq_scan_clean_reads	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	2	Optional	CL	sars-cov-2
fastq_scan_clean_reads	read1_name	Int	Internal component, do not modify		Do not modify, Optional	CL	sars-cov-2
fastq_scan_raw_reads	cpu	Int	Number of CPUs to allocate to the task	1	Optional	CL	sars-cov-2
fastq_scan_raw_reads	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional	CL	sars-cov-2
fastq_scan_raw_reads	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/theiagen/utility:1.1	Optional	CL	sars-cov-2
fastq_scan_raw_reads	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	2	Optional	CL	sars-cov-2
fastq_scan_raw_reads	read1_name	Int	Internal component, do not modify		Do not modify, Optional	CL	sars-cov-2
flu_track	abricate_flu_cpu	Int	Number of CPUs to allocate to the task	2	Optional	FASTA, ONT, PE	flu
flu_track	abricate_flu_disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional	FASTA, ONT, PE	flu
flu_track	abricate_flu_docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/abricate:1.0.1-insaflu-220727	Optional	FASTA, ONT, PE	flu
flu_track	abricate_flu_memory	Int	Amount of memory/RAM (in GB) to allocate to the task	4	Optional	FASTA, ONT, PE	flu
flu_track	abricate_flu_mincov	Int	Minimum DNA % coverage	60	Optional	FASTA, ONT, PE	flu
flu_track	abricate_flu_minid	Int	Minimum DNA % identity	70	Optional	FASTA, ONT, PE	flu
flu_track	antiviral_aa_subs	String	Additional list of antiviral resistance associated amino acid substitutions of interest to be searched against those called on the sample segments. They take the format of :, e.g. NA:A26V		Optional	ONT, PE	flu
flu_track	assembly_metrics_cpu	Int	Number of CPUs to allocate to the task	2	Optional	PE	flu
flu_track	assembly_metrics_disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional	PE	flu
flu_track	assembly_metrics_docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/samtools:1.15	Optional	PE	flu
flu_track	assembly_metrics_memory	Int	Amount of memory/RAM (in GB) to allocate to the task	8	Optional	PE	flu
flu_track	flu_h1_ha_ref	File	Internal component, do not modify		Do not modify, Optional	ONT, PE	flu
flu_track	flu_h1n1_m2_ref	File	Internal component, do not modify		Do not modify, Optional	ONT, PE	flu
flu_track	flu_h3_ha_ref	File	Internal component, do not modify		Do not modify, Optional	ONT, PE	flu
flu_track	flu_h3n2_m2_ref	File	Internal component, do not modify		Do not modify, Optional	ONT, PE	flu
flu_track	flu_n1_na_ref	File	Internal component, do not modify		Do not modify, Optional	ONT, PE	flu
flu_track	flu_n2_na_ref	File	Internal component, do not modify		Do not modify, Optional	ONT, PE	flu
flu_track	flu_pa_ref	File	Internal component, do not modify		Do not modify, Optional	ONT, PE	flu
flu_track	flu_pb1_ref	File	Internal component, do not modify		Do not modify, Optional	ONT, PE	flu
flu_track	flu_pb2_ref	File	Internal component, do not modify		Do not modify, Optional	ONT, PE	flu
flu_track	flu_subtype	String	The influenza subtype being analyzed. Used for picking nextclade datasets. Options: "Yamagata", "Victoria", "H1N1", "H3N2". Only use to override the subtype call from IRMA and ABRicate.		Optional	CL, ONT, PE, SE	flu
flu_track	genoflu_cpu	Int	Number of CPUs to allocate to the task	1	Optional	FASTA, ONT, PE	flu
flu_track	genoflu_cross_reference	File	An Excel file to cross-reference BLAST findings; probably useful if novel genotypes are not in the default file used by genoflu.py		Optional	FASTA, ONT, PE
flu_track	genoflu_disk_size	Int	Amount of storage (in GB) to allocate to the task	25	Optional	FASTA, ONT, PE
flu_track	genoflu_docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/genoflu:1.03	Optional	FASTA, ONT, PE
flu_track	genoflu_memory	Int	Amount of memory/RAM (in GB) to allocate to the task	2	Optional	FASTA, ONT, PE
flu_track	irma_cpu	Int	Number of CPUs to allocate to the task	4	Optional	ONT, PE	flu
flu_track	irma_disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional	ONT, PE	flu
flu_track	irma_docker_image	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/cdcgov/irma:v1.1.5	Optional	ONT, PE	flu
flu_track	irma_keep_ref_deletions	Boolean	True/False variable that determines if sites missed during read gathering should be deleted by ambiguation.	TRUE	Optional	ONT, PE	flu
flu_track	irma_memory	Int	Amount of memory/RAM (in GB) to allocate to the task	16	Optional	ONT, PE	flu
flu_track	nextclade_cpu	Int	Number of CPUs to allocate to the task	2	Optional	ONT, PE	flu
flu_track	nextclade_disk_size	Int	Amount of storage (in GB) to allocate to the task	50	Optional	ONT, PE	flu
flu_track	nextclade_docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/nextstrain/nextclade:3.3.1	Optional	ONT, PE	flu
flu_track	nextclade_memory	Int	Amount of memory/RAM (in GB) to allocate to the task	4	Optional	ONT, PE	flu
flu_track	nextclade_output_parser_cpu	Int	Number of CPUs to allocate to the task	2	Optional	CL, FASTA, ONT, PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
flu_track	nextclade_output_parser_disk_size	Int	Amount of storage (in GB) to allocate to the task	50	Optional	CL, FASTA, ONT, PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
flu_track	nextclade_output_parser_docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/python/python:3.8.18-slim	Optional	CL, FASTA, ONT, PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
flu_track	nextclade_output_parser_memory	Int	Amount of memory/RAM (in GB) to allocate to the task	4	Optional	CL, FASTA, ONT, PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
flu_track	read2	File	Internal component. Do not use.		Optional	ONT	flu
gene_coverage	cpu	Int	Number of CPUs to allocate to the task	2	Optional	CL, ONT, PE, SE	MPXV, sars-cov-2
gene_coverage	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional	CL, ONT, PE, SE	MPXV, sars-cov-2
gene_coverage	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/samtools:1.15	Optional	CL, ONT, PE, SE	MPXV, sars-cov-2
gene_coverage	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	8	Optional	CL, ONT, PE, SE	MPXV, sars-cov-2
gene_coverage	min_depth	Int	The minimum depth to determine if a position was covered.	10	Optional	ONT, PE, SE	MPXV, sars-cov-2
gene_coverage	sc2_s_gene_start	Int	start nucleotide position of the SARS-CoV-2 Spike gene	21563	Optional	CL, ONT, PE, SE	MPXV, sars-cov-2
gene_coverage	sc2_s_gene_stop	Int	End/Last nucleotide position of the SARS-CoV-2 Spike gene	25384	Optional	CL, ONT, PE, SE	MPXV, sars-cov-2
ivar_consensus	read2	File	Internal component, do not modify		Do not modify, Optional	SE	HIV, MPXV, WNV, rsv_a, rsv_b, sars-cov-2
ivar_consensus	skip_N	Boolean	True/False variable that determines if regions with depth less than minimum depth should not be added to the consensus sequence	FALSE	Optional	PE, SE	HIV, MPXV, WNV, rsv_a, rsv_b, sars-cov-2
kraken2_dehosted	cpu	Int	Number of CPUs to allocate to the task	4	Optional	CL	sars-cov-2
kraken2_dehosted	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional	CL	sars-cov-2
kraken2_dehosted	docker_image	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/kraken2:2.0.8-beta_hv	Optional	CL	sars-cov-2
kraken2_dehosted	kraken2_db	String	The database used to run Kraken2	/kraken2-db	Optional	CL	sars-cov-2
kraken2_dehosted	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	8	Optional	CL	sars-cov-2
kraken2_dehosted	read2	File	Internal component, do not modify		Do not modify, Optional	CL	sars-cov-2
kraken2_raw	cpu	Int	Number of CPUs to allocate to the task	4	Optional	CL	sars-cov-2
kraken2_raw	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional	CL	sars-cov-2
kraken2_raw	docker_image	Int	Docker container used in this task	us-docker.pkg.dev/general-theiagen/staphb/kraken2:2.0.8-beta_hv	Optional	CL	sars-cov-2
kraken2_raw	kraken2_db	String	The database used to run Kraken2	/kraken2-db	Optional	CL	sars-cov-2
kraken2_raw	memory	String	Amount of memory/RAM (in GB) to allocate to the task	8	Optional	CL	sars-cov-2
kraken2_raw	read_processing	String	The tool used for trimming of primers from reads. Options are trimmomatic and fastp	trimmomatic	Optional		HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
kraken2_raw	read2	File	Internal component, do not modify		Do not modify, Optional	CL	sars-cov-2
nanoplot_clean	cpu	Int	Number of CPUs to allocate to the task	4	Optional	ONT	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
nanoplot_clean	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional	ONT	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
nanoplot_clean	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/nanoplot:1.40.0	Optional	ONT	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
nanoplot_clean	max_length	Int	The maximum length of clean reads, for which reads longer than the length specified will be hidden.	100000	Optional	ONT	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
nanoplot_clean	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	16	Optional	ONT	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
nanoplot_raw	cpu	Int	Number of CPUs to allocate to the task	4	Optional	ONT	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
nanoplot_raw	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional	ONT	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
nanoplot_raw	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/nanoplot:1.40.0	Optional	ONT	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
nanoplot_raw	max_length	Int	The maximum length of clean reads, for which reads longer than the length specified will be hidden.	100000	Optional	ONT	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
nanoplot_raw	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	16	Optional	ONT	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
ncbi_scrub_se	cpu	Int	Number of CPUs to allocate to the task	4	Optional	CL	sars-cov-2
ncbi_scrub_se	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional	CL	sars-cov-2
ncbi_scrub_se	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/ncbi/sra-human-scrubber:2.2.1	Optional	CL	sars-cov-2
ncbi_scrub_se	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	8	Optional	CL	sars-cov-2
nextclade_output_parser	cpu	Int	Number of CPUs to allocate to the task	2	Optional	ONT, PE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
nextclade_output_parser	disk_size	Int	Amount of storage (in GB) to allocate to the task	50	Optional	ONT, PE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
nextclade_output_parser	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/python/python:3.8.18-slim	Optional	ONT, PE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
nextclade_output_parser	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	2	Optional	ONT, PE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
nextclade_v3	auspice_reference_tree_json	File	An Auspice JSON phylogenetic reference tree which serves as a target for phylogenetic placement.	Inherited from nextclade dataset	Optional	CL, FASTA, ONT, PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
nextclade_v3	cpu	Int	Number of CPUs to allocate to the task	2	Optional	CL, FASTA, ONT, PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
nextclade_v3	disk_size	Int	Amount of storage (in GB) to allocate to the task	50	Optional	CL, FASTA, ONT, PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
nextclade_v3	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/nextstrain/nextclade:3.3.1	Optional	CL, FASTA, ONT, PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
nextclade_v3	gene_annotations_gff	File	A genome annotation to specify how to translate the nucleotide sequence to proteins (genome_annotation.gff3). specifying this enables codon-informed alignment and protein alignments. See here for more info: https://docs.nextstrain.org/projects/nextclade/en/latest/user/input-files/03-genome-annotation.html	Inherited from nextclade dataset	Optional	CL, FASTA, ONT, PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
nextclade_v3	input_ref	File	A nucleotide sequence which serves as a reference for the pairwise alignment of all input sequences. This is also the sequence which defines the coordinate system of the genome annotation. See here for more info: https://docs.nextstrain.org/projects/nextclade/en/latest/user/input-files/02-reference-sequence.html	Inherited from nextclade dataset	Optional	CL, FASTA, ONT, PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
nextclade_v3	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	4	Optional	CL, FASTA, ONT, PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
nextclade_v3	nextclade_pathogen_json	File	General dataset configuration file. See here for more info: https://docs.nextstrain.org/projects/nextclade/en/latest/user/input-files/05-pathogen-config.html	Inherited from nextclade dataset	Optional	CL, FASTA, ONT, PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
nextclade_v3	verbosity	String	other options are: "off" , "error" , "info" , "debug" , and "trace" (highest level of verbosity)	warn	Optional	CL, FASTA, ONT, PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
organism_parameters	auspice_config	File	Auspice config file used in Augur_PHB workflow. Defaults set for various organisms & Flu segments. A minimal auspice config file is set in cases where organism is not specified and user does not provide an optional input config file.		Optional	Augur, CL, FASTA, ONT, PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
organism_parameters	flu_segment	String	Influenza genome segment being analyzed. Options: "HA" or "NA". Automatically determined. This input is ignored if provided for TheiaCoV_Illumina_SE and TheiaCoV_ClearLabs	N/A	Optional	CL, ONT, PE, SE	flu
organism_parameters	flu_subtype	String	The influenza subtype being analyzed. Options: "Yamagata", "Victoria", "H1N1", "H3N2". Automatically determined. This input is ignored if provided for TheiaCoV_Illumina_SE and TheiaCoV_ClearLabs	N/A	Optional	CL, ONT, PE, SE	flu
organism_parameters	gene_locations_bed_file	File	Use to provide locations of interest where average coverage will be calculated	Default provided for SARS-CoV-2 ("gs://theiagen-public-files-rp/terra/sars-cov-2-files/sc2_gene_locations.bed") and mpox ("gs://theiagen-public-files/terra/mpxv-files/mpox_gene_locations.bed")	Optional	CL, FASTA
organism_parameters	genome_length_input	Int	Use to specify the expected genome length; provided by default for all supported organisms	Default provided for SARS-CoV-2 (29903), mpox (197200), WNV (11000), flu (13000), RSV-A (16000), RSV-B (16000), HIV (primer versions 1 [9181] and 2 [9840])	Optional	CL
organism_parameters	hiv_primer_version	String	The version of HIV primers used. Options are "https://github.com/theiagen/public_health_bioinformatics/blob/main/workflows/utilities/wf_organism_parameters.wdl#L156" and "https://github.com/theiagen/public_health_bioinformatics/blob/main/workflows/utilities/wf_organism_parameters.wdl#L164". This input is ignored if provided for TheiaCoV_Illumina_SE and TheiaCoV_ClearLabs	v1	Optional	CL, FASTA, ONT, PE, SE	HIV
organism_parameters	kraken_target_organism_input	String	The organism whose abundance the user wants to check in their reads. This should be a proper taxonomic name recognized by the Kraken database.	Default provided for mpox (Monkeypox virus), WNV (West Nile virus), and HIV (Human immunodeficiency virus 1)	Optional	FASTA, ONT, SE	HIV, MPXV, WNV, rsv_a, rsv_b, sars-cov-2
organism_parameters	pangolin_docker_image	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/pangolin:4.3.1-pdata-1.29	Optional	CL, FASTA
organism_parameters	primer_bed_file	File	The bed file containing the primers used when sequencing was performed	REQUIRED FOR SARS-CoV-2, MPOX, WNV, RSV-A & RSV-B. Provided by default only for HIV primer versions 1 ("gs://theiagen-public-files/terra/hivgc-files/HIV-1_v1.0.primer.hyphen.bed" and 2 ("gs://theiagen-public-files/terra/hivgc-files/HIV-1_v2.0.primer.hyphen400.1.bed")	Optional, Sometimes required	CL, FASTA
organism_parameters	reference_gff_file	File	Reference GFF file for the organism being analyzed	Default provided for mpox ("gs://theiagen-public-files/terra/mpxv-files/Mpox-MT903345.1.reference.gff3") and HIV (primer versions 1 ["gs://theiagen-public-files/terra/hivgc-files/NC_001802.1.gff3"] and 2 ["gs://theiagen-public-files/terra/hivgc-files/AY228557.1.gff3"])	Optional	CL, FASTA, ONT
organism_parameters	vadr_max_length	Int	Maximum length for the fasta-trim-terminal-ambigs.pl VADR script	Default provided for SARS-CoV-2 (30000), mpox (210000), WNV (11000), flu (0), RSV-A (15500) and RSV-B (15500).	Optional	CL
organism_parameters	vadr_memory	Int	Amount of memory/RAM (in GB) to allocate to the task	32 (RSV-A and RSV-B) and 8 (all other TheiaCoV organisms)	Optional	CL, ONT, PE, SE	MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
organism_parameters	vadr_options	String	Options for the v-annotate.pl VADR script	Default provided for SARS-CoV-2 ("--noseqnamemax --glsearch -s -r --nomisc --mkey sarscov2 --lowsim5seq 6 --lowsim3seq 6 --alt_fail lowscore,insertnn,deletinn --out_allfasta"), mpox ("--glsearch -s -r --nomisc --mkey mpxv --r_lowsimok --r_lowsimxd 100 --r_lowsimxl 2000 --alt_pass discontn,dupregin --out_allfasta --minimap2 --s_overhang 150"), WNV ("--mkey flavi --mdir /opt/vadr/vadr-models-flavi/ --nomisc --noprotid --out_allfasta"), flu (""), RSV-A ("-r --mkey rsv --xnocomp"), and RSV-B ("-r --mkey rsv --xnocomp")	Optional	CL
organism_parameters	vadr_skip_length	Int	Minimum assembly length (unambiguous) to run VADR	10000	Optional	CL	MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
pangolin4	analysis_mode	String	Pangolin inference engine for lineage designations (usher or pangolearn). Default is Usher.		Optional	CL, FASTA, ONT, PE, SE	sars-cov-2
pangolin4	cpu	Int	Number of CPUs to allocate to the task	4	Optional	CL, FASTA, ONT, PE, SE	sars-cov-2
pangolin4	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional	CL, FASTA, ONT, PE, SE	sars-cov-2
pangolin4	expanded_lineage	Boolean	True/False that determines if a lineage should be expanded without aliases (e.g., BA.1 → B.1.1.529.1)	TRUE	Optional	CL, FASTA, ONT, PE, SE	sars-cov-2
pangolin4	max_ambig	Float	The maximum proportion of Ns allowed for pangolin to attempt an assignment	0.5	Optional	CL, FASTA, ONT, PE, SE	sars-cov-2
pangolin4	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	8	Optional	CL, FASTA, ONT, PE, SE	sars-cov-2
pangolin4	min_length	Int	Minimum query length allowed for pangolin to attempt an assignment	10000	Optional	CL, FASTA, ONT, PE, SE	sars-cov-2
pangolin4	pangolin_arguments	String	Optional arguments for pangolin e.g. ''--skip-scorpio''		Optional	CL, FASTA, ONT, PE, SE	sars-cov-2
pangolin4	skip_designation_cache	Boolean	A True/False option that determines if the designation cache should be used	FALSE	Optional	CL, FASTA, ONT, PE, SE	sars-cov-2
pangolin4	skip_scorpio	Boolean	A True/False option that determines if scorpio should be skipped.	FALSE	Optional	CL, FASTA, ONT, PE, SE	sars-cov-2
qc_check_task	ani_highest_percent	Float	Internal component, do not modify		Do not modify, Optional	CL, FASTA, ONT, PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
qc_check_task	ani_highest_percent_bases_aligned	Float	Internal component, do not modify		Do not modify, Optional	CL, FASTA, ONT, PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
qc_check_task	assembly_length	Int	Internal component, do not modify		Do not modify, Optional	CL, FASTA, ONT, PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
qc_check_task	assembly_mean_coverage	Int	Internal component, do not modify		Do not modify, Optional	FASTA	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
qc_check_task	busco_results	String	Internal component, do not modify		Do not modify, Optional	CL, FASTA, ONT, PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
qc_check_task	combined_mean_q_clean	Float	Internal component, do not modify		Do not modify, Optional	CL, FASTA, ONT, PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
qc_check_task	combined_mean_q_raw	Float	Internal component, do not modify		Do not modify, Optional	CL, FASTA, ONT, PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
qc_check_task	combined_mean_readlength_clean	Float	Internal component, do not modify		Do not modify, Optional	CL, FASTA, ONT, PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
qc_check_task	combined_mean_readlength_raw	Float	Internal component, do not modify		Do not modify, Optional	CL, FASTA, ONT, PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
qc_check_task	cpu	Int	Number of CPUs to allocate to the task	4	Optional	CL, FASTA, ONT, PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
qc_check_task	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional	CL, FASTA, ONT, PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
qc_check_task	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/theiagen/terra-tools:2023-03-16	Optional	CL, FASTA, ONT, PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
qc_check_task	est_coverage_clean	Float	Internal component, do not modify		Do not modify, Optional	CL, FASTA, ONT, PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
qc_check_task	est_coverage_raw	Float	Internal component, do not modify		Do not modify, Optional	CL, FASTA, ONT, PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
qc_check_task	gambit_predicted_taxon	String	Internal component, do not modify		Do not modify, Optional	CL, FASTA, ONT, PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
qc_check_task	kraken_human	String	Internal component, do not modify		Do not modify, Optional	FASTA, ONT, SE
qc_check_task	kraken_human_dehosted	String	Internal component, do not modify		Do not modify, Optional	FASTA, ONT, SE
qc_check_task	kraken_sc2	Float	Internal component, do not modify		Do not modify, Optional	CL, FASTA, ONT, PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
qc_check_task	kraken_sc2_dehosted	Float	Internal component, do not modify		Do not modify, Optional	CL, FASTA, ONT, PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
qc_check_task	kraken_target_organism	Float	Internal component, do not modify		Do not modify, Optional	CL, FASTA, ONT, PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
qc_check_task	kraken_target_organism_dehosted	Float	Internal component, do not modify		Do not modify, Optional	CL, FASTA, ONT, PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
qc_check_task	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	8	Optional	CL, FASTA, ONT, PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
qc_check_task	midas_secondary_genus_abundance	Float	Internal component, do not modify		Do not modify, Optional	CL, FASTA, ONT, PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
qc_check_task	midas_secondary_genus_coverage	Float	Internal component, do not modify		Do not modify, Optional	CL, FASTA, ONT, PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
qc_check_task	minbaseq_trim	Int	Internal component, do not modify		Do not modify, Optional	FASTA	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
qc_check_task	n50_value	Int	Internal component, do not modify		Do not modify, Optional	CL, FASTA, ONT, PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
qc_check_task	num_reads_clean2	Int	Internal component, do not modify		Do not modify, Optional	CL, FASTA, ONT, SE
qc_check_task	num_reads_raw2	Int	Internal component, do not modify		Do not modify, Optional	CL, FASTA, ONT, SE
qc_check_task	number_contigs	Int	Internal component, do not modify		Do not modify, Optional	CL, FASTA, ONT, PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
qc_check_task	quast_gc_percent	Float	Internal component, do not modify		Do not modify, Optional	CL, FASTA, ONT, PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
qc_check_task	r1_mean_q_clean	Float	Internal component, do not modify		Do not modify, Optional	CL, FASTA, ONT, PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
qc_check_task	r1_mean_q_raw	Float	Internal component, do not modify		Do not modify, Optional	CL, FASTA, ONT, PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
qc_check_task	r1_mean_readlength_clean	Float	Internal component, do not modify		Do not modify, Optional	CL, FASTA, ONT, PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
qc_check_task	r1_mean_readlength_raw	Float	Internal component, do not modify		Do not modify, Optional	CL, FASTA, ONT, PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
qc_check_task	r2_mean_q_clean	Float	Internal component, do not modify		Do not modify, Optional	CL, FASTA, ONT, PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
qc_check_task	r2_mean_q_raw	Float	Internal component, do not modify		Do not modify, Optional	CL, FASTA, ONT, PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
qc_check_task	r2_mean_readlength_clean	Float	Internal component, do not modify		Do not modify, Optional	CL, FASTA, ONT, PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
qc_check_task	r2_mean_readlength_raw	Float	Internal component, do not modify		Do not modify, Optional	CL, FASTA, ONT, PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
qc_check_task	sc2_s_gene_mean_coverage	Float	Internal component, do not modify		Do not modify, Optional	CL, FASTA, ONT, PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
qc_check_task	sc2_s_gene_percent_coverage	Float	Internal component, do not modify		Do not modify, Optional	CL, FASTA, ONT, PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
quasitools_illumina_pe	cpu	Int	Number of CPUs to allocate to the task	2	Optional	PE	HIV
quasitools_illumina_pe	disk_size	Int	Amount of storage (in GB) to allocate to the task	50	Optional	PE	HIV
quasitools_illumina_pe	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/biocontainers/quasitools:0.7.0--pyh864c0ab_1	Optional	PE	HIV
quasitools_illumina_pe	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	4	Optional	PE	HIV
quasitools_ont	cpu	Int	Number of CPUs to allocate to the task	2	Optional	ONT	HIV
quasitools_ont	disk_size	Int	Amount of storage (in GB) to allocate to the task	50	Optional	ONT	HIV
quasitools_ont	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/biocontainers/quasitools:0.7.0--pyh864c0ab_1	Optional	ONT	HIV
quasitools_ont	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	4	Optional	ONT	HIV
quasitools_ont	read2	File	Internal component. Do not use.		Do not modify, Optional	ONT	HIV
raw_check_reads	cpu	Int	Number of CPUs to allocate to the task	2	Optional	ONT, PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
raw_check_reads	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional	ONT, PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
raw_check_reads	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/bactopia/gather_samples:2.0.2	Optional	ONT, PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
raw_check_reads	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	2	Optional	ONT, PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
read_QC_trim	bbduk_memory	Int	Amount of memory/RAM (in GB) to allocate to the task	8	Optional	PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
read_QC_trim	call_kraken	Boolean	True/False variable that determines if the Kraken2 task should be called.	FALSE	Optional	PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
read_QC_trim	call_midas	Boolean	True/False variable that determines if the MIDAS task should be called.	TRUE	Optional	PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
read_QC_trim	downsampling_coverage	Float	The desired coverage to sub-sample the reads to with RASUSA	150	Optional	ONT	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
read_QC_trim	fastp_args	String	Additional fastp task arguments	--detect_adapter_for_pe -g -5 20 -3 20	Optional	PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
read_QC_trim	kraken_db	File	The database used to run Kraken2	/kraken2-db	Optional	PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
read_QC_trim	kraken_disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional	PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
read_QC_trim	kraken_memory	Int	Amount of memory/RAM (in GB) to allocate to the task	8	Optional	PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
read_QC_trim	midas_db	File	The database used by the MIDAS task	gs://theiagen-public-files-rp/terra/theiaprok-files/midas/midas_db_v1.2.tar.gz	Optional	PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
read_QC_trim	read_processing	String	The name of the tool to perform basic read processing; options: "trimmomatic" or "fastp"	trimmomatic	Optional	PE, SE
read_QC_trim	read_qc	String	The tool used for quality control (QC) of reads. Options are fastq_scan and fastqc	fastq_scan	Optional	PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
read_QC_trim	target_organism	String	Organism to search for in Kraken		Optional	PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
read_QC_trim	trimmomatic_args	String	Additional arguments to pass to trimmomatic	-phred33	Optional	PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
set_flu_ha_nextclade_values	reference_gff_file	File	Reference GFF file for flu HA		Do not modify, Optional	ONT	flu
set_flu_na_nextclade_values	reference_gff_file	Int	Reference GFF file for flu NA		Do not modify, Optional	ONT	flu
set_flu_na_nextclade_values	vadr_mem	Int	Memory, in GB, allocated to this task	8	Do not modify, Optional	ONT	flu
stats_n_coverage	cpu	Int	Number of CPUs to allocate to the task	2	Optional	CL, ONT
stats_n_coverage	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional	CL, ONT
stats_n_coverage	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/samtools:1.15	Optional	CL, ONT
stats_n_coverage	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	8	Optional	CL, ONT
stats_n_coverage_primtrim	cpu	Int	Number of CPUs to allocate to the task	2	Optional	CL, ONT
stats_n_coverage_primtrim	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional	CL, ONT
stats_n_coverage_primtrim	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/samtools:1.15	Optional	CL, ONT
stats_n_coverage_primtrim	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	8	Optional	CL, ONT
vadr	cpu	Int	Number of CPUs to allocate to the task	2	Optional	CL, FASTA, ONT, PE, SE	MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
vadr	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional	CL, FASTA, ONT, PE, SE	MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
vadr	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/vadr:1.5.1	Optional	CL, FASTA, ONT, PE, SE	MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
vadr	max_length	Int	Maximum length of contig allowed to run VADR		Optional	CL	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
vadr	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	32 (RSV-A and RSV-B) and 8 (all other TheiaCoV organisms)	Optional	CL	MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
vadr	min_length	Int	Minimum length subsequence to possibly replace Ns for the http://fasta-trim-terminal-ambigs.pl/ VADR script	50	Optional	CL, FASTA, ONT, PE, SE	MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
vadr	skip_length	Int	Minimum assembly length (unambiguous) to run VADR	10000	Optional	CL	MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
vadr	vadr_opts	String	Additional options to provide to VADR		Optional	CL	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
version_capture	docker	String	The Docker container to use for the task	"us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0"	Optional	ONT, PE, SE, FASTA, CL	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
version_capture	timezone	String	Set the time zone to get an accurate date of analysis (uses UTC by default)		Optional	ONT, PE, SE, FASTA, CL	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
workflow name	adapters	File	File that contains the adapters used	/bbmap/resources/adapters.fa	Optional	PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
workflow name	consensus_min_freq	Float	The minimum frequency for a variant to be called a SNP in consensus genome	0.6	Optional	PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
workflow name	flu_segment	String	Influenza genome segment being analyzed. Options: "HA" or "NA".	HA	Optional, Required	FASTA
workflow name	flu_subtype	String	The influenza subtype being analyzed. Options: "Yamagata", "Victoria", "H1N1", "H3N2". Automatically determined.		Optional	FASTA
workflow name	genome_length	Int	Use to specify the expected genome length		Optional	FASTA, ONT, PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
workflow name	max_genome_length	Int	Maximum genome length able to pass read screening	2673870	Optional	ONT, PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
workflow name	max_length	Int	Maximum length for a read based on the SARS-CoV-2 primer scheme	700	Optional	ONT	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
workflow name	medaka_docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/artic-ncov2019:1.3.0-medaka-1.4.3	Optional	CL
workflow name	min_basepairs	Int	Minimum base pairs to pass read screening	34000	Optional	ONT, PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
workflow name	min_coverage	Int	Minimum coverage to pass read screening	10	Optional	ONT, PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
workflow name	min_depth	Int	Minimum depth of reads required to call variants and generate a consensus genome. This value is passed to the iVar software.	100	Optional	ONT, PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
workflow name	min_genome_length	Int	Minimum genome length to pass read screening	1700	Optional	ONT, PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
workflow name	min_length	Int	Minimum length of a read based on the SARS-CoV-2 primer scheme	400	Optional	ONT	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
workflow name	min_proportion	Int	Minimum read proportion to pass read screening	40	Optional	PE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
workflow name	min_reads	Int	Minimum reads to pass read screening	113	Optional	PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
workflow name	nextclade_dataset_name	String	Nextclade organism dataset names. However, if organism input is set correctly, this input will be automatically assigned the corresponding dataset name. See organism defaults for more information	Defaults are organism-specific. Please find default values for all organisms (and for Flu - their respective genome segments) here: https://github.com/theiagen/public_health_bioinformatics/blob/main/workflows/utilities/wf_organism_parameters.wdl	Optional	CL, FASTA, ONT, PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
workflow name	nextclade_dataset_tag	String	Nextclade dataset tag. Used for pulling up-to-date reference genomes and associated information specific to nextclade datasets (QC thresholds, organism-specific information like SARS-CoV-2 clade & lineage information, etc.) that is required for running the Nextclade tool.	Defaults are organism-specific. Please find default values for all organisms (and for Flu - their respective genome segments) here: https://github.com/theiagen/public_health_bioinformatics/blob/main/workflows/utilities/wf_organism_parameters.wdl	Optional	CL, FASTA, ONT, PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
workflow name	normalise	Int	Used to normalize the amount of reads to the indicated level before variant calling	20000 for CL, 200 for ONT	Optional	CL, ONT
workflow name	organism	String	The organism that is being analyzed. Options: "sars-cov-2", "MPXV", "WNV", "HIV", "flu", "rsv_a", "rsv_b". However, "flu" is not available for TheiaCoV_Illumina_SE	sars-cov-2	Optional	CL, FASTA, ONT, PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
workflow name	pangolin_docker_image	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/pangolin:4.3.1-pdata-1.29	Do not modify, Optional	ONT, PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
workflow name	phix	File	File that contains the phix used	/bbmap/resources/phix174_ill.ref.fa.gz	Optional	PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
workflow name	primer_bed	File	The bed file containing the primers used when sequencing was performed		Optional	ONT, PE, SE	HIV, MPXV, WNV, rsv_a, rsv_b, sars-cov-2
workflow name	qc_check_table	File	A TSV file with optional user input QC values to be compared against the default workflow value		Optional	CL, FASTA, ONT, PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
workflow name	reference_gene_locations_bed	File	Use to provide locations of interest where average coverage will be calculated		Optional	ONT, PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
workflow name	reference_genome	File	An optional reference genome used for consensus assembly and QC		Optional	CL, FASTA, ONT, PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
workflow name	reference_gff	File	The general feature format (gff) of the reference genome.		Optional	PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
workflow name	seq_method	String	The sequencing methodology used to generate the input read data	ILLUMINA	Optional	CL, ONT, PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
workflow name	skip_mash	Boolean	A True/False option that determines if mash should be skipped in the screen task.	FALSE	Optional	ONT, SE	HIV, MPXV, WNV, rsv_a, rsv_b, sars-cov-2
workflow name	skip_screen	Boolean	A True/False option that determines if the screen task should be skipped.	FALSE	Optional	ONT, PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
workflow name	target_organism	String	The organism whose abundance the user wants to check in their reads. This should be a proper taxonomic name recognized by the Kraken database.		Optional	CL, ONT, PE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
workflow name	trim_min_length	Int	The minimum length of each read after trimming	75	Optional	PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
workflow name	trim_primers	Boolean	A True/False option that determines if primers should be trimmed.	TRUE	Optional	PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
workflow name	trim_quality_min_score	Int	The minimum quality score to keep during trimming	30	Optional	PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
workflow name	trim_window_size	Int	Specifies window size for trimming (the number of bases to average the quality across)	4	Optional	PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
workflow name	vadr_max_length	Int	Maximum length of contig allowed to run VADR		Optional	FASTA, ONT, PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
workflow name	vadr_memory	Int	Amount of memory/RAM (in GB) to allocate to the task	32 (RSV-A and RSV-B) and 8 (all other TheiaCoV organisms)	Optional	FASTA, ONT, PE, SE	MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
workflow name	vadr_options	String	Additional options to provide to VADR		Optional	ONT, PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
workflow name	vadr_opts	String	Additional options to provide to VADR		Optional	FASTA	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
workflow name	vadr_skip_length	Int	Minimum assembly length (unambiguous) to run VADR	10000	Optional	FASTA, ONT, PE, SE	MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
workflow name	variant_min_freq	Float	Minimum frequency for a variant to be reported in ivar outputs	0.6	Optional	PE, SE	HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2

TheiaCoV_FASTA_Batch_PHB Inputs

TheiaCoV_FASTA_Batch Inputs¶

Input Data

The TheiaCoV_FASTA_Batch workflow takes in a set of assembly files in FASTA format.

Terra Task Name	Variable	Type	Description	Default Value	Terra Status
theiacov_fasta_batch	assembly_fastas	Array[File]	Genome assembly files in fasta format. Example: this.sars-cov-2-samples.assembly_fasta		Required
theiacov_fasta_batch	bucket_name	String	The GCP bucket for the workspace where the TheiaCoV_FASTA_Batch output files are saved. We recommend using a unique GSURI for the bucket associated with your Terra workspace. The root GSURI is accessible in the Dashboard page of your workspace in the "Cloud Information" section. Do not include the prefix gs:// in the string Example: ""fc-c526190d-4332-409b-8086-be7e1af9a0b6/theiacov_fasta_batch-2024-04-15-seq-run-1/		Required
theiacov_fasta_batch	project_name	String	The name of the Terra project where the data can be found. Example: "my-terra-project"		Required
theiacov_fasta_batch	samplenames	Array[String]	The names of the samples to be analyzed. Example: this.sars-cov-2-samples.sars-cov-2-sample_id		Required
theiacov_fasta_batch	table_name	String	The name of the Terra table where the data can be found. Example: "sars-cov-2-sample"		Required
theiacov_fasta_batch	workspace_name	String	The name of the Terra workspace where the data can be found. Example "my-terra-workspace"		Required
cat_files_fasta	cpu	Int	Number of CPUs to allocate to the task	2	Optional
cat_files_fasta	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
cat_files_fasta	docker_image	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/theiagen/utility:1.1	Optional
cat_files_fasta	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	8	Optional
nextclade_v3	auspice_reference_tree_json	File	The phylogenetic reference tree which serves as a target for phylogenetic placement	default is inherited from NextClade dataset	Optional
nextclade_v3	cpu	Int	Number of CPUs to allocate to the task	2	Optional
nextclade_v3	disk_size	Int	Amount of storage (in GB) to allocate to the task	50	Optional
nextclade_v3	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/nextstrain/nextclade:3.3.1	Optional
nextclade_v3	gene_annotations_gff	File	A genome annotation to specify how to translate the nucleotide sequence to proteins (genome_annotation.gff3). specifying this enables codon-informed alignment and protein alignments. See here for more info: https://docs.nextstrain.org/projects/nextclade/en/latest/user/input-files/03-genome-annotation.html	None	Optional
nextclade_v3	input_ref	File	A nucleotide sequence which serves as a reference for the pairwise alignment of all input sequences. This is also the sequence which defines the coordinate system of the genome annotation. See here for more info: https://docs.nextstrain.org/projects/nextclade/en/latest/user/input-files/02-reference-sequence.html	None	Optional
nextclade_v3	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	4	Optional
nextclade_v3	nextclade_pathogen_json	File	General dataset configuration file. See here for more info: https://docs.nextstrain.org/projects/nextclade/en/latest/user/input-files/05-pathogen-config.html	None	Optional
nextclade_v3	verbosity	String	other options are: "off" , "error" , "info" , "debug" , and "trace" (highest level of verbosity)	warn	Optional
organism_parameters	flu_segment	String			Optional
organism_parameters	flu_subtype	String			Optional
organism_parameters	gene_locations_bed_file	File			Optional
organism_parameters	genome_length_input	Int			Optional
organism_parameters	hiv_primer_version	String			Optional
organism_parameters	kraken_target_organism_input	String			Optional
organism_parameters	primer_bed_file	File			Optional
organism_parameters	reference_genome	File			Optional
organism_parameters	reference_gff_file	File			Optional
organism_parameters	vadr_max_length	Int			Optional
organism_parameters	vadr_mem	Int			Optional
organism_parameters	vadr_options	String			Optional
pangolin4	analysis_mode	String	Used to switch between usher and pangolearn analysis modes. Only use usher because pangolearn is no longer supported as of Pangolin v4.3 and higher versions.	None	Optional
pangolin4	cpu	Int	Number of CPUs to allocate to the task	4	Optional
pangolin4	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
pangolin4	expanded_lineage	Boolean	True/False that determines if a lineage should be expanded without aliases (e.g., BA.1 → B.1.1.529.1)	TRUE	Optional
pangolin4	max_ambig	Float	The maximum proportion of Ns allowed for pangolin to attempt an assignment	0.5	Optional
pangolin4	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	8	Optional
pangolin4	skip_designation_cache	Boolean	True/False that determines if the designation cache should be used	FALSE	Optional
pangolin4	skip_scorpio	Boolean	True/False that determines if scorpio should be skipped.	FALSE	Optional
sm_theiacov_fasta_wrangling	cpu	Int	Number of CPUs to allocate to the task	8	Optional
sm_theiacov_fasta_wrangling	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
sm_theiacov_fasta_wrangling	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/theiagen/terra-tools:2023-08-28-v4	Optional
sm_theiacov_fasta_wrangling	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	4	Optional
theiacov_fasta_batch	nextclade_dataset_name	String	Nextclade organism dataset name. Options: "nextstrain/sars-cov-2/wuhan-hu-1/orfs" However, if organism input is set correctly, this input will be automatically assigned the corresponding dataset name.	sars-cov-2	Optional
theiacov_fasta_batch	nextclade_dataset_tag	String	Nextclade dataset tag. Used for pulling up-to-date reference genomes and associated information specific to nextclade datasets (QC thresholds, organism-specific information like SARS-CoV-2 clade & lineage information, etc.) that is required for running the Nextclade tool.	2024-06-13--23-42-47Z	Optional
theiacov_fasta_batch	organism	String	The organism that is being analyzed. Options: "sars-cov-2"	sars-cov-2	Optional
theiacov_fasta_batch	pangolin_docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/pangolin:4.3.1-pdata-1.27	Optional
version_capture	docker	String	The Docker container to use for the task	"us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0"	Optional
version_capture	timezone	String	Set the time zone to get an accurate date of analysis (uses UTC by default)		Optional

Organism-specific parameters and logic¶

The organism_parameters sub-workflow is the first step in all TheiaCoV workflows. This step automatically sets the different parameters needed for each downstream tool to the appropriate value for the user-designated organism (by default, "sars-cov-2" is the default organism).

The following tables include the relevant organism-specific parameters; all of these default values can be overwritten by providing a value for the "Overwrite Variable Name" field.

SARS-CoV-2 Defaults

Overwrite Variable Name	Organism	Default Value
gene_locations_bed_file	sars-cov-2	`"gs://theiagen-public-files-rp/terra/sars-cov-2-files/sc2_gene_locations.bed"`
genome_length_input	sars-cov-2	`29903`
nextclade_dataset_name_input	sars-cov-2	`"nextstrain/sars-cov-2/wuhan-hu-1/orfs"`
nextclade_dataset_tag_input	sars-cov-2	`"2024-07-17--12-57-03Z"`
pangolin_docker_image	sars-cov-2	`"us-docker.pkg.dev/general-theiagen/staphb/pangolin:4.3.1-pdata-1.29 "`
reference_genome	sars-cov-2	`"gs://theiagen-public-files-rp/terra/augur-sars-cov-2-references/MN908947.fasta"`
vadr_max_length	sars-cov-2	`30000`
vadr_mem	sars-cov-2	`8`
vadr_options	sars-cov-2	`"--noseqnamemax --glsearch -s -r --nomisc --mkey sarscov2 --lowsim5seq 6 --lowsim3seq 6 --alt_fail lowscore,insertnn,deletinn --out_allfasta"`

Mpox Defaults

Overwrite Variable Name	Organism	Default Value
gene_locations_bed_file	MPXV	`"gs://theiagen-public-files/terra/mpxv-files/mpox_gene_locations.bed"`
genome_length_input	MPXV	`197200`
kraken_target_organism_input	MPXV	`"Monkeypox virus"`
nextclade_dataset_name_input	MPXV	`"nextstrain/mpox/lineage-b.1"`
nextclade_dataset_tag_input	MPXV	`"2024-04-19--07-50-39Z"`
primer_bed_file	MPXV	`"gs://theiagen-public-files/terra/mpxv-files/MPXV.primer.bed"`
reference_genome	MPXV	`"gs://theiagen-public-files/terra/mpxv-files/MPXV.MT903345.reference.fasta"`
reference_gff_file	MPXV	`"gs://theiagen-public-files/terra/mpxv-files/Mpox-MT903345.1.reference.gff3"`
vadr_max_length	MPXV	`210000`
vadr_mem	MPXV	`8`
vadr_options	MPXV	`"--glsearch -s -r --nomisc --mkey mpxv --r_lowsimok --r_lowsimxd 100 --r_lowsimxl 2000 --alt_pass discontn,dupregin --out_allfasta --minimap2 --s_overhang 150"`

WNV Defaults

Overwrite Variable Name	Organism	Default Value	Notes
genome_length_input	WNV	`11000`
kraken_target_organism_input	WNV	`"West Nile virus`"
nextclade_dataset_name_input	WNV	`"NA"`	TheiaCoV's Nextclade currently does not support WNV
nextclade_dataset_tag_input	WNV	`"NA"`	TheiaCoV's Nextclade currently does not support WNV
primer_bed_file	WNV	`"gs://theiagen-public-files/terra/theiacov-files/WNV/WNV-L1_primer.bed"`
reference_genome	WNV	`"gs://theiagen-public-files/terra/theiacov-files/WNV/NC_009942.1_wnv_L1.fasta"`
vadr_max_length	WNV	`11000`
vadr_mem	WNV	`8`
vadr_options	WNV	`"--mkey flavi --mdir /opt/vadr/vadr-models-flavi/ --nomisc --noprotid --out_allfasta"`

Flu Defaults

Overwrite Variable Name	Organism	Flu Segment	Flu Subtype	Default Value	Notes
flu_segment	flu	all	all	N/A	TheiaCoV will attempt to automatically assign a flu segment
flu_subtype	flu	all	all	N/A	TheiaCoV will attempt to automatically assign a flu subtype
genome_length_input	flu	all	all	`13500`
vadr_max_length	flu	all	all	`13500`
vadr_mem	flu	all	all	`8`
vadr_options	flu	all	all	`"--atgonly --xnocomp --nomisc --alt_fail extrant5,extrant3 --mkey flu"`
nextclade_dataset_name_input	flu	ha	h1n1	`"nextstrain/flu/h1n1pdm/ha/MW626062"`
nextclade_dataset_tag_input	flu	ha	h1n1	`"2024-07-03--08-29-55Z"`
reference_genome	flu	ha	h1n1	`"gs://theiagen-public-files-rp/terra/flu-references/reference_h1n1pdm_ha.fasta"`
nextclade_dataset_name_input	flu	ha	h3n2	`"nextstrain/flu/h3n2/ha/EPI1857216"`
nextclade_dataset_tag_input	flu	ha	h3n2	`"2024-08-08--05-08-21Z"`
reference_genome	flu	ha	h3n2	`"gs://theiagen-public-files-rp/terra/flu-references/reference_h3n2_ha.fasta"`
nextclade_dataset_name_input	flu	ha	victoria	`"nextstrain/flu/vic/ha/KX058884"`
nextclade_dataset_tag_input	flu	ha	victoria	`"2024-07-03--08-29-55Z"`
reference_genome	flu	ha	victoria	`"gs://theiagen-public-files-rp/terra/flu-references/reference_vic_ha.fasta"`
nextclade_dataset_name_input	flu	ha	yamagata	`"nextstrain/flu/yam/ha/JN993010"`
nextclade_dataset_tag_input	flu	ha	yamagata	`"2024-01-30--16-34-55Z"`
reference_genome	flu	ha	yamagata	`"gs://theiagen-public-files-rp/terra/flu-references/reference_yam_ha.fasta"`
nextclade_dataset_name_input	flu	na	h1n1	`"nextstrain/flu/h1n1pdm/na/MW626056"`
nextclade_dataset_tag_input	flu	na	h1n1	`"2024-07-03--08-29-55Z"`
reference_genome	flu	na	h1n1	`"gs://theiagen-public-files-rp/terra/flu-references/reference_h1n1pdm_na.fasta"`
nextclade_dataset_name_input	flu	na	h3n2	`"nextstrain/flu/h3n2/na/EPI1857215"`
nextclade_dataset_tag_input	flu	na	h3n2	`"2024-04-19--07-50-39Z"`
reference_genome	flu	na	h3n2	`"gs://theiagen-public-files-rp/terra/flu-references/reference_h3n2_na.fasta"`
nextclade_dataset_name_input	flu	na	victoria	`"nextstrain/flu/vic/na/CY073894"`
nextclade_dataset_tag_input	flu	na	victoria	`"2024-04-19--07-50-39Z"`
reference_genome	flu	na	victoria	`"gs://theiagen-public-files-rp/terra/flu-references/reference_vic_na.fasta"`
nextclade_dataset_name_input	flu	na	yamagata	`"NA"`
nextclade_dataset_tag_input	flu	na	yamagata	`"NA"`
reference_genome	flu	na	yamagata	`"gs://theiagen-public-files-rp/terra/flu-references/reference_yam_na.fasta"`

RSV-A Defaults

Overwrite Variable Name	Organism	Default Value
genome_length_input	rsv_a	16000
kraken_target_organism	rsv_a	Respiratory syncytial virus
nextclade_dataset_name_input	rsv_a	nextstrain/rsv/a/EPI_ISL_412866
nextclade_dataset_tag_input	rsv_a	2024-08-01--22-31-31Z
reference_genome	rsv_a	gs://theiagen-public-files-rp/terra/rsv_references/reference_rsv_a.fasta
vadr_max_length	rsv_a	15500
vadr_mem	rsv_a	32
vadr_options	rsv_a	-r --mkey rsv --xnocomp

RSV-B Defaults

Overwrite Variable Name	Organism	Default Value
genome_length_input	rsv_b	16000
kraken_target_organism	rsv_b	"Human orthopneumovirus"
nextclade_dataset_name_input	rsv_b	nextstrain/rsv/b/EPI_ISL_1653999
nextclade_dataset_tag_input	rsv_b	"2024-08-01--22-31-31Z"
reference_genome	rsv_b	gs://theiagen-public-files-rp/terra/rsv_references/reference_rsv_b.fasta
vadr_max_length	rsv_b	15500
vadr_mem	rsv_b	32
vadr_options	rsv_b	-r --mkey rsv --xnocomp

HIV Defaults

Overwrite Variable Name	Organism	Default Value	Notes
kraken_target_organism_input	HIV	Human immunodeficiency virus 1
genome_length_input	HIV-v1	9181	This version of HIV originates from Oregon
primer_bed_file	HIV-v1	gs://theiagen-public-files/terra/hivgc-files/HIV-1_v1.0.primer.hyphen.bed	This version of HIV originates from Oregon
reference_genome	HIV-v1	gs://theiagen-public-files/terra/hivgc-files/NC_001802.1.fasta	This version of HIV originates from Oregon
reference_gff_file	HIV-v1	gs://theiagen-public-files/terra/hivgc-files/NC_001802.1.gff3	This version of HIV originates from Oregon
genome_length_input	HIV-v2	9840	This version of HIV originates from Southern Africa
primer_bed_file	HIV-v2	gs://theiagen-public-files/terra/hivgc-files/HIV-1_v2.0.primer.hyphen400.1.bed	This version of HIV originates from Southern Africa
reference_genome	HIV-v2	gs://theiagen-public-files/terra/hivgc-files/AY228557.1.headerchanged.fasta	This version of HIV originates from Southern Africa
reference_gff_file	HIV-v2	gs://theiagen-public-files/terra/hivgc-files/AY228557.1.gff3	This version of HIV originates from Southern Africa

Workflow Tasks¶

All input reads are processed through "core tasks" in the TheiaCoV Illumina, ONT, and ClearLabs workflows. These undertake read trimming and assembly appropriate to the input data type. TheiaCoV workflows subsequently launch default genome characterization modules for quality assessment, and additional taxa-specific characterization steps. When setting up the workflow, users may choose to use "optional tasks" as additions or alternatives to tasks run in the workflow by default.

Core tasks¶

These tasks are performed regardless of organism, and perform read trimming and various quality control steps.

versioning: Version capture for TheiaEuk

The versioning task captures the workflow version from the GitHub (code repository) version.

Version Capture Technical details

	Links
Task	task_versioning.wdl

screen: Total Raw Read Quantification and Genome Size Estimation

The screen task ensures the quantity of sequence data is sufficient to undertake genomic analysis. It uses bash commands for quantification of reads and base pairs, and mash sketching to estimate the genome size and its coverage. At each step, the results are assessed relative to pass/fail criteria and thresholds that may be defined by optional user inputs. Samples that do not meet these criteria will not be processed further by the workflow:

Total number of reads: A sample will fail the read screening task if its total number of reads is less than or equal to min_reads.
The proportion of basepairs reads in the forward and reverse read files: A sample will fail the read screening if fewer than min_proportion basepairs are in either the reads1 or read2 files.
Number of basepairs: A sample will fail the read screening if there are fewer than min_basepairs basepairs
Estimated genome size: A sample will fail the read screening if the estimated genome size is smaller than min_genome_size or bigger than max_genome_size.
Estimated genome coverage: A sample will fail the read screening if the estimated genome coverage is less than the min_coverage.

Read screening is undertaken on both the raw and cleaned reads. The task may be skipped by setting the skip_screen variable to true.

Default values vary between the PE and SE workflow. The rationale for these default values can be found below.

Variable	Rationale
`skip_screen`	Prevent the read screen from running
`skip_screen`	Saving waste of compute resources on insufficient data
`min_reads`	Minimum number of base pairs for 10x coverage of the Hepatitis delta (of the Deltavirus genus) virus divided by 300 (longest Illumina read length)
`min_basepairs`	Greater than 10x coverage of the Hepatitis delta (of the Deltavirus genus) virus
`min_genome_size`	Based on the Hepatitis delta (of the Deltavirus genus) genome- the smallest viral genome as of 2024-04-11 (1,700 bp)
`max_genome_size`	Based on the Pandoravirus salinus genome, the biggest viral genome, (2,673,870 bp) with 2 Mbp added
`min_coverage`	A bare-minimum coverage for genome characterization. Higher coverage would be required for high-quality phylogenetics.
`min_proportion`	Greater than 50% reads are in the read1 file; others are in the read2 file

Screen Technical Details

There is a single WDL task for read screening. The screen task is run twice, once for raw reads and once for clean reads.

	Links
Task	task_screen.wdl

read_QC_trim_pe and read_QC_trim_se: Read Quality Trimming, Host and Adapter Removal, Quantification, and Identification for Illumina workflows

read_QC_trim is a sub-workflow within TheiaCoV that removes low-quality reads, low-quality regions of reads, and sequencing adapters to improve data quality. It uses a number of tasks, described below. The differences between TheiaCoV PE and SE in the read_QC_trim sub-workflow lie in the default parameters, the use of two or one input read file(s), and the different output files.

Host removal

All reads of human origin are removed, including their mates, by using NCBI's human read removal tool (HRRT).

HRRT is based on the SRA Taxonomy Analysis Tool and employs a k-mer database constructed of k-mers from Eukaryota derived from all human RefSeq records with any k-mers found in non-Eukaryota RefSeq records subtracted from the database.

NCBI-Scrub Technical Details

	Links
Task	task_ncbi_scrub.wdl
Software Source Code	NCBI Scrub on GitHub
Software Documentation	https://github.com/ncbi/sra-human-scrubber/blob/master/README.md

Read quality trimming

Either trimmomatic or fastp can be used for read-quality trimming. Trimmomatic is used by default. Both tools trim low-quality regions of reads with a sliding window (with a window size of trim_window_size), cutting once the average quality within the window falls below trim_quality_trim_score. They will both discard the read if it is trimmed below trim_minlen.

If fastp is selected for analysis, fastp also implements the additional read-trimming steps indicated below:

Parameter	Explanation
-g	enables polyG tail trimming
-5 20	enables read end-trimming
-3 20	enables read end-trimming
--detect_adapter_for_pe	enables adapter-trimming only for paired-end reads

Adapter removal

The BBDuk task removes adapters from sequence reads. To do this:

Repair from the BBTools package reorders reads in paired fastq files to ensure the forward and reverse reads of a pair are in the same position in the two fastq files.
BBDuk ("Bestus Bioinformaticus" Decontamination Using Kmers) is then used to trim the adapters and filter out all reads that have a 31-mer match to PhiX, which is commonly added to Illumina sequencing runs to monitor and/or improve overall run quality.

What are adapters and why do they need to be removed?

Adapters are manufactured oligonucleotide sequences attached to DNA fragments during the library preparation process. In Illumina sequencing, these adapter sequences are required for attaching reads to flow cells. You can read more about Illumina adapters here. For genome analysis, it's important to remove these sequences since they're not actually from your sample. If you don't remove them, the downstream analysis may be affected.

Read Quantification

There are two methods for read quantification to choose from: fastq-scan (default) or fastqc. Both quantify the forward and reverse reads in FASTQ files. In TheiaProk_Illumina_PE, they also provide the total number of read pairs. This task is run once with raw reads as input and once with clean reads as input. If QC has been performed correctly, you should expect fewer clean reads than raw reads. fastqc also provides a graphical visualization of the read quality.

Read Identification

Kraken2 is a bioinformatics tool originally designed for metagenomic applications. It has additionally proven valuable for validating taxonomic assignments and checking contamination of single-species (e.g. bacterial isolate, eukaryotic isolate, viral isolate, etc.) whole genome sequence data.

Kraken2 is run on the set of raw reads, provided as input, as well as the set of clean reads that are resulted from the read_QC_trim workflow

Database-dependent

TheiaCoV automatically uses a viral-specific Kraken2 database.

Kraken2 Technical Details

	Links
Task	task_kraken2.wdl
Software Source Code	Kraken2 on GitHub
Software Documentation	https://github.com/DerrickWood/kraken2/wiki
Original Publication(s)	Improved metagenomic analysis with Kraken 2

read_QC_trim Technical Details

	Links
Sub-workflow	wf_read_QC_trim.wdl
Tasks	task_fastp.wdl task_trimmomatic.wdl task_bbduk.wdl task_fastq_scan.wdl task_midas.wdl task_kraken2.wdl
Software Source Code	fastp; Trimmomatic; fastq-scan; MIDAS; Kraken2
Software Documentation	fastp; Trimmomatic; BBDuk; fastq-scan; MIDAS; Kraken2
Original Publication(s)	Trimmomatic: a flexible trimmer for Illumina sequence data fastp: an ultra-fast all-in-one FASTQ preprocessor An integrated metagenomics pipeline for strain profiling reveals novel patterns of bacterial transmission and biogeography Improved metagenomic analysis with Kraken 2

read_QC_trim_ONT: Read Quality Trimming, Host Removal, and Identification for ONT data

read_QC_trim is a sub-workflow within TheiaCoV that removes low-quality reads, low-quality regions of reads, and sequencing adapters to improve data quality. It uses a number of tasks, described below.

Host removal

All reads of human origin are removed, including their mates, by using NCBI's human read removal tool (HRRT).

HRRT is based on the SRA Taxonomy Analysis Tool and employs a k-mer database constructed of k-mers from Eukaryota derived from all human RefSeq records with any k-mers found in non-Eukaryota RefSeq records subtracted from the database.

NCBI-Scrub Technical Details

	Links
Task	task_ncbi_scrub.wdl
Software Source Code	NCBI Scrub on GitHub
Software Documentation	https://github.com/ncbi/sra-human-scrubber/blob/master/README.md

Read quality filtering

Read filtering is performed using artic guppyplex which performs a quality check by filtering the reads by length to remove chimeric reads.

Read Identification

Kraken2 is a bioinformatics tool originally designed for metagenomic applications. It has additionally proven valuable for validating taxonomic assignments and checking contamination of single-species (e.g. bacterial isolate, eukaryotic isolate, viral isolate, etc.) whole genome sequence data.

Kraken2 is run on the set of raw reads, provided as input, as well as the set of clean reads that are resulted from the read_QC_trim workflow

Database-dependent

TheiaCoV automatically uses a viral-specific Kraken2 database.

Kraken2 Technical Details

	Links
Task	task_kraken2.wdl
Software Source Code	Kraken2 on GitHub
Software Documentation	https://github.com/DerrickWood/kraken2/wiki
Original Publication(s)	Improved metagenomic analysis with Kraken 2

read_QC_trim Technical Details

Each TheiaCoV workflow calls a sub-workflow listed below, which then calls the individual tasks:

Workflow	TheiaCoV_ONT
Sub-workflow	wf_read_QC_trim_ont.wdl
Tasks	task_ncbi_scrub.wdl (SE subtask) task_artic_guppyplex.wdl task_kraken2.wdl
Software Source Code	NCBI Scrub on GitHub Artic on GitHub Kraken2 on GitHub
Software Documentation	NCBI Scrub Artic pipeline Kraken2
Original Publication(s)	STAT: a fast, scalable, MinHash-based k-mer tool to assess Sequence Read Archive next-generation sequence submissions Improved metagenomic analysis with Kraken 2

Assembly tasks¶

Either one of these tasks is run depending on the organism and workflow type.

ivar_consensus: Alignment, Consensus, Variant Detection, and Assembly Statistics for non-flu organisms in Illumina workflows

ivar_consensus is a sub-workflow within TheiaCoV that performs reference-based consensus assembly using the iVar tool by Nathan Grubaugh from the Andersen lab.

The following steps are performed as part of this sub-workflow:

Cleaned reads are aligned to the appropriate reference genome (see also the organism-specific parameters and logic section above) using BWA to generate a Binary Alignment Mapping (BAM) file.
If trim_primers is set to true, primers will be removed using ivar trim.
1. General statistics about the remaining reads are calculated.
The ivar consensus command is run to generate a consensus assembly.
General statistics about the assembly are calculated..

iVar Consensus Technical Details

Workflow	TheiaCoV_Illumina_PE & TheiaCoV_Illumina_SE
Sub-workflow	wf_ivar_consensus.wdl
Tasks	task_bwa.wdl task_ivar_primer_trim.wdl task_assembly_metrics.wdl task_ivar_variant_call.wdl task_ivar_consensus.wdl
Software Source Code	BWA on GitHub, iVar on GitHub
Software Documentation	BWA on SourceForge, iVar on GitHub
Original Publication(s)	Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM An amplicon-based sequencing framework for accurately measuring intrahost virus diversity using PrimalSeq and iVar

artic_consensus: Alignment, Primer Trimming, Variant Detection, and Consensus for non-flu organisms in ONT & ClearLabs workflows

Briefly, input reads are aligned to the appropriate reference with minimap2 to generate a Binary Alignment Mapping (BAM) file. Primer sequences are then removed from the BAM file and a consensus assembly file is generated using the Artic minion Medaka argument.

Read-trimming is performed on raw read data generated on the ClearLabs instrument and thus not a required step in the TheiaCoV_ClearLabs workflow.

General statistics about the assembly are generated with the consensus_qc task (task_assembly_metrics.wdl).

Artic Consensus Technical Details

	Links
Task	task_artic_consensus.wdl
Software Source Code	Artic on GitHub
Software Documentation	Artic pipeline

irma: Assembly and Characterization for flu in TheiaCoV_Illumina_PE & TheiaCoV_ONT

Cleaned reads are assembled using irma which does not use a reference due to the rapid evolution and high variability of influenza. irma also performs typing and subtyping as part of the assembly process.

General statistics about the assembly are generated with the consensus_qc task (task_assembly_metrics.wdl).

IRMA Technical Details

	Links
Task	task_irma.wdl
Software Documentation	IRMA website
Original Publication(s)	*Viral deep sequencing needs an adaptive approach: IRMA, the iterative refinement meta-assembler

Organism-specific characterization tasks¶

The following tasks only run for the appropriate organism designation. The following table illustrates which characterization tools are run for the indicated organism.

	SARS-CoV-2	MPXV	HIV	WNV	Influenza	RSV-A	RSV-B
Pangolin	✅	❌	❌	❌	❌	❌	❌
Nextclade	✅	✅	❌	❌	✅	✅	✅
VADR	✅	✅	❌	✅	✅	✅	✅
Quasitools HyDRA	❌	❌	✅	❌	❌	❌	❌
IRMA	❌	❌	❌	❌	✅	❌	❌
Abricate	❌	❌	❌	❌	✅	❌	❌
% Gene Coverage	✅	✅	❌	❌	❌	❌	❌
Antiviral Detection	❌	❌	❌	❌	✅	❌	❌
GenoFLU	❌	❌	❌	❌	✅	❌	❌

pangolin

Pangolin designates SARS-CoV-2 lineage assignments.

Pangolin Technical Details

	Links
Task	task_pangolin.wdl
Software Source Code	Pangolin on GitHub
Software Documentation	Pangolin website

nextclade

"Nextclade is an open-source project for viral genome alignment, mutation calling, clade assignment, quality checks and phylogenetic placement."

Nextclade Technical Details

	Links
Task	task_nextclade.wdl
Software Source Code	https://github.com/nextstrain/nextclade
Software Documentation	Nextclade
Original Publication(s)	Nextclade: clade assignment, mutation calling and quality control for viral genomes.

vadr

VADR annotates and validates completed assembly files.

VADR Technical Details

	Links
Task	task_vadr.wdl
Software Source Code	https://github.com/ncbi/vadr
Software Documentation	https://github.com/ncbi/vadr/wiki
Original Publication(s)	For SARS-CoV-2: Faster SARS-CoV-2 sequence validation and annotation for GenBank using VADR For non-SARS_CoV-2: VADR: validation and annotation of virus sequence submissions to GenBank

quasitools

quasitools performs genome characterization for HIV.

Quasitools Technical Details

	Links
Task	task_quasitools.wdl
Software Source Code	https://github.com/phac-nml/quasitools/
Software Documentation	Quasitools HyDRA

irma

IRMA assigns types and subtype/lineages in addition to performing assembly of flu genomes. Please see the section above under "Assembly tasks" to find more information regarding this tool.

IRMA Technical Details

	Links
Task	task_irma.wdl
Software Documentation	IRMA website
Original Publication(s)	*Viral deep sequencing needs an adaptive approach: IRMA, the iterative refinement meta-assembler

abricate

Abricate assigns types and subtype/lineages for flu samples

Abricate Technical Details

	Links
Task	task_abricate.wdl (abricate_flu subtask)
Software Source Code	ABRicate on GitHub
Software Documentation	ABRicate on GitHub

gene_coverage

This task calculates the percent of the gene covered above a minimum depth. By default, it runs for SARS-CoV-2 and MPXV, but if a bed file is provided with regions of interest, this task will be run for other organisms as well.

Gene Coverage Technical Details

	Links
Task	task_gene_coverage.wdl

flu_antiviral_substitutions

This sub-workflow determines which, if any, antiviral mutations are present in the sample.

The assembled HA, NA, PA, PB1 and PB2 segments are compared against a list of known amino-acid substitutions associated with resistance to the antivirals A_315675, compound_367, Favipiravir, Fludase, L_742_001, Laninamivir, Oseltamivir (tamiflu), Peramivir, Pimodivir, Xofluza, and Zanamivir. The list of known antiviral amino acid substitutions can be expanded via optional user input antiviral_aa_subs in the format "NA:V95A,HA:I97V", i.e. Protein:AAPositionAA.

Antiviral Substitutions Technical Details

	Links
Workflow	wf_influenza_antiviral_substitutions.wdl

genoflu

This sub-workflow determines the whole-genome genotype of an H5N1 flu sample.

GenoFLU Technical Details

	Links
Task	task_genoflu.wdl
Software Source Code	GenoFLU on GitHub

Outputs¶

All TheiaCoV Workflows (not TheiaCoV_FASTA_Batch)

Variable	Type	Description	Workflow
abricate_flu_database	String	ABRicate database used for analysis	FASTA, ONT, PE
abricate_flu_results	File	File containing all results from ABRicate	FASTA, ONT, PE
abricate_flu_subtype	String	Flu subtype as determined by ABRicate	FASTA, ONT, PE
abricate_flu_type	String	Flu type as determined by ABRicate	FASTA, ONT, PE
abricate_flu_version	String	Version of ABRicate	FASTA, ONT, PE
aligned_bai	File	Index companion file to the bam file generated during the consensus assembly process	CL, ONT, PE, SE
aligned_bam	File	Primer-trimmed BAM file; generated during consensus assembly process	CL, ONT, PE, SE
artic_docker	String	Docker image utilized for read trimming and consensus genome assembly	CL, ONT
artic_version	String	Version of the Artic software utilized for read trimming and conesnsus genome assembly	CL, ONT
assembly_fasta	File	Consensus genome assembly; for lower quality flu samples, the output may state "Assembly could not be generated" when there is too little and/or too low quality data for IRMA to produce an assembly	CL, ONT, PE, SE
assembly_length_unambiguous	Int	Number of unambiguous basecalls within the consensus assembly	CL, FASTA, ONT, PE, SE
assembly_mean_coverage	Float	Mean sequencing depth throughout the consensus assembly. Generated after performing primer trimming and calculated using the SAMtools coverage command	CL, ONT, PE, SE
assembly_method	String	Method employed to generate consensus assembly	CL, FASTA, ONT, PE, SE
auspice_json	File	Auspice-compatable JSON output generated from Nextclade analysis that includes the Nextclade default samples for clade-typing and the single sample placed on this tree	CL, FASTA, ONT, PE, SE
auspice_json_flu_ha	File	Auspice-compatable JSON output generated from Nextclade analysis on Influenza HA segment that includes the Nextclade default samples for clade-typing and the single sample placed on this tree	ONT, PE
auspice_json_flu_na	File	Auspice-compatable JSON output generated from Nextclade analysis on Influenza NA segment that includes the Nextclade default samples for clade-typing and the single sample placed on this tree	ONT, PE
bbduk_docker	String	Docker image used to run BBDuk	PE, SE
bwa_version	String	Version of BWA used to map read data to the reference genome	PE, SE
consensus_flagstat	File	Output from the SAMtools flagstat command to assess quality of the alignment file (BAM)	CL, ONT, PE, SE
consensus_n_variant_min_depth	Int	Minimum read depth to call variants for iVar consensus and iVar variants	PE, SE
consensus_stats	File	Output from the SAMtools stats command to assess quality of the alignment file (BAM)	CL, ONT, PE, SE
est_coverage_clean	Float	Estimated coverage of the clean reads	ONT
est_coverage_raw	Float	Estimated coverage of the raw reads	ONT
est_percent_gene_coverage_tsv	File	Percent coverage for each gene in the organism being analyzed (depending on the organism input)	CL, ONT, PE, SE
fastp_html_report	File	HTML report for fastp	PE, SE
fastp_version	String	Fastp version used	PE, SE
fastq_scan_num_reads_clean_pairs	String	Number of paired reads after filtering as determined by fastq_scan	PE
fastq_scan_num_reads_clean1	Int	Number of forward reads after filtering as determined by fastq_scan	CL, PE, SE
fastq_scan_num_reads_clean2	Int	Number of reverse reads after filtering as determined by fastq_scan	PE
fastq_scan_num_reads_raw_pairs	String	Number of paired reads identified in the input fastq files as determined by fastq_scan	PE
fastq_scan_num_reads_raw1	Int	Number of forward reads identified in the input fastq files as determined by fastq_scan	CL, PE, SE
fastq_scan_num_reads_raw2	Int	Number of reverse reads identified in the input fastq files as determined by fastq_scan	PE
fastq_scan_r1_mean_q_clean	Float	Forward read mean quality value after quality trimming and adapter removal
fastq_scan_r1_mean_q_raw	Float	Forward read mean quality value before quality trimming and adapter removal
fastq_scan_r1_mean_readlength_clean	Float	Forward read mean read length value after quality trimming and adapter removal
fastq_scan_r1_mean_readlength_raw	Float	Forward read mean read length value before quality trimming and adapter removal
fastq_scan_version	String	Version of fastq_scan used for read QC analysis	CL, PE, SE
fastqc_clean1_html	File	Graphical visualization of clean forward read quality from fastqc to open in an internet browser	PE, SE
fastqc_clean2_html	File	Graphical visualization of clean reverse read quality from fastqc to open in an internet browser	PE
fastqc_docker	String	Docker container used for fastqc	PE, SE
fastqc_num_reads_clean_pairs	String	Number of read pairs after cleaning by fastqc	PE
fastqc_num_reads_clean1	Int	Number of forward reads after cleaning by fastqc	PE, SE
fastqc_num_reads_clean2	Int	Number of reverse reads after cleaning by fastqc	PE
fastqc_num_reads_raw_pairs	Int	Number of raw read pairs as computed by fastqc	PE
fastqc_num_reads_raw1	Int	Number of raw forward/facing reads as computed by fastqc	PE, SE
fastqc_num_reads_raw2	Int	Number of raw reverse-facing reads as computed by fastqc	PE
fastqc_raw1_html	File	Graphical visualization of raw forward read quality from fastqc to open in an internet browser	PE, SE
fastqc_raw2_html	File	Graphical visualization of raw reverse read quality from fastqc to open in an internet browser	PE
fastqc_version	String	Version of fastqc software used	PE, SE
flu_A_315675_resistance	String	resistance mutations to A_315675	ONT, PE
flu_amantadine_resistance	String	resistance mutations to amantadine	ONT, PE
flu_compound_367_resistance	String	resistance mutations to compound_367	ONT, PE
flu_favipiravir_resistance	String	resistance mutations to favipiravir	ONT, PE
flu_fludase_resistance	String	resistance mutations to fludase	ONT, PE
flu_L_742_001_resistance	String	resistance mutations to L_742_001	ONT, PE
flu_laninamivir_resistance	String	resistance mutations to laninamivir	ONT, PE
flu_oseltamivir_resistance	String	resistance mutations to oseltamivir (Tamiflu®)	ONT, PE
flu_peramivir_resistance	String	resistance mutations to peramivir (Rapivab®)	ONT, PE
flu_pimodivir_resistance	String	resistance mutations to pimodivir	ONT, PE
flu_rimantadine_resistance	String	resistance mutations to rimantadine	ONT, PE
flu_xofluza_resistance	String	resistance mutations to xofluza (Baloxavir marboxil)	ONT, PE
flu_zanamivir_resistance	String	resistance mutations to zanamivir (Relenza®)	ONT, PE
genoflu_all_segments	String	The genotypes for each individual flu segment	FASTA, ONT, PE
genoflu_genotype	String	The genotype of the whole genome, based off of the individual segments types	FASTA, ONT, PE
genoflu_output_tsv	File	The output file from GenoFLU	FASTA, ONT, PE
genoflu_version	String	The version of GenoFLU used	FASTA, ONT, PE
irma_docker	String	Docker image used to run IRMA	ONT, PE
irma_ha_segment_fasta	File	HA (Haemagglutinin) assembly fasta file	ONT, PE
irma_mp_segment_fasta	File	MP (Matrix Protein) assembly fasta file	ONT, PE
irma_na_segment_fasta	File	NA (Neuraminidase) assembly fasta file	ONT, PE
irma_np_segment_fasta	File	NP (Nucleoprotein) assembly fasta file	ONT, PE
irma_ns_segment_fasta	File	NS (Nonstructural) assembly fasta file	ONT, PE
irma_pa_segment_fasta	File	PA (Polymerase acidic) assembly fasta file	ONT, PE
irma_pb1_segment_fasta	File	PB1 (Polymerase basic 1) assembly fasta file	ONT, PE
irma_pb2_segment_fasta	File	PB2 (Polymerase basic 2) assembly fasta file	ONT, PE
irma_subtype	String	Flu subtype as determined by IRMA	ONT, PE
irma_subtype_notes	String	Helpful note to user about Flu B subtypes. Output will be blank for Flu A samples. For Flu B samples it will state: "IRMA does not differentiate Victoria and Yamagata Flu B lineages. See abricate_flu_subtype output column"	ONT, PE
irma_type	String	Flu type as determined by IRMA	ONT, PE
irma_version	String	Version of IRMA used	ONT, PE
ivar_tsv	File	Variant descriptor file generated by iVar variants	PE, SE
ivar_variant_proportion_intermediate	String	The proportion of variants of intermediate frequency	PE, SE
ivar_variant_version	String	Version of iVar for running the iVar variants command	PE, SE
ivar_vcf	File	iVar tsv output converted to VCF format	PE, SE
ivar_version_consensus	String	Version of iVar for running the iVar consensus command	PE, SE
ivar_version_primtrim	String	Version of iVar for running the iVar trim command	PE, SE
kraken_human	Float	Percent of human read data detected using the Kraken2 software	CL, ONT, PE, SE
kraken_human_dehosted	Float	Percent of human read data detected using the Kraken2 software after host removal	CL, ONT, PE
kraken_report	File	Full Kraken report	CL, ONT, PE, SE
kraken_report_dehosted	File	Full Kraken report after host removal	CL, ONT, PE
kraken_sc2	Float	Percent of SARS-CoV-2 read data detected using the Kraken2 software	CL, ONT, PE, SE
kraken_sc2_dehosted	Float	Percent of SARS-CoV-2 read data detected using the Kraken2 software after host removal	CL, ONT, PE
kraken_target_organism	String	Percent of target organism read data detected using the Kraken2 software	CL, ONT, PE, SE
kraken_target_organism_dehosted	String	Percent of target organism read data detected using the Kraken2 software after host removal	CL, ONT, PE
kraken_target_organism_name	String	The name of the target organism; e.g., "Monkeypox" or "Human immunodeficiency virus"	CL, ONT, PE, SE
kraken_version	String	Version of Kraken software used	CL, ONT, PE, SE
meanbaseq_trim	Float	Mean quality of the nucleotide basecalls aligned to the reference genome after primer trimming	CL, ONT, PE, SE
meanmapq_trim	Float	Mean quality of the mapped reads to the reference genome after primer trimming	CL, ONT, PE, SE
medaka_reference	String	Reference sequence used in medaka task	CL, ONT
medaka_vcf	File	A VCF file containing the identified variants	ONT
nanoplot_docker	String	Docker image used to run Nanoplot	ONT
nanoplot_html_clean	File	An HTML report describing the clean reads	ONT
nanoplot_html_raw	File	An HTML report describing the raw reads	ONT
nanoplot_num_reads_clean1	Float	Number of clean reads	ONT
nanoplot_num_reads_raw1	Float	Number of raw reads	ONT
nanoplot_r1_est_coverage_clean	Float	Estimated coverage on the clean reads by nanoplot	ONT
nanoplot_r1_est_coverage_raw	Float	Estimated coverage on the raw reads by nanoplot	ONT
nanoplot_r1_mean_q_clean	Float	Mean quality score of clean forward reads	ONT
nanoplot_r1_mean_q_raw	Float	Mean quality score of raw forward reads	ONT
nanoplot_r1_mean_readlength_clean	Float	Mean read length of clean forward reads	ONT
nanoplot_r1_mean_readlength_raw	Float	Mean read length of raw forward reads	ONT
nanoplot_r1_median_q_clean	Float	Median quality score of clean forward reads	ONT
nanoplot_r1_median_q_raw	Float	Median quality score of raw forward reads	ONT
nanoplot_r1_median_readlength_clean	Float	Median read length of clean forward reads	ONT
nanoplot_r1_median_readlength_raw	Float	Median read length of raw forward reads	ONT
nanoplot_r1_n50_clean	Float	N50 of clean forward reads	ONT
nanoplot_r1_n50_raw	Float	N50 of raw forward reads	ONT
nanoplot_r1_stdev_readlength_clean	Float	Standard deviation read length of clean forward reads	ONT
nanoplot_r1_stdev_readlength_raw	Float	Standard deviation read length of raw forward reads	ONT
nanoplot_tsv_clean	File	A TSV report describing the clean reads	ONT
nanoplot_tsv_raw	File	A TSV report describing the raw reads	ONT
nanoplot_version	String	Version of nanoplot tool used	ONT
nextclade_aa_dels	String	Amino-acid deletions as detected by NextClade. Will be blank for Flu	CL, FASTA, ONT, PE, SE
nextclade_aa_dels_flu_ha	String	Amino-acid deletions as detected by NextClade. Specific to flu; it includes deletions for HA segment	ONT, PE
nextclade_aa_dels_flu_na	String	Amino-acid deletions as detected by NextClade. Specific to Flu; it includes deletions for NA segment	ONT, PE
nextclade_aa_subs	String	Amino-acid substitutions as detected by Nextclade. Will be blank for Flu	CL, FASTA, ONT, PE, SE
nextclade_aa_subs_flu_ha	String	Amino-acid substitutions as detected by Nextclade. Specific to Flu; it includes substitutions for NA segment	ONT, PE
nextclade_aa_subs_flu_na	String	Amino-acid substitutions as detected by Nextclade. Specific to Flu; it includes substitutions for NA segment	ONT, PE
nextclade_clade	String	Nextclade clade designation, will be blank for Flu.	CL, FASTA, ONT, PE, SE
nextclade_clade_flu_ha	String	Nextclade clade designation, specific to Flu NA segment	ONT, PE
nextclade_clade_flu_na	String	Nextclade clade designation, specific to Flu HA segment	ONT, PE
nextclade_docker	String	Docker image used to run Nextclade	CL, FASTA, ONT, PE, SE
nextclade_ds_tag	String	Dataset tag used to run Nextclade. Will be blank for Flu	CL, FASTA, ONT, PE, SE
nextclade_ds_tag_flu_ha	String	Dataset tag used to run Nextclade, specific to Flu HA segment	ONT, PE
nextclade_ds_tag_flu_na	String	Dataset tag used to run Nextclade, specific to Flu NA segment	ONT, PE
nextclade_json	File	Nextclade output in JSON file format. Will be blank for Flu	CL, FASTA, ONT, PE, SE
nextclade_json_flu_ha	File	Nextclade output in JSON file format, specific to Flu HA segment	ONT, PE
nextclade_json_flu_na	File	Nextclade output in JSON file format, specific to Flu NA segment	ONT, PE
nextclade_lineage	String	Nextclade lineage designation	CL, FASTA, ONT, PE, SE
nextclade_qc	String	QC metric as determined by Nextclade. (For Flu, this output will be specific to HA segment)	CL, FASTA, ONT, PE, SE
nextclade_qc_flu_ha	String	QC metric as determined by Nextclade, specific to Flu HA segment	ONT, PE
nextclade_qc_flu_na	String	QC metric as determined by Nextclade, specific to Flu NA segment	ONT, PE
nextclade_tsv	File	Nextclade output in TSV file format. (For Flu, this output will be specific to HA segment)	CL, FASTA, ONT, PE, SE
nextclade_tsv_flu_ha	File	Nextclade output in TSV file format, specific to Flu HA segment	ONT, PE
nextclade_tsv_flu_na	File	Nextclade output in TSV file format, specific to Flu NA segment	ONT, PE
nextclade_version	String	The version of Nextclade software used	CL, FASTA, ONT, PE, SE
number_Degenerate	Int	Number of degenerate basecalls within the consensus assembly	CL, FASTA, ONT, PE, SE
number_N	Int	Number of fully ambiguous basecalls within the consensus assembly	CL, FASTA, ONT, PE, SE
number_Total	Int	Total number of nucleotides within the consensus assembly	CL, FASTA, ONT, PE, SE
pango_lineage	String	Pango lineage as determined by Pangolin	CL, FASTA, ONT, PE, SE
pango_lineage_expanded	String	Pango lineage without use of aliases; e.g., "BA.1" → "B.1.1.529.1"	CL, FASTA, ONT, PE, SE
pango_lineage_report	File	Full Pango lineage report generated by Pangolin	CL, FASTA, ONT, PE, SE
pangolin_assignment_version	String	The version of the pangolin software (e.g. PANGO or PUSHER) used for lineage assignment	CL, FASTA, ONT, PE, SE
pangolin_conflicts	String	Number of lineage conflicts as determined by Pangolin	CL, FASTA, ONT, PE, SE
pangolin_docker	String	Docker image used to run Pangolin	CL, FASTA, ONT, PE, SE
pangolin_notes	String	Lineage notes as determined by Pangolin	CL, FASTA, ONT, PE, SE
pangolin_versions	String	All Pangolin software and database versions	CL, FASTA, ONT, PE, SE
percent_reference_coverage	Float	Percent coverage of the reference genome after performing primer trimming; calculated as assembly_length_unambiguous / length of the reference genome (SC2: 29903) x 100	CL, FASTA, ONT, PE, SE
primer_bed_name	String	Name of the primer bed files used for primer trimming	CL, ONT, PE, SE
primer_trimmed_read_percent	Float	Percentage of read data with primers trimmed as determined by iVar trim	PE, SE
qc_check	String	The results of the QC Check task	CL, FASTA, ONT, PE, SE
qc_standard	File	The file used in the QC Check task containing the QC thresholds.	CL, FASTA, ONT, PE, SE
quasitools_coverage_file	File	The coverage report created by Quasitools HyDRA	ONT, PE
quasitools_date	String	Date of Quasitools analysis	ONT, PE
quasitools_dr_report	File	Drug resistance report created by Quasitools HyDRA	ONT, PE
quasitools_hydra_vcf	File	The VCF created by Quasitools HyDRA	ONT, PE
quasitools_mutations_report	File	The mutation report created by Quasitools HyDRA	ONT, PE
quasitools_version	String	Version of Quasitools used	ONT, PE
read_screen_clean	String	A PASS or FAIL flag for input reads after cleaning	ONT, PE, SE
read_screen_raw	String	A PASS or FAIL flag for input reads	ONT, PE, SE
read1_aligned	File	Forward read file of only aligned reads	CL, ONT, PE, SE
read1_clean	File	Forward read file after quality trimming and adapter removal	PE, SE
read1_dehosted	File	Dehosted forward reads; suggested read file for SRA submission	CL, ONT, PE
read1_trimmed	File	Forward read file after quality trimming and adapter removal	ONT
read1_unaligned	File	Forward read file of unaligned reads	PE, SE
read2_aligned	File	Reverse read file of only aligned reads	PE
read2_clean	File	Reverse read file after quality trimming and adapter removal	PE
read2_dehosted	File	Dehosted reverse reads; suggested read file for SRA submission	PE
read2_unaligned	File	Reverse read file of unaligned reads	PE
samtools_version	String	The version of SAMtools used to sort and index the alignment file	ONT, PE, SE
samtools_version_consensus	String	The version of SAMtools used to create the pileup before running iVar consensus	PE, SE
samtools_version_primtrim	String	The version of SAMtools used to create the pileup before running iVar trim	PE, SE
samtools_version_stats	String	The version of SAMtools used to assess the quality of read mapping	CL, PE, SE
sc2_s_gene_mean_coverage	Float	Mean read depth for the S gene in SARS-CoV-2	CL, ONT, PE, SE
sc2_s_gene_percent_coverage	Float	Percent coverage of the S gene in SARS-CoV-2	CL, ONT, PE, SE
seq_platform	String	Description of the sequencing methodology used to generate the input read data	CL, FASTA, ONT, PE, SE
sorted_bam_unaligned	File	A BAM file that only contains reads that did not align to the reference	PE, SE
sorted_bam_unaligned_bai	File	Index companion file to a BAM file that only contains reads that did not align to the reference	PE, SE
theiacov_clearlabs_analysis_date	String	Date of analysis	CL
theiacov_clearlabs_version	String	Version of PHB used for running the workflow	CL
theiacov_fasta_analysis_date	String	Date of analysis	FASTA
theiacov_fasta_version	String	Version of PHB used for running the workflow	FASTA
theiacov_illumina_pe_analysis_date	String	Date of analysis	PE
theiacov_illumina_pe_version	String	Version of PHB used for running the workflow	PE
theiacov_illumina_se_analysis_date	String	Date of analysis	SE
theiacov_illumina_se_version	String	Version of PHB used for running the workflow	SE
theiacov_ont_analysis_date	String	Date of analysis	ONT
theiacov_ont_version	String	Version of PHB used for running the workflow	ONT
trimmomatic_docker	String	Docker container used with trimmomatic	PE, SE
trimmomatic_version	String	The version of Trimmomatic used	PE, SE
vadr_alerts_list	File	A file containing all of the fatal alerts as determined by VADR	CL, FASTA, ONT, PE, SE
vadr_all_outputs_tar_gz	File	A .tar.gz file (gzip-compressed tar archive file) containing all outputs from the VADR command v-annotate.pl. This file must be uncompressed & extracted to see the many files within. See https://github.com/ncbi/vadr/blob/master/documentation/formats.md#format-of-v-annotatepl-output-filesfor more complete description of all files present within the archive. Useful when deeply investigating a sample's genome & annotations.	CL, FASTA, ONT, PE, SE
vadr_classification_summary_file	File	Per-sequence tabular classification file. See https://github.com/ncbi/vadr/blob/master/documentation/formats.md#explanation-of-sqc-suffixed-output-files for more complete description.	CL, FASTA, ONT, PE, SE
vadr_docker	String	Docker image used to run VADR	CL, FASTA, ONT, PE, SE
vadr_fastas_zip_archive	File	Zip archive containing all fasta files created during VADR analysis	CL, FASTA, ONT, PE, SE
vadr_feature_tbl_fail	File	5 column feature table output for failing sequences. See https://github.com/ncbi/vadr/blob/master/documentation/formats.md#format-of-v-annotatepl-output-files for more complete description.	CL, FASTA, ONT, PE, SE
vadr_feature_tbl_pass	File	5 column feature table output for passing sequences. See https://github.com/ncbi/vadr/blob/master/documentation/formats.md#format-of-v-annotatepl-output-files for more complete description.	CL, FASTA, ONT, PE, SE
vadr_num_alerts	String	Number of fatal alerts as determined by VADR	CL, FASTA, ONT, PE, SE
variants_from_ref_vcf	File	Number of variants relative to the reference genome	CL

TheiaCoV_FASTA_Batch_PHB Outputs

TheiaCoV_FASTA_Batch Outputs¶

Overwrite Warning

TheiaCoV_FASTA_Batch_PHB workflow will output results to the set-level data table in addition to overwriting the Pangolin & Nextclade output columns in the sample-level data table. Users can view the set-level workflow output TSV file called "Datatable" to view exactly which columns were overwritten in the sample-level data table.

Variable	Type	Description
datatable	File	Sample-level data table TSV file that was used to update the original sample-level data table in the last step of the TheiaCoV_FASTA_Batch workflow.
nextclade_json	File	Output Nextclade JSON file that contains results for all samples included in the workflow
nextclade_tsv	File	Output Nextclade TSV file that contains results for all samples included in the workflow
pango_lineage_report	File	Output Pangolin CSV file that contains results for all samples included in the workflow
theiacov_fasta_batch_analysis_date	String	Date that the workflow was run.
theiacov_fasta_batch_version	String	Version of the workflow that was used.