TheiaViral Workflow Series¶

Quick Facts¶

Workflow Type	Applicable Kingdom	Last Known Changes	Command-line Compatibility	Workflow Level
Genomic Characterization	Viral	v3.1.0	No	Sample-level

TheiaViral Workflows¶

TheiaViral workflows assemble, quality assess, and characterize viral genomes from diverse data sources, including metagenomic samples. TheiaViral workflows can generate consensus assemblies of recalcitrant viruses, including diverse or recombinant lineages, such as rabies virus and norovirus, through a three-step approach: 1) generating an intermediate de novo assembly from taxonomy-filtered reads, 2) selecting the best reference from a database of ~200,000 viral genomes using average nucleotide identity, and 3) producing a final consensus assembly through reference-based read mapping and variant calling. Reference genomes can be directly provided to TheiaViral to bypass de novo assembly, which enables compatibility with tiled amplicon sequencing data. Targeted viral characterization is currently ongoing and functional for Lyssavirus rabies.

What are the main differences between the TheiaViral and TheiaCov workflows?

TheiaCov Workflows
- For amplicon-derived viral sequencing methods
- Supports a limited number of pathogens
- Uses manually curated, static reference genomes
TheiaViral Workflows
- Designed for a variety of sequencing methods
- Supports relatively diverse and recombinant pathogens
- Dynamically identifies the most similar reference genome for consensus assembly via an intermediate de novo assembly

Segmented viruses

Segmented viruses are accounted for in TheiaViral. The reference genome database excludes segmented viral nucleotide accessions, while including RefSeq assembly accessions that include all viral segments. Consensus assembly modules are constructed to handle multi-segment references.

Workflow Diagram¶

TheiaViral_Illumina_PETheiaViral_ONT

TheiaViral_Illumina_PE Workflow Diagram

TheiaViral_ONT Workflow Diagram

TheiaViral Workflows for Different Input Types¶

TheiaViral_Illumina_PE

Illumina_PE Input Read Data

The TheiaViral_Illumina_PE workflow inputs Illumina paired-end read data. Read file extensions should be .fastq or .fq, and can optionally include the .gz compression extension. Theiagen recommends compressing files with gzip before Terra uploads to minimize data upload time and storage costs.

Modifications to the optional parameter for trim_minlen may be required to appropriately trim reads shorter than 2 x 150 bp (i.e. generated using a 300-cycle sequencing kit), such as the 2 x 75bp reads generated using a 150-cycle sequencing kit.
TheiaViral_ONT

ONT Input Read Data

The TheiaViral_ONT workflow inputs base-called Oxford Nanopore Technology (ONT) read data. Read file extensions should be .fastq or .fq, and can optionally include the .gz compression extension. Theiagen recommends compressing files with gzip before Terra uploads to minimize data upload time and storage costs.

It is recommended to trim adapter sequencings via dorado basecalling prior to running TheiaViral_ONT, though porechop can optionally be called to trim adapters within the workflow.

The ONT sequencing kit and base-calling approach can produce substantial variability in the amount and quality of read data. Genome assemblies produced by the TheiaViral_ONT workflow must be quality assessed before reporting results. We recommend using the Dorado_Basecalling_PHB workflow if applicable.

Inputs¶

taxon required input parameter

taxon is the standardized taxonomic name (e.g. "Lyssavirus rabies") or NCBI taxon ID (e.g. "11292") of the desired virus to analyze. Inputs must be represented in the NCBI taxonomy database and do not have to be species-level (see read_extraction_rank below).

host optional input parameter

The host input triggers the Host Decontaminate workflow, which removes reads that map to a reference host genome. This input needs to be an NCBI Taxonomy-compatible taxon or an NCBI assembly accession. If using a taxon, the first retrieved genome corresponding to that taxon is retrieved. If using an accession, it must be coupled with the Host Decontaminate task is_accession (ONT) or Read QC Trim PE host_is_accession (Illumina) boolean populated as "true".

extract_unclassified optional input parameter

By default, the extract_unclassified parameter is set to "true", which indicates that reads that are not classified by Kraken2 (Illumina) or Metabuli (ONT) will be included with reads classified as the input taxon. These classification software most often do not comprehensively classify reads using the default RefSeq databases, so extracting unclassified reads is desirable when host and contaminant reads have been sufficiently decontaminated. Host decontamination occurs in TheiaViral using NCBI sra-human-scrubber, read classification to the human genome, and/or via mapping reads to the inputted host. Contaminant viral reads are mostly excluded because they will be often be classified against the default RefSeq classification databases. Consider setting extract_unclassified to false if de novo assembly or Skani reference selection is failing.

min_allele_freq, min_depth, and min_map_quality optional input parameters

These parameters have a direct effect on the variants that will ultimately be reported in the consensus assembly. min_allele_freq determines the minimum proportion of an allelic variant to be reported in the consensus assembly. min_depth and min_map_quality affect how "N" is reported in the consensus, i.e. depth below min_depth is reported as "N" and reads with mapping quality below min_map_quality are not included in depth calculations.

read_extraction_rank optional input parameter

By default, the read_extraction_rank parameter is set to "family", which indicates that reads will be extracted if they are classified as the taxonomic family of the input taxon, including all descendant taxa of the family. Read classification may not resolve to the rank of the input taxon, so these reads may be classified at higher ranks. For example, some Lyssavirus rabies (species) reads may only be resolved to Lyssavirus (genus), so they would not be extracted if the read_extraction_rank is set to "species". Setting the read_extraction_rank above the inputted taxon's rank can therefore dramatically increase the number of reads recovered, at the potential cost of including other viruses. This likely is not a problem for scarcely represented lineages, e.g. a sample that is expected to include Lyssavirus rabies is unlikely to contain other viruses of the corresponding family, Rhabdoviridae, within the same sample. However, setting a read_extraction_rank far beyond the input taxon rank can be problematic when multiple representatives of the same viral family are included in similar abundance within the same sample. To further refine the desired read_extraction_rank, please review the corresponding classification reports of the respective classification software (kraken2 for Illumina and Metabuli for ONT)

TheiaViral_Illumina_PETheiaViral_ONT

Terra Task Name	Variable	Type	Description	Default Value	Terra Status
theiaviral_illumina_pe	read1	File	llumina forward read file in FASTQ file format (compression optional)		Required
theiaviral_illumina_pe	read2	File	llumina reverse read file in FASTQ file format (compression optional)		Required
theiaviral_illumina_pe	samplename	String	Nme of the sample being analyzed		Required
theiaviral_illumina_pe	taxon	String	Taxon ID or organism name of interest		Required
bwa	cpu	Int	Number of CPUs to allocate to the task	6	Optional
bwa	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
bwa	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/ivar:1.3.1-titan	Optional
bwa	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	16	Optional
checkv_consensus	checkv_db	File	CheckV database file	gs://theiagen-public-resources-rp/reference_data/databases/checkv/checkv-db-v1.5.tar.gz	Optional
checkv_consensus	cpu	Int	Number of CPUs allocated for the task	2	Optional
checkv_consensus	disk_size	Int	Disk size allocated for the task (in GB)	100	Optional
checkv_consensus	docker	String	Docker image used for the task	us-docker.pkg.dev/general-theiagen/staphb/checkv:1.0.3	Optional
checkv_consensus	memory	Int	Memory allocated for the task (in GB)	8	Optional
checkv_denovo	checkv_db	File	CheckV database file	gs://theiagen-public-resources-rp/reference_data/databases/checkv/checkv-db-v1.5.tar.gz	Optional
checkv_denovo	cpu	Int	Number of CPUs allocated for the task	2	Optional
checkv_denovo	disk_size	Int	Disk size allocated for the task (in GB)	100	Optional
checkv_denovo	docker	String	Docker image used for the task	us-docker.pkg.dev/general-theiagen/staphb/checkv:1.0.3	Optional
checkv_denovo	memory	Int	Memory allocated for the task (in GB)	8	Optional
clean_check_reads	cpu	Int	Number of CPUs to allocate to the task	1	Optional
clean_check_reads	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
clean_check_reads	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/bactopia/gather_samples:2.0.2	Optional
clean_check_reads	max_genome_length	Int	Maximum genome length able to pass read screening	2673870	Optional
clean_check_reads	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	2	Optional
clean_check_reads	min_basepairs	Int	Minimum base pairs to pass read screening	15000	Optional
clean_check_reads	min_coverage	Int	Minimum coverage to pass read screening	10	Optional
clean_check_reads	min_genome_length	Int	Minimum genome length to pass read screening	1500	Optional
clean_check_reads	min_proportion	Int	Minimum read proportion to pass read screening	40	Optional
clean_check_reads	min_reads	Int	Minimum reads to pass read screening	50	Optional
consensus	char_unknown	String	Character used to represent unknown bases in the consensus sequence	N	Optional
consensus	count_orphans	Boolean	True/False that determines if anomalous read pairs are NOT skipped in variant calling. Anomalous read pairs are those marked in the FLAG field as paired in sequencing but without the properly-paired flag set.	TRUE	Optional
consensus	cpu	Int	Number of CPUs to allocate to the task	8	Optional
consensus	disable_baq	Boolean	True/False that determines if base alignment quality (BAQ) computation should be disabled during samtools mpileup before consensus generation	TRUE	Optional
consensus	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
consensus	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/artic-ncov2019-epi2me	Optional
consensus	max_depth	Int	For a given position, read at maximum INT number of reads per input file during samtools mpileup before consensus generation	600000	Optional
consensus	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	16	Optional
consensus	min_bq	Int	Minimum base quality required for a base to be considered during samtools mpileup before consensus generation	0	Optional
consensus	skip_N	Boolean	True/False that determines if "N" bases should be skipped in the consensus sequence	FALSE	Optional
consensus_qc	cpu	Int	Number of CPUs to allocate to the task	1	Optional
consensus_qc	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
consensus_qc	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/theiagen/utility:1.1	Optional
consensus_qc	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	2	Optional
ivar_variants	cpu	Int	Number of CPUs allocated for the task	2	Optional
ivar_variants	disk_size	Int	Disk size allocated for the task (in GB)	100	Optional
ivar_variants	docker	String	Docker image used for the task	us-docker.pkg.dev/general-theiagen/staphb/ivar:1.3.1-titan	Optional
ivar_variants	memory	Int	Memory allocated for the task (in GB)	8	Optional
ivar_variants	reference_gff	File	A GFF file in the GFF3 format can be supplied to specify coordinates of open reading frames (ORFs) so iVar can identify codons and translate variants into amino acids		Optional
megahit	cpu	Int	Number of CPUs allocated for the task	4	Optional
megahit	disk_size	Int	Disk size allocated for the task (in GB)	100	Optional
megahit	docker	String	Docker image used for the task	us-docker.pkg.dev/general-theiagen/theiagen/megahit:1.2.9	Optional
megahit	kmers	String	Comma-separated list of kmer sizes to use for assembly. All must be odd, in the range 15-255, increment <= 28	21,29,39,59,79,99,119,141	Optional
megahit	megahit_opts	String	Additional parameters for MEGAHIT assembler		Optional
megahit	memory	Int	Memory allocated for the task (in GB)	16	Optional
megahit	min_contig_length	Int	Minimum contig length for MEGAHIT assembler	1	Optional
morgana_magic	abricate_flu_cpu	Int	Number of CPUs to allocate to the task	2	Optional
morgana_magic	abricate_flu_disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
morgana_magic	abricate_flu_docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/abricate:1.0.1-insaflu-220727	Optional
morgana_magic	abricate_flu_memory	Int	Amount of memory/RAM (in GB) to allocate to the task	4	Optional
morgana_magic	abricate_flu_min_percent_coverage	Int	Minimum DNA percent coverage	60	Optional
morgana_magic	abricate_flu_min_percent_identity	Int	Minimum DNA percent identity	70	Optional
morgana_magic	assembly_metrics_cpu	Int	Number of CPUs to allocate to the task	2	Optional
morgana_magic	assembly_metrics_disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
morgana_magic	assembly_metrics_docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/samtools:1.15	Optional
morgana_magic	assembly_metrics_memory	Int	Amount of memory/RAM (in GB) to allocate to the task	8	Optional
morgana_magic	consensus_qc_cpu	Int	Number of CPUs to allocate to the task	1	Optional
morgana_magic	consensus_qc_disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
morgana_magic	consensus_qc_docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/theiagen/utility:1.1	Optional
morgana_magic	consensus_qc_memory	Int	Amount of memory/RAM (in GB) to allocate to the task	2	Optional
morgana_magic	genoflu_cpu	Int	Number of CPUs to allocate to the task	1	Optional
morgana_magic	genoflu_cross_reference	File	An Excel file to cross-reference BLAST findings; probably useful if novel genotypes are not in the default file used by genoflu.py		Optional
morgana_magic	genoflu_disk_size	Int	Amount of storage (in GB) to allocate to the task	25	Optional
morgana_magic	genoflu_docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/genoflu:1.06	Optional
morgana_magic	genoflu_memory	Int	Amount of memory/RAM (in GB) to allocate to the task	2	Optional
morgana_magic	irma_cpu	Int	Number of CPUs to allocate to the task	4	Optional
morgana_magic	irma_disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
morgana_magic	irma_docker_image	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/irma:1.2.0	Optional
morgana_magic	irma_keep_ref_deletions	Boolean	True/False variable that determines if sites missed (i.e. 0 reads for a site in the reference genome) during read gathering should be deleted by ambiguation by inserting N's or deleting the sequence entirely. False sets this IRMA paramater to "DEL" and true sets it to "NNN"	TRUE	Optional
morgana_magic	irma_memory	Int	Amount of memory/RAM (in GB) to allocate to the task	16	Optional
morgana_magic	nextclade_cpu	Int	Number of CPUs to allocate to the task	2	Optional
morgana_magic	nextclade_disk_size	Int	Amount of storage (in GB) to allocate to the task	50	Optional
morgana_magic	nextclade_docker_image	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/nextstrain/nextclade:3.10.2	Optional
morgana_magic	nextclade_memory	Int	Amount of memory/RAM (in GB) to allocate to the task	4	Optional
morgana_magic	nextclade_output_parser_cpu	Int	Number of CPUs to allocate to the task	2	Optional
morgana_magic	nextclade_output_parser_disk_size	Int	Amount of storage (in GB) to allocate to the task	50	Optional
morgana_magic	nextclade_output_parser_docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/python/python:3.8.18-slim	Optional
morgana_magic	nextclade_output_parser_memory	Int	Amount of memory/RAM (in GB) to allocate to the task	4	Optional
morgana_magic	pangolin_cpu	Int	Number of CPUs to allocate to the task	2	Optional
morgana_magic	pangolin_disk_size	Int	Amount of storage (in GB) to allocate to the task	50	Optional
morgana_magic	pangolin_docker_image	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/nextstrain/nextclade:3.10.2	Optional
morgana_magic	pangolin_memory	Int	Amount of memory/RAM (in GB) to allocate to the task	4	Optional
ncbi_datasets	cpu	Int	Number of CPUs allocated for the task	1	Optional
ncbi_datasets	disk_size	Int	Disk size allocated for the task (in GB)	50	Optional
ncbi_datasets	docker	String	Docker image used for the task	us-docker.pkg.dev/general-theiagen/staphb/ncbi-datasets:16.38.1	Optional
ncbi_datasets	include_gbff	Boolean	True/False to include gbff files in the output	FALSE	Optional
ncbi_datasets	include_gff3	Boolean	True/False to include gff3 files in the output	FALSE	Optional
ncbi_datasets	memory	Int	Memory allocated for the task (in GB)	4	Optional
ncbi_identify	complete	Boolean	Only query genomes labeled complete	TRUE	Optional
ncbi_identify	cpu	Int	Number of CPUs allocated for the task	1	Optional
ncbi_identify	disk_size	Int	Disk size allocated for the task (in GB)	50	Optional
ncbi_identify	docker	String	Docker image used for the task	us-docker.pkg.dev/general-theiagen/staphb/ncbi-datasets:16.38.1	Optional
ncbi_identify	memory	Int	Memory allocated for the task (in GB)	4	Optional
ncbi_identify	refseq	Boolean	Only query RefSeq genomes	TRUE	Optional
ncbi_identify	summary_limit	Int	Maximum number of genomes to return in the summary	100	Optional
ncbi_identify	use_ncbi_virus	Boolean	Set to true to download from NCBI Virus Datasets	FALSE	Optional
quast_denovo	cpu	Int	Number of CPUs allocated for the task	2	Optional
quast_denovo	disk_size	Int	Disk size allocated for the task (in GB)	100	Optional
quast_denovo	docker	String	Docker image used for the task	us-docker.pkg.dev/general-theiagen/staphb/quast:5.0.2	Optional
quast_denovo	memory	Int	Memory allocated for the task (in GB)	2	Optional
rasusa	bases	String	Explicitly set the number of bases required e.g., 4.3kb, 7Tb, 9000, 4.1MB. If this option is given, --coverage and --genome-size are ignored		Optional
rasusa	coverage	Float	The desired coverage to sub-sample the reads to. If --bases is not provided, this option and --genome-size are required	250	Optional
rasusa	cpu	Int	Number of CPUs allocated for the task	4	Optional
rasusa	disk_size	Int	Disk size allocated for the task (in GB)	100	Optional
rasusa	docker	String	Docker image used for the task	us-docker.pkg.dev/general-theiagen/staphb/rasusa:2.1.0	Optional
rasusa	frac	Float	Subsample to a fraction of the reads - e.g., 0.5 samples half the reads		Optional
rasusa	memory	Int	Memory allocated for the task (in GB)	8	Optional
rasusa	num	Int	Subsample to a specific number of reads		Optional
rasusa	seed	Int	Random seed for reproducibility		Optional
read_QC_trim	adapters	File	File with adapter sequences to be removed		Optional
read_QC_trim	bbduk_memory	Int	Amount of memory/RAM (in GB) to allocate to the task	8	Optional
read_QC_trim	call_kraken	Boolean	Internal component, do not modify		Optional
read_QC_trim	call_midas	Boolean	Internal component, do not modify		Optional
read_QC_trim	fastp_args	String	Additional arguments to use with fastp	--detect_adapter_for_pe -g -5 20 -3 20	Optional
read_QC_trim	host_complete_only	Boolean	Only download host reference genome labeled "complete"	FALSE	Optional
read_QC_trim	host_decontaminate_mem	Int	Memory allocated for minimap2 (in GB)	32	Optional
read_QC_trim	host_is_accession	Boolean	Inputted "host" is an accession	FALSE	Optional
read_QC_trim	kraken_cpu	Int	Number of CPUs to allocate to the task	4	Optional
read_QC_trim	kraken_memory	Int	Amount of memory/RAM (in GB) to allocate to the task	32	Optional
read_QC_trim	phix	File	A file containing the phix used during Illumina sequencing; used in the BBDuk task		Optional
read_QC_trim	read_processing	String	The name of the tool to perform basic read processing; options: "trimmomatic" or "fastp"	trimmomatic	Optional
read_QC_trim	read_qc	String	The tool used for quality control (QC) of reads. Options are "fastq_scan" (default) and "fastqc"	fastq_scan	Optional
read_QC_trim	target_organism	String	Internal component, do not modify		Optional
read_QC_trim	trim_min_length	Int	Specifies minimum length of each read after trimming to be kept	75	Optional
read_QC_trim	trim_quality_min_score	Int	Specifies the average quality of bases in a sliding window to be kept	30	Optional
read_QC_trim	trim_window_size	Int	Specifies window size for trimming (the number of bases to average the quality across)	4	Optional
read_QC_trim	trimmomatic_args	String	Additional arguments to pass to trimmomatic. "-phred33" specifies the Phred Q score encoding which is almost always phred33 with modern sequence data.	-phred33	Optional
read_QC_trim_pe	kraken_disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
read_QC_trim_pe	midas_db	File	Internal component, do not modify		Optional
read_mapping_stats	cpu	Int	Number of CPUs allocated for the task	2	Optional
read_mapping_stats	disk_size	Int	Disk size allocated for the task (in GB)	100	Optional
read_mapping_stats	docker	String	Docker image used for the task	us-docker.pkg.dev/general-theiagen/staphb/samtools:1.15	Optional
read_mapping_stats	memory	Int	Memory allocated for the task (in GB)	8	Optional
skani	cpu	Int	Number of CPUs allocated for the task	2	Optional
skani	disk_size	Int	Disk size allocated for the task (in GB)	100	Optional
skani	docker	String	Docker image used for the task	us-docker.pkg.dev/general-theiagen/staphb/skani:0.2.2	Optional
skani	memory	Int	Memory allocated for the task (in GB)	4	Optional
skani	skani_db	File	Skani database file	gs://theiagen-public-resources-rp/reference_data/databases/skani/skani_db_20250606.tar	Optional
spades	cpu	Int	Number of CPUs allocated for the task	4	Optional
spades	disk_size	Int	Disk size allocated for the task (in GB)	100	Optional
spades	docker	String	Docker image used for the task	us-docker.pkg.dev/general-theiagen/staphb/spades:4.1.0	Optional
spades	kmers	String	list of k-mer sizes (must be odd and less than 128)	auto	Optional
spades	memory	Int	Memory allocated for the task (in GB)	16	Optional
spades	phred_offset	Int	PHRED quality offset in the input reads (33 or 64)	33	Optional
spades	spades_opts	String	Additional parameters for Spades assembler		Optional
theiaviral_illumina_pe	call_metaviralspades	Boolean	True/False to call assembly with MetaviralSPAdes and use Megahit as fallback	TRUE	Optional
theiaviral_illumina_pe	extract_unclassified	Boolean	True/False that determines if unclassified reads should be extracted and combined with the taxon specific extracted reads	TRUE	Optional
theiaviral_illumina_pe	genome_length	Int	Expected genome length of taxon of interest		Optional
theiaviral_illumina_pe	host	String	Host taxon/accession to dehost reads, if provided		Optional
theiaviral_illumina_pe	kraken_db	File	Kraken2 database file	gs://theiagen-public-resources-rp/reference_data/databases/kraken2/kraken2_humanGRCh38_viralRefSeq_20240828.tar.gz	Optional
theiaviral_illumina_pe	min_allele_freq	Float	Minimum allele frequency required for a variant to populate the consensus sequence	0.6	Optional
theiaviral_illumina_pe	min_depth	Int	Minimum read depth required for a variant to populate the consensus sequence	10	Optional
theiaviral_illumina_pe	min_map_quality	Int	Minimum mapping quality required for read alignments	20	Optional
theiaviral_illumina_pe	read_extraction_rank	String	Taxonomic rank to use for read extraction - limits taxons to only those within the specified ranks.	family	Optional
theiaviral_illumina_pe	reference_fasta	File	Reference genome in FASTA format		Optional
theiaviral_illumina_pe	skip_rasusa	Boolean	True/False to skip read subsampling with Rasusa	FALSE	Optional
theiaviral_illumina_pe	skip_screen	Boolean	True/False to skip read screening check prior to analysis	FALSE	Optional
version_capture	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0	Optional
version_capture	timezone	String	Set the time zone to get an accurate date of analysis (uses UTC by default)		Optional

Terra Task Name	Variable	Type	Description	Default Value	Terra Status
theiaviral_ont	read1	File	Base-called ONT read file in FASTQ file format (compression optional)		Required
theiaviral_ont	samplename	String	Name of the sample being analyzed		Required
theiaviral_ont	taxon	String	Taxon ID or organism name of interest		Required
bcftools_consensus	cpu	Int	Number of CPUs allocated for the task	2	Optional
bcftools_consensus	disk_size	Int	Disk size allocated for the task (in GB)	100	Optional
bcftools_consensus	docker	String	Docker image used for the task	us-docker.pkg.dev/general-theiagen/staphb/bcftools:1.20	Optional
bcftools_consensus	memory	Int	Memory allocated for the task (in GB)	4	Optional
checkv_consensus	checkv_db	File	CheckV database file	gs://theiagen-public-resources-rp/reference_data/databases/checkv/checkv-db-v1.5.tar.gz	Optional
checkv_consensus	cpu	Int	Number of CPUs allocated for the task	2	Optional
checkv_consensus	disk_size	Int	Disk size allocated for the task (in GB)	100	Optional
checkv_consensus	docker	String	Docker image used for the task	us-docker.pkg.dev/general-theiagen/staphb/checkv:1.0.3	Optional
checkv_consensus	memory	Int	Memory allocated for the task (in GB)	8	Optional
checkv_denovo	checkv_db	File	CheckV database file	gs://theiagen-public-resources-rp/reference_data/databases/checkv/checkv-db-v1.5.tar.gz	Optional
checkv_denovo	cpu	Int	Number of CPUs allocated for the task	2	Optional
checkv_denovo	disk_size	Int	Disk size allocated for the task (in GB)	100	Optional
checkv_denovo	docker	String	Docker image used for the task	us-docker.pkg.dev/general-theiagen/staphb/checkv:1.0.3	Optional
checkv_denovo	memory	Int	Memory allocated for the task (in GB)	8	Optional
clair3	clair3_model	String	Model to be used by Clair3	r1041_e82_400bps_sup_v500	Optional
clair3	cpu	Int	Number of CPUs allocated for the task	4	Optional
clair3	disable_phasing	Boolean	True/False that determines if variants should be called without whatshap phasing in full alignment calling	TRUE	Optional
clair3	disk_size	Int	Disk size allocated for the task (in GB)	100	Optional
clair3	docker	String	Docker image used for the task	us-docker.pkg.dev/general-theiagen/theiagen/clair3-extra-models:1.0.10	Optional
clair3	enable_gvcf	Boolean	True/False that determines if an additional GVCF output should generated	FALSE	Optional
clair3	enable_haploid_precise	Boolean	True/False that determines haploid calling mode where only 1/1 is considered as a variant	TRUE	Optional
clair3	include_all_contigs	Boolean	True/False that determines if all contigs should be included in the output	TRUE	Optional
clair3	indel_min_af	Float	Minimum Indel AF required for a candidate variant	0.08	Optional
clair3	memory	Int	Memory allocated for the task (in GB)	8	Optional
clair3	snp_min_af	Float	Minimum SNP AF required for a candidate variant	0.08	Optional
clair3	variant_quality	Int	If set, variants with >$qual will be marked PASS, or LowQual otherwise	2	Optional
clean_check_reads	cpu	Int	Number of CPUs to allocate to the task	1	Optional
clean_check_reads	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
clean_check_reads	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/bactopia/gather_samples:2.0.2	Optional
clean_check_reads	max_genome_length	Int	Maximum genome length able to pass read screening	2673870	Optional
clean_check_reads	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	2	Optional
clean_check_reads	min_basepairs	Int	Minimum base pairs to pass read screening	15000	Optional
clean_check_reads	min_coverage	Int	Minimum coverage to pass read screening	10	Optional
clean_check_reads	min_genome_length	Int	Minimum genome length to pass read screening	1500	Optional
clean_check_reads	min_reads	Int	Minimum reads to pass read screening	50	Optional
clean_check_reads	skip_mash	Boolean	If true, skips estimation of genome size and coverage using mash in read screening steps. As a result, providing true also prevents screening using these parameters.	TRUE	Optional
consensus_qc	cpu	Int	Number of CPUs to allocate to the task	1	Optional
consensus_qc	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
consensus_qc	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/theiagen/utility:1.1	Optional
consensus_qc	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	2	Optional
fasta_utilities	cpu	Int	Number of CPUs allocated for the task	1	Optional
fasta_utilities	disk_size	Int	Disk size allocated for the task (in GB)	10	Optional
fasta_utilities	docker	String	Docker image used for the task	us-docker.pkg.dev/general-theiagen/biocontainers/seqkit:2.4.0--h9ee0642_0	Optional
fasta_utilities	memory	Int	Memory allocated for the task (in GB)	2	Optional
flye	additional_parameters	String	Additional parameters for Flye assembler		Optional
flye	asm_coverage	Int	Reduced coverage for initial disjointig assembly		Optional
flye	cpu	Int	Number of CPUs allocated for the task	4	Optional
flye	disk_size	Int	Disk size allocated for the task (in GB)	100	Optional
flye	docker	String	Docker image used for the task	us-docker.pkg.dev/general-theiagen/staphb/flye:2.9.4	Optional
flye	flye_polishing_iterations	Int	Number of polishing iterations	1	Optional
flye	genome_length	Int	Expected genome length for assembly - requires asm_coverage		Optional
flye	keep_haplotypes	Boolean	True/False to prevent collapsing alternative haplotypes	FALSE	Optional
flye	memory	Int	Memory allocated for the task (in GB)	32	Optional
flye	minimum_overlap	Int	Minimum overlap between reads		Optional
flye	no_alt_contigs	Boolean	True/False to disable alternative contig generation	FALSE	Optional
flye	read_error_rate	Float	Expected error rate in reads		Optional
flye	read_type	String	Type of read data for Flye	--nano-hq	Optional
flye	scaffold	Boolean	True/False to enable scaffolding using graph	FALSE	Optional
host_decontaminate	complete_only	Boolean	Only download genomes labeled "complete"	FALSE	Optional
host_decontaminate	is_accession	Boolean	Inputted "host" is an accession	FALSE	Optional
host_decontaminate	minimap2_memory	Int	Memory allocated for minimap2 (in GB)	32	Optional
host_decontaminate	read2	File	Internal componenet, do not modify		Optional
host_decontaminate	refseq	Boolean	Only download RefSeq genomes	TRUE	Optional
mask_low_coverage	cpu	Int	Number of CPUs allocated for the task	2	Optional
mask_low_coverage	disk_size	Int	Disk size allocated for the task (in GB)	100	Optional
mask_low_coverage	docker	String	Docker image used for the task	us-docker.pkg.dev/general-theiagen/staphb/bedtools:2.31.0	Optional
mask_low_coverage	memory	Int	Memory allocated for the task (in GB)	8	Optional
metabuli	cpu	Int	Number of CPUs allocated for the task	4	Optional
metabuli	disk_size	Int	Disk size allocated for the task (in GB)	100	Optional
metabuli	docker	String	Docker image used for the task	us-docker.pkg.dev/general-theiagen/theiagen/metabuli:1.1.0	Optional
metabuli	memory	Int	Memory allocated for the task (in GB)	16	Optional
metabuli	metabuli_db	File	Metabuli database file	gs://theiagen-public-resources-rp/reference_data/databases/metabuli/refseq_virus-v223.tar.gz	Optional
metabuli	min_percent_coverage	Float	Minimum query coverage threshold (0.0 - 1.0)	0.0	Optional
metabuli	min_score	Float	Minimum sequenece similarity score (0.0 - 1.0)	0.0	Optional
metabuli	min_sp_score	Float	Minimum score for species- or lower-level classification	0.0	Optional
metabuli	taxonomy_path	File	Path to taxonomy file	gs://theiagen-public-resources-rp/reference_data/databases/metabuli/new_taxdump.tar.gz	Optional
minimap2	cpu	Int	Number of CPUs to allocate to the task	2	Optional
minimap2	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
minimap2	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/minimap2:2.22	Optional
minimap2	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	8	Optional
minimap2	query2	File	Internal component, do not modify		Optional
morgana_magic	abricate_flu_cpu	Int	Number of CPUs to allocate to the task	2	Optional
morgana_magic	abricate_flu_disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
morgana_magic	abricate_flu_docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/abricate:1.0.1-insaflu-220727	Optional
morgana_magic	abricate_flu_memory	Int	Amount of memory/RAM (in GB) to allocate to the task	4	Optional
morgana_magic	abricate_flu_min_percent_coverage	Int	Minimum DNA percent coverage	60	Optional
morgana_magic	abricate_flu_min_percent_identity	Int	Minimum DNA percent identity	70	Optional
morgana_magic	assembly_metrics_cpu	Int	Number of CPUs to allocate to the task	2	Optional
morgana_magic	assembly_metrics_disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
morgana_magic	assembly_metrics_docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/samtools:1.15	Optional
morgana_magic	assembly_metrics_memory	Int	Amount of memory/RAM (in GB) to allocate to the task	8	Optional
morgana_magic	consensus_qc_cpu	Int	Number of CPUs to allocate to the task	1	Optional
morgana_magic	consensus_qc_disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
morgana_magic	consensus_qc_docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/theiagen/utility:1.1	Optional
morgana_magic	consensus_qc_memory	Int	Amount of memory/RAM (in GB) to allocate to the task	2	Optional
morgana_magic	genoflu_cpu	Int	Number of CPUs to allocate to the task	1	Optional
morgana_magic	genoflu_cross_reference	File	An Excel file to cross-reference BLAST findings; probably useful if novel genotypes are not in the default file used by genoflu.py		Optional
morgana_magic	genoflu_disk_size	Int	Amount of storage (in GB) to allocate to the task	25	Optional
morgana_magic	genoflu_docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/genoflu:1.06	Optional
morgana_magic	genoflu_memory	Int	Amount of memory/RAM (in GB) to allocate to the task	2	Optional
morgana_magic	irma_cpu	Int	Number of CPUs to allocate to the task	4	Optional
morgana_magic	irma_disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
morgana_magic	irma_docker_image	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/irma:1.2.0	Optional
morgana_magic	irma_keep_ref_deletions	Boolean	True/False variable that determines if sites missed (i.e. 0 reads for a site in the reference genome) during read gathering should be deleted by ambiguation by inserting N's or deleting the sequence entirely. False sets this IRMA paramater to "DEL" and true sets it to "NNN"	TRUE	Optional
morgana_magic	irma_memory	Int	Amount of memory/RAM (in GB) to allocate to the task	16	Optional
morgana_magic	nextclade_cpu	Int	Number of CPUs to allocate to the task	2	Optional
morgana_magic	nextclade_disk_size	Int	Amount of storage (in GB) to allocate to the task	50	Optional
morgana_magic	nextclade_docker_image	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/nextstrain/nextclade:3.10.2	Optional
morgana_magic	nextclade_memory	Int	Amount of memory/RAM (in GB) to allocate to the task	4	Optional
morgana_magic	nextclade_output_parser_cpu	Int	Number of CPUs to allocate to the task	2	Optional
morgana_magic	nextclade_output_parser_disk_size	Int	Amount of storage (in GB) to allocate to the task	50	Optional
morgana_magic	nextclade_output_parser_docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/python/python:3.8.18-slim	Optional
morgana_magic	nextclade_output_parser_memory	Int	Amount of memory/RAM (in GB) to allocate to the task	4	Optional
morgana_magic	pangolin_cpu	Int	Number of CPUs to allocate to the task	2	Optional
morgana_magic	pangolin_disk_size	Int	Amount of storage (in GB) to allocate to the task	50	Optional
morgana_magic	pangolin_docker_image	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/nextstrain/nextclade:3.10.2	Optional
morgana_magic	pangolin_memory	Int	Amount of memory/RAM (in GB) to allocate to the task	4	Optional
morgana_magic	read2	File	Internal component, do not modify		Optional
nanoplot_clean	cpu	Int	Number of CPUs to allocate to the task	4	Optional
nanoplot_clean	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
nanoplot_clean	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/nanoplot:1.40.0	Optional
nanoplot_clean	max_length	Int	The maximum length of clean reads, for which reads longer than the length specified will be hidden.	100000	Optional
nanoplot_clean	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	16	Optional
nanoplot_raw	cpu	Int	Number of CPUs to allocate to the task	4	Optional
nanoplot_raw	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
nanoplot_raw	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/nanoplot:1.40.0	Optional
nanoplot_raw	max_length	Int	The maximum length of clean reads, for which reads longer than the length specified will be hidden.	100000	Optional
nanoplot_raw	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	16	Optional
nanoq	cpu	Int	Number of CPUs allocated for the task	1	Optional
nanoq	disk_size	Int	Disk size allocated for the task (in GB)	100	Optional
nanoq	docker	String	Docker image used for the task	us-docker.pkg.dev/general-theiagen/biocontainers/nanoq:0.9.0--hec16e2b_1	Optional
nanoq	max_read_length	Int	Maximum read length to keep	100000	Optional
nanoq	max_read_qual	Int	Maximum read quality to keep	10	Optional
nanoq	memory	Int	Memory allocated for the task (in GB)	2	Optional
nanoq	min_read_length	Int	Minimum read length to keep	500	Optional
nanoq	min_read_qual	Int	Minimum read quality to keep	10	Optional
ncbi_datasets	cpu	Int	Number of CPUs allocated for the task	1	Optional
ncbi_datasets	disk_size	Int	Disk size allocated for the task (in GB)	50	Optional
ncbi_datasets	docker	String	Docker image used for the task	us-docker.pkg.dev/general-theiagen/staphb/ncbi-datasets:16.38.1	Optional
ncbi_datasets	include_gbff	Boolean	True/False to include gbff files in the output	FALSE	Optional
ncbi_datasets	include_gff3	Boolean	True/False to include gff3 files in the output	FALSE	Optional
ncbi_datasets	memory	Int	Memory allocated for the task (in GB)	4	Optional
ncbi_identify	complete	Boolean	Only query genomes labeled complete	TRUE	Optional
ncbi_identify	cpu	Int	Number of CPUs allocated for the task	1	Optional
ncbi_identify	disk_size	Int	Disk size allocated for the task (in GB)	50	Optional
ncbi_identify	docker	String	Docker image used for the task	us-docker.pkg.dev/general-theiagen/staphb/ncbi-datasets:16.38.1	Optional
ncbi_identify	memory	Int	Memory allocated for the task (in GB)	4	Optional
ncbi_identify	refseq	Boolean	Only query RefSeq genomes	TRUE	Optional
ncbi_identify	summary_limit	Int	Maximum number of genomes to return in the summary	100	Optional
ncbi_identify	use_ncbi_virus	Boolean	Set to true to download from NCBI Virus Datasets	FALSE	Optional
ncbi_scrub_se	cpu	Int	Number of CPUs to allocate to the task	4	Optional
ncbi_scrub_se	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
ncbi_scrub_se	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/ncbi/sra-human-scrubber:2.2.1	Optional
ncbi_scrub_se	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	8	Optional
parse_mapping	cpu	Int	Number of CPUs allocated for the task	2	Optional
parse_mapping	disk_size	Int	Disk size allocated for the task (in GB)	100	Optional
parse_mapping	docker	String	Docker image used for the task	us-docker.pkg.dev/general-theiagen/staphb/samtools:1.17	Optional
parse_mapping	memory	Int	Memory allocated for the task (in GB)	8	Optional
porechop	cpu	Int	Number of CPUs allocated for the task	4	Optional
porechop	disk_size	Int	Disk size allocated for the task (in GB)	100	Optional
porechop	docker	String	Docker image used for the task	us-docker.pkg.dev/general-theiagen/staphb/porechop:0.2.4	Optional
porechop	memory	Int	Memory allocated for the task (in GB)	16	Optional
porechop	trimopts	String	Additional trimming options for Porechop		Optional
quast_denovo	cpu	Int	Number of CPUs allocated for the task	2	Optional
quast_denovo	disk_size	Int	Disk size allocated for the task (in GB)	100	Optional
quast_denovo	docker	String	Docker image used for the task	us-docker.pkg.dev/general-theiagen/staphb/quast:5.0.2	Optional
quast_denovo	memory	Int	Memory allocated for the task (in GB)	2	Optional
rasusa	bases	String	Explicitly set the number of bases required e.g., 4.3kb, 7Tb, 9000, 4.1MB. If this option is given, --coverage and --genome-size are ignored		Optional
rasusa	coverage	Float	The desired coverage to sub-sample the reads to. If --bases is not provided, this option and --genome-size are required	250	Optional
rasusa	cpu	Int	Number of CPUs allocated for the task	4	Optional
rasusa	disk_size	Int	Disk size allocated for the task (in GB)	100	Optional
rasusa	docker	String	Docker image used for the task	us-docker.pkg.dev/general-theiagen/staphb/rasusa:2.1.0	Optional
rasusa	frac	Float	Subsample to a fraction of the reads - e.g., 0.5 samples half the reads		Optional
rasusa	memory	Int	Memory allocated for the task (in GB)	8	Optional
rasusa	num	Int	Subsample to a specific number of reads		Optional
rasusa	read2	File	Internal component, do not modify		Optional
rasusa	seed	Int	Random seed for reproducibility		Optional
raven	cpu	Int	Number of CPUs allocated for the task	4	Optional
raven	disk_size	Int	Disk size allocated for the task (in GB)	100	Optional
raven	docker	String	Docker image used for the task	us-docker.pkg.dev/general-theiagen/theiagen/raven:1.8.3	Optional
raven	memory	Int	Memory allocated for the task (in GB)	16	Optional
raven	raven_identity	Float	Threshold for overlap between two reads in order to construct an edge between them	0.0	Optional
raven	raven_opts	Int	Additional parameters for Raven assembler		Optional
raven	raven_polishing_iterations	Int	Number of polishing iterations	2	Optional
read_mapping_stats	cpu	Int	Number of CPUs allocated for the task	2	Optional
read_mapping_stats	disk_size	Int	Disk size allocated for the task (in GB)	100	Optional
read_mapping_stats	docker	String	Docker image used for the task	us-docker.pkg.dev/general-theiagen/staphb/samtools:1.15	Optional
read_mapping_stats	memory	Int	Memory allocated for the task (in GB)	8	Optional
skani	cpu	Int	Number of CPUs allocated for the task	2	Optional
skani	disk_size	Int	Disk size allocated for the task (in GB)	100	Optional
skani	docker	String	Docker image used for the task	us-docker.pkg.dev/general-theiagen/staphb/skani:0.2.2	Optional
skani	memory	Int	Memory allocated for the task (in GB)	4	Optional
skani	skani_db	File	Skani database file	gs://theiagen-public-resources-rp/reference_data/databases/skani/skani_db_20250606.tar	Optional
theiaviral_ont	call_porechop	Boolean	True/False to trim adapters with porechop	FALSE	Optional
theiaviral_ont	call_raven	Boolean	True/False to call assembly with Raven and use Flye as fallback	TRUE	Optional
theiaviral_ont	extract_unclassified	Boolean	True/False that determines if unclassified reads should be extracted and combined with the taxon specific extracted reads	FALSE	Optional
theiaviral_ont	genome_length	Int	Expected genome length of taxon of interest		Optional
theiaviral_ont	host	String	Host taxon/accession to dehost reads, if provided		Optional
theiaviral_ont	min_allele_freq	Float	Minimum allele frequency required for a variant to populate the consensus sequence	0.6	Optional
theiaviral_ont	min_depth	Int	Minimum read depth required for a variant to populate the consensus sequence	10	Optional
theiaviral_ont	min_map_quality	Int	Minimum mapping quality required for read alignments	20	Optional
theiaviral_ont	read_extraction_rank	String	Taxonomic rank to use for read extraction - limits taxons to only those within the specified ranks.	family	Optional
theiaviral_ont	reference_fasta	File	Reference genome in FASTA format		Optional
theiaviral_ont	skip_rasusa	Boolean	True/False to skip read subsampling with Rasusa	FALSE	Optional
theiaviral_ont	skip_screen	Boolean	True/False to skip read screening check prior to analysis	FALSE	Optional
version_capture	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0	Optional
version_capture	timezone	String	Set the time zone to get an accurate date of analysis (uses UTC by default)		Optional

All Tasks¶

TheiaViral_Illumina_PETheiaViral_ONT

Versioning

versioning: Version Capture

The versioning task captures the workflow version from the GitHub (code repository) version.

Version Capture Technical details

	Links
Task	task_versioning.wdl

Taxonomic Identification

ncbi_identify

The ncbi_identify task uses NCBI Datasets to search the NCBI Viral Genome Database and acquire taxonomic metadata from a user's inputted taxonomy and desired taxonomic rank. This task will always return a taxon ID, name, and rank, and it facilitates multiple downstream functions, including read classification and targeted read extraction. This task also generates a comprehensive summary file of all successful hits to the input taxon, which includes each taxon's accession number, completeness status, genome length, source, and other relevant metadata. Based on this summary, the task also calculates the average expected genome size for the input taxon.

taxon input parameter

This parameter accepts either a NCBI taxon ID (e.g. 11292) or an organism name (e.g. Lyssavirus rabies).

rank a.k.a read_extraction_rank input parameter

Valid options include: "species", "genus", "family", "order", "class", "phylum", "kingdom", or "domain". By default it is set to "family". This parameter filters metadata to report information only at the taxonomic rank specified by the user, regardless of the taxonomic rank implied by the original input taxon.

Important

The rank parameter must specify a taxonomic rank that is equal to or above the input taxon's taxonomic rank.

Examples:

If your input taxon is Lyssavirus rabies (species level) with rank set to family, the task will return information for the family of Lyssavirus rabies: taxon ID for Rhabdoviridae (11270), name "Rhabdoviridae", and rank "family".
If your input taxon is Lyssavirus (genus level) with rank set to species, the task will fail because it cannot determine species information from an inputted genus.

NCBI Datasets Technical Details

	Links
Task	task_identify_taxon_id.wdl
Software Source Code	NCBI Datasets on GitHub
Software Documentation	NCBI Datasets Documentation on NCBI
Original Publication(s)	Exploring and retrieving sequence and metadata for species across the tree of life with NCBI Datasets

Read Quality Control, Trimming, Filtering, Identification and Extraction

read_QC_trim

read_QC_trim is a sub-workflow that removes low-quality reads, low-quality regions of reads, and sequencing adapters to improve data quality. It uses a number of tasks, described below. The differences between the PE and SE versions of the read_QC_trim sub-workflow lie in the default parameters, the use of two or one input read file(s), and the different output files.

HRRT: Human Host Sequence Removal

All reads of human origin are removed, including their mates, by using NCBI's human read removal tool (HRRT).

HRRT is based on the SRA Taxonomy Analysis Tool and employs a k-mer database constructed of k-mers from Eukaryota derived from all human RefSeq records with any k-mers found in non-Eukaryota RefSeq records subtracted from the database.

NCBI-Scrub Technical Details

	Links
Task	task_ncbi_scrub.wdl
Software Source Code	HRRT on GitHub
Software Documentation	HRRT on NCBI

Read quality trimming

Either trimmomatic or fastp can be used for read-quality trimming. Trimmomatic is used by default. Both tools trim low-quality regions of reads with a sliding window (with a window size of trim_window_size), cutting once the average quality within the window falls below trim_quality_trim_score. They will both discard the read if it is trimmed below trim_minlen.

read_processing input parameter

This input parameter accepts either trimmomatic or fastp as an input to determine which tool should be used for read quality trimming. This is set to trimmomatic by default.

If the fastp option is selected, see below for table of default parameters.

fastp default read-trimming parameters

Parameter	Explanation
-g	enables polyG tail trimming
-5 20	enables read end-trimming
-3 20	enables read end-trimming
--detect_adapter_for_pe	enables adapter-trimming only for paired-end reads

Additional arguments can be passed using the fastp_args optional parameter.

Trimmomatic and fastp Technical Details

	Links
Task	task_trimmomatic.wdl task_fastp.wdl
Software Source Code	Trimmomatic fastp on Github
Software Documentation	Trimmomatic fastp
Original Publication(s)	Trimmomatic: a flexible trimmer for Illumina sequence data fastp: an ultra-fast all-in-one FASTQ preprocessor

Adapter removal

The BBDuk task removes adapters from sequence reads. To do this:

Repair from the BBTools package reorders reads in paired fastq files to ensure the forward and reverse reads of a pair are in the same position in the two fastq files.
BBDuk ("Bestus Bioinformaticus" Decontamination Using Kmers) is then used to trim the adapters and filter out all reads that have a 31-mer match to PhiX, which is commonly added to Illumina sequencing runs to monitor and/or improve overall run quality.

What are adapters and why do they need to be removed?

Adapters are manufactured oligonucleotide sequences attached to DNA fragments during the library preparation process. In Illumina sequencing, these adapter sequences are required for attaching reads to flow cells. You can read more about Illumina adapters here. For genome analysis, it's important to remove these sequences since they're not actually from your sample. If you don't remove them, the downstream analysis may be affected.

BBDuk Technical Details

	Links
Task	task_bbduk.wdl
Software Source Code	BBTools
Software Documentation	BBDuk

Read Quantification

There are two methods for read quantification to choose from: fastq-scan (default) or fastqc. Both quantify the forward and reverse reads in FASTQ files. For paired-end data, they also provide the total number of read pairs. This task is run once with raw reads as input and once with clean reads as input. If QC has been performed correctly, you should expect fewer clean reads than raw reads. fastqc also provides a graphical visualization of the read quality.

read_qc input parameter

This input parameter accepts either "fastq_scan" or "fastqc" as an input to determine which tool should be used for read quantification. This is set to "fastq-scan" by default.

fastq-scan and FastQC Technical Details

	Links
Task	task_fastq_scan.wdl task_fastqc.wdl
Software Source Code	fastq-scan on Github fastqc on Github
Software Documentation	fastq-scan fastqc

host_decontaminate: Host read decontamination

Host genetic data is frequently incidentally sequenced alongside pathogens, which can negatively affect the quality of downstream analysis. Host Decontaminate attempts to remove host reads by aligning to a reference host genome acquired on-the-fly. The reference host genome can be acquired via NCBI Taxonomy-compatible taxon input or assembly accession. Host Decontaminate maps inputted reads to the host genome using minimap2, reports mapping statistics to this host genome, and outputs the unaligned dehosted reads.

The detailed steps and tasks are as follows:

Taxonomic Identification

The ncbi_identify task uses NCBI Datasets to search the NCBI Viral Genome Database and acquire taxonomic metadata from a user's inputted taxonomy and desired taxonomic rank. This task will always return a taxon ID, name, and rank, and it facilitates multiple downstream functions, including read classification and targeted read extraction. This task also generates a comprehensive summary file of all successful hits to the input taxon, which includes each taxon's accession number, completeness status, genome length, source, and other relevant metadata. Based on this summary, the task also calculates the average expected genome size for the input taxon.

taxon input parameter

This parameter accepts either a NCBI taxon ID (e.g. 11292) or an organism name (e.g. Lyssavirus rabies).

rank a.k.a read_extraction_rank input parameter

Valid options include: "species", "genus", "family", "order", "class", "phylum", "kingdom", or "domain". By default it is set to "family". This parameter filters metadata to report information only at the taxonomic rank specified by the user, regardless of the taxonomic rank implied by the original input taxon.

Important

The rank parameter must specify a taxonomic rank that is equal to or above the input taxon's taxonomic rank.

Examples:

If your input taxon is Lyssavirus rabies (species level) with rank set to family, the task will return information for the family of Lyssavirus rabies: taxon ID for Rhabdoviridae (11270), name "Rhabdoviridae", and rank "family".
If your input taxon is Lyssavirus (genus level) with rank set to species, the task will fail because it cannot determine species information from an inputted genus.

NCBI Datasets Technical Details

	Links
Task	task_identify_taxon_id.wdl
Software Source Code	NCBI Datasets on GitHub
Software Documentation	NCBI Datasets Documentation on NCBI
Original Publication(s)	Exploring and retrieving sequence and metadata for species across the tree of life with NCBI Datasets

Download Accession

The NCBI Datasets task downloads specified assemblies from NCBI using either the virus or genome (for all other genome types) package as appropriate.

This task uses the accession ID output from the skani task to download the the most closely related reference genome to the input assembly. The downloaded reference is then used for downstream analysis, including variant calling and consensus generation.

NCBI Datasets Technical Details

	Links
Task	task_ncbi_datasets.wdl
Software Source Code	NCBI Datasets on GitHub
Software Documentation	NCBI Datasets Documentation on NCBI
Original Publication(s)	Exploring and retrieving sequence and metadata for species across the tree of life with NCBI Datasets

Map Reads to Host

minimap2 is a popular aligner that is used to align reads (or assemblies) to an assembly file. In minimap2, "modes" are a group of preset options.

The mode used in this task is map-ont which is the default mode for long reads and indicates that long reads of ~10% error rates should be aligned to the reference genome. The output file is in SAM format.

For more information regarding modes and the available options for minimap2, please see the minimap2 manpage

minimap2 Technical Details

	Links
Task	task_minimap2.wdl
Software Source Code	minimap2 on GitHub
Software Documentation	minimap2
Original Publication(s)	Minimap2: pairwise alignment for nucleotide sequences

Extract Unaligned Reads

The bam_to_unaligned_fastq task will extract a FASTQ file of reads that failed to align, while removing unpaired reads.

parse_mapping Technical Details

	Links
Task	task_parse_mapping.wdl
Software Source Code	samtools on GitHub
Software Documentation	samtools
Original Publication(s)	The Sequence Alignment/Map format and SAMtools Twelve Years of SAMtools and BCFtools

Host Read Mapping Statistics

The assembly_metrics task generates mapping statistics from a BAM file. It uses samtools to generate a summary of the mapping statistics, which includes coverage, depth, average base quality, average mapping quality, and other relevant metrics.

assembly_metrics Technical Details

	Links
Task	task_assembly_metrics.wdl
Software Source Code	samtools on GitHub
Software Documentation	samtools
Original Publication(s)	The Sequence Alignment/Map format and SAMtools Twelve Years of SAMtools and BCFtools

Host Decontaminate Technical Details

	Links
Subworkflow File	wf_host_decontaminate.wdl

Read Identification

Kraken2 is a bioinformatics tool originally designed for metagenomic applications. It has additionally proven valuable for validating taxonomic assignments and checking contamination of single-species (e.g. bacterial isolate, eukaryotic isolate, viral isolate, etc.) whole genome sequence data.

This task runs on cleaned reads passed from the read_QC_trim subworkflow and outputs a Kraken2 report detailing taxonomic classifications. It also separates classified reads from unclassified ones.

Database-dependent

This workflow automatically uses a viral-specific Kraken2 database. This database was generated in-house from RefSeq's viral sequence collection and human genome GRCh38. It's available at gs://theiagen-public-resources-rp/reference_data/databases/kraken2/kraken2_humanGRCh38_viralRefSeq_20240828.tar.gz.

Kraken2 Technical Details

	Links
Task	task_kraken2.wdl
Software Source Code	Kraken2 on GitHub
Software Documentation	https://github.com/DerrickWood/kraken2/blob/master/docs/MANUAL.markdown
Original Publication(s)	Improved metagenomic analysis with Kraken 2

Read Extraction

The task_krakentools.wdl task extracts reads from the Kraken2 output file. It uses the KrakenTools package to extract reads classified at any user-specified taxon ID.

extract_unclassified input parameter

This parameter determines whether unclassified reads should also be extracted and combined with the taxon-specific extracted reads. By default, this is set to false, meaning that only reads classified to the specified input taxon will be extracted.

Important

This task will extract reads classified to the input taxon and all of its descendant taxa. The rank input parameter controls the extraction of reads classified at the specified rank and all suboridante taxonomic levels. See task ncbi_identify under the Taxonomic Identification section for more details on the rank input parameter.

KrakenTools Technical Details

	Links
Task	task_krakentools.wdl
Software Source Code	KrakenTools on GitHub
Software Documentation	KrakenTools
Original Publication(s)	Metagenome analysis using the Kraken software suite

rasusa

The rasusa task performs subsampling on the input raw reads. By default, it subsamples reads to a target depth of 250X, using the estimated genome length either generated by the ncbi_identify task or provided directly by the user. Disabled by default, users can enable it by setting the skip_rasusa variable to false. The target subsampling depth can also be adjusted by modifying the coverage variable.

coverage input parameter

This parameter specifies the target coverage for subsampling. The default value is 250, but users can adjust it as needed.

Non-deterministic output(s)

This task may yield non-deterministic outputs.

Rasusa Technical Details

	Links
Task	task_rasusa.wdl
Software Source Code	Rasusa on GitHub
Software Documentation	Rasusa on GitHub
Original Publication(s)	Rasusa: Randomly subsample sequencing reads to a specified coverage

clean_check_reads

The screen task ensures the quantity of sequence data is sufficient to undertake genomic analysis. It uses fastq-scan and bash commands for quantification of reads and base pairs, and mash sketching to estimate the genome size and its coverage. At each step, the results are assessed relative to pass/fail criteria and thresholds that may be defined by optional user inputs. Samples are run through all threshold checks, regardless of failures, and the workflow will terminate after the screen task if any thresholds are not met:

Total number of reads: A sample will fail the read screening task if its total number of reads is less than or equal to min_reads.
The proportion of basepairs reads in the forward and reverse read files: A sample will fail the read screening if fewer than min_proportion basepairs are in either the reads1 or read2 files.
Number of basepairs: A sample will fail the read screening if there are fewer than min_basepairs basepairs
Estimated genome size: A sample will fail the read screening if the estimated genome size is smaller than min_genome_size or bigger than max_genome_size.
Estimated genome coverage: A sample will fail the read screening if the estimated genome coverage is less than the min_coverage.

Read screening is performed only on the cleaned reads. The task may be skipped by setting the skip_screen variable to true. Default values vary between the ONT and PE workflow. The rationale for these default values can be found below:

Default Thresholds and Rationales

Variable	Description	Default Value	Rationale
`min_reads`	A sample will fail the read screening task if its total number of reads is less than or equal to `min_reads`	50	Minimum number of base pairs for 10x coverage of the Hepatitis delta (of the Deltavirus genus) virus divided by 300 (longest Illumina read length)
`min_basepairs`	A sample will fail the read screening if there are fewer than `min_basepairs` basepairs	15000	Greater than 10x coverage of the Hepatitis delta (of the Deltavirus genus) virus
`min_genome_size`	A sample will fail the read screening if the estimated genome size is smaller than `min_genome_size`	1500	Based on the Hepatitis delta (of the Deltavirus genus) genome- the smallest viral genome as of 2024-04-11 (1,700 bp)
`max_genome_size`	A sample will fail the read screening if the estimated genome size is smaller than `max_genome_size`	2673870	Based on the Pandoravirus salinus genome, the biggest viral genome, (2,673,870 bp) with 2 Mbp added
`min_coverage`	A sample will fail the read screening if the estimated genome coverage is less than the `min_coverage`	10	A bare-minimum coverage for genome characterization. Higher coverage would be required for high-quality phylogenetics.
`min_proportion`	A sample will fail the read screening if fewer than `min_proportion` basepairs are in either the reads1 or read2 files	40	Greater than 50% reads are in the read1 file; others are in the read2 file. (PE workflow only)

Screen Technical Details

	Links
Task	task_screen.wdl (PE sub-task) task_screen.wdl (SE sub-task)

De novo Assembly and Reference Selection

These tasks are only performed if no reference genome is provided

In this workflow, de novo assembly is primarily used to facilitate the selection of a closely related reference genome, though high quality de novo assemblies can be used for downstream analysis. If the user provides an input reference_fasta, the following assembly generation, assembly evaluation, and reference selections tasks will be skipped:

spades
megahit
checkv_denovo
quast_denovo
skani
ncbi_datasets

spades

The spades task is a wrapper for the SPAdes assembler, which is used for de novo assembly of the cleaned reads. It is run with the --metaviral option, which is recommended for viral genomes. MetaviralSPAdes pipeline consists of three independent steps, ViralAssembly for finding putative viral subgraphs in a metagenomic assembly graph and generating contigs in these graphs, ViralVerify for checking whether the resulting contigs have viral origin and ViralComplete for checking whether these contigs represent complete viral genomes. For more details, please see the original publication.

MetaviralSPAdes was selected as the default assembler because it produces the most complete viral genomes within TheiaViral, determined by CheckV quality assessment (see task checkv for technical details).

call_metaviralspades input parameter

This parameter controls whether or not the spades task is called by the workflow. By default, call_metaviralspades is set to true because MetaviralSPAdes is used as the primary assembler. MetaviralSPAdes is generally recommended for most users, but it might not perform optimally on all datasets. If users encounter issues with MetaviralSPAdes, they can set the call_metaviralspades variable to false to bypass the spades task and instead de novo assemble using MEGAHIT (see task megahit for details). Additionally, if the spades task fails during execution, the workflow will automatically fall back to using MEGAHIT for de novo assembly.

Non-deterministic output(s)

This task may yield non-deterministic outputs.

MetaviralSPAdes Technical Details

	Links
Task	task_spades.wdl
Software Source Code	SPAdes on GitHub
Software Documentation	SPAdes Manual
Original Publication(s)	MetaviralSPAdes: assembly of viruses from metagenomic data

megahit

The megahit task is a wrapper for the MEGAHIT assembler, which is used for de novo metagenomic assembly of the cleaned reads. MEGAHIT is a fast and memory-efficient de novo assembler that can handle large datasets. This task is optional, turned off by default, and will only be called if MetaviralSPAdes fails. It can be enabled by setting the skip_metaviralspades parameter to true. The megahit task is used as a fallback option if the spades task fails during execution (see task spades for more details).

Non-deterministic output(s)

This task may yield non-deterministic outputs.

MEGAHIT Technical Details

	Links
Task	task_megahit.wdl
Software Source Code	MEGAHIT on GitHub
Software Documentation	MEGAHIT
Original Publication(s)	MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph

skani

The skani task is used to identify and select the most closely related reference genome to the de novo assembly. Skani uses an approximate mapping method without base-level alignment to calculate average nucleotide identity (ANI). It is magnitudes faster than BLAST-based methods and almost as accurate.

By default, the reference genome is selected from a database of approximately 200,000 viral genomes. This database was constructed with the following methodology:

Extracting all complete NCBI viral genomes, excluding RefSeq accessions (redundancy), SARS-CoV-2 accessions, and segmented families (Orthomyxoviridae, Hantaviridae, Arenaviridae, and Phenuiviridae). Some complete gene accessions, and not complete genomes, are included because NCBI datasets completeness parameters are susceptible to metadata errors.
Adding complete RefSeq segmented viral assembly accessions, which represent segments as individual contigs within the FASTA
Adding one SARS-CoV-2 genome for each major pangolin lineage

Skani Technical Details

	Links
Task	task_skani.wdl
Software Source Code	Skani on GitHub
Software Documentation	Skani Documentation
Original Publication(s)	Fast and robust metagenomic sequence comparison through sparse chaining with skani

ncbi_datasets

The NCBI Datasets task downloads specified assemblies from NCBI using either the virus or genome (for all other genome types) package as appropriate.

This task uses the accession ID output from the skani task to download the the most closely related reference genome to the input assembly. The downloaded reference is then used for downstream analysis, including variant calling and consensus generation.

NCBI Datasets Technical Details

	Links
Task	task_ncbi_datasets.wdl
Software Source Code	NCBI Datasets on GitHub
Software Documentation	NCBI Datasets Documentation on NCBI
Original Publication(s)	Exploring and retrieving sequence and metadata for species across the tree of life with NCBI Datasets

Reference Mapping

bwa

The bwa task is a wrapper for the BWA alignment tool. It utilizes the BWA-MEM algorithm to map cleaned reads to the reference genome, either selected by the skani task or provided by the user input reference_fasta. This creates a BAM file which is then sorted using the command samtools sort.

BWA Technical Details

	Links
Task	task_bwa.wdl
Software Source Code	https://github.com/lh3/bwa
Software Documentation	https://bio-bwa.sourceforge.net/
Original Publication(s)	Fast and accurate short read alignment with Burrows-Wheeler transform

read_mapping_stats

The read_mapping_stats task generates mapping statistics from a BAM file. It uses samtools to generate a summary of the mapping statistics, which includes coverage, depth, average base quality, average mapping quality, and other relevant metrics.

read_mapping_stats Technical Details

	Links
Task	task_assembly_metrics.wdl
Software Source Code	samtools on GitHub
Software Documentation	samtools
Original Publication(s)	The Sequence Alignment/Map format and SAMtools Twelve Years of SAMtools and BCFtools

Variant Calling and Consensus Generation

ivar_variants

The ivar_variants task wraps the iVar tool to call variants from the sorted BAM file produced by the bwa task. It uses the ivar variants command to identify and report variants based on the aligned reads. The ivar_variants task will filter all variant calls based on user-defined parameters, including min_map_quality, min_depth, and min_allele_freq. This task will return a VCF file containing the variant calls, along with the total number of variants, and the proportion of intermediate variant calls.

min_depth input parameter

This parameter accepts an integer value to set the minimum read depth for variant calling and subsequent consensus sequence generation. The default value is 10.

min_map_quality input parameter

This parameter accepts an integer value to set the minimum mapping quality for variant calling and subsequent consensus sequence generation. The default value is 20.

min_allele_freq input parameter

This parameter accepts a float value to set the minimum allele frequency for variant calling and subsequent consensus sequence generation. The default value is 0.6.

iVar Technical Details

	Links
Task	task_ivar_variant_call.wdl
Software Source Code	Ivar on GitHub
Software Documentation	Ivar Documentation
Original Publication(s)	An amplicon-based sequencing framework for accurately measuring intrahost virus diversity using PrimalSeq and iVar

ivar consensus

The consensus task wraps the iVar tool to generate a reference-based consensus assembly from the sorted BAM file produced by the bwa task. It uses the ivar consensus command to call variants and generate a consensus sequence based on those mapped reads. The consensus task will filter all variant calls based on user-defined parameters, including min_map_quality, min_depth, and min_allele_freq. This task will return a consensus sequence in FASTA format and the samtools mpileup output.

This task is functional for segmented viruses by iteratively executing iVar on a contig-by-contig basis and concantenating resulting consensus contigs.

min_depth input parameter

This parameter accepts an integer value to set the minimum read depth for variant calling and subsequent consensus sequence generation. The default value is 10.

min_map_quality input parameter

This parameter accepts an integer value to set the minimum mapping quality for variant calling and subsequent consensus sequence generation. The default value is 20.

min_allele_freq input parameter

This parameter accepts a float value to set the minimum allele frequency for variant calling and subsequent consensus sequence generation. The default value is 0.6.

iVar Technical Details

	Links
Task	task_ivar_consensus.wdl
Software Source Code	Ivar on GitHub
Software Documentation	Ivar Documentation
Original Publication(s)	An amplicon-based sequencing framework for accurately measuring intrahost virus diversity using PrimalSeq and iVar

Assembly Evaluation and Consensus Quality Control

quast_denovo

QUAST stands for QUality ASsessment Tool. It evaluates genome/metagenome assemblies by computing various metrics without a reference being necessary. It includes useful metrics such as number of contigs, length of the largest contig and N50.

QUAST Technical Details

	Links
Task	task_quast.wdl
Software Source Code	QUAST on GitHub
Software Documentation	https://quast.sourceforge.net/
Original Publication(s)	QUAST: quality assessment tool for genome assemblies

checkv_denovo & checkv_consensus

CheckV is a fully automated command-line pipeline for assessing the quality of viral genomes, including identification of host contamination for integrated proviruses, estimating completeness for genome fragments, and identification of closed genomes.

By default, CheckV reports results on a contig-by-contig basis. The checkv task additionally reports both "weighted_contamination" and "weighted_completeness", which are average percents calculated across the total assembly that are weighted by contig length.

CheckV Technical Details

	Links
Task	task_checkv.wdl
Software Source Code	CheckV on Bitbucket
Software Documentation	CheckV Documentation
Original Publication(s)	CheckV assesses the quality and completeness of metagenome-assembled viral genomes

consensus_qc

The consensus_qc task generates a summary of genomic statistics from a consensus genome. This includes the total number of bases, "N" bases, degenerate bases, and an estimate of the percent coverage to the reference genome.

consensus_qc Technical Details

	Links
Task	task_consensus_qc.wdl
Software Source Docker Image	Theiagen Docker Builds: utility:1.1

Versioning

versioning: Version Capture

The versioning task captures the workflow version from the GitHub (code repository) version.

Version Capture Technical details

	Links
Task	task_versioning.wdl

Taxonomic Identification

ncbi_identify

The ncbi_identify task uses NCBI Datasets to search the NCBI Viral Genome Database and acquire taxonomic metadata from a user's inputted taxonomy and desired taxonomic rank. This task will always return a taxon ID, name, and rank, and it facilitates multiple downstream functions, including read classification and targeted read extraction. This task also generates a comprehensive summary file of all successful hits to the input taxon, which includes each taxon's accession number, completeness status, genome length, source, and other relevant metadata. Based on this summary, the task also calculates the average expected genome size for the input taxon.

taxon input parameter

This parameter accepts either a NCBI taxon ID (e.g. 11292) or an organism name (e.g. Lyssavirus rabies).

rank a.k.a read_extraction_rank input parameter

Valid options include: "species", "genus", "family", "order", "class", "phylum", "kingdom", or "domain". By default it is set to "family". This parameter filters metadata to report information only at the taxonomic rank specified by the user, regardless of the taxonomic rank implied by the original input taxon.

Important

The rank parameter must specify a taxonomic rank that is equal to or above the input taxon's taxonomic rank.

Examples:

If your input taxon is Lyssavirus rabies (species level) with rank set to family, the task will return information for the family of Lyssavirus rabies: taxon ID for Rhabdoviridae (11270), name "Rhabdoviridae", and rank "family".
If your input taxon is Lyssavirus (genus level) with rank set to species, the task will fail because it cannot determine species information from an inputted genus.

NCBI Datasets Technical Details

	Links
Task	task_identify_taxon_id.wdl
Software Source Code	NCBI Datasets on GitHub
Software Documentation	NCBI Datasets Documentation on NCBI
Original Publication(s)	Exploring and retrieving sequence and metadata for species across the tree of life with NCBI Datasets

Read Quality Control, Trimming, and Filtering

nanoplot_raw & nanoplot_clean

Nanoplot is used for the determination of mean quality scores, read lengths, and number of reads. This task is run once with raw reads as input and once with clean reads as input. If QC has been performed correctly, you should expect fewer clean reads than raw reads.

Nanoplot Technical Details

	Links
Task	task_nanoplot.wdl
Software Source Code	NanoPlot
Software Documentation	NanoPlot Documentation
Original Publication(s)	NanoPack2: population-scale evaluation of long-read sequencing data

porechop

Porechop is a tool for finding and removing adapters from ONT data. Adapters on the ends of reads are trimmed, and when a read has an adapter in the middle, the read is split into two.

The porechop task is optional and is turned off by default. It can be enabled by setting the call_porechop parameter to true.

Porechop Technical Details

	Links
WDL Task	task_porechop.wdl
Software Source Code	Porechop on GitHub
Software Documentation	https://github.com/rrwick/Porechop#porechop

nanoq

Reads are filtered by length and quality using nanoq. By default, sequences with less than 500 basepairs and quality score lower than 10 are filtered out to improve assembly accuracy.

Nanoq Technical Details

	Links
Task	task_nanoq.wdl
Software Source Code	Nanoq
Software Documentation	Nanoq Documentation
Original Publication(s)	Nanoq: ultra-fast quality control for nanopore reads

ncbi_scrub_se

All reads of human origin are removed, including their mates, by using NCBI's human read removal tool (HRRT).

HRRT is based on the SRA Taxonomy Analysis Tool and employs a k-mer database constructed of k-mers from Eukaryota derived from all human RefSeq records with any k-mers found in non-Eukaryota RefSeq records subtracted from the database.

NCBI-Scrub Technical Details

	Links
Task	task_ncbi_scrub.wdl
Software Source Code	HRRT on GitHub
Software Documentation	HRRT on NCBI

host_decontaminate

Host genetic data is frequently incidentally sequenced alongside pathogens, which can negatively affect the quality of downstream analysis. Host Decontaminate attempts to remove host reads by aligning to a reference host genome acquired on-the-fly. The reference host genome can be acquired via NCBI Taxonomy-compatible taxon input or assembly accession. Host Decontaminate maps inputted reads to the host genome using minimap2, reports mapping statistics to this host genome, and outputs the unaligned dehosted reads.

The detailed steps and tasks are as follows:

Taxonomic Identification

The ncbi_identify task uses NCBI Datasets to search the NCBI Viral Genome Database and acquire taxonomic metadata from a user's inputted taxonomy and desired taxonomic rank. This task will always return a taxon ID, name, and rank, and it facilitates multiple downstream functions, including read classification and targeted read extraction. This task also generates a comprehensive summary file of all successful hits to the input taxon, which includes each taxon's accession number, completeness status, genome length, source, and other relevant metadata. Based on this summary, the task also calculates the average expected genome size for the input taxon.

taxon input parameter

This parameter accepts either a NCBI taxon ID (e.g. 11292) or an organism name (e.g. Lyssavirus rabies).

rank a.k.a read_extraction_rank input parameter

Valid options include: "species", "genus", "family", "order", "class", "phylum", "kingdom", or "domain". By default it is set to "family". This parameter filters metadata to report information only at the taxonomic rank specified by the user, regardless of the taxonomic rank implied by the original input taxon.

Important

The rank parameter must specify a taxonomic rank that is equal to or above the input taxon's taxonomic rank.

Examples:

If your input taxon is Lyssavirus rabies (species level) with rank set to family, the task will return information for the family of Lyssavirus rabies: taxon ID for Rhabdoviridae (11270), name "Rhabdoviridae", and rank "family".
If your input taxon is Lyssavirus (genus level) with rank set to species, the task will fail because it cannot determine species information from an inputted genus.

NCBI Datasets Technical Details

	Links
Task	task_identify_taxon_id.wdl
Software Source Code	NCBI Datasets on GitHub
Software Documentation	NCBI Datasets Documentation on NCBI
Original Publication(s)	Exploring and retrieving sequence and metadata for species across the tree of life with NCBI Datasets

Download Accession

The NCBI Datasets task downloads specified assemblies from NCBI using either the virus or genome (for all other genome types) package as appropriate.

This task uses the accession ID output from the skani task to download the the most closely related reference genome to the input assembly. The downloaded reference is then used for downstream analysis, including variant calling and consensus generation.

NCBI Datasets Technical Details

	Links
Task	task_ncbi_datasets.wdl
Software Source Code	NCBI Datasets on GitHub
Software Documentation	NCBI Datasets Documentation on NCBI
Original Publication(s)	Exploring and retrieving sequence and metadata for species across the tree of life with NCBI Datasets

Map Reads to Host

minimap2 is a popular aligner that is used to align reads (or assemblies) to an assembly file. In minimap2, "modes" are a group of preset options.

The mode used in this task is map-ont which is the default mode for long reads and indicates that long reads of ~10% error rates should be aligned to the reference genome. The output file is in SAM format.

For more information regarding modes and the available options for minimap2, please see the minimap2 manpage

minimap2 Technical Details

	Links
Task	task_minimap2.wdl
Software Source Code	minimap2 on GitHub
Software Documentation	minimap2
Original Publication(s)	Minimap2: pairwise alignment for nucleotide sequences

Extract Unaligned Reads

The bam_to_unaligned_fastq task will extract a FASTQ file of reads that failed to align, while removing unpaired reads.

parse_mapping Technical Details

	Links
Task	task_parse_mapping.wdl
Software Source Code	samtools on GitHub
Software Documentation	samtools
Original Publication(s)	The Sequence Alignment/Map format and SAMtools Twelve Years of SAMtools and BCFtools

Host Read Mapping Statistics

The assembly_metrics task generates mapping statistics from a BAM file. It uses samtools to generate a summary of the mapping statistics, which includes coverage, depth, average base quality, average mapping quality, and other relevant metrics.

assembly_metrics Technical Details

	Links
Task	task_assembly_metrics.wdl
Software Source Code	samtools on GitHub
Software Documentation	samtools
Original Publication(s)	The Sequence Alignment/Map format and SAMtools Twelve Years of SAMtools and BCFtools

Host Decontaminate Technical Details

	Links
Subworkflow File	wf_host_decontaminate.wdl

rasusa

The rasusa task performs subsampling on the input raw reads. By default, it subsamples reads to a target depth of 250X, using the estimated genome length either generated by the ncbi_identify task or provided directly by the user. Disabled by default, users can enable it by setting the skip_rasusa variable to false. The target subsampling depth can also be adjusted by modifying the coverage variable.

coverage input parameter

This parameter specifies the target coverage for subsampling. The default value is 250, but users can adjust it as needed.

Non-deterministic output(s)

This task may yield non-deterministic outputs.

Rasusa Technical Details

	Links
Task	task_rasusa.wdl
Software Source Code	Rasusa on GitHub
Software Documentation	Rasusa on GitHub
Original Publication(s)	Rasusa: Randomly subsample sequencing reads to a specified coverage

clean_check_reads

The screen task ensures the quantity of sequence data is sufficient to undertake genomic analysis. It uses fastq-scan and bash commands for quantification of reads and base pairs, and mash sketching to estimate the genome size and its coverage. At each step, the results are assessed relative to pass/fail criteria and thresholds that may be defined by optional user inputs. Samples are run through all threshold checks, regardless of failures, and the workflow will terminate after the screen task if any thresholds are not met:

Total number of reads: A sample will fail the read screening task if its total number of reads is less than or equal to min_reads.
The proportion of basepairs reads in the forward and reverse read files: A sample will fail the read screening if fewer than min_proportion basepairs are in either the reads1 or read2 files.
Number of basepairs: A sample will fail the read screening if there are fewer than min_basepairs basepairs
Estimated genome size: A sample will fail the read screening if the estimated genome size is smaller than min_genome_size or bigger than max_genome_size.
Estimated genome coverage: A sample will fail the read screening if the estimated genome coverage is less than the min_coverage.

Read screening is performed only on the cleaned reads. The task may be skipped by setting the skip_screen variable to true. Default values vary between the ONT and PE workflow. The rationale for these default values can be found below:

Default Thresholds and Rationales

Variable	Description	Default Value	Rationale
`min_reads`	A sample will fail the read screening task if its total number of reads is less than or equal to `min_reads`	50	Minimum number of base pairs for 10x coverage of the Hepatitis delta (of the Deltavirus genus) virus divided by 300 (longest Illumina read length)
`min_basepairs`	A sample will fail the read screening if there are fewer than `min_basepairs` basepairs	15000	Greater than 10x coverage of the Hepatitis delta (of the Deltavirus genus) virus
`min_genome_size`	A sample will fail the read screening if the estimated genome size is smaller than `min_genome_size`	1500	Based on the Hepatitis delta (of the Deltavirus genus) genome- the smallest viral genome as of 2024-04-11 (1,700 bp)
`max_genome_size`	A sample will fail the read screening if the estimated genome size is smaller than `max_genome_size`	2673870	Based on the Pandoravirus salinus genome, the biggest viral genome, (2,673,870 bp) with 2 Mbp added
`min_coverage`	A sample will fail the read screening if the estimated genome coverage is less than the `min_coverage`	10	A bare-minimum coverage for genome characterization. Higher coverage would be required for high-quality phylogenetics.
`min_proportion`	A sample will fail the read screening if fewer than `min_proportion` basepairs are in either the reads1 or read2 files	40	Greater than 50% reads are in the read1 file; others are in the read2 file. (PE workflow only)

Screen Technical Details

	Links
Task	task_screen.wdl (PE sub-task) task_screen.wdl (SE sub-task)

Read Classification and Extraction

metabuli

The metabuli task is used to classify and extract reads against a reference database. Metabuli uses a novel k-mer structure, called metamer, to analyze both amino acid (AA) and DNA sequences. It leverages AA conservation for sensitive homology detection and DNA mutations for specific differentiation between closely related taxa.

cpu / memory input parameters

Increasing the memory and cpus allocated to Metabuli can substantially increase throughput.

extract_unclassified input parameter

This parameter determines whether unclassified reads should also be extracted and combined with the taxon-specific extracted reads. By default, this is set to false, meaning that only reads classified to the specified input taxon will be extracted.

Descendant taxa reads are extracted

This task will extract reads classified to the input taxon and all of its descendant taxa. The rank input parameter controls the extraction of reads classified at the specified rank and all subordiante taxonomic levels. See task ncbi_identify under the Taxonomic Identification section above for more details on the rank input parameter.

Metabuli Technical Details

	Links
Task	task_metabuli.wdl
Software Source Code	Metabuli on GitHub
Software Documentation	Metabuli Documentation
Original Publication(s)	Metabuli: sensitive and specific metagenomic classification via joint analysis of amino acid and DNA

De novo Assembly and Reference Selection

These tasks are only performed if no reference genome is provided

In this workflow, de novo assembly is used solely to facilitate the selection of a closely related reference genome. If the user provides an input reference_fasta, the following assembly generation, assembly evaluation, and reference selections tasks will be skipped:

raven
flye
checkv_denovo
quast_denovo
skani
ncbi_datasets

raven

The raven task is used to create a de novo assembly from cleaned reads. Raven is an overlap-layout-consensus based assembler that accelerates the overlap step, constructs an assembly graph from reads pre-processed with pile-o-grams, applies a novel and robust graph simplification method based on graph drawings, and polishes unambiguous graph paths using Racon.

Based on internal benchmarking against Flye and results reported by Cook et al. (2024), Raven is faster, produces more contiguous assemblies, and yields more complete genomes within TheiaViral according to CheckV quality assessment (see task checkv for technical details).

call_raven input parameter

This parameter controls whether or not the raven task is called by the workflow. By default, call_raven is set to true because Raven is used as the primary assembler. Raven is generally recommended for most users, but it might not perform optimally on all datasets. If users encounter issues with Raven, they can set the call_raven variable to false to bypass the raven task and instead de novo assemble using Flye (see task flye for details). Additionally, if the Raven task fails during execution, the workflow will automatically fall back to using Flye for de novo assembly.

Error traceback

Raven may fail with cryptic "segmentation fault" (segfault) errors or by failing to output an output file. It is difficult to traceback the source of these issues, though increasing the memory parameter may resolve some errors.

Non-deterministic output(s)

This task may yield non-deterministic outputs.

Raven Technical Details

	Links
Task	task_raven.wdl
Software Source Code	Raven on GitHub
Software Documentation	Raven Documentation
Original Publication(s)	Time- and memory-efficient genome assembly with Raven

flye

Flye is a de novo assembler for long read data using repeat graphs. Compared to de Bruijn graphs, which require exact k-mer matches, repeat graphs can use approximate matches which better tolerates the error rate of ONT data.

It can be enabled by setting the call_raven parameter to false. The flye task is used as a fallback option if the raven task fails during execution (see task raven for more details).

read_type input parameter

This input parameter specifies the type of sequencing reads being used for assembly. This parameter significantly impacts the assembly process and should match the characteristics of your input data. Below are the available options:

Parameter	Explanation
`--nano-hq` (default)	Optimized for ONT high-quality reads, such as Guppy5+ SUP or Q20 (<5% error). Recommended for ONT reads processed with Guppy5 or newer
`--nano-raw`	For ONT regular reads, pre-Guppy5 (<20% error)
`--nano-corr`	ONT reads corrected with other methods (<3% error)
`--pacbio-raw`	PacBio regular CLR reads (<20% error)
`--pacbio-corr`	PacBio reads corrected with other methods (<3% error)
`--pacbio-hifi`	PacBio HiFi reads (<1% error)

Refer to the Flye documentation for detailed guidance on selecting the appropriate read_type based on your sequencing data and additional optional paramaters.

Non-deterministic output(s)

This task may yield non-deterministic outputs.

Flye Technical Details

	Links
WDL Task	task_flye.wdl
Software Source Code	Flye on GitHub
Software Documentation	Flye Documentation
Original Publication(s)	Assembly of long, error-prone reads using repeat graphs

skani

The skani task is used to identify and select the most closely related reference genome to the de novo assembly. Skani uses an approximate mapping method without base-level alignment to calculate average nucleotide identity (ANI). It is magnitudes faster than BLAST-based methods and almost as accurate.

By default, the reference genome is selected from a database of approximately 200,000 viral genomes. This database was constructed with the following methodology:

Extracting all complete NCBI viral genomes, excluding RefSeq accessions (redundancy), SARS-CoV-2 accessions, and segmented families (Orthomyxoviridae, Hantaviridae, Arenaviridae, and Phenuiviridae). Some complete gene accessions, and not complete genomes, are included because NCBI datasets completeness parameters are susceptible to metadata errors.
Adding complete RefSeq segmented viral assembly accessions, which represent segments as individual contigs within the FASTA
Adding one SARS-CoV-2 genome for each major pangolin lineage

Skani Technical Details

	Links
Task	task_skani.wdl
Software Source Code	Skani on GitHub
Software Documentation	Skani Documentation
Original Publication(s)	Fast and robust metagenomic sequence comparison through sparse chaining with skani

ncbi_datasets

The NCBI Datasets task downloads specified assemblies from NCBI using either the virus or genome (for all other genome types) package as appropriate.

This task uses the accession ID output from the skani task to download the the most closely related reference genome to the input assembly. The downloaded reference is then used for downstream analysis, including variant calling and consensus generation.

NCBI Datasets Technical Details

	Links
Task	task_ncbi_datasets.wdl
Software Source Code	NCBI Datasets on GitHub
Software Documentation	NCBI Datasets Documentation on NCBI
Original Publication(s)	Exploring and retrieving sequence and metadata for species across the tree of life with NCBI Datasets

Reference Mapping

minimap2

minimap2 is a popular aligner that is used to align reads (or assemblies) to an assembly file. In minimap2, "modes" are a group of preset options.

The mode used in this task is map-ont with additional long-read-specific parameters (the -L --cs --MD flags) to align ONT reads to the reference genome. These specialized parameters are essential for proper handling of long read error profiles, generation of detailed alignment information, and improved mapping accuracy for long reads.

map-ont is the default mode for long reads and it indicates that long reads of ~10% error rates should be aligned to the reference genome. The output file is in SAM format.

For more information regarding modes and the available options for minimap2, please see the minimap2 manpage

minimap2 Technical Details

	Links
Task	task_minimap2.wdl
Software Source Code	minimap2 on GitHub
Software Documentation	minimap2
Original Publication(s)	Minimap2: pairwise alignment for nucleotide sequences

parse_mapping

The sam_to_sorted_bam sub-task converts the output SAM file from the minimap2 task and converts it to a BAM file. It then sorts the BAM file by coordinate, and creates a BAM index file.

min_map_quality input parameter

This parameter accepts an integer value to set the minimum mapping quality for variant calling and subsequent consensus sequence generation. The default value is 20.

parse_mapping Technical Details

	Links
Task	task_parse_mapping.wdl
Software Source Code	samtools on GitHub
Software Documentation	samtools
Original Publication(s)	The Sequence Alignment/Map format and SAMtools Twelve Years of SAMtools and BCFtools

read_mapping_stats

The read_mapping_stats task generates mapping statistics from a BAM file. It uses samtools to generate a summary of the mapping statistics, which includes coverage, depth, average base quality, average mapping quality, and other relevant metrics.

read_mapping_stats Technical Details

	Links
Task	task_assembly_metrics.wdl
Software Source Code	samtools on GitHub
Software Documentation	samtools
Original Publication(s)	The Sequence Alignment/Map format and SAMtools Twelve Years of SAMtools and BCFtools

fasta_utilities

The fasta_utilities task utilizes samtools to index a reference fasta file. This reference is selected by the skani task or provided by the user input reference_fasta. This indexed reference genome is used for downstream variant calling and consensus generation tasks.

fasta_utilities Technical Details

	Links
Task	task_fasta_utilities.wdl
Software Source Code	samtools on GitHub
Software Documentation	samtools
Original Publication(s)	The Sequence Alignment/Map format and SAMtools Twelve Years of SAMtools and BCFtools

Variant Calling and Consensus Generation

clair3

Clair3 performs deep learning-based variant detection using a multi-stage approach. The process begins with pileup-based calling for initial variant identification, followed by full-alignment analysis for comprehensive variant detection. Results are merged into a final high-confidence call set.

The variant calling pipeline employs specialized neural networks trained on ONT data to accurately identify: - Single nucleotide variants (SNVs) - Small insertions and deletions (indels) - Structural variants

clair3_model input parameter

This parameter specifies the clair3 model to use for variant calling. The default is set to "r1041_e82_400bps_sup_v500", but users may select from other available models that clair3 was trained on, which may yield better results depending on the basecaller and data type. The following models are available:

"ont"
"ont_guppy2"
"ont_guppy5"
"r941_prom_sup_g5014"
"r941_prom_hac_g360+g422"
"r941_prom_hac_g238"
"r1041_e82_400bps_sup_v500"
"r1041_e82_400bps_hac_v500"
"r1041_e82_400bps_sup_v410"
"r1041_e82_400bps_hac_v410"

Default Parameters and Filtering

In this workflow, clair3 is run with nearly all default parameters. Note that the VCF file produced by the clair3 task is unfiltered and does not represent the final set of variants that will be included in the final consensus genome. A filtered vcf file is generated by the bcftools_consensus task. The filtering parameters are as follows:

The min_map_quality parameter is applied before calling variants.
The min_depth and min_allele_freq parameters are applied after variant calling during consensus genome construction.

Clair3 Technical Details

	Links
Task	task_clair3.wdl
Software Source Code	Clair3 on GitHub
Software Documentation	Clair3 Documentation
Original Publication(s)	Symphonizing pileup and full-alignment for deep learning-based long-read variant calling

parse_mapping

The mask_low_coverage sub-task is used to mask low coverage regions in the reference_fasta file to improve the accuracy of the final consensus genome. Coverage thresholds are defined by the min_depth parameter, which specifies the minimum read depth required for a base to be retained. Bases falling below this threshold are replaced with "N"s to clearly mark low confidence regions. The masked reference is then combined with variants from the clair3 task to produce the final consensus genome.

min_depth input parameter

This parameter accepts an integer value to set the minimum read depth for variant calling and subsequent consensus sequence generation. The default value is 10.

parse_mapping Technical Details

	Links
Task	task_parse_mapping.wdl
Software Source Code	samtools on GitHub
Software Documentation	samtools
Original Publication(s)	The Sequence Alignment/Map format and SAMtools Twelve Years of SAMtools and BCFtools

bcftools_consensus

The bcftools_consensus task generates a consensus genome assembly by applying variants from the clair3 task to a masked reference genome. It uses bcftools to filter variants based on the min_depth and min_allele_freq input parameter, left aligns and normalizes indels, indexes the VCF file, and generates a consensus genome in FASTA format. Reference bases are substituted with filtered variants where applicable, preserved in regions without variant calls, and replaced with "N"s in areas masked by the mask_low_coverage task.

min_depth input parameter

This parameter accepts an integer value to set the minimum read depth for variant calling and subsequent consensus sequence generation. The default value is 10.

min_allele_freq input parameter

This parameter accepts a float value to set the minimum allele frequency for variant calling and subsequent consensus sequence generation. The default value is 0.6.

bcftools_consensus Technical Details

	Links
Task	task_bcftools_consensus.wdl
Software Source Code	bcftools on GitHub
Software Documentation	bcftools Manual Page
Original Publication(s)	Twelve Years of SAMtools and BCFtools

Assembly Evaluation and Consensus Quality Control

quast_denovo

QUAST stands for QUality ASsessment Tool. It evaluates genome/metagenome assemblies by computing various metrics without a reference being necessary. It includes useful metrics such as number of contigs, length of the largest contig and N50.

QUAST Technical Details

	Links
Task	task_quast.wdl
Software Source Code	QUAST on GitHub
Software Documentation	https://quast.sourceforge.net/
Original Publication(s)	QUAST: quality assessment tool for genome assemblies

checkv_denovo & checkv_consensus

CheckV is a fully automated command-line pipeline for assessing the quality of viral genomes, including identification of host contamination for integrated proviruses, estimating completeness for genome fragments, and identification of closed genomes.

By default, CheckV reports results on a contig-by-contig basis. The checkv task additionally reports both "weighted_contamination" and "weighted_completeness", which are average percents calculated across the total assembly that are weighted by contig length.

CheckV Technical Details

	Links
Task	task_checkv.wdl
Software Source Code	CheckV on Bitbucket
Software Documentation	CheckV Documentation
Original Publication(s)	CheckV assesses the quality and completeness of metagenome-assembled viral genomes

consensus_qc

The consensus_qc task generates a summary of genomic statistics from a consensus genome. This includes the total number of bases, "N" bases, degenerate bases, and an estimate of the percent coverage to the reference genome.

consensus_qc Technical Details

	Links
Task	task_consensus_qc.wdl
Software Source Docker Image	Theiagen Docker Builds: utility:1.1

Taxa-Specific Tasks¶

The TheiaViral workflows automatically activate taxa-specific sub-workflows after the identification of relevant taxa using the taxon ID of the reference genome.

Lyssavirus rabies

nextclade

"Nextclade is an open-source project for viral genome alignment, mutation calling, clade assignment, quality checks and phylogenetic placement."

Theiagen has implemented a full genome-based Nextclade dataset for L. rabies with subclade classification resolution.

Nextclade Technical Details

	Links
Task	task_nextclade.wdl
Software Source Code	https://github.com/nextstrain/nextclade
Software Documentation	Nextclade
Original Publication(s)	Nextclade: clade assignment, mutation calling and quality control for viral genomes.

Outputs¶

TheiaViral_Illumina_PETheiaViral_ONT

Variable	Type	Description
abricate_flu_database	String	ABRicate database used for analysis
abricate_flu_results	File	File containing all results from ABRicate
abricate_flu_subtype	String	Flu subtype as determined by ABRicate
abricate_flu_type	String	Flu type as determined by ABRicate
abricate_flu_version	String	Version of ABRicate
assembly_denovo_fasta	File	De novo assembly in FASTA format
auspice_json_flu_ha	File	Auspice-compatable JSON output generated from Nextclade analysis on Influenza HA segment that includes the Nextclade default samples for clade-typing and the single sample placed on this tree
auspice_json_flu_na	File	Auspice-compatable JSON output generated from Nextclade analysis on Influenza NA segment that includes the Nextclade default samples for clade-typing and the single sample placed on this tree
auspice_json_mpxv	File	Auspice-compatable JSON output generated from Nextclade analysis on Monkeypox virus that includes the Nextclade default samples for clade-typing and the single sample placed on this tree
auspice_json_rabies	File	Auspice-compatable JSON output generated from Nextclade analysis on Rabies virus that includes the Nextclade default samples for clade-typing and the single sample placed on this tree
bbduk_docker	String	The Docker image for bbduk, which was used to remove the adapters from the sequences
bbduk_read1_clean	File	Clean forward reads after BBDuk processing
bbduk_read2_clean	File	Clean reverse reads after BBDuk processing
bwa_aligned_bai	File	BAM index file for reads aligned to reference
bwa_read1_aligned	File	Forward reads aligned to reference
bwa_read1_unaligned	File	Forward reads not aligned to reference
bwa_read2_aligned	File	Reverse reads aligned to reference
bwa_read2_unaligned	File	Reverse reads not aligned to reference
bwa_samtools_version	String	Version of samtools used by BWA
bwa_sorted_bai	File	Sorted BAM index file of reads aligned to reference
bwa_sorted_bam	File	Sorted BAM file of reads aligned to reference
bwa_sorted_bam_unaligned	File	A BAM file that only contains reads that did not align to the reference
bwa_sorted_bam_unaligned_bai	File	Index companion file to a BAM file that only contains reads that did not align to the reference
bwa_version	String	Version of BWA software used
checkv_consensus_contamination	Float	Contamination estimate for consensus assembly from CheckV
checkv_consensus_summary	File	Summary report from CheckV for consensus assembly
checkv_consensus_total_genes	Int	Number of genes detected in consensus assembly by CheckV
checkv_consensus_version	String	Version of CheckV used for consensus assembly
checkv_consensus_weighted_completeness	Float	Weighted completeness score for consensus assembly from CheckV
checkv_consensus_weighted_contamination	Float	Weighted contamination score for consensus assembly from CheckV
checkv_denovo_contamination	Float	Contamination estimate for de novo assembly from CheckV
checkv_denovo_summary	File	Summary report from CheckV for de novo assembly
checkv_denovo_total_genes	Int	Number of genes detected in de novo assembly by CheckV
checkv_denovo_version	String	Version of CheckV used for de novo assembly
checkv_denovo_weighted_completeness	Float	Weighted completeness score for de novo assembly from CheckV
checkv_denovo_weighted_contamination	Float	Weighted contamination score for de novo assembly from CheckV
consensus_n_variant_min_depth	Int	Minimum read depth to call variants for iVar consensus and iVar variants. Also represents the minimum consensus support threshold used by IRMA with Illumina Influenza data.
consensus_qc_assembly_length_unambiguous	Int	Length of consensus assembly excluding ambiguous bases
consensus_qc_number_Degenerate	Int	Number of degenerate bases in consensus assembly
consensus_qc_number_N	Int	Number of N bases in consensus assembly
consensus_qc_number_Total	Int	Total number of bases in consensus assembly
consensus_qc_percent_reference_coverage	Float	Percent of reference genome covered in consensus assembly
dehost_wf_dehost_read1	File	Reads that did not map to host
dehost_wf_dehost_read2	File	Paired-reads that did not map to host
dehost_wf_download_status	String	Status of host genome acquisition
dehost_wf_host_accession	String	Host genome accession
dehost_wf_host_fasta	File	Host genome FASTA file
dehost_wf_host_flagstat	File	Output from the SAMtools flagstat command to assess quality of the alignment file (BAM)
dehost_wf_host_mapped_bai	File	Indexed bam file of the reads aligned to the host reference
dehost_wf_host_mapped_bam	File	Sorted BAM file containing the alignments of reads to the host reference genome
dehost_wf_host_mapping_cov_hist	File	Coverage histogram from host read mapping
dehost_wf_host_mapping_coverage	Float	Average coverage from host read mapping
dehost_wf_host_mapping_mean_depth	Float	Average depth from host read mapping
dehost_wf_host_mapping_metrics	File	File of mapping metrics
dehost_wf_host_mapping_stats	File	File of mapping statistics
dehost_wf_host_percent_mapped_reads	Float	Percentage of reads mapped to host reference genome
fastp_html_report	File	The HTML report made with fastp
fastp_version	String	The version of fastp used
fastq_scan_clean1_json	File	The JSON file output from `fastq-scan` containing summary stats about clean forward read quality and length
fastq_scan_clean2_json	File	The JSON file output from `fastq-scan` containing summary stats about clean reverse read quality and length
fastq_scan_clean_pairs	Int	Number of read pairs after cleaning
fastq_scan_docker	String	The Docker image of fastq_scan
fastq_scan_num_reads_clean1	Int	The number of forward reads after cleaning as calculated by fastq_scan
fastq_scan_num_reads_clean2	Int	The number of reverse reads after cleaning as calculated by fastq_scan
fastq_scan_num_reads_raw1	Int	The number of input forward reads as calculated by fastq_scan
fastq_scan_num_reads_raw2	Int	The number of input reserve reads as calculated by fastq_scan
fastq_scan_raw1_json	File	The JSON file output from `fastq-scan` containing summary stats about raw forward read quality and length
fastq_scan_raw2_json	File	The JSON file output from `fastq-scan` containing summary stats about raw reverse read quality and length
fastq_scan_raw_pairs	Int	Number of raw read pairs
fastq_scan_version	String	The version of fastq_scan
genoflu_all_segments	String	The genotypes for each individual flu segment
genoflu_genotype	String	The genotype of the whole genome, based off of the individual segments types
genoflu_output_tsv	File	The output file from GenoFLU
genoflu_version	String	The version of GenoFLU used
irma_docker	String	Docker image used to run IRMA
irma_subtype	String	Flu subtype as determined by IRMA
irma_subtype_notes	String	Helpful note to user about Flu B subtypes. Output will be blank for Flu A samples. For Flu B samples it will state: "IRMA does not differentiate Victoria and Yamagata Flu B lineages. See abricate_flu_subtype output column"
irma_type	String	Flu type as determined by IRMA
irma_version	String	Version of IRMA used
ivar_tsv	File	Variant descriptor file generated by iVar variants
ivar_variant_proportion_intermediate	String	The proportion of variants of intermediate frequency
ivar_variant_version	String	Version of iVar for running the iVar variants command
ivar_vcf	File	iVar tsv output converted to VCF format
ivar_version_consensus	String	Version of iVar for running the iVar consensus command
kraken2_extracted_read1	File	Forward reads extracted by taxonomic classification
kraken2_extracted_read2	File	Reverse reads extracted by taxonomic classification
kraken_database	File	Database used for Kraken classification
kraken_docker	String	Docker image used for Kraken
kraken_report	File	Full Kraken report
kraken_version	String	Version of Kraken software used
megahit_docker	String	Docker image used for MEGAHIT
megahit_status	String	Status of the MEGAHIT assembly
megahit_version	String	Version of MEGAHIT used
metaviralspades_docker	String	Docker image used for MetaviralSPAdes
metaviralspades_status	String	Status of MetaviralSPAdes assembly
metaviralspades_version	String	Version of MetaviralSPAdes used
ncbi_datasets_docker	String	Docker image used for NCBI datasets
ncbi_datasets_version	String	Version of NCBI datasets used
ncbi_identify_accession	String	NCBI accession ID of identified taxon
ncbi_identify_avg_genome_length	Int	Average genome length from NCBI taxon summary
ncbi_identify_genome_summary_tsv	File	TSV file with genome summary from NCBI
ncbi_identify_read_extraction_rank	String	Taxonomic rank used for read extraction
ncbi_identify_taxon_id	String	NCBI taxonomy ID of identified organism
ncbi_identify_taxon_name	String	Name of identified taxon
ncbi_identify_taxon_summary_tsv	File	TSV file with taxa specific summary from NCBI
ncbi_scrub_docker	String	The Docker image for NCBI's HRRT (human read removal tool)
ncbi_scrub_human_spots_removed	Int	Number of spots removed (or masked)
nextclade_aa_dels_flu_ha	String	Amino-acid deletions as detected by NextClade. Specific to flu; it includes deletions for HA segment
nextclade_aa_dels_flu_na	String	Amino-acid deletions as detected by NextClade. Specific to Flu; it includes deletions for NA segment
nextclade_aa_dels_mpxv	String	Amino-acid deletions as detected by Nextclade. Specific to Monkeypox
nextclade_aa_dels_rabies	String	Amino-acid deletions as detected by Nextclade. Specific to Monkeypox
nextclade_aa_subs_flu_ha	String	Amino-acid substitutions as detected by Nextclade. Specific to Flu; it includes substitutions for HA segment
nextclade_aa_subs_flu_na	String	Amino-acid substitutions as detected by Nextclade. Specific to Flu; it includes substitutions for NA segment
nextclade_aa_subs_mpxv	String	Amino-acid substitutions as detected by Nextclade. Specific to Monkeypox
nextclade_aa_subs_rabies	String	Amino-acid substitutions as detected by Nextclade. Specific to Monkeypox
nextclade_clade_mpxv	String	Nextclade clade designation, specific to Monkeypox
nextclade_clade_rabies	String	Nextclade clade designation, specific to Rabies
nextclade_docker	String	Docker image used to run Nextclade
nextclade_ds_tag	String	Dataset tag used to run Nextclade. Will be blank for Flu
nextclade_ds_tag_flu_ha	String	Dataset tag used to run Nextclade, specific to Flu HA segment
nextclade_ds_tag_flu_na	String	Dataset tag used to run Nextclade, specific to Flu NA segment
nextclade_json_flu_ha	File	Nextclade output in JSON file format, specific to Flu HA segment
nextclade_json_flu_na	File	Nextclade output in JSON file format, specific to Flu NA segment
nextclade_json_mpxv	File	Nextclade output in JSON file format, specific to Monkeypox
nextclade_json_rabies	File	Nextclade output in JSON file format, specific to Rabies
nextclade_lineage_mpxv	String	Nextclade lineage designation, specific to Monkeypox
nextclade_lineage_rabies	String	Nextclade lineage designation, specific to Rabies
nextclade_qc_flu_ha	String	QC metric as determined by Nextclade, specific to Flu HA segment
nextclade_qc_flu_na	String	QC metric as determined by Nextclade, specific to Flu NA segment
nextclade_qc_mpxv	String	QC metric as determined by Nextclade, specific to Monkeypox
nextclade_qc_rabies	String	QC metric as determined by Nextclade, specific to Rabies
nextclade_tsv_flu_ha	File	Nextclade output in TSV file format, specific to Flu HA segment
nextclade_tsv_flu_na	File	Nextclade output in TSV file format, specific to Flu NA segment
nextclade_tsv_mpxv	File	Nextclade output in TSV file format, specific to Monkeypox
nextclade_tsv_rabies	File	Nextclade output in TSV file format, specific to Rabies
organism	String	Standardized organism name used for characterization
pango_lineage	String	Pango lineage as determined by Pangolin
pango_lineage_expanded	String	Pango lineage without use of aliases; e.g., "BA.1" → "B.1.1.529.1"
pango_lineage_report	File	Full Pango lineage report generated by Pangolin
pangolin_assignment_version	String	The version of the pangolin software (e.g. PANGO or PUSHER) used for lineage assignment
pangolin_conflicts	String	Number of lineage conflicts as determined by Pangolin
pangolin_docker	String	Docker image used to run Pangolin
pangolin_notes	String	Lineage notes as determined by Pangolin
pangolin_versions	String	All Pangolin software and database versions
quast_denovo_docker	String	Docker image used for QUAST
quast_denovo_gc_percent	Float	GC percentage of de novo assembly from QUAST
quast_denovo_genome_length	Int	Genome length of de novo assembly from QUAST
quast_denovo_largest_contig	Int	Size of largest contig in de novo assembly from QUAST
quast_denovo_n50_value	Int	N50 value of de novo assembly from QUAST
quast_denovo_number_contigs	Int	Number of contigs in de novo assembly from QUAST
quast_denovo_report	File	QUAST report for de novo assembly
quast_denovo_uncalled_bases	Int	Number of uncalled bases in de novo assembly from QUAST
quast_denovo_version	String	Version of QUAST used
read1_dehosted	File	The dehosted forward reads file; suggested read file for SRA submission
read2_dehosted	File	The dehosted reverse reads file; suggested read file for SRA submission
read_mapping_cov_hist	File	Coverage histogram from read mapping
read_mapping_cov_stats	File	Coverage statistics from read mapping
read_mapping_coverage	Float	Average coverage from read mapping
read_mapping_date	String	Date of read mapping analysis
read_mapping_depth	Float	Average depth from read mapping
read_mapping_flagstat	File	Flagstat file from read mapping
read_mapping_meanbaseq	Float	Mean base quality from read mapping
read_mapping_meanmapq	Float	Mean mapping quality from read mapping
read_mapping_percentage_mapped_reads	Float	Percentage of mapped reads
read_mapping_report	File	Report file from read mapping
read_mapping_samtools_version	String	Version of samtools used in read mapping
read_mapping_statistics	File	Statistics file from read mapping
reference_taxon_name	String	NCBI derived taxon name from best ANI hit accession
skani_database	File	Database used for Skani
skani_docker	String	Docker image used for Skani
skani_report	File	Report from Skani
skani_status	String	Status of Skani analysis
skani_top_accession	String	Top accession ID from Skani
skani_top_ani	Float	Top ANI score from Skani
skani_top_ani_fasta	File	FASTA file of top ANI match from Skani
skani_top_query_coverage	Float	Query coverage of top match from Skani
skani_top_score	Float	Top score from Skani
skani_version	String	Version of Skani used
skani_warning	String	Skani warning message
theiaviral_illumina_pe_date	String	Date of TheiaViral Illumina PE workflow run
theiaviral_illumina_pe_version	String	Version of TheiaViral Illumina PE workflow
trimmomatic_docker	String	The docker image used for the trimmomatic module in this workflow
trimmomatic_version	String	The version of Trimmomatic used

Variable	Type	Description
abricate_flu_database	String	ABRicate database used for analysis
abricate_flu_results	File	File containing all results from ABRicate
abricate_flu_subtype	String	Flu subtype as determined by ABRicate
abricate_flu_type	String	Flu type as determined by ABRicate
abricate_flu_version	String	Version of ABRicate
assembly_denovo_fasta	File	De novo assembly in FASTA format
assembly_to_ref_bai	File	BAM index file for reads aligned to reference
assembly_to_ref_bam	File	BAM file of reads aligned to reference
auspice_json_flu_ha	File	Auspice-compatable JSON output generated from Nextclade analysis on Influenza HA segment that includes the Nextclade default samples for clade-typing and the single sample placed on this tree
auspice_json_flu_na	File	Auspice-compatable JSON output generated from Nextclade analysis on Influenza NA segment that includes the Nextclade default samples for clade-typing and the single sample placed on this tree
auspice_json_mpxv	File	Auspice-compatable JSON output generated from Nextclade analysis on Monkeypox virus that includes the Nextclade default samples for clade-typing and the single sample placed on this tree
auspice_json_rabies	File	Auspice-compatable JSON output generated from Nextclade analysis on Rabies virus that includes the Nextclade default samples for clade-typing and the single sample placed on this tree
bcftools_docker	String	Docker image used for bcftools
bcftools_filtered_vcf	File	Filtered variant calls in VCF format from bcftools
bcftools_version	String	Version of bcftools used
checkv_consensus_contamination	Float	Contamination estimate for consensus assembly from CheckV
checkv_consensus_summary	File	Summary report from CheckV for consensus assembly
checkv_consensus_total_genes	Int	Number of genes detected in consensus assembly by CheckV
checkv_consensus_version	String	Version of CheckV used for consensus assembly
checkv_consensus_weighted_completeness	Float	Weighted completeness score for consensus assembly from CheckV
checkv_consensus_weighted_contamination	Float	Weighted contamination score for consensus assembly from CheckV
checkv_denovo_contamination	Float	Contamination estimate for de novo assembly from CheckV
checkv_denovo_summary	File	Summary report from CheckV for de novo assembly
checkv_denovo_total_genes	Int	Number of genes detected in de novo assembly by CheckV
checkv_denovo_version	String	Version of CheckV used for de novo assembly
checkv_denovo_weighted_completeness	Float	Weighted completeness score for de novo assembly from CheckV
checkv_denovo_weighted_contamination	Float	Weighted contamination score for de novo assembly from CheckV
clair3_docker	String	Docker image used for Clair3
clair3_gvcf	File	Genomic VCF file from Clair3
clair3_model	String	Model used for Clair3 variant calling
clair3_vcf	File	Variant calls in VCF format from Clair3
clair3_version	String	Clair3 Version being used
consensus_qc_assembly_length_unambiguous	Int	Length of consensus assembly excluding ambiguous bases
consensus_qc_number_Degenerate	Int	Number of degenerate bases in consensus assembly
consensus_qc_number_N	Int	Number of N bases in consensus assembly
consensus_qc_number_Total	Int	Total number of bases in consensus assembly
consensus_qc_percent_reference_coverage	Float	Percent of reference genome covered in consensus assembly
dehost_wf_dehost_read1	File	Reads that did not map to host
dehost_wf_download_status	String	Status of host genome acquisition
dehost_wf_host_accession	String	Host genome accession
dehost_wf_host_fasta	File	Host genome FASTA file
dehost_wf_host_flagstat	File	Output from the SAMtools flagstat command to assess quality of the alignment file (BAM)
dehost_wf_host_mapped_bai	File	Indexed bam file of the reads aligned to the host reference
dehost_wf_host_mapped_bam	File	Sorted BAM file containing the alignments of reads to the host reference genome
dehost_wf_host_mapping_cov_hist	File	Coverage histogram from host read mapping
dehost_wf_host_mapping_coverage	Float	Average coverage from host read mapping
dehost_wf_host_mapping_mean_depth	Float	Average depth from host read mapping
dehost_wf_host_mapping_metrics	File	File of mapping metrics
dehost_wf_host_mapping_stats	File	File of mapping statistics
dehost_wf_host_percent_mapped_reads	Float	Percentage of reads mapped to host reference genome
fasta_utilities_fai	File	FASTA index file
fasta_utilities_samtools_docker	String	Docker image used for samtools in fasta utilities
fasta_utilities_samtools_version	String	Version of samtools used in fasta utilities
flye_denovo_docker	String	Docker image used for Flye
flye_denovo_info	File	Information file from Flye assembly
flye_denovo_status	String	Status of Flye assembly
flye_denovo_version	String	Version of Flye used
genoflu_all_segments	String	The genotypes for each individual flu segment
genoflu_genotype	String	The genotype of the whole genome, based off of the individual segments types
genoflu_output_tsv	File	The output file from GenoFLU
genoflu_version	String	The version of GenoFLU used
irma_docker	String	Docker image used to run IRMA
irma_subtype	String	Flu subtype as determined by IRMA
irma_subtype_notes	String	Helpful note to user about Flu B subtypes. Output will be blank for Flu A samples. For Flu B samples it will state: "IRMA does not differentiate Victoria and Yamagata Flu B lineages. See abricate_flu_subtype output column"
irma_type	String	Flu type as determined by IRMA
irma_version	String	Version of IRMA used
mask_low_coverage_all_coverage_bed	File	BED file showing all coverage regions
mask_low_coverage_bed	File	BED file showing masked low coverage regions
mask_low_coverage_bedtools_docker	String	Docker image used for bedtools in masking
mask_low_coverage_bedtools_version	String	Version of bedtools used in masking
mask_low_coverage_reference_fasta	File	Reference FASTA with low coverage regions masked
metabuli_classified	File	Classified reads from Metabuli
metabuli_database	File	Database used for Metabuli
metabuli_docker	String	Docker image used for Metabuli
metabuli_krona_report	File	Krona visualization report from Metabuli
metabuli_read1_extract	File	Extracted reads from Metabuli
metabuli_report	File	Classification report from Metabuli
metabuli_version	String	Version of Metabuli used
minimap2_docker	String	The Docker image of minimap2
minimap2_out	File	Output file from Minimap2 alignment
minimap2_version	String	The version of minimap2
nanoplot_html_clean	File	An HTML report describing the clean reads
nanoplot_html_raw	File	An HTML report describing the raw reads
nanoplot_num_reads_clean1	Int	Number of clean reads
nanoplot_num_reads_raw1	Int	Number of raw reads
nanoplot_r1_mean_q_clean	Float	Mean quality score of clean forward reads
nanoplot_r1_mean_q_raw	Float	Mean quality score of raw forward reads
nanoplot_r1_mean_readlength_clean	Float	Mean read length of clean forward reads
nanoplot_r1_mean_readlength_raw	Float	Mean read length of raw forward reads
nanoplot_r1_median_q_clean	Float	Median quality score of clean forward reads
nanoplot_r1_median_q_raw	Float	Median quality score of raw forward reads
nanoplot_r1_median_readlength_clean	Float	Median read length of clean forward reads
nanoplot_r1_median_readlength_raw	Float	Median read length of raw forward reads
nanoplot_r1_n50_clean	Float	N50 of clean forward reads
nanoplot_r1_n50_raw	Float	N50 of raw forward reads
nanoplot_r1_stdev_readlength_clean	Float	Standard deviation read length of clean forward reads
nanoplot_r1_stdev_readlength_raw	Float	Standard deviation read length of raw forward reads
nanoplot_tsv_clean	File	A TSV report describing the clean reads
nanoplot_tsv_raw	File	A TSV report describing the raw reads
nanoq_filtered_read1	File	Filtered reads from NanoQ
nanoq_version	String	Version of nanoq used in analysis
ncbi_datasets_docker	String	Docker image used for NCBI datasets
ncbi_datasets_version	String	Version of NCBI datasets used
ncbi_identify_accession	String	NCBI accession ID of identified taxon
ncbi_identify_avg_genome_length	Int	Average genome length from NCBI taxon summary
ncbi_identify_docker	String	Docker image used for NCBI identify
ncbi_identify_genome_summary_tsv	File	TSV file with genome summary from NCBI
ncbi_identify_read_extraction_rank	String	Taxonomic rank used for read extraction
ncbi_identify_taxon_id	String	NCBI taxonomy ID of identified organism
ncbi_identify_taxon_name	String	Name of identified taxon
ncbi_identify_taxon_summary_tsv	File	TSV file with taxa specific summary from NCBI
ncbi_identify_version	String	Version of NCBI identify tool used
ncbi_scrub_docker	String	The Docker image for NCBI's HRRT (human read removal tool)
ncbi_scrub_human_spots_removed	Int	Number of spots removed (or masked)
ncbi_scrub_read1_dehosted	File	Dehosted reads after NCBI scrub
nextclade_aa_dels_flu_ha	String	Amino-acid deletions as detected by NextClade. Specific to flu; it includes deletions for HA segment
nextclade_aa_dels_flu_na	String	Amino-acid deletions as detected by NextClade. Specific to Flu; it includes deletions for NA segment
nextclade_aa_dels_mpxv	String	Amino-acid deletions as detected by Nextclade. Specific to Monkeypox
nextclade_aa_dels_rabies	String	Amino-acid deletions as detected by Nextclade. Specific to Monkeypox
nextclade_aa_subs_flu_ha	String	Amino-acid substitutions as detected by Nextclade. Specific to Flu; it includes substitutions for HA segment
nextclade_aa_subs_flu_na	String	Amino-acid substitutions as detected by Nextclade. Specific to Flu; it includes substitutions for NA segment
nextclade_aa_subs_mpxv	String	Amino-acid substitutions as detected by Nextclade. Specific to Monkeypox
nextclade_aa_subs_rabies	String	Amino-acid substitutions as detected by Nextclade. Specific to Monkeypox
nextclade_clade_mpxv	String	Nextclade clade designation, specific to Monkeypox
nextclade_clade_rabies	String	Nextclade clade designation, specific to Rabies
nextclade_docker	String	Docker image used to run Nextclade
nextclade_ds_tag	String	Dataset tag used to run Nextclade. Will be blank for Flu
nextclade_ds_tag_flu_ha	String	Dataset tag used to run Nextclade, specific to Flu HA segment
nextclade_ds_tag_flu_na	String	Dataset tag used to run Nextclade, specific to Flu NA segment
nextclade_json_flu_ha	File	Nextclade output in JSON file format, specific to Flu HA segment
nextclade_json_flu_na	File	Nextclade output in JSON file format, specific to Flu NA segment
nextclade_json_mpxv	File	Nextclade output in JSON file format, specific to Monkeypox
nextclade_json_rabies	File	Nextclade output in JSON file format, specific to Rabies
nextclade_lineage_mpxv	String	Nextclade lineage designation, specific to Monkeypox
nextclade_lineage_rabies	String	Nextclade lineage designation, specific to Rabies
nextclade_qc_flu_ha	String	QC metric as determined by Nextclade, specific to Flu HA segment
nextclade_qc_flu_na	String	QC metric as determined by Nextclade, specific to Flu NA segment
nextclade_qc_mpxv	String	QC metric as determined by Nextclade, specific to Monkeypox
nextclade_qc_rabies	String	QC metric as determined by Nextclade, specific to Rabies
nextclade_tsv_flu_ha	File	Nextclade output in TSV file format, specific to Flu HA segment
nextclade_tsv_flu_na	File	Nextclade output in TSV file format, specific to Flu NA segment
nextclade_tsv_mpxv	File	Nextclade output in TSV file format, specific to Monkeypox
nextclade_tsv_rabies	File	Nextclade output in TSV file format, specific to Rabies
organism	String	Standardized organism name used for characterization
pango_lineage	String	Pango lineage as determined by Pangolin
pango_lineage_expanded	String	Pango lineage without use of aliases; e.g., "BA.1" → "B.1.1.529.1"
pango_lineage_report	File	Full Pango lineage report generated by Pangolin
pangolin_assignment_version	String	The version of the pangolin software (e.g. PANGO or PUSHER) used for lineage assignment
pangolin_conflicts	String	Number of lineage conflicts as determined by Pangolin
pangolin_docker	String	Docker image used to run Pangolin
pangolin_notes	String	Lineage notes as determined by Pangolin
pangolin_versions	String	All Pangolin software and database versions
parse_mapping_samtools_docker	String	Docker image used for samtools in parse mapping
parse_mapping_samtools_version	String	Version of samtools used in parse mapping
porechop_trimmed_read1	File	Trimmed reads from Porechop
porechop_version	String	Version of Porechop used
quast_denovo_docker	String	Docker image used for QUAST
quast_denovo_gc_percent	Float	GC percentage of de novo assembly from QUAST
quast_denovo_genome_length	Int	Genome length of de novo assembly from QUAST
quast_denovo_largest_contig	Int	Size of largest contig in de novo assembly from QUAST
quast_denovo_n50_value	Int	N50 value of de novo assembly from QUAST
quast_denovo_number_contigs	Int	Number of contigs in de novo assembly from QUAST
quast_denovo_report	File	QUAST report for de novo assembly
quast_denovo_uncalled_bases	Int	Number of uncalled bases in de novo assembly from QUAST
quast_denovo_version	String	Version of QUAST used
rasusa_read1_subsampled	File	Subsampled read file from Rasusa
rasusa_read2_subsampled	File	Subsampled read file from Rasusa (paired file)
rasusa_version	String	Version of RASUSA used for the analysis
raven_denovo_docker	String	Docker image used for Raven
raven_denovo_status	String	Status of Raven assembly
raven_denovo_version	String	Version of Raven used
read_mapping_cov_hist	File	Coverage histogram from read mapping
read_mapping_cov_stats	File	Coverage statistics from read mapping
read_mapping_coverage	Float	Average coverage from read mapping
read_mapping_date	String	Date of read mapping analysis
read_mapping_depth	Float	Average depth from read mapping
read_mapping_flagstat	File	Flagstat file from read mapping
read_mapping_meanbaseq	Float	Mean base quality from read mapping
read_mapping_meanmapq	Float	Mean mapping quality from read mapping
read_mapping_percentage_mapped_reads	Float	Percentage of mapped reads
read_mapping_report	File	Report file from read mapping
read_mapping_samtools_version	String	Version of samtools used in read mapping
read_mapping_statistics	File	Statistics file from read mapping
read_screen_clean	String	PASS or FAIL result from clean read screening; FAIL accompanied by the reason(s) for failure
read_screen_clean_tsv	File	Clean read screening report TSV depicting read counts, total read base pairs, and estimated genome length
reference_taxon_name	String	NCBI derived taxon name from best ANI hit accession
skani_database	File	Database used for Skani
skani_docker	String	Docker image used for Skani
skani_report	File	Report from Skani
skani_status	String	Status of Skani analysis
skani_top_accession	String	Top accession ID from Skani
skani_top_ani	Float	Top ANI score from Skani
skani_top_ani_fasta	File	FASTA file of top ANI match from Skani
skani_top_query_coverage	Float	Query coverage of top match from Skani
skani_top_score	Float	Top score from Skani
skani_version	String	Version of Skani used
skani_warning	String	Skani warning message
theiaviral_ont_date	String	Date of TheiaViral ONT workflow run
theiaviral_ont_version	String	Version of TheiaViral ONT workflow

What are the differences between the de novo and consensus assemblies?

De novo genomes are generated from scratch without a reference to guide read assembly, while consensus genomes are generated by mapping reads to a reference and replacing reference positions with identified variants (structural and nucleotide). De novo assemblies are thus not biased by requiring reads map to the reference, though they may be more fragmented. Consensus assembly can generate more robust assemblies from lower coverage samples if the reference genome is sufficient quality and sufficiently closely related to the inputted sequence, though consensus assembly may not perform well in instances of significant structural variation. TheiaViral uses de novo assemblies as an intermediate to acquire the best reference genome for consensus assembly.

We generally recommend TheiaViral users focus on the consensus assembly as the desired assembly output. While we chose the best de novo assemblers for TheiaViral based on internal benchmarking, the consensus assembly will often be higher quality than the de novo assembly. However, the de novo assembly can approach or exceed consensus quality if the read inputs largely comprise one virus, have high depth of coverage, and/or are derived from a virus with high potential for recombination. TheiaViral does conduct assembly contiguity and viral completeness quality control for de novo assemblies, so de novo assembly that meets quality control standards can certainly be used for downstream analysis.

How is de novo assembly quality evaluated?

De novo assembly quality evaluation focuses on the completeness and contiguity of the genome. While a ground truth genome does not truly exist for quality comparison, reference genome selection can help contextualize quality if the reference is sufficiently similar to the de novo assembly. TheiaViral uses QUAST to acquire basic contiguity statistics and CheckV to assess viral genome completeness and contamination. Additionally, the reference selection software, Skani, can provide a quantitative comparison between the de novo assembly and the best reference genome.

Completeness and contamination

checkv_denovo_summary: The summary file reports CheckV results on a contig-by-contig basis. Ideally completeness is 100% for a single contig, or 100% for all segments. If there are multiple extraneous contigs in the assembly, one is ideally 100%. The same principles apply to contamination, though it ideally is 0%.
checkv_denovo_total_genes: The total genes is ideally the same number of genes as expected from the inputted viral taxon. Sometimes CheckV can fail to recover all the genes from a complete genome, so other statistics should be weighted more heavily in quality evaluation.
checkv_denovo_weighted_completeness: The weighted completeness is ideally 100%.
checkv_denovo_weighted_contamination: The weighted contamination is ideally 0%.

Length and contiguity

quast_denovo_genome_length: The de novo genome length is ideally the same as the expected genome length of the focal virus.
quast_denovo_largest_contig: The largest contig is ideally the size of the genome, or the size of the largest expected segment. If there are multiple contigs, and the largest contig is the ideal size, then the smaller contigs may be discarded based on the CheckV completeness for the largest contig (see CheckV outputs).
quast_denovo_n50_value: The N50 is an evaluation of contiguity and is ideally as close as possible to the genome size. For segmented viruses, the N50 should be as close as possible to the size of the segment molecule that would cover at least 50% of the total genome size when segment lengths are added after sorting largest to smallest.
quast_denovo_number_contigs: The number of contigs is ideally 1 or the total number of segments expected.

Reference genome similarity

skani_top_ani: The percent average nucleotide identity (ANI) for the top Skani hit is ideally 100% if the sequenced virus is highly similar to a reference genome. However, if the virus is divergent, ANI is not a good indication of assembly quality.
skani_top_query_coverage: The percent query coverage for the top Skani hit is ideally 100% if the sequenced virus has not undergone significant recombination/structural variation.
skani_top_score: The score for the top Skani hit is the ANI x Query (de novo assembly) coverage and is ideally 100% if the sequenced virus is not substantially divergent from the reference dataset.

How is consensus assembly quality evaluated?

Consensus assemblies are derived from a reference genome, so quality assessment focuses on coverage and variant quality. Bases with insufficient coverage are denoted as "N". Additionally, the size and contiguity of a TheiaViral consensus assembly is expected to approximate the reference genome, so any discrepancy here is likely due to inferred structural variation.

Completeness and contamination

checkv_consensus_weighted_completeness: The weighted completeness is ideally 100%.

Consensus variant calls

consensus_qc_number_Degenerate: The number of degenerate bases is ideally 0. While degenerate bases indicate ambiguity in the sequence, non-N degenerate bases indicate that some information about the base was obtained.
consensus_qc_number_N: The number of "N" bases is ideally 0.

Coverage

consensus_qc_percent_reference_coverage: The percent reference coverage is ideally 100%.
read_mapping_cov_hist: The read mapping coverage histogram ideally depicts normally distributed coverage, which may indicate uniform coverage across the reference genome. However, uniform coverage is unlikely with repetitive regions that approach/exceed read length.
read_mapping_coverage: The average read mapping coverage is ideally as high as possible.
read_mapping_meanbaseq: The average mean mapping base quality is ideally as high as possible.
read_mapping_meanmapq: The average mean mapping alignment quality is ideally as high as possible.
read_mapping_percentage_mapped_reads: The percent of mapped reads is ideally 100% of the reads classified as the lineage of interest. Some unclassified reads may also map, which may indicate they were erroneously unclassified. Alternatively, these reads could have been erroneously mapped.

Why did the workflow complete without generating a consensus?

TheiaViral is designed to "soft fail" when specific steps do not succeed due to input data quality. This means the workflow will be reported as successful, with an output that delineates the step that failed. If the workflow fails, please look for the following outputs in this order (sorted by timing of failure, latest first):

skani_status: If this output is populated with something other than "PASS" and skani_top_accession is populated with "N/A", this indicates that Skani did not identify a sufficiently similar reference genome. The Skani database comprises a broad array of NCBI viral genomes, so a failure here likely indicates poor read quality because viral contigs are not found in the de novo assembly or are too small. It may be useful to BLAST whatever contigs do exist in the de novo to determine if there is contamination that can be removed via the host input parameter. Additionally, review CheckV de novo outputs to assess if viral contigs were retrieved. Finally, consider keeping extract_unclassified to "true", using a higher read_extraction_rank if it will not introduce contaminant viruses, and invoking a host input to remove host reads if host contigs are present.
megahit_status / flye_status: If this output is populated with something other than "PASS", it indicates the fallback assembler did not successfully complete. The fallback assemblers are permissive, so failure here likely indicates poor read quality. Review read QC to check read quality, particularly following read classification. If read classification is dispensing with a significant number of reads, consider extract_unclassified, read_extraction_rank, and host input adjustment. Otherwise, sequencing quality may be poor.
metaviralspades_status / raven_denovo_status: If this output is populated with something other than "PASS", it indicates the default assembler did not successfully complete or extract viral contigs (MetaviralSPAdes). On their own, these statuses do not correspond directly to workflow failure because fallback de novo assemblers are implemented for both TheiaViral workflows.
read_screen_clean: If this output is populated with something other than "PASS", it indicates the reads did not pass the imposed thresholds. Either the reads are poor quality or the thresholds are too stringent, in which case the thresholds can be relaxed or skip_screen can be set to "true".
dehost_wf_download_status: If this output is populated with something other than "PASS", it indicates a host genome could not be retrieved for decontamination. See the host input explanation for more information and review the download_accession/download_taxonomy task output logs for advanced error parsing.

Known errors associated with read quality

ONT workflows may fail at Metabuli if no reads are classified as the taxon. Check the Metabuli classification.tsv or krona report for the read extraction taxon ID to determine if any reads were classified. This error will report out of memory (OOM), but increasing memory will not resolve it.
Illumina workflows may fail at CheckV (de novo) with Error: 80 hmmsearch tasks failed. Program should be rerun if no viral contigs were identified in the de novo assembly.

Acknowlegments¶

We would like to thank Danny Park at the Broad institute and Jared Johnson at the Washington State Department of Public Health for correspondence during the development of TheiaViral. TheiaViral was built referencing viral-assemble, VAPER, and Artic.