TheiaViral Workflow Series¶

Quick Facts¶

Workflow Type	Applicable Kingdom	Last Known Changes	Command-line Compatibility	Workflow Level	Dockstore
Genomic Characterization	Viral	v4.0.0	Some optional features incompatible, Yes	Sample-level	TheiaViral_Illumina_PE_PHB, TheiaViral_ONT_PHB, TheiaViral_Panel_PHB

TheiaViral Workflows¶

The TheiaViral workflows are for the assembly, quality assessment, and characterization of viral genomes from diverse data sources, including metagenomic samples. There are currently three TheiaViral workflows designed to accomodate different kinds of input data:

Illumina paired-end sequencing (TheiaViral_Illumina_PE)
Oxford Nanopore Technology (ONT) sequencing (TheiaViral_ONT)
Illumina paired-end sequencing originating from hybrid-capture panel-based methods (TheiaViral_Panel)

These workflows function by generating consensus assemblies of recalcitrant viruses, including diverse or recombinant lineages (such as rabies or norovirus), through a three-step approach:

An intermediate de novo assembly is generated from taxonomy-filtered reads,
The best reference from a database of ~200,000 viral genomes is selected using average nucleotide identity (ANI), and
A final consensus assembly is generated through reference-based read mapping and variant calling.

De novo assembly and reference selection can be skipped by providing a reference genome as input; this enables compatibility with tiled-amplicon sequencing data. Subsequent genomic characterization is currently only functional for the viral lineages listed below.

What are the main differences between the TheiaViral and TheiaCoV workflows?

TheiaCoV Workflows
- For amplicon-derived viral sequencing methods
- Supports a limited number of pathogens
- Uses manually curated, static reference genomes
TheiaViral Workflows
- Designed for a variety of sequencing methods
- Supports relatively diverse and recombinant pathogens
- Dynamically identifies the most similar reference genome for consensus assembly via an intermediate de novo assembly

What about segmented viruses?

TheiaViral can properly assemble segmented viruses. The reference genome database used in Step 2 excludes segmented viral nucleotide accessions but includes the RefSeq assembly accessions that include all viral segments. Consensus assembly modules are constructed to handle multi-segment references.

Workflow Diagram¶

TheiaViral_Illumina_PETheiaViral_ONTTheiaViral_Panel

TheiaViral_Illumina_PE Workflow Diagram

TheiaViral_ONT Workflow Diagram

TheiaViral_Panel Workflow Diagram

Inputs¶

Input Data

TheiaViral_Illumina_PETheiaViral_ONTTheiaViral_Panel

The TheiaViral_Illumina_PE workflow inputs Illumina paired-end read data. Read file extensions should be .fastq or .fq, and can optionally include the .gz compression extension. Theiagen recommends compressing files with gzip to minimize data upload time and storage costs.

Modifications to the optional parameter for trim_minlen may be required to appropriately trim reads shorter than 2 x 150 bp (i.e. generated using a 300-cycle sequencing kit), such as the 2 x 75bp reads generated using a 150-cycle sequencing kit.

taxon required input parameter

taxon is the standardized taxonomic name (e.g. "Lyssavirus rabies") or NCBI taxon ID (e.g. "11292") of the desired virus to analyze. Inputs must be represented in the NCBI taxonomy database and do not have to be species-level (see read_extraction_rank below).

host optional input parameter

The host input triggers the Host Decontaminate workflow, which removes reads that map to a reference host genome. This input needs to be an NCBI Taxonomy-compatible taxon, host genome assembly FASTA, or an NCBI assembly accession. If using a taxon, the first retrieved genome corresponding to that taxon is retrieved. If using a genome assembly or accession, these inputs must be coupled with the Host Decontaminate task is_genome/is_accession (ONT) or Read QC Trim PE host_is_genome/host_is_accession (Illumina) boolean populated as "true".

extract_unclassified optional input parameter

By default, the extract_unclassified parameter is set to true, which indicates that reads that are not classified by Kraken2 (Illumina) or Metabuli (ONT) will be included with reads classified as the input taxon.

These classification software most often do not comprehensively classify reads using the default RefSeq databases, so extracting unclassified reads is desirable when host and contaminant reads have been sufficiently decontaminated. Host decontamination occurs in TheiaViral using NCBI sra-human-scrubber, read classification to the human genome, and/or via mapping reads to the inputted host. Contaminant viral reads are mostly excluded because they will be often be classified against the default RefSeq classification databases.

Consider setting extract_unclassified to false if de novo assembly or Skani reference selection is failing.

min_allele_freq, min_depth, and min_map_quality optional input parameters

These parameters have a direct effect on the variants that will ultimately be reported in the consensus assembly. min_allele_freq determines the minimum proportion of an allelic variant to be reported in the consensus assembly. min_depth and min_map_quality affect how "N" is reported in the consensus, i.e. depth below min_depth is reported as "N" and reads with mapping quality below min_map_quality are not included in depth calculations.

read_extraction_rank optional input parameter

By default, the read_extraction_rank parameter is set to "family", which indicates that reads will be extracted if they are classified as the taxonomic family of the input taxon, including all descendant taxa of the family. Read classification may not resolve to the rank of the input taxon, so these reads may be classified at higher ranks. For example, some Lyssavirus rabies (species) reads may only be resolved to Lyssavirus (genus), so they would not be extracted if the read_extraction_rank is set to "species". Setting the read_extraction_rank above the inputted taxon's rank can therefore dramatically increase the number of reads recovered, at the potential cost of including other viruses. This likely is not a problem for scarcely represented lineages, e.g. a sample that is expected to include Lyssavirus rabies is unlikely to contain other viruses of the corresponding family, Rhabdoviridae, within the same sample. However, setting a read_extraction_rank far beyond the input taxon rank can be problematic when multiple representatives of the same viral family are included in similar abundance within the same sample. To further refine the desired read_extraction_rank, please review the corresponding classification reports of the respective classification software (kraken2 for Illumina and Metabuli for ONT)

The TheiaViral_ONT workflow inputs base-called Oxford Nanopore Technology (ONT) read data. Read file extensions should be .fastq or .fq, and can optionally include the .gz compression extension. Theiagen recommends compressing files with gzip to minimize data upload time and storage costs.

It is recommended to trim adapter sequencings via dorado basecalling prior to running TheiaViral_ONT, though porechop can optionally be called to trim adapters within the workflow.

The ONT sequencing kit and base-calling approach can produce substantial variability in the amount and quality of read data. Genome assemblies produced by the TheiaViral_ONT workflow must be quality assessed before reporting results. We recommend using the Dorado_Basecalling_PHB workflow if applicable.

taxon required input parameter

taxon is the standardized taxonomic name (e.g. "Lyssavirus rabies") or NCBI taxon ID (e.g. "11292") of the desired virus to analyze. Inputs must be represented in the NCBI taxonomy database and do not have to be species-level (see read_extraction_rank below).

host optional input parameter

The host input triggers the Host Decontaminate workflow, which removes reads that map to a reference host genome. This input needs to be an NCBI Taxonomy-compatible taxon, host genome assembly FASTA, or an NCBI assembly accession. If using a taxon, the first retrieved genome corresponding to that taxon is retrieved. If using a genome assembly or accession, these inputs must be coupled with the Host Decontaminate task is_genome/is_accession (ONT) or Read QC Trim PE host_is_genome/host_is_accession (Illumina) boolean populated as "true".

extract_unclassified optional input parameter

By default, the extract_unclassified parameter is set to true, which indicates that reads that are not classified by Kraken2 (Illumina) or Metabuli (ONT) will be included with reads classified as the input taxon.

These classification software most often do not comprehensively classify reads using the default RefSeq databases, so extracting unclassified reads is desirable when host and contaminant reads have been sufficiently decontaminated. Host decontamination occurs in TheiaViral using NCBI sra-human-scrubber, read classification to the human genome, and/or via mapping reads to the inputted host. Contaminant viral reads are mostly excluded because they will be often be classified against the default RefSeq classification databases.

Consider setting extract_unclassified to false if de novo assembly or Skani reference selection is failing.

min_allele_freq, min_depth, and min_map_quality optional input parameters

These parameters have a direct effect on the variants that will ultimately be reported in the consensus assembly. min_allele_freq determines the minimum proportion of an allelic variant to be reported in the consensus assembly. min_depth and min_map_quality affect how "N" is reported in the consensus, i.e. depth below min_depth is reported as "N" and reads with mapping quality below min_map_quality are not included in depth calculations.

read_extraction_rank optional input parameter

By default, the read_extraction_rank parameter is set to "family", which indicates that reads will be extracted if they are classified as the taxonomic family of the input taxon, including all descendant taxa of the family. Read classification may not resolve to the rank of the input taxon, so these reads may be classified at higher ranks. For example, some Lyssavirus rabies (species) reads may only be resolved to Lyssavirus (genus), so they would not be extracted if the read_extraction_rank is set to "species". Setting the read_extraction_rank above the inputted taxon's rank can therefore dramatically increase the number of reads recovered, at the potential cost of including other viruses. This likely is not a problem for scarcely represented lineages, e.g. a sample that is expected to include Lyssavirus rabies is unlikely to contain other viruses of the corresponding family, Rhabdoviridae, within the same sample. However, setting a read_extraction_rank far beyond the input taxon rank can be problematic when multiple representatives of the same viral family are included in similar abundance within the same sample. To further refine the desired read_extraction_rank, please review the corresponding classification reports of the respective classification software (kraken2 for Illumina and Metabuli for ONT)

The TheiaViral_Panel workflow accepts Illumina paired-end read data. Read file extensions should be .fastq or .fq, and can optionally include the .gz compression extension. Theiagen recommends compressing files with gzip to minimize data upload time and storage costs.

For the analysis of RSV and Flu it is recommended that TheiaCov is run for full characterization of RSV and IRMA assembly for Flu. Due to limitations within the Kraken Database RSV A and B will both be extracted under HRSV. Subtypes can be losely infered by looking at Skani outputs.

taxon_ids optional input parameter

The taxon_ids parameter is required for TheiaViral_Panel to run correctly, but is optional in Terra.

By default, TheiaViral_Panel uses a list of 172 taxon IDs derived from a list of targeted viruses and subtypes in the Viral Surveillance Panel version 2 (VSP v2) produced by Illumina, though this workflow is not specific to that assay. This list can be modified to include or exclude any taxon IDs of interest; however, the taxon IDs must be present in the Kraken2 database used for read classification. Changing this parameter will change what organisms are extracted for assembly and characterization. Keep in mind that these IDs must be available in the passed Kraken DB. The list of default taxon IDs can be found below:

Taxon ID	Common Name	Species Name	Genome Length
1618189	Bourbon virus	Thogotovirus bourbonense	10560
37124	Chikungunya virus	Alphavirus chikungunya	11547
46839	Colorado tick fever virus	Coltivirus dermacentoris	29174
12637	Dengue virus	Orthoflavivirus denguei	10770
1216928	Heartland virus	Bandavirus heartlandense	11540
59301	Mayaro virus	Alphavirus mayaro	11411
2169701	Onyong-nyong virus	Alphavirus onyong	11827
118655	Oropouche virus	Orthobunyavirus oropoucheense	11985
11587	Punta Toro virus	Phlebovirus toroense	12634
11029	Ross River virus	Alphavirus rossriver	11802
11033	Semliki Forest virus	Alphavirus semliki	11341
11034	Sindbis virus	Alphavirus sindbis	11671
1608084	Tacheng Tick Virus 2	Uukuvirus tachengense	8844
64286	Usutu virus	Orthoflavivirus usutuense	11066
11082	West Nile virus	Orthoflavivirus nilense	10942
11089	Yellow fever virus	Orthoflavivirus flavi	NA
64320	Zika virus	Orthoflavivirus zikaense	10874
10804	adeno-associated virus 2	Dependoparvovirus primate1	4679
12092	Hepatovirus A	Hepatovirus ahepa	7446
3052230	Hepacivirus hominis	Hepacivirus hominis	9431
12475	Hepatitis delta virus	Deltavirus italiense	1680
291484	Hepatitis E virus	Hepatitis E virus	7499
11676	Human immunodeficiency virus 1	Lentivirus humimdef1	9388
11709	Human immunodeficiency virus 2	Lentivirus humimdef2	10059
68887	Torque teno virus	Torque teno virus	3477
1980456	Orthohantavirus andesense	Orthohantavirus andesense	7735
3052470	Orthohantavirus bayoui	Orthohantavirus bayoui	10861
3052490	Orthohantavirus nigrorivense	Orthohantavirus nigrorivense	6067
169173	Choclo virus	Orthohantavirus chocloense	7844
3052489	Orthohantavirus negraense	Orthohantavirus mamorense	NA
238817	Maporal virus	Orthohantavirus maporalense	12106
1980442	Orthohantavirus		8504
3052496	Orthohantavirus sangassouense	Orthohantavirus sangassouense	11928
3052499	Orthohantavirus sinnombreense	Orthohantavirus sinnombreense	10583
90961	Lyssavirus australis	Lyssavirus australis	11822
80935	Cache Valley virus	Orthobunyavirus cacheense	12283
35305	California encephalitis virus	Orthobunyavirus encephalitidis	12466
1221391	Cedar virus	Henipavirus cedarense	18162
38767	Lyssavirus duvenhage	Lyssavirus duvenhage	11976
11021	Eastern equine encephalitis virus	Alphavirus eastern	11675
38768	European bat lyssavirus	European bat lyssavirus	11935
2847089	Ghana virus	Henipavirus ghanaense	18530
3052223	Henipavirus hendraense	Henipavirus hendraense	18234
260964	Henipavirus		18134
35511	Jamestown Canyon virus	Orthobunyavirus jamestownense	12461
11072	Japanese encephalitis virus	Orthoflavivirus japonicum	NA
11577	La Crosse virus	Orthobunyavirus lacrosseense	12490
38766	Lyssavirus lagos	Lyssavirus lagos	12016
1474807	Mojiang virus	Parahenipavirus mojiangense	18406
12538	Lyssavirus mokola	Lyssavirus mokola	11940
11079	Murray Valley encephalitis virus	Orthoflavivirus murrayense	7012
3052225	Henipavirus nipahense	Henipavirus nipahense	18248
11083	Powassan virus	Orthoflavivirus powassanense	10826
11292	Lyssavirus rabies	Lyssavirus rabies	11927
11580	Snowshoe hare virus	Orthobunyavirus khatangaense	12208
11080	St. Louis encephalitis virus	Orthoflavivirus louisense	10940
45270	Tahyna virus	Orthobunyavirus tahynaense	12446
11084	Tick-borne encephalitis virus	Orthoflavivirus encephalitidis	7367
11036	Venezuelan equine encephalitis virus	Alphavirus venezuelan	11411
11039	Western equine encephalitis virus	Alphavirus western	11523
1313215	aichivirus A1	Kobuvirus aichi	8266
138948	Enterovirus A	Enterovirus alphacoxsackie	7427
138949	Enterovirus B	Enterovirus betacoxsackie	7410
138950	Enterovirus C	Enterovirus coxsackiepol	7442
138951	Enterovirus D	Enterovirus deconjuncti	7367
1239565	Mamastrovirus 1	Mamastrovirus hominis	6791
1239570	Mamastrovirus 6	Mamastrovirus melbournense	6171
1239573	Mamastrovirus 9	Mamastrovirus virginiaense	6576
142786	Norovirus		5162
28875	Rotavirus A	Rotavirus alphagastroenteritidis	8881
28876	Rotavirus B	Rotavirus betagastroenteritidis	17791
36427	Rotavirus C	Rotavirus tritogastroenteritidis	17720
1348384	Rotavirus H	Rotavirus aspergastroenteritidis	17961
1330524	Salivirus A	Salivirus aklasse	7956
95341	Sapovirus		7470
2849717	Aigai virus	Orthonairovirus parahaemorrhagiae	19245
1424613	Anjozorobe virus	Orthohantavirus thailandense	NA
2010960	Bombali virus	Orthoebolavirus bombaliense	19043
565995	Bundibugyo virus	Orthoebolavirus bundibugyoense	18940
3052302	Mammarenavirus chapareense	Mammarenavirus chapareense	10464
3052518	Orthonairovirus haemorrhagiae	Orthonairovirus haemorrhagiae	19146
3052477	Orthohantavirus dobravaense	Orthohantavirus dobravaense	9116
3052307	Mammarenavirus guanaritoense	Mammarenavirus guanaritoense	10424
3052480	Orthohantavirus hantanense	Orthohantavirus hantanense	6917
2169991	Mammarenavirus juninense	Mammarenavirus juninense	10525
33743	Kyasanur Forest disease virus	Orthoflavivirus kyasanurense	10579
3052310	Mammarenavirus lassaense	Mammarenavirus lassaense	10686
3052148	Cuevavirus lloviuense	Cuevavirus lloviuense	18893
3052314	Mammarenavirus lujoense	Mammarenavirus lujoense	10352
3052303	Mammarenavirus choriomeningitidis	Mammarenavirus choriomeningitidis	10367
3052317	Mammarenavirus machupoense	Mammarenavirus machupoense	10635
12542	Omsk hemorrhagic fever virus	Orthoflavivirus omskense	10787
3052493	Orthohantavirus puumalaense	Orthohantavirus puumalaense	10925
186539	Reston ebolavirus	Orthoebolavirus restonense	18891
11588	Rift Valley fever virus	Phlebovirus riftense	11979
2907957	Sabia virus	Mammarenavirus brazilense	10499
3052498	Orthohantavirus seoulense	Orthohantavirus seoulense	9746
1003835	Severe fever with thrombocytopenia syndrome virus	Bandavirus dabieense	10547
1452514	Sosuga virus	Pararubulavirus sosugaense	15480
186540	Sudan ebolavirus	Orthoebolavirus sudanense	18875
186541	Tai Forest ebolavirus	Orthoebolavirus taiense	18935
3052503	Orthohantavirus tulaense	Orthohantavirus tulaense	9987
1891762	Betapolyomavirus hominis	Betapolyomavirus hominis	5146
10376	human gammaherpesvirus 4	Lymphocryptovirus humangamma4	172146
10359	Human betaherpesvirus 5	Cytomegalovirus humanbeta5	214152
333760	Human papillomavirus 16	Alphapapillomavirus 9	7905
333761	human papillomavirus 18	Alphapapillomavirus 7	7857
337044	Alphapapillomavirus 5	Alphapapillomavirus 5	7805
337050	Alphapapillomavirus 6	Alphapapillomavirus 6	7847
1671798	Human papillomavirus type 54	Alphapapillomavirus 13	7759
333754	Alphapapillomavirus 10	Alphapapillomavirus 10	7898
333767	Alphapapillomavirus 3	Alphapapillomavirus 3	8061
746830	Human polyomavirus 6	Deltapolyomavirus sextihominis	4926
746831	Human polyomavirus 7	Deltapolyomavirus septihominis	4952
943908	Human polyomavirus 9	Alphapolyomavirus nonihominis	5027
10632	JC polyomavirus	Betapolyomavirus secuhominis	5171
1891764	Betapolyomavirus tertihominis	Betapolyomavirus tertihominis	5040
1965344	LI polyomavirus	Alphapolyomavirus quardecihominis	5269
493803	Merkel cell polyomavirus	Alphapolyomavirus quintihominis	5387
1203539	MW polyomavirus	Deltapolyomavirus decihominis	4927
1497391	New Jersey polyomavirus-2013	Alphapolyomavirus terdecihominis	5108
1891767	Betapolyomavirus macacae	Betapolyomavirus macacae	5243
1277649	STL polyomavirus	Deltapolyomavirus undecihominis	4776
862909	Trichodysplasia spinulosa-associated polyomavirus	Alphapolyomavirus octihominis	5232
862909	Trichodysplasia spinulosa-associated polyomavirus	Alphapolyomavirus octihominis	5232
440266	WU Polyomavirus	Betapolyomavirus quartihominis	5229
10298	Human alphaherpesvirus 1	Simplexvirus humanalpha1	155275
11234	Measles morbillivirus	Morbillivirus hominis	15956
152219	Menangle virus	Pararubulavirus menangleense	15516
10244	Monkeypox virus	Orthopoxvirus monkeypox	193392
2560602	Mumps orthorubulavirus	Orthorubulavirus parotitidis	NA
11041	Rubella virus	Rubivirus rubellae	9762
10335	Human alphaherpesvirus 3	Varicellovirus humanalpha3	125308
10255	Variola virus	Orthopoxvirus variola	186087
129875	Human mastadenovirus A	Mastadenovirus adami	34077
108098	Human mastadenovirus B	Mastadenovirus blackbeardi	34777
129951	Human mastadenovirus C	Mastadenovirus caesari	35753
130310	Human mastadenovirus D	Mastadenovirus dominans	35160
130308	Human mastadenovirus E	Mastadenovirus exoticum	36099
130309	Human mastadenovirus F	Mastadenovirus faecale	33926
536079	Human mastadenovirus G	Mastadenovirus russelli	21467
329641	Human bocavirus	Human bocavirus	5289
11137	Human coronavirus 229E	Alphacoronavirus chicagoense	27375
290028	Human coronavirus HKU1	Betacoronavirus hongkongense	29911
277944	Human coronavirus NL63	Alphacoronavirus amsterdamense	27551
31631	Human coronavirus OC43	Betacoronavirus gravedinis	30767
162145	human metapneumovirus	Metapneumovirus hominis	13319
12730	Human respirovirus 1	Respirovirus laryngotracheitidis	15600
2560525	Human orthorubulavirus 2	Orthorubulavirus laryngotracheitidis	15649
11216	Human respirovirus 3	Respirovirus pneumoniae	15430
2560526	Human orthorubulavirus 4	Orthorubulavirus hominis	17235
1803956	Parechovirus A	Parechovirus ahumpari	666
10798	Human parvovirus B19	Erythroparvovirus primate1	5595
11250	human respiratory syncytial virus	Orthopneumovirus hominis	15246
11320	Influenza A virus	Alphainfluenzavirus influenzae	13357
11520	Influenza B virus	Betainfluenzavirus influenzae	14563
11552	Influenza C virus	Gammainfluenzavirus influenzae	12430
1335626	Middle East respiratory syndrome-related coronavirus	Betacoronavirus cameli	30150
147711	Rhinovirus A	Enterovirus alpharhino	6983
147712	Rhinovirus B	Enterovirus betarhino	6940
463676	Rhinovirus C	Enterovirus cerhino	5749
2901879	Severe acute respiratory syndrome coronavirus	Betacoronavirus pandemicum	29747
2697049	Severe acute respiratory syndrome coronavirus 2	Betacoronavirus pandemicum	29883
10404	Hepadnaviridae		3186
3052505	Orthomarburgvirus marburgense	Orthomarburgvirus marburgense	NA
337041	Alphapapillomavirus 9	Alphapapillomavirus 9	7916
337042	Alphapapillomavirus 7	Alphapapillomavirus 7	7861
333757	Alphapapillomavirus 8	Alphapapillomavirus 8	7960
337048	Alphapapillomavirus 1	Alphapapillomavirus 1	7940
333754	Alphapapillomavirus 10	Alphapapillomavirus 10	7898
333766	Alphapapillomavirus 13	Alphapapillomavirus 13	7759
337049	Alphapapillomavirus 11	Alphapapillomavirus 11	7779

output_taxon_table optional input parameter

A key feature of TheiaViral_Panel is the ability to output assemblies and characterization results to taxon-specific Terra tables. This allows users to easily separate results by taxon for downstream analysis.

The output_taxon_table parameter is an optional input file with a set default that specifies which taxon are output to what taxon table in Terra.

Formatting the output_taxon_table file

The output_taxon_table file must be uploaded to a Google storage bucket that is accessible by Terra and should be in tab-delimited format and include a header. Briefly, the viral taxon name should be listed in the leftmost column with the name of the data table to copy samples of that taxon to in the rightmost column. This will result in any taxonomy classification identified as "influenza" being added to a Terra table named "influenza_panel_specimen". The default table is shown below. For best results, edit your taxon table in a text editor such as Notepad.

taxon	taxon_table
influenza	panel_influenza_specimen
coronavirus	panel_coronavirus_specimen
human_immunodeficiency_virus	panel_hiv_specimen
monkeypox_virus	panel_monkeypox_specimen
human_respiratory_syncytial_virus	panel_rsv_specimen
west_nile_virus	panel_wnv_specimen
other	panel_other_specimen
h3n1	panel_influenza_specimen
h1n1	panel_influenza_specimen
h5n1	panel_influenza_specimen
h3n2	panel_influenza_specimen
h2n2	panel_influenza_specimen
mastadenovirus	panel_mastadenovirus_specimen
orthohantavirus	panel_orthohantavirus_specimen
enterovirus	panel_enterovirus_specimen
alphapapillomavirus	panel_alphapapillomavirus_specimen
hepatitis	panel_hepatitis_specimen
hepadnaviridae	panel_hepatitis_specimen

kraken_db optional input parameter

For the reliable extraction of input taxon IDs, it is important to make sure that the taxon IDs used as input are concordant with the contents of the Kraken database. When making changes to this parameter keep in mind the relationship between these two inputs. The default database can be accessed here.

extract_unclassified optional input parameter

By default, extract_unclassifed is set to false, which indicates that reads that are not classified by Kraken2 will NOT be included with reads classified as the input taxon.

If the extracted read data is lacking and assemblies are not generated, consider setting this parameter to true to increase the available read count to make assembly generation more probable. Please note this will introduce reads that are not aligned with the identified taxon and can introduce significant noise and misclassifications.

min_read_count optional input parameter

By default, min_read_count is set to 1000. This value is the number of reads that are required to pass the binning threshold to proceed onto assembly and characterization.

TheiaViral_Illumina_PETheiaViral_ONTTheiaViral_Panel

Terra Task Name	Variable	Type	Description	Default Value	Terra Status
theiaviral_illumina_pe	read1	File	llumina forward read file in FASTQ file format (compression optional)		Required
theiaviral_illumina_pe	read2	File	llumina reverse read file in FASTQ file format (compression optional)		Required
theiaviral_illumina_pe	samplename	String	Nme of the sample being analyzed		Required
theiaviral_illumina_pe	taxon	String	Taxon ID or organism name of interest		Required
bwa	cpu	Int	Number of CPUs to allocate to the task	6	Optional
bwa	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
bwa	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/ivar:1.3.1-titan	Optional
bwa	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	16	Optional
checkv_consensus	cpu	Int	Number of CPUs to allocate to the task	2	Optional
checkv_consensus	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
checkv_consensus	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/checkv:1.0.3	Optional
checkv_consensus	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	8	Optional
checkv_denovo	cpu	Int	Number of CPUs to allocate to the task	2	Optional
checkv_denovo	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
checkv_denovo	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/checkv:1.0.3	Optional
checkv_denovo	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	8	Optional
clean_check_reads	cpu	Int	Number of CPUs to allocate to the task	1	Optional
clean_check_reads	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
clean_check_reads	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/bactopia/gather_samples:2.0.2	Optional
clean_check_reads	max_genome_length	Int	Maximum genome length able to pass read screening	2673870	Optional
clean_check_reads	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	2	Optional
clean_check_reads	min_basepairs	Int	Minimum base pairs to pass read screening	15000	Optional
clean_check_reads	min_coverage	Int	Minimum coverage to pass read screening	10	Optional
clean_check_reads	min_genome_length	Int	Minimum genome length to pass read screening	1500	Optional
clean_check_reads	min_proportion	Int	Minimum read proportion to pass read screening	40	Optional
clean_check_reads	min_reads	Int	Minimum reads to pass read screening	50	Optional
consensus	char_unknown	String	Character used to represent unknown bases in the consensus sequence	N	Optional
consensus	count_orphans	Boolean	True/False that determines if anomalous read pairs are NOT skipped in variant calling. Anomalous read pairs are those marked in the FLAG field as paired in sequencing but without the properly-paired flag set.	True	Optional
consensus	cpu	Int	Number of CPUs to allocate to the task	2	Optional
consensus	disable_baq	Boolean	True/False that determines if base alignment quality (BAQ) computation should be disabled during samtools mpileup before consensus generation	True	Optional
consensus	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
consensus	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/ivar:1.3.1-titan	Optional
consensus	max_depth	Int	For a given position, read at maximum INT number of reads per input file during samtools mpileup before consensus generation	600000	Optional
consensus	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	8	Optional
consensus	min_bq	Int	Minimum base quality required for a base to be considered during samtools mpileup before consensus generation	0	Optional
consensus	skip_N	Boolean	True/False that determines if "N" bases should be skipped in the consensus sequence	False	Optional
consensus_qc	cpu	Int	Number of CPUs to allocate to the task	1	Optional
consensus_qc	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
consensus_qc	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/theiagen/utility:1.1	Optional
consensus_qc	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	2	Optional
est_genome_length	cpu	Int	Number of CPUs to allocate to the task	1	Optional
est_genome_length	disk_size	Int	Amount of storage (in GB) to allocate to the task	50	Optional
est_genome_length	docker	String	Docker image to use for the task	us-docker.pkg.dev/general-theiagen/theiagen/ncbi-datasets:18.9.0-python-jq	Optional
est_genome_length	memory	Int	Amount of memory (in GB) to allocate to the task	4	Optional
est_genome_length	summary_limit	Int	Maximum number of genomes to query	100	Optional
ete4_identify	cpu	Int	Number of CPUs to allocate to the task	1	Optional
ete4_identify	disk_size	Int	Amount of storage (in GB) to allocate to the task	50	Optional
ete4_identify	docker	String	Docker image to use for the task	us-docker.pkg.dev/general-theiagen/theiagen/ete4:4.3.0	Optional
ete4_identify	memory	Int	Amount of memory (in GB) to allocate to the task	4	Optional
ivar_variants	cpu	Int	Number of CPUs to allocate to the task	2	Optional
ivar_variants	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
ivar_variants	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/ivar:1.3.1-titan	Optional
ivar_variants	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	8	Optional
ivar_variants	reference_gff	File	A GFF file in the GFF3 format can be supplied to specify coordinates of open reading frames (ORFs) so iVar can identify codons and translate variants into amino acids		Optional
megahit	cpu	Int	Number of CPUs to allocate to the task	4	Optional
megahit	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
megahit	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/theiagen/megahit:1.2.9	Optional
megahit	kmers	String	Comma-separated list of kmer sizes to use for assembly. All must be odd, in the range 15-255, increment <= 28	21,29,39,59,79,99,119,141	Optional
megahit	megahit_opts	String	Additional parameters for MEGAHIT assembler		Optional
megahit	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	16	Optional
megahit	min_contig_length	Int	Minimum contig length for MEGAHIT assembler	1	Optional
morgana_magic	abricate_flu_cpu	Int	Number of CPUs to allocate to the task		Optional
morgana_magic	abricate_flu_disk_size	Int	Amount of storage (in GB) to allocate to the task		Optional
morgana_magic	abricate_flu_docker	String	The Docker container to use for the task		Optional
morgana_magic	abricate_flu_memory	Int	Amount of memory/RAM (in GB) to allocate to the task		Optional
morgana_magic	abricate_flu_min_percent_coverage	Int	Minimum DNA percent coverage		Optional
morgana_magic	abricate_flu_min_percent_identity	Int	Minimum DNA percent identity		Optional
morgana_magic	assembly_metrics_cpu	Int	Number of CPUs to allocate to the task		Optional
morgana_magic	assembly_metrics_disk_size	Int	Amount of storage (in GB) to allocate to the task		Optional
morgana_magic	assembly_metrics_docker	String	The Docker container to use for the task		Optional
morgana_magic	assembly_metrics_memory	Int	Amount of memory/RAM (in GB) to allocate to the task		Optional
morgana_magic	flu_track_antiviral_aa_subs	String	Additional list of antiviral resistance associated amino acid substitutions of interest to be searched against those called on the sample segments. They take the format of :, e.g. NA:A26V		Optional
morgana_magic	gene_coverage_bam	File	Bam file used for calculating gene coverage		Optional
morgana_magic	gene_coverage_cpu	Int	Number of CPUs to allocate to the task		Optional
morgana_magic	gene_coverage_disk_size	Int	Amount of storage (in GB) to allocate to the task		Optional
morgana_magic	gene_coverage_docker	String	The Docker container to use for the task		Optional
morgana_magic	gene_coverage_memory	Int	Amount of memory/RAM (in GB) to allocate to the task		Optional
morgana_magic	gene_coverage_min_depth	Int	The minimum depth to determine if a position was covered.		Optional
morgana_magic	genoflu_cpu	Int	Number of CPUs to allocate to the task		Optional
morgana_magic	genoflu_cross_reference	File	An Excel file to cross-reference BLAST findings; probably useful if novel genotypes are not in the default file used by genoflu.py		Optional
morgana_magic	genoflu_disk_size	Int	Amount of storage (in GB) to allocate to the task		Optional
morgana_magic	genoflu_docker	String	The Docker container to use for the task		Optional
morgana_magic	genoflu_memory	Int	Amount of memory/RAM (in GB) to allocate to the task		Optional
morgana_magic	irma_cpu	Int	Number of CPUs to allocate to the task		Optional
morgana_magic	irma_disk_size	Int	Amount of storage (in GB) to allocate to the task		Optional
morgana_magic	irma_docker_image	String	The Docker container to use for the task		Optional
morgana_magic	irma_keep_ref_deletions	Boolean	True/False variable that determines if sites missed (i.e. 0 reads for a site in the reference genome) during read gathering should be deleted by ambiguation by inserting N's or deleting the sequence entirely. False sets this IRMA paramater to "DEL" and true sets it to "NNN"		Optional
morgana_magic	irma_memory	Int	Amount of memory/RAM (in GB) to allocate to the task		Optional
morgana_magic	nextclade_auspice_reference_tree_json	File	An Auspice JSON phylogenetic reference tree which serves as a target for phylogenetic placement.	Inherited from nextclade dataset	Optional
morgana_magic	nextclade_cpu	Int	Number of CPUs to allocate to the task		Optional
morgana_magic	nextclade_custom_input_dataset	File	For H5N1 flu samples only. A custom Nextclade dataset in JSON format. If provided, this dataset will be used to process any H5N1 flu samples. If not provided, a custom dataset will be selected depending on the GenoFLU Genotype.	Defaults are GenoFLU Genotype specific. Please find these default values here: https://github.com/theiagen/public_health_bioinformatics/blob/main/workflows/utilities/wf_organism_parameters.wdl	Optional
morgana_magic	nextclade_dataset_name	String	NextClade organism dataset name	Defaults are organism-specific. Please find default values for all organisms (and for Flu - their respective genome segments and subtypes) here: https://github.com/theiagen/public_health_bioinformatics/blob/main/workflows/utilities/wf_organism_parameters.wdl. For an organism without set defaults, the default is "NA".	Optional
morgana_magic	nextclade_dataset_tag	String	NextClade organism dataset tag	Defaults are organism-specific. Please find default values for all organisms (and for Flu - their respective genome segments and subtypes) here: https://github.com/theiagen/public_health_bioinformatics/blob/main/workflows/utilities/wf_organism_parameters.wdl. For an organism without set defaults, the default is "NA".	Optional
morgana_magic	nextclade_disk_size	Int	Amount of storage (in GB) to allocate to the task		Optional
morgana_magic	nextclade_docker_image	String	The Docker container to use for the task		Optional
morgana_magic	nextclade_input_ref	File	A nucleotide sequence which serves as a reference for the pairwise alignment of all input sequences. This is also the sequence which defines the coordinate system of the genome annotation. See here for more info: https://docs.nextstrain.org/projects/nextclade/en/latest/user/input-files/02-reference-sequence.html	Inherited from nextclade dataset	Optional
morgana_magic	nextclade_memory	Int	Amount of memory/RAM (in GB) to allocate to the task		Optional
morgana_magic	nextclade_output_parser_cpu	Int	Number of CPUs to allocate to the task		Optional
morgana_magic	nextclade_output_parser_disk_size	Int	Amount of storage (in GB) to allocate to the task		Optional
morgana_magic	nextclade_output_parser_docker	String	The Docker container to use for the task		Optional
morgana_magic	nextclade_output_parser_memory	Int	Amount of memory/RAM (in GB) to allocate to the task		Optional
morgana_magic	nextclade_pathogen_json	File	General dataset configuration file. See here for more info: https://docs.nextstrain.org/projects/nextclade/en/latest/user/input-files/05-pathogen-config.html	Inherited from nextclade dataset	Optional
morgana_magic	nextclade_reference_gff_file	File	A genome annotation to specify how to translate the nucleotide sequence to proteins (genome_annotation.gff3). specifying this enables codon-informed alignment and protein alignments. See here for more info: https://docs.nextstrain.org/projects/nextclade/en/latest/user/input-files/03-genome-annotation.html	Inherited from nextclade dataset	Optional
morgana_magic	nextclade_verbosity	String	other options are: "off" , "error" , "info" , "debug" , and "trace" (highest level of verbosity)	warn	Optional
morgana_magic	pangolin_analysis_mode	String	Specify which inference engine to use. Options: accurate (UShER), fast (pangoLEARN), pangolearn, usher.		Optional
morgana_magic	pangolin_arguments	String	Optional arguments for pangolin e.g. ''--skip-scorpio''		Optional
morgana_magic	pangolin_cpu	Int	Number of CPUs to allocate to the task		Optional
morgana_magic	pangolin_disk_size	Int	Amount of storage (in GB) to allocate to the task		Optional
morgana_magic	pangolin_docker_image	String	The Docker container to use for the task		Optional
morgana_magic	pangolin_expanded_lineage	Boolean	True/False that determines if a lineage should be expanded without aliases (e.g., BA.1 → B.1.1.529.1)		Optional
morgana_magic	pangolin_max_ambig	Float	Maximum proportion of Ns allowed for pangolin to attempt assignment.		Optional
morgana_magic	pangolin_memory	Int	Amount of memory/RAM (in GB) to allocate to the task		Optional
morgana_magic	pangolin_min_length	Int	Minimum query length allowed for pangolin to attempt an assignment		Optional
morgana_magic	pangolin_skip_designation_cache	Boolean	A True/False option that determines if the designation cache should be used		Optional
morgana_magic	pangolin_skip_scorpio	Boolean	A True/False option that determines if scorpio should be skipped.		Optional
morgana_magic	quasitools_cpu	Int	Number of CPUs to allocate to the task		Optional
morgana_magic	quasitools_disk_size	Int	Amount of storage (in GB) to allocate to the task		Optional
morgana_magic	quasitools_docker	String	The Docker container to use for the task		Optional
morgana_magic	quasitools_memory	Int	Amount of memory/RAM (in GB) to allocate to the task		Optional
morgana_magic	sc2_s_gene_start	Int	Start position of S gene		Optional
morgana_magic	sc2_s_gene_stop	Int	End position of S gene		Optional
morgana_magic	vadr_cpu	Int	Number of CPUs to allocate to the task		Optional
morgana_magic	vadr_disk_size	Int	Amount of storage (in GB) to allocate to the task		Optional
morgana_magic	vadr_docker_image	String	The Docker container to use for the task		Optional
morgana_magic	vadr_max_length	Int	Maximum length for the fasta-trim-terminal-ambigs.pl VADR script		Optional
morgana_magic	vadr_memory	Int	Amount of memory/RAM (in GB) to allocate to the task		Optional
morgana_magic	vadr_min_length	Int	Minimum length for the fasta-trim-terminal-ambigs.pl VADR script		Optional
morgana_magic	vadr_model_file	File	Path to the a tar + gzipped VADR model file		Optional
morgana_magic	vadr_options	String	Options to pass to the VADR script		Optional
morgana_magic	vadr_skip_length	Int	Skip reads shorter than this length		Optional
quast_denovo	cpu	Int	Number of CPUs to allocate to the task	2	Optional
quast_denovo	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
quast_denovo	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/quast:5.0.2	Optional
quast_denovo	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	2	Optional
rasusa	bases	String	Explicitly set the number of bases required e.g., 4.3kb, 7Tb, 9000, 4.1MB. If this option is given, --coverage and --genome-size are ignored		Optional
rasusa	coverage	Float	The desired coverage to sub-sample the reads to. If --bases is not provided, this option and --genome-size are required	250	Optional
rasusa	cpu	Int	Number of CPUs to allocate to the task	4	Optional
rasusa	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
rasusa	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/rasusa:2.1.0	Optional
rasusa	frac	Float	Subsample to a fraction of the reads - e.g., 0.5 samples half the reads		Optional
rasusa	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	8	Optional
rasusa	num	Int	Subsample to a specific number of reads		Optional
rasusa	seed	Int	Random seed for reproducibility		Optional
read_QC_trim	adapters	File	File with adapter sequences to be removed		Optional
read_QC_trim	bbduk_memory	Int	Amount of memory/RAM (in GB) to allocate to the task	8	Optional
read_QC_trim	call_midas	Boolean	Internal component, do not modify	False	Optional
read_QC_trim	fastp_args	String	Additional arguments to use with fastp	--detect_adapter_for_pe -g -5 20 -3 20	Optional
read_QC_trim	host_complete_only	Boolean	Only download host reference genome labeled "complete"	False	Optional
read_QC_trim	host_decontaminate_mem	Int	Memory allocated for minimap2 (in GB)	32	Optional
read_QC_trim	host_is_accession	Boolean	Inputted "host" is an accession	False	Optional
read_QC_trim	host_is_genome	Boolean	Inputted "host" is a genome URI	False	Optional
read_QC_trim	host_refseq	Boolean	Internal component, do not modify	True	Optional
read_QC_trim	kraken_cpu	Int	Number of CPUs to allocate to the task	4	Optional
read_QC_trim	kraken_disk_size	Int	Amount of storage (in GB) to allocate to the task. Increase this when using large (>30GB kraken2 databases such as the "k2_standard" database)	100	Optional
read_QC_trim	kraken_memory	Int	Amount of memory/RAM (in GB) to allocate to the task	32	Optional
read_QC_trim	midas_db	File	Internal component, do not modify	gs://theiagen-public-files-rp/terra/theiaprok-files/midas/midas_db_v1.2.tar.gz	Optional
read_QC_trim	phix	File	A file containing the phix used during Illumina sequencing; used in the BBDuk task		Optional
read_QC_trim	read_processing	String	The name of the tool to perform basic read processing; options: "trimmomatic" or "fastp"	trimmomatic	Optional
read_QC_trim	read_qc	String	The tool used for quality control (QC) of reads. Options are "fastq_scan" (default) and "fastqc"	fastq_scan	Optional
read_QC_trim	target_organism	String	Internal component, do not modify		Optional
read_QC_trim	trim_min_length	Int	Specifies minimum length of each read after trimming to be kept	75	Optional
read_QC_trim	trim_quality_min_score	Int	Specifies the average quality of bases in a sliding window to be kept	30	Optional
read_QC_trim	trim_window_size	Int	Specifies window size for trimming (the number of bases to average the quality across)	4	Optional
read_QC_trim	trimmomatic_args	String	Additional arguments to pass to trimmomatic. "-phred33" specifies the Phred Q score encoding which is almost always phred33 with modern sequence data.	-phred33	Optional
read_mapping_stats	cpu	Int	Number of CPUs to allocate to the task	2	Optional
read_mapping_stats	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
read_mapping_stats	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/samtools:1.15	Optional
read_mapping_stats	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	8	Optional
skani	acc2taxon_map	File	Tab-delimited map between reference genome accessions and their affiliated taxon	gs://theiagen-public-resources-rp/reference_data/databases/skani/viral_fna_20251107/viral_accession2taxon_20251107.tsv	Optional
skani	cpu	Int	Number of CPUs to allocate to the task	2	Optional
skani	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
skani	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/skani:0.2.2	Optional
skani	fasta_dir	String	Reference genome database base directory	gs://theiagen-public-resources-rp/reference_data/databases/skani/viral_fna_20251107/fna/	Optional
skani	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	4	Optional
spades	cpu	Int	Number of CPUs to allocate to the task	4	Optional
spades	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
spades	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/spades:4.1.0	Optional
spades	kmers	String	list of k-mer sizes (must be odd and less than 128)	auto	Optional
spades	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	16	Optional
spades	phred_offset	Int	PHRED quality offset in the input reads (33 or 64)	33	Optional
spades	spades_opts	String	Additional parameters for Spades assembler		Optional
theiaviral_illumina_pe	call_metaviralspades	Boolean	True/False to call assembly with MetaviralSPAdes and use Megahit as fallback	True	Optional
theiaviral_illumina_pe	checkv_db	File	Database used for CheckV		Optional
theiaviral_illumina_pe	extract_unclassified	Boolean	True/False that determines if unclassified reads should be extracted and combined with the taxon specific extracted reads	True	Optional
theiaviral_illumina_pe	genome_length	Int	Expected genome length of taxon of interest		Optional
theiaviral_illumina_pe	host	String	Host taxon/accession to dehost reads, if provided		Optional
theiaviral_illumina_pe	kraken_db	File	Kraken2 database file	gs://theiagen-public-resources-rp/reference_data/databases/kraken2/kraken2_humanGRCh38_viralRefSeq_20240828.tar.gz	Optional
theiaviral_illumina_pe	min_allele_freq	Float	Minimum allele frequency required for a variant to populate the consensus sequence	0.6	Optional
theiaviral_illumina_pe	min_depth	Int	Minimum read depth required for a variant to populate the consensus sequence	10	Optional
theiaviral_illumina_pe	min_map_quality	Int	Minimum mapping quality required for read alignments	20	Optional
theiaviral_illumina_pe	read_extraction_rank	String	Taxonomic rank to use for read extraction - limits taxons to only those within the specified ranks.	family	Optional
theiaviral_illumina_pe	reference_fasta	File	Reference genome in FASTA format		Optional
theiaviral_illumina_pe	reference_gene_locations_bed	File	Use to provide locations of interest where average coverage will be calculated		Optional
theiaviral_illumina_pe	skani_db	File	Skani database file		Optional
theiaviral_illumina_pe	skip_qc	Boolean	Internal component, do not modify	False	Optional
theiaviral_illumina_pe	skip_rasusa	Boolean	True/False to skip read subsampling with Rasusa	True	Optional
theiaviral_illumina_pe	skip_screen	Boolean	True/False to skip read screening check prior to analysis	False	Optional
version_capture	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0	Optional
version_capture	timezone	String	Set the time zone to get an accurate date of analysis (uses UTC by default)		Optional

Terra Task Name	Variable	Type	Description	Default Value	Terra Status
theiaviral_ont	read1	File	Base-called ONT read file in FASTQ file format (compression optional)		Required
theiaviral_ont	samplename	String	Name of the sample being analyzed		Required
theiaviral_ont	taxon	String	Taxon ID or organism name of interest		Required
bcftools_consensus	cpu	Int	Number of CPUs to allocate to the task	2	Optional
bcftools_consensus	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
bcftools_consensus	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/bcftools:1.20	Optional
bcftools_consensus	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	4	Optional
checkv_consensus	checkv_db	File	CheckV database file	gs://theiagen-public-resources-rp/reference_data/databases/checkv/checkv-db-v1.5.tar.gz	Optional
checkv_consensus	cpu	Int	Number of CPUs to allocate to the task	2	Optional
checkv_consensus	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
checkv_consensus	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/checkv:1.0.3	Optional
checkv_consensus	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	8	Optional
checkv_denovo	checkv_db	File	CheckV database file	gs://theiagen-public-resources-rp/reference_data/databases/checkv/checkv-db-v1.5.tar.gz	Optional
checkv_denovo	cpu	Int	Number of CPUs to allocate to the task	2	Optional
checkv_denovo	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
checkv_denovo	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/checkv:1.0.3	Optional
checkv_denovo	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	8	Optional
clair3	clair3_model	String	Model to be used by Clair3	r1041_e82_400bps_sup_v500	Optional
clair3	cpu	Int	Number of CPUs to allocate to the task	4	Optional
clair3	disable_phasing	Boolean	True/False that determines if variants should be called without whatshap phasing in full alignment calling	True	Optional
clair3	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
clair3	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/theiagen/clair3-extra-models:1.0.10	Optional
clair3	enable_gvcf	Boolean	True/False that determines if an additional GVCF output should generated	False	Optional
clair3	enable_haploid_precise	Boolean	True/False that determines haploid calling mode where only 1/1 is considered as a variant	True	Optional
clair3	include_all_contigs	Boolean	True/False that determines if all contigs should be included in the output	True	Optional
clair3	indel_min_af	Float	Minimum Indel AF required for a candidate variant	0.08	Optional
clair3	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	8	Optional
clair3	snp_min_af	Float	Minimum SNP allele frequency required for a candidate variant. Lowering the value might increase a bit of sensitivity in trade of speed and accuracy	0.08	Optional
clair3	variant_quality	Int	If set, variants with >$qual will be marked PASS, or LowQual otherwise	2	Optional
clean_check_reads	cpu	Int	Number of CPUs to allocate to the task	1	Optional
clean_check_reads	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
clean_check_reads	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/bactopia/gather_samples:2.0.2	Optional
clean_check_reads	max_genome_length	Int	Maximum genome length able to pass read screening	2673870	Optional
clean_check_reads	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	2	Optional
clean_check_reads	min_basepairs	Int	Minimum base pairs to pass read screening	15000	Optional
clean_check_reads	min_coverage	Int	Minimum coverage to pass read screening	10	Optional
clean_check_reads	min_genome_length	Int	Minimum genome length to pass read screening	1500	Optional
clean_check_reads	min_reads	Int	Minimum reads to pass read screening	50	Optional
consensus_qc	cpu	Int	Number of CPUs to allocate to the task	1	Optional
consensus_qc	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
consensus_qc	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/theiagen/utility:1.1	Optional
consensus_qc	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	2	Optional
est_genome_length	cpu	Int	Number of CPUs to allocate to the task	1	Optional
est_genome_length	disk_size	Int	Amount of storage (in GB) to allocate to the task	50	Optional
est_genome_length	docker	String	Docker image to use for the task	us-docker.pkg.dev/general-theiagen/theiagen/ncbi-datasets:18.9.0-python-jq	Optional
est_genome_length	memory	Int	Amount of memory (in GB) to allocate to the task	4	Optional
est_genome_length	summary_limit	Int	Maximum number of genomes to query	100	Optional
ete4_identify	cpu	Int	Number of CPUs to allocate to the task	1	Optional
ete4_identify	disk_size	Int	Amount of storage (in GB) to allocate to the task	50	Optional
ete4_identify	docker	String	Docker image to use for the task	us-docker.pkg.dev/general-theiagen/theiagen/ete4:4.3.0	Optional
ete4_identify	memory	Int	Amount of memory (in GB) to allocate to the task	4	Optional
fasta_utilities	cpu	Int	Number of CPUs to allocate to the task	1	Optional
fasta_utilities	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
fasta_utilities	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/samtools:1.17	Optional
fasta_utilities	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	8	Optional
flye	additional_parameters	String	Additional parameters for Flye assembler		Optional
flye	asm_coverage	Int	Reduced coverage for initial disjointig assembly		Optional
flye	cpu	Int	Number of CPUs to allocate to the task	4	Optional
flye	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
flye	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/flye:2.9.4	Optional
flye	flye_polishing_iterations	Int	Number of polishing iterations	1	Optional
flye	genome_length	Int	Expected genome length for assembly - requires asm_coverage		Optional
flye	keep_haplotypes	Boolean	True/False to prevent collapsing alternative haplotypes	False	Optional
flye	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	32	Optional
flye	minimum_overlap	Int	Minimum overlap between reads		Optional
flye	no_alt_contigs	Boolean	True/False to disable alternative contig generation	False	Optional
flye	read_error_rate	Float	Expected error rate in reads		Optional
flye	read_type	String	Type of read data for Flye	--nano-hq	Optional
flye	scaffold	Boolean	True/False to enable scaffolding using graph	False	Optional
host_decontaminate	complete_only	Boolean	Only download genomes labeled "complete"	False	Optional
host_decontaminate	is_accession	Boolean	Inputted "host" is an accession	False	Optional
host_decontaminate	is_genome	Boolean	Inputted "host" is an assembly FASTA	False	Optional
host_decontaminate	minimap2_memory	Int	Amount of memory/RAM (in GB) to allocate to the task	32	Optional
host_decontaminate	read2	File	Internal component, do not modify		Optional
host_decontaminate	refseq	Boolean	Only download RefSeq genomes	True	Optional
mask_low_coverage	cpu	Int	Number of CPUs to allocate to the task	2	Optional
mask_low_coverage	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
mask_low_coverage	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/bedtools:2.31.0	Optional
mask_low_coverage	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	8	Optional
metabuli	cpu	Int	Number of CPUs to allocate to the task	4	Optional
metabuli	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
metabuli	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/theiagen/metabuli:1.1.0	Optional
metabuli	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	16	Optional
metabuli	metabuli_db	File	Metabuli database file	gs://theiagen-public-resources-rp/reference_data/databases/metabuli/refseq_virus-v223.tar.gz	Optional
metabuli	min_percent_coverage	Float	Minimum query coverage threshold (0.0 - 1.0)	0	Optional
metabuli	min_score	Float	Minimum sequenece similarity score (0.0 - 1.0)	0	Optional
metabuli	min_sp_score	Float	Minimum score for species- or lower-level classification	0	Optional
metabuli	taxonomy_path	File	Path to taxonomy file	gs://theiagen-public-resources-rp/reference_data/databases/metabuli/new_taxdump.tar.gz	Optional
minimap2	cpu	Int	Number of CPUs to allocate to the task	2	Optional
minimap2	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
minimap2	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/minimap2:2.22	Optional
minimap2	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	8	Optional
minimap2	query2	File	Internal component, do not modify		Optional
morgana_magic	abricate_flu_cpu	Int	Number of CPUs to allocate to the task		Optional
morgana_magic	abricate_flu_disk_size	Int	Amount of storage (in GB) to allocate to the task		Optional
morgana_magic	abricate_flu_docker	String	The Docker container to use for the task		Optional
morgana_magic	abricate_flu_memory	Int	Amount of memory/RAM (in GB) to allocate to the task		Optional
morgana_magic	abricate_flu_min_percent_coverage	Int	Minimum DNA percent coverage		Optional
morgana_magic	abricate_flu_min_percent_identity	Int	Minimum DNA percent identity		Optional
morgana_magic	assembly_metrics_cpu	Int	Number of CPUs to allocate to the task		Optional
morgana_magic	assembly_metrics_disk_size	Int	Amount of storage (in GB) to allocate to the task		Optional
morgana_magic	assembly_metrics_docker	String	The Docker container to use for the task		Optional
morgana_magic	assembly_metrics_memory	Int	Amount of memory/RAM (in GB) to allocate to the task		Optional
morgana_magic	flu_track_antiviral_aa_subs	String	Additional list of antiviral resistance associated amino acid substitutions of interest to be searched against those called on the sample segments. They take the format of :, e.g. NA:A26V		Optional
morgana_magic	gene_coverage_bam	File	Bam file used for calculating gene coverage		Optional
morgana_magic	gene_coverage_cpu	Int	Number of CPUs to allocate to the task		Optional
morgana_magic	gene_coverage_disk_size	Int	Amount of storage (in GB) to allocate to the task		Optional
morgana_magic	gene_coverage_docker	String	The Docker container to use for the task		Optional
morgana_magic	gene_coverage_memory	Int	Amount of memory/RAM (in GB) to allocate to the task		Optional
morgana_magic	gene_coverage_min_depth	Int	The minimum depth to determine if a position was covered.		Optional
morgana_magic	genoflu_cpu	Int	Number of CPUs to allocate to the task		Optional
morgana_magic	genoflu_cross_reference	File	An Excel file to cross-reference BLAST findings; probably useful if novel genotypes are not in the default file used by genoflu.py		Optional
morgana_magic	genoflu_disk_size	Int	Amount of storage (in GB) to allocate to the task		Optional
morgana_magic	genoflu_docker	String	The Docker container to use for the task		Optional
morgana_magic	genoflu_memory	Int	Amount of memory/RAM (in GB) to allocate to the task		Optional
morgana_magic	irma_cpu	Int	Number of CPUs to allocate to the task		Optional
morgana_magic	irma_disk_size	Int	Amount of storage (in GB) to allocate to the task		Optional
morgana_magic	irma_docker_image	String	The Docker container to use for the task		Optional
morgana_magic	irma_keep_ref_deletions	Boolean	True/False variable that determines if sites missed (i.e. 0 reads for a site in the reference genome) during read gathering should be deleted by ambiguation by inserting N's or deleting the sequence entirely. False sets this IRMA paramater to "DEL" and true sets it to "NNN"		Optional
morgana_magic	irma_memory	Int	Amount of memory/RAM (in GB) to allocate to the task		Optional
morgana_magic	nextclade_auspice_reference_tree_json	File	An Auspice JSON phylogenetic reference tree which serves as a target for phylogenetic placement.	Inherited from nextclade dataset	Optional
morgana_magic	nextclade_cpu	Int	Number of CPUs to allocate to the task		Optional
morgana_magic	nextclade_custom_input_dataset	File	For H5N1 flu samples only. A custom Nextclade dataset in JSON format. If provided, this dataset will be used to process any H5N1 flu samples. If not provided, a custom dataset will be selected depending on the GenoFLU Genotype.	Defaults are GenoFLU Genotype specific. Please find these default values here: https://github.com/theiagen/public_health_bioinformatics/blob/main/workflows/utilities/wf_organism_parameters.wdl	Optional
morgana_magic	nextclade_dataset_name	String	NextClade organism dataset name	Defaults are organism-specific. Please find default values for all organisms (and for Flu - their respective genome segments and subtypes) here: https://github.com/theiagen/public_health_bioinformatics/blob/main/workflows/utilities/wf_organism_parameters.wdl. For an organism without set defaults, the default is "NA".	Optional
morgana_magic	nextclade_dataset_tag	String	NextClade organism dataset tag	Defaults are organism-specific. Please find default values for all organisms (and for Flu - their respective genome segments and subtypes) here: https://github.com/theiagen/public_health_bioinformatics/blob/main/workflows/utilities/wf_organism_parameters.wdl. For an organism without set defaults, the default is "NA".	Optional
morgana_magic	nextclade_disk_size	Int	Amount of storage (in GB) to allocate to the task		Optional
morgana_magic	nextclade_docker_image	String	The Docker container to use for the task		Optional
morgana_magic	nextclade_input_ref	File	A nucleotide sequence which serves as a reference for the pairwise alignment of all input sequences. This is also the sequence which defines the coordinate system of the genome annotation. See here for more info: https://docs.nextstrain.org/projects/nextclade/en/latest/user/input-files/02-reference-sequence.html	Inherited from nextclade dataset	Optional
morgana_magic	nextclade_memory	Int	Amount of memory/RAM (in GB) to allocate to the task		Optional
morgana_magic	nextclade_output_parser_cpu	Int	Number of CPUs to allocate to the task		Optional
morgana_magic	nextclade_output_parser_disk_size	Int	Amount of storage (in GB) to allocate to the task		Optional
morgana_magic	nextclade_output_parser_docker	String	The Docker container to use for the task		Optional
morgana_magic	nextclade_output_parser_memory	Int	Amount of memory/RAM (in GB) to allocate to the task		Optional
morgana_magic	nextclade_pathogen_json	File	General dataset configuration file. See here for more info: https://docs.nextstrain.org/projects/nextclade/en/latest/user/input-files/05-pathogen-config.html	Inherited from nextclade dataset	Optional
morgana_magic	nextclade_reference_gff_file	File	A genome annotation to specify how to translate the nucleotide sequence to proteins (genome_annotation.gff3). specifying this enables codon-informed alignment and protein alignments. See here for more info: https://docs.nextstrain.org/projects/nextclade/en/latest/user/input-files/03-genome-annotation.html	Inherited from nextclade dataset	Optional
morgana_magic	nextclade_verbosity	String	other options are: "off" , "error" , "info" , "debug" , and "trace" (highest level of verbosity)	warn	Optional
morgana_magic	pangolin_analysis_mode	String	Specify which inference engine to use. Options: accurate (UShER), fast (pangoLEARN), pangolearn, usher.		Optional
morgana_magic	pangolin_arguments	String	Optional arguments for pangolin e.g. ''--skip-scorpio''		Optional
morgana_magic	pangolin_cpu	Int	Number of CPUs to allocate to the task		Optional
morgana_magic	pangolin_disk_size	Int	Amount of storage (in GB) to allocate to the task		Optional
morgana_magic	pangolin_docker_image	String	The Docker container to use for the task		Optional
morgana_magic	pangolin_expanded_lineage	Boolean	True/False that determines if a lineage should be expanded without aliases (e.g., BA.1 → B.1.1.529.1)		Optional
morgana_magic	pangolin_max_ambig	Float	Maximum proportion of Ns allowed for pangolin to attempt assignment.		Optional
morgana_magic	pangolin_memory	Int	Amount of memory/RAM (in GB) to allocate to the task		Optional
morgana_magic	pangolin_min_length	Int	Minimum query length allowed for pangolin to attempt an assignment		Optional
morgana_magic	pangolin_skip_designation_cache	Boolean	A True/False option that determines if the designation cache should be used		Optional
morgana_magic	pangolin_skip_scorpio	Boolean	A True/False option that determines if scorpio should be skipped.		Optional
morgana_magic	quasitools_cpu	Int	Number of CPUs to allocate to the task		Optional
morgana_magic	quasitools_disk_size	Int	Amount of storage (in GB) to allocate to the task		Optional
morgana_magic	quasitools_docker	String	The Docker container to use for the task		Optional
morgana_magic	quasitools_memory	Int	Amount of memory/RAM (in GB) to allocate to the task		Optional
morgana_magic	read2	File	Internal component, do not modify		Optional
morgana_magic	sc2_s_gene_start	Int	Start position of S gene		Optional
morgana_magic	sc2_s_gene_stop	Int	End position of S gene		Optional
morgana_magic	vadr_cpu	Int	Number of CPUs to allocate to the task		Optional
morgana_magic	vadr_disk_size	Int	Amount of storage (in GB) to allocate to the task		Optional
morgana_magic	vadr_docker_image	String	The Docker container to use for the task		Optional
morgana_magic	vadr_max_length	Int	Maximum length for the fasta-trim-terminal-ambigs.pl VADR script		Optional
morgana_magic	vadr_memory	Int	Amount of memory/RAM (in GB) to allocate to the task		Optional
morgana_magic	vadr_min_length	Int	Minimum length for the fasta-trim-terminal-ambigs.pl VADR script		Optional
morgana_magic	vadr_model_file	File	Path to the a tar + gzipped VADR model file		Optional
morgana_magic	vadr_options	String	Options to pass to the VADR script		Optional
morgana_magic	vadr_skip_length	Int	Skip reads shorter than this length		Optional
nanoplot_clean	cpu	Int	Number of CPUs to allocate to the task	4	Optional
nanoplot_clean	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
nanoplot_clean	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/nanoplot:1.40.0	Optional
nanoplot_clean	max_length	Int	The maximum length of clean reads, for which reads longer than the length specified will be hidden.	100000	Optional
nanoplot_clean	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	16	Optional
nanoplot_raw	cpu	Int	Number of CPUs to allocate to the task	4	Optional
nanoplot_raw	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
nanoplot_raw	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/nanoplot:1.40.0	Optional
nanoplot_raw	max_length	Int	The maximum length of clean reads, for which reads longer than the length specified will be hidden.	100000	Optional
nanoplot_raw	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	16	Optional
nanoq	cpu	Int	Number of CPUs to allocate to the task	2	Optional
nanoq	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
nanoq	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/biocontainers/nanoq:0.9.0--hec16e2b_1	Optional
nanoq	max_read_length	Int	Maximum read length to keep	100000	Optional
nanoq	max_read_qual	Int	Maximum read quality to keep	100	Optional
nanoq	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	2	Optional
nanoq	min_read_length	Int	Minimum read length to keep	500	Optional
nanoq	min_read_qual	Int	Minimum read quality to keep	10	Optional
ncbi_scrub_se	cpu	Int	Number of CPUs to allocate to the task	4	Optional
ncbi_scrub_se	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
ncbi_scrub_se	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/ncbi/sra-human-scrubber:2.2.1	Optional
ncbi_scrub_se	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	8	Optional
parse_mapping	cpu	Int	Number of CPUs to allocate to the task	2	Optional
parse_mapping	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
parse_mapping	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/samtools:1.17	Optional
parse_mapping	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	8	Optional
porechop	cpu	Int	Number of CPUs to allocate to the task	4	Optional
porechop	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
porechop	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/porechop:0.2.4	Optional
porechop	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	16	Optional
porechop	trimopts	String	Additional trimming options for Porechop		Optional
quast_denovo	cpu	Int	Number of CPUs to allocate to the task	2	Optional
quast_denovo	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
quast_denovo	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/quast:5.0.2	Optional
quast_denovo	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	2	Optional
quast_denovo	min_contig_length	Int	Minimum length of contig for QUAST	500	Optional
rasusa	bases	String	Explicitly set the number of bases required e.g., 4.3kb, 7Tb, 9000, 4.1MB. If this option is given, --coverage and --genome-size are ignored		Optional
rasusa	coverage	Float	The desired coverage to sub-sample the reads to. If --bases is not provided, this option and --genome-size are required	250	Optional
rasusa	cpu	Int	Number of CPUs to allocate to the task	4	Optional
rasusa	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
rasusa	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/rasusa:2.1.0	Optional
rasusa	frac	Float	Subsample to a fraction of the reads - e.g., 0.5 samples half the reads		Optional
rasusa	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	8	Optional
rasusa	num	Int	Subsample to a specific number of reads		Optional
rasusa	read2	File	Internal component, do not modify		Optional
rasusa	seed	Int	Random seed for reproducibility		Optional
raven	cpu	Int	Number of CPUs to allocate to the task	4	Optional
raven	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
raven	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/theiagen/raven:1.8.3	Optional
raven	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	16	Optional
raven	raven_identity	Float	Threshold for overlap between two reads in order to construct an edge between them	0	Optional
raven	raven_opts	String	Additional parameters for Raven assembler		Optional
raven	raven_polishing_iterations	Int	Number of polishing iterations	2	Optional
read_mapping_stats	cpu	Int	Number of CPUs to allocate to the task	2	Optional
read_mapping_stats	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
read_mapping_stats	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/samtools:1.15	Optional
read_mapping_stats	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	8	Optional
skani	acc2taxon_map	File	Tab-delimited map between reference genome accessions and their affiliated taxon	gs://theiagen-public-resources-rp/reference_data/databases/skani/viral_fna_20251107/viral_accession2taxon_20251107.tsv	Optional
skani	cpu	Int	Number of CPUs to allocate to the task	2	Optional
skani	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
skani	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/skani:0.2.2	Optional
skani	fasta_dir	String	Reference genome database base directory	gs://theiagen-public-resources-rp/reference_data/databases/skani/viral_fna_20251107/fna/	Optional
skani	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	4	Optional
skani	skani_db	File	Skani database file	gs://theiagen-public-resources-rp/reference_data/databases/skani/skani_db_20251107.tar	Optional
theiaviral_ont	call_porechop	Boolean	True/False to trim adapters with porechop	False	Optional
theiaviral_ont	call_raven	Boolean	True/False to call assembly with Raven and use Flye as fallback	True	Optional
theiaviral_ont	extract_unclassified	Boolean	True/False that determines if unclassified reads should be extracted and combined with the taxon specific extracted reads	True	Optional
theiaviral_ont	genome_length	Int	Expected genome length of taxon of interest		Optional
theiaviral_ont	host	String	Host taxon/accession to dehost reads, if provided		Optional
theiaviral_ont	min_allele_freq	Float	Minimum allele frequency required for a variant to populate the consensus sequence	0.6	Optional
theiaviral_ont	min_depth	Int	Minimum read depth required for a variant to populate the consensus sequence	10	Optional
theiaviral_ont	min_map_quality	Int	Minimum mapping quality required for read alignments	20	Optional
theiaviral_ont	read_extraction_rank	String	Taxonomic rank to use for read extraction - limits taxons to only those within the specified ranks.	family	Optional
theiaviral_ont	reference_fasta	File	Reference genome in FASTA format		Optional
theiaviral_ont	reference_gene_locations_bed	File	Use to provide locations of interest where average coverage will be calculated		Optional
theiaviral_ont	skip_rasusa	Boolean	True/False to skip read subsampling with Rasusa	True	Optional
theiaviral_ont	skip_screen	Boolean	True/False to skip read screening check prior to analysis	False	Optional
version_capture	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0	Optional
version_capture	timezone	String	Set the time zone to get an accurate date of analysis (uses UTC by default)		Optional

Terra Task Name	Variable	Type	Description	Default Value	Terra Status
theiaviral_panel	read1	File	Illumina forward read file in FASTQ file format (compression optional)		Required
theiaviral_panel	read2	File	Illumina reverse read file in FASTQ file format (compression optional)		Required
theiaviral_panel	samplename	String	Name of the sample being analyzed		Required
theiaviral_panel	source_table_name	String	Name of the Terra table the source reads originate from. This is used for identifying originating location of extracted assemblies once added to output tables.		Required
theiaviral_panel	terra_project	String	The Terra project containing the data table		Required
theiaviral_panel	terra_workspace	String	The Terra workspace containing the data table		Required
cat_lanes	cpu	Int	Number of CPUs to allocate to the task	2	Optional
cat_lanes	disk_size	Int	Amount of storage (in GB) to allocate to the task	50	Optional
cat_lanes	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/theiagen/utility:1.2	Optional
cat_lanes	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	4	Optional
cat_lanes	read1_lane3	File	Internal component, do not modify		Optional
cat_lanes	read1_lane4	File	Internal component, do not modify		Optional
cat_lanes	read2_lane3	File	Internal component, do not modify		Optional
cat_lanes	read2_lane4	File	Internal component, do not modify		Optional
ete4_identify	cpu	Int	Number of CPUs to allocate to the task	1	Optional
ete4_identify	disk_size	Int	Amount of storage (in GB) to allocate to the task	50	Optional
ete4_identify	docker	String	Docker image to use for the task	us-docker.pkg.dev/general-theiagen/theiagen/ete4:4.3.0	Optional
ete4_identify	memory	Int	Amount of memory (in GB) to allocate to the task	4	Optional
ete4_identify	rank	String	Internal component, do not modify		Optional
export_taxon_table	cpu	Int	Number of CPUs to allocate to the task	1	Optional
export_taxon_table	disk_size	Int	Amount of storage (in GB) to allocate to the task	25	Optional
export_taxon_table	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/theiagen/terra-tools:2023-06-21	Optional
export_taxon_table	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	2	Optional
kraken2	classified_out	String	Allows user to rename the classified FASTQ files output. Must include .fastq as the suffix	classified#.fastq	Optional
kraken2	cpu	Int	Number of CPUs to allocate to the task	4	Optional
kraken2	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
kraken2	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/kraken2:2.1.2-no-db	Optional
kraken2	kraken2_args	String	Allows a user to supply additional kraken2 command-line arguments		Optional
kraken2	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	32	Optional
kraken2	unclassified_out	String	Allows user to rename unclassified FASTQ files output. Must include .fastq as the suffix	unclassified#.fastq	Optional
kraken_parser	cpu	Int	Number of CPUs to allocate to the task	2	Optional
kraken_parser	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
kraken_parser	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/theiagen/krakentools:d4a2fbe	Optional
kraken_parser	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	4	Optional
krakentools	cpu	Int	Number of CPUs to allocate to the task	1	Optional
krakentools	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
krakentools	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/theiagen/krakentools:d4a2fbe	Optional
krakentools	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	4	Optional
read_QC_trim	adapters	File	File with adapter sequences to be removed		Optional
read_QC_trim	bbduk_memory	Int	Amount of memory/RAM (in GB) to allocate to the task	8	Optional
read_QC_trim	call_midas	Boolean	Internal component, do not modify	False	Optional
read_QC_trim	extract_unclassified	Boolean	Allows user to extract unclassified reads	False	Optional
read_QC_trim	fastp_args	String	Additional arguments to use with fastp	--detect_adapter_for_pe -g -5 20 -3 20	Optional
read_QC_trim	host_complete_only	Boolean	Only download host reference genome labeled "complete"	False	Optional
read_QC_trim	host_decontaminate_mem	Int	Memory allocated for minimap2 (in GB)	32	Optional
read_QC_trim	host_is_accession	Boolean	Inputted "host" is an accession	False	Optional
read_QC_trim	host_is_genome	Boolean	Inputted "host" is a genome URI	False	Optional
read_QC_trim	host_refseq	Boolean	Internal component, do not modify	True	Optional
read_QC_trim	kraken_cpu	Int	Number of CPUs to allocate to the task	4	Optional
read_QC_trim	kraken_disk_size	Int	Amount of storage (in GB) to allocate to the task. Increase this when using large (>30GB kraken2 databases such as the "k2_standard" database)	100	Optional
read_QC_trim	kraken_memory	Int	Amount of memory/RAM (in GB) to allocate to the task	32	Optional
read_QC_trim	midas_db	File	Internal component, do not modify	gs://theiagen-public-files-rp/terra/theiaprok-files/midas/midas_db_v1.2.tar.gz	Optional
read_QC_trim	phix	File	A file containing the phix used during Illumina sequencing; used in the BBDuk task		Optional
read_QC_trim	read_processing	String	The name of the tool to perform basic read processing; options: "trimmomatic" or "fastp"	trimmomatic	Optional
read_QC_trim	read_qc	String	The tool used for quality control (QC) of reads. Options are "fastq_scan" (default) and "fastqc"	fastq_scan	Optional
read_QC_trim	target_organism	String	Internal component, do not modify		Optional
read_QC_trim	taxon_id	Int	Internal component, do not modify	0	Optional
read_QC_trim	trim_min_length	Int	Specifies minimum length of each read after trimming to be kept	75	Optional
read_QC_trim	trim_quality_min_score	Int	Specifies the average quality of bases in a sliding window to be kept	30	Optional
read_QC_trim	trim_window_size	Int	Specifies window size for trimming (the number of bases to average the quality across)	4	Optional
read_QC_trim	trimmomatic_args	String	Additional arguments to pass to trimmomatic. "-phred33" specifies the Phred Q score encoding which is almost always phred33 with modern sequence data.	-phred33	Optional
theiaviral_illumina_pe	checkv_db	File	Database used for CheckV		Optional
theiaviral_illumina_pe	extract_unclassified	Boolean	Internal component, do not modify	True	Optional
theiaviral_illumina_pe	genome_length	Int	Expected genome length of taxon of interest		Optional
theiaviral_illumina_pe	host	String	Internal component, do not modify		Optional
theiaviral_illumina_pe	min_allele_freq	Float	Minimum allele frequency required for a variant to populate the consensus sequence	0.6	Optional
theiaviral_illumina_pe	min_depth	Int	Minimum read depth required for a variant to populate the consensus sequence	10	Optional
theiaviral_illumina_pe	min_map_quality	Int	Minimum mapping quality required for read alignments	20	Optional
theiaviral_illumina_pe	read_extraction_rank	String	Internal component, do not modify	family	Optional
theiaviral_illumina_pe	reference_fasta	File	Reference genome in FASTA format		Optional
theiaviral_illumina_pe	reference_gene_locations_bed	File	Use to provide locations of interest where average coverage will be calculated		Optional
theiaviral_illumina_pe	skani_db	File	Skani database file		Optional
theiaviral_illumina_pe	skip_rasusa	Boolean	True/False to skip read subsampling with Rasusa	True	Optional
theiaviral_panel	call_metaviralspades	Boolean	Whether to run metaviralspades for assembly	True	Optional
theiaviral_panel	extract_unclassified	Boolean	True/False that determines if unclassified reads should be extracted and combined with the taxon specific extracted reads	False	Optional
theiaviral_panel	host	String	Host taxon/accession to dehost reads, if provided		Optional
theiaviral_panel	kraken_db	File	Kraken2 database file in .tar.gz format.	gs://theiagen-public-resources-rp/reference_data/databases/kraken2/kraken2_humanGRCh38_viralRefSeq_20240828.tar.gz	Optional
theiaviral_panel	min_read_count	Int	Minimum number of reads required to consider a taxon for assembly	1000	Optional
theiaviral_panel	output_taxon_table	File	A TSV file containing organism names and their corresponding output table names.	gs://theiagen-public-resources-rp/reference_data/family_agnostic/theiaviral_panel_taxon_table_20251111.tsv	Optional
theiaviral_panel	taxon_ids	Array[String]	An array of taxon IDs user wishes to analyze.	[['1618189'], ['37124'], ['46839'], ['12637'], ['1216928'], ['59301'], ['2169701'], ['118655'], ['11587'], ['11029'], ['11033'], ['11034'], ['1608084'], ['64286'], ['11082'], ['11089'], ['64320'], ['10804'], ['12092'], ['3052230'], ['12475'], ['11676'], ['11709'], ['68887'], ['1980456'], ['3052470'], ['3052490'], ['169173'], ['3052489'], ['238817'], ['1980442'], ['3052496'], ['3052499'], ['90961'], ['80935'], ['35305'], ['1221391'], ['38767'], ['11021'], ['38768'], ['2847089'], ['3052223'], ['260964'], ['35511'], ['11072'], ['11577'], ['38766'], ['1474807'], ['12538'], ['11079'], ['3052225'], ['11083'], ['11292'], ['11580'], ['11080'], ['45270'], ['11084'], ['11036'], ['11039'], ['1313215'], ['138948'], ['138949'], ['138950'], ['138951'], ['1239565'], ['1239570'], ['1239573'], ['142786'], ['28875'], ['28876'], ['36427'], ['1348384'], ['1330524'], ['95341'], ['2849717'], ['1424613'], ['2010960'], ['565995'], ['3052302'], ['3052518'], ['3052477'], ['3052307'], ['3052480'], ['2169991'], ['33743'], ['3052310'], ['3052148'], ['3052314'], ['3052303'], ['3052317'], ['33727'], ['12542'], ['3052493'], ['186539'], ['11588'], ['2907957'], ['3052498'], ['1003835'], ['1452514'], ['186540'], ['186541'], ['3052503'], ['1891762'], ['10376'], ['10359'], ['333760'], ['333761'], ['337044'], ['337050'], ['1671798'], ['333754'], ['333767'], ['746830'], ['746831'], ['943908'], ['10632'], ['1891764'], ['1965344'], ['493803'], ['1203539'], ['1497391'], ['1891767'], ['1277649'], ['862909'], ['862909'], ['440266'], ['11234'], ['152219'], ['10244'], ['2560602'], ['11041'], ['10335'], ['10255'], ['129875'], ['108098'], ['129951'], ['130310'], ['130308'], ['130309'], ['536079'], ['329641'], ['11137'], ['290028'], ['277944'], ['31631'], ['162145'], ['12730'], ['2560525'], ['11216'], ['2560526'], ['1803956'], ['10798'], ['11250'], ['11320'], ['11520'], ['11552'], ['1335626'], ['147711'], ['147712'], ['463676'], ['2901879'], ['2697049'], ['10404'], ['3052505'], ['337041'], ['337042'], ['333757'], ['337048'], ['333754'], ['333766'], ['337049']]	Optional
version_capture	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0	Optional
version_capture	timezone	String	Set the time zone to get an accurate date of analysis (uses UTC by default)		Optional

Workflow Tasks¶

TheiaViral_Illumina_PETheiaViral_ONTTheiaViral_Panel

Versioning

versioning: Version Capture

The versioning task captures the workflow version from the GitHub (code repository) version.

Version Capture Technical details

	Links
Task	task_versioning.wdl

Taxonomic Identification

ete4_identify

The ete4_identify task parses the NCBI taxonomy hierarchy from a user's inputted taxonomy and desired taxonomic rank. This task returns a taxon ID, name, and rank, which facilitates downstream functions, including read classification, targeted read extraction, and genomic characterization modules.

taxon input parameter

This parameter accepts either a NCBI taxon ID (e.g. 11292) or an organism name (e.g. Lyssavirus rabies).

rank a.k.a read_extraction_rank input parameter

Valid options include: "species", "genus", "family", "order", "class", "phylum", "kingdom", or "domain". By default it is set to "family". This parameter filters metadata to report information only at the taxonomic rank specified by the user, regardless of the taxonomic rank implied by the original input taxon.

Important

The rank parameter must specify a taxonomic rank that is equal to or above the input taxon's taxonomic rank.

Examples:

If your input taxon is Lyssavirus rabies (species level) with rank set to family, the task will return information for the family of Lyssavirus rabies: taxon ID for Rhabdoviridae (11270), name "Rhabdoviridae", and rank "family".
If your input taxon is Lyssavirus (genus level) with rank set to species, the task will fail because it cannot determine species information from an inputted genus.

ete4 Identify Technical Details

	Links
Task	task_ete4_taxon_id.wdl
Software Source Code	ete4 on GitHub
Software Documentation	NCBI Datasets Documentation on NCBI
Original Publication(s)	Exploring and retrieving sequence and metadata for species across the tree of life with NCBI Datasets

datasets_genome_length

The datasets_genome_length task uses NCBI Datasets to acquire genome length metadata for an inputted taxon and retrieve a top reference accession. This task generates a summary file of all successful hits to the input taxon, which includes each genome's accession number, completeness status, genome length, source, and other relevant metadata. The task will then calculate the average expected genome length in basepairs for the input taxon.

taxon input parameter

This parameter accepts either a NCBI taxon ID (e.g. 11292) or an organism name (e.g. Lyssavirus rabies).

NCBI Datasets Technical Details

	Links
Task	task_identify_taxon_id.wdl
Software Source Code	NCBI Datasets on GitHub
Software Documentation	NCBI Datasets Documentation on NCBI
Original Publication(s)	Exploring and retrieving sequence and metadata for species across the tree of life with NCBI Datasets

Read Quality Control, Trimming, Filtering, Identification and Extraction

read_QC_trim

read_QC_trim is a sub-workflow that removes low-quality reads, low-quality regions of reads, and sequencing adapters to improve data quality. It uses a number of tasks, described below. The differences between the PE and SE versions of the read_QC_trim sub-workflow lie in the default parameters, the use of two or one input read file(s), and the different output files.

HRRT: Human Host Sequence Removal

All reads of human origin are removed, including their mates, by using NCBI's human read removal tool (HRRT).

HRRT is based on the SRA Taxonomy Analysis Tool and employs a k-mer database constructed of k-mers from Eukaryota derived from all human RefSeq records with any k-mers found in non-Eukaryota RefSeq records subtracted from the database.

NCBI-Scrub Technical Details

	Links
Task	task_ncbi_scrub.wdl
Software Source Code	HRRT on GitHub
Software Documentation	HRRT on NCBI

By default, read_processing is set to "trimmomatic". To use fastp instead, set read_processing to "fastp". These tasks are mutually exclusive.

Trimmomatic: Read Trimming (default)

Read proccessing is available via Trimmomatic by default.

Trimmomatic trims low-quality regions of Illumina paired-end or single-end reads with a sliding window (with a default window size of 4, specified with trim_window_size), cutting once the average quality within the window falls below the trim_quality_trim_score (default of 20 for paired-end, 30 for single-end). The read is discarded if it is trimmed below trim_minlen (default of 75 for paired-end, 25 for single-end).

Trimmomatic Technical Details

	Links
Task	task_trimmomatic.wdl
Software Source Code	Trimmomatic on GitHub
Software Documentation	Trimmomatic Website
Original Publication(s)	Trimmomatic: a flexible trimmer for Illumina sequence data

fastp: Read Trimming (alternative)

To activate this task, set read_processing to "fastp".

fastp trims low-quality regions of Illumina paired-end or single-end reads with a sliding window (with a default window size of 4, specified with trim_window_size), cutting once the average quality within the window falls below the trim_quality_trim_score (default of 20 for paired-end, 30 for single-end). The read is discarded if it is trimmed below trim_minlen (default of 75 for paired-end, 25 for single-end).

fastp also has additional default parameters and features that are not a part of trimmomatic's default configuration.

fastp default read-trimming parameters

Parameter	Explanation
-g	enables polyG tail trimming
-5 20	enables read end-trimming
-3 20	enables read end-trimming
--detect_adapter_for_pe	enables adapter-trimming only for paired-end reads

Additional arguments can be passed using the fastp_args optional parameter.

Trimmomatic and fastp Technical Details

	Links
Task	task_fastp.wdl
Software Source Code	fastp on GitHub
Software Documentation	fastp on GitHub
Original Publication(s)	fastp: an ultra-fast all-in-one FASTQ preprocessor

BBDuk: Adapter Trimming and PhiX Removal

Adapters are manufactured oligonucleotide sequences attached to DNA fragments during the library preparation process. In Illumina sequencing, these adapter sequences are required for attaching reads to flow cells. You can read more about Illumina adapters here. For genome analysis, it's important to remove these sequences since they're not actually from your sample. If you don't remove them, the downstream analysis may be affected.

The bbduk task removes adapters from sequence reads. To do this:

Repair from the BBTools package reorders reads in paired fastq files to ensure the forward and reverse reads of a pair are in the same position in the two fastq files (it re-pairs).
BBDuk ("Bestus Bioinformaticus" Decontamination Using Kmers) is then used to trim the adapters and filter out all reads that have a 31-mer match to PhiX, which is commonly added to Illumina sequencing runs to monitor and/or improve overall run quality.

BBDuk Technical Details

	Links
Task	task_bbduk.wdl
Software Source Code	BBMap on SourceForge
Software Documentation	BBDuk Guide (archived)

By default, read_qc is set to "fastq_scan". To use fastqc instead, set read_qc to "fastqc". These tasks are mutually exclusive.

fastq-scan: Read Quantification (default)

Read quantification is available via fastq-scan by default.

fastq-scan quantifies the forward and reverse reads in FASTQ files. For paired-end data, it also provide the total number of read pairs. This task is run once with raw reads as input and once with clean reads as input. If QC has been performed correctly, you should expect fewer clean reads than raw reads.

fastq-scan Technical Details

	Links
Task	task_fastq_scan.wdl
Software Source Code	fastq-scan on GitHub
Software Documentation	fastq-scan on GitHub

FastQC: Read Quantification (alternative)

To activate this task, set read_qc to "fastqc".

FastQC quantifies the forward and reverse reads in FASTQ files. For paired-end data, it also provide the total number of read pairs. This task is run once with raw reads as input and once with clean reads as input. If QC has been performed correctly, you should expect fewer clean reads than raw reads.

This tool also provides a graphical visualization of the read quality.

FastQC Technical Details

	Links
Task	task_fastqc.wdl
Software Source Code	FastQC on Github
Software Documentation	FastQC Website

host_decontaminate: Host Read Decontamination

Host genetic data is frequently incidentally sequenced alongside pathogens, which can negatively affect the quality of downstream analysis. Host Decontaminate attempts to remove host reads by aligning to a reference host genome that is directly inputted or acquired on-the-fly. The reference host genome can be inputted into the host input field as an assembly file (with is_genome set to "true"), acquired via NCBI Taxonomy-compatible taxon input, or assembly accession (with is_accession set to "true"). Host Decontaminate maps inputted reads to the host genome using minimap2, reports mapping statistics to this host genome, and outputs the unaligned dehosted reads.

The detailed steps and tasks are as follows:

datasets_genome_length

The datasets_genome_length task uses NCBI Datasets to acquire genome length metadata for an inputted taxon and retrieve a top reference accession. This task generates a summary file of all successful hits to the input taxon, which includes each genome's accession number, completeness status, genome length, source, and other relevant metadata. The task will then calculate the average expected genome length in basepairs for the input taxon.

taxon input parameter

This parameter accepts either a NCBI taxon ID (e.g. 11292) or an organism name (e.g. Lyssavirus rabies).

NCBI Datasets Technical Details

	Links
Task	task_identify_taxon_id.wdl
Software Source Code	NCBI Datasets on GitHub
Software Documentation	NCBI Datasets Documentation on NCBI
Original Publication(s)	Exploring and retrieving sequence and metadata for species across the tree of life with NCBI Datasets

Download Accession

The NCBI Datasets task downloads specified assemblies from NCBI using either the virus or genome (for all other genome types) package as appropriate.

This task uses the accession ID output from the skani task to download the the most closely related reference genome to the input assembly. The downloaded reference is then used for downstream analysis, including variant calling and consensus generation.

NCBI Datasets Technical Details

	Links
Task	task_ncbi_datasets.wdl
Software Source Code	NCBI Datasets on GitHub
Software Documentation	NCBI Datasets Documentation on NCBI
Original Publication(s)	Exploring and retrieving sequence and metadata for species across the tree of life with NCBI Datasets

Map Reads to Host

minimap2 is a popular aligner that is used to align reads (or assemblies) to an assembly file. In minimap2, "modes" are a group of preset options.

The mode used in this task is map-ont which is the default mode for long reads and indicates that long reads of ~10% error rates should be aligned to the reference genome. The output file is in SAM format.

For more information regarding modes and the available options for minimap2, please see the minimap2 manpage

minimap2 Technical Details

	Links
Task	task_minimap2.wdl
Software Source Code	minimap2 on GitHub
Software Documentation	minimap2
Original Publication(s)	Minimap2: pairwise alignment for nucleotide sequences

Extract Unaligned Reads

The bam_to_unaligned_fastq task will extract a FASTQ file of reads that failed to align, while removing unpaired reads.

parse_mapping Technical Details

	Links
Task	task_parse_mapping.wdl
Software Source Code	samtools on GitHub
Software Documentation	samtools
Original Publication(s)	The Sequence Alignment/Map format and SAMtools Twelve Years of SAMtools and BCFtools

assembly_metrics: Mapping Statistics

The assembly_metrics task generates mapping statistics from a BAM file. It uses samtools to generate a summary of the mapping statistics, which includes coverage, depth, average base quality, average mapping quality, and other relevant metrics.

assembly_metrics Technical Details

	Links
Task	task_assembly_metrics.wdl
Software Source Code	samtools on GitHub
Software Documentation	samtools
Original Publication(s)	The Sequence Alignment/Map format and SAMtools Twelve Years of SAMtools and BCFtools

Host Decontaminate Technical Details

	Links
Subworkflow	wf_host_decontaminate.wdl

Kraken2: Read Identification

Kraken2 is a bioinformatics tool originally designed for metagenomic applications. It has additionally proven valuable for validating taxonomic assignments and checking contamination of single-species (e.g. bacterial isolate, eukaryotic isolate, viral isolate, etc.) whole genome sequence data.

This task runs on cleaned reads passed from the read_QC_trim subworkflow and outputs a Kraken2 report detailing taxonomic classifications. It also separates classified reads from unclassified ones.

Database-dependent

This workflow automatically uses a viral-specific Kraken2 database. This database was generated in-house from RefSeq's viral sequence collection and human genome GRCh38. It's available at gs://theiagen-public-resources-rp/reference_data/databases/kraken2/kraken2_humanGRCh38_viralRefSeq_20240828.tar.gz.

Kraken2 Technical Details

	Links
Task	task_kraken2.wdl
Software Source Code	Kraken2 on GitHub
Software Documentation	Kraken2 Documentation
Original Publication(s)	Improved metagenomic analysis with Kraken 2

krakentools: Read Extraction

The task_krakentools.wdl task extracts reads from the Kraken2 output file. It uses the KrakenTools package to extract reads classified at any user-specified taxon ID.

extract_unclassified input parameter

This parameter determines whether unclassified reads should also be extracted and combined with the taxon-specific extracted reads. By default, this is set to false, meaning that only reads classified to the specified input taxon will be extracted.

Important

This task will extract reads classified to the input taxon and all of its descendant taxa. The rank input parameter controls the extraction of reads classified at the specified rank and all suboridante taxonomic levels. See task ncbi_identify under the Taxonomic Identification section for more details on the rank input parameter.

KrakenTools Technical Details

	Links
Task	task_krakentools.wdl
Software Source Code	KrakenTools on GitHub
Software Documentation	KrakenTools
Original Publication(s)	Metagenome analysis using the Kraken software suite

read_QC_trim Technical Details

	Links
Subworkflow	wf_read_QC_trim_pe.wdl wf_read_QC_trim_se.wdl

rasusa

Rasusa is a tool to randomly subsample sequencing reads to a specified coverage without assuming that all reads are of equal length, making it especially suitable for long-read data while still being applicable to short-read data.

The Rasusa task performs subsampling on the input raw reads. By default, it subsamples reads to a target depth of 250X, using the estimated genome length either generated by the ncbi_identify task or provided directly by the user. Disabled by default, users can enable it by setting the skip_rasusa variable to false. The target subsampling depth can also be adjusted by modifying the coverage variable.

coverage input parameter

This parameter specifies the target coverage for subsampling. The default value is 250, but users can adjust it as needed.

Non-deterministic output(s)

This task may yield non-deterministic outputs since it performs random subsampling. To ensure reproducibility, set a a value for the rasusa_seed optional input variable.

Rasusa Technical Details

	Links
Task	task_rasusa.wdl
Software Source Code	Rasusa on GitHub
Software Documentation	Rasusa on GitHub
Original Publication(s)	Rasusa: Randomly subsample sequencing reads to a specified coverage

clean_check_reads

The screen task ensures the quantity of sequence data is sufficient to undertake genomic analysis. It uses fastq-scan and bash commands for quantification of reads and base pairs, and mash sketching to estimate the genome size and its coverage. At each step, the results are assessed relative to pass/fail criteria and thresholds that may be defined by optional user inputs. Samples are run through all threshold checks, regardless of failures, and the workflow will terminate after the screen task if any thresholds are not met:

Total number of reads: A sample will fail the read screening task if its total number of reads is less than or equal to min_reads.
The proportion of basepairs reads in the forward and reverse read files: A sample will fail the read screening if fewer than min_proportion basepairs are in either the reads1 or read2 files.
Number of basepairs: A sample will fail the read screening if there are fewer than min_basepairs basepairs
Estimated genome size: A sample will fail the read screening if the estimated genome size is smaller than min_genome_size or bigger than max_genome_size.
Estimated genome coverage: A sample will fail the read screening if the estimated genome coverage is less than the min_coverage.

Read screening is performed only on the cleaned reads. The task may be skipped by setting the skip_screen variable to true. Default values vary between the ONT and PE workflow. The rationale for these default values can be found below:

Default Thresholds and Rationales

Variable	Description	Default Value	Rationale
`estimated_genome_length`	Default genome_length is set to 12,500, which approximates the median RNA virus length
`min_reads`	A sample will fail the read screening task if its total number of reads is less than or equal to `min_reads`	50	Minimum number of base pairs for 10x coverage of the Hepatitis delta (of the Deltavirus genus) virus divided by 300 (longest Illumina read length)
`min_basepairs`	A sample will fail the read screening if there are fewer than `min_basepairs` basepairs	15000	Greater than 10x coverage of the Hepatitis delta (of the Deltavirus genus) virus
`min_genome_size`	A sample will fail the read screening if the estimated genome size is smaller than `min_genome_size`	1500	Based on the Hepatitis delta (of the Deltavirus genus) genome- the smallest viral genome as of 2024-04-11 (1,700 bp)
`max_genome_size`	A sample will fail the read screening if the estimated genome size is smaller than `max_genome_size`	2673870	Based on the Pandoravirus salinus genome, the biggest viral genome, (2,673,870 bp) with 2 Mbp added
`min_coverage`	A sample will fail the read screening if the estimated genome coverage is less than the `min_coverage`	10	A bare-minimum coverage for genome characterization. Higher coverage would be required for high-quality phylogenetics.
`min_proportion`	A sample will fail the read screening if fewer than `min_proportion` basepairs are in either the reads1 or read2 files	40	Greater than 50% reads are in the read1 file; others are in the read2 file. (PE workflow only)

Screen Technical Details

	Links
Task	task_screen.wdl (PE sub-task) task_screen.wdl (SE sub-task)

De novo Assembly and Reference Selection

These tasks are only performed if no reference genome is provided

In this workflow, de novo assembly is primarily used to facilitate the selection of a closely related reference genome, though high quality de novo assemblies can be used for downstream analysis. If the user provides an input reference_fasta, the following assembly generation, assembly evaluation, and reference selections tasks will be skipped:

spades
megahit
checkv_denovo
quast_denovo
skani

spades

SPAdes (St. Petersburg genome assembler) is a de novo assembly tool that uses de Bruijn graphs to assemble genomes from Illumina short reads.

It is run with the --metaviral option, which is recommended for viral genomes. MetaviralSPAdes pipeline consists of three independent steps, ViralAssembly for finding putative viral subgraphs in a metagenomic assembly graph and generating contigs in these graphs, ViralVerify for checking whether the resulting contigs have viral origin and ViralComplete for checking whether these contigs represent complete viral genomes. For more details, please see the original publication.

MetaviralSPAdes was selected as the default assembler because it produces the most complete viral genomes within TheiaViral, determined by CheckV quality assessment (see task checkv for technical details).

call_metaviralspades input parameter

This parameter controls whether or not the spades task is called by the workflow. By default, call_metaviralspades is set to true because MetaviralSPAdes is used as the primary assembler. MetaviralSPAdes is generally recommended for most users, but it might not perform optimally on all datasets. If users encounter issues with MetaviralSPAdes, they can set the call_metaviralspades variable to false to bypass the spades task and instead de novo assemble using MEGAHIT (see task megahit for details). Additionally, if the spades task fails during execution, the workflow will automatically fall back to using MEGAHIT for de novo assembly.

Non-deterministic output(s)

This task may yield non-deterministic outputs.

MetaviralSPAdes Technical Details

	Links
Task	task_spades.wdl
Software Source Code	SPAdes on GitHub
Software Documentation	SPAdes Manual
Original Publication(s)	TheiaProk: SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing TheiaViral: MetaviralSPAdes: assembly of viruses from metagenomic data

megahit

The MEGAHIT assembler is a fast and memory-efficient de novo assembler that can handle large datasets. While optimized for metagenomics, MEGAHIT also performs well on single-genome assemblies, making it a versatile choice for various assembly tasks.

MEGAHIT uses a multiple k-mer strategy that can be beneficial for assembling genomes with varying coverage levels, which is common in metagenomic samples. It constructs succinct de Bruijn graphs to efficiently represent the assembly process, allowing it to handle large and complex datasets with reduced memory usage.

This task is optional, turned off by default, and will only be called if MetaviralSPAdes fails. It can be enabled by setting the skip_metaviralspades parameter to true. The megahit task is used as a fallback option if the spades task fails during execution (see task spades for more details).

Non-deterministic output(s)

This task may yield non-deterministic outputs.

MEGAHIT Technical Details

	Links
Task	task_megahit.wdl
Software Source Code	MEGAHIT on GitHub
Software Documentation	MEGAHIT on GitHub
Original Publication(s)	MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph

skani

The skani task is used to identify and select the most closely related reference genome to the de novo assembly. Skani uses an approximate mapping method without base-level alignment to calculate average nucleotide identity (ANI). It is magnitudes faster than BLAST-based methods and almost as accurate.

By default, the reference genome is selected from a database of approximately 200,000 viral genomes. This database was constructed with the following methodology:

Extracting all complete NCBI viral genomes, excluding RefSeq accessions (redundancy), SARS-CoV-2 accessions, and segmented families (Orthomyxoviridae, Hantaviridae, Arenaviridae, and Phenuiviridae). Some complete gene accessions, and not complete genomes, are included because NCBI datasets completeness parameters are susceptible to metadata errors.
Adding complete RefSeq segmented viral assembly accessions, which represent segments as individual contigs within the FASTA
Adding one SARS-CoV-2 genome for each major pangolin lineage

Skani Technical Details

	Links
Task	task_skani.wdl
Software Source Code	Skani on GitHub
Software Documentation	Skani Documentation
Original Publication(s)	Fast and robust metagenomic sequence comparison through sparse chaining with skani

Reference Mapping

bwa

The bwa task is a wrapper for the BWA alignment tool. It utilizes the BWA-MEM algorithm to map cleaned reads to the reference genome, either selected by the skani task or provided by the user input reference_fasta. This creates a BAM file which is then sorted using the command samtools sort.

BWA Technical Details

	Links
Task	task_bwa.wdl
Software Source Code	BWA on GitHub
Software Documentation	BWA Documentation
Original Publication(s)	Fast and accurate short read alignment with Burrows-Wheeler transform

read_mapping_stats: Mapping Statistics

The read_mapping_stats task generates mapping statistics from a BAM file. It uses samtools to generate a summary of the mapping statistics, which includes coverage, depth, average base quality, average mapping quality, and other relevant metrics.

read_mapping_stats Technical Details

	Links
Task	task_assembly_metrics.wdl
Software Source Code	samtools on GitHub
Software Documentation	samtools
Original Publication(s)	The Sequence Alignment/Map format and SAMtools Twelve Years of SAMtools and BCFtools

Variant Calling and Consensus Generation

ivar_variants: Variant Calling

iVar uses the outputs of samtools mpileup to call single nucleotide variants (SNVs) and insertions/deletions (indels). Several key parameters can be set to determine the stringency of variant calling, including minimum quality, minimum allele frequency, and minimum depth.

This task returns a VCF file containing all called variants, the number of detected variants, and the proportion of those variants with allele frequencies between 0.6 and 0.9 (also known as intermediate variants).

min_depth input parameter

This parameter accepts an integer value to set the minimum read depth for variant calling and subsequent consensus sequence generation. The default value is 10.

min_map_quality input parameter

This parameter accepts an integer value to set the minimum mapping quality for variant calling and subsequent consensus sequence generation. The default value is 20.

min_allele_freq input parameter

This parameter accepts a float value to set the minimum allele frequency for variant calling and subsequent consensus sequence generation. The default value is 0.6.

iVar Technical Details

	Links
Task	task_ivar_variant_call.wdl
Software Source Code	Ivar on GitHub
Software Documentation	Ivar Documentation
Original Publication(s)	An amplicon-based sequencing framework for accurately measuring intrahost virus diversity using PrimalSeq and iVar

ivar_consensus: Consensus Assembly

iVar's consensus tool generates a reference-based consensus assembly. Several parameters can be set that determine the stringency of the consensus assembly, including minimum quality, minimum allele frequency, and minimum depth.

This task is functional for segmented viruses by iteratively executing iVar on a contig-by-contig basis and concantenating resulting consensus contigs.

min_depth input parameter

This parameter accepts an integer value to set the minimum read depth for variant calling and subsequent consensus sequence generation. The default value is 10.

min_map_quality input parameter

This parameter accepts an integer value to set the minimum mapping quality for variant calling and subsequent consensus sequence generation. The default value is 20.

min_allele_freq input parameter

This parameter accepts a float value to set the minimum allele frequency for variant calling and subsequent consensus sequence generation. The default value is 0.6.

iVar Technical Details

	Links
Task	task_ivar_consensus.wdl
Software Source Code	Ivar on GitHub
Software Documentation	Ivar Documentation
Original Publication(s)	An amplicon-based sequencing framework for accurately measuring intrahost virus diversity using PrimalSeq and iVar

Assembly Evaluation and Consensus Quality Control

quast_denovo

QUAST stands for QUality ASsessment Tool. It evaluates genome/metagenome assemblies by computing various metrics without a reference being necessary. It includes useful metrics such as number of contigs, length of the largest contig and N50.

QUAST Technical Details

	Links
Task	task_quast.wdl
Software Source Code	QUAST on GitHub
Software Documentation	QUAST Manual on SourceForge
Original Publication(s)	QUAST: quality assessment tool for genome assemblies

checkv_denovo & checkv_consensus

CheckV is a fully automated command-line pipeline for assessing the quality of viral genomes, including identification of host contamination for integrated proviruses, estimating completeness for genome fragments, and identification of closed genomes.

By default, CheckV reports results on a contig-by-contig basis. The checkv task additionally reports both "weighted_contamination" and "weighted_completeness", which are average percents calculated across the total assembly that are weighted by contig length.

CheckV Technical Details

	Links
Task	task_checkv.wdl
Software Source Code	CheckV on Bitbucket
Software Documentation	CheckV Documentation
Original Publication(s)	CheckV assesses the quality and completeness of metagenome-assembled viral genomes

consensus_qc: Assembly Statistics

The consensus_qc task generates a summary of genomic statistics from a consensus genome. This includes the total number of bases, "N" bases, degenerate bases, and an estimate of the percent coverage to the reference genome.

consensus_qc Technical Details

	Links
Task	task_consensus_qc.wdl

Versioning

versioning: Version Capture

The versioning task captures the workflow version from the GitHub (code repository) version.

Version Capture Technical details

	Links
Task	task_versioning.wdl

Taxonomic Identification

ete4_identify

The ete4_identify task parses the NCBI taxonomy hierarchy from a user's inputted taxonomy and desired taxonomic rank. This task returns a taxon ID, name, and rank, which facilitates downstream functions, including read classification, targeted read extraction, and genomic characterization modules.

taxon input parameter

This parameter accepts either a NCBI taxon ID (e.g. 11292) or an organism name (e.g. Lyssavirus rabies).

rank a.k.a read_extraction_rank input parameter

Valid options include: "species", "genus", "family", "order", "class", "phylum", "kingdom", or "domain". By default it is set to "family". This parameter filters metadata to report information only at the taxonomic rank specified by the user, regardless of the taxonomic rank implied by the original input taxon.

Important

The rank parameter must specify a taxonomic rank that is equal to or above the input taxon's taxonomic rank.

Examples:

If your input taxon is Lyssavirus rabies (species level) with rank set to family, the task will return information for the family of Lyssavirus rabies: taxon ID for Rhabdoviridae (11270), name "Rhabdoviridae", and rank "family".
If your input taxon is Lyssavirus (genus level) with rank set to species, the task will fail because it cannot determine species information from an inputted genus.

ete4 Identify Technical Details

	Links
Task	task_ete4_taxon_id.wdl
Software Source Code	ete4 on GitHub
Software Documentation	NCBI Datasets Documentation on NCBI
Original Publication(s)	Exploring and retrieving sequence and metadata for species across the tree of life with NCBI Datasets

datasets_genome_length

The datasets_genome_length task uses NCBI Datasets to acquire genome length metadata for an inputted taxon and retrieve a top reference accession. This task generates a summary file of all successful hits to the input taxon, which includes each genome's accession number, completeness status, genome length, source, and other relevant metadata. The task will then calculate the average expected genome length in basepairs for the input taxon.

taxon input parameter

This parameter accepts either a NCBI taxon ID (e.g. 11292) or an organism name (e.g. Lyssavirus rabies).

NCBI Datasets Technical Details

	Links
Task	task_identify_taxon_id.wdl
Software Source Code	NCBI Datasets on GitHub
Software Documentation	NCBI Datasets Documentation on NCBI
Original Publication(s)	Exploring and retrieving sequence and metadata for species across the tree of life with NCBI Datasets

Read Quality Control, Trimming, and Filtering

NanoPlot: Read Quantification

NanoPlot is used for the determination of mean quality scores, read lengths, and number of reads. This task is run once with raw reads as input and once with clean reads as input. If QC has been performed correctly, you should expect fewer clean reads than raw reads.

NanoPlot Technical Details

	Links
Task	task_nanoplot.wdl
Software Source Code	NanoPlot on GitHub
Software Documentation	NanoPlot Documentation
Original Publication(s)	NanoPack2: population-scale evaluation of long-read sequencing data

porechop

Porechop is a tool for finding and removing adapters from ONT data. Adapters on the ends of reads are trimmed, and when a read has an adapter in the middle, the read is split into two.

The porechop task is optional and is turned off by default. It can be enabled by setting the call_porechop parameter to true.

Porechop Technical Details

	Links
WDL Task	task_porechop.wdl
Software Source Code	Porechop on GitHub
Software Documentation	https://github.com/rrwick/Porechop#porechop

Nanoq: Read Filtering

Reads are filtered by length and quality using nanoq. By default, sequences with less than 500 basepairs and quality scores lower than 10 are filtered out to improve assembly accuracy. These defaults are able to be modified by the user.

Nanoq Technical Details

	Links
Task	task_nanoq.wdl
Software Source Code	Nanoq on GitHub
Software Documentation	Nanoq Documentation
Original Publication(s)	Nanoq: ultra-fast quality control for nanopore reads

ncbi_scrub_se

All reads of human origin are removed, including their mates, by using NCBI's human read removal tool (HRRT).

HRRT is based on the SRA Taxonomy Analysis Tool and employs a k-mer database constructed of k-mers from Eukaryota derived from all human RefSeq records with any k-mers found in non-Eukaryota RefSeq records subtracted from the database.

NCBI-Scrub Technical Details

	Links
Task	task_ncbi_scrub.wdl
Software Source Code	HRRT on GitHub
Software Documentation	HRRT on NCBI

host_decontaminate: Host Read Decontamination

Host genetic data is frequently incidentally sequenced alongside pathogens, which can negatively affect the quality of downstream analysis. Host Decontaminate attempts to remove host reads by aligning to a reference host genome that is directly inputted or acquired on-the-fly. The reference host genome can be inputted into the host input field as an assembly file (with is_genome set to "true"), acquired via NCBI Taxonomy-compatible taxon input, or assembly accession (with is_accession set to "true"). Host Decontaminate maps inputted reads to the host genome using minimap2, reports mapping statistics to this host genome, and outputs the unaligned dehosted reads.

The detailed steps and tasks are as follows:

datasets_genome_length

The datasets_genome_length task uses NCBI Datasets to acquire genome length metadata for an inputted taxon and retrieve a top reference accession. This task generates a summary file of all successful hits to the input taxon, which includes each genome's accession number, completeness status, genome length, source, and other relevant metadata. The task will then calculate the average expected genome length in basepairs for the input taxon.

taxon input parameter

This parameter accepts either a NCBI taxon ID (e.g. 11292) or an organism name (e.g. Lyssavirus rabies).

NCBI Datasets Technical Details

	Links
Task	task_identify_taxon_id.wdl
Software Source Code	NCBI Datasets on GitHub
Software Documentation	NCBI Datasets Documentation on NCBI
Original Publication(s)	Exploring and retrieving sequence and metadata for species across the tree of life with NCBI Datasets

Download Accession

The NCBI Datasets task downloads specified assemblies from NCBI using either the virus or genome (for all other genome types) package as appropriate.

This task uses the accession ID output from the skani task to download the the most closely related reference genome to the input assembly. The downloaded reference is then used for downstream analysis, including variant calling and consensus generation.

NCBI Datasets Technical Details

	Links
Task	task_ncbi_datasets.wdl
Software Source Code	NCBI Datasets on GitHub
Software Documentation	NCBI Datasets Documentation on NCBI
Original Publication(s)	Exploring and retrieving sequence and metadata for species across the tree of life with NCBI Datasets

Map Reads to Host

minimap2 is a popular aligner that is used to align reads (or assemblies) to an assembly file. In minimap2, "modes" are a group of preset options.

The mode used in this task is map-ont which is the default mode for long reads and indicates that long reads of ~10% error rates should be aligned to the reference genome. The output file is in SAM format.

For more information regarding modes and the available options for minimap2, please see the minimap2 manpage

minimap2 Technical Details

	Links
Task	task_minimap2.wdl
Software Source Code	minimap2 on GitHub
Software Documentation	minimap2
Original Publication(s)	Minimap2: pairwise alignment for nucleotide sequences

Extract Unaligned Reads

The bam_to_unaligned_fastq task will extract a FASTQ file of reads that failed to align, while removing unpaired reads.

parse_mapping Technical Details

	Links
Task	task_parse_mapping.wdl
Software Source Code	samtools on GitHub
Software Documentation	samtools
Original Publication(s)	The Sequence Alignment/Map format and SAMtools Twelve Years of SAMtools and BCFtools

assembly_metrics: Mapping Statistics

The assembly_metrics task generates mapping statistics from a BAM file. It uses samtools to generate a summary of the mapping statistics, which includes coverage, depth, average base quality, average mapping quality, and other relevant metrics.

assembly_metrics Technical Details

	Links
Task	task_assembly_metrics.wdl
Software Source Code	samtools on GitHub
Software Documentation	samtools
Original Publication(s)	The Sequence Alignment/Map format and SAMtools Twelve Years of SAMtools and BCFtools

Host Decontaminate Technical Details

	Links
Subworkflow	wf_host_decontaminate.wdl

rasusa

Rasusa is a tool to randomly subsample sequencing reads to a specified coverage without assuming that all reads are of equal length, making it especially suitable for long-read data while still being applicable to short-read data.

The Rasusa task performs subsampling on the input raw reads. By default, it subsamples reads to a target depth of 250X, using the estimated genome length either generated by the ncbi_identify task or provided directly by the user. Disabled by default, users can enable it by setting the skip_rasusa variable to false. The target subsampling depth can also be adjusted by modifying the coverage variable.

coverage input parameter

This parameter specifies the target coverage for subsampling. The default value is 250, but users can adjust it as needed.

Non-deterministic output(s)

This task may yield non-deterministic outputs since it performs random subsampling. To ensure reproducibility, set a a value for the rasusa_seed optional input variable.

Rasusa Technical Details

	Links
Task	task_rasusa.wdl
Software Source Code	Rasusa on GitHub
Software Documentation	Rasusa on GitHub
Original Publication(s)	Rasusa: Randomly subsample sequencing reads to a specified coverage

clean_check_reads

The screen task ensures the quantity of sequence data is sufficient to undertake genomic analysis. It uses fastq-scan and bash commands for quantification of reads and base pairs, and mash sketching to estimate the genome size and its coverage. At each step, the results are assessed relative to pass/fail criteria and thresholds that may be defined by optional user inputs. Samples are run through all threshold checks, regardless of failures, and the workflow will terminate after the screen task if any thresholds are not met:

Total number of reads: A sample will fail the read screening task if its total number of reads is less than or equal to min_reads.
The proportion of basepairs reads in the forward and reverse read files: A sample will fail the read screening if fewer than min_proportion basepairs are in either the reads1 or read2 files.
Number of basepairs: A sample will fail the read screening if there are fewer than min_basepairs basepairs
Estimated genome size: A sample will fail the read screening if the estimated genome size is smaller than min_genome_size or bigger than max_genome_size.
Estimated genome coverage: A sample will fail the read screening if the estimated genome coverage is less than the min_coverage.

Read screening is performed only on the cleaned reads. The task may be skipped by setting the skip_screen variable to true. Default values vary between the ONT and PE workflow. The rationale for these default values can be found below:

Default Thresholds and Rationales

Variable	Description	Default Value	Rationale
`estimated_genome_length`	Default genome_length is set to 12,500, which approximates the median RNA virus length
`min_reads`	A sample will fail the read screening task if its total number of reads is less than or equal to `min_reads`	50	Minimum number of base pairs for 10x coverage of the Hepatitis delta (of the Deltavirus genus) virus divided by 300 (longest Illumina read length)
`min_basepairs`	A sample will fail the read screening if there are fewer than `min_basepairs` basepairs	15000	Greater than 10x coverage of the Hepatitis delta (of the Deltavirus genus) virus
`min_genome_size`	A sample will fail the read screening if the estimated genome size is smaller than `min_genome_size`	1500	Based on the Hepatitis delta (of the Deltavirus genus) genome- the smallest viral genome as of 2024-04-11 (1,700 bp)
`max_genome_size`	A sample will fail the read screening if the estimated genome size is smaller than `max_genome_size`	2673870	Based on the Pandoravirus salinus genome, the biggest viral genome, (2,673,870 bp) with 2 Mbp added
`min_coverage`	A sample will fail the read screening if the estimated genome coverage is less than the `min_coverage`	10	A bare-minimum coverage for genome characterization. Higher coverage would be required for high-quality phylogenetics.
`min_proportion`	A sample will fail the read screening if fewer than `min_proportion` basepairs are in either the reads1 or read2 files	40	Greater than 50% reads are in the read1 file; others are in the read2 file. (PE workflow only)

Screen Technical Details

	Links
Task	task_screen.wdl (PE sub-task) task_screen.wdl (SE sub-task)

Read Classification and Extraction

metabuli

The metabuli task is used to classify and extract reads against a reference database. Metabuli uses a novel k-mer structure, called metamer, to analyze both amino acid (AA) and DNA sequences. It leverages AA conservation for sensitive homology detection and DNA mutations for specific differentiation between closely related taxa.

cpu / memory input parameters

Increasing the memory and cpus allocated to Metabuli can substantially increase throughput.

extract_unclassified input parameter

This parameter determines whether unclassified reads should also be extracted and combined with the taxon-specific extracted reads. By default, this is set to false, meaning that only reads classified to the specified input taxon will be extracted.

Descendant taxa reads are extracted

This task will extract reads classified to the input taxon and all of its descendant taxa. The rank input parameter controls the extraction of reads classified at the specified rank and all subordiante taxonomic levels. See task ncbi_identify under the Taxonomic Identification section above for more details on the rank input parameter.

Metabuli Technical Details

	Links
Task	task_metabuli.wdl
Software Source Code	Metabuli on GitHub
Software Documentation	Metabuli Documentation
Original Publication(s)	Metabuli: sensitive and specific metagenomic classification via joint analysis of amino acid and DNA

De novo Assembly and Reference Selection

These tasks are only performed if no reference genome is provided

In this workflow, de novo assembly is used solely to facilitate the selection of a closely related reference genome. If the user provides an input reference_fasta, the following assembly generation, assembly evaluation, and reference selections tasks will be skipped:

raven
flye
checkv_denovo
quast_denovo
skani
ncbi_datasets

raven

The raven task is used to create a de novo assembly from cleaned reads. Raven is an overlap-layout-consensus based assembler that accelerates the overlap step, constructs an assembly graph from reads pre-processed with pile-o-grams, applies a novel and robust graph simplification method based on graph drawings, and polishes unambiguous graph paths using Racon.

Based on internal benchmarking against Flye and results reported by Cook et al. (2024), Raven is faster, produces more contiguous assemblies, and yields more complete genomes within TheiaViral according to CheckV quality assessment (see task checkv for technical details).

call_raven input parameter

This parameter controls whether or not the raven task is called by the workflow. By default, call_raven is set to true because Raven is used as the primary assembler. Raven is generally recommended for most users, but it might not perform optimally on all datasets. If users encounter issues with Raven, they can set the call_raven variable to false to bypass the raven task and instead de novo assemble using Flye (see task flye for details). Additionally, if the Raven task fails during execution, the workflow will automatically fall back to using Flye for de novo assembly.

Error traceback

Raven may fail with cryptic "segmentation fault" (segfault) errors or by failing to output an output file. It is difficult to traceback the source of these issues, though increasing the memory parameter may resolve some errors.

Non-deterministic output(s)

This task may yield non-deterministic outputs.

Raven Technical Details

	Links
Task	task_raven.wdl
Software Source Code	Raven on GitHub
Software Documentation	Raven Documentation
Original Publication(s)	Time- and memory-efficient genome assembly with Raven

flye

Flye is a de novo assembler for long read data using repeat graphs. Compared to de Bruijn graphs, which require exact k-mer matches, repeat graphs can use approximate matches which better tolerates the error rate of ONT data.

It can be enabled by setting the call_raven parameter to false. The flye task is used as a fallback option if the raven task fails during execution (see task raven for more details).

read_type input parameter

This input parameter specifies the type of sequencing reads being used for assembly. This parameter significantly impacts the assembly process and should match the characteristics of your input data. Below are the available options:

Parameter	Explanation
`--nano-hq` (default)	Optimized for ONT high-quality reads, such as Guppy5+ SUP or Q20 (<5% error). Recommended for ONT reads processed with Guppy5 or newer
`--nano-raw`	For ONT regular reads, pre-Guppy5 (<20% error)
`--nano-corr`	ONT reads corrected with other methods (<3% error)
`--pacbio-raw`	PacBio regular CLR reads (<20% error)
`--pacbio-corr`	PacBio reads corrected with other methods (<3% error)
`--pacbio-hifi`	PacBio HiFi reads (<1% error)

Refer to the Flye documentation for detailed guidance on selecting the appropriate read_type based on your sequencing data and additional optional paramaters.

Non-deterministic output(s)

This task may yield non-deterministic outputs.

Flye Technical Details

	Links
WDL Task	task_flye.wdl
Software Source Code	Flye on GitHub
Software Documentation	Flye Documentation
Original Publication(s)	Assembly of long, error-prone reads using repeat graphs

skani

The skani task is used to identify and select the most closely related reference genome to the de novo assembly. Skani uses an approximate mapping method without base-level alignment to calculate average nucleotide identity (ANI). It is magnitudes faster than BLAST-based methods and almost as accurate.

By default, the reference genome is selected from a database of approximately 200,000 viral genomes. This database was constructed with the following methodology:

Extracting all complete NCBI viral genomes, excluding RefSeq accessions (redundancy), SARS-CoV-2 accessions, and segmented families (Orthomyxoviridae, Hantaviridae, Arenaviridae, and Phenuiviridae). Some complete gene accessions, and not complete genomes, are included because NCBI datasets completeness parameters are susceptible to metadata errors.
Adding complete RefSeq segmented viral assembly accessions, which represent segments as individual contigs within the FASTA
Adding one SARS-CoV-2 genome for each major pangolin lineage

Skani Technical Details

	Links
Task	task_skani.wdl
Software Source Code	Skani on GitHub
Software Documentation	Skani Documentation
Original Publication(s)	Fast and robust metagenomic sequence comparison through sparse chaining with skani

Reference Mapping

minimap2

minimap2 is a popular aligner that is used to align reads (or assemblies) to an assembly file. In minimap2, "modes" are a group of preset options.

The mode used in this task is map-ont with additional long-read-specific parameters (the -L --cs --MD flags) to align ONT reads to the reference genome. These specialized parameters are essential for proper handling of long read error profiles, generation of detailed alignment information, and improved mapping accuracy for long reads.

map-ont is the default mode for long reads and it indicates that long reads of ~10% error rates should be aligned to the reference genome. The output file is in SAM format.

For more information regarding modes and the available options for minimap2, please see the minimap2 manpage

minimap2 Technical Details

	Links
Task	task_minimap2.wdl
Software Source Code	minimap2 on GitHub
Software Documentation	minimap2
Original Publication(s)	Minimap2: pairwise alignment for nucleotide sequences

parse_mapping

The sam_to_sorted_bam sub-task converts the output SAM file from the minimap2 task and converts it to a BAM file. It then sorts the BAM file by coordinate, and creates a BAM index file.

min_map_quality input parameter

This parameter accepts an integer value to set the minimum mapping quality for variant calling and subsequent consensus sequence generation. The default value is 20.

parse_mapping Technical Details

	Links
Task	task_parse_mapping.wdl
Software Source Code	samtools on GitHub
Software Documentation	samtools
Original Publication(s)	The Sequence Alignment/Map format and SAMtools Twelve Years of SAMtools and BCFtools

read_mapping_stats: Mapping Statistics

The read_mapping_stats task generates mapping statistics from a BAM file. It uses samtools to generate a summary of the mapping statistics, which includes coverage, depth, average base quality, average mapping quality, and other relevant metrics.

read_mapping_stats Technical Details

	Links
Task	task_assembly_metrics.wdl
Software Source Code	samtools on GitHub
Software Documentation	samtools
Original Publication(s)	The Sequence Alignment/Map format and SAMtools Twelve Years of SAMtools and BCFtools

fasta_utilities

The fasta_utilities task utilizes samtools to index a reference fasta file. This reference is selected by the skani task or provided by the user input reference_fasta. This indexed reference genome is used for downstream variant calling and consensus generation tasks.

fasta_utilities Technical Details

	Links
Task	task_fasta_utilities.wdl
Software Source Code	samtools on GitHub
Software Documentation	samtools
Original Publication(s)	The Sequence Alignment/Map format and SAMtools Twelve Years of SAMtools and BCFtools

Variant Calling and Consensus Generation

clair3

Clair3 performs deep learning-based variant detection using a multi-stage approach. The process begins with pileup-based calling for initial variant identification, followed by full-alignment analysis for comprehensive variant detection. Results are merged into a final high-confidence call set.

The variant calling pipeline employs specialized neural networks trained on ONT data to accurately identify: - Single nucleotide variants (SNVs) - Small insertions and deletions (indels) - Structural variants

clair3_model input parameter

This parameter specifies the clair3 model to use for variant calling. The default is set to "r1041_e82_400bps_sup_v500", but users may select from other available models that clair3 was trained on, which may yield better results depending on the basecaller and data type. The following models are available:

"ont"
"ont_guppy2"
"ont_guppy5"
"r941_prom_sup_g5014"
"r941_prom_hac_g360+g422"
"r941_prom_hac_g238"
"r1041_e82_400bps_sup_v500"
"r1041_e82_400bps_hac_v500"
"r1041_e82_400bps_sup_v410"
"r1041_e82_400bps_hac_v410"

Default Parameters and Filtering

In this workflow, clair3 is run with nearly all default parameters. Note that the VCF file produced by the clair3 task is unfiltered and does not represent the final set of variants that will be included in the final consensus genome. A filtered vcf file is generated by the bcftools_consensus task. The filtering parameters are as follows:

The min_map_quality parameter is applied before calling variants.
The min_depth and min_allele_freq parameters are applied after variant calling during consensus genome construction.

Clair3 Technical Details

	Links
Task	task_clair3.wdl
Software Source Code	Clair3 on GitHub
Software Documentation	Clair3 Documentation
Original Publication(s)	Symphonizing pileup and full-alignment for deep learning-based long-read variant calling

parse_mapping

The mask_low_coverage sub-task is used to mask low coverage regions in the reference_fasta file to improve the accuracy of the final consensus genome. Coverage thresholds are defined by the min_depth parameter, which specifies the minimum read depth required for a base to be retained. Bases falling below this threshold are replaced with "N"s to clearly mark low confidence regions. The masked reference is then combined with variants from the clair3 task to produce the final consensus genome.

min_depth input parameter

This parameter accepts an integer value to set the minimum read depth for variant calling and subsequent consensus sequence generation. The default value is 10.

parse_mapping Technical Details

	Links
Task	task_parse_mapping.wdl
Software Source Code	samtools on GitHub
Software Documentation	samtools
Original Publication(s)	The Sequence Alignment/Map format and SAMtools Twelve Years of SAMtools and BCFtools

bcftools_consensus

The bcftools_consensus task generates a consensus genome assembly by applying variants from the clair3 task to a masked reference genome. It uses bcftools to filter variants based on the min_depth and min_allele_freq input parameter, left aligns and normalizes indels, indexes the VCF file, and generates a consensus genome in FASTA format. Reference bases are substituted with filtered variants where applicable, preserved in regions without variant calls, and replaced with "N"s in areas masked by the mask_low_coverage task.

min_depth input parameter

This parameter accepts an integer value to set the minimum read depth for variant calling and subsequent consensus sequence generation. The default value is 10.

min_allele_freq input parameter

This parameter accepts a float value to set the minimum allele frequency for variant calling and subsequent consensus sequence generation. The default value is 0.6.

bcftools_consensus Technical Details

	Links
Task	task_bcftools_consensus.wdl
Software Source Code	bcftools on GitHub
Software Documentation	bcftools Manual Page
Original Publication(s)	Twelve Years of SAMtools and BCFtools

Assembly Evaluation and Consensus Quality Control

quast_denovo

QUAST stands for QUality ASsessment Tool. It evaluates genome/metagenome assemblies by computing various metrics without a reference being necessary. It includes useful metrics such as number of contigs, length of the largest contig and N50.

QUAST Technical Details

	Links
Task	task_quast.wdl
Software Source Code	QUAST on GitHub
Software Documentation	QUAST Manual on SourceForge
Original Publication(s)	QUAST: quality assessment tool for genome assemblies

checkv_denovo & checkv_consensus

CheckV is a fully automated command-line pipeline for assessing the quality of viral genomes, including identification of host contamination for integrated proviruses, estimating completeness for genome fragments, and identification of closed genomes.

By default, CheckV reports results on a contig-by-contig basis. The checkv task additionally reports both "weighted_contamination" and "weighted_completeness", which are average percents calculated across the total assembly that are weighted by contig length.

CheckV Technical Details

	Links
Task	task_checkv.wdl
Software Source Code	CheckV on Bitbucket
Software Documentation	CheckV Documentation
Original Publication(s)	CheckV assesses the quality and completeness of metagenome-assembled viral genomes

consensus_qc: Assembly Statistics

The consensus_qc task generates a summary of genomic statistics from a consensus genome. This includes the total number of bases, "N" bases, degenerate bases, and an estimate of the percent coverage to the reference genome.

consensus_qc Technical Details

	Links
Task	task_consensus_qc.wdl

TheiaViral_Panel operates by identifying reads that align with input taxon codes (specified in the taxon_ids input variable), extracting those reads, and assembling and characterizing them using the same modules as TheiaViral_Illumina_PE. Multiple assemblies and characterizations can be generated from a single sample if reads align with multiple taxon codes.

Versioning

versioning: Version Capture

The versioning task captures the workflow version from the GitHub (code repository) version.

Version Capture Technical details

	Links
Task	task_versioning.wdl

Read Quality Control, Trimming, Filtering, Identification

read_QC_trim: Read Quality Trimming, Adapter Removal, Quantification, and Identification

read_QC_trim is a sub-workflow that removes low-quality reads, low-quality regions of reads, and sequencing adapters to improve data quality. It uses a number of tasks, described below. The differences between the PE and SE versions of the read_QC_trim sub-workflow lie in the default parameters, the use of two or one input read file(s), and the different output files.

By default, read_processing is set to "trimmomatic". To use fastp instead, set read_processing to "fastp". These tasks are mutually exclusive.

Trimmomatic: Read Trimming (default)

Read proccessing is available via Trimmomatic by default.

Trimmomatic trims low-quality regions of Illumina paired-end or single-end reads with a sliding window (with a default window size of 4, specified with trim_window_size), cutting once the average quality within the window falls below the trim_quality_trim_score (default of 20 for paired-end, 30 for single-end). The read is discarded if it is trimmed below trim_minlen (default of 75 for paired-end, 25 for single-end).

Trimmomatic Technical Details

	Links
Task	task_trimmomatic.wdl
Software Source Code	Trimmomatic on GitHub
Software Documentation	Trimmomatic Website
Original Publication(s)	Trimmomatic: a flexible trimmer for Illumina sequence data

fastp: Read Trimming (alternative)

To activate this task, set read_processing to "fastp".

fastp trims low-quality regions of Illumina paired-end or single-end reads with a sliding window (with a default window size of 4, specified with trim_window_size), cutting once the average quality within the window falls below the trim_quality_trim_score (default of 20 for paired-end, 30 for single-end). The read is discarded if it is trimmed below trim_minlen (default of 75 for paired-end, 25 for single-end).

fastp also has additional default parameters and features that are not a part of trimmomatic's default configuration.

fastp default read-trimming parameters

Parameter	Explanation
-g	enables polyG tail trimming
-5 20	enables read end-trimming
-3 20	enables read end-trimming
--detect_adapter_for_pe	enables adapter-trimming only for paired-end reads

Additional arguments can be passed using the fastp_args optional parameter.

Trimmomatic and fastp Technical Details

	Links
Task	task_fastp.wdl
Software Source Code	fastp on GitHub
Software Documentation	fastp on GitHub
Original Publication(s)	fastp: an ultra-fast all-in-one FASTQ preprocessor

BBDuk: Adapter Trimming and PhiX Removal

Adapters are manufactured oligonucleotide sequences attached to DNA fragments during the library preparation process. In Illumina sequencing, these adapter sequences are required for attaching reads to flow cells. You can read more about Illumina adapters here. For genome analysis, it's important to remove these sequences since they're not actually from your sample. If you don't remove them, the downstream analysis may be affected.

The bbduk task removes adapters from sequence reads. To do this:

Repair from the BBTools package reorders reads in paired fastq files to ensure the forward and reverse reads of a pair are in the same position in the two fastq files (it re-pairs).
BBDuk ("Bestus Bioinformaticus" Decontamination Using Kmers) is then used to trim the adapters and filter out all reads that have a 31-mer match to PhiX, which is commonly added to Illumina sequencing runs to monitor and/or improve overall run quality.

BBDuk Technical Details

	Links
Task	task_bbduk.wdl
Software Source Code	BBMap on SourceForge
Software Documentation	BBDuk Guide (archived)

By default, read_qc is set to "fastq_scan". To use fastqc instead, set read_qc to "fastqc". These tasks are mutually exclusive.

fastq-scan: Read Quantification (default)

Read quantification is available via fastq-scan by default.

fastq-scan quantifies the forward and reverse reads in FASTQ files. For paired-end data, it also provide the total number of read pairs. This task is run once with raw reads as input and once with clean reads as input. If QC has been performed correctly, you should expect fewer clean reads than raw reads.

fastq-scan Technical Details

	Links
Task	task_fastq_scan.wdl
Software Source Code	fastq-scan on GitHub
Software Documentation	fastq-scan on GitHub

FastQC: Read Quantification (alternative)

To activate this task, set read_qc to "fastqc".

FastQC quantifies the forward and reverse reads in FASTQ files. For paired-end data, it also provide the total number of read pairs. This task is run once with raw reads as input and once with clean reads as input. If QC has been performed correctly, you should expect fewer clean reads than raw reads.

This tool also provides a graphical visualization of the read quality.

FastQC Technical Details

	Links
Task	task_fastqc.wdl
Software Source Code	FastQC on Github
Software Documentation	FastQC Website

read_QC_trim Technical Details

	Links
Subworkflow	wf_read_QC_trim_pe.wdl wf_read_QC_trim_se.wdl

Read Extraction and Binning

kraken_parser: Parses Kraken Reports

kraken_parser lightens the computation load by taking the input taxon ID list and comparing it to the taxon IDs identified by Kraken2 in the kraken_report_clean output file. Only taxon IDs that were found by Kraken are used in the scatter portion of the workflow, which lowers the number of scatter shards the workflow requires.

Find Files Technical Details

	Links
Task	task_kraken_parser.wdl

krakentools: Read Extraction

The task_krakentools.wdl task extracts reads from the Kraken2 output file. It uses the KrakenTools package to extract reads classified at any user-specified taxon ID.

KrakenTools Technical Details

	Links
Task	task_krakentools.wdl
Software Source Code	KrakenTools on GitHub
Software Documentation	KrakenTools
Original Publication(s)	Metagenome analysis using the Kraken software suite

Taxonomic Identification

ete4_identify

The ete4_identify task parses the NCBI taxonomy hierarchy from a user's inputted taxonomy and desired taxonomic rank. This task returns a taxon ID, name, and rank, which facilitates downstream functions, including read classification, targeted read extraction, and genomic characterization modules.

taxon input parameter

This parameter accepts either a NCBI taxon ID (e.g. 11292) or an organism name (e.g. Lyssavirus rabies).

rank a.k.a read_extraction_rank input parameter

Valid options include: "species", "genus", "family", "order", "class", "phylum", "kingdom", or "domain". By default it is set to "family". This parameter filters metadata to report information only at the taxonomic rank specified by the user, regardless of the taxonomic rank implied by the original input taxon.

Important

The rank parameter must specify a taxonomic rank that is equal to or above the input taxon's taxonomic rank.

Examples:

If your input taxon is Lyssavirus rabies (species level) with rank set to family, the task will return information for the family of Lyssavirus rabies: taxon ID for Rhabdoviridae (11270), name "Rhabdoviridae", and rank "family".
If your input taxon is Lyssavirus (genus level) with rank set to species, the task will fail because it cannot determine species information from an inputted genus.

ete4 Identify Technical Details

	Links
Task	task_ete4_taxon_id.wdl
Software Source Code	ete4 on GitHub
Software Documentation	NCBI Datasets Documentation on NCBI
Original Publication(s)	Exploring and retrieving sequence and metadata for species across the tree of life with NCBI Datasets

datasets_genome_length

The datasets_genome_length task uses NCBI Datasets to acquire genome length metadata for an inputted taxon and retrieve a top reference accession. This task generates a summary file of all successful hits to the input taxon, which includes each genome's accession number, completeness status, genome length, source, and other relevant metadata. The task will then calculate the average expected genome length in basepairs for the input taxon.

taxon input parameter

This parameter accepts either a NCBI taxon ID (e.g. 11292) or an organism name (e.g. Lyssavirus rabies).

NCBI Datasets Technical Details

	Links
Task	task_identify_taxon_id.wdl
Software Source Code	NCBI Datasets on GitHub
Software Documentation	NCBI Datasets Documentation on NCBI
Original Publication(s)	Exploring and retrieving sequence and metadata for species across the tree of life with NCBI Datasets

TheiaViral_Panel uses the assembly and characterization tasks of TheiaViral_Illumina_PE. This allows for multiple binned taxon IDs from a single sample to undergo the same viral assembly as other samples. The following tasks are performed for each taxon ID that passes the read binning threshold:

De novo Assembly and Reference Selection

spades

SPAdes (St. Petersburg genome assembler) is a de novo assembly tool that uses de Bruijn graphs to assemble genomes from Illumina short reads.

It is run with the --metaviral option, which is recommended for viral genomes. MetaviralSPAdes pipeline consists of three independent steps, ViralAssembly for finding putative viral subgraphs in a metagenomic assembly graph and generating contigs in these graphs, ViralVerify for checking whether the resulting contigs have viral origin and ViralComplete for checking whether these contigs represent complete viral genomes. For more details, please see the original publication.

MetaviralSPAdes was selected as the default assembler because it produces the most complete viral genomes within TheiaViral, determined by CheckV quality assessment (see task checkv for technical details).

call_metaviralspades input parameter

This parameter controls whether or not the spades task is called by the workflow. By default, call_metaviralspades is set to true because MetaviralSPAdes is used as the primary assembler. MetaviralSPAdes is generally recommended for most users, but it might not perform optimally on all datasets. If users encounter issues with MetaviralSPAdes, they can set the call_metaviralspades variable to false to bypass the spades task and instead de novo assemble using MEGAHIT (see task megahit for details). Additionally, if the spades task fails during execution, the workflow will automatically fall back to using MEGAHIT for de novo assembly.

Non-deterministic output(s)

This task may yield non-deterministic outputs.

MetaviralSPAdes Technical Details

	Links
Task	task_spades.wdl
Software Source Code	SPAdes on GitHub
Software Documentation	SPAdes Manual
Original Publication(s)	TheiaProk: SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing TheiaViral: MetaviralSPAdes: assembly of viruses from metagenomic data

megahit

The MEGAHIT assembler is a fast and memory-efficient de novo assembler that can handle large datasets. While optimized for metagenomics, MEGAHIT also performs well on single-genome assemblies, making it a versatile choice for various assembly tasks.

MEGAHIT uses a multiple k-mer strategy that can be beneficial for assembling genomes with varying coverage levels, which is common in metagenomic samples. It constructs succinct de Bruijn graphs to efficiently represent the assembly process, allowing it to handle large and complex datasets with reduced memory usage.

This task is optional, turned off by default, and will only be called if MetaviralSPAdes fails. It can be enabled by setting the skip_metaviralspades parameter to true. The megahit task is used as a fallback option if the spades task fails during execution (see task spades for more details).

Non-deterministic output(s)

This task may yield non-deterministic outputs.

MEGAHIT Technical Details

	Links
Task	task_megahit.wdl
Software Source Code	MEGAHIT on GitHub
Software Documentation	MEGAHIT on GitHub
Original Publication(s)	MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph

skani

The skani task is used to identify and select the most closely related reference genome to the de novo assembly. Skani uses an approximate mapping method without base-level alignment to calculate average nucleotide identity (ANI). It is magnitudes faster than BLAST-based methods and almost as accurate.

By default, the reference genome is selected from a database of approximately 200,000 viral genomes. This database was constructed with the following methodology:

Extracting all complete NCBI viral genomes, excluding RefSeq accessions (redundancy), SARS-CoV-2 accessions, and segmented families (Orthomyxoviridae, Hantaviridae, Arenaviridae, and Phenuiviridae). Some complete gene accessions, and not complete genomes, are included because NCBI datasets completeness parameters are susceptible to metadata errors.
Adding complete RefSeq segmented viral assembly accessions, which represent segments as individual contigs within the FASTA
Adding one SARS-CoV-2 genome for each major pangolin lineage

Skani Technical Details

	Links
Task	task_skani.wdl
Software Source Code	Skani on GitHub
Software Documentation	Skani Documentation
Original Publication(s)	Fast and robust metagenomic sequence comparison through sparse chaining with skani

Reference Mapping

bwa

The bwa task is a wrapper for the BWA alignment tool. It utilizes the BWA-MEM algorithm to map cleaned reads to the reference genome, either selected by the skani task or provided by the user input reference_fasta. This creates a BAM file which is then sorted using the command samtools sort.

BWA Technical Details

	Links
Task	task_bwa.wdl
Software Source Code	BWA on GitHub
Software Documentation	BWA Documentation
Original Publication(s)	Fast and accurate short read alignment with Burrows-Wheeler transform

read_mapping_stats: Mapping Statistics

The read_mapping_stats task generates mapping statistics from a BAM file. It uses samtools to generate a summary of the mapping statistics, which includes coverage, depth, average base quality, average mapping quality, and other relevant metrics.

read_mapping_stats Technical Details

	Links
Task	task_assembly_metrics.wdl
Software Source Code	samtools on GitHub
Software Documentation	samtools
Original Publication(s)	The Sequence Alignment/Map format and SAMtools Twelve Years of SAMtools and BCFtools

Variant Calling and Consensus Generation

ivar_variants: Variant Calling

iVar uses the outputs of samtools mpileup to call single nucleotide variants (SNVs) and insertions/deletions (indels). Several key parameters can be set to determine the stringency of variant calling, including minimum quality, minimum allele frequency, and minimum depth.

This task returns a VCF file containing all called variants, the number of detected variants, and the proportion of those variants with allele frequencies between 0.6 and 0.9 (also known as intermediate variants).

min_depth input parameter

This parameter accepts an integer value to set the minimum read depth for variant calling and subsequent consensus sequence generation. The default value is 10.

min_map_quality input parameter

This parameter accepts an integer value to set the minimum mapping quality for variant calling and subsequent consensus sequence generation. The default value is 20.

min_allele_freq input parameter

This parameter accepts a float value to set the minimum allele frequency for variant calling and subsequent consensus sequence generation. The default value is 0.6.

iVar Technical Details

	Links
Task	task_ivar_variant_call.wdl
Software Source Code	Ivar on GitHub
Software Documentation	Ivar Documentation
Original Publication(s)	An amplicon-based sequencing framework for accurately measuring intrahost virus diversity using PrimalSeq and iVar

ivar_consensus: Consensus Assembly

iVar's consensus tool generates a reference-based consensus assembly. Several parameters can be set that determine the stringency of the consensus assembly, including minimum quality, minimum allele frequency, and minimum depth.

This task is functional for segmented viruses by iteratively executing iVar on a contig-by-contig basis and concantenating resulting consensus contigs.

min_depth input parameter

This parameter accepts an integer value to set the minimum read depth for variant calling and subsequent consensus sequence generation. The default value is 10.

min_map_quality input parameter

This parameter accepts an integer value to set the minimum mapping quality for variant calling and subsequent consensus sequence generation. The default value is 20.

min_allele_freq input parameter

This parameter accepts a float value to set the minimum allele frequency for variant calling and subsequent consensus sequence generation. The default value is 0.6.

iVar Technical Details

	Links
Task	task_ivar_consensus.wdl
Software Source Code	Ivar on GitHub
Software Documentation	Ivar Documentation
Original Publication(s)	An amplicon-based sequencing framework for accurately measuring intrahost virus diversity using PrimalSeq and iVar

Assembly Evaluation and Consensus Quality Control

quast_denovo

QUAST stands for QUality ASsessment Tool. It evaluates genome/metagenome assemblies by computing various metrics without a reference being necessary. It includes useful metrics such as number of contigs, length of the largest contig and N50.

QUAST Technical Details

	Links
Task	task_quast.wdl
Software Source Code	QUAST on GitHub
Software Documentation	QUAST Manual on SourceForge
Original Publication(s)	QUAST: quality assessment tool for genome assemblies

checkv_denovo & checkv_consensus

CheckV is a fully automated command-line pipeline for assessing the quality of viral genomes, including identification of host contamination for integrated proviruses, estimating completeness for genome fragments, and identification of closed genomes.

By default, CheckV reports results on a contig-by-contig basis. The checkv task additionally reports both "weighted_contamination" and "weighted_completeness", which are average percents calculated across the total assembly that are weighted by contig length.

CheckV Technical Details

	Links
Task	task_checkv.wdl
Software Source Code	CheckV on Bitbucket
Software Documentation	CheckV Documentation
Original Publication(s)	CheckV assesses the quality and completeness of metagenome-assembled viral genomes

consensus_qc: Assembly Statistics

The consensus_qc task generates a summary of genomic statistics from a consensus genome. This includes the total number of bases, "N" bases, degenerate bases, and an estimate of the percent coverage to the reference genome.

consensus_qc Technical Details

	Links
Task	task_consensus_qc.wdl

Exporting Results to Taxon-Specific Tables

Taxon Tables: Copy outputs to new data tables based on taxonomic assignment (optional)

This task is incompatible when running TheiaViral_Panel on the command-line as it is geared specifically for Terra. Do not activate this task if you are a command-line user.

Activate this task by providing a value for the output_taxon_table input variable. If provided, the user must also provide values to the terra_project and terra_workspace optional input variables.

The taxon_tables module will copy sample data to a different data table based on the taxonomic assignment. For example, if an influenza sample is analyzed, the module will copy the sample data to a new table for influenza samples or add the sample data to an existing table.

Formatting the output_taxon_table file

The output_taxon_table file must be uploaded a Google storage bucket that is accessible by Terra and should be in tab-delimited format and include a header. Briefly, the viral taxon name should be listed in the leftmost column with the name of the data table to copy samples of that taxon to in the rightmost column.

taxon	taxon_table
influenza	influenza_panel_specimen
coronavirus	coronavirus_panel_specimen
human_immunodeficiency_virus	hiv_panel_specimen
monkeypox_virus	monkeypox_panel_specimen

There are no output columns for the taxon table task. The only output of the task is that additional data tables will appear for in the Terra workspace for samples matching a taxa in the taxon table file.

export_taxon_table Technical Details

	Links
Task	task_export_taxon_table.wdl

Taxa-Specific Tasks¶

The TheiaViral workflows activate taxa-specific sub-workflows after the identification of relevant taxa. These characterization modules are activated by populating taxon with an exact match to a compatible taxon. We recommend using the taxon ID integer because these can be simpler. Compatible taxon codes are listed in parentheses below (case-insensitive):

SARS-CoV-2 ("2697049", "severe acute respiratory syndrome coronavirus 2")
Monkeypox virus ("10244", "mpox", "monkeypox virus")
Human Immunodeficiency Virus 1 ("11676", "human immunodeficiency virus 1")
Human Immunodeficiency Virus 2 ("11709", "human immunodeficiency virus 2")
West Nile Virus ("11082", "west nile virus")
Influenza A ("11320", "influenza a virus")
Influenza B ("11520", "influenza b virus")
RSV-A ("208893", "human respiratory syncytial virus a")
RSV-B ("208895", "human respiratory syncytial virus b")
Measles ("11234", "measles")
Rabies ("11292", "lyssavirus rabies")
Mumps ("2560602", "mumps virus", "Mumps orthorubulavirus")
Rubella ("11041", "rubella virus")

Outputs¶

TheiaViral_Illumina_PETheiaViral_ONTTheiaViral_Panel

Variable	Type	Description
abricate_flu_database	String	ABRicate database used for analysis
abricate_flu_results	File	File containing all results from ABRicate
abricate_flu_subtype	String	Flu subtype as determined by ABRicate
abricate_flu_type	String	Flu type as determined by ABRicate
abricate_flu_version	String	Version of ABRicate
assembly_consensus_fasta	File	Final consensus assembly in FASTA format
assembly_denovo_fasta	File	De novo assembly in FASTA format
auspice_json	File	Auspice-compatable JSON output generated from Nextclade analysis that includes the Nextclade default samples for clade-typing and the single sample placed on this tree
auspice_json_flu_ha	File	Auspice-compatable JSON output generated from Nextclade analysis on Influenza HA segment that includes the Nextclade default samples for clade-typing and the single sample placed on this tree
auspice_json_flu_na	File	Auspice-compatable JSON output generated from Nextclade analysis on Influenza NA segment that includes the Nextclade default samples for clade-typing and the single sample placed on this tree
auspice_json_rabies	File	Auspice-compatable JSON output generated from Nextclade analysis on Rabies virus that includes the Nextclade default samples for clade-typing and the single sample placed on this tree
bbduk_docker	String	The Docker image for bbduk, which was used to remove the adapters from the sequences
bbduk_read1_clean	File	Clean forward reads after BBDuk processing
bbduk_read2_clean	File	Clean reverse reads after BBDuk processing
bwa_read1_aligned	File	Forward reads aligned to reference
bwa_read1_unaligned	File	Forward reads not aligned to reference
bwa_read2_aligned	File	Reverse reads aligned to reference
bwa_read2_unaligned	File	Reverse reads not aligned to reference
bwa_samtools_version	String	Version of samtools used by BWA
bwa_sorted_bai	File	Sorted BAM index file of reads aligned to reference
bwa_sorted_bam	File	Sorted BAM file of reads aligned to reference
bwa_sorted_bam_unaligned	File	A BAM file that only contains reads that did not align to the reference
bwa_sorted_bam_unaligned_bai	File	Index companion file to a BAM file that only contains reads that did not align to the reference
bwa_version	String	Version of BWA software used
checkv_consensus_contamination	File	Contamination estimate for consensus assembly from CheckV
checkv_consensus_summary	File	Summary report from CheckV for consensus assembly
checkv_consensus_total_genes	Int	Number of genes detected in consensus assembly by CheckV
checkv_consensus_version	String	Version of CheckV used for consensus assembly
checkv_consensus_weighted_completeness	Float	Weighted completeness score for consensus assembly from CheckV
checkv_consensus_weighted_contamination	Float	Weighted contamination score for consensus assembly from CheckV
checkv_denovo_contamination	File	Contamination estimate for de novo assembly from CheckV
checkv_denovo_summary	File	Summary report from CheckV for de novo assembly
checkv_denovo_total_genes	Int	Number of genes detected in de novo assembly by CheckV
checkv_denovo_version	String	Version of CheckV used for de novo assembly
checkv_denovo_weighted_completeness	Float	Weighted completeness score for de novo assembly from CheckV
checkv_denovo_weighted_contamination	Float	Weighted contamination score for de novo assembly from CheckV
consensus_n_variant_min_depth	Int	Minimum read depth to call variants for iVar consensus and iVar variants. Also represents the minimum consensus support threshold used by IRMA with Illumina Influenza data.
consensus_qc_assembly_length_unambiguous	Int	Length of consensus assembly excluding ambiguous bases
consensus_qc_number_Degenerate	Int	Number of degenerate bases in consensus assembly
consensus_qc_number_N	Int	Number of N bases in consensus assembly
consensus_qc_number_Total	Int	Total number of bases in consensus assembly
consensus_qc_percent_reference_coverage	Float	Percent of reference genome covered in consensus assembly
datasets_genome_length_docker	String	The Docker container used for the task
datasets_genome_length_version	String	The version of NCBI Datasets used for analysis
dehost_wf_dehost_read1	File	Reads that did not map to host
dehost_wf_dehost_read2	File	Paired-reads that did not map to host
dehost_wf_host_accession	String	Host genome accession
dehost_wf_host_fasta	File	Host genome FASTA file
dehost_wf_host_flagstat	File	Output from the SAMtools flagstat command to assess quality of the alignment file (BAM)
dehost_wf_host_mapped_bai	File	Indexed bam file of the reads aligned to the host reference
dehost_wf_host_mapped_bam	File	Sorted BAM file containing the alignments of reads to the host reference genome
dehost_wf_host_mapping_cov_hist	File	Coverage histogram from host read mapping
dehost_wf_host_mapping_coverage	Float	Average coverage from host read mapping
dehost_wf_host_mapping_mean_depth	Float	Average depth from host read mapping
dehost_wf_host_mapping_metrics	File	File of mapping metrics
dehost_wf_host_mapping_stats	File	File of mapping statistics
dehost_wf_host_percent_mapped_reads	Float	Percentage of reads mapped to host reference genome
ete4_docker	String	Docker image used for ETE4 taxonomy parsing
ete4_version	String	The version of ETE4 used
fastp_html_report	File	The HTML report made with fastp
fastp_version	String	The version of fastp used
fastq_scan_clean1_json	File	The JSON file output from `fastq-scan` containing summary stats about clean forward read quality and length
fastq_scan_clean2_json	File	The JSON file output from `fastq-scan` containing summary stats about clean reverse read quality and length
fastq_scan_clean_pairs	String	Number of read pairs after cleaning
fastq_scan_docker	String	The Docker image of fastq_scan
fastq_scan_num_reads_clean1	Int	The number of forward reads after cleaning as calculated by fastq_scan
fastq_scan_num_reads_clean2	Int	The number of reverse reads after cleaning as calculated by fastq_scan
fastq_scan_num_reads_raw1	Int	The number of input forward reads as calculated by fastq_scan
fastq_scan_num_reads_raw2	Int	The number of input reserve reads as calculated by fastq_scan
fastq_scan_raw1_json	File	The JSON file output from `fastq-scan` containing summary stats about raw forward read quality and length
fastq_scan_raw2_json	File	The JSON file output from `fastq-scan` containing summary stats about raw reverse read quality and length
fastq_scan_raw_pairs	String	Number of raw read pairs
fastq_scan_version	String	The version of fastq_scan
genoflu_all_segments	String	The genotypes for each individual flu segment
genoflu_genotype	String	The genotype of the whole genome, based off of the individual segments types
genoflu_output_tsv	File	The output file from GenoFLU
genoflu_version	String	The version of GenoFLU used
irma_docker	String	Docker image used to run IRMA
irma_subtype	String	Flu subtype as determined by IRMA
irma_subtype_notes	String	Helpful note to user about Flu B subtypes. Output will be blank for Flu A samples. For Flu B samples it will state: "IRMA does not differentiate Victoria and Yamagata Flu B lineages. See abricate_flu_subtype output column"
irma_type	String	Flu type as determined by IRMA
irma_version	String	Version of IRMA used
ivar_tsv	File	Variant descriptor file generated by iVar variants
ivar_variant_proportion_intermediate	String	The proportion of variants of intermediate frequency
ivar_variant_version	String	Version of iVar for running the iVar variants command
ivar_vcf	File	iVar tsv output converted to VCF format
ivar_version_consensus	String	Version of iVar for running the iVar consensus command
kraken2_extracted_read1	File	Forward reads extracted by taxonomic classification
kraken2_extracted_read2	File	Reverse reads extracted by taxonomic classification
kraken_database	String	Database used for Kraken classification
kraken_docker	String	Docker image used for Kraken
kraken_report	String	Full Kraken report
kraken_version	String	Version of Kraken software used
megahit_docker	String	Docker image used for MEGAHIT
megahit_status	String	Status of the MEGAHIT assembly
megahit_version	String	Version of MEGAHIT used
metaviralspades_docker	String	Docker image used for MetaviralSPAdes
metaviralspades_status	String	Status of MetaviralSPAdes assembly
metaviralspades_version	String	Version of MetaviralSPAdes used
morgana_magic_organism	String	Standardized organism name used for characterization
ncbi_read_extraction_rank	String	Read extraction rank used
ncbi_scrub_docker	String	The Docker image for NCBI's HRRT (human read removal tool)
ncbi_scrub_human_spots_removed	Int	Number of spots removed (or masked)
ncbi_taxon_id	String	NCBI taxonomy ID of inputted organism following rank extraction
ncbi_taxon_name	String	NCBI taxonomy name of inputted taxon following rank extraction
nextclade_aa_dels	String	Amino-acid deletions as detected by NextClade. Will be blank for Flu
nextclade_aa_dels_flu_ha	String	Amino-acid deletions as detected by NextClade. Specific to flu; it includes deletions for HA segment
nextclade_aa_dels_flu_na	String	Amino-acid deletions as detected by NextClade. Specific to Flu; it includes deletions for NA segment
nextclade_aa_dels_rabies	String	Amino-acid deletions as detected by Nextclade. Specific to Monkeypox
nextclade_aa_subs	String	Amino-acid substitutions as detected by Nextclade. Will be blank for Flu
nextclade_aa_subs_flu_ha	String	Amino-acid substitutions as detected by Nextclade. Specific to Flu; it includes substitutions for HA segment
nextclade_aa_subs_flu_na	String	Amino-acid substitutions as detected by Nextclade. Specific to Flu; it includes substitutions for NA segment
nextclade_aa_subs_rabies	String	Amino-acid substitutions as detected by Nextclade. Specific to Monkeypox
nextclade_clade	String	Nextclade clade designation, will be blank for Flu.
nextclade_clade_flu_ha	String	Nextclade clade designation, specific to Flu NA segment
nextclade_clade_flu_na	String	Nextclade clade designation, specific to Flu HA segment
nextclade_clade_rabies	String	Nextclade clade designation, specific to Rabies
nextclade_docker	String	Docker image used to run Nextclade
nextclade_ds_tag	String	Dataset tag used to run Nextclade. Will be blank for Flu
nextclade_ds_tag_flu_ha	String	Dataset tag used to run Nextclade, specific to Flu HA segment
nextclade_ds_tag_flu_na	String	Dataset tag used to run Nextclade, specific to Flu NA segment
nextclade_json	File	Nextclade output in JSON file format. Will be blank for Flu
nextclade_json_flu_ha	File	Nextclade output in JSON file format, specific to Flu HA segment
nextclade_json_flu_na	File	Nextclade output in JSON file format, specific to Flu NA segment
nextclade_json_rabies	File	Nextclade output in JSON file format, specific to Rabies
nextclade_lineage	String	Nextclade lineage designation
nextclade_lineage_rabies	String	Nextclade lineage designation, specific to Rabies
nextclade_qc	String	QC metric as determined by Nextclade. Will be blank for Flu
nextclade_qc_flu_ha	String	QC metric as determined by Nextclade, specific to Flu HA segment
nextclade_qc_flu_na	String	QC metric as determined by Nextclade, specific to Flu NA segment
nextclade_qc_rabies	String	QC metric as determined by Nextclade, specific to Rabies
nextclade_tsv	File	Nextclade output in TSV file format. Will be blank for Flu
nextclade_tsv_flu_ha	File	Nextclade output in TSV file format, specific to Flu HA segment
nextclade_tsv_flu_na	File	Nextclade output in TSV file format, specific to Flu NA segment
nextclade_tsv_rabies	File	Nextclade output in TSV file format, specific to Rabies
nextclade_version	String	The version of Nextclade software used
pango_lineage	String	Pango lineage as determined by Pangolin
pango_lineage_expanded	String	Pango lineage without use of aliases; e.g., "BA.1" → "B.1.1.529.1"
pango_lineage_report	File	Full Pango lineage report generated by Pangolin
pangolin_assignment_version	String	The version of the pangolin software (e.g. PANGO or PUSHER) used for lineage assignment
pangolin_conflicts	String	Number of lineage conflicts as determined by Pangolin
pangolin_docker	String	Docker image used to run Pangolin
pangolin_notes	String	Lineage notes as determined by Pangolin
pangolin_versions	String	All Pangolin software and database versions
quasitools_coverage_file	File	The coverage report created by Quasitools HyDRA
quasitools_date	String	Date of Quasitools analysis
quasitools_dr_report	File	Drug resistance report created by Quasitools HyDRA
quasitools_hydra_vcf	File	The VCF created by Quasitools HyDRA
quasitools_mutations_report	File	The mutation report created by Quasitools HyDRA
quasitools_version	String	Version of Quasitools used
quast_denovo_docker	String	Docker image used for QUAST
quast_denovo_gc_percent	Float	GC percentage of de novo assembly from QUAST
quast_denovo_genome_length	Int	Genome length of de novo assembly from QUAST
quast_denovo_largest_contig	Int	Size of largest contig in de novo assembly from QUAST
quast_denovo_n50_value	Int	N50 value of de novo assembly from QUAST
quast_denovo_number_contigs	Int	Number of contigs in de novo assembly from QUAST
quast_denovo_report	File	QUAST report for de novo assembly
quast_denovo_uncalled_bases	Float	Number of uncalled bases in de novo assembly from QUAST
quast_denovo_version	String	Version of QUAST used
read1_dehosted	File	The dehosted forward reads file; suggested read file for SRA submission
read2_dehosted	File	The dehosted reverse reads file; suggested read file for SRA submission
read_mapping_cov_hist	File	Coverage histogram from read mapping
read_mapping_cov_stats	File	Coverage statistics from read mapping
read_mapping_coverage	Float	Average coverage from read mapping
read_mapping_date	String	Date of read mapping analysis
read_mapping_depth	Float	Average depth from read mapping
read_mapping_flagstat	File	Flagstat file from read mapping
read_mapping_meanbaseq	Float	Mean base quality from read mapping
read_mapping_meanmapq	Float	Mean mapping quality from read mapping
read_mapping_percentage_mapped_reads	Float	Percentage of mapped reads
read_mapping_report	File	Report file from read mapping
read_mapping_samtools_version	String	Version of samtools used in read mapping
read_mapping_statistics	File	Statistics file from read mapping
read_screen_clean	String	PASS or FAIL result from clean read screening; FAIL accompanied by the reason(s) for failure
read_screen_clean_tsv	File	Clean read screening report TSV depicting read counts, total read base pairs, and estimated genome length
skani_database	String	Database used for Skani
skani_docker	String	Docker image used for Skani
skani_reference_assembly	File	Reference genome assembly
skani_reference_taxon	String	Reference taxon name
skani_report	File	Report from Skani
skani_status	String	Status of Skani analysis
skani_top_accession	String	Top accession ID from Skani
skani_top_ani	Float	Top ANI score from Skani
skani_top_query_coverage	Float	Query coverage of top match from Skani
skani_top_score	Float	Top score from Skani
skani_version	String	Version of Skani used
skani_warning	String	Skani warning message
taxon_avg_genome_length	String	Average genome length for taxon obtained from NCBI datasets summary
theiaviral_illumina_pe_date	String	Date of TheiaViral Illumina PE workflow run
theiaviral_illumina_pe_version	String	Version of TheiaViral Illumina PE workflow
trimmomatic_docker	String	The docker image used for the trimmomatic module in this workflow
trimmomatic_version	String	The version of Trimmomatic used
vadr_alerts_list	File	A file containing all of the fatal alerts as determined by VADR
vadr_all_outputs_tar_gz	File	A .tar.gz file (gzip-compressed tar archive file) containing all outputs from the VADR command v-annotate.pl. This file must be uncompressed & extracted to see the many files within. See https://github.com/ncbi/vadr/blob/master/documentation/formats.md#format-of-v-annotatepl-output-files for more complete description of all files present within the archive. Useful when deeply investigating a sample's genome & annotations.
vadr_classification_summary_file	File	Per-sequence tabular classification file. See https://github.com/ncbi/vadr/blob/master/documentation/formats.md#explanation-of-sqc-suffixed-output-files for more complete description.
vadr_docker	String	Docker image used to run VADR
vadr_fastas_zip_archive	File	Zip archive containing all fasta files created during VADR analysis
vadr_feature_tbl_fail	File	5 column feature table output for failing sequences. See https://github.com/ncbi/vadr/blob/master/documentation/formats.md#format-of-v-annotatepl-output-files for more complete description.
vadr_feature_tbl_pass	File	5 column feature table output for passing sequences. See https://github.com/ncbi/vadr/blob/master/documentation/formats.md#format-of-v-annotatepl-output-files for more complete description.
vadr_num_alerts	String	Number of fatal alerts as determined by VADR

Variable	Type	Description
abricate_flu_database	String	ABRicate database used for analysis
abricate_flu_results	File	File containing all results from ABRicate
abricate_flu_subtype	String	Flu subtype as determined by ABRicate
abricate_flu_type	String	Flu type as determined by ABRicate
abricate_flu_version	String	Version of ABRicate
assembly_consensus_fasta	File	Final consensus assembly in FASTA format
assembly_denovo_fasta	File	De novo assembly in FASTA format
assembly_to_ref_bai	File	BAM index file for reads aligned to reference
assembly_to_ref_bam	File	BAM file of reads aligned to reference
auspice_json	File	Auspice-compatable JSON output generated from Nextclade analysis that includes the Nextclade default samples for clade-typing and the single sample placed on this tree
auspice_json_flu_ha	File	Auspice-compatable JSON output generated from Nextclade analysis on Influenza HA segment that includes the Nextclade default samples for clade-typing and the single sample placed on this tree
auspice_json_flu_na	File	Auspice-compatable JSON output generated from Nextclade analysis on Influenza NA segment that includes the Nextclade default samples for clade-typing and the single sample placed on this tree
auspice_json_rabies	File	Auspice-compatable JSON output generated from Nextclade analysis on Rabies virus that includes the Nextclade default samples for clade-typing and the single sample placed on this tree
bcftools_docker	String	Docker image used for bcftools
bcftools_filtered_vcf	File	Filtered variant calls in VCF format from bcftools
bcftools_version	String	Version of bcftools used
checkv_consensus_contamination	File	Contamination estimate for consensus assembly from CheckV
checkv_consensus_summary	File	Summary report from CheckV for consensus assembly
checkv_consensus_total_genes	Int	Number of genes detected in consensus assembly by CheckV
checkv_consensus_version	String	Version of CheckV used for consensus assembly
checkv_consensus_weighted_completeness	Float	Weighted completeness score for consensus assembly from CheckV
checkv_consensus_weighted_contamination	Float	Weighted contamination score for consensus assembly from CheckV
checkv_denovo_contamination	File	Contamination estimate for de novo assembly from CheckV
checkv_denovo_summary	File	Summary report from CheckV for de novo assembly
checkv_denovo_total_genes	Int	Number of genes detected in de novo assembly by CheckV
checkv_denovo_version	String	Version of CheckV used for de novo assembly
checkv_denovo_weighted_completeness	Float	Weighted completeness score for de novo assembly from CheckV
checkv_denovo_weighted_contamination	Float	Weighted contamination score for de novo assembly from CheckV
clair3_docker	String	Docker image used for Clair3
clair3_gvcf	File	Genomic VCF file from Clair3
clair3_model	String	Model used for Clair3 variant calling
clair3_vcf	File	Variant calls in VCF format from Clair3
clair3_version	String	Clair3 Version being used
consensus_qc_assembly_length_unambiguous	Int	Length of consensus assembly excluding ambiguous bases
consensus_qc_number_Degenerate	Int	Number of degenerate bases in consensus assembly
consensus_qc_number_N	Int	Number of N bases in consensus assembly
consensus_qc_number_Total	Int	Total number of bases in consensus assembly
consensus_qc_percent_reference_coverage	Float	Percent of reference genome covered in consensus assembly
datasets_genome_length_docker	String	The Docker container used for the task
datasets_genome_length_version	String	The version of NCBI Datasets used for analysis
dehost_wf_dehost_read1	File	Reads that did not map to host
dehost_wf_host_accession	String	Host genome accession
dehost_wf_host_fasta	File	Host genome FASTA file
dehost_wf_host_flagstat	File	Output from the SAMtools flagstat command to assess quality of the alignment file (BAM)
dehost_wf_host_mapped_bai	File	Indexed bam file of the reads aligned to the host reference
dehost_wf_host_mapped_bam	File	Sorted BAM file containing the alignments of reads to the host reference genome
dehost_wf_host_mapping_cov_hist	File	Coverage histogram from host read mapping
dehost_wf_host_mapping_coverage	Float	Average coverage from host read mapping
dehost_wf_host_mapping_mean_depth	Float	Average depth from host read mapping
dehost_wf_host_mapping_metrics	File	File of mapping metrics
dehost_wf_host_mapping_stats	File	File of mapping statistics
dehost_wf_host_percent_mapped_reads	Float	Percentage of reads mapped to host reference genome
ete4_docker	String	Docker image used for ETE4 taxonomy parsing
ete4_version	String	The version of ETE4 used
fasta_utilities_fai	File	FASTA index file
fasta_utilities_samtools_docker	String	Docker image used for samtools in fasta utilities
fasta_utilities_samtools_version	String	Version of samtools used in fasta utilities
flye_denovo_docker	String	Docker image used for Flye
flye_denovo_info	File	Information file from Flye assembly
flye_denovo_status	String	Status of Flye assembly
flye_denovo_version	String	Version of Flye used
genoflu_all_segments	String	The genotypes for each individual flu segment
genoflu_genotype	String	The genotype of the whole genome, based off of the individual segments types
genoflu_output_tsv	File	The output file from GenoFLU
genoflu_version	String	The version of GenoFLU used
irma_docker	String	Docker image used to run IRMA
irma_subtype	String	Flu subtype as determined by IRMA
irma_subtype_notes	String	Helpful note to user about Flu B subtypes. Output will be blank for Flu A samples. For Flu B samples it will state: "IRMA does not differentiate Victoria and Yamagata Flu B lineages. See abricate_flu_subtype output column"
irma_type	String	Flu type as determined by IRMA
irma_version	String	Version of IRMA used
mask_low_coverage_all_coverage_bed	File	BED file showing all coverage regions
mask_low_coverage_bed	File	BED file showing masked low coverage regions
mask_low_coverage_bedtools_docker	String	Docker image used for bedtools in masking
mask_low_coverage_bedtools_version	String	Version of bedtools used in masking
mask_low_coverage_reference_fasta	File	Reference FASTA with low coverage regions masked
metabuli_classified	File	Classified reads from Metabuli
metabuli_database	String	Database used for Metabuli
metabuli_docker	String	Docker image used for Metabuli
metabuli_krona_report	File	Krona visualization report from Metabuli
metabuli_read1_extract	File	Extracted reads from Metabuli
metabuli_report	File	Classification report from Metabuli
metabuli_version	String	Version of Metabuli used
minimap2_docker	String	The Docker image of minimap2
minimap2_out	File	Output file from Minimap2 alignment
minimap2_version	String	The version of minimap2
morgana_magic_organism	String	Standardized organism name used for characterization
nanoplot_html_clean	File	An HTML report describing the clean reads
nanoplot_html_raw	File	An HTML report describing the raw reads
nanoplot_num_reads_clean1	Int	Number of clean reads
nanoplot_num_reads_raw1	Int	Number of raw reads
nanoplot_r1_mean_q_clean	Float	Mean quality score of clean forward reads
nanoplot_r1_mean_q_raw	Float	Mean quality score of raw forward reads
nanoplot_r1_mean_readlength_clean	Float	Mean read length of clean forward reads
nanoplot_r1_mean_readlength_raw	Float	Mean read length of raw forward reads
nanoplot_r1_median_q_clean	Float	Median quality score of clean forward reads
nanoplot_r1_median_q_raw	Float	Median quality score of raw forward reads
nanoplot_r1_median_readlength_clean	Float	Median read length of clean forward reads
nanoplot_r1_median_readlength_raw	Float	Median read length of raw forward reads
nanoplot_r1_n50_clean	Float	N50 of clean forward reads
nanoplot_r1_n50_raw	Float	N50 of raw forward reads
nanoplot_r1_stdev_readlength_clean	Float	Standard deviation read length of clean forward reads
nanoplot_r1_stdev_readlength_raw	Float	Standard deviation read length of raw forward reads
nanoplot_tsv_clean	File	A TSV report describing the clean reads
nanoplot_tsv_raw	File	A TSV report describing the raw reads
nanoq_filtered_read1	File	Filtered reads from NanoQ
nanoq_version	String	Version of nanoq used in analysis
ncbi_read_extraction_rank	String	Read extraction rank used
ncbi_scrub_docker	String	The Docker image for NCBI's HRRT (human read removal tool)
ncbi_scrub_human_spots_removed	Int	Number of spots removed (or masked)
ncbi_scrub_read1_dehosted	File	Dehosted reads after NCBI scrub
ncbi_taxon_id	String	NCBI taxonomy ID of inputted organism following rank extraction
ncbi_taxon_name	String	NCBI taxonomy name of inputted taxon following rank extraction
nextclade_aa_dels	String	Amino-acid deletions as detected by NextClade. Will be blank for Flu
nextclade_aa_dels_flu_ha	String	Amino-acid deletions as detected by NextClade. Specific to flu; it includes deletions for HA segment
nextclade_aa_dels_flu_na	String	Amino-acid deletions as detected by NextClade. Specific to Flu; it includes deletions for NA segment
nextclade_aa_dels_rabies	String	Amino-acid deletions as detected by Nextclade. Specific to Monkeypox
nextclade_aa_subs	String	Amino-acid substitutions as detected by Nextclade. Will be blank for Flu
nextclade_aa_subs_flu_ha	String	Amino-acid substitutions as detected by Nextclade. Specific to Flu; it includes substitutions for HA segment
nextclade_aa_subs_flu_na	String	Amino-acid substitutions as detected by Nextclade. Specific to Flu; it includes substitutions for NA segment
nextclade_aa_subs_rabies	String	Amino-acid substitutions as detected by Nextclade. Specific to Monkeypox
nextclade_clade	String	Nextclade clade designation, will be blank for Flu.
nextclade_clade_flu_ha	String	Nextclade clade designation, specific to Flu NA segment
nextclade_clade_flu_na	String	Nextclade clade designation, specific to Flu HA segment
nextclade_clade_rabies	String	Nextclade clade designation, specific to Rabies
nextclade_docker	String	Docker image used to run Nextclade
nextclade_ds_tag	String	Dataset tag used to run Nextclade. Will be blank for Flu
nextclade_ds_tag_flu_ha	String	Dataset tag used to run Nextclade, specific to Flu HA segment
nextclade_ds_tag_flu_na	String	Dataset tag used to run Nextclade, specific to Flu NA segment
nextclade_json	File	Nextclade output in JSON file format. Will be blank for Flu
nextclade_json_flu_ha	File	Nextclade output in JSON file format, specific to Flu HA segment
nextclade_json_flu_na	File	Nextclade output in JSON file format, specific to Flu NA segment
nextclade_json_rabies	File	Nextclade output in JSON file format, specific to Rabies
nextclade_lineage	String	Nextclade lineage designation
nextclade_lineage_rabies	String	Nextclade lineage designation, specific to Rabies
nextclade_qc	String	QC metric as determined by Nextclade. Will be blank for Flu
nextclade_qc_flu_ha	String	QC metric as determined by Nextclade, specific to Flu HA segment
nextclade_qc_flu_na	String	QC metric as determined by Nextclade, specific to Flu NA segment
nextclade_qc_rabies	String	QC metric as determined by Nextclade, specific to Rabies
nextclade_tsv	File	Nextclade output in TSV file format. Will be blank for Flu
nextclade_tsv_flu_ha	File	Nextclade output in TSV file format, specific to Flu HA segment
nextclade_tsv_flu_na	File	Nextclade output in TSV file format, specific to Flu NA segment
nextclade_tsv_rabies	File	Nextclade output in TSV file format, specific to Rabies
nextclade_version	String	The version of Nextclade software used
pango_lineage	String	Pango lineage as determined by Pangolin
pango_lineage_expanded	String	Pango lineage without use of aliases; e.g., "BA.1" → "B.1.1.529.1"
pango_lineage_report	File	Full Pango lineage report generated by Pangolin
pangolin_assignment_version	String	The version of the pangolin software (e.g. PANGO or PUSHER) used for lineage assignment
pangolin_conflicts	String	Number of lineage conflicts as determined by Pangolin
pangolin_docker	String	Docker image used to run Pangolin
pangolin_notes	String	Lineage notes as determined by Pangolin
pangolin_versions	String	All Pangolin software and database versions
parse_mapping_samtools_docker	String	Docker image used for samtools in parse mapping
parse_mapping_samtools_version	String	Version of samtools used in parse mapping
porechop_trimmed_read1	File	Trimmed reads from Porechop
porechop_version	String	Version of Porechop used
quasitools_coverage_file	File	The coverage report created by Quasitools HyDRA
quasitools_date	String	Date of Quasitools analysis
quasitools_dr_report	File	Drug resistance report created by Quasitools HyDRA
quasitools_hydra_vcf	File	The VCF created by Quasitools HyDRA
quasitools_mutations_report	File	The mutation report created by Quasitools HyDRA
quasitools_version	String	Version of Quasitools used
quast_denovo_docker	String	Docker image used for QUAST
quast_denovo_gc_percent	Float	GC percentage of de novo assembly from QUAST
quast_denovo_genome_length	Int	Genome length of de novo assembly from QUAST
quast_denovo_largest_contig	Int	Size of largest contig in de novo assembly from QUAST
quast_denovo_n50_value	Int	N50 value of de novo assembly from QUAST
quast_denovo_number_contigs	Int	Number of contigs in de novo assembly from QUAST
quast_denovo_report	File	QUAST report for de novo assembly
quast_denovo_uncalled_bases	Float	Number of uncalled bases in de novo assembly from QUAST
quast_denovo_version	String	Version of QUAST used
rasusa_read1_subsampled	File	Subsampled read file from Rasusa
rasusa_read2_subsampled	File	Subsampled read file from Rasusa (paired file)
rasusa_version	String	Version of RASUSA used for the analysis
raven_denovo_docker	String	Docker image used for Raven
raven_denovo_status	String	Status of Raven assembly
raven_denovo_version	String	Version of Raven used
read_mapping_cov_hist	File	Coverage histogram from read mapping
read_mapping_cov_stats	File	Coverage statistics from read mapping
read_mapping_coverage	Float	Average coverage from read mapping
read_mapping_date	String	Date of read mapping analysis
read_mapping_depth	Float	Average depth from read mapping
read_mapping_flagstat	File	Flagstat file from read mapping
read_mapping_meanbaseq	Float	Mean base quality from read mapping
read_mapping_meanmapq	Float	Mean mapping quality from read mapping
read_mapping_percentage_mapped_reads	Float	Percentage of mapped reads
read_mapping_report	File	Report file from read mapping
read_mapping_samtools_version	String	Version of samtools used in read mapping
read_mapping_statistics	File	Statistics file from read mapping
read_screen_clean	String	PASS or FAIL result from clean read screening; FAIL accompanied by the reason(s) for failure
read_screen_clean_tsv	File	Clean read screening report TSV depicting read counts, total read base pairs, and estimated genome length
skani_database	String	Database used for Skani
skani_docker	String	Docker image used for Skani
skani_reference_assembly	File	Reference genome assembly
skani_reference_taxon	String	Reference taxon name
skani_report	File	Report from Skani
skani_status	String	Status of Skani analysis
skani_top_accession	String	Top accession ID from Skani
skani_top_ani	Float	Top ANI score from Skani
skani_top_query_coverage	Float	Query coverage of top match from Skani
skani_top_score	Float	Top score from Skani
skani_version	String	Version of Skani used
skani_warning	String	Skani warning message
taxon_avg_genome_length	String	Average genome length for taxon obtained from NCBI datasets summary
theiaviral_ont_date	String	Date of TheiaViral ONT workflow run
theiaviral_ont_version	String	Version of TheiaViral ONT workflow
vadr_alerts_list	File	A file containing all of the fatal alerts as determined by VADR
vadr_all_outputs_tar_gz	File	A .tar.gz file (gzip-compressed tar archive file) containing all outputs from the VADR command v-annotate.pl. This file must be uncompressed & extracted to see the many files within. See https://github.com/ncbi/vadr/blob/master/documentation/formats.md#format-of-v-annotatepl-output-files for more complete description of all files present within the archive. Useful when deeply investigating a sample's genome & annotations.
vadr_classification_summary_file	File	Per-sequence tabular classification file. See https://github.com/ncbi/vadr/blob/master/documentation/formats.md#explanation-of-sqc-suffixed-output-files for more complete description.
vadr_docker	String	Docker image used to run VADR
vadr_fastas_zip_archive	File	Zip archive containing all fasta files created during VADR analysis
vadr_feature_tbl_fail	File	5 column feature table output for failing sequences. See https://github.com/ncbi/vadr/blob/master/documentation/formats.md#format-of-v-annotatepl-output-files for more complete description.
vadr_feature_tbl_pass	File	5 column feature table output for passing sequences. See https://github.com/ncbi/vadr/blob/master/documentation/formats.md#format-of-v-annotatepl-output-files for more complete description.
vadr_num_alerts	String	Number of fatal alerts as determined by VADR

Variable	Type	Description
assembled_viruses	Int	Number of viruses assembled from sample
assemblies	Array[File]	Assembly files generated during the workflow
bbduk_docker	String	The Docker image for bbduk, which was used to remove the adapters from the sequences
dehost_wf_dehost_read1	File	Reads that did not map to host
dehost_wf_dehost_read2	File	Paired-reads that did not map to host
dehost_wf_host_accession	String	Host genome accession
dehost_wf_host_fasta	File	Host genome FASTA file
dehost_wf_host_flagstat	File	Output from the SAMtools flagstat command to assess quality of the alignment file (BAM)
dehost_wf_host_mapped_bai	File	Indexed bam file of the reads aligned to the host reference
dehost_wf_host_mapped_bam	File	Sorted BAM file containing the alignments of reads to the host reference genome
dehost_wf_host_mapping_cov_hist	File	Coverage histogram from host read mapping
dehost_wf_host_mapping_coverage	Float	Average coverage from host read mapping
dehost_wf_host_mapping_mean_depth	Float	Average depth from host read mapping
dehost_wf_host_mapping_metrics	File	File of mapping metrics
dehost_wf_host_mapping_stats	File	File of mapping statistics
dehost_wf_host_percent_mapped_reads	Float	Percentage of reads mapped to host reference genome
fastp_html_report	File	The HTML report made with fastp
fastp_version	String	The version of fastp used
fastq_scan_clean1_json	File	The JSON file output from `fastq-scan` containing summary stats about clean forward read quality and length
fastq_scan_clean2_json	File	The JSON file output from `fastq-scan` containing summary stats about clean reverse read quality and length
fastq_scan_clean_pairs	String	Number of read pairs after cleaning
fastq_scan_docker	String	The Docker image of fastq_scan
fastq_scan_num_reads_clean1	Int	The number of forward reads after cleaning as calculated by fastq_scan
fastq_scan_num_reads_clean2	Int	The number of reverse reads after cleaning as calculated by fastq_scan
fastq_scan_num_reads_raw1	Int	The number of input forward reads as calculated by fastq_scan
fastq_scan_num_reads_raw2	Int	The number of input reserve reads as calculated by fastq_scan
fastq_scan_raw1_json	File	The JSON file output from `fastq-scan` containing summary stats about raw forward read quality and length
fastq_scan_raw2_json	File	The JSON file output from `fastq-scan` containing summary stats about raw reverse read quality and length
fastq_scan_raw_pairs	String	Number of raw read pairs
fastq_scan_version	String	The version of fastq_scan
fastqc_clean1_html	File	An HTML file that provides a graphical visualization of clean forward read quality from fastqc to open in an internet browser
fastqc_clean2_html	File	An HTML file that provides a graphical visualization of clean reverse read quality from fastqc to open in an internet browser
fastqc_docker	String	The Docker container used for fastqc
fastqc_num_reads_clean1	Int	The number of forward reads after cleaning by fastqc
fastqc_num_reads_clean2	Int	The number of reverse reads after cleaning by fastqc
fastqc_num_reads_clean_pairs	String	The number of read pairs after cleaning by fastqc
fastqc_num_reads_raw1	Int	The number of input forward reads by fastqc before cleaning
fastqc_num_reads_raw2	Int	The number of input reverse reads by fastqc before cleaning
fastqc_num_reads_raw_pairs	String	The number of input read pairs by fastqc before cleaning
fastqc_raw1_html	File	An HTML file that provides a graphical visualization of raw forward read quality from fastqc to open in an internet browser
fastqc_raw2_html	File	An HTML file that provides a graphical visualization of raw reverse read quality from fastqc to open in an internet browser
fastqc_version	String	Version of fastqc software used
identified_organisms	Array[String]	List of organisms extracted and identified from an panel level sample
kraken2_classified_report	File	Standard Kraken2 output report. TXT filetype, but can be opened in Excel as a TSV file
kraken2_database	String	Kraken2 database used for the taxonomic assignment
kraken2_docker	String	Docker image used to run kraken2
kraken2_report_clean	File	The full Kraken report for the sample's clean reads
kraken2_report_raw	File	The full Kraken report for the sample's raw reads
kraken2_version	String	The version of kraken2 used
kraken_percent_human_clean	Float	Percent of human read data detected using the Kraken2 software after host removal for cleaned reads
kraken_percent_human_raw	Float	Percent of human read data detected using the Kraken2 software after host removal for raw reads
ncbi_scrub_docker	String	The Docker image for NCBI's HRRT (human read removal tool)
ncbi_scrub_human_spots_removed	Int	Number of spots removed (or masked)
read1_clean	File	Forward read file after quality trimming and adapter removal
read1_dehosted	File	The dehosted forward reads file; suggested read file for SRA submission
read2_clean	File	Reverse read file after quality trimming and adapter removal
read2_dehosted	File	The dehosted reverse reads file; suggested read file for SRA submission
theiaviral_panel_analysis_date	String	The date the analysis was run
theiaviral_panel_version	String	The version of the workflow that was run
trimmomatic_docker	String	The docker image used for the trimmomatic module in this workflow
trimmomatic_version	String	The version of Trimmomatic used

What are the differences between the de novo and consensus assemblies?

De novo genomes are generated from scratch without a reference to guide read assembly, while consensus genomes are generated by mapping reads to a reference and replacing reference positions with identified variants (structural and nucleotide). De novo assemblies are thus not biased by requiring reads map to the reference, though they may be more fragmented. Consensus assembly can generate more robust assemblies from lower coverage samples if the reference genome is sufficient quality and sufficiently closely related to the inputted sequence, though consensus assembly may not perform well in instances of significant structural variation. TheiaViral uses de novo assemblies as an intermediate to acquire the best reference genome for consensus assembly.

We generally recommend TheiaViral users focus on the consensus assembly as the desired assembly output. While we chose the best de novo assemblers for TheiaViral based on internal benchmarking, the consensus assembly will often be higher quality than the de novo assembly. However, the de novo assembly can approach or exceed consensus quality if the read inputs largely comprise one virus, have high depth of coverage, and/or are derived from a virus with high potential for recombination. TheiaViral does conduct assembly contiguity and viral completeness quality control for de novo assemblies, so de novo assembly that meets quality control standards can certainly be used for downstream analysis.

How is de novo assembly quality evaluated?

De novo assembly quality evaluation focuses on the completeness and contiguity of the genome. While a ground truth genome does not truly exist for quality comparison, reference genome selection can help contextualize quality if the reference is sufficiently similar to the de novo assembly. TheiaViral uses QUAST to acquire basic contiguity statistics and CheckV to assess viral genome completeness and contamination. Additionally, the reference selection software, Skani, can provide a quantitative comparison between the de novo assembly and the best reference genome.

Completeness and contamination

checkv_denovo_summary: The summary file reports CheckV results on a contig-by-contig basis. Ideally completeness is 100% for a single contig, or 100% for all segments. If there are multiple extraneous contigs in the assembly, one is ideally 100%. The same principles apply to contamination, though it ideally is 0%.
checkv_denovo_total_genes: The total genes is ideally the same number of genes as expected from the inputted viral taxon. Sometimes CheckV can fail to recover all the genes from a complete genome, so other statistics should be weighted more heavily in quality evaluation.
checkv_denovo_weighted_completeness: The weighted completeness is ideally 100%.
checkv_denovo_weighted_contamination: The weighted contamination is ideally 0%.

Length and contiguity

quast_denovo_genome_length: The de novo genome length is ideally the same as the expected genome length of the focal virus.
quast_denovo_largest_contig: The largest contig is ideally the size of the genome, or the size of the largest expected segment. If there are multiple contigs, and the largest contig is the ideal size, then the smaller contigs may be discarded based on the CheckV completeness for the largest contig (see CheckV outputs).
quast_denovo_n50_value: The N50 is an evaluation of contiguity and is ideally as close as possible to the genome size. For segmented viruses, the N50 should be as close as possible to the size of the segment molecule that would cover at least 50% of the total genome size when segment lengths are added after sorting largest to smallest.
quast_denovo_number_contigs: The number of contigs is ideally 1 or the total number of segments expected.

Reference genome similarity

skani_top_ani: The percent average nucleotide identity (ANI) for the top Skani hit is ideally 100% if the sequenced virus is highly similar to a reference genome. However, if the virus is divergent, ANI is not a good indication of assembly quality.
skani_top_query_coverage: The percent query coverage for the top Skani hit is ideally 100% if the sequenced virus has not undergone significant recombination/structural variation.
skani_top_score: The score for the top Skani hit is the ANI x Query (de novo assembly) coverage and is ideally 100% if the sequenced virus is not substantially divergent from the reference dataset.

How is consensus assembly quality evaluated?

Consensus assemblies are derived from a reference genome, so quality assessment focuses on coverage and variant quality. Bases with insufficient coverage are denoted as "N". Additionally, the size and contiguity of a TheiaViral consensus assembly is expected to approximate the reference genome, so any discrepancy here is likely due to inferred structural variation.

Completeness and contamination

checkv_consensus_weighted_completeness: The weighted completeness is ideally 100%.

Consensus variant calls

consensus_qc_number_Degenerate: The number of degenerate bases is ideally 0. While degenerate bases indicate ambiguity in the sequence, non-N degenerate bases indicate that some information about the base was obtained.
consensus_qc_number_N: The number of "N" bases is ideally 0.

Coverage

consensus_qc_percent_reference_coverage: The percent reference coverage is ideally 100%.
read_mapping_cov_hist: The read mapping coverage histogram ideally depicts normally distributed coverage, which may indicate uniform coverage across the reference genome. However, uniform coverage is unlikely with repetitive regions that approach/exceed read length.
read_mapping_coverage: The average read mapping coverage is ideally as high as possible.
read_mapping_meanbaseq: The average mean mapping base quality is ideally as high as possible.
read_mapping_meanmapq: The average mean mapping alignment quality is ideally as high as possible.
read_mapping_percentage_mapped_reads: The percent of mapped reads is ideally 100% of the reads classified as the lineage of interest. Some unclassified reads may also map, which may indicate they were erroneously unclassified. Alternatively, these reads could have been erroneously mapped.

Why did the workflow complete without generating a consensus?

TheiaViral is designed to "soft fail" when specific steps do not succeed due to input data quality. This means the workflow will be reported as successful, with an output that delineates the step that failed. If the workflow fails, please look for the following outputs in this order (sorted by timing of failure, latest first):

skani_status: If this output is populated with something other than "PASS" and skani_top_accession is populated with "N/A", this indicates that Skani did not identify a sufficiently similar reference genome. The Skani database comprises a broad array of NCBI viral genomes, so a failure here likely indicates poor read quality because viral contigs are not found in the de novo assembly or are too small. It may be useful to BLAST whatever contigs do exist in the de novo to determine if there is contamination that can be removed via the host input parameter. Additionally, review CheckV de novo outputs to assess if viral contigs were retrieved. Finally, consider keeping extract_unclassified to "true", using a higher read_extraction_rank if it will not introduce contaminant viruses, and invoking a host input to remove host reads if host contigs are present.
megahit_status / flye_status: If this output is populated with something other than "PASS", it indicates the fallback assembler did not successfully complete. The fallback assemblers are permissive, so failure here likely indicates poor read quality. Review read QC to check read quality, particularly following read classification. If read classification is dispensing with a significant number of reads, consider extract_unclassified, read_extraction_rank, and host input adjustment. Otherwise, sequencing quality may be poor.
metaviralspades_status / raven_denovo_status: If this output is populated with something other than "PASS", it indicates the default assembler did not successfully complete or extract viral contigs (MetaviralSPAdes). On their own, these statuses do not correspond directly to workflow failure because fallback de novo assemblers are implemented for both TheiaViral workflows.
read_screen_clean: If this output is populated with something other than "PASS", it indicates the reads did not pass the imposed thresholds. Either the reads are poor quality or the thresholds are too stringent, in which case the thresholds can be relaxed or skip_screen can be set to "true".
dehost_wf_download_status: If this output is populated with something other than "PASS", it indicates a host genome could not be retrieved for decontamination. See the host input explanation for more information and review the download_accession/download_taxonomy task output logs for advanced error parsing.

Known errors associated with read quality

ONT workflows may fail at Metabuli if no reads are classified as the taxon. Check the Metabuli classification.tsv or krona report for the read extraction taxon ID to determine if any reads were classified. This error will report out of memory (OOM), but increasing memory will not resolve it.
Illumina workflows may fail at CheckV (de novo) with Error: 80 hmmsearch tasks failed. Program should be rerun if no viral contigs were identified in the de novo assembly.

Acknowlegments¶

We would like to thank Danny Park at the Broad institute and Jared Johnson at the Washington State Department of Public Health for correspondence during the development of TheiaViral. TheiaViral was built referencing viral-assemble, VAPER, and Artic.