TheiaViral Workflow Series¶
Quick Facts¶
| Workflow Type | Applicable Kingdom | Last Known Changes | Command-line Compatibility | Workflow Level | Dockstore |
|---|---|---|---|---|---|
| Genomic Characterization | Viral | v4.0.0 | Some optional features incompatible, Yes | Sample-level | TheiaViral_Illumina_PE_PHB, TheiaViral_ONT_PHB, TheiaViral_Panel_PHB |
TheiaViral Workflows¶
The TheiaViral workflows are for the assembly, quality assessment, and characterization of viral genomes from diverse data sources, including metagenomic samples. There are currently three TheiaViral workflows designed to accomodate different kinds of input data:
- Illumina paired-end sequencing (TheiaViral_Illumina_PE)
- Oxford Nanopore Technology (ONT) sequencing (TheiaViral_ONT)
- Illumina paired-end sequencing originating from hybrid-capture panel-based methods (TheiaViral_Panel)
These workflows function by generating consensus assemblies of recalcitrant viruses, including diverse or recombinant lineages (such as rabies or norovirus), through a three-step approach:
- An intermediate de novo assembly is generated from taxonomy-filtered reads,
- The best reference from a database of ~200,000 viral genomes is selected using average nucleotide identity (ANI), and
- A final consensus assembly is generated through reference-based read mapping and variant calling.
De novo assembly and reference selection can be skipped by providing a reference genome as input; this enables compatibility with tiled-amplicon sequencing data. Subsequent genomic characterization is currently only functional for the viral lineages listed below.
What are the main differences between the TheiaViral and TheiaCoV workflows?
-
TheiaCoV Workflows
- For amplicon-derived viral sequencing methods
- Supports a limited number of pathogens
- Uses manually curated, static reference genomes
-
TheiaViral Workflows
- Designed for a variety of sequencing methods
- Supports relatively diverse and recombinant pathogens
- Dynamically identifies the most similar reference genome for consensus assembly via an intermediate de novo assembly
What about segmented viruses?
TheiaViral can properly assemble segmented viruses. The reference genome database used in Step 2 excludes segmented viral nucleotide accessions but includes the RefSeq assembly accessions that include all viral segments. Consensus assembly modules are constructed to handle multi-segment references.
Workflow Diagram¶
Inputs¶
Input Data
The TheiaViral_Illumina_PE workflow inputs Illumina paired-end read data. Read file extensions should be .fastq or .fq, and can optionally include the .gz compression extension. Theiagen recommends compressing files with gzip to minimize data upload time and storage costs.
Modifications to the optional parameter for trim_minlen may be required to appropriately trim reads shorter than 2 x 150 bp (i.e. generated using a 300-cycle sequencing kit), such as the 2 x 75bp reads generated using a 150-cycle sequencing kit.
taxon required input parameter
taxon is the standardized taxonomic name (e.g. "Lyssavirus rabies") or NCBI taxon ID (e.g. "11292") of the desired virus to analyze. Inputs must be represented in the NCBI taxonomy database and do not have to be species-level (see read_extraction_rank below).
host optional input parameter
The host input triggers the Host Decontaminate workflow, which removes reads that map to a reference host genome. This input needs to be an NCBI Taxonomy-compatible taxon, host genome assembly FASTA, or an NCBI assembly accession. If using a taxon, the first retrieved genome corresponding to that taxon is retrieved. If using a genome assembly or accession, these inputs must be coupled with the Host Decontaminate task is_genome/is_accession (ONT) or Read QC Trim PE host_is_genome/host_is_accession (Illumina) boolean populated as "true".
extract_unclassified optional input parameter
By default, the extract_unclassified parameter is set to true, which indicates that reads that are not classified by Kraken2 (Illumina) or Metabuli (ONT) will be included with reads classified as the input taxon.
These classification software most often do not comprehensively classify reads using the default RefSeq databases, so extracting unclassified reads is desirable when host and contaminant reads have been sufficiently decontaminated. Host decontamination occurs in TheiaViral using NCBI sra-human-scrubber, read classification to the human genome, and/or via mapping reads to the inputted host. Contaminant viral reads are mostly excluded because they will be often be classified against the default RefSeq classification databases.
Consider setting extract_unclassified to false if de novo assembly or Skani reference selection is failing.
min_allele_freq, min_depth, and min_map_quality optional input parameters
These parameters have a direct effect on the variants that will ultimately be reported in the consensus assembly. min_allele_freq determines the minimum proportion of an allelic variant to be reported in the consensus assembly. min_depth and min_map_quality affect how "N" is reported in the consensus, i.e. depth below min_depth is reported as "N" and reads with mapping quality below min_map_quality are not included in depth calculations.
read_extraction_rank optional input parameter
By default, the read_extraction_rank parameter is set to "family", which indicates that reads will be extracted if they are classified as the taxonomic family of the input taxon, including all descendant taxa of the family. Read classification may not resolve to the rank of the input taxon, so these reads may be classified at higher ranks. For example, some Lyssavirus rabies (species) reads may only be resolved to Lyssavirus (genus), so they would not be extracted if the read_extraction_rank is set to "species". Setting the read_extraction_rank above the inputted taxon's rank can therefore dramatically increase the number of reads recovered, at the potential cost of including other viruses. This likely is not a problem for scarcely represented lineages, e.g. a sample that is expected to include Lyssavirus rabies is unlikely to contain other viruses of the corresponding family, Rhabdoviridae, within the same sample. However, setting a read_extraction_rank far beyond the input taxon rank can be problematic when multiple representatives of the same viral family are included in similar abundance within the same sample. To further refine the desired read_extraction_rank, please review the corresponding classification reports of the respective classification software (kraken2 for Illumina and Metabuli for ONT)
The TheiaViral_ONT workflow inputs base-called Oxford Nanopore Technology (ONT) read data. Read file extensions should be .fastq or .fq, and can optionally include the .gz compression extension. Theiagen recommends compressing files with gzip to minimize data upload time and storage costs.
It is recommended to trim adapter sequencings via dorado basecalling prior to running TheiaViral_ONT, though porechop can optionally be called to trim adapters within the workflow.
The ONT sequencing kit and base-calling approach can produce substantial variability in the amount and quality of read data. Genome assemblies produced by the TheiaViral_ONT workflow must be quality assessed before reporting results. We recommend using the Dorado_Basecalling_PHB workflow if applicable.
taxon required input parameter
taxon is the standardized taxonomic name (e.g. "Lyssavirus rabies") or NCBI taxon ID (e.g. "11292") of the desired virus to analyze. Inputs must be represented in the NCBI taxonomy database and do not have to be species-level (see read_extraction_rank below).
host optional input parameter
The host input triggers the Host Decontaminate workflow, which removes reads that map to a reference host genome. This input needs to be an NCBI Taxonomy-compatible taxon, host genome assembly FASTA, or an NCBI assembly accession. If using a taxon, the first retrieved genome corresponding to that taxon is retrieved. If using a genome assembly or accession, these inputs must be coupled with the Host Decontaminate task is_genome/is_accession (ONT) or Read QC Trim PE host_is_genome/host_is_accession (Illumina) boolean populated as "true".
extract_unclassified optional input parameter
By default, the extract_unclassified parameter is set to true, which indicates that reads that are not classified by Kraken2 (Illumina) or Metabuli (ONT) will be included with reads classified as the input taxon.
These classification software most often do not comprehensively classify reads using the default RefSeq databases, so extracting unclassified reads is desirable when host and contaminant reads have been sufficiently decontaminated. Host decontamination occurs in TheiaViral using NCBI sra-human-scrubber, read classification to the human genome, and/or via mapping reads to the inputted host. Contaminant viral reads are mostly excluded because they will be often be classified against the default RefSeq classification databases.
Consider setting extract_unclassified to false if de novo assembly or Skani reference selection is failing.
min_allele_freq, min_depth, and min_map_quality optional input parameters
These parameters have a direct effect on the variants that will ultimately be reported in the consensus assembly. min_allele_freq determines the minimum proportion of an allelic variant to be reported in the consensus assembly. min_depth and min_map_quality affect how "N" is reported in the consensus, i.e. depth below min_depth is reported as "N" and reads with mapping quality below min_map_quality are not included in depth calculations.
read_extraction_rank optional input parameter
By default, the read_extraction_rank parameter is set to "family", which indicates that reads will be extracted if they are classified as the taxonomic family of the input taxon, including all descendant taxa of the family. Read classification may not resolve to the rank of the input taxon, so these reads may be classified at higher ranks. For example, some Lyssavirus rabies (species) reads may only be resolved to Lyssavirus (genus), so they would not be extracted if the read_extraction_rank is set to "species". Setting the read_extraction_rank above the inputted taxon's rank can therefore dramatically increase the number of reads recovered, at the potential cost of including other viruses. This likely is not a problem for scarcely represented lineages, e.g. a sample that is expected to include Lyssavirus rabies is unlikely to contain other viruses of the corresponding family, Rhabdoviridae, within the same sample. However, setting a read_extraction_rank far beyond the input taxon rank can be problematic when multiple representatives of the same viral family are included in similar abundance within the same sample. To further refine the desired read_extraction_rank, please review the corresponding classification reports of the respective classification software (kraken2 for Illumina and Metabuli for ONT)
The TheiaViral_Panel workflow accepts Illumina paired-end read data. Read file extensions should be .fastq or .fq, and can optionally include the .gz compression extension. Theiagen recommends compressing files with gzip to minimize data upload time and storage costs.
For the analysis of RSV and Flu it is recommended that TheiaCov is run for full characterization of RSV and IRMA assembly for Flu. Due to limitations within the Kraken Database RSV A and B will both be extracted under HRSV. Subtypes can be losely infered by looking at Skani outputs.
taxon_ids optional input parameter
The taxon_ids parameter is required for TheiaViral_Panel to run correctly, but is optional in Terra.
By default, TheiaViral_Panel uses a list of 172 taxon IDs derived from a list of targeted viruses and subtypes in the Viral Surveillance Panel version 2 (VSP v2) produced by Illumina, though this workflow is not specific to that assay. This list can be modified to include or exclude any taxon IDs of interest; however, the taxon IDs must be present in the Kraken2 database used for read classification. Changing this parameter will change what organisms are extracted for assembly and characterization. Keep in mind that these IDs must be available in the passed Kraken DB. The list of default taxon IDs can be found below:
| Taxon ID | Common Name | Species Name | Genome Length |
|---|---|---|---|
| 1618189 | Bourbon virus | Thogotovirus bourbonense | 10560 |
| 37124 | Chikungunya virus | Alphavirus chikungunya | 11547 |
| 46839 | Colorado tick fever virus | Coltivirus dermacentoris | 29174 |
| 12637 | Dengue virus | Orthoflavivirus denguei | 10770 |
| 1216928 | Heartland virus | Bandavirus heartlandense | 11540 |
| 59301 | Mayaro virus | Alphavirus mayaro | 11411 |
| 2169701 | Onyong-nyong virus | Alphavirus onyong | 11827 |
| 118655 | Oropouche virus | Orthobunyavirus oropoucheense | 11985 |
| 11587 | Punta Toro virus | Phlebovirus toroense | 12634 |
| 11029 | Ross River virus | Alphavirus rossriver | 11802 |
| 11033 | Semliki Forest virus | Alphavirus semliki | 11341 |
| 11034 | Sindbis virus | Alphavirus sindbis | 11671 |
| 1608084 | Tacheng Tick Virus 2 | Uukuvirus tachengense | 8844 |
| 64286 | Usutu virus | Orthoflavivirus usutuense | 11066 |
| 11082 | West Nile virus | Orthoflavivirus nilense | 10942 |
| 11089 | Yellow fever virus | Orthoflavivirus flavi | NA |
| 64320 | Zika virus | Orthoflavivirus zikaense | 10874 |
| 10804 | adeno-associated virus 2 | Dependoparvovirus primate1 | 4679 |
| 12092 | Hepatovirus A | Hepatovirus ahepa | 7446 |
| 3052230 | Hepacivirus hominis | Hepacivirus hominis | 9431 |
| 12475 | Hepatitis delta virus | Deltavirus italiense | 1680 |
| 291484 | Hepatitis E virus | Hepatitis E virus | 7499 |
| 11676 | Human immunodeficiency virus 1 | Lentivirus humimdef1 | 9388 |
| 11709 | Human immunodeficiency virus 2 | Lentivirus humimdef2 | 10059 |
| 68887 | Torque teno virus | Torque teno virus | 3477 |
| 1980456 | Orthohantavirus andesense | Orthohantavirus andesense | 7735 |
| 3052470 | Orthohantavirus bayoui | Orthohantavirus bayoui | 10861 |
| 3052490 | Orthohantavirus nigrorivense | Orthohantavirus nigrorivense | 6067 |
| 169173 | Choclo virus | Orthohantavirus chocloense | 7844 |
| 3052489 | Orthohantavirus negraense | Orthohantavirus mamorense | NA |
| 238817 | Maporal virus | Orthohantavirus maporalense | 12106 |
| 1980442 | Orthohantavirus | 8504 | |
| 3052496 | Orthohantavirus sangassouense | Orthohantavirus sangassouense | 11928 |
| 3052499 | Orthohantavirus sinnombreense | Orthohantavirus sinnombreense | 10583 |
| 90961 | Lyssavirus australis | Lyssavirus australis | 11822 |
| 80935 | Cache Valley virus | Orthobunyavirus cacheense | 12283 |
| 35305 | California encephalitis virus | Orthobunyavirus encephalitidis | 12466 |
| 1221391 | Cedar virus | Henipavirus cedarense | 18162 |
| 38767 | Lyssavirus duvenhage | Lyssavirus duvenhage | 11976 |
| 11021 | Eastern equine encephalitis virus | Alphavirus eastern | 11675 |
| 38768 | European bat lyssavirus | European bat lyssavirus | 11935 |
| 2847089 | Ghana virus | Henipavirus ghanaense | 18530 |
| 3052223 | Henipavirus hendraense | Henipavirus hendraense | 18234 |
| 260964 | Henipavirus | 18134 | |
| 35511 | Jamestown Canyon virus | Orthobunyavirus jamestownense | 12461 |
| 11072 | Japanese encephalitis virus | Orthoflavivirus japonicum | NA |
| 11577 | La Crosse virus | Orthobunyavirus lacrosseense | 12490 |
| 38766 | Lyssavirus lagos | Lyssavirus lagos | 12016 |
| 1474807 | Mojiang virus | Parahenipavirus mojiangense | 18406 |
| 12538 | Lyssavirus mokola | Lyssavirus mokola | 11940 |
| 11079 | Murray Valley encephalitis virus | Orthoflavivirus murrayense | 7012 |
| 3052225 | Henipavirus nipahense | Henipavirus nipahense | 18248 |
| 11083 | Powassan virus | Orthoflavivirus powassanense | 10826 |
| 11292 | Lyssavirus rabies | Lyssavirus rabies | 11927 |
| 11580 | Snowshoe hare virus | Orthobunyavirus khatangaense | 12208 |
| 11080 | St. Louis encephalitis virus | Orthoflavivirus louisense | 10940 |
| 45270 | Tahyna virus | Orthobunyavirus tahynaense | 12446 |
| 11084 | Tick-borne encephalitis virus | Orthoflavivirus encephalitidis | 7367 |
| 11036 | Venezuelan equine encephalitis virus | Alphavirus venezuelan | 11411 |
| 11039 | Western equine encephalitis virus | Alphavirus western | 11523 |
| 1313215 | aichivirus A1 | Kobuvirus aichi | 8266 |
| 138948 | Enterovirus A | Enterovirus alphacoxsackie | 7427 |
| 138949 | Enterovirus B | Enterovirus betacoxsackie | 7410 |
| 138950 | Enterovirus C | Enterovirus coxsackiepol | 7442 |
| 138951 | Enterovirus D | Enterovirus deconjuncti | 7367 |
| 1239565 | Mamastrovirus 1 | Mamastrovirus hominis | 6791 |
| 1239570 | Mamastrovirus 6 | Mamastrovirus melbournense | 6171 |
| 1239573 | Mamastrovirus 9 | Mamastrovirus virginiaense | 6576 |
| 142786 | Norovirus | 5162 | |
| 28875 | Rotavirus A | Rotavirus alphagastroenteritidis | 8881 |
| 28876 | Rotavirus B | Rotavirus betagastroenteritidis | 17791 |
| 36427 | Rotavirus C | Rotavirus tritogastroenteritidis | 17720 |
| 1348384 | Rotavirus H | Rotavirus aspergastroenteritidis | 17961 |
| 1330524 | Salivirus A | Salivirus aklasse | 7956 |
| 95341 | Sapovirus | 7470 | |
| 2849717 | Aigai virus | Orthonairovirus parahaemorrhagiae | 19245 |
| 1424613 | Anjozorobe virus | Orthohantavirus thailandense | NA |
| 2010960 | Bombali virus | Orthoebolavirus bombaliense | 19043 |
| 565995 | Bundibugyo virus | Orthoebolavirus bundibugyoense | 18940 |
| 3052302 | Mammarenavirus chapareense | Mammarenavirus chapareense | 10464 |
| 3052518 | Orthonairovirus haemorrhagiae | Orthonairovirus haemorrhagiae | 19146 |
| 3052477 | Orthohantavirus dobravaense | Orthohantavirus dobravaense | 9116 |
| 3052307 | Mammarenavirus guanaritoense | Mammarenavirus guanaritoense | 10424 |
| 3052480 | Orthohantavirus hantanense | Orthohantavirus hantanense | 6917 |
| 2169991 | Mammarenavirus juninense | Mammarenavirus juninense | 10525 |
| 33743 | Kyasanur Forest disease virus | Orthoflavivirus kyasanurense | 10579 |
| 3052310 | Mammarenavirus lassaense | Mammarenavirus lassaense | 10686 |
| 3052148 | Cuevavirus lloviuense | Cuevavirus lloviuense | 18893 |
| 3052314 | Mammarenavirus lujoense | Mammarenavirus lujoense | 10352 |
| 3052303 | Mammarenavirus choriomeningitidis | Mammarenavirus choriomeningitidis | 10367 |
| 3052317 | Mammarenavirus machupoense | Mammarenavirus machupoense | 10635 |
| 12542 | Omsk hemorrhagic fever virus | Orthoflavivirus omskense | 10787 |
| 3052493 | Orthohantavirus puumalaense | Orthohantavirus puumalaense | 10925 |
| 186539 | Reston ebolavirus | Orthoebolavirus restonense | 18891 |
| 11588 | Rift Valley fever virus | Phlebovirus riftense | 11979 |
| 2907957 | Sabia virus | Mammarenavirus brazilense | 10499 |
| 3052498 | Orthohantavirus seoulense | Orthohantavirus seoulense | 9746 |
| 1003835 | Severe fever with thrombocytopenia syndrome virus | Bandavirus dabieense | 10547 |
| 1452514 | Sosuga virus | Pararubulavirus sosugaense | 15480 |
| 186540 | Sudan ebolavirus | Orthoebolavirus sudanense | 18875 |
| 186541 | Tai Forest ebolavirus | Orthoebolavirus taiense | 18935 |
| 3052503 | Orthohantavirus tulaense | Orthohantavirus tulaense | 9987 |
| 1891762 | Betapolyomavirus hominis | Betapolyomavirus hominis | 5146 |
| 10376 | human gammaherpesvirus 4 | Lymphocryptovirus humangamma4 | 172146 |
| 10359 | Human betaherpesvirus 5 | Cytomegalovirus humanbeta5 | 214152 |
| 333760 | Human papillomavirus 16 | Alphapapillomavirus 9 | 7905 |
| 333761 | human papillomavirus 18 | Alphapapillomavirus 7 | 7857 |
| 337044 | Alphapapillomavirus 5 | Alphapapillomavirus 5 | 7805 |
| 337050 | Alphapapillomavirus 6 | Alphapapillomavirus 6 | 7847 |
| 1671798 | Human papillomavirus type 54 | Alphapapillomavirus 13 | 7759 |
| 333754 | Alphapapillomavirus 10 | Alphapapillomavirus 10 | 7898 |
| 333767 | Alphapapillomavirus 3 | Alphapapillomavirus 3 | 8061 |
| 746830 | Human polyomavirus 6 | Deltapolyomavirus sextihominis | 4926 |
| 746831 | Human polyomavirus 7 | Deltapolyomavirus septihominis | 4952 |
| 943908 | Human polyomavirus 9 | Alphapolyomavirus nonihominis | 5027 |
| 10632 | JC polyomavirus | Betapolyomavirus secuhominis | 5171 |
| 1891764 | Betapolyomavirus tertihominis | Betapolyomavirus tertihominis | 5040 |
| 1965344 | LI polyomavirus | Alphapolyomavirus quardecihominis | 5269 |
| 493803 | Merkel cell polyomavirus | Alphapolyomavirus quintihominis | 5387 |
| 1203539 | MW polyomavirus | Deltapolyomavirus decihominis | 4927 |
| 1497391 | New Jersey polyomavirus-2013 | Alphapolyomavirus terdecihominis | 5108 |
| 1891767 | Betapolyomavirus macacae | Betapolyomavirus macacae | 5243 |
| 1277649 | STL polyomavirus | Deltapolyomavirus undecihominis | 4776 |
| 862909 | Trichodysplasia spinulosa-associated polyomavirus | Alphapolyomavirus octihominis | 5232 |
| 862909 | Trichodysplasia spinulosa-associated polyomavirus | Alphapolyomavirus octihominis | 5232 |
| 440266 | WU Polyomavirus | Betapolyomavirus quartihominis | 5229 |
| 10298 | Human alphaherpesvirus 1 | Simplexvirus humanalpha1 | 155275 |
| 11234 | Measles morbillivirus | Morbillivirus hominis | 15956 |
| 152219 | Menangle virus | Pararubulavirus menangleense | 15516 |
| 10244 | Monkeypox virus | Orthopoxvirus monkeypox | 193392 |
| 2560602 | Mumps orthorubulavirus | Orthorubulavirus parotitidis | NA |
| 11041 | Rubella virus | Rubivirus rubellae | 9762 |
| 10335 | Human alphaherpesvirus 3 | Varicellovirus humanalpha3 | 125308 |
| 10255 | Variola virus | Orthopoxvirus variola | 186087 |
| 129875 | Human mastadenovirus A | Mastadenovirus adami | 34077 |
| 108098 | Human mastadenovirus B | Mastadenovirus blackbeardi | 34777 |
| 129951 | Human mastadenovirus C | Mastadenovirus caesari | 35753 |
| 130310 | Human mastadenovirus D | Mastadenovirus dominans | 35160 |
| 130308 | Human mastadenovirus E | Mastadenovirus exoticum | 36099 |
| 130309 | Human mastadenovirus F | Mastadenovirus faecale | 33926 |
| 536079 | Human mastadenovirus G | Mastadenovirus russelli | 21467 |
| 329641 | Human bocavirus | Human bocavirus | 5289 |
| 11137 | Human coronavirus 229E | Alphacoronavirus chicagoense | 27375 |
| 290028 | Human coronavirus HKU1 | Betacoronavirus hongkongense | 29911 |
| 277944 | Human coronavirus NL63 | Alphacoronavirus amsterdamense | 27551 |
| 31631 | Human coronavirus OC43 | Betacoronavirus gravedinis | 30767 |
| 162145 | human metapneumovirus | Metapneumovirus hominis | 13319 |
| 12730 | Human respirovirus 1 | Respirovirus laryngotracheitidis | 15600 |
| 2560525 | Human orthorubulavirus 2 | Orthorubulavirus laryngotracheitidis | 15649 |
| 11216 | Human respirovirus 3 | Respirovirus pneumoniae | 15430 |
| 2560526 | Human orthorubulavirus 4 | Orthorubulavirus hominis | 17235 |
| 1803956 | Parechovirus A | Parechovirus ahumpari | 666 |
| 10798 | Human parvovirus B19 | Erythroparvovirus primate1 | 5595 |
| 11250 | human respiratory syncytial virus | Orthopneumovirus hominis | 15246 |
| 11320 | Influenza A virus | Alphainfluenzavirus influenzae | 13357 |
| 11520 | Influenza B virus | Betainfluenzavirus influenzae | 14563 |
| 11552 | Influenza C virus | Gammainfluenzavirus influenzae | 12430 |
| 1335626 | Middle East respiratory syndrome-related coronavirus | Betacoronavirus cameli | 30150 |
| 147711 | Rhinovirus A | Enterovirus alpharhino | 6983 |
| 147712 | Rhinovirus B | Enterovirus betarhino | 6940 |
| 463676 | Rhinovirus C | Enterovirus cerhino | 5749 |
| 2901879 | Severe acute respiratory syndrome coronavirus | Betacoronavirus pandemicum | 29747 |
| 2697049 | Severe acute respiratory syndrome coronavirus 2 | Betacoronavirus pandemicum | 29883 |
| 10404 | Hepadnaviridae | 3186 | |
| 3052505 | Orthomarburgvirus marburgense | Orthomarburgvirus marburgense | NA |
| 337041 | Alphapapillomavirus 9 | Alphapapillomavirus 9 | 7916 |
| 337042 | Alphapapillomavirus 7 | Alphapapillomavirus 7 | 7861 |
| 333757 | Alphapapillomavirus 8 | Alphapapillomavirus 8 | 7960 |
| 337048 | Alphapapillomavirus 1 | Alphapapillomavirus 1 | 7940 |
| 333754 | Alphapapillomavirus 10 | Alphapapillomavirus 10 | 7898 |
| 333766 | Alphapapillomavirus 13 | Alphapapillomavirus 13 | 7759 |
| 337049 | Alphapapillomavirus 11 | Alphapapillomavirus 11 | 7779 |
output_taxon_table optional input parameter
A key feature of TheiaViral_Panel is the ability to output assemblies and characterization results to taxon-specific Terra tables. This allows users to easily separate results by taxon for downstream analysis.
The output_taxon_table parameter is an optional input file with a set default that specifies which taxon are output to what taxon table in Terra.
Formatting the output_taxon_table file
The output_taxon_table file must be uploaded to a Google storage bucket that is accessible by Terra and should be in tab-delimited format and include a header. Briefly, the viral taxon name should be listed in the leftmost column with the name of the data table to copy samples of that taxon to in the rightmost column. This will result in any taxonomy classification identified as "influenza" being added to a Terra table named "influenza_panel_specimen". The default table is shown below. For best results, edit your taxon table in a text editor such as Notepad.
| taxon | taxon_table |
|---|---|
| influenza | panel_influenza_specimen |
| coronavirus | panel_coronavirus_specimen |
| human_immunodeficiency_virus | panel_hiv_specimen |
| monkeypox_virus | panel_monkeypox_specimen |
| human_respiratory_syncytial_virus | panel_rsv_specimen |
| west_nile_virus | panel_wnv_specimen |
| other | panel_other_specimen |
| h3n1 | panel_influenza_specimen |
| h1n1 | panel_influenza_specimen |
| h5n1 | panel_influenza_specimen |
| h3n2 | panel_influenza_specimen |
| h2n2 | panel_influenza_specimen |
| mastadenovirus | panel_mastadenovirus_specimen |
| orthohantavirus | panel_orthohantavirus_specimen |
| enterovirus | panel_enterovirus_specimen |
| alphapapillomavirus | panel_alphapapillomavirus_specimen |
| hepatitis | panel_hepatitis_specimen |
| hepadnaviridae | panel_hepatitis_specimen |
kraken_db optional input parameter
For the reliable extraction of input taxon IDs, it is important to make sure that the taxon IDs used as input are concordant with the contents of the Kraken database. When making changes to this parameter keep in mind the relationship between these two inputs. The default database can be accessed here.
extract_unclassified optional input parameter
By default, extract_unclassifed is set to false, which indicates that reads that are not classified by Kraken2 will NOT be included with reads classified as the input taxon.
If the extracted read data is lacking and assemblies are not generated, consider setting this parameter to true to increase the available read count to make assembly generation more probable. Please note this will introduce reads that are not aligned with the identified taxon and can introduce significant noise and misclassifications.
min_read_count optional input parameter
By default, min_read_count is set to 1000. This value is the number of reads that are required to pass the binning threshold to proceed onto assembly and characterization.
| Terra Task Name | Variable | Type | Description | Default Value | Terra Status |
|---|---|---|---|---|---|
| theiaviral_illumina_pe | read1 | File | llumina forward read file in FASTQ file format (compression optional) | Required | |
| theiaviral_illumina_pe | read2 | File | llumina reverse read file in FASTQ file format (compression optional) | Required | |
| theiaviral_illumina_pe | samplename | String | Nme of the sample being analyzed | Required | |
| theiaviral_illumina_pe | taxon | String | Taxon ID or organism name of interest | Required | |
| bwa | cpu | Int | Number of CPUs to allocate to the task | 6 | Optional |
| bwa | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
| bwa | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/ivar:1.3.1-titan | Optional |
| bwa | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 16 | Optional |
| checkv_consensus | cpu | Int | Number of CPUs to allocate to the task | 2 | Optional |
| checkv_consensus | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
| checkv_consensus | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/checkv:1.0.3 | Optional |
| checkv_consensus | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 8 | Optional |
| checkv_denovo | cpu | Int | Number of CPUs to allocate to the task | 2 | Optional |
| checkv_denovo | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
| checkv_denovo | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/checkv:1.0.3 | Optional |
| checkv_denovo | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 8 | Optional |
| clean_check_reads | cpu | Int | Number of CPUs to allocate to the task | 1 | Optional |
| clean_check_reads | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
| clean_check_reads | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/bactopia/gather_samples:2.0.2 | Optional |
| clean_check_reads | max_genome_length | Int | Maximum genome length able to pass read screening | 2673870 | Optional |
| clean_check_reads | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 2 | Optional |
| clean_check_reads | min_basepairs | Int | Minimum base pairs to pass read screening | 15000 | Optional |
| clean_check_reads | min_coverage | Int | Minimum coverage to pass read screening | 10 | Optional |
| clean_check_reads | min_genome_length | Int | Minimum genome length to pass read screening | 1500 | Optional |
| clean_check_reads | min_proportion | Int | Minimum read proportion to pass read screening | 40 | Optional |
| clean_check_reads | min_reads | Int | Minimum reads to pass read screening | 50 | Optional |
| consensus | char_unknown | String | Character used to represent unknown bases in the consensus sequence | N | Optional |
| consensus | count_orphans | Boolean | True/False that determines if anomalous read pairs are NOT skipped in variant calling. Anomalous read pairs are those marked in the FLAG field as paired in sequencing but without the properly-paired flag set. | True | Optional |
| consensus | cpu | Int | Number of CPUs to allocate to the task | 2 | Optional |
| consensus | disable_baq | Boolean | True/False that determines if base alignment quality (BAQ) computation should be disabled during samtools mpileup before consensus generation | True | Optional |
| consensus | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
| consensus | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/ivar:1.3.1-titan | Optional |
| consensus | max_depth | Int | For a given position, read at maximum INT number of reads per input file during samtools mpileup before consensus generation | 600000 | Optional |
| consensus | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 8 | Optional |
| consensus | min_bq | Int | Minimum base quality required for a base to be considered during samtools mpileup before consensus generation | 0 | Optional |
| consensus | skip_N | Boolean | True/False that determines if "N" bases should be skipped in the consensus sequence | False | Optional |
| consensus_qc | cpu | Int | Number of CPUs to allocate to the task | 1 | Optional |
| consensus_qc | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
| consensus_qc | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/theiagen/utility:1.1 | Optional |
| consensus_qc | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 2 | Optional |
| est_genome_length | cpu | Int | Number of CPUs to allocate to the task | 1 | Optional |
| est_genome_length | disk_size | Int | Amount of storage (in GB) to allocate to the task | 50 | Optional |
| est_genome_length | docker | String | Docker image to use for the task | us-docker.pkg.dev/general-theiagen/theiagen/ncbi-datasets:18.9.0-python-jq | Optional |
| est_genome_length | memory | Int | Amount of memory (in GB) to allocate to the task | 4 | Optional |
| est_genome_length | summary_limit | Int | Maximum number of genomes to query | 100 | Optional |
| ete4_identify | cpu | Int | Number of CPUs to allocate to the task | 1 | Optional |
| ete4_identify | disk_size | Int | Amount of storage (in GB) to allocate to the task | 50 | Optional |
| ete4_identify | docker | String | Docker image to use for the task | us-docker.pkg.dev/general-theiagen/theiagen/ete4:4.3.0 | Optional |
| ete4_identify | memory | Int | Amount of memory (in GB) to allocate to the task | 4 | Optional |
| ivar_variants | cpu | Int | Number of CPUs to allocate to the task | 2 | Optional |
| ivar_variants | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
| ivar_variants | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/ivar:1.3.1-titan | Optional |
| ivar_variants | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 8 | Optional |
| ivar_variants | reference_gff | File | A GFF file in the GFF3 format can be supplied to specify coordinates of open reading frames (ORFs) so iVar can identify codons and translate variants into amino acids | Optional | |
| megahit | cpu | Int | Number of CPUs to allocate to the task | 4 | Optional |
| megahit | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
| megahit | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/theiagen/megahit:1.2.9 | Optional |
| megahit | kmers | String | Comma-separated list of kmer sizes to use for assembly. All must be odd, in the range 15-255, increment <= 28 | 21,29,39,59,79,99,119,141 | Optional |
| megahit | megahit_opts | String | Additional parameters for MEGAHIT assembler | Optional | |
| megahit | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 16 | Optional |
| megahit | min_contig_length | Int | Minimum contig length for MEGAHIT assembler | 1 | Optional |
| morgana_magic | abricate_flu_cpu | Int | Number of CPUs to allocate to the task | Optional | |
| morgana_magic | abricate_flu_disk_size | Int | Amount of storage (in GB) to allocate to the task | Optional | |
| morgana_magic | abricate_flu_docker | String | The Docker container to use for the task | Optional | |
| morgana_magic | abricate_flu_memory | Int | Amount of memory/RAM (in GB) to allocate to the task | Optional | |
| morgana_magic | abricate_flu_min_percent_coverage | Int | Minimum DNA percent coverage | Optional | |
| morgana_magic | abricate_flu_min_percent_identity | Int | Minimum DNA percent identity | Optional | |
| morgana_magic | assembly_metrics_cpu | Int | Number of CPUs to allocate to the task | Optional | |
| morgana_magic | assembly_metrics_disk_size | Int | Amount of storage (in GB) to allocate to the task | Optional | |
| morgana_magic | assembly_metrics_docker | String | The Docker container to use for the task | Optional | |
| morgana_magic | assembly_metrics_memory | Int | Amount of memory/RAM (in GB) to allocate to the task | Optional | |
| morgana_magic | flu_track_antiviral_aa_subs | String | Additional list of antiviral resistance associated amino acid substitutions of interest to be searched against those called on the sample segments. They take the format of :, e.g. NA:A26V | Optional | |
| morgana_magic | gene_coverage_bam | File | Bam file used for calculating gene coverage | Optional | |
| morgana_magic | gene_coverage_cpu | Int | Number of CPUs to allocate to the task | Optional | |
| morgana_magic | gene_coverage_disk_size | Int | Amount of storage (in GB) to allocate to the task | Optional | |
| morgana_magic | gene_coverage_docker | String | The Docker container to use for the task | Optional | |
| morgana_magic | gene_coverage_memory | Int | Amount of memory/RAM (in GB) to allocate to the task | Optional | |
| morgana_magic | gene_coverage_min_depth | Int | The minimum depth to determine if a position was covered. | Optional | |
| morgana_magic | genoflu_cpu | Int | Number of CPUs to allocate to the task | Optional | |
| morgana_magic | genoflu_cross_reference | File | An Excel file to cross-reference BLAST findings; probably useful if novel genotypes are not in the default file used by genoflu.py | Optional | |
| morgana_magic | genoflu_disk_size | Int | Amount of storage (in GB) to allocate to the task | Optional | |
| morgana_magic | genoflu_docker | String | The Docker container to use for the task | Optional | |
| morgana_magic | genoflu_memory | Int | Amount of memory/RAM (in GB) to allocate to the task | Optional | |
| morgana_magic | irma_cpu | Int | Number of CPUs to allocate to the task | Optional | |
| morgana_magic | irma_disk_size | Int | Amount of storage (in GB) to allocate to the task | Optional | |
| morgana_magic | irma_docker_image | String | The Docker container to use for the task | Optional | |
| morgana_magic | irma_keep_ref_deletions | Boolean | True/False variable that determines if sites missed (i.e. 0 reads for a site in the reference genome) during read gathering should be deleted by ambiguation by inserting N's or deleting the sequence entirely. False sets this IRMA paramater to "DEL" and true sets it to "NNN" | Optional | |
| morgana_magic | irma_memory | Int | Amount of memory/RAM (in GB) to allocate to the task | Optional | |
| morgana_magic | nextclade_auspice_reference_tree_json | File | An Auspice JSON phylogenetic reference tree which serves as a target for phylogenetic placement. | Inherited from nextclade dataset | Optional |
| morgana_magic | nextclade_cpu | Int | Number of CPUs to allocate to the task | Optional | |
| morgana_magic | nextclade_custom_input_dataset | File | For H5N1 flu samples only. A custom Nextclade dataset in JSON format. If provided, this dataset will be used to process any H5N1 flu samples. If not provided, a custom dataset will be selected depending on the GenoFLU Genotype. | Defaults are GenoFLU Genotype specific. Please find these default values here: https://github.com/theiagen/public_health_bioinformatics/blob/main/workflows/utilities/wf_organism_parameters.wdl | Optional |
| morgana_magic | nextclade_dataset_name | String | NextClade organism dataset name | Defaults are organism-specific. Please find default values for all organisms (and for Flu - their respective genome segments and subtypes) here: https://github.com/theiagen/public_health_bioinformatics/blob/main/workflows/utilities/wf_organism_parameters.wdl. For an organism without set defaults, the default is "NA". | Optional |
| morgana_magic | nextclade_dataset_tag | String | NextClade organism dataset tag | Defaults are organism-specific. Please find default values for all organisms (and for Flu - their respective genome segments and subtypes) here: https://github.com/theiagen/public_health_bioinformatics/blob/main/workflows/utilities/wf_organism_parameters.wdl. For an organism without set defaults, the default is "NA". | Optional |
| morgana_magic | nextclade_disk_size | Int | Amount of storage (in GB) to allocate to the task | Optional | |
| morgana_magic | nextclade_docker_image | String | The Docker container to use for the task | Optional | |
| morgana_magic | nextclade_input_ref | File | A nucleotide sequence which serves as a reference for the pairwise alignment of all input sequences. This is also the sequence which defines the coordinate system of the genome annotation. See here for more info: https://docs.nextstrain.org/projects/nextclade/en/latest/user/input-files/02-reference-sequence.html | Inherited from nextclade dataset | Optional |
| morgana_magic | nextclade_memory | Int | Amount of memory/RAM (in GB) to allocate to the task | Optional | |
| morgana_magic | nextclade_output_parser_cpu | Int | Number of CPUs to allocate to the task | Optional | |
| morgana_magic | nextclade_output_parser_disk_size | Int | Amount of storage (in GB) to allocate to the task | Optional | |
| morgana_magic | nextclade_output_parser_docker | String | The Docker container to use for the task | Optional | |
| morgana_magic | nextclade_output_parser_memory | Int | Amount of memory/RAM (in GB) to allocate to the task | Optional | |
| morgana_magic | nextclade_pathogen_json | File | General dataset configuration file. See here for more info: https://docs.nextstrain.org/projects/nextclade/en/latest/user/input-files/05-pathogen-config.html | Inherited from nextclade dataset | Optional |
| morgana_magic | nextclade_reference_gff_file | File | A genome annotation to specify how to translate the nucleotide sequence to proteins (genome_annotation.gff3). specifying this enables codon-informed alignment and protein alignments. See here for more info: https://docs.nextstrain.org/projects/nextclade/en/latest/user/input-files/03-genome-annotation.html | Inherited from nextclade dataset | Optional |
| morgana_magic | nextclade_verbosity | String | other options are: "off" , "error" , "info" , "debug" , and "trace" (highest level of verbosity) | warn | Optional |
| morgana_magic | pangolin_analysis_mode | String | Specify which inference engine to use. Options: accurate (UShER), fast (pangoLEARN), pangolearn, usher. | Optional | |
| morgana_magic | pangolin_arguments | String | Optional arguments for pangolin e.g. ''--skip-scorpio'' | Optional | |
| morgana_magic | pangolin_cpu | Int | Number of CPUs to allocate to the task | Optional | |
| morgana_magic | pangolin_disk_size | Int | Amount of storage (in GB) to allocate to the task | Optional | |
| morgana_magic | pangolin_docker_image | String | The Docker container to use for the task | Optional | |
| morgana_magic | pangolin_expanded_lineage | Boolean | True/False that determines if a lineage should be expanded without aliases (e.g., BA.1 → B.1.1.529.1) | Optional | |
| morgana_magic | pangolin_max_ambig | Float | Maximum proportion of Ns allowed for pangolin to attempt assignment. | Optional | |
| morgana_magic | pangolin_memory | Int | Amount of memory/RAM (in GB) to allocate to the task | Optional | |
| morgana_magic | pangolin_min_length | Int | Minimum query length allowed for pangolin to attempt an assignment | Optional | |
| morgana_magic | pangolin_skip_designation_cache | Boolean | A True/False option that determines if the designation cache should be used | Optional | |
| morgana_magic | pangolin_skip_scorpio | Boolean | A True/False option that determines if scorpio should be skipped. | Optional | |
| morgana_magic | quasitools_cpu | Int | Number of CPUs to allocate to the task | Optional | |
| morgana_magic | quasitools_disk_size | Int | Amount of storage (in GB) to allocate to the task | Optional | |
| morgana_magic | quasitools_docker | String | The Docker container to use for the task | Optional | |
| morgana_magic | quasitools_memory | Int | Amount of memory/RAM (in GB) to allocate to the task | Optional | |
| morgana_magic | sc2_s_gene_start | Int | Start position of S gene | Optional | |
| morgana_magic | sc2_s_gene_stop | Int | End position of S gene | Optional | |
| morgana_magic | vadr_cpu | Int | Number of CPUs to allocate to the task | Optional | |
| morgana_magic | vadr_disk_size | Int | Amount of storage (in GB) to allocate to the task | Optional | |
| morgana_magic | vadr_docker_image | String | The Docker container to use for the task | Optional | |
| morgana_magic | vadr_max_length | Int | Maximum length for the fasta-trim-terminal-ambigs.pl VADR script | Optional | |
| morgana_magic | vadr_memory | Int | Amount of memory/RAM (in GB) to allocate to the task | Optional | |
| morgana_magic | vadr_min_length | Int | Minimum length for the fasta-trim-terminal-ambigs.pl VADR script | Optional | |
| morgana_magic | vadr_model_file | File | Path to the a tar + gzipped VADR model file | Optional | |
| morgana_magic | vadr_options | String | Options to pass to the VADR script | Optional | |
| morgana_magic | vadr_skip_length | Int | Skip reads shorter than this length | Optional | |
| quast_denovo | cpu | Int | Number of CPUs to allocate to the task | 2 | Optional |
| quast_denovo | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
| quast_denovo | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/quast:5.0.2 | Optional |
| quast_denovo | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 2 | Optional |
| rasusa | bases | String | Explicitly set the number of bases required e.g., 4.3kb, 7Tb, 9000, 4.1MB. If this option is given, --coverage and --genome-size are ignored | Optional | |
| rasusa | coverage | Float | The desired coverage to sub-sample the reads to. If --bases is not provided, this option and --genome-size are required | 250 | Optional |
| rasusa | cpu | Int | Number of CPUs to allocate to the task | 4 | Optional |
| rasusa | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
| rasusa | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/rasusa:2.1.0 | Optional |
| rasusa | frac | Float | Subsample to a fraction of the reads - e.g., 0.5 samples half the reads | Optional | |
| rasusa | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 8 | Optional |
| rasusa | num | Int | Subsample to a specific number of reads | Optional | |
| rasusa | seed | Int | Random seed for reproducibility | Optional | |
| read_QC_trim | adapters | File | File with adapter sequences to be removed | Optional | |
| read_QC_trim | bbduk_memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 8 | Optional |
| read_QC_trim | call_midas | Boolean | Internal component, do not modify | False | Optional |
| read_QC_trim | fastp_args | String | Additional arguments to use with fastp | --detect_adapter_for_pe -g -5 20 -3 20 | Optional |
| read_QC_trim | host_complete_only | Boolean | Only download host reference genome labeled "complete" | False | Optional |
| read_QC_trim | host_decontaminate_mem | Int | Memory allocated for minimap2 (in GB) | 32 | Optional |
| read_QC_trim | host_is_accession | Boolean | Inputted "host" is an accession | False | Optional |
| read_QC_trim | host_is_genome | Boolean | Inputted "host" is a genome URI | False | Optional |
| read_QC_trim | host_refseq | Boolean | Internal component, do not modify | True | Optional |
| read_QC_trim | kraken_cpu | Int | Number of CPUs to allocate to the task | 4 | Optional |
| read_QC_trim | kraken_disk_size | Int | Amount of storage (in GB) to allocate to the task. Increase this when using large (>30GB kraken2 databases such as the "k2_standard" database) | 100 | Optional |
| read_QC_trim | kraken_memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 32 | Optional |
| read_QC_trim | midas_db | File | Internal component, do not modify | gs://theiagen-public-files-rp/terra/theiaprok-files/midas/midas_db_v1.2.tar.gz | Optional |
| read_QC_trim | phix | File | A file containing the phix used during Illumina sequencing; used in the BBDuk task | Optional | |
| read_QC_trim | read_processing | String | The name of the tool to perform basic read processing; options: "trimmomatic" or "fastp" | trimmomatic | Optional |
| read_QC_trim | read_qc | String | The tool used for quality control (QC) of reads. Options are "fastq_scan" (default) and "fastqc" | fastq_scan | Optional |
| read_QC_trim | target_organism | String | Internal component, do not modify | Optional | |
| read_QC_trim | trim_min_length | Int | Specifies minimum length of each read after trimming to be kept | 75 | Optional |
| read_QC_trim | trim_quality_min_score | Int | Specifies the average quality of bases in a sliding window to be kept | 30 | Optional |
| read_QC_trim | trim_window_size | Int | Specifies window size for trimming (the number of bases to average the quality across) | 4 | Optional |
| read_QC_trim | trimmomatic_args | String | Additional arguments to pass to trimmomatic. "-phred33" specifies the Phred Q score encoding which is almost always phred33 with modern sequence data. | -phred33 | Optional |
| read_mapping_stats | cpu | Int | Number of CPUs to allocate to the task | 2 | Optional |
| read_mapping_stats | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
| read_mapping_stats | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/samtools:1.15 | Optional |
| read_mapping_stats | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 8 | Optional |
| skani | acc2taxon_map | File | Tab-delimited map between reference genome accessions and their affiliated taxon | gs://theiagen-public-resources-rp/reference_data/databases/skani/viral_fna_20251107/viral_accession2taxon_20251107.tsv | Optional |
| skani | cpu | Int | Number of CPUs to allocate to the task | 2 | Optional |
| skani | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
| skani | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/skani:0.2.2 | Optional |
| skani | fasta_dir | String | Reference genome database base directory | gs://theiagen-public-resources-rp/reference_data/databases/skani/viral_fna_20251107/fna/ | Optional |
| skani | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 4 | Optional |
| spades | cpu | Int | Number of CPUs to allocate to the task | 4 | Optional |
| spades | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
| spades | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/spades:4.1.0 | Optional |
| spades | kmers | String | list of k-mer sizes (must be odd and less than 128) | auto | Optional |
| spades | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 16 | Optional |
| spades | phred_offset | Int | PHRED quality offset in the input reads (33 or 64) | 33 | Optional |
| spades | spades_opts | String | Additional parameters for Spades assembler | Optional | |
| theiaviral_illumina_pe | call_metaviralspades | Boolean | True/False to call assembly with MetaviralSPAdes and use Megahit as fallback | True | Optional |
| theiaviral_illumina_pe | checkv_db | File | Database used for CheckV | Optional | |
| theiaviral_illumina_pe | extract_unclassified | Boolean | True/False that determines if unclassified reads should be extracted and combined with the taxon specific extracted reads | True | Optional |
| theiaviral_illumina_pe | genome_length | Int | Expected genome length of taxon of interest | Optional | |
| theiaviral_illumina_pe | host | String | Host taxon/accession to dehost reads, if provided | Optional | |
| theiaviral_illumina_pe | kraken_db | File | Kraken2 database file | gs://theiagen-public-resources-rp/reference_data/databases/kraken2/kraken2_humanGRCh38_viralRefSeq_20240828.tar.gz | Optional |
| theiaviral_illumina_pe | min_allele_freq | Float | Minimum allele frequency required for a variant to populate the consensus sequence | 0.6 | Optional |
| theiaviral_illumina_pe | min_depth | Int | Minimum read depth required for a variant to populate the consensus sequence | 10 | Optional |
| theiaviral_illumina_pe | min_map_quality | Int | Minimum mapping quality required for read alignments | 20 | Optional |
| theiaviral_illumina_pe | read_extraction_rank | String | Taxonomic rank to use for read extraction - limits taxons to only those within the specified ranks. | family | Optional |
| theiaviral_illumina_pe | reference_fasta | File | Reference genome in FASTA format | Optional | |
| theiaviral_illumina_pe | reference_gene_locations_bed | File | Use to provide locations of interest where average coverage will be calculated | Optional | |
| theiaviral_illumina_pe | skani_db | File | Skani database file | Optional | |
| theiaviral_illumina_pe | skip_qc | Boolean | Internal component, do not modify | False | Optional |
| theiaviral_illumina_pe | skip_rasusa | Boolean | True/False to skip read subsampling with Rasusa | True | Optional |
| theiaviral_illumina_pe | skip_screen | Boolean | True/False to skip read screening check prior to analysis | False | Optional |
| version_capture | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0 | Optional |
| version_capture | timezone | String | Set the time zone to get an accurate date of analysis (uses UTC by default) | Optional |
| Terra Task Name | Variable | Type | Description | Default Value | Terra Status |
|---|---|---|---|---|---|
| theiaviral_ont | read1 | File | Base-called ONT read file in FASTQ file format (compression optional) | Required | |
| theiaviral_ont | samplename | String | Name of the sample being analyzed | Required | |
| theiaviral_ont | taxon | String | Taxon ID or organism name of interest | Required | |
| bcftools_consensus | cpu | Int | Number of CPUs to allocate to the task | 2 | Optional |
| bcftools_consensus | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
| bcftools_consensus | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/bcftools:1.20 | Optional |
| bcftools_consensus | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 4 | Optional |
| checkv_consensus | checkv_db | File | CheckV database file | gs://theiagen-public-resources-rp/reference_data/databases/checkv/checkv-db-v1.5.tar.gz | Optional |
| checkv_consensus | cpu | Int | Number of CPUs to allocate to the task | 2 | Optional |
| checkv_consensus | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
| checkv_consensus | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/checkv:1.0.3 | Optional |
| checkv_consensus | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 8 | Optional |
| checkv_denovo | checkv_db | File | CheckV database file | gs://theiagen-public-resources-rp/reference_data/databases/checkv/checkv-db-v1.5.tar.gz | Optional |
| checkv_denovo | cpu | Int | Number of CPUs to allocate to the task | 2 | Optional |
| checkv_denovo | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
| checkv_denovo | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/checkv:1.0.3 | Optional |
| checkv_denovo | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 8 | Optional |
| clair3 | clair3_model | String | Model to be used by Clair3 | r1041_e82_400bps_sup_v500 | Optional |
| clair3 | cpu | Int | Number of CPUs to allocate to the task | 4 | Optional |
| clair3 | disable_phasing | Boolean | True/False that determines if variants should be called without whatshap phasing in full alignment calling | True | Optional |
| clair3 | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
| clair3 | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/theiagen/clair3-extra-models:1.0.10 | Optional |
| clair3 | enable_gvcf | Boolean | True/False that determines if an additional GVCF output should generated | False | Optional |
| clair3 | enable_haploid_precise | Boolean | True/False that determines haploid calling mode where only 1/1 is considered as a variant | True | Optional |
| clair3 | include_all_contigs | Boolean | True/False that determines if all contigs should be included in the output | True | Optional |
| clair3 | indel_min_af | Float | Minimum Indel AF required for a candidate variant | 0.08 | Optional |
| clair3 | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 8 | Optional |
| clair3 | snp_min_af | Float | Minimum SNP allele frequency required for a candidate variant. Lowering the value might increase a bit of sensitivity in trade of speed and accuracy | 0.08 | Optional |
| clair3 | variant_quality | Int | If set, variants with >$qual will be marked PASS, or LowQual otherwise | 2 | Optional |
| clean_check_reads | cpu | Int | Number of CPUs to allocate to the task | 1 | Optional |
| clean_check_reads | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
| clean_check_reads | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/bactopia/gather_samples:2.0.2 | Optional |
| clean_check_reads | max_genome_length | Int | Maximum genome length able to pass read screening | 2673870 | Optional |
| clean_check_reads | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 2 | Optional |
| clean_check_reads | min_basepairs | Int | Minimum base pairs to pass read screening | 15000 | Optional |
| clean_check_reads | min_coverage | Int | Minimum coverage to pass read screening | 10 | Optional |
| clean_check_reads | min_genome_length | Int | Minimum genome length to pass read screening | 1500 | Optional |
| clean_check_reads | min_reads | Int | Minimum reads to pass read screening | 50 | Optional |
| consensus_qc | cpu | Int | Number of CPUs to allocate to the task | 1 | Optional |
| consensus_qc | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
| consensus_qc | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/theiagen/utility:1.1 | Optional |
| consensus_qc | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 2 | Optional |
| est_genome_length | cpu | Int | Number of CPUs to allocate to the task | 1 | Optional |
| est_genome_length | disk_size | Int | Amount of storage (in GB) to allocate to the task | 50 | Optional |
| est_genome_length | docker | String | Docker image to use for the task | us-docker.pkg.dev/general-theiagen/theiagen/ncbi-datasets:18.9.0-python-jq | Optional |
| est_genome_length | memory | Int | Amount of memory (in GB) to allocate to the task | 4 | Optional |
| est_genome_length | summary_limit | Int | Maximum number of genomes to query | 100 | Optional |
| ete4_identify | cpu | Int | Number of CPUs to allocate to the task | 1 | Optional |
| ete4_identify | disk_size | Int | Amount of storage (in GB) to allocate to the task | 50 | Optional |
| ete4_identify | docker | String | Docker image to use for the task | us-docker.pkg.dev/general-theiagen/theiagen/ete4:4.3.0 | Optional |
| ete4_identify | memory | Int | Amount of memory (in GB) to allocate to the task | 4 | Optional |
| fasta_utilities | cpu | Int | Number of CPUs to allocate to the task | 1 | Optional |
| fasta_utilities | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
| fasta_utilities | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/samtools:1.17 | Optional |
| fasta_utilities | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 8 | Optional |
| flye | additional_parameters | String | Additional parameters for Flye assembler | Optional | |
| flye | asm_coverage | Int | Reduced coverage for initial disjointig assembly | Optional | |
| flye | cpu | Int | Number of CPUs to allocate to the task | 4 | Optional |
| flye | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
| flye | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/flye:2.9.4 | Optional |
| flye | flye_polishing_iterations | Int | Number of polishing iterations | 1 | Optional |
| flye | genome_length | Int | Expected genome length for assembly - requires asm_coverage | Optional | |
| flye | keep_haplotypes | Boolean | True/False to prevent collapsing alternative haplotypes | False | Optional |
| flye | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 32 | Optional |
| flye | minimum_overlap | Int | Minimum overlap between reads | Optional | |
| flye | no_alt_contigs | Boolean | True/False to disable alternative contig generation | False | Optional |
| flye | read_error_rate | Float | Expected error rate in reads | Optional | |
| flye | read_type | String | Type of read data for Flye | --nano-hq | Optional |
| flye | scaffold | Boolean | True/False to enable scaffolding using graph | False | Optional |
| host_decontaminate | complete_only | Boolean | Only download genomes labeled "complete" | False | Optional |
| host_decontaminate | is_accession | Boolean | Inputted "host" is an accession | False | Optional |
| host_decontaminate | is_genome | Boolean | Inputted "host" is an assembly FASTA | False | Optional |
| host_decontaminate | minimap2_memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 32 | Optional |
| host_decontaminate | read2 | File | Internal component, do not modify | Optional | |
| host_decontaminate | refseq | Boolean | Only download RefSeq genomes | True | Optional |
| mask_low_coverage | cpu | Int | Number of CPUs to allocate to the task | 2 | Optional |
| mask_low_coverage | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
| mask_low_coverage | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/bedtools:2.31.0 | Optional |
| mask_low_coverage | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 8 | Optional |
| metabuli | cpu | Int | Number of CPUs to allocate to the task | 4 | Optional |
| metabuli | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
| metabuli | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/theiagen/metabuli:1.1.0 | Optional |
| metabuli | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 16 | Optional |
| metabuli | metabuli_db | File | Metabuli database file | gs://theiagen-public-resources-rp/reference_data/databases/metabuli/refseq_virus-v223.tar.gz | Optional |
| metabuli | min_percent_coverage | Float | Minimum query coverage threshold (0.0 - 1.0) | 0 | Optional |
| metabuli | min_score | Float | Minimum sequenece similarity score (0.0 - 1.0) | 0 | Optional |
| metabuli | min_sp_score | Float | Minimum score for species- or lower-level classification | 0 | Optional |
| metabuli | taxonomy_path | File | Path to taxonomy file | gs://theiagen-public-resources-rp/reference_data/databases/metabuli/new_taxdump.tar.gz | Optional |
| minimap2 | cpu | Int | Number of CPUs to allocate to the task | 2 | Optional |
| minimap2 | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
| minimap2 | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/minimap2:2.22 | Optional |
| minimap2 | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 8 | Optional |
| minimap2 | query2 | File | Internal component, do not modify | Optional | |
| morgana_magic | abricate_flu_cpu | Int | Number of CPUs to allocate to the task | Optional | |
| morgana_magic | abricate_flu_disk_size | Int | Amount of storage (in GB) to allocate to the task | Optional | |
| morgana_magic | abricate_flu_docker | String | The Docker container to use for the task | Optional | |
| morgana_magic | abricate_flu_memory | Int | Amount of memory/RAM (in GB) to allocate to the task | Optional | |
| morgana_magic | abricate_flu_min_percent_coverage | Int | Minimum DNA percent coverage | Optional | |
| morgana_magic | abricate_flu_min_percent_identity | Int | Minimum DNA percent identity | Optional | |
| morgana_magic | assembly_metrics_cpu | Int | Number of CPUs to allocate to the task | Optional | |
| morgana_magic | assembly_metrics_disk_size | Int | Amount of storage (in GB) to allocate to the task | Optional | |
| morgana_magic | assembly_metrics_docker | String | The Docker container to use for the task | Optional | |
| morgana_magic | assembly_metrics_memory | Int | Amount of memory/RAM (in GB) to allocate to the task | Optional | |
| morgana_magic | flu_track_antiviral_aa_subs | String | Additional list of antiviral resistance associated amino acid substitutions of interest to be searched against those called on the sample segments. They take the format of :, e.g. NA:A26V | Optional | |
| morgana_magic | gene_coverage_bam | File | Bam file used for calculating gene coverage | Optional | |
| morgana_magic | gene_coverage_cpu | Int | Number of CPUs to allocate to the task | Optional | |
| morgana_magic | gene_coverage_disk_size | Int | Amount of storage (in GB) to allocate to the task | Optional | |
| morgana_magic | gene_coverage_docker | String | The Docker container to use for the task | Optional | |
| morgana_magic | gene_coverage_memory | Int | Amount of memory/RAM (in GB) to allocate to the task | Optional | |
| morgana_magic | gene_coverage_min_depth | Int | The minimum depth to determine if a position was covered. | Optional | |
| morgana_magic | genoflu_cpu | Int | Number of CPUs to allocate to the task | Optional | |
| morgana_magic | genoflu_cross_reference | File | An Excel file to cross-reference BLAST findings; probably useful if novel genotypes are not in the default file used by genoflu.py | Optional | |
| morgana_magic | genoflu_disk_size | Int | Amount of storage (in GB) to allocate to the task | Optional | |
| morgana_magic | genoflu_docker | String | The Docker container to use for the task | Optional | |
| morgana_magic | genoflu_memory | Int | Amount of memory/RAM (in GB) to allocate to the task | Optional | |
| morgana_magic | irma_cpu | Int | Number of CPUs to allocate to the task | Optional | |
| morgana_magic | irma_disk_size | Int | Amount of storage (in GB) to allocate to the task | Optional | |
| morgana_magic | irma_docker_image | String | The Docker container to use for the task | Optional | |
| morgana_magic | irma_keep_ref_deletions | Boolean | True/False variable that determines if sites missed (i.e. 0 reads for a site in the reference genome) during read gathering should be deleted by ambiguation by inserting N's or deleting the sequence entirely. False sets this IRMA paramater to "DEL" and true sets it to "NNN" | Optional | |
| morgana_magic | irma_memory | Int | Amount of memory/RAM (in GB) to allocate to the task | Optional | |
| morgana_magic | nextclade_auspice_reference_tree_json | File | An Auspice JSON phylogenetic reference tree which serves as a target for phylogenetic placement. | Inherited from nextclade dataset | Optional |
| morgana_magic | nextclade_cpu | Int | Number of CPUs to allocate to the task | Optional | |
| morgana_magic | nextclade_custom_input_dataset | File | For H5N1 flu samples only. A custom Nextclade dataset in JSON format. If provided, this dataset will be used to process any H5N1 flu samples. If not provided, a custom dataset will be selected depending on the GenoFLU Genotype. | Defaults are GenoFLU Genotype specific. Please find these default values here: https://github.com/theiagen/public_health_bioinformatics/blob/main/workflows/utilities/wf_organism_parameters.wdl | Optional |
| morgana_magic | nextclade_dataset_name | String | NextClade organism dataset name | Defaults are organism-specific. Please find default values for all organisms (and for Flu - their respective genome segments and subtypes) here: https://github.com/theiagen/public_health_bioinformatics/blob/main/workflows/utilities/wf_organism_parameters.wdl. For an organism without set defaults, the default is "NA". | Optional |
| morgana_magic | nextclade_dataset_tag | String | NextClade organism dataset tag | Defaults are organism-specific. Please find default values for all organisms (and for Flu - their respective genome segments and subtypes) here: https://github.com/theiagen/public_health_bioinformatics/blob/main/workflows/utilities/wf_organism_parameters.wdl. For an organism without set defaults, the default is "NA". | Optional |
| morgana_magic | nextclade_disk_size | Int | Amount of storage (in GB) to allocate to the task | Optional | |
| morgana_magic | nextclade_docker_image | String | The Docker container to use for the task | Optional | |
| morgana_magic | nextclade_input_ref | File | A nucleotide sequence which serves as a reference for the pairwise alignment of all input sequences. This is also the sequence which defines the coordinate system of the genome annotation. See here for more info: https://docs.nextstrain.org/projects/nextclade/en/latest/user/input-files/02-reference-sequence.html | Inherited from nextclade dataset | Optional |
| morgana_magic | nextclade_memory | Int | Amount of memory/RAM (in GB) to allocate to the task | Optional | |
| morgana_magic | nextclade_output_parser_cpu | Int | Number of CPUs to allocate to the task | Optional | |
| morgana_magic | nextclade_output_parser_disk_size | Int | Amount of storage (in GB) to allocate to the task | Optional | |
| morgana_magic | nextclade_output_parser_docker | String | The Docker container to use for the task | Optional | |
| morgana_magic | nextclade_output_parser_memory | Int | Amount of memory/RAM (in GB) to allocate to the task | Optional | |
| morgana_magic | nextclade_pathogen_json | File | General dataset configuration file. See here for more info: https://docs.nextstrain.org/projects/nextclade/en/latest/user/input-files/05-pathogen-config.html | Inherited from nextclade dataset | Optional |
| morgana_magic | nextclade_reference_gff_file | File | A genome annotation to specify how to translate the nucleotide sequence to proteins (genome_annotation.gff3). specifying this enables codon-informed alignment and protein alignments. See here for more info: https://docs.nextstrain.org/projects/nextclade/en/latest/user/input-files/03-genome-annotation.html | Inherited from nextclade dataset | Optional |
| morgana_magic | nextclade_verbosity | String | other options are: "off" , "error" , "info" , "debug" , and "trace" (highest level of verbosity) | warn | Optional |
| morgana_magic | pangolin_analysis_mode | String | Specify which inference engine to use. Options: accurate (UShER), fast (pangoLEARN), pangolearn, usher. | Optional | |
| morgana_magic | pangolin_arguments | String | Optional arguments for pangolin e.g. ''--skip-scorpio'' | Optional | |
| morgana_magic | pangolin_cpu | Int | Number of CPUs to allocate to the task | Optional | |
| morgana_magic | pangolin_disk_size | Int | Amount of storage (in GB) to allocate to the task | Optional | |
| morgana_magic | pangolin_docker_image | String | The Docker container to use for the task | Optional | |
| morgana_magic | pangolin_expanded_lineage | Boolean | True/False that determines if a lineage should be expanded without aliases (e.g., BA.1 → B.1.1.529.1) | Optional | |
| morgana_magic | pangolin_max_ambig | Float | Maximum proportion of Ns allowed for pangolin to attempt assignment. | Optional | |
| morgana_magic | pangolin_memory | Int | Amount of memory/RAM (in GB) to allocate to the task | Optional | |
| morgana_magic | pangolin_min_length | Int | Minimum query length allowed for pangolin to attempt an assignment | Optional | |
| morgana_magic | pangolin_skip_designation_cache | Boolean | A True/False option that determines if the designation cache should be used | Optional | |
| morgana_magic | pangolin_skip_scorpio | Boolean | A True/False option that determines if scorpio should be skipped. | Optional | |
| morgana_magic | quasitools_cpu | Int | Number of CPUs to allocate to the task | Optional | |
| morgana_magic | quasitools_disk_size | Int | Amount of storage (in GB) to allocate to the task | Optional | |
| morgana_magic | quasitools_docker | String | The Docker container to use for the task | Optional | |
| morgana_magic | quasitools_memory | Int | Amount of memory/RAM (in GB) to allocate to the task | Optional | |
| morgana_magic | read2 | File | Internal component, do not modify | Optional | |
| morgana_magic | sc2_s_gene_start | Int | Start position of S gene | Optional | |
| morgana_magic | sc2_s_gene_stop | Int | End position of S gene | Optional | |
| morgana_magic | vadr_cpu | Int | Number of CPUs to allocate to the task | Optional | |
| morgana_magic | vadr_disk_size | Int | Amount of storage (in GB) to allocate to the task | Optional | |
| morgana_magic | vadr_docker_image | String | The Docker container to use for the task | Optional | |
| morgana_magic | vadr_max_length | Int | Maximum length for the fasta-trim-terminal-ambigs.pl VADR script | Optional | |
| morgana_magic | vadr_memory | Int | Amount of memory/RAM (in GB) to allocate to the task | Optional | |
| morgana_magic | vadr_min_length | Int | Minimum length for the fasta-trim-terminal-ambigs.pl VADR script | Optional | |
| morgana_magic | vadr_model_file | File | Path to the a tar + gzipped VADR model file | Optional | |
| morgana_magic | vadr_options | String | Options to pass to the VADR script | Optional | |
| morgana_magic | vadr_skip_length | Int | Skip reads shorter than this length | Optional | |
| nanoplot_clean | cpu | Int | Number of CPUs to allocate to the task | 4 | Optional |
| nanoplot_clean | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
| nanoplot_clean | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/nanoplot:1.40.0 | Optional |
| nanoplot_clean | max_length | Int | The maximum length of clean reads, for which reads longer than the length specified will be hidden. | 100000 | Optional |
| nanoplot_clean | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 16 | Optional |
| nanoplot_raw | cpu | Int | Number of CPUs to allocate to the task | 4 | Optional |
| nanoplot_raw | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
| nanoplot_raw | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/nanoplot:1.40.0 | Optional |
| nanoplot_raw | max_length | Int | The maximum length of clean reads, for which reads longer than the length specified will be hidden. | 100000 | Optional |
| nanoplot_raw | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 16 | Optional |
| nanoq | cpu | Int | Number of CPUs to allocate to the task | 2 | Optional |
| nanoq | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
| nanoq | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/biocontainers/nanoq:0.9.0--hec16e2b_1 | Optional |
| nanoq | max_read_length | Int | Maximum read length to keep | 100000 | Optional |
| nanoq | max_read_qual | Int | Maximum read quality to keep | 100 | Optional |
| nanoq | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 2 | Optional |
| nanoq | min_read_length | Int | Minimum read length to keep | 500 | Optional |
| nanoq | min_read_qual | Int | Minimum read quality to keep | 10 | Optional |
| ncbi_scrub_se | cpu | Int | Number of CPUs to allocate to the task | 4 | Optional |
| ncbi_scrub_se | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
| ncbi_scrub_se | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/ncbi/sra-human-scrubber:2.2.1 | Optional |
| ncbi_scrub_se | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 8 | Optional |
| parse_mapping | cpu | Int | Number of CPUs to allocate to the task | 2 | Optional |
| parse_mapping | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
| parse_mapping | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/samtools:1.17 | Optional |
| parse_mapping | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 8 | Optional |
| porechop | cpu | Int | Number of CPUs to allocate to the task | 4 | Optional |
| porechop | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
| porechop | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/porechop:0.2.4 | Optional |
| porechop | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 16 | Optional |
| porechop | trimopts | String | Additional trimming options for Porechop | Optional | |
| quast_denovo | cpu | Int | Number of CPUs to allocate to the task | 2 | Optional |
| quast_denovo | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
| quast_denovo | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/quast:5.0.2 | Optional |
| quast_denovo | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 2 | Optional |
| quast_denovo | min_contig_length | Int | Minimum length of contig for QUAST | 500 | Optional |
| rasusa | bases | String | Explicitly set the number of bases required e.g., 4.3kb, 7Tb, 9000, 4.1MB. If this option is given, --coverage and --genome-size are ignored | Optional | |
| rasusa | coverage | Float | The desired coverage to sub-sample the reads to. If --bases is not provided, this option and --genome-size are required | 250 | Optional |
| rasusa | cpu | Int | Number of CPUs to allocate to the task | 4 | Optional |
| rasusa | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
| rasusa | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/rasusa:2.1.0 | Optional |
| rasusa | frac | Float | Subsample to a fraction of the reads - e.g., 0.5 samples half the reads | Optional | |
| rasusa | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 8 | Optional |
| rasusa | num | Int | Subsample to a specific number of reads | Optional | |
| rasusa | read2 | File | Internal component, do not modify | Optional | |
| rasusa | seed | Int | Random seed for reproducibility | Optional | |
| raven | cpu | Int | Number of CPUs to allocate to the task | 4 | Optional |
| raven | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
| raven | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/theiagen/raven:1.8.3 | Optional |
| raven | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 16 | Optional |
| raven | raven_identity | Float | Threshold for overlap between two reads in order to construct an edge between them | 0 | Optional |
| raven | raven_opts | String | Additional parameters for Raven assembler | Optional | |
| raven | raven_polishing_iterations | Int | Number of polishing iterations | 2 | Optional |
| read_mapping_stats | cpu | Int | Number of CPUs to allocate to the task | 2 | Optional |
| read_mapping_stats | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
| read_mapping_stats | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/samtools:1.15 | Optional |
| read_mapping_stats | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 8 | Optional |
| skani | acc2taxon_map | File | Tab-delimited map between reference genome accessions and their affiliated taxon | gs://theiagen-public-resources-rp/reference_data/databases/skani/viral_fna_20251107/viral_accession2taxon_20251107.tsv | Optional |
| skani | cpu | Int | Number of CPUs to allocate to the task | 2 | Optional |
| skani | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
| skani | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/skani:0.2.2 | Optional |
| skani | fasta_dir | String | Reference genome database base directory | gs://theiagen-public-resources-rp/reference_data/databases/skani/viral_fna_20251107/fna/ | Optional |
| skani | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 4 | Optional |
| skani | skani_db | File | Skani database file | gs://theiagen-public-resources-rp/reference_data/databases/skani/skani_db_20251107.tar | Optional |
| theiaviral_ont | call_porechop | Boolean | True/False to trim adapters with porechop | False | Optional |
| theiaviral_ont | call_raven | Boolean | True/False to call assembly with Raven and use Flye as fallback | True | Optional |
| theiaviral_ont | extract_unclassified | Boolean | True/False that determines if unclassified reads should be extracted and combined with the taxon specific extracted reads | True | Optional |
| theiaviral_ont | genome_length | Int | Expected genome length of taxon of interest | Optional | |
| theiaviral_ont | host | String | Host taxon/accession to dehost reads, if provided | Optional | |
| theiaviral_ont | min_allele_freq | Float | Minimum allele frequency required for a variant to populate the consensus sequence | 0.6 | Optional |
| theiaviral_ont | min_depth | Int | Minimum read depth required for a variant to populate the consensus sequence | 10 | Optional |
| theiaviral_ont | min_map_quality | Int | Minimum mapping quality required for read alignments | 20 | Optional |
| theiaviral_ont | read_extraction_rank | String | Taxonomic rank to use for read extraction - limits taxons to only those within the specified ranks. | family | Optional |
| theiaviral_ont | reference_fasta | File | Reference genome in FASTA format | Optional | |
| theiaviral_ont | reference_gene_locations_bed | File | Use to provide locations of interest where average coverage will be calculated | Optional | |
| theiaviral_ont | skip_rasusa | Boolean | True/False to skip read subsampling with Rasusa | True | Optional |
| theiaviral_ont | skip_screen | Boolean | True/False to skip read screening check prior to analysis | False | Optional |
| version_capture | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0 | Optional |
| version_capture | timezone | String | Set the time zone to get an accurate date of analysis (uses UTC by default) | Optional |
| Terra Task Name | Variable | Type | Description | Default Value | Terra Status |
|---|---|---|---|---|---|
| theiaviral_panel | read1 | File | Illumina forward read file in FASTQ file format (compression optional) | Required | |
| theiaviral_panel | read2 | File | Illumina reverse read file in FASTQ file format (compression optional) | Required | |
| theiaviral_panel | samplename | String | Name of the sample being analyzed | Required | |
| theiaviral_panel | source_table_name | String | Name of the Terra table the source reads originate from. This is used for identifying originating location of extracted assemblies once added to output tables. | Required | |
| theiaviral_panel | terra_project | String | The Terra project containing the data table | Required | |
| theiaviral_panel | terra_workspace | String | The Terra workspace containing the data table | Required | |
| cat_lanes | cpu | Int | Number of CPUs to allocate to the task | 2 | Optional |
| cat_lanes | disk_size | Int | Amount of storage (in GB) to allocate to the task | 50 | Optional |
| cat_lanes | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/theiagen/utility:1.2 | Optional |
| cat_lanes | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 4 | Optional |
| cat_lanes | read1_lane3 | File | Internal component, do not modify | Optional | |
| cat_lanes | read1_lane4 | File | Internal component, do not modify | Optional | |
| cat_lanes | read2_lane3 | File | Internal component, do not modify | Optional | |
| cat_lanes | read2_lane4 | File | Internal component, do not modify | Optional | |
| ete4_identify | cpu | Int | Number of CPUs to allocate to the task | 1 | Optional |
| ete4_identify | disk_size | Int | Amount of storage (in GB) to allocate to the task | 50 | Optional |
| ete4_identify | docker | String | Docker image to use for the task | us-docker.pkg.dev/general-theiagen/theiagen/ete4:4.3.0 | Optional |
| ete4_identify | memory | Int | Amount of memory (in GB) to allocate to the task | 4 | Optional |
| ete4_identify | rank | String | Internal component, do not modify | Optional | |
| export_taxon_table | cpu | Int | Number of CPUs to allocate to the task | 1 | Optional |
| export_taxon_table | disk_size | Int | Amount of storage (in GB) to allocate to the task | 25 | Optional |
| export_taxon_table | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/theiagen/terra-tools:2023-06-21 | Optional |
| export_taxon_table | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 2 | Optional |
| kraken2 | classified_out | String | Allows user to rename the classified FASTQ files output. Must include .fastq as the suffix | classified#.fastq | Optional |
| kraken2 | cpu | Int | Number of CPUs to allocate to the task | 4 | Optional |
| kraken2 | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
| kraken2 | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/kraken2:2.1.2-no-db | Optional |
| kraken2 | kraken2_args | String | Allows a user to supply additional kraken2 command-line arguments | Optional | |
| kraken2 | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 32 | Optional |
| kraken2 | unclassified_out | String | Allows user to rename unclassified FASTQ files output. Must include .fastq as the suffix | unclassified#.fastq | Optional |
| kraken_parser | cpu | Int | Number of CPUs to allocate to the task | 2 | Optional |
| kraken_parser | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
| kraken_parser | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/theiagen/krakentools:d4a2fbe | Optional |
| kraken_parser | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 4 | Optional |
| krakentools | cpu | Int | Number of CPUs to allocate to the task | 1 | Optional |
| krakentools | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
| krakentools | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/theiagen/krakentools:d4a2fbe | Optional |
| krakentools | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 4 | Optional |
| read_QC_trim | adapters | File | File with adapter sequences to be removed | Optional | |
| read_QC_trim | bbduk_memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 8 | Optional |
| read_QC_trim | call_midas | Boolean | Internal component, do not modify | False | Optional |
| read_QC_trim | extract_unclassified | Boolean | Allows user to extract unclassified reads | False | Optional |
| read_QC_trim | fastp_args | String | Additional arguments to use with fastp | --detect_adapter_for_pe -g -5 20 -3 20 | Optional |
| read_QC_trim | host_complete_only | Boolean | Only download host reference genome labeled "complete" | False | Optional |
| read_QC_trim | host_decontaminate_mem | Int | Memory allocated for minimap2 (in GB) | 32 | Optional |
| read_QC_trim | host_is_accession | Boolean | Inputted "host" is an accession | False | Optional |
| read_QC_trim | host_is_genome | Boolean | Inputted "host" is a genome URI | False | Optional |
| read_QC_trim | host_refseq | Boolean | Internal component, do not modify | True | Optional |
| read_QC_trim | kraken_cpu | Int | Number of CPUs to allocate to the task | 4 | Optional |
| read_QC_trim | kraken_disk_size | Int | Amount of storage (in GB) to allocate to the task. Increase this when using large (>30GB kraken2 databases such as the "k2_standard" database) | 100 | Optional |
| read_QC_trim | kraken_memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 32 | Optional |
| read_QC_trim | midas_db | File | Internal component, do not modify | gs://theiagen-public-files-rp/terra/theiaprok-files/midas/midas_db_v1.2.tar.gz | Optional |
| read_QC_trim | phix | File | A file containing the phix used during Illumina sequencing; used in the BBDuk task | Optional | |
| read_QC_trim | read_processing | String | The name of the tool to perform basic read processing; options: "trimmomatic" or "fastp" | trimmomatic | Optional |
| read_QC_trim | read_qc | String | The tool used for quality control (QC) of reads. Options are "fastq_scan" (default) and "fastqc" | fastq_scan | Optional |
| read_QC_trim | target_organism | String | Internal component, do not modify | Optional | |
| read_QC_trim | taxon_id | Int | Internal component, do not modify | 0 | Optional |
| read_QC_trim | trim_min_length | Int | Specifies minimum length of each read after trimming to be kept | 75 | Optional |
| read_QC_trim | trim_quality_min_score | Int | Specifies the average quality of bases in a sliding window to be kept | 30 | Optional |
| read_QC_trim | trim_window_size | Int | Specifies window size for trimming (the number of bases to average the quality across) | 4 | Optional |
| read_QC_trim | trimmomatic_args | String | Additional arguments to pass to trimmomatic. "-phred33" specifies the Phred Q score encoding which is almost always phred33 with modern sequence data. | -phred33 | Optional |
| theiaviral_illumina_pe | checkv_db | File | Database used for CheckV | Optional | |
| theiaviral_illumina_pe | extract_unclassified | Boolean | Internal component, do not modify | True | Optional |
| theiaviral_illumina_pe | genome_length | Int | Expected genome length of taxon of interest | Optional | |
| theiaviral_illumina_pe | host | String | Internal component, do not modify | Optional | |
| theiaviral_illumina_pe | min_allele_freq | Float | Minimum allele frequency required for a variant to populate the consensus sequence | 0.6 | Optional |
| theiaviral_illumina_pe | min_depth | Int | Minimum read depth required for a variant to populate the consensus sequence | 10 | Optional |
| theiaviral_illumina_pe | min_map_quality | Int | Minimum mapping quality required for read alignments | 20 | Optional |
| theiaviral_illumina_pe | read_extraction_rank | String | Internal component, do not modify | family | Optional |
| theiaviral_illumina_pe | reference_fasta | File | Reference genome in FASTA format | Optional | |
| theiaviral_illumina_pe | reference_gene_locations_bed | File | Use to provide locations of interest where average coverage will be calculated | Optional | |
| theiaviral_illumina_pe | skani_db | File | Skani database file | Optional | |
| theiaviral_illumina_pe | skip_rasusa | Boolean | True/False to skip read subsampling with Rasusa | True | Optional |
| theiaviral_panel | call_metaviralspades | Boolean | Whether to run metaviralspades for assembly | True | Optional |
| theiaviral_panel | extract_unclassified | Boolean | True/False that determines if unclassified reads should be extracted and combined with the taxon specific extracted reads | False | Optional |
| theiaviral_panel | host | String | Host taxon/accession to dehost reads, if provided | Optional | |
| theiaviral_panel | kraken_db | File | Kraken2 database file in .tar.gz format. | gs://theiagen-public-resources-rp/reference_data/databases/kraken2/kraken2_humanGRCh38_viralRefSeq_20240828.tar.gz | Optional |
| theiaviral_panel | min_read_count | Int | Minimum number of reads required to consider a taxon for assembly | 1000 | Optional |
| theiaviral_panel | output_taxon_table | File | A TSV file containing organism names and their corresponding output table names. | gs://theiagen-public-resources-rp/reference_data/family_agnostic/theiaviral_panel_taxon_table_20251111.tsv | Optional |
| theiaviral_panel | taxon_ids | Array[String] | An array of taxon IDs user wishes to analyze. | [['1618189'], ['37124'], ['46839'], ['12637'], ['1216928'], ['59301'], ['2169701'], ['118655'], ['11587'], ['11029'], ['11033'], ['11034'], ['1608084'], ['64286'], ['11082'], ['11089'], ['64320'], ['10804'], ['12092'], ['3052230'], ['12475'], ['11676'], ['11709'], ['68887'], ['1980456'], ['3052470'], ['3052490'], ['169173'], ['3052489'], ['238817'], ['1980442'], ['3052496'], ['3052499'], ['90961'], ['80935'], ['35305'], ['1221391'], ['38767'], ['11021'], ['38768'], ['2847089'], ['3052223'], ['260964'], ['35511'], ['11072'], ['11577'], ['38766'], ['1474807'], ['12538'], ['11079'], ['3052225'], ['11083'], ['11292'], ['11580'], ['11080'], ['45270'], ['11084'], ['11036'], ['11039'], ['1313215'], ['138948'], ['138949'], ['138950'], ['138951'], ['1239565'], ['1239570'], ['1239573'], ['142786'], ['28875'], ['28876'], ['36427'], ['1348384'], ['1330524'], ['95341'], ['2849717'], ['1424613'], ['2010960'], ['565995'], ['3052302'], ['3052518'], ['3052477'], ['3052307'], ['3052480'], ['2169991'], ['33743'], ['3052310'], ['3052148'], ['3052314'], ['3052303'], ['3052317'], ['33727'], ['12542'], ['3052493'], ['186539'], ['11588'], ['2907957'], ['3052498'], ['1003835'], ['1452514'], ['186540'], ['186541'], ['3052503'], ['1891762'], ['10376'], ['10359'], ['333760'], ['333761'], ['337044'], ['337050'], ['1671798'], ['333754'], ['333767'], ['746830'], ['746831'], ['943908'], ['10632'], ['1891764'], ['1965344'], ['493803'], ['1203539'], ['1497391'], ['1891767'], ['1277649'], ['862909'], ['862909'], ['440266'], ['11234'], ['152219'], ['10244'], ['2560602'], ['11041'], ['10335'], ['10255'], ['129875'], ['108098'], ['129951'], ['130310'], ['130308'], ['130309'], ['536079'], ['329641'], ['11137'], ['290028'], ['277944'], ['31631'], ['162145'], ['12730'], ['2560525'], ['11216'], ['2560526'], ['1803956'], ['10798'], ['11250'], ['11320'], ['11520'], ['11552'], ['1335626'], ['147711'], ['147712'], ['463676'], ['2901879'], ['2697049'], ['10404'], ['3052505'], ['337041'], ['337042'], ['333757'], ['337048'], ['333754'], ['333766'], ['337049']] | Optional |
| version_capture | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0 | Optional |
| version_capture | timezone | String | Set the time zone to get an accurate date of analysis (uses UTC by default) | Optional |
Workflow Tasks¶
Versioning
versioning: Version Capture
The versioning task captures the workflow version from the GitHub (code repository) version.
Version Capture Technical details
| Links | |
|---|---|
| Task | task_versioning.wdl |
Taxonomic Identification
ete4_identify
The ete4_identify task parses the NCBI taxonomy hierarchy from a user's inputted taxonomy and desired taxonomic rank. This task returns a taxon ID, name, and rank, which facilitates downstream functions, including read classification, targeted read extraction, and genomic characterization modules.
taxon input parameter
This parameter accepts either a NCBI taxon ID (e.g. 11292) or an organism name (e.g. Lyssavirus rabies).
rank a.k.a read_extraction_rank input parameter
Valid options include: "species", "genus", "family", "order", "class", "phylum", "kingdom", or "domain". By default it is set to "family". This parameter filters metadata to report information only at the taxonomic rank specified by the user, regardless of the taxonomic rank implied by the original input taxon.
Important
- The
rankparameter must specify a taxonomic rank that is equal to or above the input taxon's taxonomic rank.
Examples:
- If your input
taxonisLyssavirus rabies(species level) withrankset tofamily, the task will return information for the family ofLyssavirus rabies: taxon ID for Rhabdoviridae (11270), name "Rhabdoviridae", and rank "family". - If your input
taxonisLyssavirus(genus level) withrankset tospecies, the task will fail because it cannot determine species information from an inputted genus.
ete4 Identify Technical Details
| Links | |
|---|---|
| Task | task_ete4_taxon_id.wdl |
| Software Source Code | ete4 on GitHub |
| Software Documentation | NCBI Datasets Documentation on NCBI |
| Original Publication(s) | Exploring and retrieving sequence and metadata for species across the tree of life with NCBI Datasets |
datasets_genome_length
The datasets_genome_length task uses NCBI Datasets to acquire genome length metadata for an inputted taxon and retrieve a top reference accession. This task generates a summary file of all successful hits to the input taxon, which includes each genome's accession number, completeness status, genome length, source, and other relevant metadata. The task will then calculate the average expected genome length in basepairs for the input taxon.
taxon input parameter
This parameter accepts either a NCBI taxon ID (e.g. 11292) or an organism name (e.g. Lyssavirus rabies).
NCBI Datasets Technical Details
| Links | |
|---|---|
| Task | task_identify_taxon_id.wdl |
| Software Source Code | NCBI Datasets on GitHub |
| Software Documentation | NCBI Datasets Documentation on NCBI |
| Original Publication(s) | Exploring and retrieving sequence and metadata for species across the tree of life with NCBI Datasets |
Read Quality Control, Trimming, Filtering, Identification and Extraction
read_QC_trim
read_QC_trim is a sub-workflow that removes low-quality reads, low-quality regions of reads, and sequencing adapters to improve data quality. It uses a number of tasks, described below. The differences between the PE and SE versions of the read_QC_trim sub-workflow lie in the default parameters, the use of two or one input read file(s), and the different output files.
HRRT: Human Host Sequence Removal
All reads of human origin are removed, including their mates, by using NCBI's human read removal tool (HRRT).
HRRT is based on the SRA Taxonomy Analysis Tool and employs a k-mer database constructed of k-mers from Eukaryota derived from all human RefSeq records with any k-mers found in non-Eukaryota RefSeq records subtracted from the database.
NCBI-Scrub Technical Details
| Links | |
|---|---|
| Task | task_ncbi_scrub.wdl |
| Software Source Code | HRRT on GitHub |
| Software Documentation | HRRT on NCBI |
By default, read_processing is set to "trimmomatic". To use fastp instead, set read_processing to "fastp". These tasks are mutually exclusive.
Trimmomatic: Read Trimming (default)
Read proccessing is available via Trimmomatic by default.
Trimmomatic trims low-quality regions of Illumina paired-end or single-end reads with a sliding window (with a default window size of 4, specified with trim_window_size), cutting once the average quality within the window falls below the trim_quality_trim_score (default of 20 for paired-end, 30 for single-end). The read is discarded if it is trimmed below trim_minlen (default of 75 for paired-end, 25 for single-end).
Trimmomatic Technical Details
| Links | |
|---|---|
| Task | task_trimmomatic.wdl |
| Software Source Code | Trimmomatic on GitHub |
| Software Documentation | Trimmomatic Website |
| Original Publication(s) | Trimmomatic: a flexible trimmer for Illumina sequence data |
fastp: Read Trimming (alternative)
To activate this task, set read_processing to "fastp".
fastp trims low-quality regions of Illumina paired-end or single-end reads with a sliding window (with a default window size of 4, specified with trim_window_size), cutting once the average quality within the window falls below the trim_quality_trim_score (default of 20 for paired-end, 30 for single-end). The read is discarded if it is trimmed below trim_minlen (default of 75 for paired-end, 25 for single-end).
fastp also has additional default parameters and features that are not a part of trimmomatic's default configuration.
fastp default read-trimming parameters
| Parameter | Explanation |
|---|---|
| -g | enables polyG tail trimming |
| -5 20 | enables read end-trimming |
| -3 20 | enables read end-trimming |
| --detect_adapter_for_pe | enables adapter-trimming only for paired-end reads |
Additional arguments can be passed using the fastp_args optional parameter.
Trimmomatic and fastp Technical Details
| Links | |
|---|---|
| Task | task_fastp.wdl |
| Software Source Code | fastp on GitHub |
| Software Documentation | fastp on GitHub |
| Original Publication(s) | fastp: an ultra-fast all-in-one FASTQ preprocessor |
BBDuk: Adapter Trimming and PhiX Removal
Adapters are manufactured oligonucleotide sequences attached to DNA fragments during the library preparation process. In Illumina sequencing, these adapter sequences are required for attaching reads to flow cells. You can read more about Illumina adapters here. For genome analysis, it's important to remove these sequences since they're not actually from your sample. If you don't remove them, the downstream analysis may be affected.
The bbduk task removes adapters from sequence reads. To do this:
- Repair from the BBTools package reorders reads in paired fastq files to ensure the forward and reverse reads of a pair are in the same position in the two fastq files (it re-pairs).
- BBDuk ("Bestus Bioinformaticus" Decontamination Using Kmers) is then used to trim the adapters and filter out all reads that have a 31-mer match to PhiX, which is commonly added to Illumina sequencing runs to monitor and/or improve overall run quality.
BBDuk Technical Details
| Links | |
|---|---|
| Task | task_bbduk.wdl |
| Software Source Code | BBMap on SourceForge |
| Software Documentation | BBDuk Guide (archived) |
By default, read_qc is set to "fastq_scan". To use fastqc instead, set read_qc to "fastqc". These tasks are mutually exclusive.
fastq-scan: Read Quantification (default)
Read quantification is available via fastq-scan by default.
fastq-scan quantifies the forward and reverse reads in FASTQ files. For paired-end data, it also provide the total number of read pairs. This task is run once with raw reads as input and once with clean reads as input. If QC has been performed correctly, you should expect fewer clean reads than raw reads.
fastq-scan Technical Details
| Links | |
|---|---|
| Task | task_fastq_scan.wdl |
| Software Source Code | fastq-scan on GitHub |
| Software Documentation | fastq-scan on GitHub |
FastQC: Read Quantification (alternative)
To activate this task, set read_qc to "fastqc".
FastQC quantifies the forward and reverse reads in FASTQ files. For paired-end data, it also provide the total number of read pairs. This task is run once with raw reads as input and once with clean reads as input. If QC has been performed correctly, you should expect fewer clean reads than raw reads.
This tool also provides a graphical visualization of the read quality.
FastQC Technical Details
| Links | |
|---|---|
| Task | task_fastqc.wdl |
| Software Source Code | FastQC on Github |
| Software Documentation | FastQC Website |
host_decontaminate: Host Read Decontamination
Host genetic data is frequently incidentally sequenced alongside pathogens, which can negatively affect the quality of downstream analysis. Host Decontaminate attempts to remove host reads by aligning to a reference host genome that is directly inputted or acquired on-the-fly. The reference host genome can be inputted into the host input field as an assembly file (with is_genome set to "true"), acquired via NCBI Taxonomy-compatible taxon input, or assembly accession (with is_accession set to "true"). Host Decontaminate maps inputted reads to the host genome using minimap2, reports mapping statistics to this host genome, and outputs the unaligned dehosted reads.
The detailed steps and tasks are as follows:
datasets_genome_length
The datasets_genome_length task uses NCBI Datasets to acquire genome length metadata for an inputted taxon and retrieve a top reference accession. This task generates a summary file of all successful hits to the input taxon, which includes each genome's accession number, completeness status, genome length, source, and other relevant metadata. The task will then calculate the average expected genome length in basepairs for the input taxon.
taxon input parameter
This parameter accepts either a NCBI taxon ID (e.g. 11292) or an organism name (e.g. Lyssavirus rabies).
NCBI Datasets Technical Details
| Links | |
|---|---|
| Task | task_identify_taxon_id.wdl |
| Software Source Code | NCBI Datasets on GitHub |
| Software Documentation | NCBI Datasets Documentation on NCBI |
| Original Publication(s) | Exploring and retrieving sequence and metadata for species across the tree of life with NCBI Datasets |
Download Accession
The NCBI Datasets task downloads specified assemblies from NCBI using either the virus or genome (for all other genome types) package as appropriate.
This task uses the accession ID output from the skani task to download the the most closely related reference genome to the input assembly. The downloaded reference is then used for downstream analysis, including variant calling and consensus generation.
NCBI Datasets Technical Details
| Links | |
|---|---|
| Task | task_ncbi_datasets.wdl |
| Software Source Code | NCBI Datasets on GitHub |
| Software Documentation | NCBI Datasets Documentation on NCBI |
| Original Publication(s) | Exploring and retrieving sequence and metadata for species across the tree of life with NCBI Datasets |
Map Reads to Host
minimap2 is a popular aligner that is used to align reads (or assemblies) to an assembly file. In minimap2, "modes" are a group of preset options.
The mode used in this task is map-ont which is the default mode for long reads and indicates that long reads of ~10% error rates should be aligned to the reference genome. The output file is in SAM format.
For more information regarding modes and the available options for minimap2, please see the minimap2 manpage
minimap2 Technical Details
| Links | |
|---|---|
| Task | task_minimap2.wdl |
| Software Source Code | minimap2 on GitHub |
| Software Documentation | minimap2 |
| Original Publication(s) | Minimap2: pairwise alignment for nucleotide sequences |
Extract Unaligned Reads
The bam_to_unaligned_fastq task will extract a FASTQ file of reads that failed to align, while removing unpaired reads.
parse_mapping Technical Details
| Links | |
|---|---|
| Task | task_parse_mapping.wdl |
| Software Source Code | samtools on GitHub |
| Software Documentation | samtools |
| Original Publication(s) | The Sequence Alignment/Map format and SAMtools Twelve Years of SAMtools and BCFtools |
assembly_metrics: Mapping Statistics
The assembly_metrics task generates mapping statistics from a BAM file. It uses samtools to generate a summary of the mapping statistics, which includes coverage, depth, average base quality, average mapping quality, and other relevant metrics.
assembly_metrics Technical Details
| Links | |
|---|---|
| Task | task_assembly_metrics.wdl |
| Software Source Code | samtools on GitHub |
| Software Documentation | samtools |
| Original Publication(s) | The Sequence Alignment/Map format and SAMtools Twelve Years of SAMtools and BCFtools |
Host Decontaminate Technical Details
| Links | |
|---|---|
| Subworkflow | wf_host_decontaminate.wdl |
Kraken2: Read Identification
Kraken2 is a bioinformatics tool originally designed for metagenomic applications. It has additionally proven valuable for validating taxonomic assignments and checking contamination of single-species (e.g. bacterial isolate, eukaryotic isolate, viral isolate, etc.) whole genome sequence data.
This task runs on cleaned reads passed from the read_QC_trim subworkflow and outputs a Kraken2 report detailing taxonomic classifications. It also separates classified reads from unclassified ones.
Database-dependent
This workflow automatically uses a viral-specific Kraken2 database. This database was generated in-house from RefSeq's viral sequence collection and human genome GRCh38. It's available at gs://theiagen-public-resources-rp/reference_data/databases/kraken2/kraken2_humanGRCh38_viralRefSeq_20240828.tar.gz.
Kraken2 Technical Details
| Links | |
|---|---|
| Task | task_kraken2.wdl |
| Software Source Code | Kraken2 on GitHub |
| Software Documentation | Kraken2 Documentation |
| Original Publication(s) | Improved metagenomic analysis with Kraken 2 |
krakentools: Read Extraction
The task_krakentools.wdl task extracts reads from the Kraken2 output file. It uses the KrakenTools package to extract reads classified at any user-specified taxon ID.
extract_unclassified input parameter
This parameter determines whether unclassified reads should also be extracted and combined with the taxon-specific extracted reads. By default, this is set to false, meaning that only reads classified to the specified input taxon will be extracted.
Important
This task will extract reads classified to the input taxon and all of its descendant taxa. The rank input parameter controls the extraction of reads classified at the specified rank and all suboridante taxonomic levels. See task ncbi_identify under the Taxonomic Identification section for more details on the rank input parameter.
KrakenTools Technical Details
| Links | |
|---|---|
| Task | task_krakentools.wdl |
| Software Source Code | KrakenTools on GitHub |
| Software Documentation | KrakenTools |
| Original Publication(s) | Metagenome analysis using the Kraken software suite |
read_QC_trim Technical Details
| Links | |
|---|---|
| Subworkflow | wf_read_QC_trim_pe.wdl wf_read_QC_trim_se.wdl |
rasusa
Rasusa is a tool to randomly subsample sequencing reads to a specified coverage without assuming that all reads are of equal length, making it especially suitable for long-read data while still being applicable to short-read data.
The Rasusa task performs subsampling on the input raw reads. By default, it subsamples reads to a target depth of 250X, using the estimated genome length either generated by the ncbi_identify task or provided directly by the user. Disabled by default, users can enable it by setting the skip_rasusa variable to false. The target subsampling depth can also be adjusted by modifying the coverage variable.
coverage input parameter
This parameter specifies the target coverage for subsampling. The default value is 250, but users can adjust it as needed.
Non-deterministic output(s)
This task may yield non-deterministic outputs since it performs random subsampling. To ensure reproducibility, set a a value for the rasusa_seed optional input variable.
Rasusa Technical Details
| Links | |
|---|---|
| Task | task_rasusa.wdl |
| Software Source Code | Rasusa on GitHub |
| Software Documentation | Rasusa on GitHub |
| Original Publication(s) | Rasusa: Randomly subsample sequencing reads to a specified coverage |
clean_check_reads
The screen task ensures the quantity of sequence data is sufficient to undertake genomic analysis. It uses fastq-scan and bash commands for quantification of reads and base pairs, and mash sketching to estimate the genome size and its coverage. At each step, the results are assessed relative to pass/fail criteria and thresholds that may be defined by optional user inputs. Samples are run through all threshold checks, regardless of failures, and the workflow will terminate after the screen task if any thresholds are not met:
- Total number of reads: A sample will fail the read screening task if its total number of reads is less than or equal to
min_reads. - The proportion of basepairs reads in the forward and reverse read files: A sample will fail the read screening if fewer than
min_proportionbasepairs are in either the reads1 or read2 files. - Number of basepairs: A sample will fail the read screening if there are fewer than
min_basepairsbasepairs - Estimated genome size: A sample will fail the read screening if the estimated genome size is smaller than
min_genome_sizeor bigger thanmax_genome_size. - Estimated genome coverage: A sample will fail the read screening if the estimated genome coverage is less than the
min_coverage.
Read screening is performed only on the cleaned reads. The task may be skipped by setting the skip_screen variable to true. Default values vary between the ONT and PE workflow. The rationale for these default values can be found below:
Default Thresholds and Rationales
| Variable | Description | Default Value | Rationale |
|---|---|---|---|
estimated_genome_length |
Default genome_length is set to 12,500, which approximates the median RNA virus length | ||
min_reads |
A sample will fail the read screening task if its total number of reads is less than or equal to min_reads |
50 | Minimum number of base pairs for 10x coverage of the Hepatitis delta (of the Deltavirus genus) virus divided by 300 (longest Illumina read length) |
min_basepairs |
A sample will fail the read screening if there are fewer than min_basepairs basepairs |
15000 | Greater than 10x coverage of the Hepatitis delta (of the Deltavirus genus) virus |
min_genome_size |
A sample will fail the read screening if the estimated genome size is smaller than min_genome_size |
1500 | Based on the Hepatitis delta (of the Deltavirus genus) genome- the smallest viral genome as of 2024-04-11 (1,700 bp) |
max_genome_size |
A sample will fail the read screening if the estimated genome size is smaller than max_genome_size |
2673870 | Based on the Pandoravirus salinus genome, the biggest viral genome, (2,673,870 bp) with 2 Mbp added |
min_coverage |
A sample will fail the read screening if the estimated genome coverage is less than the min_coverage |
10 | A bare-minimum coverage for genome characterization. Higher coverage would be required for high-quality phylogenetics. |
min_proportion |
A sample will fail the read screening if fewer than min_proportion basepairs are in either the reads1 or read2 files |
40 | Greater than 50% reads are in the read1 file; others are in the read2 file. (PE workflow only) |
Screen Technical Details
| Links | |
|---|---|
| Task | task_screen.wdl (PE sub-task) task_screen.wdl (SE sub-task) |
De novo Assembly and Reference Selection
These tasks are only performed if no reference genome is provided
In this workflow, de novo assembly is primarily used to facilitate the selection of a closely related reference genome, though high quality de novo assemblies can be used for downstream analysis. If the user provides an input reference_fasta, the following assembly generation, assembly evaluation, and reference selections tasks will be skipped:
spadesmegahitcheckv_denovoquast_denovoskani
spades
SPAdes (St. Petersburg genome assembler) is a de novo assembly tool that uses de Bruijn graphs to assemble genomes from Illumina short reads.
It is run with the --metaviral option, which is recommended for viral genomes. MetaviralSPAdes pipeline consists of three independent steps, ViralAssembly for finding putative viral subgraphs in a metagenomic assembly graph and generating contigs in these graphs, ViralVerify for checking whether the resulting contigs have viral origin and ViralComplete for checking whether these contigs represent complete viral genomes. For more details, please see the original publication.
MetaviralSPAdes was selected as the default assembler because it produces the most complete viral genomes within TheiaViral, determined by CheckV quality assessment (see task checkv for technical details).
call_metaviralspades input parameter
This parameter controls whether or not the spades task is called by the workflow. By default, call_metaviralspades is set to true because MetaviralSPAdes is used as the primary assembler. MetaviralSPAdes is generally recommended for most users, but it might not perform optimally on all datasets. If users encounter issues with MetaviralSPAdes, they can set the call_metaviralspades variable to false to bypass the spades task and instead de novo assemble using MEGAHIT (see task megahit for details). Additionally, if the spades task fails during execution, the workflow will automatically fall back to using MEGAHIT for de novo assembly.
Non-deterministic output(s)
This task may yield non-deterministic outputs.
MetaviralSPAdes Technical Details
| Links | |
|---|---|
| Task | task_spades.wdl |
| Software Source Code | SPAdes on GitHub |
| Software Documentation | SPAdes Manual |
| Original Publication(s) | TheiaProk: SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing TheiaViral: MetaviralSPAdes: assembly of viruses from metagenomic data |
megahit
The MEGAHIT assembler is a fast and memory-efficient de novo assembler that can handle large datasets. While optimized for metagenomics, MEGAHIT also performs well on single-genome assemblies, making it a versatile choice for various assembly tasks.
MEGAHIT uses a multiple k-mer strategy that can be beneficial for assembling genomes with varying coverage levels, which is common in metagenomic samples. It constructs succinct de Bruijn graphs to efficiently represent the assembly process, allowing it to handle large and complex datasets with reduced memory usage.
This task is optional, turned off by default, and will only be called if MetaviralSPAdes fails. It can be enabled by setting the skip_metaviralspades parameter to true. The megahit task is used as a fallback option if the spades task fails during execution (see task spades for more details).
Non-deterministic output(s)
This task may yield non-deterministic outputs.
MEGAHIT Technical Details
| Links | |
|---|---|
| Task | task_megahit.wdl |
| Software Source Code | MEGAHIT on GitHub |
| Software Documentation | MEGAHIT on GitHub |
| Original Publication(s) | MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph |
skani
The skani task is used to identify and select the most closely related reference genome to the de novo assembly. Skani uses an approximate mapping method without base-level alignment to calculate average nucleotide identity (ANI). It is magnitudes faster than BLAST-based methods and almost as accurate.
By default, the reference genome is selected from a database of approximately 200,000 viral genomes. This database was constructed with the following methodology:
-
Extracting all complete NCBI viral genomes, excluding RefSeq accessions (redundancy), SARS-CoV-2 accessions, and segmented families (Orthomyxoviridae, Hantaviridae, Arenaviridae, and Phenuiviridae). Some complete gene accessions, and not complete genomes, are included because NCBI
datasetscompleteness parameters are susceptible to metadata errors. -
Adding complete RefSeq segmented viral assembly accessions, which represent segments as individual contigs within the FASTA
-
Adding one SARS-CoV-2 genome for each major pangolin lineage
Skani Technical Details
| Links | |
|---|---|
| Task | task_skani.wdl |
| Software Source Code | Skani on GitHub |
| Software Documentation | Skani Documentation |
| Original Publication(s) | Fast and robust metagenomic sequence comparison through sparse chaining with skani |
Reference Mapping
bwa
The bwa task is a wrapper for the BWA alignment tool. It utilizes the BWA-MEM algorithm to map cleaned reads to the reference genome, either selected by the skani task or provided by the user input reference_fasta. This creates a BAM file which is then sorted using the command samtools sort.
BWA Technical Details
| Links | |
|---|---|
| Task | task_bwa.wdl |
| Software Source Code | BWA on GitHub |
| Software Documentation | BWA Documentation |
| Original Publication(s) | Fast and accurate short read alignment with Burrows-Wheeler transform |
read_mapping_stats: Mapping Statistics
The read_mapping_stats task generates mapping statistics from a BAM file. It uses samtools to generate a summary of the mapping statistics, which includes coverage, depth, average base quality, average mapping quality, and other relevant metrics.
read_mapping_stats Technical Details
| Links | |
|---|---|
| Task | task_assembly_metrics.wdl |
| Software Source Code | samtools on GitHub |
| Software Documentation | samtools |
| Original Publication(s) | The Sequence Alignment/Map format and SAMtools Twelve Years of SAMtools and BCFtools |
Variant Calling and Consensus Generation
ivar_variants: Variant Calling
iVar uses the outputs of samtools mpileup to call single nucleotide variants (SNVs) and insertions/deletions (indels). Several key parameters can be set to determine the stringency of variant calling, including minimum quality, minimum allele frequency, and minimum depth.
This task returns a VCF file containing all called variants, the number of detected variants, and the proportion of those variants with allele frequencies between 0.6 and 0.9 (also known as intermediate variants).
min_depth input parameter
This parameter accepts an integer value to set the minimum read depth for variant calling and subsequent consensus sequence generation. The default value is 10.
min_map_quality input parameter
This parameter accepts an integer value to set the minimum mapping quality for variant calling and subsequent consensus sequence generation. The default value is 20.
min_allele_freq input parameter
This parameter accepts a float value to set the minimum allele frequency for variant calling and subsequent consensus sequence generation. The default value is 0.6.
iVar Technical Details
| Links | |
|---|---|
| Task | task_ivar_variant_call.wdl |
| Software Source Code | Ivar on GitHub |
| Software Documentation | Ivar Documentation |
| Original Publication(s) | An amplicon-based sequencing framework for accurately measuring intrahost virus diversity using PrimalSeq and iVar |
ivar_consensus: Consensus Assembly
iVar's consensus tool generates a reference-based consensus assembly. Several parameters can be set that determine the stringency of the consensus assembly, including minimum quality, minimum allele frequency, and minimum depth.
This task is functional for segmented viruses by iteratively executing iVar on a contig-by-contig basis and concantenating resulting consensus contigs.
min_depth input parameter
This parameter accepts an integer value to set the minimum read depth for variant calling and subsequent consensus sequence generation. The default value is 10.
min_map_quality input parameter
This parameter accepts an integer value to set the minimum mapping quality for variant calling and subsequent consensus sequence generation. The default value is 20.
min_allele_freq input parameter
This parameter accepts a float value to set the minimum allele frequency for variant calling and subsequent consensus sequence generation. The default value is 0.6.
iVar Technical Details
| Links | |
|---|---|
| Task | task_ivar_consensus.wdl |
| Software Source Code | Ivar on GitHub |
| Software Documentation | Ivar Documentation |
| Original Publication(s) | An amplicon-based sequencing framework for accurately measuring intrahost virus diversity using PrimalSeq and iVar |
Assembly Evaluation and Consensus Quality Control
quast_denovo
QUAST stands for QUality ASsessment Tool. It evaluates genome/metagenome assemblies by computing various metrics without a reference being necessary. It includes useful metrics such as number of contigs, length of the largest contig and N50.
QUAST Technical Details
| Links | |
|---|---|
| Task | task_quast.wdl |
| Software Source Code | QUAST on GitHub |
| Software Documentation | QUAST Manual on SourceForge |
| Original Publication(s) | QUAST: quality assessment tool for genome assemblies |
checkv_denovo & checkv_consensus
CheckV is a fully automated command-line pipeline for assessing the quality of viral genomes, including identification of host contamination for integrated proviruses, estimating completeness for genome fragments, and identification of closed genomes.
By default, CheckV reports results on a contig-by-contig basis. The checkv task additionally reports both "weighted_contamination" and "weighted_completeness", which are average percents calculated across the total assembly that are weighted by contig length.
CheckV Technical Details
| Links | |
|---|---|
| Task | task_checkv.wdl |
| Software Source Code | CheckV on Bitbucket |
| Software Documentation | CheckV Documentation |
| Original Publication(s) | CheckV assesses the quality and completeness of metagenome-assembled viral genomes |
consensus_qc: Assembly Statistics
The consensus_qc task generates a summary of genomic statistics from a consensus genome. This includes the total number of bases, "N" bases, degenerate bases, and an estimate of the percent coverage to the reference genome.
consensus_qc Technical Details
| Links | |
|---|---|
| Task | task_consensus_qc.wdl |
Versioning
versioning: Version Capture
The versioning task captures the workflow version from the GitHub (code repository) version.
Version Capture Technical details
| Links | |
|---|---|
| Task | task_versioning.wdl |
Taxonomic Identification
ete4_identify
The ete4_identify task parses the NCBI taxonomy hierarchy from a user's inputted taxonomy and desired taxonomic rank. This task returns a taxon ID, name, and rank, which facilitates downstream functions, including read classification, targeted read extraction, and genomic characterization modules.
taxon input parameter
This parameter accepts either a NCBI taxon ID (e.g. 11292) or an organism name (e.g. Lyssavirus rabies).
rank a.k.a read_extraction_rank input parameter
Valid options include: "species", "genus", "family", "order", "class", "phylum", "kingdom", or "domain". By default it is set to "family". This parameter filters metadata to report information only at the taxonomic rank specified by the user, regardless of the taxonomic rank implied by the original input taxon.
Important
- The
rankparameter must specify a taxonomic rank that is equal to or above the input taxon's taxonomic rank.
Examples:
- If your input
taxonisLyssavirus rabies(species level) withrankset tofamily, the task will return information for the family ofLyssavirus rabies: taxon ID for Rhabdoviridae (11270), name "Rhabdoviridae", and rank "family". - If your input
taxonisLyssavirus(genus level) withrankset tospecies, the task will fail because it cannot determine species information from an inputted genus.
ete4 Identify Technical Details
| Links | |
|---|---|
| Task | task_ete4_taxon_id.wdl |
| Software Source Code | ete4 on GitHub |
| Software Documentation | NCBI Datasets Documentation on NCBI |
| Original Publication(s) | Exploring and retrieving sequence and metadata for species across the tree of life with NCBI Datasets |
datasets_genome_length
The datasets_genome_length task uses NCBI Datasets to acquire genome length metadata for an inputted taxon and retrieve a top reference accession. This task generates a summary file of all successful hits to the input taxon, which includes each genome's accession number, completeness status, genome length, source, and other relevant metadata. The task will then calculate the average expected genome length in basepairs for the input taxon.
taxon input parameter
This parameter accepts either a NCBI taxon ID (e.g. 11292) or an organism name (e.g. Lyssavirus rabies).
NCBI Datasets Technical Details
| Links | |
|---|---|
| Task | task_identify_taxon_id.wdl |
| Software Source Code | NCBI Datasets on GitHub |
| Software Documentation | NCBI Datasets Documentation on NCBI |
| Original Publication(s) | Exploring and retrieving sequence and metadata for species across the tree of life with NCBI Datasets |
Read Quality Control, Trimming, and Filtering
NanoPlot: Read Quantification
NanoPlot is used for the determination of mean quality scores, read lengths, and number of reads. This task is run once with raw reads as input and once with clean reads as input. If QC has been performed correctly, you should expect fewer clean reads than raw reads.
NanoPlot Technical Details
| Links | |
|---|---|
| Task | task_nanoplot.wdl |
| Software Source Code | NanoPlot on GitHub |
| Software Documentation | NanoPlot Documentation |
| Original Publication(s) | NanoPack2: population-scale evaluation of long-read sequencing data |
porechop
Porechop is a tool for finding and removing adapters from ONT data. Adapters on the ends of reads are trimmed, and when a read has an adapter in the middle, the read is split into two.
The porechop task is optional and is turned off by default. It can be enabled by setting the call_porechop parameter to true.
Porechop Technical Details
| Links | |
|---|---|
| WDL Task | task_porechop.wdl |
| Software Source Code | Porechop on GitHub |
| Software Documentation | https://github.com/rrwick/Porechop#porechop |
Nanoq: Read Filtering
Reads are filtered by length and quality using nanoq. By default, sequences with less than 500 basepairs and quality scores lower than 10 are filtered out to improve assembly accuracy. These defaults are able to be modified by the user.
Nanoq Technical Details
| Links | |
|---|---|
| Task | task_nanoq.wdl |
| Software Source Code | Nanoq on GitHub |
| Software Documentation | Nanoq Documentation |
| Original Publication(s) | Nanoq: ultra-fast quality control for nanopore reads |
ncbi_scrub_se
All reads of human origin are removed, including their mates, by using NCBI's human read removal tool (HRRT).
HRRT is based on the SRA Taxonomy Analysis Tool and employs a k-mer database constructed of k-mers from Eukaryota derived from all human RefSeq records with any k-mers found in non-Eukaryota RefSeq records subtracted from the database.
NCBI-Scrub Technical Details
| Links | |
|---|---|
| Task | task_ncbi_scrub.wdl |
| Software Source Code | HRRT on GitHub |
| Software Documentation | HRRT on NCBI |
host_decontaminate: Host Read Decontamination
Host genetic data is frequently incidentally sequenced alongside pathogens, which can negatively affect the quality of downstream analysis. Host Decontaminate attempts to remove host reads by aligning to a reference host genome that is directly inputted or acquired on-the-fly. The reference host genome can be inputted into the host input field as an assembly file (with is_genome set to "true"), acquired via NCBI Taxonomy-compatible taxon input, or assembly accession (with is_accession set to "true"). Host Decontaminate maps inputted reads to the host genome using minimap2, reports mapping statistics to this host genome, and outputs the unaligned dehosted reads.
The detailed steps and tasks are as follows:
datasets_genome_length
The datasets_genome_length task uses NCBI Datasets to acquire genome length metadata for an inputted taxon and retrieve a top reference accession. This task generates a summary file of all successful hits to the input taxon, which includes each genome's accession number, completeness status, genome length, source, and other relevant metadata. The task will then calculate the average expected genome length in basepairs for the input taxon.
taxon input parameter
This parameter accepts either a NCBI taxon ID (e.g. 11292) or an organism name (e.g. Lyssavirus rabies).
NCBI Datasets Technical Details
| Links | |
|---|---|
| Task | task_identify_taxon_id.wdl |
| Software Source Code | NCBI Datasets on GitHub |
| Software Documentation | NCBI Datasets Documentation on NCBI |
| Original Publication(s) | Exploring and retrieving sequence and metadata for species across the tree of life with NCBI Datasets |
Download Accession
The NCBI Datasets task downloads specified assemblies from NCBI using either the virus or genome (for all other genome types) package as appropriate.
This task uses the accession ID output from the skani task to download the the most closely related reference genome to the input assembly. The downloaded reference is then used for downstream analysis, including variant calling and consensus generation.
NCBI Datasets Technical Details
| Links | |
|---|---|
| Task | task_ncbi_datasets.wdl |
| Software Source Code | NCBI Datasets on GitHub |
| Software Documentation | NCBI Datasets Documentation on NCBI |
| Original Publication(s) | Exploring and retrieving sequence and metadata for species across the tree of life with NCBI Datasets |
Map Reads to Host
minimap2 is a popular aligner that is used to align reads (or assemblies) to an assembly file. In minimap2, "modes" are a group of preset options.
The mode used in this task is map-ont which is the default mode for long reads and indicates that long reads of ~10% error rates should be aligned to the reference genome. The output file is in SAM format.
For more information regarding modes and the available options for minimap2, please see the minimap2 manpage
minimap2 Technical Details
| Links | |
|---|---|
| Task | task_minimap2.wdl |
| Software Source Code | minimap2 on GitHub |
| Software Documentation | minimap2 |
| Original Publication(s) | Minimap2: pairwise alignment for nucleotide sequences |
Extract Unaligned Reads
The bam_to_unaligned_fastq task will extract a FASTQ file of reads that failed to align, while removing unpaired reads.
parse_mapping Technical Details
| Links | |
|---|---|
| Task | task_parse_mapping.wdl |
| Software Source Code | samtools on GitHub |
| Software Documentation | samtools |
| Original Publication(s) | The Sequence Alignment/Map format and SAMtools Twelve Years of SAMtools and BCFtools |
assembly_metrics: Mapping Statistics
The assembly_metrics task generates mapping statistics from a BAM file. It uses samtools to generate a summary of the mapping statistics, which includes coverage, depth, average base quality, average mapping quality, and other relevant metrics.
assembly_metrics Technical Details
| Links | |
|---|---|
| Task | task_assembly_metrics.wdl |
| Software Source Code | samtools on GitHub |
| Software Documentation | samtools |
| Original Publication(s) | The Sequence Alignment/Map format and SAMtools Twelve Years of SAMtools and BCFtools |
Host Decontaminate Technical Details
| Links | |
|---|---|
| Subworkflow | wf_host_decontaminate.wdl |
rasusa
Rasusa is a tool to randomly subsample sequencing reads to a specified coverage without assuming that all reads are of equal length, making it especially suitable for long-read data while still being applicable to short-read data.
The Rasusa task performs subsampling on the input raw reads. By default, it subsamples reads to a target depth of 250X, using the estimated genome length either generated by the ncbi_identify task or provided directly by the user. Disabled by default, users can enable it by setting the skip_rasusa variable to false. The target subsampling depth can also be adjusted by modifying the coverage variable.
coverage input parameter
This parameter specifies the target coverage for subsampling. The default value is 250, but users can adjust it as needed.
Non-deterministic output(s)
This task may yield non-deterministic outputs since it performs random subsampling. To ensure reproducibility, set a a value for the rasusa_seed optional input variable.
Rasusa Technical Details
| Links | |
|---|---|
| Task | task_rasusa.wdl |
| Software Source Code | Rasusa on GitHub |
| Software Documentation | Rasusa on GitHub |
| Original Publication(s) | Rasusa: Randomly subsample sequencing reads to a specified coverage |
clean_check_reads
The screen task ensures the quantity of sequence data is sufficient to undertake genomic analysis. It uses fastq-scan and bash commands for quantification of reads and base pairs, and mash sketching to estimate the genome size and its coverage. At each step, the results are assessed relative to pass/fail criteria and thresholds that may be defined by optional user inputs. Samples are run through all threshold checks, regardless of failures, and the workflow will terminate after the screen task if any thresholds are not met:
- Total number of reads: A sample will fail the read screening task if its total number of reads is less than or equal to
min_reads. - The proportion of basepairs reads in the forward and reverse read files: A sample will fail the read screening if fewer than
min_proportionbasepairs are in either the reads1 or read2 files. - Number of basepairs: A sample will fail the read screening if there are fewer than
min_basepairsbasepairs - Estimated genome size: A sample will fail the read screening if the estimated genome size is smaller than
min_genome_sizeor bigger thanmax_genome_size. - Estimated genome coverage: A sample will fail the read screening if the estimated genome coverage is less than the
min_coverage.
Read screening is performed only on the cleaned reads. The task may be skipped by setting the skip_screen variable to true. Default values vary between the ONT and PE workflow. The rationale for these default values can be found below:
Default Thresholds and Rationales
| Variable | Description | Default Value | Rationale |
|---|---|---|---|
estimated_genome_length |
Default genome_length is set to 12,500, which approximates the median RNA virus length | ||
min_reads |
A sample will fail the read screening task if its total number of reads is less than or equal to min_reads |
50 | Minimum number of base pairs for 10x coverage of the Hepatitis delta (of the Deltavirus genus) virus divided by 300 (longest Illumina read length) |
min_basepairs |
A sample will fail the read screening if there are fewer than min_basepairs basepairs |
15000 | Greater than 10x coverage of the Hepatitis delta (of the Deltavirus genus) virus |
min_genome_size |
A sample will fail the read screening if the estimated genome size is smaller than min_genome_size |
1500 | Based on the Hepatitis delta (of the Deltavirus genus) genome- the smallest viral genome as of 2024-04-11 (1,700 bp) |
max_genome_size |
A sample will fail the read screening if the estimated genome size is smaller than max_genome_size |
2673870 | Based on the Pandoravirus salinus genome, the biggest viral genome, (2,673,870 bp) with 2 Mbp added |
min_coverage |
A sample will fail the read screening if the estimated genome coverage is less than the min_coverage |
10 | A bare-minimum coverage for genome characterization. Higher coverage would be required for high-quality phylogenetics. |
min_proportion |
A sample will fail the read screening if fewer than min_proportion basepairs are in either the reads1 or read2 files |
40 | Greater than 50% reads are in the read1 file; others are in the read2 file. (PE workflow only) |
Screen Technical Details
| Links | |
|---|---|
| Task | task_screen.wdl (PE sub-task) task_screen.wdl (SE sub-task) |
Read Classification and Extraction
metabuli
The metabuli task is used to classify and extract reads against a reference database. Metabuli uses a novel k-mer structure, called metamer, to analyze both amino acid (AA) and DNA sequences. It leverages AA conservation for sensitive homology detection and DNA mutations for specific differentiation between closely related taxa.
cpu / memory input parameters
Increasing the memory and cpus allocated to Metabuli can substantially increase throughput.
extract_unclassified input parameter
This parameter determines whether unclassified reads should also be extracted and combined with the taxon-specific extracted reads. By default, this is set to false, meaning that only reads classified to the specified input taxon will be extracted.
Descendant taxa reads are extracted
This task will extract reads classified to the input taxon and all of its descendant taxa. The rank input parameter controls the extraction of reads classified at the specified rank and all subordiante taxonomic levels. See task ncbi_identify under the Taxonomic Identification section above for more details on the rank input parameter.
Metabuli Technical Details
| Links | |
|---|---|
| Task | task_metabuli.wdl |
| Software Source Code | Metabuli on GitHub |
| Software Documentation | Metabuli Documentation |
| Original Publication(s) | Metabuli: sensitive and specific metagenomic classification via joint analysis of amino acid and DNA |
De novo Assembly and Reference Selection
These tasks are only performed if no reference genome is provided
In this workflow, de novo assembly is used solely to facilitate the selection of a closely related reference genome. If the user provides an input reference_fasta, the following assembly generation, assembly evaluation, and reference selections tasks will be skipped:
ravenflyecheckv_denovoquast_denovoskanincbi_datasets
raven
The raven task is used to create a de novo assembly from cleaned reads. Raven is an overlap-layout-consensus based assembler that accelerates the overlap step, constructs an assembly graph from reads pre-processed with pile-o-grams, applies a novel and robust graph simplification method based on graph drawings, and polishes unambiguous graph paths using Racon.
Based on internal benchmarking against Flye and results reported by Cook et al. (2024), Raven is faster, produces more contiguous assemblies, and yields more complete genomes within TheiaViral according to CheckV quality assessment (see task checkv for technical details).
call_raven input parameter
This parameter controls whether or not the raven task is called by the workflow. By default, call_raven is set to true because Raven is used as the primary assembler. Raven is generally recommended for most users, but it might not perform optimally on all datasets. If users encounter issues with Raven, they can set the call_raven variable to false to bypass the raven task and instead de novo assemble using Flye (see task flye for details). Additionally, if the Raven task fails during execution, the workflow will automatically fall back to using Flye for de novo assembly.
Error traceback
Raven may fail with cryptic "segmentation fault" (segfault) errors or by failing to output an output file. It is difficult to traceback the source of these issues, though increasing the memory parameter may resolve some errors.
Non-deterministic output(s)
This task may yield non-deterministic outputs.
Raven Technical Details
| Links | |
|---|---|
| Task | task_raven.wdl |
| Software Source Code | Raven on GitHub |
| Software Documentation | Raven Documentation |
| Original Publication(s) | Time- and memory-efficient genome assembly with Raven |
flye
Flye is a de novo assembler for long read data using repeat graphs. Compared to de Bruijn graphs, which require exact k-mer matches, repeat graphs can use approximate matches which better tolerates the error rate of ONT data.
It can be enabled by setting the call_raven parameter to false. The flye task is used as a fallback option if the raven task fails during execution (see task raven for more details).
read_type input parameter
This input parameter specifies the type of sequencing reads being used for assembly. This parameter significantly impacts the assembly process and should match the characteristics of your input data. Below are the available options:
| Parameter | Explanation |
|---|---|
--nano-hq (default) |
Optimized for ONT high-quality reads, such as Guppy5+ SUP or Q20 (<5% error). Recommended for ONT reads processed with Guppy5 or newer |
--nano-raw |
For ONT regular reads, pre-Guppy5 (<20% error) |
--nano-corr |
ONT reads corrected with other methods (<3% error) |
--pacbio-raw |
PacBio regular CLR reads (<20% error) |
--pacbio-corr |
PacBio reads corrected with other methods (<3% error) |
--pacbio-hifi |
PacBio HiFi reads (<1% error) |
Refer to the Flye documentation for detailed guidance on selecting the appropriate read_type based on your sequencing data and additional optional paramaters.
Non-deterministic output(s)
This task may yield non-deterministic outputs.
Flye Technical Details
| Links | |
|---|---|
| WDL Task | task_flye.wdl |
| Software Source Code | Flye on GitHub |
| Software Documentation | Flye Documentation |
| Original Publication(s) | Assembly of long, error-prone reads using repeat graphs |
skani
The skani task is used to identify and select the most closely related reference genome to the de novo assembly. Skani uses an approximate mapping method without base-level alignment to calculate average nucleotide identity (ANI). It is magnitudes faster than BLAST-based methods and almost as accurate.
By default, the reference genome is selected from a database of approximately 200,000 viral genomes. This database was constructed with the following methodology:
-
Extracting all complete NCBI viral genomes, excluding RefSeq accessions (redundancy), SARS-CoV-2 accessions, and segmented families (Orthomyxoviridae, Hantaviridae, Arenaviridae, and Phenuiviridae). Some complete gene accessions, and not complete genomes, are included because NCBI
datasetscompleteness parameters are susceptible to metadata errors. -
Adding complete RefSeq segmented viral assembly accessions, which represent segments as individual contigs within the FASTA
-
Adding one SARS-CoV-2 genome for each major pangolin lineage
Skani Technical Details
| Links | |
|---|---|
| Task | task_skani.wdl |
| Software Source Code | Skani on GitHub |
| Software Documentation | Skani Documentation |
| Original Publication(s) | Fast and robust metagenomic sequence comparison through sparse chaining with skani |
Reference Mapping
minimap2
minimap2 is a popular aligner that is used to align reads (or assemblies) to an assembly file. In minimap2, "modes" are a group of preset options.
The mode used in this task is map-ont with additional long-read-specific parameters (the -L --cs --MD flags) to align ONT reads to the reference genome. These specialized parameters are essential for proper handling of long read error profiles, generation of detailed alignment information, and improved mapping accuracy for long reads.
map-ont is the default mode for long reads and it indicates that long reads of ~10% error rates should be aligned to the reference genome. The output file is in SAM format.
For more information regarding modes and the available options for minimap2, please see the minimap2 manpage
minimap2 Technical Details
| Links | |
|---|---|
| Task | task_minimap2.wdl |
| Software Source Code | minimap2 on GitHub |
| Software Documentation | minimap2 |
| Original Publication(s) | Minimap2: pairwise alignment for nucleotide sequences |
parse_mapping
The sam_to_sorted_bam sub-task converts the output SAM file from the minimap2 task and converts it to a BAM file. It then sorts the BAM file by coordinate, and creates a BAM index file.
min_map_quality input parameter
This parameter accepts an integer value to set the minimum mapping quality for variant calling and subsequent consensus sequence generation. The default value is 20.
parse_mapping Technical Details
| Links | |
|---|---|
| Task | task_parse_mapping.wdl |
| Software Source Code | samtools on GitHub |
| Software Documentation | samtools |
| Original Publication(s) | The Sequence Alignment/Map format and SAMtools Twelve Years of SAMtools and BCFtools |
read_mapping_stats: Mapping Statistics
The read_mapping_stats task generates mapping statistics from a BAM file. It uses samtools to generate a summary of the mapping statistics, which includes coverage, depth, average base quality, average mapping quality, and other relevant metrics.
read_mapping_stats Technical Details
| Links | |
|---|---|
| Task | task_assembly_metrics.wdl |
| Software Source Code | samtools on GitHub |
| Software Documentation | samtools |
| Original Publication(s) | The Sequence Alignment/Map format and SAMtools Twelve Years of SAMtools and BCFtools |
fasta_utilities
The fasta_utilities task utilizes samtools to index a reference fasta file.
This reference is selected by the skani task or provided by the user input reference_fasta. This indexed reference genome is used for downstream variant calling and consensus generation tasks.
fasta_utilities Technical Details
| Links | |
|---|---|
| Task | task_fasta_utilities.wdl |
| Software Source Code | samtools on GitHub |
| Software Documentation | samtools |
| Original Publication(s) | The Sequence Alignment/Map format and SAMtools Twelve Years of SAMtools and BCFtools |
Variant Calling and Consensus Generation
clair3
Clair3 performs deep learning-based variant detection using a multi-stage approach. The process begins with pileup-based calling for initial variant identification, followed by full-alignment analysis for comprehensive variant detection. Results are merged into a final high-confidence call set.
The variant calling pipeline employs specialized neural networks trained on ONT data to accurately identify: - Single nucleotide variants (SNVs) - Small insertions and deletions (indels) - Structural variants
clair3_model input parameter
This parameter specifies the clair3 model to use for variant calling. The default is set to "r1041_e82_400bps_sup_v500", but users may select from other available models that clair3 was trained on, which may yield better results depending on the basecaller and data type. The following models are available:
"ont""ont_guppy2""ont_guppy5""r941_prom_sup_g5014""r941_prom_hac_g360+g422""r941_prom_hac_g238""r1041_e82_400bps_sup_v500""r1041_e82_400bps_hac_v500""r1041_e82_400bps_sup_v410""r1041_e82_400bps_hac_v410"
Default Parameters and Filtering
In this workflow, clair3 is run with nearly all default parameters. Note that the VCF file produced by the clair3 task is unfiltered and does not represent the final set of variants that will be included in the final consensus genome. A filtered vcf file is generated by the bcftools_consensus task. The filtering parameters are as follows:
- The
min_map_qualityparameter is applied before calling variants. - The
min_depthandmin_allele_freqparameters are applied after variant calling during consensus genome construction.
Clair3 Technical Details
| Links | |
|---|---|
| Task | task_clair3.wdl |
| Software Source Code | Clair3 on GitHub |
| Software Documentation | Clair3 Documentation |
| Original Publication(s) | Symphonizing pileup and full-alignment for deep learning-based long-read variant calling |
parse_mapping
The mask_low_coverage sub-task is used to mask low coverage regions in the reference_fasta file to improve the accuracy of the final consensus genome. Coverage thresholds are defined by the min_depth parameter, which specifies the minimum read depth required for a base to be retained. Bases falling below this threshold are replaced with "N"s to clearly mark low confidence regions. The masked reference is then combined with variants from the clair3 task to produce the final consensus genome.
min_depth input parameter
This parameter accepts an integer value to set the minimum read depth for variant calling and subsequent consensus sequence generation. The default value is 10.
parse_mapping Technical Details
| Links | |
|---|---|
| Task | task_parse_mapping.wdl |
| Software Source Code | samtools on GitHub |
| Software Documentation | samtools |
| Original Publication(s) | The Sequence Alignment/Map format and SAMtools Twelve Years of SAMtools and BCFtools |
bcftools_consensus
The bcftools_consensus task generates a consensus genome assembly by applying variants from the clair3 task to a masked reference genome. It uses bcftools to filter variants based on the min_depth and min_allele_freq input parameter, left aligns and normalizes indels, indexes the VCF file, and generates a consensus genome in FASTA format. Reference bases are substituted with filtered variants where applicable, preserved in regions without variant calls, and replaced with "N"s in areas masked by the mask_low_coverage task.
min_depth input parameter
This parameter accepts an integer value to set the minimum read depth for variant calling and subsequent consensus sequence generation. The default value is 10.
min_allele_freq input parameter
This parameter accepts a float value to set the minimum allele frequency for variant calling and subsequent consensus sequence generation. The default value is 0.6.
bcftools_consensus Technical Details
| Links | |
|---|---|
| Task | task_bcftools_consensus.wdl |
| Software Source Code | bcftools on GitHub |
| Software Documentation | bcftools Manual Page |
| Original Publication(s) | Twelve Years of SAMtools and BCFtools |
Assembly Evaluation and Consensus Quality Control
quast_denovo
QUAST stands for QUality ASsessment Tool. It evaluates genome/metagenome assemblies by computing various metrics without a reference being necessary. It includes useful metrics such as number of contigs, length of the largest contig and N50.
QUAST Technical Details
| Links | |
|---|---|
| Task | task_quast.wdl |
| Software Source Code | QUAST on GitHub |
| Software Documentation | QUAST Manual on SourceForge |
| Original Publication(s) | QUAST: quality assessment tool for genome assemblies |
checkv_denovo & checkv_consensus
CheckV is a fully automated command-line pipeline for assessing the quality of viral genomes, including identification of host contamination for integrated proviruses, estimating completeness for genome fragments, and identification of closed genomes.
By default, CheckV reports results on a contig-by-contig basis. The checkv task additionally reports both "weighted_contamination" and "weighted_completeness", which are average percents calculated across the total assembly that are weighted by contig length.
CheckV Technical Details
| Links | |
|---|---|
| Task | task_checkv.wdl |
| Software Source Code | CheckV on Bitbucket |
| Software Documentation | CheckV Documentation |
| Original Publication(s) | CheckV assesses the quality and completeness of metagenome-assembled viral genomes |
consensus_qc: Assembly Statistics
The consensus_qc task generates a summary of genomic statistics from a consensus genome. This includes the total number of bases, "N" bases, degenerate bases, and an estimate of the percent coverage to the reference genome.
consensus_qc Technical Details
| Links | |
|---|---|
| Task | task_consensus_qc.wdl |
TheiaViral_Panel operates by identifying reads that align with input taxon codes (specified in the taxon_ids input variable), extracting those reads, and assembling and characterizing them using the same modules as TheiaViral_Illumina_PE. Multiple assemblies and characterizations can be generated from a single sample if reads align with multiple taxon codes.
Versioning
versioning: Version Capture
The versioning task captures the workflow version from the GitHub (code repository) version.
Version Capture Technical details
| Links | |
|---|---|
| Task | task_versioning.wdl |
Read Quality Control, Trimming, Filtering, Identification
read_QC_trim: Read Quality Trimming, Adapter Removal, Quantification, and Identification
read_QC_trim is a sub-workflow that removes low-quality reads, low-quality regions of reads, and sequencing adapters to improve data quality. It uses a number of tasks, described below. The differences between the PE and SE versions of the read_QC_trim sub-workflow lie in the default parameters, the use of two or one input read file(s), and the different output files.
By default, read_processing is set to "trimmomatic". To use fastp instead, set read_processing to "fastp". These tasks are mutually exclusive.
Trimmomatic: Read Trimming (default)
Read proccessing is available via Trimmomatic by default.
Trimmomatic trims low-quality regions of Illumina paired-end or single-end reads with a sliding window (with a default window size of 4, specified with trim_window_size), cutting once the average quality within the window falls below the trim_quality_trim_score (default of 20 for paired-end, 30 for single-end). The read is discarded if it is trimmed below trim_minlen (default of 75 for paired-end, 25 for single-end).
Trimmomatic Technical Details
| Links | |
|---|---|
| Task | task_trimmomatic.wdl |
| Software Source Code | Trimmomatic on GitHub |
| Software Documentation | Trimmomatic Website |
| Original Publication(s) | Trimmomatic: a flexible trimmer for Illumina sequence data |
fastp: Read Trimming (alternative)
To activate this task, set read_processing to "fastp".
fastp trims low-quality regions of Illumina paired-end or single-end reads with a sliding window (with a default window size of 4, specified with trim_window_size), cutting once the average quality within the window falls below the trim_quality_trim_score (default of 20 for paired-end, 30 for single-end). The read is discarded if it is trimmed below trim_minlen (default of 75 for paired-end, 25 for single-end).
fastp also has additional default parameters and features that are not a part of trimmomatic's default configuration.
fastp default read-trimming parameters
| Parameter | Explanation |
|---|---|
| -g | enables polyG tail trimming |
| -5 20 | enables read end-trimming |
| -3 20 | enables read end-trimming |
| --detect_adapter_for_pe | enables adapter-trimming only for paired-end reads |
Additional arguments can be passed using the fastp_args optional parameter.
Trimmomatic and fastp Technical Details
| Links | |
|---|---|
| Task | task_fastp.wdl |
| Software Source Code | fastp on GitHub |
| Software Documentation | fastp on GitHub |
| Original Publication(s) | fastp: an ultra-fast all-in-one FASTQ preprocessor |
BBDuk: Adapter Trimming and PhiX Removal
Adapters are manufactured oligonucleotide sequences attached to DNA fragments during the library preparation process. In Illumina sequencing, these adapter sequences are required for attaching reads to flow cells. You can read more about Illumina adapters here. For genome analysis, it's important to remove these sequences since they're not actually from your sample. If you don't remove them, the downstream analysis may be affected.
The bbduk task removes adapters from sequence reads. To do this:
- Repair from the BBTools package reorders reads in paired fastq files to ensure the forward and reverse reads of a pair are in the same position in the two fastq files (it re-pairs).
- BBDuk ("Bestus Bioinformaticus" Decontamination Using Kmers) is then used to trim the adapters and filter out all reads that have a 31-mer match to PhiX, which is commonly added to Illumina sequencing runs to monitor and/or improve overall run quality.
BBDuk Technical Details
| Links | |
|---|---|
| Task | task_bbduk.wdl |
| Software Source Code | BBMap on SourceForge |
| Software Documentation | BBDuk Guide (archived) |
By default, read_qc is set to "fastq_scan". To use fastqc instead, set read_qc to "fastqc". These tasks are mutually exclusive.
fastq-scan: Read Quantification (default)
Read quantification is available via fastq-scan by default.
fastq-scan quantifies the forward and reverse reads in FASTQ files. For paired-end data, it also provide the total number of read pairs. This task is run once with raw reads as input and once with clean reads as input. If QC has been performed correctly, you should expect fewer clean reads than raw reads.
fastq-scan Technical Details
| Links | |
|---|---|
| Task | task_fastq_scan.wdl |
| Software Source Code | fastq-scan on GitHub |
| Software Documentation | fastq-scan on GitHub |
FastQC: Read Quantification (alternative)
To activate this task, set read_qc to "fastqc".
FastQC quantifies the forward and reverse reads in FASTQ files. For paired-end data, it also provide the total number of read pairs. This task is run once with raw reads as input and once with clean reads as input. If QC has been performed correctly, you should expect fewer clean reads than raw reads.
This tool also provides a graphical visualization of the read quality.
FastQC Technical Details
| Links | |
|---|---|
| Task | task_fastqc.wdl |
| Software Source Code | FastQC on Github |
| Software Documentation | FastQC Website |
read_QC_trim Technical Details
| Links | |
|---|---|
| Subworkflow | wf_read_QC_trim_pe.wdl wf_read_QC_trim_se.wdl |
Read Extraction and Binning
kraken_parser: Parses Kraken Reports
kraken_parser lightens the computation load by taking the input taxon ID list and comparing it to the taxon IDs identified by Kraken2 in the kraken_report_clean output file. Only taxon IDs that were found by Kraken are used in the scatter portion of the workflow, which lowers the number of scatter shards the workflow requires.
Find Files Technical Details
| Links | |
|---|---|
| Task | task_kraken_parser.wdl |
krakentools: Read Extraction
The task_krakentools.wdl task extracts reads from the Kraken2 output file. It uses the KrakenTools package to extract reads classified at any user-specified taxon ID.
KrakenTools Technical Details
| Links | |
|---|---|
| Task | task_krakentools.wdl |
| Software Source Code | KrakenTools on GitHub |
| Software Documentation | KrakenTools |
| Original Publication(s) | Metagenome analysis using the Kraken software suite |
Taxonomic Identification
ete4_identify
The ete4_identify task parses the NCBI taxonomy hierarchy from a user's inputted taxonomy and desired taxonomic rank. This task returns a taxon ID, name, and rank, which facilitates downstream functions, including read classification, targeted read extraction, and genomic characterization modules.
taxon input parameter
This parameter accepts either a NCBI taxon ID (e.g. 11292) or an organism name (e.g. Lyssavirus rabies).
rank a.k.a read_extraction_rank input parameter
Valid options include: "species", "genus", "family", "order", "class", "phylum", "kingdom", or "domain". By default it is set to "family". This parameter filters metadata to report information only at the taxonomic rank specified by the user, regardless of the taxonomic rank implied by the original input taxon.
Important
- The
rankparameter must specify a taxonomic rank that is equal to or above the input taxon's taxonomic rank.
Examples:
- If your input
taxonisLyssavirus rabies(species level) withrankset tofamily, the task will return information for the family ofLyssavirus rabies: taxon ID for Rhabdoviridae (11270), name "Rhabdoviridae", and rank "family". - If your input
taxonisLyssavirus(genus level) withrankset tospecies, the task will fail because it cannot determine species information from an inputted genus.
ete4 Identify Technical Details
| Links | |
|---|---|
| Task | task_ete4_taxon_id.wdl |
| Software Source Code | ete4 on GitHub |
| Software Documentation | NCBI Datasets Documentation on NCBI |
| Original Publication(s) | Exploring and retrieving sequence and metadata for species across the tree of life with NCBI Datasets |
datasets_genome_length
The datasets_genome_length task uses NCBI Datasets to acquire genome length metadata for an inputted taxon and retrieve a top reference accession. This task generates a summary file of all successful hits to the input taxon, which includes each genome's accession number, completeness status, genome length, source, and other relevant metadata. The task will then calculate the average expected genome length in basepairs for the input taxon.
taxon input parameter
This parameter accepts either a NCBI taxon ID (e.g. 11292) or an organism name (e.g. Lyssavirus rabies).
NCBI Datasets Technical Details
| Links | |
|---|---|
| Task | task_identify_taxon_id.wdl |
| Software Source Code | NCBI Datasets on GitHub |
| Software Documentation | NCBI Datasets Documentation on NCBI |
| Original Publication(s) | Exploring and retrieving sequence and metadata for species across the tree of life with NCBI Datasets |
TheiaViral_Panel uses the assembly and characterization tasks of TheiaViral_Illumina_PE. This allows for multiple binned taxon IDs from a single sample to undergo the same viral assembly as other samples. The following tasks are performed for each taxon ID that passes the read binning threshold:
De novo Assembly and Reference Selection
spades
SPAdes (St. Petersburg genome assembler) is a de novo assembly tool that uses de Bruijn graphs to assemble genomes from Illumina short reads.
It is run with the --metaviral option, which is recommended for viral genomes. MetaviralSPAdes pipeline consists of three independent steps, ViralAssembly for finding putative viral subgraphs in a metagenomic assembly graph and generating contigs in these graphs, ViralVerify for checking whether the resulting contigs have viral origin and ViralComplete for checking whether these contigs represent complete viral genomes. For more details, please see the original publication.
MetaviralSPAdes was selected as the default assembler because it produces the most complete viral genomes within TheiaViral, determined by CheckV quality assessment (see task checkv for technical details).
call_metaviralspades input parameter
This parameter controls whether or not the spades task is called by the workflow. By default, call_metaviralspades is set to true because MetaviralSPAdes is used as the primary assembler. MetaviralSPAdes is generally recommended for most users, but it might not perform optimally on all datasets. If users encounter issues with MetaviralSPAdes, they can set the call_metaviralspades variable to false to bypass the spades task and instead de novo assemble using MEGAHIT (see task megahit for details). Additionally, if the spades task fails during execution, the workflow will automatically fall back to using MEGAHIT for de novo assembly.
Non-deterministic output(s)
This task may yield non-deterministic outputs.
MetaviralSPAdes Technical Details
| Links | |
|---|---|
| Task | task_spades.wdl |
| Software Source Code | SPAdes on GitHub |
| Software Documentation | SPAdes Manual |
| Original Publication(s) | TheiaProk: SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing TheiaViral: MetaviralSPAdes: assembly of viruses from metagenomic data |
megahit
The MEGAHIT assembler is a fast and memory-efficient de novo assembler that can handle large datasets. While optimized for metagenomics, MEGAHIT also performs well on single-genome assemblies, making it a versatile choice for various assembly tasks.
MEGAHIT uses a multiple k-mer strategy that can be beneficial for assembling genomes with varying coverage levels, which is common in metagenomic samples. It constructs succinct de Bruijn graphs to efficiently represent the assembly process, allowing it to handle large and complex datasets with reduced memory usage.
This task is optional, turned off by default, and will only be called if MetaviralSPAdes fails. It can be enabled by setting the skip_metaviralspades parameter to true. The megahit task is used as a fallback option if the spades task fails during execution (see task spades for more details).
Non-deterministic output(s)
This task may yield non-deterministic outputs.
MEGAHIT Technical Details
| Links | |
|---|---|
| Task | task_megahit.wdl |
| Software Source Code | MEGAHIT on GitHub |
| Software Documentation | MEGAHIT on GitHub |
| Original Publication(s) | MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph |
skani
The skani task is used to identify and select the most closely related reference genome to the de novo assembly. Skani uses an approximate mapping method without base-level alignment to calculate average nucleotide identity (ANI). It is magnitudes faster than BLAST-based methods and almost as accurate.
By default, the reference genome is selected from a database of approximately 200,000 viral genomes. This database was constructed with the following methodology:
-
Extracting all complete NCBI viral genomes, excluding RefSeq accessions (redundancy), SARS-CoV-2 accessions, and segmented families (Orthomyxoviridae, Hantaviridae, Arenaviridae, and Phenuiviridae). Some complete gene accessions, and not complete genomes, are included because NCBI
datasetscompleteness parameters are susceptible to metadata errors. -
Adding complete RefSeq segmented viral assembly accessions, which represent segments as individual contigs within the FASTA
-
Adding one SARS-CoV-2 genome for each major pangolin lineage
Skani Technical Details
| Links | |
|---|---|
| Task | task_skani.wdl |
| Software Source Code | Skani on GitHub |
| Software Documentation | Skani Documentation |
| Original Publication(s) | Fast and robust metagenomic sequence comparison through sparse chaining with skani |
Reference Mapping
bwa
The bwa task is a wrapper for the BWA alignment tool. It utilizes the BWA-MEM algorithm to map cleaned reads to the reference genome, either selected by the skani task or provided by the user input reference_fasta. This creates a BAM file which is then sorted using the command samtools sort.
BWA Technical Details
| Links | |
|---|---|
| Task | task_bwa.wdl |
| Software Source Code | BWA on GitHub |
| Software Documentation | BWA Documentation |
| Original Publication(s) | Fast and accurate short read alignment with Burrows-Wheeler transform |
read_mapping_stats: Mapping Statistics
The read_mapping_stats task generates mapping statistics from a BAM file. It uses samtools to generate a summary of the mapping statistics, which includes coverage, depth, average base quality, average mapping quality, and other relevant metrics.
read_mapping_stats Technical Details
| Links | |
|---|---|
| Task | task_assembly_metrics.wdl |
| Software Source Code | samtools on GitHub |
| Software Documentation | samtools |
| Original Publication(s) | The Sequence Alignment/Map format and SAMtools Twelve Years of SAMtools and BCFtools |
Variant Calling and Consensus Generation
ivar_variants: Variant Calling
iVar uses the outputs of samtools mpileup to call single nucleotide variants (SNVs) and insertions/deletions (indels). Several key parameters can be set to determine the stringency of variant calling, including minimum quality, minimum allele frequency, and minimum depth.
This task returns a VCF file containing all called variants, the number of detected variants, and the proportion of those variants with allele frequencies between 0.6 and 0.9 (also known as intermediate variants).
min_depth input parameter
This parameter accepts an integer value to set the minimum read depth for variant calling and subsequent consensus sequence generation. The default value is 10.
min_map_quality input parameter
This parameter accepts an integer value to set the minimum mapping quality for variant calling and subsequent consensus sequence generation. The default value is 20.
min_allele_freq input parameter
This parameter accepts a float value to set the minimum allele frequency for variant calling and subsequent consensus sequence generation. The default value is 0.6.
iVar Technical Details
| Links | |
|---|---|
| Task | task_ivar_variant_call.wdl |
| Software Source Code | Ivar on GitHub |
| Software Documentation | Ivar Documentation |
| Original Publication(s) | An amplicon-based sequencing framework for accurately measuring intrahost virus diversity using PrimalSeq and iVar |
ivar_consensus: Consensus Assembly
iVar's consensus tool generates a reference-based consensus assembly. Several parameters can be set that determine the stringency of the consensus assembly, including minimum quality, minimum allele frequency, and minimum depth.
This task is functional for segmented viruses by iteratively executing iVar on a contig-by-contig basis and concantenating resulting consensus contigs.
min_depth input parameter
This parameter accepts an integer value to set the minimum read depth for variant calling and subsequent consensus sequence generation. The default value is 10.
min_map_quality input parameter
This parameter accepts an integer value to set the minimum mapping quality for variant calling and subsequent consensus sequence generation. The default value is 20.
min_allele_freq input parameter
This parameter accepts a float value to set the minimum allele frequency for variant calling and subsequent consensus sequence generation. The default value is 0.6.
iVar Technical Details
| Links | |
|---|---|
| Task | task_ivar_consensus.wdl |
| Software Source Code | Ivar on GitHub |
| Software Documentation | Ivar Documentation |
| Original Publication(s) | An amplicon-based sequencing framework for accurately measuring intrahost virus diversity using PrimalSeq and iVar |
Assembly Evaluation and Consensus Quality Control
quast_denovo
QUAST stands for QUality ASsessment Tool. It evaluates genome/metagenome assemblies by computing various metrics without a reference being necessary. It includes useful metrics such as number of contigs, length of the largest contig and N50.
QUAST Technical Details
| Links | |
|---|---|
| Task | task_quast.wdl |
| Software Source Code | QUAST on GitHub |
| Software Documentation | QUAST Manual on SourceForge |
| Original Publication(s) | QUAST: quality assessment tool for genome assemblies |
checkv_denovo & checkv_consensus
CheckV is a fully automated command-line pipeline for assessing the quality of viral genomes, including identification of host contamination for integrated proviruses, estimating completeness for genome fragments, and identification of closed genomes.
By default, CheckV reports results on a contig-by-contig basis. The checkv task additionally reports both "weighted_contamination" and "weighted_completeness", which are average percents calculated across the total assembly that are weighted by contig length.
CheckV Technical Details
| Links | |
|---|---|
| Task | task_checkv.wdl |
| Software Source Code | CheckV on Bitbucket |
| Software Documentation | CheckV Documentation |
| Original Publication(s) | CheckV assesses the quality and completeness of metagenome-assembled viral genomes |
consensus_qc: Assembly Statistics
The consensus_qc task generates a summary of genomic statistics from a consensus genome. This includes the total number of bases, "N" bases, degenerate bases, and an estimate of the percent coverage to the reference genome.
consensus_qc Technical Details
| Links | |
|---|---|
| Task | task_consensus_qc.wdl |
Exporting Results to Taxon-Specific Tables
Taxon Tables: Copy outputs to new data tables based on taxonomic assignment (optional)
This task is incompatible when running TheiaViral_Panel on the command-line as it is geared specifically for Terra. Do not activate this task if you are a command-line user.
Activate this task by providing a value for the output_taxon_table input variable. If provided, the user must also provide values to the terra_project and terra_workspace optional input variables.
The taxon_tables module will copy sample data to a different data table based on the taxonomic assignment. For example, if an influenza sample is analyzed, the module will copy the sample data to a new table for influenza samples or add the sample data to an existing table.
Formatting the output_taxon_table file
The output_taxon_table file must be uploaded a Google storage bucket that is accessible by Terra and should be in tab-delimited format and include a header. Briefly, the viral taxon name should be listed in the leftmost column with the name of the data table to copy samples of that taxon to in the rightmost column.
| taxon | taxon_table |
|---|---|
| influenza | influenza_panel_specimen |
| coronavirus | coronavirus_panel_specimen |
| human_immunodeficiency_virus | hiv_panel_specimen |
| monkeypox_virus | monkeypox_panel_specimen |
There are no output columns for the taxon table task. The only output of the task is that additional data tables will appear for in the Terra workspace for samples matching a taxa in the taxon table file.
export_taxon_table Technical Details
| Links | |
|---|---|
| Task | task_export_taxon_table.wdl |
Taxa-Specific Tasks¶
The TheiaViral workflows activate taxa-specific sub-workflows after the identification of relevant taxa. These characterization modules are activated by populating taxon with an exact match to a compatible taxon. We recommend using the taxon ID integer because these can be simpler. Compatible taxon codes are listed in parentheses below (case-insensitive):
- SARS-CoV-2 (
"2697049","severe acute respiratory syndrome coronavirus 2") - Monkeypox virus (
"10244","mpox","monkeypox virus") - Human Immunodeficiency Virus 1 (
"11676","human immunodeficiency virus 1") - Human Immunodeficiency Virus 2 (
"11709","human immunodeficiency virus 2") - West Nile Virus (
"11082","west nile virus") - Influenza A (
"11320","influenza a virus") - Influenza B (
"11520","influenza b virus") - RSV-A (
"208893","human respiratory syncytial virus a") - RSV-B (
"208895","human respiratory syncytial virus b") - Measles (
"11234","measles") - Rabies (
"11292","lyssavirus rabies") - Mumps (
"2560602","mumps virus","Mumps orthorubulavirus") - Rubella (
"11041","rubella virus")
Outputs¶
| Variable | Type | Description |
|---|---|---|
| abricate_flu_database | String | ABRicate database used for analysis |
| abricate_flu_results | File | File containing all results from ABRicate |
| abricate_flu_subtype | String | Flu subtype as determined by ABRicate |
| abricate_flu_type | String | Flu type as determined by ABRicate |
| abricate_flu_version | String | Version of ABRicate |
| assembly_consensus_fasta | File | Final consensus assembly in FASTA format |
| assembly_denovo_fasta | File | De novo assembly in FASTA format |
| auspice_json | File | Auspice-compatable JSON output generated from Nextclade analysis that includes the Nextclade default samples for clade-typing and the single sample placed on this tree |
| auspice_json_flu_ha | File | Auspice-compatable JSON output generated from Nextclade analysis on Influenza HA segment that includes the Nextclade default samples for clade-typing and the single sample placed on this tree |
| auspice_json_flu_na | File | Auspice-compatable JSON output generated from Nextclade analysis on Influenza NA segment that includes the Nextclade default samples for clade-typing and the single sample placed on this tree |
| auspice_json_rabies | File | Auspice-compatable JSON output generated from Nextclade analysis on Rabies virus that includes the Nextclade default samples for clade-typing and the single sample placed on this tree |
| bbduk_docker | String | The Docker image for bbduk, which was used to remove the adapters from the sequences |
| bbduk_read1_clean | File | Clean forward reads after BBDuk processing |
| bbduk_read2_clean | File | Clean reverse reads after BBDuk processing |
| bwa_read1_aligned | File | Forward reads aligned to reference |
| bwa_read1_unaligned | File | Forward reads not aligned to reference |
| bwa_read2_aligned | File | Reverse reads aligned to reference |
| bwa_read2_unaligned | File | Reverse reads not aligned to reference |
| bwa_samtools_version | String | Version of samtools used by BWA |
| bwa_sorted_bai | File | Sorted BAM index file of reads aligned to reference |
| bwa_sorted_bam | File | Sorted BAM file of reads aligned to reference |
| bwa_sorted_bam_unaligned | File | A BAM file that only contains reads that did not align to the reference |
| bwa_sorted_bam_unaligned_bai | File | Index companion file to a BAM file that only contains reads that did not align to the reference |
| bwa_version | String | Version of BWA software used |
| checkv_consensus_contamination | File | Contamination estimate for consensus assembly from CheckV |
| checkv_consensus_summary | File | Summary report from CheckV for consensus assembly |
| checkv_consensus_total_genes | Int | Number of genes detected in consensus assembly by CheckV |
| checkv_consensus_version | String | Version of CheckV used for consensus assembly |
| checkv_consensus_weighted_completeness | Float | Weighted completeness score for consensus assembly from CheckV |
| checkv_consensus_weighted_contamination | Float | Weighted contamination score for consensus assembly from CheckV |
| checkv_denovo_contamination | File | Contamination estimate for de novo assembly from CheckV |
| checkv_denovo_summary | File | Summary report from CheckV for de novo assembly |
| checkv_denovo_total_genes | Int | Number of genes detected in de novo assembly by CheckV |
| checkv_denovo_version | String | Version of CheckV used for de novo assembly |
| checkv_denovo_weighted_completeness | Float | Weighted completeness score for de novo assembly from CheckV |
| checkv_denovo_weighted_contamination | Float | Weighted contamination score for de novo assembly from CheckV |
| consensus_n_variant_min_depth | Int | Minimum read depth to call variants for iVar consensus and iVar variants. Also represents the minimum consensus support threshold used by IRMA with Illumina Influenza data. |
| consensus_qc_assembly_length_unambiguous | Int | Length of consensus assembly excluding ambiguous bases |
| consensus_qc_number_Degenerate | Int | Number of degenerate bases in consensus assembly |
| consensus_qc_number_N | Int | Number of N bases in consensus assembly |
| consensus_qc_number_Total | Int | Total number of bases in consensus assembly |
| consensus_qc_percent_reference_coverage | Float | Percent of reference genome covered in consensus assembly |
| datasets_genome_length_docker | String | The Docker container used for the task |
| datasets_genome_length_version | String | The version of NCBI Datasets used for analysis |
| dehost_wf_dehost_read1 | File | Reads that did not map to host |
| dehost_wf_dehost_read2 | File | Paired-reads that did not map to host |
| dehost_wf_host_accession | String | Host genome accession |
| dehost_wf_host_fasta | File | Host genome FASTA file |
| dehost_wf_host_flagstat | File | Output from the SAMtools flagstat command to assess quality of the alignment file (BAM) |
| dehost_wf_host_mapped_bai | File | Indexed bam file of the reads aligned to the host reference |
| dehost_wf_host_mapped_bam | File | Sorted BAM file containing the alignments of reads to the host reference genome |
| dehost_wf_host_mapping_cov_hist | File | Coverage histogram from host read mapping |
| dehost_wf_host_mapping_coverage | Float | Average coverage from host read mapping |
| dehost_wf_host_mapping_mean_depth | Float | Average depth from host read mapping |
| dehost_wf_host_mapping_metrics | File | File of mapping metrics |
| dehost_wf_host_mapping_stats | File | File of mapping statistics |
| dehost_wf_host_percent_mapped_reads | Float | Percentage of reads mapped to host reference genome |
| ete4_docker | String | Docker image used for ETE4 taxonomy parsing |
| ete4_version | String | The version of ETE4 used |
| fastp_html_report | File | The HTML report made with fastp |
| fastp_version | String | The version of fastp used |
| fastq_scan_clean1_json | File | The JSON file output from fastq-scan containing summary stats about clean forward read quality and length |
| fastq_scan_clean2_json | File | The JSON file output from fastq-scan containing summary stats about clean reverse read quality and length |
| fastq_scan_clean_pairs | String | Number of read pairs after cleaning |
| fastq_scan_docker | String | The Docker image of fastq_scan |
| fastq_scan_num_reads_clean1 | Int | The number of forward reads after cleaning as calculated by fastq_scan |
| fastq_scan_num_reads_clean2 | Int | The number of reverse reads after cleaning as calculated by fastq_scan |
| fastq_scan_num_reads_raw1 | Int | The number of input forward reads as calculated by fastq_scan |
| fastq_scan_num_reads_raw2 | Int | The number of input reserve reads as calculated by fastq_scan |
| fastq_scan_raw1_json | File | The JSON file output from fastq-scan containing summary stats about raw forward read quality and length |
| fastq_scan_raw2_json | File | The JSON file output from fastq-scan containing summary stats about raw reverse read quality and length |
| fastq_scan_raw_pairs | String | Number of raw read pairs |
| fastq_scan_version | String | The version of fastq_scan |
| genoflu_all_segments | String | The genotypes for each individual flu segment |
| genoflu_genotype | String | The genotype of the whole genome, based off of the individual segments types |
| genoflu_output_tsv | File | The output file from GenoFLU |
| genoflu_version | String | The version of GenoFLU used |
| irma_docker | String | Docker image used to run IRMA |
| irma_subtype | String | Flu subtype as determined by IRMA |
| irma_subtype_notes | String | Helpful note to user about Flu B subtypes. Output will be blank for Flu A samples. For Flu B samples it will state: "IRMA does not differentiate Victoria and Yamagata Flu B lineages. See abricate_flu_subtype output column" |
| irma_type | String | Flu type as determined by IRMA |
| irma_version | String | Version of IRMA used |
| ivar_tsv | File | Variant descriptor file generated by iVar variants |
| ivar_variant_proportion_intermediate | String | The proportion of variants of intermediate frequency |
| ivar_variant_version | String | Version of iVar for running the iVar variants command |
| ivar_vcf | File | iVar tsv output converted to VCF format |
| ivar_version_consensus | String | Version of iVar for running the iVar consensus command |
| kraken2_extracted_read1 | File | Forward reads extracted by taxonomic classification |
| kraken2_extracted_read2 | File | Reverse reads extracted by taxonomic classification |
| kraken_database | String | Database used for Kraken classification |
| kraken_docker | String | Docker image used for Kraken |
| kraken_report | String | Full Kraken report |
| kraken_version | String | Version of Kraken software used |
| megahit_docker | String | Docker image used for MEGAHIT |
| megahit_status | String | Status of the MEGAHIT assembly |
| megahit_version | String | Version of MEGAHIT used |
| metaviralspades_docker | String | Docker image used for MetaviralSPAdes |
| metaviralspades_status | String | Status of MetaviralSPAdes assembly |
| metaviralspades_version | String | Version of MetaviralSPAdes used |
| morgana_magic_organism | String | Standardized organism name used for characterization |
| ncbi_read_extraction_rank | String | Read extraction rank used |
| ncbi_scrub_docker | String | The Docker image for NCBI's HRRT (human read removal tool) |
| ncbi_scrub_human_spots_removed | Int | Number of spots removed (or masked) |
| ncbi_taxon_id | String | NCBI taxonomy ID of inputted organism following rank extraction |
| ncbi_taxon_name | String | NCBI taxonomy name of inputted taxon following rank extraction |
| nextclade_aa_dels | String | Amino-acid deletions as detected by NextClade. Will be blank for Flu |
| nextclade_aa_dels_flu_ha | String | Amino-acid deletions as detected by NextClade. Specific to flu; it includes deletions for HA segment |
| nextclade_aa_dels_flu_na | String | Amino-acid deletions as detected by NextClade. Specific to Flu; it includes deletions for NA segment |
| nextclade_aa_dels_rabies | String | Amino-acid deletions as detected by Nextclade. Specific to Monkeypox |
| nextclade_aa_subs | String | Amino-acid substitutions as detected by Nextclade. Will be blank for Flu |
| nextclade_aa_subs_flu_ha | String | Amino-acid substitutions as detected by Nextclade. Specific to Flu; it includes substitutions for HA segment |
| nextclade_aa_subs_flu_na | String | Amino-acid substitutions as detected by Nextclade. Specific to Flu; it includes substitutions for NA segment |
| nextclade_aa_subs_rabies | String | Amino-acid substitutions as detected by Nextclade. Specific to Monkeypox |
| nextclade_clade | String | Nextclade clade designation, will be blank for Flu. |
| nextclade_clade_flu_ha | String | Nextclade clade designation, specific to Flu NA segment |
| nextclade_clade_flu_na | String | Nextclade clade designation, specific to Flu HA segment |
| nextclade_clade_rabies | String | Nextclade clade designation, specific to Rabies |
| nextclade_docker | String | Docker image used to run Nextclade |
| nextclade_ds_tag | String | Dataset tag used to run Nextclade. Will be blank for Flu |
| nextclade_ds_tag_flu_ha | String | Dataset tag used to run Nextclade, specific to Flu HA segment |
| nextclade_ds_tag_flu_na | String | Dataset tag used to run Nextclade, specific to Flu NA segment |
| nextclade_json | File | Nextclade output in JSON file format. Will be blank for Flu |
| nextclade_json_flu_ha | File | Nextclade output in JSON file format, specific to Flu HA segment |
| nextclade_json_flu_na | File | Nextclade output in JSON file format, specific to Flu NA segment |
| nextclade_json_rabies | File | Nextclade output in JSON file format, specific to Rabies |
| nextclade_lineage | String | Nextclade lineage designation |
| nextclade_lineage_rabies | String | Nextclade lineage designation, specific to Rabies |
| nextclade_qc | String | QC metric as determined by Nextclade. Will be blank for Flu |
| nextclade_qc_flu_ha | String | QC metric as determined by Nextclade, specific to Flu HA segment |
| nextclade_qc_flu_na | String | QC metric as determined by Nextclade, specific to Flu NA segment |
| nextclade_qc_rabies | String | QC metric as determined by Nextclade, specific to Rabies |
| nextclade_tsv | File | Nextclade output in TSV file format. Will be blank for Flu |
| nextclade_tsv_flu_ha | File | Nextclade output in TSV file format, specific to Flu HA segment |
| nextclade_tsv_flu_na | File | Nextclade output in TSV file format, specific to Flu NA segment |
| nextclade_tsv_rabies | File | Nextclade output in TSV file format, specific to Rabies |
| nextclade_version | String | The version of Nextclade software used |
| pango_lineage | String | Pango lineage as determined by Pangolin |
| pango_lineage_expanded | String | Pango lineage without use of aliases; e.g., "BA.1" → "B.1.1.529.1" |
| pango_lineage_report | File | Full Pango lineage report generated by Pangolin |
| pangolin_assignment_version | String | The version of the pangolin software (e.g. PANGO or PUSHER) used for lineage assignment |
| pangolin_conflicts | String | Number of lineage conflicts as determined by Pangolin |
| pangolin_docker | String | Docker image used to run Pangolin |
| pangolin_notes | String | Lineage notes as determined by Pangolin |
| pangolin_versions | String | All Pangolin software and database versions |
| quasitools_coverage_file | File | The coverage report created by Quasitools HyDRA |
| quasitools_date | String | Date of Quasitools analysis |
| quasitools_dr_report | File | Drug resistance report created by Quasitools HyDRA |
| quasitools_hydra_vcf | File | The VCF created by Quasitools HyDRA |
| quasitools_mutations_report | File | The mutation report created by Quasitools HyDRA |
| quasitools_version | String | Version of Quasitools used |
| quast_denovo_docker | String | Docker image used for QUAST |
| quast_denovo_gc_percent | Float | GC percentage of de novo assembly from QUAST |
| quast_denovo_genome_length | Int | Genome length of de novo assembly from QUAST |
| quast_denovo_largest_contig | Int | Size of largest contig in de novo assembly from QUAST |
| quast_denovo_n50_value | Int | N50 value of de novo assembly from QUAST |
| quast_denovo_number_contigs | Int | Number of contigs in de novo assembly from QUAST |
| quast_denovo_report | File | QUAST report for de novo assembly |
| quast_denovo_uncalled_bases | Float | Number of uncalled bases in de novo assembly from QUAST |
| quast_denovo_version | String | Version of QUAST used |
| read1_dehosted | File | The dehosted forward reads file; suggested read file for SRA submission |
| read2_dehosted | File | The dehosted reverse reads file; suggested read file for SRA submission |
| read_mapping_cov_hist | File | Coverage histogram from read mapping |
| read_mapping_cov_stats | File | Coverage statistics from read mapping |
| read_mapping_coverage | Float | Average coverage from read mapping |
| read_mapping_date | String | Date of read mapping analysis |
| read_mapping_depth | Float | Average depth from read mapping |
| read_mapping_flagstat | File | Flagstat file from read mapping |
| read_mapping_meanbaseq | Float | Mean base quality from read mapping |
| read_mapping_meanmapq | Float | Mean mapping quality from read mapping |
| read_mapping_percentage_mapped_reads | Float | Percentage of mapped reads |
| read_mapping_report | File | Report file from read mapping |
| read_mapping_samtools_version | String | Version of samtools used in read mapping |
| read_mapping_statistics | File | Statistics file from read mapping |
| read_screen_clean | String | PASS or FAIL result from clean read screening; FAIL accompanied by the reason(s) for failure |
| read_screen_clean_tsv | File | Clean read screening report TSV depicting read counts, total read base pairs, and estimated genome length |
| skani_database | String | Database used for Skani |
| skani_docker | String | Docker image used for Skani |
| skani_reference_assembly | File | Reference genome assembly |
| skani_reference_taxon | String | Reference taxon name |
| skani_report | File | Report from Skani |
| skani_status | String | Status of Skani analysis |
| skani_top_accession | String | Top accession ID from Skani |
| skani_top_ani | Float | Top ANI score from Skani |
| skani_top_query_coverage | Float | Query coverage of top match from Skani |
| skani_top_score | Float | Top score from Skani |
| skani_version | String | Version of Skani used |
| skani_warning | String | Skani warning message |
| taxon_avg_genome_length | String | Average genome length for taxon obtained from NCBI datasets summary |
| theiaviral_illumina_pe_date | String | Date of TheiaViral Illumina PE workflow run |
| theiaviral_illumina_pe_version | String | Version of TheiaViral Illumina PE workflow |
| trimmomatic_docker | String | The docker image used for the trimmomatic module in this workflow |
| trimmomatic_version | String | The version of Trimmomatic used |
| vadr_alerts_list | File | A file containing all of the fatal alerts as determined by VADR |
| vadr_all_outputs_tar_gz | File | A .tar.gz file (gzip-compressed tar archive file) containing all outputs from the VADR command v-annotate.pl. This file must be uncompressed & extracted to see the many files within. See https://github.com/ncbi/vadr/blob/master/documentation/formats.md#format-of-v-annotatepl-output-files for more complete description of all files present within the archive. Useful when deeply investigating a sample's genome & annotations. |
| vadr_classification_summary_file | File | Per-sequence tabular classification file. See https://github.com/ncbi/vadr/blob/master/documentation/formats.md#explanation-of-sqc-suffixed-output-files for more complete description. |
| vadr_docker | String | Docker image used to run VADR |
| vadr_fastas_zip_archive | File | Zip archive containing all fasta files created during VADR analysis |
| vadr_feature_tbl_fail | File | 5 column feature table output for failing sequences. See https://github.com/ncbi/vadr/blob/master/documentation/formats.md#format-of-v-annotatepl-output-files for more complete description. |
| vadr_feature_tbl_pass | File | 5 column feature table output for passing sequences. See https://github.com/ncbi/vadr/blob/master/documentation/formats.md#format-of-v-annotatepl-output-files for more complete description. |
| vadr_num_alerts | String | Number of fatal alerts as determined by VADR |
| Variable | Type | Description |
|---|---|---|
| abricate_flu_database | String | ABRicate database used for analysis |
| abricate_flu_results | File | File containing all results from ABRicate |
| abricate_flu_subtype | String | Flu subtype as determined by ABRicate |
| abricate_flu_type | String | Flu type as determined by ABRicate |
| abricate_flu_version | String | Version of ABRicate |
| assembly_consensus_fasta | File | Final consensus assembly in FASTA format |
| assembly_denovo_fasta | File | De novo assembly in FASTA format |
| assembly_to_ref_bai | File | BAM index file for reads aligned to reference |
| assembly_to_ref_bam | File | BAM file of reads aligned to reference |
| auspice_json | File | Auspice-compatable JSON output generated from Nextclade analysis that includes the Nextclade default samples for clade-typing and the single sample placed on this tree |
| auspice_json_flu_ha | File | Auspice-compatable JSON output generated from Nextclade analysis on Influenza HA segment that includes the Nextclade default samples for clade-typing and the single sample placed on this tree |
| auspice_json_flu_na | File | Auspice-compatable JSON output generated from Nextclade analysis on Influenza NA segment that includes the Nextclade default samples for clade-typing and the single sample placed on this tree |
| auspice_json_rabies | File | Auspice-compatable JSON output generated from Nextclade analysis on Rabies virus that includes the Nextclade default samples for clade-typing and the single sample placed on this tree |
| bcftools_docker | String | Docker image used for bcftools |
| bcftools_filtered_vcf | File | Filtered variant calls in VCF format from bcftools |
| bcftools_version | String | Version of bcftools used |
| checkv_consensus_contamination | File | Contamination estimate for consensus assembly from CheckV |
| checkv_consensus_summary | File | Summary report from CheckV for consensus assembly |
| checkv_consensus_total_genes | Int | Number of genes detected in consensus assembly by CheckV |
| checkv_consensus_version | String | Version of CheckV used for consensus assembly |
| checkv_consensus_weighted_completeness | Float | Weighted completeness score for consensus assembly from CheckV |
| checkv_consensus_weighted_contamination | Float | Weighted contamination score for consensus assembly from CheckV |
| checkv_denovo_contamination | File | Contamination estimate for de novo assembly from CheckV |
| checkv_denovo_summary | File | Summary report from CheckV for de novo assembly |
| checkv_denovo_total_genes | Int | Number of genes detected in de novo assembly by CheckV |
| checkv_denovo_version | String | Version of CheckV used for de novo assembly |
| checkv_denovo_weighted_completeness | Float | Weighted completeness score for de novo assembly from CheckV |
| checkv_denovo_weighted_contamination | Float | Weighted contamination score for de novo assembly from CheckV |
| clair3_docker | String | Docker image used for Clair3 |
| clair3_gvcf | File | Genomic VCF file from Clair3 |
| clair3_model | String | Model used for Clair3 variant calling |
| clair3_vcf | File | Variant calls in VCF format from Clair3 |
| clair3_version | String | Clair3 Version being used |
| consensus_qc_assembly_length_unambiguous | Int | Length of consensus assembly excluding ambiguous bases |
| consensus_qc_number_Degenerate | Int | Number of degenerate bases in consensus assembly |
| consensus_qc_number_N | Int | Number of N bases in consensus assembly |
| consensus_qc_number_Total | Int | Total number of bases in consensus assembly |
| consensus_qc_percent_reference_coverage | Float | Percent of reference genome covered in consensus assembly |
| datasets_genome_length_docker | String | The Docker container used for the task |
| datasets_genome_length_version | String | The version of NCBI Datasets used for analysis |
| dehost_wf_dehost_read1 | File | Reads that did not map to host |
| dehost_wf_host_accession | String | Host genome accession |
| dehost_wf_host_fasta | File | Host genome FASTA file |
| dehost_wf_host_flagstat | File | Output from the SAMtools flagstat command to assess quality of the alignment file (BAM) |
| dehost_wf_host_mapped_bai | File | Indexed bam file of the reads aligned to the host reference |
| dehost_wf_host_mapped_bam | File | Sorted BAM file containing the alignments of reads to the host reference genome |
| dehost_wf_host_mapping_cov_hist | File | Coverage histogram from host read mapping |
| dehost_wf_host_mapping_coverage | Float | Average coverage from host read mapping |
| dehost_wf_host_mapping_mean_depth | Float | Average depth from host read mapping |
| dehost_wf_host_mapping_metrics | File | File of mapping metrics |
| dehost_wf_host_mapping_stats | File | File of mapping statistics |
| dehost_wf_host_percent_mapped_reads | Float | Percentage of reads mapped to host reference genome |
| ete4_docker | String | Docker image used for ETE4 taxonomy parsing |
| ete4_version | String | The version of ETE4 used |
| fasta_utilities_fai | File | FASTA index file |
| fasta_utilities_samtools_docker | String | Docker image used for samtools in fasta utilities |
| fasta_utilities_samtools_version | String | Version of samtools used in fasta utilities |
| flye_denovo_docker | String | Docker image used for Flye |
| flye_denovo_info | File | Information file from Flye assembly |
| flye_denovo_status | String | Status of Flye assembly |
| flye_denovo_version | String | Version of Flye used |
| genoflu_all_segments | String | The genotypes for each individual flu segment |
| genoflu_genotype | String | The genotype of the whole genome, based off of the individual segments types |
| genoflu_output_tsv | File | The output file from GenoFLU |
| genoflu_version | String | The version of GenoFLU used |
| irma_docker | String | Docker image used to run IRMA |
| irma_subtype | String | Flu subtype as determined by IRMA |
| irma_subtype_notes | String | Helpful note to user about Flu B subtypes. Output will be blank for Flu A samples. For Flu B samples it will state: "IRMA does not differentiate Victoria and Yamagata Flu B lineages. See abricate_flu_subtype output column" |
| irma_type | String | Flu type as determined by IRMA |
| irma_version | String | Version of IRMA used |
| mask_low_coverage_all_coverage_bed | File | BED file showing all coverage regions |
| mask_low_coverage_bed | File | BED file showing masked low coverage regions |
| mask_low_coverage_bedtools_docker | String | Docker image used for bedtools in masking |
| mask_low_coverage_bedtools_version | String | Version of bedtools used in masking |
| mask_low_coverage_reference_fasta | File | Reference FASTA with low coverage regions masked |
| metabuli_classified | File | Classified reads from Metabuli |
| metabuli_database | String | Database used for Metabuli |
| metabuli_docker | String | Docker image used for Metabuli |
| metabuli_krona_report | File | Krona visualization report from Metabuli |
| metabuli_read1_extract | File | Extracted reads from Metabuli |
| metabuli_report | File | Classification report from Metabuli |
| metabuli_version | String | Version of Metabuli used |
| minimap2_docker | String | The Docker image of minimap2 |
| minimap2_out | File | Output file from Minimap2 alignment |
| minimap2_version | String | The version of minimap2 |
| morgana_magic_organism | String | Standardized organism name used for characterization |
| nanoplot_html_clean | File | An HTML report describing the clean reads |
| nanoplot_html_raw | File | An HTML report describing the raw reads |
| nanoplot_num_reads_clean1 | Int | Number of clean reads |
| nanoplot_num_reads_raw1 | Int | Number of raw reads |
| nanoplot_r1_mean_q_clean | Float | Mean quality score of clean forward reads |
| nanoplot_r1_mean_q_raw | Float | Mean quality score of raw forward reads |
| nanoplot_r1_mean_readlength_clean | Float | Mean read length of clean forward reads |
| nanoplot_r1_mean_readlength_raw | Float | Mean read length of raw forward reads |
| nanoplot_r1_median_q_clean | Float | Median quality score of clean forward reads |
| nanoplot_r1_median_q_raw | Float | Median quality score of raw forward reads |
| nanoplot_r1_median_readlength_clean | Float | Median read length of clean forward reads |
| nanoplot_r1_median_readlength_raw | Float | Median read length of raw forward reads |
| nanoplot_r1_n50_clean | Float | N50 of clean forward reads |
| nanoplot_r1_n50_raw | Float | N50 of raw forward reads |
| nanoplot_r1_stdev_readlength_clean | Float | Standard deviation read length of clean forward reads |
| nanoplot_r1_stdev_readlength_raw | Float | Standard deviation read length of raw forward reads |
| nanoplot_tsv_clean | File | A TSV report describing the clean reads |
| nanoplot_tsv_raw | File | A TSV report describing the raw reads |
| nanoq_filtered_read1 | File | Filtered reads from NanoQ |
| nanoq_version | String | Version of nanoq used in analysis |
| ncbi_read_extraction_rank | String | Read extraction rank used |
| ncbi_scrub_docker | String | The Docker image for NCBI's HRRT (human read removal tool) |
| ncbi_scrub_human_spots_removed | Int | Number of spots removed (or masked) |
| ncbi_scrub_read1_dehosted | File | Dehosted reads after NCBI scrub |
| ncbi_taxon_id | String | NCBI taxonomy ID of inputted organism following rank extraction |
| ncbi_taxon_name | String | NCBI taxonomy name of inputted taxon following rank extraction |
| nextclade_aa_dels | String | Amino-acid deletions as detected by NextClade. Will be blank for Flu |
| nextclade_aa_dels_flu_ha | String | Amino-acid deletions as detected by NextClade. Specific to flu; it includes deletions for HA segment |
| nextclade_aa_dels_flu_na | String | Amino-acid deletions as detected by NextClade. Specific to Flu; it includes deletions for NA segment |
| nextclade_aa_dels_rabies | String | Amino-acid deletions as detected by Nextclade. Specific to Monkeypox |
| nextclade_aa_subs | String | Amino-acid substitutions as detected by Nextclade. Will be blank for Flu |
| nextclade_aa_subs_flu_ha | String | Amino-acid substitutions as detected by Nextclade. Specific to Flu; it includes substitutions for HA segment |
| nextclade_aa_subs_flu_na | String | Amino-acid substitutions as detected by Nextclade. Specific to Flu; it includes substitutions for NA segment |
| nextclade_aa_subs_rabies | String | Amino-acid substitutions as detected by Nextclade. Specific to Monkeypox |
| nextclade_clade | String | Nextclade clade designation, will be blank for Flu. |
| nextclade_clade_flu_ha | String | Nextclade clade designation, specific to Flu NA segment |
| nextclade_clade_flu_na | String | Nextclade clade designation, specific to Flu HA segment |
| nextclade_clade_rabies | String | Nextclade clade designation, specific to Rabies |
| nextclade_docker | String | Docker image used to run Nextclade |
| nextclade_ds_tag | String | Dataset tag used to run Nextclade. Will be blank for Flu |
| nextclade_ds_tag_flu_ha | String | Dataset tag used to run Nextclade, specific to Flu HA segment |
| nextclade_ds_tag_flu_na | String | Dataset tag used to run Nextclade, specific to Flu NA segment |
| nextclade_json | File | Nextclade output in JSON file format. Will be blank for Flu |
| nextclade_json_flu_ha | File | Nextclade output in JSON file format, specific to Flu HA segment |
| nextclade_json_flu_na | File | Nextclade output in JSON file format, specific to Flu NA segment |
| nextclade_json_rabies | File | Nextclade output in JSON file format, specific to Rabies |
| nextclade_lineage | String | Nextclade lineage designation |
| nextclade_lineage_rabies | String | Nextclade lineage designation, specific to Rabies |
| nextclade_qc | String | QC metric as determined by Nextclade. Will be blank for Flu |
| nextclade_qc_flu_ha | String | QC metric as determined by Nextclade, specific to Flu HA segment |
| nextclade_qc_flu_na | String | QC metric as determined by Nextclade, specific to Flu NA segment |
| nextclade_qc_rabies | String | QC metric as determined by Nextclade, specific to Rabies |
| nextclade_tsv | File | Nextclade output in TSV file format. Will be blank for Flu |
| nextclade_tsv_flu_ha | File | Nextclade output in TSV file format, specific to Flu HA segment |
| nextclade_tsv_flu_na | File | Nextclade output in TSV file format, specific to Flu NA segment |
| nextclade_tsv_rabies | File | Nextclade output in TSV file format, specific to Rabies |
| nextclade_version | String | The version of Nextclade software used |
| pango_lineage | String | Pango lineage as determined by Pangolin |
| pango_lineage_expanded | String | Pango lineage without use of aliases; e.g., "BA.1" → "B.1.1.529.1" |
| pango_lineage_report | File | Full Pango lineage report generated by Pangolin |
| pangolin_assignment_version | String | The version of the pangolin software (e.g. PANGO or PUSHER) used for lineage assignment |
| pangolin_conflicts | String | Number of lineage conflicts as determined by Pangolin |
| pangolin_docker | String | Docker image used to run Pangolin |
| pangolin_notes | String | Lineage notes as determined by Pangolin |
| pangolin_versions | String | All Pangolin software and database versions |
| parse_mapping_samtools_docker | String | Docker image used for samtools in parse mapping |
| parse_mapping_samtools_version | String | Version of samtools used in parse mapping |
| porechop_trimmed_read1 | File | Trimmed reads from Porechop |
| porechop_version | String | Version of Porechop used |
| quasitools_coverage_file | File | The coverage report created by Quasitools HyDRA |
| quasitools_date | String | Date of Quasitools analysis |
| quasitools_dr_report | File | Drug resistance report created by Quasitools HyDRA |
| quasitools_hydra_vcf | File | The VCF created by Quasitools HyDRA |
| quasitools_mutations_report | File | The mutation report created by Quasitools HyDRA |
| quasitools_version | String | Version of Quasitools used |
| quast_denovo_docker | String | Docker image used for QUAST |
| quast_denovo_gc_percent | Float | GC percentage of de novo assembly from QUAST |
| quast_denovo_genome_length | Int | Genome length of de novo assembly from QUAST |
| quast_denovo_largest_contig | Int | Size of largest contig in de novo assembly from QUAST |
| quast_denovo_n50_value | Int | N50 value of de novo assembly from QUAST |
| quast_denovo_number_contigs | Int | Number of contigs in de novo assembly from QUAST |
| quast_denovo_report | File | QUAST report for de novo assembly |
| quast_denovo_uncalled_bases | Float | Number of uncalled bases in de novo assembly from QUAST |
| quast_denovo_version | String | Version of QUAST used |
| rasusa_read1_subsampled | File | Subsampled read file from Rasusa |
| rasusa_read2_subsampled | File | Subsampled read file from Rasusa (paired file) |
| rasusa_version | String | Version of RASUSA used for the analysis |
| raven_denovo_docker | String | Docker image used for Raven |
| raven_denovo_status | String | Status of Raven assembly |
| raven_denovo_version | String | Version of Raven used |
| read_mapping_cov_hist | File | Coverage histogram from read mapping |
| read_mapping_cov_stats | File | Coverage statistics from read mapping |
| read_mapping_coverage | Float | Average coverage from read mapping |
| read_mapping_date | String | Date of read mapping analysis |
| read_mapping_depth | Float | Average depth from read mapping |
| read_mapping_flagstat | File | Flagstat file from read mapping |
| read_mapping_meanbaseq | Float | Mean base quality from read mapping |
| read_mapping_meanmapq | Float | Mean mapping quality from read mapping |
| read_mapping_percentage_mapped_reads | Float | Percentage of mapped reads |
| read_mapping_report | File | Report file from read mapping |
| read_mapping_samtools_version | String | Version of samtools used in read mapping |
| read_mapping_statistics | File | Statistics file from read mapping |
| read_screen_clean | String | PASS or FAIL result from clean read screening; FAIL accompanied by the reason(s) for failure |
| read_screen_clean_tsv | File | Clean read screening report TSV depicting read counts, total read base pairs, and estimated genome length |
| skani_database | String | Database used for Skani |
| skani_docker | String | Docker image used for Skani |
| skani_reference_assembly | File | Reference genome assembly |
| skani_reference_taxon | String | Reference taxon name |
| skani_report | File | Report from Skani |
| skani_status | String | Status of Skani analysis |
| skani_top_accession | String | Top accession ID from Skani |
| skani_top_ani | Float | Top ANI score from Skani |
| skani_top_query_coverage | Float | Query coverage of top match from Skani |
| skani_top_score | Float | Top score from Skani |
| skani_version | String | Version of Skani used |
| skani_warning | String | Skani warning message |
| taxon_avg_genome_length | String | Average genome length for taxon obtained from NCBI datasets summary |
| theiaviral_ont_date | String | Date of TheiaViral ONT workflow run |
| theiaviral_ont_version | String | Version of TheiaViral ONT workflow |
| vadr_alerts_list | File | A file containing all of the fatal alerts as determined by VADR |
| vadr_all_outputs_tar_gz | File | A .tar.gz file (gzip-compressed tar archive file) containing all outputs from the VADR command v-annotate.pl. This file must be uncompressed & extracted to see the many files within. See https://github.com/ncbi/vadr/blob/master/documentation/formats.md#format-of-v-annotatepl-output-files for more complete description of all files present within the archive. Useful when deeply investigating a sample's genome & annotations. |
| vadr_classification_summary_file | File | Per-sequence tabular classification file. See https://github.com/ncbi/vadr/blob/master/documentation/formats.md#explanation-of-sqc-suffixed-output-files for more complete description. |
| vadr_docker | String | Docker image used to run VADR |
| vadr_fastas_zip_archive | File | Zip archive containing all fasta files created during VADR analysis |
| vadr_feature_tbl_fail | File | 5 column feature table output for failing sequences. See https://github.com/ncbi/vadr/blob/master/documentation/formats.md#format-of-v-annotatepl-output-files for more complete description. |
| vadr_feature_tbl_pass | File | 5 column feature table output for passing sequences. See https://github.com/ncbi/vadr/blob/master/documentation/formats.md#format-of-v-annotatepl-output-files for more complete description. |
| vadr_num_alerts | String | Number of fatal alerts as determined by VADR |
| Variable | Type | Description |
|---|---|---|
| assembled_viruses | Int | Number of viruses assembled from sample |
| assemblies | Array[File] | Assembly files generated during the workflow |
| bbduk_docker | String | The Docker image for bbduk, which was used to remove the adapters from the sequences |
| dehost_wf_dehost_read1 | File | Reads that did not map to host |
| dehost_wf_dehost_read2 | File | Paired-reads that did not map to host |
| dehost_wf_host_accession | String | Host genome accession |
| dehost_wf_host_fasta | File | Host genome FASTA file |
| dehost_wf_host_flagstat | File | Output from the SAMtools flagstat command to assess quality of the alignment file (BAM) |
| dehost_wf_host_mapped_bai | File | Indexed bam file of the reads aligned to the host reference |
| dehost_wf_host_mapped_bam | File | Sorted BAM file containing the alignments of reads to the host reference genome |
| dehost_wf_host_mapping_cov_hist | File | Coverage histogram from host read mapping |
| dehost_wf_host_mapping_coverage | Float | Average coverage from host read mapping |
| dehost_wf_host_mapping_mean_depth | Float | Average depth from host read mapping |
| dehost_wf_host_mapping_metrics | File | File of mapping metrics |
| dehost_wf_host_mapping_stats | File | File of mapping statistics |
| dehost_wf_host_percent_mapped_reads | Float | Percentage of reads mapped to host reference genome |
| fastp_html_report | File | The HTML report made with fastp |
| fastp_version | String | The version of fastp used |
| fastq_scan_clean1_json | File | The JSON file output from fastq-scan containing summary stats about clean forward read quality and length |
| fastq_scan_clean2_json | File | The JSON file output from fastq-scan containing summary stats about clean reverse read quality and length |
| fastq_scan_clean_pairs | String | Number of read pairs after cleaning |
| fastq_scan_docker | String | The Docker image of fastq_scan |
| fastq_scan_num_reads_clean1 | Int | The number of forward reads after cleaning as calculated by fastq_scan |
| fastq_scan_num_reads_clean2 | Int | The number of reverse reads after cleaning as calculated by fastq_scan |
| fastq_scan_num_reads_raw1 | Int | The number of input forward reads as calculated by fastq_scan |
| fastq_scan_num_reads_raw2 | Int | The number of input reserve reads as calculated by fastq_scan |
| fastq_scan_raw1_json | File | The JSON file output from fastq-scan containing summary stats about raw forward read quality and length |
| fastq_scan_raw2_json | File | The JSON file output from fastq-scan containing summary stats about raw reverse read quality and length |
| fastq_scan_raw_pairs | String | Number of raw read pairs |
| fastq_scan_version | String | The version of fastq_scan |
| fastqc_clean1_html | File | An HTML file that provides a graphical visualization of clean forward read quality from fastqc to open in an internet browser |
| fastqc_clean2_html | File | An HTML file that provides a graphical visualization of clean reverse read quality from fastqc to open in an internet browser |
| fastqc_docker | String | The Docker container used for fastqc |
| fastqc_num_reads_clean1 | Int | The number of forward reads after cleaning by fastqc |
| fastqc_num_reads_clean2 | Int | The number of reverse reads after cleaning by fastqc |
| fastqc_num_reads_clean_pairs | String | The number of read pairs after cleaning by fastqc |
| fastqc_num_reads_raw1 | Int | The number of input forward reads by fastqc before cleaning |
| fastqc_num_reads_raw2 | Int | The number of input reverse reads by fastqc before cleaning |
| fastqc_num_reads_raw_pairs | String | The number of input read pairs by fastqc before cleaning |
| fastqc_raw1_html | File | An HTML file that provides a graphical visualization of raw forward read quality from fastqc to open in an internet browser |
| fastqc_raw2_html | File | An HTML file that provides a graphical visualization of raw reverse read quality from fastqc to open in an internet browser |
| fastqc_version | String | Version of fastqc software used |
| identified_organisms | Array[String] | List of organisms extracted and identified from an panel level sample |
| kraken2_classified_report | File | Standard Kraken2 output report. TXT filetype, but can be opened in Excel as a TSV file |
| kraken2_database | String | Kraken2 database used for the taxonomic assignment |
| kraken2_docker | String | Docker image used to run kraken2 |
| kraken2_report_clean | File | The full Kraken report for the sample's clean reads |
| kraken2_report_raw | File | The full Kraken report for the sample's raw reads |
| kraken2_version | String | The version of kraken2 used |
| kraken_percent_human_clean | Float | Percent of human read data detected using the Kraken2 software after host removal for cleaned reads |
| kraken_percent_human_raw | Float | Percent of human read data detected using the Kraken2 software after host removal for raw reads |
| ncbi_scrub_docker | String | The Docker image for NCBI's HRRT (human read removal tool) |
| ncbi_scrub_human_spots_removed | Int | Number of spots removed (or masked) |
| read1_clean | File | Forward read file after quality trimming and adapter removal |
| read1_dehosted | File | The dehosted forward reads file; suggested read file for SRA submission |
| read2_clean | File | Reverse read file after quality trimming and adapter removal |
| read2_dehosted | File | The dehosted reverse reads file; suggested read file for SRA submission |
| theiaviral_panel_analysis_date | String | The date the analysis was run |
| theiaviral_panel_version | String | The version of the workflow that was run |
| trimmomatic_docker | String | The docker image used for the trimmomatic module in this workflow |
| trimmomatic_version | String | The version of Trimmomatic used |
What are the differences between the de novo and consensus assemblies?
De novo genomes are generated from scratch without a reference to guide read assembly, while consensus genomes are generated by mapping reads to a reference and replacing reference positions with identified variants (structural and nucleotide). De novo assemblies are thus not biased by requiring reads map to the reference, though they may be more fragmented. Consensus assembly can generate more robust assemblies from lower coverage samples if the reference genome is sufficient quality and sufficiently closely related to the inputted sequence, though consensus assembly may not perform well in instances of significant structural variation. TheiaViral uses de novo assemblies as an intermediate to acquire the best reference genome for consensus assembly.
We generally recommend TheiaViral users focus on the consensus assembly as the desired assembly output. While we chose the best de novo assemblers for TheiaViral based on internal benchmarking, the consensus assembly will often be higher quality than the de novo assembly. However, the de novo assembly can approach or exceed consensus quality if the read inputs largely comprise one virus, have high depth of coverage, and/or are derived from a virus with high potential for recombination. TheiaViral does conduct assembly contiguity and viral completeness quality control for de novo assemblies, so de novo assembly that meets quality control standards can certainly be used for downstream analysis.
How is de novo assembly quality evaluated?
De novo assembly quality evaluation focuses on the completeness and contiguity of the genome. While a ground truth genome does not truly exist for quality comparison, reference genome selection can help contextualize quality if the reference is sufficiently similar to the de novo assembly. TheiaViral uses QUAST to acquire basic contiguity statistics and CheckV to assess viral genome completeness and contamination. Additionally, the reference selection software, Skani, can provide a quantitative comparison between the de novo assembly and the best reference genome.
Completeness and contamination
checkv_denovo_summary: The summary file reports CheckV results on a contig-by-contig basis. Ideally completeness is 100% for a single contig, or 100% for all segments. If there are multiple extraneous contigs in the assembly, one is ideally 100%. The same principles apply to contamination, though it ideally is 0%.checkv_denovo_total_genes: The total genes is ideally the same number of genes as expected from the inputted viral taxon. Sometimes CheckV can fail to recover all the genes from a complete genome, so other statistics should be weighted more heavily in quality evaluation.checkv_denovo_weighted_completeness: The weighted completeness is ideally 100%.checkv_denovo_weighted_contamination: The weighted contamination is ideally 0%.
Length and contiguity
quast_denovo_genome_length: The de novo genome length is ideally the same as the expected genome length of the focal virus.quast_denovo_largest_contig: The largest contig is ideally the size of the genome, or the size of the largest expected segment. If there are multiple contigs, and the largest contig is the ideal size, then the smaller contigs may be discarded based on the CheckV completeness for the largest contig (see CheckV outputs).quast_denovo_n50_value: The N50 is an evaluation of contiguity and is ideally as close as possible to the genome size. For segmented viruses, the N50 should be as close as possible to the size of the segment molecule that would cover at least 50% of the total genome size when segment lengths are added after sorting largest to smallest.quast_denovo_number_contigs: The number of contigs is ideally 1 or the total number of segments expected.
Reference genome similarity
skani_top_ani: The percent average nucleotide identity (ANI) for the top Skani hit is ideally 100% if the sequenced virus is highly similar to a reference genome. However, if the virus is divergent, ANI is not a good indication of assembly quality.skani_top_query_coverage: The percent query coverage for the top Skani hit is ideally 100% if the sequenced virus has not undergone significant recombination/structural variation.skani_top_score: The score for the top Skani hit is the ANI x Query (de novo assembly) coverage and is ideally 100% if the sequenced virus is not substantially divergent from the reference dataset.
How is consensus assembly quality evaluated?
Consensus assemblies are derived from a reference genome, so quality assessment focuses on coverage and variant quality. Bases with insufficient coverage are denoted as "N". Additionally, the size and contiguity of a TheiaViral consensus assembly is expected to approximate the reference genome, so any discrepancy here is likely due to inferred structural variation.
Completeness and contamination
checkv_consensus_weighted_completeness: The weighted completeness is ideally 100%.
Consensus variant calls
consensus_qc_number_Degenerate: The number of degenerate bases is ideally 0. While degenerate bases indicate ambiguity in the sequence, non-N degenerate bases indicate that some information about the base was obtained.consensus_qc_number_N: The number of "N" bases is ideally 0.
Coverage
consensus_qc_percent_reference_coverage: The percent reference coverage is ideally 100%.read_mapping_cov_hist: The read mapping coverage histogram ideally depicts normally distributed coverage, which may indicate uniform coverage across the reference genome. However, uniform coverage is unlikely with repetitive regions that approach/exceed read length.read_mapping_coverage: The average read mapping coverage is ideally as high as possible.read_mapping_meanbaseq: The average mean mapping base quality is ideally as high as possible.read_mapping_meanmapq: The average mean mapping alignment quality is ideally as high as possible.read_mapping_percentage_mapped_reads: The percent of mapped reads is ideally 100% of the reads classified as the lineage of interest. Some unclassified reads may also map, which may indicate they were erroneously unclassified. Alternatively, these reads could have been erroneously mapped.
Why did the workflow complete without generating a consensus?
TheiaViral is designed to "soft fail" when specific steps do not succeed due to input data quality. This means the workflow will be reported as successful, with an output that delineates the step that failed. If the workflow fails, please look for the following outputs in this order (sorted by timing of failure, latest first):
skani_status: If this output is populated with something other than "PASS" andskani_top_accessionis populated with "N/A", this indicates that Skani did not identify a sufficiently similar reference genome. The Skani database comprises a broad array of NCBI viral genomes, so a failure here likely indicates poor read quality because viral contigs are not found in the de novo assembly or are too small. It may be useful to BLAST whatever contigs do exist in the de novo to determine if there is contamination that can be removed via thehostinput parameter. Additionally, review CheckV de novo outputs to assess if viral contigs were retrieved. Finally, consider keepingextract_unclassifiedto "true", using a higherread_extraction_rankif it will not introduce contaminant viruses, and invoking ahostinput to remove host reads if host contigs are present.megahit_status/flye_status: If this output is populated with something other than "PASS", it indicates the fallback assembler did not successfully complete. The fallback assemblers are permissive, so failure here likely indicates poor read quality. Review read QC to check read quality, particularly following read classification. If read classification is dispensing with a significant number of reads, considerextract_unclassified,read_extraction_rank, andhostinput adjustment. Otherwise, sequencing quality may be poor.metaviralspades_status/raven_denovo_status: If this output is populated with something other than "PASS", it indicates the default assembler did not successfully complete or extract viral contigs (MetaviralSPAdes). On their own, these statuses do not correspond directly to workflow failure because fallback de novo assemblers are implemented for both TheiaViral workflows.read_screen_clean: If this output is populated with something other than "PASS", it indicates the reads did not pass the imposed thresholds. Either the reads are poor quality or the thresholds are too stringent, in which case the thresholds can be relaxed orskip_screencan be set to "true".dehost_wf_download_status: If this output is populated with something other than "PASS", it indicates a host genome could not be retrieved for decontamination. See thehostinput explanation for more information and review thedownload_accession/download_taxonomytask output logs for advanced error parsing.
Known errors associated with read quality
- ONT workflows may fail at Metabuli if no reads are classified as the
taxon. Check the Metabuliclassification.tsvorkronareport for the read extraction taxon ID to determine if any reads were classified. This error will reportout of memory (OOM), but increasing memory will not resolve it. - Illumina workflows may fail at CheckV (de novo) with
Error: 80 hmmsearch tasks failed. Program should be rerunif no viral contigs were identified in the de novo assembly.
Acknowlegments¶
We would like to thank Danny Park at the Broad institute and Jared Johnson at the Washington State Department of Public Health for correspondence during the development of TheiaViral. TheiaViral was built referencing viral-assemble, VAPER, and Artic.


