Skip to content

TheiaViral Workflow Series

Quick Facts

Workflow Type Applicable Kingdom Last Known Changes Command-line Compatibility Workflow Level
Genomic Characterization Viral vX.X.X No Sample-level

TheiaViral Workflows

TheiaViral workflows assemble, quality assess, and characterize viral genomes from diverse data sources, including metagenomic samples. TheiaViral workflows can generate consensus assemblies of recalcitrant viruses, including diverse or recombinant lineages, such as rabies virus and norovirus, through a three-step approach: 1) generating an intermediate de novo assembly from taxonomy-filtered reads, 2) selecting the best reference from a database of ~200,000 complete viral genomes using average nucleotide identity, and 3) producing a final consensus assembly through reference-based read mapping and variant calling. Reference genomes can be directly provided to TheiaViral to bypass de novo assembly, which enables compatibility with tiled amplicon sequencing data. Targeted viral characterization is currently ongoing and functional for Lyssavirus rabies.

What are the main differences between the TheiaViral and TheiaCov workflows?
  • TheiaCov Workflows


    • For amplicon-derived viral sequencing methods
    • Supports a limited number of pathogens
    • Uses manually curated, static reference genomes
  • TheiaViral Workflows


    • Designed for a variety of sequencing methods
    • Supports relatively diverse and recombinant pathogens
    • Dynamically identifies the most similar reference genome for consensus assembly via an intermediate de novo assembly
Segmented viruses

Segmented viruses are accounted for in TheiaViral. The reference genome database excludes segmented viral nucleotide accessions, while including RefSeq assembly accessions that include all viral segments. Consensus assembly modules are constructed to handle multi-segment references.

Workflow Diagram

TheiaViral_Illumina_PE Workflow Diagram

TheiaViral_Illumina_PE Workflow Diagram

TheiaViral_ONT Workflow Diagram

TheiaViral_ONT Workflow Diagram

TheiaViral Workflows for Different Input Types

  • TheiaViral_Illumina_PE


    Illumina_PE Input Read Data

    The TheiaViral_Illumina_PE workflow inputs Illumina paired-end read data. Read file extensions should be .fastq or .fq, and can optionally include the .gz compression extension. Theiagen recommends compressing files with gzip before Terra uploads to minimize data upload time and storage costs.

    Modifications to the optional parameter for trim_minlen may be required to appropriately trim reads shorter than 2 x 150 bp (i.e. generated using a 300-cycle sequencing kit), such as the 2 x 75bp reads generated using a 150-cycle sequencing kit.

  • TheiaViral_ONT


    ONT Input Read Data

    The TheiaViral_ONT workflow inputs base-called Oxford Nanopore Technology (ONT) read data. Read file extensions should be .fastq or .fq, and can optionally include the .gz compression extension. Theiagen recommends compressing files with gzip before Terra uploads to minimize data upload time and storage costs.

    It is recommended to trim adapter sequencings via dorado basecalling prior to running TheiaViral_ONT, though porechop can optionally be called to trim adapters within the workflow.

    The ONT sequencing kit and base-calling approach can produce substantial variability in the amount and quality of read data. Genome assemblies produced by the TheiaViral_ONT workflow must be quality assessed before reporting results. We recommend using the Dorado_Basecalling_PHB workflow if applicable.

Inputs

taxon required input parameter

taxon is the standardized taxonomic name (e.g. "Lyssavirus rabies") or NCBI taxon ID (e.g. "11292") of the desired virus to analyze. Inputs must be represented in the NCBI taxonomy database and do not have to be species-level (see read_extraction_rank below).

host optional input parameter

The host input triggers the Host Decontaminate workflow, which removes reads that map to a reference host genome. This input needs to be an NCBI Taxonomy-compatible taxon or an NCBI assembly accession. If using a taxon, the first retrieved genome corresponding to that taxon is retrieved. If using an accession, it must be coupled with the Host Decontaminate task is_accession (ONT) or Read QC Trim PE host_is_accession (Illumina) boolean populated as "true".

extract_unclassified optional input parameter

By default, the extract_unclassified parameter is set to "true", which indicates that reads that are not classified by Kraken2 (Illumina) or Metabuli (ONT) will be included with reads classified as the input taxon. These classification software most often do not comprehensively classify reads using the default RefSeq databases, so extracting unclassified reads is desirable when host and contaminant reads have been sufficiently decontaminated. Host decontamination occurs in TheiaViral using NCBI sra-human-scrubber, read classification to the human genome, and/or via mapping reads to the inputted host. Contaminant viral reads are mostly excluded because they will be often be classified against the default RefSeq classification databases. Consider setting extract_unclassified to false if de novo assembly or Skani reference selection is failing.

min_allele_freq, min_depth, and min_map_quality optional input parameters

These parameters have a direct effect on the variants that will ultimately be reported in the consensus assembly. min_allele_freq determines the minimum proportion of an allelic variant to be reported in the consensus assembly. min_depth and min_map_quality affect how "N" is reported in the consensus, i.e. depth below min_depth is reported as "N" and reads with mapping quality below min_map_quality are not included in depth calculations.

read_extraction_rank optional input parameter

By default, the read_extraction_rank parameter is set to "family", which indicates that reads will be extracted if they are classified as the taxonomic family of the input taxon, including all descendant taxa of the family. Read classification may not resolve to the rank of the input taxon, so these reads may be classified at higher ranks. For example, some Lyssavirus rabies (species) reads may only be resolved to Lyssavirus (genus), so they would not be extracted if the read_extraction_rank is set to "species". Setting the read_extraction_rank above the inputted taxon's rank can therefore dramatically increase the number of reads recovered, at the potential cost of including other viruses. This likely is not a problem for scarcely represented lineages, e.g. a sample that is expected to include Lyssavirus rabies is unlikely to contain other viruses of the corresponding family, Rhabdoviridae, within the same sample. However, setting a read_extraction_rank far beyond the input taxon rank can be problematic when multiple representatives of the same viral family are included in similar abundance within the same sample. To further refine the desired read_extraction_rank, please review the corresponding classification reports of the respective classification software (kraken2 for Illumina and Metabuli for ONT)

Terra Task Name Variable Type Description Default Value Terra Status
theiaviral_illumina_pe read1 File llumina forward read file in FASTQ file format (compression optional) Required
theiaviral_illumina_pe read2 File llumina reverse read file in FASTQ file format (compression optional) Required
theiaviral_illumina_pe samplename String Nme of the sample being analyzed Required
theiaviral_illumina_pe taxon String Taxon ID or organism name of interest Required
bwa cpu Int Number of CPUs to allocate to the task 6 Optional
bwa disk_size Int Amount of storage (in GB) to allocate to the task 100 Optional
bwa docker String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/staphb/ivar:1.3.1-titan Optional
bwa memory Int Amount of memory/RAM (in GB) to allocate to the task 16 Optional
checkv_consensus checkv_db File CheckV database file gs://theiagen-public-resources-rp/reference_data/databases/checkv/checkv-db-v1.5.tar.gz Optional
checkv_consensus cpu Int Number of CPUs allocated for the task 2 Optional
checkv_consensus disk_size Int Disk size allocated for the task (in GB) 100 Optional
checkv_consensus docker String Docker image used for the task us-docker.pkg.dev/general-theiagen/staphb/checkv:1.0.3 Optional
checkv_consensus memory Int Memory allocated for the task (in GB) 8 Optional
checkv_denovo checkv_db File CheckV database file gs://theiagen-public-resources-rp/reference_data/databases/checkv/checkv-db-v1.5.tar.gz Optional
checkv_denovo cpu Int Number of CPUs allocated for the task 2 Optional
checkv_denovo disk_size Int Disk size allocated for the task (in GB) 100 Optional
checkv_denovo docker String Docker image used for the task us-docker.pkg.dev/general-theiagen/staphb/checkv:1.0.3 Optional
checkv_denovo memory Int Memory allocated for the task (in GB) 8 Optional
clean_check_reads cpu Int Number of CPUs to allocate to the task 1 Optional
clean_check_reads disk_size Int Amount of storage (in GB) to allocate to the task 100 Optional
clean_check_reads docker String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/bactopia/gather_samples:2.0.2 Optional
clean_check_reads max_genome_length Int Maximum genome length able to pass read screening 2673870 Optional
clean_check_reads memory Int Amount of memory/RAM (in GB) to allocate to the task 2 Optional
clean_check_reads min_basepairs Int Minimum base pairs to pass read screening 15000 Optional
clean_check_reads min_coverage Int Minimum coverage to pass read screening 10 Optional
clean_check_reads min_genome_length Int Minimum genome length to pass read screening 1500 Optional
clean_check_reads min_proportion Int Minimum read proportion to pass read screening 40 Optional
clean_check_reads min_reads Int Minimum reads to pass read screening 50 Optional
consensus char_unknown String Character used to represent unknown bases in the consensus sequence N Optional
consensus count_orphans Boolean True/False that determines if anomalous read pairs are NOT skipped in variant calling. Anomalous read pairs are those marked in the FLAG field as paired in sequencing but without the properly-paired flag set. TRUE Optional
consensus cpu Int Number of CPUs to allocate to the task 8 Optional
consensus disable_baq Boolean True/False that determines if base alignment quality (BAQ) computation should be disabled during samtools mpileup before consensus generation TRUE Optional
consensus disk_size Int Amount of storage (in GB) to allocate to the task 100 Optional
consensus docker String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/staphb/artic-ncov2019-epi2me Optional
consensus max_depth Int For a given position, read at maximum INT number of reads per input file during samtools mpileup before consensus generation 600000 Optional
consensus memory Int Amount of memory/RAM (in GB) to allocate to the task 16 Optional
consensus min_bq Int Minimum base quality required for a base to be considered during samtools mpileup before consensus generation 0 Optional
consensus skip_N Boolean True/False that determines if "N" bases should be skipped in the consensus sequence FALSE Optional
consensus_qc cpu Int Number of CPUs to allocate to the task 1 Optional
consensus_qc disk_size Int Amount of storage (in GB) to allocate to the task 100 Optional
consensus_qc docker String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/theiagen/utility:1.1 Optional
consensus_qc memory Int Amount of memory/RAM (in GB) to allocate to the task 2 Optional
ivar_variants cpu Int Number of CPUs allocated for the task 2 Optional
ivar_variants disk_size Int Disk size allocated for the task (in GB) 100 Optional
ivar_variants docker String Docker image used for the task us-docker.pkg.dev/general-theiagen/staphb/ivar:1.3.1-titan Optional
ivar_variants memory Int Memory allocated for the task (in GB) 8 Optional
ivar_variants reference_gff File A GFF file in the GFF3 format can be supplied to specify coordinates of open reading frames (ORFs) so iVar can identify codons and translate variants into amino acids Optional
megahit cpu Int Number of CPUs allocated for the task 4 Optional
megahit disk_size Int Disk size allocated for the task (in GB) 100 Optional
megahit docker String Docker image used for the task us-docker.pkg.dev/general-theiagen/theiagen/megahit:1.2.9 Optional
megahit kmers String Comma-separated list of kmer sizes to use for assembly. All must be odd, in the range 15-255, increment <= 28 21,29,39,59,79,99,119,141 Optional
megahit megahit_opts String Additional parameters for MEGAHIT assembler Optional
megahit memory Int Memory allocated for the task (in GB) 16 Optional
megahit min_contig_length Int Minimum contig length for MEGAHIT assembler 1 Optional
morgana_magic abricate_flu_cpu Int Number of CPUs to allocate to the task 2 Optional
morgana_magic abricate_flu_disk_size Int Amount of storage (in GB) to allocate to the task 100 Optional
morgana_magic abricate_flu_docker String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/staphb/abricate:1.0.1-insaflu-220727 Optional
morgana_magic abricate_flu_memory Int Amount of memory/RAM (in GB) to allocate to the task 4 Optional
morgana_magic abricate_flu_min_percent_coverage Int Minimum DNA percent coverage 60 Optional
morgana_magic abricate_flu_min_percent_identity Int Minimum DNA percent identity 70 Optional
morgana_magic assembly_metrics_cpu Int Number of CPUs to allocate to the task 2 Optional
morgana_magic assembly_metrics_disk_size Int Amount of storage (in GB) to allocate to the task 100 Optional
morgana_magic assembly_metrics_docker String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/staphb/samtools:1.15 Optional
morgana_magic assembly_metrics_memory Int Amount of memory/RAM (in GB) to allocate to the task 8 Optional
morgana_magic consensus_qc_cpu Int Number of CPUs to allocate to the task 1 Optional
morgana_magic consensus_qc_disk_size Int Amount of storage (in GB) to allocate to the task 100 Optional
morgana_magic consensus_qc_docker String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/theiagen/utility:1.1 Optional
morgana_magic consensus_qc_memory Int Amount of memory/RAM (in GB) to allocate to the task 2 Optional
morgana_magic genoflu_cpu Int Number of CPUs to allocate to the task 1 Optional
morgana_magic genoflu_cross_reference File An Excel file to cross-reference BLAST findings; probably useful if novel genotypes are not in the default file used by genoflu.py Optional
morgana_magic genoflu_disk_size Int Amount of storage (in GB) to allocate to the task 25 Optional
morgana_magic genoflu_docker String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/staphb/genoflu:1.06 Optional
morgana_magic genoflu_memory Int Amount of memory/RAM (in GB) to allocate to the task 2 Optional
morgana_magic irma_cpu Int Number of CPUs to allocate to the task 4 Optional
morgana_magic irma_disk_size Int Amount of storage (in GB) to allocate to the task 100 Optional
morgana_magic irma_docker_image String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/staphb/irma:1.2.0 Optional
morgana_magic irma_keep_ref_deletions Boolean True/False variable that determines if sites missed (i.e. 0 reads for a site in the reference genome) during read gathering should be deleted by ambiguation by inserting N's or deleting the sequence entirely. False sets this IRMA paramater to "DEL" and true sets it to "NNN" TRUE Optional
morgana_magic irma_memory Int Amount of memory/RAM (in GB) to allocate to the task 16 Optional
morgana_magic nextclade_cpu Int Number of CPUs to allocate to the task 2 Optional
morgana_magic nextclade_disk_size Int Amount of storage (in GB) to allocate to the task 50 Optional
morgana_magic nextclade_docker_image String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/nextstrain/nextclade:3.10.2 Optional
morgana_magic nextclade_memory Int Amount of memory/RAM (in GB) to allocate to the task 4 Optional
morgana_magic nextclade_output_parser_cpu Int Number of CPUs to allocate to the task 2 Optional
morgana_magic nextclade_output_parser_disk_size Int Amount of storage (in GB) to allocate to the task 50 Optional
morgana_magic nextclade_output_parser_docker String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/python/python:3.8.18-slim Optional
morgana_magic nextclade_output_parser_memory Int Amount of memory/RAM (in GB) to allocate to the task 4 Optional
morgana_magic pangolin_cpu Int Number of CPUs to allocate to the task 2 Optional
morgana_magic pangolin_disk_size Int Amount of storage (in GB) to allocate to the task 50 Optional
morgana_magic pangolin_docker_image String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/nextstrain/nextclade:3.10.2 Optional
morgana_magic pangolin_memory Int Amount of memory/RAM (in GB) to allocate to the task 4 Optional
ncbi_datasets cpu Int Number of CPUs allocated for the task 1 Optional
ncbi_datasets disk_size Int Disk size allocated for the task (in GB) 50 Optional
ncbi_datasets docker String Docker image used for the task us-docker.pkg.dev/general-theiagen/staphb/ncbi-datasets:16.38.1 Optional
ncbi_datasets include_gbff Boolean True/False to include gbff files in the output FALSE Optional
ncbi_datasets include_gff3 Boolean True/False to include gff3 files in the output FALSE Optional
ncbi_datasets memory Int Memory allocated for the task (in GB) 4 Optional
ncbi_identify complete Boolean Only query genomes labeled complete TRUE Optional
ncbi_identify cpu Int Number of CPUs allocated for the task 1 Optional
ncbi_identify disk_size Int Disk size allocated for the task (in GB) 50 Optional
ncbi_identify docker String Docker image used for the task us-docker.pkg.dev/general-theiagen/staphb/ncbi-datasets:16.38.1 Optional
ncbi_identify memory Int Memory allocated for the task (in GB) 4 Optional
ncbi_identify refseq Boolean Only query RefSeq genomes TRUE Optional
ncbi_identify summary_limit Int Maximum number of genomes to return in the summary 100 Optional
ncbi_identify use_ncbi_virus Boolean Set to true to download from NCBI Virus Datasets FALSE Optional
quast_denovo cpu Int Number of CPUs allocated for the task 2 Optional
quast_denovo disk_size Int Disk size allocated for the task (in GB) 100 Optional
quast_denovo docker String Docker image used for the task us-docker.pkg.dev/general-theiagen/staphb/quast:5.0.2 Optional
quast_denovo memory Int Memory allocated for the task (in GB) 2 Optional
rasusa bases String Explicitly set the number of bases required e.g., 4.3kb, 7Tb, 9000, 4.1MB. If this option is given, --coverage and --genome-size are ignored Optional
rasusa coverage Float The desired coverage to sub-sample the reads to. If --bases is not provided, this option and --genome-size are required 250 Optional
rasusa cpu Int Number of CPUs allocated for the task 4 Optional
rasusa disk_size Int Disk size allocated for the task (in GB) 100 Optional
rasusa docker String Docker image used for the task us-docker.pkg.dev/general-theiagen/staphb/rasusa:2.1.0 Optional
rasusa frac Float Subsample to a fraction of the reads - e.g., 0.5 samples half the reads Optional
rasusa memory Int Memory allocated for the task (in GB) 8 Optional
rasusa num Int Subsample to a specific number of reads Optional
rasusa seed Int Random seed for reproducibility Optional
read_QC_trim adapters File File with adapter sequences to be removed Optional
read_QC_trim bbduk_memory Int Amount of memory/RAM (in GB) to allocate to the task 8 Optional
read_QC_trim call_kraken Boolean Internal component, do not modify Optional
read_QC_trim call_midas Boolean Internal component, do not modify Optional
read_QC_trim fastp_args String Additional arguments to use with fastp --detect_adapter_for_pe -g -5 20 -3 20 Optional
read_QC_trim host_complete_only Boolean Only download host reference genome labeled "complete" FALSE Optional
read_QC_trim host_decontaminate_mem Int Memory allocated for minimap2 (in GB) 32 Optional
read_QC_trim host_is_accession Boolean Inputted "host" is an accession FALSE Optional
read_QC_trim kraken_cpu Int Number of CPUs to allocate to the task 4 Optional
read_QC_trim kraken_memory Int Amount of memory/RAM (in GB) to allocate to the task 32 Optional
read_QC_trim phix File A file containing the phix used during Illumina sequencing; used in the BBDuk task Optional
read_QC_trim read_processing String The name of the tool to perform basic read processing; options: "trimmomatic" or "fastp" trimmomatic Optional
read_QC_trim read_qc String The tool used for quality control (QC) of reads. Options are "fastq_scan" (default) and "fastqc" fastq_scan Optional
read_QC_trim target_organism String Internal component, do not modify Optional
read_QC_trim trim_min_length Int Specifies minimum length of each read after trimming to be kept 75 Optional
read_QC_trim trim_quality_min_score Int Specifies the average quality of bases in a sliding window to be kept 30 Optional
read_QC_trim trim_window_size Int Specifies window size for trimming (the number of bases to average the quality across) 4 Optional
read_QC_trim trimmomatic_args String Additional arguments to pass to trimmomatic. "-phred33" specifies the Phred Q score encoding which is almost always phred33 with modern sequence data. -phred33 Optional
read_QC_trim_pe kraken_disk_size Int Amount of storage (in GB) to allocate to the task 100 Optional
read_QC_trim_pe midas_db File Internal component, do not modify Optional
read_mapping_stats cpu Int Number of CPUs allocated for the task 2 Optional
read_mapping_stats disk_size Int Disk size allocated for the task (in GB) 100 Optional
read_mapping_stats docker String Docker image used for the task us-docker.pkg.dev/general-theiagen/staphb/samtools:1.15 Optional
read_mapping_stats memory Int Memory allocated for the task (in GB) 8 Optional
skani cpu Int Number of CPUs allocated for the task 2 Optional
skani disk_size Int Disk size allocated for the task (in GB) 100 Optional
skani docker String Docker image used for the task us-docker.pkg.dev/general-theiagen/staphb/skani:0.2.2 Optional
skani memory Int Memory allocated for the task (in GB) 4 Optional
skani skani_db File Skani database file gs://theiagen-public-resources-rp/reference_data/databases/skani/skani_db_20250606.tar Optional
spades cpu Int Number of CPUs allocated for the task 4 Optional
spades disk_size Int Disk size allocated for the task (in GB) 100 Optional
spades docker String Docker image used for the task us-docker.pkg.dev/general-theiagen/staphb/spades:4.1.0 Optional
spades kmers String list of k-mer sizes (must be odd and less than 128) auto Optional
spades memory Int Memory allocated for the task (in GB) 16 Optional
spades phred_offset Int PHRED quality offset in the input reads (33 or 64) 33 Optional
spades spades_opts String Additional parameters for Spades assembler Optional
theiaviral_illumina_pe call_metaviralspades Boolean True/False to call assembly with MetaviralSPAdes and use Megahit as fallback TRUE Optional
theiaviral_illumina_pe extract_unclassified Boolean True/False that determines if unclassified reads should be extracted and combined with the taxon specific extracted reads TRUE Optional
theiaviral_illumina_pe genome_length Int Expected genome length of taxon of interest Optional
theiaviral_illumina_pe host String Host taxon/accession to dehost reads, if provided Optional
theiaviral_illumina_pe kraken_db File Kraken2 database file gs://theiagen-public-resources-rp/reference_data/databases/kraken2/kraken2_humanGRCh38_viralRefSeq_20240828.tar.gz Optional
theiaviral_illumina_pe min_allele_freq Float Minimum allele frequency required for a variant to populate the consensus sequence 0.6 Optional
theiaviral_illumina_pe min_depth Int Minimum read depth required for a variant to populate the consensus sequence 10 Optional
theiaviral_illumina_pe min_map_quality Int Minimum mapping quality required for read alignments 20 Optional
theiaviral_illumina_pe read_extraction_rank String Taxonomic rank to use for read extraction - limits taxons to only those within the specified ranks. family Optional
theiaviral_illumina_pe reference_fasta File Reference genome in FASTA format Optional
theiaviral_illumina_pe skip_rasusa Boolean True/False to skip read subsampling with Rasusa FALSE Optional
theiaviral_illumina_pe skip_screen Boolean True/False to skip read screening check prior to analysis FALSE Optional
version_capture docker String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0 Optional
version_capture timezone String Set the time zone to get an accurate date of analysis (uses UTC by default) Optional
Terra Task Name Variable Type Description Default Value Terra Status
theiaviral_ont read1 File Base-called ONT read file in FASTQ file format (compression optional) Required
theiaviral_ont samplename String Name of the sample being analyzed Required
theiaviral_ont taxon String Taxon ID or organism name of interest Required
bcftools_consensus cpu Int Number of CPUs allocated for the task 2 Optional
bcftools_consensus disk_size Int Disk size allocated for the task (in GB) 100 Optional
bcftools_consensus docker String Docker image used for the task us-docker.pkg.dev/general-theiagen/staphb/bcftools:1.20 Optional
bcftools_consensus memory Int Memory allocated for the task (in GB) 4 Optional
checkv_consensus checkv_db File CheckV database file gs://theiagen-public-resources-rp/reference_data/databases/checkv/checkv-db-v1.5.tar.gz Optional
checkv_consensus cpu Int Number of CPUs allocated for the task 2 Optional
checkv_consensus disk_size Int Disk size allocated for the task (in GB) 100 Optional
checkv_consensus docker String Docker image used for the task us-docker.pkg.dev/general-theiagen/staphb/checkv:1.0.3 Optional
checkv_consensus memory Int Memory allocated for the task (in GB) 8 Optional
checkv_denovo checkv_db File CheckV database file gs://theiagen-public-resources-rp/reference_data/databases/checkv/checkv-db-v1.5.tar.gz Optional
checkv_denovo cpu Int Number of CPUs allocated for the task 2 Optional
checkv_denovo disk_size Int Disk size allocated for the task (in GB) 100 Optional
checkv_denovo docker String Docker image used for the task us-docker.pkg.dev/general-theiagen/staphb/checkv:1.0.3 Optional
checkv_denovo memory Int Memory allocated for the task (in GB) 8 Optional
clair3 clair3_model String Model to be used by Clair3 r1041_e82_400bps_sup_v500 Optional
clair3 cpu Int Number of CPUs allocated for the task 4 Optional
clair3 disable_phasing Boolean True/False that determines if variants should be called without whatshap phasing in full alignment calling TRUE Optional
clair3 disk_size Int Disk size allocated for the task (in GB) 100 Optional
clair3 docker String Docker image used for the task us-docker.pkg.dev/general-theiagen/theiagen/clair3-extra-models:1.0.10 Optional
clair3 enable_gvcf Boolean True/False that determines if an additional GVCF output should generated FALSE Optional
clair3 enable_haploid_precise Boolean True/False that determines haploid calling mode where only 1/1 is considered as a variant TRUE Optional
clair3 include_all_contigs Boolean True/False that determines if all contigs should be included in the output TRUE Optional
clair3 indel_min_af Float Minimum Indel AF required for a candidate variant 0.08 Optional
clair3 memory Int Memory allocated for the task (in GB) 8 Optional
clair3 snp_min_af Float Minimum SNP AF required for a candidate variant 0.08 Optional
clair3 variant_quality Int If set, variants with >$qual will be marked PASS, or LowQual otherwise 2 Optional
clean_check_reads cpu Int Number of CPUs to allocate to the task 1 Optional
clean_check_reads disk_size Int Amount of storage (in GB) to allocate to the task 100 Optional
clean_check_reads docker String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/bactopia/gather_samples:2.0.2 Optional
clean_check_reads max_genome_length Int Maximum genome length able to pass read screening 2673870 Optional
clean_check_reads memory Int Amount of memory/RAM (in GB) to allocate to the task 2 Optional
clean_check_reads min_basepairs Int Minimum base pairs to pass read screening 15000 Optional
clean_check_reads min_coverage Int Minimum coverage to pass read screening 10 Optional
clean_check_reads min_genome_length Int Minimum genome length to pass read screening 1500 Optional
clean_check_reads min_reads Int Minimum reads to pass read screening 50 Optional
clean_check_reads skip_mash Boolean If true, skips estimation of genome size and coverage using mash in read screening steps. As a result, providing true also prevents screening using these parameters. TRUE Optional
consensus_qc cpu Int Number of CPUs to allocate to the task 1 Optional
consensus_qc disk_size Int Amount of storage (in GB) to allocate to the task 100 Optional
consensus_qc docker String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/theiagen/utility:1.1 Optional
consensus_qc memory Int Amount of memory/RAM (in GB) to allocate to the task 2 Optional
fasta_utilities cpu Int Number of CPUs allocated for the task 1 Optional
fasta_utilities disk_size Int Disk size allocated for the task (in GB) 10 Optional
fasta_utilities docker String Docker image used for the task us-docker.pkg.dev/general-theiagen/biocontainers/seqkit:2.4.0--h9ee0642_0 Optional
fasta_utilities memory Int Memory allocated for the task (in GB) 2 Optional
flye additional_parameters String Additional parameters for Flye assembler Optional
flye asm_coverage Int Reduced coverage for initial disjointig assembly Optional
flye cpu Int Number of CPUs allocated for the task 4 Optional
flye disk_size Int Disk size allocated for the task (in GB) 100 Optional
flye docker String Docker image used for the task us-docker.pkg.dev/general-theiagen/staphb/flye:2.9.4 Optional
flye flye_polishing_iterations Int Number of polishing iterations 1 Optional
flye genome_length Int Expected genome length for assembly - requires asm_coverage Optional
flye keep_haplotypes Boolean True/False to prevent collapsing alternative haplotypes FALSE Optional
flye memory Int Memory allocated for the task (in GB) 32 Optional
flye minimum_overlap Int Minimum overlap between reads Optional
flye no_alt_contigs Boolean True/False to disable alternative contig generation FALSE Optional
flye read_error_rate Float Expected error rate in reads Optional
flye read_type String Type of read data for Flye --nano-hq Optional
flye scaffold Boolean True/False to enable scaffolding using graph FALSE Optional
host_decontaminate complete_only Boolean Only download genomes labeled "complete" FALSE Optional
host_decontaminate is_accession Boolean Inputted "host" is an accession FALSE Optional
host_decontaminate minimap2_memory Int Memory allocated for minimap2 (in GB) 32 Optional
host_decontaminate read2 File Internal componenet, do not modify Optional
host_decontaminate refseq Boolean Only download RefSeq genomes TRUE Optional
mask_low_coverage cpu Int Number of CPUs allocated for the task 2 Optional
mask_low_coverage disk_size Int Disk size allocated for the task (in GB) 100 Optional
mask_low_coverage docker String Docker image used for the task us-docker.pkg.dev/general-theiagen/staphb/bedtools:2.31.0 Optional
mask_low_coverage memory Int Memory allocated for the task (in GB) 8 Optional
metabuli cpu Int Number of CPUs allocated for the task 4 Optional
metabuli disk_size Int Disk size allocated for the task (in GB) 100 Optional
metabuli docker String Docker image used for the task us-docker.pkg.dev/general-theiagen/theiagen/metabuli:1.1.0 Optional
metabuli memory Int Memory allocated for the task (in GB) 16 Optional
metabuli metabuli_db File Metabuli database file gs://theiagen-public-resources-rp/reference_data/databases/metabuli/refseq_virus-v223.tar.gz Optional
metabuli min_percent_coverage Float Minimum query coverage threshold (0.0 - 1.0) 0.0 Optional
metabuli min_score Float Minimum sequenece similarity score (0.0 - 1.0) 0.0 Optional
metabuli min_sp_score Float Minimum score for species- or lower-level classification 0.0 Optional
metabuli taxonomy_path File Path to taxonomy file gs://theiagen-public-resources-rp/reference_data/databases/metabuli/new_taxdump.tar.gz Optional
minimap2 cpu Int Number of CPUs to allocate to the task 2 Optional
minimap2 disk_size Int Amount of storage (in GB) to allocate to the task 100 Optional
minimap2 docker String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/staphb/minimap2:2.22 Optional
minimap2 memory Int Amount of memory/RAM (in GB) to allocate to the task 8 Optional
minimap2 query2 File Internal component, do not modify Optional
morgana_magic abricate_flu_cpu Int Number of CPUs to allocate to the task 2 Optional
morgana_magic abricate_flu_disk_size Int Amount of storage (in GB) to allocate to the task 100 Optional
morgana_magic abricate_flu_docker String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/staphb/abricate:1.0.1-insaflu-220727 Optional
morgana_magic abricate_flu_memory Int Amount of memory/RAM (in GB) to allocate to the task 4 Optional
morgana_magic abricate_flu_min_percent_coverage Int Minimum DNA percent coverage 60 Optional
morgana_magic abricate_flu_min_percent_identity Int Minimum DNA percent identity 70 Optional
morgana_magic assembly_metrics_cpu Int Number of CPUs to allocate to the task 2 Optional
morgana_magic assembly_metrics_disk_size Int Amount of storage (in GB) to allocate to the task 100 Optional
morgana_magic assembly_metrics_docker String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/staphb/samtools:1.15 Optional
morgana_magic assembly_metrics_memory Int Amount of memory/RAM (in GB) to allocate to the task 8 Optional
morgana_magic consensus_qc_cpu Int Number of CPUs to allocate to the task 1 Optional
morgana_magic consensus_qc_disk_size Int Amount of storage (in GB) to allocate to the task 100 Optional
morgana_magic consensus_qc_docker String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/theiagen/utility:1.1 Optional
morgana_magic consensus_qc_memory Int Amount of memory/RAM (in GB) to allocate to the task 2 Optional
morgana_magic genoflu_cpu Int Number of CPUs to allocate to the task 1 Optional
morgana_magic genoflu_cross_reference File An Excel file to cross-reference BLAST findings; probably useful if novel genotypes are not in the default file used by genoflu.py Optional
morgana_magic genoflu_disk_size Int Amount of storage (in GB) to allocate to the task 25 Optional
morgana_magic genoflu_docker String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/staphb/genoflu:1.06 Optional
morgana_magic genoflu_memory Int Amount of memory/RAM (in GB) to allocate to the task 2 Optional
morgana_magic irma_cpu Int Number of CPUs to allocate to the task 4 Optional
morgana_magic irma_disk_size Int Amount of storage (in GB) to allocate to the task 100 Optional
morgana_magic irma_docker_image String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/staphb/irma:1.2.0 Optional
morgana_magic irma_keep_ref_deletions Boolean True/False variable that determines if sites missed (i.e. 0 reads for a site in the reference genome) during read gathering should be deleted by ambiguation by inserting N's or deleting the sequence entirely. False sets this IRMA paramater to "DEL" and true sets it to "NNN" TRUE Optional
morgana_magic irma_memory Int Amount of memory/RAM (in GB) to allocate to the task 16 Optional
morgana_magic nextclade_cpu Int Number of CPUs to allocate to the task 2 Optional
morgana_magic nextclade_disk_size Int Amount of storage (in GB) to allocate to the task 50 Optional
morgana_magic nextclade_docker_image String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/nextstrain/nextclade:3.10.2 Optional
morgana_magic nextclade_memory Int Amount of memory/RAM (in GB) to allocate to the task 4 Optional
morgana_magic nextclade_output_parser_cpu Int Number of CPUs to allocate to the task 2 Optional
morgana_magic nextclade_output_parser_disk_size Int Amount of storage (in GB) to allocate to the task 50 Optional
morgana_magic nextclade_output_parser_docker String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/python/python:3.8.18-slim Optional
morgana_magic nextclade_output_parser_memory Int Amount of memory/RAM (in GB) to allocate to the task 4 Optional
morgana_magic pangolin_cpu Int Number of CPUs to allocate to the task 2 Optional
morgana_magic pangolin_disk_size Int Amount of storage (in GB) to allocate to the task 50 Optional
morgana_magic pangolin_docker_image String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/nextstrain/nextclade:3.10.2 Optional
morgana_magic pangolin_memory Int Amount of memory/RAM (in GB) to allocate to the task 4 Optional
morgana_magic read2 File Internal component, do not modify Optional
nanoplot_clean cpu Int Number of CPUs to allocate to the task 4 Optional
nanoplot_clean disk_size Int Amount of storage (in GB) to allocate to the task 100 Optional
nanoplot_clean docker String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/staphb/nanoplot:1.40.0 Optional
nanoplot_clean max_length Int The maximum length of clean reads, for which reads longer than the length specified will be hidden. 100000 Optional
nanoplot_clean memory Int Amount of memory/RAM (in GB) to allocate to the task 16 Optional
nanoplot_raw cpu Int Number of CPUs to allocate to the task 4 Optional
nanoplot_raw disk_size Int Amount of storage (in GB) to allocate to the task 100 Optional
nanoplot_raw docker String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/staphb/nanoplot:1.40.0 Optional
nanoplot_raw max_length Int The maximum length of clean reads, for which reads longer than the length specified will be hidden. 100000 Optional
nanoplot_raw memory Int Amount of memory/RAM (in GB) to allocate to the task 16 Optional
nanoq cpu Int Number of CPUs allocated for the task 1 Optional
nanoq disk_size Int Disk size allocated for the task (in GB) 100 Optional
nanoq docker String Docker image used for the task us-docker.pkg.dev/general-theiagen/biocontainers/nanoq:0.9.0--hec16e2b_1 Optional
nanoq max_read_length Int Maximum read length to keep 100000 Optional
nanoq max_read_qual Int Maximum read quality to keep 10 Optional
nanoq memory Int Memory allocated for the task (in GB) 2 Optional
nanoq min_read_length Int Minimum read length to keep 500 Optional
nanoq min_read_qual Int Minimum read quality to keep 10 Optional
ncbi_datasets cpu Int Number of CPUs allocated for the task 1 Optional
ncbi_datasets disk_size Int Disk size allocated for the task (in GB) 50 Optional
ncbi_datasets docker String Docker image used for the task us-docker.pkg.dev/general-theiagen/staphb/ncbi-datasets:16.38.1 Optional
ncbi_datasets include_gbff Boolean True/False to include gbff files in the output FALSE Optional
ncbi_datasets include_gff3 Boolean True/False to include gff3 files in the output FALSE Optional
ncbi_datasets memory Int Memory allocated for the task (in GB) 4 Optional
ncbi_identify complete Boolean Only query genomes labeled complete TRUE Optional
ncbi_identify cpu Int Number of CPUs allocated for the task 1 Optional
ncbi_identify disk_size Int Disk size allocated for the task (in GB) 50 Optional
ncbi_identify docker String Docker image used for the task us-docker.pkg.dev/general-theiagen/staphb/ncbi-datasets:16.38.1 Optional
ncbi_identify memory Int Memory allocated for the task (in GB) 4 Optional
ncbi_identify refseq Boolean Only query RefSeq genomes TRUE Optional
ncbi_identify summary_limit Int Maximum number of genomes to return in the summary 100 Optional
ncbi_identify use_ncbi_virus Boolean Set to true to download from NCBI Virus Datasets FALSE Optional
ncbi_scrub_se cpu Int Number of CPUs to allocate to the task 4 Optional
ncbi_scrub_se disk_size Int Amount of storage (in GB) to allocate to the task 100 Optional
ncbi_scrub_se docker String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/ncbi/sra-human-scrubber:2.2.1 Optional
ncbi_scrub_se memory Int Amount of memory/RAM (in GB) to allocate to the task 8 Optional
parse_mapping cpu Int Number of CPUs allocated for the task 2 Optional
parse_mapping disk_size Int Disk size allocated for the task (in GB) 100 Optional
parse_mapping docker String Docker image used for the task us-docker.pkg.dev/general-theiagen/staphb/samtools:1.17 Optional
parse_mapping memory Int Memory allocated for the task (in GB) 8 Optional
porechop cpu Int Number of CPUs allocated for the task 4 Optional
porechop disk_size Int Disk size allocated for the task (in GB) 100 Optional
porechop docker String Docker image used for the task us-docker.pkg.dev/general-theiagen/staphb/porechop:0.2.4 Optional
porechop memory Int Memory allocated for the task (in GB) 16 Optional
porechop trimopts String Additional trimming options for Porechop Optional
quast_denovo cpu Int Number of CPUs allocated for the task 2 Optional
quast_denovo disk_size Int Disk size allocated for the task (in GB) 100 Optional
quast_denovo docker String Docker image used for the task us-docker.pkg.dev/general-theiagen/staphb/quast:5.0.2 Optional
quast_denovo memory Int Memory allocated for the task (in GB) 2 Optional
rasusa bases String Explicitly set the number of bases required e.g., 4.3kb, 7Tb, 9000, 4.1MB. If this option is given, --coverage and --genome-size are ignored Optional
rasusa coverage Float The desired coverage to sub-sample the reads to. If --bases is not provided, this option and --genome-size are required 250 Optional
rasusa cpu Int Number of CPUs allocated for the task 4 Optional
rasusa disk_size Int Disk size allocated for the task (in GB) 100 Optional
rasusa docker String Docker image used for the task us-docker.pkg.dev/general-theiagen/staphb/rasusa:2.1.0 Optional
rasusa frac Float Subsample to a fraction of the reads - e.g., 0.5 samples half the reads Optional
rasusa memory Int Memory allocated for the task (in GB) 8 Optional
rasusa num Int Subsample to a specific number of reads Optional
rasusa read2 File Internal component, do not modify Optional
rasusa seed Int Random seed for reproducibility Optional
raven cpu Int Number of CPUs allocated for the task 4 Optional
raven disk_size Int Disk size allocated for the task (in GB) 100 Optional
raven docker String Docker image used for the task us-docker.pkg.dev/general-theiagen/theiagen/raven:1.8.3 Optional
raven memory Int Memory allocated for the task (in GB) 16 Optional
raven raven_identity Float Threshold for overlap between two reads in order to construct an edge between them 0.0 Optional
raven raven_opts Int Additional parameters for Raven assembler Optional
raven raven_polishing_iterations Int Number of polishing iterations 2 Optional
read_mapping_stats cpu Int Number of CPUs allocated for the task 2 Optional
read_mapping_stats disk_size Int Disk size allocated for the task (in GB) 100 Optional
read_mapping_stats docker String Docker image used for the task us-docker.pkg.dev/general-theiagen/staphb/samtools:1.15 Optional
read_mapping_stats memory Int Memory allocated for the task (in GB) 8 Optional
skani cpu Int Number of CPUs allocated for the task 2 Optional
skani disk_size Int Disk size allocated for the task (in GB) 100 Optional
skani docker String Docker image used for the task us-docker.pkg.dev/general-theiagen/staphb/skani:0.2.2 Optional
skani memory Int Memory allocated for the task (in GB) 4 Optional
skani skani_db File Skani database file gs://theiagen-public-resources-rp/reference_data/databases/skani/skani_db_20250606.tar Optional
theiaviral_ont call_porechop Boolean True/False to trim adapters with porechop FALSE Optional
theiaviral_ont call_raven Boolean True/False to call assembly with Raven and use Flye as fallback TRUE Optional
theiaviral_ont extract_unclassified Boolean True/False that determines if unclassified reads should be extracted and combined with the taxon specific extracted reads FALSE Optional
theiaviral_ont genome_length Int Expected genome length of taxon of interest Optional
theiaviral_ont host String Host taxon/accession to dehost reads, if provided Optional
theiaviral_ont min_allele_freq Float Minimum allele frequency required for a variant to populate the consensus sequence 0.6 Optional
theiaviral_ont min_depth Int Minimum read depth required for a variant to populate the consensus sequence 10 Optional
theiaviral_ont min_map_quality Int Minimum mapping quality required for read alignments 20 Optional
theiaviral_ont read_extraction_rank String Taxonomic rank to use for read extraction - limits taxons to only those within the specified ranks. family Optional
theiaviral_ont reference_fasta File Reference genome in FASTA format Optional
theiaviral_ont skip_rasusa Boolean True/False to skip read subsampling with Rasusa FALSE Optional
theiaviral_ont skip_screen Boolean True/False to skip read screening check prior to analysis FALSE Optional
version_capture docker String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0 Optional
version_capture timezone String Set the time zone to get an accurate date of analysis (uses UTC by default) Optional

All Tasks

Versioning
versioning: Version Capture

The versioning task captures the workflow version from the GitHub (code repository) version.

Version Capture Technical details

Links
Task task_versioning.wdl
Taxonomic Identification
ncbi_identify

The ncbi_identify task uses NCBI Datasets to search the NCBI Viral Genome Database and acquire taxonomic metadata from a user's inputted taxonomy and desired taxonomic rank. This task will always return a taxon ID, name, and rank, and it facilitates multiple downstream functions, including read classification and targeted read extraction. This task also generates a comprehensive summary file of all successful hits to the input taxon, which includes each taxon's accession number, completeness status, genome length, source, and other relevant metadata. Based on this summary, the task also calculates the average expected genome size for the input taxon.

taxon input parameter

This parameter accepts either a NCBI taxon ID (e.g. 11292) or an organism name (e.g. Lyssavirus rabies).

rank a.k.a read_extraction_rank input parameter

Valid options include: "species", "genus", "family", "order", "class", "phylum", "kingdom", or "domain". By default it is set to "family". This parameter filters metadata to report information only at the taxonomic rank specified by the user, regardless of the taxonomic rank implied by the original input taxon.

Important
  • The rank parameter must specify a taxonomic rank that is equal to or above the input taxon's taxonomic rank.

Examples:

  • If your input taxon is Lyssavirus rabies (species level) with rank set to family, the task will return information for the family of Lyssavirus rabies: taxon ID for Rhabdoviridae (11270), name "Rhabdoviridae", and rank "family".
  • If your input taxon is Lyssavirus (genus level) with rank set to species, the task will fail because it cannot determine species information from an inputted genus.
Read Quality Control, Trimming, Filtering, Identification and Extraction
read_QC_trim

read_QC_trim is a sub-workflow that removes low-quality reads, low-quality regions of reads, and sequencing adapters to improve data quality. It uses a number of tasks, described below. The differences between the PE and SE versions of the read_QC_trim sub-workflow lie in the default parameters, the use of two or one input read file(s), and the different output files.

HRRT: Human Host Sequence Removal

All reads of human origin are removed, including their mates, by using NCBI's human read removal tool (HRRT).

HRRT is based on the SRA Taxonomy Analysis Tool and employs a k-mer database constructed of k-mers from Eukaryota derived from all human RefSeq records with any k-mers found in non-Eukaryota RefSeq records subtracted from the database.

NCBI-Scrub Technical Details

Links
Task task_ncbi_scrub.wdl
Software Source Code HRRT on GitHub
Software Documentation HRRT on NCBI
Read quality trimming

Either trimmomatic or fastp can be used for read-quality trimming. Trimmomatic is used by default. Both tools trim low-quality regions of reads with a sliding window (with a window size of trim_window_size), cutting once the average quality within the window falls below trim_quality_trim_score. They will both discard the read if it is trimmed below trim_minlen.

read_processing input parameter

This input parameter accepts either trimmomatic or fastp as an input to determine which tool should be used for read quality trimming. This is set to trimmomatic by default.

If the fastp option is selected, see below for table of default parameters.

fastp default read-trimming parameters
Parameter Explanation
-g enables polyG tail trimming
-5 20 enables read end-trimming
-3 20 enables read end-trimming
--detect_adapter_for_pe enables adapter-trimming only for paired-end reads

Additional arguments can be passed using the fastp_args optional parameter.

Trimmomatic and fastp Technical Details

Links
Task task_trimmomatic.wdl
task_fastp.wdl
Software Source Code Trimmomatic
fastp on Github
Software Documentation Trimmomatic
fastp
Original Publication(s) Trimmomatic: a flexible trimmer for Illumina sequence data
fastp: an ultra-fast all-in-one FASTQ preprocessor
Adapter removal

The BBDuk task removes adapters from sequence reads. To do this:

  • Repair from the BBTools package reorders reads in paired fastq files to ensure the forward and reverse reads of a pair are in the same position in the two fastq files.
  • BBDuk ("Bestus Bioinformaticus" Decontamination Using Kmers) is then used to trim the adapters and filter out all reads that have a 31-mer match to PhiX, which is commonly added to Illumina sequencing runs to monitor and/or improve overall run quality.
What are adapters and why do they need to be removed?

Adapters are manufactured oligonucleotide sequences attached to DNA fragments during the library preparation process. In Illumina sequencing, these adapter sequences are required for attaching reads to flow cells. You can read more about Illumina adapters here. For genome analysis, it's important to remove these sequences since they're not actually from your sample. If you don't remove them, the downstream analysis may be affected.

BBDuk Technical Details

Links
Task task_bbduk.wdl
Software Source Code BBTools
Software Documentation BBDuk
Read Quantification

There are two methods for read quantification to choose from: fastq-scan (default) or fastqc. Both quantify the forward and reverse reads in FASTQ files. For paired-end data, they also provide the total number of read pairs. This task is run once with raw reads as input and once with clean reads as input. If QC has been performed correctly, you should expect fewer clean reads than raw reads. fastqc also provides a graphical visualization of the read quality.

read_qc input parameter

This input parameter accepts either "fastq_scan" or "fastqc" as an input to determine which tool should be used for read quantification. This is set to "fastq-scan" by default.

fastq-scan and FastQC Technical Details

Links
Task task_fastq_scan.wdl
task_fastqc.wdl
Software Source Code fastq-scan on Github
fastqc on Github
Software Documentation fastq-scan
fastqc
host_decontaminate: Host read decontamination

Host genetic data is frequently incidentally sequenced alongside pathogens, which can negatively affect the quality of downstream analysis. Host Decontaminate attempts to remove host reads by aligning to a reference host genome acquired on-the-fly. The reference host genome can be acquired via NCBI Taxonomy-compatible taxon input or assembly accession. Host Decontaminate maps inputted reads to the host genome using minimap2, reports mapping statistics to this host genome, and outputs the unaligned dehosted reads.

The detailed steps and tasks are as follows:

Taxonomic Identification

The ncbi_identify task uses NCBI Datasets to search the NCBI Viral Genome Database and acquire taxonomic metadata from a user's inputted taxonomy and desired taxonomic rank. This task will always return a taxon ID, name, and rank, and it facilitates multiple downstream functions, including read classification and targeted read extraction. This task also generates a comprehensive summary file of all successful hits to the input taxon, which includes each taxon's accession number, completeness status, genome length, source, and other relevant metadata. Based on this summary, the task also calculates the average expected genome size for the input taxon.

taxon input parameter

This parameter accepts either a NCBI taxon ID (e.g. 11292) or an organism name (e.g. Lyssavirus rabies).

rank a.k.a read_extraction_rank input parameter

Valid options include: "species", "genus", "family", "order", "class", "phylum", "kingdom", or "domain". By default it is set to "family". This parameter filters metadata to report information only at the taxonomic rank specified by the user, regardless of the taxonomic rank implied by the original input taxon.

Important
  • The rank parameter must specify a taxonomic rank that is equal to or above the input taxon's taxonomic rank.

Examples:

  • If your input taxon is Lyssavirus rabies (species level) with rank set to family, the task will return information for the family of Lyssavirus rabies: taxon ID for Rhabdoviridae (11270), name "Rhabdoviridae", and rank "family".
  • If your input taxon is Lyssavirus (genus level) with rank set to species, the task will fail because it cannot determine species information from an inputted genus.
Download Accession

The NCBI Datasets task downloads specified assemblies from NCBI using either the virus or genome (for all other genome types) package as appropriate.

This task uses the accession ID output from the skani task to download the the most closely related reference genome to the input assembly. The downloaded reference is then used for downstream analysis, including variant calling and consensus generation.

Map Reads to Host

minimap2 is a popular aligner that is used to align reads (or assemblies) to an assembly file. In minimap2, "modes" are a group of preset options.

The mode used in this task is map-ont which is the default mode for long reads and indicates that long reads of ~10% error rates should be aligned to the reference genome. The output file is in SAM format.

For more information regarding modes and the available options for minimap2, please see the minimap2 manpage

minimap2 Technical Details

Links
Task task_minimap2.wdl
Software Source Code minimap2 on GitHub
Software Documentation minimap2
Original Publication(s) Minimap2: pairwise alignment for nucleotide sequences
Extract Unaligned Reads

The bam_to_unaligned_fastq task will extract a FASTQ file of reads that failed to align, while removing unpaired reads.

parse_mapping Technical Details

Links
Task task_parse_mapping.wdl
Software Source Code samtools on GitHub
Software Documentation samtools
Original Publication(s) The Sequence Alignment/Map format and SAMtools
Twelve Years of SAMtools and BCFtools
Host Read Mapping Statistics

The assembly_metrics task generates mapping statistics from a BAM file. It uses samtools to generate a summary of the mapping statistics, which includes coverage, depth, average base quality, average mapping quality, and other relevant metrics.

assembly_metrics Technical Details

Links
Task task_assembly_metrics.wdl
Software Source Code samtools on GitHub
Software Documentation samtools
Original Publication(s) The Sequence Alignment/Map format and SAMtools
Twelve Years of SAMtools and BCFtools

Host Decontaminate Technical Details

Links
Subworkflow File wf_host_decontaminate.wdl
Read Identification

Kraken2 is a bioinformatics tool originally designed for metagenomic applications. It has additionally proven valuable for validating taxonomic assignments and checking contamination of single-species (e.g. bacterial isolate, eukaryotic isolate, viral isolate, etc.) whole genome sequence data.

This task runs on cleaned reads passed from the read_QC_trim subworkflow and outputs a Kraken2 report detailing taxonomic classifications. It also separates classified reads from unclassified ones.

Database-dependent

This workflow automatically uses a viral-specific Kraken2 database. This database was generated in-house from RefSeq's viral sequence collection and human genome GRCh38. It's available at gs://theiagen-public-resources-rp/reference_data/databases/kraken2/kraken2_humanGRCh38_viralRefSeq_20240828.tar.gz.

Kraken2 Technical Details

Links
Task task_kraken2.wdl
Software Source Code Kraken2 on GitHub
Software Documentation https://github.com/DerrickWood/kraken2/blob/master/docs/MANUAL.markdown
Original Publication(s) Improved metagenomic analysis with Kraken 2
Read Extraction

The task_krakentools.wdl task extracts reads from the Kraken2 output file. It uses the KrakenTools package to extract reads classified at any user-specified taxon ID.

extract_unclassified input parameter

This parameter determines whether unclassified reads should also be extracted and combined with the taxon-specific extracted reads. By default, this is set to false, meaning that only reads classified to the specified input taxon will be extracted.

Important

This task will extract reads classified to the input taxon and all of its descendant taxa. The rank input parameter controls the extraction of reads classified at the specified rank and all suboridante taxonomic levels. See task ncbi_identify under the Taxonomic Identification section for more details on the rank input parameter.

KrakenTools Technical Details

Links
Task task_krakentools.wdl
Software Source Code KrakenTools on GitHub
Software Documentation KrakenTools
Original Publication(s) Metagenome analysis using the Kraken software suite
rasusa

The rasusa task performs subsampling on the input raw reads. By default, it subsamples reads to a target depth of 250X, using the estimated genome length either generated by the ncbi_identify task or provided directly by the user. Disabled by default, users can enable it by setting the skip_rasusa variable to false. The target subsampling depth can also be adjusted by modifying the coverage variable.

coverage input parameter

This parameter specifies the target coverage for subsampling. The default value is 250, but users can adjust it as needed.

Non-deterministic output(s)

This task may yield non-deterministic outputs.

Rasusa Technical Details

Links
Task task_rasusa.wdl
Software Source Code Rasusa on GitHub
Software Documentation Rasusa on GitHub
Original Publication(s) Rasusa: Randomly subsample sequencing reads to a specified coverage
clean_check_reads

The screen task ensures the quantity of sequence data is sufficient to undertake genomic analysis. It uses fastq-scan and bash commands for quantification of reads and base pairs, and mash sketching to estimate the genome size and its coverage. At each step, the results are assessed relative to pass/fail criteria and thresholds that may be defined by optional user inputs. Samples are run through all threshold checks, regardless of failures, and the workflow will terminate after the screen task if any thresholds are not met:

  1. Total number of reads: A sample will fail the read screening task if its total number of reads is less than or equal to min_reads.
  2. The proportion of basepairs reads in the forward and reverse read files: A sample will fail the read screening if fewer than min_proportion basepairs are in either the reads1 or read2 files.
  3. Number of basepairs: A sample will fail the read screening if there are fewer than min_basepairs basepairs
  4. Estimated genome size: A sample will fail the read screening if the estimated genome size is smaller than min_genome_size or bigger than max_genome_size.
  5. Estimated genome coverage: A sample will fail the read screening if the estimated genome coverage is less than the min_coverage.

Read screening is performed only on the cleaned reads. The task may be skipped by setting the skip_screen variable to true. Default values vary between the ONT and PE workflow. The rationale for these default values can be found below:

Default Thresholds and Rationales
Variable Description Default Value Rationale
min_reads A sample will fail the read screening task if its total number of reads is less than or equal to min_reads 50 Minimum number of base pairs for 10x coverage of the Hepatitis delta (of the Deltavirus genus) virus divided by 300 (longest Illumina read length)
min_basepairs A sample will fail the read screening if there are fewer than min_basepairs basepairs 15000 Greater than 10x coverage of the Hepatitis delta (of the Deltavirus genus) virus
min_genome_size A sample will fail the read screening if the estimated genome size is smaller than min_genome_size 1500 Based on the Hepatitis delta (of the Deltavirus genus) genome- the smallest viral genome as of 2024-04-11 (1,700 bp)
max_genome_size A sample will fail the read screening if the estimated genome size is smaller than max_genome_size 2673870 Based on the Pandoravirus salinus genome, the biggest viral genome, (2,673,870 bp) with 2 Mbp added
min_coverage A sample will fail the read screening if the estimated genome coverage is less than the min_coverage 10 A bare-minimum coverage for genome characterization. Higher coverage would be required for high-quality phylogenetics.
min_proportion A sample will fail the read screening if fewer than min_proportion basepairs are in either the reads1 or read2 files 40 Greater than 50% reads are in the read1 file; others are in the read2 file. (PE workflow only)
De novo Assembly and Reference Selection
These tasks are only performed if no reference genome is provided

In this workflow, de novo assembly is primarily used to facilitate the selection of a closely related reference genome, though high quality de novo assemblies can be used for downstream analysis. If the user provides an input reference_fasta, the following assembly generation, assembly evaluation, and reference selections tasks will be skipped:

  • spades
  • megahit
  • checkv_denovo
  • quast_denovo
  • skani
  • ncbi_datasets
spades

The spades task is a wrapper for the SPAdes assembler, which is used for de novo assembly of the cleaned reads. It is run with the --metaviral option, which is recommended for viral genomes. MetaviralSPAdes pipeline consists of three independent steps, ViralAssembly for finding putative viral subgraphs in a metagenomic assembly graph and generating contigs in these graphs, ViralVerify for checking whether the resulting contigs have viral origin and ViralComplete for checking whether these contigs represent complete viral genomes. For more details, please see the original publication.

MetaviralSPAdes was selected as the default assembler because it produces the most complete viral genomes within TheiaViral, determined by CheckV quality assessment (see task checkv for technical details).

call_metaviralspades input parameter

This parameter controls whether or not the spades task is called by the workflow. By default, call_metaviralspades is set to true because MetaviralSPAdes is used as the primary assembler. MetaviralSPAdes is generally recommended for most users, but it might not perform optimally on all datasets. If users encounter issues with MetaviralSPAdes, they can set the call_metaviralspades variable to false to bypass the spades task and instead de novo assemble using MEGAHIT (see task megahit for details). Additionally, if the spades task fails during execution, the workflow will automatically fall back to using MEGAHIT for de novo assembly.

Non-deterministic output(s)

This task may yield non-deterministic outputs.

MetaviralSPAdes Technical Details

Links
Task task_spades.wdl
Software Source Code SPAdes on GitHub
Software Documentation SPAdes Manual
Original Publication(s) MetaviralSPAdes: assembly of viruses from metagenomic data
megahit

The megahit task is a wrapper for the MEGAHIT assembler, which is used for de novo metagenomic assembly of the cleaned reads. MEGAHIT is a fast and memory-efficient de novo assembler that can handle large datasets. This task is optional, turned off by default, and will only be called if MetaviralSPAdes fails. It can be enabled by setting the skip_metaviralspades parameter to true. The megahit task is used as a fallback option if the spades task fails during execution (see task spades for more details).

Non-deterministic output(s)

This task may yield non-deterministic outputs.

MEGAHIT Technical Details

Links
Task task_megahit.wdl
Software Source Code MEGAHIT on GitHub
Software Documentation MEGAHIT
Original Publication(s) MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph
skani

The skani task is used to identify and select the most closely related reference genome to the de novo assembly. Skani uses an approximate mapping method without base-level alignment to calculate average nucleotide identity (ANI). It is magnitudes faster than BLAST-based methods and almost as accurate.

By default, the reference genome is selected from a database of approximately 200,000 complete viral genomes. This database was constructed with the following methodology:

  1. Extracting all complete NCBI viral genomes, excluding RefSeq accessions (redundancy), SARS-CoV-2 accessions, and segmented families (Orthomyxoviridae, Hantaviridae, Arenaviridae, and Phenuiviridae)

  2. Adding complete RefSeq segmented viral assembly accessions, which represent segments as individual contigs within the FASTA

  3. Adding one SARS-CoV-2 genome for each major pangolin lineage

Skani Technical Details

Links
Task task_skani.wdl
Software Source Code Skani on GitHub
Software Documentation Skani Documentation
Original Publication(s) Fast and robust metagenomic sequence comparison through sparse chaining with skani
ncbi_datasets

The NCBI Datasets task downloads specified assemblies from NCBI using either the virus or genome (for all other genome types) package as appropriate.

This task uses the accession ID output from the skani task to download the the most closely related reference genome to the input assembly. The downloaded reference is then used for downstream analysis, including variant calling and consensus generation.

Reference Mapping
bwa

The bwa task is a wrapper for the BWA alignment tool. It utilizes the BWA-MEM algorithm to map cleaned reads to the reference genome, either selected by the skani task or provided by the user input reference_fasta. This creates a BAM file which is then sorted using the command samtools sort.

BWA Technical Details

Links
Task task_bwa.wdl
Software Source Code https://github.com/lh3/bwa
Software Documentation https://bio-bwa.sourceforge.net/
Original Publication(s) Fast and accurate short read alignment with Burrows-Wheeler transform
read_mapping_stats

The read_mapping_stats task generates mapping statistics from a BAM file. It uses samtools to generate a summary of the mapping statistics, which includes coverage, depth, average base quality, average mapping quality, and other relevant metrics.

read_mapping_stats Technical Details

Links
Task task_assembly_metrics.wdl
Software Source Code samtools on GitHub
Software Documentation samtools
Original Publication(s) The Sequence Alignment/Map format and SAMtools
Twelve Years of SAMtools and BCFtools
Variant Calling and Consensus Generation
ivar_variants

The ivar_variants task wraps the iVar tool to call variants from the sorted BAM file produced by the bwa task. It uses the ivar variants command to identify and report variants based on the aligned reads. The ivar_variants task will filter all variant calls based on user-defined parameters, including min_map_quality, min_depth, and min_allele_freq. This task will return a VCF file containing the variant calls, along with the total number of variants, and the proportion of intermediate variant calls.

min_depth input parameter

This parameter accepts an integer value to set the minimum read depth for variant calling and subsequent consensus sequence generation. The default value is 10.

min_map_quality input parameter

This parameter accepts an integer value to set the minimum mapping quality for variant calling and subsequent consensus sequence generation. The default value is 20.

min_allele_freq input parameter

This parameter accepts a float value to set the minimum allele frequency for variant calling and subsequent consensus sequence generation. The default value is 0.6.

ivar consensus

The consensus task wraps the iVar tool to generate a reference-based consensus assembly from the sorted BAM file produced by the bwa task. It uses the ivar consensus command to call variants and generate a consensus sequence based on those mapped reads. The consensus task will filter all variant calls based on user-defined parameters, including min_map_quality, min_depth, and min_allele_freq. This task will return a consensus sequence in FASTA format and the samtools mpileup output.

This task is functional for segmented viruses by iteratively executing iVar on a contig-by-contig basis and concantenating resulting consensus contigs.

min_depth input parameter

This parameter accepts an integer value to set the minimum read depth for variant calling and subsequent consensus sequence generation. The default value is 10.

min_map_quality input parameter

This parameter accepts an integer value to set the minimum mapping quality for variant calling and subsequent consensus sequence generation. The default value is 20.

min_allele_freq input parameter

This parameter accepts a float value to set the minimum allele frequency for variant calling and subsequent consensus sequence generation. The default value is 0.6.

Assembly Evaluation and Consensus Quality Control
quast_denovo

QUAST stands for QUality ASsessment Tool. It evaluates genome/metagenome assemblies by computing various metrics without a reference being necessary. It includes useful metrics such as number of contigs, length of the largest contig and N50.

QUAST Technical Details

Links
Task task_quast.wdl
Software Source Code QUAST on GitHub
Software Documentation https://quast.sourceforge.net/
Original Publication(s) QUAST: quality assessment tool for genome assemblies
checkv_denovo & checkv_consensus

CheckV is a fully automated command-line pipeline for assessing the quality of viral genomes, including identification of host contamination for integrated proviruses, estimating completeness for genome fragments, and identification of closed genomes.

By default, CheckV reports results on a contig-by-contig basis. The checkv task additionally reports both "weighted_contamination" and "weighted_completeness", which are average percents calculated across the total assembly that are weighted by contig length.

CheckV Technical Details

Links
Task task_checkv.wdl
Software Source Code CheckV on Bitbucket
Software Documentation CheckV Documentation
Original Publication(s) CheckV assesses the quality and completeness of metagenome-assembled viral genomes
consensus_qc

The consensus_qc task generates a summary of genomic statistics from a consensus genome. This includes the total number of bases, "N" bases, degenerate bases, and an estimate of the percent coverage to the reference genome.

consensus_qc Technical Details

Links
Task task_consensus_qc.wdl
Software Source Docker Image Theiagen Docker Builds: utility:1.1
Versioning
versioning: Version Capture

The versioning task captures the workflow version from the GitHub (code repository) version.

Version Capture Technical details

Links
Task task_versioning.wdl
Taxonomic Identification
ncbi_identify

The ncbi_identify task uses NCBI Datasets to search the NCBI Viral Genome Database and acquire taxonomic metadata from a user's inputted taxonomy and desired taxonomic rank. This task will always return a taxon ID, name, and rank, and it facilitates multiple downstream functions, including read classification and targeted read extraction. This task also generates a comprehensive summary file of all successful hits to the input taxon, which includes each taxon's accession number, completeness status, genome length, source, and other relevant metadata. Based on this summary, the task also calculates the average expected genome size for the input taxon.

taxon input parameter

This parameter accepts either a NCBI taxon ID (e.g. 11292) or an organism name (e.g. Lyssavirus rabies).

rank a.k.a read_extraction_rank input parameter

Valid options include: "species", "genus", "family", "order", "class", "phylum", "kingdom", or "domain". By default it is set to "family". This parameter filters metadata to report information only at the taxonomic rank specified by the user, regardless of the taxonomic rank implied by the original input taxon.

Important
  • The rank parameter must specify a taxonomic rank that is equal to or above the input taxon's taxonomic rank.

Examples:

  • If your input taxon is Lyssavirus rabies (species level) with rank set to family, the task will return information for the family of Lyssavirus rabies: taxon ID for Rhabdoviridae (11270), name "Rhabdoviridae", and rank "family".
  • If your input taxon is Lyssavirus (genus level) with rank set to species, the task will fail because it cannot determine species information from an inputted genus.
Read Quality Control, Trimming, and Filtering
nanoplot_raw & nanoplot_clean

Nanoplot is used for the determination of mean quality scores, read lengths, and number of reads. This task is run once with raw reads as input and once with clean reads as input. If QC has been performed correctly, you should expect fewer clean reads than raw reads.

Nanoplot Technical Details

Links
Task task_nanoplot.wdl
Software Source Code NanoPlot
Software Documentation NanoPlot Documentation
Original Publication(s) NanoPack2: population-scale evaluation of long-read sequencing data
porechop

Porechop is a tool for finding and removing adapters from ONT data. Adapters on the ends of reads are trimmed, and when a read has an adapter in the middle, the read is split into two.

The porechop task is optional and is turned off by default. It can be enabled by setting the call_porechop parameter to true.

Porechop Technical Details

Links
WDL Task task_porechop.wdl
Software Source Code Porechop on GitHub
Software Documentation https://github.com/rrwick/Porechop#porechop
nanoq

Reads are filtered by length and quality using nanoq. By default, sequences with less than 500 basepairs and quality score lower than 10 are filtered out to improve assembly accuracy.

Nanoq Technical Details

Links
Task task_nanoq.wdl
Software Source Code Nanoq
Software Documentation Nanoq Documentation
Original Publication(s) Nanoq: ultra-fast quality control for nanopore reads
ncbi_scrub_se

All reads of human origin are removed, including their mates, by using NCBI's human read removal tool (HRRT).

HRRT is based on the SRA Taxonomy Analysis Tool and employs a k-mer database constructed of k-mers from Eukaryota derived from all human RefSeq records with any k-mers found in non-Eukaryota RefSeq records subtracted from the database.

NCBI-Scrub Technical Details

Links
Task task_ncbi_scrub.wdl
Software Source Code HRRT on GitHub
Software Documentation HRRT on NCBI
host_decontaminate

Host genetic data is frequently incidentally sequenced alongside pathogens, which can negatively affect the quality of downstream analysis. Host Decontaminate attempts to remove host reads by aligning to a reference host genome acquired on-the-fly. The reference host genome can be acquired via NCBI Taxonomy-compatible taxon input or assembly accession. Host Decontaminate maps inputted reads to the host genome using minimap2, reports mapping statistics to this host genome, and outputs the unaligned dehosted reads.

The detailed steps and tasks are as follows:

Taxonomic Identification

The ncbi_identify task uses NCBI Datasets to search the NCBI Viral Genome Database and acquire taxonomic metadata from a user's inputted taxonomy and desired taxonomic rank. This task will always return a taxon ID, name, and rank, and it facilitates multiple downstream functions, including read classification and targeted read extraction. This task also generates a comprehensive summary file of all successful hits to the input taxon, which includes each taxon's accession number, completeness status, genome length, source, and other relevant metadata. Based on this summary, the task also calculates the average expected genome size for the input taxon.

taxon input parameter

This parameter accepts either a NCBI taxon ID (e.g. 11292) or an organism name (e.g. Lyssavirus rabies).

rank a.k.a read_extraction_rank input parameter

Valid options include: "species", "genus", "family", "order", "class", "phylum", "kingdom", or "domain". By default it is set to "family". This parameter filters metadata to report information only at the taxonomic rank specified by the user, regardless of the taxonomic rank implied by the original input taxon.

Important
  • The rank parameter must specify a taxonomic rank that is equal to or above the input taxon's taxonomic rank.

Examples:

  • If your input taxon is Lyssavirus rabies (species level) with rank set to family, the task will return information for the family of Lyssavirus rabies: taxon ID for Rhabdoviridae (11270), name "Rhabdoviridae", and rank "family".
  • If your input taxon is Lyssavirus (genus level) with rank set to species, the task will fail because it cannot determine species information from an inputted genus.
Download Accession

The NCBI Datasets task downloads specified assemblies from NCBI using either the virus or genome (for all other genome types) package as appropriate.

This task uses the accession ID output from the skani task to download the the most closely related reference genome to the input assembly. The downloaded reference is then used for downstream analysis, including variant calling and consensus generation.

Map Reads to Host

minimap2 is a popular aligner that is used to align reads (or assemblies) to an assembly file. In minimap2, "modes" are a group of preset options.

The mode used in this task is map-ont which is the default mode for long reads and indicates that long reads of ~10% error rates should be aligned to the reference genome. The output file is in SAM format.

For more information regarding modes and the available options for minimap2, please see the minimap2 manpage

minimap2 Technical Details

Links
Task task_minimap2.wdl
Software Source Code minimap2 on GitHub
Software Documentation minimap2
Original Publication(s) Minimap2: pairwise alignment for nucleotide sequences
Extract Unaligned Reads

The bam_to_unaligned_fastq task will extract a FASTQ file of reads that failed to align, while removing unpaired reads.

parse_mapping Technical Details

Links
Task task_parse_mapping.wdl
Software Source Code samtools on GitHub
Software Documentation samtools
Original Publication(s) The Sequence Alignment/Map format and SAMtools
Twelve Years of SAMtools and BCFtools
Host Read Mapping Statistics

The assembly_metrics task generates mapping statistics from a BAM file. It uses samtools to generate a summary of the mapping statistics, which includes coverage, depth, average base quality, average mapping quality, and other relevant metrics.

assembly_metrics Technical Details

Links
Task task_assembly_metrics.wdl
Software Source Code samtools on GitHub
Software Documentation samtools
Original Publication(s) The Sequence Alignment/Map format and SAMtools
Twelve Years of SAMtools and BCFtools

Host Decontaminate Technical Details

Links
Subworkflow File wf_host_decontaminate.wdl
rasusa

The rasusa task performs subsampling on the input raw reads. By default, it subsamples reads to a target depth of 250X, using the estimated genome length either generated by the ncbi_identify task or provided directly by the user. Disabled by default, users can enable it by setting the skip_rasusa variable to false. The target subsampling depth can also be adjusted by modifying the coverage variable.

coverage input parameter

This parameter specifies the target coverage for subsampling. The default value is 250, but users can adjust it as needed.

Non-deterministic output(s)

This task may yield non-deterministic outputs.

Rasusa Technical Details

Links
Task task_rasusa.wdl
Software Source Code Rasusa on GitHub
Software Documentation Rasusa on GitHub
Original Publication(s) Rasusa: Randomly subsample sequencing reads to a specified coverage
clean_check_reads

The screen task ensures the quantity of sequence data is sufficient to undertake genomic analysis. It uses fastq-scan and bash commands for quantification of reads and base pairs, and mash sketching to estimate the genome size and its coverage. At each step, the results are assessed relative to pass/fail criteria and thresholds that may be defined by optional user inputs. Samples are run through all threshold checks, regardless of failures, and the workflow will terminate after the screen task if any thresholds are not met:

  1. Total number of reads: A sample will fail the read screening task if its total number of reads is less than or equal to min_reads.
  2. The proportion of basepairs reads in the forward and reverse read files: A sample will fail the read screening if fewer than min_proportion basepairs are in either the reads1 or read2 files.
  3. Number of basepairs: A sample will fail the read screening if there are fewer than min_basepairs basepairs
  4. Estimated genome size: A sample will fail the read screening if the estimated genome size is smaller than min_genome_size or bigger than max_genome_size.
  5. Estimated genome coverage: A sample will fail the read screening if the estimated genome coverage is less than the min_coverage.

Read screening is performed only on the cleaned reads. The task may be skipped by setting the skip_screen variable to true. Default values vary between the ONT and PE workflow. The rationale for these default values can be found below:

Default Thresholds and Rationales
Variable Description Default Value Rationale
min_reads A sample will fail the read screening task if its total number of reads is less than or equal to min_reads 50 Minimum number of base pairs for 10x coverage of the Hepatitis delta (of the Deltavirus genus) virus divided by 300 (longest Illumina read length)
min_basepairs A sample will fail the read screening if there are fewer than min_basepairs basepairs 15000 Greater than 10x coverage of the Hepatitis delta (of the Deltavirus genus) virus
min_genome_size A sample will fail the read screening if the estimated genome size is smaller than min_genome_size 1500 Based on the Hepatitis delta (of the Deltavirus genus) genome- the smallest viral genome as of 2024-04-11 (1,700 bp)
max_genome_size A sample will fail the read screening if the estimated genome size is smaller than max_genome_size 2673870 Based on the Pandoravirus salinus genome, the biggest viral genome, (2,673,870 bp) with 2 Mbp added
min_coverage A sample will fail the read screening if the estimated genome coverage is less than the min_coverage 10 A bare-minimum coverage for genome characterization. Higher coverage would be required for high-quality phylogenetics.
min_proportion A sample will fail the read screening if fewer than min_proportion basepairs are in either the reads1 or read2 files 40 Greater than 50% reads are in the read1 file; others are in the read2 file. (PE workflow only)
Read Classification and Extraction
metabuli

The metabuli task is used to classify and extract reads against a reference database. Metabuli uses a novel k-mer structure, called metamer, to analyze both amino acid (AA) and DNA sequences. It leverages AA conservation for sensitive homology detection and DNA mutations for specific differentiation between closely related taxa.

cpu / memory input parameters

Increasing the memory and cpus allocated to Metabuli can substantially increase throughput.

extract_unclassified input parameter

This parameter determines whether unclassified reads should also be extracted and combined with the taxon-specific extracted reads. By default, this is set to false, meaning that only reads classified to the specified input taxon will be extracted.

Descendant taxa reads are extracted

This task will extract reads classified to the input taxon and all of its descendant taxa. The rank input parameter controls the extraction of reads classified at the specified rank and all subordiante taxonomic levels. See task ncbi_identify under the Taxonomic Identification section above for more details on the rank input parameter.

Metabuli Technical Details

Links
Task task_metabuli.wdl
Software Source Code Metabuli on GitHub
Software Documentation Metabuli Documentation
Original Publication(s) Metabuli: sensitive and specific metagenomic classification via joint analysis of amino acid and DNA
De novo Assembly and Reference Selection
These tasks are only performed if no reference genome is provided

In this workflow, de novo assembly is used solely to facilitate the selection of a closely related reference genome. If the user provides an input reference_fasta, the following assembly generation, assembly evaluation, and reference selections tasks will be skipped:

  • raven
  • flye
  • checkv_denovo
  • quast_denovo
  • skani
  • ncbi_datasets
raven

The raven task is used to create a de novo assembly from cleaned reads. Raven is an overlap-layout-consensus based assembler that accelerates the overlap step, constructs an assembly graph from reads pre-processed with pile-o-grams, applies a novel and robust graph simplification method based on graph drawings, and polishes unambiguous graph paths using Racon.

Based on internal benchmarking against Flye and results reported by Cook et al. (2024), Raven is faster, produces more contiguous assemblies, and yields more complete genomes within TheiaViral according to CheckV quality assessment (see task checkv for technical details).

call_raven input parameter

This parameter controls whether or not the raven task is called by the workflow. By default, call_raven is set to true because Raven is used as the primary assembler. Raven is generally recommended for most users, but it might not perform optimally on all datasets. If users encounter issues with Raven, they can set the call_raven variable to false to bypass the raven task and instead de novo assemble using Flye (see task flye for details). Additionally, if the Raven task fails during execution, the workflow will automatically fall back to using Flye for de novo assembly.

Error traceback

Raven may fail with cryptic "segmentation fault" (segfault) errors or by failing to output an output file. It is difficult to traceback the source of these issues, though increasing the memory parameter may resolve some errors.

Non-deterministic output(s)

This task may yield non-deterministic outputs.

Raven Technical Details

Links
Task task_raven.wdl
Software Source Code Raven on GitHub
Software Documentation Raven Documentation
Original Publication(s) Time- and memory-efficient genome assembly with Raven
flye

Flye is a de novo assembler for long read data using repeat graphs. Compared to de Bruijn graphs, which require exact k-mer matches, repeat graphs can use approximate matches which better tolerates the error rate of ONT data.

It can be enabled by setting the call_raven parameter to false. The flye task is used as a fallback option if the raven task fails during execution (see task raven for more details).

read_type input parameter

This input parameter specifies the type of sequencing reads being used for assembly. This parameter significantly impacts the assembly process and should match the characteristics of your input data. Below are the available options:

Parameter Explanation
--nano-hq (default) Optimized for ONT high-quality reads, such as Guppy5+ SUP or Q20 (<5% error). Recommended for ONT reads processed with Guppy5 or newer
--nano-raw For ONT regular reads, pre-Guppy5 (<20% error)
--nano-corr ONT reads corrected with other methods (<3% error)
--pacbio-raw PacBio regular CLR reads (<20% error)
--pacbio-corr PacBio reads corrected with other methods (<3% error)
--pacbio-hifi PacBio HiFi reads (<1% error)

Refer to the Flye documentation for detailed guidance on selecting the appropriate read_type based on your sequencing data and additional optional paramaters.

Non-deterministic output(s)

This task may yield non-deterministic outputs.

Flye Technical Details

Links
WDL Task task_flye.wdl
Software Source Code Flye on GitHub
Software Documentation Flye Documentation
Original Publication(s) Assembly of long, error-prone reads using repeat graphs
skani

The skani task is used to identify and select the most closely related reference genome to the de novo assembly. Skani uses an approximate mapping method without base-level alignment to calculate average nucleotide identity (ANI). It is magnitudes faster than BLAST-based methods and almost as accurate.

By default, the reference genome is selected from a database of approximately 200,000 complete viral genomes. This database was constructed with the following methodology:

  1. Extracting all complete NCBI viral genomes, excluding RefSeq accessions (redundancy), SARS-CoV-2 accessions, and segmented families (Orthomyxoviridae, Hantaviridae, Arenaviridae, and Phenuiviridae)

  2. Adding complete RefSeq segmented viral assembly accessions, which represent segments as individual contigs within the FASTA

  3. Adding one SARS-CoV-2 genome for each major pangolin lineage

Skani Technical Details

Links
Task task_skani.wdl
Software Source Code Skani on GitHub
Software Documentation Skani Documentation
Original Publication(s) Fast and robust metagenomic sequence comparison through sparse chaining with skani
ncbi_datasets

The NCBI Datasets task downloads specified assemblies from NCBI using either the virus or genome (for all other genome types) package as appropriate.

This task uses the accession ID output from the skani task to download the the most closely related reference genome to the input assembly. The downloaded reference is then used for downstream analysis, including variant calling and consensus generation.

Reference Mapping
minimap2

minimap2 is a popular aligner that is used to align reads (or assemblies) to an assembly file. In minimap2, "modes" are a group of preset options.

The mode used in this task is map-ont with additional long-read-specific parameters (the -L --cs --MD flags) to align ONT reads to the reference genome. These specialized parameters are essential for proper handling of long read error profiles, generation of detailed alignment information, and improved mapping accuracy for long reads.

map-ont is the default mode for long reads and it indicates that long reads of ~10% error rates should be aligned to the reference genome. The output file is in SAM format.

For more information regarding modes and the available options for minimap2, please see the minimap2 manpage

minimap2 Technical Details

Links
Task task_minimap2.wdl
Software Source Code minimap2 on GitHub
Software Documentation minimap2
Original Publication(s) Minimap2: pairwise alignment for nucleotide sequences
parse_mapping

The sam_to_sorted_bam sub-task converts the output SAM file from the minimap2 task and converts it to a BAM file. It then sorts the BAM file by coordinate, and creates a BAM index file.

min_map_quality input parameter

This parameter accepts an integer value to set the minimum mapping quality for variant calling and subsequent consensus sequence generation. The default value is 20.

parse_mapping Technical Details

Links
Task task_parse_mapping.wdl
Software Source Code samtools on GitHub
Software Documentation samtools
Original Publication(s) The Sequence Alignment/Map format and SAMtools
Twelve Years of SAMtools and BCFtools
read_mapping_stats

The read_mapping_stats task generates mapping statistics from a BAM file. It uses samtools to generate a summary of the mapping statistics, which includes coverage, depth, average base quality, average mapping quality, and other relevant metrics.

read_mapping_stats Technical Details

Links
Task task_assembly_metrics.wdl
Software Source Code samtools on GitHub
Software Documentation samtools
Original Publication(s) The Sequence Alignment/Map format and SAMtools
Twelve Years of SAMtools and BCFtools
fasta_utilities

The fasta_utilities task utilizes samtools to index a reference fasta file. This reference is selected by the skani task or provided by the user input reference_fasta. This indexed reference genome is used for downstream variant calling and consensus generation tasks.

fasta_utilities Technical Details

Links
Task task_fasta_utilities.wdl
Software Source Code samtools on GitHub
Software Documentation samtools
Original Publication(s) The Sequence Alignment/Map format and SAMtools
Twelve Years of SAMtools and BCFtools
Variant Calling and Consensus Generation
clair3

Clair3 performs deep learning-based variant detection using a multi-stage approach. The process begins with pileup-based calling for initial variant identification, followed by full-alignment analysis for comprehensive variant detection. Results are merged into a final high-confidence call set.

The variant calling pipeline employs specialized neural networks trained on ONT data to accurately identify: - Single nucleotide variants (SNVs) - Small insertions and deletions (indels) - Structural variants

clair3_model input parameter

This parameter specifies the clair3 model to use for variant calling. The default is set to "r1041_e82_400bps_sup_v500", but users may select from other available models that clair3 was trained on, which may yield better results depending on the basecaller and data type. The following models are available:

  • "ont"
  • "ont_guppy2"
  • "ont_guppy5"
  • "r941_prom_sup_g5014"
  • "r941_prom_hac_g360+g422"
  • "r941_prom_hac_g238"
  • "r1041_e82_400bps_sup_v500"
  • "r1041_e82_400bps_hac_v500"
  • "r1041_e82_400bps_sup_v410"
  • "r1041_e82_400bps_hac_v410"
Default Parameters and Filtering

In this workflow, clair3 is run with nearly all default parameters. Note that the VCF file produced by the clair3 task is unfiltered and does not represent the final set of variants that will be included in the final consensus genome. A filtered vcf file is generated by the bcftools_consensus task. The filtering parameters are as follows:

  • The min_map_quality parameter is applied before calling variants.
  • The min_depth and min_allele_freq parameters are applied after variant calling during consensus genome construction.

Clair3 Technical Details

Links
Task task_clair3.wdl
Software Source Code Clair3 on GitHub
Software Documentation Clair3 Documentation
Original Publication(s) Symphonizing pileup and full-alignment for deep learning-based long-read variant calling
parse_mapping

The mask_low_coverage sub-task is used to mask low coverage regions in the reference_fasta file to improve the accuracy of the final consensus genome. Coverage thresholds are defined by the min_depth parameter, which specifies the minimum read depth required for a base to be retained. Bases falling below this threshold are replaced with "N"s to clearly mark low confidence regions. The masked reference is then combined with variants from the clair3 task to produce the final consensus genome.

min_depth input parameter

This parameter accepts an integer value to set the minimum read depth for variant calling and subsequent consensus sequence generation. The default value is 10.

parse_mapping Technical Details

Links
Task task_parse_mapping.wdl
Software Source Code samtools on GitHub
Software Documentation samtools
Original Publication(s) The Sequence Alignment/Map format and SAMtools
Twelve Years of SAMtools and BCFtools
bcftools_consensus

The bcftools_consensus task generates a consensus genome assembly by applying variants from the clair3 task to a masked reference genome. It uses bcftools to filter variants based on the min_depth and min_allele_freq input parameter, left aligns and normalizes indels, indexes the VCF file, and generates a consensus genome in FASTA format. Reference bases are substituted with filtered variants where applicable, preserved in regions without variant calls, and replaced with "N"s in areas masked by the mask_low_coverage task.

min_depth input parameter

This parameter accepts an integer value to set the minimum read depth for variant calling and subsequent consensus sequence generation. The default value is 10.

min_allele_freq input parameter

This parameter accepts a float value to set the minimum allele frequency for variant calling and subsequent consensus sequence generation. The default value is 0.6.

bcftools_consensus Technical Details

Links
Task task_bcftools_consensus.wdl
Software Source Code bcftools on GitHub
Software Documentation bcftools Manual Page
Original Publication(s) Twelve Years of SAMtools and BCFtools
Assembly Evaluation and Consensus Quality Control
quast_denovo

QUAST stands for QUality ASsessment Tool. It evaluates genome/metagenome assemblies by computing various metrics without a reference being necessary. It includes useful metrics such as number of contigs, length of the largest contig and N50.

QUAST Technical Details

Links
Task task_quast.wdl
Software Source Code QUAST on GitHub
Software Documentation https://quast.sourceforge.net/
Original Publication(s) QUAST: quality assessment tool for genome assemblies
checkv_denovo & checkv_consensus

CheckV is a fully automated command-line pipeline for assessing the quality of viral genomes, including identification of host contamination for integrated proviruses, estimating completeness for genome fragments, and identification of closed genomes.

By default, CheckV reports results on a contig-by-contig basis. The checkv task additionally reports both "weighted_contamination" and "weighted_completeness", which are average percents calculated across the total assembly that are weighted by contig length.

CheckV Technical Details

Links
Task task_checkv.wdl
Software Source Code CheckV on Bitbucket
Software Documentation CheckV Documentation
Original Publication(s) CheckV assesses the quality and completeness of metagenome-assembled viral genomes
consensus_qc

The consensus_qc task generates a summary of genomic statistics from a consensus genome. This includes the total number of bases, "N" bases, degenerate bases, and an estimate of the percent coverage to the reference genome.

consensus_qc Technical Details

Links
Task task_consensus_qc.wdl
Software Source Docker Image Theiagen Docker Builds: utility:1.1

Taxa-Specific Tasks

The TheiaViral workflows automatically activate taxa-specific sub-workflows after the identification of relevant taxa using the taxon ID of the reference genome.

Lyssavirus rabies
nextclade

"Nextclade is an open-source project for viral genome alignment, mutation calling, clade assignment, quality checks and phylogenetic placement."

Theiagen has implemented a full genome-based Nextclade dataset for L. rabies with subclade classification resolution.

Nextclade Technical Details

Links
Task task_nextclade.wdl
Software Source Code https://github.com/nextstrain/nextclade
Software Documentation Nextclade
Original Publication(s) Nextclade: clade assignment, mutation calling and quality control for viral genomes.

Outputs

Variable Type Description
abricate_flu_database String ABRicate database used for analysis
abricate_flu_results File File containing all results from ABRicate
abricate_flu_subtype String Flu subtype as determined by ABRicate
abricate_flu_type String Flu type as determined by ABRicate
abricate_flu_version String Version of ABRicate
assembly_denovo_fasta File De novo assembly in FASTA format
auspice_json_flu_ha File Auspice-compatable JSON output generated from Nextclade analysis on Influenza HA segment that includes the Nextclade default samples for clade-typing and the single sample placed on this tree
auspice_json_flu_na File Auspice-compatable JSON output generated from Nextclade analysis on Influenza NA segment that includes the Nextclade default samples for clade-typing and the single sample placed on this tree
auspice_json_mpxv File Auspice-compatable JSON output generated from Nextclade analysis on Monkeypox virus that includes the Nextclade default samples for clade-typing and the single sample placed on this tree
auspice_json_rabies File Auspice-compatable JSON output generated from Nextclade analysis on Rabies virus that includes the Nextclade default samples for clade-typing and the single sample placed on this tree
bbduk_docker String The Docker image for bbduk, which was used to remove the adapters from the sequences
bbduk_read1_clean File Clean forward reads after BBDuk processing
bbduk_read2_clean File Clean reverse reads after BBDuk processing
bwa_aligned_bai File BAM index file for reads aligned to reference
bwa_read1_aligned File Forward reads aligned to reference
bwa_read1_unaligned File Forward reads not aligned to reference
bwa_read2_aligned File Reverse reads aligned to reference
bwa_read2_unaligned File Reverse reads not aligned to reference
bwa_samtools_version String Version of samtools used by BWA
bwa_sorted_bai File Sorted BAM index file of reads aligned to reference
bwa_sorted_bam File Sorted BAM file of reads aligned to reference
bwa_sorted_bam_unaligned File A BAM file that only contains reads that did not align to the reference
bwa_sorted_bam_unaligned_bai File Index companion file to a BAM file that only contains reads that did not align to the reference
bwa_version String Version of BWA software used
checkv_consensus_contamination Float Contamination estimate for consensus assembly from CheckV
checkv_consensus_summary File Summary report from CheckV for consensus assembly
checkv_consensus_total_genes Int Number of genes detected in consensus assembly by CheckV
checkv_consensus_version String Version of CheckV used for consensus assembly
checkv_consensus_weighted_completeness Float Weighted completeness score for consensus assembly from CheckV
checkv_consensus_weighted_contamination Float Weighted contamination score for consensus assembly from CheckV
checkv_denovo_contamination Float Contamination estimate for de novo assembly from CheckV
checkv_denovo_summary File Summary report from CheckV for de novo assembly
checkv_denovo_total_genes Int Number of genes detected in de novo assembly by CheckV
checkv_denovo_version String Version of CheckV used for de novo assembly
checkv_denovo_weighted_completeness Float Weighted completeness score for de novo assembly from CheckV
checkv_denovo_weighted_contamination Float Weighted contamination score for de novo assembly from CheckV
consensus_n_variant_min_depth Int Minimum read depth to call variants for iVar consensus and iVar variants. Also represents the minimum consensus support threshold used by IRMA with Illumina Influenza data.
consensus_qc_assembly_length_unambiguous Int Length of consensus assembly excluding ambiguous bases
consensus_qc_number_Degenerate Int Number of degenerate bases in consensus assembly
consensus_qc_number_N Int Number of N bases in consensus assembly
consensus_qc_number_Total Int Total number of bases in consensus assembly
consensus_qc_percent_reference_coverage Float Percent of reference genome covered in consensus assembly
dehost_wf_dehost_read1 File Reads that did not map to host
dehost_wf_dehost_read2 File Paired-reads that did not map to host
dehost_wf_download_status String Status of host genome acquisition
dehost_wf_host_accession String Host genome accession
dehost_wf_host_fasta File Host genome FASTA file
dehost_wf_host_flagstat File Output from the SAMtools flagstat command to assess quality of the alignment file (BAM)
dehost_wf_host_mapped_bai File Indexed bam file of the reads aligned to the host reference
dehost_wf_host_mapped_bam File Sorted BAM file containing the alignments of reads to the host reference genome
dehost_wf_host_mapping_cov_hist File Coverage histogram from host read mapping
dehost_wf_host_mapping_coverage Float Average coverage from host read mapping
dehost_wf_host_mapping_mean_depth Float Average depth from host read mapping
dehost_wf_host_mapping_metrics File File of mapping metrics
dehost_wf_host_mapping_stats File File of mapping statistics
dehost_wf_host_percent_mapped_reads Float Percentage of reads mapped to host reference genome
fastp_html_report File The HTML report made with fastp
fastp_version String The version of fastp used
fastq_scan_clean1_json File The JSON file output from fastq-scan containing summary stats about clean forward read quality and length
fastq_scan_clean2_json File The JSON file output from fastq-scan containing summary stats about clean reverse read quality and length
fastq_scan_clean_pairs Int Number of read pairs after cleaning
fastq_scan_docker String The Docker image of fastq_scan
fastq_scan_num_reads_clean1 Int The number of forward reads after cleaning as calculated by fastq_scan
fastq_scan_num_reads_clean2 Int The number of reverse reads after cleaning as calculated by fastq_scan
fastq_scan_num_reads_raw1 Int The number of input forward reads as calculated by fastq_scan
fastq_scan_num_reads_raw2 Int The number of input reserve reads as calculated by fastq_scan
fastq_scan_raw1_json File The JSON file output from fastq-scan containing summary stats about raw forward read quality and length
fastq_scan_raw2_json File The JSON file output from fastq-scan containing summary stats about raw reverse read quality and length
fastq_scan_raw_pairs Int Number of raw read pairs
fastq_scan_version String The version of fastq_scan
genoflu_all_segments String The genotypes for each individual flu segment
genoflu_genotype String The genotype of the whole genome, based off of the individual segments types
genoflu_output_tsv File The output file from GenoFLU
genoflu_version String The version of GenoFLU used
irma_docker String Docker image used to run IRMA
irma_subtype String Flu subtype as determined by IRMA
irma_subtype_notes String Helpful note to user about Flu B subtypes. Output will be blank for Flu A samples. For Flu B samples it will state: "IRMA does not differentiate Victoria and Yamagata Flu B lineages. See abricate_flu_subtype output column"
irma_type String Flu type as determined by IRMA
irma_version String Version of IRMA used
ivar_tsv File Variant descriptor file generated by iVar variants
ivar_variant_proportion_intermediate String The proportion of variants of intermediate frequency
ivar_variant_version String Version of iVar for running the iVar variants command
ivar_vcf File iVar tsv output converted to VCF format
ivar_version_consensus String Version of iVar for running the iVar consensus command
kraken2_extracted_read1 File Forward reads extracted by taxonomic classification
kraken2_extracted_read2 File Reverse reads extracted by taxonomic classification
kraken_database File Database used for Kraken classification
kraken_docker String Docker image used for Kraken
kraken_report File Full Kraken report
kraken_version String Version of Kraken software used
megahit_docker String Docker image used for MEGAHIT
megahit_status String Status of the MEGAHIT assembly
megahit_version String Version of MEGAHIT used
metaviralspades_docker String Docker image used for MetaviralSPAdes
metaviralspades_status String Status of MetaviralSPAdes assembly
metaviralspades_version String Version of MetaviralSPAdes used
ncbi_datasets_docker String Docker image used for NCBI datasets
ncbi_datasets_version String Version of NCBI datasets used
ncbi_identify_accession String NCBI accession ID of identified taxon
ncbi_identify_avg_genome_length Int Average genome length from NCBI taxon summary
ncbi_identify_genome_summary_tsv File TSV file with genome summary from NCBI
ncbi_identify_read_extraction_rank String Taxonomic rank used for read extraction
ncbi_identify_taxon_id String NCBI taxonomy ID of identified organism
ncbi_identify_taxon_name String Name of identified taxon
ncbi_identify_taxon_summary_tsv File TSV file with taxa specific summary from NCBI
ncbi_scrub_docker String The Docker image for NCBI's HRRT (human read removal tool)
ncbi_scrub_human_spots_removed Int Number of spots removed (or masked)
nextclade_aa_dels_flu_ha String Amino-acid deletions as detected by NextClade. Specific to flu; it includes deletions for HA segment
nextclade_aa_dels_flu_na String Amino-acid deletions as detected by NextClade. Specific to Flu; it includes deletions for NA segment
nextclade_aa_dels_mpxv String Amino-acid deletions as detected by Nextclade. Specific to Monkeypox
nextclade_aa_dels_rabies String Amino-acid deletions as detected by Nextclade. Specific to Monkeypox
nextclade_aa_subs_flu_ha String Amino-acid substitutions as detected by Nextclade. Specific to Flu; it includes substitutions for HA segment
nextclade_aa_subs_flu_na String Amino-acid substitutions as detected by Nextclade. Specific to Flu; it includes substitutions for NA segment
nextclade_aa_subs_mpxv String Amino-acid substitutions as detected by Nextclade. Specific to Monkeypox
nextclade_aa_subs_rabies String Amino-acid substitutions as detected by Nextclade. Specific to Monkeypox
nextclade_clade_mpxv String Nextclade clade designation, specific to Monkeypox
nextclade_clade_rabies String Nextclade clade designation, specific to Rabies
nextclade_docker String Docker image used to run Nextclade
nextclade_ds_tag String Dataset tag used to run Nextclade. Will be blank for Flu
nextclade_ds_tag_flu_ha String Dataset tag used to run Nextclade, specific to Flu HA segment
nextclade_ds_tag_flu_na String Dataset tag used to run Nextclade, specific to Flu NA segment
nextclade_json_flu_ha File Nextclade output in JSON file format, specific to Flu HA segment
nextclade_json_flu_na File Nextclade output in JSON file format, specific to Flu NA segment
nextclade_json_mpxv File Nextclade output in JSON file format, specific to Monkeypox
nextclade_json_rabies File Nextclade output in JSON file format, specific to Rabies
nextclade_lineage_mpxv String Nextclade lineage designation, specific to Monkeypox
nextclade_lineage_rabies String Nextclade lineage designation, specific to Rabies
nextclade_qc_flu_ha String QC metric as determined by Nextclade, specific to Flu HA segment
nextclade_qc_flu_na String QC metric as determined by Nextclade, specific to Flu NA segment
nextclade_qc_mpxv String QC metric as determined by Nextclade, specific to Monkeypox
nextclade_qc_rabies String QC metric as determined by Nextclade, specific to Rabies
nextclade_tsv_flu_ha File Nextclade output in TSV file format, specific to Flu HA segment
nextclade_tsv_flu_na File Nextclade output in TSV file format, specific to Flu NA segment
nextclade_tsv_mpxv File Nextclade output in TSV file format, specific to Monkeypox
nextclade_tsv_rabies File Nextclade output in TSV file format, specific to Rabies
organism String Standardized organism name used for characterization
pango_lineage String Pango lineage as determined by Pangolin
pango_lineage_expanded String Pango lineage without use of aliases; e.g., "BA.1" → "B.1.1.529.1"
pango_lineage_report File Full Pango lineage report generated by Pangolin
pangolin_assignment_version String The version of the pangolin software (e.g. PANGO or PUSHER) used for lineage assignment
pangolin_conflicts String Number of lineage conflicts as determined by Pangolin
pangolin_docker String Docker image used to run Pangolin
pangolin_notes String Lineage notes as determined by Pangolin
pangolin_versions String All Pangolin software and database versions
quast_denovo_docker String Docker image used for QUAST
quast_denovo_gc_percent Float GC percentage of de novo assembly from QUAST
quast_denovo_genome_length Int Genome length of de novo assembly from QUAST
quast_denovo_largest_contig Int Size of largest contig in de novo assembly from QUAST
quast_denovo_n50_value Int N50 value of de novo assembly from QUAST
quast_denovo_number_contigs Int Number of contigs in de novo assembly from QUAST
quast_denovo_report File QUAST report for de novo assembly
quast_denovo_uncalled_bases Int Number of uncalled bases in de novo assembly from QUAST
quast_denovo_version String Version of QUAST used
read1_dehosted File The dehosted forward reads file; suggested read file for SRA submission
read2_dehosted File The dehosted reverse reads file; suggested read file for SRA submission
read_mapping_cov_hist File Coverage histogram from read mapping
read_mapping_cov_stats File Coverage statistics from read mapping
read_mapping_coverage Float Average coverage from read mapping
read_mapping_date String Date of read mapping analysis
read_mapping_depth Float Average depth from read mapping
read_mapping_flagstat File Flagstat file from read mapping
read_mapping_meanbaseq Float Mean base quality from read mapping
read_mapping_meanmapq Float Mean mapping quality from read mapping
read_mapping_percentage_mapped_reads Float Percentage of mapped reads
read_mapping_report File Report file from read mapping
read_mapping_samtools_version String Version of samtools used in read mapping
read_mapping_statistics File Statistics file from read mapping
reference_taxon_name String NCBI derived taxon name from best ANI hit accession
skani_database File Database used for Skani
skani_docker String Docker image used for Skani
skani_report File Report from Skani
skani_status String Status of Skani analysis
skani_top_accession String Top accession ID from Skani
skani_top_ani Float Top ANI score from Skani
skani_top_ani_fasta File FASTA file of top ANI match from Skani
skani_top_ref_coverage Float Reference coverage of top match from Skani
skani_top_score Float Top score from Skani
skani_version String Version of Skani used
skani_warning String Skani warning message
theiaviral_illumina_pe_date String Date of TheiaViral Illumina PE workflow run
theiaviral_illumina_pe_version String Version of TheiaViral Illumina PE workflow
trimmomatic_docker String The docker image used for the trimmomatic module in this workflow
trimmomatic_version String The version of Trimmomatic used
Variable Type Description
abricate_flu_database String ABRicate database used for analysis
abricate_flu_results File File containing all results from ABRicate
abricate_flu_subtype String Flu subtype as determined by ABRicate
abricate_flu_type String Flu type as determined by ABRicate
abricate_flu_version String Version of ABRicate
assembly_denovo_fasta File De novo assembly in FASTA format
assembly_to_ref_bai File BAM index file for reads aligned to reference
assembly_to_ref_bam File BAM file of reads aligned to reference
auspice_json_flu_ha File Auspice-compatable JSON output generated from Nextclade analysis on Influenza HA segment that includes the Nextclade default samples for clade-typing and the single sample placed on this tree
auspice_json_flu_na File Auspice-compatable JSON output generated from Nextclade analysis on Influenza NA segment that includes the Nextclade default samples for clade-typing and the single sample placed on this tree
auspice_json_mpxv File Auspice-compatable JSON output generated from Nextclade analysis on Monkeypox virus that includes the Nextclade default samples for clade-typing and the single sample placed on this tree
auspice_json_rabies File Auspice-compatable JSON output generated from Nextclade analysis on Rabies virus that includes the Nextclade default samples for clade-typing and the single sample placed on this tree
bcftools_docker String Docker image used for bcftools
bcftools_filtered_vcf File Filtered variant calls in VCF format from bcftools
bcftools_version String Version of bcftools used
checkv_consensus_contamination Float Contamination estimate for consensus assembly from CheckV
checkv_consensus_summary File Summary report from CheckV for consensus assembly
checkv_consensus_total_genes Int Number of genes detected in consensus assembly by CheckV
checkv_consensus_version String Version of CheckV used for consensus assembly
checkv_consensus_weighted_completeness Float Weighted completeness score for consensus assembly from CheckV
checkv_consensus_weighted_contamination Float Weighted contamination score for consensus assembly from CheckV
checkv_denovo_contamination Float Contamination estimate for de novo assembly from CheckV
checkv_denovo_summary File Summary report from CheckV for de novo assembly
checkv_denovo_total_genes Int Number of genes detected in de novo assembly by CheckV
checkv_denovo_version String Version of CheckV used for de novo assembly
checkv_denovo_weighted_completeness Float Weighted completeness score for de novo assembly from CheckV
checkv_denovo_weighted_contamination Float Weighted contamination score for de novo assembly from CheckV
clair3_docker String Docker image used for Clair3
clair3_gvcf File Genomic VCF file from Clair3
clair3_model String Model used for Clair3 variant calling
clair3_vcf File Variant calls in VCF format from Clair3
clair3_version String Clair3 Version being used
consensus_qc_assembly_length_unambiguous Int Length of consensus assembly excluding ambiguous bases
consensus_qc_number_Degenerate Int Number of degenerate bases in consensus assembly
consensus_qc_number_N Int Number of N bases in consensus assembly
consensus_qc_number_Total Int Total number of bases in consensus assembly
consensus_qc_percent_reference_coverage Float Percent of reference genome covered in consensus assembly
dehost_wf_dehost_read1 File Reads that did not map to host
dehost_wf_download_status String Status of host genome acquisition
dehost_wf_host_accession String Host genome accession
dehost_wf_host_fasta File Host genome FASTA file
dehost_wf_host_flagstat File Output from the SAMtools flagstat command to assess quality of the alignment file (BAM)
dehost_wf_host_mapped_bai File Indexed bam file of the reads aligned to the host reference
dehost_wf_host_mapped_bam File Sorted BAM file containing the alignments of reads to the host reference genome
dehost_wf_host_mapping_cov_hist File Coverage histogram from host read mapping
dehost_wf_host_mapping_coverage Float Average coverage from host read mapping
dehost_wf_host_mapping_mean_depth Float Average depth from host read mapping
dehost_wf_host_mapping_metrics File File of mapping metrics
dehost_wf_host_mapping_stats File File of mapping statistics
dehost_wf_host_percent_mapped_reads Float Percentage of reads mapped to host reference genome
fasta_utilities_fai File FASTA index file
fasta_utilities_samtools_docker String Docker image used for samtools in fasta utilities
fasta_utilities_samtools_version String Version of samtools used in fasta utilities
flye_denovo_docker String Docker image used for Flye
flye_denovo_info File Information file from Flye assembly
flye_denovo_status String Status of Flye assembly
flye_denovo_version String Version of Flye used
genoflu_all_segments String The genotypes for each individual flu segment
genoflu_genotype String The genotype of the whole genome, based off of the individual segments types
genoflu_output_tsv File The output file from GenoFLU
genoflu_version String The version of GenoFLU used
irma_docker String Docker image used to run IRMA
irma_subtype String Flu subtype as determined by IRMA
irma_subtype_notes String Helpful note to user about Flu B subtypes. Output will be blank for Flu A samples. For Flu B samples it will state: "IRMA does not differentiate Victoria and Yamagata Flu B lineages. See abricate_flu_subtype output column"
irma_type String Flu type as determined by IRMA
irma_version String Version of IRMA used
mask_low_coverage_all_coverage_bed File BED file showing all coverage regions
mask_low_coverage_bed File BED file showing masked low coverage regions
mask_low_coverage_bedtools_docker String Docker image used for bedtools in masking
mask_low_coverage_bedtools_version String Version of bedtools used in masking
mask_low_coverage_reference_fasta File Reference FASTA with low coverage regions masked
metabuli_classified File Classified reads from Metabuli
metabuli_database File Database used for Metabuli
metabuli_docker String Docker image used for Metabuli
metabuli_krona_report File Krona visualization report from Metabuli
metabuli_read1_extract File Extracted reads from Metabuli
metabuli_report File Classification report from Metabuli
metabuli_version String Version of Metabuli used
minimap2_docker String The Docker image of minimap2
minimap2_out File Output file from Minimap2 alignment
minimap2_version String The version of minimap2
nanoplot_html_clean File An HTML report describing the clean reads
nanoplot_html_raw File An HTML report describing the raw reads
nanoplot_num_reads_clean1 Int Number of clean reads
nanoplot_num_reads_raw1 Int Number of raw reads
nanoplot_r1_mean_q_clean Float Mean quality score of clean forward reads
nanoplot_r1_mean_q_raw Float Mean quality score of raw forward reads
nanoplot_r1_mean_readlength_clean Float Mean read length of clean forward reads
nanoplot_r1_mean_readlength_raw Float Mean read length of raw forward reads
nanoplot_r1_median_q_clean Float Median quality score of clean forward reads
nanoplot_r1_median_q_raw Float Median quality score of raw forward reads
nanoplot_r1_median_readlength_clean Float Median read length of clean forward reads
nanoplot_r1_median_readlength_raw Float Median read length of raw forward reads
nanoplot_r1_n50_clean Float N50 of clean forward reads
nanoplot_r1_n50_raw Float N50 of raw forward reads
nanoplot_r1_stdev_readlength_clean Float Standard deviation read length of clean forward reads
nanoplot_r1_stdev_readlength_raw Float Standard deviation read length of raw forward reads
nanoplot_tsv_clean File A TSV report describing the clean reads
nanoplot_tsv_raw File A TSV report describing the raw reads
nanoq_filtered_read1 File Filtered reads from NanoQ
nanoq_version String Version of nanoq used in analysis
ncbi_datasets_docker String Docker image used for NCBI datasets
ncbi_datasets_version String Version of NCBI datasets used
ncbi_identify_accession String NCBI accession ID of identified taxon
ncbi_identify_avg_genome_length Int Average genome length from NCBI taxon summary
ncbi_identify_docker String Docker image used for NCBI identify
ncbi_identify_genome_summary_tsv File TSV file with genome summary from NCBI
ncbi_identify_read_extraction_rank String Taxonomic rank used for read extraction
ncbi_identify_taxon_id String NCBI taxonomy ID of identified organism
ncbi_identify_taxon_name String Name of identified taxon
ncbi_identify_taxon_summary_tsv File TSV file with taxa specific summary from NCBI
ncbi_identify_version String Version of NCBI identify tool used
ncbi_scrub_docker String The Docker image for NCBI's HRRT (human read removal tool)
ncbi_scrub_human_spots_removed Int Number of spots removed (or masked)
ncbi_scrub_read1_dehosted File Dehosted reads after NCBI scrub
nextclade_aa_dels_flu_ha String Amino-acid deletions as detected by NextClade. Specific to flu; it includes deletions for HA segment
nextclade_aa_dels_flu_na String Amino-acid deletions as detected by NextClade. Specific to Flu; it includes deletions for NA segment
nextclade_aa_dels_mpxv String Amino-acid deletions as detected by Nextclade. Specific to Monkeypox
nextclade_aa_dels_rabies String Amino-acid deletions as detected by Nextclade. Specific to Monkeypox
nextclade_aa_subs_flu_ha String Amino-acid substitutions as detected by Nextclade. Specific to Flu; it includes substitutions for HA segment
nextclade_aa_subs_flu_na String Amino-acid substitutions as detected by Nextclade. Specific to Flu; it includes substitutions for NA segment
nextclade_aa_subs_mpxv String Amino-acid substitutions as detected by Nextclade. Specific to Monkeypox
nextclade_aa_subs_rabies String Amino-acid substitutions as detected by Nextclade. Specific to Monkeypox
nextclade_clade_mpxv String Nextclade clade designation, specific to Monkeypox
nextclade_clade_rabies String Nextclade clade designation, specific to Rabies
nextclade_docker String Docker image used to run Nextclade
nextclade_ds_tag String Dataset tag used to run Nextclade. Will be blank for Flu
nextclade_ds_tag_flu_ha String Dataset tag used to run Nextclade, specific to Flu HA segment
nextclade_ds_tag_flu_na String Dataset tag used to run Nextclade, specific to Flu NA segment
nextclade_json_flu_ha File Nextclade output in JSON file format, specific to Flu HA segment
nextclade_json_flu_na File Nextclade output in JSON file format, specific to Flu NA segment
nextclade_json_mpxv File Nextclade output in JSON file format, specific to Monkeypox
nextclade_json_rabies File Nextclade output in JSON file format, specific to Rabies
nextclade_lineage_mpxv String Nextclade lineage designation, specific to Monkeypox
nextclade_lineage_rabies String Nextclade lineage designation, specific to Rabies
nextclade_qc_flu_ha String QC metric as determined by Nextclade, specific to Flu HA segment
nextclade_qc_flu_na String QC metric as determined by Nextclade, specific to Flu NA segment
nextclade_qc_mpxv String QC metric as determined by Nextclade, specific to Monkeypox
nextclade_qc_rabies String QC metric as determined by Nextclade, specific to Rabies
nextclade_tsv_flu_ha File Nextclade output in TSV file format, specific to Flu HA segment
nextclade_tsv_flu_na File Nextclade output in TSV file format, specific to Flu NA segment
nextclade_tsv_mpxv File Nextclade output in TSV file format, specific to Monkeypox
nextclade_tsv_rabies File Nextclade output in TSV file format, specific to Rabies
organism String Standardized organism name used for characterization
pango_lineage String Pango lineage as determined by Pangolin
pango_lineage_expanded String Pango lineage without use of aliases; e.g., "BA.1" → "B.1.1.529.1"
pango_lineage_report File Full Pango lineage report generated by Pangolin
pangolin_assignment_version String The version of the pangolin software (e.g. PANGO or PUSHER) used for lineage assignment
pangolin_conflicts String Number of lineage conflicts as determined by Pangolin
pangolin_docker String Docker image used to run Pangolin
pangolin_notes String Lineage notes as determined by Pangolin
pangolin_versions String All Pangolin software and database versions
parse_mapping_samtools_docker String Docker image used for samtools in parse mapping
parse_mapping_samtools_version String Version of samtools used in parse mapping
porechop_trimmed_read1 File Trimmed reads from Porechop
porechop_version String Version of Porechop used
quast_denovo_docker String Docker image used for QUAST
quast_denovo_gc_percent Float GC percentage of de novo assembly from QUAST
quast_denovo_genome_length Int Genome length of de novo assembly from QUAST
quast_denovo_largest_contig Int Size of largest contig in de novo assembly from QUAST
quast_denovo_n50_value Int N50 value of de novo assembly from QUAST
quast_denovo_number_contigs Int Number of contigs in de novo assembly from QUAST
quast_denovo_report File QUAST report for de novo assembly
quast_denovo_uncalled_bases Int Number of uncalled bases in de novo assembly from QUAST
quast_denovo_version String Version of QUAST used
rasusa_read1_subsampled File Subsampled read file from Rasusa
rasusa_read2_subsampled File Subsampled read file from Rasusa (paired file)
rasusa_version String Version of RASUSA used for the analysis
raven_denovo_docker String Docker image used for Raven
raven_denovo_status String Status of Raven assembly
raven_denovo_version String Version of Raven used
read_mapping_cov_hist File Coverage histogram from read mapping
read_mapping_cov_stats File Coverage statistics from read mapping
read_mapping_coverage Float Average coverage from read mapping
read_mapping_date String Date of read mapping analysis
read_mapping_depth Float Average depth from read mapping
read_mapping_flagstat File Flagstat file from read mapping
read_mapping_meanbaseq Float Mean base quality from read mapping
read_mapping_meanmapq Float Mean mapping quality from read mapping
read_mapping_percentage_mapped_reads Float Percentage of mapped reads
read_mapping_report File Report file from read mapping
read_mapping_samtools_version String Version of samtools used in read mapping
read_mapping_statistics File Statistics file from read mapping
read_screen_clean String PASS or FAIL result from clean read screening; FAIL accompanied by the reason(s) for failure
read_screen_clean_tsv File Clean read screening report TSV depicting read counts, total read base pairs, and estimated genome length
reference_taxon_name String NCBI derived taxon name from best ANI hit accession
skani_database File Database used for Skani
skani_docker String Docker image used for Skani
skani_report File Report from Skani
skani_status String Status of Skani analysis
skani_top_accession String Top accession ID from Skani
skani_top_ani Float Top ANI score from Skani
skani_top_ani_fasta File FASTA file of top ANI match from Skani
skani_top_ref_coverage Float Reference coverage of top match from Skani
skani_top_score Float Top score from Skani
skani_version String Version of Skani used
skani_warning String Skani warning message
theiaviral_ont_date String Date of TheiaViral ONT workflow run
theiaviral_ont_version String Version of TheiaViral ONT workflow
What are the differences between the de novo and consensus assemblies?

De novo genomes are generated from scratch without a reference to guide read assembly, while consensus genomes are generated by mapping reads to a reference and replacing reference positions with identified variants (structural and nucleotide). De novo assemblies are thus not biased by requiring reads map to the reference, though they may be more fragmented. Consensus assembly can generate more robust assemblies from lower coverage samples if the reference genome is sufficient quality and sufficiently closely related to the inputted sequence, though consensus assembly may not perform well in instances of significant structural variation. TheiaViral uses de novo assemblies as an intermediate to acquire the best reference genome for consensus assembly.

We generally recommend TheiaViral users focus on the consensus assembly as the desired assembly output. While we chose the best de novo assemblers for TheiaViral based on internal benchmarking, the consensus assembly will often be higher quality than the de novo assembly. However, the de novo assembly can approach or exceed consensus quality if the read inputs largely comprise one virus, have high depth of coverage, and/or are derived from a virus with high potential for recombination. TheiaViral does conduct assembly contiguity and viral completeness quality control for de novo assemblies, so de novo assembly that meets quality control standards can certainly be used for downstream analysis.

How is de novo assembly quality evaluated?

De novo assembly quality evaluation focuses on the completeness and contiguity of the genome. While a ground truth genome does not truly exist for quality comparison, reference genome selection can help contextualize quality if the reference is sufficiently similar to the de novo assembly. TheiaViral uses QUAST to acquire basic contiguity statistics and CheckV to assess viral genome completeness and contamination. Additionally, the reference selection software, Skani, can provide a quantitative comparison between the de novo assembly and the best reference genome.

Completeness and contamination

  • checkv_denovo_summary: The summary file reports CheckV results on a contig-by-contig basis. Ideally completeness is 100% for a single contig, or 100% for all segments. If there are multiple extraneous contigs in the assembly, one is ideally 100%. The same principles apply to contamination, though it ideally is 0%.
  • checkv_denovo_total_genes: The total genes is ideally the same number of genes as expected from the inputted viral taxon. Sometimes CheckV can fail to recover all the genes from a complete genome, so other statistics should be weighted more heavily in quality evaluation.
  • checkv_denovo_weighted_completeness: The weighted completeness is ideally 100%.
  • checkv_denovo_weighted_contamination: The weighted contamination is ideally 0%.

Length and contiguity

  • quast_denovo_genome_length: The de novo genome length is ideally the same as the expected genome length of the focal virus.
  • quast_denovo_largest_contig: The largest contig is ideally the size of the genome, or the size of the largest expected segment. If there are multiple contigs, and the largest contig is the ideal size, then the smaller contigs may be discarded based on the CheckV completeness for the largest contig (see CheckV outputs).
  • quast_denovo_n50_value: The N50 is an evaluation of contiguity and is ideally as close as possible to the genome size. For segmented viruses, the N50 should be as close as possible to the size of the segment molecule that would cover at least 50% of the total genome size when segment lengths are added after sorting largest to smallest.
  • quast_denovo_number_contigs: The number of contigs is ideally 1 or the total number of segments expected.

Reference genome similarity

  • skani_top_ani: The percent average nucleotide identity (ANI) for the top Skani hit is ideally 100% if the sequenced virus is highly similar to a reference genome. However, if the virus is divergent, ANI is not a good indication of assembly quality.
  • skani_top_ref_coverage: The percent reference coverage for the top Skani hit is ideally 100% if the sequenced virus has not undergone significant recombination/structural variation.
  • skani_top_score: The score for the top Skani hit is the ANI x Reference coverage and is ideally 100% if the sequenced virus is not substantially divergent from the reference dataset.
How is consensus assembly quality evaluated?

Consensus assemblies are derived from a reference genome, so quality assessment focuses on coverage and variant quality. Bases with insufficient coverage are denoted as "N". Additionally, the size and contiguity of a TheiaViral consensus assembly is expected to approximate the reference genome, so any discrepancy here is likely due to inferred structural variation.

Completeness and contamination

  • checkv_consensus_weighted_completeness: The weighted completeness is ideally 100%.

Consensus variant calls

  • consensus_qc_number_Degenerate: The number of degenerate bases is ideally 0. While degenerate bases indicate ambiguity in the sequence, non-N degenerate bases indicate that some information about the base was obtained.
  • consensus_qc_number_N: The number of "N" bases is ideally 0.

Coverage

  • consensus_qc_percent_reference_coverage: The percent reference coverage is ideally 100%.
  • read_mapping_cov_hist: The read mapping coverage histogram ideally depicts normally distributed coverage, which may indicate uniform coverage across the reference genome. However, uniform coverage is unlikely with repetitive regions that approach/exceed read length.
  • read_mapping_coverage: The average read mapping coverage is ideally as high as possible.
  • read_mapping_meanbaseq: The average mean mapping base quality is ideally as high as possible.
  • read_mapping_meanmapq: The average mean mapping alignment quality is ideally as high as possible.
  • read_mapping_percentage_mapped_reads: The percent of mapped reads is ideally 100% of the reads classified as the lineage of interest. Some unclassified reads may also map, which may indicate they were erroneously unclassified. Alternatively, these reads could have been erroneously mapped.
Why did the workflow complete without generating a consensus?

TheiaViral is designed to "soft fail" when specific steps do not succeed due to input data quality. This means the workflow will be reported as successful, with an output that delineates the step that failed. If the workflow fails, please look for the following outputs in this order (sorted by timing of failure, latest first):

  • skani_status: If this output is populated with something other than "PASS" and skani_top_accession is populated with "N/A", this indicates that Skani did not identify a sufficiently similar reference genome. The Skani database comprises a broad array of NCBI viral genomes, so a failure here likely indicates poor read quality because viral contigs are not found in the de novo assembly or are too small. It may be useful to BLAST whatever contigs do exist in the de novo to determine if there is contamination that can be removed via the host input parameter. Additionally, review CheckV de novo outputs to assess if viral contigs were retrieved. Finally, consider keeping extract_unclassified to "true", using a higher read_extraction_rank if it will not introduce contaminant viruses, and invoking a host input to remove host reads if host contigs are present.
  • megahit_status / flye_status: If this output is populated with something other than "PASS", it indicates the fallback assembler did not successfully complete. The fallback assemblers are permissive, so failure here likely indicates poor read quality. Review read QC to check read quality, particularly following read classification. If read classification is dispensing with a significant number of reads, consider extract_unclassified, read_extraction_rank, and host input adjustment. Otherwise, sequencing quality may be poor.
  • metaviralspades_status / raven_denovo_status: If this output is populated with something other than "PASS", it indicates the default assembler did not successfully complete or extract viral contigs (MetaviralSPAdes). On their own, these statuses do not correspond directly to workflow failure because fallback de novo assemblers are implemented for both TheiaViral workflows.
  • read_screen_clean: If this output is populated with something other than "PASS", it indicates the reads did not pass the imposed thresholds. Either the reads are poor quality or the thresholds are too stringent, in which case the thresholds can be relaxed or skip_screen can be set to "true".
  • dehost_wf_download_status: If this output is populated with something other than "PASS", it indicates a host genome could not be retrieved for decontamination. See the host input explanation for more information and review the download_accession/download_taxonomy task output logs for advanced error parsing.
Known errors associated with read quality
  • ONT workflows may fail at Metabuli if no reads are classified as the taxon. Check the Metabuli classification.tsv or krona report for the read extraction taxon ID to determine if any reads were classified. This error will report out of memory (OOM), but increasing memory will not resolve it.
  • Illumina workflows may fail at CheckV (de novo) with Error: 80 hmmsearch tasks failed. Program should be rerun if no viral contigs were identified in the de novo assembly.

Acknowlegments

We would like to thank Danny Park at the Broad institute and Jared Johnson at the Washington State Department of Public Health for correspondence during the development of TheiaViral. TheiaViral was built referencing viral-assemble, VAPER, and Artic.