TheiaViral Workflow Series¶
Quick Facts¶
Workflow Type | Applicable Kingdom | Last Known Changes | Command-line Compatibility | Workflow Level |
---|---|---|---|---|
Genomic Characterization | Viral | vX.X.X | No | Sample-level |
TheiaViral Workflows¶
TheiaViral workflows assemble, quality assess, and characterize viral genomes from diverse data sources, including metagenomic samples. TheiaViral workflows can generate consensus assemblies of recalcitrant viruses, including diverse or recombinant lineages, such as rabies virus and norovirus, through a three-step approach: 1) generating an intermediate de novo assembly from taxonomy-filtered reads, 2) selecting the best reference from a database of ~200,000 complete viral genomes using average nucleotide identity, and 3) producing a final consensus assembly through reference-based read mapping and variant calling. Reference genomes can be directly provided to TheiaViral to bypass de novo assembly, which enables compatibility with tiled amplicon sequencing data. Targeted viral characterization is currently ongoing and functional for Lyssavirus rabies.
What are the main differences between the TheiaViral and TheiaCov workflows?
-
TheiaCov Workflows
- For amplicon-derived viral sequencing methods
- Supports a limited number of pathogens
- Uses manually curated, static reference genomes
-
TheiaViral Workflows
- Designed for a variety of sequencing methods
- Supports relatively diverse and recombinant pathogens
- Dynamically identifies the most similar reference genome for consensus assembly via an intermediate de novo assembly
Segmented viruses
Segmented viruses are accounted for in TheiaViral. The reference genome database excludes segmented viral nucleotide accessions, while including RefSeq assembly accessions that include all viral segments. Consensus assembly modules are constructed to handle multi-segment references.
Workflow Diagram¶
TheiaViral Workflows for Different Input Types¶
-
TheiaViral_Illumina_PE
Illumina_PE Input Read Data
The TheiaViral_Illumina_PE workflow inputs Illumina paired-end read data. Read file extensions should be
.fastq
or.fq
, and can optionally include the.gz
compression extension. Theiagen recommends compressing files with gzip before Terra uploads to minimize data upload time and storage costs.Modifications to the optional parameter for
trim_minlen
may be required to appropriately trim reads shorter than 2 x 150 bp (i.e. generated using a 300-cycle sequencing kit), such as the 2 x 75bp reads generated using a 150-cycle sequencing kit. -
TheiaViral_ONT
ONT Input Read Data
The TheiaViral_ONT workflow inputs base-called Oxford Nanopore Technology (ONT) read data. Read file extensions should be
.fastq
or.fq
, and can optionally include the.gz
compression extension. Theiagen recommends compressing files with gzip before Terra uploads to minimize data upload time and storage costs.It is recommended to trim adapter sequencings via
dorado
basecalling prior to running TheiaViral_ONT, thoughporechop
can optionally be called to trim adapters within the workflow.The ONT sequencing kit and base-calling approach can produce substantial variability in the amount and quality of read data. Genome assemblies produced by the TheiaViral_ONT workflow must be quality assessed before reporting results. We recommend using the Dorado_Basecalling_PHB workflow if applicable.
Inputs¶
taxon
required input parameter
taxon
is the standardized taxonomic name (e.g. "Lyssavirus rabies") or NCBI taxon ID (e.g. "11292") of the desired virus to analyze. Inputs must be represented in the NCBI taxonomy database and do not have to be species-level (see read_extraction_rank
below).
host
optional input parameter
The host
input triggers the Host Decontaminate workflow, which removes reads that map to a reference host genome. This input needs to be an NCBI Taxonomy-compatible taxon or an NCBI assembly accession. If using a taxon, the first retrieved genome corresponding to that taxon is retrieved. If using an accession, it must be coupled with the Host Decontaminate task is_accession
(ONT) or Read QC Trim PE host_is_accession
(Illumina) boolean populated as "true".
extract_unclassified
optional input parameter
By default, the extract_unclassified
parameter is set to "true", which indicates that reads that are not classified by Kraken2 (Illumina) or Metabuli (ONT) will be included with reads classified as the input taxon
. These classification software most often do not comprehensively classify reads using the default RefSeq databases, so extracting unclassified reads is desirable when host and contaminant reads have been sufficiently decontaminated. Host decontamination occurs in TheiaViral using NCBI sra-human-scrubber
, read classification to the human genome, and/or via mapping reads to the inputted host
. Contaminant viral reads are mostly excluded because they will be often be classified against the default RefSeq classification databases. Consider setting extract_unclassified
to false if de novo assembly or Skani reference selection is failing.
min_allele_freq
, min_depth
, and min_map_quality
optional input parameters
These parameters have a direct effect on the variants that will ultimately be reported in the consensus assembly. min_allele_freq
determines the minimum proportion of an allelic variant to be reported in the consensus assembly. min_depth
and min_map_quality
affect how "N" is reported in the consensus, i.e. depth below min_depth
is reported as "N" and reads with mapping quality below min_map_quality
are not included in depth calculations.
read_extraction_rank
optional input parameter
By default, the read_extraction_rank
parameter is set to "family", which indicates that reads will be extracted if they are classified as the taxonomic family of the input taxon
, including all descendant taxa of the family. Read classification may not resolve to the rank of the input taxon
, so these reads may be classified at higher ranks. For example, some Lyssavirus rabies (species) reads may only be resolved to Lyssavirus (genus), so they would not be extracted if the read_extraction_rank
is set to "species". Setting the read_extraction_rank
above the inputted taxon
's rank can therefore dramatically increase the number of reads recovered, at the potential cost of including other viruses. This likely is not a problem for scarcely represented lineages, e.g. a sample that is expected to include Lyssavirus rabies is unlikely to contain other viruses of the corresponding family, Rhabdoviridae, within the same sample. However, setting a read_extraction_rank
far beyond the input taxon
rank can be problematic when multiple representatives of the same viral family are included in similar abundance within the same sample. To further refine the desired read_extraction_rank
, please review the corresponding classification reports of the respective classification software (kraken2 for Illumina and Metabuli for ONT)
Terra Task Name | Variable | Type | Description | Default Value | Terra Status |
---|---|---|---|---|---|
theiaviral_illumina_pe | read1 | File | llumina forward read file in FASTQ file format (compression optional) | Required | |
theiaviral_illumina_pe | read2 | File | llumina reverse read file in FASTQ file format (compression optional) | Required | |
theiaviral_illumina_pe | samplename | String | Nme of the sample being analyzed | Required | |
theiaviral_illumina_pe | taxon | String | Taxon ID or organism name of interest | Required | |
bwa | cpu | Int | Number of CPUs to allocate to the task | 6 | Optional |
bwa | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
bwa | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/ivar:1.3.1-titan | Optional |
bwa | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 16 | Optional |
checkv_consensus | checkv_db | File | CheckV database file | gs://theiagen-public-resources-rp/reference_data/databases/checkv/checkv-db-v1.5.tar.gz | Optional |
checkv_consensus | cpu | Int | Number of CPUs allocated for the task | 2 | Optional |
checkv_consensus | disk_size | Int | Disk size allocated for the task (in GB) | 100 | Optional |
checkv_consensus | docker | String | Docker image used for the task | us-docker.pkg.dev/general-theiagen/staphb/checkv:1.0.3 | Optional |
checkv_consensus | memory | Int | Memory allocated for the task (in GB) | 8 | Optional |
checkv_denovo | checkv_db | File | CheckV database file | gs://theiagen-public-resources-rp/reference_data/databases/checkv/checkv-db-v1.5.tar.gz | Optional |
checkv_denovo | cpu | Int | Number of CPUs allocated for the task | 2 | Optional |
checkv_denovo | disk_size | Int | Disk size allocated for the task (in GB) | 100 | Optional |
checkv_denovo | docker | String | Docker image used for the task | us-docker.pkg.dev/general-theiagen/staphb/checkv:1.0.3 | Optional |
checkv_denovo | memory | Int | Memory allocated for the task (in GB) | 8 | Optional |
clean_check_reads | cpu | Int | Number of CPUs to allocate to the task | 1 | Optional |
clean_check_reads | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
clean_check_reads | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/bactopia/gather_samples:2.0.2 | Optional |
clean_check_reads | max_genome_length | Int | Maximum genome length able to pass read screening | 2673870 | Optional |
clean_check_reads | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 2 | Optional |
clean_check_reads | min_basepairs | Int | Minimum base pairs to pass read screening | 15000 | Optional |
clean_check_reads | min_coverage | Int | Minimum coverage to pass read screening | 10 | Optional |
clean_check_reads | min_genome_length | Int | Minimum genome length to pass read screening | 1500 | Optional |
clean_check_reads | min_proportion | Int | Minimum read proportion to pass read screening | 40 | Optional |
clean_check_reads | min_reads | Int | Minimum reads to pass read screening | 50 | Optional |
consensus | char_unknown | String | Character used to represent unknown bases in the consensus sequence | N | Optional |
consensus | count_orphans | Boolean | True/False that determines if anomalous read pairs are NOT skipped in variant calling. Anomalous read pairs are those marked in the FLAG field as paired in sequencing but without the properly-paired flag set. | TRUE | Optional |
consensus | cpu | Int | Number of CPUs to allocate to the task | 8 | Optional |
consensus | disable_baq | Boolean | True/False that determines if base alignment quality (BAQ) computation should be disabled during samtools mpileup before consensus generation | TRUE | Optional |
consensus | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
consensus | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/artic-ncov2019-epi2me | Optional |
consensus | max_depth | Int | For a given position, read at maximum INT number of reads per input file during samtools mpileup before consensus generation | 600000 | Optional |
consensus | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 16 | Optional |
consensus | min_bq | Int | Minimum base quality required for a base to be considered during samtools mpileup before consensus generation | 0 | Optional |
consensus | skip_N | Boolean | True/False that determines if "N" bases should be skipped in the consensus sequence | FALSE | Optional |
consensus_qc | cpu | Int | Number of CPUs to allocate to the task | 1 | Optional |
consensus_qc | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
consensus_qc | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/theiagen/utility:1.1 | Optional |
consensus_qc | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 2 | Optional |
ivar_variants | cpu | Int | Number of CPUs allocated for the task | 2 | Optional |
ivar_variants | disk_size | Int | Disk size allocated for the task (in GB) | 100 | Optional |
ivar_variants | docker | String | Docker image used for the task | us-docker.pkg.dev/general-theiagen/staphb/ivar:1.3.1-titan | Optional |
ivar_variants | memory | Int | Memory allocated for the task (in GB) | 8 | Optional |
ivar_variants | reference_gff | File | A GFF file in the GFF3 format can be supplied to specify coordinates of open reading frames (ORFs) so iVar can identify codons and translate variants into amino acids | Optional | |
megahit | cpu | Int | Number of CPUs allocated for the task | 4 | Optional |
megahit | disk_size | Int | Disk size allocated for the task (in GB) | 100 | Optional |
megahit | docker | String | Docker image used for the task | us-docker.pkg.dev/general-theiagen/theiagen/megahit:1.2.9 | Optional |
megahit | kmers | String | Comma-separated list of kmer sizes to use for assembly. All must be odd, in the range 15-255, increment <= 28 | 21,29,39,59,79,99,119,141 | Optional |
megahit | megahit_opts | String | Additional parameters for MEGAHIT assembler | Optional | |
megahit | memory | Int | Memory allocated for the task (in GB) | 16 | Optional |
megahit | min_contig_length | Int | Minimum contig length for MEGAHIT assembler | 1 | Optional |
morgana_magic | abricate_flu_cpu | Int | Number of CPUs to allocate to the task | 2 | Optional |
morgana_magic | abricate_flu_disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
morgana_magic | abricate_flu_docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/abricate:1.0.1-insaflu-220727 | Optional |
morgana_magic | abricate_flu_memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 4 | Optional |
morgana_magic | abricate_flu_min_percent_coverage | Int | Minimum DNA percent coverage | 60 | Optional |
morgana_magic | abricate_flu_min_percent_identity | Int | Minimum DNA percent identity | 70 | Optional |
morgana_magic | assembly_metrics_cpu | Int | Number of CPUs to allocate to the task | 2 | Optional |
morgana_magic | assembly_metrics_disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
morgana_magic | assembly_metrics_docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/samtools:1.15 | Optional |
morgana_magic | assembly_metrics_memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 8 | Optional |
morgana_magic | consensus_qc_cpu | Int | Number of CPUs to allocate to the task | 1 | Optional |
morgana_magic | consensus_qc_disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
morgana_magic | consensus_qc_docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/theiagen/utility:1.1 | Optional |
morgana_magic | consensus_qc_memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 2 | Optional |
morgana_magic | genoflu_cpu | Int | Number of CPUs to allocate to the task | 1 | Optional |
morgana_magic | genoflu_cross_reference | File | An Excel file to cross-reference BLAST findings; probably useful if novel genotypes are not in the default file used by genoflu.py | Optional | |
morgana_magic | genoflu_disk_size | Int | Amount of storage (in GB) to allocate to the task | 25 | Optional |
morgana_magic | genoflu_docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/genoflu:1.06 | Optional |
morgana_magic | genoflu_memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 2 | Optional |
morgana_magic | irma_cpu | Int | Number of CPUs to allocate to the task | 4 | Optional |
morgana_magic | irma_disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
morgana_magic | irma_docker_image | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/irma:1.2.0 | Optional |
morgana_magic | irma_keep_ref_deletions | Boolean | True/False variable that determines if sites missed (i.e. 0 reads for a site in the reference genome) during read gathering should be deleted by ambiguation by inserting N's or deleting the sequence entirely. False sets this IRMA paramater to "DEL" and true sets it to "NNN" | TRUE | Optional |
morgana_magic | irma_memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 16 | Optional |
morgana_magic | nextclade_cpu | Int | Number of CPUs to allocate to the task | 2 | Optional |
morgana_magic | nextclade_disk_size | Int | Amount of storage (in GB) to allocate to the task | 50 | Optional |
morgana_magic | nextclade_docker_image | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/nextstrain/nextclade:3.10.2 | Optional |
morgana_magic | nextclade_memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 4 | Optional |
morgana_magic | nextclade_output_parser_cpu | Int | Number of CPUs to allocate to the task | 2 | Optional |
morgana_magic | nextclade_output_parser_disk_size | Int | Amount of storage (in GB) to allocate to the task | 50 | Optional |
morgana_magic | nextclade_output_parser_docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/python/python:3.8.18-slim | Optional |
morgana_magic | nextclade_output_parser_memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 4 | Optional |
morgana_magic | pangolin_cpu | Int | Number of CPUs to allocate to the task | 2 | Optional |
morgana_magic | pangolin_disk_size | Int | Amount of storage (in GB) to allocate to the task | 50 | Optional |
morgana_magic | pangolin_docker_image | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/nextstrain/nextclade:3.10.2 | Optional |
morgana_magic | pangolin_memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 4 | Optional |
ncbi_datasets | cpu | Int | Number of CPUs allocated for the task | 1 | Optional |
ncbi_datasets | disk_size | Int | Disk size allocated for the task (in GB) | 50 | Optional |
ncbi_datasets | docker | String | Docker image used for the task | us-docker.pkg.dev/general-theiagen/staphb/ncbi-datasets:16.38.1 | Optional |
ncbi_datasets | include_gbff | Boolean | True/False to include gbff files in the output | FALSE | Optional |
ncbi_datasets | include_gff3 | Boolean | True/False to include gff3 files in the output | FALSE | Optional |
ncbi_datasets | memory | Int | Memory allocated for the task (in GB) | 4 | Optional |
ncbi_identify | complete | Boolean | Only query genomes labeled complete | TRUE | Optional |
ncbi_identify | cpu | Int | Number of CPUs allocated for the task | 1 | Optional |
ncbi_identify | disk_size | Int | Disk size allocated for the task (in GB) | 50 | Optional |
ncbi_identify | docker | String | Docker image used for the task | us-docker.pkg.dev/general-theiagen/staphb/ncbi-datasets:16.38.1 | Optional |
ncbi_identify | memory | Int | Memory allocated for the task (in GB) | 4 | Optional |
ncbi_identify | refseq | Boolean | Only query RefSeq genomes | TRUE | Optional |
ncbi_identify | summary_limit | Int | Maximum number of genomes to return in the summary | 100 | Optional |
ncbi_identify | use_ncbi_virus | Boolean | Set to true to download from NCBI Virus Datasets | FALSE | Optional |
quast_denovo | cpu | Int | Number of CPUs allocated for the task | 2 | Optional |
quast_denovo | disk_size | Int | Disk size allocated for the task (in GB) | 100 | Optional |
quast_denovo | docker | String | Docker image used for the task | us-docker.pkg.dev/general-theiagen/staphb/quast:5.0.2 | Optional |
quast_denovo | memory | Int | Memory allocated for the task (in GB) | 2 | Optional |
rasusa | bases | String | Explicitly set the number of bases required e.g., 4.3kb, 7Tb, 9000, 4.1MB. If this option is given, --coverage and --genome-size are ignored | Optional | |
rasusa | coverage | Float | The desired coverage to sub-sample the reads to. If --bases is not provided, this option and --genome-size are required | 250 | Optional |
rasusa | cpu | Int | Number of CPUs allocated for the task | 4 | Optional |
rasusa | disk_size | Int | Disk size allocated for the task (in GB) | 100 | Optional |
rasusa | docker | String | Docker image used for the task | us-docker.pkg.dev/general-theiagen/staphb/rasusa:2.1.0 | Optional |
rasusa | frac | Float | Subsample to a fraction of the reads - e.g., 0.5 samples half the reads | Optional | |
rasusa | memory | Int | Memory allocated for the task (in GB) | 8 | Optional |
rasusa | num | Int | Subsample to a specific number of reads | Optional | |
rasusa | seed | Int | Random seed for reproducibility | Optional | |
read_QC_trim | adapters | File | File with adapter sequences to be removed | Optional | |
read_QC_trim | bbduk_memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 8 | Optional |
read_QC_trim | call_kraken | Boolean | Internal component, do not modify | Optional | |
read_QC_trim | call_midas | Boolean | Internal component, do not modify | Optional | |
read_QC_trim | fastp_args | String | Additional arguments to use with fastp | --detect_adapter_for_pe -g -5 20 -3 20 | Optional |
read_QC_trim | host_complete_only | Boolean | Only download host reference genome labeled "complete" | FALSE | Optional |
read_QC_trim | host_decontaminate_mem | Int | Memory allocated for minimap2 (in GB) | 32 | Optional |
read_QC_trim | host_is_accession | Boolean | Inputted "host" is an accession | FALSE | Optional |
read_QC_trim | kraken_cpu | Int | Number of CPUs to allocate to the task | 4 | Optional |
read_QC_trim | kraken_memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 32 | Optional |
read_QC_trim | phix | File | A file containing the phix used during Illumina sequencing; used in the BBDuk task | Optional | |
read_QC_trim | read_processing | String | The name of the tool to perform basic read processing; options: "trimmomatic" or "fastp" | trimmomatic | Optional |
read_QC_trim | read_qc | String | The tool used for quality control (QC) of reads. Options are "fastq_scan" (default) and "fastqc" | fastq_scan | Optional |
read_QC_trim | target_organism | String | Internal component, do not modify | Optional | |
read_QC_trim | trim_min_length | Int | Specifies minimum length of each read after trimming to be kept | 75 | Optional |
read_QC_trim | trim_quality_min_score | Int | Specifies the average quality of bases in a sliding window to be kept | 30 | Optional |
read_QC_trim | trim_window_size | Int | Specifies window size for trimming (the number of bases to average the quality across) | 4 | Optional |
read_QC_trim | trimmomatic_args | String | Additional arguments to pass to trimmomatic. "-phred33" specifies the Phred Q score encoding which is almost always phred33 with modern sequence data. | -phred33 | Optional |
read_QC_trim_pe | kraken_disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
read_QC_trim_pe | midas_db | File | Internal component, do not modify | Optional | |
read_mapping_stats | cpu | Int | Number of CPUs allocated for the task | 2 | Optional |
read_mapping_stats | disk_size | Int | Disk size allocated for the task (in GB) | 100 | Optional |
read_mapping_stats | docker | String | Docker image used for the task | us-docker.pkg.dev/general-theiagen/staphb/samtools:1.15 | Optional |
read_mapping_stats | memory | Int | Memory allocated for the task (in GB) | 8 | Optional |
skani | cpu | Int | Number of CPUs allocated for the task | 2 | Optional |
skani | disk_size | Int | Disk size allocated for the task (in GB) | 100 | Optional |
skani | docker | String | Docker image used for the task | us-docker.pkg.dev/general-theiagen/staphb/skani:0.2.2 | Optional |
skani | memory | Int | Memory allocated for the task (in GB) | 4 | Optional |
skani | skani_db | File | Skani database file | gs://theiagen-public-resources-rp/reference_data/databases/skani/skani_db_20250606.tar | Optional |
spades | cpu | Int | Number of CPUs allocated for the task | 4 | Optional |
spades | disk_size | Int | Disk size allocated for the task (in GB) | 100 | Optional |
spades | docker | String | Docker image used for the task | us-docker.pkg.dev/general-theiagen/staphb/spades:4.1.0 | Optional |
spades | kmers | String | list of k-mer sizes (must be odd and less than 128) | auto | Optional |
spades | memory | Int | Memory allocated for the task (in GB) | 16 | Optional |
spades | phred_offset | Int | PHRED quality offset in the input reads (33 or 64) | 33 | Optional |
spades | spades_opts | String | Additional parameters for Spades assembler | Optional | |
theiaviral_illumina_pe | call_metaviralspades | Boolean | True/False to call assembly with MetaviralSPAdes and use Megahit as fallback | TRUE | Optional |
theiaviral_illumina_pe | extract_unclassified | Boolean | True/False that determines if unclassified reads should be extracted and combined with the taxon specific extracted reads | TRUE | Optional |
theiaviral_illumina_pe | genome_length | Int | Expected genome length of taxon of interest | Optional | |
theiaviral_illumina_pe | host | String | Host taxon/accession to dehost reads, if provided | Optional | |
theiaviral_illumina_pe | kraken_db | File | Kraken2 database file | gs://theiagen-public-resources-rp/reference_data/databases/kraken2/kraken2_humanGRCh38_viralRefSeq_20240828.tar.gz | Optional |
theiaviral_illumina_pe | min_allele_freq | Float | Minimum allele frequency required for a variant to populate the consensus sequence | 0.6 | Optional |
theiaviral_illumina_pe | min_depth | Int | Minimum read depth required for a variant to populate the consensus sequence | 10 | Optional |
theiaviral_illumina_pe | min_map_quality | Int | Minimum mapping quality required for read alignments | 20 | Optional |
theiaviral_illumina_pe | read_extraction_rank | String | Taxonomic rank to use for read extraction - limits taxons to only those within the specified ranks. | family | Optional |
theiaviral_illumina_pe | reference_fasta | File | Reference genome in FASTA format | Optional | |
theiaviral_illumina_pe | skip_rasusa | Boolean | True/False to skip read subsampling with Rasusa | FALSE | Optional |
theiaviral_illumina_pe | skip_screen | Boolean | True/False to skip read screening check prior to analysis | FALSE | Optional |
version_capture | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0 | Optional |
version_capture | timezone | String | Set the time zone to get an accurate date of analysis (uses UTC by default) | Optional |
Terra Task Name | Variable | Type | Description | Default Value | Terra Status |
---|---|---|---|---|---|
theiaviral_ont | read1 | File | Base-called ONT read file in FASTQ file format (compression optional) | Required | |
theiaviral_ont | samplename | String | Name of the sample being analyzed | Required | |
theiaviral_ont | taxon | String | Taxon ID or organism name of interest | Required | |
bcftools_consensus | cpu | Int | Number of CPUs allocated for the task | 2 | Optional |
bcftools_consensus | disk_size | Int | Disk size allocated for the task (in GB) | 100 | Optional |
bcftools_consensus | docker | String | Docker image used for the task | us-docker.pkg.dev/general-theiagen/staphb/bcftools:1.20 | Optional |
bcftools_consensus | memory | Int | Memory allocated for the task (in GB) | 4 | Optional |
checkv_consensus | checkv_db | File | CheckV database file | gs://theiagen-public-resources-rp/reference_data/databases/checkv/checkv-db-v1.5.tar.gz | Optional |
checkv_consensus | cpu | Int | Number of CPUs allocated for the task | 2 | Optional |
checkv_consensus | disk_size | Int | Disk size allocated for the task (in GB) | 100 | Optional |
checkv_consensus | docker | String | Docker image used for the task | us-docker.pkg.dev/general-theiagen/staphb/checkv:1.0.3 | Optional |
checkv_consensus | memory | Int | Memory allocated for the task (in GB) | 8 | Optional |
checkv_denovo | checkv_db | File | CheckV database file | gs://theiagen-public-resources-rp/reference_data/databases/checkv/checkv-db-v1.5.tar.gz | Optional |
checkv_denovo | cpu | Int | Number of CPUs allocated for the task | 2 | Optional |
checkv_denovo | disk_size | Int | Disk size allocated for the task (in GB) | 100 | Optional |
checkv_denovo | docker | String | Docker image used for the task | us-docker.pkg.dev/general-theiagen/staphb/checkv:1.0.3 | Optional |
checkv_denovo | memory | Int | Memory allocated for the task (in GB) | 8 | Optional |
clair3 | clair3_model | String | Model to be used by Clair3 | r1041_e82_400bps_sup_v500 | Optional |
clair3 | cpu | Int | Number of CPUs allocated for the task | 4 | Optional |
clair3 | disable_phasing | Boolean | True/False that determines if variants should be called without whatshap phasing in full alignment calling | TRUE | Optional |
clair3 | disk_size | Int | Disk size allocated for the task (in GB) | 100 | Optional |
clair3 | docker | String | Docker image used for the task | us-docker.pkg.dev/general-theiagen/theiagen/clair3-extra-models:1.0.10 | Optional |
clair3 | enable_gvcf | Boolean | True/False that determines if an additional GVCF output should generated | FALSE | Optional |
clair3 | enable_haploid_precise | Boolean | True/False that determines haploid calling mode where only 1/1 is considered as a variant | TRUE | Optional |
clair3 | include_all_contigs | Boolean | True/False that determines if all contigs should be included in the output | TRUE | Optional |
clair3 | indel_min_af | Float | Minimum Indel AF required for a candidate variant | 0.08 | Optional |
clair3 | memory | Int | Memory allocated for the task (in GB) | 8 | Optional |
clair3 | snp_min_af | Float | Minimum SNP AF required for a candidate variant | 0.08 | Optional |
clair3 | variant_quality | Int | If set, variants with >$qual will be marked PASS, or LowQual otherwise | 2 | Optional |
clean_check_reads | cpu | Int | Number of CPUs to allocate to the task | 1 | Optional |
clean_check_reads | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
clean_check_reads | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/bactopia/gather_samples:2.0.2 | Optional |
clean_check_reads | max_genome_length | Int | Maximum genome length able to pass read screening | 2673870 | Optional |
clean_check_reads | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 2 | Optional |
clean_check_reads | min_basepairs | Int | Minimum base pairs to pass read screening | 15000 | Optional |
clean_check_reads | min_coverage | Int | Minimum coverage to pass read screening | 10 | Optional |
clean_check_reads | min_genome_length | Int | Minimum genome length to pass read screening | 1500 | Optional |
clean_check_reads | min_reads | Int | Minimum reads to pass read screening | 50 | Optional |
clean_check_reads | skip_mash | Boolean | If true, skips estimation of genome size and coverage using mash in read screening steps. As a result, providing true also prevents screening using these parameters. | TRUE | Optional |
consensus_qc | cpu | Int | Number of CPUs to allocate to the task | 1 | Optional |
consensus_qc | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
consensus_qc | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/theiagen/utility:1.1 | Optional |
consensus_qc | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 2 | Optional |
fasta_utilities | cpu | Int | Number of CPUs allocated for the task | 1 | Optional |
fasta_utilities | disk_size | Int | Disk size allocated for the task (in GB) | 10 | Optional |
fasta_utilities | docker | String | Docker image used for the task | us-docker.pkg.dev/general-theiagen/biocontainers/seqkit:2.4.0--h9ee0642_0 | Optional |
fasta_utilities | memory | Int | Memory allocated for the task (in GB) | 2 | Optional |
flye | additional_parameters | String | Additional parameters for Flye assembler | Optional | |
flye | asm_coverage | Int | Reduced coverage for initial disjointig assembly | Optional | |
flye | cpu | Int | Number of CPUs allocated for the task | 4 | Optional |
flye | disk_size | Int | Disk size allocated for the task (in GB) | 100 | Optional |
flye | docker | String | Docker image used for the task | us-docker.pkg.dev/general-theiagen/staphb/flye:2.9.4 | Optional |
flye | flye_polishing_iterations | Int | Number of polishing iterations | 1 | Optional |
flye | genome_length | Int | Expected genome length for assembly - requires asm_coverage | Optional | |
flye | keep_haplotypes | Boolean | True/False to prevent collapsing alternative haplotypes | FALSE | Optional |
flye | memory | Int | Memory allocated for the task (in GB) | 32 | Optional |
flye | minimum_overlap | Int | Minimum overlap between reads | Optional | |
flye | no_alt_contigs | Boolean | True/False to disable alternative contig generation | FALSE | Optional |
flye | read_error_rate | Float | Expected error rate in reads | Optional | |
flye | read_type | String | Type of read data for Flye | --nano-hq | Optional |
flye | scaffold | Boolean | True/False to enable scaffolding using graph | FALSE | Optional |
host_decontaminate | complete_only | Boolean | Only download genomes labeled "complete" | FALSE | Optional |
host_decontaminate | is_accession | Boolean | Inputted "host" is an accession | FALSE | Optional |
host_decontaminate | minimap2_memory | Int | Memory allocated for minimap2 (in GB) | 32 | Optional |
host_decontaminate | read2 | File | Internal componenet, do not modify | Optional | |
host_decontaminate | refseq | Boolean | Only download RefSeq genomes | TRUE | Optional |
mask_low_coverage | cpu | Int | Number of CPUs allocated for the task | 2 | Optional |
mask_low_coverage | disk_size | Int | Disk size allocated for the task (in GB) | 100 | Optional |
mask_low_coverage | docker | String | Docker image used for the task | us-docker.pkg.dev/general-theiagen/staphb/bedtools:2.31.0 | Optional |
mask_low_coverage | memory | Int | Memory allocated for the task (in GB) | 8 | Optional |
metabuli | cpu | Int | Number of CPUs allocated for the task | 4 | Optional |
metabuli | disk_size | Int | Disk size allocated for the task (in GB) | 100 | Optional |
metabuli | docker | String | Docker image used for the task | us-docker.pkg.dev/general-theiagen/theiagen/metabuli:1.1.0 | Optional |
metabuli | memory | Int | Memory allocated for the task (in GB) | 16 | Optional |
metabuli | metabuli_db | File | Metabuli database file | gs://theiagen-public-resources-rp/reference_data/databases/metabuli/refseq_virus-v223.tar.gz | Optional |
metabuli | min_percent_coverage | Float | Minimum query coverage threshold (0.0 - 1.0) | 0.0 | Optional |
metabuli | min_score | Float | Minimum sequenece similarity score (0.0 - 1.0) | 0.0 | Optional |
metabuli | min_sp_score | Float | Minimum score for species- or lower-level classification | 0.0 | Optional |
metabuli | taxonomy_path | File | Path to taxonomy file | gs://theiagen-public-resources-rp/reference_data/databases/metabuli/new_taxdump.tar.gz | Optional |
minimap2 | cpu | Int | Number of CPUs to allocate to the task | 2 | Optional |
minimap2 | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
minimap2 | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/minimap2:2.22 | Optional |
minimap2 | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 8 | Optional |
minimap2 | query2 | File | Internal component, do not modify | Optional | |
morgana_magic | abricate_flu_cpu | Int | Number of CPUs to allocate to the task | 2 | Optional |
morgana_magic | abricate_flu_disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
morgana_magic | abricate_flu_docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/abricate:1.0.1-insaflu-220727 | Optional |
morgana_magic | abricate_flu_memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 4 | Optional |
morgana_magic | abricate_flu_min_percent_coverage | Int | Minimum DNA percent coverage | 60 | Optional |
morgana_magic | abricate_flu_min_percent_identity | Int | Minimum DNA percent identity | 70 | Optional |
morgana_magic | assembly_metrics_cpu | Int | Number of CPUs to allocate to the task | 2 | Optional |
morgana_magic | assembly_metrics_disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
morgana_magic | assembly_metrics_docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/samtools:1.15 | Optional |
morgana_magic | assembly_metrics_memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 8 | Optional |
morgana_magic | consensus_qc_cpu | Int | Number of CPUs to allocate to the task | 1 | Optional |
morgana_magic | consensus_qc_disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
morgana_magic | consensus_qc_docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/theiagen/utility:1.1 | Optional |
morgana_magic | consensus_qc_memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 2 | Optional |
morgana_magic | genoflu_cpu | Int | Number of CPUs to allocate to the task | 1 | Optional |
morgana_magic | genoflu_cross_reference | File | An Excel file to cross-reference BLAST findings; probably useful if novel genotypes are not in the default file used by genoflu.py | Optional | |
morgana_magic | genoflu_disk_size | Int | Amount of storage (in GB) to allocate to the task | 25 | Optional |
morgana_magic | genoflu_docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/genoflu:1.06 | Optional |
morgana_magic | genoflu_memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 2 | Optional |
morgana_magic | irma_cpu | Int | Number of CPUs to allocate to the task | 4 | Optional |
morgana_magic | irma_disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
morgana_magic | irma_docker_image | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/irma:1.2.0 | Optional |
morgana_magic | irma_keep_ref_deletions | Boolean | True/False variable that determines if sites missed (i.e. 0 reads for a site in the reference genome) during read gathering should be deleted by ambiguation by inserting N's or deleting the sequence entirely. False sets this IRMA paramater to "DEL" and true sets it to "NNN" | TRUE | Optional |
morgana_magic | irma_memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 16 | Optional |
morgana_magic | nextclade_cpu | Int | Number of CPUs to allocate to the task | 2 | Optional |
morgana_magic | nextclade_disk_size | Int | Amount of storage (in GB) to allocate to the task | 50 | Optional |
morgana_magic | nextclade_docker_image | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/nextstrain/nextclade:3.10.2 | Optional |
morgana_magic | nextclade_memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 4 | Optional |
morgana_magic | nextclade_output_parser_cpu | Int | Number of CPUs to allocate to the task | 2 | Optional |
morgana_magic | nextclade_output_parser_disk_size | Int | Amount of storage (in GB) to allocate to the task | 50 | Optional |
morgana_magic | nextclade_output_parser_docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/python/python:3.8.18-slim | Optional |
morgana_magic | nextclade_output_parser_memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 4 | Optional |
morgana_magic | pangolin_cpu | Int | Number of CPUs to allocate to the task | 2 | Optional |
morgana_magic | pangolin_disk_size | Int | Amount of storage (in GB) to allocate to the task | 50 | Optional |
morgana_magic | pangolin_docker_image | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/nextstrain/nextclade:3.10.2 | Optional |
morgana_magic | pangolin_memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 4 | Optional |
morgana_magic | read2 | File | Internal component, do not modify | Optional | |
nanoplot_clean | cpu | Int | Number of CPUs to allocate to the task | 4 | Optional |
nanoplot_clean | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
nanoplot_clean | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/nanoplot:1.40.0 | Optional |
nanoplot_clean | max_length | Int | The maximum length of clean reads, for which reads longer than the length specified will be hidden. | 100000 | Optional |
nanoplot_clean | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 16 | Optional |
nanoplot_raw | cpu | Int | Number of CPUs to allocate to the task | 4 | Optional |
nanoplot_raw | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
nanoplot_raw | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/nanoplot:1.40.0 | Optional |
nanoplot_raw | max_length | Int | The maximum length of clean reads, for which reads longer than the length specified will be hidden. | 100000 | Optional |
nanoplot_raw | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 16 | Optional |
nanoq | cpu | Int | Number of CPUs allocated for the task | 1 | Optional |
nanoq | disk_size | Int | Disk size allocated for the task (in GB) | 100 | Optional |
nanoq | docker | String | Docker image used for the task | us-docker.pkg.dev/general-theiagen/biocontainers/nanoq:0.9.0--hec16e2b_1 | Optional |
nanoq | max_read_length | Int | Maximum read length to keep | 100000 | Optional |
nanoq | max_read_qual | Int | Maximum read quality to keep | 10 | Optional |
nanoq | memory | Int | Memory allocated for the task (in GB) | 2 | Optional |
nanoq | min_read_length | Int | Minimum read length to keep | 500 | Optional |
nanoq | min_read_qual | Int | Minimum read quality to keep | 10 | Optional |
ncbi_datasets | cpu | Int | Number of CPUs allocated for the task | 1 | Optional |
ncbi_datasets | disk_size | Int | Disk size allocated for the task (in GB) | 50 | Optional |
ncbi_datasets | docker | String | Docker image used for the task | us-docker.pkg.dev/general-theiagen/staphb/ncbi-datasets:16.38.1 | Optional |
ncbi_datasets | include_gbff | Boolean | True/False to include gbff files in the output | FALSE | Optional |
ncbi_datasets | include_gff3 | Boolean | True/False to include gff3 files in the output | FALSE | Optional |
ncbi_datasets | memory | Int | Memory allocated for the task (in GB) | 4 | Optional |
ncbi_identify | complete | Boolean | Only query genomes labeled complete | TRUE | Optional |
ncbi_identify | cpu | Int | Number of CPUs allocated for the task | 1 | Optional |
ncbi_identify | disk_size | Int | Disk size allocated for the task (in GB) | 50 | Optional |
ncbi_identify | docker | String | Docker image used for the task | us-docker.pkg.dev/general-theiagen/staphb/ncbi-datasets:16.38.1 | Optional |
ncbi_identify | memory | Int | Memory allocated for the task (in GB) | 4 | Optional |
ncbi_identify | refseq | Boolean | Only query RefSeq genomes | TRUE | Optional |
ncbi_identify | summary_limit | Int | Maximum number of genomes to return in the summary | 100 | Optional |
ncbi_identify | use_ncbi_virus | Boolean | Set to true to download from NCBI Virus Datasets | FALSE | Optional |
ncbi_scrub_se | cpu | Int | Number of CPUs to allocate to the task | 4 | Optional |
ncbi_scrub_se | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
ncbi_scrub_se | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/ncbi/sra-human-scrubber:2.2.1 | Optional |
ncbi_scrub_se | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 8 | Optional |
parse_mapping | cpu | Int | Number of CPUs allocated for the task | 2 | Optional |
parse_mapping | disk_size | Int | Disk size allocated for the task (in GB) | 100 | Optional |
parse_mapping | docker | String | Docker image used for the task | us-docker.pkg.dev/general-theiagen/staphb/samtools:1.17 | Optional |
parse_mapping | memory | Int | Memory allocated for the task (in GB) | 8 | Optional |
porechop | cpu | Int | Number of CPUs allocated for the task | 4 | Optional |
porechop | disk_size | Int | Disk size allocated for the task (in GB) | 100 | Optional |
porechop | docker | String | Docker image used for the task | us-docker.pkg.dev/general-theiagen/staphb/porechop:0.2.4 | Optional |
porechop | memory | Int | Memory allocated for the task (in GB) | 16 | Optional |
porechop | trimopts | String | Additional trimming options for Porechop | Optional | |
quast_denovo | cpu | Int | Number of CPUs allocated for the task | 2 | Optional |
quast_denovo | disk_size | Int | Disk size allocated for the task (in GB) | 100 | Optional |
quast_denovo | docker | String | Docker image used for the task | us-docker.pkg.dev/general-theiagen/staphb/quast:5.0.2 | Optional |
quast_denovo | memory | Int | Memory allocated for the task (in GB) | 2 | Optional |
rasusa | bases | String | Explicitly set the number of bases required e.g., 4.3kb, 7Tb, 9000, 4.1MB. If this option is given, --coverage and --genome-size are ignored | Optional | |
rasusa | coverage | Float | The desired coverage to sub-sample the reads to. If --bases is not provided, this option and --genome-size are required | 250 | Optional |
rasusa | cpu | Int | Number of CPUs allocated for the task | 4 | Optional |
rasusa | disk_size | Int | Disk size allocated for the task (in GB) | 100 | Optional |
rasusa | docker | String | Docker image used for the task | us-docker.pkg.dev/general-theiagen/staphb/rasusa:2.1.0 | Optional |
rasusa | frac | Float | Subsample to a fraction of the reads - e.g., 0.5 samples half the reads | Optional | |
rasusa | memory | Int | Memory allocated for the task (in GB) | 8 | Optional |
rasusa | num | Int | Subsample to a specific number of reads | Optional | |
rasusa | read2 | File | Internal component, do not modify | Optional | |
rasusa | seed | Int | Random seed for reproducibility | Optional | |
raven | cpu | Int | Number of CPUs allocated for the task | 4 | Optional |
raven | disk_size | Int | Disk size allocated for the task (in GB) | 100 | Optional |
raven | docker | String | Docker image used for the task | us-docker.pkg.dev/general-theiagen/theiagen/raven:1.8.3 | Optional |
raven | memory | Int | Memory allocated for the task (in GB) | 16 | Optional |
raven | raven_identity | Float | Threshold for overlap between two reads in order to construct an edge between them | 0.0 | Optional |
raven | raven_opts | Int | Additional parameters for Raven assembler | Optional | |
raven | raven_polishing_iterations | Int | Number of polishing iterations | 2 | Optional |
read_mapping_stats | cpu | Int | Number of CPUs allocated for the task | 2 | Optional |
read_mapping_stats | disk_size | Int | Disk size allocated for the task (in GB) | 100 | Optional |
read_mapping_stats | docker | String | Docker image used for the task | us-docker.pkg.dev/general-theiagen/staphb/samtools:1.15 | Optional |
read_mapping_stats | memory | Int | Memory allocated for the task (in GB) | 8 | Optional |
skani | cpu | Int | Number of CPUs allocated for the task | 2 | Optional |
skani | disk_size | Int | Disk size allocated for the task (in GB) | 100 | Optional |
skani | docker | String | Docker image used for the task | us-docker.pkg.dev/general-theiagen/staphb/skani:0.2.2 | Optional |
skani | memory | Int | Memory allocated for the task (in GB) | 4 | Optional |
skani | skani_db | File | Skani database file | gs://theiagen-public-resources-rp/reference_data/databases/skani/skani_db_20250606.tar | Optional |
theiaviral_ont | call_porechop | Boolean | True/False to trim adapters with porechop | FALSE | Optional |
theiaviral_ont | call_raven | Boolean | True/False to call assembly with Raven and use Flye as fallback | TRUE | Optional |
theiaviral_ont | extract_unclassified | Boolean | True/False that determines if unclassified reads should be extracted and combined with the taxon specific extracted reads | FALSE | Optional |
theiaviral_ont | genome_length | Int | Expected genome length of taxon of interest | Optional | |
theiaviral_ont | host | String | Host taxon/accession to dehost reads, if provided | Optional | |
theiaviral_ont | min_allele_freq | Float | Minimum allele frequency required for a variant to populate the consensus sequence | 0.6 | Optional |
theiaviral_ont | min_depth | Int | Minimum read depth required for a variant to populate the consensus sequence | 10 | Optional |
theiaviral_ont | min_map_quality | Int | Minimum mapping quality required for read alignments | 20 | Optional |
theiaviral_ont | read_extraction_rank | String | Taxonomic rank to use for read extraction - limits taxons to only those within the specified ranks. | family | Optional |
theiaviral_ont | reference_fasta | File | Reference genome in FASTA format | Optional | |
theiaviral_ont | skip_rasusa | Boolean | True/False to skip read subsampling with Rasusa | FALSE | Optional |
theiaviral_ont | skip_screen | Boolean | True/False to skip read screening check prior to analysis | FALSE | Optional |
version_capture | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0 | Optional |
version_capture | timezone | String | Set the time zone to get an accurate date of analysis (uses UTC by default) | Optional |
All Tasks¶
Versioning
versioning
: Version Capture
The versioning
task captures the workflow version from the GitHub (code repository) version.
Version Capture Technical details
Links | |
---|---|
Task | task_versioning.wdl |
Taxonomic Identification
ncbi_identify
The ncbi_identify
task uses NCBI Datasets
to search the NCBI Viral Genome Database and acquire taxonomic metadata from a user's inputted taxonomy and desired taxonomic rank. This task will always return a taxon ID, name, and rank, and it facilitates multiple downstream functions, including read classification and targeted read extraction. This task also generates a comprehensive summary file of all successful hits to the input taxon
, which includes each taxon's accession number, completeness status, genome length, source, and other relevant metadata. Based on this summary, the task also calculates the average expected genome size for the input taxon
.
taxon
input parameter
This parameter accepts either a NCBI taxon ID (e.g. 11292
) or an organism name (e.g. Lyssavirus rabies
).
rank
a.k.a read_extraction_rank
input parameter
Valid options include: "species"
, "genus"
, "family"
, "order"
, "class"
, "phylum"
, "kingdom"
, or "domain"
. By default it is set to "family"
. This parameter filters metadata to report information only at the taxonomic rank
specified by the user, regardless of the taxonomic rank implied by the original input taxon
.
Important
- The
rank
parameter must specify a taxonomic rank that is equal to or above the input taxon's taxonomic rank.
Examples:
- If your input
taxon
isLyssavirus rabies
(species level) withrank
set tofamily
, the task will return information for the family ofLyssavirus rabies
: taxon ID for Rhabdoviridae (11270), name "Rhabdoviridae", and rank "family". - If your input
taxon
isLyssavirus
(genus level) withrank
set tospecies
, the task will fail because it cannot determine species information from an inputted genus.
NCBI Datasets Technical Details
Links | |
---|---|
Task | task_identify_taxon_id.wdl |
Software Source Code | NCBI Datasets on GitHub |
Software Documentation | NCBI Datasets Documentation on NCBI |
Original Publication(s) | Exploring and retrieving sequence and metadata for species across the tree of life with NCBI Datasets |
Read Quality Control, Trimming, Filtering, Identification and Extraction
read_QC_trim
read_QC_trim
is a sub-workflow that removes low-quality reads, low-quality regions of reads, and sequencing adapters to improve data quality. It uses a number of tasks, described below. The differences between the PE and SE versions of the read_QC_trim
sub-workflow lie in the default parameters, the use of two or one input read file(s), and the different output files.
HRRT
: Human Host Sequence Removal
All reads of human origin are removed, including their mates, by using NCBI's human read removal tool (HRRT).
HRRT is based on the SRA Taxonomy Analysis Tool and employs a k-mer database constructed of k-mers from Eukaryota derived from all human RefSeq records with any k-mers found in non-Eukaryota RefSeq records subtracted from the database.
NCBI-Scrub Technical Details
Links | |
---|---|
Task | task_ncbi_scrub.wdl |
Software Source Code | HRRT on GitHub |
Software Documentation | HRRT on NCBI |
Read quality trimming
Either trimmomatic
or fastp
can be used for read-quality trimming. Trimmomatic is used by default. Both tools trim low-quality regions of reads with a sliding window (with a window size of trim_window_size
), cutting once the average quality within the window falls below trim_quality_trim_score
. They will both discard the read if it is trimmed below trim_minlen
.
read_processing
input parameter
This input parameter accepts either trimmomatic
or fastp
as an input to determine which tool should be used for read quality trimming. This is set to trimmomatic
by default.
If the fastp
option is selected, see below for table of default parameters.
fastp
default read-trimming parameters
Parameter | Explanation |
---|---|
-g | enables polyG tail trimming |
-5 20 | enables read end-trimming |
-3 20 | enables read end-trimming |
--detect_adapter_for_pe | enables adapter-trimming only for paired-end reads |
Additional arguments can be passed using the fastp_args
optional parameter.
Trimmomatic and fastp Technical Details
Links | |
---|---|
Task | task_trimmomatic.wdl task_fastp.wdl |
Software Source Code | Trimmomatic fastp on Github |
Software Documentation | Trimmomatic fastp |
Original Publication(s) | Trimmomatic: a flexible trimmer for Illumina sequence data fastp: an ultra-fast all-in-one FASTQ preprocessor |
Adapter removal
The BBDuk
task removes adapters from sequence reads. To do this:
- Repair from the BBTools package reorders reads in paired fastq files to ensure the forward and reverse reads of a pair are in the same position in the two fastq files.
- BBDuk ("Bestus Bioinformaticus" Decontamination Using Kmers) is then used to trim the adapters and filter out all reads that have a 31-mer match to PhiX, which is commonly added to Illumina sequencing runs to monitor and/or improve overall run quality.
What are adapters and why do they need to be removed?
Adapters are manufactured oligonucleotide sequences attached to DNA fragments during the library preparation process. In Illumina sequencing, these adapter sequences are required for attaching reads to flow cells. You can read more about Illumina adapters here. For genome analysis, it's important to remove these sequences since they're not actually from your sample. If you don't remove them, the downstream analysis may be affected.
BBDuk Technical Details
Links | |
---|---|
Task | task_bbduk.wdl |
Software Source Code | BBTools |
Software Documentation | BBDuk |
Read Quantification
There are two methods for read quantification to choose from: fastq-scan
(default) or fastqc
. Both quantify the forward and reverse reads in FASTQ files. For paired-end data, they also provide the total number of read pairs. This task is run once with raw reads as input and once with clean reads as input. If QC has been performed correctly, you should expect fewer clean reads than raw reads. fastqc
also provides a graphical visualization of the read quality.
read_qc
input parameter
This input parameter accepts either "fastq_scan"
or "fastqc"
as an input to determine which tool should be used for read quantification. This is set to "fastq-scan"
by default.
fastq-scan and FastQC Technical Details
Links | |
---|---|
Task | task_fastq_scan.wdl task_fastqc.wdl |
Software Source Code | fastq-scan on Github fastqc on Github |
Software Documentation | fastq-scan fastqc |
host_decontaminate
: Host read decontamination
Host genetic data is frequently incidentally sequenced alongside pathogens, which can negatively affect the quality of downstream analysis. Host Decontaminate attempts to remove host reads by aligning to a reference host genome acquired on-the-fly. The reference host genome can be acquired via NCBI Taxonomy-compatible taxon input or assembly accession. Host Decontaminate maps inputted reads to the host genome using minimap2
, reports mapping statistics to this host genome, and outputs the unaligned dehosted reads.
The detailed steps and tasks are as follows:
Taxonomic Identification
The ncbi_identify
task uses NCBI Datasets
to search the NCBI Viral Genome Database and acquire taxonomic metadata from a user's inputted taxonomy and desired taxonomic rank. This task will always return a taxon ID, name, and rank, and it facilitates multiple downstream functions, including read classification and targeted read extraction. This task also generates a comprehensive summary file of all successful hits to the input taxon
, which includes each taxon's accession number, completeness status, genome length, source, and other relevant metadata. Based on this summary, the task also calculates the average expected genome size for the input taxon
.
taxon
input parameter
This parameter accepts either a NCBI taxon ID (e.g. 11292
) or an organism name (e.g. Lyssavirus rabies
).
rank
a.k.a read_extraction_rank
input parameter
Valid options include: "species"
, "genus"
, "family"
, "order"
, "class"
, "phylum"
, "kingdom"
, or "domain"
. By default it is set to "family"
. This parameter filters metadata to report information only at the taxonomic rank
specified by the user, regardless of the taxonomic rank implied by the original input taxon
.
Important
- The
rank
parameter must specify a taxonomic rank that is equal to or above the input taxon's taxonomic rank.
Examples:
- If your input
taxon
isLyssavirus rabies
(species level) withrank
set tofamily
, the task will return information for the family ofLyssavirus rabies
: taxon ID for Rhabdoviridae (11270), name "Rhabdoviridae", and rank "family". - If your input
taxon
isLyssavirus
(genus level) withrank
set tospecies
, the task will fail because it cannot determine species information from an inputted genus.
NCBI Datasets Technical Details
Links | |
---|---|
Task | task_identify_taxon_id.wdl |
Software Source Code | NCBI Datasets on GitHub |
Software Documentation | NCBI Datasets Documentation on NCBI |
Original Publication(s) | Exploring and retrieving sequence and metadata for species across the tree of life with NCBI Datasets |
Download Accession
The NCBI Datasets
task downloads specified assemblies from NCBI using either the virus or genome (for all other genome types) package as appropriate.
This task uses the accession ID output from the skani
task to download the the most closely related reference genome to the input assembly. The downloaded reference is then used for downstream analysis, including variant calling and consensus generation.
NCBI Datasets Technical Details
Links | |
---|---|
Task | task_ncbi_datasets.wdl |
Software Source Code | NCBI Datasets on GitHub |
Software Documentation | NCBI Datasets Documentation on NCBI |
Original Publication(s) | Exploring and retrieving sequence and metadata for species across the tree of life with NCBI Datasets |
Map Reads to Host
minimap2
is a popular aligner that is used to align reads (or assemblies) to an assembly file. In minimap2, "modes" are a group of preset options.
The mode used in this task is map-ont
which is the default mode for long reads and indicates that long reads of ~10% error rates should be aligned to the reference genome. The output file is in SAM format.
For more information regarding modes and the available options for minimap2
, please see the minimap2 manpage
minimap2 Technical Details
Links | |
---|---|
Task | task_minimap2.wdl |
Software Source Code | minimap2 on GitHub |
Software Documentation | minimap2 |
Original Publication(s) | Minimap2: pairwise alignment for nucleotide sequences |
Extract Unaligned Reads
The bam_to_unaligned_fastq
task will extract a FASTQ file of reads that failed to align, while removing unpaired reads.
parse_mapping
Technical Details
Links | |
---|---|
Task | task_parse_mapping.wdl |
Software Source Code | samtools on GitHub |
Software Documentation | samtools |
Original Publication(s) | The Sequence Alignment/Map format and SAMtools Twelve Years of SAMtools and BCFtools |
Host Read Mapping Statistics
The assembly_metrics
task generates mapping statistics from a BAM file. It uses samtools to generate a summary of the mapping statistics, which includes coverage, depth, average base quality, average mapping quality, and other relevant metrics.
assembly_metrics
Technical Details
Links | |
---|---|
Task | task_assembly_metrics.wdl |
Software Source Code | samtools on GitHub |
Software Documentation | samtools |
Original Publication(s) | The Sequence Alignment/Map format and SAMtools Twelve Years of SAMtools and BCFtools |
Host Decontaminate Technical Details
Links | |
---|---|
Subworkflow File | wf_host_decontaminate.wdl |
Read Identification
Kraken2
is a bioinformatics tool originally designed for metagenomic applications. It has additionally proven valuable for validating taxonomic assignments and checking contamination of single-species (e.g. bacterial isolate, eukaryotic isolate, viral isolate, etc.) whole genome sequence data.
This task runs on cleaned reads passed from the read_QC_trim
subworkflow and outputs a Kraken2 report detailing taxonomic classifications. It also separates classified reads from unclassified ones.
Database-dependent
This workflow automatically uses a viral-specific Kraken2 database. This database was generated in-house from RefSeq's viral sequence collection and human genome GRCh38. It's available at gs://theiagen-public-resources-rp/reference_data/databases/kraken2/kraken2_humanGRCh38_viralRefSeq_20240828.tar.gz
.
Kraken2 Technical Details
Links | |
---|---|
Task | task_kraken2.wdl |
Software Source Code | Kraken2 on GitHub |
Software Documentation | https://github.com/DerrickWood/kraken2/blob/master/docs/MANUAL.markdown |
Original Publication(s) | Improved metagenomic analysis with Kraken 2 |
Read Extraction
The task_krakentools.wdl
task extracts reads from the Kraken2 output file. It uses the KrakenTools package to extract reads classified at any user-specified taxon ID.
extract_unclassified
input parameter
This parameter determines whether unclassified reads should also be extracted and combined with the taxon
-specific extracted reads. By default, this is set to false
, meaning that only reads classified to the specified input taxon
will be extracted.
Important
This task will extract reads classified to the input taxon
and all of its descendant taxa. The rank
input parameter controls the extraction of reads classified at the specified rank
and all suboridante taxonomic levels. See task ncbi_identify
under the Taxonomic Identification section for more details on the rank
input parameter.
KrakenTools Technical Details
Links | |
---|---|
Task | task_krakentools.wdl |
Software Source Code | KrakenTools on GitHub |
Software Documentation | KrakenTools |
Original Publication(s) | Metagenome analysis using the Kraken software suite |
rasusa
The rasusa
task performs subsampling on the input raw reads. By default, it subsamples reads to a target depth of 250X, using the estimated genome length either generated by the ncbi_identify
task or provided directly by the user. Disabled by default, users can enable it by setting the skip_rasusa
variable to false
. The target subsampling depth can also be adjusted by modifying the coverage
variable.
coverage
input parameter
This parameter specifies the target coverage for subsampling. The default value is 250
, but users can adjust it as needed.
Non-deterministic output(s)
This task may yield non-deterministic outputs.
Rasusa Technical Details
Links | |
---|---|
Task | task_rasusa.wdl |
Software Source Code | Rasusa on GitHub |
Software Documentation | Rasusa on GitHub |
Original Publication(s) | Rasusa: Randomly subsample sequencing reads to a specified coverage |
clean_check_reads
The screen
task ensures the quantity of sequence data is sufficient to undertake genomic analysis. It uses fastq-scan
and bash commands for quantification of reads and base pairs, and mash sketching to estimate the genome size and its coverage. At each step, the results are assessed relative to pass/fail criteria and thresholds that may be defined by optional user inputs. Samples are run through all threshold checks, regardless of failures, and the workflow will terminate after the screen
task if any thresholds are not met:
- Total number of reads: A sample will fail the read screening task if its total number of reads is less than or equal to
min_reads
. - The proportion of basepairs reads in the forward and reverse read files: A sample will fail the read screening if fewer than
min_proportion
basepairs are in either the reads1 or read2 files. - Number of basepairs: A sample will fail the read screening if there are fewer than
min_basepairs
basepairs - Estimated genome size: A sample will fail the read screening if the estimated genome size is smaller than
min_genome_size
or bigger thanmax_genome_size
. - Estimated genome coverage: A sample will fail the read screening if the estimated genome coverage is less than the
min_coverage
.
Read screening is performed only on the cleaned reads. The task may be skipped by setting the skip_screen
variable to true
. Default values vary between the ONT and PE workflow. The rationale for these default values can be found below:
Default Thresholds and Rationales
Variable | Description | Default Value | Rationale |
---|---|---|---|
min_reads |
A sample will fail the read screening task if its total number of reads is less than or equal to min_reads |
50 | Minimum number of base pairs for 10x coverage of the Hepatitis delta (of the Deltavirus genus) virus divided by 300 (longest Illumina read length) |
min_basepairs |
A sample will fail the read screening if there are fewer than min_basepairs basepairs |
15000 | Greater than 10x coverage of the Hepatitis delta (of the Deltavirus genus) virus |
min_genome_size |
A sample will fail the read screening if the estimated genome size is smaller than min_genome_size |
1500 | Based on the Hepatitis delta (of the Deltavirus genus) genome- the smallest viral genome as of 2024-04-11 (1,700 bp) |
max_genome_size |
A sample will fail the read screening if the estimated genome size is smaller than max_genome_size |
2673870 | Based on the Pandoravirus salinus genome, the biggest viral genome, (2,673,870 bp) with 2 Mbp added |
min_coverage |
A sample will fail the read screening if the estimated genome coverage is less than the min_coverage |
10 | A bare-minimum coverage for genome characterization. Higher coverage would be required for high-quality phylogenetics. |
min_proportion |
A sample will fail the read screening if fewer than min_proportion basepairs are in either the reads1 or read2 files |
40 | Greater than 50% reads are in the read1 file; others are in the read2 file. (PE workflow only) |
Screen Technical Details
Links | |
---|---|
Task | task_screen.wdl (PE sub-task) task_screen.wdl (SE sub-task) |
De novo Assembly and Reference Selection
These tasks are only performed if no reference genome is provided
In this workflow, de novo assembly is primarily used to facilitate the selection of a closely related reference genome, though high quality de novo assemblies can be used for downstream analysis. If the user provides an input reference_fasta
, the following assembly generation, assembly evaluation, and reference selections tasks will be skipped:
spades
megahit
checkv_denovo
quast_denovo
skani
ncbi_datasets
spades
The spades
task is a wrapper for the SPAdes assembler, which is used for de novo assembly of the cleaned reads. It is run with the --metaviral
option, which is recommended for viral genomes. MetaviralSPAdes pipeline consists of three independent steps, ViralAssembly
for finding putative viral subgraphs in a metagenomic assembly graph and generating contigs in these graphs, ViralVerify
for checking whether the resulting contigs have viral origin and ViralComplete
for checking whether these contigs represent complete viral genomes. For more details, please see the original publication.
MetaviralSPAdes was selected as the default assembler because it produces the most complete viral genomes within TheiaViral, determined by CheckV quality assessment (see task checkv
for technical details).
call_metaviralspades
input parameter
This parameter controls whether or not the spades
task is called by the workflow. By default, call_metaviralspades
is set to true
because MetaviralSPAdes is used as the primary assembler. MetaviralSPAdes is generally recommended for most users, but it might not perform optimally on all datasets. If users encounter issues with MetaviralSPAdes, they can set the call_metaviralspades
variable to false
to bypass the spades
task and instead de novo assemble using MEGAHIT (see task megahit
for details). Additionally, if the spades
task fails during execution, the workflow will automatically fall back to using MEGAHIT for de novo assembly.
Non-deterministic output(s)
This task may yield non-deterministic outputs.
MetaviralSPAdes Technical Details
Links | |
---|---|
Task | task_spades.wdl |
Software Source Code | SPAdes on GitHub |
Software Documentation | SPAdes Manual |
Original Publication(s) | MetaviralSPAdes: assembly of viruses from metagenomic data |
megahit
The megahit
task is a wrapper for the MEGAHIT assembler, which is used for de novo metagenomic assembly of the cleaned reads. MEGAHIT is a fast and memory-efficient de novo assembler that can handle large datasets. This task is optional, turned off by default, and will only be called if MetaviralSPAdes fails. It can be enabled by setting the skip_metaviralspades
parameter to true
. The megahit
task is used as a fallback option if the spades
task fails during execution (see task spades
for more details).
Non-deterministic output(s)
This task may yield non-deterministic outputs.
MEGAHIT Technical Details
Links | |
---|---|
Task | task_megahit.wdl |
Software Source Code | MEGAHIT on GitHub |
Software Documentation | MEGAHIT |
Original Publication(s) | MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph |
skani
The skani
task is used to identify and select the most closely related reference genome to the de novo assembly. Skani uses an approximate mapping method without base-level alignment to calculate average nucleotide identity (ANI). It is magnitudes faster than BLAST-based methods and almost as accurate.
By default, the reference genome is selected from a database of approximately 200,000 complete viral genomes. This database was constructed with the following methodology:
-
Extracting all complete NCBI viral genomes, excluding RefSeq accessions (redundancy), SARS-CoV-2 accessions, and segmented families (Orthomyxoviridae, Hantaviridae, Arenaviridae, and Phenuiviridae)
-
Adding complete RefSeq segmented viral assembly accessions, which represent segments as individual contigs within the FASTA
-
Adding one SARS-CoV-2 genome for each major pangolin lineage
Skani Technical Details
Links | |
---|---|
Task | task_skani.wdl |
Software Source Code | Skani on GitHub |
Software Documentation | Skani Documentation |
Original Publication(s) | Fast and robust metagenomic sequence comparison through sparse chaining with skani |
ncbi_datasets
The NCBI Datasets
task downloads specified assemblies from NCBI using either the virus or genome (for all other genome types) package as appropriate.
This task uses the accession ID output from the skani
task to download the the most closely related reference genome to the input assembly. The downloaded reference is then used for downstream analysis, including variant calling and consensus generation.
NCBI Datasets Technical Details
Links | |
---|---|
Task | task_ncbi_datasets.wdl |
Software Source Code | NCBI Datasets on GitHub |
Software Documentation | NCBI Datasets Documentation on NCBI |
Original Publication(s) | Exploring and retrieving sequence and metadata for species across the tree of life with NCBI Datasets |
Reference Mapping
bwa
The bwa
task is a wrapper for the BWA alignment tool. It utilizes the BWA-MEM algorithm to map cleaned reads to the reference genome, either selected by the skani
task or provided by the user input reference_fasta
. This creates a BAM file which is then sorted using the command samtools sort
.
BWA Technical Details
Links | |
---|---|
Task | task_bwa.wdl |
Software Source Code | https://github.com/lh3/bwa |
Software Documentation | https://bio-bwa.sourceforge.net/ |
Original Publication(s) | Fast and accurate short read alignment with Burrows-Wheeler transform |
read_mapping_stats
The read_mapping_stats
task generates mapping statistics from a BAM file. It uses samtools to generate a summary of the mapping statistics, which includes coverage, depth, average base quality, average mapping quality, and other relevant metrics.
read_mapping_stats
Technical Details
Links | |
---|---|
Task | task_assembly_metrics.wdl |
Software Source Code | samtools on GitHub |
Software Documentation | samtools |
Original Publication(s) | The Sequence Alignment/Map format and SAMtools Twelve Years of SAMtools and BCFtools |
Variant Calling and Consensus Generation
ivar_variants
The ivar_variants
task wraps the iVar tool to call variants from the sorted BAM file produced by the bwa
task. It uses the ivar variants
command to identify and report variants based on the aligned reads. The ivar_variants
task will filter all variant calls based on user-defined parameters, including min_map_quality
, min_depth
, and min_allele_freq
. This task will return a VCF file containing the variant calls, along with the total number of variants, and the proportion of intermediate variant calls.
min_depth
input parameter
This parameter accepts an integer value to set the minimum read depth for variant calling and subsequent consensus sequence generation. The default value is 10
.
min_map_quality
input parameter
This parameter accepts an integer value to set the minimum mapping quality for variant calling and subsequent consensus sequence generation. The default value is 20
.
min_allele_freq
input parameter
This parameter accepts a float value to set the minimum allele frequency for variant calling and subsequent consensus sequence generation. The default value is 0.6
.
iVar Technical Details
Links | |
---|---|
Task | task_ivar_variant_call.wdl |
Software Source Code | Ivar on GitHub |
Software Documentation | Ivar Documentation |
Original Publication(s) | An amplicon-based sequencing framework for accurately measuring intrahost virus diversity using PrimalSeq and iVar |
ivar consensus
The consensus
task wraps the iVar tool to generate a reference-based consensus assembly from the sorted BAM file produced by the bwa
task. It uses the ivar consensus
command to call variants and generate a consensus sequence based on those mapped reads. The consensus
task will filter all variant calls based on user-defined parameters, including min_map_quality
, min_depth
, and min_allele_freq
. This task will return a consensus sequence in FASTA format and the samtools mpileup output.
This task is functional for segmented viruses by iteratively executing iVar on a contig-by-contig basis and concantenating resulting consensus contigs.
min_depth
input parameter
This parameter accepts an integer value to set the minimum read depth for variant calling and subsequent consensus sequence generation. The default value is 10
.
min_map_quality
input parameter
This parameter accepts an integer value to set the minimum mapping quality for variant calling and subsequent consensus sequence generation. The default value is 20
.
min_allele_freq
input parameter
This parameter accepts a float value to set the minimum allele frequency for variant calling and subsequent consensus sequence generation. The default value is 0.6
.
iVar Technical Details
Links | |
---|---|
Task | task_ivar_consensus.wdl |
Software Source Code | Ivar on GitHub |
Software Documentation | Ivar Documentation |
Original Publication(s) | An amplicon-based sequencing framework for accurately measuring intrahost virus diversity using PrimalSeq and iVar |
Assembly Evaluation and Consensus Quality Control
quast_denovo
QUAST stands for QUality ASsessment Tool. It evaluates genome/metagenome assemblies by computing various metrics without a reference being necessary. It includes useful metrics such as number of contigs, length of the largest contig and N50.
QUAST Technical Details
Links | |
---|---|
Task | task_quast.wdl |
Software Source Code | QUAST on GitHub |
Software Documentation | https://quast.sourceforge.net/ |
Original Publication(s) | QUAST: quality assessment tool for genome assemblies |
checkv_denovo
& checkv_consensus
CheckV is a fully automated command-line pipeline for assessing the quality of viral genomes, including identification of host contamination for integrated proviruses, estimating completeness for genome fragments, and identification of closed genomes.
By default, CheckV reports results on a contig-by-contig basis. The checkv
task additionally reports both "weighted_contamination" and "weighted_completeness", which are average percents calculated across the total assembly that are weighted by contig length.
CheckV Technical Details
Links | |
---|---|
Task | task_checkv.wdl |
Software Source Code | CheckV on Bitbucket |
Software Documentation | CheckV Documentation |
Original Publication(s) | CheckV assesses the quality and completeness of metagenome-assembled viral genomes |
consensus_qc
The consensus_qc task generates a summary of genomic statistics from a consensus genome. This includes the total number of bases, "N" bases, degenerate bases, and an estimate of the percent coverage to the reference genome.
consensus_qc
Technical Details
Links | |
---|---|
Task | task_consensus_qc.wdl |
Software Source Docker Image | Theiagen Docker Builds: utility:1.1 |
Versioning
versioning
: Version Capture
The versioning
task captures the workflow version from the GitHub (code repository) version.
Version Capture Technical details
Links | |
---|---|
Task | task_versioning.wdl |
Taxonomic Identification
ncbi_identify
The ncbi_identify
task uses NCBI Datasets
to search the NCBI Viral Genome Database and acquire taxonomic metadata from a user's inputted taxonomy and desired taxonomic rank. This task will always return a taxon ID, name, and rank, and it facilitates multiple downstream functions, including read classification and targeted read extraction. This task also generates a comprehensive summary file of all successful hits to the input taxon
, which includes each taxon's accession number, completeness status, genome length, source, and other relevant metadata. Based on this summary, the task also calculates the average expected genome size for the input taxon
.
taxon
input parameter
This parameter accepts either a NCBI taxon ID (e.g. 11292
) or an organism name (e.g. Lyssavirus rabies
).
rank
a.k.a read_extraction_rank
input parameter
Valid options include: "species"
, "genus"
, "family"
, "order"
, "class"
, "phylum"
, "kingdom"
, or "domain"
. By default it is set to "family"
. This parameter filters metadata to report information only at the taxonomic rank
specified by the user, regardless of the taxonomic rank implied by the original input taxon
.
Important
- The
rank
parameter must specify a taxonomic rank that is equal to or above the input taxon's taxonomic rank.
Examples:
- If your input
taxon
isLyssavirus rabies
(species level) withrank
set tofamily
, the task will return information for the family ofLyssavirus rabies
: taxon ID for Rhabdoviridae (11270), name "Rhabdoviridae", and rank "family". - If your input
taxon
isLyssavirus
(genus level) withrank
set tospecies
, the task will fail because it cannot determine species information from an inputted genus.
NCBI Datasets Technical Details
Links | |
---|---|
Task | task_identify_taxon_id.wdl |
Software Source Code | NCBI Datasets on GitHub |
Software Documentation | NCBI Datasets Documentation on NCBI |
Original Publication(s) | Exploring and retrieving sequence and metadata for species across the tree of life with NCBI Datasets |
Read Quality Control, Trimming, and Filtering
nanoplot_raw
& nanoplot_clean
Nanoplot is used for the determination of mean quality scores, read lengths, and number of reads. This task is run once with raw reads as input and once with clean reads as input. If QC has been performed correctly, you should expect fewer clean reads than raw reads.
Nanoplot Technical Details
Links | |
---|---|
Task | task_nanoplot.wdl |
Software Source Code | NanoPlot |
Software Documentation | NanoPlot Documentation |
Original Publication(s) | NanoPack2: population-scale evaluation of long-read sequencing data |
porechop
Porechop is a tool for finding and removing adapters from ONT data. Adapters on the ends of reads are trimmed, and when a read has an adapter in the middle, the read is split into two.
The porechop
task is optional and is turned off by default. It can be enabled by setting the call_porechop
parameter to true
.
Porechop Technical Details
Links | |
---|---|
WDL Task | task_porechop.wdl |
Software Source Code | Porechop on GitHub |
Software Documentation | https://github.com/rrwick/Porechop#porechop |
nanoq
Reads are filtered by length and quality using nanoq
. By default, sequences with less than 500 basepairs and quality score lower than 10 are filtered out to improve assembly accuracy.
Nanoq Technical Details
Links | |
---|---|
Task | task_nanoq.wdl |
Software Source Code | Nanoq |
Software Documentation | Nanoq Documentation |
Original Publication(s) | Nanoq: ultra-fast quality control for nanopore reads |
ncbi_scrub_se
All reads of human origin are removed, including their mates, by using NCBI's human read removal tool (HRRT).
HRRT is based on the SRA Taxonomy Analysis Tool and employs a k-mer database constructed of k-mers from Eukaryota derived from all human RefSeq records with any k-mers found in non-Eukaryota RefSeq records subtracted from the database.
NCBI-Scrub Technical Details
Links | |
---|---|
Task | task_ncbi_scrub.wdl |
Software Source Code | HRRT on GitHub |
Software Documentation | HRRT on NCBI |
host_decontaminate
Host genetic data is frequently incidentally sequenced alongside pathogens, which can negatively affect the quality of downstream analysis. Host Decontaminate attempts to remove host reads by aligning to a reference host genome acquired on-the-fly. The reference host genome can be acquired via NCBI Taxonomy-compatible taxon input or assembly accession. Host Decontaminate maps inputted reads to the host genome using minimap2
, reports mapping statistics to this host genome, and outputs the unaligned dehosted reads.
The detailed steps and tasks are as follows:
Taxonomic Identification
The ncbi_identify
task uses NCBI Datasets
to search the NCBI Viral Genome Database and acquire taxonomic metadata from a user's inputted taxonomy and desired taxonomic rank. This task will always return a taxon ID, name, and rank, and it facilitates multiple downstream functions, including read classification and targeted read extraction. This task also generates a comprehensive summary file of all successful hits to the input taxon
, which includes each taxon's accession number, completeness status, genome length, source, and other relevant metadata. Based on this summary, the task also calculates the average expected genome size for the input taxon
.
taxon
input parameter
This parameter accepts either a NCBI taxon ID (e.g. 11292
) or an organism name (e.g. Lyssavirus rabies
).
rank
a.k.a read_extraction_rank
input parameter
Valid options include: "species"
, "genus"
, "family"
, "order"
, "class"
, "phylum"
, "kingdom"
, or "domain"
. By default it is set to "family"
. This parameter filters metadata to report information only at the taxonomic rank
specified by the user, regardless of the taxonomic rank implied by the original input taxon
.
Important
- The
rank
parameter must specify a taxonomic rank that is equal to or above the input taxon's taxonomic rank.
Examples:
- If your input
taxon
isLyssavirus rabies
(species level) withrank
set tofamily
, the task will return information for the family ofLyssavirus rabies
: taxon ID for Rhabdoviridae (11270), name "Rhabdoviridae", and rank "family". - If your input
taxon
isLyssavirus
(genus level) withrank
set tospecies
, the task will fail because it cannot determine species information from an inputted genus.
NCBI Datasets Technical Details
Links | |
---|---|
Task | task_identify_taxon_id.wdl |
Software Source Code | NCBI Datasets on GitHub |
Software Documentation | NCBI Datasets Documentation on NCBI |
Original Publication(s) | Exploring and retrieving sequence and metadata for species across the tree of life with NCBI Datasets |
Download Accession
The NCBI Datasets
task downloads specified assemblies from NCBI using either the virus or genome (for all other genome types) package as appropriate.
This task uses the accession ID output from the skani
task to download the the most closely related reference genome to the input assembly. The downloaded reference is then used for downstream analysis, including variant calling and consensus generation.
NCBI Datasets Technical Details
Links | |
---|---|
Task | task_ncbi_datasets.wdl |
Software Source Code | NCBI Datasets on GitHub |
Software Documentation | NCBI Datasets Documentation on NCBI |
Original Publication(s) | Exploring and retrieving sequence and metadata for species across the tree of life with NCBI Datasets |
Map Reads to Host
minimap2
is a popular aligner that is used to align reads (or assemblies) to an assembly file. In minimap2, "modes" are a group of preset options.
The mode used in this task is map-ont
which is the default mode for long reads and indicates that long reads of ~10% error rates should be aligned to the reference genome. The output file is in SAM format.
For more information regarding modes and the available options for minimap2
, please see the minimap2 manpage
minimap2 Technical Details
Links | |
---|---|
Task | task_minimap2.wdl |
Software Source Code | minimap2 on GitHub |
Software Documentation | minimap2 |
Original Publication(s) | Minimap2: pairwise alignment for nucleotide sequences |
Extract Unaligned Reads
The bam_to_unaligned_fastq
task will extract a FASTQ file of reads that failed to align, while removing unpaired reads.
parse_mapping
Technical Details
Links | |
---|---|
Task | task_parse_mapping.wdl |
Software Source Code | samtools on GitHub |
Software Documentation | samtools |
Original Publication(s) | The Sequence Alignment/Map format and SAMtools Twelve Years of SAMtools and BCFtools |
Host Read Mapping Statistics
The assembly_metrics
task generates mapping statistics from a BAM file. It uses samtools to generate a summary of the mapping statistics, which includes coverage, depth, average base quality, average mapping quality, and other relevant metrics.
assembly_metrics
Technical Details
Links | |
---|---|
Task | task_assembly_metrics.wdl |
Software Source Code | samtools on GitHub |
Software Documentation | samtools |
Original Publication(s) | The Sequence Alignment/Map format and SAMtools Twelve Years of SAMtools and BCFtools |
Host Decontaminate Technical Details
Links | |
---|---|
Subworkflow File | wf_host_decontaminate.wdl |
rasusa
The rasusa
task performs subsampling on the input raw reads. By default, it subsamples reads to a target depth of 250X, using the estimated genome length either generated by the ncbi_identify
task or provided directly by the user. Disabled by default, users can enable it by setting the skip_rasusa
variable to false
. The target subsampling depth can also be adjusted by modifying the coverage
variable.
coverage
input parameter
This parameter specifies the target coverage for subsampling. The default value is 250
, but users can adjust it as needed.
Non-deterministic output(s)
This task may yield non-deterministic outputs.
Rasusa Technical Details
Links | |
---|---|
Task | task_rasusa.wdl |
Software Source Code | Rasusa on GitHub |
Software Documentation | Rasusa on GitHub |
Original Publication(s) | Rasusa: Randomly subsample sequencing reads to a specified coverage |
clean_check_reads
The screen
task ensures the quantity of sequence data is sufficient to undertake genomic analysis. It uses fastq-scan
and bash commands for quantification of reads and base pairs, and mash sketching to estimate the genome size and its coverage. At each step, the results are assessed relative to pass/fail criteria and thresholds that may be defined by optional user inputs. Samples are run through all threshold checks, regardless of failures, and the workflow will terminate after the screen
task if any thresholds are not met:
- Total number of reads: A sample will fail the read screening task if its total number of reads is less than or equal to
min_reads
. - The proportion of basepairs reads in the forward and reverse read files: A sample will fail the read screening if fewer than
min_proportion
basepairs are in either the reads1 or read2 files. - Number of basepairs: A sample will fail the read screening if there are fewer than
min_basepairs
basepairs - Estimated genome size: A sample will fail the read screening if the estimated genome size is smaller than
min_genome_size
or bigger thanmax_genome_size
. - Estimated genome coverage: A sample will fail the read screening if the estimated genome coverage is less than the
min_coverage
.
Read screening is performed only on the cleaned reads. The task may be skipped by setting the skip_screen
variable to true
. Default values vary between the ONT and PE workflow. The rationale for these default values can be found below:
Default Thresholds and Rationales
Variable | Description | Default Value | Rationale |
---|---|---|---|
min_reads |
A sample will fail the read screening task if its total number of reads is less than or equal to min_reads |
50 | Minimum number of base pairs for 10x coverage of the Hepatitis delta (of the Deltavirus genus) virus divided by 300 (longest Illumina read length) |
min_basepairs |
A sample will fail the read screening if there are fewer than min_basepairs basepairs |
15000 | Greater than 10x coverage of the Hepatitis delta (of the Deltavirus genus) virus |
min_genome_size |
A sample will fail the read screening if the estimated genome size is smaller than min_genome_size |
1500 | Based on the Hepatitis delta (of the Deltavirus genus) genome- the smallest viral genome as of 2024-04-11 (1,700 bp) |
max_genome_size |
A sample will fail the read screening if the estimated genome size is smaller than max_genome_size |
2673870 | Based on the Pandoravirus salinus genome, the biggest viral genome, (2,673,870 bp) with 2 Mbp added |
min_coverage |
A sample will fail the read screening if the estimated genome coverage is less than the min_coverage |
10 | A bare-minimum coverage for genome characterization. Higher coverage would be required for high-quality phylogenetics. |
min_proportion |
A sample will fail the read screening if fewer than min_proportion basepairs are in either the reads1 or read2 files |
40 | Greater than 50% reads are in the read1 file; others are in the read2 file. (PE workflow only) |
Screen Technical Details
Links | |
---|---|
Task | task_screen.wdl (PE sub-task) task_screen.wdl (SE sub-task) |
Read Classification and Extraction
metabuli
The metabuli
task is used to classify and extract reads against a reference database. Metabuli uses a novel k-mer structure, called metamer, to analyze both amino acid (AA) and DNA sequences. It leverages AA conservation for sensitive homology detection and DNA mutations for specific differentiation between closely related taxa.
cpu
/ memory
input parameters
Increasing the memory and cpus allocated to Metabuli can substantially increase throughput.
extract_unclassified
input parameter
This parameter determines whether unclassified reads should also be extracted and combined with the taxon
-specific extracted reads. By default, this is set to false
, meaning that only reads classified to the specified input taxon
will be extracted.
Descendant taxa reads are extracted
This task will extract reads classified to the input taxon
and all of its descendant taxa. The rank
input parameter controls the extraction of reads classified at the specified rank
and all subordiante taxonomic levels. See task ncbi_identify
under the Taxonomic Identification section above for more details on the rank
input parameter.
Metabuli Technical Details
Links | |
---|---|
Task | task_metabuli.wdl |
Software Source Code | Metabuli on GitHub |
Software Documentation | Metabuli Documentation |
Original Publication(s) | Metabuli: sensitive and specific metagenomic classification via joint analysis of amino acid and DNA |
De novo Assembly and Reference Selection
These tasks are only performed if no reference genome is provided
In this workflow, de novo assembly is used solely to facilitate the selection of a closely related reference genome. If the user provides an input reference_fasta
, the following assembly generation, assembly evaluation, and reference selections tasks will be skipped:
raven
flye
checkv_denovo
quast_denovo
skani
ncbi_datasets
raven
The raven
task is used to create a de novo assembly from cleaned reads. Raven is an overlap-layout-consensus based assembler that accelerates the overlap step, constructs an assembly graph from reads pre-processed with pile-o-grams, applies a novel and robust graph simplification method based on graph drawings, and polishes unambiguous graph paths using Racon.
Based on internal benchmarking against Flye and results reported by Cook et al. (2024), Raven is faster, produces more contiguous assemblies, and yields more complete genomes within TheiaViral according to CheckV quality assessment (see task checkv
for technical details).
call_raven
input parameter
This parameter controls whether or not the raven
task is called by the workflow. By default, call_raven
is set to true
because Raven is used as the primary assembler. Raven is generally recommended for most users, but it might not perform optimally on all datasets. If users encounter issues with Raven, they can set the call_raven
variable to false
to bypass the raven
task and instead de novo assemble using Flye (see task flye
for details). Additionally, if the Raven task fails during execution, the workflow will automatically fall back to using Flye for de novo assembly.
Error traceback
Raven may fail with cryptic "segmentation fault" (segfault) errors or by failing to output an output file. It is difficult to traceback the source of these issues, though increasing the memory
parameter may resolve some errors.
Non-deterministic output(s)
This task may yield non-deterministic outputs.
Raven Technical Details
Links | |
---|---|
Task | task_raven.wdl |
Software Source Code | Raven on GitHub |
Software Documentation | Raven Documentation |
Original Publication(s) | Time- and memory-efficient genome assembly with Raven |
flye
Flye is a de novo assembler for long read data using repeat graphs. Compared to de Bruijn graphs, which require exact k-mer matches, repeat graphs can use approximate matches which better tolerates the error rate of ONT data.
It can be enabled by setting the call_raven
parameter to false
. The flye
task is used as a fallback option if the raven
task fails during execution (see task raven
for more details).
read_type
input parameter
This input parameter specifies the type of sequencing reads being used for assembly. This parameter significantly impacts the assembly process and should match the characteristics of your input data. Below are the available options:
Parameter | Explanation |
---|---|
--nano-hq (default) |
Optimized for ONT high-quality reads, such as Guppy5+ SUP or Q20 (<5% error). Recommended for ONT reads processed with Guppy5 or newer |
--nano-raw |
For ONT regular reads, pre-Guppy5 (<20% error) |
--nano-corr |
ONT reads corrected with other methods (<3% error) |
--pacbio-raw |
PacBio regular CLR reads (<20% error) |
--pacbio-corr |
PacBio reads corrected with other methods (<3% error) |
--pacbio-hifi |
PacBio HiFi reads (<1% error) |
Refer to the Flye documentation for detailed guidance on selecting the appropriate read_type
based on your sequencing data and additional optional paramaters.
Non-deterministic output(s)
This task may yield non-deterministic outputs.
Flye Technical Details
Links | |
---|---|
WDL Task | task_flye.wdl |
Software Source Code | Flye on GitHub |
Software Documentation | Flye Documentation |
Original Publication(s) | Assembly of long, error-prone reads using repeat graphs |
skani
The skani
task is used to identify and select the most closely related reference genome to the de novo assembly. Skani uses an approximate mapping method without base-level alignment to calculate average nucleotide identity (ANI). It is magnitudes faster than BLAST-based methods and almost as accurate.
By default, the reference genome is selected from a database of approximately 200,000 complete viral genomes. This database was constructed with the following methodology:
-
Extracting all complete NCBI viral genomes, excluding RefSeq accessions (redundancy), SARS-CoV-2 accessions, and segmented families (Orthomyxoviridae, Hantaviridae, Arenaviridae, and Phenuiviridae)
-
Adding complete RefSeq segmented viral assembly accessions, which represent segments as individual contigs within the FASTA
-
Adding one SARS-CoV-2 genome for each major pangolin lineage
Skani Technical Details
Links | |
---|---|
Task | task_skani.wdl |
Software Source Code | Skani on GitHub |
Software Documentation | Skani Documentation |
Original Publication(s) | Fast and robust metagenomic sequence comparison through sparse chaining with skani |
ncbi_datasets
The NCBI Datasets
task downloads specified assemblies from NCBI using either the virus or genome (for all other genome types) package as appropriate.
This task uses the accession ID output from the skani
task to download the the most closely related reference genome to the input assembly. The downloaded reference is then used for downstream analysis, including variant calling and consensus generation.
NCBI Datasets Technical Details
Links | |
---|---|
Task | task_ncbi_datasets.wdl |
Software Source Code | NCBI Datasets on GitHub |
Software Documentation | NCBI Datasets Documentation on NCBI |
Original Publication(s) | Exploring and retrieving sequence and metadata for species across the tree of life with NCBI Datasets |
Reference Mapping
minimap2
minimap2
is a popular aligner that is used to align reads (or assemblies) to an assembly file. In minimap2, "modes" are a group of preset options.
The mode used in this task is map-ont
with additional long-read-specific parameters (the -L --cs --MD
flags) to align ONT reads to the reference genome. These specialized parameters are essential for proper handling of long read error profiles, generation of detailed alignment information, and improved mapping accuracy for long reads.
map-ont
is the default mode for long reads and it indicates that long reads of ~10% error rates should be aligned to the reference genome. The output file is in SAM format.
For more information regarding modes and the available options for minimap2
, please see the minimap2 manpage
minimap2 Technical Details
Links | |
---|---|
Task | task_minimap2.wdl |
Software Source Code | minimap2 on GitHub |
Software Documentation | minimap2 |
Original Publication(s) | Minimap2: pairwise alignment for nucleotide sequences |
parse_mapping
The sam_to_sorted_bam
sub-task converts the output SAM file from the minimap2
task and converts it to a BAM file. It then sorts the BAM file by coordinate, and creates a BAM index file.
min_map_quality
input parameter
This parameter accepts an integer value to set the minimum mapping quality for variant calling and subsequent consensus sequence generation. The default value is 20
.
parse_mapping
Technical Details
Links | |
---|---|
Task | task_parse_mapping.wdl |
Software Source Code | samtools on GitHub |
Software Documentation | samtools |
Original Publication(s) | The Sequence Alignment/Map format and SAMtools Twelve Years of SAMtools and BCFtools |
read_mapping_stats
The read_mapping_stats
task generates mapping statistics from a BAM file. It uses samtools to generate a summary of the mapping statistics, which includes coverage, depth, average base quality, average mapping quality, and other relevant metrics.
read_mapping_stats
Technical Details
Links | |
---|---|
Task | task_assembly_metrics.wdl |
Software Source Code | samtools on GitHub |
Software Documentation | samtools |
Original Publication(s) | The Sequence Alignment/Map format and SAMtools Twelve Years of SAMtools and BCFtools |
fasta_utilities
The fasta_utilities
task utilizes samtools to index a reference fasta file.
This reference is selected by the skani
task or provided by the user input reference_fasta
. This indexed reference genome is used for downstream variant calling and consensus generation tasks.
fasta_utilities
Technical Details
Links | |
---|---|
Task | task_fasta_utilities.wdl |
Software Source Code | samtools on GitHub |
Software Documentation | samtools |
Original Publication(s) | The Sequence Alignment/Map format and SAMtools Twelve Years of SAMtools and BCFtools |
Variant Calling and Consensus Generation
clair3
Clair3
performs deep learning-based variant detection using a multi-stage approach. The process begins with pileup-based calling for initial variant identification, followed by full-alignment analysis for comprehensive variant detection. Results are merged into a final high-confidence call set.
The variant calling pipeline employs specialized neural networks trained on ONT data to accurately identify: - Single nucleotide variants (SNVs) - Small insertions and deletions (indels) - Structural variants
clair3_model
input parameter
This parameter specifies the clair3 model to use for variant calling. The default is set to "r1041_e82_400bps_sup_v500"
, but users may select from other available models that clair3
was trained on, which may yield better results depending on the basecaller and data type. The following models are available:
"ont"
"ont_guppy2"
"ont_guppy5"
"r941_prom_sup_g5014"
"r941_prom_hac_g360+g422"
"r941_prom_hac_g238"
"r1041_e82_400bps_sup_v500"
"r1041_e82_400bps_hac_v500"
"r1041_e82_400bps_sup_v410"
"r1041_e82_400bps_hac_v410"
Default Parameters and Filtering
In this workflow, clair3
is run with nearly all default parameters. Note that the VCF file produced by the clair3
task is unfiltered and does not represent the final set of variants that will be included in the final consensus genome. A filtered vcf file is generated by the bcftools_consensus
task. The filtering parameters are as follows:
- The
min_map_quality
parameter is applied before calling variants. - The
min_depth
andmin_allele_freq
parameters are applied after variant calling during consensus genome construction.
Clair3 Technical Details
Links | |
---|---|
Task | task_clair3.wdl |
Software Source Code | Clair3 on GitHub |
Software Documentation | Clair3 Documentation |
Original Publication(s) | Symphonizing pileup and full-alignment for deep learning-based long-read variant calling |
parse_mapping
The mask_low_coverage
sub-task is used to mask low coverage regions in the reference_fasta
file to improve the accuracy of the final consensus genome. Coverage thresholds are defined by the min_depth
parameter, which specifies the minimum read depth required for a base to be retained. Bases falling below this threshold are replaced with "N"s to clearly mark low confidence regions. The masked reference is then combined with variants from the clair3
task to produce the final consensus genome.
min_depth
input parameter
This parameter accepts an integer value to set the minimum read depth for variant calling and subsequent consensus sequence generation. The default value is 10
.
parse_mapping
Technical Details
Links | |
---|---|
Task | task_parse_mapping.wdl |
Software Source Code | samtools on GitHub |
Software Documentation | samtools |
Original Publication(s) | The Sequence Alignment/Map format and SAMtools Twelve Years of SAMtools and BCFtools |
bcftools_consensus
The bcftools_consensus
task generates a consensus genome assembly by applying variants from the clair3
task to a masked reference genome. It uses bcftools to filter variants based on the min_depth
and min_allele_freq
input parameter, left aligns and normalizes indels, indexes the VCF file, and generates a consensus genome in FASTA format. Reference bases are substituted with filtered variants where applicable, preserved in regions without variant calls, and replaced with "N"s in areas masked by the mask_low_coverage
task.
min_depth
input parameter
This parameter accepts an integer value to set the minimum read depth for variant calling and subsequent consensus sequence generation. The default value is 10
.
min_allele_freq
input parameter
This parameter accepts a float value to set the minimum allele frequency for variant calling and subsequent consensus sequence generation. The default value is 0.6
.
bcftools_consensus
Technical Details
Links | |
---|---|
Task | task_bcftools_consensus.wdl |
Software Source Code | bcftools on GitHub |
Software Documentation | bcftools Manual Page |
Original Publication(s) | Twelve Years of SAMtools and BCFtools |
Assembly Evaluation and Consensus Quality Control
quast_denovo
QUAST stands for QUality ASsessment Tool. It evaluates genome/metagenome assemblies by computing various metrics without a reference being necessary. It includes useful metrics such as number of contigs, length of the largest contig and N50.
QUAST Technical Details
Links | |
---|---|
Task | task_quast.wdl |
Software Source Code | QUAST on GitHub |
Software Documentation | https://quast.sourceforge.net/ |
Original Publication(s) | QUAST: quality assessment tool for genome assemblies |
checkv_denovo
& checkv_consensus
CheckV is a fully automated command-line pipeline for assessing the quality of viral genomes, including identification of host contamination for integrated proviruses, estimating completeness for genome fragments, and identification of closed genomes.
By default, CheckV reports results on a contig-by-contig basis. The checkv
task additionally reports both "weighted_contamination" and "weighted_completeness", which are average percents calculated across the total assembly that are weighted by contig length.
CheckV Technical Details
Links | |
---|---|
Task | task_checkv.wdl |
Software Source Code | CheckV on Bitbucket |
Software Documentation | CheckV Documentation |
Original Publication(s) | CheckV assesses the quality and completeness of metagenome-assembled viral genomes |
consensus_qc
The consensus_qc task generates a summary of genomic statistics from a consensus genome. This includes the total number of bases, "N" bases, degenerate bases, and an estimate of the percent coverage to the reference genome.
consensus_qc
Technical Details
Links | |
---|---|
Task | task_consensus_qc.wdl |
Software Source Docker Image | Theiagen Docker Builds: utility:1.1 |
Taxa-Specific Tasks¶
The TheiaViral workflows automatically activate taxa-specific sub-workflows after the identification of relevant taxa using the taxon ID of the reference genome.
Lyssavirus rabies
nextclade
Theiagen has implemented a full genome-based Nextclade dataset for L. rabies with subclade classification resolution.
Nextclade Technical Details
Links | |
---|---|
Task | task_nextclade.wdl |
Software Source Code | https://github.com/nextstrain/nextclade |
Software Documentation | Nextclade |
Original Publication(s) | Nextclade: clade assignment, mutation calling and quality control for viral genomes. |
Outputs¶
Variable | Type | Description |
---|---|---|
abricate_flu_database | String | ABRicate database used for analysis |
abricate_flu_results | File | File containing all results from ABRicate |
abricate_flu_subtype | String | Flu subtype as determined by ABRicate |
abricate_flu_type | String | Flu type as determined by ABRicate |
abricate_flu_version | String | Version of ABRicate |
assembly_denovo_fasta | File | De novo assembly in FASTA format |
auspice_json_flu_ha | File | Auspice-compatable JSON output generated from Nextclade analysis on Influenza HA segment that includes the Nextclade default samples for clade-typing and the single sample placed on this tree |
auspice_json_flu_na | File | Auspice-compatable JSON output generated from Nextclade analysis on Influenza NA segment that includes the Nextclade default samples for clade-typing and the single sample placed on this tree |
auspice_json_mpxv | File | Auspice-compatable JSON output generated from Nextclade analysis on Monkeypox virus that includes the Nextclade default samples for clade-typing and the single sample placed on this tree |
auspice_json_rabies | File | Auspice-compatable JSON output generated from Nextclade analysis on Rabies virus that includes the Nextclade default samples for clade-typing and the single sample placed on this tree |
bbduk_docker | String | The Docker image for bbduk, which was used to remove the adapters from the sequences |
bbduk_read1_clean | File | Clean forward reads after BBDuk processing |
bbduk_read2_clean | File | Clean reverse reads after BBDuk processing |
bwa_aligned_bai | File | BAM index file for reads aligned to reference |
bwa_read1_aligned | File | Forward reads aligned to reference |
bwa_read1_unaligned | File | Forward reads not aligned to reference |
bwa_read2_aligned | File | Reverse reads aligned to reference |
bwa_read2_unaligned | File | Reverse reads not aligned to reference |
bwa_samtools_version | String | Version of samtools used by BWA |
bwa_sorted_bai | File | Sorted BAM index file of reads aligned to reference |
bwa_sorted_bam | File | Sorted BAM file of reads aligned to reference |
bwa_sorted_bam_unaligned | File | A BAM file that only contains reads that did not align to the reference |
bwa_sorted_bam_unaligned_bai | File | Index companion file to a BAM file that only contains reads that did not align to the reference |
bwa_version | String | Version of BWA software used |
checkv_consensus_contamination | Float | Contamination estimate for consensus assembly from CheckV |
checkv_consensus_summary | File | Summary report from CheckV for consensus assembly |
checkv_consensus_total_genes | Int | Number of genes detected in consensus assembly by CheckV |
checkv_consensus_version | String | Version of CheckV used for consensus assembly |
checkv_consensus_weighted_completeness | Float | Weighted completeness score for consensus assembly from CheckV |
checkv_consensus_weighted_contamination | Float | Weighted contamination score for consensus assembly from CheckV |
checkv_denovo_contamination | Float | Contamination estimate for de novo assembly from CheckV |
checkv_denovo_summary | File | Summary report from CheckV for de novo assembly |
checkv_denovo_total_genes | Int | Number of genes detected in de novo assembly by CheckV |
checkv_denovo_version | String | Version of CheckV used for de novo assembly |
checkv_denovo_weighted_completeness | Float | Weighted completeness score for de novo assembly from CheckV |
checkv_denovo_weighted_contamination | Float | Weighted contamination score for de novo assembly from CheckV |
consensus_n_variant_min_depth | Int | Minimum read depth to call variants for iVar consensus and iVar variants. Also represents the minimum consensus support threshold used by IRMA with Illumina Influenza data. |
consensus_qc_assembly_length_unambiguous | Int | Length of consensus assembly excluding ambiguous bases |
consensus_qc_number_Degenerate | Int | Number of degenerate bases in consensus assembly |
consensus_qc_number_N | Int | Number of N bases in consensus assembly |
consensus_qc_number_Total | Int | Total number of bases in consensus assembly |
consensus_qc_percent_reference_coverage | Float | Percent of reference genome covered in consensus assembly |
dehost_wf_dehost_read1 | File | Reads that did not map to host |
dehost_wf_dehost_read2 | File | Paired-reads that did not map to host |
dehost_wf_download_status | String | Status of host genome acquisition |
dehost_wf_host_accession | String | Host genome accession |
dehost_wf_host_fasta | File | Host genome FASTA file |
dehost_wf_host_flagstat | File | Output from the SAMtools flagstat command to assess quality of the alignment file (BAM) |
dehost_wf_host_mapped_bai | File | Indexed bam file of the reads aligned to the host reference |
dehost_wf_host_mapped_bam | File | Sorted BAM file containing the alignments of reads to the host reference genome |
dehost_wf_host_mapping_cov_hist | File | Coverage histogram from host read mapping |
dehost_wf_host_mapping_coverage | Float | Average coverage from host read mapping |
dehost_wf_host_mapping_mean_depth | Float | Average depth from host read mapping |
dehost_wf_host_mapping_metrics | File | File of mapping metrics |
dehost_wf_host_mapping_stats | File | File of mapping statistics |
dehost_wf_host_percent_mapped_reads | Float | Percentage of reads mapped to host reference genome |
fastp_html_report | File | The HTML report made with fastp |
fastp_version | String | The version of fastp used |
fastq_scan_clean1_json | File | The JSON file output from fastq-scan containing summary stats about clean forward read quality and length |
fastq_scan_clean2_json | File | The JSON file output from fastq-scan containing summary stats about clean reverse read quality and length |
fastq_scan_clean_pairs | Int | Number of read pairs after cleaning |
fastq_scan_docker | String | The Docker image of fastq_scan |
fastq_scan_num_reads_clean1 | Int | The number of forward reads after cleaning as calculated by fastq_scan |
fastq_scan_num_reads_clean2 | Int | The number of reverse reads after cleaning as calculated by fastq_scan |
fastq_scan_num_reads_raw1 | Int | The number of input forward reads as calculated by fastq_scan |
fastq_scan_num_reads_raw2 | Int | The number of input reserve reads as calculated by fastq_scan |
fastq_scan_raw1_json | File | The JSON file output from fastq-scan containing summary stats about raw forward read quality and length |
fastq_scan_raw2_json | File | The JSON file output from fastq-scan containing summary stats about raw reverse read quality and length |
fastq_scan_raw_pairs | Int | Number of raw read pairs |
fastq_scan_version | String | The version of fastq_scan |
genoflu_all_segments | String | The genotypes for each individual flu segment |
genoflu_genotype | String | The genotype of the whole genome, based off of the individual segments types |
genoflu_output_tsv | File | The output file from GenoFLU |
genoflu_version | String | The version of GenoFLU used |
irma_docker | String | Docker image used to run IRMA |
irma_subtype | String | Flu subtype as determined by IRMA |
irma_subtype_notes | String | Helpful note to user about Flu B subtypes. Output will be blank for Flu A samples. For Flu B samples it will state: "IRMA does not differentiate Victoria and Yamagata Flu B lineages. See abricate_flu_subtype output column" |
irma_type | String | Flu type as determined by IRMA |
irma_version | String | Version of IRMA used |
ivar_tsv | File | Variant descriptor file generated by iVar variants |
ivar_variant_proportion_intermediate | String | The proportion of variants of intermediate frequency |
ivar_variant_version | String | Version of iVar for running the iVar variants command |
ivar_vcf | File | iVar tsv output converted to VCF format |
ivar_version_consensus | String | Version of iVar for running the iVar consensus command |
kraken2_extracted_read1 | File | Forward reads extracted by taxonomic classification |
kraken2_extracted_read2 | File | Reverse reads extracted by taxonomic classification |
kraken_database | File | Database used for Kraken classification |
kraken_docker | String | Docker image used for Kraken |
kraken_report | File | Full Kraken report |
kraken_version | String | Version of Kraken software used |
megahit_docker | String | Docker image used for MEGAHIT |
megahit_status | String | Status of the MEGAHIT assembly |
megahit_version | String | Version of MEGAHIT used |
metaviralspades_docker | String | Docker image used for MetaviralSPAdes |
metaviralspades_status | String | Status of MetaviralSPAdes assembly |
metaviralspades_version | String | Version of MetaviralSPAdes used |
ncbi_datasets_docker | String | Docker image used for NCBI datasets |
ncbi_datasets_version | String | Version of NCBI datasets used |
ncbi_identify_accession | String | NCBI accession ID of identified taxon |
ncbi_identify_avg_genome_length | Int | Average genome length from NCBI taxon summary |
ncbi_identify_genome_summary_tsv | File | TSV file with genome summary from NCBI |
ncbi_identify_read_extraction_rank | String | Taxonomic rank used for read extraction |
ncbi_identify_taxon_id | String | NCBI taxonomy ID of identified organism |
ncbi_identify_taxon_name | String | Name of identified taxon |
ncbi_identify_taxon_summary_tsv | File | TSV file with taxa specific summary from NCBI |
ncbi_scrub_docker | String | The Docker image for NCBI's HRRT (human read removal tool) |
ncbi_scrub_human_spots_removed | Int | Number of spots removed (or masked) |
nextclade_aa_dels_flu_ha | String | Amino-acid deletions as detected by NextClade. Specific to flu; it includes deletions for HA segment |
nextclade_aa_dels_flu_na | String | Amino-acid deletions as detected by NextClade. Specific to Flu; it includes deletions for NA segment |
nextclade_aa_dels_mpxv | String | Amino-acid deletions as detected by Nextclade. Specific to Monkeypox |
nextclade_aa_dels_rabies | String | Amino-acid deletions as detected by Nextclade. Specific to Monkeypox |
nextclade_aa_subs_flu_ha | String | Amino-acid substitutions as detected by Nextclade. Specific to Flu; it includes substitutions for HA segment |
nextclade_aa_subs_flu_na | String | Amino-acid substitutions as detected by Nextclade. Specific to Flu; it includes substitutions for NA segment |
nextclade_aa_subs_mpxv | String | Amino-acid substitutions as detected by Nextclade. Specific to Monkeypox |
nextclade_aa_subs_rabies | String | Amino-acid substitutions as detected by Nextclade. Specific to Monkeypox |
nextclade_clade_mpxv | String | Nextclade clade designation, specific to Monkeypox |
nextclade_clade_rabies | String | Nextclade clade designation, specific to Rabies |
nextclade_docker | String | Docker image used to run Nextclade |
nextclade_ds_tag | String | Dataset tag used to run Nextclade. Will be blank for Flu |
nextclade_ds_tag_flu_ha | String | Dataset tag used to run Nextclade, specific to Flu HA segment |
nextclade_ds_tag_flu_na | String | Dataset tag used to run Nextclade, specific to Flu NA segment |
nextclade_json_flu_ha | File | Nextclade output in JSON file format, specific to Flu HA segment |
nextclade_json_flu_na | File | Nextclade output in JSON file format, specific to Flu NA segment |
nextclade_json_mpxv | File | Nextclade output in JSON file format, specific to Monkeypox |
nextclade_json_rabies | File | Nextclade output in JSON file format, specific to Rabies |
nextclade_lineage_mpxv | String | Nextclade lineage designation, specific to Monkeypox |
nextclade_lineage_rabies | String | Nextclade lineage designation, specific to Rabies |
nextclade_qc_flu_ha | String | QC metric as determined by Nextclade, specific to Flu HA segment |
nextclade_qc_flu_na | String | QC metric as determined by Nextclade, specific to Flu NA segment |
nextclade_qc_mpxv | String | QC metric as determined by Nextclade, specific to Monkeypox |
nextclade_qc_rabies | String | QC metric as determined by Nextclade, specific to Rabies |
nextclade_tsv_flu_ha | File | Nextclade output in TSV file format, specific to Flu HA segment |
nextclade_tsv_flu_na | File | Nextclade output in TSV file format, specific to Flu NA segment |
nextclade_tsv_mpxv | File | Nextclade output in TSV file format, specific to Monkeypox |
nextclade_tsv_rabies | File | Nextclade output in TSV file format, specific to Rabies |
organism | String | Standardized organism name used for characterization |
pango_lineage | String | Pango lineage as determined by Pangolin |
pango_lineage_expanded | String | Pango lineage without use of aliases; e.g., "BA.1" → "B.1.1.529.1" |
pango_lineage_report | File | Full Pango lineage report generated by Pangolin |
pangolin_assignment_version | String | The version of the pangolin software (e.g. PANGO or PUSHER) used for lineage assignment |
pangolin_conflicts | String | Number of lineage conflicts as determined by Pangolin |
pangolin_docker | String | Docker image used to run Pangolin |
pangolin_notes | String | Lineage notes as determined by Pangolin |
pangolin_versions | String | All Pangolin software and database versions |
quast_denovo_docker | String | Docker image used for QUAST |
quast_denovo_gc_percent | Float | GC percentage of de novo assembly from QUAST |
quast_denovo_genome_length | Int | Genome length of de novo assembly from QUAST |
quast_denovo_largest_contig | Int | Size of largest contig in de novo assembly from QUAST |
quast_denovo_n50_value | Int | N50 value of de novo assembly from QUAST |
quast_denovo_number_contigs | Int | Number of contigs in de novo assembly from QUAST |
quast_denovo_report | File | QUAST report for de novo assembly |
quast_denovo_uncalled_bases | Int | Number of uncalled bases in de novo assembly from QUAST |
quast_denovo_version | String | Version of QUAST used |
read1_dehosted | File | The dehosted forward reads file; suggested read file for SRA submission |
read2_dehosted | File | The dehosted reverse reads file; suggested read file for SRA submission |
read_mapping_cov_hist | File | Coverage histogram from read mapping |
read_mapping_cov_stats | File | Coverage statistics from read mapping |
read_mapping_coverage | Float | Average coverage from read mapping |
read_mapping_date | String | Date of read mapping analysis |
read_mapping_depth | Float | Average depth from read mapping |
read_mapping_flagstat | File | Flagstat file from read mapping |
read_mapping_meanbaseq | Float | Mean base quality from read mapping |
read_mapping_meanmapq | Float | Mean mapping quality from read mapping |
read_mapping_percentage_mapped_reads | Float | Percentage of mapped reads |
read_mapping_report | File | Report file from read mapping |
read_mapping_samtools_version | String | Version of samtools used in read mapping |
read_mapping_statistics | File | Statistics file from read mapping |
reference_taxon_name | String | NCBI derived taxon name from best ANI hit accession |
skani_database | File | Database used for Skani |
skani_docker | String | Docker image used for Skani |
skani_report | File | Report from Skani |
skani_status | String | Status of Skani analysis |
skani_top_accession | String | Top accession ID from Skani |
skani_top_ani | Float | Top ANI score from Skani |
skani_top_ani_fasta | File | FASTA file of top ANI match from Skani |
skani_top_ref_coverage | Float | Reference coverage of top match from Skani |
skani_top_score | Float | Top score from Skani |
skani_version | String | Version of Skani used |
skani_warning | String | Skani warning message |
theiaviral_illumina_pe_date | String | Date of TheiaViral Illumina PE workflow run |
theiaviral_illumina_pe_version | String | Version of TheiaViral Illumina PE workflow |
trimmomatic_docker | String | The docker image used for the trimmomatic module in this workflow |
trimmomatic_version | String | The version of Trimmomatic used |
Variable | Type | Description |
---|---|---|
abricate_flu_database | String | ABRicate database used for analysis |
abricate_flu_results | File | File containing all results from ABRicate |
abricate_flu_subtype | String | Flu subtype as determined by ABRicate |
abricate_flu_type | String | Flu type as determined by ABRicate |
abricate_flu_version | String | Version of ABRicate |
assembly_denovo_fasta | File | De novo assembly in FASTA format |
assembly_to_ref_bai | File | BAM index file for reads aligned to reference |
assembly_to_ref_bam | File | BAM file of reads aligned to reference |
auspice_json_flu_ha | File | Auspice-compatable JSON output generated from Nextclade analysis on Influenza HA segment that includes the Nextclade default samples for clade-typing and the single sample placed on this tree |
auspice_json_flu_na | File | Auspice-compatable JSON output generated from Nextclade analysis on Influenza NA segment that includes the Nextclade default samples for clade-typing and the single sample placed on this tree |
auspice_json_mpxv | File | Auspice-compatable JSON output generated from Nextclade analysis on Monkeypox virus that includes the Nextclade default samples for clade-typing and the single sample placed on this tree |
auspice_json_rabies | File | Auspice-compatable JSON output generated from Nextclade analysis on Rabies virus that includes the Nextclade default samples for clade-typing and the single sample placed on this tree |
bcftools_docker | String | Docker image used for bcftools |
bcftools_filtered_vcf | File | Filtered variant calls in VCF format from bcftools |
bcftools_version | String | Version of bcftools used |
checkv_consensus_contamination | Float | Contamination estimate for consensus assembly from CheckV |
checkv_consensus_summary | File | Summary report from CheckV for consensus assembly |
checkv_consensus_total_genes | Int | Number of genes detected in consensus assembly by CheckV |
checkv_consensus_version | String | Version of CheckV used for consensus assembly |
checkv_consensus_weighted_completeness | Float | Weighted completeness score for consensus assembly from CheckV |
checkv_consensus_weighted_contamination | Float | Weighted contamination score for consensus assembly from CheckV |
checkv_denovo_contamination | Float | Contamination estimate for de novo assembly from CheckV |
checkv_denovo_summary | File | Summary report from CheckV for de novo assembly |
checkv_denovo_total_genes | Int | Number of genes detected in de novo assembly by CheckV |
checkv_denovo_version | String | Version of CheckV used for de novo assembly |
checkv_denovo_weighted_completeness | Float | Weighted completeness score for de novo assembly from CheckV |
checkv_denovo_weighted_contamination | Float | Weighted contamination score for de novo assembly from CheckV |
clair3_docker | String | Docker image used for Clair3 |
clair3_gvcf | File | Genomic VCF file from Clair3 |
clair3_model | String | Model used for Clair3 variant calling |
clair3_vcf | File | Variant calls in VCF format from Clair3 |
clair3_version | String | Clair3 Version being used |
consensus_qc_assembly_length_unambiguous | Int | Length of consensus assembly excluding ambiguous bases |
consensus_qc_number_Degenerate | Int | Number of degenerate bases in consensus assembly |
consensus_qc_number_N | Int | Number of N bases in consensus assembly |
consensus_qc_number_Total | Int | Total number of bases in consensus assembly |
consensus_qc_percent_reference_coverage | Float | Percent of reference genome covered in consensus assembly |
dehost_wf_dehost_read1 | File | Reads that did not map to host |
dehost_wf_download_status | String | Status of host genome acquisition |
dehost_wf_host_accession | String | Host genome accession |
dehost_wf_host_fasta | File | Host genome FASTA file |
dehost_wf_host_flagstat | File | Output from the SAMtools flagstat command to assess quality of the alignment file (BAM) |
dehost_wf_host_mapped_bai | File | Indexed bam file of the reads aligned to the host reference |
dehost_wf_host_mapped_bam | File | Sorted BAM file containing the alignments of reads to the host reference genome |
dehost_wf_host_mapping_cov_hist | File | Coverage histogram from host read mapping |
dehost_wf_host_mapping_coverage | Float | Average coverage from host read mapping |
dehost_wf_host_mapping_mean_depth | Float | Average depth from host read mapping |
dehost_wf_host_mapping_metrics | File | File of mapping metrics |
dehost_wf_host_mapping_stats | File | File of mapping statistics |
dehost_wf_host_percent_mapped_reads | Float | Percentage of reads mapped to host reference genome |
fasta_utilities_fai | File | FASTA index file |
fasta_utilities_samtools_docker | String | Docker image used for samtools in fasta utilities |
fasta_utilities_samtools_version | String | Version of samtools used in fasta utilities |
flye_denovo_docker | String | Docker image used for Flye |
flye_denovo_info | File | Information file from Flye assembly |
flye_denovo_status | String | Status of Flye assembly |
flye_denovo_version | String | Version of Flye used |
genoflu_all_segments | String | The genotypes for each individual flu segment |
genoflu_genotype | String | The genotype of the whole genome, based off of the individual segments types |
genoflu_output_tsv | File | The output file from GenoFLU |
genoflu_version | String | The version of GenoFLU used |
irma_docker | String | Docker image used to run IRMA |
irma_subtype | String | Flu subtype as determined by IRMA |
irma_subtype_notes | String | Helpful note to user about Flu B subtypes. Output will be blank for Flu A samples. For Flu B samples it will state: "IRMA does not differentiate Victoria and Yamagata Flu B lineages. See abricate_flu_subtype output column" |
irma_type | String | Flu type as determined by IRMA |
irma_version | String | Version of IRMA used |
mask_low_coverage_all_coverage_bed | File | BED file showing all coverage regions |
mask_low_coverage_bed | File | BED file showing masked low coverage regions |
mask_low_coverage_bedtools_docker | String | Docker image used for bedtools in masking |
mask_low_coverage_bedtools_version | String | Version of bedtools used in masking |
mask_low_coverage_reference_fasta | File | Reference FASTA with low coverage regions masked |
metabuli_classified | File | Classified reads from Metabuli |
metabuli_database | File | Database used for Metabuli |
metabuli_docker | String | Docker image used for Metabuli |
metabuli_krona_report | File | Krona visualization report from Metabuli |
metabuli_read1_extract | File | Extracted reads from Metabuli |
metabuli_report | File | Classification report from Metabuli |
metabuli_version | String | Version of Metabuli used |
minimap2_docker | String | The Docker image of minimap2 |
minimap2_out | File | Output file from Minimap2 alignment |
minimap2_version | String | The version of minimap2 |
nanoplot_html_clean | File | An HTML report describing the clean reads |
nanoplot_html_raw | File | An HTML report describing the raw reads |
nanoplot_num_reads_clean1 | Int | Number of clean reads |
nanoplot_num_reads_raw1 | Int | Number of raw reads |
nanoplot_r1_mean_q_clean | Float | Mean quality score of clean forward reads |
nanoplot_r1_mean_q_raw | Float | Mean quality score of raw forward reads |
nanoplot_r1_mean_readlength_clean | Float | Mean read length of clean forward reads |
nanoplot_r1_mean_readlength_raw | Float | Mean read length of raw forward reads |
nanoplot_r1_median_q_clean | Float | Median quality score of clean forward reads |
nanoplot_r1_median_q_raw | Float | Median quality score of raw forward reads |
nanoplot_r1_median_readlength_clean | Float | Median read length of clean forward reads |
nanoplot_r1_median_readlength_raw | Float | Median read length of raw forward reads |
nanoplot_r1_n50_clean | Float | N50 of clean forward reads |
nanoplot_r1_n50_raw | Float | N50 of raw forward reads |
nanoplot_r1_stdev_readlength_clean | Float | Standard deviation read length of clean forward reads |
nanoplot_r1_stdev_readlength_raw | Float | Standard deviation read length of raw forward reads |
nanoplot_tsv_clean | File | A TSV report describing the clean reads |
nanoplot_tsv_raw | File | A TSV report describing the raw reads |
nanoq_filtered_read1 | File | Filtered reads from NanoQ |
nanoq_version | String | Version of nanoq used in analysis |
ncbi_datasets_docker | String | Docker image used for NCBI datasets |
ncbi_datasets_version | String | Version of NCBI datasets used |
ncbi_identify_accession | String | NCBI accession ID of identified taxon |
ncbi_identify_avg_genome_length | Int | Average genome length from NCBI taxon summary |
ncbi_identify_docker | String | Docker image used for NCBI identify |
ncbi_identify_genome_summary_tsv | File | TSV file with genome summary from NCBI |
ncbi_identify_read_extraction_rank | String | Taxonomic rank used for read extraction |
ncbi_identify_taxon_id | String | NCBI taxonomy ID of identified organism |
ncbi_identify_taxon_name | String | Name of identified taxon |
ncbi_identify_taxon_summary_tsv | File | TSV file with taxa specific summary from NCBI |
ncbi_identify_version | String | Version of NCBI identify tool used |
ncbi_scrub_docker | String | The Docker image for NCBI's HRRT (human read removal tool) |
ncbi_scrub_human_spots_removed | Int | Number of spots removed (or masked) |
ncbi_scrub_read1_dehosted | File | Dehosted reads after NCBI scrub |
nextclade_aa_dels_flu_ha | String | Amino-acid deletions as detected by NextClade. Specific to flu; it includes deletions for HA segment |
nextclade_aa_dels_flu_na | String | Amino-acid deletions as detected by NextClade. Specific to Flu; it includes deletions for NA segment |
nextclade_aa_dels_mpxv | String | Amino-acid deletions as detected by Nextclade. Specific to Monkeypox |
nextclade_aa_dels_rabies | String | Amino-acid deletions as detected by Nextclade. Specific to Monkeypox |
nextclade_aa_subs_flu_ha | String | Amino-acid substitutions as detected by Nextclade. Specific to Flu; it includes substitutions for HA segment |
nextclade_aa_subs_flu_na | String | Amino-acid substitutions as detected by Nextclade. Specific to Flu; it includes substitutions for NA segment |
nextclade_aa_subs_mpxv | String | Amino-acid substitutions as detected by Nextclade. Specific to Monkeypox |
nextclade_aa_subs_rabies | String | Amino-acid substitutions as detected by Nextclade. Specific to Monkeypox |
nextclade_clade_mpxv | String | Nextclade clade designation, specific to Monkeypox |
nextclade_clade_rabies | String | Nextclade clade designation, specific to Rabies |
nextclade_docker | String | Docker image used to run Nextclade |
nextclade_ds_tag | String | Dataset tag used to run Nextclade. Will be blank for Flu |
nextclade_ds_tag_flu_ha | String | Dataset tag used to run Nextclade, specific to Flu HA segment |
nextclade_ds_tag_flu_na | String | Dataset tag used to run Nextclade, specific to Flu NA segment |
nextclade_json_flu_ha | File | Nextclade output in JSON file format, specific to Flu HA segment |
nextclade_json_flu_na | File | Nextclade output in JSON file format, specific to Flu NA segment |
nextclade_json_mpxv | File | Nextclade output in JSON file format, specific to Monkeypox |
nextclade_json_rabies | File | Nextclade output in JSON file format, specific to Rabies |
nextclade_lineage_mpxv | String | Nextclade lineage designation, specific to Monkeypox |
nextclade_lineage_rabies | String | Nextclade lineage designation, specific to Rabies |
nextclade_qc_flu_ha | String | QC metric as determined by Nextclade, specific to Flu HA segment |
nextclade_qc_flu_na | String | QC metric as determined by Nextclade, specific to Flu NA segment |
nextclade_qc_mpxv | String | QC metric as determined by Nextclade, specific to Monkeypox |
nextclade_qc_rabies | String | QC metric as determined by Nextclade, specific to Rabies |
nextclade_tsv_flu_ha | File | Nextclade output in TSV file format, specific to Flu HA segment |
nextclade_tsv_flu_na | File | Nextclade output in TSV file format, specific to Flu NA segment |
nextclade_tsv_mpxv | File | Nextclade output in TSV file format, specific to Monkeypox |
nextclade_tsv_rabies | File | Nextclade output in TSV file format, specific to Rabies |
organism | String | Standardized organism name used for characterization |
pango_lineage | String | Pango lineage as determined by Pangolin |
pango_lineage_expanded | String | Pango lineage without use of aliases; e.g., "BA.1" → "B.1.1.529.1" |
pango_lineage_report | File | Full Pango lineage report generated by Pangolin |
pangolin_assignment_version | String | The version of the pangolin software (e.g. PANGO or PUSHER) used for lineage assignment |
pangolin_conflicts | String | Number of lineage conflicts as determined by Pangolin |
pangolin_docker | String | Docker image used to run Pangolin |
pangolin_notes | String | Lineage notes as determined by Pangolin |
pangolin_versions | String | All Pangolin software and database versions |
parse_mapping_samtools_docker | String | Docker image used for samtools in parse mapping |
parse_mapping_samtools_version | String | Version of samtools used in parse mapping |
porechop_trimmed_read1 | File | Trimmed reads from Porechop |
porechop_version | String | Version of Porechop used |
quast_denovo_docker | String | Docker image used for QUAST |
quast_denovo_gc_percent | Float | GC percentage of de novo assembly from QUAST |
quast_denovo_genome_length | Int | Genome length of de novo assembly from QUAST |
quast_denovo_largest_contig | Int | Size of largest contig in de novo assembly from QUAST |
quast_denovo_n50_value | Int | N50 value of de novo assembly from QUAST |
quast_denovo_number_contigs | Int | Number of contigs in de novo assembly from QUAST |
quast_denovo_report | File | QUAST report for de novo assembly |
quast_denovo_uncalled_bases | Int | Number of uncalled bases in de novo assembly from QUAST |
quast_denovo_version | String | Version of QUAST used |
rasusa_read1_subsampled | File | Subsampled read file from Rasusa |
rasusa_read2_subsampled | File | Subsampled read file from Rasusa (paired file) |
rasusa_version | String | Version of RASUSA used for the analysis |
raven_denovo_docker | String | Docker image used for Raven |
raven_denovo_status | String | Status of Raven assembly |
raven_denovo_version | String | Version of Raven used |
read_mapping_cov_hist | File | Coverage histogram from read mapping |
read_mapping_cov_stats | File | Coverage statistics from read mapping |
read_mapping_coverage | Float | Average coverage from read mapping |
read_mapping_date | String | Date of read mapping analysis |
read_mapping_depth | Float | Average depth from read mapping |
read_mapping_flagstat | File | Flagstat file from read mapping |
read_mapping_meanbaseq | Float | Mean base quality from read mapping |
read_mapping_meanmapq | Float | Mean mapping quality from read mapping |
read_mapping_percentage_mapped_reads | Float | Percentage of mapped reads |
read_mapping_report | File | Report file from read mapping |
read_mapping_samtools_version | String | Version of samtools used in read mapping |
read_mapping_statistics | File | Statistics file from read mapping |
read_screen_clean | String | PASS or FAIL result from clean read screening; FAIL accompanied by the reason(s) for failure |
read_screen_clean_tsv | File | Clean read screening report TSV depicting read counts, total read base pairs, and estimated genome length |
reference_taxon_name | String | NCBI derived taxon name from best ANI hit accession |
skani_database | File | Database used for Skani |
skani_docker | String | Docker image used for Skani |
skani_report | File | Report from Skani |
skani_status | String | Status of Skani analysis |
skani_top_accession | String | Top accession ID from Skani |
skani_top_ani | Float | Top ANI score from Skani |
skani_top_ani_fasta | File | FASTA file of top ANI match from Skani |
skani_top_ref_coverage | Float | Reference coverage of top match from Skani |
skani_top_score | Float | Top score from Skani |
skani_version | String | Version of Skani used |
skani_warning | String | Skani warning message |
theiaviral_ont_date | String | Date of TheiaViral ONT workflow run |
theiaviral_ont_version | String | Version of TheiaViral ONT workflow |
What are the differences between the de novo and consensus assemblies?
De novo genomes are generated from scratch without a reference to guide read assembly, while consensus genomes are generated by mapping reads to a reference and replacing reference positions with identified variants (structural and nucleotide). De novo assemblies are thus not biased by requiring reads map to the reference, though they may be more fragmented. Consensus assembly can generate more robust assemblies from lower coverage samples if the reference genome is sufficient quality and sufficiently closely related to the inputted sequence, though consensus assembly may not perform well in instances of significant structural variation. TheiaViral uses de novo assemblies as an intermediate to acquire the best reference genome for consensus assembly.
We generally recommend TheiaViral users focus on the consensus assembly as the desired assembly output. While we chose the best de novo assemblers for TheiaViral based on internal benchmarking, the consensus assembly will often be higher quality than the de novo assembly. However, the de novo assembly can approach or exceed consensus quality if the read inputs largely comprise one virus, have high depth of coverage, and/or are derived from a virus with high potential for recombination. TheiaViral does conduct assembly contiguity and viral completeness quality control for de novo assemblies, so de novo assembly that meets quality control standards can certainly be used for downstream analysis.
How is de novo assembly quality evaluated?
De novo assembly quality evaluation focuses on the completeness and contiguity of the genome. While a ground truth genome does not truly exist for quality comparison, reference genome selection can help contextualize quality if the reference is sufficiently similar to the de novo assembly. TheiaViral uses QUAST to acquire basic contiguity statistics and CheckV to assess viral genome completeness and contamination. Additionally, the reference selection software, Skani, can provide a quantitative comparison between the de novo assembly and the best reference genome.
Completeness and contamination
checkv_denovo_summary
: The summary file reports CheckV results on a contig-by-contig basis. Ideally completeness is 100% for a single contig, or 100% for all segments. If there are multiple extraneous contigs in the assembly, one is ideally 100%. The same principles apply to contamination, though it ideally is 0%.checkv_denovo_total_genes
: The total genes is ideally the same number of genes as expected from the inputted viral taxon. Sometimes CheckV can fail to recover all the genes from a complete genome, so other statistics should be weighted more heavily in quality evaluation.checkv_denovo_weighted_completeness
: The weighted completeness is ideally 100%.checkv_denovo_weighted_contamination
: The weighted contamination is ideally 0%.
Length and contiguity
quast_denovo_genome_length
: The de novo genome length is ideally the same as the expected genome length of the focal virus.quast_denovo_largest_contig
: The largest contig is ideally the size of the genome, or the size of the largest expected segment. If there are multiple contigs, and the largest contig is the ideal size, then the smaller contigs may be discarded based on the CheckV completeness for the largest contig (see CheckV outputs).quast_denovo_n50_value
: The N50 is an evaluation of contiguity and is ideally as close as possible to the genome size. For segmented viruses, the N50 should be as close as possible to the size of the segment molecule that would cover at least 50% of the total genome size when segment lengths are added after sorting largest to smallest.quast_denovo_number_contigs
: The number of contigs is ideally 1 or the total number of segments expected.
Reference genome similarity
skani_top_ani
: The percent average nucleotide identity (ANI) for the top Skani hit is ideally 100% if the sequenced virus is highly similar to a reference genome. However, if the virus is divergent, ANI is not a good indication of assembly quality.skani_top_ref_coverage
: The percent reference coverage for the top Skani hit is ideally 100% if the sequenced virus has not undergone significant recombination/structural variation.skani_top_score
: The score for the top Skani hit is the ANI x Reference coverage and is ideally 100% if the sequenced virus is not substantially divergent from the reference dataset.
How is consensus assembly quality evaluated?
Consensus assemblies are derived from a reference genome, so quality assessment focuses on coverage and variant quality. Bases with insufficient coverage are denoted as "N". Additionally, the size and contiguity of a TheiaViral consensus assembly is expected to approximate the reference genome, so any discrepancy here is likely due to inferred structural variation.
Completeness and contamination
checkv_consensus_weighted_completeness
: The weighted completeness is ideally 100%.
Consensus variant calls
consensus_qc_number_Degenerate
: The number of degenerate bases is ideally 0. While degenerate bases indicate ambiguity in the sequence, non-N degenerate bases indicate that some information about the base was obtained.consensus_qc_number_N
: The number of "N" bases is ideally 0.
Coverage
consensus_qc_percent_reference_coverage
: The percent reference coverage is ideally 100%.read_mapping_cov_hist
: The read mapping coverage histogram ideally depicts normally distributed coverage, which may indicate uniform coverage across the reference genome. However, uniform coverage is unlikely with repetitive regions that approach/exceed read length.read_mapping_coverage
: The average read mapping coverage is ideally as high as possible.read_mapping_meanbaseq
: The average mean mapping base quality is ideally as high as possible.read_mapping_meanmapq
: The average mean mapping alignment quality is ideally as high as possible.read_mapping_percentage_mapped_reads
: The percent of mapped reads is ideally 100% of the reads classified as the lineage of interest. Some unclassified reads may also map, which may indicate they were erroneously unclassified. Alternatively, these reads could have been erroneously mapped.
Why did the workflow complete without generating a consensus?
TheiaViral is designed to "soft fail" when specific steps do not succeed due to input data quality. This means the workflow will be reported as successful, with an output that delineates the step that failed. If the workflow fails, please look for the following outputs in this order (sorted by timing of failure, latest first):
skani_status
: If this output is populated with something other than "PASS" andskani_top_accession
is populated with "N/A", this indicates that Skani did not identify a sufficiently similar reference genome. The Skani database comprises a broad array of NCBI viral genomes, so a failure here likely indicates poor read quality because viral contigs are not found in the de novo assembly or are too small. It may be useful to BLAST whatever contigs do exist in the de novo to determine if there is contamination that can be removed via thehost
input parameter. Additionally, review CheckV de novo outputs to assess if viral contigs were retrieved. Finally, consider keepingextract_unclassified
to "true", using a higherread_extraction_rank
if it will not introduce contaminant viruses, and invoking ahost
input to remove host reads if host contigs are present.megahit_status
/flye_status
: If this output is populated with something other than "PASS", it indicates the fallback assembler did not successfully complete. The fallback assemblers are permissive, so failure here likely indicates poor read quality. Review read QC to check read quality, particularly following read classification. If read classification is dispensing with a significant number of reads, considerextract_unclassified
,read_extraction_rank
, andhost
input adjustment. Otherwise, sequencing quality may be poor.metaviralspades_status
/raven_denovo_status
: If this output is populated with something other than "PASS", it indicates the default assembler did not successfully complete or extract viral contigs (MetaviralSPAdes). On their own, these statuses do not correspond directly to workflow failure because fallback de novo assemblers are implemented for both TheiaViral workflows.read_screen_clean
: If this output is populated with something other than "PASS", it indicates the reads did not pass the imposed thresholds. Either the reads are poor quality or the thresholds are too stringent, in which case the thresholds can be relaxed orskip_screen
can be set to "true".dehost_wf_download_status
: If this output is populated with something other than "PASS", it indicates a host genome could not be retrieved for decontamination. See thehost
input explanation for more information and review thedownload_accession
/download_taxonomy
task output logs for advanced error parsing.
Known errors associated with read quality
- ONT workflows may fail at Metabuli if no reads are classified as the
taxon
. Check the Metabuliclassification.tsv
orkrona
report for the read extraction taxon ID to determine if any reads were classified. This error will reportout of memory (OOM)
, but increasing memory will not resolve it. - Illumina workflows may fail at CheckV (de novo) with
Error: 80 hmmsearch tasks failed. Program should be rerun
if no viral contigs were identified in the de novo assembly.
Acknowlegments¶
We would like to thank Danny Park at the Broad institute and Jared Johnson at the Washington State Department of Public Health for correspondence during the development of TheiaViral. TheiaViral was built referencing viral-assemble, VAPER, and Artic.