Skip to content

TheiaCoV Workflow Series

Quick Facts

Workflow Type Applicable Kingdom Last Known Changes Command-line Compatibility Workflow Level
Genomic Characterization Viral PHB v2.2.0 Yes, some optional features incompatible Sample-level

TheiaCoV Workflows

The TheiaCoV workflows are for the assembly, quality assessment, and characterization of viral genomes. There are currently five TheiaCoV workflows designed to accommodate different kinds of input data:

  1. Illumina paired-end sequencing (TheiaCoV_Illumina_PE)
  2. Illumina single-end sequencing (TheiaCoV_Illumina_SE)
  3. ONT sequencing (TheiaCoV_ONT)
  4. Genome assemblies (TheiaCoV_FASTA)
  5. ClearLabs sequencing (TheiaCoV_ClearLabs)

Additionally, the TheiaCoV_FASTA_Batch workflow is available to process several hundred SARS-CoV-2 assemblies at the same time.


TheiaCoV Workflow Diagram

TheiaCoV Workflow Diagram

Supported Organisms

These workflows currently support the following organisms:

  • SARS-CoV-2 ("sars-cov-2", "SARS-CoV-2") - default organism input
  • Monkeypox virus ("MPXV", "mpox", "monkeypox", "Monkeypox virus", "Mpox")
  • Human Immunodeficiency Virus ("HIV")
  • West Nile Virus ("WNV", "wnv", "West Nile virus")
  • Influenza ("flu", "influenza", "Flu", "Influenza")
  • RSV-A ("rsv_a", "rsv-a", "RSV-A", "RSV_A")
  • RSV-B ("rsv_b", "rsv-b", "RSV-B", "RSV_B")

The compatibility of each workflow with each pathogen is shown below:

SARS-CoV-2 Mpox HIV WNV Influenza RSV-A RSV-B
Illumina_PE โœ… โœ… โœ… โœ… โœ… โœ… โœ…
Illumina_SE โœ… โœ… โŒ โœ… โŒ โœ… โœ…
ClearLabs โœ… โŒ โŒ โŒ โŒ โŒ โŒ
ONT โœ… โœ… โœ… โŒ โœ… โœ… โœ…
FASTA โœ… โœ… โŒ โœ… โœ… โœ… โœ…

We've provided the following information to help you set up the workflow for each organism in the form of input JSONs.

Inputs

All TheiaCoV Workflows (not TheiaCoV_FASTA_Batch)

TheiaCoV_Illumina_PE Input Read Data

The TheiaCoV_Illumina_PE workflow takes in Illumina paired-end read data. Read file names should end with .fastq or .fq, with the optional addition of .gz. When possible, Theiagen recommends zipping files with gzip before Terra uploads to minimize data upload time.

By default, the workflow anticipatesย 2 x 150bpย reads (i.e. the input reads were generated using a 300-cycle sequencing kit). Modifications to the optional parameter for trim_minlen may be required to accommodate shorter read data, such as the 2 x 75bp reads generated using a 150-cycle sequencing kit.

TheiaCoV_Illumina_SE Input Read Data

TheiaCoV_Illumina_SE takes in Illumina single-end reads. Read file names should end with .fastq or .fq, with the optional addition of .gz. Theiagen highly recommends zipping files with gzip before uploading to Terra to minimize data upload time & save on storage costs.

By default, the workflow anticipates 1 x 35 bp reads (i.e. the input reads were generated using a 70-cycle sequencing kit). Modifications to the optional parameter for trim_minlen may be required to accommodate longer read data.

TheiaCoV_ONT Input Read Data

The TheiaCoV_ONT workflow takes in base-called ONT read data. Read file names should end with .fastq or .fq, with the optional addition of .gz. When possible, Theiagen recommends zipping files with gzip before uploading to Terra to minimize data upload time.

The ONT sequencing kit and base-calling approach can produce substantial variability in the amount and quality of read data. Genome assemblies produced by the TheiaCoV_ONT workflow must be quality assessed before reporting results.

TheiaCoV_FASTA Input Assembly Data

The TheiaCoV_FASTA workflow takes in assembly files in FASTA format.

TheiaCoV_ClearLabs Input Read Data

The TheiaCoV_ClearLabs workflow takes in read data produced by the Clear Dx platform from ClearLabs. However, many users use the TheiaCoV_FASTA workflow instead of this one due to a few known issues when generating assemblies with this pipeline that are not present when using ClearLabs-generated FASTA files.

Terra Task Name Variable Type Description Default Value Terra Status * Organism
theiacov_clearlabs primer_bed File The bed file containing the primers used when sequencing was performed Required CL sars-cov-2
theiacov_clearlabs read1 File Read data produced by the Clear Dx platform from ClearLabs Required CL sars-cov-2
theiacov_fasta assembly_fasta File Input assembly FASTA file Required FASTA HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
theiacov_fasta input_assembly_method File Method used to generate the assembly file Required FASTA HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
theiacov_illumina_pe read1 File Forward Illumina read in FASTQ file format (compression optional) Required PE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
theiacov_illumina_pe read2 File Reverse Illumina read in FASTQ file format (compression optional) Required PE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
theiacov_illumina_se read1 File Forward Illumina read in FASTQ file format (compression optional) Required SE MPXV, WNV, sars-cov-2
theiacov_ont read1 File Demultiplexed ONT read in FASTQ file format (compression optional) Required ONT HIV, MPXV, WNV, flu, sars-cov-2
workflow name samplename String Name of the sample being analyzed Required CL, FASTA, ONT, PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
workflow name seq_method String The sequencing methodology used to generate the input read data Required FASTA HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
clean_check_reads cpu Int Number of CPUs to allocate to the task 2 Optional ONT, PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
clean_check_reads disk_size Int Amount of storage (in GB) to allocate to the task 100 Optional ONT, PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
clean_check_reads docker String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/bactopia/gather_samples:2.0.2 Optional ONT, PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
clean_check_reads memory Int Amount of memory/RAM (in GB) to allocate to the task 2 Optional ONT, PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
consensus cpu Int Number of CPUs to allocate to the task 8 Optional CL, ONT sars-cov-2
consensus disk_size Int Amount of storage (in GB) to allocate to the task 100 Optional CL, ONT sars-cov-2
consensus docker String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/staphb/artic-ncov2019-epi2me Optional ONT HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
consensus medaka_model String In order to obtain the best results, the appropriate model must be set to match the sequencer's basecaller model; this string takes the format of {pore}{device}. See also https://github.com/nanoporetech/medaka?tab=readme-ov-file#models.}_{caller_version r941_min_high_g360 Optional CL, ONT sars-cov-2
consensus memory Int Amount of memory/RAM (in GB) to allocate to the task 16 Optional CL, ONT sars-cov-2
consensus_qc cpu Int Number of CPUs to allocate to the task 1 Optional CL, FASTA, ONT, PE, SE HIV, MPXV, WNV, rsv_a, rsv_b, sars-cov-2
consensus_qc disk_size Int Amount of storage (in GB) to allocate to the task 100 Optional CL, FASTA, ONT, PE, SE HIV, MPXV, WNV, rsv_a, rsv_b, sars-cov-2
consensus_qc docker String The Docker container to use for the task ngolin Optional CL, FASTA, ONT, PE, SE HIV, MPXV, WNV, rsv_a, rsv_b, sars-cov-2
consensus_qc genome_length Int Internal component, do not modify Do not modify, Optional CL, SE HIV, MPXV, WNV, rsv_a, rsv_b, sars-cov-2
consensus_qc memory Int Amount of memory/RAM (in GB) to allocate to the task 2 Optional CL, FASTA, ONT, PE, SE HIV, MPXV, WNV, rsv_a, rsv_b, sars-cov-2
fastq_scan_clean_reads cpu Int Number of CPUs to allocate to the task 1 Optional CL sars-cov-2
fastq_scan_clean_reads disk_size Int Amount of storage (in GB) to allocate to the task 100 Optional CL sars-cov-2
fastq_scan_clean_reads docker String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/theiagen/utility:1.1 Optional CL sars-cov-2
fastq_scan_clean_reads memory Int Amount of memory/RAM (in GB) to allocate to the task 2 Optional CL sars-cov-2
fastq_scan_clean_reads read1_name Int Internal component, do not modify Do not modify, Optional CL sars-cov-2
fastq_scan_raw_reads cpu Int Number of CPUs to allocate to the task 1 Optional CL sars-cov-2
fastq_scan_raw_reads disk_size Int Amount of storage (in GB) to allocate to the task 100 Optional CL sars-cov-2
fastq_scan_raw_reads docker String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/theiagen/utility:1.1 Optional CL sars-cov-2
fastq_scan_raw_reads memory Int Amount of memory/RAM (in GB) to allocate to the task 2 Optional CL sars-cov-2
fastq_scan_raw_reads read1_name Int Internal component, do not modify Do not modify, Optional CL sars-cov-2
flu_track abricate_flu_cpu Int Number of CPUs to allocate to the task 2 Optional FASTA, ONT, PE flu
flu_track abricate_flu_disk_size Int Amount of storage (in GB) to allocate to the task 100 Optional FASTA, ONT, PE flu
flu_track abricate_flu_docker String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/staphb/abricate:1.0.1-insaflu-220727 Optional FASTA, ONT, PE flu
flu_track abricate_flu_memory Int Amount of memory/RAM (in GB) to allocate to the task 4 Optional FASTA, ONT, PE flu
flu_track abricate_flu_mincov Int Minimum DNA % coverage 60 Optional FASTA, ONT, PE flu
flu_track abricate_flu_minid Int Minimum DNA % identity 70 Optional FASTA, ONT, PE flu
flu_track antiviral_aa_subs String Additional list of antiviral resistance associated amino acid substitutions of interest to be searched against those called on the sample segments. They take the format of :, e.g. NA:A26V Optional ONT, PE flu
flu_track assembly_metrics_cpu Int Number of CPUs to allocate to the task 2 Optional PE flu
flu_track assembly_metrics_disk_size Int Amount of storage (in GB) to allocate to the task 100 Optional PE flu
flu_track assembly_metrics_docker String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/staphb/samtools:1.15 Optional PE flu
flu_track assembly_metrics_memory Int Amount of memory/RAM (in GB) to allocate to the task 8 Optional PE flu
flu_track flu_h1_ha_ref File Internal component, do not modify Do not modify, Optional ONT, PE flu
flu_track flu_h1n1_m2_ref File Internal component, do not modify Do not modify, Optional ONT, PE flu
flu_track flu_h3_ha_ref File Internal component, do not modify Do not modify, Optional ONT, PE flu
flu_track flu_h3n2_m2_ref File Internal component, do not modify Do not modify, Optional ONT, PE flu
flu_track flu_n1_na_ref File Internal component, do not modify Do not modify, Optional ONT, PE flu
flu_track flu_n2_na_ref File Internal component, do not modify Do not modify, Optional ONT, PE flu
flu_track flu_pa_ref File Internal component, do not modify Do not modify, Optional ONT, PE flu
flu_track flu_pb1_ref File Internal component, do not modify Do not modify, Optional ONT, PE flu
flu_track flu_pb2_ref File Internal component, do not modify Do not modify, Optional ONT, PE flu
flu_track flu_subtype String The influenza subtype being analyzed. Used for picking nextclade datasets. Options: "Yamagata", "Victoria", "H1N1", "H3N2". Only use to override the subtype call from IRMA and ABRicate. Optional CL, ONT, PE, SE flu
flu_track genoflu_cpu Int Number of CPUs to allocate to the task 1 Optional FASTA, ONT, PE flu
flu_track genoflu_cross_reference File An Excel file to cross-reference BLAST findings; probably useful if novel genotypes are not in the default file used by genoflu.py Optional FASTA, ONT, PE
flu_track genoflu_disk_size Int Amount of storage (in GB) to allocate to the task 25 Optional FASTA, ONT, PE
flu_track genoflu_docker String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/staphb/genoflu:1.03 Optional FASTA, ONT, PE
flu_track genoflu_memory Int Amount of memory/RAM (in GB) to allocate to the task 2 Optional FASTA, ONT, PE
flu_track irma_cpu Int Number of CPUs to allocate to the task 4 Optional ONT, PE flu
flu_track irma_disk_size Int Amount of storage (in GB) to allocate to the task 100 Optional ONT, PE flu
flu_track irma_docker_image String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/cdcgov/irma:v1.1.5 Optional ONT, PE flu
flu_track irma_keep_ref_deletions Boolean True/False variable that determines if sites missed during read gathering should be deleted by ambiguation. TRUE Optional ONT, PE flu
flu_track irma_memory Int Amount of memory/RAM (in GB) to allocate to the task 16 Optional ONT, PE flu
flu_track nextclade_cpu Int Number of CPUs to allocate to the task 2 Optional ONT, PE flu
flu_track nextclade_disk_size Int Amount of storage (in GB) to allocate to the task 50 Optional ONT, PE flu
flu_track nextclade_docker String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/nextstrain/nextclade:3.3.1 Optional ONT, PE flu
flu_track nextclade_memory Int Amount of memory/RAM (in GB) to allocate to the task 4 Optional ONT, PE flu
flu_track nextclade_output_parser_cpu Int Number of CPUs to allocate to the task 2 Optional CL, FASTA, ONT, PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
flu_track nextclade_output_parser_disk_size Int Amount of storage (in GB) to allocate to the task 50 Optional CL, FASTA, ONT, PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
flu_track nextclade_output_parser_docker String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/python/python:3.8.18-slim Optional CL, FASTA, ONT, PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
flu_track nextclade_output_parser_memory Int Amount of memory/RAM (in GB) to allocate to the task 4 Optional CL, FASTA, ONT, PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
flu_track read2 File Internal component. Do not use. Optional ONT flu
gene_coverage cpu Int Number of CPUs to allocate to the task 2 Optional CL, ONT, PE, SE MPXV, sars-cov-2
gene_coverage disk_size Int Amount of storage (in GB) to allocate to the task 100 Optional CL, ONT, PE, SE MPXV, sars-cov-2
gene_coverage docker String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/staphb/samtools:1.15 Optional CL, ONT, PE, SE MPXV, sars-cov-2
gene_coverage memory Int Amount of memory/RAM (in GB) to allocate to the task 8 Optional CL, ONT, PE, SE MPXV, sars-cov-2
gene_coverage min_depth Int The minimum depth to determine if a position was covered. 10 Optional ONT, PE, SE MPXV, sars-cov-2
gene_coverage sc2_s_gene_start Int start nucleotide position of the SARS-CoV-2 Spike gene 21563 Optional CL, ONT, PE, SE MPXV, sars-cov-2
gene_coverage sc2_s_gene_stop Int End/Last nucleotide position of the SARS-CoV-2 Spike gene 25384 Optional CL, ONT, PE, SE MPXV, sars-cov-2
ivar_consensus read2 File Internal component, do not modify Do not modify, Optional SE HIV, MPXV, WNV, rsv_a, rsv_b, sars-cov-2
ivar_consensus skip_N Boolean True/False variable that determines if regions with depth less than minimum depth should not be added to the consensus sequence FALSE Optional PE, SE HIV, MPXV, WNV, rsv_a, rsv_b, sars-cov-2
kraken2_dehosted cpu Int Number of CPUs to allocate to the task 4 Optional CL sars-cov-2
kraken2_dehosted disk_size Int Amount of storage (in GB) to allocate to the task 100 Optional CL sars-cov-2
kraken2_dehosted docker_image String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/staphb/kraken2:2.0.8-beta_hv Optional CL sars-cov-2
kraken2_dehosted kraken2_db String The database used to run Kraken2 /kraken2-db Optional CL sars-cov-2
kraken2_dehosted memory Int Amount of memory/RAM (in GB) to allocate to the task 8 Optional CL sars-cov-2
kraken2_dehosted read2 File Internal component, do not modify Do not modify, Optional CL sars-cov-2
kraken2_raw cpu Int Number of CPUs to allocate to the task 4 Optional CL sars-cov-2
kraken2_raw disk_size Int Amount of storage (in GB) to allocate to the task 100 Optional CL sars-cov-2
kraken2_raw docker_image Int Docker container used in this task us-docker.pkg.dev/general-theiagen/staphb/kraken2:2.0.8-beta_hv Optional CL sars-cov-2
kraken2_raw kraken2_db String The database used to run Kraken2 /kraken2-db Optional CL sars-cov-2
kraken2_raw memory String Amount of memory/RAM (in GB) to allocate to the task 8 Optional CL sars-cov-2
kraken2_raw read_processing String The tool used for trimming of primers from reads. Options are trimmomatic and fastp trimmomatic Optional HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
kraken2_raw read2 File Internal component, do not modify Do not modify, Optional CL sars-cov-2
nanoplot_clean cpu Int Number of CPUs to allocate to the task 4 Optional ONT HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
nanoplot_clean disk_size Int Amount of storage (in GB) to allocate to the task 100 Optional ONT HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
nanoplot_clean docker String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/staphb/nanoplot:1.40.0 Optional ONT HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
nanoplot_clean max_length Int The maximum length of clean reads, for which reads longer than the length specified will be hidden. 100000 Optional ONT HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
nanoplot_clean memory Int Amount of memory/RAM (in GB) to allocate to the task 16 Optional ONT HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
nanoplot_raw cpu Int Number of CPUs to allocate to the task 4 Optional ONT HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
nanoplot_raw disk_size Int Amount of storage (in GB) to allocate to the task 100 Optional ONT HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
nanoplot_raw docker String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/staphb/nanoplot:1.40.0 Optional ONT HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
nanoplot_raw max_length Int The maximum length of clean reads, for which reads longer than the length specified will be hidden. 100000 Optional ONT HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
nanoplot_raw memory Int Amount of memory/RAM (in GB) to allocate to the task 16 Optional ONT HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
ncbi_scrub_se cpu Int Number of CPUs to allocate to the task 4 Optional CL sars-cov-2
ncbi_scrub_se disk_size Int Amount of storage (in GB) to allocate to the task 100 Optional CL sars-cov-2
ncbi_scrub_se docker String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/ncbi/sra-human-scrubber:2.2.1 Optional CL sars-cov-2
ncbi_scrub_se memory Int Amount of memory/RAM (in GB) to allocate to the task 8 Optional CL sars-cov-2
nextclade_output_parser cpu Int Number of CPUs to allocate to the task 2 Optional ONT, PE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
nextclade_output_parser disk_size Int Amount of storage (in GB) to allocate to the task 50 Optional ONT, PE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
nextclade_output_parser docker String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/python/python:3.8.18-slim Optional ONT, PE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
nextclade_output_parser memory Int Amount of memory/RAM (in GB) to allocate to the task 2 Optional ONT, PE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
nextclade_v3 auspice_reference_tree_json File An Auspice JSON phylogenetic reference tree which serves as a target for phylogenetic placement. Inherited from nextclade dataset Optional CL, FASTA, ONT, PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
nextclade_v3 cpu Int Number of CPUs to allocate to the task 2 Optional CL, FASTA, ONT, PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
nextclade_v3 disk_size Int Amount of storage (in GB) to allocate to the task 50 Optional CL, FASTA, ONT, PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
nextclade_v3 docker String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/nextstrain/nextclade:3.3.1 Optional CL, FASTA, ONT, PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
nextclade_v3 gene_annotations_gff File A genome annotation to specify how to translate the nucleotide sequence to proteins (genome_annotation.gff3). specifying this enables codon-informed alignment and protein alignments. See here for more info: https://docs.nextstrain.org/projects/nextclade/en/latest/user/input-files/03-genome-annotation.html Inherited from nextclade dataset Optional CL, FASTA, ONT, PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
nextclade_v3 input_ref File A nucleotide sequence which serves as a reference for the pairwise alignment of all input sequences. This is also the sequence which defines the coordinate system of the genome annotation. See here for more info: https://docs.nextstrain.org/projects/nextclade/en/latest/user/input-files/02-reference-sequence.html Inherited from nextclade dataset Optional CL, FASTA, ONT, PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
nextclade_v3 memory Int Amount of memory/RAM (in GB) to allocate to the task 4 Optional CL, FASTA, ONT, PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
nextclade_v3 nextclade_pathogen_json File General dataset configuration file. See here for more info: https://docs.nextstrain.org/projects/nextclade/en/latest/user/input-files/05-pathogen-config.html Inherited from nextclade dataset Optional CL, FASTA, ONT, PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
nextclade_v3 verbosity String other options are: "off" , "error" , "info" , "debug" , and "trace" (highest level of verbosity) warn Optional CL, FASTA, ONT, PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
organism_parameters auspice_config File Auspice config file used in Augur_PHB workflow.
Defaults set for various organisms & Flu segments. A minimal auspice config file is set in cases where organism is not specified and user does not provide an optional input config file.
Optional Augur, CL, FASTA, ONT, PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
organism_parameters flu_segment String Influenza genome segment being analyzed. Options: "HA" or "NA". Automatically determined. This input is ignored if provided for TheiaCoV_Illumina_SE and TheiaCoV_ClearLabs N/A Optional CL, ONT, PE, SE flu
organism_parameters flu_subtype String The influenza subtype being analyzed. Options: "Yamagata", "Victoria", "H1N1", "H3N2". Automatically determined. This input is ignored if provided for TheiaCoV_Illumina_SE and TheiaCoV_ClearLabs N/A Optional CL, ONT, PE, SE flu
organism_parameters gene_locations_bed_file File Use to provide locations of interest where average coverage will be calculated Default provided for SARS-CoV-2 ("gs://theiagen-public-files-rp/terra/sars-cov-2-files/sc2_gene_locations.bed") and mpox ("gs://theiagen-public-files/terra/mpxv-files/mpox_gene_locations.bed") Optional CL, FASTA
organism_parameters genome_length_input Int Use to specify the expected genome length; provided by default for all supported organisms Default provided for SARS-CoV-2 (29903), mpox (197200), WNV (11000), flu (13000), RSV-A (16000), RSV-B (16000), HIV (primer versions 1 [9181] and 2 [9840]) Optional CL
organism_parameters hiv_primer_version String The version of HIV primers used. Options are "https://github.com/theiagen/public_health_bioinformatics/blob/main/workflows/utilities/wf_organism_parameters.wdl#L156" and "https://github.com/theiagen/public_health_bioinformatics/blob/main/workflows/utilities/wf_organism_parameters.wdl#L164". This input is ignored if provided for TheiaCoV_Illumina_SE and TheiaCoV_ClearLabs v1 Optional CL, FASTA, ONT, PE, SE HIV
organism_parameters kraken_target_organism_input String The organism whose abundance the user wants to check in their reads. This should be a proper taxonomic name recognized by the Kraken database. Default provided for mpox (Monkeypox virus), WNV (West Nile virus), and HIV (Human immunodeficiency virus 1) Optional FASTA, ONT, SE HIV, MPXV, WNV, rsv_a, rsv_b, sars-cov-2
organism_parameters pangolin_docker_image String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/staphb/pangolin:4.3.1-pdata-1.29 Optional CL, FASTA
organism_parameters primer_bed_file File The bed file containing the primers used when sequencing was performed REQUIRED FOR SARS-CoV-2, MPOX, WNV, RSV-A & RSV-B. Provided by default only for HIV primer versions 1 ("gs://theiagen-public-files/terra/hivgc-files/HIV-1_v1.0.primer.hyphen.bed" and 2 ("gs://theiagen-public-files/terra/hivgc-files/HIV-1_v2.0.primer.hyphen400.1.bed") Optional, Sometimes required CL, FASTA
organism_parameters reference_gff_file File Reference GFF file for the organism being analyzed Default provided for mpox ("gs://theiagen-public-files/terra/mpxv-files/Mpox-MT903345.1.reference.gff3") and HIV (primer versions 1 ["gs://theiagen-public-files/terra/hivgc-files/NC_001802.1.gff3"] and 2 ["gs://theiagen-public-files/terra/hivgc-files/AY228557.1.gff3"]) Optional CL, FASTA, ONT
organism_parameters vadr_max_length Int Maximum length for the fasta-trim-terminal-ambigs.pl VADR script Default provided for SARS-CoV-2 (30000), mpox (210000), WNV (11000), flu (0), RSV-A (15500) and RSV-B (15500). Optional CL
organism_parameters vadr_memory Int Amount of memory/RAM (in GB) to allocate to the task 32 (RSV-A and RSV-B) and 8 (all other TheiaCoV organisms) Optional CL, ONT, PE, SE MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
organism_parameters vadr_options String Options for the v-annotate.pl VADR script Default provided for SARS-CoV-2 ("--noseqnamemax --glsearch -s -r --nomisc --mkey sarscov2 --lowsim5seq 6 --lowsim3seq 6 --alt_fail lowscore,insertnn,deletinn --out_allfasta"), mpox ("--glsearch -s -r --nomisc --mkey mpxv --r_lowsimok --r_lowsimxd 100 --r_lowsimxl 2000 --alt_pass discontn,dupregin --out_allfasta --minimap2 --s_overhang 150"), WNV ("--mkey flavi --mdir /opt/vadr/vadr-models-flavi/ --nomisc --noprotid --out_allfasta"), flu (""), RSV-A ("-r --mkey rsv --xnocomp"), and RSV-B ("-r --mkey rsv --xnocomp") Optional CL
organism_parameters vadr_skip_length Int Minimum assembly length (unambiguous) to run VADR 10000 Optional CL MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
pangolin4 analysis_mode String Pangolin inference engine for lineage designations (usher or pangolearn). Default is Usher. Optional CL, FASTA, ONT, PE, SE sars-cov-2
pangolin4 cpu Int Number of CPUs to allocate to the task 4 Optional CL, FASTA, ONT, PE, SE sars-cov-2
pangolin4 disk_size Int Amount of storage (in GB) to allocate to the task 100 Optional CL, FASTA, ONT, PE, SE sars-cov-2
pangolin4 expanded_lineage Boolean True/False that determines if a lineage should be expanded without aliases (e.g., BA.1 โ†’ B.1.1.529.1) TRUE Optional CL, FASTA, ONT, PE, SE sars-cov-2
pangolin4 max_ambig Float The maximum proportion of Ns allowed for pangolin to attempt an assignment 0.5 Optional CL, FASTA, ONT, PE, SE sars-cov-2
pangolin4 memory Int Amount of memory/RAM (in GB) to allocate to the task 8 Optional CL, FASTA, ONT, PE, SE sars-cov-2
pangolin4 min_length Int Minimum query length allowed for pangolin to attempt an assignment 10000 Optional CL, FASTA, ONT, PE, SE sars-cov-2
pangolin4 pangolin_arguments String Optional arguments for pangolin e.g. ''--skip-scorpio'' Optional CL, FASTA, ONT, PE, SE sars-cov-2
pangolin4 skip_designation_cache Boolean A True/False option that determines if the designation cache should be used FALSE Optional CL, FASTA, ONT, PE, SE sars-cov-2
pangolin4 skip_scorpio Boolean A True/False option that determines if scorpio should be skipped. FALSE Optional CL, FASTA, ONT, PE, SE sars-cov-2
qc_check_task ani_highest_percent Float Internal component, do not modify Do not modify, Optional CL, FASTA, ONT, PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
qc_check_task ani_highest_percent_bases_aligned Float Internal component, do not modify Do not modify, Optional CL, FASTA, ONT, PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
qc_check_task assembly_length Int Internal component, do not modify Do not modify, Optional CL, FASTA, ONT, PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
qc_check_task assembly_mean_coverage Int Internal component, do not modify Do not modify, Optional FASTA HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
qc_check_task busco_results String Internal component, do not modify Do not modify, Optional CL, FASTA, ONT, PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
qc_check_task combined_mean_q_clean Float Internal component, do not modify Do not modify, Optional CL, FASTA, ONT, PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
qc_check_task combined_mean_q_raw Float Internal component, do not modify Do not modify, Optional CL, FASTA, ONT, PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
qc_check_task combined_mean_readlength_clean Float Internal component, do not modify Do not modify, Optional CL, FASTA, ONT, PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
qc_check_task combined_mean_readlength_raw Float Internal component, do not modify Do not modify, Optional CL, FASTA, ONT, PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
qc_check_task cpu Int Number of CPUs to allocate to the task 4 Optional CL, FASTA, ONT, PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
qc_check_task disk_size Int Amount of storage (in GB) to allocate to the task 100 Optional CL, FASTA, ONT, PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
qc_check_task docker String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/theiagen/terra-tools:2023-03-16 Optional CL, FASTA, ONT, PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
qc_check_task est_coverage_clean Float Internal component, do not modify Do not modify, Optional CL, FASTA, ONT, PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
qc_check_task est_coverage_raw Float Internal component, do not modify Do not modify, Optional CL, FASTA, ONT, PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
qc_check_task gambit_predicted_taxon String Internal component, do not modify Do not modify, Optional CL, FASTA, ONT, PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
qc_check_task kraken_human String Internal component, do not modify Do not modify, Optional FASTA, ONT, SE
qc_check_task kraken_human_dehosted String Internal component, do not modify Do not modify, Optional FASTA, ONT, SE
qc_check_task kraken_sc2 Float Internal component, do not modify Do not modify, Optional CL, FASTA, ONT, PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
qc_check_task kraken_sc2_dehosted Float Internal component, do not modify Do not modify, Optional CL, FASTA, ONT, PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
qc_check_task kraken_target_organism Float Internal component, do not modify Do not modify, Optional CL, FASTA, ONT, PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
qc_check_task kraken_target_organism_dehosted Float Internal component, do not modify Do not modify, Optional CL, FASTA, ONT, PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
qc_check_task memory Int Amount of memory/RAM (in GB) to allocate to the task 8 Optional CL, FASTA, ONT, PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
qc_check_task midas_secondary_genus_abundance Float Internal component, do not modify Do not modify, Optional CL, FASTA, ONT, PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
qc_check_task midas_secondary_genus_coverage Float Internal component, do not modify Do not modify, Optional CL, FASTA, ONT, PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
qc_check_task minbaseq_trim Int Internal component, do not modify Do not modify, Optional FASTA HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
qc_check_task n50_value Int Internal component, do not modify Do not modify, Optional CL, FASTA, ONT, PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
qc_check_task num_reads_clean2 Int Internal component, do not modify Do not modify, Optional CL, FASTA, ONT, SE
qc_check_task num_reads_raw2 Int Internal component, do not modify Do not modify, Optional CL, FASTA, ONT, SE
qc_check_task number_contigs Int Internal component, do not modify Do not modify, Optional CL, FASTA, ONT, PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
qc_check_task quast_gc_percent Float Internal component, do not modify Do not modify, Optional CL, FASTA, ONT, PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
qc_check_task r1_mean_q_clean Float Internal component, do not modify Do not modify, Optional CL, FASTA, ONT, PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
qc_check_task r1_mean_q_raw Float Internal component, do not modify Do not modify, Optional CL, FASTA, ONT, PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
qc_check_task r1_mean_readlength_clean Float Internal component, do not modify Do not modify, Optional CL, FASTA, ONT, PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
qc_check_task r1_mean_readlength_raw Float Internal component, do not modify Do not modify, Optional CL, FASTA, ONT, PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
qc_check_task r2_mean_q_clean Float Internal component, do not modify Do not modify, Optional CL, FASTA, ONT, PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
qc_check_task r2_mean_q_raw Float Internal component, do not modify Do not modify, Optional CL, FASTA, ONT, PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
qc_check_task r2_mean_readlength_clean Float Internal component, do not modify Do not modify, Optional CL, FASTA, ONT, PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
qc_check_task r2_mean_readlength_raw Float Internal component, do not modify Do not modify, Optional CL, FASTA, ONT, PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
qc_check_task sc2_s_gene_mean_coverage Float Internal component, do not modify Do not modify, Optional CL, FASTA, ONT, PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
qc_check_task sc2_s_gene_percent_coverage Float Internal component, do not modify Do not modify, Optional CL, FASTA, ONT, PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
quasitools_illumina_pe cpu Int Number of CPUs to allocate to the task 2 Optional PE HIV
quasitools_illumina_pe disk_size Int Amount of storage (in GB) to allocate to the task 50 Optional PE HIV
quasitools_illumina_pe docker String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/biocontainers/quasitools:0.7.0--pyh864c0ab_1 Optional PE HIV
quasitools_illumina_pe memory Int Amount of memory/RAM (in GB) to allocate to the task 4 Optional PE HIV
quasitools_ont cpu Int Number of CPUs to allocate to the task 2 Optional ONT HIV
quasitools_ont disk_size Int Amount of storage (in GB) to allocate to the task 50 Optional ONT HIV
quasitools_ont docker String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/biocontainers/quasitools:0.7.0--pyh864c0ab_1 Optional ONT HIV
quasitools_ont memory Int Amount of memory/RAM (in GB) to allocate to the task 4 Optional ONT HIV
quasitools_ont read2 File Internal component. Do not use. Do not modify, Optional ONT HIV
raw_check_reads cpu Int Number of CPUs to allocate to the task 2 Optional ONT, PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
raw_check_reads disk_size Int Amount of storage (in GB) to allocate to the task 100 Optional ONT, PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
raw_check_reads docker String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/bactopia/gather_samples:2.0.2 Optional ONT, PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
raw_check_reads memory Int Amount of memory/RAM (in GB) to allocate to the task 2 Optional ONT, PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
read_QC_trim bbduk_memory Int Amount of memory/RAM (in GB) to allocate to the task 8 Optional PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
read_QC_trim call_kraken Boolean True/False variable that determines if the Kraken2 task should be called. FALSE Optional PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
read_QC_trim call_midas Boolean True/False variable that determines if the MIDAS task should be called. TRUE Optional PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
read_QC_trim downsampling_coverage Float The desired coverage to sub-sample the reads to with RASUSA 150 Optional ONT HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
read_QC_trim fastp_args String Additional fastp task arguments --detect_adapter_for_pe -g -5 20 -3 20 Optional PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
read_QC_trim kraken_db File The database used to run Kraken2 /kraken2-db Optional PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
read_QC_trim kraken_disk_size Int Amount of storage (in GB) to allocate to the task 100 Optional PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
read_QC_trim kraken_memory Int Amount of memory/RAM (in GB) to allocate to the task 8 Optional PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
read_QC_trim midas_db File The database used by the MIDAS task gs://theiagen-public-files-rp/terra/theiaprok-files/midas/midas_db_v1.2.tar.gz Optional PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
read_QC_trim read_processing String The name of the tool to perform basic read processing; options: "trimmomatic" or "fastp" trimmomatic Optional PE, SE
read_QC_trim read_qc String The tool used for quality control (QC) of reads. Options are fastq_scan and fastqc fastq_scan Optional PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
read_QC_trim target_organism String Organism to search for in Kraken Optional PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
read_QC_trim trimmomatic_args String Additional arguments to pass to trimmomatic -phred33 Optional PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
set_flu_ha_nextclade_values reference_gff_file File Reference GFF file for flu HA Do not modify, Optional ONT flu
set_flu_na_nextclade_values reference_gff_file Int Reference GFF file for flu NA Do not modify, Optional ONT flu
set_flu_na_nextclade_values vadr_mem Int Memory, in GB, allocated to this task 8 Do not modify, Optional ONT flu
stats_n_coverage cpu Int Number of CPUs to allocate to the task 2 Optional CL, ONT
stats_n_coverage disk_size Int Amount of storage (in GB) to allocate to the task 100 Optional CL, ONT
stats_n_coverage docker String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/staphb/samtools:1.15 Optional CL, ONT
stats_n_coverage memory Int Amount of memory/RAM (in GB) to allocate to the task 8 Optional CL, ONT
stats_n_coverage_primtrim cpu Int Number of CPUs to allocate to the task 2 Optional CL, ONT
stats_n_coverage_primtrim disk_size Int Amount of storage (in GB) to allocate to the task 100 Optional CL, ONT
stats_n_coverage_primtrim docker String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/staphb/samtools:1.15 Optional CL, ONT
stats_n_coverage_primtrim memory Int Amount of memory/RAM (in GB) to allocate to the task 8 Optional CL, ONT
vadr cpu Int Number of CPUs to allocate to the task 2 Optional CL, FASTA, ONT, PE, SE MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
vadr disk_size Int Amount of storage (in GB) to allocate to the task 100 Optional CL, FASTA, ONT, PE, SE MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
vadr docker String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/staphb/vadr:1.5.1 Optional CL, FASTA, ONT, PE, SE MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
vadr max_length Int Maximum length of contig allowed to run VADR Optional CL HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
vadr memory Int Amount of memory/RAM (in GB) to allocate to the task 32 (RSV-A and RSV-B) and 8 (all other TheiaCoV organisms) Optional CL MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
vadr min_length Int Minimum length subsequence to possibly replace Ns for the http://fasta-trim-terminal-ambigs.pl/ VADR script 50 Optional CL, FASTA, ONT, PE, SE MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
vadr skip_length Int Minimum assembly length (unambiguous) to run VADR 10000 Optional CL MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
vadr vadr_opts String Additional options to provide to VADR Optional CL HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
version_capture docker String The Docker container to use for the task "us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0" Optional ONT, PE, SE, FASTA, CL HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
version_capture timezone String Set the time zone to get an accurate date of analysis (uses UTC by default) Optional ONT, PE, SE, FASTA, CL HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
workflow name adapters File File that contains the adapters used /bbmap/resources/adapters.fa Optional PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
workflow name consensus_min_freq Float The minimum frequency for a variant to be called a SNP in consensus genome 0.6 Optional PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
workflow name flu_segment String Influenza genome segment being analyzed. Options: "HA" or "NA". HA Optional, Required FASTA
workflow name flu_subtype String The influenza subtype being analyzed. Options: "Yamagata", "Victoria", "H1N1", "H3N2". Automatically determined. Optional FASTA
workflow name genome_length Int Use to specify the expected genome length Optional FASTA, ONT, PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
workflow name max_genome_length Int Maximum genome length able to pass read screening 2673870 Optional ONT, PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
workflow name max_length Int Maximum length for a read based on the SARS-CoV-2 primer scheme 700 Optional ONT HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
workflow name medaka_docker String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/staphb/artic-ncov2019:1.3.0-medaka-1.4.3 Optional CL
workflow name min_basepairs Int Minimum base pairs to pass read screening 34000 Optional ONT, PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
workflow name min_coverage Int Minimum coverage to pass read screening 10 Optional ONT, PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
workflow name min_depth Int Minimum depth of reads required to call variants and generate a consensus genome. This value is passed to the iVar software. 100 Optional ONT, PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
workflow name min_genome_length Int Minimum genome length to pass read screening 1700 Optional ONT, PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
workflow name min_length Int Minimum length of a read based on the SARS-CoV-2 primer scheme 400 Optional ONT HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
workflow name min_proportion Int Minimum read proportion to pass read screening 40 Optional PE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
workflow name min_reads Int Minimum reads to pass read screening 113 Optional PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
workflow name nextclade_dataset_name String Nextclade organism dataset names. However, if organism input is set correctly, this input will be automatically assigned the corresponding dataset name. See organism defaults for more information Defaults are organism-specific. Please find default values for all organisms (and for Flu - their respective genome segments) here: https://github.com/theiagen/public_health_bioinformatics/blob/main/workflows/utilities/wf_organism_parameters.wdl Optional CL, FASTA, ONT, PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
workflow name nextclade_dataset_tag String Nextclade dataset tag. Used for pulling up-to-date reference genomes and associated information specific to nextclade datasets (QC thresholds, organism-specific information like SARS-CoV-2 clade & lineage information, etc.) that is required for running the Nextclade tool. Defaults are organism-specific. Please find default values for all organisms (and for Flu - their respective genome segments) here: https://github.com/theiagen/public_health_bioinformatics/blob/main/workflows/utilities/wf_organism_parameters.wdl Optional CL, FASTA, ONT, PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
workflow name normalise Int Used to normalize the amount of reads to the indicated level before variant calling 20000 for CL, 200 for ONT Optional CL, ONT
workflow name organism String The organism that is being analyzed. Options: "sars-cov-2", "MPXV", "WNV", "HIV", "flu", "rsv_a", "rsv_b". However, "flu" is not available for TheiaCoV_Illumina_SE sars-cov-2 Optional CL, FASTA, ONT, PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
workflow name pangolin_docker_image String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/staphb/pangolin:4.3.1-pdata-1.29 Do not modify, Optional ONT, PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
workflow name phix File File that contains the phix used /bbmap/resources/phix174_ill.ref.fa.gz Optional PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
workflow name primer_bed File The bed file containing the primers used when sequencing was performed Optional ONT, PE, SE HIV, MPXV, WNV, rsv_a, rsv_b, sars-cov-2
workflow name qc_check_table File A TSV file with optional user input QC values to be compared against the default workflow value Optional CL, FASTA, ONT, PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
workflow name reference_gene_locations_bed File Use to provide locations of interest where average coverage will be calculated Optional ONT, PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
workflow name reference_genome File An optional reference genome used for consensus assembly and QC Optional CL, FASTA, ONT, PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
workflow name reference_gff File The general feature format (gff) of the reference genome. Optional PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
workflow name seq_method String The sequencing methodology used to generate the input read data ILLUMINA Optional CL, ONT, PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
workflow name skip_mash Boolean A True/False option that determines if mash should be skipped in the screen task. FALSE Optional ONT, SE HIV, MPXV, WNV, rsv_a, rsv_b, sars-cov-2
workflow name skip_screen Boolean A True/False option that determines if the screen task should be skipped. FALSE Optional ONT, PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
workflow name target_organism String The organism whose abundance the user wants to check in their reads. This should be a proper taxonomic name recognized by the Kraken database. Optional CL, ONT, PE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
workflow name trim_min_length Int The minimum length of each read after trimming 75 Optional PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
workflow name trim_primers Boolean A True/False option that determines if primers should be trimmed. TRUE Optional PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
workflow name trim_quality_min_score Int The minimum quality score to keep during trimming 30 Optional PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
workflow name trim_window_size Int Specifies window size for trimming (the number of bases to average the quality across) 4 Optional PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
workflow name vadr_max_length Int Maximum length of contig allowed to run VADR Optional FASTA, ONT, PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
workflow name vadr_memory Int Amount of memory/RAM (in GB) to allocate to the task 32 (RSV-A and RSV-B) and 8 (all other TheiaCoV organisms) Optional FASTA, ONT, PE, SE MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
workflow name vadr_options String Additional options to provide to VADR Optional ONT, PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
workflow name vadr_opts String Additional options to provide to VADR Optional FASTA HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
workflow name vadr_skip_length Int Minimum assembly length (unambiguous) to run VADR 10000 Optional FASTA, ONT, PE, SE MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
workflow name variant_min_freq Float Minimum frequency for a variant to be reported in ivar outputs 0.6 Optional PE, SE HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2
TheiaCoV_FASTA_Batch_PHB Inputs
TheiaCoV_FASTA_Batch Inputs
Input Data

The TheiaCoV_FASTA_Batch workflow takes in a set of assembly files in FASTA format.

Terra Task Name Variable Type Description Default Value Terra Status
theiacov_fasta_batch assembly_fastas Array[File] Genome assembly files in fasta format. Example: this.sars-cov-2-samples.assembly_fasta Required
theiacov_fasta_batch bucket_name String The GCP bucket for the workspace where the TheiaCoV_FASTA_Batch output files are saved. We recommend using a unique GSURI for the bucket associated with your Terra workspace. The root GSURI is accessible in the Dashboard page of your workspace in the "Cloud Information" section.
Do not include the prefix gs:// in the string
Example: ""fc-c526190d-4332-409b-8086-be7e1af9a0b6/theiacov_fasta_batch-2024-04-15-seq-run-1/
Required
theiacov_fasta_batch project_name String The name of the Terra project where the data can be found. Example: "my-terra-project" Required
theiacov_fasta_batch samplenames Array[String] The names of the samples to be analyzed. Example: this.sars-cov-2-samples.sars-cov-2-sample_id Required
theiacov_fasta_batch table_name String The name of the Terra table where the data can be found. Example: "sars-cov-2-sample" Required
theiacov_fasta_batch workspace_name String The name of the Terra workspace where the data can be found. Example "my-terra-workspace" Required
cat_files_fasta cpu Int Number of CPUs to allocate to the task 2 Optional
cat_files_fasta disk_size Int Amount of storage (in GB) to allocate to the task 100 Optional
cat_files_fasta docker_image String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/theiagen/utility:1.1 Optional
cat_files_fasta memory Int Amount of memory/RAM (in GB) to allocate to the task 8 Optional
nextclade_v3 auspice_reference_tree_json File The phylogenetic reference tree which serves as a target for phylogenetic placement default is inherited from NextClade dataset Optional
nextclade_v3 cpu Int Number of CPUs to allocate to the task 2 Optional
nextclade_v3 disk_size Int Amount of storage (in GB) to allocate to the task 50 Optional
nextclade_v3 docker String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/nextstrain/nextclade:3.3.1 Optional
nextclade_v3 gene_annotations_gff File A genome annotation to specify how to translate the nucleotide sequence to proteins (genome_annotation.gff3). specifying this enables codon-informed alignment and protein alignments. See here for more info: https://docs.nextstrain.org/projects/nextclade/en/latest/user/input-files/03-genome-annotation.html None Optional
nextclade_v3 input_ref File A nucleotide sequence which serves as a reference for the pairwise alignment of all input sequences. This is also the sequence which defines the coordinate system of the genome annotation. See here for more info: https://docs.nextstrain.org/projects/nextclade/en/latest/user/input-files/02-reference-sequence.html None Optional
nextclade_v3 memory Int Amount of memory/RAM (in GB) to allocate to the task 4 Optional
nextclade_v3 nextclade_pathogen_json File General dataset configuration file. See here for more info: https://docs.nextstrain.org/projects/nextclade/en/latest/user/input-files/05-pathogen-config.html None Optional
nextclade_v3 verbosity String other options are: "off" , "error" , "info" , "debug" , and "trace" (highest level of verbosity) warn Optional
organism_parameters flu_segment String Optional
organism_parameters flu_subtype String Optional
organism_parameters gene_locations_bed_file File Optional
organism_parameters genome_length_input Int Optional
organism_parameters hiv_primer_version String Optional
organism_parameters kraken_target_organism_input String Optional
organism_parameters primer_bed_file File Optional
organism_parameters reference_genome File Optional
organism_parameters reference_gff_file File Optional
organism_parameters vadr_max_length Int Optional
organism_parameters vadr_mem Int Optional
organism_parameters vadr_options String Optional
pangolin4 analysis_mode String Used to switch between usher and pangolearn analysis modes. Only use usher because pangolearn is no longer supported as of Pangolin v4.3 and higher versions. None Optional
pangolin4 cpu Int Number of CPUs to allocate to the task 4 Optional
pangolin4 disk_size Int Amount of storage (in GB) to allocate to the task 100 Optional
pangolin4 expanded_lineage Boolean True/False that determines if a lineage should be expanded without aliases (e.g., BA.1 โ†’ B.1.1.529.1) TRUE Optional
pangolin4 max_ambig Float The maximum proportion of Ns allowed for pangolin to attempt an assignment 0.5 Optional
pangolin4 memory Int Amount of memory/RAM (in GB) to allocate to the task 8 Optional
pangolin4 skip_designation_cache Boolean True/False that determines if the designation cache should be used FALSE Optional
pangolin4 skip_scorpio Boolean True/False that determines if scorpio should be skipped. FALSE Optional
sm_theiacov_fasta_wrangling cpu Int Number of CPUs to allocate to the task 8 Optional
sm_theiacov_fasta_wrangling disk_size Int Amount of storage (in GB) to allocate to the task 100 Optional
sm_theiacov_fasta_wrangling docker String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/theiagen/terra-tools:2023-08-28-v4 Optional
sm_theiacov_fasta_wrangling memory Int Amount of memory/RAM (in GB) to allocate to the task 4 Optional
theiacov_fasta_batch nextclade_dataset_name String Nextclade organism dataset name. Options: "nextstrain/sars-cov-2/wuhan-hu-1/orfs" However, if organism input is set correctly, this input will be automatically assigned the corresponding dataset name. sars-cov-2 Optional
theiacov_fasta_batch nextclade_dataset_tag String Nextclade dataset tag. Used for pulling up-to-date reference genomes and associated information specific to nextclade datasets (QC thresholds, organism-specific information like SARS-CoV-2 clade & lineage information, etc.) that is required for running the Nextclade tool. 2024-06-13--23-42-47Z Optional
theiacov_fasta_batch organism String The organism that is being analyzed. Options: "sars-cov-2" sars-cov-2 Optional
theiacov_fasta_batch pangolin_docker String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/staphb/pangolin:4.3.1-pdata-1.27 Optional
version_capture docker String The Docker container to use for the task "us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0" Optional
version_capture timezone String Set the time zone to get an accurate date of analysis (uses UTC by default) Optional

Organism-specific parameters and logic

The organism_parameters sub-workflow is the first step in all TheiaCoV workflows. This step automatically sets the different parameters needed for each downstream tool to the appropriate value for the user-designated organism (by default, "sars-cov-2" is the default organism).

The following tables include the relevant organism-specific parameters; all of these default values can be overwritten by providing a value for the "Overwrite Variable Name" field.

SARS-CoV-2 Defaults
Overwrite Variable Name Organism Default Value
gene_locations_bed_file sars-cov-2 "gs://theiagen-public-files-rp/terra/sars-cov-2-files/sc2_gene_locations.bed"
genome_length_input sars-cov-2 29903
nextclade_dataset_name_input sars-cov-2 "nextstrain/sars-cov-2/wuhan-hu-1/orfs"
nextclade_dataset_tag_input sars-cov-2 "2024-07-17--12-57-03Z"
pangolin_docker_image sars-cov-2 "us-docker.pkg.dev/general-theiagen/staphb/pangolin:4.3.1-pdata-1.29 "
reference_genome sars-cov-2 "gs://theiagen-public-files-rp/terra/augur-sars-cov-2-references/MN908947.fasta"
vadr_max_length sars-cov-2 30000
vadr_mem sars-cov-2 8
vadr_options sars-cov-2 "--noseqnamemax --glsearch -s -r --nomisc --mkey sarscov2 --lowsim5seq 6 --lowsim3seq 6 --alt_fail lowscore,insertnn,deletinn --out_allfasta"
Mpox Defaults
Overwrite Variable Name Organism Default Value
gene_locations_bed_file MPXV "gs://theiagen-public-files/terra/mpxv-files/mpox_gene_locations.bed"
genome_length_input MPXV 197200
kraken_target_organism_input MPXV "Monkeypox virus"
nextclade_dataset_name_input MPXV "nextstrain/mpox/lineage-b.1"
nextclade_dataset_tag_input MPXV "2024-04-19--07-50-39Z"
primer_bed_file MPXV "gs://theiagen-public-files/terra/mpxv-files/MPXV.primer.bed"
reference_genome MPXV "gs://theiagen-public-files/terra/mpxv-files/MPXV.MT903345.reference.fasta"
reference_gff_file MPXV "gs://theiagen-public-files/terra/mpxv-files/Mpox-MT903345.1.reference.gff3"
vadr_max_length MPXV 210000
vadr_mem MPXV 8
vadr_options MPXV "--glsearch -s -r --nomisc --mkey mpxv --r_lowsimok --r_lowsimxd 100 --r_lowsimxl 2000 --alt_pass discontn,dupregin --out_allfasta --minimap2 --s_overhang 150"
WNV Defaults
Overwrite Variable Name Organism Default Value Notes
genome_length_input WNV 11000
kraken_target_organism_input WNV "West Nile virus"
nextclade_dataset_name_input WNV "NA" TheiaCoV's Nextclade currently does not support WNV
nextclade_dataset_tag_input WNV "NA" TheiaCoV's Nextclade currently does not support WNV
primer_bed_file WNV "gs://theiagen-public-files/terra/theiacov-files/WNV/WNV-L1_primer.bed"
reference_genome WNV "gs://theiagen-public-files/terra/theiacov-files/WNV/NC_009942.1_wnv_L1.fasta"
vadr_max_length WNV 11000
vadr_mem WNV 8
vadr_options WNV "--mkey flavi --mdir /opt/vadr/vadr-models-flavi/ --nomisc --noprotid --out_allfasta"
Flu Defaults
Overwrite Variable Name Organism Flu Segment Flu Subtype Default Value Notes
flu_segment flu all all N/A TheiaCoV will attempt to automatically assign a flu segment
flu_subtype flu all all N/A TheiaCoV will attempt to automatically assign a flu subtype
genome_length_input flu all all 13500
vadr_max_length flu all all 13500
vadr_mem flu all all 8
vadr_options flu all all "--atgonly --xnocomp --nomisc --alt_fail extrant5,extrant3 --mkey flu"
nextclade_dataset_name_input flu ha h1n1 "nextstrain/flu/h1n1pdm/ha/MW626062"
nextclade_dataset_tag_input flu ha h1n1 "2024-07-03--08-29-55Z"
reference_genome flu ha h1n1 "gs://theiagen-public-files-rp/terra/flu-references/reference_h1n1pdm_ha.fasta"
nextclade_dataset_name_input flu ha h3n2 "nextstrain/flu/h3n2/ha/EPI1857216"
nextclade_dataset_tag_input flu ha h3n2 "2024-08-08--05-08-21Z"
reference_genome flu ha h3n2 "gs://theiagen-public-files-rp/terra/flu-references/reference_h3n2_ha.fasta"
nextclade_dataset_name_input flu ha victoria "nextstrain/flu/vic/ha/KX058884"
nextclade_dataset_tag_input flu ha victoria "2024-07-03--08-29-55Z"
reference_genome flu ha victoria "gs://theiagen-public-files-rp/terra/flu-references/reference_vic_ha.fasta"
nextclade_dataset_name_input flu ha yamagata "nextstrain/flu/yam/ha/JN993010"
nextclade_dataset_tag_input flu ha yamagata "2024-01-30--16-34-55Z"
reference_genome flu ha yamagata "gs://theiagen-public-files-rp/terra/flu-references/reference_yam_ha.fasta"
nextclade_dataset_name_input flu na h1n1 "nextstrain/flu/h1n1pdm/na/MW626056"
nextclade_dataset_tag_input flu na h1n1 "2024-07-03--08-29-55Z"
reference_genome flu na h1n1 "gs://theiagen-public-files-rp/terra/flu-references/reference_h1n1pdm_na.fasta"
nextclade_dataset_name_input flu na h3n2 "nextstrain/flu/h3n2/na/EPI1857215"
nextclade_dataset_tag_input flu na h3n2 "2024-04-19--07-50-39Z"
reference_genome flu na h3n2 "gs://theiagen-public-files-rp/terra/flu-references/reference_h3n2_na.fasta"
nextclade_dataset_name_input flu na victoria "nextstrain/flu/vic/na/CY073894"
nextclade_dataset_tag_input flu na victoria "2024-04-19--07-50-39Z"
reference_genome flu na victoria "gs://theiagen-public-files-rp/terra/flu-references/reference_vic_na.fasta"
nextclade_dataset_name_input flu na yamagata "NA"
nextclade_dataset_tag_input flu na yamagata "NA"
reference_genome flu na yamagata "gs://theiagen-public-files-rp/terra/flu-references/reference_yam_na.fasta"
RSV-A Defaults
Overwrite Variable Name Organism Default Value
genome_length_input rsv_a 16000
kraken_target_organism rsv_a Respiratory syncytial virus
nextclade_dataset_name_input rsv_a nextstrain/rsv/a/EPI_ISL_412866
nextclade_dataset_tag_input rsv_a 2024-08-01--22-31-31Z
reference_genome rsv_a gs://theiagen-public-files-rp/terra/rsv_references/reference_rsv_a.fasta
vadr_max_length rsv_a 15500
vadr_mem rsv_a 32
vadr_options rsv_a -r --mkey rsv --xnocomp
RSV-B Defaults
Overwrite Variable Name Organism Default Value
genome_length_input rsv_b 16000
kraken_target_organism rsv_b "Human orthopneumovirus"
nextclade_dataset_name_input rsv_b nextstrain/rsv/b/EPI_ISL_1653999
nextclade_dataset_tag_input rsv_b "2024-08-01--22-31-31Z"
reference_genome rsv_b gs://theiagen-public-files-rp/terra/rsv_references/reference_rsv_b.fasta
vadr_max_length rsv_b 15500
vadr_mem rsv_b 32
vadr_options rsv_b -r --mkey rsv --xnocomp
HIV Defaults
Overwrite Variable Name Organism Default Value Notes
kraken_target_organism_input HIV Human immunodeficiency virus 1
genome_length_input HIV-v1 9181 This version of HIV originates from Oregon
primer_bed_file HIV-v1 gs://theiagen-public-files/terra/hivgc-files/HIV-1_v1.0.primer.hyphen.bed This version of HIV originates from Oregon
reference_genome HIV-v1 gs://theiagen-public-files/terra/hivgc-files/NC_001802.1.fasta This version of HIV originates from Oregon
reference_gff_file HIV-v1 gs://theiagen-public-files/terra/hivgc-files/NC_001802.1.gff3 This version of HIV originates from Oregon
genome_length_input HIV-v2 9840 This version of HIV originates from Southern Africa
primer_bed_file HIV-v2 gs://theiagen-public-files/terra/hivgc-files/HIV-1_v2.0.primer.hyphen400.1.bed This version of HIV originates from Southern Africa
reference_genome HIV-v2 gs://theiagen-public-files/terra/hivgc-files/AY228557.1.headerchanged.fasta This version of HIV originates from Southern Africa
reference_gff_file HIV-v2 gs://theiagen-public-files/terra/hivgc-files/AY228557.1.gff3 This version of HIV originates from Southern Africa

Workflow Tasks

All input reads are processed through "core tasks" in the TheiaCoV Illumina, ONT, and ClearLabs workflows. These undertake read trimming and assembly appropriate to the input data type. TheiaCoV workflows subsequently launch default genome characterization modules for quality assessment, and additional taxa-specific characterization steps. When setting up the workflow, users may choose to use "optional tasks" as additions or alternatives to tasks run in the workflow by default.

Core tasks

These tasks are performed regardless of organism, and perform read trimming and various quality control steps.

versioning: Version capture for TheiaEuk

The versioning task captures the workflow version from the GitHub (code repository) version.

Version Capture Technical details

Links
Task task_versioning.wdl
screen: Total Raw Read Quantification and Genome Size Estimation

The screen task ensures the quantity of sequence data is sufficient to undertake genomic analysis. It uses bash commands for quantification of reads and base pairs, and mash sketching to estimate the genome size and its coverage. At each step, the results are assessed relative to pass/fail criteria and thresholds that may be defined by optional user inputs. Samples that do not meet these criteria will not be processed further by the workflow:

  1. Total number of reads: A sample will fail the read screening task if its total number of reads is less than or equal to min_reads.
  2. The proportion of basepairs reads in the forward and reverse read files: A sample will fail the read screening if fewer than min_proportion basepairs are in either the reads1 or read2 files.
  3. Number of basepairs: A sample will fail the read screening if there are fewer than min_basepairs basepairs
  4. Estimated genome size: A sample will fail the read screening if the estimated genome size is smaller than min_genome_size or bigger than max_genome_size.
  5. Estimated genome coverage: A sample will fail the read screening if the estimated genome coverage is less than the min_coverage.

Read screening is undertaken on both the raw and cleaned reads. The task may be skipped by setting the skip_screen variable to true.

Default values vary between the PE and SE workflow. The rationale for these default values can be found below.

Variable Rationale
skip_screen Prevent the read screen from running
skip_screen Saving waste of compute resources on insufficient data
min_reads Minimum number of base pairs for 10x coverage of the Hepatitis delta (of the Deltavirus genus) virus divided by 300 (longest Illumina read length)
min_basepairs Greater than 10x coverage of the Hepatitis delta (of the Deltavirus genus) virus
min_genome_size Based on the Hepatitis delta (of the Deltavirus genus) genome- the smallest viral genome as of 2024-04-11 (1,700 bp)
max_genome_size Based on the Pandoravirus salinus genome, the biggest viral genome, (2,673,870 bp) with 2 Mbp added
min_coverage A bare-minimum coverage for genome characterization. Higher coverage would be required for high-quality phylogenetics.
min_proportion Greater than 50% reads are in the read1 file; others are in the read2 file

Screen Technical Details

There is a single WDL task for read screening. The screen task is run twice, once for raw reads and once for clean reads.

Links
Task task_screen.wdl
read_QC_trim_pe and read_QC_trim_se: Read Quality Trimming, Host and Adapter Removal, Quantification, and Identification for Illumina workflows

read_QC_trim is a sub-workflow within TheiaCoV that removes low-quality reads, low-quality regions of reads, and sequencing adapters to improve data quality. It uses a number of tasks, described below. The differences between TheiaCoV PE and SE in the read_QC_trim sub-workflow lie in the default parameters, the use of two or one input read file(s), and the different output files.

Host removal

All reads of human origin are removed, including their mates, by using NCBI's human read removal tool (HRRT).

HRRT is based on the SRA Taxonomy Analysis Tool and employs a k-mer database constructed of k-mers from Eukaryota derived from all human RefSeq records with any k-mers found in non-Eukaryota RefSeq records subtracted from the database.

NCBI-Scrub Technical Details

Links
Task task_ncbi_scrub.wdl
Software Source Code NCBI Scrub on GitHub
Software Documentation https://github.com/ncbi/sra-human-scrubber/blob/master/README.md
Read quality trimming

Either trimmomatic or fastp can be used for read-quality trimming. Trimmomatic is used by default. Both tools trim low-quality regions of reads with a sliding window (with a window size of trim_window_size), cutting once the average quality within the window falls below trim_quality_trim_score. They will both discard the read if it is trimmed below trim_minlen.

If fastp is selected for analysis, fastp also implements the additional read-trimming steps indicated below:

Parameter Explanation
-g enables polyG tail trimming
-5 20 enables read end-trimming
-3 20 enables read end-trimming
--detect_adapter_for_pe enables adapter-trimming only for paired-end reads
Adapter removal

The BBDuk task removes adapters from sequence reads. To do this:

  • Repair from the BBTools package reorders reads in paired fastq files to ensure the forward and reverse reads of a pair are in the same position in the two fastq files.
  • BBDuk ("Bestus Bioinformaticus" Decontamination Using Kmers) is then used to trim the adapters and filter out all reads that have a 31-mer match to PhiX, which is commonly added to Illumina sequencing runs to monitor and/or improve overall run quality.
What are adapters and why do they need to be removed?

Adapters are manufactured oligonucleotide sequences attached to DNA fragments during the library preparation process. In Illumina sequencing, these adapter sequences are required for attaching reads to flow cells. You can read more about Illumina adapters here. For genome analysis, it's important to remove these sequences since they're not actually from your sample. If you don't remove them, the downstream analysis may be affected.

Read Quantification

There are two methods for read quantification to choose from: fastq-scan (default) or fastqc. Both quantify the forward and reverse reads in FASTQ files. In TheiaProk_Illumina_PE, they also provide the total number of read pairs. This task is run once with raw reads as input and once with clean reads as input. If QC has been performed correctly, you should expect fewer clean reads than raw reads. fastqc also provides a graphical visualization of the read quality.

Read Identification

Kraken2 is a bioinformatics tool originally designed for metagenomic applications. It has additionally proven valuable for validating taxonomic assignments and checking contamination of single-species (e.g. bacterial isolate, eukaryotic isolate, viral isolate, etc.) whole genome sequence data.

Kraken2 is run on the set of raw reads, provided as input, as well as the set of clean reads that are resulted from the read_QC_trim workflow

Database-dependent

TheiaCoV automatically uses a viral-specific Kraken2 database.

Kraken2 Technical Details

Links
Task task_kraken2.wdl
Software Source Code Kraken2 on GitHub
Software Documentation https://github.com/DerrickWood/kraken2/wiki
Original Publication(s) Improved metagenomic analysis with Kraken 2
read_QC_trim_ONT: Read Quality Trimming, Host Removal, and Identification for ONT data

read_QC_trim is a sub-workflow within TheiaCoV that removes low-quality reads, low-quality regions of reads, and sequencing adapters to improve data quality. It uses a number of tasks, described below.

Host removal

All reads of human origin are removed, including their mates, by using NCBI's human read removal tool (HRRT).

HRRT is based on the SRA Taxonomy Analysis Tool and employs a k-mer database constructed of k-mers from Eukaryota derived from all human RefSeq records with any k-mers found in non-Eukaryota RefSeq records subtracted from the database.

NCBI-Scrub Technical Details

Links
Task task_ncbi_scrub.wdl
Software Source Code NCBI Scrub on GitHub
Software Documentation https://github.com/ncbi/sra-human-scrubber/blob/master/README.md
Read quality filtering

Read filtering is performed using artic guppyplex which performs a quality check by filtering the reads by length to remove chimeric reads.

Read Identification

Kraken2 is a bioinformatics tool originally designed for metagenomic applications. It has additionally proven valuable for validating taxonomic assignments and checking contamination of single-species (e.g. bacterial isolate, eukaryotic isolate, viral isolate, etc.) whole genome sequence data.

Kraken2 is run on the set of raw reads, provided as input, as well as the set of clean reads that are resulted from the read_QC_trim workflow

Database-dependent

TheiaCoV automatically uses a viral-specific Kraken2 database.

Kraken2 Technical Details

Links
Task task_kraken2.wdl
Software Source Code Kraken2 on GitHub
Software Documentation https://github.com/DerrickWood/kraken2/wiki
Original Publication(s) Improved metagenomic analysis with Kraken 2

read_QC_trim Technical Details

Each TheiaCoV workflow calls a sub-workflow listed below, which then calls the individual tasks:

Workflow TheiaCoV_ONT
Sub-workflow wf_read_QC_trim_ont.wdl
Tasks task_ncbi_scrub.wdl (SE subtask)
task_artic_guppyplex.wdl
task_kraken2.wdl
Software Source Code NCBI Scrub on GitHub
Artic on GitHub
Kraken2 on GitHub
Software Documentation NCBI Scrub
Artic pipeline
Kraken2
Original Publication(s) STAT: a fast, scalable, MinHash-based k*-mer tool to assess Sequence Read Archive next-generation sequence submissions
*Improved metagenomic analysis with Kraken 2

Assembly tasks

Either one of these tasks is run depending on the organism and workflow type.

ivar_consensus: Alignment, Consensus, Variant Detection, and Assembly Statistics for non-flu organisms in Illumina workflows

ivar_consensus is a sub-workflow within TheiaCoV that performs reference-based consensus assembly using the iVar tool by Nathan Grubaugh from the Andersen lab.

The following steps are performed as part of this sub-workflow:

  1. Cleaned reads are aligned to the appropriate reference genome (see also the organism-specific parameters and logic section above) using BWA to generate a Binary Alignment Mapping (BAM) file.
  2. If trim_primers is set to true, primers will be removed using ivar trim.
    1. General statistics about the remaining reads are calculated.
  3. The ivar consensus command is run to generate a consensus assembly.
  4. General statistics about the assembly are calculated..
artic_consensus: Alignment, Primer Trimming, Variant Detection, and Consensus for non-flu organisms in ONT & ClearLabs workflows

Briefly, input reads are aligned to the appropriate reference withย minimap2ย to generate a Binary Alignment Mapping (BAM) file. Primer sequences are then removed from the BAM file and a consensus assembly file is generated using theย Artic minion Medaka argument.

Read-trimming is performed on raw read data generated on the ClearLabs instrument and thus not a required step in the TheiaCoV_ClearLabs workflow.

General statistics about the assembly are generated with the consensus_qc task (task_assembly_metrics.wdl).

Artic Consensus Technical Details

Links
Task task_artic_consensus.wdl
Software Source Code Artic on GitHub
Software Documentation Artic pipeline
irma: Assembly and Characterization for flu in TheiaCoV_Illumina_PE & TheiaCoV_ONT

Cleaned reads are assembled using irma which does not use a reference due to the rapid evolution and high variability of influenza. irma also performs typing and subtyping as part of the assembly process.

General statistics about the assembly are generated with the consensus_qc task (task_assembly_metrics.wdl).

IRMA Technical Details

Links
Task task_irma.wdl
Software Documentation IRMA website
Original Publication(s) *Viral deep sequencing needs an adaptive approach: IRMA, the iterative refinement meta-assembler

Organism-specific characterization tasks

The following tasks only run for the appropriate organism designation. The following table illustrates which characterization tools are run for the indicated organism.

SARS-CoV-2 MPXV HIV WNV Influenza RSV-A RSV-B
Pangolin โœ… โŒ โŒ โŒ โŒ โŒ โŒ
Nextclade โœ… โœ… โŒ โŒ โœ… โœ… โœ…
VADR โœ… โœ… โŒ โœ… โœ… โœ… โœ…
Quasitools HyDRA โŒ โŒ โœ… โŒ โŒ โŒ โŒ
IRMA โŒ โŒ โŒ โŒ โœ… โŒ โŒ
Abricate โŒ โŒ โŒ โŒ โœ… โŒ โŒ
% Gene Coverage โœ… โœ… โŒ โŒ โŒ โŒ โŒ
Antiviral Detection โŒ โŒ โŒ โŒ โœ… โŒ โŒ
GenoFLU โŒ โŒ โŒ โŒ โœ… โŒ โŒ
pangolin

Pangolin designates SARS-CoV-2 lineage assignments.

Pangolin Technical Details

Links
Task task_pangolin.wdl
Software Source Code Pangolin on GitHub
Software Documentation Pangolin website
nextclade

"Nextclade is an open-source project for viral genome alignment, mutation calling, clade assignment, quality checks and phylogenetic placement."

Nextclade Technical Details

Links
Task task_nextclade.wdl
Software Source Code https://github.com/nextstrain/nextclade
Software Documentation Nextclade
Original Publication(s) Nextclade: clade assignment, mutation calling and quality control for viral genomes.
vadr

VADR annotates and validates completed assembly files.

quasitools

quasitools performs genome characterization for HIV.

Quasitools Technical Details

Links
Task task_quasitools.wdl
Software Source Code https://github.com/phac-nml/quasitools/
Software Documentation Quasitools HyDRA
irma

IRMA assigns types and subtype/lineages in addition to performing assembly of flu genomes. Please see the section above under "Assembly tasks" to find more information regarding this tool.

IRMA Technical Details

Links
Task task_irma.wdl
Software Documentation IRMA website
Original Publication(s) *Viral deep sequencing needs an adaptive approach: IRMA, the iterative refinement meta-assembler
abricate

Abricate assigns types and subtype/lineages for flu samples

Abricate Technical Details

Links
Task task_abricate.wdl (abricate_flu subtask)
Software Source Code ABRicate on GitHub
Software Documentation ABRicate on GitHub
gene_coverage

This task calculates the percent of the gene covered above a minimum depth. By default, it runs for SARS-CoV-2 and MPXV, but if a bed file is provided with regions of interest, this task will be run for other organisms as well.

Gene Coverage Technical Details

Links
Task task_gene_coverage.wdl
flu_antiviral_substitutions

This sub-workflow determines which, if any, antiviral mutations are present in the sample.

The assembled HA, NA, PA, PB1 and PB2 segments are compared against a list of known amino-acid substitutions associated with resistance to the antivirals A_315675, compound_367, Favipiravir, Fludase, L_742_001, Laninamivir, Oseltamivir (tamiflu), Peramivir, Pimodivir, Xofluza, and Zanamivir. The list of known antiviral amino acid substitutions can be expanded via optional user input antiviral_aa_subs in the format "NA:V95A,HA:I97V", i.e. Protein:AAPositionAA.

Antiviral Substitutions Technical Details

Links
Workflow wf_influenza_antiviral_substitutions.wdl
genoflu

This sub-workflow determines the whole-genome genotype of an H5N1 flu sample.

GenoFLU Technical Details

Links
Task task_genoflu.wdl
Software Source Code GenoFLU on GitHub

Outputs

All TheiaCoV Workflows (not TheiaCoV_FASTA_Batch)

Variable Type Description Workflow
abricate_flu_database String ABRicate database used for analysis FASTA, ONT, PE
abricate_flu_results File File containing all results from ABRicate FASTA, ONT, PE
abricate_flu_subtype String Flu subtype as determined by ABRicate FASTA, ONT, PE
abricate_flu_type String Flu type as determined by ABRicate FASTA, ONT, PE
abricate_flu_version String Version of ABRicate FASTA, ONT, PE
aligned_bai File Index companion file to the bam file generated during the consensus assembly process CL, ONT, PE, SE
aligned_bam File Primer-trimmed BAM file; generated during consensus assembly process CL, ONT, PE, SE
artic_docker String Docker image utilized for read trimming and consensus genome assembly CL, ONT
artic_version String Version of the Artic software utilized for read trimming and conesnsus genome assembly CL, ONT
assembly_fasta File Consensus genome assembly; for lower quality flu samples, the output may state "Assembly could not be generated" when there is too little and/or too low quality data for IRMA to produce an assembly CL, ONT, PE, SE
assembly_length_unambiguous Int Number of unambiguous basecalls within the consensus assembly CL, FASTA, ONT, PE, SE
assembly_mean_coverage Float Mean sequencing depth throughout the consensus assembly. Generated after performing primer trimming and calculated using the SAMtools coverage command CL, ONT, PE, SE
assembly_method String Method employed to generate consensus assembly CL, FASTA, ONT, PE, SE
auspice_json File Auspice-compatable JSON output generated from Nextclade analysis that includes the Nextclade default samples for clade-typing and the single sample placed on this tree CL, FASTA, ONT, PE, SE
auspice_json_flu_ha File Auspice-compatable JSON output generated from Nextclade analysis on Influenza HA segment that includes the Nextclade default samples for clade-typing and the single sample placed on this tree ONT, PE
auspice_json_flu_na File Auspice-compatable JSON output generated from Nextclade analysis on Influenza NA segment that includes the Nextclade default samples for clade-typing and the single sample placed on this tree ONT, PE
bbduk_docker String Docker image used to run BBDuk PE, SE
bwa_version String Version of BWA used to map read data to the reference genome PE, SE
consensus_flagstat File Output from the SAMtools flagstat command to assess quality of the alignment file (BAM) CL, ONT, PE, SE
consensus_n_variant_min_depth Int Minimum read depth to call variants for iVar consensus and iVar variants PE, SE
consensus_stats File Output from the SAMtools stats command to assess quality of the alignment file (BAM) CL, ONT, PE, SE
est_coverage_clean Float Estimated coverage of the clean reads ONT
est_coverage_raw Float Estimated coverage of the raw reads ONT
est_percent_gene_coverage_tsv File Percent coverage for each gene in the organism being analyzed (depending on the organism input) CL, ONT, PE, SE
fastp_html_report File HTML report for fastp PE, SE
fastp_version String Fastp version used PE, SE
fastq_scan_num_reads_clean_pairs String Number of paired reads after filtering as determined by fastq_scan PE
fastq_scan_num_reads_clean1 Int Number of forward reads after filtering as determined by fastq_scan CL, PE, SE
fastq_scan_num_reads_clean2 Int Number of reverse reads after filtering as determined by fastq_scan PE
fastq_scan_num_reads_raw_pairs String Number of paired reads identified in the input fastq files as determined by fastq_scan PE
fastq_scan_num_reads_raw1 Int Number of forward reads identified in the input fastq files as determined by fastq_scan CL, PE, SE
fastq_scan_num_reads_raw2 Int Number of reverse reads identified in the input fastq files as determined by fastq_scan PE
fastq_scan_r1_mean_q_clean Float Forward read mean quality value after quality trimming and adapter removal
fastq_scan_r1_mean_q_raw Float Forward read mean quality value before quality trimming and adapter removal
fastq_scan_r1_mean_readlength_clean Float Forward read mean read length value after quality trimming and adapter removal
fastq_scan_r1_mean_readlength_raw Float Forward read mean read length value before quality trimming and adapter removal
fastq_scan_version String Version of fastq_scan used for read QC analysis CL, PE, SE
fastqc_clean1_html File Graphical visualization of clean forward read quality from fastqc to open in an internet browser PE, SE
fastqc_clean2_html File Graphical visualization of clean reverse read quality from fastqc to open in an internet browser PE
fastqc_docker String Docker container used for fastqc PE, SE
fastqc_num_reads_clean_pairs String Number of read pairs after cleaning by fastqc PE
fastqc_num_reads_clean1 Int Number of forward reads after cleaning by fastqc PE, SE
fastqc_num_reads_clean2 Int Number of reverse reads after cleaning by fastqc PE
fastqc_num_reads_raw_pairs Int Number of raw read pairs as computed by fastqc PE
fastqc_num_reads_raw1 Int Number of raw forward/facing reads as computed by fastqc PE, SE
fastqc_num_reads_raw2 Int Number of raw reverse-facing reads as computed by fastqc PE
fastqc_raw1_html File Graphical visualization of raw forward read quality from fastqc to open in an internet browser PE, SE
fastqc_raw2_html File Graphical visualization of raw reverse read quality from fastqc to open in an internet browser PE
fastqc_version String Version of fastqc software used PE, SE
flu_A_315675_resistance String resistance mutations to A_315675 ONT, PE
flu_amantadine_resistance String resistance mutations to amantadine ONT, PE
flu_compound_367_resistance String resistance mutations to compound_367 ONT, PE
flu_favipiravir_resistance String resistance mutations to favipiravir ONT, PE
flu_fludase_resistance String resistance mutations to fludase ONT, PE
flu_L_742_001_resistance String resistance mutations to L_742_001 ONT, PE
flu_laninamivir_resistance String resistance mutations to laninamivir ONT, PE
flu_oseltamivir_resistance String resistance mutations to oseltamivir (Tamifluยฎ) ONT, PE
flu_peramivir_resistance String resistance mutations to peramivir (Rapivabยฎ) ONT, PE
flu_pimodivir_resistance String resistance mutations to pimodivir ONT, PE
flu_rimantadine_resistance String resistance mutations to rimantadine ONT, PE
flu_xofluza_resistance String resistance mutations to xofluza (Baloxavir marboxil) ONT, PE
flu_zanamivir_resistance String resistance mutations to zanamivir (Relenzaยฎ) ONT, PE
genoflu_all_segments String The genotypes for each individual flu segment FASTA, ONT, PE
genoflu_genotype String The genotype of the whole genome, based off of the individual segments types FASTA, ONT, PE
genoflu_output_tsv File The output file from GenoFLU FASTA, ONT, PE
genoflu_version String The version of GenoFLU used FASTA, ONT, PE
irma_docker String Docker image used to run IRMA ONT, PE
irma_ha_segment_fasta File HA (Haemagglutinin) assembly fasta file ONT, PE
irma_mp_segment_fasta File MP (Matrix Protein) assembly fasta file ONT, PE
irma_na_segment_fasta File NA (Neuraminidase) assembly fasta file ONT, PE
irma_np_segment_fasta File NP (Nucleoprotein) assembly fasta file ONT, PE
irma_ns_segment_fasta File NS (Nonstructural) assembly fasta file ONT, PE
irma_pa_segment_fasta File PA (Polymerase acidic) assembly fasta file ONT, PE
irma_pb1_segment_fasta File PB1 (Polymerase basic 1) assembly fasta file ONT, PE
irma_pb2_segment_fasta File PB2 (Polymerase basic 2) assembly fasta file ONT, PE
irma_subtype String Flu subtype as determined by IRMA ONT, PE
irma_subtype_notes String Helpful note to user about Flu B subtypes. Output will be blank for Flu A samples. For Flu B samples it will state: "IRMA does not differentiate Victoria and Yamagata Flu B lineages. See abricate_flu_subtype output column" ONT, PE
irma_type String Flu type as determined by IRMA ONT, PE
irma_version String Version of IRMA used ONT, PE
ivar_tsv File Variant descriptor file generated by iVar variants PE, SE
ivar_variant_proportion_intermediate String The proportion of variants of intermediate frequency PE, SE
ivar_variant_version String Version of iVar for running the iVar variants command PE, SE
ivar_vcf File iVar tsv output converted to VCF format PE, SE
ivar_version_consensus String Version of iVar for running the iVar consensus command PE, SE
ivar_version_primtrim String Version of iVar for running the iVar trim command PE, SE
kraken_human Float Percent of human read data detected using the Kraken2 software CL, ONT, PE, SE
kraken_human_dehosted Float Percent of human read data detected using the Kraken2 software after host removal CL, ONT, PE
kraken_report File Full Kraken report CL, ONT, PE, SE
kraken_report_dehosted File Full Kraken report after host removal CL, ONT, PE
kraken_sc2 Float Percent of SARS-CoV-2 read data detected using the Kraken2 software CL, ONT, PE, SE
kraken_sc2_dehosted Float Percent of SARS-CoV-2 read data detected using the Kraken2 software after host removal CL, ONT, PE
kraken_target_organism String Percent of target organism read data detected using the Kraken2 software CL, ONT, PE, SE
kraken_target_organism_dehosted String Percent of target organism read data detected using the Kraken2 software after host removal CL, ONT, PE
kraken_target_organism_name String The name of the target organism; e.g., "Monkeypox" or "Human immunodeficiency virus" CL, ONT, PE, SE
kraken_version String Version of Kraken software used CL, ONT, PE, SE
meanbaseq_trim Float Mean quality of the nucleotide basecalls aligned to the reference genome after primer trimming CL, ONT, PE, SE
meanmapq_trim Float Mean quality of the mapped reads to the reference genome after primer trimming CL, ONT, PE, SE
medaka_reference String Reference sequence used in medaka task CL, ONT
medaka_vcf File A VCF file containing the identified variants ONT
nanoplot_docker String Docker image used to run Nanoplot ONT
nanoplot_html_clean File An HTML report describing the clean reads ONT
nanoplot_html_raw File An HTML report describing the raw reads ONT
nanoplot_num_reads_clean1 Float Number of clean reads ONT
nanoplot_num_reads_raw1 Float Number of raw reads ONT
nanoplot_r1_est_coverage_clean Float Estimated coverage on the clean reads by nanoplot ONT
nanoplot_r1_est_coverage_raw Float Estimated coverage on the raw reads by nanoplot ONT
nanoplot_r1_mean_q_clean Float Mean quality score of clean forward reads ONT
nanoplot_r1_mean_q_raw Float Mean quality score of raw forward reads ONT
nanoplot_r1_mean_readlength_clean Float Mean read length of clean forward reads ONT
nanoplot_r1_mean_readlength_raw Float Mean read length of raw forward reads ONT
nanoplot_r1_median_q_clean Float Median quality score of clean forward reads ONT
nanoplot_r1_median_q_raw Float Median quality score of raw forward reads ONT
nanoplot_r1_median_readlength_clean Float Median read length of clean forward reads ONT
nanoplot_r1_median_readlength_raw Float Median read length of raw forward reads ONT
nanoplot_r1_n50_clean Float N50 of clean forward reads ONT
nanoplot_r1_n50_raw Float N50 of raw forward reads ONT
nanoplot_r1_stdev_readlength_clean Float Standard deviation read length of clean forward reads ONT
nanoplot_r1_stdev_readlength_raw Float Standard deviation read length of raw forward reads ONT
nanoplot_tsv_clean File A TSV report describing the clean reads ONT
nanoplot_tsv_raw File A TSV report describing the raw reads ONT
nanoplot_version String Version of nanoplot tool used ONT
nextclade_aa_dels String Amino-acid deletions as detected by NextClade. Will be blank for Flu CL, FASTA, ONT, PE, SE
nextclade_aa_dels_flu_ha String Amino-acid deletions as detected by NextClade. Specific to flu; it includes deletions for HA segment ONT, PE
nextclade_aa_dels_flu_na String Amino-acid deletions as detected by NextClade. Specific to Flu; it includes deletions for NA segment ONT, PE
nextclade_aa_subs String Amino-acid substitutions as detected by Nextclade. Will be blank for Flu CL, FASTA, ONT, PE, SE
nextclade_aa_subs_flu_ha String Amino-acid substitutions as detected by Nextclade. Specific to Flu; it includes substitutions for NA segment ONT, PE
nextclade_aa_subs_flu_na String Amino-acid substitutions as detected by Nextclade. Specific to Flu; it includes substitutions for NA segment ONT, PE
nextclade_clade String Nextclade clade designation, will be blank for Flu. CL, FASTA, ONT, PE, SE
nextclade_clade_flu_ha String Nextclade clade designation, specific to Flu NA segment ONT, PE
nextclade_clade_flu_na String Nextclade clade designation, specific to Flu HA segment ONT, PE
nextclade_docker String Docker image used to run Nextclade CL, FASTA, ONT, PE, SE
nextclade_ds_tag String Dataset tag used to run Nextclade. Will be blank for Flu CL, FASTA, ONT, PE, SE
nextclade_ds_tag_flu_ha String Dataset tag used to run Nextclade, specific to Flu HA segment ONT, PE
nextclade_ds_tag_flu_na String Dataset tag used to run Nextclade, specific to Flu NA segment ONT, PE
nextclade_json File Nextclade output in JSON file format. Will be blank for Flu CL, FASTA, ONT, PE, SE
nextclade_json_flu_ha File Nextclade output in JSON file format, specific to Flu HA segment ONT, PE
nextclade_json_flu_na File Nextclade output in JSON file format, specific to Flu NA segment ONT, PE
nextclade_lineage String Nextclade lineage designation CL, FASTA, ONT, PE, SE
nextclade_qc String QC metric as determined by Nextclade. (For Flu, this output will be specific to HA segment) CL, FASTA, ONT, PE, SE
nextclade_qc_flu_ha String QC metric as determined by Nextclade, specific to Flu HA segment ONT, PE
nextclade_qc_flu_na String QC metric as determined by Nextclade, specific to Flu NA segment ONT, PE
nextclade_tsv File Nextclade output in TSV file format. (For Flu, this output will be specific to HA segment) CL, FASTA, ONT, PE, SE
nextclade_tsv_flu_ha File Nextclade output in TSV file format, specific to Flu HA segment ONT, PE
nextclade_tsv_flu_na File Nextclade output in TSV file format, specific to Flu NA segment ONT, PE
nextclade_version String The version of Nextclade software used CL, FASTA, ONT, PE, SE
number_Degenerate Int Number of degenerate basecalls within the consensus assembly CL, FASTA, ONT, PE, SE
number_N Int Number of fully ambiguous basecalls within the consensus assembly CL, FASTA, ONT, PE, SE
number_Total Int Total number of nucleotides within the consensus assembly CL, FASTA, ONT, PE, SE
pango_lineage String Pango lineage as determined by Pangolin CL, FASTA, ONT, PE, SE
pango_lineage_expanded String Pango lineage without use of aliases; e.g., "BA.1" โ†’ "B.1.1.529.1" CL, FASTA, ONT, PE, SE
pango_lineage_report File Full Pango lineage report generated by Pangolin CL, FASTA, ONT, PE, SE
pangolin_assignment_version String The version of the pangolin software (e.g. PANGO or PUSHER) used for lineage assignment CL, FASTA, ONT, PE, SE
pangolin_conflicts String Number of lineage conflicts as determined by Pangolin CL, FASTA, ONT, PE, SE
pangolin_docker String Docker image used to run Pangolin CL, FASTA, ONT, PE, SE
pangolin_notes String Lineage notes as determined by Pangolin CL, FASTA, ONT, PE, SE
pangolin_versions String All Pangolin software and database versions CL, FASTA, ONT, PE, SE
percent_reference_coverage Float Percent coverage of the reference genome after performing primer trimming; calculated as assembly_length_unambiguous / length of the reference genome (SC2: 29903) x 100 CL, FASTA, ONT, PE, SE
primer_bed_name String Name of the primer bed files used for primer trimming CL, ONT, PE, SE
primer_trimmed_read_percent Float Percentage of read data with primers trimmed as determined by iVar trim PE, SE
qc_check String The results of the QC Check task CL, FASTA, ONT, PE, SE
qc_standard File The file used in the QC Check task containing the QC thresholds. CL, FASTA, ONT, PE, SE
quasitools_coverage_file File The coverage report created by Quasitools HyDRA ONT, PE
quasitools_date String Date of Quasitools analysis ONT, PE
quasitools_dr_report File Drug resistance report created by Quasitools HyDRA ONT, PE
quasitools_hydra_vcf File The VCF created by Quasitools HyDRA ONT, PE
quasitools_mutations_report File The mutation report created by Quasitools HyDRA ONT, PE
quasitools_version String Version of Quasitools used ONT, PE
read_screen_clean String A PASS or FAIL flag for input reads after cleaning ONT, PE, SE
read_screen_raw String A PASS or FAIL flag for input reads ONT, PE, SE
read1_aligned File Forward read file of only aligned reads CL, ONT, PE, SE
read1_clean File Forward read file after quality trimming and adapter removal PE, SE
read1_dehosted File Dehosted forward reads; suggested read file for SRA submission CL, ONT, PE
read1_trimmed File Forward read file after quality trimming and adapter removal ONT
read1_unaligned File Forward read file of unaligned reads PE, SE
read2_aligned File Reverse read file of only aligned reads PE
read2_clean File Reverse read file after quality trimming and adapter removal PE
read2_dehosted File Dehosted reverse reads; suggested read file for SRA submission PE
read2_unaligned File Reverse read file of unaligned reads PE
samtools_version String The version of SAMtools used to sort and index the alignment file ONT, PE, SE
samtools_version_consensus String The version of SAMtools used to create the pileup before running iVar consensus PE, SE
samtools_version_primtrim String The version of SAMtools used to create the pileup before running iVar trim PE, SE
samtools_version_stats String The version of SAMtools used to assess the quality of read mapping CL, PE, SE
sc2_s_gene_mean_coverage Float Mean read depth for the S gene in SARS-CoV-2 CL, ONT, PE, SE
sc2_s_gene_percent_coverage Float Percent coverage of the S gene in SARS-CoV-2 CL, ONT, PE, SE
seq_platform String Description of the sequencing methodology used to generate the input read data CL, FASTA, ONT, PE, SE
sorted_bam_unaligned File A BAM file that only contains reads that did not align to the reference PE, SE
sorted_bam_unaligned_bai File Index companion file to a BAM file that only contains reads that did not align to the reference PE, SE
theiacov_clearlabs_analysis_date String Date of analysis CL
theiacov_clearlabs_version String Version of PHB used for running the workflow CL
theiacov_fasta_analysis_date String Date of analysis FASTA
theiacov_fasta_version String Version of PHB used for running the workflow FASTA
theiacov_illumina_pe_analysis_date String Date of analysis PE
theiacov_illumina_pe_version String Version of PHB used for running the workflow PE
theiacov_illumina_se_analysis_date String Date of analysis SE
theiacov_illumina_se_version String Version of PHB used for running the workflow SE
theiacov_ont_analysis_date String Date of analysis ONT
theiacov_ont_version String Version of PHB used for running the workflow ONT
trimmomatic_docker String Docker container used with trimmomatic PE, SE
trimmomatic_version String The version of Trimmomatic used PE, SE
vadr_alerts_list File A file containing all of the fatal alerts as determined by VADR CL, FASTA, ONT, PE, SE
vadr_all_outputs_tar_gz File A .tar.gz file (gzip-compressed tar archive file) containing all outputs from the VADR command v-annotate.pl. This file must be uncompressed & extracted to see the many files within. See https://github.com/ncbi/vadr/blob/master/documentation/formats.md#format-of-v-annotatepl-output-filesfor more complete description of all files present within the archive. Useful when deeply investigating a sample's genome & annotations. CL, FASTA, ONT, PE, SE
vadr_classification_summary_file File Per-sequence tabular classification file. See https://github.com/ncbi/vadr/blob/master/documentation/formats.md#explanation-of-sqc-suffixed-output-files for more complete description. CL, FASTA, ONT, PE, SE
vadr_docker String Docker image used to run VADR CL, FASTA, ONT, PE, SE
vadr_fastas_zip_archive File Zip archive containing all fasta files created during VADR analysis CL, FASTA, ONT, PE, SE
vadr_feature_tbl_fail File 5 column feature table output for failing sequences. See https://github.com/ncbi/vadr/blob/master/documentation/formats.md#format-of-v-annotatepl-output-files for more complete description. CL, FASTA, ONT, PE, SE
vadr_feature_tbl_pass File 5 column feature table output for passing sequences. See https://github.com/ncbi/vadr/blob/master/documentation/formats.md#format-of-v-annotatepl-output-files for more complete description. CL, FASTA, ONT, PE, SE
vadr_num_alerts String Number of fatal alerts as determined by VADR CL, FASTA, ONT, PE, SE
variants_from_ref_vcf File Number of variants relative to the reference genome CL
TheiaCoV_FASTA_Batch_PHB Outputs
TheiaCoV_FASTA_Batch Outputs

Overwrite Warning

TheiaCoV_FASTA_Batch_PHB workflow will output results to the set-level data table in addition to overwriting the Pangolin & Nextclade output columns in the sample-level data table. Users can view the set-level workflow output TSV file called "Datatable" to view exactly which columns were overwritten in the sample-level data table.

Variable Type Description
datatable File Sample-level data table TSV file that was used to update the original sample-level data table in the last step of the TheiaCoV_FASTA_Batch workflow.
nextclade_json File Output Nextclade JSON file that contains results for all samples included in the workflow
nextclade_tsv File Output Nextclade TSV file that contains results for all samples included in the workflow
pango_lineage_report File Output Pangolin CSV file that contains results for all samples included in the workflow
theiacov_fasta_batch_analysis_date String Date that the workflow was run.
theiacov_fasta_batch_version String Version of the workflow that was used.