TheiaEuk Workflow Series¶

Quick Facts¶

Workflow Type	Applicable Kingdom	Last Known Changes	Command-line Compatibility	Workflow Level
Genomic Characterization	Mycotics	v3.1.0	Some optional features incompatible, Yes	Sample-level

TheiaEuk Workflows¶

The TheiaEuk workflows are for the assembly, quality assessment, and characterization of fungal genomes. It is designed to accept Illumina paired-end sequencing data or base-called ONT reads as the primary input. It is currently intended only for haploid fungal genomes like Candidozyma auris. Analyzing diploid genomes using TheiaEuk should be attempted only with expert attention to the resulting genome quality.

All input reads are processed through "core tasks" in each workflow. The core tasks include raw read quality assessment, read cleaning (quality trimming and adapter removal), de novo assembly, assembly quality assessment, and species taxon identification. For some taxa identified, taxa-specific sub-workflows will be automatically activated, undertaking additional taxa-specific characterization steps, including clade-typing and/or antifungal resistance detection.

TheiaEuk_Illumina_PETheiaEuk_ONT

TheiaEuk Illumina PE Workflow Diagram

TheiaEuk ONT Workflow Diagram

Before running TheiaEuk

TheiaEuk_Illumina_PE relies on Snippy to perform variant calling on the cleaned read dataset and then queries the resulting file for specific mutations that are known to confim antifugal resistance (see Organism-specific characterization section). This behaviour has been replicated in TheiaEuk_ONT but the variant calling is performed directly on the resulting assemblies. Therefore, the read support reported is, at the moment, non-reliable. Future improvements will include improvements on this module.

Inputs¶

Input Read Data

TheiaEuk_Illumina_PETheiaEuk_ONT

The TheiaEuk_Illumina_PE workflow takes in Illumina paired-end read data. Read file names should end with .fastq or .fq, with the optional addition of .gz. When possible, Theiagen recommends zipping files with gzip before Terra uploads to minimize data upload time.

By default, the workflow anticipates 2 x 150bp reads (i.e. the input reads were generated using a 300-cycle sequencing kit). Modifications to the optional parameter for trim_minlen may be required to accommodate shorter read data, such as the 2 x 75bp reads generated using a 150-cycle sequencing kit.

The TheiaEuk_ONT workflow takes in base-called ONT read data. Read file names should end with .fastq or .fq, with the optional addition of .gz. When possible, Theiagen recommends zipping files with gzip before uploading to Terra to minimize data upload time.

The ONT sequencing kit and base-calling approach can produce substantial variability in the amount and quality of read data. Genome assemblies produced by the TheiaEuk_ONT workflow must be quality assessed before reporting results.

TheiaEuk_Illumina_PETheiaEuk_ONT

Terra Task Name	Variable	Type	Description	Default Value	Terra Status
theiaeuk_pe	read1	File	Illumina forward read file in FASTQ file format (compression optional)		Required
theiaeuk_pe	read2	File	Illumina reverse read file in FASTQ file format (compression optional)		Required
theiaeuk_pe	samplename	String	The name of the sample being analyzed		Required
busco	cpu	Int	Number of CPUs to allocate to the task	2	Optional
busco	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
busco	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/ezlabgva/busco:v5.3.2_cv1	Optional
cg_pipeline_clean	cg_pipe_opts	String	Options to pass to CG-Pipeline for clean read assessment	#NAME?	Optional
cg_pipeline_clean	cpu	Int	Number of CPUs to allocate to the task	4	Optional
cg_pipeline_clean	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
cg_pipeline_clean	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/lyveset:1.1.4f	Optional
cg_pipeline_clean	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	8	Optional
cg_pipeline_raw	cg_pipe_opts	String	Options to pass to CG-Pipeline for raw read assessment	#NAME?	Optional
cg_pipeline_raw	cpu	Int	Number of CPUs to allocate to the task	4	Optional
cg_pipeline_raw	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
cg_pipeline_raw	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/lyveset:1.1.4f	Optional
clean_check_reads	cpu	Int	Number of CPUs to allocate to the task	1	Optional
clean_check_reads	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
clean_check_reads	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/bactopia/gather_samples:2.0.2	Optional
clean_check_reads	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	2	Optional
clean_check_reads	organism	String	Internal component, do not modify		Optional
clean_check_reads	workflow_series	String	Internal component, do not modify		Optional
digger_denovo	assember_options	String	String	Assembler-specific options that you might choose for the selected assembler	Optional
digger_denovo	assembler	String	Assembler to use (spades, skesa, megahit)	skesa	Optional
digger_denovo	bwa_cpu	Int	Number of CPUs to allocate to the task	6	Optional
digger_denovo	bwa_disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
digger_denovo	bwa_docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/ivar:1.3.1-titan	Optional
digger_denovo	bwa_memory	Int	Amount of memory/RAM (in GB) to allocate to the task	16	Optional
digger_denovo	call_pilon	Boolean	Whether to run Pilon polishing after assembly	FALSE	Optional
digger_denovo	filter_contigs_cpu	Int	Number of CPUs to allocate to the task	1	Optional
digger_denovo	filter_contigs_disk_size	Int	Amount of storage (in GB) to allocate to the task	50	Optional
digger_denovo	filter_contigs_docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/theiagen/shovilter:0.2	Optional
digger_denovo	filter_contigs_memory	Int	Amount of memory/RAM (in GB) to allocate to the task	8	Optional
digger_denovo	filter_contigs_min_coverage	Float	Minimum coverage threshold for contig filtering	2	Optional
digger_denovo	filter_contigs_skip_coverage_filter	Boolean	Skip filtering contigs based on coverage	FALSE	Optional
digger_denovo	filter_contigs_skip_homopolymer_filter	Boolean	Skip filtering contigs containing homopolymers	FALSE	Optional
digger_denovo	filter_contigs_skip_length_filter	Boolean	Skip filtering contigs based on length	FALSE	Optional
digger_denovo	kmers	String	K-mer sizes for assembly (comma-separated)		Optional
digger_denovo	megahit_cpu	Int	Number of CPUs to allocate to the task	4	Optional
digger_denovo	megahit_disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
digger_denovo	megahit_docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/theiagen/megahit:1.2.9	Optional
digger_denovo	megahit_memory	Int	Amount of memory/RAM (in GB) to allocate to the task	16	Optional
digger_denovo	pilon_cpu	Int	Number of CPUs to allocate to the task	8	Optional
digger_denovo	pilon_disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
digger_denovo	pilon_docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/biocontainers/pilon:1.24--hdfd78af_0	Optional
digger_denovo	pilon_fix	String	Potential issues with assembly to try and automatically fix (snps, indels, gaps, local, all, bases, none)	bases	Optional
digger_denovo	pilon_memory	Int	Amount of memory/RAM (in GB) to allocate to the task	32	Optional
digger_denovo	pilon_min_base_quality	Int	Minimum base quality to keep	3	Optional
digger_denovo	pilon_min_depth	Float	Minimum coverage threshold for variant calling: when set to a value ≥1, it requires that absolute depth of coverage; when set to a fraction <1, it requires coverage at least that fraction of the mean coverage for the region	0.25	Optional
digger_denovo	pilon_min_mapping_quality	Int	Minimum mapping quality for a read to count in pileups	60	Optional
digger_denovo	run_filter_contigs	Boolean	Whether to run contig filtering step	TRUE	Optional
digger_denovo	skesa_cpu	Int	Number of CPUs to allocate to the task	4	Optional
digger_denovo	skesa_disk_size	Int	Disk space in GB for SKESA assembler	50	Optional
digger_denovo	skesa_docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/skesa:2.4.0	Optional
digger_denovo	skesa_memory	Int	Amount of memory/RAM (in GB) to allocate to the task	4	Optional
digger_denovo	spades_cpu	Int	Number of CPUs to allocate to the task	16	Optional
digger_denovo	spades_disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
digger_denovo	spades_docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/spades:4.1.0	Optional
digger_denovo	spades_memory	Int	Amount of memory/RAM (in GB) to allocate to the task	32	Optional
digger_denovo	spades_type	String	SPAdes assembly mode (isolate, meta, rna, etc.), more can be found here	isolate	Optional
gambit	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
gambit	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/gambit:1.0.0	Optional
gambit	gambit_db_genomes	File	Database of metadata for assembled query genomes; requires complementary signatures file. If not provided, uses default database "/gambit-db"	gs://gambit-databases-rp/2.0.0/gambit-metadata-2.0.1-20250505.gdb	Optional
gambit	gambit_db_signatures	File	Signatures file; requires complementary genomes file. If not specified, the file from the docker container will be used.	gs://gambit-databases-rp/2.0.0/gambit-signatures-2.0.1-20250505.gs	Optional
gambit	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	16	Optional
merlin_magic	agrvate_docker_image	String	Internal component, do not modify		Optional
merlin_magic	amr_search_cpu	Int	Number of CPUs to allocate to the task	2	Optional
merlin_magic	amr_search_disk_size	Int	Amount of storage (in GB) to allocate to the task	50	Optional
merlin_magic	amr_search_docker_image	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/theiagen/amrsearch:0.2.1	Optional
merlin_magic	amr_search_memory	Int	Amount of memory/RAM (in GB) to allocate to the task	8	Optional
merlin_magic	assembly_only	Boolean	Internal component, do not modify		Optional
merlin_magic	call_poppunk	Boolean	Internal component, do not modify		Optional
merlin_magic	call_shigeifinder_reads_input	Boolean	Internal component, do not modify		Optional
merlin_magic	emmtypingtool_docker_image	String	Internal component, do not modify		Optional
merlin_magic	hicap_docker_image	String	Internal component, do not modify		Optional
merlin_magic	ont_data	Boolean	Internal component, do not modify		Optional
merlin_magic	paired_end	Boolean	Internal component, do not modify		Optional
merlin_magic	pasty_docker_image	String	Internal component, do not modify		Optional
merlin_magic	pasty_min_coverage	Int	Internal component, do not modify		Optional
merlin_magic	pasty_min_percent_identity	Int	Internal component, do not modify		Optional
merlin_magic	run_amr_search	Boolean	If set to true AMR_Search workflow will be run if species is part of supported taxon, see AMR_Search docs.	FALSE	Optional
merlin_magic	shigatyper_docker_image	String	Internal component, do not modify		Optional
merlin_magic	shigeifinder_docker_image	String	Internal component, do not modify		Optional
merlin_magic	snippy_query_gene	String	Provide a gene to search for using Snippy	Default depend on detected organism	Optional
merlin_magic	srst2_gene_max_mismatch	Int	Internal component, do not modify		Optional
merlin_magic	srst2_max_divergence	Int	Internal component, do not modify		Optional
merlin_magic	srst2_min_cov	Int	Internal component, do not modify		Optional
merlin_magic	srst2_min_depth	Int	Internal component, do not modify		Optional
merlin_magic	srst2_min_edge_depth	Int	Internal component, do not modify		Optional
merlin_magic	staphopia_sccmec_docker_image	String	Internal component, do not modify		Optional
merlin_magic	tbp_parser_config	File	Internal component, do not modify		Optional
merlin_magic	tbp_parser_debug	Boolean	Internal component, do not modify		Optional
merlin_magic	tbp_parser_docker_image	String	Internal component, do not modify		Optional
merlin_magic	tbp_parser_min_depth	Int	Internal component, do not modify		Optional
merlin_magic	tbp_parser_min_percent_coverage	Float	Internal component, do not modify		Optional
merlin_magic	tbp_parser_operator	String	Internal component, do not modify		Optional
merlin_magic	tbp_parser_output_seq_method_type	String	Internal component, do not modify		Optional
merlin_magic	tbprofiler_custom_db	File	Internal component, do not modify		Optional
merlin_magic	tbprofiler_mapper	String	Internal component, do not modify		Optional
merlin_magic	tbprofiler_min_af	Float	Internal component, do not modify		Optional
merlin_magic	tbprofiler_min_depth	Int	Internal component, do not modify		Optional
merlin_magic	tbprofiler_run_cdph_db	Boolean	Internal component, do not modify		Optional
merlin_magic	tbprofiler_run_custom_db	Boolean	Internal component, do not modify		Optional
merlin_magic	tbprofiler_variant_caller	String	Internal component, do not modify		Optional
merlin_magic	tbprofiler_variant_calling_params	String	Internal component, do not modify		Optional
merlin_magic	virulencefinder_database	String	Internal component, do not modify		Optional
merlin_magic	virulencefinder_docker_image	String	Internal component, do not modify		Optional
merlin_magic	virulencefinder_min_percent_coverage	Float	Internal component, do not modify		Optional
merlin_magic	virulencefinder_min_percent_identity	Float	Internal component, do not modify		Optional
qc_check_task	ani_highest_percent	Float	Internal component, do not modify		Optional
qc_check_task	ani_highest_percent_bases_aligned	Float	Internal component, do not modify		Optional
qc_check_task	assembly_length_unambiguous	Int	Internal component, do not modify		Optional
qc_check_task	assembly_mean_coverage	Float	Internal component, do not modify		Optional
qc_check_task	cpu	Int	Number of CPUs to allocate to the task	4	Optional
qc_check_task	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
qc_check_task	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/theiagen/terra-tools:2023-03-16	Optional
qc_check_task	kraken_human	String	Internal component, do not modify		Optional
qc_check_task	kraken_human_dehosted	String	Internal component, do not modify		Optional
qc_check_task	kraken_sc2	String	Internal component, do not modify		Optional
qc_check_task	kraken_sc2_dehosted	String	Internal component, do not modify		Optional
qc_check_task	kraken_target_organism	Float	Internal component, do not modify		Optional
qc_check_task	kraken_target_organism_dehosted	Float	Internal component, do not modify		Optional
qc_check_task	meanbaseq_trim	String	Internal component, do not modify		Optional
qc_check_task	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	8	Optional
qc_check_task	midas_secondary_genus_abundance	Float	Internal component, do not modify		Optional
qc_check_task	midas_secondary_genus_coverage	Float	Internal component, do not modify		Optional
qc_check_task	number_Degenerate	Int	Internal component, do not modify		Optional
qc_check_task	number_N	Int	Internal component, do not modify		Optional
qc_check_task	percent_reference_coverage	Float	Internal component, do not modify		Optional
qc_check_task	sc2_s_gene_mean_coverage	Float	Internal component, do not modify		Optional
qc_check_task	sc2_s_gene_percent_coverage	Float	Internal component, do not modify		Optional
qc_check_task	vadr_num_alerts	String	Internal component, do not modify		Optional
quast	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
quast	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/quast:5.0.2	Optional
quast	min_contig_length	Int	Minimum length of contig for QUAST	500	Optional
rasusa_task	bases	String	Explicitly set the number of bases required e.g., 4.3kb, 7Tb, 9000, 4.1MB. If this option is given, --coverage and --genome-size are ignored		Optional
rasusa_task	cpu	Int	Number of CPUs to allocate to the task	4	Optional
rasusa_task	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
rasusa_task	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/rasusa:2.1.0	Optional
rasusa_task	frac	Float	Explicitly define the fraction of reads to keep in the subsample; when used, genome size and coverage are ignored; acceptable inputs include whole numbers and decimals, e.g. 50.0 will leave 50% of the reads in the subsample		Optional
rasusa_task	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	8	Optional
rasusa_task	num	Int	Optional: explicitly define the number of reads in the subsample; when used, genome size and coverage are ignored; acceptable metric suffixes include: b, k, m, g, and t for base, kilo, mega, giga, and tera, respectively		Optional
rasusa_task	seed	Int	Use to assign a name to the "random seed" that is used by the subsampler; i.e. this allows the exact same subsample to be produced from the same input file/s in subsequent runs when providing the seed identifier; do not input values for random downsampling		Optional
raw_check_reads	cpu	Int	Number of CPUs to allocate to the task	2	Optional
raw_check_reads	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
raw_check_reads	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/bactopia/gather_samples:2.0.2	Optional
raw_check_reads	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	2	Optional
raw_check_reads	organism	String	Internal component, do not modify		Optional
raw_check_reads	workflow_series	String	Internal component, do not modify		Optional
read_QC_trim	adapters	File	File with adapter sequences to be removed		Optional
read_QC_trim	bbduk_memory	Int	Amount of memory/RAM (in GB) to allocate to the task	8	Optional
read_QC_trim	call_kraken	Boolean	True/False variable that determines if the Kraken2 task should be called; for non-TheiaCoV workflows, the `kraken_db` variable must be provided.	FALSE	Optional
read_QC_trim	call_midas	Boolean	Internal component, do not modify		Optional
read_QC_trim	fastp_args	String	Additional arguments to use with fastp	--detect_adapter_for_pe -g -5 20 -3 20	Optional
read_QC_trim	kraken_db	File	A kraken2 database to use with the kraken2 optional task. The file must be a .tar.gz kraken2 database.		Optional
read_QC_trim	kraken_memory	Int	Amount of memory/RAM (in GB) to allocate to the task	32	Optional
read_QC_trim	midas_db	File	Internal component, do not modify		Optional
read_QC_trim	phix	File	A file containing the phix used during Illumina sequencing; used in the BBDuk task		Optional
read_QC_trim	read_processing	String	The name of the tool to perform basic read processing; options: "trimmomatic" or "fastp"	trimmomatic	Optional
read_QC_trim	read_qc	String	The tool used for quality control (QC) of reads. Options are "fastq_scan" (default) and "fastqc"	fastq_scan	Optional
read_QC_trim	target_organism	String	This string is searched for in the kraken2 outputs to extract the read percentage		Optional
read_QC_trim	trim_min_length	Int	Specifies minimum length of each read after trimming to be kept	75	Optional
read_QC_trim	trim_quality_min_score	Int	Specifies the average quality of bases in a sliding window to be kept	20	Optional
read_QC_trim	trim_window_size	Int	Specifies window size for trimming (the number of bases to average the quality across)	10	Optional
read_QC_trim	trimmomatic_args	String	Additional arguments to pass to trimmomatic. "-phred33" specifies the Phred Q score encoding which is almost always phred33 with modern sequence data.	-phred33	Optional
read_QC_trim	workflow_series	String	Internal component, do not modify		Optional
theiaeuk_pe	busco_memory	Int	Amount of memory/RAM (in GB) to allocate to the task	8	Optional
theiaeuk_pe	call_rasusa	Boolean	If true, RASUSA will subsample raw reads to a specified read depth (150X by default)	TRUE	Optional
theiaeuk_pe	gambit_db_genomes	File	User-provided database of assembled query genomes; requires complementary signatures file. If not provided, uses default database, "/gambit-db"	gs://gambit-databases-rp/fungal-version/1.0.0/gambit-fungal-metadata-1.0.0-20241213.gdb	Optional
theiaeuk_pe	gambit_db_signatures	File	User-provided signatures file; requires complementary genomes file. If not specified, the file from the docker container will be used.	gs://gambit-databases-rp/fungal-version/1.0.0/gambit-fungal-signatures-1.0.0-20241213.gs	Optional
theiaeuk_pe	genome_length	Int	User-specified expected genome length to be used in genome statistics calculations		Optional
theiaeuk_pe	max_genome_size	Int	Maximum genome size able to pass read screening	50000000	Optional
theiaeuk_pe	min_basepairs	Int	Minimum number of base pairs able to pass read screening	2241820	Optional
theiaeuk_pe	min_coverage	Int	Minimum genome coverage able to pass read screening	10	Optional
theiaeuk_pe	min_genome_size	Int	Minimum genome size able to pass read screening	100000	Optional
theiaeuk_pe	min_proportion	Int	Minimum proportion of total reads in each read file to pass read screening	50	Optional
theiaeuk_pe	min_reads	Int	Minimum number of reads to pass read screening	10000	Optional
theiaeuk_pe	skip_screen	Boolean	Option to skip the read screening prior to analysis; if setting to true, please provide a value for the theiaeuk_pe genome_length optional input, OR set call_rasusa to false. Otherwise RASUSA will attempt to downsample to an expected genome size of 0 bp, and the workflow will fail.	FALSE	Optional
theiaeuk_pe	subsample_coverage	Float	Read depth for RASUSA task to subsample reads to	150	Optional
version_capture	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0	Optional
version_capture	timezone	String	Set the time zone to get an accurate date of analysis (uses UTC by default)		Optional
workflow name	trim_min_length	Int	Specifies minimum length of each read after trimming to be kept	75	Optional
workflow name	trim_quality_min_score	Int	Specifies the minimum average quality of bases in a sliding window to be kept	20	Optional

Terra Task Name	Variable	Type	Description	Default Value	Terra Status
theiaeuk_ont	read1	File	ONT read file in FASTQ file format (compression optional)		Required
theiaeuk_ont	samplename	String	The name of the sample being analyzed		Required
busco	cpu	Int	Number of CPUs to allocate to the task	2	Optional
busco	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
clean_check_reads	cpu	Int	Number of CPUs to allocate to the task	1	Optional
clean_check_reads	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
clean_check_reads	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/bactopia/gather_samples:2.0.2	Optional
clean_check_reads	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	2	Optional
clean_check_reads	organism	String	Internal component, do not modify		Optional
clean_check_reads	workflow_series	String	Internal component, do not modify		Optional
flye_denovo	auto_medaka_model	Boolean	If true, medaka will automatically select the best Medaka model for assembly	TRUE	Optional
flye_denovo	bandage_cpu	Int	Number of CPUs to allocate to the task	2	Optional
flye_denovo	bandage_disk_size	Int	Amount of storage (in GB) to allocate to the task	10	Optional
flye_denovo	bandage_memory	Int	Amount of memory/RAM (in GB) to allocate to the task	4	Optional
flye_denovo	dnaapler_cpu	Int	Number of CPUs to allocate to the task	1	Optional
flye_denovo	dnaapler_disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
flye_denovo	dnaapler_memory	Int	Amount of memory/RAM (in GB) to allocate to the task	16	Optional
flye_denovo	dnaapler_mode	String	Dnaapler-specific inputs	all	Optional
flye_denovo	filtercontigs_cpu	Int	Number of CPUs to allocate to the task	1	Optional
flye_denovo	filtercontigs_disk_size	Int	Amount of storage (in GB) to allocate to the task	10	Optional
flye_denovo	filtercontigs_memory	Int	Amount of memory/RAM (in GB) to allocate to the task	16	Optional
flye_denovo	filtercontigs_min_length	Int	Minimum contig length to keep	1000	Optional
flye_denovo	flye_additional_parameters	String	Any extra Flye-specific parameters		Optional
flye_denovo	flye_asm_coverage	Int	Reduced coverage for initial disjointig assembly		Optional
flye_denovo	flye_cpu	Int	Number of CPUs to allocate to the task	4	Optional
flye_denovo	flye_disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
flye_denovo	flye_genome_length	Int	User-specified expected genome length to be used in genome statistics calculations		Optional
flye_denovo	flye_keep_haplotypes	Boolean	If true keep haplotypes	FALSE	Optional
flye_denovo	flye_memory	Int	Amount of memory/RAM (in GB) to allocate to the task	32	Optional
flye_denovo	flye_minimum_overlap	Int	Minimum overlap between reads		Optional
flye_denovo	flye_no_alt_contigs	Boolean	If true, do not generate alternative contigs	FALSE	Optional
flye_denovo	flye_polishing_iterations	Int	Default polishing iterations	1	Optional
flye_denovo	flye_read_error_rate	Float	Maximum expected read error rate		Optional
flye_denovo	flye_read_type	String	Specifies the type of sequencing reads. Options: --nano-raw (default), --nano-corr, --nano-hq, --pacbio-raw, --pacbio-corr, --pacbio-hifi. Refer to Flye documentation for details on each type.	#NAME?	Optional
flye_denovo	flye_scaffold	Boolean	If true, scaffold	FALSE	Optional
flye_denovo	flye_uneven_coverage_mode	Boolean		FALSE	Optional
flye_denovo	illumina_read1	File	If Illumina reads are provided, flye_denovo subworkflow will perform Illumina polishing		Optional
flye_denovo	illumina_read2	File	If Illumina reads are provided, flye_denovo subworflow will perform Illumina polishing		Optional
flye_denovo	medaka_cpu	Int	Number of CPUs to allocate to the task	4	Optional
flye_denovo	medaka_disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
flye_denovo	medaka_memory	Int	Amount of memory/RAM (in GB) to allocate to the task	16	Optional
flye_denovo	medaka_model	String	In order to obtain the best results, the appropriate model must be set to match the sequencer's basecaller model; this string takes the format of {pore}{device}{caller variant}_{caller_version}. See also https://github.com/nanoporetech/medaka?tab=readme-ov-file#models. If this is being run on legacy data it is likely to be r941_min_hac_g507.	r1041_e82_400bps_sup_v5.0.0	Optional
flye_denovo	polisher	String	The polishing tool to use for assembly	medaka	Optional
flye_denovo	polishing_rounds	Int	The number of polishing rounds to conduct for medaka or racon (without Illumina)	1	Optional
flye_denovo	polypolish_careful	Boolean	Polypolish-specific inputs	FALSE	Optional
flye_denovo	polypolish_cpu	Int	Polypolish cpu	1	Optional
flye_denovo	polypolish_disk_size	Int	Polypolish disk size	100	Optional
flye_denovo	polypolish_fraction_invalid	Float	Polypolish-specific inputs		Optional
flye_denovo	polypolish_fraction_valid	Float	Polypolish-specific inputs		Optional
flye_denovo	polypolish_high_percentile_threshold	Float	Polypolish-specific inputs		Optional
flye_denovo	polypolish_low_percentile_threshold	Float	Polypolish-specific inputs		Optional
flye_denovo	polypolish_maximum_errors	Int	Polypolish-specific inputs		Optional
flye_denovo	polypolish_memory	Int	Polypolish memory	8	Optional
flye_denovo	polypolish_minimum_depth	Int	Polypolish-specific inputs		Optional
flye_denovo	polypolish_pair_orientation	String	Polypolish-specific inputs		Optional
flye_denovo	porechop_cpu	Int	Number of CPUs to allocate to the task	4	Optional
flye_denovo	porechop_disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
flye_denovo	porechop_memory	Int	Amount of memory/RAM (in GB) to allocate to the task	16	Optional
flye_denovo	porechop_trimopts	String	Options to pass to Porechop for trimming		Optional
flye_denovo	racon_cpu	Int	Number of CPUs to allocate to the task	8	Optional
flye_denovo	racon_disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
flye_denovo	racon_memory	Int	Amount of memory/RAM (in GB) to allocate to the task	16	Optional
flye_denovo	read1	File	ONT read file in FASTQ file format (compression optional)		Optional
flye_denovo	run_porechop	Boolean	If true, trims reads before assembly using Porechop	FALSE	Optional
flye_denovo	skip_polishing	Boolean	If true, skips polishing	FALSE	Optional
gambit	cpu	Int	Number of CPUs to allocate to the task	8	Optional
gambit	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
gambit	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/gambit:1.0.0	Optional
gambit	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	16	Optional
merlin_magic	agrvate_docker_image	String	Internal component, do not modify		Optional
merlin_magic	agrvate_docker_image	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/biocontainers/agrvate:1.0.2--hdfd78af_0	Optional
merlin_magic	amr_search_cpu	Int	Number of CPUs to allocate to the task	2	Optional
merlin_magic	amr_search_disk_size	Int	Amount of storage (in GB) to allocate to the task	50	Optional
merlin_magic	amr_search_docker_image	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/theiagen/amrsearch:0.2.1	Optional
merlin_magic	amr_search_memory	Int	Amount of memory/RAM (in GB) to allocate to the task	8	Optional
merlin_magic	assembly_only	Boolean	Internal component, do not modify		Optional
merlin_magic	call_poppunk	Boolean	Internal component, do not modify		Optional
merlin_magic	call_shigeifinder_reads_input	Boolean	Internal component, do not modify		Optional
merlin_magic	cladetyper_kmer_size	Int	Kmer size for cladtyper		Optional
merlin_magic	cladetyper_ref_clade1	File	Provide reference for clade1		Optional
merlin_magic	cladetyper_ref_clade1_annotated	File	Provide annoated reference for clade1		Optional
merlin_magic	cladetyper_ref_clade2	File	Provide reference for clade2		Optional
merlin_magic	cladetyper_ref_clade2_annotated	File	Provide annoated reference for clade2		Optional
merlin_magic	cladetyper_ref_clade3	File	Provide reference for clade3		Optional
merlin_magic	cladetyper_ref_clade3_annotated	File	Provide annoated reference for clade3		Optional
merlin_magic	cladetyper_ref_clade4_annotated	File	Provide annoated reference for clade3		Optional
merlin_magic	cladetyper_ref_clade5	File	Provide reference for clade5		Optional
merlin_magic	cladetyper_ref_clade5_annotated	File	Provide annoated reference for clade5		Optional
merlin_magic	clockwork_docker_image	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/cdcgov/varpipe_wgs_with_refs:2bc7234074bd53d9e92a1048b0485763cd9bbf6f4d12d5a1cc82bfec8ca7d75e	Optional
merlin_magic	emmtypingtool_docker_image	String	Internal component, do not modify		Optional
merlin_magic	hicap_docker_image	String	Internal component, do not modify		Optional
merlin_magic	ont_data	Boolean	Internal component, do not modify		Optional
merlin_magic	paired_end	Boolean	Internal component, do not modify		Optional
merlin_magic	pasty_docker_image	String	Internal component, do not modify		Optional
merlin_magic	pasty_min_coverage	Int	Internal component, do not modify		Optional
merlin_magic	pasty_min_percent_identity	Int	Internal component, do not modify		Optional
merlin_magic	run_amr_search	Boolean	If set to true AMR_Search workflow will be run if species is part of supported taxon, see AMR_Search docs.	FALSE	Optional
merlin_magic	snippy_base_quality	Int	Internal component, do not modify		Optional
merlin_magic	snippy_gene_query_docker_image	String	Internal component, do not modify		Optional
merlin_magic	snippy_map_qual	Int	Internal component, do not modify		Optional
merlin_magic	snippy_maxsoft	Int	Internal component, do not modify		Optional
merlin_magic	snippy_min_coverage	Int	Internal component, do not modify		Optional
merlin_magic	snippy_min_frac	Float	Internal component, do not modify		Optional
merlin_magic	snippy_min_quality	Int	Internal component, do not modify		Optional
merlin_magic	snippy_query_gene	String	Internal component, do not modify		Optional
merlin_magic	snippy_query_gene	String	Provide a gene to search for using Snippy	Default depend on detected organism	Optional
merlin_magic	snippy_reference_afumigatus	File	*Provide an empty file if running TheiaProk on the command-line		Optional
merlin_magic	snippy_reference_calbicans	File	*Provide an empty file if running TheiaProk on the command-line		Optional
merlin_magic	snippy_reference_cryptoneo	File	*Provide an empty file if running TheiaProk on the command-line		Optional
merlin_magic	snippy_variants_docker_image	String	Internal component, do not modify		Optional
merlin_magic	srst2_gene_max_mismatch	Int	Internal component, do not modify		Optional
merlin_magic	srst2_max_divergence	Int	Internal component, do not modify		Optional
merlin_magic	srst2_min_cov	Int	Internal component, do not modify		Optional
merlin_magic	srst2_min_depth	Int	Internal component, do not modify		Optional
merlin_magic	srst2_min_edge_depth	Int	Internal component, do not modify		Optional
merlin_magic	staphopia_sccmec_docker_image	String	Internal component, do not modify		Optional
merlin_magic	tbp_parser_config	File	Internal component, do not modify		Optional
merlin_magic	tbp_parser_debug	Boolean	Internal component, do not modify		Optional
merlin_magic	tbp_parser_docker_image	String	Internal component, do not modify		Optional
merlin_magic	tbp_parser_min_depth	Int	Internal component, do not modify		Optional
merlin_magic	tbp_parser_min_percent_coverage	Float	Internal component, do not modify		Optional
merlin_magic	tbp_parser_operator	String	Internal component, do not modify		Optional
merlin_magic	tbp_parser_output_seq_method_type	String	Internal component, do not modify		Optional
merlin_magic	tbprofiler_custom_db	File	Internal component, do not modify		Optional
merlin_magic	tbprofiler_mapper	String	Internal component, do not modify		Optional
merlin_magic	tbprofiler_min_af	Float	Internal component, do not modify		Optional
merlin_magic	tbprofiler_min_depth	Int	Internal component, do not modify		Optional
merlin_magic	tbprofiler_run_cdph_db	Boolean	Internal component, do not modify		Optional
merlin_magic	tbprofiler_run_custom_db	Boolean	Internal component, do not modify		Optional
merlin_magic	tbprofiler_variant_caller	String	Internal component, do not modify		Optional
merlin_magic	tbprofiler_variant_calling_params	String	Internal component, do not modify		Optional
merlin_magic	virulencefinder_min_percent_coverage	Float	Internal component, do not modify		Optional
merlin_magic	virulencefinder_min_percent_identity	Float	Internal component, do not modify		Optional
nanoplot_clean	cpu	Int	Number of CPUs to allocate to the task	4	Optional
nanoplot_clean	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
nanoplot_clean	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/nanoplot:1.40.0	Optional
nanoplot_clean	max_length	Int	The maximum length of clean reads, for which reads longer than the length specified will be hidden.	100000	Optional
nanoplot_clean	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	16	Optional
nanoplot_raw	cpu	Int	Number of CPUs to allocate to the task	4	Optional
nanoplot_raw	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
nanoplot_raw	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/nanoplot:1.40.0	Optional
nanoplot_raw	max_length	Int	The maximum length of clean reads, for which reads longer than the length specified will be hidden.	100000	Optional
nanoplot_raw	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	16	Optional
quast	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
quast	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/quast:5.0.2	Optional
quast	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	2	Optional
quast	min_contig_length	Int	Minimum length of contig for QUAST	500	Optional
read_QC_trim	artic_guppyplex_cpu	Int	Internal component, do not modify		Optional
read_QC_trim	artic_guppyplex_disk_size	Int	Internal component, do not modify		Optional
read_QC_trim	max_length	Int	Internal component, do not modify		Optional
read_QC_trim	min_length	Int	Internal component, do not modify		Optional
read_QC_trim	nanoq_cpu	Int	Number of CPUs to allocate to the task	2	Optional
read_QC_trim	nanoq_disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
read_QC_trim	nanoq_docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/biocontainers/nanoq:0.9.0--hec16e2b_1	Optional
read_QC_trim	nanoq_max_read_length	Int	The maximum read length to keep after trimming	100000	Optional
read_QC_trim	nanoq_max_read_qual	Int	The maximum read quality to keep after trimming	40	Optional
read_QC_trim	nanoq_memory	Int	Amount of memory/RAM (in GB) to allocate to the task	2	Optional
read_QC_trim	nanoq_min_read_length	Int	The minimum read length to keep after trimming	500	Optional
read_QC_trim	nanoq_min_read_qual	Int	The minimum read quality to keep after trimming	10	Optional
read_QC_trim	rasusa_bases	String	Explicitly set the number of bases required e.g., 4.3kb, 7Tb, 9000, 4.1MB. If this option is given, --coverage and --genome-size are ignored		Optional
read_QC_trim	rasusa_cpu	Int	Number of CPUs to allocate to the task	4	Optional
read_QC_trim	rasusa_disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
read_QC_trim	rasusa_docker	String	Internal component, do not modify		Optional
read_QC_trim	rasusa_fraction_of_reads	Float	Subsample to a fraction of the reads - e.g., 0.5 samples half the reads		Optional
read_QC_trim	rasusa_memory	Int	Amount of memory/RAM (in GB) to allocate to the task	8	Optional
read_QC_trim	rasusa_number_of_reads	Int	Subsample to a specific number of reads		Optional
read_QC_trim	rasusa_seed	Int	Random seed to use		Optional
theiaeuk_ont	busco_docker_image	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/ezlabgva/busco:v5.3.2_cv1	Optional
theiaeuk_ont	busco_memory	Int	Amount of memory/RAM (in GB) to allocate to the task	24	Optional
theiaeuk_ont	gambit_db_genomes	File	User-provided database of assembled query genomes; requires complementary signatures file. If not provided, uses default database, "/gambit-db"	gs://gambit-databases-rp/fungal-version/1.0.0/gambit-fungal-metadata-1.0.0-20241213.gdb	Optional
theiaeuk_ont	gambit_db_signatures	File	User-provided signatures file; requires complementary genomes file. If not specified, the file from the docker container will be used.	gs://gambit-databases-rp/fungal-version/1.0.0/gambit-fungal-signatures-1.0.0-20241213.gs	Optional
theiaeuk_ont	genome_length	Int	User-specified expected genome length to be used in genome statistics calculations		Optional
theiaeuk_ont	min_basepairs	Int	Minimum number of base pairs able to pass read screening	45000000	Optional
theiaeuk_ont	min_coverage	Int	Minimum genome coverage able to pass read screening	5	Optional
theiaeuk_ont	min_genome_size	Int	Minimum genome size able to pass read screening	9000000	Optional
theiaeuk_ont	min_reads	Int	Minimum number of reads to pass read screening	5000	Optional
theiaeuk_ont	skip_screen	Boolean	Option to skip the read screening prior to analysis; if setting to true, please provide a value for the theiaeuk_pe genome_length optional input, OR set call_rasusa to false. Otherwise RASUSA will attempt to downsample to an expected genome size of 0 bp, and the workflow will fail.	FALSE	Optional
version_capture	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0	Optional
version_capture	timezone	String	Set the time zone to get an accurate date of analysis (uses UTC by default)		Optional
workflow name	skip_mash	Boolean	If true, skips estimation of genome size and coverage using mash in read screening steps. As a result, providing true also prevents screening using these parameters.	TRUE	Optional

Workflow Tasks¶

All input reads are processed through "core tasks" in the TheiaEuk workflows. These undertake read trimming and assembly appropriate to the input data type, currently only Illumina paired-end data. TheiaEuk workflow subsequently launch default genome characterization modules for quality assessment, and additional taxa-specific characterization steps. When setting up the workflow, users may choose to use "optional tasks" or alternatives to tasks run in the workflow by default.

Core tasks¶

These tasks are performed regardless of organism. They include tasks that are performed regardless of and specific for the input data type. They perform read trimming and assembly appropriate to the input data type.

versioning: Version Capture

The versioning task captures the workflow version from the GitHub (code repository) version.

Version Capture Technical details

	Links
Task	task_versioning.wdl

TheiaEuk_Illumina_PETheiaEuk_ONT

screen: Total Raw Read Quantification and Genome Size Estimation

The screen task ensures the quantity of sequence data is sufficient to undertake genomic analysis. It uses fastq-scan and bash commands for quantification of reads and base pairs, and mash sketching to estimate the genome size and its coverage. At each step, the results are assessed relative to pass/fail criteria and thresholds that may be defined by optional user inputs. Samples are run through all threshold checks, regardless of failures, and the workflow will terminate after the screen task if any thresholds are not met:

Total number of reads: A sample will fail the read screening task if its total number of reads is less than or equal to min_reads.
The proportion of basepairs reads in the forward and reverse read files: A sample will fail the read screening if fewer than min_proportion basepairs are in either the reads1 or read2 files.
Number of basepairs: A sample will fail the read screening if there are fewer than min_basepairs basepairs
Estimated genome size: A sample will fail the read screening if the estimated genome size is smaller than min_genome_size or bigger than max_genome_size.
Estimated genome coverage: A sample will fail the read screening if the estimated genome coverage is less than the min_coverage.

Read screening is undertaken on both the raw and cleaned reads. The task may be skipped by setting the skip_screen variable to true.

Default values vary between the PE, SE, and ONT workflows. The rationale for these default values can be found below. If two default values are shown, the first is for Illumina workflows and the second is for ONT.

| Variable | Rationale | | --- | --- | --- | | skip_screen | false | Set to true to skip the read screen from running. If you set this value to true, please provide a value for the theiaeuk_illumina_pe genome_length optional input, OR set the theiaeuk_illumina_pe call_rasusa optional input to false. Otherwise RASUSA will attempt to downsample to an expected genome size of 0 bp, and the workflow will fail. | | min_reads | 3000 | Calculated from the minimum number of base pairs required for 20x coverage of the Hansenula polymorpha genome, the smallest fungal genome as of 2015-04-02 (8.97 Mbp), divided by 300 (the longest Illumina read length) | | min_basepairs | 45000000 | Should be greater than 10x coverage of Hansenula polymorpha, the smallest fungal genome as of 2015-04-02 (8.97 Mbp) | | min_genome_length | 9000000 | Based on the Hansenula polymorpha genome - the smallest fungal genome as of 2015-04-02 (8.97 Mbp) | | max_genome_length | 178000000 | Based on the Cenococcum geophilum genome, the largest pathogenic fungal genome (177.57 Mbp), plus an additional 2 Mbp to cater for potential extra genomic material | | min_coverage | 10 | A bare-minimum average per base coverage across the genome required for genome characterization. Higher coverage would be required for high-quality phylogenetics.| | min_proportion | 40 | Neither read1 nor read2 files should have less than 40% of the total number of reads. For paired-end data only. |

Screen Technical Details

There is a single WDL task for read screening. The screen task is run twice, once for raw reads and once for clean reads.

	Links
Task	task_screen.wdl (PE sub-task) task_screen.wdl (SE sub-task)

Rasusa: Read subsampling (optional, on by default)

Non-deterministic output(s)

This task may yield non-deterministic outputs.

Rasusa Technical Details

	Links
Task	task_rasusa.wdl
Software Source Code	Rasusa on GitHub
Software Documentation	Rasusa on GitHub
Original Publication(s)	Rasusa: Randomly subsample sequencing reads to a specified coverage

read_QC_trim: Read Quality Trimming, Adapter Removal, Quantification, and Identification

read_QC_trim is a sub-workflow that removes low-quality reads, low-quality regions of reads, and sequencing adapters to improve data quality. It uses a number of tasks, described below. The differences between the PE and SE versions of the read_QC_trim sub-workflow lie in the default parameters, the use of two or one input read file(s), and the different output files.

Read quality trimming

Either trimmomatic or fastp can be used for read-quality trimming. Trimmomatic is used by default. Both tools trim low-quality regions of reads with a sliding window (with a window size of trim_window_size), cutting once the average quality within the window falls below trim_quality_trim_score. They will both discard the read if it is trimmed below trim_minlen.

read_processing input parameter

This input parameter accepts either trimmomatic or fastp as an input to determine which tool should be used for read quality trimming. This is set to trimmomatic by default.

If the fastp option is selected, see below for table of default parameters.

fastp default read-trimming parameters

Parameter	Explanation
-g	enables polyG tail trimming
-5 20	enables read end-trimming
-3 20	enables read end-trimming
--detect_adapter_for_pe	enables adapter-trimming only for paired-end reads

Additional arguments can be passed using the fastp_args optional parameter.

Trimmomatic and fastp Technical Details

	Links
Task	task_trimmomatic.wdl task_fastp.wdl
Software Source Code	Trimmomatic fastp on Github
Software Documentation	Trimmomatic fastp
Original Publication(s)	Trimmomatic: a flexible trimmer for Illumina sequence data fastp: an ultra-fast all-in-one FASTQ preprocessor

Adapter removal

The BBDuk task removes adapters from sequence reads. To do this:

Repair from the BBTools package reorders reads in paired fastq files to ensure the forward and reverse reads of a pair are in the same position in the two fastq files.
BBDuk ("Bestus Bioinformaticus" Decontamination Using Kmers) is then used to trim the adapters and filter out all reads that have a 31-mer match to PhiX, which is commonly added to Illumina sequencing runs to monitor and/or improve overall run quality.

What are adapters and why do they need to be removed?

Adapters are manufactured oligonucleotide sequences attached to DNA fragments during the library preparation process. In Illumina sequencing, these adapter sequences are required for attaching reads to flow cells. You can read more about Illumina adapters here. For genome analysis, it's important to remove these sequences since they're not actually from your sample. If you don't remove them, the downstream analysis may be affected.

BBDuk Technical Details

	Links
Task	task_bbduk.wdl
Software Source Code	BBTools
Software Documentation	BBDuk

Read Quantification

There are two methods for read quantification to choose from: fastq-scan (default) or fastqc. Both quantify the forward and reverse reads in FASTQ files. For paired-end data, they also provide the total number of read pairs. This task is run once with raw reads as input and once with clean reads as input. If QC has been performed correctly, you should expect fewer clean reads than raw reads. fastqc also provides a graphical visualization of the read quality.

read_qc input parameter

This input parameter accepts either "fastq_scan" or "fastqc" as an input to determine which tool should be used for read quantification. This is set to "fastq-scan" by default.

fastq-scan and FastQC Technical Details

	Links
Task	task_fastq_scan.wdl task_fastqc.wdl
Software Source Code	fastq-scan on Github fastqc on Github
Software Documentation	fastq-scan fastqc

qc_check: Check QC Metrics Against User-Defined Thresholds (optional)

The qc_check task compares generated QC metrics against user-defined thresholds for each metric. This task will run if the user provides a qc_check_table TSV file. If all QC metrics meet the threshold, the qc_check output variable will read QC_PASS. Otherwise, the output will read QC_NA if the task could not proceed or QC_ALERT followed by a string indicating what metric failed.

The qc_check task applies quality thresholds according to the sample taxa. The sample taxa is taken from the gambit_predicted_taxon value inferred by the GAMBIT module OR can be manually provided by the user using the expected_taxon workflow input.

Formatting the qc_check_table.tsv

The first column of the qc_check_table lists the organism that the task will assess and the header of this column must be "taxon".
Any genus or species can be included as a row of the qc_check_table. However, these taxa must uniquely match the sample taxa, meaning that the file can include multiple species from the same genus (Vibrio_cholerae and Vibrio_vulnificus), but not both a genus row and species within that genus (Vibrio and Vibrio cholerae). The taxa should be formatted with the first letter capitalized and underscores in lieu of spaces.
Each subsequent column indicates a QC metric and lists a threshold for each organism that will be checked. The column names must exactly match expected values, so we highly recommend copy and pasting the header from the template file below as a starting place.

Template qc_check_table.tsv files

TheiaEuk_Illumina_PE_PHB: theiaeuk_qc_check_template.tsv

Example Purposes Only

The QC threshold values shown in the file above are for example purposes only and should not be presumed to be sufficient for every dataset.

qc_check Technical Details

	Links
Task	task_qc_check_phb.wdl

These tasks assemble the reads into a de novo assembly and assess the quality of the assembly.

digger_denovo: De novo Assembly

De Novo assembly will be undertaken only for samples that have sufficient read quantity and quality, as determined by the screen task assessment of clean reads.

In this workflow, assembly is performed using the digger_denovo, which is a hat tip to Shovill pipeline. This undertakes the assembly with one of three assemblers SKESA (default), SPAdes, Megahit, but also performs a number of post processing steps for assembly polishing and contig filtering. Pilon can optionally be run if call_pilon is set to true. On default, the contig filtering task is set to run, which will remove any homopolymers, contigs below a specificied length, and contigs with coverage below a specified minimum coverage. This can be turned off by setting run_filter_contigs to false.

What is de novo assembly?

De novo assembly is the process or product of attempting to reconstruct a genome from scratch (without prior knowledge of the genome) using sequence reads. Assembly of fungal genomes from short-reads will produce multiple contigs per chromosome rather than a single contiguous sequence for each chromosome.

Digger-Denovo Technical Details

	Links
SubWorkflow File	wf_digger_denovo.wdl

quast: Assembly Quality Assessment

QUAST stands for QUality ASsessment Tool. It evaluates genome/metagenome assemblies by computing various metrics without a reference being necessary. It includes useful metrics such as number of contigs, length of the largest contig and N50.

QUAST Technical Details

	Links
Task	task_quast.wdl
Software Source Code	QUAST on GitHub
Software Documentation	https://quast.sourceforge.net/
Original Publication(s)	QUAST: quality assessment tool for genome assemblies

CG-Pipeline: Assessment of Read Quality, and Estimation of Genome Coverage

Thecg_pipeline task generates metrics about read quality and estimates the coverage of the genome using the run_assembly_readMetrics.pl script from CG-Pipeline. The genome coverage estimates are calculated using both using raw and cleaned reads, using either a user-provided genome_size or the estimated genome length generated by QUAST.

CG-Pipeline Technical Details

The cg_pipeline task is run twice in this workflow, once with raw reads, and once with clean reads.

	Links
Task	task_cg_pipeline.wdl
Software Source Code	CG-Pipeline on GitHub
Software Documentation	CG-Pipeline on GitHub
Original Publication(s)	A computational genomics pipeline for prokaryotic sequencing projects

read_QC_trim_ont: Read Quality Trimming, Quantification, and Identification

read_QC_trim_ont is a sub-workflow that filters low-quality reads and trims low-quality regions of reads. It uses several tasks, described below.

Read Identification with Kraken2 (optional)

Kraken2 is a bioinformatics tool originally designed for metagenomic applications. It has additionally proven valuable for validating taxonomic assignments and checking contamination of single-species (e.g. bacterial isolate, eukaryotic isolate, viral isolate, etc.) whole genome sequence data.

Database-dependent

This workflow automatically uses a viral-specific Kraken2 database. This database was generated in-house from RefSeq's viral sequence collection and human genome GRCh38. It's available at gs://theiagen-public-resources-rp/reference_data/databases/kraken2/kraken2_humanGRCh38_viralRefSeq_20240828.tar.gz.

As an alternative to MIDAS (see above), the Kraken2 task can also be turned on through setting the call_kraken input variable as true for the identification of reads to detect contamination with non-target taxa.

A database must be provided if this optional module is activated, through the kraken_db optional input. A list of suggested databases can be found on Kraken2 standalone documentation.

Kraken2 Technical Details

	Links
Task	task_kraken2.wdl
Software Source Code	Kraken2 on GitHub
Software Documentation	https://github.com/DerrickWood/kraken2/blob/master/docs/MANUAL.markdown
Original Publication(s)	Improved metagenomic analysis with Kraken 2

A note on estimated genome length

By default, an estimated genome length is set to 5 Mb, which is around 0.7 Mb higher than the average bacterial genome length, according to the information collated here. This estimate can be overwritten by the user, and is used by RASUSA.

nanoplot: Plotting and quantifying long-read sequencing data

Nanoplot is used for the determination of mean quality scores, read lengths, and number of reads. This task is run once with raw reads as input and once with clean reads as input. If QC has been performed correctly, you should expect fewer clean reads than raw reads.

Read subsampling

Samples are automatically randomly subsampled to 150X coverage using RASUSA.

Plasmid prediction

Plasmids are identified using replicon sequences used for typing from PlasmidFinder.

Read filtering

Reads are filtered by length and quality using nanoq. By default, sequences with less than 500 basepairs and quality score lower than 10 are filtered out to improve assembly accuracy.

read_QC_trim_ont Technical Details

	Links
Sub-workflow	wf_read_QC_trim_ont.wdl
Tasks	task_fastq_scan.wdl task_nanoplot.wdl task_rasusa.wdl task_nanoq.wdl
Software Source Code	fastq-scan NanoPlot RASUSA nanoq
Software Documentation	NCBI Scrub on GitHub NanoPlot documentation on GitHub Rasusa on GitHub Nanoq on GitHub Artic Pipeline ReadTheDocs Kraken2 on GitHub
Original Publication(s)	NanoPack2: population-scale evaluation of long-read sequencing data Rasusa: Randomly subsample sequencing reads to a specified coverage Nanoq: ultra-fast quality control for nanopore reads Improved metagenomic analysis with Kraken 2

These tasks assemble the reads into a de novo assembly and assess the quality of the assembly.

Flye: De novo Assembly

flye_denovo is a sub-workflow that performs de novo assembly using Flye for ONT data and supports additional polishing and visualization steps.

Ensure correct medaka model is selected if performing medaka polishing

In order to obtain the best results, the appropriate model must be set to match the sequencer's basecaller model; this string takes the format of {pore}_{device}_{caller variant}_{caller_version}. See also https://github.com/nanoporetech/medaka?tab=readme-ov-file#models. If flye is being run on legacy data the medaka model will likely be r941_min_hac_g507. Recently generated data will likely be suited by the default model of r1041_e82_400bps_sup_v5.0.0.

The detailed steps and tasks are as follows:

Porechop: Read Trimming (optional; off by default)

Read trimming is optional and can be enabled by setting the run_porchop input variable to true.

Porechop is a tool for finding and removing adapters from ONT data. Adapters on the ends of reads are trimmed, and when a read has an adapter in the middle, the read is split into two.

Porechop Technical Details

	Links
WDL Task	task_porechop.wdl
Software Source Code	Porechop on GitHub
Software Documentation	https://github.com/rrwick/Porechop#porechop

Flye: De novo Assembly

Flye is a de novo assembler for long read data using repeat graphs. Compared to de Bruijn graphs, which require exact k-mer matches, repeat graphs can use approximate matches which better tolerates the error rate of ONT data.

flye_read_type input parameter

This input parameter specifies the type of sequencing reads being used for assembly. This parameter significantly impacts the assembly process and should match the characteristics of your input data. Below are the available options:

Parameter	Explanation
`--nano-hq` (default)	Optimized for ONT high-quality reads, such as Guppy5+ SUP or Q20 (<5% error). Recommended for ONT reads processed with Guppy5 or newer
`--nano-raw`	For ONT regular reads, pre-Guppy5 (<20% error)
`--nano-corr`	ONT reads corrected with other methods (<3% error)
`--pacbio-raw`	PacBio regular CLR reads (<20% error)
`--pacbio-corr`	PacBio reads corrected with other methods (<3% error)
`--pacbio-hifi`	PacBio HiFi reads (<1% error)

Refer to the Flye documentation for detailed guidance on selecting the appropriate flye_read_type based on your sequencing data and additional optional paramaters.

Non-deterministic output(s)

This task may yield non-deterministic outputs.

Flye Technical Details

	Links
WDL Task	task_flye.wdl
Software Source Code	Flye on GitHub
Software Documentation	Flye Documentation
Original Publication(s)	Assembly of long, error-prone reads using repeat graphs

Bandage: Graph Visualization

Bandage creates de novo assembly graphs containing the assembled contigs and the connections between those contigs.

Bandage Technical Details

	Links
WDL Task	task_bandage_plot.wdl
Software Source Code	Bandage on GitHub
Software Documentation	Bandage Documentation
Original Publication(s)	Bandage: interactive visualization of de novo genome assemblies

Polypolish: Hybrid Assembly Polishing for ONT and Illumina data

If short reads are provided with the optional illumina_read1 and illumina_read2 inputs, Polypolish will use those short-reads to correct errors in the long-read assemblies. Uniquely, Polypolish uses the short-read alignments where each read is aligned to all possible locations, meaning that repeat regions will have error correction.

Polypolish Technical Details

	Links
Task	task_polypolish.wdl
Software Source Code	Polypolish on GitHub
Software Documentation	Polypolish Documentation
Original Publication(s)	Polypolish: short-read polishing of long-read bacterial genome assemblies How low can you go? Short-read polishing of Oxford Nanopore bacterial genome assemblies

Medaka: Polishing of Flye assembly (default; optional)

Polishing is optional and can be skipped by setting the skip_polishing variable to true. If polishing is skipped, then neither Medaka or Racon will run.

Medaka is the default assembly polisher used in TheiaProk. Racon may be used alternatively, and if so, Medaka will not run. Medaka uses the raw reads to polish the assembly and generate a consensus sequence.

Importantly, Medaka requires knowing the model that was used to generate the read data. There are several ways to provide this information:

Automatic Model Selection: Automatically determines the most appropriate Medaka model based on the input data, ensuring optimal polishing results without manual intervention.
User-Specified Model Override: Allows users to specify a particular Medaka model if automatic selection does not yield the desired outcome or for specialized use cases.
Default Model: If both automatic model selection fails and no user-specified model is provided, Medaka defaults to the predefined fallback model r1041_e82_400bps_sup_v5.0.0.

Medaka Model Resolution Process

Medaka's automatic model selection uses the medaka tools resolve_model command to identify the appropriate model for polishing. This process relies on metadata embedded in the input file, which is typically generated by the basecaller. If the automatic selection fails to identify a suitable model, Medaka gracefully falls back to the default model to maintain workflow continuity. Users should verify the chosen model and consider specifying a model override if necessary.

Medaka Technical Details

	Links
WDL Task	task_medaka.wdl
Software Source Code	Medaka on GitHub
Software Documentation	Medaka Documentation

Racon: Polishing of Flye assembly (alternative; optional)

Polishing is optional and can be skipped by setting the skip_polishing variable to true. If polishing is skipped, then neither Medaka or Racon will run.

Racon is an alternative to using medaka for assembly polishing, and can be run by setting the polisher input to "racon". Racon is a consensus algorithm designed for refining raw de novo DNA assemblies generated from long, uncorrected sequencing reads.

Racon Technical Details

	Links
WDL Task	task_racon.wdl
Software Source Code	Racon on GitHub
Software Documentation	Racon Documentation
Original Publication(s)	Fast and accurate de novo genome assembly from long uncorrected reads

Filter Contigs: Filter contigs below a threshold length and remove homopolymer contigs

This task filters the created contigs based on a user-defined minimum length threshold (default of 1000) and eliminates homopolymer contigs (contigs of any length that consist of a single nucleotide). This ensures high-quality assemblies by retaining only contigs that meet specified criteria. Detailed metrics on contig counts and sequence lengths before and after filtering are provided in the output.

Filter Contigs Technical Details

	Links
WDL Task	task_filter_contigs.wdl

Dnaapler: Final Assembly Orientation

Dnaapler reorients contigs to start at specific reference points. Dnaapler supports the following modes, which can be indicated by filling the dnaapler_mode input variable with the desired mode. The default is all, which reorients contigs to start with dnaA, terL, repA, or COG1474.

all: Reorients contigs to start with dnaA, terL, repA, or COG1474 (Default)
chromosome: Reorients to begin with the dnaA chromosomal replication initiator gene, commonly used for bacterial chromosome assemblies.
plasmid: Reorients to start with the repA plasmid replication initiation gene, ideal for plasmid assemblie
phage: Reorients to start with the terL large terminase subunit gene, used for bacteriophage assemblies
archaea: Reorients to start with the COG1474 archaeal Orc1/cdc6 gene, relevant for archaeal assemblies
custom: Reorients based on a user-specified gene in amino acid FASTA format for experimental or unique workflows
mystery: Reorients to start with a random CDS for exploratory purposes
largest: Reorients to start with the largest CDS in the assembly, often useful for poorly annotated genomes
nearest: Reorients to start with the first CDS nearest to the sequence start, resolving CDS breakpoints
bulk: Processes multiple contigs to start with the desired start gene (dnaA, terL, repA, or custom)

Dnaapler Technical Details

	Links
WDL Task	task_dnaapler.wdl
Software Source Code	Dnaapler on GitHub
Software Documentation	Dnaapler Documentation
Original Publication(s)	Dnaapler: a tool to reorient circular microbial genomes

Organism-agnostic characterization¶

These tasks are performed regardless of the organism and provide quality control and taxonomic assignment.

GAMBIT: Taxon Assignment

GAMBIT determines the taxon of the genome assembly using a k-mer based approach to match the assembly sequence to the closest complete genome in a database, thereby predicting its identity. Sometimes, GAMBIT can confidently designate the organism to the species level. Other times, it is more conservative and assigns it to a higher taxonomic rank.

For additional details regarding the GAMBIT tool and a list of available GAMBIT databases for analysis, please consult the GAMBIT tool documentation.

GAMBIT Technical Details

	Links
Task	task_gambit.wdl
Software Source Code	GAMBIT on GitHub
Software Documentation	GAMBIT ReadTheDocs
Original Publication(s)	GAMBIT (Genomic Approximation Method for Bacterial Identification and Tracking): A methodology to rapidly leverage whole genome sequencing of bacterial isolates for clinical identification

BUSCO: Assembly Quality Assessment

BUSCO (Benchmarking Universal Single-Copy Orthologue) attempts to quantify the completeness and contamination of an assembly to generate quality assessment metrics. It uses taxa-specific databases containing genes that are all expected to occur in the given taxa, each in a single copy. BUSCO examines the presence or absence of these genes, whether they are fragmented, and whether they are duplicated (suggestive that additional copies came from contaminants).

BUSCO notation

Here is an example of BUSCO notation: C:99.1%[S:98.9%,D:0.2%],F:0.0%,M:0.9%,n:440. There are several abbreviations used in this output:

Complete (C) - genes are considered "complete" when their lengths are within two standard deviations of the BUSCO group mean length.
Single-copy (S) - genes that are complete and have only one copy.
Duplicated (D) - genes that are complete and have more than one copy.
Fragmented (F) - genes that are only partially recovered.
Missing (M) - genes that were not recovered at all.
Number of genes examined (n) - the number of genes examined.

A high equity assembly will use the appropriate database for the taxa, have high complete (C) and single-copy (S) percentages, and low duplicated (D), fragmented (F) and missing (M) percentages.

BUSCO Technical Details

	Links
Task	task_busco.wdl
Software Source Code	BUSCO on GitLab
Software Documentation	https://busco.ezlab.org/
Orginal publication	BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs

Organism-specific characterization¶

The TheiaEuk workflow automatically activates taxa-specific tasks after identification of the relevant taxa using GAMBIT. Many of these taxa-specific tasks do not require any additional inputs from the user.

Candidozyma auris (also known as Candida auris)

Three tools can be deployed when Candidozyma auris/Candida auris is identified.

Cladetyping: clade determination

A custom GAMBIT database is created using five clade-specific Candidozyma auris reference genomes. Sequences undergo genomic signature comparison against this database, which then enables assignment to one of the five Candidozyma auris clades (Clade I to Clade V) based on sequence similarity and phylogenetic relationships. This integrated approach ensures precise clade assignments, crucial for understanding the genetic diversity and epidemiology of Candidozyma auris.

See more information on the reference information for the five clades below:

Clade	Genome Accession	Assembly Name	Strain	BioSample Accession
Clade I	GCA_002759435.2	Cand_auris_B8441_V2	B8441	SAMN05379624
Clade II	GCA_003013715.2	ASM301371v2	B11220	SAMN05379608
Clade III	GCA_002775015.1	Cand_auris_B11221_V1	B11221	SAMN05379609
Clade IV	GCA_003014415.1	Cand_auris_B11243	B11243	SAMN05379619
Clade V	GCA_016809505.1	ASM1680950v1	IFRC2087	SAMN11570381

Cauris_Cladetyper Technical Details

	Links
Task	task_cauris_cladetyper.wdl
Software Source Code	GAMBIT on GitHub
Software Documentation	GAMBIT Overview
Original Publication(s)	GAMBIT (Genomic Approximation Method for Bacterial Identification and Tracking): A methodology to rapidly leverage whole genome sequencing of bacterial isolates for clinical identification TheiaEuk: a species-agnostic bioinformatics workflow for fungal genomic characterization

amr_search: Antimicrobial resistance profiling (Optional)

Set the run_amr_search parameter to true to enable this task.

The AMR Search module will report only specific SNP changes in the AMR genes, like FUR1 F211L. For a complete list of mutations being queried, please refer to amr_search's documentation.

This task performs in silico antimicrobial resistance (AMR) profiling for supported species using AMRsearch, the primary tool used by Pathogenwatch to genotype and infer antimicrobial resistance (AMR) phenotypes from assembled microbial genomes.

AMRsearch screens against Pathogenwatch's library of curated genotypes and inferred phenotypes, developed in collaboration with community experts. Resistance phenotypes are determined based on both resistance genes and mutations, and the system accounts for interactions between multiple SNPs, genes, and suppressors. Predictions follow S/I/R classification (Sensitive, Intermediate, Resistant).

Outputs:

JSON Output: Contains the complete AMR profile, including detailed resistance state, detected resistance genes/mutations, and supporting BLAST results.
CSV & PDF Tables: An incorprated Python script, parse_amr_json.py, extracts and formats results into a CSV file and PDF summary table for easier visualization.

amr_search Technical Details

	Links
Task	task_amr_search.wdl
Software Source Code	AMRsearch
Software Documentation	Pathogenwatch
Original Publication(s)	PAARSNP: rapid genotypic resistance prediction for Neisseria gonorrhoeae*

Snippy Variants: antifungal resistance detection

To detect mutations that may confer antifungal resistance, Snippy is used to find all variants relative to the clade-specific reference, then these variants are queried for product names associated with resistance. It's important to note that unlike amr_search, this task reports all variants found in the searched targets.

The genes in which there are known resistance-conferring mutations for this pathogen are:

FKS1
ERG11 (lanosterol 14-alpha demethylase)
FUR1 (uracil phosphoribosyltransferase)

We query Snippy results to see if any mutations were identified in those genes. By default, we automatically check for the following loci (which can be overwritten by the user). You will find the mutations next to the locus tag in the theiaeuk_snippy_variants_hits column corresponding gene name (see below):

TheiaEuk Search Term	Corresponding Gene Name
B9J08_005340	ERG6
B9J08_000401	FLO8
B9J08_005343	Hypothetical protein (PSK74852)
B9J08_003102	MEC3
B9J08_003737	ERG3
lanosterol.14-alpha.demethylase	ERG11
uracil.phosphoribosyltransferase	FUR1
FKS1	FKS1

For example, one sample may have the following output for the theiaeuk_snippy_variants_hits column:

lanosterol.14-alpha.demethylase: lanosterol 14-alpha demethylase (missense_variant c.428A>G p.Lys143Arg; C:266 T:0),B9J08_000401: hypothetical protein (stop_gained c.424C>T p.Gln142*; A:70 G:0)

Based on this, we can tell that ERG11 has a missense variant at position 143 (Lysine to Arginine) and B9J08_000401 (which is FLO8) has a stop-gained variant at position 142 (Glutamine to Stop).

Known resistance-conferring mutations for Candidozyma auris

Mutations in these genes that are known to confer resistance are shown below

Organism	Found in	Gene name	Gene locus	AA mutation	Drug	Reference
Candidozyma auris	Human	ERG11		Y132F	Fluconazole	Simultaneous Emergence of Multidrug-Resistant Candida auris on 3 Continents Confirmed by Whole-Genome Sequencing and Epidemiological Analyses
Candidozyma auris	Human	ERG11		K143R	Fluconazole	Simultaneous Emergence of Multidrug-Resistant Candida auris on 3 Continents Confirmed by Whole-Genome Sequencing and Epidemiological Analyses
Candidozyma auris	Human	ERG11		F126T	Fluconazole	Simultaneous Emergence of Multidrug-Resistant Candida auris on 3 Continents Confirmed by Whole-Genome Sequencing and Epidemiological Analyses
Candidozyma auris	Human	FKS1		S639P	Micafungin	Activity of CD101, a long-acting echinocandin, against clinical isolates of Candida auris
Candidozyma auris	Human	FKS1		S639P	Caspofungin	Activity of CD101, a long-acting echinocandin, against clinical isolates of Candida auris
Candidozyma auris	Human	FKS1		S639P	Anidulafungin	Activity of CD101, a long-acting echinocandin, against clinical isolates of Candida auris
Candidozyma auris	Human	FKS1		S639F	Micafungin	A multicentre study of antifungal susceptibility patterns among 350 Candida auris isolates (2009–17) in India: role of the ERG11 and FKS1 genes in azole and echinocandin resistance
Candidozyma auris	Human	FKS1		S639F	Caspofungin	A multicentre study of antifungal susceptibility patterns among 350 Candida auris isolates (2009–17) in India: role of the ERG11 and FKS1 genes in azole and echinocandin resistance
Candidozyma auris	Human	FKS1		S639F	Anidulafungin	A multicentre study of antifungal susceptibility patterns among 350 Candida auris isolates (2009–17) in India: role of the ERG11 and FKS1 genes in azole and echinocandin resistance
Candidozyma auris	Human	FUR1	CAMJ_004922	F211I	5-flucytosine	Genomic epidemiology of the UK outbreak of the emerging human fungal pathogen Candida auris

Snippy Variants Technical Details

	Links
Task	task_snippy_variants.wdl task_snippy_gene_query.wdl
Software Source Code	Snippy on GitHub
Software Documentation	Snippy on GitHub

Candida albicans

When this species is detected by the taxon ID tool, an antifungal resistance detection task is deployed.

Snippy Variants: antifungal resistance detection

To detect mutations that may confer antifungal resistance, Snippy is used to find all variants relative to the clade-specific reference, and these variants are queried for product names associated with resistance.

The genes in which there are known resistance-conferring mutations for this pathogen are:

ERG11
GCS1 (FKS1)
FUR1
RTA2

We query Snippy results to see if any mutations were identified in those genes. By default, we automatically check for the following loci (which can be overwritten by the user). You will find the mutations next to the locus tag in the theiaeuk_snippy_variants_hits column corresponding gene name (see below):

TheiaEuk Search Term	Corresponding Gene Name
ERG11	ERG11
GCS1	FKS1
FUR1	FUR1
RTA2	RTA2

Snippy Variants Technical Details

	Links
Task	task_snippy_variants.wdl task_snippy_gene_query.wdl
Software Source Code	Snippy on GitHub
Software Documentation	Snippy on GitHub

Aspergillus fumigatus

When this species is detected by the taxon ID tool an antifungal resistance detection task is deployed.

Snippy Variants: antifungal resistance detection

To detect mutations that may confer antifungal resistance, Snippy is used to find all variants relative to the clade-specific reference, and these variants are queried for product names associated with resistance.

The genes in which there are known resistance-conferring mutations for this pathogen are:

Cyp51A
HapE
COX10 (AFUA_4G08340)

We query Snippy results to see if any mutations were identified in those genes. By default, we automatically check for the following loci (which can be overwritten by the user). You will find the mutations next to the locus tag in the theiaeuk_snippy_variants_hits column corresponding gene name (see below):

TheiaEuk Search Term	Corresponding Gene Name
Cyp51A	Cyp51A
HapE	HapE
AFUA_4G08340	COX10

Snippy Variants Technical Details

	Links
Task	task_snippy_variants.wdl task_snippy_gene_query.wdl
Software Source Code	Snippy on GitHub
Software Documentation	Snippy on GitHub

Cryptococcus neoformans

When this species is detected by the taxon ID tool an antifungal resistance detection task is deployed.

Snippy Variants: antifungal resistance detection

To detect mutations that may confer antifungal resistance, Snippy is used to find all variants relative to the clade-specific reference, and these variants are queried for product names associated with resistance.

The genes in which there are known resistance-conferring mutations for this pathogen are:

ERG11 (CNA00300)

We query Snippy results to see if any mutations were identified in those genes. By default, we automatically check for the following loci (which can be overwritten by the user). You will find the mutations next to the locus tag in the theiaeuk_snippy_variants_hits column corresponding gene name (see below):

TheiaEuk Search Term	Corresponding Gene Name
CNA00300	ERG11

Snippy Variants Technical Details

	Links
Task	task_snippy_variants.wdl task_snippy_gene_query.wdl
Software Source Code	Snippy on GitHub
Software Documentation	Snippy on GitHub

Outputs¶

TheiaEuk_Illumina_PETheiaEuk_ONT

Variable	Type	Description
amr_results_csv	File	CSV formatted AMR profile
amr_search_docker	String	Docker image used to run AMR_Search
amr_search_results	File	JSON formatted AMR profile including BLAST results
amr_search_results_pdf	File	PDF formatted AMR profile
assembler	String	Assembler used in digger_denovo subworkflow
assembler_version	String	Version of the assembler used in digger_denovo
assembly_fasta	File	De novo genome assembly in FASTA format
assembly_length	Int	Length of assembly (total contig length) as determined by QUAST
bbduk_docker	String	The Docker image for bbduk, which was used to remove the adapters from the sequences
busco_database	String	BUSCO database used
busco_docker	String	BUSCO docker image used
busco_report	File	A plain text summary of the results in BUSCO notation
busco_results	String	BUSCO results (see relevant toggle in this block)
busco_version	String	BUSCO software version used
cg_pipeline_docker	String	Docker file used for running CG-Pipeline on cleaned reads
cg_pipeline_report	File	TSV file of read metrics from raw reads, including average read length, number of reads, and estimated genome coverage
cladetyper_annotated_reference	String	The annotated reference file for the identified clade, "None" if no clade was identified
cladetyper_clade	String	The clade assigned to the input assembly
cladetyper_docker_image	String	The Docker container used for the task
cladetyper_gambit_version	String	The version of GAMBIT used for the analysis
combined_mean_q_clean	Float	Mean quality score for the combined clean reads
combined_mean_q_raw	Float	Mean quality score for the combined raw reads
combined_mean_readlength_clean	Float	Mean read length for the combined clean reads
combined_mean_readlength_raw	Float	Mean read length for the combined raw reads
contigs_fastg	File	Assembly graph if megahit used for genome assembly
contigs_gfa	File	Assembly graph output generated by SPAdes (Illumina: PE, SE) or Flye (ONT), used to visualize and evaluate genome assembly results.
contigs_lastgraph	File	Assembly graph if velvet used for genome assembly
est_coverage_clean	Float	Estimated coverage calculated from clean reads and genome length
est_coverage_raw	Float	Estimated coverage calculated from raw reads and genome length
fastp_html_report	File	The HTML report made with fastp
fastp_version	String	The version of fastp used
fastq_scan_clean1_json	File	The JSON file output from `fastq-scan` containing summary stats about clean forward read quality and length
fastq_scan_clean2_json	File	The JSON file output from `fastq-scan` containing summary stats about clean reverse read quality and length
fastq_scan_num_reads_clean1	Int	The number of forward reads after cleaning as calculated by fastq_scan
fastq_scan_num_reads_clean2	Int	The number of reverse reads after cleaning as calculated by fastq_scan
fastq_scan_num_reads_clean_pairs	String	The number of read pairs after cleaning as calculated by fastq_scan
fastq_scan_num_reads_raw1	Int	The number of input forward reads as calculated by fastq_scan
fastq_scan_num_reads_raw2	Int	The number of input reserve reads as calculated by fastq_scan
fastq_scan_num_reads_raw_pairs	String	The number of input read pairs as calculated by fastq_scan
fastq_scan_raw1_json	File	The JSON file output from `fastq-scan` containing summary stats about raw forward read quality and length
fastq_scan_raw2_json	File	The JSON file output from `fastq-scan` containing summary stats about raw reverse read quality and length
fastq_scan_version	String	The version of fastq_scan
fastqc_clean1_html	File	An HTML file that provides a graphical visualization of clean forward read quality from fastqc to open in an internet browser
fastqc_clean2_html	File	An HTML file that provides a graphical visualization of clean reverse read quality from fastqc to open in an internet browser
fastqc_docker	String	The Docker container used for fastqc
fastqc_num_reads_clean1	Int	The number of forward reads after cleaning by fastqc
fastqc_num_reads_clean2	Int	The number of reverse reads after cleaning by fastqc
fastqc_num_reads_clean_pairs	String	The number of read pairs after cleaning by fastqc
fastqc_num_reads_raw1	Int	The number of input forward reads by fastqc before cleaning
fastqc_num_reads_raw2	Int	The number of input reverse reads by fastqc before cleaning
fastqc_num_reads_raw_pairs	String	The number of input read pairs by fastqc before cleaning
fastqc_raw1_html	File	An HTML file that provides a graphical visualization of raw forward read quality from fastqc to open in an internet browser
fastqc_raw2_html	File	An HTML file that provides a graphical visualization of raw reverse read quality from fastqc to open in an internet browser
fastqc_version	String	Version of fastqc software used
filtered_contigs_metrics	File	File containing metrics of contigs filtered
gambit_closest_genomes	File	CSV file listing genomes in the GAMBIT database that are most similar to the query assembly
gambit_db_version	String	Version of the GAMBIT database used
gambit_docker	String	GAMBIT Docker used
gambit_predicted_taxon	String	Taxon predicted by GAMBIT
gambit_predicted_taxon_rank	String	Taxon rank of GAMBIT taxon prediction
gambit_report	File	GAMBIT report in a machine-readable format
gambit_version	String	Version of GAMBIT software used
n50_value	Int	N50 of assembly calculated by QUAST
number_contigs	Int	Total number of contigs in assembly
qc_check	String	A string that indicates whether or not the sample passes a set of pre-determined and user-provided QC thresholds
qc_standard	File	The file used in the QC Check task containing the QC thresholds.
quast_gc_percent	Float	The GC percent of your sample
quast_report	File	TSV report from QUAST
quast_version	String	The version of QUAST
r1_mean_q_raw	Float	Mean quality score of raw forward reads
r1_mean_readlength_raw	Float	Mean read length of raw forward reads
r2_mean_q_raw	Float	Mean quality score of raw reverse reads
r2_mean_readlength_clean	Float	Mean read length of clean reverse reads
rasusa_version	String	Version of RASUSA used for the analysis
read1_clean	File	Forward read file after quality trimming and adapter removal
read1_subsampled	File	Read1 FASTQ files downsampled to desired coverage
read2_clean	File	Reverse read file after quality trimming and adapter removal
read2_subsampled	File	Read2 FASTQ files downsampled to desired coverage
read_screen_clean	String	PASS or FAIL result from clean read screening; FAIL accompanied by the reason(s) for failure
read_screen_clean_tsv	File	Clean read screening report TSV depicting read counts, total read base pairs, and estimated genome length
read_screen_raw	String	PASS or FAIL result from raw read screening; FAIL accompanied by the reason(s) for failure
read_screen_raw_tsv	File	Raw read screening report TSV depicting read counts, total read base pairs, and estimated genome length
seq_platform	String	Description of the sequencing methodology used to generate the input read data
shovill_pe_version	String	Shovill version used
theiaeuk_illumina_pe_analysis_date	String	Date of TheiaEuk PE workflow execution
theiaeuk_illumina_pe_version	String	TheiaEuk PE workflow version used
theiaeuk_snippy_variants_bai	String	BAI file produced by the snippy module
theiaeuk_snippy_variants_bam	String	BAM file produced by the snippy module
theiaeuk_snippy_variants_coverage_tsv	String	TSV file containing coverage information for each base in the reference genome
theiaeuk_snippy_variants_gene_query_results	File	File containing all lines from variants file matching gene query terms
theiaeuk_snippy_variants_hits	String	String of all variant file entries matching gene query term
theiaeuk_snippy_variants_num_reads_aligned	String	Number of reads aligned by snippy
theiaeuk_snippy_variants_num_variants	Int	Number of variants detected by snippy
theiaeuk_snippy_variants_outdir_tarball	File	Tar compressed file containing full snippy output directory
theiaeuk_snippy_variants_percent_ref_coverage	String	Percent of reference genome covered by snippy
theiaeuk_snippy_variants_query	String	The gene query term(s) used to search variant
theiaeuk_snippy_variants_query_check	String	Were the gene query terms present in the refence annotated genome file
theiaeuk_snippy_variants_reference_genome	File	The reference genome used in the alignment and variant calling
theiaeuk_snippy_variants_results	File	The variants file produced by snippy
theiaeuk_snippy_variants_summary	File	A file summarizing the variants detected by snippy
theiaeuk_snippy_variants_version	String	The version of the snippy_variants module being used
trimmomatic_docker	String	The docker image used for the trimmomatic module in this workflow
trimmomatic_version	String	The version of Trimmomatic used

Variable	Type	Description
amr_search_csv	File	CSV formatted AMR profile
amr_search_docker	String	Docker image used to run AMR_Search
amr_search_results	File	JSON formatted AMR profile including BLAST results
amr_search_results_pdf	File	PDF formatted AMR profile
assembly_fasta	File	De novo genome assembly in FASTA format
assembly_length	Int	Length of assembly (total contig length) as determined by QUAST
bandage_plot	File	Image file (PNG) visualizing the Flye assembly graph generated by Bandage
bwa_version	String	Version of BWA software used
cladetyper_annotated_reference	String	The annotated reference file for the identified clade, "None" if no clade was identified
cladetyper_clade	String	The clade assigned to the input assembly
cladetyper_docker_image	String	The Docker container used for the task
cladetyper_version	String	The version of Cladetyper used for the analysis
contigs_fastg	File	Assembly graph if megahit used for genome assembly
dnaapler_version	String	Version of dnaapler used
est_coverage_clean	Float	Estimated coverage calculated from clean reads and genome length
est_coverage_raw	Float	Estimated coverage calculated from raw reads and genome length
filtered_contigs_metrics	File	File containing metrics of contigs filtered
gambit_closest_genomes	File	CSV file listing genomes in the GAMBIT database that are most similar to the query assembly
gambit_db_version	String	Version of the GAMBIT database used
gambit_docker	String	GAMBIT Docker used
gambit_predicted_taxon	String	Taxon predicted by GAMBIT
gambit_predicted_taxon_rank	String	Taxon rank of GAMBIT taxon prediction
gambit_report	File	GAMBIT report in a machine-readable format
gambit_version	String	Version of GAMBIT software used
medaka_model	String	Model used by Medaka
medaka_vcf	File	A VCF file containing the identified variants
medaka_version	String	Version of Medaka used
n50_value	Int	N50 of assembly calculated by QUAST
nanoplot_docker	String	Docker image for nanoplot
nanoplot_html_clean	File	An HTML report describing the clean reads
nanoplot_html_raw	File	An HTML report describing the raw reads
nanoplot_num_reads_clean1	Int	Number of clean reads
nanoplot_num_reads_raw1	Int	Number of raw reads
nanoplot_r1_est_coverage_clean	Float	Estimated coverage on the clean reads by nanoplot
nanoplot_r1_est_coverage_raw	Float	Estimated coverage on the raw reads by nanoplot
nanoplot_r1_mean_q_clean	Float	Mean quality score of clean forward reads
nanoplot_r1_mean_q_raw	Float	Mean quality score of raw forward reads
nanoplot_r1_mean_readlength_clean	Float	Mean read length of clean forward reads
nanoplot_r1_mean_readlength_raw	Float	Mean read length of raw forward reads
nanoplot_r1_median_q_clean	Float	Median quality score of clean forward reads
nanoplot_r1_median_q_raw	Float	Median quality score of raw forward reads
nanoplot_r1_median_readlength_clean	Float	Median read length of clean forward reads
nanoplot_r1_median_readlength_raw	Float	Median read length of raw forward reads
nanoplot_r1_n50_clean	Float	N50 of clean forward reads
nanoplot_r1_n50_raw	Float	N50 of raw forward reads
nanoplot_r1_stdev_readlength_clean	Float	Standard deviation read length of clean forward reads
nanoplot_r1_stdev_readlength_raw	Float	Standard deviation read length of raw forward reads
nanoplot_tsv_clean	File	A TSV report describing the clean reads
nanoplot_tsv_raw	File	A TSV report describing the raw reads
nanoplot_version	String	Version of nanoplot used for analysis
nanoq_version	String	Version of nanoq used in analysis
number_contigs	Int	Total number of contigs in assembly
polypolish_version	String	Version of Polypolish used
porechop_version	String	Version of Porechop used
quast_gc_percent	Float	The GC percent of your sample
quast_report	File	TSV report from QUAST
quast_version	String	The version of QUAST
racon_version	String	Version of Racon used
read1_clean	File	Forward read file after quality trimming and adapter removal
read_screen_clean	String	PASS or FAIL result from clean read screening; FAIL accompanied by the reason(s) for failure
read_screen_clean_tsv	File	Clean read screening report TSV depicting read counts, total read base pairs, and estimated genome length
read_screen_raw	String	PASS or FAIL result from raw read screening; FAIL accompanied by the reason(s) for failure
read_screen_raw_tsv	File	Raw read screening report TSV depicting read counts, total read base pairs, and estimated genome length
theiaeuk_illumina_pe_analysis_date	String	Date of TheiaEuk PE workflow execution
theiaeuk_illumina_pe_version	String	TheiaEuk PE workflow version used
theiaeuk_snippy_variants_bai	String	BAI file produced by the snippy module
theiaeuk_snippy_variants_bam	String	BAM file produced by the snippy module
theiaeuk_snippy_variants_coverage_tsv	String	TSV file containing coverage information for each base in the reference genome
theiaeuk_snippy_variants_gene_query_results	File	File containing all lines from variants file matching gene query terms
theiaeuk_snippy_variants_hits	String	String of all variant file entries matching gene query term
theiaeuk_snippy_variants_num_reads_aligned	String	Number of reads aligned by snippy
theiaeuk_snippy_variants_num_variants	Int	Number of variants detected by snippy
theiaeuk_snippy_variants_outdir_tarball	File	Tar compressed file containing full snippy output directory
theiaeuk_snippy_variants_percent_ref_coverage	String	Percent of reference genome covered by snippy
theiaeuk_snippy_variants_query	String	The gene query term(s) used to search variant
theiaeuk_snippy_variants_query_check	String	Were the gene query terms present in the refence annotated genome file
theiaeuk_snippy_variants_reference_genome	File	The reference genome used in the alignment and variant calling
theiaeuk_snippy_variants_results	File	The variants file produced by snippy
theiaeuk_snippy_variants_summary	File	A file summarizing the variants detected by snippy
theiaeuk_snippy_variants_version	String	The version of the snippy_variants module being used