TheiaMeta¶

Quick Facts¶

Workflow Type	Applicable Kingdom	Last Known Changes	Command-line Compatibility	Workflow Level
Genomic Characterization	Any taxa	v3.0.0	Yes	Sample-level

TheiaMeta Workflows¶

Genomic characterization of pathogens is an increasing priority for public health laboratories globally. The workflows in the TheiaMeta Genomic Characterization Series make the analysis of pathogens from metagenomic samples easy by taking raw next-generation sequencing (NGS) data and generating metagenome-assembled genomes (MAGs), either using a reference-genome or not.

TheiaMeta can use one of two distinct methods for generating and processing the final assembly:

If a reference genome is not provided, the de novo assembly will be the final assembly. Additionally, go through a binning process where the contigs are separated into distinct files ("bins") according to composition and coverage such that each bin hopefully contains a single taxon.
If a reference genome is provided by the user, the de novo metagenomic assembly is filtered by mapping the contigs to the reference and those constitute the final assembly. No binning is necessary as the mapping will filter contigs that are likely the same taxon as the reference.

TheiaMeta Workflow Diagram

Inputs¶

The TheiaMeta_Illumina_PE workflow processes Illumina paired-end (PE) reads generated for metagenomic characterization (typically by shotgun). By default, this workflow will assume that input reads were generated using a 300-cycle sequencing kit (i.e. 2 x 150 bp reads). Modifications to the optional parameter for trim_minlen may be required to accommodate shorter read data, such as 2 x 75bp reads generated using a 150-cycle sequencing kit.

Terra Task Name	Variable	Type	Description	Default Value	Terra Status
theiameta_illumina_pe	read1	File	Illumina forward read file in FASTQ file format (compression optional)		Required
theiameta_illumina_pe	read2	File	Illumina reverse read file in FASTQ file format (compression optional)		Required
theiameta_illumina_pe	samplename	String	The name of the sample being analyzed		Required
assembled_reads_percent	cpu	Int	Number of CPUs to allocate to the task	2	Optional
assembled_reads_percent	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
assembled_reads_percent	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/samtools:1.17	Optional
assembled_reads_percent	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	8	Optional
bwa	cpu	Int	Number of CPUs to allocate to the task	6	Optional
bwa	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
bwa	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/ivar:1.3.1-titan	Optional
bwa	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	16	Optional
calculate_coverage	cpu	Int	Number of CPUs to allocate to the task	2	Optional
calculate_coverage	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
calculate_coverage	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/bedtools:2.31.0	Optional
calculate_coverage	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	8	Optional
calculate_coverage_paf	cpu	Int	Number of CPUs to allocate to the task	2	Optional
calculate_coverage_paf	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
calculate_coverage_paf	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/quay/ubuntu:latest	Optional
calculate_coverage_paf	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	8	Optional
compare_assemblies	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
kraken2_clean	cpu	Int	Number of CPUs to allocate to the task	4	Optional
kraken2_clean	disk_size	Int	Amount of storage (in GB) to allocate to the task. Increase this when using large (>30GB kraken2 databases such as the "k2_standard" database)	100	Optional
kraken2_clean	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/kraken2:2.1.2-no-db	Optional
kraken2_clean	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	16	Optional
kraken2_raw	cpu	Int	Number of CPUs to allocate to the task	4	Optional
kraken2_raw	disk_size	Int	Amount of storage (in GB) to allocate to the task. Increase this when using large (>30GB kraken2 databases such as the "k2_standard" database)	100	Optional
kraken2_raw	docker_image	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/kraken2:2.1.2-no-db	Optional
kraken2_raw	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	32	Optional
krona_clean	cpu	Int	Number of CPUs to allocate to the task	4	Optional
krona_clean	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
krona_clean	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/krona:2.8.1	Optional
krona_clean	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	8	Optional
krona_raw	cpu	Int	Number of CPUs to allocate to the task	4	Optional
krona_raw	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
krona_raw	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/krona:2.8.1	Optional
krona_raw	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	8	Optional
metaspades	metaspades_opts	String	Additional arguments to pass to metaspades task		Optional
minimap2_assembly	cpu	Int	Number of CPUs to allocate to the task	2	Optional
minimap2_assembly	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
minimap2_assembly	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/minimap2:2.22	Optional
minimap2_assembly	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	8	Optional
minimap2_assembly	query2	File	Internal component, do not modify		Optional
minimap2_reads	cpu	Int	Number of CPUs to allocate to the task	2	Optional
minimap2_reads	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
minimap2_reads	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/minimap2:2.22	Optional
minimap2_reads	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	8	Optional
quast	cpu	Int	Number of CPUs to allocate to the task	2	Optional
quast	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
quast	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/quast:5.0.2	Optional
quast	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	2	Optional
read_QC_trim	adapters	File	File with adapter sequences to be removed		Optional
read_QC_trim	bbduk_memory	Int	Amount of memory/RAM (in GB) to allocate to the task	8	Optional
read_QC_trim	call_midas	Boolean	True/False variable that determines if the MIDAS task should be called.	FALSE	Optional
read_QC_trim	fastp_args	String	Additional arguments to use with fastp	--detect_adapter_for_pe -g -5 20 -3 20	Optional
read_QC_trim	midas_db	File	The database used by the MIDAS task in .tar.gz format	gs://theiagen-public-files-rp/terra/theiaprok-files/midas/midas_db_v1.2.tar.gz	Optional
read_QC_trim	phix	File	A file containing the phix used during Illumina sequencing; used in the BBDuk task		Optional
read_QC_trim	read_processing	String	The name of the tool to perform basic read processing; options: "trimmomatic" or "fastp"	trimmomatic	Optional
read_QC_trim	read_qc	String	The tool used for quality control (QC) of reads. Options are "fastq_scan" (default) and "fastqc"	fastq_scan	Optional
read_QC_trim	target_organism	String	Internal component, do not modify		Optional
read_QC_trim	trim_min_length	Int	Specifies minimum length of each read after trimming to be kept	75	Optional
read_QC_trim	trim_quality_min_score	Int	Specifies the average quality of bases in a sliding window to be kept	30	Optional
read_QC_trim	trim_window_size	Int	Specifies window size for trimming (the number of bases to average the quality across)	4	Optional
read_QC_trim	trimmomatic_args	String	Additional arguments to pass to trimmomatic. "-phred33" specifies the Phred Q score encoding which is almost always phred33 with modern sequence data.	-phred33	Optional
retrieve_aligned_contig_paf	cpu	Int	Number of CPUs to allocate to the task	2	Optional
retrieve_aligned_contig_paf	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
retrieve_aligned_contig_paf	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/seqkit:2.3.1	Optional
retrieve_aligned_contig_paf	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	8	Optional
retrieve_aligned_pe_reads_sam	cpu	Int	Number of CPUs to allocate to the task	2	Optional
retrieve_aligned_pe_reads_sam	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
retrieve_aligned_pe_reads_sam	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/samtools:1.17	Optional
retrieve_aligned_pe_reads_sam	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	8	Optional
retrieve_unaligned_pe_reads_sam	cpu	Int	Number of CPUs to allocate to the task	2	Optional
retrieve_unaligned_pe_reads_sam	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
retrieve_unaligned_pe_reads_sam	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/samtools:1.17	Optional
retrieve_unaligned_pe_reads_sam	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	8	Optional
sam_to_sorted_bam	cpu	Int	Number of CPUs to allocate to the task	2	Optional
sam_to_sorted_bam	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
sam_to_sorted_bam	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/samtools:1.17	Optional
sam_to_sorted_bam	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	8	Optional
semibin	cpu	Int	Number of CPUs to allocate to the task	6	Optional
semibin	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
semibin	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/biocontainers/semibin:2.0.2--pyhdfd78af_0	Optional
semibin	environment	String	Environment model to use. Options:• human_gut• dog_gut• ocean• soil• cat_gut• human_oral• mouse_gut• pig_gut• built_environment• wastewater• chicken_caecum• global	global	Optional
semibin	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	8	Optional
semibin	min_length	Int	Minimum contig length for binning	1000	Optional
semibin	ratio	Float	If the ratio of the number of base pairs of contigs between 1000-2500 bp smaller than this value, the minimal length will be set as 1000bp, otherwise 2500bp.	0.05	Optional
sort_bam_assembly_correction	cpu	Int	Number of CPUs to allocate to the task	2	Optional
sort_bam_assembly_correction	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
sort_bam_assembly_correction	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/samtools:1.17	Optional
sort_bam_assembly_correction	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	8	Optional
theiameta_illumina_pe	kraken2_db	File	A Kraken2 database in .tar.gz format	gs://theiagen-public-files-rp/terra/theiaprok-files/k2_standard_08gb_20230605.tar.gz	Optional
theiameta_illumina_pe	output_additional_files	Boolean	Output additional files such as aligned and unaligned reads to reference	FALSE	Optional
theiameta_illumina_pe	reference	File	Reference file for consensus calling, in FASTA format		Optional
version_capture	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0	Optional
version_capture	timezone	String	Set the time zone to get an accurate date of analysis (uses UTC by default)		Optional

Workflow Tasks¶

versioning: Version Capture

The versioning task captures the workflow version from the GitHub (code repository) version.

Version Capture Technical details

	Links
Task	task_versioning.wdl

Read Cleaning and QC¶

HRRT: Human Host Sequence Removal

All reads of human origin are removed, including their mates, by using NCBI's human read removal tool (HRRT).

HRRT is based on the SRA Taxonomy Analysis Tool and employs a k-mer database constructed of k-mers from Eukaryota derived from all human RefSeq records with any k-mers found in non-Eukaryota RefSeq records subtracted from the database.

NCBI-Scrub Technical Details

	Links
Task	task_ncbi_scrub.wdl
Software Source Code	HRRT on GitHub
Software Documentation	HRRT on NCBI

read_QC_trim: Read Quality Trimming, Adapter Removal, Quantification, and Identification

read_QC_trim is a sub-workflow that removes low-quality reads, low-quality regions of reads, and sequencing adapters to improve data quality. It uses a number of tasks, described below. The differences between the PE and SE versions of the read_QC_trim sub-workflow lie in the default parameters, the use of two or one input read file(s), and the different output files.

Read quality trimming

Either trimmomatic or fastp can be used for read-quality trimming. Trimmomatic is used by default. Both tools trim low-quality regions of reads with a sliding window (with a window size of trim_window_size), cutting once the average quality within the window falls below trim_quality_trim_score. They will both discard the read if it is trimmed below trim_minlen.

read_processing input parameter

This input parameter accepts either trimmomatic or fastp as an input to determine which tool should be used for read quality trimming. This is set to trimmomatic by default.

If the fastp option is selected, see below for table of default parameters.

fastp default read-trimming parameters

Parameter	Explanation
-g	enables polyG tail trimming
-5 20	enables read end-trimming
-3 20	enables read end-trimming
--detect_adapter_for_pe	enables adapter-trimming only for paired-end reads

Additional arguments can be passed using the fastp_args optional parameter.

Trimmomatic and fastp Technical Details

	Links
Task	task_trimmomatic.wdl task_fastp.wdl
Software Source Code	Trimmomatic fastp on Github
Software Documentation	Trimmomatic fastp
Original Publication(s)	Trimmomatic: a flexible trimmer for Illumina sequence data fastp: an ultra-fast all-in-one FASTQ preprocessor

Adapter removal

The BBDuk task removes adapters from sequence reads. To do this:

Repair from the BBTools package reorders reads in paired fastq files to ensure the forward and reverse reads of a pair are in the same position in the two fastq files.
BBDuk ("Bestus Bioinformaticus" Decontamination Using Kmers) is then used to trim the adapters and filter out all reads that have a 31-mer match to PhiX, which is commonly added to Illumina sequencing runs to monitor and/or improve overall run quality.

What are adapters and why do they need to be removed?

Adapters are manufactured oligonucleotide sequences attached to DNA fragments during the library preparation process. In Illumina sequencing, these adapter sequences are required for attaching reads to flow cells. You can read more about Illumina adapters here. For genome analysis, it's important to remove these sequences since they're not actually from your sample. If you don't remove them, the downstream analysis may be affected.

BBDuk Technical Details

	Links
Task	task_bbduk.wdl
Software Source Code	BBTools
Software Documentation	BBDuk

Read Quantification

There are two methods for read quantification to choose from: fastq-scan (default) or fastqc. Both quantify the forward and reverse reads in FASTQ files. For paired-end data, they also provide the total number of read pairs. This task is run once with raw reads as input and once with clean reads as input. If QC has been performed correctly, you should expect fewer clean reads than raw reads. fastqc also provides a graphical visualization of the read quality.

read_qc input parameter

This input parameter accepts either "fastq_scan" or "fastqc" as an input to determine which tool should be used for read quantification. This is set to "fastq-scan" by default.

fastq-scan and FastQC Technical Details

	Links
Task	task_fastq_scan.wdl task_fastqc.wdl
Software Source Code	fastq-scan on Github fastqc on Github
Software Documentation	fastq-scan fastqc

Read Identification with MIDAS (optional)

The MIDAS task is for the identification of reads to detect contamination with non-target taxa. This task is optional and turned off by default. It can be used by setting the call_midas input variable to true.

The MIDAS tool was originally designed for metagenomic sequencing data but has been co-opted for use with bacterial isolate WGS methods. It can be used to detect contamination present in raw sequencing data by estimating bacterial species abundance in bacterial isolate WGS data. If a secondary genus is detected above a relative frequency of 0.01 (1%), then the sample should fail QC and be investigated further for potential contamination.

This task is similar to those used in commercial software, BioNumerics, for estimating secondary species abundance.

How are the MIDAS output columns determined?

Example MIDAS report in the midas_report column:

species_id	count_reads	coverage	relative_abundance
Salmonella_enterica_58156	3309	89.88006645	0.855888033
Salmonella_enterica_58266	501	11.60606061	0.110519371
Salmonella_enterica_53987	99	2.232896237	0.021262881
Citrobacter_youngae_61659	46	0.995216227	0.009477003
Escherichia_coli_58110	5	0.123668877	0.001177644

MIDAS report column descriptions:

species_id: species identifier
count_reads: number of reads mapped to marker genes
coverage: estimated genome-coverage (i.e. read-depth) of species in metagenome
relative_abundance: estimated relative abundance of species in metagenome

The value in the midas_primary_genus column is derived by ordering the rows in order of "relative_abundance" and identifying the genus of top species in the "species_id" column (Salmonella). The value in the midas_secondary_genus column is derived from the genus of the second-most prevalent genus in the "species_id" column (Citrobacter). The midas_secondary_genus_abundance column is the "relative_abundance" of the second-most prevalent genus (0.009477003). The midas_secondary_genus_coverage is the "coverage" of the second-most prevalent genus (0.995216227).

MIDAS Reference Database Overview

The MIDAS reference database is a comprehensive tool for identifying bacterial species in metagenomic and bacterial isolate WGS data. It includes several layers of genomic data, helping detect species abundance and potential contaminants.

Key Components of the MIDAS Database

Species Groups:
- MIDAS clusters bacterial genomes based on 96.5% sequence identity, forming over 5,950 species groups from 31,007 genomes. These groups align with the gold-standard species definition (95% ANI), ensuring highly accurate species identification.
Genomic Data Structure:
- Marker Genes: Contains 15 universal single-copy genes used to estimate species abundance.
- Representative Genome: Each species group has a selected representative genome, which minimizes genetic variation and aids in accurate SNP identification.
- Pan-genome: The database includes clusters of non-redundant genes, with options for multi-level clustering (e.g., 99%, 95%, 90% identity), enabling MIDAS to identify gene content within strains at various clustering thresholds.
Taxonomic Annotation:
- Genomes are annotated based on consensus Latin names. Discrepancies in name assignments may occur due to factors like unclassified genomes or genus-level ambiguities.

Using the Default MIDAS Database

TheiaProk and TheiaEuk use the pre-loaded MIDAS database in Terra (see input table for current version) by default for bacterial species detection in metagenomic data, requiring no additional setup.

Create a Custom MIDAS Database

Users can also build their own custom MIDAS database if they want to include specific genomes or configurations. This custom database can replace the default MIDAS database used in Terra. To build a custom MIDAS database, follow the MIDAS GitHub guide on building a custom database. Once the database is built, users can upload it to a Google Cloud Storage bucket or Terra workkspace and provide the link to the database in the midas_db input variable.

MIDAS Technical Details

	Links
Task	task_midas.wdl
Software Source Code	MIDAS
Software Documentation	MIDAS
Original Publication(s)	An integrated metagenomics pipeline for strain profiling reveals novel patterns of bacterial transmission and biogeography

Kraken2 is a bioinformatics tool originally designed for metagenomic applications. It has additionally proven valuable for validating taxonomic assignments and checking contamination of single-species (e.g. bacterial isolate, eukaryotic isolate, viral isolate, etc.) whole genome sequence data.

Database-dependent

This workflow automatically uses a viral-specific Kraken2 database. This database was generated in-house from RefSeq's viral sequence collection and human genome GRCh38. It's available at gs://theiagen-public-resources-rp/reference_data/databases/kraken2/kraken2_humanGRCh38_viralRefSeq_20240828.tar.gz.

Kraken2 is run on the set of raw reads, provided as input, as well as the set of clean reads that are resulted from the read_QC_trim workflow

The Kraken2 software is database-dependent and taxonomic assignments are highly sensitive to the database used. An appropriate database should contain the expected organism(s) (e.g. Escherichia coli) and other taxa that may be present in the reads (e.g. Citrobacter freundii, a common contaminant).

Kraken2 Technical Details

	Links
Task	task_kraken2.wdl
Software Source Code	Kraken2 on GitHub
Software Documentation	https://github.com/DerrickWood/kraken2/blob/master/docs/MANUAL.markdown
Original Publication(s)	Improved metagenomic analysis with Kraken 2

Assembly¶

metaspades: De Novo Metagenomic Assembly

While metagenomics has emerged as a technology of choice for analyzing bacterial populations, the assembly of metagenomic data remains challenging. A dedicated metagenomic assembly algorithm is necessary to circumvent the challenge of interpreting variation. metaSPAdes addresses various challenges of metagenomic assembly by capitalizing on computational ideas that proved to be useful in assemblies of single cells and highly polymorphic diploid genomes.

metaspades is a de novo assembler that first constructs a de Bruijn graph of all the reads using the SPAdes algorithm. Through various graph simplification procedures, paths in the assembly graph are reconstructed that correspond to long genomic fragments within the metagenome. For more details, please see the original publication.

Common errors with SPAdes v4+

We found that MetaSPAdes v4+ can raise segmentation fault errors using our validation set of metagenomic samples, so MetaSPAdes v3+ is called by TheiaMeta. A newer version can be called by referencing a more recent container (e.g. "us-docker.pkg.dev/general-theiagen/staphb/spades:4.2.0") via the metaspades_pe docker input.

MetaSPAdes Technical Details

	Links
Task	task_metaspades.wdl
Software Source Code	SPAdes on GitHub
Software Documentation	SPAdes Manual
Original Publication(s)	metaSPAdes: a new versatile metagenomic assembler

minimap2: Assembly Correction

minimap2 is a popular aligner that is used to align reads (or assemblies) to an assembly file. In minimap2, "modes" are a group of preset options.

The mode used in this task is sr which is intended for "short single-end reads without splicing". The sr mode indicates the following parameters should be used: -k21 -w11 --sr --frag=yes -A2 -B8 -O12,32 -E2,1 -b0 -r100 -p.5 -N20 -f1000,5000 -n2 -m20 -s40 -g100 -2K50m --heap-sort=yes --secondary=no. The output file is in SAM format.

For more information regarding modes and the available options for minimap2, please see the minimap2 manpage

minimap2 Technical Details

	Links
Task	task_minimap2.wdl
Software Source Code	minimap2 on GitHub
Software Documentation	minimap2
Original Publication(s)	Minimap2: pairwise alignment for nucleotide sequences

samtools: SAM File Conversion

This task converts the output SAM file from minimap2 and converts it to a BAM file. It then sorts the BAM based on the read names, and then generates an index file.

samtools Technical Details

	Links
Task	task_samtools.wdl
Software Source Code	samtools on GitHub
Software Documentation	samtools
Original Publication(s)	The Sequence Alignment/Map format and SAMtools Twelve Years of SAMtools and BCFtools

pilon: Assembly Polishing

pilon is a tool that uses read alignment to correct errors in an assembly. It is used to polish the assembly produced by metaSPAdes. The input to Pilon is the sorted BAM file produced by samtools, and the original draft assembly produced by metaspades.

pilon Technical Details

	Links
Task	task_pilon.wdl
Software Source Code	Pilon on GitHub
Software Documentation	Pilon Wiki
Original Publication(s)	Pilon: An Integrated Tool for Comprehensive Microbial Variant Detection and Genome Assembly Improvement

Reference Alignment & Contig Filtering¶

These tasks only run if a reference is provided through the reference optional input.

minimap2: Assembly Alignment and Contig Filtering

minimap2 is a popular aligner that is used to align reads (or assemblies) to an assembly file. In minimap2, "modes" are a group of preset options.

The mode used in this task is asm20 which is intended for "long assembly to reference mapping". The asm20 mode indicates the following parameters should be used: -k19 -w10 -U50,500 --rmq -r100k -g10k -A1 -B4 -O6,26 -E2,1 -s200 -z200 -N50. The output file is in PAF format.

For more information regarding modes and the available options for minimap2, please see the minimap2 manpage

minimap2 Technical Details

	Links
Task	task_minimap2.wdl
Software Source Code	minimap2 on GitHub
Software Documentation	minimap2
Original Publication(s)	Minimap2: pairwise alignment for nucleotide sequences

Parsing the PAF file into a FASTA file

Following the minimap2 alignment, the output PAF file is parsed into a FASTA file using seqkit and then coverage is calculated using awk.

parse_mapping Technical Details

	Links
Task	task_parse_mapping.wdl#retrieve_aligned_contig_paf task_parse_mapping.wdl#calculate_coverage_paf
Software Source Code	seqkit on GitHub
Software Documentation	seqkit
Original Publication(s)	SeqKit: A Cross-Platform and Ultrafast Toolkit for FASTA/Q File Manipulation SeqKit2: A Swiss army knife for sequence and alignment processing

Assembly QC¶

This task is run on either:

the reference-aligned contigs (if a reference was provided), or
the Pilon-polished assembly_fasta (if no reference was provided).

quast: Assembly Quality Assessment

QUAST stands for QUality ASsessment Tool. It evaluates genome/metagenome assemblies by computing various metrics without a reference being necessary. It includes useful metrics such as number of contigs, length of the largest contig and N50.

QUAST Technical Details

	Links
Task	task_quast.wdl
Software Source Code	QUAST on GitHub
Software Documentation	https://quast.sourceforge.net/
Original Publication(s)	QUAST: quality assessment tool for genome assemblies

Binning¶

These tasks only run if a reference is not provided.

bwa: Read alignment to the assembly

If a reference is not provided, BWA (Burrow-Wheeler Aligner) is used to align the clean reads to the Pilon-polished assembly_fasta.

BWA Technical Details

	Links
Task	task_bwa.wdl
Software Source Code	https://github.com/lh3/bwa
Software Documentation	https://bio-bwa.sourceforge.net/
Original Publication(s)	Fast and accurate short read alignment with Burrows-Wheeler transform

semibin2: Metagenomic binning

After the alignment, the resulting BAM file and index and the Pilon-polished assembly_fasta will be binned with semibin2, a command-line tool for metagenomic binning with deep learning. Specifically, it uses a semi-supervised siamese neural network that uses knowledge from reference genomes while maintaining reference-exclusive bins. By default, the global environemnt model is used, though a variety of options that may be better suited for your sample are available, and are listed in the relevant inputs section.

SemiBin2 Technical Details

	Links
Task	task_semibin2.wdl
Software Source Code	SemiBin2 on GitHub
Software Documenttation	SemiBin2 ReadTheDocs
Original Publication(s)	A deep siamese neural network improves metagenome-assembled genomes in microbiome datasets across different environments

Additional Outputs¶

These tasks only run if output_additional_files is set to true (default is false).

minimap2: Read Alignment to the Assembly

minimap2 is a popular aligner that is used to align reads (or assemblies) to an assembly file. In minimap2, "modes" are a group of preset options.

The mode used in this task is sr which is intended for "short single-end reads without splicing". The sr mode indicates the following parameters should be used: -k21 -w11 --sr --frag=yes -A2 -B8 -O12,32 -E2,1 -b0 -r100 -p.5 -N20 -f1000,5000 -n2 -m20 -s40 -g100 -2K50m --heap-sort=yes --secondary=no. The output file is in SAM format.

For more information regarding modes and the available options for minimap2, please see the minimap2 manpage

minimap2 Technical Details

	Links
Task	task_minimap2.wdl
Software Source Code	minimap2 on GitHub
Software Documentation	minimap2
Original Publication(s)	Minimap2: pairwise alignment for nucleotide sequences

samtools: SAM File Conversion (Round 2)

This task converts the output SAM file from minimap2 and converts it to a BAM file. It then sorts the BAM based on the read names, and then generates an index file.

samtools Technical Details

	Links
Task	task_samtools.wdl
Software Source Code	samtools on GitHub
Software Documentation	samtools
Original Publication(s)	The Sequence Alignment/Map format and SAMtools Twelve Years of SAMtools and BCFtools

Parsing the BAM file

Several tasks follow that perform the following functions:

Calculates the average depth of coverage of the assembly using bedtools.
Retrieves from the BAM file any unaligned reads using samtools.
Retrieves from the BAM file any aligned reads using samtools.
Calculates the percentage of reads that were assembled using samtools.

parse_mapping Technical Details

	Links
Task	task_parse_mapping.wdl#calculate_coverage task_parse_mapping.wdl#retrieve_pe_reads_bam task_parse_mapping.wdl#assembled_reads_percent
Software Source Code	bedtools on GitHub samtools on GitHub
Software Documentation	bedtools ReadTheDocs samtools
Original Publication(s)	BEDTools: a flexible suite of utilities for comparing genomic features The Sequence Alignment/Map format and SAMtools Twelve Years of SAMtools and BCFtools

Outputs¶

Variable	Type	Description
assembly_fasta	File	The final recovered metagenome-assembled genome (MAG). "A MAG represents a microbial genome by a group of sequences from genome assembly with similar characteristics. It enables [the identification of] novel species [to] understand their potential functions in a dynamic ecosystem"¹
assembly_length	Int	Length of assembly (total contig length) as determined by QUAST
assembly_mean_coverage	Float	Mean sequencing depth throughout the consensus assembly. Generated after performing primer trimming and calculated using the SAMtools coverage command
bbduk_docker	String	The Docker image for bbduk, which was used to remove the adapters from the sequences
bedtools_docker	String	The Docker image for bedtools, which was used to calculate coverage
bedtools_version	String	The version of bedtools, which was used to calculate coverage
contig number	Int	The number of contigs in the assembly_fasta (see description for `assembly_fasta`)
fastp_html_report	File	The HTML report made with fastp
fastp_version	String	The version of fastp used
fastq_scan_clean1_json	File	The JSON file output from `fastq-scan` containing summary stats about clean forward read quality and length
fastq_scan_clean2_json	File	The JSON file output from `fastq-scan` containing summary stats about clean reverse read quality and length
fastq_scan_docker	String	The Docker image of fastq_scan
fastq_scan_num_reads_clean1	Int	The number of forward reads after cleaning as calculated by fastq_scan
fastq_scan_num_reads_clean2	Int	The number of reverse reads after cleaning as calculated by fastq_scan
fastq_scan_num_reads_clean_pairs	String	The number of read pairs after cleaning as calculated by fastq_scan
fastq_scan_num_reads_raw1	Int	The number of input forward reads as calculated by fastq_scan
fastq_scan_num_reads_raw2	Int	The number of input reserve reads as calculated by fastq_scan
fastq_scan_num_reads_raw_pairs	String	The number of input read pairs as calculated by fastq_scan
fastq_scan_raw1_json	File	The JSON file output from `fastq-scan` containing summary stats about raw forward read quality and length
fastq_scan_raw2_json	File	The JSON file output from `fastq-scan` containing summary stats about raw reverse read quality and length
fastq_scan_version	String	The version of fastq_scan
fastqc_clean1_html	File	An HTML file that provides a graphical visualization of clean forward read quality from fastqc to open in an internet browser
fastqc_clean2_html	File	An HTML file that provides a graphical visualization of clean reverse read quality from fastqc to open in an internet browser
fastqc_docker	String	The Docker container used for fastqc
fastqc_num_reads_clean1	Int	The number of forward reads after cleaning by fastqc
fastqc_num_reads_clean2	Int	The number of reverse reads after cleaning by fastqc
fastqc_num_reads_clean_pairs	String	The number of read pairs after cleaning by fastqc
fastqc_num_reads_raw1	Int	The number of input forward reads by fastqc before cleaning
fastqc_num_reads_raw2	Int	The number of input reverse reads by fastqc before cleaning
fastqc_num_reads_raw_pairs	String	The number of input read pairs by fastqc before cleaning
fastqc_raw1_html	File	An HTML file that provides a graphical visualization of raw forward read quality from fastqc to open in an internet browser
fastqc_raw2_html	File	An HTML file that provides a graphical visualization of raw reverse read quality from fastqc to open in an internet browser
fastqc_version	String	Version of fastqc software used
kraken2_docker	String	Docker image used to run kraken2
kraken2_percent_human_clean	Float	The percentage of human-classified reads in the sample's clean reads
kraken2_percent_human_raw	Float	The percentage of human-classified reads in the sample's raw reads
kraken2_report_clean	File	The full Kraken report for the sample's clean reads
kraken2_report_raw	File	The full Kraken report for the sample's raw reads
kraken2_version	String	The version of kraken2 used
krona_docker	String	The docker image of Krona
krona_html_clean	File	The KronaPlot after reads are cleaned
krona_html_raw	File	The KronaPlot before reads are cleaned
krona_version	String	The version of Krona
largest_contig	Int	The size of the largest contig in basepairs
metaspades_docker	String	The Docker image of metaspades
metaspades_version	String	The version of metaspades
midas_primary_genus	String	The primary genus detected by MIDAS
midas_report	File	TSV report of full MIDAS results
minimap2_docker	String	The Docker image of minimap2
minimap2_version	String	The version of minimap2
ncbi_scrub_docker	String	The Docker image for NCBI's HRRT (human read removal tool)
percent_coverage	Float	The percentage coverage of the reference genome provided if one was provided
percentage_mapped_reads	String	Percentage of reads that successfully aligned to the reference genome. This value is calculated by number of mapped reads / total number of reads x 100.
pilon_docker	String	The Docker image for pilon
pilon_version	String	The version of pilon
quast_docker	String	The Docker image of QUAST
quast_version	String	The version of QUAST
read1_clean	File	Forward read file after quality trimming and adapter removal
read1_dehosted	File	The dehosted forward reads file; suggested read file for SRA submission
read1_mapped	File	The mapped forward reads to the assembly
read1_unmapped	File	The unmapped forwards reads to the assembly
read2_clean	File	Reverse read file after quality trimming and adapter removal
read2_dehosted	File	The dehosted reverse reads file; suggested read file for SRA submission
read2_mapped	File	The mapped reverse reads to the assembly
read2_unmapped	File	The unmapped reverse reads to the assembly
samtools_docker	String	The Docker image of samtools
samtools_version	String	The version of SAMtools used to sort and index the alignment file
semibin_bins	Array[File]	An array of binned metagenomic assembled genome files
semibin_docker	String	The Docker image of semibin
semibin_version	String	The version of Semibin used
theiameta_illumina_pe_analysis_date	String	The date of analysis
theiameta_illumina_pe_version	String	The version of TheiaMeta used during execution
trimmomatic_docker	String	The docker image used for the trimmomatic module in this workflow
trimmomatic_version	String	The version of Trimmomatic used

References¶

Human read removal tool (HRRT): https://github.com/ncbi/sra-human-scrubber

Trimmomatic: Anthony M. Bolger and others, Trimmomatic: a flexible trimmer for Illumina sequence data, Bioinformatics, Volume 30, Issue 15, August 2014, Pages 2114–2120, https://doi.org/10.1093/bioinformatics/btu170

Fastq-Scan: https://github.com/rpetit3/fastq-scan

metaSPAdes: Sergey Nurk and others, metaSPAdes: a new versatile metagenomic assembler, Genome Res. 2017 May; 27(5): 824–834., https://doi.org/10.1101%2Fgr.213959.116

Pilon: Bruce J. Walker and others. Pilon: An Integrated Tool for Comprehensive Microbial Variant Detection and Genome Assembly Improvement. Plos One. November 19, 2014. https://doi.org/10.1371/journal.pone.0112963

Minimap2: Heng Li, Minimap2: pairwise alignment for nucleotide sequences, Bioinformatics, Volume 34, Issue 18, September 2018, Pages 3094–3100, https://doi.org/10.1093/bioinformatics/bty191

QUAST: Alexey Gurevich and others, QUAST: quality assessment tool for genome assemblies, Bioinformatics, Volume 29, Issue 8, April 2013, Pages 1072–1075, https://doi.org/10.1093/bioinformatics/btt086

Samtools: Li, Heng, Bob Handsaker, Alec Wysoker, Tim Fennell, Jue Ruan, Nils Homer, Gabor Marth, Goncalo Abecasis, Richard Durbin, and 1000 Genome Project Data Processing Subgroup. 2009. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25(16): 2078-2079. https://doi.org/10.1093/bioinformatics/btp352

BEDtools: Quinlan AR and Hall IM, 2010. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 26, 6, pp. 841–842. https://doi.org/10.1093/bioinformatics/btq033

Bcftools: Petr Danecek, James K Bonfield, Jennifer Liddle, John Marshall, Valeriu Ohan, Martin O Pollard, Andrew Whitwham, Thomas Keane, Shane A McCarthy, Robert M Davies, Heng Li. Twelve years of SAMtools and BCFtools. GigaScience, Volume 10, Issue 2, February 2021, giab008, https://doi.org/10.1093/gigascience/giab008

Semibin2: Shaojun Pan, Xing-Ming Zhao, Luis Pedro Coelho, SemiBin2: self-supervised contrastive learning leads to better MAGs for short- and long-read sequencing, Bioinformatics, Volume 39, Issue Supplement_1, June 2023, Pages i21–i29, https://doi.org/10.1093/bioinformatics/btad209

Direct quote from the abstract of Yang C, Chowdhury D, Zhang Z, Cheung WK, Lu A, Bian Z, Zhang L. A review of computational tools for generating metagenome-assembled genomes from metagenomic sequencing data. Comput Struct Biotechnol J. 2021;19:6301-14. doi: 10.1016/j.csbj.2021.11.028. This is a paper from 2021 that reviews some of the tools used in this workflow, though not all. ↩