Metabuli¶

Quick Facts¶

Workflow Type	Applicable Kingdom	Last Known Changes	Command-line Compatibility	Workflow Level	Dockstore
Standalone	Any taxa	vX.X.X	Yes	Sample-level	Metabuli_PHB

Metabuli Workflow¶

The Metabuli workflow assesses the taxonomic profile of raw sequencing data (FASTQ files).

Metabuli is suitable for classifying short reads AND long reads by comparing them to reference genomes. Optionally it can enable the extraction of reads from a specific NCBI taxon ID of interest. Metabuli uses a novel k-mer structure, called a "metamer", which incorporates both the DNA sequence for high specificity and amino acid conservation for sensitive homology detection.

The Metabuli_PHB workflow additionally includes read trimming software, Fastp (Illumina) and fastplong (ONT), for adapter trimming (recommended) and basic read preprocessing.

Metabuli Workflow Diagram

Databases¶

Database selection

The Metabuli software is database-dependent and taxonomic assignments are highly sensitive to the database used. An appropriate database should contain the expected organism(s) (e.g. Escherichia coli) and other taxa that may be present in the reads (e.g. Citrobacter freundii, a common contaminant). To enable read extraction, the database taxon inputs must correspond to an appropriate compressed taxdump, e.g. NCBI taxdump for RefSeq databases and GTDB taxdump for GTDB databases (see suggested databases for example).

Adjusting computational resources

Default random-access memory (RAM) is typically sufficient for Metabuli, though this may need to be adjusted if an out-of-memory (OOM) error is returned. Additionally, the default disk_space is sufficient for the databases noted below, but this input must be adjusted to accommodate larger databases based on their decompressed size.

Suggested databases¶

Database name	Database Description	Suggested Applications	GCP URI (for usage in Terra)	Source	Database Size [Decompressed] (GB)	Date of Last Update	Associated taxdump
viral	RefSeq viral + human (T2T-CHM13v2.0)	Viral metagenomics	`gs://theiagen-public-resources-rp/reference_data/databases/metabuli/refseq_virus-v223.tar.gz`	https://metabuli.steineggerlab.workers.dev/	4.0 [8.1]	2024/04/01	gs://theiagen-public-resources-rp/reference_data/databases/metabuli/ncbi_taxdump_20260211.tar.gz
GTDB	Prokaryote (Complete Genome/Chromosome, CheckM completeness > 90, and contamination <5) + human (T2T-CHM13v2.0)	Prokaryote metagenomics	`gs://theiagen-public-resources-rp/reference_data/databases/metabuli/gtdb_20240401.tar.gz`	https://metabuli.steineggerlab.workers.dev/	68.8 [117]	2024/04/01	gs://theiagen-public-resources-rp/reference_data/databases/metabuli/gtdb_taxdump_20250428.tar.gz

Inputs¶

taxon input parameter

Inputting a taxon (NCBI taxon ID/name) will enable read extraction within the workflow. The input taxon will be standardized via querying the NCBI taxonomy hierarchy in the ete4_identify task. Additionally, a parent taxonomic rank (e.g. "genus", "family", "order", etc.) can be set in ete4_identify to extract reads at a higher taxonomic level relative to the input taxon.

illumina input parameter

Setting illumina to "true" enables Illumina mode for single-end reads. Inputting a read2 implicitly sets illumina to "true".

Metabuli

Terra Task Name	Variable	Type	Description	Default Value	Terra Status
metabuli_wf	metabuli_db	File			Required
metabuli_wf	read1	File	FASTQ file containing read1 sequences		Required
metabuli_wf	samplename	String	The name of the sample being analyzed		Required
ete4_identify	cpu	Int	Number of CPUs to allocate to the task	1	Optional
ete4_identify	disk_size	Int	Amount of storage (in GB) to allocate to the task	50	Optional
ete4_identify	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/theiagen/ete4:4.3.0	Optional
ete4_identify	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	4	Optional
ete4_identify	rank	String	Internal component, do not modify		Optional
fastp	cpu	Int	Number of CPUs to allocate to the task	4	Optional
fastp	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
fastp	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/fastp:1.1.0	Optional
fastp	fastp_adapter_fasta	File	FASTA of adapters to replace Fastp's default adapter search		Optional
fastp	fastp_args	String	Additional arguments to use with fastp		Optional
fastp	fastp_min_length	Int	Minimum read length	15	Optional
fastp	fastp_quality_trim_score	Int	Minimum mean base quality relative to window size	20	Optional
fastp	fastp_window_size	Int	Sliding window size for surveying read quality	4	Optional
fastp	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	8	Optional
fastplong	cpu	Int	Number of CPUs to allocate to the task	4	Optional
fastplong	cut_front	Boolean	Apply trimming options from 5' to 3'	False	Optional
fastplong	cut_tail	Boolean	Apply trimming options from 3' to 5'	False	Optional
fastplong	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
fastplong	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/fastplong:0.4.1	Optional
fastplong	fastplong_adapter_fasta	File	FASTA of adapter sequences to trim		Optional
fastplong	fastplong_args	String	Additional arguments to use with fastplong		Optional
fastplong	fastplong_end_adapter	String	Adapter sequence to trim from the 3' end		Optional
fastplong	fastplong_min_length	Int	Minimum read length	15	Optional
fastplong	fastplong_quality_trim_score	Int	Minimum mean base quality relative to window size	20	Optional
fastplong	fastplong_start_adapter	String	Adapter sequence to trim from the 5' end		Optional
fastplong	fastplong_trim_adapters	Boolean	Enable adapter trimming via fastplong	True	Optional
fastplong	fastplong_window_size	Int	Sliding window size for surveying read quality	4	Optional
fastplong	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	8	Optional
metabuli	cpu	Int	Number of CPUs to allocate to the task	4	Optional
metabuli	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/theiagen/metabuli:1.1.1	Optional
metabuli	extract_unclassified	Boolean	True/False that determines if unclassified reads should be extracted and combined with the taxon specific extracted reads	False	Optional
metabuli	min_percent_coverage	Float	Minimum query coverage threshold (0.0 - 1.0)	0	Optional
metabuli	min_score	Float	Minimum sequenece similarity score (0.0 - 1.0)	0	Optional
metabuli	min_sp_score	Float	Minimum score for species- or lower-level classification	0	Optional
metabuli_wf	call_trim	Boolean	Call adapter and read trimming via Fastp (Illumina) or Fastplong (ONT)	True	Optional
metabuli_wf	illumina	Boolean	Input reads are Illumina - automatically inferred if read2 is populated, ONT is assumed otherwise		Optional
metabuli_wf	metabuli_disk_size	Int	Amount of storage (in GB) to allocate to the task	250	Optional
metabuli_wf	metabuli_mem	Int	Amount of memory/RAM (in GB) to allocate to the task	32	Optional
metabuli_wf	read2	File	FASTQ file containing read2 sequences		Optional
metabuli_wf	taxdump	File	Path to compressed taxonomy dump that corresponds with database	gs://theiagen-public-resources-rp/reference_data/databases/metabuli/ncbi_taxdump_20260211.tar.gz	Optional
metabuli_wf	taxon	String	NCBI taxonomy compatible taxon name/ID to enable read extraction		Optional
version_capture	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0	Optional
version_capture	timezone	String	Set the time zone to get an accurate date of analysis (uses UTC by default)		Optional

Workflow Tasks¶

ete4_identify

The ete4_identify task parses the NCBI taxonomy hierarchy from a user's inputted taxonomy and desired taxonomic rank. This task returns a taxon ID, name, and rank, which facilitates downstream functions, including read classification, targeted read extraction, and genomic characterization modules.

taxon input parameter

This parameter accepts either a NCBI taxon ID (e.g. 11292) or an organism name (e.g. Lyssavirus rabies).

rank a.k.a read_extraction_rank input parameter

Valid options include: "species", "genus", "family", "order", "class", "phylum", "kingdom", or "domain". By default it is set to "family". This parameter filters metadata to report information only at the taxonomic rank specified by the user, regardless of the taxonomic rank implied by the original input taxon.

Important

The rank parameter must specify a taxonomic rank that is equal to or above the input taxon's taxonomic rank.

Examples:

If your input taxon is Lyssavirus rabies (species level) with rank set to family, the task will return information for the family of Lyssavirus rabies: taxon ID for Rhabdoviridae (11270), name "Rhabdoviridae", and rank "family".
If your input taxon is Lyssavirus (genus level) with rank set to species, the task will fail because it cannot determine species information from an inputted genus.

ete4 Identify Technical Details

	Links
Task	task_ete4_taxon_id.wdl
Software Source Code	ete4 on GitHub
Software Documentation	NCBI Datasets Documentation on NCBI
Original Publication(s)	Exploring and retrieving sequence and metadata for species across the tree of life with NCBI Datasets

fastp: Read Trimming

fastp trims low-quality regions with a sliding window (with a default window size of 4, specified with fastp_window_size), cutting once the average quality within the window falls below the fastp_quality_trim_score (default of 20 for paired-end, 30 for single-end). The read is discarded if it is trimmed below fastp_min_length (default of 15 bases).

Adapter trimming is enabled by default and can be disabled by setting fastp_trim_adapters to "false".

Additional arguments can be passed using the fastp_args optional parameter. Please reference the Fastp GitHub for a comprehensive list of arguments.

fastp Technical Details

	Links
Task	task_fastp.wdl
Software Source Code	fastp on GitHub
Software Documentation	fastp on GitHub
Original Publication(s)	fastp: an ultra-fast all-in-one FASTQ preprocessor

fastplong: ONT Read Trimming

fastplong trims low-quality regions with a sliding window (with a default window size of 4, specified with fastplong_window_size), cutting once the average quality within the window falls below the fastplong_quality_trim_score (default of 20). The read is discarded if it is trimmed below fastplong_min_length (default of 15 bases). These trimming options are conducted according to a sliding window, but the directionality of this window can be specified by setting cut_front (5' to 3') or cut_tail (3' to 5') to "true".

Adapter trimming is enabled by default and can be disabled by setting fastplong_trim_adapters to "false". Automatic adapter detection is enabled by default, though a FASTA, a string of start, or a string of end adapters can be specified with fastplong_adapter_fasta, fastplong_start_adapter, or fastplong_end_adapter inputs respectively.

Additional arguments can be passed using the fastplong_args optional parameter. Please reference the Fastp GitHub for a comprehensive list of arguments.

fastplong Technical Details

	Links
Task	task_fastp.wdl
Software Source Code	fastp on GitHub
Software Documentation	fastp on GitHub
Original Publication(s)	Ultrafast one-pass FASTQ data preprocessing, quality control, and deduplication using fastp

metabuli

The metabuli task is used to classify and optionally extract reads against a reference database. Metabuli uses a novel k-mer structure, called metamer, to analyze both amino acid (AA) and DNA sequences. It leverages AA conservation for sensitive homology detection and DNA mutations for specific differentiation between closely related taxa.

taxon_id input parameter

taxon_id triggers read extraction by retrieving the inputted NCBI taxon ID and all descendant taxon IDs derived from the input.

Precision mode and min_score / min_sp_score input parameters

The min_score parameter is the minimum score (DNA-level identity) required for a read to be classified and the min_sp_score parameter is the minimum score for a read to be classified at or below species rank. Metabuli precision mode is defined by its authors as more stringently setting the min_score and min_sp_score parameters for specific read types:

Illumina short reads: min_score = 0.15, min_sp_score = 0.5
ONT long reads: min_score = 0.008

Metabuli is run on both raw and human dehosted reads.

`metabuli_db` must be set to activate Metabuli read classification for TheiaProk.

??? dna "`taxdump_path` input parameter"
    The `taxdump_path` directs the task toward a taxonkit-generated taxdump file, e.g. [from NCBI](https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/) or [from GTDB](https://github.com/shenwei356/gtdb-taxdump/releases). This is not necessary to edit unless users want a more recent taxdump than what Theiagen hosts, or if users want to reference a different taxonomy. By default, Theiagen uses the NCBI taxonomy hierarchy.

??? dna "`cpu` / `memory` input parameters"
    Increasing the memory and cpus allocated to Metabuli can substantially increase throughput.

??? dna "`extract_unclassified` input parameter"
    This parameter determines whether unclassified reads should also be extracted and combined with the `taxon`-specific extracted reads. By default, this is set to `false`, meaning that only reads classified to the specified input `taxon` will be extracted.


!!! techdetails "Metabuli Technical Details"
    |  | Links |
    | --- | --- |
    | Task | [task_metabuli.wdl](https://github.com/theiagen/public_health_bioinformatics/blob/main/tasks/taxon_id/contamination/task_metabuli.wdl) |
    | Software Source Code | [Metabuli on GitHub](https://github.com/steineggerlab/Metabuli) |
    | Software Documentation | [Metabuli Documentation](https://github.com/steineggerlab/Metabuli/blob/master/README.md) |
    | Original Publication(s) | [Metabuli: sensitive and specific metagenomic classification via joint analysis of amino acid and DNA](https://doi.org/10.1038/s41592-024-02273-y) |

Outputs¶

Metabuli

Variable	Type	Description
ete4_docker	String	Docker image used for ETE4 taxonomy parsing
ete4_version	String	The version of ETE4 used
fastp_docker	String	Docker image used for fastp
fastp_html_report	File	The HTML report conveying fastp results
fastp_json_report	File	The JSON report conveying fastp results
fastp_read1_trimmed	File	`read1` input trimmed by fastp
fastp_read2_trimmed	File	`read2` input trimmed by fastp
fastp_version	String	The version of fastp used
fastplong_docker	String	Docker image used for fastplong
fastplong_html_report	File	The HTML report conveying fastplong results
fastplong_json_report	File	The JSON report conveying fastplong results
fastplong_read1_trimmed	File	`read1` input trimmed by fastplong
fastplong_version	String	The version of fastplong used
metabuli_classified_read1	File	FASTQ of `read1` input classified by Metabuli
metabuli_classified_read2	File	FASTQ of `read2` input classified by Metabuli
metabuli_classified_report	File	Classification report from Metabuli
metabuli_docker	String	Docker image used for Metabuli
metabuli_krona_report	File	Krona visualization report from Metabuli
metabuli_report	File	Classification report from Metabuli
metabuli_status	String	Status of Metabuli analysis
metabuli_version	String	Version of Metabuli used
metabuli_wf_analysis_date	String	Analysis date for Metabuli workflow
metabuli_wf_version	String	Version of Metabuli workflow
ncbi_read_extraction_rank	String	Read extraction rank used
ncbi_taxon_id	String	NCBI taxonomy ID of inputted organism following rank extraction
ncbi_taxon_name	String	NCBI taxonomy name of inputted taxon following rank extraction

Interpretation of results¶

The most important outputs of the Metabuli workflows are the metabuli_report files. These will include a breakdown of the number of sequences assigned to a particular taxon, and the percentage of reads assigned. A complete description of the report format can be found here.

When assessing the taxonomic identity of a single isolate's sequence, it is normal that a few reads are assigned to very closely rated taxa due to the shared sequence identity between them. "Very closely related taxa" may be genetically similar species in the same genus, or taxa with which the dominant species have undergone horizontal gene transfer. Unrelated taxa or a high abundance of these closely related taxa is indicative of contamination or sequencing of non-target taxa. Interpretation of the results is dependent on the biological context.

Example Metabuli report

Below is an example metabuli_report for a Human immunodeficiency virus 1 sample. Only the first 13 lines are included here since the rows near the bottom are <0.08% of the reads, which are likely human-derived contamination.

From this report, we can see that ~98.78% of the reads were assigned at the species level (species in the 4th column) to "Human immunodeficiency virus 1". ~1.15% of the reads were unclassified, and the remaining <0.08% of reads are annoated as Homo sapiens (not depicted).

 #clade_proportion  clade_count taxon_count rank    taxID   name
 1.1457 3045    3045    no rank 0   unclassified
 98.8543    262722  1   no rank 1   root
 98.7850    262538  0   superkingdom    10239     Viruses
 98.7843    262536  0   clade   2559587     Riboviria
 98.7843    262536  0   kingdom 2732397       Pararnavirae
 98.7843    262536  0   phylum  2732409         Artverviricota
 98.7843    262536  0   class   2732514           Revtraviricetes
 98.7843    262536  0   order   2169561             Ortervirales
 98.7843    262536  0   family  11632                 Retroviridae
 98.7843    262536  0   subfamily   327045                  Orthoretrovirinae
 98.7843    262536  0   genus   11646                     Lentivirus
 **98.7843  262536  262536  species 11676                       Human immunodeficiency virus 1**

Krona visualisation of Metabuli report¶

Krona produces an interactive report that allows hierarchical data, such as the one from Metabuli, to be explored with zooming, multi-layered pie charts. These pie charts are intuitive and highly responsive.

Example Krona report

Below is an example of the krona_html for a bacterial sample. Taxonomic rank is organised from the centre of the pie chart to the edge, with each slice representing the relative abundance of a given taxa in the sample.

Metabuli Technical Details

	Links
Software Source Code	Metabuli on GitHub
Software Documentation	https://github.com/steineggerlab/Metabuli/blob/master/README.md
Original Publication(s)	Metabuli: sensitive and specific metagenomic classification via joint analysis of amino acid and DNA