Skip to content

Metabuli

Quick Facts

Workflow Type Applicable Kingdom Last Known Changes Command-line Compatibility Workflow Level Dockstore
Standalone Any taxa vX.X.X Yes Sample-level Metabuli_PHB

Metabuli Workflow

The Metabuli workflow assesses the taxonomic profile of raw sequencing data (FASTQ files).

Metabuli is suitable for classifying short reads AND long reads by comparing them to reference genomes. Optionally it can enable the extraction of reads from a specific NCBI taxon ID of interest. Metabuli uses a novel k-mer structure, called a "metamer", which incorporates both the DNA sequence for high specificity and amino acid conservation for sensitive homology detection.

The Metabuli_PHB workflow additionally includes read trimming software, Fastp (Illumina) and fastplong (ONT), for adapter trimming (recommended) and basic read preprocessing.

Metabuli Workflow Diagram

Metabuli Workflow Diagram

Databases

Database selection

The Metabuli software is database-dependent and taxonomic assignments are highly sensitive to the database used. An appropriate database should contain the expected organism(s) (e.g. Escherichia coli) and other taxa that may be present in the reads (e.g. Citrobacter freundii, a common contaminant). To enable read extraction, the database taxon inputs must correspond to an appropriate compressed taxdump, e.g. NCBI taxdump for RefSeq databases and GTDB taxdump for GTDB databases (see suggested databases for example).

Adjusting computational resources

Default random-access memory (RAM) is typically sufficient for Metabuli, though this may need to be adjusted if an out-of-memory (OOM) error is returned. Additionally, the default disk_space is sufficient for the databases noted below, but this input must be adjusted to accommodate larger databases based on their decompressed size.

Suggested databases

Database name Database Description Suggested Applications GCP URI (for usage in Terra) Source Database Size [Decompressed] (GB) Date of Last Update Associated taxdump
viral RefSeq viral + human (T2T-CHM13v2.0) Viral metagenomics gs://theiagen-public-resources-rp/reference_data/databases/metabuli/refseq_virus-v223.tar.gz https://metabuli.steineggerlab.workers.dev/ 4.0 [8.1] 2024/04/01 gs://theiagen-public-resources-rp/reference_data/databases/metabuli/ncbi_taxdump_20260211.tar.gz
GTDB Prokaryote (Complete Genome/Chromosome, CheckM completeness > 90, and contamination <5) + human (T2T-CHM13v2.0) Prokaryote metagenomics gs://theiagen-public-resources-rp/reference_data/databases/metabuli/gtdb_20240401.tar.gz https://metabuli.steineggerlab.workers.dev/ 68.8 [117] 2024/04/01 gs://theiagen-public-resources-rp/reference_data/databases/metabuli/gtdb_taxdump_20250428.tar.gz

Inputs

taxon input parameter

Inputting a taxon (NCBI taxon ID/name) will enable read extraction within the workflow. The input taxon will be standardized via querying the NCBI taxonomy hierarchy in the ete4_identify task. Additionally, a parent taxonomic rank (e.g. "genus", "family", "order", etc.) can be set in ete4_identify to extract reads at a higher taxonomic level relative to the input taxon.

illumina input parameter

Setting illumina to "true" enables Illumina mode for single-end reads. Inputting a read2 implicitly sets illumina to "true".

Terra Task Name Variable Type Description Default Value Terra Status
metabuli_wf metabuli_db File Required
metabuli_wf read1 File FASTQ file containing read1 sequences Required
metabuli_wf samplename String The name of the sample being analyzed Required
ete4_identify cpu Int Number of CPUs to allocate to the task 1 Optional
ete4_identify disk_size Int Amount of storage (in GB) to allocate to the task 50 Optional
ete4_identify docker String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/theiagen/ete4:4.3.0 Optional
ete4_identify memory Int Amount of memory/RAM (in GB) to allocate to the task 4 Optional
ete4_identify rank String Internal component, do not modify Optional
fastp cpu Int Number of CPUs to allocate to the task 4 Optional
fastp disk_size Int Amount of storage (in GB) to allocate to the task 100 Optional
fastp docker String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/staphb/fastp:1.1.0 Optional
fastp fastp_adapter_fasta File FASTA of adapters to replace Fastp's default adapter search Optional
fastp fastp_args String Additional arguments to use with fastp Optional
fastp fastp_min_length Int Minimum read length 15 Optional
fastp fastp_quality_trim_score Int Minimum mean base quality relative to window size 20 Optional
fastp fastp_window_size Int Sliding window size for surveying read quality 4 Optional
fastp memory Int Amount of memory/RAM (in GB) to allocate to the task 8 Optional
fastplong cpu Int Number of CPUs to allocate to the task 4 Optional
fastplong cut_front Boolean Apply trimming options from 5' to 3' False Optional
fastplong cut_tail Boolean Apply trimming options from 3' to 5' False Optional
fastplong disk_size Int Amount of storage (in GB) to allocate to the task 100 Optional
fastplong docker String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/staphb/fastplong:0.4.1 Optional
fastplong fastplong_adapter_fasta File FASTA of adapter sequences to trim Optional
fastplong fastplong_args String Additional arguments to use with fastplong Optional
fastplong fastplong_end_adapter String Adapter sequence to trim from the 3' end Optional
fastplong fastplong_min_length Int Minimum read length 15 Optional
fastplong fastplong_quality_trim_score Int Minimum mean base quality relative to window size 20 Optional
fastplong fastplong_start_adapter String Adapter sequence to trim from the 5' end Optional
fastplong fastplong_trim_adapters Boolean Enable adapter trimming via fastplong True Optional
fastplong fastplong_window_size Int Sliding window size for surveying read quality 4 Optional
fastplong memory Int Amount of memory/RAM (in GB) to allocate to the task 8 Optional
metabuli cpu Int Number of CPUs to allocate to the task 4 Optional
metabuli docker String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/theiagen/metabuli:1.1.1 Optional
metabuli extract_unclassified Boolean True/False that determines if unclassified reads should be extracted and combined with the taxon specific extracted reads False Optional
metabuli min_percent_coverage Float Minimum query coverage threshold (0.0 - 1.0) 0 Optional
metabuli min_score Float Minimum sequenece similarity score (0.0 - 1.0) 0 Optional
metabuli min_sp_score Float Minimum score for species- or lower-level classification 0 Optional
metabuli_wf call_trim Boolean Call adapter and read trimming via Fastp (Illumina) or Fastplong (ONT) True Optional
metabuli_wf illumina Boolean Input reads are Illumina - automatically inferred if read2 is populated, ONT is assumed otherwise Optional
metabuli_wf metabuli_disk_size Int Amount of storage (in GB) to allocate to the task 250 Optional
metabuli_wf metabuli_mem Int Amount of memory/RAM (in GB) to allocate to the task 32 Optional
metabuli_wf read2 File FASTQ file containing read2 sequences Optional
metabuli_wf taxdump File Path to compressed taxonomy dump that corresponds with database gs://theiagen-public-resources-rp/reference_data/databases/metabuli/ncbi_taxdump_20260211.tar.gz Optional
metabuli_wf taxon String NCBI taxonomy compatible taxon name/ID to enable read extraction Optional
version_capture docker String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0 Optional
version_capture timezone String Set the time zone to get an accurate date of analysis (uses UTC by default) Optional

Workflow Tasks

ete4_identify

The ete4_identify task parses the NCBI taxonomy hierarchy from a user's inputted taxonomy and desired taxonomic rank. This task returns a taxon ID, name, and rank, which facilitates downstream functions, including read classification, targeted read extraction, and genomic characterization modules.

taxon input parameter

This parameter accepts either a NCBI taxon ID (e.g. 11292) or an organism name (e.g. Lyssavirus rabies).

rank a.k.a read_extraction_rank input parameter

Valid options include: "species", "genus", "family", "order", "class", "phylum", "kingdom", or "domain". By default it is set to "family". This parameter filters metadata to report information only at the taxonomic rank specified by the user, regardless of the taxonomic rank implied by the original input taxon.

Important
  • The rank parameter must specify a taxonomic rank that is equal to or above the input taxon's taxonomic rank.

Examples:

  • If your input taxon is Lyssavirus rabies (species level) with rank set to family, the task will return information for the family of Lyssavirus rabies: taxon ID for Rhabdoviridae (11270), name "Rhabdoviridae", and rank "family".
  • If your input taxon is Lyssavirus (genus level) with rank set to species, the task will fail because it cannot determine species information from an inputted genus.

ete4 Identify Technical Details

Links
Task task_ete4_taxon_id.wdl
Software Source Code ete4 on GitHub
Software Documentation NCBI Datasets Documentation on NCBI
Original Publication(s) Exploring and retrieving sequence and metadata for species across the tree of life with NCBI Datasets
fastp: Read Trimming

fastp trims low-quality regions with a sliding window (with a default window size of 4, specified with fastp_window_size), cutting once the average quality within the window falls below the fastp_quality_trim_score (default of 20 for paired-end, 30 for single-end). The read is discarded if it is trimmed below fastp_min_length (default of 15 bases).

Adapter trimming is enabled by default and can be disabled by setting fastp_trim_adapters to "false".

Additional arguments can be passed using the fastp_args optional parameter. Please reference the Fastp GitHub for a comprehensive list of arguments.

fastp Technical Details

Links
Task task_fastp.wdl
Software Source Code fastp on GitHub
Software Documentation fastp on GitHub
Original Publication(s) fastp: an ultra-fast all-in-one FASTQ preprocessor
fastplong: ONT Read Trimming

fastplong trims low-quality regions with a sliding window (with a default window size of 4, specified with fastplong_window_size), cutting once the average quality within the window falls below the fastplong_quality_trim_score (default of 20). The read is discarded if it is trimmed below fastplong_min_length (default of 15 bases). These trimming options are conducted according to a sliding window, but the directionality of this window can be specified by setting cut_front (5' to 3') or cut_tail (3' to 5') to "true".

Adapter trimming is enabled by default and can be disabled by setting fastplong_trim_adapters to "false". Automatic adapter detection is enabled by default, though a FASTA, a string of start, or a string of end adapters can be specified with fastplong_adapter_fasta, fastplong_start_adapter, or fastplong_end_adapter inputs respectively.

Additional arguments can be passed using the fastplong_args optional parameter. Please reference the Fastp GitHub for a comprehensive list of arguments.

fastplong Technical Details

Links
Task task_fastp.wdl
Software Source Code fastp on GitHub
Software Documentation fastp on GitHub
Original Publication(s) Ultrafast one-pass FASTQ data preprocessing, quality control, and deduplication using fastp
metabuli

The metabuli task is used to classify and optionally extract reads against a reference database. Metabuli uses a novel k-mer structure, called metamer, to analyze both amino acid (AA) and DNA sequences. It leverages AA conservation for sensitive homology detection and DNA mutations for specific differentiation between closely related taxa.

taxon_id input parameter

taxon_id triggers read extraction by retrieving the inputted NCBI taxon ID and all descendant taxon IDs derived from the input.

Precision mode and min_score / min_sp_score input parameters

The min_score parameter is the minimum score (DNA-level identity) required for a read to be classified and the min_sp_score parameter is the minimum score for a read to be classified at or below species rank. Metabuli precision mode is defined by its authors as more stringently setting the min_score and min_sp_score parameters for specific read types:

  • Illumina short reads: min_score = 0.15, min_sp_score = 0.5
  • ONT long reads: min_score = 0.008
Metabuli is run on both raw and human dehosted reads.
`metabuli_db` must be set to activate Metabuli read classification for TheiaProk.

??? dna "`taxdump_path` input parameter"
    The `taxdump_path` directs the task toward a taxonkit-generated taxdump file, e.g. [from NCBI](https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/) or [from GTDB](https://github.com/shenwei356/gtdb-taxdump/releases). This is not necessary to edit unless users want a more recent taxdump than what Theiagen hosts, or if users want to reference a different taxonomy. By default, Theiagen uses the NCBI taxonomy hierarchy.

??? dna "`cpu` / `memory` input parameters"
    Increasing the memory and cpus allocated to Metabuli can substantially increase throughput.

??? dna "`extract_unclassified` input parameter"
    This parameter determines whether unclassified reads should also be extracted and combined with the `taxon`-specific extracted reads. By default, this is set to `false`, meaning that only reads classified to the specified input `taxon` will be extracted.


!!! techdetails "Metabuli Technical Details"
    |  | Links |
    | --- | --- |
    | Task | [task_metabuli.wdl](https://github.com/theiagen/public_health_bioinformatics/blob/main/tasks/taxon_id/contamination/task_metabuli.wdl) |
    | Software Source Code | [Metabuli on GitHub](https://github.com/steineggerlab/Metabuli) |
    | Software Documentation | [Metabuli Documentation](https://github.com/steineggerlab/Metabuli/blob/master/README.md) |
    | Original Publication(s) | [Metabuli: sensitive and specific metagenomic classification via joint analysis of amino acid and DNA](https://doi.org/10.1038/s41592-024-02273-y) |

Outputs

Variable Type Description
ete4_docker String Docker image used for ETE4 taxonomy parsing
ete4_version String The version of ETE4 used
fastp_docker String Docker image used for fastp
fastp_html_report File The HTML report conveying fastp results
fastp_json_report File The JSON report conveying fastp results
fastp_read1_trimmed File read1 input trimmed by fastp
fastp_read2_trimmed File read2 input trimmed by fastp
fastp_version String The version of fastp used
fastplong_docker String Docker image used for fastplong
fastplong_html_report File The HTML report conveying fastplong results
fastplong_json_report File The JSON report conveying fastplong results
fastplong_read1_trimmed File read1 input trimmed by fastplong
fastplong_version String The version of fastplong used
metabuli_classified_read1 File FASTQ of read1 input classified by Metabuli
metabuli_classified_read2 File FASTQ of read2 input classified by Metabuli
metabuli_classified_report File Classification report from Metabuli
metabuli_docker String Docker image used for Metabuli
metabuli_krona_report File Krona visualization report from Metabuli
metabuli_report File Classification report from Metabuli
metabuli_status String Status of Metabuli analysis
metabuli_version String Version of Metabuli used
metabuli_wf_analysis_date String Analysis date for Metabuli workflow
metabuli_wf_version String Version of Metabuli workflow
ncbi_read_extraction_rank String Read extraction rank used
ncbi_taxon_id String NCBI taxonomy ID of inputted organism following rank extraction
ncbi_taxon_name String NCBI taxonomy name of inputted taxon following rank extraction

Interpretation of results

The most important outputs of the Metabuli workflows are the metabuli_report files. These will include a breakdown of the number of sequences assigned to a particular taxon, and the percentage of reads assigned. A complete description of the report format can be found here.

When assessing the taxonomic identity of a single isolate's sequence, it is normal that a few reads are assigned to very closely rated taxa due to the shared sequence identity between them. "Very closely related taxa" may be genetically similar species in the same genus, or taxa with which the dominant species have undergone horizontal gene transfer. Unrelated taxa or a high abundance of these closely related taxa is indicative of contamination or sequencing of non-target taxa. Interpretation of the results is dependent on the biological context.

Example Metabuli report

Below is an example metabuli_report for a Human immunodeficiency virus 1 sample. Only the first 13 lines are included here since the rows near the bottom are <0.08% of the reads, which are likely human-derived contamination.

From this report, we can see that ~98.78% of the reads were assigned at the species level (species in the 4th column) to "Human immunodeficiency virus 1". ~1.15% of the reads were unclassified, and the remaining <0.08% of reads are annoated as Homo sapiens (not depicted).

 #clade_proportion  clade_count taxon_count rank    taxID   name
 1.1457 3045    3045    no rank 0   unclassified
 98.8543    262722  1   no rank 1   root
 98.7850    262538  0   superkingdom    10239     Viruses
 98.7843    262536  0   clade   2559587     Riboviria
 98.7843    262536  0   kingdom 2732397       Pararnavirae
 98.7843    262536  0   phylum  2732409         Artverviricota
 98.7843    262536  0   class   2732514           Revtraviricetes
 98.7843    262536  0   order   2169561             Ortervirales
 98.7843    262536  0   family  11632                 Retroviridae
 98.7843    262536  0   subfamily   327045                  Orthoretrovirinae
 98.7843    262536  0   genus   11646                     Lentivirus
 **98.7843  262536  262536  species 11676                       Human immunodeficiency virus 1**

Krona visualisation of Metabuli report

Krona produces an interactive report that allows hierarchical data, such as the one from Metabuli, to be explored with zooming, multi-layered pie charts. These pie charts are intuitive and highly responsive.

Example Krona report

Below is an example of the krona_html for a bacterial sample. Taxonomic rank is organised from the centre of the pie chart to the edge, with each slice representing the relative abundance of a given taxa in the sample.

Example Krona Report