Metabuli¶
Quick Facts¶
| Workflow Type | Applicable Kingdom | Last Known Changes | Command-line Compatibility | Workflow Level | Dockstore |
|---|---|---|---|---|---|
| Standalone | Any taxa | vX.X.X | Yes | Sample-level | Metabuli_PHB |
Metabuli Workflow¶
The Metabuli workflow assesses the taxonomic profile of raw sequencing data (FASTQ files).
Metabuli is suitable for classifying short reads AND long reads by comparing them to reference genomes. Optionally it can enable the extraction of reads from a specific NCBI taxon ID of interest. Metabuli uses a novel k-mer structure, called a "metamer", which incorporates both the DNA sequence for high specificity and amino acid conservation for sensitive homology detection.
The Metabuli_PHB workflow additionally includes read trimming software, Fastp (Illumina) and fastplong (ONT), for adapter trimming (recommended) and basic read preprocessing.
Databases¶
Database selection
The Metabuli software is database-dependent and taxonomic assignments are highly sensitive to the database used. An appropriate database should contain the expected organism(s) (e.g. Escherichia coli) and other taxa that may be present in the reads (e.g. Citrobacter freundii, a common contaminant). To enable read extraction, the database taxon inputs must correspond to an appropriate compressed taxdump, e.g. NCBI taxdump for RefSeq databases and GTDB taxdump for GTDB databases (see suggested databases for example).
Adjusting computational resources
Default random-access memory (RAM) is typically sufficient for Metabuli, though this may need to be adjusted if an out-of-memory (OOM) error is returned. Additionally, the default disk_space is sufficient for the databases noted below, but this input must be adjusted to accommodate larger databases based on their decompressed size.
Suggested databases¶
| Database name | Database Description | Suggested Applications | GCP URI (for usage in Terra) | Source | Database Size [Decompressed] (GB) | Date of Last Update | Associated taxdump |
|---|---|---|---|---|---|---|---|
| viral | RefSeq viral + human (T2T-CHM13v2.0) | Viral metagenomics | gs://theiagen-public-resources-rp/reference_data/databases/metabuli/refseq_virus-v223.tar.gz |
https://metabuli.steineggerlab.workers.dev/ | 4.0 [8.1] | 2024/04/01 | gs://theiagen-public-resources-rp/reference_data/databases/metabuli/ncbi_taxdump_20260211.tar.gz |
| GTDB | Prokaryote (Complete Genome/Chromosome, CheckM completeness > 90, and contamination <5) + human (T2T-CHM13v2.0) | Prokaryote metagenomics | gs://theiagen-public-resources-rp/reference_data/databases/metabuli/gtdb_20240401.tar.gz |
https://metabuli.steineggerlab.workers.dev/ | 68.8 [117] | 2024/04/01 | gs://theiagen-public-resources-rp/reference_data/databases/metabuli/gtdb_taxdump_20250428.tar.gz |
Inputs¶
taxon input parameter
Inputting a taxon (NCBI taxon ID/name) will enable read extraction within the workflow. The input taxon will be standardized via querying the NCBI taxonomy hierarchy in the ete4_identify task. Additionally, a parent taxonomic rank (e.g. "genus", "family", "order", etc.) can be set in ete4_identify to extract reads at a higher taxonomic level relative to the input taxon.
illumina input parameter
Setting illumina to "true" enables Illumina mode for single-end reads. Inputting a read2 implicitly sets illumina to "true".
| Terra Task Name | Variable | Type | Description | Default Value | Terra Status |
|---|---|---|---|---|---|
| metabuli_wf | metabuli_db | File | Required | ||
| metabuli_wf | read1 | File | FASTQ file containing read1 sequences | Required | |
| metabuli_wf | samplename | String | The name of the sample being analyzed | Required | |
| ete4_identify | cpu | Int | Number of CPUs to allocate to the task | 1 | Optional |
| ete4_identify | disk_size | Int | Amount of storage (in GB) to allocate to the task | 50 | Optional |
| ete4_identify | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/theiagen/ete4:4.3.0 | Optional |
| ete4_identify | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 4 | Optional |
| ete4_identify | rank | String | Internal component, do not modify | Optional | |
| fastp | cpu | Int | Number of CPUs to allocate to the task | 4 | Optional |
| fastp | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
| fastp | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/fastp:1.1.0 | Optional |
| fastp | fastp_adapter_fasta | File | FASTA of adapters to replace Fastp's default adapter search | Optional | |
| fastp | fastp_args | String | Additional arguments to use with fastp | Optional | |
| fastp | fastp_min_length | Int | Minimum read length | 15 | Optional |
| fastp | fastp_quality_trim_score | Int | Minimum mean base quality relative to window size | 20 | Optional |
| fastp | fastp_window_size | Int | Sliding window size for surveying read quality | 4 | Optional |
| fastp | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 8 | Optional |
| fastplong | cpu | Int | Number of CPUs to allocate to the task | 4 | Optional |
| fastplong | cut_front | Boolean | Apply trimming options from 5' to 3' | False | Optional |
| fastplong | cut_tail | Boolean | Apply trimming options from 3' to 5' | False | Optional |
| fastplong | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
| fastplong | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/fastplong:0.4.1 | Optional |
| fastplong | fastplong_adapter_fasta | File | FASTA of adapter sequences to trim | Optional | |
| fastplong | fastplong_args | String | Additional arguments to use with fastplong | Optional | |
| fastplong | fastplong_end_adapter | String | Adapter sequence to trim from the 3' end | Optional | |
| fastplong | fastplong_min_length | Int | Minimum read length | 15 | Optional |
| fastplong | fastplong_quality_trim_score | Int | Minimum mean base quality relative to window size | 20 | Optional |
| fastplong | fastplong_start_adapter | String | Adapter sequence to trim from the 5' end | Optional | |
| fastplong | fastplong_trim_adapters | Boolean | Enable adapter trimming via fastplong | True | Optional |
| fastplong | fastplong_window_size | Int | Sliding window size for surveying read quality | 4 | Optional |
| fastplong | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 8 | Optional |
| metabuli | cpu | Int | Number of CPUs to allocate to the task | 4 | Optional |
| metabuli | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/theiagen/metabuli:1.1.1 | Optional |
| metabuli | extract_unclassified | Boolean | True/False that determines if unclassified reads should be extracted and combined with the taxon specific extracted reads | False | Optional |
| metabuli | min_percent_coverage | Float | Minimum query coverage threshold (0.0 - 1.0) | 0 | Optional |
| metabuli | min_score | Float | Minimum sequenece similarity score (0.0 - 1.0) | 0 | Optional |
| metabuli | min_sp_score | Float | Minimum score for species- or lower-level classification | 0 | Optional |
| metabuli_wf | call_trim | Boolean | Call adapter and read trimming via Fastp (Illumina) or Fastplong (ONT) | True | Optional |
| metabuli_wf | illumina | Boolean | Input reads are Illumina - automatically inferred if read2 is populated, ONT is assumed otherwise | Optional | |
| metabuli_wf | metabuli_disk_size | Int | Amount of storage (in GB) to allocate to the task | 250 | Optional |
| metabuli_wf | metabuli_mem | Int | Amount of memory/RAM (in GB) to allocate to the task | 32 | Optional |
| metabuli_wf | read2 | File | FASTQ file containing read2 sequences | Optional | |
| metabuli_wf | taxdump | File | Path to compressed taxonomy dump that corresponds with database | gs://theiagen-public-resources-rp/reference_data/databases/metabuli/ncbi_taxdump_20260211.tar.gz | Optional |
| metabuli_wf | taxon | String | NCBI taxonomy compatible taxon name/ID to enable read extraction | Optional | |
| version_capture | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0 | Optional |
| version_capture | timezone | String | Set the time zone to get an accurate date of analysis (uses UTC by default) | Optional |
Workflow Tasks¶
ete4_identify
The ete4_identify task parses the NCBI taxonomy hierarchy from a user's inputted taxonomy and desired taxonomic rank. This task returns a taxon ID, name, and rank, which facilitates downstream functions, including read classification, targeted read extraction, and genomic characterization modules.
taxon input parameter
This parameter accepts either a NCBI taxon ID (e.g. 11292) or an organism name (e.g. Lyssavirus rabies).
rank a.k.a read_extraction_rank input parameter
Valid options include: "species", "genus", "family", "order", "class", "phylum", "kingdom", or "domain". By default it is set to "family". This parameter filters metadata to report information only at the taxonomic rank specified by the user, regardless of the taxonomic rank implied by the original input taxon.
Important
- The
rankparameter must specify a taxonomic rank that is equal to or above the input taxon's taxonomic rank.
Examples:
- If your input
taxonisLyssavirus rabies(species level) withrankset tofamily, the task will return information for the family ofLyssavirus rabies: taxon ID for Rhabdoviridae (11270), name "Rhabdoviridae", and rank "family". - If your input
taxonisLyssavirus(genus level) withrankset tospecies, the task will fail because it cannot determine species information from an inputted genus.
ete4 Identify Technical Details
| Links | |
|---|---|
| Task | task_ete4_taxon_id.wdl |
| Software Source Code | ete4 on GitHub |
| Software Documentation | NCBI Datasets Documentation on NCBI |
| Original Publication(s) | Exploring and retrieving sequence and metadata for species across the tree of life with NCBI Datasets |
fastp: Read Trimming
fastp trims low-quality regions with a sliding window (with a default window size of 4, specified with fastp_window_size), cutting once the average quality within the window falls below the fastp_quality_trim_score (default of 20 for paired-end, 30 for single-end). The read is discarded if it is trimmed below fastp_min_length (default of 15 bases).
Adapter trimming is enabled by default and can be disabled by setting fastp_trim_adapters to "false".
Additional arguments can be passed using the fastp_args optional parameter. Please reference the Fastp GitHub for a comprehensive list of arguments.
fastp Technical Details
| Links | |
|---|---|
| Task | task_fastp.wdl |
| Software Source Code | fastp on GitHub |
| Software Documentation | fastp on GitHub |
| Original Publication(s) | fastp: an ultra-fast all-in-one FASTQ preprocessor |
fastplong: ONT Read Trimming
fastplong trims low-quality regions with a sliding window (with a default window size of 4, specified with fastplong_window_size), cutting once the average quality within the window falls below the fastplong_quality_trim_score (default of 20). The read is discarded if it is trimmed below fastplong_min_length (default of 15 bases). These trimming options are conducted according to a sliding window, but the directionality of this window can be specified by setting cut_front (5' to 3') or cut_tail (3' to 5') to "true".
Adapter trimming is enabled by default and can be disabled by setting fastplong_trim_adapters to "false". Automatic adapter detection is enabled by default, though a FASTA, a string of start, or a string of end adapters can be specified with fastplong_adapter_fasta, fastplong_start_adapter, or fastplong_end_adapter inputs respectively.
Additional arguments can be passed using the fastplong_args optional parameter. Please reference the Fastp GitHub for a comprehensive list of arguments.
fastplong Technical Details
| Links | |
|---|---|
| Task | task_fastp.wdl |
| Software Source Code | fastp on GitHub |
| Software Documentation | fastp on GitHub |
| Original Publication(s) | Ultrafast one-pass FASTQ data preprocessing, quality control, and deduplication using fastp |
metabuli
The metabuli task is used to classify and optionally extract reads against a reference database. Metabuli uses a novel k-mer structure, called metamer, to analyze both amino acid (AA) and DNA sequences. It leverages AA conservation for sensitive homology detection and DNA mutations for specific differentiation between closely related taxa.
taxon_id input parameter
taxon_id triggers read extraction by retrieving the inputted NCBI taxon ID and all descendant taxon IDs derived from the input.
Precision mode and min_score / min_sp_score input parameters
The min_score parameter is the minimum score (DNA-level identity) required for a read to be classified and the min_sp_score parameter is the minimum score for a read to be classified at or below species rank. Metabuli precision mode is defined by its authors as more stringently setting the min_score and min_sp_score parameters for specific read types:
- Illumina short reads:
min_score= 0.15,min_sp_score= 0.5 - ONT long reads:
min_score= 0.008
Metabuli is run on both raw and human dehosted reads.
`metabuli_db` must be set to activate Metabuli read classification for TheiaProk.
??? dna "`taxdump_path` input parameter"
The `taxdump_path` directs the task toward a taxonkit-generated taxdump file, e.g. [from NCBI](https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/) or [from GTDB](https://github.com/shenwei356/gtdb-taxdump/releases). This is not necessary to edit unless users want a more recent taxdump than what Theiagen hosts, or if users want to reference a different taxonomy. By default, Theiagen uses the NCBI taxonomy hierarchy.
??? dna "`cpu` / `memory` input parameters"
Increasing the memory and cpus allocated to Metabuli can substantially increase throughput.
??? dna "`extract_unclassified` input parameter"
This parameter determines whether unclassified reads should also be extracted and combined with the `taxon`-specific extracted reads. By default, this is set to `false`, meaning that only reads classified to the specified input `taxon` will be extracted.
!!! techdetails "Metabuli Technical Details"
| | Links |
| --- | --- |
| Task | [task_metabuli.wdl](https://github.com/theiagen/public_health_bioinformatics/blob/main/tasks/taxon_id/contamination/task_metabuli.wdl) |
| Software Source Code | [Metabuli on GitHub](https://github.com/steineggerlab/Metabuli) |
| Software Documentation | [Metabuli Documentation](https://github.com/steineggerlab/Metabuli/blob/master/README.md) |
| Original Publication(s) | [Metabuli: sensitive and specific metagenomic classification via joint analysis of amino acid and DNA](https://doi.org/10.1038/s41592-024-02273-y) |
Outputs¶
| Variable | Type | Description |
|---|---|---|
| ete4_docker | String | Docker image used for ETE4 taxonomy parsing |
| ete4_version | String | The version of ETE4 used |
| fastp_docker | String | Docker image used for fastp |
| fastp_html_report | File | The HTML report conveying fastp results |
| fastp_json_report | File | The JSON report conveying fastp results |
| fastp_read1_trimmed | File | read1 input trimmed by fastp |
| fastp_read2_trimmed | File | read2 input trimmed by fastp |
| fastp_version | String | The version of fastp used |
| fastplong_docker | String | Docker image used for fastplong |
| fastplong_html_report | File | The HTML report conveying fastplong results |
| fastplong_json_report | File | The JSON report conveying fastplong results |
| fastplong_read1_trimmed | File | read1 input trimmed by fastplong |
| fastplong_version | String | The version of fastplong used |
| metabuli_classified_read1 | File | FASTQ of read1 input classified by Metabuli |
| metabuli_classified_read2 | File | FASTQ of read2 input classified by Metabuli |
| metabuli_classified_report | File | Classification report from Metabuli |
| metabuli_docker | String | Docker image used for Metabuli |
| metabuli_krona_report | File | Krona visualization report from Metabuli |
| metabuli_report | File | Classification report from Metabuli |
| metabuli_status | String | Status of Metabuli analysis |
| metabuli_version | String | Version of Metabuli used |
| metabuli_wf_analysis_date | String | Analysis date for Metabuli workflow |
| metabuli_wf_version | String | Version of Metabuli workflow |
| ncbi_read_extraction_rank | String | Read extraction rank used |
| ncbi_taxon_id | String | NCBI taxonomy ID of inputted organism following rank extraction |
| ncbi_taxon_name | String | NCBI taxonomy name of inputted taxon following rank extraction |
Interpretation of results¶
The most important outputs of the Metabuli workflows are the metabuli_report files. These will include a breakdown of the number of sequences assigned to a particular taxon, and the percentage of reads assigned. A complete description of the report format can be found here.
When assessing the taxonomic identity of a single isolate's sequence, it is normal that a few reads are assigned to very closely rated taxa due to the shared sequence identity between them. "Very closely related taxa" may be genetically similar species in the same genus, or taxa with which the dominant species have undergone horizontal gene transfer. Unrelated taxa or a high abundance of these closely related taxa is indicative of contamination or sequencing of non-target taxa. Interpretation of the results is dependent on the biological context.
Example Metabuli report
Below is an example metabuli_report for a Human immunodeficiency virus 1 sample. Only the first 13 lines are included here since the rows near the bottom are <0.08% of the reads, which are likely human-derived contamination.
From this report, we can see that ~98.78% of the reads were assigned at the species level (species in the 4th column) to "Human immunodeficiency virus 1". ~1.15% of the reads were unclassified, and the remaining <0.08% of reads are annoated as Homo sapiens (not depicted).
#clade_proportion clade_count taxon_count rank taxID name
1.1457 3045 3045 no rank 0 unclassified
98.8543 262722 1 no rank 1 root
98.7850 262538 0 superkingdom 10239 Viruses
98.7843 262536 0 clade 2559587 Riboviria
98.7843 262536 0 kingdom 2732397 Pararnavirae
98.7843 262536 0 phylum 2732409 Artverviricota
98.7843 262536 0 class 2732514 Revtraviricetes
98.7843 262536 0 order 2169561 Ortervirales
98.7843 262536 0 family 11632 Retroviridae
98.7843 262536 0 subfamily 327045 Orthoretrovirinae
98.7843 262536 0 genus 11646 Lentivirus
**98.7843 262536 262536 species 11676 Human immunodeficiency virus 1**
Krona visualisation of Metabuli report¶
Krona produces an interactive report that allows hierarchical data, such as the one from Metabuli, to be explored with zooming, multi-layered pie charts. These pie charts are intuitive and highly responsive.
Example Krona report
Below is an example of the krona_html for a bacterial sample. Taxonomic rank is organised from the centre of the pie chart to the edge, with each slice representing the relative abundance of a given taxa in the sample.
Metabuli Technical Details
| Links | |
|---|---|
| Software Source Code | Metabuli on GitHub |
| Software Documentation | https://github.com/steineggerlab/Metabuli/blob/master/README.md |
| Original Publication(s) | Metabuli: sensitive and specific metagenomic classification via joint analysis of amino acid and DNA |

