NCBI_Scrub¶
Quick Facts¶
| Workflow Type | Applicable Kingdom | Last Known Changes | Command-line Compatibility | Workflow Level | Dockstore |
|---|---|---|---|---|---|
| Standalone | Any taxa | v2.2.1 | Yes | Sample-level | NCBI_Scrub_PE_PHB, NCBI_Scrub_SE_PHB |
NCBI Scrub Workflows¶
NCBI Scrub, also known as the human read removal tool (HRRT), is based on the SRA Taxonomy Analysis Tool that will take as input a FASTQ file, and produce as output a FASTQ file in which all reads identified as potentially of human origin are either removed (default) or masked with 'N'. There are three Kraken2 workflows:
NCBI_Scrub_PEis compatible with Illumina paired-end dataNCBI_Scrub_SEis compatible with Illumina single-end data
Inputs¶
| Terra Task Name | Variable | Type | Description | Default Value | Terra Status |
|---|---|---|---|---|---|
| dehost_pe | read1 | File | FASTQ file containing read1 sequences | Required | |
| dehost_pe | read2 | File | FASTQ file containing read2 sequences | Required | |
| dehost_pe | samplename | String | The name of the sample being analyzed | Required | |
| dehost_pe | target_organism | String | Target organism for Kraken2 reporting | Severe acute respiratory syndrome coronavirus 2 | Optional |
| kraken2 | bracken_kmer_length | Int | Kmer length for Bracken to use instead of auto-detection - must be present in database | Optional | |
| kraken2 | call_bracken | Boolean | Call Bracken kraken2 report refinement | True | Optional |
| kraken2 | classified_out | String | Allows user to rename the classified FASTQ files output. Must include .fastq as the suffix | classified#.fastq | Optional |
| kraken2 | cpu | Int | Number of CPUs to allocate to the task | 4 | Optional |
| kraken2 | disk_size | Int | Amount of storage (in GB) to allocate to the task. Increase this when using large (>30GB kraken2 databases such as the "k2_standard" database) | 100 | Optional |
| kraken2 | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/theiagen/kraken2:2.17.1 | Optional |
| kraken2 | kraken2_args | String | Allows a user to supply additional kraken2 command-line arguments | Optional | |
| kraken2 | kraken2_db | File | The database used to run Kraken2 | gs://theiagen-public-resources-rp/reference_data/databases/kraken2/k2_viral-refseq_human-GRCh38_20260220.tar.gz | Optional |
| kraken2 | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 32 | Optional |
| kraken2 | unclassified_out | String | Allows user to rename unclassified FASTQ files output. Must include .fastq as the suffix | unclassified#.fastq | Optional |
| ncbi_scrub_pe | cpu | Int | Number of CPUs to allocate to the task | 4 | Optional |
| ncbi_scrub_pe | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
| ncbi_scrub_pe | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/ncbi/sra-human-scrubber:2.2.1 | Optional |
| ncbi_scrub_pe | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 8 | Optional |
| version_capture | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0 | Optional |
| version_capture | timezone | String | Set the time zone to get an accurate date of analysis (uses UTC by default) | Optional |
| Terra Task Name | Variable | Type | Description | Default Value | Terra Status |
|---|---|---|---|---|---|
| dehost_se | read1 | File | FASTQ file containing read1 sequences | Required | |
| dehost_se | samplename | String | The name of the sample being analyzed | Required | |
| dehost_se | target_organism | String | Target organism for Kraken2 reporting | Severe acute respiratory syndrome coronavirus 2 | Optional |
| kraken2 | bracken_kmer_length | Int | Kmer length for Bracken to use instead of auto-detection - must be present in database | Optional | |
| kraken2 | call_bracken | Boolean | Call Bracken kraken2 report refinement | True | Optional |
| kraken2 | classified_out | String | Allows user to rename the classified FASTQ files output. Must include .fastq as the suffix | classified#.fastq | Optional |
| kraken2 | cpu | Int | Number of CPUs to allocate to the task | 4 | Optional |
| kraken2 | disk_size | Int | Amount of storage (in GB) to allocate to the task. Increase this when using large (>30GB kraken2 databases such as the "k2_standard" database) | 100 | Optional |
| kraken2 | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/theiagen/kraken2:2.17.1 | Optional |
| kraken2 | kraken2_args | String | Allows a user to supply additional kraken2 command-line arguments | Optional | |
| kraken2 | kraken2_db | File | The database used to run Kraken2 | gs://theiagen-public-resources-rp/reference_data/databases/kraken2/k2_viral-refseq_human-GRCh38_20260220.tar.gz | Optional |
| kraken2 | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 32 | Optional |
| kraken2 | read2 | File | Internal component, do not modify | Optional | |
| kraken2 | unclassified_out | String | Allows user to rename unclassified FASTQ files output. Must include .fastq as the suffix | unclassified#.fastq | Optional |
| ncbi_scrub_se | cpu | Int | Number of CPUs to allocate to the task | 4 | Optional |
| ncbi_scrub_se | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
| ncbi_scrub_se | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/ncbi/sra-human-scrubber:2.2.1 | Optional |
| ncbi_scrub_se | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 8 | Optional |
| version_capture | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0 | Optional |
| version_capture | timezone | String | Set the time zone to get an accurate date of analysis (uses UTC by default) | Optional |
Workflow Tasks¶
This workflow is composed of two tasks, one to dehost the input reads and another to screen the clean reads with kraken2 and the viral+human database.
HRRT: Human Host Sequence Removal
All reads of human origin are removed, including their mates, by using NCBI's human read removal tool (HRRT).
HRRT is based on the SRA Taxonomy Analysis Tool and employs a k-mer database constructed of k-mers from Eukaryota derived from all human RefSeq records with any k-mers found in non-Eukaryota RefSeq records subtracted from the database.
NCBI-Scrub Technical Details
| Links | |
|---|---|
| Task | task_ncbi_scrub.wdl |
| Software Source Code | HRRT on GitHub |
| Software Documentation | HRRT on NCBI |
Kraken2 + Bracken: Read Classification
kraken2 is a bioinformatics tool originally designed for metagenomic applications. It has additionally proven valuable for validating taxonomic assignments and checking contamination of single-species (e.g. bacterial isolate, eukaryotic isolate, viral isolate, etc.) whole genome sequence data.
Bracken is a refinement module that improves the resolution of kraken2 reports.
kraken2 is run on both the raw and clean reads.
Database-dependent
This workflow automatically uses a viral-specific Kraken2 database. This database was generated in-house from RefSeq's viral sequence collection and human genome GRCh38. It's available at gs://theiagen-public-resources-rp/reference_data/databases/kraken2/k2_viral-refseq_human-GRCh38_20260220.tar.gz.
Bracken report refinement
Bracken refines the Kraken2 taxon classification report when call_bracken is set to "true" (default). Bracken uses a Bayesian model to probabilistically estimate read abundances at the species/genus-level. Bracken will output a bracken_report that:
- increases report-level classification resolution up to the species level
- decreases resolution of sub-species report-level classifications, e.g. Severe acute respiratory syndrome coronavirus 2 will be grouped into Betacoronavirus pandemicum
- does not affect read-level classification and extraction
- will not be used in downstream
percent_humanandpercent_target_organismcalculations - inputted in place of Kraken reports in downstream tasks, such as
qc_checkandkrona - outputted separate of the
kraken/kraken2_report
By default, Bracken will reference the k-mer database that is closest to the mean read length of the input. This reference k-mer database size can be directly set using the bracken_kmer_length input, though it MUST correspond to an available k-mer database within the Kraken2 database (named database<KMER_LENGTH>mers.kmer_distrib). Bracken will be skipped if there are no k-mer libraries in the Kraken2 database.
Kraken2 Technical Details
| Links | |
|---|---|
| Task | task_kraken2.wdl |
| Software Source Code | Kraken2 on GitHub, Bracken on GitHub |
| Software Documentation | Kraken2 Documentation, Bracken Documentation |
| Original Publication(s) | Improved metagenomic analysis with Kraken 2, Bracken: estimating species abundance in metagenomics data |
Outputs¶
| Variable | Type | Description |
|---|---|---|
| kraken_human_dehosted | Float | Percent of human read data detected using the Kraken2 software after host removal |
| kraken_report_dehosted | File | Full Kraken report after host removal |
| kraken_target_organism_dehosted | String | Percent of target organism read data detected using the Kraken2 software after host removal |
| kraken_version_dehosted | String | Version of Kraken2 software used |
| ncbi_scrub_docker | String | The Docker image for NCBI's HRRT (human read removal tool) |
| ncbi_scrub_human_spots_removed | Int | Number of spots removed (or masked) |
| ncbi_scrub_pe_analysis_date | String | Date of analysis |
| ncbi_scrub_pe_version | String | Version of HRRT software used |
| read1_dehosted | File | The dehosted forward reads file; suggested read file for SRA submission |
| read2_dehosted | File | The dehosted reverse reads file; suggested read file for SRA submission |
| Variable | Type | Description |
|---|---|---|
| kraken_human_dehosted | Float | Percent of human read data detected using the Kraken2 software after host removal |
| kraken_report_dehosted | File | Full Kraken report after host removal |
| kraken_target_organism_dehosted | String | Percent of target organism read data detected using the Kraken2 software after host removal |
| kraken_version_dehosted | String | Version of Kraken2 software used |
| ncbi_scrub_docker | String | The Docker image for NCBI's HRRT (human read removal tool) |
| ncbi_scrub_human_spots_removed | Int | Number of spots removed (or masked) |
| ncbi_scrub_se_analysis_date | String | Date of analysis |
| ncbi_scrub_se_version | String | Version of HRRT software used |
| read1_dehosted | File | The dehosted forward reads file; suggested read file for SRA submission |