Skip to content

NCBI_Scrub

Quick Facts

Workflow Type Applicable Kingdom Last Known Changes Command-line Compatibility Workflow Level Dockstore
Standalone Any taxa v2.2.1 Yes Sample-level NCBI_Scrub_PE_PHB, NCBI_Scrub_SE_PHB

NCBI Scrub Workflows

NCBI Scrub, also known as the human read removal tool (HRRT), is based on the SRA Taxonomy Analysis Tool that will take as input a FASTQ file, and produce as output a FASTQ file in which all reads identified as potentially of human origin are either removed (default) or masked with 'N'. There are three Kraken2 workflows:

  • NCBI_Scrub_PE is compatible with Illumina paired-end data
  • NCBI_Scrub_SE is compatible with Illumina single-end data

Inputs

Terra Task Name Variable Type Description Default Value Terra Status
dehost_pe read1 File FASTQ file containing read1 sequences Required
dehost_pe read2 File FASTQ file containing read2 sequences Required
dehost_pe samplename String The name of the sample being analyzed Required
dehost_pe target_organism String Target organism for Kraken2 reporting Severe acute respiratory syndrome coronavirus 2 Optional
kraken2 bracken_kmer_length Int Kmer length for Bracken to use instead of auto-detection - must be present in database Optional
kraken2 call_bracken Boolean Call Bracken kraken2 report refinement True Optional
kraken2 classified_out String Allows user to rename the classified FASTQ files output. Must include .fastq as the suffix classified#.fastq Optional
kraken2 cpu Int Number of CPUs to allocate to the task 4 Optional
kraken2 disk_size Int Amount of storage (in GB) to allocate to the task. Increase this when using large (>30GB kraken2 databases such as the "k2_standard" database) 100 Optional
kraken2 docker String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/theiagen/kraken2:2.17.1 Optional
kraken2 kraken2_args String Allows a user to supply additional kraken2 command-line arguments Optional
kraken2 kraken2_db File The database used to run Kraken2 gs://theiagen-public-resources-rp/reference_data/databases/kraken2/k2_viral-refseq_human-GRCh38_20260220.tar.gz Optional
kraken2 memory Int Amount of memory/RAM (in GB) to allocate to the task 32 Optional
kraken2 unclassified_out String Allows user to rename unclassified FASTQ files output. Must include .fastq as the suffix unclassified#.fastq Optional
ncbi_scrub_pe cpu Int Number of CPUs to allocate to the task 4 Optional
ncbi_scrub_pe disk_size Int Amount of storage (in GB) to allocate to the task 100 Optional
ncbi_scrub_pe docker String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/ncbi/sra-human-scrubber:2.2.1 Optional
ncbi_scrub_pe memory Int Amount of memory/RAM (in GB) to allocate to the task 8 Optional
version_capture docker String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0 Optional
version_capture timezone String Set the time zone to get an accurate date of analysis (uses UTC by default) Optional
Terra Task Name Variable Type Description Default Value Terra Status
dehost_se read1 File FASTQ file containing read1 sequences Required
dehost_se samplename String The name of the sample being analyzed Required
dehost_se target_organism String Target organism for Kraken2 reporting Severe acute respiratory syndrome coronavirus 2 Optional
kraken2 bracken_kmer_length Int Kmer length for Bracken to use instead of auto-detection - must be present in database Optional
kraken2 call_bracken Boolean Call Bracken kraken2 report refinement True Optional
kraken2 classified_out String Allows user to rename the classified FASTQ files output. Must include .fastq as the suffix classified#.fastq Optional
kraken2 cpu Int Number of CPUs to allocate to the task 4 Optional
kraken2 disk_size Int Amount of storage (in GB) to allocate to the task. Increase this when using large (>30GB kraken2 databases such as the "k2_standard" database) 100 Optional
kraken2 docker String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/theiagen/kraken2:2.17.1 Optional
kraken2 kraken2_args String Allows a user to supply additional kraken2 command-line arguments Optional
kraken2 kraken2_db File The database used to run Kraken2 gs://theiagen-public-resources-rp/reference_data/databases/kraken2/k2_viral-refseq_human-GRCh38_20260220.tar.gz Optional
kraken2 memory Int Amount of memory/RAM (in GB) to allocate to the task 32 Optional
kraken2 read2 File Internal component, do not modify Optional
kraken2 unclassified_out String Allows user to rename unclassified FASTQ files output. Must include .fastq as the suffix unclassified#.fastq Optional
ncbi_scrub_se cpu Int Number of CPUs to allocate to the task 4 Optional
ncbi_scrub_se disk_size Int Amount of storage (in GB) to allocate to the task 100 Optional
ncbi_scrub_se docker String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/ncbi/sra-human-scrubber:2.2.1 Optional
ncbi_scrub_se memory Int Amount of memory/RAM (in GB) to allocate to the task 8 Optional
version_capture docker String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0 Optional
version_capture timezone String Set the time zone to get an accurate date of analysis (uses UTC by default) Optional

Workflow Tasks

This workflow is composed of two tasks, one to dehost the input reads and another to screen the clean reads with kraken2 and the viral+human database.

HRRT: Human Host Sequence Removal

All reads of human origin are removed, including their mates, by using NCBI's human read removal tool (HRRT).

HRRT is based on the SRA Taxonomy Analysis Tool and employs a k-mer database constructed of k-mers from Eukaryota derived from all human RefSeq records with any k-mers found in non-Eukaryota RefSeq records subtracted from the database.

NCBI-Scrub Technical Details

Links
Task task_ncbi_scrub.wdl
Software Source Code HRRT on GitHub
Software Documentation HRRT on NCBI
Kraken2 + Bracken: Read Classification

kraken2 is a bioinformatics tool originally designed for metagenomic applications. It has additionally proven valuable for validating taxonomic assignments and checking contamination of single-species (e.g. bacterial isolate, eukaryotic isolate, viral isolate, etc.) whole genome sequence data.

Bracken is a refinement module that improves the resolution of kraken2 reports.

kraken2 is run on both the raw and clean reads.

Database-dependent

This workflow automatically uses a viral-specific Kraken2 database. This database was generated in-house from RefSeq's viral sequence collection and human genome GRCh38. It's available at gs://theiagen-public-resources-rp/reference_data/databases/kraken2/k2_viral-refseq_human-GRCh38_20260220.tar.gz.

Bracken report refinement

Bracken refines the Kraken2 taxon classification report when call_bracken is set to "true" (default). Bracken uses a Bayesian model to probabilistically estimate read abundances at the species/genus-level. Bracken will output a bracken_report that:

  • increases report-level classification resolution up to the species level
  • decreases resolution of sub-species report-level classifications, e.g. Severe acute respiratory syndrome coronavirus 2 will be grouped into Betacoronavirus pandemicum
  • does not affect read-level classification and extraction
  • will not be used in downstream percent_human and percent_target_organism calculations
  • inputted in place of Kraken reports in downstream tasks, such as qc_check and krona
  • outputted separate of the kraken/kraken2_report

By default, Bracken will reference the k-mer database that is closest to the mean read length of the input. This reference k-mer database size can be directly set using the bracken_kmer_length input, though it MUST correspond to an available k-mer database within the Kraken2 database (named database<KMER_LENGTH>mers.kmer_distrib). Bracken will be skipped if there are no k-mer libraries in the Kraken2 database.

Outputs

Variable Type Description
kraken_human_dehosted Float Percent of human read data detected using the Kraken2 software after host removal
kraken_report_dehosted File Full Kraken report after host removal
kraken_target_organism_dehosted String Percent of target organism read data detected using the Kraken2 software after host removal
kraken_version_dehosted String Version of Kraken2 software used
ncbi_scrub_docker String The Docker image for NCBI's HRRT (human read removal tool)
ncbi_scrub_human_spots_removed Int Number of spots removed (or masked)
ncbi_scrub_pe_analysis_date String Date of analysis
ncbi_scrub_pe_version String Version of HRRT software used
read1_dehosted File The dehosted forward reads file; suggested read file for SRA submission
read2_dehosted File The dehosted reverse reads file; suggested read file for SRA submission
Variable Type Description
kraken_human_dehosted Float Percent of human read data detected using the Kraken2 software after host removal
kraken_report_dehosted File Full Kraken report after host removal
kraken_target_organism_dehosted String Percent of target organism read data detected using the Kraken2 software after host removal
kraken_version_dehosted String Version of Kraken2 software used
ncbi_scrub_docker String The Docker image for NCBI's HRRT (human read removal tool)
ncbi_scrub_human_spots_removed Int Number of spots removed (or masked)
ncbi_scrub_se_analysis_date String Date of analysis
ncbi_scrub_se_version String Version of HRRT software used
read1_dehosted File The dehosted forward reads file; suggested read file for SRA submission