Skip to content

NCBI_Scrub

Quick Facts

Workflow Type Applicable Kingdom Last Known Changes Command-line Compatibility Workflow Level
Standalone Any Taxa PHB v2.2.1 Yes Sample-level

NCBI Scrub Workflows

NCBI Scrub, also known as the human read removal tool (HRRT), is based on the SRA Taxonomy Analysis Tool that will take as input a FASTQ file, and produce as output a FASTQ file in which all reads identified as potentially of human origin are either removed (default) or masked with 'N'. There are three Kraken2 workflows:

  • NCBI_Scrub_PE is compatible with Illumina paired-end data
  • NCBI_Scrub_SE is compatible with Illumina single-end data

Inputs

Terra Task Name Variable Type Description Default Value Terra Status
dehost_pe or dehost_se read1 File FASTQ file containing read1 sequences Required
dehost_pe or dehost_se read2 File FASTQ file containing read2 sequences Required
dehost_pe or dehost_se samplename String The name of the sample being analyzed Required
dehost_pe or dehost_se target_organism String Target organism for Kraken2 reporting Severe acute respiratory syndrome coronavirus 2 Optional
kraken2 cpu Int Number of CPUs to allocate to the task 4 Optional
kraken2 disk_size Int Amount of storage (in GB) to allocate to the task. Increase this when using large (>30GB kraken2 databases such as the "k2_standard" database) 100 Optional
kraken2 docker_image String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/staphb/kraken2:2.0.8-beta_hv Optional
kraken2 kraken2_db String The database used to run Kraken2 /kraken2-db Optional
kraken2 memory Int Amount of memory/RAM (in GB) to allocate to the task 8 Optional
kraken2 target_organism String The organism whose abundance the user wants to check in their reads. This should be a proper taxonomic name recognized by the Kraken database. Optional
ncbi_scrub_pe or docker Int The Docker container to use for the task us-docker.pkg.dev/general-theiagen/ncbi/sra-human-scrubber:2.2.1 Optional
ncbi_scrub_pe or ncbi_scrub_se cpu Int Number of CPUs to allocate to the task 4 Optional
ncbi_scrub_pe or ncbi_scrub_se disk_size Int Amount of storage (in GB) to allocate to the task 100 Optional
ncbi_scrub_pe or ncbi_scrub_se memory Int Amount of memory/RAM (in GB) to allocate to the task 8 Optional
version_capture docker String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0 Optional
version_capture timezone String Set the time zone to get an accurate date of analysis (uses UTC by default) Optional
Terra Task Name Variable Type Description Default Value Terra Status
dehost_pe or dehost_se read1 File FASTQ file containing read1 sequences Required
dehost_pe or dehost_se samplename String The name of the sample being analyzed Required
dehost_pe or dehost_se target_organism String Target organism for Kraken2 reporting Severe acute respiratory syndrome coronavirus 2 Optional
kraken2 cpu Int Number of CPUs to allocate to the task 4 Optional
kraken2 disk_size Int Amount of storage (in GB) to allocate to the task. Increase this when using large (>30GB kraken2 databases such as the "k2_standard" database) 100 Optional
kraken2 docker_image String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/staphb/kraken2:2.0.8-beta_hv Optional
kraken2 kraken2_db String The database used to run Kraken2 /kraken2-db Optional
kraken2 memory Int Amount of memory/RAM (in GB) to allocate to the task 8 Optional
kraken2 read2 File Internal component, do not modify Optional
kraken2 target_organism String The organism whose abundance the user wants to check in their reads. This should be a proper taxonomic name recognized by the Kraken database. Optional
ncbi_scrub_pe or docker Int The Docker container to use for the task us-docker.pkg.dev/general-theiagen/ncbi/sra-human-scrubber:2.2.1 Optional
ncbi_scrub_pe or ncbi_scrub_se cpu Int Number of CPUs to allocate to the task 4 Optional
ncbi_scrub_pe or ncbi_scrub_se disk_size Int Amount of storage (in GB) to allocate to the task 100 Optional
ncbi_scrub_pe or ncbi_scrub_se memory Int Amount of memory/RAM (in GB) to allocate to the task 8 Optional
version_capture docker String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0 Optional
version_capture timezone String Set the time zone to get an accurate date of analysis (uses UTC by default) Optional

Workflow Tasks

This workflow is composed of two tasks, one to dehost the input reads and another to screen the clean reads with kraken2 and the viral+human database.

HRRT: Human Host Sequence Removal

All reads of human origin are removed, including their mates, by using NCBI's human read removal tool (HRRT).

HRRT is based on the SRA Taxonomy Analysis Tool and employs a k-mer database constructed of k-mers from Eukaryota derived from all human RefSeq records with any k-mers found in non-Eukaryota RefSeq records subtracted from the database.

NCBI-Scrub Technical Details

Links
Task task_ncbi_scrub.wdl
Software Source Code HRRT on GitHub
Software Documentation HRRT on NCBI
Read Identification with Kraken2

Kraken2 is a bioinformatics tool originally designed for metagenomic applications. It has additionally proven valuable for validating taxonomic assignments and checking contamination of single-species (e.g. bacterial isolate, eukaryotic isolate, viral isolate, etc.) whole genome sequence data.

Kraken2 is run on both the raw and clean reads.

Database-dependent

This workflow automatically uses a viral-specific Kraken2 database. This database was generated in-house from RefSeq's viral sequence collection and human genome GRCh38. It's available at gs://theiagen-large-public-files-rp/terra/databases/kraken2/kraken2_humanGRCh38_viralRefSeq_20240828.tar.gz.

Kraken2 Technical Details

Links
Task task_kraken2.wdl
Software Source Code Kraken2 on GitHub
Software Documentation https://github.com/DerrickWood/kraken2/blob/master/docs/MANUAL.markdown
Original Publication(s) Improved metagenomic analysis with Kraken 2

Outputs

Variable Type Description
kraken_human_dehosted Float Percent of human read data detected using the Kraken2 software after host removal
kraken_report_dehosted File Full Kraken report after host removal
kraken_sc2_dehosted Float Percent of SARS-CoV-2 read data detected using the Kraken2 software after host removal
kraken_version_dehosted String Version of Kraken2 software used
ncbi_scrub_docker String The Docker image for NCBI's HRRT (human read removal tool)
ncbi_scrub_human_spots_removed Int Number of spots removed (or masked)
ncbi_scrub_pe_analysis_date String Date of analysis
ncbi_scrub_pe_version String Version of HRRT software used
read1_dehosted File The dehosted forward reads file; suggested read file for SRA submission
read2_dehosted File The dehosted reverse reads file; suggested read file for SRA submission
Variable Type Description
kraken_human_dehosted Float Percent of human read data detected using the Kraken2 software after host removal
kraken_report_dehosted File Full Kraken report after host removal
kraken_sc2_dehosted Float Percent of SARS-CoV-2 read data detected using the Kraken2 software after host removal
kraken_version_dehosted String Version of Kraken2 software used
ncbi_scrub_docker String The Docker image for NCBI's HRRT (human read removal tool)
ncbi_scrub_human_spots_removed Int Number of spots removed (or masked)
ncbi_scrub_pe_analysis_date String Date of analysis
ncbi_scrub_pe_version String Version of HRRT software used
read1_dehosted File The dehosted forward reads file; suggested read file for SRA submission