NCBI_Scrub¶

Quick Facts¶

Workflow Type	Applicable Kingdom	Last Known Changes	Command-line Compatibility	Workflow Level	Dockstore
Standalone	Any taxa	v2.2.1	Yes	Sample-level	NCBI_Scrub_PE_PHB, NCBI_Scrub_SE_PHB

NCBI Scrub Workflows¶

NCBI Scrub, also known as the human read removal tool (HRRT), is based on the SRA Taxonomy Analysis Tool that will take as input a FASTQ file, and produce as output a FASTQ file in which all reads identified as potentially of human origin are either removed (default) or masked with 'N'. There are three Kraken2 workflows:

NCBI_Scrub_PE is compatible with Illumina paired-end data
NCBI_Scrub_SE is compatible with Illumina single-end data

Inputs¶

NCBI_Scrub_PENCBI_Scrub_SE

Terra Task Name	Variable	Type	Description	Default Value	Terra Status
dehost_pe	read1	File	FASTQ file containing read1 sequences		Required
dehost_pe	read2	File	FASTQ file containing read2 sequences		Required
dehost_pe	samplename	String	The name of the sample being analyzed		Required
dehost_pe	target_organism	String	Target organism for Kraken2 reporting	Severe acute respiratory syndrome coronavirus 2	Optional
kraken2	cpu	Int	Number of CPUs to allocate to the task	4	Optional
kraken2	disk_size	Int	Amount of storage (in GB) to allocate to the task. Increase this when using large (>30GB kraken2 databases such as the "k2_standard" database)	100	Optional
kraken2	docker_image	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/kraken2:2.1.2-no-db	Optional
kraken2	kraken2_db	File	The database used to run Kraken2	gs://theiagen-public-resources-rp/reference_data/databases/kraken2/kraken2_humanGRCh38_viralRefSeq_20240828.tar.gz	Optional
kraken2	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	8	Optional
ncbi_scrub_pe	cpu	Int	Number of CPUs to allocate to the task	4	Optional
ncbi_scrub_pe	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
ncbi_scrub_pe	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/ncbi/sra-human-scrubber:2.2.1	Optional
ncbi_scrub_pe	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	8	Optional
version_capture	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0	Optional
version_capture	timezone	String	Set the time zone to get an accurate date of analysis (uses UTC by default)		Optional

Terra Task Name	Variable	Type	Description	Default Value	Terra Status
dehost_se	read1	File	FASTQ file containing read1 sequences		Required
dehost_se	samplename	String	The name of the sample being analyzed		Required
dehost_se	target_organism	String	Target organism for Kraken2 reporting	Severe acute respiratory syndrome coronavirus 2	Optional
kraken2	cpu	Int	Number of CPUs to allocate to the task	4	Optional
kraken2	disk_size	Int	Amount of storage (in GB) to allocate to the task. Increase this when using large (>30GB kraken2 databases such as the "k2_standard" database)	100	Optional
kraken2	docker_image	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/kraken2:2.1.2-no-db	Optional
kraken2	kraken2_db	File	The database used to run Kraken2	gs://theiagen-public-resources-rp/reference_data/databases/kraken2/kraken2_humanGRCh38_viralRefSeq_20240828.tar.gz	Optional
kraken2	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	8	Optional
kraken2	read2	File	Internal component, do not modify		Optional
ncbi_scrub_se	cpu	Int	Number of CPUs to allocate to the task	4	Optional
ncbi_scrub_se	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
ncbi_scrub_se	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/ncbi/sra-human-scrubber:2.2.1	Optional
ncbi_scrub_se	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	8	Optional
version_capture	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0	Optional
version_capture	timezone	String	Set the time zone to get an accurate date of analysis (uses UTC by default)		Optional

Workflow Tasks¶

This workflow is composed of two tasks, one to dehost the input reads and another to screen the clean reads with kraken2 and the viral+human database.

HRRT: Human Host Sequence Removal

All reads of human origin are removed, including their mates, by using NCBI's human read removal tool (HRRT).

HRRT is based on the SRA Taxonomy Analysis Tool and employs a k-mer database constructed of k-mers from Eukaryota derived from all human RefSeq records with any k-mers found in non-Eukaryota RefSeq records subtracted from the database.

NCBI-Scrub Technical Details

	Links
Task	task_ncbi_scrub.wdl
Software Source Code	HRRT on GitHub
Software Documentation	HRRT on NCBI

Kraken2: Read Identification

Kraken2 is a bioinformatics tool originally designed for metagenomic applications. It has additionally proven valuable for validating taxonomic assignments and checking contamination of single-species (e.g. bacterial isolate, eukaryotic isolate, viral isolate, etc.) whole genome sequence data.

Kraken2 is run on both the raw and clean reads.

Database-dependent

This workflow automatically uses a viral-specific Kraken2 database. This database was generated in-house from RefSeq's viral sequence collection and human genome GRCh38. It's available at gs://theiagen-public-resources-rp/reference_data/databases/kraken2/kraken2_humanGRCh38_viralRefSeq_20240828.tar.gz.

Kraken2 Technical Details

	Links
Task	task_kraken2.wdl
Software Source Code	Kraken2 on GitHub
Software Documentation	Kraken2 Documentation
Original Publication(s)	Improved metagenomic analysis with Kraken 2

Outputs¶

NCBI_Scrub_PENCBI_Scrub_SE

Variable	Type	Description
kraken_human_dehosted	Float	Percent of human read data detected using the Kraken2 software after host removal
kraken_report_dehosted	File	Full Kraken report after host removal
kraken_sc2_dehosted	String	Percent of SARS-CoV-2 read data detected using the Kraken2 software after host removal
kraken_version_dehosted	String	Version of Kraken2 software used
ncbi_scrub_docker	String	The Docker image for NCBI's HRRT (human read removal tool)
ncbi_scrub_human_spots_removed	Int	Number of spots removed (or masked)
ncbi_scrub_pe_analysis_date	String	Date of analysis
ncbi_scrub_pe_version	String	Version of HRRT software used
read1_dehosted	File	The dehosted forward reads file; suggested read file for SRA submission
read2_dehosted	File	The dehosted reverse reads file; suggested read file for SRA submission

Variable	Type	Description
kraken_human_dehosted	Float	Percent of human read data detected using the Kraken2 software after host removal
kraken_report_dehosted	File	Full Kraken report after host removal
kraken_sc2_dehosted	String	Percent of SARS-CoV-2 read data detected using the Kraken2 software after host removal
kraken_version_dehosted	String	Version of Kraken2 software used
ncbi_scrub_docker	String	The Docker image for NCBI's HRRT (human read removal tool)
ncbi_scrub_human_spots_removed	Int	Number of spots removed (or masked)
ncbi_scrub_se_analysis_date	String	Date of analysis
ncbi_scrub_se_version	String	Version of HRRT software used
read1_dehosted	File	The dehosted forward reads file; suggested read file for SRA submission