NCBI_Scrub¶
Quick Facts¶
Workflow Type | Applicable Kingdom | Last Known Changes | Command-line Compatibility | Workflow Level |
---|---|---|---|---|
Standalone | Any Taxa | PHB v2.2.1 | Yes | Sample-level |
NCBI Scrub Workflows¶
NCBI Scrub, also known as the human read removal tool (HRRT), is based on the SRA Taxonomy Analysis Tool that will take as input a FASTQ file, and produce as output a FASTQ file in which all reads identified as potentially of human origin are either removed (default) or masked with 'N'. There are three Kraken2 workflows:
NCBI_Scrub_PE
is compatible with Illumina paired-end dataNCBI_Scrub_SE
is compatible with Illumina single-end data
Inputs¶
Terra Task Name | Variable | Type | Description | Default Value | Terra Status |
---|---|---|---|---|---|
dehost_pe or dehost_se | read1 | File | FASTQ file containing read1 sequences | Required | |
dehost_pe or dehost_se | read2 | File | FASTQ file containing read2 sequences | Required | |
dehost_pe or dehost_se | samplename | String | The name of the sample being analyzed | Required | |
dehost_pe or dehost_se | target_organism | String | Target organism for Kraken2 reporting | Severe acute respiratory syndrome coronavirus 2 | Optional |
kraken2 | cpu | Int | Number of CPUs to allocate to the task | 4 | Optional |
kraken2 | disk_size | Int | Amount of storage (in GB) to allocate to the task. Increase this when using large (>30GB kraken2 databases such as the "k2_standard" database) | 100 | Optional |
kraken2 | docker_image | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/kraken2:2.0.8-beta_hv | Optional |
kraken2 | kraken2_db | String | The database used to run Kraken2 | /kraken2-db | Optional |
kraken2 | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 8 | Optional |
kraken2 | target_organism | String | The organism whose abundance the user wants to check in their reads. This should be a proper taxonomic name recognized by the Kraken database. | Optional | |
ncbi_scrub_pe or | docker | Int | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/ncbi/sra-human-scrubber:2.2.1 | Optional |
ncbi_scrub_pe or ncbi_scrub_se | cpu | Int | Number of CPUs to allocate to the task | 4 | Optional |
ncbi_scrub_pe or ncbi_scrub_se | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
ncbi_scrub_pe or ncbi_scrub_se | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 8 | Optional |
version_capture | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0 | Optional |
version_capture | timezone | String | Set the time zone to get an accurate date of analysis (uses UTC by default) | Optional |
Terra Task Name | Variable | Type | Description | Default Value | Terra Status |
---|---|---|---|---|---|
dehost_pe or dehost_se | read1 | File | FASTQ file containing read1 sequences | Required | |
dehost_pe or dehost_se | samplename | String | The name of the sample being analyzed | Required | |
dehost_pe or dehost_se | target_organism | String | Target organism for Kraken2 reporting | Severe acute respiratory syndrome coronavirus 2 | Optional |
kraken2 | cpu | Int | Number of CPUs to allocate to the task | 4 | Optional |
kraken2 | disk_size | Int | Amount of storage (in GB) to allocate to the task. Increase this when using large (>30GB kraken2 databases such as the "k2_standard" database) | 100 | Optional |
kraken2 | docker_image | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/kraken2:2.0.8-beta_hv | Optional |
kraken2 | kraken2_db | String | The database used to run Kraken2 | /kraken2-db | Optional |
kraken2 | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 8 | Optional |
kraken2 | read2 | File | Internal component, do not modify | Optional | |
kraken2 | target_organism | String | The organism whose abundance the user wants to check in their reads. This should be a proper taxonomic name recognized by the Kraken database. | Optional | |
ncbi_scrub_pe or | docker | Int | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/ncbi/sra-human-scrubber:2.2.1 | Optional |
ncbi_scrub_pe or ncbi_scrub_se | cpu | Int | Number of CPUs to allocate to the task | 4 | Optional |
ncbi_scrub_pe or ncbi_scrub_se | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
ncbi_scrub_pe or ncbi_scrub_se | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 8 | Optional |
version_capture | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0 | Optional |
version_capture | timezone | String | Set the time zone to get an accurate date of analysis (uses UTC by default) | Optional |
Workflow Tasks¶
This workflow is composed of two tasks, one to dehost the input reads and another to screen the clean reads with kraken2 and the viral+human database.
HRRT
: Human Host Sequence Removal
All reads of human origin are removed, including their mates, by using NCBI's human read removal tool (HRRT).
HRRT is based on the SRA Taxonomy Analysis Tool and employs a k-mer database constructed of k-mers from Eukaryota derived from all human RefSeq records with any k-mers found in non-Eukaryota RefSeq records subtracted from the database.
NCBI-Scrub Technical Details
Links | |
---|---|
Task | task_ncbi_scrub.wdl |
Software Source Code | HRRT on GitHub |
Software Documentation | HRRT on NCBI |
Read Identification with Kraken2
Kraken2
is a bioinformatics tool originally designed for metagenomic applications. It has additionally proven valuable for validating taxonomic assignments and checking contamination of single-species (e.g. bacterial isolate, eukaryotic isolate, viral isolate, etc.) whole genome sequence data.
Kraken2 is run on both the raw and clean reads.
Database-dependent
This workflow automatically uses a viral-specific Kraken2 database. This database was generated in-house from RefSeq's viral sequence collection and human genome GRCh38. It's available at gs://theiagen-large-public-files-rp/terra/databases/kraken2/kraken2_humanGRCh38_viralRefSeq_20240828.tar.gz
.
Kraken2 Technical Details
Links | |
---|---|
Task | task_kraken2.wdl |
Software Source Code | Kraken2 on GitHub |
Software Documentation | https://github.com/DerrickWood/kraken2/blob/master/docs/MANUAL.markdown |
Original Publication(s) | Improved metagenomic analysis with Kraken 2 |
Outputs¶
Variable | Type | Description |
---|---|---|
kraken_human_dehosted | Float | Percent of human read data detected using the Kraken2 software after host removal |
kraken_report_dehosted | File | Full Kraken report after host removal |
kraken_sc2_dehosted | Float | Percent of SARS-CoV-2 read data detected using the Kraken2 software after host removal |
kraken_version_dehosted | String | Version of Kraken2 software used |
ncbi_scrub_docker | String | The Docker image for NCBI's HRRT (human read removal tool) |
ncbi_scrub_human_spots_removed | Int | Number of spots removed (or masked) |
ncbi_scrub_pe_analysis_date | String | Date of analysis |
ncbi_scrub_pe_version | String | Version of HRRT software used |
read1_dehosted | File | The dehosted forward reads file; suggested read file for SRA submission |
read2_dehosted | File | The dehosted reverse reads file; suggested read file for SRA submission |
Variable | Type | Description |
---|---|---|
kraken_human_dehosted | Float | Percent of human read data detected using the Kraken2 software after host removal |
kraken_report_dehosted | File | Full Kraken report after host removal |
kraken_sc2_dehosted | Float | Percent of SARS-CoV-2 read data detected using the Kraken2 software after host removal |
kraken_version_dehosted | String | Version of Kraken2 software used |
ncbi_scrub_docker | String | The Docker image for NCBI's HRRT (human read removal tool) |
ncbi_scrub_human_spots_removed | Int | Number of spots removed (or masked) |
ncbi_scrub_pe_analysis_date | String | Date of analysis |
ncbi_scrub_pe_version | String | Version of HRRT software used |
read1_dehosted | File | The dehosted forward reads file; suggested read file for SRA submission |