Kraken2¶
Quick Facts¶
Workflow Type | Applicable Kingdom | Last Known Changes | Command-line Compatibility | Workflow Level |
---|---|---|---|---|
Standalone | Any Taxa | PHB v2.0.0 | Yes | Sample-level |
Kraken2 Workflows¶
The Kraken2 workflows assess the taxonomic profile of raw sequencing data (FASTQ files).
Kraken2 is a bioinformatics tool originally designed for metagenomic applications. It has additionally proven valuable for validating taxonomic assignments and checking contamination of single-species (e.g. bacterial isolate, eukaryotic isolate, viral isolate, etc.) whole genome sequence data.
There are three Kraken2 workflows:
Kraken2_PE
is compatible with Illumina paired-end dataKraken2_SE
is compatible with Illumina single-end dataKraken2_ONT
is compatible with Oxford Nanopore data
Besides the data input types, there are minimal differences between these two workflows.
Databases¶
Database selection
The Kraken2 software is database-dependent and taxonomic assignments are highly sensitive to the database used. An appropriate database should contain the expected organism(s) (e.g. Escherichia coli) and other taxa that may be present in the reads (e.g. Citrobacter freundii, a common contaminant).
Suggested databases¶
Database name | Database Description | Suggested Applications | GCP URI (for usage in Terra) | Source | Database Size (GB) | Date of Last Update |
---|---|---|---|---|---|---|
Kalamari v5.1 | Kalamari is a database of complete public assemblies, that has been fine-tuned for enteric pathogens and is backed by trusted institutions. Full list available here ( in chromosomes.tsv and plasmids.tsv) | Single-isolate enteric bacterial pathogen analysis (Salmonella, Escherichia, Shigella, Listeria, Campylobacter, Vibrio, Yersinia) | gs://theiagen-large-public-files-rp/terra/databases/kraken2/kraken2.kalamari_5.1.tar.gz |
‣ | 1.5 | 18/5/2022 |
standard 8GB | Standard RefSeq database (archaea, bacteria, viral, plasmid, human, UniVec_Core) capped at 8GB | Prokaryotic or viral organisms, but for enteric pathogens, we recommend Kalamari | gs://theiagen-large-public-files-rp/terra/databases/kraken2/k2_standard_08gb_20240112.tar.gz |
https://benlangmead.github.io/aws-indexes/k2 | 7.5 | 12/1/2024 |
standard 16GB | Standard RefSeq database (archaea, bacteria, viral, plasmid, human, UniVec_Core) capped at 16GB | Prokaryotic or viral organisms, but for enteric pathogens, we recommend Kalamari | gs://theiagen-large-public-files-rp/terra/databases/kraken2/k2_standard_16gb_20240112.tar.gz |
https://benlangmead.github.io/aws-indexes/k2 | 15 | 12/1/2024 |
standard | Standard RefSeq database (archaea, bacteria, viral, plasmid, human, UniVec_Core) | Prokaryotic or viral organisms, but for enteric pathogens, we recommend Kalamari | gs://theiagen-large-public-files-rp/terra/databases/kraken2/k2_standard_20240112.tar.gz |
https://benlangmead.github.io/aws-indexes/k2 | 72 | 18/4/2023 |
viral | RefSeq viral | Viral metagenomics | gs://theiagen-large-public-files-rp/terra/databases/kraken2/k2_viral_20240112.tar.gz |
https://benlangmead.github.io/aws-indexes/k2 | 0.6 | 12/1/2024 |
EuPathDB48 | Eukaryotic pathogen genomes with contaminants removed. Full list available here | Eukaryotic organisms (Candida spp., Aspergillus spp., etc) | gs://theiagen-public-files-rp/terra/theiaprok-files/k2_eupathdb48_20201113.tar.gz |
https://benlangmead.github.io/aws-indexes/k2 | 30.3 | 13/11/2020 |
EuPathDB48 | Eukaryotic pathogen genomes with contaminants removed. Full list available here | Eukaryotic organisms (Candida spp., Aspergillus spp., etc) | gs://theiagen-large-public-files-rp/terra/databases/kraken/k2_eupathdb48_20230407.tar.gz |
https://benlangmead.github.io/aws-indexes/k2 | 11 | 7/4/2023 |
Inputs¶
Terra Task Name | Variable | Type | Description | Default Value | Terra Status | Workflow |
---|---|---|---|---|---|---|
*workflow_name | kraken2_db | File | A Kraken2 database in .tar.gz format | Required | ONT, PE, SE | |
*workflow_name | read1 | File | Required | ONT, PE, SE | ||
*workflow_name | read2 | File | Required for PE only | PE | ||
*workflow_name | samplename | String | Required | ONT, PE, SE | ||
kraken2_pe or kraken2_se | classified_out | String | Allows user to rename the classified FASTQ files output. Must include .fastq as the suffix | classified#.fastq | Optional | ONT, PE, SE |
kraken2_pe or kraken2_se | cpu | Int | Number of CPUs to allocate to the task | 4 | Optional | ONT, PE, SE |
kraken2_pe or kraken2_se | disk_size | Int | GB of storage to request for VM used to run the kraken2 task. Increase this when using large (>30GB kraken2 databases such as the "k2_standard" database) | 100 | Optional | ONT, PE, SE |
kraken2_pe or kraken2_se | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/kraken2:2.1.2-no-db | Optional | ONT, PE, SE |
kraken2_pe or kraken2_se | kraken2_args | String | Allows a user to supply additional kraken2 command-line arguments | Optional | ONT, PE, SE | |
kraken2_pe or kraken2_se | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 32 | Optional | ONT, PE, SE |
kraken2_pe or kraken2_se | unclassified_out | String | Allows user to rename unclassified FASTQ files output. Must include .fastq as the suffix | unclassified#.fastq | Optional | ONT, PE, SE |
krona | cpu | Int | Number of CPUs to allocate to the task | 4 | Optional | PE, SE |
krona | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional | PE, SE |
krona | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/biocontainers/krona:2.7.1--pl526_5 | Optional | PE, SE |
krona | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 8 | Optional | PE, SE |
kraken2_recalculate_abundances | cpu | Int | Number of CPUs to allocate to the task | 4 | Optional | ONT |
kraken2_recalculate_abundances | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional | ONT |
kraken2_recalculate_abundances | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/theiagen/terra-tools:2023-08-28-v4 | Optional | ONT |
kraken2_recalculate_abundances | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 8 | Optional | ONT |
kraken2_recalculate_abundances | target_organism | String | Target organism for the kraken2 abundance to be exported to the data table | Optional | ONT | |
version_capture | docker | String | The Docker container to use for the task | "us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0" | Optional | ONT, PE, SE |
version_capture | timezone | String | Set the time zone to get an accurate date of analysis (uses UTC by default) | Optional | ONT, PE, SE |
Outputs¶
Variable | Type | Description |
---|---|---|
kraken2_classified_read1 | File | FASTQ file of classified forward/R1 reads |
kraken2_classified_read2 | File | FASTQ file of classified reverse/R2 reads (if PE) |
kraken2_classified_report | File | Standard Kraken2 output report. TXT filetype, but can be opened in Excel as a TSV file |
kraken2_docker | String | Docker image used to run kraken2 |
kraken2_*_wf_analysis_date | String | Date the workflow was run |
kraken2_*_wf_version | String | Workflow version |
kraken2_report | File | TXT document describing taxonomic prediction of every FASTQ record. This file is usually very large and cumbersome to open and view |
kraken2_unclassified_read1 | File | FASTQ file of unclassified forward/R1 reads |
kraken2_unclassified_read2 | File | FASTQ file of unclassified reverse/R2 reads (if PE) |
kraken2_version | String | kraken2 version |
krona_docker | String | Docker image used to run krona (if PE or SE) |
krona_html | File | HTML report of krona with visualisation of taxonomic classification of reads (if PE or SE) |
krona_version | String | krona version (if PE or SE) |
Interpretation of results¶
The most important outputs of the Kraken2 workflows are the kraken2_report
files. These will include a breakdown of the number of sequences assigned to a particular taxon, and the percentage of reads assigned. A complete description of the report format can be found here.
When assessing the taxonomic identity of a single isolate's sequence, it is normal that a few reads are assigned to very closely rated taxa due to the shared sequence identity between them. "Very closely related taxa" may be genetically similar species in the same genus, or taxa with which the dominant species have undergone horizontal gene transfer. Unrelated taxa or a high abundance of these closely related taxa is indicative of contamination or sequencing of non-target taxa. Interpretation of the results is dependent on the biological context.
Example Kraken2 report
Below is an example kraken2_report
for a Klebsiella pneumoniae sample. Only the first 30 lines are included here since rows near the bottom are often spurious results with only a few reads assigned to a non-target organism.
From this report, we can see that 84.35 % of the reads were assigned at the species level (S
in the 4th column) to "Klebsiella pneumoniae". Given almost 6 % of reads were "unclassified" and ~2 % of reads were assigned to very closely related taxa (in the Klebsiella genus), this suggests the reads are from Klebsiella pneumoniae with very little -if any- read contamination.
5.98 108155 108155 U 0 unclassified
94.02 1699669 0 C 1
94.02 1699669 1862 C1 131567 cellular organisms
93.91 1697788 2590 D 2 Bacteria
93.75 1694805 6312 P 1224 Proteobacteria
93.39 1688284 37464 C 1236 Gammaproteobacteria
91.31 1650648 35278 O 91347 Enterobacterales
89.31 1614639 43698 F 543 Enterobacteriaceae
86.40 1561902 22513 G 570 Klebsiella
**84.35 1524918 1524918 S 573 Klebsiella pneumoniae**
0.75 13596 13596 S 548 Klebsiella aerogenes
0.03 600 600 S 244366 Klebsiella variicola
0.01 253 253 S 571 Klebsiella oxytoca
0.00 17 17 S 1134687 Klebsiella michiganensis
0.00 3 0 G1 2608929 unclassified Klebsiella
0.00 3 3 S 1972757 Klebsiella sp. PO552
0.00 2 2 S 1463165 Klebsiella quasipneumoniae
0.17 3035 129 G 590 Salmonella
0.15 2728 909 S 28901 Salmonella enterica
0.03 582 582 S1 9000010 Salmonella enterica subsp. IIa
0.02 306 306 S1 59201 Salmonella enterica subsp. enterica
0.01 230 230 S1 9000014 Salmonella enterica subsp. IIIa
0.01 221 221 S1 9000015 Salmonella enterica subsp. IIIb
0.01 136 136 S1 9000016 Salmonella enterica subsp. IX
0.01 132 132 S1 9000011 Salmonella enterica subsp. IIb
0.01 122 122 S1 59208 Salmonella enterica subsp. VII
0.00 41 41 S1 59207 Salmonella enterica subsp. indica
0.00 25 25 S1 9000017 Salmonella enterica subsp. X
0.00 24 24 S1 9000009 Salmonella enterica subsp. VIII
0.01 178 178 S 54736 Salmonella bongori
Krona visualisation of Kraken2 report¶
Krona produces an interactive report that allows hierarchical data, such as the one from Kraken2, to be explored with zooming, multi-layered pie charts. These pie charts are intuitive and highly responsive.
Krona will only output hierarchical results for bacterial organisms in its current implementation.
Example Krona report
Below is an example of the krona_html
for a metagenomic sample. Taxonomic rank is organised from the centre of the pie chart to the edge, with each slice representing the relative abundance of a given taxa in the sample.
Kraken2 Technical Details
Links | |
---|---|
Software Source Code | Kraken2 on GitHub |
Software Documentation | https://github.com/DerrickWood/kraken2/blob/master/docs/MANUAL.markdown |
Original Publication(s) | Improved metagenomic analysis with Kraken 2 |