TheiaMeta¶
Quick Facts¶
Workflow Type | Applicable Kingdom | Last Known Changes | Command-line Compatibility | Workflow Level |
---|---|---|---|---|
Genomic Characterization | Any Taxa | PHB v3.0.0 | Yes | Sample-level |
TheiaMeta Workflows¶
Genomic characterization of pathogens is an increasing priority for public health laboratories globally. The workflows in the TheiaMeta Genomic Characterization Series make the analysis of pathogens from metagenomic samples easy by taking raw next-generation sequencing (NGS) data and generating metagenome-assembled genomes (MAGs), either using a reference-genome or not.
TheiaMeta can use one of two distinct methods for generating and processing the final assembly:
- If a reference genome is not provided, the de novo assembly will be the final assembly. Additionally, go through a binning process where the contigs are separated into distinct files ("bins") according to composition and coverage such that each bin hopefully contains a single taxon.
- If a reference genome is provided by the user, the de novo metagenomic assembly is filtered by mapping the contigs to the reference and those constitute the final assembly. No binning is necessary as the mapping will filter contigs that are likely the same taxon as the reference.
Inputs¶
The TheiaMeta_Illumina_PE workflow processes Illumina paired-end (PE) reads generated for metagenomic characterization (typically by shotgun). By default, this workflow will assume that input reads were generated using a 300-cycle sequencing kit (i.e. 2 x 150 bp reads). Modifications to the optional parameter for trim_minlen
may be required to accommodate shorter read data, such as 2 x 75bp reads generated using a 150-cycle sequencing kit.
Terra Task Name | Variable | Type | Description | Default Value | Terra Status |
---|---|---|---|---|---|
theiameta_illumina_pe | read1 | File | Forward Illumina read in FASTQ file format | Required | |
theiameta_illumina_pe | read2 | File | Reverse Illumina read in FASTQ file format | Required | |
theiameta_illumina_pe | samplename | String | Name of the sample being analyzed | Required | |
assembled_reads_percent | cpu | Int | Number of CPUs to allocate to the task | 2 | Optional |
assembled_reads_percent | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
assembled_reads_percent | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/samtools:1.17 | Optional |
assembled_reads_percent | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 8 | Optional |
bwa | cpu | Int | Number of CPUs to allocate to the task | 6 | Optional |
bwa | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
bwa | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/ivar:1.3.1-titan | Optional |
bwa | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 16 | Optional |
calculate_coverage | cpu | Int | Number of CPUs to allocate to the task | 2 | Optional |
calculate_coverage | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
calculate_coverage | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/bedtools:2.31.0 | Optional |
calculate_coverage | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 8 | Optional |
calculate_coverage_paf | cpu | Int | Number of CPUs to allocate to the task | 2 | Optional |
calculate_coverage_paf | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
calculate_coverage_paf | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/quay/ubuntu:latest | Optional |
calculate_coverage_paf | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 8 | Optional |
compare_assemblies | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
kraken2_clean | cpu | Int | Number of CPUs to allocate to the task | 4 | Optional |
kraken2_clean | disk_size | Int | GB of storage to request for VM used to run the kraken2 task. Increase this when using large (>30GB kraken2 databases such as the "k2_standard" database) | 100 | Optional |
kraken2_clean | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/kraken2:2.1.2-no-db | Optional |
kraken2_clean | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 16 | Optional |
kraken2_raw | cpu | Int | Number of CPUs to allocate to the task | 4 | Optional |
kraken2_raw | disk_dize | Int | GB of storage to request for VM used to run the kraken2 task. Increase this when using large (>30GB kraken2 databases such as the "k2_standard" database) | 100 | Optional |
kraken2_raw | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/kraken2:2.1.2-no-db | Optional |
kraken2_raw | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 16 | Optional |
krona_clean | cpu | Int | Number of CPUs to allocate to the task | 4 | Optional |
krona_clean | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
krona_clean | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/krona:2.8.1 | Optional |
krona_clean | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 8 | Optional |
krona_raw | cpu | Int | Number of CPUs to allocate to the task | 4 | Optional |
krona_raw | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
krona_raw | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/krona:2.8.1 | Optional |
krona_raw | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 8 | Optional |
metaspades | kmers | String | Kmer list to use with metaspades. If not provided metaspades automatically sets this value | Optional | |
metaspades | metaspades_opts | String | Additional arguments to pass to metaspades task | Optional | |
minimap2_assembly | cpu | Int | Number of CPUs to allocate to the task | 2 | Optional |
minimap2_assembly | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
minimap2_assembly | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/minimap2:2.22 | Optional |
minimap2_assembly | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 8 | Optional |
minimap2_assembly | query2 | File | Internal component. Do not modify. | Optional | |
minimap2_reads | cpu | Int | Number of CPUs to allocate to the task | 2 | Optional |
minimap2_reads | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
minimap2_reads | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/minimap2:2.22 | Optional |
minimap2_reads | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 8 | Optional |
quast | cpu | Int | Number of CPUs to allocate to the task | 2 | Optional |
quast | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
quast | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/quast:5.0.2 | Optional |
quast | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 2 | Optional |
read_QC_trim | adapters | File | Adapter file to be trimmed by trimmomatic | Optional | |
read_QC_trim | bbduck_mem | Int | Memory to use with bbduck | 8 | Optional |
read_QC_trim | call_midas | Boolean | Optional to run Midas on input data | FALSE | Optional |
read_QC_trim | fastp_args | String | Fastp-specific options that you might choose, see https://github.com/OpenGene/fastp | Optional | |
read_QC_trim | midas_db | File | A Midas database in .tar.gz format | gs://theiagen-public-files-rp/terra/theiaprok-files/midas/midas_db_v1.2.tar.gz | Optional |
read_QC_trim | phix | File | Optional | ||
read_QC_trim | read_processing | String | Optional | ||
read_QC_trim | read_qc | String | Allows the user to decide between fastq_scan (default) and fastqc for the evaluation of read quality. | fastq_scan | Optional |
read_QC_trim | target_organism | String | Internal component. Do not modify. | Optional | |
read_QC_trim | trim_min_length | Int | Optional | ||
read_QC_trim | trim_window_size | Int | Optional | ||
read_QC_trim | trimmomatic_args | String | Optional | ||
retrieve_aligned_contig_paf | cpu | Int | Number of CPUs to allocate to the task | 2 | Optional |
retrieve_aligned_contig_paf | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
retrieve_aligned_contig_paf | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/seqkit:2.3.1 | Optional |
retrieve_aligned_contig_paf | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 8 | Optional |
retrieve_aligned_pe_reads_sam | cpu | Int | Number of CPUs to allocate to the task | 2 | Optional |
retrieve_aligned_pe_reads_sam | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
retrieve_aligned_pe_reads_sam | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/samtools:1.17 | Optional |
retrieve_aligned_pe_reads_sam | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 8 | Optional |
retrieve_unaligned_pe_reads_sam | cpu | Int | Number of CPUs to allocate to the task | 2 | Optional |
retrieve_unaligned_pe_reads_sam | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
retrieve_unaligned_pe_reads_sam | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/samtools:1.17 | Optional |
retrieve_unaligned_pe_reads_sam | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 8 | Optional |
sam_to_sorted_bam | cpu | Int | Number of CPUs to allocate to the task | 2 | Optional |
sam_to_sorted_bam | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
sam_to_sorted_bam | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/samtools:1.17 | Optional |
sam_to_sorted_bam | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 8 | Optional |
semibin | cpu | Int | Number of CPUs to allocate to the task | 6 | Optional |
semibin | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
semibin | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/biocontainers/semibin:2.0.2--pyhdfd78af_0 | Optional |
semibin | environment | String | Environment model to use. Options: • human_gut • dog_gut • ocean • soil • cat_gut • human_oral • mouse_gut • pig_gut • built_environment • wastewater • chicken_caecum • global |
global | Optional |
semibin | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 8 | Optional |
semibin | min_length | Int | Minimum contig length for binning | 1000 | Optional |
semibin | ratio | Float | If the ratio of the number of base pairs of contigs between 1000-2500 bp smaller than this value, the minimal length will be set as 1000bp, otherwise 2500bp. | 0.05 | Optional |
sort_bam_assembly_correction | cpu | Int | Number of CPUs to allocate to the task | 2 | Optional |
sort_bam_assembly_correction | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
sort_bam_assembly_correction | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/samtools:1.17 | Optional |
sort_bam_assembly_correction | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 8 | Optional |
theiameta_illumina_pe | kraken2_db | File | A Kraken2 database in .tar.gz format | gs://theiagen-public-files-rp/terra/theiaprok-files/k2_standard_08gb_20230605.tar.gz | Optional |
theiameta_illumina_pe | output_additional_files | Boolean | Output additional files such as aligned and unaligned reads to reference | FALSE | Optional |
theiameta_illumina_pe | reference | File | Reference file for consensus calling, in FASTA format | Optional | |
version_capture | docker | String | The Docker container to use for the task | "us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0" | Optional |
version_capture | timezone | String | Set the time zone to get an accurate date of analysis (uses UTC by default) | Optional |
Workflow Tasks¶
versioning
: Version Capture for TheiaMeta
The versioning
task captures the workflow version from the GitHub (code repository) version.
Version Capture Technical details
Links | |
---|---|
Task | task_versioning.wdl |
Read Cleaning and QC¶
HRRT
: Human Host Sequence Removal
All reads of human origin are removed, including their mates, by using NCBI's human read removal tool (HRRT).
HRRT is based on the SRA Taxonomy Analysis Tool and employs a k-mer database constructed of k-mers from Eukaryota derived from all human RefSeq records with any k-mers found in non-Eukaryota RefSeq records subtracted from the database.
NCBI-Scrub Technical Details
Links | |
---|---|
Task | task_ncbi_scrub.wdl |
Software Source Code | NCBI Scrub on GitHub |
Software Documentation | https://github.com/ncbi/sra-human-scrubber/blob/master/README.md |
read_QC_trim
: Read Quality Trimming, Adapter Removal, Quantification, and Identification
read_QC_trim
is a sub-workflow within TheiaMeta that removes low-quality reads, low-quality regions of reads, and sequencing adapters to improve data quality. It uses a number of tasks, described below.
Read quality trimming
Either trimmomatic
or fastp
can be used for read-quality trimming. Trimmomatic is used by default. Both tools trim low-quality regions of reads with a sliding window (with a window size of trim_window_size
), cutting once the average quality within the window falls below trim_quality_trim_score
. They will both discard the read if it is trimmed below trim_minlen
.
If fastp is selected for analysis, fastp also implements the additional read-trimming steps indicated below:
Parameter | Explanation |
---|---|
-g | enables polyG tail trimming |
-5 20 | enables read end-trimming |
-3 20 | enables read end-trimming |
--detect_adapter_for_pe | enables adapter-trimming only for paired-end reads |
Adapter removal
The BBDuk
task removes adapters from sequence reads. To do this:
- Repair from the BBTools package reorders reads in paired fastq files to ensure the forward and reverse reads of a pair are in the same position in the two fastq files.
- BBDuk ("Bestus Bioinformaticus" Decontamination Using Kmers) is then used to trim the adapters and filter out all reads that have a 31-mer match to PhiX, which is commonly added to Illumina sequencing runs to monitor and/or improve overall run quality.
What are adapters and why do they need to be removed?
Adapters are manufactured oligonucleotide sequences attached to DNA fragments during the library preparation process. In Illumina sequencing, these adapter sequences are required for attaching reads to flow cells. You can read more about Illumina adapters here. For genome analysis, it's important to remove these sequences since they're not actually from your sample. If you don't remove them, the downstream analysis may be affected.
Read Quantification
There are two methods for read quantification to choose from: fastq-scan
(default) or fastqc
. Both quantify the forward and reverse reads in FASTQ files. In TheiaProk_Illumina_PE, they also provide the total number of read pairs. This task is run once with raw reads as input and once with clean reads as input. If QC has been performed correctly, you should expect fewer clean reads than raw reads. fastqc
also provides a graphical visualization of the read quality.
Read Identification (optional)
The MIDAS
task is for the identification of reads to detect contamination with non-target taxa. This task is optional and turned off by default. It can be used by setting the call_midas
input variable to true
.
The MIDAS reference database, located at gs://theiagen-large-public-files-rp/terra/theiaprok-files/midas/midas_db_v1.2.tar.gz
, is provided as the default. It is possible to provide a custom database. More information is available here.
How are the MIDAS output columns determined?
Example MIDAS report in the ****midas_report
column:
species_id | count_reads | coverage | relative_abundance |
---|---|---|---|
Salmonella_enterica_58156 | 3309 | 89.88006645 | 0.855888033 |
Salmonella_enterica_58266 | 501 | 11.60606061 | 0.110519371 |
Salmonella_enterica_53987 | 99 | 2.232896237 | 0.021262881 |
Citrobacter_youngae_61659 | 46 | 0.995216227 | 0.009477003 |
Escherichia_coli_58110 | 5 | 0.123668877 | 0.001177644 |
MIDAS report column descriptions:
- species_id: species identifier
- count_reads: number of reads mapped to marker genes
- coverage: estimated genome-coverage (i.e. read-depth) of species in metagenome
- relative_abundance: estimated relative abundance of species in metagenome
read_QC_trim Technical Details
kraken
: Taxonomic Classification
Kraken2 is a bioinformatics tool originally designed for metagenomic applications. It has additionally proven valuable for validating taxonomic assignments and checking contamination of single-species (e.g. bacterial isolate, eukaryotic isolate, viral isolate, etc.) whole genome sequence data.
Kraken2 is run on the set of raw reads, provided as input, as well as the set of clean reads that are resulted from the read_QC_trim
workflow
Database-dependent
The Kraken2 software is database-dependent and taxonomic assignments are highly sensitive to the database used. An appropriate database should contain the expected organism(s) (e.g. Escherichia coli) and other taxa that may be present in the reads (e.g. Citrobacter freundii, a common contaminant).
Kraken2 Technical Details
Links | |
---|---|
Task | task_kraken2.wdl |
Software Source Code | Kraken2 on GitHub |
Software Documentation | https://github.com/DerrickWood/kraken2/wiki |
Original Publication(s) | Improved metagenomic analysis with Kraken 2 |
Assembly¶
metaspades
: De Novo Metagenomic Assembly
While metagenomics has emerged as a technology of choice for analyzing bacterial populations, the assembly of metagenomic data remains challenging. A dedicated metagenomic assembly algorithm is necessary to circumvent the challenge of interpreting variation. metaSPAdes addresses various challenges of metagenomic assembly by capitalizing on computational ideas that proved to be useful in assemblies of single cells and highly polymorphic diploid genomes.
metaspades
is a de novo assembler that first constructs a de Bruijn graph of all the reads using the SPAdes algorithm. Through various graph simplification procedures, paths in the assembly graph are reconstructed that correspond to long genomic fragments within the metagenome. For more details, please see the original publication.
MetaSPAdes Technical Details
Links | |
---|---|
Task | task_metaspades.wdl |
Software Source Code | SPAdes on GitHub |
Software Documentation | SPAdes Manual |
Original Publication(s) | metaSPAdes: a new versatile metagenomic assembler |
minimap2
: Assembly Correction
minimap2
is a popular aligner that is used for correcting the assembly produced by metaSPAdes. This is done by aligning the reads back to the generated assembly or a reference genome.
In minimap2, "modes" are a group of preset options.
The mode used in this task is sr
which is intended for "short single-end reads without splicing". The sr
mode indicates the following parameters should be used: -k21 -w11 --sr --frag=yes -A2 -B8 -O12,32 -E2,1 -b0 -r100 -p.5 -N20 -f1000,5000 -n2 -m20 -s40 -g100 -2K50m --heap-sort=yes --secondary=no
. The output file is in SAM format.
For more information, please see the minimap2 manpage
minimap2 Technical Details
Links | |
---|---|
Task | task_minimap2.wdl |
Software Source Code | minimap2 on GitHub |
Software Documentation | minimap2 |
Original Publication(s) | Minimap2: pairwise alignment for nucleotide sequences |
samtools
: SAM File Conversion
This task converts the output SAM file from minimap2 and converts it to a BAM file. It then sorts the BAM based on the read names, and then generates an index file.
samtools Technical Details
Links | |
---|---|
Task | task_samtools.wdl |
Software Source Code | samtools on GitHub |
Software Documentation | samtools |
Original Publication(s) | The Sequence Alignment/Map format and SAMtools Twelve Years of SAMtools and BCFtools |
pilon
: Assembly Polishing
pilon
is a tool that uses read alignment to correct errors in an assembly. It is used to polish the assembly produced by metaSPAdes. The input to Pilon is the sorted BAM file produced by samtools
, and the original draft assembly produced by metaspades
.
pilon Technical Details
Links | |
---|---|
Task | task_pilon.wdl |
Software Source Code | Pilon on GitHub |
Software Documentation | Pilon Wiki |
Original Publication(s) | Pilon: An Integrated Tool for Comprehensive Microbial Variant Detection and Genome Assembly Improvement |
Reference Alignment & Contig Filtering¶
These tasks only run if a reference is provided through the reference
optional input.
minimap2
: Assembly Alignment and Contig Filtering
minimap2
is a popular aligner that is used here to to map the Pilon-polished assembly to the reference genome. Any aligned contigs are retrieved and returned.
In minimap2, "modes" are a group of preset options.
The mode used in this task is asm20
which is intended for "long assembly to reference mapping". The asm20
mode indicates the following parameters should be used: -k19 -w10 -U50,500 --rmq -r100k -g10k -A1 -B4 -O6,26 -E2,1 -s200 -z200 -N50
. The output file is in PAF format.
For more information, please see the minimap2 manpage
minimap2 Technical Details
Links | |
---|---|
Task | task_minimap2.wdl |
Software Source Code | minimap2 on GitHub |
Software Documentation | minimap2 |
Original Publication(s) | Minimap2: pairwise alignment for nucleotide sequences |
Parsing the PAF file into a FASTA file
Following the minimap2
alignment, the output PAF file is parsed into a FASTA file using seqkit
and then coverage is calculated using awk
.
parse_mapping
Technical Details
Links | |
---|---|
Task | task_parse_mapping.wdl#retrieve_aligned_contig_paf task_parse_mapping.wdl#calculate_coverage_paf |
Software Source Code | seqkit on GitHub |
Software Documentation | seqkit |
Original Publication(s) | SeqKit: A Cross-Platform and Ultrafast Toolkit for FASTA/Q File Manipulation SeqKit2: A Swiss army knife for sequence and alignment processing |
Assembly QC¶
This task is run on either:
- the reference-aligned contigs (if a reference was provided), or
- the Pilon-polished assembly_fasta (if no reference was provided).
quast
: Assembly Quality Assessment
QUAST stands for QUality ASsessment Tool. It evaluates genome/metagenome assemblies by computing various metrics without a reference being necessary. It includes useful metrics such as number of contigs, length of the largest contig and N50.
QUAST Technical Details
Links | |
---|---|
Task | task_quast.wdl |
Software Source Code | QUAST on GitHub |
Software Documentation | https://quast.sourceforge.net/ |
Original Publication(s) | QUAST: quality assessment tool for genome assemblies |
Binning¶
These tasks only run if a reference is not provided.
bwa
: Read alignment to the assembly
If a reference is not provided, BWA (Burrow-Wheeler Aligner) is used to align the clean reads to the Pilon-polished assembly_fasta.
BWA Technical Details
Links | |
---|---|
Task | task_bwa.wdl |
Software Source Code | BWA on GitHub |
Software Documentation | BWA Manual |
Original Publication(s) | Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM |
semibin2
: Metagenomic binning
After the alignment, the resulting BAM file and index and the Pilon-polished assembly_fasta will be binned with semibin2
, a command-line tool for metagenomic binning with deep learning. Specifically, it uses a semi-supervised siamese neural network that uses knowledge from reference genomes while maintaining reference-exclusive bins. By default, the global
environemnt model is used, though a variety of options that may be better suited for your sample are available, and are listed in the relevant inputs section.
SemiBin2 Technical Details
Links | |
---|---|
Task | task_semibin2.wdl |
Software Source Code | SemiBin2 on GitHub |
Software Documenttation | SemiBin2 ReadTheDocs |
Original Publication(s) | A deep siamese neural network improves metagenome-assembled genomes in microbiome datasets across different environments |
Additional Outputs¶
These tasks only run if output_additional_files
is set to true
(default is false
).
minimap2
: Read Alignment to the Assembly
minimap2
is a popular aligner that is used here to align the clean reads to the final assembly. This is done to provide additional information about the assembly.
In minimap2, "modes" are a group of preset options.
The mode used in this task is sr
which is intended for "short single-end reads without splicing". The sr
mode indicates the following parameters should be used: -k21 -w11 --sr --frag=yes -A2 -B8 -O12,32 -E2,1 -b0 -r100 -p.5 -N20 -f1000,5000 -n2 -m20 -s40 -g100 -2K50m --heap-sort=yes --secondary=no
. The output file is in SAM format.
For more information, please see the minimap2 manpage
minimap2 Technical Details
Links | |
---|---|
Task | task_minimap2.wdl |
Software Source Code | minimap2 on GitHub |
Software Documentation | minimap2 |
Original Publication(s) | Minimap2: pairwise alignment for nucleotide sequences |
samtools
: SAM File Conversion (Round 2)
This task converts the output SAM file from minimap2 and converts it to a BAM file. It then sorts the BAM based on the read names, and then generates an index file.
samtools Technical Details
Links | |
---|---|
Task | task_samtools.wdl |
Software Source Code | samtools on GitHub |
Software Documentation | samtools |
Original Publication(s) | The Sequence Alignment/Map format and SAMtools Twelve Years of SAMtools and BCFtools |
Parsing the BAM file
Several tasks follow that perform the following functions:
- Calculates the average depth of coverage of the assembly using
bedtools
. - Retrieves from the BAM file any unaligned reads using
samtools
. - Retrieves from the BAM file any aligned reads using
samtools
. - Calculates the percentage of reads that were assembled using
samtools
.
parse_mapping
Technical Details
Outputs¶
Variable | Type | Description |
---|---|---|
assembly_fasta | File | The final recovered metagenome-assembled genome (MAG). "A MAG represents a microbial genome by a group of sequences from genome assembly with similar characteristics. It enables [the identification of] novel species [to] understand their potential functions in a dynamic ecosystem"1 |
assembly_length | Int | The length of the assembly_fasta (see description for assembly_fasta ) in basepairs |
assembly_mean_coverage | Float | The mean depth of coverage of the assembly_fasta (see description for assembly_fasta ) |
average_read_length | Float | The average read length of the clean reads |
bbduk_docker | String | The Docker image for bbduk, which was used to remove the adapters from the sequences |
bedtools_docker | String | The Docker image for bedtools, which was used to calculate coverage |
bedtools_version | String | The version of bedtools, which was used to calculate coverage |
contig number | Int | The number of contigs in the assembly_fasta (see description for assembly_fasta ) |
fastp_html_report | File | The report file for fastp in HTML format |
fastp_version | String | The version of fastp used |
fastq_scan_docker | String | The Docker image of fastq_scan |
fastq_scan_clean1_json | File | The JSON file output from fastq-scan containing summary stats about clean forward read quality and length |
fastq_scan_clean2_json | File | The JSON file output from fastq-scan containing summary stats about clean reverse read quality and length |
fastq_scan_num_reads_clean_pairs | String | The number of read pairs after cleaning as calculated by fastq_scan |
fastq_scan_num_reads_clean1 | Int | The number of forward reads after cleaning as calculated by fastq_scan |
fastq_scan_num_reads_clean2 | Int | The number of reverse reads after cleaning as calculated by fastq_scan |
fastq_scan_num_reads_raw_pairs | String | The number of input read pairs as calculated by fastq_scan |
fastq_scan_num_reads_raw1 | Int | The number of input forward reads as calculated by fastq_scan |
fastq_scan_num_reads_raw2 | Int | The number of input reserve reads as calculated by fastq_scan |
fastq_scan_raw1_json | File | The JSON file output from fastq-scan containing summary stats about raw forward read quality and length |
fastq_scan_raw2_json | File | The JSON file output from fastq-scan containing summary stats about raw reverse read quality and length |
fastq_scan_version | String | The version of fastq_scan |
fastqc_clean1_html | File | An HTML file that provides a graphical visualization of clean forward read quality from fastqc to open in an internet browser |
fastqc_clean2_html | File | An HTML file that provides a graphical visualization of clean reverse read quality from fastqc to open in an internet browser |
fastqc_docker | String | The Docker container used for fastqc |
fastqc_num_reads_clean_pairs | String | The number of read pairs after cleaning by fastqc |
fastqc_num_reads_clean1 | Int | The number of forward reads after cleaning by fastqc |
fastqc_num_reads_clean2 | Int | The number of reverse reads after cleaning by fastqc |
fastqc_num_reads_raw_pairs | String | The number of input read pairs by fastqc before cleaning |
fastqc_num_reads_raw1 | Int | The number of input forward reads by fastqc before cleaning |
fastqc_num_reads_raw2 | Int | The number of input reverse reads by fastqc before cleaning |
fastqc_raw1_html | File | An HTML file that provides a graphical visualization of raw forward read quality from fastqc to open in an internet browser |
fastqc_raw2_html | File | An HTML file that provides a graphical visualization of raw reverse read quality from fastqc to open in an internet browser |
fastqc_version | String | The version of fastqc software used |
kraken2_docker | String | The Docker image of kraken2 |
kraken2_percent_human_clean | Float | The percentage of human-classified reads in the sample's clean reads |
kraken2_percent_human_raw | Float | The percentage of human-classified reads in the sample's raw reads |
kraken2_report_clean | File | The full Kraken report for the sample's clean reads |
kraken2_report_raw | File | The full Kraken report for the sample's raw reads |
kraken2_version | String | The version of kraken |
krona_docker | String | The docker image of Krona |
krona_html_clean | File | The KronaPlot after reads are cleaned |
krona_html_raw | File | The KronaPlot before reads are cleaned |
krona_version | String | The version of Krona |
largest_contig | Int | The size of the largest contig in basepairs |
metaspades_docker | String | The Docker image of metaspades |
metaspades_version | String | The version of metaspades |
midas_primary_genus | String | The primary genus detected by MIDAS |
midas_report | File | The MIDAS report file tsv file |
minimap2_docker | String | The Docker image of minimap2 |
minimap2_version | String | The version of minimap2 |
ncbi_scrub_docker | String | The Docker image for NCBI's HRRT (human read removal tool) |
percent_coverage | Float | The percentage coverage of the reference genome provided if one was provided |
percentage_mapped_reads | Float | The percentage of mapped reads to the assembly_fasta |
pilon_docker | String | The Docker image for pilon |
pilon_version | String | The version of pilon |
quast_docker | String | The Docker image of QUAST |
quast_version | String | The version of QUAST |
read1_clean | File | The clean forward reads file |
read1_dehosted | File | The dehosted forward reads file |
read1_mapped | File | The mapped forward reads to the assembly |
read1_unmapped | File | The unmapped forwards reads to the assembly |
read2_clean | File | The clean reverse reads file |
read2_dehosted | File | The dehosted reverse reads file |
read2_mapped | File | The mapped reverse reads to the assembly |
read2_unmapped | File | The unmapped reverse reads to the assembly |
samtools_docker | String | The Docker image of samtools |
samtools_version | String | The version of samtools |
semibin_bins | Array[File] | An array of binned metagenomic assembled genome files |
semibin_docker | String | The Docker image of semibin |
semibin_version | String | The version of Semibin used |
theiameta_illumina_pe_analysis_date | String | The date of analysis |
theiameta_illumina_pe_version | String | The version of TheiaMeta used during execution |
trimmomatic_docker | String | The Docker image of trimmomatic |
trimmomatic_version | String | The version of trimmomatic |
References¶
Human read removal tool (HRRT): https://github.com/ncbi/sra-human-scrubber
Trimmomatic: Anthony M. Bolger and others, Trimmomatic: a flexible trimmer for Illumina sequence data, Bioinformatics, Volume 30, Issue 15, August 2014, Pages 2114–2120, https://doi.org/10.1093/bioinformatics/btu170
Fastq-Scan: https://github.com/rpetit3/fastq-scan
metaSPAdes: Sergey Nurk and others, metaSPAdes: a new versatile metagenomic assembler, Genome Res. 2017 May; 27(5): 824–834., https://doi.org/10.1101%2Fgr.213959.116
Pilon: Bruce J. Walker and others. Pilon: An Integrated Tool for Comprehensive Microbial Variant Detection and Genome Assembly Improvement. Plos One. November 19, 2014. https://doi.org/10.1371/journal.pone.0112963
Minimap2: Heng Li, Minimap2: pairwise alignment for nucleotide sequences, Bioinformatics, Volume 34, Issue 18, September 2018, Pages 3094–3100, https://doi.org/10.1093/bioinformatics/bty191
QUAST: Alexey Gurevich and others, QUAST: quality assessment tool for genome assemblies, Bioinformatics, Volume 29, Issue 8, April 2013, Pages 1072–1075, https://doi.org/10.1093/bioinformatics/btt086
Samtools: Li, Heng, Bob Handsaker, Alec Wysoker, Tim Fennell, Jue Ruan, Nils Homer, Gabor Marth, Goncalo Abecasis, Richard Durbin, and 1000 Genome Project Data Processing Subgroup. 2009. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25(16): 2078-2079. https://doi.org/10.1093/bioinformatics/btp352
BEDtools: Quinlan AR and Hall IM, 2010. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 26, 6, pp. 841–842. https://doi.org/10.1093/bioinformatics/btq033
Bcftools: Petr Danecek, James K Bonfield, Jennifer Liddle, John Marshall, Valeriu Ohan, Martin O Pollard, Andrew Whitwham, Thomas Keane, Shane A McCarthy, Robert M Davies, Heng Li. Twelve years of SAMtools and BCFtools. GigaScience, Volume 10, Issue 2, February 2021, giab008, https://doi.org/10.1093/gigascience/giab008
Semibin2: Shaojun Pan, Xing-Ming Zhao, Luis Pedro Coelho, SemiBin2: self-supervised contrastive learning leads to better MAGs for short- and long-read sequencing, Bioinformatics, Volume 39, Issue Supplement_1, June 2023, Pages i21–i29, https://doi.org/10.1093/bioinformatics/btad209
-
Direct quote from the abstract of Yang C, Chowdhury D, Zhang Z, Cheung WK, Lu A, Bian Z, Zhang L. A review of computational tools for generating metagenome-assembled genomes from metagenomic sequencing data. Comput Struct Biotechnol J. 2021;19:6301-14. doi: 10.1016/j.csbj.2021.11.028. This is a paper from 2021 that reviews some of the tools used in this workflow, though not all. ↩