Snippy_Streamline¶

Quick Facts¶

Workflow Type	Applicable Kingdom	Last Known Changes	Command-line Compatibility	Workflow Level
Phylogenetic Construction	Bacteria	v3.0.0	Yes	Set-level

Snippy_Streamline_PHB¶

Snippy_Streamline_PHB Workflow Diagram

The Snippy_Streamline workflow is an all-in-one approach to generating a reference-based phylogenetic tree and associated SNP-distance matrix. The workflow can be run in multiple ways.

Reference Genome Options

In order to generate a phylogenetic tree, a reference genome is required. This can be:

provided by the user by filling the reference_genome_file input variable
the identified centroid genome by setting use_centroid_as_reference to true
automatically selected using the centroid task and reference_seeker task to find a close reference genome to your dataset by providing data in the assembly_fasta input variable and leaving the reference_genome_file and use_centroid_as_reference fields blank

Automatic Reference Selection

If no reference genome is provided, then the user MUST fill in the assembly_fasta field for automatic reference genome selection.

Phylogenetic Tree Construction Options

There are several options that can be used to customize the phylogenetic tree, including:

masking user-specified regions of the genome (by providing a bed file to snippy_core_bed)
producing either a core or pan-genome phylogeny and SNP-matrix (by altering core_genome; true [default] = core genome, false = pan-genome)
choosing the nucleotide substitution (by altering iqtree2_model [see below for possible nucleotide substitution models]), or allowing IQ-Tree's ModelFinder to identify the best model for your dataset (default)
masking recombination detected by gubbins, or not (by altering use_gubbins; true [default] = recombination masking, false = no recombination masking)

Multiple Contigs in Reference Genomes

If reference genomes have multiple contigs, they are incompatible with Gubbins to mask recombination in the phylogenetic tree. The automatic selection of a reference genome by the workflow may result in a reference with multiple contigs. In this case, an alternative reference genome should be sought, or Gubbins should be turned off (via use_gubbins = false).

Inputs¶

To run Snippy_Streamline, either a reference genome must be provided (reference_genome_file), or you must provide assemblies of the samples in your tree so that the workflow can automatically find and download the closest reference genome to your dataset (via assembly_fasta)

Input Sequencing Data Requirements

Sequencing data used in the Snippy_Streamline workflow must:

Be Illumina reads
Be generated by unbiased whole genome shotgun sequencing
Pass appropriate QC thresholds for the taxa to ensure that the reads represent reasonably complete genomes that are free of contamination from other taxa or cross-contamination of the same taxon.
If masking recombination with Gubbins, input data should represent complete genomes from the same strain/lineage (e.g. MLST) that share a recent common ancestor.

Guidance for optional inputs

Several core and optional tasks can be used to generate the Snippy phylogenetic tree, making it highly flexible and suited to a wide range of datasets. You will need to decide which tasks to use depending on the genomes that you are analyzing. Some guidelines for the optional tasks to use for different genome types are provided below.

Default settings (suitable for most bacteria)

The default settings are as follows and are suitable for generating phylogenies for most bacteria

core_genome = true (creates core genome phylogeny)
use_gubbins = true (recombination masked)
nucleotide substitution model will be defined by IQTree's Model Finder

Phylogenies of Mycobacterium tuberculosis complex

Phylogenies of MTBC are typically constructed with the following options:

Using the H37Rv reference genome
- reference_genome_file = "gs://theiagen-public-resources-rp/reference_data/bacterial/mycobacterium/MTB-NC_000962.3.fasta"
Masking repetitive regions of the genome (e.g. PE/PPE genes) that are often misaligned
- snippy_core_bed = "gs://theiagen-public-resources-rp/reference_data/bacterial/mycobacterium/MTB-NC_000962.3.bed"
Without masking recombination because TB can be considered non-recombinant
- use_gubbins = false
Using the core genome
- core_genome = true (as default)

Terra Task Name	Variable	Type	Description	Default Value	Terra Status
snippy_streamline	read1	Array[File]	FASTQ files containing read1 sequences		Required
snippy_streamline	read2	Array[File]	FASTQ files containing read2 sequences		Required
snippy_streamline	samplenames	Array[String]	The names of the samples being analyzed		Required
snippy_streamline	tree_name	String	String of your choice to prefix output files		Required
centroid	cpu	Int	Number of CPUs to allocate to the task	1	Optional
centroid	disk_size	Int	Amount of storage (in GB) to allocate to the task	50	Optional
centroid	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/theiagen/centroid:0.1.0	Optional
centroid	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	4	Optional
ncbi_datasets_download_genome_accession	cpu	Int	Number of CPUs to allocate to the task	1	Optional
ncbi_datasets_download_genome_accession	disk_size	Int	Amount of storage (in GB) to allocate to the task	50	Optional
ncbi_datasets_download_genome_accession	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/ncbi-datasets:14.13.2	Optional
ncbi_datasets_download_genome_accession	include_gbff	Boolean	Set to true if you would like the GenBank Flat File (GBFF) file included in the output. It contains nucleotide sequence, metadata, and annotations.	FALSE	Optional
ncbi_datasets_download_genome_accession	include_gff3	Boolean	Set to true if you would like the Genomic Feature File v3 (GFF3) file included in the output. It contains nucleotide sequence, metadata, and annotations	FALSE	Optional
ncbi_datasets_download_genome_accession	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	4	Optional
ncbi_datasets_download_genome_accession	use_ncbi_virus	Boolean	Set to true to download from NCBI Virus Datasets	FALSE	Optional
referenceseeker	cpu	Int	Number of CPUs to allocate to the task	4	Optional
referenceseeker	disk_size	Int	Amount of storage (in GB) to allocate to the task	200	Optional
referenceseeker	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/biocontainers/referenceseeker:1.8.0--pyhdfd78af_0	Optional
referenceseeker	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	16	Optional
referenceseeker	referenceseeker_ani_threshold	Float	Bidirectional average nucleotide identity to use as a cut off for identifying reference assemblies with ReferenceSeeker; default value set according to https://github.com/oschwengers/referenceseeker#description	0.95	Optional
referenceseeker	referenceseeker_conserved_dna_threshold	Float	Conserved DNA % to use as a cut off for identifying reference assemblies with ReferenceSeeker; default value set according to https://github.com/oschwengers/referenceseeker#description	0.69	Optional
referenceseeker	referenceseeker_db	File	Database used by the referenceseeker tool that contains bacterial genomes from RefSeq release 205. Downloaded from the referenceseeker GitHub repository.	gs://theiagen-public-resources-rp/reference_data/databases/referenceseeker/referenceseeker-bacteria-refseq-205.v20210406.tar.gz	Optional
snippy_streamline	assembly_fasta	Array[File]	The assembly files for your samples (Required if a reference genome is not provided)		Optional
snippy_streamline	reference_genome_file	File	Reference genome in FASTA or GENBANK format (must be the same reference used in Snippy_Variants workflow); provide this if you want to skip the detection of a suitable reference		Optional
snippy_streamline	use_centroid_as_reference	Booolean	Set to true if you want to use the centroid sample as the reference sample instead of using the centroid to detect a suitable one	FALSE	Optional
snippy_tree_wf	call_shared_variants	Boolean	When true, workflow generates table that combines variants across all samples and a table showing variants shared across samples	TRUE	Optional
snippy_tree_wf	core_genome	Boolean	When true, workflow generates core genome phylogeny; when false, whole genome is used	TRUE	Optional
snippy_tree_wf	data_summary_column_names	String	A comma-separated list of the column names from the sample-level data table for generating a data summary (presence/absence .csv matrix)		Optional
snippy_tree_wf	data_summary_terra_project	String	The billing project for your current workspace. This can be found after the "#workspaces/" section in the workspace's URL		Optional
snippy_tree_wf	data_summary_terra_table	String	The name of the sample-level Terra data table that will be used for generating a data summary		Optional
snippy_tree_wf	data_summary_terra_workspace	String	The name of the Terra workspace you are in. This can be found at the top of the webpage, or in the URL after the billing project.		Optional
snippy_tree_wf	gubbins_cpu	Int	Number of CPUs to allocate to the task	4	Optional
snippy_tree_wf	gubbins_disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
snippy_tree_wf	gubbins_docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/biocontainers/gubbins:3.3--py310pl5321h8472f5a_0	Optional
snippy_tree_wf	gubbins_memory	Int	Amount of memory/RAM (in GB) to allocate to the task	32	Optional
snippy_tree_wf	iqtree2_bootstraps	String	Number of replicates for http://www.iqtree.org/doc/Tutorial#assessing-branch-supports-with-ultrafast-bootstrap-approximation (Minimum recommended= 1000)	1000	Optional
snippy_tree_wf	iqtree2_cpu	Int	Number of CPUs to allocate to the task	4	Optional
snippy_tree_wf	iqtree2_disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
snippy_tree_wf	iqtree2_docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/iqtree2:2.1.2	Optional
snippy_tree_wf	iqtree2_memory	Int	Amount of memory/RAM (in GB) to allocate to the task	32	Optional
snippy_tree_wf	iqtree2_model	String	Nucelotide substitution model to use when generating the final tree with IQTree2. By default, IQtree runs its ModelFinder algorithm to identify the model it thinks best fits your dataset		Optional
snippy_tree_wf	iqtree2_opts	String	Additional options to pass to IQTree2		Optional
snippy_tree_wf	midpoint_root_tree	Boolean	A True/False option that determines whether the tree used in the SNP matrix re-ordering task should be re-rooted or not. Options: true or false	TRUE	Optional
snippy_tree_wf	phandango_coloring	Boolean	Boolean variable that tells the data summary task and the reorder matrix task to include a suffix that enables consistent coloring on Phandango; by default, this suffix is not added. To add this suffix set this variable to true.	FALSE	Optional
snippy_tree_wf	snippy_core_bed	File	User-provided bed file to mask out regions of the genome when creating multiple sequence alignments		Optional
snippy_tree_wf	snippy_core_cpu	Int	Number of CPUs to allocate to the task	8	Optional
snippy_tree_wf	snippy_core_disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
snippy_tree_wf	snippy_core_docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/snippy:4.6.0	Optional
snippy_tree_wf	snippy_core_memory	Int	Amount of memory/RAM (in GB) to allocate to the task	16	Optional
snippy_tree_wf	snp_dists_docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/snp-dists:0.8.2	Optional
snippy_tree_wf	snp_sites_cpu	Int	Number of CPUs to allocate to the task	1	Optional
snippy_tree_wf	snp_sites_disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
snippy_tree_wf	snp_sites_docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/snp-sites:2.5.1	Optional
snippy_tree_wf	snp_sites_memory	Int	Amount of memory/RAM (in GB) to allocate to the task	4	Optional
snippy_tree_wf	use_gubbins	Boolean	When "true", workflow removes recombination with gubbins tasks; when "false", gubbins is not used	TRUE	Optional
snippy_variants_wf	base_quality	Int	Minimum quality for a nucleotide to be used in variant calling	13	Optional
snippy_variants_wf	cpu	Int	Number of CPUs to allocate to the task	4	Optional
snippy_variants_wf	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/snippy:4.6.0	Optional
snippy_variants_wf	map_qual	Int	Minimum mapping quality to accept in variant calling, default from snippy tool is 60		Optional
snippy_variants_wf	maxsoft	Int	Number of bases of alignment to soft-clip before discarding the alignment, default from snippy tool is 10		Optional
snippy_variants_wf	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	16	Optional
snippy_variants_wf	min_coverage	Int	Minimum read coverage of a position to identify a mutation	10	Optional
snippy_variants_wf	min_frac	Float	Minimum fraction of bases at a given position to identify a mutation, default from snippy tool is 0	0.9	Optional
snippy_variants_wf	min_quality	Int	Minimum VCF variant call "quality"	100	Optional
snippy_variants_wf	query_gene	String	Comma-separated strings (e.g. gene names) in which to search for mutations to output to data table		Optional
version_capture	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0	Optional
version_capture	timezone	String	Set the time zone to get an accurate date of analysis (uses UTC by default)		Optional

Workflow Tasks¶

Automatic Reference Selection¶

The following tasks perform automatic reference selection (if no reference genome is provided by the user and assembly_fasta is provided).

Centroid

Centroid¶

Centroid selects the most central genome among a list of assemblies by computing pairwise mash distances. In Snippy_Streamline, this centroid assembly is then used to find a closely related reference genome that can be used to generate the tree. In order to use Centroid, should complete the samplenames input.

centroid Technical Details

	Links
Task	task_centroid.wdl
Software Source Code	https://github.com/theiagen/centroid
Software Documentation	https://github.com/theiagen/centroid

ReferenceSeeker Details (Optional)

ReferenceSeeker¶

ReferenceSeeker uses your draft assembly to identify closely related bacterial, viral, fungal, or plasmid genome assemblies in RefSeq.

Databases that can be used with ReferenceSeeker are as follows, and can be used by pasting the GSURI in double quotation marks " " into the referenceseeker_db optional input:

archea: gs://theiagen-public-resources-rp/reference_data/databases/referenceseeker/referenceseeker-archaea-refseq-205.v20210406.tar.gz
bacterial (default): gs://theiagen-public-resources-rp/reference_data/databases/referenceseeker/referenceseeker-bacteria-refseq-205.v20210406.tar.gz
fungi: gs://theiagen-public-resources-rp/reference_data/databases/referenceseeker/referenceseeker-fungi-refseq-205.v20210406.tar.gz
plasmids: gs://theiagen-public-resources-rp/reference_data/databases/referenceseeker/referenceseeker-plasmids-refseq-205.v20210406.tar.gz
viral: gs://theiagen-public-resources-rp/reference_data/databases/referenceseeker/referenceseeker-viral-refseq-205.v20210406.tar.gz

For ReferenceSeeker to identify a genome, it must meet user-specified thresholds for sequence coverage (referenceseeker_conserved_dna_threshold; default >= 0.69) and identity (referenceseeker_ani_threshold; default >= 0.95 ).

A list of closely related genomes is provided in referenceseeker_tsv. The reference genome that ranks highest according to ANI and conserved DNA values is considered the closest match and will be downloaded, with information about this provided in the assembly_fetch_referenceseeker_top_hit_ncbi_accession output.

ReferenceSeeker Technical Details

	Links
Task	task_referenceseeker.wdl
Software Source Code	ReferenceSeeker on GitHub
Software Documentation	ReferenceSeeker on GitHub
Original Publication(s)	ReferenceSeeker: rapid determination of appropriate reference genomes

NCBI Datasets

NCBI Datasets¶

The NCBI Datasets task downloads specified assemblies from NCBI using either the virus or genome (for all other genome types) package as appropriate.

include_gbff behavior

If include_gbff is set to true, the gbff file will be used as the reference for Snippy_Variants and Snippy_Tree. If include_gbff is set to false, the fasta file will be used as the reference for Snippy_Variants and Snippy_Tree. Tree topology should not differ, though annotations may.

NCBI Datasets Technical Details

	Links
Task	task_ncbi_datasets.wdl
Software Source Code	NCBI Datasets on GitHub
Software Documentation	NCBI Datasets Documentation on NCBI
Original Publication(s)	Exploring and retrieving sequence and metadata for species across the tree of life with NCBI Datasets

Variant Calling¶

The following task performs variant calling on the samples using a reference genome (either selected in the previous steps, or provided by the user)

Please see the full documentation for Snippy_Variants for more information.

Snippy_Variants

Snippy_Variants¶

Snippy_Variants uses Snippy to align the assemblies for each sample against the reference genome to call SNPs, MNPs and INDELs according to optional input parameters.

Optionally, if the user provides a value for query_gene, the variant file will be searched for any mutations in the specified regions or annotations. The query string MUST match the gene name or annotation as specified in the GenBank file and provided in the output variant file in the snippy_results column.

QC Metrics from Snippy_Variants

Warning

The following QC metrics may not be applicable to your dataset as they are geared towards read data, not assemblies. Use these metrics with caution.

This task also extracts QC metrics from the Snippy output for each sample and saves them in per-sample TSV files (snippy_variants_qc_metrics). These per-sample QC metrics include the following columns:

samplename: The name of the sample.
reads_aligned_to_reference: The number of reads that aligned to the reference genome.
total_reads: The total number of reads in the sample.
percent_reads_aligned: The percentage of reads that aligned to the reference genome.
variants_total: The total number of variants detected between the sample and the reference genome.
percent_ref_coverage: The percentage of the reference genome covered by reads with a depth greater than or equal to the min_coverage threshold (default is 10).
#rname: Reference sequence name (e.g., chromosome or contig name).
startpos: Starting position of the reference sequence.
endpos: Ending position of the reference sequence.
numreads: Number of reads covering the reference sequence.
covbases: Number of bases with coverage.
coverage: Percentage of the reference sequence covered (depth ≥ 1).
meandepth: Mean depth of coverage over the reference sequence.
meanbaseq: Mean base quality over the reference sequence.
meanmapq: Mean mapping quality over the reference sequence.

Note that the last set of columns (#rname to meanmapq) may repeat for each chromosome or contig in the reference genome.

Snippy Variants Technical Details

	Links
Task	task_snippy_variants.wdl task_snippy_gene_query.wdl
Software Source Code	Snippy on GitHub
Software Documentation	Snippy on GitHub

Phylogenetic Construction¶

The following tasks are a simplified version of the Snippy_Tree workflow, which is used to build the phylogenetic tree. The tasks undertaken are exactly the same between both workflows, but user inputs and outputs have been reduced for clarity and ease.

Please see the full documentation for Snippy_Tree for more information.

Gubbins Nucleotide Substitution Model

In Snippy Streamline, the nucleotide substitution model used by gubbins will always be GTR+GAMMA.

Snippy

Snippy¶

Snippy is used to generate a whole-genome multiple sequence alignment (fasta file) of reads from all the samples we'd like in our tree.

When generating the multiple sequence alignment, a bed file can be provided by users to mask certain areas of the genome in the alignment. This is particularly relevant for masking known repetitive regions in Mycobacterium tuberculosis genomes, or masking known regions containing phage sequences.

Why do I see snippy_core in Terra?

In Terra, this task is named "snippy_core" after the name of the command in the original Snippy tool. Despite the name, this command is NOT being used to make a core genome, but instead a multiple sequence alignment of the whole genome (without any sections masked using a bed file).

Snippy Technical Details

	Links
Task	task_snippy_core.wdl
Software Source Code	Snippy on GitHub
Software Documentation	Snippy on GitHub

Gubbins (optional)

Gubbins (optional)¶

Turn on Gubbins with use_gubbins

Gubbins runs when the use_gubbins option is set to true (default=true).

Most optional inputs are hidden in Snippy_Streamline for simplification of the workflow. If you would like to use Gubbins with additional options, please use the Snippy_Tree workflow.

In Snippy Streamline, the nucleotide substitution model used by gubbins will always be GTR+GAMMA.

Genealogies Unbiased By recomBinations In Nucleotide Sequences (Gubbins) identifies and masks genomic regions that are predicted to have arisen via recombination. It works by iteratively identifying loci containing elevated densities of SNPs and constructing phylogenies based on the putative single nucleotide variants outside these regions (for more details, see here). By default, these phylogenies are constructed using RaxML and a GTR-GAMMA nucleotide substitution model, which will be the most suitable model for most bacterial phylogenetics, though this can be modified with the tree_builder and nuc_subst_model inputs.

Gubbins is the industry standard for masking recombination from bacterial genomes when building phylogenies, but limitations to recombination removal exist. Gubbins cannot distinguish recombination from high densities of SNPs that may result from assembly or alignment errors, mutational hotspots, or regions of the genome with relaxed selection. The tool is also intended only to find recombinant regions that are short relative to the length of the genome, so large regions of recombination may not be masked. These factors should be considered when interpreting resulting phylogenetic trees, but overwhelmingly Gubbins improves our ability to understand ancestral relationships between bacterial genomes.

Gubbins Technical Details

	Links
Task	task_gubbins.wdl
Software Source Code	Gubbins on GitHub
Software Documentation	Gubbins v3.3 manual
Original Publication(s)	Rapid phylogenetic analysis of large samples of recombinant bacterial whole genome sequences using Gubbins

SNP-sites (optional)

SNP-sites (optional)¶

Turn on SNP-Sites with core_genome

SNP-sites runs when the core_genome option is set to true.

SNP-sites is used to filter out invariant sites in the whole-genome alignment, thereby creating a core genome alignment for phylogenetic inference. The output is a fasta file containing the core genome of each sample only. If Gubbins has been used, this output fasta will not contain any sites that are predicted to have arisen via recombination.

SNP-sites technical details

	Links
Task	task_snp_sites.wdl
Software Source Code	SNP-sites on GitHub
Software Documentation	SNP-sites on GitHub
Original Publication(s)	SNP-sites: rapid efficient extraction of SNPs from multi-FASTA alignments

IQTree2

IQTree2¶

IQTree2 is used to build the final phylogeny. It uses the alignment generated in the previous steps of the workflow. The contents of this alignment will depend on whether any sites were masked with recombination.

The phylogeny is generated using the maximum-likelihood method and a specified nucleotide substitution model. By default, the Snippy_Tree workflow will run Model Finder to determine the most appropriate nucleotide substitution model for your data, but you may specify the nucleotide substitution model yourself using the iqtree2_model optional input (see here for available models).

IQTree will perform assessments of the tree using the Shimodaira–Hasegawa approximate likelihood-ratio test (SH-aLRT test), and ultrafast bootstrapping with UFBoot2, a quicker but less biased alternative to standard bootstrapping. A clade should not typically be trusted if it has less than 80% support from the SH-aLRT test and less than 95% support with ultrafast bootstrapping.

Nucleotide substitution model

When core_genome= true, the default nucleotide substitution model is set to the General Time Reverside model with Gamma distribution (GTR+G).

When the user sets core_genome= false, the default nucleotide substitution model is set to the General Time Reversible model with invariant sites and Gamma distribution (GTR+I+G).

IQTree2 technical details

	Links
Task	task_iqtree2.wdl
Software Source Code	IQ-TREE on GitHub
Software Documentation	IQTree documentation for the latest version (not necessarily the version used in this workflow)
Original Publication(s)	IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic Era New Algorithms and Methods to Estimate Maximum-Likelihood Phylogenies: Assessing the Performance of PhyML 3.0 Ultrafast Approximation for Phylogenetic Bootstrap UFBoot2: Improving the Ultrafast Bootstrap Approximation ModelFinder: fast model selection for accurate phylogenetic estimates

SNP-dists

SNP-dists¶

SNP-dists computes pairwise SNP distances between genomes. It takes the same alignment of genomes used to generate your phylogenetic tree and produces a matrix of pairwise SNP distances between sequences. This means that if you generated pairwise core-genome phylogeny, the output will consist of pairwise core-genome SNP (cgSNP) distances. Otherwise, these will be whole-genome SNP distances. Regardless of whether core-genome or whole-genome SNPs, this SNP distance matrix will exclude all SNPs in masked regions (i.e. masked with a bed file or gubbins).

The SNP-distance output can be visualized using software such as Phandango to explore the relationships between the genomic sequences. The task can optionally add a Phandango coloring tag (:c1) to the column names in the output matrix to ensure that all columns are colored with the same color scheme throughout by setting phandango_coloring to true.

SNP-dists Technical Details

	Links
Task	task_snp_dists.wdl
Software Source Code	SNP-dists on GitHub
Software Documentation	SNP-dists on GitHub

Data summary (optional)

Data Summary (optional)¶

If you fill out the data_summary_* and sample_names optional variables, you can use the optional summarize_data task. The task takes a comma-separated list of column names from the Terra data table, which should each contain a list of comma-separated items. For example, "amrfinderplus_virulence_genes,amrfinderplus_stress_genes" (with quotes, comma separated, no spaces) for these output columns from running TheiaProk. The task checks whether those comma-separated items are present in each row of the data table (sample), then creates a CSV file of these results. The CSV file indicates presence (TRUE) or absence (empty) for each item. By default, the task does not add a Phandango coloring tag to group items from the same column, but you can turn this on by setting phandango_coloring to true.

Example output CSV

Sample_Name,aph(3')-IIa,blaCTX-M-65,blaOXA-193,tet(O)
sample1,TRUE,,TRUE,TRUE
sample2,,,FALSE,TRUE
sample3,,,FALSE,

Example use of Phandango coloring

Data summary produced using the phandango_coloring option, visualized alongside Newick tree at http://jameshadfield.github.io/phandango/#/main

Example phandango_coloring output

Data summary technical details

	Links
Task	task_summarize_data.wdl

Concatenate Variants (optional)

Concatenate Variants (optional)¶

This task activates when call_shared_variants is true. The cat_variants task concatenates variant data from multiple samples into a single file concatenated_variants. It is very similar to the cat_files task, but also adds a column to the output file that indicates the sample associated with each row of data.

The concatenated_variants file will be in the following format:

samplename	CHROM	POS	TYPE	REF	ALT	EVIDENCE	FTYPE	STRAND	NT_POS	AA_POS	EFFECT	LOCUS_TAG	GENE
sample1	PEKT02000007	5224	snp	C	G	G:21 C:0
sample2	PEKT02000007	34112	snp	C	G	G:32 C:0	CDS	+	153/1620	51/539	missense_variant c.153C>G p.His51Gln	B9J08_002604	hypothetical protein
sample3	PEKT02000007	34487	snp	T	A	A:41 T:0	CDS	+	528/1620	176/539	missense_variant c.528T>A p.Asn176Lys	B9J08_002604	hypothetical protein

Technical Details

	Links
Task	task_cat_files.wdl

Shared Variants (optional)

Shared Variants (optional)¶

This task activates when call_shared_variants is true.

The shared_variants task takes in the concatenated_variants output from the cat_variants task and reshapes the data so that variants are rows and samples are columns. For each variant, samples where the variant was detected are populated with a "1" and samples were either the variant was not detected or there was insufficient coverage to call variants are populated with a "0". The resulting table is available as the shared_variants_table output.

The shared_variants_table file will be in the following format:

CHROM	POS	TYPE	REF	ALT	FTYPE	STRAND	NT_POS	AA_POS	EFFECT	LOCUS_TAG	GENE	PRODUCT	sample1	sample2	sample3
PEKT02000007	2693938	snp	T	C	CDS	-	1008/3000	336/999	synonymous_variant c.1008A>G p.Lys336Lys	B9J08_003879	NA	chitin synthase 1	1	1	0
PEKT02000007	2529234	snp	G	C	CDS	+	282/336	94/111	missense_variant c.282G>C p.Lys94Asn	B9J08_003804	NA	cytochrome c	1	1	1
PEKT02000002	1043926	snp	A	G	CDS	-	542/1464	181/487	missense_variant c.542T>C p.Ile181Thr	B9J08_000976	NA	dihydrolipoyl dehydrogenase	1	1	0

Technical Details

	Links
Task	task_shared_variants.wdl

Snippy_Variants QC Metrics Concatenation (optional)

Snippy_Variants QC Metric Concatenation (optional)¶

Optionally, the user can provide the snippy_variants_qc_metrics file produced by the Snippy_Variants workflow as input to the workflow to concatenate the reports for each sample in the tree. These per-sample QC metrics include the following columns:

samplename: The name of the sample.
reads_aligned_to_reference: The number of reads that aligned to the reference genome.
total_reads: The total number of reads in the sample.
percent_reads_aligned: The percentage of reads that aligned to the reference genome.
variants_total: The total number of variants detected between the sample and the reference genome.
percent_ref_coverage: The percentage of the reference genome covered by reads with a depth greater than or equal to the min_coverage threshold (default is 10).
#rname: Reference sequence name (e.g., chromosome or contig name).
startpos: Starting position of the reference sequence.
endpos: Ending position of the reference sequence.
numreads: Number of reads covering the reference sequence.
covbases: Number of bases with coverage.
coverage: Percentage of the reference sequence covered (depth ≥ 1).
meandepth: Mean depth of coverage over the reference sequence.
meanbaseq: Mean base quality over the reference sequence.
meanmapq: Mean mapping quality over the reference sequence.

The combined QC metrics file includes the same columns as above for all samples. Note that the last set of columns (#rname to meanmapq) may repeat for each chromosome or contig in the reference genome.

QC Metrics for Phylogenetic Analysis

These QC metrics provide valuable insights into the quality and coverage of your sequencing data relative to the reference genome. Monitoring these metrics can help identify samples with low coverage, poor alignment, or potential issues that may affect downstream analyses, and we recommend examining them before proceeding with phylogenetic analysis if performing Snippy_Variants and Snippy_Tree separately.

Technical Details

	Links
Task	task_cat_files.wdl

Outputs¶

Variable	Type	Description
snippy_centroid_docker	String	Docker file used for Centroid
snippy_centroid_fasta	File	FASTA file for the centroid sample
snippy_centroid_mash_tsv	File	TSV file containing mash distances computed by centroid
snippy_centroid_samplename	String	Name of the centroid sample
snippy_centroid_version	String	Centroid version used
snippy_cg_snp_matrix	File	CSV file of core genome pairwise SNP distances between samples, calculated from the final alignment
snippy_combined_qc_metrics	File	Combined QC metrics file containing concatenated QC metrics from all samples.
snippy_concatenated_variants	File	Concatenated snippy_results file across all samples in the set
snippy_filtered_metadata	File	TSV recording the columns of the Terra data table that were used in the summarize_data task
snippy_final_alignment	File	Final alignment (FASTA file) used to generate the tree (either after snippy alignment, gubbins recombination removal, and/or core site selection with SNP-sites)
snippy_final_tree	File	Newick tree produced from the final alignment. Depending on user input for core_genome, the tree could be a core genome tree (default when core_genome is true) or whole genome tree (if core_genome is false)
snippy_gubbins_branch_stats	File	CSV file showing https://github.com/nickjcroucher/gubbins/blob/master/docs/gubbins_manual.md#output-statistics for each branch of the tree
snippy_gubbins_docker	String	Docker file used for running Gubbins
snippy_gubbins_recombination_gff	File	Recombination statistics in GFF format; these can be viewed in Phandango against the phylogenetic tree
snippy_gubbins_version	String	Gubbins version used
snippy_iqtree2_docker	String	Docker file used for running IQTree2
snippy_iqtree2_model_used	String	Nucleotide substitution model used by IQTree2
snippy_iqtree2_version	String	IQTree2 version used
snippy_msa_snps_summary	File	TXT file containing summary statistics for each alignment of each input genome against the reference. This indicates how good the alignment is. Pay particular attention to # unaligned sites, and heterogeneous positions. See also https://github.com/nickjcroucher/gubbins/blob/master/docs/gubbins_manual.md#output-statistics
snippy_ncbi_datasets_docker	String	Docker file used for NCBI datasets
snippy_ncbi_datasets_version	String	NCBI datasets version used
snippy_ref	File	Reference genome (FASTA or GenBank file) used for generating phylogeny
snippy_ref_metadata_json	File	Metadata associated with the refence genome used by Snippy, in JSON format
snippy_referenceseeker_database	String	ReferenceSeeker database used
snippy_referenceseeker_docker	String	Docker file used for ReferenceSeeker
snippy_referenceseeker_top_hit_ncbi_accession	String	NCBI Accession for the top hit identified by referenceseeker
snippy_referenceseeker_tsv	File	TSV file of the top hits between the query genome and the Reference Seeker database
snippy_referenceseeker_version	String	ReferenceSeeker version used
snippy_snp_dists_docker	String	Docker file used for running SNP-dists
snippy_snp_dists_version	String	SNP-dists version used
snippy_snp_sites_docker	String	Docker file used for running SNP-sites
snippy_snp_sites_version	String	SNP-sites version used
snippy_streamline_analysis_date	String	Date of workflow run
snippy_streamline_version	String	Version of Snippy_Streamline used
snippy_summarized_data	File	CSV presence/absence matrix generated by the summarize_data task from the list of columns provided; formatted for Phandango if phandango_coloring input is true
snippy_tree_snippy_docker	String	Docker file used for running Snippy
snippy_tree_snippy_version	String	Snippy version used
snippy_variants_outdir_tarball	Array[File]	A compressed file containing the whole directory of snippy output files. This is used when running Snippy_Tree
snippy_variants_percent_reads_aligned	Float	Percentage of reads aligned to the reference genome
snippy_variants_percent_ref_coverage	Float	Proportion of the reference genome covered by reads with a depth greater than or equal to the `min_coverage` threshold (default is 10).
snippy_variants_snippy_docker	Array[String]	Docker file used for Snippy in the Snippy_Variants subworkfow
snippy_variants_snippy_version	Array[String]	Version of Snippy_Tree subworkflow used
snippy_wg_snp_matrix	File	CSV file of whole genome pairwise SNP distances between samples, calculated from the final alignment

References¶

Gubbins: Croucher, Nicholas J., Andrew J. Page, Thomas R. Connor, Aidan J. Delaney, Jacqueline A. Keane, Stephen D. Bentley, Julian Parkhill, and Simon R. Harris. 2015. "Rapid Phylogenetic Analysis of Large Samples of Recombinant Bacterial Whole Genome Sequences Using Gubbins." Nucleic Acids Research 43 (3): e15.

SNP-sites: Page, Andrew J., Ben Taylor, Aidan J. Delaney, Jorge Soares, Torsten Seemann, Jacqueline A. Keane, and Simon R. Harris. 2016. "SNP-Sites: Rapid Efficient Extraction of SNPs from Multi-FASTA Alignments." Microbial Genomics 2 (4): e000056.

IQTree: Nguyen, Lam-Tung, Heiko A. Schmidt, Arndt von Haeseler, and Bui Quang Minh. 2015. "IQ-TREE: A Fast and Effective Stochastic Algorithm for Estimating Maximum-Likelihood Phylogenies." Molecular Biology and Evolution 32 (1): 268–74.