Introduction to Phylogenetics¶

Phylogenetics is an approach to understanding evolutionary relationships among organisms, primarily through analysis of gene, amino acid, or genome sequences. These evolutionary relationships are graphically represented by phylogenetic trees.

Broadly, there are two phylogenetic analysis methods¶

Phylogenetic tree construction

Creation of a phylogenetic tree from a set of sequences
- Goal: Determine the evolutionary relationship between a set of sequences, often to rule out likely transmission
- Pros:
  - Can be constructed form any suitable set of samples
  - More accurate than phylogenetic placement when a high-quality dataset and appropriate methods are used
- Cons:
  - Can be comparably slow and computationally expensive, especially for trees with a large number of sequences and large genomes
Phylogenetic placement

Placement of genomes onto an existing phylogenetic tree
- Goal: Determine the closest relatives to a new sequence
- Pros:
  - It avoids needing to build a whole tree which is comparably slow and computationally expensive, especially for large amounts of data
- Cons:
  - Requires an existing tree to add the new sample to
  - Less accurate than building a new phylogenetic tree

Phylogenetic tree construction approaches¶

Key considerations before generating a phylogenetic tree¶

When using Theiagen workflows, sequences should have been previously analyzed with TheiaCoV, TheiaProk, or TheiaEuk to assess sequence quality, generate assemblies or annotation files that may be required for some phylogenetic tree-building workflows, and generate any metadata that you might like to use for visualization against the tree.
All samples included in a phylogenetic tree should pass agreed QC thresholds
- FASTA input trees are particularly reliant on a high-quality assembly
  - Repetitive regions may be incorrectly assembled (particularly for de novo assemblies as generated by TheiaProk and TheiaEuk)
  - Low-coverage regions and heterologous sites may be included in the phylogeny
For transmission analyses, samples in the same tree should be closely related- the same lineage or sequence type

Workflow recommendations for phylogenetic tree construction¶

Recommendations

Augur_Prep & Augur: For building phylogenetic trees from viral genomes
kSNP3: For analysis of clonal sets of genomes (e.g., foodborne outbreak analyses), using a simple method
Snippy_Streamline: For analysis of bacterial genomes that may undergo recombination or require masking of the genome
Snippy_Variants & Snippy_Tree: Similar to Snippy_Streamline, but for when you want more control over the workflow parameters or if you want to generate the tree multiple times using different combinations of sequences aligned against the same reference
Mashtree_FASTA: For very quick trees
Core_Gene_SNP: For generation of a pangenome analysis, with an additional core- or pan-gene phylogeny to visualize the pangenome against

Full comparison of Theiagen phylogenetic construction workflows

	Genome suitability	Input files	Method	Use cases	Pros	Cons
Mashtree	Low-divergence bacterial genome sets	Assembly FASTA for each genome	NJ tree based on mash distances	Identification of obvious outliers/contaminated samples in a dataset; analysis of extensive datasets where other methods are not suitable (thousands of samples)	Very quick; low computational cost; fairly accurate for large low-diversity trees; does not require input reference genome; very easy to run	Does not model evolution; cannot handle complex evolutionary histories (recombination, HGT, etc) or highly divergent genomes; does not compute SNPs to identify SNP distances
kSNP3	Clonal bacterial genome sets (e.g. foodborne outbreak genomes)	Assembly FASTA for each genome	Parsimony (default), NJ or ML tree based on kmer differences	Analysis of clonal pathogens	Reasonably fast for small datasets; does not require input reference genome; very easy to run	Not suitable for highly divergent genomes; does not remove recombination or SNPs within ~9 nucleotides; no control over SNP support; computationally demanding for very large datasets; no control of the evolutionary model, even for ML trees
Snippy phylogenetics workflows	All bacterial genome sets	FASTQ read files for each genome; reference genome or assemblies that can be used to identify a reference	Maximum likelihood with a large selection of nucleotide substitution models; can mask recombination or other genomic regions specified with a bed file	Analysis of any bacterial genome, without expectations for population partitioning	Can generate very high-quality trees; highly modifiable parameters	Slower and more computationally expensive than some other methods; requires the user to consider appropriate input parameters, including computational resources for trees with hundreds of samples
Core_Gene_SNP	Bacterial genome sets	GFF3 annotation files for each genome (from Prokka, run during TheiaProk)	Gene/CDS alignment and SNP-calling with a maximum likelihood tree; core gene and pan-gene trees available	Assessment of accessory CDS that are present or absent amongst all genomes in the dataset, against a phylogenetic tree	Does not require a reference genome; core genes are less likely to have been involved in recombination; provides pangenome presence/absence output
Augur	SARS-CoV-2, mpox virus, influenza genome sets (other viral pathogens require more parameter configuration as defaults are not provided)	FASTA assembly		Phylogenetic analysis for the specified organisms for visualization on Auspice or any platform that takes Newick files	Compatible with Auspice, high-quality trees for the specified pathogens	Custom configurations of tree generation and visualizations can require extensive parameter knowledge and manipulation

Interpreting phylogenetic trees and SNP distances¶

Resources for phylogenetic tree interpretation

Understanding phylogenetic trees, particularly what they represent
How to read a phylogenetic tree
How to interpret phylogenetic trees in terms of transmission

SNP distances¶

During outbreak investigations, SNP distances are sometimes used to help interpret the potential for transmission. SNP distance thresholds have been established for some pathogens, under some circumstances. Typically, SNP distance thresholds can

Identify potential transmission clusters
Rule out transmission events (may be directional, between two specified location/people)

It can be difficult to determine SNP thresholds because of: - within-host diversity - unknown number of transmissions/other bottlenecks decreasing genetic diversity - variable mutation rates between strains, in different environments, and/or in different regions of the genome - imprecise removal of recombination or erroneous SNPs

The comparison of SNP distances between potentially related strains and background strains can be helpful for source attribution (e.g. foodborne outbreaks). Combination with epidemiological data can help identify suitable thresholds to rule out transmission. In addition, mutation rates can be calculated based on SNPs at different time points, allowing inference of start of outbreak. Be aware of incomplete sampling as SNP distances don't reveal if there were other infected individuals that weren’t sampled

Visualizing phylogenetic trees¶

Recommendations

Auspice for phylogenetic trees generated using the Augur workflows
Phandango for visualizing metadata against the phylogenetic tree (e.g. presence/absence of ARGs or plasmid replicons, SNP-distance matrices, recombination gff files from gubbins, or pangenome visualizations)
FigTree for re-rooting phylogenetic trees, visualizing trees with annotated nodes (e.g. time-dated phylogenies) and looking at branch lengths
MicrobeTrace for visualizing phylogenetic trees with transmission networks

Full comparison of no-code phylogenetic tree visualization software

Consideration	Auspice	Phandango	FigTree	iTOL	GrapeTree	MicrobeTrace
Link	https://auspice.us/	https://jameshadfield.github.io/phandango/#/	http://tree.bio.ed.ac.uk/software/figtree/	https://itol.embl.de/	https://achtman-lab.github.io/GrapeTree/MSTree_holder.html	https://microbetrace.cdc.gov/MicrobeTrace/
Ease of use	Easy: drag and drop files to visualize; control the view with the menu	Easy: drag and drop files to visualize; control the view with the menu	Easy: Click to load, control view with the menu	Easy: Click to load, control view with the menu	Easy: Click to load, control view with the menu	Easy to visualize a tree: drag and drop files, but you have to change the visualization from network to tree
Performance	Can handle large and complex trees	Difficult to view very large trees (thousands of genomes)	Can take large and complex trees	May be slow to display large trees	Can take large and complex trees	Can take large and complex trees
Interactivity	Zoom, re-color the tree according to the metadata	Zoom, dynamic metadata views alongside tree	Zoom, tree arrangement	Zoom, tree arrangement	Zoom, tree arrangement, re-color tree according to metadata	No zoom, but you can alter horizontal & vertical stretch
Metadata visualization	Terminal nodes color-coded to metadata	Metadata visualized alongside a phylogeny	Branches, internal, and tips color-coded to metadata	Difficult to add metadata	Terminal nodes color-coded to metadata	Terminal nodes color-coded, shape-coded, or sized according to metadata, can also add labels
Input tree type	JSON	Newick	Newick and Nexus	Newick	Newick	Newick
Metadata files supported	CSV of sample characteristics	CSV of sample characteristics, .gff for recombination or pangenome	TXT of sample characteristics	N/a	CSV of sample characteristics	CSV of sample characteristics
Saving tree views	?	Image files only	Nexus, image files (PNG, SVG, JPEG), and PDF	Export options are limited in the free version	JSON, Newick, SVG	Save the MicrobeTrace session as a zip file on the computer, then drag & drop to restore
Availability	Browser-based, but does not share data	Browser-based, but does not share data	Installed on the local computer, requires Java	Browser-based, but does not share data	Browser-based, but does not share data, or installed on the local computer	Browser-based, but does not share data
Other considerations	Highly interactive; a great all-rounder	Great quickly assessing associations between tree topology and metadata, e.g. cluster association with a given characteristic; can also visualize recombination and pangenome assessments relative to tree	No longer under active development, so some bugs may not be fixed, very useful for rearranging tree view and viewing dates of nodes	Very useful for rearranging tree view		Primarily intended for visualization of transmission networks with steep learning curve; actively maintained by CDC
Limitations	Difficult to quickly assess which metadata characteristics may be associated with tree topology	No scale; no ability to rearrange tree file; limitations to interactive views	Difficult to visualize additional metadata	Difficult to visualize additional metadata	Minimum-spanning trees only	No (useful) scale
Maintained	Yes	No	No	Yes	No?	Yes

To learn more about MicrobeTrace, please see the following video: 📺 Using KSNP3 in Terra and Visualizing Bacterial Genomic Networks in MicrobeTrace