Introduction to Phylogenetics¶
Phylogenetics is an approach to understanding evolutionary relationships among organisms, primarily through analysis of gene, amino acid, or genome sequences. These evolutionary relationships are graphically represented by phylogenetic trees.
Broadly, there are two phylogenetic analysis methods¶
-
Phylogenetic tree construction
Creation of a phylogenetic tree from a set of sequences
- Goal: Determine the evolutionary relationship between a set of sequences, often to rule out likely transmission
- Pros:
- Can be constructed form any suitable set of samples
- More accurate than phylogenetic placement when a high-quality dataset and appropriate methods are used
- Cons:
- Can be comparably slow and computationally expensive, especially for trees with a large number of sequences and large genomes
-
Phylogenetic placement
Placement of genomes onto an existing phylogenetic tree
- Goal: Determine the closest relatives to a new sequence
- Pros:
- It avoids needing to build a whole tree which is comparably slow and computationally expensive, especially for large amounts of data
- Cons:
- Requires an existing tree to add the new sample to
- Less accurate than building a new phylogenetic tree
Phylogenetic tree construction approaches¶
Key considerations before generating a phylogenetic tree¶
- When using Theiagen workflows, sequences should have been previously analyzed with TheiaCoV, TheiaProk, or TheiaEuk to assess sequence quality, generate assemblies or annotation files that may be required for some phylogenetic tree-building workflows, and generate any metadata that you might like to use for visualization against the tree.
- All samples included in a phylogenetic tree should pass agreed QC thresholds
- FASTA input trees are particularly reliant on a high-quality assembly
- Repetitive regions may be incorrectly assembled (particularly for de novo assemblies as generated by TheiaProk and TheiaEuk)
- Low-coverage regions and heterologous sites may be included in the phylogeny
- FASTA input trees are particularly reliant on a high-quality assembly
- For transmission analyses, samples in the same tree should be closely related- the same lineage or sequence type
Workflow recommendations for phylogenetic tree construction¶
Recommendations
- Augur_Prep & Augur: For building phylogenetic trees from viral genomes
- kSNP3: For analysis of clonal sets of genomes (e.g., foodborne outbreak analyses), using a simple method
- Snippy_Streamline: For analysis of bacterial genomes that may undergo recombination or require masking of the genome
- Snippy_Variants & Snippy_Tree: Similar to Snippy_Streamline, but for when you want more control over the workflow parameters or if you want to generate the tree multiple times using different combinations of sequences aligned against the same reference
- Mashtree_FASTA: For very quick trees
- Core_Gene_SNP: For generation of a pangenome analysis, with an additional core- or pan-gene phylogeny to visualize the pangenome against
Full comparison of Theiagen phylogenetic construction workflows
Genome suitability | Input files | Method | Use cases | Pros | Cons | |
---|---|---|---|---|---|---|
Mashtree | Low-divergence bacterial genome sets | Assembly FASTA for each genome | NJ tree based on mash distances | Identification of obvious outliers/contaminated samples in a dataset; analysis of extensive datasets where other methods are not suitable (thousands of samples) | Very quick; low computational cost; fairly accurate for large low-diversity trees; does not require input reference genome; very easy to run | Does not model evolution; cannot handle complex evolutionary histories (recombination, HGT, etc) or highly divergent genomes; does not compute SNPs to identify SNP distances |
kSNP3 | Clonal bacterial genome sets (e.g. foodborne outbreak genomes) | Assembly FASTA for each genome | Parsimony (default), NJ or ML tree based on kmer differences | Analysis of clonal pathogens | Reasonably fast for small datasets; does not require input reference genome; very easy to run | Not suitable for highly divergent genomes; does not remove recombination or SNPs within ~9 nucleotides; no control over SNP support; computationally demanding for very large datasets; no control of the evolutionary model, even for ML trees |
Snippy phylogenetics workflows | All bacterial genome sets | FASTQ read files for each genome; reference genome or assemblies that can be used to identify a reference | Maximum likelihood with a large selection of nucleotide substitution models; can mask recombination or other genomic regions specified with a bed file | Analysis of any bacterial genome, without expectations for population partitioning | Can generate very high-quality trees; highly modifiable parameters | Slower and more computationally expensive than some other methods; requires the user to consider appropriate input parameters, including computational resources for trees with hundreds of samples |
Core_Gene_SNP | Bacterial genome sets | GFF3 annotation files for each genome (from Prokka, run during TheiaProk) | Gene/CDS alignment and SNP-calling with a maximum likelihood tree; core gene and pan-gene trees available | Assessment of accessory CDS that are present or absent amongst all genomes in the dataset, against a phylogenetic tree | Does not require a reference genome; core genes are less likely to have been involved in recombination; provides pangenome presence/absence output | |
Augur | SARS-CoV-2, mpox virus, influenza genome sets (other viral pathogens require more parameter configuration as defaults are not provided) | FASTA assembly | Phylogenetic analysis for the specified organisms for visualization on Auspice or any platform that takes Newick files | Compatible with Auspice, high-quality trees for the specified pathogens | Custom configurations of tree generation and visualizations can require extensive parameter knowledge and manipulation |
Interpreting phylogenetic trees and SNP distances¶
Resources for phylogenetic tree interpretation
- Understanding phylogenetic trees, particularly what they represent
- How to read a phylogenetic tree
- How to interpret phylogenetic trees in terms of transmission
SNP distances¶
During outbreak investigations, SNP distances are sometimes used to help interpret the potential for transmission. SNP distance thresholds have been established for some pathogens, under some circumstances. Typically, SNP distance thresholds can
- Identify potential transmission clusters
- Rule out transmission events (may be directional, between two specified location/people)
It can be difficult to determine SNP thresholds because of: - within-host diversity - unknown number of transmissions/other bottlenecks decreasing genetic diversity - variable mutation rates between strains, in different environments, and/or in different regions of the genome - imprecise removal of recombination or erroneous SNPs
The comparison of SNP distances between potentially related strains and background strains can be helpful for source attribution (e.g. foodborne outbreaks). Combination with epidemiological data can help identify suitable thresholds to rule out transmission. In addition, mutation rates can be calculated based on SNPs at different time points, allowing inference of start of outbreak. Be aware of incomplete sampling as SNP distances don't reveal if there were other infected individuals that weren’t sampled
Visualizing phylogenetic trees¶
Recommendations
- Auspice for phylogenetic trees generated using the Augur workflows
- Phandango for visualizing metadata against the phylogenetic tree (e.g. presence/absence of ARGs or plasmid replicons, SNP-distance matrices, recombination gff files from gubbins, or pangenome visualizations)
- FigTree for re-rooting phylogenetic trees, visualizing trees with annotated nodes (e.g. time-dated phylogenies) and looking at branch lengths
- MicrobeTrace for visualizing phylogenetic trees with transmission networks
Full comparison of no-code phylogenetic tree visualization software
Consideration | Auspice | Phandango | FigTree | iTOL | GrapeTree | MicrobeTrace |
---|---|---|---|---|---|---|
Link | https://auspice.us/ | https://jameshadfield.github.io/phandango/#/ | http://tree.bio.ed.ac.uk/software/figtree/ | https://itol.embl.de/ | https://achtman-lab.github.io/GrapeTree/MSTree_holder.html | https://microbetrace.cdc.gov/MicrobeTrace/ |
Ease of use | Easy: drag and drop files to visualize; control the view with the menu | Easy: drag and drop files to visualize; control the view with the menu | Easy: Click to load, control view with the menu | Easy: Click to load, control view with the menu | Easy: Click to load, control view with the menu | Easy to visualize a tree: drag and drop files, but you have to change the visualization from network to tree |
Performance | Can handle large and complex trees | Difficult to view very large trees (thousands of genomes) | Can take large and complex trees | May be slow to display large trees | Can take large and complex trees | Can take large and complex trees |
Interactivity | Zoom, re-color the tree according to the metadata | Zoom, dynamic metadata views alongside tree | Zoom, tree arrangement | Zoom, tree arrangement | Zoom, tree arrangement, re-color tree according to metadata | No zoom, but you can alter horizontal & vertical stretch |
Metadata visualization | Terminal nodes color-coded to metadata | Metadata visualized alongside a phylogeny | Branches, internal, and tips color-coded to metadata | Difficult to add metadata | Terminal nodes color-coded to metadata | Terminal nodes color-coded, shape-coded, or sized according to metadata, can also add labels |
Input tree type | JSON | Newick | Newick and Nexus | Newick | Newick | Newick |
Metadata files supported | CSV of sample characteristics | CSV of sample characteristics, .gff for recombination or pangenome | TXT of sample characteristics | N/a | CSV of sample characteristics | CSV of sample characteristics |
Saving tree views | ? | Image files only | Nexus, image files (PNG, SVG, JPEG), and PDF | Export options are limited in the free version | JSON, Newick, SVG | Save the MicrobeTrace session as a zip file on the computer, then drag & drop to restore |
Availability | Browser-based, but does not share data | Browser-based, but does not share data | Installed on the local computer, requires Java | Browser-based, but does not share data | Browser-based, but does not share data, or installed on the local computer | Browser-based, but does not share data |
Other considerations | Highly interactive; a great all-rounder | Great quickly assessing associations between tree topology and metadata, e.g. cluster association with a given characteristic; can also visualize recombination and pangenome assessments relative to tree | No longer under active development, so some bugs may not be fixed, very useful for rearranging tree view and viewing dates of nodes | Very useful for rearranging tree view | Primarily intended for visualization of transmission networks with steep learning curve; actively maintained by CDC | |
Limitations | Difficult to quickly assess which metadata characteristics may be associated with tree topology | No scale; no ability to rearrange tree file; limitations to interactive views | Difficult to visualize additional metadata | Difficult to visualize additional metadata | Minimum-spanning trees only | No (useful) scale |
Maintained | Yes | No | No | Yes | No? | Yes |
To learn more about MicrobeTrace, please see the following video: 📺 Using KSNP3 in Terra and Visualizing Bacterial Genomic Networks in MicrobeTrace