Skip to content

Introduction to Phylogenetics

Phylogenetics is an approach to understanding evolutionary relationships among organisms, primarily through analysis of gene, amino acid, or genome sequences. These evolutionary relationships are graphically represented by phylogenetic trees.

Broadly, there are two phylogenetic analysis methods

  • Phylogenetic tree construction


    Creation of a phylogenetic tree from a set of sequences

    • Goal: Determine the evolutionary relationship between a set of sequences, often to rule out likely transmission
    • Pros:
      • Can be constructed form any suitable set of samples
      • More accurate than phylogenetic placement when a high-quality dataset and appropriate methods are used
    • Cons:
      • Can be comparably slow and computationally expensive, especially for trees with a large number of sequences and large genomes
  • Phylogenetic placement


    Placement of genomes onto an existing phylogenetic tree

    • Goal: Determine the closest relatives to a new sequence
    • Pros:
      • It avoids needing to build a whole tree which is comparably slow and computationally expensive, especially for large amounts of data
    • Cons:
      • Requires an existing tree to add the new sample to
      • Less accurate than building a new phylogenetic tree

Phylogenetic tree construction approaches

Key considerations before generating a phylogenetic tree

  • When using Theiagen workflows, sequences should have been previously analyzed with TheiaCoV, TheiaProk, or TheiaEuk to assess sequence quality, generate assemblies or annotation files that may be required for some phylogenetic tree-building workflows, and generate any metadata that you might like to use for visualization against the tree.
  • All samples included in a phylogenetic tree should pass agreed QC thresholds
    • FASTA input trees are particularly reliant on a high-quality assembly
      • Repetitive regions may be incorrectly assembled (particularly for de novo assemblies as generated by TheiaProk and TheiaEuk)
      • Low-coverage regions and heterologous sites may be included in the phylogeny
  • For transmission analyses, samples in the same tree should be closely related- the same lineage or sequence type

Workflow recommendations for phylogenetic tree construction

Recommendations

  • Augur_Prep & Augur: For building phylogenetic trees from viral genomes
  • kSNP3: For analysis of clonal sets of genomes (e.g., foodborne outbreak analyses), using a simple method
  • Snippy_Streamline: For analysis of bacterial genomes that may undergo recombination or require masking of the genome
  • Snippy_Variants & Snippy_Tree: Similar to Snippy_Streamline, but for when you want more control over the workflow parameters or if you want to generate the tree multiple times using different combinations of sequences aligned against the same reference
  • Mashtree_FASTA: For very quick trees
  • Core_Gene_SNP: For generation of a pangenome analysis, with an additional core- or pan-gene phylogeny to visualize the pangenome against
Full comparison of Theiagen phylogenetic construction workflows
Genome suitability Input files Method Use cases Pros Cons
Mashtree Low-divergence bacterial genome sets Assembly FASTA for each genome NJ tree based on mash distances Identification of obvious outliers/contaminated samples in a dataset; analysis of extensive datasets where other methods are not suitable (thousands of samples) Very quick; low computational cost; fairly accurate for large low-diversity trees; does not require input reference genome; very easy to run Does not model evolution; cannot handle complex evolutionary histories (recombination, HGT, etc) or highly divergent genomes; does not compute SNPs to identify SNP distances
kSNP3 Clonal bacterial genome sets (e.g. foodborne outbreak genomes) Assembly FASTA for each genome Parsimony (default), NJ or ML tree based on kmer differences Analysis of clonal pathogens Reasonably fast for small datasets; does not require input reference genome; very easy to run Not suitable for highly divergent genomes; does not remove recombination or SNPs within ~9 nucleotides; no control over SNP support; computationally demanding for very large datasets; no control of the evolutionary model, even for ML trees
Snippy phylogenetics workflows All bacterial genome sets FASTQ read files for each genome; reference genome or assemblies that can be used to identify a reference Maximum likelihood with a large selection of nucleotide substitution models; can mask recombination or other genomic regions specified with a bed file Analysis of any bacterial genome, without expectations for population partitioning Can generate very high-quality trees; highly modifiable parameters Slower and more computationally expensive than some other methods; requires the user to consider appropriate input parameters, including computational resources for trees with hundreds of samples
Core_Gene_SNP Bacterial genome sets GFF3 annotation files for each genome (from Prokka, run during TheiaProk) Gene/CDS alignment and SNP-calling with a maximum likelihood tree; core gene and pan-gene trees available Assessment of accessory CDS that are present or absent amongst all genomes in the dataset, against a phylogenetic tree Does not require a reference genome; core genes are less likely to have been involved in recombination; provides pangenome presence/absence output
Augur SARS-CoV-2, mpox virus, influenza genome sets (other viral pathogens require more parameter configuration as defaults are not provided) FASTA assembly Phylogenetic analysis for the specified organisms for visualization on Auspice or any platform that takes Newick files Compatible with Auspice, high-quality trees for the specified pathogens Custom configurations of tree generation and visualizations can require extensive parameter knowledge and manipulation

Interpreting phylogenetic trees and SNP distances

Resources for phylogenetic tree interpretation

SNP distances

During outbreak investigations, SNP distances are sometimes used to help interpret the potential for transmission. SNP distance thresholds have been established for some pathogens, under some circumstances. Typically, SNP distance thresholds can

  • Identify potential transmission clusters
  • Rule out transmission events (may be directional, between two specified location/people)

It can be difficult to determine SNP thresholds because of: - within-host diversity - unknown number of transmissions/other bottlenecks decreasing genetic diversity - variable mutation rates between strains, in different environments, and/or in different regions of the genome - imprecise removal of recombination or erroneous SNPs

The comparison of SNP distances between potentially related strains and background strains can be helpful for source attribution (e.g. foodborne outbreaks). Combination with epidemiological data can help identify suitable thresholds to rule out transmission. In addition, mutation rates can be calculated based on SNPs at different time points, allowing inference of start of outbreak. Be aware of incomplete sampling as SNP distances don't reveal if there were other infected individuals that weren’t sampled

Visualizing phylogenetic trees

Recommendations

  • Auspice for phylogenetic trees generated using the Augur workflows
  • Phandango for visualizing metadata against the phylogenetic tree (e.g. presence/absence of ARGs or plasmid replicons, SNP-distance matrices, recombination gff files from gubbins, or pangenome visualizations)
  • FigTree for re-rooting phylogenetic trees, visualizing trees with annotated nodes (e.g. time-dated phylogenies) and looking at branch lengths
  • MicrobeTrace for visualizing phylogenetic trees with transmission networks
Full comparison of no-code phylogenetic tree visualization software
Consideration Auspice Phandango FigTree iTOL GrapeTree MicrobeTrace
Link https://auspice.us/ https://jameshadfield.github.io/phandango/#/ http://tree.bio.ed.ac.uk/software/figtree/ https://itol.embl.de/ https://achtman-lab.github.io/GrapeTree/MSTree_holder.html https://microbetrace.cdc.gov/MicrobeTrace/
Ease of use Easy: drag and drop files to visualize; control the view with the menu Easy: drag and drop files to visualize; control the view with the menu Easy: Click to load, control view with the menu Easy: Click to load, control view with the menu Easy: Click to load, control view with the menu Easy to visualize a tree: drag and drop files, but you have to change the visualization from network to tree
Performance Can handle large and complex trees Difficult to view very large trees (thousands of genomes) Can take large and complex trees May be slow to display large trees Can take large and complex trees Can take large and complex trees
Interactivity Zoom, re-color the tree according to the metadata Zoom, dynamic metadata views alongside tree Zoom, tree arrangement Zoom, tree arrangement Zoom, tree arrangement, re-color tree according to metadata No zoom, but you can alter horizontal & vertical stretch
Metadata visualization Terminal nodes color-coded to metadata Metadata visualized alongside a phylogeny Branches, internal, and tips color-coded to metadata Difficult to add metadata Terminal nodes color-coded to metadata Terminal nodes color-coded, shape-coded, or sized according to metadata, can also add labels
Input tree type JSON Newick Newick and Nexus Newick Newick Newick
Metadata files supported CSV of sample characteristics CSV of sample characteristics, .gff for recombination or pangenome TXT of sample characteristics N/a CSV of sample characteristics CSV of sample characteristics
Saving tree views ? Image files only Nexus, image files (PNG, SVG, JPEG), and PDF Export options are limited in the free version JSON, Newick, SVG Save the MicrobeTrace session as a zip file on the computer, then drag & drop to restore
Availability Browser-based, but does not share data Browser-based, but does not share data Installed on the local computer, requires Java Browser-based, but does not share data Browser-based, but does not share data, or installed on the local computer Browser-based, but does not share data
Other considerations Highly interactive; a great all-rounder Great quickly assessing associations between tree topology and metadata, e.g. cluster association with a given characteristic; can also visualize recombination and pangenome assessments relative to tree No longer under active development, so some bugs may not be fixed, very useful for rearranging tree view and viewing dates of nodes Very useful for rearranging tree view Primarily intended for visualization of transmission networks with steep learning curve; actively maintained by CDC
Limitations Difficult to quickly assess which metadata characteristics may be associated with tree topology No scale; no ability to rearrange tree file; limitations to interactive views Difficult to visualize additional metadata Difficult to visualize additional metadata Minimum-spanning trees only No (useful) scale
Maintained Yes No No Yes No? Yes

To learn more about MicrobeTrace, please see the following video: 📺 Using KSNP3 in Terra and Visualizing Bacterial Genomic Networks in MicrobeTrace