Skip to content

Core_Gene_SNP

Quick Facts

Workflow Type Applicable Kingdom Last Known Changes Command-line Compatibility Workflow Level
Phylogenetic Construction Bacteria v3.0.0 Some optional features incompatible, Yes Set-level

Core_Gene_SNP_PHB

Core Gene SNP Workflow Diagram

Core Gene SNP Workflow Diagram

The Core_Gene_SNP workflow is intended for pangenome analysis, core gene alignment, and phylogenetic analysis. The workflow takes in gene sequence data in GFF3 format from a set of samples. It first produces a pangenome summary using Pirate, which clusters genes within the sample set into orthologous gene families. By default, the workflow also instructs Pirate to produce both core gene and pangenome alignments. The workflow subsequently triggers the generation of a phylogenetic tree and SNP distance matrix from the core gene alignment using iqtree and snp-dists, respectively. Optionally, the workflow will also run this analysis using the pangenome alignment. This workflow also features an optional module, summarize_data, that creates a presence/absence matrix for the analyzed samples from a list of indicated columns (such as AMR genes, etc.) that can be used in Phandango.

Default Parameters

Please note that while default parameters for pangenome construction and phylogenetic tree generation are provided, these default parameters may not suit every dataset and have not been validated against known phylogenies. Users should take care to select the parameters that are most appropriate for their dataset. Please reach out to support@theiagen.com or one of the other resources listed at the bottom of this page if you would like assistance with this task.

Inputs

For further detail regarding Pirate options, please see PIRATE's documentation. For further detail regarding IQ-TREE options, please see http://www.iqtree.org/doc/Command-Reference.

This workflow runs on the set level.

Terra Task Name Variable Type Description Default Value Terra Status
core_gene_snp_workflow cluster_name String Name of sample set Required
core_gene_snp_workflow gff3 Array[File] Array of gff3 files to include in analysis, output gff files from both prokka and bakta using TheiaProk workflows are compatible Required
core_gene_snp_workflow align Boolean Boolean variable that instructs the workflow to generate core and pangenome alignments if "true". If "false", the workflow will produce only a pangenome summary. TRUE Optional
core_gene_snp_workflow core_tree Boolean Boolean variable that instructs the workflow to create a phylogenetic tree and SNP distance matrix from the core gene alignment. Align must also be set to true. TRUE Optional
core_gene_snp_workflow data_summary_column_names String A comma-separated list of the column names from the sample-level data table for generating a data summary (presence/absence .csv matrix); e.g., "amrfinderplus_amr_genes,amrfinderplus_virulence_genes" Optional
core_gene_snp_workflow data_summary_terra_project String The billing project for your current workspace. This can be found after the "#workspaces/" section in the workspace's URL Optional
core_gene_snp_workflow data_summary_terra_table String The name of the sample-level Terra data table that will be used for generating a data summary Optional
core_gene_snp_workflow data_summary_terra_workspace String The name of the Terra workspace you are in. This can be found at the top of the webpage, or in the URL after the billing project. Optional
core_gene_snp_workflow midpoint_root_tree Boolean Boolean variable that will instruct the workflow to reroot the tree at the midpoint FALSE Optional
core_gene_snp_workflow pan_tree Boolean Boolean variable that instructs the workflow to create a phylogenetic tree and SNP distance matrix from the pangenome alignment. Align must also be set to true. FALSE Optional
core_gene_snp_workflow phandango_coloring Boolean Boolean variable that tells the data summary task and the reorder matrix task to include a suffix that enables consistent coloring on Phandango; by default, this suffix is not added. To add this suffix set this variable to true. FALSE Optional
core_gene_snp_workflow sample_names Array[String] Array of sample_ids from the data table used Optional
core_iqtree alrt String Number of replicates to use for the SH-like approximate likelihood ratio test (Minimum recommended= 1000). Follows IQ-TREE "-alrt" option 1000 Optional
core_iqtree cpu Int Number of CPUs to allocate to the task 4 Optional
core_iqtree disk_size Int Amount of storage (in GB) to allocate to the task 100 Optional
core_iqtree docker String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/staphb/iqtree:1.6.7 Optional
core_iqtree iqtree_bootstraps String Number of ultrafast bootstrap replicates. Follows IQ-TREE "-bb" option. 1000 Optional
core_iqtree iqtree_model String Substitution model, frequency type (optional) and rate heterogeneity type (optional) used by IQ-TREE. This string follows the IQ-TREE "-m" option. For comparison to other tools use HKY for Bactopia, GTR+F+I for Grandeur, GTR+G4 for Nullarbor, GTR+G for Dryad GTR+I+G Optional
core_iqtree iqtree_opts String Additional options for IQ-TREE, see http://www.iqtree.org/doc/Command-Reference Optional
core_iqtree memory Int Amount of memory/RAM (in GB) to allocate to the task 32 Optional
core_reorder_matrix cpu Int Number of CPUs to allocate to the task 2 Optional
core_reorder_matrix disk_size Int Amount of storage (in GB) to allocate to the task 100 Optional
core_reorder_matrix docker String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/staphb/mykrobe:0.12.1 Optional
core_reorder_matrix memory Int Amount of memory/RAM (in GB) to allocate to the task 2 Optional
core_snp_dists cpu Int Number of CPUs to allocate to the task 1 Optional
core_snp_dists disk_size Int Amount of storage (in GB) to allocate to the task 50 Optional
core_snp_dists docker String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/staphb/snp-dists:0.8.2 Optional
core_snp_dists memory Int Amount of memory/RAM (in GB) to allocate to the task 2 Optional
pan_iqtree alrt String Number of replicates to use for the SH-like approximate likelihood ratio test (Minimum recommended= 1000). Follows IQ-TREE "-alrt" option 1000 Optional
pan_iqtree cpu Int Number of CPUs to allocate to the task 4 Optional
pan_iqtree disk_size Int Amount of storage (in GB) to allocate to the task 100 Optional
pan_iqtree docker String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/staphb/iqtree:1.6.7 Optional
pan_iqtree iqtree_bootstraps String Number of ultrafast bootstrap replicates. Follows IQ-TREE "-bb" option. 1000 Optional
pan_iqtree iqtree_model String Substitution model, frequency type (optional) and rate heterogeneity type (optional) used by IQ-TREE. This string follows the IQ-TREE "-m" option. For comparison to other tools use HKY for Bactopia, GTR+F+I for Grandeur, GTR+G4 for Nullarbor, GTR+G for Dryad GTR+I+G Optional
pan_iqtree iqtree_opts String Additional options for IQ-TREE, see http://www.iqtree.org/doc/Command-Reference Optional
pan_iqtree memory Int Amount of memory/RAM (in GB) to allocate to the task 32 Optional
pan_reorder_matrix cpu Int Number of CPUs to allocate to the task 2 Optional
pan_reorder_matrix disk_size Int Amount of storage (in GB) to allocate to the task 100 Optional
pan_reorder_matrix docker String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/staphb/mykrobe:0.12.1 Optional
pan_reorder_matrix memory Int Amount of memory/RAM (in GB) to allocate to the task 2 Optional
pan_snp_dists cpu Int Number of CPUs to allocate to the task 1 Optional
pan_snp_dists disk_size Int Amount of storage (in GB) to allocate to the task 50 Optional
pan_snp_dists docker String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/staphb/snp-dists:0.8.2 Optional
pan_snp_dists memory Int Amount of memory/RAM (in GB) to allocate to the task 2 Optional
pirate cpu Int Number of CPUs to allocate to the task 4 Optional
pirate disk_size Int Amount of storage (in GB) to allocate to the task 100 Optional
pirate docker_image String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/biocontainers/pirate:1.0.5--hdfd78af_0 Optional
pirate features String Features to use for pangenome construction [default: CDS] CDS Optional
pirate memory Int Amount of memory/RAM (in GB) to allocate to the task 32 Optional
pirate nucl Boolean Boolean variable that instructs pirate to create a pangenome on CDS features using nucleotide identity, rather than amino acid identity, if true. FALSE Optional
pirate panopt String Additional arguments for Pirate Optional
pirate steps String Identity thresholds to use for pangenome construction 50,60,70,80,90,95,98 Optional
summarize_data cpu Int Number of CPUs to allocate to the task 8 Optional
summarize_data disk_size Int Amount of storage (in GB) to allocate to the task 100 Optional
summarize_data docker String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/theiagen/terra-tools:2023-03-16 Optional
summarize_data id_column_name String If the sample IDs are in a different column to samplenames, it can be passed here and it will be used instead. Optional
summarize_data memory Int Amount of memory/RAM (in GB) to allocate to the task 1 Optional
version_capture docker String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0 Optional
version_capture timezone String Set the time zone to get an accurate date of analysis (uses UTC by default) Optional

Workflow Tasks

By default, the Core_Gene_SNP workflow will begin by analyzing the input sample set using PIRATE. Pirate takes in GFF3 files and classifies the genes into gene families by sequence identity, outputting a pangenome summary file. The workflow will instruct Pirate to create core gene and pangenome alignments using this gene family data. Setting the "align" input variable to false will turn off this behavior, and the workflow will output only the pangenome summary. The workflow will then use the core gene alignment from Pirate to infer a phylogenetic tree using IQ-TREE. It will also produce an SNP distance matrix from this alignment using snp-dists. This behavior can be turned off by setting the core_tree input variable to false. The workflow will not create a pangenome tree or SNP-matrix by default, but this behavior can be turned on by setting the pan_tree input variable to true.

The optional summarize_data task performs the following only if all of the data_summary_* and sample_names optional variables are filled out:

  1. Digests a comma-separated list of column names, such as "amrfinderplus_virulence_genes,amrfinderplus_stress_genes", etc. that can be found within the origin Terra data table.
  2. It will then parse through those column contents and extract each value; for example, if the amrfinder_amr_genes column for a sample contains these values: "aph(3')-IIIa,tet(O),blaOXA-193", the summarize_data task will check each sample in the set to see if they also have those AMR genes detected.
  3. Outputs a .csv file that indicates presence (TRUE) or absence (empty) for each item in those columns; that is, it will check each sample in the set against the detected items in each column to see if that value was also detected.

By default, this task appends a Phandango coloring tag to color all items from the same column the same; this can be turned off by setting the optional phandango_coloring variable to false.

Outputs

Variable Type Description
core_gene_snp_wf_analysis_date String Date of analysis using Core_Gene_SNP workflow
core_gene_snp_wf_version String Version of PHBG used for analysis
pirate_core_alignment_fasta File Nucleotide alignments of the core genes as created using MAFFT within Pirate. Loci are ordered according to the gene_families.ordered file.
pirate_core_alignment_gff File Annotation data for the gene family within the corresponding fasta file
pirate_core_snp_matrix File SNP distance matrix created from the core gene alignment
pirate_docker_image String Pirate docker image used
pirate_gene_families_ordered File Summary of all gene families, as estimated by Pirate
pirate_iqtree_core_tree File Phylogenetic tree produced by IQ-TREE from the core gene alignment
pirate_iqtree_pan_tree File Phylogenetic tree produced by IQ-TREE from the pangenome alignment
pirate_iqtree_version String IQ-TREE version used
pirate_pan_alignment_fasta File Nucleotide alignments of the pangenome by gene as created using MAFFT within Pirate. Loci are ordered according to the gene_families.ordered file.
pirate_pan_alignment_gff File Annotation data for the gene family within the corresponding fasta file
pirate_pan_snp_matrix File SNP distance matrix created from the pangenome alignment
pirate_pangenome_summary File Summary of the number and frequency of genes in the pangenome, as estimated by Pirate
pirate_presence_absence_csv File A file generated by Pirate that allows many post-alignment tools created for Roary to be used on the output from Pirate
pirate_snp_dists_version String Version of snp-dists used
pirate_summarized_data File The presence/absence matrix generated by the summarize_data task from the list of columns provided

References

Sion C Bayliss, Harry A Thorpe, Nicola M Coyle, Samuel K Sheppard, Edward J Feil, PIRATE: A fast and scalable pangenomics toolbox for clustering diverged orthologues in bacteria, GigaScience, Volume 8, Issue 10, October 2019, giz119, https://doi.org/10.1093/gigascience/giz119

Lam-Tung Nguyen, Heiko A. Schmidt, Arndt von Haeseler, Bui Quang Minh, IQ-TREE: A Fast and Effective Stochastic Algorithm for Estimating Maximum-Likelihood Phylogenies, Molecular Biology and Evolution, Volume 32, Issue 1, January 2015, Pages 268–274, https://doi.org/10.1093/molbev/msu300

https://github.com/tseemann/snp-dists