Core_Gene_SNP¶
Quick Facts¶
Workflow Type | Applicable Kingdom | Last Known Changes | Command-line Compatibility | Workflow Level |
---|---|---|---|---|
Phylogenetic Construction | Bacteria | v3.0.0 | Some optional features incompatible, Yes | Set-level |
Core_Gene_SNP_PHB¶
The Core_Gene_SNP workflow is intended for pangenome analysis, core gene alignment, and phylogenetic analysis. The workflow takes in gene sequence data in GFF3 format from a set of samples. It first produces a pangenome summary using Pirate
, which clusters genes within the sample set into orthologous gene families. By default, the workflow also instructs Pirate
to produce both core gene and pangenome alignments. The workflow subsequently triggers the generation of a phylogenetic tree and SNP distance matrix from the core gene alignment using iqtree
and snp-dists
, respectively. Optionally, the workflow will also run this analysis using the pangenome alignment. This workflow also features an optional module, summarize_data
, that creates a presence/absence matrix for the analyzed samples from a list of indicated columns (such as AMR genes, etc.) that can be used in Phandango.
Default Parameters
Please note that while default parameters for pangenome construction and phylogenetic tree generation are provided, these default parameters may not suit every dataset and have not been validated against known phylogenies. Users should take care to select the parameters that are most appropriate for their dataset. Please reach out to support@theiagen.com or one of the other resources listed at the bottom of this page if you would like assistance with this task.
Inputs¶
For further detail regarding Pirate options, please see PIRATE's documentation. For further detail regarding IQ-TREE options, please see http://www.iqtree.org/doc/Command-Reference
.
This workflow runs on the set level.
Terra Task Name | Variable | Type | Description | Default Value | Terra Status |
---|---|---|---|---|---|
core_gene_snp_workflow | cluster_name | String | Name of sample set | Required | |
core_gene_snp_workflow | gff3 | Array[File] | Array of gff3 files to include in analysis, output gff files from both prokka and bakta using TheiaProk workflows are compatible | Required | |
core_gene_snp_workflow | align | Boolean | Boolean variable that instructs the workflow to generate core and pangenome alignments if "true". If "false", the workflow will produce only a pangenome summary. | TRUE | Optional |
core_gene_snp_workflow | core_tree | Boolean | Boolean variable that instructs the workflow to create a phylogenetic tree and SNP distance matrix from the core gene alignment. Align must also be set to true. | TRUE | Optional |
core_gene_snp_workflow | data_summary_column_names | String | A comma-separated list of the column names from the sample-level data table for generating a data summary (presence/absence .csv matrix); e.g., "amrfinderplus_amr_genes,amrfinderplus_virulence_genes" | Optional | |
core_gene_snp_workflow | data_summary_terra_project | String | The billing project for your current workspace. This can be found after the "#workspaces/" section in the workspace's URL | Optional | |
core_gene_snp_workflow | data_summary_terra_table | String | The name of the sample-level Terra data table that will be used for generating a data summary | Optional | |
core_gene_snp_workflow | data_summary_terra_workspace | String | The name of the Terra workspace you are in. This can be found at the top of the webpage, or in the URL after the billing project. | Optional | |
core_gene_snp_workflow | midpoint_root_tree | Boolean | Boolean variable that will instruct the workflow to reroot the tree at the midpoint | FALSE | Optional |
core_gene_snp_workflow | pan_tree | Boolean | Boolean variable that instructs the workflow to create a phylogenetic tree and SNP distance matrix from the pangenome alignment. Align must also be set to true. | FALSE | Optional |
core_gene_snp_workflow | phandango_coloring | Boolean | Boolean variable that tells the data summary task and the reorder matrix task to include a suffix that enables consistent coloring on Phandango; by default, this suffix is not added. To add this suffix set this variable to true. | FALSE | Optional |
core_gene_snp_workflow | sample_names | Array[String] | Array of sample_ids from the data table used | Optional | |
core_iqtree | alrt | String | Number of replicates to use for the SH-like approximate likelihood ratio test (Minimum recommended= 1000). Follows IQ-TREE "-alrt" option | 1000 | Optional |
core_iqtree | cpu | Int | Number of CPUs to allocate to the task | 4 | Optional |
core_iqtree | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
core_iqtree | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/iqtree:1.6.7 | Optional |
core_iqtree | iqtree_bootstraps | String | Number of ultrafast bootstrap replicates. Follows IQ-TREE "-bb" option. | 1000 | Optional |
core_iqtree | iqtree_model | String | Substitution model, frequency type (optional) and rate heterogeneity type (optional) used by IQ-TREE. This string follows the IQ-TREE "-m" option. For comparison to other tools use HKY for Bactopia, GTR+F+I for Grandeur, GTR+G4 for Nullarbor, GTR+G for Dryad | GTR+I+G | Optional |
core_iqtree | iqtree_opts | String | Additional options for IQ-TREE, see http://www.iqtree.org/doc/Command-Reference | Optional | |
core_iqtree | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 32 | Optional |
core_reorder_matrix | cpu | Int | Number of CPUs to allocate to the task | 2 | Optional |
core_reorder_matrix | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
core_reorder_matrix | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/mykrobe:0.12.1 | Optional |
core_reorder_matrix | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 2 | Optional |
core_snp_dists | cpu | Int | Number of CPUs to allocate to the task | 1 | Optional |
core_snp_dists | disk_size | Int | Amount of storage (in GB) to allocate to the task | 50 | Optional |
core_snp_dists | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/snp-dists:0.8.2 | Optional |
core_snp_dists | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 2 | Optional |
pan_iqtree | alrt | String | Number of replicates to use for the SH-like approximate likelihood ratio test (Minimum recommended= 1000). Follows IQ-TREE "-alrt" option | 1000 | Optional |
pan_iqtree | cpu | Int | Number of CPUs to allocate to the task | 4 | Optional |
pan_iqtree | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
pan_iqtree | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/iqtree:1.6.7 | Optional |
pan_iqtree | iqtree_bootstraps | String | Number of ultrafast bootstrap replicates. Follows IQ-TREE "-bb" option. | 1000 | Optional |
pan_iqtree | iqtree_model | String | Substitution model, frequency type (optional) and rate heterogeneity type (optional) used by IQ-TREE. This string follows the IQ-TREE "-m" option. For comparison to other tools use HKY for Bactopia, GTR+F+I for Grandeur, GTR+G4 for Nullarbor, GTR+G for Dryad | GTR+I+G | Optional |
pan_iqtree | iqtree_opts | String | Additional options for IQ-TREE, see http://www.iqtree.org/doc/Command-Reference | Optional | |
pan_iqtree | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 32 | Optional |
pan_reorder_matrix | cpu | Int | Number of CPUs to allocate to the task | 2 | Optional |
pan_reorder_matrix | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
pan_reorder_matrix | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/mykrobe:0.12.1 | Optional |
pan_reorder_matrix | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 2 | Optional |
pan_snp_dists | cpu | Int | Number of CPUs to allocate to the task | 1 | Optional |
pan_snp_dists | disk_size | Int | Amount of storage (in GB) to allocate to the task | 50 | Optional |
pan_snp_dists | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/snp-dists:0.8.2 | Optional |
pan_snp_dists | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 2 | Optional |
pirate | cpu | Int | Number of CPUs to allocate to the task | 4 | Optional |
pirate | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
pirate | docker_image | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/biocontainers/pirate:1.0.5--hdfd78af_0 | Optional |
pirate | features | String | Features to use for pangenome construction [default: CDS] | CDS | Optional |
pirate | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 32 | Optional |
pirate | nucl | Boolean | Boolean variable that instructs pirate to create a pangenome on CDS features using nucleotide identity, rather than amino acid identity, if true. | FALSE | Optional |
pirate | panopt | String | Additional arguments for Pirate | Optional | |
pirate | steps | String | Identity thresholds to use for pangenome construction | 50,60,70,80,90,95,98 | Optional |
summarize_data | cpu | Int | Number of CPUs to allocate to the task | 8 | Optional |
summarize_data | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
summarize_data | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/theiagen/terra-tools:2023-03-16 | Optional |
summarize_data | id_column_name | String | If the sample IDs are in a different column to samplenames, it can be passed here and it will be used instead. | Optional | |
summarize_data | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 1 | Optional |
version_capture | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0 | Optional |
version_capture | timezone | String | Set the time zone to get an accurate date of analysis (uses UTC by default) | Optional |
Workflow Tasks¶
By default, the Core_Gene_SNP workflow will begin by analyzing the input sample set using PIRATE. Pirate takes in GFF3 files and classifies the genes into gene families by sequence identity, outputting a pangenome summary file. The workflow will instruct Pirate to create core gene and pangenome alignments using this gene family data. Setting the "align" input variable to false will turn off this behavior, and the workflow will output only the pangenome summary. The workflow will then use the core gene alignment from Pirate
to infer a phylogenetic tree using IQ-TREE
. It will also produce an SNP distance matrix from this alignment using snp-dists. This behavior can be turned off by setting the core_tree
input variable to false. The workflow will not create a pangenome tree or SNP-matrix by default, but this behavior can be turned on by setting the pan_tree
input variable to true.
The optional summarize_data
task performs the following only if all of the data_summary_*
and sample_names
optional variables are filled out:
- Digests a comma-separated list of column names, such as
"amrfinderplus_virulence_genes,amrfinderplus_stress_genes"
, etc. that can be found within the origin Terra data table. - It will then parse through those column contents and extract each value; for example, if the
amrfinder_amr_genes
column for a sample contains these values:"aph(3')-IIIa,tet(O),blaOXA-193"
, thesummarize_data
task will check each sample in the set to see if they also have those AMR genes detected. - Outputs a .csv file that indicates presence (TRUE) or absence (empty) for each item in those columns; that is, it will check each sample in the set against the detected items in each column to see if that value was also detected.
By default, this task appends a Phandango coloring tag to color all items from the same column the same; this can be turned off by setting the optional phandango_coloring
variable to false
.
Outputs¶
Variable | Type | Description |
---|---|---|
core_gene_snp_wf_analysis_date | String | Date of analysis using Core_Gene_SNP workflow |
core_gene_snp_wf_version | String | Version of PHBG used for analysis |
pirate_core_alignment_fasta | File | Nucleotide alignments of the core genes as created using MAFFT within Pirate. Loci are ordered according to the gene_families.ordered file. |
pirate_core_alignment_gff | File | Annotation data for the gene family within the corresponding fasta file |
pirate_core_snp_matrix | File | SNP distance matrix created from the core gene alignment |
pirate_docker_image | String | Pirate docker image used |
pirate_gene_families_ordered | File | Summary of all gene families, as estimated by Pirate |
pirate_iqtree_core_tree | File | Phylogenetic tree produced by IQ-TREE from the core gene alignment |
pirate_iqtree_pan_tree | File | Phylogenetic tree produced by IQ-TREE from the pangenome alignment |
pirate_iqtree_version | String | IQ-TREE version used |
pirate_pan_alignment_fasta | File | Nucleotide alignments of the pangenome by gene as created using MAFFT within Pirate. Loci are ordered according to the gene_families.ordered file. |
pirate_pan_alignment_gff | File | Annotation data for the gene family within the corresponding fasta file |
pirate_pan_snp_matrix | File | SNP distance matrix created from the pangenome alignment |
pirate_pangenome_summary | File | Summary of the number and frequency of genes in the pangenome, as estimated by Pirate |
pirate_presence_absence_csv | File | A file generated by Pirate that allows many post-alignment tools created for Roary to be used on the output from Pirate |
pirate_snp_dists_version | String | Version of snp-dists used |
pirate_summarized_data | File | The presence/absence matrix generated by the summarize_data task from the list of columns provided |
References¶
Sion C Bayliss, Harry A Thorpe, Nicola M Coyle, Samuel K Sheppard, Edward J Feil, PIRATE: A fast and scalable pangenomics toolbox for clustering diverged orthologues in bacteria, GigaScience, Volume 8, Issue 10, October 2019, giz119, https://doi.org/10.1093/gigascience/giz119
Lam-Tung Nguyen, Heiko A. Schmidt, Arndt von Haeseler, Bui Quang Minh, IQ-TREE: A Fast and Effective Stochastic Algorithm for Estimating Maximum-Likelihood Phylogenies, Molecular Biology and Evolution, Volume 32, Issue 1, January 2015, Pages 268–274, https://doi.org/10.1093/molbev/msu300