Snippy_Streamline_FASTA¶

Quick Facts¶

Workflow Type	Applicable Kingdom	Last Known Changes	Command-line Compatibility	Workflow Level
Phylogenetic Construction	Bacteria	PHB v2.2.0	Yes; some optional features incompatible	Set-level

Snippy_Streamline_FASTA_PHB¶

This workflow is a FASTA-compatible version of Snippy_Streamline. Please see the Snippy_Streamline documentation for more information regarding the workflow tasks.

Snippy_Streamline_FASTA_PHB Workflow Diagram

The Snippy_Streamline_FASTA workflow is an all-in-one approach to generating a reference-based phylogenetic tree and associated SNP-distance matrix. The workflow can be run in multiple ways with options for:

The reference genome to be provided by the user, or automatically selected using the Centroid task and Assembly_Fetch sub-workflow to find a close reference genome to your dataset
The phylogeny to be generated by optionally
- masking user-specified regions of the genome (providing a bed file to snippy_core_bed)
- producing either a core or pan-genome phylogeny and SNP-matrix (core_genome; default = true)
- masking recombination detected by gubbins, or not (use_gubbins; default=true)
- choosing the nucleotide substitution (by specifying iqtree2_model), or allowing IQ-Tree's ModelFinder to identify the best model for your dataset (default)

Assembly Data Requirements

Input data used in the Snippy_Streamline_FASTA workflow must:

Be assembled genomes in FASTA format
Be generated by unbiased whole genome shotgun sequencing
Pass appropriate QC thresholds for the taxa to ensure that the assemblies represent reasonably complete genomes that are free of contamination from other taxa or cross-contamination of the same taxon.
If masking recombination with Gubbins, input data should represent complete genomes from the same strain/lineage (e.g. MLST) that share a recent common ancestor.

Reference Genomes

If reference genomes have multiple contigs, they will not be compatible with using Gubbins to mask recombination in the phylogenetic tree. The automatic selection of a reference genome by the workflow may result in a reference with multiple contigs. In this case, an alternative reference genome should be sought.

Inputs¶

Terra Task Name	Variable	Type	Description	Default Value	Terra Status
snippy_streamline_fasta	assembly_fasta	Array[File]	The assembly files for your samples		Required
snippy_streamline_fasta	samplenames	Array[String]	The names of your samples		Required
snippy_streamline_fasta	tree_name	String	String of your choice to prefix output files		Required
snippy_streamline_fasta	reference_genome_file	File	Reference genome in FASTA or GENBANK format (must be the same reference used in Snippy_Variants workflow); provide this if you want to skip the detection of a suitable reference		Optional
centroid	cpu	Int	Number of CPUs to allocate to the task	1	Optional
centroid	disk_size	Int	Amount of storage (in GB) to allocate to the task	50	Optional
centroid	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/theiagen/centroid:0.1.0	Optional
centroid	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	4	Optional
ncbi_datasets_download_genome_accession	cpu	Int	Number of CPUs to allocate to the task	1	Optional
ncbi_datasets_download_genome_accession	disk_size	Int	Amount of storage (in GB) to allocate to the task	50	Optional
ncbi_datasets_download_genome_accession	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/ncbi-datasets:14.13.2	Optional
ncbi_datasets_download_genome_accession	include_gbff3	Boolean	When set to true, outputs a gbff3 file (Genbank file)	FALSE	Optional
ncbi_datasets_download_genome_accession	include_gff	Boolean	When set to true, outputs a gff file (Annotation file)	FALSE	Optional
ncbi_datasets_download_genome_accession	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	4	Optional
referenceseeker	cpu	Int	Number of CPUs to allocate to the task	4	Optional
referenceseeker	disk_size	Int	Amount of storage (in GB) to allocate to the task	200	Optional
referenceseeker	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/biocontainers/referenceseeker:1.8.0--pyhdfd78af_0	Optional
referenceseeker	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	16	Optional
referenceseeker	referenceseeker_ani_threshold	Float	Bidirectional average nucleotide identity to use as a cut off for identifying reference assemblies with ReferenceSeeker; default value set according to https://github.com/oschwengers/referenceseeker#description	0.95	Optional
referenceseeker	referenceseeker_conserved_dna_threshold	Float	Conserved DNA % to use as a cut off for identifying reference assemblies with ReferenceSeeker; default value set according to https://github.com/oschwengers/referenceseeker#description	0.69	Optional
referenceseeker	referenceseeker_db	File	Database to use with ReferenceSeeker	gs://theiagen-public-files-rp/terra/theiaprok-files/referenceseeker-bacteria-refseq-205.v20210406.tar.gz	Optional
snippy_tree_wf	call_shared_variants	Boolean	Activates the shared variants analysis task	TRUE	Optional
snippy_tree_wf	core_genome	Boolean	When "true", workflow generates core genome phylogeny; when "false", whole genome is used	TRUE	Optional
snippy_tree_wf	data_summary_column_names	String	A comma-separated list of the column names from the sample-level data table for generating a data summary (presence/absence .csv matrix)		Optional
snippy_tree_wf	data_summary_terra_project	String	The billing project for your current workspace. This can be found after the "#workspaces/" section in the workspace's URL		Optional
snippy_tree_wf	data_summary_terra_table	String	The name of the sample-level Terra data table that will be used for generating a data summary		Optional
snippy_tree_wf	data_summary_terra_workspace	String	The name of the Terra workspace you are in. This can be found at the top of the webpage, or in the URL after the billing project.		Optional
snippy_tree_wf	gubbins_cpu	Int	Number of CPUs to allocate to the task	4	Optional
snippy_tree_wf	gubbins_disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
snippy_tree_wf	gubbins_docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/biocontainers/gubbins:3.3--py310pl5321h8472f5a_0	Optional
snippy_tree_wf	gubbins_memory	Int	Amount of memory/RAM (in GB) to allocate to the task	32	Optional
snippy_tree_wf	iqtree2_bootstraps	String	Number of replicates for http://www.iqtree.org/doc/Tutorial#assessing-branch-supports-with-ultrafast-bootstrap-approximation (Minimum recommended= 1000)	1000	Optional
snippy_tree_wf	iqtree2_cpu	Int	Number of CPUs to allocate to the task	4	Optional
snippy_tree_wf	iqtree2_disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
snippy_tree_wf	iqtree2_docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/iqtree2:2.1.2	Optional
snippy_tree_wf	iqtree2_memory	Int	Amount of memory/RAM (in GB) to allocate to the task	32	Optional
snippy_tree_wf	iqtree2_model	String	Nucelotide substitution model to use when generating the final tree with IQTree2. By default, IQtree runs its ModelFinder algorithm to identify the model it thinks best fits your dataset		Optional
snippy_tree_wf	iqtree2_opts	String	Additional options to pass to IQTree2		Optional
snippy_tree_wf	midpoint_root_tree	Boolean	A True/False option that determines whether the tree used in the SNP matrix re-ordering task should be re-rooted or not. Options: true of false	TRUE	Optional
snippy_tree_wf	phandango_coloring	Boolean	Boolean variable that tells the data summary task and the reorder matrix task to include a suffix that enables consistent coloring on Phandango; by default, this suffix is not added. To add this suffix set this variable to true.	FALSE	Optional
snippy_tree_wf	snippy_core_bed	File	User-provided bed file to mask out regions of the genome when creating multiple sequence alignments		Optional
snippy_tree_wf	snippy_core_cpu	Int	Number of CPUs to allocate to the task	8	Optional
snippy_tree_wf	snippy_core_disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
snippy_tree_wf	snippy_core_docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/snippy:4.6.0	Optional
snippy_tree_wf	snippy_core_memory	Int	Amount of memory/RAM (in GB) to allocate to the task	16	Optional
snippy_tree_wf	snp_dists_docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/snp-dists:0.8.2	Optional
snippy_tree_wf	snp_sites_cpu	Int	Number of CPUs to allocate to the task	1	Optional
snippy_tree_wf	snp_sites_disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
snippy_tree_wf	snp_sites_docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/snp-sites:2.5.1	Optional
snippy_tree_wf	snp_sites_memory	Int	Amount of memory/RAM (in GB) to allocate to the task	4	Optional
snippy_tree_wf	use_gubbins	Boolean	When "true", workflow removes recombination with gubbins tasks; when "false", gubbins is not used	TRUE	Optional
snippy_variants_wf	base_quality	Int	Minimum quality for a nucleotide to be used in variant calling	13	Optional
snippy_variants_wf	cpu	Int	Number of CPUs to allocate to the task	4	Optional
snippy_variants_wf	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/snippy:4.6.0	Optional
snippy_variants_wf	map_qual	Int	Minimum mapping quality to accept in variant calling		Optional
snippy_variants_wf	maxsoft	Int	Number of bases of alignment to soft-clip before discarding the alignment		Optional
snippy_variants_wf	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	16	Optional
snippy_variants_wf	min_coverage	Int	Minimum read coverage of a position to identify a mutation	10	Optional
snippy_variants_wf	min_frac	Float	Minimum fraction of bases at a given position to identify a mutation	0.9	Optional
snippy_variants_wf	min_quality	Int	Minimum VCF variant call "quality"	100	Optional
snippy_variants_wf	query_gene	String	Indicate a particular gene of interest		Optional
snippy_variants_wf	read1	File	Internal component, do not modify.		Do Not Modify, Optional
snippy_variants_wf	read2	File	Internal component, do not modify.		Do Not Modify, Optional
version_capture	docker	String	The Docker container to use for the task	"us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0"	Optional
version_capture	timezone	String	Set the time zone to get an accurate date of analysis (uses UTC by default)		Optional

Outputs¶

Variable	Type	Description
snippy_centroid_docker	String	Docker file used for Centroid
snippy_centroid_fasta	File	FASTA file for the centroid sample
snippy_centroid_mash_tsv	File	TSV file containing mash distances computed by centroid
snippy_centroid_samplename	String	Name of the centroid sample
snippy_centroid_version	String	Centroid version used
snippy_cg_snp_matrix	File	CSV file of core genome pairwise SNP distances between samples, calculated from the final alignment
snippy_concatenated_variants	File	The concatenated variants file
snippy_filtered_metadata	File	TSV recording the columns of the Terra data table that were used in the summarize_data task
snippy_final_alignment	File	Final alignment (FASTA file) used to generate the tree (either after snippy alignment, gubbins recombination removal, and/or core site selection with SNP-sites)
snippy_final_tree	File	Final phylogenetic tree produced by Snippy_Streamline
snippy_gubbins_branch_stats	File	CSV file showing https://github.com/nickjcroucher/gubbins/blob/master/docs/gubbins_manual.md#output-statistics for each branch of the tree
snippy_gubbins_docker	String	Docker file used for Gubbins
snippy_gubbins_recombination_gff	File	Recombination statistics in GFF format; these can be viewed in Phandango against the phylogenetic tree
snippy_gubbins_version	String	Gubbins version used
snippy_iqtree2_docker	String	Docker file used for IQTree2
snippy_iqtree2_model_used	String	Nucleotide substitution model used by IQTree2
snippy_iqtree2_version	String	IQTree2 version used
snippy_msa_snps_summary	File	CSV file showing https://github.com/nickjcroucher/gubbins/blob/master/docs/gubbins_manual.md#output-statistics for each branch of the tree
snippy_ncbi_datasets_docker	String	Docker file used for NCBI datasets
snippy_ncbi_datasets_version	String	NCBI datasets version used
snippy_ref	File	Reference genome used by Snippy
snippy_ref_metadata_json	File	Metadata associated with the refence genome used by Snippy, in JSON format
snippy_referenceseeker_database	String	ReferenceSeeker database used
snippy_referenceseeker_docker	String	Docker file used for ReferenceSeeker
snippy_referenceseeker_top_hit_ncbi_accession	String	NCBI Accession for the top it identified by Assembly_Fetch
snippy_referenceseeker_tsv	File	TSV file of the top hits between the query genome and the Reference Seeker database
snippy_referenceseeker_version	String	ReferenceSeeker version used
snippy_snp_dists_docker	String	Docker file used for SNP-dists
snippy_snp_dists_version	String	SNP-dists version used
snippy_snp_sites_docker	String	Docker file used for SNP-sites
snippy_snp_sites_version	String	SNP-sites version used
snippy_streamline_analysis_date	String	Date of workflow run
snippy_streamline_version	String	Version of Snippy_Streamline used
snippy_summarized_data	File	CSV presence/absence matrix generated by the summarize_data task (within Snippy_Tree workflow) from the list of columns provided
snippy_tree_snippy_docker	String	Docker file used for Snippy in the Snippy_Tree subworkfow
snippy_tree_snippy_version	String	Version of Snippy_Tree subworkflow used
snippy_variants_outdir_tarball	Array[File]	A compressed file containing the whole directory of snippy output files. This is used when running Snippy_Tree
snippy_variants_snippy_docker	Array[String]	Docker file used for Snippy in the Snippy_Variants subworkfow
snippy_variants_snippy_version	Array[String]	Version of Snippy_Tree subworkflow used
snippy_wg_snp_matrix	File	CSV file of whole genome pairwise SNP distances between samples, calculated from the final alignment