Lyve_SET¶

Quick Facts¶

Workflow Type	Applicable Kingdom	Last Known Changes	Command-line Compatibility	Workflow Level
Phylogenetic Construction	Bacteria	v2.1.0	Yes	Set-level

Lyve_SET_PHB¶

The Lyve_SET WDL workflow runs the Lyve-SET pipeline developed by Lee Katz et al. for phylogenetic analysis of bacterial genomes using high quality single nucleotide polymorphisms (hqSNPs). The Lyve_SET workflow identifies SNPs amongst a set of samples by mapping sequencing reads to a reference genome, identifying high quality SNPs, and inferring phylogeny using RAxML.

Lyve-SET Pipeline (from Lyve-SET paper)¶

Lyve-SET Workflow Diagram

Inputs¶

Terra Task Name	Variable	Type	Description	Default Value	Terra Status
lyveset_workflow	dataset_name	String	Free text string used to label output files		Required
lyveset_workflow	read1	Array[File]	FASTQ files containing read1 sequences; we recommend using cleaned reads instead of raw reads		Required
lyveset_workflow	read2	Array[File]	FASTQ files containing read2 sequences; we recommend using cleaned reads instead of raw reads		Required
lyveset_workflow	reference_genome	File	Path to reference genome in a Terra-accessible Google bucket. For considerations when choosing a reference genome, see: https://github.com/lskatz/lyve-SET/blob/master/docs/FAQ.md		Required
lyveset	allowedFlanking	Int	Allowed flanking distance in base pairs. Nucleotides this close together cannot be considered as high-quality.	0	Optional
lyveset	cpu	Int	Number of CPUs to allocate to the task	4	Optional
lyveset	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
lyveset	docker_image	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/lyveset:1.1.4f	Optional
lyveset	downsample	Boolean	If true, downsample all reads to 50x. Approximated according to the ref genome assembly	FALSE	Optional
lyveset	fast	Boolean	Shorthand for --downsample --mapper snap --nomask-phages --nomask-cliffs --sample-sites	FALSE	Optional
lyveset	mapper	String	Which mapper? Choices: "smalt", "snap"	smalt	Optional
lyveset	mask_cliffs	Boolean	If true, search for and mask 'Cliffs' in pileups	FALSE	Optional
lyveset	mask_phages	Boolean	If true, search for and mask phages in the reference genome	FALSE	Optional
lyveset	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	16	Optional
lyveset	min_alt_frac	Float	The percent consensus that needs to be reached before a SNP is called. Otherwise, 'N'	0.75	Optional
lyveset	min_coverage	Int	Minimum coverage needed before a SNP is called. Otherwise, 'N'	10	Optional
lyveset	nomatrix	Boolean	If true, do not create an hqSNP matrix	FALSE	Optional
lyveset	nomsa	Boolean	If true, do not make a multiple sequence alignment	FALSE	Optional
lyveset	notrees	Boolean	If true, do not make phylogenies	FALSE	Optional
lyveset	presets	String	See presets.conf for more information		Optional
lyveset	read_cleaner	String	Which read cleaner? Choices: "none", "CGP", "BayesHammer"	CGP	Optional
lyveset	sample_sites	Boolean	If true, randomly choose a genome and find SNPs in a quick and dirty way. Then on the SNP-calling stage, only interrogate those sites for SNPs for each genome (including the randomly-sampled genome).	FALSE	Optional
lyveset	snpcaller	String	Which SNP caller? Choices: "varscan", "vcftools"	varscan	Optional
version_capture	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0	Optional
version_capture	timezone	String	Set the time zone to get an accurate date of analysis (uses UTC by default)		Optional

Workflow Actions¶

The Lyve_SET WDL workflow is run using read data from a set of samples. The workflow will produce a pairwise SNP matrix for the sample set and a maximum likelihood phylogenetic tree. Details regarding the default implementation of Lyve_SET and optional modifications are listed below.

Read processing
1. By default, the Lyve_SET WDL workflow will perform read cleaning using the CG-Pipeline "CGP". However, read cleaning can be turned off or performed using "BayesHammer" using the read_cleaner input variable.
Reference procurement
1. By default, the Lyve_SET WDL workflow will not mask phages or cliffs in the reference genome. Cliffs refer to regions of the reference genome where read coverage rises or falls abruptly. Masking phages and cliffs is intended to remove low quality SNPs. Users can invoke phage and cliff masking by setting the mask_cliffs and mask_phages variables to "true".
SNP discovery
1. The Lyve_SET WDL workflow uses the default read mapper and variant caller from the Lyve-SET pipeline (smalt and varscan). Additional options for each are available using the mapper and snpcaller input variables.
2. The workflow also uses the default parameters for variant calling from the Lyve-SET pipeline: the minimum percent consensus to call a base is 0.75 and minimum read depth is 10X. These parameters can be manually modified using the min_alt_frac and min_coverage input variables.
Phylogenetic analysis
1. The Lyve_SET workflow will attempt to produce a multiple sequence alignment, SNP distance matrix, and phylogenetic tree. These actions can be skipped by indicating nomsa = true, nomatrix = true, or notrees = true, respectively.

Outputs¶

For full descriptions of Lyve-SET pipeline outputs, we recommend consulting the Lyve-SET documentation: https://github.com/lskatz/lyve-SET/blob/master/docs/OUTPUT.md

The following output files are populated to the Terra data table. However, please note that certain files may not appear in the data table following a run for two main reasons:

The user instructed the workflow to skip an analysis step
1. For example, if notrees = true, no tree file will appear
The workflow skipped an analysis step due to an issue with the input data
1. For example, the workflow will not attempt to produce a phylogenetic tree if there are too few samples or if samples are too closely related

Variable	Type	Description
lyveset_alignment_fasta	File	The output alignment file in fasta format
lyveset_docker_image	String	Lyve_SET docker image used for analysis
lyveset_log	File	Lyve_SET task log file
lyveset_pairwise_matrix	File	Pairwise SNP distances matrix
lyveset_pooled_snps_vcf	File	SNPs vcf
lyveset_raxml_tree	File	RAxML-generated tree in newick format
lyveset_wf_analysis_date	String	Date analysis was run
lyveset_wf_version	String	Version of PHB used when running Lyveset_PHB

In addition to these outputs, all of the files produced by the Lyve-SET pipeline are available in the task-level outputs, including intermediate files and individual bam and vcf files for each sample. These files can be accessed viewing the execution directory for the run.

References¶

Lyve-SET Katz LS, Griswold T, Williams-Newkirk AJ, Wagner D, Petkau A, et al. (2017) A Comparative Analysis of the Lyve-SET Phylogenomics Pipeline for Genomic Epidemiology of Foodborne Pathogens. Frontiers in Microbiology 8.