MashTree_FASTA¶

Quick Facts¶

Workflow Type	Applicable Kingdom	Last Known Changes	Command-line Compatibility	Workflow Level	Dockstore
Phylogenetic Construction	Bacteria, Mycotics, Viral	v4.1.0	Some optional features incompatible, Yes	Set-level	MashTree_FASTA_PHB

MashTree_FASTA_PHB¶

MashTree_FASTA creates a phylogenetic tree using Mash distances.

Mash distances are representations of how many kmers two sequences have in common. These distances are generated by transforming all kmers from a sequence into an integer value with hashing and Bloom filters. The hashed kmers are sorted and a "sketch" is created by only using the kmers that appear at the top of the sorted list. These sketches can be compared by counting the number of hashed kmers they have in common. Mashtree uses a neighbor-joining algorithm to cluster these "distances" into phylogenetic trees.

This workflow also features an optional module, summarize_data, that creates a presence/absence matrix for the analyzed samples from a list of indicated columns (such as AMR genes, etc.) that can be used in Phandango.

Inputs¶

Terra Task Name	Variable	Type	Description	Default Value	Terra Status
mashtree_fasta	assembly_fasta	Array[File]	The assembly files for your samples in FASTA format		Required
mashtree_fasta	cluster_name	String	Free text string used to label output files		Required
mashtree_fasta	data_summary_column_names	String	A comma-separated list of the column names from the sample-level data table for generating a data summary (presence/absence .csv matrix); e.g., "amrfinderplus_amr_genes,amrfinderplus_virulence_genes"		Optional
mashtree_fasta	data_summary_terra_project	String	The billing project for your current workspace. This can be found after the "#workspaces/" section in the workspace's URL		Optional
mashtree_fasta	data_summary_terra_table	String	The name of the sample-level Terra data table that will be used for generating a data summary		Optional
mashtree_fasta	data_summary_terra_workspace	String	The name of the Terra workspace you are in. This can be found at the top of the webpage, or in the URL after the billing project.		Optional
mashtree_fasta	midpoint_root_tree	Boolean	If true, midpoint root the final tree	True	Optional
mashtree_fasta	phandango_coloring	Boolean	Boolean variable that tells the data summary task and the reorder matrix task to include a suffix that enables consistent coloring on Phandango; by default, this suffix is not added. To add this suffix set this variable to true.	False	Optional
mashtree_fasta	sample_names	Array[String]	The list of samples		Optional
mashtree_task	cpu	Int	Number of CPUs to allocate to the task	16	Optional
mashtree_task	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
mashtree_task	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/mashtree:1.2.0	Optional
mashtree_task	genomesize	Int	Genome size of the input samples	5000000	Optional
mashtree_task	kmerlength	Int	Hashes will be based on strings of this many nucleotides	21	Optional
mashtree_task	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	64	Optional
mashtree_task	mindepth	Int	If set to zero, mashtree will run in "accurate" mode as it will chose a mindepth by itself in a slower method; this value otherwise indicates the minimum number of times a kmer must appear in order to be included	5	Optional
mashtree_task	sketchsize	Int	Each sketch will have at most this many non-redundant min-hashes	10000	Optional
mashtree_task	sort_order	String	For neighbor-joining, the sort order can make a difference. Options include: "ABC" (alphabetical), "random", "input-order"	ABC	Optional
mashtree_task	truncLength	Int	How many characters to keep in a filename	250	Optional
reorder_matrix	cpu	Int	Number of CPUs to allocate to the task	1	Optional
reorder_matrix	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
reorder_matrix	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/mykrobe:0.12.1	Optional
reorder_matrix	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	2	Optional
reorder_matrix	outgroup_root	String	Tip name to root phylogenetic tree upon (overrides midpoint root)		Optional
summarize_data	cpu	Int	Number of CPUs to allocate to the task	8	Optional
summarize_data	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
summarize_data	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/theiagen/terra-tools:2023-03-16	Optional
summarize_data	id_column_name	String	If the sample IDs are in a different column to samplenames, it can be passed here and it will be used instead.		Optional
summarize_data	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	1	Optional
version_capture	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0	Optional
version_capture	timezone	String	Set the time zone to get an accurate date of analysis (uses UTC by default)		Optional

Workflow Actions¶

MashTree_FASTA Details

MashTree_FASTA is run on a set of assembly fastas and creates a phylogenetic tree and matrix.

MashTree_FASTA Technical Details

	Links
Task	task_mashtree_fasta.wdl
Software Source Code	Mashtree on GitHub
Software Documentation	Mashtree on GitHub
Original Publication(s)	Mashtree: a rapid comparison of whole genome sequence files

Data summary (optional)

Command-line incompatible

This task is not compatible with command-line use, even with modifications. It is engineered to run on Terra. To run this workflow on the command line, you must leave the data_summary_* and sample_names optional variables blank to prevent this task from running.

If you fill out the data_summary_* and sample_names optional variables, you can use the optional summarize_data task. The task takes a comma-separated list of column names from the Terra data table, which should each contain a list of comma-separated items. For example, "amrfinderplus_virulence_genes,amrfinderplus_stress_genes" (with quotes, comma separated, no spaces) for these output columns from running TheiaProk. The task checks whether those comma-separated items are present in each row of the data table (sample), then creates a CSV file of these results. The CSV file indicates presence (TRUE) or absence (empty) for each item. By default, the task does not add a Phandango coloring tag to group items from the same column, but you can turn this on by setting phandango_coloring to true.

Example output CSV

Sample_Name,aph(3')-IIa,blaCTX-M-65,blaOXA-193,tet(O)
sample1,TRUE,,TRUE,TRUE
sample2,,,FALSE,TRUE
sample3,,,FALSE,

Example use of Phandango coloring

Data summary produced using the phandango_coloring option, visualized alongside Newick tree at http://jameshadfield.github.io/phandango/#/main

Example phandango_coloring output

Data summary technical details

	Links
Task	task_summarize_data.wdl

Outputs¶

Variable	Type	Description
mashtree_docker	String	The Docker image used to run the mashtree task
mashtree_filtered_metadata	File	Optional output file with filtered metadata that is only produced if the optional `summarize_data` task is used
mashtree_matrix	File	The distance matrix made
mashtree_summarized_data	File	CSV presence/absence matrix generated by the `summarize_data` task from the list of columns provided; formatted for Phandango if `phandango_coloring` input is `true`
mashtree_tree	File	The phylogenetic tree made
mashtree_version	String	The version of mashtree used in the workflow
mashtree_wf_analysis_date	String	The date the workflow was run
mashtree_wf_version	String	The version of PHB the workflow is hosted in

References¶

Katz, L. S., Griswold, T., Morrison, S., Caravas, J., Zhang, S., den Bakker, H.C., Deng, X., and Carleton, H. A., (2019). Mashtree: a rapid comparison of whole genome sequence files. Journal of Open Source Software, 4(44), 1762, https://doi.org/10.21105/joss.01762

Ondov, B. D., Treangen, T. J., Melsted, P., Mallonee, A. B., Bergman, N. H., Koren, S., & Phillippy, A. M. (2016). Mash: Fast genome and metagenome distance estimation using minhash. Genome Biology, 17(1), 132. doi:10.1186/s13059-016-0997-x