Skip to content

MashTree_FASTA

Quick Facts

Workflow Type Applicable Kingdom Last Known Changes Command-line Compatibility Workflow Level
Phylogenetic Construction Bacteria PHB v2.1.0 Yes Set-level

MashTree_FASTA_PHB

MashTree_FASTA creates a phylogenetic tree using Mash distances.

Mash distances are representations of how many kmers two sequences have in common. These distances are generated by transforming all kmers from a sequence into an integer value with hashing and Bloom filters. The hashed kmers are sorted and a "sketch" is created by only using the kmers that appear at the top of the sorted list. These sketches can be compared by counting the number of hashed kmers they have in common. Mashtree uses a neighbor-joining algorithm to cluster these "distances" into phylogenetic trees.

This workflow also features an optional module, summarize_data, that creates a presence/absence matrix for the analyzed samples from a list of indicated columns (such as AMR genes, etc.) that can be used in Phandango.

Inputs

Terra Task Name Variable Type Description Default Value Terra Status
mashtree_fasta assembly_fasta Array[File] The set of assembly fastas Required
mashtree_fasta cluster_name String Free text string used to label output files Required
mashtree_fasta data_summary_column_names String A comma-separated list of the column names from the sample-level data table for generating a data summary (presence/absence .csv matrix); e.g., "amrfinderplus_amr_genes,amrfinderplus_virulence_genes" Optional
mashtree_fasta data_summary_terra_project String The billing project for your current workspace. This can be found after the "#workspaces/" section in the workspace's URL Optional
mashtree_fasta data_summary_terra_table String The name of the sample-level Terra data table that will be used for generating a data summary Optional
mashtree_fasta data_summary_terra_workspace String The name of the Terra workspace you are in. This can be found at the top of the webpage, or in the URL after the billing project. Optional
mashtree_fasta midpoint_root_tree Boolean If true, midpoint root the final tree FALSE Optional
mashtree_fasta phandango_coloring Boolean Boolean variable that tells the data summary task and the reorder matrix task to include a suffix that enables consistent coloring on Phandango; by default, this suffix is not added. To add this suffix set this variable to true. FALSE Optional
mashtree_fasta sample_names Array[String] The list of samples Optional
mashtree_task cpu Int Number of CPUs to allocate to the task 16 Optional
mashtree_task disk_size Int Amount of storage (in GB) to allocate to the task 100 Optional
mashtree_task docker String The Docker container to use for the task "us-docker.pkg.dev/general-theiagen/staphb/mashtree:1.2.0" Optional
mashtree_task genomesize Int Genome size of the input samples 5000000 Ooptional
mashtree_task kmerlength Int Hashes will be based on strings of this many nucleotides 21 Optional
mashtree_task mindepth Int If set to zero, mashtree will run in "accurate" mode as it will chose a mindepth by itself in a slower method; this value otherwise indicates the minimum number of times a kmer must appear in order to be included 5 Optional
mashtree_task memory Int Amount of memory/RAM (in GB) to allocate to the task 64 Optional
mashtree_task sketchsize Int Each sketch will have at most this many non-redundant min-hashes 10000 Optional
mashtree_task sort_order String For neighbor-joining, the sort order can make a difference. Options include: "ABC" (alphabetical), "random", "input-order" "ABC" Optional
mashtree_task truncLength Int How many characters to keep in a filename 250 Optional
reorder_matrix cpu Int Number of CPUs to allocate to the task 100 Optional
reorder_matrix disk_size Int Amount of storage (in GB) to allocate to the task 2 Optional
reorder_matrix docker String The Docker container to use for the task 100 Optional
reorder_matrix memory Int Amount of memory/RAM (in GB) to allocate to the task us-docker.pkg.dev/general-theiagen/staphb/mykrobe:0.12.1 Optional
summarize_data cpu Int Number of CPUs to allocate to the task 8 Optional
summarize_data disk_size Int Amount of storage (in GB) to allocate to the task 100 Optional
summarize_data docker String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/theiagen/terra-tools:2023-03-16 Optional
summarize_data id_column_name String If the sample IDs are in a different column to samplenames, it can be passed here and it will be used instead. Optional
summarize_data memory Int Amount of memory/RAM (in GB) to allocate to the task 8 Optional
version_capture docker String The Docker container to use for the task "us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0" Optional
version_capture timezone String Set the time zone to get an accurate date of analysis (uses UTC by default) Optional

Workflow Actions

MashTree_Fasta is run on a set of assembly fastas and creates a phylogenetic tree and matrix. These outputs are passed to a task that will rearrange the matrix to match the order of the terminal ends in the phylogenetic tree.

The optional summarize_data task performs the following only if all of the data_summary_* and sample_names optional variables are filled out:

  1. Digests a comma-separated list of column names, such as "amrfinderplus_virulence_genes,amrfinderplus_stress_genes", etc. that can be found within the origin Terra data table.
  2. It will then parse through those column contents and extract each value; for example, if the amrfinder_amr_genes column for a sample contains these values: "aph(3')-IIIa,tet(O),blaOXA-193", the summarize_data task will check each sample in the set to see if they also have those AMR genes detected.
  3. Outputs a .csv file that indicates presence (TRUE) or absence (empty) for each item in those columns; that is, it will check each sample in the set against the detected items in each column to see if that value was also detected.

By default, this task appends a Phandango coloring tag to color all items from the same column the same; this can be turned off by setting the optional phandango_coloring variable to false.

Outputs

| Variable | Type | Description | | mashtree_docker | String | The Docker image used to run the mashtree task | | mashtree_filtered_metadata | File | Optional output file with filtered metadata that is only produced if the optional summarize_data task is used | | mashtree_matrix | File | The SNP matrix made | | mashtree_summarized_data | File | CSV presence/absence matrix generated by the summarize_data task from the list of columns provided; formatted for Phandango if phandango_coloring input is true | | mashtree_tree | File | The phylogenetic tree made | | mashtree_version | String | The version of mashtree used in the workflow | | mashtree_wf_analysis_date | String | The date the workflow was run | | mashtree_wf_version | String | The version of PHB the workflow is hosted in |

References

Katz, L. S., Griswold, T., Morrison, S., Caravas, J., Zhang, S., den Bakker, H.C., Deng, X., and Carleton, H. A., (2019). Mashtree: a rapid comparison of whole genome sequence files. Journal of Open Source Software, 4(44), 1762, https://doi.org/10.21105/joss.01762

Ondov, B. D., Treangen, T. J., Melsted, P., Mallonee, A. B., Bergman, N. H., Koren, S., & Phillippy, A. M. (2016). Mash: Fast genome and metagenome distance estimation using minhash. Genome Biology, 17(1), 132. doi:10.1186/s13059-016-0997-x