MashTree_FASTA¶
Quick Facts¶
Workflow Type | Applicable Kingdom | Last Known Changes | Command-line Compatibility | Workflow Level |
---|---|---|---|---|
Phylogenetic Construction | Bacteria | PHB v3.0.0 | Yes | Set-level |
MashTree_FASTA_PHB¶
MashTree_FASTA
creates a phylogenetic tree using Mash distances.
Mash distances are representations of how many kmers two sequences have in common. These distances are generated by transforming all kmers from a sequence into an integer value with hashing and Bloom filters. The hashed kmers are sorted and a "sketch" is created by only using the kmers that appear at the top of the sorted list. These sketches can be compared by counting the number of hashed kmers they have in common. Mashtree uses a neighbor-joining algorithm to cluster these "distances" into phylogenetic trees.
This workflow also features an optional module, summarize_data
, that creates a presence/absence matrix for the analyzed samples from a list of indicated columns (such as AMR genes, etc.) that can be used in Phandango.
Inputs¶
Terra Task Name | Variable | Type | Description | Default Value | Terra Status |
---|---|---|---|---|---|
mashtree_fasta | assembly_fasta | Array[File] | The assembly files for your samples in FASTA format | Required | |
mashtree_fasta | cluster_name | String | Free text string used to label output files | Required | |
mashtree_fasta | data_summary_column_names | String | A comma-separated list of the column names from the sample-level data table for generating a data summary (presence/absence .csv matrix); e.g., "amrfinderplus_amr_genes,amrfinderplus_virulence_genes" | Optional | |
mashtree_fasta | data_summary_terra_project | String | The billing project for your current workspace. This can be found after the "#workspaces/" section in the workspace's URL | Optional | |
mashtree_fasta | data_summary_terra_table | String | The name of the sample-level Terra data table that will be used for generating a data summary | Optional | |
mashtree_fasta | data_summary_terra_workspace | String | The name of the Terra workspace you are in. This can be found at the top of the webpage, or in the URL after the billing project. | Optional | |
mashtree_fasta | midpoint_root_tree | Boolean | If true, midpoint root the final tree | FALSE | Optional |
mashtree_fasta | phandango_coloring | Boolean | Boolean variable that tells the data summary task and the reorder matrix task to include a suffix that enables consistent coloring on Phandango; by default, this suffix is not added. To add this suffix set this variable to true. | FALSE | Optional |
mashtree_fasta | sample_names | Array[String] | The list of samples | Optional | |
mashtree_task | cpu | Int | Number of CPUs to allocate to the task | 16 | Optional |
mashtree_task | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
mashtree_task | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/mashtree:1.2.0 | Optional |
mashtree_task | genomesize | Int | Genome size of the input samples | 5000000 | Optional |
mashtree_task | kmerlength | Int | Hashes will be based on strings of this many nucleotides | 21 | Optional |
mashtree_task | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 64 | Optional |
mashtree_task | mindepth | Int | If set to zero, mashtree will run in "accurate" mode as it will chose a mindepth by itself in a slower method; this value otherwise indicates the minimum number of times a kmer must appear in order to be included | 5 | Optional |
mashtree_task | sketchsize | Int | Each sketch will have at most this many non-redundant min-hashes | 10000 | Optional |
mashtree_task | sort_order | String | For neighbor-joining, the sort order can make a difference. Options include: "ABC" (alphabetical), "random", "input-order" | ABC | Optional |
mashtree_task | truncLength | Int | How many characters to keep in a filename | 250 | Optional |
reorder_matrix | cpu | Int | Number of CPUs to allocate to the task | 100 | Optional |
reorder_matrix | disk_size | Int | Amount of storage (in GB) to allocate to the task | 2 | Optional |
reorder_matrix | docker | String | The Docker container to use for the task | 100 | Optional |
reorder_matrix | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 2 | Optional |
summarize_data | cpu | Int | Number of CPUs to allocate to the task | 8 | Optional |
summarize_data | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
summarize_data | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/theiagen/terra-tools:2023-03-16 | Optional |
summarize_data | id_column_name | String | If the sample IDs are in a different column to samplenames, it can be passed here and it will be used instead. | Optional | |
summarize_data | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 1 | Optional |
version_capture | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0 | Optional |
version_capture | timezone | String | Set the time zone to get an accurate date of analysis (uses UTC by default) | Optional |
Workflow Actions¶
MashTree_FASTA Details
MashTree_FASTA
is run on a set of assembly fastas and creates a phylogenetic tree and matrix.
MashTree_FASTA Technical Details
Links | |
---|---|
Task | task_mashtree_fasta.wdl |
Software Source Code | Mashtree on GitHub |
Software Documentation | Mashtree on GitHub |
Original Publication(s) | Mashtree: a rapid comparison of whole genome sequence files |
Data summary (optional)
If you fill out the data_summary_*
and sample_names
optional variables, you can use the optional summarize_data
task. The task takes a comma-separated list of column names from the Terra data table, which should each contain a list of comma-separated items. For example, "amrfinderplus_virulence_genes,amrfinderplus_stress_genes"
(with quotes, comma separated, no spaces) for these output columns from running TheiaProk. The task checks whether those comma-separated items are present in each row of the data table (sample), then creates a CSV file of these results. The CSV file indicates presence (TRUE) or absence (empty) for each item. By default, the task does not add a Phandango coloring tag to group items from the same column, but you can turn this on by setting phandango_coloring
to true
.
Example output CSV
Example use of Phandango coloring
Data summary produced using the phandango_coloring
option, visualized alongside Newick tree at http://jameshadfield.github.io/phandango/#/main
Data summary technical details
Links | |
---|---|
Task | task_summarize_data.wdl |
Outputs¶
Variable | Type | Description |
---|---|---|
mashtree_docker | String | The Docker image used to run the mashtree task |
mashtree_filtered_metadata | File | Optional output file with filtered metadata that is only produced if the optional summarize_data task is used |
mashtree_matrix | File | The distance matrix made |
mashtree_summarized_data | File | CSV presence/absence matrix generated by the summarize_data task from the list of columns provided; formatted for Phandango if phandango_coloring input is true |
mashtree_tree | File | The phylogenetic tree made |
mashtree_version | String | The version of mashtree used in the workflow |
mashtree_wf_analysis_date | String | The date the workflow was run |
mashtree_wf_version | String | The version of PHB the workflow is hosted in |
References¶
Katz, L. S., Griswold, T., Morrison, S., Caravas, J., Zhang, S., den Bakker, H.C., Deng, X., and Carleton, H. A., (2019). Mashtree: a rapid comparison of whole genome sequence files. Journal of Open Source Software, 4(44), 1762, https://doi.org/10.21105/joss.01762
Ondov, B. D., Treangen, T. J., Melsted, P., Mallonee, A. B., Bergman, N. H., Koren, S., & Phillippy, A. M. (2016). Mash: Fast genome and metagenome distance estimation using minhash. Genome Biology, 17(1), 132. doi:10.1186/s13059-016-0997-x