Augur¶
Quick Facts¶
| Workflow Type | Applicable Kingdom | Last Known Changes | Command-line Compatibility | Workflow Level | Dockstore |
|---|---|---|---|---|---|
| Phylogenetic Construction | Viral | v4.0.0 | Yes | Sample-level, Set-level | Augur_Prep_PHB, Augur_PHB |
Augur Workflows¶
Helpful resources for epidemiological interpretation
CDC's COVID-19 Epidemiology Toolkit is a useful resource for learning more about genomic epidemiology. Some particularly relevant modules to the Augur workflows include:
Genomic epidemiology is an important approach to understand and mitigate disease transmission. A critical step in viral genomic epidemiology is generating phylogenetic trees to explore the genetic relationship between viruses on a local, regional, national, or global scale. The Augur workflows enable viral phylogenetic analysis by generating phylogenetic trees from genome assemblies and incorporating metadata into a intuitive visual platform via Auspice.
Two workflows are offered: Augur_Prep_PHB and Augur_PHB. Augur_Prep_PHB prepares sample metadata to be visualized alongside the phylogenetic tree produced by Augur_PHB. Augur_PHB can be run without metadata to produce only a phylogenetic tree. If you want metadata incorporated in your final tree, these workflows must be run sequentially. The final outputs from Augur_PHB can be visualized in Auspice, which is the recommended platform. Alternative tree visualization platforms can also be used, though these may not support all metadata features.
Augur_Prep_PHB¶
The Augur_Prep_PHB workflow was written to prepare individual sample assemblies and their metadata for inclusion in Augur_PHB analysis. The optional metadata inputs include collection date information (in YYYY-MM-DD format), clade information (like nextclade clade and/or pango lineage), and geographical information.
This workflow runs on the sample level, and takes assembly FASTA files and associated metadata formatted in a data table. FASTA files may be generated with one of the TheiaCoV/TheiaViral Characterization workflows and should adhere to quality control guidelines, (e.g. QC guidelines produced by PHA4GE).
How to prepare metadata
If you are running this workflow on Terra, we recommend carefully preparing metadata in a TSV file that will be uploaded to the same Terra datatable that contains the sample genetic information. An example of a correctly formatted TSV file can be found in this example. A few important considerations are:
- Please always include the date information in
YYYY-MM-DDformat. Other date formats are incompatible with Augur. You can specify unknown dates or month by replacing the respective values byXX(e.g.2013-01-XXor2011-XX-XX), while completely unknown dates can be shown with20XX-XX-XX(which does not restrict the sequence to being in the 21st century - they could be earlier). Alternatively, reduced precision format can also be used (e.g.2018,2018-03).- Because Excel will automatically change the date formatting, we recommend not opening or preparing your meta data file in Excel. If the metadata is already in Excel, or you decide to prepare it in Excel, we recommend using another program to correct the dates afterwards (and be caution if you open it in Excel again!).
- Different levels of geographical information can be passed to Augur. A latitude and longitude file is provided by default by Theiagen, which mirrors what can be found here. Just ensure that your spelling matches what is in the file exactly or alternatively provide your own. Augur_Prep supports the following levels:
region- Lowest-level resolution, used often for continents (e.g.:europe,asia,north america)country- Denotes the country where the sample originated (e.g.:Argentina,Japan,USA)divisions- Denotes divisions, or states, or sometimes cities, within the country (e.g.:California,Colorado,Cork)location- Highest-level resolution, often used for custom latitude and longitude for futher detail on divisions, like cities within states. Just ensure that this level is provided in either the default latitude and longitude file or in a custom one.
- Optional clade information, such the one assigned by Nextclade.
- Optional Pangolin lineage information for SARS-CoV-2 samples.
Augur_Prep Inputs¶
| Terra Task Name | Variable | Type | Description | Default Value | Terra Status |
|---|---|---|---|---|---|
| augur_prep | assembly | File | The assembly file for your sample in FASTA format | Required | |
| augur_prep | clade | String | Clade membership metadata | Optional | |
| augur_prep | collection_date | String | Collection date of the sample | Optional | |
| augur_prep | country | String | Country where sample was collected | Optional | |
| augur_prep | division | String | Sub-national location (e.g. state/province) where sample was collected | Optional | |
| augur_prep | location | String | Sub-divisional location (e.g. county) where sample was collected | Optional | |
| augur_prep | pango_lineage | String | The Pangolin lineage of the sample | Optional | |
| augur_prep | region | String | Continental/regional location where sample was collected | Optional | |
| prep_augur_metadata | cpu | Int | Number of CPUs to allocate to the task | 1 | Optional |
| prep_augur_metadata | disk_size | Int | Amount of storage (in GB) to allocate to the task | 10 | Optional |
| prep_augur_metadata | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/theiagen/utility:1.1 | Optional |
| prep_augur_metadata | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 3 | Optional |
| prep_augur_metadata | organism | String | The organism to be analyzed in Augur; options: "sars-cov-2", "flu", "MPXV", "rsv-a", "rsv-b" | sars-cov-2 | Optional |
| version_capture | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0 | Optional |
| version_capture | timezone | String | Set the time zone to get an accurate date of analysis (uses UTC by default) | Optional |
Augur_Prep Outputs¶
| Variable | Type | Description |
|---|---|---|
| augur_metadata | File | TSV file of the metadata provided as input to the workflow in the proper format for Augur analysis |
| augur_prep_phb_analysis_date | String | Date of analysis |
| augur_prep_phb_version | String | Version of the Public Health Bioinformatics (PHB) repository used |
Augur_PHB¶
Augur is a bioinformatics toolkit to track evolution from sequence data, ingesting sequences and metadata such as dates and sampling locations, filtering the data, aligning the sequences, infering a tree, and export the results in a format that can be visualized by Auspice. This is the tool behind Nextrain's builds available for a large collection of viral organisms.
Before getting started
Phylogenetic inference requires careful planning, quality control of sequences, and metadata curation. You may have to generate phylogenies multiple times by running the Augur_PHB workflow, with several iterations of assessing results and amending inputs, in order to generate a final tree with sufficient diversity and high-quality data of interest. Theiagen's Introduction to Phylogenetics is one resource to give you the necessary information on the considerations you'll need to have before performing this type of analysis.
Augur_PHB takes as input a set of assembly/consensus files (FASTA format), an optional viral organism designation, and an optional sample metadata files (TSV format) that have been formatted via the Augur_Prep_PHB workflow. Augur_PHB runs Augur to generate a phylogenetic tree following the construction of a SNP distance matrix and alignment. Provided metadata will be used to refine the final tree and incorporated into the Auspice-formatted tree visual.
Augur Inputs¶
Sample diversity and tree building
Before attempting a phylogenetic tree, you must ensure that the input FASTAs meet quality-control metrics. Sets of FASTAs with highly discordant quality metrics may result in the inaccurate inference of genetic relatedness.
There must be some sequence diversity among the set of input assemblies. If insufficient diversity is present, it may be necessary to add a more divergent sequence to the set of samples to be analyzed.
Some inputs will automatically bypass or trigger modules, such as populating alignment_fasta, which bypasses alignment. Clade-defining mutations can be automatically extracted if the "clade_membership" metadata field is provided and the extract_clade_mutations optional input is set to true.
Any metadata present in the final JSON file for Auspice visualization is determined by what metadata was provided by the user. If the sample_metadata_tsvs optional input parameter is not provided, the final tree visual will only include the distance tree. If metadata was provided, different metadata fields will trigger different steps: date information will trigger the refinement of the distance tree into a time tree; clade information will be assigned to the tree nodes; geographical information will be represented in the Auspice visual within a map. The following figure illustrates this logic.
A Note on Optional Inputs¶
Defaults change based on the specified organism input
Default values that mimic the Nextstrain builds for the following organisms have been preselected:
- Flu (
"flu"), which also requires the following two inputs:flu_segment("HA"or"NA")flu_subtype("H1N1","H3N2","Victoria","Yamagata", or"H5N1")
- RSV-A (
"rsv-a") - RSV-B (
"rsv-b") - Mpox (
"mpox") - SARS-CoV-2 (default;
"sars-cov-2")
View these default parameters in the relevant toggle block below.
organism_parameters: Setting default values for specific organisms
Organism Parameters acquires and propagates default files and variable values for specific organisms.
Default values for SARS-CoV-2
- min_num_unambig = 27000
- clades_tsv =
"gs://theiagen-public-resources-rp/reference_data/viral/sars-cov-2/sc2_clades_20251008.tsv" - lat_longs_tsv =
"gs://theiagen-public-resources-rp/reference_data/viral/sars-cov-2/sc2_lat_longs_20251008.tsv" - reference_fasta =
"gs://theiagen-public-resources-rp/reference_data/viral/sars-cov-2/MN908947.fasta" - reference_genbank =
"gs://theiagen-public-resources-rp/reference_data/viral/sars-cov-2/sc2_reference_seq_20251008.gb" - auspice_config =
"gs://theiagen-public-resources-rp/reference_data/viral/sars-cov-2/sc2_auspice_config_20251008.json" - min_date = 2020.0
- pivot_interval = 1
- pivot_interval_units = "weeks"
- narrow_bandwidth = 0.05
- proportion_wide = 0.0
Default values for Flu
- lat_longs_tsv =
"gs://theiagen-public-resources-rp/reference_data/viral/flu/lat_longs.tsv" - min_num_unambig = 900
- min_date = 2020.0
- pivot_interval = 1
- narrow_bandwidth = 0.1666667
- proportion_wide = 0.0
H1N1
- auspice_config =
"gs://theiagen-public-resources-rp/reference_data/viral/flu/auspice_config_h1n1pdm.json" - HA
- reference_fasta =
"gs://theiagen-public-resources-rp/reference_data/viral/flu/reference_h1n1pdm_ha.gb" - clades_tsv =
"gs://theiagen-public-resources-rp/reference_data/viral/flu/clades_h1n1pdm_ha.tsv"
- reference_fasta =
- NA
- reference_fasta =
"gs://theiagen-public-resources-rp/reference_data/viral/flu/reference_h1n1pdm_na.gb"
- reference_fasta =
H3N2
- auspice_config =
"gs://theiagen-public-resources-rp/reference_data/viral/flu/auspice_config_h3n2.json" - HA
- reference_fasta =
"gs://theiagen-public-resources-rp/reference_data/viral/flu/reference_h3n2_ha.gb" - clades_tsv =
"gs://theiagen-public-resources-rp/reference_data/viral/flu/clades_h3n2_ha.tsv"
- reference_fasta =
- NA
- reference_fasta =
"gs://theiagen-public-resources-rp/reference_data/viral/flu/reference_h3n2_na.gb"
- reference_fasta =
Victoria
- auspice_config =
"gs://theiagen-public-resources-rp/reference_data/viral/flu/auspice_config_vic.json" - HA
- reference_fasta =
"gs://theiagen-public-resources-rp/reference_data/viral/flu/reference_vic_ha.gb" - clades_tsv =
"gs://theiagen-public-resources-rp/reference_data/viral/flu/clades_vic_ha.tsv"
- reference_fasta =
- NA
- reference_fasta =
"gs://theiagen-public-resources-rp/reference_data/viral/flu/reference_vic_na.gb"
- reference_fasta =
Yamagata
- auspice_config =
"gs://theiagen-public-resources-rp/reference_data/viral/flu/auspice_config_yam.json" - HA
- reference_fasta =
"gs://theiagen-public-resources-rp/reference_data/viral/flu/reference_yam_ha.gb" - clades_tsv =
"gs://theiagen-public-resources-rp/reference_data/viral/flu/clades_yam_ha.tsv"
- reference_fasta =
- NA
- reference_fasta =
"gs://theiagen-public-resources-rp/reference_data/viral/flu/reference_yam_na.gb"
- reference_fasta =
H5N1
- auspice_config =
"gs://theiagen-public-resources-rp/reference_data/viral/flu/auspice_config_h5n1.json" - HA
- reference_fasta =
"gs://theiagen-public-resources-rp/reference_data/viral/flu/reference_h5n1_ha.gb" - clades_tsv =
"gs://theiagen-public-resources-rp/reference_data/viral/flu/h5nx-clades.tsv"
- reference_fasta =
Default values for MPXV
- min_num_unambig = 150000
- clades_tsv =
"gs://theiagen-public-resources-rp/reference_data/viral/mpox/mpox_clades.tsv" - lat_longs_tsv =
"gs://theiagen-public-resources-rp/reference_data/viral/flu/lat_longs.tsv" - reference_fasta =
"gs://theiagen-public-resources-rp/reference_data/viral/mpox/NC_063383.1.reference.fasta" - reference_genbank =
"gs://theiagen-public-resources-rp/reference_data/viral/mpox/NC_063383.1_reference.gb" - auspice_config =
"gs://theiagen-public-resources-rp/reference_data/viral/mpox/mpox_auspice_config_mpxv.json" - min_date = 2020.0
- pivot_interval = 1
- narrow_bandwidth = 0.1666667
- proportion_wide = 0.0
Default values for RSV-A
- min_num_unambig = 10850
- clades_tsv =
"gs://theiagen-public-resources-rp/reference_data/viral/rsv/rsv_a_clades.tsv" - lat_longs_tsv =
"gs://theiagen-public-resources-rp/reference_data/viral/flu/lat_longs.tsv" - reference_fasta =
"gs://theiagen-public-resources-rp/reference_data/viral/rsv/reference_rsv_a.EPI_ISL_412866.fasta" - reference_genbank =
""gs://theiagen-public-resources-rp/reference_data/viral/rsv/reference_rsv_a.gb" - auspice_config =
""gs://theiagen-public-resources-rp/reference_data/viral/rsv/rsv_auspice_config.json" - min_date = 2020.0
- pivot_interval = 1
- narrow_bandwidth = 0.1666667
- proportion_wide = 0.0
Default values for RSV-B
- min_num_unambig = 10850
- clades_tsv =
"gs://theiagen-public-resources-rp/reference_data/viral/rsv/rsv_b_clades.tsv" - lat_longs_tsv =
"gs://theiagen-public-resources-rp/reference_data/viral/flu/lat_longs.tsv" - reference_fasta =
"gs://theiagen-public-resources-rp/reference_data/viral/rsv/reference_rsv_b.EPI_ISL_1653999.fasta" - reference_genbank =
""gs://theiagen-public-resources-rp/reference_data/viral/rsv/reference_rsv_b.gb" - auspice_config =
""gs://theiagen-public-resources-rp/reference_data/viral/rsv/rsv_auspice_config.json" - min_date = 2020.0
- pivot_interval = 1
- narrow_bandwidth = 0.1666667
- proportion_wide = 0.0
Organism Parameters Technical Details
| Links | |
|---|---|
| Task | wf_organism_parameters.wdl |
Running Augur_PHB on custom organsisms
For non-default organisms (listed above), several optional inputs are required to guarantee workflow functionality
| Task | Input | Description |
|---|---|---|
| augur | reference_fasta | Reference sequence in FASTA format. |
| augur | reference_genbank | Reference sequence in GenBank format. |
| augur | organism | Name of expected organism. |
| augur | min_num_unambig | Minimum number of called bases in genome to pass prefilter. |
| augur | lat_longs_tsv | Tab-delimited file of geographic location names with corresponding latitude and longitude values. Only necessary if geographical information is in the metadata; must follow this format |
| augur | clades_tsv | TSV file containing clade mutation positions in four columns. Only necessary if clade information is in the metadata. |
In the inputs table below, these fields have both the "Required" and "Optional" tags.
There are many optional user inputs. For more information regarding these optional inputs, please view Nextstrain's detailed documentation on Augur
| Terra Task Name | Variable | Type | Description | Default Value | Terra Status |
|---|---|---|---|---|---|
| augur | assembly_fastas | Array[File]+ | The assembly files for your samples in FASTA format | Required | |
| augur | build_name | String | Name to give to the Augur build | Required | |
| augur | clades_tsv | File | TSV file containing clade mutation positions in four columns | Defaults are organism-specific. Please find default values for all organisms (and for Flu - their respective genome segments and subtypes) here: https://github.com/theiagen/public_health_bioinformatics/blob/main/workflows/utilities/wf_organism_parameters.wdl. For an organism without set defaults, an empty clades file is provided to prevent workflow failure, "gs://theiagen-public-resources-rp/empty_files/minimal-clades.tsv", but will not be as useful as an organism specific clades file. | Optional, Required |
| augur | flu_subtype | String | Required if organism = "flu". The subtype of the flu samples being analyzed; options: "H1N1", "H3N2", "Victoria", "Yamagata", "H5N1" | Optional, Required | |
| augur | reference_fasta | File | The reference FASTA file used to align the genomes and build the trees | Defaults are organism-specific. Please find default values for all organisms (and for Flu - their respective genome segments and subtypes) here: https://github.com/theiagen/public_health_bioinformatics/blob/main/workflows/utilities/wf_organism_parameters.wdl. For an organism without set defaults, a reference fasta file must be provided otherwise the workflow fails. | Optional, Required |
| augur | reference_genbank | File | The GenBank .gb file for the same reference genome used for the reference_fasta | Defaults are organism-specific. Please find default values for all organisms (and for Flu - their respective genome segments and subtypes) here: https://github.com/theiagen/public_health_bioinformatics/blob/main/workflows/utilities/wf_organism_parameters.wdl. For an organism without set defaults, a reference genbank file must be provided otherwise the workflow fails. | Optional, Required |
| augur | alignment_fasta | File | The alignment fasta file in Augur | Optional | |
| augur | augur_id_column | String | Column name of sequence IDs | strain | Optional |
| augur | augur_trait_columns | String | Comma-separated list of columns to use for trait analysis in Augur | Optional | |
| augur | auspice_config | File | Auspice config file for customizing visualizations; takes priority over the other customization values available for augur_export | Defaults are organism-specific. Please find default values for all organisms (and for Flu - their respective genome segments and subtypes) here: https://github.com/theiagen/public_health_bioinformatics/blob/main/workflows/utilities/wf_organism_parameters.wdl. For an organism without set defaults, a minimal auspice config file is provided to prevent workflow failure, "gs://theiagen-public-resources-rp/empty_files/minimal-auspice-config.json", but will not be as useful as an organism specific config file. | Optional |
| augur | extract_clade_mutations | Boolean | Generate a "clades.tsv" using "clade_membership" column metadata (overrides clades_tsv) | False | Optional |
| augur | flu_segment | String | Required if organism = "flu". The name of the segment to be analyzed; options: "HA" or "NA" | HA | Optional |
| augur | lat_longs_tsv | File | Tab-delimited file of geographic location names with corresponding latitude and longitude values | Defaults are organism-specific. Please find default values for all organisms (and for Flu - their respective genome segments and subtypes) here: https://github.com/theiagen/public_health_bioinformatics/blob/main/workflows/utilities/wf_organism_parameters.wdl. For an organism without set defaults, a minimal lat-long file is provided to prevent workflow failure, "gs://theiagen-public-resources-rp/empty_files/minimal-lat-longs.tsv", but will not be as useful as a detailed lat-longs file covering all the locations for the samples to be visualized. | Optional |
| augur | midpoint_root_tree | Boolean | Boolean variable that will instruct the workflow to reroot the tree at the midpoint | True | Optional |
| augur | min_date | Float | Minimum date to begin filtering or frequencies calculations | Defaults are organism-specific. Please find default values for all organisms (and for Flu - their respective genome segments and subtypes) here: https://github.com/theiagen/public_health_bioinformatics/blob/main/workflows/utilities/wf_organism_parameters.wdl. For an organism without set defaults, the default value is 0.0 | Optional |
| augur | min_num_unambig | Int | Minimum number of called bases in genome to pass prefilter | Defaults are organism-specific. Please find default values for all organisms (and for Flu - their respective genome segments and subtypes) here: https://github.com/theiagen/public_health_bioinformatics/blob/main/workflows/utilities/wf_organism_parameters.wdl. For an organism without set defaults, the default value is 0 | Optional |
| augur | narrow_bandwidth | Float | The bandwidth for the narrow KDE | 0.08333 | Optional |
| augur | organism | String | Organism used to preselect default values; options: "sars-cov-2", "flu", "mpxv", "rsv-a", "rsv-b" | sars-cov-2 | Optional |
| augur | outgroup_root | String | Tip name to root phylogenetic tree upon (overrides midpoint root) | Optional | |
| augur | pivot_interval | Int | Number of units between pivots | 3 | Optional |
| augur | proportion_wide | Float | The proportion of the wide bandwidth to use in the KDE mixture model | 0.2 | Optional |
| augur | remove_reference | Boolean | Whether or not to remove the reference in Augur | False | Optional |
| augur | sample_metadata_tsvs | Array[File] | An array of the metadata files produced in Augur_Prep_PHB | Optional | |
| augur_align | cpu | Int | Number of CPUs to allocate to the task | 64 | Optional |
| augur_align | disk_size | Int | Amount of storage (in GB) to allocate to the task | 750 | Optional |
| augur_align | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/augur:31.5.0 | Optional |
| augur_align | fill_gaps | Boolean | If true, gaps represent missing data rather than true indels and so are replaced by N after aligning. | False | Optional |
| augur_align | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 128 | Optional |
| augur_ancestral | cpu | Int | Number of CPUs to allocate to the task | 4 | Optional |
| augur_ancestral | disk_size | Int | Amount of storage (in GB) to allocate to the task | 50 | Optional |
| augur_ancestral | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/augur:31.5.0 | Optional |
| augur_ancestral | infer_ambiguous | Boolean | If true, infer nucleotides and ambiguous sites and replace with most likely | False | Optional |
| augur_ancestral | inference | String | Calculate joint or marginal maximum likelihood ancestral sequence states; options: "joint", "marginal" | joint | Optional |
| augur_ancestral | keep_ambiguous | Boolean | If true, do not infer nucleotides at ambiguous (N) sides | False | Optional |
| augur_ancestral | keep_overhangs | Boolean | If true, do not infer nucleotides for gaps on either side of the alignment | False | Optional |
| augur_ancestral | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 50 | Optional |
| augur_clades | cpu | Int | Number of CPUs to allocate to the task | 1 | Optional |
| augur_clades | disk_size | Int | Amount of storage (in GB) to allocate to the task | 50 | Optional |
| augur_clades | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/augur:31.5.0 | Optional |
| augur_clades | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 2 | Optional |
| augur_export | colors_tsv | File | Custom color defintiions, one per line in the format TRAIT_TYPE\tTRAIT_VALUE\tHEX_CODE | Optional | |
| augur_export | cpu | Int | Number of CPUs to allocate to the task | 4 | Optional |
| augur_export | description_md | File | Markdown file with description of build and/or acknowledgements | Optional | |
| augur_export | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
| augur_export | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/augur:31.5.0 | Optional |
| augur_export | include_root_sequence | Boolean | Export an additional JSON containing the root sequence used to identify mutations | False | Optional |
| augur_export | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 64 | Optional |
| augur_export | title | String | The title to be displayed by auspice | Optional | |
| augur_refine | branch_length_inference | String | Branch length mode of timetree to use; options: "auto", "joint", "marginal", "input" | auto | Optional |
| augur_refine | clock_filter_iqd | Int | Remove tips that deviate more than n_iqd interquartile ranges from the root-to-tip vs time regression | 4 | Optional |
| augur_refine | clock_rate | Float | Fixed clock rate to use for time tree calculations | Optional | |
| augur_refine | clock_std_dev | Float | Standard deviation of the fixed clock_rate estimate | Optional | |
| augur_refine | coalescent | String | Coalescent time scale in units of inverse clock rate (float), optimize as scalar ("opt") or skyline ("skyline") | Optional | |
| augur_refine | covariance | Boolean | If true, account for covariation when estimating rates and/or rerooting | True | Optional |
| augur_refine | cpu | Int | Number of CPUs to allocate to the task | 2 | Optional |
| augur_refine | date_confidence | Boolean | If true, calculate confidence intervals for node dates | True | Optional |
| augur_refine | date_inference | String | Assign internal nodes to their marginally most likely dates; options: "joint", "marginal" | marginal | Optional |
| augur_refine | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
| augur_refine | divergence_units | String | Units in which sequence divergences is exported; options: "mutations" or "mutations-per-site" | mutations | Optional |
| augur_refine | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/augur:31.5.0 | Optional |
| augur_refine | gen_per_year | Int | Number of generations per year | 50 | Optional |
| augur_refine | keep_polytomies | Boolean | If true, don't attempt to resolve polytomies | False | Optional |
| augur_refine | keep_root | Boolean | If true, do not reroot the tree; use it as-is (overrides anything specified by root) | True | Optional |
| augur_refine | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 50 | Optional |
| augur_refine | precision | Int | Precision used by TreeTime to determine the number of grid points that are used for the evaluation of the branch length interpolation objects. Values range from 0 (rough) to 3 (ultra fine) and default to 'auto' | auto' | Optional |
| augur_refine | root | String | Rooting mechanism; options: "best", "least-squares", "min_dev", "oldest", etc. | Optional | |
| augur_traits | cpu | Int | Number of CPUs to allocate to the task | 4 | Optional |
| augur_traits | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
| augur_traits | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/augur:31.5.0 | Optional |
| augur_traits | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 30 | Optional |
| augur_traits | metadata_id_columns | String | The names of possible metadata columns containing identifier information, ordered by priority | ('strain', 'name) | Optional |
| augur_traits | weights | File | a dictionary of key/value mappings in JSON format used to weight KDE tip frequencies | Optional | |
| augur_translate | cpu | Int | Number of CPUs to allocate to the task | 1 | Optional |
| augur_translate | disk_size | Int | Amount of storage (in GB) to allocate to the task | 50 | Optional |
| augur_translate | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/augur:31.5.0 | Optional |
| augur_translate | genes | File | A file containing a list of genes to translate (from nucleotides to amino acids) | Optional | |
| augur_translate | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 32 | Optional |
| augur_tree | cpu | Int | Number of CPUs to allocate to the task | 64 | Optional |
| augur_tree | disk_size | Int | Amount of storage (in GB) to allocate to the task | 750 | Optional |
| augur_tree | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/augur:31.5.0 | Optional |
| augur_tree | exclude_sites | File | File of one-based sites to exclude for raw tree building (BED format in .bed files, DRM format in tab-delimited files, or one position per line) | Optional | |
| augur_tree | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 32 | Optional |
| augur_tree | method | String | The method used to build the tree. Options: "fasttree", "raxml", "iqtree" (default) | iqtree | Optional |
| augur_tree | override_default_args | Boolean | If true, override default tree builder arguments instead of augmenting them | False | Optional |
| augur_tree | substitution_model | String | The substitution model to use; only available for iqtree. Specify "auto" to run ModelTest; model options can be found here | GTR | Optional |
| augur_tree | tree_builder_args | String | Additional tree builder arguments either augmenting or overriding the default arguments. FastTree defaults: "-nt -nosupport". RAxML defaults: "-f d -m GTRCAT -c 25 -p 235813". IQ-TREE defaults: "-ninit 2 -n 2 -me 0.05 -nt AUTO -redo" | Optional | |
| cat_files | cpu | Int | Number of CPUs to allocate to the task | 2 | Optional |
| cat_files | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
| cat_files | docker_image | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/theiagen/utility:1.1 | Optional |
| cat_files | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 8 | Optional |
| clade_extraction_task | allow_missing_tips | Boolean | Generate clades if metadata is missing for some tips | True | Optional |
| clade_extraction_task | cpu | Int | Number of CPUs to allocate to the task | 1 | Optional |
| clade_extraction_task | disk_size | Int | Amount of storage (in GB) to allocate to the task | 10 | Optional |
| clade_extraction_task | docker | String | Docker image to use for the task | us-docker.pkg.dev/general-theiagen/theiagen/extract_clade_mutations:0.0.1 | Optional |
| clade_extraction_task | memory | Int | Amount of memory (in GB) to allocate to the task | 4 | Optional |
| clade_extraction_task | skip_singleton_clades | Boolean | Skip extracting sequence signatures for clades with a single tip | True | Optional |
| fasta_to_ids | cpu | Int | Number of CPUs to allocate to the task | 1 | Optional |
| fasta_to_ids | disk_size | Int | Amount of storage (in GB) to allocate to the task | 375 | Optional |
| fasta_to_ids | docker | String | The Docker container to use for the task | ubuntu | Optional |
| fasta_to_ids | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 1 | Optional |
| filter_sequences_by_length | cpu | Int | Number of CPUs to allocate to the task | 1 | Optional |
| filter_sequences_by_length | disk_size | Int | Amount of storage (in GB) to allocate to the task | 300 | Optional |
| filter_sequences_by_length | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/broadinstitute/viral-core:2.1.33 | Optional |
| filter_sequences_by_length | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 1 | Optional |
| mutation_context | cpu | Int | Number of CPUs to allocate to the task | 1 | Optional |
| mutation_context | disk_size | Int | Amount of storage (in GB) to allocate to the task | 50 | Optional |
| mutation_context | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/theiagen/nextstrain-mpox-mutation-context:2024-06-27 | Optional |
| mutation_context | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 4 | Optional |
| organism_parameters | flu_genoflu_genotype | String | Internal component, do not modify | N/A | Optional |
| organism_parameters | gene_locations_bed_file | File | Use to provide locations of interest where average coverage will be calculated | Default provided for SARS-CoV-2 ("gs://theiagen-public-resources-rp/reference_data/viral/sars-cov-2/sc2_gene_locations.bed") and mpox ("gs://theiagen-public-resources-rp/reference_data/viral/mpox/mpox_gene_locations.bed") | Optional |
| organism_parameters | genome_length_input | Int | Use to specify the expected genome length; provided by default for all supported organisms | Default provided for SARS-CoV-2 (29903), mpox (197200), WNV (11000), flu (13000), RSV-A (16000), RSV-B (16000), HIV (primer versions 1 [9181] and 2 [9840]) | Optional |
| organism_parameters | hiv_primer_version | String | The version of HIV primers used. Options are https://github.com/theiagen/public_health_bioinformatics/blob/main/workflows/utilities/wf_organism_parameters.wdl#L156 and https://github.com/theiagen/public_health_bioinformatics/blob/main/workflows/utilities/wf_organism_parameters.wdl#L164. This input is ignored if provided for TheiaCoV_Illumina_SE and TheiaCoV_ClearLabs | v1 | Optional |
| organism_parameters | kraken_target_organism_input | String | The organism whose abundance the user wants to check in their reads. This should be a proper taxonomic name recognized by the Kraken database. | Default provided for mpox (Monkeypox virus), WNV (West Nile virus), and HIV (Human immunodeficiency virus 1) | Optional |
| organism_parameters | nextclade_dataset_name_input | String | NextClade organism dataset name | Defaults are organism-specific. Please find default values for all organisms (and for Flu - their respective genome segments and subtypes) here: https://github.com/theiagen/public_health_bioinformatics/blob/main/workflows/utilities/wf_organism_parameters.wdl. For an organism without set defaults, the default is "NA". | Optional |
| organism_parameters | nextclade_dataset_tag_input | String | NextClade organism dataset tag | Defaults are organism-specific. Please find default values for all organisms (and for Flu - their respective genome segments and subtypes) here: https://github.com/theiagen/public_health_bioinformatics/blob/main/workflows/utilities/wf_organism_parameters.wdl. For an organism without set defaults, the default is "NA". | Optional |
| organism_parameters | pangolin_docker_image | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/pangolin:4.3.1-pdata-1.34 | Optional |
| organism_parameters | primer_bed_file | File | The bed file containing the primers used when sequencing was performed | REQUIRED FOR SARS-CoV-2, MPOX, WNV, RSV-A & RSV-B. Provided by default only for HIV primer versions 1 ("gs://theiagen-public-resources-rp/reference_data/viral/hiv/HIV-1_v1.0.primer.hyphen.bed" and 2 ("gs://theiagen-public-resources-rp/reference_data/viral/hiv/HIV-1_v2.0.primer.hyphen400.1.bed") | Optional |
| organism_parameters | reference_gff_file | File | Reference GFF file for the organism being analyzed | Default provided for mpox ("gs://theiagen-public-resources-rp/reference_data/viral/mpox/Mpox-MT903345.1.reference.gff3") and HIV (primer versions 1 ["gs://theiagen-public-resources-rp/reference_data/viral/hiv/NC_001802.1.gff3"] and 2 ["gs://theiagen-public-resources-rp/reference_data/viral/hiv/AY228557.1.gff3"]) | Optional |
| organism_parameters | vadr_max_length | Int | Maximum length for the fasta-trim-terminal-ambigs.pl VADR script | Default provided for SARS-CoV-2 (30000), mpox (210000), WNV (11000), flu (0), RSV-A (15500) and RSV-B (15500). | Optional |
| organism_parameters | vadr_mem | Int | Amount of memory/RAM (in GB) to allocate to the task | 32 (RSV-A, RSV-B, WNV) and 16 (all other TheiaCoV organisms) | Optional |
| organism_parameters | vadr_model | File | Path to the a tar + gzipped VADR model file | gs://theiagen-public-resources-rp/reference_data/databases/vadr_models/vadr-models-sarscov2-1.3-2.tar.gz | Optional |
| organism_parameters | vadr_options | String | Options for the v-annotate.pl VADR script | --mkey sarscov2 --glsearch -s -r --nomisc --lowsim5seq 6 --lowsim3seq 6 --alt_fail lowscore,insertnn,deletinn --noseqnamemax --out_allfasta | Optional |
| organism_parameters | vadr_skip_length | Int | Minimum assembly length (unambiguous) to run VADR | 10000 | Optional |
| reorder_matrix | cpu | Int | Number of CPUs to allocate to the task | 1 | Optional |
| reorder_matrix | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
| reorder_matrix | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/mykrobe:0.12.1 | Optional |
| reorder_matrix | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 2 | Optional |
| reorder_matrix | phandango_coloring | Boolean | Whether or not Phandango coloring will color all items from the same column the same | False | Optional |
| snp_dists | cpu | Int | Number of CPUs to allocate to the task | 1 | Optional |
| snp_dists | disk_size | Int | Amount of storage (in GB) to allocate to the task | 50 | Optional |
| snp_dists | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/snp-dists:0.8.2 | Optional |
| snp_dists | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 2 | Optional |
| tsv_join | cpu | Int | Number of CPUs to allocate to the task | 2 | Optional |
| tsv_join | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
| tsv_join | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/broadinstitute/viral-core:2.1.33 | Optional |
| tsv_join | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 7 | Optional |
| tsv_join | out_suffix | String | Suffix of merged tsv files | .tsv | Optional |
| version_capture | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0 | Optional |
| version_capture | timezone | String | Set the time zone to get an accurate date of analysis (uses UTC by default) | Optional |
Workflow Tasks¶
For the Augur subcommands, please view the Nextstrain Augur documentation for more details and explanations.
versioning: Version Capture
The versioning task captures the workflow version from the GitHub (code repository) version.
Version Capture Technical details
| Links | |
|---|---|
| Task | task_versioning.wdl |
augur align: Genome Assembly Alignment
augur align aligns multiple nucleotide sequences and strips insertions relative to a reference sequence by using the MAFFT aligner.
augur align Technical Details
| Links | |
|---|---|
| Task | task_augur_align.wdl |
| Software Source Code | Augur on GitHub MAFFT on GitHub |
| Software Documentation | augur align on Nexstrain |
| Original Publication(s) | Nextstrain: real-time tracking of pathogen evolution Recent developments in the MAFFT multiple sequence alignment program |
snp-dists: Pairwise SNP Distance Calculation
SNP-dists computes pairwise SNP distances between genomes. It takes the same alignment of genomes used to generate your phylogenetic tree and produces a matrix of pairwise SNP distances between sequences. This means that if you generated pairwise core-genome phylogeny, the output will consist of pairwise core-genome SNP (cgSNP) distances. Otherwise, these will be whole-genome SNP distances. Regardless of whether core-genome or whole-genome SNPs, this SNP distance matrix will exclude all SNPs in masked regions (i.e. masked with a bed file or gubbins).
The SNP-distance output can be visualized using software such as Phandango to explore the relationships between the genomic sequences. The task can optionally add a Phandango coloring tag (:c1) to the column names in the output matrix to ensure that all columns are colored with the same color scheme throughout by setting phandango_coloring to true.
SNP-dists Technical Details
| Links | |
|---|---|
| Task | task_snp_dists.wdl |
| Software Source Code | SNP-dists on GitHub |
| Software Documentation | SNP-dists on GitHub |
augur tree: Phylogenetic Tree Reconstruction
augur tree generates a phylogenetic tree using either iqtree, fasttree, or raxml. By default, the tree will be generated by IQ-TREE 2. FastTree can result in significant performance improvements at the cost of how phylogenetic resolution if the alignment contains 100s of sequences and computational throughput is a priority. Additionally, the task defaults to using a GTR evolutionary model, which can be modified via the substitution_model input.
augur tree Technical Details
reorder_matrix: Matrix Reorder and Phylogenetic Tree Rooting
Reorder Matrix will reorder a TSV matrix file's entries in the same order as the tips of an inputted phylogenetic tree. The phylogenetic tree can be rooted at the midpoint via the midpoint_root_tree Boolean, or via an outgroup tip via the outgroup_root String input. The provided outgroup tip must be an exact match for a sequence input into the phylogenetic tree. Please note that phylogenetic tree tip names are derived from the alignment FASTA file headers in PHB, and may deviate from samplename inputs.
Reorder Matrix Technical Details
| Links | |
|---|---|
| Task | task_reorder_matrix.wdl |
augur refine: Phylogenetic Tree Refinement and Time-Calibration
augur refine adds metadata to a phylogenetic tree and can optionally calibrate branch lengths with respect to time using maximum-likelihood via TimeTree. Time-Calibration is triggered by incorporating collection_date in the provided metadata generated by Augur_Prep_PHB.
Augur Refine Technical Details
| Links | |
|---|---|
| Task | task_augur_refine.wdl |
| Software Source Code | Augur on GitHub TreeTime on GitHub |
| Software Documentation | augur refine on Nextstrain TreeTime ReadTheDocs |
| Original Publication(s) | Nextstrain: real-time tracking of pathogen evolution TreeTime: Maximum-likelihood phylodynamic analysis |
augur ancestral: Ancestral Nucleotide Sequence Reconstruction
augur ancestral infers ancestral nucleotide sequences based on phylogenetic relatedness using maximum-likelihood via TreeTime. A "joint" maximum likelihood model is used by default, though "marginal" input is permitted.
NOTE: keep-ambiguous and infer-ambiguous are mutually-exclusive, incompatible options that will raise an error if selected together.
augur ancestral Technical Details
| Links | |
|---|---|
| Task | task_augur_ancestral.wdl |
| Software Source Code | Augur on GitHub TreeTime on GitHub |
| Software Documentation | augur ancestral on Nexstrain TreeTime ReadTheDocs |
| Original Publication(s) | Nextstrain: real-time tracking of pathogen evolution TreeTime: Maximum-likelihood phylodynamic analysis |
augur translate: Translate Nucleotide Sequences
augur translate translates nucleotide sequences of nodes in a phylogeny to amino acids based on annotated features in a provided reference_genbank. The task will not run without a reference_genbank input.
augur_translate Technical Details
| Links | |
|---|---|
| Task | task_augur_translate.wdl |
| Software Source Code | Augur on GitHub |
| Software Documentation | augur translate on Nextstrain |
| Original Publication(s) | Nextstrain: real-time tracking of pathogen evolution |
extract mutation context: Contextualizing Mpox Mutation (only for mpox)
This task quantifies the number of G/A, C/T, and dinucleotide conversions for mpox samples. These mutations have been shown to be a characteristic of APOBEC3-type editing, which indicate adaptation of the virus to circulation among humans.
The results are formatted into a JSON that is used to contextualize mutation metadata on the Augur phylogeny.
When visualizing the output auspice_input_json file, there will be 2 new choices in the drop-down menu for "Color By":
- G→A or C→T fraction
- NGA/TCN context of G→A or C→T mutations.
An example Mpox tree with these "Color By" options can be viewed here
Extract Mutation Context Technical Details
| Links | |
|---|---|
| Task | task_augur_mutation_context.wdl |
augur traits: Ancestral Trait Reconstruction
augur traits will reconstruct the ancestral traits of provided metadata. By default, only the "pango_lineage" and "clade_membership" columns are included, though the augur_traits_columns String inputted can be populated with a comma-delimited string to determine what trait metadata to use.
augur traits Technical Details
| Links | |
|---|---|
| Task | task_augur_traits.wdl |
| Software Source Code | Augur on GitHub |
| Software Documentation | augur traits on Nextstrain |
| Original Publication(s) | Nextstrain: real-time tracking of pathogen evolution |
extract clade mutations: Extract Clade-Defining Signature Sequences
Extract Clade Mutations will create an Augur-compatible "clades.tsv" by extracting signature clade-defining sequences. A nucleotide JSON outputted by Augur Ancestral is required, and an optional amino acid JSON outputted by Augur Translate can be used to infer specific amino acid mutations.
Clade-defining signatures can only be extracted from monophyletic clades with unique mutation signatures. If no clade-defining mutations are reported, an error is raised. If the clade metadata column does not exist, then an error is raised as well.
Extract Clade Mutations Technical Details
| Links | |
|---|---|
| Task | task_extract_clade_mutations.wdl |
| Software Source Code | Theiagen Utilities on GitHub TheiaPhylo on GitHub |
augur clades: Assigning Clades based on Sequence Signatures
augur clades assigns clades to nodes in a tree based on amino-acid or nucleotide signatures. A clades_tsv is required to delineate clade-defining mutations in a tab-delimitted file with the header "clade\tgene\tsite\alt", where the site is an integer site and the alternative sequence character is delineated in the "alt" column. Clade designation preferentially references amino acid sequences if they are provided.
augur clades Technical Details
| Links | |
|---|---|
| Task | task_augur_clades.wdl |
| Software Source Code | Augur on GitHub |
| Software Documentation | augur clades on Nexstrain |
| Original Publication(s) | Nextstrain: real-time tracking of pathogen evolution |
augur export: Exporting Auspice-formatted Phylogenetic Trees
augur export outputs an Auspice-formatted JSON from an Augur-refined phylogenetic tree and its included metadata. The resulting JSON can be inputted directly into the web-based tree viewer, auspice.us.
augur export Technical Details
| Links | |
|---|---|
| Task | task_augur_export.wdl |
| Software Source Code | Augur on GitHub Auspice on GitHub |
| Software Documentation | augur export on Nextstrain Auspice on Nextstrain |
| Original Publication(s) | Nextstrain: real-time tracking of pathogen evolution |
Augur Outputs¶
The auspice_input_json is intended to be uploaded to Auspice to view the resulting phylogenetic tree with the provided metadata. Alternatively, a phylogenetic tree in Newick fromat is also available for visualization in other platforms. The metadata_merged output can also be uploaded to either Auspice or a different visualization platform to add further context to the phylogenetic visualization. The combined_assemblies output can be uploaded to UShER to view the samples on a global tree of representative sequences from public repositories.
The Nextstrain team hosts documentation surrounding the Augur workflow to Auspice visualization here, which details the various components of the Auspice interface: How data is exported by Augur for visualisation in Auspice.
| Variable | Type | Description |
|---|---|---|
| augur_aligned_fastas | File | Alignment of inputted FASTAs |
| augur_auspice_input_json | File | Auspice-formatted JSON of tree and associated metadata |
| augur_clade_mutations | File | Clade-specific mutations identified in the analysis |
| augur_combined_assemblies | File | Concatenated FASTAs |
| augur_distance_tree | File | Phylogenetic tree without time-calibration |
| augur_fasttree_version | String | The fasttree version used, blank if other tree method used |
| augur_iqtree_model_used | String | The iqtree model used during augur tree, blank if iqtree not used |
| augur_iqtree_version | String | The iqtree version used during augur tree (defualt), blank if other tree method used |
| augur_keep_list | File | List of samples that were kept and met length filters |
| augur_mafft_version | String | The mafft version used in augur align |
| augur_metadata_merged | File | TSV file of merged metadata |
| augur_phb_analysis_date | String | The date the analysis was run |
| augur_phb_version | String | The version of the Public Health Bioinformatics (PHB) repository used |
| augur_raxml_version | String | The version of raxml used during augur tree, blank if other tree method used |
| augur_snp_matrix | File | SNP matrix generated from alignment |
| augur_time_tree | File | Time-calibrated phylogenetic tree |
| augur_traits_json | File | JSON of traits used for applying metadata to the phylogenetic tree |
| augur_version | String | Version of Augur used |
References¶
When publishing work using the Augur_PHB workflow, please reference the following:
Nextstrain: Hadfield J, Megill C, Bell SM, Huddleston J, Potter B, Callender C, Sagulenko P, Bedford T, Neher RA. Nextstrain: real-time tracking of pathogen evolution. Bioinformatics. 2018 Dec 1;34(23):4121-3.

