TheiaEuk Workflow Series¶
Quick Facts¶
Workflow Type | Applicable Kingdom | Last Known Changes | Command-line Compatibility | Workflow Level |
---|---|---|---|---|
Genomic Characterization | Mycotics | vX.X.X | Some optional features incompatible, Yes | Sample-level |
TheiaEuk Workflows¶
The TheiaEuk_Illumina_PE workflow is for the assembly, quality assessment, and characterization of fungal genomes. It is designed to accept Illumina paired-end sequencing data as the primary input. It is currently intended only for haploid fungal genomes like Candidozyma auris. Analyzing diploid genomes using TheiaEuk should be attempted only with expert attention to the resulting genome quality.
All input reads are processed through "core tasks" in each workflow. The core tasks include raw read quality assessment, read cleaning (quality trimming and adapter removal), de novo assembly, assembly quality assessment, and species taxon identification. For some taxa identified, taxa-specific sub-workflows will be automatically activated, undertaking additional taxa-specific characterization steps, including clade-typing and/or antifungal resistance detection.
Inputs¶
Input read data
The TheiaEuk_Illumina_PE workflow takes in Illumina paired-end read data. Read file names should end with .fastq
or .fq
, with the optional addition of .gz
. When possible, Theiagen recommends zipping files with gzip prior to Terra upload to minimize data upload time.
By default, the workflow anticipates 2 x 150bp reads (i.e. the input reads were generated using a 300-cycle sequencing kit). Modifications to the optional parameter for trim_minlen
may be required to accommodate shorter read data, such as the 2 x 75bp reads generated using a 150-cycle sequencing kit.
Terra Task Name | Variable | Type | Description | Default Value | Terra Status |
---|---|---|---|---|---|
theiaeuk_pe | read1 | File | Illumina forward read file in FASTQ file format (compression optional) | Required | |
theiaeuk_pe | read2 | File | Illumina reverse read file in FASTQ file format (compression optional) | Required | |
theiaeuk_pe | samplename | String | The name of the sample being analyzed | Required | |
busco | cpu | Int | Number of CPUs to allocate to the task | 2 | Optional |
busco | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
busco | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/ezlabgva/busco:v5.3.2_cv1 | Optional |
cg_pipeline_clean | cg_pipe_opts | String | Options to pass to CG-Pipeline for clean read assessment | --fast | Optional |
cg_pipeline_clean | cpu | Int | Number of CPUs to allocate to the task | 4 | Optional |
cg_pipeline_clean | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
cg_pipeline_clean | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/lyveset:1.1.4f | Optional |
cg_pipeline_clean | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 8 | Optional |
cg_pipeline_raw | cg_pipe_opts | String | Options to pass to CG-Pipeline for raw read assessment | --fast | Optional |
cg_pipeline_raw | cpu | Int | Number of CPUs to allocate to the task | 4 | Optional |
cg_pipeline_raw | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
cg_pipeline_raw | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/lyveset:1.1.4f | Optional |
clean_check_reads | cpu | Int | Number of CPUs to allocate to the task | 2 | Optional |
clean_check_reads | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
clean_check_reads | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/bactopia/gather_samples:2.0.2 | Optional |
clean_check_reads | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 2 | Optional |
clean_check_reads | organism | String | Internal component, do not modify | Optional | |
clean_check_reads | workflow_series | String | Internal component, do not modify | Optional | |
digger_denovo | assembler | String | Assembler to use (spades, skesa, megahit) | skesa | Optional |
digger_denovo | assember_options | String | String | Assembler-specific options that you might choose for the selected assembler | Optional |
digger_denovo | bwa_cpu | Int | Number of CPUs to allocate to the task | 6 | Optional |
digger_denovo | bwa_disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
digger_denovo | bwa_docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/ivar:1.3.1-titan | Optional |
digger_denovo | bwa_memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 16 | Optional |
digger_denovo | call_pilon | Boolean | Whether to run Pilon polishing after assembly | FALSE | Optional |
digger_denovo | filter_contigs_cpu | Int | Number of CPUs to allocate to the task | 1 | Optional |
digger_denovo | filter_contigs_disk_size | Int | Amount of storage (in GB) to allocate to the task | 50 | Optional |
digger_denovo | filter_contigs_docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/theiagen/shovilter:0.2 | Optional |
digger_denovo | filter_contigs_memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 8 | Optional |
digger_denovo | filter_contigs_min_coverage | Float | Minimum coverage threshold for contig filtering | 2 | Optional |
digger_denovo | filter_contigs_skip_coverage_filter | Boolean | Skip filtering contigs based on coverage | FALSE | Optional |
digger_denovo | filter_contigs_skip_homopolymer_filter | Boolean | Skip filtering contigs containing homopolymers | FALSE | Optional |
digger_denovo | filter_contigs_skip_length_filter | Boolean | Skip filtering contigs based on length | FALSE | Optional |
digger_denovo | kmers | String | K-mer sizes for assembly (comma-separated) | Optional | |
digger_denovo | megahit_cpu | Int | Number of CPUs to allocate to the task | 4 | Optional |
digger_denovo | megahit_disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
digger_denovo | megahit_docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/theiagen/megahit:1.2.9 | Optional |
digger_denovo | megahit_memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 16 | Optional |
digger_denovo | pilon_cpu | Int | Number of CPUs to allocate to the task | 8 | Optional |
digger_denovo | pilon_disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
digger_denovo | pilon_docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/biocontainers/pilon:1.24--hdfd78af_0 | Optional |
digger_denovo | pilon_fix | String | Potential issues with assembly to try and automatically fix (snps, indels, gaps, local, all, bases, none) | bases | Optional |
digger_denovo | pilon_memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 32 | Optional |
digger_denovo | pilon_min_base_quality | Int | Minimum base quality to keep | 3 | Optional |
digger_denovo | pilon_min_depth | Float | Minimum coverage threshold for variant calling: when set to a value ≥1, it requires that absolute depth of coverage; when set to a fraction <1, it requires coverage at least that fraction of the mean coverage for the region | 0.25 | Optional |
digger_denovo | pilon_min_mapping_quality | Int | Minimum mapping quality for a read to count in pileups | 60 | Optional |
digger_denovo | run_filter_contigs | Boolean | Whether to run contig filtering step | TRUE | Optional |
digger_denovo | skesa_cpu | Int | Number of CPUs to allocate to the task | 4 | Optional |
digger_denovo | skesa_disk_size | Int | Disk space in GB for SKESA assembler | 50 | Optional |
digger_denovo | skesa_docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/skesa:2.4.0 | Optional |
digger_denovo | skesa_memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 4 | Optional |
digger_denovo | spades_cpu | Int | Number of CPUs to allocate to the task | 16 | Optional |
digger_denovo | spades_disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
digger_denovo | spades_docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/spades:4.1.0 | Optional |
digger_denovo | spades_memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 32 | Optional |
digger_denovo | spades_type | String | SPAdes assembly mode (isolate, meta, rna, etc.), more can be found here | isolate | Optional |
gambit | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
gambit | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/gambit:1.0.0 | Optional |
gambit | gambit_db_genomes | File | Database of metadata for assembled query genomes; requires complementary signatures file. If not provided, uses default database "/gambit-db" | gs://gambit-databases-rp/2.0.0/gambit-metadata-2.0.0-20240628.gdb | Optional |
gambit | gambit_db_signatures | File | Signatures file; requires complementary genomes file. If not specified, the file from the docker container will be used. | gs://gambit-databases-rp/2.0.0/gambit-signatures-2.0.0-20240628.gs | Optional |
gambit | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 16 | Optional |
merlin_magic | agrvate_docker_image | String | Internal component, do not modify | Optional | |
merlin_magic | amr_search_cpu | Int | Number of CPUs to allocate to the task | 2 | Optional |
merlin_magic | amr_search_disk_size | Int | Amount of storage (in GB) to allocate to the task | 50 | Optional |
merlin_magic | amr_search_docker_image | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/theiagen/amrsearch:0.2.1 | Optional |
merlin_magic | amr_search_memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 8 | Optional |
merlin_magic | assembly_only | Boolean | Internal component, do not modify | Optional | |
merlin_magic | call_poppunk | Boolean | Internal component, do not modify | Optional | |
merlin_magic | call_shigeifinder_reads_input | Boolean | Internal component, do not modify | Optional | |
merlin_magic | emmtypingtool_docker_image | String | Internal component, do not modify | Optional | |
merlin_magic | hicap_docker_image | String | Internal component, do not modify | Optional | |
merlin_magic | ont_data | Boolean | Internal component, do not modify | Optional | |
merlin_magic | paired_end | Boolean | Internal component, do not modify | Optional | |
merlin_magic | pasty_docker_image | String | Internal component, do not modify | Optional | |
merlin_magic | pasty_min_coverage | Int | Internal component, do not modify | Optional | |
merlin_magic | pasty_min_percent_identity | Int | Internal component, do not modify | Optional | |
merlin_magic | run_amr_search | Boolean | If set to true AMR_Search workflow will be run if species is part of supported taxon, see AMR_Search docs. | FALSE | Optional |
merlin_magic | shigatyper_docker_image | String | Internal component, do not modify | Optional | |
merlin_magic | shigeifinder_docker_image | String | Internal component, do not modify | Optional | |
merlin_magic | snippy_query_gene | String | Provide a gene to search for using Snippy | Default depend on detected organism | Optional |
merlin_magic | srst2_gene_max_mismatch | Int | Internal component, do not modify | Optional | |
merlin_magic | srst2_max_divergence | Int | Internal component, do not modify | Optional | |
merlin_magic | srst2_min_cov | Int | Internal component, do not modify | Optional | |
merlin_magic | srst2_min_depth | Int | Internal component, do not modify | Optional | |
merlin_magic | srst2_min_edge_depth | Int | Internal component, do not modify | Optional | |
merlin_magic | staphopia_sccmec_docker_image | String | Internal component, do not modify | Optional | |
merlin_magic | tbp_parser_config | File | Internal component, do not modify | Optional | |
merlin_magic | tbp_parser_debug | Boolean | Internal component, do not modify | Optional | |
merlin_magic | tbp_parser_docker_image | String | Internal component, do not modify | Optional | |
merlin_magic | tbp_parser_min_depth | Int | Internal component, do not modify | Optional | |
merlin_magic | tbp_parser_min_percent_coverage | Float | Internal component, do not modify | Optional | |
merlin_magic | tbp_parser_operator | String | Internal component, do not modify | Optional | |
merlin_magic | tbp_parser_output_seq_method_type | String | Internal component, do not modify | Optional | |
merlin_magic | tbprofiler_custom_db | File | Internal component, do not modify | Optional | |
merlin_magic | tbprofiler_mapper | String | Internal component, do not modify | Optional | |
merlin_magic | tbprofiler_min_af | Float | Internal component, do not modify | Optional | |
merlin_magic | tbprofiler_min_depth | Int | Internal component, do not modify | Optional | |
merlin_magic | tbprofiler_run_cdph_db | Boolean | Internal component, do not modify | Optional | |
merlin_magic | tbprofiler_run_custom_db | Boolean | Internal component, do not modify | Optional | |
merlin_magic | tbprofiler_variant_caller | String | Internal component, do not modify | Optional | |
merlin_magic | tbprofiler_variant_calling_params | String | Internal component, do not modify | Optional | |
merlin_magic | virulencefinder_database | String | Internal component, do not modify | Optional | |
merlin_magic | virulencefinder_docker_image | String | Internal component, do not modify | Optional | |
merlin_magic | virulencefinder_min_percent_coverage | Float | Internal component, do not modify | Optional | |
merlin_magic | virulencefinder_min_percent_identity | Float | Internal component, do not modify | Optional | |
qc_check_task | ani_highest_percent | Float | Internal component, do not modify | Optional | |
qc_check_task | ani_highest_percent_bases_aligned | Float | Internal component, do not modify | Optional | |
qc_check_task | assembly_length_unambiguous | Int | Internal component, do not modify | Optional | |
qc_check_task | assembly_mean_coverage | Float | Internal component, do not modify | Optional | |
qc_check_task | cpu | Int | Number of CPUs to allocate to the task | 4 | Optional |
qc_check_task | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
qc_check_task | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/theiagen/terra-tools:2023-03-16 | Optional |
qc_check_task | kraken_human | String | Internal component, do not modify | Optional | |
qc_check_task | kraken_human_dehosted | String | Internal component, do not modify | Optional | |
qc_check_task | kraken_sc2 | String | Internal component, do not modify | Optional | |
qc_check_task | kraken_sc2_dehosted | String | Internal component, do not modify | Optional | |
qc_check_task | kraken_target_organism | Float | Internal component, do not modify | Optional | |
qc_check_task | kraken_target_organism_dehosted | Float | Internal component, do not modify | Optional | |
qc_check_task | meanbaseq_trim | String | Internal component, do not modify | Optional | |
qc_check_task | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 8 | Optional |
qc_check_task | midas_secondary_genus_abundance | Float | Internal component, do not modify | Optional | |
qc_check_task | midas_secondary_genus_coverage | Float | Internal component, do not modify | Optional | |
qc_check_task | number_Degenerate | Int | Internal component, do not modify | Optional | |
qc_check_task | number_N | Int | Internal component, do not modify | Optional | |
qc_check_task | percent_reference_coverage | Float | Internal component, do not modify | Optional | |
qc_check_task | sc2_s_gene_mean_coverage | Float | Internal component, do not modify | Optional | |
qc_check_task | sc2_s_gene_percent_coverage | Float | Internal component, do not modify | Optional | |
qc_check_task | vadr_num_alerts | String | Internal component, do not modify | Optional | |
quast | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
quast | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/quast:5.0.2 | Optional |
quast | min_contig_length | Int | Minimum length of contig for QUAST | 500 | Optional |
rasusa_task | bases | String | Explicitly set the number of bases required e.g., 4.3kb, 7Tb, 9000, 4.1MB. If this option is given, --coverage and --genome-size are ignored | Optional | |
rasusa_task | cpu | Int | Number of CPUs to allocate to the task | 4 | Optional |
rasusa_task | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
rasusa_task | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/rasusa:2.1.0 | Optional |
rasusa_task | frac | Float | Explicitly define the fraction of reads to keep in the subsample; when used, genome size and coverage are ignored; acceptable inputs include whole numbers and decimals, e.g. 50.0 will leave 50% of the reads in the subsample | Optional | |
rasusa_task | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 8 | Optional |
rasusa_task | num | Int | Optional: explicitly define the number of reads in the subsample; when used, genome size and coverage are ignored; acceptable metric suffixes include: b, k, m, g, and t for base, kilo, mega, giga, and tera, respectively | Optional | |
rasusa_task | seed | Int | Use to assign a name to the "random seed" that is used by the subsampler; i.e. this allows the exact same subsample to be produced from the same input file/s in subsequent runs when providing the seed identifier; do not input values for random downsampling | Optional | |
raw_check_reads | cpu | Int | Number of CPUs to allocate to the task | 2 | Optional |
raw_check_reads | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
raw_check_reads | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/bactopia/gather_samples:2.0.2 | Optional |
raw_check_reads | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 2 | Optional |
raw_check_reads | organism | String | Internal component, do not modify | Optional | |
raw_check_reads | workflow_series | String | Internal component, do not modify | Optional | |
read_QC_trim | adapters | File | File with adapter sequences to be removed | Optional | |
read_QC_trim | bbduk_memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 8 | Optional |
read_QC_trim | call_kraken | Boolean | True/False variable that determines if the Kraken2 task should be called; for non-TheiaCoV workflows, the kraken_db variable must be provided. |
FALSE | Optional |
read_QC_trim | call_midas | Boolean | Internal component, do not modify | Optional | |
read_QC_trim | fastp_args | String | Additional arguments to use with fastp | --detect_adapter_for_pe -g -5 20 -3 20 | Optional |
read_QC_trim | kraken_db | File | A kraken2 database to use with the kraken2 optional task. The file must be a .tar.gz kraken2 database. | Optional | |
read_QC_trim | kraken_disk_size | Int | Amount of storage (in GB) to allocate to the task. Increase this when using large (>30GB) kraken2 databases such as the "k2_standard" database | 100 | Optional |
read_QC_trim | kraken_memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 8 | Optional |
read_QC_trim | midas_db | File | Internal component, do not modify | Optional | |
read_QC_trim | phix | File | A file containing the phix used during Illumina sequencing; used in the BBDuk task | Optional | |
read_QC_trim | read_processing | String | The name of the tool to perform basic read processing; options: "trimmomatic" or "fastp" | trimmomatic | Optional |
read_QC_trim | read_qc | String | The tool used for quality control (QC) of reads. Options are "fastq_scan" (default) and "fastqc" | fastq_scan | Optional |
read_QC_trim | target_organism | String | This string is searched for in the kraken2 outputs to extract the read percentage | Optional | |
read_QC_trim | trim_min_length | Int | Specifies minimum length of each read after trimming to be kept | 75 | Optional |
read_QC_trim | trim_quality_trim_score | Int | Specifies the average quality of bases in a sliding window to be kept | 20 | Optional |
read_QC_trim | trim_window_size | Int | Specifies window size for trimming (the number of bases to average the quality across) | 10 | Optional |
read_QC_trim | trimmomatic_args | String | Additional arguments to pass to trimmomatic. "-phred33" specifies the Phred Q score encoding which is almost always phred33 with modern sequence data. | -phred33 | Optional |
read_QC_trim | workflow_series | String | Internal component, do not modify | Optional | |
theiaeuk_pe | busco_memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 8 | Optional |
theiaeuk_pe | call_rasusa | Boolean | If true, RASUSA will subsample raw reads to a specified read depth (150X by default) | TRUE | Optional |
theiaeuk_pe | gambit_db_genomes | File | User-provided database of assembled query genomes; requires complementary signatures file. If not provided, uses default database, "/gambit-db" | gs://gambit-databases-rp/fungal-version/1.0.0/gambit-fungal-metadata-1.0.0-20241213.gdb | Optional |
theiaeuk_pe | gambit_db_signatures | File | User-provided signatures file; requires complementary genomes file. If not specified, the file from the docker container will be used. | gs://gambit-databases-rp/fungal-version/1.0.0/gambit-fungal-signatures-1.0.0-20241213.gs | Optional |
theiaeuk_pe | genome_length | Int | User-specified expected genome length to be used in genome statistics calculations | Optional | |
theiaeuk_pe | max_genome_size | Int | Maximum genome size able to pass read screening | 50000000 | Optional |
theiaeuk_pe | min_basepairs | Int | Minimum number of base pairs able to pass read screening | 2241820 | Optional |
theiaeuk_pe | min_coverage | Int | Minimum genome coverage able to pass read screening | 10 | Optional |
theiaeuk_pe | min_genome_size | Int | Minimum genome size able to pass read screening | 100000 | Optional |
theiaeuk_pe | min_proportion | Int | Minimum proportion of total reads in each read file to pass read screening | 50 | Optional |
theiaeuk_pe | min_reads | Int | Minimum number of reads to pass read screening | 10000 | Optional |
theiaeuk_pe | skip_screen | Boolean | Option to skip the read screening prior to analysis; if setting to true, please provide a value for the theiaeuk_pe genome_length optional input, OR set call_rasusa to false. Otherwise RASUSA will attempt to downsample to an expected genome size of 0 bp, and the workflow will fail. | FALSE | Optional |
theiaeuk_pe | subsample_coverage | Float | Read depth for RASUSA task to subsample reads to | 150 | Optional |
version_capture | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0 | Optional |
version_capture | timezone | String | Set the time zone to get an accurate date of analysis (uses UTC by default) | Optional | |
workflow name | trim_min_length | Int | Specifies minimum length of each read after trimming to be kept | 75 | Optional |
workflow name | trim_quality_min_score | Int | Specifies the minimum average quality of bases in a sliding window to be kept | 20 | Optional |
Workflow Tasks¶
All input reads are processed through "core tasks" in the TheiaEuk workflows. These undertake read trimming and assembly appropriate to the input data type, currently only Illumina paired-end data. TheiaEuk workflow subsequently launch default genome characterization modules for quality assessment, and additional taxa-specific characterization steps. When setting up the workflow, users may choose to use "optional tasks" or alternatives to tasks run in the workflow by default.
Core tasks¶
These tasks are performed regardless of organism. They perform read trimming and various quality control steps.
versioning
: Version Capture
The versioning
task captures the workflow version from the GitHub (code repository) version.
Version Capture Technical details
Links | |
---|---|
Task | task_versioning.wdl |
screen
: Total Raw Read Quantification and Genome Size Estimation
The screen
task ensures the quantity of sequence data is sufficient to undertake genomic analysis. It uses fastq-scan
and bash commands for quantification of reads and base pairs, and mash sketching to estimate the genome size and its coverage. At each step, the results are assessed relative to pass/fail criteria and thresholds that may be defined by optional user inputs. Samples are run through all threshold checks, regardless of failures, and the workflow will terminate after the screen
task if any thresholds are not met:
- Total number of reads: A sample will fail the read screening task if its total number of reads is less than or equal to
min_reads
. - The proportion of basepairs reads in the forward and reverse read files: A sample will fail the read screening if fewer than
min_proportion
basepairs are in either the reads1 or read2 files. - Number of basepairs: A sample will fail the read screening if there are fewer than
min_basepairs
basepairs - Estimated genome size: A sample will fail the read screening if the estimated genome size is smaller than
min_genome_size
or bigger thanmax_genome_size
. - Estimated genome coverage: A sample will fail the read screening if the estimated genome coverage is less than the
min_coverage
.
Read screening is undertaken on both the raw and cleaned reads. The task may be skipped by setting the skip_screen
variable to true.
Default values vary between the PE, SE, and ONT workflows. The rationale for these default values can be found below. If two default values are shown, the first is for Illumina workflows and the second is for ONT.
| Variable | Rationale |
| --- | --- | --- |
| skip_screen
| false | Set to true to skip the read screen from running. If you set this value to true, please provide a value for the theiaeuk_illumina_pe genome_length
optional input, OR set the theiaeuk_illumina_pe call_rasusa
optional input to false. Otherwise RASUSA will attempt to downsample to an expected genome size of 0 bp, and the workflow will fail. |
| min_reads
| 3000 | Calculated from the minimum number of base pairs required for 20x coverage of the Hansenula polymorpha genome, the smallest fungal genome as of 2015-04-02 (8.97 Mbp), divided by 300 (the longest Illumina read length) |
| min_basepairs
| 45000000 | Should be greater than 10x coverage of Hansenula polymorpha, the smallest fungal genome as of 2015-04-02 (8.97 Mbp) |
| min_genome_length
| 9000000 | Based on the Hansenula polymorpha genome - the smallest fungal genome as of 2015-04-02 (8.97 Mbp) |
| max_genome_length
| 178000000 | Based on the Cenococcum geophilum genome, the largest pathogenic fungal genome (177.57 Mbp), plus an additional 2 Mbp to cater for potential extra genomic material |
| min_coverage
| 10 | A bare-minimum average per base coverage across the genome required for genome characterization. Higher coverage would be required for high-quality phylogenetics.|
| min_proportion
| 40 | Neither read1 nor read2 files should have less than 40% of the total number of reads. For paired-end data only. |
Screen Technical Details
There is a single WDL task for read screening. The screen
task is run twice, once for raw reads and once for clean reads.
Links | |
---|---|
Task | task_screen.wdl (PE sub-task) task_screen.wdl (SE sub-task) |
Rasusa
: Read subsampling (optional, on by default)
The Rasusa task performs subsampling of the raw reads. By default, this task will subsample reads to a depth of 150X using the estimated genome length produced during the preceding raw read screen. The user can prevent the task from being launched by setting the call_rasusa
variable to false.
The user can also provide an estimated genome length for the task to use for subsampling using the genome_size
variable. In addition, the read depth can be modified using the subsample_coverage
variable.
Rasusa Technical Details
Links | |
---|---|
Task | task_rasusa.wdl |
Software Source Code | Rasusa on GitHub |
Software Documentation | Rasusa on GitHub |
Original Publication(s) | Rasusa: Randomly subsample sequencing reads to a specified coverage |
read_QC_trim
: Read Quality Trimming, Adapter Removal, Quantification, and Identification
read_QC_trim
is a sub-workflow that removes low-quality reads, low-quality regions of reads, and sequencing adapters to improve data quality. It uses a number of tasks, described below. The differences between the PE and SE versions of the read_QC_trim
sub-workflow lie in the default parameters, the use of two or one input read file(s), and the different output files.
Read quality trimming
Either trimmomatic
or fastp
can be used for read-quality trimming. Trimmomatic is used by default. Both tools trim low-quality regions of reads with a sliding window (with a window size of trim_window_size
), cutting once the average quality within the window falls below trim_quality_trim_score
. They will both discard the read if it is trimmed below trim_minlen
.
If fastp is selected for analysis, fastp also implements the additional read-trimming steps indicated below:
Parameter | Explanation |
---|---|
-g |
enables polyG tail trimming |
-5 20 |
enables read end-trimming |
-3 20 |
enables read end-trimming |
--detect_adapter_for_pe |
enables adapter-trimming only for paired-end reads |
Adapter removal
The BBDuk
task removes adapters from sequence reads. To do this:
- Repair from the BBTools package reorders reads in paired fastq files to ensure the forward and reverse reads of a pair are in the same position in the two fastq files.
- BBDuk ("Bestus Bioinformaticus" Decontamination Using Kmers) is then used to trim the adapters and filter out all reads that have a 31-mer match to PhiX, which is commonly added to Illumina sequencing runs to monitor and/or improve overall run quality.
What are adapters and why do they need to be removed?
Adapters are manufactured oligonucleotide sequences attached to DNA fragments during the library preparation process. In Illumina sequencing, these adapter sequences are required for attaching reads to flow cells. You can read more about Illumina adapters here. For genome analysis, it's important to remove these sequences since they're not actually from your sample. If you don't remove them, the downstream analysis may be affected.
Read Quantification
There are two methods for read quantification to choose from: fastq-scan
(default) or fastqc
. Both quantify the forward and reverse reads in FASTQ files. For paired-end data, they also provide the total number of read pairs. This task is run once with raw reads as input and once with clean reads as input. If QC has been performed correctly, you should expect fewer clean reads than raw reads. fastqc
also provides a graphical visualization of the read quality.
Read Identification with MIDAS (optional)
The MIDAS
task is for the identification of reads to detect contamination with non-target taxa. This task is optional and turned off by default. It can be used by setting the call_midas
input variable to true
.
The MIDAS tool was originally designed for metagenomic sequencing data but has been co-opted for use with bacterial isolate WGS methods. It can be used to detect contamination present in raw sequencing data by estimating bacterial species abundance in bacterial isolate WGS data. If a secondary genus is detected above a relative frequency of 0.01 (1%), then the sample should fail QC and be investigated further for potential contamination.
This task is similar to those used in commercial software, BioNumerics, for estimating secondary species abundance.
How are the MIDAS output columns determined?
Example MIDAS report in the midas_report
column:
species_id | count_reads | coverage | relative_abundance |
---|---|---|---|
Salmonella_enterica_58156 | 3309 | 89.88006645 | 0.855888033 |
Salmonella_enterica_58266 | 501 | 11.60606061 | 0.110519371 |
Salmonella_enterica_53987 | 99 | 2.232896237 | 0.021262881 |
Citrobacter_youngae_61659 | 46 | 0.995216227 | 0.009477003 |
Escherichia_coli_58110 | 5 | 0.123668877 | 0.001177644 |
MIDAS report column descriptions:
- species_id: species identifier
- count_reads: number of reads mapped to marker genes
- coverage: estimated genome-coverage (i.e. read-depth) of species in metagenome
- relative_abundance: estimated relative abundance of species in metagenome
The value in the midas_primary_genus
column is derived by ordering the rows in order of "relative_abundance" and identifying the genus of top species in the "species_id" column (Salmonella). The value in the midas_secondary_genus
column is derived from the genus of the second-most prevalent genus in the "species_id" column (Citrobacter). The midas_secondary_genus_abundance
column is the "relative_abundance" of the second-most prevalent genus (0.009477003). The midas_secondary_genus_coverage
is the "coverage" of the second-most prevalent genus (0.995216227).
MIDAS Reference Database Overview
The MIDAS reference database is a comprehensive tool for identifying bacterial species in metagenomic and bacterial isolate WGS data. It includes several layers of genomic data, helping detect species abundance and potential contaminants.
Key Components of the MIDAS Database
-
Species Groups:
- MIDAS clusters bacterial genomes based on 96.5% sequence identity, forming over 5,950 species groups from 31,007 genomes. These groups align with the gold-standard species definition (95% ANI), ensuring highly accurate species identification.
-
Genomic Data Structure:
- Marker Genes: Contains 15 universal single-copy genes used to estimate species abundance.
- Representative Genome: Each species group has a selected representative genome, which minimizes genetic variation and aids in accurate SNP identification.
- Pan-genome: The database includes clusters of non-redundant genes, with options for multi-level clustering (e.g., 99%, 95%, 90% identity), enabling MIDAS to identify gene content within strains at various clustering thresholds.
-
Taxonomic Annotation:
- Genomes are annotated based on consensus Latin names. Discrepancies in name assignments may occur due to factors like unclassified genomes or genus-level ambiguities.
Using the Default MIDAS Database
TheiaProk and TheiaEuk use the pre-loaded MIDAS database in Terra (see input table for current version) by default for bacterial species detection in metagenomic data, requiring no additional setup.
Create a Custom MIDAS Database
Users can also build their own custom MIDAS database if they want to include specific genomes or configurations. This custom database can replace the default MIDAS database used in Terra. To build a custom MIDAS database, follow the MIDAS GitHub guide on building a custom database. Once the database is built, users can upload it to a Google Cloud Storage bucket or Terra workkspace and provide the link to the database in the midas_db
input variable.
read_QC_trim Technical Details
Assembly tasks¶
These tasks assemble the reads into a de novo assembly and assess the quality of the assembly.
digger_denovo
: De novo Assembly
De Novo assembly will be undertaken only for samples that have sufficient read quantity and quality, as determined by the screen
task assessment of clean reads.
In this workflow, assembly is performed using the digger_denovo, which is a hat tip to Shovill pipeline. This undertakes the assembly with one of three assemblers SKESA (default), SPAdes, Megahit, but also performs a number of post processing steps for assembly polishing and contig filtering. Pilon can optionally be run if call_pilon
is set to true. On default, the contig filtering task is set to run, which will remove any homopolymers, contigs below a specificied length, and contigs with coverage below a specified minimum coverage. This can be turned off by setting run_filter_contigs
to false
.
What is de novo assembly?
De novo assembly is the process or product of attempting to reconstruct a genome from scratch (without prior knowledge of the genome) using sequence reads. Assembly of fungal genomes from short-reads will produce multiple contigs per chromosome rather than a single contiguous sequence for each chromosome.
Digger-Denovo Technical Details
Links | |
---|---|
SubWorkflow File | wf_digger_denovo.wdl |
quast
: Assembly Quality Assessment
QUAST stands for QUality ASsessment Tool. It evaluates genome/metagenome assemblies by computing various metrics without a reference being necessary. It includes useful metrics such as number of contigs, length of the largest contig and N50.
QUAST Technical Details
Links | |
---|---|
Task | task_quast.wdl |
Software Source Code | QUAST on GitHub |
Software Documentation | https://quast.sourceforge.net/ |
Original Publication(s) | QUAST: quality assessment tool for genome assemblies |
CG-Pipeline
: Assessment of Read Quality, and Estimation of Genome Coverage
Thecg_pipeline
task generates metrics about read quality and estimates the coverage of the genome using the run_assembly_readMetrics.pl
script from CG-Pipeline. The genome coverage estimates are calculated using both using raw and cleaned reads, using either a user-provided genome_size
or the estimated genome length generated by QUAST.
CG-Pipeline Technical Details
The cg_pipeline
task is run twice in this workflow, once with raw reads, and once with clean reads.
Links | |
---|---|
Task | task_cg_pipeline.wdl |
Software Source Code | CG-Pipeline on GitHub |
Software Documentation | CG-Pipeline on GitHub |
Original Publication(s) | A computational genomics pipeline for prokaryotic sequencing projects |
Organism-agnostic characterization¶
These tasks are performed regardless of the organism and provide quality control and taxonomic assignment.
GAMBIT
: Taxon Assignment
GAMBIT
determines the taxon of the genome assembly using a k-mer based approach to match the assembly sequence to the closest complete genome in a database, thereby predicting its identity. Sometimes, GAMBIT can confidently designate the organism to the species level. Other times, it is more conservative and assigns it to a higher taxonomic rank.
For additional details regarding the GAMBIT tool and a list of available GAMBIT databases for analysis, please consult the GAMBIT tool documentation.
GAMBIT Technical Details
Links | |
---|---|
Task | task_gambit.wdl |
Software Source Code | GAMBIT on GitHub |
Software Documentation | GAMBIT ReadTheDocs |
Original Publication(s) | GAMBIT (Genomic Approximation Method for Bacterial Identification and Tracking): A methodology to rapidly leverage whole genome sequencing of bacterial isolates for clinical identification |
BUSCO
: Assembly Quality Assessment
BUSCO (Benchmarking Universal Single-Copy Orthologue) attempts to quantify the completeness and contamination of an assembly to generate quality assessment metrics. It uses taxa-specific databases containing genes that are all expected to occur in the given taxa, each in a single copy. BUSCO examines the presence or absence of these genes, whether they are fragmented, and whether they are duplicated (suggestive that additional copies came from contaminants).
BUSCO notation
Here is an example of BUSCO notation: C:99.1%[S:98.9%,D:0.2%],F:0.0%,M:0.9%,n:440
. There are several abbreviations used in this output:
- Complete (C) - genes are considered "complete" when their lengths are within two standard deviations of the BUSCO group mean length.
- Single-copy (S) - genes that are complete and have only one copy.
- Duplicated (D) - genes that are complete and have more than one copy.
- Fragmented (F) - genes that are only partially recovered.
- Missing (M) - genes that were not recovered at all.
- Number of genes examined (n) - the number of genes examined.
A high equity assembly will use the appropriate database for the taxa, have high complete (C) and single-copy (S) percentages, and low duplicated (D), fragmented (F) and missing (M) percentages.
BUSCO Technical Details
Links | |
---|---|
Task | task_busco.wdl |
Software Source Code | BUSCO on GitLab |
Software Documentation | https://busco.ezlab.org/ |
Orginal publication | BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs |
qc_check
: Check QC Metrics Against User-Defined Thresholds (optional)
The qc_check
task compares generated QC metrics against user-defined thresholds for each metric. This task will run if the user provides a qc_check_table
TSV file. If all QC metrics meet the threshold, the qc_check
output variable will read QC_PASS
. Otherwise, the output will read QC_NA
if the task could not proceed or QC_ALERT
followed by a string indicating what metric failed.
The qc_check
task applies quality thresholds according to the sample taxa. The sample taxa is taken from the gambit_predicted_taxon
value inferred by the GAMBIT module OR can be manually provided by the user using the expected_taxon
workflow input.
Formatting the qc_check_table.tsv
- The first column of the qc_check_table lists the
organism
that the task will assess and the header of this column must be "taxon". - Any genus or species can be included as a row of the qc_check_table. However, these taxa must uniquely match the sample taxa, meaning that the file can include multiple species from the same genus (Vibrio_cholerae and Vibrio_vulnificus), but not both a genus row and species within that genus (Vibrio and Vibrio cholerae). The taxa should be formatted with the first letter capitalized and underscores in lieu of spaces.
- Each subsequent column indicates a QC metric and lists a threshold for each organism that will be checked. The column names must exactly match expected values, so we highly recommend copy and pasting the header from the template file below as a starting place.
Template qc_check_table.tsv files
- TheiaEuk_Illumina_PE_PHB: theiaeuk_qc_check_template.tsv
Example Purposes Only
The QC threshold values shown in the file above are for example purposes only and should not be presumed to be sufficient for every dataset.
qc_check
Technical Details
Links | |
---|---|
Task | task_qc_check_phb.wdl |
Organism-specific characterization¶
The TheiaEuk workflow automatically activates taxa-specific tasks after identification of the relevant taxa using GAMBIT
. Many of these taxa-specific tasks do not require any additional inputs from the user.
Candidozyma auris (also known as Candida auris)
Two tools are deployed when Candidozyma auris/Candida auris is identified.
Cladetyping: clade determination
A custom GAMBIT database is created using five clade-specific Candidozyma auris reference genomes. Sequences undergo genomic signature comparison against this database, which then enables assignment to one of the five Candidozyma auris clades (Clade I to Clade V) based on sequence similarity and phylogenetic relationships. This integrated approach ensures precise clade assignments, crucial for understanding the genetic diversity and epidemiology of Candidozyma auris.
See more information on the reference information for the five clades below:
Clade | Genome Accession | Assembly Name | Strain | BioSample Accession |
---|---|---|---|---|
Clade I | GCA_002759435.2 | Cand_auris_B8441_V2 | B8441 | SAMN05379624 |
Clade II | GCA_003013715.2 | ASM301371v2 | B11220 | SAMN05379608 |
Clade III | GCA_002775015.1 | Cand_auris_B11221_V1 | B11221 | SAMN05379609 |
Clade IV | GCA_003014415.1 | Cand_auris_B11243 | B11243 | SAMN05379619 |
Clade V | GCA_016809505.1 | ASM1680950v1 | IFRC2087 | SAMN11570381 |
Cauris_Cladetyper Technical Details
Snippy Variants: antifungal resistance detection
To detect mutations that may confer antifungal resistance, Snippy
is used to find all variants relative to the clade-specific reference, then these variants are queried for product names associated with resistance.
The genes in which there are known resistance-conferring mutations for this pathogen are:
- FKS1
- ERG11 (lanosterol 14-alpha demethylase)
- FUR1 (uracil phosphoribosyltransferase)
We query Snippy
results to see if any mutations were identified in those genes. By default, we automatically check for the following loci (which can be overwritten by the user). You will find the mutations next to the locus tag in the theiaeuk_snippy_variants_hits
column corresponding gene name (see below):
TheiaEuk Search Term | Corresponding Gene Name |
---|---|
B9J08_005340 | ERG6 |
B9J08_000401 | FLO8 |
B9J08_005343 | Hypothetical protein (PSK74852) |
B9J08_003102 | MEC3 |
B9J08_003737 | ERG3 |
lanosterol.14-alpha.demethylase | ERG11 |
uracil.phosphoribosyltransferase | FUR1 |
FKS1 | FKS1 |
For example, one sample may have the following output for the theiaeuk_snippy_variants_hits
column:
lanosterol.14-alpha.demethylase: lanosterol 14-alpha demethylase (missense_variant c.428A>G p.Lys143Arg; C:266 T:0),B9J08_000401: hypothetical protein (stop_gained c.424C>T p.Gln142*; A:70 G:0)
Based on this, we can tell that ERG11 has a missense variant at position 143 (Lysine to Arginine) and B9J08_000401 (which is FLO8) has a stop-gained variant at position 142 (Glutamine to Stop).
Known resistance-conferring mutations for Candidozyma auris
Mutations in these genes that are known to confer resistance are shown below
Snippy Variants Technical Details
Links | |
---|---|
Task | task_snippy_variants.wdl task_snippy_gene_query.wdl |
Software Source Code | Snippy on GitHub |
Software Documentation | Snippy on GitHub |
Candida albicans
When this species is detected by the taxon ID tool, an antifungal resistance detection task is deployed.
Snippy Variants: antifungal resistance detection
To detect mutations that may confer antifungal resistance, Snippy
is used to find all variants relative to the clade-specific reference, and these variants are queried for product names associated with resistance.
The genes in which there are known resistance-conferring mutations for this pathogen are:
- ERG11
- GCS1 (FKS1)
- FUR1
- RTA2
We query Snippy
results to see if any mutations were identified in those genes. By default, we automatically check for the following loci (which can be overwritten by the user). You will find the mutations next to the locus tag in the theiaeuk_snippy_variants_hits
column corresponding gene name (see below):
TheiaEuk Search Term | Corresponding Gene Name |
---|---|
ERG11 | ERG11 |
GCS1 | FKS1 |
FUR1 | FUR1 |
RTA2 | RTA2 |
Snippy Variants Technical Details
Links | |
---|---|
Task | task_snippy_variants.wdl task_snippy_gene_query.wdl |
Software Source Code | Snippy on GitHub |
Software Documentation | Snippy on GitHub |
Aspergillus fumigatus
When this species is detected by the taxon ID tool an antifungal resistance detection task is deployed.
Snippy Variants: antifungal resistance detection
To detect mutations that may confer antifungal resistance, Snippy
is used to find all variants relative to the clade-specific reference, and these variants are queried for product names associated with resistance.
The genes in which there are known resistance-conferring mutations for this pathogen are:
- Cyp51A
- HapE
- COX10 (AFUA_4G08340)
We query Snippy
results to see if any mutations were identified in those genes. By default, we automatically check for the following loci (which can be overwritten by the user). You will find the mutations next to the locus tag in the theiaeuk_snippy_variants_hits
column corresponding gene name (see below):
TheiaEuk Search Term | Corresponding Gene Name |
---|---|
Cyp51A | Cyp51A |
HapE | HapE |
AFUA_4G08340 | COX10 |
Snippy Variants Technical Details
Links | |
---|---|
Task | task_snippy_variants.wdl task_snippy_gene_query.wdl |
Software Source Code | Snippy on GitHub |
Software Documentation | Snippy on GitHub |
Cryptococcus neoformans
When this species is detected by the taxon ID tool an antifungal resistance detection task is deployed.
Snippy Variants: antifungal resistance detection
To detect mutations that may confer antifungal resistance, Snippy
is used to find all variants relative to the clade-specific reference, and these variants are queried for product names associated with resistance.
The genes in which there are known resistance-conferring mutations for this pathogen are:
- ERG11 (CNA00300)
We query Snippy
results to see if any mutations were identified in those genes. By default, we automatically check for the following loci (which can be overwritten by the user). You will find the mutations next to the locus tag in the theiaeuk_snippy_variants_hits
column corresponding gene name (see below):
TheiaEuk Search Term | Corresponding Gene Name |
---|---|
CNA00300 | ERG11 |
Snippy Variants Technical Details
Links | |
---|---|
Task | task_snippy_variants.wdl task_snippy_gene_query.wdl |
Software Source Code | Snippy on GitHub |
Software Documentation | Snippy on GitHub |
Outputs¶
Variable | Type | Description |
---|---|---|
amr_results_csv | File | CSV formatted AMR profile |
amr_results_pdf | File | PDF formatted AMR profile |
amr_search_docker | String | Docker image used to run AMR_Search |
amr_search_results | File | JSON formatted AMR profile including BLAST results |
assembler | String | Assembler used in digger_denovo subworkflow |
assembler_version | String | Version of the assembler used in digger_denovo |
assembly_fasta | File | De novo genome assembly in FASTA format |
assembly_length | Int | Length of assembly (total contig length) as determined by QUAST |
bbduk_docker | String | The Docker image for bbduk, which was used to remove the adapters from the sequences |
busco_database | String | BUSCO database used |
busco_docker | String | BUSCO docker image used |
busco_report | File | A plain text summary of the results in BUSCO notation |
busco_results | String | BUSCO results (see relevant toggle in this block) |
busco_version | String | BUSCO software version used |
cg_pipeline_docker | String | Docker file used for running CG-Pipeline on cleaned reads |
cg_pipeline_report | File | TSV file of read metrics from raw reads, including average read length, number of reads, and estimated genome coverage |
cladetyper_annotated_reference | String | The annotated reference file for the identified clade, "None" if no clade was identified |
cladetyper_clade | String | The clade assigned to the input assembly |
cladetyper_docker_image | String | The Docker container used for the task |
cladetyper_gambit_version | String | The version of GAMBIT used for the analysis |
combined_mean_q_clean | Float | Mean quality score for the combined clean reads |
combined_mean_q_raw | Float | Mean quality score for the combined raw reads |
combined_mean_readlength_clean | Float | Mean read length for the combined clean reads |
combined_mean_readlength_raw | Float | Mean read length for the combined raw reads |
contigs_fastg | File | Assembly graph if megahit used for genome assembly |
contigs_gfa | File | Assembly graph output generated by SPAdes (Illumina: PE, SE) or Flye (ONT), used to visualize and evaluate genome assembly results. |
contigs_lastgraph | File | Assembly graph if velvet used for genome assembly |
est_coverage_clean | Float | Estimated coverage calculated from clean reads and genome length |
est_coverage_raw | Float | Estimated coverage calculated from raw reads and genome length |
fastp_html_report | File | The HTML report made with fastp |
fastp_version | String | The version of fastp used |
fastq_scan_clean1_json | File | The JSON file output from fastq-scan containing summary stats about clean forward read quality and length |
fastq_scan_clean2_json | File | The JSON file output from fastq-scan containing summary stats about clean reverse read quality and length |
fastq_scan_num_reads_clean_pairs | String | The number of read pairs after cleaning as calculated by fastq_scan |
fastq_scan_num_reads_clean1 | Int | The number of forward reads after cleaning as calculated by fastq_scan |
fastq_scan_num_reads_clean2 | Int | The number of reverse reads after cleaning as calculated by fastq_scan |
fastq_scan_num_reads_raw_pairs | String | The number of input read pairs as calculated by fastq_scan |
fastq_scan_num_reads_raw1 | Int | The number of input forward reads as calculated by fastq_scan |
fastq_scan_num_reads_raw2 | Int | The number of input reserve reads as calculated by fastq_scan |
fastq_scan_raw1_json | File | The JSON file output from fastq-scan containing summary stats about raw forward read quality and length |
fastq_scan_raw2_json | File | The JSON file output from fastq-scan containing summary stats about raw reverse read quality and length |
fastq_scan_version | String | The version of fastq_scan |
fastqc_clean1_html | File | An HTML file that provides a graphical visualization of clean forward read quality from fastqc to open in an internet browser |
fastqc_clean2_html | File | An HTML file that provides a graphical visualization of clean reverse read quality from fastqc to open in an internet browser |
fastqc_docker | String | The Docker container used for fastqc |
fastqc_num_reads_clean_pairs | String | The number of read pairs after cleaning by fastqc |
fastqc_num_reads_clean1 | Int | The number of forward reads after cleaning by fastqc |
fastqc_num_reads_clean2 | Int | The number of reverse reads after cleaning by fastqc |
fastqc_num_reads_raw_pairs | String | The number of input read pairs by fastqc before cleaning |
fastqc_num_reads_raw1 | Int | The number of input forward reads by fastqc before cleaning |
fastqc_num_reads_raw2 | Int | The number of input reverse reads by fastqc before cleaning |
fastqc_raw1_html | File | An HTML file that provides a graphical visualization of raw forward read quality from fastqc to open in an internet browser |
fastqc_raw2_html | File | An HTML file that provides a graphical visualization of raw reverse read quality from fastqc to open in an internet browser |
fastqc_version | String | Version of fastqc software used |
filtered_contigs_metrics | File | File containing metrics of contigs filtered |
gambit_closest_genomes | File | CSV file listing genomes in the GAMBIT database that are most similar to the query assembly |
gambit_db_version | String | Version of the GAMBIT database used |
gambit_docker | String | GAMBIT Docker used |
gambit_predicted_taxon | String | Taxon predicted by GAMBIT |
gambit_predicted_taxon_rank | String | Taxon rank of GAMBIT taxon prediction |
gambit_report | File | GAMBIT report in a machine-readable format |
gambit_version | String | Version of GAMBIT software used |
n50_value | Int | N50 of assembly calculated by QUAST |
number_contigs | Int | Total number of contigs in assembly |
qc_check | String | A string that indicates whether or not the sample passes a set of pre-determined and user-provided QC thresholds |
qc_standard | File | The file used in the QC Check task containing the QC thresholds. |
quast_gc_percent | Float | The GC percent of your sample |
quast_report | File | TSV report from QUAST |
quast_version | String | The version of QUAST |
r1_mean_q_raw | Float | Mean quality score of raw forward reads |
r1_mean_readlength_raw | Float | Mean read length of raw forward reads |
r2_mean_q_raw | Float | Mean quality score of raw reverse reads |
r2_mean_readlength_clean | Float | Mean read length of clean reverse reads |
rasusa_version | String | Version of RASUSA used for the analysis |
read_screen_clean | String | PASS or FAIL result from clean read screening; FAIL accompanied by the reason(s) for failure |
read_screen_clean_tsv | File | Clean read screening report TSV depicting read counts, total read base pairs, and estimated genome length |
read_screen_raw | String | PASS or FAIL result from raw read screening; FAIL accompanied by the reason(s) for failure |
read_screen_raw_tsv | File | Raw read screening report TSV depicting read counts, total read base pairs, and estimated genome length |
read1_clean | File | Forward read file after quality trimming and adapter removal |
read1_subsampled | File | Read1 FASTQ files downsampled to desired coverage |
read2_clean | File | Reverse read file after quality trimming and adapter removal |
read2_subsampled | File | Read2 FASTQ files downsampled to desired coverage |
seq_platform | String | Description of the sequencing methodology used to generate the input read data |
shovill_pe_version | String | Shovill version used |
theiaeuk_illumina_pe_analysis_date | String | Date of TheiaEuk PE workflow execution |
theiaeuk_illumina_pe_version | String | TheiaEuk PE workflow version used |
theiaeuk_snippy_variants_bai | String | BAI file produced by the snippy module |
theiaeuk_snippy_variants_bam | String | BAM file produced by the snippy module |
theiaeuk_snippy_variants_coverage_tsv | String | TSV file containing coverage information for each base in the reference genome |
theiaeuk_snippy_variants_gene_query_results | File | File containing all lines from variants file matching gene query terms |
theiaeuk_snippy_variants_hits | String | String of all variant file entries matching gene query term |
theiaeuk_snippy_variants_num_reads_aligned | String | Number of reads aligned by snippy |
theiaeuk_snippy_variants_num_variants | Int | Number of variants detected by snippy |
theiaeuk_snippy_variants_outdir_tarball | File | Tar compressed file containing full snippy output directory |
theiaeuk_snippy_variants_percent_ref_coverage | String | Percent of reference genome covered by snippy |
theiaeuk_snippy_variants_query | String | The gene query term(s) used to search variant |
theiaeuk_snippy_variants_query_check | String | Were the gene query terms present in the refence annotated genome file |
theiaeuk_snippy_variants_reference_genome | File | The reference genome used in the alignment and variant calling |
theiaeuk_snippy_variants_results | File | The variants file produced by snippy |
theiaeuk_snippy_variants_summary | File | A file summarizing the variants detected by snippy |
theiaeuk_snippy_variants_version | String | The version of the snippy_variants module being used |
trimmomatic_docker | String | The docker image used for the trimmomatic module in this workflow |
trimmomatic_version | String | The version of Trimmomatic used |