RASUSA¶

Quick Facts¶

Workflow Type	Applicable Kingdom	Last Known Changes	Command-line Compatibility	Workflow Level	Dockstore
Genomic Characterization, Standalone	Any taxa	vX.X.X	Yes	Sample-level	RASUSA_PHB

RASUSA_PHB¶

Use the Rasusa workflow to:

reduce computing resources when samples end up with drastically more data than needed to perform analyses
perform limit of detection (LOD) studies to identify appropriate minimum coverage thresholds required to perform downstream analyses

Call-caching disabled

If using RASUSA_PHB workflow version v2.0.0 or higher, the call-caching feature of Terra has been DISABLED to ensure that the workflow is run from the beginning and data is downloaded fresh. Call-caching will not be enabled, even if the user checks the box ✅ in the Terra workflow interface.

Inputs¶

Terra Task Name	Variable	Type	Description	Default Value	Terra Status
rasusa_workflow	read1	File	FASTQ file containing read1 sequences		Required
rasusa_workflow	samplename	String	The name of the sample being analyzed		Required
rasusa_task	coverage	Float	The desired coverage to sub-sample the reads to. If --bases is not provided, this option and --genome-size are required	250	Optional
rasusa_task	cpu	Int	Number of CPUs to allocate to the task	4	Optional
rasusa_task	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
rasusa_task	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/staphb/rasusa:2.1.0	Optional
rasusa_task	fraction_of_reads	Float	Explicitly define the fraction of reads to keep in the subsample; when used, genome size and coverage are ignored; acceptable inputs include whole numbers and decimals, e.g. 50.0 will leave 50% of the reads in the subsample		Optional
rasusa_task	genome_length	String	Input the approximate genome size expected in quotations; this is used to determine the number of bases required to achieve the desired coverage; acceptable metric suffixes include: b, k, m, g, and t for base, kilo, mega, giga, and tera, respectively		Optional
rasusa_task	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	8	Optional
rasusa_task	num_bases	String	Explicitly set the number of bases required e.g., 4.3kb, 7Tb, 9000, 4.1MB. If this option is given, --coverage and --genome-size are ignored		Optional
rasusa_task	num_reads	Int	Optional: explicitly define the number of reads in the subsample; when used, genome size and coverage are ignored; acceptable metric suffixes include: b, k, m, g, and t for base, kilo, mega, giga, and tera, respectively		Optional
rasusa_task	seed	Int	Use to assign a name to the "random seed" that is used by the subsampler; i.e. this allows the exact same subsample to be produced from the same input file/s in subsequent runs when providing the seed identifier; do not input values for random downsampling		Optional
rasusa_workflow	read2	File	FASTQ file containing read2 sequences		Optional
version_capture	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0	Optional
version_capture	timezone	String	Set the time zone to get an accurate date of analysis (uses UTC by default)		Optional

Workflow Tasks¶

Rasusa: Read Subsampling

Rasusa is a tool to randomly subsample sequencing reads to a specified coverage without assuming that all reads are of equal length, making it especially suitable for long-read data while still being applicable to short-read data.

The Rasusa task supports four mutually exclusive subsampling modes:

Mode	Behavior
`--bases`	Subsample to a target number of bases (e.g. `100M`, `4.3kb`). Overrides coverage.
`--frac`	Subsample to a fraction of the input reads (e.g. `0.5` keeps half). Overrides coverage.
`--num`	Subsample to an explicit number of reads. Overrides coverage.
`--coverage` + `--genome-size`	Default mode. Subsamples to a target depth using an estimated genome length.

If more than one of --bases, --frac, or --num is supplied the task will fail with a descriptive error. See inputs section for details on Terra variable names.

Non-deterministic output(s)

This task may yield non-deterministic outputs since it performs random subsampling. To ensure reproducibility, set a value for the rasusa_seed optional input variable.

Rasusa Technical Details

	Links
Task	task_rasusa.wdl
Software Source Code	Rasusa on GitHub
Software Documentation	Rasusa on GitHub
Original Publication(s)	Rasusa: Randomly subsample sequencing reads to a specified coverage

Outputs¶

Variable	Type	Description
rasusa_log	File	Log of Rasusa standard error output
rasusa_version	String	Version of RASUSA used for the analysis
rasusa_wf_analysis_date	String	Date of analysis
rasusa_wf_version	String	Version of PHB used for the analysis
read1_subsampled	File	Read1 FASTQ files downsampled to desired coverage
read2_subsampled	File	Read2 FASTQ files downsampled to desired coverage

Don't Forget!

Remember to use the subsampled reads in downstream analyses with this.read1_subsampled and this.read2_subsampled inputs.

Verify Downsampling

Confirm reads were successfully subsampled before downstream analyses by comparing read file size/s to the original read file size/s

View file sizes by clicking on the read file listed in the Terra data table and looking at the file size

References¶

Hall, M. B., (2022). Rasusa: Randomly subsample sequencing reads to a specified coverage. Journal of Open Source Software, 7(69), 3941, https://doi.org/10.21105/joss.03941