RASUSA¶

Quick Facts¶

Workflow Type	Applicable Kingdom	Last Known Changes	Command-line Compatibility	Workflow Level
Standalone	Any Taxa	PHB v2.0.0	Yes	Sample-level

RASUSA_PHB¶

RASUSA functions to randomly downsample the number of raw reads to a user-defined threshold.

📋 Use Cases¶

to reduce computing resources when samples end up with drastically more data than needed to perform analyses
to perform limit of detection (LOD) studies to identify appropriate minimum coverage thresholds required to perform downstream analyses

🔧 Desired size may be specified by inputting any one of the following¶

coverage (e.g. 20X)
number of bases (e.g. "5m" for 5 megabases)
number of reads (e.g. 100000 total reads)
fraction of reads (e.g. 0.5 samples half the reads)

Call-caching disabled

If using RASUSA_PHB workflow version v2.0.0 or higher, the call-caching feature of Terra has been DISABLED to ensure that the workflow is run from the beginning and data is downloaded fresh. Call-caching will not be enabled, even if the user checks the box ✅ in the Terra workflow interface.

Inputs¶

Terra Task Name	Variable	Type	Description	Default Attribute	Terra Status
rasusa_workflow	coverage	Float	Use to specify the desired coverage of reads after downsampling; actual coverage of subsampled reads will not be exact and may be slightly higher; always check the estimated clean coverage after performing downstream workflows to verify coverage values, when necessary		Required
rasusa_workflow	genome_length	String	Input the approximate genome size expected in quotations; this is used to determine the number of bases required to achieve the desired coverage; acceptable metric suffixes include: `b`, `k`, `m`, `g`, and `t` for base, kilo, mega, giga, and tera, respectively		Required
rasusa_workflow	read1	File	FASTQ file containing read1 sequences		Required
rasusa_workflow	read2	File	FASTQ file containing read2 sequences		Required
rasusa_workflow	samplename	String	Name of the sample to be analyzed		Required
rasusa_task	bases	String	Explicitly define the number of bases required in the downsampled reads in quotations; when used, genome size and coverage are ignored; acceptable metric suffixes include: `b`, `k`, `m`, `g`, and `t` for base, kilo, mega, giga, and tera, respectively		Optional
rasusa_task	cpu	Int	Number of CPUs to allocate to the task	4	Optional
rasusa_task	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
rasusa_task	docker	String	The Docker container to use for the task	"us-docker.pkg.dev/general-theiagen/staphb/rasusa:0.7.0"	Optional
rasusa_task	frac	Float	Explicitly define the fraction of reads to keep in the subsample; when used, genome size and coverage are ignored; acceptable inputs include whole numbers and decimals, e.g. 50.0 will leave 50% of the reads in the subsample		Optional
rasusa_task	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	8	Optional
rasusa_task	num	Int	Optional: explicitly define the number of reads in the subsample; when used, genome size and coverage are ignored; acceptable metric suffixes include: `b`, `k`, `m`, `g`, and `t` for base, kilo, mega, giga, and tera, respectively		Optional
rasusa_task	seed	Int	Use to assign a name to the "random seed" that is used by the subsampler; i.e. this allows the exact same subsample to be produced from the same input file/s in subsequent runs when providing the seed identifier; do not input values for random downsampling		Optional
version_capture	docker	String	The Docker container to use for the task	"us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0"	Optional
version_capture	timezone	String	Set the time zone to get an accurate date of analysis (uses UTC by default)		Optional

Outputs¶

Variable	Type	Description
rasusa_version	String	Version of RASUSA used for the analysis
rasusa_wf_analysis_date	String	Date of analysis
rasusa_wf_version	String	Version of PHB used for the analysis
read1_subsampled	File	New read1 FASTQ files downsampled to desired coverage
read2_subsampled	File	New read2 FASTQ files downsampled to desired coverage

Don't Forget!

Remember to use the subsampled reads in downstream analyses with this.read1_subsampled and this.read2_subsampled inputs.

Verify

Confirm reads were successfully subsampled before downstream analyses by comparing read file size/s to the original read file size/s

View file sizes by clicking on the read file listed in the Terra data table and looking at the file size

References¶

Hall, M. B., (2022). Rasusa: Randomly subsample sequencing reads to a specified coverage. Journal of Open Source Software, 7(69), 3941, https://doi.org/10.21105/joss.03941