Skip to content

RASUSA¶

Quick Facts¶

Workflow Type Applicable Kingdom Last Known Changes Command-line Compatibility Workflow Level Dockstore
Standalone Any taxa vX.X.X Yes Sample-level RASUSA_PHB

RASUSA_PHB¶

Rasusa: Read Subsampling

Rasusa is a tool to randomly subsample sequencing reads to a specified coverage without assuming that all reads are of equal length, making it especially suitable for long-read data while still being applicable to short-read data.

The Rasusa task supports four mutually exclusive subsampling modes:

Mode Behavior
--bases Subsample to a target number of bases (e.g. 100M, 4.3kb). Overrides coverage.
--frac Subsample to a fraction of the input reads (e.g. 0.5 keeps half). Overrides coverage.
--num Subsample to an explicit number of reads. Overrides coverage.
--coverage + --genome-size Default mode. Subsamples to a target depth using an estimated genome length.

If more than one of --bases, --frac, or --num is supplied the task will fail with a descriptive error. See inputs section for details on Terra variable names.

Non-deterministic output(s)

This task may yield non-deterministic outputs since it performs random subsampling. To ensure reproducibility, set a value for the rasusa_seed optional input variable.

Rasusa Technical Details

Links
Task task_rasusa.wdl
Software Source Code Rasusa on GitHub
Software Documentation Rasusa on GitHub
Original Publication(s) Rasusa: Randomly subsample sequencing reads to a specified coverage

📋 Use Cases¶

  • to reduce computing resources when samples end up with drastically more data than needed to perform analyses
  • to perform limit of detection (LOD) studies to identify appropriate minimum coverage thresholds required to perform downstream analyses

Call-caching disabled

If using RASUSA_PHB workflow version v2.0.0 or higher, the call-caching feature of Terra has been DISABLED to ensure that the workflow is run from the beginning and data is downloaded fresh. Call-caching will not be enabled, even if the user checks the box ✅ in the Terra workflow interface.

Inputs¶

Terra Task Name Variable Type Description Default Value Terra Status
rasusa_workflow read1 File FASTQ file containing read1 sequences Required
rasusa_workflow samplename String The name of the sample being analyzed Required
rasusa_task coverage Float The desired coverage to sub-sample the reads to. If --bases is not provided, this option and --genome-size are required 250 Optional
rasusa_task cpu Int Number of CPUs to allocate to the task 4 Optional
rasusa_task disk_size Int Amount of storage (in GB) to allocate to the task 100 Optional
rasusa_task docker String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/staphb/rasusa:2.1.0 Optional
rasusa_task fraction_of_reads Float Explicitly define the fraction of reads to keep in the subsample; when used, genome size and coverage are ignored; acceptable inputs include whole numbers and decimals, e.g. 50.0 will leave 50% of the reads in the subsample Optional
rasusa_task genome_length String Input the approximate genome size expected in quotations; this is used to determine the number of bases required to achieve the desired coverage; acceptable metric suffixes include: b, k, m, g, and t for base, kilo, mega, giga, and tera, respectively Optional
rasusa_task memory Int Amount of memory/RAM (in GB) to allocate to the task 8 Optional
rasusa_task num_bases String Explicitly set the number of bases required e.g., 4.3kb, 7Tb, 9000, 4.1MB. If this option is given, --coverage and --genome-size are ignored Optional
rasusa_task num_reads Int Optional: explicitly define the number of reads in the subsample; when used, genome size and coverage are ignored; acceptable metric suffixes include: b, k, m, g, and t for base, kilo, mega, giga, and tera, respectively Optional
rasusa_task seed Int Use to assign a name to the "random seed" that is used by the subsampler; i.e. this allows the exact same subsample to be produced from the same input file/s in subsequent runs when providing the seed identifier; do not input values for random downsampling Optional
rasusa_workflow read2 File FASTQ file containing read2 sequences Optional
version_capture docker String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0 Optional
version_capture timezone String Set the time zone to get an accurate date of analysis (uses UTC by default) Optional

Outputs¶

Variable Type Description
rasusa_log File Log of Rasusa standard error output
rasusa_version String Version of RASUSA used for the analysis
rasusa_wf_analysis_date String Date of analysis
rasusa_wf_version String Version of PHB used for the analysis
read1_subsampled File Read1 FASTQ files downsampled to desired coverage
read2_subsampled File Read2 FASTQ files downsampled to desired coverage

Don't Forget!

Remember to use the subsampled reads in downstream analyses with this.read1_subsampled and this.read2_subsampled inputs.

Verify

Confirm reads were successfully subsampled before downstream analyses by comparing read file size/s to the original read file size/s

View file sizes by clicking on the read file listed in the Terra data table and looking at the file size

References¶

Hall, M. B., (2022). Rasusa: Randomly subsample sequencing reads to a specified coverage. Journal of Open Source Software, 7(69), 3941, https://doi.org/10.21105/joss.03941