Skip to content

RASUSA

Quick Facts

Workflow Type Applicable Kingdom Last Known Changes Command-line Compatibility Workflow Level
Standalone Any Taxa PHB v2.0.0 Yes Sample-level

RASUSA_PHB

RASUSA functions to randomly downsample the number of raw reads to a user-defined threshold.

📋 Use Cases

  • to reduce computing resources when samples end up with drastically more data than needed to perform analyses
  • to perform limit of detection (LOD) studies to identify appropriate minimum coverage thresholds required to perform downstream analyses

🔧 Desired size may be specified by inputting any one of the following

  • coverage (e.g. 20X)
  • number of bases (e.g. "5m" for 5 megabases)
  • number of reads (e.g. 100000 total reads)
  • fraction of reads (e.g. 0.5 samples half the reads)

Call-caching disabled

If using RASUSA_PHB workflow version v2.0.0 or higher, the call-caching feature of Terra has been DISABLED to ensure that the workflow is run from the beginning and data is downloaded fresh. Call-caching will not be enabled, even if the user checks the box ✅ in the Terra workflow interface.

Inputs

Terra Task Name Variable Type Description Default Attribute Terra Status
rasusa_workflow coverage Float Use to specify the desired coverage of reads after downsampling; actual coverage of subsampled reads will not be exact and may be slightly higher; always check the estimated clean coverage after performing downstream workflows to verify coverage values, when necessary Required
rasusa_workflow genome_length String Input the approximate genome size expected in quotations; this is used to determine the number of bases required to achieve the desired coverage; acceptable metric suffixes include: b, k, m, g, and t for base, kilo, mega, giga, and tera, respectively Required
rasusa_workflow read1 File FASTQ file containing read1 sequences Required
rasusa_workflow read2 File FASTQ file containing read2 sequences Required
rasusa_workflow samplename String Name of the sample to be analyzed Required
rasusa_task bases String Explicitly define the number of bases required in the downsampled reads in quotations; when used, genome size and coverage are ignored; acceptable metric suffixes include: b, k, m, g, and t for base, kilo, mega, giga, and tera, respectively Optional
rasusa_task cpu Int Number of CPUs to allocate to the task 4 Optional
rasusa_task disk_size Int Amount of storage (in GB) to allocate to the task 100 Optional
rasusa_task docker String The Docker container to use for the task "us-docker.pkg.dev/general-theiagen/staphb/rasusa:0.7.0" Optional
rasusa_task frac Float Explicitly define the fraction of reads to keep in the subsample; when used, genome size and coverage are ignored; acceptable inputs include whole numbers and decimals, e.g. 50.0 will leave 50% of the reads in the subsample Optional
rasusa_task memory Int Amount of memory/RAM (in GB) to allocate to the task 8 Optional
rasusa_task num Int Optional: explicitly define the number of reads in the subsample; when used, genome size and coverage are ignored; acceptable metric suffixes include: b, k, m, g, and t for base, kilo, mega, giga, and tera, respectively Optional
rasusa_task seed Int Use to assign a name to the "random seed" that is used by the subsampler; i.e. this allows the exact same subsample to be produced from the same input file/s in subsequent runs when providing the seed identifier; do not input values for random downsampling Optional
version_capture docker String The Docker container to use for the task "us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0" Optional
version_capture timezone String Set the time zone to get an accurate date of analysis (uses UTC by default) Optional

Outputs

Variable Type Description
rasusa_version String Version of RASUSA used for the analysis
rasusa_wf_analysis_date String Date of analysis
rasusa_wf_version String Version of PHB used for the analysis
read1_subsampled File New read1 FASTQ files downsampled to desired coverage
read2_subsampled File New read2 FASTQ files downsampled to desired coverage

Don't Forget!

Remember to use the subsampled reads in downstream analyses with this.read1_subsampled and this.read2_subsampled inputs.

Verify

Confirm reads were successfully subsampled before downstream analyses by comparing read file size/s to the original read file size/s

View file sizes by clicking on the read file listed in the Terra data table and looking at the file size

References

Hall, M. B., (2022). Rasusa: Randomly subsample sequencing reads to a specified coverage. Journal of Open Source Software, 7(69), 3941, https://doi.org/10.21105/joss.03941