RASUSA¶
Quick Facts¶
| Workflow Type | Applicable Kingdom | Last Known Changes | Command-line Compatibility | Workflow Level | Dockstore |
|---|---|---|---|---|---|
| Standalone | Any taxa | vX.X.X | Yes | Sample-level | RASUSA_PHB |
RASUSA_PHB¶
Rasusa: Read Subsampling
Rasusa is a tool to randomly subsample sequencing reads to a specified coverage without assuming that all reads are of equal length, making it especially suitable for long-read data while still being applicable to short-read data.
The Rasusa task supports four mutually exclusive subsampling modes:
| Mode | Behavior |
|---|---|
--bases |
Subsample to a target number of bases (e.g. 100M, 4.3kb). Overrides coverage. |
--frac |
Subsample to a fraction of the input reads (e.g. 0.5 keeps half). Overrides coverage. |
--num |
Subsample to an explicit number of reads. Overrides coverage. |
--coverage + --genome-size |
Default mode. Subsamples to a target depth using an estimated genome length. |
If more than one of --bases, --frac, or --num is supplied the task will fail with a descriptive error. See inputs section for details on Terra variable names.
Non-deterministic output(s)
This task may yield non-deterministic outputs since it performs random subsampling. To ensure reproducibility, set a value for the rasusa_seed optional input variable.
Rasusa Technical Details
| Links | |
|---|---|
| Task | task_rasusa.wdl |
| Software Source Code | Rasusa on GitHub |
| Software Documentation | Rasusa on GitHub |
| Original Publication(s) | Rasusa: Randomly subsample sequencing reads to a specified coverage |
📋 Use Cases¶
- to reduce computing resources when samples end up with drastically more data than needed to perform analyses
- to perform limit of detection (LOD) studies to identify appropriate minimum coverage thresholds required to perform downstream analyses
Call-caching disabled
If using RASUSA_PHB workflow version v2.0.0 or higher, the call-caching feature of Terra has been DISABLED to ensure that the workflow is run from the beginning and data is downloaded fresh. Call-caching will not be enabled, even if the user checks the box ✅ in the Terra workflow interface.
Inputs¶
| Terra Task Name | Variable | Type | Description | Default Value | Terra Status |
|---|---|---|---|---|---|
| rasusa_workflow | read1 | File | FASTQ file containing read1 sequences | Required | |
| rasusa_workflow | samplename | String | The name of the sample being analyzed | Required | |
| rasusa_task | coverage | Float | The desired coverage to sub-sample the reads to. If --bases is not provided, this option and --genome-size are required | 250 | Optional |
| rasusa_task | cpu | Int | Number of CPUs to allocate to the task | 4 | Optional |
| rasusa_task | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
| rasusa_task | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/rasusa:2.1.0 | Optional |
| rasusa_task | fraction_of_reads | Float | Explicitly define the fraction of reads to keep in the subsample; when used, genome size and coverage are ignored; acceptable inputs include whole numbers and decimals, e.g. 50.0 will leave 50% of the reads in the subsample | Optional | |
| rasusa_task | genome_length | String | Input the approximate genome size expected in quotations; this is used to determine the number of bases required to achieve the desired coverage; acceptable metric suffixes include: b, k, m, g, and t for base, kilo, mega, giga, and tera, respectively | Optional | |
| rasusa_task | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 8 | Optional |
| rasusa_task | num_bases | String | Explicitly set the number of bases required e.g., 4.3kb, 7Tb, 9000, 4.1MB. If this option is given, --coverage and --genome-size are ignored | Optional | |
| rasusa_task | num_reads | Int | Optional: explicitly define the number of reads in the subsample; when used, genome size and coverage are ignored; acceptable metric suffixes include: b, k, m, g, and t for base, kilo, mega, giga, and tera, respectively | Optional | |
| rasusa_task | seed | Int | Use to assign a name to the "random seed" that is used by the subsampler; i.e. this allows the exact same subsample to be produced from the same input file/s in subsequent runs when providing the seed identifier; do not input values for random downsampling | Optional | |
| rasusa_workflow | read2 | File | FASTQ file containing read2 sequences | Optional | |
| version_capture | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0 | Optional |
| version_capture | timezone | String | Set the time zone to get an accurate date of analysis (uses UTC by default) | Optional |
Outputs¶
| Variable | Type | Description |
|---|---|---|
| rasusa_log | File | Log of Rasusa standard error output |
| rasusa_version | String | Version of RASUSA used for the analysis |
| rasusa_wf_analysis_date | String | Date of analysis |
| rasusa_wf_version | String | Version of PHB used for the analysis |
| read1_subsampled | File | Read1 FASTQ files downsampled to desired coverage |
| read2_subsampled | File | Read2 FASTQ files downsampled to desired coverage |
Don't Forget!
Remember to use the subsampled reads in downstream analyses with this.read1_subsampled and this.read2_subsampled inputs.
Verify
Confirm reads were successfully subsampled before downstream analyses by comparing read file size/s to the original read file size/s
View file sizes by clicking on the read file listed in the Terra data table and looking at the file size
References¶
Hall, M. B., (2022). Rasusa: Randomly subsample sequencing reads to a specified coverage. Journal of Open Source Software, 7(69), 3941, https://doi.org/10.21105/joss.03941