SRA_Fetch¶
Quick Facts¶
Workflow Type | Applicable Kingdom | Last Known Changes | Command-line Compatibility | Workflow Level |
---|---|---|---|---|
Data Import | Any taxa | PHB v2.2.0 | Yes | Sample-level |
SRA_Fetch_PHB¶
The SRA_Fetch
workflow downloads sequence data from NCBI's Sequence Read Archive (SRA). It requires an SRA run accession then populates the associated read files to a Terra data table.
Read files associated with the SRA run accession provided as input are copied to a Terra-accessible Google bucket. Hyperlinks to those files are shown in the "read1" and "read2" columns of the Terra data table.
Inputs¶
The only required input for the SRA_Fetch workflow is an SRA run accession beginning "SRR", an ENA run accession beginning "ERR", or a DRA run accession which beginning "DRR".
Please see the NCBI Metadata and Submission Overview for assistance with identifying accessions. Briefly, NCBI-accessioned objects have the following naming scheme:
STUDY | SRP# |
---|---|
SAMPLE | SRS# |
EXPERIMENT | SRX# |
RUN | SRR# |
Terra Task Name | Variable | Type | Description | Default Value | Terra Status |
---|---|---|---|---|---|
fetch_sra_to_fastq | sra_accession | String | SRA, ENA, or DRA accession number | Required | |
fetch_sra_to_fastq | cpu | Int | Number of CPUs to allocate to the task | 2 | Optional |
fetch_sra_to_fastq | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
fetch_sra_to_fastq | docker_image | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/biocontainers/fastq-dl:2.0.4--pyhdfd78af_0 | Optional |
fetch_sra_to_fastq | fastq_dl_options | String | Additional parameters to pass to fastq_dl from here | "--provider sra" | Optional |
fetch_sra_to_fastq | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 8 | Optional |
version_capture | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0 | Optional |
version_capture | timezone | String | Set the time zone to get an accurate date of analysis (uses UTC by default) | Optional |
Outputs¶
Read data are available either with full base quality scores (SRA Normalized Format) or with simplified quality scores (SRA Lite). The SRA Normalized Format includes full, per-base quality scores, whereas base quality scores have been simplified in SRA Lite files. This means that all quality scores have been artificially set to Q-30 or Q3. More information about these files can be found here.
Given the lack of usefulness of SRA Lite formatted FASTQ files, we try to avoid these by preferentially searching SRA directly (SRA-Lite is more probably to be the file synced to other repositories), but sometimes downloading these files is unavoidable. To make the user aware of this, a warning column is present that is populated when an SRA-Lite file is detected.
Variable | Type | Description |
---|---|---|
sra_fetch_version | String | The version of the repository the SRA_Fetch workflow is in |
sra_fetch_analysis_date | String | The date the workflow was run |
read1 | File | File containing the forward reads |
read2 | File | File containing the reverse reads (not available for single-end or ONT data) |
fastq_dl_date | String | The date of the read data download |
fastq_dl_docker | String | The docker used |
fastq_dl_metadata | File | File containing metadata of the provided accession such as submission_accession, library_selection, instrument_platform, among others |
fastq_dl_version | String | The version of fastq-dl used |
fastq_dl_warning | String | This warning field is populated if SRA-Lite files are detected. These files contain all quality encoding as Phred-30 or Phred-3. |
References¶
fastq-dl: Petit III, R. A., Hall, M. B., Tonkin-Hill, G., Zhu, J., & Read, T. D. fastq-dl: efficiently download FASTQ files from SRA or ENA repositories (Version 2.0.2) [Computer software]. https://github.com/rpetit3/fastq-dl