Skip to content

SRA_Fetch

Quick Facts

Workflow Type Applicable Kingdom Last Known Changes Command-line Compatibility Workflow Level
Data Import Any taxa PHB v2.2.0 Yes Sample-level

SRA_Fetch_PHB

The SRA_Fetch workflow downloads sequence data from NCBI's Sequence Read Archive (SRA). It requires an SRA run accession then populates the associated read files to a Terra data table.

Read files associated with the SRA run accession provided as input are copied to a Terra-accessible Google bucket. Hyperlinks to those files are shown in the "read1" and "read2" columns of the Terra data table.

Inputs

This workflow runs on the sample level.

Terra Task Name Variable Type Description Default Value Terra Status
fetch_sra_to_fastq sra_accession String SRA, ENA, or DRA accession number Required
fetch_sra_to_fastq cpu Int Number of CPUs to allocate to the task 2 Optional
fetch_sra_to_fastq disk_size Int Amount of storage (in GB) to allocate to the task 100 Optional
fetch_sra_to_fastq docker_image String The Docker container to use for the task "us-docker.pkg.dev/general-theiagen/biocontainers/fastq-dl:2.0.4--pyhdfd78af_0" Optional
fetch_sra_to_fastq fastq_dl_options String Additional parameters to pass to fastq_dl from here "--provider sra" Optional
fetch_sra_to_fastq memory Int Amount of memory/RAM (in GB) to allocate to the task 8 Optional

The only required input for the SRA_Fetch workflow is an SRA run accession beginning "SRR", an ENA run accession beginning "ERR", or a DRA run accession which beginning "DRR".

Please see the NCBI Metadata and Submission Overview for assistance with identifying accessions. Briefly, NCBI-accessioned objects have the following naming scheme:

STUDY SRP#
SAMPLE SRS#
EXPERIMENT SRX#
RUN SRR#

Outputs

Read data are available either with full base quality scores (SRA Normalized Format) or with simplified quality scores (SRA Lite). The SRA Normalized Format includes full, per-base quality scores, whereas base quality scores have been simplified in SRA Lite files. This means that all quality scores have been artificially set to Q-30 or Q3. More information about these files can be found here.

Given the lack of usefulness of SRA Lite formatted FASTQ files, we try to avoid these by selecting as provided SRA directly (SRA-Lite is more probably to be the file synced to other repositories), but some times downloading these files is unavoidable. To make the user aware of this, a warning column is present that is populated when an SRA-Lite file is detected.

Variable Type Description Production Status
read1 File File containing the forward reads Always produced
read2 File File containing the reverse reads (not availablae for single-end or ONT data) Produced only for paired-end data
fastq_dl_date String The date of download Always produced
fastq_dl_docker String The docker used Always produced
fastq_dl_metadata File File containing metadata of the provided accession such as submission_accession, library_selection, instrument_platform, among others Always produced
fastq_dl_version String Fastq_dl version used Always produced
fastq_dl_warning String This warning field is populated if SRA-Lite files are detected. These files contain all quality encoding as Phred-30 or Phred-3. Depends on internal workflow logic

References

This workflow relies on fastq-dl, a very handy bioinformatics tool by Robert A. Petit III