Skip to content

Assembly Fetch

Quick Facts

Workflow Type Applicable Kingdom Last Known Changes Command-line Compatibility Workflow Level
Data Import Any taxa PHB v1.3.0 Yes Sample-level

Assembly_Fetch_PHB

The Assembly_Fetch workflow downloads assemblies from NCBI. This is particularly useful when you need to align reads against a reference genome, for example during a reference-based phylogenetics workflow. This workflow can be run in two ways:

  1. You can provide an accession for the specific assembly that you want to download, and Assembly_Fetch will run only the NCBI genome download task to download this assembly,
  2. You can provide an assembly, and Assembly_Fetch will first use the ReferenceSeeker task to first find the closest reference genome in RefSeq to your query assembly and then will run the NCBI genome download task to download that reference assembly.

Tip

NOTE: If using Assembly_Fetch workflow version 1.3.0 or higher, the call-caching feature of Terra has been DISABLED to ensure that the workflow is run from the beginning and data is downloaded fresh. Call-caching will not be enabled, even if the user checks the box ✅ in the Terra workflow interface.

Inputs

Assembly_Fetch requires the input samplename, and either the accession for a reference genome to download (ncbi_accession) or an assembly that can be used to query RefSeq for the closest reference genome to download (assembly_fasta).

This workflow runs on the sample level.

Terra Task Name Variable Type Description Default Value Terra Status
reference_fetch samplename String Your sample's name Required
reference_fetch assembly_fasta File Assembly FASTA file of your sample Optional
reference_fetch ncbi_accession String NCBI accession passed to the NCBI datasets task to be downloaded. Example: GCF_000006945.2 (Salmonella enterica subsp. enterica, serovar Typhimurium str. LT2 reference genome) Optional
ncbi_datasets_download_genome_accession cpu Int Number of CPUs to allocate to the task 1 Optional
ncbi_datasets_download_genome_accession disk_size Int Amount of storage (in GB) to allocate to the task 50 Optional
ncbi_datasets_download_genome_accession docker String The Docker container to use for the task "us-docker.pkg.dev/general-theiagen/staphb/ncbi-datasets:14.13.2" Optional
ncbi_datasets_download_genome_accession include_gbff Boolean set to true if you would like the GenBank Flat File (GBFF) file included in the output. It contains nucleotide sequence, metadata, and annotations. FALSE Optional
ncbi_datasets_download_genome_accession include_gff3 Boolean set to true if you would like the Genomic Feature File v3 (GFF3) file included in the output. It contains nucleotide sequence, metadata, and annotations FALSE Optional
ncbi_datasets_download_genome_accession memory Int Amount of memory/RAM (in GB) to allocate to the task 4 Optional
referenceseeker cpu Int Number of CPUs to allocate to the task 4 Optional
referenceseeker disk_size Int Amount of storage (in GB) to allocate to the task 200 Optional
referenceseeker docker String The Docker container to use for the task "us-docker.pkg.dev/general-theiagen/biocontainers/referenceseeker:1.8.0--pyhdfd78af_0" Optional
referenceseeker memory Int Amount of memory/RAM (in GB) to allocate to the task 16 Optional
referenceseeker referenceseeker_ani_threshold Float ANI threshold used to exclude ref genomes when ANI value less than this value. 0.95 Optional
referenceseeker referenceseeker_conserved_dna_threshold Float Conserved DNA threshold used to exclude ref genomes when conserved DNA value is less than this value. 0.69 Optional
referenceseeker referenceseeker_db File Database used by the referenceseeker tool that contains bacterial genomes from RefSeq release 205. Downloaded from referenceseeker GitHub repo. "gs://theiagen-public-files-rp/terra/theiaprok-files/referenceseeker-bacteria-refseq-205.v20210406.tar.gz" Optional
version_capture docker String The Docker container to use for the task "us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0" Optional
version_capture timezone String Set the time zone to get an accurate date of analysis (uses UTC by default) Optional

Analysis Tasks

ReferenceSeeker (optional) Details
ReferenceSeeker

ReferenceSeeker uses your draft assembly to identify closely related bacterial, viral, fungal, or plasmid genome assemblies in RefSeq.

Databases for use with ReferenceSeeker are as follows, and can be used by pasting the gs uri in double quotation marks " " into the referenceseeker_db optional input:

  • archea: gs://theiagen-public-files-rp/terra/theiaprok-files/referenceseeker-archaea-refseq-205.v20210406.tar.gz
  • bacterial (default): gs://theiagen-public-files-rp/terra/theiaprok-files/referenceseeker-bacteria-refseq-205.v20210406.tar.gz
  • fungi: gs://theiagen-public-files-rp/terra/theiaprok-files/referenceseeker-fungi-refseq-205.v20210406.tar.gz
  • plasmids: gs://theiagen-public-files-rp/terra/theiaprok-files/referenceseeker-plasmids-refseq-205.v20210406.tar.gz
  • viral: gs://theiagen-public-files-rp/terra/theiaprok-files/referenceseeker-viral-refseq-205.v20210406.tar.gz

For ReferenceSeeker to identify a genome, it must meet user-specified thresholds for sequence coverage (referenceseeker_conserved_dna_threshold) and identity (referenceseeker_ani_threshold). The default values for these are set according to community standards (conserved DNA >= 69 % and ANI >= 95 %). A list of closely related genomes is provided in referenceseeker_tsv. The reference genome that ranks highest according to ANI and conserved DNA values is considered the closest match and will be downloaded, with information about this provided in the assembly_fetch_referenceseeker_top_hit_ncbi_accession output.

ReferenceSeeker Technical Details

Links
Task task_referenceseeker.wdl
Software version 1.8.0 ("us-docker.pkg.dev/general-theiagen/biocontainers/referenceseeker:1.8.0--pyhdfd78af_0")
Software Source Code https://github.com/oschwengers/referenceseeker
Software Documentation https://github.com/oschwengers/referenceseeker
Original Publication(s) ReferenceSeeker: rapid determination of appropriate reference genomes
NCBI Datasets Details
NCBI Datasets

The NCBI Datasets task downloads specified assemblies from NCBI using either the virus or genome (for all other genome types) package as appropriate.

NCBI Datasets Technical Details

Links
Task task_ncbi_datasets.wdl
Software version 14.13.2 (us-docker.pkg.dev/general-theiagen/staphb/ncbi-datasets:14.13.2)
Software Source Code https://github.com/ncbi/datasets
Software Documentation https://github.com/ncbi/datasets
Original Publication(s) Not known to be published

Outputs

Variable Type Description
assembly_fetch_analysis_date String Date of assembly download
assembly_fetch_ncbi_datasets_assembly_data_report_json File JSON file containing report about assembly downloaded by Asembly_Fetch
assembly_fetch_ncbi_datasets_assembly_fasta File FASTA file downloaded by Assembly_Fetch
assembly_fetch_ncbi_datasets_docker String Docker file used for NCBI datasets
assembly_fetch_ncbi_datasets_gff File Assembly downloaded by Assembly_Fetch in GFF3 format
assembly_fetch_ncbi_datasets_gff3 File Assembly downloaded by Assembly_Fetch in GFF format
assembly_fetch_ncbi_datasets_version String NCBI datasets version used
assembly_fetch_referenceseeker_database String ReferenceSeeker database used
assembly_fetch_referenceseeker_docker String Docker file used for ReferenceSeeker
assembly_fetch_referenceseeker_top_hit_ncbi_accession String NCBI Accession for the top hit identified by Assembly_Fetch
assembly_fetch_referenceseeker_tsv File TSV file of the top hits between the query genome and the Reference Seeker database
assembly_fetch_referenceseeker_version String ReferenceSeeker version used
assembly_fetch_version String The version of the repository the Assembly Fetch workflow is in

References

ReferenceSeeker: Schwengers O, Hain T, Chakraborty T, Goesmann A. ReferenceSeeker: rapid determination of appropriate reference genomes. J Open Source Softw. 2020 Feb 4;5(46):1994.

NCBI datasets: datasets: NCBI Datasets is an experimental resource for finding and building datasets [Internet]. Github; [cited 2023 Apr 19]. Available from: https://github.com/ncbi/datasets