Skip to content

ONT_Barcode_Concatenation

Quick Facts

Workflow Type Applicable Kingdom Last Known Changes Command-line Compatibility Workflow Level Dockstore
Data Import Any taxa vX.X.X No ONT_Barcode_Concatenation_PHB

ONT_Barcode_Concatenation_PHB

Unconcatenated ONT data is the bane of all humanity. This workflow will automatically concatenate all reads in a given folder and upload those reads to a Terra data table.

We recommend running this workflow with "Run workflow with inputs defined by file paths" selected in Terra. This will allow you to upload your data files and provide the necessary information for the workflow without having to specify a data table. There are no outputs for this workflow, as the data is added to either a new or existing table in your workspace.

Barcodes Must Be In Nested Directories

This workflow anticipates that all reads associated with a barcode are located in their own subdirectories while the input_bucket_path points to the parent folder containing all the barcodes that are to be processed.

How does directory structure impact my output?

If you have the following directory structure:

output_bucket_path/
input_bucket_path/
├── barcode01/
│   ├── ABC123_pass_barcode01_123abc_789xyz_0.fastq.gz
│   ├── ...
│   └── ABC123_pass_barcode01_123abc_789xyz_XXX.fastq.gz
├── barcode02/
│   ├── ABC123_pass_barcode02_123abc_789xyz_0.fastq.gz
│   ├── ...
│   └── ABC123_pass_barcode02_123abc_789xyz_XXX.fastq.gz
├── barcodeXXX/
│   └── ...
├── random_name/
│   └── ...
├── ABC123_these_files_will_be_ignored_0.fastq.gz
└── ABC123_these_files_will_be_ignored_1.fastq.gz

The input_bucket_path in would point to gs://input_bucket_path/, and the workflow would automatically find and concatenate all reads within each subdirectory (both barcode*/ and random_name/). Learn how to upload your files in this structure below. Please note: If there are reads located in the parent directory (e.g., ABC123_these_files_will_be_ignored_0.fastq.gz and ABC123_these_files_will_be_ignored_1.fastq.gz), they will be ignored.

After concatenation, the resulting reads will appear in the gs://output_bucket_path under the following names:

output_bucket_path/
├── previously_concatenated_sample.all.fastq.gz
├── barcode01.all.fastq.gz
├── barcode02.all.fastq.gz
├── ...
├── barcodeXXX.all.fastq.gz
└── random_name.all.fastq.gz

All data in the output_bucket_path will appear in the specified Terra table under the read1 column. The sample name is taken from the text before the .all.fastq.gz suffix, which is the folder name the data was found in. If a file has already been uploaded but is in the output_bucket_path, it will be reuploaded.

<terra_table_name>_id read1
previously_concatenated_sample previously_concatenated_sample.all.fastq.gz
barcode01 barcode01.all.fastq.gz
barcode02 barcode02.all.fastq.gz
... ...
barcodeXXX barcodeXXX.all.fastq.gz
random_name random_name.all.fastq.gz

If a barcode_renaming_file is used (see relevant section below) that maps random_name to my_special_sample, the random_name sample will not appear in the table or in the output_bucket_path, and my_special_sample will appear instead.

<terra_table_name>_id read1
my_special_sample my_special_sample.all.fastq.gz

Inputs

Uploading unconcatenated ONT reads to Terra and finding the input_bucket_path

Using the Terra data uploader is not recommended.

The following method is recommended for data upload:
  1. Navigate to your Terra workspace's Dashboard page and click on "Open bucket in browser" under the "Cloud Information" toggle on the right hand side.

    Open bucket in browser

    Open the Google Bucket

  2. Click on the uploads folder.

    Open the uploads folder

    Enter the uploads folder

  3. Click on "Create folder". Name the folder a unique name that can be used to identify your run or group of data. Click on "Create" once you have entered the new folder name.

    Create a new folder

    Create a new folder

  4. Navigate into the newly created folder by clicking on it. You can now drag and drop entire barcode directories into the browser with your new Google bucket. This process uploads the data directly into your Terra workspace.

    Drag your barcode folders onto the browser

    Upload your barcode directories

    When your files are uploaded, you should see them appear.

    Uploaded folders should look like this

    Uploaded files

  5. Once your files are uploaded, you can identify the input_bucket_path by clicking on the two squares next to the file path at the top of the screen, shown below. When pasting this into the workflow inputs, you will need to add the gs:// prefix.

    Copy the file path

    Copy the file path

Finding the output_bucket_path

It is recommended to also create a new folder using the method described above for your output_bucket_path. No files should be uploaded to it.

CAUTION! Be careful when reusing output_bucket_path

The way this workflow currently works is that all files in the output_bucket_path are added to the specified Terra table. If the output_bucket_path is reused, all files will be re-added to Terra and the upload_date column will be overwritten.

Creating a barcode_renaming_file

By default, each concatenated file will take the name of the folder that contained the unconcatenated files. If you have specific sample names that correspond to each folder name, you can specify what you would like the concatenated files to be named as using a barcode_renaming_file.

This file takes the following tab-delimited format. Do not include a header.

barcode01   sample01
barcode02   sample02

The first column is the name of the folder and the second column is the desired sample name.

Upload this file to your Terra bucket using either the Data Uploader or by clicking on the file icon on the right sidebar. Copy the file path into the barcode_renaming_file variable, and your files will be appropriate renamed.

Terra Task Name Variable Type Description Default Value Terra Status
ont_barcode_concatenation input_bucket_path String The full path to your unconcatenated data's Google bucket folder location, including the gs://; can be easily copied by right-clicking and copying the link address in the header after navigating to the folder in the "Files" section of the "Data" tab on Terra (see above for examples) Required
ont_barcode_concatenation output_bucket_path String The full path to where you want the concatenated data to be stored as a Google bucket folder location, including the gs://; can be easily copied by right-clicking and copying the link address in the header after navigating to the desired folder in the "Files" section of the "Data" tab on Terra (see above for examples) Required
ont_barcode_concatenation terra_project String The name of the Terra project where your data table will be located Required
ont_barcode_concatenation terra_table_name String The name of the Terra table you want to add your newly concatenated samples to Required
ont_barcode_concatenation terra_workspace String The name of the Terra workspace where your data table will be located Required
cat_ont_barcodes cpu Int Number of CPUs to allocate to the task 2 Optional
cat_ont_barcodes disk_size Int Amount of storage (in GB) to allocate to the task 100 Optional
cat_ont_barcodes docker String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/theiagen/ont-barcodes:0.0.2 Optional
cat_ont_barcodes memory Int Amount of memory/RAM (in GB) to allocate to the task 4 Optional
create_terra_table cpu Int Number of CPUs to allocate to the task 1 Optional
create_terra_table disk_size Int Amount of storage (in GB) to allocate to the task 25 Optional
create_terra_table docker String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/theiagen/terra-tools:2023-06-21 Optional
create_terra_table memory Int Amount of memory/RAM (in GB) to allocate to the task 4 Optional
ont_barcode_concatenation barcode_renaming_file File A tab-delimited file where the name of the barcode folders are mapped to the desired sample name in the Terra table (see above for examples) Optional
ont_barcode_concatenation file_extension String If your ONT data ends in a different extension, like ".fq.gz" or ".fastq", you can indicate that here. .fastq.gz Optional

Outputs

Your concatenated ONT data will automatically appear in your workspace in the table of choice with information in the following four fields:

  • Sample name (under the terra_table_name_id column), which will be either the name of the parent folder or the remapped name indicated by the barcode_renaming_file input.
  • The concatenated ONT data in the read1 column
  • The name of the workflow (ONT_Barcode_Concatenation_PHB) under the table_created_by column, to indicate the samples were added by this workflow.
  • The date of upload/when the workflow was run under the upload_date column