ONT_Barcode_Concatenation¶

Quick Facts¶

Workflow Type	Applicable Kingdom	Last Known Changes	Command-line Compatibility	Workflow Level	Dockstore
Data Import	Any taxa	v4.0.0	No		ONT_Barcode_Concatenation_PHB

ONT_Barcode_Concatenation_PHB¶

This workflow will automatically concatenate all reads in a given folder and upload those reads to a Terra data table.

We recommend running this workflow with "Run workflow with inputs defined by file paths" selected in Terra. This will allow you to upload your data files and provide the necessary information for the workflow without having to specify a data table. There are no outputs for this workflow, as the data is added to either a new or existing table in your workspace.

Barcodes Must Be In Nested Directories

This workflow anticipates that all reads associated with a barcode are located in their own subdirectories while the input_bucket_path points to the parent folder containing all the barcodes that are to be processed.

How does directory structure impact my output?

If you have the following directory structure:

output_bucket_path/
input_bucket_path/
├── barcode01/
│   ├── ABC123_pass_barcode01_123abc_789xyz_0.fastq.gz
│   ├── ...
│   └── ABC123_pass_barcode01_123abc_789xyz_XXX.fastq.gz
├── barcode02/
│   ├── ABC123_pass_barcode02_123abc_789xyz_0.fastq.gz
│   ├── ...
│   └── ABC123_pass_barcode02_123abc_789xyz_XXX.fastq.gz
├── barcodeXXX/
│   └── ...
├── random_name/
│   └── ...
├── ABC123_these_files_will_be_ignored_0.fastq.gz
└── ABC123_these_files_will_be_ignored_1.fastq.gz

The input_bucket_path in would point to gs://input_bucket_path/, and the workflow would automatically find and concatenate all reads within each subdirectory (both barcode*/ and random_name/). Learn how to upload your files in this structure below. Please note: If there are reads located in the parent directory (e.g., ABC123_these_files_will_be_ignored_0.fastq.gz and ABC123_these_files_will_be_ignored_1.fastq.gz), they will be ignored.

After concatenation, the resulting reads will appear in the gs://output_bucket_path under the following names:

output_bucket_path/
├── previously_concatenated_sample.all.fastq.gz
├── barcode01.all.fastq.gz
├── barcode02.all.fastq.gz
├── ...
├── barcodeXXX.all.fastq.gz
└── random_name.all.fastq.gz

All data in the output_bucket_path will appear in the specified Terra table under the read1 column. The sample name is taken from the text before the .all.fastq.gz suffix, which is the folder name the data was found in. If a file has already been uploaded but is in the output_bucket_path, it will be reuploaded.

<terra_table_name>_id	read1
previously_concatenated_sample	previously_concatenated_sample.all.fastq.gz
barcode01	barcode01.all.fastq.gz
barcode02	barcode02.all.fastq.gz
...	...
barcodeXXX	barcodeXXX.all.fastq.gz
random_name	random_name.all.fastq.gz

If a barcode_renaming_file is used (see relevant section below) that maps random_name to my_special_sample, the random_name sample will not appear in the table or in the output_bucket_path, and my_special_sample will appear instead.

<terra_table_name>_id	read1
my_special_sample	my_special_sample.all.fastq.gz

Inputs¶

Uploading unconcatenated ONT reads to Terra and finding the `input_bucket_path`¶

Using the Terra data uploader is not recommended.

The following method is recommended for data upload:

Navigate to your Terra workspace's Dashboard page and click on "Open bucket in browser" under the "Cloud Information" toggle on the right hand side.

Open bucket in browser
Click on the uploads folder.

Open the uploads folder
Click on "Create folder". Name the folder a unique name that can be used to identify your run or group of data. Click on "Create" once you have entered the new folder name.

Create a new folder
Navigate into the newly created folder by clicking on it. You can now drag and drop entire barcode directories into the browser with your new Google bucket. This process uploads the data directly into your Terra workspace.

Drag your barcode folders onto the browser

When your files are uploaded, you should see them appear.

Uploaded folders should look like this
Once your files are uploaded, you can identify the input_bucket_path by clicking on the two squares next to the file path at the top of the screen, shown below. When pasting this into the workflow inputs, you will need to add the gs:// prefix.

Copy the file path

Finding the `output_bucket_path`¶

It is recommended to also create a new folder using the method described above for your output_bucket_path. No files should be uploaded to it.

CAUTION! Be careful when reusing output_bucket_path

The way this workflow currently works is that all files in the output_bucket_path are added to the specified Terra table. If the output_bucket_path is reused, all files will be re-added to Terra and the upload_date column will be overwritten.

Creating a `barcode_renaming_file`¶

By default, each concatenated file will take the name of the folder that contained the unconcatenated files. If you have specific sample names that correspond to each folder name, you can specify what you would like the concatenated files to be named as using a barcode_renaming_file.

This file takes the following tab-delimited format. Do not include a header.

barcode01   sample01
barcode02   sample02

The first column is the name of the folder and the second column is the desired sample name.

Upload this file to your Terra bucket using either the Data Uploader or by clicking on the file icon on the right sidebar. Copy the file path into the barcode_renaming_file variable, and your files will be appropriate renamed.

Terra Task Name	Variable	Type	Description	Default Value	Terra Status
ont_barcode_concatenation	input_bucket_path	String	The full path to your unconcatenated data's Google bucket folder location, including the gs://; can be easily copied by right-clicking and copying the link address in the header after navigating to the folder in the "Files" section of the "Data" tab on Terra (see above for examples)		Required
ont_barcode_concatenation	new_table_name	String	The name of the Terra table where you want the concatenated data to be added to; can be new or pre-existing		Required
ont_barcode_concatenation	output_bucket_path	String	The full path to where you want the concatenated data to be stored as a Google bucket folder location, including the gs://; can be easily copied by right-clicking and copying the link address in the header after navigating to the desired folder in the "Files" section of the "Data" tab on Terra (see above for examples)		Required
ont_barcode_concatenation	terra_project	String	The name of the Terra project where your data table will be located		Required
ont_barcode_concatenation	terra_workspace	String	The name of the Terra workspace where your data table will be located		Required
cat_ont_barcodes	cpu	Int	Number of CPUs to allocate to the task	2	Optional
cat_ont_barcodes	disk_size	Int	Amount of storage (in GB) to allocate to the task	100	Optional
cat_ont_barcodes	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/theiagen/ont-barcodes:0.0.2	Optional
cat_ont_barcodes	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	4	Optional
create_terra_table	cpu	Int	Number of CPUs to allocate to the task	1	Optional
create_terra_table	disk_size	Int	Amount of storage (in GB) to allocate to the task	25	Optional
create_terra_table	docker	String	The Docker container to use for the task	us-docker.pkg.dev/general-theiagen/theiagen/terra-tools:2023-06-21	Optional
create_terra_table	memory	Int	Amount of memory/RAM (in GB) to allocate to the task	4	Optional
ont_barcode_concatenation	barcode_renaming_file	File	A tab-delimited file where the name of the barcode folders are mapped to the desired sample name in the Terra table (see above for examples)		Optional
ont_barcode_concatenation	file_extension	String	If your ONT data ends in a different extension, like ".fq.gz" or ".fastq", you can indicate that here.	.fastq.gz	Optional

Outputs¶

Your concatenated ONT data will automatically appear in your workspace in the table of choice with information in the following four fields:

Sample name (under the terra_table_name_id column), which will be either the name of the parent folder or the remapped name indicated by the barcode_renaming_file input.
The concatenated ONT data in the read1 column
The name of the workflow (ONT_Barcode_Concatenation_PHB) under the table_created_by column, to indicate the samples were added by this workflow.
The date of upload/when the workflow was run under the upload_date column

ONT_Barcode_Concatenation¶

Quick Facts¶