ONT_Barcode_Concatenation¶
Quick Facts¶
Workflow Type | Applicable Kingdom | Last Known Changes | Command-line Compatibility | Workflow Level | Dockstore |
---|---|---|---|---|---|
Data Import | Any taxa | vX.X.X | No | ONT_Barcode_Concatenation_PHB |
ONT_Barcode_Concatenation_PHB¶
Unconcatenated ONT data is the bane of all humanity. This workflow will automatically concatenate all reads in a given folder and upload those reads to a Terra data table.
We recommend running this workflow with "Run workflow with inputs defined by file paths" selected in Terra. This will allow you to upload your data files and provide the necessary information for the workflow without having to specify a data table. There are no outputs for this workflow, as the data is added to either a new or existing table in your workspace.
Barcodes Must Be In Nested Directories
This workflow anticipates that all reads associated with a barcode are located in their own subdirectories while the input_bucket_path
points to the parent folder containing all the barcodes that are to be processed.
How does directory structure impact my output?
If you have the following directory structure:
output_bucket_path/
input_bucket_path/
├── barcode01/
│ ├── ABC123_pass_barcode01_123abc_789xyz_0.fastq.gz
│ ├── ...
│ └── ABC123_pass_barcode01_123abc_789xyz_XXX.fastq.gz
├── barcode02/
│ ├── ABC123_pass_barcode02_123abc_789xyz_0.fastq.gz
│ ├── ...
│ └── ABC123_pass_barcode02_123abc_789xyz_XXX.fastq.gz
├── barcodeXXX/
│ └── ...
├── random_name/
│ └── ...
├── ABC123_these_files_will_be_ignored_0.fastq.gz
└── ABC123_these_files_will_be_ignored_1.fastq.gz
The input_bucket_path
in would point to gs://input_bucket_path/
, and the workflow would automatically find and concatenate all reads within each subdirectory (both barcode*/
and random_name/
). Learn how to upload your files in this structure below. Please note: If there are reads located in the parent directory (e.g., ABC123_these_files_will_be_ignored_0.fastq.gz
and ABC123_these_files_will_be_ignored_1.fastq.gz
), they will be ignored.
After concatenation, the resulting reads will appear in the gs://output_bucket_path
under the following names:
output_bucket_path/
├── previously_concatenated_sample.all.fastq.gz
├── barcode01.all.fastq.gz
├── barcode02.all.fastq.gz
├── ...
├── barcodeXXX.all.fastq.gz
└── random_name.all.fastq.gz
All data in the output_bucket_path
will appear in the specified Terra table under the read1
column. The sample name is taken from the text before the .all.fastq.gz
suffix, which is the folder name the data was found in. If a file has already been uploaded but is in the output_bucket_path
, it will be reuploaded.
<terra_table_name>_id | read1 |
---|---|
previously_concatenated_sample | previously_concatenated_sample.all.fastq.gz |
barcode01 | barcode01.all.fastq.gz |
barcode02 | barcode02.all.fastq.gz |
... | ... |
barcodeXXX | barcodeXXX.all.fastq.gz |
random_name | random_name.all.fastq.gz |
If a barcode_renaming_file
is used (see relevant section below) that maps random_name
to my_special_sample
, the random_name
sample will not appear in the table or in the output_bucket_path
, and my_special_sample
will appear instead.
<terra_table_name>_id | read1 |
---|---|
my_special_sample | my_special_sample.all.fastq.gz |
Inputs¶
Uploading unconcatenated ONT reads to Terra and finding the input_bucket_path
¶
Using the Terra data uploader is not recommended.
The following method is recommended for data upload:
-
Navigate to your Terra workspace's Dashboard page and click on "Open bucket in browser" under the "Cloud Information" toggle on the right hand side.
-
Click on the
uploads
folder. -
Click on "Create folder". Name the folder a unique name that can be used to identify your run or group of data. Click on "Create" once you have entered the new folder name.
-
Navigate into the newly created folder by clicking on it. You can now drag and drop entire barcode directories into the browser with your new Google bucket. This process uploads the data directly into your Terra workspace.
When your files are uploaded, you should see them appear.
-
Once your files are uploaded, you can identify the
input_bucket_path
by clicking on the two squares next to the file path at the top of the screen, shown below. When pasting this into the workflow inputs, you will need to add thegs://
prefix.
Finding the output_bucket_path
¶
It is recommended to also create a new folder using the method described above for your output_bucket_path
. No files should be uploaded to it.
CAUTION! Be careful when reusing output_bucket_path
The way this workflow currently works is that all files in the output_bucket_path
are added to the specified Terra table. If the output_bucket_path
is reused, all files will be re-added to Terra and the upload_date
column will be overwritten.
Creating a barcode_renaming_file
¶
By default, each concatenated file will take the name of the folder that contained the unconcatenated files. If you have specific sample names that correspond to each folder name, you can specify what you would like the concatenated files to be named as using a barcode_renaming_file
.
This file takes the following tab-delimited format. Do not include a header.
The first column is the name of the folder and the second column is the desired sample name.
Upload this file to your Terra bucket using either the Data Uploader or by clicking on the file icon on the right sidebar. Copy the file path into the barcode_renaming_file
variable, and your files will be appropriate renamed.
Terra Task Name | Variable | Type | Description | Default Value | Terra Status |
---|---|---|---|---|---|
ont_barcode_concatenation | input_bucket_path | String | The full path to your unconcatenated data's Google bucket folder location, including the gs://; can be easily copied by right-clicking and copying the link address in the header after navigating to the folder in the "Files" section of the "Data" tab on Terra (see above for examples) | Required | |
ont_barcode_concatenation | output_bucket_path | String | The full path to where you want the concatenated data to be stored as a Google bucket folder location, including the gs://; can be easily copied by right-clicking and copying the link address in the header after navigating to the desired folder in the "Files" section of the "Data" tab on Terra (see above for examples) | Required | |
ont_barcode_concatenation | terra_project | String | The name of the Terra project where your data table will be located | Required | |
ont_barcode_concatenation | terra_table_name | String | The name of the Terra table you want to add your newly concatenated samples to | Required | |
ont_barcode_concatenation | terra_workspace | String | The name of the Terra workspace where your data table will be located | Required | |
cat_ont_barcodes | cpu | Int | Number of CPUs to allocate to the task | 2 | Optional |
cat_ont_barcodes | disk_size | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
cat_ont_barcodes | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/theiagen/ont-barcodes:0.0.2 | Optional |
cat_ont_barcodes | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 4 | Optional |
create_terra_table | cpu | Int | Number of CPUs to allocate to the task | 1 | Optional |
create_terra_table | disk_size | Int | Amount of storage (in GB) to allocate to the task | 25 | Optional |
create_terra_table | docker | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/theiagen/terra-tools:2023-06-21 | Optional |
create_terra_table | memory | Int | Amount of memory/RAM (in GB) to allocate to the task | 4 | Optional |
ont_barcode_concatenation | barcode_renaming_file | File | A tab-delimited file where the name of the barcode folders are mapped to the desired sample name in the Terra table (see above for examples) | Optional | |
ont_barcode_concatenation | file_extension | String | If your ONT data ends in a different extension, like ".fq.gz" or ".fastq", you can indicate that here. | .fastq.gz | Optional |
Outputs¶
Your concatenated ONT data will automatically appear in your workspace in the table of choice with information in the following four fields:
- Sample name (under the
terra_table_name
_id column), which will be either the name of the parent folder or the remapped name indicated by thebarcode_renaming_file
input. - The concatenated ONT data in the
read1
column - The name of the workflow (
ONT_Barcode_Concatenation_PHB
) under thetable_created_by
column, to indicate the samples were added by this workflow. - The date of upload/when the workflow was run under the
upload_date
column