Skip to content

TheiaValidate

Quick Facts

Workflow Type Applicable Kingdom Last Known Changes Command-line Compatibility Workflow Level
Standalone Any Taxa PHB v2.0.0 No

TheiaValidate_PHB

TheiaValidate Workflow Diagram

TheiaValidate Workflow Diagram

TheiaValidate performs basic comparisons between user-designated columns in two separate tables. We anticipate this workflow being run to determine if any differences exist between version releases or two workflows, such as TheiaProk_ONT vs TheiaProk_Illumina_PE. A summary PDF report is produced in addition to a Excel spreadsheet that lists the values for any columns that do not have matching content for a sample.

Warning

The two tables being compared must have both identical sample names and an equal number of samples. If not, validation will not work or (in the case of unequal number of samples) not be attempted.

In order to enable this workflow to function for different workflow series, we require users to provide a list of columns they want to compare between the two tables. Feel free to use the information below that Theiagen uses to compare versions of the three main workflow series as a starting point for your own validations:

Validation Starting Points

Workflow Series Validation Criteria TSV Columns to Compare
TheiaCoV Workflows TheiaCov Validation Criteria abricate_flu_subtype,abricate_flu_type,assembly_length_unambiguous,assembly_mean_coverage,irma_subtype,irma_type,kraken_human,kraken_human_dehosted,kraken_sc2,kraken_sc2_dehosted,kraken_target_org,kraken_target_org_dehosted,nextclade_aa_dels,nextclade_aa_subs,nextclade_clade,nextclade_lineage,nextclade_tamiflu_resistance_aa_subs,num_reads_clean1,num_reads_clean2,number_N,pango_lineage,percent_reference_coverage,vadr_num_alerts
TheiaEuk Workflows TheiaEuk Validation Criteria assembly_length,busco_results,clade_type,est_coverage_clean,est_coverage_raw,gambit_predicted_taxon,n50_value,num_reads_clean1,num_reads_clean2,number_contigs,quast_gc_percent,theiaeuk_snippy_variants_hits
TheiaProk Workflows TheiaProk Validation Criteria abricate_abaum_plasmid_type_genes,agrvate_agr_group,amrfinderplus_amr_core_genes,amrfinderplus_amr_plus_genes,amrfinderplus_stress_genes,amrfinderplus_virulence_genes,ani_highest_percent,ani_top_species_match,assembly_length,busco_results,ectyper_predicted_serotype,emmtypingtool_emm_type,est_coverage_clean,est_coverage_raw,gambit_predicted_taxon,genotyphi_final_genotype,hicap_genes,hicap_serotype,kaptive_k_type,kleborate_genomic_resistance_mutations,kleborate_key_resistance_genes,kleborate_mlst_sequence_type,legsta_predicted_sbt,lissero_serotype,meningotype_serogroup,midas_primary_genus,midas_secondary_genus,midas_secondary_genus_abundance,n50_value,ngmaster_ngmast_sequence_type,ngmaster_ngstar_sequence_type,num_reads_clean1,num_reads_clean2,number_contigs,pasty_serogroup,pbptyper_predicted_1A_2B_2X,plasmidfinder_plasmids,poppunk_gps_cluster,seqsero2_predicted_serotype,seroba_ariba_serotype,seroba_serotype,serotypefinder_serotype,shigatyper_ipaB_presence_absence,shigatyper_predicted_serotype,shigeifinder_cluster,shigeifinder_serotype,sistr_predicted_serotype,sonneityping_final_genotype,spatyper_type,srst2_vibrio_serogroup,staphopiasccmec_types_and_mecA_presence,tbprofiler_main_lineage,tbprofiler_resistance_genes,ts_mlst_predicted_st,virulencefinder_hits

If additional validation metrics are desired, the user has the ability to provide a validation_criteria_tsv file that specifies what type of comparison should be performed. There are several options for additional validation checks:

  • EXACT performs an exact string match and counts the number of exact match failures/differences
  • IGNORE does not check the values and says there are 0 failures
  • SET checks list items (such as amrfinder_plus_genes which is a comma-delimited list of genes) for identical content — order does not matter; that is, mdsA,mdsB is determined to be same as mdsB,mdsA. The EXACT match does not consider these to be the same, but the SET match does. -, which is an actual decimal value such as 0.02, calculates the percent difference between numerical columns. If the columns are not numerical, this function will not work and will lead to workflow failure. For example, if the decimal percentage is 0.02, the test will indicate a failure if the values in the two columns are more than 2% different.
  • Dates, integers, and object-type values are ignored and indicate 0 failures.

File Comparisons

If a column consists of only GCURIs (Google Cloud file paths), the files will be localized and compared with either an EXACT match or a SET match. In the SET match, the lines in the file are ordered before comparison. Results are returned to the summary table as expected. The results of each file comparison can be found in the theiavalidate_diffs output column.

Inputs

Please note that all string inputs must be enclosed in quotation marks; for example, "column1,column2" or "workspace1"

Terra Task Name Variable Type Description Default Value Terra Status
theiavalidate columns_to_compare String A comma-separated list of the columns the user wants to compare. Do not include whitespace. Required
theiavalidate output_prefix String The prefix for the output files Required
theiavalidate table1_name String The name of the first table Required
theiavalidate table2_name String The name of the second table Required
theiavalidate terra_project1_name String The name of the Terra project where table1_name can be found Required
theiavalidate terra_workspace1_name String The name of the Terra workspace where table1_name can be found Required
theiavalidate column_translation_tsv File If the user wants to link two columns of different names, they may supply a TSV file that provides a "column translation" between the two files (see the section below this table). Optional
theiavalidate terra_project2_name String If the table2_name is located in a different Terra project, indicate it here. Otherwise, the workflow will look for table2_name in the Terra project indicated in terra_project1_name. value for terra_project1_name Optional
theiavalidate terra_workspace2_name String If the table2_name is located in a different Terra workspace, indicate it here. Otherwise, the workflow will look for table2_name in the Terra workspace indicated in terra_workspace1_name. value for terra_workspace1_name Optional
theiavalidate validation_criteria_tsv File If the user wants to specify a different comparison than the default exact string match, they may supply a TSV file that indicates the different options (see the section below this table). Optional
compare_two_tsvs cpu Int Number of CPUs to allocate to the task 2 Optional
compare_two_tsvs debug_output Boolean Set to true to enable more outputs; useful when debugging FALSE Optional
compare_two_tsvs disk_size Int Amount of storage (in GB) to allocate to the task 100 Optional
compare_two_tsvs docker String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/theiagen/theiavalidate:0.1.0 Optional
compare_two_tsvs memory Int Amount of memory/RAM (in GB) to allocate to the task 4 Optional
compare_two_tsvs na_values String If the user knows a particular value in either table that they would like to be considered N/A, they can indicate those values in a comma-separated list here. Any changes here will overwrite the default and not append to the default list. Do not include whitespace. -1.#IND,1.#QNAN,1.#IND,-1.#QNAN,#N/A,N/A,n/a,,#NA,NULL,null,NaN,-NaN,nan,-nan,None Optional
export_two_tsvs cpu Int Number of CPUs to allocate to the task 1 Optional
export_two_tsvs disk_size Int Amount of storage (in GB) to allocate to the task 10 Optional

The optional validation_criteria_tsv file takes the following format (tab-delimited; a header line is required):

1
2
3
4
5
column_name criteria
columnB SET
columnC IGNORE
columnD 0.01
columnE EXACT

Please see above for a description of all available criteria options (EXACT, IGNORE, SET, ).

The optional column_translation_tsv file takes the following format (tab-delimited; there can be no header line):

1
2
3
column_name_in_table1   column_name_in_table2
column_name_in_table2   column_name_in_table1
internal_column_name    display_column_name

Please note that the name in the second column will be displayed and used in all output files.

Known Bug

There must be more than one line in the column_translation_tsv file or else this error will appear: AttributeError: 'str' object has no attribute 'to_dict'. To fix this error, add an additional line in the column_translation_tsv file, like the following: columnA columnA

Known Bug

If performing a comparison, all samples must have values for that column.

Call Caching Disabled

If using TheiaValidate workflow version 1.3.0 or higher, the call-caching feature of Terra has been DISABLED to ensure that the workflow is run from the beginning and data is compared fresh. Call-caching will not be enabled, even if the user checks the box ✅ in the Terra workflow interface.

Outputs

Variable Type Description
theiavalidate_criteria_differences File A TSV file that lists only the differences that fail to meet the validation criteria
theiavalidate_date String The date the analysis was run
theiavalidate_diffs Array[File] An array of files with a single file for each file comparison performed; only has values if a column with files is compared
theiavalidate_exact_differences File A TSV file that lists all exact string match differences between samples
theiavalidate_filtered_input_table1 File The first data table used for validation after removing unexamined columns and translating column names
theiavalidate_filtered_input_table2 File The second data table used for validation after removing unexamined columns and translating column names
theiavalidate_report File A PDF summary report
theiavalidate_status String Indicates whether or not validation was attempted
theiavalidate_version String The version of the TheiaValidate Python Docker
theiavalidate_wf_version String The version of the PHB repository

Example Data and Outputs

To help demonstrate how TheiaValidate works, please observe the following example and outputs:

Table1
entity:example_table1_id columnA-string columnB-set columnC-ignore columnD-float columnE-missing
sample1 option1 item1,item2,item3 cheese 1000 present
sample2 option1 item1,item3,item2 cheesecake 12 present
sample3 option2 item1,item2,item3 cake 14 present
sample4 option1 item2,item1 cakebatter 3492
sample5 option2 item1,item2 batter 3 present
Table2
entity:example_table2_id columnA-string columnB-set columnC-ignore columnD-float missing
sample1 option1 item1,item3,item2 cheesecake 999 present
sample2 option2 item1,item2,item3 batter 12 present
sample3 option1 item1,item2 cheese 24
sample4 option1 item1,item2 cakebatter 728
sample5 option2 item1,item2,item3 batter 4 present
Validation Criteria
column criteria
columnB-set SET
columnC-ignore IGNORE
columnD-float 0.01
columnE-missing EXACT
Column Translation
missing columnE-missing
columnA-string columnA-string

Note: the second row translating columnA-string to itself is included to prevent the known bug explained above.

If the above inputs are provided, then the following output files will be generated:

filtered_example_table1.tsv

filtered_example_table2.tsv

example_summary.pdf

example_exact_differences.tsv

example_validation_criteria_differences.tsv