Validating ground truth results.

The modkit validate sub-command is intended for validating results in a uniform manner from samples with known modified base content. Specifically the modified base status at any annotated reference location should be known.

Validating from modBAM reads and BED reference annotation.

The input to the modkit validate command will be pairs of modBAM and BED files. modBAM files should contain modified base calls in the MM/ML tag as input to most modkit commands. BED files paired to each input modBAM file describe the ground truth modified base status at reference positions. This ground truth status should be known by the researcher due to previous experimental conditions and cannot be derived by modkit.

modkit validate \
    --bam-and-bed sample1.bam sample1_annotation.bed \
    --bam-and-bed sample2.bam sample2_annotation.bed

This will produce output such as the following:

> Parsing BED at /path/to/sample1_annotation.bed
> Processed 10 BED lines
> Parsing BED at /path/to/sample2_annotation.bed
> Processed 10 BED lines
> Canonical base: C
> Parsing mapping at /path/to/sample1.bam
> Processed 10 mapping recrods
> Parsing mapping at /path/to/sample2.bam
> Processed 10 mapping recrods
> Raw counts summary
              Called Base
         ┌───┬───────┬───────┬───┬───┬───┬──────────┐
         │   │ -     │ a     │ C │ G │ T │ Deletion │
         ├───┼───────┼───────┼───┼───┼───┼──────────┤
 Ground  │ - │ 9,900 │   100 │ 1 │ 1 │ 1 │        2 │
 Truth   │ a │   100 │ 9,900 │ 1 │ 1 │ 1 │        2 │
         └───┴───────┴───────┴───┴───┴───┴──────────┘
> Balancing ground truth call totals
> Raw accuracy: 99.00%
> Raw modified base calls contingency table
              Called Base
         ┌───┬────────┬────────┐
         │   │ -      │ a      │
         ├───┼────────┼────────┤
 Ground  │ - │ 99.00% │  1.00% │
 Truth   │ a │  1.00% │ 99.00% │
         └───┴────────┴────────┘
> Call probability threshold: 0.95
> Filtered accuracy: 99.90%
> Filtered modified base calls contingency table
              Called Base
         ┌───┬────────┬────────┐
         │   │ -      │ a      │
         ├───┼────────┼────────┤
 Ground  │ - │ 99.90% │  0.10% │
 Truth   │ a │  0.10% │ 99.90% │
         └───┴────────┴────────┘

The filtering threshold is computed in the same manner as in all other modkit commands. Currently only a defined percentage of input data filtering threshold estimation is implemented. The default value is 10% (as in other modkit commands) and can be adjusted with the --filter-quantile argument. Other methods (including user-defined thresholds) will be implemented in a future version.

The Call probability threshold is intended a value to be used for user-defined thresholds for other modkit commands.

BED ground truth annotation file:

A BED file is a tab-delimited file. For this command the first 6 fields are processed. These fields are as follows:

columnnamedescription
1chromname of reference sequencestr
2start position0-based start positionint
3end position0-based exclusive end positionint
4mod codeModified base codestr
6strandStrand (e.g. +,-,.)str

The 5th column is ignored in the validate command.

The 4th column represents the modified base code annotating the status at this reference position (or range of reference positions). This value can be - representing a canonical base (note that this differs from the remora validate annotation), a single letter code as defined in the modBAM tag specification, or any ChEBI code. The validate command will assume that any base from the associated modBAM file overlapping these positions should match this annotation.

Output file

The --out-filepath option is provided to allow persistent storage of results in a machine-parseable format without other logging lines. This format outputs all contingency tables in a machine-parseable format. For example this contingency table [["ground_truth_label","-","a"],["-",9900,100],["a",100,9900]] would be produced from the above example results.