Frequently asked questions
How are base modification probabilities calculated?
Base modifications are assigned a probability reflecting the confidence the base modification detection algorithm has in making a decision about the modification state of the molecule at a particular position.
The probabilities are parsed from the ML
tag in the BAM record. These values reflect the probability of the base having a specific modification, modkit
uses these values and calculates the probability for each modification as well as the probability the base is canonical:
\[
P_{\text{canonical}} = 1 - \sum_{m \in \textbf{M}} P_{m}
\]
where \(\textbf{M}\) is the set of all of the potential modifications for the base.
For example, consider using a m6A model that predicts m6A or canonical bases at adenine residues, if the \( P_{\text{m6A}} = 0.9 \) then the probability of canonical \( \text{A} \) is \( P_{\text{canonical}} = 1 - P_{\text{m6A}} = 0.1 \). Or considering a typical case for cytosine modifications where the model predicts 5hmC, 5mC, and canonical cytosine:
\[ P_{\text{5mC}} = 0.7, \\ P_{\text{5hmC}} = 0.2, \\ P_{\text{canonical}} = 1 - P_{\text{5mC}} + P_{\text{5hmC}} = 0.1, \\ \]
A potential confusion is that modkit
does not assume a base is canonical if the probability of modification is close to \( \frac{1}{N_{\text{classes}}} \), the lowest probability the algorithm may assign.
What value for --filter-threshold
should I use?
The same way that you may remove low quality data as a first step to any processing, modkit
will filter out the lowest confidence base modification probabilities.
The filter threshold (or pass threshold) defines the minimum probability required for a read's base modification information at a particular position to be used in a downstream step.
This does not remove the whole read from consideration, just the base modification information attributed to a particular position in the read will be removed.
The most common place to encounter filtering is in pileup
, where base modification probabilities falling below the pass threshold will be tabulated in the \( \text{N}_{\text{Fail}} \) column instead of the \( \text{N}_{\text{valid}} \) column.
For highest accuracy, the general recommendation is to let modkit
estimate this value for you based on the input data.
The value is calculated by first taking a sample of the base modification probabilities from the input dataset and determining the \(10^{\text{th}}\) percentile probability value.
This percentile can be changed with the --filter-percentile
option.
Passing a value to --filter-threshold
and/or --mod-threshold
that is higher or lower than the estimated value will have the effect of excluding or including more probabilities, respectively.
It may be a good idea to inspect the distribution of probability values in your data, the modkit sample-probs
command is designed for this task.
Use the --hist
and --out-dir
options to collect a histogram of the prediction probabilities for each canonical base and modification.