Filtering base modification calls.
Modified base calls are qualitied with a probability that is contained in the ML tag (see the
specification). We calculate the confidence that the model
has in the base modification prediction as \(\mathcal{q} = argmax(\textbf{P})\) where \(\textbf{P}\) is the
vector of probabilities for each modification. For example, given a model that can classify canonical
cytosine, 5mC, and 5hmC, \(\textbf{P}\) is \([P_{C}, P_m, P_h]\), and \(\mathcal{q}\) will be \(\mathcal{q} =
argmax(P_{C}, P_m, P_h)\), the maximum of the three probabilities. Filtering in modkit
is performed by
first determining the value of \(\mathcal{q}\) for the lowest n-th percentile of calls (10th percentile by
default). The threshold value is typically an estimate because the base modification probabilities are
sampled from a subset of the reads. All calls can be used to calculate the exact value, but in practice the
approximation gives the same value. Base modification calls with a confidence value less than this number will
not be counted. Determination of the threshold value can be performed on the fly (by sampling) or the
threshold value can be specified on the command line with the --filter_threshold
flag. The sample-probs
command can be used to quickly estimate the value of \(\mathcal{q}\) at various percentiles.
A note on the "probability of modification" (.
) MM flag.
The .
flag (e.g. C+m.
) indicates that primary sequence bases without base modification calls can be inferred to be
canonical (SAM tags). Some base modification callers,
for example dorado
have a default threshold, below which a base modification
probability will be omitted (meaning it will be inferred to be a canonical/unmodified base). In general, omitting
the base modification probabilities when they are very confidently canonical will not change the results from modkit
.
However, since the base modification probabilities are effectively changed to 100% canonical, can affect the
estimation of the pass threshold.