Removing modification calls from BAMs.
There may be multiple possible base modifications to a specific canonical base.
For example, cytosine has at least four known modified variants. As
technologies are increasingly able to detect these chemical moieties, not all
downstream tools may be capable of accepting them, or you may want to convert
your data to make it comparable to other data with less specific resolution. To
address this issue, modkit
implements a method for removing one or more DNA
base modification call from a BAM as well as a command line flag to simply
combine all modification calls when performing a pileup to generate a bedMethyl
file.
Removing DNA base modification probabilities.
BAM records containing probabilities for multiple base modifications at each residue can
be transformed to ignore one (or more) of the predicted base modifications. For example,
if a BAM file contains 5hmC and 5mC probabilities, the total probability is necessarily
distributed over 5mC, \(p_m\), 5hmC, \(p_h\), and canonical C, \(p_{C}\). modkit
implements a method, described below, for reducing the dimensionality of the probability
distribution by one or more of the predicted base modification classes.
Take for example a 3-way modification call for C, 5mC, and 5hmC, the probabilities of each being \(p_{C}\), \(p_{m}\), \(p_{h}\), respectively. To remove the 5hmC probability, \(p_{h}\) is distributed evenly across the other two options. In this example, the updates are \( p_C \leftarrow p_C + (\frac{p_h}{2}) \) and \( p_m \leftarrow p_m + (\frac{p_h}{2}) \).
Combining multiple base modifications into a single count.
Combine: --combine
In modkit
the combine method can be used to produce binary modified/unmodified
counts. Continuing with the above example, the counts for number of modified reads,
N_mod
, becomes the number of reads with 5hmC calls plus the number of reads with 5mC
calls. N_other_mod
will always be 0
. The raw_mod_code
will be reported as the
ambiguous mod code for the canonical base. See
https://samtools.github.io/hts-specs/SAMtags.pdf for details on base modification codes
and schema.yaml
in this project for details on the notation for counts.