Repair MM/ML tags on trimmed reads

The modkit repair command is useful when you have a BAM with reads where the canonical sequences have been altered in some way that either renders the MM and ML tags invalid (for example, trimmed or hard-clipped) or the data has been lost completely. This command requires that you have the original base modification calls for each read you want to repair, and it will project these base modification calls onto the sequences in the altered BAM.

The command uses two arguments called the "donor" and the "acceptor". The donor, contains the original, correct, MM and ML tags and the acceptor is either missing MM and ML tags or they are invalid (they will be discarded either way). The reads in the donor must be a superset of the reads in the acceptor, meaning you can have extra reads in the donor BAM if some reads have been removed or filtered earlier in the workflow. Both the donor and the acceptor must be sorted by read name prior to running modkit repair. Duplicate reads in the acceptor are allowed so long as they have valid SEQ fields. Lastly, modkit repair only works on reads that have been trimmed, other kinds of alteration such as run-length-encoding are not currently supported. Split reads, or other derived transformations, are not currently repairable with this command.

For example a typical workflow may look like this:

# original base modification calls
basecalls_5mC_5hmC.bam

# basecalls that have been trimmed
trimmed.bam # could also be fastq, but would require conversion to BAM

# the two BAM files need to be sorted
samtools -n trimmed.bam -O BAM > trimed_read_sort.bam 
samtools -n basecalls_5mC_5hmC.bam -O BAM > basecalls_5mC_5hmC_read_sort.bam

modkit repair \
    --donor-bam basecalls_5mC_5hmC_read_sort.bam \
    --acceptor-bam trimed_read_sort.bam \
    --log-filepath modkit_repair.log \
    --output-bam trimmed_repaired.bam