Extracting base modification information
The modkit extract full
sub-commands will produce a table containing the base modification probabilities, the read sequence context, and optionally aligned reference information.
For extract full
and extract calls
, if a correct MN
tag is found, secondary and supplementary alignments may be output with the --allow-non-primary
flag.
See troubleshooting for details.
The table will by default contain unmapped sections of the read (soft-clipped sections, for example).
To only include mapped bases use the --mapped
flag. To only include sites of interest, pass a
BED-formatted file to the --include-bed
option. Similarly, to exclude sites, pass a BED-formatted
file to the --exclude
option. One caution, the files generated by modkit extract
can be large (2-2.5x
the size of the BAM). You may want to either use the --num-reads
option, the --region
option, or
pre-filter the modBAM ahead of time. You can also stream the output to stdout by setting the output to -
or stdout
and filter the columns before writing to disk.
Description of output table for extract full
column | name | description | type |
---|---|---|---|
1 | read_id | name of the read | str |
2 | forward_read_position | 0-based position on the forward-oriented read sequence | int |
3 | ref_position | aligned 0-based reference sequence position, -1 means unmapped | int |
4 | chrom | name of aligned contig, or '.' if the read is Gunmapped | str |
5 | mod_strand | strand of the molecule the base modification is on | str |
6 | ref_strand | strand of the reference the read is aligned to, or '.' if unmapped | str |
7 | ref_mod_strand | strand of the reference with the base modification, or '.' if unmapped | str |
8 | fw_soft_clipped_start | number of bases soft clipped from the start of the forward-oriented read | int |
9 | fw_soft_clipped_end | number of bases soft clipped from the end of the forward-oriented read | int |
10 | read_length | total length of the read | int |
11 | mod_qual | probability of the base modification in the next column | int |
12 | mod_code | base modification code from the MM tag | str |
13 | base_qual | basecall quality score (phred) | int |
14 | ref_kmer | reference 5-mer sequence context (center base is aligned base), '.' if unmapped | str |
15 | query_kmer | read 5-mer sequence context (center base is aligned base) | str |
16 | canonical_base | canonical base from the query sequence, from the MM tag | str |
17 | modified_primary_base | primary sequence base with the modification | str |
18 | inferred | whether the base modification call is implicit canonical | str |
19 | flag | FLAG from alignment record | str |
Tabulating base modification calls for each read position with extract calls
The modkit extract calls
command will generate a table of read-level base modification calls using the same thresholding algorithm employed by modkit pileup
.
The resultant table has, for each read, one row for each base modification call in that read.
If a base is called as modified then call_code
will be the code in the MM
tag. If the base is called as canonical the call_code
will be -
(A
, C
, G
, and T
are
reserved for "any modification"). The full schema of the table is below:
column | name | description | type |
---|---|---|---|
1 | read_id | name of the read | str |
2 | forward_read_position | 0-based position on the forward-oriented read sequence | int |
3 | ref_position | aligned 0-based reference sequence position, -1 means unmapped | int |
4 | chrom | name of aligned contig, or '.' if unmapped | str |
5 | mod_strand | strand of the molecule the base modification is on | str |
6 | ref_strand | strand of the reference the read is aligned to, or '.' if unmapped | str |
7 | ref_mod_strand | strand of the reference with the base modification, or '.' if unmapped | str |
8 | fw_soft_clipped_start | number of bases soft clipped from the start of the forward-oriented read | int |
9 | fw_soft_clipped_end | number of bases soft clipped from the end of the forward-oriented read | int |
10 | read_length | total length of the read | int |
11 | call_prob | probability of the base modification call in the next column | int |
12 | call_code | base modification call, - indicates a canonical call | str |
13 | base_qual | basecall quality score (phred) | int |
14 | ref_kmer | reference 5-mer sequence context (center base is aligned base), '.' if unmapped | str |
15 | query_kmer | read 5-mer sequence context (center base is aligned base) | str |
16 | canonical_base | canonical base from the query sequence, from the MM tag | str |
17 | modified_primary_base | primary sequence base with the modification | str |
18 | fail | true if the base modification call fell below the pass threshold | str |
19 | inferred | whether the base modification call is implicit canonical | str |
20 | within_alignment | when alignment information is present, is this base aligned to the reference | str |
21 | flag | FLAG from alignment record | str |
Note on implicit base modification calls.
The .
MM flag indicates that primary sequence bases without an associated base modification probability
should be inferred to be canonical. By default, when this flag is encountered in a modBAM, modkit extract
will
output rows with the inferred
column set to true
and a mod_qual
value of 0.0
for the base modifications
called on that read. For example, if you have a A+a.
MM tag, and there are A
bases in the read for which
there aren't base modification calls (identifiable as non-0s in the MM tag) will be rows where the mod_code
is a
and the mod_qual
is 0.0.
Note on non-primary alignments
If a valid MN
tag is found, secondary and supplementary alignments can be output in the modkit extract
tables above.
See troubleshooting for details on how to get valid MN
tags.
To have non-primary alignments appear in the output, the --allow-non-primary
flag must be passed.
By default, the primary alignment will have all base modification information contained on the read, including soft-clipped and unaligned read positions.
If the --mapped-only
flag is used, soft clipped sections of the read will not be included.
For secondary and supplementary alignments, soft-clipped positions are not repeated. See advanced usage for more details.
Example usages:
Extract a table of base modification probabilities from an aligned and indexed BAM
modkit extract full <input.bam> <output.tsv> [--bgzf]
If the index input.bam.bai
can be found, intervals along the aligned genome can be performed
in parallel. The optional --bgzf
flag will emit compressed output.
Extract a table from a region of a large modBAM
The below example will extract reads from only chr20, and include reference sequence context
modkit extract full <intput.bam> <output.tsv> --region chr20 --ref <ref.fasta>
Extract only sites aligned to a CG motif
modkit motif bed <reference.fasta> CG 0 > CG_motifs.bed
modkit extract full <in.bam> <out.tsv> --ref <ref.fasta> --include-bed CG_motifs.bed
Extract only sites that are at least 50 bases from the ends of the reads
modkit extract full <in.bam> <out.tsv> --edge-filter 50
Extract read-level base modification calls
modkit extract calls <input.bam> <calls.tsv>
Use --allow-non-primary
to get secondary and supplementary mappings in the output.
modkit extract calls <input.bam> <output.tsv> --allow-non-primary
See the help string and/or advanced_usage for more details and performace considerations if you encounter issues with memory usage.