Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Extracting base modification information

The modkit extract full sub-commands will produce a table containing the base modification probabilities, the read sequence context, and optionally aligned reference information. For extract full and extract calls, if a correct MN tag is found, secondary and supplementary alignments may be output with the --allow-non-primary flag. See troubleshooting for details.

The table will by default contain unmapped sections of the read (soft-clipped sections, for example). To only include mapped bases use the --mapped flag. To only include sites of interest, pass a BED-formatted file to the --include-bed option. Similarly, to exclude sites, pass a BED-formatted file to the --exclude option. One caution, the files generated by modkit extract can be large (2-2.5x the size of the BAM). You may want to either use the --num-reads option, the --region option, or pre-filter the modBAM ahead of time. You can also stream the output to stdout by setting the output to - or stdout and filter the columns before writing to disk.

Description of output table for extract full

columnnamedescriptiontype
1read_idname of the readstr
2forward_read_position0-based position on the forward-oriented read sequenceint
3ref_positionaligned 0-based reference sequence position, -1 means unmappedint
4chromname of aligned contig, or ‘.’ if the read is Gunmappedstr
5mod_strandstrand of the molecule the base modification is onstr
6ref_strandstrand of the reference the read is aligned to, or ‘.’ if unmappedstr
7ref_mod_strandstrand of the reference with the base modification, or ‘.’ if unmappedstr
8fw_soft_clipped_startnumber of bases soft clipped from the start of the forward-oriented readint
9fw_soft_clipped_endnumber of bases soft clipped from the end of the forward-oriented readint
10alignment_startleftmost (i.e. smallest) aligned reference positionint
11alignment_endrightmost (i.e. largest) aligned reference positionint
12read_lengthtotal length of the readint
13mod_qualprobability of the base modification in the next columnint
14mod_codebase modification code from the MM tagstr
15base_qualbasecall quality score (phred)int
16ref_kmerreference 5-mer sequence context (center base is aligned base), ‘.’ if unmappedstr
17query_kmerread 5-mer sequence context (center base is aligned base)str
18canonical_basecanonical base from the query sequence, from the MM tagstr
19modified_primary_baseprimary sequence base with the modificationstr
20inferredwhether the base modification call is implicit canonicalstr
21flagFLAG from alignment recordstr
22motifscomma-separated list of reference motifs matching at this position, only present when --motifs or --cpg is usedstr

Tabulating base modification calls for each read position with extract calls

The modkit extract calls command will generate a table of read-level base modification calls using the same thresholding algorithm employed by modkit pileup. The resultant table has, for each read, one row for each base modification call in that read. If a base is called as modified then call_code will be the code in the MM tag. If the base is called as canonical the call_code will be - (A, C, G, and T are reserved for “any modification”). The full schema of the table is below:

columnnamedescriptiontype
1read_idname of the readstr
2forward_read_position0-based position on the forward-oriented read sequenceint
3ref_positionaligned 0-based reference sequence position, -1 means unmappedint
4chromname of aligned contig, or ‘.’ if unmappedstr
5mod_strandstrand of the molecule the base modification is onstr
6ref_strandstrand of the reference the read is aligned to, or ‘.’ if unmappedstr
7ref_mod_strandstrand of the reference with the base modification, or ‘.’ if unmappedstr
8fw_soft_clipped_startnumber of bases soft clipped from the start of the forward-oriented readint
9fw_soft_clipped_endnumber of bases soft clipped from the end of the forward-oriented readint
10alignment_startleftmost (i.e. smallest) aligned reference positionint
11alignment_endrightmost (i.e. largest) aligned reference positionint
12read_lengthtotal length of the readint
13call_probprobability of the base modification call in the next columnint
14call_codebase modification call, - indicates a canonical callstr
15base_qualbasecall quality score (phred)int
16ref_kmerreference 5-mer sequence context (center base is aligned base), ‘.’ if unmappedstr
17query_kmerread 5-mer sequence context (center base is aligned base)str
18canonical_basecanonical base from the query sequence, from the MM tagstr
19modified_primary_baseprimary sequence base with the modificationstr
20failtrue if the base modification call fell below the pass thresholdstr
21inferredwhether the base modification call is implicit canonicalstr
22within_alignmentwhen alignment information is present, is this base aligned to the referencestr
23flagFLAG from alignment recordstr
24motifscomma-separated list of reference motifs matching at this position, only present when --motifs or --cpg is usedstr

Note on implicit base modification calls.

The . MM flag indicates that primary sequence bases without an associated base modification probability should be inferred to be canonical. By default, when this flag is encountered in a modBAM, modkit extract will output rows with the inferred column set to true and a mod_qual value of 0.0 for the base modifications called on that read. For example, if you have a A+a. MM tag, and there are A bases in the read for which there aren’t base modification calls (identifiable as non-0s in the MM tag) will be rows where the mod_code is a and the mod_qual is 0.0.

Note on non-primary alignments

If a valid MN tag is found, secondary and supplementary alignments can be output in the modkit extract tables above. See troubleshooting for details on how to get valid MN tags. To have non-primary alignments appear in the output, the --allow-non-primary flag must be passed. By default, the primary alignment will have all base modification information contained on the read, including soft-clipped and unaligned read positions. If the --mapped-only flag is used, soft clipped sections of the read will not be included. For secondary and supplementary alignments, soft-clipped positions are not repeated. See advanced usage for more details.

Tabulating per-read base modification content with extract read-stats.

Produce a table where modification counts are summarized on the read level. This table will have one record (line) per valid read and count the number of modified and unmodified bases for each base modification specified on the command line. To specify which modifications to use, pass --mod-codes {primary_base}:{mod_code} on the command line. Multiple {primary_base}:{mod_code} pairs can be passed, separated by spaces. For example --mod-codes C:m C:m will count the number of m (5mC) and h (5hmC) calls per-read. Modification codes that are encountered, but not specified on the command line will be added to the other_modified_{primary_base} count. For example, if you have direct RNA reads with m6A and Inosine calls you can use --mod-codes A:a and Modkit will count the m6A calls on each read and report it in the modified_a column. Inosine calls will be counted in other_modified_A.

Using a filter-threshold

By default extract read-stats will simply tabulate the calls for each base modification on each read. The “call” is defined as the state with the highest probability. Passing --filter-threshold to extract read-stats will filter calls based on the probability, similar to what other functions do (see filtering modified-base calls for details). A filter threshold for each primary sequence base can be used (e.g. --filter-threshold A:0.8 --filter-threshold C:0.9) which will use 0.8 and 0.9 for any base call on adenine or cytosine base, respectively. To use a filter threshold for a specific base modification (e.g. for 6mA calls or 5hmC calls only) pass --mod-thresholds {mod_code}:{threshold} (e.g. --mod-thresholds a:0.9 or --mod-thresholds h:0.8). When any threshold is used, extra columns are added to the output: fail_modified_{mod_code} and fail_unmodified_{primary_base}. These columns contain the counts of modified or unmodified calls that are below the pass threshold for that base/modification.

Schema of output for modkit extract read-stats without filtering:

columnnamedescriptiontype
1read_idname of the BAM recordstr
2chromname of the contig the read is mapped to or “unmapped” for unmapped readsstr
3aln_startleftmost aligned position of the read on the reference genome (-1 for unmapped reads)int
3+n_basesunmodified_{A|C |G |T}number of unmodified (canonical) calls for this primary sequebce baseint
3+n_bases+n_modsmodified_{modification_code}number of modified bases of this type on this readint
3+n_bases+n_mods+1+n_basesother_modified_{A|C|G|T}number of modified bases not specified on the command lineint
3+2*n_bases+n_modsread_lengthlength of SEQ in the BAM recordint

Example usages:

Extract base modification counts for 5mC, 5hmC, and 6mA on each read in a BAM

modkit extract read-stats <input.bam> <output.csv> --mod-codes C:m C:h A:a

To enable filtering, use --filter-threshold (and optionally --mod-thresholds)

modkit extract read-stats <input.bam> <output.csv> --mod-codes C:m C:h A:a --filter-threshold 0.7 --mod-thresholds a:0.9

Extract base modification counts for m6A, Inosine, and pseudouridine from a direct RNA modBAM:

modkit extract read-stats <input.bam> <output.csv> --mod-codes A:a A:17596 T:17802

Extract a table of base modification probabilities from an aligned and indexed BAM

modkit extract full <input.bam> <output.tsv> [--bgzf]

If the index input.bam.bai can be found, intervals along the aligned genome can be performed in parallel. The optional --bgzf flag will emit compressed output.

Extract a table from a region of a large modBAM

The below example will extract reads from only chr20, and include reference sequence context

modkit extract full <intput.bam> <output.tsv> --region chr20 --ref <ref.fasta>

Extract only sites aligned to a CG motif

modkit motif bed <reference.fasta> CG 0 > CG_motifs.bed
modkit extract full <in.bam> <out.tsv> --ref <ref.fasta> --include-bed CG_motifs.bed

Extract only sites that are at least 50 bases from the ends of the reads

modkit extract full <in.bam> <out.tsv> --edge-filter 50

Extract read-level base modification calls

modkit extract calls <input.bam> <calls.tsv>

Use --allow-non-primary to get secondary and supplementary mappings in the output.

modkit extract calls <input.bam> <output.tsv> --allow-non-primary

See the help string and/or advanced_usage for more details and performace considerations if you encounter issues with memory usage.