Constructing bedMethyl tables.
A primary use of modkit is to create summary counts of modified and unmodified bases in an extended bedMethyl format.
bedMethyl files tabulate the counts of base modifications from every sequencing read over each aligned reference genomic position.
In order to create a bedMethyl table, your modBAM must be aligned to a reference genome or transcriptome.
Although not required, providing a reference and using the --modified-bases option will provide the clearest results with the best computational performance.
Only primary alignments are used in generating the table, it is recommended to mark duplicate alignments before running as multiple primary alignments can be double counted.
Recommended usage
Important
Changes for v0.6.0
For best performance use the --modified-bases option with the base modifications you intend to analyze.
For example:
modkit pileup \
path/to/reads.bam \
path/to/output.bed.gz \
--modified-bases 5mC 5hmC \
--reference path/to/reference.fasta \
--log path/to/log.txt \ # optional, recommended
--bgzf \ # optional
Tip
Note when using a transcriptome-aligned modBAM (for direct RNA), pass the
--preload-referencesflag to increase performance.
Passing --modified-bases not required, but directs Modkit to use optimized routines which will lead to better efficiency.
This option does require a FASTA reference, and will only emit bedmethyl records for base modifications to the primary sequence base in the reference.
For example, a command with the option --modified-bases 5mC 5hmC 6mA will not have 6mA records on genomic Cytosine locations where a read has a C>A mismatch nor 5mC/5hmC records at genomic Adenine locations.
For more details on how this option changes results from previous versions see migrating to v0.6.0.
A subset of the base modifications present in the modBAM can specified
For example, passing the option --modified-bases 5mC when the modBAM contains 5mC and 5hmC calls will only produce 5mC records.
Note
If the reference FASTA does not have an index at
path/to/reference.fasta.faione will be created.
Syntax of --modified-bases
You may pass the “long name” such as “5mC” or a primary base and the “short name” separated by a :.
For example, --modified-bases 5mC 5hmC 6mA and --modified-bases C:m C:h A:a are equivalent.
ChEBI codes can also be used.
For example, a to make a pileup for a modBAM with direct RNA reads you may use: --modified-bases A:17596 A:69426 A:a C:m C:19228 G:19229.
Tip
If you don’t know which modified bases are present in your modBAM run:
modkit modbam check-tags $bam --head 100.
Running without a reference
A reference and --modified-bases is not required, for example:
modkit pileup path/to/reads.bam output/path/pileup.bed --log-filepath pileup.log
This command will produce a bedMethyl record for every position in the reference for which there is at least one base modification call (either modified or unmodified).
This invocation may be slower than using --modified-bases with a reference.
In both cases, a single file (described below) with base count summaries will be created.
The program performs best-practices filtering and manipulation of the raw data stored in the input file. For further details see filtering modified-base calls.
Common options that change how base modifications are tabulated
Narrowing output to CpG dinucleotides
For user convenience, the counting process can be modulated using several additional transforms and filters.
The most basic of these is to report only counts from reference CpG dinucleotides, this behavior is enabled by passing the --cpg flag.
Any motif can be used see below.
modkit pileup path/to/reads.bam output/path/pileup.bed \
--cpg \
--modified-bases 5mC 5hmC \
--ref path/to/reference.fasta
Specifying --combine-mods and/or --combine-strands to simplify output
With palindromic motifs, base modification counts can be summed across strands with --combine-strands:
modkit pileup \
path/to/reads.bam \
output/path/pileup.bed \
--modified-bases 5mC 5hmC \
--cpg \
--combine-strands \
--ref path/to/reference.fasta \
Note that the strand field of the output will be marked as ‘.’ indicating that the strand information has been lost.
Finally, --combine-mods can be used to produce a bedMethyl with binary “modified/not-modified” counts and percentages:
modkit pileup \
path/to/reads.bam \
output/path/pileup.bed \
--modified-bases 5mC 5hmC \
--combine-mods \
--ref path/to/reference.fasta \
[--cpg] \ # optional
[--combine-strands] \ #optional
As can be seen in the above example, these flags can be combined to generate the following output types (flags in [braces] indicate that it is optional):
- Nothing: All modifications at each, stranded, position.
--cpg: Narrow output to only CpGs.--combine-mods [--cpg]: sum counts of all modification types together.--combine-strands --cpg [--combine-mods]: Sum (+)-strand and (-)-strand counts onto the (+)-strand position.
To restrict output to only certain CpGs, pass the --include-bed option with the CpGs or regions to be used, see this page for more details.
modkit pileup path/to/reads.bam output/path/pileup.bed \
--modified-bases 5mC 5hmC \
--cpg \
--ref path/to/reference.fasta \
--include-bed path/to/my_cpgs.bed
Narrowing output to specific motifs
By default, modkit will output a BED row for all genomic positions where there is at least one base modification for the reference base in the input modBAM.
We define a motif as a short DNA sequence potentially containing degenerate codes.
To ease downstream analysis, the --motif <Motif> <offset, 0-based> option can be used to pre-filter and annotate the bedMethyl rows.
The --cpg flag is a alias for --motif CG 0 where the sequence motif is CG and the offset is 0, meaning pileup base modification counts for the first C in the motif on the top strand the second C (complement to G) on the bottom strand.
Another example may be --motif GATC 1, signaling to pileup counts for the A in the second position on the top strand and the A in the third position on the bottom strand.
When multiple motifs are specified the name column (column 4), will indicate which motif the counts are tabulated for.
For example, if --motif CGCG 2 --motif CG 0 are passed you may see lines such as:
oligo_741_adapters 39 40 m,CG,0 4 - 39 40 255,0,0 4 100.00 4 0 0 0 0 0 0
oligo_741_adapters 39 40 m,CGCG,2 4 - 39 40 255,0,0 4 100.00 4 0 0 0 0 0 0
The --combine-strands flag can be combined with --motif however all motifs must be reverse-complement palindromic (CG is a palindrome but CHH is not).
Only one motif at a time is supported with --combine-strands is used (see limitations for details).
Partitioning reads based on phasing information with --phased
If you have a modBAM with phased reads containing a HP tag. These can be partitioned into separate bedMethyl files on output by passing the --phased flag.
modkit pileup path/to/reads.bam output/directory/ \
--cpg --modified-bases 5mC 5hmC --ref <reference.fasta> --phased
The output will be 3 files: hp1.bedmethyl, hp2.bedmethyl, and combined.bedmethyl.
hp1.bedmethyl and hp2.bedmethyl contain counts for records with HP=1 and HP=2 tags, respectively. combined.bedmethyl contains counts for all modBAM records.
For more information on the individual options see the Advanced Usage help document.
Description of bedMethyl output.
Below is a description of the bedMethyl columns generated by modkit pileup. A brief description of the
bedMethyl specification can be found on Encode.
Definitions:
- Nmod - Number of calls passing filters that were classified as a residue with a specified base modification.
- Ncanonical - Number of calls passing filters were classified as the canonical base rather than modified. The
exact base must be inferred by the modification code. For example, if the modification code is
m(5mC) then the canonical base is cytosine. If the modification code isa, the canonical base is adenine. - Nother mod - Number of calls passing filters that were classified as modified, but where the modification is different from the listed base (and the corresponding canonical base is equal). For example, for a given cytosine there may be 3 reads with
hcalls, 1 with a canonical call, and 2 withmcalls. In the bedMethyl row forhNother_mod would be 2. In themrow Nother_mod would be 3. - Nvalid_cov - the valid coverage. Nvalid_cov = Nmod + Nother_mod + Ncanonical, also used as the
scorein the bedMethyl - Ndiff - Number of reads with a base other than the canonical base for this modification. For example, in a row
for
hthe canonical base is cytosine, if there are 2 reads with C->A substitutions, Ndiff will be 2. - Ndelete - Number of reads with a deletion at this reference position
- Nfail - Number of calls where the probability of the call was below the threshold. The threshold can be set on the command line or computed from the data (usually failing the lowest 10th percentile of calls).
- Nnocall - Number of reads aligned to this reference position, with the correct canonical base, but without a base modification call. This can happen, for example, if the model requires a CpG dinucleotide and the read has a CG->CH substitution such that no modification call was produced by the basecaller.
bedMethyl column descriptions.
| column | name | description | type |
|---|---|---|---|
| 1 | chrom | name of reference sequence from BAM header | str |
| 2 | start position | 0-based start position | int |
| 3 | end position | 0-based exclusive end position | int |
| 4 | modified base code and motif | single letter code for modified base and motif when more than one motif is used | str |
| 5 | score | equal to Nvalid_cov | int |
| 6 | strand | ‘+’ for positive strand ‘-’ for negative strand, ‘.’ when strands are combined | str |
| 7 | start position | included for compatibility | int |
| 8 | end position | included for compatibility | int |
| 9 | color | included for compatibility, always 255,0,0 | str |
| 10 | Nvalid_cov | see definitions above. | int |
| 11 | percent modified | (Nmod / Nvalid_cov) * 100 | float |
| 12 | Nmod | see definitions above | int |
| 13 | Ncanonical | see definitions above | int |
| 14 | Nother_mod | see definitions above | int |
| 15 | Ndelete | see definitions above | int |
| 16 | Nfail | see definitions above | int |
| 17 | Ndiff | see definitions above | int |
| 18 | Nnocall | see definitions above | int |
Performance considerations
Important
Changes for v0.6.0
As of version 0.6.0 the efficiency of pileup has been greatly improved.
As a result, fewer threads are often required to get high throughput.
For example, increasing threads beyond 8 a 40X human genome BAM will often not yield much benefit.