Find highly modified motif sequences

The modkit find-motifs command will attempt to summarize short genome sequences (motifs) that are more found to be highly modified (i.e. enriched for methylation). The input to this command is a bedMethyl generated by modkit pileup and the reference sequence used. For example, to run the command with default settings (recommended):

bedmethyl=/path/to/pileup.bed
ref=/path/to/reference.fasta

modkit motif search \
  -i ${bedmethyl} \
  -r ${ref} \
  -o ./motifs.tsv \
  --threads 32 \
  --log ./modkit_find_motifs_log.txt 

Specifying an output with -o will generate a machine-readable tab-separated-values file, a human-readable version of the table will always be logged to the terminal and the logfile.

Output format

All output tables are output in two formats, machine-readable and human-readable. The human-readable tables are always output to the log and terminal, the machine-readable tables are output to files specified on the command line.

Machine-readable table

columnnamedescriptiontype
1mod_codecode specifying the modification found in the motifstr
2motifsequence of identified motif using IUPAC codesstr
3offset0-based offset into the motif sequence of the modified baseint
4frac_modfraction of time this sequence is found in the high modified set col-5 / (col-5 + col-6)float
5high_countnumber of occurances of this sequence in the high-modified setint
6low_countnumber of occurances of this sequence in the low-modified setint
7mid_countnumber of occurances of this sequence in the mid-modified setint

Human-readable table

columnnamedescriptiontype
1motifhuman-readable representation of the motif sequence with the modification code in bracketsstr
2frac_modfraction of time this sequence is found in the high modified set col-3 / (col-3 + col-4)float
3high_countnumber of occurances of this sequence in the high-modified setint
4low_countnumber of occurances of this sequence in the low-modified setint
5mid_countnumber of occurances of this sequence in the mid-modified setint

Specifying known motifs

Multiple motif sequences suspected to be present can be specified with the --known-motif option. A machine-readable table of the motif sequences that are not found during the search can be specified with the --known-motifs-table option. Using this option will add two columns to the above tables:

namedescriptiontype
statusequal, Subset, Superset, or Disjoint describes the relationship of the discovered motif to the known motifstr
closest_known_motifof all motifs specified with --known-motif the one that is most similar to the discovered motifstr

If any of the known motifs are not found during the search process an additional table is also emitted in machine- and human-readable versions.

Machine-readable table

columnnamedescriptiontype
1mod_codecode specifying the modification found in the motifstr
2motifsequence of identified motif using IUPAC codesstr
3offset0-based offset into the motif sequence of the modified baseint
4frac_modfraction of time this sequence is found in the high modified set col-5 / (col-5 + col-6)float
5high_countnumber of occurances of this sequence in the high-modified setint
6low_countnumber of occurances of this sequence in the low-modified setint
7mid_countnumber of occurances of this sequence in the mid-modified setint
8statusequal, Subset, Superset, or Disjoint describes the relationship of the known motif to the closest discovered motifstr
9closest_found_motifwhich of the discovered motifs is most simuilar to the known motifstr

Human-readable table

columnnamedescriptiontype
1motifhuman-readable representation of the motif sequence with the modification code in bracketsstr
2frac_modfraction of time this sequence is found in the high modified set col-3 / (col-3 + col-4)float
3high_countnumber of occurances of this sequence in the high-modified setint
4low_countnumber of occurances of this sequence in the low-modified setint
5mid_countnumber of occurances of this sequence in the mid-modified setint
6statusequal, Subset, Superset, or Disjoint describes the relationship of the known motif to the closest discovered motifstr
7closest_found_motifwhich of the discovered motifs is most simuilar to the known motifstr

Simple description of the search algorithm

The first step in find-motifs is to categorize each genomic position in the pileup into one of three groups based on the fraction modified column in the bedMethyl:

  • Low-modified
  • Mid-modified
  • High-modified

The threshold values for each group can be set on the command line (--high-thresh and --low-thresh). For example, consider a high threshold of 0.6 and a low threshold of 0.2 the following 3 bedMethyl records would be put into the high-, low-, and mid-groups, respectively:

contig1      6       7       a       27      -       6       7       255,0,0 27      96.30   26      1       0       0       3       0       0
contig1      8       9       a       24      -       8       9       255,0,0 24      4.17    1       23      0       0       5       1       0
contig1      218     219     a       21      +       218     219     255,0,0 21      28.57   6       15      0       2       3       0       0

The sequence around each modified position is then collected from the reference FASTA file. The length of the sequence can be set with the --context-size option, accepting two values: the number of bases upstream and the number of bases downstream of the modified location. For example --context-size 12 12 will collect 12 bases upstream and 12 bases downstream of the modified base for a maximum motif length of 25 base pairs. The algorithm then iteratively expands, contracts, and merges sequences while the following criteria are met:

  • The number of occurrences in the high-modified set is greater than min_sites (set by --min-sites).
  • The fraction \( \frac{\textit{H}}{\textit{H} + \textit{L}} \), is greater than frac_mod (set with --min-frac-mod), where \( \textit{H} \) and \( \textit{L} \) is the number of total sequence contexts in the high-modified and low-modified set, respectively.
  • The log-odds of the context being in the high-modified category is greater than min_log_odds (set with --min-log-odds).

Once a motif sequence cannot be changed (made more general or more restrictive) without violating one of these criteria, the motif sequence is considered complete. As the algorithm continues, context sequences that match discovered sequences are removed from consideration.

A secondary search step is also performed by starting with every k-mer (where k is less than the total sequence length, 3 by default, set with --exhaustive-seed-len) at every motif position. The log-odds threshold for this search is usually higher and set with --exhaustive-seed-min-log-odds. Decreasing this value can drastically increase computational time.

The default parameters have been picked to be sufficiently sensitive, however if you decide to adjust the parameters in general increasing sensitivity will increase compute time.

  1. Increasing the --min-frac-mod will stop search earlier which will decrease compute time.
  2. Decreasing --min-sites has the largest effect and can especially cause the secondary search to crawl more sequences. Decreasing --min-sites along with --skip-search may be a useful technique to find very rare sequence motifs.
  3. Increasing --exhaustive-seed-min-log-odds can drastically decrease compute time (sometimes while maintaining sensitivity).

Also consider the additional steps in performance considerations.