Find highly modified motif sequences

The modkit find-motifs command will attempt to summarize short genome sequences (motifs) that are more found to be highly modified (i.e. enriched for methylation). The input to this command is a bedMethyl generated by modkit pileup and the reference sequence used. For example, to run the command with default settings (recommended):

bedmethyl=/path/to/pileup.bed
ref=/path/to/reference.fasta

modkit motif search \
  -i ${bedmethyl} \
  -r ${ref} \
  -o ./motifs.tsv \
  --threads 32 \
  --log ./modkit_find_motifs_log.txt

Specifying an output with -o will generate a machine-readable tab-separated-values file, a human-readable version of the table will always be logged to the terminal and the logfile.

If you have compressed your bedMethyl table with bgzip you can use the compressed table as well. A compressed table with tabix-generated index is required to use the --contig option which will find motifs using only a single contig in the bedMethyl table. This may be useful for applications such as metagenomics.

Output format

All output tables are output in two formats, machine-readable and human-readable. The human-readable tables are always output to the log and terminal, the machine-readable tables are output to files specified on the command line.

Machine-readable table

column	name	description	type
1	mod_code	code specifying the modification found in the motif	str
2	motif	sequence of identified motif using IUPAC codes	str
3	offset	0-based offset into the motif sequence of the modified base	int
4	frac_mod	fraction of time this sequence is found in the high modified set col-5 / (col-5 + col-6)	float
5	high_count	number of occurances of this sequence in the high-modified set	int
6	low_count	number of occurances of this sequence in the low-modified set	int
7	mid_count	number of occurances of this sequence in the mid-modified set	int

Human-readable table

column	name	description	type
1	motif	human-readable representation of the motif sequence with the modification code in brackets	str
2	frac_mod	fraction of time this sequence is found in the high modified set col-3 / (col-3 + col-4)	float
3	high_count	number of occurances of this sequence in the high-modified set	int
4	low_count	number of occurances of this sequence in the low-modified set	int
5	mid_count	number of occurances of this sequence in the mid-modified set	int

Specifying known motifs

Multiple motif sequences suspected to be present can be specified with the --known-motif option. A machine-readable table of the motif sequences that are not found during the search can be specified with the --known-motifs-table option. Using this option will add two columns to the above tables:

name	description	type
status	equal, Subset, Superset, or Disjoint describes the relationship of the discovered motif to the known motif	str
closest_known_motif	of all motifs specified with `--known-motif` the one that is most similar to the discovered motif	str

If any of the known motifs are not found during the search process an additional table is also emitted in machine- and human-readable versions.

Machine-readable table

column	name	description	type
1	mod_code	code specifying the modification found in the motif	str
2	motif	sequence of identified motif using IUPAC codes	str
3	offset	0-based offset into the motif sequence of the modified base	int
4	frac_mod	fraction of time this sequence is found in the high modified set col-5 / (col-5 + col-6)	float
5	high_count	number of occurances of this sequence in the high-modified set	int
6	low_count	number of occurances of this sequence in the low-modified set	int
7	mid_count	number of occurances of this sequence in the mid-modified set	int
8	status	equal, Subset, Superset, or Disjoint describes the relationship of the known motif to the closest discovered motif	str
9	closest_found_motif	which of the discovered motifs is most simuilar to the known motif	str

Human-readable table

column	name	description	type
1	motif	human-readable representation of the motif sequence with the modification code in brackets	str
2	frac_mod	fraction of time this sequence is found in the high modified set col-3 / (col-3 + col-4)	float
3	high_count	number of occurances of this sequence in the high-modified set	int
4	low_count	number of occurances of this sequence in the low-modified set	int
5	mid_count	number of occurances of this sequence in the mid-modified set	int
6	status	equal, Subset, Superset, or Disjoint describes the relationship of the known motif to the closest discovered motif	str
7	closest_found_motif	which of the discovered motifs is most simuilar to the known motif	str

Simple description of the search algorithm

The first step in find-motifs is to categorize each genomic position in the pileup into one of three groups based on the fraction modified column in the bedMethyl:

Low-modified
Mid-modified
High-modified

The threshold values for each group can be set on the command line (--high-thresh and --low-thresh). For example, consider a high threshold of 0.6 and a low threshold of 0.2 the following 3 bedMethyl records would be put into the high-, low-, and mid-groups, respectively:

contig1      6       7       a       27      -       6       7       255,0,0 27      96.30   26      1       0       0       3       0       0
contig1      8       9       a       24      -       8       9       255,0,0 24      4.17    1       23      0       0       5       1       0
contig1      218     219     a       21      +       218     219     255,0,0 21      28.57   6       15      0       2       3       0       0

The sequence around each modified position is then collected from the reference FASTA file. The length of the sequence can be set with the --context-size option, accepting two values: the number of bases upstream and the number of bases downstream of the modified location. For example --context-size 12 12 will collect 12 bases upstream and 12 bases downstream of the modified base for a maximum motif length of 25 base pairs. The algorithm then iteratively expands, contracts, and merges sequences while the following criteria are met:

The number of occurrences in the high-modified set is greater than min_sites (set by --min-sites).
The fraction \( \frac{\textit{H}}{\textit{H} + \textit{L}} \), is greater than frac_mod (set with --min-frac-mod), where \( \textit{H} \) and \( \textit{L} \) is the number of total sequence contexts in the high-modified and low-modified set, respectively.
The log-odds of the context being in the high-modified category is greater than min_log_odds (set with --min-log-odds).

Once a motif sequence cannot be changed (made more general or more restrictive) without violating one of these criteria, the motif sequence is considered complete. As the algorithm continues, context sequences that match discovered sequences are removed from consideration.

A secondary search step, called the "exhaustive search", is also performed by starting with every k-mer (where k is less than the total sequence length, 3 by default, set with --exhaustive-seed-len) at every motif position. The log-odds threshold for this search is usually higher and set with --exhaustive-seed-min-log-odds. Decreasing this value can drastically increase computational time, see the next section for more details.

Options to decrease or limit search time.

The optional (but recommended) exhaustive search step checks many sparsely defined sequences, called seeds, for enrichment as defined by --exhaustive-seed-min-log-odds. The algorithm then refines the seeds above the threshold into candidate motifs using the method described previously. An example seed is GNNNNNNNNNNC[m]NNNNNNNNNNNA using the modification code for 5mC. The number of seeds is defined by the --context-size and --exhaustive-seed-length parameters. For example, if --context-size a b is passed the seeds will be length \( a + b + 1 \). The default value is a = 12, b = 12 so the sequence length is 25, but we only use 24 of those positions, call this \( n \). The default value for --exhaustive-seed-length is 3, call this \( k \). Remember that there are 4 primary sequence bases. So the number of seeds to search is \( k^4 * \binom{n}{k} \) or 129,536 seeds. Most of these seeds will fall below the log-odds required, and not follow-on to refinement. However, if a large number ( \( \gt 500 \) ) are above the log-odds threshold then this step can be very time consuming. There are a couple options that can decrease or limit this run time, whist usually affording identical or very good results.

A simple timeout. When the search starts, the top-N (as ranked by log-odds) seeds are taken. The search is run on this "batch", if the timeout has not expired the next batch of N is taken. If all of the seeds are not evaluated before running out of time, the results are returned but an ERROR message is displayed.

# specify that search should stop after 15 minutes
$ modkit motif search ... --search-timeout 15m

Top percent. Only search the top P-% of seeds (or a maximum number). So a command like this --search-top-pct 20 --max-exhaustive-seeds 200 would search the top 20% of seeds as ranked by their log-odds for being motifs or the top 200 seeds if 20% of all of the seeds passing --exhaustive-seed-min-log-odds is greater than 200. n.b. With default settings, in the most pathological settings 129,536 could be searched.

# only search the min(top 20% of seeds, 100 seeds)
$ modkit motif search ... --search-top-pct 20 --max-exhaustive-seeds 100

Batch and Narrow Optimization. This algorithm is a combination of (1) and (2). In this scenario, we take the top P-% of seeds (or the maximum allowed) as in (2). But instead of stopping as in (2) we continue (as in (1)) but we remove contexts from consideration based on the motifs we've found before the next round. At the end of the round, if zero motifs are found, we stop.

# same as (2), but continue to search after each batch
$ modkit motif search .. --search-top-pct 20 --max-exhaustive-seeds 100 --narrow-search

Batch and Narrow Optimization with Timeout. This scenario is the same as (3), but if we've run out of time at the end of a round - stop.

# same as (3), but continue to search after each batch, until we've spent 15 minutes searching.
$ modkit motif search .. --search-top-pct 20 --max-exhaustive-seeds 100 --narrow-search --search-timeout 15m

Tuning parameters and `--skip-search`

The default parameters have been picked to be sufficiently sensitive, however if you decide to adjust the parameters in general increasing sensitivity will increase compute time.

Increasing the --min-frac-mod will stop search earlier which will decrease compute time.
Decreasing --min-sites has the largest effect and can especially cause the secondary search to crawl more sequences. Decreasing --min-sites along with --skip-search may be a useful technique to find very rare sequence motifs.
Increasing --exhaustive-seed-min-log-odds can drastically decrease compute time (sometimes while maintaining sensitivity).

Also consider the additional steps in performance considerations.

Investigating how motifs are evaluated

The motif search function uses JSON-lines structured logging that can be searched and queried for specific events. The schema and some examples are described in another section.

Modkit