Find highly modified motif sequences
The modkit find-motifs
command will attempt to summarize short genome sequences (motifs) that are more found to be highly modified (i.e. enriched for methylation).
The input to this command is a bedMethyl generated by modkit pileup
and the reference sequence used.
For example, to run the command with default settings (recommended):
bedmethyl=/path/to/pileup.bed
ref=/path/to/reference.fasta
modkit motif search \
-i ${bedmethyl} \
-r ${ref} \
-o ./motifs.tsv \
--threads 32 \
--log ./modkit_find_motifs_log.txt
Specifying an output with -o
will generate a machine-readable tab-separated-values file, a human-readable version of the table will always be logged to the terminal and the logfile.
Output format
All output tables are output in two formats, machine-readable and human-readable. The human-readable tables are always output to the log and terminal, the machine-readable tables are output to files specified on the command line.
Machine-readable table
column | name | description | type |
---|---|---|---|
1 | mod_code | code specifying the modification found in the motif | str |
2 | motif | sequence of identified motif using IUPAC codes | str |
3 | offset | 0-based offset into the motif sequence of the modified base | int |
4 | frac_mod | fraction of time this sequence is found in the high modified set col-5 / (col-5 + col-6) | float |
5 | high_count | number of occurances of this sequence in the high-modified set | int |
6 | low_count | number of occurances of this sequence in the low-modified set | int |
7 | mid_count | number of occurances of this sequence in the mid-modified set | int |
Human-readable table
column | name | description | type |
---|---|---|---|
1 | motif | human-readable representation of the motif sequence with the modification code in brackets | str |
2 | frac_mod | fraction of time this sequence is found in the high modified set col-3 / (col-3 + col-4) | float |
3 | high_count | number of occurances of this sequence in the high-modified set | int |
4 | low_count | number of occurances of this sequence in the low-modified set | int |
5 | mid_count | number of occurances of this sequence in the mid-modified set | int |
Specifying known motifs
Multiple motif sequences suspected to be present can be specified with the --known-motif
option.
A machine-readable table of the motif sequences that are not found during the search can be specified with the --known-motifs-table
option.
Using this option will add two columns to the above tables:
name | description | type |
---|---|---|
status | equal, Subset, Superset, or Disjoint describes the relationship of the discovered motif to the known motif | str |
closest_known_motif | of all motifs specified with --known-motif the one that is most similar to the discovered motif | str |
If any of the known motifs are not found during the search process an additional table is also emitted in machine- and human-readable versions.
Machine-readable table
column | name | description | type |
---|---|---|---|
1 | mod_code | code specifying the modification found in the motif | str |
2 | motif | sequence of identified motif using IUPAC codes | str |
3 | offset | 0-based offset into the motif sequence of the modified base | int |
4 | frac_mod | fraction of time this sequence is found in the high modified set col-5 / (col-5 + col-6) | float |
5 | high_count | number of occurances of this sequence in the high-modified set | int |
6 | low_count | number of occurances of this sequence in the low-modified set | int |
7 | mid_count | number of occurances of this sequence in the mid-modified set | int |
8 | status | equal, Subset, Superset, or Disjoint describes the relationship of the known motif to the closest discovered motif | str |
9 | closest_found_motif | which of the discovered motifs is most simuilar to the known motif | str |
Human-readable table
column | name | description | type |
---|---|---|---|
1 | motif | human-readable representation of the motif sequence with the modification code in brackets | str |
2 | frac_mod | fraction of time this sequence is found in the high modified set col-3 / (col-3 + col-4) | float |
3 | high_count | number of occurances of this sequence in the high-modified set | int |
4 | low_count | number of occurances of this sequence in the low-modified set | int |
5 | mid_count | number of occurances of this sequence in the mid-modified set | int |
6 | status | equal, Subset, Superset, or Disjoint describes the relationship of the known motif to the closest discovered motif | str |
7 | closest_found_motif | which of the discovered motifs is most simuilar to the known motif | str |
Simple description of the search algorithm
The first step in find-motifs
is to categorize each genomic position in the pileup into one of three groups based on the fraction modified
column in the bedMethyl:
- Low-modified
- Mid-modified
- High-modified
The threshold values for each group can be set on the command line (--high-thresh
and --low-thresh
).
For example, consider a high threshold of 0.6 and a low threshold of 0.2 the following 3 bedMethyl records would be put into the high-, low-, and mid-groups, respectively:
contig1 6 7 a 27 - 6 7 255,0,0 27 96.30 26 1 0 0 3 0 0
contig1 8 9 a 24 - 8 9 255,0,0 24 4.17 1 23 0 0 5 1 0
contig1 218 219 a 21 + 218 219 255,0,0 21 28.57 6 15 0 2 3 0 0
The sequence around each modified position is then collected from the reference FASTA file.
The length of the sequence can be set with the --context-size
option, accepting two values: the number of bases upstream and the number of bases downstream of the modified location.
For example --context-size 12 12
will collect 12 bases upstream and 12 bases downstream of the modified base for a maximum motif length of 25 base pairs.
The algorithm then iteratively expands, contracts, and merges sequences while the following criteria are met:
- The number of occurrences in the high-modified set is greater than
min_sites
(set by--min-sites
). - The fraction \( \frac{\textit{H}}{\textit{H} + \textit{L}} \), is greater than
frac_mod
(set with--min-frac-mod
), where \( \textit{H} \) and \( \textit{L} \) is the number of total sequence contexts in the high-modified and low-modified set, respectively. - The log-odds of the context being in the high-modified category is greater than
min_log_odds
(set with--min-log-odds
).
Once a motif sequence cannot be changed (made more general or more restrictive) without violating one of these criteria, the motif sequence is considered complete. As the algorithm continues, context sequences that match discovered sequences are removed from consideration.
A secondary search step is also performed by starting with every k-mer (where k is less than the total sequence length, 3 by default, set with --exhaustive-seed-len
) at every motif position.
The log-odds threshold for this search is usually higher and set with --exhaustive-seed-min-log-odds
.
Decreasing this value can drastically increase computational time.
Tuning parameters and --skip-search
The default parameters have been picked to be sufficiently sensitive, however if you decide to adjust the parameters in general increasing sensitivity will increase compute time.
- Increasing the
--min-frac-mod
will stop search earlier which will decrease compute time. - Decreasing
--min-sites
has the largest effect and can especially cause the secondary search to crawl more sequences. Decreasing--min-sites
along with--skip-search
may be a useful technique to find very rare sequence motifs. - Increasing
--exhaustive-seed-min-log-odds
can drastically decrease compute time (sometimes while maintaining sensitivity).
Also consider the additional steps in performance considerations.