megalodon_extras calibrate command group contains commands to produce Megalodon modified base and sequence variant calibration files for basecalling models.
When a new basecalling model is trained a calibration file must be produced in order to obtain the most accurate aggregated modified base and sequence variant calls.
Without a calibration file the
--disable-variant-calibration flags may be set, but aggregated results will likely be much less accurate.
Calibration file estimation is broken down into two steps:
Ground truth statistic generation (
megalodon_extras calibrate generate_modified_base_statsand
megalodon_extras calibrate generate_variant_statscommands)
This step processes completed Megalodon runs to extract ground truth positive and negative statistics.
Calibration estimation (
megalodon_extras calibrate modified_basesand
megalodon_extras calibrate variantscommands)
This step estimates the emperical probability of a modified base or sequence variant given the ground truth statistics from the first step.
Note that the plots produced by the calibration procedure (with examples shown below) are stored in the GitHub repository for each released model (
megalodon_extras calibrate generate_modified_base_stats¶
Generate ground truth modified base statistics.
The ground truth modified base composition for a run can be specified in two ways:
Control Megalodon results
Using this option assumes that all modified base statistics in
--control-megalodon-results-dirrepresent canonical bases and all statistics in the main Megalodon results directory represent modified bases.
This respects the
--mod-motifoptions specified in the main Megalodon commands.
Ground truth reference locations
megalodon_extras modified_bases create_ground_truthcommand for help producing a ground truth CSV file.
megalodon_extras calibrate generate_mod_stats_from_msf¶
In some situations ground truth control samples or reference locations are not available for calibration.
generate_mod_stats_from_msf sub-command uses the mapped signal file (
msf) used for Taiyaki model training to produce Megalodon calibration statistics.
This command uses the ground truth sequence including modified base annotation in order to extract modified base scores as computed in Megalodon.
The extracted scores can be constricted to a fixed canonical sequence motif using the
--motif argument (providing the sequence motif and relative modified position; e.g.
--motif CG 0 for CpG methylation).
Note that the final set of Megalodon modified base statistics should contain enough data from both the modified and canonical set of sites.
megalodon_extras calibrate merge_modified_bases_stats below for merging sets of statistics.
megalodon_extras calibrate generate_variant_stats¶
Generate ground truth sequence variant statistics.
This method produces ground truth sequence variant statistics by proposing alternatives to a reference sequence. It is thus assumed that the mapping location for each read contains the correct reference sequence. It is advised to select a set of reads with high quality mappings to a high quality reference for the sample.
This command performs basecalling and read mapping as in the main Megalodon command.
Variants are then randomly proposed and scored for a random set of sites across each read.
“Correct” variants are not produced by default due to the computational overhead required to map full reads to the “incorrect” reference.
This functionality is provided on an experimental basis via the
--compute-false-reference-scores flag, but these scores are not currently accepted by the
megalodon_extras calibrate variants command.
megalodon_extras calibrate modified_bases¶
Estimate modified base calibration file.
Given a set of ground truth modified bases and raw Megalodon called statistics, compute empirical probabilities for a modified base.
The ground truth statistics are generated by the
megalodon_extras calibrate generate_modified_base_stats command, described above, and supplied via the
This command computes the empirical log-likelihood ratio over windows of observed modified base scores.
This process involves several steps to ensure certain characteristics of the generating distributions (e.g. monotonicity).
A separate calibration will be computed and stored in the output calibration file for each modified base found in the ground truth file.
These steps are visualized in the example plot below, which can be produced for any new calibration file by providing the
The top facet of this plot shows the distribution of theoretical modified base log-likelihood ratios produced by the basecalling model.
These distributions are smoothed such that they are monotonic from either extreme to the peak of the densities.
The middle facet shows the inferred empirical probability that a base is modified given the theoretical modified base score produced by the basecaller.
The final facet shows the same probabilities, but in log-likelihood space.
A constraint is enforced on this function such that the value is monotonically increasing (red - before monotonic constraint; yellow - after monotonic constraint).
The three vertical lines indicate common threshold values for modified base aggregation.
Note that the fraction of data ignored at each threshold level is annotated in the figure legend.
megalodon_extras calibrate merge_modified_bases¶
Merge modified base calibration files.
In some cases the ground truth source for one modified base my come from a different source than another modified base. In this case calibration files can be computed separately and combined with this command. If multiple calibration files contain calibration for the same modified base, the calibration from the file listed first will be stored.
megalodon_extras calibrate merge_modified_bases_stats¶
Merge modified base calibration statistics files.
In some cases the ground truth statistics may be extracted from several sources (unmodified and modified samples) and merged afterwards. This command enables this pipeline.
megalodon_extras calibrate variants¶
Estimate sequence variant calibration file.
Given a set of ground truth sequence variant statistics, via
--ground-truth-llrs argument, compute empirical probabilities of a sequence variant.
This command computes the empirical log-likelihood ratio over windows of observed sequence variant scores.
This process involves several steps to ensure certain characteristics of the generating distributions.
This procedure is largely the same as the modified base calibration step, but the variants are grouped into categories based on the type of ground truth sequence variant.
Note that the vertical bars are not present in these plots as sequence variant per-read statistics are combined in a probabilistic fashion and not based on a hard threshold.