Megalodon Model Training¶
This page describes how to use Megalodon to prepare training data and train a new basecalling model using Taiyaki. For modified base data preparation and model training documentation see the modified base training documentation page.
Preparation of training data via Megalodon requires a basecalling model that can produce valid reference mappings.
If valid reference mappings using
minimap2 cannot be produced for a set of reads, model training will not proceed successfully.
To produce a training data (“mapped signal”) file the
--outputs signal_mappings argument should be added to a Megalodon call.
This will produce a
signal_mappings.hdf5 file in the specified Megalodon output directory.
For each read producing a valid reference mapping, this file contains a mapping between the raw signal and the mapped reference bases.
This file can then be directly passed to the Taiyaki
train_flipflop.py command for model training.
# run megalodon; output signal mappings
megalodon raw_fast5s/ \
--outputs signal_mappings \
--reference reference.fa \
--devices 0 --processes 40
# run taiyaki training
train_flipflop.py ./taiyaki/models/mLstm_flipflop.py \
megalodon_results/signal_mappings.hdf5 --device 0
Once training completes, the
training/model_final.checkpoint contains the model.
This can be converted to a guppy compatible model with the
A guppy config with appropriate settings should also be produced for new models.
For optimal performance, it is recommended that the
OMP_NUM_THREADS unix environment variable be set to
1 for the above Megalodon command and a larger value for the Taiyaki training command.
Signal Mapping Options¶
Several options are available to control the behavior of the
Only allow reads with a reference mapping length within this range into the output.
Only include reads with higher mapping percent identity in signal_mappings output.
Only include reads with higher read alignment coverage in signal_mappings output.
This option replaces the reference sequence with more likely proposed alternative sequences as called in the
Cannot specify both this option and
When a new model is trained, the produced scores must be calibrated to achieve optimal aggregated results (over reads).
Once produced, calibration files can be passed to Megalodon via the
Sequence variant calibration requires a ground truth against which to compute scores.
For sequence variants, a high quality reference for a set of reads will suffice for this requirement.
Random sequence variants are proposed and scored in order to create distributions over which to calibrate the produced scores.
In order to create a sequence variant calibration file, run
megalodon/scripts/generate_ground_truth_variant_llr_scores.py followed by
--out-pdf provides visualization of the likelihood ratio score correction.