Megalodon Model Training

This page describes how to use Megalodon to prepare training data and train a new basecalling model using Taiyaki. For modified base data preparation and model training documentation see the modified base training documentation page.

Note

Preparation of training data via Megalodon requires a basecalling model that can produce valid reference mappings. If valid reference mappings using minimap2 cannot be produced for a set of reads, model training will not proceed successfully.

Data Preparation

To produce a training data (“mapped signal”) file the --outputs signal_mappings argument should be added to a Megalodon call. This will produce a signal_mappings.hdf5 file in the specified Megalodon output directory. For each read producing a valid reference mapping, this file contains a mapping between the raw signal and the mapped reference bases. This file can then be directly passed to the Taiyaki train_flipflop.py command for model training.

# run megalodon; output signal mappings
megalodon raw_fast5s/ \
    --outputs signal_mappings \
    --reference reference.fa \
    --devices 0 --processes 40

# run taiyaki training
train_flipflop.py ./taiyaki/models/mLstm_flipflop.py \
    megalodon_results/signal_mappings.hdf5 --device 0

Once training completes, the training/model_final.checkpoint contains the model. This can be converted to a guppy compatible model with the taiyaki/bin/dump_json.py script. A guppy config with appropriate settings should also be produced for new models.

Note

For optimal performance, it is recommended that the OMP_NUM_THREADS unix environment variable be set to 1 for the above Megalodon command and a larger value for the Taiyaki training command.

Signal Mapping Options

Several options are available to control the behavior of the signal_mappings output.

  • --ref-length-range

    • Only allow reads with a reference mapping length within this range into the output.

  • --ref-percent-identity-threshold

    • Only include reads with higher mapping percent identity in signal_mappings output.

  • --ref-percent-coverage-threshold

    • Only include reads with higher read alignment coverage in signal_mappings output.

  • --ref-include-variants

    • This option replaces the reference sequence with more likely proposed alternative sequences as called in the per_read_variants output.

    • Cannot specify both this option and --ref-include-mods.

Megalodon Calibration

When a new model is trained, the produced scores must be calibrated to achieve optimal aggregated results (over reads). Once produced, calibration files can be passed to Megalodon via the --variant-calibration-filename and --mod-calibration-filename arguments.

Sequence variant calibration requires a ground truth against which to compute scores. For sequence variants, a high quality reference for a set of reads will suffice for this requirement. Random sequence variants are proposed and scored in order to create distributions over which to calibrate the produced scores. In order to create a sequence variant calibration file, run megalodon/scripts/generate_ground_truth_variant_llr_scores.py followed by megalodon/scripts/calibrate_variant_llr_scores.py. The optional --out-pdf provides visualization of the likelihood ratio score correction.