Medaka consensus training pipeline

It is possible to train and evaluate medaka consensus models starting from folders of .fast5 or .fasta/q files in a single command.

Input data specification

Read .fast5 files should be placed under top-level folders. Multiple top-level folders can be used, perhaps corresponding to multiple runs.

Within the DATA section of a katuali configuration file, these top-level folders should be listed along with details of reference sequence files and sequences. MEDAKA_TRAIN_REGIONS and MEDAKA_EVAL_REGIONS define genomic regions for training and evaluation.

In the example below data from the first two top-level directories (MinIonRun1 and MinIonRun2) will be used for training using ecoli and yeast reference sequences. Evaluation of the trained models will be performed using the third and fourth top-level directories using the ecoli, yeast, and na12878_chr21 sequences.

DATA:
    'MinIonRun1':
        'REFERENCE': '/path/to/references.fasta'
        'MEDAKA_TRAIN_REGIONS': ['ecoli', 'yeast']
        'MEDAKA_EVAL_REGIONS': []
    'MinIonRun2':
        'REFERENCE': '/path/to/references.fasta'
        'MEDAKA_TRAIN_REGIONS': ['ecoli', 'yeast']
        'MEDAKA_EVAL_REGIONS': []
    'GridIonRun1':
        'REFERENCE': '/path/to/references.fasta'
        'MEDAKA_TRAIN_REGIONS': []
        'MEDAKA_EVAL_REGIONS': ['ecoli', 'yeast', 'na12878_chr21']
    'GridIonRun2':
        'REFERENCE': '/path/to/references.fasta'
        'MEDAKA_TRAIN_REGIONS': []
        'MEDAKA_EVAL_REGIONS': ['ecoli', 'yeast', 'na12878_chr21']

Coverage depths specification

Read depths at which to create assemblies for training are specified by the DEPTH key of the katuali configuration. This list should span the range of depths at which the model is to be used.

DEPTHS:
    [25, 50, 75, 100, 125, 150, 175, 200]

For some datasets it my not be possible to create assemblies for all reference sequences at all depths. To avoid katuali exiting early when such trivial failures occur the --keep-going option can be used. This allows tasks to continue unaffected by the failure of unrelated tasks.

Creating training features

To create training data (“features”) for medaka, katuali must:

  • basecall data from all top-level directories (if .fast5 s are provided),

  • align all basecalls to the specified reference sequences,

  • create subsampled sets of basecalls over the desired regions and depths,

  • form draft assemblies from these read sets, and finally

  • create medaka training features data and labels.

There is a single medaka target to perform the above tasks:

katuali all_medaka_feat

Katuali uses the Snakemake --keep-going flag instructs to continue processing tasks when unrelated tasks fail.

Having run the all_medaka_feat target, two files will be produced for every valid combination of dataset (top-level folder), coverage depth, and reference sequence. For example the files:

4bf50792/guppy/align/senterica1/25X_prop/canu_gsz_4.8m/racon/medaka_train/medaka_train.hdf
4bf50792/guppy/align/senterica1/25X_prop/canu_gsz_4.8m/racon/medaka_train/medaka_train_rc.hdf

will be produced for a top-level folder named 4bf5079, a reference sequence senterica1 at coverage of 25-fold.

Training models

When the production of all the training data is complete, training can be commenced by running:

katuali all_medaka_train --keep-going

This step requires the the use of GPUs.

Note

Note that to tensorflow-gpu must be installed in your medaka environment for medaka training.

Coping with missing feature files

If input datasets have insufficient coverage-depth for some of the training regions, some training feature files will not be produced. In this case the config flag USE_ONLY_EXISTING_MEDAKA_FEAT can be set to true to allow katuali to train using only those features which exist already.

USE_ONLY_EXISTING_MEDAKA_FEAT: true

Note

Note that you need to first attempt to create all features with the medaka_train_feat rule with USE_ONLY_EXISTING_MEDAKA_FEAT set to false, and then run all_medaka_train with the flag set to true.

Refer to comments in the katuali configuration file for further details.