Medaka consensus training pipeline¶
It is possible to train and evaluate medaka consensus models starting from folders of
.fast5
or .fasta/q
files in a single command.
Input data specification¶
Read .fast5
files should be placed under top-level folders. Multiple top-level
folders can be used, perhaps corresponding to multiple runs.
Within the DATA
section of a katuali configuration file, these top-level
folders should be listed along with details of reference sequence files and sequences.
MEDAKA_TRAIN_REGIONS
and MEDAKA_EVAL_REGIONS
define genomic regions for training
and evaluation.
In the example below data from the first two top-level directories
(MinIonRun1
and MinIonRun2
) will be used for training using ecoli
and yeast
reference sequences. Evaluation of the trained models will be
performed using the third and fourth top-level directories using the ecoli
,
yeast
, and na12878_chr21
sequences.
DATA:
'MinIonRun1':
'REFERENCE': '/path/to/references.fasta'
'MEDAKA_TRAIN_REGIONS': ['ecoli', 'yeast']
'MEDAKA_EVAL_REGIONS': []
'MinIonRun2':
'REFERENCE': '/path/to/references.fasta'
'MEDAKA_TRAIN_REGIONS': ['ecoli', 'yeast']
'MEDAKA_EVAL_REGIONS': []
'GridIonRun1':
'REFERENCE': '/path/to/references.fasta'
'MEDAKA_TRAIN_REGIONS': []
'MEDAKA_EVAL_REGIONS': ['ecoli', 'yeast', 'na12878_chr21']
'GridIonRun2':
'REFERENCE': '/path/to/references.fasta'
'MEDAKA_TRAIN_REGIONS': []
'MEDAKA_EVAL_REGIONS': ['ecoli', 'yeast', 'na12878_chr21']
Coverage depths specification¶
Read depths at which to create assemblies for training are specified by the
DEPTH
key of the katuali configuration. This list should span the range of
depths at which the model is to be used.
DEPTHS:
[25, 50, 75, 100, 125, 150, 175, 200]
For some datasets it my not be possible to create assemblies for all reference
sequences at all depths. To avoid katuali exiting early when such trivial failures
occur the --keep-going
option can be used. This allows tasks to continue
unaffected by the failure of unrelated tasks.
Creating training features¶
To create training data (“features”) for medaka, katuali
must:
basecall data from all top-level directories (if
.fast5
s are provided),align all basecalls to the specified reference sequences,
create subsampled sets of basecalls over the desired regions and depths,
form draft assemblies from these read sets, and finally
create medaka training features data and labels.
There is a single medaka target to perform the above tasks:
katuali all_medaka_feat
Katuali
uses the Snakemake
--keep-going
flag instructs to continue processing tasks when
unrelated tasks fail.
Having run the all_medaka_feat
target, two files will be produced for
every valid combination of dataset (top-level folder), coverage depth, and reference
sequence. For example the files:
4bf50792/guppy/align/senterica1/25X_prop/canu_gsz_4.8m/racon/medaka_train/medaka_train.hdf
4bf50792/guppy/align/senterica1/25X_prop/canu_gsz_4.8m/racon/medaka_train/medaka_train_rc.hdf
will be produced for a top-level folder named 4bf5079
, a reference sequence
senterica1
at coverage of 25
-fold.
Training models¶
When the production of all the training data is complete, training can be commenced by running:
katuali all_medaka_train --keep-going
This step requires the the use of GPUs.
Note
Note that to tensorflow-gpu
must be installed in your medaka environment for medaka training.
Coping with missing feature files¶
If input datasets have insufficient coverage-depth for some of the
training regions, some training feature files will not be produced. In this
case the config flag USE_ONLY_EXISTING_MEDAKA_FEAT
can be set to true
to
allow katuali to train using only those features which exist already.
USE_ONLY_EXISTING_MEDAKA_FEAT: true
Note
Note that you need to first attempt to create all features with the
medaka_train_feat
rule with USE_ONLY_EXISTING_MEDAKA_FEAT
set to
false, and then run all_medaka_train
with the flag set to true.
Refer to comments in the katuali configuration file for further details.