.. _medaka_train: Medaka consensus training pipeline ================================== It is possible to train and evaluate medaka consensus models starting from folders of ``.fast5`` or ``.fasta/q`` files in a single command. Input data specification ------------------------ Read ``.fast5`` files should be placed under top-level folders. Multiple top-level folders can be used, perhaps corresponding to multiple runs. Within the ``DATA`` section of a katuali configuration file, these top-level folders should be listed along with details of reference sequence files and sequences. ``MEDAKA_TRAIN_REGIONS`` and ``MEDAKA_EVAL_REGIONS`` define genomic regions for training and evaluation. In the example below data from the first two top-level directories (``MinIonRun1`` and ``MinIonRun2``) will be used for training using ``ecoli`` and ``yeast`` reference sequences. Evaluation of the trained models will be performed using the third and fourth top-level directories using the ``ecoli``, ``yeast``, and ``na12878_chr21`` sequences. .. code-block:: yaml DATA: 'MinIonRun1': 'REFERENCE': '/path/to/references.fasta' 'MEDAKA_TRAIN_REGIONS': ['ecoli', 'yeast'] 'MEDAKA_EVAL_REGIONS': [] 'MinIonRun2': 'REFERENCE': '/path/to/references.fasta' 'MEDAKA_TRAIN_REGIONS': ['ecoli', 'yeast'] 'MEDAKA_EVAL_REGIONS': [] 'GridIonRun1': 'REFERENCE': '/path/to/references.fasta' 'MEDAKA_TRAIN_REGIONS': [] 'MEDAKA_EVAL_REGIONS': ['ecoli', 'yeast', 'na12878_chr21'] 'GridIonRun2': 'REFERENCE': '/path/to/references.fasta' 'MEDAKA_TRAIN_REGIONS': [] 'MEDAKA_EVAL_REGIONS': ['ecoli', 'yeast', 'na12878_chr21'] Coverage depths specification ----------------------------- Read depths at which to create assemblies for training are specified by the ``DEPTH`` key of the katuali configuration. This list should span the range of depths at which the model is to be used. .. code-block:: yaml DEPTHS: [25, 50, 75, 100, 125, 150, 175, 200] For some datasets it my not be possible to create assemblies for all reference sequences at all depths. To avoid katuali exiting early when such trivial failures occur the ``--keep-going`` option can be used. This allows tasks to continue unaffected by the failure of unrelated tasks. Creating training features -------------------------- To create training data ("features") for medaka, ``katuali`` must: * basecall data from all top-level directories (if ``.fast5`` s are provided), * align all basecalls to the specified reference sequences, * create subsampled sets of basecalls over the desired regions and depths, * form draft assemblies from these read sets, and finally * create medaka training features data and labels. There is a single medaka target to perform the above tasks: .. code-block:: bash katuali all_medaka_feat ``Katuali`` uses the ``Snakemake`` ``--keep-going`` flag instructs to continue processing tasks when unrelated tasks fail. Having run the ``all_medaka_feat`` target, two files will be produced for every valid combination of dataset (top-level folder), coverage depth, and reference sequence. For example the files: .. code-block:: bash 4bf50792/guppy/align/senterica1/25X_prop/canu_gsz_4.8m/racon/medaka_train/medaka_train.hdf 4bf50792/guppy/align/senterica1/25X_prop/canu_gsz_4.8m/racon/medaka_train/medaka_train_rc.hdf will be produced for a top-level folder named ``4bf5079``, a reference sequence ``senterica1`` at coverage of ``25``-fold. .. _training_models: Training models --------------- When the production of all the training data is complete, training can be commenced by running: .. code-block:: bash katuali all_medaka_train --keep-going This step requires the the use of GPUs. .. note:: Note that to ``tensorflow-gpu`` must be installed in your medaka environment for medaka training. .. _missing_feat: Coping with missing feature files --------------------------------- If input datasets have insufficient coverage-depth for some of the training regions, some training feature files will not be produced. In this case the config flag ``USE_ONLY_EXISTING_MEDAKA_FEAT`` can be set to ``true`` to allow katuali to train using only those features which exist already. .. code-block:: yaml USE_ONLY_EXISTING_MEDAKA_FEAT: true .. note:: Note that you need to first attempt to create all features with the ``medaka_train_feat`` rule with ``USE_ONLY_EXISTING_MEDAKA_FEAT`` set to false, and then run ``all_medaka_train`` with the flag set to true. Refer to comments in the katuali configuration file for further details.