Walkthrough

The following shows how to train and use a simple form of consensus model from sequencing data obtained from R9.4 flowcells and the ligation sequencing kit.

Note

The below assumes a Linux environment, some changes may need to be made on macOS. Windows environments are not supported. Environment variables set in one code block are assumed to persist through to other code blocks.

The below serves to demonstrate the process at a simple level. It does not represent a best-practices or state-of-the-art workflow. To train models which generalise well to other datasets more careful preparation of a larger dataset is required. Medaka’s standard models are trained using katuali.

Obtaining Data and Software

Start by downloading training data, containing basecalls from Oxford Nanopore Technologies’ reads:

WALKTHROUGH=${PWD}/medaka_walkthrough
mkdir -p ${WALKTHROUGH} && cd ${WALKTHROUGH}
wget https://s3-eu-west-1.amazonaws.com/ont-research/medaka_walkthrough_no_reads.tar.gz
tar -xvf medaka_walkthrough_no_reads.tar.gz
DATA=${PWD}/data

The extracted archive contains also all the intermediate output files that are created during the process below. Any step may be skipped by simply copying the requisite subfolder from the ${DATA} directory into the ${WALKTHROUGH} directory.

The necessary software can be sourced using:

cd ${WALKTHROUGH}
git clone https://github.com/nanoporetech/pomoxis --recursive
git clone https://github.com/nanoporetech/medaka

# While it is possible to install pomoxis and medaka into the same
#   virtual environment, we will install each package into its own
#   environment for simplicity. For more details see the readme for
#   each of the packages.

cd pomoxis && make install && cd ..
cd medaka && make install && cd ..

POMOXIS=${WALKTHROUGH}/pomoxis/venv/bin/activate
MEDAKA=${WALKTHROUGH}/medaka/venv/bin/activate

Creating a Draft Assembly

A draft assembly can be formed from the provided basecalls using the miniasm and racon based pipeline from pomoxis. Alternatively one could use canu at this step.

cd ${WALKTHROUGH}
NPROC=$(nproc)
BASECALLS=data/basecalls.fa
source ${POMOXIS}
mini_assemble -i ${BASECALLS} -o draft_assm -p assm -t ${NPROC}

This will create a draft assembly at draft_assm/assm_final.fa. The mini_assemble script has two useful options not used here:

  • specifying -c will run porechop on the reads to first trim sequencing adapters and,
  • specifying -e 10 will perform error correction on the longest 10% of reads prior to assembly (similar to the strategy of canu).

Both these steps can improve the assembly quality at the expense of speed.

The number and length of the assembled contigs can be checked

DRAFT=draft_assm/assm_final.fa
awk '{if(/>/){n=$1}else{print n " " length($0)}}' ${DRAFT}

The expected output is a contig 4,703,280 bases long (utg000001c).

Polishing a Consensus

After performing all steps up to Creating a Draft Assembly, the following commands can be run to yield a consensus using medaka’s default model. This model was trained using data obtained from E.coli, S.cerevisaie, and H.sapiens samples.

cd ${WALKTHROUGH}
source ${MEDAKA}
CONSENSUS=consensus
DRAFT=draft_assm/assm_final.fa
medaka_consensus -i ${BASECALLS} -d ${DRAFT} -o ${CONSENSUS} -t ${NPROC}

To polish an assembly using another model, use the -m option to specify the filepath of the model.

Alignment statistics can be calculated using the assess_assembly program from pomoxis:

cd ${WALKTHROUGH}
source ${POMOXIS}
TRUTH=${DATA}/truth.fasta
DRAFT2TRUTH=draft_to_truth
CONSENSUS2TRUTH=${CONSENSUS}_to_truth
echo "Draft assembly"
assess_assembly -i ${DRAFT} -r ${TRUTH} -p ${DRAFT2TRUTH} -t ${NPROC}
echo "Medaka consensus"
assess_assembly -i ${CONSENSUS}/consensus.fasta -r ${TRUTH} -p ${CONSENSUS2TRUTH} -t ${NPROC}

An decrease in error rate from 0.367% to 0.070% should be observed.

Training a Consensus Network

In order to train a bespoke network first perform all the steps up to and including Creating a Draft Assembly above.

The ultimate aim of the consensus network is to predict the truth sequence from the alignment of basecalls to the draft. This requires understanding how the basecalls may align to the draft and how the draft must be edited to obtain the truth. The draft acts as a common frame-of-reference between the basecalls and the truth.

The basecalls and truth sequence are aligned to the draft. For the latter, this is performed in chunks.

cd ${WALKTHROUGH}
DRAFT=draft_assm/assm_final.fa
TRUTH=${DATA}/truth.fasta
source ${POMOXIS}
CHUNKSIZE=100000
CALLS2DRAFT=calls2draft
TRUTH2DRAFT=truth2draft

mini_align -P -m -r ${DRAFT} -i ${BASECALLS} -t ${NPROC} -p ${CALLS2DRAFT}
mini_align -c ${CHUNKSIZE} -P -m -r ${DRAFT} -i ${TRUTH} -t ${NPROC} -p ${TRUTH2DRAFT}

These raw alignments must now be converted into features for input into a neural network. To reduce any IO bottlenecks during training, the training data can be written to the HDF5 file in batches using the -\\-batch_size option. The option -\\-read_fraction is used to randomly subsample reads which has the effect of making the resultant model more robust to variations in pileup depth when the model is used to make predictions.

cd ${WALKTHROUGH}
source ${MEDAKA}
REFNAME=utg000001c
TRAINEND=3762624
TRAINFEATURES=train_features.hdf
FRACTION="0.1 1"
BATCHSIZE=200
MODEL_FEAT_OPT=medaka/medaka/data/medaka_model.hdf5
medaka features ${CALLS2DRAFT}.bam ${TRAINFEATURES} --truth ${TRUTH2DRAFT}.bam --threads ${NPROC} --region ${REFNAME}:-${TRAINEND} --batch_size ${BATCHSIZE} --read_fraction ${FRACTION} --chunk_len 1000 --chunk_ovlp 0 --model ${MODEL_FEAT_OPT} --max_label_len 1

Now everything is in place to train a consensus network with medaka train:

cd ${WALKTHROUGH}
source ${MEDAKA}
TRAINNAME=training
medaka train ${TRAINFEATURES} --train_name ${TRAINNAME} --epochs 10

Depending on the compute resources available, this step may take some time. During training, models are regularly checkpointed so that training may be easily resumed if interrupted. At the end of training, we have a number of output models including in particular:

  • model.best.hdf5: model with the best accuracy over the training set
  • model.best.val.hdf5: model with the best accuracy over the validation set

Other ancilliary output are also produced.

To use a model run medaka_consensus, specifying the full absolute path to the model using the -m option:

cd ${WALKTHROUGH}
source ${MEDAKA}
CONSENSUS=consensus_trained
MODEL=${PWD}/${TRAINNAME}/model.best.val.hdf5
medaka_consensus -i ${BASECALLS} -d ${DRAFT} -o ${CONSENSUS} -t ${NPROC} -m ${MODEL}

Automated training pipeline

With katuali it is now possible to train medaka models starting from folders of fast5s in a single command:

katuali medaka_train_replicates --keep-going

Running the above will

  • basecall all multiple runs of data,
  • align all basecalls to reference sequences,
  • create subsampled sets of basecalls over reference sequences and depths,
  • assemble those sets of basecalls into draft assemblies,
  • create medaka training features for all assemblies,
  • train a medaka in multiple replicates, and
  • evaluate the models on test data.

For further information concerning settig-up adnd running katuali, refer to its documentation.