Human Variant Calling Benchmarks¶
Medaka implements a small variant calling pipeline to call single nucleotide
polymorphisms (SNPs) together with insertions and deletions (Indels) from Nanopore
basecalls. The pipeline leverages a diploid-aware neural network, read haplotype
tagging via WhatsHap, and haplotype consensus
using medaka
’s usual consensus network.
The following benchmarking is performed on chromosome 21 of the HG001 (NA12878) sample,
according to GA4GH best practices
using hap.py with the
GIAB truth set stratified to exclude
repeat regions as defined by the
GA4GH process.
Variants were filtered by their QUAL
values (separately for SNP and Indel classes)
to maximise the F1 score.
SNP and Indel calling¶
Last updated December 2019
Medaka
’s variant calling pipeline first aligns all reads to a reference sequence,
creates a read pileup and uses a recurrent neural network to predict a pair of bases for
every reference locus. The predictions are combined with the reference
sequence to create candidate variants under an independence assumption between
loci; no attempt is made at this point to combine predicted bases across reference
loci. This approach alone gives a reasonable estimation of variants:
Precision |
Recall |
F1 score |
|
medaka first stage |
0.9929 |
0.9392 |
0.965 |
Single nucleotide polymorphisms account for 85% of small variants in the human genome,
leaving a good proportion of insertion and deletion variants. It is therefore
desirable to be able to detect indel variants. To improve on the above results,
medaka
’s pipeline phases the recovered variants
using WhatsHap. This process allows
recovery of maternal and paternal haplotypes, and the assignment of each read to one of them. Having
partitioned reads into their haplotype, medaka consensus
is run
independently for each haplotype to calculate haplotype specific consensus sequences.

This second pass is a task in which we know medaka
excels, see Benchmarks.
From the multiple haplotype consensus sequences it is a simple task to reconstruct
variant calls for both homo- and hetero-zygous sites.
Class |
Precision |
Recall |
F1 score |
|
medaka variant |
SNP |
0.9967 |
0.9954 |
0.996 |
Indel |
0.9558 |
0.9174 |
0.936 |
|
clair |
SNP |
0.9931 |
0.9950 |
0.994 |
Indel |
0.9094 |
0.8092 |
0.856 |
|
nanopolish |
SNP |
0.9938 |
0.9662 |
0.980 |
Shown also are results from clair,
and older results for SNP calling from nanopolish
. We note that Clair
and medaka are competitive in terms of SNP calling but that medaka
excels Clair for indel calling. In contrast to medaka’s haplotype-consensus
approach, clair attempts to call indels directly from the read pileup;
medaka’s factorisation of variant calling problem would appear to be more
successful. This seems to be particularly true when comparing heterozygous
indels to homozygous cases:
Indel |
Precision |
Recall |
F1 score |
|
medaka variant |
Hom |
0.9790 |
0.9553 |
0.967 |
Het |
0.8995 |
0.8738 |
0.886 |
|
clair |
Hom |
0.9566 |
0.9175 |
0.937 |
Het |
0.8707 |
0.7332 |
0.796 |
R10.3 SNP and Indel calling¶
Last updated February 2020
The benchmarks were performed as described above, but on chromosome 20 of the HG002 (NA24385) sample at 60-fold coverage.
Class |
Precision |
Recall |
F1 score |
|
R9.4.1 |
SNP |
0.9898 |
0.9931 |
0.992 |
Indel |
0.9199 |
0.8634 |
0.891 |
|
R10.3 |
SNP |
0.9892 |
0.9946 |
0.992 |
Indel |
0.9508 |
0.9385 |
0.945 |
Guppy 3.6 SNP and Indel calling¶
Last updated June 2020
The benchmarks were performed as described above, but on chromosome 20 of the HG002 (NA24385) sample at 60-fold coverage.
medaka / variant model |
Class |
Precision |
Recall |
F1 score |
v1.0.1 / r941_prom_variant_g322 |
SNP |
0.9898 |
0.9931 |
0.992 |
Indel |
0.9199 |
0.8634 |
0.891 |
|
v1.0.2 / r941_prom_variant_g360 |
SNP |
0.9917 |
0.9965 |
0.994 |
Indel |
0.9379 |
0.8982 |
0.918 |
Performing Variant Calling¶
Note
The medaka_variant
(and medaka_consensus
) pipeline only operate on
a .bam
alignment file containing a single sample (value of the RG
alignment tag). It will refuse to run in the case of two read groups
being present.
The pipeline described above is implemented in the medaka_variant
program:
source ${MEDAKA}
medaka_variant -f <REFERENCE.fasta> -b <reads.bam>
This will run all steps of the process described above, finally outputting a
phased .vcf
variant file.
Note
Variants output by medaka_variant
are marked as “PASS” or “lowqual” in
the FILTER
column of the output .vcf
file using simple quality
thresholds. No variants are hard-filtered from the output file. Users may
wish to apply alternative filtering for their datasets and applications
using the -U, -P and -N options to medaka_variant
.
Further Improvements¶
Medaka SNP and indel calling is an extremely active area of research. We anticipate that the final step of variant calling, which amounts to a consensus call on a relatively pure haplotyped set of reads, will benefit (along with consensus calling) from further innovations in feature encoding, network architecture and training strategy. Future releases may also exploit direct snp and variant calling from mixed read populations.