Medaka variant-calling training pipeline¶
Katuali supports the training of medaka variant-calling models from .fast5
files, or preprocessed data such as basecalls or alignments if they are
available.
Models can be trained both for the initial SNP calling from mixed reads prior to phasing, referred to below as SNP-calling models, and for final variant (SNP and indel) calling from phased reads, referred to below as variant-calling models. See the medaka docs for more information on the medaka variant calling pipeline.
A specialised config file (config_variant.yaml
) for medaka variant calling
is provided by running
katuali_config --variant my_config_variant.yaml.
Input data specification¶
The data required for variant calling training is separated into two directory structures:
training data
variant calling resources: references and truth data.
Training data¶
As for the Medaka consensus training pipeline read .fast5
files should be placed
under top-level folders. Multiple top-levelfolders can be used.
Within the DATA
section of a katuali configuration file, these top-level
folders should be listed along with details of the reference sequence.
MEDAKA_TRAIN_REGIONS
lists the chromosomes included in training
In the example below data from the run (dna_prom
) will be used for training for
chromosomes 15 -19 of the GIAB sample NA24385.
DATA:
'dna_prom':
'REFERENCE': '/medaka_variant_resources/grch38/grch38.fna.gz.fasta'
'MEDAKA_TRAIN_REGIONS': ['chr15', 'chr16', 'chr17', 'chr18', 'chr19']
'MEDAKA_EVAL_REGIONS': []
'SAMPLE: "NA24385"
Variant calling resources¶
The protocol for training for variant calling require various reference and
truth data. This can be configured within the VARIANT_RESOURCES
section of
a katuali config file. The default config is set up for the GIAB sample
NA24385 (HG002), with configurable URLs which will be used to download the GIAB
high-confidence variants (vcf file) and regions (bed file).
Medaka training features for variant calling (but not SNP-calling) are formed
from pileups of reads aligned to the GRCh38 reference scuffed up with variants
from the 1000 Genomes Project. The VARIANTS_TO_ADD
config entry defines
the template filepath of the per-chromosome VCF file which will be used to
scuff up the reference. If the file is not present, it will automatically be
downloaded. It is possible to use alternative variants to scuff the reference
by changing VARIANTS_TO_ADD
to point at your own VCF. Similarly, if the
GRCh38 reference is not not found at the expected filepath
(medaka_variant_resources/grch38/grch38.fna.gz
), it will automatically be
downloaded. If you already have this file downloaded and wish to avoid
downloading it again, you can place a symlink at the expected filepath.
It is possible to train from a different, or multiple samples by extending the
SAMPLES
config section:
VARIANT_RESOURCES:
"SAMPLES":
"NA24385":
"TRUTH_HIGH_CONF_BED_URL": "https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/AshkenazimTrio/HG002_NA24385_son/NISTv3.3.2/GRCh38/HG002_GRCh38_GIAB_highconf_CG-Illfb-IllsentieonHC-Ion-10XsentieonHC-SOLIDgatkHC_CHROM1-22_v.3.3.2_highconf_noinconsistent.bed"
"TRUTH_HIGH_CONF_VCF_URL": "https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/AshkenazimTrio/HG002_NA24385_son/NISTv3.3.2/GRCh38/HG002_GRCh38_GIAB_highconf_CG-Illfb-IllsentieonHC-Ion-10XsentieonHC-SOLIDgatkHC_CHROM1-22_v.3.3.2_highconf_triophased.vcf.gz"
"VARIANTS_TO_ADD": "medaka_variant_resources/grch38/1kgenomes/ALL.{chr}.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.vcf.gz"
Based on the information provided in the config, Katuali
will download and
process reference and truth data into various intermediate files required for
creating training features. This data required is stored under a directory
structure medaka_training_resources
There should be no need for the user to delve into the contents of the directory, but its contents are nevertheless documented here. The following example shows the folder structure for the grch38 reference and the NA24385 sample.
medaka_variant_resources/
grch38/ # grch38 reference directory (currently only the grch38 is supported).
grch38.fna.gz # grch38 reference file which will be automatically downloaded.
grch38.fna.gz_per_chr/ # per chromosome references stored as chrX.fasta, automatically created.
1kgenomes/ # VARIANTS_TO_ADD: vcf file per chromosome, automatically downloaded.
NA24385/ # SAMPLE Truth reference {SAMPLE.fasta}: NA24385.fasta, created automatically from the truth vcf and reference.
NA24385_per_chr/ # Phased reference per chromosome (e.g. chr18.fasta
# chr18_NA24385_maternal.fasta and chr18_NA24385_paternal.fasta, created automatically).
scuffed_ref/
NA24385/
chrN/ # Scuffed up reference ref_edited.fasta, ref_edited_rc.fasta, automatically created.
Creating training features¶
SNP and variant-calling models can be trained in a single command using the
all_medaka_train
pipeline, see Training models.
As with the Medaka consensus training pipeline, it is also possible to just create the features for training outside of katuali.
Features for training SNP-calling models can be created with the
all_medaka_snp_feat
pipeline:
katuali all_medaka_snp_feat
This will create a SNP-calling feature file for every valid combination of dataset (top-level folder), region and coverage depth . For example the file
dna_prom/guppy/align/chr15/10X_prop/medaka_diploid_snp_features/dna_prom_chr15_10X_medaka_train.hdf
will be produced for a top-level folder named dna_prom
, chromosome 15, at
coverage of 10
-fold.
Similarly, features for training variant-calling models can be created with the
all_medaka_variant_feat
pipeline:
katuali all_medaka_variant
This will create two variant-calling feature files for every valid combination of dataset (top-level folder), region and coverage depth. For example the files
dna_prom/guppy/align/chr16/calls2scuffed_ref/20X_prop/medaka_features/medaka_train_variant_dna_prom_chr16_20X.hdf
dna_prom/guppy/align/chr16/calls2scuffed_ref/20X_prop/medaka_features/medaka_train_variant_dna_prom_chr16_20X_rc.hdf
will be produced for a top-level folder named dna_prom
, chromosome 16, at
coverage of 20
-fold.
Training models¶
SNP and variant-calling models can be trained in a single command using the
same all_medaka_train
pipeline as used for training consensus models, see Training models
The features used for model training are fully configurable in the config,
allowing a single all_medaka_train
pipeline to train either a medaka
consensus model, or medaka SNP and/or variant-calling models.
This can be controled via the MEDAKA_TRAIN_REPLICATES
config entry.
This example from the default config trains three medaka-consensus model replicates using the
all_medaka_feat
pipeline to create medaka-consensus training features.
# Run multiple training replicates - output will be in medaka_train_{key},
# values should be a key of the PIPELINES dictionary in this file. In simple
# cases this allows running technical replicates of the training, but also
# allows the pipeline to be changed to for example create features in a
# different manner. For the latter change the value component to an alternative
# user defined pipeline.
MEDAKA_TRAIN_REPLICATES:
"_rep_1": "all_medaka_feat"
"_rep_2": "all_medaka_feat"
"_rep_3": "all_medaka_feat"
In contrast, this example from the variant-calling config trains three medaka SNP-calling replicates and three medaka variant-calling replicates, each using their respective feature-generation pipelines.
MEDAKA_TRAIN_REPLICATES:
"_snp_rep_1": "all_medaka_snp_feat"
"_snp_rep_2": "all_medaka_snp_feat"
"_snp_rep_3": "all_medaka_snp_feat"
"_variant_rep_1": "all_medaka_variant_feat"
"_variant_rep_2": "all_medaka_variant_feat"
"_variant_rep_3": "all_medaka_variant_feat"
Missing feature files¶
As for the Medaka consensus training pipeline, if input datasets have insufficient coverage-depth for some of the training chromosomes, some training feature files will not be produced. See Coping with missing feature files for how to cope with this.