Advanced Megalodon Arguments¶
Guppy Backend Argument¶
--do-not-use-guppy-server
Use alternative basecalling backend
Alternatives are:
FAST5: Read called sequence and full posterior data from fast5 files.
This is the default when
--do-not-use-guppy-server
is set.Note that this option requires
--post_out
be set when running Guppy and may increase the fast5 file size by 5-10X.
Taiyaki: Use the Taiyaki package basecalling interface
This requires a Taiyaki installation (potentially with GPU settings).
Trigger this mode by setting the
--taiyaki-model-filename
option.This is much slower than Guppy and is generally intended for experimental models with either layers or architectures not supported by Guppy.
--guppy-params
Extra guppy server parameters.
Main purpose for optimal performance based on compute environment.
Quote parameters to avoid them being parsed by megalodon.
--guppy-server-port
Guppy server port.
Default:
auto
--reads-per-guppy-batch
Number of reads to send to Guppy per batch within each worker processes.
Default:
50
--guppy-timeout
Timeout to wait for guppy server to call a single read in seconds.
Default:
5.0
--list-supported-guppy-configs
List guppy configs with sequence variant and (if applicable) modified base support.
Output Arguments¶
--basecalls-format
Select either
fastq
(default) orfasta
format for basecalls output.
--num-reads
Number of reads to process. Intended for test runs on a subset.
--read-ids-filename
A file containing
read_ids
to process (one per line).Used in the variant phasing pipeline.
--mod-min-prob
Only include modified base probabilities greater than this value in
mod_basecalls
andmod_mappings
outputs.Default:
0.01
(1%
)
Mapping Arguments¶
--cram-reference
If
--reference
is a minimap2 index, the associated FASTA reference needs to be provided for--mappings-format cram
.
--samtools-executable
Samtools executable or path for sorting and indexing all mappings.
Default:
samtools
--sort-mappings
Perform sorting and indexing of mapping output files.
This can take considerable time for larger runs and thus is off by default.
Sequence Variant Arguments¶
--context-min-alt-prob
Minimum per-read variant probability to include a variant in second round of variant evaluation (including context variants).
--disable-variant-calibration
Use raw neural network sequence variant scores.
This option should be set when calibrating a new model.
Default: Calibrate scores as described in
--variant-calibration-filename
--heterozygous-factors
Bayes factor used when computing heterozygous probabilities in diploid variant calling mode.
Two factors must be provided for single base substitution variants and indels.
--max-indel-size
Maximum indel size to include in testing. Default: 50
--variant-all-paths
Compute the forward algorithm all paths score.
Default: Viterbi best-path score.
--variants-are-atomized
Input variants have been atomized (with
megalodon_extras variants atomize
).This saves compute time, but has unpredictable behavior if variants are not atomized.
--variant-calibration-filename
File containing empirical calibration for sequence variant scores.
As created by the
megalodon_extras calibrate variants
command.Default: Load default calibration file for guppy config.
--variant-context-bases
Context bases for single base SNP and indel calling. Default: [15, 30]
--variant-locations-on-disk
Force sequence variant locations to be stored only within on disk database table. This option will reduce the RAM memory requirement, but may drastically slow processing. Default: Store locations in memory and on disk.
--write-variants-text
Output per-read variants in text format.
Output includes columns:
read_id
,chrm
,strand
,pos
,ref_log_prob
,alt_log_prob
,var_ref_seq
,var_alt_seq
,var_id
Log probabilities are calibrated to match observed log-likelihood ratios from ground truth samples.
Reference log probabilities are included to make processing multiple alternative allele sites easier to process.
Position is 0-based
--write-vcf-log-probs
Write per-read alt log probabilities out in non-standard VCF field.
The
LOG_PROBS
field will contain semi-colon delimited log probabilities for each read at this site.For sites with multiple alternative alleles, per-read calls for each allele are separated by a comma as specified by the
A
genotype field type.The order is consistent within each allele so that per-read probabilities across all alleles can be reconstructed.
Modified Base Arguments¶
--disable-mod-calibration
Use raw modified base scores from the network.
This option should be set when calibrating a new model.
Default: Calibrate scores as described in
--mod-calibration-filename
--mod-aggregate-method
Modified base aggregation method.
Choices: expectation_maximization (default), binary_threshold
--mod-all-paths
Compute forwards algorithm all paths score for modified base calls.
Default: Viterbi best-path score.
--mod-binary-threshold
Hard threshold for modified base aggregation (probability of modified/canonical base).
Sites where no canonical or modified base achieves this level of confidence will be ignored in aggregation.
Default: 0.75
--mod-calibration-filename
File containing empirical calibration for modified base scores.
As created by
megalodon_extras calibrate modified_bases
command.Default: Load default calibration file for guppy config.
--mod-database-timeout
Timeout in seconds for modified base database operations.
Default: 5 seconds
--mod-context-bases
Context bases for modified base calling.
Default: 15
--mod-map-emulate-bisulfite
For
mod_mappings
output, emulate bisulfite output by converting called modified bases using “–mod-map-base-conv” argument.As of version 2.2, the default
mod_mappings
output uses theMm
andMl
hts-specs tags (see above) with all modified bases in one output file.
--mod-map-base-conv
For
mod_mappings
output, convert called bases.For example, to mimic bisulfite output use:
--mod-map-base-conv C T --mod-map-base-conv Z C
This is option useful since the BAM format does support modified bases and will convert all alternative bases to ``N``s for storage in BAM/CRAM format.
Note additional formats may be supported in the future once finalized in hts-specs.
--mod-output-formats
Modified base aggregated output format(s).
Default:
bedmethyl
Options:
bedmethyl
,modvcf
,wiggle
bedmethyl
format produces one file per modification type.This format is specified by the ENCODE consortium.
modvcf
is a slight variant to the VCF format used for sequence variant reporting.This format produces a single file containing all modifications.
The format adds a
SN
info field as modified bases occur in a stranded manner unlike sequence variants (e.g. hemi-methylation).A genotype field
VALID_DP
indicates the number of reads included in the proportion modified calculation.Modified base proportion estimates are stored in genotype fields specified by the single letter modified base encodings (defined in the model file).
--write-mod-log-probs
Write per-read modified base log probabilities out in non-standard VCF field.
The
LOG_PROBS
field will contain semi-colon delimited log probabilities for modified base within each read at this site.For sites with multiple modified bases, per-read calls for each modification type are separated by a comma as specified by the
A
genotype field type.The order is consistent within each modification type so that per-read probabilities across all modification types can be reconstructed.
--write-mods-text
Output per-read modified bases in text format.
Output includes columns:
read_id
,chrm
,strand
,pos
,mod_log_probs
,can_log_prob
,mod_bases
,motif
Log probabilities are calibrated to match observed log-likelihood ratios from ground truth samples.
Canonical log probabilities are included to make processing multiple modification sites easier to process.
Megalodon is capable of handling multiple modified bases per site with appropriate model (e.g. testing for 5mC and 5hmC simultaneously is supported given a basecalling model).
motif
includes the searched motif (via--mod-motif
) as well as the relative modified base position within that motif (e.g.CG:0
for provided--mod-motif Z CG 0
).Position is 0-based
Taiyaki Backend Arguments¶
--chunk-size
Size of individual chunks to run as input to neural network.
Smaller size will result in faster basecalling, but may reduce accuracy.
--chunk-overlap
Overlap between adjacent chunks fed to basecalling neural network.
Smaller size will result in faster basecalling, but may reduce accuracy.
--max-concurrent-chunks
Maximum number of concurrent chunks to basecall at once.
Allows a global cap on GPU memory usage.
Changes to this parameter do not effect resulting basecalls.
--taiyaki-model-filename
taiyaki basecalling model checkpoint file
In order to identify modified bases a model trained to identify those modifications must be provided.
Train a new modified base model using taiyaki.
Guppy JSON-format models can be converted to taiyaki checkpoints/models with the
taiyaki/bin/json_to_checkpoint.py
script for use with megalodon.
Reference/Signal Mapping Output¶
This output category is intended for use in generating reference sequences or signal mapping files for taiyaki basecall model training.
--ref-include-mods
Include modified base calls in
per_read_refs
orsignal_mappings
outputs.
--ref-include-variants
Include sequence variant calls in per-read reference output.
--ref-length-range
Only include reads with specified read length in per-read reference output.
--ref-percent-identity-threshold
Only include reads with higher percent identity in per-read reference output.
--ref-percent-coverage-threshold
Only include reads with higher read alignment coverage in per-read reference output.
--ref-mods-all-motifs
Annotate all
--mod-motif
occurrences as modified.Requires that –ref-include-mods` is set.
--ref-mod-threshold
Threshold (in
log(can_prob/mod_prob)
space) used to annotate a modified bases insignal_mappings
orper_read_refs
outputs.See
megalodon_extras modified_bases estimate_threshold
command for help computing this threshold.Requires that –ref-include-mods` is set.
Compute Resource Arguments¶
--num-read-enumeration-threads
Number of parallel threads to use for read enumeration.
This number of threads will be opened in a single read enumeration process and each signal extraction process (see next argument).
This value can be increased if the input queue remains empty.
Default:
8
--num-extract-signal-processes
Number of parallel processes to use for signal extraction.
Accessing data and metadata from FAST5 files requires some compute resources. For this reason, multiple processes must be spawned to achieve the highest performance on some systems.
This value can be increased if the input queue remains empty.
Default:
2
Miscellaneous Arguments¶
--database-safety
Setting for database performance versus corruption protection.
Options:
0 (DB corruption on application crash)
1 (Default; DB corruption on system crash)
2 (DB safe mode)
--edge-buffer
Do not process sequence variant or modified base calls near edge of read mapping.
Default: 30
--not-recursive
Only search for fast5 read files directly found within the fast5 directory.
Default: search recursively
--suppress-progress
Suppress progress bar output.
--suppress-queues-status
Suppress dynamic status of output queues.
These queues are helpful for diagnosing I/O issues.
--verbose-read-progress
Output dynamic updates to potential issues during processing.
Default:
3