Advanced Megalodon Arguments¶
Guppy Backend Argument¶
--do-not-use-guppy-serverUse alternative basecalling backend
Alternatives are:
FAST5: Read called sequence and full posterior data from fast5 files.
This is the default when
--do-not-use-guppy-serveris set.Note that this option requires
--post_outbe set when running Guppy and may increase the fast5 file size by 5-10X.
Taiyaki: Use the Taiyaki package basecalling interface
This requires a Taiyaki installation (potentially with GPU settings).
Trigger this mode by setting the
--taiyaki-model-filenameoption.This is much slower than Guppy and is generally intended for experimental models with either layers or architectures not supported by Guppy.
--guppy-paramsExtra guppy server parameters.
Main purpose for optimal performance based on compute environment.
Quote parameters to avoid them being parsed by megalodon.
--guppy-server-portGuppy server port.
Default:
auto
--reads-per-guppy-batchNumber of reads to send to Guppy per batch within each worker processes.
Default:
50
--guppy-timeoutTimeout to wait for guppy server to call a single read in seconds.
Default:
5.0
--list-supported-guppy-configsList guppy configs with sequence variant and (if applicable) modified base support.
Output Arguments¶
--basecalls-formatSelect either
fastq(default) orfastaformat for basecalls output.
--num-readsNumber of reads to process. Intended for test runs on a subset.
--read-ids-filenameA file containing
read_idsto process (one per line).Used in the variant phasing pipeline.
--mod-min-probOnly include modified base probabilities greater than this value in
mod_basecallsandmod_mappingsoutputs.Default:
0.01(1%)
Mapping Arguments¶
--cram-referenceIf
--referenceis a minimap2 index, the associated FASTA reference needs to be provided for--mappings-format cram.
--samtools-executableSamtools executable or path for sorting and indexing all mappings.
Default:
samtools
--sort-mappingsPerform sorting and indexing of mapping output files.
This can take considerable time for larger runs and thus is off by default.
Sequence Variant Arguments¶
--context-min-alt-probMinimum per-read variant probability to include a variant in second round of variant evaluation (including context variants).
--disable-variant-calibrationUse raw neural network sequence variant scores.
This option should be set when calibrating a new model.
Default: Calibrate scores as described in
--variant-calibration-filename
--heterozygous-factorsBayes factor used when computing heterozygous probabilities in diploid variant calling mode.
Two factors must be provided for single base substitution variants and indels.
--max-indel-sizeMaximum indel size to include in testing. Default: 50
--variant-all-pathsCompute the forward algorithm all paths score.
Default: Viterbi best-path score.
--variants-are-atomizedInput variants have been atomized (with
megalodon_extras variants atomize).This saves compute time, but has unpredictable behavior if variants are not atomized.
--variant-calibration-filenameFile containing empirical calibration for sequence variant scores.
As created by the
megalodon_extras calibrate variantscommand.Default: Load default calibration file for guppy config.
--variant-context-basesContext bases for single base SNP and indel calling. Default: [15, 30]
--variant-locations-on-diskForce sequence variant locations to be stored only within on disk database table. This option will reduce the RAM memory requirement, but may drastically slow processing. Default: Store locations in memory and on disk.
--write-variants-textOutput per-read variants in text format.
Output includes columns:
read_id,chrm,strand,pos,ref_log_prob,alt_log_prob,var_ref_seq,var_alt_seq,var_idLog probabilities are calibrated to match observed log-likelihood ratios from ground truth samples.
Reference log probabilities are included to make processing multiple alternative allele sites easier to process.
Position is 0-based
--write-vcf-log-probsWrite per-read alt log probabilities out in non-standard VCF field.
The
LOG_PROBSfield will contain semi-colon delimited log probabilities for each read at this site.For sites with multiple alternative alleles, per-read calls for each allele are separated by a comma as specified by the
Agenotype field type.The order is consistent within each allele so that per-read probabilities across all alleles can be reconstructed.
Modified Base Arguments¶
--disable-mod-calibrationUse raw modified base scores from the network.
This option should be set when calibrating a new model.
Default: Calibrate scores as described in
--mod-calibration-filename
--mod-aggregate-methodModified base aggregation method.
Choices: expectation_maximization (default), binary_threshold
--mod-all-pathsCompute forwards algorithm all paths score for modified base calls.
Default: Viterbi best-path score.
--mod-binary-thresholdHard threshold for modified base aggregation (probability of modified/canonical base).
Sites where no canonical or modified base achieves this level of confidence will be ignored in aggregation.
Default: 0.75
--mod-calibration-filenameFile containing empirical calibration for modified base scores.
As created by
megalodon_extras calibrate modified_basescommand.Default: Load default calibration file for guppy config.
--mod-database-timeoutTimeout in seconds for modified base database operations.
Default: 5 seconds
--mod-context-basesContext bases for modified base calling.
Default: 15
--mod-map-emulate-bisulfiteFor
mod_mappingsoutput, emulate bisulfite output by converting called modified bases using “–mod-map-base-conv” argument.As of version 2.2, the default
mod_mappingsoutput uses theMmandMlhts-specs tags (see above) with all modified bases in one output file.
--mod-map-base-convFor
mod_mappingsoutput, convert called bases.For example, to mimic bisulfite output use:
--mod-map-base-conv C T --mod-map-base-conv Z CThis is option useful since the BAM format does support modified bases and will convert all alternative bases to ``N``s for storage in BAM/CRAM format.
Note additional formats may be supported in the future once finalized in hts-specs.
--mod-output-formatsModified base aggregated output format(s).
Default:
bedmethylOptions:
bedmethyl,modvcf,wigglebedmethylformat produces one file per modification type.This format is specified by the ENCODE consortium.
modvcfis a slight variant to the VCF format used for sequence variant reporting.This format produces a single file containing all modifications.
The format adds a
SNinfo field as modified bases occur in a stranded manner unlike sequence variants (e.g. hemi-methylation).A genotype field
VALID_DPindicates the number of reads included in the proportion modified calculation.Modified base proportion estimates are stored in genotype fields specified by the single letter modified base encodings (defined in the model file).
--write-mod-log-probsWrite per-read modified base log probabilities out in non-standard VCF field.
The
LOG_PROBSfield will contain semi-colon delimited log probabilities for modified base within each read at this site.For sites with multiple modified bases, per-read calls for each modification type are separated by a comma as specified by the
Agenotype field type.The order is consistent within each modification type so that per-read probabilities across all modification types can be reconstructed.
--write-mods-textOutput per-read modified bases in text format.
Output includes columns:
read_id,chrm,strand,pos,mod_log_probs,can_log_prob,mod_bases,motifLog probabilities are calibrated to match observed log-likelihood ratios from ground truth samples.
Canonical log probabilities are included to make processing multiple modification sites easier to process.
Megalodon is capable of handling multiple modified bases per site with appropriate model (e.g. testing for 5mC and 5hmC simultaneously is supported given a basecalling model).
motifincludes the searched motif (via--mod-motif) as well as the relative modified base position within that motif (e.g.CG:0for provided--mod-motif Z CG 0).Position is 0-based
Taiyaki Backend Arguments¶
--chunk-sizeSize of individual chunks to run as input to neural network.
Smaller size will result in faster basecalling, but may reduce accuracy.
--chunk-overlapOverlap between adjacent chunks fed to basecalling neural network.
Smaller size will result in faster basecalling, but may reduce accuracy.
--max-concurrent-chunksMaximum number of concurrent chunks to basecall at once.
Allows a global cap on GPU memory usage.
Changes to this parameter do not effect resulting basecalls.
--taiyaki-model-filenametaiyaki basecalling model checkpoint file
In order to identify modified bases a model trained to identify those modifications must be provided.
Train a new modified base model using taiyaki.
Guppy JSON-format models can be converted to taiyaki checkpoints/models with the
taiyaki/bin/json_to_checkpoint.pyscript for use with megalodon.
Reference/Signal Mapping Output¶
This output category is intended for use in generating reference sequences or signal mapping files for taiyaki basecall model training.
--ref-include-modsInclude modified base calls in
per_read_refsorsignal_mappingsoutputs.
--ref-include-variantsInclude sequence variant calls in per-read reference output.
--ref-length-rangeOnly include reads with specified read length in per-read reference output.
--ref-percent-identity-thresholdOnly include reads with higher percent identity in per-read reference output.
--ref-percent-coverage-thresholdOnly include reads with higher read alignment coverage in per-read reference output.
--ref-mods-all-motifsAnnotate all
--mod-motifoccurrences as modified.Requires that –ref-include-mods` is set.
--ref-mod-thresholdThreshold (in
log(can_prob/mod_prob)space) used to annotate a modified bases insignal_mappingsorper_read_refsoutputs.See
megalodon_extras modified_bases estimate_thresholdcommand for help computing this threshold.Requires that –ref-include-mods` is set.
Compute Resource Arguments¶
--num-read-enumeration-threadsNumber of parallel threads to use for read enumeration.
This number of threads will be opened in a single read enumeration process and each signal extraction process (see next argument).
This value can be increased if the input queue remains empty.
Default:
8
--num-extract-signal-processesNumber of parallel processes to use for signal extraction.
Accessing data and metadata from FAST5 files requires some compute resources. For this reason, multiple processes must be spawned to achieve the highest performance on some systems.
This value can be increased if the input queue remains empty.
Default:
2
Miscellaneous Arguments¶
--database-safetySetting for database performance versus corruption protection.
Options:
0 (DB corruption on application crash)
1 (Default; DB corruption on system crash)
2 (DB safe mode)
--edge-bufferDo not process sequence variant or modified base calls near edge of read mapping.
Default: 30
--not-recursiveOnly search for fast5 read files directly found within the fast5 directory.
Default: search recursively
--suppress-progressSuppress progress bar output.
--suppress-queues-statusSuppress dynamic status of output queues.
These queues are helpful for diagnosing I/O issues.
--verbose-read-progressOutput dynamic updates to potential issues during processing.
Default:
3