Advanced Megalodon Arguments

Guppy Backend Argument

  • --do-not-use-guppy-server

    • Use alternative basecalling backend

    • Alternatives are:

      • FAST5: Read called sequence and full posterior data from fast5 files.

        • This is the default when --do-not-use-guppy-server is set.

        • Note that this option requires --post_out be set when running Guppy and may increase the fast5 file size by 5-10X.

      • Taiyaki: Use the Taiyaki package basecalling interface

        • This requires a Taiyaki installation (potentially with GPU settings).

        • Trigger this mode by setting the --taiyaki-model-filename option.

        • This is much slower than Guppy and is generally intended for experimental models with either layers or architectures not supported by Guppy.

  • --guppy-params

    • Extra guppy server parameters.

    • Main purpose for optimal performance based on compute environment.

    • Quote parameters to avoid them being parsed by megalodon.

  • --guppy-server-port

    • Guppy server port.

    • Default: auto

  • --reads-per-guppy-batch

    • Number of reads to send to Guppy per batch within each worker processes.

    • Default: 50

  • --guppy-timeout

    • Timeout to wait for guppy server to call a single read in seconds.

    • Default: 5.0

  • --list-supported-guppy-configs

    • List guppy configs with sequence variant and (if applicable) modified base support.

Output Arguments

  • --basecalls-format

    • Select either fastq (default) or fasta format for basecalls output.

  • --num-reads

    • Number of reads to process. Intended for test runs on a subset.

  • --read-ids-filename

    • A file containing read_ids to process (one per line).

    • Used in the variant phasing pipeline.

  • --mod-min-prob

    • Only include modified base probabilities greater than this value in mod_basecalls and mod_mappings outputs.

    • Default: 0.01 (1%)

Mapping Arguments

  • --cram-reference

    • If --reference is a minimap2 index, the associated FASTA reference needs to be provided for --mappings-format cram.

  • --samtools-executable

    • Samtools executable or path for sorting and indexing all mappings.

    • Default: samtools

  • --sort-mappings

    • Perform sorting and indexing of mapping output files.

    • This can take considerable time for larger runs and thus is off by default.

Sequence Variant Arguments

  • --context-min-alt-prob

    • Minimum per-read variant probability to include a variant in second round of variant evaluation (including context variants).

  • --disable-variant-calibration

    • Use raw neural network sequence variant scores.

    • This option should be set when calibrating a new model.

    • Default: Calibrate scores as described in --variant-calibration-filename

  • --heterozygous-factors

    • Bayes factor used when computing heterozygous probabilities in diploid variant calling mode.

    • Two factors must be provided for single base substitution variants and indels.

  • --max-indel-size

    • Maximum indel size to include in testing. Default: 50

  • --variant-all-paths

    • Compute the forward algorithm all paths score.

    • Default: Viterbi best-path score.

  • --variants-are-atomized

    • Input variants have been atomized (with megalodon_extras variants atomize).

    • This saves compute time, but has unpredictable behavior if variants are not atomized.

  • --variant-calibration-filename

    • File containing empirical calibration for sequence variant scores.

    • As created by the megalodon_extras calibrate variants command.

    • Default: Load default calibration file for guppy config.

  • --variant-context-bases

    • Context bases for single base SNP and indel calling. Default: [15, 30]

  • --variant-locations-on-disk

    • Force sequence variant locations to be stored only within on disk database table. This option will reduce the RAM memory requirement, but may drastically slow processing. Default: Store locations in memory and on disk.

  • --write-variants-text

    • Output per-read variants in text format.

      • Output includes columns: read_id, chrm, strand, pos, ref_log_prob, alt_log_prob, var_ref_seq, var_alt_seq, var_id

      • Log probabilities are calibrated to match observed log-likelihood ratios from ground truth samples.

        • Reference log probabilities are included to make processing multiple alternative allele sites easier to process.

      • Position is 0-based

  • --write-vcf-log-probs

    • Write per-read alt log probabilities out in non-standard VCF field.

      • The LOG_PROBS field will contain semi-colon delimited log probabilities for each read at this site.

      • For sites with multiple alternative alleles, per-read calls for each allele are separated by a comma as specified by the A genotype field type.

        • The order is consistent within each allele so that per-read probabilities across all alleles can be reconstructed.

Modified Base Arguments

  • --disable-mod-calibration

    • Use raw modified base scores from the network.

    • This option should be set when calibrating a new model.

    • Default: Calibrate scores as described in --mod-calibration-filename

  • --mod-aggregate-method

    • Modified base aggregation method.

    • Choices: expectation_maximization (default), binary_threshold

  • --mod-all-paths

    • Compute forwards algorithm all paths score for modified base calls.

    • Default: Viterbi best-path score.

  • --mod-binary-threshold

    • Hard threshold for modified base aggregation (probability of modified/canonical base).

      • Sites where no canonical or modified base achieves this level of confidence will be ignored in aggregation.

    • Default: 0.75

  • --mod-calibration-filename

    • File containing empirical calibration for modified base scores.

    • As created by megalodon_extras calibrate modified_bases command.

    • Default: Load default calibration file for guppy config.

  • --mod-database-timeout

    • Timeout in seconds for modified base database operations.

    • Default: 5 seconds

  • --mod-context-bases

    • Context bases for modified base calling.

    • Default: 15

  • --mod-map-emulate-bisulfite

    • For mod_mappings output, emulate bisulfite output by converting called modified bases using “–mod-map-base-conv” argument.

    • As of version 2.2, the default mod_mappings output uses the Mm and Ml hts-specs tags (see above) with all modified bases in one output file.

  • --mod-map-base-conv

    • For mod_mappings output, convert called bases.

      • For example, to mimic bisulfite output use: --mod-map-base-conv C T --mod-map-base-conv Z C

      • This is option useful since the BAM format does support modified bases and will convert all alternative bases to ``N``s for storage in BAM/CRAM format.

    • Note additional formats may be supported in the future once finalized in hts-specs.

  • --mod-output-formats

    • Modified base aggregated output format(s).

    • Default: bedmethyl

    • Options: bedmethyl, modvcf, wiggle

      • bedmethyl format produces one file per modification type.

      • modvcf is a slight variant to the VCF format used for sequence variant reporting.

        • This format produces a single file containing all modifications.

        • The format adds a SN info field as modified bases occur in a stranded manner unlike sequence variants (e.g. hemi-methylation).

        • A genotype field VALID_DP indicates the number of reads included in the proportion modified calculation.

        • Modified base proportion estimates are stored in genotype fields specified by the single letter modified base encodings (defined in the model file).

  • --write-mod-log-probs

    • Write per-read modified base log probabilities out in non-standard VCF field.

      • The LOG_PROBS field will contain semi-colon delimited log probabilities for modified base within each read at this site.

      • For sites with multiple modified bases, per-read calls for each modification type are separated by a comma as specified by the A genotype field type.

        • The order is consistent within each modification type so that per-read probabilities across all modification types can be reconstructed.

  • --write-mods-text

    • Output per-read modified bases in text format.

      • Output includes columns: read_id, chrm, strand, pos, mod_log_probs, can_log_prob, mod_bases, motif

      • Log probabilities are calibrated to match observed log-likelihood ratios from ground truth samples.

        • Canonical log probabilities are included to make processing multiple modification sites easier to process.

          • Megalodon is capable of handling multiple modified bases per site with appropriate model (e.g. testing for 5mC and 5hmC simultaneously is supported given a basecalling model).

      • motif includes the searched motif (via --mod-motif) as well as the relative modified base position within that motif (e.g. CG:0 for provided --mod-motif Z CG 0).

      • Position is 0-based

Taiyaki Backend Arguments

  • --chunk-size

    • Size of individual chunks to run as input to neural network.

    • Smaller size will result in faster basecalling, but may reduce accuracy.

  • --chunk-overlap

    • Overlap between adjacent chunks fed to basecalling neural network.

    • Smaller size will result in faster basecalling, but may reduce accuracy.

  • --max-concurrent-chunks

    • Maximum number of concurrent chunks to basecall at once.

    • Allows a global cap on GPU memory usage.

    • Changes to this parameter do not effect resulting basecalls.

  • --taiyaki-model-filename

    • taiyaki basecalling model checkpoint file

    • In order to identify modified bases a model trained to identify those modifications must be provided.

      • Train a new modified base model using taiyaki.

    • Guppy JSON-format models can be converted to taiyaki checkpoints/models with the taiyaki/bin/json_to_checkpoint.py script for use with megalodon.

Reference/Signal Mapping Output

This output category is intended for use in generating reference sequences or signal mapping files for taiyaki basecall model training.

  • --ref-include-mods

    • Include modified base calls in per_read_refs or signal_mappings outputs.

  • --ref-include-variants

    • Include sequence variant calls in per-read reference output.

  • --ref-length-range

    • Only include reads with specified read length in per-read reference output.

  • --ref-percent-identity-threshold

    • Only include reads with higher percent identity in per-read reference output.

  • --ref-percent-coverage-threshold

    • Only include reads with higher read alignment coverage in per-read reference output.

  • --ref-mods-all-motifs

    • Annotate all --mod-motif occurrences as modified.

    • Requires that –ref-include-mods` is set.

  • --ref-mod-threshold

    • Threshold (in log(can_prob/mod_prob) space) used to annotate a modified bases in signal_mappings or per_read_refs outputs.

    • See megalodon_extras modified_bases estimate_threshold command for help computing this threshold.

    • Requires that –ref-include-mods` is set.

Compute Resource Arguments

  • --num-read-enumeration-threads

    • Number of parallel threads to use for read enumeration.

      • This number of threads will be opened in a single read enumeration process and each signal extraction process (see next argument).

    • This value can be increased if the input queue remains empty.

    • Default: 8

  • --num-extract-signal-processes

    • Number of parallel processes to use for signal extraction.

      • Accessing data and metadata from FAST5 files requires some compute resources. For this reason, multiple processes must be spawned to achieve the highest performance on some systems.

    • This value can be increased if the input queue remains empty.

    • Default: 2

Miscellaneous Arguments

  • --database-safety

    • Setting for database performance versus corruption protection.

      • Options:

        • 0 (DB corruption on application crash)

        • 1 (Default; DB corruption on system crash)

        • 2 (DB safe mode)

  • --edge-buffer

    • Do not process sequence variant or modified base calls near edge of read mapping.

    • Default: 30

  • --not-recursive

    • Only search for fast5 read files directly found within the fast5 directory.

    • Default: search recursively

  • --suppress-progress

    • Suppress progress bar output.

  • --suppress-queues-status

    • Suppress dynamic status of output queues.

    • These queues are helpful for diagnosing I/O issues.

  • --verbose-read-progress

    • Output dynamic updates to potential issues during processing.

    • Default: 3