Advanced Megalodon Arguments¶

Guppy Backend Argument¶

--do-not-use-guppy-server
- Use alternative basecalling backend
- Alternatives are:
  - FAST5: Read called sequence and full posterior data from fast5 files.
    - This is the default when --do-not-use-guppy-server is set.
    - Note that this option requires --post_out be set when running Guppy and may increase the fast5 file size by 5-10X.
  - Taiyaki: Use the Taiyaki package basecalling interface
    - This requires a Taiyaki installation (potentially with GPU settings).
    - Trigger this mode by setting the --taiyaki-model-filename option.
    - This is much slower than Guppy and is generally intended for experimental models with either layers or architectures not supported by Guppy.
--guppy-params
- Extra guppy server parameters.
- Main purpose for optimal performance based on compute environment.
- Quote parameters to avoid them being parsed by megalodon.
--guppy-server-port
- Guppy server port.
- Default: auto
--reads-per-guppy-batch
- Number of reads to send to Guppy per batch within each worker processes.
- Default: 50
--guppy-timeout
- Timeout to wait for guppy server to call a single read in seconds.
- Default: 5.0
--list-supported-guppy-configs
- List guppy configs with sequence variant and (if applicable) modified base support.

Output Arguments¶

--basecalls-format
- Select either fastq (default) or fasta format for basecalls output.
--num-reads
- Number of reads to process. Intended for test runs on a subset.
--read-ids-filename
- A file containing read_ids to process (one per line).
- Used in the variant phasing pipeline.
--mod-min-prob
- Only include modified base probabilities greater than this value in mod_basecalls and mod_mappings outputs.
- Default: 0.01 (1%)

Mapping Arguments¶

--cram-reference
- If --reference is a minimap2 index, the associated FASTA reference needs to be provided for --mappings-format cram.
--samtools-executable
- Samtools executable or path for sorting and indexing all mappings.
- Default: samtools
--sort-mappings
- Perform sorting and indexing of mapping output files.
- This can take considerable time for larger runs and thus is off by default.

Sequence Variant Arguments¶

--context-min-alt-prob
- Minimum per-read variant probability to include a variant in second round of variant evaluation (including context variants).
--disable-variant-calibration
- Use raw neural network sequence variant scores.
- This option should be set when calibrating a new model.
- Default: Calibrate scores as described in --variant-calibration-filename
--heterozygous-factors
- Bayes factor used when computing heterozygous probabilities in diploid variant calling mode.
- Two factors must be provided for single base substitution variants and indels.
--max-indel-size
- Maximum indel size to include in testing. Default: 50
--variant-all-paths
- Compute the forward algorithm all paths score.
- Default: Viterbi best-path score.
--variants-are-atomized
- Input variants have been atomized (with megalodon_extras variants atomize).
- This saves compute time, but has unpredictable behavior if variants are not atomized.
--variant-calibration-filename
- File containing empirical calibration for sequence variant scores.
- As created by the megalodon_extras calibrate variants command.
- Default: Load default calibration file for guppy config.
--variant-context-bases
- Context bases for single base SNP and indel calling. Default: [15, 30]
--variant-locations-on-disk
- Force sequence variant locations to be stored only within on disk database table. This option will reduce the RAM memory requirement, but may drastically slow processing. Default: Store locations in memory and on disk.
--write-variants-text
- Output per-read variants in text format.
  - Output includes columns: read_id, chrm, strand, pos, ref_log_prob, alt_log_prob, var_ref_seq, var_alt_seq, var_id
  - Log probabilities are calibrated to match observed log-likelihood ratios from ground truth samples.
    - Reference log probabilities are included to make processing multiple alternative allele sites easier to process.
  - Position is 0-based
--write-vcf-log-probs
- Write per-read alt log probabilities out in non-standard VCF field.
  - The LOG_PROBS field will contain semi-colon delimited log probabilities for each read at this site.
  - For sites with multiple alternative alleles, per-read calls for each allele are separated by a comma as specified by the A genotype field type.
    - The order is consistent within each allele so that per-read probabilities across all alleles can be reconstructed.

Modified Base Arguments¶

--disable-mod-calibration
- Use raw modified base scores from the network.
- This option should be set when calibrating a new model.
- Default: Calibrate scores as described in --mod-calibration-filename
--mod-aggregate-method
- Modified base aggregation method.
- Choices: expectation_maximization (default), binary_threshold
--mod-all-paths
- Compute forwards algorithm all paths score for modified base calls.
- Default: Viterbi best-path score.
--mod-binary-threshold
- Hard threshold for modified base aggregation (probability of modified/canonical base).
  - Sites where no canonical or modified base achieves this level of confidence will be ignored in aggregation.
- Default: 0.75
--mod-calibration-filename
- File containing empirical calibration for modified base scores.
- As created by megalodon_extras calibrate modified_bases command.
- Default: Load default calibration file for guppy config.
--mod-database-timeout
- Timeout in seconds for modified base database operations.
- Default: 5 seconds
--mod-context-bases
- Context bases for modified base calling.
- Default: 15
--mod-map-emulate-bisulfite
- For mod_mappings output, emulate bisulfite output by converting called modified bases using “–mod-map-base-conv” argument.
- As of version 2.2, the default mod_mappings output uses the Mm and Ml hts-specs tags (see above) with all modified bases in one output file.
--mod-map-base-conv
- For mod_mappings output, convert called bases.
  - For example, to mimic bisulfite output use: --mod-map-base-conv C T --mod-map-base-conv Z C
  - This is option useful since the BAM format does support modified bases and will convert all alternative bases to ``N``s for storage in BAM/CRAM format.
- Note additional formats may be supported in the future once finalized in hts-specs.
--mod-output-formats
- Modified base aggregated output format(s).
- Default: bedmethyl
- Options: bedmethyl, modvcf, wiggle
  - bedmethyl format produces one file per modification type.
    - This format is specified by the ENCODE consortium.
  - modvcf is a slight variant to the VCF format used for sequence variant reporting.
    - This format produces a single file containing all modifications.
    - The format adds a SN info field as modified bases occur in a stranded manner unlike sequence variants (e.g. hemi-methylation).
    - A genotype field VALID_DP indicates the number of reads included in the proportion modified calculation.
    - Modified base proportion estimates are stored in genotype fields specified by the single letter modified base encodings (defined in the model file).
--write-mod-log-probs
- Write per-read modified base log probabilities out in non-standard VCF field.
  - The LOG_PROBS field will contain semi-colon delimited log probabilities for modified base within each read at this site.
  - For sites with multiple modified bases, per-read calls for each modification type are separated by a comma as specified by the A genotype field type.
    - The order is consistent within each modification type so that per-read probabilities across all modification types can be reconstructed.
--write-mods-text
- Output per-read modified bases in text format.
  - Output includes columns: read_id, chrm, strand, pos, mod_log_probs, can_log_prob, mod_bases, motif
  - Log probabilities are calibrated to match observed log-likelihood ratios from ground truth samples.
    - Canonical log probabilities are included to make processing multiple modification sites easier to process.
      - Megalodon is capable of handling multiple modified bases per site with appropriate model (e.g. testing for 5mC and 5hmC simultaneously is supported given a basecalling model).
  - motif includes the searched motif (via --mod-motif) as well as the relative modified base position within that motif (e.g. CG:0 for provided --mod-motif Z CG 0).
  - Position is 0-based

Taiyaki Backend Arguments¶

--chunk-size
- Size of individual chunks to run as input to neural network.
- Smaller size will result in faster basecalling, but may reduce accuracy.
--chunk-overlap
- Overlap between adjacent chunks fed to basecalling neural network.
- Smaller size will result in faster basecalling, but may reduce accuracy.
--max-concurrent-chunks
- Maximum number of concurrent chunks to basecall at once.
- Allows a global cap on GPU memory usage.
- Changes to this parameter do not effect resulting basecalls.
--taiyaki-model-filename
- taiyaki basecalling model checkpoint file
- In order to identify modified bases a model trained to identify those modifications must be provided.
  - Train a new modified base model using taiyaki.
- Guppy JSON-format models can be converted to taiyaki checkpoints/models with the taiyaki/bin/json_to_checkpoint.py script for use with megalodon.

Reference/Signal Mapping Output¶

This output category is intended for use in generating reference sequences or signal mapping files for taiyaki basecall model training.

--ref-include-mods
- Include modified base calls in per_read_refs or signal_mappings outputs.
--ref-include-variants
- Include sequence variant calls in per-read reference output.
--ref-length-range
- Only include reads with specified read length in per-read reference output.
--ref-percent-identity-threshold
- Only include reads with higher percent identity in per-read reference output.
--ref-percent-coverage-threshold
- Only include reads with higher read alignment coverage in per-read reference output.
--ref-mods-all-motifs
- Annotate all --mod-motif occurrences as modified.
- Requires that –ref-include-mods` is set.
--ref-mod-threshold
- Threshold (in log(can_prob/mod_prob) space) used to annotate a modified bases in signal_mappings or per_read_refs outputs.
- See megalodon_extras modified_bases estimate_threshold command for help computing this threshold.
- Requires that –ref-include-mods` is set.

Compute Resource Arguments¶

--num-read-enumeration-threads
- Number of parallel threads to use for read enumeration.
  - This number of threads will be opened in a single read enumeration process and each signal extraction process (see next argument).
- This value can be increased if the input queue remains empty.
- Default: 8
--num-extract-signal-processes
- Number of parallel processes to use for signal extraction.
  - Accessing data and metadata from FAST5 files requires some compute resources. For this reason, multiple processes must be spawned to achieve the highest performance on some systems.
- This value can be increased if the input queue remains empty.
- Default: 2

Miscellaneous Arguments¶

--database-safety
- Setting for database performance versus corruption protection.
  - Options:
    - 0 (DB corruption on application crash)
    - 1 (Default; DB corruption on system crash)
    - 2 (DB safe mode)
--edge-buffer
- Do not process sequence variant or modified base calls near edge of read mapping.
- Default: 30
--not-recursive
- Only search for fast5 read files directly found within the fast5 directory.
- Default: search recursively
--suppress-progress
- Suppress progress bar output.
--suppress-queues-status
- Suppress dynamic status of output queues.
- These queues are helpful for diagnosing I/O issues.
--verbose-read-progress
- Output dynamic updates to potential issues during processing.
- Default: 3