****************************
Advanced Megalodon Arguments
****************************

----------------------
Guppy Backend Argument
----------------------

- ``--do-not-use-guppy-server``

  - Use alternative basecalling backend
  - Alternatives are:

    - FAST5: Read called sequence and full posterior data from fast5 files.

      - This is the default when ``--do-not-use-guppy-server`` is set.
      - Note that this option requires ``--post_out`` be set when running Guppy and may increase the fast5 file size by 5-10X.
    - Taiyaki: Use the Taiyaki package basecalling interface

      - This requires a Taiyaki installation (potentially with GPU settings).
      - Trigger this mode by setting the ``--taiyaki-model-filename`` option.
      - This is much slower than Guppy and is generally intended for experimental models with either layers or architectures not supported by Guppy.
- ``--guppy-params``

  - Extra guppy server parameters.
  - Main purpose for optimal performance based on compute environment.
  - Quote parameters to avoid them being parsed by megalodon.
- ``--guppy-server-port``

  - Guppy server port.
  - Default: ``auto``
- ``--reads-per-guppy-batch``

  - Number of reads to send to Guppy per batch within each worker processes.
  - Default: ``50``
- ``--guppy-timeout``

  - Timeout to wait for guppy server to call a single read in seconds.
  - Default: ``5.0``
- ``--list-supported-guppy-configs``

  - List guppy configs with sequence variant and (if applicable) modified base support.

----------------
Output Arguments
----------------

- ``--basecalls-format``

  - Select either ``fastq`` (default) or ``fasta`` format for basecalls output.
- ``--num-reads``

  - Number of reads to process. Intended for test runs on a subset.
- ``--read-ids-filename``

  - A file containing ``read_ids`` to process (one per line).
  - Used in the variant phasing pipeline.
- ``--mod-min-prob``

  - Only include modified base probabilities greater than this value in ``mod_basecalls`` and ``mod_mappings`` outputs.
  - Default: ``0.01`` (``1%``)

-----------------
Mapping Arguments
-----------------

- ``--cram-reference``

  - If ``--reference`` is a minimap2 index, the associated FASTA reference needs to be provided for ``--mappings-format cram``.
- ``--samtools-executable``

  - Samtools executable or path for sorting and indexing all mappings.
  - Default: ``samtools``
- ``--sort-mappings``

  - Perform sorting and indexing of mapping output files.
  - This can take considerable time for larger runs and thus is off by default.

--------------------------
Sequence Variant Arguments
--------------------------

- ``--context-min-alt-prob``

  - Minimum per-read variant probability to include a variant in second round of variant evaluation (including context variants).

- ``--disable-variant-calibration``

  - Use raw neural network sequence variant scores.
  - This option should be set when calibrating a new model.
  - Default: Calibrate scores as described in ``--variant-calibration-filename``
- ``--heterozygous-factors``

  - Bayes factor used when computing heterozygous probabilities in diploid variant calling mode.
  - Two factors must be provided for single base substitution variants and indels.
- ``--max-indel-size``

  - Maximum indel size to include in testing. Default: 50
- ``--variant-all-paths``

  - Compute the forward algorithm all paths score.
  - Default: Viterbi best-path score.
- ``--variants-are-atomized``

  - Input variants have been atomized (with ``megalodon_extras variants atomize``).
  - This saves compute time, but has unpredictable behavior if variants are not atomized.
- ``--variant-calibration-filename``

  - File containing empirical calibration for sequence variant scores.
  - As created by the ``megalodon_extras calibrate variants`` command.
  - Default: Load default calibration file for guppy config.
- ``--variant-context-bases``

  - Context bases for single base SNP and indel calling. Default: [15, 30]
- ``--variant-locations-on-disk``

  - Force sequence variant locations to be stored only within on disk database table. This option will reduce the RAM memory requirement, but may drastically slow processing. Default: Store locations in memory and on disk.
- ``--write-variants-text``

  - Output per-read variants in text format.

    - Output includes columns: ``read_id``, ``chrm``, ``strand``, ``pos``, ``ref_log_prob``, ``alt_log_prob``, ``var_ref_seq``, ``var_alt_seq``, ``var_id``
    - Log probabilities are calibrated to match observed log-likelihood ratios from ground truth samples.

      - Reference log probabilities are included to make processing multiple alternative allele sites easier to process.
    - Position is 0-based
- ``--write-vcf-log-probs``

  - Write per-read alt log probabilities out in non-standard VCF field.

    - The ``LOG_PROBS`` field will contain semi-colon delimited log probabilities for each read at this site.
    - For sites with multiple alternative alleles, per-read calls for each allele are separated by a comma as specified by the ``A`` genotype field type.

      - The order is consistent within each allele so that per-read probabilities across all alleles can be reconstructed.

-----------------------
Modified Base Arguments
-----------------------

- ``--disable-mod-calibration``

  - Use raw modified base scores from the network.
  - This option should be set when calibrating a new model.
  - Default: Calibrate scores as described in ``--mod-calibration-filename``
- ``--mod-aggregate-method``

  - Modified base aggregation method.
  - Choices: expectation_maximization (default), binary_threshold
- ``--mod-all-paths``

  - Compute forwards algorithm all paths score for modified base calls.
  - Default: Viterbi best-path score.
- ``--mod-binary-threshold``

  - Hard threshold for modified base aggregation (probability of modified/canonical base).

    - Sites where no canonical or modified base achieves this level of confidence will be ignored in aggregation.
  - Default: 0.75
- ``--mod-calibration-filename``

  - File containing empirical calibration for modified base scores.
  - As created by ``megalodon_extras calibrate modified_bases`` command.
  - Default: Load default calibration file for guppy config.
- ``--mod-database-timeout``

  - Timeout in seconds for modified base database operations.
  - Default: 5 seconds
- ``--mod-context-bases``

  - Context bases for modified base calling.
  - Default: 15
- ``--mod-map-emulate-bisulfite``

  - For ``mod_mappings`` output, emulate bisulfite output by converting called modified bases using "--mod-map-base-conv" argument.
  - As of version 2.2, the default ``mod_mappings`` output uses the ``Mm`` and ``Ml`` hts-specs tags (see above) with all modified bases in one output file.
- ``--mod-map-base-conv``

  - For ``mod_mappings`` output, convert called bases.

    - For example, to mimic bisulfite output use: ``--mod-map-base-conv C T --mod-map-base-conv Z C``
    - This is option useful since the BAM format does support modified bases and will convert all alternative bases to ``N``s for storage in BAM/CRAM format.
  - Note additional formats may be supported in the future once finalized in hts-specs.
- ``--mod-output-formats``

  - Modified base aggregated output format(s).
  - Default: ``bedmethyl``
  - Options: ``bedmethyl``, ``modvcf``, ``wiggle``

    - ``bedmethyl`` format produces one file per modification type.

      - This format is specified by the `ENCODE consortium <https://www.encodeproject.org/data-standards/wgbs/>`_.
    - ``modvcf`` is a slight variant to the VCF format used for sequence variant reporting.

      - This format produces a single file containing all modifications.
      - The format adds a ``SN`` info field as modified bases occur in a stranded manner unlike sequence variants (e.g. hemi-methylation).
      - A genotype field ``VALID_DP`` indicates the number of reads included in the proportion modified calculation.
      - Modified base proportion estimates are stored in genotype fields specified by the single letter modified base encodings (defined in the model file).
- ``--write-mod-log-probs``

  - Write per-read modified base log probabilities out in non-standard VCF field.

    - The ``LOG_PROBS`` field will contain semi-colon delimited log probabilities for modified base within each read at this site.
    - For sites with multiple modified bases, per-read calls for each modification type are separated by a comma as specified by the ``A`` genotype field type.

      - The order is consistent within each modification type so that per-read probabilities across all modification types can be reconstructed.
- ``--write-mods-text``

  - Output per-read modified bases in text format.

    - Output includes columns: ``read_id``, ``chrm``, ``strand``, ``pos``, ``mod_log_probs``, ``can_log_prob``, ``mod_bases``, ``motif``
    - Log probabilities are calibrated to match observed log-likelihood ratios from ground truth samples.

      - Canonical log probabilities are included to make processing multiple modification sites easier to process.

        - Megalodon is capable of handling multiple modified bases per site with appropriate model (e.g. testing for 5mC and 5hmC simultaneously is supported given a basecalling model).
    - ``motif`` includes the searched motif (via ``--mod-motif``) as well as the relative modified base position within that motif (e.g. ``CG:0`` for provided ``--mod-motif Z CG 0``).
    - Position is 0-based

-------------------------
Taiyaki Backend Arguments
-------------------------

- ``--chunk-size``

  - Size of individual chunks to run as input to neural network.
  - Smaller size will result in faster basecalling, but may reduce accuracy.
- ``--chunk-overlap``

  - Overlap between adjacent chunks fed to basecalling neural network.
  - Smaller size will result in faster basecalling, but may reduce accuracy.
- ``--max-concurrent-chunks``

  - Maximum number of concurrent chunks to basecall at once.
  - Allows a global cap on GPU memory usage.
  - Changes to this parameter do not effect resulting basecalls.
- ``--taiyaki-model-filename``

  - `taiyaki <https://github.com/nanoporetech/taiyaki>`_ basecalling model checkpoint file
  - In order to identify modified bases a model trained to identify those modifications must be provided.

    - Train a new modified base model using taiyaki.

  - Guppy JSON-format models can be converted to taiyaki checkpoints/models with the ``taiyaki/bin/json_to_checkpoint.py`` script for use with megalodon.

-------------------------------
Reference/Signal Mapping Output
-------------------------------

This output category is intended for use in generating reference sequences or signal mapping files for taiyaki basecall model training.

- ``--ref-include-mods``

  - Include modified base calls in ``per_read_refs`` or ``signal_mappings`` outputs.
- ``--ref-include-variants``

  - Include sequence variant calls in per-read reference output.
- ``--ref-length-range``

  - Only include reads with specified read length in per-read reference output.
- ``--ref-percent-identity-threshold``

  - Only include reads with higher percent identity in per-read reference output.
- ``--ref-percent-coverage-threshold``

  -  Only include reads with higher read alignment coverage in per-read reference output.
- ``--ref-mods-all-motifs``

  - Annotate all ``--mod-motif`` occurrences as modified.
  - Requires that `--ref-include-mods`` is set.
- ``--ref-mod-threshold``

  - Threshold (in ``log(can_prob/mod_prob)`` space) used to annotate a modified bases in ``signal_mappings`` or ``per_read_refs`` outputs.
  - See ``megalodon_extras modified_bases estimate_threshold`` command for help computing this threshold.
  - Requires that `--ref-include-mods`` is set.

--------------------------
Compute Resource Arguments
--------------------------

- ``--num-read-enumeration-threads``

  - Number of parallel threads to use for read enumeration.

    - This number of threads will be opened in a single read enumeration process and each signal extraction process (see next argument).
  - This value can be increased if the input queue remains empty.
  - Default: ``8``
- ``--num-extract-signal-processes``

  - Number of parallel processes to use for signal extraction.

    - Accessing data and metadata from FAST5 files requires some compute resources. For this reason, multiple processes must be spawned to achieve the highest performance on some systems.
  - This value can be increased if the input queue remains empty.
  - Default: ``2``

-----------------------
Miscellaneous Arguments
-----------------------

- ``--database-safety``

  - Setting for database performance versus corruption protection.

    - Options:

      - 0 (DB corruption on application crash)
      - 1 (Default; DB corruption on system crash)
      - 2 (DB safe mode)
- ``--edge-buffer``

  - Do not process sequence variant or modified base calls near edge of read mapping.
  - Default: 30
- ``--not-recursive``

  - Only search for fast5 read files directly found within the fast5 directory.
  - Default: search recursively
- ``--suppress-progress``

  - Suppress progress bar output.
- ``--suppress-queues-status``

  - Suppress dynamic status of output queues.
  - These queues are helpful for diagnosing I/O issues.
- ``--verbose-read-progress``

  - Output dynamic updates to potential issues during processing.
  - Default: ``3``