**************************** Advanced Megalodon Arguments **************************** ---------------------- Guppy Backend Argument ---------------------- - ``--do-not-use-guppy-server`` - Use alternative basecalling backend - Alternatives are: - FAST5: Read called sequence and full posterior data from fast5 files. - This is the default when ``--do-not-use-guppy-server`` is set. - Note that this option requires ``--post_out`` be set when running Guppy and may increase the fast5 file size by 5-10X. - Taiyaki: Use the Taiyaki package basecalling interface - This requires a Taiyaki installation (potentially with GPU settings). - Trigger this mode by setting the ``--taiyaki-model-filename`` option. - This is much slower than Guppy and is generally intended for experimental models with either layers or architectures not supported by Guppy. - ``--guppy-params`` - Extra guppy server parameters. - Main purpose for optimal performance based on compute environment. - Quote parameters to avoid them being parsed by megalodon. - ``--guppy-server-port`` - Guppy server port. - Default: ``auto`` - ``--reads-per-guppy-batch`` - Number of reads to send to Guppy per batch within each worker processes. - Default: ``50`` - ``--guppy-timeout`` - Timeout to wait for guppy server to call a single read in seconds. - Default: ``5.0`` - ``--list-supported-guppy-configs`` - List guppy configs with sequence variant and (if applicable) modified base support. ---------------- Output Arguments ---------------- - ``--basecalls-format`` - Select either ``fastq`` (default) or ``fasta`` format for basecalls output. - ``--num-reads`` - Number of reads to process. Intended for test runs on a subset. - ``--read-ids-filename`` - A file containing ``read_ids`` to process (one per line). - Used in the variant phasing pipeline. - ``--mod-min-prob`` - Only include modified base probabilities greater than this value in ``mod_basecalls`` and ``mod_mappings`` outputs. - Default: ``0.01`` (``1%``) ----------------- Mapping Arguments ----------------- - ``--cram-reference`` - If ``--reference`` is a minimap2 index, the associated FASTA reference needs to be provided for ``--mappings-format cram``. - ``--samtools-executable`` - Samtools executable or path for sorting and indexing all mappings. - Default: ``samtools`` - ``--sort-mappings`` - Perform sorting and indexing of mapping output files. - This can take considerable time for larger runs and thus is off by default. -------------------------- Sequence Variant Arguments -------------------------- - ``--context-min-alt-prob`` - Minimum per-read variant probability to include a variant in second round of variant evaluation (including context variants). - ``--disable-variant-calibration`` - Use raw neural network sequence variant scores. - This option should be set when calibrating a new model. - Default: Calibrate scores as described in ``--variant-calibration-filename`` - ``--heterozygous-factors`` - Bayes factor used when computing heterozygous probabilities in diploid variant calling mode. - Two factors must be provided for single base substitution variants and indels. - ``--max-indel-size`` - Maximum indel size to include in testing. Default: 50 - ``--variant-all-paths`` - Compute the forward algorithm all paths score. - Default: Viterbi best-path score. - ``--variants-are-atomized`` - Input variants have been atomized (with ``megalodon_extras variants atomize``). - This saves compute time, but has unpredictable behavior if variants are not atomized. - ``--variant-calibration-filename`` - File containing empirical calibration for sequence variant scores. - As created by the ``megalodon_extras calibrate variants`` command. - Default: Load default calibration file for guppy config. - ``--variant-context-bases`` - Context bases for single base SNP and indel calling. Default: [15, 30] - ``--variant-locations-on-disk`` - Force sequence variant locations to be stored only within on disk database table. This option will reduce the RAM memory requirement, but may drastically slow processing. Default: Store locations in memory and on disk. - ``--write-variants-text`` - Output per-read variants in text format. - Output includes columns: ``read_id``, ``chrm``, ``strand``, ``pos``, ``ref_log_prob``, ``alt_log_prob``, ``var_ref_seq``, ``var_alt_seq``, ``var_id`` - Log probabilities are calibrated to match observed log-likelihood ratios from ground truth samples. - Reference log probabilities are included to make processing multiple alternative allele sites easier to process. - Position is 0-based - ``--write-vcf-log-probs`` - Write per-read alt log probabilities out in non-standard VCF field. - The ``LOG_PROBS`` field will contain semi-colon delimited log probabilities for each read at this site. - For sites with multiple alternative alleles, per-read calls for each allele are separated by a comma as specified by the ``A`` genotype field type. - The order is consistent within each allele so that per-read probabilities across all alleles can be reconstructed. ----------------------- Modified Base Arguments ----------------------- - ``--disable-mod-calibration`` - Use raw modified base scores from the network. - This option should be set when calibrating a new model. - Default: Calibrate scores as described in ``--mod-calibration-filename`` - ``--mod-aggregate-method`` - Modified base aggregation method. - Choices: expectation_maximization (default), binary_threshold - ``--mod-all-paths`` - Compute forwards algorithm all paths score for modified base calls. - Default: Viterbi best-path score. - ``--mod-binary-threshold`` - Hard threshold for modified base aggregation (probability of modified/canonical base). - Sites where no canonical or modified base achieves this level of confidence will be ignored in aggregation. - Default: 0.75 - ``--mod-calibration-filename`` - File containing empirical calibration for modified base scores. - As created by ``megalodon_extras calibrate modified_bases`` command. - Default: Load default calibration file for guppy config. - ``--mod-database-timeout`` - Timeout in seconds for modified base database operations. - Default: 5 seconds - ``--mod-context-bases`` - Context bases for modified base calling. - Default: 15 - ``--mod-map-emulate-bisulfite`` - For ``mod_mappings`` output, emulate bisulfite output by converting called modified bases using "--mod-map-base-conv" argument. - As of version 2.2, the default ``mod_mappings`` output uses the ``Mm`` and ``Ml`` hts-specs tags (see above) with all modified bases in one output file. - ``--mod-map-base-conv`` - For ``mod_mappings`` output, convert called bases. - For example, to mimic bisulfite output use: ``--mod-map-base-conv C T --mod-map-base-conv Z C`` - This is option useful since the BAM format does support modified bases and will convert all alternative bases to ``N``s for storage in BAM/CRAM format. - Note additional formats may be supported in the future once finalized in hts-specs. - ``--mod-output-formats`` - Modified base aggregated output format(s). - Default: ``bedmethyl`` - Options: ``bedmethyl``, ``modvcf``, ``wiggle`` - ``bedmethyl`` format produces one file per modification type. - This format is specified by the `ENCODE consortium `_. - ``modvcf`` is a slight variant to the VCF format used for sequence variant reporting. - This format produces a single file containing all modifications. - The format adds a ``SN`` info field as modified bases occur in a stranded manner unlike sequence variants (e.g. hemi-methylation). - A genotype field ``VALID_DP`` indicates the number of reads included in the proportion modified calculation. - Modified base proportion estimates are stored in genotype fields specified by the single letter modified base encodings (defined in the model file). - ``--write-mod-log-probs`` - Write per-read modified base log probabilities out in non-standard VCF field. - The ``LOG_PROBS`` field will contain semi-colon delimited log probabilities for modified base within each read at this site. - For sites with multiple modified bases, per-read calls for each modification type are separated by a comma as specified by the ``A`` genotype field type. - The order is consistent within each modification type so that per-read probabilities across all modification types can be reconstructed. - ``--write-mods-text`` - Output per-read modified bases in text format. - Output includes columns: ``read_id``, ``chrm``, ``strand``, ``pos``, ``mod_log_probs``, ``can_log_prob``, ``mod_bases``, ``motif`` - Log probabilities are calibrated to match observed log-likelihood ratios from ground truth samples. - Canonical log probabilities are included to make processing multiple modification sites easier to process. - Megalodon is capable of handling multiple modified bases per site with appropriate model (e.g. testing for 5mC and 5hmC simultaneously is supported given a basecalling model). - ``motif`` includes the searched motif (via ``--mod-motif``) as well as the relative modified base position within that motif (e.g. ``CG:0`` for provided ``--mod-motif Z CG 0``). - Position is 0-based ------------------------- Taiyaki Backend Arguments ------------------------- - ``--chunk-size`` - Size of individual chunks to run as input to neural network. - Smaller size will result in faster basecalling, but may reduce accuracy. - ``--chunk-overlap`` - Overlap between adjacent chunks fed to basecalling neural network. - Smaller size will result in faster basecalling, but may reduce accuracy. - ``--max-concurrent-chunks`` - Maximum number of concurrent chunks to basecall at once. - Allows a global cap on GPU memory usage. - Changes to this parameter do not effect resulting basecalls. - ``--taiyaki-model-filename`` - `taiyaki `_ basecalling model checkpoint file - In order to identify modified bases a model trained to identify those modifications must be provided. - Train a new modified base model using taiyaki. - Guppy JSON-format models can be converted to taiyaki checkpoints/models with the ``taiyaki/bin/json_to_checkpoint.py`` script for use with megalodon. ------------------------------- Reference/Signal Mapping Output ------------------------------- This output category is intended for use in generating reference sequences or signal mapping files for taiyaki basecall model training. - ``--ref-include-mods`` - Include modified base calls in ``per_read_refs`` or ``signal_mappings`` outputs. - ``--ref-include-variants`` - Include sequence variant calls in per-read reference output. - ``--ref-length-range`` - Only include reads with specified read length in per-read reference output. - ``--ref-percent-identity-threshold`` - Only include reads with higher percent identity in per-read reference output. - ``--ref-percent-coverage-threshold`` - Only include reads with higher read alignment coverage in per-read reference output. - ``--ref-mods-all-motifs`` - Annotate all ``--mod-motif`` occurrences as modified. - Requires that `--ref-include-mods`` is set. - ``--ref-mod-threshold`` - Threshold (in ``log(can_prob/mod_prob)`` space) used to annotate a modified bases in ``signal_mappings`` or ``per_read_refs`` outputs. - See ``megalodon_extras modified_bases estimate_threshold`` command for help computing this threshold. - Requires that `--ref-include-mods`` is set. -------------------------- Compute Resource Arguments -------------------------- - ``--num-read-enumeration-threads`` - Number of parallel threads to use for read enumeration. - This number of threads will be opened in a single read enumeration process and each signal extraction process (see next argument). - This value can be increased if the input queue remains empty. - Default: ``8`` - ``--num-extract-signal-processes`` - Number of parallel processes to use for signal extraction. - Accessing data and metadata from FAST5 files requires some compute resources. For this reason, multiple processes must be spawned to achieve the highest performance on some systems. - This value can be increased if the input queue remains empty. - Default: ``2`` ----------------------- Miscellaneous Arguments ----------------------- - ``--database-safety`` - Setting for database performance versus corruption protection. - Options: - 0 (DB corruption on application crash) - 1 (Default; DB corruption on system crash) - 2 (DB safe mode) - ``--edge-buffer`` - Do not process sequence variant or modified base calls near edge of read mapping. - Default: 30 - ``--not-recursive`` - Only search for fast5 read files directly found within the fast5 directory. - Default: search recursively - ``--suppress-progress`` - Suppress progress bar output. - ``--suppress-queues-status`` - Suppress dynamic status of output queues. - These queues are helpful for diagnosing I/O issues. - ``--verbose-read-progress`` - Output dynamic updates to potential issues during processing. - Default: ``3``