************************************* Welcome to Megalodon's documentation! ************************************* Megalodon is a research command line tool to extract high accuracy modified base and sequence variant calls from raw nanopore reads by anchoring the information rich basecalling neural network output to a reference genome/transcriptome. Raw nanopore reads are processed by a single command to produce basecalls (FASTA/Q), reference mappings (SAM/BAM/CRAM), modified base calls (per-read and aggregated per-reference site), sequence variant calls (per-read and aggregated per-reference site) and more. ------------- Prerequisites ------------- The primary Megalodon run mode requires the Guppy basecaller (version >= 4.0). See the `community page for download/installation instructions [login required] `_. Megalodon is a python-based command line software package. Given a python (version >= 3.5) installation, all other requirements are handled by ``pip`` or ``conda``. .. `Taiyaki `_ is no longer required to run Megalodon, but installation is required for two specific run modes: 1) output mapped signal files (for basecall model training) 2) running the Taiyaki basecalling backend (for neural network designs including experimental layers) ------------ Installation ------------ ``pip`` is recommended for Megalodon installation. :: pip install megalodon ``conda`` installation is available, but not fully supported. ``ont_pyguppy_client_lib`` is not available on conda and thus must be installed with ``pip``. :: conda install megalodon pip install ont_pyguppy_client_lib To install from github source for development, the following commands can be run. :: git clone https://github.com/nanoporetech/megalodon pip install -e megalodon/ It is recommended that Megalodon be installed in a control compute environment. See `the python documentation for preparing virtual environments `_ =========== Quick Start =========== Megalodon must obtain the intermediate output from the basecall neural network. Guppy (production nanopore basecalling software) is the recommended backend to obtain this output from raw nanopore signal (from FAST5 files). Nanopore basecalling is compute intensive and thus it is highly recommended that GPU resources are specified (``--devices``) for optimal Megalodon performance. Megalodon is accessed via the command line interface ``megalodon`` command. :: # megalodon help (common args) megalodon -h # megalodon help (advanced args) megalodon --help-long # Example command to output basecalls, mappings, and 5mC CpG methylation in both per-read (``mod_mappings``) and aggregated (``mods``) formats # Compute settings: GPU devices 0 and 1 with 40 CPU cores megalodon \ raw_fast5s/ \ --outputs basecalls mappings mod_mappings mods \ --reference reference.fa --mod-motif m CG 0 \ --devices 0 1 --processes 40 This command produces the ``megalodon_results`` output directory containing all requested output files and logs. The format for common outputs is described briefly below and in more detail in the `full documentation `_ The above command uses the modified base model included in Guppy. As of the ``2.3.0`` megalodon release (March 2021) the models included with Guppy (``4.5.2``) provide the most accurate modified basecalling models. As more accurate basecalling models are trained, they are first released into the `Rerio repository for research models `_. Once training pipelines are more thoroughly standardized and tested models will be transferred into Guppy. The code below shows how to obtain and run the R9.4.1, MinION/GridION, 5mC CpG model from Rerio. Note that this is the same model now included in Guppy ``4.5.2``. :: # Obtain and run R9.4.1, MinION, 5mC CpG model from Rerio git clone https://github.com/nanoporetech/rerio rerio/download_model.py rerio/basecall_models/res_dna_r941_min_modbases_5mC_CpG_v001 megalodon \ raw_fast5s/ \ --guppy-params "-d ./rerio/basecall_models/" \ --guppy-config res_dna_r941_min_modbases_5mC_CpG_v001.cfg \ --outputs basecalls mappings mod_mappings mods \ --reference reference.fa --mod-motif m CG 0 \ --devices 0 1 --processes 40 .. The path to the ``guppy_basecall_server`` executable is required to run Megalodon. By default, Megalodon assumes Guppy (Linux GPU) is installed in the current working directory (i.e. ``./ont-guppy/bin/guppy_basecall_server``). Use the ``--guppy-server-path`` argument to specify a different path. -------- Contents -------- .. toctree:: :maxdepth: 2 algorithm_details common_arguments advanced_arguments computing_considerations variant_phasing file_formats model_training modbase_training extras_aggregate extras_calibrate extras_merge extras_modified_bases extras_phase_variants extras_per_read_text extras_validate extras_variants