.. _configuration: Pipeline configuration ====================== `Katuali` uses `Snakemake` which allows pipeline `parameters `_ to be provided in a file, or on the command line. If you use the `katuali` wrapper script (rather than running `Snakemake` directly), by default your pipeline will use the yaml config provided with `katuali`. The default config file can be overridden using the ``--configfile`` option. .. code-block:: bash # use a custom config katuali guppy/miniasm_racon/consensus.fasta.gz --configfile myconfig.yaml Nested configuration -------------------- Nested configs allow access of specific settings using a target suffix. The nested config entry below defines different mini_assemble options to be used with different suffixes: .. code-block:: yaml MINI_ASSEMBLE_OPTS: "": "" # use the mini_assemble defaults "_c": "-c" # run basecalls through pore-chop before assembly "_ce": "-c -e 10" # run basecalls through pore-chop and error correct longest 10% of reads prior to assembly The following katuali targets will then run with either the defaults, or `_ce` options: .. code-block:: bash # use default MINI_ASSEMBLE_OPTS (suffix is empty string "") katuali guppy/miniasm_racon/consensus.fasta.gz # use MINI_ASSEMBLE_OPTS specified by suffix "_ce" katuali guppy/miniasm_racon_ce/consensus.fasta.gz A suffix can be added to most targets to specify options. If the suffix does not exist in the nested config, an error will be raised. Processing and resource ----------------------- The pipeline can be used on the local machine, or submitted to a cluster. There are two parameters which control CPU usage: * the ``--cores N`` option, which limits the totol number of threads which can be simultaneously used by all Snakemake tasks. This can be specified on the command line. * the ``THREADS_PER_JOB`` config parameter, determines the maximum number of `threads `_ that a single multi-threaded rule will use. When fewer cores than threads are provided, the number of threads a task uses will be reduced to the number of given cores. This parameter must be set within the ``RUNTIME`` section of the config file: .. code-block:: yaml RUNTIME: THREADS_PER_JOB: 4 As an example, if ``THREADS_PER_JOB`` is set to 4 and ``--cores`` is set to 8, up to two multi-threaded tasks can run at a time. Running medaka consensus on GPU or CPU -------------------------------------- The workload of running the the medaka consensus neural-network can assigned to either CPU or GPU resources using the ``MEDAKA_CONSENSUS_NUM_GPU`` flag. Setting .. code-block:: yaml RUNTIME: MEDAKA_CONSENSUS_NUM_GPU: 0 will result in the neural-network being run on CPU, while setting .. code-block:: yaml RUNTIME: MEDAKA_CONSENSUS_NUM_GPU: 1 will result in the neural-network being run on GPU. .. note:: Note that MEDAKA_CONSENSUS_NUM_GPU should be 0 or 1; values greater than 1 are not supported. .. note:: Note also that to ``tensorflow-gpu`` must be installed in your medaka environment if you wish to run medaka using a GPU. Running on the local machine ---------------------------- When running on a local machine using GPUs (e.g. while basecalling with guppy, training or evaluating medaka models), `katuali` can limit the number of concurrent GPU tasks scheduled so as not to saturate GPU resource by informing katuali how many GPUs are present on the machine: .. code-block:: bash NCPUS=$(nproc) # how many cores available on the machine NGPUS=$(nvidia-smi --list-gpus | wc -l) # how many GPUs available on the machine katuali --cores ${NCPUS} --resources gpu=${NGPUS} ${targets} here ``--resources gpu=${NGPUS}`` specifies the maximum number of GPUs which can be used simultaneously by concurrent tasks. .. note:: Note that if ``--cores`` is not specified, it defaults to 1, while if ``--resources`` it defaults to 0 (unlimited) and that Snakemake manages `threads/cores `_ separately from other `resources `_. .. _using_cluster: Submitting tasks to a cluster ----------------------------- When submitting to a queuing system, the ``--cores`` option will limit the number of queue slots used simultaneously. The `katuali` wrapper has an ``--autocluster`` option which can handle submission to a default cluster using DRMAA: .. code-block:: bash NSLOTS=100 target=all_fast_assm_polish katuali --cores ${NSLOTS} --autocluster ${target} The ``--autocluster`` option makes us of the default `katuali` `cluster config `_ to submit jobs to an SGE cluster. The use of cluster configs allows us to abstract away details specific to a given cluster, and easily switch between clusters simply by changing the cluster config. See the `Snakemake documentation on cluster configs for futher details `_. Using the default `katuali` cluster config in conjuction with the ``--autocluster`` option is equivalent to running: .. code-block:: bash NSLOTS=100 target=all_fast_assm_polish katuali --cores ${NSLOTS} --latency-wait 300 --drmaa "-V -cwd -l gpu={resources.gpu} -pe mt {threads} -o logs -j y" Here, ``"-V -cwd -l gpu={resources.gpu} -pe mt {threads} -o logs -j y"`` are the options specific to the SGE scheduler informing it what resources a task requires. Note that the resource requirements are expressed in brackets (``{resources.gpu}`` and ``{threads}``) and will be replaced with actual values depending on the rule generating the task being submitted. `katuali` abstracts away these SGE-specific details by using its default cluster config: .. code-block:: yaml __default__: n_cpu: "-pe mt " n_gpu: "-l gpu=" export_env: "-V" cwd: "-cwd" logdir: "-o " misc: "-j y" Using this cluster config, the `katuali` ``--autocluster`` option can support any DRMAA-enabled cluster using an appropriate cluster-config as the command line call to Snakemake is expressed in terms of cluster config entries. The ``--autocluster`` option implements: .. code-block:: bash NSLOTS=100 target=all_fast_assm_polish cluster_config=$(katuali_datafile cluster_config.yaml) katuali --cores ${NSLOTS} --latency-wait 300 --drmaa " {cluster.export_env} {cluster.cwd} {cluster.n_gpu}{resources.gpu} {cluster.n_cpu}{threads} {cluster.logdir}logs {cluster.misc}" --cluster-config ${cluster_config} ${target} Here all ``{cluster.}`` templates are replaced by values from the cluster config. Hence running on another DRMAA cluster should be as simple as creating a new cluster config with terms equivalent to those in the default katuali cluster-config, then running: .. code-block:: bash NSLOTS=100 target=all_fast_assm_polish katuali --cores ${NSLOTS} --latency-wait 300 --autocluster --cluster-config my_cluster_config.yaml ${target} When running on a cluster, the local snakemake task will submit all tasks to the queue for execution. The ``--latency-wait`` parameter is useful for ensuring that pipelines don't crash due to output files not appearing on the node where snakemake is run due to latencies on networked file systems.