Pipeline configuration

Katuali uses Snakemake which allows pipeline parameters to be provided in a file, or on the command line.

If you use the katuali wrapper script (rather than running Snakemake directly), by default your pipeline will use the yaml config provided with katuali.

The default config file can be overridden using the --configfile option.

# use a custom config
katuali guppy/miniasm_racon/consensus.fasta.gz --configfile myconfig.yaml

Nested configuration

Nested configs allow access of specific settings using a target suffix. The nested config entry below defines different mini_assemble options to be used with different suffixes:

MINI_ASSEMBLE_OPTS:
    "": ""             # use the mini_assemble defaults
    "_c": "-c"         # run basecalls through pore-chop before assembly
    "_ce": "-c -e 10"  # run basecalls through pore-chop and error correct longest 10% of reads prior to assembly

The following katuali targets will then run with either the defaults, or _ce options:

# use default MINI_ASSEMBLE_OPTS (suffix is empty string "")
katuali guppy/miniasm_racon/consensus.fasta.gz

# use MINI_ASSEMBLE_OPTS specified by suffix "_ce"
katuali guppy/miniasm_racon_ce/consensus.fasta.gz

A suffix can be added to most targets to specify options. If the suffix does not exist in the nested config, an error will be raised.

Processing and resource

The pipeline can be used on the local machine, or submitted to a cluster.

There are two parameters which control CPU usage:

  • the --cores N option, which limits the totol number of threads which can be simultaneously used by all Snakemake tasks. This can be specified on the command line.

  • the THREADS_PER_JOB config parameter, determines the maximum number of threads that a single multi-threaded rule will use. When fewer cores than threads are provided, the number of threads a task uses will be reduced to the number of given cores. This parameter must be set within the RUNTIME section of the config file:

RUNTIME:
    THREADS_PER_JOB: 4

As an example, if THREADS_PER_JOB is set to 4 and --cores is set to 8, up to two multi-threaded tasks can run at a time.

Running medaka consensus on GPU or CPU

The workload of running the the medaka consensus neural-network can assigned to either CPU or GPU resources using the MEDAKA_CONSENSUS_NUM_GPU flag.

Setting

RUNTIME:
    MEDAKA_CONSENSUS_NUM_GPU: 0

will result in the neural-network being run on CPU, while setting

RUNTIME:
    MEDAKA_CONSENSUS_NUM_GPU: 1

will result in the neural-network being run on GPU.

Note

Note that MEDAKA_CONSENSUS_NUM_GPU should be 0 or 1; values greater than 1 are not supported.

Note

Note also that to tensorflow-gpu must be installed in your medaka environment if you wish to run medaka using a GPU.

Running on the local machine

When running on a local machine using GPUs (e.g. while basecalling with guppy, training or evaluating medaka models), katuali can limit the number of concurrent GPU tasks scheduled so as not to saturate GPU resource by informing katuali how many GPUs are present on the machine:

NCPUS=$(nproc)  # how many cores available on the machine
NGPUS=$(nvidia-smi --list-gpus | wc -l)  # how many GPUs available on the machine
katuali --cores ${NCPUS} --resources gpu=${NGPUS} ${targets}

here --resources gpu=${NGPUS} specifies the maximum number of GPUs which can be used simultaneously by concurrent tasks.

Note

Note that if --cores is not specified, it defaults to 1, while if --resources it defaults to 0 (unlimited) and that Snakemake manages threads/cores separately from other resources.

Submitting tasks to a cluster

When submitting to a queuing system, the --cores option will limit the number of queue slots used simultaneously.

The katuali wrapper has an --autocluster option which can handle submission to a default cluster using DRMAA:

NSLOTS=100
target=all_fast_assm_polish
katuali --cores ${NSLOTS} --autocluster ${target}

The --autocluster option makes us of the default katuali cluster config to submit jobs to an SGE cluster. The use of cluster configs allows us to abstract away details specific to a given cluster, and easily switch between clusters simply by changing the cluster config. See the Snakemake documentation on cluster configs for futher details.

Using the default katuali cluster config in conjuction with the --autocluster option is equivalent to running:

NSLOTS=100
target=all_fast_assm_polish
katuali --cores ${NSLOTS} --latency-wait 300 --drmaa "-V -cwd -l gpu={resources.gpu} -pe mt {threads} -o logs -j y"

Here, "-V -cwd -l gpu={resources.gpu} -pe mt {threads} -o logs -j y" are the options specific to the SGE scheduler informing it what resources a task requires. Note that the resource requirements are expressed in brackets ({resources.gpu} and {threads}) and will be replaced with actual values depending on the rule generating the task being submitted.

katuali abstracts away these SGE-specific details by using its default cluster config:

__default__:
    n_cpu: "-pe mt "
    n_gpu: "-l gpu="
    export_env: "-V"
    cwd: "-cwd"
    logdir: "-o "
    misc: "-j y"

Using this cluster config, the katuali --autocluster option can support any DRMAA-enabled cluster using an appropriate cluster-config as the command line call to Snakemake is expressed in terms of cluster config entries. The --autocluster option implements:

NSLOTS=100
target=all_fast_assm_polish
cluster_config=$(katuali_datafile cluster_config.yaml)
katuali --cores ${NSLOTS} --latency-wait 300 --drmaa " {cluster.export_env} {cluster.cwd} {cluster.n_gpu}{resources.gpu} {cluster.n_cpu}{threads} {cluster.logdir}logs {cluster.misc}" --cluster-config ${cluster_config} ${target}

Here all {cluster.<variable_name>} templates are replaced by values from the cluster config.

Hence running on another DRMAA cluster should be as simple as creating a new cluster config with terms equivalent to those in the default katuali cluster-config, then running:

NSLOTS=100
target=all_fast_assm_polish
katuali --cores ${NSLOTS} --latency-wait 300 --autocluster --cluster-config my_cluster_config.yaml ${target}

When running on a cluster, the local snakemake task will submit all tasks to the queue for execution. The --latency-wait parameter is useful for ensuring that pipelines don’t crash due to output files not appearing on the node where snakemake is run due to latencies on networked file systems.