Pipeline configuration¶
Katuali uses Snakemake which allows pipeline parameters to be provided in a file, or on the command line.
If you use the katuali wrapper script (rather than running Snakemake directly), by default your pipeline will use the yaml config provided with katuali.
The default config file can be overridden using the --configfile
option.
# use a custom config
katuali guppy/miniasm_racon/consensus.fasta.gz --configfile myconfig.yaml
Nested configuration¶
Nested configs allow access of specific settings using a target suffix. The nested config entry below defines different mini_assemble options to be used with different suffixes:
MINI_ASSEMBLE_OPTS:
"": "" # use the mini_assemble defaults
"_c": "-c" # run basecalls through pore-chop before assembly
"_ce": "-c -e 10" # run basecalls through pore-chop and error correct longest 10% of reads prior to assembly
The following katuali targets will then run with either the defaults, or _ce options:
# use default MINI_ASSEMBLE_OPTS (suffix is empty string "")
katuali guppy/miniasm_racon/consensus.fasta.gz
# use MINI_ASSEMBLE_OPTS specified by suffix "_ce"
katuali guppy/miniasm_racon_ce/consensus.fasta.gz
A suffix can be added to most targets to specify options. If the suffix does not exist in the nested config, an error will be raised.
Processing and resource¶
The pipeline can be used on the local machine, or submitted to a cluster.
There are two parameters which control CPU usage:
the
--cores N
option, which limits the totol number of threads which can be simultaneously used by all Snakemake tasks. This can be specified on the command line.the
THREADS_PER_JOB
config parameter, determines the maximum number of threads that a single multi-threaded rule will use. When fewer cores than threads are provided, the number of threads a task uses will be reduced to the number of given cores. This parameter must be set within theRUNTIME
section of the config file:
RUNTIME:
THREADS_PER_JOB: 4
As an example, if THREADS_PER_JOB
is set to 4 and --cores
is set to 8, up to two multi-threaded
tasks can run at a time.
Running medaka consensus on GPU or CPU¶
The workload of running the the medaka consensus neural-network can assigned to either CPU or GPU resources using the MEDAKA_CONSENSUS_NUM_GPU
flag.
Setting
RUNTIME:
MEDAKA_CONSENSUS_NUM_GPU: 0
will result in the neural-network being run on CPU, while setting
RUNTIME:
MEDAKA_CONSENSUS_NUM_GPU: 1
will result in the neural-network being run on GPU.
Note
Note that MEDAKA_CONSENSUS_NUM_GPU should be 0 or 1; values greater than 1 are not supported.
Note
Note also that to tensorflow-gpu
must be installed in your medaka environment if you wish to run medaka using a GPU.
Running on the local machine¶
When running on a local machine using GPUs (e.g. while basecalling with guppy, training or evaluating medaka models), katuali can limit the number of concurrent GPU tasks scheduled so as not to saturate GPU resource by informing katuali how many GPUs are present on the machine:
NCPUS=$(nproc) # how many cores available on the machine
NGPUS=$(nvidia-smi --list-gpus | wc -l) # how many GPUs available on the machine
katuali --cores ${NCPUS} --resources gpu=${NGPUS} ${targets}
here --resources gpu=${NGPUS}
specifies the maximum number of GPUs which can be used
simultaneously by concurrent tasks.
Note
Note that if --cores
is not specified, it defaults to 1, while if
--resources
it defaults to 0 (unlimited) and that Snakemake manages
threads/cores
separately from other resources.
Submitting tasks to a cluster¶
When submitting to a queuing system, the --cores
option will limit the number
of queue slots used simultaneously.
The katuali wrapper has an --autocluster
option which can handle submission to a
default cluster using DRMAA:
NSLOTS=100
target=all_fast_assm_polish
katuali --cores ${NSLOTS} --autocluster ${target}
The --autocluster
option makes us of the default katuali cluster config
to submit jobs to an SGE cluster. The use of cluster configs allows
us to abstract away details specific to a given cluster, and easily switch
between clusters simply by changing the cluster config. See the Snakemake documentation
on cluster configs for futher details.
Using the default katuali cluster config in conjuction with the --autocluster
option is equivalent to running:
NSLOTS=100
target=all_fast_assm_polish
katuali --cores ${NSLOTS} --latency-wait 300 --drmaa "-V -cwd -l gpu={resources.gpu} -pe mt {threads} -o logs -j y"
Here, "-V -cwd -l gpu={resources.gpu} -pe mt {threads} -o logs -j y"
are the
options specific to the SGE scheduler informing it what resources a task
requires. Note that the resource requirements are expressed in brackets
({resources.gpu}
and {threads}
) and will be replaced with actual values
depending on the rule generating the task being submitted.
katuali abstracts away these SGE-specific details by using its default cluster config:
__default__:
n_cpu: "-pe mt "
n_gpu: "-l gpu="
export_env: "-V"
cwd: "-cwd"
logdir: "-o "
misc: "-j y"
Using this cluster config, the katuali --autocluster
option can support
any DRMAA-enabled cluster using an appropriate cluster-config as the command
line call to Snakemake is expressed in terms of cluster config entries.
The --autocluster
option implements:
NSLOTS=100
target=all_fast_assm_polish
cluster_config=$(katuali_datafile cluster_config.yaml)
katuali --cores ${NSLOTS} --latency-wait 300 --drmaa " {cluster.export_env} {cluster.cwd} {cluster.n_gpu}{resources.gpu} {cluster.n_cpu}{threads} {cluster.logdir}logs {cluster.misc}" --cluster-config ${cluster_config} ${target}
Here all {cluster.<variable_name>}
templates are replaced by values from the cluster config.
Hence running on another DRMAA cluster should be as simple as creating a new cluster config with terms equivalent to those in the default katuali cluster-config, then running:
NSLOTS=100
target=all_fast_assm_polish
katuali --cores ${NSLOTS} --latency-wait 300 --autocluster --cluster-config my_cluster_config.yaml ${target}
When running on a cluster, the local snakemake task will submit all tasks to
the queue for execution. The --latency-wait
parameter is useful for ensuring
that pipelines don’t crash due to output files not appearing on the node where
snakemake is run due to latencies on networked file systems.