Performance considerations
Sharding a large modBAM by region.
The --region
option in pileup
, summary
, and sample-probs
can be used to
operate on a subset of records in a large BAM. If you're working in a
distributed environment, the genome could be sharded into large sections which
are specified to modkit
in concurrent processes and merged afterward in a
"map-reduce" pattern.
Setting the --interval-size
and --chunk-size
(pileup
).
Whenever operating on a sorted, indexed BAM, modkit
will operate in parallel
on disjoint spans of the genome. The length of these spans (i.e. intervals) can
be determined by the --interval-size
or the --sampling-interval-size
(for
the sampling algorithm only). The defaults for these parameters works well for
genomes such as the human genome. For smaller genomes with high coverage, you
may decide to decrease the interval size in order to take advantage of
parallelism. The pileup
subcommand also has a --chunk-size
option that will
limit the total number of intervals computed on in parallel. By default,
modkit
will set this parameter to be 50% larger than the number of threads.
In general, this is a good setting for balancing parallelism and memory usage.
Increasing the --chunk-size
can increase parallelism (and decrease run time)
but will consume more memory.
Memory usage in modkit extract
.
Transforming reads into a table with modkit extract
can produce large files (especially with long reads).
Before the data can be written to disk, however, it is enqueued in memory and can potentially create a large memory burden.
There are a few ways to decrease the amount of memory modkit extract
will use in these cases:
- Lower the
--queue-size
, this decreased the number of batches that will be held in flight. - Use
--ignore-index
this will forcemodkit extract
to run a serial scan of the mod-BAM. - Decrease the
--interval-size
, this will decrease the size of the batches.