tombo package¶

Tombo Python API¶

The Tombo python API is intended to give users access to some of the lower level processing functions used throughout Tombo analysis pipelines.

Primarily this includes inspection of per-read modified base detection results and observed (re-squiggle assigned) per-read signal levels.

Note

Effort will be made to maintain this API interface introduced at Tombo version 1.4, but major structural changes to the Tombo framework may require changes to some API interface components. Such changes will be noted in github release notes where applicable.

Python API Examples¶

Import Tombo sub-modules:

from tombo import tombo_helper, tombo_stats, resquiggle

Extract normalized raw signal assigned to each genomic base from each read in a region:

# specify region of interest
reg_data = tombo_helper.intervalData(
    chrm='chr20', start=10000, end=10100, strand='+')

# parse Tombo index from previously re-squiggled set of reads
reads_index = tombo_helper.TomboReads(['test_data/native_reads',])
# extract reads that overlap this interval and then extract base signal
# levels from 10 randomly selected reads
reg_base_levels = reg_data.add_reads(
    reads_index).get_base_levels(num_reads=10)

Extracting per-read testing results:

sample_per_read_stats = tombo_stats.PerReadStats(
    'test_stats.alt_model.5mC.tombo.per_read_stats')
# reg_per_read_stats contains a numpy array containing per-read stats
# over all reads covering the region of interest
reg_per_read_stats = sample_per_read_stats.get_region_per_read_stats(
    reg_data)

Run standard resquiggle algorithm (may cause errors on some reads):

import h5py, mappy

# set read values
fast5_fn, reference_fn = 'path/to/read.fast5', 'genome.fasta'
fast5_data = h5py.File(fast5_fn, 'r')
seq_samp_type = tombo_helper.get_seq_sample_type(fast5_data)

# prep aligner, signal model and parameters
aligner = mappy.Aligner(reference_fn, preset=str('map-ont'), best_n=1)
std_ref = tombo_stats.TomboModel(seq_samp_type=seq_samp_type)
rsqgl_params = tombo_stats.load_resquiggle_parameters(seq_samp_type)

# extract data from FAST5
map_results = resquiggle.map_read(fast5_data, aligner, std_ref)
all_raw_signal = tombo_helper.get_raw_read_slot(fast5_data)['Signal'][:]
if seq_samp_type.rev_sig:
    all_raw_signal = all_raw_signal[::-1]
map_results = map_results._replace(raw_signal=all_raw_signal)

# run full re-squiggle
rsqgl_results = resquiggle.resquiggle_read(
    map_results, std_ref, rsqgl_params, all_raw_signal=all_raw_signal)

# or run individual steps
num_events = tombo_stats.compute_num_events(
    all_raw_signal.shape[0], len(map_results.genome_seq),
    rsqgl_params.mean_obs_per_event)
valid_cpts, norm_signal, scale_values = resquiggle.segment_signal(
    map_results, num_events, rsqgl_params)
event_means = tombo_stats.compute_base_means(norm_signal, valid_cpts)
dp_results = resquiggle.find_adaptive_base_assignment(
    valid_cpts, event_means, rsqgl_params, std_ref, map_results.genome_seq)
norm_signal = norm_signal[
    dp_results.read_start_rel_to_raw:
    dp_results.read_start_rel_to_raw + dp_results.segs[-1]]
segs = resquiggle.resolve_skipped_bases_with_raw(
    dp_results, norm_signal, rsqgl_params)

Submodules¶

tombo.resquiggle module¶

tombo.resquiggle.get_read_seq(fast5_data, bc_grp='Basecall_1D_000', bc_subgrp='BaseCalled_template', seq_samp_type=seqSampleType(name='DNA', rev_sig=False), q_score_thresh=0)[source]¶

Extract read sequence from the Fastq slot providing useful error messages

Parameters

fast5_data (tombo.tombo_helper.readData) – read information
bc_grp (str) – group location containing read information (optional; default: ‘Basecall_1D_000’)
bc_subgrp (str) – sub-group location containing read information (optional; default: ‘BaseCalled_template’)
seq_samp_type (tombo.tombo_helper.seqSampleType) – sequencing sample type (default: DNA)
q_score_thresh (float) – basecalling mean q-score threshold (optional; default: 0/no filtering)

Returns

tombo.tombo_helper.sequenceData

tombo.resquiggle.map_read(fast5_data, aligner, std_ref, seq_samp_type=seqSampleType(name='DNA', rev_sig=False), bc_grp='Basecall_1D_000', bc_subgrp='BaseCalled_template', map_thr_buf=None, q_score_thresh=0, seq_len_rng=None)[source]¶

Extract read sequence from the Fastq slot providing useful error messages

Parameters

fast5_data (tombo.tombo_helper.readData) – read information
aligner (mappy.Aligner) – aligner object
std_ref (tombo.tombo_stats.TomboModel) – canonical model (in order to extract extended genomic sequence)
seq_samp_type (tombo.tombo_helper.seqSampleType) – sequencing sample type (default: DNA)
bc_grp (str) – group location containing read information (optional; default: ‘Basecall_1D_000’)
bc_subgrp (str) – sub-group location containing read information (optional; default: ‘BaseCalled_template’)
map_thr_buf (mappy.ThreadBuffer) – mappy thread buffer object (optional; default: None)
q_score_thresh (float) – basecalling mean q-score threshold (optional; default: 0/no filtering)
seq_len_rng (tuple) – allowed mapped sequence length range (optional; default: None/no filtering)

Returns

tombo.tombo_helper.resquiggleResults containing valid mapping values (signal to sequence assignment attributes will be None)

tombo.resquiggle.resquiggle_read(map_res, std_ref, rsqgl_params, outlier_thresh=None, all_raw_signal=None, max_raw_cpts=200, min_event_to_seq_ratio=1.1, const_scale=None, skip_seq_scaling=False, seq_samp_type=seqSampleType(name='DNA', rev_sig=False))[source]¶

Identify raw signal to genome sequence assignment, using adaptive banded dynamic programming

Parameters

map_res (tombo.tombo_helper.resquiggleResults) – mapping results
std_ref (tombo.tombo_stats.TomboModel) – canonical base model
rsqgl_params (tombo.tombo_helper.resquiggleParams) – parameters for the re-squiggle algorithm
outlier_thresh (float) – windsorize signal greater than this value (optional)
all_raw_signal (np.array::np.int64) – raw data acquisition (DAC) current signal values (optional; default use value in map_res)
max_raw_cpts (int) – read will fail if more than max_raw_cpts must be found to produce a valid re-squiggle results (optional)
min_event_to_seq_ratio (float) – minimum event to sequence ratio (optional)
const_scale (float) – constant scale value (optional; may be deprecated)
skip_seq_scaling (bool) – skip sequence-based scaling step
seq_samp_type (tombo.tombo_helper.seqSampleType) – sequencing sample type (default: DNA)

Returns

tombo.tombo_helper.resquiggleResults containing raw signal to genome sequence alignment (note that raw_signal now contains trimmed and normalized raw signal)

tombo.resquiggle.segment_signal(map_res, num_events, rsqgl_params, outlier_thresh=None, const_scale=None)[source]¶

Normalize and segment raw signal as defined by rsqgl_params into num_events.

Parameters

map_res (tombo.tombo_helper.resquiggleResults) – containing mapping results only (attributes after read_start_rel_to_raw will all be None)
num_events (int) – number of events to process
rsqgl_params (tombo.tombo_helper.resquiggleParams) – parameters for the re-squiggle algorithm
outlier_thresh (float) – windsorize signal greater than this value (optional)

Returns

identified event positions (0-based)
normalized raw signal
scale values (tombo.tombo_helper.scaleValues)

tombo.resquiggle.find_adaptive_base_assignment(valid_cpts, event_means, rsqgl_params, std_ref, genome_seq, start_clip_bases=None, start_clip_params=startClipParams(bandwidth=1000, num_genome_bases=200), seq_samp_type=seqSampleType(name='DNA', rev_sig=False), reg_id=None)[source]¶

Align expected (from genome sequence) signal levels to observed using a dynamic programming approach with adaptive bandwidth start positions

Parameters

valid_cpts (np.array::np.int64) – raw signal base start locations
event_means (np.array::np.float64) – event normalized raw signal means
rsqgl_params (tombo.tombo_helper.resquiggleParams) – parameters for the re-squiggle algorithm
std_ref (tombo.tombo_stats.TomboModel) – canonical model
genome_seq (str) – genome sequence (from mapping)
start_clip_bases (str) – mapping read clipped bases
start_clip_params (tombo.tombo_helper.startClipParams) – start clip basecall params
seq_samp_type (tombo.tombo_helper.seqSampleType) – sequencing sample type (default: DNA)
reg_id (str) – debug

Returns

tombo.tombo_helper.dpResults

tombo.resquiggle.resolve_skipped_bases_with_raw(dp_res, norm_signal, rsqgl_params, max_raw_cpts=200, del_fix_window=2, max_del_fix_window=10, extra_sig_factor=1.1)[source]¶

Perform dynamic-programming forward pass over raw signal z-scores. For details, see https://nanoporetech.github.io/tombo/resquiggle.html#resolve-skipped-bases

Parameters

dp_res (tombo_helper.dpResults) – dynamic programming results
norm_signal (np.array::np.float64) – normalized raw siganl
rsqgl_params (tombo.tombo_helper.resquiggleParams) – parameters for the re-squiggle algorithm
max_raw_cpts (int) – maximum new changepoints to find from raw signal (optional)
del_fix_window (int) – initial bases to extend skipped base windows (optional)
max_del_fix_window (int) – max bases to extend skipped base windows (optional)
extra_sig_factor (float) – amount of extra signal required to perform signal space re-squiggle (optional)

Returns

np.array::np.int64 containing new deletion resolved base raw signal start positions

tombo.resquiggle.find_seq_start_in_events(event_means, r_ref_means, r_ref_sds, rsqgl_params, num_bases, num_events, seq_samp_type=None, reg_id=None)[source]¶

Identify most probably start of expected levels within observed events

Parameters

event_means (np.array::np.float64) – event normalized raw signal means
r_ref_means (np.array::np.float64) – expected base signal levels
r_ref_sds (np.array::np.float64) – expected base level SDs
rsqgl_params (tombo.tombo_helper.resquiggleParams) – parameters for the re-squiggle algorithm
num_bases (int) – number of bases to process
num_events (int) – number of events to process
reg_id (str) – debug

Returns

event position (0-based) corresponding to the start of expected signal levels
mean events per base identified from start of the read

tombo.resquiggle.find_static_base_assignment(event_means, r_ref_means, r_ref_sds, rsqgl_params, reg_id=None)[source]¶

Align expected (from genome sequence) signal levels to observed using a dynamic programming approach with static bandwidth start positions

Parameters

event_means (np.array::np.float64) – read base means
r_ref_means (np.array::np.float64) – expected base signal levels
r_ref_sds (np.array::np.float64) – expected base level SDs
rsqgl_params (tombo.tombo_helper.resquiggleParams) – parameters for the re-squiggle algorithm

Returns

np.array::np.int64 containng event to sequence mapping for full length of short read

tombo.tombo_helper module¶

class tombo.tombo_helper.readData[source]¶

Nanopore read meta-data

Example:

r_data = tombo_helper.readData(
    start=0, end=1000, filtered=False, read_start_rel_to_raw=100,
    strand='+', fn='path/to/read.fast5',
    corr_group='RawGenomeCorrected_000', rna=False,
    sig_match_score=1.0, mean_q_score=10.0)

# output is a list of readData objects
reads_index = tombo_helper.TomboReads(['test_data/native_reads',])
cs_reads = reads_index.get_cs_reads('chr20', '+')

Parameters

start (int) – 0-based start mapped position
end (int) – 0-based open interval end mapped position
filtered (bool) – is this read filtered?
read_start_rel_to_raw (int) – start (in raw signal vector) of assigned bases
strand (str) – read mapped strand (‘+’, ‘-‘ or None)
fn (str) – raw read filename.
corr_group (str) – –corrected-group slot specified in re-squiggle command
rna (bool) – is this read RNA?
sig_match_score (float) – signal matching score. Default: None
mean_q_score (float) – mean basecalling q-score. Default: None
read_id (str) – read identifier Default: None

class tombo.tombo_helper.intervalData(chrm, start, end, strand=None, reg_id=None, reg_text='', reads=None, seq=None)[source]¶

Genome/transcriptome interval information

Example:

int_data = tombo_helper.intervalData(
    chrm='chr20', start=10000, end=10100, strand='+')

Note

All intervalData functions return the object in order to allow chaining of interval commands. (e.g. int_data.add_reads(reads_index).add_seq())

__init__(chrm, start, end, strand=None, reg_id=None, reg_text='', reads=None, seq=None)[source]¶

Initialize a genome/transcriptome interval object

Parameters

chrm (str) – chromosome name.
start (int) – 0-based start position.
end (int) – 1-based (or open interval) end position.
strand (str) – interval strand (‘+’, ‘-‘ or None). Default: None
reg_id (str) – used to keep track of multiple intervals. Default: None
reg_text (str) – region description text. Default: ‘’
reads (list) – list of readData values overlapping region. Default: None
seq (str) – region genomic sequence. Default: None

add_reads(reads_index, require_full_span=False)[source]¶

Add reads overlapping this interval

Parameters

reads_index (tombo.tombo_helper.TomboReads) – reads index
require_full_span (bool) – require that reads span then entire interval (default: include all reads overlapping the interval)

add_seq(genome_index=None, error_end=True)[source]¶

Extract the forward strand genomic sequence for an interval from reads or genome_index if provided

Parameters: genome_index (tombo.tombo_helper.Fasta) –

copy(include_reads=True)[source]¶

Create a copy of this interval. Useful when adding sets of reads from multiple samples to compare.

Parameters: include_reads (bool) – include current reads slot in the new object (default: True)

expand_interval(expand_width)[source]¶

Expand this interval by the specified amount (effects only the start and stop attributes; NOT seq or reads)

Parameters: expand_width (int) – amount by which to expand the interval

get_base_levels(read_rows=False, num_reads=None)[source]¶

Extract base levels from this interval.

Parameters

read_rows (bool) – return array with reads as rows (default: row columns)
num_reads (int) – maximal number of reads to output. randomly downsample if more reads are present (default: output all reads)

Returns

np.array containing read mean levels with rows corresponding to interval position and columns to individual reads (or reverse if read_rows)

merge(other_reg)[source]¶

Create a copy of this interval with the reads from this interval and other_reg

Parameters: other_reg (tombo.tombo_helper.intervalData) – a second region to merge with this interval

Note

Aside from reads, all other attributes will be taken from this interval

update(**kwargs)[source]¶: Update slots specified in keyword arguments (kwargs)

class tombo.tombo_helper.TomboReads(fast5s_basedirs, corrected_group='RawGenomeCorrected_000', basecall_subgroups=None, for_writing=False, remove_filtered=True, sample_name=None)[source]¶

A set of reads with associated meta-data from re-squiggle processing

__init__(fast5s_basedirs, corrected_group='RawGenomeCorrected_000', basecall_subgroups=None, for_writing=False, remove_filtered=True, sample_name=None)[source]¶

Parse data from a list of re-squiggle fast5 directories

Parameters

fast5s_basedirs (list) – fast5 base directories, which have been processed by tombo resquiggle
corrected_group (str) – Analysis slot containing the re-squiggle information (optional; default: ‘RawGenomeCorrected_000’)
basecall_subgroups (list) – basecall subgroups (optional; default: Process all basecall subgroups)
for_writing (bool) – open TomboReads to write index (optional; default: Open for reading and parse re-squiggled reads)
remove_filtered (bool) – remove filtered reads as indicated by index file (optional; default: True)
sample_name (str) – for verbose output

add_read_data(chrm, strand, read_data)[source]¶

Add read data to the index

Parameters

chrm (str) – chromosome name
strand (str) – strand (‘+’ or ‘-‘)
read_data (tombo.tombo_helper.readData) – read information

get_all_cs()[source]¶: Get list of all (chromosome, strand) stored in index.

get_coverage(chrm, pos, strand=None, invalid_return=0)[source]¶

Get coverage at specified position

Parameters

chrm (str) – chromosome name
pos (int) – 0-based genomic position
strand (str) – interval strand (‘+’, ‘-‘ or None). Default: None (max over both strands)
invalid__return – return value for invalid (bad chrm, strand or pos beyond coverage) position. Default: 0

get_cs_coverage(chrm, strand, invalid_return=None)[source]¶

Extract coverage levels over a specified chromosome and strand

Parameters

chrm (str) – chromosome name
strand (str) – strand (‘+’ or ‘-‘)
invalid_return – value to return for invalid (chrm, strand)

Returns

numpy array (numpy.int64) containing read coverage levels

get_cs_reads(chrm, strand, invalid_return=[])[source]¶

Extract the list of reads stored in a specified chromosome and strand

Parameters

chrm (str) – chromosome name
strand (str) – strand (‘+’ or ‘-‘)
invalid_return – value to return for invalid (chrm, strand)

is_empty()[source]¶: Are any reads stored in the index?

iter_cov_regs(cov_thresh, region_size=None, ctrl_reads_index=None)[source]¶

Iterate over regions with coverage greater than or equal to cov_thresh.

If region_size is provided, regions are rounded to the nearest region_sized windows and only the region start is yielded (e.g. chrm, strand, start). If not provided (chrm, strand, start, end) is yielded.

iter_coverage_regions(ctrl_reads_index=None)[source]¶

Get genome coverage for a set of reads

Parameters

ctrl_reads_index (tombo.tombo_helper.TomboReads) – a second set of tombo reads to add for coverage

Returns

Yeilds (chrm, strand, cs_cov, cs_cov_starts) indicating read coverage levels

cs_cov and cs_cov_starts are numpy arrays containing the coverage level and the 0-based genomic start positions of those intervals

iter_cs_coverage()[source]¶: Iterate over coverage levels across all stored (chrm, strands)

iter_reads()[source]¶: Iterate over reads stored in the index

next()[source]¶

replace_index(new_reads_index)[source]¶

Replace current reads index

Parameters: new_reads_index (dict) – dictionary with (chrm, strand) pointing to lists of tombo.tombo_helper.readData objects

write_index_file()[source]¶: Write index file

Note

Must be an index from only a single fast5s_dir

Index filename will be: .[fast5s_dir].[corr_grp].tombo.index Note that this is a hidden file (starts with a “.”)

class tombo.tombo_helper.TomboMotif(raw_motif, mod_pos=None)[source]¶

Description of a sequence motif, including potentially modified position

raw_motif¶

input raw motif string

Type: str

motif_len¶

length of motif

Type: int

motif_pat¶

python regular expression for this motif

Type: re.compile

rev_comp_pat¶

python regular expression for reverse complement of this motif

Type: re.compile

is_palindrome¶

is the motif palindromic (in the genomic not literary definition)

Type: bool

mod_pos¶

modified base position within motif

Type: int

mod_base¶

modified base (should generally be A, C, G or T)

Type: str

__init__(raw_motif, mod_pos=None)[source]¶

Parse string motif

Parameters

raw_motif (str) – sequence motif. supports IUPAC single letter codes (use T for RNA).
mod_pos (int) – 1-based position of modified base within the motif

find_mod_poss(seq)[source]¶

Find all mod-base positions within the sequence.

Including partial matches at beginning and end that include mod_pos.

matches_seq(seq)[source]¶

Does the motif match provided sequence (including mod_pos within seq)?

Including partial matches at beginning and end that include mod_pos.

tombo.tombo_helper.parse_motif_descs(stat_motif_descs)[source]¶

Parse string motif descriptions as defined by tombo plot roc --motif-descriptions

Parameters: stat_motif_descs (str) – string motifs description (see tombo plot roc --motif-descriptions)
Returns: list of tuples with tombo.tombo_helper.TomboMotif and motif/modification names

class tombo.tombo_helper.resquiggleParams[source]¶

Re-squiggle parameters

Parameters

match_evalue (float) – expected value for matching event to sequence
skip_pen (float) – penalty for skipped sequence position
bandwidth (int) – adaptive bandwidth
max_half_z_score (float) – windsorize half z-scores above this value
running_stat_width (int) – running neighboring window width for segmentation scoring
min_obs_per_base (int) – minimum observations per genomic base
mean_obs_per_event (int) – mean number of raw obs. per event during segmentation
z_shift (float) – amount to shift z-scores for DP (derived from match_evalue)
stay_pen (float) – stay penalty for DP (derived from match_evalue)
use_t_test_seg (bool) – use t-test segmentation criterion (default: raw neighboring window difference)
band_bound_thresh (int) – bandwidth boundary threshold for determining if a read potentially left the adaptive bandwidth
start_bw (int) – bandwidth for read start identification
start_save_bw (int) – save bandwidth for read start identification (if start_bw_fails)
start_n_bases (int) – number of genomic bases to use for read start identification

class tombo.tombo_helper.startClipParams[source]¶

Parameters to identify read start using bases clipped from mapping

Parameters

bandwidth (int) – bandwidth
num_genome_bases (int) – number of genome bases

class tombo.tombo_helper.resquiggleResults[source]¶

Re-squiggle results

Parameters

align_info (tombo.tombo_helper.alignInfo) – read alignment information
genome_loc (tombo.tombo_helper.genomeLocation) – genome mapping location
genome_seq (str) – mapped genome sequence
mean_q_score (float) – mean basecalling q-score
raw_signal (np.array::np.float64) – raw signal (i.e. un-segmented; signal may be normalized) (optional)
channel_info (tombo.tombo_helper.channelInfo) – channel information (optional)
read_start_rel_to_raw (int) – read start within raw signal (optional)
segs (np.array::np.int64) – relative raw signal segment positions (optional)
scale_values (tombo.tombo_helper.scaleValues) – signal normalization scale values (optional)
sig_match_score (float) – expected to observed signal score (see tombo.tombo_stats.get_read_seg_score; optional)
norm_params_changed (bool) – were scale parameters updated (optional)
start_clip_bases (str) – mapping clipped bases from start of read (optional)
stall_ints (list) – list of rna stall locations within raw signal (optional)

class tombo.tombo_helper.alignInfo[source]¶

Information from genomic read alignment

Parameters

ID (str) – read identifier
Subgroup (str) – sub-group location containing read information (e.g. ‘BaseCalled_template’)
ClipStart (int) – number of bases clipped from beginning of basecalls
ClipEnd (int) – number of bases clipped from end of basecalls
Insertions (int) – number of inserted bases in alignment
Deletions (int) – number of delected bases in alignment
Matches (int) – number of matched bases in alignment
Mismatches (int) – number of mis-matched bases in alignment

class tombo.tombo_helper.genomeLocation[source]¶

Genomic location

Parameters

Start (int) – 0-based genomic start position
Strand (str) – Strand (should be ‘+’ or ‘-‘)
Chrom (str) – Chromosome

class tombo.tombo_helper.sequenceData[source]¶

Read sequence data from FASTQ record

Parameters

seq (str) – read sequence
id (str) – read id (extracted from FASTQ record line)
mean_q_score (float) – mean q-score

class tombo.tombo_helper.dpResults[source]¶

Dynamic programming results

Parameters

read_start_rel_to_raw (int) – read start within raw signal
segs (np.array::np.int64) – relative raw signal segment positions
ref_means (np.array::np.float64) – expected signal levels
ref_sds (np.array::np.float64) – expected SD of signal levels
genome_seq (str) – mapped genome sequence

class tombo.tombo_helper.scaleValues[source]¶

Signal normaliztion scaling parameters. For details see https://nanoporetech.github.io/tombo/resquiggle.html#signal-normalization

Parameters

shift (float) – shift scaling parameter
scale (float) – scale scaling parameter
lower_lim (float) – lower windsorizing threshold
upper_lim (float) – upper windsorizing threshold
outlier_thresh (float) – outlier threshold used to define lower_lim and upper_lim

class tombo.tombo_helper.Fasta(fasta_fn, dry_run=False, force_in_mem=False, assume_dna_base=False)[source]¶

Fasta file sequence format wrapper class. Will load faidx via pyfaidx package if installed, else the fasta will be loaded into memory for sequence extraction.

__init__(fasta_fn, dry_run=False, force_in_mem=False, assume_dna_base=False)[source]¶

Load a fasta

Parameters

fasta_fn (str) – path to fasta file
dry_run (bool) – when pyfaidx is not installed, don’t actually read sequence into memory.
force_in_mem (bool) – force genome to be loaded into memory even if pyfaidx is installed allowing on-disk access
assume_dna_bases (bool) – skip check for DNA or RNA bases (default: False)

get_seq(chrm, start=None, end=None, error_end=True)[source]¶

Extract sequence from a specific genomic region. Note if provided, start and end must both be provided or they will be ignored.

Parameters

chrm (str) – chromosome name
start (int) – 0-based start position
end (int) – 0-based open-interval end position
error_end (bool) – raise an error when the region requested extends beyond the chromosome (default: True)

Returns

Genomic sequence requested. Sequence is converted to RNA sequence if applicable.

iter_chrms()[source]¶: Iterate over chromosome names

class tombo.tombo_helper.seqSampleType[source]¶

Description of a sequencing sample type

Parameters

name (str) – name of the sequencing sample
rev_sig (bool) – Is the raw signal reversed (3’ to 5’)

tombo.tombo_helper.get_seq_sample_type(fast5_data=None, reads_index=None, fast5_fns=None, num_reads=50)[source]¶

Get the sequencing sample type from a single read or set of reads

Parameters

fast5_data (h5py.File) – open read h5py File object
reads_index (tombo.tombo_helper.TomboReads) –
fast5_fns (list) – FAST5 read filenames
num_reads (int) – sample of reads to check for sample type

tombo.tombo_helper.get_raw_read_slot(fast5_data)[source]¶

Get the raw read slot from this FAST5 read file

Parameters: fast5_data (h5py.File) – open FAST5 read file object

Example:

all_aw_signal = get_raw_read_slot(fast5_data)['Signal'][:]

Returns: The HDF5 group slot containing the raw signal data.

tombo.tombo_helper.get_single_slot_read_centric(r_data, slot_name, corr_grp=None)[source]¶

Extract read-centric slot_name from this read’s Events table

Parameters

fast5_data (tombo.tombo_helper.readData or an open h5py.File object) –
slot_name (str) – slot from which to extract data (valid values: norm_mean, norm_stdev, start, length, base)
corr_grp (str) – corrected group slot from which to extract data (default: use value from fast5_data object; required for h5py.File)

tombo.tombo_helper.get_multiple_slots_read_centric(r_data, slot_names, corr_grp=None)[source]¶

Extract multiple read-centric slot_names from this read’s Events table

Parameters

fast5_data (tombo.tombo_helper.readData or an open h5py.File object) –
slot_name (str) – slot from which to extract data (valid values: norm_mean, norm_stdev, start, length, base)
corr_grp (str) – corrected group slot from which to extract data (default: use value from fast5_data object; required for h5py.File)

Returns

A tuple of numpy arrays specified by the slot_names

tombo.tombo_stats module¶

tombo.tombo_stats.TomboStats(stat_fn)[source]¶

Load per-reference location Tombo statistics. Safe method to load statistics for read-only purposes.

Returns: Either tombo.tombo_stats.ModelStats or tombo.tombo_stats.LevelStats depending on the type of Tombo statistics file provided.

class tombo.tombo_stats.ModelStats(stats_fn, stat_type=None, region_size=None, cov_damp_counts=None, cov_thresh=None, num_most_signif=None, most_signif_num_batches=10)[source]¶

Parse and retrieve relevant information from a standard (per-genomic base) Tombo statistics file.

__init__(stats_fn, stat_type=None, region_size=None, cov_damp_counts=None, cov_thresh=None, num_most_signif=None, most_signif_num_batches=10)[source]¶

Parse or open for writing a standard (per-genomic base) Tombo statistics file.

Example:

stats = tombo_stats.TomboStats('path/to/stats.file')
for chrm, strand, pos, frac, damp_frac, valid_cov in                     stats.iter_most_signif_sites():
    # do stuff

Parameters

stats_fn (str) – filename for previously saved tombo stats
stat_type (str) – type of statistic (model_compare, de_novo, or sample_compare); only applicable for new file writing
region_size (int) – size of chunked storage blocks; only applicable for new file writing
cov_damp_counts (tuple) – pseudo-counts for modified and un-modified reads to compute damp_frac
cov_thresh (int) – only sites with coverage greater than or equal to this value will be stored
num_most_signif (int) – number of most significant sites to be stored for faster access
most_signif_num_batches (int) – number of region batches to store before re-computing the most significant array (default: 10)

Warning

If all arguments are provided the current file’s contents will be: deleted.

Intended to open a fresh ModelStats file for writing.

close()[source]¶: Close open HDF5 file and write most significant sites if open for writing

compute_ctrl_motif_stats(ctrl_stats, motif_descs, genome_index, stats_per_block=None, total_stats_limit=None)[source]¶

Compute lists of statistic values and whether this site represents a match to the provided motifs

Parameters

ctrl_stats (tombo.tombo_stats.ModelStats) – control statistics
(list; see (motif_descs) – tombo.tombo_helper.parse_motif_descs): containing tuples with tombo.tombo_helper.TomboMotif and motif/modification names
genome_index (tombo.tombo_helper.Fasta) – genome index
stats_per_block (int) – statistics to include in calculations per-block (–multiprocess-region-size)
total_stats_limit (int) – maximum total statistics to include in computation (Default: include all stats)

Returns

Dictionary with (key) motif/modification name and (value) list of tuples containing statistic value and boolean native or control sample stats

compute_ground_truth_stats(ground_truth_locs)[source]¶

Compute lists of statistic values and ground truth modification status (boolean)

Parameters: ground_truth_locs – list containing tuples of 1) modified locations (from tombo.tombo_helper.parse_locs_file), 2) unmodified locations and mod names.
Returns: Dictionary with (key) motif/modification name and (value) list of tuples containing statistic value and boolean motif match

compute_motif_stats(motif_descs, genome_index, stats_per_block=None, total_stats_limit=None)[source]¶

Compute lists of statistic values and whether this site represents a match to the provided motifs

Parameters

(list; see (motif_descs) – tombo.tombo_helper.parse_motif_descs): containing tuples with tombo.tombo_helper.TomboMotif and motif/modification names
genome_index (tombo.tombo_helper.Fasta) – genome index
stats_per_block (int) – statistics to include in calculations per-block (–multiprocess-region-size)
total_stats_limit (int) – maximum total statistics to include in computation (Default: include all stats)

Returns

Dictionary with (key) motif/modification name and (value) list of tuples containing statistic value and boolean motif match

get_most_signif_regions(num_bases, num_regions, unique_pos=True, prepend_loc_to_text=False)[source]¶

Select regions centered on locations with the largest fraction of modified bases

Parameters

num_bases (int) – number of bases to output
num_regions (int) – number of regions to output
unique_pos (bool) – get only unique positions (optional; default True) intervals may overlap, but identified significant position is outside other intervals
prepend_loc_to_text (bool) – pre-prend most significant location to the region text (can be off for interval near start/end of sequence records)

Returns

A list of tombo.tombo_helper.intervalData objects

get_pos_stat(chrm, strand, pos, missing_value=None)[source]¶: Extract statistic value from the requested genomic position.

get_reg_stats(chrm, strand, start, end)[source]¶

iter_most_signif_sites()[source]¶: Iterate through statistics table yeilding (chrm, strand, pos, stat).

iter_stat_seqs(genome_index, before_bases, after_bases, include_pos=True)[source]¶

Iterate through most significant genomic sites returning the genomic sequence surrounding each position.

Parameters

genome_index (tombo.tombo_helper.Fasta) – genome index object
before_bases (int) – number of sequence bases before positions to include
after_bases (int) – number of sequence bases after positions to include
include_pos (bool) – yeild (pos_seq, chrm, strand, start, end) for each site (default: True)

next()[source]¶: Return next statistics block from file including (chrm, strand, block start, block end and statistics table numpy structured array)

class tombo.tombo_stats.LevelStats(stats_fn, stat_type=None, region_size=None, cov_thresh=None, num_most_signif=None, most_signif_num_batches=10)[source]¶

class tombo.tombo_stats.PerReadStats(per_read_stats_fn, stat_type=None, region_size=None)[source]¶

Store and accses per-read modified base testing statistics

__init__(per_read_stats_fn, stat_type=None, region_size=None)[source]¶

Open per-read statistics file.

Examples:

per_read_stats = tombo_stats.PerReadStats(                'path/to/sample.tombo.per_read_stats')
int_data = tombo_helper.intervalData(
    chrm='chr20', start=10000, end=10100, strand='+')
reg_per_read_stats = per_read_stats.get_region_per_read_stats(
    int_data, num_reads=10)

Parameters

per_read_stats_fn (str) – filename containing (or to write) per-read Tombo statistics
stat_type (str) – type of statistic (model_compare, de_novo, or sample_compare); only applicable for new file writing
region_size (int) – size of chunked storage blocks; only applicable for new file writing

Warning

If stat_type and region_size are provided the current file’s contents will be deleted.

Intended to open a fresh PerReadStats file for writing.

close()[source]¶: Close HDF5 file

compute_ctrl_motif_stats(pr_ctrl_stats, motif_descs, genome_index, stats_per_block=None, total_stats_limit=None)[source]¶

Compute lists of statistic values and whether this site represents a match to the provided motifs

Parameters

pr_ctrl_stats (tombo.tombo_stats.PerReadStats) – control per-read statistics
(list; see (motif_descs) – tombo.tombo_helper.parse_motif_descs): containing tuples with tombo.tombo_helper.TomboMotif and motif/modification names
genome_index (tombo.tombo_helper.Fasta) – genome index
stats_per_block (int) – statistics to include in calculations per-block (–multiprocess-region-size)
total_stats_limit (int) – maximum total statistics to include in computation (Default: include all stats)

Returns

Dictionary with (key) motif/modification name and (value) list of tuples containing statistic value and boolean native or control sample stats

compute_ground_truth_stats(ground_truth_locs)[source]¶

Compute lists of statistic values and ground truth modification status (boolean)

Parameters: ground_truth_locs – list containing tuples of 1) modified locations (from tombo.tombo_helper.parse_locs_file), 2) unmodified locations and mod names.
Returns: Dictionary with (key) motif/modification name and (value) list of tuples containing statistic value and boolean motif match

compute_motif_stats(motif_descs, genome_index, stats_per_block=None, total_stats_limit=None)[source]¶

Compute lists of statistic values and whether this site represents a match to the provided motifs

Parameters

(list; see (motif_descs) – tombo.tombo_helper.parse_motif_descs): containing tuples with tombo.tombo_helper.TomboMotif and motif/modification names
genome_index (tombo.tombo_helper.Fasta) – genome index
stats_per_block (int) – statistics to include in calculations per-block (–multiprocess-region-size)
total_stats_limit (int) – maximum total statistics to include in computation (Default: include all stats)

Returns

Dictionary with (key) motif/modification name and (value) list of tuples containing statistic value and boolean motif match

get_reg_stats(chrm, strand, start, end)[source]¶

get_region_per_read_stats(interval_data, num_reads=None)[source]¶

Extract per-read statistics over the specifed interval.

Parameters

interval_data (tombo.tombo_helper.intervalData) – genomic interval
num_reads (int) – randomly select this many reads (default: inlcude all reads)

Returns

np.array structured array containing pos, stat and read_id for per-read stats over requested interval

next()[source]¶: Return next per-read statistics block from file including (chrm, strand, block start, block end and per-read statistics table numpy structured array)

class tombo.tombo_stats.TomboModel(ref_fn=None, is_text_model=False, kmer_ref=None, central_pos=None, seq_samp_type=None, reads_index=None, fast5_fns=None, minimal_startup=True)[source]¶

Load, store and access Tombo model attributes and sequence-based: expected mean and standard deviation levels (median normalization only).

__init__(ref_fn=None, is_text_model=False, kmer_ref=None, central_pos=None, seq_samp_type=None, reads_index=None, fast5_fns=None, minimal_startup=True)[source]¶

Initialize a Tombo k-mer model object

Parameters

ref_fn (str) – tombo model filename
is_text_model (bool) – ref_fn is text (e.g. https://github.com/nanoporetech/kmer_models/blob/master/r9.4_180mv_450bps_6mer/template_median68pA.model)
kmer_ref (list) – containing 3-tuples 1) k-mer 2) expected level 3) level SD
central_pos (int) – base within k-mer to assign signal (only applicable when kmer_ref is provided)
seq_samp_type (tombo.tombo_helper.seqSampleType) – sequencing sample type (default: None)
reads_index (tombo.tombo_helper.TomboReads) – For determining seq_samp_type
fast5_fns (list) – fast5 read filenames from which to extract read metadata. For determining seq_samp_type
minimal_startup (bool) – don’t compute inverse variances (default True)

Note

Order of priority for initialization when multiple model: specifications are provided: 1) ref_fn 2) kmer_ref (requires central_pos) 3) seq_samp_type 4) reads_index 5) fast5_fns
Last 3 options load a default model file included with Tombo. Last: 2 determine the sample type from read metadata.

get_exp_levels_from_kmers(seq_kmers)[source]¶

Look up expected signal levels for a list of k-mers

Parameters

seq_kmers (list) – list of k-mers (as returned from tombo.tombo_helper.get_seq_kmers)

Returns

Expected signal level references

ref_means (np.array::np.float64) expected signal levels
ref_sds (np.array::np.float64) expected signal level sds

get_exp_levels_from_seq(seq, rev_strand=False)[source]¶

Compute expected signal levels for a sequence

Parameters

seq (str) – genomic seqeunce to be converted to expected signal levels
rev_strand (bool) – provided sequence is from reverse strand (so flip return order to genome forward direction)

Note

Returned expected signal levels will be trimmed compared to the passed sequence based on the std_ref.kmer_width and std_ref.central_pos.

Returns: Expected signal level references 1) ref_means (np.array::np.float64) expected signal levels 2) ref_sds (np.array::np.float64) expected signal level sds

get_exp_levels_from_seq_with_gaps(reg_seq, rev_strand)[source]¶

reverse_sequence_copy()[source]¶: Return a copy of model for processing sequence/signal in reverse (default models are all saved in genome sequence forward (5p to 3p) direction)

write_model(ref_fn)[source]¶

Write TomboModel to specified file

Parameters: ref_fn (str) – filename to write TomboModel

class tombo.tombo_stats.AltModel(ref_fn=None, kmer_ref=None, central_pos=None, alt_base=None, name=None, motif=None, minimal_startup=True)[source]¶

Load, store and access alternate-base Tombo model attributes and sequence-based expected mean and standard deviation levels (median normalization only).

__init__(ref_fn=None, kmer_ref=None, central_pos=None, alt_base=None, name=None, motif=None, minimal_startup=True)[source]¶

Initialize a Tombo k-mer model object

Parameters

ref_fn (str) – tombo model filename
kmer_ref (list) – containing 3-tuples 1) k-mer 2) expected level 3) level SD
central_pos (int) – base within k-mer to assign signal (only applicable when kmer_ref is provided)
alt_base (int) – “swap” base within k-mer (only applicable when kmer_ref is provided)
name (str) – model name (only applicable when kmer_ref is provided)
motif (tombo.tombo_helper.TomboMotif) – model motif (defaults to alt_base motif; only applicable when kmer_ref is provided)
minimal_startup (bool) – don’t compute inverse variances (default True)

Note

kmer_ref, central_pos, alt_base, name and motif: are ignored if ref_fn is provided.

get_exp_level(kmer, pos)[source]¶

get_exp_levels_from_kmers(seq_kmers, rev_strand=False)[source]¶

Look up expected alternate-base signal levels across a central base. Alternative base to test should be last base in the first k-mer and first base in the last k-mer

Parameters

seq_kmers (list) – list of k-mers the same length as the k-mer

Returns

Expected signal level references

ref_means (np.array::np.float64) expected signal levels
ref_sds (np.array::np.float64) expected signal level sds

get_exp_sd(kmer, pos)[source]¶

write_model(ref_fn)[source]¶

Write AltModel to specified file

Parameters: ref_fn (str) – filename to write AltModel

tombo.tombo_stats.normalize_raw_signal(all_raw_signal, read_start_rel_to_raw=0, read_obs_len=None, norm_type='median', outlier_thresh=None, channel_info=None, scale_values=None, event_means=None, model_means=None, model_inv_vars=None, const_scale=None)[source]¶

Apply scaling and windsorizing parameters to normalize raw signal.

Parameters

all_raw_signal (np.array) – raw nanopore signal obervation values
read_start_rel_to_raw (int) – amount of signal to trim from beginning of the signal (default: 0)
read_obs_len (int) – length of signal to process from read_start_rel_to_raw (default: full length)
norm_type (str) – normalization type (median (default), none, pA_raw, pA, median_const_scale, robust_median; ignored is scale_values provided)
outlier_thresh (float) – windsorizing threshold (MAD units; default: None)
channel_info (tombo.tombo_helper.channelInfo) – channel information (optional; only for pA and pA_raw)
scale_values (tombo.tombo_helper.scaleValues) – scaling values (optional)
event_means (np.array) – for pA fitted scaling parameters (optional)
model_means (np.array) – for pA fitted scaling parameters (optional)
model_inv_vars (np.array) – for pA fitted scaling parameters (optional)
const_scale (float) – global scale parameter (optional)

Returns

Normalized signal and scaling parameters

normalized signal observations (np.array::np.float64)
tombo.tombo_helper.scaleValues

tombo.tombo_stats.compute_base_means(all_raw_signal, base_starts)[source]¶

Efficiently compute new base mean values from raw signal and base start positions

Parameters

all_raw_signal (np.array) – raw nanopore signal obervation values
base_starts (np.array::np.int32) – 0-based base start positions within raw signal

Returns

np.array::np.float64 containing base mean levels

tombo.tombo_stats.get_read_seg_score(r_means, r_ref_means, r_ref_sds)[source]¶

Compute expected to observed signal matching score

Parameters

r_means (np.array::np.float64) – observed base signal levels
r_ref_means (np.array::np.float64) – expected base signal levels
r_ref_sds (np.array::np.float64) – expected base signal level sds

Returns

Mean half z-score for observed versus expected signal levels

tombo.tombo_stats.calc_kmer_fitted_shift_scale(prev_shift, prev_scale, r_event_means, r_model_means, r_model_inv_vars=None, method='theil_sen')[source]¶

Use robust Theil-Sen estimator to compute fitted shift and scale parameters based on read sequence

Parameters

prev_shift (float) – previous shift parameter
prev_scale (float) – previous scale parameter
r_ref_means (np.array::np.float64) – expected base signal levels
r_ref_sds (np.array::np.float64) – expected base signal level sds
r_model_inv_vars (np.array::np.float64) – expected base signal level inverse variances for method of moments (mom) computation
method (str) – one of theil_sen, robust, or mom

Returns

Sequence-fitted scaling parameters

shift parameter (float)
scale parameter (float)
shift correction factor; multiply by prev_scale and add to prev_shift to get shift (float)
scale correction factor; multiply by prev_scale to get scale (float)

tombo.tombo_stats.load_resquiggle_parameters(seq_samp_type, sig_aln_params=None, seg_params=None, use_save_bandwidth=False)[source]¶

Load parameters for re-squiggle algorithm

Parameters

seq_samp_type (tombo.tombo_helper.seqSampleType) – sequencing sample type
sig_aln_params (tuple) – signal alignment parameters (optional; default: load seq_samp_type defaults)
seg_params (tuple) – segmentation parameters (optional; default: load seq_samp_type defaults)
use_save_bandwidth (bool) – load larger “save” bandwidth

Returns

tombo.tombo_helper.resquiggleParams

tombo.tombo_stats.compute_num_events(signal_len, seq_len, mean_obs_per_event, min_event_to_seq_ratio=1.1)[source]¶

Compute number of events to find for this read

Parameters

signal_len (int) – length of raw signal
seq_len (int) – length of sequence
mean_obs_per_base (int) – mean raw observations per genome base
min_event_to_seq_ratio (float) – minimum event to sequence ratio (optional)

Returns

Number of events to find for this read