tombo package

Tombo Python API

The Tombo python API is intended to give users access to some of the lower level processing functions used throughout Tombo analysis pipelines.

Primarily this includes inspection of per-read modified base detection results and observed (re-squiggle assigned) per-read signal levels.

Note

Effort will be made to maintain this API interface introduced at Tombo version 1.4, but major structural changes to the Tombo framework may require changes to some API interface components. Such changes will be noted in github release notes where applicable.

Python API Examples

Import Tombo sub-modules:

from tombo import tombo_helper, tombo_stats, resquiggle

Extract normalized raw signal assigned to each genomic base from each read in a region:

# specify region of interest
reg_data = tombo_helper.intervalData(
    chrm='chr20', start=10000, end=10100, strand='+')

# parse Tombo index from previously re-squiggled set of reads
reads_index = tombo_helper.TomboReads(['test_data/native_reads',])
# extract reads that overlap this interval and then extract base signal
# levels from 10 randomly selected reads
reg_base_levels = reg_data.add_reads(
    reads_index).get_base_levels(num_reads=10)

Extracting per-read testing results:

sample_per_read_stats = tombo_stats.PerReadStats(
    'test_stats.alt_model.5mC.tombo.per_read_stats')
# reg_per_read_stats contains a numpy array containing per-read stats
# over all reads covering the region of interest
reg_per_read_stats = sample_per_read_stats.get_region_per_read_stats(
    reg_data)

Run standard resquiggle algorithm (may cause errors on some reads):

import h5py, mappy

# set read values
fast5_fn, reference_fn = 'path/to/read.fast5', 'genome.fasta'
fast5_data = h5py.File(fast5_fn, 'r')
seq_samp_type = tombo_helper.get_seq_sample_type(fast5_data)

# prep aligner, signal model and parameters
aligner = mappy.Aligner(reference_fn, preset=str('map-ont'), best_n=1)
std_ref = tombo_stats.TomboModel(seq_samp_type=seq_samp_type)
rsqgl_params = tombo_stats.load_resquiggle_parameters(seq_samp_type)

# extract data from FAST5
map_results = resquiggle.map_read(fast5_data, aligner, std_ref)
all_raw_signal = tombo_helper.get_raw_read_slot(fast5_data)['Signal'][:]
if seq_samp_type.rev_sig:
    all_raw_signal = all_raw_signal[::-1]
map_results = map_results._replace(raw_signal=all_raw_signal)

# run full re-squiggle
rsqgl_results = resquiggle.resquiggle_read(
    map_results, std_ref, rsqgl_params, all_raw_signal=all_raw_signal)

# or run individual steps
num_events = tombo_stats.compute_num_events(
    all_raw_signal.shape[0], len(map_results.genome_seq),
    rsqgl_params.mean_obs_per_event)
valid_cpts, norm_signal, scale_values = resquiggle.segment_signal(
    map_results, num_events, rsqgl_params)
event_means = tombo_stats.compute_base_means(norm_signal, valid_cpts)
dp_results = resquiggle.find_adaptive_base_assignment(
    valid_cpts, event_means, rsqgl_params, std_ref, map_results.genome_seq)
norm_signal = norm_signal[
    dp_results.read_start_rel_to_raw:
    dp_results.read_start_rel_to_raw + dp_results.segs[-1]]
segs = resquiggle.resolve_skipped_bases_with_raw(
    dp_results, norm_signal, rsqgl_params)

Submodules

tombo.resquiggle module

tombo.resquiggle.get_read_seq(fast5_data, bc_grp='Basecall_1D_000', bc_subgrp='BaseCalled_template', seq_samp_type=seqSampleType(name='DNA', rev_sig=False), q_score_thresh=0)[source]

Extract read sequence from the Fastq slot providing useful error messages

Parameters
  • fast5_data (tombo.tombo_helper.readData) – read information

  • bc_grp (str) – group location containing read information (optional; default: ‘Basecall_1D_000’)

  • bc_subgrp (str) – sub-group location containing read information (optional; default: ‘BaseCalled_template’)

  • seq_samp_type (tombo.tombo_helper.seqSampleType) – sequencing sample type (default: DNA)

  • q_score_thresh (float) – basecalling mean q-score threshold (optional; default: 0/no filtering)

Returns

tombo.tombo_helper.sequenceData

tombo.resquiggle.map_read(fast5_data, aligner, std_ref, seq_samp_type=seqSampleType(name='DNA', rev_sig=False), bc_grp='Basecall_1D_000', bc_subgrp='BaseCalled_template', map_thr_buf=None, q_score_thresh=0, seq_len_rng=None)[source]

Extract read sequence from the Fastq slot providing useful error messages

Parameters
  • fast5_data (tombo.tombo_helper.readData) – read information

  • aligner (mappy.Aligner) – aligner object

  • std_ref (tombo.tombo_stats.TomboModel) – canonical model (in order to extract extended genomic sequence)

  • seq_samp_type (tombo.tombo_helper.seqSampleType) – sequencing sample type (default: DNA)

  • bc_grp (str) – group location containing read information (optional; default: ‘Basecall_1D_000’)

  • bc_subgrp (str) – sub-group location containing read information (optional; default: ‘BaseCalled_template’)

  • map_thr_buf (mappy.ThreadBuffer) – mappy thread buffer object (optional; default: None)

  • q_score_thresh (float) – basecalling mean q-score threshold (optional; default: 0/no filtering)

  • seq_len_rng (tuple) – allowed mapped sequence length range (optional; default: None/no filtering)

Returns

tombo.tombo_helper.resquiggleResults containing valid mapping values (signal to sequence assignment attributes will be None)

tombo.resquiggle.resquiggle_read(map_res, std_ref, rsqgl_params, outlier_thresh=None, all_raw_signal=None, max_raw_cpts=200, min_event_to_seq_ratio=1.1, const_scale=None, skip_seq_scaling=False, seq_samp_type=seqSampleType(name='DNA', rev_sig=False))[source]

Identify raw signal to genome sequence assignment, using adaptive banded dynamic programming

Parameters
  • map_res (tombo.tombo_helper.resquiggleResults) – mapping results

  • std_ref (tombo.tombo_stats.TomboModel) – canonical base model

  • rsqgl_params (tombo.tombo_helper.resquiggleParams) – parameters for the re-squiggle algorithm

  • outlier_thresh (float) – windsorize signal greater than this value (optional)

  • all_raw_signal (np.array::np.int64) – raw data acquisition (DAC) current signal values (optional; default use value in map_res)

  • max_raw_cpts (int) – read will fail if more than max_raw_cpts must be found to produce a valid re-squiggle results (optional)

  • min_event_to_seq_ratio (float) – minimum event to sequence ratio (optional)

  • const_scale (float) – constant scale value (optional; may be deprecated)

  • skip_seq_scaling (bool) – skip sequence-based scaling step

  • seq_samp_type (tombo.tombo_helper.seqSampleType) – sequencing sample type (default: DNA)

Returns

tombo.tombo_helper.resquiggleResults containing raw signal to genome sequence alignment (note that raw_signal now contains trimmed and normalized raw signal)

tombo.resquiggle.segment_signal(map_res, num_events, rsqgl_params, outlier_thresh=None, const_scale=None)[source]

Normalize and segment raw signal as defined by rsqgl_params into num_events.

Parameters
  • map_res (tombo.tombo_helper.resquiggleResults) – containing mapping results only (attributes after read_start_rel_to_raw will all be None)

  • num_events (int) – number of events to process

  • rsqgl_params (tombo.tombo_helper.resquiggleParams) – parameters for the re-squiggle algorithm

  • outlier_thresh (float) – windsorize signal greater than this value (optional)

Returns

  1. identified event positions (0-based)

  2. normalized raw signal

  3. scale values (tombo.tombo_helper.scaleValues)

tombo.resquiggle.find_adaptive_base_assignment(valid_cpts, event_means, rsqgl_params, std_ref, genome_seq, start_clip_bases=None, start_clip_params=startClipParams(bandwidth=1000, num_genome_bases=200), seq_samp_type=seqSampleType(name='DNA', rev_sig=False), reg_id=None)[source]

Align expected (from genome sequence) signal levels to observed using a dynamic programming approach with adaptive bandwidth start positions

Parameters
Returns

tombo.tombo_helper.dpResults

tombo.resquiggle.resolve_skipped_bases_with_raw(dp_res, norm_signal, rsqgl_params, max_raw_cpts=200, del_fix_window=2, max_del_fix_window=10, extra_sig_factor=1.1)[source]

Perform dynamic-programming forward pass over raw signal z-scores. For details, see https://nanoporetech.github.io/tombo/resquiggle.html#resolve-skipped-bases

Parameters
  • dp_res (tombo_helper.dpResults) – dynamic programming results

  • norm_signal (np.array::np.float64) – normalized raw siganl

  • rsqgl_params (tombo.tombo_helper.resquiggleParams) – parameters for the re-squiggle algorithm

  • max_raw_cpts (int) – maximum new changepoints to find from raw signal (optional)

  • del_fix_window (int) – initial bases to extend skipped base windows (optional)

  • max_del_fix_window (int) – max bases to extend skipped base windows (optional)

  • extra_sig_factor (float) – amount of extra signal required to perform signal space re-squiggle (optional)

Returns

np.array::np.int64 containing new deletion resolved base raw signal start positions

tombo.resquiggle.find_seq_start_in_events(event_means, r_ref_means, r_ref_sds, rsqgl_params, num_bases, num_events, seq_samp_type=None, reg_id=None)[source]

Identify most probably start of expected levels within observed events

Parameters
  • event_means (np.array::np.float64) – event normalized raw signal means

  • r_ref_means (np.array::np.float64) – expected base signal levels

  • r_ref_sds (np.array::np.float64) – expected base level SDs

  • rsqgl_params (tombo.tombo_helper.resquiggleParams) – parameters for the re-squiggle algorithm

  • num_bases (int) – number of bases to process

  • num_events (int) – number of events to process

  • reg_id (str) – debug

Returns

  1. event position (0-based) corresponding to the start of expected signal levels

  2. mean events per base identified from start of the read

tombo.resquiggle.find_static_base_assignment(event_means, r_ref_means, r_ref_sds, rsqgl_params, reg_id=None)[source]

Align expected (from genome sequence) signal levels to observed using a dynamic programming approach with static bandwidth start positions

Parameters
  • event_means (np.array::np.float64) – read base means

  • r_ref_means (np.array::np.float64) – expected base signal levels

  • r_ref_sds (np.array::np.float64) – expected base level SDs

  • rsqgl_params (tombo.tombo_helper.resquiggleParams) – parameters for the re-squiggle algorithm

Returns

np.array::np.int64 containng event to sequence mapping for full length of short read

tombo.tombo_helper module

class tombo.tombo_helper.readData[source]

Nanopore read meta-data

Example:

r_data = tombo_helper.readData(
    start=0, end=1000, filtered=False, read_start_rel_to_raw=100,
    strand='+', fn='path/to/read.fast5',
    corr_group='RawGenomeCorrected_000', rna=False,
    sig_match_score=1.0, mean_q_score=10.0)

# output is a list of readData objects
reads_index = tombo_helper.TomboReads(['test_data/native_reads',])
cs_reads = reads_index.get_cs_reads('chr20', '+')
Parameters
  • start (int) – 0-based start mapped position

  • end (int) – 0-based open interval end mapped position

  • filtered (bool) – is this read filtered?

  • read_start_rel_to_raw (int) – start (in raw signal vector) of assigned bases

  • strand (str) – read mapped strand (‘+’, ‘-‘ or None)

  • fn (str) – raw read filename.

  • corr_group (str) – –corrected-group slot specified in re-squiggle command

  • rna (bool) – is this read RNA?

  • sig_match_score (float) – signal matching score. Default: None

  • mean_q_score (float) – mean basecalling q-score. Default: None

  • read_id (str) – read identifier Default: None

class tombo.tombo_helper.intervalData(chrm, start, end, strand=None, reg_id=None, reg_text='', reads=None, seq=None)[source]

Genome/transcriptome interval information

Example:

int_data = tombo_helper.intervalData(
    chrm='chr20', start=10000, end=10100, strand='+')

Note

All intervalData functions return the object in order to allow chaining of interval commands. (e.g. int_data.add_reads(reads_index).add_seq())

__init__(chrm, start, end, strand=None, reg_id=None, reg_text='', reads=None, seq=None)[source]

Initialize a genome/transcriptome interval object

Parameters
  • chrm (str) – chromosome name.

  • start (int) – 0-based start position.

  • end (int) – 1-based (or open interval) end position.

  • strand (str) – interval strand (‘+’, ‘-‘ or None). Default: None

  • reg_id (str) – used to keep track of multiple intervals. Default: None

  • reg_text (str) – region description text. Default: ‘’

  • reads (list) – list of readData values overlapping region. Default: None

  • seq (str) – region genomic sequence. Default: None

add_reads(reads_index, require_full_span=False)[source]

Add reads overlapping this interval

Parameters
  • reads_index (tombo.tombo_helper.TomboReads) – reads index

  • require_full_span (bool) – require that reads span then entire interval (default: include all reads overlapping the interval)

add_seq(genome_index=None, error_end=True)[source]

Extract the forward strand genomic sequence for an interval from reads or genome_index if provided

Parameters

genome_index (tombo.tombo_helper.Fasta) –

copy(include_reads=True)[source]

Create a copy of this interval. Useful when adding sets of reads from multiple samples to compare.

Parameters

include_reads (bool) – include current reads slot in the new object (default: True)

expand_interval(expand_width)[source]

Expand this interval by the specified amount (effects only the start and stop attributes; NOT seq or reads)

Parameters

expand_width (int) – amount by which to expand the interval

get_base_levels(read_rows=False, num_reads=None)[source]

Extract base levels from this interval.

Parameters
  • read_rows (bool) – return array with reads as rows (default: row columns)

  • num_reads (int) – maximal number of reads to output. randomly downsample if more reads are present (default: output all reads)

Returns

np.array containing read mean levels with rows corresponding to interval position and columns to individual reads (or reverse if read_rows)

merge(other_reg)[source]

Create a copy of this interval with the reads from this interval and other_reg

Parameters

other_reg (tombo.tombo_helper.intervalData) – a second region to merge with this interval

Note

Aside from reads, all other attributes will be taken from this interval

update(**kwargs)[source]

Update slots specified in keyword arguments (kwargs)

class tombo.tombo_helper.TomboReads(fast5s_basedirs, corrected_group='RawGenomeCorrected_000', basecall_subgroups=None, for_writing=False, remove_filtered=True, sample_name=None)[source]

A set of reads with associated meta-data from re-squiggle processing

__init__(fast5s_basedirs, corrected_group='RawGenomeCorrected_000', basecall_subgroups=None, for_writing=False, remove_filtered=True, sample_name=None)[source]

Parse data from a list of re-squiggle fast5 directories

Parameters
  • fast5s_basedirs (list) – fast5 base directories, which have been processed by tombo resquiggle

  • corrected_group (str) – Analysis slot containing the re-squiggle information (optional; default: ‘RawGenomeCorrected_000’)

  • basecall_subgroups (list) – basecall subgroups (optional; default: Process all basecall subgroups)

  • for_writing (bool) – open TomboReads to write index (optional; default: Open for reading and parse re-squiggled reads)

  • remove_filtered (bool) – remove filtered reads as indicated by index file (optional; default: True)

  • sample_name (str) – for verbose output

add_read_data(chrm, strand, read_data)[source]

Add read data to the index

Parameters
  • chrm (str) – chromosome name

  • strand (str) – strand (‘+’ or ‘-‘)

  • read_data (tombo.tombo_helper.readData) – read information

get_all_cs()[source]

Get list of all (chromosome, strand) stored in index.

get_coverage(chrm, pos, strand=None, invalid_return=0)[source]

Get coverage at specified position

Parameters
  • chrm (str) – chromosome name

  • pos (int) – 0-based genomic position

  • strand (str) – interval strand (‘+’, ‘-‘ or None). Default: None (max over both strands)

  • invalid__return – return value for invalid (bad chrm, strand or pos beyond coverage) position. Default: 0

get_cs_coverage(chrm, strand, invalid_return=None)[source]

Extract coverage levels over a specified chromosome and strand

Parameters
  • chrm (str) – chromosome name

  • strand (str) – strand (‘+’ or ‘-‘)

  • invalid_return – value to return for invalid (chrm, strand)

Returns

numpy array (numpy.int64) containing read coverage levels

get_cs_reads(chrm, strand, invalid_return=[])[source]

Extract the list of reads stored in a specified chromosome and strand

Parameters
  • chrm (str) – chromosome name

  • strand (str) – strand (‘+’ or ‘-‘)

  • invalid_return – value to return for invalid (chrm, strand)

is_empty()[source]

Are any reads stored in the index?

iter_cov_regs(cov_thresh, region_size=None, ctrl_reads_index=None)[source]

Iterate over regions with coverage greater than or equal to cov_thresh.

If region_size is provided, regions are rounded to the nearest region_sized windows and only the region start is yielded (e.g. chrm, strand, start). If not provided (chrm, strand, start, end) is yielded.

iter_coverage_regions(ctrl_reads_index=None)[source]

Get genome coverage for a set of reads

Parameters

ctrl_reads_index (tombo.tombo_helper.TomboReads) – a second set of tombo reads to add for coverage

Returns

Yeilds (chrm, strand, cs_cov, cs_cov_starts) indicating read coverage levels

cs_cov and cs_cov_starts are numpy arrays containing the coverage level and the 0-based genomic start positions of those intervals

iter_cs_coverage()[source]

Iterate over coverage levels across all stored (chrm, strands)

iter_reads()[source]

Iterate over reads stored in the index

next()[source]
replace_index(new_reads_index)[source]

Replace current reads index

Parameters

new_reads_index (dict) – dictionary with (chrm, strand) pointing to lists of tombo.tombo_helper.readData objects

write_index_file()[source]

Write index file

Note

Must be an index from only a single fast5s_dir

Index filename will be: .[fast5s_dir].[corr_grp].tombo.index Note that this is a hidden file (starts with a “.”)

class tombo.tombo_helper.TomboMotif(raw_motif, mod_pos=None)[source]

Description of a sequence motif, including potentially modified position

raw_motif

input raw motif string

Type

str

motif_len

length of motif

Type

int

motif_pat

python regular expression for this motif

Type

re.compile

rev_comp_pat

python regular expression for reverse complement of this motif

Type

re.compile

is_palindrome

is the motif palindromic (in the genomic not literary definition)

Type

bool

mod_pos

modified base position within motif

Type

int

mod_base

modified base (should generally be A, C, G or T)

Type

str

__init__(raw_motif, mod_pos=None)[source]

Parse string motif

Parameters
  • raw_motif (str) – sequence motif. supports IUPAC single letter codes (use T for RNA).

  • mod_pos (int) – 1-based position of modified base within the motif

find_mod_poss(seq)[source]

Find all mod-base positions within the sequence.

Including partial matches at beginning and end that include mod_pos.

matches_seq(seq)[source]

Does the motif match provided sequence (including mod_pos within seq)?

Including partial matches at beginning and end that include mod_pos.

tombo.tombo_helper.parse_motif_descs(stat_motif_descs)[source]

Parse string motif descriptions as defined by tombo plot roc --motif-descriptions

Parameters

stat_motif_descs (str) – string motifs description (see tombo plot roc --motif-descriptions)

Returns

list of tuples with tombo.tombo_helper.TomboMotif and motif/modification names

class tombo.tombo_helper.resquiggleParams[source]

Re-squiggle parameters

Parameters
  • match_evalue (float) – expected value for matching event to sequence

  • skip_pen (float) – penalty for skipped sequence position

  • bandwidth (int) – adaptive bandwidth

  • max_half_z_score (float) – windsorize half z-scores above this value

  • running_stat_width (int) – running neighboring window width for segmentation scoring

  • min_obs_per_base (int) – minimum observations per genomic base

  • mean_obs_per_event (int) – mean number of raw obs. per event during segmentation

  • z_shift (float) – amount to shift z-scores for DP (derived from match_evalue)

  • stay_pen (float) – stay penalty for DP (derived from match_evalue)

  • use_t_test_seg (bool) – use t-test segmentation criterion (default: raw neighboring window difference)

  • band_bound_thresh (int) – bandwidth boundary threshold for determining if a read potentially left the adaptive bandwidth

  • start_bw (int) – bandwidth for read start identification

  • start_save_bw (int) – save bandwidth for read start identification (if start_bw_fails)

  • start_n_bases (int) – number of genomic bases to use for read start identification

class tombo.tombo_helper.startClipParams[source]

Parameters to identify read start using bases clipped from mapping

Parameters
  • bandwidth (int) – bandwidth

  • num_genome_bases (int) – number of genome bases

class tombo.tombo_helper.resquiggleResults[source]

Re-squiggle results

Parameters
  • align_info (tombo.tombo_helper.alignInfo) – read alignment information

  • genome_loc (tombo.tombo_helper.genomeLocation) – genome mapping location

  • genome_seq (str) – mapped genome sequence

  • mean_q_score (float) – mean basecalling q-score

  • raw_signal (np.array::np.float64) – raw signal (i.e. un-segmented; signal may be normalized) (optional)

  • channel_info (tombo.tombo_helper.channelInfo) – channel information (optional)

  • read_start_rel_to_raw (int) – read start within raw signal (optional)

  • segs (np.array::np.int64) – relative raw signal segment positions (optional)

  • scale_values (tombo.tombo_helper.scaleValues) – signal normalization scale values (optional)

  • sig_match_score (float) – expected to observed signal score (see tombo.tombo_stats.get_read_seg_score; optional)

  • norm_params_changed (bool) – were scale parameters updated (optional)

  • start_clip_bases (str) – mapping clipped bases from start of read (optional)

  • stall_ints (list) – list of rna stall locations within raw signal (optional)

class tombo.tombo_helper.alignInfo[source]

Information from genomic read alignment

Parameters
  • ID (str) – read identifier

  • Subgroup (str) – sub-group location containing read information (e.g. ‘BaseCalled_template’)

  • ClipStart (int) – number of bases clipped from beginning of basecalls

  • ClipEnd (int) – number of bases clipped from end of basecalls

  • Insertions (int) – number of inserted bases in alignment

  • Deletions (int) – number of delected bases in alignment

  • Matches (int) – number of matched bases in alignment

  • Mismatches (int) – number of mis-matched bases in alignment

class tombo.tombo_helper.genomeLocation[source]

Genomic location

Parameters
  • Start (int) – 0-based genomic start position

  • Strand (str) – Strand (should be ‘+’ or ‘-‘)

  • Chrom (str) – Chromosome

class tombo.tombo_helper.sequenceData[source]

Read sequence data from FASTQ record

Parameters
  • seq (str) – read sequence

  • id (str) – read id (extracted from FASTQ record line)

  • mean_q_score (float) – mean q-score

class tombo.tombo_helper.dpResults[source]

Dynamic programming results

Parameters
  • read_start_rel_to_raw (int) – read start within raw signal

  • segs (np.array::np.int64) – relative raw signal segment positions

  • ref_means (np.array::np.float64) – expected signal levels

  • ref_sds (np.array::np.float64) – expected SD of signal levels

  • genome_seq (str) – mapped genome sequence

class tombo.tombo_helper.scaleValues[source]

Signal normaliztion scaling parameters. For details see https://nanoporetech.github.io/tombo/resquiggle.html#signal-normalization

Parameters
  • shift (float) – shift scaling parameter

  • scale (float) – scale scaling parameter

  • lower_lim (float) – lower windsorizing threshold

  • upper_lim (float) – upper windsorizing threshold

  • outlier_thresh (float) – outlier threshold used to define lower_lim and upper_lim

class tombo.tombo_helper.Fasta(fasta_fn, dry_run=False, force_in_mem=False, assume_dna_base=False)[source]

Fasta file sequence format wrapper class. Will load faidx via pyfaidx package if installed, else the fasta will be loaded into memory for sequence extraction.

__init__(fasta_fn, dry_run=False, force_in_mem=False, assume_dna_base=False)[source]

Load a fasta

Parameters
  • fasta_fn (str) – path to fasta file

  • dry_run (bool) – when pyfaidx is not installed, don’t actually read sequence into memory.

  • force_in_mem (bool) – force genome to be loaded into memory even if pyfaidx is installed allowing on-disk access

  • assume_dna_bases (bool) – skip check for DNA or RNA bases (default: False)

get_seq(chrm, start=None, end=None, error_end=True)[source]

Extract sequence from a specific genomic region. Note if provided, start and end must both be provided or they will be ignored.

Parameters
  • chrm (str) – chromosome name

  • start (int) – 0-based start position

  • end (int) – 0-based open-interval end position

  • error_end (bool) – raise an error when the region requested extends beyond the chromosome (default: True)

Returns

Genomic sequence requested. Sequence is converted to RNA sequence if applicable.

iter_chrms()[source]

Iterate over chromosome names

class tombo.tombo_helper.seqSampleType[source]

Description of a sequencing sample type

Parameters
  • name (str) – name of the sequencing sample

  • rev_sig (bool) – Is the raw signal reversed (3’ to 5’)

tombo.tombo_helper.get_seq_sample_type(fast5_data=None, reads_index=None, fast5_fns=None, num_reads=50)[source]

Get the sequencing sample type from a single read or set of reads

Parameters
  • fast5_data (h5py.File) – open read h5py File object

  • reads_index (tombo.tombo_helper.TomboReads) –

  • fast5_fns (list) – FAST5 read filenames

  • num_reads (int) – sample of reads to check for sample type

tombo.tombo_helper.get_raw_read_slot(fast5_data)[source]

Get the raw read slot from this FAST5 read file

Parameters

fast5_data (h5py.File) – open FAST5 read file object

Example:

all_aw_signal = get_raw_read_slot(fast5_data)['Signal'][:]
Returns

The HDF5 group slot containing the raw signal data.

tombo.tombo_helper.get_single_slot_read_centric(r_data, slot_name, corr_grp=None)[source]

Extract read-centric slot_name from this read’s Events table

Parameters
  • fast5_data (tombo.tombo_helper.readData or an open h5py.File object) –

  • slot_name (str) – slot from which to extract data (valid values: norm_mean, norm_stdev, start, length, base)

  • corr_grp (str) – corrected group slot from which to extract data (default: use value from fast5_data object; required for h5py.File)

tombo.tombo_helper.get_multiple_slots_read_centric(r_data, slot_names, corr_grp=None)[source]

Extract multiple read-centric slot_names from this read’s Events table

Parameters
  • fast5_data (tombo.tombo_helper.readData or an open h5py.File object) –

  • slot_name (str) – slot from which to extract data (valid values: norm_mean, norm_stdev, start, length, base)

  • corr_grp (str) – corrected group slot from which to extract data (default: use value from fast5_data object; required for h5py.File)

Returns

A tuple of numpy arrays specified by the slot_names

tombo.tombo_stats module

tombo.tombo_stats.TomboStats(stat_fn)[source]

Load per-reference location Tombo statistics. Safe method to load statistics for read-only purposes.

Returns

Either tombo.tombo_stats.ModelStats or tombo.tombo_stats.LevelStats depending on the type of Tombo statistics file provided.

class tombo.tombo_stats.ModelStats(stats_fn, stat_type=None, region_size=None, cov_damp_counts=None, cov_thresh=None, num_most_signif=None, most_signif_num_batches=10)[source]

Parse and retrieve relevant information from a standard (per-genomic base) Tombo statistics file.

__init__(stats_fn, stat_type=None, region_size=None, cov_damp_counts=None, cov_thresh=None, num_most_signif=None, most_signif_num_batches=10)[source]

Parse or open for writing a standard (per-genomic base) Tombo statistics file.

Example:

stats = tombo_stats.TomboStats('path/to/stats.file')
for chrm, strand, pos, frac, damp_frac, valid_cov in                     stats.iter_most_signif_sites():
    # do stuff
Parameters
  • stats_fn (str) – filename for previously saved tombo stats

  • stat_type (str) – type of statistic (model_compare, de_novo, or sample_compare); only applicable for new file writing

  • region_size (int) – size of chunked storage blocks; only applicable for new file writing

  • cov_damp_counts (tuple) – pseudo-counts for modified and un-modified reads to compute damp_frac

  • cov_thresh (int) – only sites with coverage greater than or equal to this value will be stored

  • num_most_signif (int) – number of most significant sites to be stored for faster access

  • most_signif_num_batches (int) – number of region batches to store before re-computing the most significant array (default: 10)

Warning

If all arguments are provided the current file’s contents will be

deleted.

Intended to open a fresh ModelStats file for writing.

close()[source]

Close open HDF5 file and write most significant sites if open for writing

compute_ctrl_motif_stats(ctrl_stats, motif_descs, genome_index, stats_per_block=None, total_stats_limit=None)[source]

Compute lists of statistic values and whether this site represents a match to the provided motifs

Parameters
Returns

Dictionary with (key) motif/modification name and (value) list of tuples containing statistic value and boolean native or control sample stats

compute_ground_truth_stats(ground_truth_locs)[source]

Compute lists of statistic values and ground truth modification status (boolean)

Parameters

ground_truth_locs – list containing tuples of 1) modified locations (from tombo.tombo_helper.parse_locs_file), 2) unmodified locations and mod names.

Returns

Dictionary with (key) motif/modification name and (value) list of tuples containing statistic value and boolean motif match

compute_motif_stats(motif_descs, genome_index, stats_per_block=None, total_stats_limit=None)[source]

Compute lists of statistic values and whether this site represents a match to the provided motifs

Parameters
Returns

Dictionary with (key) motif/modification name and (value) list of tuples containing statistic value and boolean motif match

get_most_signif_regions(num_bases, num_regions, unique_pos=True, prepend_loc_to_text=False)[source]

Select regions centered on locations with the largest fraction of modified bases

Parameters
  • num_bases (int) – number of bases to output

  • num_regions (int) – number of regions to output

  • unique_pos (bool) – get only unique positions (optional; default True) intervals may overlap, but identified significant position is outside other intervals

  • prepend_loc_to_text (bool) – pre-prend most significant location to the region text (can be off for interval near start/end of sequence records)

Returns

A list of tombo.tombo_helper.intervalData objects

get_pos_stat(chrm, strand, pos, missing_value=None)[source]

Extract statistic value from the requested genomic position.

get_reg_stats(chrm, strand, start, end)[source]
iter_most_signif_sites()[source]

Iterate through statistics table yeilding (chrm, strand, pos, stat).

iter_stat_seqs(genome_index, before_bases, after_bases, include_pos=True)[source]

Iterate through most significant genomic sites returning the genomic sequence surrounding each position.

Parameters
  • genome_index (tombo.tombo_helper.Fasta) – genome index object

  • before_bases (int) – number of sequence bases before positions to include

  • after_bases (int) – number of sequence bases after positions to include

  • include_pos (bool) – yeild (pos_seq, chrm, strand, start, end) for each site (default: True)

next()[source]

Return next statistics block from file including (chrm, strand, block start, block end and statistics table numpy structured array)

class tombo.tombo_stats.LevelStats(stats_fn, stat_type=None, region_size=None, cov_thresh=None, num_most_signif=None, most_signif_num_batches=10)[source]
class tombo.tombo_stats.PerReadStats(per_read_stats_fn, stat_type=None, region_size=None)[source]

Store and accses per-read modified base testing statistics

__init__(per_read_stats_fn, stat_type=None, region_size=None)[source]

Open per-read statistics file.

Examples:

per_read_stats = tombo_stats.PerReadStats(                'path/to/sample.tombo.per_read_stats')
int_data = tombo_helper.intervalData(
    chrm='chr20', start=10000, end=10100, strand='+')
reg_per_read_stats = per_read_stats.get_region_per_read_stats(
    int_data, num_reads=10)
Parameters
  • per_read_stats_fn (str) – filename containing (or to write) per-read Tombo statistics

  • stat_type (str) – type of statistic (model_compare, de_novo, or sample_compare); only applicable for new file writing

  • region_size (int) – size of chunked storage blocks; only applicable for new file writing

Warning

If stat_type and region_size are provided the current file’s contents will be deleted.

Intended to open a fresh PerReadStats file for writing.

close()[source]

Close HDF5 file

compute_ctrl_motif_stats(pr_ctrl_stats, motif_descs, genome_index, stats_per_block=None, total_stats_limit=None)[source]

Compute lists of statistic values and whether this site represents a match to the provided motifs

Parameters
Returns

Dictionary with (key) motif/modification name and (value) list of tuples containing statistic value and boolean native or control sample stats

compute_ground_truth_stats(ground_truth_locs)[source]

Compute lists of statistic values and ground truth modification status (boolean)

Parameters

ground_truth_locs – list containing tuples of 1) modified locations (from tombo.tombo_helper.parse_locs_file), 2) unmodified locations and mod names.

Returns

Dictionary with (key) motif/modification name and (value) list of tuples containing statistic value and boolean motif match

compute_motif_stats(motif_descs, genome_index, stats_per_block=None, total_stats_limit=None)[source]

Compute lists of statistic values and whether this site represents a match to the provided motifs

Parameters
Returns

Dictionary with (key) motif/modification name and (value) list of tuples containing statistic value and boolean motif match

get_reg_stats(chrm, strand, start, end)[source]
get_region_per_read_stats(interval_data, num_reads=None)[source]

Extract per-read statistics over the specifed interval.

Parameters
Returns

np.array structured array containing pos, stat and read_id for per-read stats over requested interval

next()[source]

Return next per-read statistics block from file including (chrm, strand, block start, block end and per-read statistics table numpy structured array)

class tombo.tombo_stats.TomboModel(ref_fn=None, is_text_model=False, kmer_ref=None, central_pos=None, seq_samp_type=None, reads_index=None, fast5_fns=None, minimal_startup=True)[source]
Load, store and access Tombo model attributes and sequence-based

expected mean and standard deviation levels (median normalization only).

__init__(ref_fn=None, is_text_model=False, kmer_ref=None, central_pos=None, seq_samp_type=None, reads_index=None, fast5_fns=None, minimal_startup=True)[source]

Initialize a Tombo k-mer model object

Parameters

Note

Order of priority for initialization when multiple model

specifications are provided: 1) ref_fn 2) kmer_ref (requires central_pos) 3) seq_samp_type 4) reads_index 5) fast5_fns

Last 3 options load a default model file included with Tombo. Last

2 determine the sample type from read metadata.

get_exp_levels_from_kmers(seq_kmers)[source]

Look up expected signal levels for a list of k-mers

Parameters

seq_kmers (list) – list of k-mers (as returned from tombo.tombo_helper.get_seq_kmers)

Returns

Expected signal level references

  1. ref_means (np.array::np.float64) expected signal levels

  2. ref_sds (np.array::np.float64) expected signal level sds

get_exp_levels_from_seq(seq, rev_strand=False)[source]

Compute expected signal levels for a sequence

Parameters
  • seq (str) – genomic seqeunce to be converted to expected signal levels

  • rev_strand (bool) – provided sequence is from reverse strand (so flip return order to genome forward direction)

Note

Returned expected signal levels will be trimmed compared to the passed sequence based on the std_ref.kmer_width and std_ref.central_pos.

Returns

Expected signal level references 1) ref_means (np.array::np.float64) expected signal levels 2) ref_sds (np.array::np.float64) expected signal level sds

get_exp_levels_from_seq_with_gaps(reg_seq, rev_strand)[source]
reverse_sequence_copy()[source]

Return a copy of model for processing sequence/signal in reverse (default models are all saved in genome sequence forward (5p to 3p) direction)

write_model(ref_fn)[source]

Write TomboModel to specified file

Parameters

ref_fn (str) – filename to write TomboModel

class tombo.tombo_stats.AltModel(ref_fn=None, kmer_ref=None, central_pos=None, alt_base=None, name=None, motif=None, minimal_startup=True)[source]

Load, store and access alternate-base Tombo model attributes and sequence-based expected mean and standard deviation levels (median normalization only).

__init__(ref_fn=None, kmer_ref=None, central_pos=None, alt_base=None, name=None, motif=None, minimal_startup=True)[source]

Initialize a Tombo k-mer model object

Parameters
  • ref_fn (str) – tombo model filename

  • kmer_ref (list) – containing 3-tuples 1) k-mer 2) expected level 3) level SD

  • central_pos (int) – base within k-mer to assign signal (only applicable when kmer_ref is provided)

  • alt_base (int) – “swap” base within k-mer (only applicable when kmer_ref is provided)

  • name (str) – model name (only applicable when kmer_ref is provided)

  • motif (tombo.tombo_helper.TomboMotif) – model motif (defaults to alt_base motif; only applicable when kmer_ref is provided)

  • minimal_startup (bool) – don’t compute inverse variances (default True)

Note

kmer_ref, central_pos, alt_base, name and motif

are ignored if ref_fn is provided.

get_exp_level(kmer, pos)[source]
get_exp_levels_from_kmers(seq_kmers, rev_strand=False)[source]

Look up expected alternate-base signal levels across a central base. Alternative base to test should be last base in the first k-mer and first base in the last k-mer

Parameters

seq_kmers (list) – list of k-mers the same length as the k-mer

Returns

Expected signal level references

  1. ref_means (np.array::np.float64) expected signal levels

  2. ref_sds (np.array::np.float64) expected signal level sds

get_exp_sd(kmer, pos)[source]
write_model(ref_fn)[source]

Write AltModel to specified file

Parameters

ref_fn (str) – filename to write AltModel

tombo.tombo_stats.normalize_raw_signal(all_raw_signal, read_start_rel_to_raw=0, read_obs_len=None, norm_type='median', outlier_thresh=None, channel_info=None, scale_values=None, event_means=None, model_means=None, model_inv_vars=None, const_scale=None)[source]

Apply scaling and windsorizing parameters to normalize raw signal.

Parameters
  • all_raw_signal (np.array) – raw nanopore signal obervation values

  • read_start_rel_to_raw (int) – amount of signal to trim from beginning of the signal (default: 0)

  • read_obs_len (int) – length of signal to process from read_start_rel_to_raw (default: full length)

  • norm_type (str) – normalization type (median (default), none, pA_raw, pA, median_const_scale, robust_median; ignored is scale_values provided)

  • outlier_thresh (float) – windsorizing threshold (MAD units; default: None)

  • channel_info (tombo.tombo_helper.channelInfo) – channel information (optional; only for pA and pA_raw)

  • scale_values (tombo.tombo_helper.scaleValues) – scaling values (optional)

  • event_means (np.array) – for pA fitted scaling parameters (optional)

  • model_means (np.array) – for pA fitted scaling parameters (optional)

  • model_inv_vars (np.array) – for pA fitted scaling parameters (optional)

  • const_scale (float) – global scale parameter (optional)

Returns

Normalized signal and scaling parameters

  1. normalized signal observations (np.array::np.float64)

  2. tombo.tombo_helper.scaleValues

tombo.tombo_stats.compute_base_means(all_raw_signal, base_starts)[source]

Efficiently compute new base mean values from raw signal and base start positions

Parameters
  • all_raw_signal (np.array) – raw nanopore signal obervation values

  • base_starts (np.array::np.int32) – 0-based base start positions within raw signal

Returns

np.array::np.float64 containing base mean levels

tombo.tombo_stats.get_read_seg_score(r_means, r_ref_means, r_ref_sds)[source]

Compute expected to observed signal matching score

Parameters
  • r_means (np.array::np.float64) – observed base signal levels

  • r_ref_means (np.array::np.float64) – expected base signal levels

  • r_ref_sds (np.array::np.float64) – expected base signal level sds

Returns

Mean half z-score for observed versus expected signal levels

tombo.tombo_stats.calc_kmer_fitted_shift_scale(prev_shift, prev_scale, r_event_means, r_model_means, r_model_inv_vars=None, method='theil_sen')[source]

Use robust Theil-Sen estimator to compute fitted shift and scale parameters based on read sequence

Parameters
  • prev_shift (float) – previous shift parameter

  • prev_scale (float) – previous scale parameter

  • r_ref_means (np.array::np.float64) – expected base signal levels

  • r_ref_sds (np.array::np.float64) – expected base signal level sds

  • r_model_inv_vars (np.array::np.float64) – expected base signal level inverse variances for method of moments (mom) computation

  • method (str) – one of theil_sen, robust, or mom

Returns

Sequence-fitted scaling parameters

  1. shift parameter (float)

  2. scale parameter (float)

  3. shift correction factor; multiply by prev_scale and add to prev_shift to get shift (float)

  4. scale correction factor; multiply by prev_scale to get scale (float)

tombo.tombo_stats.load_resquiggle_parameters(seq_samp_type, sig_aln_params=None, seg_params=None, use_save_bandwidth=False)[source]

Load parameters for re-squiggle algorithm

Parameters
  • seq_samp_type (tombo.tombo_helper.seqSampleType) – sequencing sample type

  • sig_aln_params (tuple) – signal alignment parameters (optional; default: load seq_samp_type defaults)

  • seg_params (tuple) – segmentation parameters (optional; default: load seq_samp_type defaults)

  • use_save_bandwidth (bool) – load larger “save” bandwidth

Returns

tombo.tombo_helper.resquiggleParams

tombo.tombo_stats.compute_num_events(signal_len, seq_len, mean_obs_per_event, min_event_to_seq_ratio=1.1)[source]

Compute number of events to find for this read

Parameters
  • signal_len (int) – length of raw signal

  • seq_len (int) – length of sequence

  • mean_obs_per_base (int) – mean raw observations per genome base

  • min_event_to_seq_ratio (float) – minimum event to sequence ratio (optional)

Returns

Number of events to find for this read