tombo package¶
Tombo Python API¶
The Tombo python API is intended to give users access to some of the lower level processing functions used throughout Tombo analysis pipelines.
Primarily this includes inspection of per-read modified base detection results and observed (re-squiggle assigned) per-read signal levels.
Note
Effort will be made to maintain this API interface introduced at Tombo version 1.4, but major structural changes to the Tombo framework may require changes to some API interface components. Such changes will be noted in github release notes where applicable.
Python API Examples¶
Import Tombo sub-modules:
from tombo import tombo_helper, tombo_stats, resquiggle
Extract normalized raw signal assigned to each genomic base from each read in a region:
# specify region of interest
reg_data = tombo_helper.intervalData(
chrm='chr20', start=10000, end=10100, strand='+')
# parse Tombo index from previously re-squiggled set of reads
reads_index = tombo_helper.TomboReads(['test_data/native_reads',])
# extract reads that overlap this interval and then extract base signal
# levels from 10 randomly selected reads
reg_base_levels = reg_data.add_reads(
reads_index).get_base_levels(num_reads=10)
Extracting per-read testing results:
sample_per_read_stats = tombo_stats.PerReadStats(
'test_stats.alt_model.5mC.tombo.per_read_stats')
# reg_per_read_stats contains a numpy array containing per-read stats
# over all reads covering the region of interest
reg_per_read_stats = sample_per_read_stats.get_region_per_read_stats(
reg_data)
Run standard resquiggle algorithm (may cause errors on some reads):
import h5py, mappy
# set read values
fast5_fn, reference_fn = 'path/to/read.fast5', 'genome.fasta'
fast5_data = h5py.File(fast5_fn, 'r')
seq_samp_type = tombo_helper.get_seq_sample_type(fast5_data)
# prep aligner, signal model and parameters
aligner = mappy.Aligner(reference_fn, preset=str('map-ont'), best_n=1)
std_ref = tombo_stats.TomboModel(seq_samp_type=seq_samp_type)
rsqgl_params = tombo_stats.load_resquiggle_parameters(seq_samp_type)
# extract data from FAST5
map_results = resquiggle.map_read(fast5_data, aligner, std_ref)
all_raw_signal = tombo_helper.get_raw_read_slot(fast5_data)['Signal'][:]
if seq_samp_type.rev_sig:
all_raw_signal = all_raw_signal[::-1]
map_results = map_results._replace(raw_signal=all_raw_signal)
# run full re-squiggle
rsqgl_results = resquiggle.resquiggle_read(
map_results, std_ref, rsqgl_params, all_raw_signal=all_raw_signal)
# or run individual steps
num_events = tombo_stats.compute_num_events(
all_raw_signal.shape[0], len(map_results.genome_seq),
rsqgl_params.mean_obs_per_event)
valid_cpts, norm_signal, scale_values = resquiggle.segment_signal(
map_results, num_events, rsqgl_params)
event_means = tombo_stats.compute_base_means(norm_signal, valid_cpts)
dp_results = resquiggle.find_adaptive_base_assignment(
valid_cpts, event_means, rsqgl_params, std_ref, map_results.genome_seq)
norm_signal = norm_signal[
dp_results.read_start_rel_to_raw:
dp_results.read_start_rel_to_raw + dp_results.segs[-1]]
segs = resquiggle.resolve_skipped_bases_with_raw(
dp_results, norm_signal, rsqgl_params)
Submodules¶
tombo.resquiggle module¶
-
tombo.resquiggle.
get_read_seq
(fast5_data, bc_grp='Basecall_1D_000', bc_subgrp='BaseCalled_template', seq_samp_type=seqSampleType(name='DNA', rev_sig=False), q_score_thresh=0)[source]¶ Extract read sequence from the Fastq slot providing useful error messages
- Parameters
fast5_data (
tombo.tombo_helper.readData
) – read informationbc_grp (str) – group location containing read information (optional; default: ‘Basecall_1D_000’)
bc_subgrp (str) – sub-group location containing read information (optional; default: ‘BaseCalled_template’)
seq_samp_type (
tombo.tombo_helper.seqSampleType
) – sequencing sample type (default: DNA)q_score_thresh (float) – basecalling mean q-score threshold (optional; default: 0/no filtering)
- Returns
-
tombo.resquiggle.
map_read
(fast5_data, aligner, std_ref, seq_samp_type=seqSampleType(name='DNA', rev_sig=False), bc_grp='Basecall_1D_000', bc_subgrp='BaseCalled_template', map_thr_buf=None, q_score_thresh=0, seq_len_rng=None)[source]¶ Extract read sequence from the Fastq slot providing useful error messages
- Parameters
fast5_data (
tombo.tombo_helper.readData
) – read informationaligner (mappy.Aligner) – aligner object
std_ref (
tombo.tombo_stats.TomboModel
) – canonical model (in order to extract extended genomic sequence)seq_samp_type (
tombo.tombo_helper.seqSampleType
) – sequencing sample type (default: DNA)bc_grp (str) – group location containing read information (optional; default: ‘Basecall_1D_000’)
bc_subgrp (str) – sub-group location containing read information (optional; default: ‘BaseCalled_template’)
map_thr_buf (mappy.ThreadBuffer) – mappy thread buffer object (optional; default: None)
q_score_thresh (float) – basecalling mean q-score threshold (optional; default: 0/no filtering)
seq_len_rng (tuple) – allowed mapped sequence length range (optional; default: None/no filtering)
- Returns
tombo.tombo_helper.resquiggleResults
containing valid mapping values (signal to sequence assignment attributes will beNone
)
-
tombo.resquiggle.
resquiggle_read
(map_res, std_ref, rsqgl_params, outlier_thresh=None, all_raw_signal=None, max_raw_cpts=200, min_event_to_seq_ratio=1.1, const_scale=None, skip_seq_scaling=False, seq_samp_type=seqSampleType(name='DNA', rev_sig=False))[source]¶ Identify raw signal to genome sequence assignment, using adaptive banded dynamic programming
- Parameters
map_res (
tombo.tombo_helper.resquiggleResults
) – mapping resultsstd_ref (
tombo.tombo_stats.TomboModel
) – canonical base modelrsqgl_params (
tombo.tombo_helper.resquiggleParams
) – parameters for the re-squiggle algorithmoutlier_thresh (float) – windsorize signal greater than this value (optional)
all_raw_signal (np.array::np.int64) – raw data acquisition (DAC) current signal values (optional; default use value in map_res)
max_raw_cpts (int) – read will fail if more than max_raw_cpts must be found to produce a valid re-squiggle results (optional)
min_event_to_seq_ratio (float) – minimum event to sequence ratio (optional)
const_scale (float) – constant scale value (optional; may be deprecated)
skip_seq_scaling (bool) – skip sequence-based scaling step
seq_samp_type (
tombo.tombo_helper.seqSampleType
) – sequencing sample type (default: DNA)
- Returns
tombo.tombo_helper.resquiggleResults
containing raw signal to genome sequence alignment (note thatraw_signal
now contains trimmed and normalized raw signal)
-
tombo.resquiggle.
segment_signal
(map_res, num_events, rsqgl_params, outlier_thresh=None, const_scale=None)[source]¶ Normalize and segment raw signal as defined by rsqgl_params into num_events.
- Parameters
map_res (
tombo.tombo_helper.resquiggleResults
) – containing mapping results only (attributes afterread_start_rel_to_raw
will all beNone
)num_events (int) – number of events to process
rsqgl_params (
tombo.tombo_helper.resquiggleParams
) – parameters for the re-squiggle algorithmoutlier_thresh (float) – windsorize signal greater than this value (optional)
- Returns
identified event positions (0-based)
normalized raw signal
scale values (
tombo.tombo_helper.scaleValues
)
-
tombo.resquiggle.
find_adaptive_base_assignment
(valid_cpts, event_means, rsqgl_params, std_ref, genome_seq, start_clip_bases=None, start_clip_params=startClipParams(bandwidth=1000, num_genome_bases=200), seq_samp_type=seqSampleType(name='DNA', rev_sig=False), reg_id=None)[source]¶ Align expected (from genome sequence) signal levels to observed using a dynamic programming approach with adaptive bandwidth start positions
- Parameters
valid_cpts (np.array::np.int64) – raw signal base start locations
event_means (np.array::np.float64) – event normalized raw signal means
rsqgl_params (
tombo.tombo_helper.resquiggleParams
) – parameters for the re-squiggle algorithmstd_ref (
tombo.tombo_stats.TomboModel
) – canonical modelgenome_seq (str) – genome sequence (from mapping)
start_clip_bases (str) – mapping read clipped bases
start_clip_params (
tombo.tombo_helper.startClipParams
) – start clip basecall paramsseq_samp_type (
tombo.tombo_helper.seqSampleType
) – sequencing sample type (default: DNA)reg_id (str) – debug
- Returns
-
tombo.resquiggle.
resolve_skipped_bases_with_raw
(dp_res, norm_signal, rsqgl_params, max_raw_cpts=200, del_fix_window=2, max_del_fix_window=10, extra_sig_factor=1.1)[source]¶ Perform dynamic-programming forward pass over raw signal z-scores. For details, see https://nanoporetech.github.io/tombo/resquiggle.html#resolve-skipped-bases
- Parameters
dp_res (
tombo_helper.dpResults
) – dynamic programming resultsnorm_signal (np.array::np.float64) – normalized raw siganl
rsqgl_params (
tombo.tombo_helper.resquiggleParams
) – parameters for the re-squiggle algorithmmax_raw_cpts (int) – maximum new changepoints to find from raw signal (optional)
del_fix_window (int) – initial bases to extend skipped base windows (optional)
max_del_fix_window (int) – max bases to extend skipped base windows (optional)
extra_sig_factor (float) – amount of extra signal required to perform signal space re-squiggle (optional)
- Returns
np.array::np.int64
containing new deletion resolved base raw signal start positions
-
tombo.resquiggle.
find_seq_start_in_events
(event_means, r_ref_means, r_ref_sds, rsqgl_params, num_bases, num_events, seq_samp_type=None, reg_id=None)[source]¶ Identify most probably start of expected levels within observed events
- Parameters
event_means (np.array::np.float64) – event normalized raw signal means
r_ref_means (np.array::np.float64) – expected base signal levels
r_ref_sds (np.array::np.float64) – expected base level SDs
rsqgl_params (
tombo.tombo_helper.resquiggleParams
) – parameters for the re-squiggle algorithmnum_bases (int) – number of bases to process
num_events (int) – number of events to process
reg_id (str) – debug
- Returns
event position (0-based) corresponding to the start of expected signal levels
mean events per base identified from start of the read
-
tombo.resquiggle.
find_static_base_assignment
(event_means, r_ref_means, r_ref_sds, rsqgl_params, reg_id=None)[source]¶ Align expected (from genome sequence) signal levels to observed using a dynamic programming approach with static bandwidth start positions
- Parameters
event_means (np.array::np.float64) – read base means
r_ref_means (np.array::np.float64) – expected base signal levels
r_ref_sds (np.array::np.float64) – expected base level SDs
rsqgl_params (
tombo.tombo_helper.resquiggleParams
) – parameters for the re-squiggle algorithm
- Returns
np.array::np.int64 containng event to sequence mapping for full length of short read
tombo.tombo_helper module¶
-
class
tombo.tombo_helper.
readData
[source]¶ Nanopore read meta-data
Example:
r_data = tombo_helper.readData( start=0, end=1000, filtered=False, read_start_rel_to_raw=100, strand='+', fn='path/to/read.fast5', corr_group='RawGenomeCorrected_000', rna=False, sig_match_score=1.0, mean_q_score=10.0) # output is a list of readData objects reads_index = tombo_helper.TomboReads(['test_data/native_reads',]) cs_reads = reads_index.get_cs_reads('chr20', '+')
- Parameters
start (int) – 0-based start mapped position
end (int) – 0-based open interval end mapped position
filtered (bool) – is this read filtered?
read_start_rel_to_raw (int) – start (in raw signal vector) of assigned bases
strand (str) – read mapped strand (‘+’, ‘-‘ or None)
fn (str) – raw read filename.
corr_group (str) – –corrected-group slot specified in re-squiggle command
rna (bool) – is this read RNA?
sig_match_score (float) – signal matching score. Default: None
mean_q_score (float) – mean basecalling q-score. Default: None
read_id (str) – read identifier Default: None
-
class
tombo.tombo_helper.
intervalData
(chrm, start, end, strand=None, reg_id=None, reg_text='', reads=None, seq=None)[source]¶ Genome/transcriptome interval information
Example:
int_data = tombo_helper.intervalData( chrm='chr20', start=10000, end=10100, strand='+')
Note
All intervalData functions return the object in order to allow chaining of interval commands. (e.g.
int_data.add_reads(reads_index).add_seq()
)-
__init__
(chrm, start, end, strand=None, reg_id=None, reg_text='', reads=None, seq=None)[source]¶ Initialize a genome/transcriptome interval object
- Parameters
chrm (str) – chromosome name.
start (int) – 0-based start position.
end (int) – 1-based (or open interval) end position.
strand (str) – interval strand (‘+’, ‘-‘ or None). Default: None
reg_id (str) – used to keep track of multiple intervals. Default: None
reg_text (str) – region description text. Default: ‘’
reads (list) – list of readData values overlapping region. Default: None
seq (str) – region genomic sequence. Default: None
-
add_reads
(reads_index, require_full_span=False)[source]¶ Add reads overlapping this interval
- Parameters
reads_index (
tombo.tombo_helper.TomboReads
) – reads indexrequire_full_span (bool) – require that reads span then entire interval (default: include all reads overlapping the interval)
-
add_seq
(genome_index=None, error_end=True)[source]¶ Extract the forward strand genomic sequence for an interval from reads or genome_index if provided
- Parameters
genome_index (
tombo.tombo_helper.Fasta
) –
-
copy
(include_reads=True)[source]¶ Create a copy of this interval. Useful when adding sets of reads from multiple samples to compare.
- Parameters
include_reads (bool) – include current reads slot in the new object (default: True)
-
expand_interval
(expand_width)[source]¶ Expand this interval by the specified amount (effects only the start and stop attributes; NOT seq or reads)
- Parameters
expand_width (int) – amount by which to expand the interval
-
get_base_levels
(read_rows=False, num_reads=None)[source]¶ Extract base levels from this interval.
- Parameters
read_rows (bool) – return array with reads as rows (default: row columns)
num_reads (int) – maximal number of reads to output. randomly downsample if more reads are present (default: output all reads)
- Returns
np.array containing read mean levels with rows corresponding to interval position and columns to individual reads (or reverse if read_rows)
-
merge
(other_reg)[source]¶ Create a copy of this interval with the reads from this interval and other_reg
- Parameters
other_reg (
tombo.tombo_helper.intervalData
) – a second region to merge with this interval
Note
Aside from reads, all other attributes will be taken from this interval
-
-
class
tombo.tombo_helper.
TomboReads
(fast5s_basedirs, corrected_group='RawGenomeCorrected_000', basecall_subgroups=None, for_writing=False, remove_filtered=True, sample_name=None)[source]¶ A set of reads with associated meta-data from re-squiggle processing
-
__init__
(fast5s_basedirs, corrected_group='RawGenomeCorrected_000', basecall_subgroups=None, for_writing=False, remove_filtered=True, sample_name=None)[source]¶ Parse data from a list of re-squiggle fast5 directories
- Parameters
fast5s_basedirs (list) – fast5 base directories, which have been processed by
tombo resquiggle
corrected_group (str) – Analysis slot containing the re-squiggle information (optional; default: ‘RawGenomeCorrected_000’)
basecall_subgroups (list) – basecall subgroups (optional; default: Process all basecall subgroups)
for_writing (bool) – open TomboReads to write index (optional; default: Open for reading and parse re-squiggled reads)
remove_filtered (bool) – remove filtered reads as indicated by index file (optional; default: True)
sample_name (str) – for verbose output
-
add_read_data
(chrm, strand, read_data)[source]¶ Add read data to the index
- Parameters
chrm (str) – chromosome name
strand (str) – strand (‘+’ or ‘-‘)
read_data (
tombo.tombo_helper.readData
) – read information
-
get_coverage
(chrm, pos, strand=None, invalid_return=0)[source]¶ Get coverage at specified position
- Parameters
chrm (str) – chromosome name
pos (int) – 0-based genomic position
strand (str) – interval strand (‘+’, ‘-‘ or None). Default: None (max over both strands)
invalid__return – return value for invalid (bad chrm, strand or pos beyond coverage) position. Default: 0
-
get_cs_coverage
(chrm, strand, invalid_return=None)[source]¶ Extract coverage levels over a specified chromosome and strand
- Parameters
chrm (str) – chromosome name
strand (str) – strand (‘+’ or ‘-‘)
invalid_return – value to return for invalid (chrm, strand)
- Returns
numpy array (numpy.int64) containing read coverage levels
-
get_cs_reads
(chrm, strand, invalid_return=[])[source]¶ Extract the list of reads stored in a specified chromosome and strand
- Parameters
chrm (str) – chromosome name
strand (str) – strand (‘+’ or ‘-‘)
invalid_return – value to return for invalid (chrm, strand)
-
iter_cov_regs
(cov_thresh, region_size=None, ctrl_reads_index=None)[source]¶ Iterate over regions with coverage greater than or equal to cov_thresh.
If region_size is provided, regions are rounded to the nearest region_sized windows and only the region start is yielded (e.g. chrm, strand, start). If not provided (chrm, strand, start, end) is yielded.
-
iter_coverage_regions
(ctrl_reads_index=None)[source]¶ Get genome coverage for a set of reads
- Parameters
ctrl_reads_index (
tombo.tombo_helper.TomboReads
) – a second set of tombo reads to add for coverage- Returns
Yeilds (chrm, strand, cs_cov, cs_cov_starts) indicating read coverage levels
cs_cov and cs_cov_starts are numpy arrays containing the coverage level and the 0-based genomic start positions of those intervals
-
replace_index
(new_reads_index)[source]¶ Replace current reads index
- Parameters
new_reads_index (dict) – dictionary with (chrm, strand) pointing to lists of
tombo.tombo_helper.readData
objects
-
-
class
tombo.tombo_helper.
TomboMotif
(raw_motif, mod_pos=None)[source]¶ Description of a sequence motif, including potentially modified position
-
raw_motif
¶ input raw motif string
- Type
str
-
motif_len
¶ length of motif
- Type
int
-
motif_pat
¶ python regular expression for this motif
- Type
re.compile
-
rev_comp_pat
¶ python regular expression for reverse complement of this motif
- Type
re.compile
-
is_palindrome
¶ is the motif palindromic (in the genomic not literary definition)
- Type
bool
-
mod_pos
¶ modified base position within motif
- Type
int
-
mod_base
¶ modified base (should generally be A, C, G or T)
- Type
str
-
__init__
(raw_motif, mod_pos=None)[source]¶ Parse string motif
- Parameters
raw_motif (str) – sequence motif. supports IUPAC single letter codes (use T for RNA).
mod_pos (int) – 1-based position of modified base within the motif
-
-
tombo.tombo_helper.
parse_motif_descs
(stat_motif_descs)[source]¶ Parse string motif descriptions as defined by
tombo plot roc --motif-descriptions
- Parameters
stat_motif_descs (str) – string motifs description (see
tombo plot roc --motif-descriptions
)- Returns
list of tuples with
tombo.tombo_helper.TomboMotif
and motif/modification names
-
class
tombo.tombo_helper.
resquiggleParams
[source]¶ Re-squiggle parameters
- Parameters
match_evalue (float) – expected value for matching event to sequence
skip_pen (float) – penalty for skipped sequence position
bandwidth (int) – adaptive bandwidth
max_half_z_score (float) – windsorize half z-scores above this value
running_stat_width (int) – running neighboring window width for segmentation scoring
min_obs_per_base (int) – minimum observations per genomic base
mean_obs_per_event (int) – mean number of raw obs. per event during segmentation
z_shift (float) – amount to shift z-scores for DP (derived from match_evalue)
stay_pen (float) – stay penalty for DP (derived from match_evalue)
use_t_test_seg (bool) – use t-test segmentation criterion (default: raw neighboring window difference)
band_bound_thresh (int) – bandwidth boundary threshold for determining if a read potentially left the adaptive bandwidth
start_bw (int) – bandwidth for read start identification
start_save_bw (int) – save bandwidth for read start identification (if start_bw_fails)
start_n_bases (int) – number of genomic bases to use for read start identification
-
class
tombo.tombo_helper.
startClipParams
[source]¶ Parameters to identify read start using bases clipped from mapping
- Parameters
bandwidth (int) – bandwidth
num_genome_bases (int) – number of genome bases
-
class
tombo.tombo_helper.
resquiggleResults
[source]¶ Re-squiggle results
- Parameters
align_info (
tombo.tombo_helper.alignInfo
) – read alignment informationgenome_loc (
tombo.tombo_helper.genomeLocation
) – genome mapping locationgenome_seq (str) – mapped genome sequence
mean_q_score (float) – mean basecalling q-score
raw_signal (np.array::np.float64) – raw signal (i.e. un-segmented; signal may be normalized) (optional)
channel_info (
tombo.tombo_helper.channelInfo
) – channel information (optional)read_start_rel_to_raw (int) – read start within raw signal (optional)
segs (np.array::np.int64) – relative raw signal segment positions (optional)
scale_values (
tombo.tombo_helper.scaleValues
) – signal normalization scale values (optional)sig_match_score (float) – expected to observed signal score (see
tombo.tombo_stats.get_read_seg_score
; optional)norm_params_changed (bool) – were scale parameters updated (optional)
start_clip_bases (str) – mapping clipped bases from start of read (optional)
stall_ints (list) – list of rna stall locations within raw signal (optional)
-
class
tombo.tombo_helper.
alignInfo
[source]¶ Information from genomic read alignment
- Parameters
ID (str) – read identifier
Subgroup (str) – sub-group location containing read information (e.g. ‘BaseCalled_template’)
ClipStart (int) – number of bases clipped from beginning of basecalls
ClipEnd (int) – number of bases clipped from end of basecalls
Insertions (int) – number of inserted bases in alignment
Deletions (int) – number of delected bases in alignment
Matches (int) – number of matched bases in alignment
Mismatches (int) – number of mis-matched bases in alignment
-
class
tombo.tombo_helper.
genomeLocation
[source]¶ Genomic location
- Parameters
Start (int) – 0-based genomic start position
Strand (str) – Strand (should be ‘+’ or ‘-‘)
Chrom (str) – Chromosome
-
class
tombo.tombo_helper.
sequenceData
[source]¶ Read sequence data from FASTQ record
- Parameters
seq (str) – read sequence
id (str) – read id (extracted from FASTQ record line)
mean_q_score (float) – mean q-score
-
class
tombo.tombo_helper.
dpResults
[source]¶ Dynamic programming results
- Parameters
read_start_rel_to_raw (int) – read start within raw signal
segs (np.array::np.int64) – relative raw signal segment positions
ref_means (np.array::np.float64) – expected signal levels
ref_sds (np.array::np.float64) – expected SD of signal levels
genome_seq (str) – mapped genome sequence
-
class
tombo.tombo_helper.
scaleValues
[source]¶ Signal normaliztion scaling parameters. For details see https://nanoporetech.github.io/tombo/resquiggle.html#signal-normalization
- Parameters
shift (float) – shift scaling parameter
scale (float) – scale scaling parameter
lower_lim (float) – lower windsorizing threshold
upper_lim (float) – upper windsorizing threshold
outlier_thresh (float) – outlier threshold used to define lower_lim and upper_lim
-
class
tombo.tombo_helper.
Fasta
(fasta_fn, dry_run=False, force_in_mem=False, assume_dna_base=False)[source]¶ Fasta file sequence format wrapper class. Will load faidx via
pyfaidx
package if installed, else the fasta will be loaded into memory for sequence extraction.-
__init__
(fasta_fn, dry_run=False, force_in_mem=False, assume_dna_base=False)[source]¶ Load a fasta
- Parameters
fasta_fn (str) – path to fasta file
dry_run (bool) – when pyfaidx is not installed, don’t actually read sequence into memory.
force_in_mem (bool) – force genome to be loaded into memory even if pyfaidx is installed allowing on-disk access
assume_dna_bases (bool) – skip check for DNA or RNA bases (default: False)
-
get_seq
(chrm, start=None, end=None, error_end=True)[source]¶ Extract sequence from a specific genomic region. Note if provided, start and end must both be provided or they will be ignored.
- Parameters
chrm (str) – chromosome name
start (int) – 0-based start position
end (int) – 0-based open-interval end position
error_end (bool) – raise an error when the region requested extends beyond the chromosome (default: True)
- Returns
Genomic sequence requested. Sequence is converted to RNA sequence if applicable.
-
-
class
tombo.tombo_helper.
seqSampleType
[source]¶ Description of a sequencing sample type
- Parameters
name (str) – name of the sequencing sample
rev_sig (bool) – Is the raw signal reversed (3’ to 5’)
-
tombo.tombo_helper.
get_seq_sample_type
(fast5_data=None, reads_index=None, fast5_fns=None, num_reads=50)[source]¶ Get the sequencing sample type from a single read or set of reads
- Parameters
fast5_data (h5py.File) – open read h5py File object
reads_index (
tombo.tombo_helper.TomboReads
) –fast5_fns (list) – FAST5 read filenames
num_reads (int) – sample of reads to check for sample type
-
tombo.tombo_helper.
get_raw_read_slot
(fast5_data)[source]¶ Get the raw read slot from this FAST5 read file
- Parameters
fast5_data (h5py.File) – open FAST5 read file object
Example:
all_aw_signal = get_raw_read_slot(fast5_data)['Signal'][:]
- Returns
The HDF5 group slot containing the raw signal data.
-
tombo.tombo_helper.
get_single_slot_read_centric
(r_data, slot_name, corr_grp=None)[source]¶ Extract read-centric slot_name from this read’s Events table
- Parameters
fast5_data (
tombo.tombo_helper.readData
or an openh5py.File
object) –slot_name (str) – slot from which to extract data (valid values: norm_mean, norm_stdev, start, length, base)
corr_grp (str) – corrected group slot from which to extract data (default: use value from
fast5_data
object; required forh5py.File
)
-
tombo.tombo_helper.
get_multiple_slots_read_centric
(r_data, slot_names, corr_grp=None)[source]¶ Extract multiple read-centric slot_names from this read’s Events table
- Parameters
fast5_data (
tombo.tombo_helper.readData
or an openh5py.File
object) –slot_name (str) – slot from which to extract data (valid values: norm_mean, norm_stdev, start, length, base)
corr_grp (str) – corrected group slot from which to extract data (default: use value from
fast5_data
object; required forh5py.File
)
- Returns
A tuple of numpy arrays specified by the
slot_names
tombo.tombo_stats module¶
-
tombo.tombo_stats.
TomboStats
(stat_fn)[source]¶ Load per-reference location Tombo statistics. Safe method to load statistics for read-only purposes.
- Returns
Either
tombo.tombo_stats.ModelStats
ortombo.tombo_stats.LevelStats
depending on the type of Tombo statistics file provided.
-
class
tombo.tombo_stats.
ModelStats
(stats_fn, stat_type=None, region_size=None, cov_damp_counts=None, cov_thresh=None, num_most_signif=None, most_signif_num_batches=10)[source]¶ Parse and retrieve relevant information from a standard (per-genomic base) Tombo statistics file.
-
__init__
(stats_fn, stat_type=None, region_size=None, cov_damp_counts=None, cov_thresh=None, num_most_signif=None, most_signif_num_batches=10)[source]¶ Parse or open for writing a standard (per-genomic base) Tombo statistics file.
Example:
stats = tombo_stats.TomboStats('path/to/stats.file') for chrm, strand, pos, frac, damp_frac, valid_cov in stats.iter_most_signif_sites(): # do stuff
- Parameters
stats_fn (str) – filename for previously saved tombo stats
stat_type (str) – type of statistic (model_compare, de_novo, or sample_compare); only applicable for new file writing
region_size (int) – size of chunked storage blocks; only applicable for new file writing
cov_damp_counts (tuple) – pseudo-counts for modified and un-modified reads to compute
damp_frac
cov_thresh (int) – only sites with coverage greater than or equal to this value will be stored
num_most_signif (int) – number of most significant sites to be stored for faster access
most_signif_num_batches (int) – number of region batches to store before re-computing the most significant array (default: 10)
Warning
- If all arguments are provided the current file’s contents will be
deleted.
Intended to open a fresh
ModelStats
file for writing.
-
compute_ctrl_motif_stats
(ctrl_stats, motif_descs, genome_index, stats_per_block=None, total_stats_limit=None)[source]¶ Compute lists of statistic values and whether this site represents a match to the provided motifs
- Parameters
ctrl_stats (
tombo.tombo_stats.ModelStats
) – control statistics(list; see (motif_descs) –
tombo.tombo_helper.parse_motif_descs
): containing tuples withtombo.tombo_helper.TomboMotif
and motif/modification namesgenome_index (
tombo.tombo_helper.Fasta
) – genome indexstats_per_block (int) – statistics to include in calculations per-block (–multiprocess-region-size)
total_stats_limit (int) – maximum total statistics to include in computation (Default: include all stats)
- Returns
Dictionary with (key) motif/modification name and (value) list of tuples containing statistic value and boolean native or control sample stats
-
compute_ground_truth_stats
(ground_truth_locs)[source]¶ Compute lists of statistic values and ground truth modification status (boolean)
- Parameters
ground_truth_locs – list containing tuples of 1) modified locations (from
tombo.tombo_helper.parse_locs_file
), 2) unmodified locations and mod names.- Returns
Dictionary with (key) motif/modification name and (value) list of tuples containing statistic value and boolean motif match
-
compute_motif_stats
(motif_descs, genome_index, stats_per_block=None, total_stats_limit=None)[source]¶ Compute lists of statistic values and whether this site represents a match to the provided motifs
- Parameters
(list; see (motif_descs) –
tombo.tombo_helper.parse_motif_descs
): containing tuples withtombo.tombo_helper.TomboMotif
and motif/modification namesgenome_index (
tombo.tombo_helper.Fasta
) – genome indexstats_per_block (int) – statistics to include in calculations per-block (–multiprocess-region-size)
total_stats_limit (int) – maximum total statistics to include in computation (Default: include all stats)
- Returns
Dictionary with (key) motif/modification name and (value) list of tuples containing statistic value and boolean motif match
-
get_most_signif_regions
(num_bases, num_regions, unique_pos=True, prepend_loc_to_text=False)[source]¶ Select regions centered on locations with the largest fraction of modified bases
- Parameters
num_bases (int) – number of bases to output
num_regions (int) – number of regions to output
unique_pos (bool) – get only unique positions (optional; default True) intervals may overlap, but identified significant position is outside other intervals
prepend_loc_to_text (bool) – pre-prend most significant location to the region text (can be off for interval near start/end of sequence records)
- Returns
A list of
tombo.tombo_helper.intervalData
objects
-
get_pos_stat
(chrm, strand, pos, missing_value=None)[source]¶ Extract statistic value from the requested genomic position.
-
iter_most_signif_sites
()[source]¶ Iterate through statistics table yeilding (chrm, strand, pos, stat).
-
iter_stat_seqs
(genome_index, before_bases, after_bases, include_pos=True)[source]¶ Iterate through most significant genomic sites returning the genomic sequence surrounding each position.
- Parameters
genome_index (
tombo.tombo_helper.Fasta
) – genome index objectbefore_bases (int) – number of sequence bases before positions to include
after_bases (int) – number of sequence bases after positions to include
include_pos (bool) – yeild (pos_seq, chrm, strand, start, end) for each site (default: True)
-
-
class
tombo.tombo_stats.
LevelStats
(stats_fn, stat_type=None, region_size=None, cov_thresh=None, num_most_signif=None, most_signif_num_batches=10)[source]¶
-
class
tombo.tombo_stats.
PerReadStats
(per_read_stats_fn, stat_type=None, region_size=None)[source]¶ Store and accses per-read modified base testing statistics
-
__init__
(per_read_stats_fn, stat_type=None, region_size=None)[source]¶ Open per-read statistics file.
Examples:
per_read_stats = tombo_stats.PerReadStats( 'path/to/sample.tombo.per_read_stats') int_data = tombo_helper.intervalData( chrm='chr20', start=10000, end=10100, strand='+') reg_per_read_stats = per_read_stats.get_region_per_read_stats( int_data, num_reads=10)
- Parameters
per_read_stats_fn (str) – filename containing (or to write) per-read Tombo statistics
stat_type (str) – type of statistic (model_compare, de_novo, or sample_compare); only applicable for new file writing
region_size (int) – size of chunked storage blocks; only applicable for new file writing
Warning
If
stat_type
andregion_size
are provided the current file’s contents will be deleted.Intended to open a fresh
PerReadStats
file for writing.
-
compute_ctrl_motif_stats
(pr_ctrl_stats, motif_descs, genome_index, stats_per_block=None, total_stats_limit=None)[source]¶ Compute lists of statistic values and whether this site represents a match to the provided motifs
- Parameters
pr_ctrl_stats (
tombo.tombo_stats.PerReadStats
) – control per-read statistics(list; see (motif_descs) –
tombo.tombo_helper.parse_motif_descs
): containing tuples withtombo.tombo_helper.TomboMotif
and motif/modification namesgenome_index (
tombo.tombo_helper.Fasta
) – genome indexstats_per_block (int) – statistics to include in calculations per-block (–multiprocess-region-size)
total_stats_limit (int) – maximum total statistics to include in computation (Default: include all stats)
- Returns
Dictionary with (key) motif/modification name and (value) list of tuples containing statistic value and boolean native or control sample stats
-
compute_ground_truth_stats
(ground_truth_locs)[source]¶ Compute lists of statistic values and ground truth modification status (boolean)
- Parameters
ground_truth_locs – list containing tuples of 1) modified locations (from
tombo.tombo_helper.parse_locs_file
), 2) unmodified locations and mod names.- Returns
Dictionary with (key) motif/modification name and (value) list of tuples containing statistic value and boolean motif match
-
compute_motif_stats
(motif_descs, genome_index, stats_per_block=None, total_stats_limit=None)[source]¶ Compute lists of statistic values and whether this site represents a match to the provided motifs
- Parameters
(list; see (motif_descs) –
tombo.tombo_helper.parse_motif_descs
): containing tuples withtombo.tombo_helper.TomboMotif
and motif/modification namesgenome_index (
tombo.tombo_helper.Fasta
) – genome indexstats_per_block (int) – statistics to include in calculations per-block (–multiprocess-region-size)
total_stats_limit (int) – maximum total statistics to include in computation (Default: include all stats)
- Returns
Dictionary with (key) motif/modification name and (value) list of tuples containing statistic value and boolean motif match
-
get_region_per_read_stats
(interval_data, num_reads=None)[source]¶ Extract per-read statistics over the specifed interval.
- Parameters
interval_data (
tombo.tombo_helper.intervalData
) – genomic intervalnum_reads (int) – randomly select this many reads (default: inlcude all reads)
- Returns
np.array structured array containing
pos
,stat
andread_id
for per-read stats over requested interval
-
-
class
tombo.tombo_stats.
TomboModel
(ref_fn=None, is_text_model=False, kmer_ref=None, central_pos=None, seq_samp_type=None, reads_index=None, fast5_fns=None, minimal_startup=True)[source]¶ - Load, store and access Tombo model attributes and sequence-based
expected mean and standard deviation levels (median normalization only).
-
__init__
(ref_fn=None, is_text_model=False, kmer_ref=None, central_pos=None, seq_samp_type=None, reads_index=None, fast5_fns=None, minimal_startup=True)[source]¶ Initialize a Tombo k-mer model object
- Parameters
ref_fn (str) – tombo model filename
is_text_model (bool) – ref_fn is text (e.g. https://github.com/nanoporetech/kmer_models/blob/master/r9.4_180mv_450bps_6mer/template_median68pA.model)
kmer_ref (list) – containing 3-tuples 1) k-mer 2) expected level 3) level SD
central_pos (int) – base within k-mer to assign signal (only applicable when kmer_ref is provided)
seq_samp_type (
tombo.tombo_helper.seqSampleType
) – sequencing sample type (default: None)reads_index (
tombo.tombo_helper.TomboReads
) – For determining seq_samp_typefast5_fns (list) – fast5 read filenames from which to extract read metadata. For determining seq_samp_type
minimal_startup (bool) – don’t compute inverse variances (default True)
Note
- Order of priority for initialization when multiple model
specifications are provided: 1) ref_fn 2) kmer_ref (requires central_pos) 3) seq_samp_type 4) reads_index 5) fast5_fns
- Last 3 options load a default model file included with Tombo. Last
2 determine the sample type from read metadata.
-
get_exp_levels_from_kmers
(seq_kmers)[source]¶ Look up expected signal levels for a list of k-mers
- Parameters
seq_kmers (list) – list of k-mers (as returned from
tombo.tombo_helper.get_seq_kmers
)- Returns
Expected signal level references
ref_means (np.array::np.float64) expected signal levels
ref_sds (np.array::np.float64) expected signal level sds
-
get_exp_levels_from_seq
(seq, rev_strand=False)[source]¶ Compute expected signal levels for a sequence
- Parameters
seq (str) – genomic seqeunce to be converted to expected signal levels
rev_strand (bool) – provided sequence is from reverse strand (so flip return order to genome forward direction)
Note
Returned expected signal levels will be trimmed compared to the passed sequence based on the std_ref.kmer_width and std_ref.central_pos.
- Returns
Expected signal level references 1) ref_means (np.array::np.float64) expected signal levels 2) ref_sds (np.array::np.float64) expected signal level sds
-
class
tombo.tombo_stats.
AltModel
(ref_fn=None, kmer_ref=None, central_pos=None, alt_base=None, name=None, motif=None, minimal_startup=True)[source]¶ Load, store and access alternate-base Tombo model attributes and sequence-based expected mean and standard deviation levels (median normalization only).
-
__init__
(ref_fn=None, kmer_ref=None, central_pos=None, alt_base=None, name=None, motif=None, minimal_startup=True)[source]¶ Initialize a Tombo k-mer model object
- Parameters
ref_fn (str) – tombo model filename
kmer_ref (list) – containing 3-tuples 1) k-mer 2) expected level 3) level SD
central_pos (int) – base within k-mer to assign signal (only applicable when kmer_ref is provided)
alt_base (int) – “swap” base within k-mer (only applicable when kmer_ref is provided)
name (str) – model name (only applicable when kmer_ref is provided)
motif (
tombo.tombo_helper.TomboMotif
) – model motif (defaults to alt_base motif; only applicable when kmer_ref is provided)minimal_startup (bool) – don’t compute inverse variances (default True)
Note
kmer_ref
,central_pos
,alt_base
,name
andmotif
are ignored if
ref_fn
is provided.
-
get_exp_levels_from_kmers
(seq_kmers, rev_strand=False)[source]¶ Look up expected alternate-base signal levels across a central base. Alternative base to test should be last base in the first k-mer and first base in the last k-mer
- Parameters
seq_kmers (list) – list of k-mers the same length as the k-mer
- Returns
Expected signal level references
ref_means (np.array::np.float64) expected signal levels
ref_sds (np.array::np.float64) expected signal level sds
-
-
tombo.tombo_stats.
normalize_raw_signal
(all_raw_signal, read_start_rel_to_raw=0, read_obs_len=None, norm_type='median', outlier_thresh=None, channel_info=None, scale_values=None, event_means=None, model_means=None, model_inv_vars=None, const_scale=None)[source]¶ Apply scaling and windsorizing parameters to normalize raw signal.
- Parameters
all_raw_signal (np.array) – raw nanopore signal obervation values
read_start_rel_to_raw (int) – amount of signal to trim from beginning of the signal (default: 0)
read_obs_len (int) – length of signal to process from read_start_rel_to_raw (default: full length)
norm_type (str) – normalization type (median (default), none, pA_raw, pA, median_const_scale, robust_median; ignored is
scale_values
provided)outlier_thresh (float) – windsorizing threshold (MAD units; default: None)
channel_info (
tombo.tombo_helper.channelInfo
) – channel information (optional; only for pA and pA_raw)scale_values (
tombo.tombo_helper.scaleValues
) – scaling values (optional)event_means (np.array) – for pA fitted scaling parameters (optional)
model_means (np.array) – for pA fitted scaling parameters (optional)
model_inv_vars (np.array) – for pA fitted scaling parameters (optional)
const_scale (float) – global scale parameter (optional)
- Returns
Normalized signal and scaling parameters
normalized signal observations (np.array::np.float64)
-
tombo.tombo_stats.
compute_base_means
(all_raw_signal, base_starts)[source]¶ Efficiently compute new base mean values from raw signal and base start positions
- Parameters
all_raw_signal (np.array) – raw nanopore signal obervation values
base_starts (np.array::np.int32) – 0-based base start positions within raw signal
- Returns
np.array::np.float64 containing base mean levels
-
tombo.tombo_stats.
get_read_seg_score
(r_means, r_ref_means, r_ref_sds)[source]¶ Compute expected to observed signal matching score
- Parameters
r_means (np.array::np.float64) – observed base signal levels
r_ref_means (np.array::np.float64) – expected base signal levels
r_ref_sds (np.array::np.float64) – expected base signal level sds
- Returns
Mean half z-score for observed versus expected signal levels
-
tombo.tombo_stats.
calc_kmer_fitted_shift_scale
(prev_shift, prev_scale, r_event_means, r_model_means, r_model_inv_vars=None, method='theil_sen')[source]¶ Use robust Theil-Sen estimator to compute fitted shift and scale parameters based on read sequence
- Parameters
prev_shift (float) – previous shift parameter
prev_scale (float) – previous scale parameter
r_ref_means (np.array::np.float64) – expected base signal levels
r_ref_sds (np.array::np.float64) – expected base signal level sds
r_model_inv_vars (np.array::np.float64) – expected base signal level inverse variances for method of moments (mom) computation
method (str) – one of theil_sen, robust, or mom
- Returns
Sequence-fitted scaling parameters
shift parameter (float)
scale parameter (float)
shift correction factor; multiply by
prev_scale
and add toprev_shift
to getshift
(float)scale correction factor; multiply by
prev_scale
to getscale
(float)
-
tombo.tombo_stats.
load_resquiggle_parameters
(seq_samp_type, sig_aln_params=None, seg_params=None, use_save_bandwidth=False)[source]¶ Load parameters for re-squiggle algorithm
- Parameters
seq_samp_type (
tombo.tombo_helper.seqSampleType
) – sequencing sample typesig_aln_params (tuple) – signal alignment parameters (optional; default: load seq_samp_type defaults)
seg_params (tuple) – segmentation parameters (optional; default: load seq_samp_type defaults)
use_save_bandwidth (bool) – load larger “save” bandwidth
- Returns
-
tombo.tombo_stats.
compute_num_events
(signal_len, seq_len, mean_obs_per_event, min_event_to_seq_ratio=1.1)[source]¶ Compute number of events to find for this read
- Parameters
signal_len (int) – length of raw signal
seq_len (int) – length of sequence
mean_obs_per_base (int) – mean raw observations per genome base
min_event_to_seq_ratio (float) – minimum event to sequence ratio (optional)
- Returns
Number of events to find for this read