fast5_research package¶
Submodules¶
fast5_research.extract module¶
-
class
fast5_research.extract.
Read
(read_id, read_number, tracking_id, channel_id, context_tags, raw)[source]¶ Bases:
object
-
fast5_research.extract.
extract_channel_reads
(source, output, prefix, flat, by_id, max_files, multi, channel, summary=None)[source]¶
-
fast5_research.extract.
reads_in_multi
(src, filt=None)[source]¶ Get list of read IDs contained within a multi-read file.
- Parameters
src – source file.
filt – perform filtering by given set.
- Returns
set of read UUIDs (as string and recorded in hdf group name).
fast5_research.fast5 module¶
-
class
fast5_research.fast5.
Fast5
(fname, read='r')[source]¶ Bases:
h5py._hl.files.File
Class for grabbing data from single read fast5 files. Many attributes/ groups are assumed to exist currently (we’re concerned mainly with reading). Needs some development to make robust and for writing.
-
classmethod
New
(fname, read='w', tracking_id={}, context_tags={}, channel_id={})[source]¶ Construct a fresh single-read file, with meta data written to standard locations.
-
property
attributes
¶ Attributes for a read, assumes one read in file
-
property
channel_meta
¶ Channel meta information as python dict
Context tags meta information as python dict
-
get_alignment_attrs
(section='template', analysis='Alignment')[source]¶ Read the annotated alignment meta data from the fast5 file.
- Parameters
section – String to use in paths, e.g. ‘template’.
analysis – Base analysis name (under /Analyses)
-
get_analysis_latest
(name)[source]¶ Get group of latest (present) analysis with a given base path.
- Parameters
name – Get the (full) path of newest analysis with a given base name.
-
get_analysis_new
(name)[source]¶ Get group path for new analysis with a given base name.
- Parameters
name – desired analysis name
-
get_any_mapping_data
(section='template', attrs_only=False, get_model=False)[source]¶ Convenience method for extracting whatever mapping data might be present, favouring squiggle_mapping output over basecall_mapping.
- Parameters
section – (Probably) ‘template’
attrs_only – Use attrs_only=True to return mapping attributes without events
- Returns
the tuple (events, attrs) or attrs only
-
get_basecall_data
(section='template', analysis='Basecall_1D')[source]¶ Read the annotated basecall_1D events from the fast5 file.
- Parameters
section – String to use in paths, e.g. ‘template’.
analysis – Base analysis name (under /Analyses)
-
get_engine_state
(state, time=None)[source]¶ Retrieve engine state from /EngineStates/, either across the whole read (default) or at a given time.
- Parameters
state – name of engine state
time – time (in seconds) at which to retrieve temperature
-
get_fastq
(analysis='Basecall_1D', section='template', custom=None)[source]¶ Get the fastq (sequence) data.
- Parameters
analysis – Base analysis name (under /Analyses)
section – (Probably) ‘template’
custom – Custom hdf path overriding all of the above.
-
get_mapping_attrs
(section='template', analysis='Squiggle_Map')[source]¶ Read the annotated mapping meta data from the fast5 file. Names which are inconsistent between squiggle_mapping and basecall_mapping are added to basecall_mapping (thus duplicating the attributes in basecall mapping).
- Parameters
section – String to use in paths, e.g. ‘template’.
analysis – Base analysis name (under /Analyses) For basecall mapping use analysis = ‘Alignment’
-
get_mapping_data
(section='template', analysis='Squiggle_Map', get_model=False)[source]¶ Read the annotated mapping events from the fast5 file.
Note
The seq_pos column for the events table returned from basecall_mapping is adjusted to be the genome position (consistent with squiggle_mapping)
- Parameters
section – String to use in paths, e.g. ‘template’.
analysis – Base analysis name (under /Analyses). For basecall mapping use analysis = ‘AlignToRef’.
-
get_raw
(scale=True)[source]¶ Get raw data in file, might not be present.
- Parameters
scale – Scale data to pA? (rather than ADC values)
Warning
This method is deprecated and should not be used, instead use .get_read(raw=True) to read both MinKnow conformant files and previous Tang files.
-
get_read
(group=False, raw=False, read_number=None)[source]¶ Like get_reads, but only the first read in the file
- Parameters
group – return hdf group rather than event/raw data
-
get_read_stats
()[source]¶ Combines stats based on events with output of .summary, assumes a one read file.
-
get_reads
(group=False, raw=False, read_numbers=None)[source]¶ Iterator across event data for all reads in file
- Parameters
group – return hdf group rather than event data
-
get_reference_fasta
(analysis='Alignment', section='template', custom=None)[source]¶ Get fasta sequence of known DNA fragment for the read.
- Parameters
analysis – Base analysis name (under /Analyses)
section – (Probably) ‘template’
custom – Custom hdf path overriding all of the above.
-
get_sam
(analysis='Alignment', section='template', custom=None)[source]¶ Get SAM (alignment) data.
- Parameters
analysis – Base analysis name (under /Analyses)
section – (Probably) ‘template’
custom – Custom hdf path overriding all of the above.
-
get_section_events
(section, analysis='Segment_Linear')[source]¶ Get the event data for a signal section
- Parameters
analysis – Base analysis path (under /Analyses)
-
get_section_indices
(analysis='Segment_Linear')[source]¶ Get two tuples indicating the event indices for signal segmentation boundaries.
- Parameters
analysis – Base analysis path (under /Analyses)
-
get_split_data
(analysis='Segment_Linear')[source]¶ Get signal segmentation data.
- Parameters
analysis – Base analysis name (under /Analyses)
-
get_temperature
(time=None, field='heatsink')[source]¶ Retrieve temperature data from /EngineStates/, either across the whole read (default) or at a given time.
- Parameters
time – time at which to get temperature
field – one of (‘heatsink’, ‘asic’)
-
set_basecall_data
(events, scale, path, model, seq, section='template', name='unknown', post=None, score=None, quality_data=None, qstring=None, analysis='Basecall_1D')[source]¶ Create an annotated event table and 1D basecalling summary similiar to chimaera and add them to the fast5 file.
- Parameters
events – Numpy record array of events. Must contain the mean, stdv, start and length fields.
scale – Scaling object.
path – Viterbi path containing model pointers (1D np.array).
model – Model object.
seq – Basecalled sequence string for fastq.
section – String to use in paths, e.g. ‘template’.
name – Identifier string for fastq.
post – Numpy 2D array containing the posteriors (event, state), used to annotate events.
score – Quality value for the whole strand.
quality_data – Numpy 2D array containing quality_data, used to annotate events.
qstring – Quality string for fastq.
analysis – Base analysis name (under /Analyses)
-
set_engine_state
(data)[source]¶ Set the engine state data.
- Parameters
data – a 1D-array containing two fields, the first of which must be named ‘time’. The name of the second field will be used to name the engine state and be used in the dataset path.
-
set_mapping_data
(events, scale, path, model, seq, ref_name, section='template', post=None, score=None, is_reverse=False, analysis='Squiggle_Map')[source]¶ Create an annotated event table and mapping summary similiar to chimaera and add them to the fast5 file.
- Parameters
events –
np.ndarray
of events. Must contain mean, stdv, start and length fields.scale – Scaling object.
path –
np.ndarray
containing position in reference. Negative values will be interpreted as “bad emissions”.model – Model object to use.
seq – String representation of the reference sequence.
section – Section of strand, e.g. ‘template’.
name – Reference name.
post – Two-dimensional
np.ndarray
containing posteriors.score – Mapping quality score.
is_reverse – Mapping refers to ‘-‘ strand (bool).
analysis – Base analysis name (under /Analyses)
-
set_raw
(raw, meta=None, read_number=None)[source]¶ Set the raw data in file.
- Parameters
raw – raw data to add
read_number – read number (as usually given in filename and contained within HDF paths, viz. Reads/Read_<>/). If not given attempts will be made to guess the number (assumes single read per file).
-
set_raw_old
(raw, meta)[source]¶ Set the raw data in file.
- Parameters
raw – raw data to add
meta – meta data dictionary
Warning
This method does not write raw data conforming to the Fast5 specification. This class will currently still read data written by this method.
-
set_read
(data, meta)[source]¶ Write event data to file
- Parameters
data – event data
meta – meta data to attach to read
read_number – per-channel read counter
-
set_split_data
(data, analysis='Segment_Linear')[source]¶ Write a dict containing split point data.
- Parameters
data – dict-like object containing attrs to add
analysis – Base analysis name (under /Analyses)
Warning
Not checking currently for required fields.
-
strip_analyses
(keep='EventDetection_000', 'RawData')[source]¶ Remove all analyses from file
- Parameters
keep – whitelist of analysis groups to keep
-
property
tracking_id
¶ Tracking id meta information as python dict
-
property
writable
¶ Can we write to the file.
-
classmethod
-
fast5_research.fast5.
iterate_fast5
(path='Stream', strand_list=None, paths=False, mode='r', limit=None, shuffle=False, robust=False, progress=False, recursive=False)[source]¶ Iterate over directory of fast5 files, optionally only returning those in list
- Parameters
path – Directory in which single read fast5 are located or filename.
strand_list – List of strands, can be a python list of delimited table. If the later and a filename field is present, this is used to locate files. If a file is given and a strand field is present, the directory index file is searched for and filenames built from that.
paths – Yield file paths instead of fast5 objects.
mode – Mode for opening files.
limit – Limit number of files to consider.
shuffle – Shuffle files to randomize yield of files.
robust – Carry on with iterating over FAST5 files after an exception was raised.
progress – Display progress bar.
recursive – Perform a recursive search for files in subdirectories of path.
fast5_research.fast5_bulk module¶
-
class
fast5_research.fast5_bulk.
AsicBCommand
(command)[source]¶ Bases:
object
Wrapper around the asicb command structure
-
property
configuration
¶
-
property
min_temperature
¶
-
property
-
class
fast5_research.fast5_bulk.
AsicBConfiguration
(config)[source]¶ Bases:
object
Wrapper around the asicb configuration struct passed to the asicb over usb
-
property
bias_voltage
¶
-
property
-
class
fast5_research.fast5_bulk.
BulkFast5
(filename, mode='r')[source]¶ Bases:
h5py._hl.files.File
Class for reading data from a bulk fast5 file
-
classmethod
New
(fname, read='a', tracking_id={}, context_tags={}, channel_id={})[source]¶ Construct a fresh bulk file, with meta data written to standard locations. There is currently no checking this meta data. TODO: Add meta data checking.
-
get_bias_voltage_changes
()[source]¶ Get changes in the bias voltage.
Note
For a long (-long-long) time the only logging of the common electrode voltage was the experimental history (accurate to one second). The addition of the voltage trace changed this, but this dataset is cumbersome. MinKnow 1.x(.3?) added the asic command history which is typically much shorter and therefore quicker to query. The bias voltage is numerously record. For MinION asics there is typically a -5X multiplier to convert the data into correct units with the sign people are used to.
-
get_bias_voltage_changes_in_window
(times=None, raw_indices=None)[source]¶ Find all mux voltage changes within a time window.
- Parameters
times – tuple of floats (start_second, end_second)
raw_indices – tuple of ints (start_index, end_index)
Note
This is the bias voltage from the expt history (accurate to 1 second), and will not include any changes in voltage related to waveforms. For the full voltage trace, use get_voltage.
Exactly one of the slice keyword arguments needs to be specified, as the method will override them in the order of times > raw_indices.
-
get_engine_state
(state, time=None)[source]¶ Get changes in an engine state or the value of an engine state at a given time.
- Parameters
state – the engine state to retrieve.
time – the time at which to grab engine state.
-
get_events
(channel, times=None, raw_indices=None, event_indices=None, None, use_scaling=True)[source]¶ Parse channel event data.
- Parameters
channel – channel number int
times – tuple of floats (start_second, end_second)
raw_indices – tuple of ints (start_index, end_index)
event_indices – tuple of ints (start_index, end_index)
use_scaling – if True, scale the current level
Note
Exactly one of the slice keyword arguments needs to be specified, as the method will override them in the order of times > raw_indices > event_indices.
-
get_metadata
(channel)[source]¶ Get the metadata for the specified channel.
Look for first for events metadata, and fall-back on raw metadata, returning an empty dict if neither could be found.
-
get_mux
(channel, raw_index=None, time=None, wells_only=False, return_raw_index=False)[source]¶ Find the multiplex well_id (“the mux”) at a given time
- Parameters
channel – channel number int
raw_index – sample index
time – time in seconds
- Wells_only
bool, if True, ignore changes to mux states not in [1,2,3,4] and hence return the last well mux.
- Return_raw_index
bool, if True, return tuple (mux, raw_index), raw_index being raw index when the mux was set.
Note
There are multiple mux states associated with each well (e.g. common_voltage_1 and unblock_volage_1). Here, we return the well_id associated with the mux state (using self.enum_to_mux), i.e. 1 in both these cases.
Exactly one of the slice keyword arguments needs to be specified, as the method will override them in the order of times > raw_indices.
-
get_mux_changes
(channel, wells_only=False)[source]¶ Get changes in multiplex settings for given channel.
- Parameters
channel – channel for which to fetch data
- Wells_only
bool, if True, ignore changes to mux states not in [1,2,3,4]
Note
There are multiple mux states associated with each well (e.g. 1:common_voltage_1 and 6:unblock_voltage_1). Here, we return mux state numbers, e.g. 1 and 6, which can be linked to the well_id using self.enum_to_mux
-
get_mux_changes_in_window
(channel, times=None, raw_indices=None)[source]¶ Find all mux changes within a time window.
- Parameters
channel – channel number int
times – tuple of floats (start_second, end_second)
raw_indices – tuple of ints (start_index, end_index)
Note
There are multiple mux values associated with each well (e.g. 1:common_voltage_1 and 6:unblock_voltage_1). Here, we return mux values, e.g. 1 and 6, which can be linked to the well_id using self.enum_to_mux.
Exactly one of the slice keyword arguments needs to be specified, as the method will override them in the order of times > raw_indices.
-
get_raw
(channel, times=None, raw_indices=None, None, use_scaling=True)[source]¶ If available, parse channel raw data.
- Parameters
channel – channel number int
times – tuple of floats (start_second, end_second)
raw_indices – tuple of ints (start_index, end_index)
use_scaling – if True, scale the current level
Note
Exactly one of the slice keyword arguments needs to be specified, as the method will override them in the order of times > raw_indices.
-
get_reads
(channel, transitions=False, multi_row_class='auto')[source]¶ Parse channel read data to yield details of reads.
- Parameters
channel – channel number int
transitions – if True, include transition reads
multi_row_class – options: ‘auto’, modal, ‘penultimate’, ‘final’. For reads which span multiple rows, use the classification from ‘auto’: modal class if present, penultimate row if not ‘modal’: modal class if present ‘penultimate’: penultimate row ‘final’: final row. Modal classification not supported by very old versions of MinKNOW.
-
get_state
(channel, raw_index=None, time=None)[source]¶ Find the channel state at a given time
- Parameters
channel – channel number int
raw_index – sample index
time – time in seconds
Note
Exactly one of the slice keyword arguments needs to be specified, as the method will override them in the order of times > raw_indices.
-
get_state_changes
(channel)[source]¶ Parse channel state changes.
- Parameters
channel – channel number int
-
get_states_in_window
(channel, times=None, raw_indices=None)[source]¶ Find all channel states within a time window.
- Parameters
channel – channel number int
times – tuple of floats (start_second, end_second)
raw_indices – tuple of ints (start_index, end_index)
Note
Exactly one of the slice keyword arguments needs to be specified, as the method will override them in the order of times > raw_indices.
-
get_voltage
(times=None, raw_indices=None, None, use_scaling=True)[source]¶ Extracts raw common electrode trace
- Raw_indices
tuple of ints to limit section of voltage data loaded.
- Use_scaling
bool, whether to scale voltage data. If no scaling meta is found, scale by -5 (as appropriate for MinION).
- Returns
voltage as array (including 5x multiplyer for MinKnow)
-
get_waveform_timings
()[source]¶ Extract the timings of the waveforms (if any).
- Returns
list of tuples of start and end times
-
parse_history
()[source]¶ Parse the experimental history to pull out various environmental factors. The functions below are quite nasty, don’t enquire too hard.
-
set_events
(data, meta, channel)[source]¶ Write event data to file
- Parameters
data – event data
meta – meta data to attach to read
read_number – per-channel read counter
-
classmethod
fast5_research.util module¶
-
fast5_research.util.
build_mapping_summary_table
(mapping_summary)[source]¶ Build a mapping summary table
- Parameters
mapping_summary – List of curr_map dictionaries
- Returns
Numpy record array containing summary contents. One record per array element of mapping_summary
-
fast5_research.util.
build_mapping_table
(events, ref_seq, post, scale, path, model)[source]¶ Build a mapping table based on output of a dragonet.mapper style object. Taken from chimaera.common.utilities.
- Parameters
events – Numpy record array of events. Must contain the mean, stdv, start and length fields.
ref_seq – String representation of the reference sequence.
post – Numpy 2D array containing the posteriors (event, state).
scale – Scaling object.
path – Numpy 1D array containing position in reference. May contain negative values, which will be interpreted as “bad emissions”.
model – Model object to use.
- Returns
numpy record array containing summary fields. One record per event.
Output Field
Description
mean
mean value of event samples (level)
scaled_mean
mean scaled to the bare level emission (mean/mode)
stdv
standard deviation of event samples (noise)
scaled_stdv
stdv scaled to the bare stdv emission (mode)
start
start time of event /s
length
length of event /s
model_level
modelled event level, i.e. the level emission associated with the kmer kmer, scaled to the data
model_scaled_level
bare level emission
model_sd
modelled event noise, i.e. the sd emission associated with the kmer kmer, scaled to the data
model_scaled_sd
bare noise emission
seq_pos
aligned sequence position, position on Viterbi path
p_seq_pos
posterior probability of states on Viterbi path
kmer
kmer identity of seq_pos
mp_pos
aligned sequence position, position with highest posterioir
p_mp_pos
posterior probability of most probable states
mp_kmer
kmer identity of mp_kmer
good_emission
whether or not the HMM has tagged event as fitting the model
-
fast5_research.util.
compute_movement_stats
(path)[source]¶ Compute movement stats from a mapping state path
- Parameters
path –
np.ndarry
containing position in reference. Negative values are interpreted as “bad emissions”.
-
fast5_research.util.
create_basecall_1d_output
(raw_events, scale, path, model, post=None)[source]¶ Create the annotated event table and basecalling summaries similiar to chimaera.
- Parameters
raw_events –
np.ndarray
with fields mean, stdv, start and, length fields.scale –
dragonet.basecall.scaling.Scaler
object (or object with attributes shift, scale, drift, var, scale_sd, var_sd, and var_sd.path – list containing state indices with respect to model.
model – :class:dragonet.util.model.Model object.
post – Two-dimensional
np.ndarray
containing posteriors (event, state).quality_data – :class:np.ndarray Array containing quality_data, used to annotate events.
- Returns
A tuple of:
the annotated input event table
a dict of result
-
fast5_research.util.
create_mapping_output
(raw_events, scale, path, model, seq, post=None, n_states=None, is_reverse=False, substates=False)[source]¶ Create the annotated event table and summaries similiar to chimaera
- Parameters
raw_events –
np.ndarray
with fields mean, stdv, start, and length fields.scale –
dragonet.basecall.scaling.Scaler
object (or object with attributes shift, scale, drift, var, scale_sd, var_sd, and var_sd.path – list containing state indices with respect to model.
model – :class:dragonet.util.model.Model object.
seq – String representation of the reference sequence.
post – Two-dimensional
np.ndarray
containing posteriors (event, state).is_reverse – Mapping refers to ‘-‘ strand (bool).
substate – Mapping contains substates?
- Returns
A tuple of: * the annotated input event table, * a dict of result.
-
fast5_research.util.
dtype_descr
(arr)[source]¶ Get arr.dtype.descr Views of structured arrays in which columns have been re-ordered nolonger support arr.dtype.descr see https://github.com/numpy/numpy/commit/dd8a2a8e29b0dc85dca4d2964c92df3604acc212
-
fast5_research.util.
file_has_fields
(fname, fields=None)[source]¶ Check that a tsv file has given fields
- Parameters
fname – filename to read. If the filename extension is gz or bz2, the file is first decompressed.
fields – list of required fields.
- Returns
boolean
-
fast5_research.util.
get_changes
(data, ignore_cols=None, use_cols=None)[source]¶ Return only rows of a structured array which are not equal to the previous row.
- Parameters
data – Numpy record array.
ignore_cols – iterable of column names to ignore in checking for equality between rows.
use_cols – iterable of column names to include in checking for equality between rows (only used if ignore_cols is None).
- Returns
Numpy record array.
-
fast5_research.util.
group_vector
(arr)[source]¶ Group a vector by unique values.
- Parameters
arr – input vector to be grouped.
- Returns
a dictionary mapping unique values to arrays of indices of the input vector.
-
fast5_research.util.
kmer_overlap_gen
(kmers, moves=None)[source]¶ From a list of kmers return the character shifts between them. (Movement from i to i+1 entry, e.g. [AATC,ATCG] returns [0,1]). Allowed moves may be specified in moves argument in order of preference. Taken from dragonet.bio.seq_tools
- Parameters
kmers – sequence of kmer strings.
moves – allowed movements, if None all movements to length of kmer are allowed.
-
fast5_research.util.
mad
(data, factor=None, axis=None, keepdims=False)[source]¶ Compute the Median Absolute Deviation, i.e., the median of the absolute deviations from the median, and (by default) adjust by a factor for asymptotically normal consistency.
- Parameters
data – A
ndarray
objectfactor – Factor to scale MAD by. Default (None) is to be consistent with the standard deviation of a normal distribution (i.e. mad( N(0,sigma^2) ) = sigma).
axis – For multidimensional arrays, which axis to calculate the median over.
keepdims – If True, axis is kept as dimension of length 1
- Returns
the (scaled) MAD
-
fast5_research.util.
mean_qscore
(scores)[source]¶ Returns the phred score corresponding to the mean of the probabilities associated with the phred scores provided. Taken from chimaera.common.utilities.
- Parameters
scores – Iterable of phred scores.
- Returns
Phred score corresponding to the average error rate, as estimated from the input phred scores.
-
fast5_research.util.
med_mad
(data, factor=None, axis=None, keepdims=False)[source]¶ Compute the Median Absolute Deviation, i.e., the median of the absolute deviations from the median, and the median
- Parameters
data – A
ndarray
objectfactor – Factor to scale MAD by. Default (None) is to be consistent with the standard deviation of a normal distribution (i.e. mad( N(0,sigma^2) ) = sigma).
axis – For multidimensional arrays, which axis to calculate over
keepdims – If True, axis is kept as dimension of length 1
- Returns
a tuple containing the median and MAD of the data
-
fast5_research.util.
qstring_to_phred
(quality)[source]¶ Compute standard phred scores from a quality string.
-
fast5_research.util.
readtsv
(fname, fields=None, **kwargs)[source]¶ Read a tsv file into a numpy array with required field checking
- Parameters
fname – filename to read. If the filename extension is gz or bz2, the file is first decompressed.
fields – list of required fields.
-
fast5_research.util.
seq_to_kmers
(seq, length)[source]¶ Turn a string into a list of (overlapping) kmers.
e.g. perform the transformation:
‘ATATGCG’ => [‘ATA’,’TAT’, ‘ATG’, ‘TGC’, ‘GCG’]
- Parameters
seq – character string
length – length of kmers in output
- Returns
A list of overlapping kmers
-
fast5_research.util.
validate_event_table
(table)[source]¶ Check if an object contains all columns of a basic event array.
-
fast5_research.util.
validate_model_table
(table)[source]¶ Check if an object contains all columns of a dragonet Model.