This page contains breif descriptions of the most common Tombo commands. For more detail on each commands’ options and further algorithm details, please see the corresponding documentation sections.
Re-squiggle (Raw Signal Genomic Alignment)¶
The re-squiggle algorithm aligns raw signal (electric current nanopore measurements) to reference genomic or transcriptomic sequence.
One of the major assumptions of the re-squiggle algorithm is that the provided reference sequence is correct. Thus for a poorly assembled reference or divergent sample, an assembly polishing step (possibly from the same data/sample) may improve results.
resquiggle command will add infomation (the mapped reference location and the raw signal to sequence assignment) to the read files provided (in FAST5 format), as well as producing an index file for more efficient file access in downstream commands.
resquiggle command must be run before any further processing by Tombo commands.
Note: Tombo currenly includes default canonical models for both DNA or RNA data (supporting R9.4 and R9.5; 1D and 1D^2; R9.*.1 chemistries). Analysis of other nanopore data types is not supported at this time (e.g. R7 data). If DNA or RNA sample type is not explicitly specified (via
--rna options) the sample type will be detected automatically from the raw read files.
For more details see the re-squiggle documentation.
# annotate raw FAST5s with FASTQ files produced from the same reads # skip this step if raw read files already contain basecalls tombo preprocess annotate_raw_with_fastqs --fast5-basedir <fast5s-base-directory> --fastq-filenames <reads.fastq> tombo resquiggle <fast5s-base-directory> <reference-fasta> --processes 4 --num-most-common-errors 5
Modified Base Detection¶
Tombo provides four (including two types of sample comparison) methods for the investigation of modified bases (within the
tombo detect_modifications command group). Each method has different advantages and requirements.
All modified base detection methods, except the
level_sample_compare method, produce per-read, per-reference position test statistics (which can be saved via the
--per-read-statistics-basename option). A threshold is then applied to these statistics to produce an estimate for the fraction of reads that appear modified at each reference location.
Specific alternative base detection (recommended)
tombo detect_modifications alternative_modelcommand.
This method identifies sites where signal matches a specific alternative base expected signal levels better than the canonical expected levels producing a statistic similar to a log likelihood ratio.
All-context alternative DNA models are currently available for 5-methylcytosine (5mC) and N6-methyladenosine (6mA; not currently available for RNA).
More accurate motif specific models are available for dam and dcm methylation (found in E. coli) and CpG methylation (found in human samples).
De novo canonical model comparison
tombo detect_modifications de_novocommand.
This method identifies sites where signal deviates from the expected canonical signal levels.
While this method has the highest error rates, it can be used effectively on any sample and is particularly useful for motif discovery for motif-specific modifications (e.g. bacterial samples).
These two methods requires the production of a second set of reads for comparison.
tombo detect_modifications level_sample_comparecommand.
This method performs either a ks-test (default), u-test or t-test to identify sites where signal levels deviate between two samples.
This method is ideal for high coverage samples (in order to accurately estimate the effect size measures) and comparison of two potentailly non-canonical samples (e.g. a KO experiment).
tombo detect_modifications model_sample_comparecommand.
This uses a canonical control sample to adjust the expected signal levels due to un-modeled effects.
This adjusted model is then used to test for modifications as in the de novo method.
This was the
sample_comparemethod prior to version 1.5.
Both the sample comparison and the de novo methods may not identify the exact modified base location (as the shifted signal does not always center on a modified base) and gives no information as to the identity of a modified base.
The result of all
tombo detect_modifications calls will be a binary statistics file(s), which can be passed to other Tombo commands.
For more details see the modified base detection documentation.
Genome Browser File Output¶
In order to output re-squiggle and/or modified base detection results in a genome browser compatible format (either wiggle format or bedgraph format), the
tombo text_output genome_browser command is provided.
tombo text_output browser_files --fast5-basedirs <fast5s-base-directory> \ --statistics-filename sample_alt_model.CpG.tombo.stats \ --browser-file-basename sample_alt_model --file-types dampened_fraction coverage
--file-types available are:
valid_coveragederived from a (non-
level_sample_compare) statistics file
statisticderived from a
coveragederived from the Tombo index (fast)
differencederived from read FAST5 files (slow)
Reference Sequence Output¶
For modified base analysis pipelines (e.g. motif detection), it may be useful to output the reference sequence surrounding the most likely modified sites. The
text_output signif_sequence_context command is provided for this purpose.
tombo text_output signif_sequence_context --statistics-filename sample_alt_model.6mA.tombo.stats \ --genome-fasta <reference-fasta> --sequences-filename sample_alt_model.6mA.most_signif.fasta
Example meme command line modified base motif detection command.
./meme -oc motif_output.meme -dna -mod zoops sample_alt_model.6mA.most_signif.fasta
For more details see the text output documentation.
Tombo provides many plotting functions for the visualization of modified bases within raw nanopore signal.
Most plotting commands are reference-anchored. That is the normalized raw signal is plotted as the re-squiggle algorithm has assigned it to the reference sequence.
Each reference anchored plotting command allows for the selection of reference positions based on generally applicable criterion.
tombo plot max_coverage --fast5-basedirs <fast5s-base-directory> --plot-standard-model tombo plot motif_centered --fast5-basedirs <fast5s-base-directory> --motif AWC \ --genome-fasta genome.fasta --control-fast5-basedirs <control-fast5s-base-directory> tombo plot per_read --per-read-statistics-filename <per_read_statistics.tombo.stats> \ --genome-locations chromosome:1000 chromosome:2000:- \ --genome-fasta genome.fasta
For regions with higher coverage, several over-plotting options are available. For those options producing a distribution, these are taken over each reads average signal assigned to a base. This requires extraction of these levels from all relevant FAST5 files and thus can be slow for very deep coverage regions.
For more details see the plotting documentation.
Read filtering commands can be useful to extract the most out out of a set of reads for modified base detection. Read filtering commands effect only the Tombo index file, and so filters can be cleared or applied iteratively without re-running the re-squiggle command. Five filters are currently made available (
# filter reads to a specific genomic location tombo filter genome_locations --fast5-basedirs path/to/native/rna/fast5s/ \ --include-regions chr1:0-10000000 # apply a more strigent observed to expected signal score (default: 1.1 for DNA reads) tombo filter raw_signal_matching --fast5-basedirs path/to/native/rna/fast5s/ \ --signal-matching-score 1.0
For more details see the filter documentation.