BAM¶
Format version: 0.1
Binary Alignment Map (BAM) is the comprehensive raw data of genome sequencing; it consists of the lossless, compressed binary representation of the Sequence Alignment Map-files.
BAM files generated by MinKNOW are compliant with the SAM Specification, and can be manipulated using Samtools.
Paths¶
The following path patterns are used to place the data on disk:
File | Path pattern |
---|---|
BAM file | bam{basecall_status}{duplex_status}/{alias}/{flow_cell_id}{basecall_status}{duplex_status}_{alias_}{short_protocol_run_id}_{short_run_id}_{batch_number}.bam |
BAI file | bam{basecall_status}{duplex_status}/{alias}/{flow_cell_id}{basecall_status}{duplex_status}_{alias_}{short_protocol_run_id}_{short_run_id}_{batch_number}.bam.bai |
See the Patterns documentation for more information on file patterns.
Read batching¶
The following batching options are used by default:
Option | Value |
---|---|
Duration | 3600s |
For more information on batching see Batching.
Fields¶
Read groups¶
ID
¶
([0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12})([a-z0-9@\.])+(([A-Za-z0-9@\.]+))?
Examples |
---|
e4994c62-93f9-439a-bc8f-d20c95a137a5_rna004_130bps_fast@v5.1.0_barcode02 |
e4994c62-93f9-439a-bc8f-d20c95a137a5_unknown_barcode02 |
e4994c62-93f9-439a-bc8f-d20c95a137a5_rna004_130bps_fast@v5.1.0_29d8704b |
A group identifier keyed specifically to a run, basecall model, and barcode (if enabled)
model_version_id
if missing shall be replaced with the text "unknown"
barcode_data::arrangement
if missing the suffix will not be appended.
DT
¶
\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}(Z|\.\d{6}\+\d{2}:\d{2})
Examples |
---|
2025-01-06T10:06:36.778368+00:00 |
2025-01-06T10:06:36Z |
The start time of the sequencing run.
Correctly formatted as ISO8601.
DS
¶
runid=([0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12})\s+basecall_model=([a-z0-9_@\.]+)(\s+modbase_models=([A-Za-z0-9_@\.]+))?
Examples |
---|
runid=e4994c62-93f9-439a-bc8f-d20c95a137a5 basecall_model=rna004_130bps_fast@v5.1.0 |
runid=e4994c62-93f9-439a-bc8f-d20c95a137a5 basecall_model=rna004_130bps_fast@v5.1.0 modbase_models=rna004_130bps_hac@v5.1.0_inosine_m6A@v1 |
runid
, basecall_model
and optionally modbase_models
formatted into a space separated string.
runid
contains the protocol_run_id
from the MinKNOW experiment that generated the BAM files.
Note, modbase_models
only appears if modbase calling was performed, and contains a comma separated list of Dorado modbase model names used.
LB
¶
[a-zA-Z0-9_\.-]+
Examples |
---|
My_Sample |
my-sample-1 |
The sample library identifier. Included only if data is present.
PL
¶
ONT
Examples |
---|
ONT |
The string "ONT".
PM
¶
[A-Z-0-9]+
Examples |
---|
MN12345 |
The device identifier used for sequencing.
PU
¶
[A-Z0-9_-]+
FASTQ: flow_cell_id
Sample sheet: flow_cell_id
Examples |
---|
FAB12345 |
The unique identifier for the flowcell.
SM
¶
barcode([0-9]+)
barcoding
Examples |
---|
barcode01 |
The barcode identified for the read.
Included only if data is present and the arrangement is not "unclassified"
al
¶
unclassified|[A-Za-z0-9\-_\.]+
barcoding
Examples |
---|
my_sample |
sample01 |
User-specified identifier used for the barcode, if available, otherwise the arrangement name.
Included only if data is present and the arrangement is not "unclassified"
This will be the same barcode descriptor Dorado uses for generating the output folder names, which is the sample sheet alias if available otherwise defaulting to the arrangement name.
Program groups¶
ID
¶
(basecaller|barcoder|aligner|dorado_aligner|minknow)(_[0-9]+)?
Examples |
---|
aligner_1 |
basecaller |
dorado_aligner_3 |
minknow |
<program_id>{_<unique_id>}
The SAM specification requires this field to be unique within a file, and may be modified on merging to ensure
uniqueness. Where the program_id would not be unique within the file, e.g. when the input is a BAM/SAM file
with the program_id already present, then the @PG record will have the lowest zero based suffix that can be
appended to the ID to ensure its uniqueness, e.g. "_0".
Where an application has used the minimap aligner on reads contained in the file, as well as outputting a @PG
record for itself it will also output an additional @PG record for minimap
Dorado aligner records will have the ID "dorado_aligner" to disambiguate from minimap aligner records.
Programs will identify themselves as follows
MinKNOW "minknow"
ont_basecall_client
"basecaller"
dorado basecaller
: "basecaller"
dorado barcoder
: "barcoder"
dorado aligner
: "dorado_aligner"
minimap2
: (if used in the application) "aligner"
PN
¶
(dorado|minimap2|ont_basecall_client|minknow)
Examples |
---|
dorado |
minimap2 |
ont_basecall_client |
minknow |
"dorado" for Dorado application program records. "minimap2" for minimap program records with ID:aligner. "ont_basecall_client", etc.
CL
¶
.*
Examples |
---|
dorado basecaller hac pod5s/ > calls.bam |
Command line of invoked application
VN
¶
[0-9a-z.~\+\-]+
Examples |
---|
0.9.1 |
5.1.0 |
6.2.0~pre-a7305ca |
0.0.0.28546+10c25eb94 |
DS
¶
.*
FASTQ: basecall_gpu
Examples |
---|
gpu:NVIDIA A100 80GB PCIe |
gpu:NVIDIA A100 80GB PCIe|Quadro GV100 |
Each GPU type used by the basecaller will appear once, as we are only interested in the GPU types, not the total number of them. For a PromethION the tag would say e.g DS:gpu:NVIDIA A100 80GB PCIe. If there are multiple GPU types on the system they will be separated by a vertical bar. If a GPU was not used, or the reads were called on Apple Silicon, this field will not be present.
Read tags¶
RG
:Z: ¶
ID of the read group to which this read belongs. If present its value must match the ID field of a Read Group record in the header section.
qs
:f: ¶
Read mean basecall qscore
mx
:i: ¶
read mux meta: read_data::mux
ch
:i: ¶
rn
:i: ¶
Read number meta: read_data::read_number
st
:Z: ¶
FASTQ: start_time
Sequencing summary: start_time
Read start time metadata field:
f5
:Z: ¶
fast5 file name read_data::filename. N.B. filename only so any personally identifiable data in the path is not written.
ns
:i: ¶
the number of samples in the signal (read_data::duration)
ts
:i: ¶
"
mv
:B:c ¶
sequence to signal move table (this has already been done as per Move table format for BAM file output)
sm
:f: ¶
scaling median: basecall_data::scaling_median
sd
:f: ¶
scaling dispersion (also sometimes referred to as mad, spread): basecall_data::scaling_med_abs_dev
sv
:Z: ¶
"med_mad" or "quantile", depending on which scaling method was used by the basecaller: basecall_data::scaling_version"
du
:f: ¶
duration of the read (in seconds)
pi
:Z: ¶
FASTQ: parent_read_id
Sequencing summary: parent_read_id
parent read id for a split read
MN
:i: ¶
modified_bases
Length of SEQ field when MM/ML tags were generated.
ML
:B:C ¶
modified_bases
Base modification probabilities
MM
:Z: ¶
modified_bases
Base modifications / methylation
dx
:i: ¶
Duplex read indicator. 1 for duplex reads, 0 for simplex reads without duplex offspring, -1 for simplex reads with duplex offspring.
bh
:i: ¶
bed_file
Number of BED file hits. This tag is only included if a BED file was specified when aligning.
pt
:i: ¶
poly_a_tail_estimation
Estimated number of bases in the polyA/T tail. This tag is only included if --estimate_poly_a was specified by the client