Skip to content

BAM

Format version: 0.1

Binary Alignment Map (BAM) is the comprehensive raw data of genome sequencing; it consists of the lossless, compressed binary representation of the Sequence Alignment Map-files.

BAM files generated by MinKNOW are compliant with the SAM Specification, and can be manipulated using Samtools.

Paths

The following path patterns are used to place the data on disk:

File Path pattern
BAM file bam{basecall_status}{duplex_status}/{alias}/{flow_cell_id}{basecall_status}{duplex_status}_{alias_}{short_protocol_run_id}_{short_run_id}_{batch_number}.bam
BAI file bam{basecall_status}{duplex_status}/{alias}/{flow_cell_id}{basecall_status}{duplex_status}_{alias_}{short_protocol_run_id}_{short_run_id}_{batch_number}.bam.bai

See the Patterns documentation for more information on file patterns.

Read batching

The following batching options are used by default:

Option Value
Duration 3600s

For more information on batching see Batching.

Fields

Read groups

ID

Regex ([0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12})([a-z0-9@\.])+(([A-Za-z0-9@\.]+))?
Required
Examples
e4994c62-93f9-439a-bc8f-d20c95a137a5_rna004_130bps_fast@v5.1.0_barcode02
e4994c62-93f9-439a-bc8f-d20c95a137a5_unknown_barcode02
e4994c62-93f9-439a-bc8f-d20c95a137a5_rna004_130bps_fast@v5.1.0_29d8704b

A group identifier keyed specifically to a run, basecall model, and barcode (if enabled)

model_version_id if missing shall be replaced with the text "unknown"

barcode_data::arrangement if missing the suffix will not be appended.

DT

Regex \d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}(Z|\.\d{6}\+\d{2}:\d{2})
Examples
2025-01-06T10:06:36.778368+00:00
2025-01-06T10:06:36Z

The start time of the sequencing run.

Correctly formatted as ISO8601.

DS

Regex runid=([0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12})\s+basecall_model=([a-z0-9_@\.]+)(\s+modbase_models=([A-Za-z0-9_@\.]+))?
Required
Common fields FASTQ: runidSequencing summary: run_id
Examples
runid=e4994c62-93f9-439a-bc8f-d20c95a137a5 basecall_model=rna004_130bps_fast@v5.1.0
runid=e4994c62-93f9-439a-bc8f-d20c95a137a5 basecall_model=rna004_130bps_fast@v5.1.0 modbase_models=rna004_130bps_hac@v5.1.0_inosine_m6A@v1

runid, basecall_model and optionally modbase_models formatted into a space separated string.

runid contains the protocol_run_id from the MinKNOW experiment that generated the BAM files. Note, modbase_models only appears if modbase calling was performed, and contains a comma separated list of Dorado modbase model names used.

LB

Regex [a-zA-Z0-9_\.-]+
Common fields FASTQ: sample_idSequencing summary: sample_idSample sheet: sample_id
Examples
My_Sample
my-sample-1

The sample library identifier. Included only if data is present.

PL

Regex ONT
Required
Examples
ONT

The string "ONT".

PM

Regex [A-Z-0-9]+
Examples
MN12345

The device identifier used for sequencing.

PU

Regex [A-Z0-9_-]+
Common fields FASTQ: flow_cell_idSample sheet: flow_cell_id
Examples
FAB12345

The unique identifier for the flowcell.

SM

Regex barcode([0-9]+)
Only When barcoding
Common fields FASTQ: barcodeSequencing summary: barcode_arrangementSample sheet: barcode
Examples
barcode01

The barcode identified for the read.

Included only if data is present and the arrangement is not "unclassified"

al

Regex unclassified|[A-Za-z0-9\-_\.]+
Only When barcoding
Common fields FASTQ: barcode_aliasSequencing summary: aliasSample sheet: alias
Examples
my_sample
sample01

User-specified identifier used for the barcode, if available, otherwise the arrangement name.

Included only if data is present and the arrangement is not "unclassified"

This will be the same barcode descriptor Dorado uses for generating the output folder names, which is the sample sheet alias if available otherwise defaulting to the arrangement name.

Program groups

ID

Regex (basecaller|barcoder|aligner|dorado_aligner|minknow)(_[0-9]+)?
Required
Examples
aligner_1
basecaller
dorado_aligner_3
minknow

<program_id>{_<unique_id>}

The SAM specification requires this field to be unique within a file, and may be modified on merging to ensure uniqueness. Where the program_id would not be unique within the file, e.g. when the input is a BAM/SAM file with the program_id already present, then the @PG record will have the lowest zero based suffix that can be appended to the ID to ensure its uniqueness, e.g. "_0". Where an application has used the minimap aligner on reads contained in the file, as well as outputting a @PG record for itself it will also output an additional @PG record for minimap Dorado aligner records will have the ID "dorado_aligner" to disambiguate from minimap aligner records. Programs will identify themselves as follows MinKNOW "minknow" ont_basecall_client "basecaller" dorado basecaller: "basecaller" dorado barcoder: "barcoder" dorado aligner: "dorado_aligner" minimap2: (if used in the application) "aligner"

PN

Regex (dorado|minimap2|ont_basecall_client|minknow)
Required
Examples
dorado
minimap2
ont_basecall_client
minknow

"dorado" for Dorado application program records. "minimap2" for minimap program records with ID:aligner. "ont_basecall_client", etc.

CL

Regex .*
Examples
dorado basecaller hac pod5s/ > calls.bam

Command line of invoked application

VN

Regex [0-9a-z.~\+\-]+
Required
Examples
0.9.1
5.1.0
6.2.0~pre-a7305ca
0.0.0.28546+10c25eb94

DS

Regex .*
Common fields FASTQ: basecall_gpu
Examples
gpu:NVIDIA A100 80GB PCIe
gpu:NVIDIA A100 80GB PCIe|Quadro GV100

Each GPU type used by the basecaller will appear once, as we are only interested in the GPU types, not the total number of them. For a PromethION the tag would say e.g DS:gpu:NVIDIA A100 80GB PCIe. If there are multiple GPU types on the system they will be separated by a vertical bar. If a GPU was not used, or the reads were called on Apple Silicon, this field will not be present.

Read tags

RG:Z:

ID of the read group to which this read belongs. If present its value must match the ID field of a Read Group record in the header section.

qs:f:

Read mean basecall qscore

mx:i:

read mux meta: read_data::mux

ch:i:

Common fields FASTQ: chSequencing summary: channel

read channel meta: read_data::channel

rn:i:

Read number meta: read_data::read_number

st:Z:

Common fields FASTQ: start_timeSequencing summary: start_time

Read start time metadata field:

f5:Z:

fast5 file name read_data::filename. N.B. filename only so any personally identifiable data in the path is not written.

ns:i:

the number of samples in the signal (read_data::duration)

ts:i:

" the number of samples trimmed from the start of the signal (equivalent to read_data::duration - read_data::trimmed_duration)"

mv:B:c

sequence to signal move table (this has already been done as per Move table format for BAM file output)

sm:f:

scaling median: basecall_data::scaling_median

sd:f:

scaling dispersion (also sometimes referred to as mad, spread): basecall_data::scaling_med_abs_dev

sv:Z:

"med_mad" or "quantile", depending on which scaling method was used by the basecaller: basecall_data::scaling_version"

du:f:

duration of the read (in seconds)

pi:Z:

Common fields FASTQ: parent_read_idSequencing summary: parent_read_id

parent read id for a split read

MN:i:

Required
Only When modified_bases

Length of SEQ field when MM/ML tags were generated.

ML:B:C

Required
Only When modified_bases

Base modification probabilities

MM:Z:

Required
Only When modified_bases

Base modifications / methylation

dx:i:

Duplex read indicator. 1 for duplex reads, 0 for simplex reads without duplex offspring, -1 for simplex reads with duplex offspring.

bh:i:

Only When bed_file

Number of BED file hits. This tag is only included if a BED file was specified when aligning.

pt:i:

Only When poly_a_tail_estimation

Estimated number of bases in the polyA/T tail. This tag is only included if --estimate_poly_a was specified by the client