BAM¶

Format version: 0.1

Binary Alignment Map (BAM) is the comprehensive raw data of genome sequencing; it consists of the lossless, compressed binary representation of the Sequence Alignment Map-files.

BAM files generated by MinKNOW are compliant with the SAM Specification, and can be manipulated using Samtools.

Paths¶

The following path patterns are used to place the data on disk:

File	Path pattern
BAM file	`bam{basecall_status}{duplex_status}/{alias}/{flow_cell_id}{basecall_status}{duplex_status}_{alias_}{short_protocol_run_id}_{short_run_id}_{batch_number}.bam`
BAI file	`bam{basecall_status}{duplex_status}/{alias}/{flow_cell_id}{basecall_status}{duplex_status}_{alias_}{short_protocol_run_id}_{short_run_id}_{batch_number}.bam.bai`

See the Patterns documentation for more information on file patterns.

Read batching¶

The following batching options are used by default:

Option	Value
Duration	`3600s`

For more information on batching see Batching.

Fields¶

Read groups¶

`ID`¶

Regex ([0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12})([a-z0-9@\.])+(([A-Za-z0-9@\.]+))?

Required

Examples
`e4994c62-93f9-439a-bc8f-d20c95a137a5_rna004_130bps_fast@v5.1.0_barcode02`
`e4994c62-93f9-439a-bc8f-d20c95a137a5_unknown_barcode02`
`e4994c62-93f9-439a-bc8f-d20c95a137a5_rna004_130bps_fast@v5.1.0_29d8704b`

A group identifier keyed specifically to a run, basecall model, and barcode (if enabled)

model_version_id if missing shall be replaced with the text "unknown"

barcode_data::arrangement if missing the suffix will not be appended.

`DT`¶

Regex \d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}(Z|\.\d{6}\+\d{2}:\d{2})

Examples
`2025-01-06T10:06:36.778368+00:00`
`2025-01-06T10:06:36Z`

The start time of the sequencing run.

Correctly formatted as ISO8601.

`DS`¶

Regex

runid=([0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12})\s+basecall_model=([a-z0-9_@\.]+)(\s+modbase_models=([A-Za-z0-9_@\.]+))?

Required

Common fields FASTQ: runidSequencing summary: run_id

Examples
`runid=e4994c62-93f9-439a-bc8f-d20c95a137a5 basecall_model=rna004_130bps_fast@v5.1.0`
`runid=e4994c62-93f9-439a-bc8f-d20c95a137a5 basecall_model=rna004_130bps_fast@v5.1.0 modbase_models=rna004_130bps_hac@v5.1.0_inosine_m6A@v1`

runid, basecall_model and optionally modbase_models formatted into a space separated string.

runid contains the protocol_run_id from the MinKNOW experiment that generated the BAM files. Note, modbase_models only appears if modbase calling was performed, and contains a comma separated list of Dorado modbase model names used.

`LB`¶

Regex [a-zA-Z0-9_\.-]+

Common fields FASTQ: sample_idSequencing summary: sample_idSample sheet: sample_id

Examples
`My_Sample`
`my-sample-1`

The sample library identifier. Included only if data is present.

`PL`¶

Regex ONT

Required

Examples
`ONT`

The string "ONT".

`PM`¶

Regex [A-Z-0-9]+

Examples
`MN12345`

The device identifier used for sequencing.

`PU`¶

Regex [A-Z0-9_-]+

Common fields FASTQ: flow_cell_idSample sheet: flow_cell_id

Examples
`FAB12345`

The unique identifier for the flowcell.

`SM`¶

Regex barcode([0-9]+)

Only When barcoding

Common fields FASTQ: barcodeSequencing summary: barcode_arrangementSample sheet: barcode

Examples
`barcode01`

The barcode identified for the read.

Included only if data is present and the arrangement is not "unclassified"

`al`¶

Regex unclassified|[A-Za-z0-9\-_\.]+

Only When barcoding

Common fields FASTQ: barcode_aliasSequencing summary: aliasSample sheet: alias

Examples
`my_sample`
`sample01`

User-specified identifier used for the barcode, if available, otherwise the arrangement name.

Included only if data is present and the arrangement is not "unclassified"

This will be the same barcode descriptor Dorado uses for generating the output folder names, which is the sample sheet alias if available otherwise defaulting to the arrangement name.

Program groups¶

`ID`¶

Regex (basecaller|barcoder|aligner|dorado_aligner|minknow)(_[0-9]+)?

Required

Examples
`aligner_1`
`basecaller`
`dorado_aligner_3`
`minknow`

<program_id>{_<unique_id>}

The SAM specification requires this field to be unique within a file, and may be modified on merging to ensure uniqueness. Where the program_id would not be unique within the file, e.g. when the input is a BAM/SAM file with the program_id already present, then the @PG record will have the lowest zero based suffix that can be appended to the ID to ensure its uniqueness, e.g. "_0". Where an application has used the minimap aligner on reads contained in the file, as well as outputting a @PG record for itself it will also output an additional @PG record for minimap Dorado aligner records will have the ID "dorado_aligner" to disambiguate from minimap aligner records. Programs will identify themselves as follows MinKNOW "minknow" ont_basecall_client "basecaller" dorado basecaller: "basecaller" dorado barcoder: "barcoder" dorado aligner: "dorado_aligner" minimap2: (if used in the application) "aligner"

`PN`¶

Regex (dorado|minimap2|ont_basecall_client|minknow)

Required

Examples
`dorado`
`minimap2`
`ont_basecall_client`
`minknow`

"dorado" for Dorado application program records. "minimap2" for minimap program records with ID:aligner. "ont_basecall_client", etc.

`CL`¶

Regex .*

Examples
`dorado basecaller hac pod5s/ > calls.bam`

Command line of invoked application

`VN`¶

Regex [0-9a-z.~\+\-]+

Required

Examples
`0.9.1`
`5.1.0`
`6.2.0~pre-a7305ca`
`0.0.0.28546+10c25eb94`

`DS`¶

Regex .*

Common fields FASTQ: basecall_gpu

Examples
`gpu:NVIDIA A100 80GB PCIe`
`gpu:NVIDIA A100 80GB PCIe\|Quadro GV100`

Each GPU type used by the basecaller will appear once, as we are only interested in the GPU types, not the total number of them. For a PromethION the tag would say e.g DS:gpu:NVIDIA A100 80GB PCIe. If there are multiple GPU types on the system they will be separated by a vertical bar. If a GPU was not used, or the reads were called on Apple Silicon, this field will not be present.

Read tags¶

`RG`:Z: ¶

ID of the read group to which this read belongs. If present its value must match the ID field of a Read Group record in the header section.

`qs`:f: ¶

Read mean basecall qscore

`mx`:i: ¶

read mux meta: read_data::mux

`ch`:i: ¶

Common fields FASTQ: chSequencing summary: channel

read channel meta: read_data::channel

`rn`:i: ¶

Read number meta: read_data::read_number

`st`:Z: ¶

Common fields FASTQ: start_timeSequencing summary: start_time

Read start time metadata field:

`f5`:Z: ¶

fast5 file name read_data::filename. N.B. filename only so any personally identifiable data in the path is not written.

`ns`:i: ¶

the number of samples in the signal (read_data::duration)

`ts`:i: ¶

" the number of samples trimmed from the start of the signal (equivalent to read_data::duration - read_data::trimmed_duration)"

`mv`:B:c ¶

sequence to signal move table (this has already been done as per Move table format for BAM file output)

`sm`:f: ¶

scaling median: basecall_data::scaling_median

`sd`:f: ¶

scaling dispersion (also sometimes referred to as mad, spread): basecall_data::scaling_med_abs_dev

`sv`:Z: ¶

"med_mad" or "quantile", depending on which scaling method was used by the basecaller: basecall_data::scaling_version"

`du`:f: ¶

duration of the read (in seconds)

`pi`:Z: ¶

Common fields FASTQ: parent_read_idSequencing summary: parent_read_id

parent read id for a split read

`MN`:i: ¶

Required

Only When modified_bases

Length of SEQ field when MM/ML tags were generated.

`ML`:B:C ¶

Required

Only When modified_bases

Base modification probabilities

`MM`:Z: ¶

Required

Only When modified_bases

Base modifications / methylation

`dx`:i: ¶

Duplex read indicator. 1 for duplex reads, 0 for simplex reads without duplex offspring, -1 for simplex reads with duplex offspring.

`bh`:i: ¶

Only When bed_file

Number of BED file hits. This tag is only included if a BED file was specified when aligning.

`pt`:i: ¶

Only When poly_a_tail_estimation

Estimated number of bases in the polyA/T tail. This tag is only included if --estimate_poly_a was specified by the client

BAM¶

Paths¶

Read batching¶

Fields¶

Read groups¶

ID¶

DT¶

DS¶

LB¶

PL¶

PM¶

PU¶

SM¶

al¶

Program groups¶

ID¶

PN¶

CL¶

VN¶

DS¶

Read tags¶

RG:Z: ¶

qs:f: ¶

mx:i: ¶

ch:i: ¶

rn:i: ¶

st:Z: ¶

f5:Z: ¶

ns:i: ¶

ts:i: ¶

mv:B:c ¶

sm:f: ¶

sd:f: ¶

sv:Z: ¶

du:f: ¶

pi:Z: ¶

MN:i: ¶

ML:B:C ¶

MM:Z: ¶

dx:i: ¶

bh:i: ¶

pt:i: ¶

`ID`¶

`DT`¶

`DS`¶

`LB`¶

`PL`¶

`PM`¶

`PU`¶

`SM`¶

`al`¶

`ID`¶

`PN`¶

`CL`¶

`VN`¶

`DS`¶

`RG`:Z: ¶

`qs`:f: ¶

`mx`:i: ¶

`ch`:i: ¶

`rn`:i: ¶

`st`:Z: ¶

`f5`:Z: ¶

`ns`:i: ¶

`ts`:i: ¶

`mv`:B:c ¶

`sm`:f: ¶

`sd`:f: ¶

`sv`:Z: ¶

`du`:f: ¶

`pi`:Z: ¶

`MN`:i: ¶

`ML`:B:C ¶

`MM`:Z: ¶

`dx`:i: ¶

`bh`:i: ¶

`pt`:i: ¶