FASTQ¶

Format version: 0.1

FASTQ is a text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores.

Paths¶

The following path patterns are used to place the data on disk:

File	Path pattern
FASTQ file	`fastq{basecall_status}{duplex_status}/{alias}/{flow_cell_id}{basecall_status}{duplex_status}_{alias_}{short_protocol_run_id}_{short_run_id}_{batch_number}.fastq.gz`

See the Patterns documentation for more information on file patterns.

Read batching¶

The following batching options are used by default:

Option	Value
Duration	`3600s`

For more information on batching see Batching.

Record structure¶

Oxford Nanopore Technologies FASTQ records contain a key value section after the required unique read id. This should be treated as an unordered set of values.

The approximate structure of a record is:

@<read-id>(\s<key>=<value>)*
ATCG...
+
QQQQ...

For example:

@bd8655fb-383c-45cc-bff3-eb1dc86533e0 key1=value1 key2=value2
ATCG
+
QQQQ

Attributes included in the key value section are listed below.

Required header attributes¶

`RG`:Z: ¶

Regex ([0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12})([a-z0-9@\.])+(([A-Za-z0-9@\.]+))?

Required

Examples
`e4994c62-93f9-439a-bc8f-d20c95a137a5_rna004_130bps_fast@v5.1.0_barcode02`
`e4994c62-93f9-439a-bc8f-d20c95a137a5_unknown_barcode02`
`e4994c62-93f9-439a-bc8f-d20c95a137a5_rna004_130bps_fast@v5.1.0_29d8704b`

ID of the read group to which this read belongs.

`DT`:Z: ¶

Regex \d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}(\.\d+)?(Z|\+\d{2}:\d{2})

Required

Examples
`2025-01-06T10:06:36.778368+00:00`
`2025-01-06T10:06:36.778368Z`
`2025-01-06T10:06:36+00:00`
`2025-01-06T10:06:36Z`

The protocol start time of the sequencing run, formatted as rfc3339.

`ch`:i: ¶

Regex [1-9][0-9]*

Required

Common fields BAM: chSequencing summary: channel

Examples
`512`

read channel meta: read_data::channel

`st`:Z: ¶

Regex \d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}(\.\d+)?(Z|\+\d{2}:\d{2})

Required

Examples
`2025-01-06T10:06:36.778368+00:00`
`2025-01-06T10:06:36.778368Z`
`2025-01-06T10:06:36+00:00`
`2025-01-06T10:06:36Z`

Read start time metadata field: . If this read is a split from a parent read, the start time is for the split read, formatted as rfc3339.

`PU`:Z: ¶

Regex [A-Z0-9_-]+

Required

Examples
`FAB12345`

The unique identifier for the flowcell.

`LB`:Z: ¶

Regex [a-zA-Z0-9_\.-]+

Examples
`My_Sample`
`my-sample-1`

The sample library identifier. Set by the user in the GUI as "Sample ID". Absent if not set.

`SM`:Z: ¶

Regex barcode([0-9]+)

Only When barcoding

Examples
`barcode01`

The barcode identified for the read.

Included only if data is present and the arrangement is not "unclassified"

`al`:Z: ¶

Regex unclassified|[A-Za-z0-9\-_\.]+

Only When barcoding

Examples
`my_sample`
`sample01`

User-specified identifier used for the barcode, if available, otherwise the arrangement name.

Included only if data is present and the arrangement is not "unclassified"

This will be the same barcode descriptor Dorado uses for generating the output folder names, which is the sample sheet alias if available otherwise defaulting to the arrangement name.

`pi`:Z: ¶

Regex [0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}

Examples
`e4994c62-93f9-439a-bc8f-d20c95a137a5`

parent read id for a split read

`DS`:Z: ¶

Regex .*

Only When gpu_calling

Examples
`gpu:NVIDIA A100 80GB PCIe`
`gpu:NVIDIA A100 80GB PCIe\|Quadro GV100`

Each GPU type used by the basecaller will appear once, as we are only interested in the GPU types, not the total number of them. For a PromethION the tag would say e.g DS:Z:gpu:NVIDIA A100 80GB PCIe. If there are multiple GPU types on the system they will be separated by a vertical bar. If a GPU was not used, or the reads were called on Apple Silicon, this field will not be present.

`ns`:i: ¶

The number of samples in the signal

`qs`:f: ¶

Read mean basecall qscore

`mx`:i: ¶

The mux the read originated (equivalent to read_data::mux).

`rn`:i: ¶

The channel the read originated (equivalent to read_data::read_number).

`ts`:i: ¶

" the number of samples trimmed from the start of the signal (equivalent to read_data::duration - read_data::trimmed_duration)"

`sm`:f: ¶

scaling median: basecall_data::scaling_median

`sd`:f: ¶

scaling dispersion (also sometimes referred to as mad, spread): basecall_data::scaling_med_abs_dev

`sv`:Z: ¶

"med_mad" or "quantile", depending on which scaling method was used by the basecaller: basecall_data::scaling_version"

`du`:f: ¶

duration of the read (in seconds)

`dx`:i: ¶

Duplex read indicator. 1 for duplex reads, 0 for simplex reads without duplex offspring, -1 for simplex reads with duplex offspring.

`pt`:i: ¶

Only When poly_a_tail_estimation

Estimated number of bases in the polyA/T tail. This tag is only included if --estimate_poly_a was specified by the client, but may be absent if polyA/T estimation is explicitly disabled by the configuration overrides

`pa`:B:i ¶

Only When poly_a_tail_estimation

PolyA/T tail range information - . This tag is only included if --estimate_poly_a was specified by the client, but may be absent if polyA/T estimation is explicitly disabled by the configuration overrides.

FASTQ¶

Paths¶

Read batching¶

Record structure¶

Required header attributes¶

RG:Z: ¶

DT:Z: ¶

ch:i: ¶

st:Z: ¶

PU:Z: ¶

LB:Z: ¶

SM:Z: ¶

al:Z: ¶

pi:Z: ¶

DS:Z: ¶

ns:i: ¶

qs:f: ¶

mx:i: ¶

rn:i: ¶

ts:i: ¶

sm:f: ¶

sd:f: ¶

sv:Z: ¶

du:f: ¶

dx:i: ¶

pt:i: ¶

pa:B:i ¶

`RG`:Z: ¶

`DT`:Z: ¶

`ch`:i: ¶

`st`:Z: ¶

`PU`:Z: ¶

`LB`:Z: ¶

`SM`:Z: ¶

`al`:Z: ¶

`pi`:Z: ¶

`DS`:Z: ¶

`ns`:i: ¶

`qs`:f: ¶

`mx`:i: ¶

`rn`:i: ¶

`ts`:i: ¶

`sm`:f: ¶

`sd`:f: ¶

`sv`:Z: ¶

`du`:f: ¶

`dx`:i: ¶

`pt`:i: ¶

`pa`:B:i ¶