Skip to content

FASTQ

Format version: 0.1

FASTQ is a text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores.

Paths

The following path patterns are used to place the data on disk:

File Path pattern
FASTQ file fastq{basecall_status}{duplex_status}/{alias}/{flow_cell_id}{basecall_status}{duplex_status}_{alias_}{short_protocol_run_id}_{short_run_id}_{batch_number}.fastq.gz

See the Patterns documentation for more information on file patterns.

Read batching

The following batching options are used by default:

Option Value
Duration 3600s

For more information on batching see Batching.

Record structure

Oxford Nanopore Technologies FASTQ records contain a key value section after the required unique read id. This should be treated as an unordered set of values.

The approximate structure of a record is:

@<read-id>(\s<key>=<value>)*
ATCG...
+
QQQQ...

For example:

@bd8655fb-383c-45cc-bff3-eb1dc86533e0 key1=value1 key2=value2
ATCG
+
QQQQ

Attributes included in the key value section are listed below.

Required header attributes

RG:Z:

Regex ([0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12})([a-z0-9@\.])+(([A-Za-z0-9@\.]+))?
Required
Examples
e4994c62-93f9-439a-bc8f-d20c95a137a5_rna004_130bps_fast@v5.1.0_barcode02
e4994c62-93f9-439a-bc8f-d20c95a137a5_unknown_barcode02
e4994c62-93f9-439a-bc8f-d20c95a137a5_rna004_130bps_fast@v5.1.0_29d8704b

ID of the read group to which this read belongs.

DT:Z:

Regex \d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}(\.\d+)?(Z|\+\d{2}:\d{2})
Required
Examples
2025-01-06T10:06:36.778368+00:00
2025-01-06T10:06:36.778368Z
2025-01-06T10:06:36+00:00
2025-01-06T10:06:36Z

The protocol start time of the sequencing run, formatted as rfc3339.

ch:i:

Regex [1-9][0-9]*
Required
Common fields BAM: chSequencing summary: channel
Examples
512

read channel meta: read_data::channel

st:Z:

Regex \d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}(\.\d+)?(Z|\+\d{2}:\d{2})
Required
Examples
2025-01-06T10:06:36.778368+00:00
2025-01-06T10:06:36.778368Z
2025-01-06T10:06:36+00:00
2025-01-06T10:06:36Z

Read start time metadata field: . If this read is a split from a parent read, the start time is for the split read, formatted as rfc3339.

PU:Z:

Regex [A-Z0-9_-]+
Required
Examples
FAB12345

The unique identifier for the flowcell.

LB:Z:

Regex [a-zA-Z0-9_\.-]+
Examples
My_Sample
my-sample-1

The sample library identifier. Set by the user in the GUI as "Sample ID". Absent if not set.

SM:Z:

Regex barcode([0-9]+)
Only When barcoding
Examples
barcode01

The barcode identified for the read.

Included only if data is present and the arrangement is not "unclassified"

al:Z:

Regex unclassified|[A-Za-z0-9\-_\.]+
Only When barcoding
Examples
my_sample
sample01

User-specified identifier used for the barcode, if available, otherwise the arrangement name.

Included only if data is present and the arrangement is not "unclassified"

This will be the same barcode descriptor Dorado uses for generating the output folder names, which is the sample sheet alias if available otherwise defaulting to the arrangement name.

pi:Z:

Regex [0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}
Examples
e4994c62-93f9-439a-bc8f-d20c95a137a5

parent read id for a split read

DS:Z:

Regex .*
Only When gpu_calling
Examples
gpu:NVIDIA A100 80GB PCIe
gpu:NVIDIA A100 80GB PCIe|Quadro GV100

Each GPU type used by the basecaller will appear once, as we are only interested in the GPU types, not the total number of them. For a PromethION the tag would say e.g DS:Z:gpu:NVIDIA A100 80GB PCIe. If there are multiple GPU types on the system they will be separated by a vertical bar. If a GPU was not used, or the reads were called on Apple Silicon, this field will not be present.

ns:i:

The number of samples in the signal

qs:f:

Read mean basecall qscore

mx:i:

The mux the read originated (equivalent to read_data::mux).

rn:i:

The channel the read originated (equivalent to read_data::read_number).

ts:i:

" the number of samples trimmed from the start of the signal (equivalent to read_data::duration - read_data::trimmed_duration)"

sm:f:

scaling median: basecall_data::scaling_median

sd:f:

scaling dispersion (also sometimes referred to as mad, spread): basecall_data::scaling_med_abs_dev

sv:Z:

"med_mad" or "quantile", depending on which scaling method was used by the basecaller: basecall_data::scaling_version"

du:f:

duration of the read (in seconds)

dx:i:

Duplex read indicator. 1 for duplex reads, 0 for simplex reads without duplex offspring, -1 for simplex reads with duplex offspring.

pt:i:

Only When poly_a_tail_estimation

Estimated number of bases in the polyA/T tail. This tag is only included if --estimate_poly_a was specified by the client, but may be absent if polyA/T estimation is explicitly disabled by the configuration overrides

pa:B:i

Only When poly_a_tail_estimation

PolyA/T tail range information - . This tag is only included if --estimate_poly_a was specified by the client, but may be absent if polyA/T estimation is explicitly disabled by the configuration overrides.