FASTQ¶
Format version: 0.1
FASTQ is a text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores.
Paths¶
The following path patterns are used to place the data on disk:
| File | Path pattern |
|---|---|
| FASTQ file | fastq{basecall_status}{duplex_status}/{alias}/{flow_cell_id}{basecall_status}{duplex_status}_{alias_}{short_protocol_run_id}_{short_run_id}_{batch_number}.fastq.gz |
See the Patterns documentation for more information on file patterns.
Read batching¶
The following batching options are used by default:
| Option | Value |
|---|---|
| Duration | 3600s |
For more information on batching see Batching.
Record structure¶
Oxford Nanopore Technologies FASTQ records contain a key value section after the required unique read id. This should be treated as an unordered set of values.
The approximate structure of a record is:
@<read-id>(\s<key>=<value>)*
ATCG...
+
QQQQ...
For example:
@bd8655fb-383c-45cc-bff3-eb1dc86533e0 key1=value1 key2=value2
ATCG
+
QQQQ
Attributes included in the key value section are listed below.
Required header attributes¶
RG:Z: ¶
([0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12})([a-z0-9@\.])+(([A-Za-z0-9@\.]+))?| Examples |
|---|
e4994c62-93f9-439a-bc8f-d20c95a137a5_rna004_130bps_fast@v5.1.0_barcode02 |
e4994c62-93f9-439a-bc8f-d20c95a137a5_unknown_barcode02 |
e4994c62-93f9-439a-bc8f-d20c95a137a5_rna004_130bps_fast@v5.1.0_29d8704b |
ID of the read group to which this read belongs.
DT:Z: ¶
\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}(\.\d+)?(Z|\+\d{2}:\d{2})| Examples |
|---|
2025-01-06T10:06:36.778368+00:00 |
2025-01-06T10:06:36.778368Z |
2025-01-06T10:06:36+00:00 |
2025-01-06T10:06:36Z |
The protocol start time of the sequencing run, formatted as rfc3339.
ch:i: ¶
[1-9][0-9]*| Examples |
|---|
512 |
read channel meta: read_data::channel
st:Z: ¶
\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}(\.\d+)?(Z|\+\d{2}:\d{2})| Examples |
|---|
2025-01-06T10:06:36.778368+00:00 |
2025-01-06T10:06:36.778368Z |
2025-01-06T10:06:36+00:00 |
2025-01-06T10:06:36Z |
Read start time metadata field:
PU:Z: ¶
[A-Z0-9_-]+| Examples |
|---|
FAB12345 |
The unique identifier for the flowcell.
LB:Z: ¶
[a-zA-Z0-9_\.-]+| Examples |
|---|
My_Sample |
my-sample-1 |
The sample library identifier. Set by the user in the GUI as "Sample ID". Absent if not set.
SM:Z: ¶
barcode([0-9]+)barcoding| Examples |
|---|
barcode01 |
The barcode identified for the read.
Included only if data is present and the arrangement is not "unclassified"
al:Z: ¶
unclassified|[A-Za-z0-9\-_\.]+barcoding| Examples |
|---|
my_sample |
sample01 |
User-specified identifier used for the barcode, if available, otherwise the arrangement name.
Included only if data is present and the arrangement is not "unclassified"
This will be the same barcode descriptor Dorado uses for generating the output folder names, which is the sample sheet alias if available otherwise defaulting to the arrangement name.
pi:Z: ¶
[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}| Examples |
|---|
e4994c62-93f9-439a-bc8f-d20c95a137a5 |
parent read id for a split read
DS:Z: ¶
.*gpu_calling| Examples |
|---|
gpu:NVIDIA A100 80GB PCIe |
gpu:NVIDIA A100 80GB PCIe|Quadro GV100 |
Each GPU type used by the basecaller will appear once, as we are only interested in the GPU types, not the total number of them. For a PromethION the tag would say e.g DS:Z:gpu:NVIDIA A100 80GB PCIe. If there are multiple GPU types on the system they will be separated by a vertical bar. If a GPU was not used, or the reads were called on Apple Silicon, this field will not be present.
ns:i: ¶
The number of samples in the signal
qs:f: ¶
Read mean basecall qscore
mx:i: ¶
The mux the read originated (equivalent to read_data::mux).
rn:i: ¶
The channel the read originated (equivalent to read_data::read_number).
ts:i: ¶
"
sm:f: ¶
scaling median: basecall_data::scaling_median
sd:f: ¶
scaling dispersion (also sometimes referred to as mad, spread): basecall_data::scaling_med_abs_dev
sv:Z: ¶
"med_mad" or "quantile", depending on which scaling method was used by the basecaller: basecall_data::scaling_version"
du:f: ¶
duration of the read (in seconds)
dx:i: ¶
Duplex read indicator. 1 for duplex reads, 0 for simplex reads without duplex offspring, -1 for simplex reads with duplex offspring.
pt:i: ¶
poly_a_tail_estimationEstimated number of bases in the polyA/T tail. This tag is only included if --estimate_poly_a was specified by the client, but may be absent if polyA/T estimation is explicitly disabled by the configuration overrides
pa:B:i ¶
poly_a_tail_estimationPolyA/T tail range information -