FASTQ format - an overview

In the previous post we saw how to download some sample NGS genome data. We ended up with 2 FASTQ files, each one containing a set of reads for the M. tuberculosis H37Ra genome. Before we dive into quality analysis for this experiment, let’s stop for a minute and try to understand what the FASTQ format is and how to read it.

Overview

FASTQ is a text-based format used to represent biological sequences with their quality scores. Usually it would be nucleotide sequences, as during a sequencing experiment they get assigned quality scores. In short, one can estimate a likelihood of a given nucleotide being an A, T, C or G based on the shape, resolution and other properties of peaks obtained from a sequencer (at least in traditional Sanger sequencing, although the very same principles apply in the next-generation sequencing approaches). In turn, those numbers get converted to a quality score (Q).

To get an idea of what this format represents, the easiest would be to look at our own data. In your terminal, navigate to the directory where your FASTQ files are located and type:

That should display the 4 top-most lines from the indicated file (as there are always 4 lines per sequence read in the FASTQ format):

The meaning of those 4 lines is the following:

Sequence name – here, it’s an accession number of our read set, followed by a read number and sequence length
The actual DNA sequence
Separating line – starts with a + sign, optionally can contain more text (here, the content of 1st line is repeated)
Quality scores corresponding to the nucleotides on line 2

Quality scores

Now, the quality scores that are presented on the 4th line may seem a little bit odd. They are not represented as integers but rather as ASCII-encoded characters (see the summary below). The quality score (Q) reflects in a logarithmic scale the probability that this particular base was called incorrectly (Perror):

$\begin{aligned}P_{error} &= 10^{-Q/10} \end{aligned}$

For example, for a base C with Q=B=33:

$\begin{aligned}P_{error} &= 10^{-33/10} &= 0.0005 \end{aligned}$

Here you can see some more examples of base calling error probability and accuracy, given several quality scores:

Quality score	Base calling error probability	Base calling accuracy
10	$\begin{aligned} 10^{-1} \end{aligned}$	90%
20	$\begin{aligned} 10^{-2} \end{aligned}$	99%
30	$\begin{aligned} 10^{-3} \end{aligned}$	99.9%
40	$\begin{aligned} 10^{-4} \end{aligned}$	99.99%

Summary

Below you’ll find a graphical summary of what was discussed above about FASTQ format and Phred quality scores.

Resources

a great document from Illumina explaining more details of Phred quality scores can be found here
Wikipedia page on FASTQ format
more about Phred base-calling in this paper

Michał

Hey! I’m a protein biochemist by training with a more recent software development experience. Here, I’ll be trying to combine a bit of both – let’s see though where we’ll end up with that 🙂

FASTQ format – an overview

Overview

Quality scores

Summary

Resources

Leave a ReplyCancel reply

Overview

Quality scores

Summary

Resources

Share this:

Leave a ReplyCancel reply