In the previous post we saw how to download some sample NGS genome data. We ended up with 2 FASTQ files, each one containing a set of reads for the M. tuberculosis H37Ra genome. Before we dive into quality analysis for this experiment, let’s stop for a minute and try to understand what the FASTQ format is and how to read it.
Overview
FASTQ is a text-based format used to represent biological sequences with their quality scores. Usually it would be nucleotide sequences, as during a sequencing experiment they get assigned quality scores. In short, one can estimate a likelihood of a given nucleotide being an A, T, C or G based on the shape, resolution and other properties of peaks obtained from a sequencer (at least in traditional Sanger sequencing, although the very same principles apply in the next-generation sequencing approaches). In turn, those numbers get converted to a quality score (Q).
To get an idea of what this format represents, the easiest would be to look at our own data. In your terminal, navigate to the directory where your FASTQ files are located and type:
That should display the 4 top-most lines from the indicated file (as there are always 4 lines per sequence read in the FASTQ format):
The meaning of those 4 lines is the following:
- Sequence name – here, it’s an accession number of our read set, followed by a read number and sequence length
- The actual DNA sequence
- Separating line – starts with a
+
sign, optionally can contain more text (here, the content of 1st line is repeated) - Quality scores corresponding to the nucleotides on line 2
Quality scores
Now, the quality scores that are presented on the 4th line may seem a little bit odd. They are not represented as integers but rather as ASCII-encoded characters (see the summary below). The quality score (Q) reflects in a logarithmic scale the probability that this particular base was called incorrectly (Perror):
\begin{aligned}P_{error} &= 10^{-Q/10} \end{aligned}
For example, for a base C with Q=B=33:
\begin{aligned}P_{error} &= 10^{-33/10} &= 0.0005 \end{aligned}
Here you can see some more examples of base calling error probability and accuracy, given several quality scores:
Quality score | Base calling error probability | Base calling accuracy |
---|---|---|
10 | \begin{aligned} 10^{-1} \end{aligned} | 90% |
20 | \begin{aligned} 10^{-2} \end{aligned} | 99% |
30 | \begin{aligned} 10^{-3} \end{aligned} | 99.9% |
40 | \begin{aligned} 10^{-4} \end{aligned} | 99.99% |
Summary
Resources
- a great document from Illumina explaining more details of Phred quality scores can be found here
- Wikipedia page on FASTQ format
- more about Phred base-calling in this paper

Hey! I’m a protein biochemist by training with a more recent software development experience. Here, I’ll be trying to combine a bit of both – let’s see though where we’ll end up with that 🙂