« Barcode Balancing for Illumina Sequencing | Mount Windows share in Linux system » |

## NGS reads and their Scores

**Quality scoring of the base calls**

"Quality scores measure the **probability that a base is called incorrectly**. With SBS technology, each base in a read is assigned a quality score by a phred-like algorithm, similar to that originally developed for Sanger sequencing experiments. The quality score of a given base, Q, is defined by the equation

*Q = -10log10(e)*

where e is the estimated probability of the base call being wrong. Thus, a higher quality score indicates a smaller probability of error."(1)

The quality score is usually quoted as QXX, where the XX is the score and refers to that a particular call (or a all base calls of a read / of a sample / of a run) has a probability of error of 10^(-XX/10). For example **Q30 equates to an error rate of 1 in 1000**, or 0.1%, Q40 equates to an error rate of 1 in 10,000 or 0.01%.

During the primary analysis (real-time analysis, RTA) on the sequencing machine, quality scoring is performed by calculating a set of predictors for each base call, and using those predictor values to look up the quality score in a *quality table*. The quality table is generated using a modification of the Phred algorithm on a calibration data set representative of run and sequence variability

"It is important to note how quickly or slowly quality scores degrade over the course of a read. With short-read sequencing, quality scores largely dictate the read length limits of different sequencing platforms. Thus, a longer read length specification suggests that the raw data from that platform have consistently higher quality scores across all bases." (1)

**Mapping / Alignment scores**

For each alignment, BWA calculates a mapping quality score, which is the (Phred-scaled) **probability of the alignment being incorrect**. The algorithm is similar between BWA and MAQ, except that BWA assumes that the true hit can always be found. The probability for every base is calculated as:

p = 10 ^ (-q/10)

where q is the quality. For example a mapping quality of 40: 10^-4 = 0.0001, which means there is a 0.01% chance that the base is aligned incorrectly.

Example for a whole read:

If your read is 25 bp long and the expected sequencing error rate is 1%, the probability of the read with 0 errors is:

0.99^25 = 0.78

If there is 1 perfect alignment and 5 possible alignment positions with 1 mismatch, we combine these probabilities: The probability of the read with 1 error is

0.20

combined posterior probability that the best alignment is correct:

P(0-errors) / (P(0-errors) + 5 * P(1-errors))

= 0.44, which is low.

Base quality is apparently not considered in evaluating hits in bwa.

Sources:

- Illumina
- BWA paper
- DaveTang blog
- jwfoley on SEQanswers
- Ying Wei's notes
- Gene-Test bioinformatics (PGS / NGS) consulting

This entry was posted on 01 Aug 2014 at 16:42 by felix and is filed under Sequencing, Bioinformatics.