BCL files

December 30th, 2016

As part of the Primary Analysis Illumina sequencing machines measure the intensity of the channels used for encoding the different bases and identify the most likely base at a given position of a sequencing read (tag). The Real Time Analysis (RTA) software writes the base and the confidence in the call as a quality score to base call (.bcl) files. As the name implies this is done in real time, i.e. for every cycle of the sequencing run a call for every location identified on the flow cell (tiles and lanes) is added. Bcl files are stored in binary format and represent the raw data output of a sequencing run. The format is described here. Software such as Casava/BclToFastq, Eland or the iSAAC aligner can make use of these files.

The *.bcl files are stored in the BaseCalls directory:

<run directory>/Data/Intensities/BaseCalls/L<lane>/C<cycle>.1

They are named in the format:


If you want to overcome errors during downstream processing from missing calls, software such as iSAAC and configureBclToFastq have an "--ignore-missing-bcl" command line option. This will interpret missing *.bcl files as no call (N) at that position.

Sources: Illumina, SeqAnswers

Cystic Fibrosis and its Analysis

June 17th, 2015

Cystic Fibrosis, also called Mucoviscidos, is a hereditary disease (autosomal recessive) in which exocrine (secretory) glands produce abnormally thick mucus. This mucus can cause problems in digestion, breathing, and body cooling. It affects up to one out of 3000 newborns (with northern European ancestry). There are well over a hundred genetic changes linked to CF. It is an area companies like Illumina are very active in with a special assay cleared as an in-vitro diagnostic test with the FDA for the detection of most of the genetic variants known to cause the disease.

Here are notes from a presentation Dr. Carlos Bustamante gave at a recent ClinGen conference:

Background for CF and CFTR


  • Cysstic Fibrosis Transmembrane Conductance Regulator
  • ABC transporter (ATP-binding cassette), that functons as ion channel
  • cAMP-regulated through R domain phosphorylaton
  • Transports chloride and thiocyanate across epithelial cell membranes
  • 1,480 amino acids 

CF disease:

  • Most common autosomal recessive disorder among Caucasians (1/3,300)
  • Dysregulaton of epithelial fluid transport in lung, pancreas, and other organs
  • ~ 2,000 identfied gene mutatons
  • Phe508del – most common, in 70% cases
  • Wide range of severity, most die of pulmonary disease at mean age of 37


  • ~70% of variants currently classed as VUS (variant of unknown significance)
  • ~65% are missense mutations, 24% frameshift & stop-gained, 9% synonymous

Testing Machine Learning Approaches for CF classification

  • Machine learning algorithms show higher performance when compared with separate predictors
  • Tree-based methods perform the best (GBM & RF AUC is 6% higher then the best predictor, MutPred)
  • Top features: MutPred, AF, SIFT, CADD, POSE
  • Predicted pathogenicity probability (RF.pred) correlates with available experimental data for Cl- conductance and sweat Cl-

 Other sources used: PubMedHealth, Wikipedia

CRAM format

January 7th, 2015

CRAM files are compressed versions of BAM files containing (aligned) sequencing reads. They represent a further file size reduction for this type of data that is generated at ever increasing quantities. Where SAM files are human-readable text files optimized for short read storage, BAM files are their binary equivalent, and CRAM files are a restructured column-oriented binary container format for even more efficient storage.

Tke key components of the approach are that positions are encoded in a relative way (i.e., the difference between successive positions is stored rather than the absolute value) and stored as a Golomb code. Also, only differences to the reference genome are listed instead of the full sequence.

The compression rates achieved are shown in the graph below generated by Uppsala University:

File size comparisons of SAM, BAM, CRAM

Comparing speed: Using the C implementation of for CRAM (James K. Bonfield), decoding is 1.5–1.7× slower than generating BAM files, but 1.8–2.6× faster at encoding. (File size savings are reported at 34–55%.(

Additional compression can be achieved by reducing the granularity of the quality values which will result in lossy compression though. Illumina suggested a binning of Q scores without significant calling performance.

Binning of similar Q-scores (Illumina):

qscore binning

Compression achieved by Q-score binning (Illumina):

qscore compression

Sources and further reading:

  1. Format definition and usage
  2. cram-toolkit
  3. Detailed report at the Uppsala University
  4. SAMtools with CRAM support
  5. Original article from Markus Hsi-Yang Fritz, Rasko Leinonen, Guy Cochrane and Ewan Birney
  6. Article about the implementation in C
  7. Illumina while paper on Qscore compression

Barcode Balancing for Illumina Sequencing

November 4th, 2014

HiSeq & MiSeq
The HiSeq and MiSeq use a green laser to sequence G/T and a red laser to sequence A/C. At each cycle at least one of two nucleotides for each color channel must be read to ensure proper registration. It is important to maintain color balance for each base of the index read being sequenced, otherwise index read sequencing could fail due to registration failure. E.g. if the sample contains only T and C in the first four cycles, image registration will fail. (If possible spike-in phiX sequence to add diversity to low-plex sequencing libraries.)
If one or more bases are not present in the first 11 cycles the quality of the run will be negatively impacted. This is because the color matrix is calculated from the color signals of these cycles.

NextSeq 500
The NextSeq 500 uses two-channel sequencing, which requires only two images to encode the data for four DNA bases, one red channel and one green channel. The NextSeq also uses a new implementation of real-time analysis (RTA) called RTA2.0, which includes important architecture differences from RTA on other Illumina sequencers. For any index sequences, RTA2.0 requires that there is at least one base other than G in the first two cycles. This requirement for index diversity allows the use of any Illumina index selection for single-plex indexing except index 1 (i7) 705, which uses the sequence GGACTCCT. Use the combinations in the table below for proper color balancing on the NextSeq 500.

Illlumina Nextera tech notes, Illumina Low diversity note
See also TruSeq Guide

NGS reads and their Scores

August 1st, 2014

Quality scoring of the base calls

"Quality scores measure the probability that a base is called incorrectly. With SBS technology, each base in a read is assigned a quality score by a phred-like algorithm, similar to that originally developed for Sanger sequencing experiments. The quality score of a given base, Q, is defined by the equation
Q = -10log10(e)
where e is the estimated probability of the base call being wrong. Thus, a higher quality score indicates a smaller probability of error."(1)
The quality score is usually quoted as QXX, where the XX is the score and refers to that a particular call (or a all base calls of a read / of a sample / of a run) has a probability of error of 10^(-XX/10). For example Q30 equates to an error rate of 1 in 1000, or 0.1%, Q40 equates to an error rate of 1 in 10,000 or 0.01%.

During the primary analysis (real-time analysis, RTA) on the sequencing machine, quality scoring is performed by calculating a set of predictors for each base call, and using those predictor values to look up the quality score in a quality table. The quality table is generated using a modification of the Phred algorithm on a calibration data set representative of run and sequence variability

"It is important to note how quickly or slowly quality scores degrade over the course of a read. With short-read sequencing, quality scores largely dictate the read length limits of different sequencing platforms. Thus, a longer read length specification suggests that the raw data from that platform have consistently higher quality scores across all bases." (1)

Mapping / Alignment scores

For each alignment, BWA calculates a mapping quality score, which is the (Phred-scaled) probability of the alignment being incorrect. The algorithm is similar between BWA and MAQ, except that BWA assumes that the true hit can always be found. The probability for every base is calculated as:

p = 10 ^ (-q/10)

where q is the quality. For example a mapping quality of 40: 10^-4 = 0.0001, which means there is a 0.01% chance that the base is aligned incorrectly.

Example for a whole read:

If your read is 25 bp long and the expected sequencing error rate is 1%, the probability of the read with 0 errors is:

0.99^25 = 0.78

If there is 1 perfect alignment and 5 possible alignment positions with 1 mismatch, we combine these probabilities: The probability of the read with 1 error is
combined posterior probability that the best alignment is correct:

P(0-errors) / (P(0-errors) + 5 * P(1-errors))

= 0.44, which is low.

Base quality is apparently not considered in evaluating hits in bwa.


  1. Illumina
  2. BWA paper
  3. DaveTang blog
  4. jwfoley on SEQanswers
  5. Ying Wei's notes