Cystic Fibrosis and its Analysis

June 17th, 2015

Cystic Fibrosis, also called Mucoviscidos, is a hereditary disease (autosomal recessive) in which exocrine (secretory) glands produce abnormally thick mucus. This mucus can cause problems in digestion, breathing, and body cooling. It affects up to one out of 3000 newborns (with northern European ancestry). There are well over a hundred genetic changes linked to CF. It is an area companies like Illumina are very active in with a special assay cleared as an in-vitro diagnostic test with the FDA for the detection of most of the genetic variants known to cause the disease.

Here are notes from a presentation Dr. Carlos Bustamante gave at a recent ClinGen conference:

Background for CF and CFTR


  • Cysstic Fibrosis Transmembrane Conductance Regulator
  • ABC transporter (ATP-binding cassette), that functons as ion channel
  • cAMP-regulated through R domain phosphorylaton
  • Transports chloride and thiocyanate across epithelial cell membranes
  • 1,480 amino acids 

CF disease:

  • Most common autosomal recessive disorder among Caucasians (1/3,300)
  • Dysregulaton of epithelial fluid transport in lung, pancreas, and other organs
  • ~ 2,000 identfied gene mutatons
  • Phe508del – most common, in 70% cases
  • Wide range of severity, most die of pulmonary disease at mean age of 37


  • ~70% of variants currently classed as VUS (variant of unknown significance)
  • ~65% are missense mutations, 24% frameshift & stop-gained, 9% synonymous

Testing Machine Learning Approaches for CF classification

  • Machine learning algorithms show higher performance when compared with separate predictors
  • Tree-based methods perform the best (GBM & RF AUC is 6% higher then the best predictor, MutPred)
  • Top features: MutPred, AF, SIFT, CADD, POSE
  • Predicted pathogenicity probability (RF.pred) correlates with available experimental data for Cl- conductance and sweat Cl-

 Other sources used: PubMedHealth, Wikipedia

CRAM format

January 7th, 2015

CRAM files are compressed versions of BAM files containing (aligned) sequencing reads. They represent a further file size reduction for this type of data that is generated at ever increasing quantities. Where SAM files are human-readable text files optimized for short read storage, BAM files are their binary equivalent, and CRAM files are a restructured column-oriented binary container format for even more efficient storage.

Tke key components of the approach are that positions are encoded in a relative way (i.e., the difference between successive positions is stored rather than the absolute value) and stored as a Golomb code. Also, only differences to the reference genome are listed instead of the full sequence.

The compression rates achieved are shown in the graph below generated by Uppsala University:

File size comparisons of SAM, BAM, CRAM

Comparing speed: Using the C implementation of for CRAM (James K. Bonfield), decoding is 1.5–1.7× slower than generating BAM files, but 1.8–2.6× faster at encoding. (File size savings are reported at 34–55%.(

Additional compression can be achieved by reducing the granularity of the quality values which will result in lossy compression though. Illumina suggested a binning of Q scores without significant calling performance.

Binning of similar Q-scores (Illumina):

qscore binning

Compression achieved by Q-score binning (Illumina):

qscore compression

Sources and further reading:

  1. Format definition and usage
  2. cram-toolkit
  3. Detailed report at the Uppsala University
  4. SAMtools with CRAM support
  5. Original article from Markus Hsi-Yang Fritz, Rasko Leinonen, Guy Cochrane and Ewan Birney
  6. Article about the implementation in C
  7. Illumina while paper on Qscore compression

Barcode Balancing for Illumina Sequencing

November 4th, 2014

HiSeq & MiSeq
The HiSeq and MiSeq use a green laser to sequence G/T and a red laser to sequence A/C. At each cycle at least one of two nucleotides for each color channel must be read to ensure proper registration. It is important to maintain color balance for each base of the index read being sequenced, otherwise index read sequencing could fail due to registration failure. E.g. if the sample contains only T and C in the first four cycles, image registration will fail. (If possible spike-in phiX sequence to add diversity to low-plex sequencing libraries.)
If one or more bases are not present in the first 11 cycles the quality of the run will be negatively impacted. This is because the color matrix is calculated from the color signals of these cycles.

NextSeq 500
The NextSeq 500 uses two-channel sequencing, which requires only two images to encode the data for four DNA bases, one red channel and one green channel. The NextSeq also uses a new implementation of real-time analysis (RTA) called RTA2.0, which includes important architecture differences from RTA on other Illumina sequencers. For any index sequences, RTA2.0 requires that there is at least one base other than G in the first two cycles. This requirement for index diversity allows the use of any Illumina index selection for single-plex indexing except index 1 (i7) 705, which uses the sequence GGACTCCT. Use the combinations in the table below for proper color balancing on the NextSeq 500.

Illlumina Nextera tech notes, Illumina Low diversity note
See also TruSeq Guide

Mount Windows share in Linux system

August 1st, 2014

Using a text editor, create a file for your remote servers logon credential:

gedit ~/.smbcredentials

Enter your Windows username and password in the file:

chmod 600 ~/.smbcredentials

Edit your /etc/fstab file:

//servername/sharename /media/windowsshare cifs credentials=/home/ubuntuusername/.smbcredentials,iocharset=utf8,sec=ntlm 0 0 
sudo mount -a


Testing for Equivalence

August 1st, 2014

To assess whether a new test (e.g. a diagnostic tests or medical device testing for disease or non-disease status) is equivalent to an existing test, the following measures can be reported. They can be of importance for the submission of premarket notification (510(k)) or premarket approval (PMA) applications for diagnostic devices (tests) to the American Food and Drug Administration (FDA).

A new test is usually compared to an existing and established test or a general trusted reference. If the existing test (or reference) is not perfect, the FDA recommends to report the positive and negative percent agreement (PPA/NPA). This is calculated using false positives, true positives, false negative and true negatives and calculated like this (1):

          Existing test		
New test  R+	   R-	
T+	  TP	   FP	    TP+FP
T-	  FN	   TN	    FN+TN
	  TP+FN    FP+TN    N

PPA = TP * 100 / (TP + FN)
NPA = TN * 100 / (TN + FP)

Measures of accuracy
"FDA recommends you report measures of diagnostic accuracy (sensitivity and specificity pairs, positive and negative likelihood ratio pairs) or measures of agreement (percent positive agreement and percent negative agreement) and their two-sided 95 percent confidence intervals. We recommend reporting these measures both as fractions (e.g., 490/500) and as percentages (e.g., 98.0%)." (2)

Sensitivity and specificity are explained here.

In general th FDA recommends to report (2)

  • the 2x2 table of results comparing the new test with the non-reference standard
  • a description of the non-reference standard
  • measures of agreement and corresponding confidence intervals.


1 - Workshop notes: Assessing agreement for diagnostic devices
2 - FDA recommendation "Statistical Guidance on Reporting Results from Studies Evaluating Diagnostic Tests"
3 - STAndards for the Reporting of Diagnostic accuracy studies (STARD)
4 - Wikipedia page

ozlu Sozler GereksizGercek Hava Durumu Firma Rehberi Hava Durumu E-okul Veli Firma Rehberi