Cystic Fibrosis, also called Mucoviscidos, is a hereditary disease (autosomal recessive) in which exocrine (secretory) glands produce abnormally thick mucus. This mucus can cause problems in digestion, breathing, and body cooling. It affects up to one out of 3000 newborns (with northern European ancestry). There are well over a hundred genetic changes linked to CF. It is an area companies like Illumina are very active in with a special assay cleared as an in-vitro diagnostic test with the FDA for the detection of most of the genetic variants known to cause the disease.
Background for CF and CFTR
- Cysstic Fibrosis Transmembrane Conductance Regulator
- ABC transporter (ATP-binding cassette), that functons as ion channel
- cAMP-regulated through R domain phosphorylaton
- Transports chloride and thiocyanate across epithelial cell membranes
- 1,480 amino acids
- Most common autosomal recessive disorder among Caucasians (1/3,300)
- Dysregulaton of epithelial fluid transport in lung, pancreas, and other organs
- ~ 2,000 identfied gene mutatons
- Phe508del – most common, in 70% cases
- Wide range of severity, most die of pulmonary disease at mean age of 37
- ~70% of variants currently classed as VUS (variant of unknown significance)
- ~65% are missense mutations, 24% frameshift & stop-gained, 9% synonymous
Testing Machine Learning Approaches for CF classification
- Machine learning algorithms show higher performance when compared with separate predictors
- Tree-based methods perform the best (GBM & RF AUC is 6% higher then the best predictor, MutPred)
- Top features: MutPred, AF, SIFT, CADD, POSE
- Predicted pathogenicity probability (RF.pred) correlates with available experimental data for Cl- conductance and sweat Cl-
CRAM files are compressed versions of BAM files containing (aligned) sequencing reads. They represent a further file size reduction for this type of data that is generated at ever increasing quantities. Where SAM files are human-readable text files optimized for short read storage, BAM files are their binary equivalent, and CRAM files are a restructured column-oriented binary container format for even more efficient storage.
Tke key components of the approach are that positions are encoded in a relative way (i.e., the difference between successive positions is stored rather than the absolute value) and stored as a Golomb code. Also, only differences to the reference genome are listed instead of the full sequence.
The compression rates achieved are shown in the graph below generated by Uppsala University:
Comparing speed: Using the C implementation of for CRAM (James K. Bonfield), decoding is 1.5–1.7× slower than generating BAM files, but 1.8–2.6× faster at encoding. (File size savings are reported at 34–55%.(
Additional compression can be achieved by reducing the granularity of the quality values which will result in lossy compression though. Illumina suggested a binning of Q scores without significant calling performance.
Binning of similar Q-scores (Illumina):
Compression achieved by Q-score binning (Illumina):
Sources and further reading:
HiSeq & MiSeq
The HiSeq and MiSeq use a green laser to sequence G/T and a red laser to sequence A/C. At each cycle at least one of two nucleotides for each color channel must be read to ensure proper registration. It is important to maintain color balance for each base of the index read being sequenced, otherwise index read sequencing could fail due to registration failure. E.g. if the sample contains only T and C in the first four cycles, image registration will fail. (If possible spike-in phiX sequence to add diversity to low-plex sequencing libraries.)
If one or more bases are not present in the first 11 cycles the quality of the run will be negatively impacted. This is because the color matrix is calculated from the color signals of these cycles.
The NextSeq 500 uses two-channel sequencing, which requires only two images to encode the data for four DNA bases, one red channel and one green channel. The NextSeq also uses a new implementation of real-time analysis (RTA) called RTA2.0, which includes important architecture differences from RTA on other Illumina sequencers. For any index sequences, RTA2.0 requires that there is at least one base other than G in the first two cycles. This requirement for index diversity allows the use of any Illumina index selection for single-plex indexing except index 1 (i7) 705, which uses the sequence GGACTCCT. Use the combinations in the table below for proper color balancing on the NextSeq 500.
Quality scoring of the base calls
"Quality scores measure the probability that a base is called incorrectly. With SBS technology, each base in a read is assigned a quality score by a phred-like algorithm, similar to that originally developed for Sanger sequencing experiments. The quality score of a given base, Q, is defined by the equation
Q = -10log10(e)
where e is the estimated probability of the base call being wrong. Thus, a higher quality score indicates a smaller probability of error."(1)
The quality score is usually quoted as QXX, where the XX is the score and refers to that a particular call (or a all base calls of a read / of a sample / of a run) has a probability of error of 10^(-XX/10). For example Q30 equates to an error rate of 1 in 1000, or 0.1%, Q40 equates to an error rate of 1 in 10,000 or 0.01%.
During the primary analysis (real-time analysis, RTA) on the sequencing machine, quality scoring is performed by calculating a set of predictors for each base call, and using those predictor values to look up the quality score in a quality table. The quality table is generated using a modification of the Phred algorithm on a calibration data set representative of run and sequence variability
"It is important to note how quickly or slowly quality scores degrade over the course of a read. With short-read sequencing, quality scores largely dictate the read length limits of different sequencing platforms. Thus, a longer read length specification suggests that the raw data from that platform have consistently higher quality scores across all bases." (1)
Mapping / Alignment scores
For each alignment, BWA calculates a mapping quality score, which is the (Phred-scaled) probability of the alignment being incorrect. The algorithm is similar between BWA and MAQ, except that BWA assumes that the true hit can always be found. The probability for every base is calculated as:
p = 10 ^ (-q/10)
where q is the quality. For example a mapping quality of 40: 10^-4 = 0.0001, which means there is a 0.01% chance that the base is aligned incorrectly.
Example for a whole read:
If your read is 25 bp long and the expected sequencing error rate is 1%, the probability of the read with 0 errors is:
0.99^25 = 0.78
If there is 1 perfect alignment and 5 possible alignment positions with 1 mismatch, we combine these probabilities: The probability of the read with 1 error is
combined posterior probability that the best alignment is correct:
P(0-errors) / (P(0-errors) + 5 * P(1-errors))
= 0.44, which is low.
Base quality is apparently not considered in evaluating hits in bwa.
Using a text editor, create a file for your remote servers logon credential:
Enter your Windows username and password in the file:
chmod 600 ~/.smbcredentials
Edit your /etc/fstab file:
//servername/sharename /media/windowsshare cifs credentials=/home/ubuntuusername/.smbcredentials,iocharset=utf8,sec=ntlm 0 0
sudo mount -a