CRAM format

January 7th, 2015

CRAM files are compressed versions of BAM files containing (aligned) sequencing reads. They represent a further file size reduction for this type of data that is generated at ever increasing quantities. Where SAM files are human-readable text files optimized for short read storage, BAM files are their binary equivalent, and CRAM files are a restructured column-oriented binary container format for even more efficient storage.

Tke key components of the approach are that positions are encoded in a relative way (i.e., the difference between successive positions is stored rather than the absolute value) and stored as a Golomb code. Also, only differences to the reference genome are listed instead of the full sequence.

The compression rates achieved are shown in the graph below generated by Uppsala University:

File size comparisons of SAM, BAM, CRAM

Comparing speed: Using the C implementation of for CRAM (James K. Bonfield), decoding is 1.5–1.7× slower than generating BAM files, but 1.8–2.6× faster at encoding. (File size savings are reported at 34–55%.(

Additional compression can be achieved by reducing the granularity of the quality values which will result in lossy compression though. Illumina suggested a binning of Q scores without significant calling performance.

Binning of similar Q-scores (Illumina):

qscore binning

Compression achieved by Q-score binning (Illumina):

qscore compression

Sources and further reading:

  1. Format definition and usage
  2. cram-toolkit
  3. Detailed report at the Uppsala University
  4. SAMtools with CRAM support
  5. Original article from Markus Hsi-Yang Fritz, Rasko Leinonen, Guy Cochrane and Ewan Birney
  6. Article about the implementation in C
  7. Illumina while paper on Qscore compression

Barcode Balancing for Illumina Sequencing

November 4th, 2014

HiSeq & MiSeq
The HiSeq and MiSeq use a green laser to sequence G/T and a red laser to sequence A/C. At each cycle at least one of two nucleotides for each color channel must be read to ensure proper registration. It is important to maintain color balance for each base of the index read being sequenced, otherwise index read sequencing could fail due to registration failure. E.g. if the sample contains only T and C in the first four cycles, image registration will fail. (If possible spike-in phiX sequence to add diversity to low-plex sequencing libraries.)
If one or more bases are not present in the first 11 cycles the quality of the run will be negatively impacted. This is because the color matrix is calculated from the color signals of these cycles.

NextSeq 500
The NextSeq 500 uses two-channel sequencing, which requires only two images to encode the data for four DNA bases, one red channel and one green channel. The NextSeq also uses a new implementation of real-time analysis (RTA) called RTA2.0, which includes important architecture differences from RTA on other Illumina sequencers. For any index sequences, RTA2.0 requires that there is at least one base other than G in the first two cycles. This requirement for index diversity allows the use of any Illumina index selection for single-plex indexing except index 1 (i7) 705, which uses the sequence GGACTCCT. Use the combinations in the table below for proper color balancing on the NextSeq 500.

Source:
Illlumina Nextera tech notes, Illumina Low diversity note
See also TruSeq Guide

NGS reads and their Scores

August 1st, 2014

Quality scoring of the base calls

"Quality scores measure the probability that a base is called incorrectly. With SBS technology, each base in a read is assigned a quality score by a phred-like algorithm, similar to that originally developed for Sanger sequencing experiments. The quality score of a given base, Q, is defined by the equation
Q = -10log10(e)
where e is the estimated probability of the base call being wrong. Thus, a higher quality score indicates a smaller probability of error."(1)
The quality score is usually quoted as QXX, where the XX is the score and refers to that a particular call (or a all base calls of a read / of a sample / of a run) has a probability of error of 10^(-XX/10). For example Q30 equates to an error rate of 1 in 1000, or 0.1%, Q40 equates to an error rate of 1 in 10,000 or 0.01%.

During the primary analysis (real-time analysis, RTA) on the sequencing machine, quality scoring is performed by calculating a set of predictors for each base call, and using those predictor values to look up the quality score in a quality table. The quality table is generated using a modification of the Phred algorithm on a calibration data set representative of run and sequence variability

"It is important to note how quickly or slowly quality scores degrade over the course of a read. With short-read sequencing, quality scores largely dictate the read length limits of different sequencing platforms. Thus, a longer read length specification suggests that the raw data from that platform have consistently higher quality scores across all bases." (1)

Mapping / Alignment scores

For each alignment, BWA calculates a mapping quality score, which is the (Phred-scaled) probability of the alignment being incorrect. The algorithm is similar between BWA and MAQ, except that BWA assumes that the true hit can always be found. The probability for every base is calculated as:

p = 10 ^ (-q/10)

where q is the quality. For example a mapping quality of 40: 10^-4 = 0.0001, which means there is a 0.01% chance that the base is aligned incorrectly.

Example for a whole read:

If your read is 25 bp long and the expected sequencing error rate is 1%, the probability of the read with 0 errors is:

0.99^25 = 0.78

If there is 1 perfect alignment and 5 possible alignment positions with 1 mismatch, we combine these probabilities: The probability of the read with 1 error is
0.20
combined posterior probability that the best alignment is correct:

P(0-errors) / (P(0-errors) + 5 * P(1-errors))

= 0.44, which is low.

Base quality is apparently not considered in evaluating hits in bwa.

Sources:

  1. Illumina
  2. BWA paper
  3. DaveTang blog
  4. jwfoley on SEQanswers
  5. Ying Wei's notes
  6. Gene-Test bioinformatics (PGS / NGS) consulting

Mount Windows share in Linux system

August 1st, 2014

Using a text editor, create a file for your remote servers logon credential:

gedit ~/.smbcredentials

Enter your Windows username and password in the file:

username=msusername
password=mspassword
chmod 600 ~/.smbcredentials

Edit your /etc/fstab file:

//servername/sharename /media/windowsshare cifs credentials=/home/ubuntuusername/.smbcredentials,iocharset=utf8,sec=ntlm 0 0 
sudo mount -a

Ref: https://wiki.ubuntu.com/MountWindowsSharesPermanently

Testing for Equivalence

August 1st, 2014

 

To assess whether a new test (e.g. a diagnostic tests or medical device testing for disease or non-disease status) is equivalent to an existing test, the following measures can be reported. They can be of importance for the submission of premarket notification (510(k)) or premarket approval (PMA) applications for diagnostic devices (tests) to the American Food and Drug Administration (FDA). A new test is usually compared to an existing and established test or a general trusted reference. If the existing test (or reference) is not perfect, the FDA recommends to report the positive and negative percent agreement (PPA/NPA). This is calculated using false positives, true positives, false negative and true negatives and calculated like this (1):

  Existing Test  
New Test R+ R-  
T+ TP FP TP+FP
T- FN TN FN+TN
  TP+FN FP+FN TP+FP+FN+TN

  PPA = TP * 100 / (TP + FN)

  NPA = TN * 100 / (TN + FP)

Measures of accuracy

The FDA "recommends you report measures of diagnostic accuracy (sensitivity and specificity pairs, positive and negative likelihood ratio pairs) or measures of agreement (percent positive agreement and percent negative agreement) and their two-sided 95 percent confidence intervals. We recommend reporting these measures both as fractions (e.g., 490/500) and as percentages (e.g., 98.0%)." (2) Sensitivity and specificity are explained here. In general th FDA recommends to report (2):

  • the 2x2 table of results comparing the new test with the non-reference standard
  • a description of the non-reference standard
  • measures of agreement and corresponding confidence intervals.


References:

  1. Workshop notes (B. Biswas): Assessing agreement for diagnostic devices  
  2. FDA recommendation "Statistical Guidance on Reporting Results from Studies Evaluating Diagnostic Tests" 
  3. STAndards for the Reporting of Diagnostic accuracy studies (STARD)
  4. Wikipedia page