Barcode Balancing for Illumina Sequencing

November 4th, 2014

HiSeq & MiSeq
The HiSeq and MiSeq use a green laser to sequence G/T and a red laser to sequence A/C. At each cycle at least one of two nucleotides for each color channel must be read to ensure proper registration. It is important to maintain color balance for each base of the index read being sequenced, otherwise index read sequencing could fail due to registration failure. E.g. if the sample contains only T and C in the first four cycles, image registration will fail. (If possible spike-in phiX sequence to add diversity to low-plex sequencing libraries.)
If one or more bases are not present in the first 11 cycles the quality of the run will be negatively impacted. This is because the color matrix is calculated from the color signals of these cycles.

NextSeq 500
The NextSeq 500 uses two-channel sequencing, which requires only two images to encode the data for four DNA bases, one red channel and one green channel. The NextSeq also uses a new implementation of real-time analysis (RTA) called RTA2.0, which includes important architecture differences from RTA on other Illumina sequencers. For any index sequences, RTA2.0 requires that there is at least one base other than G in the first two cycles. This requirement for index diversity allows the use of any Illumina index selection for single-plex indexing except index 1 (i7) 705, which uses the sequence GGACTCCT. Use the combinations in the table below for proper color balancing on the NextSeq 500.

Illlumina Nextera tech notes, Illumina Low diversity note
See also TruSeq Guide

NGS reads and their Scores

August 1st, 2014

Quality scoring of the base calls

"Quality scores measure the probability that a base is called incorrectly. With SBS technology, each base in a read is assigned a quality score by a phred-like algorithm, similar to that originally developed for Sanger sequencing experiments. The quality score of a given base, Q, is defined by the equation
Q = -10log10(e)
where e is the estimated probability of the base call being wrong. Thus, a higher quality score indicates a smaller probability of error."(1)
The quality score is usually quoted as QXX, where the XX is the score and refers to that a particular call (or a all base calls of a read / of a sample / of a run) has a probability of error of 10^(-XX/10). For example Q30 equates to an error rate of 1 in 1000, or 0.1%, Q40 equates to an error rate of 1 in 10,000 or 0.01%.

During the primary analysis (real-time analysis, RTA) on the sequencing machine, quality scoring is performed by calculating a set of predictors for each base call, and using those predictor values to look up the quality score in a quality table. The quality table is generated using a modification of the Phred algorithm on a calibration data set representative of run and sequence variability

"It is important to note how quickly or slowly quality scores degrade over the course of a read. With short-read sequencing, quality scores largely dictate the read length limits of different sequencing platforms. Thus, a longer read length specification suggests that the raw data from that platform have consistently higher quality scores across all bases." (1)

Mapping / Alignment scores

For each alignment, BWA calculates a mapping quality score, which is the (Phred-scaled) probability of the alignment being incorrect. The algorithm is similar between BWA and MAQ, except that BWA assumes that the true hit can always be found. The probability for every base is calculated as:

p = 10 ^ (-q/10)

where q is the quality. For example a mapping quality of 40: 10^-4 = 0.0001, which means there is a 0.01% chance that the base is aligned incorrectly.

Example for a whole read:

If your read is 25 bp long and the expected sequencing error rate is 1%, the probability of the read with 0 errors is:

0.99^25 = 0.78

If there is 1 perfect alignment and 5 possible alignment positions with 1 mismatch, we combine these probabilities: The probability of the read with 1 error is
combined posterior probability that the best alignment is correct:

P(0-errors) / (P(0-errors) + 5 * P(1-errors))

= 0.44, which is low.

Base quality is apparently not considered in evaluating hits in bwa.


  1. Illumina
  2. BWA paper
  3. DaveTang blog
  4. jwfoley on SEQanswers
  5. Ying Wei's notes

Mount Windows share in Linux system

August 1st, 2014

Using a text editor, create a file for your remote servers logon credential:

gedit ~/.smbcredentials

Enter your Windows username and password in the file:

chmod 600 ~/.smbcredentials

Edit your /etc/fstab file:

//servername/sharename /media/windowsshare cifs credentials=/home/ubuntuusername/.smbcredentials,iocharset=utf8,sec=ntlm 0 0 
sudo mount -a


Testing for Equivalence

August 1st, 2014

To assess whether a new test (e.g. a diagnostic tests or medical device testing for disease or non-disease status) is equivalent to an existing test, the following measures can be reported. They can be of importance for the submission of premarket notification (510(k)) or premarket approval (PMA) applications for diagnostic devices (tests) to the American Food and Drug Administration (FDA).

A new test is usually compared to an existing and established test or a general trusted reference. If the existing test (or reference) is not perfect, the FDA recommends to report the positive and negative percent agreement (PPA/NPA). This is calculated using false positives, true positives, false negative and true negatives and calculated like this (1):

          Existing test		
New test  R+	   R-	
T+	  TP	   FP	    TP+FP
T-	  FN	   TN	    FN+TN
	  TP+FN    FP+TN    N

PPA = TP * 100 / (TP + FN)
NPA = TN * 100 / (TN + FP)

Measures of accuracy
"FDA recommends you report measures of diagnostic accuracy (sensitivity and specificity pairs, positive and negative likelihood ratio pairs) or measures of agreement (percent positive agreement and percent negative agreement) and their two-sided 95 percent confidence intervals. We recommend reporting these measures both as fractions (e.g., 490/500) and as percentages (e.g., 98.0%)." (2)

Sensitivity and specificity are explained here.

In general th FDA recommends to report (2)

  • the 2x2 table of results comparing the new test with the non-reference standard
  • a description of the non-reference standard
  • measures of agreement and corresponding confidence intervals.


1 - Workshop notes: Assessing agreement for diagnostic devices
2 - FDA recommendation "Statistical Guidance on Reporting Results from Studies Evaluating Diagnostic Tests"
3 - STAndards for the Reporting of Diagnostic accuracy studies (STARD)
4 - Wikipedia page

SNP calling & the VCF format

May 16th, 2014

SNP calling refers to the process of identifying posititions where the genome of a sequenced sample differs to that of the reference genome. This might lead to finding disease-causing genomic alterations.
In the following I wanted to re-align short NGS reads against a specific reference (in this case the Mitochondrial genome sequence). A simple way is to use samtools.

1. make a reference genome index

bwa index -a is NCBI_chrM.fa

2. filter reads

samtools view -F4 –hb A1_S1.bam chrM > A1_S1_chrM.bam
samtools view -f4 -hb A1_S1.bam > A1_S1_unmapped.bam
samtools merge A1_S1_chrM_and_Un.bam A1_S1_chrM.bam A1_S1_unmapped.bam

3. create fastq

bamToFastq -i A1_S1_chrM_and_Un.bam -fq A1_S1_chrM_and_Un.1.fq \
 -fq2 A1_S1_chrM_and_Un.2.fq 2> bwa.err &

4. align to new reference

bwa aln NCBI_chrM.fa A1_S1_chrM_and_Un.1.fq >
bwa aln NCBI_chrM.fa A1_S1_chrM_and_Un.2.fq > A1_S1_chrM_and_Un.2.sai
bwa sampe NCBI_chrM.fa A1_S1_chrM_and_Un.1.sai A1_S1_chrM_and_Un.2.sai \
 A1_S1_chrM_and_Un.1.fq A1_S1_chrM_and_Un.2.fq > A1_S1_chrM_realigned.sam
samtools view -F4 -Sbh A1_S1_chrM_realigned.sam \
 | samtools sort -o - sorted > A1_S1_chrM_realigned.bam

5. call SNPs

samtools mpileup -uD -f NCBI_chrM.fa A1_S1_chrM_realigned.bam \
 | bcftools view -cg - > A1_S1_chrM_realigned.sam.vcf

From the Samtools help pages:
One should consider to apply the following parameters to mpileup in different scenarios:

  • Apply -C50 to reduce the effect of reads with excessive mismatches. This aims to fix overestimated mapping quality and appears to be preferred for BWA-short.
  • Given multiple technologies, apply -P to specify which technologies to use for collecting initial INDEL candidates. It is recommended to find INDEL candidates from technologies with low INDEL error rate, such as Illumina. When this option is in use, the value(s) following the option must appear in the PL tag in the @RG header lines.
  • Apply -D and -S to keep per-sample read depth and strand bias. This is preferred if there are more than one samples at high coverage.
  • Adjust -m and -F to control when to initiate indel realignment (requiring r877+). Samtools only finds INDELs where there are sufficient reads containing the INDEL at the same position. It does this to avoid excessive realignment that is computationally demanding. The default works well for many low-coverage samples but not for, say, 500 exomes. In the latter case, using -m 3 -F 0.0002 (3 supporting reads at minimum 0.02% frequency) is necessary to find singletons.
  • Use `-BQ0 -d10000000 -f ref.fa' if the purpose is to get the precise depth of coverage rather than call SNPs. Under this setting, mpileup will count low-quality bases, process all reads (by default the depth is capped at 8000), and skip the time-demanding BAQ calculation.
  • Apply -A to use anomalous read pairs in mpileup, which are not used by default (requring r874+).

The VCF format

The Variant Call Format (VCF) is the emerging standard for storing variant data. Originally designed for SNPs and short INDELs, it also works for structural variations.

VCF consists of a header and a data section. The header must contain a line starting with one '#', showing the name of each field, and then the sample names starting at the 10th column. The data section is TAB delimited with each line consisting of at least 8 mandatory fields (the first 8 fields in the table below).

Col	Field	Description
1	CHROM	Chromosome name
2	POS	1-based position. For an indel, this is the position 
		preceding the indel.
3	ID	Variant identifier. Usually the dbSNP rsID.
4	REF	Reference sequence at POS involved in the variant. 
		For a SNP, it is a single base.
5	ALT	Comma delimited list of alternative seuqence(s).
6	QUAL	Phred-scaled probability of all samples being 
		homozygous reference.
7	FILTER	Semicolon delimited list of filters that the variant 
		fails to pass.
8	INFO	Semicolon delimited list of variant information.
9	FORMAT	Colon delimited list of the format of individual 
		genotypes in the following fields.
10+	Sample(s)  Individual genotype information defined by FORMAT.

The following table gives the INFO tags used by samtools and bcftools.

Tag	Description
AC	Allele count in genotypes
AC1	Max-likelihood estimate of the first ALT allele count 
	(no HWE assumption)
AF1	Max-likelihood estimate of the first ALT allele frequency 
	(assuming HWE)
AN	Total number of alleles in called genotypes
CGT	The most probable constrained genotype configuration in the trio
CLR	Log ratio of genotype likelihoods with and without the constraint
DP	Raw read depth (sum for all samples)
DP4	Number of high-quality ref-forward bases, ref-reverse, alt-forward 
	and alt-reverse bases
FQ	Phred probability of all samples being the same
G3	ML estimate of genotype frequencies
HWE	Hardy-Weinberg equilibrium test (PMID:15789306)
ICF	Inbreeding coefficient F
INDEL	Indicates that the variant is an INDEL.
IS	Maximum number of reads supporting an indel and fraction of 
	indel reads
MDV	Maximum number of high-quality nonRef reads in samples
MQ	Root-mean-square mapping quality of covering reads
PC2	Phred probability of the nonRef allele frequency in group1 samples 
	being larger (, smaller) than in group2.
PCHI2	Posterior weighted chi2 P-value for testing the association 
	between group1 and group2 samples.
PR	Number of permutations yielding a smaller PCHI2.
PV4	P-values for strand bias, baseQ bias, mapQ bias and tail 
	distance bias
QBD	Quality by Depth: QUAL/#reads
QCHI2	Phred scaled PCHI2.
RP	# permutations yielding a smaller PCHI2
RPB	Read Position Bias
SF	Source File (index to sourceFiles, f when filtered)
TYPE	Variant type
UGT	The most probable unconstrained genotype configuration in the trio
VDB	Variant Distance Bias (v2) for filtering splice-site artefacts 
	in RNA-seq data.