SNP calling & the VCF format

May 16th, 2014

SNP calling refers to the process of identifying posititions where the genome of a sequenced sample differs to that of the reference genome. This might lead to finding disease-causing genomic alterations.
In the following I wanted to re-align short NGS reads against a specific reference (in this case the Mitochondrial genome sequence). A simple way is to use samtools.

1. make a reference genome index

bwa index -a is NCBI_chrM.fa

2. filter reads

samtools view -F4 –hb A1_S1.bam chrM > A1_S1_chrM.bam
samtools view -f4 -hb A1_S1.bam > A1_S1_unmapped.bam
samtools merge A1_S1_chrM_and_Un.bam A1_S1_chrM.bam A1_S1_unmapped.bam

3. create fastq

bamToFastq -i A1_S1_chrM_and_Un.bam -fq A1_S1_chrM_and_Un.1.fq \
 -fq2 A1_S1_chrM_and_Un.2.fq 2> bwa.err &

4. align to new reference

bwa aln NCBI_chrM.fa A1_S1_chrM_and_Un.1.fq >
bwa aln NCBI_chrM.fa A1_S1_chrM_and_Un.2.fq > A1_S1_chrM_and_Un.2.sai
bwa sampe NCBI_chrM.fa A1_S1_chrM_and_Un.1.sai A1_S1_chrM_and_Un.2.sai \
 A1_S1_chrM_and_Un.1.fq A1_S1_chrM_and_Un.2.fq > A1_S1_chrM_realigned.sam
samtools view -F4 -Sbh A1_S1_chrM_realigned.sam \
 | samtools sort -o - sorted > A1_S1_chrM_realigned.bam

5. call SNPs

samtools mpileup -uD -f NCBI_chrM.fa A1_S1_chrM_realigned.bam \
 | bcftools view -cg - > A1_S1_chrM_realigned.sam.vcf

From the Samtools help pages:
One should consider to apply the following parameters to mpileup in different scenarios:

  • Apply -C50 to reduce the effect of reads with excessive mismatches. This aims to fix overestimated mapping quality and appears to be preferred for BWA-short.
  • Given multiple technologies, apply -P to specify which technologies to use for collecting initial INDEL candidates. It is recommended to find INDEL candidates from technologies with low INDEL error rate, such as Illumina. When this option is in use, the value(s) following the option must appear in the PL tag in the @RG header lines.
  • Apply -D and -S to keep per-sample read depth and strand bias. This is preferred if there are more than one samples at high coverage.
  • Adjust -m and -F to control when to initiate indel realignment (requiring r877+). Samtools only finds INDELs where there are sufficient reads containing the INDEL at the same position. It does this to avoid excessive realignment that is computationally demanding. The default works well for many low-coverage samples but not for, say, 500 exomes. In the latter case, using -m 3 -F 0.0002 (3 supporting reads at minimum 0.02% frequency) is necessary to find singletons.
  • Use `-BQ0 -d10000000 -f ref.fa' if the purpose is to get the precise depth of coverage rather than call SNPs. Under this setting, mpileup will count low-quality bases, process all reads (by default the depth is capped at 8000), and skip the time-demanding BAQ calculation.
  • Apply -A to use anomalous read pairs in mpileup, which are not used by default (requring r874+).

The VCF format

The Variant Call Format (VCF) is the emerging standard for storing variant data. Originally designed for SNPs and short INDELs, it also works for structural variations.

VCF consists of a header and a data section. The header must contain a line starting with one '#', showing the name of each field, and then the sample names starting at the 10th column. The data section is TAB delimited with each line consisting of at least 8 mandatory fields (the first 8 fields in the table below).

Col	Field	Description
1	CHROM	Chromosome name
2	POS	1-based position. For an indel, this is the position 
		preceding the indel.
3	ID	Variant identifier. Usually the dbSNP rsID.
4	REF	Reference sequence at POS involved in the variant. 
		For a SNP, it is a single base.
5	ALT	Comma delimited list of alternative seuqence(s).
6	QUAL	Phred-scaled probability of all samples being 
		homozygous reference.
7	FILTER	Semicolon delimited list of filters that the variant 
		fails to pass.
8	INFO	Semicolon delimited list of variant information.
9	FORMAT	Colon delimited list of the format of individual 
		genotypes in the following fields.
10+	Sample(s)  Individual genotype information defined by FORMAT.

The following table gives the INFO tags used by samtools and bcftools.

Tag	Description
AC	Allele count in genotypes
AC1	Max-likelihood estimate of the first ALT allele count 
	(no HWE assumption)
AF1	Max-likelihood estimate of the first ALT allele frequency 
	(assuming HWE)
AN	Total number of alleles in called genotypes
CGT	The most probable constrained genotype configuration in the trio
CLR	Log ratio of genotype likelihoods with and without the constraint
DP	Raw read depth (sum for all samples)
DP4	Number of high-quality ref-forward bases, ref-reverse, alt-forward 
	and alt-reverse bases
FQ	Phred probability of all samples being the same
G3	ML estimate of genotype frequencies
HWE	Hardy-Weinberg equilibrium test (PMID:15789306)
ICF	Inbreeding coefficient F
INDEL	Indicates that the variant is an INDEL.
IS	Maximum number of reads supporting an indel and fraction of 
	indel reads
MDV	Maximum number of high-quality nonRef reads in samples
MQ	Root-mean-square mapping quality of covering reads
PC2	Phred probability of the nonRef allele frequency in group1 samples 
	being larger (, smaller) than in group2.
PCHI2	Posterior weighted chi2 P-value for testing the association 
	between group1 and group2 samples.
PR	Number of permutations yielding a smaller PCHI2.
PV4	P-values for strand bias, baseQ bias, mapQ bias and tail 
	distance bias
QBD	Quality by Depth: QUAL/#reads
QCHI2	Phred scaled PCHI2.
RP	# permutations yielding a smaller PCHI2
RPB	Read Position Bias
SF	Source File (index to sourceFiles, f when filtered)
TYPE	Variant type
UGT	The most probable unconstrained genotype configuration in the trio
VDB	Variant Distance Bias (v2) for filtering splice-site artefacts 
	in RNA-seq data.


Linux Firewall

July 4th, 2013

To block a specific IP address from network access to your (Ubuntu Linux) system, you can add it to your firewall settings:
sudo iptables -A INPUT -s -j DROP
To remove this entry:
sudo iptables -D INPUT -s -j DROP
To just list current firewall rules:
sudo iptables -L


GC content of human chromosomes

April 16th, 2013

The GC content is the molar ratio of guanine+cytosine bases in DNA. The human genome is a mosaic of GC-rich and GC-poor regions, of around 300kb in length, called isochores. GC content is an important factor in many experiments and bioinformatic analysis. This is especially true for next-generation sequencing where the DNA being sequenced has gone through multiple rounds of PCR amplification.

Average GC content per chromosome:

1   0.417439
2   0.402438
3   0.396943
4   0.382479
5   0.395163
6   0.396109
7   0.407513
8   0.401757
9   0.413168
10  0.415849
11  0.415657
12  0.40812
13  0.385265
14  0.408872
15  0.42201
16  0.447894
17  0.455405
18  0.39785
19  0.483603
20  0.441257
21  0.408325
22  0.479881
X   0.394963
Y   0.391288
MT  0.443626

The common way to reduce the GC bias in data analysis is to basically

  1. calculate to GC ratio (number of G/C bases / number of bases) in the region of interest (ROI) being measured
  2. find average value measured (a) across the genome in all regions with this ratio
  3. normalize the value measured in the ROI (m) with this value: m/a

More details on the GC bias in next-gen sequencing is described by Benjamini and Speed here: " The bias is not consistent between samples; and there is no consensus as to the best methods to remove it in a single sample. (...) It is the GC content of the full DNA fragment, not only the sequenced read, that most influences fragment count. This GC effect is unimodal: both GC-rich fragments and AT-rich fragments are underrepresented in the sequencing results. This empirical evidence strengthens the hypothesis that PCR is the most important cause of the GC bias."

Correcting the bias can follow a "read model", "fragment model" or a "global model".

Sources:, PubMed, PubMed

See also: Chromosome length table

Windows Task Scheduler Error

April 16th, 2013

A scheduled task on Microsoft Windows 2008 failed "due to a time trigger condition" and with the error message including "Data: Error Value 2147943726." after running without problems before.
The reason for this was that the network-wide password for the user account assigned to running the task, had been changed since setting up the task.
Re-opening the task properties (double-click in the "Active Tasks" list and select "Options" from the right-hand menue") and saving with the new password fixed the problem.

Chromosome lengths

March 20th, 2013

Here is a quick list of the sizes of human chromosomes in assembly GRCh37 as defined by Ensembl:

chrom	 length [bp]
 1	 249,250,621 
 2	 243,199,373 
 3	 198,022,430 
 4	 191,154,276 
 5	 180,915,260 
 6	 171,115,067 
 7	 159,138,663 
 8	 146,364,022 
 9	 141,213,431 
10	 135,534,747 
11	 135,006,516 
12	 133,851,895 
13	 115,169,878 
14	 107,349,540 
15	 102,531,392 
16	  90,354,753 
17	  81,195,210 
18	  78,077,248 
19	  59,128,983 
20	  63,025,520 
21	  48,129,895 
22	  51,304,566 
X	 155,270,560 
Y	  59,373,566 
Mt	      16,569
Chromosome lengths

These sizes are useful for calculations of percent coverage of genomic features or sequencing reads.
They are often required when working with BED files.

Related: Chromosome ideograms and nomenclature, chromosome GC content