Nucleotide Ambiguity Codes

April 4th, 2012

The symbols to describe the different nucleotides in DNA are the following:

Symbol       Meaning      Nucleic Acid
A            A           Adenine
C            C           Cytosine
G            G           Guanine
T            T           Thymine
U            U           Uracil
M          A or C
R          A or G
W          A or T
S          C or G
Y          C or T
K          G or T
V        A or C or G
H        A or C or T
D        A or G or T
B        C or G or T
X      G or A or T or C
N      G or A or T or C

Note: these letters are also used in the "samtools tview" program to visually show NGS read alignments.


1000 Genomes Project Populations

April 3rd, 2012

The goal of the 1000 Genomes Project is create a "A Deep Catalog of Human Genetic Variation" by measuring and analysing most genetic variants that have frequencies of at least 1% in the populations studied.

The population codes used in the project are the following (Source: 1000 Genomes / ftp site):

CHB	Han Chines              Han Chinese in Beijing, China 
JPT	Japanese                Japanese in Tokyo, Japan
CHS	Southern Han Chinese    Han Chinese South 
CDX	Dai Chinese             Chinese Dai in Xishuangbanna, China
KHV	Kinh Vietnamese         Kinh in Ho Chi Minh City, Vietnam
CHD	Denver Chinese          Chinese in Denver, Colorado (pilot 3 only)
CEU	CEPH    Utah residents (CEPH) with Northern and Western European ancestry 
TSI	Tuscan  Toscani in Italia 
GBR	British British in England and Scotland 
FIN	Finnish Finnish in Finland 
IBS	Spanish Iberian populations in Spain 
YRI	Yoruba  Yoruba in Ibadan, Nigeria
LWK	Luhya   Luhya in Webuye, Kenya
GWD	Gambian Gambian in Western Division, The Gambia 
MSL	Mende   Mende in Sierra Leone
ESN	Esan    Esan in Nigeria
ASW	African-American SW     African Ancestry in Southwest US  
ACB	African-Caribbean       African Caribbean in Barbados
MXL	Mexican-American        Mexican Ancestry in Los Angeles, California
PUR	Puerto Rican            Puerto Rican in Puerto Rico
CLM	Colombian               Colombian in Medellin, Colombia
PEL	Peruvian                Peruvian in Lima, Peru

GIH	Gujarati                Gujarati Indian in Houston,TX
PJL	Punjabi                 Punjabi in Lahore,Pakistan
BEB	Bengali                 Belgali in Bangladesh
STU	Sri Lankan              Sri Lankan Tamil in the UK
ITU	Indian                  Indian Telugu in the UK

aCGH array QC measures

March 8th, 2012

The within-array quality for (genomic) microarrays is often measured using the following metrics:

  1. Standard Deviation Autosome / Robust (SD autosome)

    Measure of the dispersion of Log2 ratio of all clones on the array, giving an overall picture of the noise in the array. It is calculated on the normalised but unsmoothed data. The SD robust is the middle 58%/66% of the data. By excluding outliers large changes such as trisomies will not cause this number to change significantly. (The SD robust is the number we use when we say “3 SDs away from the noise” in the calling algorithm.) Both measures are given after all data processing but excluding any smoothing. For BlueFuse Multi processed data the values should be 0.07-0.15 and 0.05-0.11 for the autosome and robust measure respectively.

  2. Signal to Background Ratio (SBR)

    Brightness of the mean signal (after the background has been subtracted) divided by the raw background signal (global signal).

  3. Derivative Log2 Ratio / Fused (DLR)

    measure of the probe to probe variability. In an ideal world, probes within a region will have essentially the same ratio. In a noisy array adjacent probes can have a very large ratio difference. The DLR raw is before any data processing, DLR fused is after normalization and data correction BUT is always done on unsmoothed data so it is user setting independent and a cannot be adjusted by the user thereby giving a consistent array-to-array measure of noise. BlueFuse results should be < 0.2.

  4. % included clones

    Percentage of all clones that were not excluded on a BAC array due to inconsistencies between clone replicates. For BlueFuse results this should be > 95 %.

  5. Mean Spot Amplitude

    the mean fluorescent signal intensities for the two channels; channel 1 = sample (standardly Cy3; ex 550nm, emm 570nm) and channel 2 = reference (standardly Cy5; ex 650nm, emm 670nm). This metric is variable due to the differences between available scanners. The mean spot amplitude metric can give an indication of how well the DNA has labelled with fluorescent dyes, but more importantly, really high values can indicate over scanning of the microarray image OR can indicate poor washing so there is lots of non-specific signal left. The balance between channels can be assessed but the Cy5 signal tends to give a higher intensity than Cy3, major differences in the channels may indicate a labelling or a scanner problem.

Source: BlueGnome user docs

Canonical transcripts

January 3rd, 2012

As reported in the Ensembl 2009 NAR paper canonical transcripts are defined for all genes and for all species in the Ensembl gene sets. "The canonical transcript is defined as either the longest CDS, if the gene has translated transcripts, or the longest cDNA. Should a transcript already regarded as canonical not be selected using the above rules, there is support for storing this information in the Ensembl database."
For the human gene annotation the hierarchy to choose if there are more than one protein-coding transcripts is:

  1. CCDS transcripts
  2. Havana manual annotation transcripts of the type "protein_coding"
  3. Havana manual annotation transcripts that are also protein-coding
  4. Ensembl protein-coding transcripts

If there are multiple transcripts within the groups, take the longest CDS of the highest priority group.
For non-coding types takes the longest cDNA of

  1. Havana transcripts
  2. Ensembl transcripts

Ensembl 2009 NAR paper, Ensembl mailing list

These objects can be regarded as representative transcripts for the gene and can be fetched with the Perl API method


Some caution needs to be used when looking at the pseudo-autosomal regions: When looking at genes from the Y PAR, the method will return a transcript with X coordinates. While not really a bug, this might mess up your data if un-noticed. To check and fix this something like the following will work:


#fetch slice from Y PAR
my $slice = $slice_adaptor->fetch_by_region( \
'Chromosome','Y', 59100480, 59115127);
#get an example gene
my $gene = @{$slice->get_all_Genes}[0];
#get canonical transcript from the gene
my $transcript = $gene->canonical_transcript;
#re-fetch transcript on Y to avoid getting
# X locations for PAR
if($gene->slice->seq_region_name eq "Y"){
  my $sid = $transcript->stable_id;
  $transcript = undef;
  my $transcripts = \
$transcript_adaptor->fetch_all_by_Slice( \
$slice, 1);  
  foreach my $poss_transcript (@$transcripts){
    next unless($poss_transcript->stable_id eq $sid);
    $transcript = $poss_transcript;

GAL file format

December 9th, 2011

GenePix Array List (GAL) files are text files with specific information about the location, size, and name of each DNA spot on a microarray. They are therefor of vital importance for the analysis of scanned microarray images.

The format defines a specific header before the list of data columns follows:


ATF	1			

9	5			

Type=GenePix ArrayList V1.0				



"Block1=10000, 38780, 150, 20, 200, 18, 200"				


ArrayerSoftwareName=TAS Application Suite (MicroGrid II)				



Block	Column	Row	ID	Name

1	1	1	RP11-163J21	Clone 1

1	1	2	RP11-163J21	Clone 2


ATF -> File conforms to Axon Text File

1 -> Version number of ATF

9 -> Number of header lines before the "Block, Column, Row, ..." line

5 -> Number of data columns (Block, Column, Row, Name, ID)

Type=GenePix ArrayList V1.0 -> Type of file, same for all GAL files

Block Count=1 -> Number of blocks described in the file

Block Type=0 -> Type of block, 0 = rectangular

BlockX=A, B, C, D, E, F, G -> The position and dimensions of each block.

A -> xOrigin

B -> yOrigin

C -> Feature diameter

D -> xFeatures

E -> xSpacing

F -> yFeatures

G -> ySpacing

ScanResolution - Optional parameter to scale the position on higher-resolution images

Block arrangement

1	2	3	4

5	6	7	8

9	10	11	12

13	14	15	16

The data columns are:

  • Block
  • Column
  • Row
  • Name
  • ID

Further reading and sources: