OMIM Symbols

April 16th, 2012

The Online Mendelian Inheritance in Man is a manually reviewed catalog of human genes and regions involved in genetic disorders and traits. Each entry has a name and a number, e.g. "#154780 MARSHALL SYNDROME". According to the OMIM FAQs, these are the meanings of the the symbols preceding a MIM number:

  1. An asterisk (*) before an entry number indicates a gene.
  2. A number symbol (#) before an entry number indicates that it is a descriptive entry, usually of a phenotype, and does not represent a unique locus. The reason for the use of the number symbol is given in the first paragraph of the entry. Discussion of any gene(s) related to the phenotype resides in another entry(ies) as described in the first paragraph.
  3. A plus sign (+) before an entry number indicates that the entry contains the description of a gene of known sequence and a phenotype.
  4. A percent sign (%) before an entry number indicates that the entry describes a confirmed mendelian phenotype or phenotypic locus for which the underlying molecular basis is not known.
  5. No symbol before an entry number generally indicates a description of a phenotype for which the mendelian basis, although suspected, has not been clearly established or that the separateness of this phenotype from that in another entry is unclear.
  6. A caret (^) before an entry number means the entry no longer exists because it was removed from the database or moved to another entry as indicated.

To fetch a non-redundant list of OMIM annotation through the Ensembl Perl API you can look at the external references (xrefs/dblinks):


my $att = "MIM_GENE";
# or: my $att = "MIM_MORBID";
my $attribs = $gene->get_all_DBLinks($att);
my (%ids, %descriptions);
if (@{ $attribs }){
  foreach my $attrib (@{ $attribs }){
    if (not(exists $ids{$attrib->primary_id()})){
      $ids{$attrib->primary_id} = $attrib->display_id;
      $descriptions{$attrib->description} = $attrib->display_id;

OMIM publication,

Nucleotide Ambiguity Codes

April 4th, 2012

The symbols to describe the different nucleotides in DNA are the following:

Symbol       Meaning      Nucleic Acid
A            A           Adenine
C            C           Cytosine
G            G           Guanine
T            T           Thymine
U            U           Uracil
M          A or C
R          A or G
W          A or T
S          C or G
Y          C or T
K          G or T
V        A or C or G
H        A or C or T
D        A or G or T
B        C or G or T
X      G or A or T or C
N      G or A or T or C

Note: these letters are also used in the "samtools tview" program to visually show NGS read alignments.


1000 Genomes Project Populations

April 3rd, 2012

The goal of the 1000 Genomes Project is create a "A Deep Catalog of Human Genetic Variation" by measuring and analysing most genetic variants that have frequencies of at least 1% in the populations studied.

The population codes used in the project are the following (Source: 1000 Genomes / ftp site):

CHB	Han Chines              Han Chinese in Beijing, China 
JPT	Japanese                Japanese in Tokyo, Japan
CHS	Southern Han Chinese    Han Chinese South 
CDX	Dai Chinese             Chinese Dai in Xishuangbanna, China
KHV	Kinh Vietnamese         Kinh in Ho Chi Minh City, Vietnam
CHD	Denver Chinese          Chinese in Denver, Colorado (pilot 3 only)
CEU	CEPH    Utah residents (CEPH) with Northern and Western European ancestry 
TSI	Tuscan  Toscani in Italia 
GBR	British British in England and Scotland 
FIN	Finnish Finnish in Finland 
IBS	Spanish Iberian populations in Spain 
YRI	Yoruba  Yoruba in Ibadan, Nigeria
LWK	Luhya   Luhya in Webuye, Kenya
GWD	Gambian Gambian in Western Division, The Gambia 
MSL	Mende   Mende in Sierra Leone
ESN	Esan    Esan in Nigeria
ASW	African-American SW     African Ancestry in Southwest US  
ACB	African-Caribbean       African Caribbean in Barbados
MXL	Mexican-American        Mexican Ancestry in Los Angeles, California
PUR	Puerto Rican            Puerto Rican in Puerto Rico
CLM	Colombian               Colombian in Medellin, Colombia
PEL	Peruvian                Peruvian in Lima, Peru

GIH	Gujarati                Gujarati Indian in Houston,TX
PJL	Punjabi                 Punjabi in Lahore,Pakistan
BEB	Bengali                 Belgali in Bangladesh
STU	Sri Lankan              Sri Lankan Tamil in the UK
ITU	Indian                  Indian Telugu in the UK

aCGH array QC measures

March 8th, 2012

The within-array quality for (genomic) microarrays is often measured using the following metrics:

  1. Standard Deviation Autosome / Robust (SD autosome)

    Measure of the dispersion of Log2 ratio of all clones on the array, giving an overall picture of the noise in the array. It is calculated on the normalised but unsmoothed data. The SD robust is the middle 58%/66% of the data. By excluding outliers large changes such as trisomies will not cause this number to change significantly. (The SD robust is the number we use when we say “3 SDs away from the noise” in the calling algorithm.) Both measures are given after all data processing but excluding any smoothing. For BlueFuse Multi processed data the values should be 0.07-0.15 and 0.05-0.11 for the autosome and robust measure respectively.

  2. Signal to Background Ratio (SBR)

    Brightness of the mean signal (after the background has been subtracted) divided by the raw background signal (global signal).

  3. Derivative Log2 Ratio / Fused (DLR)

    measure of the probe to probe variability. In an ideal world, probes within a region will have essentially the same ratio. In a noisy array adjacent probes can have a very large ratio difference. The DLR raw is before any data processing, DLR fused is after normalization and data correction BUT is always done on unsmoothed data so it is user setting independent and a cannot be adjusted by the user thereby giving a consistent array-to-array measure of noise. BlueFuse results should be < 0.2.

  4. % included clones

    Percentage of all clones that were not excluded on a BAC array due to inconsistencies between clone replicates. For BlueFuse results this should be > 95 %.

  5. Mean Spot Amplitude

    the mean fluorescent signal intensities for the two channels; channel 1 = sample (standardly Cy3; ex 550nm, emm 570nm) and channel 2 = reference (standardly Cy5; ex 650nm, emm 670nm). This metric is variable due to the differences between available scanners. The mean spot amplitude metric can give an indication of how well the DNA has labelled with fluorescent dyes, but more importantly, really high values can indicate over scanning of the microarray image OR can indicate poor washing so there is lots of non-specific signal left. The balance between channels can be assessed but the Cy5 signal tends to give a higher intensity than Cy3, major differences in the channels may indicate a labelling or a scanner problem.

Source: BlueGnome user docs

Canonical transcripts

January 3rd, 2012

As reported in the Ensembl 2009 NAR paper canonical transcripts are defined for all genes and for all species in the Ensembl gene sets. "The canonical transcript is defined as either the longest CDS, if the gene has translated transcripts, or the longest cDNA. Should a transcript already regarded as canonical not be selected using the above rules, there is support for storing this information in the Ensembl database."
For the human gene annotation the hierarchy to choose if there are more than one protein-coding transcripts is:

  1. CCDS transcripts
  2. Havana manual annotation transcripts of the type "protein_coding"
  3. Havana manual annotation transcripts that are also protein-coding
  4. Ensembl protein-coding transcripts

If there are multiple transcripts within the groups, take the longest CDS of the highest priority group.
For non-coding types takes the longest cDNA of

  1. Havana transcripts
  2. Ensembl transcripts

Ensembl 2009 NAR paper, Ensembl mailing list

These objects can be regarded as representative transcripts for the gene and can be fetched with the Perl API method


Some caution needs to be used when looking at the pseudo-autosomal regions: When looking at genes from the Y PAR, the method will return a transcript with X coordinates. While not really a bug, this might mess up your data if un-noticed. To check and fix this something like the following will work:


#fetch slice from Y PAR
my $slice = $slice_adaptor->fetch_by_region( \
'Chromosome','Y', 59100480, 59115127);
#get an example gene
my $gene = @{$slice->get_all_Genes}[0];
#get canonical transcript from the gene
my $transcript = $gene->canonical_transcript;
#re-fetch transcript on Y to avoid getting
# X locations for PAR
if($gene->slice->seq_region_name eq "Y"){
  my $sid = $transcript->stable_id;
  $transcript = undef;
  my $transcripts = \
$transcript_adaptor->fetch_all_by_Slice( \
$slice, 1);  
  foreach my $poss_transcript (@$transcripts){
    next unless($poss_transcript->stable_id eq $sid);
    $transcript = $poss_transcript;