OMIM Symbols

April 16th, 2012

The Online Mendelian Inheritance in Man is a manually reviewed catalog of human genes and regions involved in genetic disorders and traits. Each entry has a name and a number, e.g. "#154780 MARSHALL SYNDROME". According to the OMIM FAQs, these are the meanings of the the symbols preceding a MIM number:

  1. An asterisk (*) before an entry number indicates a gene.
  2. A number symbol (#) before an entry number indicates that it is a descriptive entry, usually of a phenotype, and does not represent a unique locus. The reason for the use of the number symbol is given in the first paragraph of the entry. Discussion of any gene(s) related to the phenotype resides in another entry(ies) as described in the first paragraph.
  3. A plus sign (+) before an entry number indicates that the entry contains the description of a gene of known sequence and a phenotype.
  4. A percent sign (%) before an entry number indicates that the entry describes a confirmed mendelian phenotype or phenotypic locus for which the underlying molecular basis is not known.
  5. No symbol before an entry number generally indicates a description of a phenotype for which the mendelian basis, although suspected, has not been clearly established or that the separateness of this phenotype from that in another entry is unclear.
  6. A caret (^) before an entry number means the entry no longer exists because it was removed from the database or moved to another entry as indicated.

To fetch a non-redundant list of OMIM annotation through the Ensembl Perl API you can look at the external references (xrefs/dblinks):


my $att = "MIM_GENE";
# or: my $att = "MIM_MORBID";
my $attribs = $gene->get_all_DBLinks($att);
my (%ids, %descriptions);
if (@{ $attribs }){
  foreach my $attrib (@{ $attribs }){
    if (not(exists $ids{$attrib->primary_id()})){
      $ids{$attrib->primary_id} = $attrib->display_id;
      $descriptions{$attrib->description} = $attrib->display_id;

OMIM publication,

Nucleotide Ambiguity Codes

April 4th, 2012

The symbols to describe the different nucleotides in DNA are the following:

Symbol       Meaning      Nucleic Acid
A            A           Adenine
C            C           Cytosine
G            G           Guanine
T            T           Thymine
U            U           Uracil
M          A or C
R          A or G
W          A or T
S          C or G
Y          C or T
K          G or T
V        A or C or G
H        A or C or T
D        A or G or T
B        C or G or T
X      G or A or T or C
N      G or A or T or C

Note: these letters are also used in the "samtools tview" program to visually show NGS read alignments.


1000 Genomes Project Populations

April 3rd, 2012

The goal of the 1000 Genomes Project is create a "A Deep Catalog of Human Genetic Variation" by measuring and analysing most genetic variants that have frequencies of at least 1% in the populations studied.

The population codes used in the project are the following (Source: 1000 Genomes / ftp site):

CHB	Han Chines              Han Chinese in Beijing, China 
JPT	Japanese                Japanese in Tokyo, Japan
CHS	Southern Han Chinese    Han Chinese South 
CDX	Dai Chinese             Chinese Dai in Xishuangbanna, China
KHV	Kinh Vietnamese         Kinh in Ho Chi Minh City, Vietnam
CHD	Denver Chinese          Chinese in Denver, Colorado (pilot 3 only)
CEU	CEPH    Utah residents (CEPH) with Northern and Western European ancestry 
TSI	Tuscan  Toscani in Italia 
GBR	British British in England and Scotland 
FIN	Finnish Finnish in Finland 
IBS	Spanish Iberian populations in Spain 
YRI	Yoruba  Yoruba in Ibadan, Nigeria
LWK	Luhya   Luhya in Webuye, Kenya
GWD	Gambian Gambian in Western Division, The Gambia 
MSL	Mende   Mende in Sierra Leone
ESN	Esan    Esan in Nigeria
ASW	African-American SW     African Ancestry in Southwest US  
ACB	African-Caribbean       African Caribbean in Barbados
MXL	Mexican-American        Mexican Ancestry in Los Angeles, California
PUR	Puerto Rican            Puerto Rican in Puerto Rico
CLM	Colombian               Colombian in Medellin, Colombia
PEL	Peruvian                Peruvian in Lima, Peru

GIH	Gujarati                Gujarati Indian in Houston,TX
PJL	Punjabi                 Punjabi in Lahore,Pakistan
BEB	Bengali                 Belgali in Bangladesh
STU	Sri Lankan              Sri Lankan Tamil in the UK
ITU	Indian                  Indian Telugu in the UK

Canonical transcripts

January 3rd, 2012

As reported in the Ensembl 2009 NAR paper canonical transcripts are defined for all genes and for all species in the Ensembl gene sets. "The canonical transcript is defined as either the longest CDS, if the gene has translated transcripts, or the longest cDNA. Should a transcript already regarded as canonical not be selected using the above rules, there is support for storing this information in the Ensembl database."
For the human gene annotation the hierarchy to choose if there are more than one protein-coding transcripts is:

  1. CCDS transcripts
  2. Havana manual annotation transcripts of the type "protein_coding"
  3. Havana manual annotation transcripts that are also protein-coding
  4. Ensembl protein-coding transcripts

If there are multiple transcripts within the groups, take the longest CDS of the highest priority group.
For non-coding types takes the longest cDNA of

  1. Havana transcripts
  2. Ensembl transcripts

Ensembl 2009 NAR paper, Ensembl mailing list

These objects can be regarded as representative transcripts for the gene and can be fetched with the Perl API method


Some caution needs to be used when looking at the pseudo-autosomal regions: When looking at genes from the Y PAR, the method will return a transcript with X coordinates. While not really a bug, this might mess up your data if un-noticed. To check and fix this something like the following will work:


#fetch slice from Y PAR
my $slice = $slice_adaptor->fetch_by_region( \
'Chromosome','Y', 59100480, 59115127);
#get an example gene
my $gene = @{$slice->get_all_Genes}[0];
#get canonical transcript from the gene
my $transcript = $gene->canonical_transcript;
#re-fetch transcript on Y to avoid getting
# X locations for PAR
if($gene->slice->seq_region_name eq "Y"){
  my $sid = $transcript->stable_id;
  $transcript = undef;
  my $transcripts = \
$transcript_adaptor->fetch_all_by_Slice( \
$slice, 1);  
  foreach my $poss_transcript (@$transcripts){
    next unless($poss_transcript->stable_id eq $sid);
    $transcript = $poss_transcript;

Telomeric and Centromeric regions

September 22nd, 2011

Telomeres form caps on the ends of chromosomes that prevent fusion of chromosomal ends and provide genomic stability.

During gametogenesis, reprogramming of the germ cells leads to elongation of telomeres up to their species-specific maximum.

In normal somatic cells, telomeres are progressively shortened with every cell division. This shortening in normal human cells limits the number of cell divisions. For human cells to proliferate beyond the senescence checkpoint, they need to stabilize telomere length. This is accomplished mainly by reactivation of the telomerase enzyme. Telomerase expression is under the control of many factors. Expression of telomerase can lead to cell immortalization and is activated during tumorigenesis, i.e. cancer.

Male Xq-telomeres are 1100 bp shorter than female Xq-telomeres.

The telomeric repeat found on all human chromosomes is "TTAGGG".

The centromeres and telomeres of the human chromosomes are not defined as region attributes in the Ensembl perl API explicitely, so for checking these regions, one option is to pull them out of the UCSC table browser (use the "Mapping and Sequencing tracks" group and the "Gap" table) and define them manually. You can e.g. create an array of hashes with the regions and use them in your script:


#read data (listed below) from a file...
my @data = split("\s");
my %telomere = (
      'chrom' => $data[0],
      'start' => $data[1],
      'end'   => $data[2],
push(@telomeres, \%telomere);

The list of centromere regions (transformed from the 0-based UCSC system to the 1-based coordinated system) for GRCh37 is:


1       121535435       124535434
2       92326172       95326171
3       90504855       93504854
4       49660118       52660117
5       46405642       49405641
6       58830167       61830166
7       58054332       61054331
8       43838888       46838887
9       47367680       50367679
10      39254936       42254935
11      51644206       54644205
12      34856695       37856694
13      16000001       19000000
14      16000001       19000000
15      17000001       20000000
16      35335802       38335801
17      22263007       25263006
18      15460899       18460898
19      24681783       27681782
20      26369570       29369569
21      11288130       14288129
22      13000001       16000000
X       58632013       61632012
Y       10104554       13104553

The list of telomere regions for GRCh37 is (1-based):


1       1               10000
1       249240622       249250621
2       1               10000
2       243189374       243199373
3       1               10000
3       198012431       198022430
4       1               10000
4       191144277       191154276
5       1               10000
5       180905261       180915260
6       1               10000
6       171105068       171115067
7       1               10000
7       159128664       159138663
8       1               10000
8       146354023       146364022
9       1               10000
9       141203432       141213431
10      135524748       135534747
10      1               10000
11      134996517       135006516
11      1               10000
12      1               10000
12      133841896       133851895
13      1               10000
13      115159879       115169878
14      1               10000
14      107339541       107349540
15      1               10000
15      102521393       102531392
16      1               10000
16      90344754        90354753
18      1               10000
18      78067249        78077248
19      1               10000
19      59118984        59128983
20      1               10000
20      63015521        63025520
21      1               10000
21      48119896        48129895
22      1               10000
22      51294567        51304566
X       1               10000
X       155260561       155270560
Y       1               10000
Y       59363567        59373566

Telomeres of chromosome 17 have not been defined for assembly GRCh37. They are short, but do exists nonetheless. An assembly patch will address this.