Uniparental Disomy

May 4th, 2012

In cases where two copies of the same chromosome, or part of a chromosome, from one parent and no copies from the other parent are present in the cell, we call it uniparental disomy (UPD). While all DNA information is present, the development of the cell (and the organism) is hindered because of missing / wrong epigenetic markers. The basic mechanism of how this faulty distribution of chromosomes can occur, is shown in fig.1.

Version Control with Perforce on the Command-line

April 19th, 2012

Besides the visual client, the version control system Perforce can be operated through the command line (unix prompt or windows Dos window) and therefor be controlled through other programs like MatLab:

[status, result] = dos(p4command);

A reference manual is available, here are a few hints:
Check the environment settings:

p4 set
  P4CLIENT=try1 (set)
  P4EDITOR=C:\Windows\SysWOW64\notepad.exe (set)

end edit if necessary with

set P4CHARSET=winansi

P4EDITOR is optional, P4CLIENT is the checkout / workspace name.
The settings can also be set permanently in the visual client under
Edit / Preferences / Connection / Change Settings
If these are wrong you will get messages like "file(s) not on client".

Most common commands:
synchronize repository:

p4 sync

checkout file:

p4 edit filename.txt
p4 edit //depot/path/in/perforce/filename.txt

submit changes:

p4 submit -d "description of changes" filename.txt

revert to version in repository:

p4 revert filename.txt

add new file:

p4 add filename.txt

get help:

p4 help

Here are some useful one-liners for various tasks.

OMIM Symbols

April 16th, 2012

The Online Mendelian Inheritance in Man is a manually reviewed catalog of human genes and regions involved in genetic disorders and traits. Each entry has a name and a number, e.g. "#154780 MARSHALL SYNDROME". According to the OMIM FAQs, these are the meanings of the the symbols preceding a MIM number:

  1. An asterisk (*) before an entry number indicates a gene.
  2. A number symbol (#) before an entry number indicates that it is a descriptive entry, usually of a phenotype, and does not represent a unique locus. The reason for the use of the number symbol is given in the first paragraph of the entry. Discussion of any gene(s) related to the phenotype resides in another entry(ies) as described in the first paragraph.
  3. A plus sign (+) before an entry number indicates that the entry contains the description of a gene of known sequence and a phenotype.
  4. A percent sign (%) before an entry number indicates that the entry describes a confirmed mendelian phenotype or phenotypic locus for which the underlying molecular basis is not known.
  5. No symbol before an entry number generally indicates a description of a phenotype for which the mendelian basis, although suspected, has not been clearly established or that the separateness of this phenotype from that in another entry is unclear.
  6. A caret (^) before an entry number means the entry no longer exists because it was removed from the database or moved to another entry as indicated.

To fetch a non-redundant list of OMIM annotation through the Ensembl Perl API you can look at the external references (xrefs/dblinks):


my $att = "MIM_GENE";
# or: my $att = "MIM_MORBID";
my $attribs = $gene->get_all_DBLinks($att);
my (%ids, %descriptions);
if (@{ $attribs }){
  foreach my $attrib (@{ $attribs }){
    if (not(exists $ids{$attrib->primary_id()})){
      $ids{$attrib->primary_id} = $attrib->display_id;
      $descriptions{$attrib->description} = $attrib->display_id;

OMIM publication, http://omim.org/

Nucleotide Ambiguity Codes

April 4th, 2012

The symbols to describe the different nucleotides in DNA are the following:

Symbol       Meaning      Nucleic Acid
A            A           Adenine
C            C           Cytosine
G            G           Guanine
T            T           Thymine
U            U           Uracil
M          A or C
R          A or G
W          A or T
S          C or G
Y          C or T
K          G or T
V        A or C or G
H        A or C or T
D        A or G or T
B        C or G or T
X      G or A or T or C
N      G or A or T or C

Note: these letters are also used in the "samtools tview" program to visually show NGS read alignments.


1000 Genomes Project Populations

April 3rd, 2012

The goal of the 1000 Genomes Project is create a "A Deep Catalog of Human Genetic Variation" by measuring and analysing most genetic variants that have frequencies of at least 1% in the populations studied.

The population codes used in the project are the following (Source: 1000 Genomes / ftp site):

CHB	Han Chines              Han Chinese in Beijing, China 
JPT	Japanese                Japanese in Tokyo, Japan
CHS	Southern Han Chinese    Han Chinese South 
CDX	Dai Chinese             Chinese Dai in Xishuangbanna, China
KHV	Kinh Vietnamese         Kinh in Ho Chi Minh City, Vietnam
CHD	Denver Chinese          Chinese in Denver, Colorado (pilot 3 only)
CEU	CEPH    Utah residents (CEPH) with Northern and Western European ancestry 
TSI	Tuscan  Toscani in Italia 
GBR	British British in England and Scotland 
FIN	Finnish Finnish in Finland 
IBS	Spanish Iberian populations in Spain 
YRI	Yoruba  Yoruba in Ibadan, Nigeria
LWK	Luhya   Luhya in Webuye, Kenya
GWD	Gambian Gambian in Western Division, The Gambia 
MSL	Mende   Mende in Sierra Leone
ESN	Esan    Esan in Nigeria
ASW	African-American SW     African Ancestry in Southwest US  
ACB	African-Caribbean       African Caribbean in Barbados
MXL	Mexican-American        Mexican Ancestry in Los Angeles, California
PUR	Puerto Rican            Puerto Rican in Puerto Rico
CLM	Colombian               Colombian in Medellin, Colombia
PEL	Peruvian                Peruvian in Lima, Peru

GIH	Gujarati                Gujarati Indian in Houston,TX
PJL	Punjabi                 Punjabi in Lahore,Pakistan
BEB	Bengali                 Belgali in Bangladesh
STU	Sri Lankan              Sri Lankan Tamil in the UK
ITU	Indian                  Indian Telugu in the UK