EnsEMBL db Installation

January 14th, 2008

A few simple steps to set up a local EnsEMBL database to query genome annotation (genes, transcripts, exons, external database IDs like OMIM, RefSeq) for the human genome or many other vertebrates. Using the mySQL interface or Perl API queries are multitudes faster when connecting to a local data mirror.

Find out which Ensembl version id the most current one, eg. from the website or the Ensembl blog.

  1. Get the code: Bioperl

    cvs -d :pserver:cvs@cvs.open-bio.org:/home/repository/bioperl login
    cvs -d :pserver:cvs@cvs.open-bio.org:/home/repository/bioperl \
      co -r bioperl-release-1-2-3 bioperl-live
  2. Get the code: EnsEMBL

    cvs -d :pserver:cvsuser@cvs.sanger.ac.uk:/cvsroot/CVSmaster  login
    Logging in ...
    CVS password: CVSUSER
    cvs -d :pserver:cvsuser@cvs.sanger.ac.uk:/cvsroot/CVSmaster \
      co -r branch-ensembl-65 ensembl
  3. Install a mySQL server
    free community edition
  4. Get the data
    from FTP site: ftp://ftp.ensembl.org/pub/current_mysql/
  5. Install

    gunzip *.gz
    mysql -uroot -p -e"create database homo_sapiens_core_65_37;"
    mysqlimport -uroot -p --fields_escaped_by=\\  \
      homo_sapiens_core_65_37 -L *.txt

    Please adjust species names and version numbers to your need.

  6. Use
    Point your code to the new database using an entry in the Ensembl Registry file, eg. like this:

    new Bio::EnsEMBL::DBSQL::DBAdaptor(
      -host    => '', #or 'localhost'
      -user    => 'readonly',
      -pass    => 'readonly',
      -port    => '3306',
      -species => 'Homo sapiens 65',
      -group   => 'core',
      -dbname  => 'homo_sapiens_core_65_37'
    @aliases = ( 'human_65', 'Homo sapiens');
      -species => 'Homo sapiens 69',
      -alias   => \@aliases

from: http://www.ensembl.org/info/docs/api/api_installation.html

See also: Ensembl, Biomart installation

screen command

January 11th, 2008

Use the unix screen application to run programs on a remote host where you can log out & back in without interrupting the program. I found this extemely useful when logging into a Unix machine via SSH (e.g. Putty) from a Windows machine, starting a processing job and leaving it to run without having to worry about loosing the connection,

  • create a new session:  screen
  • create a named session screen -S sname
  • list active sessions: screen -ls
  • detach a session (if you want to log out): screen -d sname (remotely) or CTRL-a-d (at the screen session)
  • reattach a session: screen -r sname
  • kill a session when it is attached ESCAPE Ctrl-a \ or exit
  • kill a dead unresponsive session from another terminal window: screen -S <sname> -p 0 -X quit

Amino Acids / Aminosäuren (German)

Januar 8th, 2008

Aminosäuren sind organischer Verbindungen mit mindestens einer Carboxygruppe (–COOH) und einer Aminogruppe (–NH2), die eine essentielle Rolle für das Leben auf der Erde spielen. Sie sind die Bausteine der Proteine (Eiweisse), die die Körper aufbauen und an fast allen biochemischen Vorgängen im Körper beteiligt.

Die 20 kanonischen Aminosäuren (die vom genetischen Code definiert und im menschlichen Körper genutzt werden können) und ihre Eigenschaften sind hier gelistet:

Aminosäurerest Abk. Code Seitenkette Klasse Polarität Acidität oder Basizität
Alanin Ala A -CH3 unpolar neutral
Arginin Arg R -CH2CH2CH2NH-C(NH)NH2 polar basisch (stark)
Asparagin Asn N -CH2CONH2 polar neutral
Asparaginsäure Asp D -CH2COOH polar sauer
Cystein Cys C -CH2SH polar neutral
Glutaminsäure Glu E -CH2CH2COOH polar sauer
Glutamin Gln Q -CH2CH2CONH2 polar neutral
Glycin Gly G -H unpolar neutral
Histidin His H -CH2(C3H3N2) aromatisch polar basisch (schwach)
Isoleucin Ile I -CH(CH3)CH2CH3 aliphatisch unpolar neutral
Leucin Leu L -CH2CH(CH3)2 aliphatisch unpolar neutral
Lysin Lys K -CH2CH2CH2CH2NH2 polar basisch
Methionin Met M -CH2CH2SCH3 unpolar neutral
Phenylalanin Phe F -CH2(C6H5) aromatisch unpolar neutral
Prolin Pro P -CH2CH2CH2- unpolar neutral
Serin Ser S -CH2OH polar neutral
Threonin Thr T -CH(OH)CH3 polar neutral
Tryptophan Trp W -CH2(C8H6N) aromatisch polar neutral
Tyrosin Tyr Y -CH2(C6H4)OH aromatisch polar neutral
Valin Val V -CH(CH3)2 aliphatisch unpolar neutral

Valin, Methionin, Leucin, Isoleucin, Phenylalanin, Tryptophan, Threonin und Lysin sind die essentielle Aminosäuren, die vom Menschen mit der Nahrung aufgenommen werden müssen.

Quelle und weitere Details: Wikipedia

Genetic code

January 8th, 2008

The standard code, definining which amino acid to build from which combination of nucleic acids can be read from the image below.

Standard Codon Set
DNA triplet codes (Codonsonne/Codesonne)

Job Scheduling systems on HPCs

January 4th, 2008


The Load Sharing Facility system from IBM Platform can be used to run high performance compute ressources (HPC). It allows the scheduling of computational tasks (jobs) and load balancing of the machines that are part of the compute farm or cluster.

Basic tasks:

  • send jobs to farm:
    bsub -o output.txt script_name.pl -p parameter_1
  • specify ressource requirments:
    bsub -R 'model=IBMBC2800' job
    bsub 'select[mem>1000] rusage[mem=1000] job'

    limit access to database:
    select[ecs4<100] rusage[ecs4my3353=10:duration=10:decay=1]
  • kill jobs:
    bkill 0
    bkill jobid
  • example to kill specific jobs:
    bjobs | grep long | awk '{ print $1 }' | xargs bkill
    force-kill: bkill -r <ID>
  • Check jobs:
    bpeek jobid
  • Check farm machines:
    bhosts -w | more

Grid Engine

The Oracle Grid engine is a similar system based on an open-source software project supported by Sun.

Basic Usage:

  • Submit jobs
    qsub job,
    e.g. for shell scripts:
    qsub -cwd -e run.err -o run.out run_me.sh
    and for binaries:
    qsub -cwd -e run.err -o run.out -b y program.exe
    use -cwd to run job from current dir and write job results here use -V to pass in all environmental variables
  • kill jobs
    qdel jobid
    qdel -u user
  • Show host/job/queue status
    qstat -j jobid
    qstat -q

    or just: qstat | grep <username>
    more details:
    qstat -j jobid -explain E
  • queue details/available nodes:qstat -g c

Job status codes:

    qw - Queued and waiting
    w  - Job waiting
    s  - Job suspended
    t  - Job transferring and about to start
    r  - Job running
    h  - Job hold
    R  - Job restarted
    d  - Job has been marked for deletion
    Eqw - An error occurred with the job

More details: HowTo and Installation & User manual,notes.