GENCODE

February 28th, 2008

GENCODE is a scientific project in the area of genome research.

Research Goal:

Annotate gene features across the human genome using computational methods, manual annotation, and targeted experiments as part of the ENCODE project.

Grant proposal: Tim Hubbard, Ph.D.; Wellcome Trust Sanger Institute, Hinxton, England: Extensions of GENCODE gene annotation project.

$8.5 million (four years); Integrated Human Genome Annotation: Generation of a Reference Gene Set. Using computational methods, manual annotation and targeted experiments, this team will annotate gene features in the human genome. Such features include genes that code for proteins; genes that are transcribed, but do not code for proteins; and pseudogenes, which are DNA sequences similar to normal genes, but which have been altered slightly so they are not functional.

Participating Centres and principle investigators in the Integrated Annotation Program:

  • Tim Hubbard, Jen Harrow, Steve Searle, Wellcome Trust Sanger Institute, UK
  • Alexandre Reymond, The University of Lausanne, Switzerland
  • Roderic Guigo, Centre de RegulaciÛ GenÚmica (CRG), Barcelona, Catalonia, Spain
  • David Haussler, Mark Diekhans, Rachel Harte University of California, Santa Cruz (UCSC) Santa Cruz, California, USA
  • Michael Brent, Washington University, St Louis (WashU), St Louis, USA
  • Manolis Kellis, Massachusetts Institute of Technology (MIT), Boston, USA
  • Mark Gerstein, Yale University (Yale), New Haven, USA
  • Alfonso Valencia, Michael Tress, Spanish National Cancer Research Centre (CNIO), Madrid, Spain

See also: http://www.gencodegenes.org and wikipedia.

Different EnsEMBL routines

February 22nd, 2008

A few useful bits of Perl code to work with the EnsEMBL API

Use the registry & get variation information:


use Bio::EnsEMBL::Registry;



my $reg = 'Bio::EnsEMBL::Registry';

my $host= 'ensembldb.ensembl.org';

my $user= 'anonymous';



$reg->load_registry_from_db(-host => $host,

                            -user => $user

                            );



my $gene_stable_id = 'ENSG00000128573';

my $gene_adaptor = $reg->get_adaptor("human","core","gene");

my $gene = $gene_adaptor->fetch_by_stable_id($gene_stable_id);

my $vfs = $gene->feature_Slice->get_all_VariationFeatures(); #return ALL variations defined in your gene

foreach my $vf (@{$vfs}){

    print "Variation ", $vf->variation_name, " in position ", 

    $vf->seq_region_name,":",$vf->seq_region_start,"\n";

}



General sorting using two object-attributes (start and end):

my @sorted_genes =

       sort { $a->start <=> $b->start ? $a->start <=> $b->start : $b->end <=> $a->end }  @$genes;

Write out cross-references (ids from external databases) from Ensembl objects:


foreach my $gene (@genes){

  foreach my $trans (@{$gene->get_all_Transcripts}) {

    my @xrefs = @{$trans->get_all_DBEntries};

    if (@xrefs){

      foreach my $xref (@xrefs) {

	print XREFS $gene->dbID."\t";

	print XREFS $trans->dbID."\t";

	print XREFS $xref->dbname."\t"; 

	print XREFS $xref->primary_id."\t";

	print XREFS $xref->display_id."\t";

	print XREFS $gene->description."\t";

	print XREFS $gene->status."\n";

      }

    }

  }

}

EnsEMBL db Installation

January 14th, 2008

A few simple steps to set up a local EnsEMBL database to query genome annotation (genes, transcripts, exons, external database IDs like OMIM, RefSeq) for the human genome or many other vertebrates. Using the mySQL interface or Perl API queries are multitudes faster when connecting to a local data mirror.

Find out which Ensembl version id the most current one, eg. from the website or the Ensembl blog.

  1. Get the code: Bioperl

    cvs -d :pserver:cvs@cvs.open-bio.org:/home/repository/bioperl login
    cvs -d :pserver:cvs@cvs.open-bio.org:/home/repository/bioperl \
      co -r bioperl-release-1-2-3 bioperl-live
  2. Get the code: EnsEMBL

    cvs -d :pserver:cvsuser@cvs.sanger.ac.uk:/cvsroot/CVSmaster  login
    
    Logging in ...
    CVS password: CVSUSER
    cvs -d :pserver:cvsuser@cvs.sanger.ac.uk:/cvsroot/CVSmaster \
      co -r branch-ensembl-65 ensembl
  3. Install a mySQL server
    free community edition
  4. Get the data
    from FTP site: ftp://ftp.ensembl.org/pub/current_mysql/
  5. Install

    gunzip *.gz
    mysql -uroot -p -e"create database homo_sapiens_core_65_37;"
    mysqlimport -uroot -p --fields_escaped_by=\\  \
      homo_sapiens_core_65_37 -L *.txt

    Please adjust species names and version numbers to your need.

  6. Use
    Point your code to the new database using an entry in the Ensembl Registry file, eg. like this:

    new Bio::EnsEMBL::DBSQL::DBAdaptor(
      -host    => '192.168.1.108', #or 'localhost'
      -user    => 'readonly',
      -pass    => 'readonly',
      -port    => '3306',
      -species => 'Homo sapiens 65',
      -group   => 'core',
      -dbname  => 'homo_sapiens_core_65_37'
    );
    
    @aliases = ( 'human_65', 'Homo sapiens');
    Bio::EnsEMBL::Utils::ConfigRegistry->add_alias(
      -species => 'Homo sapiens 69',
      -alias   => \@aliases
    );
    

from: http://www.ensembl.org/info/docs/api/api_installation.html

See also: Ensembl, Biomart installation

screen command

January 11th, 2008

Use the unix screen application to run programs on a remote host where you can log out & back in without interrupting the program. I found this extemely useful when logging into a Unix machine via SSH (e.g. Putty) from a Windows machine, starting a processing job and leaving it to run without having to worry about loosing the connection,

  • create a new session:  screen
  • create a named session screen -S sname
  • list active sessions: screen -ls
  • detach a session (if you want to log out): screen -d sname (remotely) or CTRL-a-d (at the screen session)
  • reattach a session: screen -r sname
  • kill a session when it is attached ESCAPE Ctrl-a \ or exit
  • kill a dead unresponsive session from another terminal window: screen -S <sname> -p 0 -X quit

Amino Acids / Aminosäuren (German)

Januar 8th, 2008

Aminosäuren sind organischer Verbindungen mit mindestens einer Carboxygruppe (–COOH) und einer Aminogruppe (–NH2), die eine essentielle Rolle für das Leben auf der Erde spielen. Sie sind die Bausteine der Proteine (Eiweisse), die die Körper aufbauen und an fast allen biochemischen Vorgängen im Körper beteiligt.

Die 20 kanonischen Aminosäuren (die vom genetischen Code definiert und im menschlichen Körper genutzt werden können) und ihre Eigenschaften sind hier gelistet:

Aminosäurerest Abk. Code Seitenkette Klasse Polarität Acidität oder Basizität
Alanin Ala A -CH3 unpolar neutral
Arginin Arg R -CH2CH2CH2NH-C(NH)NH2 polar basisch (stark)
Asparagin Asn N -CH2CONH2 polar neutral
Asparaginsäure Asp D -CH2COOH polar sauer
Cystein Cys C -CH2SH polar neutral
Glutaminsäure Glu E -CH2CH2COOH polar sauer
Glutamin Gln Q -CH2CH2CONH2 polar neutral
Glycin Gly G -H unpolar neutral
Histidin His H -CH2(C3H3N2) aromatisch polar basisch (schwach)
Isoleucin Ile I -CH(CH3)CH2CH3 aliphatisch unpolar neutral
Leucin Leu L -CH2CH(CH3)2 aliphatisch unpolar neutral
Lysin Lys K -CH2CH2CH2CH2NH2 polar basisch
Methionin Met M -CH2CH2SCH3 unpolar neutral
Phenylalanin Phe F -CH2(C6H5) aromatisch unpolar neutral
Prolin Pro P -CH2CH2CH2- unpolar neutral
Serin Ser S -CH2OH polar neutral
Threonin Thr T -CH(OH)CH3 polar neutral
Tryptophan Trp W -CH2(C8H6N) aromatisch polar neutral
Tyrosin Tyr Y -CH2(C6H4)OH aromatisch polar neutral
Valin Val V -CH(CH3)2 aliphatisch unpolar neutral

Valin, Methionin, Leucin, Isoleucin, Phenylalanin, Tryptophan, Threonin und Lysin sind die essentielle Aminosäuren, die vom Menschen mit der Nahrung aufgenommen werden müssen.

Quelle und weitere Details: Wikipedia