Parsing OMIM data

April 18th, 2011

The Online Mendelian Inheritance in Man (OMIM) data is a "catalog of human genes and genetic disorders and traits, with particular focus on the molecular relationship between genetic variation and phenotypic expression. It is a phenotypic companion to the Human Genome Project." (omim.org)

To get human disease annotation for your gene data, the fine data from the OMIM database can be downloaded from their FTP site and parsed with one of multiple OMIM parsers within the BioPerl framwork.

I used Christian Zmasek's OMIMparser.pm to get hashes with the ids and names:

Code

use Bio::Phenotype::OMIM::OMIMparser;
 
 
 
$omim_parser = Bio::Phenotype::OMIM::OMIMparser->new(
 
    -genemap  => $omim_genemap,
 
    -omimtext => $omim_all );
 
while ( my $omim_entry = $omim_parser->next_phenotype() ) {
 
  my $numb  = $omim_entry->MIM_number();
 
  my $title = $omim_entry->title();
 
  #remove the gene symbol from the title line
 
  $title =~ s/^.?(\d+) //;
 
  $title =~ s/;.*$//;
 
  #store omim ids by disease names
 
  $omim_names{$title} = $numb;
 
  #store genes and disease names in hash ref by omim id
 
  $omim_ids{$numb}->{'disease'} = $title;
 
  my @symbols = $omim_entry->each_gene_symbol();
 
  $omim_ids{$numb}->{'genes'} = \@symbols;
 
  push(@all_omim, $numb.":".$title);
 
}

If you fall over an exception like this:

------------- EXCEPTION -------------

MSG: 16.13.3 does not make sense: 'arm' or 'cen' missing

STACK Bio::Map::CytoPosition::cytorange BioPerl-1.6.0/Bio/Map/CytoPosition.pm:165

You need to fix an error in the genemap file from OMIM:

line 9053 should be

16.25|2|2|10|16p13.3|CHTF18....

instead of

16.25|2|2|10|16.13.3|CHTF18...

source

OMIM ids are pre-fixed with defined symbols. The explanation what these characters means can be found on their FAQ site or here.

Please note that OMIM band start locations have a 1 bp offset to the definitions e.g. in ENSEMBL (probably from a 0-based coordinate system). The "16p11.2" band below is listed as chr16 28100001 - 34600000 in Ensembl.

OMIM example

PAR regions

April 6th, 2011

The pseudo-autosomal regions are homologous DNA sequences on the (human) X and Y chromosomes (see wikipedia for more). They allow the pairing and crossing-over of these sex chromosomes the same way the autosomal chromosomes do during meiosis. As these genomic regions are identical between X and Y, they are oftentimes only stored once.

To pull out the coordinates of the pseudo-autosomal regions (PAR) from the Ensembl database, you can perform the following query on the Ensembl core database:

Code

select (select sr.name from seq_region sr where sr.seq_region_id=ae.seq_region_id) as chrom_1, ae.seq_region_start as start_1, ae.seq_region_end as end_1, (select sr.name from seq_region sr where sr.seq_region_id=ae.exc_seq_region_id) as chrom_2, ae.exc_seq_region_start as start_2, ae.exc_seq_region_end as end_2 from assembly_exception ae where ae.exc_type="PAR";

For the human database schema 61 (assembly GRCh37/hg19) you will get where the corresponding region is located:

+---------+----------+----------+---------+-----------+-----------+
| chrom_1 | start_1  | end_1    | chrom_2 | start_2   | end_2     |
+---------+----------+----------+---------+-----------+-----------+
| Y       |    10001 |  2649520 | X       |     60001 |   2699520 |
| Y       | 59034050 | 59373566 | X       | 154931044 | 155270560 |
+---------+----------+----------+---------+-----------+-----------+

For the old assembly (NCBI36/hg18) you will get:

+---------+----------+----------+---------+-----------+-----------+
| chrom_1 | start_1  | end_1    | chrom_2 | start_2   | end_2     |
+---------+----------+----------+---------+-----------+-----------+
| Y       |        1 |  2709520 | X       |         1 |   2709520 |
| Y       | 57443438 | 57772954 | X       | 154584238 | 154913754 |
+---------+----------+----------+---------+-----------+-----------+

You can alternatively use the API:

Code

my $aefa = $db->get_AssemblyExceptionFeatureAdaptor();
my $sa   = $db->get_SliceAdaptor;
my $slice = $sa->fetch_by_region("chromosome", "Y");
my @aefs = @{$aefa->fetch_all_by_Slice($slice)};
foreach my $ae (@aefs){
  print $ae->display_id."\t".$ae->start."\t".$ae->end."\n";
}
X	10001	2649520
X	59034050	59373566

or for X:

Y	60001	2699520
Y	154931044	155270560

So to translate from Y to X PAR locations you can use the following for GRCh37 / hg19:

Y 10001 - 2649520      <->  X 60001 - 2699520, band Xp22.33
Y 59034050 - 59373566  <->  X 154931044 - 155270560, band Xq28

and for NCBI36 / hg18:

Y 1 - 2709520          <-> X  1 - 2709520, band Xp22.33
Y 57443438 - 57772954  <-> X  154584238 - 154913754, band Xq28

Please note that these coordinates do not agree with the definitions at the GRC and NCBI. This difference of the PAR-2 end coordinates (chrX:155.260.560 / 155.270.560 or chrY:59.363.566 / 59.373.566) is caused by the 10kb telomeric (gap) region which needs to be included in the PAR-2 definition to correctly represent this arrangement.

See also the telomere & centromer definition notes.

A nice list of official HGNC genes that are located in the pseudo-autosomal regions can be found here.

PDL: The Perl Data Language

March 1st, 2011

It doesn't always have to be R!

The Perl Data Language is a Perl extension for numerical manipulation that provides the convenience of Perl with the speed of compiled C.

It also contains plotting modules.

Install with cpan install PDL or check these descriptions.

Code example for getting basic stats from a few values:

Code

use PDL;
 
 
 
my @numbers = (1,4,6,8,10);
 
my $piddle = pdl(@numbers);
 
my ($mean,$prms,$median,$min,$max,$adev,$rms) = statsover($piddle);
 
 
 
print "Mean=$mean\n".
 
      "Root-mean-square deviation=$prms\n".
 
      "Median=$median\n".
 
      "Min=$min\n".
 
      "Max=$max\n".
 
      "StdDev=$adev\n".
 
      "Population-Deviation=$rms\n\n";

Running CronJobs

February 4th, 2011

cron is a extremely useful unix utility that allows tasks to be automatically run in the background at regular intervals.

You need the script / command you want to run and the time it should run. You can the use the crontab command to edit the service:

  1. crontab -e Edit your crontab file, or create one if it doesn't already exist.
  2. crontab -l Display your crontab file.
  3. crontab -r Remove your crontab file.

Format of entries:


*     *     *   *    *        command to be executed

-     -     -   -    -

|     |     |   |    |

|     |     |   |    +----- day of week (0 - 6) (Sunday=0)

|     |     |   +------- month (1 - 12)

|     |     +--------- day of        month (1 - 31)

|     +----------- hour (0 - 23)

+------------- min (0 - 59)

Example:

00 03 * * * bash /users/fsk/backup_db.sh

This runs my backup script at 03:00 every day.

*/10 * * * * echo "job done"

This runs an echo every 10 minutes of every hour of every day.

To receive an email with any result from the jobs, add

Code

MAILTO=yourmail@home.com

to the top of the crontab. To discard any output add

Code

>/dev/null 2>&1

to the end of the job line or as the very first line (for all jobs).

Source

AnnoTrack: Rails System

January 31st, 2011

This document is part of the administrator documentation for the AnnoTrack software for genome annotation tracking.

Please read elsewhere about general Ruby or Rails questions, there are blog entries about Ruby & Rails Terminology, Rails application layout

The AnnoTrack Ruby-on-Rails code can be found in svn/gencode/tracking_system/rails/. Most AnnoTrack-specific code is stored as "plugin" code in the Redmine directory. This means when trying to find a specific piece of code, you have to check the default application directory app, but also the plugin directory

vendor/plugins/redmine_annotrack/app. The language files defining the terminology and browser links used on the websites are

svn/gencode/tracking_system/rails/lang/en.yml and

svn/gencode/tracking_system/rails/vendor/plugins/redmine_annotrack/lang/en.yml/.

In these files an entry like

Code

label_project_new: New Gene

means "if you come across the term label_project_new, display it as New Gene in the browser".

To understand the code underlying specific web pages it is helpful to check the routing entries in

config/routes.rb and vendor/plugins/redmine_annotrack/routes.rb. Specific paths in the browser are mapped to specific functions in the rails code. E.g.:

Code

map.connect 'flags/show_tecs', :controller => 'flags', :action => 'show_tecs'

maps the URL http://annotrack.sanger.ac.uk/human/flags/show_tecs to

the show_tecs function in the file app/controllers/flags_controller.rb.

The list of chromosomes used as well as the different priority values are set on this page.

Some options for links on the transcript pages etc. can be changed through the administration interface.

These previous actions require administrator user rights in the AnnoTrack system. The list of different user right for all groups is shown here.

The documentation pages can be edited with a wiki-style syntax by clicking on the edit pencil on each page.