Using dbVar

May 12th, 2011

"Structural variation (SV) is generally defined as a region of DNA approximately 1 kb and larger in size and can include inversions and balanced translocations or genomic imbalances (insertions and deletions), commonly referred to as copy number variants (CNVs). These CNVs often overlap with segmental duplications (regions of DNA >1 kb present more than once in the genome). If present at >1% in a population a CNV may be referred to as copy number polymorphism (CNP)."

Estimates of how much of the human genome are CNVs range from 10-20%.

dbVar is the NCBI database of genomic structural variation designed to store data on variant DNA ≥ 1 bp in size.

The databases ids are organised in the following manner:

  • std: the study id - this identifies a submitted study
  • sv: the structural variant id - this identifies the submitted region of variation
  • ssv: the supporting structural variant id - this identifies the supporting regions of variation (often sample-specific) that were used to call the submitted region of variation
  • The ids are prefixed with 'n' if the study was submitted to NCBI, or 'e' if it was submitted to EBI

This means that multiple experimental results, ie. regions identified from different samples, stored as "supporting variants", are combined into regions that describe these as one "event" and are stored as "variant".

An example: esv10580 includes the supporting variants essv57440, essv75601, essv61475 and others. The individual (GRCh37/hg19) coordinates, e.g.

Chr1	521,413	564,458
Chr1	521,413	564,458
Chr1	521,648	575,095

result in the maximum coordinates for the variant:

Chr1	521,413	575,095

They all belong to the study estd20 by Conrad et al. (2010).

There is a good overview page explaining structural variations and related methods.

Source: dbVar

Parsing OMIM data

April 18th, 2011

The Online Mendelian Inheritance in Man (OMIM) data is a "catalog of human genes and genetic disorders and traits, with particular focus on the molecular relationship between genetic variation and phenotypic expression. It is a phenotypic companion to the Human Genome Project." (

To get human disease annotation for your gene data, the fine data from the OMIM database can be downloaded from their FTP site and parsed with one of multiple OMIM parsers within the BioPerl framwork.

I used Christian Zmasek's to get hashes with the ids and names:


use Bio::Phenotype::OMIM::OMIMparser;
$omim_parser = Bio::Phenotype::OMIM::OMIMparser->new(
    -genemap  => $omim_genemap,
    -omimtext => $omim_all );
while ( my $omim_entry = $omim_parser->next_phenotype() ) {
  my $numb  = $omim_entry->MIM_number();
  my $title = $omim_entry->title();
  #remove the gene symbol from the title line
  $title =~ s/^.?(\d+) //;
  $title =~ s/;.*$//;
  #store omim ids by disease names
  $omim_names{$title} = $numb;
  #store genes and disease names in hash ref by omim id
  $omim_ids{$numb}->{'disease'} = $title;
  my @symbols = $omim_entry->each_gene_symbol();
  $omim_ids{$numb}->{'genes'} = \@symbols;
  push(@all_omim, $numb.":".$title);

If you fall over an exception like this:

------------- EXCEPTION -------------

MSG: 16.13.3 does not make sense: 'arm' or 'cen' missing

STACK Bio::Map::CytoPosition::cytorange BioPerl-1.6.0/Bio/Map/

You need to fix an error in the genemap file from OMIM:

line 9053 should be


instead of



OMIM ids are pre-fixed with defined symbols. The explanation what these characters means can be found on their FAQ site or here.

Please note that OMIM band start locations have a 1 bp offset to the definitions e.g. in ENSEMBL (probably from a 0-based coordinate system). The "16p11.2" band below is listed as chr16 28100001 - 34600000 in Ensembl.

OMIM example

PAR regions

April 6th, 2011

The pseudo-autosomal regions are homologous DNA sequences on the (human) X and Y chromosomes (see wikipedia for more). They allow the pairing and crossing-over of these sex chromosomes the same way the autosomal chromosomes do during meiosis. As these genomic regions are identical between X and Y, they are oftentimes only stored once.

To pull out the coordinates of the pseudo-autosomal regions (PAR) from the Ensembl database, you can perform the following query on the Ensembl core database:


select (select from seq_region sr where sr.seq_region_id=ae.seq_region_id) as chrom_1, ae.seq_region_start as start_1, ae.seq_region_end as end_1, (select from seq_region sr where sr.seq_region_id=ae.exc_seq_region_id) as chrom_2, ae.exc_seq_region_start as start_2, ae.exc_seq_region_end as end_2 from assembly_exception ae where ae.exc_type="PAR";

For the human database schema 61 (assembly GRCh37/hg19) you will get where the corresponding region is located:

| chrom_1 | start_1  | end_1    | chrom_2 | start_2   | end_2     |
| Y       |    10001 |  2649520 | X       |     60001 |   2699520 |
| Y       | 59034050 | 59373566 | X       | 154931044 | 155270560 |

For the old assembly (NCBI36/hg18) you will get:

| chrom_1 | start_1  | end_1    | chrom_2 | start_2   | end_2     |
| Y       |        1 |  2709520 | X       |         1 |   2709520 |
| Y       | 57443438 | 57772954 | X       | 154584238 | 154913754 |

You can alternatively use the API:


my $aefa = $db->get_AssemblyExceptionFeatureAdaptor();
my $sa   = $db->get_SliceAdaptor;
my $slice = $sa->fetch_by_region("chromosome", "Y");
my @aefs = @{$aefa->fetch_all_by_Slice($slice)};
foreach my $ae (@aefs){
  print $ae->display_id."\t".$ae->start."\t".$ae->end."\n";
X	10001	2649520
X	59034050	59373566

or for X:

Y	60001	2699520
Y	154931044	155270560

So to translate from Y to X PAR locations you can use the following for GRCh37 / hg19:

Y 10001 - 2649520      <->  X 60001 - 2699520, band Xp22.33
Y 59034050 - 59373566  <->  X 154931044 - 155270560, band Xq28

and for NCBI36 / hg18:

Y 1 - 2709520          <-> X  1 - 2709520, band Xp22.33
Y 57443438 - 57772954  <-> X  154584238 - 154913754, band Xq28

Please note that these coordinates do not agree with the definitions at the GRC and NCBI. This difference of the PAR-2 end coordinates (chrX:155.260.560 / 155.270.560 or chrY:59.363.566 / 59.373.566) is caused by the 10kb telomeric (gap) region which needs to be included in the PAR-2 definition to correctly represent this arrangement.

See also the telomere & centromer definition notes.

A nice list of official HGNC genes that are located in the pseudo-autosomal regions can be found here.

PDL: The Perl Data Language

March 1st, 2011

It doesn't always have to be R!

The Perl Data Language is a Perl extension for numerical manipulation that provides the convenience of Perl with the speed of compiled C.

It also contains plotting modules.

Install with cpan install PDL or check these descriptions.

Code example for getting basic stats from a few values:


use PDL;
my @numbers = (1,4,6,8,10);
my $piddle = pdl(@numbers);
my ($mean,$prms,$median,$min,$max,$adev,$rms) = statsover($piddle);
print "Mean=$mean\n".
      "Root-mean-square deviation=$prms\n".

Running CronJobs

February 4th, 2011

cron is a extremely useful unix utility that allows tasks to be automatically run in the background at regular intervals.

You need the script / command you want to run and the time it should run. You can the use the crontab command to edit the service:

  1. crontab -e Edit your crontab file, or create one if it doesn't already exist.
  2. crontab -l Display your crontab file.
  3. crontab -r Remove your crontab file.

Format of entries:

*     *     *   *    *        command to be executed

-     -     -   -    -

|     |     |   |    |

|     |     |   |    +----- day of week (0 - 6) (Sunday=0)

|     |     |   +------- month (1 - 12)

|     |     +--------- day of        month (1 - 31)

|     +----------- hour (0 - 23)

+------------- min (0 - 59)


00 03 * * * bash /users/fsk/

This runs my backup script at 03:00 every day.

*/10 * * * * echo "job done"

This runs an echo every 10 minutes of every hour of every day.

To receive an email with any result from the jobs, add


to the top of the crontab. To discard any output add


>/dev/null 2>&1

to the end of the job line or as the very first line (for all jobs).