Public Ensembl databases

July 5th, 2011

A quick reminder of the specifications to connect to the public Ensembl mySQL databases:

Database Server Port
Ensembl (v 24-47)
ensembldb.ensembl.org††
3306
Ensembl (v 48 and above)
ensembldb.ensembl.org 5306
Ensembl Mart martdb.ensembl.org 5316
Ensembl Genomes mysql.ebi.ac.uk 4157
Ensembl (curr. v) in US cloud useastdb.ensembl.org 5306

user = "anonymous"

pass = ""

mysql commandline for connection:

Code

mysql -uanonymous -hensembldb.ensembl.org -P5306

SQLite

June 3rd, 2011

Using the SQLite database Engine

SQLite is different (to MySQL) in a number of ways, the main one being that it is server-less and file-based. The other distinctive features are nicely listed here with pros and cons.

It's an ideal choice if you want to bundle a database with your application, as SQLite is small, platform independent and without any usage restrictions.

It can be accessed with the Perl DBI modules:

Code

my $dbh =
DBI->connect("dbi:SQLite:dbname=$db_file","","")
or die "Unable to connect: $DBI::errstr\n";

visually with the (free) Firefox plugin SQLite Manager or the (paid) application SQLite Maestro or on the command-line by calling:
sqlite db_file_name.db
Special sqlite commands are preceeded by a ".", e.g. to exit type ".exit".

The sql syntax is not identical but very similar. Converter tools are listed here, here are some stackoverflow notes about the topic.

Some compatibility notes: SQLite supports sub queries.
It does not support deletes on joined tables.

To make the output more readable you can:

Code

.header on
.separator \t

To inspect the structure of a database you can use the following commands.
1. list table names:

Code

.tables  #or
.tables table_na%  # "like" pattern matching

2. show the create statement:

Code

.schema table_name  #or
.schema table_na%   #or
SELECT sql FROM sqlite_master WHERE name = 'table_name';

To export all data from a database into files seperated by table you can use the "export table" function in the SQLite Manager, or use the command line if you have many tables:
1. create a file with all table names in your database. (get the name as mentioned above.)
2. Then call sqlite with each to export the data:

Code

cat tables.txt | awk '{print ".mode csv\n.output "$1".txt\nselect * from "$1";"} | sqlite dbname.db

Alternative export formats are column, html, insert, line, list, tabs, tcl

Import of these text files can be done with

Code

.import file.txt table_name

The separator for export and import need to be the same, otherwise you will get errors like

data.txt line 1: expected 10 columns of data but found 1

If there are linebreaks in the data fields, the parsing of the import will break in a similiar way. Try to set the separator to

\t

and not specify

.mode csv

for the export.

Here are some very useful FAQs.

Command line options in Windows

May 20th, 2011

Missing the lovely Unix command-line tools when working on MS Windows machines, I've been trying a few options to speed up everyday tasks like easy file processing:

  • Cygwin as a Unix emulation. Works fine most of time, but you can feel that it's an alien in the windows environment unless you configure it extensively: the old problem of different line break encodings, the different way to map/list directories.
  • PowerShell. A useful alternative to the windows command window with a window split into command and output screen and an extended command set.
  • UnxUtils. A collection of all those unix tools I missed wrapped up to be usable by the windows command line (grep, ls, head, awk...). Nice!

    Remember to add "UnxUtils\usr\local\wbin" to your PATH.

Using dbVar

May 12th, 2011

"Structural variation (SV) is generally defined as a region of DNA approximately 1 kb and larger in size and can include inversions and balanced translocations or genomic imbalances (insertions and deletions), commonly referred to as copy number variants (CNVs). These CNVs often overlap with segmental duplications (regions of DNA >1 kb present more than once in the genome). If present at >1% in a population a CNV may be referred to as copy number polymorphism (CNP)."

Estimates of how much of the human genome are CNVs range from 10-20%.

dbVar is the NCBI database of genomic structural variation designed to store data on variant DNA ≥ 1 bp in size.

The databases ids are organised in the following manner:

  • std: the study id - this identifies a submitted study
  • sv: the structural variant id - this identifies the submitted region of variation
  • ssv: the supporting structural variant id - this identifies the supporting regions of variation (often sample-specific) that were used to call the submitted region of variation
  • The ids are prefixed with 'n' if the study was submitted to NCBI, or 'e' if it was submitted to EBI

This means that multiple experimental results, ie. regions identified from different samples, stored as "supporting variants", are combined into regions that describe these as one "event" and are stored as "variant".

An example: esv10580 includes the supporting variants essv57440, essv75601, essv61475 and others. The individual (GRCh37/hg19) coordinates, e.g.

Chr1	521,413	564,458
Chr1	521,413	564,458
Chr1	521,648	575,095

result in the maximum coordinates for the variant:

Chr1	521,413	575,095

They all belong to the study estd20 by Conrad et al. (2010).

There is a good overview page explaining structural variations and related methods.

Source: dbVar

Parsing OMIM data

April 18th, 2011

The Online Mendelian Inheritance in Man (OMIM) data is a "catalog of human genes and genetic disorders and traits, with particular focus on the molecular relationship between genetic variation and phenotypic expression. It is a phenotypic companion to the Human Genome Project." (omim.org)

To get human disease annotation for your gene data, the fine data from the OMIM database can be downloaded from their FTP site and parsed with one of multiple OMIM parsers within the BioPerl framwork.

I used Christian Zmasek's OMIMparser.pm to get hashes with the ids and names:

Code

use Bio::Phenotype::OMIM::OMIMparser;
 
 
 
$omim_parser = Bio::Phenotype::OMIM::OMIMparser->new(
 
    -genemap  => $omim_genemap,
 
    -omimtext => $omim_all );
 
while ( my $omim_entry = $omim_parser->next_phenotype() ) {
 
  my $numb  = $omim_entry->MIM_number();
 
  my $title = $omim_entry->title();
 
  #remove the gene symbol from the title line
 
  $title =~ s/^.?(\d+) //;
 
  $title =~ s/;.*$//;
 
  #store omim ids by disease names
 
  $omim_names{$title} = $numb;
 
  #store genes and disease names in hash ref by omim id
 
  $omim_ids{$numb}->{'disease'} = $title;
 
  my @symbols = $omim_entry->each_gene_symbol();
 
  $omim_ids{$numb}->{'genes'} = \@symbols;
 
  push(@all_omim, $numb.":".$title);
 
}

If you fall over an exception like this:

------------- EXCEPTION -------------

MSG: 16.13.3 does not make sense: 'arm' or 'cen' missing

STACK Bio::Map::CytoPosition::cytorange BioPerl-1.6.0/Bio/Map/CytoPosition.pm:165

You need to fix an error in the genemap file from OMIM:

line 9053 should be

16.25|2|2|10|16p13.3|CHTF18....

instead of

16.25|2|2|10|16.13.3|CHTF18...

source

OMIM ids are pre-fixed with defined symbols. The explanation what these characters means can be found on their FAQ site or here.

Please note that OMIM band start locations have a 1 bp offset to the definitions e.g. in ENSEMBL (probably from a 0-based coordinate system). The "16p11.2" band below is listed as chr16 28100001 - 34600000 in Ensembl.

OMIM example