Encode pilot regions

July 28th, 2011

This is the list of genomic regions that was analysed as the 1% of the human genome in the ENCODE pilot phase. (The main phase of ENCODE is looking at the entire human genome.) The coordinates are for assembly NCBI36 (hg18).
See also the entry about ENCODE and the UCSC pages.

Name Chr. Start End Description
ENr231 1 149424685 149924684 Random Picks
ENr131 2 234156564 234656627 Random Picks
ENr331 2 219985590 220485589 Random Picks
ENr112 2 51512209 52012208 Random Picks
ENr121 2 118011044 118511043 Random Picks
ENr113 4 118466104 118966103 Random Picks
ENr212 5 141880151 142380150 Random Picks
ENm002 5 131284314 132284313 Manual Picks:Interleukin
ENr221 5 55871007 56371006 Random Picks
ENr222 6 132218540 132718539 Random Picks
ENr223 6 73789953 74289952 Random Picks
ENr323 6 108371397 108871396 Random Picks
ENr334 6 41405895 41905894 Random Picks
ENm013 7 89621625 90736048 Manual Picks
ENm001 7 115597757 117475182 Manual Picks:CFTR
ENm010 7 26924046 27424045 Manual Picks:HOXA
ENm012 7 113720369 114720368 Manual Picks:FOXP2
ENm014 7 125865892 127029088 Manual Picks
ENr321 8 118882221 119382220 Random Picks
ENr232 9 130725123 131225122 Random Picks
ENr114 10 55153819 55653818 Random Picks
ENr312 11 130604798 131104797 Random Picks
ENr332 11 63940889 64440888 Random Picks
ENm009 11 4730996 5732587 Manual Picks:Beta
ENm011 11 1699992 2306039 Manual Picks:1GF2/H19
ENm003 11 115962316 116462315 Manual Picks:Apo
ENr123 12 38626477 39126476 Random Picks
ENr111 13 29418016 29918015 Random Picks
ENr132 13 112338065 112838064 Random Picks
ENr311 14 52947076 53447075 Random Picks
ENr322 14 98458224 98958223 Random Picks
ENr233 15 41520089 42020088 Random Picks
ENm008 16 1 500000 Manual Picks:Alpha
ENr313 16 60833950 61333949 Random Picks
ENr211 16 25780428 26280428 Random Picks
ENr213 18 23719232 24219231 Random Picks
ENr122 18 59412301 59912300 Random Picks
ENm007 19 59023585 60024460 Manual Picks:Chr19
ENr333 20 33304929 33804928 Random Picks
ENr133 21 39244467 39744466 Random Picks
ENm005 21 32668237 34364221 Manual Picks:Chr21
ENm004 22 30133954 31833953 Manual Picks:Chr22
ENr324 X 122609996 123109995 Random Picks
ENm006 X 152767492 154063081 Manual Picks:ChrX

Public Ensembl databases

July 5th, 2011

A quick reminder of the specifications to connect to the public Ensembl mySQL databases:

Database Server Port
Ensembl (v 24-47)
ensembldb.ensembl.org††
3306
Ensembl (v 48 and above)
ensembldb.ensembl.org 5306
Ensembl Mart martdb.ensembl.org 5316
Ensembl Genomes mysql.ebi.ac.uk 4157
Ensembl (curr. v) in US cloud useastdb.ensembl.org 5306

user = "anonymous"

pass = ""

mysql commandline for connection:

Code

mysql -uanonymous -hensembldb.ensembl.org -P5306

SQLite

June 3rd, 2011

Using the SQLite database Engine

SQLite is different (to MySQL) in a number of ways, the main one being that it is server-less and file-based. The other distinctive features are nicely listed here with pros and cons.

It's an ideal choice if you want to bundle a database with your application, as SQLite is small, platform independent and without any usage restrictions.

It can be accessed with the Perl DBI modules:

Code

my $dbh =
DBI->connect("dbi:SQLite:dbname=$db_file","","")
or die "Unable to connect: $DBI::errstr\n";

visually with the (free) Firefox plugin SQLite Manager or the (paid) application SQLite Maestro or on the command-line by calling:
sqlite db_file_name.db
Special sqlite commands are preceeded by a ".", e.g. to exit type ".exit".

The sql syntax is not identical but very similar. Converter tools are listed here, here are some stackoverflow notes about the topic.

Some compatibility notes: SQLite supports sub queries.
It does not support deletes on joined tables.

To make the output more readable you can:

Code

.header on
.separator \t

To inspect the structure of a database you can use the following commands.
1. list table names:

Code

.tables  #or
.tables table_na%  # "like" pattern matching

2. show the create statement:

Code

.schema table_name  #or
.schema table_na%   #or
SELECT sql FROM sqlite_master WHERE name = 'table_name';

To export all data from a database into files seperated by table you can use the "export table" function in the SQLite Manager, or use the command line if you have many tables:
1. create a file with all table names in your database. (get the name as mentioned above.)
2. Then call sqlite with each to export the data:

Code

cat tables.txt | awk '{print ".mode csv\n.output "$1".txt\nselect * from "$1";"} | sqlite dbname.db

Alternative export formats are column, html, insert, line, list, tabs, tcl

Import of these text files can be done with

Code

.import file.txt table_name

The separator for export and import need to be the same, otherwise you will get errors like

data.txt line 1: expected 10 columns of data but found 1

If there are linebreaks in the data fields, the parsing of the import will break in a similiar way. Try to set the separator to

\t

and not specify

.mode csv

for the export.

Here are some very useful FAQs.

Command line options in Windows

May 20th, 2011

Missing the lovely Unix command-line tools when working on MS Windows machines, I've been trying a few options to speed up everyday tasks like easy file processing:

  • Cygwin as a Unix emulation. Works fine most of time, but you can feel that it's an alien in the windows environment unless you configure it extensively: the old problem of different line break encodings, the different way to map/list directories.
  • PowerShell. A useful alternative to the windows command window with a window split into command and output screen and an extended command set.
  • UnxUtils. A collection of all those unix tools I missed wrapped up to be usable by the windows command line (grep, ls, head, awk...). Nice!

    Remember to add "UnxUtils\usr\local\wbin" to your PATH.

Using dbVar

May 12th, 2011

"Structural variation (SV) is generally defined as a region of DNA approximately 1 kb and larger in size and can include inversions and balanced translocations or genomic imbalances (insertions and deletions), commonly referred to as copy number variants (CNVs). These CNVs often overlap with segmental duplications (regions of DNA >1 kb present more than once in the genome). If present at >1% in a population a CNV may be referred to as copy number polymorphism (CNP)."

Estimates of how much of the human genome are CNVs range from 10-20%.

dbVar is the NCBI database of genomic structural variation designed to store data on variant DNA ≥ 1 bp in size.

The databases ids are organised in the following manner:

  • std: the study id - this identifies a submitted study
  • sv: the structural variant id - this identifies the submitted region of variation
  • ssv: the supporting structural variant id - this identifies the supporting regions of variation (often sample-specific) that were used to call the submitted region of variation
  • The ids are prefixed with 'n' if the study was submitted to NCBI, or 'e' if it was submitted to EBI

This means that multiple experimental results, ie. regions identified from different samples, stored as "supporting variants", are combined into regions that describe these as one "event" and are stored as "variant".

An example: esv10580 includes the supporting variants essv57440, essv75601, essv61475 and others. The individual (GRCh37/hg19) coordinates, e.g.

Chr1	521,413	564,458
Chr1	521,413	564,458
Chr1	521,648	575,095

result in the maximum coordinates for the variant:

Chr1	521,413	575,095

They all belong to the study estd20 by Conrad et al. (2010).

There is a good overview page explaining structural variations and related methods.

Source: dbVar