This is the list of genomic regions that was analysed as the 1% of the human genome in the ENCODE pilot phase. (The main phase of ENCODE is looking at the entire human genome.) The coordinates are for assembly NCBI36 (hg18).
See also the entry about ENCODE and the UCSC pages.
A quick reminder of the specifications to connect to the public Ensembl mySQL databases:
|Ensembl (v 24-47)
|Ensembl (v 48 and above)
|Ensembl (curr. v) in US cloud||useastdb.ensembl.org||5306|
user = "anonymous"
pass = ""
mysql commandline for connection:
Using the SQLite database Engine
It's an ideal choice if you want to bundle a database with your application, as SQLite is small, platform independent and without any usage restrictions.
It can be accessed with the Perl DBI modules:
visually with the (free) Firefox plugin SQLite Manager or the (paid) application SQLite Maestro or on the command-line by calling:
Special sqlite commands are preceeded by a ".", e.g. to exit type ".exit".
Some compatibility notes: SQLite supports sub queries.
It does not support deletes on joined tables.
To make the output more readable you can:
To inspect the structure of a database you can use the following commands.
1. list table names:
2. show the create statement:
To export all data from a database into files seperated by table you can use the "export table" function in the SQLite Manager, or use the command line if you have many tables:
1. create a file with all table names in your database. (get the name as mentioned above.)
2. Then call sqlite with each to export the data:
Alternative export formats are column, html, insert, line, list, tabs, tcl
Import of these text files can be done with
The separator for export and import need to be the same, otherwise you will get errors like
data.txt line 1: expected 10 columns of data but found 1
If there are linebreaks in the data fields, the parsing of the import will break in a similiar way. Try to set the separator to
and not specify
for the export.
Here are some very useful FAQs.
Missing the lovely Unix command-line tools when working on MS Windows machines, I've been trying a few options to speed up everyday tasks like easy file processing:
- Cygwin as a Unix emulation. Works fine most of time, but you can feel that it's an alien in the windows environment unless you configure it extensively: the old problem of different line break encodings, the different way to map/list directories.
- PowerShell. A useful alternative to the windows command window with a window split into command and output screen and an extended command set.
UnxUtils. A collection of all those unix tools I missed wrapped up to be usable by the windows command line (grep, ls, head, awk...). Nice!
Remember to add "UnxUtils\usr\local\wbin" to your PATH.
"Structural variation (SV) is generally defined as a region of DNA approximately 1 kb and larger in size and can include inversions and balanced translocations or genomic imbalances (insertions and deletions), commonly referred to as copy number variants (CNVs). These CNVs often overlap with segmental duplications (regions of DNA >1 kb present more than once in the genome). If present at >1% in a population a CNV may be referred to as copy number polymorphism (CNP)."
Estimates of how much of the human genome are CNVs range from 10-20%.
dbVar is the NCBI database of genomic structural variation designed to store data on variant DNA ≥ 1 bp in size.
The databases ids are organised in the following manner:
- std: the study id - this identifies a submitted study
- sv: the structural variant id - this identifies the submitted region of variation
- ssv: the supporting structural variant id - this identifies the supporting regions of variation (often sample-specific) that were used to call the submitted region of variation
- The ids are prefixed with 'n' if the study was submitted to NCBI, or 'e' if it was submitted to EBI
This means that multiple experimental results, ie. regions identified from different samples, stored as "supporting variants", are combined into regions that describe these as one "event" and are stored as "variant".
An example: esv10580 includes the supporting variants essv57440, essv75601, essv61475 and others. The individual (GRCh37/hg19) coordinates, e.g.
Chr1 521,413 564,458 Chr1 521,413 564,458 Chr1 521,648 575,095
result in the maximum coordinates for the variant:
Chr1 521,413 575,095
They all belong to the study estd20 by Conrad et al. (2010).
There is a good overview page explaining structural variations and related methods.