AnnoTrack: Data maintanance

December 8th, 2009

This document is part of the administrator documentation for the AnnoTrack software for genome annotation tracking.

Regular updates

The following Perl scripts update the data and re-set priorities and flags. They usually update Havana annotation data, but all other sources can be checked as well by activating the entry in the config file. They run as cron-job every night, but can also be run manually if needed. The cron-job is executed from svn/gencode/tracking_system/perl/scripts/cron_jobs.pl.

The general procedure (which can be also used to push new data into the system) is:

  1. update.pl - updating all sources specified in the config
  2. set_priorities.pl - update active flags to transcripts and set appropriate priorities. This also adds new flag types to the tmp_values table to keep count.
  3. set_relations.pl - create links between flagged transcripts in the same genomic region
  4. cron_queries.sh - SQL queries that update the counts that are shown on the from page

Common parameters are:

  1. env defines the target database as

    • prod or human: main human production database
    • dev: test database with human data on the mcs4a db server
    • zfish: zebrafish production db
    • mouse: mouse production db
  2. write: connect with write access and store changes in chosen db
  3. verbose: write (a lot) of output for testing

Running data update scripts:

Code

perl svn/gencode/tracking_system/perl/scripts/update.pl -env proc -core -write

Run as farm job, sources to be updated defined in config.pm. Set active flags and priorities based on flags:

Code

perl svn/gencode/tracking_system/perl/scripts/set_priorities.pl -env proc -core -write

Specific updates

  1. Our Ensembl friends regularly compare CCDS, exon, intron and cDNA features between Ensembl and Havana annotations. This will generate text files with locations and IDs that need to be reloaded into AnnoTrack. There are specific source modules for these files, so adjusting the config.pm file (for the affected source definition: pointing the "file" hash entry to the new file and setting the "active" flag to "1") and running update.pl script should be sufficient.
  2. After every Havana/Ensembl merge a new OTT-/ENS-ID mapping should be generated and loaded into the AnnoTrack tracking system. This can be done with the script svn/gencode/scripts/store_id_conversion.pl which will read the GTF file or a list of ids and create the SQL statements.

    Code

    perl svn/gencode/scripts/store_id_conversion.pl -gtf -infile current_freeze.gtf -out new_id_conversions.sql
     
    mysql -h -P -u -p -D gencode_tracking < new_id_conversions.sql

Adding new data

  1. Importing Ensembl objects

    If an important gene model is missing from Havana but was annotated by Ensembl an import into AnnoTrack can be accomplished easily with the script svn/gencode/tracking_system/perl/scripts/import_from_ensembl.pl with the following options:

    Code

    perl import_from_ensembl.pl -user Felix
     
                                -category Ensembl
     
                                -gene ENSG00012048
     
                             (or)
     
                                -transcript ENST00309486
     
                                -flag manual_selection
     
                                -note "important gene"

    Setting a flag (with the chosen flag-name) and adding a note (that will be displayed next to the flag) are optional.

  2. Importing via DAS

    A number of GENCODE sources were imported from external DAS servers. For updates or new sources these source adaptors should be checked at svn/gencode/tracking_system/perl/modules/gencode_tracking_system/sources/

  3. Importing from a file

    There are source adaptors for reading tab-delimited files (tab_file.pm) and GTF files (which can also used for GFF3). You might have a look at the source code of the parser in case it needs slight modifications to read your file format.

  4. Importing via other sources

    If there are new types of data sources not fitting above categories a new source-adator has to be created. The best way for this is to copy and modify an existing one from svn/gencode/tracking_system/perl/modules/gencode_tracking_system/sources/.

  5. Creating new entries through the web interface is possible but not recommended. A gene can be added on this admin page (Trackers: only Features is required, Modules: only Issue tracking is required), transcripts can then be added using the URL format "annotrack.sanger.ac.uk/human/projects/NEW-GENE-ID/issues/new".

For all imports with the update.pl script an entry describing the new data source needs to be created in the svn/gencode/tracking_system/perl/modules/gencode_tracking_system/config.pm config file. A hash "%OTHER_SERVERS" contains an entry for every source name with the parameters required:

  • active - set to "1" to include the source in the update procedure, all others should be set to "0"
  • dns/type/proxy - the server definitions for DAS sources
  • user_name - the login name from the users table
  • category - a name for the new data, usually the same as the source name itself
  • detached -
  • by_chrom - does the update need to be performed chromosome-by-chromosome? (for slow DAS servers)
  • description - a short description of the data source
  • update_function - name of the module used for the update, e.g. "gtf" or "missing_ccds"
  • data_type - name of the feature type, e.g. "UCSC_novel_genes"

Working with flags

Flags are the most important features of the system, they define what problems we are focusing on.

New flags can be set:

  • Through the web interface (see image 1) by any logged-in user by clicking on "add flag" on a transcript page
  • Through the web interface using a list of IDs (eg. "OTTHUMT00000334332") with this form.
  • Through the Perl script svn/gencode/tracking_system/perl/scripts/set_flags_from_file.pl. Other scripts (eg. import_from_ensembl.pl can also have an option to set new flags to the features they are working with.

If the same type of flag was already set and not resolved yet, the scripts should NOT set another flag.

To resolve flags

  • On the web interface the flags for every transcript can be resolved individually by clicking on the check/deny images next to them
  • or multiple flags at once by activating the checkbox and clicking on the check/deny images below the list of flags
  • programmatically, multiple flags can be resolved with the script svn/gencode/tracking_system/perl/scripts/resolve_flags.pl and a text file of solutions. Please check the perldocs.

resolve flags

image 1: resolving flags through the web interface

New types of flags can be created here. This creates an entry in the flags tale (with the issue_id=-1) and in the tmp_values table where stats are stored. Also check the list of all flag types and their priorities.

The description of flag types can be updated here.

In general it's a good idea to run new updates / imports against the development environment / test database first (by setting $ENV = "dev" in the config file or using the -dev env parameter for scripts). Changes can than by checked in the database or a test server first (at the Sanger at http://web-annotrack.internal.sanger.ac.uk:8000/human/.

Conditional Formatting in Ms Excel

November 12th, 2009

To change the format of a cell based on the content of that or another cell conditional formatting can be used.

  1. For simple things and up to three options the dialog "Format"-"Conditional Formatting" can be called after selecting the target cell. You can select
    • "Value" to use the content of the cell
    • "Formula" to insert any Excel formula, eg. =FIND("needle", A3)
    and then choose the desired style (font, background etc.).
  2. For other functions you can write a Macro in VBA (Visual Basic for Applications). Choose "Tools"-"Macro"-"Visual Basic Editor". In the editor right click on the "VBSProject" in the project box and add a module. Code away, an example to change the background color based on the occurence of certain strings is given below. This can be run directly from the editor or from the worksheet ("Tools"-"Macro") menu.

Code

Sub Color_groups()
 
 
 
    Set MyPlage = Range("A2:A1000")
 
 
 
    For Each Cell In MyPlage
 
 
 
        If InStr(1, Cell.Value, "Vic_") Then
 
 
 
            Cell.Interior.ColorIndex = 3
 
 
 
        ElseIf InStr(1, Cell.Value, "Tyl_") Then
 
 
 
            Cell.Interior.ColorIndex = 4
 
 
 
        ElseIf InStr(1, Cell.Value, "Wol_") Then
 
 
 
            Cell.Interior.ColorIndex = 6
 
 
 
        ElseIf InStr(1, Cell.Value, "Sim_") Then
 
 
 
            Cell.Interior.ColorIndex = 7
 
 
 
        ElseIf InStr(1, Cell.Value, "Sea_") Then
 
 
 
            Cell.Interior.ColorIndex = 8
 
 
 
        ElseIf InStr(1, Cell.Value, "Mar_") Then
 
 
 
            Cell.Interior.ColorIndex = 15
 
 
 
        ElseIf InStr(1, Cell.Value, "Lio_") Then
 
 
 
            Cell.Interior.ColorIndex = 17
 
 
 
        End If
 
 
 
    Next
 
 
 
End Sub

Caching in ENSEMBL

November 11th, 2009

How to avoid falling in the cache...

Caching is a powerful way to speed up queries to the Ensembl database. It can get problematic however for example if you are repeating a query multiple times, but have updated the data set in between. It is important to know how to turn caching off if needed - this is not officially documented though.

To turn the caching off on the mysql server

Code

my $sa = $reg->get_adaptor($species,"core","slice");
 
    my $sth = $sa->dbc->db_handle->prepare("SET SESSION
 
query_cache_type = OFF");
 
    $sth->execute || die "set session failed\n";

Reset caches in Perl API

Code

sub free_caches{
 
  my $species = shift;
 
  my $group = shift;
 
 
 
  foreach my $adap (@{$registry->get_all_adaptors(-species =>
 
$species, -group => $group)}){
 
    $adap->{'_slice_feature_cache'} = undef;
 
 
 
    if(defined($adap->{'cache'})){
 
      $adap->{'cache'} = undef;
 
    }
 
 
 
    if(defined($adap->{'seq_region_cache'})){
 
      my $seq_region_cache = $adap->{'seq_region_cache'} =
 
        Bio::EnsEMBL::Utils::SeqRegionCache->new();
 
 
 
      $adap->{'sr_name_cache'} = $seq_region_cache->{'name_cache'};
 
      $adap->{'sr_id_cache'}   = $seq_region_cache->{'id_cache'};
 
    }
 
  }
 
 
 
}

Source: Ian Longden, EBI

Proserver Setup

November 2nd, 2009

Installing and Running Proserver to serve data via DAS

The Distributed Annotation System (DAS) is an elegant way of sharing data and using data from diverse sources. More information at http://www.biodas.org and on these blog pages. The Proserver is a lightweight software system to provide your data as a DAS source.

  1. Download from http://proserver.svn.sf.net/

    or

    Code

    svn co https://proserver.svn.sf.net/svnroot/proserver/trunk Bio-Das-ProServer

  2. p; move to your favorite location
  3. Build:

    Code

    cd Bio-Das-ProServer
     
    perl Build.PL
     
    ./Build
     
    ./Build test
     
    (optional:) ./Build install
  4. Run:

    Code

    eg/proserver -x -c eg/local.ini

Adjust the ini file with the source you want to serve, e.g.:

Code

[otter_das]
 
state        = on
 
adaptor      = otter_das
 
title        = Havana manual annotations
 
description  = A DAS source that provides access to the Havana annotation.
 
coordinates  = NCBI_36,Chromosome,Homo sapiens => 21:25673390,25733000
 
dsncreated   = 2008-03-11
 
maintainer   = felix@work.ac.uk
 
doc_href     = http://www.dasregistry.org/showProjectDetails.jsp?project_id=80
 
host         = otterlive
 
user         = username
 
port         = 3306
 
dbname       = loutre
 
driver       = mysql

Dependencies to re-install:

Compression libs Bundle-Compress-Zlib, Compress::Zlib, and such (http://search.cpan.org/dist/Compress-Raw-Zlib/lib/Compress/Raw/Zlib.pm) (must match each others versions to avoid errors like does not match bootstrap parameter).

Links:

Source & Full Guide

Sanger Institute pages

Assessing Gene Predictions

November 2nd, 2009

To compare gene predictions to a reference gene set (and similar tasks), the commonly used measures for calculating the prediction rate are specificity (precision) and sensitivity (recall) (Burset and Guigo, Genomics 34, 353-367, 1996).

 Specificity = TN / (TN + FP)

 Sensitivity = TP / (TP + FN)

with

 TP = true posisitives (correctly identified)

 FP = false positives (overpredicions)

 TN = true negatives  (correctly un-called)

 FN = false negatives (missed)

You can calculate a combined score like

  Score = Specificity x Sensitivity / 2

To assess base-coverage:

Correllation Coefficient = 
(TP x TN) - (FN x FP) ----------------------------------------- SQR( (TP + FN) x (TN + FP) x (TP + FP) x (TN + FN) )

See also this text by Roderic Guigo.

Alternatively you can use the combined F1 score:

†    F1 =  2 x Specificity x Sensitivity / Specificity + Sensitivity

Defined by van Rijsbergen in 1979, Source