AnnoTrack: General Documentation

January 31st, 2011

Setting up a new system & adjusting it to your needs

This document is part of the administrator documentation for the AnnoTrack software for genome annotation tracking.

The system is flexible enough to be of use for other groups and projects performing genome annotation in a collaborative effort and is therefor provided here. These are notes on how to start a new annotation project with AnnoTrack.

General Redmine installation notes for troubleshooting are here, but all the sourcecode required for AnnoTrack is available here.

Most of the AnnoTrack code is written as a plugin for the Redmine system (rails/vendor/plugins/redmine_annotrack), but since there are some other changes required, which override Redmine's default code, you will need the complete package from this site.

General notes

you will need

  1. a database server (e.g. mysql 5)
  2. ruby on rails installation

    source and help on the official rails page documention for running on Mac OsX (usually pre-installed)

  3. a web server (e.g. Apache) when running in production mode, for testing, the Webbrick server supplied with Rails is fine.
  4. get the AnnoTrack source code and database from this page

    unpack:

    Code

    tar xzvf annotrack.version.tgz

trong>Database

create your database

Code

mysql -u<user> -p<password> -h<host> -P<port> -e"create database annotrack"
 
  mysql -u<user> -p<password> -h<host> -P<port> -Dannotrack < annotrack/database.sql

The main tables fo the database are outlined in this diagram.

Rails server

  • we have frozen the additional external Rails modules used by the application (gems) into the AnnoTrack rails code (rails/vendor/rails/) so you don't necessarily need to install all of them separately.
  • set your environment variables GEM_PATH and RAILS_ENV in your shell or in the file annotrack/rails/config/environment.rb
  • adjust the database configurations file in annotrack/rails/config/database.yml with your settings (production and development if desired)

    additional environments can be created (e.g. for multiple organisms) by adding an entry (e.g. "production_housemouse") and a file in environments (e.g. environments/production_mouse.rb)

  • start the server e.g. on port 6223:

    Code

    cd annotrack/rails
     
    ruby scripts/server -edevelopment -p6223 #(to use the development setup)
  • In a web browser your application will usually be at http://localhost:6223/. Log in as administrator ("admin"/"admin") to set up some initial values.

    The admin interface from Redmine is at DEFAULT_URL/admin, modifications should in particular be made on these pages:

    1. Settings: "Application title", "Welcome text", "Host name"
    2. AnnoTrack settings: "Menu links", "Browsers links", "other settings"

      vendor/plugins/redmine_annotrack/lang/en.yml holds the URL patterns used for browser links.

    3. Flags: define new flags to highlight errors
    4. Users: create & adjust user accounts
  • we have stored a gene with two transcript with two flags for demonstration;

    you can see these by clicking on "Transcripts" at the top of the page and then selecting "View all transcripts".

  • you can create a new gene-level entry manually at DEFAULT_URL/projects/add for testing, in general these will be created by scripts writing directly to the database.

Perl API/scripts

  • You can adjust the settings for your system in the central config.pm file.
  • We use the scripts/cron_jobs.pl file the run automatic updates of the core annotation, to update the stats given on the front page (issue and flag counts), please adjust this to your needs

    Some Perl programming knowledge is required to adjust / write parsers to handle the specific data you will be using.

  • The following additional perl modules (many of which are part of a standard installation) are required to use the AnnoTrack perl API:

    • Bio::Das::Lite
    • MIME::Lite
    • DBI
    • Getopt::Long
    • UNIVERSAL::require
    • Bio::EnsEMBL::DBSQL::DBAdaptor (when accessing Ensembl-style databases)
  • most probably you will have to adjust the source-specific scripts used for data loading and analysis stored in annotrack/perl/modules/annotrack

further hints

  • New genes/transcripts, categories and flags would usually be created by script access. There are functions for all this functionality which is documented "here":/human/docs/core
  • This (/documents/show/8) is a basic *source adaptor* reading data from a tab-delimited file to demonstrate how the modules work.
  • This (/documents/show/10) is an example *source adaptor* to demonstrate a module reading from a database with DBI.
  • Here (/human/docs/core) is the Perl-doc of the AnnoTrack core module.

Further adjustments

to customize the system for your own set-up there are a number of files you can modify:

  • rails/app/views/layouts/base.rhtml: Start page layout
  • rails/vendor/plugins/redmine_annotrack/lang/en.yml: Names and paths to browsers and project-related links
  • we are using a Lucene-based search engine for AnnoTrack, there is a switch option between this and the Redmine-internal search enginge on the Administration/Settings/Annotrack page
  • the scripts annotrack/perl/scripts/cron_jobs.pl.example and annotrack/perl/scripts/cron_queries.sh.example should be adjusted with your environment and run regulary (nightly) to update annotation data, update counts and optimize tables.
  • many "helper variables" are stored in the tmp_values table. Have a look there if stats etc. are not displayed as expected.

Upgrading

General notes on upgrading existing Redmine installations are here.

AnnoTrack: Web-Server

January 27th, 2011

This document is part of the administrator documentation for the AnnoTrack software for genome annotation tracking.

AnnoTrack is a Ruby-On-Rails application with is executed by an Apache2 server with the mod-rails (Passenger) plugin. It is living on virtual machines (VM) where we don't run any other services as rails does not play nice with other web-services.

James Smith (webteam) knows most about this, Tim Cutts & Dave Holland (infrastructure management) can help with the VMs.

Access restrictions apply to connect to all the following services and the superuser rights.

There is a test environment on the VM web-annotrack, the production servers are running on two VM clones web-annotrack1 and web-annotrack2. All can be accessed directly with SSH:

Code

ssh web-annotrack
 
cd /var/www/annotrack-app

The different species have their own AnnoTrack/Redmine code installations as there does not seem to be another way to have them running in parallel otherwise:

annotrack-app == human

annotrack-app-mouse == mouse

annotrack-app-zfish == zebrafish

Rails/Passenger requires symbolic links from the root-level to the public folder:

human -> annotrack-app/public/

The test system is visible at http://web-annotrack.internal.sanger.ac.uk:8000

The port and other specific server settings are set in the apache2/sites-available/default file.

Re-starting Rails server:

Code

ssh web-annotrack[1,2]
 
sudo touch tmp/restart.txt

Re-starting entire web server:

Code

ssh web-annotrack[1,2]
 
sudo apache2ctl -k graceful

Service monitoring

The VMs are monitored with vSphere (web access, Windows client available as well) and Nagios (web-annotrack 1 / 2).

The website is also checked by the Montastic monitoring service.

Submitting to EMBLdb

January 24th, 2011

To submit DNA sequences from capillary (Sanger) sequencing to the public EMBL database, these steps can be taken:

The strategy is to create one submission at the European Nucleotide Archive (ENA) @ EBI Webin submission page and attach a FASTA file with all sequences.

  1. remove low quality sequences. I my case the filter criteria were:

    • max 5 consecutive Ns
    • max 10% Ns
    • min 80bp length
  2. screen for vector contamination:

    • Use NBCI web interface for small sets
    • Use BioPerl for large set: get EMVEC file in EMBL format, convert to FASTA format file with BioPerl

      Code

      my $inseq = Bio::SeqIO->new(
       
            -file   => "<file.dat",
       
            -format => "embl" );
       
      my $outseq = Bio::SeqIO->new(
       
            -file   => ">file.fa",
       
            -format => "fasta" );
       
      while (my $seq = $inseq->next_seq) {
       
        $outseq->write_seq($seq);
       
      }
    • index with formatdb

      To extract sequences from a BLAST database you need an index file (for protein-dbs these files end with the extension: ".pin", for DNA dbs: ".nin"), a sequence file (".psq", ".nsq") and a header file (".phr" and ".nhr"). formatdb turns FASTA files into BLAST databases.

      Code

      formatdb -i emvec.fa -p F -o F

    • run BioPerl Blast with the sequences to be submitted against the EMVEC db:

      Code

      use Bio::Tools::Run::StandAloneBlast;
       
      my @blast_params = (program  => 'blastn', database => 'emvec.dat.fa');
       
      my $blast_hits = run_blast($seq);

      and filter out hits with very low (<0.1) eValues and long sequence hits.

  3. In my case these are submitted as ESTs. Log in to Webin, create a new submission, choose molecule type (eg.g. "EST"), add a reference publication, specify the number of sequences, describe the header (at least one field, eg. clone-identifier, must be specified to be read from the FASTA header), add common values in the small table to be added to add entries (e.g organism "Homo sapiens"), upload your FASTA file.

Sources:

Sequence Contaminations

January 20th, 2011

When analysing sequences from public databases or from your own sequencer you have to be aware of potential contaminations.

A contaminated sequence is one that does not faithfully represent the genetic information from the biological source organism/organelle because it contains one or more sequence segments of foreign origin. [NCBI]

The primary approach to screening nucleic acid sequences for vector contamination is to run a sequence similarity search against a database of vector sequences. The preferred tool for conducting such a search is NCBI's VecScreen. VecScreen detects contamination by running a BLAST sequence similarity search against the UniVec vector sequence database.

An interactive web-service EMVEC Database BLAST to scan for contamination.

Help with the interpretation of the results of BLAST2 EVEC.

See also this post about submitting to EMBL db and this post about screening NGS reads locally.

GVF Format

January 11th, 2011

The Genome Variation Format (GVF) is a file format for describing sequence variants at nucleotide resolution relative to a reference genome. The GVF format was published in Reese et al., Genome Biol., 2010: A standard variation file format for human genome sequences.

GVF is a type of GFF3 file with additional pragmas and attributes specified.

Two examples:

Code

chr16 samtools SNV 49291141 49291141 . + . ID=ID_1;Variant_seq=A,G;Reference_seq=G;Genotype=heterozygous
 
chr16 samtools SNV 49291360 49291360 . + . ID=ID_2;Variant_seq=G;Reference_seq=C;Genotype=homozygous

Code

chr16 samtools SNV 49291141 49291141 . + . ID=ID_1;Variant_seq=A,G;Reference_seq=G;Genotype=heterozygous;Variant_effect=synonymous_codon 0 mRNA NM_022162;
 
chr16 samtools SNV 49302125 49302125 . + . ID=ID_3;Variant_seq=T,C;Reference_seq=C;Genotype=heterozygous;Variant_effect=nonsynonymous_codon 0 mRNA NM_022162;Alias=NP_071445.1:p.P45S;

This is used e.g. by Ensembl to write out "Watson SNPs" from the variation database (ftp).

Source and full specs: Sequenceontology.org