AnnoTrack: Web-Server

January 27th, 2011

This document is part of the administrator documentation for the AnnoTrack software for genome annotation tracking.

AnnoTrack is a Ruby-On-Rails application with is executed by an Apache2 server with the mod-rails (Passenger) plugin. It is living on virtual machines (VM) where we don't run any other services as rails does not play nice with other web-services.

James Smith (webteam) knows most about this, Tim Cutts & Dave Holland (infrastructure management) can help with the VMs.

Access restrictions apply to connect to all the following services and the superuser rights.

There is a test environment on the VM web-annotrack, the production servers are running on two VM clones web-annotrack1 and web-annotrack2. All can be accessed directly with SSH:

Code

ssh web-annotrack
 
cd /var/www/annotrack-app

The different species have their own AnnoTrack/Redmine code installations as there does not seem to be another way to have them running in parallel otherwise:

annotrack-app == human

annotrack-app-mouse == mouse

annotrack-app-zfish == zebrafish

Rails/Passenger requires symbolic links from the root-level to the public folder:

human -> annotrack-app/public/

The test system is visible at http://web-annotrack.internal.sanger.ac.uk:8000

The port and other specific server settings are set in the apache2/sites-available/default file.

Re-starting Rails server:

Code

ssh web-annotrack[1,2]
 
sudo touch tmp/restart.txt

Re-starting entire web server:

Code

ssh web-annotrack[1,2]
 
sudo apache2ctl -k graceful

Service monitoring

The VMs are monitored with vSphere (web access, Windows client available as well) and Nagios (web-annotrack 1 / 2).

The website is also checked by the Montastic monitoring service.

Submitting to EMBLdb

January 24th, 2011

To submit DNA sequences from capillary (Sanger) sequencing to the public EMBL database, these steps can be taken:

The strategy is to create one submission at the European Nucleotide Archive (ENA) @ EBI Webin submission page and attach a FASTA file with all sequences.

  1. remove low quality sequences. I my case the filter criteria were:

    • max 5 consecutive Ns
    • max 10% Ns
    • min 80bp length
  2. screen for vector contamination:

    • Use NBCI web interface for small sets
    • Use BioPerl for large set: get EMVEC file in EMBL format, convert to FASTA format file with BioPerl

      Code

      my $inseq = Bio::SeqIO->new(
       
            -file   => "<file.dat",
       
            -format => "embl" );
       
      my $outseq = Bio::SeqIO->new(
       
            -file   => ">file.fa",
       
            -format => "fasta" );
       
      while (my $seq = $inseq->next_seq) {
       
        $outseq->write_seq($seq);
       
      }
    • index with formatdb

      To extract sequences from a BLAST database you need an index file (for protein-dbs these files end with the extension: ".pin", for DNA dbs: ".nin"), a sequence file (".psq", ".nsq") and a header file (".phr" and ".nhr"). formatdb turns FASTA files into BLAST databases.

      Code

      formatdb -i emvec.fa -p F -o F

    • run BioPerl Blast with the sequences to be submitted against the EMVEC db:

      Code

      use Bio::Tools::Run::StandAloneBlast;
       
      my @blast_params = (program  => 'blastn', database => 'emvec.dat.fa');
       
      my $blast_hits = run_blast($seq);

      and filter out hits with very low (<0.1) eValues and long sequence hits.

  3. In my case these are submitted as ESTs. Log in to Webin, create a new submission, choose molecule type (eg.g. "EST"), add a reference publication, specify the number of sequences, describe the header (at least one field, eg. clone-identifier, must be specified to be read from the FASTA header), add common values in the small table to be added to add entries (e.g organism "Homo sapiens"), upload your FASTA file.

Sources:

Sequence Contaminations

January 20th, 2011

When analysing sequences from public databases or from your own sequencer you have to be aware of potential contaminations.

A contaminated sequence is one that does not faithfully represent the genetic information from the biological source organism/organelle because it contains one or more sequence segments of foreign origin. [NCBI]

The primary approach to screening nucleic acid sequences for vector contamination is to run a sequence similarity search against a database of vector sequences. The preferred tool for conducting such a search is NCBI's VecScreen. VecScreen detects contamination by running a BLAST sequence similarity search against the UniVec vector sequence database.

An interactive web-service EMVEC Database BLAST to scan for contamination.

Help with the interpretation of the results of BLAST2 EVEC.

See also this post about submitting to EMBL db and this post about screening NGS reads locally.

GVF Format

January 11th, 2011

The Genome Variation Format (GVF) is a file format for describing sequence variants at nucleotide resolution relative to a reference genome. The GVF format was published in Reese et al., Genome Biol., 2010: A standard variation file format for human genome sequences.

GVF is a type of GFF3 file with additional pragmas and attributes specified.

Two examples:

Code

chr16 samtools SNV 49291141 49291141 . + . ID=ID_1;Variant_seq=A,G;Reference_seq=G;Genotype=heterozygous
 
chr16 samtools SNV 49291360 49291360 . + . ID=ID_2;Variant_seq=G;Reference_seq=C;Genotype=homozygous

Code

chr16 samtools SNV 49291141 49291141 . + . ID=ID_1;Variant_seq=A,G;Reference_seq=G;Genotype=heterozygous;Variant_effect=synonymous_codon 0 mRNA NM_022162;
 
chr16 samtools SNV 49302125 49302125 . + . ID=ID_3;Variant_seq=T,C;Reference_seq=C;Genotype=heterozygous;Variant_effect=nonsynonymous_codon 0 mRNA NM_022162;Alias=NP_071445.1:p.P45S;

This is used e.g. by Ensembl to write out "Watson SNPs" from the variation database (ftp).

Source and full specs: Sequenceontology.org

QSEQ File Format

January 6th, 2011

QSEQ is a plain-text file format for sequence reads produced directly by many current next-generation sequencing machines. The content can be described as follows.

Each record is one line with tab separator in the following format:

- Machine name: unique identifier of the sequencer.

- Run number: unique number to identify the run on the sequencer.

- Lane number: positive integer (currently 1-8).

- Tile number: positive integer.

- X: x coordinate of the spot. Integer (can be negative).

- Y: y coordinate of the spot. Integer (can be negative).

- Index: positive integer. No indexing should have a value of 1.

- Read Number: 1 for single reads; 1 or 2 for paired ends.

- Sequence (BASES)

- Quality: the calibrated quality string. (QUALITIES)

- Filter: Did the read pass filtering? 0 - No, 1 - Yes.

Source: SRA_File_Formats_Guide.pdf