RNA-Seq data quality scores

26/02/10 | by felix [mail] | Categories: Other things, Encode

There are different way to encode the quality scores in FASTQ files. It is important to know these before using the data and converting between the ways if necessary.

  • Sanger format can encode a [[Phred quality score]] from 0 to 93 using [[ASCII]] 33 to 126 (although in raw read data the Phred quality score rarely exceeds 60, higher scores are possible in assemblies or read maps).
  • Illumina 1.3+ format can encode a [[Phred quality score]] from 0 to 62 using [[ASCII]] 64 to 126 (although in raw read data Phred scores from 0 to 40 only are expected).
  • Solexa/Illumina 1.0 format can encode a Solexa/Illumina quality score from -5 to 62 using [[ASCII]] 59 to 126 (although in raw read data Solexa scores from -5 to 40 only are expected)
SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS.....................................................
...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII......................
..........................XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
|                         |    |        |                              |                     |
33                        59   64       73                            104                   126
 
  S - Sanger       Phred+33,  41 values  (0, 40)
  I - Illumina 1.3 Phred+64,  41 values  (0, 40)
  X - Solexa       Solexa+64, 68 values (-5, 62)

Source: wikipedia

You can convert the Solexa read quality to Sanger read quality with Maq:
maq sol2sanger s_1_sequence.txt s_1_sequence.fastq
where s_1_sequence.txt is the Solexa read sequence file. Missing this step will lead to unreliable SNP calling when aligning reads with Maq.

Source: maq-manual

Permalink

ENCODE cell lines

24/02/10 | by felix [mail] | Categories: Biology, Encode

These are some of the cell lines that are used in the various analysis of the ENCODE project. The first two are so-called tier-1 lines and covered by all the different types of experiments within ENCODE, the others are tier-2 lines, additionally there are a number of tier-3 cell lines.

  • GM12878 is a lymphoblastoid cell line produced from the blood of a female donor with northern and western European ancestry by EBV transformation. It was one of the original HapMap cell lines and has been selected by the International HapMap Project for deep sequencing using the Solexa/Illumina platform. This cell line has a relatively normal karyotype and grows well. Choice of this cell line offers potential synergy with the International HapMap Project and genetic variation studies. It represents the mesoderm cell lineage.
  • K562 is an immortalized cell line produced from a female patient with chronic myelogenous leukemia (CML). It is a widely used model for cell biology, biochemistry, and erythropoiesis. It grows well, is transfectable, and represents the mesoderm linage.
  • HepG2 is a cell line derived from a male patient with liver carcinoma. It is a model system for metabolism disorders and much data on transcriptional regulation have been generated using this cell line. It grows well, is transfectable, and represents the endoderm lineage.
  • HeLa-S3 is an immortalized cell line that was derived from a cervical cancer patient. It grows extremely well in suspension and is transfectable. It represents the ectoderm lineage. Many data sets were produced using this cell line during the pilot phase of the ENCODE Project. In addition, these cells have been widely used in biochemical and molecular genetic studies of gene function and regulation.
  • HUVEC (human umbilical vein endothelial cells) have a normal karyotype and are readily expandable to 108-109 cells. They represent the mesoderm lineage.
  • Keratinocytes have a normal karyotype and are readily expandable to 108-109 cells. They represent the ectoderm lineage.
  • H1 human embryonic stem cells.

Source
Full list

Permalink

Conditional Formatting in Ms Excel

12/11/09 | by felix [mail] | Categories: VisualBasic, Job Notes, Bioinformatics

To change the format of a cell based on the content of that or another cell conditional formatting can be used.

  1. For simple things and up to three options the dialog "Format"-"Conditional Formatting" can be called after selecting the target cell. You can select
    • "Value" to use the content of the cell
    • "Formula" to insert any Excel formula, eg. =FIND("needle", A3)
    and then choose the desired style (font, background etc.).
  2. For other functions you can write a Macro in VBA (Visual Basic for Applications). Choose "Tools"-"Macro"-"Visual Basic Editor". In the editor right click on the "VBSProject" in the project box and add a module. Code away, an example to change the background color based on the occurence of certain strings is given below. This can be run directly from the editor or from the worksheet ("Tools"-"Macro") menu.

Code:

Sub Color_groups()
 
    Set MyPlage = Range("A2:A1000")
 
    For Each Cell In MyPlage
 
        If InStr(1, Cell.Value, "Vic_") Then
 
            Cell.Interior.ColorIndex = 3
 
        ElseIf InStr(1, Cell.Value, "Tyl_") Then
 
            Cell.Interior.ColorIndex = 4
 
        ElseIf InStr(1, Cell.Value, "Wol_") Then
 
            Cell.Interior.ColorIndex = 6
 
        ElseIf InStr(1, Cell.Value, "Sim_") Then
 
            Cell.Interior.ColorIndex = 7
 
        ElseIf InStr(1, Cell.Value, "Sea_") Then
 
            Cell.Interior.ColorIndex = 8
 
        ElseIf InStr(1, Cell.Value, "Mar_") Then
 
            Cell.Interior.ColorIndex = 15
 
        ElseIf InStr(1, Cell.Value, "Lio_") Then
 
            Cell.Interior.ColorIndex = 17
 
        End If
 
    Next
 
End Sub
Permalink

Caching in ENSEMBL

11/11/09 | by felix [mail] | Categories: EnsEMBL, Perl, Scripts

How to avoid falling in the cache...

Caching is a powerful way to speed up queries to the Ensembl database. It can get problematic however for example if you are repeating a query multiple time, but have updated the data set in between. It is important to know how to turn caching off if needed - this is not officially documented though.

To turn the caching off on the mysql server

Code:

my $sa = $reg->get_adaptor($species,"core","slice");
    my $sth = $sa->dbc->db_handle->prepare("SET SESSION
query_cache_type = OFF");
    $sth->execute || die "set session failed\n";

Reset caches in Perl API

Code:

sub free_caches{
  my $species = shift;
  my $group = shift;
 
  foreach my $adap (@{$registry->get_all_adaptors(-species =>
$species, -group => $group)}){
    $adap->{'_slice_feature_cache'} = undef;
 
    if(defined($adap->{'cache'})){
      $adap->{'cache'} = undef;
    }
 
    if(defined($adap->{'seq_region_cache'})){
      my $seq_region_cache = $adap->{'seq_region_cache'} =
        Bio::EnsEMBL::Utils::SeqRegionCache->new();
 
      $adap->{'sr_name_cache'} = $seq_region_cache->{'name_cache'};
      $adap->{'sr_id_cache'}   = $seq_region_cache->{'id_cache'};
    }
  }
 
}

Source: Ian Longden, EBI

Permalink

Proserver Setup

02/11/09 | by felix [mail] | Categories: DAS, Encode

Installing and Running Proserver to serve data via DAS

The Distributed Annotation System (DAS) is an elegant way of sharing data and using data from diverse sources. More information at http://www.biodas.org and on these blog pages. The Proserver is a lightweight software system to provide your data as a DAS source.

  1. Download from http://proserver.svn.sf.net/
    or

    Code:

  2. untar & move to your favorite location
  3. Build:

    Code:

    cd Bio-Das-ProServer
    perl Build.PL
    ./Build
    ./Build test
    (optional:) ./Build install
  4. Run:

    Code:

    eg/proserver -x -c eg/local.ini

Adjust the ini file with the source you want to serve, e.g.:

Code:

[otter_das]
state        = on
adaptor      = otter_das
title        = Havana manual annotations
description  = A DAS source that provides access to the Havana annotation.
coordinates  = NCBI_36,Chromosome,Homo sapiens => 21:25673390,25733000
dsncreated   = 2008-03-11
maintainer   = felix@work.ac.uk
doc_href     = http://www.dasregistry.org/showProjectDetails.jsp?project_id=80
host         = otterlive
user         = username
port         = 3306
dbname       = loutre
driver       = mysql

Dependencies to re-install:
Compression libs Bundle-Compress-Zlib, Compress::Zlib, and such (http://search.cpan.org/dist/Compress-Raw-Zlib/lib/Compress/Raw/Zlib.pm) (must match each others versions to avoid errors like does not match bootstrap parameter).

Links:
Source & Full Guide
Sanger Institute pages

Permalink

Pages: 1 2 3 4 5 6 7 8 9 >>