Sequence Mappability & Alignability

May 16th, 2012

Sequence uniqueness within the genome plays an important part when attempting to map short sequence parts - e.g. next-generation short sequencing reads. It is one of the factors that can introduce a bias in sequencing or it's analysis - the other important factor being GC content (GC-rich sequences, eg. genic/exonic region, as well as very GC-poor regions are often under-represented (Bentley et al. 2008), mainly caused by amplificatin steps in the protocol). Reads mapped to multiple regions are often discarded, genomic regions with high sequence degeneracy / low sequence complexity therefor show lower mapped read coverage than unique regions, creating a systematic bias.

The CRG Alignability tracks at the UCSC genome browser display how uniquely k-mer sequences align to a region of the genome. As you can see from the tracks, the mappability increases with read length:

Sequence Mappability & Alignability

CRG mappability tracks for different read lengths at the UCSC browser

For each window (of sizes 36, 40, 50, 75 or 100 nts), a mapability score was computed:
S = 1 / (number of matches found in the genome),
so S=1 means one match in the genome, S=0.5 is two matches in the genome, and so on. Further desription in the publication of Thomas Derrien, Paolo Ribeca, et al. The data for these tracks can be downloaded, if you are working with other read lengths or genomes, you can run the software to generate the data yourself: Get the Gem library, unpack it with tar xbvf GEM-libraries-Linux-x86_64.tbz2, create an index:

Code

gem-do-index -i genome.fasta -o gem_index

run the mappability part, eg. with a read length of 250:

Code

gem-mappability -I gem_index -l 250 -o mappability_250.gem

To query a specific region for its mappability you can also use this online tool http://surveyor.chgr.org/.

An alternative is to look at the "uniqueome" data and publication.

Refs:

  • Fast computation and applications of genome mappability.
    Derrien T, et al. PLoS One. 2012
  • The uniqueome: a mappability resource for short-tag sequencing. Koehler et al. Bioinformatics. 2011; 27(2): 272–274.
  • Blog post at MassGenomics
  • Systematic bias in high-throughput sequencing data and its correction by BEADS. Cheung et al. 2011
  • Accurate Whole Human Genome Sequencing using Reversible
    Terminator Chemistry. Bentley et al., Nature 2008

Ruby Sorting

May 9th, 2012

Sorting (elements in an array) is a very common tasks in many scripts. A lot of research has gone into finding the most efficient way to sort.
In Ruby the "sort" function performs a standard comparison accoring to the data type inspected, but as in most other languages you can define any specific orders.

   open_orders.sort

is equivalent to

   open_orders.sort { |x, y| x <=> y }

The sort algorithm will assume that this comparison function/block will return a value accoring to the following logic (like the comparison operators):

    return -1 if x < y
    return  0 if x = y
    return  1 if x > y

So using this logic I can define a specific custom function to to compare the elements that need sorting and call it in the sort function afterwards. In my simple example I need to sort order numbers by two criteria: by a string first ("UK" before "ORD") and by ascending numbers afterwards.

Code

def custom_order_sorting(x_ord,y_ord)
    if(x_ord.match('UK')
       and y_ord.match('ORD'))
       #use UK first
       return -1
    elsif(x_ord.match('ORD')
       and y_ord.match('UK'))
       #use UK first
       return 1
    else
      #use smaller number first
      x_num = x_ord.match('\w(\d+)$')[1]
      y_num = y_ord.match('\w(\d+)$')[1]
      return x_num <=> y_num
    end
end
 
open_orders.sort!{|x,y| custom_order_sorting(x,y)}

Source: stackoverflow.com

Genometastasis

May 9th, 2012

The hypothesis of genometastasis was suggested by García-Olmo et al. more than a decade ago (1) and states (simplified) that normal cells could be turned into cancer cells through contact with (dying) cancer cells. In particular, "metastases might develop as a result of transfection of susceptible cells in distant target organs with dominant oncogenes that circulate in the plasma and are derived from the primary tumor." It can therefor be considered as a form of horizontal gene / DNA transfer. The updake of the genomic material was explained through apoptotic bodies from cancer cells as described by Holmgren et al. (2). The ideas were actually already described a century ago (6,7).
An alternative could be the involvement of a virus as a transmitter as described by zur Hausen (8).

In a later study (3) the same group could show that plasma from colorectal cancer patients could transform cultured cells oncogenically (fig 1):

Genometastasis

Further research of the group was published recently (4) describing the transformation of cells cultured from healthy individuals through particles from cultured colon cancer cells. Goldenberg et al. (5) could stablely transform cells between species through cell fusion, resulting in hamster cells that express human oncogenes.

The evidence for horizontal gene transfer, in particular that cancer cells, dying parts of the cells or even cell-free cancer DNA can induce malignancy is worrying. It is likely only possible under very specific conditions and with certain (aggressive) cancer types, but certainly an interesting research area to watch. If confirmed it could have dramatic effects on treatment strategies and could open up new methological possibilities for molecular research.

References:

  1. García-Olmo D, et al. (1999) Histol Histopathol. 14(4):1159-64.
    Tumor DNA circulating in the plasma might play a role in metastasis. The hypothesis of the genometastasis.
  2. Holmgren L, et al (1999) Horizontal transfer of DNA by the uptake of apoptotic bodies. Blood. 93:3956-3963.
  3. García-Olmo D, García-Olmo DC (2001) Ann N Y Acad Sci. 945:265-75. Functionality of circulating DNA: the hypothesis of genometastasis.
  4. García-Olmo D, et al. (2010) Cell-Free Nucleic Acids Circulating in the Plasma of Colorectal Cancer Patients Induce the Oncogenic Transformation of Susceptible Cultured Cells; Cancer Res. 70(2):560-7
  5. Goldenberg DM et al. (2011) Horizontal transmission and retention of malignancy, as well as
    functional human genes, after spontaneous fusion of human
    glioblastoma and hamster host cells in vivo. International Journal of Cancer 131,1
  6. Goldenberg DM (1968) Über die Progression der Malignität: Eine Hypothese [On the progression of malignancy: A hypothesis]. Klin Wochenschr; 46: 898–99
  7. Aichel O (1911) Über Zellverschmelzung mit qualitative abnormer Chromosomenverteilung als Ursache der Geschwulstbildung [On cell fusion with qualitative abnormal chromosome distribution as the cause of tumor formation]. In: Roux W, ed. Vorträge und Aufsätze über Entwicklungsmechanik der Organismen, Vol. 13
  8. zur Hausen, HPapillomaviruses Causing Cancer: Evasion From Host-Cell Control in Early Events in Carcinogenesis, J Natl Cancer Inst. 2000;92(9)

Uniparental Disomy

May 4th, 2012

In cases where two copies of the same chromosome, or part of a chromosome, from one parent and no copies from the other parent are present in the cell, we call it uniparental disomy (UPD). While all DNA information is present, the development of the cell (and the organism) is hindered because of missing / wrong epigenetic markers. The basic mechanism of how this faulty distribution of chromosomes can occur, is shown in fig.1.

Uniparental Disomy

Sources:

  • Wikipedia
  • Eggermann and Kotzot (2010) Uniparental disomy, Onset mechanisms and their relevance in clinical genetics [German], Medizinische Genetik

Version Control with Perforce on the Command-line

April 19th, 2012

Besides the visual client, the version control system Perforce can be operated through the command line (unix prompt or windows Dos window) and therefor be controlled through other programs like MatLab:

[status, result] = dos(p4command);

A reference manual is available, here are a few hints:
Check the environment settings:

p4 set
  P4CHARSET=winansi
  P4CLIENT=try1 (set)
  P4EDITOR=C:\Windows\SysWOW64\notepad.exe (set)
  P4PORT=perforce:1666
  P4USER=Felix_Kokocinski

end edit if necessary with

set P4CHARSET=winansi

P4EDITOR is optional, P4CLIENT is the checkout / workspace name.
The settings can also be set permanently in the visual client under
Edit / Preferences / Connection / Change Settings
If these are wrong you will get messages like "file(s) not on client".

Most common commands:
synchronize repository:

p4 sync

checkout file:

p4 edit filename.txt
  or
p4 edit //depot/path/in/perforce/filename.txt

submit changes:

p4 submit -d "description of changes" filename.txt

revert to version in repository:

p4 revert filename.txt

add new file:

p4 add filename.txt

get help:

p4 help

Here are some useful one-liners for various tasks.