Building Config Files from a Skeleton

July 12th, 2012

To run programs or pipelines automatically it is often necessary to create or adjust configuration files. Ideally this should be done dynamically by a script from a skeleton (layout) file, replacing placeholder with the adjusted values. This can be done with a unix shell script that even contains the skeleton within:


#! /bin/sh
# pass in variables from command-line arguments
# do other required tasks
# ...
# config skeleton
template='#config file for pipeline
# Generate file output.txt from variable
# $template using placeholders above.
echo "$(eval "echo \"$template\"")" \
> $outputfile
# run the specified program
# with the new config file
./${prog} -conf ${outputfile}

Save as and call with parameters:
sh program_name par1 par2

Source: stackoverflow

Analysing Variation with Ensembl and PolyPhen

May 28th, 2012

The Ensembl variation resources provide information about structural variants and sequence variants (including Single Nucleotide Polymorphisms (SNPs), insertions, deletions and somatic mutations in the human genome. Details and references are described on the web site and in Chen et al. (2010) Ensembl Variation Resources, BMC Genomics and other publications listed in the site.

Sources and Descriptions currently included in Ensembl variation resources (v67):

  • dbSNP - Variants (including SNPs and indels) imported from dbSNP
  • DGVa - Database of Genomic Variants Archive
  • NHGRI_GWAS_catalog - Variants associated with phenotype data from the NHGRI GWAS catalog
  • COSMIC - Somatic mutations found in human cancers from the COSMIC project
  • EGA - Variants imported from the European Genome-phenome Archive with phenotype association
  • Uniprot - Variants with protein annotation imported from Uniprot
  • HGMD-PUBLIC - Variants from HGMD-PUBLIC dataset March 2012
  • OMIM - Variations linked to entries in the Online Mendelian Inheritance in Man (OMIM) database
  • Open Access GWAS Database - Johnson & O'Donnell 'An Open Access Database of Genome-wide Association Results' PMID:19161620
  • LSDB_LEPRE1 - LEPRE1 homepage - Osteogenesis Imperfecta Variant Database - Leiden Open Variation Database
  • LSDB_PPIB - PPIB homepage - Osteogenesis Imperfecta Variant Database - Leiden Open Variation Database
  • LSDB_CRTAP - CRTAP homepage - Osteogenesis Imperfecta Variant Database - Leiden Open Variation Database
  • LSDB_FKBP10 - FKBP10 homepage - Osteogenesis Imperfecta Variant Database - Leiden Open Variation Database

Ensembl offers the possibility to run the underlying code on your own data and predict the functional consequences of known and unknown variants using the Variant Effect Predictor (VEP).

Internally the VEP uses PolyPhen which is further explained below:

For a given amino acid substitution in a protein, PolyPhen-2 extracts various sequence and structure-based features of the substitution site and feeds them to a probabilistic classifier to identify:

Sequence-based features include binding or linking sites, transmembrane regions, regulatory modification sites. Profile matrices are calculated to assess the likelihood of the occurrence of this amino acid at the given position.

Structural features include the comparison to known protein 3D structures in PDB, using DSSP (Dictionary of Secondary Structure in Proteins), accessible surface area and properties.

PolyPhen-2 also looks at functional significance of an allele replacement using the UniProtKB database. It uses the "HumDiv" classifier to find disease-related changes and "HumVar" for variations in the "normal" population.

Ensembl have now added a nice blog entry about this with some more details.

Sequence Mappability & Alignability

May 16th, 2012

Sequence uniqueness within the genome plays an important part when attempting to map short sequence parts - e.g. next-generation short sequencing reads. It is one of the factors that can introduce a bias in sequencing or it's analysis - the other important factor being GC content (GC-rich sequences, eg. genic/exonic region, as well as very GC-poor regions are often under-represented (Bentley et al. 2008), mainly caused by amplificatin steps in the protocol). Reads mapped to multiple regions are often discarded, genomic regions with high sequence degeneracy / low sequence complexity therefor show lower mapped read coverage than unique regions, creating a systematic bias.

The CRG Alignability tracks at the UCSC genome browser display how uniquely k-mer sequences align to a region of the genome. As you can see from the tracks, the mappability increases with read length:

Sequence Mappability & Alignability

CRG mappability tracks for different read lengths at the UCSC browser

For each window (of sizes 36, 40, 50, 75 or 100 nts), a mapability score was computed:
S = 1 / (number of matches found in the genome),
so S=1 means one match in the genome, S=0.5 is two matches in the genome, and so on. Further desription in the publication of Thomas Derrien, Paolo Ribeca, et al. The data for these tracks can be downloaded, if you are working with other read lengths or genomes, you can run the software to generate the data yourself: Get the Gem library, unpack it with tar xbvf GEM-libraries-Linux-x86_64.tbz2, create an index:


gem-do-index -i genome.fasta -o gem_index

run the mappability part, eg. with a read length of 250:


gem-mappability -I gem_index -l 250 -o mappability_250.gem

To query a specific region for its mappability you can also use this online tool

An alternative is to look at the "uniqueome" data and publication.


  • Fast computation and applications of genome mappability.
    Derrien T, et al. PLoS One. 2012
  • The uniqueome: a mappability resource for short-tag sequencing. Koehler et al. Bioinformatics. 2011; 27(2): 272–274.
  • Blog post at MassGenomics
  • Systematic bias in high-throughput sequencing data and its correction by BEADS. Cheung et al. 2011
  • Accurate Whole Human Genome Sequencing using Reversible
    Terminator Chemistry. Bentley et al., Nature 2008

Ruby Sorting

May 9th, 2012

Sorting (elements in an array) is a very common tasks in many scripts. A lot of research has gone into finding the most efficient way to sort.
In Ruby the "sort" function performs a standard comparison accoring to the data type inspected, but as in most other languages you can define any specific orders.


is equivalent to

   open_orders.sort { |x, y| x <=> y }

The sort algorithm will assume that this comparison function/block will return a value accoring to the following logic (like the comparison operators):

    return -1 if x < y
    return  0 if x = y
    return  1 if x > y

So using this logic I can define a specific custom function to to compare the elements that need sorting and call it in the sort function afterwards. In my simple example I need to sort order numbers by two criteria: by a string first ("UK" before "ORD") and by ascending numbers afterwards.


def custom_order_sorting(x_ord,y_ord)
       and y_ord.match('ORD'))
       #use UK first
       return -1
       and y_ord.match('UK'))
       #use UK first
       return 1
      #use smaller number first
      x_num = x_ord.match('\w(\d+)$')[1]
      y_num = y_ord.match('\w(\d+)$')[1]
      return x_num <=> y_num
open_orders.sort!{|x,y| custom_order_sorting(x,y)}



May 9th, 2012

The hypothesis of genometastasis was suggested by García-Olmo et al. more than a decade ago (1) and states (simplified) that normal cells could be turned into cancer cells through contact with (dying) cancer cells. In particular, "metastases might develop as a result of transfection of susceptible cells in distant target organs with dominant oncogenes that circulate in the plasma and are derived from the primary tumor." It can therefor be considered as a form of horizontal gene / DNA transfer. The updake of the genomic material was explained through apoptotic bodies from cancer cells as described by Holmgren et al. (2). The ideas were actually already described a century ago (6,7).
An alternative could be the involvement of a virus as a transmitter as described by zur Hausen (8).

In a later study (3) the same group could show that plasma from colorectal cancer patients could transform cultured cells oncogenically (fig 1):


Further research of the group was published recently (4) describing the transformation of cells cultured from healthy individuals through particles from cultured colon cancer cells. Goldenberg et al. (5) could stablely transform cells between species through cell fusion, resulting in hamster cells that express human oncogenes.

The evidence for horizontal gene transfer, in particular that cancer cells, dying parts of the cells or even cell-free cancer DNA can induce malignancy is worrying. It is likely only possible under very specific conditions and with certain (aggressive) cancer types, but certainly an interesting research area to watch. If confirmed it could have dramatic effects on treatment strategies and could open up new methological possibilities for molecular research.


  1. García-Olmo D, et al. (1999) Histol Histopathol. 14(4):1159-64.
    Tumor DNA circulating in the plasma might play a role in metastasis. The hypothesis of the genometastasis.
  2. Holmgren L, et al (1999) Horizontal transfer of DNA by the uptake of apoptotic bodies. Blood. 93:3956-3963.
  3. García-Olmo D, García-Olmo DC (2001) Ann N Y Acad Sci. 945:265-75. Functionality of circulating DNA: the hypothesis of genometastasis.
  4. García-Olmo D, et al. (2010) Cell-Free Nucleic Acids Circulating in the Plasma of Colorectal Cancer Patients Induce the Oncogenic Transformation of Susceptible Cultured Cells; Cancer Res. 70(2):560-7
  5. Goldenberg DM et al. (2011) Horizontal transmission and retention of malignancy, as well as
    functional human genes, after spontaneous fusion of human
    glioblastoma and hamster host cells in vivo. International Journal of Cancer 131,1
  6. Goldenberg DM (1968) Über die Progression der Malignität: Eine Hypothese [On the progression of malignancy: A hypothesis]. Klin Wochenschr; 46: 898–99
  7. Aichel O (1911) Über Zellverschmelzung mit qualitative abnormer Chromosomenverteilung als Ursache der Geschwulstbildung [On cell fusion with qualitative abnormal chromosome distribution as the cause of tumor formation]. In: Roux W, ed. Vorträge und Aufsätze über Entwicklungsmechanik der Organismen, Vol. 13
  8. zur Hausen, HPapillomaviruses Causing Cancer: Evasion From Host-Cell Control in Early Events in Carcinogenesis, J Natl Cancer Inst. 2000;92(9)