Data Processing with Biopieces

August 2nd, 2012

There is a fine set of scripts that form an orderely pipeline (or framework) to process bioinformatics data on the Unix command line called biopieces. You can e.g. process sequencing (NGS) data like this:


./read_fastq -n 1000 -i data/reads.fastq | ./plot_scores -t png -o data/scores.png --no_stream

to read the first 1000 sequences from a FASTQ file and plot the scores to an image file.
The result might look like this:
Data Processing with Biopieces

The general logic is
        read_data | calculate_something | write_results
with the data being passed through as a "stream" and all modules having the same interface to eachother. Installation instructions are here, on my Ubuntu VM I had to follow these steps:

  1. we need Perl, Ruby, Python, SVN. Install as needed.


    sudo apt-get install subversion
  2. get biopieces code:


    svn checkout biopieces cd biopieces svn checkout bp_usage
  3. check pre-requisites with the project's installer script


  4. missing Perl modules where listed nicely and could be installed as suggested.
  5. missing Ruby gems could not be installed due to incompatibilities, eg:


    sudo gem install RubyInline ERROR: Error installing RubyInline: ZenTest requires RubyGems version > 1.8.

    But the project supplies an excellent ruby installer on the downloads page to create a separate Ruby 1.9 installation, as the default 1.8 one is too old for biopieces, the newer one not officially supported on Ubuntu
  6. modify your ~/.bashrc file to include:


    export BP_DIR="$HOME/bin/biopieces"
    export BP_DATA="$HOME/bin/biopieces/BP_DATA"
    export BP_TMP="$HOME/bin/biopieces/tmp"
    export BP_LOG="$HOME/bin/biopieces/BP_LOG"
    export PATH="/home/test/bin/biopieces/ruby_install/bin:/home/test/bin/biopieces/biopieces/bp_bin:$PATH"
    export RUBYLIB="/home/test/bin/biopieces/biopieces/code_ruby/lib:$RUBYLIB"
    export PERL5LIB="/home/test/bin/biopieces/biopieces/code_perl:$PERL5LIB"


    source ~/.bashrc
    mkdir $BP_DATA $BP_TMP $BP_LOG
    The Ruby and Perl lib definitions are necessary avoid errors like


    cannot load such file -- maasha/biopieces (LoadError)
    Can't locate Maasha/ in @INC

Some of the almost 200 methods that are implemented in biopieces at this time include:

  • read and write various formats like bed, tab, gff, fasta, fastq
  • blast sequences against eachother or against a genome
  • calculate the N50 value for a set of sequences
  • create statistics about the exon, intron, etc. content of a (12-column) BED file

Building Config Files from a Skeleton

July 12th, 2012

To run programs or pipelines automatically it is often necessary to create or adjust configuration files. Ideally this should be done dynamically by a script from a skeleton (layout) file, replacing placeholder with the adjusted values. This can be done with a unix shell script that even contains the skeleton within:


#! /bin/sh
# pass in variables from command-line arguments
# do other required tasks
# ...
# config skeleton
template='#config file for pipeline
# Generate file output.txt from variable
# $template using placeholders above.
echo "$(eval "echo \"$template\"")" \
> $outputfile
# run the specified program
# with the new config file
./${prog} -conf ${outputfile}

Save as and call with parameters:
sh program_name par1 par2

Source: stackoverflow

Analysing Variation with Ensembl and PolyPhen

May 28th, 2012

The Ensembl variation resources provide information about structural variants and sequence variants (including Single Nucleotide Polymorphisms (SNPs), insertions, deletions and somatic mutations in the human genome. Details and references are described on the web site and in Chen et al. (2010) Ensembl Variation Resources, BMC Genomics and other publications listed in the site.

Sources and Descriptions currently included in Ensembl variation resources (v67):

  • dbSNP - Variants (including SNPs and indels) imported from dbSNP
  • DGVa - Database of Genomic Variants Archive
  • NHGRI_GWAS_catalog - Variants associated with phenotype data from the NHGRI GWAS catalog
  • COSMIC - Somatic mutations found in human cancers from the COSMIC project
  • EGA - Variants imported from the European Genome-phenome Archive with phenotype association
  • Uniprot - Variants with protein annotation imported from Uniprot
  • HGMD-PUBLIC - Variants from HGMD-PUBLIC dataset March 2012
  • OMIM - Variations linked to entries in the Online Mendelian Inheritance in Man (OMIM) database
  • Open Access GWAS Database - Johnson & O'Donnell 'An Open Access Database of Genome-wide Association Results' PMID:19161620
  • LSDB_LEPRE1 - LEPRE1 homepage - Osteogenesis Imperfecta Variant Database - Leiden Open Variation Database
  • LSDB_PPIB - PPIB homepage - Osteogenesis Imperfecta Variant Database - Leiden Open Variation Database
  • LSDB_CRTAP - CRTAP homepage - Osteogenesis Imperfecta Variant Database - Leiden Open Variation Database
  • LSDB_FKBP10 - FKBP10 homepage - Osteogenesis Imperfecta Variant Database - Leiden Open Variation Database

Ensembl offers the possibility to run the underlying code on your own data and predict the functional consequences of known and unknown variants using the Variant Effect Predictor (VEP).

Internally the VEP uses PolyPhen which is further explained below:

For a given amino acid substitution in a protein, PolyPhen-2 extracts various sequence and structure-based features of the substitution site and feeds them to a probabilistic classifier to identify:

Sequence-based features include binding or linking sites, transmembrane regions, regulatory modification sites. Profile matrices are calculated to assess the likelihood of the occurrence of this amino acid at the given position.

Structural features include the comparison to known protein 3D structures in PDB, using DSSP (Dictionary of Secondary Structure in Proteins), accessible surface area and properties.

PolyPhen-2 also looks at functional significance of an allele replacement using the UniProtKB database. It uses the "HumDiv" classifier to find disease-related changes and "HumVar" for variations in the "normal" population.

Ensembl have now added a nice blog entry about this with some more details.

Sequence Mappability & Alignability

May 16th, 2012

Sequence uniqueness within the genome plays an important part when attempting to map short sequence parts - e.g. next-generation short sequencing reads. It is one of the factors that can introduce a bias in sequencing or it's analysis - the other important factor being GC content (GC-rich sequences, eg. genic/exonic region, as well as very GC-poor regions are often under-represented (Bentley et al. 2008), mainly caused by amplificatin steps in the protocol). Reads mapped to multiple regions are often discarded, genomic regions with high sequence degeneracy / low sequence complexity therefor show lower mapped read coverage than unique regions, creating a systematic bias.

The CRG Alignability tracks at the UCSC genome browser display how uniquely k-mer sequences align to a region of the genome. As you can see from the tracks, the mappability increases with read length:

Sequence Mappability & Alignability

CRG mappability tracks for different read lengths at the UCSC browser

For each window (of sizes 36, 40, 50, 75 or 100 nts), a mapability score was computed:
S = 1 / (number of matches found in the genome),
so S=1 means one match in the genome, S=0.5 is two matches in the genome, and so on. Further desription in the publication of Thomas Derrien, Paolo Ribeca, et al. The data for these tracks can be downloaded, if you are working with other read lengths or genomes, you can run the software to generate the data yourself: Get the Gem library, unpack it with tar xbvf GEM-libraries-Linux-x86_64.tbz2, create an index:


gem-do-index -i genome.fasta -o gem_index

run the mappability part, eg. with a read length of 250:


gem-mappability -I gem_index -l 250 -o mappability_250.gem

To query a specific region for its mappability you can also use this online tool

An alternative is to look at the "uniqueome" data and publication.


  • Fast computation and applications of genome mappability.
    Derrien T, et al. PLoS One. 2012
  • The uniqueome: a mappability resource for short-tag sequencing. Koehler et al. Bioinformatics. 2011; 27(2): 272–274.
  • Blog post at MassGenomics
  • Systematic bias in high-throughput sequencing data and its correction by BEADS. Cheung et al. 2011
  • Accurate Whole Human Genome Sequencing using Reversible
    Terminator Chemistry. Bentley et al., Nature 2008

Ruby Sorting

May 9th, 2012

Sorting (elements in an array) is a very common tasks in many scripts. A lot of research has gone into finding the most efficient way to sort.
In Ruby the "sort" function performs a standard comparison accoring to the data type inspected, but as in most other languages you can define any specific orders.


is equivalent to

   open_orders.sort { |x, y| x <=> y }

The sort algorithm will assume that this comparison function/block will return a value accoring to the following logic (like the comparison operators):

    return -1 if x < y
    return  0 if x = y
    return  1 if x > y

So using this logic I can define a specific custom function to to compare the elements that need sorting and call it in the sort function afterwards. In my simple example I need to sort order numbers by two criteria: by a string first ("UK" before "ORD") and by ascending numbers afterwards.


def custom_order_sorting(x_ord,y_ord)
       and y_ord.match('ORD'))
       #use UK first
       return -1
       and y_ord.match('UK'))
       #use UK first
       return 1
      #use smaller number first
      x_num = x_ord.match('\w(\d+)$')[1]
      y_num = y_ord.match('\w(\d+)$')[1]
      return x_num <=> y_num
open_orders.sort!{|x,y| custom_order_sorting(x,y)}