ENCODE publication interview

October 1st, 2012

Following on from the publication of the main papers of the ENCODE (Encyclopedia Of DNA Elements) scale-up phase, I gave an interview to BlueGnome's marketing team for the Newstrack customer newsletter in 2012.

These are my personal opinions, not my employer's (past or present). They might be of interest to researcher's considering to join a large-scale project like this.

Q. What was it like to be part of the ENCODE project?
It was a great experience to work on a project of this scale with more than 400 scientists from 32 groups spread across the globe. Many of them are the leaders in their field, but at consortium meetings and the many phone conferences everyone could contribute. The amount of data and different technologies was overwhelming at times, so I think it’s an impressive achievement how this project was run and now the findings have been published.

Q. What are the main outcomes of the project?
There has been a very lively discussion about the outcome and how it was presented. In my opinion, the most important result is the data itself. ENCODE has created an enormous repository of measurements across the human genome that has been compiled in a systematic and standardised way. The data will be the basis of future research trying to understand genomic processes involved in basic cellular processes as well as in various diseases.
ENCODE has pushed the development of standards and new applications to interrogate the genome, in particular using sequencing technologies.
The results also remind us that there is a lot of activity in the genome that we currently do not fully understand. Up to 80% of the human genome is biochemically active, there are thousands of additional (non-coding) genes in introns and in the intergenic space, and up to 75% of the genome is transcribed at some point. These observations paint a very dynamic genomic landscape, with overlapping active zones and signals of different complexity, indicating, that we have to keep the concept of genes and genome regulation pretty flexible in our mind.

Q. What are potential implications for BlueGnome and
its customers?

I’m afraid the interpretation of CNV regions is getting even more complex as regulatory regions far away from the actual disease genes might be relevant for cases the clinical customers might come across. This is especially true for the interpretation of cancer profiles – which is highly complex already. We won’t be able to use these new interconnections directly in most cases, but we are looking through the data and have started to incorporate the knowledge by providing new genome-wide annotation data sets as optional BED files on the BlueGnome website, e.g. with GWAS results and regulatory element locations.

Q. Where do you see the human genome in 5 years’ time?
ENCODE is entering its next phase now to extend the catalogue to many additional cell lines as well as the mouse genome. With the recent publications scientists around the world are now more aware of this data and how to use it, so my hope is that we will see an acceleration in algorithm development, data mining and scientific findings. In 5 years we still won’t understand the genome entirely, but we should have a complete parts list and more connections between the parts. Some of these will be clinically relevant to allow progress in understanding and fighting today’s ‘big killers’ like certain types of cancer.

Q. Would you personally be interested in having your genome sequenced?
As a data exploration exercise I would find this really interesting, but the definitive answers you can get from it are still limited today. I would certainly want to make sure this data is kept private and under my control. With BlueGnome now being part of Illumina we can actually help to develop these ideas further.

Further information: Nature's Encode portal, "An integrated encyclopedia of DNA elements in the human genome" publication, Guardian Interview with Ewan Birney

SAM format summary

August 30th, 2012

The Sequence Alignment/Map (SAM) format is a generic alignment format for storing read alignments against reference sequences. It is a text format for storing sequence data in a series of tab delimited ASCII columns and is commonly used in next-generation sequencing data processing. It is the (non-binary) human-readable version of the BAM format and contains information about the read and the aligned position in the genome. It was developed by Heng Li in Richard Durbins group and others, their paper is here.

After a header section the alignment section describes all results of the aligned read data. The format is best explained with an example line:


1:497:R:-272+13M17D24M  113  1  497  37  37M  15  100338662  0  CGGGTCTGACCTGAGGAGAACTGTGCTCCGCCTTCAG  0;==-==9;>>>>>=>>>>>>>>>>>=>>>>>>>>>>  XT:A:U  NM:i:0  SM:i:37  AM:i:0  X0:i:1  X1:i:0  XM:i:0  XO:i:0  XG:i:0  MD:Z:37
Fieldname	description	Example-data
QNAME	read name	1:497:R:-272+13M17D24M
FLAG	alignment flag	113
RNAME	alignment chromosome	1
POS	alignment start position	497
MAPQ	overall mapping quality	37
CIGAR	alignment CIGAR string	37M
MRNM/RNEXT	name of next alignm. in group (mate)	15
MPOS/PNEXT	pos. of next alignm. in group (mate)	100338662
ISIZE/TLEN	observed Template LENgth	0
QUAL	quality per base	0;==-==9;>>>>>=>>>>>>>>>>>=>>>>>>>>>>
TAGs	further tags with alignment info
XT:A:U NM:i:0 SM:i:37 AM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:37

The tags are optional and might vary between alignment programs. Shown are examples from BWA. Important for filtering are usually the tags X0:i (numbers of genome alignments of this read) and XM:i (number of mismatches in alignment).

       Tag	Meaning
       NM	Edit distance
       MD	Mismatching positions/bases
       AS	Alignment score
       BC	Barcode sequence
       X0	Number of best hits
       X1	Number of suboptimal hits found by BWA
       XN	Number of ambiguous bases in the referenece
       XM	Number of mismatches in the alignment
       XO	Number of gap opens
       XG	Number of gap extentions
       XT	Type: Unique/Repeat/N/Mate-sw
       XA	Alternative hits; format: (chr,pos,CIGAR,NM;)*
       XS	Suboptimal alignment score
       XF	Support from forward/reverse alignment
       XE	Number of supporting seeds

The read name (at least from Illumina machines) are constructed as:

[instrument-name]:[run ID]:[flowcell ID]:[lane-number]:[tile-number]:
[x-pos]:[y-pos] [read number]:[is filtered]:[control number]:
[barcode sequence]


@M01117:25:000000000-A37B9:1:1101:14984:1386 1:N:0:4

genome.sph.umich.ed with further useful details, full specs.

Male infertility genetics

August 17th, 2012

10-15% of couples in the western world are faced with some kind of infertility issue, in almost half the cases there are (co-) factors on the male side.
Male infertility factors are often based on sperm abnormalities which can be categorized into:

  • Azoospermic: No sperm in the semen
  • Oligozoospermic: A low sperm count
  • Asthenozoospermic: poor sperm motility
  • Teratozoospermic: abnormal sperm morphology

The genetic region responsible for spermatogenesis and most of these abnormalities is located in the azoospermia factor (AZF) region on Yq11. It contains the sub-regions AZFa, AZFb and AZFc. Microdeletion in these regions are responsible for many genetic causes of male infertility. Alteratons in the region AZFc (which contains the genes PRY2, BPY2, DAZ and CDY1) is believed to be the most frequent molecularly defined cause of spermatogenic failure. This is caused by a high genomic variability, in fact AZFc is one of the most genetically dynamic regions in the human genome. This property may serve as counter against the genetic degeneracy associated with the lack of a meiotic partner, meaning that no exchange of genetic material with a counterpart chromosomal region from the mother can happen.
Intracytoplasmic sperm injection (ICSI) can result in pregnancies, but passes on the genetic infertility to any sons born.

It has been reported that the average sperm count for men in the western world has declined by up to 50% in the past 50 years. These findings are not conclusive however as different studies found different trends in the world. It seems clear however that the exposure to chemical compounds in our environment will influence the hormone balance and have an adverse effect on male fertility and promote diseases like testicular cancer.

Sources: srlworld.com, endotext.org, Page et al. (1999), Navarro-Costa et al. (2010).

Display todays' Date with JavaScript

August 10th, 2012

To display the current date, day of the week and time on a web page, you don't want to refresh the entire page every sencond or minute. Instead you will want to use JavaScript to dynamically update just this date/clock display element. Here is the code for a display in the format

Friday, 10.8.2012    9:41:49


<!DOCTYPE html>
<script type="text/javascript">
function startTime(){
  var today=new Date();
  var h=today.getHours();
  var m=today.getMinutes();
  var s=today.getSeconds();
  var month = today.getMonth() + 1
  var day = today.getDate()
  var myDays= ["Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"]
  var weekday = today.getDay()
  var wday = myDays[weekday]
  var year = today.getFullYear()
  // add a zero in front of numbers<10
  document.getElementById('txt').innerHTML=wday + ", " + day + "." + month + "." + year + "&nbsp;&nbsp;&nbsp;&nbsp;" + h+":"+m+":"+s;
function checkTime(i){
  if (i<10){
    i="0" + i;
  return i;
<body onload="startTime()">
<div id="txt"></div>

Sources: trans4mind.com, w3schools.com

Data Processing with Biopieces

August 2nd, 2012

There is a fine set of scripts that form an orderely pipeline (or framework) to process bioinformatics data on the Unix command line called biopieces. You can e.g. process sequencing (NGS) data like this:


./read_fastq -n 1000 -i data/reads.fastq | ./plot_scores -t png -o data/scores.png --no_stream

to read the first 1000 sequences from a FASTQ file and plot the scores to an image file.
The result might look like this:
Data Processing with Biopieces

The general logic is
        read_data | calculate_something | write_results
with the data being passed through as a "stream" and all modules having the same interface to eachother. Installation instructions are here, on my Ubuntu VM I had to follow these steps:

  1. we need Perl, Ruby, Python, SVN. Install as needed.


    sudo apt-get install subversion
  2. get biopieces code:


    svn checkout http://biopieces.googlecode.com/svn/trunk/ biopieces cd biopieces svn checkout http://biopieces.googlecode.com/svn/wiki bp_usage
  3. check pre-requisites with the project's installer script


    bash biopieces_installer.sh
  4. missing Perl modules where listed nicely and could be installed as suggested.
  5. missing Ruby gems could not be installed due to incompatibilities, eg:


    sudo gem install RubyInline ERROR: Error installing RubyInline: ZenTest requires RubyGems version > 1.8.

    But the project supplies an excellent ruby installer on the downloads page to create a separate Ruby 1.9 installation, as the default 1.8 one is too old for biopieces, the newer one not officially supported on Ubuntu
  6. modify your ~/.bashrc file to include:


    export BP_DIR="$HOME/bin/biopieces"
    export BP_DATA="$HOME/bin/biopieces/BP_DATA"
    export BP_TMP="$HOME/bin/biopieces/tmp"
    export BP_LOG="$HOME/bin/biopieces/BP_LOG"
    export PATH="/home/test/bin/biopieces/ruby_install/bin:/home/test/bin/biopieces/biopieces/bp_bin:$PATH"
    export RUBYLIB="/home/test/bin/biopieces/biopieces/code_ruby/lib:$RUBYLIB"
    export PERL5LIB="/home/test/bin/biopieces/biopieces/code_perl:$PERL5LIB"


    source ~/.bashrc
    mkdir $BP_DATA $BP_TMP $BP_LOG
    The Ruby and Perl lib definitions are necessary avoid errors like


    cannot load such file -- maasha/biopieces (LoadError)
    Can't locate Maasha/Fasta.pm in @INC

Some of the almost 200 methods that are implemented in biopieces at this time include:

  • read and write various formats like bed, tab, gff, fasta, fastq
  • blast sequences against eachother or against a genome
  • calculate the N50 value for a set of sequences
  • create statistics about the exon, intron, etc. content of a (12-column) BED file