Windows Task Scheduler Error

April 16th, 2013

A scheduled task on Microsoft Windows 2008 failed "due to a time trigger condition" and with the error message including "Data: Error Value 2147943726." after running without problems before.
The reason for this was that the network-wide password for the user account assigned to running the task, had been changed since setting up the task.
Re-opening the task properties (double-click in the "Active Tasks" list and select "Options" from the right-hand menue") and saving with the new password fixed the problem.

Chromosome lengths

March 20th, 2013

Here is a quick list of the sizes of human chromosomes in assembly GRCh37 as defined by Ensembl:

chrom	 length [bp]
 1	 249,250,621 
 2	 243,199,373 
 3	 198,022,430 
 4	 191,154,276 
 5	 180,915,260 
 6	 171,115,067 
 7	 159,138,663 
 8	 146,364,022 
 9	 141,213,431 
10	 135,534,747 
11	 135,006,516 
12	 133,851,895 
13	 115,169,878 
14	 107,349,540 
15	 102,531,392 
16	  90,354,753 
17	  81,195,210 
18	  78,077,248 
19	  59,128,983 
20	  63,025,520 
21	  48,129,895 
22	  51,304,566 
X	 155,270,560 
Y	  59,373,566 
Mt	      16,569
Chromosome lengths

These sizes are useful for calculations of percent coverage of genomic features or sequencing reads.
They are often required when working with BED files.

Related: Chromosome ideograms and nomenclature, chromosome GC content

Simple Website monitor

November 2nd, 2012

There are many sophisticated services and scripts to monitor the accessability of your website or various aspects of your web server. Check stack-overflow and look at Monastic for examples. Here is a very simple solution I needed to monitor the availability of a specific server using its IP address within the internal network. Is was necessary after the server's IP address, that is used in third party software to provide specific services, was "stolen" by other machines. In this case the DNS server assigned the IP address that should have been reserved to mobile devices that connected to the wireless network.

This approach is simply fetching a website from a specific URL using the "reserved" IP and looks for a word/pattern you know should be there. The script is run on a second machine (host name "ubuntu64"), an Ubuntu VM. (It is not using any additional security measures you will want to use if you expose the machine externally.)

Prepare second machine to send notification emails:
Install sendmail, sendemail, mailutils, sensible-mda (to have the whole set).
Add/modify entry in /etc/hosts: ubuntu64

run "sudo sendmailconfig"
test with


sendemail -q -f -t -u "mailtest" -m "mail works!"

Write bash script to get and check website and send alert emails:


# define address and pattern to expect
# define alert email
body="system on machine at risk"
subj="Important server unresponsive";
# fetch page and look for pattern
resp=`wget -q -O - $address | grep -c $searchword`
if [ $resp -lt 1 ]; then
  sendemail -q -f $sender -t $receiver -u $subj -m $body

Add a crontab entry to automatically run this script every 10 minutes:


*/10 * * * * sh /home/user/

Additional improvements could include the options to stop alerting after a specific number of alerts or checking the response time.

Alternatively you can just look up the MAC address associated with the "reserved" IP and compare it to the known physical address of your server and wrap this up into a little script:

>arp -a

Interface: --- 0xb
  Internet Address      Physical Address      Type           00-11-18-2c-2e-6d     dynamic

ENCODE publication interview

October 1st, 2012

Following on from the publication of the main papers of the ENCODE (Encyclopedia Of DNA Elements) scale-up phase, I gave an interview to BlueGnome's marketing team for the Newstrack customer newsletter in 2012.

These are my personal opinions, not my employer's (past or present). They might be of interest to researcher's considering to join a large-scale project like this.

Q. What was it like to be part of the ENCODE project?
It was a great experience to work on a project of this scale with more than 400 scientists from 32 groups spread across the globe. Many of them are the leaders in their field, but at consortium meetings and the many phone conferences everyone could contribute. The amount of data and different technologies was overwhelming at times, so I think it’s an impressive achievement how this project was run and now the findings have been published.

Q. What are the main outcomes of the project?
There has been a very lively discussion about the outcome and how it was presented. In my opinion, the most important result is the data itself. ENCODE has created an enormous repository of measurements across the human genome that has been compiled in a systematic and standardised way. The data will be the basis of future research trying to understand genomic processes involved in basic cellular processes as well as in various diseases.
ENCODE has pushed the development of standards and new applications to interrogate the genome, in particular using sequencing technologies.
The results also remind us that there is a lot of activity in the genome that we currently do not fully understand. Up to 80% of the human genome is biochemically active, there are thousands of additional (non-coding) genes in introns and in the intergenic space, and up to 75% of the genome is transcribed at some point. These observations paint a very dynamic genomic landscape, with overlapping active zones and signals of different complexity, indicating, that we have to keep the concept of genes and genome regulation pretty flexible in our mind.

Q. What are potential implications for BlueGnome and
its customers?

I’m afraid the interpretation of CNV regions is getting even more complex as regulatory regions far away from the actual disease genes might be relevant for cases the clinical customers might come across. This is especially true for the interpretation of cancer profiles – which is highly complex already. We won’t be able to use these new interconnections directly in most cases, but we are looking through the data and have started to incorporate the knowledge by providing new genome-wide annotation data sets as optional BED files on the BlueGnome website, e.g. with GWAS results and regulatory element locations.

Q. Where do you see the human genome in 5 years’ time?
ENCODE is entering its next phase now to extend the catalogue to many additional cell lines as well as the mouse genome. With the recent publications scientists around the world are now more aware of this data and how to use it, so my hope is that we will see an acceleration in algorithm development, data mining and scientific findings. In 5 years we still won’t understand the genome entirely, but we should have a complete parts list and more connections between the parts. Some of these will be clinically relevant to allow progress in understanding and fighting today’s ‘big killers’ like certain types of cancer.

Q. Would you personally be interested in having your genome sequenced?
As a data exploration exercise I would find this really interesting, but the definitive answers you can get from it are still limited today. I would certainly want to make sure this data is kept private and under my control. With BlueGnome now being part of Illumina we can actually help to develop these ideas further.

Further information: Nature's Encode portal, "An integrated encyclopedia of DNA elements in the human genome" publication, Guardian Interview with Ewan Birney

SAM format summary

August 30th, 2012

The Sequence Alignment/Map (SAM) format is a generic alignment format for storing read alignments against reference sequences. It is a text format for storing sequence data in a series of tab delimited ASCII columns and is commonly used in next-generation sequencing data processing. It is the (non-binary) human-readable version of the BAM format and contains information about the read and the aligned position in the genome. It was developed by Heng Li in Richard Durbins group and others, their paper is here.

After a header section the alignment section describes all results of the aligned read data. The format is best explained with an example line:


1:497:R:-272+13M17D24M  113  1  497  37  37M  15  100338662  0  CGGGTCTGACCTGAGGAGAACTGTGCTCCGCCTTCAG  0;==-==9;>>>>>=>>>>>>>>>>>=>>>>>>>>>>  XT:A:U  NM:i:0  SM:i:37  AM:i:0  X0:i:1  X1:i:0  XM:i:0  XO:i:0  XG:i:0  MD:Z:37
Fieldname	description	Example-data
QNAME	read name	1:497:R:-272+13M17D24M
FLAG	alignment flag	113
RNAME	alignment chromosome	1
POS	alignment start position	497
MAPQ	overall mapping quality	37
CIGAR	alignment CIGAR string	37M
MRNM/RNEXT	name of next alignm. in group (mate)	15
MPOS/PNEXT	pos. of next alignm. in group (mate)	100338662
ISIZE/TLEN	observed Template LENgth	0
QUAL	quality per base	0;==-==9;>>>>>=>>>>>>>>>>>=>>>>>>>>>>
TAGs	further tags with alignment info
XT:A:U NM:i:0 SM:i:37 AM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:37

The tags are optional and might vary between alignment programs. Shown are examples from BWA. Important for filtering are usually the tags X0:i (numbers of genome alignments of this read) and XM:i (number of mismatches in alignment).

       Tag	Meaning
       NM	Edit distance
       MD	Mismatching positions/bases
       AS	Alignment score
       BC	Barcode sequence
       X0	Number of best hits
       X1	Number of suboptimal hits found by BWA
       XN	Number of ambiguous bases in the referenece
       XM	Number of mismatches in the alignment
       XO	Number of gap opens
       XG	Number of gap extentions
       XT	Type: Unique/Repeat/N/Mate-sw
       XA	Alternative hits; format: (chr,pos,CIGAR,NM;)*
       XS	Suboptimal alignment score
       XF	Support from forward/reverse alignment
       XE	Number of supporting seeds

The read name (at least from Illumina machines) are constructed as:

[instrument-name]:[run ID]:[flowcell ID]:[lane-number]:[tile-number]:
[x-pos]:[y-pos] [read number]:[is filtered]:[control number]:
[barcode sequence]


@M01117:25:000000000-A37B9:1:1101:14984:1386 1:N:0:4

genome.sph.umich.ed with further useful details, full specs.