Comparing instance prices on the Amazon cloud

April 13th, 2017

As the largest cloud computing company Amazon Web Services (AWS) offers various options to use compute-power on a "as-needed" basis. You can choose what size and type of machine, what number of machines - and you can choose a price model where you are "bidding" for the resource. This means you might have to wait longer to get it, but you will get an impressive discount! You can choose your machines from the AWS dashboard.


Here is a comparison of the current prices for "General Purpose - Current Generation" AWS machines in the EU (Frankfurt) region (as of 13/04/2017):

vCPU ECU Memory (GiB) Instance Storage (GB) Linux / UNIX Usage On-Demand Price per Hour Spot Price per Hour Saving %
m4.large 2 6.5 8 EBS Only $0.129 $0.0336 74
m4.xlarge 4 13 16 EBS Only $0.257 $0.0375 85
m4.2xlarge 8 26 32 EBS Only $0.513 $0.1199 77
m4.4xlarge 16 53.5 64 EBS Only $1.026 $0.3536 66
m4.10xlarge 40 124.5 160 EBS Only $2.565 $1.1214 56
m4.16xlarge 64 188 256 EBS Only $4.104 $0.503 88
m3.medium 1 3 3.75 1x4 SSD $0.079 $0.0114 86
m3.large 2 6.5 7.5 1x32 SSD $0.158 $0.0227 86
m3.xlarge 4 13 15 2x40 SSD $0.315 $0.047 85
m3.2xlarge 8 26 30 2x80 SSD $0.632 $0.1504 76

 This only shows a selection of machine options and the prices obviously change over time - but the message should be clear...


Software Requirements Specification

March 2nd, 2017

For any large software project (i.e. one that requires more than a few scripts preforming a one-off task) and for every project that was initiated by a customer request, it is useful to precisely define the requirements before starting to write any code. This might be painful at times and slow down the coding fun, but it should avoid a lot of frustration on either side in the end.

Here is a short summary of what Software Requirements Specification (SRS) (IEEE 830) are, how to write them, what they are good for.

SRS is a complete description of the behavior of a system to be developed, including use cases.

The benefits of writing specifications when planning a software project are:

  • Establish the basis for agreement between the customers and the suppliers on what the software product is to do.
  • Reduce the development effort by avoiding redesign, recoding, and retesting and revealing omissions, misunderstandings, and inconsistencies early in the development cycle.
  • Provide a basis for estimating costs and schedules.
  • Provide a baseline for validation (comparison against what the customer needs) and verification (comparison with the formal specifications).
  • Facilitate transfer to new users or new machines.
  • Serve as a basis for enhancement.

Key points to address:

  • Required functionality.
  • External interfaces.
  • Performance.
  • Attributes.
  • Design constraints imposed on an implementation.

Avoid design details and coding details in the specs. Hardware requirements etc. go into general System Specifications, not SRS. The content and language of the document should fit the description with the following key words:

Complete, Consistent, Accurate, Modifiable, Ranked, Testable, Traceable, Unambiguous, Valid, Verifiable

Descriptions of "use cases", mock-up GUI components and other visual aids are extremely useful to communicate with the parties involved.


BCL files

December 30th, 2016

As part of the Primary Analysis Illumina sequencing machines measure the intensity of the channels used for encoding the different bases and identify the most likely base at a given position of a sequencing read (tag). The Real Time Analysis (RTA) software writes the base and the confidence in the call as a quality score to base call (.bcl) files. As the name implies this is done in real time, i.e. for every cycle of the sequencing run a call for every location identified on the flow cell (tiles and lanes) is added. Bcl files are stored in binary format and represent the raw data output of a sequencing run. The format is described here. Software such as Casava/BclToFastq, Eland or the iSAAC aligner can make use of these files.

The *.bcl files are stored in the BaseCalls directory:

<run directory>/Data/Intensities/BaseCalls/L<lane>/C<cycle>.1

They are named in the format:


If you want to overcome errors during downstream processing from missing calls, software such as iSAAC and configureBclToFastq have an "--ignore-missing-bcl" command line option. This will interpret missing *.bcl files as no call (N) at that position.

Sources: Illumina, SeqAnswers

Embryo Morphology Assessment

July 8th, 2015

Some researchers and clinicians believe embryo morphology and development characteristics can be used to assess the viability of IVF embryos to increase chances of a successful pregnancy.

Healthy embryos, i.e. the most viable zygotes that will develop into blastocysts and further seem to follow a specific growth pattern between development day 3 and re-implantation on day 5:
Growth from 2 to 3 cells should be seen in 9 - 11 hours, from 3 to 4 cells in under 2 hours. Reaching day 5 is a critical as the embryo will be re-implanted into the uterus and will attach to the endometrium. The normal development process is shown in figure 1 (source: CMFT NHS):

Embryo Morphology Assessment

Embryo morphology is graded on a scale of 1 to 5 as shown in fig 2 (source: CMFT NHS):

Embryo Morphology Assessment

Embryo cell division can be monitored through the use of an "embryoscope", an incubator with integrated camera. Time-lapse pictures are analysed by an embryologist to help select viable embryos. Systems that help the monitoring process are e.g. the "Early Embryo Viability Assessment" (Eeva) software by Auxogyn.

Embryo Morphology Assessment

Cell tracking and embryo assessment with Eeva (YouTube)

Further readings:


Cystic Fibrosis and its Analysis

June 17th, 2015

Cystic Fibrosis, also called Mucoviscidos, is a hereditary disease (autosomal recessive) in which exocrine (secretory) glands produce abnormally thick mucus. This mucus can cause problems in digestion, breathing, and body cooling. It affects up to one out of 3000 newborns (with northern European ancestry). There are well over a hundred genetic changes linked to CF. It is an area companies like Illumina are very active in with a special assay cleared as an in-vitro diagnostic test with the FDA for the detection of most of the genetic variants known to cause the disease.

Here are notes from a presentation Dr. Carlos Bustamante gave at a recent ClinGen conference:

Background for CF and CFTR


  • Cysstic Fibrosis Transmembrane Conductance Regulator
  • ABC transporter (ATP-binding cassette), that functons as ion channel
  • cAMP-regulated through R domain phosphorylaton
  • Transports chloride and thiocyanate across epithelial cell membranes
  • 1,480 amino acids 

CF disease:

  • Most common autosomal recessive disorder among Caucasians (1/3,300)
  • Dysregulaton of epithelial fluid transport in lung, pancreas, and other organs
  • ~ 2,000 identfied gene mutatons
  • Phe508del – most common, in 70% cases
  • Wide range of severity, most die of pulmonary disease at mean age of 37


  • ~70% of variants currently classed as VUS (variant of unknown significance)
  • ~65% are missense mutations, 24% frameshift & stop-gained, 9% synonymous

Testing Machine Learning Approaches for CF classification

  • Machine learning algorithms show higher performance when compared with separate predictors
  • Tree-based methods perform the best (GBM & RF AUC is 6% higher then the best predictor, MutPred)
  • Top features: MutPred, AF, SIFT, CADD, POSE
  • Predicted pathogenicity probability (RF.pred) correlates with available experimental data for Cl- conductance and sweat Cl-

 Other sources used: PubMedHealth, Wikipedia