For any large software project (i.e. one that requires more than a few scripts preforming a one-off task) and for every project that was initiated by a customer request, it is useful to precisely define the requirements before starting to write any code. This might be painful at times and slow down the coding fun, but it should avoid a lot of frustration on either side in the end.
Here is a short summary of what Software Requirements Specification (SRS) (IEEE 830) are, how to write them, what they are good for.
SRS is a complete description of the behavior of a system to be developed, including use cases.
The benefits of writing specifications when planning a software project are:
- Establish the basis for agreement between the customers and the suppliers on what the software product is to do.
- Reduce the development effort by avoiding redesign, recoding, and retesting and revealing omissions, misunderstandings, and inconsistencies early in the development cycle.
- Provide a basis for estimating costs and schedules.
- Provide a baseline for validation (comparison against what the customer needs) and verification (comparison with the formal specifications).
- Facilitate transfer to new users or new machines.
- Serve as a basis for enhancement.
Key points to address:
- Required functionality.
- External interfaces.
- Design constraints imposed on an implementation.
Avoid design details and coding details in the specs. Hardware requirements etc. go into general System Specifications, not SRS. The content and language of the document should fit the description with the following key words:
Complete, Consistent, Accurate, Modifiable, Ranked, Testable, Traceable, Unambiguous, Valid, Verifiable
Descriptions of "use cases", mock-up GUI components and other visual aids are extremely useful to communicate with the parties involved.
As part of the Primary Analysis Illumina sequencing machines measure the intensity of the channels used for encoding the different bases and identify the most likely base at a given position of a sequencing read (tag). The Real Time Analysis (RTA) software writes the base and the confidence in the call as a quality score to base call (.bcl) files. As the name implies this is done in real time, i.e. for every cycle of the sequencing run a call for every location identified on the flow cell (tiles and lanes) is added. Bcl files are stored in binary format and represent the raw data output of a sequencing run. The format is described here. Software such as Casava/BclToFastq, Eland or the iSAAC aligner can make use of these files.
The *.bcl files are stored in the BaseCalls directory:
They are named in the format:
If you want to overcome errors during downstream processing from missing calls, software such as iSAAC and configureBclToFastq have an "--ignore-missing-bcl" command line option. This will interpret missing *.bcl files as no call (N) at that position.
Sources: Illumina, SeqAnswers
Some researchers and clinicians believe embryo morphology and development characteristics can be used to assess the viability of IVF embryos to increase chances of a successful pregnancy.
Healthy embryos, i.e. the most viable zygotes that will develop into blastocysts and further seem to follow a specific growth pattern between development day 3 and re-implantation on day 5:
Growth from 2 to 3 cells should be seen in 9 - 11 hours, from 3 to 4 cells in under 2 hours. Reaching day 5 is a critical as the embryo will be re-implanted into the uterus and will attach to the endometrium. The normal development process is shown in figure 1 (source: CMFT NHS):
Embryo morphology is graded on a scale of 1 to 5 as shown in fig 2 (source: CMFT NHS):
Embryo cell division can be monitored through the use of an "embryoscope", an incubator with integrated camera. Time-lapse pictures are analysed by an embryologist to help select viable embryos. Systems that help the monitoring process are e.g. the "Early Embryo Viability Assessment" (Eeva) software by Auxogyn.
Cell tracking and embryo assessment with Eeva (YouTube)
Cystic Fibrosis, also called Mucoviscidos, is a hereditary disease (autosomal recessive) in which exocrine (secretory) glands produce abnormally thick mucus. This mucus can cause problems in digestion, breathing, and body cooling. It affects up to one out of 3000 newborns (with northern European ancestry). There are well over a hundred genetic changes linked to CF. It is an area companies like Illumina are very active in with a special assay cleared as an in-vitro diagnostic test with the FDA for the detection of most of the genetic variants known to cause the disease.
Background for CF and CFTR
- Cysstic Fibrosis Transmembrane Conductance Regulator
- ABC transporter (ATP-binding cassette), that functons as ion channel
- cAMP-regulated through R domain phosphorylaton
- Transports chloride and thiocyanate across epithelial cell membranes
- 1,480 amino acids
- Most common autosomal recessive disorder among Caucasians (1/3,300)
- Dysregulaton of epithelial fluid transport in lung, pancreas, and other organs
- ~ 2,000 identfied gene mutatons
- Phe508del – most common, in 70% cases
- Wide range of severity, most die of pulmonary disease at mean age of 37
- ~70% of variants currently classed as VUS (variant of unknown significance)
- ~65% are missense mutations, 24% frameshift & stop-gained, 9% synonymous
Testing Machine Learning Approaches for CF classification
- Machine learning algorithms show higher performance when compared with separate predictors
- Tree-based methods perform the best (GBM & RF AUC is 6% higher then the best predictor, MutPred)
- Top features: MutPred, AF, SIFT, CADD, POSE
- Predicted pathogenicity probability (RF.pred) correlates with available experimental data for Cl- conductance and sweat Cl-
CRAM files are compressed versions of BAM files containing (aligned) sequencing reads. They represent a further file size reduction for this type of data that is generated at ever increasing quantities. Where SAM files are human-readable text files optimized for short read storage, BAM files are their binary equivalent, and CRAM files are a restructured column-oriented binary container format for even more efficient storage.
Tke key components of the approach are that positions are encoded in a relative way (i.e., the difference between successive positions is stored rather than the absolute value) and stored as a Golomb code. Also, only differences to the reference genome are listed instead of the full sequence.
The compression rates achieved are shown in the graph below generated by Uppsala University:
Comparing speed: Using the C implementation of for CRAM (James K. Bonfield), decoding is 1.5–1.7× slower than generating BAM files, but 1.8–2.6× faster at encoding. (File size savings are reported at 34–55%.(
Additional compression can be achieved by reducing the granularity of the quality values which will result in lossy compression though. Illumina suggested a binning of Q scores without significant calling performance.
Binning of similar Q-scores (Illumina):
Compression achieved by Q-score binning (Illumina):
Sources and further reading: