Following on from the publication of the main papers of the ENCODE (Encyclopedia Of DNA Elements) scale-up phase, I gave an interview to BlueGnome's marketing team for the Newstrack customer newsletter in 2012.
These are my personal opinions, not my employer's (past or present). They might be of interest to researcher's considering to join a large-scale project like this.
Q. What was it like to be part of the ENCODE project?
It was a great experience to work on a project of this scale with more than 400 scientists from 32 groups spread across the globe. Many of them are the leaders in their field, but at consortium meetings and the many phone conferences everyone could contribute. The amount of data and different technologies was overwhelming at times, so I think it’s an impressive achievement how this project was run and now the findings have been published.
Q. What are the main outcomes of the project?
There has been a very lively discussion about the outcome and how it was presented. In my opinion, the most important result is the data itself. ENCODE has created an enormous repository of measurements across the human genome that has been compiled in a systematic and standardised way. The data will be the basis of future research trying to understand genomic processes involved in basic cellular processes as well as in various diseases.
ENCODE has pushed the development of standards and new applications to interrogate the genome, in particular using sequencing technologies.
The results also remind us that there is a lot of activity in the genome that we currently do not fully understand. Up to 80% of the human genome is biochemically active, there are thousands of additional (non-coding) genes in introns and in the intergenic space, and up to 75% of the genome is transcribed at some point. These observations paint a very dynamic genomic landscape, with overlapping active zones and signals of different complexity, indicating, that we have to keep the concept of genes and genome regulation pretty flexible in our mind.
Q. What are potential implications for BlueGnome and
I’m afraid the interpretation of CNV regions is getting even more complex as regulatory regions far away from the actual disease genes might be relevant for cases the clinical customers might come across. This is especially true for the interpretation of cancer profiles – which is highly complex already. We won’t be able to use these new interconnections directly in most cases, but we are looking through the data and have started to incorporate the knowledge by providing new genome-wide annotation data sets as optional BED files on the BlueGnome website, e.g. with GWAS results and regulatory element locations.
Q. Where do you see the human genome in 5 years’ time?
ENCODE is entering its next phase now to extend the catalogue to many additional cell lines as well as the mouse genome. With the recent publications scientists around the world are now more aware of this data and how to use it, so my hope is that we will see an acceleration in algorithm development, data mining and scientific findings. In 5 years we still won’t understand the genome entirely, but we should have a complete parts list and more connections between the parts. Some of these will be clinically relevant to allow progress in understanding and fighting today’s ‘big killers’ like certain types of cancer.
Q. Would you personally be interested in having your genome sequenced?
As a data exploration exercise I would find this really interesting, but the definitive answers you can get from it are still limited today. I would certainly want to make sure this data is kept private and under my control. With BlueGnome now being part of Illumina we can actually help to develop these ideas further.
The Sequence Alignment/Map (SAM) format is a generic alignment format for storing read alignments against reference sequences. It is a text format for storing sequence data in a series of tab delimited ASCII columns and is commonly used in next-generation sequencing data processing. It is the (non-binary) human-readable version of the BAM format and contains information about the read and the aligned position in the genome. It was developed by Heng Li in Richard Durbins group and others, their paper is here.
After a header section the alignment section describes all results of the aligned read data. The format is best explained with an example line:
Fieldname description Example-data QNAME read name 1:497:R:-272+13M17D24M FLAG alignment flag 113 RNAME alignment chromosome 1 POS alignment start position 497 MAPQ overall mapping quality 37 CIGAR alignment CIGAR string 37M MRNM/RNEXT name of next alignm. in group (mate) 15 MPOS/PNEXT pos. of next alignm. in group (mate) 100338662 ISIZE/TLEN observed Template LENgth 0 SEQ sequence CGGGTCTGACCTGAGGAGAACTGTGCTCCGCCTTCAG QUAL quality per base 0;==-==9;>>>>>=>>>>>>>>>>>=>>>>>>>>>> TAGs further tags with alignment info XT:A:U NM:i:0 SM:i:37 AM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:37
The tags are optional and might vary between alignment programs. Shown are examples from BWA. Important for filtering are usually the tags X0:i (numbers of genome alignments of this read) and XM:i (number of mismatches in alignment).
Tag Meaning NM Edit distance MD Mismatching positions/bases AS Alignment score BC Barcode sequence X0 Number of best hits X1 Number of suboptimal hits found by BWA XN Number of ambiguous bases in the referenece XM Number of mismatches in the alignment XO Number of gap opens XG Number of gap extentions XT Type: Unique/Repeat/N/Mate-sw XA Alternative hits; format: (chr,pos,CIGAR,NM;)* XS Suboptimal alignment score XF Support from forward/reverse alignment XE Number of supporting seeds
The read name (at least from Illumina machines) are constructed as:
[instrument-name]:[run ID]:[flowcell ID]:[lane-number]:[tile-number]: [x-pos]:[y-pos] [read number]:[is filtered]:[control number]: [barcode sequence]
10-15% of couples in the western world are faced with some kind of infertility issue, in almost half the cases there are (co-) factors on the male side.
Male infertility factors are often based on sperm abnormalities which can be categorized into:
- Azoospermic: No sperm in the semen
- Oligozoospermic: A low sperm count
- Asthenozoospermic: poor sperm motility
- Teratozoospermic: abnormal sperm morphology
The genetic region responsible for spermatogenesis and most of these abnormalities is located in the azoospermia factor (AZF) region on Yq11. It contains the sub-regions AZFa, AZFb and AZFc. Microdeletion in these regions are responsible for many genetic causes of male infertility. Alteratons in the region AZFc (which contains the genes PRY2, BPY2, DAZ and CDY1) is believed to be the most frequent molecularly defined cause of spermatogenic failure. This is caused by a high genomic variability, in fact AZFc is one of the most genetically dynamic regions in the human genome. This property may serve as counter against the genetic degeneracy associated with the lack of a meiotic partner, meaning that no exchange of genetic material with a counterpart chromosomal region from the mother can happen.
Intracytoplasmic sperm injection (ICSI) can result in pregnancies, but passes on the genetic infertility to any sons born.
It has been reported that the average sperm count for men in the western world has declined by up to 50% in the past 50 years. These findings are not conclusive however as different studies found different trends in the world. It seems clear however that the exposure to chemical compounds in our environment will influence the hormone balance and have an adverse effect on male fertility and promote diseases like testicular cancer.
Friday, 10.8.2012 9:41:49
There is a fine set of scripts that form an orderely pipeline (or framework) to process bioinformatics data on the Unix command line called biopieces. You can e.g. process sequencing (NGS) data like this:
to read the first 1000 sequences from a FASTQ file and plot the scores to an image file.
The result might look like this:
The general logic is
read_data | calculate_something | write_results
with the data being passed through as a "stream" and all modules having the same interface to eachother. Installation instructions are here, on my Ubuntu VM I had to follow these steps:
- we need Perl, Ruby, Python, SVN. Install as needed.
sudo apt-get install subversion
- get biopieces code:
svn checkout http://biopieces.googlecode.com/svn/trunk/ biopieces cd biopieces svn checkout http://biopieces.googlecode.com/svn/wiki bp_usage
- check pre-requisites with the project's installer script
- missing Perl modules where listed nicely and could be installed as suggested.
- missing Ruby gems could not be installed due to incompatibilities, eg:
sudo gem install RubyInline ERROR: Error installing RubyInline: ZenTest requires RubyGems version > 1.8.
But the project supplies an excellent ruby installer on the downloads page to create a separate Ruby 1.9 installation, as the default 1.8 one is too old for biopieces, the newer one not officially supported on Ubuntu
- modify your ~/.bashrc file to include:
CodeThe Ruby and Perl lib definitions are necessary avoid errors like
mkdir $BP_DATA $BP_TMP $BP_LOG
cannot load such file -- maasha/biopieces (LoadError)
Can't locate Maasha/Fasta.pm in @INC
Some of the almost 200 methods that are implemented in biopieces at this time include:
- read and write various formats like bed, tab, gff, fasta, fastq
- blast sequences against eachother or against a genome
- calculate the N50 value for a set of sequences
- create statistics about the exon, intron, etc. content of a (12-column) BED file