Here is a quick list of the sizes of human chromosomes in assembly GRCh37 as defined by Ensembl:
chrom length [bp] 1 249,250,621 2 243,199,373 3 198,022,430 4 191,154,276 5 180,915,260 6 171,115,067 7 159,138,663 8 146,364,022 9 141,213,431 10 135,534,747 11 135,006,516 12 133,851,895 13 115,169,878 14 107,349,540 15 102,531,392 16 90,354,753 17 81,195,210 18 78,077,248 19 59,128,983 20 63,025,520 21 48,129,895 22 51,304,566 X 155,270,560 Y 59,373,566 Mt 16,569
These sizes are useful for calculations of percent coverage of genomic features or sequencing reads.
They are often required when working with BED files.
There are many sophisticated services and scripts to monitor the accessability of your website or various aspects of your web server. Check stack-overflow and look at Monastic for examples. Here is a very simple solution I needed to monitor the availability of a specific server using its IP address within the internal network. Is was necessary after the server's IP address, that is used in third party software to provide specific services, was "stolen" by other machines. In this case the DNS server assigned the IP address that should have been reserved to mobile devices that connected to the wireless network.
This approach is simply fetching a website from a specific URL using the "reserved" IP and looks for a word/pattern you know should be there. The script is run on a second machine (host name "ubuntu64"), an Ubuntu VM. (It is not using any additional security measures you will want to use if you expose the machine externally.)
Prepare second machine to send notification emails:
Install sendmail, sendemail, mailutils, sensible-mda (to have the whole set).
Add/modify entry in /etc/hosts:
127.0.1.1 ubuntu64.network.local ubuntu64
run "sudo sendmailconfig"
Write bash script to get and check website and send alert emails:
Add a crontab entry to automatically run this script every 10 minutes:
Additional improvements could include the options to stop alerting after a specific number of alerts or checking the response time.
Alternatively you can just look up the MAC address associated with the "reserved" IP and compare it to the known physical address of your server and wrap this up into a little script:
>arp -a 192.168.1.1 Interface: 192.168.1.152 --- 0xb Internet Address Physical Address Type 192.168.1.1 00-11-18-2c-2e-6d dynamic
Following on from the publication of the main papers of the ENCODE (Encyclopedia Of DNA Elements) scale-up phase, I gave an interview to BlueGnome's marketing team for the Newstrack customer newsletter in 2012.
These are my personal opinions, not my employer's (past or present). They might be of interest to researcher's considering to join a large-scale project like this.
Q. What was it like to be part of the ENCODE project?
It was a great experience to work on a project of this scale with more than 400 scientists from 32 groups spread across the globe. Many of them are the leaders in their field, but at consortium meetings and the many phone conferences everyone could contribute. The amount of data and different technologies was overwhelming at times, so I think it’s an impressive achievement how this project was run and now the findings have been published.
Q. What are the main outcomes of the project?
There has been a very lively discussion about the outcome and how it was presented. In my opinion, the most important result is the data itself. ENCODE has created an enormous repository of measurements across the human genome that has been compiled in a systematic and standardised way. The data will be the basis of future research trying to understand genomic processes involved in basic cellular processes as well as in various diseases.
ENCODE has pushed the development of standards and new applications to interrogate the genome, in particular using sequencing technologies.
The results also remind us that there is a lot of activity in the genome that we currently do not fully understand. Up to 80% of the human genome is biochemically active, there are thousands of additional (non-coding) genes in introns and in the intergenic space, and up to 75% of the genome is transcribed at some point. These observations paint a very dynamic genomic landscape, with overlapping active zones and signals of different complexity, indicating, that we have to keep the concept of genes and genome regulation pretty flexible in our mind.
Q. What are potential implications for BlueGnome and
I’m afraid the interpretation of CNV regions is getting even more complex as regulatory regions far away from the actual disease genes might be relevant for cases the clinical customers might come across. This is especially true for the interpretation of cancer profiles – which is highly complex already. We won’t be able to use these new interconnections directly in most cases, but we are looking through the data and have started to incorporate the knowledge by providing new genome-wide annotation data sets as optional BED files on the BlueGnome website, e.g. with GWAS results and regulatory element locations.
Q. Where do you see the human genome in 5 years’ time?
ENCODE is entering its next phase now to extend the catalogue to many additional cell lines as well as the mouse genome. With the recent publications scientists around the world are now more aware of this data and how to use it, so my hope is that we will see an acceleration in algorithm development, data mining and scientific findings. In 5 years we still won’t understand the genome entirely, but we should have a complete parts list and more connections between the parts. Some of these will be clinically relevant to allow progress in understanding and fighting today’s ‘big killers’ like certain types of cancer.
Q. Would you personally be interested in having your genome sequenced?
As a data exploration exercise I would find this really interesting, but the definitive answers you can get from it are still limited today. I would certainly want to make sure this data is kept private and under my control. With BlueGnome now being part of Illumina we can actually help to develop these ideas further.
The Sequence Alignment/Map (SAM) format is a generic alignment format for storing read alignments against reference sequences. It is a text format for storing sequence data in a series of tab delimited ASCII columns and is commonly used in next-generation sequencing data processing. It is the (non-binary) human-readable version of the BAM format and contains information about the read and the aligned position in the genome. It was developed by Heng Li in Richard Durbins group and others, their paper is here.
After a header section the alignment section describes all results of the aligned read data. The format is best explained with an example line:
Fieldname description Example-data QNAME read name 1:497:R:-272+13M17D24M FLAG alignment flag 113 RNAME alignment chromosome 1 POS alignment start position 497 MAPQ overall mapping quality 37 CIGAR alignment CIGAR string 37M MRNM/RNEXT name of next alignm. in group (mate) 15 MPOS/PNEXT pos. of next alignm. in group (mate) 100338662 ISIZE/TLEN observed Template LENgth 0 SEQ sequence CGGGTCTGACCTGAGGAGAACTGTGCTCCGCCTTCAG QUAL quality per base 0;==-==9;>>>>>=>>>>>>>>>>>=>>>>>>>>>> TAGs further tags with alignment info XT:A:U NM:i:0 SM:i:37 AM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:37
The tags are optional and might vary between alignment programs. Shown are examples from BWA. Important for filtering are usually the tags X0:i (numbers of genome alignments of this read) and XM:i (number of mismatches in alignment).
Tag Meaning NM Edit distance MD Mismatching positions/bases AS Alignment score BC Barcode sequence X0 Number of best hits X1 Number of suboptimal hits found by BWA XN Number of ambiguous bases in the referenece XM Number of mismatches in the alignment XO Number of gap opens XG Number of gap extentions XT Type: Unique/Repeat/N/Mate-sw XA Alternative hits; format: (chr,pos,CIGAR,NM;)* XS Suboptimal alignment score XF Support from forward/reverse alignment XE Number of supporting seeds
The read name (at least from Illumina machines) are constructed as:
[instrument-name]:[run ID]:[flowcell ID]:[lane-number]:[tile-number]: [x-pos]:[y-pos] [read number]:[is filtered]:[control number]: [barcode sequence]
10-15% of couples in the western world are faced with some kind of infertility issue, in almost half the cases there are (co-) factors on the male side.
Male infertility factors are often based on sperm abnormalities which can be categorized into:
- Azoospermic: No sperm in the semen
- Oligozoospermic: A low sperm count
- Asthenozoospermic: poor sperm motility
- Teratozoospermic: abnormal sperm morphology
The genetic region responsible for spermatogenesis and most of these abnormalities is located in the azoospermia factor (AZF) region on Yq11. It contains the sub-regions AZFa, AZFb and AZFc. Microdeletion in these regions are responsible for many genetic causes of male infertility. Alteratons in the region AZFc (which contains the genes PRY2, BPY2, DAZ and CDY1) is believed to be the most frequent molecularly defined cause of spermatogenic failure. This is caused by a high genomic variability, in fact AZFc is one of the most genetically dynamic regions in the human genome. This property may serve as counter against the genetic degeneracy associated with the lack of a meiotic partner, meaning that no exchange of genetic material with a counterpart chromosomal region from the mother can happen.
Intracytoplasmic sperm injection (ICSI) can result in pregnancies, but passes on the genetic infertility to any sons born.
It has been reported that the average sperm count for men in the western world has declined by up to 50% in the past 50 years. These findings are not conclusive however as different studies found different trends in the world. It seems clear however that the exposure to chemical compounds in our environment will influence the hormone balance and have an adverse effect on male fertility and promote diseases like testicular cancer.