Gene Models (and the Central Dogma of Molecular Biology)

October 23rd, 2010

What is a Gene Model?

I found the following text on the teaching pages of Prof. Ann Loraine and found it worth repeating (slightly modified) here:

Gene models are hypotheses about the structure of transcripts produced by a gene. Like all models, they may be correct, partly correct, or entirely wrong. Typically, with evidence-based gene-prediction programs, we use information from EST s (expressed sequence tags) , cDNAs or RNASeq reads to evaluate or create gene models. Alternatively models can be derived from the genomic sequence alone, looking for well-known characteristics (open-reading frames, splice-sites, stops, etc.) of the sequence of genes. This approach is called ab-initio gene prediction.

Itís important to remember at all times that a gene model is only that: a model.

To understand what a gene model represents, you need to refresh your memory about how transcription, RNA splicing, and polyadenylation operate.

Most protein-coding genes in eukaryotic organisms (like humans, the research plant Arabidopsis thaliana, fruit flies, etc.) are transcribed into RNA by an enzyme complex called RNA polymerase II, which binds to the five prime end of a gene in its so-called promoter region. The promoter region typically contains binding sites for transcription factors that help the RNA polymerase complex recognize the position in the genomic DNA where it should begin transcription. Many genes have multiple places in the genomic DNA where transcription can begin, and so transcripts arising from the same gene may have different five-prime ends. Transcripts arising from the same gene that have different transcription start sites are said to come from alternative promoters.

Once the RNA polymerase complex binds to the five prime end of gene, it can begin building an RNA copy of the DNA sense strand via the process known as transcription. The ultimate product of transcription is thus called a transcript. During and after transcription, another large complex of proteins and non-coding RNAs called the spliceosome attaches to the growing RNA molecule, cuts out segments of RNA called introns, and joins together (splices) the flanking sequences, which are called exons. Not every newly synthesized transcript is processed in this way; sometimes no introns are removed at all. Genes whose products do not undergo splicing are often called single ?exon genes.

Also, splicing may remove different segments from transcripts arising from the same gene. This variability in splicing patterns is called alternative splicing. In addition to splicing, RNA transcripts undergo another processing reaction called polyadenylation.

In polyadenylation, a segment of sequence at the 3-prime end of the RNA transcript is cut off, and a polymer consisting of adenosine residues called a polyA tail is attached to the 3-prime end of the transcript. The length of polyA tail may vary a lot from transcript to transcript, and the position where it is added may also differ. Genes whose transcripts can receive a polyA tail at more than one location are said to be subject to alternative polyadenylation or alternative 3?prime end processing. One of the functions of this polyA tail is thought to be increased stability of the transcript.

These processing reactions are believed to take place in the nucleus. Ultimately, most of the mature or maturing RNA transcripts are exported from the nucleus into the cytoplasm, where they will be translated by ribosomes into proteins, chains of amino acids that perform work in the cell (such as enzymes) or that provide form and structure (like actin in the cytoskeleton).

The continuous sequence of bases in an RNA that encode a protein is called a coding region, and the coding typically starts with an AUG codon and terminates with one of three possible stop codons. The segments of sequence that comprise a coding region are called CDSs and they generally occupy the same sequences as the exons, apart from the regions five and three prime of the start and stop codons, respectively.

Most RNAs code for one protein sequence, but there are some interesting exceptions in which one mature mRNA may contain more than one translated open reading frame. The three bases where the ribosome initiates translation are called a start codon and the triplet of bases immediately following the last translated codon are called the stop codon. The start codon encodes the amino acid methionine, typically, and the stop codon doesnít code for any amino acid.

A gene model thus consists of a collection of introns and exons and their locations in the genomic sequence, as well as the location of the translated region or region. Thus, a gene model implies a theory about where the RNA polymerase started transcription, as well as the location of the polyadenylation site and the starts and stops of translation. Usually, we draw gene models as showing the location of introns and exons relative to the genomic sequence, as if we are mapping the RNA copy back onto the genomic DNA itself.

This text nicely describes the classical central dogma of (molecular) biology (DNA -> RNA -> protein), what gene models are and some thoughts about gene prediction at the same time...

Repeat Finding and Masking

October 13th, 2010

What are genomic interspersed repeats? [from the RepeatMasker docu] In the mid 1960's scientists discovered that many genomes contain stretches of highly repetitive DNA sequences ( see Reassociation Kinetics Experiments, and C-Value Paradox ). These sequences were later characterized and placed into five categories:

  1. Simple Repeats - Duplications of simple sets of DNA bases (typically 1-5bp) such as A, CA, CGG etc.
  2. Tandem Repeats - Typically found at the centromeres and telomeres of chromosomes these are duplications of more complex 100-200 base sequences.
  3. Segmental Duplications - Large blocks of 10-300 kilobases which are that have been copied to another region of the genome.
  4. Interspersed Repeats
    1. Processed Pseudogenes, Retrotranscripts, SINES - Non-functional copies of RNA genes which have been reintegrated into the genome with the assitance of a reverse transcriptase.
    2. DNA Transposons
    3. Retrovirus Retrotransposons
    4. Non-Retrovirus Retrotransposons ( LINES )

Currently up to 50% of the human genome is repetitive in nature and as improvements are made in detection methods this number is expected to increase.

Software for repeat identification

  • The best known program is RepeatMasker (Adrian Smit, Washington University), that screens DNA sequences for interspersed repeats and low complexity DNA sequences. Sequence comparisons in RepeatMasker are performed by the program cross_match, an efficient implementation of the Smith-Waterman-Gotoh algorithm developed by Phil Green. Alternatively WU-BLAST can be used for faster processing. A web-based analysis can be carried out at, but the sequence size limit is 100kb here.
  • Recon and RepeatScout are other (less well-maintained) de novo repeat-finding software packages
  • Dust is a program for filtering low complexity regions from nucleic acid sequences, has been used within BLAST for many years. (Paper in J. Comp. Biol.)
  • TRF is the "Tandem repeats finder". (Paper in NAR)

The default parameters for RepeatMasker as part of the Ensembl gene-prediction pipeline e.g. mouse are:

-nolow -species mouse -s

Further reading: Table in Nature with different programs. See also Tarailo-Graovac and Chen "Using RepeatMasker to Identify Repetitive Elements in Genomic Sequences" in Current Protocols in Bioinformatics, March 2009. See also this RepeatMasker readme at

RNA-Seq data quality scores

February 26th, 2010

There are different ways to encode the quality scores in FASTQ files from Next-generation sequencing machines. It is important to find out before using the data and to convert between formats if necessary.

  • Sanger format can encode a [[Phred quality score]] from 0 to 93 using [[ASCII]] 33 to 126 (although in raw read data the Phred quality score rarely exceeds 60, higher scores are possible in assemblies or read maps).
  • Illumina 1.3+ format can encode a [[Phred quality score]] from 0 to 62 using [[ASCII]] 64 to 126 (although in raw read data Phred scores from 0 to 40 only are expected).
  • Solexa/Illumina 1.0 format can encode a Solexa/Illumina quality score from -5 to 62 using [[ASCII]] 59 to 126 (although in raw read data Solexa scores from -5 to 40 only are expected)

  |                         |    |        |                              |                     |
 33                        59   64       73                            104                   126

 S - Sanger        Phred+33,  raw reads typically (0, 40)
 X - Solexa        Solexa+64, raw reads typically (-5, 40)
 I - Illumina 1.3+ Phred+64,  raw reads typically (0, 40)
 J - Illumina 1.5+ Phred+64,  raw reads typically (3, 40)
    with 0=unused, 1=unused, 2=Read Segment Quality Control Indicator
 L - Illumina 1.8+ Phred+33,  raw reads typically (0, 41)

Source: wikipedia

For a simple look-up from ASCII to numeric scores you can use the following list:

ASCII	numeric		ASCII	numeric
!	0		@	31
"	1		A	32
#	2		B	33
$	3		C	34
%	4		D	35
&	5		E	36
'	6		F	37
(	7		G	38
)	8		H	39
*	9		I	40
+	10		J	41
,	11		K	42
-	12		L	43
.	13		M	44
/	14		N	45
0	15		O	46
1	16		P	47
2	17		Q	48
3	18		R	49
4	19		S	50
5	20		T	51
6	21		U	52
7	22		V	53
8	23		W	54
9	24		X	55
:	25		Y	56
;	26		Z	57
<	27		[	58
=	28		\	59
>	29		]	60
?	30		^	61

You can convert the Solexa read quality to Sanger read quality with Maq:

maq sol2sanger s_1_sequence.txt s_1_sequence.fastq

where s_1_sequence.txt is the Solexa read sequence file. Missing this step will lead to unreliable SNP calling when aligning reads with Maq.

Source: maq-manual

Phred itself is a base calling program for DNA sequence traces developed during the initial automation phase of the sequencing of the human genome.
After calling bases, Phred examines the peaks around each base call to assign a quality score to each base call. Quality scores range from 4 to about 60, with higher values corresponding to higher quality. The quality scores are logarithmically linked to error probabilities, as shown in the following table:

Phred quality	Probability of		Accuracy of
score		wrong base call		base call
10 		1 in 10 		90%
20 		1 in 100 		99%
30 		1 in 1,000 		99.9%
40 		1 in 10,000 		99.99%
50 		1 in 100,000 		99.999%

"High quality bases" are usually scores of 20 and above ("Phred20 score").

You can read the original publications about the Phred program and scoring by Brent Ewing et al. from Phil Green's lab here and here.


ENCODE cell lines

February 24th, 2010

These are some of the cell lines that are used in the various analysis of the ENCODE project. The first two are so-called tier-1 lines and covered by all the different types of experiments within ENCODE, the others are tier-2 lines, additionally there are a number of tier-3 cell lines.

  • GM12878 is a lymphoblastoid cell line produced from the blood of a female donor with northern and western European ancestry by EBV transformation. It was one of the original HapMap cell lines and has been selected by the International HapMap Project for deep sequencing using the Solexa/Illumina platform. This cell line has a relatively normal karyotype and grows well. Choice of this cell line offers potential synergy with the International HapMap Project and genetic variation studies. It represents the mesoderm cell lineage.
  • K562 is an immortalized cell line produced from a female patient with chronic myelogenous leukemia (CML). It is a widely used model for cell biology, biochemistry, and erythropoiesis. It grows well, is transfectable, and represents the mesoderm linage.
  • HepG2 is a cell line derived from a male patient with liver carcinoma. It is a model system for metabolism disorders and much data on transcriptional regulation have been generated using this cell line. It grows well, is transfectable, and represents the endoderm lineage.
  • HeLa-S3 is an immortalized cell line that was derived from a cervical cancer patient. It grows extremely well in suspension and is transfectable. It represents the ectoderm lineage. Many data sets were produced using this cell line during the pilot phase of the ENCODE Project. In addition, these cells have been widely used in biochemical and molecular genetic studies of gene function and regulation.
  • HUVEC (human umbilical vein endothelial cells) have a normal karyotype and are readily expandable to 108-109 cells. They represent the mesoderm lineage.
  • Keratinocytes have a normal karyotype and are readily expandable to 108-109 cells. They represent the ectoderm lineage.
  • H1 human embryonic stem cells.


Full list

AnnoTrack: Data maintanance

December 8th, 2009

This document is part of the administrator documentation for the AnnoTrack software for genome annotation tracking.

Regular updates

The following Perl scripts update the data and re-set priorities and flags. They usually update Havana annotation data, but all other sources can be checked as well by activating the entry in the config file. They run as cron-job every night, but can also be run manually if needed. The cron-job is executed from svn/gencode/tracking_system/perl/scripts/

The general procedure (which can be also used to push new data into the system) is:

  1. - updating all sources specified in the config
  2. - update active flags to transcripts and set appropriate priorities. This also adds new flag types to the tmp_values table to keep count.
  3. - create links between flagged transcripts in the same genomic region
  4. - SQL queries that update the counts that are shown on the from page

Common parameters are:

  1. env defines the target database as

    • prod or human: main human production database
    • dev: test database with human data on the mcs4a db server
    • zfish: zebrafish production db
    • mouse: mouse production db
  2. write: connect with write access and store changes in chosen db
  3. verbose: write (a lot) of output for testing

Running data update scripts:


perl svn/gencode/tracking_system/perl/scripts/ -env proc -core -write

Run as farm job, sources to be updated defined in Set active flags and priorities based on flags:


perl svn/gencode/tracking_system/perl/scripts/ -env proc -core -write

Specific updates

  1. Our Ensembl friends regularly compare CCDS, exon, intron and cDNA features between Ensembl and Havana annotations. This will generate text files with locations and IDs that need to be reloaded into AnnoTrack. There are specific source modules for these files, so adjusting the file (for the affected source definition: pointing the "file" hash entry to the new file and setting the "active" flag to "1") and running script should be sufficient.
  2. After every Havana/Ensembl merge a new OTT-/ENS-ID mapping should be generated and loaded into the AnnoTrack tracking system. This can be done with the script svn/gencode/scripts/ which will read the GTF file or a list of ids and create the SQL statements.


    perl svn/gencode/scripts/ -gtf -infile current_freeze.gtf -out new_id_conversions.sql
    mysql -h -P -u -p -D gencode_tracking < new_id_conversions.sql

Adding new data

  1. Importing Ensembl objects

    If an important gene model is missing from Havana but was annotated by Ensembl an import into AnnoTrack can be accomplished easily with the script svn/gencode/tracking_system/perl/scripts/ with the following options:


    perl -user Felix
                                -category Ensembl
                                -gene ENSG00012048
                                -transcript ENST00309486
                                -flag manual_selection
                                -note "important gene"

    Setting a flag (with the chosen flag-name) and adding a note (that will be displayed next to the flag) are optional.

  2. Importing via DAS

    A number of GENCODE sources were imported from external DAS servers. For updates or new sources these source adaptors should be checked at svn/gencode/tracking_system/perl/modules/gencode_tracking_system/sources/

  3. Importing from a file

    There are source adaptors for reading tab-delimited files ( and GTF files (which can also used for GFF3). You might have a look at the source code of the parser in case it needs slight modifications to read your file format.

  4. Importing via other sources

    If there are new types of data sources not fitting above categories a new source-adator has to be created. The best way for this is to copy and modify an existing one from svn/gencode/tracking_system/perl/modules/gencode_tracking_system/sources/.

  5. Creating new entries through the web interface is possible but not recommended. A gene can be added on this admin page (Trackers: only Features is required, Modules: only Issue tracking is required), transcripts can then be added using the URL format "".

For all imports with the script an entry describing the new data source needs to be created in the svn/gencode/tracking_system/perl/modules/gencode_tracking_system/ config file. A hash "%OTHER_SERVERS" contains an entry for every source name with the parameters required:

  • active - set to "1" to include the source in the update procedure, all others should be set to "0"
  • dns/type/proxy - the server definitions for DAS sources
  • user_name - the login name from the users table
  • category - a name for the new data, usually the same as the source name itself
  • detached -
  • by_chrom - does the update need to be performed chromosome-by-chromosome? (for slow DAS servers)
  • description - a short description of the data source
  • update_function - name of the module used for the update, e.g. "gtf" or "missing_ccds"
  • data_type - name of the feature type, e.g. "UCSC_novel_genes"

Working with flags

Flags are the most important features of the system, they define what problems we are focusing on.

New flags can be set:

  • Through the web interface (see image 1) by any logged-in user by clicking on "add flag" on a transcript page
  • Through the web interface using a list of IDs (eg. "OTTHUMT00000334332") with this form.
  • Through the Perl script svn/gencode/tracking_system/perl/scripts/ Other scripts (eg. can also have an option to set new flags to the features they are working with.

If the same type of flag was already set and not resolved yet, the scripts should NOT set another flag.

To resolve flags

  • On the web interface the flags for every transcript can be resolved individually by clicking on the check/deny images next to them
  • or multiple flags at once by activating the checkbox and clicking on the check/deny images below the list of flags
  • programmatically, multiple flags can be resolved with the script svn/gencode/tracking_system/perl/scripts/ and a text file of solutions. Please check the perldocs.

resolve flags

image 1: resolving flags through the web interface

New types of flags can be created here. This creates an entry in the flags tale (with the issue_id=-1) and in the tmp_values table where stats are stored. Also check the list of all flag types and their priorities.

The description of flag types can be updated here.

In general it's a good idea to run new updates / imports against the development environment / test database first (by setting $ENV = "dev" in the config file or using the -dev env parameter for scripts). Changes can than by checked in the database or a test server first (at the Sanger at