GENCODE: Generating release files

January 4th, 2011

These are notes about the data handling steps involved in creating the GTF files released by the GENCODE project and submitted to the DCC. (Valid as of February 2011)

For general information and data access please visit the project website at http://www.gencodegenes.org, this blog post or the AnnoTrack annotation tracking system.

A. Input sources

-ensembl core database with gene models, stable ids and xrefs

-vega database of same release for id-lookup

-3-way pseudogene file with gene ids:

from Yale, based on pre-dump file from same release (using the newfullmerge.pl script)

-2-way (Yale/UCSC) pseudogene file with full locations and 2 sets of ids (from Yale)

-level-1 (and level-4 if defined) transcript file containing stable-ids

-optional file with additional annotation remarks

-file from HGNC web site with columns

HGNV-ID, gene_symbol, Pubmed-IDs, Vega-ID

-RefSeq NP / NM mapping from current xref database (from Ensembl core team):

Code

mysql -uensro -hens-research -Dianl_human_xref_release_61
 
  -e'select accession1, accession2 from pairs where accession1 like "NP%" and accession2 like "NM%"'
 
  > RefSeq_relations.txt

B. Code to use

svn/gencode/scripts/data_release/newfullmerge.pl

		      .../write_class_file.pl

		      .../gencode_addmetadata.pl

svn/gencode/modules/Gencode/Ensembl2GTF.pm

C. Procedure

Create directory where output files are written to and the following input files are placed:

3-way_consensus_pseudogenes.txt, classes.def, validated_level_1_ids.txt

The paths to these are needed in the newfullmerge.pl script...

mkdir /work/dir/gencode_7

for LSF output files:

mkdir /work/dir/gencode_7/outfiles

dump annotation data (using main chromosomes only)

Code

foreach chr ( 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y MT )
 
    bsub -o /work/dir/gencode_7/outfiles/gencode_$chr.out perl svn/gencode/scripts/data_release/newfullmerge.pl -basedir /work/dir/gencode_7 -chrom $chr
 
end

check jobs

Code

grep -c "^Successfully" gencode_*out

update PAR region (We are currently writing out X and Y PAR regions separately. They are stored only once in the Ensembl db though, so the ids need to be made non-redundant with this step)

Code

perl svn/gencode/scripts/data_release/update_y_ids.pl -x gencode_X.gtf -y gencode_Y.gtf -out gencode_YY.gtf

create joined file

add header to release file gencode.v7.annotation.gtf:

##description: evidence-based annotation of the human genome (GRCh37),

 version 7 (Ensembl 62)

##provider: GENCODE

##contact: gencode@sanger.ac.uk

##format: gtf

##date: 2011-03-23

Code

foreach chr ( 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X YY MT )
 
  cat gencode_$chr.gtf >> gencode.v7.annotation.gtf
 
end

check gene and transcripts numbers(compare to previous release and database ignoring haplotype regions etc.)

Code

awk '{if($3=="gene"){g++}else{if($3=="transcript"){t++}}} END{print "genes: "g"\ntranscripts: "t"\n"}' gencode.v7.annotation.gtf

check tags (annotation remarks)(compare to previous release)

Code

foreach t ( seleno pseudo_consens CCDS mRNA_start_NF mRNA_end_NF cds_start_NF cds_end_NF non_org_supp exp_conf PAR alternative_3_UTR alternative_5_UTR readthrough NMD_exception not_organism-supported not_best-in-genome_evidence non-submitted_evidence upstream_ATG downstream_ATG upstream_uORF overlapping_uORF NAGNAG_splice_site non_canonical_conserved non_canonical_genome_sequence_error non_canonical_other non_canonical_polymorphism non_canonical_U12 non_canonical_TEC )
 
  echo -n $t"\t"; awk '{if($3=="transcript"){print $0}}' gencode.v7.annotation.gtf | grep -c "$t"
 
end

split by level (level 1/2 and 3 are displayed as two sep. tracks in the UCSC browser)

Code

awk '{if($26=="3;"){print $0}}' gencode.v7.annotation.gtf | awk '{if($3!="gene"){print $0}}' > gencode.v7.annotation.level_3.gtf
 
awk '{if($26!="3;"){print $0}}' gencode.v7.annotation.gtf | awk '{if($3!="gene"){print $0}}' > gencode.v7.annotation.level_1_2.gtf

make class file (data loading at UCSC requires a mapping of all gene and transcripts id to a level and a type)

Find classes not yet defined:

Code

grep -h "^Class not defined" gencode_*.out | sort -u

add these manually to the classes.def file. Write out new lists:

Code

perl svn/gencode/scripts/data_release/write_class_file.pl -in gencode.v7.annotation.level_1_2.gtf -class classes.def -out gencode.v7.annotation.level_1_2.classes
 
perl svn/gencode/scripts/data_release/write_class_file.pl -in gencode.v7.annotation.level_3.gtf -class classes.def -out gencode.v7.annotation.level_3.classes

generate meta-data

perl svn/gencode/scripts/data_release/gencode_addmetadata.pl

requires list of new PAR region IDs

generate tRNAs

Code

bsub -o trna.out perl svn/gencode/scripts/data_release/newfullmerge.pl -trna -out gencode.v7.tRNAs.gtf

[622 lines]

Code

nice perl svn/gencode/scripts/data_release/write_class_file.pl -in gencode.v7.tRNAs.gtf -class classes.def -out gencode.v7.tRNAs.classes -types tRNAscan

generate polyAs

Code

nice perl svn/gencode/scripts/data_release/dump_polyAs.pl -out gencode.v7.polyAs.gtf

[28966 lines]

Code

nice perl svn/gencode/scripts/data_release/write_class_file.pl -in gencode.v7.polyAs.gtf -class classes.def -out gencode.v7.polyAs.classes

re-format 2-way pseudogenes (from Yale) NEEDS UPDATING

(create header)

Code

awk 'BEGIN{c=0} {print $1"\tYale_UCSC\ttranscript\t"$2"\t"$3"\t.\t"$4"\t.\tgene_id \"Overlap"c"\"; transcript_id \"Overlap"c"\"; gene_type \"pseudogene\"; gene_status \"UNKNOWN\"; gene_name \"Overlap"c"\"; transcript_type \"pseudogene\"; transcript_status \"UNKNOWN\"; transcript_name \"Overlap"c"\"; level 3; tag \"2way_pseudo_cons\"; yale_id \""$5"\"; ucsc_id \""$6"\"; parent_id \""$7"\";"; c++}' yale_ucsc_2way_consensus >> gencode.v7.2wayconspseudos.GRCh37.gtf
 
nice perl svn/gencode/scripts/data_release/write_class_file.pl -in gencode.v7.2wayconspseudos.GRCh37.gtf -class classes.def -out gencode.v7.2wayconspseudos.GRCh37.classes -types transcript

create transcript sequence files

Code

foreach chr ( 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y MT )
 
    bsub -o seqs/trans_$chr.out perl svn/gencode/scripts/data_release/newfullmerge.pl -outfile seqs/trans_$chr.fa -ass GRCh37 -sequence -chrom $chr
 
end

create protein sequence files

Code

foreach chr ( 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y MT )
 
    bsub -o seqs/prot_$chr.out perl svn/gencode/scripts/data_release/newfullmerge.pl -outfile seqs/prot_$chr.fa -ass GRCh37 -sequence -protein -chrom $chr
 
end

update PAR regions in sequence files

Code

nice perl svn/gencode/scripts/data_release/update_y_ids.pl -fasta -x gencode_X.gtf -y seqs/trans_Y.fa -out seqs/trans_YY.fa
 
nice perl svn/gencode/scripts/data_release/update_y_ids.pl -fasta -x gencode_X.gtf -y seqs/prot_Y.fa -out seqs/prot_YY.fa

combine sequence files

Code

foreach chr ( 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X YY MT )
 
  cat seqs/prot_$chr.fa >> gencode.v7.pc_translations.fa
 
  cat seqs/trans_$chr.fa >> gencode.v7.pc_transcripts.fa
 
end

files to release to the DCC

gencode.v7.annotation.level_1_2.gtf	 

gencode.v7.annotation.level_1_2.classes  

gencode.v7.annotation.level_3.gtf	

gencode.v7.annotation.level_3.classes	

gencode.v7.polyAs.gtf

gencode.v7.polyAs.classes

gencode.v7.2wayconspseudos.gtf

gencode.v7.2wayconspseudos.classes

metadata/

  gencode_Exon_supporting_feature

  gencode_HGNC

  gencode_PDB

  gencode_Pubmed_id

  gencode_RefSeq

  gencode_Source

  gencode_SwissProt

  gencode_Transcript_supporting_feature

Code

tar -czvf gencode7_GRCh37.tgz gencode7
 
cp gencode7_GRCh37.tgz PUB_FTP/gencode/release_7/gencode7_GRCh37.tgz

It can take up to 20 minutes before the files are visible on the public FTP site.

These additional files are added to the FTP sites individually for general users:

gencode.v7.annotation.gtf.gz

gencode.v7.pc_transcripts.fa.gz

gencode.v7.pc_translations.fa.gz

gencode.v7.polyAs.gtf.gz

gencode.v7.tRNAs.gtf.gz

Code

nice gzip -c gencode.v7.pc_transcripts.fa > gencode7_GRCh37.tgz PUB_FTP/gencode/release_7/gencode.v7.pc_transcripts.fa.gz

etc.

Other notes:

  • After every Havana/Ensembl merge a new OTT-/ENS ID mapping should be generated and loaded into the AnnoTrack tracking system. This can be done with the script
    svn/gencode/scripts/store_id_conversion.pl

    which will read the GTF file or a list of ids and create the SQL statements. It's better to use a release file with no versions in the Ensembl ids as the others can not be linked to the Ensembl web site directly and the "." might break some functions in AnnoTrack. Please remember this might create links to ids that are not yet "valid" until the official Ensembl release date.

    Code

    perl svn/gencode/scripts/store_id_conversion.pl -gtf -infile gencode.v7.annotation.gtf -out new_id_conversions.sql
     
    mysql -h -P -u -p -D gencode_tracking < new_id_conversions.sql
  • Also the external annotations in AnnoTrack should be updated from the new ensembl database. These are stored as custom_values with this script:

    Code

    bsub -q long -o job.out perl svn/gencode/tracking_system/perl/scripts/update_external_info.pl
     
                 -coredb homo_sapiens_core_61_37f
     
                 -comparadb ensembl_compara_61
     
                 -ontologydb ensembl_ontology_61

    This is looking at the live-mirror dbs by default, so either modify this or run this after the Ensembl release date.

  • Selenocysteine tags are now read directly from the database, to pull them out separately for other reasons into a file you can do:

    Code

    mysql -uensro -hens-livemirror -Dhomo_sapiens_core_60_37e -e"select tsi.stable_id, ta.value from translation_attrib ta, transcript_stable_id tsi, translation tl where tl.transcript_id=tsi.transcript_id and tl.translation_id=ta.translation_id and ta.attrib_type_id=12 order by stable_id;" | awk '{print $1"\t"$2}' > selenocystein.transcripts

FASTQ Sequence Files

December 15th, 2010

A good description of the FASTQ format can be found at Illumina:

"A fastq file is an ASCII encoded text file that stores DNA or RNA sequences and their corresponding IDs and quality scores. It uses unix newlines and consists of 4 lines per sequence unless wrapping occurs due to sequence length. The first line begins with an "@" followed by an identifier (ID) which acts as a label for the read/sequence, plus index and read (pair) numbers. Read numbers are 1 for single reads and 1 or 2 for paired reads. The second line represents a DNA or RNA sequence, and should consist only of standard bases, and IUPAC ambiguity codes (ACTGNURYSWKMBDHV). This line must be wrapped with newlines if the read is longer than 80nt. The third line must be a single "+" which signifies the end of the sequence, optionally followed by the identifier again. The fourth line is a quality score string showing the quality of each base in the prior sequence, represented as the ASCII character corresponding to the quality Phred score + 33. Phred scores must be 0 and 60 (ASCII chars 33 aka "!" to 93 aka "]"). The quality score must also be wrapped to multiple lines if longer than 80 characters, but must be exactly equal in length to it's corresponding sequence."

Example:

@READNAME[#index]/read_number

BASES

+READNAME[#index]/read_number

QUALITIES

As a sanity and QC check the DCC of the 1000 genomes project applies the following rules (source):
Syntax Checks:
-Each header line begins with @
-The third line always starts with a +
-There are four lines in each entry (implied by the above two rules)
-On line3, if a name follows the + sign, the name has to match the one found in line1
-The sequence and quality lines are the same length
-For paired end files, the _1 and _2 files have the same number of reads in them.
-For SOLID colourspace fastq, each read starts with a base followed by a string of numbers

Sequence Checks:
-Read is longer than 35bp for Solexa, 25bp for Solid, and 30 bp for 454
-Read does not contain any N's in the first 25, 30 or 35bp
-Quality values are all 2 or higher in the first 25bp, 30bp or 35bp
-The reads contain more than one type of base in the first 25, 30, or 35bp
-Read does not contain more than 50% Ns in its whole length
-Read does not contain characters other than ATGCN (this rule does not apply to SOLID reads)

Taking it further:

Ensembl Core Database Schema Diagram

November 26th, 2010

To understand the concept of Ensembl and learn how to query the tables I find it extremely useful to have a schema diagram of the database in front of me.

This can be generated by using the schema.sql and foreign_keys.sql files from the sql directory of the Ensembl API cvs checkout. After loading this data into a program like the free MySQL Workbench the tables and connections can be arranged to your liking.

Here is a pdf version I created based on Ensembl core 59 with the MySQL Workbench file.

UPDATE:
Nice schema diagrams and a description of the different tables can now be found on the Ensembl pages!

Genomic Start Coordinates

October 23rd, 2010

Adding to the confusion about different notations of phases/frames, the start coordinates of genomic features are also noted differently between different genome browsers and file formats.

1. One-based

Counting bases starting with "1" at the first position.

Regions are specified by a "closed interval." Used e.g. by the Ensembl genome browser and annotation system, the GFF/GTF, SAM and wiggle file formats.

2. Zero-based

The interbase system counts spaces starting with "0" at the first position.

Regions are specified by a "half-closed-half-open interval". Used by the UCSC genome browser, Chado (the fruitfly browser), the BED, BAM and PSL file formats.

An example:

    One-based


     1 2 3 4 5 6

     | | | | | |

     C G A T G C

    | | | | | | |

    0 1 2 3 4 5 6


    Zero-based

The ATG interval would be described from 3-5 in the first, from 2-5 in the second system.

Gene Models (and the Central Dogma of Molecular Biology)

October 23rd, 2010

What is a Gene Model?

I found the following text on the teaching pages of Prof. Ann Loraine and found it worth repeating (slightly modified) here:

Gene models are hypotheses about the structure of transcripts produced by a gene. Like all models, they may be correct, partly correct, or entirely wrong. Typically, with evidence-based gene-prediction programs, we use information from EST s (expressed sequence tags) , cDNAs or RNASeq reads to evaluate or create gene models. Alternatively models can be derived from the genomic sequence alone, looking for well-known characteristics (open-reading frames, splice-sites, stops, etc.) of the sequence of genes. This approach is called ab-initio gene prediction.

Itís important to remember at all times that a gene model is only that: a model.

To understand what a gene model represents, you need to refresh your memory about how transcription, RNA splicing, and polyadenylation operate.

Most protein-coding genes in eukaryotic organisms (like humans, the research plant Arabidopsis thaliana, fruit flies, etc.) are transcribed into RNA by an enzyme complex called RNA polymerase II, which binds to the five prime end of a gene in its so-called promoter region. The promoter region typically contains binding sites for transcription factors that help the RNA polymerase complex recognize the position in the genomic DNA where it should begin transcription. Many genes have multiple places in the genomic DNA where transcription can begin, and so transcripts arising from the same gene may have different five-prime ends. Transcripts arising from the same gene that have different transcription start sites are said to come from alternative promoters.

Once the RNA polymerase complex binds to the five prime end of gene, it can begin building an RNA copy of the DNA sense strand via the process known as transcription. The ultimate product of transcription is thus called a transcript. During and after transcription, another large complex of proteins and non-coding RNAs called the spliceosome attaches to the growing RNA molecule, cuts out segments of RNA called introns, and joins together (splices) the flanking sequences, which are called exons. Not every newly synthesized transcript is processed in this way; sometimes no introns are removed at all. Genes whose products do not undergo splicing are often called single ?exon genes.

Also, splicing may remove different segments from transcripts arising from the same gene. This variability in splicing patterns is called alternative splicing. In addition to splicing, RNA transcripts undergo another processing reaction called polyadenylation.

In polyadenylation, a segment of sequence at the 3-prime end of the RNA transcript is cut off, and a polymer consisting of adenosine residues called a polyA tail is attached to the 3-prime end of the transcript. The length of polyA tail may vary a lot from transcript to transcript, and the position where it is added may also differ. Genes whose transcripts can receive a polyA tail at more than one location are said to be subject to alternative polyadenylation or alternative 3?prime end processing. One of the functions of this polyA tail is thought to be increased stability of the transcript.

These processing reactions are believed to take place in the nucleus. Ultimately, most of the mature or maturing RNA transcripts are exported from the nucleus into the cytoplasm, where they will be translated by ribosomes into proteins, chains of amino acids that perform work in the cell (such as enzymes) or that provide form and structure (like actin in the cytoskeleton).

The continuous sequence of bases in an RNA that encode a protein is called a coding region, and the coding typically starts with an AUG codon and terminates with one of three possible stop codons. The segments of sequence that comprise a coding region are called CDSs and they generally occupy the same sequences as the exons, apart from the regions five and three prime of the start and stop codons, respectively.

Most RNAs code for one protein sequence, but there are some interesting exceptions in which one mature mRNA may contain more than one translated open reading frame. The three bases where the ribosome initiates translation are called a start codon and the triplet of bases immediately following the last translated codon are called the stop codon. The start codon encodes the amino acid methionine, typically, and the stop codon doesnít code for any amino acid.

A gene model thus consists of a collection of introns and exons and their locations in the genomic sequence, as well as the location of the translated region or region. Thus, a gene model implies a theory about where the RNA polymerase started transcription, as well as the location of the polyadenylation site and the starts and stops of translation. Usually, we draw gene models as showing the location of introns and exons relative to the genomic sequence, as if we are mapping the RNA copy back onto the genomic DNA itself.

This text nicely describes the classical central dogma of (molecular) biology (DNA -> RNA -> protein), what gene models are and some thoughts about gene prediction at the same time...