Submitting to EMBLdb

January 24th, 2011

To submit DNA sequences from capillary (Sanger) sequencing to the public EMBL database, these steps can be taken:

The strategy is to create one submission at the European Nucleotide Archive (ENA) @ EBI Webin submission page and attach a FASTA file with all sequences.

  1. remove low quality sequences. I my case the filter criteria were:

    • max 5 consecutive Ns
    • max 10% Ns
    • min 80bp length
  2. screen for vector contamination:

    • Use NBCI web interface for small sets
    • Use BioPerl for large set: get EMVEC file in EMBL format, convert to FASTA format file with BioPerl

      Code

      my $inseq = Bio::SeqIO->new(
       
            -file   => "<file.dat",
       
            -format => "embl" );
       
      my $outseq = Bio::SeqIO->new(
       
            -file   => ">file.fa",
       
            -format => "fasta" );
       
      while (my $seq = $inseq->next_seq) {
       
        $outseq->write_seq($seq);
       
      }
    • index with formatdb

      To extract sequences from a BLAST database you need an index file (for protein-dbs these files end with the extension: ".pin", for DNA dbs: ".nin"), a sequence file (".psq", ".nsq") and a header file (".phr" and ".nhr"). formatdb turns FASTA files into BLAST databases.

      Code

      formatdb -i emvec.fa -p F -o F

    • run BioPerl Blast with the sequences to be submitted against the EMVEC db:

      Code

      use Bio::Tools::Run::StandAloneBlast;
       
      my @blast_params = (program  => 'blastn', database => 'emvec.dat.fa');
       
      my $blast_hits = run_blast($seq);

      and filter out hits with very low (<0.1) eValues and long sequence hits.

  3. In my case these are submitted as ESTs. Log in to Webin, create a new submission, choose molecule type (eg.g. "EST"), add a reference publication, specify the number of sequences, describe the header (at least one field, eg. clone-identifier, must be specified to be read from the FASTA header), add common values in the small table to be added to add entries (e.g organism "Homo sapiens"), upload your FASTA file.

Sources:

Sequence Contaminations

January 20th, 2011

When analysing sequences from public databases or from your own sequencer you have to be aware of potential contaminations.

A contaminated sequence is one that does not faithfully represent the genetic information from the biological source organism/organelle because it contains one or more sequence segments of foreign origin. [NCBI]

The primary approach to screening nucleic acid sequences for vector contamination is to run a sequence similarity search against a database of vector sequences. The preferred tool for conducting such a search is NCBI's VecScreen. VecScreen detects contamination by running a BLAST sequence similarity search against the UniVec vector sequence database.

An interactive web-service EMVEC Database BLAST to scan for contamination.

Help with the interpretation of the results of BLAST2 EVEC.

See also this post about submitting to EMBL db and this post about screening NGS reads locally.

GVF Format

January 11th, 2011

The Genome Variation Format (GVF) is a file format for describing sequence variants at nucleotide resolution relative to a reference genome. The GVF format was published in Reese et al., Genome Biol., 2010: A standard variation file format for human genome sequences.

GVF is a type of GFF3 file with additional pragmas and attributes specified.

Two examples:

Code

chr16 samtools SNV 49291141 49291141 . + . ID=ID_1;Variant_seq=A,G;Reference_seq=G;Genotype=heterozygous
 
chr16 samtools SNV 49291360 49291360 . + . ID=ID_2;Variant_seq=G;Reference_seq=C;Genotype=homozygous

Code

chr16 samtools SNV 49291141 49291141 . + . ID=ID_1;Variant_seq=A,G;Reference_seq=G;Genotype=heterozygous;Variant_effect=synonymous_codon 0 mRNA NM_022162;
 
chr16 samtools SNV 49302125 49302125 . + . ID=ID_3;Variant_seq=T,C;Reference_seq=C;Genotype=heterozygous;Variant_effect=nonsynonymous_codon 0 mRNA NM_022162;Alias=NP_071445.1:p.P45S;

This is used e.g. by Ensembl to write out "Watson SNPs" from the variation database (ftp).

Source and full specs: Sequenceontology.org

QSEQ File Format

January 6th, 2011

QSEQ is a plain-text file format for sequence reads produced directly by many current next-generation sequencing machines. The content can be described as follows.

Each record is one line with tab separator in the following format:

- Machine name: unique identifier of the sequencer.

- Run number: unique number to identify the run on the sequencer.

- Lane number: positive integer (currently 1-8).

- Tile number: positive integer.

- X: x coordinate of the spot. Integer (can be negative).

- Y: y coordinate of the spot. Integer (can be negative).

- Index: positive integer. No indexing should have a value of 1.

- Read Number: 1 for single reads; 1 or 2 for paired ends.

- Sequence (BASES)

- Quality: the calibrated quality string. (QUALITIES)

- Filter: Did the read pass filtering? 0 - No, 1 - Yes.

Source: SRA_File_Formats_Guide.pdf

GENCODE: Generating release files

January 4th, 2011

These are notes about the data handling steps involved in creating the GTF files released by the GENCODE project and submitted to the DCC. (Valid as of February 2011)

For general information and data access please visit the project website at http://www.gencodegenes.org, this blog post or the AnnoTrack annotation tracking system.

A. Input sources

-ensembl core database with gene models, stable ids and xrefs

-vega database of same release for id-lookup

-3-way pseudogene file with gene ids:

from Yale, based on pre-dump file from same release (using the newfullmerge.pl script)

-2-way (Yale/UCSC) pseudogene file with full locations and 2 sets of ids (from Yale)

-level-1 (and level-4 if defined) transcript file containing stable-ids

-optional file with additional annotation remarks

-file from HGNC web site with columns

HGNV-ID, gene_symbol, Pubmed-IDs, Vega-ID

-RefSeq NP / NM mapping from current xref database (from Ensembl core team):

Code

mysql -uensro -hens-research -Dianl_human_xref_release_61
 
  -e'select accession1, accession2 from pairs where accession1 like "NP%" and accession2 like "NM%"'
 
  > RefSeq_relations.txt

B. Code to use

svn/gencode/scripts/data_release/newfullmerge.pl

		      .../write_class_file.pl

		      .../gencode_addmetadata.pl

svn/gencode/modules/Gencode/Ensembl2GTF.pm

C. Procedure

Create directory where output files are written to and the following input files are placed:

3-way_consensus_pseudogenes.txt, classes.def, validated_level_1_ids.txt

The paths to these are needed in the newfullmerge.pl script...

mkdir /work/dir/gencode_7

for LSF output files:

mkdir /work/dir/gencode_7/outfiles

dump annotation data (using main chromosomes only)

Code

foreach chr ( 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y MT )
 
    bsub -o /work/dir/gencode_7/outfiles/gencode_$chr.out perl svn/gencode/scripts/data_release/newfullmerge.pl -basedir /work/dir/gencode_7 -chrom $chr
 
end

check jobs

Code

grep -c "^Successfully" gencode_*out

update PAR region (We are currently writing out X and Y PAR regions separately. They are stored only once in the Ensembl db though, so the ids need to be made non-redundant with this step)

Code

perl svn/gencode/scripts/data_release/update_y_ids.pl -x gencode_X.gtf -y gencode_Y.gtf -out gencode_YY.gtf

create joined file

add header to release file gencode.v7.annotation.gtf:

##description: evidence-based annotation of the human genome (GRCh37),

 version 7 (Ensembl 62)

##provider: GENCODE

##contact: gencode@sanger.ac.uk

##format: gtf

##date: 2011-03-23

Code

foreach chr ( 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X YY MT )
 
  cat gencode_$chr.gtf >> gencode.v7.annotation.gtf
 
end

check gene and transcripts numbers(compare to previous release and database ignoring haplotype regions etc.)

Code

awk '{if($3=="gene"){g++}else{if($3=="transcript"){t++}}} END{print "genes: "g"\ntranscripts: "t"\n"}' gencode.v7.annotation.gtf

check tags (annotation remarks)(compare to previous release)

Code

foreach t ( seleno pseudo_consens CCDS mRNA_start_NF mRNA_end_NF cds_start_NF cds_end_NF non_org_supp exp_conf PAR alternative_3_UTR alternative_5_UTR readthrough NMD_exception not_organism-supported not_best-in-genome_evidence non-submitted_evidence upstream_ATG downstream_ATG upstream_uORF overlapping_uORF NAGNAG_splice_site non_canonical_conserved non_canonical_genome_sequence_error non_canonical_other non_canonical_polymorphism non_canonical_U12 non_canonical_TEC )
 
  echo -n $t"\t"; awk '{if($3=="transcript"){print $0}}' gencode.v7.annotation.gtf | grep -c "$t"
 
end

split by level (level 1/2 and 3 are displayed as two sep. tracks in the UCSC browser)

Code

awk '{if($26=="3;"){print $0}}' gencode.v7.annotation.gtf | awk '{if($3!="gene"){print $0}}' > gencode.v7.annotation.level_3.gtf
 
awk '{if($26!="3;"){print $0}}' gencode.v7.annotation.gtf | awk '{if($3!="gene"){print $0}}' > gencode.v7.annotation.level_1_2.gtf

make class file (data loading at UCSC requires a mapping of all gene and transcripts id to a level and a type)

Find classes not yet defined:

Code

grep -h "^Class not defined" gencode_*.out | sort -u

add these manually to the classes.def file. Write out new lists:

Code

perl svn/gencode/scripts/data_release/write_class_file.pl -in gencode.v7.annotation.level_1_2.gtf -class classes.def -out gencode.v7.annotation.level_1_2.classes
 
perl svn/gencode/scripts/data_release/write_class_file.pl -in gencode.v7.annotation.level_3.gtf -class classes.def -out gencode.v7.annotation.level_3.classes

generate meta-data

perl svn/gencode/scripts/data_release/gencode_addmetadata.pl

requires list of new PAR region IDs

generate tRNAs

Code

bsub -o trna.out perl svn/gencode/scripts/data_release/newfullmerge.pl -trna -out gencode.v7.tRNAs.gtf

[622 lines]

Code

nice perl svn/gencode/scripts/data_release/write_class_file.pl -in gencode.v7.tRNAs.gtf -class classes.def -out gencode.v7.tRNAs.classes -types tRNAscan

generate polyAs

Code

nice perl svn/gencode/scripts/data_release/dump_polyAs.pl -out gencode.v7.polyAs.gtf

[28966 lines]

Code

nice perl svn/gencode/scripts/data_release/write_class_file.pl -in gencode.v7.polyAs.gtf -class classes.def -out gencode.v7.polyAs.classes

re-format 2-way pseudogenes (from Yale) NEEDS UPDATING

(create header)

Code

awk 'BEGIN{c=0} {print $1"\tYale_UCSC\ttranscript\t"$2"\t"$3"\t.\t"$4"\t.\tgene_id \"Overlap"c"\"; transcript_id \"Overlap"c"\"; gene_type \"pseudogene\"; gene_status \"UNKNOWN\"; gene_name \"Overlap"c"\"; transcript_type \"pseudogene\"; transcript_status \"UNKNOWN\"; transcript_name \"Overlap"c"\"; level 3; tag \"2way_pseudo_cons\"; yale_id \""$5"\"; ucsc_id \""$6"\"; parent_id \""$7"\";"; c++}' yale_ucsc_2way_consensus >> gencode.v7.2wayconspseudos.GRCh37.gtf
 
nice perl svn/gencode/scripts/data_release/write_class_file.pl -in gencode.v7.2wayconspseudos.GRCh37.gtf -class classes.def -out gencode.v7.2wayconspseudos.GRCh37.classes -types transcript

create transcript sequence files

Code

foreach chr ( 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y MT )
 
    bsub -o seqs/trans_$chr.out perl svn/gencode/scripts/data_release/newfullmerge.pl -outfile seqs/trans_$chr.fa -ass GRCh37 -sequence -chrom $chr
 
end

create protein sequence files

Code

foreach chr ( 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y MT )
 
    bsub -o seqs/prot_$chr.out perl svn/gencode/scripts/data_release/newfullmerge.pl -outfile seqs/prot_$chr.fa -ass GRCh37 -sequence -protein -chrom $chr
 
end

update PAR regions in sequence files

Code

nice perl svn/gencode/scripts/data_release/update_y_ids.pl -fasta -x gencode_X.gtf -y seqs/trans_Y.fa -out seqs/trans_YY.fa
 
nice perl svn/gencode/scripts/data_release/update_y_ids.pl -fasta -x gencode_X.gtf -y seqs/prot_Y.fa -out seqs/prot_YY.fa

combine sequence files

Code

foreach chr ( 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X YY MT )
 
  cat seqs/prot_$chr.fa >> gencode.v7.pc_translations.fa
 
  cat seqs/trans_$chr.fa >> gencode.v7.pc_transcripts.fa
 
end

files to release to the DCC

gencode.v7.annotation.level_1_2.gtf	 

gencode.v7.annotation.level_1_2.classes  

gencode.v7.annotation.level_3.gtf	

gencode.v7.annotation.level_3.classes	

gencode.v7.polyAs.gtf

gencode.v7.polyAs.classes

gencode.v7.2wayconspseudos.gtf

gencode.v7.2wayconspseudos.classes

metadata/

  gencode_Exon_supporting_feature

  gencode_HGNC

  gencode_PDB

  gencode_Pubmed_id

  gencode_RefSeq

  gencode_Source

  gencode_SwissProt

  gencode_Transcript_supporting_feature

Code

tar -czvf gencode7_GRCh37.tgz gencode7
 
cp gencode7_GRCh37.tgz PUB_FTP/gencode/release_7/gencode7_GRCh37.tgz

It can take up to 20 minutes before the files are visible on the public FTP site.

These additional files are added to the FTP sites individually for general users:

gencode.v7.annotation.gtf.gz

gencode.v7.pc_transcripts.fa.gz

gencode.v7.pc_translations.fa.gz

gencode.v7.polyAs.gtf.gz

gencode.v7.tRNAs.gtf.gz

Code

nice gzip -c gencode.v7.pc_transcripts.fa > gencode7_GRCh37.tgz PUB_FTP/gencode/release_7/gencode.v7.pc_transcripts.fa.gz

etc.

Other notes:

  • After every Havana/Ensembl merge a new OTT-/ENS ID mapping should be generated and loaded into the AnnoTrack tracking system. This can be done with the script
    svn/gencode/scripts/store_id_conversion.pl

    which will read the GTF file or a list of ids and create the SQL statements. It's better to use a release file with no versions in the Ensembl ids as the others can not be linked to the Ensembl web site directly and the "." might break some functions in AnnoTrack. Please remember this might create links to ids that are not yet "valid" until the official Ensembl release date.

    Code

    perl svn/gencode/scripts/store_id_conversion.pl -gtf -infile gencode.v7.annotation.gtf -out new_id_conversions.sql
     
    mysql -h -P -u -p -D gencode_tracking < new_id_conversions.sql
  • Also the external annotations in AnnoTrack should be updated from the new ensembl database. These are stored as custom_values with this script:

    Code

    bsub -q long -o job.out perl svn/gencode/tracking_system/perl/scripts/update_external_info.pl
     
                 -coredb homo_sapiens_core_61_37f
     
                 -comparadb ensembl_compara_61
     
                 -ontologydb ensembl_ontology_61

    This is looking at the live-mirror dbs by default, so either modify this or run this after the Ensembl release date.

  • Selenocysteine tags are now read directly from the database, to pull them out separately for other reasons into a file you can do:

    Code

    mysql -uensro -hens-livemirror -Dhomo_sapiens_core_60_37e -e"select tsi.stable_id, ta.value from translation_attrib ta, transcript_stable_id tsi, translation tl where tl.transcript_id=tsi.transcript_id and tl.translation_id=ta.translation_id and ta.attrib_type_id=12 order by stable_id;" | awk '{print $1"\t"$2}' > selenocystein.transcripts