Category: "Bioinformatics"

GVF Format

January 11th, 2011
The Genome Variation Format (GVF) is a file format for describing sequence variants at nucleotide resolution relative to a reference genome. The GVF format was published in Reese et al., Genome Biol., 2010: A standard variation file format for human… more »

QSEQ File Format

January 6th, 2011
Each record is one line with tab separator in the following format: - Machine name: unique identifier of the sequencer. - Run number: unique number to identify the run on the sequencer. - Lane number: positive integer (currently 1-8). - Tile number:… more »

GENCODE: Generating release files

January 4th, 2011
A. input sources -ensembl core database with gene models, stable ids and xrefs -vega database of same release for id-lookup -3way pseudogene file with gene ids: from Yale, based on pre-dump file from same release -selenocystein file: mysql -… more »

FASTQ Sequence Files

December 15th, 2010
A good description of the FASTQ format can be found at Illumina: "A fastq file is an ASCII encoded text file that stores DNA or RNA sequences and their corresponding IDs and quality scores. It uses unix newlines and consists of 4 lines per sequence un… more »

Ensembl Core Database Schema Diagram

November 26th, 2010
To understand the concept of Ensembl and learn how to query the tables I find it extremely useful to have a schema diagram of the database in front of me. This can be generated by using the schema.sql and foreign_keys.sql files from the sql directory… more »