Sequencing History

January 7th, 2008

Major landmarks in DNA sequencing and molecular biology

* 1953 Discovery of the structure of the DNA double helix (Watson, Crick, Franklin).

* 1958 Prove the semi-conservative nature of dna replication (Meselson, Stahl)

* 1961 First dna triplet is decoded (Matthei, Nierenberg)

* 1972 Development of recombinant DNA technology, which permits isolation of defined fragments of DNA; prior to this, the only accessible samples for sequencing were from bacteriophage or virus DNA.

* 1972 The first gene is sequenced

* 1975 The first complete DNA genome to be sequenced is that of bacteriophage φX174

* 1977 Allan Maxam and Walter Gilbert publish "DNA sequencing by chemical degradation" [4]. Fred Sanger, independently, publishes "DNA sequencing by enzymatic synthesis".

* 1980 Fred Sanger and Wally Gilbert receive the Nobel Prize in Chemistry

* 1982 Genbank starts as a public repository of DNA sequences.

* 1982 Andre Marion and Sam Eletr from Hewlett Packard start Applied Biosystems in May, which comes to dominate automated sequencing.

* 1982 Akiyoshi Wada proposes automated sequencing and gets support to build robots with help from Hitachi.

* 1983 Polymerase-Chain-Reaction (Mullis)

* 1984 Restriction fragment length polymorphism fingerprinting (Jeffreys)

* 1984 Medical Research Council scientists decipher the complete DNA sequence of the Epstein-Barr virus, 170 kb.

* 1985 Kary Mullis and colleagues develop the polymerase chain reaction, a technique to replicate small fragments of DNA

* 1986 Leroy E. Hood's laboratory at the California Institute of Technology and Smith announce the first semi-automated DNA sequencing machine. Comercialized by Applied Biosystems as 370A.

* 1987 Applied Biosystems markets first automated sequencing machine, the model ABI 370.

o Walter Gilbert leaves the U.S. National Research Council genome panel to start Genome Corp., with the goal of sequencing and commercializing the data.

* 1990 The U.S. National Institutes of Health (NIS) begins large-scale sequencing trials on Mycoplasma capricolum, Escherichia coli, Caenorhabditis elegans, and Saccharomyces cerevisiae (at 75 cents (US)/base).

* 1990 BLAST algorithm for aligning sequences published (Lipman, Myers).

* 1990 capillary electrophoresis published (Barry Karger, Lloyd Smith, Norman Dovichi).

* official start of the Human Genome Project

* 1991 Craig Venter develops strategy to find expressed genes with ESTs (Expressed Sequence Tags).

o Uberbacher develops GRAIL, a gene-prediction program.

* 1992 Craig Venter leaves NIH to set up The Institute for Genomic Research (TIGR).

o William Haseltine heads Human Genome Sciences, to commercialize TIGR products.

o Wellcome Trust begins participation in the Human Genome Project.

o Simon et al. develop BACs (Bacterial Artificial Chromosomes) for cloning.

o First chromosome physical maps published:

+ Page et al. - Y chromosome[28];

+ Cohen et al. chromosome 21[29].

+ Lander - complete mouse genetic map[30];

+ Weissenbach - complete human genetic map[31].

* 1993 Wellcome Trust and MRC open Sanger Centre, near Cambridge, UK.

o The GenBank database migrates from Los Alamos (DOE) to NCBI (NIH).

* 1995 Venter, Fraser and Smith publish first sequence of free-living organism, Haemophilus influenzae (genome size of 1.8 Mb).

o Richard Mathies et al. publish on sequencing dyes (PNAS, May)[32].

o Michael Reeve and Carl Fuller, thermostable polymerase for sequencing[8].

* 1996 International HGP partners agree to release sequence data into public databases within 24 hours.

o International consortium releases genome sequence of yeast S. cerevisiae (genome size of 12.1 Mb).

o Yoshihide Hayashizaki's at RIKEN completes the first set of full-length mouse cDNAs.

o ABI introduces a capillary electrophoresis system, the ABI310 sequence analyzer.

* 1997 Blattner, Plunkett et al. publish the sequence of E. coli (genome size of 5 Mb)[33]

* 1997 First cloned animal, Sheep "Dolly", is born (Wilmut)

* 1998 Phil Green and Brent Ewing of Washington University publish ìphredî for interpreting sequencer data (in use since ë95)[34].

o Venter starts new company ìCeleraî; ìwill sequence HG in 3 yrs for $300m.î

o Applied Biosystems introduces the 3700 capillary sequencing machine.

o Wellcome Trust doubles support for the HGP to $330 million for 1/3 of the sequencing.

o NIH & DOE goal: "working draft" of the human genome by 2001.

o Sulston, Waterston et al finish sequence of C. elegans (genome size of 97Mb)[35].

* 1999 NIH moves up completion date for rough draft, to spring 2000.

o NIH launches the mouse genome sequencing project.

o First sequence of human chromosome 22 published[36].

* 2000 Celera and collaborators sequence fruit fly Drosophila melanogaster (genome size of 180Mb) - validation of Venter's shotgun method. HGP and Celera debate issues related to data release.

* 2000 HGP consortium publishes sequence of chromosome 21.[37]

* 2000 HGP & Celera jointly announce working drafts of HG sequence, promise joint publication.

* 2000 Estimates for the number of genes in the human genome range from 35,000 to 120,000.

* 2000 International consortium completes first plant sequence, Arabidopsis thaliana (genome size of 125 Mb).

* 2001 HGP consortium publishes Human Genome Sequence draft in Nature (15 Feb)[38].

* 2001 Celera publishes the Human Genome sequence[39].

* 2002 HapMap project initiated to decipher human genetic variation

* 2005 420,000 VariantSEQr human resequencing primer sequences published on new NCBI Probe database.

* 2005 Genographic project launched to study human migration

* 2007 A set of closely related species (12 Drosophilidae) are sequenced, launching the era of phylogenomics.

* 2007 Craig Venter publishes his full diploid genome

Source: Wikipedia and ABI

Job Scheduling systems on HPCs

January 4th, 2008

LSF

The Load Sharing Facility system from IBM Platform can be used to run high performance compute ressources (HPC). It allows the scheduling of computational tasks (jobs) and load balancing of the machines that are part of the compute farm or cluster.

Basic tasks:

  • send jobs to farm:
    bsub -o output.txt script_name.pl -p parameter_1
  • specify ressource requirments:
    bsub -R 'model=IBMBC2800' job
    bsub 'select[mem>1000] rusage[mem=1000] job'

    limit access to database:
    select[ecs4<100] rusage[ecs4my3353=10:duration=10:decay=1]
  • kill jobs:
    bkill 0
    bkill jobid
  • example to kill specific jobs:
    bjobs | grep long | awk '{ print $1 }' | xargs bkill
    force-kill: bkill -r <ID>
  • Check jobs:
    bpeek jobid
  • Check farm machines:
    bhosts -w | more

Grid Engine

The Oracle Grid engine is a similar system based on an open-source software project supported by Sun.

Basic Usage:

  • Submit jobs
    qsub job,
    e.g. for shell scripts:
    qsub -cwd -e run.err -o run.out run_me.sh
    and for binaries:
    qsub -cwd -e run.err -o run.out -b y program.exe
    use -cwd to run job from current dir and write job results here use -V to pass in all environmental variables
  • kill jobs
    qdel jobid
    qdel -u user
  • Show host/job/queue status
    qhost
    qstat
    qstat -j jobid
    qstat -q

    or just: qstat | grep <username>
    more details:
    qstat -j jobid -explain E
  • queue details/available nodes:qstat -g c

Job status codes:

    qw - Queued and waiting
    w  - Job waiting
    s  - Job suspended
    t  - Job transferring and about to start
    r  - Job running
    h  - Job hold
    R  - Job restarted
    d  - Job has been marked for deletion
    Eqw - An error occurred with the job

More details: HowTo and Installation & User manual,notes.