Display todays' Date with JavaScript

August 10th, 2012

To display the current date, day of the week and time on a web page, you don't want to refresh the entire page every sencond or minute. Instead you will want to use JavaScript to dynamically update just this date/clock display element. Here is the code for a display in the format

Friday, 10.8.2012    9:41:49

Code

<!DOCTYPE html>
<html>
<head>
<script type="text/javascript">
function startTime(){
  var today=new Date();
  var h=today.getHours();
  var m=today.getMinutes();
  var s=today.getSeconds();
  var month = today.getMonth() + 1
  var day = today.getDate()
  var myDays= ["Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"]
  var weekday = today.getDay()
  var wday = myDays[weekday]
  var year = today.getFullYear()
  // add a zero in front of numbers<10
  m=checkTime(m);
  s=checkTime(s);
  document.getElementById('txt').innerHTML=wday + ", " + day + "." + month + "." + year + "&nbsp;&nbsp;&nbsp;&nbsp;" + h+":"+m+":"+s;
  t=setTimeout(function(){startTime()},500);
}
 
function checkTime(i){
  if (i<10){
    i="0" + i;
  }
  return i;
}
</script>
</head>
 
<body onload="startTime()">
<div id="txt"></div>
</body>
</html>

Sources: trans4mind.com, w3schools.com

Data Processing with Biopieces

August 2nd, 2012

There is a fine set of scripts that form an orderely pipeline (or framework) to process bioinformatics data on the Unix command line called biopieces. You can e.g. process sequencing (NGS) data like this:

Code

biopieces>
./read_fastq -n 1000 -i data/reads.fastq | ./plot_scores -t png -o data/scores.png --no_stream

to read the first 1000 sequences from a FASTQ file and plot the scores to an image file.
The result might look like this:
Data Processing with Biopieces

The general logic is
        read_data | calculate_something | write_results
with the data being passed through as a "stream" and all modules having the same interface to eachother. Installation instructions are here, on my Ubuntu VM I had to follow these steps:

  1. we need Perl, Ruby, Python, SVN. Install as needed.

    Code

    sudo apt-get install subversion
  2. get biopieces code:

    Code

    svn checkout http://biopieces.googlecode.com/svn/trunk/ biopieces cd biopieces svn checkout http://biopieces.googlecode.com/svn/wiki bp_usage
  3. check pre-requisites with the project's installer script

    Code

    bash biopieces_installer.sh
  4. missing Perl modules where listed nicely and could be installed as suggested.
  5. missing Ruby gems could not be installed due to incompatibilities, eg:

    Code

    sudo gem install RubyInline ERROR: Error installing RubyInline: ZenTest requires RubyGems version > 1.8.

    But the project supplies an excellent ruby installer on the downloads page to create a separate Ruby 1.9 installation, as the default 1.8 one is too old for biopieces, the newer one not officially supported on Ubuntu
  6. modify your ~/.bashrc file to include:

    Code

    export BP_DIR="$HOME/bin/biopieces"
    export BP_DATA="$HOME/bin/biopieces/BP_DATA"
    export BP_TMP="$HOME/bin/biopieces/tmp"
    export BP_LOG="$HOME/bin/biopieces/BP_LOG"
    export PATH="/home/test/bin/biopieces/ruby_install/bin:/home/test/bin/biopieces/biopieces/bp_bin:$PATH"
    export RUBYLIB="/home/test/bin/biopieces/biopieces/code_ruby/lib:$RUBYLIB"
    export PERL5LIB="/home/test/bin/biopieces/biopieces/code_perl:$PERL5LIB"

    Code

    source ~/.bashrc
    mkdir $BP_DATA $BP_TMP $BP_LOG
    The Ruby and Perl lib definitions are necessary avoid errors like

    Code

    cannot load such file -- maasha/biopieces (LoadError)
    ----
    Can't locate Maasha/Fasta.pm in @INC

Some of the almost 200 methods that are implemented in biopieces at this time include:

  • read and write various formats like bed, tab, gff, fasta, fastq
  • blast sequences against eachother or against a genome
  • calculate the N50 value for a set of sequences
  • create statistics about the exon, intron, etc. content of a (12-column) BED file

Building Config Files from a Skeleton

July 12th, 2012

To run programs or pipelines automatically it is often necessary to create or adjust configuration files. Ideally this should be done dynamically by a script from a skeleton (layout) file, replacing placeholder with the adjusted values. This can be done with a unix shell script that even contains the skeleton within:

Code

#! /bin/sh
# pass in variables from command-line arguments
prog=$1
var1=$2
var2=$3
 
# do other required tasks
# ...
 
# config skeleton
template='#config file for pipeline
parameter_1=$var1
parameter_2=$var2'
 
# Generate file output.txt from variable
# $template using placeholders above.
echo "$(eval "echo \"$template\"")" \
> $outputfile
 
# run the specified program
# with the new config file
./${prog} -conf ${outputfile}

Save as script.sh and call with parameters:
sh script.sh program_name par1 par2

Source: stackoverflow

Analysing Variation with Ensembl and PolyPhen

May 28th, 2012

The Ensembl variation resources provide information about structural variants and sequence variants (including Single Nucleotide Polymorphisms (SNPs), insertions, deletions and somatic mutations in the human genome. Details and references are described on the web site and in Chen et al. (2010) Ensembl Variation Resources, BMC Genomics and other publications listed in the site.

Sources and Descriptions currently included in Ensembl variation resources (v67):

  • dbSNP - Variants (including SNPs and indels) imported from dbSNP
  • DGVa - Database of Genomic Variants Archive
  • NHGRI_GWAS_catalog - Variants associated with phenotype data from the NHGRI GWAS catalog
  • COSMIC - Somatic mutations found in human cancers from the COSMIC project
  • EGA - Variants imported from the European Genome-phenome Archive with phenotype association
  • Uniprot - Variants with protein annotation imported from Uniprot
  • HGMD-PUBLIC - Variants from HGMD-PUBLIC dataset March 2012
  • OMIM - Variations linked to entries in the Online Mendelian Inheritance in Man (OMIM) database
  • Open Access GWAS Database - Johnson & O'Donnell 'An Open Access Database of Genome-wide Association Results' PMID:19161620
  • LSDB_LEPRE1 - LEPRE1 homepage - Osteogenesis Imperfecta Variant Database - Leiden Open Variation Database
  • LSDB_PPIB - PPIB homepage - Osteogenesis Imperfecta Variant Database - Leiden Open Variation Database
  • LSDB_CRTAP - CRTAP homepage - Osteogenesis Imperfecta Variant Database - Leiden Open Variation Database
  • LSDB_FKBP10 - FKBP10 homepage - Osteogenesis Imperfecta Variant Database - Leiden Open Variation Database

Ensembl offers the possibility to run the underlying code on your own data and predict the functional consequences of known and unknown variants using the Variant Effect Predictor (VEP).

Internally the VEP uses PolyPhen which is further explained below:

For a given amino acid substitution in a protein, PolyPhen-2 extracts various sequence and structure-based features of the substitution site and feeds them to a probabilistic classifier to identify:

Sequence-based features include binding or linking sites, transmembrane regions, regulatory modification sites. Profile matrices are calculated to assess the likelihood of the occurrence of this amino acid at the given position.

Structural features include the comparison to known protein 3D structures in PDB, using DSSP (Dictionary of Secondary Structure in Proteins), accessible surface area and properties.

PolyPhen-2 also looks at functional significance of an allele replacement using the UniProtKB database. It uses the "HumDiv" classifier to find disease-related changes and "HumVar" for variations in the "normal" population.

Ensembl have now added a nice blog entry about this with some more details.

Sequence Mappability & Alignability

May 16th, 2012

Sequence uniqueness within the genome plays an important part when attempting to map short sequence parts - e.g. next-generation short sequencing reads. It is one of the factors that can introduce a bias in sequencing or it's analysis - the other important factor being GC content (GC-rich sequences, eg. genic/exonic region, as well as very GC-poor regions are often under-represented (Bentley et al. 2008), mainly caused by amplificatin steps in the protocol). Reads mapped to multiple regions are often discarded, genomic regions with high sequence degeneracy / low sequence complexity therefor show lower mapped read coverage than unique regions, creating a systematic bias.

The CRG Alignability tracks at the UCSC genome browser display how uniquely k-mer sequences align to a region of the genome. As you can see from the tracks, the mappability increases with read length:

Sequence Mappability &amp; Alignability

CRG mappability tracks for different read lengths at the UCSC browser

For each window (of sizes 36, 40, 50, 75 or 100 nts), a mapability score was computed:
S = 1 / (number of matches found in the genome),
so S=1 means one match in the genome, S=0.5 is two matches in the genome, and so on. Further desription in the publication of Thomas Derrien, Paolo Ribeca, et al. The data for these tracks can be downloaded, if you are working with other read lengths or genomes, you can run the software to generate the data yourself: Get the Gem library, unpack it with tar xbvf GEM-libraries-Linux-x86_64.tbz2, create an index:

Code

gem-do-index -i genome.fasta -o gem_index

run the mappability part, eg. with a read length of 250:

Code

gem-mappability -I gem_index -l 250 -o mappability_250.gem

To query a specific region for its mappability you can also use this online tool http://surveyor.chgr.org/.

An alternative is to look at the "uniqueome" data and publication.

Refs:

  • Fast computation and applications of genome mappability.
    Derrien T, et al. PLoS One. 2012
  • The uniqueome: a mappability resource for short-tag sequencing. Koehler et al. Bioinformatics. 2011; 27(2): 272–274.
  • Blog post at MassGenomics
  • Systematic bias in high-throughput sequencing data and its correction by BEADS. Cheung et al. 2011
  • Accurate Whole Human Genome Sequencing using Reversible
    Terminator Chemistry. Bentley et al., Nature 2008