Improving Website performance

October 29th, 2009

Some general notes on ways to shorten the response time of your web site.

1. Make fewer HTTP requests ñ Reducing 304 requests with Cache-Control Headers

2. Use a CDN

3. Use a customized php.ini ñ Creating and using a custom PHP.ini

4. Add an Expires header

ñ Caching with mod_expires on Apache

ñ Using .htaccess file with

Code

<FilesMatch "\.(ico|pdf|flv|jpg|jpeg|png|gif|js|css|swf)$">
 
Header set Expires "Thu, 15 Apr 2010 20:00:00 GMT"
 
</FilesMatch>
(source) or better in your httpd.conf file as described here.

5. Gzip components

ñ http://askapache.info/2.0/mod

/mod_deflate.html

ñ or with .htaccess file:

Code

<IfModule mod_gzip.c>
 
   mod_gzip_on Yes
 
   mod_gzip_dechunk Yes
 
   mod_gzip_item_include file \.(html?|txt|css|js|php|pl|jpg|png|gif|xml)$
 
   mod_gzip_item_include handler ^cgi-script$
 
   mod_gzip_item_include mime ^text/.*
 
   mod_gzip_item_include mime ^application/x-javascript.*
 
   mod_gzip_item_exclude mime ^image/.*
 
   mod_gzip_item_exclude rspheader ^Content-Encoding:.*gzip.*
 
  </IfModule>

source

6. Put CSS at the top in head

7. Move Javascript to the bottom

8. Avoid CSS expressions, keep it simple

9. Make CSS and unobtrusive Javascript as external files not inline

10. Reduce DNS lookups ñ Use Static IP address, use a subdomain with static IP address for static content.

11. Minimize Javascript ñ Refactor the code, compress with dojo

12. Avoid external redirects ñ Use internal redirection with mod_rewrite, The correct way to redirect with 301

13. Turn off ETags ñ Prevent Caching with htaccess

14. Make AJAX cacheable and small

Source: Firebug-extension & http://www.askapache.com/web-cache/top-methods-for-faster-speedier-web-sites.html

awk

September 2nd, 2009

awk is an extremely useful unix tool for quick command-line task, in particular in combination with other commands like grep or sort.

"AWK is a data-driven programming language designed for processing text-based data, either in files or data streams. It is an example of a programming language that extensively uses the string datatype, associative arrays (that is, arrays indexed by key strings), and regular expressions." [wikipedia]

Built-in Variables

  • ARGC

    The number of command line arguments (not including options or the awk program itself).

  • ARGV

    The array of command line arguments. The array is indexed from 0 to ARGC - 1. Dynamically changing the contents of ARGV can control the files used for data.

  • CONVFMT

    The conversion format to use when converting numbers to strings.

  • ENVIRON

    An array containing the values of the environment variables. The array is indexed by variable name, each element being the value of that variable. Thus, the environment variable HOME would be in ENVIRON["HOME"]. Its value might be `/u/close'. Changing this array does not affect the environment seen by programs which awk spawns via redirection or the system function. Some operating systems do not have environment variables. The array ENVIRON is empty when running on these systems.

  • FILENAME

    The name of the current input file. If no files are specified on the command line, the value of FILENAME is `-'.

  • FNR

    The input record number in the current input file.

  • FS

    The input field separator, a blank by default.

    using multiple alternative field separators:

    FS="\t|=|;" (nawk)

  • NF

    The number of fields in the current input record.

  • NR

    The total number of input records seen so far.

  • OFMT

    The output format for numbers for the print statement, "%.6g" by default.

  • OFS

    The output field separator, a blank by default.

  • ORS

    The output record separator, by default a newline.

  • RS

    The input record separator, by default a newline. RS is exceptional in that only the first character of its string value is used for separating records. If RS is set to the null string, then records are separated by blank lines. When RS is set to the null string, then the newline character always acts as a field separator, in addition to whatever value FS may have.

  • RSTART

    The index of the first character matched by match; 0 if no match.

  • RLENGTH

    The length of the string matched by match; -1 if no match.

  • SUBSEP

    The string used to separate multiple subscripts in array elements, by default "\034".

String functions

  • index(in, find)
  • length(string)
  • match(string, regexp)
  • split(string, array, fieldsep)
  • sprintf(format, expression1,...)
  • sub(regexp, replacement, target)
  • gsub(regexp, replacement, target)
  • substr(string, start, length)
  • tolower(string)
  • toupper(string)

source: http://people.cs.uu.nl/piet/docs/nawk/nawk_toc.html

The GTF Format

July 9th, 2009

GTF stands for Gene transfer format. It borrows from GFF, but has additional structure that warrants a separate definition and format name. The current version is 2.2.

Structure is as GFF, so the fields are:

Code

<seqname> <source> <feature> <start> <end> <score> <strand> <frame> [attributes] [comments]

Attributes consist of key - value pairs, separated by one space.

Multiple attributes are separated by "; ".

The attributes list must start with gene_id and transcript_id.

Example attributes:

seq1     BLASTX  similarity   101  235 87.1 + 0  gene_id "gene-0"; transcript_id "transcript-0-1"; gene_name "Frst1"; expression 1;

More details:

http://mblab.wustl.edu/GTF22.html

http://www.bioperl.org/wiki/GTF

The SRF Format

July 2nd, 2009

SRF (Sequence Read Format) is a generic and flexible container format for sequencing and next-generation sequencing files.

Format working group: http://srf.sourceforge.net

It's the preferred format for the submission of sequencing results to archives like the European Nucleotide Archive.

How to use it:

SOLiD software to map SOLiD to SRF files.

SOLiD software to map MA (mapping) to GFF files.

An API: http://sourceforge.net/projects/srf

Also implemented within Staden package:

Fetch out basic read counts:

Code

/software/solexa/bin/srf_info -l 1 file.srf

To convert them to fastq:

Code

/software/solexa/bin/srf2fastq -c file.srf

(run without parameters/file for more options.)

Filter out reads flagged as "bad":

srf_filter -b infile.srf outfile.srf

Related blog post on Politigenomics

More info from SOLiD

RSYNC

June 15th, 2009

Using rsync

"rsync is a software application for Unix systems which synchronizes files and directories from one location to another while minimizing data transfer using delta encoding when appropriate." [Wikipedia]

Fact is that it is very fast (faster than cp or scp) for file transfers and ideal for home-brew back-up solutions. There is a lot of documentation on the internet, here are some pointers that were useful for me.

Basic command to sync DIR to a different location:
rsync -r DIR ~/backup/

Basic command to sync DIR from hostA to hostB:
ssh hostA
rsync -r DIR user@hostB:~/backup/

Basic command to list files at a remote server:
rsync rsync://pub@your-ip-or-hostname/

If you require a password, the easiest way is to put it in a file and chmod 700 it.
--password-file=your_file

To fetch recursively us the recursive -r or the archive -a options:
rsync -aPv source/someDirectory .

To specify a timeout you can set these options (in seconds):
--contimeout=1000 --timeout=1000

You can conveniently specify selected files and directories you want to transfer in an include file, ignoring the rest. Example: transfer basic data from a sequencing run to a remote host:

Code

> cat include_file.txt:
Data
Data/Intensities
Data/Intensities/BaseCalls
Data/Intensities/BaseCalls/***
InterOp/***
RunInfo.xml
RunParameters.xml
RunParameters.xml
*.csv
 
> rsync -arv --include-from='~/include_file.txt' --exclude='*' RunFolder_A34MJNACXX/ ~/temp/

Use --dry-run to test the results before starting a long transfer.

Use the -vvvv option to see the full step-wise processing of all files and directories.

Note: The Sanger firewall seems to block rsync, you can use it on the (guest) wireless network though.