Friday, 10.8.2012 9:41:49
There is a fine set of scripts that form an orderely pipeline (or framework) to process bioinformatics data on the Unix command line called biopieces. You can e.g. process sequencing (NGS) data like this:
to read the first 1000 sequences from a FASTQ file and plot the scores to an image file.
The result might look like this:
The general logic is
read_data | calculate_something | write_results
with the data being passed through as a "stream" and all modules having the same interface to eachother. Installation instructions are here, on my Ubuntu VM I had to follow these steps:
- we need Perl, Ruby, Python, SVN. Install as needed.
sudo apt-get install subversion
- get biopieces code:
svn checkout http://biopieces.googlecode.com/svn/trunk/ biopieces cd biopieces svn checkout http://biopieces.googlecode.com/svn/wiki bp_usage
- check pre-requisites with the project's installer script
- missing Perl modules where listed nicely and could be installed as suggested.
- missing Ruby gems could not be installed due to incompatibilities, eg:
sudo gem install RubyInline ERROR: Error installing RubyInline: ZenTest requires RubyGems version > 1.8.
But the project supplies an excellent ruby installer on the downloads page to create a separate Ruby 1.9 installation, as the default 1.8 one is too old for biopieces, the newer one not officially supported on Ubuntu
- modify your ~/.bashrc file to include:
CodeThe Ruby and Perl lib definitions are necessary avoid errors like
mkdir $BP_DATA $BP_TMP $BP_LOG
cannot load such file -- maasha/biopieces (LoadError)
Can't locate Maasha/Fasta.pm in @INC
Some of the almost 200 methods that are implemented in biopieces at this time include:
- read and write various formats like bed, tab, gff, fasta, fastq
- blast sequences against eachother or against a genome
- calculate the N50 value for a set of sequences
- create statistics about the exon, intron, etc. content of a (12-column) BED file
To run programs or pipelines automatically it is often necessary to create or adjust configuration files. Ideally this should be done dynamically by a script from a skeleton (layout) file, replacing placeholder with the adjusted values. This can be done with a unix shell script that even contains the skeleton within:
script.sh and call with parameters:
sh script.sh program_name par1 par2
The Ensembl variation resources provide information about structural variants and sequence variants (including Single Nucleotide Polymorphisms (SNPs), insertions, deletions and somatic mutations in the human genome. Details and references are described on the web site and in Chen et al. (2010) Ensembl Variation Resources, BMC Genomics and other publications listed in the site.
Sources and Descriptions currently included in Ensembl variation resources (v67):
- dbSNP - Variants (including SNPs and indels) imported from dbSNP
- DGVa - Database of Genomic Variants Archive
- NHGRI_GWAS_catalog - Variants associated with phenotype data from the NHGRI GWAS catalog
- COSMIC - Somatic mutations found in human cancers from the COSMIC project
- EGA - Variants imported from the European Genome-phenome Archive with phenotype association
- Uniprot - Variants with protein annotation imported from Uniprot
- HGMD-PUBLIC - Variants from HGMD-PUBLIC dataset March 2012
- OMIM - Variations linked to entries in the Online Mendelian Inheritance in Man (OMIM) database
- Open Access GWAS Database - Johnson & O'Donnell 'An Open Access Database of Genome-wide Association Results' PMID:19161620
- LSDB_LEPRE1 - LEPRE1 homepage - Osteogenesis Imperfecta Variant Database - Leiden Open Variation Database
- LSDB_PPIB - PPIB homepage - Osteogenesis Imperfecta Variant Database - Leiden Open Variation Database
- LSDB_CRTAP - CRTAP homepage - Osteogenesis Imperfecta Variant Database - Leiden Open Variation Database
- LSDB_FKBP10 - FKBP10 homepage - Osteogenesis Imperfecta Variant Database - Leiden Open Variation Database
Ensembl offers the possibility to run the underlying code on your own data and predict the functional consequences of known and unknown variants using the Variant Effect Predictor (VEP).
Internally the VEP uses PolyPhen which is further explained below:
For a given amino acid substitution in a protein, PolyPhen-2 extracts various sequence and structure-based features of the substitution site and feeds them to a probabilistic classifier to identify:
Sequence-based features include binding or linking sites, transmembrane regions, regulatory modification sites. Profile matrices are calculated to assess the likelihood of the occurrence of this amino acid at the given position.
Structural features include the comparison to known protein 3D structures in PDB, using DSSP (Dictionary of Secondary Structure in Proteins), accessible surface area and properties.
PolyPhen-2 also looks at functional significance of an allele replacement using the UniProtKB database. It uses the "HumDiv" classifier to find disease-related changes and "HumVar" for variations in the "normal" population.
Ensembl have now added a nice blog entry about this with some more details.
Sequence uniqueness within the genome plays an important part when attempting to map short sequence parts - e.g. next-generation short sequencing reads. It is one of the factors that can introduce a bias in sequencing or it's analysis - the other important factor being GC content (GC-rich sequences, eg. genic/exonic region, as well as very GC-poor regions are often under-represented (Bentley et al. 2008), mainly caused by amplificatin steps in the protocol). Reads mapped to multiple regions are often discarded, genomic regions with high sequence degeneracy / low sequence complexity therefor show lower mapped read coverage than unique regions, creating a systematic bias.
The CRG Alignability tracks at the UCSC genome browser display how uniquely k-mer sequences align to a region of the genome. As you can see from the tracks, the mappability increases with read length:
CRG mappability tracks for different read lengths at the UCSC browser
For each window (of sizes 36, 40, 50, 75 or 100 nts), a mapability score was computed:
S = 1 / (number of matches found in the genome),
so S=1 means one match in the genome, S=0.5 is two matches in the genome, and so on. Further desription in the publication of Thomas Derrien, Paolo Ribeca, et al. The data for these tracks can be downloaded, if you are working with other read lengths or genomes, you can run the software to generate the data yourself: Get the Gem library, unpack it with
tar xbvf GEM-libraries-Linux-x86_64.tbz2, create an index:
run the mappability part, eg. with a read length of 250:
To query a specific region for its mappability you can also use this online tool http://surveyor.chgr.org/.
An alternative is to look at the "uniqueome" data and publication.
- Fast computation and applications of genome mappability.
Derrien T, et al. PLoS One. 2012
- The uniqueome: a mappability resource for short-tag sequencing. Koehler et al. Bioinformatics. 2011; 27(2): 272–274.
- Blog post at MassGenomics
- Systematic bias in high-throughput sequencing data and its correction by BEADS. Cheung et al. 2011
- Accurate Whole Human Genome Sequencing using Reversible
Terminator Chemistry. Bentley et al., Nature 2008