Please note that I am moving most of these posts to my page
New articles will appear only there.
To display data as bar graphs along the genome e.g. in the UCSC genome browser, you can create BigWig files. The underlying Wig format is described in more detail here. BigWig is the binary version (described here), that allows compressing the data and streaming of the data from a remote location to the machine running the display (i.e. the genome browser) on demand.
To create and work with the data various software options are available, my current recommendation is:
DeepTools, a "suite of python tools particularly developed for the efficient analysis of high-throughput sequencing data". The docu pages show more options, but to get started install with:
conda install -c bioconda deeptools
or
pip install --user deepTools
and create a BigWig from a Bam file:
bamCoverage -b Seqdata.bam -o Seqdata.bigWig
Originally a Galaxy workflow, DeepTools2 can run on the command line or as an API. It was published by Fidel Ramírez et al.
You can display the data in the UCSC browser by adding a custom track (more details) in the form on the track control page using a line like the following to point to your file that needs to be internet-accessible:
track type=bigWig name="My Big Wig" description="Seqdata BigWig file" bigDataUrl=https://your.server.com/Seqdata.bigWig
GenePix Array List (GAL) files are text files with specific information about the location, size, and name of each DNA spot on a microarray. They are therefor of vital importance for the analysis of scanned microarray images. The format defines a specific header before the list of data columns follows:
Example:
ATF 1 9 5 Type=GenePix ArrayList V1.0 BlockCount=1 BlockType=0 "Block1=10000, 38780, 150, 20, 200, 18, 200" Supplier=BioRobotics ArrayerSoftwareName=TAS Application Suite (MicroGrid II) ArrayerSoftwareVersion=2.7.1.18 ScanResolution=10 Block Column Row ID Name 1 1 1 RP11-163J21 Clone 1 1 1 2 RP11-163J21 Clone 2
Explanantions:
ATF -> File conforms to Axon Text File
1 -> Version number of ATF
9 -> Number of header lines before the "Block, Column, Row, ..." line
5 -> Number of data columns (Block, Column, Row, Name, ID)
Type=GenePix ArrayList V1.0 -> Type of file, same for all GAL files
Block Count=1 -> Number of blocks described in the file
Block Type=0 -> Type of block, 0 = rectangular Block
X=A, B, C, D, E, F, G -> The position and dimensions of each block.
A -> xOrigin
B -> yOrigin
C -> Feature diameter
D -> xFeatures
E -> xSpacing
F -> yFeatures
G -> ySpacing ScanResolution - Optional parameter to scale the position on higher-resolution images Block arrangement
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
The data columns are:
Further reading and sources:
The within-array quality for (genomic) microarrays is often measured using the following metrics:
Source: BlueGnome user docs
See also: Microarray Scanners and PGS consulting in the UK & Ireland
Congress passed the Clinical Laboratory Improvement Amendments (CLIA) act in 1988 to establish quality standards for all non-research laboratory testing:
The objective of the CLIA program is to ensure quality in laboratory testing procedures and specifically to establish quality standards to ensure the accuracy, reliability, and timeliness of the patient’s test results. The CLIA Quality System Regulations became effective on April 24, 2003. Now the laboratory is required to check (verify) the manufacturer's performance specifications provided in the package insert for:
The number of samples needs to be established for every test, 20 samples are seen as a "rule of thumb".
The FDA defines a Laboratory Developed Test (LDT) as an in vitro diagnostic test that is manufactured by and used within a single laboratory (i.e. a laboratory with a single CLIA certificate). LDTs are also sometimes called in-house developed tests, or “home brew” tests. Similar to other in vitro diagnostic tests, LDTs are considered “devices,” as defined by the FFDCA, and are therefore subject to regulatory oversight by FDA.
Sources:Centers for Medicare & Medicaid Services, Genohub
Most of us take vaccinations for granten and rely on them from our very first days. The whooping cough as an example can be deadly, especially for young babies who are too young to be protected by their own vaccination. Since 2010, the Centers for Disease Control and Prevention (CDC) has recorded between 10,000 and 50,000 cases each year in the United States and up to 20 babies dying. One recent study showed that many whooping cough deaths among babies could be prevented if all babies received the first dose of vaccination on time at 2 months old, when they are old enough to get vaccinated (CDC). Still, some parents believe they know better and risk their childrens life by not vaccinating them at all.
For the US the CDC recommends vaccination of newborns / babies against the following diseases:
For Germany the situation is almost the same and the following vacciantions are recommended for babies under 2 years:
Sources: CDC, Robert-Koch-Institut
As part of the health assessment of newborn babys, a test for common genetic conditions is done by drawing a few drops of blood from the heel of the baby and sending this off for analysis. Any positive results will then be followed up by confirmatory test and a treatment can be initiated if required. The conditions are mostly life-threatening or disabeling for the child if undiagnosed or left untreated.
Below is a list of conditions that are screened as part of the current standard panel of core conditions and secondary conditions in the US-american health system. Secondary conditions are results that will be additionally (unintentinally) revealed when testing for the core conditions. If desired there are even more options for testing (supplemental screening). What test are offered or paid for depends on the state and the insurance. This information is taken from babysfirsttest.org.
1. Metabolic Disorders
ORGANIC ACID CONDITIONS
FATTY ACID OXIDATION DISORDERS
AMINO ACID DISORDERS
2. Endocrine Disorders
3. Hemoglobin Disorders
4. Other Disorders
5. Lysosomal Storage Disorders
See more at: www.babysfirsttest.org
As the largest cloud computing company Amazon Web Services (AWS) offers various options to use compute-power on a "as-needed" basis. You can choose what size and type of machine, what number of machines - and you can choose a price model where you are "bidding" for the resource. This means you might have to wait longer to get it, but you will get an impressive discount! You can choose your machines from the AWS dashboard.
Here is a comparison of the current prices for "General Purpose - Current Generation" AWS machines in the EU (Frankfurt) region (as of 13/04/2017):
vCPU | ECU | Memory (GiB) | Instance Storage (GB) | Linux / UNIX Usage | On-Demand Price per Hour | Spot Price per Hour | Saving % |
---|---|---|---|---|---|---|---|
m4.large | 2 | 6.5 | 8 | EBS Only | $0.129 | $0.0336 | 74 |
m4.xlarge | 4 | 13 | 16 | EBS Only | $0.257 | $0.0375 | 85 |
m4.2xlarge | 8 | 26 | 32 | EBS Only | $0.513 | $0.1199 | 77 |
m4.4xlarge | 16 | 53.5 | 64 | EBS Only | $1.026 | $0.3536 | 66 |
m4.10xlarge | 40 | 124.5 | 160 | EBS Only | $2.565 | $1.1214 | 56 |
m4.16xlarge | 64 | 188 | 256 | EBS Only | $4.104 | $0.503 | 88 |
m3.medium | 1 | 3 | 3.75 | 1x4 SSD | $0.079 | $0.0114 | 86 |
m3.large | 2 | 6.5 | 7.5 | 1x32 SSD | $0.158 | $0.0227 | 86 |
m3.xlarge | 4 | 13 | 15 | 2x40 SSD | $0.315 | $0.047 | 85 |
m3.2xlarge | 8 | 26 | 30 | 2x80 SSD | $0.632 | $0.1504 | 76 |
This only shows a selection of machine options and the prices obviously change over time - but the message should be clear...
For any large software project (i.e. one that requires more than a few scripts preforming a one-off task) and for every project that was initiated by a customer request, it is useful to precisely define the requirements before starting to write any code. This might be painful at times and slow down the coding fun, but it should avoid a lot of frustration on either side in the end.
Here is a short summary of what Software Requirements Specification (SRS) (IEEE 830) are, how to write them, what they are good for.
SRS is a complete description of the behavior of a system to be developed, including use cases.
The benefits of writing specifications when planning a software project are:
Key points to address:
Avoid design details and coding details in the specs. Hardware requirements etc. go into general System Specifications, not SRS. The content and language of the document should fit the description with the following key words:
Complete, Consistent, Accurate, Modifiable, Ranked, Testable, Traceable, Unambiguous, Valid, Verifiable
Descriptions of "use cases", mock-up GUI components and other visual aids are extremely useful to communicate with the parties involved.
Sources:
Wikipedia
www.microtoolsinc.com
www.techwr-l.com
As part of the Primary Analysis Illumina sequencing machines measure the intensity of the channels used for encoding the different bases and identify the most likely base at a given position of a sequencing read (tag). The Real Time Analysis (RTA) software writes the base and the confidence in the call as a quality score to base call (.bcl) files. As the name implies this is done in real time, i.e. for every cycle of the sequencing run a call for every location identified on the flow cell (tiles and lanes) is added. Bcl files are stored in binary format and represent the raw data output of a sequencing run. The format is described here. Software such as Casava/BclToFastq, Eland or the iSAAC aligner can make use of these files.
The *.bcl files are stored in the BaseCalls directory:
<run directory>/Data/Intensities/BaseCalls/L<lane>/C<cycle>.1
They are named in the format:
s_<lane>_<tile>.bcl
If you want to overcome errors during downstream processing from missing calls, software such as iSAAC and configureBclToFastq have an "--ignore-missing-bcl" command line option. This will interpret missing *.bcl files as no call (N) at that position.
Sources: Illumina, SeqAnswers
Some researchers and clinicians believe embryo morphology and development characteristics can be used to assess the viability of IVF embryos to increase chances of a successful pregnancy.
Healthy embryos, i.e. the most viable zygotes that will develop into blastocysts and further seem to follow a specific growth pattern between development day 3 and re-implantation on day 5:
Growth from 2 to 3 cells should be seen in 9 - 11 hours, from 3 to 4 cells in under 2 hours. Reaching day 5 is a critical as the embryo will be re-implanted into the uterus and will attach to the endometrium. The normal development process is shown in figure 1 (source: CMFT NHS):
Embryo morphology is graded on a scale of 1 to 5 as shown in fig 2 (source: CMFT NHS):
Further readings:
* http://www.ivf.com/morphology.html
Cystic Fibrosis, also called Mucoviscidos, is a hereditary disease (autosomal recessive) in which exocrine (secretory) glands produce abnormally thick mucus. This mucus can cause problems in digestion, breathing, and body cooling. It affects up to one out of 3000 newborns (with northern European ancestry). There are well over a hundred genetic changes linked to CF. It is an area companies like Illumina are very active in with a special assay cleared as an in-vitro diagnostic test with the FDA for the detection of most of the genetic variants known to cause the disease.
Here are notes from a presentation Dr. Carlos Bustamante gave at a recent ClinGen conference:
CFTR:
CF disease:
Variants:
Other sources used: PubMedHealth, Wikipedia
CRAM files are compressed versions of BAM files containing (aligned) sequencing reads. They represent a further file size reduction for this type of data that is generated at ever increasing quantities. Where SAM files are human-readable text files optimized for short read storage, BAM files are their binary equivalent, and CRAM files are a restructured column-oriented binary container format for even more efficient storage.
Tke key components of the approach are that positions are encoded in a relative way (i.e., the difference between successive positions is stored rather than the absolute value) and stored as a Golomb code. Also, only differences to the reference genome are listed instead of the full sequence.
The compression rates achieved are shown in the graph below generated by Uppsala University:
Comparing speed: Using the C implementation of for CRAM (James K. Bonfield), decoding is 1.5–1.7× slower than generating BAM files, but 1.8–2.6× faster at encoding. (File size savings are reported at 34–55%.(
Additional compression can be achieved by reducing the granularity of the quality values which will result in lossy compression though. Illumina suggested a binning of Q scores without significant calling performance.
Binning of similar Q-scores (Illumina):
Compression achieved by Q-score binning (Illumina):
Sources and further reading:
HiSeq & MiSeq
The HiSeq and MiSeq use a green laser to sequence G/T and a red laser to sequence A/C. At each cycle at least one of two nucleotides for each color channel must be read to ensure proper registration. It is important to maintain color balance for each base of the index read being sequenced, otherwise index read sequencing could fail due to registration failure. E.g. if the sample contains only T and C in the first four cycles, image registration will fail. (If possible spike-in phiX sequence to add diversity to low-plex sequencing libraries.)
If one or more bases are not present in the first 11 cycles the quality of the run will be negatively impacted. This is because the color matrix is calculated from the color signals of these cycles.
NextSeq 500
The NextSeq 500 uses two-channel sequencing, which requires only two images to encode the data for four DNA bases, one red channel and one green channel. The NextSeq also uses a new implementation of real-time analysis (RTA) called RTA2.0, which includes important architecture differences from RTA on other Illumina sequencers. For any index sequences, RTA2.0 requires that there is at least one base other than G in the first two cycles. This requirement for index diversity allows the use of any Illumina index selection for single-plex indexing except index 1 (i7) 705, which uses the sequence GGACTCCT. Use the combinations in the table below for proper color balancing on the NextSeq 500.
Source:
Illlumina Nextera tech notes, Illumina Low diversity note
See also TruSeq Guide
Quality scoring of the base calls
"Quality scores measure the probability that a base is called incorrectly. With SBS technology, each base in a read is assigned a quality score by a phred-like algorithm, similar to that originally developed for Sanger sequencing experiments. The quality score of a given base, Q, is defined by the equation
Q = -10log10(e)
where e is the estimated probability of the base call being wrong. Thus, a higher quality score indicates a smaller probability of error."(1)
The quality score is usually quoted as QXX, where the XX is the score and refers to that a particular call (or a all base calls of a read / of a sample / of a run) has a probability of error of 10^(-XX/10). For example Q30 equates to an error rate of 1 in 1000, or 0.1%, Q40 equates to an error rate of 1 in 10,000 or 0.01%.
During the primary analysis (real-time analysis, RTA) on the sequencing machine, quality scoring is performed by calculating a set of predictors for each base call, and using those predictor values to look up the quality score in a quality table. The quality table is generated using a modification of the Phred algorithm on a calibration data set representative of run and sequence variability
"It is important to note how quickly or slowly quality scores degrade over the course of a read. With short-read sequencing, quality scores largely dictate the read length limits of different sequencing platforms. Thus, a longer read length specification suggests that the raw data from that platform have consistently higher quality scores across all bases." (1)
Mapping / Alignment scores
For each alignment, BWA calculates a mapping quality score, which is the (Phred-scaled) probability of the alignment being incorrect. The algorithm is similar between BWA and MAQ, except that BWA assumes that the true hit can always be found. The probability for every base is calculated as:
p = 10 ^ (-q/10)
where q is the quality. For example a mapping quality of 40: 10^-4 = 0.0001, which means there is a 0.01% chance that the base is aligned incorrectly.
Example for a whole read:
If your read is 25 bp long and the expected sequencing error rate is 1%, the probability of the read with 0 errors is:
0.99^25 = 0.78
If there is 1 perfect alignment and 5 possible alignment positions with 1 mismatch, we combine these probabilities: The probability of the read with 1 error is
0.20
combined posterior probability that the best alignment is correct:
P(0-errors) / (P(0-errors) + 5 * P(1-errors))
= 0.44, which is low.
Base quality is apparently not considered in evaluating hits in bwa.
Sources:
Using a text editor, create a file for your remote servers logon credential:
gedit ~/.smbcredentials
Enter your Windows username and password in the file:
username=msusername password=mspassword
chmod 600 ~/.smbcredentials
Edit your /etc/fstab file:
//servername/sharename /media/windowsshare cifs credentials=/home/ubuntuusername/.smbcredentials,iocharset=utf8,sec=ntlm 0 0
sudo mount -a
Ref: https://wiki.ubuntu.com/MountWindowsSharesPermanently
To assess whether a new test (e.g. a diagnostic tests or medical device testing for disease or non-disease status) is equivalent to an existing test, the following measures can be reported. They can be of importance for the submission of premarket notification (510(k)) or premarket approval (PMA) applications for diagnostic devices (tests) to the American Food and Drug Administration (FDA). A new test is usually compared to an existing and established test or a general trusted reference. If the existing test (or reference) is not perfect, the FDA recommends to report the positive and negative percent agreement (PPA/NPA). This is calculated using false positives, true positives, false negative and true negatives and calculated like this (1):
Existing Test | |||
New Test | R+ | R- | |
T+ | TP | FP | TP+FP |
T- | FN | TN | FN+TN |
TP+FN | FP+FN | TP+FP+FN+TN |
PPA = TP * 100 / (TP + FN)
NPA = TN * 100 / (TN + FP)
Measures of accuracy
The FDA "recommends you report measures of diagnostic accuracy (sensitivity and specificity pairs, positive and negative likelihood ratio pairs) or measures of agreement (percent positive agreement and percent negative agreement) and their two-sided 95 percent confidence intervals. We recommend reporting these measures both as fractions (e.g., 490/500) and as percentages (e.g., 98.0%)." (2) Sensitivity and specificity are explained here. In general th FDA recommends to report (2):
References:
SNP calling refers to the process of identifying posititions where the genome of a sequenced sample differs to that of the reference genome. This might lead to finding disease-causing genomic alterations.
In the following I wanted to re-align short NGS reads against a specific reference (in this case the Mitochondrial genome sequence). A simple way is to use samtools.
1. make a reference genome index
bwa index -a is NCBI_chrM.fa
2. filter reads
samtools view -F4 –hb A1_S1.bam chrM > A1_S1_chrM.bam samtools view -f4 -hb A1_S1.bam > A1_S1_unmapped.bam samtools merge A1_S1_chrM_and_Un.bam A1_S1_chrM.bam A1_S1_unmapped.bam
3. create fastq
bamToFastq -i A1_S1_chrM_and_Un.bam -fq A1_S1_chrM_and_Un.1.fq \ -fq2 A1_S1_chrM_and_Un.2.fq 2> bwa.err &
4. align to new reference
bwa aln NCBI_chrM.fa A1_S1_chrM_and_Un.1.fq > A1_S1_chrM_and_Un.1.sa bwa aln NCBI_chrM.fa A1_S1_chrM_and_Un.2.fq > A1_S1_chrM_and_Un.2.sai bwa sampe NCBI_chrM.fa A1_S1_chrM_and_Un.1.sai A1_S1_chrM_and_Un.2.sai \ A1_S1_chrM_and_Un.1.fq A1_S1_chrM_and_Un.2.fq > A1_S1_chrM_realigned.sam samtools view -F4 -Sbh A1_S1_chrM_realigned.sam \ | samtools sort -o - sorted > A1_S1_chrM_realigned.bam
5. call SNPs
samtools mpileup -uD -f NCBI_chrM.fa A1_S1_chrM_realigned.bam \ | bcftools view -cg - > A1_S1_chrM_realigned.sam.vcf
From the Samtools help pages:
One should consider to apply the following parameters to mpileup in different scenarios:
The VCF format
The Variant Call Format (VCF) is the emerging standard for storing variant data. Originally designed for SNPs and short INDELs, it also works for structural variations.
VCF consists of a header and a data section. The header must contain a line starting with one '#', showing the name of each field, and then the sample names starting at the 10th column. The data section is TAB delimited with each line consisting of at least 8 mandatory fields (the first 8 fields in the table below).
Col Field Description 1 CHROM Chromosome name 2 POS 1-based position. For an indel, this is the position preceding the indel. 3 ID Variant identifier. Usually the dbSNP rsID. 4 REF Reference sequence at POS involved in the variant. For a SNP, it is a single base. 5 ALT Comma delimited list of alternative seuqence(s). 6 QUAL Phred-scaled probability of all samples being homozygous reference. 7 FILTER Semicolon delimited list of filters that the variant fails to pass. 8 INFO Semicolon delimited list of variant information. 9 FORMAT Colon delimited list of the format of individual genotypes in the following fields. 10+ Sample(s) Individual genotype information defined by FORMAT.
The following table gives the INFO tags used by samtools and bcftools.
Tag Description AC Allele count in genotypes AC1 Max-likelihood estimate of the first ALT allele count (no HWE assumption) AF1 Max-likelihood estimate of the first ALT allele frequency (assuming HWE) AN Total number of alleles in called genotypes CGT The most probable constrained genotype configuration in the trio CLR Log ratio of genotype likelihoods with and without the constraint DP Raw read depth (sum for all samples) DP4 Number of high-quality ref-forward bases, ref-reverse, alt-forward and alt-reverse bases FQ Phred probability of all samples being the same G3 ML estimate of genotype frequencies HWE Hardy-Weinberg equilibrium test (PMID:15789306) ICF Inbreeding coefficient F INDEL Indicates that the variant is an INDEL. IS Maximum number of reads supporting an indel and fraction of indel reads MDV Maximum number of high-quality nonRef reads in samples MQ Root-mean-square mapping quality of covering reads PC2 Phred probability of the nonRef allele frequency in group1 samples being larger (, smaller) than in group2. PCHI2 Posterior weighted chi2 P-value for testing the association between group1 and group2 samples. PR Number of permutations yielding a smaller PCHI2. PV4 P-values for strand bias, baseQ bias, mapQ bias and tail distance bias QBD Quality by Depth: QUAL/#reads QCHI2 Phred scaled PCHI2. RP # permutations yielding a smaller PCHI2 RPB Read Position Bias SF Source File (index to sourceFiles, f when filtered) TYPE Variant type UGT The most probable unconstrained genotype configuration in the trio VDB Variant Distance Bias (v2) for filtering splice-site artefacts in RNA-seq data.
Sources:
To block a specific IP address from network access to your (Ubuntu Linux) system, you can add it to your firewall settings:
sudo iptables -A INPUT -s 223.4.208.56 -j DROP
To remove this entry:
sudo iptables -D INPUT -s 223.4.208.56 -j DROP
To just list current firewall rules:
sudo iptables -L
Sources: cyberciti.biz
The GC content is the molar ratio of guanine+cytosine bases in DNA. The human genome is a mosaic of GC-rich and GC-poor regions, of around 300kb in length, called isochores. GC content is an important factor in many experiments and bioinformatic analysis. This is especially true for next-generation sequencing where the DNA being sequenced has gone through multiple rounds of PCR amplification.
Average GC content per chromosome:
1 0.417439 2 0.402438 3 0.396943 4 0.382479 5 0.395163 6 0.396109 7 0.407513 8 0.401757 9 0.413168 10 0.415849 11 0.415657 12 0.40812 13 0.385265 14 0.408872 15 0.42201 16 0.447894 17 0.455405 18 0.39785 19 0.483603 20 0.441257 21 0.408325 22 0.479881 X 0.394963 Y 0.391288 MT 0.443626 |
The common way to reduce the GC bias in data analysis is to basically
More details on the GC bias in next-gen sequencing is described by Benjamini and Speed here: " The bias is not consistent between samples; and there is no consensus as to the best methods to remove it in a single sample. (...) It is the GC content of the full DNA fragment, not only the sequenced read, that most influences fragment count. This GC effect is unimodal: both GC-rich fragments and AT-rich fragments are underrepresented in the sequencing results. This empirical evidence strengthens the hypothesis that PCR is the most important cause of the GC bias."
Correcting the bias can follow a "read model", "fragment model" or a "global model".
Sources: biostars.org, PubMed, PubMed
See also: Chromosome length table
A scheduled task on Microsoft Windows 2008 failed "due to a time trigger condition" and with the error message including "Data: Error Value 2147943726." after running without problems before.
The reason for this was that the network-wide password for the user account assigned to running the task, had been changed since setting up the task.
Re-opening the task properties (double-click in the "Active Tasks" list and select "Options" from the right-hand menue") and saving with the new password fixed the problem.
Here is a quick list of the sizes of human chromosomes in assembly GRCh37 as defined by Ensembl:
chrom length [bp] 1 249,250,621 2 243,199,373 3 198,022,430 4 191,154,276 5 180,915,260 6 171,115,067 7 159,138,663 8 146,364,022 9 141,213,431 10 135,534,747 11 135,006,516 12 133,851,895 13 115,169,878 14 107,349,540 15 102,531,392 16 90,354,753 17 81,195,210 18 78,077,248 19 59,128,983 20 63,025,520 21 48,129,895 22 51,304,566 X 155,270,560 Y 59,373,566 Mt 16,569 |
These sizes are useful for calculations of percent coverage of genomic features or sequencing reads.
They are often required when working with BED files.
Related: Chromosome ideograms and nomenclature, chromosome GC content
There are many sophisticated services and scripts to monitor the accessability of your website or various aspects of your web server. Check stack-overflow and look at Monastic for examples. Here is a very simple solution I needed to monitor the availability of a specific server using its IP address within the internal network. Is was necessary after the server's IP address, that is used in third party software to provide specific services, was "stolen" by other machines. In this case the DNS server assigned the IP address that should have been reserved to mobile devices that connected to the wireless network.
This approach is simply fetching a website from a specific URL using the "reserved" IP and looks for a word/pattern you know should be there. The script is run on a second machine (host name "ubuntu64"), an Ubuntu VM. (It is not using any additional security measures you will want to use if you expose the machine externally.)
Prepare second machine to send notification emails:
Install sendmail, sendemail, mailutils, sensible-mda (to have the whole set).
Add/modify entry in /etc/hosts:
127.0.1.1 ubuntu64.network.local ubuntu64
run "sudo sendmailconfig"
test with
Code
sendemail -q -f cron@ubuntux64.network.local -t my@email.com -u "mailtest" -m "mail works!" |
Write bash script to get and check website and send alert emails:
Code
# define address and pattern to expect | |
address='192.168.1.1/phpmyadmin/main.php' | |
searchword='phpMyAdmin' | |
| |
# define alert email | |
sender="cron@ubuntux64" | |
receiver="my@email.com" | |
body="system on machine 192.168.1.1 at risk" | |
subj="Important server unresponsive"; | |
| |
# fetch page and look for pattern | |
resp=`wget -q -O - $address | grep -c $searchword` | |
if [ $resp -lt 1 ]; then | |
sendemail -q -f $sender -t $receiver -u $subj -m $body | |
fi |
Add a crontab entry to automatically run this script every 10 minutes:
Code
*/10 * * * * sh /home/user/server_check.sh |
Additional improvements could include the options to stop alerting after a specific number of alerts or checking the response time.
Alternatively you can just look up the MAC address associated with the "reserved" IP and compare it to the known physical address of your server and wrap this up into a little script:
>arp -a 192.168.1.1 Interface: 192.168.1.152 --- 0xb Internet Address Physical Address Type 192.168.1.1 00-11-18-2c-2e-6d dynamic
Following on from the publication of the main papers of the ENCODE (Encyclopedia Of DNA Elements) scale-up phase, I gave an interview to BlueGnome's marketing team for the Newstrack customer newsletter in 2012.
These are my personal opinions, not my employer's (past or present). They might be of interest to researcher's considering to join a large-scale project like this.
Q. What was it like to be part of the ENCODE project?
It was a great experience to work on a project of this scale with more than 400 scientists from 32 groups spread across the globe. Many of them are the leaders in their field, but at consortium meetings and the many phone conferences everyone could contribute. The amount of data and different technologies was overwhelming at times, so I think it’s an impressive achievement how this project was run and now the findings have been published.
Q. What are the main outcomes of the project?
There has been a very lively discussion about the outcome and how it was presented. In my opinion, the most important result is the data itself. ENCODE has created an enormous repository of measurements across the human genome that has been compiled in a systematic and standardised way. The data will be the basis of future research trying to understand genomic processes involved in basic cellular processes as well as in various diseases.
ENCODE has pushed the development of standards and new applications to interrogate the genome, in particular using sequencing technologies.
The results also remind us that there is a lot of activity in the genome that we currently do not fully understand. Up to 80% of the human genome is biochemically active, there are thousands of additional (non-coding) genes in introns and in the intergenic space, and up to 75% of the genome is transcribed at some point. These observations paint a very dynamic genomic landscape, with overlapping active zones and signals of different complexity, indicating, that we have to keep the concept of genes and genome regulation pretty flexible in our mind.
Q. What are potential implications for BlueGnome and
its customers?
I’m afraid the interpretation of CNV regions is getting even more complex as regulatory regions far away from the actual disease genes might be relevant for cases the clinical customers might come across. This is especially true for the interpretation of cancer profiles – which is highly complex already. We won’t be able to use these new interconnections directly in most cases, but we are looking through the data and have started to incorporate the knowledge by providing new genome-wide annotation data sets as optional BED files on the BlueGnome website, e.g. with GWAS results and regulatory element locations.
Q. Where do you see the human genome in 5 years’ time?
ENCODE is entering its next phase now to extend the catalogue to many additional cell lines as well as the mouse genome. With the recent publications scientists around the world are now more aware of this data and how to use it, so my hope is that we will see an acceleration in algorithm development, data mining and scientific findings. In 5 years we still won’t understand the genome entirely, but we should have a complete parts list and more connections between the parts. Some of these will be clinically relevant to allow progress in understanding and fighting today’s ‘big killers’ like certain types of cancer.
Q. Would you personally be interested in having your genome sequenced?
As a data exploration exercise I would find this really interesting, but the definitive answers you can get from it are still limited today. I would certainly want to make sure this data is kept private and under my control. With BlueGnome now being part of Illumina we can actually help to develop these ideas further.
Further information: Nature's Encode portal, "An integrated encyclopedia of DNA elements in the human genome" publication, Guardian Interview with Ewan Birney
The Sequence Alignment/Map (SAM) format is a generic alignment format for storing read alignments against reference sequences. It is a text format for storing sequence data in a series of tab delimited ASCII columns and is commonly used in next-generation sequencing data processing. It is the (non-binary) human-readable version of the BAM format and contains information about the read and the aligned position in the genome. It was developed by Heng Li in Richard Durbins group and others, their paper is here.
After a header section the alignment section describes all results of the aligned read data. The format is best explained with an example line:
Code
1:497:R:-272+13M17D24M 113 1 497 37 37M 15 100338662 0 CGGGTCTGACCTGAGGAGAACTGTGCTCCGCCTTCAG 0;==-==9;>>>>>=>>>>>>>>>>>=>>>>>>>>>> XT:A:U NM:i:0 SM:i:37 AM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:37 |
Fieldname description Example-data QNAME read name 1:497:R:-272+13M17D24M FLAG alignment flag 113 RNAME alignment chromosome 1 POS alignment start position 497 MAPQ overall mapping quality 37 CIGAR alignment CIGAR string 37M MRNM/RNEXT name of next alignm. in group (mate) 15 MPOS/PNEXT pos. of next alignm. in group (mate) 100338662 ISIZE/TLEN observed Template LENgth 0 SEQ sequence CGGGTCTGACCTGAGGAGAACTGTGCTCCGCCTTCAG QUAL quality per base 0;==-==9;>>>>>=>>>>>>>>>>>=>>>>>>>>>> TAGs further tags with alignment info XT:A:U NM:i:0 SM:i:37 AM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:37
The tags are optional and might vary between alignment programs. Shown are examples from BWA. Important for filtering are usually the tags X0:i (numbers of genome alignments of this read) and XM:i (number of mismatches in alignment).
Tag Meaning NM Edit distance MD Mismatching positions/bases AS Alignment score BC Barcode sequence X0 Number of best hits X1 Number of suboptimal hits found by BWA XN Number of ambiguous bases in the referenece XM Number of mismatches in the alignment XO Number of gap opens XG Number of gap extentions XT Type: Unique/Repeat/N/Mate-sw XA Alternative hits; format: (chr,pos,CIGAR,NM;)* XS Suboptimal alignment score XF Support from forward/reverse alignment XE Number of supporting seeds
The read name (at least from Illumina machines) are constructed as:
[instrument-name]:[run ID]:[flowcell ID]:[lane-number]:[tile-number]: [x-pos]:[y-pos] [read number]:[is filtered]:[control number]: [barcode sequence]
example:
@M01117:25:000000000-A37B9:1:1101:14984:1386 1:N:0:4
Sources:
genome.sph.umich.ed with further useful details, full specs.
10-15% of couples in the western world are faced with some kind of infertility issue, in almost half the cases there are (co-) factors on the male side.
Male infertility factors are often based on sperm abnormalities which can be categorized into:
The genetic region responsible for spermatogenesis and most of these abnormalities is located in the azoospermia factor (AZF) region on Yq11. It contains the sub-regions AZFa, AZFb and AZFc. Microdeletion in these regions are responsible for many genetic causes of male infertility. Alteratons in the region AZFc (which contains the genes PRY2, BPY2, DAZ and CDY1) is believed to be the most frequent molecularly defined cause of spermatogenic failure. This is caused by a high genomic variability, in fact AZFc is one of the most genetically dynamic regions in the human genome. This property may serve as counter against the genetic degeneracy associated with the lack of a meiotic partner, meaning that no exchange of genetic material with a counterpart chromosomal region from the mother can happen.
Intracytoplasmic sperm injection (ICSI) can result in pregnancies, but passes on the genetic infertility to any sons born.
It has been reported that the average sperm count for men in the western world has declined by up to 50% in the past 50 years. These findings are not conclusive however as different studies found different trends in the world. It seems clear however that the exposure to chemical compounds in our environment will influence the hormone balance and have an adverse effect on male fertility and promote diseases like testicular cancer.
Sources: srlworld.com, endotext.org, Page et al. (1999), Navarro-Costa et al. (2010).
To display the current date, day of the week and time on a web page, you don't want to refresh the entire page every sencond or minute. Instead you will want to use JavaScript to dynamically update just this date/clock display element. Here is the code for a display in the format
Friday, 10.8.2012 9:41:49
Code
<!DOCTYPE html> | |
<html> | |
<head> | |
<script type="text/javascript"> | |
function startTime(){ | |
var today=new Date(); | |
var h=today.getHours(); | |
var m=today.getMinutes(); | |
var s=today.getSeconds(); | |
var month = today.getMonth() + 1 | |
var day = today.getDate() | |
var myDays= ["Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"] | |
var weekday = today.getDay() | |
var wday = myDays[weekday] | |
var year = today.getFullYear() | |
// add a zero in front of numbers<10 | |
m=checkTime(m); | |
s=checkTime(s); | |
document.getElementById('txt').innerHTML=wday + ", " + day + "." + month + "." + year + " " + h+":"+m+":"+s; | |
t=setTimeout(function(){startTime()},500); | |
} | |
| |
function checkTime(i){ | |
if (i<10){ | |
i="0" + i; | |
} | |
return i; | |
} | |
</script> | |
</head> | |
| |
<body onload="startTime()"> | |
<div id="txt"></div> | |
</body> | |
</html> |
Sources: trans4mind.com, w3schools.com
There is a fine set of scripts that form an orderely pipeline (or framework) to process bioinformatics data on the Unix command line called biopieces. You can e.g. process sequencing (NGS) data like this:
Code
biopieces> | |
./read_fastq -n 1000 -i data/reads.fastq | ./plot_scores -t png -o data/scores.png --no_stream |
The general logic is
read_data | calculate_something | write_results
with the data being passed through as a "stream" and all modules having the same interface to eachother. Installation instructions are here, on my Ubuntu VM I had to follow these steps:
Code
sudo apt-get install subversion |
Code
svn checkout http://biopieces.googlecode.com/svn/trunk/ biopieces cd biopieces svn checkout http://biopieces.googlecode.com/svn/wiki bp_usage |
Code
bash biopieces_installer.sh |
Code
sudo gem install RubyInline ERROR: Error installing RubyInline: ZenTest requires RubyGems version > 1.8. |
Code
export BP_DIR="$HOME/bin/biopieces" | |
export BP_DATA="$HOME/bin/biopieces/BP_DATA" | |
export BP_TMP="$HOME/bin/biopieces/tmp" | |
export BP_LOG="$HOME/bin/biopieces/BP_LOG" | |
export PATH="/home/test/bin/biopieces/ruby_install/bin:/home/test/bin/biopieces/biopieces/bp_bin:$PATH" | |
export RUBYLIB="/home/test/bin/biopieces/biopieces/code_ruby/lib:$RUBYLIB" | |
export PERL5LIB="/home/test/bin/biopieces/biopieces/code_perl:$PERL5LIB" |
Code
source ~/.bashrc | |
mkdir $BP_DATA $BP_TMP $BP_LOG |
Code
cannot load such file -- maasha/biopieces (LoadError) | |
---- | |
Can't locate Maasha/Fasta.pm in @INC |
Some of the almost 200 methods that are implemented in biopieces at this time include:
To run programs or pipelines automatically it is often necessary to create or adjust configuration files. Ideally this should be done dynamically by a script from a skeleton (layout) file, replacing placeholder with the adjusted values. This can be done with a unix shell script that even contains the skeleton within:
Code
#! /bin/sh | |
# pass in variables from command-line arguments | |
prog=$1 | |
var1=$2 | |
var2=$3 | |
| |
# do other required tasks | |
# ... | |
| |
# config skeleton | |
template='#config file for pipeline | |
parameter_1=$var1 | |
parameter_2=$var2' | |
| |
# Generate file output.txt from variable | |
# $template using placeholders above. | |
echo "$(eval "echo \"$template\"")" \ | |
> $outputfile | |
| |
# run the specified program | |
# with the new config file | |
./${prog} -conf ${outputfile} |
Save as script.sh
and call with parameters:
sh script.sh program_name par1 par2
Source: stackoverflow
The Ensembl variation resources provide information about structural variants and sequence variants (including Single Nucleotide Polymorphisms (SNPs), insertions, deletions and somatic mutations in the human genome. Details and references are described on the web site and in Chen et al. (2010) Ensembl Variation Resources, BMC Genomics and other publications listed in the site.
Sources and Descriptions currently included in Ensembl variation resources (v67):
Ensembl offers the possibility to run the underlying code on your own data and predict the functional consequences of known and unknown variants using the Variant Effect Predictor (VEP).
Internally the VEP uses PolyPhen which is further explained below:
For a given amino acid substitution in a protein, PolyPhen-2 extracts various sequence and structure-based features of the substitution site and feeds them to a probabilistic classifier to identify:
Sequence-based features include binding or linking sites, transmembrane regions, regulatory modification sites. Profile matrices are calculated to assess the likelihood of the occurrence of this amino acid at the given position.
Structural features include the comparison to known protein 3D structures in PDB, using DSSP (Dictionary of Secondary Structure in Proteins), accessible surface area and properties.
PolyPhen-2 also looks at functional significance of an allele replacement using the UniProtKB database. It uses the "HumDiv" classifier to find disease-related changes and "HumVar" for variations in the "normal" population.
Ensembl have now added a nice blog entry about this with some more details.
Sequence uniqueness within the genome plays an important part when attempting to map short sequence parts - e.g. next-generation short sequencing reads. It is one of the factors that can introduce a bias in sequencing or it's analysis - the other important factor being GC content (GC-rich sequences, eg. genic/exonic region, as well as very GC-poor regions are often under-represented (Bentley et al. 2008), mainly caused by amplificatin steps in the protocol). Reads mapped to multiple regions are often discarded, genomic regions with high sequence degeneracy / low sequence complexity therefor show lower mapped read coverage than unique regions, creating a systematic bias. The CRG Alignability tracks at the UCSC genome browser display how uniquely k-mer sequences align to a region of the genome. As you can see from the tracks, the mappability increases with read length: CRG mappability tracks for different read lengths at the UCSC browser For each window (of sizes 36, 40, 50, 75 or 100 nts), a mapability score was computed: S = 1 / (number of matches found in the genome), so S=1 means one match in the genome, S=0.5 is two matches in the genome, and so on. Further desription in the publication of Thomas Derrien, Paolo Ribeca, et al. The data for these tracks can be downloaded, if you are working with other read lengths or genomes, you can run the software to generate the data yourself: Get the Gem library (latest version at GibHub), unpack it with tar xbvf GEM-libraries-Linux-x86_64.tbz2
, create an index: [codeblock lang= line=1]gem-do-index -i genome.fasta -o gem_index[/codeblock] run the mappability part, eg. with a read length of 250: [codeblock lang= line=1]gem-mappability -I gem_index -l 250 -o mappability_250.gem[/codeblock] To query a specific region for its mappability you can also use this online tool http://surveyor.chgr.org/. An alternative is to look at the "uniqueome" data and publication. Refs:
Sorting (elements in an array) is a very common tasks in many scripts. A lot of research has gone into finding the most efficient way to sort.
In Ruby the "sort" function performs a standard comparison accoring to the data type inspected, but as in most other languages you can define any specific orders.
open_orders.sort
is equivalent to
open_orders.sort { |x, y| x <=> y }
The sort algorithm will assume that this comparison function/block will return a value accoring to the following logic (like the comparison operators):
return -1 if x < y return 0 if x = y return 1 if x > y
So using this logic I can define a specific custom function to to compare the elements that need sorting and call it in the sort function afterwards. In my simple example I need to sort order numbers by two criteria: by a string first ("UK" before "ORD") and by ascending numbers afterwards.
Code
def custom_order_sorting(x_ord,y_ord) | |
if(x_ord.match('UK') | |
and y_ord.match('ORD')) | |
#use UK first | |
return -1 | |
elsif(x_ord.match('ORD') | |
and y_ord.match('UK')) | |
#use UK first | |
return 1 | |
else | |
#use smaller number first | |
x_num = x_ord.match('\w(\d+)$')[1] | |
y_num = y_ord.match('\w(\d+)$')[1] | |
return x_num <=> y_num | |
end | |
end | |
| |
open_orders.sort!{|x,y| custom_order_sorting(x,y)} |
Source: stackoverflow.com
The hypothesis of genometastasis was suggested by García-Olmo et al. more than a decade ago (1) and states (simplified) that normal cells could be turned into cancer cells through contact with (dying) cancer cells. In particular, "metastases might develop as a result of transfection of susceptible cells in distant target organs with dominant oncogenes that circulate in the plasma and are derived from the primary tumor." It can therefor be considered as a form of horizontal gene / DNA transfer. The updake of the genomic material was explained through apoptotic bodies from cancer cells as described by Holmgren et al. (2). The ideas were actually already described a century ago (6,7).
An alternative could be the involvement of a virus as a transmitter as described by zur Hausen (8).
In a later study (3) the same group could show that plasma from colorectal cancer patients could transform cultured cells oncogenically (fig 1):
Further research of the group was published recently (4) describing the transformation of cells cultured from healthy individuals through particles from cultured colon cancer cells. Goldenberg et al. (5) could stablely transform cells between species through cell fusion, resulting in hamster cells that express human oncogenes.
The evidence for horizontal gene transfer, in particular that cancer cells, dying parts of the cells or even cell-free cancer DNA can induce malignancy is worrying. It is likely only possible under very specific conditions and with certain (aggressive) cancer types, but certainly an interesting research area to watch. If confirmed it could have dramatic effects on treatment strategies and could open up new methological possibilities for molecular research.
References:
In cases where two copies of the same chromosome, or part of a chromosome, from one parent and no copies from the other parent are present in the cell, we call it uniparental disomy (UPD). While all DNA information is present, the development of the cell (and the organism) is hindered because of missing / wrong epigenetic markers. The basic mechanism of how this faulty distribution of chromosomes can occur, is shown in fig.1.
Sources:
Besides the visual client, the version control system Perforce can be operated through the command line (unix prompt or windows Dos window) and therefor be controlled through other programs like MatLab:
[status, result] = dos(p4command);
A reference manual is available, here are a few hints:
Check the environment settings:
p4 set P4CHARSET=winansi P4CLIENT=try1 (set) P4EDITOR=C:\Windows\SysWOW64\notepad.exe (set) P4PORT=perforce:1666 P4USER=Felix_Kokocinski
end edit if necessary with
set P4CHARSET=winansi
P4EDITOR is optional, P4CLIENT is the checkout / workspace name.
The settings can also be set permanently in the visual client under
Edit / Preferences / Connection / Change Settings
If these are wrong you will get messages like "file(s) not on client".
Most common commands:
synchronize repository:
p4 sync
checkout file:
p4 edit filename.txt or p4 edit //depot/path/in/perforce/filename.txt
submit changes:
p4 submit -d "description of changes" filename.txt
revert to version in repository:
p4 revert filename.txt
add new file:
p4 add filename.txt
get help:
p4 help
Here are some useful one-liners for various tasks.
The Online Mendelian Inheritance in Man is a manually reviewed catalog of human genes and regions involved in genetic disorders and traits. Each entry has a name and a number, e.g. "#154780 MARSHALL SYNDROME". According to the OMIM FAQs, these are the meanings of the the symbols preceding a MIM number:
To fetch a non-redundant list of OMIM annotation through the Ensembl Perl API you can look at the external references (xrefs/dblinks):
Code
my $att = "MIM_GENE"; | |
# or: my $att = "MIM_MORBID"; | |
my $attribs = $gene->get_all_DBLinks($att); | |
my (%ids, %descriptions); | |
if (@{ $attribs }){ | |
foreach my $attrib (@{ $attribs }){ | |
if (not(exists $ids{$attrib->primary_id()})){ | |
$ids{$attrib->primary_id} = $attrib->display_id; | |
$descriptions{$attrib->description} = $attrib->display_id; | |
} | |
} | |
} |
Ref:
OMIM publication, http://omim.org/
The symbols to describe the different nucleotides in DNA are the following:
------------------------------------------ Symbol Meaning Nucleic Acid ------------------------------------------ A A Adenine C C Cytosine G G Guanine T T Thymine U U Uracil M A or C R A or G W A or T S C or G Y C or T K G or T V A or C or G H A or C or T D A or G or T B C or G or T X G or A or T or C N G or A or T or C
Note: these letters are also used in the "samtools tview" program to visually show NGS read alignments.
Sources:
The goal of the 1000 Genomes Project is create a "A Deep Catalog of Human Genetic Variation" by measuring and analysing most genetic variants that have frequencies of at least 1% in the populations studied.
The population codes used in the project are the following (Source: 1000 Genomes / ftp site):
CHB Han Chines Han Chinese in Beijing, China JPT Japanese Japanese in Tokyo, Japan CHS Southern Han Chinese Han Chinese South CDX Dai Chinese Chinese Dai in Xishuangbanna, China KHV Kinh Vietnamese Kinh in Ho Chi Minh City, Vietnam CHD Denver Chinese Chinese in Denver, Colorado (pilot 3 only) CEU CEPH Utah residents (CEPH) with Northern and Western European ancestry TSI Tuscan Toscani in Italia GBR British British in England and Scotland FIN Finnish Finnish in Finland IBS Spanish Iberian populations in Spain YRI Yoruba Yoruba in Ibadan, Nigeria LWK Luhya Luhya in Webuye, Kenya GWD Gambian Gambian in Western Division, The Gambia MSL Mende Mende in Sierra Leone ESN Esan Esan in Nigeria ASW African-American SW African Ancestry in Southwest US ACB African-Caribbean African Caribbean in Barbados MXL Mexican-American Mexican Ancestry in Los Angeles, California PUR Puerto Rican Puerto Rican in Puerto Rico CLM Colombian Colombian in Medellin, Colombia PEL Peruvian Peruvian in Lima, Peru GIH Gujarati Gujarati Indian in Houston,TX PJL Punjabi Punjabi in Lahore,Pakistan BEB Bengali Belgali in Bangladesh STU Sri Lankan Sri Lankan Tamil in the UK ITU Indian Indian Telugu in the UK
As reported in the Ensembl 2009 NAR paper canonical transcripts are defined for all genes and for all species in the Ensembl gene sets. "The canonical transcript is defined as either the longest CDS, if the gene has translated transcripts, or the longest cDNA. Should a transcript already regarded as canonical not be selected using the above rules, there is support for storing this information in the Ensembl database."
For the human gene annotation the hierarchy to choose if there are more than one protein-coding transcripts is:
If there are multiple transcripts within the groups, take the longest CDS of the highest priority group.
For non-coding types takes the longest cDNA of
Source:
Ensembl 2009 NAR paper, Ensembl mailing list
These objects can be regarded as representative transcripts for the gene and can be fetched with the Perl API method
Bio::EnsEMBL::Gene::canonical_transcript()
Some caution needs to be used when looking at the pseudo-autosomal regions: When looking at genes from the Y PAR, the method will return a transcript with X coordinates. While not really a bug, this might mess up your data if un-noticed. To check and fix this something like the following will work:
Code
#fetch slice from Y PAR | |
| |
my $slice = $slice_adaptor->fetch_by_region( \ | |
'Chromosome','Y', 59100480, 59115127); | |
| |
#get an example gene | |
my $gene = @{$slice->get_all_Genes}[0]; | |
| |
#get canonical transcript from the gene | |
my $transcript = $gene->canonical_transcript; | |
| |
#re-fetch transcript on Y to avoid getting | |
# X locations for PAR | |
if($gene->slice->seq_region_name eq "Y"){ | |
| |
my $sid = $transcript->stable_id; | |
$transcript = undef; | |
my $transcripts = \ | |
$transcript_adaptor->fetch_all_by_Slice( \ | |
$slice, 1); | |
| |
foreach my $poss_transcript (@$transcripts){ | |
next unless($poss_transcript->stable_id eq $sid); | |
$transcript = $poss_transcript; | |
} | |
| |
} |
Telomeres form caps on the ends of chromosomes that prevent fusion of chromosomal ends and provide genomic stability.
During gametogenesis, reprogramming of the germ cells leads to elongation of telomeres up to their species-specific maximum.
In normal somatic cells, telomeres are progressively shortened with every cell division. This shortening in normal human cells limits the number of cell divisions. For human cells to proliferate beyond the senescence checkpoint, they need to stabilize telomere length. This is accomplished mainly by reactivation of the telomerase enzyme. Telomerase expression is under the control of many factors. Expression of telomerase can lead to cell immortalization and is activated during tumorigenesis, i.e. cancer.
Male Xq-telomeres are 1100 bp shorter than female Xq-telomeres.
The telomeric repeat found on all human chromosomes is "TTAGGG".
The centromeres and telomeres of the human chromosomes are not defined as region attributes in the Ensembl perl API explicitely, so for checking these regions, one option is to pull them out of the UCSC table browser (use the "Mapping and Sequencing tracks" group and the "Gap" table) and define them manually. You can e.g. create an array of hashes with the regions and use them in your script:
Code
#read data (listed below) from a file... | |
my @data = split("\s"); | |
my %telomere = ( | |
'chrom' => $data[0], | |
'start' => $data[1], | |
'end' => $data[2], | |
); | |
push(@telomeres, \%telomere); |
The list of centromere regions (transformed from the 0-based UCSC system to the 1-based coordinated system) for GRCh37 is:
Code
1 121535435 124535434 | |
2 92326172 95326171 | |
3 90504855 93504854 | |
4 49660118 52660117 | |
5 46405642 49405641 | |
6 58830167 61830166 | |
7 58054332 61054331 | |
8 43838888 46838887 | |
9 47367680 50367679 | |
10 39254936 42254935 | |
11 51644206 54644205 | |
12 34856695 37856694 | |
13 16000001 19000000 | |
14 16000001 19000000 | |
15 17000001 20000000 | |
16 35335802 38335801 | |
17 22263007 25263006 | |
18 15460899 18460898 | |
19 24681783 27681782 | |
20 26369570 29369569 | |
21 11288130 14288129 | |
22 13000001 16000000 | |
X 58632013 61632012 | |
Y 10104554 13104553 |
Code
1 1 10000 | |
1 249240622 249250621 | |
2 1 10000 | |
2 243189374 243199373 | |
3 1 10000 | |
3 198012431 198022430 | |
4 1 10000 | |
4 191144277 191154276 | |
5 1 10000 | |
5 180905261 180915260 | |
6 1 10000 | |
6 171105068 171115067 | |
7 1 10000 | |
7 159128664 159138663 | |
8 1 10000 | |
8 146354023 146364022 | |
9 1 10000 | |
9 141203432 141213431 | |
10 135524748 135534747 | |
10 1 10000 | |
11 134996517 135006516 | |
11 1 10000 | |
12 1 10000 | |
12 133841896 133851895 | |
13 1 10000 | |
13 115159879 115169878 | |
14 1 10000 | |
14 107339541 107349540 | |
15 1 10000 | |
15 102521393 102531392 | |
16 1 10000 | |
16 90344754 90354753 | |
18 1 10000 | |
18 78067249 78077248 | |
19 1 10000 | |
19 59118984 59128983 | |
20 1 10000 | |
20 63015521 63025520 | |
21 1 10000 | |
21 48119896 48129895 | |
22 1 10000 | |
22 51294567 51304566 | |
X 1 10000 | |
X 155260561 155270560 | |
Y 1 10000 | |
Y 59363567 59373566 |
Telomeres of chromosome 17 have not been defined for assembly GRCh37. They are short, but do exists nonetheless. An assembly patch will address this.
Sources:
The karyotype (number and set-up of the chromosomes) of a person or any changes in specific regions on human chromosomes are described with a system of numbers and symbols, defined by a group of cytogentic experts as the International System for Human Cytogenetic Nomenclature (ISCN). It was initiated 1960 by a committee after the suggestion of Charles E. Ford resulting in the "Proposed Standard System of Nomenclature of Human Mitotic Chromosomes". The system is based on the ideogram definitions of visual bands (described eg. in Francke et al. 1981), and was last revised 2005 and 2009.
The visual bands are created by staining techniques and describe regions of similarity in respect to functionality and base compositions (GC content); their lengths are 5-10 MB.
Example ideograms: Human chromosome 1 in different resolutions, from WashU
Numbers used:
The regions are numbered from the centromer outwards in both directions towards the telomeres on the shorter p arm and the longer q arm. The numbers cannot be read in the normal decimal numeric system e.g. 21, but rather 2-1 (region 2 band 1). Counting starts at the centromer as region 1 (or 1-0), to 11 (1-1) to 21 (2-1) to 22 (2-2) etc. Subbands are added in a similar way, eg. 21.1 to 21.2, if the bands are small or only appear at a higher resolution.
There are different levels of resolution that can be used as bands, e.g. 1q32 in the 400-bands resolution can be split up into 1q32.1, 1q32.2, 1q32.3 in the 550-bands resolution (see Figure above as an example for chr 1).
Here is a list of chromosome, band (arm, region, band, subband), genomic start and end position, from the Ensembl database for assembly GRCh37:
Code
1 p11.1 121500001 125000000 acen | |
1 p11.2 120600001 121500000 gneg | |
1 p12 117800001 120600000 gpos50 | |
1 p13.1 116100001 117800000 gneg | |
1 p13.2 111800001 116100000 gpos50 | |
1 p13.3 107200001 111800000 gneg | |
1 p21.1 102200001 107200000 gpos100 | |
1 p21.2 99700001 102200000 gneg | |
1 p21.3 94700001 99700000 gpos75 | |
1 p22.1 92000001 94700000 gneg | |
1 p22.2 88400001 92000000 gpos75 | |
1 p22.3 84900001 88400000 gneg | |
1 p31.1 69700001 84900000 gpos100 | |
1 p31.2 68900001 69700000 gneg | |
1 p31.3 61300001 68900000 gpos50 | |
1 p32.1 59000001 61300000 gneg | |
1 p32.2 56100001 59000000 gpos50 | |
1 p32.3 50700001 56100000 gneg | |
1 p33 46800001 50700000 gpos75 | |
1 p34.1 44100001 46800000 gneg | |
1 p34.2 40100001 44100000 gpos25 | |
1 p34.3 34600001 40100000 gneg | |
1 p35.1 32400001 34600000 gpos25 | |
1 p35.2 30200001 32400000 gneg | |
1 p35.3 28000001 30200000 gpos25 | |
1 p36.11 23900001 28000000 gneg | |
1 p36.12 20400001 23900000 gpos25 | |
1 p36.13 16200001 20400000 gneg | |
1 p36.21 12700001 16200000 gpos50 | |
1 p36.22 9200001 12700000 gneg | |
1 p36.23 7200001 9200000 gpos25 | |
1 p36.31 5400001 7200000 gneg | |
1 p36.32 2300001 5400000 gpos25 | |
1 p36.33 1 2300000 gneg | |
1 q11 125000001 128900000 acen | |
1 q12 128900001 142600000 gvar | |
1 q21.1 142600001 147000000 gneg | |
1 q21.2 147000001 150300000 gpos50 | |
1 q21.3 150300001 155000000 gneg | |
1 q22 155000001 156500000 gpos50 | |
1 q23.1 156500001 159100000 gneg | |
1 q23.2 159100001 160500000 gpos50 | |
1 q23.3 160500001 165500000 gneg | |
1 q24.1 165500001 167200000 gpos50 | |
1 q24.2 167200001 170900000 gneg | |
1 q24.3 170900001 172900000 gpos75 | |
1 q25.1 172900001 176000000 gneg | |
1 q25.2 176000001 180300000 gpos50 | |
1 q25.3 180300001 185800000 gneg | |
1 q31.1 185800001 190800000 gpos100 | |
1 q31.2 190800001 193800000 gneg | |
1 q31.3 193800001 198700000 gpos100 | |
1 q32.1 198700001 207200000 gneg | |
1 q32.2 207200001 211500000 gpos25 | |
1 q32.3 211500001 214500000 gneg | |
1 q41 214500001 224100000 gpos100 | |
1 q42.11 224100001 224600000 gneg | |
1 q42.12 224600001 227000000 gpos25 | |
1 q42.13 227000001 230700000 gneg | |
1 q42.2 230700001 234700000 gpos50 | |
1 q42.3 234700001 236600000 gneg | |
1 q43 236600001 243700000 gpos75 | |
1 q44 243700001 249250621 gneg | |
2 p11.1 90500001 93300000 acen | |
2 p11.2 83300001 90500000 gneg | |
2 p12 75000001 83300000 gpos100 | |
2 p13.1 73500001 75000000 gneg | |
2 p13.2 71500001 73500000 gpos50 | |
2 p13.3 68600001 71500000 gneg | |
2 p14 64100001 68600000 gpos50 | |
2 p15 61300001 64100000 gneg | |
2 p16.1 55000001 61300000 gpos100 | |
2 p16.2 52900001 55000000 gneg | |
2 p16.3 47800001 52900000 gpos100 | |
2 p21 41800001 47800000 gneg | |
2 p22.1 38600001 41800000 gpos50 | |
2 p22.2 36600001 38600000 gneg | |
2 p22.3 32100001 36600000 gpos75 | |
2 p23.1 30000001 32100000 gneg | |
2 p23.2 27900001 30000000 gpos25 | |
2 p23.3 24000001 27900000 gneg | |
2 p24.1 19200001 24000000 gpos75 | |
2 p24.2 16700001 19200000 gneg | |
2 p24.3 12200001 16700000 gpos75 | |
2 p25.1 7100001 12200000 gneg | |
2 p25.2 4400001 7100000 gpos50 | |
2 p25.3 1 4400000 gneg | |
2 q11.1 93300001 96800000 acen | |
2 q11.2 96800001 102700000 gneg | |
2 q12.1 102700001 106000000 gpos50 | |
2 q12.2 106000001 107500000 gneg | |
2 q12.3 107500001 110200000 gpos25 | |
2 q13 110200001 114400000 gneg | |
2 q14.1 114400001 118800000 gpos50 | |
2 q14.2 118800001 122400000 gneg | |
2 q14.3 122400001 129900000 gpos50 | |
2 q21.1 129900001 132500000 gneg | |
2 q21.2 132500001 135100000 gpos25 | |
2 q21.3 135100001 136800000 gneg | |
2 q22.1 136800001 142200000 gpos100 | |
2 q22.2 142200001 144100000 gneg | |
2 q22.3 144100001 148700000 gpos100 | |
2 q23.1 148700001 149900000 gneg | |
2 q23.2 149900001 150500000 gpos25 | |
2 q23.3 150500001 154900000 gneg | |
2 q24.1 154900001 159800000 gpos75 | |
2 q24.2 159800001 163700000 gneg | |
2 q24.3 163700001 169700000 gpos75 | |
2 q31.1 169700001 178000000 gneg | |
2 q31.2 178000001 180600000 gpos50 | |
2 q31.3 180600001 183000000 gneg | |
2 q32.1 183000001 189400000 gpos75 | |
2 q32.2 189400001 191900000 gneg | |
2 q32.3 191900001 197400000 gpos75 | |
2 q33.1 197400001 203300000 gneg | |
2 q33.2 203300001 204900000 gpos50 | |
2 q33.3 204900001 209000000 gneg | |
2 q34 209000001 215300000 gpos100 | |
2 q35 215300001 221500000 gneg | |
2 q36.1 221500001 225200000 gpos75 | |
2 q36.2 225200001 226100000 gneg | |
2 q36.3 226100001 231000000 gpos100 | |
2 q37.1 231000001 235600000 gneg | |
2 q37.2 235600001 237300000 gpos50 | |
2 q37.3 237300001 243199373 gneg | |
3 p11.1 87900001 91000000 acen | |
3 p11.2 87200001 87900000 gneg | |
3 p12.1 83500001 87200000 gpos75 | |
3 p12.2 79800001 83500000 gneg | |
3 p12.3 74200001 79800000 gpos75 | |
3 p13 69800001 74200000 gneg | |
3 p14.1 63700001 69800000 gpos50 | |
3 p14.2 58600001 63700000 gneg | |
3 p14.3 54400001 58600000 gpos50 | |
3 p21.1 52300001 54400000 gneg | |
3 p21.2 50600001 52300000 gpos25 | |
3 p21.31 44200001 50600000 gneg | |
3 p21.32 44100001 44200000 gpos50 | |
3 p21.33 43700001 44100000 gneg | |
3 p22.1 39400001 43700000 gpos75 | |
3 p22.2 36500001 39400000 gneg | |
3 p22.3 32100001 36500000 gpos50 | |
3 p23 30900001 32100000 gneg | |
3 p24.1 26400001 30900000 gpos75 | |
3 p24.2 23900001 26400000 gneg | |
3 p24.3 16400001 23900000 gpos100 | |
3 p25.1 13300001 16400000 gneg | |
3 p25.2 11800001 13300000 gpos25 | |
3 p25.3 8700001 11800000 gneg | |
3 p26.1 4000001 8700000 gpos50 | |
3 p26.2 2800001 4000000 gneg | |
3 p26.3 1 2800000 gpos50 | |
3 q11.1 91000001 93900000 acen | |
3 q11.2 93900001 98300000 gvar | |
3 q12.1 98300001 100000000 gneg | |
3 q12.2 100000001 100900000 gpos25 | |
3 q12.3 100900001 102800000 gneg | |
3 q13.11 102800001 106200000 gpos75 | |
3 q13.12 106200001 107900000 gneg | |
3 q13.13 107900001 111300000 gpos50 | |
3 q13.2 111300001 113500000 gneg | |
3 q13.31 113500001 117300000 gpos75 | |
3 q13.32 117300001 119000000 gneg | |
3 q13.33 119000001 121900000 gpos75 | |
3 q21.1 121900001 123800000 gneg | |
3 q21.2 123800001 125800000 gpos25 | |
3 q21.3 125800001 129200000 gneg | |
3 q22.1 129200001 133700000 gpos25 | |
3 q22.2 133700001 135700000 gneg | |
3 q22.3 135700001 138700000 gpos25 | |
3 q23 138700001 142800000 gneg | |
3 q24 142800001 148900000 gpos100 | |
3 q25.1 148900001 152100000 gneg | |
3 q25.2 152100001 155000000 gpos50 | |
3 q25.31 155000001 157000000 gneg | |
3 q25.32 157000001 159000000 gpos50 | |
3 q25.33 159000001 160700000 gneg | |
3 q26.1 160700001 167600000 gpos100 | |
3 q26.2 167600001 170900000 gneg | |
3 q26.31 170900001 175700000 gpos75 | |
3 q26.32 175700001 179000000 gneg | |
3 q26.33 179000001 182700000 gpos75 | |
3 q27.1 182700001 184500000 gneg | |
3 q27.2 184500001 186000000 gpos25 | |
3 q27.3 186000001 187900000 gneg | |
3 q28 187900001 192300000 gpos75 | |
3 q29 192300001 198022430 gneg | |
4 p11 48200001 50400000 acen | |
4 p12 44600001 48200000 gneg | |
4 p13 41200001 44600000 gpos50 | |
4 p14 35800001 41200000 gneg | |
4 p15.1 27700001 35800000 gpos100 | |
4 p15.2 21300001 27700000 gneg | |
4 p15.31 17800001 21300000 gpos75 | |
4 p15.32 15200001 17800000 gneg | |
4 p15.33 11300001 15200000 gpos50 | |
4 p16.1 6000001 11300000 gneg | |
4 p16.2 4500001 6000000 gpos25 | |
4 p16.3 1 4500000 gneg | |
4 q11 50400001 52700000 acen | |
4 q12 52700001 59500000 gneg | |
4 q13.1 59500001 66600000 gpos100 | |
4 q13.2 66600001 70500000 gneg | |
4 q13.3 70500001 76300000 gpos75 | |
4 q21.1 76300001 78900000 gneg | |
4 q21.21 78900001 82400000 gpos50 | |
4 q21.22 82400001 84100000 gneg | |
4 q21.23 84100001 86900000 gpos25 | |
4 q21.3 86900001 88000000 gneg | |
4 q22.1 88000001 93700000 gpos75 | |
4 q22.2 93700001 95100000 gneg | |
4 q22.3 95100001 98800000 gpos75 | |
4 q23 98800001 101100000 gneg | |
4 q24 101100001 107700000 gpos50 | |
4 q25 107700001 114100000 gneg | |
4 q26 114100001 120800000 gpos75 | |
4 q27 120800001 123800000 gneg | |
4 q28.1 123800001 128800000 gpos50 | |
4 q28.2 128800001 131100000 gneg | |
4 q28.3 131100001 139500000 gpos100 | |
4 q31.1 139500001 141500000 gneg | |
4 q31.21 141500001 146800000 gpos25 | |
4 q31.22 146800001 148500000 gneg | |
4 q31.23 148500001 151100000 gpos25 | |
4 q31.3 151100001 155600000 gneg | |
4 q32.1 155600001 161800000 gpos100 | |
4 q32.2 161800001 164500000 gneg | |
4 q32.3 164500001 170100000 gpos100 | |
4 q33 170100001 171900000 gneg | |
4 q34.1 171900001 176300000 gpos75 | |
4 q34.2 176300001 177500000 gneg | |
4 q34.3 177500001 183200000 gpos100 | |
4 q35.1 183200001 187100000 gneg | |
4 q35.2 187100001 191154276 gpos25 | |
5 p11 46100001 48400000 acen | |
5 p12 42500001 46100000 gpos50 | |
5 p13.1 38400001 42500000 gneg | |
5 p13.2 33800001 38400000 gpos25 | |
5 p13.3 28900001 33800000 gneg | |
5 p14.1 24600001 28900000 gpos100 | |
5 p14.2 23300001 24600000 gneg | |
5 p14.3 18400001 23300000 gpos100 | |
5 p15.1 15000001 18400000 gneg | |
5 p15.2 9800001 15000000 gpos50 | |
5 p15.31 6300001 9800000 gneg | |
5 p15.32 4500001 6300000 gpos25 | |
5 p15.33 1 4500000 gneg | |
5 q11.1 48400001 50700000 acen | |
5 q11.2 50700001 58900000 gneg | |
5 q12.1 58900001 62900000 gpos75 | |
5 q12.2 62900001 63200000 gneg | |
5 q12.3 63200001 66700000 gpos75 | |
5 q13.1 66700001 68400000 gneg | |
5 q13.2 68400001 73300000 gpos50 | |
5 q13.3 73300001 76900000 gneg | |
5 q14.1 76900001 81400000 gpos50 | |
5 q14.2 81400001 82800000 gneg | |
5 q14.3 82800001 92300000 gpos100 | |
5 q15 92300001 98200000 gneg | |
5 q21.1 98200001 102800000 gpos100 | |
5 q21.2 102800001 104500000 gneg | |
5 q21.3 104500001 109600000 gpos100 | |
5 q22.1 109600001 111500000 gneg | |
5 q22.2 111500001 113100000 gpos50 | |
5 q22.3 113100001 115200000 gneg | |
5 q23.1 115200001 121400000 gpos100 | |
5 q23.2 121400001 127300000 gneg | |
5 q23.3 127300001 130600000 gpos100 | |
5 q31.1 130600001 136200000 gneg | |
5 q31.2 136200001 139500000 gpos25 | |
5 q31.3 139500001 144500000 gneg | |
5 q32 144500001 149800000 gpos75 | |
5 q33.1 149800001 152700000 gneg | |
5 q33.2 152700001 155700000 gpos50 | |
5 q33.3 155700001 159900000 gneg | |
5 q34 159900001 168500000 gpos100 | |
5 q35.1 168500001 172800000 gneg | |
5 q35.2 172800001 176600000 gpos25 | |
5 q35.3 176600001 180915260 gneg | |
6 p11.1 58700001 61000000 acen | |
6 p11.2 57000001 58700000 gneg | |
6 p12.1 52900001 57000000 gpos100 | |
6 p12.2 51800001 52900000 gneg | |
6 p12.3 46200001 51800000 gpos100 | |
6 p21.1 40500001 46200000 gneg | |
6 p21.2 36600001 40500000 gpos25 | |
6 p21.31 33500001 36600000 gneg | |
6 p21.32 32100001 33500000 gpos25 | |
6 p21.33 30400001 32100000 gneg | |
6 p22.1 27000001 30400000 gpos50 | |
6 p22.2 25200001 27000000 gneg | |
6 p22.3 15200001 25200000 gpos75 | |
6 p23 13400001 15200000 gneg | |
6 p24.1 11600001 13400000 gpos25 | |
6 p24.2 10600001 11600000 gneg | |
6 p24.3 7100001 10600000 gpos50 | |
6 p25.1 4200001 7100000 gneg | |
6 p25.2 2300001 4200000 gpos25 | |
6 p25.3 1 2300000 gneg | |
6 q11.1 61000001 63300000 acen | |
6 q11.2 63300001 63400000 gneg | |
6 q12 63400001 70000000 gpos100 | |
6 q13 70000001 75900000 gneg | |
6 q14.1 75900001 83900000 gpos50 | |
6 q14.2 83900001 84900000 gneg | |
6 q14.3 84900001 88000000 gpos50 | |
6 q15 88000001 93100000 gneg | |
6 q16.1 93100001 99500000 gpos100 | |
6 q16.2 99500001 100600000 gneg | |
6 q16.3 100600001 105500000 gpos100 | |
6 q21 105500001 114600000 gneg | |
6 q22.1 114600001 118300000 gpos75 | |
6 q22.2 118300001 118500000 gneg | |
6 q22.31 118500001 126100000 gpos100 | |
6 q22.32 126100001 127100000 gneg | |
6 q22.33 127100001 130300000 gpos75 | |
6 q23.1 130300001 131200000 gneg | |
6 q23.2 131200001 135200000 gpos50 | |
6 q23.3 135200001 139000000 gneg | |
6 q24.1 139000001 142800000 gpos75 | |
6 q24.2 142800001 145600000 gneg | |
6 q24.3 145600001 149000000 gpos75 | |
6 q25.1 149000001 152500000 gneg | |
6 q25.2 152500001 155500000 gpos50 | |
6 q25.3 155500001 161000000 gneg | |
6 q26 161000001 164500000 gpos50 | |
6 q27 164500001 171115067 gneg | |
7 p11.1 58000001 59900000 acen | |
7 p11.2 54000001 58000000 gneg | |
7 p12.1 50500001 54000000 gpos75 | |
7 p12.2 49000001 50500000 gneg | |
7 p12.3 45400001 49000000 gpos75 | |
7 p13 43300001 45400000 gneg | |
7 p14.1 37200001 43300000 gpos75 | |
7 p14.2 35000001 37200000 gneg | |
7 p14.3 28800001 35000000 gpos75 | |
7 p15.1 28000001 28800000 gneg | |
7 p15.2 25500001 28000000 gpos50 | |
7 p15.3 20900001 25500000 gneg | |
7 p21.1 16500001 20900000 gpos100 | |
7 p21.2 13800001 16500000 gneg | |
7 p21.3 7300001 13800000 gpos100 | |
7 p22.1 4500001 7300000 gneg | |
7 p22.2 2800001 4500000 gpos25 | |
7 p22.3 1 2800000 gneg | |
7 q11.1 59900001 61700000 acen | |
7 q11.21 61700001 67000000 gneg | |
7 q11.22 67000001 72200000 gpos50 | |
7 q11.23 72200001 77500000 gneg | |
7 q21.11 77500001 86400000 gpos100 | |
7 q21.12 86400001 88200000 gneg | |
7 q21.13 88200001 91100000 gpos75 | |
7 q21.2 91100001 92800000 gneg | |
7 q21.3 92800001 98000000 gpos75 | |
7 q22.1 98000001 103800000 gneg | |
7 q22.2 103800001 104500000 gpos50 | |
7 q22.3 104500001 107400000 gneg | |
7 q31.1 107400001 114600000 gpos75 | |
7 q31.2 114600001 117400000 gneg | |
7 q31.31 117400001 121100000 gpos75 | |
7 q31.32 121100001 123800000 gneg | |
7 q31.33 123800001 127100000 gpos75 | |
7 q32.1 127100001 129200000 gneg | |
7 q32.2 129200001 130400000 gpos25 | |
7 q32.3 130400001 132600000 gneg | |
7 q33 132600001 138200000 gpos50 | |
7 q34 138200001 143100000 gneg | |
7 q35 143100001 147900000 gpos75 | |
7 q36.1 147900001 152600000 gneg | |
7 q36.2 152600001 155100000 gpos25 | |
7 q36.3 155100001 159138663 gneg | |
8 p11.1 43100001 45600000 acen | |
8 p11.21 39700001 43100000 gneg | |
8 p11.22 38300001 39700000 gpos25 | |
8 p11.23 36500001 38300000 gneg | |
8 p12 28800001 36500000 gpos75 | |
8 p21.1 27400001 28800000 gneg | |
8 p21.2 23300001 27400000 gpos50 | |
8 p21.3 19000001 23300000 gneg | |
8 p22 12700001 19000000 gpos100 | |
8 p23.1 6200001 12700000 gneg | |
8 p23.2 2200001 6200000 gpos75 | |
8 p23.3 1 2200000 gneg | |
8 q11.1 45600001 48100000 acen | |
8 q11.21 48100001 52200000 gneg | |
8 q11.22 52200001 52600000 gpos75 | |
8 q11.23 52600001 55500000 gneg | |
8 q12.1 55500001 61600000 gpos50 | |
8 q12.2 61600001 62200000 gneg | |
8 q12.3 62200001 66000000 gpos50 | |
8 q13.1 66000001 68000000 gneg | |
8 q13.2 68000001 70500000 gpos50 | |
8 q13.3 70500001 73900000 gneg | |
8 q21.11 73900001 78300000 gpos100 | |
8 q21.12 78300001 80100000 gneg | |
8 q21.13 80100001 84600000 gpos75 | |
8 q21.2 84600001 86900000 gneg | |
8 q21.3 86900001 93300000 gpos100 | |
8 q22.1 93300001 99000000 gneg | |
8 q22.2 99000001 101600000 gpos25 | |
8 q22.3 101600001 106200000 gneg | |
8 q23.1 106200001 110500000 gpos75 | |
8 q23.2 110500001 112100000 gneg | |
8 q23.3 112100001 117700000 gpos100 | |
8 q24.11 117700001 119200000 gneg | |
8 q24.12 119200001 122500000 gpos50 | |
8 q24.13 122500001 127300000 gneg | |
8 q24.21 127300001 131500000 gpos50 | |
8 q24.22 131500001 136400000 gneg | |
8 q24.23 136400001 139900000 gpos75 | |
8 q24.3 139900001 146364022 gneg | |
9 p11.1 47300001 49000000 acen | |
9 p11.2 43600001 47300000 gneg | |
9 p12 41000001 43600000 gpos50 | |
9 p13.1 38400001 41000000 gneg | |
9 p13.2 36300001 38400000 gpos25 | |
9 p13.3 33200001 36300000 gneg | |
9 p21.1 28000001 33200000 gpos100 | |
9 p21.2 25600001 28000000 gneg | |
9 p21.3 19900001 25600000 gpos100 | |
9 p22.1 18500001 19900000 gneg | |
9 p22.2 16600001 18500000 gpos25 | |
9 p22.3 14200001 16600000 gneg | |
9 p23 9000001 14200000 gpos75 | |
9 p24.1 4600001 9000000 gneg | |
9 p24.2 2200001 4600000 gpos25 | |
9 p24.3 1 2200000 gneg | |
9 q11 49000001 50700000 acen | |
9 q12 50700001 65900000 gvar | |
9 q13 65900001 68700000 gneg | |
9 q21.11 68700001 72200000 gpos25 | |
9 q21.12 72200001 74000000 gneg | |
9 q21.13 74000001 79200000 gpos50 | |
9 q21.2 79200001 81100000 gneg | |
9 q21.31 81100001 84100000 gpos50 | |
9 q21.32 84100001 86900000 gneg | |
9 q21.33 86900001 90400000 gpos50 | |
9 q22.1 90400001 91800000 gneg | |
9 q22.2 91800001 93900000 gpos25 | |
9 q22.31 93900001 96600000 gneg | |
9 q22.32 96600001 99300000 gpos25 | |
9 q22.33 99300001 102600000 gneg | |
9 q31.1 102600001 108200000 gpos100 | |
9 q31.2 108200001 111300000 gneg | |
9 q31.3 111300001 114900000 gpos25 | |
9 q32 114900001 117700000 gneg | |
9 q33.1 117700001 122500000 gpos75 | |
9 q33.2 122500001 125800000 gneg | |
9 q33.3 125800001 130300000 gpos25 | |
9 q34.11 130300001 133500000 gneg | |
9 q34.12 133500001 134000000 gpos25 | |
9 q34.13 134000001 135900000 gneg | |
9 q34.2 135900001 137400000 gpos25 | |
9 q34.3 137400001 141213431 gneg | |
10 p11.1 38000001 40200000 acen | |
10 p11.21 34400001 38000000 gneg | |
10 p11.22 31300001 34400000 gpos25 | |
10 p11.23 29600001 31300000 gneg | |
10 p12.1 24600001 29600000 gpos50 | |
10 p12.2 22600001 24600000 gneg | |
10 p12.31 18700001 22600000 gpos75 | |
10 p12.32 18600001 18700000 gneg | |
10 p12.33 17300001 18600000 gpos75 | |
10 p13 12200001 17300000 gneg | |
10 p14 6600001 12200000 gpos75 | |
10 p15.1 3800001 6600000 gneg | |
10 p15.2 3000001 3800000 gpos25 | |
10 p15.3 1 3000000 gneg | |
10 q11.1 40200001 42300000 acen | |
10 q11.21 42300001 46100000 gneg | |
10 q11.22 46100001 49900000 gpos25 | |
10 q11.23 49900001 52900000 gneg | |
10 q21.1 52900001 61200000 gpos100 | |
10 q21.2 61200001 64500000 gneg | |
10 q21.3 64500001 70600000 gpos100 | |
10 q22.1 70600001 74900000 gneg | |
10 q22.2 74900001 77700000 gpos50 | |
10 q22.3 77700001 82000000 gneg | |
10 q23.1 82000001 87900000 gpos100 | |
10 q23.2 87900001 89500000 gneg | |
10 q23.31 89500001 92900000 gpos75 | |
10 q23.32 92900001 94100000 gneg | |
10 q23.33 94100001 97000000 gpos50 | |
10 q24.1 97000001 99300000 gneg | |
10 q24.2 99300001 101900000 gpos50 | |
10 q24.31 101900001 103000000 gneg | |
10 q24.32 103000001 104900000 gpos25 | |
10 q24.33 104900001 105800000 gneg | |
10 q25.1 105800001 111900000 gpos100 | |
10 q25.2 111900001 114900000 gneg | |
10 q25.3 114900001 119100000 gpos75 | |
10 q26.11 119100001 121700000 gneg | |
10 q26.12 121700001 123100000 gpos50 | |
10 q26.13 123100001 127500000 gneg | |
10 q26.2 127500001 130600000 gpos50 | |
10 q26.3 130600001 135534747 gneg | |
11 p11.11 51600001 53700000 acen | |
11 p11.12 48800001 51600000 gpos75 | |
11 p11.2 43500001 48800000 gneg | |
11 p12 36400001 43500000 gpos100 | |
11 p13 31000001 36400000 gneg | |
11 p14.1 27200001 31000000 gpos75 | |
11 p14.2 26100001 27200000 gneg | |
11 p14.3 21700001 26100000 gpos100 | |
11 p15.1 16200001 21700000 gneg | |
11 p15.2 12700001 16200000 gpos50 | |
11 p15.3 10700001 12700000 gneg | |
11 p15.4 2800001 10700000 gpos50 | |
11 p15.5 1 2800000 gneg | |
11 q11 53700001 55700000 acen | |
11 q12.1 55700001 59900000 gpos75 | |
11 q12.2 59900001 61700000 gneg | |
11 q12.3 61700001 63400000 gpos25 | |
11 q13.1 63400001 65900000 gneg | |
11 q13.2 65900001 68400000 gpos25 | |
11 q13.3 68400001 70400000 gneg | |
11 q13.4 70400001 75200000 gpos50 | |
11 q13.5 75200001 77100000 gneg | |
11 q14.1 77100001 85600000 gpos100 | |
11 q14.2 85600001 88300000 gneg | |
11 q14.3 88300001 92800000 gpos100 | |
11 q21 92800001 97200000 gneg | |
11 q22.1 97200001 102100000 gpos100 | |
11 q22.2 102100001 102900000 gneg | |
11 q22.3 102900001 110400000 gpos100 | |
11 q23.1 110400001 112500000 gneg | |
11 q23.2 112500001 114500000 gpos50 | |
11 q23.3 114500001 121200000 gneg | |
11 q24.1 121200001 123900000 gpos50 | |
11 q24.2 123900001 127800000 gneg | |
11 q24.3 127800001 130800000 gpos50 | |
11 q25 130800001 135006516 gneg | |
12 p11.1 33300001 35800000 acen | |
12 p11.21 30700001 33300000 gneg | |
12 p11.22 27800001 30700000 gpos50 | |
12 p11.23 26500001 27800000 gneg | |
12 p12.1 21300001 26500000 gpos100 | |
12 p12.2 20000001 21300000 gneg | |
12 p12.3 14800001 20000000 gpos100 | |
12 p13.1 12800001 14800000 gneg | |
12 p13.2 10100001 12800000 gpos75 | |
12 p13.31 5400001 10100000 gneg | |
12 p13.32 3300001 5400000 gpos25 | |
12 p13.33 1 3300000 gneg | |
12 q11 35800001 38200000 acen | |
12 q12 38200001 46400000 gpos100 | |
12 q13.11 46400001 49100000 gneg | |
12 q13.12 49100001 51500000 gpos25 | |
12 q13.13 51500001 54900000 gneg | |
12 q13.2 54900001 56600000 gpos25 | |
12 q13.3 56600001 58100000 gneg | |
12 q14.1 58100001 63100000 gpos75 | |
12 q14.2 63100001 65100000 gneg | |
12 q14.3 65100001 67700000 gpos50 | |
12 q15 67700001 71500000 gneg | |
12 q21.1 71500001 75700000 gpos75 | |
12 q21.2 75700001 80300000 gneg | |
12 q21.31 80300001 86700000 gpos100 | |
12 q21.32 86700001 89000000 gneg | |
12 q21.33 89000001 92600000 gpos100 | |
12 q22 92600001 96200000 gneg | |
12 q23.1 96200001 101600000 gpos75 | |
12 q23.2 101600001 103800000 gneg | |
12 q23.3 103800001 109000000 gpos50 | |
12 q24.11 109000001 111700000 gneg | |
12 q24.12 111700001 112300000 gpos25 | |
12 q24.13 112300001 114300000 gneg | |
12 q24.21 114300001 116800000 gpos50 | |
12 q24.22 116800001 118100000 gneg | |
12 q24.23 118100001 120700000 gpos50 | |
12 q24.31 120700001 125900000 gneg | |
12 q24.32 125900001 129300000 gpos50 | |
12 q24.33 129300001 133851895 gneg | |
13 p11.1 16300001 17900000 acen | |
13 p11.2 10000001 16300000 gvar | |
13 p12 4500001 10000000 stalk | |
13 p13 1 4500000 gvar | |
13 q11 17900001 19500000 acen | |
13 q12.11 19500001 23300000 gneg | |
13 q12.12 23300001 25500000 gpos25 | |
13 q12.13 25500001 27800000 gneg | |
13 q12.2 27800001 28900000 gpos25 | |
13 q12.3 28900001 32200000 gneg | |
13 q13.1 32200001 34000000 gpos50 | |
13 q13.2 34000001 35500000 gneg | |
13 q13.3 35500001 40100000 gpos75 | |
13 q14.11 40100001 45200000 gneg | |
13 q14.12 45200001 45800000 gpos25 | |
13 q14.13 45800001 47300000 gneg | |
13 q14.2 47300001 50900000 gpos50 | |
13 q14.3 50900001 55300000 gneg | |
13 q21.1 55300001 59600000 gpos100 | |
13 q21.2 59600001 62300000 gneg | |
13 q21.31 62300001 65700000 gpos75 | |
13 q21.32 65700001 68600000 gneg | |
13 q21.33 68600001 73300000 gpos100 | |
13 q22.1 73300001 75400000 gneg | |
13 q22.2 75400001 77200000 gpos50 | |
13 q22.3 77200001 79000000 gneg | |
13 q31.1 79000001 87700000 gpos100 | |
13 q31.2 87700001 90000000 gneg | |
13 q31.3 90000001 95000000 gpos100 | |
13 q32.1 95000001 98200000 gneg | |
13 q32.2 98200001 99300000 gpos25 | |
13 q32.3 99300001 101700000 gneg | |
13 q33.1 101700001 104800000 gpos100 | |
13 q33.2 104800001 107000000 gneg | |
13 q33.3 107000001 110300000 gpos100 | |
13 q34 110300001 115169878 gneg | |
14 p11.1 16100001 17600000 acen | |
14 p11.2 8100001 16100000 gvar | |
14 p12 3700001 8100000 stalk | |
14 p13 1 3700000 gvar | |
14 q11.1 17600001 19100000 acen | |
14 q11.2 19100001 24600000 gneg | |
14 q12 24600001 33300000 gpos100 | |
14 q13.1 33300001 35300000 gneg | |
14 q13.2 35300001 36600000 gpos50 | |
14 q13.3 36600001 37800000 gneg | |
14 q21.1 37800001 43500000 gpos100 | |
14 q21.2 43500001 47200000 gneg | |
14 q21.3 47200001 50900000 gpos100 | |
14 q22.1 50900001 54100000 gneg | |
14 q22.2 54100001 55500000 gpos25 | |
14 q22.3 55500001 58100000 gneg | |
14 q23.1 58100001 62100000 gpos75 | |
14 q23.2 62100001 64800000 gneg | |
14 q23.3 64800001 67900000 gpos50 | |
14 q24.1 67900001 70200000 gneg | |
14 q24.2 70200001 73800000 gpos50 | |
14 q24.3 73800001 79300000 gneg | |
14 q31.1 79300001 83600000 gpos100 | |
14 q31.2 83600001 84900000 gneg | |
14 q31.3 84900001 89800000 gpos100 | |
14 q32.11 89800001 91900000 gneg | |
14 q32.12 91900001 94700000 gpos25 | |
14 q32.13 94700001 96300000 gneg | |
14 q32.2 96300001 101400000 gpos50 | |
14 q32.31 101400001 103200000 gneg | |
14 q32.32 103200001 104000000 gpos50 | |
14 q32.33 104000001 107349540 gneg | |
15 p11.1 15800001 19000000 acen | |
15 p11.2 8700001 15800000 gvar | |
15 p12 3900001 8700000 stalk | |
15 p13 1 3900000 gvar | |
15 q11.1 19000001 20700000 acen | |
15 q11.2 20700001 25700000 gneg | |
15 q12 25700001 28100000 gpos50 | |
15 q13.1 28100001 30300000 gneg | |
15 q13.2 30300001 31200000 gpos50 | |
15 q13.3 31200001 33600000 gneg | |
15 q14 33600001 40100000 gpos75 | |
15 q15.1 40100001 42800000 gneg | |
15 q15.2 42800001 43600000 gpos25 | |
15 q15.3 43600001 44800000 gneg | |
15 q21.1 44800001 49500000 gpos75 | |
15 q21.2 49500001 52900000 gneg | |
15 q21.3 52900001 59100000 gpos75 | |
15 q22.1 59100001 59300000 gneg | |
15 q22.2 59300001 63700000 gpos25 | |
15 q22.31 63700001 67200000 gneg | |
15 q22.32 67200001 67300000 gpos25 | |
15 q22.33 67300001 67500000 gneg | |
15 q23 67500001 72700000 gpos25 | |
15 q24.1 72700001 75200000 gneg | |
15 q24.2 75200001 76600000 gpos25 | |
15 q24.3 76600001 78300000 gneg | |
15 q25.1 78300001 81700000 gpos50 | |
15 q25.2 81700001 85200000 gneg | |
15 q25.3 85200001 89100000 gpos50 | |
15 q26.1 89100001 94300000 gneg | |
15 q26.2 94300001 98500000 gpos50 | |
15 q26.3 98500001 102531392 gneg | |
16 p11.1 34600001 36600000 acen | |
16 p11.2 28100001 34600000 gneg | |
16 p12.1 24200001 28100000 gpos50 | |
16 p12.2 21200001 24200000 gneg | |
16 p12.3 16800001 21200000 gpos50 | |
16 p13.11 14800001 16800000 gneg | |
16 p13.12 12600001 14800000 gpos50 | |
16 p13.13 10500001 12600000 gneg | |
16 p13.2 7900001 10500000 gpos50 | |
16 p13.3 1 7900000 gneg | |
16 q11.1 36600001 38600000 acen | |
16 q11.2 38600001 47000000 gvar | |
16 q12.1 47000001 52600000 gneg | |
16 q12.2 52600001 56700000 gpos50 | |
16 q13 56700001 57400000 gneg | |
16 q21 57400001 66700000 gpos100 | |
16 q22.1 66700001 70800000 gneg | |
16 q22.2 70800001 72900000 gpos50 | |
16 q22.3 72900001 74100000 gneg | |
16 q23.1 74100001 79200000 gpos75 | |
16 q23.2 79200001 81700000 gneg | |
16 q23.3 81700001 84200000 gpos50 | |
16 q24.1 84200001 87100000 gneg | |
16 q24.2 87100001 88700000 gpos25 | |
16 q24.3 88700001 90354753 gneg | |
17 p11.1 22200001 24000000 acen | |
17 p11.2 16000001 22200000 gneg | |
17 p12 10700001 16000000 gpos75 | |
17 p13.1 6500001 10700000 gneg | |
17 p13.2 3300001 6500000 gpos50 | |
17 p13.3 1 3300000 gneg | |
17 q11.1 24000001 25800000 acen | |
17 q11.2 25800001 31800000 gneg | |
17 q12 31800001 38100000 gpos50 | |
17 q21.1 38100001 38400000 gneg | |
17 q21.2 38400001 40900000 gpos25 | |
17 q21.31 40900001 44900000 gneg | |
17 q21.32 44900001 47400000 gpos25 | |
17 q21.33 47400001 50200000 gneg | |
17 q22 50200001 57600000 gpos75 | |
17 q23.1 57600001 58300000 gneg | |
17 q23.2 58300001 61100000 gpos75 | |
17 q23.3 61100001 62600000 gneg | |
17 q24.1 62600001 64200000 gpos50 | |
17 q24.2 64200001 67100000 gneg | |
17 q24.3 67100001 70900000 gpos75 | |
17 q25.1 70900001 74800000 gneg | |
17 q25.2 74800001 75300000 gpos25 | |
17 q25.3 75300001 81195210 gneg | |
18 p11.1 15400001 17200000 acen | |
18 p11.21 10900001 15400000 gneg | |
18 p11.22 8500001 10900000 gpos25 | |
18 p11.23 7100001 8500000 gneg | |
18 p11.31 2900001 7100000 gpos50 | |
18 p11.32 1 2900000 gneg | |
18 q11.1 17200001 19000000 acen | |
18 q11.2 19000001 25000000 gneg | |
18 q12.1 25000001 32700000 gpos100 | |
18 q12.2 32700001 37200000 gneg | |
18 q12.3 37200001 43500000 gpos75 | |
18 q21.1 43500001 48200000 gneg | |
18 q21.2 48200001 53800000 gpos75 | |
18 q21.31 53800001 56200000 gneg | |
18 q21.32 56200001 59000000 gpos50 | |
18 q21.33 59000001 61600000 gneg | |
18 q22.1 61600001 66800000 gpos100 | |
18 q22.2 66800001 68700000 gneg | |
18 q22.3 68700001 73100000 gpos25 | |
18 q23 73100001 78077248 gneg | |
19 p11 24400001 26500000 acen | |
19 p12 20000001 24400000 gvar | |
19 p13.11 16300001 20000000 gneg | |
19 p13.12 14000001 16300000 gpos25 | |
19 p13.13 13900001 14000000 gneg | |
19 p13.2 6900001 13900000 gpos25 | |
19 p13.3 1 6900000 gneg | |
19 q11 26500001 28600000 acen | |
19 q12 28600001 32400000 gvar | |
19 q13.11 32400001 35500000 gneg | |
19 q13.12 35500001 38300000 gpos25 | |
19 q13.13 38300001 38700000 gneg | |
19 q13.2 38700001 43400000 gpos25 | |
19 q13.31 43400001 45200000 gneg | |
19 q13.32 45200001 48000000 gpos25 | |
19 q13.33 48000001 51400000 gneg | |
19 q13.41 51400001 53600000 gpos25 | |
19 q13.42 53600001 56300000 gneg | |
19 q13.43 56300001 59128983 gpos25 | |
20 p11.1 25600001 27500000 acen | |
20 p11.21 22300001 25600000 gneg | |
20 p11.22 21300001 22300000 gpos25 | |
20 p11.23 17900001 21300000 gneg | |
20 p12.1 12100001 17900000 gpos75 | |
20 p12.2 9200001 12100000 gneg | |
20 p12.3 5100001 9200000 gpos75 | |
20 p13 1 5100000 gneg | |
20 q11.1 27500001 29400000 acen | |
20 q11.21 29400001 32100000 gneg | |
20 q11.22 32100001 34400000 gpos25 | |
20 q11.23 34400001 37600000 gneg | |
20 q12 37600001 41700000 gpos75 | |
20 q13.11 41700001 42100000 gneg | |
20 q13.12 42100001 46400000 gpos25 | |
20 q13.13 46400001 49800000 gneg | |
20 q13.2 49800001 55000000 gpos75 | |
20 q13.31 55000001 56500000 gneg | |
20 q13.32 56500001 58400000 gpos50 | |
20 q13.33 58400001 63025520 gneg | |
21 p11.1 10900001 13200000 acen | |
21 p11.2 6800001 10900000 gvar | |
21 p12 2800001 6800000 stalk | |
21 p13 1 2800000 gvar | |
21 q11.1 13200001 14300000 acen | |
21 q11.2 14300001 16400000 gneg | |
21 q21.1 16400001 24000000 gpos100 | |
21 q21.2 24000001 26800000 gneg | |
21 q21.3 26800001 31500000 gpos75 | |
21 q22.11 31500001 35800000 gneg | |
21 q22.12 35800001 37800000 gpos50 | |
21 q22.13 37800001 39700000 gneg | |
21 q22.2 39700001 42600000 gpos50 | |
21 q22.3 42600001 48129895 gneg | |
22 p11.1 12200001 14700000 acen | |
22 p11.2 8300001 12200000 gvar | |
22 p12 3800001 8300000 stalk | |
22 p13 1 3800000 gvar | |
22 q11.1 14700001 17900000 acen | |
22 q11.21 17900001 22200000 gneg | |
22 q11.22 22200001 23500000 gpos25 | |
22 q11.23 23500001 25900000 gneg | |
22 q12.1 25900001 29600000 gpos50 | |
22 q12.2 29600001 32200000 gneg | |
22 q12.3 32200001 37600000 gpos50 | |
22 q13.1 37600001 41000000 gneg | |
22 q13.2 41000001 44200000 gpos50 | |
22 q13.31 44200001 48400000 gneg | |
22 q13.32 48400001 49400000 gpos50 | |
22 q13.33 49400001 51304566 gneg | |
X p11.1 58100001 60600000 acen | |
X p11.21 54800001 58100000 gneg | |
X p11.22 49800001 54800000 gpos25 | |
X p11.23 46400001 49800000 gneg | |
X p11.3 42400001 46400000 gpos75 | |
X p11.4 37600001 42400000 gneg | |
X p21.1 31500001 37600000 gpos100 | |
X p21.2 29300001 31500000 gneg | |
X p21.3 24900001 29300000 gpos100 | |
X p22.11 21900001 24900000 gneg | |
X p22.12 19300001 21900000 gpos50 | |
X p22.13 17100001 19300000 gneg | |
X p22.2 9500001 17100000 gpos50 | |
X p22.31 6000001 9500000 gneg | |
X p22.32 4300001 6000000 gpos50 | |
X p22.33 1 4300000 gneg | |
X q11.1 60600001 63000000 acen | |
X q11.2 63000001 64600000 gneg | |
X q12 64600001 67800000 gpos50 | |
X q13.1 67800001 71800000 gneg | |
X q13.2 71800001 73900000 gpos50 | |
X q13.3 73900001 76000000 gneg | |
X q21.1 76000001 84600000 gpos100 | |
X q21.2 84600001 86200000 gneg | |
X q21.31 86200001 91800000 gpos100 | |
X q21.32 91800001 93500000 gneg | |
X q21.33 93500001 98300000 gpos75 | |
X q22.1 98300001 102600000 gneg | |
X q22.2 102600001 103700000 gpos50 | |
X q22.3 103700001 108700000 gneg | |
X q23 108700001 116500000 gpos75 | |
X q24 116500001 120900000 gneg | |
X q25 120900001 128700000 gpos100 | |
X q26.1 128700001 130400000 gneg | |
X q26.2 130400001 133600000 gpos25 | |
X q26.3 133600001 138000000 gneg | |
X q27.1 138000001 140300000 gpos75 | |
X q27.2 140300001 142100000 gneg | |
X q27.3 142100001 147100000 gpos100 | |
X q28 147100001 155270560 gneg | |
Y p11.1 11600001 12500000 acen | |
Y p11.2 3000001 11600000 gneg | |
Y p11.31 2500001 3000000 gpos50 | |
Y p11.32 1 2500000 gneg | |
Y q11.1 12500001 13400000 acen | |
Y q11.21 13400001 15100000 gneg | |
Y q11.221 15100001 19800000 gpos50 | |
Y q11.222 19800001 22100000 gneg | |
Y q11.223 22100001 26200000 gpos50 | |
Y q11.23 26200001 28800000 gneg | |
Y q12 28800001 59373566 gvar |
Symbols used:
The following symbols are often used with band numbers when describing changes in the karyogram, eg. of cancer cells in ISCN notation.
, Separates chromosome modal number, sex chromosomes, and chromosome abnormalities - Loss of a chromosome ( ) Surround structurally altered chromosomes and breakpoints + Gain of a chromosome ++ Multiple signals on one chromosome ; Separates rearranged chromosomes and breakpoints involving more than one chromosome / Separates cell lines or clones // Separates recipient and donor cell lines in bone marrow transplants ~ approximation x multiple copies of chromosomes or regions . Sperates multiple techniques amp amplification arr microarray data dim diminished fluorescence ratio intensity: deletion del Deletion der Derivative chromosome (used when only one chromosome from a translocation is present, or when one chromosome has two or more structural abnormalities). Alternative description: Structurally rearranged chromosome generated either by a rearrangement involving two or more chromosomes or by multiple aberrations within a single chromosome (e.g. an inversion and a deletion of the same chromosome, or deletions in both arms of a single chromosome).[1] The term always refers to the chromosome that has an intact centromere. dic Dicentric chromosome dn Chromosomal abnormality not inherited from parents (de novo) dup Duplication of a portion of a chromosome enh Enhanced fluorescence ratio intensity: duplication fra Fragile site (usually used with Fragile-X syndrome) h Heterochromatic region of chromosome hlpa Multiple ligation-dependent probe amplifications hmz Homozygosity htz Heterozygosity i Isochromosome (both arms of the chromosome are the same) ins Insertion of a portion of a chromosome inv Inversion .ish Precedes karyotype results from fluorescence in situ hybridization (FISH) analysis mar Marker chromosome (unidentifiable piece of chromosome) mat Maternally derived chromosome rearrangement p Short arm of a chromosome pat Paternally derived chromosome rearrangement psu dic Only one centromere is active (pseudo dicentric) q Long arm of a chromosome r Ring chromosome t Translocation ter Terminal end of arm (i.e. 2qter - end of the long arm of chromosome 2) tri Trisomy trp Triplication of a portion of a chromosome
The software CyDAS seems to be able to work with this kind of karyotype data and there is also a discussion about shortcomings of the nomenclature. (Some of the issues might have been fixed in the mean times.)
There is also a publication by Mascarello et al. about the shortcomings of the system, shown by comparisons how different clinicians used it in a survey analysis. Up to 50% of the notations used for specific cases were incorrect and only 8% of participants used the exact same string to describe a trisomy 21 in uncultured amniocytes.
Sources and further reading:
This is the list of genomic regions that was analysed as the 1% of the human genome in the ENCODE pilot phase. (The main phase of ENCODE is looking at the entire human genome.) The coordinates are for assembly NCBI36 (hg18).
See also the entry about ENCODE and the UCSC pages.
Name | Chr. | Start | End | Description | |
ENr231 | 1 | 149424685 | 149924684 | Random | Picks |
ENr131 | 2 | 234156564 | 234656627 | Random | Picks |
ENr331 | 2 | 219985590 | 220485589 | Random | Picks |
ENr112 | 2 | 51512209 | 52012208 | Random | Picks |
ENr121 | 2 | 118011044 | 118511043 | Random | Picks |
ENr113 | 4 | 118466104 | 118966103 | Random | Picks |
ENr212 | 5 | 141880151 | 142380150 | Random | Picks |
ENm002 | 5 | 131284314 | 132284313 | Manual | Picks:Interleukin |
ENr221 | 5 | 55871007 | 56371006 | Random | Picks |
ENr222 | 6 | 132218540 | 132718539 | Random | Picks |
ENr223 | 6 | 73789953 | 74289952 | Random | Picks |
ENr323 | 6 | 108371397 | 108871396 | Random | Picks |
ENr334 | 6 | 41405895 | 41905894 | Random | Picks |
ENm013 | 7 | 89621625 | 90736048 | Manual | Picks |
ENm001 | 7 | 115597757 | 117475182 | Manual | Picks:CFTR |
ENm010 | 7 | 26924046 | 27424045 | Manual | Picks:HOXA |
ENm012 | 7 | 113720369 | 114720368 | Manual | Picks:FOXP2 |
ENm014 | 7 | 125865892 | 127029088 | Manual | Picks |
ENr321 | 8 | 118882221 | 119382220 | Random | Picks |
ENr232 | 9 | 130725123 | 131225122 | Random | Picks |
ENr114 | 10 | 55153819 | 55653818 | Random | Picks |
ENr312 | 11 | 130604798 | 131104797 | Random | Picks |
ENr332 | 11 | 63940889 | 64440888 | Random | Picks |
ENm009 | 11 | 4730996 | 5732587 | Manual | Picks:Beta |
ENm011 | 11 | 1699992 | 2306039 | Manual | Picks:1GF2/H19 |
ENm003 | 11 | 115962316 | 116462315 | Manual | Picks:Apo |
ENr123 | 12 | 38626477 | 39126476 | Random | Picks |
ENr111 | 13 | 29418016 | 29918015 | Random | Picks |
ENr132 | 13 | 112338065 | 112838064 | Random | Picks |
ENr311 | 14 | 52947076 | 53447075 | Random | Picks |
ENr322 | 14 | 98458224 | 98958223 | Random | Picks |
ENr233 | 15 | 41520089 | 42020088 | Random | Picks |
ENm008 | 16 | 1 | 500000 | Manual | Picks:Alpha |
ENr313 | 16 | 60833950 | 61333949 | Random | Picks |
ENr211 | 16 | 25780428 | 26280428 | Random | Picks |
ENr213 | 18 | 23719232 | 24219231 | Random | Picks |
ENr122 | 18 | 59412301 | 59912300 | Random | Picks |
ENm007 | 19 | 59023585 | 60024460 | Manual | Picks:Chr19 |
ENr333 | 20 | 33304929 | 33804928 | Random | Picks |
ENr133 | 21 | 39244467 | 39744466 | Random | Picks |
ENm005 | 21 | 32668237 | 34364221 | Manual | Picks:Chr21 |
ENm004 | 22 | 30133954 | 31833953 | Manual | Picks:Chr22 |
ENr324 | X | 122609996 | 123109995 | Random | Picks |
ENm006 | X | 152767492 | 154063081 | Manual | Picks:ChrX |
A quick reminder of the specifications to connect to the public Ensembl mySQL databases:
Database | Server | Port |
---|---|---|
Ensembl (v 24-47) |
ensembldb.ensembl.org†† |
3306 |
Ensembl (v 48 and above) |
ensembldb.ensembl.org | 5306 |
Ensembl Mart | martdb.ensembl.org | 5316 |
Ensembl Genomes | mysql.ebi.ac.uk | 4157 |
Ensembl (curr. v) in US cloud | useastdb.ensembl.org | 5306 |
user = "anonymous"
pass = ""
mysql commandline for connection:
Code
mysql -uanonymous -hensembldb.ensembl.org -P5306 |
Using the SQLite database Engine
SQLite is different (to MySQL) in a number of ways, the main one being that it is server-less and file-based. The other distinctive features are nicely listed here with pros and cons.
It's an ideal choice if you want to bundle a database with your application, as SQLite is small, platform independent and without any usage restrictions.
It can be accessed with the Perl DBI modules:
Code
my $dbh = | |
DBI->connect("dbi:SQLite:dbname=$db_file","","") | |
or die "Unable to connect: $DBI::errstr\n"; |
visually with the (free) Firefox plugin SQLite Manager or the (paid) application SQLite Maestro or on the command-line by calling:
sqlite db_file_name.db
Special sqlite commands are preceeded by a ".", e.g. to exit type ".exit".
The sql syntax is not identical but very similar. Converter tools are listed here, here are some stackoverflow notes about the topic.
Some compatibility notes: SQLite supports sub queries.
It does not support deletes on joined tables.
To make the output more readable you can:
Code
.header on | |
.separator \t |
To inspect the structure of a database you can use the following commands.
1. list table names:
Code
.tables #or | |
.tables table_na% # "like" pattern matching |
Code
.schema table_name #or | |
.schema table_na% #or | |
SELECT sql FROM sqlite_master WHERE name = 'table_name'; |
To export all data from a database into files seperated by table you can use the "export table" function in the SQLite Manager, or use the command line if you have many tables:
1. create a file with all table names in your database. (get the name as mentioned above.)
2. Then call sqlite with each to export the data:
Code
cat tables.txt | awk '{print ".mode csv\n.output "$1".txt\nselect * from "$1";"} | sqlite dbname.db |
Import of these text files can be done with
Code
.import file.txt table_name |
The separator for export and import need to be the same, otherwise you will get errors like
data.txt line 1: expected 10 columns of data but found 1
If there are linebreaks in the data fields, the parsing of the import will break in a similiar way. Try to set the separator to
\t
and not specify
.mode csv
for the export.
Here are some very useful FAQs.
Missing the lovely Unix command-line tools when working on MS Windows machines, I've been trying a few options to speed up everyday tasks like easy file processing:
UnxUtils. A collection of all those unix tools I missed wrapped up to be usable by the windows command line (grep, ls, head, awk...). Nice!
Remember to add "UnxUtils\usr\local\wbin" to your PATH.
"Structural variation (SV) is generally defined as a region of DNA approximately 1 kb and larger in size and can include inversions and balanced translocations or genomic imbalances (insertions and deletions), commonly referred to as copy number variants (CNVs). These CNVs often overlap with segmental duplications (regions of DNA >1 kb present more than once in the genome). If present at >1% in a population a CNV may be referred to as copy number polymorphism (CNP)."
Estimates of how much of the human genome are CNVs range from 10-20%.
dbVar is the NCBI database of genomic structural variation designed to store data on variant DNA ≥ 1 bp in size.
The databases ids are organised in the following manner:
This means that multiple experimental results, ie. regions identified from different samples, stored as "supporting variants", are combined into regions that describe these as one "event" and are stored as "variant".
An example: esv10580 includes the supporting variants essv57440, essv75601, essv61475 and others. The individual (GRCh37/hg19) coordinates, e.g.
Chr1 521,413 564,458 Chr1 521,413 564,458 Chr1 521,648 575,095
result in the maximum coordinates for the variant:
Chr1 521,413 575,095
They all belong to the study estd20 by Conrad et al. (2010).
There is a good overview page explaining structural variations and related methods.
Source: dbVar
The Online Mendelian Inheritance in Man (OMIM) data is a "catalog of human genes and genetic disorders and traits, with particular focus on the molecular relationship between genetic variation and phenotypic expression. It is a phenotypic companion to the Human Genome Project." (omim.org)
To get human disease annotation for your gene data, the fine data from the OMIM database can be downloaded from their FTP site and parsed with one of multiple OMIM parsers within the BioPerl framwork.
I used Christian Zmasek's OMIMparser.pm to get hashes with the ids and names:
Code
use Bio::Phenotype::OMIM::OMIMparser; | |
| |
| |
| |
$omim_parser = Bio::Phenotype::OMIM::OMIMparser->new( | |
| |
-genemap => $omim_genemap, | |
| |
-omimtext => $omim_all ); | |
| |
while ( my $omim_entry = $omim_parser->next_phenotype() ) { | |
| |
my $numb = $omim_entry->MIM_number(); | |
| |
my $title = $omim_entry->title(); | |
| |
#remove the gene symbol from the title line | |
| |
$title =~ s/^.?(\d+) //; | |
| |
$title =~ s/;.*$//; | |
| |
#store omim ids by disease names | |
| |
$omim_names{$title} = $numb; | |
| |
#store genes and disease names in hash ref by omim id | |
| |
$omim_ids{$numb}->{'disease'} = $title; | |
| |
my @symbols = $omim_entry->each_gene_symbol(); | |
| |
$omim_ids{$numb}->{'genes'} = \@symbols; | |
| |
push(@all_omim, $numb.":".$title); | |
| |
} |
If you fall over an exception like this:
------------- EXCEPTION ------------- MSG: 16.13.3 does not make sense: 'arm' or 'cen' missing STACK Bio::Map::CytoPosition::cytorange BioPerl-1.6.0/Bio/Map/CytoPosition.pm:165
You need to fix an error in the genemap file from OMIM:
line 9053 should be
16.25|2|2|10|16p13.3|CHTF18....
instead of
16.25|2|2|10|16.13.3|CHTF18...
OMIM ids are pre-fixed with defined symbols. The explanation what these characters means can be found on their FAQ site or here.
Please note that OMIM band start locations have a 1 bp offset to the definitions e.g. in ENSEMBL (probably from a 0-based coordinate system). The "16p11.2" band below is listed as chr16 28100001 - 34600000 in Ensembl.
The pseudo-autosomal regions are homologous DNA sequences on the (human) X and Y chromosomes (see wikipedia for more). They allow the pairing and crossing-over of these sex chromosomes the same way the autosomal chromosomes do during meiosis. As these genomic regions are identical between X and Y, they are oftentimes only stored once.
To pull out the coordinates of the pseudo-autosomal regions (PAR) from the Ensembl database, you can perform the following query on the Ensembl core database:
Code
select (select sr.name from seq_region sr where sr.seq_region_id=ae.seq_region_id) as chrom_1, ae.seq_region_start as start_1, ae.seq_region_end as end_1, (select sr.name from seq_region sr where sr.seq_region_id=ae.exc_seq_region_id) as chrom_2, ae.exc_seq_region_start as start_2, ae.exc_seq_region_end as end_2 from assembly_exception ae where ae.exc_type="PAR"; |
For the human database schema 61 (assembly GRCh37/hg19) you will get where the corresponding region is located:
+---------+----------+----------+---------+-----------+-----------+ | chrom_1 | start_1 | end_1 | chrom_2 | start_2 | end_2 | +---------+----------+----------+---------+-----------+-----------+ | Y | 10001 | 2649520 | X | 60001 | 2699520 | | Y | 59034050 | 59373566 | X | 154931044 | 155270560 | +---------+----------+----------+---------+-----------+-----------+
For the old assembly (NCBI36/hg18) you will get:
+---------+----------+----------+---------+-----------+-----------+ | chrom_1 | start_1 | end_1 | chrom_2 | start_2 | end_2 | +---------+----------+----------+---------+-----------+-----------+ | Y | 1 | 2709520 | X | 1 | 2709520 | | Y | 57443438 | 57772954 | X | 154584238 | 154913754 | +---------+----------+----------+---------+-----------+-----------+
You can alternatively use the API:
Code
my $aefa = $db->get_AssemblyExceptionFeatureAdaptor(); | |
my $sa = $db->get_SliceAdaptor; | |
my $slice = $sa->fetch_by_region("chromosome", "Y"); | |
my @aefs = @{$aefa->fetch_all_by_Slice($slice)}; | |
foreach my $ae (@aefs){ | |
print $ae->display_id."\t".$ae->start."\t".$ae->end."\n"; | |
} |
X 10001 2649520 X 59034050 59373566
or for X:
Y 60001 2699520 Y 154931044 155270560
So to translate from Y to X PAR locations you can use the following for GRCh37 / hg19:
Y 10001 - 2649520 <-> X 60001 - 2699520, band Xp22.33 Y 59034050 - 59373566 <-> X 154931044 - 155270560, band Xq28
and for NCBI36 / hg18:
Y 1 - 2709520 <-> X 1 - 2709520, band Xp22.33 Y 57443438 - 57772954 <-> X 154584238 - 154913754, band Xq28
Please note that these coordinates do not agree with the definitions at the GRC and NCBI. This difference of the PAR-2 end coordinates (chrX:155.260.560 / 155.270.560 or chrY:59.363.566 / 59.373.566) is caused by the 10kb telomeric (gap) region which needs to be included in the PAR-2 definition to correctly represent this arrangement.
See also the telomere & centromer definition notes.
A nice list of official HGNC genes that are located in the pseudo-autosomal regions can be found here.
It doesn't always have to be R!
The Perl Data Language is a Perl extension for numerical manipulation that provides the convenience of Perl with the speed of compiled C.
It also contains plotting modules.
Install with cpan install PDL
or check these descriptions.
Code example for getting basic stats from a few values:
Code
use PDL; | |
| |
| |
| |
my @numbers = (1,4,6,8,10); | |
| |
my $piddle = pdl(@numbers); | |
| |
my ($mean,$prms,$median,$min,$max,$adev,$rms) = statsover($piddle); | |
| |
| |
| |
print "Mean=$mean\n". | |
| |
"Root-mean-square deviation=$prms\n". | |
| |
"Median=$median\n". | |
| |
"Min=$min\n". | |
| |
"Max=$max\n". | |
| |
"StdDev=$adev\n". | |
| |
"Population-Deviation=$rms\n\n"; |
cron is a extremely useful unix utility that allows tasks to be automatically run in the background at regular intervals.
You need the script / command you want to run and the time it should run. You can the use the crontab command to edit the service:
Format of entries:
* * * * * command to be executed - - - - - | | | | | | | | | +----- day of week (0 - 6) (Sunday=0) | | | +------- month (1 - 12) | | +--------- day of month (1 - 31) | +----------- hour (0 - 23) +------------- min (0 - 59)
Example:
00 03 * * * bash /users/fsk/backup_db.sh
This runs my backup script at 03:00 every day.
*/10 * * * * echo "job done"
This runs an echo every 10 minutes of every hour of every day.
To receive an email with any result from the jobs, add
Code
MAILTO=yourmail@home.com |
Code
>/dev/null 2>&1 |
This document is part of the administrator documentation for the AnnoTrack software for genome annotation tracking.
Please read elsewhere about general Ruby or Rails questions, there are blog entries about Ruby & Rails Terminology, Rails application layout
The AnnoTrack Ruby-on-Rails code can be found in svn/gencode/tracking_system/rails/. Most AnnoTrack-specific code is stored as "plugin" code in the Redmine directory. This means when trying to find a specific piece of code, you have to check the default application directory app, but also the plugin directory
vendor/plugins/redmine_annotrack/app. The language files defining the terminology and browser links used on the websites are
svn/gencode/tracking_system/rails/lang/en.yml and
svn/gencode/tracking_system/rails/vendor/plugins/redmine_annotrack/lang/en.yml/.
In these files an entry like
Code
label_project_new: New Gene |
means "if you come across the term label_project_new, display it as New Gene in the browser".
To understand the code underlying specific web pages it is helpful to check the routing entries in
config/routes.rb and vendor/plugins/redmine_annotrack/routes.rb. Specific paths in the browser are mapped to specific functions in the rails code. E.g.:
Code
map.connect 'flags/show_tecs', :controller => 'flags', :action => 'show_tecs' |
maps the URL http://annotrack.sanger.ac.uk/human/flags/show_tecs to
the show_tecs function in the file app/controllers/flags_controller.rb.
The list of chromosomes used as well as the different priority values are set on this page.
Some options for links on the transcript pages etc. can be changed through the administration interface.
These previous actions require administrator user rights in the AnnoTrack system. The list of different user right for all groups is shown here.
The documentation pages can be edited with a wiki-style syntax by clicking on the edit pencil on each page.
Setting up a new system & adjusting it to your needs
This document is part of the administrator documentation for the AnnoTrack software for genome annotation tracking.
The system is flexible enough to be of use for other groups and projects performing genome annotation in a collaborative effort and is therefor provided here. These are notes on how to start a new annotation project with AnnoTrack.
General Redmine installation notes for troubleshooting are here, but all the sourcecode required for AnnoTrack is available here.
Most of the AnnoTrack code is written as a plugin for the Redmine system (rails/vendor/plugins/redmine_annotrack), but since there are some other changes required, which override Redmine's default code, you will need the complete package from this site.
General notes
you will need
ruby on rails installation
source and help on the official rails page documention for running on Mac OsX (usually pre-installed)
get the AnnoTrack source code and database from this page
unpack:
Code
tar xzvf annotrack.version.tgz |
create your database
Code
mysql -u<user> -p<password> -h<host> -P<port> -e"create database annotrack" | |
| |
mysql -u<user> -p<password> -h<host> -P<port> -Dannotrack < annotrack/database.sql |
The main tables fo the database are outlined in this diagram.
Rails server
adjust the database configurations file in annotrack/rails/config/database.yml with your settings (production and development if desired)
additional environments can be created (e.g. for multiple organisms) by adding an entry (e.g. "production_housemouse") and a file in environments (e.g. environments/production_mouse.rb)
start the server e.g. on port 6223:
Code
cd annotrack/rails | |
| |
ruby scripts/server -edevelopment -p6223 #(to use the development setup) |
In a web browser your application will usually be at http://localhost:6223/. Log in as administrator ("admin"/"admin") to set up some initial values.
The admin interface from Redmine is at DEFAULT_URL/admin, modifications should in particular be made on these pages:
AnnoTrack settings: "Menu links", "Browsers links", "other settings"
vendor/plugins/redmine_annotrack/lang/en.yml holds the URL patterns used for browser links.
we have stored a gene with two transcript with two flags for demonstration;
you can see these by clicking on "Transcripts" at the top of the page and then selecting "View all transcripts".
Perl API/scripts
We use the scripts/cron_jobs.pl file the run automatic updates of the core annotation, to update the stats given on the front page (issue and flag counts), please adjust this to your needs
Some Perl programming knowledge is required to adjust / write parsers to handle the specific data you will be using.
The following additional perl modules (many of which are part of a standard installation) are required to use the AnnoTrack perl API:
most probably you will have to adjust the source-specific scripts used for data loading and analysis stored in annotrack/perl/modules/annotrack
further hints
Further adjustments
to customize the system for your own set-up there are a number of files you can modify:
Upgrading
General notes on upgrading existing Redmine installations are here.
This document is part of the administrator documentation for the AnnoTrack software for genome annotation tracking.
AnnoTrack is a Ruby-On-Rails application with is executed by an Apache2 server with the mod-rails (Passenger) plugin. It is living on virtual machines (VM) where we don't run any other services as rails does not play nice with other web-services.
James Smith (webteam) knows most about this, Tim Cutts & Dave Holland (infrastructure management) can help with the VMs.
Access restrictions apply to connect to all the following services and the superuser rights.
There is a test environment on the VM web-annotrack, the production servers are running on two VM clones web-annotrack1 and web-annotrack2. All can be accessed directly with SSH:
Code
ssh web-annotrack | |
| |
cd /var/www/annotrack-app |
The different species have their own AnnoTrack/Redmine code installations as there does not seem to be another way to have them running in parallel otherwise:
annotrack-app == human annotrack-app-mouse == mouse annotrack-app-zfish == zebrafish
Rails/Passenger requires symbolic links from the root-level to the public folder:
human -> annotrack-app/public/
The test system is visible at http://web-annotrack.internal.sanger.ac.uk:8000
The port and other specific server settings are set in the apache2/sites-available/default file.
Re-starting Rails server:
Code
ssh web-annotrack[1,2] | |
| |
sudo touch tmp/restart.txt |
Re-starting entire web server:
Code
ssh web-annotrack[1,2] | |
| |
sudo apache2ctl -k graceful |
Service monitoring
The VMs are monitored with vSphere (web access, Windows client available as well) and Nagios (web-annotrack 1 / 2).
The website is also checked by the Montastic monitoring service.
To submit DNA sequences from capillary (Sanger) sequencing to the public EMBL database, these steps can be taken:
The strategy is to create one submission at the European Nucleotide Archive (ENA) @ EBI Webin submission page and attach a FASTA file with all sequences.
remove low quality sequences. I my case the filter criteria were:
screen for vector contamination:
Use BioPerl for large set: get EMVEC file in EMBL format, convert to FASTA format file with BioPerl
Code
my $inseq = Bio::SeqIO->new( | |
| |
-file => "<file.dat", | |
| |
-format => "embl" ); | |
| |
my $outseq = Bio::SeqIO->new( | |
| |
-file => ">file.fa", | |
| |
-format => "fasta" ); | |
| |
while (my $seq = $inseq->next_seq) { | |
| |
$outseq->write_seq($seq); | |
| |
} |
index with formatdb
To extract sequences from a BLAST database you need an index file (for protein-dbs these files end with the extension: ".pin", for DNA dbs: ".nin"), a sequence file (".psq", ".nsq") and a header file (".phr" and ".nhr"). formatdb turns FASTA files into BLAST databases.
Code
formatdb -i emvec.fa -p F -o F |
run BioPerl Blast with the sequences to be submitted against the EMVEC db:
Code
use Bio::Tools::Run::StandAloneBlast; | |
| |
my @blast_params = (program => 'blastn', database => 'emvec.dat.fa'); | |
| |
my $blast_hits = run_blast($seq); |
and filter out hits with very low (<0.1) eValues and long sequence hits.
Sources:
When analysing sequences from public databases or from your own sequencer you have to be aware of potential contaminations.
A contaminated sequence is one that does not faithfully represent the genetic information from the biological source organism/organelle because it contains one or more sequence segments of foreign origin. [NCBI]
The primary approach to screening nucleic acid sequences for vector contamination is to run a sequence similarity search against a database of vector sequences. The preferred tool for conducting such a search is NCBI's VecScreen. VecScreen detects contamination by running a BLAST sequence similarity search against the UniVec vector sequence database.
An interactive web-service EMVEC Database BLAST to scan for contamination.
Help with the interpretation of the results of BLAST2 EVEC.
See also this post about submitting to EMBL db and this post about screening NGS reads locally.
The Genome Variation Format (GVF) is a file format for describing sequence variants at nucleotide resolution relative to a reference genome. The GVF format was published in Reese et al., Genome Biol., 2010: A standard variation file format for human genome sequences.
GVF is a type of GFF3 file with additional pragmas and attributes specified.
Two examples:
Code
chr16 samtools SNV 49291141 49291141 . + . ID=ID_1;Variant_seq=A,G;Reference_seq=G;Genotype=heterozygous | |
| |
chr16 samtools SNV 49291360 49291360 . + . ID=ID_2;Variant_seq=G;Reference_seq=C;Genotype=homozygous |
Code
chr16 samtools SNV 49291141 49291141 . + . ID=ID_1;Variant_seq=A,G;Reference_seq=G;Genotype=heterozygous;Variant_effect=synonymous_codon 0 mRNA NM_022162; | |
| |
chr16 samtools SNV 49302125 49302125 . + . ID=ID_3;Variant_seq=T,C;Reference_seq=C;Genotype=heterozygous;Variant_effect=nonsynonymous_codon 0 mRNA NM_022162;Alias=NP_071445.1:p.P45S; |
This is used e.g. by Ensembl to write out "Watson SNPs" from the variation database (ftp).
Source and full specs: Sequenceontology.org
QSEQ is a plain-text file format for sequence reads produced directly by many current next-generation sequencing machines. The content can be described as follows.
Each record is one line with tab separator in the following format:
- Machine name: unique identifier of the sequencer.
- Run number: unique number to identify the run on the sequencer.
- Lane number: positive integer (currently 1-8).
- Tile number: positive integer.
- X: x coordinate of the spot. Integer (can be negative).
- Y: y coordinate of the spot. Integer (can be negative).
- Index: positive integer. No indexing should have a value of 1.
- Read Number: 1 for single reads; 1 or 2 for paired ends.
- Sequence (BASES)
- Quality: the calibrated quality string. (QUALITIES)
- Filter: Did the read pass filtering? 0 - No, 1 - Yes.
Source: SRA_File_Formats_Guide.pdf
These are notes about the data handling steps involved in creating the GTF files released by the GENCODE project and submitted to the DCC. (Valid as of February 2011)
For general information and data access please visit the project website at http://www.gencodegenes.org, this blog post or the AnnoTrack annotation tracking system.
A. Input sources
-ensembl core database with gene models, stable ids and xrefs
-vega database of same release for id-lookup
-3-way pseudogene file with gene ids:
from Yale, based on pre-dump file from same release (using the newfullmerge.pl script)
-2-way (Yale/UCSC) pseudogene file with full locations and 2 sets of ids (from Yale)
-level-1 (and level-4 if defined) transcript file containing stable-ids
-optional file with additional annotation remarks
-file from HGNC web site with columns
HGNV-ID, gene_symbol, Pubmed-IDs, Vega-ID
-RefSeq NP / NM mapping from current xref database (from Ensembl core team):
Code
mysql -uensro -hens-research -Dianl_human_xref_release_61 | |
| |
-e'select accession1, accession2 from pairs where accession1 like "NP%" and accession2 like "NM%"' | |
| |
> RefSeq_relations.txt |
B. Code to use
svn/gencode/scripts/data_release/newfullmerge.pl .../write_class_file.pl .../gencode_addmetadata.pl svn/gencode/modules/Gencode/Ensembl2GTF.pm
C. Procedure
Create directory where output files are written to and the following input files are placed:
3-way_consensus_pseudogenes.txt, classes.def, validated_level_1_ids.txt
The paths to these are needed in the newfullmerge.pl script...
mkdir /work/dir/gencode_7
for LSF output files:
mkdir /work/dir/gencode_7/outfiles
dump annotation data (using main chromosomes only)
Code
foreach chr ( 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y MT ) | |
| |
bsub -o /work/dir/gencode_7/outfiles/gencode_$chr.out perl svn/gencode/scripts/data_release/newfullmerge.pl -basedir /work/dir/gencode_7 -chrom $chr | |
| |
end |
check jobs
Code
grep -c "^Successfully" gencode_*out |
update PAR region (We are currently writing out X and Y PAR regions separately. They are stored only once in the Ensembl db though, so the ids need to be made non-redundant with this step)
Code
perl svn/gencode/scripts/data_release/update_y_ids.pl -x gencode_X.gtf -y gencode_Y.gtf -out gencode_YY.gtf |
create joined file
add header to release file gencode.v7.annotation.gtf:
##description: evidence-based annotation of the human genome (GRCh37), version 7 (Ensembl 62) ##provider: GENCODE ##contact: gencode@sanger.ac.uk ##format: gtf ##date: 2011-03-23
Code
foreach chr ( 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X YY MT ) | |
| |
cat gencode_$chr.gtf >> gencode.v7.annotation.gtf | |
| |
end |
check gene and transcripts numbers(compare to previous release and database ignoring haplotype regions etc.)
Code
awk '{if($3=="gene"){g++}else{if($3=="transcript"){t++}}} END{print "genes: "g"\ntranscripts: "t"\n"}' gencode.v7.annotation.gtf |
check tags (annotation remarks)(compare to previous release)
Code
foreach t ( seleno pseudo_consens CCDS mRNA_start_NF mRNA_end_NF cds_start_NF cds_end_NF non_org_supp exp_conf PAR alternative_3_UTR alternative_5_UTR readthrough NMD_exception not_organism-supported not_best-in-genome_evidence non-submitted_evidence upstream_ATG downstream_ATG upstream_uORF overlapping_uORF NAGNAG_splice_site non_canonical_conserved non_canonical_genome_sequence_error non_canonical_other non_canonical_polymorphism non_canonical_U12 non_canonical_TEC ) | |
| |
echo -n $t"\t"; awk '{if($3=="transcript"){print $0}}' gencode.v7.annotation.gtf | grep -c "$t" | |
| |
end |
split by level (level 1/2 and 3 are displayed as two sep. tracks in the UCSC browser)
Code
awk '{if($26=="3;"){print $0}}' gencode.v7.annotation.gtf | awk '{if($3!="gene"){print $0}}' > gencode.v7.annotation.level_3.gtf | |
| |
awk '{if($26!="3;"){print $0}}' gencode.v7.annotation.gtf | awk '{if($3!="gene"){print $0}}' > gencode.v7.annotation.level_1_2.gtf |
make class file (data loading at UCSC requires a mapping of all gene and transcripts id to a level and a type)
Find classes not yet defined:
Code
grep -h "^Class not defined" gencode_*.out | sort -u |
add these manually to the classes.def file. Write out new lists:
Code
perl svn/gencode/scripts/data_release/write_class_file.pl -in gencode.v7.annotation.level_1_2.gtf -class classes.def -out gencode.v7.annotation.level_1_2.classes | |
| |
perl svn/gencode/scripts/data_release/write_class_file.pl -in gencode.v7.annotation.level_3.gtf -class classes.def -out gencode.v7.annotation.level_3.classes |
generate meta-data
perl svn/gencode/scripts/data_release/gencode_addmetadata.pl
requires list of new PAR region IDs
generate tRNAs
Code
bsub -o trna.out perl svn/gencode/scripts/data_release/newfullmerge.pl -trna -out gencode.v7.tRNAs.gtf |
[622 lines]
Code
nice perl svn/gencode/scripts/data_release/write_class_file.pl -in gencode.v7.tRNAs.gtf -class classes.def -out gencode.v7.tRNAs.classes -types tRNAscan |
generate polyAs
Code
nice perl svn/gencode/scripts/data_release/dump_polyAs.pl -out gencode.v7.polyAs.gtf |
[28966 lines]
Code
nice perl svn/gencode/scripts/data_release/write_class_file.pl -in gencode.v7.polyAs.gtf -class classes.def -out gencode.v7.polyAs.classes |
re-format 2-way pseudogenes (from Yale) NEEDS UPDATING
(create header)
Code
awk 'BEGIN{c=0} {print $1"\tYale_UCSC\ttranscript\t"$2"\t"$3"\t.\t"$4"\t.\tgene_id \"Overlap"c"\"; transcript_id \"Overlap"c"\"; gene_type \"pseudogene\"; gene_status \"UNKNOWN\"; gene_name \"Overlap"c"\"; transcript_type \"pseudogene\"; transcript_status \"UNKNOWN\"; transcript_name \"Overlap"c"\"; level 3; tag \"2way_pseudo_cons\"; yale_id \""$5"\"; ucsc_id \""$6"\"; parent_id \""$7"\";"; c++}' yale_ucsc_2way_consensus >> gencode.v7.2wayconspseudos.GRCh37.gtf | |
| |
nice perl svn/gencode/scripts/data_release/write_class_file.pl -in gencode.v7.2wayconspseudos.GRCh37.gtf -class classes.def -out gencode.v7.2wayconspseudos.GRCh37.classes -types transcript |
create transcript sequence files
Code
foreach chr ( 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y MT ) | |
| |
bsub -o seqs/trans_$chr.out perl svn/gencode/scripts/data_release/newfullmerge.pl -outfile seqs/trans_$chr.fa -ass GRCh37 -sequence -chrom $chr | |
| |
end |
create protein sequence files
Code
foreach chr ( 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y MT ) | |
| |
bsub -o seqs/prot_$chr.out perl svn/gencode/scripts/data_release/newfullmerge.pl -outfile seqs/prot_$chr.fa -ass GRCh37 -sequence -protein -chrom $chr | |
| |
end |
update PAR regions in sequence files
Code
nice perl svn/gencode/scripts/data_release/update_y_ids.pl -fasta -x gencode_X.gtf -y seqs/trans_Y.fa -out seqs/trans_YY.fa | |
| |
nice perl svn/gencode/scripts/data_release/update_y_ids.pl -fasta -x gencode_X.gtf -y seqs/prot_Y.fa -out seqs/prot_YY.fa |
combine sequence files
Code
foreach chr ( 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X YY MT ) | |
| |
cat seqs/prot_$chr.fa >> gencode.v7.pc_translations.fa | |
| |
cat seqs/trans_$chr.fa >> gencode.v7.pc_transcripts.fa | |
| |
end |
files to release to the DCC
gencode.v7.annotation.level_1_2.gtf gencode.v7.annotation.level_1_2.classes gencode.v7.annotation.level_3.gtf gencode.v7.annotation.level_3.classes gencode.v7.polyAs.gtf gencode.v7.polyAs.classes gencode.v7.2wayconspseudos.gtf gencode.v7.2wayconspseudos.classes metadata/ gencode_Exon_supporting_feature gencode_HGNC gencode_PDB gencode_Pubmed_id gencode_RefSeq gencode_Source gencode_SwissProt gencode_Transcript_supporting_feature
Code
tar -czvf gencode7_GRCh37.tgz gencode7 | |
| |
cp gencode7_GRCh37.tgz PUB_FTP/gencode/release_7/gencode7_GRCh37.tgz |
It can take up to 20 minutes before the files are visible on the public FTP site.
These additional files are added to the FTP sites individually for general users:
gencode.v7.annotation.gtf.gz gencode.v7.pc_transcripts.fa.gz gencode.v7.pc_translations.fa.gz gencode.v7.polyAs.gtf.gz gencode.v7.tRNAs.gtf.gz
Code
nice gzip -c gencode.v7.pc_transcripts.fa > gencode7_GRCh37.tgz PUB_FTP/gencode/release_7/gencode.v7.pc_transcripts.fa.gz |
etc.
Other notes:
svn/gencode/scripts/store_id_conversion.pl
which will read the GTF file or a list of ids and create the SQL statements. It's better to use a release file with no versions in the Ensembl ids as the others can not be linked to the Ensembl web site directly and the "." might break some functions in AnnoTrack. Please remember this might create links to ids that are not yet "valid" until the official Ensembl release date.
Code
perl svn/gencode/scripts/store_id_conversion.pl -gtf -infile gencode.v7.annotation.gtf -out new_id_conversions.sql | |
| |
mysql -h -P -u -p -D gencode_tracking < new_id_conversions.sql |
Also the external annotations in AnnoTrack should be updated from the new ensembl database. These are stored as custom_values with this script:
Code
bsub -q long -o job.out perl svn/gencode/tracking_system/perl/scripts/update_external_info.pl | |
| |
-coredb homo_sapiens_core_61_37f | |
| |
-comparadb ensembl_compara_61 | |
| |
-ontologydb ensembl_ontology_61 |
This is looking at the live-mirror dbs by default, so either modify this or run this after the Ensembl release date.
Selenocysteine tags are now read directly from the database, to pull them out separately for other reasons into a file you can do:
Code
mysql -uensro -hens-livemirror -Dhomo_sapiens_core_60_37e -e"select tsi.stable_id, ta.value from translation_attrib ta, transcript_stable_id tsi, translation tl where tl.transcript_id=tsi.transcript_id and tl.translation_id=ta.translation_id and ta.attrib_type_id=12 order by stable_id;" | awk '{print $1"\t"$2}' > selenocystein.transcripts |
A good description of the FASTQ format can be found at Illumina:
"A fastq file is an ASCII encoded text file that stores DNA or RNA sequences and their corresponding IDs and quality scores. It uses unix newlines and consists of 4 lines per sequence unless wrapping occurs due to sequence length. The first line begins with an "@" followed by an identifier (ID) which acts as a label for the read/sequence, plus index and read (pair) numbers. Read numbers are 1 for single reads and 1 or 2 for paired reads. The second line represents a DNA or RNA sequence, and should consist only of standard bases, and IUPAC ambiguity codes (ACTGNURYSWKMBDHV). This line must be wrapped with newlines if the read is longer than 80nt. The third line must be a single "+" which signifies the end of the sequence, optionally followed by the identifier again. The fourth line is a quality score string showing the quality of each base in the prior sequence, represented as the ASCII character corresponding to the quality Phred score + 33. Phred scores must be 0 and 60 (ASCII chars 33 aka "!" to 93 aka "]"). The quality score must also be wrapped to multiple lines if longer than 80 characters, but must be exactly equal in length to it's corresponding sequence."
Example:
@READNAME[#index]/read_number BASES +READNAME[#index]/read_number QUALITIES
As a sanity and QC check the DCC of the 1000 genomes project applies the following rules (source):
Syntax Checks:
-Each header line begins with @
-The third line always starts with a +
-There are four lines in each entry (implied by the above two rules)
-On line3, if a name follows the + sign, the name has to match the one found in line1
-The sequence and quality lines are the same length
-For paired end files, the _1 and _2 files have the same number of reads in them.
-For SOLID colourspace fastq, each read starts with a base followed by a string of numbers
Sequence Checks:
-Read is longer than 35bp for Solexa, 25bp for Solid, and 30 bp for 454
-Read does not contain any N's in the first 25, 30 or 35bp
-Quality values are all 2 or higher in the first 25bp, 30bp or 35bp
-The reads contain more than one type of base in the first 25, 30, or 35bp
-Read does not contain more than 50% Ns in its whole length
-Read does not contain characters other than ATGCN (this rule does not apply to SOLID reads)
Taking it further:
To understand the concept of Ensembl and learn how to query the tables I find it extremely useful to have a schema diagram of the database in front of me.
This can be generated by using the schema.sql and foreign_keys.sql files from the sql directory of the Ensembl API cvs checkout. After loading this data into a program like the free MySQL Workbench the tables and connections can be arranged to your liking.
Here is a pdf version I created based on Ensembl core 59 with the MySQL Workbench file.
UPDATE:
Nice schema diagrams and a description of the different tables can now be found on the Ensembl pages!
Adding to the confusion about different notations of phases/frames, the start coordinates of genomic features are also noted differently between different genome browsers and file formats.
1. One-based
Counting bases starting with "1" at the first position.
Regions are specified by a "closed interval." Used e.g. by the Ensembl genome browser and annotation system, the GFF/GTF, SAM and wiggle file formats.
2. Zero-based
The interbase system counts spaces starting with "0" at the first position.
Regions are specified by a "half-closed-half-open interval". Used by the UCSC genome browser, Chado (the fruitfly browser), the BED, BAM and PSL file formats.
An example:
One-based 1 2 3 4 5 6 | | | | | | C G A T G C | | | | | | | 0 1 2 3 4 5 6 Zero-based
The ATG interval would be described from 3-5 in the first, from 2-5 in the second system.
What is a Gene Model?
I found the following text on the teaching pages of Prof. Ann Loraine and found it worth repeating (slightly modified) here:
Gene models are hypotheses about the structure of transcripts produced by a gene. Like all models, they may be correct, partly correct, or entirely wrong. Typically, with evidence-based gene-prediction programs, we use information from EST s (expressed sequence tags) , cDNAs or RNASeq reads to evaluate or create gene models. Alternatively models can be derived from the genomic sequence alone, looking for well-known characteristics (open-reading frames, splice-sites, stops, etc.) of the sequence of genes. This approach is called ab-initio gene prediction.
Itís important to remember at all times that a gene model is only that: a model.
To understand what a gene model represents, you need to refresh your memory about how transcription, RNA splicing, and polyadenylation operate.
Most protein-coding genes in eukaryotic organisms (like humans, the research plant Arabidopsis thaliana, fruit flies, etc.) are transcribed into RNA by an enzyme complex called RNA polymerase II, which binds to the five prime end of a gene in its so-called promoter region. The promoter region typically contains binding sites for transcription factors that help the RNA polymerase complex recognize the position in the genomic DNA where it should begin transcription. Many genes have multiple places in the genomic DNA where transcription can begin, and so transcripts arising from the same gene may have different five-prime ends. Transcripts arising from the same gene that have different transcription start sites are said to come from alternative promoters.
Once the RNA polymerase complex binds to the five prime end of gene, it can begin building an RNA copy of the DNA sense strand via the process known as transcription. The ultimate product of transcription is thus called a transcript. During and after transcription, another large complex of proteins and non-coding RNAs called the spliceosome attaches to the growing RNA molecule, cuts out segments of RNA called introns, and joins together (splices) the flanking sequences, which are called exons. Not every newly synthesized transcript is processed in this way; sometimes no introns are removed at all. Genes whose products do not undergo splicing are often called single ?exon genes.
Also, splicing may remove different segments from transcripts arising from the same gene. This variability in splicing patterns is called alternative splicing. In addition to splicing, RNA transcripts undergo another processing reaction called polyadenylation.
In polyadenylation, a segment of sequence at the 3-prime end of the RNA transcript is cut off, and a polymer consisting of adenosine residues called a polyA tail is attached to the 3-prime end of the transcript. The length of polyA tail may vary a lot from transcript to transcript, and the position where it is added may also differ. Genes whose transcripts can receive a polyA tail at more than one location are said to be subject to alternative polyadenylation or alternative 3?prime end processing. One of the functions of this polyA tail is thought to be increased stability of the transcript.
These processing reactions are believed to take place in the nucleus. Ultimately, most of the mature or maturing RNA transcripts are exported from the nucleus into the cytoplasm, where they will be translated by ribosomes into proteins, chains of amino acids that perform work in the cell (such as enzymes) or that provide form and structure (like actin in the cytoskeleton).
The continuous sequence of bases in an RNA that encode a protein is called a coding region, and the coding typically starts with an AUG codon and terminates with one of three possible stop codons. The segments of sequence that comprise a coding region are called CDSs and they generally occupy the same sequences as the exons, apart from the regions five and three prime of the start and stop codons, respectively.
Most RNAs code for one protein sequence, but there are some interesting exceptions in which one mature mRNA may contain more than one translated open reading frame. The three bases where the ribosome initiates translation are called a start codon and the triplet of bases immediately following the last translated codon are called the stop codon. The start codon encodes the amino acid methionine, typically, and the stop codon doesnít code for any amino acid.
A gene model thus consists of a collection of introns and exons and their locations in the genomic sequence, as well as the location of the translated region or region. Thus, a gene model implies a theory about where the RNA polymerase started transcription, as well as the location of the polyadenylation site and the starts and stops of translation. Usually, we draw gene models as showing the location of introns and exons relative to the genomic sequence, as if we are mapping the RNA copy back onto the genomic DNA itself.
This text nicely describes the classical central dogma of (molecular) biology (DNA -> RNA -> protein), what gene models are and some thoughts about gene prediction at the same time...
What are genomic interspersed repeats? [from the RepeatMasker docu] In the mid 1960's scientists discovered that many genomes contain stretches of highly repetitive DNA sequences ( see Reassociation Kinetics Experiments, and C-Value Paradox ). These sequences were later characterized and placed into five categories:
Currently up to 50% of the human genome is repetitive in nature and as improvements are made in detection methods this number is expected to increase.
Software for repeat identification
The default parameters for RepeatMasker as part of the Ensembl gene-prediction pipeline e.g. mouse are:
-nolow -species mouse -s
Further reading: Table in Nature with different programs. See also Tarailo-Graovac and Chen "Using RepeatMasker to Identify Repetitive Elements in Genomic Sequences" in Current Protocols in Bioinformatics, March 2009. See also this RepeatMasker readme at animalgenome.org.
There are different ways to encode the quality scores in FASTQ files from Next-generation sequencing machines. It is important to find out before using the data and to convert between formats if necessary.
SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS..................................................... ..........................XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX...................... ...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII...................... .................................JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ...................... LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL.................................................... !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ | | | | | | 33 59 64 73 104 126 S - Sanger Phred+33, raw reads typically (0, 40) X - Solexa Solexa+64, raw reads typically (-5, 40) I - Illumina 1.3+ Phred+64, raw reads typically (0, 40) J - Illumina 1.5+ Phred+64, raw reads typically (3, 40) with 0=unused, 1=unused, 2=Read Segment Quality Control Indicator L - Illumina 1.8+ Phred+33, raw reads typically (0, 41)
Source: wikipedia
For a simple look-up from ASCII to numeric scores you can use the following list:
ASCII numeric ASCII numeric ! 0 @ 31 " 1 A 32 # 2 B 33 $ 3 C 34 % 4 D 35 & 5 E 36 ' 6 F 37 ( 7 G 38 ) 8 H 39 * 9 I 40 + 10 J 41 , 11 K 42 - 12 L 43 . 13 M 44 / 14 N 45 0 15 O 46 1 16 P 47 2 17 Q 48 3 18 R 49 4 19 S 50 5 20 T 51 6 21 U 52 7 22 V 53 8 23 W 54 9 24 X 55 : 25 Y 56 ; 26 Z 57 < 27 [ 58 = 28 \ 59 > 29 ] 60 ? 30 ^ 61
You can convert the Solexa read quality to Sanger read quality with Maq:
maq sol2sanger s_1_sequence.txt s_1_sequence.fastq
where s_1_sequence.txt is the Solexa read sequence file. Missing this step will lead to unreliable SNP calling when aligning reads with Maq.
Source: maq-manual
Phred itself is a base calling program for DNA sequence traces developed during the initial automation phase of the sequencing of the human genome.
After calling bases, Phred examines the peaks around each base call to assign a quality score to each base call. Quality scores range from 4 to about 60, with higher values corresponding to higher quality. The quality scores are logarithmically linked to error probabilities, as shown in the following table:
Phred quality Probability of Accuracy of score wrong base call base call 10 1 in 10 90% 20 1 in 100 99% 30 1 in 1,000 99.9% 40 1 in 10,000 99.99% 50 1 in 100,000 99.999%
"High quality bases" are usually scores of 20 and above ("Phred20 score").
You can read the original publications about the Phred program and scoring by Brent Ewing et al. from Phil Green's lab here and here.
Source: www.phrap.com
These are some of the cell lines that are used in the various analysis of the ENCODE project. The first two are so-called tier-1 lines and covered by all the different types of experiments within ENCODE, the others are tier-2 lines, additionally there are a number of tier-3 cell lines.
This document is part of the administrator documentation for the AnnoTrack software for genome annotation tracking.
Regular updates
The following Perl scripts update the data and re-set priorities and flags. They usually update Havana annotation data, but all other sources can be checked as well by activating the entry in the config file. They run as cron-job every night, but can also be run manually if needed. The cron-job is executed from svn/gencode/tracking_system/perl/scripts/cron_jobs.pl.
The general procedure (which can be also used to push new data into the system) is:
Common parameters are:
env defines the target database as
Running data update scripts:
Code
perl svn/gencode/tracking_system/perl/scripts/update.pl -env proc -core -write |
Run as farm job, sources to be updated defined in config.pm. Set active flags and priorities based on flags:
Code
perl svn/gencode/tracking_system/perl/scripts/set_priorities.pl -env proc -core -write |
Specific updates
After every Havana/Ensembl merge a new OTT-/ENS-ID mapping should be generated and loaded into the AnnoTrack tracking system. This can be done with the script svn/gencode/scripts/store_id_conversion.pl which will read the GTF file or a list of ids and create the SQL statements.
Code
perl svn/gencode/scripts/store_id_conversion.pl -gtf -infile current_freeze.gtf -out new_id_conversions.sql | |
| |
mysql -h -P -u -p -D gencode_tracking < new_id_conversions.sql |
Adding new data
Importing Ensembl objects
If an important gene model is missing from Havana but was annotated by Ensembl an import into AnnoTrack can be accomplished easily with the script svn/gencode/tracking_system/perl/scripts/import_from_ensembl.pl with the following options:
Code
perl import_from_ensembl.pl -user Felix | |
| |
-category Ensembl | |
| |
-gene ENSG00012048 | |
| |
(or) | |
| |
-transcript ENST00309486 | |
| |
-flag manual_selection | |
| |
-note "important gene" |
Setting a flag (with the chosen flag-name) and adding a note (that will be displayed next to the flag) are optional.
Importing via DAS
A number of GENCODE sources were imported from external DAS servers. For updates or new sources these source adaptors should be checked at svn/gencode/tracking_system/perl/modules/gencode_tracking_system/sources/
Importing from a file
There are source adaptors for reading tab-delimited files (tab_file.pm) and GTF files (which can also used for GFF3). You might have a look at the source code of the parser in case it needs slight modifications to read your file format.
Importing via other sources
If there are new types of data sources not fitting above categories a new source-adator has to be created. The best way for this is to copy and modify an existing one from svn/gencode/tracking_system/perl/modules/gencode_tracking_system/sources/.
Creating new entries through the web interface is possible but not recommended. A gene can be added on this admin page (Trackers: only Features is required, Modules: only Issue tracking is required), transcripts can then be added using the URL format "annotrack.sanger.ac.uk/human/projects/NEW-GENE-ID/issues/new".
For all imports with the update.pl script an entry describing the new data source needs to be created in the svn/gencode/tracking_system/perl/modules/gencode_tracking_system/config.pm config file. A hash "%OTHER_SERVERS" contains an entry for every source name with the parameters required:
Working with flags
Flags are the most important features of the system, they define what problems we are focusing on.
New flags can be set:
If the same type of flag was already set and not resolved yet, the scripts should NOT set another flag.
To resolve flags
image 1: resolving flags through the web interface
New types of flags can be created here. This creates an entry in the flags tale (with the issue_id=-1) and in the tmp_values table where stats are stored. Also check the list of all flag types and their priorities.
The description of flag types can be updated here.
In general it's a good idea to run new updates / imports against the development environment / test database first (by setting $ENV = "dev"
in the config file or using the -dev env
parameter for scripts). Changes can than by checked in the database or a test server first (at the Sanger at http://web-annotrack.internal.sanger.ac.uk:8000/human/.
To change the format of a cell based on the content of that or another cell conditional formatting can be used.
Code
Sub Color_groups() | |
| |
| |
| |
Set MyPlage = Range("A2:A1000") | |
| |
| |
| |
For Each Cell In MyPlage | |
| |
| |
| |
If InStr(1, Cell.Value, "Vic_") Then | |
| |
| |
| |
Cell.Interior.ColorIndex = 3 | |
| |
| |
| |
ElseIf InStr(1, Cell.Value, "Tyl_") Then | |
| |
| |
| |
Cell.Interior.ColorIndex = 4 | |
| |
| |
| |
ElseIf InStr(1, Cell.Value, "Wol_") Then | |
| |
| |
| |
Cell.Interior.ColorIndex = 6 | |
| |
| |
| |
ElseIf InStr(1, Cell.Value, "Sim_") Then | |
| |
| |
| |
Cell.Interior.ColorIndex = 7 | |
| |
| |
| |
ElseIf InStr(1, Cell.Value, "Sea_") Then | |
| |
| |
| |
Cell.Interior.ColorIndex = 8 | |
| |
| |
| |
ElseIf InStr(1, Cell.Value, "Mar_") Then | |
| |
| |
| |
Cell.Interior.ColorIndex = 15 | |
| |
| |
| |
ElseIf InStr(1, Cell.Value, "Lio_") Then | |
| |
| |
| |
Cell.Interior.ColorIndex = 17 | |
| |
| |
| |
End If | |
| |
| |
| |
Next | |
| |
| |
| |
End Sub |
How to avoid falling in the cache...
Caching is a powerful way to speed up queries to the Ensembl database. It can get problematic however for example if you are repeating a query multiple times, but have updated the data set in between. It is important to know how to turn caching off if needed - this is not officially documented though.
To turn the caching off on the mysql server
Code
my $sa = $reg->get_adaptor($species,"core","slice"); | |
| |
my $sth = $sa->dbc->db_handle->prepare("SET SESSION | |
| |
query_cache_type = OFF"); | |
| |
$sth->execute || die "set session failed\n"; |
Reset caches in Perl API
Code
sub free_caches{ | |
| |
my $species = shift; | |
| |
my $group = shift; | |
| |
| |
| |
foreach my $adap (@{$registry->get_all_adaptors(-species => | |
| |
$species, -group => $group)}){ | |
| |
$adap->{'_slice_feature_cache'} = undef; | |
| |
| |
| |
if(defined($adap->{'cache'})){ | |
| |
$adap->{'cache'} = undef; | |
| |
} | |
| |
| |
| |
if(defined($adap->{'seq_region_cache'})){ | |
| |
my $seq_region_cache = $adap->{'seq_region_cache'} = | |
| |
Bio::EnsEMBL::Utils::SeqRegionCache->new(); | |
| |
| |
| |
$adap->{'sr_name_cache'} = $seq_region_cache->{'name_cache'}; | |
| |
$adap->{'sr_id_cache'} = $seq_region_cache->{'id_cache'}; | |
| |
} | |
| |
} | |
| |
| |
| |
} |
Source: Ian Longden, EBI
Installing and Running Proserver to serve data via DAS
The Distributed Annotation System (DAS) is an elegant way of sharing data and using data from diverse sources. More information at http://www.biodas.org and on these blog pages. The Proserver is a lightweight software system to provide your data as a DAS source.
Download from http://proserver.svn.sf.net/
or
Code
svn co https://proserver.svn.sf.net/svnroot/proserver/trunk Bio-Das-ProServer |
Build:
Code
cd Bio-Das-ProServer | |
| |
perl Build.PL | |
| |
./Build | |
| |
./Build test | |
| |
(optional:) ./Build install |
Run:
Code
eg/proserver -x -c eg/local.ini |
Adjust the ini file with the source you want to serve, e.g.:
Code
[otter_das] | |
| |
state = on | |
| |
adaptor = otter_das | |
| |
title = Havana manual annotations | |
| |
description = A DAS source that provides access to the Havana annotation. | |
| |
coordinates = NCBI_36,Chromosome,Homo sapiens => 21:25673390,25733000 | |
| |
dsncreated = 2008-03-11 | |
| |
maintainer = felix@work.ac.uk | |
| |
doc_href = http://www.dasregistry.org/showProjectDetails.jsp?project_id=80 | |
| |
host = otterlive | |
| |
user = username | |
| |
port = 3306 | |
| |
dbname = loutre | |
| |
driver = mysql |
Dependencies to re-install:
Compression libs Bundle-Compress-Zlib, Compress::Zlib, and such (http://search.cpan.org/dist/Compress-Raw-Zlib/lib/Compress/Raw/Zlib.pm) (must match each others versions to avoid errors like does not match bootstrap parameter).
Links:
To compare gene predictions to a reference gene set (and similar tasks), the commonly used measures for calculating the prediction rate are specificity (precision) and sensitivity (recall) (Burset and Guigo, Genomics 34, 353-367, 1996).
Specificity = TN / (TN + FP) Sensitivity = TP / (TP + FN)
with
TP = true posisitives (correctly identified) FP = false positives (overpredicions) TN = true negatives (correctly un-called) FN = false negatives (missed)
You can calculate a combined score like
Score = Specificity x Sensitivity / 2
To assess base-coverage:
Correllation Coefficient =
(TP x TN) - (FN x FP) ----------------------------------------- SQR( (TP + FN) x (TN + FP) x (TP + FP) x (TN + FN) )
See also this text by Roderic Guigo.
Alternatively you can use the combined F1 score:
† F1 = 2 x Specificity x Sensitivity / Specificity + Sensitivity
Defined by van Rijsbergen in 1979, Source
Some general notes on ways to shorten the response time of your web site.
1. Make fewer HTTP requests ñ Reducing 304 requests with Cache-Control Headers
2. Use a CDN
3. Use a customized php.ini ñ Creating and using a custom PHP.ini
4. Add an Expires header
ñ Caching with mod_expires on Apache
ñ Using .htaccess file with
Code
<FilesMatch "\.(ico|pdf|flv|jpg|jpeg|png|gif|js|css|swf)$"> | |
| |
Header set Expires "Thu, 15 Apr 2010 20:00:00 GMT" | |
| |
</FilesMatch> |
5. Gzip components
ñ http://askapache.info/2.0/mod
/mod_deflate.html
ñ or with .htaccess file:
Code
<IfModule mod_gzip.c> | |
| |
mod_gzip_on Yes | |
| |
mod_gzip_dechunk Yes | |
| |
mod_gzip_item_include file \.(html?|txt|css|js|php|pl|jpg|png|gif|xml)$ | |
| |
mod_gzip_item_include handler ^cgi-script$ | |
| |
mod_gzip_item_include mime ^text/.* | |
| |
mod_gzip_item_include mime ^application/x-javascript.* | |
| |
mod_gzip_item_exclude mime ^image/.* | |
| |
mod_gzip_item_exclude rspheader ^Content-Encoding:.*gzip.* | |
| |
</IfModule> |
6. Put CSS at the top in head
7. Move Javascript to the bottom
8. Avoid CSS expressions, keep it simple
9. Make CSS and unobtrusive Javascript as external files not inline
10. Reduce DNS lookups ñ Use Static IP address, use a subdomain with static IP address for static content.
11. Minimize Javascript ñ Refactor the code, compress with dojo
12. Avoid external redirects ñ Use internal redirection with mod_rewrite, The correct way to redirect with 301
13. Turn off ETags ñ Prevent Caching with htaccess
14. Make AJAX cacheable and small
Source: Firebug-extension & http://www.askapache.com/web-cache/top-methods-for-faster-speedier-web-sites.html
awk is an extremely useful unix tool for quick command-line task, in particular in combination with other commands like grep or sort.
"AWK is a data-driven programming language designed for processing text-based data, either in files or data streams. It is an example of a programming language that extensively uses the string datatype, associative arrays (that is, arrays indexed by key strings), and regular expressions." [wikipedia]
Built-in Variables
ARGC
The number of command line arguments (not including options or the awk program itself).
ARGV
The array of command line arguments. The array is indexed from 0 to ARGC - 1. Dynamically changing the contents of ARGV can control the files used for data.
CONVFMT
The conversion format to use when converting numbers to strings.
ENVIRON
An array containing the values of the environment variables. The array is indexed by variable name, each element being the value of that variable. Thus, the environment variable HOME would be in ENVIRON["HOME"]. Its value might be `/u/close'. Changing this array does not affect the environment seen by programs which awk spawns via redirection or the system function. Some operating systems do not have environment variables. The array ENVIRON is empty when running on these systems.
FILENAME
The name of the current input file. If no files are specified on the command line, the value of FILENAME is `-'.
FNR
The input record number in the current input file.
FS
The input field separator, a blank by default.
using multiple alternative field separators:
FS="\t|=|;" (nawk)
NF
The number of fields in the current input record.
NR
The total number of input records seen so far.
OFMT
The output format for numbers for the print statement, "%.6g" by default.
OFS
The output field separator, a blank by default.
ORS
The output record separator, by default a newline.
RS
The input record separator, by default a newline. RS is exceptional in that only the first character of its string value is used for separating records. If RS is set to the null string, then records are separated by blank lines. When RS is set to the null string, then the newline character always acts as a field separator, in addition to whatever value FS may have.
RSTART
The index of the first character matched by match; 0 if no match.
RLENGTH
The length of the string matched by match; -1 if no match.
SUBSEP
The string used to separate multiple subscripts in array elements, by default "\034".
String functions
source: http://people.cs.uu.nl/piet/docs/nawk/nawk_toc.html
GTF stands for Gene transfer format. It borrows from GFF, but has additional structure that warrants a separate definition and format name. The current version is 2.2.
Structure is as GFF, so the fields are:
Code
<seqname> <source> <feature> <start> <end> <score> <strand> <frame> [attributes] [comments] |
Attributes consist of key - value pairs, separated by one space.
Multiple attributes are separated by "; ".
The attributes list must start with gene_id and transcript_id.
Example attributes:
seq1 BLASTX similarity 101 235 87.1 + 0 gene_id "gene-0"; transcript_id "transcript-0-1"; gene_name "Frst1"; expression 1;
More details:
http://mblab.wustl.edu/GTF22.html
http://www.bioperl.org/wiki/GTF
SRF (Sequence Read Format) is a generic and flexible container format for sequencing and next-generation sequencing files.
Format working group: http://srf.sourceforge.net
It's the preferred format for the submission of sequencing results to archives like the European Nucleotide Archive.
How to use it:
SOLiD software to map SOLiD to SRF files.
SOLiD software to map MA (mapping) to GFF files.
An API: http://sourceforge.net/projects/srf
Also implemented within Staden package:
Fetch out basic read counts:
Code
/software/solexa/bin/srf_info -l 1 file.srf |
To convert them to fastq:
Code
/software/solexa/bin/srf2fastq -c file.srf |
(run without parameters/file for more options.)
Filter out reads flagged as "bad":
srf_filter -b infile.srf outfile.srf
Related blog post on Politigenomics
More info from SOLiD
Using rsync
"rsync is a software application for Unix systems which synchronizes files and directories from one location to another while minimizing data transfer using delta encoding when appropriate." [Wikipedia]
Fact is that it is very fast (faster than cp or scp) for file transfers and ideal for home-brew back-up solutions. There is a lot of documentation on the internet, here are some pointers that were useful for me.
Basic command to sync DIR to a different location:
rsync -r DIR ~/backup/
Basic command to sync DIR from hostA to hostB:
ssh hostA
rsync -r DIR user@hostB:~/backup/
Basic command to list files at a remote server:
rsync rsync://pub@your-ip-or-hostname/
If you require a password, the easiest way is to put it in a file and chmod 700 it.
--password-file=your_file
To fetch recursively us the recursive -r or the archive -a options:
rsync -aPv source/someDirectory .
To specify a timeout you can set these options (in seconds):
--contimeout=1000 --timeout=1000
You can conveniently specify selected files and directories you want to transfer in an include file, ignoring the rest. Example: transfer basic data from a sequencing run to a remote host:
Code
> cat include_file.txt: | |
Data | |
Data/Intensities | |
Data/Intensities/BaseCalls | |
Data/Intensities/BaseCalls/*** | |
InterOp/*** | |
RunInfo.xml | |
RunParameters.xml | |
RunParameters.xml | |
*.csv | |
| |
> rsync -arv --include-from='~/include_file.txt' --exclude='*' RunFolder_A34MJNACXX/ ~/temp/ |
Use --dry-run to test the results before starting a long transfer.
Use the -vvvv option to see the full step-wise processing of all files and directories.
Note: The Sanger firewall seems to block rsync, you can use it on the (guest) wireless network though.
User commands for the mailing list software Majordomo
Send these commands in the email body to majordomo@your-server.com, example:
mailto: Majordomo@ebi.ac.uk who mac-list
help Majordomo replies with a list of acceptable Majordomo commands. subscribe listname Majordomo subscribes the sender to the named list. (Example) subscribe listname address Majordomo subscribes the address given to the named list. unsubscribe listname Majordomo unsubscribes the sender from the named list if the sender sent the mail from exactly the address he was subscribed to. (Example) unsubscribe listname address Majordomo unsubscribes the address from the named list. which Majordomo sends back a catalogue of the mailing lists the sender is subscribed to at the address he sent the mail from. which address Majordomo sends back a list of the mailing lists the address given is subscribed to. lists Majordomo sends back a catalogue of the mailing lists which Majordomo handles with a half-line description of each list. (Example) info listname Majordomo sends back summary information about the list. (Example) who listname Majordomo replies with a roster of the e-mail addresses that are subscribed to the named list. index listname Messages sent to all mailing lists are archived monthly unless the list owner explicitly requests no archiving. Majordomo replies with the filenames of the archived files of the named list. get listname filename Majordomo sends the archived messages for the filename requested. Files are archived under the filename listname.yymm. Example: www-l.9503 contains all messages sent to the list www-l during March 1995. end Majordomo ignores anything in an e-mail message to Majordomo which comes after the command "end". This can be useful if you have a signature or other text at the end of your message, or if you want to include more than one Majordomo command.
To use the Lucene search engine for querying the AnnoTrack annotation tracking system of Gencode, an XML dump must be prepared. This should be done daily to allow a regular re-indexing of the search.
We are indexing on the gene and transcript level separately. The XML can be written out with this script:
~fsk/3_scripts/gencode/lucene_dump.pl
It writes the following format:
XML
<entry id="otthumg00000159378"> | |
| |
<name>OTTHUMG00000159378</name> | |
| |
<description>Description: putative novel protein | |
| |
Genename: AP000221.2</description> | |
| |
<cross_references> | |
| |
<ref dbname="vega" dbkey="OTTHUMG00000159378" /> | |
| |
<ref dbname="gentrack_transcript" dbkey="548587" /> | |
| |
</cross_references> | |
| |
<additional_fields> | |
| |
<field name="transcript_count">1</field> | |
| |
<field name="location">21:25747378,25760913:-</field> | |
| |
<field name="chromosome">21</field> | |
| |
<field name="category">HAVANA</field> | |
| |
</additional_fields> | |
| |
</entry> | |
| |
</entries> |
With cross-references to Gene/Transcript entries in AnnoTrack and Vega.
Characters to escape in XML:
XML
" " | |
| |
< < | |
| |
> > | |
| |
& & |
The search can be initiated from http://www.sanger.ac.uk/search, a direct link could look like http://www.sanger.ac.uk/search?db=annotrack&t=brca1
To updload data to the Sanger FTP server:
Source: http://intweb.sanger.ac.uk/Sysman/FAQ/ftp.shtml
To make them available to everybody (and for paper submissions) Next-Generation Sequencing results should be submitted to the European Read Archive (ERA) - now called European Nucleotide Archive (ENA) which collaborates with the NCBI Short Read Archive (SRA) (If this is still being funded).
In the GENCODE project we submitted the RT-PCR-Seq data to the ENA using the ArrayExpress submission system.
Please note this system was about to change at the time of writing and might be different now...
Meta data hierarchy:
Study Sample Experiment Run Submission
In detail:
The SRA tracks the following five objects:
Study - Identifies the sequencing study or project and contains multiple experiments.
Sample - Identifies the organism, isolate, or individual being sequenced.
Experiment - Specifies the sample, sequencing protocol, sequencing platform, and data
processing that will result one or more runs.
Run - Identifies run data files, the experiment they are contained in, and any runtime
parameters gathered from the sequencing instrument.
Analysis - Packages data associated with short read objects that are intended for
downstream usage or that otherwise needs an archival home. Examples include
assemblies, alignments, spreadsheets, QC reports, and read lists.
XLM schemata for different levels
Re-sequenced transcripts (Sanger sequ.) are submitted to the EMBL db, using the Webin interface
All the meta-data in the ERA is available here
the sample.xml file contains a single attribute for each sample e.g.
XML
<SAMPLE_ATTRIBUTES> | |
| |
<SAMPLE_ATTRIBUTE> | |
| |
<TAG>sample_origin</TAG> | |
| |
<VALUE>Trypanosome brucei genetic crosses between T. brucei 927 and T. b. gambiense 386</VALUE> | |
| |
</SAMPLE_ATTRIBUTE> | |
| |
</SAMPLE_ATTRIBUTES> |
you can put any name/value pair in a SAMPLE_ATTRIBUTE block, check the attributes column at
http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=table&f=sample&m=data&s=sample
to see what other people have used.
See also the Trace archive @ Sanger
An easier way to automate large parts of the process is to submit the data through ArrayExpress. This can be done through the magetab web interface.
more info: ArrayExpress help
To find out more about processes running on your machine you can:
top -p 6363
for specific processps
and ps -gauwxe | more
look into the process's data in /proc
eg. less /proc/5987/cmdline
METHODS THAT ARE OPERATORS
Operators such as + and * work on strings (concatenate and replicate). The % operator is a short form for sprintf, and the << operator is the same as +. You can treat a character string as an array of characters too.
OTHER METHODS
To change case:
capitalize - first character to upper, rest to lower
downcase - all to lower case
swapcase - changes the case of all letters
upcase - all to upper case
To rejustify:
center - add white space padding to center string
ljust - pads string, left justified
rjust - pads string, right justified
To trim:
chop - remove last character
chomp - remove trailing line separators
squeeze - reduces successive equal characters to singles
strip - deletes leading and trailing white space
To examine:
count - return a count of matches
empty? - returns true if empty
include? - is a specified target string present in the source?
index - return the position of one string in another
length or size - return the length of a string
rindex - returns the last position of one string in another
slice - returns a partial string
To encode and alter:
crypt - password encryption
delete - delete an intersection
dump - adds extra \ characters to escape specials
hex - takes string as hex digits and returns number
next or succ - successive or next string (eg ba -> bb)
oct - take string as octal digits and returns number
replace - replace one string with another
reverse - turns the string around
slice! - DELETES a partial string and returns the part deleted
split - returns an array of partial strings exploded at separator
sum - returns a checksum of the string
to_f and to_i - return string converted to float and integer
tr - to map all occurrences of specified char(s) to other char(s)
tr_s - as tr, then squeeze out resultant duplicates
unpack - to extract from a string into an array using a template
To iterate:
each - process each character in turn
each_line - process each line in a string
each_byte - process each byte in turn
upto - iterate through successive strings (see "next" above)
source: http://www.wellho.net/solutions/ruby-string-functions-in-ruby.html
Here's a quick list with the HTML codes to safely display common characters on web pages.
More details are e.g. here.
See also the list of ASCII codes only.
HTML Code |
Browser View |
HTML Code |
Browser View |
HTML Code |
Browser View |
HTML Code |
Browser View |
HTML Code |
Browser View |
© | © | ! | ! | _ | _ |  | | Û | Û |
® | ® | " | " | ` | ` | ž | ž | Ü | Ü |
| # | # | a | a | Ÿ | Ÿ | Ý | Ý | |
" | " | $ | $ | b | b |   | Þ | Þ | |
& | & | % | % | c | c | ¡ | ¡ | ß | ß |
< | < | & | & | d | d | ¢ | ¢ | à | à |
> | > | ' | ' | e | e | £ | £ | á | á |
À | À | ( | ( | f | f | ¤ | ¤ | â | â |
Á | Á | ) | ) | g | g | ¥ | ¥ | ã | ã |
 |  | * | * | h | h | ¦ | ¦ | ä | ä |
à | à | + | + | i | i | § | § | å | å |
Ä | Ä | , | , | j | j | ¨ | ¨ | æ | æ |
Å | Å | - | - | k | k | © | © | ç | ç |
Æ | Æ | . | . | l | l | ª | ª | è | è |
Ç | Ç | / | / | m | m | « | « | é | é |
È | È | 0 | 0 | n | n | ¬ | ¬ | ê | ê |
É | É | 1 | 1 | o | o | ­ | | ë | ë |
Ê | Ê | 2 | 2 | p | p | ® | ® | ì | ì |
Ë | Ë | 3 | 3 | q | q | ¯ | ¯ | í | í |
Ì | Ì | 4 | 4 | r | r | ° | ° | î | î |
Í | Í | 5 | 5 | s | s | ± | ± | ï | ï |
Î | Î | 6 | 6 | t | t | ² | ² | ð | ð |
Ï | Ï | 7 | 7 | u | u | ³ | ³ | ñ | ñ |
Ð | Ð | 8 | 8 | v | v | ´ | ´ | ò | ò |
Ñ | Ñ | 9 | 9 | w | w | µ | µ | ó | ó |
Õ | Õ | : | : | x | x | ¶ | ¶ | ô | ô |
Ö | Ö | ; | ; | y | y | · | · | õ | õ |
Ø | Ø | < | < | z | z | ¸ | ¸ | ö | ö |
Ù | Ù | = | = | { | { | ¹ | ¹ | ÷ | ÷ |
Ú | Ú | > | > | | | | | º | º | ø | ø |
Û | Û | ? | ? | } | } | » | » | ù | ù |
Ü | Ü | @ | @ | ~ | ~ | ¼ | ¼ | ú | ú |
Ý | Ý | A | A |  | ? | ½ | ½ | û | û |
Þ | Þ | B | B | € | € | ¾ | ¾ | ü | ü |
ß | ß | C | C |  | | ¿ | ¿ | ý | ý |
à | à | D | D | ‚ | ‚ | À | À | þ | þ |
á | á | E | E | ƒ | ƒ | Á | Á | ÿ | ÿ |
å | å | F | F | „ | „ | Â | Â | ||
æ | æ | G | G | … | … | Ã | Ã | ||
ç | ç | H | H | † | † | Ä | Ä | ||
è | è | I | I | ‡ | ‡ | Å | Å | ||
é | é | J | J | ˆ | ˆ | Æ | Æ | ||
ê | ê | K | K | ‰ | ‰ | Ç | Ç | ||
ë | ë | L | L | Š | Š | È | È | ||
ì | ì | M | M | ‹ | ‹ | É | É | ||
í | í | N | N | Œ | Œ | Ê | ? | ||
î | î | O | O |  | | Ë | Ë | ||
ï | ï | P | P | Ž | ž | Ì | Ì | ||
ð | ð | Q | Q |  | | Í | Í | ||
ñ | ñ | R | R |  | | Î | Î | ||
ò | ò | S | S | ‘ | ‘ | Ï | Ï | ||
ó | ó | T | T | ’ | ’ | Ð | Ð | ||
ô | ô | U | U | “ | “ | Ñ | Ñ | ||
õ | õ | V | V | ” | ” | Ò | Ò | ||
ö | ö | W | W | • | • | Ó | Ó | ||
ø | ø | X | X | – | – | Ô | Ô | ||
ù | ù | Y | Y | — | — | Õ | Õ | ||
ú | ú | Z | Z | ˜ | ˜ | Ö | Ö | ||
û | û | [ | [ | ™ | ™ | × | × | ||
ý | ý | \ | \ | š | š | Ø | Ø | ||
þ | þ | ] | ] | › | › | Ù | Ù | ||
ÿ | ÿ | ^ | ^ | œ | œ | Ú | Ú |
A quick table to look up the data types used in the mySQL database management system.
Ty p e |
S i z e |
D e s c r i p t i o n |
CHAR[Length] |
Length bytes |
A fixed-length field from 0 to 255 characters long. |
VARCHAR(Length) |
String length + 1 bytes |
A fixed-length field from 0 to 255 characters long. |
TINYTEXT |
String length + 1 bytes |
A string with a maximum length of 255 characters. |
TEXT |
String length + 2 bytes |
A string with a maximum length of 65,535 characters. |
MEDIUMTEXT |
String length + 3 bytes |
A string with a maximum length of 16,777,215 characters. |
LONGTEXT |
String length + 4 bytes |
A string with a maximum length of 4,294,967,295 characters. |
TINYINT[Length] |
1 byte |
Range of -128 to 127 or 0 to 255 unsigned. |
SMALLINT[Length] |
2 bytes |
Range of -32,768 to 32,767 or 0 to 65535 unsigned. |
MEDIUMINT[Length] |
3 bytes |
Range of -8,388,608 to 8,388,607 or 0 to 16,777,215 |
INT[Length] |
4 bytes |
Range of -2,147,483,648 to 2,147,483,647 or 0 to 4,294,967,295 |
BIGINT[Length] |
8 bytes |
Range of -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807 |
FLOAT |
4 bytes |
A small number with a floating decimal point. |
DOUBLE[Length, Decimals] |
8 bytes |
A large number with a floating decimal point. |
DECIMAL[Length, Decimals] |
Length + 1 or Length + 2 bytes |
A DOUBLE stored as a string, allowing for a fixed decimal |
DATE |
3 bytes |
In the format of YYYY-MM-DD. |
DATETIME |
8 bytes |
In the format of YYYY-MM-DD HH:MM:SS. |
TIMESTAMP |
4 bytes |
In the format of YYYYMMDDHHMMSS; acceptable range ends |
TIME |
3 bytes |
In the format of HH:MM:SS |
ENUM |
1 or 2 bytes |
Short for enumeration, which means that each column |
SET |
1, 2, 3, 4, or 8 bytes |
Like ENUM except that each column can have more than |
source: http://www.peachpit.com/, mySQL
The Distributed Annotation System allows to share data across servers and applications. Some other blog entries about DAS.
To use and serve the data from a specific source (your database or flat file) using the Perl Proserver you need to define a Source-Adaptor. This will translate from your specific data format to the common DAS XML format understood by all DAS servers and clients.
Here are two example SourceAdaptors for Proserver DAS sources using the 1.53e standard (and complying to GENCODE format).
Example 1: Reading data from GFF file
Code
package Bio::Das::ProServer::SourceAdaptor::example; | |
| |
| |
| |
use strict; | |
| |
#Proserver module: | |
| |
use base qw(Bio::Das::ProServer::SourceAdaptor); | |
| |
#for the datestamp format: | |
| |
use Date::Format; | |
| |
| |
| |
| |
| |
# General initialization function | |
| |
# Set metadata such as the commands supported by this source. | |
| |
sub init { | |
| |
my ($self) = @_; | |
| |
$self->{'capabilities'} = { 'features' => '1.0' }; | |
| |
} | |
| |
| |
| |
| |
| |
# General function for "features" DAS command | |
| |
# Gather the features annotated in a given segment of sequence. | |
| |
sub build_features { | |
| |
my ($self, $args) = @_; | |
| |
| |
| |
my $segment = $args->{'segment'}; # The query segment ID | |
| |
my $start = $args->{'start'}; # The query start position (optional) | |
| |
my $end = $args->{'end'}; # The query end position (optional) | |
| |
| |
| |
my @features = (); | |
| |
my %group_start; | |
| |
my %group_end; | |
| |
my %group_count; | |
| |
| |
| |
#category controlled vocabulary: | |
| |
#id: ECO:00000067; name: inferred from electronic annotation | |
| |
my $typecategory = "ECO:00000067"; | |
| |
| |
| |
#read data from gff file | |
| |
open FH, '<', '/Users/fsk/great_data/annotation.gff' | |
| |
or die "Unable to open data file"; | |
| |
| |
| |
while (defined (my $line = <FH>)) { | |
| |
| |
| |
chomp $line; | |
| |
my ($f_seg, $method, $type, $f_start, $f_end, $score, $strand, $phase, $add) = split /\t/, $line; | |
| |
| |
| |
#get extra info from last column | |
| |
my ($f_id, $stamp) = split(";", $add); | |
| |
| |
| |
#replace unwanted characters | |
| |
$f_id =~ s|\"||g; | |
| |
$f_seg =~ s/^chr//; | |
| |
| |
| |
#create group attributes for new set of features | |
| |
if($type eq "mRNA"){ | |
| |
$group_start{$f_id} = $f_start; | |
| |
$group_end{$f_id} = $f_end; | |
| |
$group_count{$f_id}; | |
| |
next; | |
| |
} | |
| |
| |
| |
#convert datestamp from machine time format into desired format (2006-04-07T15:15:58+0100) | |
| |
my $modstamp = time2str("%Y-%m-%dT%H:%M:%S%z", $stamp); | |
| |
| |
| |
#index for unique feature id | |
| |
$group_count{$f_id}++; | |
| |
| |
| |
#get the features overlapping this genomic region only | |
| |
if (($f_seg eq $segment) && ($f_start <= $end && $f_end >= $start) ) { | |
| |
| |
| |
#create individual feature | |
| |
my $feature = { | |
| |
#unique id | |
| |
'id' => $f_id."_".$group_count{$f_id}, | |
| |
#genomic start | |
| |
'start' => $f_start, | |
| |
#genomic end | |
| |
'end' => $f_end, | |
| |
#strand: +/-/0 | |
| |
'ori' => $strand, | |
| |
#name of this method/annotation | |
| |
'method' => $method, | |
| |
#type must be exon, intron, etc. | |
| |
'type' => $type, | |
| |
#category type: ECO id | |
| |
'typecategory' => $typecategory, | |
| |
#phase: 0/1/2/- | |
| |
'phase' => '-', | |
| |
#score, 0 if n.a. | |
| |
'score' => 0, | |
| |
#note for various fields, key=value pairs | |
| |
'note' => [ | |
| |
'lastmod='.$modstamp, | |
| |
], | |
| |
#group of features | |
| |
'group_id' => $f_id, | |
| |
'grouptype' => $method."_prediction", | |
| |
'groupnote' => 'Note='.$group_start{$f_id}."-".$group_end{$f_id}, | |
| |
}; | |
| |
| |
| |
#store in feature array | |
| |
push @features, $feature; | |
| |
} | |
| |
| |
| |
} | |
| |
close FH or warn "Problem closing data file"; | |
| |
| |
| |
#return entire features array | |
| |
return @features; | |
| |
} | |
| |
| |
| |
1; |
Example 2: Connecting to a database to serve transcript features
Code
package Bio::Das::ProServer::SourceAdaptor::example2; | |
| |
| |
| |
use strict; | |
| |
#Proserver module: | |
| |
use base qw(Bio::Das::ProServer::SourceAdaptor); | |
| |
#for accessing mysql dbs | |
| |
#docu eg. at http://www.perl.com/pub/a/1999/10/DBI.html | |
| |
use DBI; | |
| |
| |
| |
# General initialization function | |
| |
# Set metadata such as the commands supported by this source. | |
| |
sub init { | |
| |
my ($self) = @_; | |
| |
$self->{'capabilities'} = { 'features' => '1.0' }; | |
| |
} | |
| |
| |
| |
| |
| |
# General function for "features" DAS command | |
| |
# Gather the features annotated in a given segment of sequence. | |
| |
sub build_features { | |
| |
my ($self, $args) = @_; | |
| |
| |
| |
my $config = $self->config; | |
| |
| |
| |
#the region of interest | |
| |
my $qchrom = $args->{'segment'}; # The query segment ID | |
| |
my $qstart = $args->{'start'}; # The query start position (optional) | |
| |
my $qend = $args->{'end'}; # The query end position (optional) | |
| |
| |
| |
my @features = (); | |
| |
my %group_start; | |
| |
my %group_end; | |
| |
my %group_count; | |
| |
| |
| |
#category, controlled vocabulary: | |
| |
#example: id: ECO:00000067; name: inferred from electronic annotation | |
| |
my $typecategory = "ECO:00000067"; | |
| |
| |
| |
#method used to create genes/transcripts | |
| |
my $method = "CAPS_analysis"; | |
| |
| |
| |
#read data from mysql database | |
| |
#connection parameters are given in the proserver config (ini) file | |
| |
my $dsn = "DBI:".$config->{driver}.":".$config->{dbname}.":". | |
| |
$config->{host}.":".$config->{port}; | |
| |
my $db = DBI->connect($dsn, $config->{user}, $config->{dbpass}) | |
| |
or die "cant connect to database ".$config->{dbname}."\n"; | |
| |
| |
| |
#example query to get transcripts overlapping roi | |
| |
my $type = "transcript"; | |
| |
my $query = "SELECT transcript_name, chromosome, start, end, ". | |
| |
"strand, phase, gene_name, lastmod ". | |
| |
"FROM transcripts ORDER BY chromosome, start ". | |
| |
"WHERE chromosome = ? AND start >= ? AND end <= ?"; | |
| |
my $handle = $db->prepare($query); | |
| |
$handle->execute($qchrom, $qend, $qstart); | |
| |
while (my ($name, $chromosome, $start, $end, $strand, | |
| |
$phase, $gene_name, $lastmod) = $handle->fetchrow_array) { | |
| |
| |
| |
#create individual feature | |
| |
my $feature = { | |
| |
#unique id | |
| |
'id' => $name, | |
| |
#genomic start | |
| |
'start' => $start, | |
| |
#genomic end | |
| |
'end' => $end, | |
| |
#strand: +/-/0 | |
| |
'ori' => $strand, | |
| |
#name of this method/annotation | |
| |
'method' => $method, | |
| |
#type must be exon, intron, etc. | |
| |
'type' => $type, | |
| |
#category type: ECO id | |
| |
'typecategory' => $typecategory, | |
| |
#phase: 0/1/2/- | |
| |
'phase' => $phase, | |
| |
#score, - if n.a. | |
| |
'score' => '-', | |
| |
#note for various fields, key=value pairs | |
| |
'note' => [ | |
| |
'lastmod='.$lastmod, | |
| |
], | |
| |
}; | |
| |
| |
| |
#store in feature array | |
| |
push @features, $feature; | |
| |
} | |
| |
| |
| |
$handle->finish(); | |
| |
$db->disconnect(); | |
| |
| |
| |
#return entire features array | |
| |
return @features; | |
| |
} | |
| |
| |
| |
1; |
Using MySQL from Ruby
Ruby can connect to a MySQL database using the Ruby/MySQL module. There is a complete introduction into the subject can be found here:
http://www.kitebird.com/articles/ruby-mysql.html
The core functions of the modules are explained here:
http://www.tmtm.org/en/mysql/ruby/
To establish a connection:
require "mysql" begin # connect to the MySQL server dbh = Mysql.real_connect("localhost", "testuser", "testpass", "test") # ..... # disconnect dbh.close end
To run a basic query you can simply do the following:
# issue a retrieval query, perform a fetch loop, print # the row count, and free the result set res = dbh.query("SELECT name, category FROM animal") while row = res.fetch_row do printf "%s, %s\n", row[0], row[1] end puts "Number of rows returned: #{res.num_rows}" res.free
This can also help to try out small queries from within a Rails application, eg. like this:
@issue_count = ActiveRecord::Base.connection.execute( "SELECT default_count from tmp").fetch_row[0].to_i
Find bottlenecks with NewRelic RPM
rails-application/config/
.script/plugin install http://svn.newrelic.com/rpm/agent/newrelic_rpmThis will automatically fetch and install the plugin.
grep is an other most useful unix command line text search utility. The name is taken from the first letters in global / regular expression / print.
It can be used to find occurrences of a specific string or pattern in a file or in all files in a large directory in a few seconds.
Some useful option for the unix command grep
just COUNT occurrences:
grep -c find text.txt
using NOT:
grep -v notfind text.txt
ignore the case:
grep -i UPorLOW text.txt
using OR:
egrep 'this|that' text.txt
show context:
also show 2 previous lines: grep -B2 find text.txt
also show 2 next lines: grep -A2 find text.txt
If you work at the Welcome Trust Sanger Institute and would like to browse web pages at home the same way you do at work, here is how to set it up: You need an ssh login, open a terminal window and connect like this
ssh -L3128:webcache.sanger.ac.uk:3128 YOUR-LOGIN-NAME@ssh.sanger.ac.uk
Then change the proxy
settings in your web browser to point to:
localhost 3128
In Firefox this can be found at Edit / Preferences / Advanced / Network / Connection
This will create a "tunnel", forwarding the pages and other data you request through the Sanger network. You can see the intweb and journal pages, etc.
Even more conveniently, you can store the tunnel set up in the ~/ssh/config.ssh file and just connect ssh username@ssh.sanger.ac.uk.
Exmple:
Host ssh.sanger.ac.uk
LocalForward 14301 imap.sanger.ac.uk:143
LocalForward 25001 mail.sanger.ac.uk:25
LocalForward 3128 wwwcache.sanger.ac.uk:3128
LocalForward 2222 deskproXXXX.dynamic.sanger.ac.uk:22
A good ressource for more SSH productivity tips is this blog entry.
Basic Terminology from troubleshooters.com
Rails
A framework for developing web applications.
Ruby on Rails
Synonym for Rails.
Ruby
The computer language used to write Rails, and also the language you use to turn the Rails framework into an application. Ruby is a loosely typed interpreter with a full yet simple object model, and in my opinion is a very productive computer language.
Web application
A computer program that interfaces with the user through a web browser.
Framework
A ready made bunch of code and code generators to perform the majority of a software program. It is then up to the application developer to add the code that makes his application unique. Such code is typically added in many different spots throughout the framework.
MVC
Stands for Model, View and Controller. Many web application frameworks, including Ruby, partition their code into models, views and controllers. Doing so makes it easier to change and scale the program.
Model
The part of the application that interfaces to persistant data, whether that data is stored in a DBMS (MySQL, Postgres, MS SQL Server, Oracle and the like), or as a flat file on the local disk, or some other way. The persistent data is accessed and validated by code in the model.
There is typically one model for each database table, and one for each relevent flat file.
View
The part of the application that paints screens. Ideally, code in the view paints the screen but does nothing else. Lookups and calculations are done elsewhere, and the view simply sends results of those lookups and calculations to the screen, properly formatted.
There is typically one view for each type of screen, although often one view is used for several similar but slightly different screens. For instance, screens for data insert, modification and deletion are all similar enough to be accomplished with one view using flags set by the controller.
Controller
The part of the application that does what the model and view don't. Some people claim the controller contains the "business rules". I consider that a little pompous. After all, many applications are not intended to be used just for business. Also, some business rules, such as "we don't accept anyone with a credit score under 500" are typically implemented in a model as validation routines.
Every Rails application has at least one controller. There might be more, but usually not a large number. One way of splitting the work is to create a controller for each type of person using the system. For instance, there might be one controller called DataEntryPerson, another called Accountant, and a third called Administrator.
DRY
Stands for Don't Repeat Yourself. This means have each piece of information in one place. This is a basic part of the Ruby philosophy, and of course is also the philosophy behind data normalization.
AJAX
Stands for Asynchronous JavaScript And XML. This technology enables a web page to communicate with the server and update parts of itself without refreshing the whole page, thereby saving bandwidth.
Webrick
The web server that comes with Rails. You run it with this command:
script/server
It can serve only a single application on a single port, so it's more useful for development and testing than for production. Luckily, other web servers can serve Rails pages in production.
Apache
The market leader in web servers. Apache can serve Rails pages if you're willing to put in some deployment work.
InstantRails
Ruby, Gems and Rails, with production quality web server, in one bundle. Unfortunately, as of 1/18/2006 it's Windows only, but a Linux/Unix/BSD version is being worked on.
Locomotive
A production quality Rails-capable web server, which unfortunately is Mac only.
fastcgi
A system whereby CGI (Common Gateway Interface) programs stay in memory rather than being spawned as individual process when requested. This makes for much better efficiency. The lighttpd server comes with a fastcgi interface.
lighttpd
Production quality, Rails-capable, Ruby-centric web server available for Linux/Unix/BSD. Requires fastcgi. See http://wiki.rubyonrails.com/rails/pages/Lighttpd and http://www.lighttpd.net/.
RubyGems
A package manager for Ruby packages. Used to install Rails.
scaffold
An autocoded chunk of code facilitating creation ofscreens to list out a data table, and to provide create, edit and delete facilities for a data table, based on the structure of that data table, which the scaffold generator reads and uses as a specification. You can use a few scaffolds to create a quick and dirty web app to show your client.
session
A hash like structure within Rails apps to hold state between pages. It's a front end to cookies, where the state info is really held.
flash
This is NOT Macromedia flash, and is nothing like Macromedia flash!
In Rails, the term "flash" refers to a facility to pass temporary objects between actions. It's a module: ActionController::Flash. Whatever you place in flash will be exposed in the very next action, but then deleted, so you don't need to delete it manually (which is why it's better than the session for this type of thing). It's often used for error, warning and informational messages displayed on the screen after one the user has just filled out.
GERP (Genomic Evolutionary Rate Profiling)
"GERP identifies constrained elements in multiple alignments by quantifying substitution deficits. These deficits represent substitutions that would have occurred if the element were neutral DNA, but did not occur because the element has been under functional constraint. We refer to these deficits as "Rejected Substitutions". Rejected substitutions are a natural measure of constraint that reflects the strength of past purifying selection on the element." [Sidow lab]
It was developed primarily by Greg Cooper in the lab of Arend Sidow at Stanford University (Depts of Pathology and Genetics), in close collaboration with Eric Stone (Biostatistics, NC State), and George Asimenos and Eugene Davydov in the lab of Serafim Batzoglou (Dept. of Computer Science, Stanford).
For more information, see the GERP section of the
track description for the ENCODE TBA Conservation
track in the Human May 2004 (hg17) genome browser,
and the GERP web page at the Sidow lab:
http://mendel.stanford.edu/sidowlab/downloads/gerp/index.html
and the publication:
Cooper, et. al.
"Distribution and intensity of constraint in mammalian genomic sequence"
Genome Research, July 2005
http://www.genome.org/cgi/content/abstract/15/7/901
Blog entry with more details.
Subversion (svn) is an open-source version control system like the Concurrent Versions System (cvs).
It's home is here: http://subversion.apache.org/
These are notes from the creation/maintenance of the Gencode code repository at the Sanger institute.
Checking out a repository
svn co svn+ssh://cvs.internal.sanger.ac.uk/repos/svn/gencode
Adding an external subversion code directory to my repository:
1. Remove existing svn files recursively:
find tracking_system/ -name ".svn" -print | xargs rm -rf
2. Tell svn to ignore certain types of files:
vi ~/.subversion/config
add the line: global-ignores =