The symbols to describe the different nucleotides in DNA are the following:
------------------------------------------ Symbol Meaning Nucleic Acid ------------------------------------------ A A Adenine C C Cytosine G G Guanine T T Thymine U U Uracil M A or C R A or G W A or T S C or G Y C or T K G or T V A or C or G H A or C or T D A or G or T B C or G or T X G or A or T or C N G or A or T or C
Note: these letters are also used in the "samtools tview" program to visually show NGS read alignments.
- IUPAC-IUB SYMBOLS FOR NUCLEOTIDE NOMENCLATURE: Cornish-Bowden (1985) Nucl. Acids Res. 13: 3021-3030.
The goal of the 1000 Genomes Project is create a "A Deep Catalog of Human Genetic Variation" by measuring and analysing most genetic variants that have frequencies of at least 1% in the populations studied.
CHB Han Chines Han Chinese in Beijing, China JPT Japanese Japanese in Tokyo, Japan CHS Southern Han Chinese Han Chinese South CDX Dai Chinese Chinese Dai in Xishuangbanna, China KHV Kinh Vietnamese Kinh in Ho Chi Minh City, Vietnam CHD Denver Chinese Chinese in Denver, Colorado (pilot 3 only) CEU CEPH Utah residents (CEPH) with Northern and Western European ancestry TSI Tuscan Toscani in Italia GBR British British in England and Scotland FIN Finnish Finnish in Finland IBS Spanish Iberian populations in Spain YRI Yoruba Yoruba in Ibadan, Nigeria LWK Luhya Luhya in Webuye, Kenya GWD Gambian Gambian in Western Division, The Gambia MSL Mende Mende in Sierra Leone ESN Esan Esan in Nigeria ASW African-American SW African Ancestry in Southwest US ACB African-Caribbean African Caribbean in Barbados MXL Mexican-American Mexican Ancestry in Los Angeles, California PUR Puerto Rican Puerto Rican in Puerto Rico CLM Colombian Colombian in Medellin, Colombia PEL Peruvian Peruvian in Lima, Peru GIH Gujarati Gujarati Indian in Houston,TX PJL Punjabi Punjabi in Lahore,Pakistan BEB Bengali Belgali in Bangladesh STU Sri Lankan Sri Lankan Tamil in the UK ITU Indian Indian Telugu in the UK
The within-array quality for (genomic) microarrays is often measured using the following metrics:
Standard Deviation Autosome / Robust (SD autosome)
Measure of the dispersion of Log2 ratio of all clones on the array, giving an overall picture of the noise in the array. It is calculated on the normalised but unsmoothed data. The SD robust is the middle 58%/66% of the data. By excluding outliers large changes such as trisomies will not cause this number to change significantly. (The SD robust is the number we use when we say “3 SDs away from the noise” in the calling algorithm.) Both measures are given after all data processing but excluding any smoothing. For BlueFuse Multi processed data the values should be 0.07-0.15 and 0.05-0.11 for the autosome and robust measure respectively.
Signal to Background Ratio (SBR)
Brightness of the mean signal (after the background has been subtracted) divided by the raw background signal (global signal).
Derivative Log2 Ratio / Fused (DLR)
measure of the probe to probe variability. In an ideal world, probes within a region will have essentially the same ratio. In a noisy array adjacent probes can have a very large ratio difference. The DLR raw is before any data processing, DLR fused is after normalization and data correction BUT is always done on unsmoothed data so it is user setting independent and a cannot be adjusted by the user thereby giving a consistent array-to-array measure of noise. BlueFuse results should be < 0.2.
% included clones
Percentage of all clones that were not excluded on a BAC array due to inconsistencies between clone replicates. For BlueFuse results this should be > 95 %.
Mean Spot Amplitude
the mean fluorescent signal intensities for the two channels; channel 1 = sample (standardly Cy3; ex 550nm, emm 570nm) and channel 2 = reference (standardly Cy5; ex 650nm, emm 670nm). This metric is variable due to the differences between available scanners. The mean spot amplitude metric can give an indication of how well the DNA has labelled with fluorescent dyes, but more importantly, really high values can indicate over scanning of the microarray image OR can indicate poor washing so there is lots of non-specific signal left. The balance between channels can be assessed but the Cy5 signal tends to give a higher intensity than Cy3, major differences in the channels may indicate a labelling or a scanner problem.
Source: BlueGnome user docs
As reported in the Ensembl 2009 NAR paper canonical transcripts are defined for all genes and for all species in the Ensembl gene sets. "The canonical transcript is defined as either the longest CDS, if the gene has translated transcripts, or the longest cDNA. Should a transcript already regarded as canonical not be selected using the above rules, there is support for storing this information in the Ensembl database."
For the human gene annotation the hierarchy to choose if there are more than one protein-coding transcripts is:
- CCDS transcripts
- Havana manual annotation transcripts of the type "protein_coding"
- Havana manual annotation transcripts that are also protein-coding
- Ensembl protein-coding transcripts
If there are multiple transcripts within the groups, take the longest CDS of the highest priority group.
For non-coding types takes the longest cDNA of
- Havana transcripts
- Ensembl transcripts
Ensembl 2009 NAR paper, Ensembl mailing list
These objects can be regarded as representative transcripts for the gene and can be fetched with the Perl API method
Some caution needs to be used when looking at the pseudo-autosomal regions: When looking at genes from the Y PAR, the method will return a transcript with X coordinates. While not really a bug, this might mess up your data if un-noticed. To check and fix this something like the following will work:
GenePix Array List (GAL) files are text files with specific information about the location, size, and name of each DNA spot on a microarray. They are therefor of vital importance for the analysis of scanned microarray images.
The format defines a specific header before the list of data columns follows:
ATF 1 9 5 Type=GenePix ArrayList V1.0 BlockCount=1 BlockType=0 "Block1=10000, 38780, 150, 20, 200, 18, 200" Supplier=BioRobotics ArrayerSoftwareName=TAS Application Suite (MicroGrid II) ArrayerSoftwareVersion=188.8.131.52 ScanResolution=10 Block Column Row ID Name 1 1 1 RP11-163J21 Clone 1 1 1 2 RP11-163J21 Clone 2
ATF -> File conforms to Axon Text File
1 -> Version number of ATF
9 -> Number of header lines before the "Block, Column, Row, ..." line
5 -> Number of data columns (Block, Column, Row, Name, ID)
Type=GenePix ArrayList V1.0 -> Type of file, same for all GAL files
Block Count=1 -> Number of blocks described in the file
Block Type=0 -> Type of block, 0 = rectangular
BlockX=A, B, C, D, E, F, G -> The position and dimensions of each block.
A -> xOrigin
B -> yOrigin
C -> Feature diameter
D -> xFeatures
E -> xSpacing
F -> yFeatures
G -> ySpacing
ScanResolution - Optional parameter to scale the position on higher-resolution images
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
The data columns are:
Further reading and sources: