Introduction to microarray Bin Yao [email protected].
-
date post
15-Jan-2016 -
Category
Documents
-
view
234 -
download
0
Transcript of Introduction to microarray Bin Yao [email protected].
![Page 2: Introduction to microarray Bin Yao byao@med.wayne.edu.](https://reader033.fdocuments.net/reader033/viewer/2022051018/56649d695503460f94a47624/html5/thumbnails/2.jpg)
Types of Microarray
• Affymetrix GeneChip (Oligo)
• Spotted array (cDNA /Oligo)
![Page 3: Introduction to microarray Bin Yao byao@med.wayne.edu.](https://reader033.fdocuments.net/reader033/viewer/2022051018/56649d695503460f94a47624/html5/thumbnails/3.jpg)
Affymetrix GeneChip
• in-situ Synthesis: photolithography and combinatorial chemistry.
• Each probe set contain13-21 pairs of 25- mer oligo probes.
• PM and MM
![Page 4: Introduction to microarray Bin Yao byao@med.wayne.edu.](https://reader033.fdocuments.net/reader033/viewer/2022051018/56649d695503460f94a47624/html5/thumbnails/4.jpg)
![Page 5: Introduction to microarray Bin Yao byao@med.wayne.edu.](https://reader033.fdocuments.net/reader033/viewer/2022051018/56649d695503460f94a47624/html5/thumbnails/5.jpg)
Spotted array
cDNA or Oligo are printed on glass slides using arrayer
![Page 6: Introduction to microarray Bin Yao byao@med.wayne.edu.](https://reader033.fdocuments.net/reader033/viewer/2022051018/56649d695503460f94a47624/html5/thumbnails/6.jpg)
Array
Sample1 mRNA
Sample2 mRNA
Cy3
Cy5
Procedures
![Page 7: Introduction to microarray Bin Yao byao@med.wayne.edu.](https://reader033.fdocuments.net/reader033/viewer/2022051018/56649d695503460f94a47624/html5/thumbnails/7.jpg)
Laser
Array
PMT
ADC Image
Data
![Page 8: Introduction to microarray Bin Yao byao@med.wayne.edu.](https://reader033.fdocuments.net/reader033/viewer/2022051018/56649d695503460f94a47624/html5/thumbnails/8.jpg)
Image quantification
• Pixel value
• Image: 16 bits gray scale image. Range of value 0-65535 216 values. Signal>65535 is saturated.
![Page 9: Introduction to microarray Bin Yao byao@med.wayne.edu.](https://reader033.fdocuments.net/reader033/viewer/2022051018/56649d695503460f94a47624/html5/thumbnails/9.jpg)
Image segmentation: separate signal, background and contamination
•Output data files: Spotted array –Signal Mean–Background Mean
–Signal Median
–Background Median
–Signal Stdev
–Background Stdev
![Page 10: Introduction to microarray Bin Yao byao@med.wayne.edu.](https://reader033.fdocuments.net/reader033/viewer/2022051018/56649d695503460f94a47624/html5/thumbnails/10.jpg)
• Output data files: Affymetrix
– .DAT: Pixel data– .CEL: Intensity information for a given probe on an array
– .EXP: Experiment information– .CHP: Analysis result from a Microarray Suite analysis
![Page 11: Introduction to microarray Bin Yao byao@med.wayne.edu.](https://reader033.fdocuments.net/reader033/viewer/2022051018/56649d695503460f94a47624/html5/thumbnails/11.jpg)
Get gene expression value from probe level data
Consolidate 26 (13 PM data and 13 MM data) data into one gene expression value
1. MAS (4&5): Affymetrix algorithmGene expression=weighted average (PM-MM)
2. Dchip: model based expression index PMij – MM ij = i j + εij
with invariant Set Normalization
3. RMA: robust multi-array average Normalized log (PMij -BKG)=i+ j + εij
With quantile normalization
![Page 12: Introduction to microarray Bin Yao byao@med.wayne.edu.](https://reader033.fdocuments.net/reader033/viewer/2022051018/56649d695503460f94a47624/html5/thumbnails/12.jpg)
Data analysis
• What are problems for microarray data analysis?– Different sources of variance– Large number of genes (high false positives)– Small number of replicates (low sensitivity)
![Page 13: Introduction to microarray Bin Yao byao@med.wayne.edu.](https://reader033.fdocuments.net/reader033/viewer/2022051018/56649d695503460f94a47624/html5/thumbnails/13.jpg)
Data pre-processing• Background correction: Signal of a spot contains specific binding signal,
non-specific binding signal and background signal. Background estimation: local background, global background and
negative control spots.• Data filtering: Low signal spots and contaminated spots.• Data transformation
Ratio is not symmetric.
0.5 21
2 fold decrease 2 fold increase
Log ratio is symmetric
-1 11
Log2(2 fold decrease) Log2(2 fold increase)
Multiplicative in ratioAdditive in logarithm log(A/B)=logA-logB
![Page 14: Introduction to microarray Bin Yao byao@med.wayne.edu.](https://reader033.fdocuments.net/reader033/viewer/2022051018/56649d695503460f94a47624/html5/thumbnails/14.jpg)
Frequency
0
20
40
60
80
100
120
140
Frequency
0
50
100
150
200
250
Fold change distribution Log(fold) distribution
![Page 15: Introduction to microarray Bin Yao byao@med.wayne.edu.](https://reader033.fdocuments.net/reader033/viewer/2022051018/56649d695503460f94a47624/html5/thumbnails/15.jpg)
Sources of Variance
• Printing pin• Scanning (laser and detector, PMT, focus)• Hybridization (temperature, time, mixing, etc.)• Probe labeling• RNA preparation
• Biological variability
![Page 16: Introduction to microarray Bin Yao byao@med.wayne.edu.](https://reader033.fdocuments.net/reader033/viewer/2022051018/56649d695503460f94a47624/html5/thumbnails/16.jpg)
Normalization
Many other effects (systematic errors) beside treatment effect can also change gene signal values. Normalization eliminates systematic errors so that gene signals can be compared directly.
Numerous normalization methods are available. How to choose?
1. Understand sources of variation in your data.
2. Understand assumptions behind each method.
3. Diagnostic plot
![Page 17: Introduction to microarray Bin Yao byao@med.wayne.edu.](https://reader033.fdocuments.net/reader033/viewer/2022051018/56649d695503460f94a47624/html5/thumbnails/17.jpg)
•Dividing by mean or median Normalized signal =(signal of a spot on an array)/(mean|median intensity of all spots on the array)
This can be done for subset of genes e.g. excluding genes whose intensity is in top 10% or bottom 10% percentile to minimize the effect of outliers or differentially expressed genes.
•Subtracting mean: Used for log transformed data•Z-transformation Normalized signal =(signal of a spot –mean signal of the array)/signal standard deviation of the array
Normalization methods
![Page 18: Introduction to microarray Bin Yao byao@med.wayne.edu.](https://reader033.fdocuments.net/reader033/viewer/2022051018/56649d695503460f94a47624/html5/thumbnails/18.jpg)
Normalization methods
• Quantile normalization:
![Page 19: Introduction to microarray Bin Yao byao@med.wayne.edu.](https://reader033.fdocuments.net/reader033/viewer/2022051018/56649d695503460f94a47624/html5/thumbnails/19.jpg)
2.000 3.000 4.000
-1.0
-0.5
0.0
0.5
2.000 3.000 4.000
-.5
0.0
0.5
Before Normalization After Normalization
•Housekeeping gene
Normalized signal =(signal of a spot)/(signal of house keeping gene(s))
•Intensity dependent normalization
Use local regression to correct non-linear intensity dependency.
![Page 20: Introduction to microarray Bin Yao byao@med.wayne.edu.](https://reader033.fdocuments.net/reader033/viewer/2022051018/56649d695503460f94a47624/html5/thumbnails/20.jpg)
Which genes are differentially expressed?
One of goals of microarray experiment is to find lists of genes that are up or down regulated between treatments
• Fold change:Simple
Low sensitivity
High false positives
![Page 21: Introduction to microarray Bin Yao byao@med.wayne.edu.](https://reader033.fdocuments.net/reader033/viewer/2022051018/56649d695503460f94a47624/html5/thumbnails/21.jpg)
• Hypotheses test
Take into consideration of both magnitude of the change and uncertainty of the measurement.
T-test: two-group comparison
– Student t-test: assume equal variance, normal distribution.
– Welch method: assume normal distribution, variance is not equal.
– Wilcoxon and Mann-Whitney: Non-parametric, no assumption for distribution
![Page 22: Introduction to microarray Bin Yao byao@med.wayne.edu.](https://reader033.fdocuments.net/reader033/viewer/2022051018/56649d695503460f94a47624/html5/thumbnails/22.jpg)
• Analysis of Variance (ANOVA):
– Compare multiple groups: Which genes are differentially expressed at least in one condition. Post Hoc test finds the condition(s) that changes gene expression.
– Tow- or higher-way ANOVA
One-way ANOVA test only one factor, treatment effect. In microarray there are more than one factors. Some of these are the factors that we are not interested but are not avoidable.
An ANOVA model for two-color microarray
Y=A+D+G+A*D+G*T
Where A=array effect, D=dye effect, G=gene effect, T=treatment effect, A*D=array gene interaction, G*T=gene treatment interaction (usually this is what we are interested)
![Page 23: Introduction to microarray Bin Yao byao@med.wayne.edu.](https://reader033.fdocuments.net/reader033/viewer/2022051018/56649d695503460f94a47624/html5/thumbnails/23.jpg)
If the probability to make a false positive when doing t test for a single gene is p=0.05, for 5000 genes you can expect 5000x0.05=250 false positives.
To ensure the probability to make one mistake over the entire 5000 genes is still 0.05 (Family-wised error rate) p-value for each gene need to be adjusted.
Bonferroni adjustments: simple but conservative p*=min{pxN,1} where p is the raw p value and N is the
total number of tests.Holm or step-down Bonferroni: less conservative Wellfall and Young’s permutation: Take into
consideration of possible correlations between genes. Slow
False discovery rate: Percentage of expected false positives in the gene list.
Multiple test and p value adjustment
![Page 24: Introduction to microarray Bin Yao byao@med.wayne.edu.](https://reader033.fdocuments.net/reader033/viewer/2022051018/56649d695503460f94a47624/html5/thumbnails/24.jpg)
Cluster Analysis
• First used by Tryon, 1939 to organize observed data into meaningful structures
• Find genes have similar expression profile• Types of cluster analysis: Hierarchical cluster and
k-means cluster
![Page 25: Introduction to microarray Bin Yao byao@med.wayne.edu.](https://reader033.fdocuments.net/reader033/viewer/2022051018/56649d695503460f94a47624/html5/thumbnails/25.jpg)
Hierarchical cluster
Dendrogram or tree shows hierarchical relationship.
– Bottom up (agglomerative): Start from individual genes. Measure distance of all pairs of genes/nodes Joint the tow genes/nodes with shortest distance iterate until all genes are jointed
![Page 26: Introduction to microarray Bin Yao byao@med.wayne.edu.](https://reader033.fdocuments.net/reader033/viewer/2022051018/56649d695503460f94a47624/html5/thumbnails/26.jpg)
g1 g2 g3 g4
g1 d1 d2 d3
g2 d4 d5
g3 d6
g4
Find minimum of {d1…d6}
g12 g3 g4
g12 d1’ d2’
g3 d3’
g4
Find minimum of {d1’…d3’}
g124 g3
g124 d1’’
g3
g1
g2
g4
g3
d1
d2’
d1’’
![Page 28: Introduction to microarray Bin Yao byao@med.wayne.edu.](https://reader033.fdocuments.net/reader033/viewer/2022051018/56649d695503460f94a47624/html5/thumbnails/28.jpg)
• K-means cluster: find k clusters that separate as far as possible.
– Start from k random clusters and move elements between clusters to minimize the variability within clusters and maximize variability between clusters. Iterate until converged or specified number of iteration is reached.
– Some methods are developed to estimate the number of cluster e.g Silhouette plot. However there is no completely satisfactory method for determining the number clusters.
![Page 29: Introduction to microarray Bin Yao byao@med.wayne.edu.](https://reader033.fdocuments.net/reader033/viewer/2022051018/56649d695503460f94a47624/html5/thumbnails/29.jpg)
Time
![Page 30: Introduction to microarray Bin Yao byao@med.wayne.edu.](https://reader033.fdocuments.net/reader033/viewer/2022051018/56649d695503460f94a47624/html5/thumbnails/30.jpg)
Distance measurement• Euclidean distance distance(x,y) =
A
B
C
D
n
iii yx
1
2)(
![Page 31: Introduction to microarray Bin Yao byao@med.wayne.edu.](https://reader033.fdocuments.net/reader033/viewer/2022051018/56649d695503460f94a47624/html5/thumbnails/31.jpg)
•CCity-block (Manhattan) distance distance(x,y) =
d(A,B)=a+b+c+dResult is similar to Euclidean distance. Effect of single outlier is smallerBoth methods measure geometric distance
ab
c d
||1
i
n
ii yx
![Page 32: Introduction to microarray Bin Yao byao@med.wayne.edu.](https://reader033.fdocuments.net/reader033/viewer/2022051018/56649d695503460f94a47624/html5/thumbnails/32.jpg)
•Angle distanceEuclidean distance does not take into account magnitude. Angle distance measure Angle distance between two vectors. Moving alone the lines do not change distance between A and B
n
ii
n
ii
n
iii
yx
yx
1
2
1
2
1d(x,y)= xx
yy
AA
BBA’A’
B’B’
dd
d’d’
Angle distance
![Page 33: Introduction to microarray Bin Yao byao@med.wayne.edu.](https://reader033.fdocuments.net/reader033/viewer/2022051018/56649d695503460f94a47624/html5/thumbnails/33.jpg)
• Pearson correlation
Measure how close are two genes change in same way.
n
i i
n
i i
n
i iixy
yyxx
yyxxr
1
2
1
2
1
)()(
))((
rxy is between –1 and 1. rxy <0 two genes change in opposite ways. Distance is defined as 1- | rxy |
•Spearman correlationA non-parametric method, similar to Pearson correlation
![Page 34: Introduction to microarray Bin Yao byao@med.wayne.edu.](https://reader033.fdocuments.net/reader033/viewer/2022051018/56649d695503460f94a47624/html5/thumbnails/34.jpg)
Linkage
Determine distance between clusters.– Single linkage (nearest neighbor)
Distance between two nodes is determined by the distance of the two closest objects (nearest neighbors) in the different nodes
– Complete linkage (furthest neighbor) Distances between nodes are determined by the
greatest distance between any two objects ("furthest neighbors") in the different nodes.
![Page 35: Introduction to microarray Bin Yao byao@med.wayne.edu.](https://reader033.fdocuments.net/reader033/viewer/2022051018/56649d695503460f94a47624/html5/thumbnails/35.jpg)
– Average (Centroid)• The centroid of a node is the average point in the
multidimensional space. It is the center of the node. The distance between two clusters is determined as the distance between centroids.
1. Single linkage2. Average linkage3. Complete linkage
![Page 36: Introduction to microarray Bin Yao byao@med.wayne.edu.](https://reader033.fdocuments.net/reader033/viewer/2022051018/56649d695503460f94a47624/html5/thumbnails/36.jpg)
Self-Organizing Map
Self-Organizing Map (SOM) was introduced by Teuvo Kohonen in 1982.
In artificial neural network, neurons that forms an one or two dimensional elastic net lattice are trained with input data. neurons competes to approximate the density of the data. After the training is over, input data vectors map to n adjacent map neurons
![Page 37: Introduction to microarray Bin Yao byao@med.wayne.edu.](https://reader033.fdocuments.net/reader033/viewer/2022051018/56649d695503460f94a47624/html5/thumbnails/37.jpg)
Neurons compete for the input pattern. The winner take all. Winner and neighbors move toward the input pattern.
Neighborhood: Which neurons move with the winner.Learning rate: How much dose the winner move each time.
Input layer
neurons
![Page 38: Introduction to microarray Bin Yao byao@med.wayne.edu.](https://reader033.fdocuments.net/reader033/viewer/2022051018/56649d695503460f94a47624/html5/thumbnails/38.jpg)
Other methods• Principle component analysis (PCA)
– Reduce the dimensionality of the data matrix by finding new variables. Intended to narrow number of variables down to only those that are of importance.
• Machine learning: Trained with data set with known classification. Predict or classify new data set.
xx
y’y’x’x’
yy
AA
BB
![Page 39: Introduction to microarray Bin Yao byao@med.wayne.edu.](https://reader033.fdocuments.net/reader033/viewer/2022051018/56649d695503460f94a47624/html5/thumbnails/39.jpg)
![Page 40: Introduction to microarray Bin Yao byao@med.wayne.edu.](https://reader033.fdocuments.net/reader033/viewer/2022051018/56649d695503460f94a47624/html5/thumbnails/40.jpg)
Biological data miningGeneOntology: Gene functions are classified into
hierarchical structures. The top 3 are : molecular function, biological process and cellular component.
• Tools using GO: Onto-Express, EASE, eGOn, GoSurfer
Pathway: KEGG, GeneMapp
Regulatory region analysis:
• Tools for regulatory region analysis: Genomatix, Transfac
Gene network:• Tools for gene network: Pathway Assist, iHOP
![Page 41: Introduction to microarray Bin Yao byao@med.wayne.edu.](https://reader033.fdocuments.net/reader033/viewer/2022051018/56649d695503460f94a47624/html5/thumbnails/41.jpg)
Microarray Standard
MIAME: Minimal Information About a Microarray Experiment. Defining data standards
Information Required to Interpret and Replicate
•Experimental Design
•Array Design
•Biological Samples
•Hybridizations
•Measurements
•Data Normalization and Transformation
![Page 42: Introduction to microarray Bin Yao byao@med.wayne.edu.](https://reader033.fdocuments.net/reader033/viewer/2022051018/56649d695503460f94a47624/html5/thumbnails/42.jpg)
•MIAME checklist: http://www.mged.org/Workgroups/
MIAME/miame_checklist.html
•Public database •ArrayExpress (EBI)•GEO (NCBI)•CIBEX (DDBJ)
•Other microarray database: BASE, SMD, Oncomine,
YMD