Gene Expression Data Analyses (3)
-
Upload
leonora-karis -
Category
Documents
-
view
40 -
download
0
description
Transcript of Gene Expression Data Analyses (3)
Gene Expression Data Analyses
(3)
Trupti Joshi
Computer Science Department317 Engineering Building North
E-mail: [email protected](O)
Lecture Outline -1
Statistical significance vs. biological relevance Statistical methods
Two sample statistical tests Parametric: T-test (paired and unpaired t test) Non-parametric:
Mann-Whitney test for independent samples Wilcoxon signed-rank test for paired data
Multivariate statistics One-way vs Two-way analysis of variance (ANOVA) Kruskal-Wallis
Multiple comparison corrections Bonferroni Correction False Discovery Rate
Lecture Outline -2
Data interpretation Selection of softwares
Image analysis Imagene
Statistical analysis GeneSpring SAM ArrayStat
Lecture Outline -1
Statistical significance vs. biological relevance Statistical methods
Two sample statistical tests Parametric: T-test (paired and unpaired t test) Non-parametric:
Mann-Whitney test for independent samples Wilcoxon signed-rank test for paired data
Multivariate statistics One-way vs Two-way analysis of variance (ANOVA) Kruskal-Wallis
Multiple comparison corrections Bonferroni Correction False Discovery Rate
Why Statistical Analysis?
Rank results by confidence with significance metrics (e.g. p-value)
Estimate the false positive (Type I errors) and false negatives (Type II errors)
Achieve the desired balance of sensitivity and specificity
Result in a certain amount of flexibility (and arbitrariness) when interpreting significance metrics generated by a test
Statistical Significance vs. Biological Relevance
Lecture Outline -1
Statistical significance vs. biological relevance Statistical methods
Two sample statistical tests Parametric: T-test (paired and unpaired t test) Non-parametric:
Mann-Whitney test for independent samples Wilcoxon signed-rank test for paired data
Multivariate statistics One-way vs Two-way analysis of variance (ANOVA) Kruskal-Wallis
Multiple comparison corrections Bonferroni Correction False Discovery Rate
Normal Distribution
Central peak: mean Symmetrical
Parametric Analysis
Test the hypothesis that one or more treatments have no effect on the mean and variance of a chosen variable
Assume yield data as a normal distribution
Disadvantages: If the yield is not normally distributed.
Non-parametric Analysis
Use ranks of numerical data rather than the data themselves
Use information about the relative sizes of observations, without making any assumptions about the means and variances of the populations being tested
Can be used for any data set
Disadvantages: if the data is normally distributed, it is less powerful than parametric analysis
Lecture Outline -1
Statistical significance vs. biological relevance Statistical methods
Two sample statistical tests Parametric: T-test (paired and unpaired t test) Non-parametric:
Mann-Whitney test for independent samples Wilcoxon signed-rank test for paired data
Multivariate statistics One-way vs Two-way analysis of variance (ANOVA) Kruskal-Wallis
Multiple comparison corrections Bonferroni Correction False Discovery Rate
T test
Paired t test: the size of two groups should be sameComparison for organism before or after
treatment (before and after heat shock)
Unpaired t test: the size of two groups do not need to be
sameComparison between organisms with
treatment or non-treatment
How to Perform T test
Paired T-test Un-Paired T-test
T-test example
Unpaired T testPaired T test
Mann-Whitney Test Use if sample is not distributed normally Similar to non-paired T test but non-
parametric Use the rankings of the numerical values
instead of variance
Mann-Whitney Test--example
Wilcoxon Signed-Rank Test
Use if sample is not distributed normally Similar to paired T test but non-parametric Rank the difference between arrays If the difference between two pairs is 0, the
value is not used If the difference is identical between 2 pairs, the
average rank of the two groups is used Use Wilcoxon Table
Lecture Outline -1
Statistical significance vs. biological relevance Statistical methods
Two sample statistical tests Parametric: T-test (paired and unpaired t test) Non-parametric:
Mann-Whitney test for independent samples Wilcoxon signed-rank test for paired data
Multivariate statistics One-way vs Two-way analysis of variance (ANOVA) Kruskal-Wallis
Multiple comparison corrections Bonferroni Correction False Discovery Rate
ANOVA (Analysis of Variance)
A parametric test Assumes a normal distribution The variance in the groups must be equal The data points in each group must be
from independent samples If only two groups, ANOVA is equivalent to
T test
Perform ANOVA
Two estimates of variance are taken
Estimate the variance within the group based on the standard deviation of each group
Estimate the variance among groups based on the variability between means of each group
One-Way ANOVA
One-Way ANOVA-example
Two-Way ANOVA
Two-Way ANOVA--example
Kruskal-Wallis Non-parametric equivalent to ANOVA Using Chi-square distribution with k-1 degrees
of freedom
Lecture Outline -1
Statistical significance vs. biological relevance Statistical methods
Two sample statistical tests Parametric: T-test (paired and unpaired t test) Non-parametric:
Mann-Whitney test for independent samples Wilcoxon signed-rank test for paired data
Multivariate statistics One-way vs Two-way analysis of variance (ANOVA) Kruskal-Wallis
Multiple comparison corrections Bonferroni Correction False Discovery Rate
Multiple Comparison Corrections
When the sample size increases, the number for significance will be increased.
The number of false positives (Type I errors) may increase as well.
To fix this problem, some sort of adjustment of p-values or -levels
Let,k = the number of groups; K = the number of comparisons that are necessaryEach subsequent column represents the chosen level of significance.
Increased likelihood of generating Type I error by performing multiple pair-wise comparisons
Bonferroni Correction
The cut-off level of significance being used is divided by the number of means being compared.
In stead of testing each hypothesis at level , test each at level /m.
Good for a small number of samples
May be too conservative
False Discovery Rate
Multiple test controls Prob(V1) M is huge=> falsely rejected (Type II error) are likely to
occur
Better to control
Intuitive definition of false discovery rate:
Compared to Bonferroni: Bonferroni fixed error rate: estimated rejection area
FDR fixed rejection error: estimated rejection error
Two Algorithms for FDR
Benjamin and Hochberg:
The rate that false discoveries occur
Fix a cutoff *, and then derive a decision rule that achieves FDR*
Storey:The rate that discoveries are false
Fix a decision rule, and then estimate the FDR associated with using this decision rule
Estimate m0
Lecture Outline -1
Statistical significance vs. biological relevance Statistical methods
Two sample statistical tests Parametric: T-test (paired and unpaired t test) Non-parametric:
Mann-Whitney test for independent samples Wilcoxon signed-rank test for paired data
Multivariate statistics One-way vs Two-way analysis of variance (ANOVA) Kruskal-Wallis
Multiple comparison corrections Bonferroni Correction False Discovery Rate
Lecture Outline -2
Data interpretation Selection of softwares
Image analysis Imagene
Statistical analysis GeneSpring SAM ArrayStat
Lecture Outline -2
Data interpretation Selection of softwares
Image analysis Imagene
Statistical analysis GeneSpring SAM ArrayStat
How to Interpret Expression Profiling Data
Overlay functional information and allow biological context to help decide what is of interest and what is not
Using computational methods (classification, clustering, promoter prediction, etc.)
Data mining toolsPublic identifier: GenBank, Swiss-prot, Gene
Ontology (GO)Using database: LocusLink, HomologGene, RefSeq,
UniGene, etc.GeneFAS (Digbio), GenePath (Digbio), NetAffx, etc.
Gene Ontology (GO)
Most commonly used public domain sources of gene classification
Provide controlled vocabulary hierarchies for molecular function
biological process
cellular component
GO
Current GO annotation
http://www.geneontology.org/GO.current.annotations.shtml More than 30 species are listed
Lecture Outline -2
Data interpretation Selection of softwares
Image analysis Imagene
Statistical analysis GeneSpring SAM ArrayStat
Image Analysis
More 20 softwares are listed at http://ihome.cuhk.edu.hk/~b400559/arraysoft_image.html
Imagene (BioDiscovery, Inc.)
Imagene Analysis
Flagging Spot
Defining Thresholds for Empty Spots
Lecture Outline -2
Data interpretation Selection of softwares
Image analysis Imagene
Statistical analysis GeneSpring SAM ArrayStat
GeneSpring GeneSpring (Silicon Genetics)
Broadly usedNice user interfaceData Normalization (Lowess, etc.)Powerful ANOVA statistical analysis
t-test/1-way ANOVA test 2-way ANOVA tests 1-way post-hoc tests for reliably identifying differentially expressed
genesIncorporation of different analysis tools
Clustering Visual filtering Pathway viewing Scripting
ANOVA in GeneSpring (I)
Tools -> Statistical Analysis -> test type: parametric, assume variance equal or parametric, don't assume variance equal.
Technical replicates are on different slides + Biological replicates (e.g. as in the case of one-color arrays)
GeneSpring does not make the distinction between technical sample and biological sample replicates
ANOVA in GeneSpring (II)
Use Tools -> Statistical Analysis -> test type: parametric, assume variance equal or parametric, don't assume variance equal.
The on-chip variance is being ignored. Technical replicates are spotted on a chip (i.e. on-chip
replicates) + biological replicates e.g. If you have 3 sets of on-chip replicates X 2
biological replicates for group A, same set up for group B. GeneSpring will first average the on-chip replicates. Now, you
have the average on-chip value for replicate #1 and another average for the on-chip values for replicate #2. Then, GeneSpring uses these two final averages to compute ANOVA. The df is 2-1.
ANOVA in GeneSpring (III)
Use Tools -> Statistical Analysis -> test type: parametric, use all available error measurements.
In this case, both the on-chip and biological replicate information are used.
Technical replicates are spotted on a chip (i.e.. on-chip replicates) + biological replicates
If you have 3 sets of on-chip replicates X 2 biological replicates for group A, same set up for group B. GeneSpring will take on-chip and biological variance into account when
calculating the ANOVA. The degree of freedom will also account for both types of replicates. The equation for the degree of freedom is actually quite complex, because GeneSpring takes the standard error of the on-chip and biological replicates into consideration. This is done so that different levels of variations between technical and biological replicates will be accounted for.
Error correction
P-value Cutoff/False discovery rate: 0.05 Multiple testing correction: Too
conservative. Use None Post-Hoc testing: Used for 3 more more
conditions.Showing the pairing conditions between
which the significant changes are detected.
Statistical Analysis of Microarray (SAM)
From Stanford (http://www-stat.stanford.edu/~tibs/SAM/) Correlates gene expression data to a wide variety of clinical
parameters including treatment, diagnosis categories, survival time and time trends
Provides estimate of False Discovery Rate for multiple testing Automatic imputation of missing data via nearest neighbor
algorithm Can deal with blocked designs, for example, when treatments are
applied within different batches of arrays Convenient Excel Add-in Works with data from both cDNA and oligo microarrays. Can also
be applied to protein expression data and SNP chip data. Genes are web-linked to Stanford SOURCE database
ArrayStat (Imaging research)
Accepts data from MS Excel and in text format Novel and standard random error estimations methods Performs powerful statistics on as few as two replicates “Outlier” detection and removal Showing number of replicates in the results
Flexible normalization within an experiment and across experiments
False positive corrections
Dependent and independent statistical tests of expression changes
Statistical power analysis to minimize false negatives
Reading Assignments
Suggested reading: GeneChip Expression Analysis. Affymetrix,
Inc.
John D. Storey. 2002. A direct approach to false discovery rates. J. R. Statist. Soc. B. part 3, 479-498