Gene Expression Data Analyses (3)

52
Gene Expression Data Analyses (3) Trupti Joshi Computer Science Department 317 Engineering Building North E-mail: [email protected] 573-884-3528(O)

description

Gene Expression Data Analyses (3). Trupti Joshi Computer Science Department 317 Engineering Building North E-mail: [email protected] 573-884-3528(O). Lecture Outline -1. Statistical significance vs. biological relevance Statistical methods Two sample statistical tests - PowerPoint PPT Presentation

Transcript of Gene Expression Data Analyses (3)

Page 1: Gene Expression Data Analyses (3)

Gene Expression Data Analyses

(3)

Trupti Joshi

Computer Science Department317 Engineering Building North

E-mail: [email protected](O)

Page 2: Gene Expression Data Analyses (3)

Lecture Outline -1

Statistical significance vs. biological relevance Statistical methods

Two sample statistical tests Parametric: T-test (paired and unpaired t test) Non-parametric:

Mann-Whitney test for independent samples Wilcoxon signed-rank test for paired data

Multivariate statistics One-way vs Two-way analysis of variance (ANOVA) Kruskal-Wallis

Multiple comparison corrections Bonferroni Correction False Discovery Rate

Page 3: Gene Expression Data Analyses (3)

Lecture Outline -2

Data interpretation Selection of softwares

Image analysis Imagene

Statistical analysis GeneSpring SAM ArrayStat

Page 4: Gene Expression Data Analyses (3)

Lecture Outline -1

Statistical significance vs. biological relevance Statistical methods

Two sample statistical tests Parametric: T-test (paired and unpaired t test) Non-parametric:

Mann-Whitney test for independent samples Wilcoxon signed-rank test for paired data

Multivariate statistics One-way vs Two-way analysis of variance (ANOVA) Kruskal-Wallis

Multiple comparison corrections Bonferroni Correction False Discovery Rate

Page 5: Gene Expression Data Analyses (3)

Why Statistical Analysis?

Rank results by confidence with significance metrics (e.g. p-value)

Estimate the false positive (Type I errors) and false negatives (Type II errors)

Achieve the desired balance of sensitivity and specificity

Result in a certain amount of flexibility (and arbitrariness) when interpreting significance metrics generated by a test

Page 6: Gene Expression Data Analyses (3)

Statistical Significance vs. Biological Relevance

Page 7: Gene Expression Data Analyses (3)

Lecture Outline -1

Statistical significance vs. biological relevance Statistical methods

Two sample statistical tests Parametric: T-test (paired and unpaired t test) Non-parametric:

Mann-Whitney test for independent samples Wilcoxon signed-rank test for paired data

Multivariate statistics One-way vs Two-way analysis of variance (ANOVA) Kruskal-Wallis

Multiple comparison corrections Bonferroni Correction False Discovery Rate

Page 8: Gene Expression Data Analyses (3)

Normal Distribution

Central peak: mean Symmetrical

Page 9: Gene Expression Data Analyses (3)

Parametric Analysis

Test the hypothesis that one or more treatments have no effect on the mean and variance of a chosen variable

Assume yield data as a normal distribution

Disadvantages: If the yield is not normally distributed.

Page 10: Gene Expression Data Analyses (3)

Non-parametric Analysis

Use ranks of numerical data rather than the data themselves

Use information about the relative sizes of observations, without making any assumptions about the means and variances of the populations being tested

Can be used for any data set

Disadvantages: if the data is normally distributed, it is less powerful than parametric analysis

Page 11: Gene Expression Data Analyses (3)

Lecture Outline -1

Statistical significance vs. biological relevance Statistical methods

Two sample statistical tests Parametric: T-test (paired and unpaired t test) Non-parametric:

Mann-Whitney test for independent samples Wilcoxon signed-rank test for paired data

Multivariate statistics One-way vs Two-way analysis of variance (ANOVA) Kruskal-Wallis

Multiple comparison corrections Bonferroni Correction False Discovery Rate

Page 12: Gene Expression Data Analyses (3)

T test

Paired t test: the size of two groups should be sameComparison for organism before or after

treatment (before and after heat shock)

Unpaired t test: the size of two groups do not need to be

sameComparison between organisms with

treatment or non-treatment

Page 13: Gene Expression Data Analyses (3)

How to Perform T test

Paired T-test Un-Paired T-test

Page 14: Gene Expression Data Analyses (3)

T-test example

Unpaired T testPaired T test

Page 15: Gene Expression Data Analyses (3)

Mann-Whitney Test Use if sample is not distributed normally Similar to non-paired T test but non-

parametric Use the rankings of the numerical values

instead of variance

Page 16: Gene Expression Data Analyses (3)

Mann-Whitney Test--example

Page 17: Gene Expression Data Analyses (3)

Wilcoxon Signed-Rank Test

Use if sample is not distributed normally Similar to paired T test but non-parametric Rank the difference between arrays If the difference between two pairs is 0, the

value is not used If the difference is identical between 2 pairs, the

average rank of the two groups is used Use Wilcoxon Table

Page 18: Gene Expression Data Analyses (3)

Lecture Outline -1

Statistical significance vs. biological relevance Statistical methods

Two sample statistical tests Parametric: T-test (paired and unpaired t test) Non-parametric:

Mann-Whitney test for independent samples Wilcoxon signed-rank test for paired data

Multivariate statistics One-way vs Two-way analysis of variance (ANOVA) Kruskal-Wallis

Multiple comparison corrections Bonferroni Correction False Discovery Rate

Page 19: Gene Expression Data Analyses (3)

ANOVA (Analysis of Variance)

A parametric test Assumes a normal distribution The variance in the groups must be equal The data points in each group must be

from independent samples If only two groups, ANOVA is equivalent to

T test

Page 20: Gene Expression Data Analyses (3)

Perform ANOVA

Two estimates of variance are taken

Estimate the variance within the group based on the standard deviation of each group

Estimate the variance among groups based on the variability between means of each group

Page 21: Gene Expression Data Analyses (3)

One-Way ANOVA

Page 22: Gene Expression Data Analyses (3)

One-Way ANOVA-example

Page 23: Gene Expression Data Analyses (3)

Two-Way ANOVA

Page 24: Gene Expression Data Analyses (3)

Two-Way ANOVA--example

Page 25: Gene Expression Data Analyses (3)

Kruskal-Wallis Non-parametric equivalent to ANOVA Using Chi-square distribution with k-1 degrees

of freedom

Page 26: Gene Expression Data Analyses (3)

Lecture Outline -1

Statistical significance vs. biological relevance Statistical methods

Two sample statistical tests Parametric: T-test (paired and unpaired t test) Non-parametric:

Mann-Whitney test for independent samples Wilcoxon signed-rank test for paired data

Multivariate statistics One-way vs Two-way analysis of variance (ANOVA) Kruskal-Wallis

Multiple comparison corrections Bonferroni Correction False Discovery Rate

Page 27: Gene Expression Data Analyses (3)

Multiple Comparison Corrections

When the sample size increases, the number for significance will be increased.

The number of false positives (Type I errors) may increase as well.

To fix this problem, some sort of adjustment of p-values or -levels

Page 28: Gene Expression Data Analyses (3)

Let,k = the number of groups; K = the number of comparisons that are necessaryEach subsequent column represents the chosen level of significance.

Increased likelihood of generating Type I error by performing multiple pair-wise comparisons

Page 29: Gene Expression Data Analyses (3)

Bonferroni Correction

The cut-off level of significance being used is divided by the number of means being compared.

In stead of testing each hypothesis at level , test each at level /m.

Good for a small number of samples

May be too conservative

Page 30: Gene Expression Data Analyses (3)

False Discovery Rate

Multiple test controls Prob(V1) M is huge=> falsely rejected (Type II error) are likely to

occur

Better to control

Intuitive definition of false discovery rate:

Compared to Bonferroni: Bonferroni fixed error rate: estimated rejection area

FDR fixed rejection error: estimated rejection error

Page 31: Gene Expression Data Analyses (3)

Two Algorithms for FDR

Benjamin and Hochberg:

The rate that false discoveries occur

Fix a cutoff *, and then derive a decision rule that achieves FDR*

Storey:The rate that discoveries are false

Fix a decision rule, and then estimate the FDR associated with using this decision rule

Estimate m0

Page 32: Gene Expression Data Analyses (3)

Lecture Outline -1

Statistical significance vs. biological relevance Statistical methods

Two sample statistical tests Parametric: T-test (paired and unpaired t test) Non-parametric:

Mann-Whitney test for independent samples Wilcoxon signed-rank test for paired data

Multivariate statistics One-way vs Two-way analysis of variance (ANOVA) Kruskal-Wallis

Multiple comparison corrections Bonferroni Correction False Discovery Rate

Page 33: Gene Expression Data Analyses (3)

Lecture Outline -2

Data interpretation Selection of softwares

Image analysis Imagene

Statistical analysis GeneSpring SAM ArrayStat

Page 34: Gene Expression Data Analyses (3)

Lecture Outline -2

Data interpretation Selection of softwares

Image analysis Imagene

Statistical analysis GeneSpring SAM ArrayStat

Page 35: Gene Expression Data Analyses (3)

How to Interpret Expression Profiling Data

Overlay functional information and allow biological context to help decide what is of interest and what is not

Using computational methods (classification, clustering, promoter prediction, etc.)

Data mining toolsPublic identifier: GenBank, Swiss-prot, Gene

Ontology (GO)Using database: LocusLink, HomologGene, RefSeq,

UniGene, etc.GeneFAS (Digbio), GenePath (Digbio), NetAffx, etc.

Page 36: Gene Expression Data Analyses (3)

Gene Ontology (GO)

Most commonly used public domain sources of gene classification

Provide controlled vocabulary hierarchies for molecular function

biological process

cellular component

Page 37: Gene Expression Data Analyses (3)

GO

Page 38: Gene Expression Data Analyses (3)

Current GO annotation

http://www.geneontology.org/GO.current.annotations.shtml More than 30 species are listed

Page 39: Gene Expression Data Analyses (3)

Lecture Outline -2

Data interpretation Selection of softwares

Image analysis Imagene

Statistical analysis GeneSpring SAM ArrayStat

Page 40: Gene Expression Data Analyses (3)

Image Analysis

More 20 softwares are listed at http://ihome.cuhk.edu.hk/~b400559/arraysoft_image.html

Imagene (BioDiscovery, Inc.)

Page 41: Gene Expression Data Analyses (3)

Imagene Analysis

Page 42: Gene Expression Data Analyses (3)

Flagging Spot

Page 43: Gene Expression Data Analyses (3)

Defining Thresholds for Empty Spots

Page 44: Gene Expression Data Analyses (3)

Lecture Outline -2

Data interpretation Selection of softwares

Image analysis Imagene

Statistical analysis GeneSpring SAM ArrayStat

Page 45: Gene Expression Data Analyses (3)

GeneSpring GeneSpring (Silicon Genetics)

Broadly usedNice user interfaceData Normalization (Lowess, etc.)Powerful ANOVA statistical analysis

t-test/1-way ANOVA test 2-way ANOVA tests 1-way post-hoc tests for reliably identifying differentially expressed

genesIncorporation of different analysis tools

Clustering Visual filtering Pathway viewing Scripting

Page 46: Gene Expression Data Analyses (3)

ANOVA in GeneSpring (I)

Tools -> Statistical Analysis -> test type: parametric, assume variance equal or parametric, don't assume variance equal.

Technical replicates are on different slides + Biological replicates (e.g. as in the case of one-color arrays)

GeneSpring does not make the distinction between technical sample and biological sample replicates

Page 47: Gene Expression Data Analyses (3)

ANOVA in GeneSpring (II)

Use Tools -> Statistical Analysis -> test type: parametric, assume variance equal or parametric, don't assume variance equal. 

The on-chip variance is being ignored. Technical replicates are spotted on a chip (i.e. on-chip

replicates) + biological replicates  e.g.  If you have 3 sets of on-chip replicates X 2

biological replicates for group A, same set up for group B.  GeneSpring will first average the on-chip replicates.  Now, you

have the average on-chip value for replicate #1 and another average for the on-chip values for replicate #2.   Then, GeneSpring uses these two final averages to compute ANOVA.  The df is 2-1.

Page 48: Gene Expression Data Analyses (3)

ANOVA in GeneSpring (III)

Use Tools -> Statistical Analysis -> test type: parametric, use all available error measurements.

In this case, both the on-chip and biological replicate information are used.

Technical replicates are spotted on a chip (i.e.. on-chip replicates) + biological replicates

If you have 3 sets of on-chip replicates X 2 biological replicates for group A, same set up for group B.  GeneSpring will take on-chip and biological variance into account when

calculating the ANOVA.  The degree of freedom will also account for both types of replicates.  The equation for the degree of freedom is actually quite complex, because GeneSpring takes the standard error of the on-chip and biological replicates into consideration.  This is done so that different levels of variations between technical and biological replicates will be accounted for.

Page 49: Gene Expression Data Analyses (3)

Error correction

P-value Cutoff/False discovery rate: 0.05 Multiple testing correction: Too

conservative. Use None Post-Hoc testing: Used for 3 more more

conditions.Showing the pairing conditions between

which the significant changes are detected.

Page 50: Gene Expression Data Analyses (3)

Statistical Analysis of Microarray (SAM)

From Stanford (http://www-stat.stanford.edu/~tibs/SAM/) Correlates gene expression data to a wide variety of clinical

parameters including treatment, diagnosis categories, survival time and time trends

Provides estimate of False Discovery Rate for multiple testing Automatic imputation of missing data via nearest neighbor

algorithm Can deal with blocked designs, for example, when treatments are

applied within different batches of arrays Convenient Excel Add-in Works with data from both cDNA and oligo microarrays. Can also

be applied to protein expression data and SNP chip data. Genes are web-linked to Stanford SOURCE database

Page 51: Gene Expression Data Analyses (3)

ArrayStat (Imaging research)

Accepts data from MS Excel and in text format Novel and standard random error estimations methods Performs powerful statistics on as few as two replicates “Outlier” detection and removal Showing number of replicates in the results

Flexible normalization within an experiment and across experiments

False positive corrections

Dependent and independent statistical tests of expression changes

Statistical power analysis to minimize false negatives

Page 52: Gene Expression Data Analyses (3)

Reading Assignments

Suggested reading: GeneChip Expression Analysis. Affymetrix,

Inc.

John D. Storey. 2002. A direct approach to false discovery rates. J. R. Statist. Soc. B. part 3, 479-498