Carlo Colantuoni – ccolantu@jhsph

132
Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – [email protected] http://www.biostat.jhsph.edu/GenomeCAFE/GeneExpressionAnalysis/ GEA2009.htm

description

Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017. Carlo Colantuoni – [email protected]. http://www.biostat.jhsph.edu/GenomeCAFE/GeneExpressionAnalysis/GEA2009.htm. Class Outline. - PowerPoint PPT Presentation

Transcript of Carlo Colantuoni – ccolantu@jhsph

Page 1: Carlo Colantuoni – ccolantu@jhsph

Summer Inst. Of Epidemiology and Biostatistics, 2009:

Gene Expression Data Analysis

8:30am-12:00pm in Room W2017

Carlo Colantuoni – [email protected]

http://www.biostat.jhsph.edu/GenomeCAFE/GeneExpressionAnalysis/GEA2009.htm

Page 2: Carlo Colantuoni – ccolantu@jhsph

Class Outline• Basic Biology & Gene Expression Analysis Technology

• Data Preprocessing, Normalization, & QC

• Measures of Differential Expression

• Multiple Comparison Problem

• Clustering and Classification

• The R Statistical Language and Bioconductor

• GRADES – independent project with Affymetrix data.

http://www.biostat.jhsph.edu/GenomeCAFE/GeneExpressionAnalysis/GEA2009.htm

Page 3: Carlo Colantuoni – ccolantu@jhsph

Class Outline - Detailed• Basic Biology & Gene Expression Analysis Technology

– The Biology of Our Genome & Transcriptome– Genome and Transcriptome Structure & Databases– Gene Expression & Microarray Technology

• Data Preprocessing, Normalization, & QC– Intensity Comparison & Ratio vs. Intensity Plots (log transformation)– Background correction (PM-MM, RMA, GCRMA)– Global Mean Normalization– Loess Normalization– Quantile Normalization (RMA & GCRMA)– Quality Control: Batches, plates, pins, hybs, washes, and other artifacts– Quality Control: PCA and MDS for dimension reduction

• Measures of Differential Expression– Basic Statistical Concepts– T-tests and Associated Problems– Significance analysis in microarrays (SAM) [ & Empirical Bayes]– Complex ANOVA’s (limma package in R)

• Multiple Comparison Problem– Bonferroni– False Discovery Rate Analysis (FDR)

• Differential Expression of Functional Gene Groups– Functional Annotation of the Genome– Hypergeometric test?, Χ2, KS, pDens, Wilcoxon Rank Sum– Gene Set Enrichment Analysis (GSEA)– Parametric Analysis of Gene Set Enrichment (PAGE)– geneSetTest– Notes on Experimental Design

• Clustering and Classification– Hierarchical clustering– K-means– Classification

• LDA (PAM), kNN, Random Forests• Cross-Validation

• Additional Topics– The R Statistical Language– Bioconductor– Affymetrix data processing example!

Page 4: Carlo Colantuoni – ccolantu@jhsph

DAY #2:

•Intensity Comparison & Ratio vs. Intensity Plots

•Log transformation

•Background correction (Affymetrix, 2-color, other)

•Normalization: global and local mean centering

•Normalization: quantile normalization

•Batches, plates, pins, hybs, washes, and other artifacts

•QC: PCA and MDS for dimension reduction

Page 5: Carlo Colantuoni – ccolantu@jhsph

Log Intensity

Lo

g I

nte

nsi

ty

Microarray Data Quantification

Page 6: Carlo Colantuoni – ccolantu@jhsph

Log Intensity

Lo

g R

atio

Microarray Data Quantification

Page 7: Carlo Colantuoni – ccolantu@jhsph

Logarithmic Transformation:

if : logz(x)=y then : zy=x

Logarithm math refresher:

log(x) + log(y) = log( x * y )

log(x) - log(y) = log( x / y )

Page 8: Carlo Colantuoni – ccolantu@jhsph

Intensity vs. Intensity: LINEAR

Intensity Distribution: LINEAR

Page 9: Carlo Colantuoni – ccolantu@jhsph

Intensity vs. Intensity: LOG

Intensity Distribution:LOG

Page 10: Carlo Colantuoni – ccolantu@jhsph

Intensity vs. Intensity: LINEAR

Page 11: Carlo Colantuoni – ccolantu@jhsph

Intensity vs. Intensity: LOG

Page 12: Carlo Colantuoni – ccolantu@jhsph

Int vs. Int:LINEAR

Int vs. Int:LOG

Ratio vs. Int: LOG

Microarray Data Quantification

Page 13: Carlo Colantuoni – ccolantu@jhsph

Background Subtraction

Page 14: Carlo Colantuoni – ccolantu@jhsph

Before Hybridization

Array 1 Array 2

Sample 1 Sample 2

Page 15: Carlo Colantuoni – ccolantu@jhsph

After Hybridization

Array 1 Array 2

Page 16: Carlo Colantuoni – ccolantu@jhsph

More Realistic - Before

Array 1 Array 2

Sample 1 Sample 2

Page 17: Carlo Colantuoni – ccolantu@jhsph

Array 1 Array 2

More Realistic - After

Page 18: Carlo Colantuoni – ccolantu@jhsph

poly CNo label

Page 19: Carlo Colantuoni – ccolantu@jhsph

Intensity distributions for theno-label and Yeast DNA

Page 20: Carlo Colantuoni – ccolantu@jhsph

The presence of background noise is clear from the fact that the minimum PM intensity is not 0 and that the geometric mean of the probesets with no spike-in is around 200 units.

Why Adjust for Background?

Hs RNA on Hs chip(w/ spike-ins)

PM intensities

Page 21: Carlo Colantuoni – ccolantu@jhsph

Why Adjust for Background?

Local slope decreases as nominal concentration

decreases!

(E1 + B) / (E2 + B) ≈ 1

(E1 + B) / (E2 + B) ≈ E1 / E2

(E1 + B) ≈ B or …

(E1 + B) ≈ E1 or …

Page 22: Carlo Colantuoni – ccolantu@jhsph

By using the log-scale transformation before analyzing microarray data, investigators have, implicitly or explicitly, assumed a multiplicative measurement error model (Dudoit et al., 2002; Newton et al., 2001; Kerr et al., 200; Wolfinger et al., 2001). The fact, seen in Figure 2, that observed intensity increase linearly with concentration in the original scale but not in the log-scale suggests that background noise is additive with non-zero mean. Durbin et al. (2002), Huber et al. (2002), Cui, Kerr, and Churchill (2003), and Irizarry et al. (2003a) have proposed additive-background-multiplicative-measurement-error models for intensities read from microarray scanners.

PM intensities

Page 23: Carlo Colantuoni – ccolantu@jhsph

Affymetrix GeneChip Design

5’ 3’

Reference sequence

…TGTGATGGTGCATGATGGGTCAGAAGGCCTCCGATGCGCCGATTGAGAAT…GTACTACCCAGTCTTCCGGAGGCTAGTACTACCCAGTGTTCCGGAGGCTA

Perfectmatch (PM)Mismatch (MM)

NSB & SB

NSB

Page 24: Carlo Colantuoni – ccolantu@jhsph

Why not subtract MM?

Page 25: Carlo Colantuoni – ccolantu@jhsph

Why not subtract MM?

Page 26: Carlo Colantuoni – ccolantu@jhsph

Why not subtract MM?

Page 27: Carlo Colantuoni – ccolantu@jhsph

Background: Solutions

Page 28: Carlo Colantuoni – ccolantu@jhsph

Affymetrix GeneChip Design

5’ 3’

Reference sequence

…TGTGATGGTGCATGATGGGTCAGAAGGCCTCCGATGCGCCGATTGAGAAT…GTACTACCCAGTCTTCCGGAGGCTAGTACTACCCAGTGTTCCGGAGGCTA

Perfectmatch (PM)Mismatch (MM)

NSB & SB

NSB

Page 29: Carlo Colantuoni – ccolantu@jhsph

Motivation: PM - MM

PM = B + S MM = B

PM – MM = S

The hope is that:

But this is not correct!

Page 30: Carlo Colantuoni – ccolantu@jhsph

Simulation

• We create some feature level data for two replicate arrays

• Then compute Y=log(PM-kMM) for each array

• We make an MA using the Ys for each array

• We make a observed concentration versus known concentration plot

• We do this for various values of k. The following “movie” shows k moving from 0 to 1.

Page 31: Carlo Colantuoni – ccolantu@jhsph

k=0

Known level (log2)

Obs

erve

d le

vel (

log2

)

Log2(Intensity)

Log2

(Rat

io)

Page 32: Carlo Colantuoni – ccolantu@jhsph

k=1/4

Known level (log2)

Obs

erve

d le

vel (

log2

)

Log2(Intensity)

Log2

(Rat

io)

Page 33: Carlo Colantuoni – ccolantu@jhsph

k=1/2

Known level (log2)

Obs

erve

d le

vel (

log2

)

Log2(Intensity)

Log2

(Rat

io)

Page 34: Carlo Colantuoni – ccolantu@jhsph

k=3/4

Known level (log2)

Obs

erve

d le

vel (

log2

)

Log2(Intensity)

Log2

(Rat

io)

Page 35: Carlo Colantuoni – ccolantu@jhsph

k=1

Known level (log2)

Obs

erve

d le

vel (

log2

)

Log2(Intensity)

Log2

(Rat

io)

Page 36: Carlo Colantuoni – ccolantu@jhsph

Real Data

MAS 5.0 RMA

Page 37: Carlo Colantuoni – ccolantu@jhsph

RMA: The Basic Idea

PM=B+S

Observed: PMOf interest: S

Pose a statistical model and use it to predict S from the observed PM

Page 38: Carlo Colantuoni – ccolantu@jhsph

The Basic Idea

PM=B+S

• A mathematically convenient, useful model

– B ~ Normal (,) S ~ Exponential ()

– No MM– Borrowing strength across probes

ˆ S E[S | PM]

Page 39: Carlo Colantuoni – ccolantu@jhsph
Page 40: Carlo Colantuoni – ccolantu@jhsph

MAS 5.0

Page 41: Carlo Colantuoni – ccolantu@jhsph

RMA

Notice improved precision but worse accuracy

Page 42: Carlo Colantuoni – ccolantu@jhsph

Problem

• Global background correction ignores probe-specific NSB

• MM have problems

• Another possibility: Use probe sequence

Page 43: Carlo Colantuoni – ccolantu@jhsph

Probe-specific Background

Page 44: Carlo Colantuoni – ccolantu@jhsph

G-C content effect in PM’s

Boxplots of log intensities from the array hybridized to Yeast DNA for strata of probes defined by their G-C content. Probes with 6 or less G-C are grouped together. Probes with 20 or more are grouped together as well. Smooth density plots are shown for the strata with G-C contents of 6,10,14, and 18.

Any given probe will have some propensity to non-specific binding. As described in Section 2.3 and demonstrated in Figure 3, this tends to be directly related to its G-C content. We propose a statistical model that describes the relationship between the PM, MM, and probes of the same G-C content.

Page 45: Carlo Colantuoni – ccolantu@jhsph

General Model (GCRMA)

NSB SB

PMgij OiPM exp(hi( j

PM ) bgjPM gij

PM ) exp( f i( j ) gi gij )

MMgij OiMM exp(hi( j

MM ) bgjMM gij

MM )

We can calculate:

E[gi PMgij , MMgij ]

Due to the associated variance with the measured MM intensities we argue that one data point is not enough to obtain a useful adjustment. In this paper we propose using probe sequence information to select other probes that can serve the same purpose as the MM pair. We do this by defining subsets of the existing MM probes with similar hybridization properties.

Page 46: Carlo Colantuoni – ccolantu@jhsph

The MA plot shows log fold change as a function of mean log expression level. A set of 14 arrays representing a single experiment from the Affymetrix spike-in data are used for this plot. A total of 13 sets of fold changes are generated by comparing the first array in the set to each of the others. Genes are symbolized by numbers representing the nominal log2 fold change for the gene. Non-differentially expressed genes with observed fold changes larger than 2 are plotted in red. All other probesets are represented with black dots. The smooth lines are 3SDs away with SD depending on log expression.

Page 47: Carlo Colantuoni – ccolantu@jhsph
Page 48: Carlo Colantuoni – ccolantu@jhsph

Naef & Magnasco (2003),PHYSICAL REVIEW E 68, 011906, 2003

Another sequence effect in PM’s and MM’s

Page 49: Carlo Colantuoni – ccolantu@jhsph

Another sequence effect in PM’s and MM’s

We show in Fig. 2 joint probability distributions of PMs and MMs, obtained from all probe pairs in a large set of experiments. Actually, two separate probability distributions are superimposed: in red, the distribution for all probe pairs whose 13th letter is a purine, and in cyan those whose 13th letter is a pyrimidine. The plot clearly shows two distinct branches in two colors, corresponding to the basic distinction between the shapes of the bases: purines are large, double ringed nucleotides while pyrimidines have smaller single rings. This underscores that by replacing the middle letter of the PM with its complementary base, the situation on the MM probe is that the middle letter always faces itself, leading to two quite distinct outcomes according to the size of the nucleotide. If the letter is a purine, there is no room within an undistorted backbone for two large bases, so this mismatch distorts the geometry of the double helix, incurring a large steric and stacking cost. But if the letter is a pyrimidine, there is room to spare, and the bases just dangle. The only energy lost is that of the hydrogen bonds.

Naef & Magnasco (2003),PHYSICAL REVIEW E 68, 011906, 2003

Page 50: Carlo Colantuoni – ccolantu@jhsph

C and T are pyrimidines (and small), A and G are purines (and

large).

Page 51: Carlo Colantuoni – ccolantu@jhsph

Why not subtract MM?

Page 52: Carlo Colantuoni – ccolantu@jhsph

Another sequence effect in PM’s

Naef & Magnasco (2003), PHYSICAL REVIEW E 68, 011906, 2003

The asymmetry of (A,T) and (G,C) affinities in Fig. 3 can be explained because only A-U and G-C bonds carry labels ~purines U and C on the mRNA are labeled. Notice the nearly equal magnitudes of the reduction in both type of bonds. (Remember also that G-C pairs have 3 and A-T pairs have 2 hydrogen bonds!).

Page 53: Carlo Colantuoni – ccolantu@jhsph

Two color platforms (Agilent, cDNA)

• Common to have just one feature per gene

• 60 vs. 25 NT?

• Optical noise still a concern

• After spots are identified, a measure of local background is obtained from area around spot

(this is also applicable to some spotted one-channel data)

Page 54: Carlo Colantuoni – ccolantu@jhsph

Local background

---- GenePix

---- QuantArray

---- ScanAnalyze

Page 55: Carlo Colantuoni – ccolantu@jhsph

Two color feature level data

• Red and Green foreground and background obtained from each feature

• We have Rfgij, Gfgij, Rbgij, Gbgij (g is gene, i is array and j is replicate)

• A default summary statistic is the log-ratio:

log2 [(Rf - Rb) / (Gf - Gb)]

Page 56: Carlo Colantuoni – ccolantu@jhsph

Background subtractionNo background

subtraction

Page 57: Carlo Colantuoni – ccolantu@jhsph

Diagnostics: images of Rb, Gb, scatterplot of log2 (Rf/Gf) vs. log2(Rb/Gb)

Page 58: Carlo Colantuoni – ccolantu@jhsph

Correlation may be spatially dependent

Page 59: Carlo Colantuoni – ccolantu@jhsph

Two color platforms

• Again, we can assess the tradeoff of accuracy and precision via simulation

• Simulation uses a self versus self (SVS) hybridization experiment -- no differential expression should occur.

• Mean squared error (MSE) = bias^2 + variance.

Page 60: Carlo Colantuoni – ccolantu@jhsph

Lower MSE with NBS if correlation < 0.2

Page 61: Carlo Colantuoni – ccolantu@jhsph

• A procedure that subtracts local background as a function of the correlation of fg and bg ratios may be a nice compromise between background subtraction and no background subtraction.

• For references, see background subtraction paper by C. Kooperberg J Computational Biol 2002.

• Limma package in R has many useful functions for background subtraction.

• Following the decision to background subtract, we need to consider a normalization algorithm.

Background Subtraction: Conclusions

Page 62: Carlo Colantuoni – ccolantu@jhsph

Normalization

Page 63: Carlo Colantuoni – ccolantu@jhsph

Normalization

• Normalization is needed to ensure that differences in intensities are indeed due to differential expression, and not some printing, hybridization, or scanning artifact.

• Normalization is necessary before any analysis which involves within or between slides comparisons of intensities, e.g., clustering, testing.

• Somewhat different approaches are used in two-color and one-color technologies

Page 64: Carlo Colantuoni – ccolantu@jhsph

Varying distributions of intensities from each microarray.

Page 65: Carlo Colantuoni – ccolantu@jhsph

Distributions of intensities after global mean normalization.

Page 66: Carlo Colantuoni – ccolantu@jhsph

What does this normalization mean in Int vs. Int, or Ratio vs. Int space?

Page 67: Carlo Colantuoni – ccolantu@jhsph

Distributions of intensities after global mean normalization – global mean

normalization is not enough …

Possible solutions:

Local Mean Normalization

Quantile Normalization

Page 68: Carlo Colantuoni – ccolantu@jhsph

Local Mean Normalization

(loess):

Adjusts for intensity-dependent bias in

ratios.

Requires Comparison!

Page 69: Carlo Colantuoni – ccolantu@jhsph
Page 70: Carlo Colantuoni – ccolantu@jhsph
Page 71: Carlo Colantuoni – ccolantu@jhsph

Loess

Page 72: Carlo Colantuoni – ccolantu@jhsph

Loess

Page 73: Carlo Colantuoni – ccolantu@jhsph

Loess

Page 74: Carlo Colantuoni – ccolantu@jhsph

Loess

Page 75: Carlo Colantuoni – ccolantu@jhsph

Loess

Page 76: Carlo Colantuoni – ccolantu@jhsph

Loess

Page 77: Carlo Colantuoni – ccolantu@jhsph

Quantile Normalization

Page 78: Carlo Colantuoni – ccolantu@jhsph

Quantile normalization

• All these non-linear methods perform similarly• Quantiles is commonly used because its fast

and conceptually simple• Basic idea:

– order value in each array– take average across probes– Substitute probe intensity with average– Put in original order

Page 79: Carlo Colantuoni – ccolantu@jhsph

Example of quantile normalization

2 4 4

5 4 14

4 6 8

3 5 8

3 3 9

2 3 4

3 4 8

3 4 8

4 5 9

5 6 14

3 3 3

5 5 5

5 5 5

6 6 6

8 8 8

3 5 3

8 5 8

6 8 5

5 6 5

5 3 6

Original Ordered Averaged Re-ordered

Page 80: Carlo Colantuoni – ccolantu@jhsph

Before Quantile Normalization

Page 81: Carlo Colantuoni – ccolantu@jhsph

After Quantile Normalization

A worry is that it over corrects

Page 82: Carlo Colantuoni – ccolantu@jhsph

QC

Page 83: Carlo Colantuoni – ccolantu@jhsph
Page 84: Carlo Colantuoni – ccolantu@jhsph
Page 85: Carlo Colantuoni – ccolantu@jhsph
Page 86: Carlo Colantuoni – ccolantu@jhsph
Page 87: Carlo Colantuoni – ccolantu@jhsph

Print-tip Effect

Page 88: Carlo Colantuoni – ccolantu@jhsph

Print-tip Loess

Page 89: Carlo Colantuoni – ccolantu@jhsph

Plate effect

Page 90: Carlo Colantuoni – ccolantu@jhsph

Bad Plate Effect

Page 91: Carlo Colantuoni – ccolantu@jhsph

Bad Plate Effect

Page 92: Carlo Colantuoni – ccolantu@jhsph

Print Order Effect

Page 93: Carlo Colantuoni – ccolantu@jhsph

Microarray Pseudo Images: Intensity

Page 94: Carlo Colantuoni – ccolantu@jhsph

Microarray Pseudo Images: Ratios

Page 95: Carlo Colantuoni – ccolantu@jhsph

Images of probe level data

This is the raw data

Page 96: Carlo Colantuoni – ccolantu@jhsph

Images of probe level data

Residuals (or weights) from probe level model fits show problem clearly

Page 97: Carlo Colantuoni – ccolantu@jhsph

Hybridization Artifacts

Page 98: Carlo Colantuoni – ccolantu@jhsph
Page 99: Carlo Colantuoni – ccolantu@jhsph
Page 100: Carlo Colantuoni – ccolantu@jhsph

PCA, MDS, and Clustering:

Dimension Reduction to Detect Experimental

Artifacts and Biological Effects

Page 101: Carlo Colantuoni – ccolantu@jhsph

Principle Components Analysis (PCA)

and

Multi-Dimensional Scaling (MDS)

Page 102: Carlo Colantuoni – ccolantu@jhsph

PCA

Page 103: Carlo Colantuoni – ccolantu@jhsph

MDS

Page 104: Carlo Colantuoni – ccolantu@jhsph
Page 105: Carlo Colantuoni – ccolantu@jhsph
Page 106: Carlo Colantuoni – ccolantu@jhsph
Page 107: Carlo Colantuoni – ccolantu@jhsph
Page 108: Carlo Colantuoni – ccolantu@jhsph

Uncorrected Intensities: MDS Colored by Batch

Page 109: Carlo Colantuoni – ccolantu@jhsph

Removing The Batch Effect

Much LikeRed:Green Analysis

Page 110: Carlo Colantuoni – ccolantu@jhsph

Uncorrected Intensities: MDS Colored by Batch

Page 111: Carlo Colantuoni – ccolantu@jhsph

Batch Subtracted Measures: MDS Colored by Batch

Page 112: Carlo Colantuoni – ccolantu@jhsph

MDS of All Array Experiments: Subject Replicates

Page 113: Carlo Colantuoni – ccolantu@jhsph
Page 114: Carlo Colantuoni – ccolantu@jhsph

AGE

?

Page 115: Carlo Colantuoni – ccolantu@jhsph
Page 116: Carlo Colantuoni – ccolantu@jhsph
Page 117: Carlo Colantuoni – ccolantu@jhsph
Page 118: Carlo Colantuoni – ccolantu@jhsph
Page 119: Carlo Colantuoni – ccolantu@jhsph
Page 120: Carlo Colantuoni – ccolantu@jhsph
Page 121: Carlo Colantuoni – ccolantu@jhsph
Page 122: Carlo Colantuoni – ccolantu@jhsph

AGE

RN

A Q

ual

ity

Page 123: Carlo Colantuoni – ccolantu@jhsph

AGE

Bat

ch

Page 124: Carlo Colantuoni – ccolantu@jhsph

Biological Effects:

Tissue Types and Growth Factor

Treatments

Page 125: Carlo Colantuoni – ccolantu@jhsph

Illumina 24K

Page 126: Carlo Colantuoni – ccolantu@jhsph
Page 127: Carlo Colantuoni – ccolantu@jhsph
Page 128: Carlo Colantuoni – ccolantu@jhsph
Page 129: Carlo Colantuoni – ccolantu@jhsph
Page 130: Carlo Colantuoni – ccolantu@jhsph
Page 131: Carlo Colantuoni – ccolantu@jhsph
Page 132: Carlo Colantuoni – ccolantu@jhsph