Summarization of Oligonucleotide Expression Arrays BIOS 691-803 Winter 2010.

23
Summarization of Oligonucleotide Expression Arrays BIOS 691-803 Winter 2010

Transcript of Summarization of Oligonucleotide Expression Arrays BIOS 691-803 Winter 2010.

Page 1: Summarization of Oligonucleotide Expression Arrays BIOS 691-803 Winter 2010.

Summarization of Oligonucleotide Expression

Arrays

BIOS 691-803 Winter 2010

Page 2: Summarization of Oligonucleotide Expression Arrays BIOS 691-803 Winter 2010.

What is Summarization?

• Some expression arrays (Affymetrix, Nimblegen) use multiple probes to target a single transcript – a ‘probe set’

• Typically probes have different fold changes between any two samples

• How to effectively summarize the information in a probe set?

Page 3: Summarization of Oligonucleotide Expression Arrays BIOS 691-803 Winter 2010.

Many Probes for One Gene

GeneGeneSequenceSequence

Multiple Multiple oligo probesoligo probes

Perfect MatchPerfect MatchMismatchMismatch

5´5´ 3´3´

How to combine signals from multiple probes into a single gene abundance estimate?

Page 4: Summarization of Oligonucleotide Expression Arrays BIOS 691-803 Winter 2010.

Probe Variation

• Individual probes don’t agree on fold changes

• Probes vary by two orders of magnitude on each chip– CG content is most important factor in signal strength

Signal from 16 probes along one gene on one chip

Page 5: Summarization of Oligonucleotide Expression Arrays BIOS 691-803 Winter 2010.

Probe Measure Variation

•Typical probes are two orders of magnitude different!•CG content is most important factor•RNA target folding also affects hybridization

3x104

0

Page 6: Summarization of Oligonucleotide Expression Arrays BIOS 691-803 Winter 2010.

Bioinformatics Issues

• Probes may not map accurately

• SNP’s in probes

• Affymetrix places most probes in 3’UTR of genes– Alternate Poly-A sites mean that some probe

targets may really be less common than others

Page 7: Summarization of Oligonucleotide Expression Arrays BIOS 691-803 Winter 2010.

Probe Mapping

• Early builds of the genome often confused regions or genes and their complements

• Probe sets at right represent probe sets for rRNA gene and its complement

Page 8: Summarization of Oligonucleotide Expression Arrays BIOS 691-803 Winter 2010.

Alternate Poly-Adenylation Sites

Poly-A marks mRNA ‘tail’

Many genes have alternatives

3’ UTR may be longer or shorter

Page 9: Summarization of Oligonucleotide Expression Arrays BIOS 691-803 Winter 2010.

Alternate Polyadenylation of MID1

Page 10: Summarization of Oligonucleotide Expression Arrays BIOS 691-803 Winter 2010.

Many Approaches to Summarization

• Affymetrix MicroArray Suite; PLiER • dChip - Li and Wong, HSPH• Bioconductor:

– RMA - Bolstad, Irizarry, Speed, et al– affyPLM – Bolstad– gcRMA – Wu

• Physical chemistry models – Zhang et al• Factor model• Probe-weighting

Page 11: Summarization of Oligonucleotide Expression Arrays BIOS 691-803 Winter 2010.

Critique of Averaging (MAS5)

• Not clear what an average of different probes should mean

• Tukey bi-weight can be unstable when data cluster at either end – frequently the conditions here

• No ‘learning’ based on cross-chip performance of individual probes

Page 12: Summarization of Oligonucleotide Expression Arrays BIOS 691-803 Winter 2010.

Motivation for multi-chip models:

Probe level data from spike-in study ( log scale ) note parallel trend of all probes

Courtesy of Terry Speed

Page 13: Summarization of Oligonucleotide Expression Arrays BIOS 691-803 Winter 2010.

Model for Probe Signal

• Each probe signal is proportional to – i) the amount of target sample – a – ii) the affinity of the specific probe sequence to the target – f

• NB: High affinity is not the same as Specificity– Probe can give high signal to intended target and also to

other transcripts

a1

a2

Probes 1 2 3

chip 1

chip 2 f1 f2 f3

Page 14: Summarization of Oligonucleotide Expression Arrays BIOS 691-803 Winter 2010.

Multiplicative Model

• For each gene, a set of probes p1,…,pk

• Each probe pj binds the gene with efficiency fj

• In each sample there is an amount ai.

• Probe intensity should be proportional to fjxai

• Always some noise!

Page 15: Summarization of Oligonucleotide Expression Arrays BIOS 691-803 Winter 2010.

Robust Linear Models

• Criterion of fit– Least median squares– Sum of weighted squares– Least squares and throw out outliers

• Method for finding fit– High-dimensional search – Iteratively re-weighted least squares– Median Polish

Page 16: Summarization of Oligonucleotide Expression Arrays BIOS 691-803 Winter 2010.

• For each probe set, take log of PMij = ai fj:

• then fit the model:

• where caret represents “after pre-processing”• Fit this additive model by iteratively re-

weighted least-squares or median polish

ijjiijMP )ˆ(log

Bolstad, Irizarry, Speed – (RMA)

Critique: Model assumes probe noise is constant (homoschedastic) on log scale

)log()log()(log jiij faPM

Page 17: Summarization of Oligonucleotide Expression Arrays BIOS 691-803 Winter 2010.

Comparing Measures

20 replicate arrays – variance should be smallStandard deviations of expression estimates on arraysarranged in four groups of genesby increasing mean expression level

Green: MAS5.0; Black: Li-Wong; Blue, Red: RMA

Courtesy of Terry Speed

Page 18: Summarization of Oligonucleotide Expression Arrays BIOS 691-803 Winter 2010.

Background

• 25-mers are prone to cross-hybridization

• MM > PM for about 1/3 of all probes

• Cross-hybridization varies with GC content

• Signal intensity varies with cross-hybe

Page 19: Summarization of Oligonucleotide Expression Arrays BIOS 691-803 Winter 2010.

The gcRMA Approach

• Estimate non-specific binding using either:– True null assay (non-

homologous RNA)– Estimates from MM

• Subtract background before normalization and fitting model

Page 20: Summarization of Oligonucleotide Expression Arrays BIOS 691-803 Winter 2010.

Evaluating gcRMA

• On AffyComp data sets, gcRMA wins– Replicates with 14 spike-ins done by Affy

• Many investigators get crappy results (and don’t write it up)

• gcRMA does very well on highly expressed genes, not nearly so well on less expressed genes

• Gharaibeh et al. BMC Bioinformatics 2008 9:452

Page 21: Summarization of Oligonucleotide Expression Arrays BIOS 691-803 Winter 2010.

Factor Model• Assume relation between p observations x

and true value z: x = z + where i are independent

• Use factor analytic methods to estimate – Depends on assuming z ~ Normal– Differs from RMA in relaxing assumption of

IID errors – some probes can have more random error than others

Page 22: Summarization of Oligonucleotide Expression Arrays BIOS 691-803 Winter 2010.

Weighting Probes• It is clear that some probes are more

reliable than others

• How to assess this in a simple fashion?

• If a gene really changes across arrays, then a responsive probe will change more than a noisy probe

• Weight by relative ranges

• Best performance on AffyComp!

Page 23: Summarization of Oligonucleotide Expression Arrays BIOS 691-803 Winter 2010.

Summary and Evaluation

• No one best solution for all situations

• gcRMA and DFW seem to do very well on AffyComp data– May need weights for DFW by tissue

• Leading methods seem to rely on probe weighting