Exploiting the EMERALD mixture design for model based...

Exploiting the EMERALD mixture design formodel based microarray platform comparisons by

Bayesian inference of technical and biologicalvariance components

Thomas Tuchler1 Florian Klinglmuller2 Peter Sykacek1

David P. Kreil1

1 WWTF Chair of Bioinformatics, BOKU University, 1190 Vienna, Austria.2 Medical Statistics and Informatics, Medical University of Vienna, Spitalgasse 23,

1090 Vienna, Austria

5 December 2008

Thomas Tuchler WWTF Chair of Bioinformatics

CAMDA 08

Evaluation approaches

I Spike-based approachesKnowing concentration of particular RNA species

I Non spike-based approachesKnowing features about whole samples

→ Ability to detect subtle biological differences


CAMDA 08

The EMERALD dataset

I 6 rats (genetically different)

I 4 titrations of two tissues(liver to kidney 4:0, 3:1, 1:3, 0:4)

I 3 technical replicates per sample

I 3 Platforms(Affymetrix, Agilent, Illumina)

I 216 microarrays

→ Biological vs. technical variance


CAMDA 08

Outline

I PreprocessingI Comparable measurementsI Clean data

I ModellingI Exploiting mixture design

I InvestigatingI Biological vs. technical varianceI Impact of signal intensity and

Normalization


CAMDA 08

Exploring the data

All platforms are affected by outlier slides

4 6 8 10 12

4

6

8

10

12

E−TABM−555−raw−data−1660806719.txt

E−

TA

BM

−55

5−ra

w−

data

−16

6080

6501

.txt

Rat 5 on Agilent

15

101418232732364045495458626771

Counts

4 6 8 10 12

4

6

8

10

12


E−

TA

BM

−55

5−ra

w−

data

−16

6080

6616

.txt

Rat 5 on Agilent

110202938485767768595104114123132142151

Counts

4 6 8 10 12

4

6

8

10

12


E−

TA

BM

−55

5−ra

w−

data

−16

6080

6501

.txt

Rat 5 on Agilent

17

131925313743495561677379859197

Counts

I Affymetrix: 3B-3

I Agilent: third replicate series (hybridization names ‘. . . -3’)conducted by operator ‘A’ (high ‘AmpLabelingInputMass’)

I Illumina: low cRNA yield (3 of 6), plate location 3 (4 of 6)and hybridization date 10/04/08 (5 of 6)


CAMDA 08

Modelling the mixture factor

Mixing total RNA

(from Shippy, 2006)

Measuring mRNA

CmRNA = 0.75 · a · AtotRNA +

0.25 · b · BtotRNA

ρ = b/a

ρ = 2/3 A B C D

Liver 100 82 33 0Kidney 0 18 67 100

→ Titration 6= Mixture ratio


CAMDA 08

Modelling the EMERALD data

Lig

Liver Kidney

individual i

gene g

xL

Lg Lg

t

xM1xM2

xK

Kig

Kg Kg

Directed Acyclic Graph (DAG)

µLig ∼ N(µLg , λ

−1Lg

)µKig ∼ N

(µKg , λ

−1Kg

)xLig ∼ N

(µLig , λ

−1t

)xKig ∼ N

(µKig , λ

−1t

)mig = f (µLig , µKig , ρ)

xM1ig ∼ N(m1ig , λ

−1t

)xM2ig ∼ N

(m2ig , λ

−1t

)


CAMDA 08

Inference using Variational Bayes

Approximate true posteriorp (θ|x) by factorized Q(θ):

Q(θ) =∏i

Qi (θi )

Optimizing free energybetween p (θ|x) and Q(θ):

FE = −∫

Q(θ) lnQ(θ)

p (x , θ)dθ

(a) �0.2

0.3

0.4

0.50.60.70.80.9

1

0 0.5 1 1.5 2�(b) �

0.2

0.3

0.4

0.50.60.70.80.9

1

0 0.5 1 1.5 2�( ) �

0.2

0.3

0.4

0.50.60.70.80.9

1

0 0.5 1 1.5 2�(d) �0.2

0.3

0.4

0.50.60.70.80.9

1

0 0.5 1 1.5 2�(e) �

0.2

0.3

0.4

0.50.60.70.80.9

1

0 0.5 1 1.5 2�: : : (f) �0.2

0.3

0.4

0.50.60.70.80.9

1

0 0.5 1 1.5 2�Updates for µ and σ of a Gaussian(from MacKay, 2003)


CAMDA 08

Validating the implementation

●●

●

●

●

●

●●

● ● ●

−40

000

−20

000

Mixing factor

rho

FE

0 2 4 6 8 10 12 14 16

●

● ● ● ●

1 2 3 4 5

−22

000

−17

000

Convergence

FE: −15800iteration

FE

−500 0 500 1000 1500

200

600

1000

Mean expression of genes

True

Infe

rred

−500 0 500 1000 1500

200

600

1000

Mean expression of genes

Individuum 2True

Infe

rred

Liver

mean: 0.696 EvgL>Evt: 89 %Biological variance contrib.

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

02

4

Kidney

mean: 0.809 EvgK>Evt: 97 %Biological variance contrib.

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

02

4

Simulation validating retrieval and conv.

Pros

I Full posteriordistributions

I Fast convergence

I Monitor convergence

I Model comparisonyard stick

Cons

I Approximation


CAMDA 08

Results within platform and intensity

Low Signal

Biological variance fraction

Den

sity

0.0 0.5 1.0

01

23

High Signal

Biological variance fraction

Den

sity

0.0 0.5 1.0

01

23

0 <varbio

varbio + vartec< 1

I Biological variancedetectable

I Biological < technicalvariance

I More mRNA inkidney samples; ρ > 1(cf. Liggett, 2008)


CAMDA 08

Results across platforms and intensities0

1020

3040

50

base Out

Intensity

Bio

logi

cal v

aria

nce

cont

r. [%

]

low medium high

AffymetrixAgilentIllumina

Baseline

010

2030

4050

quan Out

Intensity

Bio

logi

cal v

aria

nce

cont

r. [%

]

low medium high


Quantile

010

2030

4050

vsn Out

Intensity

Bio

logi

cal v

aria

nce

cont

r. [%

]

low medium high


VSN

Percentage of genes with biological > technical variance

I Platform and intensity dependent

I Normalization dependent


CAMDA 08

Conclusions and Outlook

Methods

I Detecting biological differences as performance measure

I VB model exploiting mixture design

→ Efficient Bayesian inference for genome size data

EMERALD dataset

I Biological variance detectable

I Platform and intensity dependence


CAMDA 08

Conclusions and Outlook

Methods

I Detecting biological differences as performance measure

I VB model exploiting mixture design→ Efficient Bayesian inference for genome size data

EMERALD dataset

I Biological variance detectable

I Platform and intensity dependence


CAMDA 08

Exploiting the EMERALD mixture design for model based...

Documents

Transcript of Exploiting the EMERALD mixture design for model based...