Exploiting the EMERALD mixture design for model based...
Transcript of Exploiting the EMERALD mixture design for model based...
Exploiting the EMERALD mixture design formodel based microarray platform comparisons by
Bayesian inference of technical and biologicalvariance components
Thomas Tuchler1 Florian Klinglmuller2 Peter Sykacek1
David P. Kreil1
1 WWTF Chair of Bioinformatics, BOKU University, 1190 Vienna, Austria.2 Medical Statistics and Informatics, Medical University of Vienna, Spitalgasse 23,
1090 Vienna, Austria
5 December 2008
Thomas Tuchler WWTF Chair of Bioinformatics
CAMDA 08
Evaluation approaches
I Spike-based approachesKnowing concentration of particular RNA species
I Non spike-based approachesKnowing features about whole samples
→ Ability to detect subtle biological differences
Thomas Tuchler WWTF Chair of Bioinformatics
CAMDA 08
Evaluation approaches
I Spike-based approachesKnowing concentration of particular RNA species
I Non spike-based approachesKnowing features about whole samples
→ Ability to detect subtle biological differences
Thomas Tuchler WWTF Chair of Bioinformatics
CAMDA 08
The EMERALD dataset
I 6 rats (genetically different)
I 4 titrations of two tissues(liver to kidney 4:0, 3:1, 1:3, 0:4)
I 3 technical replicates per sample
I 3 Platforms(Affymetrix, Agilent, Illumina)
I 216 microarrays
→ Biological vs. technical variance
Thomas Tuchler WWTF Chair of Bioinformatics
CAMDA 08
Outline
I PreprocessingI Comparable measurementsI Clean data
I ModellingI Exploiting mixture design
I InvestigatingI Biological vs. technical varianceI Impact of signal intensity and
Normalization
Thomas Tuchler WWTF Chair of Bioinformatics
CAMDA 08
Exploring the data
All platforms are affected by outlier slides
4 6 8 10 12
4
6
8
10
12
E−TABM−555−raw−data−1660806719.txt
E−
TA
BM
−55
5−ra
w−
data
−16
6080
6501
.txt
Rat 5 on Agilent
15
101418232732364045495458626771
Counts
4 6 8 10 12
4
6
8
10
12
E−TABM−555−raw−data−1660806719.txt
E−
TA
BM
−55
5−ra
w−
data
−16
6080
6616
.txt
Rat 5 on Agilent
110202938485767768595104114123132142151
Counts
4 6 8 10 12
4
6
8
10
12
E−TABM−555−raw−data−1660806616.txt
E−
TA
BM
−55
5−ra
w−
data
−16
6080
6501
.txt
Rat 5 on Agilent
17
131925313743495561677379859197
Counts
I Affymetrix: 3B-3
I Agilent: third replicate series (hybridization names ‘. . . -3’)conducted by operator ‘A’ (high ‘AmpLabelingInputMass’)
I Illumina: low cRNA yield (3 of 6), plate location 3 (4 of 6)and hybridization date 10/04/08 (5 of 6)
Thomas Tuchler WWTF Chair of Bioinformatics
CAMDA 08
Modelling the mixture factor
Mixing total RNA
(from Shippy, 2006)
Measuring mRNA
CmRNA = 0.75 · a · AtotRNA +
0.25 · b · BtotRNA
ρ = b/a
ρ = 2/3 A B C D
Liver 100 82 33 0Kidney 0 18 67 100
→ Titration 6= Mixture ratio
Thomas Tuchler WWTF Chair of Bioinformatics
CAMDA 08
Modelling the mixture factor
Mixing total RNA
(from Shippy, 2006)
Measuring mRNA
CmRNA = 0.75 · a · AtotRNA +
0.25 · b · BtotRNA
ρ = b/a
ρ = 2/3 A B C D
Liver 100 82 33 0Kidney 0 18 67 100
→ Titration 6= Mixture ratio
Thomas Tuchler WWTF Chair of Bioinformatics
CAMDA 08
Modelling the EMERALD data
Lig
Liver Kidney
individual i
gene g
xL
Lg Lg
t
xM1xM2
xK
Kig
Kg Kg
Directed Acyclic Graph (DAG)
µLig ∼ N(µLg , λ
−1Lg
)µKig ∼ N
(µKg , λ
−1Kg
)xLig ∼ N
(µLig , λ
−1t
)xKig ∼ N
(µKig , λ
−1t
)mig = f (µLig , µKig , ρ)
xM1ig ∼ N(m1ig , λ
−1t
)xM2ig ∼ N
(m2ig , λ
−1t
)
Thomas Tuchler WWTF Chair of Bioinformatics
CAMDA 08
Inference using Variational Bayes
Approximate true posteriorp (θ|x) by factorized Q(θ):
Q(θ) =∏i
Qi (θi )
Optimizing free energybetween p (θ|x) and Q(θ):
FE = −∫
Q(θ) lnQ(θ)
p (x , θ)dθ
(a) �0.2
0.3
0.4
0.50.60.70.80.9
1
0 0.5 1 1.5 2�(b) �
0.2
0.3
0.4
0.50.60.70.80.9
1
0 0.5 1 1.5 2�( ) �
0.2
0.3
0.4
0.50.60.70.80.9
1
0 0.5 1 1.5 2�(d) �0.2
0.3
0.4
0.50.60.70.80.9
1
0 0.5 1 1.5 2�(e) �
0.2
0.3
0.4
0.50.60.70.80.9
1
0 0.5 1 1.5 2�: : : (f) �0.2
0.3
0.4
0.50.60.70.80.9
1
0 0.5 1 1.5 2�Updates for µ and σ of a Gaussian(from MacKay, 2003)
Thomas Tuchler WWTF Chair of Bioinformatics
CAMDA 08
Validating the implementation
●●
●
●
●
●
●●
● ● ●
−40
000
−20
000
Mixing factor
rho
FE
0 2 4 6 8 10 12 14 16
●
● ● ● ●
1 2 3 4 5
−22
000
−17
000
Convergence
FE: −15800iteration
FE
−500 0 500 1000 1500
200
600
1000
Mean expression of genes
True
Infe
rred
−500 0 500 1000 1500
200
600
1000
Mean expression of genes
Individuum 2True
Infe
rred
Liver
mean: 0.696 EvgL>Evt: 89 %Biological variance contrib.
Den
sity
0.0 0.2 0.4 0.6 0.8 1.0
02
4
Kidney
mean: 0.809 EvgK>Evt: 97 %Biological variance contrib.
Den
sity
0.0 0.2 0.4 0.6 0.8 1.0
02
4
Simulation validating retrieval and conv.
Pros
I Full posteriordistributions
I Fast convergence
I Monitor convergence
I Model comparisonyard stick
Cons
I Approximation
Thomas Tuchler WWTF Chair of Bioinformatics
CAMDA 08
Results within platform and intensity
Low Signal
Biological variance fraction
Den
sity
0.0 0.5 1.0
01
23
High Signal
Biological variance fraction
Den
sity
0.0 0.5 1.0
01
23
0 <varbio
varbio + vartec< 1
I Biological variancedetectable
I Biological < technicalvariance
I More mRNA inkidney samples; ρ > 1(cf. Liggett, 2008)
Thomas Tuchler WWTF Chair of Bioinformatics
CAMDA 08
Results across platforms and intensities0
1020
3040
50
base Out
Intensity
Bio
logi
cal v
aria
nce
cont
r. [%
]
low medium high
AffymetrixAgilentIllumina
Baseline
010
2030
4050
quan Out
Intensity
Bio
logi
cal v
aria
nce
cont
r. [%
]
low medium high
AffymetrixAgilentIllumina
Quantile
010
2030
4050
vsn Out
Intensity
Bio
logi
cal v
aria
nce
cont
r. [%
]
low medium high
AffymetrixAgilentIllumina
VSN
Percentage of genes with biological > technical variance
I Platform and intensity dependent
I Normalization dependent
Thomas Tuchler WWTF Chair of Bioinformatics
CAMDA 08
Results across platforms and intensities0
1020
3040
50
base Out
Intensity
Bio
logi
cal v
aria
nce
cont
r. [%
]
low medium high
AffymetrixAgilentIllumina
Baseline
010
2030
4050
quan Out
Intensity
Bio
logi
cal v
aria
nce
cont
r. [%
]
low medium high
AffymetrixAgilentIllumina
Quantile
010
2030
4050
vsn Out
Intensity
Bio
logi
cal v
aria
nce
cont
r. [%
]
low medium high
AffymetrixAgilentIllumina
VSN
Percentage of genes with biological > technical variance
I Platform and intensity dependent
I Normalization dependent
Thomas Tuchler WWTF Chair of Bioinformatics
CAMDA 08
Results across platforms and intensities0
1020
3040
50
base Out
Intensity
Bio
logi
cal v
aria
nce
cont
r. [%
]
low medium high
AffymetrixAgilentIllumina
Baseline
010
2030
4050
quan Out
Intensity
Bio
logi
cal v
aria
nce
cont
r. [%
]
low medium high
AffymetrixAgilentIllumina
Quantile
010
2030
4050
vsn Out
Intensity
Bio
logi
cal v
aria
nce
cont
r. [%
]
low medium high
AffymetrixAgilentIllumina
VSN
Percentage of genes with biological > technical variance
I Platform and intensity dependent
I Normalization dependent
Thomas Tuchler WWTF Chair of Bioinformatics
CAMDA 08
Conclusions and Outlook
Methods
I Detecting biological differences as performance measure
I VB model exploiting mixture design
→ Efficient Bayesian inference for genome size data
EMERALD dataset
I Biological variance detectable
I Platform and intensity dependence
Thomas Tuchler WWTF Chair of Bioinformatics
CAMDA 08
Conclusions and Outlook
Methods
I Detecting biological differences as performance measure
I VB model exploiting mixture design→ Efficient Bayesian inference for genome size data
EMERALD dataset
I Biological variance detectable
I Platform and intensity dependence
Thomas Tuchler WWTF Chair of Bioinformatics
CAMDA 08
Conclusions and Outlook
Methods
I Detecting biological differences as performance measure
I VB model exploiting mixture design→ Efficient Bayesian inference for genome size data
EMERALD dataset
I Biological variance detectable
I Platform and intensity dependence
Thomas Tuchler WWTF Chair of Bioinformatics
CAMDA 08
Conclusions and Outlook
Methods
I Detecting biological differences as performance measure
I VB model exploiting mixture design→ Efficient Bayesian inference for genome size data
EMERALD dataset
I Biological variance detectable
I Platform and intensity dependence
Thomas Tuchler WWTF Chair of Bioinformatics
CAMDA 08