Sylvia Richardson Centre for Biostatistics Imperial College, London
-
Upload
kelly-meyer -
Category
Documents
-
view
27 -
download
0
description
Transcript of Sylvia Richardson Centre for Biostatistics Imperial College, London
1
Sylvia RichardsonCentre for Biostatistics
Imperial College, London
Bayesian hierarchical modelling of genomic data
In collaboration with Natalia Bochkina, Anne Mette Hein, Alex Lewin (St Mary’s)
Helen Causton and Tim Aitman (Hammersmith)Peter Green (Bristol)
Philippe Broët (INSERM, Paris)
BBSRC Exploiting Genomics grant
2
Outline
• Introduction
• A fully Bayesian gene expression index (BGX)
• Differential expression and array effects
• Mixture models
• Discussion
3
Part 1 Introduction
• Recent developments in genomics have led to techniques – Capable of interrogating the genome at
different levels– Aiming to capture one or several stages of
the biological process
DNA mRNA protein phenotype
4
DNA -> mRNA -> protein
Pictures from http://www.emc.maricopa.edu/faculty/farabee/BIOBK/BioBookTOC.html
Protein-encoding genes are transcribed into mRNA (messenger), and the mRNA is translated to make proteins
Fundamental process
5
DNA Microarrays are used to measure the relative abundance
of mRNA, providing information on gene expression in a particular
cell, under particular conditions
The fundamental principle used to measure the expression is that of hybridisation between a sample and probes:
– Known sequences of single-stranded DNA representing genes are immobilised on microarray–Tissue sample (with unknown concentration of RNA) fluorescently labelled– Sample hybridised to array– Array scanned to measure amount of RNA present for each sequence
The expression level of ten of thousands of probes are measured on a single microarray !
gene expression profile
What are gene expression data ?
gene expression measure
6
Variation and uncertainty
• condition/treatment• biological• array manufacture• imaging• technical
• gene specific variability of the probes for a gene
• within/between array variation
Gene expression data (e.g. Affymetrix) is the result of multiple sources of variability
Structured statistical modelling allows considering all uncertainty at once
7
Example of within vs between strains gene variability
• 7 cross-bred strains of mice that differ only by a small portion of chromosome 1
• Strains have different phenotypes related to immunological disorders
• For each line, 9 animals used to obtain 3 pooled RNA extracts from spleen 7 x 3 samples
Excellent experimental design to minimise “biological variability between replicate animals”
Aim: to tease out differences between expression profiles of the 7 lines of mice and relate these to locations on chromosome 1
8
Biological variability is large !
Total variance calculatedover the 21 samples
Average (over the 7 groups) of within strain variance calculated from the 3 pooled samples
Ratio within/total
9
1000 genes most variablebetween strains: hierarchicalclustering recovers thecross-bred lines structure
Random set of 1000 genes
10
Common characteristics of genomics data sets• High dimensional data (ten of thousands of
genes) and few samples• Many sources of variability (low signal/noise
ratio)
Common issues
• Pre-processing and data reduction• Multiple testing• Need to borrow information• Importance to include prior biological knowledge
11
Part 2
• Introduction• A fully Bayesian gene expression index (BGX)
– Single array model– Multiple array model
• Differential expression and array effects• Mixture models• Discussion
12
A fully Bayesian Gene eXpression index for Affymetrix GeneChip arrays
Anne Mette HeinSR, Helen Causton,
Graeme Ambler, Peter Green
Background correctionGene specific variability
(probe)
PMMM
PMMM
PMMM
PMMM
Gene index BGX
Raw intensities
13
**
**
*
Slide courtesy of Affymetrix
Zoom Image of Hybridised Array
Expressed PM
Non-expressed PM
Image of Hybridised Array
Hybridised Spot
Each gene g represented by probe set: (J:11-20)
Perfect match: PMg1,…, PMgJ
Mis-match: MMg1,…, MMgJ
expression measure for gene g
Affymetrix GeneChips:
14
Commonly used methods for estimation expression levels from GeneChips
MAS5:
• uses PM and MMs. Imputes IMs from MMs to obtain all PM-MMs positive
• gene expression measure : estimate obtained by applying Tukey Biweight to the set of log(PM-MM) values in the probe set
RMA:
• uses PMs only.
• Fits an model with additive gene and probe effects to log-scale background corrected PMs using median polish
Characteristics: positive, robust, noisy at low levels
Characteristics: positive, robust, attenuated signal detection
15
Variability across conditions is conditioned by the choice of summary measure ! Beware of filtering
Mean (left) and Empirical standard deviation (right) over 7 conditions (arrays) for 45000 genes estimated by 2 different methods for quantifying gene expression
mean
16
• The intensity for the PM measurement for probe (reporter) j and gene g is due to binding
of labelled fragments that perfectly match the oligos in the spot
The true Signal Sgj
of labelled fragments that do not
perfectly match these oligos
The non-specific hybridisation Hgj
• The intensity of the corresponding MM measurement is caused
by a binding fraction Φ of the true signal Sgj
by non-specific hybridisation Hgj
Model assumptions and key biological parameters
17
BGX single array model:g=1,…,G (thousands), j=1,…,J (11-20)
Gene specific error terms:exchangeable
log(ξ g2)N(a,
b2)
log(Sgj+1) TN (μg , ξg2)
j=1,…,J
Gene expression index (BGX):
g=median(TN (μg , ξ g2))
“Pools” information over probes j=1,…,J
log(Hgj+1) TN(λ, η2)
Array-wide distribution
PMgj N( Sgj + Hgj , τ2)
MMgj N(Φ Sgj + Hgj ,τ2) Background noise, additive
signal Non-specific hybridisation
fraction
Priors: “vague” 2 ~ (10-3, 10-3) ~ B(1,1),
g ~ U(0,15) 2 ~ (10-3, 10-3), ~ N(0,103) “Empirical Bayes”
18
Inference:
mean
2.5-97.5% credibility interval
• Implemented in WinBugs and C• allows: - Joint estimation of parameters in full
Bayesian framework• obtain: - posterior distributions of parameters
(and functions of these) in model:
1 2 3 2 3 4 1.75 2 2.25
For each gene g:
log(Sgj+1): j=1,…,J
log(Hgj+1): j=1,…,J
g:
Log-scale true signals: Log-scale non-spec. hybr: BGX: gene expr:
NB! A distribution
19
Computational issues
• We found mixing slow for gene specific parameters (μg , ξg2)
and large autocorrelation • For low signal (bottom 25%) more variability of Sgj and Hgj ,
and less separation
So less information on (μg , ξg2) and longer runs are
needed• For the full hierarchical model, the convergence of the
hyperparameters for the distribution of ξg2 was problematic
• We studied sensitivity to a range of plausible values for those and implemented an “empirical Bayes” version of the model which was reproducible with sensible run length
20
Posterior mean of g using a runof 30 000 versus those obtainedfrom runs of 5 000, 10 000 and20 000 sweeps
Reproducibilityis obtained withshort runs forlarge expressionvalues
Longer runsare necessary for low expression values
21
• 14 samples of cRNA from acute myeloid leukemia (AML) tumor cell line
• In sample k: each of 11 genes spiked in at concentration ck:
sample k: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 conc. ck(pM): 0.0 0.5 0.75 1.0 1.5 2.0 3.0 5.0 12.5 25 50 75 100 150
• Each sample hybridised to an array
Single array model performance:Data set : varying concentrations (geneLogic):
Consider subset consisting of 500 normal genes
+ 11 spike-ins
22
Single array model:examples of posterior distributions of BGX
expression indices
Each curve(truncated normal
with median param.) represents a gene
Examples with data:
o: log(PMgj-MMgj)
j=1,…,Jg
(at 0 if not defined)
Mean +- 1SD
23
Single array model performance:
11 genes spiked in at 13 (increasing)
concentrations
BGX index g increases with
concentration …..
… except for gene 7 (spiked-in??)
Indication of smooth
& sustained increase
over a wider range of
concentrations
Comparison with other expression measures
24
2.5 – 97.5 % credibility intervals for the Bayesian expression index
11 spike-in genes at 13 different concentration (data set A)
Note how the variabilityis substantially larger for low expression level
Each colour corresponds to a different spike-in geneGene 7 : broken red line
25
PMMM
PMMM
PMMM
Gene specific variability (probe)Gene index BGX
Condition 1
PMMM
PMMM
PMMM
PMMM
Gene specific variability (probe)Gene index BGX
Distribution of differential expression parameter
Condition 2
Integrated modelling of Affymetrix data
PMMM
Distribution of expression index for gene g , condition 1
Distribution of expression index for gene g , condition 2
Hierarchical model of replicate(biological) variability and array effect
Hierarchical model of replicate(biological) variability and array effect
26
PMgjcr N( Sgjcr+ Hgjcr , τcr2)
MMgjcr N(ΦSgjcr+ Hgjcr , τcr2)
BGX Multiple array model: conditions: c=1,…,C, replicates: r = 1,…,Rc
log(Sgjcr+1) TN (μgc , ξ gc2)
Gene and condition specific BGX
gc=median(TN(μgc, ξ gc
2)) “Pools” information over replicate probe sets j = 1,…J, r = 1,…,Rc
Background noise, additiveArray specific
log(Hgjcr+1) TN(λcr,ηcr2)
Array-specific distribution of non-specific hybridisation
27
Subset of AffyU133A spike-in data set(AffyComp)
Consider:
• Six arrays, 1154 genes (every 20th and 42 spike-ins)
• Same cRNA hybridised to all arrays EXCEPT for spike-ins:
`1` `2` `3` … `12` `13` `14`
Spike-in genes: 1-3 4-6 7-9 … 34-36 37-39 40-42
Spike-in conc (pM):
Condition 1 (array 1-3): 0.0 0.25 0.50 … 128 256 512
Condition 2 (array 4-6): 0.25 0.50 1.00 … 256 512 0.00
Fold change: - 2 2 … 2 2 -
28
BGX: measure of uncertainty providedPosterior mean +- 1SD credibility intervals
diffg=bgxg,1- bgxg,2
}
Spike in 1113 -1154above the blue line
Blue stars show RMA measure
29
Part 3
• Introduction• A fully Bayesian gene expression index (BGX)• Differential expression and array effects
– Non linear array effects
– Model checking
• Mixture models• Discussion
31
Data Set and Biological question
Biological Question
Understand the mechanisms of insulin resistanceUsing animal models where key genes are knockout and
comparison made between gene expression of wildtype (normal) and knockout mice
Data set A (MAS 5) ( 12000 genes on each array)
3 wildtype mice compared with 3 mice with Cd36 knocked out
Data set B (RMA) ( 22700 genes on each array)
8 wildtype mice compared with 8 knocked out mice
32
Differential expression parameter
Condition 1 Condition 2
Posterior distribution
(flat prior)
Mixture modelling for classification
Hierarchical model of replicateVariability and array effect
Hierarchical model of replicateVariability and array effect
Start with given pointestimates of expression
33
Condition 1 (3 replicates)
Condition 2 (3 replicates)
Needs ‘normalisation’
Spline curves shown
Exploratory analysis of array effect
Mouse dataset A
34
Model for Differential Expression
• Expression-level-dependent normalisation
• Only few replicates per gene, so share information between genes to estimate variability of gene expression between the replicates
• To select interesting genes:– Use posterior distribution of quantities of interest,
function of, ranks ….– Use mixture prior on the differential expression
parameter
35
Data: ygsr = log gene expression for gene g, replicate r
g = gene effect
δg = differential effect for gene g between 2 conditions
r(g)s = array effect (expression-level dependent)
gs2 = gene variance
• 1st level yg1r N(g – ½ δg + r(g)1 , g1
2),
yg2r N(g + ½ δg + r(g)2 , g22),
Σrr(g)s = 0, r(g)s = function of g , parameters {a} and {b}
• 2nd level
Priors for g , δg, coefficients {a} and {b}
gs2 lognormal (μs, τs)
Bayesian hierarchical model for differential expression
36
• Piecewise polynomial with unknown break points:r(g)s = quadratic in g for ars(k-1) ≤ g ≤ ars(k)
with coeff (brsk(1), brsk
(2) ), k =1, … # breakpoints
– Locations of break points not fixed– Must do sensitivity checks on # break points
• Joint estimation of array effects and differential expression: In comparison to 2 step method
– More accurate estimates of array effects– Lower percentage of false positive (simulation study)
Details of array effects (Normalization)
37
Mouse Data set A
3 replicate arrays (wildtype mouse data)
Model: posterior meansE(r(g)s | data) v. E(g | data)
Data: ygsr - E(g | data)
For this data set, cubic fits well
38
• Check assumptions on gene variances, e.g. exchangeable variances, what distribution ?
• Predict sample variance Sg2 new (a chosen checking function)
from the model specification (not using the data for this)
• Compare predicted Sg2 new with observed Sg
2 obs
‘Bayesian p-value’: Prob( Sg2 new > Sg
2 obs )
• Distribution of p-values approx Uniform if model is ‘true’
(Marshall and Spiegelhalter, 2003)• Easily implemented in MCMC algorithm
Bayesian Model Checking
40
Possible Statistics for Differential Expression
δg ≈ log fold change
δg* = δg / (σ2 g1 / R1 + σ2 g2 / R2 )½ (standardised difference)
• We obtain the posterior distribution of all {δg} and/or
{δg* }
• Can compute directly posterior probability of genes satisfying criterion X of interest:
pg,X = Prob( g of “interest” | Criterion X, data)
• Can compute the distributions of ranks
41
Gene is of interest if |log fold change| > log(2) and log (overall expression) > 4
Criterion X
The majority of the genes
have very small pg,X :
90% of genes
have pg,X < 0.2
Genes withpg,X > 0.5 (green)
# 280pg,X > 0.8 (red)
# 46
pg,X = 0.49
Plot of log fold change versus overall expression level
Data set A 3 wildtype mice compared to 3 knockout mice (U74A chip) Mas5
Genes with low overall expression have a greater range of fold change than those with higher expression
42
Gene is of interest if |log fold change| > log (1.5)Criterion X:
The majority of the genes
have very small pg,X :
97% of genes
have pg,X < 0.2
Genes withpg,X > 0.5 (green)
# 292pg,X > 0.8 (red)
# 139
Plot of log fold change versus overall expression level
Experiment: 8 wildtype mice compared to 8 knockout mice RMA
43
Posterior probabilities and log fold change
Data set A : 3 replicates MAS5 Data set B : 8 replicates RMA
44
Credibility intervals for ranks
100 genes with lowest rank (most under/over expressed)
Low rank, high uncertainty
Low rank, low uncertainty
Data set B
45
• Compute
Probability ( | δg* | > 2 | data)
Bayesian analogue of a t test !• Order genes
• Select genes such that
Using the posterior distribution of δg*
(standardised difference)
Probability ( | δg* | > 2 | data) > cut-off ( in blue)
By comparison, additional genes selected by a standard
T test with p value < 5% are in red)
46
Part 4
• Introduction• A fully Bayesian gene expression index • Differential expression and array effects• Mixture models
– Classification for differential expression– Bayesian estimate of False Discovery Rates– CGH arrays: models including information on clones spatial location on
chromosome
• Discussion
47
Mixture and Bayesian estimation of false discovery rates
Natalia Bochkina, Philippe Broët Alex Lewin, SR
48
• Gene lists can be built by computing separately a criteria for each gene and ranking
• Thousands of genes are considered simultaneously• How to assess the performance of such lists ?
Multiple Testing Problem
Statistical ChallengeSelect interesting genes without including too many false
positives in a gene list
A gene is a false positive if it is included in the list when it is truly unmodified under the experimental set up
Want an evaluation of the expected false discovery rate (FDR)
49
Bayesian Estimate of FDR
• Step 1: Choose a gene specific parameter (e.g. δg ) or a gene statistic
• Step 2: Model its prior (resp marginal) distribution using a mixture model
-- with one component to model the unaffected genes (null hypothesis) e.g. point mass at 0 for δg
-- other components to model (flexibly) the alternative
• Step 3: Calculate the posterior probability for any gene of belonging to the unmodified component : pg0 | data
• Step 4: Evaluate FDR (and FNR) for any listassuming that all the gene classification are independent(Broët et al 2004) :
Bayes FDR (list) | data = 1/card(list) Σg list pg0
50
Mixture framework for differential expression
• To obtain a gene list, a commonly used method
(cf Lönnstedt & Speed 2002, Newton 2003, Smyth 2003,
…) is to define a mixture prior for δg :
• H0 δg = 0 point mass at 0 with probability p0
• H1 δg ~ flexible 2-sided distribution to model pattern of differential expression
Classify each gene following its posterior probabilities of not being in the null: 1- pg0
Use Bayes rule or fix the FDR to get a cutoff
51
Mixture prior for differential expression
• In full Bayesian framework, introduce latent allocation variable zg to help computations
• Joint estimation of all the mixture parameters (including p0) avoids plugging-in of values (e.g. p0) that are influential on the classification
• Sensitivity to prior settings of the alternative distribution
• Performance has been tested on simulated data sets
Poster by Natalia Bochkina
52
Performance of the mixture prior
yg1r = g - ½ δg + g1r , r = 1, … R1
yg2r = g + ½ δg + g2r , r = 1, … R2
(For simplification, we assume that the data has been pre normalised)
Var(gsr ) = σ2gs ~ IG(as, bs)
δg ~ p0δ0 + p1G (1.5, 1) + p2G (1.5, 2)
H0 H1
Dirichlet distribution for (p0, p1, p2)
Exponential hyper prior for 1 and 2
53
Estimation
• Estimation of all parameters combines information from biological replicates and between condition contrasts
• s2gs = 1/Rs Σr (ygsr - ygs. )2 , s = 1,2
Within condition biological variability
• 1/Rs Σr ygsr = ygs. ,
Average expression over replicates
• ½(yg1.+ yg2.) Average expression over conditions
• ½(yg1.- yg2.) Between conditions contrast
54g = 1:G
DAG for the mixture model
a1, b1
½(yg1.+ yg2.)
1 , 2
δg 2g1 s2
g1
2g2 s2
g2g
zg
a2, b2
p
½(yg1.- yg2.)
55
Simulated data
ygr ~ N(δg , σ2g) (8 replicates)
σ2gs ~ IG(1.5, 0.05)
δg ~ (-1)Bern(0.5) G(2,2), g=1:200
δg = 0, g=201:1000
Choice of simulation parametersinspired by estimates found in analyses of biological data sets
Plot of the true differences
56Post Prob (g H1) = 1- pg0
Bayesrule
FDR (black)FNR (blue)as a function of1- pg0
Observedand estimatedFDR/FNRcorrespond well
Important feature
57
Comparison of mixture classification and
posterior probabilities for δg* (standardised differences)
In red, 200
genes with
δg ≠ 0
Probability ( | δg* | > 2 | data)
31 = 4%False negative
10 = 6%False positive
Post Prob (g H1)
58
Wrongly classified by mixture:
truly dif. expressed,
truly not dif. expressed
Classification errorsare on the borderline:
Confusion betweensize of fold change and biological variability
59
Another simulation
Can we improve estimationof within conditionbiological variability ?
2628 data points
Many points addedon borderline:classificationerrors in red
60g = 1:G
DAG for the mixture model
a1, b1
½(yg1.+ yg2.)
1 , 2
δg 2g1 s2
g1
2g2 s2
g2g
zg
a2, b2
p
½(yg1.- yg2.)
The varianceestimates areinfluenced bythe mixtureparameters
Use only partialinformation fromthe replicatesto estimate2
gs and feed
forwardin the mixture ?
61
Mixture, full vs partial
In 46 data pointswith improvedclassification when‘feed back frommixture is cut’
In11 data pointswith changedbut new incorrect classification
Classificationaltered for 57 points:
Work in progress
62
Mixture models in CGH arrays experiments
• Philippe Broët, SR
• Curie Institute oncology department
CGH = Competitive Genomic Hybridization
between fluorescein- labelled normal and pathologic
samples to an array containing clones designed
to cover certain areas of the genome
63
In oncology, where carcinogenesis is associated with complex chromosomic alterations, CGH array can be used for detailed analysis of genomic changes in copy number (gains or loss of genetic information) in the tumor sample.
Amplification of an oncogene or deletion of a tumor suppressor gene are considered as important mechanisms for tumorigenesis
Loss Gain
Tumor supressor gene Oncogene
Aim: study genomic alterations
64
Specificity of CGH array experiment
A priori biological knowledge from conventional CGH :• Limited number of states for a sequence :
- presence, - deletion, - gain(s)
corresponding to different intensity ratios on the slide
Mixture model to capture the underlying discrete states
• Clones located contiguously on chromosomes are likely to carry alterations of the same type
Use clone spatial location in the allocation model
• Some CGH custom array experiments target
restricted areas of the genome Large proportion of genomic alterations are expected
65
3 component mixture model with spatial allocation
ygr N(θg , g2) , normal versus tumoral change, clone g
replicate measure r
θg wg0N(μ0 ,02) + wg1N(μ1 ,1
2) + wg2N(μ2 ,22)
μ0 : known central estimate obtained from reference clonesIntroduce centred spatial autoregressive Markov random fields, {ug
0}, {ug1}, {ug
2} with nearest neighbours along the chromosomes
presence
deletion gain
x x xg -1 g g+1
Spatial neighbours of g
Define mixture proportions to depend on the chromosomic location
via a logistic model: wgk = exp(ugk) / Σm exp(ug
m)
favours allocation of nearby clones to same componentWork in progress
66
Deletion ?
Presence ?Ref value
μ0 = - 0.11μ0
Curie Institute CGH platform
Focus on Investigating deletion areas on chromosome 1 (tumour suppressor locus)
Data on 190 clones
67
Mixture model posteriorprobability p of clone being deleted
Classification withcut-off at p ≥ 0.8
Short arm
68
Bayesian gene expression measure (BGX)
Good range of resolution, provides credibility intervals
Differential Expression
Expression-level-dependent normalisationBorrow information across genes for variance estimationGene lists based on posterior probabilities or mixture classification
False Discovery Rate
Mixture gives good estimate of FDR and classifiesFlexibility to incorporate a priori biological features, e.g. dependence on chromosomic location
Future work Mixture prior on BGX index, with uncertainty propagated to mixture parameters, comparison of marginal and prior mixture approaches, clustering of profiles for more general experimental set-ups
Summary
69
Papers and technical reports:
Hein AM., Richardson S., Causton H., Ambler G. and Green P. (2004)BGX: a fully Bayesian gene expression index for Affymetrix GeneChip data
(to appear in Biostatistics)
Lewin A., Richardson S., Marshall C., Glazier A. and Aitman T. (2003) Bayesian Modelling of Differential Gene Expression
(under revision for Biometrics)
Broët P., Lewin A., Richardson S., Dalmasso C. and Magdelenat H. (2004) A mixture model-based strategy for selecting sets of genes in multiclass response microarray experiments. Bioinformatics 22, 2562-2571.
Broët, P., Richardson, S. and Radvanyi, F. (2002) Bayesian Hierarchical Model for Identifying Changes in Gene Expression from Microarray Experiments. Journal of Computational Biology 9, 671-683.
Available athttp ://www.bgx.org.uk/
Thanks