Alex Lewin Sylvia Richardson ( IC Epidemiology) Tim Aitman (IC Microarray Centre)
Sylvia Richardson Centre for Biostatistics Imperial College, London
description
Transcript of Sylvia Richardson Centre for Biostatistics Imperial College, London
1BGX
Sylvia RichardsonCentre for Biostatistics
Imperial College, London
Statistical Analysis of Gene Expression Data
In collaboration with Natalia Bochkina, Anne Mette Hein, Alex Lewin (St Mary’s)
Tim Aitman (Hammersmith)Peter Green (Bristol)
BBSRCwww.bgx.org.uk
Biological Atlasof Insulin Resistance
2BGX
Statistical modelling and biology
• Extracting the ‘message’ from microarray data needs statistical as well as biological understanding
• Statistical modelling – in contrast to data analysis – gives a framework for formally organising assumptions about signal and noise
• Our models are structured, reflecting data generation process: – Bayesian hierarchical modelling approach– Inference based on posterior distribution of quantities of
interest
3BGX
What are gene expression data ?
• DNA Microarrays are used to measure the relative abundance of mRNA, providing information on gene expression in a particular cell type, under specific conditions
• Gene expression data (e.g. Affymetrix) results from the scanning of arrays where hybridisation between a sample and a large number of probes has taken place:– gene expression measure for each gene
• The expression level of ten of thousands of probes are measured on a single microarray:– gene expression profile
• Typically, gene expression profiles are obtained for several samples, in a single or related experiments:– gene expression data matrix
* ** *
*
4BGX
Common characteristics of data sets in transcriptomic
• High dimensional data (ten of thousands of genes) and few samples
• Many sources of variability (low signal/noise ratio)
• condition/treatment• biological • array manufacture• imaging• technical
• within/between array variation
• gene specific variability of the probes for a gene (e.g. for Affymetrix)
5BGX
• Gene expression data can be used in several types of analysis:
-- Comparison of gene expression under different experimental conditions, or in different tissues
-- Building a predictive model for classification or prognosis based on gene expression measurements
-- Exploration of patterns in gene expression matrices
Analysing gene expression data
Samples
Gen
es (2
0000
)Gene expression level
Gene expression data matrix
6BGX
Common statistical issues
• Pre-processing and data reduction– account for the uncertainty of the signal?– making arrays comparable: “normalisation”
• Realistic assessment of uncertainty• Multiplicity: control of “error rates”• Need to borrow information• Importance to include prior biological knowledge
Illustrate how structured statistical modelling can help to tease out signal from noise and strengthen inference in the context of differential expression studies
7BGX
Outline
• Background• Modelling uncertainty in the signal• Bayesian hierarchical models for
differential expression experiments– posterior predictive checks– use of posterior distribution of parameters
of interest to select genes of interest• Further structure: mixture models
8BGX
Data: Affymetrix chip: - Each gene g is represented by a probe set,
consisting of a number of probe pairs (reporters) j Perfect match (PM) and Mismatch (MM)
Aim: Formulate a model to combine PM and MM values into a new expression value for the gene -- BGX
- Base the model on biological assumptions - Combine good features of Li and Wong (dChip) and RMA (Robust Multichip Analysis, Irrizarry et al)
I – Modelling uncertainty in the signal:A fully Bayesian Gene expression index for
Affymetrix Gene Chip arrays (Anne Mette Hein)
Use a flexible Bayesian framework that will allow• to get a measure of uncertainty of the expression• to integrate further components of the experimental design
9BGX
Single array model: Motivation
Key observations: Conclusions:
• PMs and MMs both increase with spike-in concentration (MMs slower than PMs)
MMs bind fraction of signal
• Spread of PMs increase with level
Multiplicative (and additive) error; transformation needed
• Considerable variability in PM (and MM) response within a probe set
Varying reliability in gene expression estimation for different genes
• Probe effects approximately additive on log-scale
Estimate gene expression measure from PMs and MMs on log scale
10BGX
BGX single array model
Remaining priors: “vague”
fraction
log(Hgj+1) TN(λ, η2)
Non-specific hybridisation: array wide distribution:
j=1,…,J (20), g=1,…,GShrinkage:
exchangeabilitylog(σg
2)N(a, b2)
“Emp. Bayes”
log(Sgj+1) TN(μg,σg2)
Expression measure for gene g is built from: j=1,…,J (20)
“BGX” expression measure
PMgj N( Sgj + Hgj , τ2)
MMgj N(Φ Sgj + Hgj , τ2) Background noise, additive
Gene and probe specific S and H (g:1,…,”1000s”, j=1,…,”tens”)
11BGX
BGX model: inference Hein et al, Biostatistics, 2005
For each gene g: obtain a distribution for signal (log scale)
g:
BGX: gene expression
PMMM
• Implemented in WinBugs and C++ (MCMC)• All parameters estimated jointly in full Bayesian framework• Posterior distributions of parameters (and functions) obtained
The single array model can be extended to estimate signalfrom several biological replicates, as well as differentialsignal between conditions
12BGX
Single array model:examples of posterior distributions of BGX indices
Each curve represents a gene
Examples with data:
o: log(PMgj-MMgj)
j=1,…,J
(at 0 if not defined)
Mean 1SD
13BGX
Comparison with other expression measures
11 genes spiked in at 13 (increasing) concentrations
BGX index μg increases with concentration …..… except for gene 7 (incorrectly spiked-in??)
Indication of smooth & sustained increase over a wider range ofconcentrations
14BGX
95% credibility intervals for Bayesian gene expression index
11 spike-in genes at 13 different concentrations
Note how the variabilityis substantially larger for low expression level
Each colour corresponds to a different spike-in geneGene 7 : broken red line
15BGX
II – Modelling differential expression
Differential expression parameter
Condition 1 Condition 2
Posterior distribution (flat prior)
Mixture modelling for classification
Hierarchical model of replicatevariability and array effect
Hierarchical model of replicatevariability and array effect
Start with given pointestimates of expression
16BGX
Data Sets and Biological question
Biological Question
• Understand the mechanisms of insulin resistance• Using animal models where key genes are knockout
A) Cd36 Knock out Data set (MAS 5) 3 wildtype (“normal”) mice compared with 3 mice with Cd36 knocked out ( 12000 genes on each array )
B) IRS2 Knock out Data set (RMA) 8 wildtype (“normal”) mice compared with 8 mice with IRS2 gene knocked out ( 22700 genes on each array)
17BGX
Condition 1 (3 replicates)
Condition 2 (3 replicates)
Needs ‘normalisation’
Spline curves shown
Exploratory analysis showing array effect
Mouse dataset A
18BGX
Data: ygcr = log gene expression gene g, replicate r, condition c g = gene effect dg = differential effect for gene g between 2 conditionsr(g)c = array effect – modelled as a smooth (spline) function of g gc
2 = gene specific variance
• 1st level yg1r N(g – ½ dg + r(g)1 , g12)
yg2r N(g + ½ dg + r(g)2 , g22)
Σrr(g)c = 0, r(g)c = function of g , parameters {c,d}
• 2nd level “Flat” priors for g , dg, {c,d} gc
2 lognormal (ac, bc)
Bayesian hierarchical model for differential expression (Lewin et al, Biometrics, 2005)
Exchangeablevariances
19BGX
Directed Acyclic Graph for the differential expression model (no array effect represented)
a1, b1
½(yg1.+ yg2.)
dg 2g1 s2
g1
2g2 s2
g2g
a2, b2
½(yg1.- yg2.)
20BGX
Differential expression model
Joint modelling of array effects and differential expression:
• Performs normalisation simultaneously with estimation
• Gives fewer false positives
How to check some of the modelling assumptions?Posterior predictive checks
How to use the posterior distribution of dg to select genes of interest ?
Decision rules
21BGX
• Check assumptions on gene variances, e.g. exchangeable variances, what distribution ?
• Predict sample variance sg2 new (a chosen checking function)
from the model specification (not using the data for this)• Compare predicted sg
2 new with observed sg2 obs
‘Bayesian p-value’: Prob( sg2 new > sg
2 obs )
• Distribution of p-values approx Uniform if model is ‘true’ (Marshall and Spiegelhalter, 2003)
• Easily implemented in MCMC algorithm
Bayesian Model Checking
22BGX
Bayesian model checking
a1, b1
½(yg1.+ yg2.)
dg 2g1 s2
g1
2g2 s2
g2g
a2, b2
½(yg1.- yg2.)
2g1
new
s2g1
new
obs
23BGX
MouseData set A
24BGX
Use of tail probabilities for selecting gene listsdg : log fold change
tg = dg / (σ2 g1 / n1 + σ2 g2 / n2 )½ standardised difference
(n1 and n2 # replicates in each condition)
-- Obtain the posterior distribution of dg and/or tg
-- Compute directly posterior probability of genes satisfying criterion X of interest, e.g. dg > threshold or tg
> percentile
pg,X = Prob( g of “interest” | Criterion X, data)
-- Compute the distributions of ranks, …. Interesting statistical issues on relative merits and propertiesof different selection rules based on tail probabilities
25BGX
• Compute Probability ( | tg
| > 2 | data)
Bayesian T test • Order genes• Select genes such that
Using the posterior distribution of tg (standardised difference) (Natalia Bochkina)
Probability ( | tg | > 2 | data) > cut-off ( in blue)
By comparison, additional genes selected by a standard
T test with p value < 5% are in red)
Data set B
26BGX
Credibility intervals for ranks
100 genes with lowest rank (most under/over expressed)
Low rank, high uncertainty
Low rank, low uncertainty
27BGX
III – Mixture and Bayesian estimation of False Discovery Rates (FDR)
• Mixture models can be used to perform a model based classification
• Mixture models can be considered at the level of the data (e.g. clustering time profiles) or for the underlying parameters
• Mixture models can be used to detect differentially expressed genes if a model of the alternative is specified
• One benefit is that an estimate of the uncertainty of the classification: the False Discovery Rate is simultaneously obtained
28BGX
Mixture framework for differential expression
yg1r = g - ½ dg + g1r , r = 1, … R1
yg2r = g + ½ dg + g2r , r = 1, … R2
(We assume that the data has been pre normalised)
Var(gcr ) = σ2gc ~ IG(ac, bc)
dg ~ 0δ0 + 1G (-x|1.5, 1) + 2G (x|1.5, 2)
H0 H1 Dirichlet distribution for (0, 1, 2)
Exp(1) hyper prior for 1 and 2
Explicit modellingof the alternative
29BGX
Mixture for classification of DE genes
• Calculate the posterior probability for any gene of belonging to the unmodified component : pg0 | data
• Classify using a cut-off on pg0 :
i.e. declare gene is DE if 1- pg0 > pcut
Bayes rule corresponds to pcut = 0.5• Bayesian estimate of FDR (and FNR) for any list
(Newton et al 2003, Broët et al 2004) :
Bayes FDR (list) | data = 1/card(list) Σg list pg0
30BGX
Performance of the mixture prior
• Joint estimation of all the mixture parameters (including 0) using MCMC algorithms avoids plugging-in of values that are influential on the classification
• Estimation of all parameters combines information from biological replicates and between condition contrasts
• Performance has been tested on simulated data sets
31BGX
Plot of truedifference ineach case
π0 = 0.8, 500 DE π0 = 0.9, 250 DE
π0 = 0.99, 25 DEπ0 = 0.95, 125 DEπ0 = 0.80, 500 DE
32BGX
Examples ofsimulated datafor each case
33BGX
Results averaged over 50 replications
Av. π0 = 0.99
Av. π0 = 0.80 Av. π0 = 0.90
Av. π0 = 0.78 Av. π0 = 0.95
^ ^
^ ^ ^
Good estimatesof 0 = Prob(null)for each case
34BGX
Comparison of estimated (dotted lines) and observed (full) FDR (black) and FNR (red) rates as cut-off for declaring DE is varied
Bayesian mixture: • good estimates ofFDR and FNR• easy way to choose efficientclassification rule
35BGX
In summary
Integrated gene expression analysis • Uses the natural hierarchical structure of the data: e.g.
probes within genes within replicate arrays within condition to synthesize, borrow information and provide realistic quantification of uncertainty
• Posterior distributions can be exploited for inference with few replicates: choice of decision rules
• Framework where biological prior information, e.g. on the structure of the probes or on chromosomic location, can be incorporated
• Model based classification, e.g. through mixtures, provides interpretable output and a structure to deal with multiplicity
General framework for investigating other questions
36BGX
Many interesting questions in the analysis of gene expression data
-- Comparison of gene expression under different experimental conditions, or in different tissues
-- Integrated gene expression analysis
-- Investigate high dimensional classification rules (prediction with large number of variables) and “large p small n” regression problems (shrinkage or variable selection)
-- Building a predictive model for classification or prognosis based on gene expression measurements, finding “signatures”
37BGX
Association of gene expression with prognosis
Investigate properties of high dimensional classification rules (prediction with large number of variables) and “large p small n” regression problems (shrinkage or variable selection)
Expression plot of 115 prognostic genes comprising The Ovarian Cancer Prognostic Profile
38BGX
-- Comparison of gene expression under different experimental conditions, or in different tissues
-- Building a predictive model for classification or prognosis based on gene expression measurements, finding “signatures
Other questions ….
-- Integrated gene expression analysis
-- Investigate high dimensional classification rules (prediction with large number of variables) and “large p small n” regression problems (shrinkage or variable selection)
-- Perform unsupervisedmodel based clustering-- Estimate graphical models
-- Exploration of patterns and association networks in gene expression matrices
39BGX
-- Comparison of gene expression under different experimental conditions, or in different tissues
-- Classification of gene expression profiles and association of gene expression with other factors, e.g. prognosis (prediction problem)
Exploration of patterns in gene expression matrices
Perform unsupervisedmodel based clustering (e.g. semi-parametric using basis functions, mixtures or DP processes)
Development of centralnervous systems in rats(9 time points)
samples
gene
s
40BGX
BBSRC Exploiting Genomics grant
Colleagues
Natalia Bochkina, Anne Mette Hein, Alex Lewin (Imperial College)Peter Green (Bristol University)Philippe Broët (INSERM, Paris)
Papers and technical reports: www.bgx.org.uk/
Thanks