Marie Laure-Martin-Magniette - Mixture models for analysing transcriptome and chIP‐chip data
-
Upload
australian-bioinformatics-network -
Category
Science
-
view
426 -
download
0
description
Transcript of Marie Laure-Martin-Magniette - Mixture models for analysing transcriptome and chIP‐chip data
Mixture modelsfor analysing
transcriptome and ChIP-chip data
Marie-Laure Martin-Magniette
French National Institute for agricultural research (INRA)
Unit of Applied Mathematics and Informatics at AgroParisTech, Paris
Unit of Plant Genomics Research (URGV), Evry
M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 1 / 30
Presentation outline
1 Introduction
2 Mixture model definition
3 Genomic examples
4 Conclusions
M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 2 / 30
Introduction
Observations described by 2 variables
Observation distribution seems easy to model with one Gaussian
M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 3 / 30
Introduction
Observations described by 2 variables
Data are scattered and subpopulations are observedAccording to the experimental design, there exists no externalinformation about them
This is an underlying structure observed through the data
M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 3 / 30
Introduction
Definition of a mixture modelIt is a probabilistic model for representing the presence of subpopula-tions within an overall population.
Introduction of a latent variable Z indicating the subpopulationwhere each observation comes from
what we observe the model the expected results
Z = ? Z : 1 = •,2 = •,3 = •
→ It is an unsupervised classification methodM.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 4 / 30
Functional annotation is the new challenge
It is now relatively easy to sequence an organism and to localizeits genesBut between 20% and 40% of the genes have an unknownfunctionFor Arabidopsis thaliana, 16% of the genes are orphean genesi.e. without any information on their function
→ with the high-throughput technologies, it is now possible to improvethe functional annotation
M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 5 / 30
First genomic example: co-expression analysis
Co-expressed genes are good candidates to be involved in asame biological process (Eisen et al, 1998)Pearson correlation values are often used to measure theco-expression, but it is a local point of viewCo-expression analysis can be recast as a research of anunderlying structure in a whole dataset
Table : Examples of co-expression clusters of genes observed on 45independent transcriptome experiments. Clusters are identified with amixture.
M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 6 / 30
Second example: ChIP-chip analysis
These experiments aim atidentifying interactions between aprotein and DNA
Most methods look for peaks oflog(IP/Input) along the genome
There exists an underlying structurebetween the two samples
M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 7 / 30
Presentation outline
1 Introduction
2 Mixture model definition
3 Genomic examples
4 Conclusions
M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 8 / 30
Key ingredients of a mixture model
what we observe the model the expected results
Z = ? Z : 1 = •, 2 = •, 3 = •
Let y = (y1, . . . ,yn) denote n observations with yi ∈ RQ and letZ = (Z1, . . . ,Zn) be the latent vector.
1) Distribution of Z: {Zi} are assumed to be independent and
P(Zi = k) = πk withK∑
k=1
πk = 1 → Z ∼M(n;π1, . . . , πK )
and where K is the number of components of the mixture
2) Distribution of (yi |Zi = k): a parametric distribution f (•;γk )
M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 9 / 30
Some properties:{Zi} are independent{Yi} are independent conditionally to {Zi}Couples {(Yi ,Zi)} are i.i.d.The model is invariant for any permutation of the labels {1, . . . ,K}⇒ the mixture model has K ! equivalent definitions.
Distribution of Y:
P(Y|K ,θ) =n∏
i=1
K∑k=1
P(Yi ,Zi = k) =n∏
i=1
K∑k=1
P(Zi = k)P(Yi |Zi = k)
=n∏
i=1
K∑k=1
πk f (Yi ;γk )
→ It is a weighted sum of parametric distributions known up to theparameter vector θ = (π1, . . . , πK−1,γ1, . . . ,γK )
M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 10 / 30
Statistical inference of incomplete data models
Maximum likelihood estimate:
θ̂ = arg maxθ
log P(Y|K ,θ) = arg maxθ
n∑i=1
log
[K∑
k=1
πk f (Yi ;γk )
]
→ It is not always possible since this sum involves K n terms....
Expectation-Maximization algorithm: iterative algorithm based on theexpectation of the completed data conditionally to θ(l)
θ(l+1) = arg maxθ
E{
log P(Y,Z|K ,θ)|Y,θ(l)}
→ According to the theory, it implies that log P(Y|K ,θ) tends toward alocal maximum.
M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 11 / 30
EM algorithm details
Initialisation of θ(0)
While the convergence criterion is not reached, iterateE-step Calculation of the conditional probabilities
τ(l)ik = P(Zi = k |yi ,θ
(l)) =π(l)k f (yi ;γ
(l)k )∑K
k ′=1 π(l)k ′ f (yi ;γ
(l)k ′ )
M-step Calculation of θ̂ by maximising the complete likehoodwhere Z is replaced with the conditional probabilities
θ̂ = arg maxθ
n∑i=1
K∑k=1
τ(l)ik [logπk + log f (yi ;γk )]
→ weighted version of the usual maximum likelihoodestimates (MLE).
M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 12 / 30
EM algorithm properties
Convergence is always reached but not always toward a globalmaximum
EM algorithm is sensitive to the initialisation step
EM algorithm exists in all good statistical sotfwares
In R software, it is available in MCLUST and RMIXMOD packages.
RMIXMOD proposes the best strategy of initialisation
M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 13 / 30
Outputs of the model
Distribution: Conditional probabilities:
g(yi ) = π1f (yi ;γ1) + π2f (yi ;γ2) + π3f (yi ;γ3) τik = P(Zi = k |yi) =πk f (yi ;γk )
g(yi)
τik (%) i = 1 i = 2 i = 3k = 1 65.8 0.7 0.0k = 2 34.2 47.8 0.0k = 3 0.0 51.5 1.0
→ These probabilities enables the classification of the observationsinto the subpopulations
M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 14 / 30
Outputs of the model
Distribution: Conditional probabilities:
g(yi ) = π1f (yi ;γ1) + π2f (yi ;γ2) + π3f (yi ;γ3) τik = P(Zi = k |yi) =πk f (yi ;γk )
g(yi)
τik (%) i = 1 i = 2 i = 3k = 1 65.8 0.7 0.0k = 2 34.2 47.8 0.0k = 3 0.0 51.5 1.0
Maximum A Posteriori rule: Classification in the component for whichthe conditional probability is the highest.
M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 14 / 30
Model selection
The number of components of the mixture is often unknownA collection of models where K varies between 2 and Kmax
The best model is the one maximising a criterion
Bayesian Information Criterion (BIC)
proxy of the integrated likelihood P(Y|K ) =∫
P(Y|K ,θ)π(θ|K )dθaims at finding a good number of components for a global fit of thedata distribution
BIC(K ) = log P(Y|K , θ̂)− νK
2log(n)
whereνK is the number of free parameters of the modelP(Y|K , θ̂) is the maximum likelihood under this model.
M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 15 / 30
Model selection
The number of components of the mixture is often unknownA collection of models where K varies between 2 and Kmax
The best model is the one maximising a criterion
Integrated Information Criterion (ICL)
proxy of the integrated complete likelihood P(Y,Z|m)
dedicated to classification since it strongly penalizes models forwhich the classification is uncertain
ICL(K ) = BIC(K )+n∑
i=1
K∑k=1
τik log τik ,
whereνK is the number of free parametersP(Y|K , θ̂) is the maximum likelihood under this model.
M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 15 / 30
Conclusions on the model selection
BIC aims at finding a good number of components for a global fitof the data distribution. It tends to overestimate the number ofcomponentsICL is dedicated to a classification purpose. It strongly penalizesmodels for which the classification is uncertain.Whatever the criterion, it must be a convex function of the numberof components
Bad behavior Correct behavior
→ a non-convex function may indicate an issue of modelingM.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 16 / 30
Presentation outline
1 Introduction
2 Mixture model definition
3 Genomic examplesMixtures for co-expression analysisMixtures for analysing chIP-chip data
4 Conclusions
M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 17 / 30
GEM2Net: From gene expression modeling to-omics network
Goal: Explore the orphean gene space to identifynew genes involved in defense andadaptation process
Method: Predict co-expression networks using mixturemodels
Data: An original resource generated by thetranscriptomic platform of URGV
Homogeneous data generated with theCATMA microarray5,095 genes not present in Affymetrix chipHigh diversity of biological samples relativeto stress conditions
M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 18 / 30
Workflow overview
- Extraction of CATdb of 387 stress comparaisons
- 17,264 genes are differentially expressed in at least one of thesecomparisons (FWER controlled at 5% on overall the tests)
- Analyses performed with Gaussian Mixture Models
- According to BIC curve, the naive clustering on the whole dataset is notrelevant
- Gene co-expression depends on the stress categories→ The functional modules vary with the environment
M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 19 / 30
Results of the co-expression analysis
- 18 categories (9 biotic and 9 abiotic), identification of 681 clusters
- Large overlap between biotic and abiotic clusters
- 98% of clusters have a functional bias in a term of gene ontology
- 80% are associated to a stress term
- 39% have a preferential sub-cellular localization in plastid
- 18% are enriched in transcription factors and for stifenia, no cluster is enriched in TF
M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 20 / 30
Focus on nematode stress
7467 genes described by 10 expressiondifferences29 clusters of co-expression identified1519 genes with a conditional proba.close to 1
Example of Cluster 14
49 genes repressed from 14 days afterinfection13 genes known to be involved in stressresponse10 orphean genesEndoplasmic reticulum bias
M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 21 / 30
GEM2Net databasehttp://urgv.evry.inra.fr/GEM2NET
Integration of various resources: gene ontology, genes involved instress responses, gene families (transcription factors andhormones) and protein-protein interactions (experimental andpredicted).
Original representation and interactive visualization, using piecharts to summarize the functional biases at first glance
M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 22 / 30
ChIP-chip experiments
The log-ratio is not tractable while the couple (IP, Input) isDevelopment of mixture of 2 linear regressions
M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 23 / 30
MultiChIPmix: Mixture of two linear regressions
Let Zi the status of the probe i : P(Zi = 1) = π
The linear relation between IP and Input depends on the probestatus
IPir =
a0r + b0rInputir + Eir if Zi = 0 (normal)
a1r + b1rInputir + Eir if Zi = 1 (enriched)V (IPir) = σ2
r
Martin-Magniette et al. (2008), BioinformaticsM.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 24 / 30
MultiChIPmix: Mixture of two linear regressions
Let Zi the status of the probe i : P(Zi = 1) = π
The linear relation between IP and Input depends on the probestatus
IPir =
a0r + b0rInputir + Eir if Zi = 0 (normal)
a1r + b1rInputir + Eir if Zi = 1 (enriched)V (IPir) = σ2
r
M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 24 / 30
Use tocreate the first epigenomic map of Arabidopsis thaliana: Roudier etal. (2011), EMBO Journalstudy the additive inherance of histone modifications in Arabidopsisthaliana intra-specific hybrids: Moghaddam et al. (2011), Plant Journal
M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 25 / 30
MultiChIPmixHMM for taking the spatialinformation into account
When probes are (almost)equally spaced along thegenome, hybridisation signalstend to be clusteredAssuming that the probestatus are(Markov-)dependent enablesthis information in the model:{Zi} ∼ MC(π, ν)
πk` = Pr{Zi = k |Zi−1 = `}
M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 26 / 30
Table : Example of one known H3K27me3 target gene identified only withMultiChIPmixHMM.
MultiChIPmix and MultiChIPmixHMM are alternative methods topeak detections
Analysis of several replicates simultaneously + modelling thespatial dependency = more accurate conditional probabilities
MultiChIPmixHMM is available as an R package: Bérard et al.(2013), BMC Bioinformatics
M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 27 / 30
Presentation outline
1 Introduction
2 Mixture model definition
3 Genomic examples
4 Conclusions
M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 28 / 30
Conclusions
Mixtures reveal underlying structuresKey ingredients are P(Z) and P(Y|Z)For genomic data, component distribution modeling is sometimestricky, especially for RNA-Seq dataApplications on genomic data sometimes raise newmethodological questions about the parameter inference andclassification rulesExamples of R packages using mixtures: Mclust, Rmixmod,MultiChIPmixHMM, HTSDiff, HTSCluster,poisson.glm.mix
M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 29 / 30
Acknowledgements
Statistics Bioinformatics Biology
S. Robin V. Brunaud J-P. RenouT. Mary-Huard J-P Tamby E. DelannoyC. Bérard R. Zaag S. BalzergueG. Celeux Z. TariqC. Maugis-Rabusseau V. ColotG. Rigaill F. RoudierA. Rau
P. PapastamoulisM. Seifert
Thank you for your attention !
M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 30 / 30