Marie Laure-Martin-Magniette - Mixture models for analysing transcriptome and chIP‐chip data

Mixture modelsfor analysing

transcriptome and ChIP-chip data

Marie-Laure Martin-Magniette

French National Institute for agricultural research (INRA)

Unit of Applied Mathematics and Informatics at AgroParisTech, Paris

Unit of Plant Genomics Research (URGV), Evry

M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 1 / 30

Presentation outline

1 Introduction

2 Mixture model definition

3 Genomic examples

4 Conclusions


Introduction

Observations described by 2 variables

Observation distribution seems easy to model with one Gaussian


Introduction

Observations described by 2 variables

Data are scattered and subpopulations are observedAccording to the experimental design, there exists no externalinformation about them

This is an underlying structure observed through the data


Introduction

Definition of a mixture modelIt is a probabilistic model for representing the presence of subpopula-tions within an overall population.

Introduction of a latent variable Z indicating the subpopulationwhere each observation comes from

what we observe the model the expected results

Z = ? Z : 1 = •,2 = •,3 = •

→ It is an unsupervised classification methodM.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 4 / 30

Functional annotation is the new challenge

It is now relatively easy to sequence an organism and to localizeits genesBut between 20% and 40% of the genes have an unknownfunctionFor Arabidopsis thaliana, 16% of the genes are orphean genesi.e. without any information on their function

→ with the high-throughput technologies, it is now possible to improvethe functional annotation


First genomic example: co-expression analysis

Co-expressed genes are good candidates to be involved in asame biological process (Eisen et al, 1998)Pearson correlation values are often used to measure theco-expression, but it is a local point of viewCo-expression analysis can be recast as a research of anunderlying structure in a whole dataset

Table : Examples of co-expression clusters of genes observed on 45independent transcriptome experiments. Clusters are identified with amixture.


Second example: ChIP-chip analysis

These experiments aim atidentifying interactions between aprotein and DNA

Most methods look for peaks oflog(IP/Input) along the genome

There exists an underlying structurebetween the two samples



1 Introduction


3 Genomic examples

4 Conclusions


Key ingredients of a mixture model

what we observe the model the expected results

Z = ? Z : 1 = •, 2 = •, 3 = •

Let y = (y1, . . . ,yn) denote n observations with yi ∈ RQ and letZ = (Z1, . . . ,Zn) be the latent vector.

1) Distribution of Z: {Zi} are assumed to be independent and

P(Zi = k) = πk withK∑

k=1

πk = 1 → Z ∼M(n;π1, . . . , πK )

and where K is the number of components of the mixture

2) Distribution of (yi |Zi = k): a parametric distribution f (•;γk )


Some properties:{Zi} are independent{Yi} are independent conditionally to {Zi}Couples {(Yi ,Zi)} are i.i.d.The model is invariant for any permutation of the labels {1, . . . ,K}⇒ the mixture model has K ! equivalent definitions.

Distribution of Y:

P(Y|K ,θ) =n∏

i=1

K∑k=1

P(Yi ,Zi = k) =n∏

i=1

K∑k=1

P(Zi = k)P(Yi |Zi = k)

=n∏

i=1

K∑k=1

πk f (Yi ;γk )

→ It is a weighted sum of parametric distributions known up to theparameter vector θ = (π1, . . . , πK−1,γ1, . . . ,γK )


Statistical inference of incomplete data models

Maximum likelihood estimate:

θ̂ = arg maxθ

log P(Y|K ,θ) = arg maxθ

n∑i=1

log

[K∑

k=1

πk f (Yi ;γk )

]

→ It is not always possible since this sum involves K n terms....

Expectation-Maximization algorithm: iterative algorithm based on theexpectation of the completed data conditionally to θ(l)

θ(l+1) = arg maxθ

E{

log P(Y,Z|K ,θ)|Y,θ(l)}

→ According to the theory, it implies that log P(Y|K ,θ) tends toward alocal maximum.


EM algorithm details

Initialisation of θ(0)

While the convergence criterion is not reached, iterateE-step Calculation of the conditional probabilities

τ(l)ik = P(Zi = k |yi ,θ

(l)) =π(l)k f (yi ;γ

(l)k )∑K

k ′=1 π(l)k ′ f (yi ;γ

(l)k ′ )

M-step Calculation of θ̂ by maximising the complete likehoodwhere Z is replaced with the conditional probabilities

θ̂ = arg maxθ

n∑i=1

K∑k=1

τ(l)ik [logπk + log f (yi ;γk )]

→ weighted version of the usual maximum likelihoodestimates (MLE).


EM algorithm properties

Convergence is always reached but not always toward a globalmaximum

EM algorithm is sensitive to the initialisation step

EM algorithm exists in all good statistical sotfwares

In R software, it is available in MCLUST and RMIXMOD packages.

RMIXMOD proposes the best strategy of initialisation


Outputs of the model

Distribution: Conditional probabilities:

g(yi ) = π1f (yi ;γ1) + π2f (yi ;γ2) + π3f (yi ;γ3) τik = P(Zi = k |yi) =πk f (yi ;γk )

g(yi)

τik (%) i = 1 i = 2 i = 3k = 1 65.8 0.7 0.0k = 2 34.2 47.8 0.0k = 3 0.0 51.5 1.0

→ These probabilities enables the classification of the observationsinto the subpopulations


Outputs of the model

Distribution: Conditional probabilities:

g(yi ) = π1f (yi ;γ1) + π2f (yi ;γ2) + π3f (yi ;γ3) τik = P(Zi = k |yi) =πk f (yi ;γk )

g(yi)

τik (%) i = 1 i = 2 i = 3k = 1 65.8 0.7 0.0k = 2 34.2 47.8 0.0k = 3 0.0 51.5 1.0

Maximum A Posteriori rule: Classification in the component for whichthe conditional probability is the highest.


Model selection

The number of components of the mixture is often unknownA collection of models where K varies between 2 and Kmax

The best model is the one maximising a criterion

Bayesian Information Criterion (BIC)

proxy of the integrated likelihood P(Y|K ) =∫

P(Y|K ,θ)π(θ|K )dθaims at finding a good number of components for a global fit of thedata distribution

BIC(K ) = log P(Y|K , θ̂)− νK

2log(n)

whereνK is the number of free parameters of the modelP(Y|K , θ̂) is the maximum likelihood under this model.


Model selection

The number of components of the mixture is often unknownA collection of models where K varies between 2 and Kmax

The best model is the one maximising a criterion

Integrated Information Criterion (ICL)

proxy of the integrated complete likelihood P(Y,Z|m)

dedicated to classification since it strongly penalizes models forwhich the classification is uncertain

ICL(K ) = BIC(K )+n∑

i=1

K∑k=1

τik log τik ,

whereνK is the number of free parametersP(Y|K , θ̂) is the maximum likelihood under this model.


Conclusions on the model selection

BIC aims at finding a good number of components for a global fitof the data distribution. It tends to overestimate the number ofcomponentsICL is dedicated to a classification purpose. It strongly penalizesmodels for which the classification is uncertain.Whatever the criterion, it must be a convex function of the numberof components

Bad behavior Correct behavior

→ a non-convex function may indicate an issue of modelingM.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 16 / 30


1 Introduction


3 Genomic examplesMixtures for co-expression analysisMixtures for analysing chIP-chip data

4 Conclusions


GEM2Net: From gene expression modeling to-omics network

Goal: Explore the orphean gene space to identifynew genes involved in defense andadaptation process

Method: Predict co-expression networks using mixturemodels

Data: An original resource generated by thetranscriptomic platform of URGV

Homogeneous data generated with theCATMA microarray5,095 genes not present in Affymetrix chipHigh diversity of biological samples relativeto stress conditions


Workflow overview

- Extraction of CATdb of 387 stress comparaisons

- 17,264 genes are differentially expressed in at least one of thesecomparisons (FWER controlled at 5% on overall the tests)

- Analyses performed with Gaussian Mixture Models

- According to BIC curve, the naive clustering on the whole dataset is notrelevant

- Gene co-expression depends on the stress categories→ The functional modules vary with the environment


Results of the co-expression analysis

- 18 categories (9 biotic and 9 abiotic), identification of 681 clusters

- Large overlap between biotic and abiotic clusters

- 98% of clusters have a functional bias in a term of gene ontology

- 80% are associated to a stress term

- 39% have a preferential sub-cellular localization in plastid

- 18% are enriched in transcription factors and for stifenia, no cluster is enriched in TF


Focus on nematode stress

7467 genes described by 10 expressiondifferences29 clusters of co-expression identified1519 genes with a conditional proba.close to 1

Example of Cluster 14

49 genes repressed from 14 days afterinfection13 genes known to be involved in stressresponse10 orphean genesEndoplasmic reticulum bias


GEM2Net databasehttp://urgv.evry.inra.fr/GEM2NET

Integration of various resources: gene ontology, genes involved instress responses, gene families (transcription factors andhormones) and protein-protein interactions (experimental andpredicted).

Original representation and interactive visualization, using piecharts to summarize the functional biases at first glance


ChIP-chip experiments

The log-ratio is not tractable while the couple (IP, Input) isDevelopment of mixture of 2 linear regressions


MultiChIPmix: Mixture of two linear regressions

Let Zi the status of the probe i : P(Zi = 1) = π

The linear relation between IP and Input depends on the probestatus

IPir =

a0r + b0rInputir + Eir if Zi = 0 (normal)

a1r + b1rInputir + Eir if Zi = 1 (enriched)V (IPir) = σ2

r

Martin-Magniette et al. (2008), BioinformaticsM.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 24 / 30

MultiChIPmix: Mixture of two linear regressions

Let Zi the status of the probe i : P(Zi = 1) = π

The linear relation between IP and Input depends on the probestatus

IPir =

a0r + b0rInputir + Eir if Zi = 0 (normal)

a1r + b1rInputir + Eir if Zi = 1 (enriched)V (IPir) = σ2

r


Use tocreate the first epigenomic map of Arabidopsis thaliana: Roudier etal. (2011), EMBO Journalstudy the additive inherance of histone modifications in Arabidopsisthaliana intra-specific hybrids: Moghaddam et al. (2011), Plant Journal


MultiChIPmixHMM for taking the spatialinformation into account

When probes are (almost)equally spaced along thegenome, hybridisation signalstend to be clusteredAssuming that the probestatus are(Markov-)dependent enablesthis information in the model:{Zi} ∼ MC(π, ν)

πk` = Pr{Zi = k |Zi−1 = `}


Table : Example of one known H3K27me3 target gene identified only withMultiChIPmixHMM.

MultiChIPmix and MultiChIPmixHMM are alternative methods topeak detections

Analysis of several replicates simultaneously + modelling thespatial dependency = more accurate conditional probabilities

MultiChIPmixHMM is available as an R package: Bérard et al.(2013), BMC Bioinformatics



1 Introduction


3 Genomic examples

4 Conclusions


Conclusions

Mixtures reveal underlying structuresKey ingredients are P(Z) and P(Y|Z)For genomic data, component distribution modeling is sometimestricky, especially for RNA-Seq dataApplications on genomic data sometimes raise newmethodological questions about the parameter inference andclassification rulesExamples of R packages using mixtures: Mclust, Rmixmod,MultiChIPmixHMM, HTSDiff, HTSCluster,poisson.glm.mix


Acknowledgements

Statistics Bioinformatics Biology

S. Robin V. Brunaud J-P. RenouT. Mary-Huard J-P Tamby E. DelannoyC. Bérard R. Zaag S. BalzergueG. Celeux Z. TariqC. Maugis-Rabusseau V. ColotG. Rigaill F. RoudierA. Rau

P. PapastamoulisM. Seifert

Thank you for your attention !


Marie Laure-Martin-Magniette - Mixture models for analysing transcriptome and chIP‐chip data

Science

Transcript of Marie Laure-Martin-Magniette - Mixture models for analysing transcriptome and chIP‐chip data