Topic Model-based Mass Spectrometric Data Analysis in...

41
Topic Model-based Mass Spectrometric Data Analysis in Cancer Biomarker Discovery Studies Minkun Wang Bradley Department of Electrical and Computer Engineering, Virginia Tech Preliminary Exam April 29, 2016 Kevin Minkun Wang (ECE, VT) Probabilistic Purification Model April 29, 2016 1 / 38

Transcript of Topic Model-based Mass Spectrometric Data Analysis in...

Page 1: Topic Model-based Mass Spectrometric Data Analysis in ...omics.georgetown.edu/wangminkun/kevinfile/PEslides_MWang.pdf · Topic Model-based Mass Spectrometric Data Analysis in Cancer

Topic Model-based Mass Spectrometric Data Analysis inCancer Biomarker Discovery Studies

Minkun Wang

Bradley Department of Electrical and Computer Engineering, Virginia Tech

Preliminary Exam

April 29, 2016

Kevin Minkun Wang (ECE, VT) Probabilistic Purification Model April 29, 2016 1 / 38

Page 2: Topic Model-based Mass Spectrometric Data Analysis in ...omics.georgetown.edu/wangminkun/kevinfile/PEslides_MWang.pdf · Topic Model-based Mass Spectrometric Data Analysis in Cancer

Overview

1 Introduction

2 Topic Model

3 Intensity-level purification model (IPM)

4 Scan-level purification model (SPM)

5 Evaluation using synthetic and experimental data

6 Summary & future work

Kevin Minkun Wang (ECE, VT) Probabilistic Purification Model April 29, 2016 2 / 38

Page 3: Topic Model-based Mass Spectrometric Data Analysis in ...omics.georgetown.edu/wangminkun/kevinfile/PEslides_MWang.pdf · Topic Model-based Mass Spectrometric Data Analysis in Cancer

Liquid or gas chromatography-mass spectrometry(LC/GC-MS)

Kevin Minkun Wang (ECE, VT) Probabilistic Purification Model April 29, 2016 3 / 38

Page 4: Topic Model-based Mass Spectrometric Data Analysis in ...omics.georgetown.edu/wangminkun/kevinfile/PEslides_MWang.pdf · Topic Model-based Mass Spectrometric Data Analysis in Cancer

Chromatogram and mass spectrum

Kevin Minkun Wang (ECE, VT) Probabilistic Purification Model April 29, 2016 4 / 38

Page 5: Topic Model-based Mass Spectrometric Data Analysis in ...omics.georgetown.edu/wangminkun/kevinfile/PEslides_MWang.pdf · Topic Model-based Mass Spectrometric Data Analysis in Cancer

About LC/GC-MS

Advantages

High-throughput (thousands of biomolecules in one LC/GC-MS run)

Highly sensitive (ability to profile low-abundance biomolecules)

Main applications:Biomarker discovery to identify candidate markers of

Proteins/glycoproteins (proteomics/glycoproteomics)

Metabolites (metabolomics)

Other biomolecules · · ·

Major challenges:

Significant variability in intensity measurement

Irreproducible chromatographic separation

Kevin Minkun Wang (ECE, VT) Probabilistic Purification Model April 29, 2016 5 / 38

Page 6: Topic Model-based Mass Spectrometric Data Analysis in ...omics.georgetown.edu/wangminkun/kevinfile/PEslides_MWang.pdf · Topic Model-based Mass Spectrometric Data Analysis in Cancer

Application to profiling biomolecules

LC-MS profiled proteins and GC-MS profiled metabolites

Preprocessing: raw data →EICs (scan-level features)⇒ integrated peak intensities (intensity-level features)

Allow us to investigate on different levels.

Kevin Minkun Wang (ECE, VT) Probabilistic Purification Model April 29, 2016 6 / 38

Page 7: Topic Model-based Mass Spectrometric Data Analysis in ...omics.georgetown.edu/wangminkun/kevinfile/PEslides_MWang.pdf · Topic Model-based Mass Spectrometric Data Analysis in Cancer

Application to profiling biomolecules

LC-MS profiled proteins and GC-MS profiled metabolites

Preprocessing: raw data →EICs (scan-level features)⇒ integrated peak intensities (intensity-level features)

Allow us to investigate on different levels.

Kevin Minkun Wang (ECE, VT) Probabilistic Purification Model April 29, 2016 7 / 38

Page 8: Topic Model-based Mass Spectrometric Data Analysis in ...omics.georgetown.edu/wangminkun/kevinfile/PEslides_MWang.pdf · Topic Model-based Mass Spectrometric Data Analysis in Cancer

LC/GC-MS based omics & corresponding publications

experimental design, sample collection

sample preparation, data acquisition

data preprocessing:

peak detection → retention time align-

ment → normalization → purification

statistical analysis, biomarker discovery

verification, validation

integrative/pathway/network analysis

biological interpretation

Data preprocessing

1. [Wang M, et al. (2013). IEEE International Conference

on Bioinformatics and Biomedicine Workshop (BIBMW)]2. [Tsai TH, Wang M, et al. (2016). Statistical Analysis in

Proteomics (Methods in Molecular Biology)]

Computational purification

3. [Wang M, et al. (2015). IEEE International Conference

on Bioinformatics and Biomedicine (BIBM)]4. [Wang M, et al. (2016). BMC Genomics, in revision]

Biomarker discovery

5. [Tsai TH, Wang M, et al. (2014). J Proteome Res.]

6. [Tsai TH*, Wang M*, et al. (2015). Proteomics (* first

authors)]7. [Di Poto C, Wang M, et al (2016). Cancer Epidemiol

Biomarkers Prev , submitted]

Integrative analysis

8. [ Wang M, et al. (2015). International Conference of

the IEEE Engineering in Medicine and Biology Society ]9. [ Wang M, et al. (2016). IEEE J Biomed Health Inform,

in revision]10. [Ressom HW, Wang M, et al. (2016). IEEE Engineering

in Medicine and Biology Society, submitted]

Kevin Minkun Wang (ECE, VT) Probabilistic Purification Model April 29, 2016 8 / 38

Page 9: Topic Model-based Mass Spectrometric Data Analysis in ...omics.georgetown.edu/wangminkun/kevinfile/PEslides_MWang.pdf · Topic Model-based Mass Spectrometric Data Analysis in Cancer

LC/GC-MS based omics & corresponding publications

experimental design, sample collection

sample preparation, data acquisition

data preprocessing:

peak detection → retention time align-

ment → normalization → purification

statistical analysis, biomarker discovery

verification, validation

integrative/pathway/network analysis

biological interpretation

Data preprocessing

1. [Wang M, et al. (2013). IEEE International Conference

on Bioinformatics and Biomedicine Workshop (BIBMW)]2. [Tsai TH, Wang M, et al. (2016). Statistical Analysis in

Proteomics (Methods in Molecular Biology)]

Computational purification

3. [Wang M, et al. (2015). IEEE International Conference

on Bioinformatics and Biomedicine (BIBM)]4. [Wang M, et al. (2016). BMC Genomics, in revision]

Biomarker discovery

5. [Tsai TH, Wang M, et al. (2014). J Proteome Res.]

6. [Tsai TH*, Wang M*, et al. (2015). Proteomics (* first

authors)]7. [Di Poto C, Wang M, et al (2016). Cancer Epidemiol

Biomarkers Prev , submitted]

Integrative analysis

8. [ Wang M, et al. (2015). International Conference of

the IEEE Engineering in Medicine and Biology Society ]9. [ Wang M, et al. (2016). IEEE J Biomed Health Inform,

in revision]10. [Ressom HW, Wang M, et al. (2016). IEEE Engineering

in Medicine and Biology Society, submitted]

Kevin Minkun Wang (ECE, VT) Probabilistic Purification Model April 29, 2016 8 / 38

Page 10: Topic Model-based Mass Spectrometric Data Analysis in ...omics.georgetown.edu/wangminkun/kevinfile/PEslides_MWang.pdf · Topic Model-based Mass Spectrometric Data Analysis in Cancer

LC/GC-MS based omics & corresponding publications

experimental design, sample collection

sample preparation, data acquisition

data preprocessing:

peak detection → retention time align-

ment → normalization → purification

statistical analysis, biomarker discovery

verification, validation

integrative/pathway/network analysis

biological interpretation

Data preprocessing

1. [Wang M, et al. (2013). IEEE International Conference

on Bioinformatics and Biomedicine Workshop (BIBMW)]2. [Tsai TH, Wang M, et al. (2016). Statistical Analysis in

Proteomics (Methods in Molecular Biology)]

Computational purification

3. [Wang M, et al. (2015). IEEE International Conference

on Bioinformatics and Biomedicine (BIBM)]4. [Wang M, et al. (2016). BMC Genomics, in revision]

Biomarker discovery

5. [Tsai TH, Wang M, et al. (2014). J Proteome Res.]

6. [Tsai TH*, Wang M*, et al. (2015). Proteomics (* first

authors)]7. [Di Poto C, Wang M, et al (2016). Cancer Epidemiol

Biomarkers Prev , submitted]

Integrative analysis

8. [ Wang M, et al. (2015). International Conference of

the IEEE Engineering in Medicine and Biology Society ]9. [ Wang M, et al. (2016). IEEE J Biomed Health Inform,

in revision]10. [Ressom HW, Wang M, et al. (2016). IEEE Engineering

in Medicine and Biology Society, submitted]

Kevin Minkun Wang (ECE, VT) Probabilistic Purification Model April 29, 2016 8 / 38

Page 11: Topic Model-based Mass Spectrometric Data Analysis in ...omics.georgetown.edu/wangminkun/kevinfile/PEslides_MWang.pdf · Topic Model-based Mass Spectrometric Data Analysis in Cancer

LC/GC-MS based omics & corresponding publications

experimental design, sample collection

sample preparation, data acquisition

data preprocessing:

peak detection → retention time align-

ment → normalization → purification

statistical analysis, biomarker discovery

verification, validation

integrative/pathway/network analysis

biological interpretation

Data preprocessing

1. [Wang M, et al. (2013). IEEE International Conference

on Bioinformatics and Biomedicine Workshop (BIBMW)]2. [Tsai TH, Wang M, et al. (2016). Statistical Analysis in

Proteomics (Methods in Molecular Biology)]

Computational purification

3. [Wang M, et al. (2015). IEEE International Conference

on Bioinformatics and Biomedicine (BIBM)]4. [Wang M, et al. (2016). BMC Genomics, in revision]

Biomarker discovery

5. [Tsai TH, Wang M, et al. (2014). J Proteome Res.]

6. [Tsai TH*, Wang M*, et al. (2015). Proteomics (* first

authors)]7. [Di Poto C, Wang M, et al (2016). Cancer Epidemiol

Biomarkers Prev , submitted]

Integrative analysis

8. [ Wang M, et al. (2015). International Conference of

the IEEE Engineering in Medicine and Biology Society ]9. [ Wang M, et al. (2016). IEEE J Biomed Health Inform,

in revision]10. [Ressom HW, Wang M, et al. (2016). IEEE Engineering

in Medicine and Biology Society, submitted]

Kevin Minkun Wang (ECE, VT) Probabilistic Purification Model April 29, 2016 8 / 38

Page 12: Topic Model-based Mass Spectrometric Data Analysis in ...omics.georgetown.edu/wangminkun/kevinfile/PEslides_MWang.pdf · Topic Model-based Mass Spectrometric Data Analysis in Cancer

LC/GC-MS based omics & corresponding publications

experimental design, sample collection

sample preparation, data acquisition

data preprocessing:

peak detection → retention time align-

ment → normalization → purification

statistical analysis, biomarker discovery

verification, validation

integrative/pathway/network analysis

biological interpretation

Data preprocessing

1. [Wang M, et al. (2013). IEEE International Conference

on Bioinformatics and Biomedicine Workshop (BIBMW)]2. [Tsai TH, Wang M, et al. (2016). Statistical Analysis in

Proteomics (Methods in Molecular Biology)]

Computational purification

3. [Wang M, et al. (2015). IEEE International Conference

on Bioinformatics and Biomedicine (BIBM)]4. [Wang M, et al. (2016). BMC Genomics, in revision]

Biomarker discovery

5. [Tsai TH, Wang M, et al. (2014). J Proteome Res.]

6. [Tsai TH*, Wang M*, et al. (2015). Proteomics (* first

authors)]7. [Di Poto C, Wang M, et al (2016). Cancer Epidemiol

Biomarkers Prev , submitted]

Integrative analysis

8. [ Wang M, et al. (2015). International Conference of

the IEEE Engineering in Medicine and Biology Society ]9. [ Wang M, et al. (2016). IEEE J Biomed Health Inform,

in revision]10. [Ressom HW, Wang M, et al. (2016). IEEE Engineering

in Medicine and Biology Society, submitted]

Kevin Minkun Wang (ECE, VT) Probabilistic Purification Model April 29, 2016 9 / 38

Page 13: Topic Model-based Mass Spectrometric Data Analysis in ...omics.georgetown.edu/wangminkun/kevinfile/PEslides_MWang.pdf · Topic Model-based Mass Spectrometric Data Analysis in Cancer

Computational purification

Motivation1. Specimens (i.e., tumor tissue and human blood) collected from patients exhibit somedegree of heterogeneity.

2. The cancerous profiles of interest are typically contaminated by other components,leading to unreliable results in differential analyses (e.g., biomarker discovery).

3. Computational purification methods offer inexpensive and efficient alternative toexperimental methods.

4. This issue has been a subject of discussion in cancer genomic studies, it has not yetbeen rigorously investigated in mass spectrometry based proteomic and metabolomicstudies.

Objective

Main focus of this study: address data heterogeneity issue in LC/GC-MS basedbiomolecular expression profiles through appropriate computational purification.

Kevin Minkun Wang (ECE, VT) Probabilistic Purification Model April 29, 2016 10 / 38

Page 14: Topic Model-based Mass Spectrometric Data Analysis in ...omics.georgetown.edu/wangminkun/kevinfile/PEslides_MWang.pdf · Topic Model-based Mass Spectrometric Data Analysis in Cancer

Approaches in genomic field

1. Numerical approaches

Non-negative Matrix Factorization (NMF) methods or linear regression based models.

Y = Θ ·X or y =S∑i=1

θixi + ε,S∑i=1

θi = 1

ssNMF [Gaujoux et al., 2012], PERT[Qiao et al., 2012], UNDO[Wang et al., 2014], etc.

2. Statistical approachesProbabilistic graphic models to mimic the heterogeneous data generating process.

L = p(y|Θ, x)

L∏l=1

pl(Θl, x)

DeMix [Ahn J et al., 2013], ISOLATE, ISOpure[Quon et al., 2009, 2013], etc.

Kevin Minkun Wang (ECE, VT) Probabilistic Purification Model April 29, 2016 11 / 38

Page 15: Topic Model-based Mass Spectrometric Data Analysis in ...omics.georgetown.edu/wangminkun/kevinfile/PEslides_MWang.pdf · Topic Model-based Mass Spectrometric Data Analysis in Cancer

Proposed approach

? Proposed topic-model based approaches

I extend latent Dirichlet allocation (LDA), i.e., topic model [ Blei, et al., 2004 ] toperform LC/GC-MS based omic data purification.

Both intensity-level and scan-level purification models (IPM & SPM) are proposed.

Clues

a) No available tools developed for LC/GC-MS data purification.b) Biomarker discovery studies consist of group information (reference profile).c) Topic models provide richer explanation on the data and more powerful purificationperformance.d) Statistical approach is capable to model noise.

Kevin Minkun Wang (ECE, VT) Probabilistic Purification Model April 29, 2016 12 / 38

Page 16: Topic Model-based Mass Spectrometric Data Analysis in ...omics.georgetown.edu/wangminkun/kevinfile/PEslides_MWang.pdf · Topic Model-based Mass Spectrometric Data Analysis in Cancer

Topic model in natural language processing

? Topic model is a type of statistical model for discovering the abstract“topics” that occur in acollection of documents.

Document← a collection of words (w) [Observed]

Words in a document← an underlying set of topics (β = [β1, · · · , βK ]) [Latent]

Each word← a topic indicator z [Latent]

Each topic↔ a probability distribution over words/vocabulary

Each document↔ a probability distribution over topics (mixture proportion: θ)

Kevin Minkun Wang (ECE, VT) Probabilistic Purification Model April 29, 2016 13 / 38

Page 17: Topic Model-based Mass Spectrometric Data Analysis in ...omics.georgetown.edu/wangminkun/kevinfile/PEslides_MWang.pdf · Topic Model-based Mass Spectrometric Data Analysis in Cancer

Topic model in natural language processing

Latent Dirichlet Allocation Bayesian Network:

Parameters θd ∼ Dirichlet(θd|α) Mixture proportionβ = {β1; · · · ;βK} Topics zd,n ∼ Multinomial(zd,n|θd) Topic indicatorα = [α1, ..., αK ] Dirichlet priors wd,n ∼ Multinomial(wd,n|βzd,n ) Word

Outer plate is repeated for D documents; inner plate is repeated for N words.

LDA explicitly includes dependence on model parameters {α,β}:

P (w|α,β) =

∫p(θ|α)

(N∏n=1

K∑k=1

P (z(k)n |θ)P (wn|z(k)n ,β)

)dθ

Coupling of θ and β → intractable likelihood calculation9 Expectation Maximization

Approximation: Variational EM to estimate parameters α, β and infer θ.Kevin Minkun Wang (ECE, VT) Probabilistic Purification Model April 29, 2016 14 / 38

Page 18: Topic Model-based Mass Spectrometric Data Analysis in ...omics.georgetown.edu/wangminkun/kevinfile/PEslides_MWang.pdf · Topic Model-based Mass Spectrometric Data Analysis in Cancer

LDA → Intensity-level purification

Strength & limitation of LDALDA allows each single document to be associated with a specific mixture ofmultiple topics, more flexible representation of data structure than that by mixtureof unigrams models.

LDA is unsupervised model which takes no advantage of prior knowledge (e.g.,group information) in biomarker discovery studies.

LDA infers underlying topics β shared by the whole corpus. No document-specifictopics can be captured.

Extension to Intensity-level Purification Model (IPM)

Documents Words Mixture of topics Uncover topics

MS dataset Biomolecules Mixture of sources Purification

1. Purification ⇐⇒ identify underlying pure source.2. More assumptions to consider prior knowledge and enable sample-specific purification.3. Purification is carried on intensity-level features after MS data preprocessing.

Kevin Minkun Wang (ECE, VT) Probabilistic Purification Model April 29, 2016 15 / 38

Page 19: Topic Model-based Mass Spectrometric Data Analysis in ...omics.georgetown.edu/wangminkun/kevinfile/PEslides_MWang.pdf · Topic Model-based Mass Spectrometric Data Analysis in Cancer

Intensity-level purification model

IPM Bayesian Network:

ApplicationStudy: MS-based cancer biomarker discovery.

Data: intensities of multiple biomolecules across samples from case and control groups.

NotationHeterogeneous data:

- {td},d=1,··· ,D expression profile of samples in cancer group (to be purified). [Observed]

Sources (‘topics’) β,γθ−→ t :

- {γd},d=1,··· ,D : sample-specific pure cancerous origin. [Latent]

- {βm},m=1,··· ,M : non-cancerous contaminants/unfavorite source. [Observed]

- γ′: average cancer origin (whole-collection-level).[Latent]

Hyperparameters:

- α,η, κ: Dirichlet priors conjugate to multinomials (θ, γ)

Kevin Minkun Wang (ECE, VT) Probabilistic Purification Model April 29, 2016 16 / 38

Page 20: Topic Model-based Mass Spectrometric Data Analysis in ...omics.georgetown.edu/wangminkun/kevinfile/PEslides_MWang.pdf · Topic Model-based Mass Spectrometric Data Analysis in Cancer

Intensity-level purification model: assumptions

IPM Bayesian Network:

Three assumptions in IPM:

{βm} → td The source contaminants in each expression profile {td} are coming from the control

group {βm},m=1,··· ,M . – It has been observed that the cancerous tissues within tumor samples are

typically surrounded by adjacent non-cancerous tissues.

γ′ → γd Corresponding cancerous origins {γd},d=1,··· ,D share an average cancer profile γ′. –

individual cancerous profile can be treated as a noisy version of the average cancer profile in the same

group (e.g., liver cancer group)

{βm} → γ′ Average cancer profile has similar patterns as non-cancerous profiles, except for some

sites (biomolecules) which are differentially expressed between case and control groups – holds in the

same cohort

Mathematically, {βm}, γ′, {γd} represent multinomial (probabilistic) distribution over bomolecules.

Kevin Minkun Wang (ECE, VT) Probabilistic Purification Model April 29, 2016 17 / 38

Page 21: Topic Model-based Mass Spectrometric Data Analysis in ...omics.georgetown.edu/wangminkun/kevinfile/PEslides_MWang.pdf · Topic Model-based Mass Spectrometric Data Analysis in Cancer

Intensity-level purification model

IPM complete likelihood:

L(t,z,θ,γ,γ′|α,β,η, κ, κ′)

=p(γ′|β,η, κ′) ·D∏d=1

p(θd|α) · p(γd|γ′, κd) ·N∏n=1

[p(zd,n|θd) · p(td,n|zd,n,θd,β,γd)

]

- p(θd|α) = Dirichlet(θd|α, 1)

- p(γ′|β, η, κ′) = Dirichlet(γ′|ηT β, κ′)- p(γd|γ′, κd) = Dirichlet(γd|γ′, κd)- p(zd,n|θd) = Multinomial(zd,n|θd)- p(td,n|zd,n ≤ M, θd,β,γd) = Multinomial(td,n|βzd,n

)

- p(td,n|zd,n = M + 1, θd,β,γd) = Multinomial(td,n|γd)

Inference & Estimation: maximizing complete likelihood function via variationalexpectation maximization (variational EM) algorithms.

Kevin Minkun Wang (ECE, VT) Probabilistic Purification Model April 29, 2016 18 / 38

Page 22: Topic Model-based Mass Spectrometric Data Analysis in ...omics.georgetown.edu/wangminkun/kevinfile/PEslides_MWang.pdf · Topic Model-based Mass Spectrometric Data Analysis in Cancer

Variational EM for IPM

Two-phase updating rules

1. Treat γ′ as consistent cancer origin for all profiles. Each profile is mixed from topic

panel {β1, ..., βM , γ′}.F use the same variational EM framework as LDA [Blei et al., 2003] to estimate α, κ′,

and infer {θd}, γ′.

2. Fix the cancer mixing proportion θd,M+1, and average cancer origin γ′ as prior to

infer sample-specific pure cancer profile γd and contaminant mixing proportion

{θd,k}, k = 1, ...,M .

F iteratively maximize the complete log likelihood function through conjugate gradientdescent till convergence.

[Wang M, et al. (2015). IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp.228-233.]

Kevin Minkun Wang (ECE, VT) Probabilistic Purification Model April 29, 2016 19 / 38

Page 23: Topic Model-based Mass Spectrometric Data Analysis in ...omics.georgetown.edu/wangminkun/kevinfile/PEslides_MWang.pdf · Topic Model-based Mass Spectrometric Data Analysis in Cancer

Intensity-level → Scan-level purification

Strength & limitation of IPMIPM enables sample-specific purification. Intensity-level features are convenient toimplement.

Intensity-level feature is obtained by integrating the scan-level measurements of adetected chromatographic peak within a specified retention time (RT) interval.This integration or truncation inevitably brings in variances which interfere withoriginal sample heterogeneity. IPM ignores peak shape information.

IPM is not robust to noise.

Extension to Scan-level Purification Model (SPM)

1. Consider peak shape information based on extracted ion chromatogram (EIC).

2. Consider random noise.

? We hypothesize that purification at the scan level leads to more accurate resultsand offers the opportunity to extend the model to characterize both ion abundanceand peak shape.

Kevin Minkun Wang (ECE, VT) Probabilistic Purification Model April 29, 2016 20 / 38

Page 24: Topic Model-based Mass Spectrometric Data Analysis in ...omics.georgetown.edu/wangminkun/kevinfile/PEslides_MWang.pdf · Topic Model-based Mass Spectrometric Data Analysis in Cancer

Scan-level information: EICs

EIC is characterized by its retention time (corresponding to multiple scans), mass value,and ion abundance. Area under EIC ⇒ integrated peak intensity.Now, {td} (same for {βd}) consists of multiple EIC peaks, represented by ion abundancesacross S scans with a certain elution shape F(·), characterized by exponentially modifiedGaussian (EMG).

td,n(s) = xd,n · δd,n(s) · F(s,φd,n) + ed,n(s), s = 1, · · · , S

F(s,φ) =1

2ζ exp

(1

2ζ(2µ+ ζσ2 − 2s)

)· (1− erf(

µ+ ζσ2 − s√

2σ), φ

.= {µ, ζ, σ}

Kevin Minkun Wang (ECE, VT) Probabilistic Purification Model April 29, 2016 21 / 38

Page 25: Topic Model-based Mass Spectrometric Data Analysis in ...omics.georgetown.edu/wangminkun/kevinfile/PEslides_MWang.pdf · Topic Model-based Mass Spectrometric Data Analysis in Cancer

Scan-level purification model (SPM)

Incorporate EIC information (by adding a lower layer) into IPM.

Three assumptions in IPM still hold for ion abundance xt, xβ , x′γ , and xγ .

Noise variable: ed,n(s)|σ2ed∼ N (0, σ2

ed), σ2

ed∼ IG(ae, be).

Missing scan indicator variable:p(δd,n(s)|qd) = Bernoulli(δd,n(s)|qd), p(qd|aq , bq) = Beta(qd|aq , bq).Heterogeneous data point is modeled as:td,n(s)|xtd,n, qd, φd,n, σ2

ed∼ qdN (xtd,nF(s, φd,n), σ2

ed) + (1− qd)N (0, σ2

ed).

Kevin Minkun Wang (ECE, VT) Probabilistic Purification Model April 29, 2016 22 / 38

Page 26: Topic Model-based Mass Spectrometric Data Analysis in ...omics.georgetown.edu/wangminkun/kevinfile/PEslides_MWang.pdf · Topic Model-based Mass Spectrometric Data Analysis in Cancer

Inference method for SPM

Split SPM into two components:

P1. Mixture model of underlying ion abundances (same as IPM).

P2. Scan-level feature modeling and inference.

Two-steps updating rules:S1. Markov chain Monte Carlo sampling → peak shape model parameters in P2 (i.e.,ion

abundance xt, xβ , and shape function parameters φ).

F Gibbs sampling for variables (indicated by Θg) with known posterior densityfunction.

F Metropolis–Hastings for the rest Υmh with proposal distribution Q(), multivariateGaussian.

S2. Treat xt, xβ as observed variables to implement the inference of P1 using the same VEMalgorithm employed in IPM.

[Wang M, et al. (2016). BMC Genomics, in revision.]

Kevin Minkun Wang (ECE, VT) Probabilistic Purification Model April 29, 2016 23 / 38

Page 27: Topic Model-based Mass Spectrometric Data Analysis in ...omics.georgetown.edu/wangminkun/kevinfile/PEslides_MWang.pdf · Topic Model-based Mass Spectrometric Data Analysis in Cancer

Evaluation using synthetic and experimental datasets

Models in comparisonIntensity-level purification model (IPM)

Scan-level purification model (SPM)

Demix? (version 1.0.1)

Latent Dirichlet allocation(LDA)

Mass spectrometric datasetsSynthetic dataset

LC-MS based serum proteomic datasetI 116 samples from 57 patients with hepatocellular carcinoma (HCC) and 59 controls

with liver cirrhosis. 101 proteins were identified, corresponding to 187 peptides.

GC-MS based tissue metabolomic datasetI 15 samples from 5 HCC cases (5 tumor and 5 adjacent cirrhotic tissues) and 5

patients with liver cirrhosis. 559 metabolites were identified.

? statistical approach for deconvolving mixed cancer transcriptomes

Kevin Minkun Wang (ECE, VT) Probabilistic Purification Model April 29, 2016 24 / 38

Page 28: Topic Model-based Mass Spectrometric Data Analysis in ...omics.georgetown.edu/wangminkun/kevinfile/PEslides_MWang.pdf · Topic Model-based Mass Spectrometric Data Analysis in Cancer

Synthetic dataset generation

Mix previously profiled LC-MS serum proteomic data to simulate heterogeneous cancer profiles.

Kevin Minkun Wang (ECE, VT) Probabilistic Purification Model April 29, 2016 25 / 38

Page 29: Topic Model-based Mass Spectrometric Data Analysis in ...omics.georgetown.edu/wangminkun/kevinfile/PEslides_MWang.pdf · Topic Model-based Mass Spectrometric Data Analysis in Cancer

Synthetic dataset evaluation

To test if the models can:

1) reasonably estimate the proportion of mixtures θ in each of the synthetic profiles;

2) accurately infer the underlying pure cancer profiles γ.

Evaluation metrics

estimation error ratio between the estimated proportions of mixtures θ∗d and the true onesθd:

ξd(θ∗,θ) =||θ∗d − θd||1||θd||1

× 100%, d = 1, · · · , D

correlation coefficient between the inferred sample-specific pure profile γ∗d and the groundtruth γd.

ξd(γ∗,γ) =||∑Ss=1

[γ∗d (s)− γd(s)

]||1

||∑Ss=1 γd(s)||1

× 100%, d = 1, · · · , D

Kevin Minkun Wang (ECE, VT) Probabilistic Purification Model April 29, 2016 26 / 38

Page 30: Topic Model-based Mass Spectrometric Data Analysis in ...omics.georgetown.edu/wangminkun/kevinfile/PEslides_MWang.pdf · Topic Model-based Mass Spectrometric Data Analysis in Cancer

Estimation of mixture proportion θ

Comparison between θ∗d and θd (displayed first 6 instances)

Kevin Minkun Wang (ECE, VT) Probabilistic Purification Model April 29, 2016 27 / 38

Page 31: Topic Model-based Mass Spectrometric Data Analysis in ...omics.georgetown.edu/wangminkun/kevinfile/PEslides_MWang.pdf · Topic Model-based Mass Spectrometric Data Analysis in Cancer

Estimation of pure cancer profile γ

Comparison of scatter plots in origin profile td with γd and in estimated pure profile γ∗d with γd(displayed first 6 instances)

Kevin Minkun Wang (ECE, VT) Probabilistic Purification Model April 29, 2016 28 / 38

Page 32: Topic Model-based Mass Spectrometric Data Analysis in ...omics.georgetown.edu/wangminkun/kevinfile/PEslides_MWang.pdf · Topic Model-based Mass Spectrometric Data Analysis in Cancer

Performance of IPM on synthetic dataset

Estimation error ratio ξd(θ∗,θ) means (standard deviations) based on 100 realizations.

SNR LDA DeMix IPM∞ 30.87(8.95) 4.395(1.014) 2.331(0.541)50 38.56(10.12) 6.753(2.455) 4.198(1.705)25 51.45(12.21) 12.74(3.258) 13.71(4.302)10 76.18(14.25) 35.78(9.854) 32.25(10.98)

Estimation error ratio ξd(γ∗,γ): means (standard deviations) based on 100 realizations.

SNR No purification? LDA DeMix IPM∞ 16.57(3.432) 12.87(2.043) 7.294(1.821) 6.510(1.015)50 24.11(5.217) 24.05(5.885) 16.33(3.753) 10.20(2.781)25 30.66(7.514) 31.25(7.356) 19.34(4.255) 20.16(4.041)10 39.78(8.021) 36.75(7.953) 25.64(4.863) 21.53(3.872)

Correlation coefficients ρ < γ∗,γ >: means (standard deviations) based on 100 realizations.

SNR No purification? LDA DeMix IPM∞ 0.985(0.002) 0.988(0.003) 0.998(2.455) 0.999(0.001)50 0.947(0.005) 0.955(0.005) 0.988(0.002) 0.995(0.002)25 0.875(0.025) 0.895(0.015) 0.926(0.014) 0.950(0.005)10 0.755(0.035) 0.795(0.022) 0.890(0.015) 0.940(0.012)

? origin, without purification < td, γd >

Kevin Minkun Wang (ECE, VT) Probabilistic Purification Model April 29, 2016 29 / 38

Page 33: Topic Model-based Mass Spectrometric Data Analysis in ...omics.georgetown.edu/wangminkun/kevinfile/PEslides_MWang.pdf · Topic Model-based Mass Spectrometric Data Analysis in Cancer

Performance of SPM on synthetic dataset with EICs

Estimation error ratio ξd(θ∗,θ) means (standard deviations) based on 100 realizations.

SNR LDA DeMix IPM SPM∞ 42.71(11.59) 10.69(3.01) 7.231(2.526) 3.568(1.422)50 48.33(10.12) 13.75(2.55) 13.85(2.15) 3.922(1.305)25 57.29(8.99) 22.57(4.58) 20.41(5.20) 4.392(1.823)10 83.48(12.52) 27.62(5.84) 25.16(6.52) 9.573(2.117)

Estimation error ratio ξd(γ∗,γ): means (standard deviations) based on 100 realizations.

SNR No purification? LDA DeMix IPM SPM∞ 9.61(1.432) 8.72(2.043) 4.239(1.821) 4.231(1.206) 3.120(0.085)50 15.14(2.71) 14.47(2.15) 14.33(3.73) 13.85(2.15) 4.201(0.091)25 29.32(4.54) 23.95(4.36) 19.13(3.27) 18.41(3.20) 6.571(0.523)10 35.22(8.14) 34.52(7.51) 20.46(5.63) 23.16(5.85) 10.454(0.946)

Correlation coefficients ρ < γ∗,γ >: means (standard deviations) based on 100 realizations.

SNR No purification? LDA DeMix IPM SPM∞ 0.985(0.005) 0.988(0.003) 0.998(0.002) 0.999(0.001) 0.999(0.001)50 0.935(0.005) 0.955(0.015) 0.988(0.002) 0.975(0.005) 0.998(0.002)25 0.885(0.025) 0.895(0.015) 0.945(0.014) 0.955(0.005) 0.990(0.005)10 0.785(0.035) 0.815(0.025) 0.920(0.015) 0.925(0.015) 0.980(0.015)

Kevin Minkun Wang (ECE, VT) Probabilistic Purification Model April 29, 2016 30 / 38

Page 34: Topic Model-based Mass Spectrometric Data Analysis in ...omics.georgetown.edu/wangminkun/kevinfile/PEslides_MWang.pdf · Topic Model-based Mass Spectrometric Data Analysis in Cancer

Evaluation on LC-MS based proteomic data

116 LC-MS based serum proteomic data

57 hepatocellular carcinoma (HCC) vs. 59 cirrhosis.

101 proteins were identified.

[Tsai, et al. (2014). Proteomics]

Kevin Minkun Wang (ECE, VT) Probabilistic Purification Model April 29, 2016 31 / 38

Page 35: Topic Model-based Mass Spectrometric Data Analysis in ...omics.georgetown.edu/wangminkun/kevinfile/PEslides_MWang.pdf · Topic Model-based Mass Spectrometric Data Analysis in Cancer

Evaluation on LC-MS based proteomic data

Principal component analysis:

Kevin Minkun Wang (ECE, VT) Probabilistic Purification Model April 29, 2016 32 / 38

Page 36: Topic Model-based Mass Spectrometric Data Analysis in ...omics.georgetown.edu/wangminkun/kevinfile/PEslides_MWang.pdf · Topic Model-based Mass Spectrometric Data Analysis in Cancer

Evaluation on LC-MS based proteomic data

ROC curves:a) No purification b) purified by IPM c) purified by SPM

Performance No Purification IPM SPM] of biomarkers 43 75 69AUC(95% CI) 0.706([0.606, 0.795]) 0.793([0.700, 0.863]) 0.811([0.719, 0.890])

A bootstrap method (1000 bootstrap replicates) was used to compute the 95% confidenceinterval (CI) of the area under each ROC curve.

More powerful biomarkers were selected after scan-level purification.

Kevin Minkun Wang (ECE, VT) Probabilistic Purification Model April 29, 2016 33 / 38

Page 37: Topic Model-based Mass Spectrometric Data Analysis in ...omics.georgetown.edu/wangminkun/kevinfile/PEslides_MWang.pdf · Topic Model-based Mass Spectrometric Data Analysis in Cancer

Evaluation on LC-MS based proteomic data

Pathway Analysis:Signaling Pathways (number of significant proteins involved in thepathway)

No Purification IPM SPMComplement and coagula-tion cascades (13)

Complement and coagula-tion cascades (18)

Complement and coagula-tion cascades (19)

Systemic lupus erythemato-sus (5)

Systemic lupus erythemato-sus (6)

Systemic lupus erythemato-sus (4)

Prion diseases(4) Prion diseases (4) Prion diseases (4)- ? PPAR signaling pathway

(5)?PPAR signaling pathway(6)

? Evidence found in previous reports [Tachibana, et al. PPAR Research 2008] linking cancer and

PPARs expressed in human liver.

Kevin Minkun Wang (ECE, VT) Probabilistic Purification Model April 29, 2016 34 / 38

Page 38: Topic Model-based Mass Spectrometric Data Analysis in ...omics.georgetown.edu/wangminkun/kevinfile/PEslides_MWang.pdf · Topic Model-based Mass Spectrometric Data Analysis in Cancer

Evaluation on GC-MS based metabolomic data

15 GC-MS based tissue metabolomic data5 HCC vs. 5 adjacent cirrhosis vs. 5 independent cirrhosis559 metabolites were identified.No scan-level features are available.

Purify the HCC profiles {td}d=1,··· ,5 using independent cirrhotic profiles {βm}m=1,··· ,5.No. of biomarkers (FDR adjusted p-value ≤ 0.05).

0 −→ 7

Purify adjacent cirrhotic profiles {ψd}d=1,··· ,5 using the HCC profiles {td}d=1,··· ,5. Asexpected, the purified adjacent cirrhotic profiles →independent cirrhotic profiles.

ξ̄(ψ,β) = 28.3% −→ ξ̄(ψ∗,β) = 24.9%

The improvements are less substantial compared to the previous datasets, presumably dueto the limited sample size and potential overfitting issue.

Kevin Minkun Wang (ECE, VT) Probabilistic Purification Model April 29, 2016 35 / 38

Page 39: Topic Model-based Mass Spectrometric Data Analysis in ...omics.georgetown.edu/wangminkun/kevinfile/PEslides_MWang.pdf · Topic Model-based Mass Spectrometric Data Analysis in Cancer

Summary of research

Target at LC/GC-MS based omics for cancer biomarker discovery.

Aim to address sample heterogeneity issue.1. Investigate topic model-based inference methods (IPM, SPM) to

computationally purify LC/GC-MS data.

F IPM purify each profile in a sample-specific fashion.

F SPM hypothesize the information loss in data preprocessing.

2. Observe that incorporation of scan-level features have the potential to leadto more accurate purification results by alleviating the loss in information asa result of integrating peaks.

3. Show MS based biomarker discovery studies can potentially benefit fromtopic model-based purification of the data prior to statistical and pathwayanalyses.

Kevin Minkun Wang (ECE, VT) Probabilistic Purification Model April 29, 2016 36 / 38

Page 40: Topic Model-based Mass Spectrometric Data Analysis in ...omics.georgetown.edu/wangminkun/kevinfile/PEslides_MWang.pdf · Topic Model-based Mass Spectrometric Data Analysis in Cancer

Future work

1. Evaluate IPM and SPM on more experimental MS based omic datasets inbiomarker discovery studies.

2. Adjust appropriate forms of regularization on parameters to address the limitationdue to small sample size. Non-parameterize the number of topics.

3. Develop a R package, integrating IPM and SPM into MS probabilistic purificationmodel (MSppm).

May, 2016 · · · · · · · · ·• Evaluate IPM and SPM on 105 tissue metabolomic data.

August, 2016 · · · · · · · · ·• Adjust the parameter regularization.

November, 2016 · · · · · · · · ·• Non-parameterize the models.

January, 2017 · · · · · · · · ·• Re-investigate biomarker discovery results on purified list.

April, 2017 · · · · · · · · ·• Integration of IPM and SPM to MSppm.

Kevin Minkun Wang (ECE, VT) Probabilistic Purification Model April 29, 2016 37 / 38

Page 41: Topic Model-based Mass Spectrometric Data Analysis in ...omics.georgetown.edu/wangminkun/kevinfile/PEslides_MWang.pdf · Topic Model-based Mass Spectrometric Data Analysis in Cancer

References

-[Wang M Yu G, Mechref Y, Ressom HW (2013). GPA: an algorithm for LC/MS based glycan profile annotation. In the

proceedings of the 2013 IEEE International Conference on Bioinformatics and Biomedicine Workshop (BIBMW), pp. 16-22.]

-[Tsai TH, Wang M, Ressom HW (2016). Preprocessing and Analysis of LC-MS-Based Proteomic Data. Statistical Analysis in

Proteomics (Methods in Molecular Biology), 63-76.]

-[Wang M, Tsai TH, Yu G, Ressom HW (2015). Purification of LC/GC-MS based biomolecular expression profiles using a topic

model. In the proceedings of the 2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp.228-233.]

-[Wang M, Tsai TH, Di Poto C, Ferrarini A, Yu G, Ressom HW (2016)., Topic model-based mass spectrometric data analysis in

cancer biomarker discovery studies. BMC Genomics, in revision]

-[Tsai TH, Wang M, Di Poto C, Hu Y, Zhou S, Zhao Y, Varghese RS, Luo Y, Tadesse MG, Ziada DH, Desai CS, Shetty K,

Mechref Y, Ressom HW (2014). LC-MS profiling of N-Glycans derived from human serum samples for biomarker discovery in

hepatocellular carcinoma. J Proteome Res. 13(11), 4859-4868.]

-[Tsai TH*, Song E*, Zhu R*, Di Poto C*, Wang M*, Luo Y, Varghese RS, Tadesse MG, Ziada DH, Desai CS, Shetty K,

Mechref Y, Ressom HW (2015). LC-MS/MS based serum proteomics for identification of candidate biomarkers for

hepatocellular carcinoma. Proteomics. 15(13), 2369-2381. (* first authors)]

-[Di Poto C, Ferrarini A, Zhao Y, Varghese RS, Tu C, Zuo Y, Wang M et al (2016). Metabolomic characterization of

hepatocellular carcinoma in patients with liver cirrhosis for biomarker discovery. Cancer Epidemiol Biomarkers Prev , submitted]

Kevin Minkun Wang (ECE, VT) Probabilistic Purification Model April 29, 2016 38 / 38