Learning graphical models for hypothesis testingsignal.ee.psu.edu/graph_classification.pdfLearning...

Learning graphical models for hypothesis testing

and classification1

Umamahesh Srinivas

iPAL Group Meeting

October 22, 2010

1Tan et al., IEEE Trans. Signal Processing, Nov. 2010

Outline

1 Background and motivation

2 Graphical models: some preliminaries

3 Generative learning of trees

4 Discriminative learning of trees

Discriminative learning of forests

5 Learning thicker graphs via boosting

6 Extension to multi-class problems

10/22/2010 iPAL Group Meeting 2

Binary hypothesis testing problem

Random vector x = (x1, . . . , xn) ∈ Xn generated from either of twohypotheses

H0 : x ∼ p

H1 : x ∼ q

Given: Training sets Tp and Tq, K samples each

Goal: Classify new sample as coming from H0 or H1

Assumption: Class densities p and q known exactly

Likelihood ratio test (LRT)

L(x) :=p(x)

H0 : x ∼ p

H1 : x ∼ q

L(x) :=p(x)

H0 : x ∼ p

H1 : x ∼ q

L(x) :=p(x)

What if true densities are not known a priori?

Estimate empiricals pe and qe from Tp and Tq respectively

LRT using pe and qe

Problem: For high-dimensional data, need large number of samples toget reasonable empirical estimates.

LRT using pe and qe

Graphical models:

Efficiently learn tractable models from insufficient data

Trade-off between consistency and generalization

LRT using pe and qe

Graphical models:

Generative learning: Learning models to approximate distributions.

Learn p from Tp, and q from Tq.

LRT using pe and qe

Graphical models:

Generative learning: Learning models to approximate distributions.

Learn p from Tp, and q from Tq.

Discriminative learning: Learning models for binary classification.

Learn p from Tp and Tq; likewise q.

Graphical models: Preliminaries

(Undirected) Graph G = (V , E) defined by a set of nodesV = 1, . . . , n, and a set of edges E ⊂

Graphical model: Random vector defined on a graph such that eachnode represents one (or more) random variables, and edges revealconditional dependencies.

Graph structure defines factorization of joint probability distribution.

f(x) = f(x1)f(x2|x1)f(x3|x1)f(x4|x2)f(x5|x2)f(x6|x3)f(x7|x3).

Local Markov property:

p(xi|xV\i) = p(xi|xN (i)), ∀ i ∈ V .

Such a p(x) is Markov w.r.t. G.

Trees and forests

Tree: Undirected acyclic graph with exactly (n− 1) edges.

Forest: Contains k < (n− 1) edges → not connected.

Trees and forests

Factorization property:

p(x) =∏

p(xi)∏

(i,j)∈E

pi,j(xi, xj)

p(xi)p(xj).

Trees and forests

p(x) =∏

p(xi)∏

(i,j)∈E

pi,j(xi, xj)

p(xi)p(xj).

Notational convention: p represents probability distribution, prepresents a graphical approximation (tree- or forest-structured).

Trees and forests

p(x) =∏

p(xi)∏

(i,j)∈E

pi,j(xi, xj)

p(xi)p(xj).

Notational convention: p represents probability distribution, prepresents a graphical approximation (tree- or forest-structured).

Projection p of p onto a tree (or forest):

p(x) :=∏

p(xi)∏

(i,j)∈E

pi,j(xi, xj)

p(xi)p(xj).

Generative learning of trees2

Optimal tree approximation of a distribution

Given p, find p = argminp∈T

D(p||p).

(D(p||p) :=

∫p(x) log

2Chow and Liu, IEEE Trans. Inf. Theory 1968

D(p||p).

(D(p||p) :=

∫p(x) log

Equivalent max-weight spanning tree (MWST) problem:

maxE:G=(V,E) is a tree

(i,j)∈E

I(xi;xj).

D(p||p).

(D(p||p) :=

∫p(x) log

Equivalent max-weight spanning tree (MWST) problem:

maxE:G=(V,E) is a tree

(i,j)∈E

I(xi;xj).

Need only marginal and pairwise statistics

Kruskal MWST algorithm.

J-divergence

Given distributions p and q,

J(p, q) := D(p||q) +D(q||p) =

Ω⊂Xn

(p(x) − q(x)) log

4exp(−J) ≤ Pr(err) ≤

)− 14

Maximize J to minimize upper bound on Pr(err).

Tree-approximate J-divergence of p, q w.r.t p, q:

J(p, q; p, q) :=

Ω⊂Xn

(p(x) − q(x)) log

Marginal consistency of p w.r.t. p:

p(i,j)(xi, xj) = p(i,j)(xi, xj), ∀ (i, j) ∈ Ep.

Marginal consistency of p w.r.t. p:

p(i,j)(xi, xj) = p(i,j)(xi, xj), ∀ (i, j) ∈ Ep.

Benefits of tree-approx. J-divergence:

Maximizing tree-approx. J-divergence gives good discriminativeperformance (shown experimentally).

Marginal consistency leads to tractable optimization for p and q.

Trees provide rich class of distributions to model high-dimensionaldata.

Discriminative learning of trees

p and q: empirical distributions from Tp and Tq respectively.

(p, q) = arg maxp∈Tp,q∈Tq

J(p, q; p, q).

Decoupling into two independent MWST problems:

p = arg minp∈Tp

D(p‖p)−D(q‖p)

q = arg minq∈Tq

D(q‖q)−D(p‖q).

Edge weights:

ψpi,j := Epi,j

pi,jpipj

]− Eqi,j

pi,jpipj

ψqi,j := Eqi,j

qi,jqiqj

]− Epi,j

qi,jqiqj

Algorithm 1 Discriminative trees (DT)

Given: Training sets Tp and Tq.

1: Estimate pairwise statistics pi,j(xi, xj), qi,j(xi, xj) for all edges (i, j).2: Compute edge weights ψp

i,j and ψqi,j for all edges (i, j).

3: Find Ep = MWST(ψpi,j) and Eq = MWST(ψq

4: Get p by projection of p onto Ep; likewise q.

5: LRT using p and q.

Discriminative learning of forestsp(k) and q(k): Markov on forests with at most k ≤ (n− 1) edges.

Maximize joint objective over both pairs of distributions:

(p(k), q(k)) = arg maxp∈T

p(k) ,q∈Tq(k)

J(p, q; p, q)

p(k) ,q∈Tq(k)

J(p, q; p, q)

Useful property: J(p, q; p, q) =∑

J(pi, qi) +∑

(i,j)∈Ep∪Eq

where wij can be expressed in terms of mutual information terms andKL-divergences involving marginal and pairwise statistics.

p(k) ,q∈Tq(k)

J(p, q; p, q)

J(pi, qi) +∑

(i,j)∈Ep∪Eq

Additivity of cost function and optimality of k-step Kruskal MWSTalgorithm for each k ⇒ k-step Kruskal MWST leads to optimalforest-structured distribution.

p(k) ,q∈Tq(k)

J(p, q; p, q)

J(pi, qi) +∑

(i,j)∈Ep∪Eq

Estimated edges sets are nested, i.e., Tp(k−1) ⊆ Tp(k) , ∀ k ≤ n− 1.

p(k) ,q∈Tq(k)

J(p, q; p, q)

J(pi, qi) +∑

(i,j)∈Ep∪Eq

Estimated edges sets are nested, i.e., Tp(k−1) ⊆ Tp(k) , ∀ k ≤ n− 1.

Single run of Kruskal MWST recovers all (n− 1) pairs of edgesubstructures!

Learning thicker graphs via boosting

Trees learn (n− 1) edges → sparse representation.

Desirable to learn more graph edges for better classification, if wecan also avoid overfitting.

Learning of junction trees known to be NP-hard.

Learning general graph structures is intractable.

Learning thicker graphs via boosting

Trees learn (n− 1) edges → sparse representation.

Desirable to learn more graph edges for better classification, if wecan also avoid overfitting.

Learning of junction trees known to be NP-hard.

Learning general graph structures is intractable.

Use boosting to learn more than (n− 1) edges per model.

Boosted graphical model classification

DT classifier used as a weak learner

Training sets Tp and Tq remain unchanged

In t-th iteration, learn trees pt and qt, and classify using:

ht(x) := log

(pt(x)

Learn trees by minimizing weighted training error: use (pw, qw)instead of (p, q).

In t-th iteration, learn trees pt and qt, and classify using:

ht(x) := log

(pt(x)

Learn trees by minimizing weighted training error: use (pw, qw)instead of (p, q).

Final boosted classifier:

HT (x) = sgn

αt log

(pt(x)

(p∗(x)

q∗(x)

where p∗(x) :=∏T

t=1 pt(x)αt .

Some comments

Boosting learns at most (n− 1) edges per iteration ⇒ maximum of(n− 1)T edges (pairwise features)

With suitable normalization, p∗(x)/Zp(α) is a probabilitydistribution.

p∗(x)/Zp(α) is Markov on a graph G = (V , Ep∗) with edge set

Ep∗ =

How to avoid overfitting?

Use cross-validation to determine optimum number of iterations T ∗.

Extension to multi-class problems

Set of classes I: one-versus-all strategy.

p(k)i|j (x) and p

(k)j|i (x) - learned forests for the binary classification

problem Class i versus Class j.

f(k)ij (x) := log

(k)i|j (x)

p(k)j|i (x)

, i, j ∈ I.

Multi-class decision function:

g(k)(x) := argmaxi∈I

f(k)ij (x).

Conclusion

Discriminative learning optimizes an approximation to theexpectation of log-likelihood ratio

Superior performance in classification applications compared togenerative approaches.

Learned tree models can have different edge structures → removesthe restriction of Tree Augmented Naive (TAN) Bayes framework .

No additional computational overhead compared to existingtree-based methods.

Amenable to boosting → weak learners on weighted empiricals.

Learning thicker graphical models in a principled manner.

Learning graphical models for hypothesis testingsignal.ee.psu.edu/graph_classification.pdfLearning...

Documents

Transcript of Learning graphical models for hypothesis testingsignal.ee.psu.edu/graph_classification.pdfLearning...

Graphical Models and Independence Models › jwalsh › Yunshu_Independent... · 2019-01-23 · Preliminaries on Graphical Models Deﬁnition of Graphical Models: A graphical model

Implementation of a Graphical Password Authentication ... · graphical password system.) Figure 1. Types of graphical password schemes. The recall-based graphical password system

Scientific Method starting on p. 29 Scientific method – Scientific method – System – System – Hypothesis- Hypothesis- Testing hypothesis- Testing hypothesis-

Graphical Perception and Graphical Methods for …webspace.ship.edu/pgmarr/Geo441/Readings/Cleveland and McGill 19… · Graphical Perception and Graphical Methods for Analyzing Scientific

Continuous Improvement Toolkit · Roadmaps Focus groups QFD Graphical Analysis Probability Distributions Lateral ThinkingCritical Incident Technique Hypothesis Testing OEE Pull Systems

Basic Statistical Vocabulary Graphical Displays … Statistical Vocabulary Graphical Displays ... Basic Statistical Vocabulary Graphical Displays Algebra ... Graphical Displays 2.

Hypothesis Testing I: 1 Hypothesis Testing – Part I.

Incompatibility Hypothesis: Graphical and Mathematical ...

Graphical Programming · • Graphical Programming • Easy to use • Faster Development Time • Graphical User Interface • Graphical Source Code • Easily Modularized • Application

Chapter 144 Probability Plots - NCSS · This question may be studied using both numerical and graphical procedures. Numerical hypothesis tests have been developed that allow you to

Chapter 10 Hypothesis Testing 10 HYPOTHESIS TESTING

Scientific Method Collect Facts and Data Hypothesis Test hypothesis Modify hypothesis Conclusion.

Employees' State Insurance Corporation, Ministry of Labour & … · 2018-09-27 · Nagamani chilaka chalibindi.srilatha nukala.rajesh Saretla soujanya MOUNICA ERY Kadiyala umamahesh

WHAT IS A HYPOTHESIS? HOW DO WE WRITE A HYPOTHESIS? Writing a Hypothesis.

Easter chocolate eggs are gooood …no matter how. Hypothesis 1 Hypothesis Hypothesis 2.

Testing of Hypothesis Fundamentals of Hypothesis.

Non-parametric Hypothesis Testing Procedureshaalshraideh/Courses/IE347/Non...Non-parametric Hypothesis Testing Procedures Hypothesis Testing General Procedure for Hypothesis Tests

Procedure for Hypothesis Testcontents.kocw.or.kr/document/03_Procedure for Hypothesis... · 2011. 12. 27. · Alternative Hypothesis 영가설 / 귀무가설 Null Hypothesis X: 자동차

Graphical User Interface (GUI) Two-Dimensional Graphical Shapes.

Hypothesis Tests Hypothesis Tests One Sample Means.