Learning Mixtures of Truncated Basis Functions from Data · Learning Mixtures of Truncated Basis...

28
Learning Mixtures of Truncated Basis Functions from Data Helge Langseth, Thomas D. Nielsen, and Antonio Salmerón PGM 2012 This work is supported by an Abel grant from Iceland, Liechtenstein, and Norway through the EEA Financial Mechanism (Nils mobility project). Supported and Coordinated by Universidad Complutense de Madrid, by the Spanish Ministry of Science and Innovation through projects TIN2010-20900-C04-02-03, and by ERDF (FEDER) funds. Learning MoTBFs from data 1

Transcript of Learning Mixtures of Truncated Basis Functions from Data · Learning Mixtures of Truncated Basis...

Page 1: Learning Mixtures of Truncated Basis Functions from Data · Learning Mixtures of Truncated Basis Functions from Data Helge Langseth, Thomas D. Nielsen, and Antonio Salmerón PGM 2012

Learning Mixtures of Truncated Basis Functions

from Data

Helge Langseth, Thomas D. Nielsen, and Antonio Salmerón

PGM 2012

This work is supported by an Abel grant from Iceland, Liechtenstein, and Norway through the EEA Financial Mechanism (Nils mobility project).

Supported and Coordinated by Universidad Complutense de Madrid, by the Spanish Ministry of Science and Innovation

through projects TIN2010-20900-C04-02-03, and by ERDF (FEDER) funds.

Learning MoTBFs from data 1

Page 2: Learning Mixtures of Truncated Basis Functions from Data · Learning Mixtures of Truncated Basis Functions from Data Helge Langseth, Thomas D. Nielsen, and Antonio Salmerón PGM 2012

Background: Approximations

Learning MoTBFs from data Background: Approximations 2

Page 3: Learning Mixtures of Truncated Basis Functions from Data · Learning Mixtures of Truncated Basis Functions from Data Helge Langseth, Thomas D. Nielsen, and Antonio Salmerón PGM 2012

Geometry of approximations

A quick recall of how of how to do approximations in Rn:

0 1

2 3

4 5 0

1

2

3

4

5

0

1

2

3

4

5

We want to approximate the vector f = (3, 2, 5) with

A vector along e1 = (1, 0, 0).

Learning MoTBFs from data Background: Approximations 2

Page 4: Learning Mixtures of Truncated Basis Functions from Data · Learning Mixtures of Truncated Basis Functions from Data Helge Langseth, Thomas D. Nielsen, and Antonio Salmerón PGM 2012

Geometry of approximations

A quick recall of how of how to do approximations in Rn:

0 1

2 3

4 5 0

1

2

3

4

5

0

1

2

3

4

5

We want to approximate the vector f = (3, 2, 5) with

A vector along e1 = (1, 0, 0). Best choice is 〈f , e1〉 · e1 = (3, 0, 0).

Learning MoTBFs from data Background: Approximations 2

Page 5: Learning Mixtures of Truncated Basis Functions from Data · Learning Mixtures of Truncated Basis Functions from Data Helge Langseth, Thomas D. Nielsen, and Antonio Salmerón PGM 2012

Geometry of approximations

A quick recall of how of how to do approximations in Rn:

0 1

2 3

4 5 0

1

2

3

4

5

0

1

2

3

4

5

We want to approximate the vector f = (3, 2, 5) with

A vector along e1 = (1, 0, 0). Best choice is 〈f , e1〉 · e1 = (3, 0, 0).Now, add a vector along e2.

Best choice is 〈f ,e2〉 · e2, independently of the choice made for e1.

Also, the choice we made for e1 is still optimal since e1 ⊥ e2.

Best approximation is in general∑

ℓ〈f ,eℓ〉 · eℓ.

Learning MoTBFs from data Background: Approximations 2

Page 6: Learning Mixtures of Truncated Basis Functions from Data · Learning Mixtures of Truncated Basis Functions from Data Helge Langseth, Thomas D. Nielsen, and Antonio Salmerón PGM 2012

Geometry of approximations

A quick recall of how of how to do approximations in Rn:

0 1

2 3

4 5 0

1

2

3

4

5

0

1

2

3

4

5

We want to approximate the vector f = (3, 2, 5) with

A vector along e1 = (1, 0, 0). Best choice is 〈f , e1〉 · e1 = (3, 0, 0).Now, add a vector along e2.

Best choice is 〈f ,e2〉 · e2, independently of the choice made for e1.

Also, the choice we made for e1 is still optimal since e1 ⊥ e2.

Best approximation is in general∑

ℓ〈f ,eℓ〉 · eℓ.

All of this maps over to approximations of functions!

We only need a definition of the inner product and the equivalent toorthonormal basis vectors.

Learning MoTBFs from data Background: Approximations 2

Page 7: Learning Mixtures of Truncated Basis Functions from Data · Learning Mixtures of Truncated Basis Functions from Data Helge Langseth, Thomas D. Nielsen, and Antonio Salmerón PGM 2012

Geometry of approximations

A quick recall of how of how to do approximations in Rn:

0 1

2 3

4 5 0

1

2

3

4

5

0

1

2

3

4

5

We want to approximate the vector f = (3, 2, 5) with

A vector along e1 = (1, 0, 0). Best choice is 〈f , e1〉 · e1 = (3, 0, 0).Now, add a vector along e2.

Best choice is 〈f ,e2〉 · e2, independently of the choice made for e1.

Also, the choice we made for e1 is still optimal since e1 ⊥ e2.

Best approximation is in general∑

ℓ〈f ,eℓ〉 · eℓ.

Inner product for functions

For two functions u(·) and v(·) defined on Ω ⊆ R, we use〈u, v〉 =

Ωu(x) v(x) dx.

Learning MoTBFs from data Background: Approximations 2

Page 8: Learning Mixtures of Truncated Basis Functions from Data · Learning Mixtures of Truncated Basis Functions from Data Helge Langseth, Thomas D. Nielsen, and Antonio Salmerón PGM 2012

Generalised Fourier series

Definition (Legal set of basis functions)

Let Ψ = ψi∞

i=0 be an indexed set of basis functions. Let Q be the set of all

linear combination of functions in Ψ. Ψ is a legal set of basis functions if:

1 ψ0 is constant;

2 u ∈ Q and v ∈ Q implies that (u · v) ∈ Q;

3 For any pair of real numbers s and t, s 6= t, there exists a function ψi ∈ Ψs.t. ψi(s) 6= ψi(t).

Legal basis functions

1, x, x2, x3, . . . is a legal set of basis functions.

1, exp(−x), exp(x), exp(−2x), exp(2x), . . . is also legal.

1, log(x), log(2x), log(3x), . . . is not a legal set of basis functions.

Learning MoTBFs from data Background: Approximations 3

Page 9: Learning Mixtures of Truncated Basis Functions from Data · Learning Mixtures of Truncated Basis Functions from Data Helge Langseth, Thomas D. Nielsen, and Antonio Salmerón PGM 2012

Generalised Fourier series

Definition (Legal set of basis functions)

Let Ψ = ψi∞

i=0 be an indexed set of basis functions. Let Q be the set of all

linear combination of functions in Ψ. Ψ is a legal set of basis functions if:

1 ψ0 is constant;

2 u ∈ Q and v ∈ Q implies that (u · v) ∈ Q;

3 For any pair of real numbers s and t, s 6= t, there exists a function ψi ∈ Ψs.t. ψi(s) 6= ψi(t).

Generalized Fourier series

Assume Ψ is legal and contains orthonormal basis functions (if not, they

can be made orthonormal through a Gram-Schmidt process).

Then, the Generalized Fourier Series approximation to a function f isdefined as

f(·) =∑

ℓ〈f, ψℓ〉ψℓ(·).

Learning MoTBFs from data Background: Approximations 3

Page 10: Learning Mixtures of Truncated Basis Functions from Data · Learning Mixtures of Truncated Basis Functions from Data Helge Langseth, Thomas D. Nielsen, and Antonio Salmerón PGM 2012

Generalised Fourier series

Definition (Legal set of basis functions)

Let Ψ = ψi∞

i=0 be an indexed set of basis functions. Let Q be the set of all

linear combination of functions in Ψ. Ψ is a legal set of basis functions if:

1 ψ0 is constant;

2 u ∈ Q and v ∈ Q implies that (u · v) ∈ Q;

3 For any pair of real numbers s and t, s 6= t, there exists a function ψi ∈ Ψs.t. ψi(s) 6= ψi(t).

Important properties

1 Any function – including density functions – can be approximated

arbitrarily well by this approach.

2

Ω

(

f(x)−∑k

ℓ=0 ci ψℓ(x))2

dx ≥

Ω

(

f(x)−∑k

ℓ=0〈f, ψℓ〉ψℓ(x))2

dx,

so the generalized Fourier series approximation is optimal in L2 sense.

Learning MoTBFs from data Background: Approximations 3

Page 11: Learning Mixtures of Truncated Basis Functions from Data · Learning Mixtures of Truncated Basis Functions from Data Helge Langseth, Thomas D. Nielsen, and Antonio Salmerón PGM 2012

MoTBFs

Learning MoTBFs from data MoTBFs 4

Page 12: Learning Mixtures of Truncated Basis Functions from Data · Learning Mixtures of Truncated Basis Functions from Data Helge Langseth, Thomas D. Nielsen, and Antonio Salmerón PGM 2012

The marginal MoTBF potential

Definition

Let Ψ = ψi∞

i=0 with ψi : R 7→ R define a legal set of basis functions onΩ ⊆ R. Then gk : Ω 7→ R

+0 is an MoTBF potential at level k wrt. Ψ . . .

1 if

gk(x) =

k∑

i=0

ai ψi(x)

for all x ∈ Ω, where ai are real constants;

2 . . . or there is a partition of Ω into intervals I1, . . . , Im s.t. gk is defined as

above on each Ij .

Special cases

An MoTBFs potential at level k = 0 is simply a standard discretisation.

MoPs (original definition) and MTEs are also special cases of MoTBFs.

Learning MoTBFs from data MoTBFs 4

Page 13: Learning Mixtures of Truncated Basis Functions from Data · Learning Mixtures of Truncated Basis Functions from Data Helge Langseth, Thomas D. Nielsen, and Antonio Salmerón PGM 2012

The marginal MoTBF potential

Definition

Let Ψ = ψi∞

i=0 with ψi : R 7→ R define a legal set of basis functions onΩ ⊆ R. Then gk : Ω 7→ R

+0 is an MoTBF potential at level k wrt. Ψ . . .

1 if

gk(x) =

k∑

i=0

ai ψi(x)

for all x ∈ Ω, where ai are real constants;

2 . . . or there is a partition of Ω into intervals I1, . . . , Im s.t. gk is defined as

above on each Ij .

Simplification

We do not utilize the option to split the domain into subdomains here.

Learning MoTBFs from data MoTBFs 4

Page 14: Learning Mixtures of Truncated Basis Functions from Data · Learning Mixtures of Truncated Basis Functions from Data Helge Langseth, Thomas D. Nielsen, and Antonio Salmerón PGM 2012

Example: Polynomials vs. the Std. Gaussian

-3 -2 -1 0 1 2 30

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

-3 -2 -1 0 1 2 30

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

-3 -2 -1 0 1 2 30

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

g0 = 0.4362 · ψ0 g2 = 0.4362 · ψ0 +

0 · ψ1 +

−0.1927 · ψ2

g8 = 0.4362 · ψ0 +

0 · ψ1 +

......

0.0052 · ψ8

Use orthonormal polynomials (shifted & scaled Legendre polynomials).

Approximation always integrates to unity.

Direct computations give the gk closest in L2-norm.

Positivity constraint and KL minimisation convex optimization.

Learning MoTBFs from data MoTBFs 5

Page 15: Learning Mixtures of Truncated Basis Functions from Data · Learning Mixtures of Truncated Basis Functions from Data Helge Langseth, Thomas D. Nielsen, and Antonio Salmerón PGM 2012

Learning Univariate Distributions

Learning MoTBFs from data Learning Univariate Distributions 6

Page 16: Learning Mixtures of Truncated Basis Functions from Data · Learning Mixtures of Truncated Basis Functions from Data Helge Langseth, Thomas D. Nielsen, and Antonio Salmerón PGM 2012

Relationship between KL and ML

Idea for learning MoTBFs from data

Generate a kernel density for a (marginal) probability distribution, and use the

translation-scheme to approximate it with an MoTBF.

Learning MoTBFs from data Learning Univariate Distributions 6

Page 17: Learning Mixtures of Truncated Basis Functions from Data · Learning Mixtures of Truncated Basis Functions from Data Helge Langseth, Thomas D. Nielsen, and Antonio Salmerón PGM 2012

Relationship between KL and ML

Idea for learning MoTBFs from data

Generate a kernel density for a (marginal) probability distribution, and use the

translation-scheme to approximate it with an MoTBF.

Setup

Let f(x) be the density generating x1, . . . ,xN.

Let gk(x|θ) =∑k

i=0 θi · ψi(x) be an MoTBF of order k.

Let hN(x) be a kernel density estimator.

Result: KL minimization is likelihood maximization in the limit

Let θN = argminθ D (hN(·) ‖ gk(·|θ) ). Then θN converges to the maximum

likelihood estimator of θ as N → ∞ (given certain regularity conditions).

Learning MoTBFs from data Learning Univariate Distributions 6

Page 18: Learning Mixtures of Truncated Basis Functions from Data · Learning Mixtures of Truncated Basis Functions from Data Helge Langseth, Thomas D. Nielsen, and Antonio Salmerón PGM 2012

Example: Learning the standard Gaussian

-3 -2 -1 0 1 2 30

0.2

0.4

0.6

Density estimate; 50 samples.

Learning MoTBFs from data Learning Univariate Distributions 7

Page 19: Learning Mixtures of Truncated Basis Functions from Data · Learning Mixtures of Truncated Basis Functions from Data Helge Langseth, Thomas D. Nielsen, and Antonio Salmerón PGM 2012

Example: Learning the standard Gaussian

-3 -2 -1 0 1 2 30

0.2

0.4

0.6

Density estimate; 50 samples.

g0: BIC = −91.54.

Learning MoTBFs from data Learning Univariate Distributions 7

Page 20: Learning Mixtures of Truncated Basis Functions from Data · Learning Mixtures of Truncated Basis Functions from Data Helge Langseth, Thomas D. Nielsen, and Antonio Salmerón PGM 2012

Example: Learning the standard Gaussian

-3 -2 -1 0 1 2 30

0.2

0.4

0.6

Density estimate; 50 samples.

g0: BIC = −91.54.

g2: BIC = −83.21.

Learning MoTBFs from data Learning Univariate Distributions 7

Page 21: Learning Mixtures of Truncated Basis Functions from Data · Learning Mixtures of Truncated Basis Functions from Data Helge Langseth, Thomas D. Nielsen, and Antonio Salmerón PGM 2012

Example: Learning the standard Gaussian

-3 -2 -1 0 1 2 30

0.2

0.4

0.6

Density estimate; 50 samples.

g0: BIC = −91.54.

g2: BIC = −83.21.

g4: BIC = −76.13.

Learning MoTBFs from data Learning Univariate Distributions 7

Page 22: Learning Mixtures of Truncated Basis Functions from Data · Learning Mixtures of Truncated Basis Functions from Data Helge Langseth, Thomas D. Nielsen, and Antonio Salmerón PGM 2012

Example: Learning the standard Gaussian

-3 -2 -1 0 1 2 30

0.2

0.4

0.6

Density estimate; 50 samples.

g0: BIC = −91.54.

g2: BIC = −83.21.

g4: BIC = −76.13. ⇐ Best BIC score.

g12: BIC = −88.78.

Learning MoTBFs from data Learning Univariate Distributions 7

Page 23: Learning Mixtures of Truncated Basis Functions from Data · Learning Mixtures of Truncated Basis Functions from Data Helge Langseth, Thomas D. Nielsen, and Antonio Salmerón PGM 2012

Comparison to “State-of-the-art”

Direct ML optimization

At PGM’08/IJAR’10 we presented ML-learning of univariate MTEs:

Divides support of function up into intervals.

Direct ML optimization inside each interval.

Computationally difficult.

Summary of results

Precision of the new method in terms of log likelihood is comparable to

(but slightly poorer than) previous results.

Speedup factor from 10 to 15.

Fewer parameters chosen by BIC selection criteria.

Learning MoTBFs from data Learning Univariate Distributions 8

Page 24: Learning Mixtures of Truncated Basis Functions from Data · Learning Mixtures of Truncated Basis Functions from Data Helge Langseth, Thomas D. Nielsen, and Antonio Salmerón PGM 2012

Conditional Distributions

Learning MoTBFs from data Conditional Distributions 9

Page 25: Learning Mixtures of Truncated Basis Functions from Data · Learning Mixtures of Truncated Basis Functions from Data Helge Langseth, Thomas D. Nielsen, and Antonio Salmerón PGM 2012

Definition of conditional distributions

Assume we have x ∈ Im, and want to define g(m)k (y|x) there.

We define conditional MoTBFs to only depend on their conditioning

variable(s) through the relevant hypercube, and not the numerical

value: g(m)k (y|x) =

∑k

j=0 θ(m)j ψj(y) for x ∈ Im.

g(3,1)(y)

X1

g(2,1)(y)

g(2,2)(y)

g(3,3)(y)

g(1,2)(y)

g(1,3)(y)

X2

g(2,3)(y)

g(3,2)(y)

g(1,1)(y)

Conditioning hypercubes learned by optimizing BIC-score.

Learning MoTBFs from data Conditional Distributions 9

Page 26: Learning Mixtures of Truncated Basis Functions from Data · Learning Mixtures of Truncated Basis Functions from Data Helge Langseth, Thomas D. Nielsen, and Antonio Salmerón PGM 2012

Results: X ∼ N(0, 1), Y |X = x ∼ N(x/2, 1)

50 cases 500 cases

2500 cases 5000 casesLearning MoTBFs from data Conditional Distributions 10

Page 27: Learning Mixtures of Truncated Basis Functions from Data · Learning Mixtures of Truncated Basis Functions from Data Helge Langseth, Thomas D. Nielsen, and Antonio Salmerón PGM 2012

Concluding Remarks

Learning MoTBFs from data Concluding Remarks 11

Page 28: Learning Mixtures of Truncated Basis Functions from Data · Learning Mixtures of Truncated Basis Functions from Data Helge Langseth, Thomas D. Nielsen, and Antonio Salmerón PGM 2012

Summary

Conclusions:

KL-guided learning is much faster than the current implementations of

direct ML optimization.

There is – however – a loss in precision.

The KL-guided learning results do not use splitpoints for the headvariable. This can be exploited by inference algorithms.

Future work:

1 Look for improvements with respect to computational speed and

numerical stability of the learning algorithm.

2 Investigate the formal properties of the estimators.

3 Compare our approach to López-Cruz et al. (2012): Learning mixtures of

polynomials from data using B-spline interpolation.

Learning MoTBFs from data Concluding Remarks 11