Download - 1 LING 696B: Midterm review: parametric and non-parametric inductive inference.

1

LING 696B: Midterm review: parametric and non-parametric inductive inference

2

Big question: How do people generalize?

3

Big question: How do people generalize? Examples related to language:

Categorizing a new stimuli Assign structure to a signal Telling whether a form is grammatical

4



What is the nature of inductive inference?

5



What is the nature of inductive inference? What role does statistics play?

6

Two paradigms of statistical learning (I) Fisher’s paradigm: inductive

inference through likelihood -- p(X|) X: observed set of data : parameters of the probability density

function p, or an interpretation of X We expect X to come from an infinite

population observing p(X|) Representational bias: the form of p(X|)

constrains what kind things you can learn

7

Learning in Fisher’s paradigm Philosophy: finding the infinite

population so that the chance of seeing X is large (idea from Bayes) Knowing the universe by seeing

individuals Randomness is due to the finiteness of X

Maximum likelihood: find so p(X|) reaches the maximum

Natural consequence: the more X you see, the better you learn about p(X|)

8

Extending Fisher’s paradigm to complex situations Statisticians cannot specify p(X|) for you!

Must come from understanding of the structure that generates X, e.g. grammar

Needs a supporting theory that guides the construction of p(X|) -- “language is special”

Extending p(X|) to include hidden variables The EM algorithm

Making bigger model from smaller models Iterative learning through coordinate-wise

ascent

9

Example: unsupervised learning of categories X: instances of pre-segmented speech

sounds : mixture of a fixed number of category

models Representational bias:

Discreteness Distribution of each category (bias from

mixture components) Hidden variable: category membership Learning: EM algorithm

10

Example: unsupervised learning of phonological words X: instances of word-level signals : mixture model + phonotactic model

+ word segmentation Representational bias:

Discreteness Distribution of each category (bias from

mixture components) Combinatorial structure of phonological

words Learning: coordinate-wise ascent

11

From Fisher’s paradigm to Bayesian learning Bayesian: wants to learn the

posterior distribution p(|X) Bayesian formula: p(|X) p(X|)

p() = p(X, ) Same as ML when p() is uniform

Still needs a theory guiding the construction of p() and p(X|) More on this later

12

Attractions of generative modeling Has clear semantics

p(X|) -- prediction/production/synthesis

p() -- belief/prior knowledge/initial bias

p(|X) -- perception/interpretation

13



p() -- belief/prior knowledge/initial bias

p(|X) -- perception/interpretation Can make “infinite

generalizations” Synthesize from p(X, ) can tell us

something about the generalization

14



p() -- belief/prior knowledge/initial bias p(|X) -- perception/interpretation

Can make “infinite generalizations” Synthesize from p(X, ) can tell us

something about the generalization A very general framework

Theory of everything?

15

Challenges to generative modeling The representational bias can be

wrong

16


wrong But “all models are wrong”

17



Unclear how to choose from different classes of models

18



Unclear how to choose from different classes of models E.g. The destiny of K

19



Unclear how to choose from different classes of models E.g. The destiny of K Simplicity is relative, e.g.

f(x)=a*sin(bx)+c

20

Challenges to generative modeling

The representational bias can be wrong But “all models are wrong”

Unclear how to choose from different classes of models E.g. The destiny of K Simplicity is relative, e.g. f(x)=a*sin(bx)+c

Computing max{p(X|)} can be very hard Bayesian computation may help

21

Challenges to generative modeling Even finding X can be hard for

language

22


language Probability distribution over what?

Example: statistical syntax, choices of X String of words Parse trees Semantic interpetations Social interactions

23


language Probability distribution over what?

Example: X for statistical syntax? String of words Parse trees Semantic interpetations Social interactions

Hope: staying on low levels of language will make the choice of X easier

24

Two paradigms of statistical learning (II) Vapnik’s critique for generative

modeling: “Why solve a more general problem before solving a specific one ?”

Example: Generative approach to 2-class classification (supervised)Likelihood ratio test:Log[p(x|A)/p(x|B)]A, B are parametric models

25

Non-parametric approach to inductive inference Main idea: don’t want to know the

universe first, then generalize Universe is complicated, representational

bias often inappropriate Very few data to learn from, compared to

dimensionality of space Instead, want to generalize directly from

old data to new data Rules v.s. analogy?

26

Examples of non-parametric learning (I): Nearest neighbor classification:

Analogy-based learning by dictionary lookup

Generalize to K-nearest neighbors

27

Examples of non-parametric learning (II)

Radial Basis networks for supervised learning: F(x) = i ai K(x, xi) K(x, xi) a non-linear similarity function

centered at xi , with tunable parameters Interpretation: “soft/smooth” dictionary

lookup/analogy within a population Learning: find ai from (xi, yi) pairs -- a

regularized regression problemmin i [f(x)-yi]2 + || f ||2

28

Radial basis functions/networks Each data point xi is associated with

a K(x, xi) -- a radial basis function Linear combinations of enough K(x,

xi) can approximate any smooth function from RnR Universal approximation property Network interpretation (see demo)

29

How is this different from generative modeling? Do not assume a fixed space to

search for the best hypothesis Instead, this space grows with the

amount of data Basis of the space: K(x, xi) Interpretation: local generalization from

old data xi to new data x F(x) = i ai K(x, xi) represents an

ensemble generlization from {xi} to x

30

Examples of non-parametric learning (III) Support Vector Machines (last

time): linear separation f(x) = sign(<w,x>+b)

31

Max margin classification The solution is also a direct

generalization from old data, but sparse

mostly zero

f(x) = sign(<w,x>+b)

32

Interpretation of support vectors Support vectors have non-zero

contribution to the generalization “prototypes” for analogical learning

mostly zero

f(x) = sign(<w,x>+b)

33

Kernel generalization of SVM The solution looks very much like RBF

networks: RBF net: F(x) = i ai K(x, xi)

Many old data contribute to generalization

SVM: F(x) = sign(i ai K(x, xi) + b)Relatively few old data contribute

Dense/sparse solution is due to different goals (see demo)

34

Transductive inference with support vectors One more wrinkle: now I’m putting

two points there, but don’t tell you the color

35

Transductive SVM Not only old data affect

generalization, the new data affect each other too

36

A general view of non-parametric inductive inference A function approximation problem:

knowing that (x1, y1), …, (xN, yN) are input and output of some unknown function F, how can we approximate F and generalize to new values of x? Linguistics: find the universe for F Psychology: find the best model that

“behaves” like F In realistic terms, non-parametric

methods often win

37

Who’s got the answer? Parametric approach can also

approximate functions Model the joint distribution p(x,y|)

38



But the model is often difficult to build E.g. a realistic experimental task

39



But the model is often difficult to build E.g. a realistic experimental task

Before reaching a conclusion, we need to know how people learn They may be doing both

40

Where does neural net fit? Clearly not generative: does not

reason with probability

41


reason with probability Somewhat different from analogy-

type of non-parametric: the network does not directly reason from old data Difficult to interpret the

generalization

42


reason with probability Somewhat different from analogy-type

of non-parametric: the network does not directly reason from old data Difficult to interpret the generalization

Some results available for limiting cases Similar to non-parametric methods when

hidden units are infinite

43

A point that nobody gets right Small sample dilemma: people learn

from very few examples (compared to dimension of data), yet any statistical machinery needs many Parametric: ML estimate approaches the

true distribution with infinite sample Non-parametric: universal approximation

requires infinite sample The limit is taken in the wrong

direction