1 LING 696B: Midterm review: parametric and non-parametric inductive inference.

43
1 LING 696B: Midterm review: parametric and non- parametric inductive inference

Transcript of 1 LING 696B: Midterm review: parametric and non-parametric inductive inference.

Page 1: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference.

1

LING 696B: Midterm review: parametric and non-parametric inductive inference

Page 2: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference.

2

Big question: How do people generalize?

Page 3: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference.

3

Big question: How do people generalize? Examples related to language:

Categorizing a new stimuli Assign structure to a signal Telling whether a form is grammatical

Page 4: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference.

4

Big question: How do people generalize? Examples related to language:

Categorizing a new stimuli Assign structure to a signal Telling whether a form is grammatical

What is the nature of inductive inference?

Page 5: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference.

5

Big question: How do people generalize? Examples related to language:

Categorizing a new stimuli Assign structure to a signal Telling whether a form is grammatical

What is the nature of inductive inference? What role does statistics play?

Page 6: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference.

6

Two paradigms of statistical learning (I) Fisher’s paradigm: inductive

inference through likelihood -- p(X|) X: observed set of data : parameters of the probability density

function p, or an interpretation of X We expect X to come from an infinite

population observing p(X|) Representational bias: the form of p(X|)

constrains what kind things you can learn

Page 7: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference.

7

Learning in Fisher’s paradigm Philosophy: finding the infinite

population so that the chance of seeing X is large (idea from Bayes) Knowing the universe by seeing

individuals Randomness is due to the finiteness of X

Maximum likelihood: find so p(X|) reaches the maximum

Natural consequence: the more X you see, the better you learn about p(X|)

Page 8: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference.

8

Extending Fisher’s paradigm to complex situations Statisticians cannot specify p(X|) for you!

Must come from understanding of the structure that generates X, e.g. grammar

Needs a supporting theory that guides the construction of p(X|) -- “language is special”

Extending p(X|) to include hidden variables The EM algorithm

Making bigger model from smaller models Iterative learning through coordinate-wise

ascent

Page 9: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference.

9

Example: unsupervised learning of categories X: instances of pre-segmented speech

sounds : mixture of a fixed number of category

models Representational bias:

Discreteness Distribution of each category (bias from

mixture components) Hidden variable: category membership Learning: EM algorithm

Page 10: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference.

10

Example: unsupervised learning of phonological words X: instances of word-level signals : mixture model + phonotactic model

+ word segmentation Representational bias:

Discreteness Distribution of each category (bias from

mixture components) Combinatorial structure of phonological

words Learning: coordinate-wise ascent

Page 11: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference.

11

From Fisher’s paradigm to Bayesian learning Bayesian: wants to learn the

posterior distribution p(|X) Bayesian formula: p(|X) p(X|)

p() = p(X, ) Same as ML when p() is uniform

Still needs a theory guiding the construction of p() and p(X|) More on this later

Page 12: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference.

12

Attractions of generative modeling Has clear semantics

p(X|) -- prediction/production/synthesis

p() -- belief/prior knowledge/initial bias

p(|X) -- perception/interpretation

Page 13: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference.

13

Attractions of generative modeling Has clear semantics

p(X|) -- prediction/production/synthesis

p() -- belief/prior knowledge/initial bias

p(|X) -- perception/interpretation Can make “infinite

generalizations” Synthesize from p(X, ) can tell us

something about the generalization

Page 14: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference.

14

Attractions of generative modeling Has clear semantics

p(X|) -- prediction/production/synthesis

p() -- belief/prior knowledge/initial bias p(|X) -- perception/interpretation

Can make “infinite generalizations” Synthesize from p(X, ) can tell us

something about the generalization A very general framework

Theory of everything?

Page 15: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference.

15

Challenges to generative modeling The representational bias can be

wrong

Page 16: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference.

16

Challenges to generative modeling The representational bias can be

wrong But “all models are wrong”

Page 17: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference.

17

Challenges to generative modeling The representational bias can be

wrong But “all models are wrong”

Unclear how to choose from different classes of models

Page 18: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference.

18

Challenges to generative modeling The representational bias can be

wrong But “all models are wrong”

Unclear how to choose from different classes of models E.g. The destiny of K

Page 19: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference.

19

Challenges to generative modeling The representational bias can be

wrong But “all models are wrong”

Unclear how to choose from different classes of models E.g. The destiny of K Simplicity is relative, e.g.

f(x)=a*sin(bx)+c

Page 20: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference.

20

Challenges to generative modeling

The representational bias can be wrong But “all models are wrong”

Unclear how to choose from different classes of models E.g. The destiny of K Simplicity is relative, e.g. f(x)=a*sin(bx)+c

Computing max{p(X|)} can be very hard Bayesian computation may help

Page 21: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference.

21

Challenges to generative modeling Even finding X can be hard for

language

Page 22: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference.

22

Challenges to generative modeling Even finding X can be hard for

language Probability distribution over what?

Example: statistical syntax, choices of X String of words Parse trees Semantic interpetations Social interactions

Page 23: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference.

23

Challenges to generative modeling Even finding X can be hard for

language Probability distribution over what?

Example: X for statistical syntax? String of words Parse trees Semantic interpetations Social interactions

Hope: staying on low levels of language will make the choice of X easier

Page 24: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference.

24

Two paradigms of statistical learning (II) Vapnik’s critique for generative

modeling: “Why solve a more general problem before solving a specific one ?”

Example: Generative approach to 2-class classification (supervised)Likelihood ratio test:Log[p(x|A)/p(x|B)]A, B are parametric models

Page 25: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference.

25

Non-parametric approach to inductive inference Main idea: don’t want to know the

universe first, then generalize Universe is complicated, representational

bias often inappropriate Very few data to learn from, compared to

dimensionality of space Instead, want to generalize directly from

old data to new data Rules v.s. analogy?

Page 26: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference.

26

Examples of non-parametric learning (I): Nearest neighbor classification:

Analogy-based learning by dictionary lookup

Generalize to K-nearest neighbors

Page 27: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference.

27

Examples of non-parametric learning (II)

Radial Basis networks for supervised learning: F(x) = i ai K(x, xi) K(x, xi) a non-linear similarity function

centered at xi , with tunable parameters Interpretation: “soft/smooth” dictionary

lookup/analogy within a population Learning: find ai from (xi, yi) pairs -- a

regularized regression problemmin i [f(x)-yi]2 + || f ||2

Page 28: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference.

28

Radial basis functions/networks Each data point xi is associated with

a K(x, xi) -- a radial basis function Linear combinations of enough K(x,

xi) can approximate any smooth function from RnR Universal approximation property Network interpretation (see demo)

Page 29: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference.

29

How is this different from generative modeling? Do not assume a fixed space to

search for the best hypothesis Instead, this space grows with the

amount of data Basis of the space: K(x, xi) Interpretation: local generalization from

old data xi to new data x F(x) = i ai K(x, xi) represents an

ensemble generlization from {xi} to x

Page 30: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference.

30

Examples of non-parametric learning (III) Support Vector Machines (last

time): linear separation f(x) = sign(<w,x>+b)

Page 31: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference.

31

Max margin classification The solution is also a direct

generalization from old data, but sparse

mostly zero

f(x) = sign(<w,x>+b)

Page 32: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference.

32

Interpretation of support vectors Support vectors have non-zero

contribution to the generalization “prototypes” for analogical learning

mostly zero

f(x) = sign(<w,x>+b)

Page 33: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference.

33

Kernel generalization of SVM The solution looks very much like RBF

networks: RBF net: F(x) = i ai K(x, xi)

Many old data contribute to generalization

SVM: F(x) = sign(i ai K(x, xi) + b)Relatively few old data contribute

Dense/sparse solution is due to different goals (see demo)

Page 34: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference.

34

Transductive inference with support vectors One more wrinkle: now I’m putting

two points there, but don’t tell you the color

Page 35: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference.

35

Transductive SVM Not only old data affect

generalization, the new data affect each other too

Page 36: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference.

36

A general view of non-parametric inductive inference A function approximation problem:

knowing that (x1, y1), …, (xN, yN) are input and output of some unknown function F, how can we approximate F and generalize to new values of x? Linguistics: find the universe for F Psychology: find the best model that

“behaves” like F In realistic terms, non-parametric

methods often win

Page 37: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference.

37

Who’s got the answer? Parametric approach can also

approximate functions Model the joint distribution p(x,y|)

Page 38: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference.

38

Who’s got the answer? Parametric approach can also

approximate functions Model the joint distribution p(x,y|)

But the model is often difficult to build E.g. a realistic experimental task

Page 39: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference.

39

Who’s got the answer? Parametric approach can also

approximate functions Model the joint distribution p(x,y|)

But the model is often difficult to build E.g. a realistic experimental task

Before reaching a conclusion, we need to know how people learn They may be doing both

Page 40: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference.

40

Where does neural net fit? Clearly not generative: does not

reason with probability

Page 41: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference.

41

Where does neural net fit? Clearly not generative: does not

reason with probability Somewhat different from analogy-

type of non-parametric: the network does not directly reason from old data Difficult to interpret the

generalization

Page 42: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference.

42

Where does neural net fit? Clearly not generative: does not

reason with probability Somewhat different from analogy-type

of non-parametric: the network does not directly reason from old data Difficult to interpret the generalization

Some results available for limiting cases Similar to non-parametric methods when

hidden units are infinite

Page 43: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference.

43

A point that nobody gets right Small sample dilemma: people learn

from very few examples (compared to dimension of data), yet any statistical machinery needs many Parametric: ML estimate approaches the

true distribution with infinite sample Non-parametric: universal approximation

requires infinite sample The limit is taken in the wrong

direction