1
LING 696B: Midterm review: parametric and non-parametric inductive inference
2
Big question: How do people generalize?
3
Big question: How do people generalize? Examples related to language:
Categorizing a new stimuli Assign structure to a signal Telling whether a form is grammatical
4
Big question: How do people generalize? Examples related to language:
Categorizing a new stimuli Assign structure to a signal Telling whether a form is grammatical
What is the nature of inductive inference?
5
Big question: How do people generalize? Examples related to language:
Categorizing a new stimuli Assign structure to a signal Telling whether a form is grammatical
What is the nature of inductive inference? What role does statistics play?
6
Two paradigms of statistical learning (I) Fisher’s paradigm: inductive
inference through likelihood -- p(X|) X: observed set of data : parameters of the probability density
function p, or an interpretation of X We expect X to come from an infinite
population observing p(X|) Representational bias: the form of p(X|)
constrains what kind things you can learn
7
Learning in Fisher’s paradigm Philosophy: finding the infinite
population so that the chance of seeing X is large (idea from Bayes) Knowing the universe by seeing
individuals Randomness is due to the finiteness of X
Maximum likelihood: find so p(X|) reaches the maximum
Natural consequence: the more X you see, the better you learn about p(X|)
8
Extending Fisher’s paradigm to complex situations Statisticians cannot specify p(X|) for you!
Must come from understanding of the structure that generates X, e.g. grammar
Needs a supporting theory that guides the construction of p(X|) -- “language is special”
Extending p(X|) to include hidden variables The EM algorithm
Making bigger model from smaller models Iterative learning through coordinate-wise
ascent
9
Example: unsupervised learning of categories X: instances of pre-segmented speech
sounds : mixture of a fixed number of category
models Representational bias:
Discreteness Distribution of each category (bias from
mixture components) Hidden variable: category membership Learning: EM algorithm
10
Example: unsupervised learning of phonological words X: instances of word-level signals : mixture model + phonotactic model
+ word segmentation Representational bias:
Discreteness Distribution of each category (bias from
mixture components) Combinatorial structure of phonological
words Learning: coordinate-wise ascent
11
From Fisher’s paradigm to Bayesian learning Bayesian: wants to learn the
posterior distribution p(|X) Bayesian formula: p(|X) p(X|)
p() = p(X, ) Same as ML when p() is uniform
Still needs a theory guiding the construction of p() and p(X|) More on this later
12
Attractions of generative modeling Has clear semantics
p(X|) -- prediction/production/synthesis
p() -- belief/prior knowledge/initial bias
p(|X) -- perception/interpretation
13
Attractions of generative modeling Has clear semantics
p(X|) -- prediction/production/synthesis
p() -- belief/prior knowledge/initial bias
p(|X) -- perception/interpretation Can make “infinite
generalizations” Synthesize from p(X, ) can tell us
something about the generalization
14
Attractions of generative modeling Has clear semantics
p(X|) -- prediction/production/synthesis
p() -- belief/prior knowledge/initial bias p(|X) -- perception/interpretation
Can make “infinite generalizations” Synthesize from p(X, ) can tell us
something about the generalization A very general framework
Theory of everything?
15
Challenges to generative modeling The representational bias can be
wrong
16
Challenges to generative modeling The representational bias can be
wrong But “all models are wrong”
17
Challenges to generative modeling The representational bias can be
wrong But “all models are wrong”
Unclear how to choose from different classes of models
18
Challenges to generative modeling The representational bias can be
wrong But “all models are wrong”
Unclear how to choose from different classes of models E.g. The destiny of K
19
Challenges to generative modeling The representational bias can be
wrong But “all models are wrong”
Unclear how to choose from different classes of models E.g. The destiny of K Simplicity is relative, e.g.
f(x)=a*sin(bx)+c
20
Challenges to generative modeling
The representational bias can be wrong But “all models are wrong”
Unclear how to choose from different classes of models E.g. The destiny of K Simplicity is relative, e.g. f(x)=a*sin(bx)+c
Computing max{p(X|)} can be very hard Bayesian computation may help
21
Challenges to generative modeling Even finding X can be hard for
language
22
Challenges to generative modeling Even finding X can be hard for
language Probability distribution over what?
Example: statistical syntax, choices of X String of words Parse trees Semantic interpetations Social interactions
23
Challenges to generative modeling Even finding X can be hard for
language Probability distribution over what?
Example: X for statistical syntax? String of words Parse trees Semantic interpetations Social interactions
Hope: staying on low levels of language will make the choice of X easier
24
Two paradigms of statistical learning (II) Vapnik’s critique for generative
modeling: “Why solve a more general problem before solving a specific one ?”
Example: Generative approach to 2-class classification (supervised)Likelihood ratio test:Log[p(x|A)/p(x|B)]A, B are parametric models
25
Non-parametric approach to inductive inference Main idea: don’t want to know the
universe first, then generalize Universe is complicated, representational
bias often inappropriate Very few data to learn from, compared to
dimensionality of space Instead, want to generalize directly from
old data to new data Rules v.s. analogy?
26
Examples of non-parametric learning (I): Nearest neighbor classification:
Analogy-based learning by dictionary lookup
Generalize to K-nearest neighbors
27
Examples of non-parametric learning (II)
Radial Basis networks for supervised learning: F(x) = i ai K(x, xi) K(x, xi) a non-linear similarity function
centered at xi , with tunable parameters Interpretation: “soft/smooth” dictionary
lookup/analogy within a population Learning: find ai from (xi, yi) pairs -- a
regularized regression problemmin i [f(x)-yi]2 + || f ||2
28
Radial basis functions/networks Each data point xi is associated with
a K(x, xi) -- a radial basis function Linear combinations of enough K(x,
xi) can approximate any smooth function from RnR Universal approximation property Network interpretation (see demo)
29
How is this different from generative modeling? Do not assume a fixed space to
search for the best hypothesis Instead, this space grows with the
amount of data Basis of the space: K(x, xi) Interpretation: local generalization from
old data xi to new data x F(x) = i ai K(x, xi) represents an
ensemble generlization from {xi} to x
30
Examples of non-parametric learning (III) Support Vector Machines (last
time): linear separation f(x) = sign(<w,x>+b)
31
Max margin classification The solution is also a direct
generalization from old data, but sparse
mostly zero
f(x) = sign(<w,x>+b)
32
Interpretation of support vectors Support vectors have non-zero
contribution to the generalization “prototypes” for analogical learning
mostly zero
f(x) = sign(<w,x>+b)
33
Kernel generalization of SVM The solution looks very much like RBF
networks: RBF net: F(x) = i ai K(x, xi)
Many old data contribute to generalization
SVM: F(x) = sign(i ai K(x, xi) + b)Relatively few old data contribute
Dense/sparse solution is due to different goals (see demo)
34
Transductive inference with support vectors One more wrinkle: now I’m putting
two points there, but don’t tell you the color
35
Transductive SVM Not only old data affect
generalization, the new data affect each other too
36
A general view of non-parametric inductive inference A function approximation problem:
knowing that (x1, y1), …, (xN, yN) are input and output of some unknown function F, how can we approximate F and generalize to new values of x? Linguistics: find the universe for F Psychology: find the best model that
“behaves” like F In realistic terms, non-parametric
methods often win
37
Who’s got the answer? Parametric approach can also
approximate functions Model the joint distribution p(x,y|)
38
Who’s got the answer? Parametric approach can also
approximate functions Model the joint distribution p(x,y|)
But the model is often difficult to build E.g. a realistic experimental task
39
Who’s got the answer? Parametric approach can also
approximate functions Model the joint distribution p(x,y|)
But the model is often difficult to build E.g. a realistic experimental task
Before reaching a conclusion, we need to know how people learn They may be doing both
40
Where does neural net fit? Clearly not generative: does not
reason with probability
41
Where does neural net fit? Clearly not generative: does not
reason with probability Somewhat different from analogy-
type of non-parametric: the network does not directly reason from old data Difficult to interpret the
generalization
42
Where does neural net fit? Clearly not generative: does not
reason with probability Somewhat different from analogy-type
of non-parametric: the network does not directly reason from old data Difficult to interpret the generalization
Some results available for limiting cases Similar to non-parametric methods when
hidden units are infinite
43
A point that nobody gets right Small sample dilemma: people learn
from very few examples (compared to dimension of data), yet any statistical machinery needs many Parametric: ML estimate approaches the
true distribution with infinite sample Non-parametric: universal approximation
requires infinite sample The limit is taken in the wrong
direction
Top Related