Part III Learning structured representations Hierarchical...

Part III

Learning structured representationsHierarchical Bayesian models

VerbVP

NPVPVP

VNPRelRelClause

RelClauseNounAdjDetNP

Phrase structure

Utterance

Speech signal

Grammar

Universal Grammar Hierarchical phrase structuregrammars (e.g., CFG, HPSG, TAG)

Outline

• Learning structured representations– grammars– logical theories

• Learning at multiple levels of abstraction

StructuredRepresentations

UnstructuredRepresentations

A historical divide

(Chomsky, Pinker, Keil, ...)

(McClelland, Rumelhart, ...)

Innate knowledge Learningvs

InnateKnowledge

StructuredRepresentations

Learning

UnstructuredRepresentations

ChomskyKeil

McClelland,Rumelhart

StructureLearning

Representations

VerbVP

NPVPVP

VNPRelRelClause

][][Grammars

Logical theories

Causal networks

asbestos

lung cancer

coughing chest pain

Representations

Chemicals Diseases

Bio-active substances

Biologicalfunctions

affect

disrupt

interact with affectSemantic networks

Phonological rules

How to learn a R• Search for R that maximizes

• Prerequisites– Put a prior over a hypothesis space of Rs.– Decide how observable data are generated

from an underlying R.

How to learn a R• Search for R that maximizes

• Prerequisites– Put a prior over a hypothesis space of Rs.– Decide how observable data are generated

from an underlying R.

anything

Context free grammarS → N VP VP → V N → “Alice” V → “scratched”

VP → V N N → “Bob” V → “cheered”

cheered

scratched

Alice N

Probabilistic context free grammarS → N VP VP → V N → “Alice” V → “scratched”

cheered

1.0 0.6

scratched

Alice N

0.5 0.6

0.5 0.4

0.5 0.5

probability = 1.0 * 0.5 * 0.6= 0.3

probability = 1.0*0.5*0.4*0.5*0.5= 0.05

The learning problem

S → N VP VP → V N → “Alice” V → “scratched”

1.0 0.6

Alice scratched.Bob scratched.Alice scratched Alice.Alice scratched Bob.Bob scratched Alice.Bob scratched Bob.

Alice cheered.Bob cheered.Alice cheered Alice.Alice cheered Bob.Bob cheered Alice.Bob cheered Bob.

Data D:

Grammar G:

Grammar learning• Search for G that maximizes

• Prior:

• Likelihood:– assume that sentences in the data are

independently generated from the grammar.

(Horning 1969; Stolcke 1994)

Experiment

• Data: 100 sentences

(Stolcke, 1994)

Generating grammar: Model solution:

For all x and y, if y is the sibling of x then x isthe sibling of y

For all x, y and z, if x is the ancestor of y and yis the ancestor of z, then x is the ancestor of z.

Predicate logic

• A compositional language

Learning a kinship theory

Sibling(victoria, arthur), Sibling(arthur,victoria),Ancestor(chris,victoria), Ancestor(chris,colin),Parent(chris,victoria), Parent(victoria,colin),Uncle(arthur,colin), Brother(arthur,victoria) …

(Hinton, Quinlan, …)

Data D:

Theory T:

Learning logical theories• Search for T that maximizes

• Prior:

• Likelihood:– assume that the data include all facts that are

true according to T

(Conklin and Witten; Kemp et al 08; Katz et al 08)

Theory-learning in the lab

R(f,k)

R(f,c)R(k,c)

R(f,l) R(k,l)

R(c,l)

R(f,b) R(k,b)

R(c,b)

R(l,b)

R(f,h)

R(k,h)R(c,h)

R(l,h)

R(b,h)

(cf Krueger 1979)

Theory-learning in the labR(f,k). R(k,c). R(c,l). R(l,b). R(b,h).

R(X,Z) ← R(X,Y), R(Y,Z).

Transitive:

f,k f,c

f,l f,b f,h

k,l k,b k,h

c,l c,b c,h

l,b l,hb,h

trans. trans.

trans. excep. trans.(Kemp et al 08)

Conclusion: Part 1

• Bayesian models can combinestructured representations with statisticalinference.

Outline

(Han and Zhu, 2006)

Vision

Motor Control

(Wolpert et al., 2003)

Causal learning

Causalmodels

ContingencyData

Schema

chemicals

diseases

symptoms

Patient 1: asbestos exposure, coughing, chest pain

Patient 2: mercury exposure, muscle wasting

asbestos

lung cancer

coughing chest pain

mercury

minamata disease

muscle wasting

(Kelley; Cheng; Waldmann)

VerbVP

NPVPVP

VNPRelRelClause

Phrase structure

Utterance

Speech signal

Grammar

Universal Grammar Hierarchical phrase structuregrammars (e.g., CFG, HPSG, TAG)

P(phrase structure | grammar)

P(utterance | phrase structure)

P(speech | utterance)

P(grammar | UG)

Phrase structure

Utterance

Grammar

Universal Grammar

u1 u2 u3 u4 u5 u6

s1 s2 s3 s4 s5 s6

A hierarchical Bayesian model specifies a jointdistribution over all variables in the hierarchy:

P({ui}, {si}, G | U)

= P ({ui} | {si}) P({si} | G) P(G|U)

Hierarchical Bayesian model

P(G|U)

P(s|G)

P(u|s)

Phrase structure

Utterance

Grammar

Universal Grammar

u1 u2 u3 u4 u5 u6

s1 s2 s3 s4 s5 s6

Infer {si} given {ui}, G:

P( {si} | {ui}, G) α P( {ui} | {si} ) P( {si} |G)

Top-down inferences

Phrase structure

Utterance

Grammar

Universal Grammar

u1 u2 u3 u4 u5 u6

s1 s2 s3 s4 s5 s6

Infer G given {si} and U:

P(G| {si}, U) α P( {si} | G) P(G|U)

Bottom-up inferences

Phrase structure

Utterance

Grammar

Universal Grammar

u1 u2 u3 u4 u5 u6

s1 s2 s3 s4 s5 s6

Infer G and {si} given {ui} and U:

P(G, {si} | {ui}, U) α P( {ui} | {si} )P({si} |G)P(G|U)

Simultaneous learning at multiple levels

Whole-object biasShape bias

gavagaiduckmonkeycar

Words in general

Individual words

Word learning

A hierarchical Bayesian model

d1 d2 d3 d4

θ ~ Beta(FH,FT)

Coin 1 Coin 2 Coin 200...θ2 θ200

physical knowledge

• Qualitative physical knowledge (symmetry) can influenceestimates of continuous parameters (FH, FT).

• Explains why 10 flips of 200 coins are better than 2000 flipsof a single coin: more informative about FH, FT.

Word Learning

“This is a dax.” “Show me the dax.”

• 24 month olds show a shape bias• 20 month olds do not

(Landau, Smith & Gleitman)

Is the shape bias learned?• Smith et al (2002) trained

17-month-olds on labels for4 artificial categories:

• After 8 weeks of training 19-month-olds show the shapebias:

“wib”

“lug”

“zup”“div”

“This is a dax.” “Show me the dax.”

Learning about feature variability

(cf. Goodman)

A hierarchical model

Bag proportions

mostlyred

mostlybrown

mostlygreen

Color varies across bagsbut not much within bagsBags in general

mostlyyellow

mostlyblue?

Meta-constraints M

[1,0,0] [0,1,0] [1,0,0] [0,1,0] [.1,.1,.8]

= 0.1= [0.4, 0.4, 0.2]

Within-bag variability

…[6,0,0] [0,6,0] [6,0,0] [0,6,0] [0,0,1]

Bag proportions

Bags in general

Meta-constraints

[.5,.5,0] [.5,.5,0] [.5,.5,0] [.5,.5,0] [.4,.4,.2]

= 5= [0.4, 0.4, 0.2]

Within-bag variability

…[3,3,0] [3,3,0] [3,3,0] [3,3,0] [0,0,1]

Bag proportions

Bags in general

Meta-constraints

Shape of the Beta prior

Bag proportions

Bags in general

MMeta-constraints

Bag proportions

Bags in general

Meta-constraints

Categories in general

Meta-constraints

Individual categories

“wib” “lug” “zup” “div”

“dax”

“wib” “lug” “zup” “div”

Model predictions

“dax”

“Show me the dax:”

Choice probability

Where do priors come from?M

Categories in general

Meta-constraints

Individual categories

Knowledge representation

Children discover structural form• Children may discover that

– Social networks are often organized into cliques– The months form a cycle– “Heavier than” is transitive– Category labels can be organized into hierarchies

Structure

squirrelchimp

gorilla

whiskers hands tail Xmousesquirrelchimpgorilla

Meta-constraints M

F: form

S: structure

D: data

squirrelchimp

gorilla

Meta-constraints M

Structural forms

Order Chain RingPartition

Hierarchy Tree Grid Cylinder

P(S|F,n): Generating structures

if S inconsistent with F

otherwise

• Each structure is weighted by the number of nodesit contains:

where is the number of nodes in S

squirrelchimp

gorilla

mousesquirrel

gorilla

mousesquirrel

chimp gorilla

• Simpler forms are preferred

P(S|F, n): Generating structures from forms

All possiblegraph structures S

P(S|F)

ChainGrid

F: form

S: structure

D: data

squirrelchimp

gorilla

Meta-constraints M

• Intuition: features should be smooth overgraph S

Relatively smooth Not smooth

p(D|S): Generating feature data

(Zhu, Lafferty & Ghahramani)

Let be the featurevalue at node i

p(D|S): Generating feature data

F: form

S: structure

D: data

squirrelchimp

gorilla

Meta-constraints M

als features

es cases

Feature data: results

Developmental shifts

110 features

20 features

5 features

Similarity data: results

colors

Relational data

Structure

Cliques

Meta-constraints M

4 56 7 8

1 2 3 4 5 6 7 812345678

Relational data: resultsPrimates Bush cabinet Prisoners

“x dominates y” “x tells y” “x is friends with y”

Why structural form matters

• Structural forms support predictions aboutnew or sparsely-observed entities.

Cliques (n = 8/12) Chain (n = 7/12)

Experiment: Form discovery

Structure

squirrelchimp

gorilla

whiskers hands tail bloodedmousesquirrel ?chimp ?gorilla ?

Universal Structure grammar U

A hypothesis space of formsForm FormProcess Process

Conclusions: Part 2

• Hierarchical Bayesian models provide aunified framework which helps to explain:

– How abstract knowledge is acquired

– How abstract knowledge is used for induction

Outline

Handbook of Mathematical Psychology, 1963

Part III Learning structured representations Hierarchical...

Documents

Transcript of Part III Learning structured representations Hierarchical...

From Hierarchical Partitions to Hierarchical Covers:

XACML v3.0 Hierarchical Resource Profile Version 1docs.oasis-open.org/xacml/3.0/hierarchical/v1.0/cs... · xacml-3.0-hierarchical-v1.0-cs02 18 May 2014 . The . . . . hierarchical

Hierarchical Models

J9458-Hierarchical Routes in ArcGIS Network Analystdownloads2.esri.com/.../other_/ArcGIS_NA_Hierarchical_Routes_Aug0… · Hierarchical Routes in ArcGIS ... Hierarchical versus Exact

The Hierarchical Cortical Organization of Human Speech ...cocosci.princeton.edu/papers/hierarchybrainlanguage.pdfhuman subjects in this study was approved by the University of Califor-nia,

Parallelograms revisited Exploring the limitations of vector ...cocosci.princeton.edu/papers/peterson_parallelograms.pdf2019/04/19 · Parallelograms revisited: Exploring the limitations

Sampling assumptions in language learning 1 Running head: …cocosci.princeton.edu/annehsu/papers/sampling_language.pdf · 2010-09-30 · sentences as grammatical or ungrammatical.

Modeling Hierarchical Transformations Hierarchical Models Scene Graphs.

Hierarchical Bayesian Models - Central Web Server 2 - UITS ...web2.uconn.edu/cyberinfra/module3/Downloads/Day 6 - Hierarchical... · What is a hierarchical model? Hierarchical models

Learning Based Hierarchical Vessel Segmentation Learning Based Hierarchical Vessel Segmentation Learning Based Hierarchical Vessel Segmentation Presenter:

The Role of Explanation in Discovery and Generalization ...cocosci.princeton.edu/joseph/WilliamsLombrozo.pdflearning (Nokes & Ohlsson, 2005), it is likely that explanation inﬂuences

Word-level information influences phonetic learning in ...cocosci.princeton.edu/tom/papers/LabPublications/WordInfoAdultsBabies.pdfWord-level information inﬂuences phonetic learning

Hierarchical Model

Neural Implementation of Hierarchical Bayesian Inference ...cocosci.princeton.edu/tom/papers/neuralIS.pdf · Training RBF networks is practical for neural implementation. Unlike the

Dale Berger Hierarchical Regression Demonstration ...wise.cgu.edu/.../Hierarchical-Regression-Gender-and-Faculty-Salary.pdf · Hierarchical Regression Analysis: Gender Differences

Hierarchical Data Visualizationshfang/cs552/cs552-hierarchy.pdf · Hierarchical Data Visualization 1 Hierarchical Data Hierarchical data emphasize the subordinate or membership relations

Hierarchical Planning - University of Nottinghampsznza/G52PAS12/lecture15.pdf · Hierarchical Planning Hierarchical Planning 1. ... (at most b), O(rd/k) is root k of non-hierarchical

Exploring the structure of mental representations by implementing computer algorithms ...cocosci.princeton.edu/tom/papers/exploring_mental... · 2014-10-06 · Exploring the structure

Hierarchical MPC

DELHI PUBLIC SCHOOL, JAMMUdpsjammu.in/admin_panel/down.aspx?filepath=files/620191034804.pdfGrammar: - subject verb agreement, punctuation, jumbled sentences and Tenses. ... Q3.Choose