Bob Carpenter Probabilistic Graphs: Efficient Natural (Spoken) Language Processing.

28
Bob Carpenter Probabilistic Graphs: Efficient Natural (Spoken) Language Processing

Transcript of Bob Carpenter Probabilistic Graphs: Efficient Natural (Spoken) Language Processing.

Bob Carpenter

Probabilistic Graphs:Efficient Natural (Spoken) Language Processing

October 1999

The Standard Clichés

• Moore’s Cliché – Exponential growth in computing power and memory will

continue to open up new possibilities

• The Internet Cliché– With the advent and growth of the world-wide web, an

ever increasing amount of information must be managed

October 1999

More Standard Clichés

• The Convergence Cliché– Data, voice and video networking will be integrated over

a universal network, that:• includes land lines and wireless; • includes broadband and narrowband• likely implementation is IP (internet protocol)

• The Interface Cliché – The three forces above (growth in computing power,

information online, and networking) will both enable and require new interfaces

– Speech will become as common as graphics

October 1999

Application Requirements

• Robustness– acoustic and linguistic variation– disfluencies and noise

• Scalability– from embedded devices to palmtops to clients to servers– across tasks from simple to complex– system-initiative form-filling to mixed initiative dialogue

• Portability– simple adaptation to new tasks and new domains– preferably automated as much as possible

October 1999

The Big Question

• How do humans handle unrestricted language so effortlessly in real time?

• Unfortunately, the “classical” linguistic assumptions and methodology completely ignore this issue

• This is dangerous strategy for processing natural spoken language

October 1999

My Favorite Experiments: I

• Head-Mounted Eye Tracking– Mike Tanenhaus et al. (Univ. Rochester)

Eyes track

Semantic resolution

~200 ms tracking time

• Clearly shows human understanding is online

“Pick up the yellow plate”

October 1999

My Favorite Experiments (II)

• Garden Paths and Context Sensitivity– Crain & Steedman (U.Connecticut & U. Edinburgh)– if noun denotation is not a singleton in context,

postmodificiation is much more likely

• Garden Paths are Frequency and Agreement Sensitive– Tanenhaus et al.– The horse raced past the barn fell. (raced likely past)– The horses brought into the barn fell. (brought likely

participle, and less likely activity for horses)

October 1999

Conclusion: Function & Evolution

• Humans agressively prune in real time– This is an existence proof: there must be enough info to

do so; we just need to find it– All linguistic information is brought in at 200ms– Other pruning strategies have no such existence proof

• Speakers are cooperative in their use of language– Especially with spoken language, which is very different

than written language due to real-time requirements

• (Co-?)Evolution of language and speakers to optimize these requirements

October 1999

Stats: Explanation or Stopgap?

• The Common View– Statistics are some kind of approximation of underlying

factors requiring further explanation.

• Steve Abney’s Analogy (AT&T Labs)– Statistical Queueing Theory– Consider traffic flows through a toll gate on a highway. – Underlying factors are diverse, and explain the actions of

each driver, their cars, possible causes of flat tires, drunk drivers, etc.

– Statistics is more insightful [“explanatory”] in this case as it captures emergent generalizations

– It is a reductionist error to insist on low-level account

October 1999

Algebraic vs. Statistical• False Dichotomy

– Statistical systems have an algebraic basis, even if trivial

• Best performing statistical systems have best linguistic conditioning– Holds for phonology/phonetics and morphology/syntax – Most “explanatory” in traditional sense– Statistical estimators less significant than conditioning

• In “other” sciences, statistics used for exploratory data analysis – trendier: data mining; trendiest: information harvesting

• Emergent statistical generalizations can be algebraic

October 1999

The Speech Recognition Problem

• The Recognition Problem– Find most likely sequence w of “words” given the

sequence of acoustic observation vectors a– Use Bayes’ law to create a generative model – Max w . P(w|a) = Max w . P(a|w) P(w) / P(a)

= Max w . P(a|w) P(w)

• Language Model: P(w) [usually n-grams - discrete]

• Acoustic Model: P(a|w) [usually HMMs - cont. density]

• Challenge 1: beat trigram language models• Challenge 2: extend this paradigm to NLP

October 1999

N-best and Word Graphs

• Speech recognizers can return n-best histories1. flights from Boston today 2. lights for Boston to pay

3. flights from Austin today 4. flights for Boston to pay

• Or a packed word graph of histories– sum of path log probs equals acoustic/language log prob

flights

lights

from

for

for

Boston

Boston

Austin

today

topay

• Path closest to utterance in dense graphs much better than first-best on average [density: 1:24%; 5:15%; 180:11%]

October 1999

Probabilistic Graph Processing

• The architecutre we’re exploring in the context of spoken dialogue systems involves:– Speech recognizers that produce a probabilistic word

graph output, with scores given by acoustic probabilities– A tagger that transforms a word graph into a word/tag

graph with scores given by joint probabilities– A parser that transforms a word/tag graph into a syntactic

graph (as in CKY parsing) with scores given by grammar

• Allows each module to rescore output of previous module’s decision

• Long Term: Apply this architecture to speech act detection, dialogue act selection, and in generation

October 1999

Probabilistic Graph Tagger

• In: probabilistic word graph – P(As|Ws) : conditional acoustic likelihoods [or confidences]

• Out: probabilistic word/tag graph – P(Ws,Ts) : joint word/tag likelihoods [ignores acoustics]– P(As,Ws,Ts) : joint acoustic/word/tag likelihoods [but…]

• General history-based implementation [in Java]

– next tag/word probability a function of specified history– operates purely left to right on forward pass– backwards prune to edges within a beam / on n-best path– able to output hypotheses online– optional backwards confidence rescoring [not P(As,Ws,Ts)]

– need node for each active history class for proper model

October 1999

Backwards: Rescore & Minimize

All Paths: 1. A,C,E : 1/64 3. B,C,D : 1/256 2. A,C,D : 1/128 4. B,C,E : 1/512

A : 4/5

B : 1/5

C : 1

D : 2/3

E : 1/3

Backward:

Note: outputs sum to 1 after backward pass

• Edge gets sum of all path scores that go through it• Normalize by total: (1/64 + 1/128 + 1/256 + 1/512)

A : 1/2

B : 1/8

C : 1/4

D : 1/8

E : 1/16

Joint Out:

October 1999

Tagger Probability Model

• Exact Probabilities:– P(As,Ws,Ts) = P(Ws,Ts) * P(As|Ws,Ts) – P(Ws,Ts) = P(Ts) * P(Ws|Ts) [“top-down”]

• Approximations: – Two Tag History: tag trigram

• P(Ts) ~ PRODUCT_n P(T_n | T_n-2, T_n-1) – Words Depend only on Tags: HMM

• P(Ws|Ts) ~ PRODUCT_n P(W_n | T_n) – Pronunciation Independent of Tag: use standard acoustics

• P(As|Ws,Ts) ~ P(As|Ws)

October 1999

Prices rose sharply today0. -35.612683136497516 : NNS/prices VBD/rose RB/sharply NN/today (0, 2;NNS/prices) (2, 10;VBD/rose) (10, 14;RB/sharply) (14, 15;NN/today)

1. -37.035496392922575 : NNS/prices VBD/rose RB/sharply NNP/today (0, 2;NNS/prices) (2, 10;VBD/rose) (10, 14;RB/sharply) (14, 15;NNP/today)

2. -40.439580756197934 : NNS/prices VBP/rose RB/sharply NN/today (0, 2;NNS/prices) (2, 9;VBP/rose) (9, 11;RB/sharply) (11, 15;NN/today)

3. -41.86239401262299 : NNS/prices VBP/rose RB/sharply NNP/today (0, 2;NNS/prices) (2, 9;VBP/rose) (9, 11;RB/sharply) (11, 15;NNP/today)

4. -43.45450487625557 : NN/prices VBD/rose RB/sharply NN/today (0, 1;NN/prices) (1, 6;VBD/rose) (6, 14;RB/sharply) (14, 15;NN/today)

5. -44.87731813268063 : NN/prices VBD/rose RB/sharply NNP/today (0, 1;NN/prices) (1, 6;VBD/rose) (6, 14;RB/sharply) (14, 15;NNP/today)

6. -45.70597331609037 : NNS/prices NN/rose RB/sharply NN/today (0, 2;NNS/prices) (2, 8;NN/rose) (8, 13;RB/sharply) (13, 15;NN/today)

7. -45.81027979248346 : NNS/prices NNP/rose RB/sharply NN/today (0, 2;NNS/prices) (2, 7;NNP/rose) (7, 12;RB/sharply) (12, 15;NN/today)

8. ……………..

October 1999

Prices rose sharply after hours15-best as a word/tag graph + minimization

prices:NNS

prices: NN

rose:VBD

rose:VBP

rose:NN

sharply:RB

sharply:RB

sharply:RB

sharply:RB

sharply:RB

after:IN

after:RB

after:IN

after:IN

after:IN

after:RB

hours:NNS

hours:NNS

rose:VBD

rose:NNP

October 1999

Prices rose sharply after hours15-best as a word/tag graph + minimization + collapsing

rose:NNP

prices:NNS

prices: NN

rose:VBD

rose:VBProse:NN

sharply:RB

sharply:RB

after:IN

after:IN

after:RB

hours:NNS

rose:VBD

prices:NNS

rose:VBD

rose:VBProse:NN

sharply:RB after:INafter:RB

hours:NNS rose:NNPprices:NN

October 1999

Weighted Minimize (isn’t easy)

• Can push probabilities back through graph• Ratio of scores must be equivalent for sound

minimization (difference of log scores)

A:x

A:y

B:w

C:z

A:y

B:w+(x-y)

C:z

• Assume x > y; operation preserves sum of paths: B,A : w+x C,A : z+y

October 1999

Weighted Minimize is Problematic

• Can’t minimize if ratio is not the same:

B:w

C:z

B:x1

A:y1

B:x2

A:y2

• To push, must have amount to push:

(x1-x2) = (y1-y2)

[e^x1 / e^x2 = e^y1 / e^y2]

October 1999

How to Collect n Best in O(n k)

• Do a forward pass through graph, saving:– best total path score at each node– backpointers to all previous nodes, with scores

• This is done during tagging (linear in max length k )• Algorithm:

– add first-best and second best final path to priority queue– k times, repeat:

• follow backpointer of best path on queue to beginning

& save next best (if any) at each node on queue

• Can do same for all paths within beam epsilon• Result is deterministic; minimize before parsing

October 1999

Collins’ Head/Dependency Parser

• Michael Collins (AT&T) 1998 UPenn PhD Thesis• Generative model of tree probabilities: P(Tree) • Parses WSJ with ~90% constituent precision/recall

– Best performance for single parser, but Henderson’s Johns Hopkins’ Thesis beat it by blending with other parsers (Charniak & Ratnaparkhi)

• Formal “language” induced from simple smoothing of treebank is trivial: ~Word* (Charniak)

• Collins’ parser runs in real time– Collins’ naïve C implementation– Parses 100% of test set

October 1999

Collins’ Grammar Model

• Similar to GPSG + CG (aka HPSG) model– Subcat frames: adjuncts / complements distinguished– Generalized Coordination– Unbounded Dependencies via slash– Punctuation– Distance metric codes word order (canonical & not)

• Probabilities conditioned top-down• 12,000 word vocabulary (>= 5 occs in treebank)

– backs off to a word’s tag– approximates unknown words from words with < 5 occs

• Induces “feature information” statistically

October 1999

Collins’ Statistics (Simplified)

• Choose Start Symbol, Head Tag, & Head Word – P(RootCat, HeadTag, HeadWord)

• Project Daughter and Left/Right Subcat Frames– P(DaughterCat|MotherCat, HeadTag, HeadWord)– P(SubCat|MotherCat, DtrCat, HeadTag, HeadWord)

• Attach Modifier (Comp/Adjunct & Left/Right)– P(ModifierCat, ModiferTag, ModifierWord | SubCat, . .

MotherCat, DaughterCat, HeadTag, HeadWord, Distance)

October 1999

Complexity and Efficiency

• Collins’ wide coverage linguistic grammar generates millions of readings for simple strings

• But Collins’ parser runs faster than real time on unseen sentences of arbitrary length

• How?• Punchline: Time-Syncrhonous Beam Search

Reduces time to Linear • Tighter estimates with more features and more

complex grammars ran faster and more accurately– Beam allows tradeoff of accuracy (search error) and

speed

October 1999

Completeness & Dialogue

• Collins’ parser is not complete in the usual sense• But neither are humans (eg. garden paths)• Syntactic features alone don’t determine structure

– Humans can’t parse without context, semantics, etc.– Even phone or phoneme detection is very challenging,

especially in a noisy environment– Top-down expectations and knowledge of likely bottom-

up combinations prune the vast search space on line– Question is how to combine it with other factors

• Next steps: semantics, pragmatics & dialogue

October 1999

Conclusions

• Need ranking of hypotheses for applications

• Beam can reduce processing time to linear– need good statistics to do this

• More linguistic features are better for stat models– can induce the relevant ones and weights from data

– linguistic rules emerge from these generalizations

• Using acoustic / word / tag / syntax graphs allows the propogation of uncertainty– ideal is totally online (model is compatible with this)

– approximation allows simpler modules to do first pruning