Inference and Representation - MITpeople.csail.mit.edu/dsontag/courses/inference14/slides/... ·...

47
Inference and Representation David Sontag New York University Lecture 1, September 2, 2014 David Sontag (NYU) Inference and Representation Lecture 1, September 2, 2014 1 / 47

Transcript of Inference and Representation - MITpeople.csail.mit.edu/dsontag/courses/inference14/slides/... ·...

Page 1: Inference and Representation - MITpeople.csail.mit.edu/dsontag/courses/inference14/slides/... · 2014-09-03 · Exact inference & learning Then generalize Continuous variables Partially

Inference and Representation

David Sontag

New York University

Lecture 1, September 2, 2014

David Sontag (NYU) Inference and Representation Lecture 1, September 2, 2014 1 / 47

Page 2: Inference and Representation - MITpeople.csail.mit.edu/dsontag/courses/inference14/slides/... · 2014-09-03 · Exact inference & learning Then generalize Continuous variables Partially

One of the most exciting advances in machine learning (AI, signalprocessing, coding, control, . . .) in the last decades

David Sontag (NYU) Inference and Representation Lecture 1, September 2, 2014 2 / 47

Page 3: Inference and Representation - MITpeople.csail.mit.edu/dsontag/courses/inference14/slides/... · 2014-09-03 · Exact inference & learning Then generalize Continuous variables Partially

How can we gain global insight based on local observations?

David Sontag (NYU) Inference and Representation Lecture 1, September 2, 2014 3 / 47

Page 4: Inference and Representation - MITpeople.csail.mit.edu/dsontag/courses/inference14/slides/... · 2014-09-03 · Exact inference & learning Then generalize Continuous variables Partially

Key idea

1 Represent the world as a collection of random variables X1, . . . ,Xn

with joint distribution p(X1, . . . ,Xn)

2 Learn the distribution from data

3 Perform “inference” (compute conditional distributionsp(Xi | X1 = x1, . . . ,Xm = xm))

David Sontag (NYU) Inference and Representation Lecture 1, September 2, 2014 4 / 47

Page 5: Inference and Representation - MITpeople.csail.mit.edu/dsontag/courses/inference14/slides/... · 2014-09-03 · Exact inference & learning Then generalize Continuous variables Partially

Reasoning under uncertainty

As humans, we are continuously making predictions under uncertainty

Classical AI and ML research ignored this phenomena

Many of the most recent advances in technology are possible becauseof this new, probabilistic, approach

David Sontag (NYU) Inference and Representation Lecture 1, September 2, 2014 5 / 47

Page 6: Inference and Representation - MITpeople.csail.mit.edu/dsontag/courses/inference14/slides/... · 2014-09-03 · Exact inference & learning Then generalize Continuous variables Partially

Applications: Deep question answering

David Sontag (NYU) Inference and Representation Lecture 1, September 2, 2014 6 / 47

Page 7: Inference and Representation - MITpeople.csail.mit.edu/dsontag/courses/inference14/slides/... · 2014-09-03 · Exact inference & learning Then generalize Continuous variables Partially

Applications: Machine translation

David Sontag (NYU) Inference and Representation Lecture 1, September 2, 2014 7 / 47

Page 8: Inference and Representation - MITpeople.csail.mit.edu/dsontag/courses/inference14/slides/... · 2014-09-03 · Exact inference & learning Then generalize Continuous variables Partially

Applications: Speech recognition

David Sontag (NYU) Inference and Representation Lecture 1, September 2, 2014 8 / 47

Page 9: Inference and Representation - MITpeople.csail.mit.edu/dsontag/courses/inference14/slides/... · 2014-09-03 · Exact inference & learning Then generalize Continuous variables Partially

Applications: Stereo vision

output: disparity!input: two images!

David Sontag (NYU) Inference and Representation Lecture 1, September 2, 2014 9 / 47

Page 10: Inference and Representation - MITpeople.csail.mit.edu/dsontag/courses/inference14/slides/... · 2014-09-03 · Exact inference & learning Then generalize Continuous variables Partially

Key challenges

1 Represent the world as a collection of random variables X1, . . . ,Xn

with joint distribution p(X1, . . . ,Xn)

How does one compactly describe this joint distribution?Directed graphical models (Bayesian networks)Undirected graphical models (Markov random fields, factor graphs)

2 Learn the distribution from data

Maximum likelihood estimation. Other estimation methods?How much data do we need?How much computation does it take?

3 Perform “inference” (compute conditional distributionsp(Xi | X1 = x1, . . . ,Xm = xm))

David Sontag (NYU) Inference and Representation Lecture 1, September 2, 2014 10 / 47

Page 11: Inference and Representation - MITpeople.csail.mit.edu/dsontag/courses/inference14/slides/... · 2014-09-03 · Exact inference & learning Then generalize Continuous variables Partially

Syllabus overview

We will study Representation, Inference & Learning

First in the simplest case

Only discrete variablesFully observed modelsExact inference & learning

Then generalize

Continuous variablesPartially observed data during learning (hidden variables)Approximate inference & learning

Learn about algorithms, theory & applications

David Sontag (NYU) Inference and Representation Lecture 1, September 2, 2014 11 / 47

Page 12: Inference and Representation - MITpeople.csail.mit.edu/dsontag/courses/inference14/slides/... · 2014-09-03 · Exact inference & learning Then generalize Continuous variables Partially

Logistics: class

Class webpage:http://cs.nyu.edu/~dsontag/courses/inference14/

Sign up for mailing list!

Book: Machine Learning: a Probabilistic Perspective by KevinMurphy, MIT Press (2012)

Required readings for each lecture posted to course website.A good optional reference is Probabilistic Graphical Models: Principlesand Techniques by Daphne Koller and Nir Friedman, MIT Press (2009)

Office hours: Tuesdays 10:30-11:30am. 715 Broadway, 12th floor,Room 1204

Lab: Thursdays, 5:10-6:00pm in Silver Center 401

Instructor: Yacine Jernite ([email protected])Required attendance; no exceptions.

Grader: Prasoon Goyal ([email protected])

David Sontag (NYU) Inference and Representation Lecture 1, September 2, 2014 12 / 47

Page 13: Inference and Representation - MITpeople.csail.mit.edu/dsontag/courses/inference14/slides/... · 2014-09-03 · Exact inference & learning Then generalize Continuous variables Partially

Logistics: prerequisites & grading

Prerequisite:DS-GA-1003/CSCI-GA.2567 (Machine Learning and ComputationalStatistics)Exceptions to the prerequisite must be confirmed by me (via email),and are only likely to be granted to PhD students

Grading: problem sets (55%) + in class midterm exam (20%) + inclass final exam (20%) + participation (5%)

Class attendance is required.7-8 assignments (every 1–2 weeks). Both theory and programming.First homework out today, due Monday Sept. 15 at 10pm (via email)Important: See collaboration policy on class webpage

Solutions to the theoretical questions require formal proofs.

For the programming assignments, I recommend Python (Java orMatlab OK too). Do not use C++.

David Sontag (NYU) Inference and Representation Lecture 1, September 2, 2014 13 / 47

Page 14: Inference and Representation - MITpeople.csail.mit.edu/dsontag/courses/inference14/slides/... · 2014-09-03 · Exact inference & learning Then generalize Continuous variables Partially

Example: Medical diagnosis

Variable for each symptom (e.g. “fever”, “cough”, “fast breathing”,“shaking”, “nausea”, “vomiting”)

Variable for each disease (e.g. “pneumonia”, “flu”, “common cold”,“bronchitis”, “tuberculosis”)

Diagnosis is performed by inference in the model:

p(pneumonia = 1 | cough = 1, fever = 1, vomiting = 0)

One famous model, Quick Medical Reference (QMR-DT), has 600diseases and 4000 findings

David Sontag (NYU) Inference and Representation Lecture 1, September 2, 2014 14 / 47

Page 15: Inference and Representation - MITpeople.csail.mit.edu/dsontag/courses/inference14/slides/... · 2014-09-03 · Exact inference & learning Then generalize Continuous variables Partially

Representing the distribution

Naively, could represent multivariate distributions with table ofprobabilities for each outcome (assignment)

How many outcomes are there in QMR-DT? 24600

Estimation of joint distribution would require a huge amount of data

Inference of conditional probabilities, e.g.

p(pneumonia = 1 | cough = 1, fever = 1, vomiting = 0)

would require summing over exponentially many variables’ values

Moreover, defeats the purpose of probabilistic modeling, which is tomake predictions with previously unseen observations

David Sontag (NYU) Inference and Representation Lecture 1, September 2, 2014 15 / 47

Page 16: Inference and Representation - MITpeople.csail.mit.edu/dsontag/courses/inference14/slides/... · 2014-09-03 · Exact inference & learning Then generalize Continuous variables Partially

Structure through independence

If X1, . . . ,Xn are independent, then

p(x1, . . . , xn) = p(x1)p(x2) · · · p(xn)

2n entries can be described by just n numbers (if |Val(Xi )| = 2)!

However, this is not a very useful model – observing a variable Xi

cannot influence our predictions of Xj

If X1, . . . ,Xn are conditionally independent given Y , denoted asXi ⊥ X−i | Y , then

p(y , x1, . . . , xn) = p(y)p(x1 | y)n∏

i=2

p(xi | x1, . . . , xi−1, y)

= p(y)p(x1 | y)n∏

i=2

p(xi | y).

This is a simple, yet powerful, model

David Sontag (NYU) Inference and Representation Lecture 1, September 2, 2014 16 / 47

Page 17: Inference and Representation - MITpeople.csail.mit.edu/dsontag/courses/inference14/slides/... · 2014-09-03 · Exact inference & learning Then generalize Continuous variables Partially

Example: naive Bayes for classification

Classify e-mails as spam (Y = 1) or not spam (Y = 0)Let 1 : n index the words in our vocabulary (e.g., English)Xi = 1 if word i appears in an e-mail, and 0 otherwiseE-mails are drawn according to some distribution p(Y ,X1, . . . ,Xn)

Suppose that the words are conditionally independent given Y . Then,

p(y , x1, . . . xn) = p(y)n∏

i=1

p(xi | y)

Estimate the model with maximum likelihood. Predict with:

p(Y = 1 | x1, . . . xn) =p(Y = 1)

∏ni=1 p(xi | Y = 1)∑

y={0,1} p(Y = y)∏n

i=1 p(xi | Y = y)

Are the independence assumptions made here reasonable?

Philosophy: Nearly all probabilistic models are “wrong”, but many arenonetheless useful

David Sontag (NYU) Inference and Representation Lecture 1, September 2, 2014 17 / 47

Page 18: Inference and Representation - MITpeople.csail.mit.edu/dsontag/courses/inference14/slides/... · 2014-09-03 · Exact inference & learning Then generalize Continuous variables Partially

Bayesian networksReference: Chapter 10

A Bayesian network is specified by a directed acyclic graphG = (V ,E ) with:

1 One node i ∈ V for each random variable Xi

2 One conditional probability distribution (CPD) per node, p(xi | xPa(i)),specifying the variable’s probability conditioned on its parents’ values

Corresponds 1-1 with a particular factorization of the jointdistribution:

p(x1, . . . xn) =∏

i∈Vp(xi | xPa(i))

Powerful framework for designing algorithms to perform probabilitycomputations

Enables use of prior knowledge to specify (part of) model structure

David Sontag (NYU) Inference and Representation Lecture 1, September 2, 2014 18 / 47

Page 19: Inference and Representation - MITpeople.csail.mit.edu/dsontag/courses/inference14/slides/... · 2014-09-03 · Exact inference & learning Then generalize Continuous variables Partially

Example

Consider the following Bayesian network:

Grade

Letter

SAT

IntelligenceDifficulty

d1d0

0.6 0.4

i1i0

0.7 0.3

i0

i1

s1s0

0.95

0.2

0.05

0.8

g1

g2

g2

l1l 0

0.1

0.4

0.99

0.9

0.6

0.01

i0,d0

i0,d1

i0,d0

i0,d1

g2 g3g1

0.3

0.05

0.9

0.5

0.4

0.25

0.08

0.3

0.3

0.7

0.02

0.2

What is its joint distribution?

p(x1, . . . xn) =∏

i∈Vp(xi | xPa(i))

p(d , i , g , s, l) = p(d)p(i)p(g | i , d)p(s | i)p(l | g)

David Sontag (NYU) Inference and Representation Lecture 1, September 2, 2014 19 / 47

Page 20: Inference and Representation - MITpeople.csail.mit.edu/dsontag/courses/inference14/slides/... · 2014-09-03 · Exact inference & learning Then generalize Continuous variables Partially

More examples

p(x1, . . . xn) =∏

i∈Vp(xi | xPa(i))

Will my car start this morning?

Heckerman et al., Decision-Theoretic Troubleshooting, 1995

David Sontag (NYU) Inference and Representation Lecture 1, September 2, 2014 20 / 47

Page 21: Inference and Representation - MITpeople.csail.mit.edu/dsontag/courses/inference14/slides/... · 2014-09-03 · Exact inference & learning Then generalize Continuous variables Partially

More examples

p(x1, . . . xn) =∏

i∈Vp(xi | xPa(i))

What is the differential diagnosis?

Beinlich et al., The ALARM Monitoring System, 1989David Sontag (NYU) Inference and Representation Lecture 1, September 2, 2014 21 / 47

Page 22: Inference and Representation - MITpeople.csail.mit.edu/dsontag/courses/inference14/slides/... · 2014-09-03 · Exact inference & learning Then generalize Continuous variables Partially

Bayesian networks are generative models

naive Bayes

Y

X1 X2 X3 Xn. . .

Features

Label

Evidence is denoted by shading in a node

Can interpret Bayesian network as a generative process. Forexample, to generate an e-mail, we

1 Decide whether it is spam or not spam, by samping y ∼ p(Y )2 For each word i = 1 to n, sample xi ∼ p(Xi | Y = y)

David Sontag (NYU) Inference and Representation Lecture 1, September 2, 2014 22 / 47

Page 23: Inference and Representation - MITpeople.csail.mit.edu/dsontag/courses/inference14/slides/... · 2014-09-03 · Exact inference & learning Then generalize Continuous variables Partially

Bayesian network structure implies conditionalindependencies!

Grade

Letter

SAT

IntelligenceDifficulty

The joint distribution corresponding to the above BN factors as

p(d , i , g , s, l) = p(d)p(i)p(g | i , d)p(s | i)p(l | g)

However, by the chain rule, any distribution can be written as

p(d , i , g , s, l) = p(d)p(i | d)p(g | i , d)p(s | i , d , g)p(l | g , d , i , g , s)

Thus, we are assuming the following additional independencies:D ⊥ I , S ⊥ {D,G} | I , L ⊥ {I ,D,S} | G . What else?

David Sontag (NYU) Inference and Representation Lecture 1, September 2, 2014 23 / 47

Page 24: Inference and Representation - MITpeople.csail.mit.edu/dsontag/courses/inference14/slides/... · 2014-09-03 · Exact inference & learning Then generalize Continuous variables Partially

Bayesian network structure implies conditionalindependencies!

Generalizing the above arguments, we obtain that a variable isindependent from its non-descendants given its parents

Common parent – fixing B decouples A and C

!

" #$%&'(%)*'+',-./'011!20113 3

Qualitative Specification

456$6'7869':56';<=>%:=:%?6'9@6&%A%&=:%8)'&8B6'A$8BC

D$%8$'E)8F>67*6'8A'&=<9=>'$6>=:%8)95%@9

D$%8$'E)8F>67*6'8A'B87<>=$'$6>=:%8)95%@9

G99699B6):'A$8B'6H@6$:9

I6=$)%)*'A$8B'7=:=

46'9%B@>J'>%)E'='&6$:=%)'=$&5%:6&:<$6'K6L*L'='>=J6$67'*$=@5M'

N

" #$%&'(%)*'+',-./'011!20113 O1

Local Structures & Independencies

,8BB8)'@=$6):P%H%)*'Q'76&8<@>69 G'=)7',R*%?6)':56'>6?6>'8A'*6)6'Q/':56'>6?6>9'8A'G'=)7','=$6'%)76@6)76):R

,=9&=76S)8F%)*'Q 76&8<@>69 G'=)7',R*%?6)':56'>6?6>'8A'*6)6'Q/':56'>6?6>'*6)6'G'@$8?%769')8'

6H:$='@$67%&:%8)'?=><6'A8$':56'>6?6>'8A'*6)6',R

T29:$<&:<$6S)8F%)*','&8<@>69'G'=)7'Q

U6&=<96'G'&=)'R6H@>=%)'=F=JR'Q'FL$L:L',RVA'G'&8$$6>=:69':8',/':56)'&5=)&6'A8$'Q':8'=>98'&8$$6>=:6':8'Q'F%>>'76&$6=96R

W56'>=)*<=*6'%9'&8B@=&:/':56'&8)&6@:9'=$6'$%&5X

! "#

!

"

#

!

#

"

Cascade – knowing B decouples A and C

!

" #$%&'(%)*'+',-./'011!20113 3

Qualitative Specification

456$6'7869':56';<=>%:=:%?6'9@6&%A%&=:%8)'&8B6'A$8BC

D$%8$'E)8F>67*6'8A'&=<9=>'$6>=:%8)95%@9

D$%8$'E)8F>67*6'8A'B87<>=$'$6>=:%8)95%@9

G99699B6):'A$8B'6H@6$:9

I6=$)%)*'A$8B'7=:=

46'9%B@>J'>%)E'='&6$:=%)'=$&5%:6&:<$6'K6L*L'='>=J6$67'*$=@5M'

N

" #$%&'(%)*'+',-./'011!20113 O1

Local Structures & Independencies

,8BB8)'@=$6):P%H%)*'Q'76&8<@>69 G'=)7',R*%?6)':56'>6?6>'8A'*6)6'Q/':56'>6?6>9'8A'G'=)7','=$6'%)76@6)76):R

,=9&=76S)8F%)*'Q 76&8<@>69 G'=)7',R*%?6)':56'>6?6>'8A'*6)6'Q/':56'>6?6>'*6)6'G'@$8?%769')8'

6H:$='@$67%&:%8)'?=><6'A8$':56'>6?6>'8A'*6)6',R

T29:$<&:<$6S)8F%)*','&8<@>69'G'=)7'Q

U6&=<96'G'&=)'R6H@>=%)'=F=JR'Q'FL$L:L',RVA'G'&8$$6>=:69':8',/':56)'&5=)&6'A8$'Q':8'=>98'&8$$6>=:6':8'Q'F%>>'76&$6=96R

W56'>=)*<=*6'%9'&8B@=&:/':56'&8)&6@:9'=$6'$%&5X

! "#

!

"

#

!

#

"

V-structure – Knowing C couples A and B

!

" #$%&'(%)*'+',-./'011!20113 3

Qualitative Specification

456$6'7869':56';<=>%:=:%?6'9@6&%A%&=:%8)'&8B6'A$8BC

D$%8$'E)8F>67*6'8A'&=<9=>'$6>=:%8)95%@9

D$%8$'E)8F>67*6'8A'B87<>=$'$6>=:%8)95%@9

G99699B6):'A$8B'6H@6$:9

I6=$)%)*'A$8B'7=:=

46'9%B@>J'>%)E'='&6$:=%)'=$&5%:6&:<$6'K6L*L'='>=J6$67'*$=@5M'

N

" #$%&'(%)*'+',-./'011!20113 O1

Local Structures & Independencies

,8BB8)'@=$6):P%H%)*'Q'76&8<@>69 G'=)7',R*%?6)':56'>6?6>'8A'*6)6'Q/':56'>6?6>9'8A'G'=)7','=$6'%)76@6)76):R

,=9&=76S)8F%)*'Q 76&8<@>69 G'=)7',R*%?6)':56'>6?6>'8A'*6)6'Q/':56'>6?6>'*6)6'G'@$8?%769')8'

6H:$='@$67%&:%8)'?=><6'A8$':56'>6?6>'8A'*6)6',R

T29:$<&:<$6S)8F%)*','&8<@>69'G'=)7'Q

U6&=<96'G'&=)'R6H@>=%)'=F=JR'Q'FL$L:L',RVA'G'&8$$6>=:69':8',/':56)'&5=)&6'A8$'Q':8'=>98'&8$$6>=:6':8'Q'F%>>'76&$6=96R

W56'>=)*<=*6'%9'&8B@=&:/':56'&8)&6@:9'=$6'$%&5X

! "#

!

"

#

!

#

"

This important phenomona is called explaining away and is whatmakes Bayesian networks so powerful

David Sontag (NYU) Inference and Representation Lecture 1, September 2, 2014 24 / 47

Page 25: Inference and Representation - MITpeople.csail.mit.edu/dsontag/courses/inference14/slides/... · 2014-09-03 · Exact inference & learning Then generalize Continuous variables Partially

A simple justification (for common parent)

!

" #$%&'(%)*'+',-./'011!20113 3

Qualitative Specification

456$6'7869':56';<=>%:=:%?6'9@6&%A%&=:%8)'&8B6'A$8BC

D$%8$'E)8F>67*6'8A'&=<9=>'$6>=:%8)95%@9

D$%8$'E)8F>67*6'8A'B87<>=$'$6>=:%8)95%@9

G99699B6):'A$8B'6H@6$:9

I6=$)%)*'A$8B'7=:=

46'9%B@>J'>%)E'='&6$:=%)'=$&5%:6&:<$6'K6L*L'='>=J6$67'*$=@5M'

N

" #$%&'(%)*'+',-./'011!20113 O1

Local Structures & Independencies

,8BB8)'@=$6):P%H%)*'Q'76&8<@>69 G'=)7',R*%?6)':56'>6?6>'8A'*6)6'Q/':56'>6?6>9'8A'G'=)7','=$6'%)76@6)76):R

,=9&=76S)8F%)*'Q 76&8<@>69 G'=)7',R*%?6)':56'>6?6>'8A'*6)6'Q/':56'>6?6>'*6)6'G'@$8?%769')8'

6H:$='@$67%&:%8)'?=><6'A8$':56'>6?6>'8A'*6)6',R

T29:$<&:<$6S)8F%)*','&8<@>69'G'=)7'Q

U6&=<96'G'&=)'R6H@>=%)'=F=JR'Q'FL$L:L',RVA'G'&8$$6>=:69':8',/':56)'&5=)&6'A8$'Q':8'=>98'&8$$6>=:6':8'Q'F%>>'76&$6=96R

W56'>=)*<=*6'%9'&8B@=&:/':56'&8)&6@:9'=$6'$%&5X

! "#

!

"

#

!

#

"

We’ll show that p(A,C | B) = p(A | B)p(C | B) for any distributionp(A,B,C ) that factors according to this graph structure, i.e.

p(A,B,C ) = p(B)p(A | B)p(C | B)

Proof.

p(A,C | B) =p(A,B,C )

p(B)= p(A | B)p(C | B)

David Sontag (NYU) Inference and Representation Lecture 1, September 2, 2014 25 / 47

Page 26: Inference and Representation - MITpeople.csail.mit.edu/dsontag/courses/inference14/slides/... · 2014-09-03 · Exact inference & learning Then generalize Continuous variables Partially

D-separation (“directed separated”) in Bayesian networks

Algorithm to calculate whether X ⊥ Z | Y by looking at graphseparation

Look to see if there is active path between X and Z when variablesY are observed:

(a)

X

Y

Z X

Y

Z

(b)

(a)

X

Y

Z

(b)

X

Y

Z

David Sontag (NYU) Inference and Representation Lecture 1, September 2, 2014 26 / 47

Page 27: Inference and Representation - MITpeople.csail.mit.edu/dsontag/courses/inference14/slides/... · 2014-09-03 · Exact inference & learning Then generalize Continuous variables Partially

D-separation (“directed separated”) in Bayesian networks

Algorithm to calculate whether X ⊥ Z | Y by looking at graphseparation

Look to see if there is active path between X and Z when variablesY are observed:

(a)

X

Y

Z X

Y

Z

(b)

(a)

X

Y

Z

(b)

X

Y

Z

David Sontag (NYU) Inference and Representation Lecture 1, September 2, 2014 27 / 47

Page 28: Inference and Representation - MITpeople.csail.mit.edu/dsontag/courses/inference14/slides/... · 2014-09-03 · Exact inference & learning Then generalize Continuous variables Partially

D-separation (“directed separated”) in Bayesian networks

Algorithm to calculate whether X ⊥ Z | Y by looking at graphseparation

Look to see if there is active path between X and Z when variablesY are observed:

X Y Z X Y Z

(a) (b)

If no such path, then X and Z are d-separated with respect to Y

d-separation reduces statistical independencies (hard) to connectivityin graphs (easy)

Important because it allows us to quickly prune the Bayesian network,finding just the relevant variables for answering a query

David Sontag (NYU) Inference and Representation Lecture 1, September 2, 2014 28 / 47

Page 29: Inference and Representation - MITpeople.csail.mit.edu/dsontag/courses/inference14/slides/... · 2014-09-03 · Exact inference & learning Then generalize Continuous variables Partially

D-separation example 1

1X

2X

3X

X 4

X 5

X6

1X

2X

3X

X 4

X 5

X6

Is X6 ⊥ X5 | X2,X3? Is X4 ⊥ X5 | X2,X3?

David Sontag (NYU) Inference and Representation Lecture 1, September 2, 2014 29 / 47

Page 30: Inference and Representation - MITpeople.csail.mit.edu/dsontag/courses/inference14/slides/... · 2014-09-03 · Exact inference & learning Then generalize Continuous variables Partially

D-separation example 2

1X

2X

3X

X 4

X 5

X6

1X

2X

3X

X 4

X 5

X6

Is X4 ⊥ X5 | X1,X6? What about is X6 is not observed? I.e., isX4 ⊥ X5 | X1?

David Sontag (NYU) Inference and Representation Lecture 1, September 2, 2014 30 / 47

Page 31: Inference and Representation - MITpeople.csail.mit.edu/dsontag/courses/inference14/slides/... · 2014-09-03 · Exact inference & learning Then generalize Continuous variables Partially

Independence maps

Let I (G ) be the set of all conditional independencies implied by thedirected acyclic graph (DAG) G

Let I (p) denote the set of all conditional independencies that hold forthe joint distribution p.

A DAG G is an I-map (independence map) of a distribution p ifI (G ) ⊆ I (p)

A fully connected DAG G is an I-map for any distribution, sinceI (G ) = ∅ ⊆ I (p) for all p

G is a minimal I-map for p if the removal of even a single edgemakes it not an I-map

A distribution may have several minimal I-mapsEach corresponds to a specific node-ordering

G is a perfect map (P-map) for distribution p if I (G ) = I (p)

David Sontag (NYU) Inference and Representation Lecture 1, September 2, 2014 31 / 47

Page 32: Inference and Representation - MITpeople.csail.mit.edu/dsontag/courses/inference14/slides/... · 2014-09-03 · Exact inference & learning Then generalize Continuous variables Partially

Equivalent structures

Different Bayesian network structures can be equivalent in that theyencode precisely the same conditional independence assertions (andthus the same distributions)

Which of these are equivalent?

Y

(a) (b) (c) (d)

X

Z Z

X Y

X

Z

Y Z

X Y

David Sontag (NYU) Inference and Representation Lecture 1, September 2, 2014 32 / 47

Page 33: Inference and Representation - MITpeople.csail.mit.edu/dsontag/courses/inference14/slides/... · 2014-09-03 · Exact inference & learning Then generalize Continuous variables Partially

Equivalent structures

Different Bayesian network structures can be equivalent in that theyencode precisely the same conditional independence assertions (andthus the same distributions)

Are these equivalent?

W

V X

Y

Z

W

V X

Y

Z

David Sontag (NYU) Inference and Representation Lecture 1, September 2, 2014 33 / 47

Page 34: Inference and Representation - MITpeople.csail.mit.edu/dsontag/courses/inference14/slides/... · 2014-09-03 · Exact inference & learning Then generalize Continuous variables Partially

2011 Turing Award was for Bayesian networks

David Sontag (NYU) Inference and Representation Lecture 1, September 2, 2014 34 / 47

Page 35: Inference and Representation - MITpeople.csail.mit.edu/dsontag/courses/inference14/slides/... · 2014-09-03 · Exact inference & learning Then generalize Continuous variables Partially

What are some frequently used graphical models?

David Sontag (NYU) Inference and Representation Lecture 1, September 2, 2014 35 / 47

Page 36: Inference and Representation - MITpeople.csail.mit.edu/dsontag/courses/inference14/slides/... · 2014-09-03 · Exact inference & learning Then generalize Continuous variables Partially

Hidden Markov models

X1 X2 X3 X4 X5 X6

Y1 Y2 Y3 Y4 Y5 Y6

Frequently used for speech recognition and part-of-speech tagging

Joint distribution factors as:

p(y, x) = p(y1)p(x1 | y1)T∏

t=2

p(yt | yt−1)p(xt | yt)

p(y1) is the distribution for the starting statep(yt | yt−1) is the transition probability between any two statesp(xt | yt) is the emission probability

What are the conditional independencies here? For example,Y1 ⊥ {Y3, . . . ,Y6} | Y2

David Sontag (NYU) Inference and Representation Lecture 1, September 2, 2014 36 / 47

Page 37: Inference and Representation - MITpeople.csail.mit.edu/dsontag/courses/inference14/slides/... · 2014-09-03 · Exact inference & learning Then generalize Continuous variables Partially

Hidden Markov models

X1 X2 X3 X4 X5 X6

Y1 Y2 Y3 Y4 Y5 Y6

Joint distribution factors as:

p(y, x) = p(y1)p(x1 | y1)T∏

t=2

p(yt | yt−1)p(xt | yt)

A homogeneous HMM uses the same parameters (β and α below)for each transition and emission distribution (parameter sharing):

p(y, x) = p(y1)αx1,y1

T∏

t=2

βyt ,yt−1αxt ,yt

How many parameters need to be learned?

David Sontag (NYU) Inference and Representation Lecture 1, September 2, 2014 37 / 47

Page 38: Inference and Representation - MITpeople.csail.mit.edu/dsontag/courses/inference14/slides/... · 2014-09-03 · Exact inference & learning Then generalize Continuous variables Partially

Mixture of Gaussians

The N-dim. multivariate normal distribution, N (µ,Σ), has density:

p(x) =1

(2π)N/2|Σ|1/2 exp(− 1

2(x− µ)TΣ−1(x− µ)

)

Suppose we have k Gaussians given by µk and Σk , and a distributionθ over the numbers 1, . . . , k

Mixture of Gaussians distribution p(y , x) given by1 Sample y ∼ θ (specifies which Gaussian to use)2 Sample x ∼ N (µy ,Σy )

David Sontag (NYU) Inference and Representation Lecture 1, September 2, 2014 38 / 47

Page 39: Inference and Representation - MITpeople.csail.mit.edu/dsontag/courses/inference14/slides/... · 2014-09-03 · Exact inference & learning Then generalize Continuous variables Partially

Mixture of Gaussians

The marginal distribution over x looks like:

David Sontag (NYU) Inference and Representation Lecture 1, September 2, 2014 39 / 47

Page 40: Inference and Representation - MITpeople.csail.mit.edu/dsontag/courses/inference14/slides/... · 2014-09-03 · Exact inference & learning Then generalize Continuous variables Partially

Latent Dirichlet allocation (LDA)

Topic models are powerful tools for exploring large data sets and formaking inferences about the content of documents

!"#$%&'() *"+,#)

+"/,9#)1+.&),3&'(1"65%51

:5)2,'0("'1.&/,0,"'1-

.&/,0,"'12,'3$14$3,5)%1&(2,#)1

6$332,)%1

)+".()165)&65//1)"##&.1

65)7&(65//18""(65//1

- -

Many applications in information retrieval, document summarization,and classification

Complexity+of+Inference+in+Latent+Dirichlet+Alloca6on+David+Sontag,+Daniel+Roy+(NYU,+Cambridge)+

W66+Topic+models+are+powerful+tools+for+exploring+large+data+sets+and+for+making+inferences+about+the+content+of+documents+

Documents+ Topics+poli6cs+.0100+

president+.0095+obama+.0090+

washington+.0085+religion+.0060+

Almost+all+uses+of+topic+models+(e.g.,+for+unsupervised+learning,+informa6on+retrieval,+classifica6on)+require+probabilis)c+inference:+

New+document+ What+is+this+document+about?+

Words+w1,+…,+wN+ ✓Distribu6on+of+topics+

�t =�

p(w | z = t)

…+

religion+.0500+hindu+.0092+

judiasm+.0080+ethics+.0075+

buddhism+.0016+

sports+.0105+baseball+.0100+soccer+.0055+

basketball+.0050+football+.0045+

…+ …+

weather+ .50+finance+ .49+sports+ .01+

LDA is one of the simplest and most widely used topic models

David Sontag (NYU) Inference and Representation Lecture 1, September 2, 2014 40 / 47

Page 41: Inference and Representation - MITpeople.csail.mit.edu/dsontag/courses/inference14/slides/... · 2014-09-03 · Exact inference & learning Then generalize Continuous variables Partially

Generative model for a document in LDA

1 Sample the document’s topic distribution θ (aka topic vector)

θ ∼ Dirichlet(α1:T )

where the {αt}Tt=1 are fixed hyperparameters. Thus θ is a distributionover T topics with mean θt = αt/

∑t′ αt′

2 For i = 1 to N, sample the topic zi of the i ’th word

zi |θ ∼ θ

3 ... and then sample the actual word wi from the zi ’th topic

wi |zi ∼ βziwhere {βt}Tt=1 are the topics (a fixed collection of distributions onwords)

David Sontag (NYU) Inference and Representation Lecture 1, September 2, 2014 41 / 47

Page 42: Inference and Representation - MITpeople.csail.mit.edu/dsontag/courses/inference14/slides/... · 2014-09-03 · Exact inference & learning Then generalize Continuous variables Partially

Generative model for a document in LDA

1 Sample the document’s topic distribution θ (aka topic vector)

θ ∼ Dirichlet(α1:T )

where the {αt}Tt=1 are hyperparameters.The Dirichlet density, defined over

∆ = {~θ ∈ RT : ∀t θt ≥ 0,∑T

t=1 θt = 1}, is:

p(θ1, . . . , θT ) ∝T∏

t=1

θαt−1t

For example, for T=3 (θ3 = 1− θ1 − θ2):

α1 = α2 = α3 =

θ1 θ2

log Pr(θ)

θ1 θ2

log Pr(θ)

α1 = α2 = α3 =

David Sontag (NYU) Inference and Representation Lecture 1, September 2, 2014 42 / 47

Page 43: Inference and Representation - MITpeople.csail.mit.edu/dsontag/courses/inference14/slides/... · 2014-09-03 · Exact inference & learning Then generalize Continuous variables Partially

Generative model for a document in LDA

3 ... and then sample the actual word wi from the zi ’th topic

wi |zi ∼ βzi

where {βt}Tt=1 are the topics (a fixed collection of distributions onwords)

Complexity+of+Inference+in+Latent+Dirichlet+Alloca6on+David+Sontag,+Daniel+Roy+(NYU,+Cambridge)+

W66+Topic+models+are+powerful+tools+for+exploring+large+data+sets+and+for+making+inferences+about+the+content+of+documents+

Documents+ Topics+poli6cs+.0100+

president+.0095+obama+.0090+

washington+.0085+religion+.0060+

Almost+all+uses+of+topic+models+(e.g.,+for+unsupervised+learning,+informa6on+retrieval,+classifica6on)+require+probabilis)c+inference:+

New+document+ What+is+this+document+about?+

Words+w1,+…,+wN+ ✓Distribu6on+of+topics+

�t =�

p(w | z = t)

…+

religion+.0500+hindu+.0092+

judiasm+.0080+ethics+.0075+

buddhism+.0016+

sports+.0105+baseball+.0100+soccer+.0055+

basketball+.0050+football+.0045+

…+ …+

weather+ .50+finance+ .49+sports+ .01+

David Sontag (NYU) Inference and Representation Lecture 1, September 2, 2014 43 / 47

Page 44: Inference and Representation - MITpeople.csail.mit.edu/dsontag/courses/inference14/slides/... · 2014-09-03 · Exact inference & learning Then generalize Continuous variables Partially

Example of using LDA

gene 0.04dna 0.02genetic 0.01.,,

life 0.02evolve 0.01organism 0.01.,,

brain 0.04neuron 0.02nerve 0.01...

data 0.02number 0.02computer 0.01.,,

Topics Documents Topic proportions andassignments

Figure 1: The intuitions behind latent Dirichlet allocation. We assume that somenumber of “topics,” which are distributions over words, exist for the whole collection (far left).Each document is assumed to be generated as follows. First choose a distribution over thetopics (the histogram at right); then, for each word, choose a topic assignment (the coloredcoins) and choose the word from the corresponding topic. The topics and topic assignmentsin this figure are illustrative—they are not fit from real data. See Figure 2 for topics fit fromdata.

model assumes the documents arose. (The interpretation of LDA as a probabilistic model isfleshed out below in Section 2.1.)

We formally define a topic to be a distribution over a fixed vocabulary. For example thegenetics topic has words about genetics with high probability and the evolutionary biologytopic has words about evolutionary biology with high probability. We assume that thesetopics are specified before any data has been generated.1 Now for each document in thecollection, we generate the words in a two-stage process.

1. Randomly choose a distribution over topics.

2. For each word in the document

(a) Randomly choose a topic from the distribution over topics in step #1.

(b) Randomly choose a word from the corresponding distribution over the vocabulary.

This statistical model reflects the intuition that documents exhibit multiple topics. Eachdocument exhibits the topics with different proportion (step #1); each word in each document

1Technically, the model assumes that the topics are generated first, before the documents.

3

θd

z1d

zNd

β1

βT

(Blei, Introduction to Probabilistic Topic Models, 2011)

David Sontag (NYU) Inference and Representation Lecture 1, September 2, 2014 44 / 47

Page 45: Inference and Representation - MITpeople.csail.mit.edu/dsontag/courses/inference14/slides/... · 2014-09-03 · Exact inference & learning Then generalize Continuous variables Partially

“Plate” notation for LDA model

α Dirichlet hyperparameters

i = 1 to N

d = 1 to D

θd

wid

zid

Topic distributionfor document

Topic of word i of doc d

Word

βTopic-worddistributions

Variables within a plate are replicated in a conditionally independent manner

David Sontag (NYU) Inference and Representation Lecture 1, September 2, 2014 45 / 47

Page 46: Inference and Representation - MITpeople.csail.mit.edu/dsontag/courses/inference14/slides/... · 2014-09-03 · Exact inference & learning Then generalize Continuous variables Partially

Comparison of mixture and admixture models

i = 1 to N

d = 1 to D

wid

Prior distributionover topics

Topic of doc d

Word

βTopic-worddistributions

θ

zd

α Dirichlet hyperparameters

i = 1 to N

d = 1 to D

θd

wid

zid

Topic distributionfor document

Topic of word i of doc d

Word

βTopic-worddistributions

Model on left is a mixture modelCalled multinomial naive Bayes (a word can appear multiple times)Document is generated from a single topic

Model on right (LDA) is an admixture modelDocument is generated from a distribution over topics

David Sontag (NYU) Inference and Representation Lecture 1, September 2, 2014 46 / 47

Page 47: Inference and Representation - MITpeople.csail.mit.edu/dsontag/courses/inference14/slides/... · 2014-09-03 · Exact inference & learning Then generalize Continuous variables Partially

Summary

Bayesian networks given by (G ,P) where P is specified as a set oflocal conditional probability distributions associated with G ’s nodes

One interpretation of a BN is as a generative model, where variablesare sampled in topological order

Local and global independence properties identifiable viad-separation criteria

Computing the probability of any assignment is obtained bymultiplying CPDs

Bayes’ rule is used to compute conditional probabilitiesMarginalization or inference is often computationally difficult

Examples (will show up again): naive Bayes, hidden Markovmodels, latent Dirichlet allocation

David Sontag (NYU) Inference and Representation Lecture 1, September 2, 2014 47 / 47