March12 natarajan

Introduction HMM Window Based MaxEnt CRF Summary References

Machine Learning for SequentialData: A Review

MD2K Reading GroupMarch 12, 2015

1 / 16


Classical Supervised Learning

� Given train set {(x1, y1), (x2, y2), ..., (xn, yn)}� x – features, independent variables, scalar/vector i.e. |x| ≥ 1; y ∈ Y –

labels/classes, dependent variables, scalar i.e. |y| = 1

� Learn a model h ∈ H such that y = h(x)

� Example: character classfication, x–image of hand written character,y ∈ {A,B, ...Z}

y1

x1

yt−1

xt−1

yt

xt

yt+1

xt+1

yn

xn

1 / 16


Sequential Supervised Learning (SSL)

� Given train set{

(x1,n, y1,n), (x2,n, y2,n), ..., (xl,n, yl,n)}

� l – training instances each of length n (all training instances need not be ofthe same length i.e. n could vary)

� x – features, independent variables, scalar/vector; y ∈ Y – labels/classes,dependent variables

� Learn a model h ∈ H such that yl = h(xl)� SSL is different from time series prediction, sequence classification� Leverage sequential patterns and interactions (lines - L to R; dotted - R to

L)� Example: POS tagging, x–‘the dog saw a cat’ (English sentence),

y = {D,N, V,D,N}

yl,1

xl,1

yl,t−1

xl,t−1

yl,t

xl,t

yl,t+1

xl,t+1

yl,n

xl,n

2 / 16


1 Hidden Markov Models

2 Window based Approaches

3 Maximum Entropy Models

4 Conditional Random Fields

3 / 16


HMM (contd...)

y1

x1

yt−1

xt−1

yt

xt

yt+1

xt+1

yn

xn

� HMM‘s are generative models i.e. model joint probability p(x, y)

� Predicts whole sequence

� Models only the first order Markov property not suitable for many realworld applications

� xt only influences yt. Cannot model dependencies like p(xt|yt−1, yt, yt+1)which implies xt influences {yt−1, yt, yt+1}

5 / 16


Sliding Window Approach

� Sliding windows consider a window of features to make a decision e.g. ytlooks at xt−1, xt, xt+1 to make a decision

� Predict single class

� Can utilize any existing supervised learning algorithms without modificatione.g. SVM, logistic regression, etc

� Cannot model dependencies between y labels (both short and long range)

y1

x1

yt−1

xt−1

yt

xt

yt+1

xt+1

yn

xn

6 / 16


Recurrent Sliding Window Approach

� Similar to sliding window approach

� Models short range dependencies by using previous decision (yt−1) whenmaking current decision (yt)

� Problem: Need y values when training and testing

y1

x1

yt−1

xt−1

yt

xt

yt+1

xt+1

yn

xn

7 / 16


Maximum Entropy Model (MaxEnt)

� Based on the Principle of Maximum Entropy (Jaynes, 1957)– if incomplete information about a probability distributionis available then the unbiased assumption that can be madeis a distribution which is as uniform as possible given theavailable information

� Uniform distribution - maximum entropy (primal problem)

� Model available information - expressed as constraints overtraining data (dual problem)

� Discriminative model i.e. models p(y|x)

� Predict a single class

8 / 16


MaxEnt (contd...)

I. Model the known (dual problem)

Train set = {(x1, y1), (x2, y2), ..., (xn, yn)} (given)

p(x, y) =1

N× No. of times (x, y) occurs in train set

(i.e. joint probability table)

fi(x, y) =

{1, if y = k AND x = xk0, otherwise

(e.g. y = physical activity AND x = HR ≥ 110bpm; 1 ≤ i ≤ m, m–number of features)

E(fi) =∑x,y

p(x, y)× fi(x, y) (expected value of fi from training data)

E(fi) =∑x,y

p(x, y)× fi(x, y)

(expected value of fi under model distribution)

=∑x,y

p(y|x)× p(x)× fi(x, y)

=∑x,y

p(y|x)× p(x)× fi(x, y) (replace p(x) with p(x))

we need to only learn the conditional probability as opposed to joint probability

9 / 16


MaxEnt (contd...)E(fi) = E(fi)∑

x,y

p(x, y)× fi(x, y) =∑x,y

p(y|x)× p(x)× fi(x, y)

(goal is to find best conditional probability p∗(y|x))

II. Make zero assumptions about the unknown (primal problem)

H(y|x) = −∑

(x,y)∈(X×Y)

p(x, y) log p(y|x) (conditional Entropy)

III. Objective function and Lagrange multipliers

Λ(p∗(y|x), λ) = H(y|x) +m∑i=1

λi

(E(fi)− E(fi)

)+ λm+1

∑y∈Y

p(y|x)− 1

(objective function)

p∗λ

(y|x) =1

Zλ(x)exp

(m∑i=1

λifi(x, y)

)(maximize conditional distribution subject to constraints)

p∗λ

(yt|yt−1, x) =1

Zλ(yt−1, x)exp

(m∑i=1

λifi(x, y)

)(inducing the Markov property results in Maximum Entropy Markov Model (MEMM))

10 / 16


Conditional Random Fields (CRF)

� Discriminative model i.e. models p(y|x)

� Conditional probability, p(y|x), is modeled as a product offactors ψk(xk, yk)

� Factors have log-linear representation –ψk(Xk, yk) = exp(λk × φk(xk, yk))

� Predicts whole sequence

�

p(y|x) =1

Z(x)

∏C=C

ΨC(xC , yC) (CRF general form)

11 / 16


Linear Chain CRF

y1

x1

yt−1

xt−1

yt

xt

yt+1

xt+1

yn

xn

φF φF φF φF φF φF φF φF φF

φT φT φT φT φT φT φT φT

p(yt|xt) =1

Z(x)exp (λF × φF (yt, xt) + λT × φT (yt, yt−1))

(individual prediction)

p(y|x) =1

Z(x)

n∏i=1

exp(λF × φF (yi, xi) + λT × φT (yi, yi−1))

(predict whole sequence; tack on y0)

p(y|x) =1

Z(x)

n∏i=1

exp

k∑j=1

λj × φj(yi, yi−1, xi)

(general form of linear chain CRF‘s)

12 / 16


CRF (contd...)

y1

x1

y2

x2

yt−1

xt−1

yt

xt

yt+1

xt+1

yn

xn

φ1

φ2

p(yt|x, y1:t−1) =1

Z(x)exp(λ1 × φ1(yt, xt) + λ2 × φ2(yt, yt−1)+

λ3 × φ3(yt, x2) + λ4 × φ4(yt, xt−1) + λ5 × φ5(yt, xt+1)+

λ6 × φ6(yt, y1)) (additional features; induce loops)

13 / 16


Figure: Sample CRF

Further reading refer to [3, 4, 2, 1]

14 / 16


Model Space

Figure: Graphical models for sequential data[4]

Further reading refer to [3, 4, 2, 1]

15 / 16


Berger, A.A brief maxent tutorial.www.cs.cmu.edu/afs/cs/user/aberger/www/html/tutorial/tutorial.html.

Blake, A., Kohli, P., and Rother, C.Markov random fields for vision and image processing.Mit Press, 2011.

Dietterich, T. G.Machine learning for sequential data: A review.In Structural, syntactic, and statistical pattern recognition. Springer, 2002,pp. 15–30.

Klinger, R., and Tomanek, K.Classical probabilistic models and conditional random fields.TU, Algorithm Engineering, 2007.

16 / 16

www.cs.cmu.edu/afs/cs/user/aberger/www/html/tutorial/tutorial.html

March12 natarajan

Science

Transcript of March12 natarajan