DT2118 Speech and Speaker Recognition - Lecture 2 ... · Speech and Speaker Recognition Lecture 2:...

16
DT2118 Speech and Speaker Recognition Lecture 2: Probability, Statistics and Pattern Recognition Giampiero Salvi KTH/CSC/TMH [email protected] VT 2012 1 / 60 Outline Probability Theory Pattern Recognition (Bayes Decision Theory) Estimation Theory Maximum Likelihood methods Bayesian methods Neural networks Unsupervised Learning Vector Quantisation, K-Means Probabilistic Clustering Classification and Regression Trees 2 / 60 Different views on probabilities Axiomatic defines axioms and derives properties Classical number of ways something can happen over total number of things that can happen (e.g. dice) Logical same, but weight the different ways Frequency frequency of success in repeated experiments Propensity Subjective degree of belief 4 / 60 Axiomatic view on probabilities (Kolmogorov) Given an event E in a event space F 1. P (E ) 0 for all E F 2. sure event Ω: P (Ω) = 1 3. E 1 , E 2 ,... countable sequence of pairwise disjoint events, then P (E 1 E 2 ∪··· )= X i =1 P (E i ) E 1 E 2 ∪··· E 1 E 2 ··· 5 / 60 Notes Notes Notes Notes

Transcript of DT2118 Speech and Speaker Recognition - Lecture 2 ... · Speech and Speaker Recognition Lecture 2:...

Page 1: DT2118 Speech and Speaker Recognition - Lecture 2 ... · Speech and Speaker Recognition Lecture 2: Probability, Statistics and Pattern Recognition Giampiero Salvi KTH/CSC/TMH giampi@kth.se

DT2118

Speech and Speaker RecognitionLecture 2: Probability, Statistics and Pattern Recognition

Giampiero Salvi

KTH/CSC/TMH [email protected]

VT 2012

1 / 60

Outline

Probability Theory

Pattern Recognition (Bayes Decision Theory)

Estimation TheoryMaximum Likelihood methodsBayesian methodsNeural networks

Unsupervised LearningVector Quantisation, K-MeansProbabilistic Clustering

Classification and Regression Trees

2 / 60

Different views on probabilities

Axiomatic defines axioms and derives properties

Classical number of ways something can happen over totalnumber of things that can happen (e.g. dice)

Logical same, but weight the different ways

Frequency frequency of success in repeated experiments

Propensity

Subjective degree of belief

4 / 60

Axiomatic view on probabilities (Kolmogorov)Given an event E in a event space F

1. P(E ) ≥ 0 for all E ∈ F2. sure event Ω: P(Ω) = 13. E1,E2, . . . countable sequence of pairwise disjoint events,

then

P(E1 ∪ E2 ∪ · · · ) =∞∑i=1

P(Ei)

E1 ∪ E2 ∪ · · ·

E1

E2

· · ·

5 / 60

Notes

Notes

Notes

Notes

Page 2: DT2118 Speech and Speaker Recognition - Lecture 2 ... · Speech and Speaker Recognition Lecture 2: Probability, Statistics and Pattern Recognition Giampiero Salvi KTH/CSC/TMH giampi@kth.se

Consequences

1. Monotonicity: P(A) ≤ P(B) if A ⊆ B

B

A

2. Empty set ∅: P(∅) = 0

3. Bounds: 0 ≤ P(E ) ≤ 1 for all E ∈ F

6 / 60

More Consequences: Addition

P(A ∪ B) = P(A) + P(B)− P(A ∩ B)

A B

A ∪ B

A ∩ B

7 / 60

More Consequences: Negation

P(A) = P(Ω \ A) = 1− P(A)

Ω

AΩ \ A

8 / 60

Conditional Probabilities

P(A|B)

The probability of event A when we know that event B hashappened

Note: different from the probability that event A and event Bhappen

9 / 60

Notes

Notes

Notes

Notes

Page 3: DT2118 Speech and Speaker Recognition - Lecture 2 ... · Speech and Speaker Recognition Lecture 2: Probability, Statistics and Pattern Recognition Giampiero Salvi KTH/CSC/TMH giampi@kth.se

Conditional Probabilities

P(A|B) 6= P(A ∩ B)

Ω

A B

A ∩ B

10 / 60

Conditional Probabilities

P(A|B) 6= P(A ∩ B)

Ω

A B

A ∩ B

10 / 60

Conditional Probabilities

P(A|B) 6= P(A ∩ B)

A B ≡ Ω

A ∩ B

10 / 60

Conditional Probabilities

P(A|B) =P(A ∩ B)

P(B)

A B ≡ Ω

A ∩ B

10 / 60

Notes

Notes

Notes

Notes

Page 4: DT2118 Speech and Speaker Recognition - Lecture 2 ... · Speech and Speaker Recognition Lecture 2: Probability, Statistics and Pattern Recognition Giampiero Salvi KTH/CSC/TMH giampi@kth.se

Bayes’ Rule

if

P(A|B) =P(A ∩ B)

P(B)

thenP(A ∩ B) = P(A|B)P(B) = P(B |A)P(A)

and

P(A|B) =P(B |A)P(A)

P(B)

11 / 60

Bayes’ Rule and Pattern RecognitionA = words, B = sounds:

I During training we know the words and can computeP(sounds|words) using frequentist approach (repeatedobservations)

I during recognition we want

words = arg maxP(words|sounds)

I using Bayes’ rule:

P(words|sounds) =P(sounds|words)P(words)

P(sounds)

whereP(words): a priori probability of the words (Language Model)P(sounds): a priori probability of the sounds (constant, can beignored)

12 / 60

Discrete vs Continuous variables

I Discrete events: either 1,2, 3, 4, 5, or 6.

I Discrete probabilitydistributionp(x) = P(d = x)

I P(d = 1) = 1/6 (fairdice)

I Any real number(theoretically infinite)

I Distribution function(PDF) f (x) (NOTPROBABILITY!!!)

I P(t = 36.6) = 0

I P(36.6 < t < 36.7) = 0.1

13 / 60

Gaussian distributions: One-dimensional

f (x |µ, σ2) = N(µ, σ2) =1√2πσ

exp

[−(x − µ)2

2σ2

]

1.92 1.94 1.96 1.98 2 2.02 2.04 2.06 2.080

5

10

15

µ

x

f(x)

14 / 60

Notes

Notes

Notes

Notes

Page 5: DT2118 Speech and Speaker Recognition - Lecture 2 ... · Speech and Speaker Recognition Lecture 2: Probability, Statistics and Pattern Recognition Giampiero Salvi KTH/CSC/TMH giampi@kth.se

Gaussian distributions: One-dimensional

f (x |µ, σ2) = N(µ, σ2) =1√2πσ

exp

[−(x − µ)2

2σ2

]

1.92 1.94 1.96 1.98 2 2.02 2.04 2.06 2.080

5

10

15

f(x1) = 8.1

P(x2<x<x3) = 0.15

x

f(x)

14 / 60

Gaussian distributions: d Dimensions

x =

x1x2. . .xd

µ =

µ1

µ2

. . .µd

Σ =

σ11 σ12 . . . σ1dσ21 . . .. . .σd1 . . . σdd

f (x|µ,Σ) =exp

[−1

2(x− µ)TΣ−1(x− µ)

](2π)

d2 |Σ|

12

15 / 60

The Probabilistic Model of Classification

I “Nature” assumes one of c states ωj with a prioriprobability P(ωj)

I When in state ωj , “nature” emits observations x withdistribution p(x|ωj)

state1 state2 state3

a priori probabilities

0.0

0.1

0.2

0.3

0.4

0.5

0.6 class conditional probability distributions

x

17 / 60

Problem

I If I observe x and I know P(ωj) and

p(x|ωj) for each j

I what can I say about the state of

“nature” ωj?

18 / 60

Notes

Notes

Notes

Notes

Page 6: DT2118 Speech and Speaker Recognition - Lecture 2 ... · Speech and Speaker Recognition Lecture 2: Probability, Statistics and Pattern Recognition Giampiero Salvi KTH/CSC/TMH giampi@kth.se

Bayes decision theory

state1 state2 state3

a priori probabilities

0.0

0.1

0.2

0.3

0.4

0.5

0.6 class conditional probability distributions

x

P(ωj |x) =p(x|ωj) P(ωj)

p(x)

posterior probabilities

x

19 / 60

Classifiers: Discriminant Functions

x1

x2

xd

d1

d2

ds

arg max c(x)

di(x) = p(x|ωi) P(ωi)

20 / 60

Classifiers: Decision Boundaries

21 / 60

Decision Boundaries in Two Dimensions

22 / 60

Notes

Notes

Notes

Notes

Page 7: DT2118 Speech and Speaker Recognition - Lecture 2 ... · Speech and Speaker Recognition Lecture 2: Probability, Statistics and Pattern Recognition Giampiero Salvi KTH/CSC/TMH giampi@kth.se

Estimation Theory

I so far we assumed we know P(ωj) and p(x|ωj)

I how can we obtain them from collections of data?

I this is the subject of Estimation Theory

24 / 60

Parametric vs Non-Parametric Estimation

Parametric non parametric

We only consider parametric methods in this course

25 / 60

Parameter estimation

x1

x2 −→

I ideally: p(x|ωj) i.e. p(x|θj) in reality: p(x|θj) or p(x|D)

I Assumptions:I samples from class ωi do not influence estimate for classωj , i 6= j

I samples from the same class are independent andidentically distributed (i.i.d.)

26 / 60

Parameter estimation (cont.)

I class independence assumption:

x1

x2 −→ ≡

xx[,2

]

−→ xx[,2

] −→

xx[,2

] −→ xx[,2

] −→

xx[,2

] −→

I Maximum likelihood estimation

I Bayesian estimation

27 / 60

Notes

Notes

Notes

Notes

Page 8: DT2118 Speech and Speaker Recognition - Lecture 2 ... · Speech and Speaker Recognition Lecture 2: Probability, Statistics and Pattern Recognition Giampiero Salvi KTH/CSC/TMH giampi@kth.se

Maximum likelihood estimationI Find parameter vector θ that maximises p(D|θ) withD = x1, . . . , xn

I i.i.d. → p(D|θ) =∏n

k=1 p(xk |θ)

X

p(D

|thet

a)

likel

ihoo

dlo

g lik

elih

ood

28 / 60

Maximum likelihood estimationI Find parameter vector θ that maximises p(D|θ) withD = x1, . . . , xn

I i.i.d. → p(D|θ) =∏n

k=1 p(xk |θ)

X

p(D

|thet

a)

likel

ihoo

d

log

likel

ihoo

d

28 / 60

Maximum likelihood estimationI Find parameter vector θ that maximises p(D|θ) withD = x1, . . . , xn

I i.i.d. → p(D|θ) =∏n

k=1 p(xk |θ)

X

p(D

|thet

a)

p(x|theta)

likel

ihoo

d

log

likel

ihoo

d

28 / 60

Overfitting

29 / 60

Notes

Notes

Notes

Notes

Page 9: DT2118 Speech and Speaker Recognition - Lecture 2 ... · Speech and Speaker Recognition Lecture 2: Probability, Statistics and Pattern Recognition Giampiero Salvi KTH/CSC/TMH giampi@kth.se

Overfitting: Phoneme Discrimination

30 / 60

Bayesian estimation

I Consider θ as a random variable

I characterise θ with the posterior distribution p(θ|D) giventhe data

I using Bayes formula, the posterior can be computed fromthe likelihood p(D|θ) and the prior p(θ)

p(θ|D) =p(D|θ)p(θ)∫p(D|θ)p(θ)dθ

I ML: D → θ Bayes: D, p(θ) → p(θ|D)

31 / 60

Bayesian estimation (cont.)I we can compute p(x|D) instead of p(x|θ)

I integrate the joint density p(x, θ|D) = p(x|θ)p(θ|D)

p(x|θ)

X

p(D

|thet

a)

p(x|theta)

likel

ihoo

d

log

likel

ihoo

d

32 / 60

Bayesian estimationI we can compute p(x|D) instead of p(x|θ)

I integrate the joint density p(x, θ|D) = p(x|θ)p(θ|D)

p(x|D) =∫p(x|θ)p(θ|D)dθ

X

join

dis

t

inte

gral

33 / 60

Notes

Notes

Notes

Notes

Page 10: DT2118 Speech and Speaker Recognition - Lecture 2 ... · Speech and Speaker Recognition Lecture 2: Probability, Statistics and Pattern Recognition Giampiero Salvi KTH/CSC/TMH giampi@kth.se

Bayesian estimationI we can compute p(x|D) instead of p(x|θ)

I integrate the joint density p(x, θ|D) = p(x|θ)p(θ|D)

p(x|D) =∫p(x|θ)p(θ|D)dθ

X

join

dis

t

inte

gral

33 / 60

Bayesian estimationI we can compute p(x|D) instead of p(x|θ)

I integrate the joint density p(x, θ|D) = p(x|θ)p(θ|D)

p(x|D) =∫p(x|θ)p(θ|D)dθ

X

join

dis

t

inte

gral

33 / 60

Bayesian estimation (cont.)

Pros:

I better use of the data

I makes a priori assumptions explicit

I easily implemented recursivelyI use posterior p(θ|D) as new prior

I reduce overfitting

Cons:

I definition of noninformative priors can be tricky

I often requires numerical integration

I not widely accepted by traditional statistics (frequentism)

34 / 60

Other Training Strategies: Discriminative Training

I Maximum Mutual Information Estimation

I Minimum Error Rate Estimation

I Neural Networks

35 / 60

Notes

Notes

Notes

Notes

Page 11: DT2118 Speech and Speaker Recognition - Lecture 2 ... · Speech and Speaker Recognition Lecture 2: Probability, Statistics and Pattern Recognition Giampiero Salvi KTH/CSC/TMH giampi@kth.se

Multi layer neural networks

Multi layerneural networks

x0 x1 x2xd

input units

bias unit

output unit

hidden layer

E

I Backpropagation algorithm

36 / 60

Unsupervised Learning

I so far we assumed we knew the class ωi for each datapoint

I what if we don’t?

I class independence assumption loses meaning

x1

x2

38 / 60

Vector Quantisation, K-Means

I describes each class with a centroid

I a point belongs to a class if the corresponding centroid isclosest (Euclidean distance)

I iterative procedure

I guaranteed to converge

I not guaranteed to find the optimal solution

I used in vector quantization

39 / 60

K-means: algorithm

Data: k (number of desired clusters), n data points xiResult: k clustersinitialization: assign initial value to k centroids ci ;repeat

assign each point xi to closest centroid cj ;compute new centroids as mean of each group of points;

until centroids do not change ;return k clusters;

40 / 60

Notes

Notes

Notes

Notes

Page 12: DT2118 Speech and Speaker Recognition - Lecture 2 ... · Speech and Speaker Recognition Lecture 2: Probability, Statistics and Pattern Recognition Giampiero Salvi KTH/CSC/TMH giampi@kth.se

K-means: example

iteration 20, update clusters

41 / 60

K-means: sensitivity to initial conditions

iteration 20, update clusters

42 / 60

Solution: LBG Algorithm

I Linde–Buzo–Gray

I start with one centroid

I adjust to mean

I split centroid (with ε)

I K-means

I split again. . .

43 / 60

K-means: limits of Euclidean distance

I the Euclidean distance is isotropic (same in all directionsin Rp)

I this favours spherical clusters

I the size of the clusters is controlled by their distance

44 / 60

Notes

Notes

Notes

Notes

Page 13: DT2118 Speech and Speaker Recognition - Lecture 2 ... · Speech and Speaker Recognition Lecture 2: Probability, Statistics and Pattern Recognition Giampiero Salvi KTH/CSC/TMH giampi@kth.se

K-means: non-spherical classes

two non−spherical classes

45 / 60

Probabilistic ClusteringI model data as a mixture of probability distributions

(Gaussian)

I each distribution corresponds to a cluster

I clustering corresponds to parameter estimation

46 / 60

Gaussian distributions

fk(xi |µk ,Σk) =exp

−1

2(xi − µk)TΣ−1k (xi − µk)

(2π)

p2 |Σk |

12

Eigenvalue decomposition of the covariance matrix:

Σk = λkDkAkDTk

x1

x2

x1

x2

x1

x2

47 / 60

Mixture of Gaussian distributionsΣk Distribution Volume Shape Orientation

λI Spherical Equal Equal N/Aλk I Spherical Variable Equal N/AλDADT Ellipsoidal Equal Equal EqualλDkAD

Tk Ellipsoidal Equal Equal Variable

λkDkADTk Ellipsoidal Variable Equal Variable

λkDkAkDTk Ellipsoidal Variable Variable Variable

x1

x2

48 / 60

Notes

Notes

Notes

Notes

Page 14: DT2118 Speech and Speaker Recognition - Lecture 2 ... · Speech and Speaker Recognition Lecture 2: Probability, Statistics and Pattern Recognition Giampiero Salvi KTH/CSC/TMH giampi@kth.se

Fitting the model

I given the data D = xiI given a certain model M and its parameters θ

I maximize the model fit to the data as expressed by thelikelihood

L = p(D|θ)

49 / 60

Unsupervised Case

I release class independence assumption:

I learn the mixture at once

I problem of missing data

I solution: Expectation Maximization

50 / 60

Expectation Maximization

I let x = (x1, . . . , xn) be the data (observations) drawnfrom K distributions (known)

I we call zj ∈ [1,K ] the index of the Gaussian thatgenerated the point xj (unknown)

I the combination of x and z is called the complete data

I the probability that the ith Gaussian generates aparticular x is proportional to

p(x|z = i , θ) = N (µi ,Σi)

51 / 60

Expectation Maximization 2

I the task is to estimate the unknown parameters

θ = µ1, . . . , µK ,Σ1, . . . ,ΣK ,

P(z = 1), . . . ,P(z = K )

I to do so we iterate between improving our knowledgeabout z and improving the estimate of θ given thisknowledge

52 / 60

Notes

Notes

Notes

Notes

Page 15: DT2118 Speech and Speaker Recognition - Lecture 2 ... · Speech and Speaker Recognition Lecture 2: Probability, Statistics and Pattern Recognition Giampiero Salvi KTH/CSC/TMH giampi@kth.se

EM: formulation

E-step estimate the probability of z given theobservation and the current model:

P(zj = i |xj , θt)

M-step 1) compute the expected log-likelihood of thecomplete data (x, z)

Q(θ) = Ez

[ln

n∏j=1

p(xj , z |θ)|xj

]

2) maximize Q(θ) with respect to the modelparameters θ

53 / 60

EM and GM: properties

I the variance in the GM model must be constrained (toavoid infinite likelihood)

I EM is guaranteed to converge to a local maximum of thecomplete data likelihood

I the initial conditions play an important role (as withK-means)

I GM and EM are quivalent to K-means when thecovariances are all equal to the identity matrix

I equal covariances lead to linear discriminants

54 / 60

Classification and Regression Tree (CART)

I Binary decision tree

I An automatic and data-driven framework to construct adecision process based on objective criteria

I Handles data samples with mixed types, nonstandardstructures

I Handles missing data, robust to outliers and mislabeleddata samples

I Used in speech recognition for model tying

56 / 60

Classification and Regression Tree (CART)

57 / 60

Notes

Notes

Notes

Notes

Page 16: DT2118 Speech and Speaker Recognition - Lecture 2 ... · Speech and Speaker Recognition Lecture 2: Probability, Statistics and Pattern Recognition Giampiero Salvi KTH/CSC/TMH giampi@kth.se

Steps in constructing a CART

I Find set of questions

I Put all training samples in root

I Recursive algorithmI Find the best combination of question and node. Split

the node into two new nodesI Move the corresponding data into the new nodesI Repeat until right-sized tree is obtained

I Greedy algorithm, only locally optimal, splitting withoutregard to subsequent splits

58 / 60

Defining questions

Data described by x = (x1, x2, . . . , xd)

I one question per variable (singleton questions)

I If xi discrete with values in c1, . . . , cK, questions in theform: is xi ∈ S?, with S subset of the values.

I If xi continuous, questions in the form: is xi ≤ c?, with creal number.

I in both cases, finite number of questions for a dataset

59 / 60

Splitting Criterium

I we want data points in each leaf to be homogeneous

I Find the pair of node and question for which the splitgives largest improvement

I Examples:

1. Largest decrease in class entropy2. Largest decrease in squared error from a regression of

the data in the node

60 / 60

Notes

Notes

Notes

Notes