DT2118 Speech and Speaker Recognition - Lecture 2 ... · Speech and Speaker Recognition Lecture 2:...
Transcript of DT2118 Speech and Speaker Recognition - Lecture 2 ... · Speech and Speaker Recognition Lecture 2:...
DT2118
Speech and Speaker RecognitionLecture 2: Probability, Statistics and Pattern Recognition
Giampiero Salvi
KTH/CSC/TMH [email protected]
VT 2012
1 / 60
Outline
Probability Theory
Pattern Recognition (Bayes Decision Theory)
Estimation TheoryMaximum Likelihood methodsBayesian methodsNeural networks
Unsupervised LearningVector Quantisation, K-MeansProbabilistic Clustering
Classification and Regression Trees
2 / 60
Different views on probabilities
Axiomatic defines axioms and derives properties
Classical number of ways something can happen over totalnumber of things that can happen (e.g. dice)
Logical same, but weight the different ways
Frequency frequency of success in repeated experiments
Propensity
Subjective degree of belief
4 / 60
Axiomatic view on probabilities (Kolmogorov)Given an event E in a event space F
1. P(E ) ≥ 0 for all E ∈ F2. sure event Ω: P(Ω) = 13. E1,E2, . . . countable sequence of pairwise disjoint events,
then
P(E1 ∪ E2 ∪ · · · ) =∞∑i=1
P(Ei)
E1 ∪ E2 ∪ · · ·
E1
E2
· · ·
5 / 60
Notes
Notes
Notes
Notes
Consequences
1. Monotonicity: P(A) ≤ P(B) if A ⊆ B
B
A
2. Empty set ∅: P(∅) = 0
3. Bounds: 0 ≤ P(E ) ≤ 1 for all E ∈ F
6 / 60
More Consequences: Addition
P(A ∪ B) = P(A) + P(B)− P(A ∩ B)
A B
A ∪ B
A ∩ B
7 / 60
More Consequences: Negation
P(A) = P(Ω \ A) = 1− P(A)
Ω
AΩ \ A
8 / 60
Conditional Probabilities
P(A|B)
The probability of event A when we know that event B hashappened
Note: different from the probability that event A and event Bhappen
9 / 60
Notes
Notes
Notes
Notes
Conditional Probabilities
P(A|B) 6= P(A ∩ B)
Ω
A B
A ∩ B
10 / 60
Conditional Probabilities
P(A|B) 6= P(A ∩ B)
Ω
A B
A ∩ B
10 / 60
Conditional Probabilities
P(A|B) 6= P(A ∩ B)
A B ≡ Ω
A ∩ B
10 / 60
Conditional Probabilities
P(A|B) =P(A ∩ B)
P(B)
A B ≡ Ω
A ∩ B
10 / 60
Notes
Notes
Notes
Notes
Bayes’ Rule
if
P(A|B) =P(A ∩ B)
P(B)
thenP(A ∩ B) = P(A|B)P(B) = P(B |A)P(A)
and
P(A|B) =P(B |A)P(A)
P(B)
11 / 60
Bayes’ Rule and Pattern RecognitionA = words, B = sounds:
I During training we know the words and can computeP(sounds|words) using frequentist approach (repeatedobservations)
I during recognition we want
words = arg maxP(words|sounds)
I using Bayes’ rule:
P(words|sounds) =P(sounds|words)P(words)
P(sounds)
whereP(words): a priori probability of the words (Language Model)P(sounds): a priori probability of the sounds (constant, can beignored)
12 / 60
Discrete vs Continuous variables
I Discrete events: either 1,2, 3, 4, 5, or 6.
I Discrete probabilitydistributionp(x) = P(d = x)
I P(d = 1) = 1/6 (fairdice)
I Any real number(theoretically infinite)
I Distribution function(PDF) f (x) (NOTPROBABILITY!!!)
I P(t = 36.6) = 0
I P(36.6 < t < 36.7) = 0.1
13 / 60
Gaussian distributions: One-dimensional
f (x |µ, σ2) = N(µ, σ2) =1√2πσ
exp
[−(x − µ)2
2σ2
]
1.92 1.94 1.96 1.98 2 2.02 2.04 2.06 2.080
5
10
15
2σ
µ
x
f(x)
14 / 60
Notes
Notes
Notes
Notes
Gaussian distributions: One-dimensional
f (x |µ, σ2) = N(µ, σ2) =1√2πσ
exp
[−(x − µ)2
2σ2
]
1.92 1.94 1.96 1.98 2 2.02 2.04 2.06 2.080
5
10
15
f(x1) = 8.1
P(x2<x<x3) = 0.15
x
f(x)
14 / 60
Gaussian distributions: d Dimensions
x =
x1x2. . .xd
µ =
µ1
µ2
. . .µd
Σ =
σ11 σ12 . . . σ1dσ21 . . .. . .σd1 . . . σdd
f (x|µ,Σ) =exp
[−1
2(x− µ)TΣ−1(x− µ)
](2π)
d2 |Σ|
12
15 / 60
The Probabilistic Model of Classification
I “Nature” assumes one of c states ωj with a prioriprobability P(ωj)
I When in state ωj , “nature” emits observations x withdistribution p(x|ωj)
state1 state2 state3
a priori probabilities
0.0
0.1
0.2
0.3
0.4
0.5
0.6 class conditional probability distributions
x
17 / 60
Problem
I If I observe x and I know P(ωj) and
p(x|ωj) for each j
I what can I say about the state of
“nature” ωj?
18 / 60
Notes
Notes
Notes
Notes
Bayes decision theory
state1 state2 state3
a priori probabilities
0.0
0.1
0.2
0.3
0.4
0.5
0.6 class conditional probability distributions
x
P(ωj |x) =p(x|ωj) P(ωj)
p(x)
posterior probabilities
x
19 / 60
Classifiers: Discriminant Functions
x1
x2
xd
d1
d2
ds
arg max c(x)
di(x) = p(x|ωi) P(ωi)
20 / 60
Classifiers: Decision Boundaries
21 / 60
Decision Boundaries in Two Dimensions
22 / 60
Notes
Notes
Notes
Notes
Estimation Theory
I so far we assumed we know P(ωj) and p(x|ωj)
I how can we obtain them from collections of data?
I this is the subject of Estimation Theory
24 / 60
Parametric vs Non-Parametric Estimation
Parametric non parametric
We only consider parametric methods in this course
25 / 60
Parameter estimation
x1
x2 −→
I ideally: p(x|ωj) i.e. p(x|θj) in reality: p(x|θj) or p(x|D)
I Assumptions:I samples from class ωi do not influence estimate for classωj , i 6= j
I samples from the same class are independent andidentically distributed (i.i.d.)
26 / 60
Parameter estimation (cont.)
I class independence assumption:
x1
x2 −→ ≡
xx[,2
]
−→ xx[,2
] −→
xx[,2
] −→ xx[,2
] −→
xx[,2
] −→
I Maximum likelihood estimation
I Bayesian estimation
27 / 60
Notes
Notes
Notes
Notes
Maximum likelihood estimationI Find parameter vector θ that maximises p(D|θ) withD = x1, . . . , xn
I i.i.d. → p(D|θ) =∏n
k=1 p(xk |θ)
X
p(D
|thet
a)
likel
ihoo
dlo
g lik
elih
ood
28 / 60
Maximum likelihood estimationI Find parameter vector θ that maximises p(D|θ) withD = x1, . . . , xn
I i.i.d. → p(D|θ) =∏n
k=1 p(xk |θ)
X
p(D
|thet
a)
likel
ihoo
d
log
likel
ihoo
d
28 / 60
Maximum likelihood estimationI Find parameter vector θ that maximises p(D|θ) withD = x1, . . . , xn
I i.i.d. → p(D|θ) =∏n
k=1 p(xk |θ)
X
p(D
|thet
a)
p(x|theta)
likel
ihoo
d
log
likel
ihoo
d
28 / 60
Overfitting
29 / 60
Notes
Notes
Notes
Notes
Overfitting: Phoneme Discrimination
30 / 60
Bayesian estimation
I Consider θ as a random variable
I characterise θ with the posterior distribution p(θ|D) giventhe data
I using Bayes formula, the posterior can be computed fromthe likelihood p(D|θ) and the prior p(θ)
p(θ|D) =p(D|θ)p(θ)∫p(D|θ)p(θ)dθ
I ML: D → θ Bayes: D, p(θ) → p(θ|D)
31 / 60
Bayesian estimation (cont.)I we can compute p(x|D) instead of p(x|θ)
I integrate the joint density p(x, θ|D) = p(x|θ)p(θ|D)
p(x|θ)
X
p(D
|thet
a)
p(x|theta)
likel
ihoo
d
log
likel
ihoo
d
32 / 60
Bayesian estimationI we can compute p(x|D) instead of p(x|θ)
I integrate the joint density p(x, θ|D) = p(x|θ)p(θ|D)
p(x|D) =∫p(x|θ)p(θ|D)dθ
X
join
dis
t
inte
gral
33 / 60
Notes
Notes
Notes
Notes
Bayesian estimationI we can compute p(x|D) instead of p(x|θ)
I integrate the joint density p(x, θ|D) = p(x|θ)p(θ|D)
p(x|D) =∫p(x|θ)p(θ|D)dθ
X
join
dis
t
inte
gral
33 / 60
Bayesian estimationI we can compute p(x|D) instead of p(x|θ)
I integrate the joint density p(x, θ|D) = p(x|θ)p(θ|D)
p(x|D) =∫p(x|θ)p(θ|D)dθ
X
join
dis
t
inte
gral
33 / 60
Bayesian estimation (cont.)
Pros:
I better use of the data
I makes a priori assumptions explicit
I easily implemented recursivelyI use posterior p(θ|D) as new prior
I reduce overfitting
Cons:
I definition of noninformative priors can be tricky
I often requires numerical integration
I not widely accepted by traditional statistics (frequentism)
34 / 60
Other Training Strategies: Discriminative Training
I Maximum Mutual Information Estimation
I Minimum Error Rate Estimation
I Neural Networks
35 / 60
Notes
Notes
Notes
Notes
Multi layer neural networks
Multi layerneural networks
x0 x1 x2xd
input units
bias unit
output unit
hidden layer
E
I Backpropagation algorithm
36 / 60
Unsupervised Learning
I so far we assumed we knew the class ωi for each datapoint
I what if we don’t?
I class independence assumption loses meaning
x1
x2
38 / 60
Vector Quantisation, K-Means
I describes each class with a centroid
I a point belongs to a class if the corresponding centroid isclosest (Euclidean distance)
I iterative procedure
I guaranteed to converge
I not guaranteed to find the optimal solution
I used in vector quantization
39 / 60
K-means: algorithm
Data: k (number of desired clusters), n data points xiResult: k clustersinitialization: assign initial value to k centroids ci ;repeat
assign each point xi to closest centroid cj ;compute new centroids as mean of each group of points;
until centroids do not change ;return k clusters;
40 / 60
Notes
Notes
Notes
Notes
K-means: example
iteration 20, update clusters
41 / 60
K-means: sensitivity to initial conditions
iteration 20, update clusters
42 / 60
Solution: LBG Algorithm
I Linde–Buzo–Gray
I start with one centroid
I adjust to mean
I split centroid (with ε)
I K-means
I split again. . .
43 / 60
K-means: limits of Euclidean distance
I the Euclidean distance is isotropic (same in all directionsin Rp)
I this favours spherical clusters
I the size of the clusters is controlled by their distance
44 / 60
Notes
Notes
Notes
Notes
K-means: non-spherical classes
two non−spherical classes
45 / 60
Probabilistic ClusteringI model data as a mixture of probability distributions
(Gaussian)
I each distribution corresponds to a cluster
I clustering corresponds to parameter estimation
46 / 60
Gaussian distributions
fk(xi |µk ,Σk) =exp
−1
2(xi − µk)TΣ−1k (xi − µk)
(2π)
p2 |Σk |
12
Eigenvalue decomposition of the covariance matrix:
Σk = λkDkAkDTk
x1
x2
x1
x2
x1
x2
47 / 60
Mixture of Gaussian distributionsΣk Distribution Volume Shape Orientation
λI Spherical Equal Equal N/Aλk I Spherical Variable Equal N/AλDADT Ellipsoidal Equal Equal EqualλDkAD
Tk Ellipsoidal Equal Equal Variable
λkDkADTk Ellipsoidal Variable Equal Variable
λkDkAkDTk Ellipsoidal Variable Variable Variable
x1
x2
48 / 60
Notes
Notes
Notes
Notes
Fitting the model
I given the data D = xiI given a certain model M and its parameters θ
I maximize the model fit to the data as expressed by thelikelihood
L = p(D|θ)
49 / 60
Unsupervised Case
I release class independence assumption:
I learn the mixture at once
I problem of missing data
I solution: Expectation Maximization
50 / 60
Expectation Maximization
I let x = (x1, . . . , xn) be the data (observations) drawnfrom K distributions (known)
I we call zj ∈ [1,K ] the index of the Gaussian thatgenerated the point xj (unknown)
I the combination of x and z is called the complete data
I the probability that the ith Gaussian generates aparticular x is proportional to
p(x|z = i , θ) = N (µi ,Σi)
51 / 60
Expectation Maximization 2
I the task is to estimate the unknown parameters
θ = µ1, . . . , µK ,Σ1, . . . ,ΣK ,
P(z = 1), . . . ,P(z = K )
I to do so we iterate between improving our knowledgeabout z and improving the estimate of θ given thisknowledge
52 / 60
Notes
Notes
Notes
Notes
EM: formulation
E-step estimate the probability of z given theobservation and the current model:
P(zj = i |xj , θt)
M-step 1) compute the expected log-likelihood of thecomplete data (x, z)
Q(θ) = Ez
[ln
n∏j=1
p(xj , z |θ)|xj
]
2) maximize Q(θ) with respect to the modelparameters θ
53 / 60
EM and GM: properties
I the variance in the GM model must be constrained (toavoid infinite likelihood)
I EM is guaranteed to converge to a local maximum of thecomplete data likelihood
I the initial conditions play an important role (as withK-means)
I GM and EM are quivalent to K-means when thecovariances are all equal to the identity matrix
I equal covariances lead to linear discriminants
54 / 60
Classification and Regression Tree (CART)
I Binary decision tree
I An automatic and data-driven framework to construct adecision process based on objective criteria
I Handles data samples with mixed types, nonstandardstructures
I Handles missing data, robust to outliers and mislabeleddata samples
I Used in speech recognition for model tying
56 / 60
Classification and Regression Tree (CART)
57 / 60
Notes
Notes
Notes
Notes
Steps in constructing a CART
I Find set of questions
I Put all training samples in root
I Recursive algorithmI Find the best combination of question and node. Split
the node into two new nodesI Move the corresponding data into the new nodesI Repeat until right-sized tree is obtained
I Greedy algorithm, only locally optimal, splitting withoutregard to subsequent splits
58 / 60
Defining questions
Data described by x = (x1, x2, . . . , xd)
I one question per variable (singleton questions)
I If xi discrete with values in c1, . . . , cK, questions in theform: is xi ∈ S?, with S subset of the values.
I If xi continuous, questions in the form: is xi ≤ c?, with creal number.
I in both cases, finite number of questions for a dataset
59 / 60
Splitting Criterium
I we want data points in each leaf to be homogeneous
I Find the pair of node and question for which the splitgives largest improvement
I Examples:
1. Largest decrease in class entropy2. Largest decrease in squared error from a regression of
the data in the node
60 / 60
Notes
Notes
Notes
Notes