Audio signal classification - TUNIsgn24006/PDF/L04-audio-classification.pdf · Audio signal...

13
Audio signal classification Contents: Introduction to pattern classification Features Feature selection Classification methods Classification 2 SGN-24006 1 Introduction to pattern classification Refresher of the basic concepts of pattern classification: Introductory slides accompanying with the book Pattern Classification by Duda, Hart and Stork, John Wiley & Sons, 2000. Classification 3 SGN-24006 Audio signal classification Concrete problems musical genre classification, musical instrument recognition speaker recognition, language recognition audio context recognition (office vs. restaurant vs. street), video segmenting based on audio, sound effects retrieval Closely related to sound source recognition in humans source recognition includes segmentation (perceptual sound separation) in polyphonic signals Many efficient methods have been developed in the speech / speaker recognition field Classification 4 SGN-24006 Example: musical instrument recognition Different acoustic properties of sound sources make them recognizable properties result from sound production mechanism But: even a single source produces varying sounds we must find someting characteristic to the source, not sound events alone: source invariants Examples below: flute (left) and clarinet (right) what to measure? [Eronen&Klapuri,2000]

Transcript of Audio signal classification - TUNIsgn24006/PDF/L04-audio-classification.pdf · Audio signal...

Page 1: Audio signal classification - TUNIsgn24006/PDF/L04-audio-classification.pdf · Audio signal classification Contents: Introduction to pattern classification Features Feature selection

Audio signal classification

Contents:

Introduction to pattern classificationFeaturesFeature selectionClassification methods

Classification 2SGN-24006

1 Introduction to pattern classification

Refresher of the basic concepts of pattern classification:Introductory slides accompanying with the bookPattern Classification by Duda, Hart and Stork, John

Wiley & Sons, 2000.

Classification 3SGN-24006

Audio signal classification

Concrete problemsmusical genre classification, musical instrument recognitionspeaker recognition, language recognitionaudio context recognition (office vs. restaurant vs. street), videosegmenting based on audio, sound effects retrieval

Closely related to sound source recognition in humanssource recognition includes segmentation (perceptual soundseparation) in polyphonic signals

Many efficient methods have been developed in thespeech / speaker recognition field

Classification 4SGN-24006

Example: musical instrument recognition

Different acoustic properties of sound sources make them recognizableproperties result from sound production mechanism

But: even a single source produces varying soundswe must find someting characteristic to the source, not sound events alone:source invariants

Examples below: flute (left) and clarinet (right) what to measure?[Eronen&Klapuri,2000]

Page 2: Audio signal classification - TUNIsgn24006/PDF/L04-audio-classification.pdf · Audio signal classification Contents: Introduction to pattern classification Features Feature selection

Classification 5SGN-24006

Supervised classification system

Simplified:A. Feature extractionB. Classification (using models)

1. training phase:learn fromexamples,collectstatisticsfor featurevalues

2. classify newinstancesbased on what waslearnt in the training phase

Featureextraction

Classify

Modeltraining

Models

recognitionresult

signal

Segmentation

Classification 6SGN-24006

2 Feature extraction

Feature extraction is inevitabletime-domain signal as such contains too much irrelevant data touse it directly for classification

Using appropriate features is crucial for successfulclassification

lousy features (with little discriminating power) can hardly

Classification 7SGN-24006

2.1 Spectral features

= Features that characterize the short-time spectrumUsually the most successful features for audio classificationAlso the most general-purpose features for different problems

in contrast, temporal features are typically different for musical genrerecognition, instrument recognition, or speaker recognition, for example

In extracting spectral featuresPhase spectrum is typically discarded

50% reduction of informationSpectral fine structure is usually discarded (in most tasks)

even more reduction of irrelevant informationRetain only the coarse spectral energy distribution

effective for general audio classificationbasis for speech and speaker recognition

Classification 8SGN-24006Spectral features

Cepstral coefficientsCepstral coefficients c(k) are a very convenient way tomodel spectral energy distribution

where DFT denotes the Fourier-tranform and IDFT its inverse

In Matlab

only real part, real( ), is taken because numerical precisionproduces infinidecimally small imaginary part

Cepstrum coefficients are calculated in short frames overtime

can be further modeled by calculating e.g. the mean and varianceof each coefficient over time

nxDFTIDFTkc log

Page 3: Audio signal classification - TUNIsgn24006/PDF/L04-audio-classification.pdf · Audio signal classification Contents: Introduction to pattern classification Features Feature selection

Classification 9SGN-24006Spectral features

Cepstral coefficientsOnly the first M cepstrum coefficients are used as features

all coefficients model the precise spectrumcoarse spectral shape is modeled by the first coefficientsprecision is selected by the number of coefficients takenthe first coefficient (energy) is usually discarded

Usually M = fs / (2000 Hz) is a good first guess for MFigure: piano spectrum (thin black line) and spectrum modeled withthe first 50 coeffs (thick blue line) or 10 coeffs (red broken line)

Frequency (Hz)

Log

mag

nitu

de

Classification 10SGN-24006Spectral features

Cepstral coefficientsA drawback of the cepstral coefficients: linear frequency scalePerceptually, the frequency ranges 100 200Hz and 10kHz 20kHzare approximately equally important

the standard cepstral coefficients do not take this into accountlogarithmic frequency scale would be better

Why mimic perception?typically we want to classify sounds according to perceptual (dis)similarityperceptually relevant features often lead to robust classification, too

Desirable for features:small change in feature vector small perceptual change(and vice versa)Mel-frequency cepstral coefficients fulfill this criterion

Classification 11SGN-24006Spectral features

Frequency and magnitude warping

Linear scaleusuallyhard to seeanything

Log-frequencyeach octave isapproximatelyequally importantperceptually

Log-magnitudeperceived changefrom 50dB to 60dBabout the same asfrom 60dB to 70dB

Classification 12SGN-24006Spectral features

Mel-frequency cepstral coefficientsImproves over thestandard cepstralcoefficients

Block diagram: calculationsfor one analysis frame

Mel frequency scale:7001log2595 10 ffMel

Discrete Fourier transform

Simulate Mel filterbank

Power at the output of each filter

Discrete cosine transform

Frame blocking, windowingSignal

MFCC

Page 4: Audio signal classification - TUNIsgn24006/PDF/L04-audio-classification.pdf · Audio signal classification Contents: Introduction to pattern classification Features Feature selection

Classification 13SGN-24006Spectral features

Mel-frequency cepstral coefficients

Some reasons why MFCCs are successful:

Mel-frequency scaleLog of power

Discrete cosine transform

MFCCs are usually concatenated to the feature vector toinclude information about temporal variation of features

let vt be the feature vector, then the first-order delta feature is

the delta features are added after the static features in the featurevector

Large change in MFCC vectorlarge perceptual change

Drop spectral fine structure,decorrelate the features

D = +1 - -1

2

Classification 14SGN-24006Spectral features

Other spectral features

Spectral centroid (correlates with brightness)

where |X(k)| is the magnitude of frequency component kBandwidth

K

k

K

kkXkXkSC

1

2

1

2 )()(

K

k

K

kkXkXSCkBW

1

2

1

22 )()()(

Classification 15SGN-24006

2.2 Temporal features

Characterize the temporal evolution of an audio signalTemporal features tend to be more task-specificMid-level representation for feature extraction:

power envelope of the signal sampled at 100Hz...1kHz rateor: power envelopes of the signal at 3...40 subbands

Note: also these drop the phase spectrum and all spectralfine structure!

Classification 16SGN-24006

Temporal features

Musical instrument classification:rise time: time interval between the onset and instant of maximalamplitudeonset asynchrony at different frequenciesfrequency modulation: amplitude and rate (vibrato: 4 8Hz)amplitude modulation: amplitude and rate

tremolo, roughness (fast amplitude modulation at 15-300 Hz rate)

General audio classificatione.g. amplitude modulationalso MFCCs can be seen as temporal features

Page 5: Audio signal classification - TUNIsgn24006/PDF/L04-audio-classification.pdf · Audio signal classification Contents: Introduction to pattern classification Features Feature selection

Classification 17SGN-240062.3 Features calculated in the time

domainSometimes (computationally) ultra-light features are needed and theFourier transform is avoidedZero-crossing rate

Figure: ZCR correlates strongly with spectral centroid (~ brightness)spectral centroidzero-crossing rate[Peltonen, MSc thesis, 2001]

Short-time energy

lousy feature as such, but different statistics of STE are useful

N

nnxsignnxsign

NZCR

2

))1(())((1

0,10,00,1

)(x

xx

xsign

N

nnx

NSTE

1

2)(1

Classification 18SGN-24006

More features...

Table: some featuresfor musical instrumentrecognition[Eronen,Klapuri,2000]

One can measuremany things...

Classification 19SGN-24006

3 Feature selection and transformation11) This section is based on lecture notes of Antti Eronen

Let us consider methods to select a set of features from alarger set of available featuresThe number of features at the disposal of the designer isusually very large (tens or even hundreds)Reasons for reducing the number of features to a

less features simpler models less training data needed(remember the curse of dimensionality )amount of training data: the higher the ratio of the number oftraining patterns Nbetter the generalisation propertiescomputational complexity, correlating (redundant) features

Classification 20SGN-24006

Feature selection

Feature selection (or, reduction) problem:reduce the dimension of the feature vectors while preserving asmuch class discriminatory information as possible

Good features result in large between-class distance andsmall within-class variance

examine features individually and discard those with littlediscrimination abilitya better way is to examine the features in combinationslinear (or nonlinear) transformation of the feature vector

Page 6: Audio signal classification - TUNIsgn24006/PDF/L04-audio-classification.pdf · Audio signal classification Contents: Introduction to pattern classification Features Feature selection

Classification 21SGN-24006

3.1 Data normalization

function than features with small values (e.g. Euclidean distance)For N available data points of the dth feature (d = 1, 2, ..., D);

The resulting normalized features have zero mean and unit varianceif desired, feature weighting can be performed separately after normalizat.

N

nndd x

Nm

1

1

N

ndndd mx

N 1

22

11

d

dndnd

mxx

mean:

variance:

normalizedfeature:

Classification 22SGN-24006

Mean, variance, correlation

x1

x2

Classification 23SGN-24006

3.2 Principal component analysis

Idea: A such that when applied to the featurevectors x, the resulting new features y are uncorrelated.

the covariance matrix of the feature vectors y is diagonal.Define the covariance matrix of x by (here we use the training data!)

Cx= E{( x m)( x m)T }.Because CxD orthonormal eigenvectors, and hence it can be diagonalised:

A Cx AT =

Above, A is an D by D matrix whose rows are formed from theeigenvectors of Cx and is a diagonal matrix with the correspondingeigenvalues on its diagonalNow the transform can be written as

y = A(x m)

where m is the mean of the feature vectors x.

Classification 24SGN-24006Principal component analysis

Eigenvectors

Recall the eigenvector equations:

Ced = ded

(C dI)ed = 0

where ed are the eigenvectors and d are the eigenvaluesand C is square matrixIn Matlab:

The transform matrix A in y = A(x m) consists of theeigenvectors of Cx: A = [e1, e2, ..., eD]T

Since Cx is real and symmetric, A is orthonormal (AAT = I, AT = A 1)

Page 7: Audio signal classification - TUNIsgn24006/PDF/L04-audio-classification.pdf · Audio signal classification Contents: Introduction to pattern classification Features Feature selection

Classification 25SGN-24006Principal component analysis

Transformed features

The transformed feature vectors y = A(x m) have zeromean and their covariance matrix (ignoring the meanterms):

Cy = E{ A(x m) (A(x m))T } = ACxAT

i.e. Cy is diagonal and the features are thus uncorrelated.Furthermore, let us denote by the diagonal matrixwhose elements are the square roots of the eigenvaluesof Cx. Then the transformed vectors

y A( x m)

have uncorrelated elements with unit variance:

Cy A) Cx A)T = ... = I.

Classification 26SGN-24006Principal component analysis

Transformed features

Mean removal: y = x m

Classification 27SGN-24006Principal component analysis

Transformed featuresLeft: decorrelated features basis vectors orthogonal e1

Te2 = 0,Right: variance scaling basis vectors orthonormal e1

Te1 = e2Te2 = 1

Classification 28SGN-24006

Principal component analysis

The transformation can be used to reduce the amount of needed dataA corresponds to

the largest eigenvalue and the last row to the smallest eigenvalueM principal components to create an M by D matrix A for

the data projection

We obtain an approximation for x, which essentially is the projectionof x onto the subspace spanned by the M orthonormal eigenvectors.This projection is optimal in the sense that it minimises the meansquare error (MSE)

for any approximation with M components

)( mxx MA

2xxE

Page 8: Audio signal classification - TUNIsgn24006/PDF/L04-audio-classification.pdf · Audio signal classification Contents: Introduction to pattern classification Features Feature selection

Classification 29SGN-24006Principal component analysis

Practical use of PCAPCA is useful for preprocessing features before classification

Note: PCA does not take different classes into account it only considersthe properties of different features

If two features xi and xj are redundant, then one eigenvalue in A isvery small and one dimension can be dropped

we do not need to choose between two correlating features!: it is better todo the linear transform and then drop the least significant dimensionboth of the correlating features are utilized

You need to scale the feature variances before eigenanalysis, sinceeigenvalues are proportional to the numerical range of the features

procedure: 1. normalize PCA are there unnecessary dimensions?

Example of correlating features:ZCR and spectral centroid

Classification 30SGN-24006

3.3 Class separability measures

PCA does not take different classes into accoutfeatures remaining after PCA are efficient in characterising sounds but donot necessarily discriminate between different classes

Now we will look at a set of simple criteria that measure thediscriminating properties of feature vectorsWithin-class scatter matrix (K different classes)

where Ck = E((x mk)( x mk)T ) is the covariance matrix for class kand p k) the prior probability of the class k, i.e., p k) = Nk N, whereNk is the number of samples from class k (of total N samples).trace {Sw} is the sum of the diagonal elements of Sw and herequantifies the average within-class variance of the features over allclasses

K

kkkw CpS

1

)(

Classification 31SGN-24006

Class separability measures

Between-class scatter matrix

where mk is the mean of class k and m0 is the global meanvector

trace{Sb} is a measure of the average distance of themean of each class from the global mean value over allclasses

K

k

Tkkkb pS

100 ))()(( mmmm

K

kkkp

10 )( mm

Classification 32SGN-24006

Class separability measures

It obtains large values whensamples in the D-dimensional space are well clustered aroundtheir mean within each class (small trace{Sw} )the clusters of the different classes are well separated(large trace{Sb})

w

b

StraceStraceJ1

Page 9: Audio signal classification - TUNIsgn24006/PDF/L04-audio-classification.pdf · Audio signal classification Contents: Introduction to pattern classification Features Feature selection

Classification 33SGN-24006

3.4 Feature subset selection

Problem: how to select a subset of M features from the Doriginally available so that1. we reduce the dimensionality of the feature vector (M < D)2. we optimize the desired class separability criterion

optimal subset of features, we should formall possible combinations of M features out of the Doriginally available

the best combination is then selected according to any desiredclass separability measure J

In practice, it is not possible to evaluate all the possiblefeature combinations!

Classification 34SGN-24006Feature vector selection

Sequential backward selection (SBS)

Particularly suitable for discarding a few worst features:1. Choose a class separability criterion J, and calculate its

value for the feature vector which consists of all availablefeatures ( length D)

2. Eliminate one feature, and for each possible resultingcombinations (of length D 1) compute J. Select the best.

3. Continue this for the remaining features, and stop whenyou have obtained the desired dimension M.This is a suboptimal search procedure, since we cannotguarantee that the optimal r 1 dimensional vector has tooriginate from the optimal r dimensional one!

Classification 35SGN-24006Feature vector selection

Sequential forward selection (SFS)Particularly suitable for finding a few golden features:

1. Compute criterion J value for all individual features.Select the best.

2. Form all possible two-dimensional vectors that containthe winner from the previous step. Calculate the criterionfor each vector and select the best.

3. Continue adding features one at time, taking always theone that results in the largest value of the criterion J.

4. Stop when the desired vector dimension M is reached.Both SBS and SFS suffer from the nesting effect: once afeature is discarded in SBS (selected in SFS), it cannotbe reconsidered again (discarded in SFS).

Classification 36SGN-24006

3.5 Linear transforms

Feature generation by combining all the D features toobtain M (with M < D) features

A that transforms the original featurevectors x to M-dimensional feature vectors such that

y = Ax

Objectives:1. reduce the dimensionality of the feature vector2. optimize the desired class separability criterion

Note the similarity with PCA the difference is that herewe consider the class separability

Page 10: Audio signal classification - TUNIsgn24006/PDF/L04-audio-classification.pdf · Audio signal classification Contents: Introduction to pattern classification Features Feature selection

Classification 37SGN-24006Linear transforms

Linear discriminant analysis (LDA)

Choose A to optimize

Very similar to PCA. The difference is that here theeigenanalysis is performed for the matrix Sw

1 Sb instead ofthe global covariance.The rows of the transform matrix are obtained by choosingthe M largest eigenvectors of Sw

1 Sb

Leads to a feature vector dimension M K 1 where K isthe number of classes

this is because the rank of Sw1 Sb is K 1

bw SStraceJ 12

Classification 38SGN-24006

4 Classification methods

Goal is to classify previously unseen instances after theclassifier has learned the training dataSupervised vs. unsupervised classification

supervised: the classes of training instances are told duringlearningunsupervised: clustering into hitherto unknown classes

Supervised classification is the focus here

Example on the next pagedata is represented as M-dimensional feature vectors (here M=2)18 different classesseveral training samples from each class

Classification 39SGN-24006Classification methods

Example

New instance to classify

Classification 40SGN-24006

4.1 Classification by distance functions

Minimum distance classificationcalculate the distance D(x, yi) between the unknown sample x andall the training samples yi from all classesfor example the Euclidean distancechoose the class according to the closest training sample

k-nearest neighbour (k-NN) classifierpick k nearest neighbours to x and then choose the class which wasmost often picked

These are a lazy classifierstraining is trivial: just store the training samples yi

classification gets complex with a lot of training datamust measure distance to all training samples

Computational efficiency can be improved by storing only a sufficientnumber of class prototypes for each class

or by using efficient indexing techniques (locality sensitive hashing)

yxyxyx TED ,

Page 11: Audio signal classification - TUNIsgn24006/PDF/L04-audio-classification.pdf · Audio signal classification Contents: Introduction to pattern classification Features Feature selection

Classification 41SGN-24006Classification by distance functions

Distance metrics

Choice of the distance metric is very importantEuclidean distance metric:

sqrt is order-preserving, thus it is equivalent to minimize

Mahalanobis distance between x and y

where C is the covariance matrix of training data.

Mahalanobis distance DM is generally a good choiceComment: k-NN is a decent classifier yet very easy toimplement

ddd

TE yxD 2)(, yxyxyx

yxyxyx TED ,2

yxyxyx 1, CD TM

Classification 42SGN-24006

Classification by distance functions

Example: decision boundaries for 1-NNclassifier

Left: Euclidean distance, axes not scaledundesired weighting (elongated horizontal contours)

Right: Euclidean distance, axes scaled [Peltonen, MSc thesis, 2001]

Classification 43SGN-24006

Classification by distance functions

Example: decision boundaries for 1-NNclassifier

Left: Euclidean distance, axes scaledRight: Mahalanobis distance scales the axes and rotates the featurespace to decorrelates the features [Peltonen, MSc thesis, 2001]

Classification 44SGN-24006

4.2 Statistical classification

Idea: interpret the feature vector x as a random variable whosedistribution depends on the classBayes formula a very important tool:

where P( i|x) denotes the probability of class i given an observedfeature vector xNote: we do not know P( i|x), but P(x| i) can be estimated from thetraining dataMoreover, since P(x) is the same for all classes, we can performclassification based on (MAP classifier):

A remaining task is to parametrize and learn P(x| i) and to define (orlearn) the prior probabilities P( i)

xxx

PPPP ii

i

iiii PP xmaxarg

Page 12: Audio signal classification - TUNIsgn24006/PDF/L04-audio-classification.pdf · Audio signal classification Contents: Introduction to pattern classification Features Feature selection

Classification 45SGN-24006Statistical classification

Gaussian mixture model (GMM)GMMs are very convenient for representing P(x| i), i.e., themultidimensional distribution of features for the class i

Weighted sum of multidimensional Gaussian distributions

where wi,q are weights, N(x, , ) is a Gaussian distribution, and qwi,q=1

Figure:GMM in onedimension[Heittola,MSc thesis, 2004]

Q

qqiqiqii NwP

1,,, ,;xx

Classification 46SGN-24006Statistical classification

Gaussian mixture model (GMM)GMMs can fit any distribution given a sufficient number of Gaussians

successful generalization requires limiting the number of components

GMM is a parametric modelparameters: weights wi,q, means i,q, covariance matrices i,q

diagonal covariance matrices can be used if features are decorrelatedmuch fewer parameters

Parameters of the GMM of each class are estimated to maximizeP(x| i), i.e., probability of the training data for each class i

iterative EM algorithmToolboxes: insert training data get the model use to classify new

xxx 1

21exp

21,; T

DN

Q

qqiqiqii NwP

1,,, ,;xx

Classification 47SGN-24006Statistical classification

MAP classification

After having learned P(x| i) and P( i) for each class,maximum a posteriori classification of a new instance x:

Usually we have a sequence of feature vectors x1:T and

(assuming that feature vectors are i.i.d. given the class)since logarithm is order-preserving we can maximize

iiii PP xmaxarg

T

tiitii PP

1

maxarg x

iit

T

ti

T

tiitii

PP

PP

x

x

1

1

logmaxarg

logmaxarg

Classification 48SGN-24006Statistical classification

Hidden Markov model (HMM)HMMs account for the time-varying nature of sounds

probability of observation x in state j: typically GMM model bj(x)transition probabilities btw states: matrix A with aij = P(qt=i|qt 1=j)the state variable is hidden, and usually not of interest in classification

Train own HMM for each classHMMs allowcomputing

where x1:T isan observationsequenceFigure: [Heittola,MSc thesis, 2004]

T

iiTTi P

PPP

:1

:1:1 )(

xx

x

Page 13: Audio signal classification - TUNIsgn24006/PDF/L04-audio-classification.pdf · Audio signal classification Contents: Introduction to pattern classification Features Feature selection

Classification 49SGN-24006Statistical classification

Hidden Markov model (HMM)

Hidden Markov models are a generalization over GMMsGMM = HMM with only one state (when using GMMs as state-conditional observation distributions)

HMMs do not necessarily perform better than GMMsonly if some temporal information is truly learned to the statetransition probability matrix, then HMMs are useful e.g. speechExample: genre classification, general audio classification

In speech recogntion, HMMs are a key technology(sequential order of phonemes is crucial)

More about HMMs on speech recognition lectures