Audio signal classification - TUNIsgn24006/PDF/L04-audio-classification.pdf · Audio signal...

Audio signal classification

Contents:

Introduction to pattern classificationFeaturesFeature selectionClassification methods

Classification 2SGN-24006

1 Introduction to pattern classification

Refresher of the basic concepts of pattern classification:Introductory slides accompanying with the bookPattern Classification by Duda, Hart and Stork, John

Wiley & Sons, 2000.


Audio signal classification

Concrete problemsmusical genre classification, musical instrument recognitionspeaker recognition, language recognitionaudio context recognition (office vs. restaurant vs. street), videosegmenting based on audio, sound effects retrieval

Closely related to sound source recognition in humanssource recognition includes segmentation (perceptual soundseparation) in polyphonic signals

Many efficient methods have been developed in thespeech / speaker recognition field


Example: musical instrument recognition

Different acoustic properties of sound sources make them recognizableproperties result from sound production mechanism

But: even a single source produces varying soundswe must find someting characteristic to the source, not sound events alone:source invariants

Examples below: flute (left) and clarinet (right) what to measure?[Eronen&Klapuri,2000]


Supervised classification system

Simplified:A. Feature extractionB. Classification (using models)

1. training phase:learn fromexamples,collectstatisticsfor featurevalues

2. classify newinstancesbased on what waslearnt in the training phase

Featureextraction

Classify

Modeltraining

Models

recognitionresult

signal

Segmentation


2 Feature extraction

Feature extraction is inevitabletime-domain signal as such contains too much irrelevant data touse it directly for classification

Using appropriate features is crucial for successfulclassification

lousy features (with little discriminating power) can hardly


2.1 Spectral features

= Features that characterize the short-time spectrumUsually the most successful features for audio classificationAlso the most general-purpose features for different problems

in contrast, temporal features are typically different for musical genrerecognition, instrument recognition, or speaker recognition, for example

In extracting spectral featuresPhase spectrum is typically discarded

50% reduction of informationSpectral fine structure is usually discarded (in most tasks)

even more reduction of irrelevant informationRetain only the coarse spectral energy distribution

effective for general audio classificationbasis for speech and speaker recognition

Classification 8SGN-24006Spectral features

Cepstral coefficientsCepstral coefficients c(k) are a very convenient way tomodel spectral energy distribution

where DFT denotes the Fourier-tranform and IDFT its inverse

In Matlab

only real part, real( ), is taken because numerical precisionproduces infinidecimally small imaginary part

Cepstrum coefficients are calculated in short frames overtime

can be further modeled by calculating e.g. the mean and varianceof each coefficient over time

nxDFTIDFTkc log


Cepstral coefficientsOnly the first M cepstrum coefficients are used as features

all coefficients model the precise spectrumcoarse spectral shape is modeled by the first coefficientsprecision is selected by the number of coefficients takenthe first coefficient (energy) is usually discarded

Usually M = fs / (2000 Hz) is a good first guess for MFigure: piano spectrum (thin black line) and spectrum modeled withthe first 50 coeffs (thick blue line) or 10 coeffs (red broken line)

Frequency (Hz)

Log

mag

nitu

de


Cepstral coefficientsA drawback of the cepstral coefficients: linear frequency scalePerceptually, the frequency ranges 100 200Hz and 10kHz 20kHzare approximately equally important

the standard cepstral coefficients do not take this into accountlogarithmic frequency scale would be better

Why mimic perception?typically we want to classify sounds according to perceptual (dis)similarityperceptually relevant features often lead to robust classification, too

Desirable for features:small change in feature vector small perceptual change(and vice versa)Mel-frequency cepstral coefficients fulfill this criterion


Frequency and magnitude warping

Linear scaleusuallyhard to seeanything

Log-frequencyeach octave isapproximatelyequally importantperceptually

Log-magnitudeperceived changefrom 50dB to 60dBabout the same asfrom 60dB to 70dB


Mel-frequency cepstral coefficientsImproves over thestandard cepstralcoefficients

Block diagram: calculationsfor one analysis frame

Mel frequency scale:7001log2595 10 ffMel

Discrete Fourier transform

Simulate Mel filterbank

Power at the output of each filter

Discrete cosine transform

Frame blocking, windowingSignal

MFCC


Mel-frequency cepstral coefficients

Some reasons why MFCCs are successful:

Mel-frequency scaleLog of power

Discrete cosine transform

MFCCs are usually concatenated to the feature vector toinclude information about temporal variation of features

let vt be the feature vector, then the first-order delta feature is

the delta features are added after the static features in the featurevector

Large change in MFCC vectorlarge perceptual change

Drop spectral fine structure,decorrelate the features

D = +1 - -1

2


Other spectral features

Spectral centroid (correlates with brightness)

where |X(k)| is the magnitude of frequency component kBandwidth

K

k

K

kkXkXkSC

1

2

1

2 )()(

K

k

K

kkXkXSCkBW

1

2

1

22 )()()(


2.2 Temporal features

Characterize the temporal evolution of an audio signalTemporal features tend to be more task-specificMid-level representation for feature extraction:

power envelope of the signal sampled at 100Hz...1kHz rateor: power envelopes of the signal at 3...40 subbands

Note: also these drop the phase spectrum and all spectralfine structure!


Temporal features

Musical instrument classification:rise time: time interval between the onset and instant of maximalamplitudeonset asynchrony at different frequenciesfrequency modulation: amplitude and rate (vibrato: 4 8Hz)amplitude modulation: amplitude and rate

tremolo, roughness (fast amplitude modulation at 15-300 Hz rate)

General audio classificatione.g. amplitude modulationalso MFCCs can be seen as temporal features

Classification 17SGN-240062.3 Features calculated in the time

domainSometimes (computationally) ultra-light features are needed and theFourier transform is avoidedZero-crossing rate

Figure: ZCR correlates strongly with spectral centroid (~ brightness)spectral centroidzero-crossing rate[Peltonen, MSc thesis, 2001]

Short-time energy

lousy feature as such, but different statistics of STE are useful

N

nnxsignnxsign

NZCR

2

))1(())((1

0,10,00,1

)(x

xx

xsign

N

nnx

NSTE

1

2)(1


More features...

Table: some featuresfor musical instrumentrecognition[Eronen,Klapuri,2000]

One can measuremany things...


3 Feature selection and transformation11) This section is based on lecture notes of Antti Eronen

Let us consider methods to select a set of features from alarger set of available featuresThe number of features at the disposal of the designer isusually very large (tens or even hundreds)Reasons for reducing the number of features to a

less features simpler models less training data needed(remember the curse of dimensionality )amount of training data: the higher the ratio of the number oftraining patterns Nbetter the generalisation propertiescomputational complexity, correlating (redundant) features


Feature selection

Feature selection (or, reduction) problem:reduce the dimension of the feature vectors while preserving asmuch class discriminatory information as possible

Good features result in large between-class distance andsmall within-class variance

examine features individually and discard those with littlediscrimination abilitya better way is to examine the features in combinationslinear (or nonlinear) transformation of the feature vector


3.1 Data normalization

function than features with small values (e.g. Euclidean distance)For N available data points of the dth feature (d = 1, 2, ..., D);

The resulting normalized features have zero mean and unit varianceif desired, feature weighting can be performed separately after normalizat.

N

nndd x

Nm

1

1

N

ndndd mx

N 1

22

11

d

dndnd

mxx

mean:

variance:

normalizedfeature:


Mean, variance, correlation

x1

x2


3.2 Principal component analysis

Idea: A such that when applied to the featurevectors x, the resulting new features y are uncorrelated.

the covariance matrix of the feature vectors y is diagonal.Define the covariance matrix of x by (here we use the training data!)

Cx= E{( x m)( x m)T }.Because CxD orthonormal eigenvectors, and hence it can be diagonalised:

A Cx AT =

Above, A is an D by D matrix whose rows are formed from theeigenvectors of Cx and is a diagonal matrix with the correspondingeigenvalues on its diagonalNow the transform can be written as

y = A(x m)

where m is the mean of the feature vectors x.

Classification 24SGN-24006Principal component analysis

Eigenvectors

Recall the eigenvector equations:

Ced = ded

(C dI)ed = 0

where ed are the eigenvectors and d are the eigenvaluesand C is square matrixIn Matlab:

The transform matrix A in y = A(x m) consists of theeigenvectors of Cx: A = [e1, e2, ..., eD]T

Since Cx is real and symmetric, A is orthonormal (AAT = I, AT = A 1)


Transformed features

The transformed feature vectors y = A(x m) have zeromean and their covariance matrix (ignoring the meanterms):

Cy = E{ A(x m) (A(x m))T } = ACxAT

i.e. Cy is diagonal and the features are thus uncorrelated.Furthermore, let us denote by the diagonal matrixwhose elements are the square roots of the eigenvaluesof Cx. Then the transformed vectors

y A( x m)

have uncorrelated elements with unit variance:

Cy A) Cx A)T = ... = I.


Transformed features

Mean removal: y = x m


Transformed featuresLeft: decorrelated features basis vectors orthogonal e1

Te2 = 0,Right: variance scaling basis vectors orthonormal e1

Te1 = e2Te2 = 1


Principal component analysis

The transformation can be used to reduce the amount of needed dataA corresponds to

the largest eigenvalue and the last row to the smallest eigenvalueM principal components to create an M by D matrix A for

the data projection

We obtain an approximation for x, which essentially is the projectionof x onto the subspace spanned by the M orthonormal eigenvectors.This projection is optimal in the sense that it minimises the meansquare error (MSE)

for any approximation with M components

)( mxx MA

2xxE


Practical use of PCAPCA is useful for preprocessing features before classification

Note: PCA does not take different classes into account it only considersthe properties of different features

If two features xi and xj are redundant, then one eigenvalue in A isvery small and one dimension can be dropped

we do not need to choose between two correlating features!: it is better todo the linear transform and then drop the least significant dimensionboth of the correlating features are utilized

You need to scale the feature variances before eigenanalysis, sinceeigenvalues are proportional to the numerical range of the features

procedure: 1. normalize PCA are there unnecessary dimensions?

Example of correlating features:ZCR and spectral centroid


3.3 Class separability measures

PCA does not take different classes into accoutfeatures remaining after PCA are efficient in characterising sounds but donot necessarily discriminate between different classes

Now we will look at a set of simple criteria that measure thediscriminating properties of feature vectorsWithin-class scatter matrix (K different classes)

where Ck = E((x mk)( x mk)T ) is the covariance matrix for class kand p k) the prior probability of the class k, i.e., p k) = Nk N, whereNk is the number of samples from class k (of total N samples).trace {Sw} is the sum of the diagonal elements of Sw and herequantifies the average within-class variance of the features over allclasses

K

kkkw CpS

1

)(


Class separability measures

Between-class scatter matrix

where mk is the mean of class k and m0 is the global meanvector

trace{Sb} is a measure of the average distance of themean of each class from the global mean value over allclasses

K

k

Tkkkb pS

100 ))()(( mmmm

K

kkkp

10 )( mm


Class separability measures

It obtains large values whensamples in the D-dimensional space are well clustered aroundtheir mean within each class (small trace{Sw} )the clusters of the different classes are well separated(large trace{Sb})

w

b

StraceStraceJ1


3.4 Feature subset selection

Problem: how to select a subset of M features from the Doriginally available so that1. we reduce the dimensionality of the feature vector (M < D)2. we optimize the desired class separability criterion

optimal subset of features, we should formall possible combinations of M features out of the Doriginally available

the best combination is then selected according to any desiredclass separability measure J

In practice, it is not possible to evaluate all the possiblefeature combinations!

Classification 34SGN-24006Feature vector selection

Sequential backward selection (SBS)

Particularly suitable for discarding a few worst features:1. Choose a class separability criterion J, and calculate its

value for the feature vector which consists of all availablefeatures ( length D)

2. Eliminate one feature, and for each possible resultingcombinations (of length D 1) compute J. Select the best.

3. Continue this for the remaining features, and stop whenyou have obtained the desired dimension M.This is a suboptimal search procedure, since we cannotguarantee that the optimal r 1 dimensional vector has tooriginate from the optimal r dimensional one!

Classification 35SGN-24006Feature vector selection

Sequential forward selection (SFS)Particularly suitable for finding a few golden features:

1. Compute criterion J value for all individual features.Select the best.

2. Form all possible two-dimensional vectors that containthe winner from the previous step. Calculate the criterionfor each vector and select the best.

3. Continue adding features one at time, taking always theone that results in the largest value of the criterion J.

4. Stop when the desired vector dimension M is reached.Both SBS and SFS suffer from the nesting effect: once afeature is discarded in SBS (selected in SFS), it cannotbe reconsidered again (discarded in SFS).


3.5 Linear transforms

Feature generation by combining all the D features toobtain M (with M < D) features

A that transforms the original featurevectors x to M-dimensional feature vectors such that

y = Ax

Objectives:1. reduce the dimensionality of the feature vector2. optimize the desired class separability criterion

Note the similarity with PCA the difference is that herewe consider the class separability

Classification 37SGN-24006Linear transforms

Linear discriminant analysis (LDA)

Choose A to optimize

Very similar to PCA. The difference is that here theeigenanalysis is performed for the matrix Sw

1 Sb instead ofthe global covariance.The rows of the transform matrix are obtained by choosingthe M largest eigenvectors of Sw

1 Sb

Leads to a feature vector dimension M K 1 where K isthe number of classes

this is because the rank of Sw1 Sb is K 1

bw SStraceJ 12


4 Classification methods

Goal is to classify previously unseen instances after theclassifier has learned the training dataSupervised vs. unsupervised classification

supervised: the classes of training instances are told duringlearningunsupervised: clustering into hitherto unknown classes

Supervised classification is the focus here

Example on the next pagedata is represented as M-dimensional feature vectors (here M=2)18 different classesseveral training samples from each class

Classification 39SGN-24006Classification methods

Example

New instance to classify


4.1 Classification by distance functions

Minimum distance classificationcalculate the distance D(x, yi) between the unknown sample x andall the training samples yi from all classesfor example the Euclidean distancechoose the class according to the closest training sample

k-nearest neighbour (k-NN) classifierpick k nearest neighbours to x and then choose the class which wasmost often picked

These are a lazy classifierstraining is trivial: just store the training samples yi

classification gets complex with a lot of training datamust measure distance to all training samples

Computational efficiency can be improved by storing only a sufficientnumber of class prototypes for each class

or by using efficient indexing techniques (locality sensitive hashing)

yxyxyx TED ,

Classification 41SGN-24006Classification by distance functions

Distance metrics

Choice of the distance metric is very importantEuclidean distance metric:

sqrt is order-preserving, thus it is equivalent to minimize

Mahalanobis distance between x and y

where C is the covariance matrix of training data.

Mahalanobis distance DM is generally a good choiceComment: k-NN is a decent classifier yet very easy toimplement

ddd

TE yxD 2)(, yxyxyx

yxyxyx TED ,2

yxyxyx 1, CD TM


Classification by distance functions

Example: decision boundaries for 1-NNclassifier

Left: Euclidean distance, axes not scaledundesired weighting (elongated horizontal contours)

Right: Euclidean distance, axes scaled [Peltonen, MSc thesis, 2001]


Classification by distance functions

Example: decision boundaries for 1-NNclassifier

Left: Euclidean distance, axes scaledRight: Mahalanobis distance scales the axes and rotates the featurespace to decorrelates the features [Peltonen, MSc thesis, 2001]


4.2 Statistical classification

Idea: interpret the feature vector x as a random variable whosedistribution depends on the classBayes formula a very important tool:

where P( i|x) denotes the probability of class i given an observedfeature vector xNote: we do not know P( i|x), but P(x| i) can be estimated from thetraining dataMoreover, since P(x) is the same for all classes, we can performclassification based on (MAP classifier):

A remaining task is to parametrize and learn P(x| i) and to define (orlearn) the prior probabilities P( i)

xxx

PPPP ii

i

iiii PP xmaxarg

Classification 45SGN-24006Statistical classification

Gaussian mixture model (GMM)GMMs are very convenient for representing P(x| i), i.e., themultidimensional distribution of features for the class i

Weighted sum of multidimensional Gaussian distributions

where wi,q are weights, N(x, , ) is a Gaussian distribution, and qwi,q=1

Figure:GMM in onedimension[Heittola,MSc thesis, 2004]

Q

qqiqiqii NwP

1,,, ,;xx


Gaussian mixture model (GMM)GMMs can fit any distribution given a sufficient number of Gaussians

successful generalization requires limiting the number of components

GMM is a parametric modelparameters: weights wi,q, means i,q, covariance matrices i,q

diagonal covariance matrices can be used if features are decorrelatedmuch fewer parameters

Parameters of the GMM of each class are estimated to maximizeP(x| i), i.e., probability of the training data for each class i

iterative EM algorithmToolboxes: insert training data get the model use to classify new

xxx 1

21exp

21,; T

DN

Q

qqiqiqii NwP

1,,, ,;xx


MAP classification

After having learned P(x| i) and P( i) for each class,maximum a posteriori classification of a new instance x:

Usually we have a sequence of feature vectors x1:T and

(assuming that feature vectors are i.i.d. given the class)since logarithm is order-preserving we can maximize

iiii PP xmaxarg

T

tiitii PP

1

maxarg x

iit

T

ti

T

tiitii

PP

PP

x

x

1

1

logmaxarg

logmaxarg


Hidden Markov model (HMM)HMMs account for the time-varying nature of sounds

probability of observation x in state j: typically GMM model bj(x)transition probabilities btw states: matrix A with aij = P(qt=i|qt 1=j)the state variable is hidden, and usually not of interest in classification

Train own HMM for each classHMMs allowcomputing

where x1:T isan observationsequenceFigure: [Heittola,MSc thesis, 2004]

T

iiTTi P

PPP

:1

:1:1 )(

xx

x


Hidden Markov model (HMM)

Hidden Markov models are a generalization over GMMsGMM = HMM with only one state (when using GMMs as state-conditional observation distributions)

HMMs do not necessarily perform better than GMMsonly if some temporal information is truly learned to the statetransition probability matrix, then HMMs are useful e.g. speechExample: genre classification, general audio classification

In speech recogntion, HMMs are a key technology(sequential order of phonemes is crucial)

More about HMMs on speech recognition lectures

Audio signal classification - TUNIsgn24006/PDF/L04-audio-classification.pdf · Audio signal...

Documents

Transcript of Audio signal classification - TUNIsgn24006/PDF/L04-audio-classification.pdf · Audio signal...