Research Article Music Emotion Detection Using ...emotion or not. 3. Extraction of Acoustical...

Research ArticleMusic Emotion Detection Using Hierarchical SparseKernel Machines

Yu-Hao Chin Chang-Hong Lin Ernestasia Siahaan and Jia-Ching Wang

Department of Computer Science and Information Engineering National Central University Taoyuan 32001 Taiwan

Correspondence should be addressed to Jia-Ching Wang jcwcsiencuedutw

Received 30 August 2013 Accepted 17 October 2013 Published 3 March 2014

Academic Editors B-W Chen S Liou and C-H Wu

Copyright copy 2014 Yu-Hao Chin et al This is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

Formusic emotion detection this paper presents amusic emotion verification system based on hierarchical sparse kernelmachinesWith the proposed system we intend to verify if a music clip possesses happiness emotion or not There are two levels in thehierarchical sparse kernel machines In the first level a set of acoustical features are extracted and principle component analysis(PCA) is implemented to reduce the dimensionThe acoustical features are utilized to generate the first-level decision vector whichis a vector with each element being a significant value of an emotion The significant values of eight main emotional classes areutilized in this paper To calculate the significant value of an emotion we construct its 2-class SVMwith calm emotion as the global(non-target) side of the SVM The probability distributions of the adopted acoustical features are calculated and the probabilityproduct kernel is applied in the first-level SVMs to obtain first-level decision vector feature In the second level of the hierarchicalsystem we merely construct a 2-class relevance vector machine (RVM) with happiness as the target side and other emotions as thebackground side of the RVM The first-level decision vector is used as the feature with conventional radial basis function kernelThe happiness verification threshold is built on the probability value In the experimental results the detection error tradeoff (DET)curve shows that the proposed system has a good performance on verifying if a music clip reveals happiness emotion

1 Introduction

Listening to music plays an important role in humanrsquos dailylife and people usually gain much benefit from listening tomusic Besides the leisure purpose music listening has otherapplication areas such as education inspiration productiontherapy and marketing [1] Sometimes people try to be inparticular emotion state by listening to music However insuch situation people need to choose the music which canmake human have particular feelings They should listen toeach song at least once to know the music emotion of eachsong and the whole process takes much time If peoplecan use computer to detect the emotion content in musicthe problem can be solved Besides this application musicemotion detection technology can be applied to other areaas well such as music research music recommendation andmusic retrieval For the limitless potential of music emotiondetection technology many researchers focus on detectingemotion in music

Many researches on music emotion detection havebeen proposed in music emotion detection [2] Existingresearch methods could be divided into two main categoriesdimension approach and categorical approach Dimensionapproach defines an emotion plane and views the emotionplane as a continuous emotion state space Each positionof the plane means an emotion state [3] The acousticalfeatures can be mapped to a point in the emotion plane [4]Categorical approach works by categorized emotions into anumber of emotion classes Each emotion class representsan area in the emotion plane [3] Different from dimensionapproach each emotion class is defined clearly In the trainingphase acoustical features are directly used to train classifiersto recognize the corresponding emotion classes [5] In thispaper the proposed method belongs to the second type

In previous music emotion detection studies manymachine learning algorithms are applied In [5] features weremapped into emotion categories on the emotion plane andtwo support vector regressors were trained to predict the

Hindawi Publishing Corporatione Scientific World JournalVolume 2014 Article ID 270378 7 pageshttpdxdoiorg1011552014270378

2 The Scientific World Journal

Unknown

Spectrum

Tempo

Chromagram

energyRoot mean square

estimation

MFCC

emotionalmusic

Audiopreprocessing

Emotionalmusic database Time

-frequencyprocessing

Spectrum centroidSpectrum spread

RSS

Principle component Support vector First-level decision Relevance vector machine machineanalysis vector

With probability Testing

Verifiedemotion

Training

product Kernel

First-level decision vectorextraction

Feature extraction

Figure 1 Block diagram of the proposed system

arousal and valence value In [6] hierarchical frameworkwas adopted to detect emotion from acoustic music dataThemethod has the advantage of emphasizing proper featurein different detection work In [7] support vector machinewas applied to detect emotion content in music In [8]kernel-based class separability is used to weight featuresAfter feature selection principal component analysis andlinear discriminant analysis were applied and 119896-nearestneighborhood (KNN) classifier was then implemented Inthis paper amusic emotion detection system is proposedThesystem establishes a hierarchical sparse kernel machine Inthe first level eight 2-class SVMmodels are trainedwith eightemotion classes as the target sides respectively It is noted thatemotion perception is usually not based on a single acousticalfeature but a combination of acoustical features [4 9] Thispaper adopts an acoustical feature set comprising root meansquare energy (RMS energy) tempo chromagram MFCCsspectrum centroid spectrum spread and ratio of a spectralflatness measure to a spectral center (RSS) Each of them isnormalized In the second level of hierarchical sparse kernelmachines a 2-class relevance vector machine (RVM) modelwith happiness as the target side and other emotion as thebackground side is trained Besides first-level decision vectoris used as the feature in this level

The rest of this paper is organized as follows The systemoverview is described in Section 2 The features and first-level decision vector extraction are described in Section 3Principle component analysis is described in Section 4 Theintroduction of SVM and RVM is described in Section 5Section 6 shows our experimental results The conclusion isgiven in Section 7

2 System Overview

The block diagram of the proposed system is presented inFigure 1The systemmainly comprises two level sparse kernelmachines For the first-level SVMs we use a set of acousticalfeatures which includes RMS energy tempo chromagramMFCCs spectrum centroid spectrum spread and RSS InTable 1 the used acoustical features are classified into fourmain types that is dynamic rhythm timbre and tonality

Table 1 The proposed acoustical feature set

Feature class Feature name (dimension of feature)Dynamic RMS energy (1)Rhythm Tempo (1)

Timbre MFCCs (13) spectrum centroid (1) spectrumspread (1) RSS (1)

Tonality Chromagram (12)

Because each featurersquos scale is different normalization of thewhole feature set is performed [10] After normalization eightSVM models are trained to transform acoustical featuresinto emotion profile features Each of the eight SVM modelis trained and tested using probability product kernel Weuse the first-level decision vectors generated from the angryhappy sad relaxed pleased bored nervous and peacefulemotion classes For an emotion to calculate the correspond-ing value in the emotion profile features we construct its 2-class SVM with calm emotion as the background side of theRVM For the RVM conventional radial basis function kernelis used and the first-level decision vector extracted in the firstlevel is utilized as the feature To verify happiness emotiona 2-class RVM with happiness as the target side and otheremotion as the background side is constructed For a testedmusic clip the obtained probabilities value from this 2-classRVM is used to judge if this music clip belongs to happinessemotion or not

3 Extraction of Acoustical Feature and FirstLevel Decision Value Vector Feature

In the 2-level hierarchical sparse kernel machines the first-level SVMs use acoustical features while the second-levelRVM adopts first-level decision vector For acoustical fea-tures the proposed system extracts RMS energy tempochromagram MFCCs spectrum centroid spectrum spreadand RSSThe extraction of these acoustical features as well asfirst-level decision vectors are described in the following

The Scientific World Journal 3

31 Extraction of Acoustical Feature

311 RMS Energy RMS energy is also called root meansquare energy It computes the global energy of input signal119909 [11] The operation is defined as follows

119883RMS = radic1

119899

119899

sum

119894=1

1199092119894 (1)

where 119899 means signalrsquos length in hundredth of a second bydefault

312 Tempo Many tempo estimation methods have beenproposed The estimation of tempo is based on detectingperiodicities in a range of BPMs [12] Firstly significant onsetevents are detected in the frequency domain [11] Then findthe events that best represents the tempo of the song whichmeans to choose the maximum periodicity score for eachframe separately

313 Chromagram Chroma which is also called harmonicpitch class profile has a strong relationship with the structureof music [13] Chromagram is a joint distribution of signalstrength over the variables of time and chroma Chromais a frame-based representation of audio and is similar toshort time Fourier transform In music clips frequencycomponents belonging to the same pitch class are extractedby chromagram and transformed to a 12-dimensional repre-sentation including C C D D E F F G G A A andB The chromagram can present the distribution of energyalong the pitches or pitch classes [11 14] In [14] chromagramis defined as the remapping of time-frequency distributionThe chromagram is extracted by

V (119905 119896) = sum

119899isin119878119896

119883119905(119899)

119876119896

119896 isin 0 1 2 11 (2)

where 119883119905(119899) means the logarithmic magnitude of discrete

Fourier transform of the 119905th frame and 119876119896is the number of

elements in a subset of the discrete frequency space for eachpitch class [15]

In Figure 2 the chromagram from a piece of music isexemplified

314 Mel-Frequency Cepstral Coefficients (MFCCs) Aftersignal is digitized a large amount of information is notneeded and cost plenty of storage space Power spectrum isoften adopted to encode the signal to solve the problem [16]It is noted that MFCCs performs similar to human auditoryperception systemThe feature is adopted in various researchtopics including speaker recognition speech recognitionand music emotion recognition For example Cooper andFoote extracted MFCCs from music signal and they foundthat MFCCs are similar to music timbre expression [17] In[18]MFCCswere also proven to be having goodperformancein music recommendation

0 1 2 3 4 5C

CD

DEF

FG

GA

AB

Chromagram

Time axis (s)

Chro

ma c

lass

Figure 2 Example of chromagram from a piece of music

MFCCs extraction is based on spectrum The spectrumcan be extracted by using discrete Fourier transform

119909119908(119891) =

119873

sum

119899=0

119909119908(119899) expminus

2120587119891119899

119873 (3)

After power spectrum is extracted subband energies canbe extracted by using Mel filter banks and then evaluatelogarithm value of the energies as follows

119878119894= log

119865ℎ

sum

119891=119865119897

119871 (119894 119891)1003816100381610038161003816119883119908(119891)

10038161003816100381610038162

(4)

where 119865ℎis the discrete frequency index corresponding to

the high cutoff frequency 119865119897is the discrete frequency index

corresponding to low cutoff frequency and 119871(119894 119891) is theamplitude of the 119891th discrete frequency index of the 119894th Melwindow The number of the Mel windows often ranges from20 to 24 Finally MFCCs is obtained by performing discretecosine transform (DCT) [19] In Figure 3 the averageMFCCsvalues from a piece of music are exemplified

315 Spectrum Centroid Spectrum centroid is an econom-ical description of the shape of the power spectrum [20ndash22] Additionally it is correlated with a major perceptualdimension of timbre that is sharpness Figure 4 gives anexample of a spectrum and its spectrum centroid obtainedfrom a frame in a piece ofmusicThe spectrum centroid valueis 2638Hz in this example

316 Spectrum Spread Spectrum spread is an economicaldescriptor of the shape of the power spectrum that indicateswhether it is concentrated in the vicinity of its centroidor else spread out over the spectrum [20ndash22] It allowsdifferentiating between tone-like and noise-like sounds InFigure 5 an example of spectrum spread from a piece ofmusic is provided


1 2 3 4 5 6 7 8 9 10 11 12 13minus02

0

02

04

06

08

1

12MFCC

Coefficient ranks

Mag

nitu

de

Figure 3 Example of average MFCCs values from a piece of music

0 1000 2000 3000 4000 5000 6000 7000 80000

100

200

300

400

500

600

700Spectrum

Frequency (Hz)

Mag

nitu

de

Figure 4 Example of a spectrum and its spectrum centroid from aframe in a piece of music

317 Ratio of a Spectral Flatness Measure to a Spectral Center(RSS) RSSwas proposed byVapnik for speaker-independentemotional speech recognition [23] RSS is the ratio of spec-trum flatness to spectrum centroid and is calculated by

RSS =1000 times SF

SC (5)

where SF denotes spectrum flatness and SC represents spec-trum centroid

32 Extraction of First-Level Decision Vector The acousticalfeature set is utilized to generate the first-level decision vectorwith each element being a significant value of an emotionThis approach is able to interpret the emotional content byproviding multiple probabilistic class labels rather than asingle hard label [24] For example happiness emotion not

0 50 100 150 200 250 300 3500

500

1000

1500

2000

2500

3000

3500

4000

Frame

Freq

uenc

y

Spectrum spread

Figure 5 Example of spectrum spread from a piece of music

only contains happiness content but also other propertiesthat are similar to the content of peace The similarity topeaceful may cause a music clip to be recognized as an incor-rect emotion class In this example the advantage of first-level decision vector representation is its ability to conveyboth the evidences of happiness and peaceful emotions Thispaper uses the significant values of eight emotions (angryhappy sad relaxed pleased bored nervous and peaceful) toconstruct an emotion profile feature vector To calculate thesignificant value of an emotion we construct its 2-class SVMwith calm emotion as the background side of the SVM

4 Principle Component Analysis

PCA is an important mathematic technology in featureextraction approach In this paper PCA is implemented toreduce the dimensions of the extracted featuresThe first stepof PCA is to calculate the 119889-dimension mean vector u and119889 times 119889 covariance matrix Σ of the samples [25] After thatthe eigenvectors and eigenvalues are computed Finally thelargest 119896 eigenvectors are selected to form a 119889 times 119896 matrix 119872

whose columns consist of the 119896 eigenvectors In fact the otherdimensions are noise The PCA transformed data can be inthe form

x1015840 = M119879 (x minus u) (6)

5 Emotion Classifier

The emotion classifier used in the proposed system adoptsa 2-level hierarchical structure of sparse kernel machinesThe first-level SVMs use probability product kernel whilethe second-level RVMadopts traditional radial basis functionkernel with first-level decision vector feature

51 Support Vector Machine The SVM theory is an effectivestatistical technique and has drawn much attention on audioclassification tasks [7] An SVM is a binary classifier thatcreates an optimal hyperplane to classify input samples Thisoptimal hyperplane linearly divides the two classes with thelargest margin [23] Denote 119879 = (x

119894 119910119894) 119894 = 1 2 119873

as a training set for SVM each pair (x119894 119910119894) means training


sample x119894belongs to a class 119910

119894 where 119910

119894isin +1 minus1 The

fundamental concept is to choose a hyperplane which canclassify T accurately while maximizing the distance betweenthe two classes This means to find a pair (w 119887) such that

119910119894(w sdot x119894+ 119887) gt 0 119894 = 1 119873 (7)

where w isin 119877119873 is normalized by itself and 119887 isin 119877The pair (w 119887) defines a separating hyperplane of equa-

tion

w sdot x + 119887 = 0 (8)

If there exists a hyperplane satisfying (7) the set 119879 is saidto be linearly separable and we can change w and 119887 so that

119910119894(w sdot x119894+ 119887) gt 1 119894 = 1 119873 (9)

According to (9) we can derive an objective functionunder constraint

min w2

subject to 119910119894(w sdot x119894+ 119887) gt 1 119894 = 1 119873

(10)

Since w2 is convex we can solve (9) by applying the

classical method of Lagrange multipliers

min w2

+ 120583119894[119910119894(w sdot x119894+ 119887) minus 1] 119894 = 1 119873 (11)

We denote U = (1205831 1205832 120583

119873) as the 119873 nonnegative

Lagrange multipliers associated with (10) After solving (11)the optimal hyperplane has the following expansion

w =

119873

sum

119894=1

120583119894119910119894119909119894 (12)

119887 can be determined from U and from the Kuhn-Tuckerconditions Consider

120583119894(119910119894(w sdot 119909

119894+ 119887) minus 1) = 0 119894 = 1 2 119873 (13)

Accordingly (11) the expected hyperplane is a linearcombination of training samplesThe corresponding trainingsamples (x

119894 119910119894) with nonzero Lagrange multipliers are called

support vectors Finally the decision value from a new datapoint x can be written as

dec (119909) =

119873

sum

119894=1

120583119894119910119894x119894sdot x + 119887 (14)

Functions that satisfy Mercerrsquos theorem can be used askernels In this paper probability product kernel is adopted

52 Probability Product Support Vector Machine A functioncan be considered as kernel function if the function satisfiesMercerrsquos theorem Using Mercerrsquos theory we can introduce amapping function 120601(x) such that 119896(x

119895 x119894) = 120601(x

119895)120601(x119894) This

provides the ability of handling nonlinear data by mappingthe original input space R119889 into some other space

In this paper the probability product kernel is utilizedThe probability product kernel is a method of measuringsimilarity between distributions and it has the property ofsimple and intuitively compelling conception [26] Proba-bility product kernel computes a generalized inner productbetween two probability distributions in the Hilbert space Apositive definite kernel 119896 119874 times 119874 rarr R on input space 119874

and examples o1 o2 o

119898isin 119874 are defined Firstly the input

data 119909 is mapped to a probability distribution119901(119909 | 119874) whichfits separate probabilistic models 119901

1(119909) 1199012(119909) 119901

119898(119909) to

o1 o2 o

119898 After that a novel kernel 119896prob(119901

119894 119901119895) between

probability distributions on 119874 is defined At last a kernelbetween examples is needed to be defined and the kernelis equal to 119896

prob between the corresponding distributionsConsider

119896 (o119894 o119895) = 119896

prob(119901119894 119901119895) (15)

Finally this kernel is applied to SVM and proceeded as usualThe probability product kernel between distributions 119901

119894and

119901119895is defined as

119896 (119901119894 119901119895) = int119909

119901120588

119894(x) 119901120588119895(x) 119889119909 = ⟨119901

120588

119894 119901120588

119895⟩1198712

(16)

where 119901119894and 119901

119895are probability distributions on a space 119874

Assume that 119901120588

119894 119901120588

119895isin 1198712(119874) 119871

2is a Hilbert space and 120588 is

a positive constant Probability product kernel allows us tointroduce prior knowledge of data In this paper we assumea 119889-dimensional Gaussian distribution of our data

53 First-Level Decision Vector Extraction First-level deci-sion vector presents perception probability of each of theeight emotion-specific decisions which is extracted frominput data by collecting decision values from each modelThe decision value of SVM represents the degree of similaritybetween model and testing data The advantage of similaritymeasure can be used to find out which model fits the datamost accurately [24] Using the first-level decision vector themost probably perceived emotion in music can be detected

54 Relevance Vector Machine RVM is a development ofSVM Different from SVM RVM tries to find a considerablenumber of weights which has highest sparsity [27]Themodeldefines a conditional distribution for target class 119910 = 0 1given an input set x

1 x

119899 [28] Assume that a training

data can be a linear combination of weighted nonlinear basisfunctions 120601

119894(x) which is transformed by a logistic sigmoid

function as follows

119891 (xw) = w119879120601 (x) (17)

where w = (1199081 1199082 119908

119866) 120601(x) = (120601

1(x) 1206012(x)

120601119866(x))119879 denotes the weights In order to make weight sparse

the Bayesian probabilistic framework is implemented to findthe distribution over the weights instead of using pointwiseestimation therefore a separate hyperparameter 119886 for each


01 02 05 1 2 5 10 20 40

0102

05

1

2

5

10

20

40

False alarm probability ()

Miss

pro

babi

lity

()

Figure 6 DET curve of the proposed system

of the weight parameters119908 is introduced According to Bayesrule the posterior probability of 119908 is

119901 (w | 119910 a) =119901 (119910 | w a) 119901 (w | a)

119901 (119910 | a) (18)

where 119901(119910 | w a) is likelihood 119901(w | a) is prior conditionedon weights a = [119886

1 119886

119899]119879 and 119901(119910 | a) denotes the

evidence For the reason that 119910 is a binary variable thelikelihood function can be given by

119901 (119910 | w a) =

119899

prod

119894=1

[120590 (119891 (x119894w))]119910119894[1 minus 120590 (119891 (x

119894w))]1minus119910119894

(19)

where 120590(119891) = 1(1+119890minus119891

) is the logistic sigmoid link functionAccording to (18) it can be found that a significant proportionof hyperparameters tend to be infinity and the correspondingposterior distributions of weight parameters are concentratedat zeroTherefore the basis functions that multiplied by theseparameters will not be taken for reference when training themodel As a result the model will be sparse

6 Experimental Results

In the experiments we collected one hundred songs fromtwo websites to construct a music emotion database Thesewebsites are All Music Guide [29] and Lastfm [30] Asmentioned before music may contain multiple emotions Ifwe know which emotion class a song most likely belongs towemay know themain emotion of the song Songs in Lastfmare tagged by many people on the Internet We choose theemotion which is tagged by most people to be the groundtruth of data

The database consists of nine classes of emotions includ-ing happy angry sad bored nervous relaxed pleased calm

and peaceful Calm is taken as a modelrsquos opposite site whentraining models Each emotion class contains twenty songsEach song is thirty seconds long and is divided into five-second clips Half of the songs are used as training dataand the others are used as testing data In this paper 240music clips are tested All of songs are western music and areencoded in 16KHzWAV format The used acoustical featureset are listed in Table 1 The whole feature set dimension is30 The used SVM is based on LIBSVM library [31] andthe used RVM is based on PTR toolbox [32] The systemperformance is evaluated in terms of DET curve Figure 6depicts DET curve of the proposed happiness verificationsystem The proposed system can achieve 1333 equal errorrate (EER) From our results we see that the system performswell on happiness emotion verification in music

7 Conclusion

Detecting emotion in music has become the concern ofmany researchers in recent years In this paper we proposeda first-level decision-vector-based music happiness emotiondetection system The proposed system adopts a hierarchicalstructure of sparse kernel machines First eight SVMmodelsare trained based on acoustical features with probabilityproduct kernel Then eight decision values can be extractedto construct the first-level decision vector feature After thatthese eight decision values are considered as new feature totrain and test a 2-class RVM with happiness as the targetside The probability value of the RVM is used to verifyhappiness content in music Experimental results reveal thatthe proposed system can achieve 1333 equal error rate(EER)

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] R E Milliman ldquoUsing backgroundmusic to affect the behaviorof supermarket shoppersrdquo Journal of Marketing vol 46 no 3pp 86ndash91 1982

[2] C-H Yeh H-H Lin and H-T Chang ldquoAn efficient emotiondetection scheme for popular musicrdquo in Proceedings of the IEEEInternational Symposium on Circuits and Systems (ISCAS rsquo09)pp 1799ndash1802 Taipei City Taiwan May 2009

[3] Y-H Yang Y-C Lin Y-F Su and H H Chen ldquoA regressionapproach to music emotion recognitionrdquo IEEE Transactions onAudio Speech and Language Processing vol 16 no 2 pp 448ndash457 2008

[4] Y H Yang and H H Chen ldquoPrediction of the distributionof perceived music emotions using discrete samplesrdquo IEEETransactions on Audio Speech and Language Processing vol 19no 7 pp 2184ndash2196 2011

[5] B Han S Rho R B Dannenberg and E Hwang ldquoSMERSmusic emotion recognition using support vector regressionrdquo inProceedings of the International Conference on Music Informa-tion Retrieval Kobe Japan 2009


[6] L Lu D Liu and H-J Zhang ldquoAutomatic mood detection andtracking of music audio signalsrdquo IEEE Transactions on AudioSpeech and Language Processing vol 14 no 1 pp 5ndash18 2006

[7] C-Y Chang C-Y Lo C-J Wang and P-C Chung ldquoA musicrecommendation system with consideration of personal emo-tionrdquo in Proceedings of the International Computer Symposium(ICS rsquo10) pp 18ndash23 Tainan City Taiwan December 2010

[8] F C Hwang J S Wang P C Chung and C F YangldquoDetecting emotional expression ofmusic with feature selectionapproachrdquo in Proceedings of the International Conference onOrange Technologies (ICOT rsquo13) pp 282ndash286 March 2013

[9] K Hevner ldquoExpression in music a discussion of experimentalstudies and theoriesrdquo Psychological Review vol 42 no 2 pp186ndash204 1935

[10] M Chouchane S Paris F Le Gland C Musso and D-TPham ldquoOn the probability distribution of a moving targetAsymptotic and non-asymptotic resultsrdquo in Proceedings of the14th International Conference on Information Fusion (Fusion rsquo11)pp 1ndash8 July 2011

[11] O Lartillot and P Toiviainen ldquoMIR in Matlab (II) a toolboxfor musical feature extraction from audiordquo in Proceedings ofthe International Conference Music Information Retrieval pp127ndash130 2007 httpswwwjyufihumlaitoksetmusiikkienresearchcoematerialsmirtoolbox

[12] C-W Chen K Lee and H-H Wu ldquoTowards a class-basedrepresentation of perceptual tempo for music retrievalrdquo inProceedings of the 8th International Conference on MachineLearning andApplications (ICMLA rsquo09) pp 602ndash607December2009

[13] WChai ldquoSemantic segmentation and summarization ofmusicrdquoIEEE Signal Processing Magazine vol 23 no 2 pp 124ndash1322006

[14] M A Bartsch and G H Wakefield ldquoAudio thumbnailingof popular music using chroma-based representationsrdquo IEEETransactions on Multimedia vol 7 no 1 pp 96ndash104 2005

[15] X Yu J Zhang J LiuWWan andW Yang ldquoAn audio retrievalmethod based on chromagram and distance metricsrdquo in Pro-ceedings of the International Conference on Audio Language andImage Processing (ICALIP rsquo10) pp 425ndash428 Shanghai ChinaNovember 2010

[16] J O Garcıa and C A R Garcia ldquoMel-frequency cepstrumcoefficients extraction from infant cry for classification of nor-mal and pathological cry with feed-forward neural networksrdquoin Proceedings of the International Joint Conference on NeuralNetworks pp 3140ndash3145 July 2003

[17] C Y Lin and S Cheng ldquoMulti-theme analysis of musicemotion similarity for jukebox applicationrdquo in Proceedings ofthe International Conference on Audio Language and ImageProcessing (ICALIP rsquo12) pp 241ndash246 July 2012

[18] B Shao M Ogihara D Wang and T Li ldquoMusic recommenda-tion based on acoustic features and user access patternsrdquo IEEETransactions on Audio Speech and Language Processing vol 17pp 1602ndash1611 2009

[19] W-Q Zhang D Yang J Liu and X Bao ldquoPerturbation analysisof mel-frequency cepstrum coefficientsrdquo in Proceedings of theInternational Conference onAudio Language and Image Process-ing (ICALIP rsquo10) pp 715ndash718 Shanghai China November 2010

[20] H G Kim N Moreau and T Sikora MPEG-7 Audio andBeyond AudioContent Indexing andRetrievalWileyNewYorkNY USA 2005

[21] ISO-IECJTCI SC29 WGII Moving Pictures Expeti GroupldquoInformation technologymdashmultimedia content description

interfacemdashpart 4 Audiordquo Comittee Draft 15938-4 ISOIEC2000

[22] M Casey ldquoMPEG-7 sound-recognition toolsrdquo IEEE Transac-tions on Circuits and Systems for Video Technology vol 11 no6 pp 737ndash747 2001

[23] V Vapnik Statistical Learning Theory Wiley New York NYUSA 1998

[24] E Mower M J Mataric and S Narayanan ldquoA frameworkfor automatic human emotion classification using emotionprofilesrdquo IEEE Transactions on Audio Speech and LanguageProcessing vol 19 no 5 pp 1057ndash1070 2011

[25] R O Duda P E Hart and D G Stork Pattern ClassificationNew York NY USA 2nd edition 2001

[26] T Jebara R Kondor and A Howard ldquoProbability productkernelsrdquo Journal of Machine Learning Research vol 5 pp 819ndash844 2004

[27] F A Mianji and Y Zhang ldquoRobust hyperspectral classificationusing relevance vector machinerdquo IEEE Transactions on Geo-science and Remote Sensing vol 49 no 6 pp 2100ndash2112 2011

[28] C M Bishop Pattern Recognition andMachine Learning (Infor-mation Science and Statistics) Springer New York NY USA2nd edition 2007

[29] ldquoThe All Music Guiderdquo httpwwwallmusiccom[30] ldquoLastfmrdquo httpcnlastfmhome[31] C C Chang andC J Lin ldquoLIBSVM a library for support vector

machinesrdquo 2001 httpwwwcsientuedutwsimcjlinlibsvm[32] ldquoPattern Recognition Toolboxrdquo httpwwwnewfolderconsult-

ingcomprt

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014


Active and Passive Electronic Components

Control Scienceand Engineering

Journal of



RotatingMachinery


Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design



Shock and Vibration


Civil EngineeringAdvances in

Acoustics and VibrationAdvances in



Electrical and Computer Engineering

Journal of

Advances inOptoElectronics


Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of


Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014


Chemical EngineeringInternational Journal of Antennas and

Propagation




Navigation and Observation



DistributedSensor Networks



Unknown

Spectrum

Tempo

Chromagram

energyRoot mean square

estimation

MFCC

emotionalmusic

Audiopreprocessing

Emotionalmusic database Time

-frequencyprocessing

Spectrum centroidSpectrum spread

RSS

Principle component Support vector First-level decision Relevance vector machine machineanalysis vector

With probability Testing

Verifiedemotion

Training

product Kernel

First-level decision vectorextraction

Feature extraction

Figure 1 Block diagram of the proposed system

arousal and valence value In [6] hierarchical frameworkwas adopted to detect emotion from acoustic music dataThemethod has the advantage of emphasizing proper featurein different detection work In [7] support vector machinewas applied to detect emotion content in music In [8]kernel-based class separability is used to weight featuresAfter feature selection principal component analysis andlinear discriminant analysis were applied and 119896-nearestneighborhood (KNN) classifier was then implemented Inthis paper amusic emotion detection system is proposedThesystem establishes a hierarchical sparse kernel machine Inthe first level eight 2-class SVMmodels are trainedwith eightemotion classes as the target sides respectively It is noted thatemotion perception is usually not based on a single acousticalfeature but a combination of acoustical features [4 9] Thispaper adopts an acoustical feature set comprising root meansquare energy (RMS energy) tempo chromagram MFCCsspectrum centroid spectrum spread and ratio of a spectralflatness measure to a spectral center (RSS) Each of them isnormalized In the second level of hierarchical sparse kernelmachines a 2-class relevance vector machine (RVM) modelwith happiness as the target side and other emotion as thebackground side is trained Besides first-level decision vectoris used as the feature in this level

The rest of this paper is organized as follows The systemoverview is described in Section 2 The features and first-level decision vector extraction are described in Section 3Principle component analysis is described in Section 4 Theintroduction of SVM and RVM is described in Section 5Section 6 shows our experimental results The conclusion isgiven in Section 7

2 System Overview

The block diagram of the proposed system is presented inFigure 1The systemmainly comprises two level sparse kernelmachines For the first-level SVMs we use a set of acousticalfeatures which includes RMS energy tempo chromagramMFCCs spectrum centroid spectrum spread and RSS InTable 1 the used acoustical features are classified into fourmain types that is dynamic rhythm timbre and tonality

Table 1 The proposed acoustical feature set

Feature class Feature name (dimension of feature)Dynamic RMS energy (1)Rhythm Tempo (1)

Timbre MFCCs (13) spectrum centroid (1) spectrumspread (1) RSS (1)

Tonality Chromagram (12)

Because each featurersquos scale is different normalization of thewhole feature set is performed [10] After normalization eightSVM models are trained to transform acoustical featuresinto emotion profile features Each of the eight SVM modelis trained and tested using probability product kernel Weuse the first-level decision vectors generated from the angryhappy sad relaxed pleased bored nervous and peacefulemotion classes For an emotion to calculate the correspond-ing value in the emotion profile features we construct its 2-class SVM with calm emotion as the background side of theRVM For the RVM conventional radial basis function kernelis used and the first-level decision vector extracted in the firstlevel is utilized as the feature To verify happiness emotiona 2-class RVM with happiness as the target side and otheremotion as the background side is constructed For a testedmusic clip the obtained probabilities value from this 2-classRVM is used to judge if this music clip belongs to happinessemotion or not

3 Extraction of Acoustical Feature and FirstLevel Decision Value Vector Feature

In the 2-level hierarchical sparse kernel machines the first-level SVMs use acoustical features while the second-levelRVM adopts first-level decision vector For acoustical fea-tures the proposed system extracts RMS energy tempochromagram MFCCs spectrum centroid spectrum spreadand RSSThe extraction of these acoustical features as well asfirst-level decision vectors are described in the following




119883RMS = radic1

119899

119899

sum

119894=1

1199092119894 (1)




V (119905 119896) = sum

119899isin119878119896

119883119905(119899)

119876119896

119896 isin 0 1 2 11 (2)






0 1 2 3 4 5C

CD

DEF

FG

GA

AB

Chromagram

Time axis (s)

Chro

ma c

lass



119909119908(119891) =

119873

sum

119899=0

119909119908(119899) expminus

2120587119891119899

119873 (3)


119878119894= log

119865ℎ

sum

119891=119865119897

119871 (119894 119891)1003816100381610038161003816119883119908(119891)

10038161003816100381610038162

(4)







1 2 3 4 5 6 7 8 9 10 11 12 13minus02

0

02

04

06

08

1

12MFCC

Coefficient ranks

Mag

nitu

de


0 1000 2000 3000 4000 5000 6000 7000 80000

100

200

300

400

500

600

700Spectrum

Frequency (Hz)

Mag

nitu

de



RSS =1000 times SF

SC (5)



0 50 100 150 200 250 300 3500

500

1000

1500

2000

2500

3000

3500

4000

Frame

Freq

uenc

y

Spectrum spread






x1015840 = M119879 (x minus u) (6)




119894 119910119894) 119894 = 1 2 119873




119894 where 119910



119910119894(w sdot x119894+ 119887) gt 0 119894 = 1 119873 (7)


tion

w sdot x + 119887 = 0 (8)


119910119894(w sdot x119894+ 119887) gt 1 119894 = 1 119873 (9)


min w2


(10)



min w2

+ 120583119894[119910119894(w sdot x119894+ 119887) minus 1] 119894 = 1 119873 (11)

We denote U = (1205831 1205832 120583



w =

119873

sum

119894=1

120583119894119910119894119909119894 (12)


120583119894(119910119894(w sdot 119909

119894+ 119887) minus 1) = 0 119894 = 1 2 119873 (13)




dec (119909) =

119873

sum

119894=1

120583119894119910119894x119894sdot x + 119887 (14)



119895 x119894) = 120601(x

119895)120601(x119894) This






1(119909) 1199012(119909) 119901

119898(119909) to

o1 o2 o


119894 119901119895) between



119896 (o119894 o119895) = 119896

prob(119901119894 119901119895) (15)


119894and


119896 (119901119894 119901119895) = int119909

119901120588

119894(x) 119901120588119895(x) 119889119909 = ⟨119901

120588

119894 119901120588

119895⟩1198712

(16)

where 119901119894and 119901



119894 119901120588

119895isin 1198712(119874) 119871





1 x




function as follows

119891 (xw) = w119879120601 (x) (17)

where w = (1199081 1199082 119908

119866) 120601(x) = (120601

1(x) 1206012(x)




01 02 05 1 2 5 10 20 40

0102

05

1

2

5

10

20

40


Miss

pro

babi

lity

()



119901 (w | 119910 a) =119901 (119910 | w a) 119901 (w | a)

119901 (119910 | a) (18)


1 119886

119899]119879 and 119901(119910 | a) denotes the


119901 (119910 | w a) =

119899

prod

119894=1

[120590 (119891 (x119894w))]119910119894[1 minus 120590 (119891 (x

119894w))]1minus119910119894

(19)

where 120590(119891) = 1(1+119890minus119891






7 Conclusion




References

































ingcomprt



RoboticsJournal of





Journal of



RotatingMachinery





VLSI Design



Shock and Vibration







Journal of



Volume 2014


SensorsJournal of





Propagation












119883RMS = radic1

119899

119899

sum

119894=1

1199092119894 (1)




V (119905 119896) = sum

119899isin119878119896

119883119905(119899)

119876119896

119896 isin 0 1 2 11 (2)






0 1 2 3 4 5C

CD

DEF

FG

GA

AB

Chromagram

Time axis (s)

Chro

ma c

lass



119909119908(119891) =

119873

sum

119899=0

119909119908(119899) expminus

2120587119891119899

119873 (3)


119878119894= log

119865ℎ

sum

119891=119865119897

119871 (119894 119891)1003816100381610038161003816119883119908(119891)

10038161003816100381610038162

(4)







1 2 3 4 5 6 7 8 9 10 11 12 13minus02

0

02

04

06

08

1

12MFCC

Coefficient ranks

Mag

nitu

de


0 1000 2000 3000 4000 5000 6000 7000 80000

100

200

300

400

500

600

700Spectrum

Frequency (Hz)

Mag

nitu

de



RSS =1000 times SF

SC (5)



0 50 100 150 200 250 300 3500

500

1000

1500

2000

2500

3000

3500

4000

Frame

Freq

uenc

y

Spectrum spread






x1015840 = M119879 (x minus u) (6)




119894 119910119894) 119894 = 1 2 119873




119894 where 119910



119910119894(w sdot x119894+ 119887) gt 0 119894 = 1 119873 (7)


tion

w sdot x + 119887 = 0 (8)


119910119894(w sdot x119894+ 119887) gt 1 119894 = 1 119873 (9)


min w2


(10)



min w2

+ 120583119894[119910119894(w sdot x119894+ 119887) minus 1] 119894 = 1 119873 (11)

We denote U = (1205831 1205832 120583



w =

119873

sum

119894=1

120583119894119910119894119909119894 (12)


120583119894(119910119894(w sdot 119909

119894+ 119887) minus 1) = 0 119894 = 1 2 119873 (13)




dec (119909) =

119873

sum

119894=1

120583119894119910119894x119894sdot x + 119887 (14)



119895 x119894) = 120601(x

119895)120601(x119894) This






1(119909) 1199012(119909) 119901

119898(119909) to

o1 o2 o


119894 119901119895) between



119896 (o119894 o119895) = 119896

prob(119901119894 119901119895) (15)


119894and


119896 (119901119894 119901119895) = int119909

119901120588

119894(x) 119901120588119895(x) 119889119909 = ⟨119901

120588

119894 119901120588

119895⟩1198712

(16)

where 119901119894and 119901



119894 119901120588

119895isin 1198712(119874) 119871





1 x




function as follows

119891 (xw) = w119879120601 (x) (17)

where w = (1199081 1199082 119908

119866) 120601(x) = (120601

1(x) 1206012(x)




01 02 05 1 2 5 10 20 40

0102

05

1

2

5

10

20

40


Miss

pro

babi

lity

()



119901 (w | 119910 a) =119901 (119910 | w a) 119901 (w | a)

119901 (119910 | a) (18)


1 119886

119899]119879 and 119901(119910 | a) denotes the


119901 (119910 | w a) =

119899

prod

119894=1

[120590 (119891 (x119894w))]119910119894[1 minus 120590 (119891 (x

119894w))]1minus119910119894

(19)

where 120590(119891) = 1(1+119890minus119891






7 Conclusion




References

































ingcomprt



RoboticsJournal of





Journal of



RotatingMachinery





VLSI Design



Shock and Vibration







Journal of



Volume 2014


SensorsJournal of





Propagation










1 2 3 4 5 6 7 8 9 10 11 12 13minus02

0

02

04

06

08

1

12MFCC

Coefficient ranks

Mag

nitu

de


0 1000 2000 3000 4000 5000 6000 7000 80000

100

200

300

400

500

600

700Spectrum

Frequency (Hz)

Mag

nitu

de



RSS =1000 times SF

SC (5)



0 50 100 150 200 250 300 3500

500

1000

1500

2000

2500

3000

3500

4000

Frame

Freq

uenc

y

Spectrum spread






x1015840 = M119879 (x minus u) (6)




119894 119910119894) 119894 = 1 2 119873




119894 where 119910



119910119894(w sdot x119894+ 119887) gt 0 119894 = 1 119873 (7)


tion

w sdot x + 119887 = 0 (8)


119910119894(w sdot x119894+ 119887) gt 1 119894 = 1 119873 (9)


min w2


(10)



min w2

+ 120583119894[119910119894(w sdot x119894+ 119887) minus 1] 119894 = 1 119873 (11)

We denote U = (1205831 1205832 120583



w =

119873

sum

119894=1

120583119894119910119894119909119894 (12)


120583119894(119910119894(w sdot 119909

119894+ 119887) minus 1) = 0 119894 = 1 2 119873 (13)




dec (119909) =

119873

sum

119894=1

120583119894119910119894x119894sdot x + 119887 (14)



119895 x119894) = 120601(x

119895)120601(x119894) This






1(119909) 1199012(119909) 119901

119898(119909) to

o1 o2 o


119894 119901119895) between



119896 (o119894 o119895) = 119896

prob(119901119894 119901119895) (15)


119894and


119896 (119901119894 119901119895) = int119909

119901120588

119894(x) 119901120588119895(x) 119889119909 = ⟨119901

120588

119894 119901120588

119895⟩1198712

(16)

where 119901119894and 119901



119894 119901120588

119895isin 1198712(119874) 119871





1 x




function as follows

119891 (xw) = w119879120601 (x) (17)

where w = (1199081 1199082 119908

119866) 120601(x) = (120601

1(x) 1206012(x)




01 02 05 1 2 5 10 20 40

0102

05

1

2

5

10

20

40


Miss

pro

babi

lity

()



119901 (w | 119910 a) =119901 (119910 | w a) 119901 (w | a)

119901 (119910 | a) (18)


1 119886

119899]119879 and 119901(119910 | a) denotes the


119901 (119910 | w a) =

119899

prod

119894=1

[120590 (119891 (x119894w))]119910119894[1 minus 120590 (119891 (x

119894w))]1minus119910119894

(19)

where 120590(119891) = 1(1+119890minus119891






7 Conclusion




References

































ingcomprt



RoboticsJournal of





Journal of



RotatingMachinery





VLSI Design



Shock and Vibration







Journal of



Volume 2014


SensorsJournal of





Propagation











119894 where 119910



119910119894(w sdot x119894+ 119887) gt 0 119894 = 1 119873 (7)


tion

w sdot x + 119887 = 0 (8)


119910119894(w sdot x119894+ 119887) gt 1 119894 = 1 119873 (9)


min w2


(10)



min w2

+ 120583119894[119910119894(w sdot x119894+ 119887) minus 1] 119894 = 1 119873 (11)

We denote U = (1205831 1205832 120583



w =

119873

sum

119894=1

120583119894119910119894119909119894 (12)


120583119894(119910119894(w sdot 119909

119894+ 119887) minus 1) = 0 119894 = 1 2 119873 (13)




dec (119909) =

119873

sum

119894=1

120583119894119910119894x119894sdot x + 119887 (14)



119895 x119894) = 120601(x

119895)120601(x119894) This






1(119909) 1199012(119909) 119901

119898(119909) to

o1 o2 o


119894 119901119895) between



119896 (o119894 o119895) = 119896

prob(119901119894 119901119895) (15)


119894and


119896 (119901119894 119901119895) = int119909

119901120588

119894(x) 119901120588119895(x) 119889119909 = ⟨119901

120588

119894 119901120588

119895⟩1198712

(16)

where 119901119894and 119901



119894 119901120588

119895isin 1198712(119874) 119871





1 x




function as follows

119891 (xw) = w119879120601 (x) (17)

where w = (1199081 1199082 119908

119866) 120601(x) = (120601

1(x) 1206012(x)




01 02 05 1 2 5 10 20 40

0102

05

1

2

5

10

20

40


Miss

pro

babi

lity

()



119901 (w | 119910 a) =119901 (119910 | w a) 119901 (w | a)

119901 (119910 | a) (18)


1 119886

119899]119879 and 119901(119910 | a) denotes the


119901 (119910 | w a) =

119899

prod

119894=1

[120590 (119891 (x119894w))]119910119894[1 minus 120590 (119891 (x

119894w))]1minus119910119894

(19)

where 120590(119891) = 1(1+119890minus119891






7 Conclusion




References

































ingcomprt



RoboticsJournal of





Journal of



RotatingMachinery





VLSI Design



Shock and Vibration







Journal of



Volume 2014


SensorsJournal of





Propagation










01 02 05 1 2 5 10 20 40

0102

05

1

2

5

10

20

40


Miss

pro

babi

lity

()



119901 (w | 119910 a) =119901 (119910 | w a) 119901 (w | a)

119901 (119910 | a) (18)


1 119886

119899]119879 and 119901(119910 | a) denotes the


119901 (119910 | w a) =

119899

prod

119894=1

[120590 (119891 (x119894w))]119910119894[1 minus 120590 (119891 (x

119894w))]1minus119910119894

(19)

where 120590(119891) = 1(1+119890minus119891






7 Conclusion




References

































ingcomprt



RoboticsJournal of





Journal of



RotatingMachinery





VLSI Design



Shock and Vibration







Journal of



Volume 2014


SensorsJournal of





Propagation




































ingcomprt



RoboticsJournal of





Journal of



RotatingMachinery





VLSI Design



Shock and Vibration







Journal of



Volume 2014


SensorsJournal of





Propagation











RoboticsJournal of





Journal of



RotatingMachinery





VLSI Design



Shock and Vibration







Journal of



Volume 2014


SensorsJournal of





Propagation









Research Article Music Emotion Detection Using ...emotion or not. 3. Extraction of Acoustical...

Documents

Transcript of Research Article Music Emotion Detection Using ...emotion or not. 3. Extraction of Acoustical...