Download - Audio Music Monitoring- Analyzing Current Techniques for Song Recognition and Identification

7/25/2019 Audio Music Monitoring- Analyzing Current Techniques for Song Recognition and Identification

1/12

Audio Music Monitoring: Analyzing Current

Techniques for Song Recognition and IdentificationE.D. Nishan W. Senevirathna and Lakshman Jayaratne

Abstractwhen people are attaching or interesting in

something, usually they are trying to interact with it frequently.

Music is attached to people since the day of they were born.

When music repository grows, people faced lots of challenges

such as finding a song quickly, categorizing, organizing and even

listening again when they want etc. Because of this, people tend

to find electronic solutions. To index music, most of theresearchers use content based information retrieval mechanism

since content based classification doesnt need any additional

information rather than audio features embedded to it. As well

as it is the most suitable way to search music, when user dont

know the meta data attached to it, like author of the song. The

most valuable application of this audio recognition is copyright

infringement detection. Throughout this survey we will present

approaches which were proposed by various researchers to

detect, recognize music using content base mechanisms. And

finally we will conclude this by analyzing the current status of

this era.

KeywordsAudio fingerprint; features extraction; wavelets;

broadcast monitoring; Audio classification; Audio identification.

I.

INTRODUCTION

usic repositories in the world are increasing

exponentially. New artist can come to the field easily

with new technologies. Once we listen a new song, we cant

get it again easily if we dont know the meta data of that song

like author or singer. However the most common method of

accessing music is through textual meta-data but this is no

longer function properly against huge music collection. When

we come to the audio music recognition era, followings are

the key considerations.

Can we find an unknown song using a small part of it

or humming the melody?

Can we organize, index songs without meta data like

singer of the song?

Can we detect copyright infringement? For an example

after a song was broadcasted in a radio channel.

Can we identify a cover song when multiple versions

exist?

Can we obtain a statistical report about broadcasted

songs in a radio channel without a manual monitoring

process?

Above considerations motivate researches to find proper

solutions for these challenges. As of now, so many ideas have

been proposed by researches as well as some of them have

been implemented, Shazam is one of example for that.

However still this is a challenging research area since there is

no optimal solution. This problem become even more

complex when,

Audio signal is altered by noise.

Audio signal is polluted by adding unnecessary audio

object like advertisement in radio broadcasting.

When multiple versions are existed.

Only a small part of a song is available.

At any of above situations, human auditory system can

recognize music but providing an automated electronic

solution is very challenging task since similarity between

original music and querying music could be very few or these

similar features may not be possible to model mathematically.

It means researches need to consider perceptual features also,

in order to provide a proper solution. Feature extraction can

be considered as the heart of any of these approaches since the

accuracy and all are depended on the way of feature

extraction.

Rest of this survey, will provide broader overview and

comparisons of proposed feature extractions, searching

algorithms and overall solutions architectures.

M

DOI: 10.5176/2251-3043_4.3.328

GSTF Journal on Computing (JoC) Vol.4 No.3, October 2015

The Author(s) 2015. This article is published with open access by the GSTF

23

Received 20 Jul 2015 Accepted 13 Aug 2015

DOI 10.7603/s40601-014-0015-7


2/12

II. CLASSIFICATIONS (RECOGNITION)VS.IDENTIFICATIONS

What is the different between audio recognition

(classification) and identification? In audio classification,

audio object will be classified into pre-defined sets like song,

advertisement, vocals etc. but they are not identified further.

Ultimately we know that this is a song or advertisement but

we dont know what that song is! Audio classification is less

complex than recognition. Most of the time, we can see that

these two things are combined each other in order to get better

result. For an example, in audio song recognition system, first

we can extract only songs among collection of other audio

objects using audio classifier and output will be fed in to the

audio recognition system. Using that kind of approach we can

get better result by narrow downing the search space. There

are more proposed audio classification approaches. Some of

them will be discussed in next sub section.

A.

Audio classifications1)

Overview

There are considerable amount of real world

applications for audio classification. For an example it will be

very helpful to be able to search sound effects automatically

from a very large audio database in films post processing,

which contains sounds of explosion, windstorm, earthquake,

animals and so on[1]. As well as audio content analysis and

classification is also useful for audio-assisted video

classifications. For an example, all video of gun fight scenes

should include the sound of shooting and or explosions, but

image content may vary significantly from one scene to

another.

When classifying an audio content into different sets,

different classes have to be considered. Most of the researches

have started this classifying speech and music. However these

classes are depended on the situations. For example, music,

speech and others can be considered for the parsing of

news stories whereas audio recording can be classified into

speech, laughter, silences and non speech for the

purpose of segmenting discussions recording in meetings[1].

In any cases above, we have to consider, extract some sort of

audio features. This is the challenging part as well as past

researches are differed from this point. But we can consider

feature extraction of audio classification and feature

extraction of audio identification separately since most of thetimes these two cases consider disjoin feature sets [7].

2)

Feature extraction of audio classification

Actually, most of the time output of the audio

classification is the input of the audio identification. This will

reduce the searching space and speed up the process and help

to retrieve better results. Most of the researchers, audio

classification will be broken down into further steps. In [1]

they used two steps, in the first stage, audio signal is

segmented and classified into basic types, including speech,

music, several types of environmental sounds, and silence.

They called it as the coarse-level classification. In the second

stage, further classification is conducted within each basic

type. For speech, they differentiated it into voices of man,

woman, child as well as speech with a music background and

so on. For music, it is classified according to the instruments

or types (for example, classics, blues, jazz, rock and roll,

music with singing and the plain song). For environmental

sounds, they classified them into finer classes such as

applause, bell ring, footstep, windstorm, laughter, birds' cry,

and so on. They called this as the fine-level classification.

Overall idea was reducing the searching space step by step in

order to get better results. As well as we can use proper feature

extraction mechanism for each finer level classes based on its

basic type. For an example, due to differences in the

origination of the three basic types of audio, i.e. speech, musicand environmental sounds, different approaches can be taken

in their fine classification. Most of the researches have used

low-level (physical, acoustic) features such as Spectral

Centroid or Mel-frequency Coefficients but end users may

prefer to interact with a higher semantic level [2]. For an

example they may need to find dog barking sound instead of

environmental sounds. However low-level features can be

easily extract using signal processing than high-level

(perceptual) features.

Most of the researchers have used Hidden Markov

Model (HMM) and Gaussian Mixture Model (GMM) as the

pattern recognition tool. Those are the widely used very

powerful statistical tools in pattern recognition. To use those

tools we have to extract unique features. Any audio feature

can be grouped into two or more sets. Most of the researches

grouped all audio features into two group, physical (or

mathematical) features and conceptual features. Physical

features are directly extracted from the audio wave such as

energy of the wave, frequency, peaks, average zero crossings

and so on. These features cannot be identified by the human

auditory system. But perceptual features are the features

human can understand like loudness, pitch, timbre, rhythm

and so on. Perceptual features cannot easily be model by

mathematical functions but those are the very important audio

features since human uses those features to differentiate

audios.

However sometime we can see that audio features

classified into hierarchical groups with similar characteristics

[12]. They divide all audio features into six main categories,

refer the Figure 1.



24


3/12

Figure 1. High level, Audio Feature Classification[12].

However no one can define audio feature and its

category exactly since there is no broad consensus on the

allocation of features to particular groups. We can see that

same feature may be classified into two different groups bytwo different researchers. It is depended on the different

viewpoints of the authors. Features defined in the figure 1 can

be further classified into several groups considering the

structure of each feature.

Considering the structure of the temporal domain

feature, in [12], they classified it into three sub groups of

features: amplitude-based, power-based, and zero crossing-

based features. Each of these features related to one or more

physical property of the wave, refer the Figure 2.

Figure 2. The organization of features in Temporal Domain [12].

In here, some researches had defined zero crossings

rate (ZCR) as a physical feature. Frequency domain signals

are the very important features. Most of the researches

consider only the frequency domain features. Next we willlook at the frequency domain feature classification done by

[12] refer the Figure 3.

Sometime we can see that some researches had further

classified other four main features as well. But those are not

very important. Next we will see the main characteristics of

major features.

Figure 3. The organization of features in Frequency Domain [12]

a) Temporal (Raw) Domain features

Most of the time, we cant extract features without

altering the native audio signal. But there are several features

which can be extracted from native audio signal those features

are known as temporal features. Since we dont want to alter

the native signal it is very law cost feature extraction

methodology. But only using this feature we cant uniquely

identify audio music.

Zero crossing rate is a main temporal domain feature.

This is very helpful but low cost feature which is often used

in audio classification. Usually we define is as the number of

zero crossings in the temporal domain within one second. It is

a rough estimation of dominant frequency and the spectral

centroid [12]. Sometime we obtain ZCR by altering the audiosignal bit. In this case we extract frequency information and

corresponding intensities scaled sub bands from time domain

zero crossings. It gives more stable measurement for us and it

is very helpful in noisy environment. Since noises are always

spread around zero axes but this is not creating considerable

amount of peaks therefore peak related zero crossing rate will

remain unchanged.

Amplitude-Based Features are another example for

temporal domain features. We can obtain this feature by

directly computing the frequency of audio signal. It is again

good measurement but subject to change even audio signal is

alter little bit by noise like unwanted affects.

Power measurement is also a raw domain signal

which is almost same as the amplitude based features. The

power or the energy of a signal is the square of the amplitude

represented by the waveform. Volume is well known power

measurement feature it is widely used in silence detection and

speech/music segmentation.



25


4/12

b)Physical features

Most of the audio features are obtain from frequency

domain since almost all features live in this domain. Before

extracting frequency domain features we have to transform the

base signal into some other formats. To do that, we can useseveral methods. The most popular methods are the Fourier

transform and the autocorrelation. Other popular methods are

the Cosine transform, Wavelet transform, and the constant Q

transform [12]. Frequency domain signal can be categorized

into two major class, physical features and perceptual features.

Physical domain features are defined using physical

characteristics of audio signal which have not semantic

meanings. Next we will discuss mainly used physical features

and then perceptual features.

Auto-regression-Based Features: In statistics and

signal processing, an autoregressive (AR) model is a

representation of a type of random process; as such, it

describes certain time-varying processes in nature,economics, etc.[18]. This is widely used standard techniques

for speech/music discrimination. This can be used to extract

basic parameters of a speech signal, such as formant

frequencies and the vocal tract transfer function [18].

Sometime we can see that this feature group is divided further

into two group, linear predictive coding (LPC) and Line

spectral frequencies (LSF). But in here we are not going to

discuss about these sub group in detailed.

Short-Time Fourier Transform-Based Features

(STFT): this is another widely used audio feature based on

the audio spectrum. STFT can be used to obtain characteristics

of both frequency component and phase component. There are

several features under STFT such as Shannon entropy, Renyi

entropy, spectral centroid, spectral bandwidth, spectral

flatness measure, spectral crest factor and Mel-frequency

cepstral coefficients [15].

Short-time energy function: Energy of an audio

signal is measured by amplitude of that signal. When we

represent amplitude variation over time it is called energy

function of that signal. For speech signals, it is a basis for

distinguishing voiced speech components from unvoiced

speech components, as the energy function values for

unvoiced components are significantly smaller than those of

the voiced components [1].

Short-time average zero-crossing rate (ZCR): Thisfeature is another measurement to classify voiced speech

components and unvoiced speech components. Usually voice

component have much smaller ZCR than unvoiced component

[1].

Short-time fundamental frequency (FuF): Using this

feature we can find harmonic properties. Usually most

musical instrument sounds are harmonic. Sometime some

sound can be mixer of harmonic and non-harmonic. However

this feature also can be used to classify audio objects [1].

Spectral Flatness Measure (SFM): which is an

estimation of the tone-like or noise-like quality for a band inthe spectrum [1]. Really used for audio classifications.

There are some other widely used physical features

like, Mel-Frequency Cepstrum Coefficients(MFCC),Papaodysseuset al. (2001) presented the band

representative vectors, which are an ordered list of indexes

of bands with prominent tones (i.e. with peaks with significant

amplitude). Energy of each band is used by Kimura et al.

(2001). Normalized spectral sub-band centroids are proposed

by Seo et al. (2005). Haitsma et al. use the energies of 33 bark-

scaled bands to obtain their hash string, which is the sign of

the energy band differences (both in the time and thefrequency axis) and so on.

Most of the time silent audio frames are identified

earlier and those are not directed to further processing. There

are several approaches to identify/define a silent frame. Some

researched have used ZCR property. In [4], they have used

something like below to define silent frames.

Before feature extraction, an audio signal (8-bit ISDN

-law encoding) is pre-emphasized with parameter 0.96 and

then divided into frames. Given the sampling frequency of

8000 Hz, the frames are of 256 samples (32ms) each, with

25% (64 samples or 8ms) overlap in each of the two adjacent

frames. A frame is hamming-windowed by, wi= 0.540.46

* cos(2i/256). It is marked as a silent frame if,

( )2