7/25/2019 Audio Music Monitoring- Analyzing Current Techniques for Song Recognition and Identification
1/12
Audio Music Monitoring: Analyzing Current
Techniques for Song Recognition and IdentificationE.D. Nishan W. Senevirathna and Lakshman Jayaratne
Abstractwhen people are attaching or interesting in
something, usually they are trying to interact with it frequently.
Music is attached to people since the day of they were born.
When music repository grows, people faced lots of challenges
such as finding a song quickly, categorizing, organizing and even
listening again when they want etc. Because of this, people tend
to find electronic solutions. To index music, most of theresearchers use content based information retrieval mechanism
since content based classification doesnt need any additional
information rather than audio features embedded to it. As well
as it is the most suitable way to search music, when user dont
know the meta data attached to it, like author of the song. The
most valuable application of this audio recognition is copyright
infringement detection. Throughout this survey we will present
approaches which were proposed by various researchers to
detect, recognize music using content base mechanisms. And
finally we will conclude this by analyzing the current status of
this era.
KeywordsAudio fingerprint; features extraction; wavelets;
broadcast monitoring; Audio classification; Audio identification.
I.
INTRODUCTION
usic repositories in the world are increasing
exponentially. New artist can come to the field easily
with new technologies. Once we listen a new song, we cant
get it again easily if we dont know the meta data of that song
like author or singer. However the most common method of
accessing music is through textual meta-data but this is no
longer function properly against huge music collection. When
we come to the audio music recognition era, followings are
the key considerations.
Can we find an unknown song using a small part of it
or humming the melody?
Can we organize, index songs without meta data like
singer of the song?
Can we detect copyright infringement? For an example
after a song was broadcasted in a radio channel.
Can we identify a cover song when multiple versions
exist?
Can we obtain a statistical report about broadcasted
songs in a radio channel without a manual monitoring
process?
Above considerations motivate researches to find proper
solutions for these challenges. As of now, so many ideas have
been proposed by researches as well as some of them have
been implemented, Shazam is one of example for that.
However still this is a challenging research area since there is
no optimal solution. This problem become even more
complex when,
Audio signal is altered by noise.
Audio signal is polluted by adding unnecessary audio
object like advertisement in radio broadcasting.
When multiple versions are existed.
Only a small part of a song is available.
At any of above situations, human auditory system can
recognize music but providing an automated electronic
solution is very challenging task since similarity between
original music and querying music could be very few or these
similar features may not be possible to model mathematically.
It means researches need to consider perceptual features also,
in order to provide a proper solution. Feature extraction can
be considered as the heart of any of these approaches since the
accuracy and all are depended on the way of feature
extraction.
Rest of this survey, will provide broader overview and
comparisons of proposed feature extractions, searching
algorithms and overall solutions architectures.
M
DOI: 10.5176/2251-3043_4.3.328
GSTF Journal on Computing (JoC) Vol.4 No.3, October 2015
The Author(s) 2015. This article is published with open access by the GSTF
23
Received 20 Jul 2015 Accepted 13 Aug 2015
DOI 10.7603/s40601-014-0015-7
7/25/2019 Audio Music Monitoring- Analyzing Current Techniques for Song Recognition and Identification
2/12
II. CLASSIFICATIONS (RECOGNITION)VS.IDENTIFICATIONS
What is the different between audio recognition
(classification) and identification? In audio classification,
audio object will be classified into pre-defined sets like song,
advertisement, vocals etc. but they are not identified further.
Ultimately we know that this is a song or advertisement but
we dont know what that song is! Audio classification is less
complex than recognition. Most of the time, we can see that
these two things are combined each other in order to get better
result. For an example, in audio song recognition system, first
we can extract only songs among collection of other audio
objects using audio classifier and output will be fed in to the
audio recognition system. Using that kind of approach we can
get better result by narrow downing the search space. There
are more proposed audio classification approaches. Some of
them will be discussed in next sub section.
A.
Audio classifications1)
Overview
There are considerable amount of real world
applications for audio classification. For an example it will be
very helpful to be able to search sound effects automatically
from a very large audio database in films post processing,
which contains sounds of explosion, windstorm, earthquake,
animals and so on[1]. As well as audio content analysis and
classification is also useful for audio-assisted video
classifications. For an example, all video of gun fight scenes
should include the sound of shooting and or explosions, but
image content may vary significantly from one scene to
another.
When classifying an audio content into different sets,
different classes have to be considered. Most of the researches
have started this classifying speech and music. However these
classes are depended on the situations. For example, music,
speech and others can be considered for the parsing of
news stories whereas audio recording can be classified into
speech, laughter, silences and non speech for the
purpose of segmenting discussions recording in meetings[1].
In any cases above, we have to consider, extract some sort of
audio features. This is the challenging part as well as past
researches are differed from this point. But we can consider
feature extraction of audio classification and feature
extraction of audio identification separately since most of thetimes these two cases consider disjoin feature sets [7].
2)
Feature extraction of audio classification
Actually, most of the time output of the audio
classification is the input of the audio identification. This will
reduce the searching space and speed up the process and help
to retrieve better results. Most of the researchers, audio
classification will be broken down into further steps. In [1]
they used two steps, in the first stage, audio signal is
segmented and classified into basic types, including speech,
music, several types of environmental sounds, and silence.
They called it as the coarse-level classification. In the second
stage, further classification is conducted within each basic
type. For speech, they differentiated it into voices of man,
woman, child as well as speech with a music background and
so on. For music, it is classified according to the instruments
or types (for example, classics, blues, jazz, rock and roll,
music with singing and the plain song). For environmental
sounds, they classified them into finer classes such as
applause, bell ring, footstep, windstorm, laughter, birds' cry,
and so on. They called this as the fine-level classification.
Overall idea was reducing the searching space step by step in
order to get better results. As well as we can use proper feature
extraction mechanism for each finer level classes based on its
basic type. For an example, due to differences in the
origination of the three basic types of audio, i.e. speech, musicand environmental sounds, different approaches can be taken
in their fine classification. Most of the researches have used
low-level (physical, acoustic) features such as Spectral
Centroid or Mel-frequency Coefficients but end users may
prefer to interact with a higher semantic level [2]. For an
example they may need to find dog barking sound instead of
environmental sounds. However low-level features can be
easily extract using signal processing than high-level
(perceptual) features.
Most of the researchers have used Hidden Markov
Model (HMM) and Gaussian Mixture Model (GMM) as the
pattern recognition tool. Those are the widely used very
powerful statistical tools in pattern recognition. To use those
tools we have to extract unique features. Any audio feature
can be grouped into two or more sets. Most of the researches
grouped all audio features into two group, physical (or
mathematical) features and conceptual features. Physical
features are directly extracted from the audio wave such as
energy of the wave, frequency, peaks, average zero crossings
and so on. These features cannot be identified by the human
auditory system. But perceptual features are the features
human can understand like loudness, pitch, timbre, rhythm
and so on. Perceptual features cannot easily be model by
mathematical functions but those are the very important audio
features since human uses those features to differentiate
audios.
However sometime we can see that audio features
classified into hierarchical groups with similar characteristics
[12]. They divide all audio features into six main categories,
refer the Figure 1.
GSTF Journal on Computing (JoC) Vol.4 No.3, October 2015
The Author(s) 2015. This article is published with open access by the GSTF
24
7/25/2019 Audio Music Monitoring- Analyzing Current Techniques for Song Recognition and Identification
3/12
Figure 1. High level, Audio Feature Classification[12].
However no one can define audio feature and its
category exactly since there is no broad consensus on the
allocation of features to particular groups. We can see that
same feature may be classified into two different groups bytwo different researchers. It is depended on the different
viewpoints of the authors. Features defined in the figure 1 can
be further classified into several groups considering the
structure of each feature.
Considering the structure of the temporal domain
feature, in [12], they classified it into three sub groups of
features: amplitude-based, power-based, and zero crossing-
based features. Each of these features related to one or more
physical property of the wave, refer the Figure 2.
Figure 2. The organization of features in Temporal Domain [12].
In here, some researches had defined zero crossings
rate (ZCR) as a physical feature. Frequency domain signals
are the very important features. Most of the researches
consider only the frequency domain features. Next we willlook at the frequency domain feature classification done by
[12] refer the Figure 3.
Sometime we can see that some researches had further
classified other four main features as well. But those are not
very important. Next we will see the main characteristics of
major features.
Figure 3. The organization of features in Frequency Domain [12]
a) Temporal (Raw) Domain features
Most of the time, we cant extract features without
altering the native audio signal. But there are several features
which can be extracted from native audio signal those features
are known as temporal features. Since we dont want to alter
the native signal it is very law cost feature extraction
methodology. But only using this feature we cant uniquely
identify audio music.
Zero crossing rate is a main temporal domain feature.
This is very helpful but low cost feature which is often used
in audio classification. Usually we define is as the number of
zero crossings in the temporal domain within one second. It is
a rough estimation of dominant frequency and the spectral
centroid [12]. Sometime we obtain ZCR by altering the audiosignal bit. In this case we extract frequency information and
corresponding intensities scaled sub bands from time domain
zero crossings. It gives more stable measurement for us and it
is very helpful in noisy environment. Since noises are always
spread around zero axes but this is not creating considerable
amount of peaks therefore peak related zero crossing rate will
remain unchanged.
Amplitude-Based Features are another example for
temporal domain features. We can obtain this feature by
directly computing the frequency of audio signal. It is again
good measurement but subject to change even audio signal is
alter little bit by noise like unwanted affects.
Power measurement is also a raw domain signal
which is almost same as the amplitude based features. The
power or the energy of a signal is the square of the amplitude
represented by the waveform. Volume is well known power
measurement feature it is widely used in silence detection and
speech/music segmentation.
GSTF Journal on Computing (JoC) Vol.4 No.3, October 2015
The Author(s) 2015. This article is published with open access by the GSTF
25
7/25/2019 Audio Music Monitoring- Analyzing Current Techniques for Song Recognition and Identification
4/12
b)Physical features
Most of the audio features are obtain from frequency
domain since almost all features live in this domain. Before
extracting frequency domain features we have to transform the
base signal into some other formats. To do that, we can useseveral methods. The most popular methods are the Fourier
transform and the autocorrelation. Other popular methods are
the Cosine transform, Wavelet transform, and the constant Q
transform [12]. Frequency domain signal can be categorized
into two major class, physical features and perceptual features.
Physical domain features are defined using physical
characteristics of audio signal which have not semantic
meanings. Next we will discuss mainly used physical features
and then perceptual features.
Auto-regression-Based Features: In statistics and
signal processing, an autoregressive (AR) model is a
representation of a type of random process; as such, it
describes certain time-varying processes in nature,economics, etc.[18]. This is widely used standard techniques
for speech/music discrimination. This can be used to extract
basic parameters of a speech signal, such as formant
frequencies and the vocal tract transfer function [18].
Sometime we can see that this feature group is divided further
into two group, linear predictive coding (LPC) and Line
spectral frequencies (LSF). But in here we are not going to
discuss about these sub group in detailed.
Short-Time Fourier Transform-Based Features
(STFT): this is another widely used audio feature based on
the audio spectrum. STFT can be used to obtain characteristics
of both frequency component and phase component. There are
several features under STFT such as Shannon entropy, Renyi
entropy, spectral centroid, spectral bandwidth, spectral
flatness measure, spectral crest factor and Mel-frequency
cepstral coefficients [15].
Short-time energy function: Energy of an audio
signal is measured by amplitude of that signal. When we
represent amplitude variation over time it is called energy
function of that signal. For speech signals, it is a basis for
distinguishing voiced speech components from unvoiced
speech components, as the energy function values for
unvoiced components are significantly smaller than those of
the voiced components [1].
Short-time average zero-crossing rate (ZCR): Thisfeature is another measurement to classify voiced speech
components and unvoiced speech components. Usually voice
component have much smaller ZCR than unvoiced component
[1].
Short-time fundamental frequency (FuF): Using this
feature we can find harmonic properties. Usually most
musical instrument sounds are harmonic. Sometime some
sound can be mixer of harmonic and non-harmonic. However
this feature also can be used to classify audio objects [1].
Spectral Flatness Measure (SFM): which is an
estimation of the tone-like or noise-like quality for a band inthe spectrum [1]. Really used for audio classifications.
There are some other widely used physical features
like, Mel-Frequency Cepstrum Coefficients(MFCC),Papaodysseuset al. (2001) presented the band
representative vectors, which are an ordered list of indexes
of bands with prominent tones (i.e. with peaks with significant
amplitude). Energy of each band is used by Kimura et al.
(2001). Normalized spectral sub-band centroids are proposed
by Seo et al. (2005). Haitsma et al. use the energies of 33 bark-
scaled bands to obtain their hash string, which is the sign of
the energy band differences (both in the time and thefrequency axis) and so on.
Most of the time silent audio frames are identified
earlier and those are not directed to further processing. There
are several approaches to identify/define a silent frame. Some
researched have used ZCR property. In [4], they have used
something like below to define silent frames.
Before feature extraction, an audio signal (8-bit ISDN
-law encoding) is pre-emphasized with parameter 0.96 and
then divided into frames. Given the sampling frequency of
8000 Hz, the frames are of 256 samples (32ms) each, with
25% (64 samples or 8ms) overlap in each of the two adjacent
frames. A frame is hamming-windowed by, wi= 0.540.46
* cos(2i/256). It is marked as a silent frame if,
( )2
Top Related