Classification of Underwater Broadband Bio-acoustics Using...

8
Classification of Underwater Broadband Bio-acoustics Using Spectro-Temporal Features Navinda Kottege Autonomous Systems Laboratory CSIRO ICT Centre Pullenvale QLD 4069, Australia [email protected] Raja Jurdak Autonomous Systems Laboratory CSIRO ICT Centre Pullenvale QLD 4069, Australia [email protected] Frederieke Kroon CSIRO Ecosystem Sciences Atherton QLD 4883, Australia [email protected] Dean Jones CSIRO Ecosystem Sciences Atherton QLD 4883, Australia [email protected] ABSTRACT Large scale freshwater monitoring networks can passively capture sound for species detection or classification. The sheer volume of acoustic recordings in such systems requires in-network classification. Most of the recent work on bio- acoustic in-network classification targets narrowband or short- durations signals, which renders it unsuitable for classifying species that emit broadband short-duration signals. This paper proposes a method for broadband sound based clas- sification for large scale aquatic monitoring networks. The method is based on the extraction of a small set of spectral and temporal features. We collect empirical fish sounds, us- ing the case study of the spotted tilapia (Tilapia mariae ) which is an invasive freshwater fish species in Australia, and extract spectral and temporal features with our method. We then evaluate the classification accuracy and precision of these features for detecting tilapia sounds against the per- formance of existing narrowband sound features. The results show that using logistic regression with our limited feature set yields the best performance. Surprisingly, performance slightly improves when we downsample the signal from 44.1 to 16 kHz, indicating that our method is well-suited for clas- sification on embedded devices. We quantify the computa- tional benefits of our approach for enabling broader long- term in-situ species tracking in underwater environments. Categories and Subject Descriptors H.4 [Information Systems Applications]: Miscella- neous Keywords Tilapia, Freshwater fish, Sound, detection, Classification, Underwater, Bio-acoustics 1 Introduction The health of freshwater ecosystems worldwide is suffer- ing from adverse anthropogenic effects, including population Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. WUWNet’12, Nov. 5 - 6, 2012 Los Angeles, California, USA. Copyright 2012 ACM 978-1-4503-1773-3/12/11 ... $15.00. growth, pollution, and the introduction of invasive species. This has motivated new initiatives to deploy large-scale net- works of monitoring nodes for environmental assessment [34]. Environmental managers and biologists have an increasing need for for tracking resident and invasive species in freshwa- ter ecosystems, particularly for quantifying biomass, iden- tifying spawning regions, and recording movement during floods. The passive capture of acoustic signals is particu- larly suitable as a noninvasive method for species detection and classification in underwater environments, as they facil- itate sound propagation. Large scale bio-acoustic monitoring in freshwater systems remains a challenge, primarily because the sheer volume of captured bio-acoustic data over the long term is prohibitive for either sending or storing all of the acoustic recordings. This highlights the need for in-network sound classification to identify the presence and biomass of species across space and time. In-network classification of bio-acoustic sound has been demonstrated for terrestrial species, such as amphib- ians, mammals and birds [20, 2]. Much of the recent work in aquatic bio-acoustic classification has focused on automat- ically classifying narrowband whale sounds using spectro- gram correlation [26, 5]. There are currently no techniques for sound classification of broadband signals [8], which are known to occur in several fish species. Few recent stud- ies have characterised the broadband sound features of spe- cific fish species without an explicit method for classifying their sound [23, 6, 24]. The related body of work on au- tomated music genre classification using various machine learning techniques [36, 35] deals with spectral features that are tailored for the human ear, and therefore they are not suitable for broadband fish sound specification. This paper proposes an in-network classification approach of broadband fish sounds in large scale freshwater monitor- ing systems. The approach is based on the retrospective selection of small feature vectors that are tailored for broad- band sound classification, followed by classification using machine learning. Our approach is well-suited for our target embedded audio sensor node, thanks to its small feature set and its robustness under reduced audio sampling rates. We focus on Tilapia mariae, an invasive fish species in Australia’s Northern Tropics, as a representative case study. We record high quality empirical sounds and video from this species, and use the video footage as a ground truth for the fish sound events. This enables us to determine that the species emits broadband short duration sounds of around 10 ms. We then use the empirical data for retrospective fea- ture extraction, yielding six key features. We evaluate the performance of our feature vector, namely the accuracy, pre-

Transcript of Classification of Underwater Broadband Bio-acoustics Using...

Page 1: Classification of Underwater Broadband Bio-acoustics Using ...wuwnet.acm.org/2012/files/Papers/05_paper19.pdf · Classification of Underwater Broadband Bio-acoustics Using Spectro-Temporal

Classification of Underwater Broadband Bio-acousticsUsing Spectro-Temporal Features

Navinda KottegeAutonomous Systems Laboratory

CSIRO ICT CentrePullenvale QLD 4069, [email protected]

Raja JurdakAutonomous Systems Laboratory

CSIRO ICT CentrePullenvale QLD 4069, Australia

[email protected] Kroon

CSIRO Ecosystem SciencesAtherton QLD 4883, Australia

[email protected]

Dean JonesCSIRO Ecosystem SciencesAtherton QLD 4883, Australia

[email protected]

ABSTRACTLarge scale freshwater monitoring networks can passively

capture sound for species detection or classification. Thesheer volume of acoustic recordings in such systems requiresin-network classification. Most of the recent work on bio-acoustic in-network classification targets narrowband or short-durations signals, which renders it unsuitable for classifyingspecies that emit broadband short-duration signals. Thispaper proposes a method for broadband sound based clas-sification for large scale aquatic monitoring networks. Themethod is based on the extraction of a small set of spectraland temporal features. We collect empirical fish sounds, us-ing the case study of the spotted tilapia (Tilapia mariae)which is an invasive freshwater fish species in Australia, andextract spectral and temporal features with our method. Wethen evaluate the classification accuracy and precision ofthese features for detecting tilapia sounds against the per-formance of existing narrowband sound features. The resultsshow that using logistic regression with our limited featureset yields the best performance. Surprisingly, performanceslightly improves when we downsample the signal from 44.1to 16 kHz, indicating that our method is well-suited for clas-sification on embedded devices. We quantify the computa-tional benefits of our approach for enabling broader long-term in-situ species tracking in underwater environments.

Categories and Subject DescriptorsH.4 [Information Systems Applications]: Miscella-

neous

KeywordsTilapia, Freshwater fish, Sound, detection, Classification,

Underwater, Bio-acoustics

1 IntroductionThe health of freshwater ecosystems worldwide is suffer-

ing from adverse anthropogenic effects, including population

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.WUWNet’12, Nov. 5 - 6, 2012 Los Angeles, California, USA.Copyright 2012 ACM 978-1-4503-1773-3/12/11 ... $15.00.

growth, pollution, and the introduction of invasive species.This has motivated new initiatives to deploy large-scale net-works of monitoring nodes for environmental assessment [34].Environmental managers and biologists have an increasingneed for for tracking resident and invasive species in freshwa-ter ecosystems, particularly for quantifying biomass, iden-tifying spawning regions, and recording movement duringfloods. The passive capture of acoustic signals is particu-larly suitable as a noninvasive method for species detectionand classification in underwater environments, as they facil-itate sound propagation.

Large scale bio-acoustic monitoring in freshwater systemsremains a challenge, primarily because the sheer volume ofcaptured bio-acoustic data over the long term is prohibitivefor either sending or storing all of the acoustic recordings.This highlights the need for in-network sound classificationto identify the presence and biomass of species across spaceand time. In-network classification of bio-acoustic sound hasbeen demonstrated for terrestrial species, such as amphib-ians, mammals and birds [20, 2]. Much of the recent work inaquatic bio-acoustic classification has focused on automat-ically classifying narrowband whale sounds using spectro-gram correlation [26, 5]. There are currently no techniquesfor sound classification of broadband signals [8], which areknown to occur in several fish species. Few recent stud-ies have characterised the broadband sound features of spe-cific fish species without an explicit method for classifyingtheir sound [23, 6, 24]. The related body of work on au-tomated music genre classification using various machinelearning techniques [36, 35] deals with spectral features thatare tailored for the human ear, and therefore they are notsuitable for broadband fish sound specification.

This paper proposes an in-network classification approachof broadband fish sounds in large scale freshwater monitor-ing systems. The approach is based on the retrospectiveselection of small feature vectors that are tailored for broad-band sound classification, followed by classification usingmachine learning. Our approach is well-suited for our targetembedded audio sensor node, thanks to its small feature setand its robustness under reduced audio sampling rates.

We focus on Tilapia mariae, an invasive fish species inAustralia’s Northern Tropics, as a representative case study.We record high quality empirical sounds and video from thisspecies, and use the video footage as a ground truth for thefish sound events. This enables us to determine that thespecies emits broadband short duration sounds of around10 ms. We then use the empirical data for retrospective fea-ture extraction, yielding six key features. We evaluate theperformance of our feature vector, namely the accuracy, pre-

Page 2: Classification of Underwater Broadband Bio-acoustics Using ...wuwnet.acm.org/2012/files/Papers/05_paper19.pdf · Classification of Underwater Broadband Bio-acoustics Using Spectro-Temporal

Mote

MicroSDcard

Ethernetbreakout board

Audio codecs

Bluetechnix module (Underneath)

Connectorfor audiobreakout

board

(a)

Batteries3G modem

Audio node (DSP,Mote, Ethernet &

Audio breakout boards)

Solar regulator & chargerUSB storage

Connectors for sensors, solar panel

& antenna

3G modempower control

(b) (c)

Figure 1: System hardware: (a) The audio node board stack, (b) components inside the enclosure and (c)an audio node deployed at a fisheries pond with an attached hydrophone (inset: inside of the pelican casehousing showing the same hardware configuration as in (b)).

cision, sensitivity, and specificity, using Discriminant Anal-ysis and Logistic Regression, and show that it outperformsthe same algorithms using timbral texture features and spec-trogram correlation. We also downsample the sound signaland show that the performance of Logistic Regression withour feature vector remains nearly the same, making it at-tractive for underwater bio-acoustic sound classification onembedded platforms. Based on these findings, we quantifythe computational benefits of this approach on our targetsound sampling node.

The remainder of this paper is structured as follows; sec-tion 2 provides an overview of the motivating application oftracking invasive fish species and discusses our proposal for alarge scale bio-acoustic monitoring network, its constituentaudio nodes, and candidate classification algorithms. Sec-tion 3 introduces our spectro-temporal feature method anddescribes our methodology for empirical data collection andfor extraction of the feature vector. Section 4 evaluates ourapproach against spectrogram correlation and timbral tex-ture features, showing the impact of varying the samplingrate. Section 5 discusses the design implications of the re-sults for in network species classification on our target audionode, and concludes the paper.

2 Bio-acoustic Monitoring OverviewWe first motivate the need for bio-acoustic monitoring

through a specific species of invasive fish, T. mariae, andthen proceed to discuss the architecture of a large-scale mon-itoring network to track such species. We then introduce ouraudio node as the main building block of a monitoring net-work, and briefly revisit candidate bio-acoustic classificationmethods that can be implemented on our audio node.

2.1 Motivating Application: Tracking Tilapiamariae

T. mariae (Cichlidae) is a freshwater and estuarine teleostnative to West African coastal drainages in the Gulf of Guineaand naturalised in the USA, Australia and possibly Russia[7] due to aquarium and aquaculture releases. In its nativerange, it can be the dominant fish species in streams, rivers,lakes and estuaries [1], and supports local subsistence andartisanal fisheries in some catchments [27].

Outside its native range, T. mariae has a potential detri-mental impact on native ichthyofauna, including competi-tion for food [13] and breeding space [10], thereby affectingthe relative abundance of native and endemic species [21].

In Australia, T. mariae is a declared noxious fish underthe relevant State Fisheries Acts in all states and territo-ries, except Western Australia, and is listed on the NationalNoxious Fish List [11]. Notwithstanding, it has continuedits range expansion in the Wet Tropics region since its firstdetection around 1980 [37, 21]. A population in a 5 km sec-tion of the Walsh River, a tributary of the western-flowingMitchell River was successfully eradicated using rotenone,a chemical used as a picicide [17]. The long-term efficacyof other T. mariae management strategies, including thebanning of its possession [12, 16], the introduction of thepeacock cichlid (C. ocellaris) in Florida (USA) [31], pub-lic awareness and education in Australia [17], and electro-fishing are uncertain and have not been quantified.

2.2 Large Scale Bio-acoustic MonitoringThe growing spread of invasive fish species, such as T.

mariae, demands monitoring systems that can cover largespatial scales. Invasive monitoring methods such as ultra-sonic fish tags are expensive and not scalable. Instead, wefocus on the passive capture of sound through a wirelessnetwork of acoustic sensor nodes as a more affordable andnoninvasive tracking approach. We target local processing ofcaptured sounds on each node, in order to avoid the energyand bandwidth overhead of sending raw audio data contin-uously over wireless links. In-network acoustic classificationmust consider the resource limitations that are inherent tobattery/solar powered sensor nodes and the suitability offeature sets for the species of interest. We turn our atten-tion next to the audio node, as the main building block fora large scale acoustic monitoring system.

2.2.1 Audio nodeFigure 1(a) shows the audio node board stack while fig-

ure 1(b) shows the inside of the weather proof audio nodeenclosure which is ready for outdoor deployment with a 3Gmodem, batteries, solar regulator and charger. Additionalenvironmental sensors such as temperature and humiditysensors and leaf wetness sensors would be connected to theaudio node in a typical deployment scenario. Figure 2 de-picts a functional block diagram of the deployment-readyaudio node showing the connectivity and relationships be-tween different components of the system in terms of power,control and communication. Figure 1(c) shows an audionode in a more rugged housing deployed near a fisheriespond with an attached hydrophone lowered in to the wa-ter at the end of a long boom floating on a pontoon. Theinset of figure 1(c) shows the inside of the pelican case en-

Page 3: Classification of Underwater Broadband Bio-acoustics Using ...wuwnet.acm.org/2012/files/Papers/05_paper19.pdf · Classification of Underwater Broadband Bio-acoustics Using Spectro-Temporal

SPI

Analog in

Solar regulator& charger

Batteries r

Solar panel

Externalantenna

Temperature& humiditysensorLeaf

wetnesssensor

AIC3254Audio codecs

8-192kHz stereo

Power control

MicroSD card

BluetechnixBF537E DSP

600MHz32MB SDRAM

4MB Flash(VDK)USB storage

Power control

3G Modem(UMTS/HSDPA)

Datacall gateway

ARM9 200MHz32MB SDRAM128MB Flash(Linux 2.6)

Fleck3z MoteATmega 1281 8MHz

(TinyOS)

Temperature& humiditymiditysensoor

s

ternalntenna

sensor

SPI

Analog in

Bat

FlATme

Ethernet

Serial

GPIO

GPIO

Serial

Power Control Communication

Microphones / Hydrophoneswith pre-amplifiers

Figure 2: Functional block diagram of the audionode.

closure with the same hardware configuration as that shownin figure 1(b).

Each node consists of a wireless mote and a higher powerDSP for general signal processing. The mote used is theFleck™ 3z which is one of a family of robust wireless motesdesigned for outdoor applications in environmental monitor-ing and agriculture. It contains an Atmel AT86RF212 IEEE802.15.4 compliant radio with up to 1 Mbps data rate andhigh sensitivity (down to -110 dBm). The DSP is an AnalogDevices 600 MHz ADSP-BF537 Blackfin processor. We usethe BlueTechnix CM-BF537E implementation with 132 kBinternal SRAM, 32 MB external SDRAM and 4 MB flash.

The Texas Instruments TLV320AIC3254 stereo codec isused for audio collection. It can sample at rates of 8 kHzup to 192 kHz, and the power consumption range is 4.2 mWat 8 kHZ mono to 21.9 mW at 192 kHz stereo. It offers twoonboard mini-DSPs for specific audio processing tasks. Wetypically set the sampling rate to 16 kHz for bio-acousticmonitoring applications as a means for keeping the powerconsumption low while capturing frequencies up to 8 kHz.Apart from the power consumption of the codec itself, highersampling rates accumulate more audio samples which re-quires more processing time on the Blackfin DSP which inturn contributes to further power consumption at a higherrate. The audio node board also contains a microSD slot.We currently use 32 GB microSD cards and can upgradeto higher capacity cards as they become available and costeffective. Table 1 gives power consumed by different com-ponents of the audio node.

Process Power (mW)

Mote Idle 60× 10−3

Mote Tx/Rx 60Audio codec 4.2 - 21.9DSP Startup/ Processing/ Idle 12003G modem 2400

Table 1: Power consumption of the audio node.

The DSP clearly dominates the power profile of the au-dio node, which highlights the need for classification algo-rithm that are both appropriate to the species of interestand resource-efficient. With these requirements in mind, thenext section briefly revisits current methods for bio-acousticclassification.

Hydrophone

Tank

Audio Interface

Computer

Videocamera

Courtingpair

Figure 3: The experimental setup showing the hy-drophone inside the aquarium tank with T. mariae.

2.2.2 Existing Bio-acoustic Classification MethodsMachine learning and artificial intelligence techniques have

been successfully used for detection and classification of frogs,bats, birds whales and land mammals based on the soundemitted by these species [32, 20, 28, 25, 19, 2]. A closelyrelated area in the underwater bio-acoustic domain is au-tomated whale sound classification using techniques suchas spectrogram correlation [26, 5]. This method assumesnon-impulsive calls with narrow short-time frequency band-widths consisting of a sequence of frequency up-sweeps anddown-sweeps. A correlation kernel is constructed in a piece-wise manner to encompass the shape of the spectrogram ofthe target call. This kernel is then moved along the spec-trogram of the input signal and cross-correlated in the timedomain to obtain a recognition score function. Threshold-ing this function yields detections of the target calls. Typicalduration of calls used with this technique are in the order ofseconds which is common for whale vocalisations.

The other related area is automated music genre classi-fication using machine learning techniques such as HiddenMarkov Models and Support Vector Machines [35] with anextracted set of features from the input signal. While thefeatures used in the literature does not assume narrowbandsignals, they do rely on signals with duration in the orderof seconds. This is reasonable since it generally applies tomusic segments used in this field. As an example, work pre-sented in [36] uses a 30 dimensional feature vector consistingof ‘timbral texture features’, ‘rhythmic content features’ and‘pitch content features’. The timbral texture feature (TTF)vector is assembled by initially dividing the signal in to rela-tively small analysis windows with a duration of 23 ms. Mul-tiples of these analysis windows makes up a larger texturewindow with a duration of 1 s. Characteristics such as spec-tral centroid, spectral rolloff, spectral flux, zero-crossings aswell as Mel-frequency cepstral coefficients (MFCC) are cal-culated for each analysis window and the final feature vectorconsists of the means and variances of these quantities overa full texture window. Rhythmic and pitch content featuresare calculated for the full signal duration which in this caseare full length music files.

Page 4: Classification of Underwater Broadband Bio-acoustics Using ...wuwnet.acm.org/2012/files/Papers/05_paper19.pdf · Classification of Underwater Broadband Bio-acoustics Using Spectro-Temporal

Frequency [kHz]

0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0−120

−110

−100

−90

−80

−70

−60

−50R

elat

ive

soun

d le

vel [

dB]

(a)Time [ms]

0 10 20 30 40 50 60−120

−110

−100

−90

−80

−70

−60

−50

Rel

ativ

e so

und

leve

l [dB

]

(b)

Frequency [kHz]

0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0−120

−110

−100

−90

−80

−70

−60

−50

Rel

ativ

e so

und

leve

l [dB

]

(c)

0 10 20 30 40 50 60−120

−110

−100

−90

−80

−70

−60

−50

Rel

ativ

e so

und

leve

l [dB

]

Time [ms]

(d)

Frequency [kHz]

0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0−120

−110

−100

−90

−80

−70

−60

−50

Rel

ativ

e so

und

leve

l [dB

]

(e)Time [ms]

0 10 20 30 40 50 60−120

−110

−100

−90

−80

−70

−60

−50

Rel

ativ

e so

und

leve

l [dB

]

(f)

Figure 4: Spectral (first column) and temporal (sec-ond column) characteristics of a representative ex-ample click (a & b), average of 48 detected clicks (c& d), average of 773 other sounds detected in therecordings (e & f).

The above methods make implicit assumptions about theunderlying signal of the species’ vocalisation, particularlythat the signals are narrowband and relatively long in dura-tion. However, existing methods are not suitable for classi-fying broadband bio-acoustic calls [9], which are increasinglyrecognised to exist in nature, and are characteristic of ourspecies of interest in this paper. The following section in-troduces our spectro-temporal feature vectors for classifyingbroadband bio-acoustics.

3 Spectro-temporal feature vectorsOur initial experiments reveal that a typical T. mariae

sound is a click approximately 10 ms in duration and hasmost of the spectral energy in the 3-8 kHz region with higherharmonics. We introduce a unique set of spectro-temporalfeatures (STF) which effectively characterise short durationbroadband sounds such as the T. mariae clicks described.

The following sections describe the experimental proce-dure used in acquiring acoustic data containing of T. mariaeclicks and later presents the proposed STF and the featureextraction and classification method.

3.1 Case study: Classification of Tilapia mariaeclicks

To study the acoustic behaviour of T. mariae, a num-ber of recordings were made in an aquarium environmentwith five T. mariae individuals (Total length 15 - 28 cm). Ahydrophone was placed in a tank (tank size: 183 cm (L) ×

Time [ms]

Freq

uenc

y [k

Hz]

−150

−140

−130

−120

−110

−100

−90

−80

−70

−60

−50

−120

−110

−100

−90

−80

−70

−60

−50

−120

−110

−100

−90

−80

−70

−60

−50 5 10 15 20 25 30 35 400 50

0

3

6

9

15

12

18

Rel

ativ

e so

und

leve

l [dB

]

Rel

ativ

e so

und

leve

l [dB

]

Relative sound level [dB]

θ1

θ2

Mean

Mean

a1b1

c1

d1

a2

b2

c2

d2

Figure 5: Spectral and temporal characteristics of atypical T. mariae click. The various annotated mea-surements are used to define the spectro-temporalfeatures in the text.

60 cm (H) × 45 cm (W), temperature: 24 ◦C) and the associ-ated behaviour was recorded via a video camera from outsidethe tank.

Fish species in the Cichlid family produce sound in cap-tivity and in the wild [15], including males of several Tilapiaspecies that vocalise during courtship, nest preparation andspawning [3, 22]. To maximise our chances of recordingsounds from T. mariae, we focussed our study on an estab-lished pair that showed distinct courtship and nest buildingbehaviour [7]. Audio and video recordings were conductedsimultaneously to capture potential vocalisations in paral-lel with any specific behaviours. The aerators of the tankwere switched off during the recording periods to minimiseacoustic interference. Due to this, recording periods were atmost 60 mins in duration to avoid any stress to the fish dueto oxygen depravation.

The audio recordings were done with a Reson TC4032hydrophone. This is a low noise, sea-state zero hydrophonewith a 10 dB built-in preamplifier. It has a linear frequencyrange of 15 Hz to 40 kHz with ±2 dB [30]. The hydrophonewas connected to a laptop computer via a Cakewalk FA-66device, which is a FireWire audio interface providing ad-ditional preamplification and analog to digital conversion.Recordings were stored as uncompressed PCM Wave fileswith a sample rate of 44.1 kHz with a resolution of 24 bits.The experimental setup is shown in Figure 3. We also usedan audio node to do recordings in parallel with two differenthydrophones connected to the two input channels. One wasanother Reson TC4032 hydrophone and the the other was alow cost Teledyne Benthos AQ-2000 hydrophone connectedvia an Ecologic HP/02 preamplifier [33, 18]. Online pro-cessing was not performed on the audio node for this casestudy and the audio data was stored on the onboard SDcard for later analysis and comparison of recording quality,classification performance and hardware cost.

3.2 Feature extractionThe recorded audio was initially processed with a high

pass filter with a cut-off frequency fc of 1000 Hz to removelow frequency noise such as aquarium pumps and aerators ofother tanks in the aquarium as well as the 50 Hz mains humand its harmonics. The audio and video tracks were then

Page 5: Classification of Underwater Broadband Bio-acoustics Using ...wuwnet.acm.org/2012/files/Papers/05_paper19.pdf · Classification of Underwater Broadband Bio-acoustics Using Spectro-Temporal

Feature set Method Accuracy Precision Sensitivity Specificity

STFDiscriminant Analysis 0.97 0.65 0.94 0.97

Logistic Regression 0.98 0.82 0.75 0.99

TTFDiscriminant Analysis 0.93 0.45 0.90 0.93

Logistic Regression 0.95 0.63 0.24 0.99

Spectrogram Correlation 0.85 0.00 0.00 0.90

Table 2: Performance of Discriminant Analysis and Logistic Regression with the proposed Spectro-temporalfeatures (STF) and Timbral texture features (TTF) along with Spectrogram correlation in identifying theclick sound from original recording sampled at 44.1 kHz. These values were obtained by averaging k-foldcross-validation over 100 iterations with k=10. The dataset consisted of 821 observations.

Feature set Method Accuracy Precision Sensitivity Specificity

STFDiscriminant Analysis 0.95 0.69 0.95 0.95

Logistic Regression 0.96 0.85 0.78 0.98

TTFDiscriminant Analysis 0.89 0.52 0.27 0.97

Logistic Regression 0.90 0.79 0.13 0.99

Spectrogram Correlation 0.85 0.25 0.19 0.93

Table 3: Performance of Discriminant Analysis and Logistic Regression with the proposed Spectro-temporalfeatures (STF) and Timbral texture features (TTF) along with Spectrogram correlation in identifying the clicksound from audio downsampled to 16 kHz. These values were obtained by averaging k-fold cross-validationover 100 iterations with k=10. The dataset consisted of 436 observations.

synchronised in post-processing using an event captured onboth recordings (tapping on the aquarium glass) The record-ings were then manually annotated with ground truth, i.e.the time stamps of acoustic activity (clicking sounds) by thefish were saved as a label track in Audacity [4].

Many different sounds were present in the recordings whichincluded pebble movement, clicking, scraping and chewingby the fish, most of which could be classified as broadband.An audio detection and segmentation algorithm using anenergy detector implemented as a finite state machine wasused to extract the various instances of sounds and to dis-card the background noises [14]. The manually annotatedground truth time stamps were aligned with the segmentedsounds to give each detected sound a corresponding classlabel to be used for cross validation of automated classifica-tion. From one set of recordings used for this study, therewere 48 manually annotated T. mariae clicks, verified withvideo footage showing synchronous mouth movement. Thedetection and segmentation algorithm found a total of 821sounds which included all of the 48 clicks.

Figures 4(a) and 4(b) shows the relative sound level of atypical T. mariae click with respect to frequency and time.Figures 4(c) and 4(d) shows these frequency and temporal

High-pass filter

fc=1000Hz

Feature extraction

Classification

Ground truthannotation

Cross-validation

HydrophoneDetection

&segmentation

Pre-amplification & A/D conversion

Figure 6: Functional block diagram of the evaluationprocess.

characteristics averaged over all 48 detected clicks and fig-ures 4(e) and 4(f) shows the same plots averaged over allother 773 detected sounds.

We observe a number of attributes in the spectral andtemporal domain which could be used to effectively charac-terise T. mariae clicks. Figure 5 shows a typical T. mariaeclick spectrogram along with its frequency distribution andtemporal evolution. The spectrogram consists of power spec-tral density values. Assuming time and frequencies are dis-cretized, the spectrogram is given by:

P (f, t) = k|S(f, t)|2,where k = 2/

(fs

L∑n=1

|ω(n)|2)

(1)

where S(f, t) is the short time Fourier transform (STFT) oftime domain signal s(t), ω(n) is the Hamming windowingfunction, L the window length and fs is the sampling fre-quency. The relative sound levels over frequency F (f) andover time T (t) are given by

T (t) =

fs/2∑f=0

P (f, t) (2)

F (f) =

t′∑t=0

P (f, t) (3)

where t′ is the duration of the relevant sound segment.

The set of STFs with six elements denoted by {v1..v6} aredefined as; v1 is the ratio between the peak position of T (t)and the duration of the sound segment t′ (i.e. a1/b1 in fig-ure 5), v2 is the ratio between the peak position of F (f) andthe frequency bandwidth (i.e. a2/b2 in the figure), v3 is thegradient of T (t) immediately before the peak (i.e. tan(θ1)),v4 is the gradient of the least squares fit line of F (f) (i.e.tan(θ2)), v5 is the ratio between the peak value of T (t) andits mean value (i.e. c1/d1) and v6 is the ratio between thepeak value of F (f) and its mean value (i.e. c2/d2). Thesefeatures are carefully selected to be amplitude invariant andthis contributes to the robustness of the subsequent classifi-cation process.

Page 6: Classification of Underwater Broadband Bio-acoustics Using ...wuwnet.acm.org/2012/files/Papers/05_paper19.pdf · Classification of Underwater Broadband Bio-acoustics Using Spectro-Temporal

False positive rate

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.00.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

True

pos

itive

rat

e

Logistic Regression (TTF)

Discriminant Analysis (TTF)

Logistic Regression (STF)

Discriminant Analysis (STF)

Spectrogram Correlation

(a)False positive rate

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.00.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

True

pos

itive

rat

e

Logistic Regression (TTF)

Discriminant Analysis (TTF)

Logistic Regression (STF)

Discriminant Analysis (STF)

Spectrogram Correlation

(b)

Figure 7: Receiver Operator Characteristic curves at (a) fs=44.1 kHz and (b) fs=16 kHz, for DiscriminantAnalysis and Logistic Regression with the proposed Spectro-temporal features (STF) and Timbral texturefeatures (TTF) along with Spectrogram correlation.

STFs are calculated for each detected sound segment andmake up the feature vector associated with that sound. Thesewere used as the input vectors for automated classification ofsound produced by T. mariae. The functional blocks of thisprocess are shown in figure 6. We use discriminant analysisand logistic regression as two independent machine learningmethods for classification and present the cross validatedresults in the following section. We also compare the per-formance of STF vectors with features used in the literaturefor sound classification.

4 Evaluation and resultsWe use our proposed STF vectors as inputs for standard

classification methods of Logistic Regression and Discrim-inant Analysis [29]. For comparison, we use Timbral tex-ture features (TTF) with the same two classification meth-ods. We also compare STF vectors against Spectrogramcorrelation, which is a classification technique without ex-plicit feature extraction. To evaluate the performance ofthe features and classification methods at lower samplingrates, we downsampled the original 44100 Hz recording to16000 Hz and then to 8000 Hz. We compute the Accuracy,Precision, Sensitivity (Recall) and Specificity of each classi-fication method/feature set combination using k-fold cross-validation. These four metrics are defined as follows:

Accuracy =tp+ tn

tp+ tn+ fp+ fn(4)

Precision =tp

tp+ fp(5)

Sensitivity =tp

tp+ fn(6)

Specificity =tn

tn+ fp(7)

where tp, tn, fp and fn are the numbers of true positives,true negatives, false positives and false negatives respec-tively.

4.1 Robustness of featuresTables 2 and 3 gives the performance of each feature set

and classification method for the original 44.1 kHz recordingand the 16 kHz version (64% reduction of sampling rate).At 8 kHz the average accuracies of STF dropped to 0.89 andTTF dropped to 0.86 spectrogram correlation performancewas not evaluated. For the higher sampling rate of 44.1 Khz,

STF outperforms TTF for Accuracy and Precision, and per-forms comparably or slightly worse for Sensitivity and Speci-ficity. It can be seen that both precision and sensitivity ofSTF increases with downsampling while accuracy decreasesby about 2%. For TTF the decrease in accuracy was 5%and only precision improved with downsampling with sensi-tivity decreasing drastically (70% for discriminant analysisand 46% for logistic regression). Despite an accuracy of0.85, spectrogram correlation has very low utility with itslow precision and sensitivity. This stems from the fact thatthe correlation kernel used in spectrogram correlation relieson a relatively long duration slow varying narrowband sig-nal such as the frequency sweeps common in whale calls.The short duration impulsive broadband click of T.mariaedoes not conform to these characteristics yielding poor per-formance when using spectrogram correlation.

Receiver Operator Characteristic (ROC) curves which plottrue positive rate against false positive rate for 44.1 kHz and16 kHz versions are shown in figures 7(a) and 7(b). Thesecurves further demonstrate the robustness of STF as thesampling rate is reduced to 16 kHz when compared to TTF.They also depicts the poor performance of spectrogram cor-relation both for 44.1 kHz as well as for 16 kHz sampling.The ROC curves of STF shows better performance for boththe classification methods when compared to TTF, whichseem to perform better with discriminant analysis than withlogistic regression.

4.2 Effects of reduced sampling rateTo observe the effect of reduced sampling rate on each

of the STF components, we plot the density distributions offeatures associated with detected T. mariae clicks and thoseassociated with all detected sounds (full data set) for both44.1 kHz and 16 kHz sampling rates (figure 8). T. mariaeclicks have their peak energy in frequencies below 8 kHzwhereas the other sounds present in the recordings have theirenergy more towards the higher frequencies (figure 4). At16 kHz sampling, the peak energy portion of the T. mariaeclick is still preserved while the other sounds loose the moreenergetic part of their spectrum. This explains why featurev4 lose its separation from the full data set in figure 8(d)when downsampled to 16 kHz as this feature is based on theshape of the frequency distribution over the full bandwidth.However, features v2 and v3 have increased separation be-tween clicks and the full dataset at the lower sampling rate.

Page 7: Classification of Underwater Broadband Bio-acoustics Using ...wuwnet.acm.org/2012/files/Papers/05_paper19.pdf · Classification of Underwater Broadband Bio-acoustics Using Spectro-Temporal

Feature value−5 −4 −3 −2 −1 0 1 2 3 4 5

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4R

elat

ive

dens

ityClicks(44.1kHz)All(44.1kHz)Clicks(16kHz)All(16kHz)

(a) v1

Feature value

Clicks(44.1kHz)All(44.1kHz)Clicks(16kHz)All(16kHz)

−3 −2 −1 0 1 2 3 40.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

Rel

ativ

e de

nsity

(b) v2

Feature value−3 −2 −1 0 1 2 3 4 5

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

Rel

ativ

e de

nsity

Clicks(44.1kHz)All(44.1kHz)Clicks(16kHz)All(16kHz)

(c) v3

Feature value

Clicks(44.1kHz)All(44.1kHz)Clicks(16kHz)All(16kHz)

−5 −4 −3 −2 −1 0 1 2 3 40.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Rel

ativ

e de

nsity

(d) v4

Feature value

Clicks(44.1kHz)All(44.1kHz)Clicks(16kHz)All(16kHz)

−5 −4 −3 −2 −1 0 1 2 3 40.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Rel

ativ

e de

nsity

(e) v5

Feature value−6 −5 −4 −3 −2 −1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Rel

ativ

e de

nsity

Clicks(44.1kHz)All(44.1kHz)Clicks(16kHz)All(16kHz)

(f) v6

Figure 8: Density distribution of the six features{v1..v6}, comparing the positive examples (red trian-gles and orange crosses) with the full data set (bluecircles and green squares) for the original 44.1 kHzrecording and the downsampled 16 kHz version.

Due to other sounds losing most of their energetic portionof the spectrum when downsampled, the total number ofsounds picked up by the detection and segmentation algo-rithm decreases with downsampling from 821 at 44.1 kHz to430 at 16 kHz and 236 at 8 kHz. Despite this reduction, the48 manually annotated clicks always appear in the detectedsegments. The total number of samples that need to be pro-cessed during the classification is a product of the samplingrate and the number of detected sound segments. To fur-ther quantify the classification performance of each featureset, we define a performance metric which is the productof accuracy, precision, sensitivity and specificity. Since itsdesirable for each of these quantities to tend to 1 and sincetheir range is 0 − 1, the new performance metric would re-flect the overall performance of these quantities and remainin the range 0− 1. Figure 9 shows the reduction in samplesto be processed as the sampling rate is reduced along withthe classification performance for both STF and TTF usingthe new metric. The error bars represent the maxima andminima for the two different classification algorithms dis-criminant analysis and logistic regression. At 16 kHz, whichis the sampling rate used on the audio node, the number ofsamples to be processed is 70% lower than at 44.1 kHz. Thissubstantial reduction comes with a small decrease in accu-racy (2%) and an actual increase in precision and sensitivitywhen using STF. This is reflected as an overall increase in-crease in the performance metric (0.58 to 0.60). For thesame change in sampling rate, the performance for TTF is

44.1kHz 8kHz16kHz0.0

0.5

1.0

1.5

2.0

2.5

Sam

ples

[x1

06 ]

Perf

orm

ance

Sample rate

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8Samples to process STF TTF

Figure 9: Reduction in number of samples to beprocessed during sound classification with the de-creasing sampling rate.

nearly halved, highlighting the robustness of STF to lowersampling rates. We therefore conclude that a sampling rateof 16 kHz strikes a much better balance between the numberof samples to be processed and classification performance ofSTF.

5 DiscussionThe ability to detect and to automatically classify T.

mariae through the passive capture of sound enables notonly the detection of such species, but also the possibil-ity of population control. Through integration with wire-less sensor network technology, detection of these invasivespecies can be automated over large spatial scales to coverfreshwater bodies such as river systems and lakes. This initself could enable ecological management specialists to im-plement control measures in the specific regions where thespecies has been detected. Population control can be en-hanced through the automation of active measures as well.For instance, the click sound can potentially be played backin areas suspected to be on the invasion front to lure or deterthese fish to assist management.

This paper presented a novel set of amplitude invariantspectro-temporal features which effectively characterise shortduration broadband bio-acoustic calls. These features canthen be used with standard machine learning techniques toaccurately and efficiently detect and classify species such asT. mariae. When combined with the audio detection andsegmentation algorithm used in this work, we showed thatan accuracy > 0.95 can be achieved while the number ofsamples processed can be reduced by 70% when the signal isdownsampled. We demonstrated that our proposed featuresremain robust even at the lower sampling rate of 16 kHz.This along with the relatively small feature set size makesit ideally suited for implementation on the wireless audionode platform described earlier in this paper. These audionodes would then serve as the building blocks of a large scalefreshwater monitoring system deployed along inland water-ways and lakes enabling real-time in-network detection andclassification of invasive species.

6 References

[1] R. C. Akpaniteaku and J. N. Aguigwo. Seasonalvariation in catch of tilapiines (Osteichthyes:Chichlidae) in Agulu and Nawfia Lakes, AnambraState, Nigeria. Journal of Sustainable TropicalAgriculture Research, 6:19–22, 2003.

[2] M. Allen, L. Girod, R. Newton, S. Madden, D. T.Blumstein, and D. Estrin. VoxNet: an interactive,rapidly-deployable acoustic monitoring platform. InProceedings of the International conference on

Page 8: Classification of Underwater Broadband Bio-acoustics Using ...wuwnet.acm.org/2012/files/Papers/05_paper19.pdf · Classification of Underwater Broadband Bio-acoustics Using Spectro-Temporal

Information Processing in Sensor Networks, IPSN ’08,pages 371–382, Washington, DC, USA, 2008. IEEEComputer Society.

[3] M. C. P. Amorim, P. J. Fonseca, and V. C. Almada.Sound production during courtship and spawning oforeochromis mossambicus: male-female and male-maleinteractions. Journal of fish biology, 62(3):658–672,2003.

[4] Audacity Development Team. Audacity [Computerprogram], 2010. http://audacity.sourceforge.net/.

[5] M. F. Baumgartner and S. E. Mussoline. Ageneralized baleen whale call detection andclassification system. Journal of the Acoustical Societyof America, 129(5):2889–2902, 2011.

[6] K. S. Boyle and T. C. Tricas. Pulse sound generation,anterior swim bladder buckling and associated muscleactivity in the pyramid butterflyfish, Hemitaurichthyspolylepis. Journal of experimental biology,213(22):3881–3893, 2010.

[7] M. Bradford, F. J. Kroon, and D. J. Russell. Thebiology and management of Tilapia mariae (Pisces:Cichlidae) as a native and invasive species: a review.Marine and Freshwater Research, 62(8):902–917, 2011.

[8] T. S. Brandes. Automated sound recording andanalysis techniques for bird surveys and conservation.Bird Conservation International, 18:S163–S173, 2008.

[9] T. S. Brandes. Feature vector selection and use withhidden markov models to identify frequencymodulated bioacoustic signals amidst noise. Audio,Speech, and Language Processing, IEEE Transactionson, 16(6):1173–1180, 2008.

[10] W. R. Brooks and R. C. Jordan. Enhancedinterspecific territoriality and the invasion success ofthe spotted tilapia (Tilapia mariae ) in South Florida.Biological Invasions, 12(4):865–874, 2010.

[11] Bureau of Rural Sciences. A strategic approach to themanagement of ornamental fish in Australia. Technicalreport, Bureau of Rural Sciences, 2007.

[12] M. R. Clark. Probable establishment and rangeextension of the spotted tilapia, tilapia mariaeboulenger (Pisces: Cichlidae) in East Central Florida.Florida Scientist, 44(3):168–171, 1981.

[13] J. Courtenay, Walter R. and J. E. Deacon. FishIntroductions in the American Southwest: A CaseHistory of Rogers Spring, Nevada. The SouthwesternNaturalist, 28(2):221–224, 1983.

[14] B. Croker and N. Kottege. Using feature vectors todetect frog calls in wireless sensor networks. TheJournal of the Acoustical Society of America,131(5):EL400–EL405, 2012.

[15] P. D. Danley, M. Husemann, and J. Chetta. Acousticdiversity in lake malawi’s rock-dwelling cichlids.Environmental Biology of Fishes, 93(1):23–30, 2012.

[16] DEEDI. Queensland fisheries act. Technical report,Queensland Department of Employment, EconomicDevelopment and Innovation, 1994.

[17] DEEDI. Control of exotic pest fishes: an operationalstrategy for Queensland freshwaters 2000-2005.Technical report, Queensland Department ofEmployment, Economic Development and Innovation,2000.

[18] Ecologic. HP/02 Hydrophone preamplifiers, 2006.http://ecologicuk.co.uk/Equipment.htm.

[19] S. Fagerlund. Bird species recognition using supportvector machines. EURASIP Journal on Applied SignalProcessing, 2007(1):64–64, 2007.

[20] W. Hu, N. Bulusu, C. T. Chou, S. Jha, A. Taylor, andV. N. Tran. Design and evaluation of a hybrid sensor

network for cane toad monitoring. ACM Transactionson Sensor Netowrks, 5(1):4:1–4:28, February 2009.

[21] F. J. Kroon, D. J. Russell, P. A. Thuesen, T. Lawson,and A. Hogan. Using environmental variables topredict distribution and abundance of invasive fish inthe wet tropics. Technical report, CSIRO, 2011.

[22] N. Longrie, M. L. Fine, and E. Parmentier. Innatesound production in the cichlid Oreochromis niloticus.Journal of Zoology, 275(4):413–417, 2008.

[23] N. Longrie, S. Van Wassenbergh, P. Vandewalle,Q. Mauguit, and E. Parmentier. Potential mechanismof sound production in Oreochromis niloticus(Cichlidae). Journal of experimental biology,212(21):3395–3402, 2009.

[24] K. P. Maruska, U. S. Ung, and R. D. Fernald. Theafrican cichlid fish Astatotilapia burtoni uses acousticcommunication for reproduction: Sound production,hearing, and behavioral significance. PLoS ONE,7(5):e37612, 2012.

[25] D. K. Mellinger. A comparison of methods fordetecting right whale calls. Canadian Acoustics,55(2):55–65, 2004.

[26] D. K. Mellinger and C. W. Clark. Recognizingtransient low-frequency whale sounds by spectrogramcorrelation. Journal of the Acoustical Society ofAmerica, 107(6):3518–3529, 2000.

[27] C. S. Nwadiaro. Fecundity of cichlid fishes of theSombreiro River in the lower Niger delta. Revue deZoologie Africaine, 101:433–437, 1987.

[28] S. Parsons and G. Jones. Acoustic identification oftwelve species of echolocating bat by discriminantfunction analysis and artificial neural networks.Journal of Experimental Biology, 203:2641–2656, 2000.

[29] S. J. Press and S. Wilson. Choosing between logisticregression and discriminant analysis. Journal of theAmerican Statistical Association, 73(364):pp. 699–705,1978.

[30] Reson. TC4032 Datasheet, 2005.http://www.reson.com/products/hydrophones/tc-4032/.

[31] P. L. Shafland. Introduction and establishment of asuccessful butterfly peacock fishery in southeastFlorida canals. In Proceedings of the AmericanFisheries Society Symposium, pages 443–451, 1995.

[32] A. Taylor, G. Watson, G. Grigg, and H. McCallum.Monitoring frog communities: an application ofmachine learning. In Proceedings of the eighth annualconference on Innovative applications of artificialintelligence, pages 1564–1569, 1864675, 1996. AAAIPress.

[33] Teledyne Benthos. AQ-2000 hydrophone, 2012.http://www.benthos.com/undersea-hydrophones-microphones-elements.asp.

[34] C. U. The Beacon Institute. River and EstuaryObservatory Network (REON), 2012.

[35] P. Tim, P. Elias, and W. Gerhard. Evaluation offrequently used audio features for classification ofmusic into perceptual categories. In Proceedings of theFourth International Workshop on Content-BasedMultimedia Indexing (CBMI’05), 2005.

[36] G. Tzanetakis and P. Cook. Musical genreclassification of audio signals. IEEE Transactions onSpeech and Audio Processing, 10(5):293–302, 2002.

[37] A. C. Webb. Status of non-native freshwater fishes intropical northern Queensland, including establishmentsuccess, rates of spread, range and introductionpathways. Journal and Proceedings of the RoyalSociety of New South Wales, 140:63–78, 2007.