Improvisation in Interactive Music Systems - QUT...

Improvisation in Interactive Music Systems

c© Toby Gifford

A thesis submitted in partialfulfillment of the degree of

Doctor of Philosophy

Faculty of Creative IndustriesQueensland University of Technology

March 2011

Brisbane Queensland

Key Words

interactive music systems; generative; algorithmic composition; beat tracking; metreinduction; onset detection; polyphonic; pitch tracking; machine listening; improvisa-tion

ii

Abstract

This project investigates machine listening and improvisation in interactive music systems withthe goal of improvising musically appropriate accompaniment to an audio stream in real-time.The input audio may be from a live musical ensemble, or playback of a recording for use by aDJ. I present a collection of robust techniques for machine listening in the context of Westernpopular dance music genres, and strategies of improvisation to allow for intuitive and musicallysalient interaction in live performance.

The findings are embodied in a computational agent – the Jambot – capable of real-time musicalimprovisation in an ensemble setting. Conceptually the agent’s functionality is split into threedomains: reception, analysis and generation. The project has resulted in novel techniques foraddressing a range of issues in each of these domains.

In the reception domain I present a novel suite of onset detection algorithms for real-timedetection and classification of percussive onsets. This suite achieves reasonable discriminationbetween the kick, snare and hi-hat attacks of a standard drum-kit, with sufficiently low-latencyto allow perceptually simultaneous triggering of accompaniment notes. The onset detectionalgorithms are designed to operate in the context of complex polyphonic audio.

In the analysis domain I present novel beat-tracking and metre-induction algorithms that oper-ate in real-time and are responsive to change in a live setting. I also present a novel analyticmodel of rhythm, based on musically salient features. This model informs the generation pro-cess, affording intuitive parametric control and allowing for the creation of a broad range ofinteresting rhythms.

In the generation domain I present a novel improvisatory architecture drawing on theories ofmusic perception, which provides a mechanism for the real-time generation of complementaryaccompaniment in an ensemble setting.

All of these innovations have been combined into a computational agent – the Jambot, whichis capable of producing improvised percussive musical accompaniment to an audio stream inreal-time. I situate the architectural philosophy of the Jambot within contemporary debate re-garding the nature of cognition and artificial intelligence, and argue for an approach to algorith-mic improvisation that privileges the minimisation of cognitive dissonance in human-computerinteraction.

This thesis contains extensive written discussions of the Jambot and its component algorithms,

along with some comparative analyses of aspects of its operation and aesthetic evaluations of its

output. The accompanying CD contains the Jambot software, along with video documentation

of experiments and performances conducted during the project.

iii

Contents

Keywords ii

Abstract iii

List of Figures vi

Supplementary CD vii

Statement of Original Authorship ix

Acknowledgements x

1 Introduction 11.1 Project Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Knowledge Claims . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.3 Guide to the Accompanying CD . . . . . . . . . . . . . . . . . . . . 111.4 Associated Publications . . . . . . . . . . . . . . . . . . . . . . . . . 14

2 Background 152.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.2 Interactive Music Systems . . . . . . . . . . . . . . . . . . . . . . . 152.3 Approaches to Improvisation . . . . . . . . . . . . . . . . . . . . . . 192.4 Musical Pulse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242.5 Musical Metre . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272.6 Utilising Metrical Ambiguity . . . . . . . . . . . . . . . . . . . . . . 30

3 Methodology 313.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.2 Research Framework . . . . . . . . . . . . . . . . . . . . . . . . . . 323.3 Epistemology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.4 Theoretical Perspective . . . . . . . . . . . . . . . . . . . . . . . . . 413.5 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493.6 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4 Architecture 604.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 604.2 Jambot Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

iv

4.3 Architectures for Perception and Action . . . . . . . . . . . . . . . . 644.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5 Reception 745.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 745.2 Stochastic Onset Detection . . . . . . . . . . . . . . . . . . . . . . . 755.3 Polyphonic Pitch Tracking . . . . . . . . . . . . . . . . . . . . . . . 865.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

6 Analysis of Metre 976.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 976.2 Beat Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 986.3 The Substratum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1036.4 Estimating the Substratum . . . . . . . . . . . . . . . . . . . . . . . 1056.5 Estimating the Substratum Phase . . . . . . . . . . . . . . . . . . . . 1126.6 Mirex Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1156.7 Attentional Modulation . . . . . . . . . . . . . . . . . . . . . . . . . 1186.8 Finding the Bar Period . . . . . . . . . . . . . . . . . . . . . . . . . 1196.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

7 Analysis of Rhythm 1237.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1237.2 Representing Onsets as Notes . . . . . . . . . . . . . . . . . . . . . . 1237.3 Rhythmic Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . 1267.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

8 Generation 1388.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1388.2 Reactive Generation Techniques . . . . . . . . . . . . . . . . . . . . 1398.3 Proactive Generation Techniques . . . . . . . . . . . . . . . . . . . . 1538.4 Interactive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1738.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174

9 Conclusion 1769.1 Reception . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1779.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1809.3 Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1829.4 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1859.5 Further Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1879.6 Final Thoughts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187

A Derivation of Phase Estimate 189

Bibliography 193

v

List of Figures

2.1 Coherence metre . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.1 Theoretical Framework – adapted from (Crotty 1998:4) . . . . . . . . 333.2 Scaling up problem (Kitano 1993) . . . . . . . . . . . . . . . . . . . 50

4.1 Reception, Analysis and Generation . . . . . . . . . . . . . . . . . . 63

5.1 Attacks can be masked in multipart signals . . . . . . . . . . . . . . . 765.2 Splitting out the RCC, Amplitude vs Samples . . . . . . . . . . . . . 795.3 Comparison of Detection Functions . . . . . . . . . . . . . . . . . . 81

6.1 substratum pulse is quavers . . . . . . . . . . . . . . . . . . . . . . . 1056.2 SOD stream salience of a segment of the Amen Break . . . . . . . . . 1086.3 ACF for the SOD stream salience . . . . . . . . . . . . . . . . . . . . 1086.4 Clumped ACF for the SOD stream salience . . . . . . . . . . . . . . 1096.5 MIREX training set example 5. . . . . . . . . . . . . . . . . . . . . . 116

8.1 Rozin’s Wooden Mirror . . . . . . . . . . . . . . . . . . . . . . . . . 1438.2 Children interacting with the Continuator . . . . . . . . . . . . . . . 1448.3 Edward Ihnatowicz’s Sound Activated Module . . . . . . . . . . . . . 1488.4 Edward Ihnatowicz’s Senster . . . . . . . . . . . . . . . . . . . . . . 1498.5 Simon Penny’s Petit Mal . . . . . . . . . . . . . . . . . . . . . . . . 1508.6 Offbeat snare hits with anticipatory timing . . . . . . . . . . . . . . . 1638.7 Offbeat snare hits without anticipatory timing . . . . . . . . . . . . . 1638.8 Dancehall beat with anticipatory timing . . . . . . . . . . . . . . . . 1648.9 Dancehall beat without anticipatory timing . . . . . . . . . . . . . . . 1648.10 A depiction of the Chimæra on an ancient Greek plate . . . . . . . . . 1668.11 Data from three updates of the inferred metric contexts . . . . . . . . 1708.12 Simple example of ambiguity strategies . . . . . . . . . . . . . . . . 171

vi

Supplementary CD

Root Folderthesis.pdf

1 IntroductionMainDemo.movRobobongoAllstars.movRobobongoAllstars2.mov

2 Reception/Polyphonic Pitch Tracking4 ton mantis.mp34 ton mantis jam.mp3mas que nada.mp3mas que nada jam.mp3sweet.mp3sweet jam.mp3

2 Reception/Stochastic Onset DetectionJungleBoogie bq.mp3JungleBoogie hf.mp3JungleBoogie nd.mp3JungleBoogie.mp3train* bq.mp3train* hf.mp3train* nd.mp3

3 Analysis of MetreMirexPracticeData.zip

3 Analysis of Metre/MirexJambottrack*.mp3

vii

3 Analysis of Metre/MirexBeatRootBeatRootTrack*.mp3

4 Generation/ReactiveBernardLubat.mp4Children.mp4petit mal.mp4SAM.mpgsenster.mpg

4 Generation/Proactivechimaera.movoriginal.mp3disambiguate.mp3ambiguate.mp3follow.mp3

4 Generation/Proactive/Anticipatory TimingAlternatingSnareAndHatAnticipation.mp3AlternatingSnareAndHatNoAnticipation.mp3AlternatingSnareAndHatAnticipation.mp3DanceHallAnticipation.mp3DanceHallNoAnticipation.mp3EnsembleAnticipation.mp3EnsembleNoAnticipation.mp3AmenBreak.mp3

DiscussionBeatTracking.movAttentionalModulation.mov

A Softwareimpromptu.appjambot-demo.scmYbot.componentAmenBreak.wavREADME.rtfYbot.component

viii

Statement of Original Authorship

The work contained in this thesis has not been previously submitted to meet require-ments for an award at this or any other higher education institution. To the best of myknowledge and belief, the thesis contains no material previously published or writtenby another person except where due reference is made.

Signature:

Date:

ix

Acknowledgements

First and foremost thanks must go to my supervisors, Andrew Brown and Steve Dil-lon, whose patience, energy and commitment are nothing short of remarkable. Manythanks also to Cori Stewart and Siall Waterbright for their invaluable guidance in theintricacies of the English language. To Andrew Sorensen for creating Impromptu,which has played a large role in my work. To the QUT postgrad community who mademy experience what it was. And lastly to all my friends and family who have patientlywaited for me to emerge from the cocoon that is a PhD.

x

Chapter 1

Introduction

The difference between a tool and a machine is not capable ofvery precise distinction.

Babbage (1833)

1.1 Project Summary

This project aimed to develop a computational musical agent capable of improvisingin an ensemble setting. The research was practice-led, drawing on my practice as acomputational instrument builder and live acoustic/electronic musician. Consequentlythere was a design requirement for the computational agent – dubbed the Jambot –to be a complete operational system of sufficient quality for use in live performance.Furthermore, the goal was to be able to process complex audio, such as a full band, orto augment a recording for use by a DJ.

Viewed from another perspective, this project was an investigation into algorithms formachine listening and algorithmic improvisation. Conceptually the constituent algo-rithms are divided into three domains, reception, analysis and generation. Receptionrefers to the conversion of raw audio into timestamped notes. Analysis refers to beat-tracking and metre induction, as well as various rhythmic analyses. Generation refersto processes of algorithmic improvisation, aimed at producing complementary per-cussive accompaniment to a musical input signal. This project aimed to develop newtechniques in all of these domains, subject to the constraint of needing to work togetherin a functioning, robust, concert-ready system.

From another perspective again this project aimed to provide commentary on theoriesof music perception and music analysis, by encoding them into computational imple-mentations, and assessing their veracity in the context of an interactive music system.In particular, this project aimed to investigate appropriate architectures for music rep-resentation, machine listening and algorithmic improvisation, with a view to providingcommentary on plausible architectures for human music perception and cognition.

1

1.1.1 Personal Motivations

I am an acoustic musician, playing primarily the clarinet, in a variety of styles in-cluding jazz, funk, rock, with a strong interest in folk dance genres. I am also a liveelectronic musician, performing both as a DJ and hybrid acoustic/electronic musician.My interest is primarily popular dance music, in a wide range of genres from discothrough to Balkan gypsy.

As both an acoustic and electronic musician I have a natural interest in combining thesetwo modes of performance. In my experience the two modes operate in somewhatincompatible paradigms; certain aspects of performance that are natural in one arequite unnatural in the other. The idea of this project has been to create a technologicalbridge between the modes of performance that transforms this incompatibility into acomplementarity.

An attribute of electronic performance, that often differentiates it from traditionalacoustic performance, is that technology can release the performer from the ‘one-gesture-one-sound’ paradigm (Wessel 1991) nigh on ubiquitous for acoustic instru-ments. In playing a traditional acoustic instrument each gesture (such as moving afinger) is generally responsible for an individual sonic event. However with technol-ogy such as samplers, sequencers and complex synthesiser patches a single gesturecan set in motion a long series of sonic events. In the case of a DJ this enables a soloperformer to produce a full and engaging sound in a manner that would be difficult tomatch as a solo acoustic instrumentalist, at least in typical DJ performance contextssuch as dance clubs.

On the other hand, live acoustic performance involves a number of attributes oftenlost in electronic performance that I feel are intrinsic to the experience of live music.In particular the communicative dynamics of an ensemble, particularly when playingimprovised music, I believe to play a large role in my enjoyment of live performances.One aspect of the ensemble dynamics notably missing from electronic performance, atleast when operating outside of the one-gesture-one-sound paradigm, is human tempovariation.

For example, the use of loops and sequencers in combination with live acoustic per-formance is quite widespread, particularly in genres such as hip-hop (Reynolds 2006).However, to my mind, as soon as a loop is added to the mix something of the live feelis lost as the performers are locked into the tempo of the loop.

I suggest that the fundamental problem is that the technology is not ‘listening’ and so

2

not responsive to what the rest of the ensemble does. The idea of the Jambot is to ad-dress this issue. One potential application of the technologies developed in this projectis tempo tracking for controlling sample playback, although I haven’t concentrated onthis application. Rather, I have been interested in creating algorithms that both listenand respond.

Mixing acoustic with electronic performance is not the only motivation (or application)of this research. Another primary goal is to develop tools for use by DJs. The Jambotlistens to an audio signal – this might be from a band, or it could be listening to anaudio recording. So a DJ could make use of the Jambot by having it listen and jam toa record. In an abstract sense this application is similar to the first, in that I seek forthis technology to increase the fluidity, flexibility and dynamicism of live performancewith electronics.

Through the course of this research my focus changed to lean more heavily towardsDJ applications, rather than live ensembles. In the middle stages of the project I con-ducted a number of performance trials with a live ensemble – the Robobongo Allstars– examples of which can be found on the accompanying CD. As the project progressedI concentrated more on having the Jambot augment playback of recordings. The rea-son for this was primarily pragmatic; from an experimental perspective the ease andrepeatability of using an audio file input compared with corralling a live ensemble,recording the improvisation and later analysing the results, meant that progress on theunderlying algorithms was substantially quicker with prerecorded input.

Concentration on DJ applications also entailed a shift in focus from an autonomoussystem to a semi-autonomous system, with the DJ able to manipulate aspects of theoutput via parametric control. This involved a shift in the phenomenological aims.When developing the Jambot as an autonomous agent I was interested in interactionstrategies for creating an impression of agency. In semi-autonomous mode, the focusshifted to providing an intuitive and expressive interface for parametric control of thegenerated output.

1.1.2 Creative Practice Research and my Role as Researcher

This project has been conducted in the creative practice paradigm (Hamilton & Jaaniste2009). In creative practice research it is common for the researcher to utilise subjectivemeasures, such as aesthetic evaluation, in reflecting on their practice (Stock 2010). Thedesign goal for the Jambot was to provide appropriate and complementary musicalaccompaniment within a limited musical scope. The evaluation of appropriateness and

3

complementarity in this research has been framed by my aesthetic preferences and theintended musical scope. In the sections below I discuss these issues further.

Creative Practice Research

Creative practice research is research that is “initiated in practice, where questions,problems, challenges are identified and formed by the needs of practice and practition-ers ... and the research strategy is carried out through practice” (Gray 1998:3). TheQueensland University of Technology’s doctoral research regulations formally recog-nise creative practice research projects (QUT 2011), requiring that the candidate nom-inate a percentage split between the theoretical and practical outcomes of the project.This PhD project is a creative practice project with a split of 70% theoretical to 30%practical.

My practice has two mutually informing aspects – computational instrument building,and live performance. The practical outcome of this project is the Jambot itself. Inaddition to the Jambot, I have provided video and audio documentation of musicalperformances with the Jambot. These are not intended to be considered as the practicaloutcomes of this project per se, but rather as another source of documentation of theJambot.

Musical Scope

I developed the Jambot in the context of my creative practice in hybrid acoustic/elec-tronic live performance. The musical scope of interest to me is popular dance music ina wide variety of genres, including electronic styles such as breakbeat, drum ‘n’ bass,house and techno, as well as acoustic styles such as funk, rock, latin and gypsy. Somecommon threads are that the music be intended for dance, strongly pulse driven, andpopulist rather than experimental. A further practical limitation, for the Jambot in itscurrent form, is that the music contain percussive elements. As such the Jambot is notcurrently able to completely replace a rhythm section; the ability to do so is a goal forfuture development.

Evaluation of Musical Appropriateness

In presenting the research questions and discussing the outcomes of this project I fre-quently refer to musical appropriateness or complementarity. These terms should beunderstood in the context of the goal of the Jambot, which is to provide stylisticallyconvincing output within the musical scope outlined above. In assessing how stylis-tically convincing the Jambot’s output is I have utilised my own subjective aestheticevaluation using a set of critical values discussed in §3.5.1 and §3.6.5.

4

1.1.3 Research QuestionsIn describing this research I have divided the investigation into three domains: re-ception, analysis and generation. New knowledge has been created in each of thesedomains. This delineation is reflected in the architecture of the Jambot (§4.2.1), andalso corresponds roughly to the chronological order in which the research was carriedout (§3.6.1). Additionally, this research sought to find an effective architecture forcombining these three domains.

Reception

How can salient events be extracted from a raw audio stream? The Jambot is designedfor use with an audio input, rather than symbolic input such as MIDI. The first stageof its processing chain is to extract salient events from the raw audio stream. I inves-tigated techniques for detecting the onsets of notes, both percussive and pitched, andtechniques for polyphonic pitch tracking.

The extraction of features from complex audio is an active area of research in Com-putational Auditory Scene Analysis, posing significant engineering challenges. Oneresearch focus within this field is automatic transcription of a musical audio signalinto notes (Plumbley et al. 2002). This project aimed to provide incremental advancesin this area, sufficient for the overall goal of generating appropriate improvised ac-companiment. In particular it aimed to provide precise timing information for use inbeat-tracking.

Analysis

How can the notes be analysed into musical features? In order to provide appropriatemusical accompaniment, the Jambot performs a number of musical analyses on thesurface notes. In particular the Jambot performs beat-tracking and metre induction toenable real-time metrically appropriate responses. The Jambot also performs rhyth-mic analyses, so as to understand the musically salient rhythmic features of the inputsignal.

Computational musical analyses such as these are the domain of machine listening.Much research has been conducted in machine listening, particularly in beat-trackingand metre-induction (Dixon 2001). However, a general robust beat-tracking algo-rithm is yet to be found, and is perhaps unachievable (Goto & Muraoka 1995; Collins2006b:99). This project aimed to extend work in this field to enable the Jambot tofollow mildly varying tempos, and quickly recover from abrupt tempo changes, in per-cussive dance music.

5

The Jambot’s collection of rhythmic analyses together form a model of rhythm, whichinforms the Jambot’s generation process. The aim of the model was to provide aparametrisation of rhythm space into musically salient features, to allow for intuitiveparametric control over rhythm generation.

Generation

How can appropriate musical responses be generated? The purpose of the Jambot isto provide improvised accompaniment in an ensemble setting. Having parsed the rawaudio signal into notes, and analysed these notes into musical features, how can theJambot generate musically appropriate responses?

The field of interactive music systems is concerned with the creation of computationalmusical agents that improvise in an ensemble. A range of processes have been studied,including transformations of the input signal, and generative techniques from algorith-mic composition.

Transformative processes tend to be quite robust, whilst generative processes can allowfor more interesting and novel accompaniment. Most interactive music systems use ei-ther transformative or generative processes, but not both (§2.2). This project aimedto find ways of combining transformative and generative processes to provide for ro-bust and interesting accompaniment of real-world complex audio signals in a concertsetting.

Architecture

What is an effective architecture for combining reception, analysis and generation?The reception, analysis and generation modules are not independent – they exist ina single coherent architecture. The Jambot’s architecture is a complex hierarchicalstructure with bi-directional feedback between the layers of the hierarchy.

Many architectures have been studied for interactive music systems, music analysis,and music generation (§4.3). More broadly, many architectures have been appliedto problems in Artificial Intelligence (AI) (§4.3.2). Some of the more successful ex-periments in robotics have criticised traditional AI architectures based on knowledgerepresentations (§4.3.3). Instead they emphasise the importance of feedback, both be-tween the robotic agent and its environment, and between modules within the agent’sperceptual architecture. This project aimed to apply architectural notions from situatedrobotics to an interactive music system, in the hope of creating a more robust system.

6

1.2 Knowledge Claims

This project has generated new knowledge in a number of disciplines. As an interdis-ciplinary project, the findings of this research have relevance to science, engineeringand the creative arts.

The knowledge claims for this project are divided into the same three domains as theresearch questions, namely reception, analysis and generation. Additionally, I presentsome architectural findings, relating to the manner in which reception, analysis andgeneration interact in a complete functioning system.

1.2.1 Reception Knowledge Claims

In the reception domain I present a novel suite of low latency percussive onset detec-tors for use in complex audio, able to achieve reasonable discrimination between thekick, snare and hi-hat attacks of a standard drum-kit. These detectors all operate withsufficiently low latency to allow real-time imitation (and timbral remapping) of a drumtrack with imperceptible delay.

The suite consists of three onsets detectors: the SOD, MID and LOW detectors. Theseare tuned to detect the hi-hat, snare and kick attacks respectively. More broadly theycan be thought of as detecting high, mid and low frequency percussive onsets, althoughthey are not simply band-pass filtered energy detectors.

Of the three detectors, the SOD algorithm (§5.2) represents the most significant con-tribution to knowledge. It significantly departs from existing detection schemes bysearching directly for growth in the noisiness of the signal. It is more robust to con-founding high frequency noise than existing detectors.

The MID detector (§5.2.7) combines a standard band-pass filtered energy detector witha variation of the SOD algorithm. The LOW detector (§5.2.6) is simply a band-passfiltered energy detector. These detectors do not represent as great a contribution toknowledge as the SOD algorithm. However, the implementation details required toachieve sufficient accuracy and discrimination at the desired latency represents a sub-stantial research effort.

I also present a novel technique aimed at improving the accuracy and latency of pitchestimates in polyphonic pitch tracking (§5.3). The resulting algorithm is, however, of

7

insufficient quality for production use. Consequently the Jambot uses only informationfrom the onset detection suite, and produces only rhythmic accompaniment.

1.2.2 Analysis Knowledge Claims

In the analysis domain I present new findings relating to computational metric anal-ysis, and to computational rhythmic analysis. The metric analysis findings relate tobeat-tracking and metre induction. The rhythmic analysis findings present a model ofrhythm based on musically salient features.

For the purposes of beat-tracking, I introduce a new music theoretical notion calledthe substratum (§6.3). The substratum acts as a referent pulse level, distinct fromexisting notions of referent pulse such as the tactus and tatum. The substratum is amathematical property of the musical surface, rather than a perceptual property suchas the tactus, and so is more amenable to computational implementation.

I present a novel beat-tracking algorithm. This algorithm combines and extends anumber of existing approaches, but has significant differences. It achieves reason-able tracking in percussive dance music with mildly varying tempo, and also recoversquickly from abrupt tempo (or phase) changes. The beat-tracking algorithm presentsnovel techniques for the estimation of both beat period (§6.4) and beat phase (§6.5).The phase estimation is better suited to high precision onset timing input (such asachieved by the Jambot’s onset detection suite) than existing methods, and is also de-signed to be most accurate at the time of estimation, unlike many existing methodswhich are most accurate in the middle of a short window of history prior to the time ofestimation.

I also present a novel metre-induction algorithm (§6.8), which estimates the numberof beats in a bar (but not the position of the downbeat) using the notion of beat-classinterval invariance. More complex notions of metre are not estimated.

The Jambot’s rhythmic analyses together form a model of rhythm (§7.3.4). This modelinforms the generation process. As such, the model is designed to parametrize rhythmspace into musically salient features. This affords intuitive parametric control over theJambot’s rhythm generation.

8

1.2.3 Generation Knowledge Claims

The generation stage is the third component of the Jambot’s architecture. The taskof the generation stage is to produce musically appropriate actions, given the currentmusical context inferred by the reception and analysis stages. The generation stageutilises three improvisatory strategies: reactive, proactive and interactive.

The reactive strategy uses a novel technique called transformational mimesis (§8.2.6).Transformational mimesis is imitation that has been transformed to obfuscate the con-nection between the imitation and its source. It is similar in philosophy to the existingnotion of reflexivity in interactive music systems (§2.2), but differs in that it utilisessynchronous imitation, made possible by the low latency onset detectors. I present adescription of an interaction paradigm termed Smoke and Mirrors (§8.2.2) to whichtransformational mimesis and reflexivity both conform.

The proactive strategy uses an novel approach to rhythm generation (§8.3.7) based onthe model of rhythm defined in the analysis stage. The approach sets target values forsalient rhythmic features, and continuously searches for musical actions that tend tomove the ensemble towards these target values. The target values can be controlledby a human in real-time. This results in appropriate rhythmic accompaniment, withintuitive parametric control over salient aspects of the accompaniment.

In order to allow for intuitive parametric control of this generative process, I haveintroduced a novel search technique called anticipatory timing (§8.3.6). Anticipatorytiming is designed to provide a good trade-off between computational efficiency andoptimality. It involves an extension of greedy optimisation that allows for finessingof the timing of actions. It is particularly appropriate when searching for optimalrhythmic actions under real-time constraints.

I also present a meta-generative algorithm called the Chimæra (§8.3.10) for automat-ically controlling some of the target values for the rhythmic features. The Chimæraaims to achieve specifiable levels of metric ambiguity by selectively emphasising moreor less plausible metric assumptions gathered from the metric analysis stage.

I present a novel mechanism for combining proactive and reactive strategies called thejuxtapositional calculus (§8.4.1). The idea of the juxtapositional calculus is to providea means of deviating from a baseline of transformational mimesis, so that elements ofmusical understanding can be acted upon in a way that minimises cognitive dissonance.

9

Architectural Knowledge Claims

The Jambot’s architecture is a complex hierarchical structure with communicativefeedback between layers in the hierarchy. This architecture encompasses all of thestages of reception, analysis and generation. Feedback between the stages means thatthey are not independent. This research demonstrates that such an architecture is effec-tive for an interactive music system, and that the use of feedback increases robustness.

The most common architectures used for music representation, and for modelling cog-nitive structures involved in music perception, are trees and Markov chains. I presentan argument that these simple architectures are inappropriate for music representa-tion or cognition (§4.3). Rather, a complex hierarchical structure with communicativefeedback, such as the Jambot’s architecture, is required to capture both local and globalfeatures of music, and to provide for robust perception.

The Jambot’s architecture implements simple attentional mechanisms (§8.4.2). I presentan argument that byproducts of attention such as curiosity and distractibility can createan impression of ‘lifelike’ behaviour in an agent (§8.2.5).

Attention facilitates one source of feedback in the Jambot’s architecture through the useof attentional modulation of onset expectation (§6.7), which involves feedback fromthe analysis stage to the reception stage. The modulation changes the signal/noiseratio required to report an onset, making it easier for an onset to be detected close toa predicted beat. The use of attentional modulation increases the robustness of thebeat-tracking algorithm.

The Jambot

Finally, the Jambot itself represents a significant contribution to knowledge. All ofthe knowledge claims listed above are demonstrated by, and embodied in, a completeoperational concert-ready system.

Formally, this project has been practice-led, with a 70% theoretical to 30% practicalweighting. For the practical component I present the Jambot itself. The accompanyingCD contains complete source code documentation, as well as the Jambot program.

A short demonstration of the main features of the Jambot can be viewed in the fileMainDemo.mov on the accompanying CD.

10

1.3 Guide to the Accompanying CD

The data CD that accompanies this thesis contains the Jambot software, documentationfor the code, and supplementary audio-visual materials referred to in the thesis. In thissection I give a brief description of these materials, organised by folders of the CD.

A number of these files are Quicktime movies. The Quicktime player for Windowsand MacOSX can be downloaded from (Apple 2011).

Root Folder

thesis.pdf: pdf version of this document.

1 Introduction

MainDemo.mov: demonstrates the main features of the Jambot, and gives an exampleof jamming to a prerecorded track.

RobobongoAllstars.mov: documentation of the Jambot jamming with a live ensem-ble – the Robobongo Allstars. This performance took place at the 2009 AustralasianComputer Music Conference at QUT in Brisbane.

RobobongoAllstars2.mov: another performance of the Jambot with the RobobongoAllstars. This performance took place at the 2008 Ignite Conference at QUT in Bris-bane.

2 Reception

Polyphonic Pitch Tracking: this folder contains examples of the Jambot augmentinga short segment of a prerecorded track, by mimicking the pitch material that it perceives(§5.3.6). The original track segments, and the corresponding augmented files, are:

mas que nada.mp3 mas que nada jam.mp34 ton mantis.mp3 4 ton mantis jam.mp3sweet.mp3 sweet jam.mp3

Stochastic Onset Detection:kick and synth.mp3: an audio sample of a synth note playing over a kick drum sound

11

(§5.2.1). This example demonstrates the difficulty with using amplitude envelopebased onset detection.

JungleBoogie bq.mp3 JungleBoogie hf.mp3 JungleBoogie nd.mp3These files are augmentations of a short audio sample, overlaid with a click at the timeswhen and onset is detected, using three different onset detection techniques: the bonk∼(Puckette et al. 1998), HFC (Masri & Bateman 1996) and SOD techniques respectively(§5.2.5). Also included is the unaugmented sample file JungleBoogie.mp3.

train* bq.mp3 train* hf.mp3 train* nd.mp3These audio files are the similar to the JungleBoogie *.mp3 files above, but applied to asubset of the training data from the MIREX (2006) tempo tracking data set (§5.2.5).

3 Analysis of Metre

MirexPracticeData.zip: This is the MIREX (2006) tempo tracking practice data set.The subset of these audio files containing percussive elements is used for algorithmevaluation in several places in this thesis.

MirexJambot: track*.mp3These files are audio examples of the Jambot’s beat tracking, applied to a subset of theMIREX (2006) tempo tracking data set (§6.6).

MirexBeatRoot: BeatRootTrack*.mp3These files are audio examples of the BeatRoot’s beat tracking, applied to a subset ofthe MIREX (2006) tempo tracking data set (§6.6).

4 Generation

Reactive: this folder contains video documentation of exemplar artworks supportingthe arguments made in the section on Reactive Generation (§8.2).BernardLubat.mp4: an example of the Continuator (Pachet 2002) interacting with anexperienced jazz musician (§8.2.2).Children.mp4: the Continuator (Pachet 2002) interacting with children (§8.2.2).petit mal.mp4: video documentation of Simon Penny’s (2011) Petit Mal (§8.2.2).SAM.mpg: documentation of Edward Ihnatowicz’s (2011) Sound Activated Module(§8.2.2).senster.mpg: documentation of Edward Ihnatowicz’s (2011) Senster (§8.2.2).

12

Proactive/Chimaera: this folder contains examples of the Chimæra process for am-biguous rhythm generation (§8.3.10).chimaera.mov: video demonstration of the Chimæra process for ambiguous rhythmgeneration.

The following three files are examples of the different modes that the Chimæra em-ploys: disambiguation, ambiguation and following (§8.3.10). These files are record-ings of the Jambot augmenting a simple drum loop: example file original.mp3.disambiguate.mp3 ambiguate.mp3 follow.mp3

Proactive/Anticipatory Timing: this folder contains examples of the Jambot’s rhythmgeneration with, and without, the Anticipatory Timing search technique (§8.3.6). Alsoincluded is the input file AmenBreak.mp3. The files are:

AlternatingSnareAndHatAnticipation.mp3 AlternatingSnareAndHatNoAnticipation.mp3DanceHallAnticipation.mp3 DanceHallNoAnticipation.mp3EnsembleAnticipation.mp3 EnsembleNoAnticipation.mp3

5 Discussion

BeatTracking.mov: demonstrates the Jambot’s beat tracking facility (§6.2.2).

AttentionalModulation.mov: demonstrates the stabilising effect of modulating the on-set detection thresholds according to metric expectations (§6.7).

A Software

The files in this folder work together as a working demonstration of the Jambot soft-ware. Refer to the README.html file for installation and usage instructions.README.rtf: Installation and usage instructions for the Jambot demonstration.impromptu.app: Andrew Sorensen’s free audio programming environment (Sorensen2005).Ybot.component: the Jambot itself, implemented as an audio unit plugin.jambot-demo.scm: Impromptu source code for loading and demonstrating the Jambot.AmenBreak.wav: example audio file for use in this demonstration.

13

1.4 Associated Publications

Aspects of this doctoral research have been published in peer-reviewed conferenceproceedings. The publication references are listed below:

Gifford, T & Brown, A (2011). ‘Beyond Reflexivity: Mediating between imitative andintelligent action in an interactive music system’. In British Computer Society Human-Computer Interaction Conference, July 2011. British Computer Society, Newcastleupon Tyne.

Gifford, T & Brown, A (2010). ‘Anticipatory Timing in Algorithmic Rhythm Gener-ation’. In Opie, T, ed., Australasian Computer Music Conference, June 2010, 21-28.ACMA, Canberra.

Brown, A & Gifford, T & Davidson, R and Narmour, E (2009). ‘Generation in Con-text’. In Stevens C, Schubert E, Kruithof B, Buckley K, & Fazio S, eds., 2nd Inter-national Conference on Music Communication Science, Dec 2009, 7-10. HCSNet,Sydney.

Gifford, T & Brown, A (2009). ‘Do Androids Dream of Electric Chimera?’. InSorensen, AC, ed., Australasian Computer Music Conference, July 2009, 53-56. ACMA,Brisbane.

Gifford, T & Brown, A (2008). ‘Stochastic Onset Detection: An approach to detectingpercussive onset attacks in complex audio’. In Wilkie, S & Hood, A, eds., AustralasianComputer Music Conference, June 2008. ACMA, Sydney.

Gifford, T & Brown, A (2007), ‘Polyphonic Listening: Real-time accompaniment ofpolyphonic audio’. In Australasian Computer Music Conference, June 2007. ACMA,Canberra.

Gifford, T & Brown, A (2006). ‘The Ambidrum: Ambiguous generative rhythms’. InAustralasian Computer Music Conference, July 2006. ACMA, Adelaide.

14

Chapter 2

Background

2.1 Introduction

The Jambot is an interactive music system – a computer system for live performancewith humans. The first section of this chapter gives a brief overview of the field ofinteractive music systems, highlighting some areas in which relatively little researchhas taken place, and which this project has explored. One such area is the use ofgenerative algorithms based on salient musical attributes. Another is the combinationof transformative techniques with generative techniques.

The second section outlines some approaches to improvisation identified from the liter-ature, and my own practice as an improvising musician. A key improvisatory strategyis identified: striking a balance between novelty and coherence. Some theories ofmusic perception relating to musical expectations and ambiguity are introduced, mak-ing the case that manipulation of the level of ambiguity present in the improvisationprovides a mechanism for altering the balance between novelty and coherence. Thelevel of ambiguity is one of the salient musical attributes that the Jambot uses in itsgenerative processes.

The remaining sections discuss theories and concepts relating to musical pulse andmetre. The tactus, tatum and density referent pulses are described. Some competingtheories of metre are introduced, and a general view of metre as an expectational frame-work is presented. Finally the manipulation of metric ambiguity as an improvisatorydevice is suggested.

2.2 Interactive Music Systems

The Jambot is an interactive music system. Interactive music systems are computersystems for musical performance, in which one or more human performers interactwith the system in live performance. The computer system is responsible for partof the sound production, whether by synthesis or by robotic control of a mechan-ical instrument. The human performers may be playing instruments, manipulatingphysical controllers, or both. The system’s musical output is affected by the human

15

performer, either directly via manipulation of synthesis or compositional parametersthrough physical controllers, or indirectly through musical interactions.

There exists a large array of interactive music systems, varying greatly in type, rangingfrom systems that are best characterised as hyperinstruments to those that are essen-tially experiments in artificial intelligence. The type of output varies from systemsthat perform digital signal processing on input from an acoustic instrument, throughsystems that use techniques of algorithmic improvisation to produce MIDI output, tosystems that mechanically control physical instruments. Most systems are designedfor a single human performer, and when in a band setting will generally track a singleinstrument, typically the drummer. The Jambot differs in this respect by analysing theaggregate signal of the whole band.

The following brief discussion of interactive music systems is not intended to be acomprehensive review, but rather to outline a general classification scheme for suchsystems, with a few key exemplars, so as to situate the Jambot within this field. Moredetailed surveys of interactive music systems may be found in Rowe (1993; 2001),Dean (2003) and Collins (2006b).

Rowe (1993) describes a multidimensional taxonomy of interactive music systems.One dimension of this taxonomy classifies systems as transformative or generative.Transformative systems transform incoming musical input (generally from the humanperformer playing an instrument) to produce output. Generative systems utilise tech-niques of algorithmic composition to generate output. Rowe also discusses a thirdcategory of sequencing, however in this discussion I will consider sequencing as asimple form of generation. This categorisation is somewhat problematic, in that sys-tems may be composed of both transformative and generative elements. None-the-lessit provides a useful launching point for discussion.

Transformative systems have the capacity to be relatively robust to a variety of mu-sical styles. They can benefit from inheriting musicality from the human performer,since many musical features of the input signal may be invariant under the transforma-tions used. A limitation of transformative systems is that they tend to produce outputthat is either stylistically similar (at one extreme), or musically unrelated (at the otherextreme), to the input material.

Generative systems use algorithmic composition techniques to produce output. Theappropriateness of the output to the input is achieved through more abstract musicalanalyses, such as beat tracking and chord classification. Generative systems are able toproduce output that has a greater degree of novelty than transformative systems. They

16

are often limited stylistically by the pre-programmed improvisatory approaches, andmay not be robust to unexpected musical styles.

Within the class of transformative systems is the subclass of reflexive (Pachet 2006)systems. Reflexive systems are transformative in the sense that they manipulate theinput music to produce an output. The manipulations that they perform are designedto create a sense of similarity to the input material, but without the similarity beingtoo obvious. Pachet describes reflexive systems as allowing the user to “experiencethe sensation of interacting with a copy of [themselves]” (ibid:360). Reflexive systemsaim to model the style of the input material, for example using Markov models trainedon a short history of the input. Reflexive systems enjoy the benefits of transformativesystems, namely inheriting musicality from the human input, and so are robust to avariety of input styles. The use of abstracted transformations means that they canproduce surprising and novel output whilst maintaining stylistic similarity to the input.Reflexive systems do not, however, perform abstract musical analyses such as beat-tracking.

The Jambot is designed to combine transformative and generative approaches. In thisway it hopes to achieve the flexibility and robustness of a transformative system, whilstallowing for aspects of abstract musical analysis to be inserted into the improvisation.The idea is to operate from a baseline of transformed imitation, and to utilise momentsof confident understanding to deviate musically from this baseline.

Another limitation of many reflexive and generative systems is they model music usingstatistical/mathematical machinery such as Markov chains, neural nets, genetic algo-rithms and the like. The difficulty with these models is they do not directly exposesalient musical features. This means the parameters of these models do not affordintuitive control over the musical features of their output. The Jambot utilises a rep-resentation of musical rhythm that parameterises rhythm space into musically salientfeatures. This way the Jambot’s generative processes may be controlled intuitively.

17

2.2.1 Examples of systems

In this section I list a few prominent examples of existing interactive music systems.Many more examples can be found in Rowe (1993; 2001), Dean (2003) and Collins(2006b).

M and Jam Factory: Two early interactive music systems, M and Jam Factory, werepioneered by Chadabe & Zicarelli (Zicarelli 1987). These systems were MIDI basedsystems that captured input in Markov models based on pitch, duration and loudness.They created real-time output based on the Markov models with some variability overthe generative strategies.

Oscar: Another pioneering system was Peter Beyl’s Oscar system which he describesas “a program with an artificial ear. The program tries to express its own personal char-acter while simultaneously aiming for integration into a larger social whole” (1988:219).The Oscar system takes as input an audio signal and saxophone key data, the analysisis focused on pitch material and it outputs a MIDI signal.

Band-out-of-a-Box: An interesting example of an improvisational interactive sys-tem is provided by Belinda Thom (2000a). Her system, BoB, learns to improvise withanother improviser, and attempts to emulate their style. The heart of her system is avariable-sized-multinomial-mixture model, which gains a knowledge of a given hu-man’s improvisation style by a process of unsupervised training.

B-Keeper: Andrew Robertson & Mark Plumbley (2007) have developed a systemcalled B-Keeper, which is designed as a beat-tracking system for standard drum-kits.It outputs a varying tempo estimate which can be used to control the playback rate ofa sample or sequence.

The Continuator: Francois Pachet’s (2002) Continuator is an example of a reflexivesystem. The Continuator operates by sampling short phrases of input from a MIDIkeyboard. It trains a Markov model on this short phrase, and then generates a responsefrom this model. It re-uses the same rhythm as in the original phrase.

Omax: Developed at IRCAM, the Omax system operates in the reflexive paradigm,accompanying a live acoustic musician (Assayag et al. 2006). The Omax system usesthe Factor Oracle, a type of statistical learning algorithm, to model the input, andproduce reflexive output.

18

2.3 Approaches to Improvisation

A successful performance should balance the predictable withthe unpredictable.

Borgo (2002)

The Jambot is an improvisatory agent. Its goal is to provide appropriate improvisedaccompaniment in an ensemble setting. In order to codify a sense of musical appropri-ateness (or complementarity) I have drawn upon theories of music perception, discus-sions of improvisatory approaches by practising musicians, and my own experience asan improvising musician.

The main insight I have utilised is that in improvisation one must strike an appropriatebalance between novelty and coherence. In the sections below I elaborate on this idea,and introduce some theories of music perception relating to musical expectations andambiguity. The gist of the argument I present below is that multiple expectations giverise to ambiguities, and manipulation of the level of ambiguity present in the improvi-sation provides a mechanism for altering the balance between novelty and coherence.

2.3.1 Free Improvisation

In examining improvisatory processes in humans I have paid particular attention to freeimprovisation. Free improvisation refers to the spontaneous production of music withlittle or no pre-arranged musical structure. Free improvisation has played a prominentrole in my practice as a musician, and is of interest due to the importance it places onensemble interaction. I believe that lessons learnt from free improvisation have valuein broader improvisatory contexts.

Thomas Nunn goes as far as to say that free improvisation “is the essence of all formsof improvisation” (1998:7). Whilst this is stating the case more strongly than I would,I nevertheless share the conviction that elements of free improvisation have broad ap-plicability in any improvisatory setting. Moreover, the removal of all structure placesthese elements in bas-relief and consequently, I suggest, free improvisation provides avaluable arena for study of improvisation in general.

The Jambot improvises freely – it has no prior knowledge of the structure of the music.It operates ‘in the moment’. The temporal scope of the Jambot’s musical analysis is

19

quite short, at most a couple of bars. It has no notion of phrasing, hyper-measure, orform. I do not mean to dismiss the importance of such structures in free improvisation– indeed in my experience they are pivotal to the success of an improvisation. How-ever, at this stage of development the Jambot seeks only to act appropriately in a verylocalised sense. Higher order temporal structures are a topic for further research.

It may be worth reiterating at this point that the Jambot currently is restricted to per-cussive output. Whilst pitch-based improvisation is of great interest to me (and wasinitially part of the design brief for the Jambot) the quality of the polyphonic pitchtracking algorithms developed (described in §5.3) was not sufficient to allow for this.In this sense the Jambot is quite different from many computational improvisatoryagents, which have frequently been designed for melodic jazz improvisation (over aknown harmonic progression) (Thom 2000b; Biles 2002; Keller et al. 2006).

I have, however, done some experimentation with pitch-based improvisation that didnot rely upon pitch-based listening – this was achieved by choosing a key and lettingthe human performers adapt to the Jambot’s harmonic motions. An example of thiscan be seen in the accompanying demonstration video RobobongoAllstars.mov.

2.3.2 Balancing Novelty and Coherence

In my experience of free improvisation there is a constant tension between maintaininga coherent foundational structure and keeping the music interesting. Free jazz saxo-phonist David Borgo comments:

When a listener (or performer) hears what they expect, there is a low com-plexity and what is called “coherence” ... and when they hear somethingunexpected, there is “incoherence” ... this creates a dilemma for improvis-ers, since they must constantly create new patterns, or patterns of patterns,in order to keep the energy going, while still maintaining the coherence ofthe piece. Borgo (2004)

Part of the art of improvisation (and composition) is to strike the right balance betweencoherence and novelty. For the listener to perceive coherent and interesting structure,there must be some element of surprise, but not so much that the listener loses theirbearings entirely.

‘good’ music ... must cut a path midway between the expected and theunexpected ... If a works musical events are all completely unsurprising... then the music will fulfill all of the listener’s expectations, never be

20

Figure 2.1: Coherence metre

surprising – in a word, will be boring. On the other hand, if musical eventsare all surprising ... the musical work will be, in effect, unintelligible:chaotic. (Kivy 2002:74)

The Jambot attempts to strike an appropriate balance between coherence and noveltyby maintaining an ongoing measure of the level of coherence in the improvisation.Figure 2.1 shows a whimsical portrayal of a ‘coherence metre’, displaying a real-timemeasure of coherence. A target level of coherence is set either by a human or by ahigher order generative process. The Jambot then takes musical actions to alter thelevel of coherence to maintain this target.

In order to model the coherence level of the improvisation, I have utilised notions ofambiguity and expectation. Borgo and Kivy (above) both identify expectations regard-ing future musical events as a key contributor to the sense of coherence of the improvi-sation. By creating multiple expectations, a sense of ambiguity can be created, whichin turn decreases the coherence level. Conversely by highlighting a single expectation,ambiguity is decreased and coherence increased. In the next sections I discuss sometheories from music perception regarding musical expectations, and their relation tonotions of musical ambiguity.

2.3.3 Expectation

The importance of taking account of the dynamic nature of musical expectations whenconsidering musical experience has been acknowledged in the music theory literaturefor some time (Lerdahl & Jackendoff 1983; Meyer 1956; Narmour 1990; Barucha1993) but has only recently been translated into computational descriptions and rarely

21

been the basis for algorithmic music systems. Meyer suggests that affect in music per-ception can be largely attributed to the formation and subsequent fulfilment or violationof expectations. His exposition is compelling but imprecise as to the exact nature ofmusical expectations and to the mechanisms of their formation.

A number of extensions to Meyer’s theory have been proposed, which have in commonthe postulation of at least two separate types of expectations; structural expectationsof the type considered by Meyer, and dynamic expectations. Narmour’s (1990) theoryof Implication and Realisation, an extension of Meyer’s work, posits two cognitivemodes; one of a schematic type, and one of a more innate expectancy type. Barucha(1993) also discriminates between schematic expectations (expectations derived fromexposure to a musical culture) and veridical expectations (expectations formed on thebasis of knowledge of a particular piece).

Huron (2006) has recently published an extensive and detailed model of musical ex-pectations that builds further on this work. He argues that there are, in fact, a numberof different types of expectations involved in music perception, and that indeed theinterplay between these expectations is an important aspect of the affective power ofthe music. Huron extends Bharucha’s categorisation of schematic and veridical ex-pectations, and in particular makes the distinction between schematic and dynamicexpectations.

Dynamic expectations are constantly learned from the local context. Several authorshave suggested that these dynamic expectations may be represented as statistical infer-ences formed from the immediate past (Huron 2006; Pearce & Wiggins 2006). LikeBharucha, Huron argues that the interplay of these expectancies is an integral part ofthe musical experience.

2.3.4 Ambiguity

Meyer (1956) identifies ambiguity as a mechanism by which expectations may be ex-ploited for artistic effect. In this context ambiguity refers to musical surfaces thatcreate several disparate expectations. The level of ambiguity in the music creates acycle of tension and release, which forms an important part of the listening experiencein Meyer’s theory. An ambiguous situation creates tension – the resolution of which ispart of the art of composition.

Ambiguity is important because it gives rise to particularly strong ten-sions and powerful expectations. For the human mind, ever searching forthe certainty and control which comes with the ability to envisage and

22

predict, avoids and abhors such doubtful and confused states and expectssubsequent clarification. (Meyer 1956:27)

Temperley notes that ambiguity can arise as the result of multiple plausible analysesof the musical surface:

Some moments in music are clearly ambiguous, offering two or perhapsseveral analyses that all seem plausible and perceptually valid. These twoaspects of music – diachronic processing and ambiguity – are essential tomusical experience (Temperley 2001:205)

I have been discussing ambiguity as inversely related to coherence. However, the no-tion of ambiguity has an extra nuance that is worth mentioning. Certainly, an un-ambiguous (musical) situation should be highly coherent. A high level of ambiguity,however, should not be confused with vagueness; where vagueness implies a lack ofany strong suggestion, ambiguity implies a multiplicity of strong suggestions.

2.3.5 Multiple Parallel Analyses

The concept of systems of musical analysis that yield several plausible results has beenposited by a number of authors as a model of human musical cognition. Notably, Jack-endoff (1992:140) proposed the mutiple parallel analysis model. This model, whichwas motivated by models of how humans parse speech, claims that at any one time ahuman listening to music will keep track of a number of plausible analyses in parallel.

In a similar vein, Huron (2006) describes the competing concurrent representation the-ory. He goes further to claim that, more than just a model of music cognition, “Com-peting concurrent representations may be the norm in mental functioning” (2006:108).

2.3.6 Ambiguity in Multiple Parallel Representations

An analysis system that affords multiple interpretations provides a natural mechanismfor the generation of ambiguity. In discussing their Generative Theory of Tonal Music(GTTM), Lerdahl & Jackendoff observe that their “rules establish not inflexible deci-sions about structure, but relative preferences among a number of logically possibleanalyses” (1983:42), and that this gives rise to ambiguity. In saying this Lerdahl &Jackendoff are not explicitly referencing a cognitive model of multiple parallel analy-ses; the GTTM predates Jackendoff’s construction of this model, and does not consider

23

real-time cognition processes. Indeed it was considerations of the cognitive constraintsinvolved in resolving the ambiguities of multiple interpretations that led Jackendoff toconclude that the mind must be processing multiple analyses in parallel (Jackendoff1992).

Temperley (2001:219) has revisited the preference rule approach to musical analysesin a multiple parallel analyses model:

The preference rule approach [is] well suited to the description of ambi-guity. Informally speaking, an ambiguous situation is one in which, onbalance, the preference rules do not express a strong preference for oneanalysis over another ... At any moment, the system has a set of “best-so-far” analyses, the analysis with the higher score being the preferred one.In some cases, there may be a single analysis whose score is far above allothers; in other cases, one or more analyses may be roughly equal in score.The latter situation represents synchronic ambiguity.

In a similar spirit, Huron (2006:109) argues that multiple parallel analyses (or compet-ing concurrent representations, as he calls them) must all be generating expectations,and consequently must give rise to the kind of expectational ambiguity that was arguedabove to play a central role in producing musical affect.

2.4 Musical Pulse

An important skill in ensemble improvisation is an ability to play in time. Termsrelating to musical time, such as beat and tempo, are often explained with reference toa regular pulse. The sections below outline some pulses commonly discussed in theliterature; the tactus, the tatum, and the density referent.

2.4.1 The Tactus

The notion of a referent pulse level is a common construct in analyses of musicalrhythm (whether performed computationally for beat tracking, or manually for musi-cological analysis), and this referent level is often identified as being the beat, or tactus.The tactus is generally described as being the most salient pulse level (Parncutt 1994),or the ‘toe-tapping’ pulse – the tempo at which a listener would tend to tap their toe,or clap their hands along to. (Newman 1995)

The listener tends to focus primarily on one (or two) intermediate level(s)in which the beats pass by at a moderate rate. This is the level at which

24

the conductor waves his baton, the listener taps his foot, and the dancercompletes a shift in weight. Adopting a Renaissance term, we call suchlevel the tactus. (Lerdahl & Jackendoff 1983)

From a computational standpoint the notion of tactus presents some difficulties, partic-ularly when trying to estimate the tactus in a musical audio stream. The problem stemsfrom the fact that, like metre, the tactus is a perceptual construct, and is not even uni-formly perceived between different people. As an example, each year a computationaltempo tracking contest is run by the Music Information Retrieval Exchange (MIREX)community, and the ground-truth tactus values for the sample audio tracks (which aredetermined by measuring the rate at which a group of test-subjects tap along to a givenpiece) are generally given as two values with relative probabilities (corresponding tothe top two choices amongst the test-subjects and their relative frequencies). By andlarge the different choices of tactus are closely related, having tempi in a ratio of 2 : 1,but nevertheless this highlights the difficulties that can be expected in modelling thetactus when it is defined as a perceptual quality rather than a surface property of themusic.

2.4.2 The Tatum

An alternative temporal reference pulse dubbed the tatum was suggested by Bilmes(1993) in his study of expressive timing. The tatum, a whimsical contraction of tem-poral atom in honour of Art Tatum, is the fastest commonly occurring pulse evidentin the surface of the music. The term was conceived of in a computational context;Bilmes was modelling microtiming and expressive tempo variations in percussion per-formances, and needed a quantized unit of time in which to measure the rhythms. Thisis reminiscent of the step sequencer metaphor prevalent in computer music where thetatum corresponds to the quantization period of the step.

The tatum is the high frequency pulse or clock that we keep in mind whenperceiving or performing music. The tatum is the lowest level of the met-rical hierarchy. We use it to judge the placement of all musical events.(ibid:21)

The tatum is not necessarily constant through the course of a piece of music (ibid:109);the tatum may change to reflect a different predominant level of subdivision within asingle piece.

25

2.4.3 Referent Pulse Level

In their theory of dynamic attention, Jones & Boltz posit a referent pulse level anddescribe it as an “anchor or referent time level for the perceiver” (1989:470), whichother periodicities are interpreted in terms of, either as subdivisions (for faster pulses)or groupings (for slower pulses). The suggestion is that the perception of pulses aboveand below this reference pulse operate via different cognitive mechanisms:

[Jones & Boltz’s] temporal perception model relates smaller and larger pe-riods of attention to a middle level that she refers to as the referent timeperiod. The temporal referent anchors our attentional process, and me-diates between analytic attending (awareness of local details) and future-oriented attending (awareness of more global processes and goals). (Lon-don 2004:18)

Although Jones’s original exposition does not explicitly mention the tactus, Londongoes on to identify this referent level as being the tactus:

In metric terms, the beat or tactus serves as the referent level. Jones’ modelthus suggests that beat subdivisions are the product of analytic attending,that is we grasp them as fractions of a larger span. Conversely, larger levels- measures - are yoked to expectations that derive from the referent levelsuch as anticipating that an important event will occur “in two beats” or“every three beats” (London 2004:18)

Some doubt is cast on this identification by an examination of Western African per-cussion music, in which the tactus does not function in the same manner as in Westernmusic (Arom 1991:206); if there was a fundamental split in the perceptual processesabove and below the tactus, then one might expect the tactus to function universallyacross musical cultures.

2.4.4 Pulsation

Analysts of Western African music have found a more useful concept variously knownas the Common Fast Beat (Kauffman 1980:396), density referent (Hood 1971:114) orpulsation (Arom 1991:202) which plays the role of a referent pulse level in terms ofwhich all other musical events are measured.

Pulsations are an uninterrupted sequence of reference points with respectto which rhythmic flow is organised. All the durations in a piece ... aredefined in relationship to the pulsation (Arom 1991:202)

26

It seems almost undebatable that most African rhythms can be related to afast regular pulse. Density Referent seems to be the term that is increas-ingly used to identify these fast pulses ... Musical scholars are probablymore aware of the density referent than are performers of the music, butthe absolute accuracy of rhythmic relationships in performance seems toattest to at least an unconscious measuring of fast, evenly paced units.(Kauffman 1980:407)

The density referent seems more likely to be amenable to computational estimationthan the tactus as it is not a perceptual quality, but relates to the surface properties ofthe music:

the density referent ... can be used to study and understand temporal ele-ments that would be rendered ambiguous to more subjective concepts ofbeat. (Kauffman 1980:396)

The pulsation may differ from the tatum. The idea of the tatum is that it is the fastestpulse which actually occurs in the surface music; and that all other pulses (and inter-onset-intervals) in the music are some multiple of the tatum – it is fundamentally theunit of time in which the music is measured. The pulsation similarly sets up a gridin terms of which all other musical events are to be measured. However, unlike thetatum, the pulsation is not necessarily the fastest pulse – subdivisions of it may appearin the surface of the music. Whilst the tatum may change during a piece to reflecta different predominant subdivision (Bilmes 1993:109), the pulsation should remainconstant (Arom 1991:202).

2.5 Musical Metre

Musical metre is frequently described as the pattern of strong and weak beats in amusical stream (Cooper & Meyer 1960; Lerdahl & Jackendoff 1983; Large & Kolen1994). From the point of view of music psychology metre is understood as a perceptualconstruct, in contrast to rhythm which is a phenomenal pattern of accents in the musicalsurface (Lerdahl & Jackendoff 1983; London 2004).

Metre is inferred from the surface rhythms, and possesses a kind of perceptual inertia.In other words, once established in the mind, a metrical context tends to persist evenwhen it conflicts with the rhythmic surface, until the conflicts become too great:

Once a clear metrical pattern has been established, the listener renounces itonly in the face of strongly contradicting evidence (Lerdahl & Jackendoff1983:17).

27

The Jambot implements perceptual inertia for its beat tracking (§6.4.6) and metre in-duction (§6.8.3).

Cooper & Meyer define metre as “the measurement of the number of pulses betweenmore or less regularly recurring accents” (1960:4) whilst Yeston describes metre as “anoutgrowth of the interaction of two levels — two differently rated strata, the faster ofwhich provides the elements and the slower of which groups them” (1976:66).

A related but more elaborate description of metre is given by Lerdahl & Jackendoff(1983) in their Generative Theory of Tonal Music (GTTM). They propose a represen-tation of metre which reflects a hierarchy of timescales. In this representation, a beat ofany given level is assigned a strength according to the number of levels in the hierarchythat contain this beat.

2.5.1 Metre as an Expectational Framework

Within the music perception field metre is generally considered as an expectationalframework against which the phenomenal rhythms of the music are interpreted (Lon-don 2004; Huron 2006). Jones (1987), for example, argues that metre should be con-strued as a cognitive mechanism for predicting when salient musical events are ex-pected to happen. This description of metre has been widely accepted within the musicpsychology community (Huron 2006; Large 1994; London 2004).

Metrical structure provides listeners with a temporal framework upon whichto build expectations for events (Large & Kolen 1994)

London adopts this view of metre, and develops it further to a description of metresimilar to the GTTM in that it involves a hierarchy of pulse levels, but different in thatit takes the tactus pulse as the primary reference level, and posits different cognitivemechanism for the perception of levels above the tactus (grouping) and levels belowthe tactus (subdivion) as discussed in §2.4.3.

London’s model of metre is rooted in Jones & Boltz’s (1989) theory of dynamic atten-tion. This theory contends that a listener’s attention varies through time according tothe metre, so that attention is greatest at metrically strong points. The Jambot imple-ments a similar form attentional modulation to help filter spurious onsets (§6.7).

Huron similarly treats metre as a distribution of expectational strengths across beat-classes, and claims that metric strength correlates with the probability of an onset(2006:179). Huron’s model of music perception is premised on the notion that much

28

(if not all) of our musical understanding is derived from learned expectations due tolong exposure to statistical regularities in music that we listen to.

The description of metre as an interaction between two (or more) isochronous pulsesis not suitable for some styles of music. The recognition of non-isochronous pulseshas become common in more recent writings in music perception (Huron 2006; Lon-don 2004). London insists that the unequal beat periods must be rationally related,and indeed that such pulses arise as uneven groupings of a faster isochronous pulse(2004:100). However, Moelants (1999) observes that Bulgarian dance music containsnon-isochronous pulses with irrational ratios of duration, and Huron (2006:188) sug-gests the same of Viennese Waltzes.

Another example of possibly irrational subdivisions of a pulse is the case of swing.Although swing is often described as arising from a triplet subdivision of the tactus,actual swing timings occupy a spectrum, and may subdivide the tactus arbitrarily (butconsistently) (Lindsay & Nordquist 2007).

Both Huron and London subscribe to the view of metre as an expectational frame-work, in which temporal positions of beats need only form some predictable pattern.Isochrony yields a particularly simple predictive mechanism. For London this is theonly mechanism at work in metre perception, but for Huron any (even irrational) sub-division of the bar provides a predictable template given sufficient exposure.

The general view of metre as a predictive template fits a wide variety of musical styles,encompassing both metres that arise out of isochronous pulses, and those that are betterdescribed as an irrational (but consistent) subdivision of a single isochronous pulse(Collins 2006a:29).

The target musical genres for this project are popular contemporary Western dancestyles, such as Electronic Dance Music (EDM), jazz, funk and pop. The (less general)description of metre as arising from a referent pulse level, and a periodic patterning ofthis pulse into strong/weak beats, appears to be broadly applicable in these styles. TheJambot implements this view of metre. However, conceiving of metre as an expecta-tional framework suggest a mechanism for manipulating ambiguity in improvisation.

29

2.6 Utilising Metrical Ambiguity

The view of metre as an expectational framework creates the possibility for manipu-lating the level of metric ambiguity as an improvisatory device. As discussed in §2.3.2the Jambot seeks to maintain a target level of ambiguity in the ensemble improvisa-tion. The Jambot does this by simultaneously reinforcing a multiplicity of metricalpossibilities when it believes the coherence level to be too high. The multiplicity ofexpectations creates ambiguity, which decreases the coherence. Conversely if the co-herence level is assessed to be too low (i.e. the improvisation has become too chaotic),then the Jambot will strongly reinforce the most plausible metric possibility to lowerthe ambiguity level.

The Jambot achieves this via two different mechanisms. One mechanism makes useof different rhythmic variables (timbre, dynamics, duration) to give disparate signalsregarding the underlying metre – this technique is discussed in §7.3.7. The secondmechanism utilises a generative process that I have dubbed the Chimaera, which willbe described in §8.3.10. The Chimaera draws upon the multiple parallel analyses ofmetre performed by the Jambot, and seeks to take musical actions to manipulate thelevel of ambiguity in the music by selectively highlighting particular analyses.

30

Chapter 3

Methodology

3.1 Introduction

This research project consisted of designing, implementing and evaluating an interac-tive music system according to a design research methodology (§3.5.1). In this chapterI describe the mechanics of this process, the theoretical perspective and epistemologyframing the research, and argue that this framework is a valid approach to investigatingthe research questions posed.

The design brief for the project was to create a robust concert-ready interactive mu-sic system that could listen to a raw audio stream, and produce real-time improvisedrhythmic accompaniment. This research is formally categorised as creative practiceresearch. The practice is computational instrument building, and the creative output isthe Jambot itself. In this chapter I argue that:

• The Jambot, as a complete operational interactive music system, in and of itselfrepresents a substantial contribution to knowledge in the engineering domain.

• Additionally, the process of developing and evaluating the Jambot has generatedknowledge contributions in the science domain and the creative arts domain.

• The Jambot as a computational artefact is a mechanism for the generation offurther knowledge in both science and creative arts.

This project has been interdisciplinary, straddling science, engineering and creativearts, with new knowledge generated in all three domains. Traditionally these differentdomains have adopted quite disparate epistemologies. In this chapter I argue that a par-ticular form of pragmatist epistemology (§3.3.2), known as Ecological Epistemology(§3.3.3), provides a view of knowledge compatible with all three of these domains.

Theories of knowledge are important to this project for several reasons, beyond purelymethodological concerns. This research delves into artificial intelligence and cognitivepsychology. Both of these fields model structural aspects of knowledge. I have adopted

31

a theoretical perspective on knowledge called Situated Cognition (§3.4.3). This per-spective gives a structural description of knowledge compatible with the abstract man-dates of Ecological Epistemology. The architecture of the Jambot has been designedaccording to the principles of Situated Cognition.

3.2 Research Framework

I am adopting Michael Crotty’s (1998) framework for discussing the research methods.He delineates a hierarchy of four components, Epistemology, Theoretical Perspective,Methodology and Method. Crotty (1998:3) defines these terms as follows:

• Epistemology: the theory of knowledge embedded in the theoretical perspectiveand thereby in the methodology.

• Theoretical Perspective: the philosophical stance informing the methodologyand thus providing a context for the process and grounding its logic and criteria.

• Methodology: the strategy, plan of action, process or design lying behind thechoice and use of particular methods and linking the choice and use of methodsto the desired outcomes.

• Method: the techniques or procedures used to gather and analyse data related tosome research question or hypothesis.

This framework is depicted in Figure 3.1, together with the particular approaches em-ployed in this project. The approaches I have used are:

• The epistemology that I have adopted is Ecological Epistemology (§3.3.3) – inwhich knowledge is viewed as an interaction between the knower and the known.

• The theoretical perspective I employ is the Brunswikian (§3.4.1) perspective ofSituated Cognition (§3.4.3), which views perception as an interaction betweenthe perceiver and the environment.

• The methodology is Design Research (§3.5.1).

• A number of methods were employed including reflective practice, software de-sign, computational modelling, interaction design and algorithmic composition.

In the sections below I describe each of these components, which together form thetheoretical framework for this research. The most important conceptual thread runningthrough all of these components is an emphasis on the importance of context.

32

methods

methodology

theoretical perspective

epistemology Ecological Epistemology

Situated Cognition

Design Research

Software DesignReflective PracticeAlgorithmic CompositionComputational ModellingInteraction Design

Figure 3.1: Theoretical Framework – adapted from (Crotty 1998:4)

3.3 Epistemology

Epistemology is the study of knowledge. Epistemology is pivotal to any researchproject, as it justifies why the research methods used can lead to credible new knowl-edge. Epistemology is important for this project in a number of respects. The researchgoals of this project include finding new insights into both artificial intelligence andhuman cognition. In both of these fields the study of knowledge is a central concernof the discipline. So beyond methodological concerns, theories of knowledge play amechanistic and structural role in this work.

The nature of this research project is interdisciplinary. On the one hand there arecontributions to knowledge in a variety of scientific fields – computational auditoryscene analysis, artificial intelligence, psychology. On the other hand this project isfirmly rooted in the creative arts, with knowledge contributions in interactive musicsystems, applied music theory, and the phenomenology of human-machine interaction.This interdisciplinary nature has highlighted some epistemological issues since thesciences and the creative arts frequently take differing epistemological stances.

3.3.1 Episteme, Techne and Praxis

Aristotle identified several categories of knowledge, including episteme, techne andphronesis. Flyvbjerg (2001:57) summarises these categories as:

• Episteme: Scientific Knowledge. Universal, invariable, context-independent.Based on general analytic rationality. The original concept is known today fromthe terms ‘epistemology’ and ‘epistemic’.

33

• Techne: Craft/Art. Pragmatic, variable, context-dependent. Oriented towardsproduction. Based on a practical instrumental rationality governed by a con-scious goal. The original concept appears today in terms such as ‘technique’,‘technical’ and ‘technology’.

• Phronesis: Ethics. Deliberation about values with reference to praxis. Prag-matic, variable, context-dependent. Oriented towards action. Based on practicalvalue-rationality.

These categories of knowledge were all regarded by Aristotle as “intellectual virtues,of which epistemic science, with its emphasis on theories, analysis, and universals wasbut one, and not even the most important” (Flyvbjerg 2001:57).

Corresponding to these three categories of knowledge Aristotle described three basicintellectual activities of mankind as theoria, poiesis and praxis (Hickman 1992:99).Modern usages of these terms have sometimes blurred the distinction between the in-tellectual activity of praxis and the corresponding intellectual virtue of phronesis (Fly-vbjerg 2001; Callaos 2011; Greenwood & Levin 2005:51). I will follow this conven-tion and adopt a trifold categorisation of knowledge into episteme, techne and praxis.

Science and Episteme

Mainstream contemporary thought in science holds episteme to be the only legitimateform of knowledge (Goldman 2004). From this perspective the products of technology,whilst vital to the actual practice of science, are not considered themselves to constituteknowledge (McCarthy 2006). Goldman (2004) argues that this perspective is the resultof a long history of philosophical prejudice towards Platonic ideals equating rationalitywith high culture. Dewey suggests that this prejudice is primarily historical:

If one could get rid one’s traditional logical theories and set to work afreshto frame a theory of knowledge on the basis of the procedure of the com-mon man, the moralist, and the experimentalist, would it be the forced orthe natural procedure to say that the realities which we know, which we aresure of, are precisely those realities that have taken shape in and throughthe active procedures of knowing? (1973:213)

Engineering and Techne

The nascent field of the Philosophy of Engineering rejects the primacy of epistemicknowledge, and argues that techne should be considered as equally legitimate. Theknowledge categories of episteme, techne, and praxis are also sometimes described interms of Ryle’s (1949) distinction between know-that and know-how, and Polanyi’s

34

(1974) notion of tacit knowledge respectively (Callaos 2011). McCarthy suggests thatengineering know-how should not be considered as purely a mechanism for enablingscientific know-thats:

engineering can be seen as delivering knowledge by a much more directroute than by aiding science. There is a useful distinction in philosophybetween ‘knowing that’ and ‘knowing how’ ... Engineering is ‘know-how’... [and] as a consequence yields highly successful knowledge about howto control materials and processes to bring about desired results. It is a wayof getting to the nature of things – a voyage of discovery as much as sci-ence is. Hence engineering provides a useful case study for philosophersinquiring about the status of human knowledge. (2006)

Goldman (2004) characterises the distinction between episteme and techne thus: epis-teme is fundamentally concerned with the necessary whilst techne is fundamentallyconcerned with the contingent, and locates engineering knowledge in the realm oftechne:

Engineering problem solving employs a contingency based form of rea-soning that stands in sharp contrast to the necessity based model of ra-tionality that has dominated Western philosophy since Plato and that un-derlies modern science. The concept ‘necessity’ is cognate with the con-cepts ‘certainty’, ‘universality’, ‘abstractness’ and ‘theory’. Engineeringby contrast is characterised by wilfulness, particularity, probability, con-creteness and practice. The identification of rationality with necessity hasimpoverished our ability to apply reason effectively to action. This articlelocates the contingency based reasoning of engineering in a philosophicaltradition extending from pre-Socratic philosophers to American pragma-tism, and suggests how a contingency based philosophy of engineeringmight enable more effective technological action. (Goldman 2004)

Techne is also distinct from praxis. Where praxis knowledge is often tacit, techneis explicit and methodical. Innis summarises the pragmatist position on technologysimilarly:

Technology ... is a reflective and methodical discipline. Techniques ...are habitual and traditional ways of dealing with things. They involve, itis clear, both observational and material activities. Technology ... raisesthem to explicitness and subjects them to methodical control. (Innis 2003:50)

35

Creative Arts and Praxis

In the creative arts the recognition of praxis as a legitimate form of knowledge iswidespread (Crouch 2007). Artistic knowledge is accepted as residing within the artist,embodied and personal, resultant from a history of engagement with artistic practice.Epistemic knowledge, in my experience, is often regarded with suspicion or disdainwithin the creative arts – or at least viewed as an unlikely product of creative artsresearch.

An Interdisciplinary Approach

This project is interdisciplinary straddling science, engineering and the creative arts,with knowledge claims in all three of these domains. In terms of the preceding discus-sion this project has generated episteme, techne and praxis. Consequently the defaultperspectives of either science or the creative arts were unsuitable, given their prejudicetowards one or other of these knowledge categories.

In the next section I discuss the pragmatist epistemology, termed Epistemic Pragma-tism, that I have adopted for this research. Epistemic Pragmatism embraces episteme,techne and praxis all as legitimate forms of knowledge. I situate Epistemic Pragmatismin contrast with Objectivism and Subjectivism, which I suggest would be the naturalepistemologies had this project been solely concerned with science or creative artsrespectively.

After discussing Epistemic Pragmatism I introduce a more particular pragmatic per-spective, that of Ecological Epistemology. Ecological Epistemology not only embracesall of these knowledge perspectives, but goes further to describe a cybernetic inter-relationship between them.

3.3.2 Objectivism, Subjectivism and Pragmatism

The light dove, cleaving the air in her free flight, and feeling itsresistance, might imagine that its flight would be still easier inempty space.

(Kant 1964:47)

Pragmatism is a philosophy in which context plays a central role in both epistemologyand ontology, and in which “the worth of a proposition or theory is to be judged bythe consequences of accepting the proposition or theory” (Marshall et al. 2005). Inthis thesis I adopt a pragmatist position, particularly as articulated by John Dewey(1973).

36

Pragmatism entails a particular epistemology, sometimes termed Epistemic Pragma-tism (Long 2002), in which knowledge is viewed as an interaction between the knowerand the known (Dewey 1949). In the sections below I contrast Epistemic Pragmatismwith the epistemologies of Objectivism and Subjectivism.

Objectivism

Within science the dominant epistemology is Objectivism (Crotty 1998:18), so muchso that epistemology is frequently not discussed within scientific theses. A key tenetof Objectivism is the separation of the observer from the observed in experimentation(Gray & Malins 2004:19). According to Objectivism knowledge can be thought of asreferring to something external to the knower. Of the categorisation of knowledge intoepisteme, techne and praxis discussed above, Objectivism recognises only episteme(Schon 1995:34).

Subjectivism

In the creative arts a subjectivist epistemology is frequently employed (Gray & Malins2004:21). Subjectivism takes exception to the objectivist position of separating the ob-served from the observer. Subjectivism, in contrast, recognises the role of the observerin observation (Crotty 1998:9). From a methodological standpoint this translates to therecognition that a researcher’s personal biases will affect the results of the research.

More broadly, Subjectivism embraces the role of the researcher as a participant in theresearch. As such Subjectivism resonates with creative arts researchers, as in creativearts research the knowledge and skills of the researcher are often inseparably interwo-ven into both the method and the outcomes of the research.

As a theory of knowledge, Subjectivism makes the fundamental claim that knowledgerequires a knower. In Subjectivism praxis is most highly regarded, with episteme oftengiven little or no accord.

Epistemic Pragmatism

Epistemic Pragmatism sits between the Objectivism and Subjectivism, inheriting fromObjectivism a commitment to empirical experimentation, and from Subjectivism a re-gard for the importance of context and the role of experience in knowledge.

Epistemic Pragmatism regards episteme, techne and praxis all as legitimate forms ofknowledge (Greenwood & Levin 2005:53). The embrace of techne in particular differ-entiates the Deweyian pragmatist viewpoint from either Objectivism or Subjectivism.

37

According to Dewey, the fundamental path to knowledge is through experimentation,and experimentation is viewed as a process of design, engineering and exploration,rather than simply one of validation:

Pragmatists generally and Dewey especially were enthusiasts for science,the scientific method and the application of scientific reasoning to all as-pects of life. But for Dewey, the key to understanding science lay in en-gineering! Dewey argued that science was a form of engineering, that itwas only hypothetically abstract, universal, necessary and certain. In truthscience was as value laden and ‘interested’, hence as contextual, as engi-neering. (Goldman 2004)

3.3.3 Ecological Epistemology

Epistemology is that science whose subject matter is itself.

Bateson (1979)

Within Epistemic Pragmatism a more particular view of knowledge, called EcologicalEpistemology, has been championed by Bateson (1972). The primary tenets of Eco-logical Epistemology are that knowledge is an interaction between the knower and theknown, and has a dynamic structure comprised of a complex feedback loop betweenthe observer and the environment.

Ecological Epistemology is often attributed to Gregory Bateson (Frielick 2004; Harries-Jones 1995:8) although Bateson himself used the term Recursive Epistemology (Bate-son & Bateson 1987). This view of knowledge has a longer heritage however; midwaythrough the twentieth century Dewey (1949) gave a similar view in his pragmatic de-scription of knowledge as a transaction between the knower and the known, and EgonBrunswik (1956) coined the terms ‘ecological validity’ and ‘representative design’ indescribing a new methodology in psychology emphasising the importance of context,or situation, to cognition and perception.

Like Epistemic Pragmatism, Ecological Epistemology views knowledge as an inter-action between the knower and the known. Ecological Epistemology goes further,however, to give a structural description of this interaction as a cybernetic feedbackloop. In this description episteme, techne and praxis are intertwined. Calloas (2011)describes the process of engineering research similarly:

38

scientific and engineering activities are related to each other and integratedin a more comprehensive whole, in which Science provides the “know-that”, the propositional knowledge that engineering activities and thinkingneed as one of its inputs, and the processes and technologies produced byEngineering support scientific activities and provide a rational scientificprogress and a possible ground for philosophical reflections with regardsto the epistemic stand of scientific theories. According to this perspective,scientific and engineering activities might be related through (positive andnegative) feedback and feedforward loops, in order to generate mutual syn-ergies where the whole would be greater than the sum of its parts. (Callaos2011)

Ecological Epistemology is unusual in its level of description of knowledge, whichgoes beyond the metaphysical level of most epistemologies to describe the actual struc-ture of knowledge in a mechanistic manner:

Bateson uses the term ‘epistemology’ in an idiosyncratic manner. Philoso-phers define the term as a study of the theory of knowledge, or that branchof philosophy which investigates the origins, structure, methods and valid-ity of knowledge. By contrast, Bateson means by the term the examinationof knowledge in an operational sense: the ‘how’ of knowing and deciding,rather than the ‘what’ of the origins and validity of knowledge. (Harries-Jones 1995:8)

I commented earlier on the importance of theories of knowledge to the fields of artifi-cial intelligence and cognitive psychology. These fields explicitly model the structureof knowledge. It is not surprising then that ideas relating to Ecological Epistemologyhave appeared in these areas given the emphasis Ecological Epistemology places onstructural (and possibly model-able) aspects of knowledge. Indeed ecological ideas canbe found in a wide variety of fields including artificial intelligence (Clancey 1997), psy-chology (Brunswik 1955; Gibson 1979), cognitive biology (Varela & Maturana 1980),design (Norman 1990), computational neuroscience (Hawkins & Blakeslee 2004) androbotics (Brooks 1991b).

39

3.3.4 The Role of the Artefact

The Jambot is a technological artefact, representing new techne knowledge. Beyondthis, the Jambot is also an instrument for the generation of new knowledge in bothepisteme and praxis.

In mainstream contemporary scientific thought, technological artefacts are not gener-ally considered to constitute knowledge. In the creative arts, by way of contrast, therecognition of creative artefacts as embodying knowledge is widespread, though therecognition of technological artefacts as knowledge is less common. In the preceed-ing sections I have argued that, from a Pragmatist perspective, technological artefactsconstitute techne knowledge.

Ecological Epistemology posits an interplay between techne, episteme and praxis inwhich “the recognition of the centrality of knowledge leads to conceiving technologyas more than artifact, and as more than technique and process” (Herschbach 1995:32).To use Dewey’s notion of instrumentalism (a term he used in place of pragmatism),a technological artefact is an instrument for the generation of epistemic and practicalknowledge, as well as embodying technical knowledge (Hickman 1992:31).

The Jambot, as a case in point, provides an experimental mechanism for evaluating theconstituent machine listening algorithms. It also provides an apparatus for developingpraxis-based knowledge of effective techniques for live performance with an interac-tive music system.

From this perspective, the Jambot takes over the role of experimental data in a scienceproject, or auto-ethnographical reflection in a creative arts project. The methodologicalpurpose of these types of data are to lend credibility to the research from the perspectiveof a third party. As a technological artefact, the Jambot fulfills this role by allowing athird party to explore the knowledge claims in a context of interest to them. From anecological perspective this provides a powerful complement to more traditional formsof research data.

The next sections describe the development of ecological ideas within the fields ofpsychology and artificial intelligence. The starting point is the methodological theo-ries of Egon Brunswik (1955) in psychology, which influenced Gibson’s (1979) theoryof Ecological Psychology. Another line of development is Maturana & Varela’s (1980)concept of structural coupling. Finally these elements combine in the perspective ofSituated Cognition (Clancey 1997) which has directly informed the architectural de-sign of the Jambot.

40

3.4 Theoretical Perspective

The theoretical perspective I employ is Situated Cognition. Situated Cognition is a per-spective on human cognition and perception that emphasises the importance of contextin these processes. The key points of this perspective are:

• It is important to develop prototypes in realistic contexts.

• Computer models of perception provide indirect evidence regarding human per-ceptual processes and architectures.

• Feedback is an important facet of a perceptual architecture, both within compo-nents of a perceptual architecture, and between a perceiver and the environment.

Situated Cognition has developed from a long heritage of ecological theories withinpsychology, education, biology and artificial intelligence. In this section I chart someof these influences, beginning with Brunswik’s notion of Representative Design, throughGibson’s theory of Ecological Psychology, and concluding in the perspective of Situ-ated Cognition.

3.4.1 The Brunswikian Perspective

Egon Brunswik was a psychologist working in the first half of the twentieth century.His theories were revolutionary at the time, but failed to gain popular support in thefield. Today he remains in relative obscurity, although awareness of his ideas appearsto be increasing (Hammond & Stewart 2001:10). Cooksey (2001:232) identifies threemajor strands to Brunswik’s thoughts:

1. The Lens Model (§3.4.1): a description of perception as a feedback loop be-tween perceiver and environment.

2. Representative Design (§3.4.1): A methodological approach to psychologicalexperimentation that emphasises the importance of experimenting in a realisticcontext.

3. Probabilistic Functionalism (§3.4.1): A view of cognition that casts perceptionas a process of statistical inference.

I have adopted these three ideas into the research design for this project:

• The role of feedback in the Lens Model of perception has been influential in thedesign of the Jambot’s architecture (§4.2).

41

• The methodological assertions of Representative Design have shaped the designof the research both in endeavouring to have the Jambot operate in realistic mu-sical contexts and in arguing for the validity of indirect evidence of theories ofmusic and music perception.• The theory of Probabilistic Functionalism provides an argument as to why evi-

dence obtained from implementations of machine perception may have relevanceto theories of human perception.

The Lens Model

Brunswik’s ideas centre around the notion that perception must be thought of as aninteraction between the perceiver and the environment. His Lens Model (Cooksey2001:227) of perception emphasises the importance of feedback in this system. Hisideas have synergies with more recent theories from a diverse range of fields suchas computational biology, complexity theory and artificial intelligence. For exampleBateson’s description of anthropological cybernetics (Bateson 1979) and Maturana &Varela’s (1980) theory of autopoiesis resonate with Brunswik’s work, particularly inregard to the emphasis on feedback loops as an integral part of the system.

Strands within psychology also find common ground with Brunswik. The field ofEcological Psychology (Gibson 1979) has many conceptual similarities, although itdiffers critically in its treatment of the role of inference, as will be discussed below.More recently a general trend in psychology that is sometimes described as Perceptionand Action (Byrne & Anderson 1998) or Enactive Perception (Noe 2004) has emergedthat shares an intellectual heritage with Brunswik’s ideas, and considers perception andaction to be inseparably linked.

Representative Design

Brunswik’s notion of Representative Design is a methodological approach to experi-mental psychology that emphasises the importance of conducting experiments in rep-resentative contexts. He argued that standard experimental methodology in psychologywas hypocritical in its insistence on the importance of obtaining a representative sam-ple of subjects, but performing experiments in a context that was not representative ofthe conditions to which the experimental results were to be generalised.

[Brunswik] argued that the logic of induction should be applied in bothdirections, that the circumstances toward which the generalisation was in-tended should be specified (as with subjects) and that these should be in-cluded in the experiment if the generalization is to be justified. And he

42

often used the expression “double standard” to indicate that psychologistsemployed the logic of induction in one direction only (over subjects) butclaimed generalization over conditions without justification. (Hammond& Stewart 2001:5)

The notion of representative design has developed to become a central aspect of theperspective of Situated Cognition (§3.4.3), and has impacted this research by demand-ing that the Jambot’s development and evaluation be conducted in realistic musicalcircumstances.

Brunswik’s emphasis on the importance of context is unsurprising given his castingof perception as a dynamic system comprising both perceiver and environment. Thegestalt nature of perception as emergent from this dynamic also led him to criticise thereductionist epistemology prevalent in psychology at the time:

By extension of the principle of sampling from individuals to situations,representative design countermands a number of preconceived method-ological notions of nomothetic experimentation, among them the “rule ofone variable” i.e. the study of one factor at a time. It further countermandsthe more general inclination to design experimental research in accordancewith formalistic-“systemic” patterns which are too narrow to bring out theessentials of behavioural functioning (Cooksey 2001:233).

Probabilistic Functionalism

Brunswik’s theory of Probabilistic Functionalism describes perception as statisticalinference of causation (Brunswik 1955). The methodological importance of this posi-tion for my research is that it lends credibility to claims that computational models ofperceptual processes provide evidence regarding actual cognitive processes.

Several authors have championed statistical theories of music perception in the lastfew years (Huron 2006; Temperley 2007; Pearce & Wiggins 2007) in a broader trendin cognitive psychology sometimes termed probabilistic cognition (Chater et al. 2006).These theories claim that much of human cognition can be explained as induction ofstatistical regularities in the environment. Temperley makes explicit the claim that thesuccess of his computational model supports its adoption as a theory of cognition:

The fact that this probabilistic approach ... can ... be applied success-fully to the meter-finding problem gives it further plausibility as a generalhypothesis about the workings of the human mind. (Temperley 2007:47)

43

The success of a computational model at performing some perceptual task providesonly indirect evidence that it actually simulates cognitive activity. However I agreewith Temperley that it does add to the plausibility of the claim. Contrapositively, theabject failure of a class of computational models to achieve a perceptual task lendssupport to the notion that the mind must be doing something different to these models.This has certainly been the case in the field of Artificial Intelligence (AI), where thefailure of knowledge representation models to perform seemingly simple tasks (suchas stacking blocks) has contributed to the loss of support for representational theoriesof cognition (Winograd 2006).

Brunswik himself proposed multiple regression as a model of cognition (Hammond &Stewart 2001:9). In Chapter 4 I argue that this is too simplistic, and instead argue foran architecture consisting of hierarchical statistical categorisation. Further evidenceof this architecture as a model of cognition is provided by several recent theories ofbiologically plausible mechanisms by which this architecture could be realised withinthe brain. (Hawkins & Blakeslee 2004; Coward 2005).

3.4.2 Ecological Psychology

Ecological Psychology is a strand of psychology developed by J. J. Gibson in the midtwentieth century. Ecological Psychology bears many similarities to Brunswikian Psy-chology, particularly in regarding perception as an interaction between the perceiverand the environment. I mention it here for two reasons: firstly because Gibson’s the-ory more explicitly champions the notion of invariant representations (§3.4.2), whichis an important architectural consideration for this research, and secondly because itwas through Ecological Psychology that the perspective of Situated Cognition, whichforms the theoretical perspective for this research, developed.

Gibson and Brunswik were contemporaries, and were aware of each other’s work.Indeed Gibson seems to have been one of the few of Brunswik’s peers who were sym-pathetic to his ideas (Gibson 1957). In one important respect, however, Gibson andBrunswik were in fundamental opposition - the role of inference in perception. WhereBrunswik sought to describe all cognition as statistical inference (Brunswik 1955),Gibson proposes a theory of direct perception in which inference plays no role.

Direct Perception

Gibson’s view was that perception is direct in the sense of being unmediated by pro-cessing in the brain. In this Gibson was in direct opposition to Brunswik’s views, sincethe notion of probabilistic inference was central to Brunswik’s theories.

44

Disagreement between Brunswik and Gibson crystallized around the issueof whether the organism’s relation to the environment should be consid-ered irreducibly probabilistic or fully informative. (Kirlik 2001:240)

The theory of Direct Perception asserts that the environment contains the informationwe perceive, and that we perceive this information by ‘picking’ it up directly, ratherthan from performing inferences upon incoming sensory information. Gibson referredto the perceptual system as ‘resonating’ with invariant properties of the objects beingperceived.

There are some examples of machine listening techniques that favour direct perception.Scheirer (1996) argues the note transcription metaphor of musical perception. Large& Kolen’s (1994) resonance theory of metre models metric perception with a networkof oscillators that resonate at the perceived metre. Both of these examples distinguishthemselves from the mainstream by eschewing intermediate symbolic representationsin the processing chain.

The notion that information is contained in the environment, and perceived directlywithout symbolic mediation, is an important paradigm in situated robotics (Brown2008). The central idea is summarised by Brooks as “the world is its own best model”(1990:5). The Smoke and Mirrors approach to musical improvisation discussed inChapter 8 draws upon these ideas, although the Jambot’s implementation of Smokeand Mirrors does utilise some degree of symbolic representation.

Gibson and Brooks alike have been frequently criticised for their uncompromising re-fusal to admit any degree of processing or representation occurs in perception, despiteseemingly apparent contradictions with other aspects of their theories. The completedenial of any form of representational processing in the act of perception seems tohave been instrumental in the failure of ecological psychology to gain acceptance inthe mainstream of psychology (Fodor & Pylyshyn 1981).

An alternative view of perception that unites Gibson and Brunswik’s views is providedby the perspective of Situated Cognition (§3.4.3) which describes a representationalbut non-inferential mechanism for perception termed structural coupling. Before dis-cussing Situated Cognition I must introduce the notion of invariant representation.

45

Invariant Representation

A central aspect of Ecological Psychology is the notion of invariant representation.Gibson argues that what the mind perceives must be an invariant representation ofobjects in the world, since the actual information being presented to the senses is con-stantly changing, but our perception is mostly of constant objects:

The features of the object that make it different from other objects havecorresponding features in the family of perspectives that are invariant un-der perspective transformations. These invariants constitute informationabout the object. Although the observer gets a different form-sensation ateach moment of his tour, his information-based perception can be of thesame object (Gibson 1972).

Here Gibson is referring to invariant properties of an object in the visual field under ge-ometric transformations; indeed his theories are heavily biased towards the explanationof visual perception.

A more general notion of invariance is proposed by Hawkins & Blakslee (2004), whosuggest that the perception of invariant properties is a form of categorical perception,and that cognition consists of a hierarchical process of parsing invariant propertiesfrom sensory information. Gibson’s insistence on the direct nature of perception, andthe complete lack of any internal representation or processing of sensory information,appear to have stopped him from exploring more complex architectures of invariantrepresentation such as this.

Gibson appears to recognize the importance of the issue of internal struc-ture in percepts, since he frequently refers to the perceptual objects in theenvironment as being “nested”. However he denies that the detection ofsuch nested units is cascaded in the sense that the identification of thehigher units is dependent on the prior identification of the lower ones. Hemust deny this because the mechanisms in virtue of which the identifica-tion of the former is contingent upon the identification of the latter couldonly be inferential. (Fodor & Pylyshyn 1981:208).

Kirlik (2001:240) suggests that Gibson’s difficulty in reconciling his theory of directperception of invariants with any known (or even imagined) mechanism led him togradually diminish emphasis on invariance in his writings in favour of the notion ofaffordances.

In this research I have made central use of an invariant representation of the musicalaudio input. The invariant representation is that of notes and their temporal locations

46

expressed in terms of beats. The crucial aspect of this representation is that by storingtemporal locations as beats, rather than as times, the resulting rhythms are invariant totempo changes. I have found this to be essential for the robust perception of rhythmicand metrical properties in the context of a fluctuating tempo. This notion of invarianceis similar to that of Hawkins discussed above, and differs from a Gibsonian view ofinvariance in that it involves a symbolic representation as an intermediate step of theperceptual processing chain.

A limitation of this invariant representation is that it does not capture models of musicaltime involving both absolute and beat relative timing, such as Bilmes’ (1993) modelof expressive timing, and Honing’s (2008) studies of swing timing.

3.4.3 Situated CognitionThe theory of Situated Cognition has developed relatively recently at the interfacebetween Cognitive Psychology and Artificial Intelligence, influenced in part by Gib-son’s theory of Ecological Psychology. Situated cognition is a perspective on humancognition that views knowledge as a dynamic combination of parallel influences, ac-knowledges the importance of context to understanding, and argues that perception isproperly viewed as an interaction between the perceiver and the environment. In thisperspective, perception and action are inextricably linked.

Situated Cognition is the study of how human knowledge develops as ameans of coordinating activity within activity itself. This means that feed-back – occurring internally and with the environment over time – is ofparamount importance. Knowledge therefore has a dynamic aspect in bothformation and content. This shift in perspective from knowledge as storedartefact to knowledge as constructed capability-in-action is inspiring anew generation of cyberneticists in the fields of situated robotics, ecologi-cal psychology, and computational neuroscience. (Clancey 1997:4)

As a theory of knowledge, Situated Cognition is compatible with Bateson’s epistemo-logical view that knowledge is an interaction between the knower and the known.

Clancey argues that Situated Cognition seeks to integrate the previously disparate per-spectives of direct perception and knowledge representation.

Now is the time for the culprit and the offended to be reconciled ... Ishow how the “structural” aspect of situated cognition may be understoodin terms of a coupling mechanism, on which the inferential processes de-scribed in studies of symbolic reasoning depend. (Clancey 1997:269)

47

Situated Cognition promotes the notion of structural coupling as a representational,yet non-inferential mechanism for perception. Clancey proposes this as a resolutionbetween Gibson and his critics:

Fodor and Pyshlyn have no sense of a non-inferential mechanism, andGibson didn’t provide any. Both parties appear to agree that the distinc-tion is direct versus inference, but direct perception is absurd to Fodorand Pylyshyn because it suggests that there is no mechanism, no processoccurring at all. Today, thanks to the work of the neuroscientists and situ-ated roboticists I have surveyed, we do have an alternative, non-inferentialmechanism: direct (structural) coupling. Unlike Fodor and Pylyshyn’stransducers, coupling is a mechanism of categorical learning - a means ofrepresenting that creates features. (1997:270)

The term structural coupling has been co-opted from the work of Maturana & Varela(1980) in cognitive biology.

In this research project I have embraced the perspective of Situated Cognition. This hashad ramifications for the architectural design of the Jambot, and for the nature of thedevelopment and evaluation of its operation. The notion of combining Gibson’s viewof direct perception with representational theories of ‘musical understanding’ formsthe backbone of the Jambot’s architecture.

Another key aspect of Situated Cognition is the emphasis on conducting experiments“in the wild” (Clancey 1997:250). In the context of situated robotics this means thatat all stages of development the robot should be placed in real-world situations, inkeeping with Brunswik’s notion of Representive Design. Stowell et al. recommendthis approach in the evaluation of interactive music systems:

One of the key themes in our recommendations is that the design of anevaluation experiment should aim as far as possible to reflect an authenticcontext for the use of the system. Experimental design should include aphase which encourages use and exploration of the system. (2009:974)

The adoption of this perspective has guided the development of the Jambot to be ableto perform in realistic musical situations. This has also been driven by my desire touse the Jambot in performance.

48

3.5 Methodology

Design is a way of inquiring, a way of producing knowing andknowledge; this means it is a way of researching.

(Downton 2003:2)

This research consisted of developing and evaluating an interactive music system. Theresultant system, the Jambot, both embodies knowledge as a technogical artefact, andacts as an instrument for exploring and validating the underlying theories encoded intothe system. In short, the goal of this research was to design the Jambot, with a designbrief of creating a robust concert-ready system. This approach to research is calledDesign Research.

3.5.1 Design Research

Design Research is an explorative methodology in which research is conducted fol-lowing a design process (Downton 2003:2). In this methodology research is conductedin the context of a design brief – an overall goal to be achieved through mechanismsunknown, and possibly unimagined at the outset of the research.

In designing a solution to the brief, a number of subproblems may emerge. In this wayDesign Research is an exploratory methodology, as it provides a productive mechanismfor exploring a problem space. Downton suggests that Design Research approaches arecommon even in scientific research, despite a lack of explicit recognition of this fact inscientific writing.

[The standard account of the scientific method] does not address one com-mon activity of sciences – exploration ... This kind of inquiry looks muchmore like designing in structure; it is focussed on the production of some-thing new. It is concerned with moving away from the existing, the known,through intentional actions to arrive at an as yet unknown, but desired, out-come. (Downton 2003:75)

Two types of knowledge are generated by Design Research. Firstly, solutions to theemergent subproblems, and secondly, knowledge regarding how the solutions to sub-problems can operate together in a system.

49

Figure 3.2: Scaling up problem (Kitano 1993)

The systems focus of Design Research makes it compatible with the perspective ofSituated Cognition, both because of the recognition of the role of context, and becauseof the methodological mantra of building prototypes in realistic settings throughout thedevelopment process. Goto & Muraoka, in designing a real-time beat tracking system,echo this advise:

Our strategy of first building a system that works in realistic complex en-vironments, and then upgrading the ability of the system, is related to thescaling up problem (Kitano 1993) in the domain of artificial intelligence ...As Hiroaki Kitano stated: experiences in expert systems, machine transla-tion systems, and other knowledge-based systems indicate that scaling upis extremely difficult for many of the prototypes (Kitano 1993). In otherwords, it is hard to scale up the system whose preliminary implementa-tion works in not real environments but only laboratory environments. Wecan expect that computational auditory scene analysis would have simi-lar scaling up problems. We believe that our strategy addresses this issue.(1995:74)

Figure 3.2, reproduced from Kitano (1993), depicts this issue: in order to create asystem that is both complex and operates in a realistic setting, it is best to first createa system that is simple and works in a realistic setting, then gradually increase thecomplexity.

By adopting a systems approach Design Research can be expected to generate differ-

50

ent insights regarding its subproblems than researching these problems in isolation.The requirement to operate within a larger system may necessitate new approaches tothe subproblems – approaches robust to a set of practical constraints that may renderstandard techniques unsuitable. An example in the case of the Jambot is a requirementfor efficiency and causality in all of its perceptual algorithms, due to the context ofreal-time operation. Conversely, the operational requirements of the design brief maymean that partial solutions to difficult engineering problems may be sufficient. Thefield of Musical Scene Description takes this view as fundamental (Goto 2006:252).

Iterative Design

In engineering, enlightened trial and error, not the planning offlawless intellects, has brought most advances; this is whyengineers build prototypes.

Drexler (1986)

A central aspect of Design Research is the use of a cyclic process of development andevaluation called iterative design (Downton 2003:98). Zimmerman describes iterativedesign as “a design methodology based on a cyclic process of prototyping, testing,analyzing and refining a work in progress. In iterative design, interaction with thedesigned system is used as a form of research for informing and evolving a project assuccessive versions or iterations of a design are implemented” (2003:176).

This iterative approach is also common in the software development industry (Nielsen1993; van Lamsweerde 2004). The use of this methodology in the context of musicresearch has been discussed by Brown (2007) under the rubric Software DevelopmentAs Research (SoDaR).

The iterative design process is also acknowledged as fundamental to engineering prac-tice. The systems nature of engineering research means that there will often be atwo-tiered evaluation strategy, where component algorithms are evaluated in isolationduring initial design, then re-evaluated in terms of how well they operate within thesystem. Durban writes:

Only at the lowest level does the individual engineer encounter a well-defined, well-circumscribed problem to which technical knowledge canbe directly applied. But even there, the most obvious technical solutionmay not match expectations at higher levels of what the solution needed to

51

accomplish given a broader perception of the problem, leading to its returnfor re-engineering. (2006:144)

This two-tiered approach to evaluation reflects the mechanics of the Creative Practiceresearch method that I have employed in developing the Jambot.

Creative Practice Research

Design research can in some cases be regarded as a type of creative practice research.Creative practice research is research that is “initiated in practice, where questions,problems, challenges are identified and formed by the needs of practice and practition-ers ... and the research strategy is carried out through practice” (Gray 1998:3). Creativepractice research is becoming increasingly prevalent within the creative arts.

The mechanics of creative practice research is often similar to that of design research.Anne Burdick writes:

Designers who are conducting research through their creative practice cre-ate work that is intended to address both a particular design brief and alarger set of questions at the same time. In most cases, the inquiry is sus-tained over a period of time and the designers create a body of work inresponse - projects and practices that serve as experiments through whichthey interrogate their ideas, test their hypotheses, and pose new questions.(2003:82)

My practice has two mutually informing aspects - computational instrument build-ing and live performance. The two work together as aspects of the design researchprocess. Computational instrument building plays the role of development, and liveperformance of evaluation, in the iterative design cycle. The use of reflective practiceas an evaluative mechanism is common in creative practice research (Schon 1995).

Research through Design

The term ‘Design Research’ is used in a variety of ways, including research for design,research into design, and research through design (Downton 2003:2). This project hasinvolved research through design, with the overall design brief of developing a concert-ready interactive music system. Research through design emphasises the practical andcontextual considerations required of a complete operational system:

52

Research through design focuses on the role of the product prototype asan instrument of design knowledge enquiry. The prototype can evolve indegrees of granularity, from interactive mockups to fully functional pro-totypes, as a means to formulate, develop and validate design knowledge.The designer-researcher can begin to explore complex product interactionissues in a realistic user context and reflect back on the design processand decisions made based on actual user-interaction with the test proto-type. Observations of how the prototype was experienced may be used toguide research through design as an iterative process, helping to evolve theproduct prototype. (Keyson & Alonso 2009:4548)

An important characteristic of research through design is that it seeks to test compo-nent hypotheses underlying the design. In this way the design artefact becomes aninstrument (in Dewey’s language §3.3.4) for both the exploration and validation of thetheories upon which the design is based, serving as a “means to empirically test under-lying theoretical hypothesis, rather than acting mainly as a reflective mechanism or asa means to only organize user feedback” (Keyson & Alonso 2009:4550).

In the case of software design, the prototypes created in research through design canbe thought of as computational models, or simulations, of the underlying theories. Thedesign of the Jambot utilises a broad range of theories of music, music perception andcomputational auditory scene analysis. Consequently the Jambot is an instrument forthe exploration and validation of these component theories.

Aesthetic Evaluation

The strength of research through design is that it allows for the underlying hypothesesinforming the design to be evaluated in a realistic setting. In the field of Human-Computer Interaction (HCI) research through design is valued because it enables eval-uation of design subproblems in a coherent system via subjective user experience, andby “evaluating the performance and effect of the artifact situated in the world, designresearchers can ... discover unanticipated effects” (Zimmerman et al. 2007:497).

In the case of the Jambot, the subjective user experience is a musical interaction. Con-sequently I have utilised aesthetic evaluation as the primary evaluative mechanism forthis research. In other words I evaluated the system by having Jambot produce musi-cal output, and by reflecting on both the musical quality and the intuitiveness of theinteraction. The use of aesthetic evaluation of generative algorithms encoding musicaltheories is discussed further in Brown et al. (2009).

53

In reality evaluation of the component theories and algorithms was a more intricateprocess, involving both independent evaluation during development utilising a varietyof techniques, and aesthetic evaluation in the context of the complete system. In thesection below I detail the mechanics of the development and evaluation process.

3.6 Methods

A variety of methods were used in the iterative cycle of development and evaluation.Development involved a combination of software design, computational modelling,algorithmic composition and interaction design. Evaluation used reflective practice forassessing the operation of the system as a whole. Additionally a variety of specificevaluation techniques were employed in initial development of component algorithmsin isolation. The sections below detail the methods employed during development andevaluation of the three stages of reception, analysis and generation.

3.6.1 Prototype Development and Evaluation

In describing the processes of development and evaluation of the Jambot I will usethe same structure used to describe the Jambot’s architecture – reception, analysis andgeneration – as this delineation roughly corresponds to the stages in which the researchwas conducted. The investigative methods in all three stages were similar in somerespects, though each had its own flavour.

An important aspect of the research method at all of these stages was the process ofprototyping, moving to production, evaluating, and returning to the prototype in an it-erative design cycle. For the reception and analysis stages, prototyping was performedby running non-real-time simulations in Gnu Octave1, and for the generation stageprototyping was performed in Impromptu, a Scheme based live-coding environmentfor music production. After prototyping, algorithms were implemented as real-time inC++.

Although I have separated the process into stages, the entire project followed the itera-tive design metaphor, so that each stage informed the other stages. Consequently all ofthe algorithms were revisited many times throughout the research as the requirementsof the system clarified the requirements of the component algorithms.

1Gnu Octave is an open source emulation of MATLAB, a popular mathematics package.

54

3.6.2 Development of Reception Algorithms

The goal of this stage of the project was to develop algorithms for extracting note on-set and pitch information from an audio stream. The iterative design cycle started withthe inspection of some short audio samples that I manually annotated with note on-set points and pitches. Manual annotation was done by listening to the audio sampleswhilst inspecting the waveform in Amadeus Pro, and making note of the sample posi-tion of perceptual onsets. Pitch estimates were made in the same way with the aid ofan external reference tone produced in Max/MSP.

The samples were used to develop algorithms that reported note onsets/pitch infor-mation in agreement with the manual annotation. Importantly, in developing thesealgorithms I strongly preferred techniques that aligned with my musical intuition, as aprecaution against overfitting (de Marchi 2005:36). Inevitably however, upon movingfrom prototype to production, exposing the algorithm to a larger data revealed weak-nesses, often due to the algorithm capturing idiosyncrasies of the small developmentsample rather than genuinely generalisable properties.

For the real-time implementation of these algorithms manually annotating the inputmaterial was not viable. In this case a valuable evaluative strategy was the use ofmimicry. Having the Jambot repeat back what it received as it received it provided avaluable method for assessing how well its reception matched my own.

The percussive onset detection algorithms operate very fast – they are generally able toreport an onset within 128 samples (≈ 3ms at 44.1kHz), a latency which is impercep-tible (to my ear). Consequently, when mimicking the percussive content of the audio,the onset of the mimicked response was imperceptibly different from the onset timethat the algorithm detected, making the use of mimicry for evaluation more valuable.The harmonic content extraction algorithm is much slower (≈ 40ms) and the latency ofits mimicry is quite noticeable. Nonetheless mimicry proved valuable as an evaluationmechanism for the quality of the pitch estimates.

3.6.3 Development of Analytic Algorithms

In this thesis I describe the analysis stage in two parts: metric and rhythmic. Thesetwo parts were developed independently, although, as with all of the component algo-rithms, the design research methodology entailed revisiting each algorithm in view ofits performance in the context of the complete system.

55

Metric analysis focused on beat-tracking and metre induction. The research methodsemployed were very similar to those used in the reception stage – an iterative designcycle starting with small audio samples which I manually annotated with beat loca-tions. I developed candidate algorithms, adjusting them until they reported beats inagreement with my annotation, and then implemented them in real-time and exposedthem to a much broader data set. This exposure invariably highlighted problematicareas, prompting a revisit of the prototype for refinement.

Again mimicry was a valuable tool for evaluation. In the case of tracking the beat,mimicry consisted of having the Jambot play a click on each beat that it predicted, andsimilarly for the downbeat.

The rhythmic analyses performed in the analysis stage are used by the Jambot to con-struct appropriate rhythmic accompaniment. So, whilst in this thesis I categorise theseanalytic algorithms as belonging to the analysis stage, their development and evalu-ation is best described in the context of the generation stage, discussed in the nextsection.

3.6.4 Development of Generative Algorithms

The research process in developing the generative algorithms was similar to the previ-ous two stages in that it utilised an iterative design methodology. However, in a numberof other respects it was quite different.

Firstly, the cycle of moving from offline to online implementation was not present,although the cycle of moving from prototype to production remained. Prototypingwas carried out in Impromptu (Sorensen 2005), a real-time environment itself, and themove to production was primarily for efficiency. Aspects of the improvisatory processof the completed Jambot are still implemented in Impromptu, as this environment al-lows for the alteration of aspects of the improvisatory algorithms in real-time, whichis desirable from a performance perspective.

Secondly the evaluative mechanisms were different. Mimicry was no longer adequateas an evaluation tool. Instead I used my own aesthetic judgment of the quality of theJambot’s improvisation. The ability of the Jambot to jam to recorded material was veryhelpful in this stage of development, as this enabled the use of an enormous supply ofreadily available data in the form of CDs and mp3s, which could be used without theneed for any intricate experimental setup. However I also experimented with havingthe Jambot improvise with live musicians in this stage of the research process.

56

3.6.5 Evaluation Techniques

Evaluation of the component algorithms followed a two-tiered process. The first tierinvolved evaluating the algorithm in isolation. For the reception and analysis stages theprimary technique for this first tier was the use of mimicry combined with comparisonsof the Signal/Noise ratio with existing techniques (for reception), or application of thealgorithm to a test corpus (for analysis).

The second tier of evaluation involved re-assessing the performance of the componentalgorithms in the context of the complete system. The technique employed for thissecond tier was aesthetic evaluation, meaning that I actively interacted with the Jambotand reflected on its musical quality. In doing so I employed a set of critical values,relating to the design brief for the Jambot.

These evaluations were conducted by myself, the researcher. Although I received in-formal feedback from both listeners and other musicians playing in ensembles with theJambot, I did not conduct any formal evaluation sessions with third parties. I will leaveexpanded user studies as a topic for further research.

The quality of the onset detection algorithms was aesthetically directly evident whenusing the transformational mimesis improvisation strategy. This strategy (described in§8.2.6) involves real-time imitation of the percussive attacks. Human onset synchronydiscrimination can be as low as 2ms (Geobl & Parncutt 2001), and so this evaluativetechnique was an effective measure of the latency of the algorithms.

The beat tracking algorithm was evaluated subjectively via the use of a click-trackaccompaniment. Dannenberg uses a similar subjective assessment criterion to evaluatehis beat-tracking algorithm and comments that “although human judgement is involvedin this evaluation, the determination of whether the beat tracker is actually tracking ornot seems to be quite unambiguous, so the results are believed to be highly repeatable”(2005:371).

Within the Music Information Retrieval community a large number of quantitativemeasures have been used for beat tracking evaluation (Davies et al. 2009), but thesemetrics tend not to address situations involving abrupt tempo/phase changes (Collins2006b:96). In subjectively evaluating the Jambot’s beat-tracking facility I deliberatelymanipulated the playback tempo, paused playback, and abruptly switched between testtracks. Part of the evaluation was the recovery time required after an abrupt tempo shiftor pause. A couple of seconds was viewed as an acceptable recovery time (particularly

57

since the Jambot’s juxtapositional calculus (§8.4.1) ensures that the improvisation willremain musically appropriate during the interim).

For the generation stage I aesthetically evaluated the Jambot’s improvised output. Indoing so, I was looking for stylistic appropriateness, rather than hoping to forge newmusical directions. Indeed, in this project generally I have been interested in modellingmusic so as to be able to reproduce it computationally. This contrasts with manyapproaches to algorithmic music, which deliberately seek novel musical structures.

As an evaluative criterion I utilised a form of musical Turing test; namely how similarthe improvisation produced by the Jambot was to what might be expected of a humanperformer. Stowell et al. (2009) discuss the use of Turing-like tests for evaluating inter-active music systems at some length. My interpretation is a little different – rather thanbeing concerned with how human the improvisation sounds, I am concerned with howbelievable it is. This distinction will be elaborated in more detail in §8.2.1. Practicallythis amounted to wanting the Jambot’s improvisation to maintain a tension betweenseamless integration and a sense of individuality.

The Jambot is designed to be used in either autonomous or semi-autonomous mode. Akey design requirement was that, when used semi-autonomously, the parametric con-trols afforded intuitive manipulation of salient musical attributes. Consequently, insubjectively assessing the Jambot’s output, I also experimented with the intuitivenessof the responses to parametric manipulation. The anticipatory timing (§8.3.6) optimi-sation technique was created as a result of this experimentation – without this enhancedsearch technique the parametric controls were uninituitive. A further evaluative crite-rion was the ability to produce a broad range of interesting rhythms by manipulatingthese controls.

3.7 Summary

In this research I have employed a pragmatist epistemology (§3.3.2), the theoreticalperspective of Situated Cognition (§3.4.3) and a design research methodology (§3.5.1),which together form a coherent theoretical framework appropriate for the interdisci-plinary research aims. The central conceptual thread running through the componentsof this framework is a recognition of the importance of context.

This framework describes structural aspects of knowledge, and in particular viewsknowledge (and cognition) as a dynamic exchange between knower and known, with

58

feedback between the two, and feedback internally between layers of cognitive ab-straction.

Having adopted this framework, consideration the Jambot’s architecture is importantfor two reasons. Firstly, the perspective of Situated Cognition suggests that architectureis a key factor in the development of robust interactive agents. Secondly, adopting anarchitecture inspired by structural descriptions of cognition lends more credence tocommentary linking computational design experiences to the mechanisms at play inhuman perception. The next chapter discusses architectural considerations in general,and describes the particular architecture of the Jambot.

59

Chapter 4

Architecture

4.1 Introduction

A key aspect of the Jambot’s design is a single coherent architecture encompassing allof its perceptual, analytical and improvisatory algorithms. The Jambot’s architectureis a complex hierarchical structure with communicative feedback between layers in thehierarchy. The first section in this chapter describes the Jambot’s architecture.

In the second section I review a number of architectures that have been used for therepresentation of music, and for modelling cognitive structures involved in the percep-tion of music. I build a case that the simple tree-like structures often used for thesepurposes are inadequate. Rather, an appropriate architecture is a multiple belief hierar-chy with complex topology including bidirectional feedback between the layers of thehierarchy.

4.2 Jambot Architecture

The path from audio signal to improvised musical response is a multi-staged one, andas such this project required the creation of a suite of algorithms in various areas ofmachine listening and musicianship. The individual algorithms, for the most part,were adapted from existing algorithms and represent incremental advances of varyingdegrees of novelty. Equally important is the architecture that determines how thesealgorithms combine to form a coherent and robust system.

4.2.1 Reception, Analysis and Generation

The algorithms are organised into three modules: reception, analysis and generation.The reception module is the Jambot’s auditory ‘front-line’. It is responsible for con-verting the raw audio signal into timestamped musical notes. The analysis componentinfers the underlying metre from the notes and performs rhythmic analyses. The gen-eration component utilises these analyses to generate an appropriate musical response.

60

The Jambot’s architecture is comprised of these three modules and communicationbetween them. There is a natural flow of information from reception to analysis togeneration. Additionally there is feedback of information from generation to analysis,and from analysis to reception. The feedback is important to the robustness of thesystem.

Figure 4.1 depicts the Jambot’s architecture. At the base of the figure is the audio sig-nal to which the Jambot is listening. The audio signal is fed into the reception module,which contains a collection of algorithms for onset detection and pitch tracking. Cur-rently the pitch tracking algorithms are not of production quality, so information fromthem is not utilised. The result of applying the reception module is to represent theaudio as a stream of timestamped note onsets.

The analysis module consists of algorithms for beat tracking and metre induction, anda collection of rhythmic analyses. It takes the timestamped note onsets from the re-ception module as input. The beat tracking algorithm results in a reference pulse levelcontaining both tempo and phase information. This reference pulse level is not thetactus, but a new theoretical construct dubbed the substratum, which will be describedin §6.3. The metre induction estimates the most salient bar length and the location ofthe downbeat. The result of performing these analyses is to represent the audio streamas notes at beat locations, similar to the symbolic representation of common practicenotation (with the addition of amplitude information). Once the audio is represented inthis form it is further analysed for rhythmic properties. There are a variety of rhythmicanalyses performed, described in §7.3.

There is feedback of information from the analysis module to the reception module.Once the Jambot has established a sense of the beat, it forms an expectation regardinglikely times for onsets to occur. It then modulates the threshold signal/noise ratiorequired for reporting an onset. The modulation is such that it is more difficult foran onset to be registered at an unexpected time. The details of this modulation aredescribed in §6.7.

The generation module is responsible for taking musical actions based on the resultsof the analysis module. There are two generation algorithms, one reactive and oneproactive. Reactive generation utilises real-time imitation of the input audio. Theimitation is transformed sufficiently to obscure from the listener the direct relationshipbetween the input audio and the generated response, so that the result sounds like anappropriate improvisation rather than a simple imitation. This technique, which I calltransformational mimesis, is described in §8.2.6

61

The proactive generation algorithm is not imitational, but rather relies on prediction,and utilises the metrical and rhythmic analyses performed in the analysis module. Itoperates by having target values for the rhythmic analyses, and by taking musical ac-tions to create output that meets these targets as closely as possible. The targets them-selves are parameters that may be adjusted by a human performer during the course ofperformance.

There is feedback from the generation module to the analysis module. The Jambotremembers the actions that it has taken, and feeds these back into the analysis module,which merges them with the input stream. The target values for the rhythmic analysesare applied to the musical output of the ensemble, including the Jambot’s own actions.This feedback only affects the rhythmic analyses, not the tempo tracking and metreinduction.

4.2.2 Multiple Parallel Beliefs

The Jambot uses a multiple parallel belief architecture. This means that for all of itsperceptual parameters it maintains multiple hypotheses about the value, with associ-ated levels of confidence. The use of a multiple parallel belief architecture mirrors sev-eral theories regarding the cognitive structures involved in music perception (§2.3.5).

4.2.3 Perceptual Inertia

An important property of the Jambot’s perceptual architecture is perceptual inertia.This means that once a belief in a particular value for a parameter has been spawned,it is robust to a temporary disappearance of evidence supporting that value, so long asthe value is coherent with the current evidence.

An important consideration for any real-time system is finding a good trade-off be-tween reactivity and inertia (Gouyon & Dixon 2005). Within a multiple parallel beliefarchitecture, often a single belief will be need to be identified as dominant, so thatactions can be taken according to that belief. Some of the Jambot’s improvisatorystrategies do not rely on a dominant belief (Cf. the Chimæra §8.3.10), but some do.For robust and temporally coherent improvisation it is important that the dominant be-lief display some level of hysteresis, i.e. it should continue on as the dominant belief,even if a competing belief momentarily becomes more plausible. However, it is alsoimportant to quickly relinquish a belief that is actually incorrect.

The Jambot’s implementation of perceptual inertia addresses this issue. By maintain-

62

Reception

Generation

Onset Detection PitchTracking

AnalysisMetric Analysis Rhythmic

Analysis

DensitySyncopationMetric Ambiguity

50%25%75%

Periodicities 3 5 8

Reactive ProactiveJuxtapositional

Calculus

Figure 4.1: Reception, Analysis and Generation

63

ing confidence in a belief so long as there is no contradictory evidence, the dominantbelief tends to be more stable. Whilst the dominant belief may at times not coincidewith the most plausible belief, the lack of contradictory evidence should mean that theJambot’s actions based on the dominant belief should remain appropriate. However,faced with contradictory evidence an implausible belief is quickly abandoned.

4.2.4 Parameter Estimation

The use of a multiple parallel analysis architecture throughout the Jambot’s perceptualsystem means that the estimation of perceptual parameters can follow a common pat-tern. At any given time the Jambot maintains a number of hypotheses regarding thevalue of any given parameter. At each new decision point (for example when an onsetis detected, or on the next beat), new information is assessed in the light of these be-liefs and a weighted list of plausible new values is estimated. This approach is similarto the use of Bayesian Networks, although differs in the use of feedback from higherlevels of the architecture to effect the estimation procedures.

The process is one of continually updating multiple inter-related hypotheses. For thisprocess to operate each parameter must have some estimation procedure that allowshypotheses to be drawn from the available evidence. However these estimation pro-cedures will not in general need to return a single answer, and will not need to befoolproof – a degree of perceptual inertia means that the system is robust to periods ofnoise.

4.3 Architectures for Perception and Action

The preceeding section describes the Jambot’s architecture, characterising it as a mul-tiple belief hierarchy with complex topology including bidirectional feedback betweenthe layers of the hierarchy. In this section I review a number of architectures that havebeen used for music representation, for machine perception, and for machine action.I argue that a multiple belief hierarchy with bidirectional feedback is an appropriatearchitecture for a computational improvisatory agent.

4.3.1 Architectures for Music Representation

Music analysis, by and large, involves processes of reduction from the musical surfaceto some more abstract representation. Traditionally the starting point for analysis is

64

a score, in which the fundamental representational unit is the note (which in itself isalready an abstract representation of the sonic event that it prescribes).

Analytic representations are often hierarchical; notes are grouped into phrases, phrasesinto sections etc. Tree structures are common, with the leaves of the tree being thenotes of the musical surface. The process of analysis involves moving up levels of thetree, in a process of increasing abstraction.

From a generative music perspective, it is of interest to move down the tree, a processthat music analysis does not traditionally address. As a simple example, the top levelof a hierarchical representation of a piece of music might be its form. The next leveldown could be the chord progressions for each section, and the bottom level the actualnotes in the song. A generative process that started at the middle level and moveddown would retain the same form and chord structure as the original song, but mighthave a different melody. A more sophisticated representation could have another levelbefore the surface level that represented some abstract feature of the melody (such aspitch contour), so that moving down from that layer resulted in a new piece with thesame form and chord structure and a melody that is similar to the original but not thesame. In this thesis I refer to moving up a representational hierarchy as analysis andmoving down as generation.

Schenkerian Analysis

Schenkerian Analysis is an example of a tree-structured hierarchical representation ofmusic, which has received wide acceptance (as well as criticism) in musicology.

[Schenkerian analysis has] as its cornerstone the concept of structural lev-els ... the background .. middleground .. and foreground levels ... Theprogression from background to foreground moves the basic idea to itsrealisation; conversely analysis involves the progressive reduction of a fin-ished work to its fundamental outline. (Forte & Gilbert, 1982).

In Schenkerian analysis the process of reduction involves identifying structural tones(Salzer 1962:41), of which surrounding non-structural tones are regarded as orname-nations, diminuations, or other subservient classes. The abstractions (or reductions) inSchenkerian theory then have the interesting property of being (superficially) similar -i.e. the ‘abstractions’ at each layer of the hierarchy are still notes.

65

Network Structures

The representational architecture of Schenkerian analysis is a (uni-directional) tree-structure, which is the simplest topology amongst graphs (or networks). The simplicityof this topology brings into question whether such a representation is rich enoughto adequately describe musical structures. Narmour, in his critique of Schenkeriananalysis, comments:

One consequence of the way Schenkerian theory works is that every anal-ysis results in a treelike structure ... [such a] treelike generation mightseem reasonable were it not that such a parsing distorts melodic, formal,rhythmic, and metric facts. (Narmour 1977:96)

The restricted connectedness of a (uni-directional) tree prohibits lateral communica-tion between nodes of the same level, nodes in non-adjacent levels, or bidirectionalcommunicative feedback between levels. Such restrictions limit the ability of such arepresentation to model temporal motion (indeed Shenkerian theory asserts that tem-poral motion should always be understood as elaborations of archetypal motions athigher levels), and also limits the ability to describe exact repetitions of motifs versusvariational returns. Narmour contends that such limitations render tree-like structuresincapable of adequately representing music:

analytical reductions should be conceptualized not as trees - except per-haps in the most simplistic kind of music where each unit (form, prolon-gation, whatever) is highly closed - but as networks. That is musical struc-tures should not be analyzed as consisting of levels systematically stackedas blocks ... but rather as intertwined, reticulated complexes. (Narmour1977:98)

Computational Models

In the last few decades a class of models for computational music analysis has emergedunder the rubric of computational musicology, made possible by advances in digitalcomputers. Within computational musicology a variety of representations have beenused, by far the most common of which is a Markov Chain.

For example, single level Hidden Markov Models have been used by Pearce & Wig-gins (2007) to analyse melodic structure in a collection of Bach Chorales, and similarlyapplied to Greek church chants by Mavromatis (2005). Pearce & Wiggins use theiranalytical model in reverse to generate novel melodies that they hope will be stylisti-cally similar to the corpus on which the model was trained. They note that the output

66

of their system suffers from the kinds of musical limitations discussed in relation totree-structured representations (§4.3.1), suggesting that “more structured generationstrategies ... may be able to conserve phrase level regularity and repetition in ways thatour systems were not” (2007:8).

In this case however the structure is not a tree, but rather a linear sequence of nodes,with lateral communication from one node to the next. However, the problem is sim-ilarly one of a lack of bidirectional and multilateral communication at an appropriatelevel of abstraction. So whilst there is at least some lateral communication in themodel (between adjacent nodes, or more precisely between nodes within n of eachother where n is the order of the Markov Model), there is no capacity for exact repeti-tion (let alone precise control over dimensions of motif variation) beyond the order ofthe Markov Chain (which for tractability needs to be relatively small).

Christopher Thornton notes that this is a general issue with music generated fromMarkov Models, and argues that it reflects an architectural flaw in such statisticalrepresentations.

If we construct a statistical model on the basis of note-transition probabil-ities, the model obtained will tend to represent note-transition structures.Structures at other scales may not be represented. Music generated fromthe model may conform to observed structure at one level while deviatingfrom it at another. The well known ‘wandering melody’ effect, exhibitedby some Markov-chained music, is a case in point. (Thornton 2009:1)

Thornton suggests that constructing a hierarchy of Markov Models will help alleviatethese issues. He constructs a hierarchical Markov Model for analysing melodies inBach Chorales, and uses the model to generate novel works. He takes the somewhatunusual approach of training his analytic model on just one chorale before using it togenerate back a chorale that is supposed to be similar to the original chorale in somesense (more commonly models are trained on a corpus of data, and generated musicis supposed to be of a similar style to the corpus). The model is like a tree-structureexcept that it allows for lateral communication between adjacent nodes in the samelevel (but not multilateral or bidirectional communication).

Thornton describes the generated results as “interesting”. They can be heard at (Thorn-ton 2011). My impression of them is that whilst they clearly do exhibit a greaterdegree of form than single-layer Markov Models, they suffer still from the kinds oflimitations of tree-structured representations discussed in §4.3.1, despite the inclusionof lateral communication of adjacent nodes. Also the fact that the representation does

67

not explicitly model any music theoretic or music perceptual notions (except perhapsthe theory that music is hierarchical and statistical) seems evident to me in the out-put, which to my ear is not convincingly of the genre, although its relationship to theoriginal is vaguely recognisable. Whether this is fundamentally the fault of his repre-sentational architecture, or as a result of only training the model on a single chorale isnot clear; if such a model was capable of inducing music theoretic notions as statisticalregularities it is unlikely that it would do so on the basis of a single work.

Nodal

An example of a generative system that uses a more complex network topology as itsrepresentational system is the Nodal composition tool (McIlwain et al. 2007). Nodalallows the user (composer) to construct their own representation which can be an ar-bitrary graph, with the nodes of the graph being notes. The representation is used togenerate music by having a play-head traverse the graph, playing the note associatedwith the node when it passes through the node. The physical distance between nodes inrelation to the scale of the graph determine the time delay between playing successivenodes. If a node has several nodes emanating from it then the playback splits, enablingpolyphony.

Nodal is particularly good at representing exact repetitions, a property that is difficultto achieve with either Markov Models or with tree-like representations. In particularthe Nodal system excels in producing isorhythms. Nodal is not however hierarchical;there is no abstraction mechanism allowing a node to represent a collection of nodes(although a single node may contain a list of pitches rather than a single pitch). Con-sequently, compositional actions like section changes, which are simple to representin a hierarchical structure, are cumbersome or impossible to represent in this way. Ademonstration of Nodal can be viewed at (Nodal 2011).

4.3.2 Architectures for Perception

The architectures discussed above for representations aimed to describe the structuresof objects (or phenomena) without regard to how these objects might be perceived. Inthis section I discuss architectures for perception, by which I mean architectures of theinternal representations used by agents in perception.

A word on anthropomorphism; by an agent I mean either human or machine – butonly when the machine is designed to give some impression of agency. The notion ofagency will be discussed in more detail in §8.2.1, but roughly speaking by an agent I

68

mean something that both senses and behaves. I am particularly only concerned withagents that are able to enter some form of interactional dialogue with a human.

In this thesis I discuss computational architectures for music perception. It is importantto note that whilst these computational architectures have been inspired by cognitiveprocesses of music perception, I am not claiming that the Jambot’s architecture is agood model for the actual mental representations used in music perception. My goalhas not been to model the brain’s function; rather I am primarily concerned with find-ing an effective architecture for an improvisational agent. That said, I suggest that theresults of implementing computational perceptual architectures provide indirect evi-dence regarding theories of music cognition.

Knowledge Representation

The concept of knowledge representation has played a dominant role in artificial intel-ligence throughout its history. The idea is rooted in a view of intelligence as a processof symbol manipulation. The symbols are abstract representations of objects in theworld, and together with rules for their manipulation these symbols form an agent’sknowledge of the world. The knowledge is internal to the agent, and separate from theoutside world. As data structures the symbols are static, with no sense of temporality.This sort of model of knowledge is consistent with some formal theories of languagestructure, such as Chomsky’s descriptions of Generative and Transformational Gram-mar (Chomsky 1969). Clancey describes knowledge representation:

According to this view, intelligence is mental, and the content of thoughtconsists of networks of words, coordinated by an architecture for match-ing, search, and rule application. These representations, describing theworld and how to behave, serve as the machine’s knowledge, just as theyare the basis for human reasoning and judgement. According to this sym-bolic approach to building an artificial intelligence, descriptive models notonly represent human knowledge, they correspond in a maplike way tostructures stored in human memory. (Clancey 1997:6)

Generative Theory of Tonal Music

Directly inspired by linguistic models of Generative Grammar is the Generative Theoryof Tonal Music (GTTM). The theory is intended as a model of the actual cognitivestructures involved in music perception:

Where, then, do the constructs and relationships described by music theoryreside? ... In-so-far as one wishes to ascribe some sort of “reality” to these

69

kinds of structure, one must ultimately treat them as mental products ...In our view the central task of music theory should be to explicate thismentally produced organisation. (Lerdahl & Jackendoff 1983:2)

The representational structures used in the GTTM are trees of increasing abstraction:

we propose a detailed theory of musical hierarchies ... By hierarchy wemean an organization composed of discrete elements (or regions) relatedin such a way that one element may subsume or contain other elements.The elements cannot overlap; at any given hierarchical level the elementsmust be adjacent; and the relation of subsuming or containing can con-tinue recursively from level to level ... we notate [these hierarchical rela-tionships] by means of trees. (Lerdahl & Jackendoff 1993:290,291,295)

The authors note, as I have suggested above, that the strict tree structure is at odds withsome aspects of music perception, particularly notions of parallel horizontal motion.

The nonoverlapping condition raises a number of issues. Inherent in ourtree notation is the conception of a musical surface as a sequence of dis-crete events. This view is only partly accurate: music is also perceived assimultaneous polyphonic lines ... we reserve for future research and ex-tension of the theory in this dimension. (Lerdahl & Jackendoff 1993:308)

The use of the term generative in this theory is related to Chomsky’s use of it in gen-erative grammar; in the language of this thesis it would be more appropriate to call itthe Analytical Theory of Tonal Music. Lerdahl & Jackendoff do not use the theory forgenerating musical surfaces from the analysis.

Experiments In Musical Intelligence

One celebrated example of generative composition is David Cope’s (1996) Experi-ments in Musical Intelligence (EMI). EMI also utilises a knowledge representationstyle structure inspired by natural language processing, however, rather than the stricttree structures discussed above, EMI utilises Augmented Transition Networks whichallow for more complex topologies. Furthermore EMI implements an additional pro-cess called SPEAC, that allows the attachment of contextual information to nodes inthe network, further complicating the topology. I find the musical results of Cope’swork to be quite convincing, and suggest that his choice of architecture plays a largerole in this.

70

Multiple Parallel Processing

The concept of systems of musical analysis that yield several plausible results hasbeen posited by a number of authors as a model of human musical cognition. Notably,Jackendoff (1992:140) proposed the parallel multiple analysis model. This model,which was motivated by models of how humans parse speech, claims that at any onetime a human listening to music will keep track of a number of plausible analyses inparallel. In a similar vein, Huron describes the competing concurrent representationtheory. Huron goes further to claim that, more than just a model of music cogni-tion, “Competing concurrent representations may be the norm in mental functioning”(Huron 2006:108).

4.3.3 Architectures for Action

Behaviourist AI

Despite purporting to model human cognition, knowledge representation based ap-proaches such as GTTM do not address the temporal nature of perception. Knowledgerepresentations model static structures rather than the processes involved in convertingsensations into these representations. Indeed the tree structures of generative theoriescannot be computed causally; an entire sentence (or phrase) must be available for in-spection before the analytical rules can be applied. Minsky criticises this approach inmusic representation as failing to capture the temporal and interactional relationshipsfundamental to musical experience:

In a computation based treatment of musical expression you would expectto see attempts to describe such sorts of structure. Yet the most‘respectable’present day analyses – e.g. the well known Lerdahl & Jackendoff workon generative grammars for tonal music – seem to me insufficiently con-cerned with such relationships. The so-called ‘generative’ approach pur-ports to describe all choices open to a speaker or composer - but it alsotries to abstract away the actual procedure, the temporal evolution of thecompositional process. Consquently it cannot even begin to describe thechoices a composer must actually face – and we can understand that onlyby making models of the cognitive constraints that motivate an author orcomposer ... a lot of what a composer does is setting up expectations, andthen figuring out how to frustrate them. (Balaban et al. 1992:xvii)

An alternative school of thought within artificial intelligence, that stands in stark con-trast to the knowledge representation view, arose in the mid 80s called behaviour-based

71

AI . This is an approach to AI that draws inspiration from Minsky’s views on the gestaltnature of human intelligence (Minsky 1987). In this paradigm, intelligence is viewedas emerging from the interactions of competing behaviours, each of which directlyconnects perception to action.

The implementation of a behaviouralist approach into an artificial intelligence systemhas been addressed by Rodney Brooks. He, and many others in this field, are primarilyconcerned with mobile robots. The computational engine behind the robot can beconsidered as an agent; the properties of robots as he discusses them apply equally wellto behaviour based agents. Brooks lists the following properties to be characteristic ofbehaviour based robots:

Situatedness The robots are situated in the world - they do not deal withabstract descriptions, but with the here and now of the world directly in-fluencing the behaviour of the system.

Embodiment The robots have bodies and experience the world directly- their actions are part of a dynamic with the world and have immediatefeedback on their own sensations.

Intelligence They are observed to be intelligent - but the source of the in-telligence is not limited to just the computational engine. It also comesfrom the situation in the world, the signal transformations within the sen-sors, and the physical coupling of the robot with the world.

Emergence The intelligence of the system emerges from the system’sinteractions with the world and from sometimes indirect interactions be-tween its components - it is sometimes hard to point to one event or placewithin the system and say that is why some external action was manifested.(1991b)

Subsumption Architecture

One prominent example of a behaviour-based AI design is the subsumption architec-ture (Brooks 1991a). The subsumption architecture consists of a collection of layerseach of which is a complete (though possibly simple) behavioural entity in the sensethat its output maps to an action of the agent. The layers communicate with each otheronly by the sensory inputs and response outputs already existing in a lower layer, bysuppressing the input to the layer or inhibiting the output of the layer. The bottom layeris thought of as a basic instinctive sensory system, whilst higher layers implement pro-gressively more abstract and complex behaviours.

72

Brooks (1991a) lists several design requirements that motivated the development ofthe subsumption architecture:

1. Realtime interaction in a dynamic environment2. Robustness to environmental noise3. Ability to pursue multiple goals

Because all of the agent’s behaviours are in operation (though possibly suppressed orinhibited) at any time, the subsumption architecture agent is able to operate in real-time and respond to unexpected environmental occurrences with relative ease. By notrelying on a central knowledge representation the agent is more robust to encountersthat lie outside the scope in which the representation was conceived. The pursuit ofmultiple goals is facilitated by the parallel running of behaviours – this also allows theagent to be opportunistic and abandon a previous goal in favour of another action whenan appropriate opportunity arises.

4.4 Summary

In this chapter I have discussed a number of architectures for music representation,perception and action. I argued that, for robust musical interaction, an agent’s musicrepresentations should not be separated from the overall architecture. Rather, a holisticarchitectural view in which perception and action are intertwined is likely to be morerobust to novel musical contexts. Furthermore, I suggested that an effective such archi-tecture is a complex hierarchical topology with bi-directional communication betweenthe layers of the hierarchy.

The Jambot’s architecture follows this model. It is a simple hierarchy consisting ofthree layers: reception, analysis and generation. There is a natural flow of informationup the hierarchy – the notes parsed during the reception stage form the input to theanalysis stage, and the resulting analyses are used by the generation stage in producingappropriate accompaniment. There is also feedback of information down the hierar-chy: the Jambot’s own generative rhythm is combined with the parsed rhythm of theensemble as input into the analysis stage, and metric information is passed down to thereception stage to modulate the sensitivity of the onset detection algorithms accordingto metric expectation.

Having discussed how the reception, analysis and generation stages fit together in acoherent architecture, in the next few chapters I will describe these individual stagesin detail.

73

Chapter 5

Reception

5.1 Introduction

The first stage of the Jambot’s perceptual apparatus is the conversion of a raw audiosignal into musically salient events in time. I have named this stage reception. The re-ception module consists of a collection of onset detection algorithms and a polyphonicpitch tracking algorithm. The onset detection algorithms are primarily designed fordetecting percussive onsets, and are specifically tuned to be able to discriminate be-tween the kick, snare and hi-hat onsets of a standard drum kit. The three percussiveonset detection algorithms involved, which I have termed the SOD, MID and LOWdetectors, can be thought of as detecting high, mid and low frequency percussive on-sets respectively. The polyphonic pitch tracking algorithm is of insufficient quality forproduction use, and consequently the Jambot currently utilises only rhythmic attributesof the input signal.

The SOD detector tracks noisy onsets such as that of hi-hats by looking for peaksin the noisiness of the signal. It is designed to operate in the presence confoundingnoise such as that present in an ensemble with distorted guitars and synthesisers. TheSOD algorithm is a significant departure from existing onset detection algorithms. TheLOW detector utilises a band-pass filtered energy onset detector, quite similar to ex-isting algorithms, although the details of the implementation required to achieve anappropriate balance of accuracy versus latency is the result of substantial experimenta-tion. The MID detector combines these two detectors to be able to discriminate a snaredrum onset from a kick-drum or hi-hat, and again the tuning of parameters in the MIDdetector represents a substantial research effort. These detectors all utilise an adaptivethresholding technique for real-time peak-picking.

The pitch tracking algorithm attempts to determine the harmonic content of the audio,and also report on the timing of non-percussive onsets. It is an adaptation and amal-gamation of a number of existing spectral techniques. It was designed to provide lowlatency frequency estimates to use in real-time imitation, however its accuracy is toolow for use in a performance setting.

The output of the reception stage is a series of timestamped musical events. These

74

events may be thought of as similar to MIDI notes, save that the timing information atthis stage is expressed in clock time, rather than in the invariant representation of beatlocation. The notes possess a pitch and velocity, and an onset time. The pitch trackingcapacity of the Jambot is currently of insufficient quality for production usage, so thepitch of these notes is used just to discriminate between the different percussive timbresthat the Jambot can detect.

The three percussive onset detectors mentioned above are all designed to be able toreport (and discriminate) an onset from the audio stream in real-time, and with verylow latency. The latency of all three is sufficiently low to enable real-time imitation(and re-mapping) of the percussive elements of a complex audio stream. In other wordsthe time taken to report an onset is small enough that by generating a sound as soon theonset is detected, the Jambot can play along with the input audio with an imperceptibledelay. This capacity is essential for the transformational mimesis generation technique(§8.2.6).

I chose the term reception to be similar, but not identical, to the term perception. Thereason for doing this is that I am adopting the Perception in Action perspective (§3.4.1)in which perception is viewed as inextricably bound to action. In this spirit I considerthe Jambot’s perceptual apparatus to encompass all stages of processing and respond-ing, and embrace the interconnections between the processes. Indeed, in Chapter 4 I ar-gued for the efficacy of an architecture that models connectivity and feedback betweenstages in the processing/responding chain, and have implemented such an architecturein the Jambot (§4.2.1).

5.2 Stochastic Onset Detection

The Stochastic Onset Detection technique (SOD) is a novel approach to extractingpercussive onsets from an audio signal. The crux of the technique is to search forpatterns of increasing noise in the signal. The SOD technique is designed for use withcomplex audio signals consisting of both pitched and percussive instrumental soundstogether, and aims to report solely on the timing of noisy percussive attacks, such ashi-hats. In contrast to most onset detection algorithms it operates in the time domainand is very efficient, suiting my requirement for real-time detection.

75

Figure 5.1: Attacks can be masked in multipart signals

5.2.1 Existing Onset Detection Techniques

A survey of onset detection techniques is given by Bello et al. (2005). A commonapproach shared by the majority of these techniques can be summarised as follows:

• the input signal is distilled into a reduced form called the detection function;

• the detection function is then searched for recognisable features, often peak val-ues; and these features are filtered, and then reported as onsets.

The simplest method for detecting onsets is to look for growth in the amplitude enve-lope. However, in the presence of complex audio signals containing multiple musicalparts this technique is not viable. For example, Figure 5.1 shows the waveform for asustained synthesizer note with a kick drum sound in the middle (corresponding to theexample audio file kick and synth.mp3). The kick drum is clearly audible but its onsetdoes not correspond to a peak in the amplitude envelope. To deal with situations likethis a number of onset detection algorithms first split the signal into frequency bandsusing a Fourier transform. Onsets are then associated with growth in the energy inany band. One algorithm using this technique is Miller Puckette’s bounded-Q onsetdetector available as the bonk∼ external for Max/MSP (1998). Another example is theHigh Frequency Content (HFC) detection function of Masri & Bateman (1996), whichaggregates energy across all bins but preferentially weights higher frequencies.

However, for complex audio signals in which there is power throughout the spectrum,the growth of energy in a frequency sub-band due to a percussive attack may still bemasked by the ambient power of the signal in that band. The SOD technique described

76

in this chapter is designed to address this problem by seeking time domain artifacts ofpercussive attacks that are absent in periodic signals.

5.2.2 Rapidly Changing Component

The SOD technique adopts the Deterministic Plus Stochastic model of Serra (1997)for modelling musical signals. In this model a musical signal is considered to consistof a deterministic component, which may be described as a combination of sinusoids,and a stochastic component, which is described by a random noise variable. The cruxof the SOD technique is the assumption that a percussive onset will be characterisedby an increase in the noise component of the signal.

In order to identify times of increased noise, I define a time domain component of thesignal called the Rapidly Changing Component (RCC). The RCC can be thought ofas the zigzags in the signal. For example, referring to Figure 5.1, the signal is smoothwhen the synthesiser is playing on its own. The time at which the kick-drum startsis visually discernible because the signal becomes rougher (contains more zigzags) atthis point.

The RCC consists both of high frequency sounds and noisy sounds. The SOD algo-rithm operates by separating out the RCC from slower moving components, and thenmeasuring the loudness of the RCC, and estimating what fraction of the RCC is due tonoise.

Transient Detection

Transients give rise to high frequency components in a Fourier Transform. One ap-proach to transient detection is the High Frequency Content (HFC) technique (Masri& Bateman 1996). As the name suggests this approach aggregates all of the energyin high frequency bands (to be precise it aggregates all bands but linearly weights byfrequency). The HFC approach does however have some limitations. It suffers fromthe same basic problem as a direct amplitude approach but in more limited circum-stances; if the periodic part of the signal has a lot of energy in high frequencies, thenthe growth in the HFC due to the percussive onset may be small in comparison to theambient level of HFC, degrading the signal/noise ratio for the detection function.

The SOD technique is designed to improve upon the HFC technique in this respect.The approach is to look at the short timescale activity and measure how random it is.The growth in that randomness then forms the onset detection function. In this way the

77

presence of background high frequency periodic content will not affect our detectionfunction.

5.2.3 Description of the SOD Algorithm

The SOD algorithm is designed for realtime use with minimal latency. The inputsignal is processed in short windows of 128 samples. Each window I measure the levelof noise in the signal. This measurement consists of four steps:

1. Separate out the RCC

2. Measure the size of the RCC

3. Measure the randomness of the RCC

4. Estimate the loudness of the stochastic component – this is the detection func-tion. Having obtained the most recent value for the detection function, an adap-tive peak picking algorithm (described below) is employed to look for significantgrowth in the noise. Points of significant growth that exceed an absolute noisethreshold are marked as percussive onsets.

Splitting out the RCC

The first step in the construction of the noise measure is separating out the RapidlyChanging Component (RCC) from the rest of the signal. To do this I use a little rocketscience drawing inspiration from a technique developed at NASA called EmpiricalMode Decomposition (Huang et al. 1998). This is a technique for extracting ‘modes’from a nonlinear signal, where a mode may have a varying frequency through time.The basic idea is that to get the RCC you look at adjacent turning points of the signal(i.e., the local maxima and minima) and consider these to be short timescale activityaround a carrier signal which is taken to be halfway between the turning points. This issimilar to creating a smoothed carrier wave by using a moving average with a varyingorder, and taking the RCC to be the residual of the signal from the carrier wave. Theprocess is illustrated in Figure 5.2.

Stochastic Components of the RCC

Generally the RCC will be comprised of both the stochastic component of the signaland high frequency parts of the deterministic component. So as to get a sense of therelative sizes of these contributions to the RCC, I make a measurement of the level ofrandomness in the RCC.

78

Figure 5.2: Splitting out the RCC, Amplitude vs Samples

The statistic I use to measure the level of the stochastic component is the first orderautocorrelation, which measures how related the signal is to itself from one sample tothe next. The stochastic component of the signal should have each sample statisticallyindependent, and so will have an autocorrelation of zero. The deterministic component,on the other hand, will be strongly related to itself from one sample to the next, andso should have autocorrelation close to one. The autocorrelation of the RCC will thenreflect the relative amplitudes of these two components of the RCC; an autocorrelationof close to zero means that the RCC is mostly stochastic, whilst an autocorrelationclose to one means that the RCC is mostly deterministic.

Another measure of the randomness that could be considered is the signal entropy(Shannon 1948). The use of entropy in searching for changes in the signal noise wasexplored by Bercher & Vignat (2000), who give an adaptive procedure for estimatingthe entropy. However, their procedure is not intended for realtime use; indeed the cal-culation of entropy is computationally expensive (Hall & Morton 2004). Furthermore,the autocorrelation measure has the advantage that it has a direct interpretation as ap-proximating the percentage of the RCC that is deterministic. Conversely, if I take ourmeasurement of randomness to be 1−c where c is the autocorrelation, then this will bean approximate measure of the percentage of the RCC attributable to noise. For thesereasons I prefer the autocorrelation measure to entropy.

79

Description of Noise Measure

Having extracted the RCC its loudness can be reported. Then, having also estimatedthe stochastic component of the RCC, and hence the approximate percentage of theRCC attributable to noise, I can make an estimate of the loudness of the noise in thesignal by multiplying the amplitude of the RCC by its stochastic component. In moredetail, the noise measure is constructed as follows:

1. Split the signal into rectangular analysis windows (I have used a window size of128 samples).

2. Calculate the Rapidly Changing Component:

(a) Find the turning points of the signal.

(b) The carrier wave is assumed to be halfway between adjacent turning pointsof the signal, so construct the carrier wave by linearly interpolating betweenthese midpoints.

(c) The Rapidly Changing Component is the difference between the signal andthe carrier wave.

3. Calculate the size of the RCC:

SizeRCC = Std. Dev. of the derivative of the RCC

4. Calculate the randomness of the RCC:

RandomnessRCC = 1− autocorrelation of the RCC

5. Calculate the noise:

NoiseRCC = SizeRCC ∗RandomnessRCC

An example of the Noise detection function for a short audio sample (correspondingto the file JungleBoogie.mp3 in the examples) is shown in Figure 5.3. Also shown forcomparison is the HFC detection function, and a Bounded-Q detection function similarto that used by bonk∼. The horizontal axis is time (measured in analysis windows of128 samples), and onsets are identified by looking for peaks in these detection func-tions. The noise detection measure has much more clearly discriminated peaks thanthe other two.

80

Figure 5.3: Comparison of Detection Functions

81

Adaptive Thresholding

Having calculated the noise function the next step is to identify peaks, which will beinterpreted as percussive attacks. More precisely, an attack will be reported when thereis a sudden growth in the noise, followed by a peak, and then a decay – a situation Idescribe as a ‘significant jump’ in the noise function. Different pieces of music mayhave markedly different noise characteristics; the size of a jump which is significantwill depend on the ratio of the ambient noisiness of the pitched instruments comparedto percussive instruments. To deal with this variation between musical signals I haveused an adaptive thresholding technique.

I maintain a measure of the mean and standard deviation of noise in the recent pastusing an Exponentially Weighted Moving Average. For each new window I updatethese measures by accumulating a weighted value of the preceding window (I currentlyuse a weighting of 8%). So, for each new window, the measures of mean and standarddeviation of recent history will be 92% of what they were before plus 8% of the valuesfor the immediately preceding window. This process allows for the identification of asignificant jump in the noise level – where the noise level is some number of standarddeviations above the mean of the recent past.

Once an onset is detected using this technique, it is not necessary to report any moreonsets until the current attack is completed. A common strategy for measuring attackcompletion is to maintain a high and low threshold; for an onset to be reported thedetection function must exceed the high threshold, and then no further onsets willbe reported until the detection function has dropped below the low threshold. I haveutilised an adaptive version of this technique for reasons mentioned previously. Oncea significant jump is detected, an ongoing measure of the peak value of the detectionfunction is maintained, and the attack is considered to be ongoing until the detectionfunction has dropped sufficiently that recent past is significantly lower than the peak(using the same exponentially weighted moving average scheme as for detecting theonset).

The detected onsets are then further filtered by an absolute noise threshold. To beconsidered as an attack, a significant jump must have a peak value higher than thisthreshold. To allow for realtime responsiveness to the signal with minimum latency,the onset is allowed through the filter as soon as the ongoing measure of its peakvalue exceeds the noise threshold. For example, an open hi-hat onset will have a rapidincrease in the noise level but not a quick decay – so that if the algorithm were to waituntil the noise had peaked before reporting the onset it would have significant latency.

82

5.2.4 Computational Efficiency

The stochastic onset detection (SOD) algorithm presented in this chapter is quite ef-ficient. No FFT is required because it works in the time domain. In my real-timeimplementation, a 128 point sample window took approximately 8 samples to process.It is also quite responsive because the RCC measurements can be calculated on smallsample buffers, typically as small as 32 samples providing a latency of less than 1ms.

5.2.5 Example Results

I applied this algorithm to a selection of audio snippets containing complex audio withpercussion. In addition to these hand selected tracks, I tested the algorithm against theMIREX Audio Tempo Extraction training data set (MIREX 2011). Training snippetsin this set that did not have any percussion parts were omitted.

The Noise detection function generally seems to have a superior signal to noise ratiocompared to the HFC or Bounded-Q detection functions. For example, referring backto Figure 5.3, of the three detection functions the Noise detection function has the mostclearly defined peaks.

I evaluated the algorithm by having it ‘jam’ along with the audio track (in realtime)mimicking what it hears by triggering a MIDI percussion sound when it detects anonset. The Noise measure also gives an estimate of the amplitude of the onset, andso this information is used to determine the velocity of the MIDI imitation. As acomparison, I performed the same trials using the bonk∼ external for Max/MSP.

These results are not intended to be exhaustive or comprehensive. I have tested thealgorithm with a limited range of musical examples and only performed aural analysisof the results. Furthermore, I have only compared SOD to Bounded-Q and HFC, whilstthere are other more recent detection algorithms that would warrant comparison (Belloet al. 2005). None-the-less, these results suggest that the SOD approach is generallymore robust than the algorithm in bonk∼.

The algorithm appears to be particularly attuned to hi-hat and cymbal onsets. For ex-ample, referring once again to Figure 5.3, in the snippet from Jungle Boogie, the NoiseDetection algorithm follows the hi-hats solidly, whilst the HFC algorithm appears moredrawn to the guitar rhythm (and the Bounded-Q algorithm is totally at sea). The eval-uations of these three algorithms may be heard online in the examples as JungleBoo-gie nd.mp3, JungleBoogie hf.mp3, and JungleBoogie bq.mp3. The other examples

83

are of the form train* nd.mp3 for the Noise Detection sample and train* bq.mp3 forthe Bonk sample.

5.2.6 Low Percussive Onsets

The LOW onset detector reports on low frequency percussion sounds, such as a kick-drum. This detector is a band-pass filtered energy onset detector – a common approachto percussive onset detection (Bello et al. 2005). The devil is, however, in the detail. AsDixon notes “relatively small differences in implementation ... and parameter settingsgreatly influence the results” (2006:135). The implementation and parameter settingsfor the LOW detector were carefully tuned to achieve the appropriate balance betweenaccuracy and latency.

The evaluative criterion for acceptable latency was imperceptibility – the latency wasrequired to be sufficiently low to allow real-time imitation without perceptible onsetasychrony. My experience in developing the SOD algorithm was that a latency of 128samples @44.1kHz (approximately 2ms) was desirable, with 256 samples being theupper limit.

In tension with the latency requirement was the desire to accurately discriminate be-tween the snare and kick sounds of a standard drum kit. The most obvious differenti-ating feature between the two classes of sound is the frequency bands in which mostenergy is contained, with a kick drum onset containing more low frequency contentthan a snare hit. In order to reliably discriminate, I found it necessary to allow suffi-cient time for at least a quarter cycle of the lowest frequency measured.

I found it was best to measure frequencies down to around 40Hz, which is near thelower threshold of human hearing. To this end I utilised a 1024 point FFT (with sam-ple rate of 44.1kHz). The first bin of this transform corresponds to 43Hz. However alatency of 1024 samples was unacceptable, so a hop size of 256 samples (correspond-ing to an overlap of 75%) was used, corresponding to a quarter cycle of the lowestbin frequency. The use of a 75% overlap in successive analysis windows also helpedsmooth out the detection function.

The LOW detection function was calculated by summing the first two bins (corre-sponding to 43Hz and 86Hz) of the Hanning windowed power spectrum of this 1024point FFT with 75% overlap. The LOW onset detector uses the same adaptive thresh-olding technique as the SOD detector.

84

5.2.7 Mid Percussive Onsets

The MID onset detector is aimed at extracting snare hits from complex audio. Snareonsets lie between hi-hat and kick onsets, in the sense that a snare onset contains bothbroadband mid-frequency energy and a stochastic noise component. The MID onsetdetector exploits this structure by combining a band pass energy detection scheme withthe SOD approach discussed above §5.2.3.

The noisiness of a snare hit is similar to that of a hi-hat, but operates at a slower timescale. One component of the MID detection function is the use of the SOD detectionfunction on a downsampled version of the signal. Downsampling the signal has theeffect of expanding the timescale on which the SOD algorithm operates. In this casethe downsampling uses a factor of 4.

The second component of the MID detection algorithm is a bandpass filtered energydetection function, similar to the LOW algorithm. The MID algorithm uses the same1024 point Hanning windowed FFT with 75% overlap as the LOW algorithm, butcalculates the summed energy of bins 3 to 6 (corresponding to frequencies 130Hz to260Hz).

The MID detection function is formed by taking the geometric mean of the noise andthe broadband energy measures. The same adaptive thresholding technique is used asin the SOD and LOW detection functions.

5.2.8 Spectral Flux Onsets

In order to allow the Jambot to operate when there are no percussive elements, I imple-mented the existing Spectral Flux onset detection scheme, which performs favourablycompared with other pitched onset detection schemes (Dixon 2006) and is both simpleto implement and relatively efficient to calculate.

The spectral flux is defined as the sum of all positive changes in power across the binsX(t,k) of an FFT from one frame to the next:

SF = ∑{t,k : G(t,k)>0}

G(t,k) where G(t,k) = X(t,k)−X(t,k−1) (5.1)

The spectral flux detection algorithm looks for significant growth in the spectral flux. Ituses the same adaptive thresholding technique as the SOD, MID and LOW algorithms.

85

5.3 Polyphonic Pitch Tracking

There is a large body of work in monophonic pitch tracking, particularly from thespeech analysis community. Much less attention has been devoted to polyphonic pitchtracking, which is generally regarded as a difficult and unsolved problem (Hainsworth2004; Collins 2006b). However, the salient features of musical material required forgenerative musical accompaniment need not be as accurately described as those re-quired for transcription or music information retrieval. The pitch features identifiedas most important in a range of studies include pitch-class set, pitch range or contour,and changes in these over time. These lead to structural features including tonality,harmonic change, textual density, grouping and proximity, which can be used for gen-erative purposes (Sloboda 1988; Temperley 2001; Cope & Hofstadter 2001).

In this section I describe an algorithm designed to extract salient harmonic materialfrom complex polyphonic audio, to allow for appropriate real-time improvisation. Theresults of this algorithm were deemed to be of insufficient quality to be useful in a realperformance situation, and so have not been included in the Jambot. Nevertheless thisalgorithm goes some way towards this goal.

The aim of the pitch tracking algorithm was not to achieve full transcription of theaudio, but rather to garner sufficient harmonic and onset timing information for appro-priate improvisatory response. The design goals of the algorithm were:

1. Tonality analysis – I want to be able to identify the current ‘key’.

2. Timing information from pitch changes – I wish to be able to identify thetimes at which a new note is played, even in the case where the note changes arethe result of a legato movement – for which no attack is present – so as to enabletempo, metre and rhythm to be inferred.

3. Cope with low frequencies – much of the important harmonic information isprovided by bass instruments and many pitch tracking algorithms require largewindow sizes to cope adequately with the lowest notes on the bass guitar.

4. Avoid False Positives – with a focus on salient features it was important to havecorrect information, even at the expense of less complete data.

5.3.1 Pitched Onset Detection

A large variety of methods for the detection of note onsets have been discussed in theliterature; a survey may be found in (Collins 2006a). Typically, onset detection algo-

86

rithms are analysed in terms of their performance on different types of sounds. Belloet al. (2005) consider the broad classes of Pitched Non Percussive (PNP), Pitched Per-cussive (PP), Non Pitched Percussive (NPP) and Complex Mixtures (CMIX) sounds.Broadly speaking, energy based techniques are reasonably successful for Percussivesounds but fail spectacularly for Pitched Non Percussive. Bello reports on a techniquebased on spectral phase information that is relatively successful in the PNP class.

The algorithm I present in this section concentrates on reporting onsets for the PNPclass of sounds. An example of the types of onsets I am considering would be a windinstrument playing a legato passage. Intuitively, in order to detect note boundariesin this class of sounds, one must have an accurate estimate of the frequency of thesound through time, so that an onset may be reported when the frequency changessubstantially.

The reason that I concentrate on this class is that, in accord with the intuition expressedabove, accurate onset timing information in the PNP class of sounds is an added bonusfrom accurate frequency estimation, which is the second goal of this algorithm. IndeedBello’s technique essentially estimates the instantaneous frequencies contained in theaudio signal (although he is not explicitly interested in these estimates). The algo-rithm I present here operates on a similar principal to that of Bello, producing accuratefrequency information and timing information for the PNP class of sounds.

If the timing information is to be useful for beat and metre induction tasks, it seemslikely that a high degree of time resolution for onset events is desirable. As I amproposing to utilise frequency information to report on note boundaries, a practicalconsideration arises regarding the size of the analysis window I use, namely that forlow frequencies the analysis window may be too short to obtain an accurate estimate ofthe frequency. In signal processing there is generally a tradeoff between time resolutionand frequency resolution (Brown 1998). I am aiming at a time resolution of 1024samples at a sample rate of 44100Hz, corresponding to approximately 23ms.

I wish to be able to determine the frequency content of pitched musical material towithin a semitone across the range of frequencies commonly present in (western popu-lar) music, say 60Hz to 16000Hz. A fundamental problem that I face is that a windowsize of 1024 samples represents around 1.5 cycles of a 60Hz frequency component.The low number of cycles makes it difficult to accurately estimate the frequency oflow frequency components with the desired time resolution.

Because my goal is to isolate salient pitch material that will enable inference of the cur-rent harmonic activity such as key or chord progression, I am not so concerned about

87

accurately detecting all notes being produced. My plan is to first identify the spec-tral peaks in the analysis window – these are viewed as the constituent frequencies.I will then associate harmonically related frequencies together as being part of onepitch. This yields a number of distinct pitches present in the analysis window. Thisinformation will be fed to the generation module. The nature of overtone series forphysically produced sounds suggests that an appropriate method for associating har-monically related frequencies is to start with the lowest frequency and associate withit any frequencies which are an integer multiple. To this end it is important to havean accurate estimate of the fundamental frequency as errors in this estimate becomecompounded with each multiple.

A number of pitch tracking techniques attack this problem by using information fromharmonics of the fundamental to refine the pitch estimate. The harmonics, being ata higher frequency, have more cycles within the analysis window and so can be moreaccurately estimated. Then, assuming that the sound source is harmonic, the frequencyof the fundamental can be estimated as a common divisor of the frequencies of theharmonics. One such technique that operates along these lines is the Harmonic ProductSpectrum (Noll 1969), which can be quite effective for pitch tracking of monophonicharmonic sources. Another technique which utilises information from a larger portionof the spectrum to infer the frequency of the fundamental is Goto’s predominant F0estimation (Goto 2004).

However, approaches that use information from the whole spectrum to estimate the fre-quency of the fundamental introduce an undesirable circularity to the analysis; whilsttrying to ascertain whether or not a given frequency is a harmonic of a given fundamen-tal frequency, the process relies upon refining the estimate of the fundamental based onthe assumption that it is harmonically related to the given frequency. For monophonicaudio sources with a known spectrum this does not pose a great difficulty, but in thecase of polyphonic audio composed of an unknown number of varied sources this canbe problematic.

In §5.3.4 below I describe a novel frequency estimation technique that avoids suchcircularity by obtaining an accurate estimate of the fundamental frequency indepen-dently of the rest of the spectrum. Having done this, the estimate of the fundamentalfrequency may be further refined by whole spectrum techniques such as above, butapplying our new technique first facilitates the accurate assessment of which of theconstituent frequencies are indeed harmonically related to each other before using theharmonic relations to refine the frequency estimates.

88

5.3.2 Frequency Estimation Techniques

There are a range of techniques commonly used to estimate the fundamental frequencyof a signal. I will explore the most prominent of them and highlight ways in which theymay not meet my requirements.

Time Domain:

1. Zero Crossings. A simple method for frequency estimation is to count the num-ber of times that the signal crosses zero in a given period. For low frequencysignals, where the number of cycles is a small multiple of the analysis period,the discrete nature of the zero crossing events leads to large variations in thefrequency estimate depending on the phase of the signal.

2. Autocorrelation. The autocorrelation of a signal is the correlation of the signalto a time-lag of itself. A periodic signal should be most highly correlated withitself at a time-lag equal to its period. Autocorrelation frequency estimationtechniques utilise this by calculating the autocorrelation as a function of time-lag(called the autocorrelation function) and searching for a maxima of this function.In practice this technique has a large variance for low frequency signals and sois unsuitable for our purposes.

Frequency Domain: Fourier techniques generally take as a starting point the spec-trum of the windowed signal, usually obtained via a Fast Fourier Transform (FFT)algorithm. The most straightforward application of the spectrum is to plot its powerversus bin number and identify the bins at which the power has a significant peak. Thecentre frequency of the bin is then identified as a constituent frequency in the signal. Aproblematic issue with this approach is that the resolution of the frequency estimatesis determined by the size of the bins. In a standard FFT the bins are equally sizedat SF / N were SF is the sampling frequency and N is the number of samples in thewindow. In our case the sampling frequency is 44100Hz and the number of samples is1024, yielding a bin size of 43Hz. This means that frequency estimates obtained in thismanner are accurate to within 43Hz, which is ample for high frequency componentsbut insufficient for low frequency components.

A number of techniques exist for either increasing the resolution of the FFT, or inter-polating frequencies that lie between bin centres:

1. Zero Padding. Before taking the FFT the signal is padded with zeros – i.e.the signal vector is concatenated with a vector consisting of zeros producing

89

a longer vector upon which the FFT is performed. The zeroes do not effectthe frequency composition of the signal, however the frequency estimates areeffectively interpolated between bin centres (Smith 2003).

2. Parabolic Interpolation. The frequency of a component that is identified by apeak in the spectrum is interpolated by fitting a quadratic function to the valueof the power spectrum at the peak bin and the bins either side of the peak bin.The refined frequency estimate is given by the location of the maxima of thefitted curve (Smith 2003). Other more elaborate interpolation schemes exist also(Milivojevic et al. 2006).

3. Constant Q transform. Rather than measuring the power at equally spacedfrequency intervals as the FFT does, the Constant Q transform measures power atexponentially spaced frequency intervals (Brown & Puckette 1992). This meansthat the frequency resolution in percentage terms is equal across the spectrum.

4. Chirp Z transform. This transform utilises equally spaced frequency intervalsbut concentrated in a frequency band of interest, rather than across the wholespectrum (Rabiner et al. 1972).

All of the methods above do a good job of either interpolating between bin centres, orincreasing the resolution of frequency estimates. However, analysis of a low frequencysine wave using these methods reveals that resolution is not the only issue. Indeed allof these methods yield a similar result, and have a high variance when estimating asingle sinusoidal component of known frequency. In particular, for a 60Hz sinusoidover a 1024 sample analysis window, the variance in the frequency estimate for all ofthese techniques exceeds a semitone; consequently these techniques are insufficient forour purposes. The fundamental issue is that for a low frequency signal the peak of thepower spectrum is simply not an accurate estimator of the frequency.

5.3.3 Phase Based Techniques

The Fourier spectrum contains more information than the just the power spectrum;additionally it contains phase information. The efficacy of exploiting this phase infor-mation for the purposes of frequency estimation was famously described by (Flanagan& Golden 1966) in their description of the Phase Vocoder. The essential idea is that theinstantaneous frequencies of the signal are equal to the time derivatives of the phases.The values of the instantaneous frequencies plotted against the Fourier bin number isknown as the Instantaneous Frequency Distribution (IFD) (Charpentier 1986).

90

A number of techniques have been proposed for the calculation of the IFD. The originalPhase Vocoder simply advances the signal by one sample, computes another FFT, andapproximates the phase derivatives by the difference in phase divided by the sampleperiod. This is however computationally expensive as it involves computing an FFTfor every sample in the signal.

Charpentier (1986) proposed utilising symmetry properties of the Fourier Transformto approximate the FFT of a window advanced by one sample from the FFT of theoriginal window. This way one needs only calculate one FFT per analysis window,and obtains very similar results to the Phase Vocoder technique.

Brown & Puckette (1998) expand on Charpentier’s work and compare the accuracy ofthis technique to a similar technique but where the signal is advanced by H (the hopsize) samples instead of one. The accuracy of the estimates of the IFD increase withlarger hop size, but there is a tradeoff. The authors identify two issues:

• The frequency estimate from this technique is essentially using information froma window of size N + H where N is the size of the analysis window, so the timeresolution of this technique decreases with larger hop size.• The estimate becomes increasingly vulnerable to mistakes in the phase unwrap-

ping. The measured change in phase corresponds to some integer number ofcycles plus a measured fraction of a cycle. The number of whole cycles that thephase has gone through is unknown, except by virtue of some other independentestimation of the frequency.

The technique I present in this section, which extends this method, addresses these twoissues.

To use the IFD to determine the constituent frequencies of a signal, one computes thespectrum, picks the peaks of the power spectrum, and then reports the values of theIFD at the bins corresponding to the peaks of the power spectrum. A further refine-ment to this class of techniques has been proposed by Kahawara et al. (1999). The ideais that, when plotting instantaneous frequency against frequency, where there is a truefrequency component in the signal the instantaneous frequency should equal the fre-quency. In other words the mapping frequency ⇒ instantaneous frequency shouldhave a fixed point at every true frequency in the signal. Much as the discrete Fourierspectrum can be interpolated, the phase spectrum can be interpolated. Then searchingfor fixed points on the interpolated IFD yields refined frequency estimates.

Masataka Goto (2004) has utilised fixed point analysis of the IFD in conjuction with

91

a Bayesian belief framework in his predominant F0 estimation. A number of otherauthors have utilised Bayesian techniques in the context of polyphonic transcription(Cemgil 2004; Hainsworth 2004).

5.3.4 Harmonic Content Extraction Technique

The technique I propose is essentially a hybrid of the above techniques, tailored to suitmy specific needs. It is a two stage estimation technique similar to those of Charpen-tier (1986) and Brown & Puckette (1998), which addresses the shortcomings of thesetechniques by utilising a combination of IFD fixed point analysis, belief propagation,and MQ-analysis (McAuley & Quatieri 1986). The algorithm is as follows:

The input signal is analysed in overlapped windows of 1024 samples with 75% overlap.Internally, a number of beliefs are held, consisting of a frequency, amplitude and phasevalue for a component currently believed to be sounding. The beliefs are stored ina fixed number of ‘tracks’ using the terminology of MQ-analysis. Each window thefollowing procedure is iterated:

1. Perform an FFT on the (Hanning) windowed signal.

2. Pick the significant peaks of the power spectrum

3. The frequencies of the spectral peaks are estimated using Charpentier’s tech-nique.

4. These estimates are refined by looking at the phase evolution of the Fourier com-ponent at the initially estimated frequency.

5. The refined estimates are pairwise matched with our current beliefs.

6. Any peak that is more than a quarter tone away from the closest belief is consid-ered to be a new component. If the refined estimate of the peak is ‘sensible’, thenthis estimate is put into a free track (if any are available) and considered to be atentative new belief. The frequency estimate is recorded as the frequency of thisbelief, the power of the spectrum at this peak as the amplitude of this belief, andthe phase of this belief is calculated by performing a single component Fouriertransform of the window centred on the estimated frequency. A tentative track-birth is recorded, and if there was a component previously in this track then atrack-death is recorded.

7. Any peak that is within a quarter-tone of the closest belief is considered to be acandidate for continuation of the belief. The stored phase information for this

92

belief is used to perform an enhanced frequency estimate using the N-hop tech-nique of Brown & Puckette (1998). This can be done extremely efficiently byutilising the frequency value of the belief: requiring calculation of just a sin-gle Fourier spectral coefficient centred at the believed frequency. The algorithmalso calculates the number of whole cycles that this component should have gonethrough in one window based on the believed frequency. Adding to the wholenumber of cycles the partial phase advance (measured as the difference betweenthis phase of the single calculated coefficient, and the stored phase of this belief)yields a total phase advance through the window which can be converted to arefined frequency estimate.

8. If the refined estimate from step 7 is ‘sensible’ (within a quarter-tone of ourbelief) then this is considered to be a definite continuation of the believed fre-quency. If the belief was previously tentative then it is retrospectively marked asdefinite. An average of the old belief and the new estimate of the frequency isused as the frequency for our belief. The algorithm uses the phase value alreadycalculated for the phase and the power of the peak as the amplitude.

9. Finally the algorithm goes through old beliefs, and for any belief that has notbeen continued a track-death is recorded, and the track is filled with a new beliefas needed.

The use of a two stage estimation, firstly utilising the 1-Hop technique of Charpentier,and secondly the N-Hop technique of Brown & Puckette, circumvents the issues raisedregarding the N-Hop technique. Firstly by encapsulating the information about the pre-vious analysis window in a set of believed frequencies and phases, the time resolutionof the N-Hop technique is increased to a single analysis window, at least in so far asdetecting when a steady component terminates. Whilst it will still take two windows toget an accurate frequency estimate from the N-Hop technique once a new componentstarts, in the meantime the algorithm can use the 1-Hop estimate of the frequency dur-ing this first window. So for the first window of a new pitch the estimate is not as goodas for any following windows – however I view this as acceptable especially since dur-ing the attack phase of a new note the pitch may be unstable in any case. Furthermore,the timing information regarding the beginning of a new component has a time reso-lution of one window, which is the desired result. Once the frequency stabilises thefrequency estimate for the first window may be retrospectively adjusted if desired.

Secondly maintaining a belief of the sounding frequency (or in the case of a new com-ponent by having a first stage estimate of the frequency) helps alleviate the phase un-wrapping errors inherent in the N-Hop technique.

93

5.3.5 Aggregation into Notes

The output of the above algorithm is a fixed number of tracks containing frequencybeliefs (along with corresponding amplitude and phase), and a series of timing eventsfor track-births and track-deaths. The next step in the algorithm is to aggregate thesefrequencies, and associate them into notes. Here I am not trying to reproduce the notesin the audio stream as they may have been physically produced, but rather perform asensible data reduction that groups frequencies into salient units.

In each analysis window the algorithm aggregates frequencies that are harmonicallyrelated. This was one motivation for obtaining an accurate frequency estimation forthe lowest frequencies in the signal. The aggregation is performed as follows: firstloop through all beliefs for a given window, starting with the lowest frequency andassociating with it any frequencies that are within a semitone of being a harmonic ofthat frequency – and consider such a collection to be a pitch. Associate this pitch withthe frequency of the lowest component. Then take the remaining beliefs and iteratethis procedure, yielding a number of believed pitches each window.

In actuality the procedure is modified somewhat, since as described the algorithmwould aggregate the fundamentals of a bass line playing a C2 with a melodic linesimultaneously playing a C4 for example. Whilst it is not my intention to segregatethe audio stream into distinct parts, I make some concession towards this by requiringthat harmonics have no more than half the energy of the fundamental to which they areaggregated.

Human frequency perception displays a phenomenon known as the ‘missing funda-mental’ (Rasch & Plomp 1982), where harmonics give rise to a sense of a fundamentalfrequency even when the actual fundamental frequency component is absent. Thevertical aggregation procedure I have described will not replicate this phenomenon.However, the goal of the procedure is not to accurately mimic human perception, butrather to give some sense of the harmonic context.

Having aggregated frequencies vertically into pitches, I then aggregate pitches horizon-tally (in time) into notes. Notes are formed by examination of the track birth and deathtiming information. A note is considered to start/finish at the time of the birth/deathof the track containing the fundamental. The note structure consists of a start time, anend time, a pitch envelope and an amplitude envelope. The information about the notesis known in realtime with a latency of two analysis windows, however the accuracy ofthe resolution of the timing information for the start/end of notes is just one analysiswindow.

94

Once the notes are formed, the algorithm merges notes that are close in time. Any twonotes whose frequencies are within a quarter tone, and where one note finishes in thesame or previous window as which the second note starts, are merged. This step elim-inates a great deal of erroneous discontinuities caused by noise. On the other hand itmeans that this algorithm will not pick up rhythmic information from repeated pitchednotes. However, since the notes are of the same pitch, for a human listener to gatherrhythmic information from them they must be marked in some other way, for exam-ple with attacks or articulatory information, and these markings may be picked up byan independent onset detection algorithm such as the percussive onset detectors de-scribed above, that operates in parallel with this algorithm. Note that this informationis available with a latency of two analysis windows.

Finally notes that are too short are eliminated from consideration. Three analysis win-dows was chosen as the minimum length for a note to be validated. Consequently thealgorithm as a whole has a latency of three analysis windows, though the timing infor-mation (for feeding into a beat-tracking algorithm for example) has a resolution of oneanalysis window.

5.3.6 Example Results

The results from the audio analysis are in the form of ‘notes’ with a frequency, am-plitude, and duration. When listening to this output reproduced it is clear that thetranscriptions are not accurate. Often pitch slides and note ghosting are misconstruedas additional notes and jumps to related frequencies by a fifth or octave are not uncom-mon. These can be filtered by ignoring short notes or by increasing the belief thresholdto provide additional stability. For the purposes of auto accompaniment, these ‘notes’are treated like incoming MIDI messages, their frequencies quantised to an equallytempered scale.

For the purposes of evaluation of the pitch tracking algorithm, I have given some ex-amples of the algorithm mimicking the notes that it perceives as a simple means ofaccompaniment. Since the accompaniment displays noticeable latency, and frequenterrors, I thought it fitting to make the accompaniment play instruments of a high schoolband. The examples may be found online in the audio examples for this chapter.

The first example is of a simple two part acoustic piece mas que nada.mp3. The auto-accompanied file is mas que nada jam.mp3, where the algorithm is mimicking boththe bass-line and the clarinet line. For the second example I use the same setup butfeed the algorithm a more challenging task with a snippet from Amon Tobin’s 4 Ton

95

Mantis: the original snippet is 4 ton mantis.mp3 and the auto-accompanied version is4 ton mantis jam.mp3. The last example uses different instrumentation, and has thealgorithm accompany Lamb’s song Sweet: the accompanied file is sweet jam.mp3.

5.4 Summary

This chapter discussed the Jambot’s suite of onset detection algorithms. The suiteconsists of three percussive onset detectors – the SOD, MID and LOW algorithms –tuned for the detection of hi-hat, snare and kick-drum sounds respectively. The Jambotalso implements the FLUX onset detector for pitched onsets.

The SOD algorithm operates by listening for growth in the noisiness of the signal.As such it is well suited to the detection of hi-hat onsets. The LOW algorithm is aband-pass filtered energy detector, whilst the MID algorithm combines a downsampledversion of the SOD detector with a band-pass filtered energy detector.

A polyphonic pitch tracking algorithm was also presented, which was a synthesis ofexisting techniques designed to yield lower latency estimates than the component tech-niques. The accuracy of the pitch estimates were, however, insufficient for use in theJambot.

The next two chapters discuss how musical features are inferred from the raw onsettimes produced by the reception stage.

96

Chapter 6

Analysis of Metre

6.1 Introduction

The previous chapter discussed the first stage of the Jambot’s processing chain, thereception of raw audio signals resulting in something akin to musical notes. The nexttwo chapters discuss the second stage of processing, analysis, which takes the receivednotes and converts them into some higher order musical representation. This chapterfocuses on the metrical analyses that the Jambot performs: beat tracking and metreinduction. Beat tracking follows the varying tempo of the music whilst metre inductionmeasures the bar structure.

I introduce the notion of the substratum, a new theoretical construct that underpins thebeat tracking techniques. The substratum plays the role of the referent pulse, similarto (but different from) notions such as the tactus, tatum and pulsation.

I present novel algorithms for real-time estimation of both the period and the phase ofthe substratum. The period estimation algorithm is a novel synthesis of a number oftechniques used in existing beat tracking algorithms. It uses the harmonic spectrum ofthe Autocorrelation Function of the onset salience in a multiple hypothesis architec-ture. It differs from existing beat tracking algorithms in a number of respects – notablyin the tracking of the substratum, the use of the SOD onset detector, the perceptual in-ertia of its architecture, its sparse implementation and its use of feedback to modulatethe onset detectors according to metric expectations.

The substratum phase estimation algorithm is also similar to existing algorithms, par-ticularly that of Laroche (2003), but differs in a number of respects. Firstly, it usesa discrete salience function and a continuous pulse function (where Laroche uses acontinuous salience function and a discrete pulse function), which I argue to be bettersuited to onsets with sharp attacks. Secondly, it utilises an efficient search mechanisminvolving Complex Linear Regression, which also affords a natural extension that bi-ases the results towards accuracy at the current time.

I also present a novel metre induction algorithm, based on beat-class interval invari-ance. The metre induction algorithm estimates the length of a bar in the invariant

97

representation of substratum beats. Although I also made substantial investigationsinto estimating the position of the downbeat, the results are not of production qualityand are not discussed in this thesis.

6.2 Beat Tracking

Beat tracking is the process of determining the varying tempo and location of the beatin a piece of music. The beat in this context is a perceptual construct, which may ormay not be explicitly evident in the musical surface. The idea of beat tracking is tomimic what humans do when they ‘tap their foot’ to music (Ellis 2007). Indeed, eachyear the Music Information Retrieval Evaluation Exchange (MIREX) runs a contest inalgorithmic tempo tracking, for which the test data has ground truth tempi establishedby having test subjects tap to the music (Davies et al. 2009). Despite being effortlessfor humans, computational beat tracking is no simple matter.

Computational beat tracking has been an active area of research for the last 30 years(Dixon 2001). A number of different approaches have been developed to suit a varietyof applications, however the creation of a general purpose beat tracking algorithmremains an open question. Indeed a number of authors have questioned the possibilityof such an algorithm (Goto & Muraoka 1995; Collins 2006b:99).

In this thesis I am not primarily concerned with constructing a universal beat tracker,but rather with supplying sufficient timing information to the Jambot to allow it tomake musically appropriate actions, within a restricted set of musical genres. Thisis in keeping with the Design Research methodology employed (§3.5.1): I first seekeffective solutions to specific problems, and then ask what aspects of the solutions aretransferable to more general situations.

Restricting the genre allows me to use stylistic cues in the estimation of tempo andbeat locations, and also in the determination of metre (described below in §6.8). TheJambot’s beat tracking facility pays particular attention to noisy sounds such as a hi-hat.Many popular western dance music styles use the hi-hat (or similarly noisy sounds) tomark the pulse. This tendency forms the basis of the beat tracking algorithm describedhere.

The beat tracking algorithm that I present is designed particularly for use in a real-timeinteractive music system. This application entails a number of design considerations.Firstly the algorithm needs to be real-time and causal. Secondly robustness is moreimportant than accuracy.

98

6.2.1 Existing Approaches to Beat Tracking

This section briefly covers the current state-of-the-art in beat tracking with the aimof highlighting areas of similarity and difference to the Jambot’s beat tracking facility.More comprehensive surveys may be found in (Collins 2006b), (Hainsworth 2004) and(Gouyon & Dixon 2005).

Existing beat tracking techniques can be categorised in several dimensions on the ba-sis of their intended application. One important dimension is whether the techniqueoperates in real-time or offline. For applications such as the Jambot it is necessary forbeat estimation to be both real-time and causal, meaning that beat estimation requiresonly data from the past, and can be done fast enough to be acted upon as it happens.For many applications of beat tracking these requirements are not essential.

In the field of Music Information Retrieval it is common to perform beat estimationoffline. For example an audio file may be supplied for which it is desired to estimatethe locations of the beats. The estimation procedure is this case does not need to becausal; when estimating the beat at a particular location in the file both the past andfuture of the stream are known. One prominent example of an offline beat trackingalgorithm is BeatRoot (Dixon 2008).

Another important dimension in the categorisation of beat tracking algorithms is thetype of input that they take; in particular whether they use symbolic input such asMIDI, or work directly with audio input. Early work in beat tracking, such as the workof Allen & Dannenberg (1990) worked with MIDI input. Some techniques, such asthe comb filter approach of Schierer (1998) work directly from the audio signal with-out an intermediate stage of symbolic representation. Other beat tracking algorithmscomprise of two stages: first a raw audio signal is converted into some form of sym-bolic representation (generally onset times or musical notes), and then secondly thesymbolic data is converted into beat locations. The Jambot’s beat tracking facility fallsinto this latter category.

Autocorrelation

A common strategy in beat tracking analysis is to examine the Autocorrelation Func-tion (ACF) of the onset times (or more precisely a salience function constructed fromthe onsets) (Goto 2004; Klapuri et al. 2006; Vercoe 1994). The ACF is strongly relatedto the Fourier Transform (Eck 2007) – the engineer’s tool of choice for measuringperiodicities.

99

A number of authors utilise the Inter-Onset-Interval (IOI) histogram to search for pe-riodicities. Often, in this context, the IOIs between all pairs of onsets (rather than justsuccessive pairs as is normally implied by an IOI) are considered in the constructionof the IOI histogram (Gouyon et al. 2002; Dixon 2001). In this case the IOI histogramwill be very similar to the ACF, the difference depending on how the autocorrelationhas been smoothed, and how (if at all) the salience of the onsets are accounted for inthe calculation of the histogram. For example Goto (2001) considers only the onsettimes, whereas Dixon (2001) combines pitch, duration and amplitude into a saliencemeasure. The Jambot’s beat tracking facility utilises the ACF of a salience functionconstructed by placing unit masses at the location of each onset, and then weightingeach mass by the detected amplitude of the onset.

Harmonically Related Periods

The ACF contains information about all periodicities evident in the onset stream. Apeak in the ACF at a particular pulse period should be accompanied by peaks at integermultiples of this period. Using these harmonically related peaks can help to improvethe accuracy of the pulse period estimation. This idea is very similar to the techniquedescribed in §5.3.1 for improving the accuracy of fundamental pitch estimation in pitchtracking algorithms. A number of authors (Scheirer 1997; Klapuri et al. 2006) haveimplemented such a scheme:

other pulse periods are integer multiples of the tatum period. Thus, theoverall function . . . contains information about the tatum . . . Gouyon et al.[24] used an inter-onset-interval histogram and Mahers two-way mismatchprocedure [34] served the same purpose. Their idea was to find a tatumperiod which best explained the multiple harmonically related peaks in thehistogram. (Klapuri et al. 2006:346)

The Jambot uses the harmonic information in the ACF to improve the estimate of thepulse period. Moreover, in contrast to Klapuri’s model, the Jambot’s beat trackingalgorithm does not attempt to estimate the fastest pulse (which Klapuri is referring toas the tatum in the above quote - this term will be described below in §2.4.2). Ratherthe beat tracking algorithm that I present here has a uniform treatment of observedperiodicities that are either integer multiples, or integer divisors, of a reference pulseperiod.

Multi-Agent architectures

Many of the more successful beat tracking algorithms utilise some form of multi-agentarchitecture (Goto 2001; Dixon 2001; Davies & Plumbley 2005). This entails main-

100

taining a number of hypotheses regarding the pulse period and phase, and gatheringevidence regarding the plausibility of these hypotheses through time. Both online andoffline algorithms have utilised such architectures. Of interest to this research is thatat least one real-time algorithm that has been implemented in an actual performancesystem – namely Nick Collins’ DrumTrack (2006b) – utilises a multi-agent architec-ture.

The notion of multi-agent architectures as a general approach to robust real-time mod-elling was discussed in §4.3.2. The Jambot utilises a multi-agent architecture for allof its perceptual parameters. Rather than using the term agent (as is common in thebeat tracking literature) I prefer to use the term belief or hypothesis, and describe theJambot as utilising a multiple parallel belief architecture. The reason for this is that Iam reserving the term agent for something that displays agency in a stronger sense thatwill be discussed in §8.2.1.

6.2.2 Points of Difference

The Jambot’s beat tracking facility has a multiple parallel belief architecture, analysingthe ACF of the onset saliences for periodicities and harmonics as evidence for the con-tinuation of period and phase beliefs. In this it has many commonalities with existingbeat tracking algorithms. The Jambot’s beat tracking is, however, unique in a numberof ways. In the paragraphs below I highlight the points of difference between the beattracking algorithm that I present in this thesis and existing systems.

Tracking the Substratum

In estimating the beat period I utilise information from peaks in the ACF that areharmonically related to the candidate period, a common technique as described above.However, in contrast to most existing techniques, I do not attempt to model a fastestpulse, and look for periods that are integer multiples of this pulse period. Instead Iintroduce a new concept – the Substratum (§6.3) – that plays the role of a referent pulse.Evidence of periodicities at integer multiples and integer divisors is incorporated intothe period estimation in a uniform fashion.

Use of the SOD stream

A critical component of any beat tracking algorithm is the quality of the onset timinginformation that it utilises. Collins suggests that improved onset information may themost critical factor in improving the performance of beat tracking systems (Collins2006b:117). The Jambot benefits from the utilisation of four parallel onset detection

101

techniques that discriminate between kick, snare and hi-hat onsets, as well as (non-legato) pitched onsets. Other beat-trackers, such as those of Goto (2001) and Collins(2006b) also utilise onset detectors that discriminate between kick and snare sounds.However the Jambot benefits from the use of the SOD detection algorithm §5.2 devel-oped in this project.

The SOD algorithm is particularly attuned to detecting hi-hat onsets. This is effi-cacious for beat tracking of many popular songs, in which the hi-hat often plays apulse-keeping role.

Sparse Implementation

The onset salience function that I use to detect periodicities is sparse in the sense thatit is mostly zero, with non-zero entries only at times where an onset is reported. Thebeat tracking algorithm is implemented to utilise sparse calculations from end to end,which results in a substantial gain in efficiency (around 100 times faster) over usingnon-sparse calculations. This was born of necessity – the non-sparse algorithm that Ioriginally used would not run in real-time.

Robustness through Perceptual Architecture

The Jambot’s architecture models perceptual inertia (§4.2.3). It achieves this by virtueof its multiple parallel belief structure and by continuing beliefs so long as the evidenceis consistent with these beliefs.

Other multiple hypothesis systems implement some form of perceptual inertia. Forexample Collins’ DrumTrack (2006b) selects between hypotheses using a causal dy-namic programming calculation that penalises state changes. The Jambot’s implemen-tation of perceptual inertia is, however, unusual in that its confidence in a belief onceestablished is not diminished by lack of corroborating evidence, only by contradictoryevidence.

The reason for implementing such strong inertia is related to the design brief for theJambot. The guiding principle driving the development of the Jambot’s perceptualalgorithms is to provide mechanisms that are easily useable in a live performance sys-tem. To this end the stability of metrical beliefs is important. Even if the Jambot’ssense of the metre diverges from a human’s, the lack of contradictory evidence meansit is likely the Jambot’s musical responses will never-the-less be somewhat musicallyappropriate. Furthermore, it is a simple matter for a human performer to press a buttoninstructing the Jambot to ‘forget’ whatever beliefs it has should the need arise.

102

Responsivity

The trade-off between reactiveness and inertia is an important consideration in real-time beat tracking systems (Gouyon & Dixon 2005). The Jambot attempts to finessethis trade-off in several ways. In particular I have employed two mechanisms to en-hance the responsiveness of the beat tracking algorithm in circumstances of mildlyvarying tempo.

The estimation of beat phase has a mechanism for biasing the result to be most accurateat the ‘pointy end’ of its window of history, namely the current time. It essentiallydoes this by weighting the salience of recent events more heavily; the details of thisprocedure are described in §6.5.

Feedback

The Jambot’s architecture is a hierarchical structure with bidirectional communicationbetween the levels in the hierarchy (§4.2.1). In particular, there is feedback from theanalysis module to the reception module. This feedback is in the form of modulationof the onset detection algorithms according to the metric expectations formed in theanalysis stage. The modulation alters the signal/noise critical thresholds required toreport an onset. The alteration favours detection of onsets at predicted beat locations.

Goto (1995) utilises bidirectional feedback between onset detection and higher orderagents, to create dynamic manipulation of his signal processing parameters. However,his manipulations are in response to poor performance of a particular agent, and do notmodulate according to metric expectations.

Modulating the onset detection algorithms according to metric expectations createsa positive feedback loop, that tends to enhance the stability of a metric belief. Asdiscussed in §6.2.2, I have tended to bias the Jambot’s algorithms towards stability inthe absence of blatantly contradictory evidence. The feedback can be manually turnedon or off by a human performer, so that if the Jambot latches on to an incorrect beliefthen a simple button press can inform it to reconsider.

6.3 The Substratum

The Jambot’s beat tracking facility relies on a novel analytic notion that I call thesubstratum. The substratum plays the role of a referent pulse level (§2.4.3). The sub-stratum beats are the beats which the Jambot tracks, and the Jambot’s metre inductionconsists of measuring the number of substratum beats in a bar.

103

The substratum plays a similar role to the pulsation (§2.4.4) in West African music, butis applied to Western music. The name is inspired by the tatum, and by the notion ofa substrate. In this context I am thinking of a substrate as a kind of grid or mesh uponwhich the musical events are placed. However, in contrast to the temporal atomicityof the tatum, the substratum allows for interpolation between the grid points. In thissense the substratum is functioning more like the tactus, but is a mathematical propertyof the music rather than a perceptual quantity.

In terms of the metrical analysis of Western music, the substratum has a particularinterpretation. Yeston describes metre as “an outgrowth of the interaction of ... dif-ferently rated strata” (1976:66). Here Yeston is using strata in the same sense as Ihave been using pulse. His analysis of metre revolves around the concepts of metricconsonance and dissonance:

Yeston considers collections of strata consonant when their rates of motionare multiples or factors of each other by an integer greater than one (Krebs1987)

The substratum is defined to be the most metrically consonant pulse. In otherwords, the substratum is the pulse for which most other pulses perceived in the musiccan be expressed as an integer multiple or divisor of.

The Jambot takes any two onsets separated by one pulse period as evidence for thatpulse. Consequently the substratum is the pulse for which most IOIs in the music canbe expressed as an integer multiple or divisor of.

The idea of the substratum is to act as computationally defined reference pulse (§2.4.3).As such, the substratum should have some perceptual inertia (§4.2.3), much as thetactus does in human perception (Large & Kolen 1994).

The substratum may differ from both the tactus and the tatum. In general, the sub-stratum will lie somewhere in between, being faster than (or equal to) the tactus, andslower than (or equal to) the tatum.

For example, in figure 6.1 the tactus is at the dotted crotchet level, the tatum at the semi-quaver level, whilst the substratum is at the quaver level. The tactus is not suitableas the substratum due to the presence of crotchets, which contribute an IOI of 2

3 ofthe tactus period, which is not a simple multiple or divisor. The tatum level is notthe substratum because the first two bars establish a quaver substratum, and the laterappearance of semi-quavers does not contradict this choice, so the perceptual inertiaof the substratum maintains the quaver level.

104

Figure 6.1: substratum pulse is quavers

The Jambot’s sense of timing is rooted in the substratum – the grid of substratumbeats forms a set of time points at which the Jambot considers taking actions (althoughthe Jambot may decide to schedule actions at interpolated points between substratumbeats).

The notion of substratum is not well suited to all styles of music. For example, musicwith irrational non-isochronous pulses, such as discussed in §2.5.1, are not amenableto analysis in terms of a substratum pulse. Consequently the Jambot does not deal wellwith such styles. In particular the Jambot does not cope will with swing.

6.4 Estimating the Substratum

The Jambot performs real-time estimation of both the substratum period and phase.There is, however, nothing to guarantee that any one pulse level will exist for whichall other surface pulses are integer multiples or divisors of. Rather than requiring thesubstratum to ‘account’ for all surfaces pulses, the estimation technique seeks the pulsewhich is most consonant with the more salient surface pulses. Potentially there will beseveral choices of substratum with corresponding levels of plausibility. The estimationtechnique does not only pick the best choice, but returns a list of possible substrataand their plausibilities. As with all perceptual parameters, the Jambot’s perceptualarchitecture maintains multiple hypotheses regarding the substratum.

6.4.1 Multivisors, Overtones and Undertones

The relation of consonance between two pulses is symmetric, i.e.

pulse1 is consonant with pulse2 ⇐⇒ pulse2 is consonant with pulse1

In estimating the substratum this symmetry means that pulses both faster and slowerthan the substratum are treated similarly by the estimation procedure. It is convenientto have a term that covers both. I will use the term multivisor to mean either a multipleor a divisor. By simple multivisor I mean an integer multiple or divisor, like 3 or 1

3 . Us-ing this language the definition of substratum above can be rephrased: the substratumis the pulse for which most of the surface pulses are simple multivisors.

105

I will also borrow terminology from harmonic theory and refer to integer divisors ofthe substratum as overtones. This is in much the same spirit as Yeston’s use of the termconsonance. I will also refer to integer multiples of the substratum as undertones. Thisterm is not used in acoustics – the physical reality of the harmonic series means thatthere is always a fundamental which plays the role of a reference level and for whichall other harmonics are overtones. However in the case of pulses, the reference levelhas both overtones and undertones.

6.4.2 The Multivisor Transform

In order to treat undertones and overtones mathematically similarly, it is convenientto transform the periods of the tones so that all multivisors become multiples, withdivisors transforming into negative multiples. So for example the undertone that hasthree times the period of the substratum is transformed into the number 3, whilst theovertone that has one-third the period of the substratum is transformed into the number-3. I call this transform the MultiVisor Transform (MVT). If p0 is the period of thesubstratum then the MVT is defined by

MVTp0(p) =

p/p0 if p0 ≤ p

−p0/p otherwise(6.1)

and is a bijection from the interval (0,∞) to the set (−∞,−1)∪ [1,∞)

The application of the MVT allows for undertones and overtones to be used togetherin a multiple linear regression, as will be described below. Also, after applying MVTmeaningful comparisons can be made between observational deviations from a periodand a hypothesised tone. For example if the substratum is hypothesised to have aperiod of 60 analysis windows, and tones are observed at periods of 32 and 115, theapplication of the MVT allows a single tolerance to be applied in determining if theobserved tones are sufficiently close to multivisors of the substratum to be consideredas supporting evidence for this hypothesis. If, for example, a tolerance of 0.1 waschosen, then the first tone with MVT60(32) =−1.875 would be discarded (being 0.125away from the closest multivisor) whilst the second tone with MVT60(115) = 1.9167would be accepted.

6.4.3 Update procedure

The general process of parameter estimation in a multiple parallel analysis architecture(§4.2.4) requires that each parameter have an update procedure that returns a weightedlist of plausible new values given its previous belief and the new evidence.

106

In estimating the substratum the Jambot uses the onsets derived from the SOD detec-tion technique (§5.2.3). The SOD technique is essentially listening for the noisiness ofthe signal, and so reports the greatest salience for onsets of instruments such as hi-hatsand cymbals etc. The motivation for using this stream is that for many popular mu-sic styles the pulse is kept by such instruments – for instance it is common in westernrock/pop songs for the hi-hat to tap out a pulse, in Western African music the bell is of-ten used in what has been suggested to be a time-keeping role (Arom 1991) and manyother examples can be found. I suspect that the reason for this is that the onsets ofthese instruments are highly temporally located. In contrast, the onset of a kick drumis less precise, but sufficient enough to mark a beat as special when the sense of pulsehas already been established.

Through this section I will give examples of the various stages of the estimation andupdate procedure using the Amen Break sample introduced in the previous chapter.

6.4.4 Surface Pulses

The first step in the estimation is to extract the surface pulses from the recent past. Therecent past is defined by a moving segment of fixed length – the actual length of thissegment is a parameter that can be varied in the interface of the Jambot. I tend to use2048 analysis windows (where each analysis window is 128 samples at 44kHz) whichis approximately 6 seconds.

A 6 second window is relatively long compared with the time required by humans tolock onto a pulse period, which varies with the style of music, but tends to be around3 seconds (Collins 2006b:47). However, for the Jambot’s beat tracking algorithm, 6seconds seems to provide the best trade-off between reactivity and inertia for mildlyvarying tempos, such as live drumming in a funk track.

The SOD onsets are first converted to a salience function, which is a function definedfor each analysis window in the recent past. The salience function is zero unless there isan onset in the given window, in which case its value is the amplitude of that onset. Thesurface pulses are then extracted by calculating the Autocorrelation Function (ACF) ofthe salience, and searching for peaks in the ACF.

An example of the salience of the SOD stream and ACF calculated for a segment ofthe Amen Break is shown in Figures 6.2 and 6.3. The x-axis in these graphs is timemeasured in analysis windows (of 128 samples).

107

Figure 6.2: SOD stream salience of a segment of the Amen Break

Figure 6.3: ACF for the SOD stream salience

108

Figure 6.4: Clumped ACF for the SOD stream salience

6.4.5 Peak Picking

The ACF is discrete and discontinuous due to the discrete nature of the onset timesreflected in the salience function. Consequently a simple search for local maxima isnot adequate in determining the peaks of the ACF. One solution to this issue would beto convolve the ACF with some smoothing kernel (Gouyon et al. 2002). For reasons ofefficiency however I have used a different technique that does not require smoothing.The peak-picking function has a resolution, which is a particular fixed width, which ituses to suppress the reporting of local maxima where there is another local maxima ofgreater value within the resolution.

Once the location of the peaks are determined, their values are taken to be the sum ofall values within the resolution of the peak, and the location of the peak is refined bytaking a weighted sum of values within the resolution. Peaks also need to contain afixed percentage of the total value of the ACF to be reportable. I call this techniqueclumping. Clumping bears similarity to the clustering method employed by Dixon(2001) to deal with the same issue.

An example of the result of the clumping technique described, applied to the ACF forthe Amen Break displayed in the previous figure, is shown below in Figure 6.4.

109

6.4.6 Surface substrata

The surface pulses are then assessed for plausibility as substratum pulses, so that aweighted list of the most plausible substratum periods can be reported. Recall that thesubstratum pulse was defined to be the pulse for which most of the surface pulses aresimple multivisors. This definition is rather vague as to what is meant by most – in factI will take the following estimation procedure as being definitive of the substratum.

The core of the estimation procedure is the calculation of how plausible a given can-didate substratum is. The calculation proceeds by setting the plausibility to 1 (theplausibility ranges from 0 to 1) and penalising the plausibility for each surface pulsethat is dissonant with the candidate via the formula

plausibility 7→ plausibility1+ wdissonant

wcandidate

(6.2)

where wdissonant and wcandidate are the values of the dissonant and candidate peaks of theclumped ACF. The fraction wdissonant/wcandidate means that a weightier dissonant peakwill penalise the plausibility more, and the weight of the candidate peak determinesthe scale to which other peaks are compared.

This formula is used only for spawning new candidate hypotheses. In calculating theplausibility of an existing belief the value wcandidate may not be available, since thecandidate pulse may no longer be directly evident in the musical surface. An importantproperty of the Jambot’s perceptual architecture is perceptual inertia; once a beliefin a particular value for a parameter has been spawned, it is robust to a temporarydisappearance of evidence supporting that value, so long as the value is coherent withthe current evidence. The substratum update procedure achieves this by using thecandidate’s current plausibility instead of wcandidate in eq. 6.2.

To clarify, at each decision point1 two sets of plausibility calculations are performed.

1. The surface substrata are extracted from the recent past and assigned plausibili-ties. This first calculation is independent of any current substratum beliefs.

2. Each currently held belief is put through the plausibility calculation, but usingthe current plausibility instead of wcandidate in eq. 6.2.

1Decision points may be defined in a number of ways, generally either when on onset is detected orat a new substratum beat

110

6.4.7 Refining Beliefs

The property of perceptual inertia means that beliefs are persistent for brief periods oftime during which there is no direct evidence supporting them. However it is importantthat the belief is able to change slightly – indeed this is fundamental to the Jambot’sability to track changing tempo. The change must however be slight otherwise it willbe assumed to be a new belief. The substratum update technique refines the estimateof a belief in addition to updating its plausibility. The refinement takes into accountthe values of all of the overtones and undertones. By finding the substratum period thatbest describes all of these tones the period can be estimated much more precisely thanis possible from just looking at the location of the peak in the clumped ACF.

The refined substratum period is obtained by regressing the peak periods of the clumpedACF onto the tone numbers corresponding to these peaks. In order to take account ofboth overtones and undertones in this regression it is necessary to first perform theMultiVisor Transform (§6.4.2) on the periods to obtain the tones, so that an overtonewill have a negative integer tone number. In this way all of the tones are expressed aslinear multiples of the substratum, which is necessary to perform the linear regression.

Only peaks that are consonant with the belief are included in the regression. Thecriterion for consonance is that the MultiVisor Transform of the peak period is withina fixed tolerance of an integer (§6.4.2). The regression is computed by Weighted LeastSquares (Carroll & Ruppert 1988) with the peak values as weights.

Weighted Least Squares regression finds the value β that minimises the weighted sum-of-squares of the residual vector ε in the linear equation

y = βx+ ε with weighting vector w (6.3)

In this case y is the vector of MultiVisor Transformed periods for the peaks of theclumped ACF which are consonant with the belief, w the vector of corresponding peakvalues, and x is the vector of corresponding tone numbers.

The refined periods are obtained by inverse MultiVisor transformation of the estimatesy from the fitted model

y = βx (6.4)

The substratum period corresponds to a tone number of 1, and the refined substratumperiod is given by β × p0, where p0 is the current belief for the substratum period.

111

Example

To clarify the refining procedure I will work through a numeric example. Suppose thatthe current belief for the substratum period is 90.25 analysis windows, and that peaksof the clumped ACF are found at 45, 92, 178, 262 and 362 with peak values of 0.42,0.11, 0.36, 0.25 and 0.26 respectively. Then p0 = 90.25. The peak periods vector andits MultiVisor Transform are

p =[45 92 178 262 362

]MV Tp0(p) =

[−2.0056 1.0194 1.9723 2.7922 4.0111

]Supposing that the tolerance for accepting a peak as consonant is 0.1 then all of thesepeaks are consonant except for 262, where the MVT equals 2.7922 (which is morethan 0.1 away from the closest integer of 3). Excluding this peak we have

w =[0.42000 0.11000 0.36000 0.26000

]y =

[−2.0056 1.0194 1.9723 4.0111

]x =

[−2 1 2 4

]Performing the weighted linear regression yields β = 0.98, and thus the refined sub-stratum period is β × p0 = 0.98×90.25 = 88.5.

6.5 Estimating the Substratum Phase

Having estimated the period of the substratum the Jambot’s next task is to estimate thephase of the substratum. Conceptually my approach to phase estimation is simply tofind the phase at which times of high rhythmic salience in the signal tend to fall on thesubstratum beats most emphatically.

One approach that suggests itself is to create another salience function characteristic ofthe pulse, and to measure its similarity to the rhythmic salience at various phase lags.In applying this approach decisions need to be made regarding the construction ofthe characteristic function, and the nature of the similarity measure. A simple choicewould be to create a characteristic function which is zero everywhere except at ex-actly the positions of the beats (assuming that the timeline has been quantised to somegranularity) where it takes the value of one2, and to use correlation as the measure ofsimilarity.

2In mathematical terms this would mean the characteristic function of the pulse was a measure with‘unit’ masses on the beats. To have literally unit masses would require the values at the beats to be1/∆ where ∆ is the size of the quantization granularity, but the correlation measure discussed is scaleinvariant.

112

Laroche (2003) uses this approach for pulse phase determination; he estimates thephase by maximising the cross-correlation of a discrete pulse expectation function witha continuous measure of spectral flux. The use of a discrete function to model pulsesalience is reminiscent to Povel’s clock model of beat perception (Povel & Essens1985) which suggests that humans internally represent pulses as a clock which ‘ticks’on the beat.

A drawback of modelling pulse salience with a discrete function (and using correlationto measure similarity) is that it is not appropriate when the rhythmic salience functionalso consists of discrete spikes. This is because when both the rhythmic salience andthe pulse salience are composed of discrete spikes, and correlation is used as the simi-larity measure, there is no room for timing error – there will only be a positive contri-bution to the correlation when the spikes line up exactly. For real data this will almostnever happen; even for a rhythm that would be perceived as being identical to the sub-stratum pulse, small timing differences would mean that the correlation measure wouldmost likely report such a rhythm as bearing zero similarity to the pulse.

In Laroche’s case the rhythmic salience function is a continuous spectral flux measure,the peaks of which have some temporal width, which should mitigate this issue some-what. However, from a design perspective, it seems problematic to be reliant on theimprecision of onset detection for the stability of the pulse phase estimation. In partic-ular, for sounds with a very sharp attack, Laroche’s rhythmic salience measure may besmall at the time of an expected beat if the onset is slightly late.

I suggest it makes more sense from a physical standpoint that the detected saliencesshould be spiked, whilst the pulse saliences should have some smoothing applied.Temporal discrimination of onsets in human listeners can be quite precise, particu-larly for percussive onsets, where onsets separated by only a few milliseconds can bediscriminated (Hirsch 1959). On the other hand a number of theories of pulse saliencesuggest that the salience function is best modelled as being non-zero in some windowaround the beat. For example Large & Kolen (1994) posit that pulse should be mod-elled as an harmonic oscillator (i.e a sinusoid) where the peaks of the sinusoid are atthe beat positions. In discussing this model Large emphasises similar concerns regard-ing the robustness of matching of rhythms to the pulse in the (inevitable) presence oftiming noise:

The fixed beat structure of the clock model is not robust even to smallperturbations in a rhythmic pattern ... introducing even modest amountsof temporal jitter would cause most onsets to fall off the beat. (Large &Kolen 1994)

113

Despite Large’s theory of resonant oscillators having some shortcomings (for examplein explaining the tendency for syncopation to reinforce the sense of tempo) I haveadopted this functional form (sinusoidal) for the pulse salience for the purpose of phaseestimation.

The peaks in the Jambot’s onset detection functions have some temporal width, how-ever the tips of the peak are generally unambiguous. Rather than using these contin-uous detection functions as a measures of rhythmic salience, I use discrete functionscomposed of spikes at the detected onset times.

Following the approach above the phase would be determined by performing a linesearch for the optimum value of the correlation between the pulse salience and therhythmic salience, calculating a value for each phase shift of the pulse salience from0 to 1 at some granularity. However doing this would be computationally expensive (O(n2) where n is the number of quantised times in one pulse period). Rather than per-forming this search I have concocted a novel technique which is more computationallyefficient.

The optimum value is obtained by performing a complex regression of the rhythmicsalience onto the complex oscillator e−2πiωt , where ω is the frequency of the substra-tum pulse (expressed in cycles per time unit). The regression itself is calculated viathe Weighted Complex Least Squares formula (A.9). A derivation that this procedureyields the maximally correlated phase is given in Appendix A.

Is correlation a good measure of similarity? I suggest this depends on the nature ofthe salience function. In the case of the Jambot’s detection functions – SOD, MID andLOW – the amplitude of the salience function is linearly related to the amplitude ofthe detected onsets. The intensity of the onsets should then vary as the square of thesalience. I have argued above that maximising the correlation is identical to minimisingthe least squares difference, which should correspond to matching the intensities ofrhythmic salience and the pulse salience. Given that the ear is responsive to intensitymore directly than amplitude, this seems like a reasonable approach.

Another way of looking at this technique for phase estimation is that it yields the phaseof the ω frequency component of the Fourier transform of the rhythmic salience. Fromthis perspective it again seems like an intuitively reasonable approach to estimating thephase. An advantage in formulating the problem as a complex regression rather thana Fourier transform is that we can utilise the extension of the Ordinary Least Squaresalgorithm to the case of Weighted Ordinary Least Squares. In reality, the substratumpulse that the Jambot is tracking will not be constant, nor will the estimate of the pulse

114

period be exact. The unweighted version of this phase estimation technique will yieldthe best phase as an average of the window of data considered, but in an improvisationwhat really matters is the current moment.

Consider for example the case of a slight acceleration in the substratum pulse – anapplication of this phase estimation technique will be working with a single pulseperiod, and so even if the rhythm is exactly tapping out the pulse the peaks of thepulse salience function described above will not exactly coincide with the perceivedonsets. The nature of averaging will then mean that the tendency is to align the onsetswith the beats in the middle of the analysis period, whereas what I really want is forthe beat to line up with the onsets at the end of the analysis period. My solution isto perform a weighted regression where I linearly weight the substratum beats from0 at the beginning of the analysis period to 1 at the end of the analysis period. Thecomplexity of this phase estimation is O(n) comparing favourably to the line searchtechnique.

6.6 Mirex Database

As a demonstration of the substratum estimation, I give some examples of the tech-nique applied to a subset of the Music Information Retrieval Evaluation eXchange(MIREX 2006) tempo extraction practice data. The MIREX community runs an an-nual suite of music information retrieval contests, including tempo extraction and beattracking tasks. The practice data for the 2006 tempo extraction task is publicly avail-able (MIREX 2011). The training set consists of 20 excerpts from a wide variety ofmusical genres. Each excerpt is 30 seconds long.

The practice set is annotated with ground truth values for the tempo estimates. Theground truth values consist of two tempo estimates and a relative strength for theseestimates. These values were established by having a pool of human test subjects tapalong to the tracks; the reported values correspond to the two highest peaks in theperceived tempo distribution, and the ratio of their heights. The beat tracking task usedthe same training set, and additionally annotated the perceived beat times.

I evaluated the substratum estimation by having the Jambot play a click track along inreal-time, and using subjective judgement as to how ‘in time’ it sounded. Whilst theMIREX community has proposed a number of quantitative empirical measures of beat-tracking accuracy (Davies et al. 2009), Dannenberg (2005) suggests that subjectivejudgment gives a reliable measure. Moreover, many of the standard measures reflecttheir intended use in offline music information retrieval tasks, rather than real-time

115

Figure 6.5: MIREX training set example 5.

performance tasks. For example, various measures of longest segment tracked arecommon, but often disregard the time taken to stabilise on the tempo estimate. Wheretime to stabilise is measured, it is at the beginning of the tracking, not in the context ofan abrupt tempo change or pause (Collins 2006b:98). These measures therefore fail tocapture the algorithm’s reactivity versus inertia tradeoff (§4.2.3).

The substratum is not the same as the tactus, and so the Jambot’s substratum estimatesare not expected to coincide with the ground truth tempo values given, but generallyshould be related by tempo doublings/halvings or, infrequently, other multivisors. Aswith all of the Jambot’s perceptual parameters, the substratum period estimate hasmultiple hypotheses. The number of hypotheses is configurable. The version of theJambot included on the accompanying CD limits the total number of hypothesisedmetric contexts to four, for reasons of efficiency. Figure 6.5 shows an example screenshot of the Jambot as it is tracking the test file train5.wav. The augmented file withclick track is track5.mp3. The ground truth values for this file are 68.5 and 205.5 beatsper minute. The Jambot’s tempo hypotheses are clustered around 69 and 205.

In Table 6.1 I show the results of taking a snapshot of the Jambot’s tempo hypothesesas it tracks a subset of the practice data set. The subset I have used consists of all thetracks that contain percussion, since the Jambot is designed for this situation.

In all cases the Jambot’s substratum tempo estimates were close to a multivisor of oneof the ground truth tempos. Also, in the two cases (track5 and track15) where the twoground truth tempos were related by a factor of 3, the Jambot has reflected both ofthese tempos in its estimates.

116

Audio File Tempo #1 Tempo #2 Jambot #1 Jambot #2 Jambot #3 Jambot #4

track1.mp3 64.5 129.5 256.5 260.2 130.5 258.4

track5.mp3 68.5 205.5 204.5 68.8 204.1 205.5

track9.mp3 64.5 129.0 129.3 259.5 259.5 259.5

track11.mp3 70.0 140.0 139.3 69.6 139.8 69.6

track13.mp3 90.0 180.0 359.7 359.7 359.7 359.7

track14.mp3 65.0 130.0 266.0 516.5 516.5 516.5

track15.mp3 62.0 186.0 61.3 61.3 184.0 184.0

track16.mp3 45.0 90.0 91.4 45.5 182.1 182.1

track19.mp3 93.5 188.0 186.0 186.0 186.0 369.4

track20.mp3 115.5 220.5 224.9 438.3 222.2 222.2

Table 6.1: Jambot tempo estimates of MIREX practice data

Audio recordings of the practice data augmented with a click-track by the Jambot aresupplied on the accompanying CD in the files track*.mp3. Each recording plays thepractice audio snippet twice. The snippets are designed not to be an integral numberof beats in length, and so when the snippet is looped a phase discontinuity occurs.Consequently these recordings also give a sense of the time it takes the Jambot torecover from a phase discontinuity.

By way of comparison I have also included audio recordings of the same snippets aug-mented using Dixon’s (2008) BeatRoot system, which is freely available from Beat-Root (2011). The download takes the form of a Java application which enables anaudio file to be loaded, offline beat-tracking to be performed, and then playback to beaugmented by a click-track. The BeatRoot augmented audio recordings are suppliedon the accompanying CD in the files BeatRootTrack*.mp3.

My assessment is that two perform comparably. To my ears BeatRoot is mildly supe-rior on a few of the snippets (track9, track19 and track20) whilst the Jambot is mildlysuperior on track14 and significantly superior on track5. In comparing BeatRoot tothe Jambot’s beat-tracking it must be remembered that BeatRoot is offline and acausal,whereas the Jambot is real-time and causal. Consequently BeatRoot is often able toclick in time from the first beat, whilst the Jambot requires a few beats to lock into the tempo. Also, the Jambot occasionally skips beats. This happens because it isconstantly reassessing the beat in real-time, and sometimes it updates the nearest beatestimate to a time slightly in the past. It must also be remembered that BeatRoot istracking the tactus, whilst the Jambot tracks the substratum, which is often a subdivi-sion of the tactus.

117

6.7 Attentional Modulation

In Chapter 4 I argued that hierarchical architectures utilising feedback between layersare appropriate for robust perception. The Jambot uses information from its metricanalysis to modulate the onset detection functions. Drawing inspiration from Jones &Boltz’s (1989) theory of dynamic attention, the Jambot pays more attention to onsetsthat occur at expected times, namely at metrically strong times (§2.5.1). This is anexample of the use of feedback between layers of the Jambot’s architecture. In thiscase the feedback is from the analysis layer to the reception layer.

The notion of utilising higher order information to ‘tune’ the parameters of onset detec-tion functions, in the context of beat tracking, is discussed by Goto & Muraoka (1995).He uses a multi-agent architecture for beat-tracking, and attaches to each agent a setof tuning parameters for his onset detection function. Agents that exhibit high reliabil-ity in beat prediction maintain their associated parameters, whilst poorly performingagents dump their parameters in favour of a new randomly generated set. Goto notesthat this technique improves the beat-tracking system’s robustness to new (musical)environments.

The Jambot’s attentional modulation strategy works differently from Goto’s agentbased parameter tuning. The dominant metre is used as an expectational templatefor onsets, i.e. onsets are expected to occur on a beat. The level of expectation is deter-mined by the metric strength. The modulation works by changing the signal/noise ratiorequired for reporting a significant growth in the detection function. The modulationis such that the threshold is lower in the vicinity of a beat.

The attentional modulation is intended to make the beat-tracking more robust throughtime, rather than across environments. The effect of modulation is to increase thestability of a beat hypothesis, by decreasing the effect of spurious onsets. This is par-ticularly helpful for signals with a significant amount of confounding high frequencynoise, such as heavily distorted guitars.

In practice I find the attentional modulation to be effective in stabilising the trackingwhen the correct tempo and phase have been ascertained, and the tempo is reasonablysteady. When the Jambot is still seeking the beat, or if the tempo is varying substan-tially, the attentional modulation is counter-productive. Consequently I tend to manu-ally switch on the attentional modulation once the Jambot’s beat tracking matches myown, and turn it off when it does not.

118

6.8 Finding the Bar Period

The Jambot’s model of metre consists of two pulses - the substratum and the bar.The two are required to be consonant (and the bar is required to be slower than thesubstratum) so the bar period is expressed by describing the number of substratumbeats in a bar.

In abstract the bar period estimation process is similar to the substratum estimationprocess. At each decision point the surface pulses are extracted, and examined for ev-idence of bar periods. Sufficiently plausible hypotheses spawn new beliefs. Currentlyheld beliefs have their plausibility updated in light of the recent evidence. Beliefs thathave become implausible are pruned.

Whereas the substratum estimation process used the onset stream from the SOD de-tection function (which reacts most strongly to noisy instruments like hi-hats), the barperiod estimation technique uses the onset stream from the LOW detection function(which reacts most strongly to bassy onsets, like a kick drum). The motivation fordoing this is the observation that, at least in a lot of Western popular music, the kickdrum seems to be more important in giving a sense of the downbeat (and in many casescomes on the downbeat) than other percussive instruments.

6.8.1 Beat-Classes and Beat-Class Intervals

The technique that the Jambot uses to estimate the bar period makes use of the notionof beat-classes (Wharburton 1988). The term is derived from drawing an analogy withpitch-classes. A pitch-class is a collection of pitches that are equivalent to each othermodulo an octave. So for example the notes C1, C2 and C4 all belong to the pitch-classof C. Analogously a beat-class is a collection of beats that are equivalent modulo thebar. So for example the downbeat is a beat-class consisting of the first beats of eachbar.

Cohn (1992) describes rhythmic content in terms of beat-class intervals in analogywith Forte’s (1973) pitch-class intervals. An interval between pitch-classes is definedmodulo octaves, so that two intervals which are inversionally related, such as a per-fect 4th and perfect 5th, belong to the same pitch-class interval. Similarly an intervalbetween two beat-classes is defined modulo the bar length. As an analogy to inversion-ally related pitch intervals, I call two beat-classes complementary if their beat numbersadd up to the number of beats in a bar. So for example, if there are 8 beats in the bar,then the 5th beat and the 3rd beat are complementary beat-classes. I apply the same

119

terminology to overtones of the substratum, so again if the bar is in 8 then the 5th andthe 3rd overtones are complementary.

The Jambot’s bar period estimation technique operates by making the assumption thatsurface properties of the music should, to some extent, respect beat-class interval in-variance. In particular a given overtone evident in the surface should be accompaniedby an overtone of the complementary tone number. So for example, if the bar is in 8and the 3rd overtone is evident in the surface then the 5th overtone should also be. Thebar period is then estimated by finding the period that best meets this assumption.

6.8.2 Partition Symmetric Difference

In order to calculate how well a candidate bar period meets the assumption of beat-class interval invariance, I use a function that I have called the partition symmetricdifference. The function is similar to the plausibility calculation used for estimatingthe substratum (6.2), in that it starts by setting the partition symmetric difference (psd)to 1, and then penalises it for each overtone that violates beat-class interval invariance.So for every overtone p of the substratum from 1 up to n (where n is the candidatenumber of beats in the bar) the partition symmetric difference is penalised via

psd 7→ psd

1+ |wp−wq|wcandidate

(6.5)

where p and q are complementary (i.e. p + q = n), and wp, wq are the weights ofthe peaks in the ACF corresponding to these overtones (or zero if there is no peak),and wcandidate is the weight of the peak corresponding to the hypothesised bar period.This function penalises bar periods for which there are substantial uncomplementedovertones evident.

Overtones beyond the bar length are transformed into an overtone from 1 to n again bythe assumption of beat-class interval equivalence. So for example, if the candidate barlength is 8, an overtone of 10 will be expected to be complemented by an overtone of2.

After all of the evident overtones have been examined for complementarity, the parti-tion symmetric difference is interpreted directly as the plausibility of the bar period.

120

6.8.3 Perceptual Inertia

Perceptual inertia for the bar period parameter is achieved in a similar manner as forthe substratum. The bar period belief may not be evident in the surface of the recentpast, but so long as it is coherent with the recent past then it can persist. Accordingly, inupdating the plausibility of a belief I have employed a slight alteration to the partitionsymmetric difference function (6.5), since the value wcandidate may not exist (there maybe no peak in the ACF at the belief). Much as in the substratum case (6.2) I replacewcandidate with the current plausibility of the belief. Even if the belief is evident in thesurface I use the current plausibility in place of wcandidate. As with the substratum thisis the mechanism by which perceptual inertia is achieved.

As with all of the Jambot’s perceptual parameters, the bar period estimation techniquedoes not just calculate the most plausible bar period, but a weighted list of possibleperiods. At each decision point3 the surface of the recent past is examined for plausiblebar periods, and sufficiently plausible periods spawn new beliefs. Then the currentbeliefs have their plausibilities updated in light of the recent past, and implausiblebeliefs are pruned.

6.9 Summary

This chapter discussed the Jambot’s beat tracking and metre induction facilities. TheJambot’s sense of timing is rooted in a referent pulse level called the substratum. Thesubstratum is a mathematical convenience designed to assist the beat tracking process.It is defined to be the pulse which most consonant with the pulses evident in the musicalsurface. Generally the substratum will be faster than the tactus, but slower than thetatum.

The beat tracking facility consists of algorithms for real-time estimation of the pe-riod and phase of the substratum. The period estimation bears similarity to existingalgorithms, in that it uses autocorrelation of the rhythmic salience in a multiple be-lief architecture, and refines the estimate using information from harmonically relatedpeaks of the ACF. It differs from existing algorithms in a number of ways, namelytracking the substratum, use of the SOD onset detector, having a sparse implementa-tion, utilising perceptual inertia, and using feedback from the beat tracking to the frontend onset detection in the form of attentional modulation.

3 In the case of bar period estimation the decision points are taken to be LOW onsets, since this isthe stream that the bar period estimation function uses.

121

The substratum phase estimation is also similar to existing techniques; it seeks to max-imise the cross-correlation of rhythmic salience with a pulse salience function. It dif-fers from existing techniques in the use of a continuous pulse salience function withdiscrete rhythmic salience, and the use of complex regression for the estimate, whichis more computationally efficient than an exhaustive search.

The metre induction facility simply estimates the number of substratum beats in onebar. It does this by using the notion of beat-class interval invariance. The Jambot findsthe bar period that displays the most beat-class interval invariance.

The next chapter discusses how the Jambot uses the beat tracking and metre estimatesto perform rhythmic analyses on the detected onsets.

122

Chapter 7

Analysis of Rhythm

7.1 Introduction

In the previous chapter I discussed the first stage of the Jambot’s analysis process, theanalysis of metre, resulting in a collection of metric contexts. In this chapter I discussthe second stage of analysis – the analysis of rhythm. The first section covers thetransformation of the raw onset data into an invariant representation of notes and beatlocations. The second section of this chapter describes the rhythmic analyses that theJambot performs.

The purpose of performing rhythmic analyses is to inform the generation process. Inorder to generate complementary and appropriate rhythmic accompaniment, the Jam-bot analyzes candidate musical actions in the context of the recent past. The mechanicsof the proactive generation process will be discussed later in §8.3.

To allow for intuitive parametric control over the Jambot’s proactive generation pro-cesses I have selected a collection of salient rhythmic features, and implemented anal-yses that quantify the real-time variation of these features. Together this collectionforms a model of rhythm. The features relate to both the temporal patterning and themetric alignment of the rhythm.

The model is designed to give a salient parametrisation of rhythm space, so that theJambot’s proactive generative process can be intuitively controlled, and is able to gen-erate a broad range of interesting rhythms.

7.2 Representing Onsets as Notes

The first stage of the Jambot’s rhythmic analysis process involves interpreting the onsetdata from the reception algorithms in the metrical contexts obtained from the metricalanalysis. The result of this is the transformation of the reception data into notes andbeat locations.

123

7.2.1 Invariant Representation

The notion of an invariant representation was discussed in §3.4.2. Invariant represen-tations play an important role in perception, particularly in regard to robustness ofperception to novel stimuli. The process of transforming raw stimuli into an invariantrepresentation is a process of abstraction, or categorical perception.

The invariant representation that the Jambot uses consists of notes and beat locations.The chunking of musical acoustic signals into notes is a fundamental process in musicperception (Bregman 1990). Most treatises on music theory take the symbolic repre-sentation of music comprised of notes as the starting point of their analysis. To engagewith these theories it is necessary to represent the musical surface data in some sym-bolic form that is at least comparable with the notes of common practice notation. Alsoutilising a note representation that bears similarity to standard digital representationssuch as MIDI eases the integration of the Jambot with existing computer hardware andsoftware such as synthesisers. This is particularly important since the Jambot does notitself have any sound synthesis capacity – it relies on external synthesisers to producesound. The Jambot’s primary mechanism for communicating with external softwareand hardware is via MIDI messages.

The Jambot’s Note data structure consists of a pitch, velocity and duration. The pitchand velocity variables are designed to be compatible with the MIDI note representation,and as such are integer values that range between 0 and 127. The Note differs fromthe standard MIDI representation of a note in that the duration is encoded directly inthe data structure, whereas in a MIDI file the duration is indirectly encoded as the timedifference between the NoteOn and NoteOff events.

The invariant representation of the temporal locations of notes involves a transforma-tion from the raw onset times into beat locations. Representing onset locations in thisinvariant fashion is central to the Jambot’s robust operation. In this case the invarianceis with regard to tempo variations. Once the onset locations have been transformedinto beat-space, the tempo may be varied without affecting the structural properties ofthe rhythm.

Clarke (1999:490) asserts that the separation of actual physical onset timings betweenstructural rhythm and expressive timing is a fundamental perceptual process related tothe level of cognitive abstraction in which the rhythmic timings are represented:

The distinction between rhythm and expressive timing is only psychologi-cally plausible if some mechanism can be identified that is able to separate

124

the one from the other. That mechanism is categorical perception, andboth Clarke (1987b) and Schulze (1989) have demonstrated its existenceempirically. In general terms, the idea is that listeners assign the contin-uously variable durations of expressive performance to a relatively smallnumber of rhythmic categories.

Both the onset time and the duration of a Note are expressed in this invariant rep-resentation of beats, as a Rational number. In converting the raw times into thisrepresentation times are quantised to the nearest quarter or third of a substratum beat.Once the raw reception data has been transformed into notes and beat locations thetransformed data is fed into the suite of rhythmic analyses below.

Bilmes (1993) models musical timing as being composed of a combination of categor-ical (beat relative) timing and absolute timing. Since the Jambot does record absolutetimings it could be an interesting extension to this research to adopt a similar model.Currently, however, the Jambot analyzes only categorical timing information.

7.2.2 Time, Beat, Location and Position

The rhythmic categories discussed above consist of proportional durations (Parncutt1994). In this representation the onset location and durations of notes are described asrational multiples of the substratum beat period. The transformation from the receptiontiming data takes input times expressed as in terms of the Jambot’s internal clock, andreturns a beat location expressed as a rational number of beats. In the implementationof the Jambot I have used the terms time, beat, location, and position to denote thevarious different representations of time, the meanings of which I discuss below.

The Jambot’s internal clock is quantised in accordance with the audio-hardware. Thefundamental unit of time is determined by the number of audio frames that the hard-ware processes at any one time. Part of the Jambot’s design is to be able to react asquickly as possible to events in the audio signal, and so it has been designed to operatewith a relatively small audio buffer - the recommended buffer size is 128 samples (at asample rate of 44.1kHz), which corresponds to approximately 3ms. Time is then rep-resented as an integer, namely the number of audio buffers that have been processedsince the Jambot was instantiated. References to either time or window in the codeshould be interpreted this way.

The term beat, when used in the Jambot’s implementation, refers to a floating pointnumber that represents a given time interpreted in a metric context. A time expressedas a beat is theoretically related to a time expressed as a time by integration of the

125

varying tempo. In actuality, the Jambot does not maintain a history of the tempo, and soit is not possible to perform this integration as such. However, each metric context

has a variable rate internal clock that continuously keeps track of the current beat, sothat the beat at which an event occurs may be recorded at the time of occurance.

In order to perform the rhythmic analyses described below, the onsets and durationsof notes need to be expressed as rational multiples of the substratum beat period. Theterm location refers to a representation of time in beats represented as a rationalnumber of beats. The Jambot’s implementation uses the Rational data structure tohold this representation.

For some purposes it is convenient to express the timing of onset locations and du-rations as rational multiples of the bar period, rather than rational multiples of thesubstratum period. In this case I have used the term position to denote the represen-tation, which also utilises the Rational data structure.

7.3 Rhythmic Analyses

Having represented the onsets in the invariant representation of notes at beat locations,the Jambot can next perform a collection of rhythmic analyses. The analyses are cho-sen to measure the real-time variation of musically salient rhythmic features.

This collection of analyses forms the Jambot’s model of rhythm, which is used laterin the generation process. The purpose of this rhythmic model is to provide a salientparameterisation of rhythm space, so as to afford intuitive parametric control of thegenerative algorithm.

7.3.1 Salient Parametrisation

The Jambot is designed to be used with parametric control. In the context of an acousticperformer jamming with the Jambot this could be physically realised by a foot pedal,or flex sensor controller. In the context of a DJ having the Jambot augment playbackof a recording this could be achieved with a more extensive MIDI controller. In ei-ther case the cognitive load required by these controllers should be minimised, as theperformer’s concentration already has significant demands.

In order to minimise cognitive load, the Jambot’s proactive generative process affordsdirect control over musically salient features. This contrasts with many generative

126

processes, which utilise processes such as Markov chains and neural nets, the param-eters of which do not relate directly to musically salient features, and so do not affordintuitive control.

In the sections below I describe the Jambot’s model of rhythm, which is a collectionof analyses that measure musically salient rhythmic features. I argue for the salienceof the chosen analyses, drawing on a range of music theoretic and music perceptualtheories regarding rhythm. I also argue that the collection of analyses represents agood cross-section of many aspects of rhythm commonly regarded as important. Thepurpose of this model is to allow for parametric control of the generated rhythms that isboth intuitive and expressive, i.e. able to generate a broad range of interesting rhythms.

7.3.2 Analysis as Reduction

The rhythmic analyses that I discuss in this section can all be considered as reductionsof the musical surface to a higher order representation. By the musical surface I meanthe invariant representation of the audio stream resultant from the prior perceptualstages of reception, and analysis of metre, as discussed above (§7.2). In other words,the musical surface consists of transcribed notes with their temporal locations recordedin terms of beats. An analysis in this context is a mapping that takes a section of themusical surface as input, and produces a value – the value of the analysis for thatsequence of notes.

I describe the analyses as reductions because they perform a data-reduction from themulti-dimensional space of musical surfaces to a single value. The idea is that theseanalyses are performing a task of abstraction analagous to the processes of abstrac-tion involved in human perception, such as chunking (Miller 1956) and categoricalperception (Clarke 1999; Parncutt 1994:490).

7.3.3 Theories of Rhythm

The term rhythm is used in a diverse variety of ways (Arom 1991:186). In this thesisI consider rhythm to be a phenomenal property of the musical surface, in contrastwith pulse and metre which are perceptual constructs (§2.5). Amongst the diversity ofdescriptions of rhythm I will concentrate on three aspects of rhythm that are commonlyidentified within music theory and music perception:

1. Accentual rhythm - the grouping of notes according to phenomenal accent.2. Temporal rhythm - the temporal proportionalities of durations between phenom-

enal accents.

127

3. Rhythm and metre - the interaction between the expectational patterning of ac-cents by metre, with the phenomenal patterning of accents by rhythm.

The Jambot’s rhythmic analyses measure a number of rhythmic properties from thesethree categories. The periodicity and coprimality analyses report on temporal as-pects of the rhythm, whilst the metric coherence and syncopation analyses reporton interactions between rhythm and metre. Accentual aspects of rhythm are not re-ported on in isolation, but are used in measuring the interaction of metric and rhythmicaccents.

Accentual Rhythm

Cooper & Meyer (1960) articulate a theory of rhythm that focusses primarily on theaccentual nature of the durations (as well as other mechanisms for accent) of the notes,and the cognitive hierarchy of note groupings that is formed as a consequence. Thebasis of their model is a collection of five archetypal patterns of accent based on thenotion of poetic feet from classical descriptions of prosody (1960:6). These archetypalpatterns consist of two or three onsets with a particular arrangement of strong and weakaccents. The patterns that they use are:

1. iamb: weak-strong

2. anapest: weak-weak-strong

3. trochee: strong-weak

4. dactyl: strong-weak-weak

5. amphibrach: weak-strong-weak

Notes of longer duration are considered to be more strongly accented than notes ofshorter duration. However the actual lengths of the durations are not specified by thesearchetypes, which are purely based on a relative notion of stronger versus weaker, orlonger versus shorter. Their theory is not fundamentally temporal (although they dodiscuss at some length the interaction of these groupings with the metre).

This description of rhythm is perceptual, rather than phenomenal, in that the group-ings are purely perceptual quantities - they are not directly observable in the musicalsurface. The circumstances under which a particular group of notes will be analysedas adhering to one or other of these archetypes involves a complex interaction of a va-riety of mechanisms for phenomenal accent including duration, dynamics, articulationand harmonic/melodic function. Cooper & Meyer do not attempt to provide a formal

128

mechanism for segmenting the musical surface into these basic groupings, frequentlyrelying upon their intuition as expert musical listeners to decide upon the correct anal-ysis in cases where their theory is ambiguous.

The reliance on intuition in conducting musical analysis is standard practice withinmusic theory, particularly when the analytic theory in question admits several plausibleanalyses for a given musical surface. For example Lerdahl & Jackendoff (1983:55)note that their Generative Theory of Tonal Music (GTTM) does not admit an obviouscomputational implementation.

For a computational musical agent, the musical analyses that it performs must admitcomputational implementation. Despite Lerdahl & Jackendoff’s skepticism regardingthe possibility of finding a numerical scheme for the automatic resolution of analyticambiguities within preference rule systems, numerous authors have attempted to adaptthe GTTM, or other preference rule systems, to be able to perform automated anal-yses (Hamanaka et al. 2006; Temperley 2001; Cambouropoulos 1998), though noneof these operate in real-time. At this stage the Jambot does not perform grouping orsegmentation analysis based on patterns of phenomenal accent. Nevertheless the tax-onomy of mechanisms for phenomenal accent outlined in Cooper & Meyer (1960) areutilised in the the metric coherence analysis described in §7.3.7.

Temporal Rhythm

Another commonly identified aspect of rhythm is that it relates to the organisation ofmusical material in time, particularly the proportions or patterning of temporal dura-tions between phenomenal accents in the musical surface. Arom (1991:183) makesthe case that temporal relations in the musical surface can exist separately from eitheraccentual groupings or metric considerations. His argument is based upon a compar-ison of western art music, and central African music traditions, the latter of which heargues is not organised metrically. Whereas in Cooper & Meyer’s theory durationsare considered as a form of phenomenal accent, temporal descriptions of rhythm placeimportance on the relative values of the duration, and the periodicities between phe-nomenal accents in the musical surface.

The organisation of rhythm implies diverse forms of periodicity, i.e. thepredictable recurrence of an event in essentially identical but existentiallydifferent form ... Periodicity, which is initially perceived in the equality intime between beats, suggests that measurable units can be found in rhythm.Science del la Musique (1976:II, 903) in (Arom 1991:194)

129

The first two of the Jambot’s suite of rhythmic analyses fall into this category. Theperiodicities analysis reports upon the periodicities evident in the onset timesbetween notes in a single perceptual stream. The coprimality analysis reports onpolyrhythmic relationships in the onset times.

These analyses do not depend on the metre; the periodic relations in the rhythm areindependent of the number of beats in a bar. Nevertheless they do need to be performedin a metric context. This is because they do depend on the substratum pulse. Inthe presence of tempo variations the actual physical proportions of note durations willnever be equal to a simple ratio. Clarke explains this by separating the structural timingproperties of the notes from expressive performance properties:

If there is one principle on which virtually all rhythm research agrees,it is that small integer ratios of duration are easier to process than morecomplex ratios. And yet this principle is at first sight deeply paradoxicalwhen the temporal properties of real musical performances are examined,for almost nowhere are small integer ratios of duration found — even inperformed sequences that are manifestly and demonstrably simple for lis-teners to interpret. The answer to this apparent paradox is that a distinc-tion must be made between the structural properties of rhythm (which areindeed based on the principle of small integer ratios) and their so-calledexpressive properties. (1999:489)

These analyses then must be carried out on the invariant representation of the onsets inbeat-space, rather than the raw onset times. The Jambot converts the raw onset timesto the closest quarter or third of a substratum beat.

Rhythm and Metre

The interaction between rhythm and metre provides another mechanism by which tem-poral aspects of rhythm can be described. Whereas the accentual perspective on rhythm(§7.3.3) categorises notes as strong or weak, the metre categorises points in time asstrong or weak. Clarke observes that the interaction between these two accent struc-tures is both interesting and relatively unexplored:

Few studies have investigated the relationship between grouping and me-ter, despite Lerdahl and Jackendoff’s insistence that it is in the interactionsbetween the two that the power and interest of rhythmic structure lies.(Clarke 1999)

The importance of temporal properties of rhythm in the perception of metre is attestedto by many researchers. Certainly every computational approach to metre induction

130

that I have come across has utilised onset periodicity information in some manner.Clarke goes further to suggest that the interaction between the accentual and temporaldescriptions of rhythm is important to the perception of metre:

A smaller literature does exist, however, that has considered the natureand variety of different kinds of accent and the ways in which accentualand temporal factors mutually influence one another (e.g. Dawe, Platt& Racinne, 1993; Drake & Palmer, 1993; Povel & Ockkerman, 1981;Thomassen, 1982; Windsor, 1993; Woodrow, 1909). Useful though thiswork has been, there has been little attempt so far to integrate what isknown about the perceptual functioning of anything other than temporalaccents with existing or new models of meter perception. This is unfor-tunate both because purely temporal models of meter perception . . . areunrealistic in the demands that they make by deriving meter from tempo-ral information alone and because such models tend to project a static andone-dimensional view of meter, rather than the more dynamic and fluidreality that is the consequence of the interplay of different sources of per-ceptual information (temporal and accentual). (1999:489)

The Jambot’s metre perception does not utilise such interactions, relying solely ontemporal properties of the phenomenal accents. However the third and fourth of theJambot’s suite of rhythmic analyses report on interactions between the accentual struc-ture of the rhythm and the accentual structure of the metre. The metric coherence

analysis reports on the level of alignment between the rhythmic and metric accents forthe rhythmic variables of duration, dynamic and pitch/timbre. The syncopation anal-ysis reports on the extent to which note durations cross over metrically strong beats.

7.3.4 Rhythmic Analyses

The Jambot’s model of rhythm comprises of a collection of analyses that measuresalient rhythmic features. The rhythmic analyses that I consider are

1. periodicities - what periodicities are present in the rhythm.

2. coprimality - how consonant the periodicities are.

3. metric coherence - how well aligned to the metre the rhythm is.

4. syncopation - how syncopated the rhythm is.

5. density - how densely the rhythm fills the bar.

In the following sections I describe these analyses in more detail.

131

7.3.5 Periodicities

The first of the Jambot’s rhythmic analyses is the periodicities analysis. This anal-ysis reports on temporal periodicities in the onset times for one perceptual stream. Thisanalysis describes an aspect of temporal rhythm (in the sense of §7.3.3) and is not de-pendent on the metre. It is, however, dependent on the metric context in the sensethat it measures periodicities in the beat locations of the onsets, not in the raw onsettimes.

Arom (1991:202) discusses temporal rhythm as being formed by periodicities (or morecomplex recurrent patterning) in the phenomenal accents of the musical surface. Theaccents may be formed by a variety of mechanisms including duration, timbre and dy-namics. The periodicities analysis reports only on periodicities in dynamic accent.

The analysis is performed by calculating something similar to a velocity weightedInter-Onset-Interval histogram. In this case the Inter-Onset-Interval is calculated be-tween all pairs of notes, not just successive notes (this definition of the Inter-Onset-Interval histogram is also often used when seeking to determine metre from periodici-ties in the musical surface - see §6.2.1).

The output of the periodicities analysis is a weighted list of periodicities found inthe surface. The periodicities are measured in units of the substratum period. Onlyperiodicities that are integer multiples of the substratum beat and less than or equalto the bar period are reported. The calculation is performed by looping through theperiods 1 to H, where H is the number of substratum beats in a bar, and the periodsare measured in units of the substratum period. For a periodicity to be reported it justrequires a single pair of onsets with their Inter-Onset-Interval equal to that period. Ifsuch a pair is found the average of their amplitudes is added to the reported weight forthat periodicity.

7.3.6 Coprimality

The second of the Jambot’s rhythmic analyses is the coprimality analysis. It re-ports on polyrhythms found in the dynamic accent structure. It does this by findingamplitude weighted pairs of coprime periodicities in the onset locations.

The coprimality performs a secondary analysis on the nature of the periodicitiesfound in the musical surface. The input to this analysis is the weighted periodicitiesoutput from the periodicities analysis. The idea of the analysis is to give some

132

measure of the rhythmic complexity, by looking for periodicities that are not related toeach other by a simple integer ratio.

London defines a polyrhythm as “any two or more separate rhythmic streams in themusical texture whose periodicities are noninteger multiples” (2004:49). In this sensethe coprimalities found by this analysis are different to polyrhythms, since they arefound in a single rhythmic stream. However, combining the coprimality measures ofthe perceptual streams gives a means for both detecting and generating polyrhythmsparticularly in the context of repetitive drum patterns. This is because repetition atthe level of the bar tends to mean that all streams have some level of periodicity atthe bar period, and consequently polyrhythms may be formed by combining a streamwith low coprimality (meaning that its periodicities are divisors of the bar length) witha stream of high coprimality (meaning that it possesses a periodicity coprime withthe bar length). It would be interesting to perform a similar coprimality calculationbetween streams, which would more directly model polyrhythms, however I have notimplemented this at this stage.

London (2004:50) suggests that polyrhythmic material is cognitively more complexto listen to. More generally Clarke (1999:489) suggests that complex ratios of dura-tion are cognitively more demanding to process (regardless of whether these ratios areformed within a single perceptual stream, or by the interaction of several streams). Inlight of these suggestions the coprimality measure is designed to give some sense ofrhythmic complexity.

The calculation is performed by looping through all pairs of periodicities output fromthe periodicities analysis, and testing if their periods are coprime (meaning thatneither is a simple integer multiple of the other)1. When a coprime pair of periodicitiesis found, the weight of the coprimality is taken to be the ratio of the weights of theperiodicities divided by the weight of the heaviest periodicity.

7.3.7 Metric Coherence

The third of the Jambot’s rhythmic analyses is a measure of metric coherence, whichreports on both how coherent a sense of metre is implied by the perceived rhythm, andhow well aligned this sense of metre is to the current metric context.

1To be precise, two integers are coprime if their greatest common divisor is equal to 1, or equivalentlyif their prime factorisations have nothing in common. This condition is actually more stringent thanrequiring that neither is an integer multiple of the other - for example 4 and 6 are not coprime becausethey share a common factor of 2.

133

In the perception of rhythm there appears to be a strong tendency to interpret rhythmicmaterial in terms of an underlying metre (Lerdahl & Jackendoff 1983; Toiviainen &Snyder 2000:17). In Chapter 6 I discussed several descriptions of metre, and outlinedthe metre induction process that the Jambot utilises. The induced metre consists of asubstratum pulse in combination with the number of substratum beats in a bar.

The metric coherence analysis uses a more detailed model of metre, similar to Huron’sdescription (§2.5.1), that additionally asserts that each beat-class has an associatedmetric strength. Since the metre induction procedure does not estimate the metricstrengths for individual beat-classes, I have set these by hand, according to my intu-ition. I acknowledge that this introduces stylistic bias into the analysis. In the future Ihope to extend the Jambot’s metre induction capacity to include that induction of met-ric strengths at each beat-class. The particular strengths that I chose were dependenton the number of beats in the bar. For a bar in 8, inspired by the GTTM, the vector ofstrengths was chosen to be

[1,0.125,0.25,0.125,0.5,0.125,0.25,0.125]

with similar schemes for bars of length 4 and 16. For all other bar periods the downbeatwas assigned a strength of 1 and all other beat-classes assigned a strength of 0.5.

Metre as I am describing it is not a directly observable property of a musical stream;it is a perceptual construct in the mind of the listener that is formed in response to theactual rhythmic patterns in the music. Rhythm in this context refers to the manner inwhich accented beats are grouped with unaccented beats. Here by an accent I mean “astimulus which is marked for consciousness in some way” (Cooper & Meyer 1960:8),or what GTTM refers to as a phenomenal accent.

A number of authors (Arom 1991; Cooper & Meyer 1960; Epstein 1995; Lerdahl &Jackendoff 1983; Temperley 2001; Dixon 2001) identify three primary means of phe-nomenal accent, which I will refer to as rhythmic variables:

(i) Stress(ii) Duration

(iii) Pitch or Timbre

Where these variables are working in concert to imply a metre, I say that the music ismetrically coherent, whilst if these markers are inducing contradictory senses of metre,I say the music is metrically ambiguous.

Phenomenal accent functions as a perceptual input to metrical accent –that is, the moments of musical stress in the raw signal serve as “cues”

134

from which the listener attempts to extrapolate a regular pattern of metricalaccents. If there is little regularity to these cues, or if they conflict, thesense of metrical accent becomes attenuated or ambiguous. If on the otherhand the cues are regular and mutually supporting, the sense of metricalaccent becomes definite (Lerdahl & Jackendoff 1983).

The rhythmic coherence analysis measures reports both on how mutually consistentthese three rhythmic variables are, and how well they align with the metric context.

To perform this analysis it is necessary to make numerical assessments of the level ofaccent that different values of these rhythmic variables represent. Dixon (2001) de-scribes a similar approach to constructing a combined salience measure for detectednotes in his work on beat tracking. He also constructs numerical measures of thesalience of pitch, duration and velocity, and combines them according to some em-pirically determined weights, with duration being the dominant factor. He notes thatthe dominance of duration is probably particular to pitched melodic material, and inparticular should not be applied to non-pitched percussive instruments.

The Jambot’s analysis does not attempt to combine the salience of these variables,but rather measure how sychronised they are. The question of how to weight theserhythmic variables when considering appropriate musical responses will be taken upin the discussion of the Chimaera (§8.3.10).

Description of the metric coherence analysis

The output of metric coherence analysis is not a single number, but rather a collectionof numbers representing the coherences between each pair of rhythmic variables, andbetween each rhythmic variable and the metric context. The values of the analysis arephrased in terms of metric violations rather than metric coherence (metric violation =1 - metric coherence). The calculation is performed as follows:• For each rhythmic variable the metric violation is initialised to 0.• For each note in the rhythm a degree of emphasis is calculated for each of the

rhythmic variables of duration, timbre and velocity. The emphasis is scaled tolie between 0 and 1.• These emphases are compared with the metric strength of the location of the note

in the bar (which is also scaled between 0 and 1). Any difference between themetric strength and the emphasis for a rhythmic variable is added to the metricviolation for that variable.

The result of this analysis is a multidimensional measure of both how coherent themetric suggestions of the rhythmic variables are, and how well they align with theunderlying metric context.

135

7.3.8 Syncopation

The fourth of the Jambot’s suite of rhythmic analyses is the syncopation analysis.This reports on the level of syncopation present in a given perceptual stream. In thiscontext syncopation is defined to occur when a note’s duration crosses over a beat ofgreater metrical strength than the note’s onset beat.

Syncopation is identified an an important aspect of rhythm in many treatises on musictheory and music perception (Meyer 1956; Cooper & Meyer 1960; Arom 1991; Epstein1995; Lerdahl & Jackendoff 1983; Temperley 2001; Huron 2006). A classical defini-tion of syncopation is “the extension of a sound begun on a weak beat onto the strongbeat” (Rousseau 1768:459) quoted in Arom (1991:207). Huron offers a psychologicalexplanation of syncopation as a violation of expectation caused when a note occurringon a metrically weak beat is not followed by a note at the next strong beat:

Whether [a] note is perceived as syncopated or not depends entirely onwhat happens after the note ends. It is the absence of an onset on theensuing stronger metrical position that creates the syncopated experience[emphasis in the original]. (Huron 2006:200)

Arom (1991:207) notes that these definitions imply that syncopation must be under-stood within the context of a metrical structure that differentiates between strong andweak beats.

London offers a slightly different definition of syncopation as a “mismatch betweenan established metre and a rhythmic figure ... in which the durational periods on thesurface are out of phase with those of a related metric subcycle” (2004:86). He relatesthis to Kreb’s (1987) theory of indirect dissonance, which describes periodicities ofconflicting phase which are present sequentially rather than simultaneously. Thesesorts of dissonances need not rely on an underlying metre (in the sense of a hierarchyof pulses, or a pattern of strong and weak beats), requiring only the perceptual inertiaof a single pulse.

The syncopation analysis is constructed along the lines of Huron’s description ofsyncopation as an onset on a metrically weaker beat not followed by an onset on thenext metrically stronger beat. The calculation is performed by looping through all ofthe notes in the musical surface and recording a syncopation whenever an onset occursthat is not followed by an onset at the next stronger beat. All syncopations are addedwith equal weight, and the output is normalised by the number of notes in the surface,so that the result may be interpreted as the percentage of the notes that are syncopated.

136

7.3.9 Density

The last of the Jambot’s rhythmic analyses is the density analysis, which reports onthe percentage of onsets in a perceptual stream compared to the number of substratumbeats. Rhythmic density has been identified as an important factor in musical affect(Livingstone 2008). Density is also commonly incorporated in generative music sys-tems (Rowe 1993).

The density calculation falls outside of the rhythmic categories discussed above in thatit is neither accentual nor temporal in nature: it does not rely on a metrical hierarchy ofstrong and weak beats, nor does it reference the dynamic, durational or timbral accentsof the musical surface. It does use the substratum period as baseline for comparison(i.e. the density is expressed as a percentage of the substratum beats for which there isan onset) however a different choice of baseline changes the density calculation by aconstant factor; it does not change the relative densities of two surfaces.

A different approach to calculating the density might be to weight the onsets by theirsalience (in any of the accentual dimensions), however the Jambot’s density calcula-tion looks only at whether or not an onset occurs.

The actual calculation of the density analysis is performed by dividing the number ofnotes in the musical surface by the number of beats that the surface spans.

7.4 Summary

This chapter discussed the rhythmic analyses that the Jambot performs. Having in-ferred a metric context from the musical surface, the Jambot represents onsets in theinvariant representation of notes at beat locations. This representation is invariant tochanges in tempo, and is thus well suited to structural analyses of rhythmic propertiesin the presence of varying tempo.

The Jambot performs a variety of rhythmic analyses, rooted in theories of rhythm per-ception, including both temporal and accentual properties. These analyses togetherparametrise rhythm in terms of salient musical features. The benefit of such a parametri-sation is that it allows for intuitive manipulation of rhythms, as will be discussed inchapter 8.

137

Chapter 8

Generation

8.1 Introduction

The generation stage is the third component of the Jambot’s architecture. The taskof the generation stage is to produce musically appropriate actions, given the currentmusical context inferred by the reception and analysis stages. The generation stageutilises three improvisatory strategies: reactive, proactive and interactive.

The reactive strategy operates by performing real-time imitation of the onsets detectedin the input signal. The Jambot uses transformational mimesis to do this effectively.Transformational mimesis is imitation that has been transformed to obfuscate the con-nection between the imitation and its source. This chapter describes transformationalmimesis as an example of a broad interaction paradigm that I call Smoke and Mirrors.The reactive strategy uses the onset timing information gathered in the reception stage,but does not require information from the analysis stage.

The proactive strategy attempts to use the musical understanding gathered in the analy-sis stage to make informed musical choices. Unlike the reactive strategy, the proactivestrategy relies on prediction of future musical events. The Jambot’s proactive strategyutilises its rhythmic analyses to make predictions regarding the musical consequencesof its actions. It has a number of musical goals that it seeks to achieve through itsimprovisation.

It utilises a novel form of optimisation, called anticipatory timing, to balance the var-ious competing goals. Anticipatory timing is an optimisation technique designed tobalance computational efficiency with optimality, in a way that is especially suited torhythmic generation.

The proactive strategy also potentially utilises all of the hypothesised metric contexts,selectively emphasising more or less plausible contexts to manipulate the overall levelof ambiguity in the improvisation. In Chapter 2 I argued that manipulating ambigu-ity provides a mechanism for altering the balance between novelty and coherence, animportant attribute in musical improvisation.

138

The experience of implementing this proactive strategy has led me to question a com-mon theory in music perception, and more broadly in Gestalt psychology – the Figure-Ground dichotomy (Boring 1942). In music perception this dichotomy surfaces in thesupposition that listeners will only perceive a single metre at any point in time. I arguethat, in fact, multiple metres may be simultaneously perceived, albeit unconsciously,and that manipulation of these multiple perceptions is an efficacious compositionalstrategy.

The interactive strategy combines the proactive and reactive strategies, using a mech-anism that I call juxtapositional calculus. The idea of the juxtapositional calculus isto provide a means of deviating from a baseline of transformational mimesis, so thatelements of musical understanding can be acted upon in a way that minimizes cogni-tive dissonance. The interactive strategy also implements a model of attention in anattempt to imbue the Jambot with a sense of agency.

8.2 Reactive Generation Techniques

When the music is really happening, I as a player try not to thinkabout it and let the music lead. Other times thoughts creep in. Inan ideal situation the music would always take over.

Borgo (2002)

The goal of this research project has to been to create a computational agent that dis-plays rudimentary improvisatory intelligence. The theoretical perspective I have em-ployed in constructing the agent is that of Situated Cognition (§3.4.3). Situated Cogni-tion is a perspective on human thought that views knowledge as a dynamic combinationof parallel influences, acknowledges the importance of context to understanding, andargues that perception is properly described as an interaction between the perceiverand the environment. This viewpoint contrasts with classical approaches to artificialintelligence (AI), which privilege knowledge representations.

An important point of difference with classical AI and the approach I have taken in thisproject relates to the philosophical intent. The fundamental goal of classical AI is tocreate an agent that can be viewed as intelligent. The goal of the Jambot is different;I am not primarily concerned as to whether the Jambot is intelligent, I am mostlyconcerned with whether or not it makes for an effective jamming partner. Whether itachieves this by virtue of some form of understanding of the structure of music, or by

139

cunning use of the art of illusion is a secondary concern. The question that I am reallyaddressing in this research is: what is the most effective way to (musically) behave tocreate an impression of improvisatory intelligence?

I have concluded that the art of illusion should not be discounted in this endeavour. Oneof the Jambot’s improvisatory techniques, dubbed Transformational Mimesis (§8.2.6)adopts this philosophy, utilising an interaction paradigm that I call Smoke And Mirrors(§8.2.2). Transformational mimesis is essentially reactive, in the sense that it is reactingto what is in the music, without necessarily having any musical understanding.

In this section I describe the Smoke and Mirrors paradigm, and its application in theTransformational Mimesis improvisatory strategy. In the next sections I will go onto describe the Jambot’s proactive generation techniques, and the ways in which thereactive and proactive generation techniques can be combined to provide for robustand convincing generative accompaniment.

8.2.1 The Turing Test

Since the birth of AI in the 1950s there has been debate around what constitutes in-telligence. Alan Turing (1950), often considered the godfather of AI, championed abehaviourist criterion that has become known as the Turing Test. The test is that if acomputational agent can behave in such a way as to fool a human into believing that itis intelligent, then it is intelligent.

Turing was referring to a particular kind of interaction, which was a non-realtime natu-ral language conversation dialogue. This was important to his definition of intelligence,because the computer (at that stage) was never going to look like a human, and alsowould not be able to process the input fast enough to engage in a real-time dialogue.Nevertheless the key philosophical point is that his criterion for intelligence relatedto the effect that it had on a human in interaction irrespective of the manner of itsoperation. Ariza (2009) has discussed the Turing Test in a musical context.

The Turing test has received a great deal of criticism, and by and large the task ofcreating agents that can pass the Turing Test has not been actively pursued by the AIcommunity (Russell & Norvig 1995). I suspect that this is for two reasons: partlybecause of disagreement that the test truly captures the internal essence of intelligence,but mostly because attempts to create an agent that passes the Turing Test have beenunconvincing. Hawkins criticises the test from a philosophical perspective:

Intelligence is not just a matter of acting or behaving intelligently. Be-

140

haviour is a manifestation of intelligence, but not the central characteristicor primary definition of being intelligent. A moment’s reflection provesthis: you can be intelligent just lying there in the dark, thinking and un-derstanding. Ignoring what goes on in your head and focusing insteadon behaviour has been a large impediment to understanding intelligence.(Hawkins & Blakeslee 2004:28)

In regard to the nature of intelligence, I agree with Turing’s critics, that the TuringTest criterion is not capturing the fundamental essence of intelligence. However, formost applications of AI, actually being intelligent is not important, and I believe thatthe quest to create agents that are truly intelligent has detracted from their success atengaging in tasks for which they are designed. I further suggest that for many appli-cations in Human Computer Interaction, for which AI has been deemed appropriate,even Turing’s criterion is unnecessarily restrictive. For example, with the Jambot, Iam not attempting to fool anyone into believing that the Jambot is a human - quite theopposite, I am explicitly advertising that it is an artificial agent. What I do regard asimportant is that it creates an impression of agency for a human interacting with it.

A similar line of thought is discussed by Michael Mateas in his discussion of Believ-able Agents for computational interactive drama:

When attempting to marry a technical field like Computer Science witha cultural activity such as story telling, it is extremely easy to becomesidetracked from the artistic goals and to begin pursuing purely techni-cal research. This research, while it may be good science, does not leadyou closer to building a new kind of cultural experience: engaging, com-pelling, and hopefully beautiful and profound. Effective techno-artisticresearch must continuously evaluate whether the technology is servingthe artistic and expressive goals. The application of this principle to in-teractive characters implies that interactive character technology shouldfollow from an understanding of what makes characters believable. Andindeed, creators of non-interactive characters have written extensively onwhat makes a character believable.

Before continuing, it’s a good idea to say something about this word be-lievable. For many people, the phrase believable agent conjures up somenotion of an agent that tells the truth, or an agent you can trust. But thisis not what is meant at all. Believable is a term coming from the charac-ter arts. A believable character is one who seems lifelike, whose actionsmake sense, who allows you to suspend disbelief. This is not the same

141

thing as realism. For example, Bugs Bunny is a believable character, butnot a realistic character. (Mateas 1999)

The Smoke and Mirrors paradigm (which I will describe further below) aims to createa Believable Agent, and does so essentially through the art of illusion. I implementedthis technique in an early version of the Jambot, called Robobongo. Along with afive piece band (The Robobongo Allstars) this version of the Jambot had a live impro-vised performance at the QUT Ignite conference 2008. An interesting discovery forme that came out of this performance was that the Smoke and Mirrors paradigm wasin some sense too effective. Robobongo’s contribution to the performance was suffi-ciently seamless that, aside from a few people who sat and watched attentively, peo-ple didn’t realise that there was a robot playing (despite the wild bongos sans bongoplayer). From my point of view this was undesirable; what I am aiming to achieve isan agent that is musically convincing, but not necessarily like a human.

8.2.2 Smoke And Mirrors

The paradigm of Smoke And Mirrors is one of imitation. In an interaction betweena human and a computational agent, there is already intelligence in the dialogue –that of the human. The premise of the Smoke And Mirrors paradigm is to leveragethe intelligence already present, and to manipulate it in such a manner as to create animpression of agency external to the human. Breaking down the metaphor further: amirror is something that reflects, whilst smoke obfuscates. The idea of Smoke andMirrors is to reflect the intelligence of the human, but to transform it sufficiently thatits source is not recognisable, yet it retains enough of the form of the source to portraya convincing agent.

An (almost literal) example of the Smoke and Mirrors paradigm put to use in an artis-tic context is given by Daniel Rozin’s Wooden Mirror, pictured in Figure 8.1. Themirror is composed of a large number of independently mobile wooden slats that arecontrolled by servo-motors to change their angle in response to input from a video cam-era. The camera captures the image of the person looking at the mirror, and adjusts thewooden slats so that they capture more or less incident light from a light-source over-head. The slats are adjusted to resemble the input image, so that the Wooden Mirror isin a sense reflecting what is looking at it, albeit with a curious twist.

In a musical improvisation setting, the Smoke And Mirrors paradigm involves imitat-ing what is happening in the music, but with a twist. In this way it inherits musicality,yet maintains a sense of individuality. Some benefits of this approach to improvisation

142

Figure 8.1: Rozin’s Wooden Mirror

are robustness to novel musical situations, no requirement of stylistic knowledge, andrapid adaptation to changes in the musical environment.

Looking to myself as an improvising musician, I would say that this technique playsa large role in what I do in improvisation, certainly in free improvisation. A commonimprovisatory technique for me is to latch on to a riff that someone else has played,repeat it until it forms a groove, then repeat it with variation. Whilst there is a limitto the musical complexity that can be expected to arise from the approach it is never-the-less an effective jamming technique in lieu of any pre-agreement as to musicalcontent.

An example of an interactive music system that utilises this technique is Francois Pa-chet’s Continuator (2002), depicted in Figure 8.2. The Continuator takes MIDI inputvia a keyboard, and makes a musical response. Unlike the Jambot it operates strictlyin a call-response modality. It waits until the human interactor has finished playing aphrase (which it determines simply by waiting for a period of silence longer than a cou-ple of seconds), and then responds with a phrase that is stylistically similar. The phrasethat the continuator responds with actually uses exactly the same rhythm as the input.The pitches that it uses are determined by a hierarchical Markov model trained on theinput phrase. Some examples of the continuator can be seen in BernardLubat.mp4 andChildren.mp4.

More broadly Pachet’s description of reflexive interactive music systems (§2.2) places

143

Figure 8.2: Children interacting with the Continuator

them in the Smoke and Mirrors paradigm. He describes reflexive systems (such as theContinuator) as using a similar mirroring effect, so that the user can “experience thesensation of interacting with a copy of themselves” (Pachet 2006:360), and notes thatthe mirroring cannot be too direct, else this sensation is lost.

The Smoke And Mirrors paradigm is reminiscent of the classification of artists as (inpart) bricoleurs as discussed by Levi-Strauss (1962) who notes that the bricoleur “usesdevious means compared to those of a craftsman” (1962:16). Levi-Strauss charac-terises bricoleurs as operating by reorganizing existing materials and events:

It is common knowledge that the artist is both something of a scientistand of a ‘bricoleur’. By his craftsmanship he constructs a material objectwhich is also an object of knowledge. We have already distinguished thescientist and the ‘bricoleur’ by the inverse functions which they assign toevents and structures as ends and means, the scientist creating events ...by means of structures and the ‘bricoleur’ creating structures by means ofevents. (1962:22)

An example of the efficacy of this approach in AI is given by the program ELIZA.Developed by Joseph Weizenbaum in the 60s, ELIZA was an early attempt at a con-versational agent that was intended as a parody of a psycho-analyst. ELIZA operatedconversationally by taking what was said to it and repeating it back in manipulatedform. Although originally created as a joke it was for many years the most successfulAI agent from the point of view of the Turing Test.

144

ELIZA’s modus operandi is a direct parallel of the Smoke And Mirrors paradigm – itreflects what is being said but with some transformation to give the illusion of under-standing. A more recent conversational agent, inspired by ELIZA, named PC Therapistwas the winner of the inaugural (1991) Loebner Prize (an actual Turing Test contesthosted annually by the Cambridge Centre for Behavioural Studies).

Another high ranking entry that year deliberately made typing mistakes to give the im-pression of humanity. Interestingly at the same event one of the humans, chosen for herin depth knowledge of Shakespeare (which was the designated topic of conversation)was repeatedly mistaken for a computer.

Ms. Cynthia Clay, the Shakespeare aficionado, was thrice misclassifiedas a computer. At least one of the judges made her classification on thepremise that “[no] human would have that amount of knowledge aboutShakespeare” (Schieber 2009)

The point of the conversational analogy is that in a musical improvisation you don’tneed to understand everything that’s going on – if you can at least discern a few thingsthat are important and highlight them, then you can give the impression that you doknow what’s going on. For example you may not know the key, or the mode that isbeing employed, but if you can guess the tonic and the downbeat then you can at leastget away with playing the tonic on the downbeat. You may not appear to be the worldsgreatest musician, but at least you will give the sense that you understand the music.People may even suspect that you are deliberately showing tasteful restraint.

8.2.3 Imitation and Representation

The Jambot utilises Smoke And Mirrors as an aspect of its improvisatory process. Theidea for this stemmed from the use of imitation as an evaluative mechanism for thereception algorithms. For example, the polyphonic pitch tracking component of theJambot attempts to chunk the audio stream into notes. In order to get a sense of howwell this component was working, I utilised imitation – where the Jambot plays backexactly what it hears (as it is hearing it).

Naturally the Jambot’s perceptual processes are not perfect, indeed far from it. Conse-quently the imitation the Jambot produces is a transformed version of the input. In thiscase the ‘Smoke’ is caused by imperfect perception. Ideally it would be possible toimitate exactly, yet artfully choose to depart from exact imitation to create a sufficientsense of independence for the illusion of agency. However, in reality, parsing an audio

145

stream into notes is an extremely difficult engineering problem. This project has madeseveral advances in this area, but the problem is far from being solved. However, asdiscussed above, it is not important to completely understand the musical environment,rather just to understand the salient aspects.

The idea of imperfect perception carries a pejorative overtone that in some ways maybe undeserved. To be sure there are times when the Jambot’s perceptions are just plainwrong, but at other times they are not so much wrong as different to the manner inwhich a human would perceive the same audio stream. For example the polyphonicpitch tracking algorithm takes MQ-analysis frequency tracks (McAuley & Quatieri1986), clusters them in frequency and segments them in time, to produce notes (§5.3.5).The mechanism by which it clusters them in frequency is quite simple – for a givenfrequency, any overtones of lesser energy are aggregated into this fundamental, and anovertone of greater energy is assumed to be a new fundamental.

In reality, an overtone produced from the same physical source as a given fundamentalfrequency may well have more energy than that fundamental. As an example, spec-tral analysis that I have performed on the timbre of my clarinet playing reveals that(for me) the first harmonic tends to have more energy than the fundamental. It is an in-triguing aspect of human hearing that we tend to perceive overtones of the fundamentalas belonging to the fundamental even when the fundamental is significantly softer, oreven missing (Barucha 1991:86). The ear utilises many more cues than does the Jam-bot in clustering frequencies, for example the correlation of amplitude and frequencymodulations amongst physically related frequency components (Bregman 1990). Con-sequently the Jambot will have a different perception to a human as to what pitches arepresent in the audio stream. An interesting side-effect of this, when using imitationas an improvisatory technique, is that the pitches that the Jambot ‘imitates’ will oftennot be the pitches we hear, but will generally be harmonically related. As a result theJambot’s improvisation inherits a harmonisation from the differences in perception.

It should also be noted that exact imitation is not the goal. In fact (close to) exactimitation would be quite easy in a digital setting – just pass on the raw audio bufferunaltered, or with some filtering. Looked at this way, the Jambot can be thought of asa filter. What differentiates it from, say, a graphic equalizer, is that the transformationsit performs are not on the audio itself, but on some higher order representation of theaudio. Even when the Jambot is operating entirely imitatively there is still a layerof musical understanding involved in both the analysis from audio to notes and theinterpretation from notes back into audio.

146

8.2.4 Situated Dialogue

In discussions of the connection between perception and action, the concept that per-ception should be viewed as an interaction between the perceiver and the environmentis usually framed in terms of a static environment. Certainly there is the notion of per-ception being active and temporal – this is for example an important aspect of Gibson’s(1979) theory of Ecological Psychology, but the source of this activity is the motionof the perceiver through the environment, rather than a change in the environment it-self. Similarly discussions such as those of Brooks (1991b) or Simon (1969:64) onbehaviour-based robotics, where complexities in the agent’s behaviour are viewed asinherited from the complexity of the environment, are still discussed in terms of a staticenvironment.

Viewed as a geometric figure, the ant’s path is irregular, complex, hard todescribe. But its complexity is really a complexity in the surface of thebeach, not a complexity in the ant (Simon 1969:64)

The notion of a changing environment is addressed in the perspective of Situated Cog-nition, in which complexity of an agent’s behaviour is inherited both from the com-plexity and the dynamicism of the environment. What is not addressed, however, is thecase of dialogue, where the environment contains intelligent agents. I posit a stancethat views complexity (and the impression of intelligence) in an agent’s actions asinherited from the human in a dialogue. It is ironic that the context of a dialoguehas received little attention as an environment in discussions of Situated Cognition,because in dialogue the importance of the interaction between the perceiver and theenvironment is manifest, as is the role bidirectional feedback plays in the emergenceof structure from the interaction.

I suggest the term Situated Dialogue for the application of ideas of Situated Cogni-tion to designing interaction strategies for dialogue. By dialogue, I don’t just meana linguistic exchange, but any communicative exchange between agents. The mo-tivation for utilising Situated Dialogue as an interaction design paradigm is that bycommunicating in the same way a human does, the cognitive dissonance that the hu-man experiences in the communication will hopefully be minimised, even when thecomputational agent has an incomplete understanding of the content of the dialogue.

147

Figure 8.3: Edward Ihnatowicz’s Sound Activated Module

8.2.5 Attention and Lifelikeness

The role of attention in robust perception is commonly argued (Hawkins & Blakeslee2004; Gibson 1979; Bregman 1990; Coward 2005; James 1890); the ability to focus ona subset of the total sensory input is critical in maintaining a persistent perception ofa particular object or phenomenon in a noisy environment. An example of this is the‘Cocktail Party Effect’ (Bregman 1990); a reveller at a cocktail party is able to focuson the conversation with the person next to them, despite the actual auditory input ontheir ear-drum being composed of multiple conversations.

Beyond assisting in robust perception, it is my contention that attention, and relatednotions of distractability and curiosity, are important interactive qualities for impartingan impression of agency. To elucidate these notions I will give some examples ofinteractive artworks that work with these qualities; the cybernetic sculptures of EdwardIhnatowicz, and Simon Penny’s Petit Mal.

Edward Ihnatowicz was a cybernetic sculptor in the 60s. His Sound Activated Module(SAM) (Ihnatowicz 2011), shown in the video SAM.mpg, was an installation piece,resembling a robotic flower, that used an array of volume sensors to roughly detectthe location of a sound. It would then move to face the direction that the sound wascoming from. The sensors were quite rough, and so the effect was that it would tendto follow people as they moved around, but not always entirely predictably. Anotherwork along the same lines, but on a larger scale, was the Senster (Ihnatowicz 2011),shown in the video senster.mpg.

148

Figure 8.4: Edward Ihnatowicz’s Senster

Petit Mal is the work of Simon Penny (2011), shown in the video petit mal.mp4. It isa robot comprised of two bicycle wheels supporting a pendulum-like structure, withultrasonic sensors at the top of the pendulum, and a motor that can drive the wheelsforward and backward independently.

Petit Mal has two fundamental behavioural drivers. Firstly it has a control-theoreticbalancing mechanism that moves itself forward and backward to stop it from tippingover. The ungainly swinging action of the pendulum often frustrates this activity, andgives its motion an unpredictable lurching quality. Secondly, the ultrasonic sensorslocate people in its vicinity. Petit Mal then attempts to maintain a particular distancefrom the person, approximately one metre – no closer and no further. Its ability to doso accurately is hampered by its awkward locative mechanism. The overall effect is ofconvincingly lifelike behaviour, that is really a result of complexity and change in theenvironment interacting with its physical constraints.

The actions that an observer takes (in moving around the installation space) are re-flected in the actions of Petit Mal, but are transformed (by virtue of its crazy lurchingmotion, and inability to track them precisely) so that they are not immediately recog-nisable as a simple reflection.

I argue that the Sound Activated Module, the Senster, and Petit Mal are all utilisingSmoke and Mirrors, and that they display interactive qualities of attention, curiosityand distractibility. There is a definite ‘lifelike’ quality to these works, an impressionof their agency. This may be due in part to their physical mechanisms of motion beingreminiscent of biological mechanisms. However I suggest that a hefty contribution to

149

Figure 8.5: Simon Penny’s Petit Mal

the impression of agency is that they seem to be curious – a newly arrived observerwill capture their attention for a period. However, importantly, they are not slavishlyfocused on one thing – they are easily distracted. In that sense they are not ‘roboti-cally’ (in the popular sense of the term) following some easily discerned program; tosome extent they appear to have ‘minds of their own’. I suggest that a combination ofattention, curiosity and distractibility is a recipe for imparting a sense of agency.

Attention is a top-down process that influences the ability of bottom-up informationto rise through levels of the cognitive hierarchy by either emphasising or suppressingsensory data according to the current focus. Thus, architecturally, cognitive modelsof perception that include attention are hierarchies with bi-directional communicativefeedback. Representations of music that do not include bidirectional feedback, suchas the generative tree structures of GTTM or single-layer Markov models discussed inchapter 4, are unable to model attention and consequently are inappropriate architec-tures for either robust perception or lifelike interaction.

The two recent computational neuroscience models discussed in chapter 4, Hierarchi-cal Temporal Memory and the Recommendation Architecture, both implement atten-tion as a top-down process of suppression. The perceptual architecture used by theJambot also implements a notion of attention, both for the purposes of robust per-ception and as an approach to lifelike interaction. The use of attention in creatingconvincing musical interactions will be discussed in §8.4.2.

150

8.2.6 Transformational Mimesis

Transformational mimesis is the term that I am using for the Jambot’s application ofthe Smoke and Mirrors paradigm to musical improvisation. Transformational mimesisinvolves imitating the percussive onsets in the musical stream as they are received.

For transformational mimesis to sound musical it is essential that onsets in the signalare detected with very low latency. This way the Jambot can play a percussive sampleas soon as on onset is detected and the timing difference between the actual onset andthe Jambot’s response will be imperceptible to a human listener. My experience hasbeen that 128 samples (at a sample rate of 44.1kHz) is the largest acceptable latencybefore the onset asynchrony becomes noticeable.

In fitting with the Smoke and Mirrors paradigm, transformational mimesis seeks totransform the onsets, so that the imitation is not too readily identifiable as being animitation. The Jambot uses several approaches to transforming the detected onsets.

One obvious way in which the detected onsets are transformed in the imitation is thatthe timbre of the percussive samples that the Jambot plays differ from the originalonsets. Indeed, the Jambot itself does not have any synthesis capacity, but rather sendsout MIDI note-on messages which may be used to fire samples from any synthesiser.Used in this way transformational mimesis may be thought of as a real-time timbralremapping technique.

The timbral remapping is made more effective due to the Jambot’s ability to discrim-inate between three streams of percussive onsets, using the SOD, MID and LOW de-tectors. Because of this the Jambot is able to selectively highlight any of these streams,which again helps to obscure its direct relationship to the source signal.

Another simple transformation is to filter the onsets in some fashion, such as by athreshold amplitude. This can have the effect of highlighting important musical events.Transformations that select certain events from the original are reminiscent of Aristo-tle’s discussion of mimesis in the context of drama:

At first glance, mimesis seems to be a stylizing of reality in which theordinary features of our world are brought into focus by a certain exag-geration, the relationship of the imitation to the object it imitates beingsomething like the relationship of dancing to walking. Imitation alwaysinvolves selecting something from the continuum of experience, thus giv-ing boundaries to what really has no beginning or end. Mimesis involves a

151

framing of reality that announces that what is contained within the frameis not simply real. Thus the more “real” the imitation the more fraudulentit becomes. (Aristotle in Davis 1999:3)

A limitation of the purely reactive approach is that it is difficult (or musically dan-gerous) for the Jambot to take any action other than when an onset is detected. Formusic that is very busy (i.e. has a lot of percussive onsets) simple reactivity can bequite effective. For music that is more sparse this can render the purely reactive ap-proach ineffective. Transformational mimesis need not, however, be a purely reactiveapproach. Indeed, the Jambot uses musical understanding gathered in the analysisstage to help transform its imitation. This will be described in §8.4.

8.2.7 Through the Looking Glass

In arguing that imitation is an important and fruitful technique in improvisation, andindeed in AI generally, I do not mean to imply that other techniques are uninteresting.Much as Levi-Strauss asserts that an artist has qualities both of the scientist and thebricoleur, so I am suggesting that both imitative behaviour and behaviour groundedin understanding are crucial aspects of creating an impression of agency. The Jambotutilises imitation, but also takes actions based on its musical understanding of whatwould be an appropriate and complementary action.

The critical point here is a question of baseline. From the perspective of classical AI,a musical improvising agent would operate from a baseline of nothing – any actionstaken would be on the basis of parsing the incoming music into some higher order un-derstanding, and utilising its musical knowledge to generate an appropriate response.The Smoke and Mirrors paradigm, on the other hand, takes direct imitation as a base-line, utilises musical understanding to deviate artfully from this baseline. In this waythe agent can communicate its musical understanding in a way that minimises the cog-nitive dissonance with the human performers.

The juxtaposition of musical understanding with transformational mimesis discussedabove relates to the previous discussion of Situated Cognition – in particular the needfor models of cognition to combine processes of direct structural coupling with pro-cesses of inference. The Jambot’s interactive generation strategies include a mecha-nism for combining transformational mimesis with its proactive generation techniques.I call this mechanism the juxtapositional calculus. It will be discussed in §8.4.1.

152

8.3 Proactive Generation Techniques

The Jambot utilises two distinct improvisatory approaches, reactive and proactive. Theprevious section outlined the reactive approach. In this section I describe the proactiveapproach. The proactive approach seeks to apply musical knowledge and understand-ing in the production of appropriate musical responses to the input signal.

The proactive approach operates by creating target values for the various rhythmicanalyses that the Jambot performs (§7.3.4). It chooses musical actions that best meetthese targets. In this section I describe the computational technique used to selectmusical actions that optimally meet these goals, and situate this technique within thecontext of existing approaches to generative music production.

I outline a novel optimisation search technique – anticipatory timing – that utilises alimited form of planning. The anticipatory timing search technique is designed to strikea good balance between optimality and computational efficiency, and is specificallytailored for generative rhythm production in a real-time interactive music system.

8.3.1 Existing Approaches to Rhythm Generation

The proactive rhythm production uses a generative process, rather than a transforma-tive process. The generative process is able to produce output without any musicalinput. When input is present the Jambot manipulates parameters of the generative al-gorithm to make the output musically appropriate. To facilitate such manipulation, thegenerative algorithm is based on a salient parametrisation of rhythm space (§7.3.1).

The Jambot’s proactive rhythm generation differs from existing approaches to gener-ative rhythm by utilising a musically salient representation which models both metricalignment and temporal patterning. In this section I outline some existing approachesto rhythm generation, categorising them into sequencing/sampling, generative algo-rithms based on statistical representations, and generative algorithms based on musi-cally salient representations. Of those approaches that use musically salient represen-tations, most either model metric alignment, or temporal patterning, but not both.

A number of generative systems utilise sequenced or sampled rhythmic material. Pearce& Wiggins (2007) algorithmically create chorales in a style similar to a seed chorale byremapping the pitches onto the original rhythm. Pachet (2002) uses a similar approachin his interactive music system – the Continuator – recycling the rhythm of a phrasesampled from the human player. Charles Ames’ Cybernetic Composer (1992) creates

153

drum-kit rhythms by looping a single bar generated according to stochastic variationsfrom a stylistic template. David Cope’s Experiments in Musical Intelligence (1996)uses recombinatorial processes, and so inherits rhythm from source material.

Amongst systems that generate rhythms from scratch (rather than sampling or trans-forming existing rhythms) a common approach is to use a stochastic onset model. Thismeans that at each time-point (of a quantised timeline) there is a probability of a noteonset. For example, Templerley’s Melisma Melody Generator (2010) creates rhythmsthis way, weighting the probabilities at each beat according to the metric strength ofthat beat.

A disadvantage of generating rhythms with probabilistic onsets is that Inter-Onset-Intervals are not directly modelled. A number of systems generate stochastic rhythmsby randomly drawing from a collection of durations, rather than modelling onset prob-abilities. For example, Brown (2005) creates rhythms by selecting random durations,weighted according to metric position of the onset and offset.

Neither of the stochastic approaches just discussed (probabilistic onsets/durations) ad-dress notions of temporal patterning (§7.3.3). Many systems incorporate patterninginto their generative algorithms by using Markov Chains of note durations, or otherstatistical representations. For example, Belinda Thom’s Band-Out-of-a-Box (2000a)uses Variable Length Trees, John Biles’ GenJam (1994) uses Genetic Algorithms, andGeorge Lewis’ Voyager uses Markov Chains to select sequences of note durations.

A difficulty with using sub-symbolic representations is that the parameters of these rep-resentations do not correspond to musically salient features. Consequently parametricmanipulation of generative algorithms based on these representations is unintuitive.

A more direct approach to modelling temporal patterning is taken by a number of sys-tems that generate rhythms according to periodicities of onsets. For example Ariza(2005) generates rhythms using onsets defined by a Xenakis Sieve (a mechanism forgenerating periodic sequences), McIlwain’s (2007) Nodal uses a graph representationto create polyrhythmic sequences, and Sorensen’s Oscillating Rhythms (2010) gener-ates note onsets from periodic functions. These examples are deterministic, and notreadily adaptible to modelling variations of metric alignment.

One system that uses a purely symbolic representation, based on salient musical fea-tures, to model both temporal patterning and metric alignment is Sorensen & Brown’s(2008) MetaScore algorithmic composition system. They use Gestalt properties ofproximity and similarity to generate a number of candidate rhythms, then select the

154

candidate that most closely matches target values for their metric alignment features,such as syncopation.

The Jambot’s proactive rhythm generation uses a salient parameterisation of rhythmspace (§7.3.4) that addresses both metric alignment and temporal patterning features.The Jambot differs from the system of Sorensen & Brown in a number of ways: first,it is interactive and real-time (MetaScore does not take musical input). Second, it usesa different collection of rhythmic features: in particular it contains measures of metricambiguity and polyrhythmic complexity not found in MetaScore. Third, it utilisesamplitude information in all of its measures, so that periodicities, syncopations andmetric ambiguity may all be achieved through dynamic accent.

8.3.2 Analysis and Generation

One approach to creating generative algorithms is to utilise analysis algorithms in re-verse (Brown et al. 2009; Conklin 2003). An analysis may be thought of as a mappingthat takes the musical surface as input, and reduces it to a value. Inverting this map-ping creates a generative algorithm; where a value (of a parameter changing in time) ismapped to a musical surface – the generated musical output. In this way the parametervalue becomes a meta-compositional (or meta-improvisational) tool.

Generally an analysis algorithm, when considered as a mapping, will not be one-to-one: many different musical surfaces may have the same value under the analysis. Thismeans that a generative algorithm constructed by inverting the analysis must choosebetween a number of possible musical surfaces. In the sections below I discuss twobroad approaches to making these choices – rule-based systems and stochastic systems– and discuss how these approaches are related.

8.3.3 Rule-based Systems

Rule-based systems for describing music operate by supplying a collection of heuristicproperties that music is expected to obey. Historically, many rule-based systems havebeen intended for analysis rather than for generation (Schenker 1980; Narmour 1990).However, if systems for analysis are formalised sufficiently to afford computationalimplementation, then they may be used in reverse as generative algorithms (Brownet al. 2009; Conklin 2003).

A frequently cited example of a rule-based system for musical analysis is the Gen-erative Theory of Tonal Music (GTTM) (Lerdahl & Jackendoff 1983). The GTTM

155

supplies a collection of preference rules that well-formed music should conform to.Actually performing an analysis using the GTTM involves coming to a compromisebetween competing preference rules. Lerdahl & Jackendoff’s model stops short ofproviding a computationally implementable scheme for deciding between competingrules.

The reason [that we have not implemented our theory computationally] isthat we have not completely characterized what happens when two pref-erence rules come into conflict. Sometimes the outcome is a vague orambiguous intuition . . . We suggested above the possibility of quantifyingrule strengths, so that the nature of a judgment in a conflicting situationcould be determined numerically. (Lerdahl & Jackendoff 1983:54)

There have been various efforts to provide computational implementations of the GTTMor similar preference rule systems (Hamanaka et al. 2006; Temperley 2001). These im-plementations consider the problem as a question of optimisation. In other words thecorrect analysis is found by searching through the space of possible analyses for thebest analysis according to some weighted combination of the rules.

In a given preference rule system, all possible analyses of a piece are con-sidered. Following Lerdahl & Jackendoff, the set of possible analyses isdefined by basic well-formedness rules. Each preference rule then assignsa numerical score to each analysis . . . The preferred analysis is the onewith the highest score (Temperley 2001:15).

The task of implementing a computational version of the GTTM (or any rule-basedsystem) as an optimisation problem is made difficult by the vast number of possibleanalyses to be searched over for any given piece of music. To perform a naive searchover all analyses in search of the optimal one is generally computationally intractable.There are, however, a number of search techniques that are substantially more efficientthan the brute force approach of exhaustive search.

A common approach to dealing with the combinatorial explosion of searching over allpossible analyses is to perform a search in a sequential fashion, and to use some heuris-tics for pruning unlikely branches of the search tree. Such approaches also seem morelikely to provide a model for “the moment to moment course of processing as it unfoldsduring listening” (Temperley 2001:14). One such approach is dynamic programming(Bellman & Kalaba 1956) which has found use in a number of algorithmic analysis andgeneration systems (Rowe 1993; Pennycook et al. 1993; Dannenberg 2000; Rolland &Ganascia 2000; Temperley 2001; Collins 2008).

156

Temperley has constructed a preference rule system similar to the GTTM called theMelisma model that affords computational implementation, and utilises dynamic pro-gramming to search for the ‘best’ analysis for a given piece of music.

When the program searches for a globally optimal analysis based on theresults of the local analyses, it must consider global analyses rather thansimply choosing the best analysis for each segment in isolation ... Usually,even for a monophonic short melody, the number of possible local anal-yses may grow exponentially with the number of segments, and the sizeof the best-so-far analysis becomes extremely large. The Melisma systemsuppresses the explosion of analyses by properly pruning less significantanalyses by dynamic programming. Temperley argues that, to some ex-tent, this searching process reflects the human moment-to-moment cog-nitive process of revision, ambiguity and expectation when listening tomusic. (Hamanaka et al. 2006)

The Melisma system, although originally intended for computational analysis, hasbeen ‘reversed’ to be a generative algorithm, the Melisma Stochastic Melody Gen-erator (Temperley & Sleator 2010), providing both an example of an analytic theorytransformed into a generative process, and an example of a (deterministic) rule-basedsystem reinterpreted as a stochastic statistical model.

8.3.4 Statistical Models

There have been many statistical models of music generation, from simple mapping ofmathematical functions such as periodic motion and random walks, to the use of morecomplex processes such as chaos theory and cellular automata. Some of the moremusically successful statistical models have been based on probability distributionsderived from analysis of previously composed material. Recent coverage of proba-bilistic tendencies in music has been written by Huron (2006) and Temperley (2007).Whilst these publications have focused on music analysis, the use of these approachesfor music generation is suggested by various authors such as Conklin:

Analytic statistical models have an objective goal which is to assign highprobability to new pieces in a style. These models can guide the generationprocess by evaluating candidate generations and ruling out those with lowprobabilities. The generation of music is thereby equated with the problemof sampling from a statistical model, or equivalently, exploring a searchspace with the statistical model used for evaluation. (2003)

157

In a simplistic sense statistical models can be used to generate musical material at eachstep in sequence independently of context, however it is also common and often moreeffective to use previous and concurrent note events to inform statistical processes ofgeneration.

The most prevalent type of statistical model encountered for music, bothfor analysis and synthesis, are models which assign probabilities to eventsconditioned only on earlier events in the sequence. (Conklin 2003)

The most common type of probabilistic model that uses past events as context areMarkov models. These have been used for generation of computer music since Hillerand Isaacson’s compositions in the late 1950s. Markov models are useful for sequentialprocesses such as music because they describe the frequency of sequences of events,such as which rhythmic values follow each other.

Markov models, and many other statistical approaches, whilst taking into account pastcontext, do not take into account future possibilities and therefore, I suggest, miss animportant opportunity to produce more optimal event selections. One challenge forextending statistical models to consider multiple future scenarios is the computationalcomplexity that can result.

8.3.5 Greedy Algorithms

Many real-time generative systems that utilise either statistical models or rule basedoperation use what is described as a greedy algorithm (Cormen et al. 2006:370). Inan optimisation context a greedy algorithm is one which generates each step of thesequence by optimising locally. Generally greedy algorithms fail to find the globaloptima (Russell & Norvig 1995). However, they are often used because they are muchfaster than exhaustive searches.

The random walk method, while applicable for real-time music improvisa-tion systems that require fast and immediate system response (Assayag etal., 1999; Pachet, 2002), is flawed for generating complete pieces becauseit is “greedy” and cannot guarantee that pieces with high overall probabil-ity will be produced. The method may generate high probability eventsbut may at some stage find that subsequently only low probability eventsare possible, or equivalently, that the distribution at subsequent stages havehigh entropy (Conklin 2003).

158

Greedy algorithms represent the most extreme form of ‘pruning’ of the search tree;only the current action is searched over - which amounts to pruning all of the branchesof the tree, leaving only the current node.

8.3.6 Anticipatory Timing

To incorporate planning by naively searching over even a short number of future ac-tions is computationally intractable. Even for offline systems this is not feasible, sosmarter approaches to searching, such as dynamic programming, must be employed.For real-time systems (such as we are concerned with) computational efficiency isa strong constraint, and even dynamic programming may be too expensive, so thatgreedy online variants of dynamic programming are sometimes used (Collins 2008).

I propose anticipatory timing as a computationally tractable means of improving upongreedy algorithms. Anticipatory timing is an extension of greedy optimisation to in-clude some level of anticipation in the timing of the next action. This involves search-ing both over possible actions and possible times for the next action.

At each step of the optimisation, the fitness function is calculated for each possibleaction, and recalculated again at each time slice (for a short window of the future).This calculation is done with the constraint that only one action is considered. If thehighest overall value for the optimisation occurred at the current time slice, then thataction is taken. Otherwise no action is taken at the current time slice.

If the algorithm anticipates that the best time for action is a few steps into the fu-ture then it does nothing at the current time slice. In the Jambot’s case, the periodicnature of the metrical context means that, unless something unexpected happens, theaction is likely to be taken at that future time. However, if in the meantime somethingunexpected has happened, then the algorithm (which is run at every time slice) mayre-asses the situation and decide against taking action at this time. This is particularlyimportant in a real-time improvisation context, because it allows for planning with-out commitment, so that the algorithm may be flexible to unexpected changes in theimprovisation.

Interpretation as pruning the search tree

Anticipatory timing is similar to greedy optimisation in only planning a single action ata time, but allows for the possibility of finessing the timing of this event if required. Ateach step of the generation the optimisation routine examines future outcomes at eachtime slice until the next event, but not every possible outcome in the rhythmic space.

159

This strategy increases the computational demand linearly with the number of beatsthat it looks ahead, comparing favourably with the exponential increase of complexityin full search planning of actions ahead.

In terms of ‘pruning’ the search tree, anticipatory timing amounts to pruning down tojust a single branch. This branch consists of a series of ‘take-no-action’ nodes andterminates when an action is decided upon. This means that the number of computa-tions required overall is equal to the number required for searching over a single actionmultiplied by the number of time slices into the future that it looks.

Comparison with Full Search

An alternative to anticipatory timing would be to plan a number of notes in advance.Presuming that we have a fixed amount of computational power available, a quick com-putation demonstrates how much further into the future we can look using anticipatorytiming, rather than searching over the entire tree.

Suppose, for purposes of demonstration, that there are 16 possible notes to consider ateach time slice. Suppose further that the bar is divided into 16 semi-quaver time slices.Using anticipatory timing we look a full bar ahead: at each of the 16 time slices in thebar we search over 16 notes. Then the total computation involves searching over 16 x16 = 256 notes.

On the other hand, consider attempting to plan some number of notes into the futureconsidering all possible combinations. Each time slice we must search over 16 notes,and for each of these notes we must search over 16 notes at the next time slice, and soon. Then by the second time slice we must examine 16 x 16 = 256 notes.

So for the same computation budget, we can look ahead a full bar of 16 semiquaverswith anticipatory timing, but can only look one semiquaver ahead if we attempt tosearch the entire tree. To look ahead the full bar for all possible notes would requiresearching over 1616 notes, which is approximately the number of grains of sand in theearth (Argonne National Laboratory 2010).

160

8.3.7 The Jambot’s Rhythm Generation

The Jambot operates in the paradigm of a rule-based context model as described in§8.3.3. In order to decide upon which musical actions to take, the Jambot performsa collection of analyses of the recent musical context (which may consist solely of itsown output, or may additionally include the actions of other members of an ensemble).The analyses are those performed in the analysis stage of the Jambot’s processingchain (§7.3.4): they include measures of rhythmic density, syncopation, periodicityand metric alignment.

The Jambot has a target value for each analysis, and searches over a quantised set ofcandidate actions so as to move the music as closely as possible towards these targetvalues. The target values themselves are improvisatory parameters which may be setin real-time by a human interacting with the Jambot, or may be set by the Jambot usinga higher order generative process. Below I provide some simple examples of rhythmsproduced using static target values with the goal of achieving stable repetitive rhythmicpatterns.

In the discussions above of rule-based systems I used the term heuristic to denote aparticular rule, with the implicit connotation that the rule be amenable to numericalimplementation. In the case of the Jambot each heuristic consists of a particular anal-ysis combined with a target value for that analysis. The analyses all report a singlenumerical value between 0 and 1. For example, the rhythmic density analysis reportson the fraction of beats that have an onset. The value of the heuristic is the absolutedifference between the analysis and the target value. Using the density analysis as anexample again, if the target value for the density was set to 0.5 (aiming for 50% ofthe beats to have onsets) and the density analysis reported 0.25 (only 25% of the beatshave onsets), then the value of the density heuristic would be 0.25.

There are five groups of heuristics (corresponding to the five different analyses de-scribed in §7.3.4). The different heuristics are combined into a single objective func-tion simply by forming a weighted combination of the individual objective functions.The weights used then also become improvisatory parameters which may be set byhand or generatively.

161

8.3.8 Ensemble Interaction

The proactive generative technique described above for rhythm generation is intro-spective, in the sense that the rhythmic analyses are being performed on a history ofthe Jambot’s own musical actions. The idea of the Jambot is, however, not simply tocreate generative rhythms, but to create appropriate musical output in an improvisedensemble setting. The output of the proactive rhythm generation is influenced by (andaims to be complementary to) the input signal via two mechanisms.

Firstly, the rhythm heuristics are rooted in the metric context, which is inferred from theinput signal. Thus the generative rhythms will (hopefully) be in time, and appropriateto the metric structure.

Secondly, the Jambot has an ensemble mode, which may be turned on and off as de-sired during performance. In ensemble mode the Jambot merges its own musical ac-tions with its perception of the rhythm present in the input signal. The rhythmic anal-yses used for generation are performed on the combined rhythm. It is the ensemblerhythm that the Jambot seeks to manipulate to best match the target values.

The ensemble mode can produce quite interesting and appropriate actions for simplesettings of the target values. For example, consider setting the target density to 100%.In ensemble mode this will result in the Jambot playing on every substratum beat forwhich there is no onset present in the input signal (it will also play at beats where theinput signal has an onset close to, but after the predicted beat). In this case the Jambotis utilising a sort of contrariness to achieve complementarity. The use of various formsof musical contrariness as an improvisational technique has been discussed by Collins(2010).

8.3.9 Musical Examples

In this section I give some very simple examples of the musical output of the Jambot’srhythm generation, and compare the results with and without anticipatory timing. Thepoint of these examples is to demonstrate that the use of anticipatory timing imbuesmy rule-based system with much greater control over the level of musical coherencein its output.

In the first two examples I attempt to find parameter settings for the rhythmic heuristicsthat result in a desired simple drum pattern. When using anticipatory timing it is easyto create these patterns with intuitive settings. Without anticipatory timing I was un-

162

Figure 8.6: Offbeat snare hits with anticipatory timing

Figure 8.7: Offbeat snare hits without anticipatory timing

able to find any settings that would result in these simple patterns. The third exampledemonstrates the output of the Jambot in an ensemble improvisation.

Example 1: Offbeat Snare

The anticipatory timing approach seems to be most helpful when constructing rhythmsthat utilise a mixture of the metrical and non-metrical heuristics. As a first example,consider trying to construct a rhythm in 4/4 quantised to 8 quavers, consisting of a kickon the downbeat, hi-hats on each crotchet beat alternating with snares on each quaveroffbeat.

The heuristics that I use for the hi-hat are density, syncopation and a periodicity of 2quavers. The density is set to 50%, since I want high hats on the crotchet beats, but notthe quaver offbeats. The syncopation is set to 0% since I want the hi-hats to be on allthe on-beats. Finally, I set a large weighting towards a periodicity of 2 quavers.

For the snare I use similar heuristics, except I target a syncopation of 100%, so as toforce the snare drum to sound on the offbeats.

The resulting rhythm, using anticipatory timing, is shown below in Figure 8.6 and canbe heard in the example file AlternatingSnareAndHatAnticipation.mp3.

However, using the same settings for the heuristics but without anticipatory timing re-sults in the music shown in Figure 8.7 and heard in the example file AlternatingSnare-AndHatNoAnticipation.mp3. Without the anticipatory timing the hi-hat pattern re-mains intact, but the snare pattern is all over the place. One way to think about what ishappening here is that the offbeat alignment is being forced by syncopation, but with-out anticipation the algorithm can’t tell that a given offbeat is a good opportunity for asyncopation because it’s not looking ahead to the stronger on-beat.

163

Figure 8.8: Dancehall beat with anticipatory timing

Figure 8.9: Dancehall beat without anticipatory timing

Example 2: Dancehall Kick Pattern

As another example I try to construct a ‘dancehall’ pattern with the kick drum, beinga 3 + 3 + 2 rhythm in quavers in 4/4. The heuristics that I use are a combination ofperiodicities of 3 and 8. The periodicity of 3 on its own would tend to produce a streamof 3 + 3 + 3 ... indefinitely. However, the addition of a periodicity of 8 constrains therhythm to attempt to repeat itself every 8 quavers, and so I hope that this combinationof settings should tend to produce the desired pattern, or a permutation of it (i.e., 3 +2 + 3). To obtain the desired permutation, and have it align to the metre, I also specifythe metric ambiguity heuristic to be low, meaning that the rhythm should adhere to themetre as closely as possible.

The resulting rhythm (with a cow bell and clave click to show the metre), with antici-patory timing, is shown in Figure 8.8 and can be heard in the example file DanceHal-lAnticipation.mp3

To see the effect of anticipatory timing, note that the same settings without anticipatorytiming gives the result shown in Figure 8.9 and heard in the example file: DanceHall-NoAnticipation.mp3. Without anticipatory timing the 3 + 3 + 2 pattern is lost.

Example 3: Ensemble Interaction

In the examples above the context for the Jambot’s rhythm model has been a shortwindow of history of its own musical actions. The Jambot is, however, primarilydesigned to be used for improvising in an ensemble, in which case the context consistsof both its own musical actions and those of the ensemble. As an example of this modeof interaction I had the Jambot improvise in real-time to the Amen Break.

164

The heuristic settings used were just the density and syncopation for each of the hi-hat,snare and drum. The settings used were:

Hi-hat Density: 95% Syncopation: 0%Snare Density: 33% Syncopation: 50%Kick Density: 33% Syncopation: 0%

The resulting rhythmic improvisations with anticipatory timing is given in Ensem-bleAnticipation.mp3 (the original input file is AmenBreak.mp3). The Jambot’s impro-visation is relatively sparse, consisting of a repeated pattern of a few snare hits only.This is because the Hi-hat and the Kick parts in the loop are already sufficiently busy tosatisfy the density target. The snare pattern, to my ears, complements the input rhythmin a musically appropriate and coherent fashion.

The same settings without anticipatory timing produced the result heard in Ensem-bleNoAnticipation.mp3. As with the previous improvisation it is restricted to the snare.The snare pattern in this example is, to my ears, still interesting but less musically co-herent than the example with anticipatory timing.

8.3.10 The Chimæra

The Jambot’s proactive generation technique operates by targeting values for variousrhythmic heuristics. The target values themselves may be static, or may be altered bya human in the course of performance. A third possibility is that the Jambot itself maydynamically alter the target values according to some higher order generative process.In this section I describe one approach to doing this, that I call the Chimæra.

The Chimæra is a generative process for dynamically altering the target values for theJambot’s rhythmic heuristics. It functions by measuring and exploiting the level ofmetric ambiguity present in the music. It seeks to achieve musical complementarity bymanipulating the metric ambiguity via its improvisational actions.

The Chimæra makes use of the multiplicity of metric contexts inferred in the analysisstage of the Jambot’s processing chain. In §2.3.6 I discussed how a multiple-parallel-analysis architecture affords a natural mechanism for manipulating ambiguity. By se-lectively highlighting several metric contexts simultaneously, metric ambiguity shouldbe increased. Conversely, by highlighting the dominant metric context only, metricambiguity should be decreased.

165

Figure 8.10: A depiction of the Chimæra on an ancient Greek plate

The idea of the Chimæra is to extend this notion to a generative rhythmic processin which the rhythms are emergent from the interaction of multiple simple rhythms,which individually are strongly suggestive of a particular metric context. My use of themetaphor of the Chimæra was inspired by Bregman’s description of Chimeric sounds:

The Chimaera was a beast in Greek Mythology with the head of a lion,the body of a goat, and the tail of a serpent. We use the word Chimerametaphorically to refer to an image derived as a composition of other im-ages. An example of an auditory Chimera would be a heard sentence thatwas created by the accidental composition of the voices of two personswho just happened to be speaking at the same time. Natural hearing triesto avoid chimeric percepts, but music often tries to create them. It maywant the listener to accept the simultaneous roll of the drum, clash of thecymbal, and brief pulse of noise from the woodwinds as a single coherentevent with its own striking emergent properties. The sound is chimericin the sense that it does not belong to any single environmental object.To avoid Chimeras the auditory system utilizes the correlations that nor-mally hold between acoustic components that derive from a single sourceand the independence that usually exists between the sensory effects ofdifferent sources. Frequently orchestration is called upon to oppose thesetendencies and force the auditory system to create Chimeras. (Bregman1990)

Where Bregman is referring to Chimeric sounds composed of physically disparate

166

source sounds, the Chimæra process in the Jambot seeks to create Chimeric rhythms asemergent from simpler rhythms, each rooted in disparate senses of the metric context.

Chimeric Rhythms

The Jambot’s perceptual architecture tracks multiple plausible metric contexts. Thehypothesis of the Chimæra process is that utilising all of the parallel scenarios forgenerative improvisation can be musically efficacious. Whilst many metre inductiontechniques utilise multiple analysis models for the purposes of metric estimation, usu-ally these techniques will report the strongest result as the metre. However, the othermetric analyses and their associated confidences give information regarding the metricambiguity present in the improvisation, and suggest musical opportunities for manip-ulating the ambiguity in an appropriate fashion.

Flanagan (2008) also discusses the use of multiple estimates of metre to measure metricambiguity, and suggests that it could be possible to create a generative algorithm thatexploits this. However, as far as I am aware, the Jambot is the only interactive musicsystem that uses this information for generative accompaniment purposes.

A more common approach is to select the most plausible metre and generate materialwhich is appropriate to this. This approach is similar to that used by authors whohave considered multiple parallel analyses models for musical analysis – the explicitassumption being that, despite tracking multiple analyses, at any one time there is onlyone analysis which is perceived as being ‘correct’. For example, Temperley (2001:219)refers to the preferred analysis as the one with the highest score. Similarly the GTTMexplicitly insists that only one analysis at a time can be ‘heard’:

Our hypothesis is that one hears a musical surface in terms of that analysis(or those analyses) that represent the highest degree of overall preference(Lerdahl & Jackendoff 1983).

A similar hypothesis is widely held in a number of fields of psychology and neuro-science, under a variety of names. The gestalt psychologists refer to it as the Figure-Ground dichotomy (Boring 1942), in neuroscience it is called the Winner-Takes-Allhypothesis (Kandel et al. 2000:494). The idea is that the mind can only be consciousof a single reality at any one time.

In music perception a number of authors have had similar commentary on the impos-sibility of consciously perceiving more than one musical analysis simultaneously:

It is true that we are conscious of only one analysis at a time, or that we canattend to only one analysis at a time. But this leaves open the possibility

167

that other analyses are present unconsciously, inaccessible to attention.(London 2004:141)

London, while discussing music psychology experiments on attending to metre bytapping along to a single stream of a polyrhythm, notes that:

These studies ... indicate that while on any given presentation I tend tohear a passage under one and only one metric framework, it is possible tore-construe the same figure or passage under a different meter on anotherlistening occasion. A polyrhythmic pattern may be heard “in three” or “infour”, just as metrically malleable patterns may be set in different metriccontexts. This should not surprise anyone familiar with the basic tenets ofperception, as the need to maintain a single coherent ground seems to beuniversal ... Thus there is no such thing as polymetre. (London 2004:50)

However, I am at odds with this view, and contrastingly suggest that all of the metriccontexts inferred by the Jambot may be used to artistic effect. Support for this notioncan be found in Huron’s (2006:109) discussion of his theory of competing concurrentrepresentations. He argues that each of the concurrent representations must all becreating expectations and so must all be contributing to the musical affect. Musicpsychology experiments (such as those above) that suggest that only one metre canbe consciously attended to at a time may not be relevant to the acts of listening to,or improvising with music since these activities do not necessarily involve consciousattention to musical representations.

More precisely, I propose that by selectively emphasising the inferred metric contextsin different ways I may achieve different musical results, particularly in regard to thelevel of ambiguity present in the improvised generative material. For example, empha-sising a single metric context should result in a decrease in the overall ambiguity ofthe ensemble’s playing, since emphasising the most prominent metric context is likelyto have the effect of further increasing its relative plausibility. Conversely, generatingmaterial that is equally appropriate to two metric contexts regardless of their relativeplausibility should be likely to increase the level of ambiguity present in the ensemble.I suggest then three different strategies for utilising the Chimæra process, revolvingaround controlling the level of ambiguity present:

(i) Disambiguation: utilise only the most plausible of the inferred metric contexts.

(ii) Ambiguation: utilise all of the inferred metric contexts with equal weight, re-gardless of their plausibility.

(iii) Following: utilise all of the inferred metric contexts with weight according tothe their plausibility.

168

A Simple Example

To clarify the operation of the Chimæra process, I will give some simple musical ex-amples designed to clearly demonstrate the musical results of the three improvisatorystrategies outlined above. The generative procedure that I use here is just for demon-stration purposes. The actual implementation of the Chimæra process in the Jambotuses a more subtle generative technique, designed to integrate with the other rhythmicgeneration techniques described earlier in this chapter. This implementation will bedescribed in §8.3.10.

The generative process that I will use for demonstration is perhaps the simplest ap-proach to mapping a metric context to a rhythm. For any given metric context, I wantto create rhythmic material that is appropriate to, or stereotypical of, that metric con-text. To achieve this I simply play a percussive attack on each beat identified in themetre, with an accent on the downbeat. The downbeat accent is created by utilisinga different timbre (i.e. a different percussive instrument), whilst the other beats havethe same timbre, but varying dynamics. The dynamics are chosen to correspond to themetric weights of the beats specified by the metre.

The dynamics of the the attacks on the beats are further modulated by the desiredweighting of the metric context in the generated material. So, for example, if I amusing the Ambiguation strategy discussed above, the pulse streams corresponding tothe different metric contexts will be equally loud (on average), whereas if I am usingthe Following strategy then the pulse stream corresponding to the most plausible metriccontext will be loudest.

I have provided examples of the system improvising rhythmic accompaniment to ashort recorded loop of live drums to demonstrate its operation. The original loop,which can be found in the example file original.mp3 is a sample rock beat played ona standard drum kit. I perceived it as being in a 4/4 time signature at 110 bpm. Therhythm has regular hi-hats played on the quavers, with snare hits on the backbeat, anda triplet-feel kick-drum pattern. The kick-drum pattern contrasted with the snare andthe hi-hat gives rise to metric ambiguity, as the triplet feel is metrically dissonant withthe straight four/eight feel of the snare and hi-hat.The Jambot’s analysis of metre yielded four plausible metric contexts:

1. The most plausible scenario has a pulse period of 0.27 seconds (∼ 220 bpm),and 8 beats to the bar.

2. The second most plausible has the same bar period as the first, but counts only 4beats of 110bpm.

169

Figure 8.11: Data from three updates of the inferred metric contexts

3. The third has 5 beats at 220bpm

4. The fourth has 3 beats at 220bpm.

The collection of plausible metric contexts, and their associated plausibilities, changesthrough time, most notably the scenarios of 5 and/or 3 beats to a bar. The top twocontexts (8 and 4 beats to the bar) are stable, whilst the others pop in and out as thesystem updates their relative plausibilities. An example print-out of the inferred metriccontexts is given in Figure 8.11. In the print-out metric contexts are labeled ‘contextdumps’. The letters P, B and N indicate the pulse period, bar period and numberof beats in a bar respectively (periods are measured as a number of analysis windows,where an analysis window is 128 samples @ 44.1khz). Notice that the Chimæra brieflyentertains the notion of a bar of 10 before pruning it out when its plausibility becomestoo low.

Audio examples from the system generating material utilising the three improvisatorystrategies of disambiguation, ambiguation, and maintenance are given in disambiguate.mp3,ambiguate.mp3, and follow.mp3 respectively. These examples have been transcribedas common practice notation below. The Disambiguate strategy results in a simplesingle-part accompaniment, whilst the Follow and Ambiguate strategies result in multi-part, polyrhythmic accompaniments.

170

Figure 8.12: Simple example of ambiguity strategies

171

Chimæra Implemention in the Jambot

The implementation of the Chimæra in the Jambot uses a more subtle technique thanthat discussed above, designed to integrate with the other proactive improvisatorystrategies described in this chapter. The idea of the Chimaera is to take improvisatoryactions that selectively emphasise either one or several of the plausible metric contexts,depending upon the mode that the Chimæra is in.

The mechanism by which the Jambot seeks to highlight a particular metric context is bymanipulating the periodicity heuristics (described in §7.3.5). When the Chimæra pro-cess is enabled the Chimæra takes control of the target periodicities and their weights.What it does with them depends upon which of the three strategies – Disambiguate,Ambiguate and Follow – that it is employing. The choice of strategy is controlled by ahuman performing with the Jambot, or by default is set to Follow.

In Disambiguation mode, the Chimæra is seeking to emphasise only the most promi-nent metric context. To do this, it sets a heavy weighting towards a periodicity that isequal to the period of the bar in the most plausible metric context. Similarly, in Am-biguation mode, the Chimæra sets heavy weightings to all periodicities that are equalto the bar period of a plausible metric context. In following mode, the weights areproportional to the plausibility of the context.

An example of the Chimæra implementation in the Jambot is given in the demonstra-tion video chimaera.mov. In this demonstration the Jambot is jamming to a short drumloop; the same drum loop as was used for the examples in §8.3.10. In this demon-stration the Chimæra process begins in Disambiguation mode, then switches to theAmbiguation and then the Following modes.

172

8.4 Interactive

The third arm of the Jambot’s generative strategies is interactive generation. Interactivegeneration combines the reactive and proactive techniques using the JuxtapositionalCalculus. It also targets interactional strategies of dialogue in order to produce lifelikeinteraction.

8.4.1 Juxtapositional Calculus

The idea of the juxtapositional calculus, as discussed in §8.2.7, is to provide a mecha-nism for incorporating musical knowledge (in this case the location of the beat and thedownbeat) into a baseline interaction strategy of mimicry.

The first component of this calculus is a switching mechanism which relies upon theplausibility attached to each of the Jambot’s metric contexts. Below a threshold confi-dence the Jambot operates entirely by transformational mimesis – only once a scenariohas reached an acceptable level of plausibility will it be incorporated into considerationfor actions.

The second component of the calculus is the use of the perceived onsets as a ‘decisiongrid’. Each time an onset is detected the Jambot considers its potential actions. If theonset is tolerably close to a substratum beat then it will make a stochastic decision (thedistribution being dependent on its internal state, its improvisatory mode, its ambiguitytargets etc.) as to whether to take an action. If it does take an action, that action may beto play a note (which note depends on whether this beat is believed to be the downbeator not), or to play a fill. A fill consists of playing a note followed by a series of notesthat evenly subdivide the gap between the current beat and the next beat. If the onsetis not tolerably close to the beat then no action is taken.

The musical understanding, which in this case consists of knowing where the beat anddownbeat are, is thus incorporated by affecting the distributions for choosing whetheror not to play a note and which note to play, and for the timing of the subdivisions ofthe fills. The fills mean that the Jambot is not restricted to playing only when an onsetis detected, but anchoring the decision points to detected onsets provides a good dealof robustness.

Using this calculus, if no onsets are detected then the Jambot does not play. Althoughin some musical circumstances it would be desirable to have the Jambot playing in theabsence of any other percussion (such as taking a solo), in practice this is frequently

173

a desirable property. It means that the Jambot doesn’t play before the piece starts orafter it finishes, and allows for sharp stops; the most the Jambot will ever spill over intoa pause is one beat’s worth of fill. It also means that it is highly responsive to tempovariation, and can cope with sudden extreme time signature changes - especially incombination with confidence thresholding described above.

8.4.2 Attention

The Jambot’s architecture implements various attentional mechanisms, both for thepurposes of robust perception and lifelike interaction. One attentional mechanism isachieved via a combination of Perceptual Inertia (§4.2.3) and the multiple parallel be-lief structure (§4.2.2) with ambiguity targeting (§8.3.10).

Perceptual Inertia operates as a top-down process in which currently held beliefs affectthe perceptual mechanism so as to emphasise sensory data that is coherent with the be-lief, and suppress sensory data that is not (§6.4.6, §6.8.3). This provides for perceptualrobustness by paying attention to input that is coherent with internal beliefs.

Similarly, the attentional modulation technique (§6.7) affects the reception mecha-nisms, so that more attention is paid to onsets occurring at expected times, adoptingthe view of metre as an attentional framework (§2.5.1).

A further level of attention is provided by the ambiguity targeting. When in unambigu-ous mode the Jambot essentially pays attention to only the most plausible scenario. Thetemporal evolution of the plausibilities means that it is possible to distract the Jambot(Cf. discussion of curiosity and agency in §8.2.5) momentarily by performing someunexpected actions.

8.5 Summary

This chapter discussed the Jambot’s rhythm generation techniques, which include re-active, proactive and interactive approaches.

The reactive approach is based on imitation. I outlined a general approach to interac-tion, Smoke and Mirrors, which uses transformed imitation. The idea of Smoke andMirrors is for the computer to create an impression of agency by inheriting human-likequalities from the human in a human-computer interaction.

174

The Jambot’s musical version of Smoke and Mirrors is called Transformation Mime-sis. It involves real-time imitation of the percussive onsets in the audio stream. Theimitation is transformed in a number of ways to obscure the directness of the imitation.

The proactive approach uses the salient parametrisation of rhythm discussed in chapter7 to generate rhythms. It operates by targeting values for the rhythmic analyses, andtaking actions to move the ensemble rhythm closer to these target values.

The rhythmic generation uses a search technique called anticipatory timing to find theoptimal action. Anticipatory timing is a variant of greedy optimisation that allowsfor finessing of the timing of an optimal action. This search technique is designed togive good compromise between efficiency and optimality that is particularly apt forrhythmic generation. Use of anticipatory timing is crucial to the intuitiveness of therhythmic model.

Another technique the Jambot uses for rhythmic generation is the Chimæra, whichseeks to create rhythms of specifiable metric ambiguity. It does this by selectivelyhighlighting one, or many, of the plausible metric contexts inferred in the analysisstage.

The interactive approach combines the reactive and proactive approaches. The idea isto operate from a baseline of transformed imitation, and to use moments of confidentunderstanding to deviate musically from this baseline.

175

Chapter 9

Conclusion

This research has resulted in a robust concert-ready interactive music system; the Jam-bot. The Jambot listens to a musical audio signal, and improvises percussive accom-paniment. The input signal may be from a live ensemble, or may be playback of arecording for use by a DJ. An example of the Jambot improvising to a prerecordedtrack is given in the demonstration video MainDemo.mov from 5’43” until the end ofthe video.

The Jambot’s perceptual apparatus is composed of three stages: reception, analysisand generation. The reception stage parses a raw audio signal into timestamped notes.These notes feed into the analysis stage, which infers tempo and metre, places the notesat beat locations, and performs rhythmic analyses. Information from the analysis stageis used by the generation stage to improvise appropriate rhythmic accompaniment.

New knowledge has been contributed by this research in each of the reception, analysisand generation stages. Moreover, the Jambot’s architecture, which utilises feedbackbetween these stages, represents a contribution to knowledge. The sections belowdetail the findings in each of these areas, relating back to the research questions posedin §1.1.3.

176

9.1 Reception

How can salient events be extracted from a raw audio stream? The reception stagecontains a novel suite of onset detection algorithms, and a novel synthesis of existingpolyphonic pitch tracking techniques.

9.1.1 Onset Detection Suite

The Jambot utilises a novel suite of percussive onset detection algorithms which achievereasonable accuracy and discrimination between the kick, snare and hi-hat sounds ofa standard drum-kit in the context of complex audio. The algorithms are causal andlow-latency, allowing for the real-time timbral remapping of percussion elements ofcomplex audio. The suite consists of the SOD, MID and LOW detectors, which aretuned to the detection of the hi-hat, snare and kick sounds respectively of a standarddrum-kit.

Stochastic Onset Detection

The most significant contribution to knowledge in the reception domain is the Stochas-tic Onset Detection (SOD) algorithm for extracting hi-hat onsets from complex audio(§5.2.6). The SOD algorithm operates by tracking growth in the ‘noisiness’ of thesignal.

The SOD algorithm is designed to extract noisy onsets, such as from hi-hats or cym-bals, from complex audio with very low latency. It is designed to outperform existingalgorithms such as bonk∼ (Puckette et al. 1998) and HFC (Masri & Bateman 1996) inthe context of complex audio with confounding high frequency content, such as froma distorted guitar.

The tracking and discrimination of hi-hat onsets is of particular interest due to thecommon use of hi-hats to mark the pulse in popular Western dance music (and manyother genres). The ability to discriminate between different percussive onsets is alsoimportant for two reasons. First, it allows the Jambot to perform real-time timbralremapping of percussion sounds. Second, the temporal structural relationship betweendifferent percussive streams is used both in the estimation of metre and the generationof rhythmic accompaniment.

I gave a concrete example of a musical context in which the SOD algorithm yields bet-ter results than either bonk∼ or HFC (§5.2.5). I also provided an auditory comparison

177

of the SOD algorithm with bonk∼ applied to the MIREX 2006 beat tracking practicedataset, by having both algorithms imitate the onsets in the data in real-time. The SODalgorithm is far more consistent at detecting musically salient onsets than bonk∼, andis more particularly tuned to noisy onsets such as a hi-hat.

The Jambot also implements the Spectral Flux (Masri 1996) onset detection algorithm,which Dixon (2006) suggests to be optimal amongst the common approaches to onsetdetection surveyed by Bello et al. (2005). Dixon uses the Spectral Flux algorithmin his BeatRoot beat tracking system. I implemented Spectral Flux detection both toenable comparison with the SOD algorithm, and to enhance the Jambot’s ability tooperate in musical contexts lacking percussion elements.

Extensive interaction with the Jambot has led me to conclude that the SOD algorithm issuperior to the Spectral Flux algorithm for the desired task of extracting noisy sounds(such as a hi-hat) from complex audio. The Jambot’s GUI allows for real-time selectionof the percussion stream(s) used for beat tracking; any or all of the FLUX, SOD,MID and LOW streams may be used. Use of the FLUX stream notably degrades beattracking performance in contexts where the pulse is being maintained by a hi-hat. Ifound the FLUX stream most useful for music without percussion.

Collins (2006b:70) compares the performance of a number of onset detection algo-rithms over the annotated test database created by Bello et al. (2005) at the QueenMary University of London (QMUL). He concludes that a modification of the algo-rithm of Hainsworth & McLeod (2003) inspired by an algorithm of Klapuri (1999) isthe best performing algorithm overall. An interesting topic for further research wouldbe to implement this algorithm to enable comparison with the SOD algorithm.

LOW onset detection

The LOW onset detection algorithm (§5.2.6) is a band-pass filtered energy detectionscheme. Conceptually this is similar to many existing schemes. However, in orderto satisfy the overall project design requirements this algorithm required substantialparameter tuning.

The design requirements were (i) an ability to discriminate between kick and snareonsets, and (ii) sufficiently low latency to allow real-time imitation (or timbral remap-ping) with imperceptible asychrony. Heuristically this task is challenging because themain cue differentiating these sounds is presence of low frequency content in the kicksound. The time required to ascertain frequency content after an attack varies inverselywith frequency, and for the frequencies in question (∼ 40Hz− 80Hz) the timescales

178

approach perceptibility.

The resultant parameter tunings are a compromise between these requirements. Thelatency of the LOW algorithm is 256 samples (@ 44.1kHz). In my experimentationsI have concluded that 128 samples is preferable, as at 256 samples the asynchronyis borderline perceptible for me. However at 128 samples latency the accuracy ofdiscrimination between kick and snare sounds was undesirably low.

MID onset detection

The MID onset detection algorithm (§5.2.7) combines elements of the LOW and SODalgorithms. It is particularly tuned for the detection of snare onsets, which contain bothbroadband mid-frequency energy and a noise component.

The MID algorithm uses the SOD algorithm applied to a downsampled version ofthe input signal. It also utilises a bandpass filtered energy detection scheme. Thesetwo components are combined via a geometric mean to produce the MID detectionfunction.

The MID algorithm performs a reasonable job of discriminating a snare sound fromboth kick and hi-hat sounds within 256 samples (@ 44.1kHz) in the context of anunaccompanied drum track. In the presence of confounding audio it is less successful.Vocal parts with voiced sibilants are problematic, as these sounds often contain mid-frequency content and a stochastic component.

Example of Onset Detection Suite

An example of real-time imitation is given in the demonstration video MainDemo.mov.From the start of this video, and up until 2’50”, the Jambot mimics the kick, snare andhi-hat elements of a drum loop, the Amen Break.

9.1.2 Polyphonic Pitch Tracking

I presented a novel technique for estimating the salient harmonic content of complexaudio (§5.3). This technique is a synthesis of a number of existing techniques, andaddresses theoretical shortcomings of the component techniques in the context of real-time operation.

The technique provides more accurate frequency estimates with lower latency than thecomponent techniques, by utilising a multistage procedure combining InstantaneousFrequency Distribution fixed point analysis into an MQ-analysis framework with belief

179

propagation. The goal of this technique is to provide sufficient harmonic informationto afford appropriate improvisation.

The output of the polyphonic pitch tracking algorithm, whilst improving on the com-ponent algorithms, was not of sufficient quality for inclusion in the Jambot. The la-tency of the estimates was too high for convincing real-time imitation, and the errorsin frequency estimation too high to warrant development of chord identification or keyfinding algorithms based on this information. Consequently, the Jambot currently onlylistens to percussive elements.

9.2 Analysis

How can the notes be analysed into musical features?. The analysis stage takes inputfrom the reception stage, and performs a number of metrical and rhythmic analyses.The metrical analyses include beat tracking and metre induction. The beat tracking isdesigned to track mildly varying tempos, and to recover quickly from abrupt tempochanges. The rhythmic analyses are designed to provide a musically salient parametri-sation of rhythm space, affording intuitive parametric control over the rhythm genera-tion process.

The Substratum

I proposed a new theoretical construct called the substratum (§6.3). The substratumacts as a referent pulse level. In this sense it is similar to the tactus, but it is definedas a mathematical property, directly observable from the musical surface, rather than aperceptual property.

The substratum is defined as the most consonant pulse, i.e. the pulse for which mostother pulses evident in the musical surface are a simple multivisor. In my experience,the substratum tends to be faster than the tactus.

The Jambot’s beat tracking estimates the substratum. The construction of the substra-tum means the beat tracking mechanism benefits from a uniform treatment of metriclevels above and below the substratum, which may be estimated together in a singlelinear regression.

The substratum is a useful theoretical device for beat tracking in many genres of per-cussive dance music. In §6.3 I highlighted a theoretical shortcoming of the substratum– it is not well suited to analysis of music with swing, particularly if the swing ra-

180

tio is irrational. A topic for further research is to develop a theory based on arbitrarysubdivisions of a referent pulse level.

Beat Tracking

I presented a novel beat tracking technique (§6.4). This technique shares aspects ofa number of existing techniques, but is different in many respects (§6.2.2). It differsby tracking the substratum (§6.3), utilising the SOD stream (§5.2), having a sparseimplementation (§6.2.2), utilising perceptual inertia (§4.2.3), implementing responsiv-ity (§6.2.2) and utilising feedback from the beat tracking to alter the onset detectionalgorithms (§6.7).

The beat tracking achieves good results on the subset of the MIREX (2006) practicedata set containing percussive elements, comparing favourably with Dixon’s (2008)BeatRoot system, but operating causally and in real-time. Moreover, the beat trackingmechanism recovers quickly from drastic tempo changes or pauses.

I also present a novel technique for estimating the substratum phase. This technique issimilar to existing techniques for pulse phase estimation, particularly that of Laroche(2003). It differs by treating the onset salience function as discrete and the pulse func-tion as continuous (Laroche does the opposite), which I argued to be better suited tohigh precision onset timing input. It also utilises a complex weighted ordinary leastsquares estimation technique to bias the results towards being accurate at the currenttime.

Metre Induction

I presented a novel technique for estimating the number of beats in a bar (§6.8). Thetechnique utilises the notion of beat-class intervals, and operates on the assumptionthat surface properties of the musical surface should, to some extent, respect beat-classinterval invariance. The bar length for which this is most true determines the Jambot’sestimate.

As with all of its perceptual parameters, the Jambot maintains multiple hypothesesregarding the number of beats in a bar, along with associated plausibilities. Thesemultiple beliefs are fed to the Chimæra process (§8.3.10) and used to manipulate thelevel of metric ambiguity present in the improvisation.

In practice, most percussive dance music utilises a persistent time signature, and theactual number of beats in the bar is readily communicated to the Jambot by a hu-man user at the press of a button. For applications other than manipulation of metric

181

ambiguity, I tend to manually control the number of beats in the bar. Moreover de-liberately misinforming the Jambot can be musically useful. The demonstration videoMainDemo.mov gives an example of this technique. As the Jambot starts improvising(on timbales) at 6’12” I manually set the bar period to 3 beats, creating a polyrhythmagainst the background track in 8. I then alternate between setting the bar period to 3or 4 beats for a time, then settle into 8.

Salient Parametrisation of Rhythm

I presented a model of rhythm in terms of musically salient parameters (§7.3.4), con-sisting of of a collection of rhythmic analyses. The model is used by the generationstage to create generative rhythms. The use of musically salient parameters contrastswith many approaches to modelling and representing rhythm such as Markov chainsand neural nets, which do not expose intuitive parameters for manipulation. The Jam-bot’s rhythm model affords intuitive parametric control over the generation process.

9.3 Generation

How can appropriate musical responses be generated? The generation stage consistsof three improvisatory strategies – reactive, proactive and interactive. The reactivestrategy utilises real-time imitation of the input signal to provide convincing accompa-niment. The proactive strategy utilises information from the analysis stage to predictthe beat, and apply musical understanding in the producetion of appropriate accom-paniment. The interactive strategy mediates between reactive and proactive strategies,and utilises interaction techniques designed to produce an impression of lifelikeness.

Smoke and Mirrors

I described a broad interaction paradigm dubbed Smoke and Mirrors (§8.2.2) thatseeks to minimise cognitive dissonance in interaction between human and machine.Smoke and Mirrors involves reflecting the actions of a human interactor whilst trans-forming the reflection sufficiently to obfuscate the direct relationship between inputand output. In this way the machine can inherit from the human intelligence alreadypresent in the interaction.

The idea of Smoke and Mirrors is to give an impression of intelligence even withlimited understanding. Beyond this, the Smoke and Mirrors paradigm advocates theuse of mimesis as a baseline of interaction. Moments of confident understanding canthen be inserted into the interaction as deviations from this baseline. This general

182

strategy allows for robust interaction in a wide variety of circumstances, with noveland insightful input when appropriate.

Transformational Mimesis

I presented a reactive improvisation strategy named transformational mimesis (§8.2.6).Transformational Mimesis conforms to the Smoke and Mirrors paradigm. It involvesthe real-time imitation of percussive elements of the input stream, made possible by thelow latency onset detection suite (§9.1.1). The imitation is transformed to obfuscate theconnection with the input signal by a variety of mechanisms. Part of the transformationprocess uses aspects of musical ‘understanding’ inferred in the analysis stage (§9.3).

Generative Rhythm

I presented a novel approach to rhythm generation, designed to afford intuitive andmusically salient parametric control (§8.3.7). The generative algorithm operates byseeking target values for rhythmic heuristics. The heuristics are defined by the Jam-bot’s rhythm model (§7.3.1), which is designed to provide a musically salient parame-terisation of rhythm space.

This algorithm maintains musical appropriateness to the input signal in two ways.First, some of the heuristics relate to the inferred metre, and so the generated rhythmwill be in time, and metrically appropriate. Second, the algorithm seeks target valuesfor the rhythmic analyses performed on the ensemble rhythm, i.e. the combination ofthe parsed rhythm from the input signal with the Jambot’s own musical actions. In thisway the Jambot takes actions to move the ensemble rhythm as a whole closer to thedesired heuristics.

Anticipatory Timing

I demonstrated a novel search technique for real-time algorithmic rhythm generation(§8.3.6) named Anticipatory Timing. Anticipatory Timing is an extension of greedyoptimisation that allows for the finessing of timing. It is designed to provide a goodcompromise between computational efficiency and optimality that is particularly effi-cacious for rhythm generation. Anticipatory timing is essential to the intuitiveness ofparametric control over the rhythm generation.

The Chimæra Process

One improvisatory strategy used in human ensemble improvisation that I sought toimplement was maintaining a balance between novelty and coherence. In §2.3 I ar-gued that manipulating the level of ambiguity provides a mechanism for altering this

183

balance. Further, I argued that selectively highlighting one or several beliefs in a mul-tiple hypothesis architecture provides a mechanism for manipulating ambiguity. TheJambot implements such a scheme in the Chimæra process (§8.3.10).

The Chimæra process is a meta-generative algorithm that controls the target valuesused in the rhythm generation. It seeks to highlight one or more of the metric contextscurrently hypothesised by the Jambot’s metric analysis stage. It has three modes ofoperation: disambiguate, ambiguate and follow. When disambiguating, it acts to high-light only the most plausible metric context. When ambiguating it acts to highlightall plausible metric contexts equally. When following it acts to highlight all plausiblemetric contexts proportional to their plausibility.

Juxtapositional Calculus

This research aimed to find ways to combine transformative and generative processes(or, in the language of this thesis, reactive and proactive processes) in an interactivemusic system (§1.1.3). I presented the Juxtapositional Calculus for combining reactiveand proactive improvisation strategies (§8.4.1). This is comprised of two components.First, a metrically aware filtering and elaboration mechanism that performs part of thetransformation in the transformational mimesis technique (§8.2.6). Second, a modeswitching procedure for alternating between transformational mimesis and proactivegeneration, based on the Jambot’s confidence in its metric estimations.

An example of this mode switching is given in the demonstration video BeatTrack-ing.mov. The first part of this video demonstrates the Jambot’s beat tracking of theAmen Break, whilst the tempo of the loop playback is being manually altered. Thebeat estimates are highlighted by having the Jambot play a click on each beat. In thesecond part of the video, from time 2’08” to the end, the Jambot improvises a timbaleline along to the loop, again with the loop playback speed being manually altered. Forgentle tempo alterations the beat tracking is able to follow continuously, the Jambot’sconfidence remains high, and the proactive generation mode is used. When the tempois altered drastically, the Jambot loses confidence and switches to reactive generation.The result is a relatively seamless improvisation despite drastic tempo variations.

Attention and Lifelikeness

In section §8.2.5 I discussed notions of agency and lifelikeness, and argued that life-likeness of computational agents can be enhanced by implementing behavioural qual-ities of curiosity and distractibility. I further argued that these behavioural qualitiescan be realised via an attentional mechanism, and discussed how such a mechanism isimplemented in the Jambot.

184

These arguments draw upon theories of situated robotics and exemplar artworks. Whilstthe Jambot implements these behavioural qualities, the gradual shift in focus of this re-search from autonomous agents to semi-autonomous agents (§1.1.1) meant that I spentmore time exploring the intuitiveness and musical salience of the parametric controlsthan exploring the lifelikeness of the Jambot in interaction.

The experiments that I did perform with the Jambot in autonomous configuration in-volved it jamming with a live ensemble. As an example, the video file Robobon-goAllstars.mov documents a performance of the Robobongo Allstars at the 2009 Aus-tralasian Computer Music Conference. The Jambot is playing the vibraphone; pitchinformation was communicated to the Jambot simply by pre-agreeing on a key.

Through the performance the Jambot displayed a good balance of reflexive and gen-erative behaviour creating simultaneous impressions of both ensemble awareness andindividuality – the combination of which creates, I suggest, an impression of musicallyintelligent agency. For example, between the times 1’18” and 1’20” in the video theJambot produces a short solo ‘fill’ on vibraphone which is musically apt in the context,yet displays individuality and novelty. The overall effect was to create a musicallyconvincing moment which elicited appreciation from the audience, and prompted oneaudience member to comment that the Jambot “had a sense of humour”.

9.4 Architecture

What is an effective architecture for combining reception, analysis and generation?The goal of this project – to produce a operational concert-ready system – meant thatarchitectural considerations were important. The benefit of using a design researchmethodology was that the individual algorithms were not developed in isolation, butrather in an iterative process in the context of a complete system. The identification andimplementation of an architecture for the Jambot led me to conclude that a complexhierarchy with bidirectional communication between its layers is an effective architec-ture for robust perception and improvisation.

Complex Hierarchy with Feedback

I argued that a complex hierarchy with bidirectional communication between its lay-ers is an effective architecture for interactive music systems (§4.3). This architecturecontrasts with typical architectures used for music representation, and for modellingcognitive structures of music perception, such as Markov chains and trees.

A particular point of difference between the Jambot’s architecture and typical architec-

185

tures for music representation and cognitive modelling is the use of feedback betweenlayers in the architecture. The Jambot uses feedback in several ways. First, it utilisesfeedback from the generation stage to the analysis stage by performing its analyses onthe combined rhythm of the ensemble with its own musical actions. Second, it utilisesfeedback from the analysis stage to the reception stage by modulating the onset detec-tion algorithms according to the current metric strength.

Attentional Modulation of Onset Detection Algorithms

The Jambot modulates the onset detection algorithms through time according to theexpectation of an onset occurring. It does this by altering the threshold signal/noiseratio required to report an onset inversely to the metric strength of the current time.The effect of this modulation is to stabilise the beat tracking by reducing the impact ofspurious onsets.

The modulation function can be turned on and off by a human operator. The intendedusage is to have the modulation off when first establishing the beat, since its stabil-ising effect is counter-productive until the correct beat is approximately established.Once the Jambot has locked on to the beat, turning on the modulation tends to makethe tracking more robust, though still able to follow mildly varying tempo changes.Should the tempo change abruptly, or should the beat estimate become inaccurate, themodulation should be turned off until a good estimate is re-established.

An example of the stabilising effect of attentional modulation is given in the demon-stration video AttentionalModulation.mov. The input material is a short looped seg-ment from Rage Against the Machine’s Killing in the Name of. During developmentof the onset detection suite I used this loop as a ‘difficult’ test-case for the variouspercussive onset detectors. The heavily distorted guitar degrades the signal/noise ratiofor the detection functions, leading to numerous spurious onset reports. The lyricalcontent (for which I issue a language warning) seems an ironic commentary on therecalcitrance of this loop to submit to analysis.

The beat tracking has been set to track the combined onsets from the SOD, MID andFLUX detectors. The modulation is on at the beginning of the video, with the Jam-bot augmenting the loop with a click track initially, and additionally with percussiveimprovisation after a few seconds. The tracking and the improvisation are (to my ear)tolerably in time. After a short time, the modulation function is turned off, and thetracking and improvisation rapidly degrade. The beat estimate is then manually re-calibrated through tap-tempo, then the modulation turned back on. The beat estimatequickly restabilises.

186

9.5 Further Research

The process of developing an operational interactive music system necessitated delv-ing into various areas of machine listening and improvisation. This research has con-tributed incremental advances in a number of these areas, and also suggested severalopportunities for further research. Below I discuss some of these opportunities in re-gard to pitch information.

Modulating the onset detection sensitivities according to metric expectation strikes meas a first step in a more comprehensive strategy of combining bottom-up perceptualprocesses with top-down expectations. I suggest that such a strategy could be par-ticularly helpful in real-time polyphonic pitch tracking. Expectations regarding noteboundaries may assist in segmenting analysis periods, reducing the confounding ef-fects note transitions on frequency estimates. Furthermore, tonal expectations basedon schematic musical knowledge (such as key) could also assist in reducing the impactof erroneous frequency estimates.

More accurate harmonic tracking would open up several other avenues of further re-search. The estimation of downbeat location would, I suggest, benefit from this infor-mation. During this research I invested considerable effort into developing algorithmsfor downbeat estimation. This did not, however, result in a particularly robust algo-rithm. In practice this is not overly problematic for the Jambot, since a human user canreadily provide this information with the press of a button. However, the issue remainstantalising. I suggest that the approaches I tried for downbeat estimation were limitedby inspecting only percussive elements of the audio signal. I suspect that a large partof our human sense of downbeat relates to pitch information, such as the bass-line andpoints of harmonic change.

Harmonic information would also open up opportunities in the generation stage. Froman algorithmic composition perspective, tonal improvisation provides a much widerresearch arena than generative rhythm.

9.6 Final Thoughts

This project has been a fascinating journey for me. At the start of the journey I soughtto develop an autonomous musical agent. Through the process of simulating humanmusical faculties I developed great respect for the human perceptual system. Thisled me to question the goal of autonomy, and to gradually shift perspective to semi-

187

autonomous systems. Rather than seeking to replicate human abilities, I moved to-wards seeking to augment them.

I believe a similar shift in focus is underway across a number of disciplines includingdesign, computer science and artificial intelligence. Whilst the classical goal of arti-ficial intelligence – to simulate human mental function – remains philosophically im-portant, the onset of the age of ubiquitous computing has precipitated interest in morepractical aspects of human-computer interaction, such as how natural the interactionis. I suggest that there is a rich future for research into strategies for human-computerinteraction that seek to minimise cognitive dissonance and maximise the leverage ofthe human capacity already present in the interaction.

188

Appendix A

Derivation of Phase Estimate

The phase estimation technique described in §6.5 involves finding the phase shift forwhich the rhythmic and pulse saliences are maximally correlated. The optimum valueis obtained by performing a complex regression of the rhythmic salience onto the com-plex oscillator e−2πiωt , where ω is the frequency of the substratum pulse (expressedin cycles per time unit). Below is a derivation that this regression yields the optimalphase shift.

First linear regression, and the Ordinary Least Squares solution, is described. Complexregression, a little known extension of linear regression to complex vectors, is thendescribed. Changing tack, a relationship between the correlation of pulse to rhythm,and the least squares difference of rhythm to complexified pulse is derived. This leadsto demonstration that the given regression yields the maximally correlated phase shift.Finally there is a discussion of the Weighted Least Squares formulation to allow forweighting schemes.

Ordinary Least Squares

Linear regression seeks to find the linear combination of the independent variablesx1, ...,xn that best explains the dependent variable y. The usual measure of goodness-of-fit is the sum of the squared residuals between y and the fitted estimate y. Formally,we seek scalars c1...cn for which

y =n

∑k=1

ckxk minimises the sum of squares ‖y− y‖2

The use of the sum of squared residuals as the objective function means that an analyticsolution is available. Gathering the x and c variables together as the columns of thematrices X = [x1...xn], C = [c1...cn], the optimal combination is given by the OrdinaryLeast Squares formula (Harvey 1990):

C = (XT X)−1XT y (A.1)

189

Complex Least Squares

Although not widely recognised, it turns out that the Ordinary Least Squares formulais also valid in the context of complex linear regression (Miller 1973). In this case weseek complex scalars β = [β1...βn] to best explain a dependent complex vector z as a(complex) linear combination of the independent complex vectors W = [w1...wn]. Inthis case the least squares solution is given by

β = (W∗W)−1W∗z (A.2)

where W∗ is the Hermitian conjugate of W.

Of particular interest to us will be the case of a single-variable complex regression,where we wish to find a single complex scalar β that creates the best fit when explain-ing a dependent complex vector z as a multiple of a single independent complex vectorw, i.e.

z = βw minimises the sum of magnitudes ‖z− z‖2

Multiplication by a complex scalar β has the effect of scaling the amplitude of thecomplex signal w by the magnitude of β , and of rotating the phase of w by the phaseof β .

Maximising Correlation of Pulse and Rhythm

I am adopting Large & Kolen’s resonant oscillator model (1994), which describes pulsesalience as an offset sinusoidal function. Denoting the pulse salience by p(t) and thephase of the pulse by φ

p(t) = 1+ cos(2π(ωt−φ)) (A.3)

where ω is the frequency (in cycles per time unit) of the pulse.

The cosine function in (A.3) can be written as the real part of a complex oscillator

p(t) = 1+Real[e−2πi(ωt−φ)] (A.4)

In order to find the phase of the pulse we seek the value of φ for which the correlationof p(t) with the rhythmic salience is maximal. Up to a normalisation factor the corre-lation is the same as the covariance, and the covariance may be expressed as an innerproduct cov(p,r) = 〈p,r〉. Denoting by q(t) the complex oscillator corresponding tothe sinusoidal oscillator in the pulse salience, i.e.

q(t) = e−2πi(ωt−φ)

190

then noting that p = 1+ 12(q+ q) we can re-express the covariance

cov(p,r) = 〈p,r〉

= 〈1,r〉+ 12〈q+ q,r〉

= 〈1,r〉+ 12〈q,r〉+ 1

2〈q,r〉

= 〈1,r〉+ 12〈q,r〉+ 1

2〈q,r〉

= 〈1,r〉+Real〈q,r〉 (A.5)

The quantity Real〈q,r〉 is related to the least squares difference of q and r

‖q− r‖2 = 〈q− r,q− r〉= 〈q,q〉+ 〈r,r〉−〈q,r〉−〈r,q〉= ‖q‖2 +‖r‖2−2Real〈q,r〉 (A.6)

Substituting (A.6) into (A.5) yields

cov(p,r) = 〈1,r〉+ 12

(‖q‖2 +‖r‖2−‖q− r‖2) (A.7)

Hence the covariance of p with r will be maximised precisely when the least squaresdifference of q and r is minimised.

Now notice that the phase term φ can be factored out of q, and redefined as a complexscalar β ≡ e2πiφ so that

q = e2πiφ e−2πiωt = βe−2πiωt

The problem of finding the phase φ which maximises the correlation of p and r canthus be re-expressed as the problem of finding the complex scalar β for which thephase shifted pulse salience βe−2πiωt minimises the least squares difference to r.

191

Calculating the Phase Estimate

Comparing this formulation of the problem to the description of one-dimensional com-plex linear regression above we can see that the optimal phase may be found by per-forming a complex linear regression of the rhythmic salience r onto the complex oscil-lator e−2πiωt .

Let us rewrite the rhythmic and pulse saliences as discrete vectors at some quantisationof time, i.e.

r =[r(t0) r(t1) . . . r(tn−1)

]W =

[e2πiω.0 e2πiω.1 . . . e2πiω.(n−1)

]where n is the number of bins in the quantization. Then consulting the Complex LeastSquares (A.2) formula above, we see that the phase estimate for the rhythmic saliencer is given by

φ = arg(β ) where β = WT r (A.8)

Weighted Least Squares

A benefit of formulating the phase estimation technique in terms of linear regressionis that there is a simple extension to weighted linear regression, where it is sought tominimise the weighted sum of square residuals between a dependent variable y anda linear combination of independent variables x1, ...,xn, with a weight vector w. Thesolution is the same as for the Ordinary Least Squares case, save that the variables yand xi are first transformed by

xi′ = xi⊗w

12 y′ = y⊗w

12

where ⊗ denotes element-wise multiplication. Then writing X′ =[x1′ . . . xn

′]

theWeighted Least Squares solution to weighted regression is

C = (X′T X′)−1X′T y′ (A.9)

192

Bibliography

Allen, P & Dannenberg, R (1990). ‘Tracking musical beats in real time’. In International ComputerMusic Conference, 140–143. ICMA, San Francisco.

Ames, C & Domino, M (1992). ‘Cybernetic composer: an overview’. In Balaban, M, Ebcioglu, K& Laske, O, eds., Understanding Music With AI: Perspectives on Music Cognition, 186–205. MITPress, Cambridge, MA.

Apple (2011). ‘Quicktime’. Accessed 31st March 2011.http://www.apple.com/quicktime/download/

Argonne National Laboratory (2010). ‘Ask a scientist’. Accessed 9th May 2010.http://www.newton.dep.anl.gov/askasci/ast99/ast99215.htm

Ariza, C (2005). ‘The Xenakis sieve as object: A new model and a complete implementation’. ComputerMusic Journal, 29(2):40–60.

Ariza, C (2009). ‘The interrogator as critic’. Computer Music Journal, 33(2):48–70.

Arom, S (1991). African Polyphony and Polyrhythm. Cambridge University Press, Cambridge, UK.

Assayag, G, Bloch, G, Chemillier, M, Cont, A & Dubnov, S (2006). ‘Omax brothers: A dynamictopology of agents for improvisation learning’. In ACM Workshop on Audio and Music Computingfor Multimedia. Santa Barbara.

Babbage, C (1833). ‘Economy of manufactures’. In Minor, DK, ed., American Railroad Journal andAdvocate of Internal Improvements, 310–313. Simmon-Boardman, New York.

Balaban, M, Ebcioglu, K & Laske, O (1992). Understanding Music With AI: Perspectives on MusicCognition. MIT Press, Cambridge, MA.

Barucha, J (1991). ‘Pitch, harmony and neural nets: A psychological perspective’. In Todd, PM & Loy,G, eds., Music and Connectionism, 84–99. MIT Press, Cambridge, MA.

Barucha, J (1993). ‘Muscat: A connectionist model of music harmony’. In Machine Models of Music,497–509. MIT Press, Cambridge, MA.

Bateson, G (1972). Steps to an Ecology of Mind: collected essays in anthropology, psychiatry, evolutionand epistemology. Aronson, London.

Bateson, G (1979). Mind and Nature. Bantam, New York.

193

Bateson, G & Bateson, MC (1987). Angels Fear: towards an epistemology of the sacred. Macmillan,New York.

BeatRoot (2011). ‘Beatroot’. Accessed 31 March 2001.http://www.eecs.qmul.ac.uk/ simond/beatroot/

Bellman, R & Kalaba, R (1956). Dynamic Programming and Modern Control Theory. Academic Press,New York.

Bello, J, Daudet, L, Abdallah, S, Duxbury, C, Davies, M & Sandler, M (2005). ‘A tutorial on onsetdetection in music signals’. IEEE Transactions on Speech and Audio Processing, 13(5):1035–1047.

Bercher, J & Vignat, C (2000). ‘Estimating the entropy of a signal with applications’. IEEE Transactionson Signal Processing, 48(6):1687–1694.

Beyls, P (1988). ‘Introducing Oscar’. In Lischka, C & Fritsch, J, eds., International Computer MusicConference, Cologne, 219–230. ICMA, San Francisco.

Biles, J (1994). ‘GenJam: A genetic algorithm for generating jazz solos’. In International ComputerMusic Conference, Arhaus Denmark, 131–137. ICMA, San Francisco.

Biles, J (2002). ‘GenJam: Evolution of a jazz improviser’. In Bentley, P & Corne, D, eds., CreativeEvolutionary Systems, 165–186. Morgan Kaufmannn.

Bilmes, J (1993). Timing is of the Essence: Perceptual and Computational Techniques for Representing,Learning, and Reproducing Expressive Timing in Percussive Rhythm. Master’s thesis, MIT.

Borgo, D (2002). ‘Synergy and surrealestate: The orderly disorder of free improvisation’. PacificReview of Ethnomusicology, 10.

Borgo, D (2004). ‘Sync or swarm: Group dynamics in musical free improvisation’. In Parncutt, R,Kessler, A & Zimmer, F, eds., Conference of Interdisciplinary Musicology, 52–53. Graz.

Boring, E (1942). Sensation and Perception in the History of Experimental Psychology. Appleton-Century, Oxford.

Bregman, A (1990). Auditory Scene Analysis. MIT Press, Cambridge, MA.

Brooks, R (1990). ‘Elephants don’t play chess’. Robotics and Autonomous Systems, 6:3–15.

Brooks, R (1991a). ‘Intelligence without reason’. In 12th International Conference on Artificial Intel-ligence, 569–595. Sydney.

Brooks, R (1991b). ‘Intelligence without representation’. Artificial Intelligence, 47:139–159.

Brown, A (2005). ‘Generative music in live performance’. In Australasian Computer Music Conference,July 2005, 23–26. ACMA, Brisbane.

Brown, AR (2007). ‘Software development as music education research’. International Journal ofEducation and the Arts, 8(6).

194

Brown, AR, Gifford, T, Davidson, R, Narmour, E & Wiggins, G (2009). ‘Generation in context’. InStevens, C, Schubert, E, Kruithof, B, Buckley, K & Fazio, S, eds., 2nd International Conference onMusic Communication Science, 7–10. HCSNet, Sydney.

Brown, J & Puckette, M (1992). ‘An efficient algorithm for the calculation of a constant-Q transform’.Journal of the Acoustical Society of America, 92:1394–1402.

Brown, M (2008). ‘A subsumption architecture for control of the lego mindstorm NXT robot’. Honour’sthesis, University of Sheffield.

Brown, MPJ (1998). ‘Accuracy of frequency estimates using the phase vocoder’. IEEE Transactions onSpeech and Audio Processing, 6(2):166–176.

Brunswik, E (1955). ‘Representative design and probabilistic theory in a functional psychology’. Psy-chological Review, 62:193–217.

Brunswik, E (1956). Perception and the Representative Design of Psychological Experiments. Univer-sity of California Press, Berkeley.

Burdick, A (2003). ‘Design (as) research’. In Laurel, B, ed., Design Research: Methods and Perspec-tives, 82. MIT Press, Cambridge, MA.

Byrne, MD & Anderson, JR (1998). ‘Perception and action’. In Anderson, JR & Lebiere, C, eds., TheAtomic Components of Thought, 167–200. Lawrence Erlbaum Associates, Mahwah, NJ.

Callaos, N (2011). ‘The essence of engineering and meta-engineering: A work in progress’. Accessed31st March 2011.www.iiis.org/Engineering-and-Meta-Engineering

Cambouropoulos, E (1998). Towards a General Computational Theory of Musical Structure. Ph.D.thesis, The University of Edinburgh.

Carroll, R & Ruppert, D (1988). Transformation and weighting in regression. Monographs on statisticsand applied probability. Chapman and Hall.

Cemgil, AT (2004). Bayesian music transcription. Ph.D. thesis, Radboud Universiteit Nijmegen.

Charpentier, FJ (1986). ‘Pitch detection using the short term phase spectrum’. In International Confer-ence on Acoustics, Speech and Signal Processing, April 1986, 113–116. IEEE.

Chater, N, Tenenbaum, JB & Yuille, A (2006). ‘Probabilistic models of cognition: Conceptual foun-dations’. Trends in Cognitive Sciences, 10(7):287 – 291. Special issue: Probabilistic models ofcognition.

Chomsky, N (1969). Aspects of the Theory of Syntax. MIT Press, Cambridge, MA.

Clancey, W (1997). Situated Cognition: On Human Knowledge and Computer Representations. Cam-bridge University Press, Cambridge, UK.

Clarke, EF (1999). ‘Rhythm and timing in music’. In Deustch, D, ed., The Psychology of Music,473–497. Academic Press, New York.

195

Cohn, R (1992). ‘Transpositional combination of beat-class sets in Steve Reich’s phase-shifting music’.Perspectives of New Music, 30(2):146–177.

Collins, N (2006a). ‘Towards a style-specific basis for computational beat tracking’. In 9th InternationalConference on Music Perception and Cognition, August 2006, 461–467. Bologna.

Collins, N (2006b). Towards Automonous Agents for Live Computer Music: Realtime machine listeningand interactive music systems. Ph.D. thesis, Cambridge.

Collins, N (2008). ‘Infno: Generating synth pop and electronic dance music on demand’. In Interna-tional Computer Music Conference, 2008. ICMA, San Francisco.

Collins, N (2010). ‘Contrary motion: An oppositional interactive music system’. In New Interfaces forMusical Expression, 2010. Sydney.

Conklin, D (2003). ‘Music generation from statistical models’. In AISB 2003 Symposium on ArtificialIntelligence and Creativity in the Arts and Sciences, 30–35. Aberystwyth.

Cooksey (2001). ‘Brunswik’s the conceptual framework of psychology: Then and now’. In Hammond,KR & Stewart, TR, eds., The Essential Brunswik. Oxford University Press, Oxford.

Cooper, G & Meyer, L (1960). The rhythmic structure of music. Chicago University Press, Chicago.

Cope, D (1996). Experiments in Musical Intelligence. A-R Editions, Madison, WI.

Cope, D & Hofstadter, D (2001). Virtual Music. MIT Press.

Cormen, T, Leiserson, C, Rivest, R & Stein, C (2006). Introduction to Algorithms. MIT Press.

Coward, LA (2005). A Systems Architecture Approach to the Brain. Nova, New York.

Crotty, M (1998). The Foundations of Social Research: Meaning and perspective in the research pro-cess. Allen & Unwin, Sydney.

Crouch, C (2007). ‘Using praxis to develop research methods into personal creativity in the visual arts’.In Hatched 07 Arts Research Symposium, April 2007. Perth.

Dannenberg, R (2000). ‘Dynamic programming for interactive music systems’. In Readings in musicand artificial intelligence, 189–206. Harwood Academic Publishers, Amsterdam.

Dannenberg, R (2005). ‘Towards automated holistic beat tracking, music analysis, and understanding’.In 6th International Conference on Music Information Retrieval, 366–373. London.

Davies, M, Degara, N & Plumbley, M (2009). ‘Evaluation methods for musical audio beat trackingalgorithms’. Tech. rep., Centre for Digital Music, Queen Mary University of London.

Davies, M & Plumbley, M (2005). ‘Beat tracking with a two state model’. In IEEE InternationalConference on Acoustics, Speech and Signal Processing.

Davis, M (1999). The Philosophy of Poetry: on Aristotle’s poetics. St Augustine’s Press, South Bend,Indiana.

196

de Marchi, S (2005). ‘Looking for car keys without any street lights’. In Computational and Mathemat-ical Modeling in the Social Sciences. Cambridge University Press, Cambridge, MA.

Dean, R (2003). Hyperimprovisation: Computer-Interactive Sound Improvisation. A-R Editions, Madi-son, WI.

Dewey, J (1949). Knowing and the Known. The Beacon Press, Boston.

Dewey, J (1973). ‘The practical character of reality’. In McDermott, J, ed., The Philosophy of JohnDewey. The University of Chicago Press, Chicago.

Dixon, S (2001). ‘Automatic extraction of tempo and beat from expressive performances’. Journal ofNew Music Research, 30(1):39–58.

Dixon, S (2006). ‘Onset detection revisited’. In 9th International Conference on Digital Audio Effects,133–137. Montreal.

Dixon, S (2008). ‘Evaluation of the audio beat tracking system beatroot’. Journal of New MusicResearch, 36:39–50.

Downton, P (2003). Design Research. RMIT University Press, Melbourne.

Drexler, KE (1986). Engines of Creation: the coming era of nanotechnology. Anchor Books, NewYork.

Durban, P (2006). ‘Philosophy of technology: In search of discourse synthesis’. Techne: Research inPhilosophy and Technology, 10(2):1–289.

Eck, D (2007). ‘Beat tracking using an autocorrelation phase matrix’. In Proceedings of the 2007International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1313–1316. IEEESignal Processing Society.

Ellis, D (2007). ‘Beat tracking by dynamic programming’. Journal of New Music Research, 36(1):51–60.

Epstein, D (1995). Shaping Time: Music, the Brain, and Performance. Schirmer Books, New York.

Flanagan, JL & Golden, RM (1966). ‘Phase vocoder’. Bell System Tech Journal, 45:1493–1509.

Flanagan, P (2008). ‘Quantifying metrical ambiguity’. In ISMIR 2008. Philadelphia.

Flyvbjerg, B (2001). Making Social Science Matter: Why social inquiry fails and how it can succeedagain. Cambridge University Press, Cambridge, UK.

Fodor, J & Pylyshyn, Z (1981). ‘How direct is visual perception? some reflections on Gibsons’ ’eco-logical approach”. Cognition, 9:139–196.

Forte, A (1973). The Structure of Atonal Music. Yale University Press, New Haven.

Frielick, S (2004). ‘Beyond constructivism: An ecological approach to e-learning’. In Atkinson, R,McBeath, C, Jonas-Dwyer, D & Phillips, R, eds., Beyond the Comfort Zone: Proceedings of the 21stASCILITE Conference, 328–332. Perth.

197

Geobl, W & Parncutt, R (2001). ‘Perception of onset asynchronies: Acoustic piano versus synthesisedcomplex versus pure tones’. In Meeting of the Society for Music Perception and Cognition, 2001.

Gibson, JJ (1957). ‘Survival in a world of probable objects’. Contemporary Psychology, 2(2):33–35.

Gibson, JJ (1972). ‘A theory of direct visual perception’. In Royce, JR & Rozeboom, WW, eds., ThePsychology of Knowing, 215–240. Gordon and Breach, New York.

Gibson, JJ (1979). The Ecological Approach to Visual Perception. Lawrence Erlbaum Associates,Hillsdale, NJ.

Goldman, SL (2004). ‘Why we need a philosophy of engineering: A work in progress’. InterdisciplinaryScience Reviews, 29(2):163–176.

Goto, M (2001). ‘An audio-based real-time beat tracking system for music with or without drum-sounds’. Journal of New Music Research, 30(2):159–171.

Goto, M (2004). ‘A real-time music scene description system: predominant-f0 estimation for detectingmelody and bass lines in real-world signals’. Speech Communications, 43:311–329.

Goto, M (2006). ‘Analysis of musical audio signals’. In Wang, D & Brown, GJ, eds., ComputationalAuditory Scene Analysis. John Wiley Sons, Hoboken, NJ.

Goto, M & Muraoka, Y (1995). ‘Music understanding at the beat level: Real-time beat tracking foraudio signals’. In Rosenthal, DF & Okuno, H, eds., Computational Auditory Scene Analysis, 157–176. Lawrence Erlbaum Associates, New Jersey.

Gouyon, F & Dixon, S (2005). ‘A review of automatic rhythm description systems.’ Computer MusicJournal, 29(1):34–54.

Gouyon, F, Herrera, P & Cano, P (2002). ‘Pulse dependent analyses of percussive music’. In AES 22ndInternational Conference on Virtual, Synthetic and Entertainment Audio, July 2002, 396–401. Espoo,Finland.

Gray, C (1998). ‘Inquiry through practice: developing appropriate research strategies’. In No Guru, NoMethod? Discussion of Art and Design Research. UIAH, University of Art and Design, Helsinki.

Gray, C & Malins, J (2004). Visualising Research: A guide to the Research Process in Art and Design.Ashgate Publishing, Aldershot, England.

Greenwood, D & Levin, M (2005). ‘Reform of the social sciences, and of universities through actionresearch’. In The SAGE Handbook of Qualitative Research. Sage Publications, Thousand Oaks, CA.

Hainsworth, S (2004). Techniques for the automated analysis of musical audio. Ph.D. thesis, Cambridge.

Hainsworth, S & McLeod, M (2003). ‘Onset detection in musical audio signals’. In International Com-puter Music Conference, 2003, 163–166. International Computer Music Association, San Francisco.

Hall, P & Morton, S (2004). ‘On the estimation of entropy’. Annals of the Institute of StatisticalMathematics, 45(1).

Hamanaka, M, Hirata, K & Tojo, S (2006). ‘Implementing a generative theory of tonal music’. Journalof New Music Research, 35(4):249–277.

198

Hamilton, J & Jaaniste, L (2009). ‘Content, structure and orientations of the pratice-led exegesis’. InArt.Media.Design: Writing intersections. Swinburne University, Melbourne.

Hammond, KR & Stewart, TR (2001). The Essential Brunswik. Oxford University Press, Oxford.

Harries-Jones, P (1995). A Recursive Vision: ecological understanding and Gregory Bateson. Universityof Toronto Press, Toronto.

Harvey, A (1990). The Econometric Analysis of Time Series. MIT Press, Cambridge, MA.

Hawkins, J & Blakeslee, S (2004). On Intelligence. Times Books, New York.

Herschbach, D (1995). ‘Technology as knowledge: Implications for instruction’. Journal of TechnologyEducation, 7(1):31–42.

Hickman, L (1992). John Dewey’s Pragmatic Technology. Indiana University Press, Bloomington.

Hirsch, IJ (1959). ‘Auditory perception of temporal order’. Journal of the Acoustical Society of America,31(6):759–767.

Honing, H & Haas, WB (2008). ‘Swing once more: Relating timing and tempo in expert jazz drum-ming’. Music Perception, 25(5):471–476.

Honneger, M, ed. (1976). Science del la Musique: Formes, Technique, Instruments. Bordas, Paris.

Hood, M (1971). The Ethnomusicologist. McGraw Hill, New York.

Huang, N, Shen, Z, Long, S, Wu, M, Shih, H & Zheng, Q (1998). ‘The empirical mode decompositionand the hilbert spectrum for nonlinear and nonstationary time series analysis’. Proceedings of theRoyal Society of London A, 454:903–995.

Huron, D (2006). Sweet Anticipation. MIT Press, Cambridge, MA.

Ihnatowicz, E (2011). ‘Edward ihnatowicz’. Accessed 31st March 2011.http://www.senster.com/ihnatowicz/index.htm

Innis, R (2003). ‘The meanings of technology’. Techne: Research in Philosophy and Technology,7(1):49–58.

Jackendoff, R (1992). Languages of the Mind. MIT Press, Cambridge, MA.

James, W (1890). The Principles of Psychology. Holt, Boston.

Jones, MR (1987). ‘Dynamic patterns structure in music: recent theory and research’. Perception andPsychophysics, 41:621–634.

Jones, MR & Boltz, M (1989). ‘Dynamic attending and responses to time.’ Psychological Review,96(3):459–491.

Kahawara, H, Katayose, H, deCheveigne, A & Patterson, RD (1999). ‘Fixed point analysis of frequencyto instanteneous frequency mapping for accurate estimation of f0 and periodicity’. In EuropeanConference on Speech, Communication and Technology, 2781–2784. Budapest.

199

Kandel, ER, Schwartz, JH & Jessell, TM (2000). Principles of Neural Science. McGraw Hill, NewYork.

Kant, I (1964). The Critique of Pure Reason. McMillan, London.

Kauffman, R (1980). ‘African rhythm: A reassessment’. Ethnomusicology, 24(3):393–415.

Keller, R, Morrison, D, Jones, S, Thom, B & Wolin, A (2006). ‘A computational framework enhancingjazz creativity’. In Third Workshop on Computational Creativity. Riva del Garda, Italy.

Keyson, D & Alonso, M (2009). ‘Empirical research through design’. Design, 9:4548–4557.

Kirlik, A (2001). ‘On Gibson’s review of Brunswik’. In Hammond, KR & Stewart, TR, eds., TheEssential Brunswik. Oxford University Press, Oxford.

Kitano, H (1993). ‘Challenges of massive parallelism’. In IJCAI-93, 813–834. Chambery, France.

Kivy, P (2002). Introduction to a philosophy of music. Oxford University Press, Oxford.

Klapuri, A (1999). ‘Sound onset detection by applying psychcoacoustic knowledge’. In IEEE Interna-tional Conference on Acoustics, Speech and Signal Processing, 115–118. Pheonix, AZ.

Klapuri, A, Eronen, A & Astola, J (2006). ‘Analysis of the meter of acoustic musical signals’. IEEETransactions on Audio, Speech and Language Processing, 14(1):342–355.

Krebs, H (1987). ‘Some extensions of the concepts of metrical consonance and dissonance’. Journal ofMusic Theory, 31(1):99–120.

Large, EW (1994). Dynamic Representation of Musical Structure. Ph.D. thesis, Ohio State University.

Large, EW & Kolen, JF (1994). ‘Resonance and the perception of musical meter’. Connection Science,6(1):177–208.

Laroche, J (2003). ‘Efficient tempo and beat tracking in audio recording’. Journal of the Audio Engi-neering Society, 51(4):226–233.

Lerdahl, F & Jackendoff, R (1983). A Generative Theory of Tonal Music. MIT Press, Cambridge, MA.

Lerdahl, F & Jackendoff, R (1993). ‘An overview of hierarchical structure in music’. In Machine Modelsof Music. MIT Press, Cambridge, MA.

Levi-Strauss, C (1962). The Savage Mind. Chicago University Press, Chicago.

Lindsay, K & Nordquist, P (2007). ‘Pulse and swing: Quantitative analysis of hierarchical structure inswing rhythm’. Journal of the Acoustical Society of America, 122(5):2945.

Livingstone, S (2008). Changing Musical Emotion through Score and Performance with a Computa-tional Rule System. Ph.D. thesis, University of Queensland.

London, J (2004). Hearing in Time. Oxford University Press, Oxford.

Long, J (2002). ‘Who’s a pragmatist: Distinguishing epistemic pragmatism and contextualism’. TheJournal of Speculative Philosophy, 1:39–49.

200

Marshall, P, Kelder, JA & Perry, A (2005). ‘Social constructionism with a twist of pragatism: A suit-able cocktail for information systems research’. In Campbell, B, Underwood, J & Bunker, D, eds.,16th Australasian Conference on Information Systems. Australasian Chapter of the Association forInformation Systems, Sydney.

Masri, P (1996). Computer Modeling of Sound for Transformation and Synthesis of Musical Signal.Ph.D. thesis, University of Bristol, Bristol, UK.

Masri, P & Bateman, A (1996). ‘Improved modelling of attack transients in music analysis-resynthesis’.In International Computer Music Conference, 1996. ICMA, San Francisco.

Mateas, M (1999). ‘An oz-centric review of interactive drama and believable agents’. In Wooldridge,M & Veloso, M, eds., Lecture Notes in Artificial Intelligence, vol. 1600, 297–329. Springer-Verlag.

Mavromatis, P (2005). ‘A hidden Markov model of melody in greek church chant’. Computing inMusicology, 14:93–112.

McAuley, RJ & Quatieri, T (1986). ‘Speech analysis/synthesis based on a sinusoidal representation’.IEEE Transactions on Acoustics, Speech, and Signal Processing, 34(4).

McCarthy, N (2006). ‘Philosophy in the making’. Ingenia, 26:47–51.

McIlwain, P, McCormack, J, Lane, A & Dorin, A (2007). ‘Generative composition with Nodal’. InMiranda, E, ed., Workshop in Music and Artificial Life, 2007. Lisbon.

Meyer, L (1956). Emotion and Meaning in Music. Chicago University Press, Chicago.

Milivojevic, Z, Mirkovic, M & Milivojevic, S (2006). ‘An estimate of the fundamental frequency usingpcc interpolation - comparitive analysis’. Information Technology and Control, 35(2):131–136.

Miller, G (1956). ‘The magical number seven, plus or minus two: Some limits on our capacity forprocessing information’. Psychological Review, 63:81–97.

Miller, KS (1973). ‘Complex linear least squares’. SIAM Review, 15(4):706–726.

Minsky, M (1987). The society of mind. Pan Books, London.

MIREX (2006). ‘Music information retrieval evaluation exchange’. Accessed 19th April 2010.http://www.music-ir.org/mirexwiki/index.php

MIREX (2011). ‘Audio beat tracking’. Accessed 3rd March 2011.http://www.music-ir.org/mirex/wiki/2006:Audio Beat Tracking

Moelants, D (1999). ‘Perceptual analysis of “aksak” meters’. In New Techniques in Ethnomusicology,3–26.

Narmour, E (1977). Beyond Schenkerism. Chicago University Press, Chicago.

Narmour, E (1990). The Analysis and Cognition of Basic Musical Structures. The University of ChicagoPress, Chicago.

Newman, A (1995). Bach and the Baroque. Pendragon Press, Hillsdale, NJ.

201

Nielsen, J (1993). ‘Iterative user-interface design’. Computer, 26(11).

Nodal (2011). ‘Nodal’. Accessed 3rd April 2011.http://www.csse.monash.edu.au/ cema/nodal/index.html

Noe, A (2004). Action In Perception. MIT Press, Cambridge, MA.

Noll, M (1969). ‘Pitch determination of human speech by the harmonic product spectrum, the har-monic sum spectrum, and maximum likelyhood estimate’. In Symposium on Computer Processingin Communications, 779–797. Brooklyn.

Norman, D (1990). The Design of Everyday Things. Doubleday, New York.

Nunn, T (1998). ‘Wisdom of the impulse: On the nature of musical free improvisation’. Accessed 31stMarch 2011.http://www20.brinkster.com/improarchive/tn wisdom part1.pdf

Pachet, F (2002). ‘Interacting with a musical learning system: the continuator’. In Music and ArtificialIntelligence, 103–108. Springer, Berlin.

Pachet, F (2006). ‘Enhancing individual creativity with interactive musical reflective systems’. InWiggins, G & Deliege, I, eds., Musical Creativity: Current Research in Theory and Practice, 359–375. Psychology Press, London.

Parncutt, R (1994). ‘A perceptual model of pulse salience and metrical accent in musical rhythms’.Music Perception, 11(4):409–464.

Pearce, MT & Wiggins, GA (2006). ‘Expectation in melody: the influence of context and learning’.Music Perception, 23(5):377–405.

Pearce, MT & Wiggins, GA (2007). ‘Evaluating cognitive models of musical composition’. In Cardoso,A & Wiggins, G, eds., 4th International Joint Workshop on Computational Creativity.

Penny, S (2011). ‘Petit mal’. Accessed 31st March 2011.http://ace.uci.edu/penny/works/petitmal.html

Pennycook, B, Stammen, DR & Reynolds, D (1993). ‘Toward a computer model of a jazz improviser’.In International Computer Music Conference, 1993, 228–231. ICMA, San Francisco.

Plumbley, MD, Abdallah, SA, Bello, JP, Davies, ME, Monti, G & Sandler, MB (2002). ‘Automaticmusic transcription and audio source separation’. Cybernetics and Systems, 33(6):603–627.

Polanyi, M (1974). Personal Knowledge: Towards a post-critical philosophy. The University of ChicagoPress, Chicago.

Povel, DJ & Essens, P (1985). ‘Perception of temporal patterns’. Music Perception, 2(4):411–440.

Puckette, M, Apel, T & Zicarelli, D (1998). ‘Real-time audio analysis tools for pd and msp’. InInternational Computer Music Conference, 1998. ICMA, San Francisco.

QUT (2011). ‘Practice-led research’. Accessed 29th March 2011.http://www.creativeindustries.qut.edu.au/research/practice-led-research/

202

Rabiner, LR, Schafer, WR & Rader, CM (1972). ‘The Chirp-Z transform’. In Digital Signal Processing.IEEE Press.

Rasch, R & Plomp, R (1982). ‘The perception of musical tones’. In Deustch, D, ed., The Psychology ofMusic, vol. 89-112. Academic Press, 2nd edn.

Reynolds, S (2006). ‘Post-rock’. In Cox, C & Warner, D, eds., Audio Culture: Readings in ModernMusic, 358–361. Continuum, New York.

Robertson, A & Plumbley, M (2007). ‘B-Keeper: a beat-tracker for live performance’. In New Interfacesfor Musical Expression, 2007, 234–237. ACM Press, New York.

Rolland, P & Ganascia, J (2000). ‘Musical pattern matching and similarity assessment’. In Mirando, E,ed., Readings in music and artificial intelligence. Harwood Academic Publishers, Amsterdam.

Rowe, R (1993). Interactive Music Systems. MIT Press, Cambridge, MA.

Rowe, R (2001). Machine Musicianship. MIT Press, Cambridge, MA.

Russell, SJ & Norvig, P (1995). Artificial Intelligence: A Modern Approach. Prentice-Hall, EnglewoodCliffs, NJ.

Ryle, G (1949). The Concept of Mind. The University of Chicago Press, Chicago.

Salzer, F (1962). Structural Hearing: Tonal Coherence in Music. Dover, New York.

Scheirer, E (1997). ‘Pulse tracking with a pitch tracker’. In IEEE Workshop on Applications of SignalProcessing to Audio and Acoustics, 1997. Mohonk, NY.

Scheirer, E (1998). ‘Tempo and beat analysis of acoustical musical signals’. Journal of the AcousticalSociety of America, 103(1):588–601.

Scheirer, ED (1996). ‘Bregman’s chimerae: Music perception as auditory scene analysis’. In Proceed-ings of the International Conference on Music Perception and Cognition. ICMA, San Francisco.

Schenker, H (1980). Harmony. Chicago University Press, Chicago.

Schieber, SM (2009). ‘Lessons from a restricted turing test’. Accessed June 10th 2009.http://www.eecs.harvard.edu/shieber/Biblio/Papers/loebner-rev-html/loebner-rev-html.htmlturingmind

Schon, D (1995). Reflective Practioner: how professionals think in action. Arena, Aldershot, England.

Serra, X (1997). ‘Musical sound modeling with sinusoids plus noise’. In Musical Signal Processing.Swets and Zeitlinger.

Shannon, C (1948). ‘A mathematical theory of communication’. Bell System Tech Journal, 27:379–423.

Simon, H (1969). The Sciences of the Artificial. MIT Press, Cambridge, MA.

Sloboda, J (1988). Generative Processes in Music: The Psychology of Performance, Improvisation andComposition. Oxford University Press, Oxford.

203

Smith, JO (2003). Mathematics of the Discrete Fourier Transform (DFT) with Music and Audio Appli-cations. W3K Publishing.

Sorensen, A (2005). ‘An interactive programming environment for composition and performance’. InAustralasian Computer Music Conference, 2005. ACMA, Brisbane.

Sorensen, A (2010). ‘Oscillating rhythms’. Accessed 9th May 2010.http://www.acid.net.au/index.php?option=com contenttask=viewid=99

Sorensen, A & Brown, A (2008). ‘A computational model for the generation of orchestral music insymphonic tradition: A progress report’. In Sound : Space - The Australasian Computer MusicConference, 78–84. Sydney.

Stock, C (2010). ‘Aesthetic tensions: Evaluating outcomes for practice-led research and industry’. TEXTJournal, 8.

Stowell, D, Robertson, A, Bryan-Kinns, N & Plumbley, MD (2009). ‘Evaluation of live human-computer music-making: quantitative and qualitative approaches’. International Journal of Human-Computer Studies, 67(11):960–975.

Temperley, D (2001). The Cognition of Basic Musical Structures. MIT Press, Cambridge, MA.

Temperley, D (2007). Music and Probability. MIT Press, Cambridge, MA.

Temperley, D & Sleator, D (2010). ‘Melisma stochastic melody generator’. Accessed 9th May 2010.http://www.link.cs.cmu.edu/melody-generator/.

Thom, B (2000a). ‘Bob: An interactive improvisational music companion’. In Fourth InternationalConference on Autonomous Agents, 309–316. ACM Press, Barcelona.

Thom, B (2000b). ‘Unsupervised learning and interaction in jazz/blues improvisation’. In AAAI-2000.Austin, TX.

Thornton, C (2009). ‘Hierarchical markov modelling for generative music’. In International ComputerMusic Conference, 2009. ICMA, San Francisco.

Thornton, C (2011). ‘Reconstituted music examples’. Accessed 31st March 2011.http://www.christhornton.eu/demos/reconstituted-music-examples.html

Toiviainen, P & Snyder, J (2000). ‘The time course of pulse sensation: dynamics of beat induction’. InSixth International Conference on Music Perception and Cognition. Keele, UK.

Turing, AM (1950). ‘Computer machinery and intelligence’. Mind, 59(236):433–460.

van Lamsweerde, A (2004). ‘Goal-oriented requirements engineering: a roundtrip from research topractice’. In 12th International Requirements Engineering Conference, 4–7. Kyoto, Japan.

Varela, F & Maturana, H (1980). Autopoiesis and Cognition: the Realisation of the Living. D. Reidel,Dordtrecht, Holland.

Vercoe, B (1994). ‘Perceptually-based music pattern recognition and response’. In Third InternationalConference for the Perception and Cognition of Music, 59–60. Liege, Belgium.

204

Wessel, D (1991). ‘Instruments that learn, refined controllers, and source model loudspeakers’. Com-puter Music Journal, 15(4):82–86.

Wharburton, D (1988). ‘A working terminology for minimal music’. Music Theory Spectrum, 2(135-159).

Winograd, T (2006). ‘Shifting viewpoints: artficial intelligence and human-computer interaction’. Ar-tificial Intelligence, 170:1256–1258.

Yeston, M (1976). The Stratification of Musical Rhythm. Yale University Press, Yale.

Zicarelli, D (1987). ‘M and jam factory’. Computer Music Journal, 11(4):13–29.

Zimmerman, E (2003). ‘Play as research’. In Design Research. MIT Press, Cambridge, MA.

Zimmerman, J, Forlizzi, J & Evenson, S (2007). ‘Research through design as a method for interactiondesign research in HCI’. In SIGCHI conference on Human factors in computing systems, CHI ’07,493–502. ACM, New York.

205

Improvisation in Interactive Music Systems - QUT...

Documents

Transcript of Improvisation in Interactive Music Systems - QUT...