Voice-led interactive exploration of audiodownloads.bbc.co.uk/rd/workstreams/irfs-misc/bbc-rd... ·...

Voice-ledinteractiveexplorationofaudio Rishi Shukla November 2019 bbc.co.uk/rd

Research & Development

Voice-led interactive exploration of audio

of 39 Research & Development

CONTENTSEXECUTIVE SUMMARY .................................................................................................................... 4

ACKNOWLEDGEMENTS ................................................................................................................... 6

INTRODUCTION ................................................................................................................................ 7

DIGITAL AUDIO AND SMART SPEAKERS ................................................................................................... 7

CURRENT LIMITATIONS OF VOICE-LED AUDIO INTERACTION ..................................................................... 8

FUTURE POSSIBILITIES FOR VOICE-LED AUDIO INTERACTION .................................................................... 8

EXISTING BBC R&D WORK RELATED TO THESE FUTURE POSSIBILITIES ..................................................... 9

RATIONALE AND DESIGN DECISIONS .......................................................................................... 11

AIMS AND DESIGN SCOPE .................................................................................................................... 11

SYSTEM ARCHITECTURE ...................................................................................................................... 12

VOICE INTERACTION DESIGN ................................................................................................................ 13

IMPLEMENTATION AND ENGINEERING ....................................................................................... 15

PROTOTYPE REALISATION ................................................................................................................... 15

SOUND DESIGN ................................................................................................................................... 16

BINAURAL SPATIAL AUDIO RENDERING................................................................................................. 18

SMART TECHNOLOGY SIMULATION ...................................................................................................... 19

USER RESEARCH DESIGN .............................................................................................................. 20

PARTICIPANT RECRUITMENT ................................................................................................................ 20

STUDY FORMAT .................................................................................................................................. 20

DATA COLLECTION.............................................................................................................................. 23

FINDINGS......................................................................................................................................... 24



PATTERNS OF USE FOR VOICE-LED CONTENT DISCOVERY ....................................................................... 24

SPECIFIC PROFILE DIFFERENCES FOR VOICE-LED CONTENT DISCOVERY ................................................... 26

SMART HEADPHONES AND SMART SPEAKER – BEHAVIOUR AND USABILITY RATINGS ............................... 28

SMART HEADPHONES AND SMART SPEAKER – CONNECTEDNESS ............................................................ 31

CONCLUSIONS AND RECOMMENDATIONS................................................................................. 33

REVIEW OF RESEARCH AIMS ................................................................................................................. 33

FUTURE WORK .................................................................................................................................... 34

APPENDIX ....................................................................................................................................... 36

REFERENCES ...................................................................................................................................... 36

MENTAL MODEL DRAWINGS ................................................................................................................. 38



Executivesummary

• The dominance of digital audio distribution and rapid growth in smart speaker

technology present new challenges for interactive exploration of audio-only media.

• Smart headphones are an emergent technology whose unique affordances for

accessing, interacting with and displaying content are only starting to be explored.

• This research project developed a working prototype of a voice-led interactive

application for navigating the BBC’s Glastonbury 2019 highlights on smart speaker

and smart headphones, called the Auditory Archive Explorer.

• The voice-led interaction design and two prototype versions were evaluated using a

mixed methods user experience research approach. The two iterations featured

identical content, but alternate modes of audio presentation. Mono playback was used

in the case of the smart speaker. Full binaural synthesis (i.e. static placement of sound

sources in virtual 3D space using head-tracking) augmented the smart headphone

experience.

• Twenty-two people were selected to participate in the research study after an open

call to register interest.

• Analysis of user interaction showed strong evidence that the design supports effective

browsing and discovery of content, with a learning curve that was quickly surmounted

without guidance, irrespective of either prior exposure to voice technologies or first

language. These quantitative patterns were also supported by participants’ self-

reported qualitative ratings on usability. It was notable, however, that frequent smart

assistant users were significantly more favourable about the system’s capacity to

support content discovery than infrequent users (despite these two groups showing



no notable differences in their patterns of interaction or activity). This indicates that

merits of the prototype seemed more evident to those familiar with current voice

interaction design limitations.

• No significant differences were found in the pattern of interactions, activity, or recall

of core system features between users of the two prototype versions, indicating that

the binaural spatial design did not present any practical benefits in this instance.

However, two potential limiting factors in the study design are noted that could have

prevented differences from emerging clearly (namely, the interaction delay inherent

with voice command input and the short duration of the task length). However,

analysis of mental model representations suggests a markedly different character or

nature in headphone users’ interaction and connection with the content.



Acknowledgements

This research was conducted within the Internet Research and Future Services section of

BBC R&D. Many minds contributed towards its development, but the project benefitted

from the specific contribution of the following people: Nicky Birch, for conceiving and

commissioning the research and steering its conceptualisation and focus; Alex Robertson

for guidance on decisions around content and its presentation, including production and

editing of all music and speech featured in the prototype; Holly Challenger and Joanna

Rusznica for significant work on the design, recruitment, management and analysis of the

user experience research. BBC R&D also extends thanks and gratitude to CereProc for

kind permission to use its synthetic voice technology in the prototype for this research.



Introduction

This research project was motivated by recent trends in digital audio consumption

through smart technology. An overview of these factors is presented here to provide

context for the research.

Digital audio and smart speakers

Digital distribution has accounted for more than 50% of UK radio engagement since

2016 – either online or via digital audio broadcasting receivers [1]. In the same year, over

half of global recorded music revenue was generated by download or streaming services

for the first time. The current annual proportion of digital streaming and download

revenue stands at 59% [2], [3]. In parallel, smart speaker technology has quickly gained a

niche but significant footing in the audio technology market, with an estimated 20% of

UK households adopting devices between 2016 and 2019 [4]. In summary, digital delivery

is now firmly established as the dominant mode of audio distribution. Smart speakers are

a fresh and immediate gateway to any form of content within this realm, incorporating

convenient access to live radio, music streaming services and podcast or catch-up

programming.

Although they offer ready access to all forms of digital audio, current trends indicate that

the proportion of smart speaker usage directed towards on-demand content is

comparatively high, if examined against the overall distribution of listening activity. If all

audio devices and sources are considered collectively, 75% of consumption in 2018 was

dedicated to live radio. By comparison, smart speaker engagement attracts usage with a

relatively balanced combination of live audio broadcasting (54%) and request-led

listening (45%). [1]



Current limitations of voice-led audio interaction

Discovering unfamiliar content on audio-only devices is problematic because it relies

entirely on a user’s memory and current mood to motivate listening activity. There are no

external triggers to direct exploration. For on-demand content, in particular, voice-led

search technologies encourage either known-item requests (i.e. “play me this

track/album/artist”), or unscrutinised use of provider-generated collections (i.e. “play me

happy/newly released/1980s tracks”). Recent UK music industry analysis of adoption

noted concerns that smart speakers could encourage listeners towards less engaged

forms of interaction with audio. As acceptance of pre-curated or algorithmically

generated recommendations and playlists increases, so listeners might become

disconnected from individual works and artists that form these compilations. Two

answers to this possible shortcoming are offered. Firstly, it is supposed that use of

“branded” recommendations or playlists with smart speakers caters to types of casual

listeners who, in the past, typically used radio as background anyway. Secondly, it is

suggested that these devices are simply not designed for discovery and exploration,

which will continue through other media and that user data gained there can be used to

populate tailored recommendation lists more effectively. [5]

Future possibilities for voice-led audio interaction

Classing voice-led audio interaction as a reductive or secondary experience would ignore

three potential opportunities encompassed by this technology:

1. As the same industry report goes on to note, speech-delivered search invites the

possibility of more verbose and nuanced queries. These could be harnessed to

achieve better recommendations if semantic analysis algorithms and metadata

structures are suitably designed in future. [5]



2. Interaction design for voice-led technology is still in its infancy. More consideration

can be given to how pre-curated or auto-compiled content might be previewed and

navigated efficiently with voice-controlled, audio-only devices.

3. Voice-controlled devices are not limited to smart speakers. Smart headphones or

“hearables” are an emergent technology that (amongst other features) enable

voice interaction for eyes-free control of a connected smart device. Many

manufacturers now offer headphones or earbuds with inbuilt voice interaction

capability and, increasingly, some form of layered or augmented reality audio,

which blends transmitted audio with real world sounds using adaptive ambient

noise cancellation.1 Specifically, the Bose AR platform further supports motion

detection in enabled hardware, offering the potential to deliver responsive binaural

audio – or surround sound over headphones.2

Existing BBC R&D work related to these future possibilities

The first of the three opportunities identified above is a vast field of research, not simply

within BBC R&D, but for the media industry and academic research institutions

worldwide. The relationship between recommender systems and voice assistants was far

beyond the scope of investigation for this research project. The BBC has been working

prominently in the second area of potential, releasing its first of several services for smart

speakers in 20173 and aiming to launch its own assistant in 20204. These offers for

current audiences have been accompanied in parallel by BBC R&D’s creative investigation

1 For recent overviews of some available smart headphones at the time of writing, see: https://www.wired.co.uk/article/best-wireless-earbuds; https://www.wired.co.uk/article/best-bluetooth-headphones; https://www.wareable.com/hearables/best-hearables-smart-wireless-earbuds 2 https://www.bose.com/en_us/better_with_bose/augmented_reality.html 3 https://www.bbc.co.uk/mediacentre/latestnews/2017/smart-speakers 4 https://www.bbc.co.uk/news/technology-49481210



of interaction and user experience design for voice content, spanning multiple projects

over two years [6]–[8]. Finally, BBC R&D has also been in early experimentation with the

third point of innovation, using Bose AR to explore how audio augmented reality could

redefine programming and audience experiences.5

Creative approaches to voice experience design and the affordances of smart

headphones both informed the objectives and design process for this research project.

5 https://www.bbc.co.uk/rd/blog/2019-03-audio-augmented-reality-spatial-sound



Rationaleanddesigndecisions

The project sought to research novel interaction mechanisms for surfacing BBC content

on voice-controlled devices. This section outlines the specific objectives and design

principles that informed the process.

Aims and design scope

Two research aims were established:

1. Develop and evaluate an interactive prototype for navigation of audio content by

voice

2. Compare a smart headphone vs. smart speaker engagement in terms of

• exploration behaviour

• connectedness to content

Additionally, two key design principles determined the prototype development. Firstly, it

was to be populated with a defined set of existing BBC content arranged in segmented

form, rather than full length programming. Use of short form audio was necessary to

support the time-constrained, lab-based interactions necessary in planned user

experience research. (Creative applications of segmented content are also a significant

focus of BBC R&D’s current workstream on Object-Based Media6). Secondly, the

navigation design had to include seamless previewing of content to enable rapid catch-up

style exploration, with a significant degree of agency devolved to the user.

6 https://www.bbc.co.uk/rd/object-based-media



System architecture

The prototype that resulted was not conceived as a completist experience with tractable

pathways to each piece of featured content. To be effective, it had to avoid complex

layers of menus and navigation and instead encourage engrossing onward journeys

commonly associated with discovery platforms like YouTube. YouTube-style engagement

was a conscious conceptual influence, since it currently accounts for 47% of all on-

demand music streaming [3]. The resulting design was termed Auditory Archive Explorer

(AAE) and is represented visually in figure 1:

Figure 1: Auditory Archive Explorer overview

The prototype is populated with 50 highlights from the BBC’s coverage of the

Glastonbury 2019 festival. Each of the 50 segments is categorised into one of five moods:

‘happy’, ‘sad’, ‘energetic’, ‘mellow’ or ‘dark’. Users hear a 30 second introduction using

synthesised speech generated from one of CereProc’s commercially available voices7.

7 https://www.cereproc.com/en/node/1166



This opening summarises the navigation concept and presents the voice commands

“start”, “select”, “forward”, “back” and “change”. The latter four commands are used to

both navigate through menus and control playback of audio tracks once selected, as

shown in figure 1. The 50 audio pieces are accessed via two menu layers (to select a mood

and then a track). A range of notification sounds serve different purposes including:

feedback confirmation for voice command recognition; delineation of menu items;

signposting of navigation transitions. The same synth voice adds short announcements

for further contextual orientation. Menu options are represented by eight-second

auditory previews of content – either short montages with a voiceover (for the five mood

categories), or a representative excerpt from a given piece (in the track menu). Menus

only feature five items – which repeat on a loop – but the track menu features a further

“refresh” option to repopulate the content. When selected, the title and artist are

announced and segue over the start of the track at the beginning of playback.

Voice interaction design

It should be noted at this stage that only limited consideration was given to the voice

interaction design. The focus of the research was to evaluate the audio-only mode of

displaying and navigating content – not the efficacy of the voice control mechanics per

se. Given this premise, the four navigation voice commands were defined, as far as

possible, to:

• be succinct and clear for the purpose of recognition;

• echo functions familiar to visual interface paradigms (e.g. “select”) and media

player controllers (e.g. “forward” and “back”);

• provide semantic coherence with their dual-purpose functions for navigating

menus and controlling audio playback.



The inclusion of the “change” command is evidently less intuitive, since it meets these

criteria less well and its function is too easily confused with “back” in the menu navigation

context. It would have been preferable also to limit the voice navigation commands to

just three, which is a common approach found in media playing systems. In these cases, it

is quite typical for “back” to return to a previous context if triggered within the first three

seconds of playback. “Change” was included with these drawbacks and inconsistencies

fully in mind, but as a means of enabling the research construct and in the absence of an

immediate and more elegant solution.



Implementationandengineering

Existing technologies and hardware were used to make a fully-interactive prototype that

simulated both smart headphone and smart speaker engagements. This section details

some of the software and sound engineering decisions that created the experience.

Prototype realisation

Four pieces of software were combined to produce the AAE prototype. All system code

was implemented in the Max visual programming language, with audio content and

mixing handled in the Reaper digital audio workstation:

Figure 2: Auditory Archive Explorer software architecture8

8 https://github.com/dlublin/SpeakOSC; https://cycling74.com/; https://www.reaper.fm/; https://edtrackerpro.mybigcommerce.com/

SpeakOSC Leverages Mac OSX ‘Dictation’ function to convert recognised voice commands to OSC messages.

Max Interprets and sends OSC (and head-tracking data) to control navigation (and 3D binaural scene rendering).

Reaper Acts as a database for all audio files (music excerpts, speech, notifications) and as a playback engine.

Head Tracking Data (For smart speaker implementation). 3DoF head rotation co-ordinates transmitted wirelessly.



Sound design

Spatial placement of sound sources in remote conferencing has been shown not only to

be a preferred format for presenting audio information, but that it benefits memory, focal

assurance (assignment of identity to location) and comprehension [9]. Binaural signal

processing enables 3D sound placement over headphones – an effect that can be

enhanced by incorporating head tracking to simulate static virtual scenes.9 In contrast,

smart speakers typically provide mono playback of content. On these devices, sound

sources usually have no spatial separation except through a combination of volume level

and applied reverberation, for limited impression of relative distance.

Linear arrangement of content for the headphone and speaker simulations of AAE was

identical, but the spatial sound production differed significantly. Figure 3 illustrates how

spatial sound positioning was used to segregate audio information streams in the smart

headphone version. In navigation mode (a), menu item previews were spatialised in front

of the listener using incremental positions from around nine o-clock to three o’clock (-

85°, -30°, 0°, 35° and 85° azimuth). Voice announcements and navigation transition

effects were placed in elevated positions, whilst speech recognition sounds used regular

stereophonic playback (i.e. were heard from “inside” the listener’s head). In track playback

mode (b), speaker positions were simulated binaurally over headphones, which created

the impression of “externalised” listening. By contrast, all audio streams in the smart

speaker version were co-located at the same point in space – the position of the smart

speaker itself.

9 There is considerable work available on how binaural perception and synthesis can be exploited in the design of audio information systems, but [10] represents a major contribution.



Figure 3: Sound design for navigation and audio playback

a) Navigation

a) Playback



Binaural spatial audio rendering

Effective binaural rendering is computationally expensive and contingent on a number of

considerations. Hardware performance and the complexity of the binaural rendering

implementation have an interdependent and fundamental bearing on fidelity and,

therefore, spatial realism [10]. Quality of experience is also dependent on the unique

anatomic and cognitive makeup of individuals, so tends to be highly subjective [11].

Bose AR is a prominent commercial development platform for authoring smart

headphone experiences. The technical constraints in its software and associated

hardware were used as a yardstick for optimising the prototype. By design, the AAE

binaural implementation therefore:

• runs virtual Third Order Ambisonics in Reaper using AmbiX plugins10 – equivalent

to the most complex 3D audio algorithm available to Bose AR;

• uses the Institute for Electronic Music and Acoustics 24 speaker “Cube” room

impulse response set (at 2048 sample length to include early room reflection data)

for real-time binaural rendering – avoiding application of additional reverb, which

would be computationally challenging to implement on a mobile platform;

• has 68 milliseconds (ms) of known system latency in DSP buffering, but a total

interval likely to be beyond 100ms – probably below, but approaching, the known

round-trip responsiveness of 196-246ms for Bose AR11.

10 http://www.matthiaskronlachner.com/?p=2015 11 https://developer.bose.com/guides/bose-ar/end-end-system-latency (requires developer login).



In short, the AAE prototype was purposefully designed with a spatial audio resolution and

head-movement response time that approximated current or near-future high-street

technology.

Smart technology simulation

Likewise, the smart headphone and speaker experiences were simulated with hardware

that approximated the capabilities of consumer devices, illustrated in figure 4. The

onboard microphone of a 13-inch 2017 model MacBook Air provided input for voice

capture and all software described ran on the same device. For the headphone version,

open-back wired Sennheiser HD650 were paired with a wireless EdTracker Pro for motion

detection. The speaker version was delivered through a Zamakol ZK606 with wired

connection.

Figure 4: Hardware configuration for smart headphone/speaker

experiences

voice commands voice commands

head tracking data

audio playback

audio playback

Headphones Speaker



Userresearchdesign

Both versions of the prototype were evaluated in a user experience testing lab at the BBC

Broadcast Centre, London, using the processes described in this section.

Participant recruitment

Twenty-two participants were recruited from an open call and selected to achieve desired

balance across the criteria in table 1.

Table 1: Participant profiles recruited for user evaluation

Headphones

(10)

Speaker

(12)

Age 23-38 23-34

Gender 6 female

4 male

5 female

7 male

First language 5 English

5 another

6 English

6 another

Voice assistant usage

6 infrequent

4 regular

5 infrequent

7 regular

Headphone participants either had no previous exposure to binaural audio listening

(eight people) or had only experienced the technology on a few occasions (two people).

Study format

Participants undertook an individual one-hour study session, comprising:



1. Pre-task interview (approx. 10 minutes)

A short introductory discussion investigated participants’ existing behaviour in music

exploration and discovery, before they were given a high-level summary of the

prototype concept.

2. Onboarding (approx. 10 minutes)

Some training and supervised use of the system was required before participants

completed the evaluation task. Those undertaking the headphone experience were

exposed to a one-minute audio demonstration. This presented a direct comparison

between spatial positioning in standard stereo playback and sound placement using

head-tracked binaural audio. Those undertaking the speaker experience received a

short tutorial on projecting their voice with sufficient loudness and clarity to be

recognised over system playback, using the keyword “testing”. In either mode,

participants were then given a short time to navigate the system freely. They were

provided with a written prompt reiterating the four navigation commands and

informed only that these enabled interaction with the system menus and controlled

track playback. They were given a minimum of 1m30s (but never more than 2m00s) to

ensure they successfully completed two or more voice commands. This pre-exposure

phase also included the 30 second narrated system introduction:

Welcome to the Glastonbury 2019 audio browser. Browse the performance highlights

to suit your mood, using just your voice. You can use four voice commands to navigate

the browser: “select”, “forward”, “back”, “change”. Say “start” to begin browsing or

wait to hear this information again.



Participants were invited to ask questions after the pre-exposure phase, but no further

instruction or indication was given on how or when commands were to be used, or the

actions they effected.

3. Evaluation task (15 minutes)

Participants were given a copy of the following task, which they also retained for

reference:

You have 15 minutes to explore the archive as far as you can and find six new

tracks that you like. Use the pen and paper provided to make your list as you go.

Make a note of the track names and artists that you choose and anything

particular you liked about each one.

A final opportunity to ask questions was provided before starting. Again, no

guidance was given about how to operate the system or access track and artist

names. When ready, participants were left alone to complete the task with the

prototype, which stopped responding automatically after exactly 15 minutes.

4. Post-task questionnaire (approx. 10 minutes)

Participants were given a short questionnaire to assess the prototype’s usability and

evaluate user experience (discussed in detail in the following section).

5. Post-task interview (approx. 10 minutes)

A short closing discussion explored participants’ responses to interacting with the

prototype in greater depth.



Data collection

Data was collected from participants using four sources:

• real-time logs of their interactions with AAE during the task

• qualitative ratings from the post-task survey evaluation

• a hand-drawn mental model of the system they interacted with

• video recordings of the pre- and post-task interviews



Findings

Data analysis was focussed on the original project aims: evaluating AAE’s voice-led

interactive navigation mechanism and comparing the two smart technology

implementations.

Patterns of use for voice-led content discovery

Figure 5 illustrates the average activity of all 22 participants viewed collectively.

Approximately two thirds of all time was used to browse content and a third was

dedicated to listening. This division seems an appropriate balance given the task – i.e.

that the majority of time was spent previewing options, but a significant minority

checking full content of tracks, to affirm choices and discover artist/title information.

In their 15 minutes, participants on average previewed half of the available content and

selected six tracks for further listening, but tended to fall short of their assigned target to

note six new tracks they liked. All participants accessed more than one mood category.

All but one participant listened to multiple tracks and 19 listened to five or more. These

patterns of interaction suggest that, as a cohort, users had no real difficulties in

navigating the system to discover new content and striving for the specified response,

even if they were unable to fulfil this in the allotted time.



Figure 5: Average activity for all participants

Figure 6: All visits by content area

Listening time36%

Searching time64%

Mean Median

Previews heard (50 tracks)

24.05 25

Track listens (50 tracks)

5.64 6

Responses given (6 track target)

4.27 5

No participants

All participants



The range and concentration of exploration presents a similarly positive outlook. Figure 6

shows the combined reach of participant activity during the 15-minute task. Each circular

node represents a location in AAE, spanning from the primary mood menu (1 x large),

through track menus (10 x medium) and individual tracks (50 x small). The value within

circles specifies how many participants visited that point in the system. As expected, visit

counts reduce at deeper locations (outer points in the diagram), since these destinations

are more removed from the starting point of the user. However, more notable is that

activity is fairly well balanced in terms of menu item precedence – that visits are quite

evenly distributed left-to-right, at all levels in the diagram. (Though there was slightly

greater traffic through the ‘happy’ category, this could be ascribed partly to initial trial-

and-error experimentation with voice commands and orientation. It could also be

reasonably supposed that the slightly lower level of traffic through the ‘dark’ route might

be due to the more specialist appeal of this category.) This demonstrates that the system

design supported exploration across all sections of the content, seemingly without any

undue effect from the order in which categories or tracks were presented.

Specific profile differences for voice-led content discovery

Prior experience with voice assistant technology appeared to influence how favourably

users viewed AAE in its capacity for content discovery. Frequent users included all those

who self-reported daily or weekly voice assistant interaction, infrequent were those who

declared their usage to be monthly or never. Regular users were significantly more likely

to provide favourable responses to the statement in figure 7 (Mann-Whitney U-test

p=0.045). However, equivalent disparities were not found between these two groups in

any of the other four qualitative usability ratings (shown separately in figure 10).

Additionally, there was no notable difference in the extent of task completion (i.e.

number of written responses given) between these groups. These patterns suggest,



therefore, that the prototype’s potential to aid music discovery was more evident to

those familiar with (current limitations of) accessing audio content through voice

assistants.

Figure 7: ‘I was able to discover something new’ ratings by voice assistant

usage

Participants with English as their first language were significantly more likely to progress

further with completing the task (two-way ANOVA p=0.038)12. However, there was no

notable discrepancy found between these groups in their actual interactions with the

prototype. The group without English as a first language tended to register a comparable

count of voice commands and encountered similar proportions of preview content and

full track playback. This points towards the likelihood that language fluency did not

12 A two-way ANOVA compared the effect of gender and first language and these were found to be independent. Female participants were more likely to progress with the task to a similarly significant extent.

10

4

1

5

10

0

2

4

6

8

10

12

Infrequent voice assistant users Regular voice assistant users

Parti

cipa

nts

Usage group

Somewhat disagree Somewhat agree Totally agree



present a barrier to engaging with the system itself, but that deciphering the spoken

artist names and track titles was more challenging. Post-task interviews revealed a

number of comments that the pace, volume and accent (a Scottish male) of the narrator

made it difficult to discern in some cases. Added to this is the fact that transcribing artist

and track names is a relatively artificial construct included for the purposes of the

research study, which would not typically be pursued in a real-world content discovery

journey.

Figure 8: Task completion rate (number of choices provided) by first

language

Smart headphones and smart speaker – behaviour and usability ratings

No significant differences were found in any of the interactions (t-test), the task

completion rate (Mann-Whitney U-test) or self-reported usability ratings (ANOVA)

between users of either implementation. Figure 9 shows how similar the two groups were

AnotherEnglishFirst Language

0

1

2

3

4

5

6

Res

pons

es g

iven

(out

of 6

)



in their use of the respective versions of AAE over 15 minutes. Speaker users appear, on

aggregate, to have been marginally more proactive in their behaviour, but in no instance

was this to a statistically significant degree. Likewise, figure 10 illustrates that opinion

expressed through usability ratings coalesced very comparably between users of either

version. Both the absolute and relative values between rating statements are mirrored

for the headphone and speaker responses. Although the headphone ratings are

consistently more favourable in this instance, again this is never to any statistically

notable extent.

Figure 9: Task completion rate (number of choices provided) by first

language

15.5

24.3

6.2

11.6

5.1

24.8

4.4

17

29

8

13

6

24

4.50

6

12

18

24

30

“Select” “Forward” “Back” “Change” Listens Previews Responses

Coun

t

Command / Activity

HeadphonesSpeaker



Figure 10: Self-reported usability ratings by platform

Though differences in behaviour and usability between the experiences have not been

revealed, both figure 9 and 10 present promising data regarding the effectiveness of the

prototype design. Participants were exposed to AAE for a very limited time and without

any operating instructions. Both groups of users nevertheless felt confident skipping

through menus to identify content of potential interest (“forward” by far the most used

command), but more rarely having to repeat an option (“back” used the least). The fact

that “select” was used noticeably more frequently than “change” emphasises that at least

a proportion of users discovered the extended uses of the former command – i.e. to

refresh a track menu list and/or to restart playback from any position during a piece. This

quantitative data is confirmed in the self-reported usability ratings, where all responses

averaging within the affirmative (where the statement is favourable) or negative (where

the statement is pejorative) thirds. In summary, both versions of AAE seemed to enable

users to onboard themselves and pursue a specific time-bound task straightforwardly and

with a good degree of success.

1 2 3 4

I was able to successfully completethe task I was given

I found the systemdifficult to use

The system was easyto navigate

I was able to discoversomething new

I needed to learn a lot of thingsbefore I was able to get going

with this system

Headphones

Speaker

Don’t agree at all Somewhat disagree Somewhat agree Totally agree



Smart headphones and smart speaker – connectedness

Following the evaluation task, participants were asked to recall details of the main mood

menu that they encountered. A scoring system was applied to assess how well users

remembered the number, name and sequence of the mood categories (‘happy’, ‘sad’,

‘energetic’, ‘mellow’, ‘dark’) in the primary menu of the prototype. The maximum

available score was 12. Of the twenty-two participants, twelve scored full marks and only

five scored less than 9, with 5 being the lowest score registered. Importantly, there was

no significant difference found (t-test) in users’ ability to recall the makeup of the mood

menu between either version of AAE. In this instance, therefore, there is no evidence to

suggest that spatial presentation of menu options benefitted storage and recall of that

information from users’ short-term memory.

Participants were also asked to provide a mental model illustration of the system they

had experienced (“draw a visual representation of the system you interacted with”). Full

analysis of all of these diagrams is beyond the scope of this report. However, submissions

were subsequently discussed collectively by three BBC researchers, who together

summarised characteristic differences in the representations.

Table 2: Characterisation of mental model illustrations, by prototype

version

Headphones Speaker

• physical space and entities • hierarchical structures

• portrayals of experience • flowcharts / decision trees

• narrative explanations • process definitions

• curves and circles • hard angles



The descriptions in table 2 were not all evident universally, but they represent a

consensus overview on the general character of the two collections, which was quite

clearly distinct. One of each illustration is appended to this report to show instances

where these traits were exemplified perhaps most strongly. Figure 11 shows a depiction

that superimposed a physical journey onto the headphone experience; figure 12 shows a

logic diagram interpretation of the speaker experience. The qualitative summary in table

2 suggests that, overall, the spatialised auditory environment had a markedly different

effect for headphone users in the nature or character of their interaction and connection

with the content.



Conclusionsandrecommendations

The original project aims are revisited to summarise research outcomes and areas of

future work to be considered.

Review of research aims

1. Develop and evaluate an interactive prototype for navigation of audio content by

voice

The data gathered in this user study indicates that the AAE design presents a good

potential basis for voice-led interactive audio exploration. Users of the prototype

were shown to navigate through a wide range of content, without any prominent

precedence bias and with a balance of activity in line with expectations, given the

task they were set. Despite only 16.5 to 17 minutes’ exposure and relying on self-

orientation, aggregate interaction patterns also suggest that users were able to

navigate confidently, fluently and accurately, with some use of more advanced

features. These statistical findings about the user experience are supported by

qualitative, self-declared usability ratings.

2. Compare a smart headphone vs. smart speaker engagement in terms of

• exploration behaviour

There was no evidence that the spatial arrangement of content and the system

notifications used in this study had any effect on patterns of interaction, activity

or task success. It should be noted that aspects of the study design could have

presented potential limiting factors in this respect (see Future Work section

below).



• connectedness to content

Spatial presentation of menu options did not significantly aid retention or recall

of category information, comparted to monoaural presentation. However, data

gathered from users’ mental model system representations strongly suggest

that the smart headphone version did create an evidently different type of

interactive experience.

Future work

There are three considerations worth noting that could be addressed if subsequent

iteration occurs in this area of investigation.

• Allow more time (at least 20 minutes) for user interaction

It is possible that the 15-minute engagement (plus 1m30s–2m pre-exposure)

was too short to establish full fluency with the system. Users will have spent a

good proportion of their allotted task time to continue self-orientation with the

voice commands and system structure. If differences in exploration behaviour

were to emerge between the two prototype versions, it’s possible they might

only present when a base degree of interaction proficiency is established by

users. There is a possibility that the relatively short duration used by this study

did not allow that threshold to be surpassed for a sufficient amount of time.

• Test gesture instead of voice interaction with headphones

Likewise, it is possible that the latent nature of voice interaction itself could

limit users’ ability to take advantage of any added perceptual orientation

afforded by the headphone version. In the prototype, spatial cues communicate

where, in a menu list of five, the user is currently located. If the interaction



method was instantaneously responsive (as in most media playing technology),

this would potentially allow skipping forward or back to previously heard menu

options by relying on (memory of) its virtual position as an anchor. Delayed

responsiveness in voice interaction could have been a further limiting factor on

users’ ability to exploit the navigational benefits in the headphone versions.

• Use other content and browsing contexts

Although the prototype used Glastonbury 2019 content for the purposes of

this study, it could have been populated with any segmented content, or even

full programme content (though the latter would have been more challenging

to evaluate in a controlled study). Rather than using pre-determined

categorisation, content could also be presented using dynamic categories to

introduce a recommender system mechanic to the experience. These

adaptations would provide more insight on potential avenues for exploring how

BBC content can be surfaced more effectively on audio-only devices.



Appendix

References

[1] Ofcom, “Communications market report,” 2018. [Online]. Available:

https://www.ofcom.org.uk/__data/assets/pdf_file/0022/117256/CMR-2018-

narrative-report.pdf. [Accessed: 18-Oct-2019].

[2] IFPI, “Global music report: annual state of the industry,” 2017. [Online]. Available:

http://www.ifpi.org/downloads/GMR2017.pdf. [Accessed: 18-Nov-2017].

[3] IFPI, “Global music report: annual state of the industry,” 2019. [Online]. Available:

https://www.ifpi.org/news/IFPI-GLOBAL-MUSIC-REPORT-2019. [Accessed: 18-

Oct-2019].

[4] Ofcom, “The communications market report: interactive data,” 2019. [Online].

Available: https://www.ofcom.org.uk/research-and-data/multi-sector-

research/cmr/interactive-data. [Accessed: 18-Oct-2019].

[5] BPI & ERA, “Everybody’s talkin’: smart speakers & their impact on music

consumption,” 2018. [Online]. Available:

https://www.bpi.co.uk/media/1645/everybodys-talkin-report.pdf. [Accessed: 18-

Oct-2019].

[6] H. Cooke, J. Moore, A. Wood, and J. Rusznica, “The Inspection Chamber: UX analysis

and recommendations,” 2018. [Online]. Available:

http://downloads.bbc.co.uk/rd/pubs/whp/whp-pdf-files/IC_UX_Analysis.pdf.

[Accessed: 18-Oct-2019].

[7] N. Birch and H. Cooke, “Designing a voice application: top tips from UK voice

producers,” 2019. [Online]. Available:

http://downloads.bbc.co.uk/rd/pubs/whp/whp-pdf-

files/VoiceApps_DesignTips.pdf. [Accessed: 18-Oct-2019].



[8] E. Young and L. Miller, “Smart speaker UX & user emotion,” 2019. [Online].

Available: http://downloads.bbc.co.uk/rd/pubs/whp/whp-pdf-

files/Smart_Speakers_and_User_Emotion_BBCRD_2019.pdf. [Accessed: 18-Oct-

2019].

[9] J. J. Baldis, “Effects of spatial audio on memory, comprehension, and preference

during desktop conferences,” Conf. Hum. Factors Comput. Syst. - Proc., no. 3, pp.

166–173, 2001.

[10] D. R. Begault, 3D sound for virtual reality and multimedia, 1st ed. London, UK:

Academic Press Limited, 1994.

[11] J. Blauert, Spatial hearing: the psychophysics of human sound localization, Revised.

Cambridge, Massachusetts: MIT Press, 1997.



Mental model drawings

Figure 11: Example mental model drawing of smart headphone experience



Figure 12: Example mental model drawing of smart speaker experience

Voice-led interactive exploration of audiodownloads.bbc.co.uk/rd/workstreams/irfs-misc/bbc-rd... ·...

Documents

Transcript of Voice-led interactive exploration of audiodownloads.bbc.co.uk/rd/workstreams/irfs-misc/bbc-rd... ·...