INTERACTION APPROACH FOR DIGITAL VIDEO BASED STORYTELLING · audience with approaches like Direct...

INTERACTION APPROACH FOR DIGITAL VIDEO BASEDSTORYTELLING

Norbert Braun

Department of Digital StorytellingComputer Graphics Center (ZGDV e.V.), Rundeturmstrasse 6

64283 DarmstadtGermany

[email protected] http://www.zgdv.de/distel

ABSTRACT

This paper shows an approach for a storytelling oriented interaction on digital video. All the interactioncapabilities of the system are driven by the video context and therefore media centric. Interactionpossibilities are given to the audience by conversation (on topics of the video content) or by classicalDirect Manipulation (of video objects). Conversations can be done by the user with a personalizedassistance or directly with the video. The implementation of the approach is shown by a discussion aboutan application architecture on the basis of Real Media Server, SMIL and Java 3D. The application runs asa Video On Demand system, accessible through the world wide web.

Keywords: Interaction, Conversation, Media Centred, Digital Video, Digital Storytelling.

1. INTRODUCTION

User interfaces surprise the every day user withpermanently new methods of interaction on theircomputer systems. Operating systems please theiraudience with approaches like Direct ManipulationWIMP-Interfaces, drag-and-drop, conversationalinterfaces for delegation of tasks etc. . Anyhow -generally the interaction metaphors are built asinteraction with the system - not interaction with theinformation that users are looking for and systemsare delivering. The interaction is system- centred, butnot centred on the needs of the user. If users shiftfrom one system to the other, the interactionopportunities on these systems change too. Usershave to adapt to the system’s interaction, thereforespend a lot of time learning interaction methods.This disturb the users concentration on theinformation they need to manage within every daywork.

The change to User Centred Interaction isnevertheless tricky. What is User Centred Interactionexactly?! If someone suppose that one major task ofevery day user's work is the collection of informationand the information is given through media - so thework with media is important for the user. If now the

interaction capabilities of the user are focused on theinteraction with the media and the informationcontext, the interaction is centred on the user's needs.

How should the user interact Media Centred- this question directs to another paradigm of UserCentred Interaction. The user should interact withthe media in a somehow natural, human like style.This implies that the user should be able to directlymanipulate objects he recognizes in the media, aswell as he should be able to delegate some tasks tothe media. Those tasks like for example an alteredorientation of information delivery through themedia or the request for background informationonto information objects presented by the media.Users should be able to phrase such delegations in ahuman like way, i.e. through a conversation. Eventhe information delivery through the media should bein an conversational way, because humans aretrained on information handling throughconversations. For the input channels of a MediaCentred Information System this implies the use ofnatural speech. The output channels of a MediaCentred Information System should use mimic andgesture as well as speech to adapt to user's needs ofnatural like conversation and natural like informationassimilation.

So Media Centred Interaction is on the onehand specific dependent, on the other hand should bedesigned and adjusted, in relation to the type ofmedia that is used to deliver information. Continuousmedia types like audio or video (audiovisual) aretime dependent. For the interaction capabilities onthis media, it is important that users are able tointeract synchronous to the media - this means ontothe information objects that are actually experiencedin the media - as well as asynchronous to the media,what means on information objects that are actuallynot experienced within the media, but that have beenexperienced or will be experienced while the mediais giving information to the user. Like in the user’sreal live it should be possible to interact onto actionsof the present or of the past - or to prepare foractions in the future.

Last but not least the user interaction isstrongly related to the type of story that is narratedthrough the media. The approach should not interferewith suspension and immersion of the story. Thestory context should limit user’s interaction demandsto a level that is fulfilled by the interactioncapabilities of the system.

The approach of this paper has shortly beendescribed in this section. This paper is organized asfollows: The next paragraph will give an overviewabout related work in this research area.Conversational interfaces on video will be discussedin Section 3, Direct Manipulation for video insection 4. The approach of Media CentredInteraction will be shown in section 5. A MediaCentred Interaction system on video will beintroduced in section 6. The paper will end with aconclusion and some words onto future work.

2. RELATED WORK

Design-based approaches to interaction within storysare closely related to the narration of the storys. AtMIT Media Lab [Davenport96] exists a lot of workbased on Automated Storytelling with avatars withinan virtual environment. The interaction capabilitiesof their systems are strongly related onto theaesthetic of the narrated story. Both DirectManipulation and Conversational Interaction is usedin this concepts. At Carnegie Melon University[Bates96] a story engine had been developed that isbased on virtual reality and comes with objectmanipulation and Conversational Interaction too.

A video based approach of storytelling isdiscussed in [Sawhney96]. The video is used to givecoffee house impressions to the user. Interaction isdone by Direct Manipulation of several videostreams and text-hyperlinks. Like most video based

systems, a conversational access to the videoinformation is not developed.

3. CONVERSATIONAL INTERACTION

Conversational Interaction is often combined with ahumanlike avatar because users need someone to talkto character - instead of talking to a video or to someaudio stream. The vision, see [Spierling00], is a veryrealistic conversation within an immersiveenvironment like shown in figure 1.

Figure 1: Idealistic view of ConversationalInteraction

The use of conversations in HCI is based on the ideaof task delegation. Users should be able to delegatetheir work to the computer – in a somewhathumanlike, easy to learn way of interaction. Thecomputer on the other hand should be able to presentits results in the same humanlike and intuitive way tothe users. Multimedia and Multimodality are oftenused words in this context – but these technologiesare not really relevant to a conversation – text basedinterfaces, used years ago in AI (see Eliza[Weizenbaum65]) showed that conversations areable without any picture, animation or sound-interfaces. But the use of multimedia, e.g. videopresentations, has an other mayor advantage: Theygive context to a conversation.

Conversational Interaction lacks often on aabsent restriction on the conversation topics. Usersare allowed often to talk about anything that’s intheir mind. The use of a context to restrictconversation topics is a nice approach to give theuser the impression that the Conversational Interface,used within the HCI-component, has reallysomething to say within the conversation. Users donot have the impression to the of another fancy stuffthat annoys with the message ‘I didn’t understand –please repeat your sentence’. User Tests showed thatvideo presentations are a method to focus theconversation goal of the user to topics which arewithin the range of the conversational practice of the

system. In addition, nonlinear video presentation canbe used to apply a narrative structure to theconversation. This enables the system to conduct theuser talk to a direction that is understandable for thesystem.

If context restriction based on videopresentations is used in conjunction with speechrecognition, the speech recognition component canbe optimised on the given context for a moreadvanced understanding of the user’s words.

Beside the input and output of speech there is aanother factor that gives Conversational Interaction ahumanlike style: mimics and gesture, see[Buxton90]. With mimic and gesture theconversation range can be expanded to emotionaland social signals that are common to the averageuser and therefore easy to understand. Beside that,emotional and social behaviour of an avatar canexpand the selective story narration of a videopresentation in a generic way. Figure 2 gives anexample of an avatar using its hands for gesture andits face for mimic information transfer. The user hasthe imagination of an humanlike opponent talkingwith his hands.

Figure 2: Avatar using its hand and face for mimikand gesture information

If a system is expanding the narrative approachof a video presentation with an humanlike avatar, theavatar can be used in several ways, see [Braun00]:• Part of the story: The avatar is used as a

narrative part of the story. The restrictions onthe video clip narration (fixed characters, actionand suspense) are expanded by the generatedbehaviour of the avatar. As a result of this storystructure the user has the impression of theavatar behaving ‘inside’ of the story world.

• Conferencier: The avatar is used as a kind ofshow master. Different to the previous point theavatar is acting outside of the story and thereforenoticeable (for the user) not as a part of thestory.

• Audience: The avatar is behaving and talking asif it was a part of the audience. This is differentto the previous points as the avatar has no (forthe user) noticeable impact on the story’snarration. Certainly the avatar is forcing user’semotions and reactions and therefore itsbehaviour and speech is planned on authoringtime as a part of the narration of the story.

Figure 3 shows an avatar as a conferencier within aVirtual Trade Show. Its task is to present videosabout some topics – therefore to moderate thepresentation and to entertain the audience of theshow. As promised by results in psychologicalresearch [Gleich97] users accept the avatar as aconversational partner.

Figure 3: Avatar is working as a conferencier

The three proposed usages of the avatar aredriven by the content of the video presentation – andtherefore the usage is Media Centric. As a benefit ofthe Media Centric approach the conversation contextis reduced in a way that it can be handled by speechand behaviour engines.

4. DIRECT MANIPULATION FOR VIDEO

The Direct Manipulation approach on video is splitinto Direct Manipulation on whole video clips, e.g.treatment of the video as a BLOB (binary largeobject) and the treatment of the video as acontinuous media object that can be subdivided intoseveral information objects. Those objects have atemporal and spatial appearance within the video. Asthe BLOB -based manipulation of a video is not thatinteresting for storytelling (e.g. it’s primitive vcr-functionallity), this section will focus on themanipulation of the information objects within thevideo.

All information objects within a video have aspatial and temporal characteristic – as well as amedia specific characteristic based upon thepresentation within the graphics or acoustics of thevideo clip. Beside the media presentation, theinteraction possibilities on these objects are not thatversatile: On can annotate the objects in ahypermedia style to give a interaction access to theuser. This annotation can be done within the mediastream that is containing the annotated object(intramedia annotation) – for instance by drawing apolygon around the annotated object if it is withinthe visual stream – or by showing a signal within aseparated media synchronous to the presence of theannotated object (intermedia annotation). Intermediaannotation could be a texthyperlink or a button thatis active synchronous to the presentation time of theannotated object, as seen at [Martel96]. Intermediaannotation has several problems, e.g. the problem ofassociating the annotation to the annotated object.The approach of this paper is a intramediaannotation due to the better association of object andannotation.

Figure 3: Concept of Temporal Video Hyperlinkwithin the videos graphic

The problematic of the spatial and temporalpresentation of objects within the visual mediastream can be solved by using Temporal VideoHyperlinks, see [Braun99]. Those Temporal VideoHyperlinks are presenting the temporal structure of ahyperlink to the user by giving him an explicit notionof the following points:• The start of a hyperlink: With the start of the

annotation, the user notices an sign that isexplicitly assigned to the annotated object.

• The duration of the hyperlink: While theannotation is accessible, the hyperlink reflectsthe remaining time, which is given to access thehyperlink.

• The end of the hyperlink: The end of thehyperlink is shown to the user by an explicit endsymbol. Users know after the ending of the

hyperlink that there is no more access to theobject.

Figure 3 shows this approach for a visual annotation.A white rectangle is drawn around the annotatedobject. This rectangle is changing its colour to blackwhile the hyperlink is elapsing in time. Figure 4shows the same approach with a circle drawn aroundthe annotated object. The circle’s colour starts withwhite and is by degrees changing its colour to red.User Tests show that users interaction is morecomfortable with this approach: the stress of clickingonto an annotation as fast as it is possible for theuser is reduced to a minimum.

Figure 4: Temporal Video Hyperlink showed byfilling ring around the target object

The same approach holds for the acousticchannel of a video presentation. Acoustic objectswithin the audio channel can be annotated by asound, played parallel to the object. Since AuditoryIcons (sound that has an natural source, see[Brewster97]) don’t work well with this approachone can use Earcons (synthetical sounds, see[Brewster97]) for sonification of the duration of thehyperlink. Figure 5 shows how two convergingtones, played synchronous to the annotated audioobject, are showing the duration of the hyperlink byconverging to one tone.

Figure 5: Temporal Audio Hyperlink by convergingtones

Figure 6 shows how the same concept workswith increased repeats of a tone. Figure 7 shows howthe duration of the hyperlink is shown with tones thatshorten their duration.

Figure 6: Temporal Audio Hyperlink by increasedrepeats of tone

Figure 7: Temporal Audio Hyperlink by shortenedtone duration

User Tests show that users can make adistinction between annotation and audio object(known as the cocktail party effect, see [Arons92])on the one hand and that users can predict theduration of a temporal audio hyperlink by the start ofthe annotation on the other hand.

5. MEDIA CENTRED INTERACTION

Having Conversational Interaction and DirectManipulation of videos discussed in section 3 andsection 4, the question is how this can be combinedto a Media Centric Interaction metaphor. If theconversational component, showed as an avatar insection 3, is driven by the content of the video(therefore the behaviour and speech of the avatar isshown synchronous to the video and its context) oneneeds a processing unit that takes into account theuser interaction on both interaction components.

Figure 8: Concept of Media Centric Interaction onvideo, based on selective/generative storytelling

This unit has to calculate an adapted systemreaction on every user interaction based on the storyrequirements. These requirements are based on thenarration of the story and therefore on the intentionsof the author for the story processing. Conversationshave to be done as a part of the story dramaturgy,allowing immersion and suspense of the user withinthe story space. The approach of this paper is acombined selective and generative narration of story.This combination allows a playout of video clips andavatar behaviour and speech, based on the plots thatare processed by the story and presented as anonlinear video. As the selective and the generativecomponents are combined, every information outputis based on the story processing and every userinteraction is based on the context given with thevideo/avatar presentation. User interaction istherefore Media Centric. Figure 8 shows a sketch ofthe approach with the video as the centralinformation unit, flanked by the Direct Manipulationand Conversational Interaction capabilities, the storygenerated by a combined selective/generativestoryengine.

6. ARCHITECTURE

The approach of this paper, nonlinear storytellingcombined with Direct Manipulation/ConversationalInteraction, has a widespread use of state of the arttechnologies to allow users a Media Centric accessto storys presented by video. The architecture of thesystem has to considerate the following components:• A conversational approach based on a

humanlike avatar• A Direct Manipulation approach, based on

video annotation• Story generation based on selective and

generative elements• Speech generation• Speech recognition• Video service• Animation of avatarFigure 9 shows an architectural overview of thesystem. The whole system is based upon the realmedia video server, see [HeftaGaub97].

Server

As one can see a database is used to store theselective and generative components of story, likevideo clips and avatar behaviour primitives. Thisdatabase is used by the story engine to calculate theactual story narration – dependant on the authoring(pre authored video, combined with video-synchronous avatar animation and text) and on userreply, send by the real media client. These storycomponents are given via the real media server(using plug-ins) through the world wide web to a realmedia client, running within a Netscape browser.

The animation control is synchronized to the videostream by SMIL [SMIL98].

Figure 9: Architecture of the Media Centric system

Client

The real media client distributes the presentationstreams to• the video rendering unit• avatar behaviour decision unit

o Java-3D based feature morphingrenderer, see [Alexa99]

o phonem generator [Portele92],combined with a speech generator[Dutoit96]

o viseme generatorThe avatar decision unit decides which avatarbehaviour to show to the user. The avatar behaviourconsists of three components:• the behaviour streamed by the real server. Since

this behaviour is streamed synchronous to thevideo (see figure 10) it is called synchronousbehaviour.

• the asynchronous conversation reply generatedby the story engine. Since this behaviour is

streamed asynchronous to the video it is calledasynchronous behaviour.

• feedback behaviour generated on the client dueto direct feedback on user inputs.

The avatar behaviour, combined with the viseme-animation, is on the fly generated on the client sideof the system. The viseme-animation is synchronizedlipsync to the speech rendered by the phoneme tospeech-generator.

The Conversational Interaction of the user isprocessed via speech recognition. The speechrecognition of the system is done within a separateapplication (using Java speech and IBM Via Voice)due to Java runtime limitations within the Netscapebrowser. The speech recognition unit processes theuser speech to a symbolic query reply. This queryreply is given via the real media client to the realmedia server.

The speech recognition too is processing theDirect Manipulation interaction on the acoustic partof the video, since the interaction on the acousticpart of the video is done by user’s voice. Every userinteraction on a temporal audio hyperlink is given asa reply to the real media server.

The Direct Manipulation interaction on thevisual part of the video, based on annotations of thevisual objects within the video, is done by the realmedia client. Every user interaction is given as areply to the real media server.

On the server side a reply is calculated by thestory engine (in form of a video clip and/or avatarbehaviour and text) and streamed back to the realmedia client.

Behaviour Streaming

The streaming of the avatar behaviour is done byusing high level behaviour instructions instead ofpreanimation of the behaviour on the server side.The avatar behaviour is hierarchically divided intofour levels:• Motivation: Commands that trigger a pro active

high level behaviour like ‘angry’ or ‘sad’.• Task: Commands that trigger direct actions like

‘salutation’ or ‘illustration’.• Feature: Commands that trigger a modification

of high level body components like ‘lefteyebrow up’ or ‘mouth smile’.

• Geometric: Commands that trigger low levelchanges on the avatar geometry.With those commands one can reduce net traffic

and maintain an avatar animation and rendering onthe client side that responds quickly on userinteraction. As a nice side effect the streamed avatar

behaviour is comfortable synchronized by SMIL, seefigure 10.

Figure 10: Avatar behaviour streamed synchronousto video

Due to the feature morphing, animationsequences have to be defined only once. If thegeometry of the avatar is changed the avatarbehaviour can still be used by the system. Figure 11shows two morph targets based on the geometry of avideo cassette. The animation defined with themorph targets can be used by any other avatargeometry.

Figure 11: Morph targets (left: smile, right: brow up)

Prototype

The whole system is implemented as a prototypewith the client running on a 900 mhz Windows-NTComputer.

Figure 12: System prototype with humanlike avatarand video

Figure 12 shows the prototype’s interface, presentingan avatar and a video. The avatar is acting

synchronous to the video, talking about the movie. Itis changing its behaviour if a user interaction

happens. Then a direct feedback to the user is given,followed by a (optional) change of the video clipand/or the avatar behaviour and speech. Figure 13

shows another avatar, this time not humanlike, thatacts in the same way as the avatar described above.

Figure 14 shows a user interacting by voice witha video. The avatar is not visible, the conversation isdone to the video stream directly. Tests show thatuser prefer the avatar shown on the screen.

Figure 13: System with video cassette avatar andvideo

Figure 14: Media Centric user interaction by voice

The prototype is used for user testing. Usersgave a very positive feedback on the system, theyinteracted with the story shown by the system in avery intuitive way. The users get use to the approachwithout a longer training phase. The test resultsunderlay earlier tests based on video hyperlinks[Zahn00] and assistance systems [Nietschke00].

7. CONCLUSION

The approach discussed within this paper showed astorytelling system based on a story engine thatcombines selective and generative components forstory narration. The story is presented using videoclips and avatar animation. User interaction ispossible via a hypermedia metaphor based ontemporal acoustic and visual videohyperlinks aswell as a conversational metaphor. The interaction

facilities are given to the user in a Media Centricway – the context of the conversation andhypernavigation is always based on the storynarrated through the video.

The story narration of the system is done using aplot based story engine, the story management islocated in event triggers combined with hypernavigation. Future work will expand the story engineto a generative storytelling model with a largerdegree of freedom in story generation, granularity ofcontrol and a global location of the storymanagement.

Another interesting extension of the approach tothe web based interactive television is a multiple userconnection based on avatars. The avatar behaviourand speech should be driven by the story impressionsof several users, based on the non linear storynarration presented by the video.

REFERENCES

[Alexa99] Alexa, M and Müller W.: The MorphingSpace, Proceedings of The 8-th InternationalConference in Central Europe on ComputerGraphics, Visualization and ComputerVision, 1999

[Arons92] Arons, B., A Review of the Cocktail PartyEffect, Journal of the American Voice I/OSociety. 1992

[Bates96] Bates, J. and Smith, S.: Towards a Theoryof Narrative for Interactive Fiction, TechnicalReport, School of Computer Science,Carnegie Mellon University, 1989

[Braun00] Braun, N. and Finke, M.: Interaction ofVideo on Demand Systems with Human-likeAvatars and Hypermedia, In Scholten, H. andvan Sinderen, M.J. (Ed.) InteractiveDistributed Multimedia Systems andTelecommunication Services, Springer, 172 –186, 2000

[Braun99] Braun, N. and Dörner R.: TemporalHypermedia for Multimedia Applications inthe World Wide Web, Procceedings of the3rd International Conference onComputational Intelligence and Multimedia,Applications, World Scientific, 1999

[Brewster97] Brewster, S.A. and C.V.Clarke: TheDesign and Evaluation of a Sonically-Enhanced Tool Palette, Proceedings of theInternational Conference on AuditoryDisplays ICAD, 1997

[Buxton90] Buxton, W.: The Natural Language ofInteraction: A Perspective on Non-VerbalDialogues. In Laurel, B. (Ed.). The Art ofHuman-Computer Interface Design, Reading,MA: Addison-Wesley. 405-416, 1990

[Davenport96] Davenport, G.: 1001 Electronic StoryNights: Interactivity and the Language ofStorytelling, Proccedings of the AustralianFilm Commission’s 1996 Language ofInteractivity Conference, 1996

[Dutoit96] Dutoit, T. and Pagel, V. and Pierret, N.and Van Der Vreken O. and Bataille, F.:MBROLA Project: Towards a Set of High-Quality Speech Synthesizers Free of Use forNon-Commercial Purposes, Proceedings ofthe ICSLP 96, Philadelphia, 1996

[Gleich97] Gleich, U.: Parasocial interaction withpeople on the screen, New Horizons In MediaPsychologie, 1997

[HeftaGaub97] Hefta-Gaub, B.: The Real MediaPlatform Architectural Overview, RealNetworks Conference, San Francisco, USA,1997.

[Martel96] Martel, R.: A Distributed NetworkApproach and Evolution to HTML for NewMedia, Real Time Multimedia and the WorldWide Web, No. 25, France, 1996.

[Nietschke00] Nitschke J. and Wandke H.: HumanSupport as a Model for Assistive Technology,Proceedings of the OZCHI, 2000

[Portele92] Portele, T. and Steffan, B. and Preuss, R.and Sendlmeier, W.F. and Hess, W.:HADIFIX - a speech synthesis system forGerman, Proceedings of the ICSLP ’92,Banff, 1227-1230, 1992

[Sawhney96] Sawhney, N., Balcom, D. and I. Smith:HyperCafe: Narrative and AestheticProperties of Hypervideo, Proccedings ofHypertext ’96, 1996

[Spierling00] Spierling, U. and Gerfelder, N. andMueller, W.: Novel User InterfaceTechnologies and Conversational UserInterafces for Information Appliances,Proccedings of ACM Conference on CHI,Development Consortium, 2000

[SMIL98] World Wide Web Consortium W3C.Synchronized Multimedia IntegrationLanguage (SMIL) 1.0 Specificationhttp://www.w3.org/TR/REC-smil/ , 1998.

[Weizenbaum65] Weizenbaum, J.: ELIZA - AComputer Program for the Study of NaturalLanguage Communication Between Man andMachine, Communications of the Associationfor Computing Machinery, 9, S. 36 – 45,1965

[Zahn00] Zahn C., Schwan S., Barquero B.:Authoring and navigating video: Designstrategies and user’s needs for hyperlinks ineducational films, Report on an explorativestudy, University of Tuebingen, 2000

INTERACTION APPROACH FOR DIGITAL VIDEO BASED STORYTELLING · audience with approaches like Direct...

Documents

Transcript of INTERACTION APPROACH FOR DIGITAL VIDEO BASED STORYTELLING · audience with approaches like Direct...