Modeling and Evaluating a Bayesian Network of Culture ... · a hybrid approach of modeling...

Modeling and Evaluating a Bayesian Networkof Culture-specific Non-verbal Behaviors

for Virtual Character Dialogues

*** *** ***

ABSTRACTHuman behavior is, amongst others, influenced by culturalbackground. Thus research on virtual characters that shouldimitate human behavior, considers cultural background toinfluence the behavior of agents as well. This paper presentsa hybrid approach of modeling culture-specific non-verbalbehaviors for virtual character dialogues based on both: the-oretical knowledge and empirical data. A Bayesian networkmodel has been established that augments the causal rela-tions between culture, dialog behavior, and non-verbal be-havior with empirical data. Therefore, the structure andvariables of the network are designed based on models andtheories from the social sciences, while its parameters arelearned from a video corpus of German and Japanese con-versations. A 10-fold-cross-validation has been conductedto validate the model. Having such a model at hand allowsfurther investigation of behavioral dependencies as well asthe automatic generation of non-verbal behaviors for virtualcharacter dialogs that reflect prototypical differences basedon culture.

Categories and Subject DescriptorsI.2.11 [Artificial Intelligence]: Distributed Artificial Intel-ligence—Intelligent agents; I.6.7 [Simulation and Model-ing]: Model Development

General TermsHuman Factors, Design

KeywordsVirtual Agents, Culture, Bayesian Networks, Non-verbal Be-havior

1. MOTIVATIONNon-verbal behavior takes a significant role during inter-

personal interaction, e.g. [15, 24]. How these non-verbal be-haviors are conducted and perceived is, amongst others, de-pendent on cultural background [27]. Unintended messagesmight be perceived due to different cultural backgrounds ofthe interlocutors. An example includes the expressiveness

Appears in: Proceedings of the 14th International Confer-ence on Autonomous Agents and Multiagent Systems (AA-MAS 2015), Bordini, Elkind, Weiss, Yolum (eds.), May,4–8, 2015, Istanbul, Turkey.Copyright c© 2015, International Foundation for Autonomous Agents andMultiagent Systems (www.ifaamas.org). All rights reserved.

of gestures. “What might seem like violent gesticulating tosomeone from Japan would seem quite normal and usual tosomeone from a Latin culture” [16].

In a similar manner, a virtual character’s expressions canbe perceived differently due to the cultural background ofthe observer. Studies by Koda and colleagues [20], for ex-ample, show that observers from different cultures judge theemotions of a virtual character differently based on their fa-cial expressions. We therefore think that designers of virtualcharacters should take potential cultural differences into ac-count when simulating natural non-verbal behaviors withvirtual characters.

However, building models that determine culture-relateddifferences in behavior is challenging as the causal relationof culture and corresponding behavior needs to be simu-lated in a convincing and consistent manner. Using Bayesiannetworks appears well suited, as they allow dealing withuncertain knowledge resulting from the fact that there isno deterministic mapping between human factors such ascultural background and non-verbal behaviors. Similar at-tempts have been made, for instance, for the generation ofgestures supporting spatial information [4] for virtual char-acters. Rehm et al. [25] have applied a Bayesian net-work based on theoretical knowledge, to determine the mostlikely cultural background of a user and to adapt the behav-ior of virtual characters taking the observed gestural speedand spatial extent into account using a Wiimote controller.However, learning the parameters of a Bayesian networkfrom a multi-modal corpus has not been applied to culture-specific behaviors yet. Using this approach, culture as anon-deterministic concept can be modeled in an intuitivemanner and without giving up a certain amount of variabil-ity that is necessary to ensure that an agent is perceivedas an individual. Training such a network, however, is atime and resource consuming task as a great amount of em-pirical data needs to be collected and processed to be ableto extract prototypical differences and distinguish culturaldifferences from other human factors such as personality.

In this paper, we introduce a hybrid approach to build aBayesian network that determines the probabilities of non-verbal behaviors based on cultural background. Therefore,the structure of the model with its nodes and variables nec-essary to parametrize different behavioral aspects, are basedon theoretical knowledge from the social sciences, while theparameters are learned from empirical data. Applying a hy-brid approach bears certain advantages. Whilst profoundmodels can be found in the literature, described behavioraltendencies sometimes lack enough detail for implementation.

These details can be extracted from empirical data. Thus,such a model can augment the causal relations between cul-ture and non-verbal behavior with empirical data. Havingsuch a model at hand allows further investigation of behav-ioral dependencies which might stay unrecognized with astatistical approach.

2. RELATED WORKThe majority of approaches that investigate culture for

virtual characters is theory-driven, with theories from the so-cial sciences being taken as a basis for computational models.That way, the causality of culture and corresponding behav-ior can be formalized in a generalizable manner. A commonway of implementing the theory-driven approach is to startfrom existing multi-agent architectures and extend them toallow for culture-specific adaption of goals, beliefs and plans.One of the earliest and most well-known systems is the Tac-tical Language Training System (TLTS) [18] which is basedon an architecture that implements a version of Theory ofMind, and has formed the basis of a variety of products forlanguage and culture training. More recently systems havebeen developed that extended the agent mind architectureFAtiMA [8] by representations of Hofstede’s cultural dimen-sions [13] to model the culture of the character, culture-specific symbols, culture-specific goals and needs, and therituals of the culture. Using this approach, the ORIENT [3],the MIXER [2] and the Traveller [22] applications simulateabstract cultures that are based on theoretical knowledgewith the aim to generate more cultural awareness on theuser’s side.

Data-driven approaches, on the other hand, use data suchas annotated multimodal recordings of existing cultures asa basis for computational models of culture. Such a cross-cultural corpus has, for example, been recorded for multi-party multi-modal dialogues in the Arab, American Englishand Mexican Spanish cultures by Herrera and colleagues[12]. The corpus has partly been coded regarding prox-emics, gaze and turn taking behaviors to allow extractionof culture-related differences in multi-party conversations.A video corpus of cross-cultural dyadic conversations in theGerman and Japanese cultures [1] was recorded with theaim to build a model for culture-related virtual agent be-havior. With the corpus data, statistical analyses were per-formed highlighting differences between the recorded cul-tures. Later, perception studies with virtual characters fol-lowing scripts that reflect the statistical data were performedin the target cultures. Results suggest that users prefer vir-tual character behavior that was designed to resemble theirown cultural background [1]. So far, behavioral aspects wereanalysed in isolation only, and no dependencies, e.g. be-tween verbal and non-verbal behavior, were considered.

While the theory-driven approach is very well suited forthe implementation of synthetic cultures and usually ensuresa higher level of consistency than the data-driven approach,it is not grounded in real data and thus may not faithfullyreflect the non-verbal behavior of existing cultures. Anotherlimitation is that it is difficult to decide which non-verbal be-haviors to choose for externalizing the goals and needs gen-erated in the agent minds. The advantage of data-drivencomputational models of culture lies in their empirical foun-dation. Compared to the theory-driven approach they al-low the extraction of concrete human behaviors. However,they are hard to adapt to settings different from the ones

recorded, as the data is hard to generalize for lack of a causalmodel. Furthermore, there is the danger that the model de-rived from the corpora is incomplete, potentially resultingin an inconsistent culture-specific behavior of the agents.

3. HYBRID APPROACHIn the present contribution we employ a hybrid approach

to build a Bayesian network that determines prototypicalculture-specific non-verbal behavior for virtual character di-alogs. The model of the network is designed based on the-oretical knowledge whilst its parameters are learned frommultimodal data.

3.1 Network ModelThe structure of the network with its variables was mod-

eled based on cultural theories and categorizations of behav-ioral aspects derived from the literature and implementedusing the GeNie modeling environment [9], see Figure 1.Since the aim of our approach is to generate non-verbalbehavior for a given virtual character dialog, the networkis divided in two parts: influencing factors and (resulting)non-verbal behavior.

3.1.1 Influencing FactorsThe objective of our approach is to generate non-verbal

behavior for a given dialogue based on culture. Althoughthere are other factors that influence human behavior, suchas personality, in our influencing factors we focus on cultureand conversational behavior.

Culture.To model culture in our network, we use Hofstede’s dimen-

sional model [13] as a basis. For the model more than 70cultures were categorized along 5 dimensions in an empiricalsurvey. The Power Distance dimension (PDI) describes theextent to which a different distribution of power is acceptedby the less powerful members of a culture. The Individual-ism dimension (IDV) describes the degree to which individu-als are integrated into a group. On the individualist side tiesbetween individuals are loose, while on the collectivist side,people are integrated into strong, cohesive in-groups. TheMasculinity dimension (MAS) describes the distribution ofroles between the genders. In feminine cultures, the rolesdiffer less than in masculine cultures, while competition israther accepted in masculine cultures and status symbolsare of importance. The Uncertainty Avoidance dimension(UAI) defines the tolerance for uncertainty and ambiguity.It indicates to what extent the members of a culture feelcomfortable or uncomfortable in unstructured or unknownsituations. The Long-Term Orientation dimension (LTO)explains differences in the perception of virtue. Values forlong term orientation are, for example, thrift and persever-ance, whereas examples for values for the short term orien-tation are respect for tradition, fulfilling social obligations,and saving one’s face.

For each dimension, clear mappings are available from ex-isting national cultures to the cultural dimensions [13]. Ini-tially these scores were normalized across cultures to staybetween 0 and 100, and extended later when more cultureswere added and more extreme values were observed. For thestructure of our network, we categorized the scores on thecultural dimensions into three discrete values (low, medium,high).

Figure 1: Network model for culture-related non-verbal behavior generation.

Gender.Another human factor that influences gestural expressiv-

ity and non-verbal behavior selection is the gender of thespeaker. For example, it is stated that gestures of menare performed more forceful, while women typically gesturemore fluidly [11]. Furthermore, men favor larger gesturesthan women. Although not the focus of the present contri-bution we added a node for gender (male, female) to ournetwork to obtain a more complete model.

Verbal Behavior.In order to generate non-verbal behavior for virtual char-

acter dialogs, verbal behavior was broken down to speechacts, and conversational topics in our network.

Speech acts can be categorized along the DAMSL (DialogAct Markup in Several Layers) coding scheme that was in-troduced by Core and Allen [6]. One layer of the schema, thecommunicative function, labels the communicative meaningof a speech-act. We use the following subset of communica-tive functions: statement, answer, info request, agreement /disagreement (indicating the speaker’s point of view), un-derstanding / misunderstanding (without stating a point ofview), hold, laugh and other.

According to Schneider [26], topics prototypically occur-ring in first-time meetings can be classified as follows: Theimmediate situation holds topics that are elements of theso-called frame of the situation, holding topics such as thesurrounding or the atmosphere of the conversation. Theexternal situation describes all topics that hold the largercontext of the immediate situation, such as the news, pol-itics, sports or movies. For the communication situationinterlocutors are seen as a subset of the immediate situa-tion, with topics focusing on the conversation partners, e.g.,their hobbies, family or career.

3.1.2 Non-verbal BehaviorThe second part of our network model consists of non-

verbal behavioral traits, that shall be calculated by the modelbased on the influencing factors (see Figure 1). Particu-larly, the nodes for posture type, gesture type, and expressiv-

ity form the output of our model.

Gesture types.The type of gesture chosen can be dependent on cultural

background. In out network, we classify gesture types ac-cording to McNeill’s categorization [23]: deictic, beat, em-blem, iconic, metaphoric, and adaptors. Deictic gesturesare pointing gestures. Beat gestures are rhythmic gesturesthat follow the prosody of speech. Emblems have a con-ventionalized meaning and do not need to be accompaniedby speech. Iconic gestures explain the semantic content ofspeech, while metaphoric gestures accompany the seman-tic content of speech in an abstract manner by the use ofmetaphors. Adaptors are hand movements towards otherparts of the body. As we are focusing on gestures thataccompany speech, adaptors were excluded from our net-work. Emblems are problematic for our purpose and wereexcluded as well, as they are highly dependent on cultureand might even convey different meanings in different lo-cations. Thus, even if we could predict that a emblematicgesture type would be appropriate, different concrete ges-tures needed to be selected based on cultural background.

Posture types.To distinguish arm postures, we employed Bull’s posture

categorization [5]. In total, 32 different arm positions are in-cluded to the network, such as PHEw (put hands on elbow),PHWr (put hands on wrist) or FAs (fold arms).

Expressivity.How we exhibit a gesture can sometimes be more crucial

to the observer’s perception than the gesture itself. We thusadded a node describing the expressivity to our network. It isfurther broken down into parameters describing differencesin the dynamic variation along different dimensions [10]. Forour model, we employ the expressivity parameters spatialextent, power, speed, fluidity and repetition [21]: The spa-tial extent describes the arm’s extent relative to the torso.The speed and power (acceleration) with which a gesture isperformed can vary as well. The fluidity describes the flow

of movements, as gestures can be conducted in a jerky orfluid manner. The repetition holds information about therepetition of the stroke of a gesture. In our network expres-sivity is categorized by the three values low, medium andhigh. To model the dependencies of the expressivity nodeand the parameters, initial values were set in a manner thata high expressivity is more likely to result in a higher valuefor the parameters.

3.1.3 Dependencies in the ModelTo model the connection from culture to non-verbal ex-

pressivity, we rely on the idea of synthetic cultures [14].These synthetic cultures find themselves on one extreme endof each cultural dimension described above, and look at thiscultural dimension in isolation. In [14], prototypical behav-ioral traits for these synthetic cultures are described. Forexample, the extreme masculine culture is described as be-ing loud and verbal, liking physical contact, direct eye con-tact, and animated gestures. Extreme feminine cultures, onthe other hand, do not raise their voices, like agreement, donot take much room and are warm and friendly in conversa-tions. Although the descriptions of prototypical behavior forthe synthetic cultures clearly indicate that the positioningon the dimensions have an impact on the level of expressiv-ity, it stays unclear how exactly they correlate. Thus, alldimensions are connected to the expressivity node.

Although there is evidence from the literature that thechoice of gesture types and posture types are dependent oncultural background, e.g. [27], there is no clear correlationof McNeill’s gesture types or Bull’s posture types with thesynthetic cultures. We thus, connected the nodes holdinggesture types and posture types directly to the culture nodeinstead of the dimensions.

In order to reflect the phenomena of gender differencesand non-verbal behavior described above, the gender node ofour influencing factors is connected it to all nodes describingnon-verbal behavior (expressivity, gesture types and posturetypes).

The selected verbal behavior should also influence accom-panying non-verbal behavior. Trees and Manusov [28], forexample, showed that lower expressivity as a non-verbal ex-pression of defence is perceived more polite in female con-versations when showing criticism. Thus, non-verbal be-havior should change in situations where more politeness isexpected (e.g. talking about a sensitive topic or choosing arequest as an speech act). Therefore, nodes holding infor-mation about the selected verbal behavior (speech act andconversational topic) are also directly linked to the nodesholding non-verbal behavior.

It is known from the literature that verbal behavior, e.g.the choice of conversational topic is amongst others depen-dent on cultural background. According to Isbister and col-leagues [17], for example, the categorization into safe andunsafe topics varies with cultural background. However, theaim of our network model is to generate culture-dependentnon-verbal behavior for a given agent dialog. We thereforeneglect dependencies between nodes of our influencing fac-tors (culture, gender, speech act and conversational topic)and assume them to be conditionally independent.

3.2 Parameters of the ModelTo augment the network model with empirical data, culture-

sensitive data had to be prepared and included to our net-

work model by a automated learning process.

3.2.1 Data CollectionThe empirical data used to learn the parameters of our

network was extracted from a multi-cultural corpus recordedfor the *** project [1] in the German and Japanese cultures.For the corpus, more than 20 participants were recorded ineach culture, each running through three scenarios: a first-time meeting, a negotiation, and a conversation with some-one of a higher social status. For the work presented in thispaper, we only consider the first scenario. Each scenariowas recorded with one student interaction partner and oneprofessional actor. This way, participants did not know eachother in advance. Actors were told to be as passive as pos-sible and to allow the participant to lead the conversation.

To prepare the data for our purposes, the videos wereannotated using the Anvil tool [19] that allows to specifybehavioral traits with their parameter and align them ina timely manner. Annotations were conducted by studentworkers of *** and ***. Annotation of speech translitera-tion and translation had to be done in the country wherethe respective videos had been recorded for language rea-sons. Nonverbal behavior was less critical in this regard andwas annotated in both in Germany and in Japan. For eachparticipants the cultural background and gender was addedto the meta data of the annotations. Verbal behavior wasannotated for conversational topics [26] and speech acts us-ing the subset of the DAMSL coding scheme [6] as mentionedabove. Gestures were annotated according to McNeill’s ges-ture types [23] and the expressivity parameters [21]. Eachexpressivity parameter was annotated using a seven-pointscale. The only exception includes the parameter repetition,for which the value denotes the exact number of repetitionsof the stroke of a gesture. For the annotation of differentpostures, Bull’s posture coding scheme [5] was employed forarm postures.

Unfortunately, not all aspects were annotated for eachperson recorded in the corpus. In particular, there are someverbal annotations (speech act and topic) missing in theJapanese part of the corpus, as translation was not avail-able. More crucially, due to human and hardware failure,the timely information on the annotations of expressivitywas lost and could not be reconstructed.

3.2.2 Parameter LearningTo learn the parameters of our network from the anno-

tated corpus data, several steps had to be performed: align-ment of the data, extraction of datasets as well as applyinga learning-algorithm.

Alignment of the Data.In a first step, the different modalities annotated in the

data have to be aligned. Therefore, we divided each anno-tated conversation into conversational blocks that we furtherrefer to as datasets. A dataset is determined by a speechutterance, specified by speech act and conversational topic.For each dataset, there may or may not be an accompanyinggesture and / or posture available.

Behavioral aspects occur at a certain time interval duringthe conversation. Based on the speech act, we added ges-tures and postures to the dataset, in case there is a overlap oftheir intervals. In case a gesture or posture did not occur, anempty token is added to the dataset. Gestures and postures

are added to all speech acts that they accompany to reflectthat a gesture or posture can be held for a longer time. Incase several gestures or postures overlapped with one speechact, the gesture and posture is added to the dataset wherethe overlap with the speech act lasted for longer.

Due to the data loss, the information on gestural expres-sivity could not be aligned to the speech act. To neverthelessbe able to integrate the empirical data on expressivity to thenetwork, two different datasets were used for the learningprocess.

Firstly, we used the aligned dataset to learn the joint prob-ability distributions of gesture types and arm postures (de-pendent on culture, gender and verbal behavior). Secondly,we used the non-aligned dataset to learn the gestural ex-pressivity (dependent on culture and gender only). Due tothe data loss, the dependency between verbal behavior andexpressivity could not be learned from the data and wastherefore removed from our network mode. However, thedata of all participants could be included to this step, asfor the not-aligned dataset the missing verbal annotationsof some Japanese participants did not account.

Extraction of Datasets.To use the datasets as training data they need to be ex-

tracted to a format that the learning algorithm can match tothe network model. The gender and culture of each person,as well as the culture’s positioning on Hofstede’s dimensionsare static and can therefore be specified for each dataset ofa person.

As the network model and the annotated behavioral traitswere parametrized in the same manner, the data could easilybe imported: States for the speech act type were simplyimported by their values. Annotated topics were categorizedto immediate, external and communication situation andaugmented by a none state if the mapping of the topic toone of the categories was not clear.

For each gesture, the annotated type, or none in caseno gesture was performed, is used. The values of gesturalexpressivity are matched from the annotated values to thestates low, medium, and high. The annotated values of pos-tures are imported directly as well, including a none statewhere no posture was performed.

Learning of Parameters.After the extraction, the aligned dataset contained a list

of 2155 dataset values, the non-aligned dataset 457 values,both used for parameter learning. The SMILE-Framework[9] provides, amongst others, an implementation of the EM-algorithm [7]. The EM-algorithm is well suited for our pur-pose, because it is capable of dealing with incomplete data.Empty dataset values occur for different reasons in our data,e.g., certain aspects were not annotated. During the processof learning, the algorithm matches the values in the datasetsto the states of the network. In our two-folded learning ap-proach, firstly, the aligned dataset with information aboutthe speech act was used to learn the probabilities of theparameters of the gesture type and arm posture node. Sec-ondly, the non-aligned dataset was applied to determine theparameters for the gestural expressivity.

3.3 Calculations of the NetworkAfter augmenting the network model with the empirical

data of the annotated video corpus the model reflects the

findings of our former statistical analysis [1], in case onlythe evidence for the cultural background is set. Regardinggesture types, our analysis revealed that the overall numberof gestures per minute is comparable in both cultures. Forthe frequencies of McNeill’s gesture types, no statisticallysignificant differences were found in the data distinguishingthe two cultures. Comparing the two cultures in terms ofgestural expressivity, we found significant differences for allparameters, suggesting that German participants gesturedmore expressively than Japanese participants. In particu-lar, gestures were performed faster and more powerfully inthe German data than in the Japanese one. German par-ticipants used wider space for their gestures compared toJapanese participants. Gestures were also performed morefluently in the German conversations, but the stroke of agesture was repeated more frequently in the Japanese con-versations. Investigating posture types, interestingly, pos-tures that regularly occurred in one culture barely occurredin the other culture. The only exception constitutes puttingthe hands behind the back, which was observed in both cul-tures.

Figure 2 exemplifies the calculations of the network withthe evidence of cultural background being set to German.Setting more evidences, such as the gender or the verbalbehavior, additional information can be obtained. For ex-ample, a correlation of chosen topic and non-verbal behaviorfrequency stayed unnoticed in our earlier work, as we havebeen looking at the behavioral aspects in isolation. Fromprevious analysis of verbal behavior [1], we know that thetopic distribution is different for the two cultures in the data.While in Japan significantly more topics covering the imme-diate situation occurred compared to Germany, in Germanysignificantly more topics covering the communication situa-tion occurred compared to Japan. Setting evidences to thetopic nodes as suggested by their prototypical occurrencesin a culture, the network reveals that people in both culturesare more likely to perform gestures when talking about lesscommon topics, in particular, the communication situationin the Japanese culture and the immediate situation in theGerman culture. This effect could be explained by the ten-dency that talking about a more uncommon topic might leadto a feeling of insecurity that results in an increased usageof gestures. While the results reflect the behavior of theparticipants of our corpus, the Bayesian network allows usto investigate such correlations in an intuitive manner.

4. EVALUATIONIn order to validate the model a 10-fold-cross-validation

was performed. The aligned dataset included 2155 datasetvalues where non-verbal behavior was aligned with the speechof the participants. Leaving every tenth dataset out, train-ing the model with the remaining 90% as described aboveand performing this step ten times, provides us with a val-idation set containing 2155 entries again. For each of thesedatasets, the cultural background (German or Japanese)and the gender (male or female) of the participant is given,as well as the performed verbal behavior (speech act typeand topic category). The accompanying non-verbal behavior(gesture type and posture type) is predicted by the network.Unfortunately, we cannot validate the non-verbal expressiv-ity using this approach due to the missing alignment. For alldatasets the predictions of the network were tested againstthe behavior that was actually observed in the corpus data.

Figure 2: Learned parameters of the Bayesian network with cultural background set to German.

Figure 3: Prediction rates of observed gesture typesand posture types being in the first, second or thirdmost likely gesture and posture type.

In human behavior it appears quite unlikely for a person toperform the exact same way several times in a given situa-tion. For a virtual character’s behavior a similar variety ofbehavior is desirable. With the aim of our project, to pre-dict stereotypical behavior for a given cultural background,we therefore additionally present in our figures whether theperformed gesture type or posture type is finding itself inthe best three guesses of the network, or not.

Figure 3 shows the prediction rates for gesture types andposture types respectively. Results look quite promising,with an overall accuracy of 88% for gesture types and 56%for posture types (with only 16% being not in the top threeguesses out of 32 possible posture types).

However, these results should not be overrated as for manyof the observed speech acts no non-verbal behavior was con-ducted (resulting in the gesture and posture type none). Inparticular, only in 11% of our dataset a gesture was per-formed during a speech act, while in 71% of the speech actsa posture was performed.

We therefore excluded datasets where no gesture or pos-ture was observed and performed another 10-fold-cross-validationleaving the option ”none” out to evaluate prediction rates

Figure 4: Prediction rates of observed gesture typeand posture type being in the first, second or thirdmost likely gesture and posture type, excludingnone-elements.

of our network for those cases where a gesture or postureshould be performed by an agent. In total, 233 speech actswere accompanied by a gesture, 1551 by a posture. Fig-ure 4 summarizes the results. Regarding gesture types, only34% of the performed gestures were correctly predicted bythe network. Considering the fact that there were only fourpotential gesture types to choose from, the accuracy of thepredictor is not better than random. Overall accuracy forposture types is 61% which looks more promising, takinginto account that there are 32 potential posture types tochoose from.

In order to find out whether the predicted gesture andposture types reflect a prototypical cultural background, in afurther evaluation step, we reversed the cultural background.Therefore, we performed a 10-fold-cross validation with thenetwork being set to a Japanese cultural background for theGerman part of the validation set, and vice versa. Includingnone-elements, accuracy rate of gesture types is still 80%,as performing no gesture is the most likely prediction inboth cultures. Leaving none-elements out, accuracy rate forgesture types drops to 24% (see Figure 5).

Figure 5: Prediction rates of observed gesture typesand posture types being in the first, second or thirdmost likely gesture or posture type with the culturalbackground set to the reversed culture, excludingnone elements.

Figure 6: Prediction rates of observed gestural ex-pressivity being in the first, second or third mostlikely category.

Regarding posture types, accuracy rate drops to 5% withthe reversed cultural background including none elements,with 75% of the observed postures falling not even in thetop 3 categories. Excluding none elements, the accuracyrate for postures with reversed cultural background is lessthan 2%, with 92% of the observed posture types not beingin the top three guesses (see Figure 5). Thus, for posturetypes, changing the cultural background leads to a very lowpredictive power of the network, suggesting that the networkis able to predict posture types dependent on the culturalbackground of the speaker.

As mentioned before, gestural expressivity could not becalculated based on verbal behavior, due to missing align-ment. We thus performed a 10-fold-cross validation for theparameters of expressivity based on cultural background andgender only. We thus predict the probabilities for the levelof expressivity given that a gesture is being performed. Amore complete data set could be used in this case, as missingtranslations of verbal behavior could be ignored. In total 457gestures were added to the validation set. Figure 6 shows thecalculated levels of expressivity. Please note that in this caseonly three categories were available (low, medium, high).

5. DEMONSTRATORFor our demonstrator we used a virtual character system

[1] containing culture-specific characters that match a pro-totypical Asian or Western ethnic background. A first timemeeting dialogue, similar to the ones recorded in the cor-pus, was scripted and tagged with the same categorizationsof speech acts and topics as in our network. The cultural

Figure 7: Virtual characters showing prototypicalculture-specific non-verbal behavior.

background and the gender is set for each character. Prob-abilities for culture-specific non-verbal behaviors are gener-ated by the network, added to the dialogue and simulatedwith the characters. Figure 7 shows a screenshot of a maleGerman character in a prototypical German posture (FA -fold arms) in conversation with a female Japanese characterperforming an iconic gesture with a small spatial extent.

For gesture selection, animations were labeled accordingto the gesture types used in the network. Each of the anima-tions can be customized by the animation engine to matchdifferent levels of expressivity. For example, the speed pa-rameter is adapted by using a different frame rate, whiledifferences in the spatial extent are achieved by animationblending. Animations for the most probable posture typesin both cultures were modeled for the characters.

In our former work, we performed perception studies to in-vestigate whether prototypical culture-specific behavior thatfollows the statistical analysis of our video corpus has animpact on the perception of human observers regarding thecharacters’ behavior. Behavioral aspects were tested in iso-lation. Results suggest that observers tend to prefer virtualagent behavior that is in line with their own cultural back-ground for some of the behavioral aspects [1]. As the samestatistical distribution is reflected in the presented networkas well, we assume similar preferences for virtual characters’non-verbal behavior that is generated by the network.

In comparison to the scripted perception studies, the Bayesiannetwork is able to present culture as a non-deterministic con-cept and preserve a certain variety in the characters’ behav-iors. For the demonstrator, we select postures and gesturesthat match the probability distribution of the network in-stead of using the most probable non-verbal behavior calcu-lated by the network.

6. DISCUSSIONThe presented network model is an approximation to re-

ality in two ways: (1) the underlying theories and catego-rizations are sometimes too broad to serve as a basis forculturally dependent behavior generation, and (2) the re-sulting model can only be as meaningful as the data beingused to specify the probabilities.

At the current stage of our work, breaking down cul-ture into dimensions does not provide additional informa-tion compared to a direct link from culture to non-verbal

behavior as only the data of two cultures is used to train thenetwork. However, we consider the dimensional approach agood direction. In case data of more cultures can be ob-tained, the network could potentially allow for interpolationbetween cultures based on cultural dimensions, to simulatebehavior of cultures that were not recorded or to demon-strate synthetic cultures.

Regarding gesture generation, relying on McNeill’s ges-ture categorization and the DAMSL dialogue acts does notseem to be sufficient to predict culturally specific behaviors,as the influence of the semantics of speech that the gestureshould accompany, is not considered in our approach. It ap-pears likely that the content of speech is more crucial for thechoice of gesture, compared to the abstraction to speech actsand conversational topics. Additionally, even if the correctgesture type could be predicted their performance might bevery different across cultures. A deictic gesture, for exam-ple, is typically performed using the index finger in Westerncultures. As this is considered rude in some Asian cultures,deictic gestures here are usually performed using the wholehand. The former statistical analysis showed that the over-all number of gestures is similar in both cultures while nosignificant differences were found in the data regarding thefrequencies of McNeill’s gesture types. Although our net-work augments former results by an alignment of the datawith verbal behavior, it fails in predicting the most probableculture-specific gesture types. Regarding gestural expres-sivity, the data suggested that German participants gesturemore expressively than Japanese participants which is re-flected in our network.

Regarding posture types, the presented network performedwell in predicting body postures based on cultural back-ground. The strong correlation of cultural background andbody posture in our data was also reflected by our statisticalanalysis, showing significant differences in total occurrenceof posture types between the cultures. Due to possible align-ment with several speech acts, our network also considers theduration of a posture which was not reflected in the previousstatistical approach.

7. CONCLUSIONS AND FUTURE WORKThis paper presents a hybrid approach to model a Bayesian

network that determines culture-dependent non-verbal be-havior for virtual character dialogues. In the hybrid ap-proach, the structure of the network along with categoriza-tions of behavioral aspects were constructed based on ex-isting theories and models. The parameters of the networkwere learned from an annotated video corpus that recordedfirst-time meetings of German and Japanese participants re-spectively.

The approach combines advantages of the commonly usedtheory-driven and data-driven approaches, as it explains thecausal relations of cultural background and resulting behav-ior and augments them by findings from empirical data. Apurely theory-driven approach is not able to simulate exist-ing national cultures with the same accuracy as a hybridapproach since it requires a strong abstraction that doesnot cover the subtleties of human behaviors. With a purelystatistical approach, we would suffer a certain flexibility pro-vided by the categorizations and theories.

With the network, the most likely culture-specific non-verbal behavioral type (gesture and posture) is determinedbased on the verbal behavior (speech act and topic) and

cultural background, and the most probable culture-relatedlevel of expressiveness in non-verbal behavior can be addedbased on cultural background. Regarding posture types andgestural expressivity, the network performs quite well in pre-dicting culture-specific behavior. Regarding gesture-types,no reliable predictions could be made based on culture. Thismight on the one hand be caused by the small sample sizeand on the other hand by the missing semantics of speech.

In our demonstrator, verbal behavior is scripted whilenon-verbal behavior is generated and added automatically.Although it is known from the literature that verbal behav-ior can be dependent on cultural background, e.g. the cat-egorization into safe and unsafe topics varies with culturalbackground [17], we did not model these dependencies. Inour future work, we aim on adding a component that de-termines the most likely sequence of speech acts and topicsbased on cultural background, and generate the most prob-able culture-specific non-verbal behavior using the networkafterwards.

In a similar way, a temporal component could improve thepredictions of the network. For example, if a person per-formed a certain body posture for a couple of speech acts,it should be more likely that the same posture is performedduring the subsequent speech act as well. We therefore aimon building a dynamic Bayesian network, that takes previ-ously performed behaviors into account.

The set of non-verbal behaviors used in our network isstill limited. However the model allows to be expanded byfurther aspects. In our future work, we aim on adding ad-ditional non-verbal behavioral traits that are known to bedependent on cultural background such as head nods to ob-tain a more complete model.

8. ACKNOWLEDGMENTS

9. REFERENCES[1] ***. These references were blinded for the reviewing

process.

[2] R. Aylett, L. Hall, S. Tazzymann, B. Endrass,E. Andre, C. Ritter, A. Nazir, A. Paiva, G. J.Hofstede, and A. Kappas. Werewolves, Cheats, andCultural Sensitivity. In Proc. of 13th Int. Conf. onAutonomous Agents and Multiagent Systems (AAMAS2014), 2014.

[3] R. Aylett, A. Paiva, N. Vannini, S. Enz, E. Andre,and L. Hall. But that was in another country: agentsand intercultural empathy. In Proc. of 8th Int. Conf.on Autonomous Agents and Multiagent Systems(AAMAS 2009), 2009.

[4] K. Bergmann and S. Kopp. Bayesian DecisionNetworks for Iconic Gesture Generation. InZ. Ruttkay, M. Kipp, A. Nijholt, and H.-H.Vilhjalmsson, editors, Proc. of 9th Int. Conf. onIntelligent Virtual Agents (IVA 2009), pages 76–89.Springer, 2009.

[5] P. Bull. Posture and Gesture. Pergamon Press,Oxford, 1987.

[6] M. Core and J. Allen. Coding Dialogs with theDAMSL Annotation Scheme. In Working Notes ofAAAI Fall Symposium on Communicative Action inHumans and Machines, pages 28–35, Boston, MA,1997.

[7] A. P. Dempster, N. M. Laird, and D. B. Rubin.Maximum likelihood from incomplete data via the emalgorithm. Journal of the Royal Statistical Society.Series B (Methodological), pages 1–38, 1977.

[8] J. Dias and A. Paiva. Feeling and reasoning: acomputational model. In 12th Portuguese Conferenceon Artificial Intelligence, EPIA, pages 127–140.Springer, 2005.

[9] M. J. Druzdzel. SMILE: Structural Modeling,Inference, and Learning Engine and GeNIe: Adevelopment environment for graphicaldecision-theoretic models (Intelligent SystemsDemonstration). In Proc. of the 16th National Conf.on Artificial Intelligence (AAAI-99), pages 902–903.AAAI Press, 1999.

[10] P. E. Gallaher. Individual Differences in NonverbalBehavior: Dimensions of Style. Journal of Personalityand Social Psychology, 63(1):133–145, 1992.

[11] L. Glass. He says, she says: Closing thecommunication gap between the sexes. Piatkus, 1992.

[12] D. Herrera, D. Novick, D. Jan, and D. R. Traum. TheUTEP-ICT Cross-Cultural Multiparty MultimodalDialog Corpus. In Multimodal Corpora Workshop:Advances in Capturing, Coding and AnalyzingMultimodality (MMC 2010), 2010.

[13] G. Hofstede, G.-J. Hofstede, and M. Minkov. Culturesand Organisations. SOFTWARE OF THE MIND.Intercultural Cooperation and its Importance forSurvival. McGraw Hill, 2010.

[14] G. J. Hofstede, P. B. Pedersen, and G. Hofstede.Exploring Culture - Exercises, Stories and SyntheticCultures. Intercultural Press, Yarmouth, UnitedStates, 2002.

[15] K. Hogan. Can’t Get Through: Eight Barriers toCommunication. Pelican Publishing, 2003.

[16] K. Isbister. Agent Culture: Human-Agent Interactionin a Multikultural World, chapter Building BridgesThrough the Unspoken: Embodied Agents toFacilitate Intercultural Communication, pages233–244. Lawrence Erlbaum Associates, 2004.

[17] K. Isbister, H. Nakanishi, T. Ishida, and C. Nass.Helper agent: Designing an assistant forhuman-human interaction in a virtual meeting space.In T. Turner and G. Szwillus, editors, Proc. of Int.Conf. on Human Factors in Computing Systems (CHI2000), pages 57–64. ACM, 2000.

[18] W.-J. Johnson, S. Marsella, and H. Vilhjalmsson. TheDARWARS Tactical Language Training System. InInterservice / Industry Training, Simulation, andEducation Conference, 2004.

[19] M. Kipp. Anvil - A Generic Annotation Tool forMultimodal Dialogue. In Eurospeech 2001, pages1367–1370, 2001.

[20] T. Koda, Z. Ruttkay, Y. Nakagawa, and K. Tabuchi.Cross-Cultural Study on Facial Regions as Cues toRecognize Emotions of Virtual Agents. In T. Ishida,editor, Culture and Computing, pages 16–27. Springer,2010.

[21] J.-C. Martin, S. Abrilian, L. Devillers, M. Lamolle,M. Mancini, and C. Pelachaud. Levels ofRepresentation in the Annotation of Emotion for theSpecification of Expressivity in ECAs. In Proc. of 5th

Int. Conf. on Intelligent Virtual Agents (IVA 2005),pages 405–417. Springer, 2005.

[22] S. Mascarenhas, A. Silva, A. Paiva, R. Aylett,F. Kistler, E. Andre, N. Deggens, G. J. Hofstede, andA. Kappas. Traveller: an intercultural training systemwith intelligent agents. In Proc. of 12th Int. Conf. onAutonomous Agents and Multiagent Systems (AAMAS2013), 2013.

[23] D. McNeill. Hand and Mind - What Gestures Revealabout Thought. University of Chicago Press, Chicago,London, 1992.

[24] A. Mehrabian. Silent messages: Implicitcommunication of emotions and attitudes. Wadsworth,1980.

[25] M. Rehm, N. Bee, and E. Andre. Wave like anegyptian: accelerometer based gesture recognition forculture specific interactions. In BCS HCI (1), pages13–22, 2008.

[26] K. P. Schneider. Small Talk: Analysing PhaticDiscourse. Hitzeroth, Marburg, 1988.

[27] S. Ting-Toomey. Communicating across cultures. TheGuilford Press, New York, 1999.

[28] A. Trees and V. Manusov. Managing face concerns incriticism: Integrating nonverbal behaviors as adimension of politeness in female friendship dyads.Human Communication Research, 24:564–683, 1998.

Modeling and Evaluating a Bayesian Network of Culture ... · a hybrid approach of modeling...

Documents

Transcript of Modeling and Evaluating a Bayesian Network of Culture ... · a hybrid approach of modeling...