AUDITORY-VISUAL SPEECH ANALYSIS: IN SEARCH · PDF fileAUDITORY-VISUAL SPEECH ANALYSIS: IN...

6
AUDITORY-VISUAL SPEECH ANALYSIS: IN SEARCH OF A THEORY Christian Kroos MARCS Auditory Laboratories, University of Western Sydney, Australia [email protected] ABSTRACT In the last decade auditory-visual speech analysis has benefited greatly from advances in face motion measurement technology. Devices and systems have become widespread, more versatile, easier to use and cheaper. Statistical methods to handle multi- channel data returned by the face motion measure- ments are readily available. However, no compre- hensive theory or, minimally common framework to guide auditory-visual speech analysis has emerged. In this paper it is proposed that Articulatory Phonol- ogy [3] developed by Browman and Goldstein for auditory-articulatory speech production is capable of filling the gap. Benefits and problems are dis- cussed. Keywords: Auditory-visual speech, face motion analysis, articulatory phonology 1. INTRODUCTION When in the early 1960s Charles Hockett attempted to determine the universal “design features” of hu- man language [9], he listed as the first feature: Vocal-auditory channel. Research on sign lan- guages, and an increased awareness of a socio- cultural bias in favour of spoken language in the past, made it clear that this definition needed to be revised in contrast to the 12 or 15 other features that have retained their validity to the present day. As sign languages have presumably been around for at least as long as vocal languages, and some re- searchers have even argued that they were the pre- cursors of vocal languages (e.g., [1]), they should have been easily observed and studied, in princi- ple. However, the visual aspects of spoken language are much more difficult to identify as a genuine part of human speech processing. Though “lip reading” abilities of the hearing impaired or deaf were well- known and acknowledged, it is the tight link to audi- tory speech that remained unnoticed for a long time. That is, speech reading was classified as a special ability that developed if the acoustic signal was per- manently unavailable. Besides the greater difficulty in observing the role that visual speech plays in the complex speech pro- duction/perception process there is another crucial difference to the study of other forms of visual com- munication, like sign language: visual speech can arguably neither be studied nor understood without the reference to auditory speech, that is, without ref- erence to speech vocal tract behaviour or, accord- ing to other accounts, the acoustic signal. For the most part visual and auditory speech are intimately linked through their common origin in vocal tract behaviour. Figure 1 shows the three-way relation- ship among vocal tract behavior, face motion and the acoustic output. This paper deals with what appears to be the primary dichotomy in the development of auditory-visual speech analysis in recent years: on the one hand, there have been impressive ad- vances concerning face motion (and also artic- ulatory) measurement techniques, on the other hand, the field is plagued by the lack of any comprehensive theoretical founda- tion or, minimally, at least a common frame- work to guide auditory-visual speech analysis. As a consequence, the results from the research re- semble more a patchwork than a gradually advanc- ing research undertaking. 2. ADVANCES IN MEASUREMENT OF FACE MOTION The rapid development and commercial production of ever more powerful integrated circuits, in gen- eral, and CCD and CMOS imaging chips, in partic- ular, has changed the landscape in motion tracking. Hardware components for optical “mocap” systems became affordable as they were mass-produced for the consumer market, and the necessary computa- tionally intense processing of the raw data became manageable as general purpose and specialized DSP chips became cheaper. A brief and by no means comprehensive review is given in the following. 2.1. Marker-based systems Marker-based systems can be divided into two groups according to whether they use active or pas- sive markers. Active markers are usually LEDs that emit infrared light pulses at a characteristic ICPhS XVI ID 1718 Saarbrücken, 6-10 August 2007 www.icphs2007.de 279

Transcript of AUDITORY-VISUAL SPEECH ANALYSIS: IN SEARCH · PDF fileAUDITORY-VISUAL SPEECH ANALYSIS: IN...

Page 1: AUDITORY-VISUAL SPEECH ANALYSIS: IN SEARCH · PDF fileAUDITORY-VISUAL SPEECH ANALYSIS: IN SEARCH OF A ... 1960s Charles Hockett attempted to determine the universal design features

AUDITORY-VISUAL SPEECH ANALYSIS:IN SEARCH OF A THEORY

Christian Kroos

MARCS Auditory Laboratories, University of Western Sydney, [email protected]

ABSTRACTIn the last decade auditory-visual speech analysishas benefited greatly from advances in face motionmeasurement technology. Devices and systems havebecome widespread, more versatile, easier to useand cheaper. Statistical methods to handle multi-channel data returned by the face motion measure-ments are readily available. However, no compre-hensive theory or, minimally common framework toguide auditory-visual speech analysis has emerged.In this paper it is proposed that Articulatory Phonol-ogy [3] developed by Browman and Goldstein forauditory-articulatory speech production is capableof filling the gap. Benefits and problems are dis-cussed.

Keywords: Auditory-visual speech, face motionanalysis, articulatory phonology

1. INTRODUCTIONWhen in the early 1960s Charles Hockett attemptedto determine the universal “design features” of hu-man language [9], he listed as the first feature:Vocal-auditory channel. Research on sign lan-guages, and an increased awareness of a socio-cultural bias in favour of spoken language in thepast, made it clear that this definition needed to berevised in contrast to the 12 or 15 other featuresthat have retained their validity to the present day.As sign languages have presumably been around forat least as long as vocal languages, and some re-searchers have even argued that they were the pre-cursors of vocal languages (e.g., [1]), they shouldhave been easily observed and studied, in princi-ple. However, the visual aspects of spoken languageare much more difficult to identify as a genuine partof human speech processing. Though “lip reading”abilities of the hearing impaired or deaf were well-known and acknowledged, it is the tight link to audi-tory speech that remained unnoticed for a long time.That is, speech reading was classified as a specialability that developed if the acoustic signal was per-manently unavailable.

Besides the greater difficulty in observing the rolethat visual speech plays in the complex speech pro-

duction/perception process there is another crucialdifference to the study of other forms of visual com-munication, like sign language: visual speech canarguably neither be studied nor understood withoutthe reference to auditory speech, that is, without ref-erence to speech vocal tract behaviour or, accord-ing to other accounts, the acoustic signal. For themost part visual and auditory speech are intimatelylinked through their common origin in vocal tractbehaviour. Figure 1 shows the three-way relation-ship among vocal tract behavior, face motion and theacoustic output. This paper deals with what appearsto be the primary dichotomy in the development ofauditory-visual speech analysis in recent years:• on the one hand, there have been impressive ad-

vances concerning face motion (and also artic-ulatory) measurement techniques,

• on the other hand, the field is plagued by thelack of any comprehensive theoretical founda-tion or, minimally, at least a common frame-work to guide auditory-visual speech analysis.

As a consequence, the results from the research re-semble more a patchwork than a gradually advanc-ing research undertaking.

2. ADVANCES IN MEASUREMENT OFFACE MOTION

The rapid development and commercial productionof ever more powerful integrated circuits, in gen-eral, and CCD and CMOS imaging chips, in partic-ular, has changed the landscape in motion tracking.Hardware components for optical “mocap” systemsbecame affordable as they were mass-produced forthe consumer market, and the necessary computa-tionally intense processing of the raw data becamemanageable as general purpose and specialized DSPchips became cheaper. A brief and by no meanscomprehensive review is given in the following.

2.1. Marker-based systems

Marker-based systems can be divided into twogroups according to whether they use active or pas-sive markers. Active markers are usually LEDsthat emit infrared light pulses at a characteristic

ICPhS XVI ID 1718 Saarbrücken, 6-10 August 2007

www.icphs2007.de 279

Page 2: AUDITORY-VISUAL SPEECH ANALYSIS: IN SEARCH · PDF fileAUDITORY-VISUAL SPEECH ANALYSIS: IN SEARCH OF A ... 1960s Charles Hockett attempted to determine the universal design features

Figure 1: Relations among vocal tract behaviour,face motion, and acoustic signal in auditory-visualspeech

Vp

Fp

A p

vocal tractbehaviour

acousticsignal

facemotion

frequency for each marker, e.g., Optotrak (North-ern Digital, Inc), PhoeniX (PhoeniX Technologies,Inc). They have the advantage that the sensors aretracked consistently and with high accuracy. Onthe downside, the number of sensors is limited, thesensors are relatively big and are wired to a de-vice providing power to the LEDs and generatingthe pulse series with the characteristic frequency(though PhoeniX just introduced wireless LEDs).Accordingly researchers who employed these sys-tems in their experiments had to make careful deci-sions about where to place the available 15-20 mark-ers.

Passive marker systems (e.g., Qualisys (QualisysMedical AB), Elite (BTS), Vicon (Oxford Metrics))typically use highly reflective spherical markers.Small markers are available and there are no wiresimpeding natural face motion, but since single mark-ers cannot be uniquely identified, marker confusionproblems can only be resolved posthoc.

Both types of system have problems with reflec-tions from sources other than the markers, thoughpassive marker systems are in general more stronglyaffected. Also, for many tasks marker occlusionscan only be avoided by adding more and more cam-era devices to the system.

An interesting alternative to the commercial sys-tems mentioned above is the “home made” passivemarker system developed at ICP (Grenoble, France)that uses beads as markers and tracks them with anappearance model [17].

2.2. Marker-free methods

Most marker-free methods operate on plain videorecorded with a single or multiple cameras and theactual tracking happens exclusively on the software

side (image processing and computer vision algo-rithms). However, face motion poses some notori-ously difficult-to-handle problems:• face motion is non-rigid motion; its degrees of

freedom cannot be determined and, if approxi-mated, are dependent on the chosen resolution;

• some areas of the face, like the cheeks, do notexhibit strong image gradients;

Accordingly, face motion cannot be convenientlymodeled as rigid body motion and certain areas pro-vide very little structure on which to base the mo-tion tracking. The second problem can be alleviatedby introducing structure artificially. Tracking usingstructured light [4] uses the projection of a light gridon the participant’s face during the recording and thedeformation of the regular pattern allows reconstruc-tion of the 3D shape of the face and its changes overtime. However, since no flesh points are tracked,the usefulness of the results is limited. In the com-mercial system Mova Contour (Mova) phosphores-cent makeup applied to the face creates randomlyscattered irregular shaped, but very dense patternson the facial surface which only show up in fluores-cent light. The moving face is then recorded with acamera array and patterns are tracked retrospectivelywith a very high resolution.

Another way to deal with the problems men-tioned above is to circumvent them. Feature track-ing methods measure the changes over time of pre-determined facial features, e.g., lips, eyebrows, chinoutline, nostrils. Note that the choice of featuresis usually governed by technical consideration re-garding the tracking process, i.e., the requirementof a strong image gradient, and not by conclusiveknowledge of the working mechanisms of auditory-visual speech processing. Nevertheless, for manytasks, adding, for instance, visual speech parametersto the acoustic feature vector in Automatic SpeechRecognition (ASR), the impoverished description ofspeech face motion gained in this way might be suf-ficient.

Global video-based face motion tracking methodsideally make no high level assumptions about the ar-eas of the face or the nature of the face motion tobe tracked. Roughly speaking two main approachescan be identified:

1. Using dense optical flow (for a review see [2])and applying additional constraints beyond thesmoothness requirements at some stage of thetracking. For instance, in [5] the direct opticalflow based tracking was optimized with an un-derlying muscle model using Kalman filtering.

2. Generating some kind of appearance model andusing image registration techniques to find the

ICPhS XVI Saarbrücken, 6-10 August 2007

280 www.icphs2007.de

Page 3: AUDITORY-VISUAL SPEECH ANALYSIS: IN SEARCH · PDF fileAUDITORY-VISUAL SPEECH ANALYSIS: IN SEARCH OF A ... 1960s Charles Hockett attempted to determine the universal design features

motion parameters that would have most likelyproduced the actual video image. For instance,in [13] a parameterised ellipsoidal mesh wasused to model the facial surface as a wholeand areas between neighbouring mesh nodes tomodel patches of it. A two-dimensional nor-malized cross-correlation served as techniqueto register the patches from the previous videoframe with the current one. A coarse-to-finestrategy employing different mesh resolutionsand a wavelet-based multi-resolution analysison the image side were subsequently used togradually increase the resolution of the track-ing.

3. MEASUREMENTS AND WHAT THEN?

The typical measurement results consist of a time se-ries of ten to several tens of measurement points withhigh internal correlations (multi-collinearity). Obvi-ously a dimensionality reduction would be a prefer-able first step. And it is here where the problemsstart. A statistically motivated, all-purpose proce-dure like Principal Component Analysis (see [11]for a review) yields good results usually recoveringa high amount of the variance in the data with only afew components [13, 8]. However, the componentsbecome increasingly difficult to interpret beyond thefirst two components which usually can be attributedto jaw movements and lip rounding.

In fact, it is not even likely that the higher compo-nents pick up physiologically meaningful behaviour:it would only be the case if the measured surfacemotion could be expressed as a linear combinationof the underlying face muscle activity, but the pas-sive “filtering” effect of the different layers of facialtissue and the intricately distributed insertion pointsof the facial muscles makes this relationship morecomplex and probably non-linear.

An alternative first step in analysis consists of de-termining components using information from theother “modalities”, e.g., components that maximizethe correlation (Canonical Correlation Analysis) orjoint co-variance (Partial Least Squares) between ei-ther the articulatory or the acoustic signal and theface motion data. But there are drawbacks here,too: First, simultaneously recorded articulatory andface motion data are not readily available (but see[20]), and second, procedures like this can only cap-ture linear relationships. The latter might not be thesevere limitation it seems to be on first sight. Pre-vious research [21] has shown that a high amountof variance can be “explained” with linear modelsand in many of the remaining cases problems canbe linearized. The quotation marks enclosing “ex-

plained” are well deserved, since determining mean-ingful components in face motion via their relationto the other modalities does not explain anything, butsimply pushes the explanation problem to a differ-ent level, one step higher. All findings remain phe-nomenological, although now inter-connected be-tween modalities, if they cannot be evaluated rela-tive to a theory of auditory-visual speech production.And such a theory is still missing.

4. ISSUES REGARDING A THEORY OFAUDITORY-VISUAL SPEECH

PRODUCTION4.1. Theoretical foundations (or lack thereof)All major speech perception theories (e.g., [14, 6,15] had to account for auditory-visual speech, sinceeffects of visual speech, like the increase of intel-ligibility in noisy environments when watching thespeaker’s face [19], or the McGurk effect [16], arewell-documented and were found to be pronouncedand robust.

However, in our view there is only one com-prehensive theory on speech production: Browmanand Goldstein’s Articulatory Phonology (AP) [3] incombination with Saltzman’s Task-Dynamic Model(TD) [18]. Very briefly, Articulatory Phonologyposes that the units of speech are dynamic articu-latory gestures realized through movements of inde-pendently acting vocal tract organs. The latter areidentified as the lips, the tongue tip, body and dor-sum, the velum and the glottis. Gestures are definedas dynamical systems, the behavior of which can befully described with a characteristic set of parametervalues according to the Task-Dynamic Model. Ev-ery possible utterance of any language can then bemodelled as parallel and serial combinations of thesegestures. That is, gestures can be either executed si-multaneously or strung together in time. Most of thetime, however, both situations arise creating differ-ent amounts of gestural overlap. The gestures them-selves are discrete and context-independent, theirlow-level (physical) realisations are the continuous,context-dependent and multi-dimensional patternsof vocal tract behavior found by speech productionresearch. Note that the high-level (phonological)gestures do not need to specify details of the mo-tor execution like context effects, since they are law-fully entailed by the dynamical systems implement-ing the gestures.

In principle, there is nothing preventing the treat-ment of visual speech production within the frame-work of AP/TD. Since the basic units of speechare assumed to be articulatory gestures, the sensorychannel through which the information about them

ICPhS XVI Saarbrücken, 6-10 August 2007

www.icphs2007.de 281

Page 4: AUDITORY-VISUAL SPEECH ANALYSIS: IN SEARCH · PDF fileAUDITORY-VISUAL SPEECH ANALYSIS: IN SEARCH OF A ... 1960s Charles Hockett attempted to determine the universal design features

is transmitted is of secondary importance. However,no explicit references to visual speech have so farbeen made.

4.2. Causal and non-causal functional relation-ships

If auditory-visual speech can be described withinAP/TD, is there then any need to have a separatetheory (or branch of it) dedicated to auditory-visualspeech? The answer to this question depends on theanswer to another question: Are the visible speechgestures exclusively a causal consequence of the vo-cal tract gestures or are there non-causal functionalrelationships, too?• If all face motion providing phonetic informa-

tion were the mechanical consequence of ar-ticulator movements, visual speech could beexplained within AP/TD and no separate ap-proach would be necessary (but see Section 4.3for remaining problems).

• If there are non-causal functional relationships,e.g., between head or eye-brow movements andlaryngeal activity, the picture becomes morecomplicated, since it can not be safely assumedthat the related visual gestures behave in thesame way as the vocal tract gestures.

Unfortunately the research to answer the secondquestion has not yet been conducted. In Section 3we described ways to statistically assess the relation-ships between vocal tract behavior and face motion.In studies employing the described or similar meth-ods face motion data and articulatory data were notrecorded simultaneously with the exception of [12],and in all of them the articulatory data were lim-ited to the mid-sagittal plane, thus unable to captureimportant characteristics of certain phoneme cate-gories (laterals), missing potential sources of vari-ance of others (e.g., the depth of the groove occur-ring at tongue back and root for many vowels) andignoring lateral asymmetries in general. High cross-modal correlation recovering high amounts of vari-ance were found nevertheless. Keeping the usualstatistical textbook warnings about inferring causal-ity from correlations in mind and the above short-comings of the measurement methods it becomesclear that we can currently not decide which rela-tions are causal and which are not with the exceptionof the trivial case of jaw and lip movements wherethe domains of visual and acoustic/auditory speechoverlap. An ideal tool for investigating and answer-ing this question would be a complete and detailedmodel of the vocal tract and the face, including bonestructure, cartilages, muscles, tendons, etc. Clearly,such a model, be it a virtual computer or a real-worldmechanical model, is still far away.

4.3. Articulatory Phonology and visual speech

Even if we assume that all relationships are causalor, equivalently, that existing non-causal functionalrelationships behave in all relevant aspects as if theywere causal, interpreting speech face motion withinthe framework of AP/DT is still not straight for-ward. If visual speech were investigated without anyknowledge about the acoustic/auditory channel, thetwo primary articulators would be the lips and thejaw. As reviewed in Section 3, they emerge as dom-inant essential components, and though lip shapeand jaw position are not fully independent the factthat they are recovered by different principal com-ponents shows that they are functionally orthogonal,at least in ordinary speech. Lip gestures map directlyfrom the articulatory-auditory to the visual domain:changes in the lip configuration lead to changes inthe resonance frequencies of the vocal tract and theyare also entirely visible. The jaw, however, it is notan articulatory organ in AP, but plays a subordinaterole as part of a coordinative structure realizing ges-tures of the lips and the tongue. Accordingly someof the articulatory gestures must be fully or partiallyreflected in jaw movements.

Figure 2 visualizes potential mappings. Dashedlines indicate that there is some but not (yet) conclu-sive evidence for a relationship, dotted lines indicatehypothesized relations that, if they exist, would pro-duce very weak effects.

Figure 2: Mapping of articulatory organs to theirvisual counterparts

lower facesurfacedeformation

other

head

eyebrows

auditory−articulatory

larynx

velum

tongue root

tongue body

tongue tip

lips lips

tongue tip(dentals)

jaw

visual−articulatory

As suggested in the figure, tongue tip, body, androot gestures might to a certain degree be recover-able from jaw position. But none of the three tongue“organs” confine the jaw to a certain position in anabsolute sense, which should make it difficult to de-tect the active organ reliably. In addition, compen-satory behaviour involving jaw and tongue (e.g., [7])

ICPhS XVI Saarbrücken, 6-10 August 2007

282 www.icphs2007.de

Page 5: AUDITORY-VISUAL SPEECH ANALYSIS: IN SEARCH · PDF fileAUDITORY-VISUAL SPEECH ANALYSIS: IN SEARCH OF A ... 1960s Charles Hockett attempted to determine the universal design features

should result in visual ambiguity regarding the de-gree of constriction. Accordingly it can be predictedthat those gestures that require the most precise jawsettings in natural unperturbed speech should be theeasiest to recover and should be the default in am-biguous cases.

It also can be assumed that there are other mi-nor surface deformations of the lower face that areconsequences of tongue gestures. Velum gesturesmost likely have no visual complement at all, thoughsome very subtle indications for a nasal consonantor more likely its absence (e.g., increased intra-oralair pressure moving the cheeks slightly outwards forstop consonants, but not for nasals) can not be com-pletely ruled out. Correlations between the larynxand F0 have been found, but whether there are anyvisual complements to the voicing gesture needs tobe investigated.

Similar to the articulatory-to-acoustic relation-ship, yet much more pronounced, the mapping be-tween vocal tract behaviour and face motion appearsto be a many-to-one mapping. As shown in [10]for the articulatory-to-acoustic mapping the inverseproblem might be solvable in many cases. It wouldbe worthwhile to investigate whether the differencesin the many-to-one mappings are in fact to some de-gree complementary so that they could be used mu-tually to provide more constraints and help to dis-ambiguate further.

The major advantage of studying auditory-visualspeech within the AP/DT model and using it as aframework for the analysis is that within it articu-latory gestures are well defined both on the physi-cal level (surface phenomena) and on the cognitivespeech processing level (phonology) in contrast tosegmental approaches where, for instance, the com-plement of a phoneme on the physical level is hardto define.1

On the gestural level, predictions about visualgestures can be made and they can guide visualspeech analysis. A comparison to the analysis ofauditory speech makes the substantial improvementin the “analysis landscape” immediately evident. Ifacoustic speech analysis in the past would have beenmore or less limited to seeking statistical regularitiesin the acoustic signal and correlating it with artic-ulatory measurements, then many of the advancesin the field would not have been possible, no mat-ter how carefully experiments on the perception sidewere designed. Describing and analysing acousticspeech within theoretical frameworks that explainedthe underlying mechanisms of its production processallowed the progression beyond the level of simplyaggregating phenomenological findings.

5. CONCLUSION AND OUTLOOK

In the last decade auditory-visual speech analysishas profited massively from advances in face mo-tion tracking and measurement technology. Devicesand systems have become widespread, more versa-tile, easier to use and cheaper. Statistical tools tohandle multi-channel data returned by the face mo-tion measurements are readily available. It seems,however, that - casually formulated - it is not clearwhat we are looking for. While on the applica-tion side the field has been able to progress with-out the need to understand the underlying mecha-nisms of auditory-visual speech in detail, e.g., inAVASR (Audio-Visual Automatic Speech Recogni-tion), auditory-visual speech analysis critically re-quires a deeper understanding of speech face mo-tion and how it is controlled in humans in order toadvance in the same way.

AP together with TD offers in our view a verypromising framework, though there are open ques-tions to be answered first and many problems to besolved along the way. It would bring a very valuabledowry in the marriage, so to speak, a unified gesturalapproach spanning the entire phonetic-phonologicaldomain that has the potential to explain the diffi-culties to determine the number of visemes and pe-culiarities of the McGurk effect. However, to sortout remaining problems speech face motion needsto be studied on the gestural level and hypothesesregarding the relationship between articulatory ges-tures and face motion need to be tested. Since cur-rently no complete model of the human vocal tractand face exists, the quest for a theory that has beenthe topic of this paper ironically has to begin (andthe paper to end) with a call for more data. Theshortage of simultaneous 3D realtime recordings ofvocal tract behavior and face motion, and the acous-tic signal has already been mentioned. The situ-ation looks even more bleak when taking into ac-count that for any reasonable attempt to understandauditory-visual speech the data of many speakersand many languages/dialects are needed. For a com-parison simply imagine phonetics and phonologywould have until now comprised only of, say, En-glish, German, Swedish, French, and Japanese.

At MARCS Auditory Laboratories we recentlycreated a new facility to study auditory-visualspeech production (and perception) offering the pos-sibility to record data in the three modalities simul-taneously. The setup consist of a 3D-EMA sys-tem (AG500, Carstens Medizinelektronik), a high-resolution face motion tracking system (Vicon sys-tem with four MX40 cameras, Oxford Metrics), ahigh-speed high-resolution camera (Phantom v10

ICPhS XVI Saarbrücken, 6-10 August 2007

www.icphs2007.de 283

Page 6: AUDITORY-VISUAL SPEECH ANALYSIS: IN SEARCH · PDF fileAUDITORY-VISUAL SPEECH ANALYSIS: IN SEARCH OF A ... 1960s Charles Hockett attempted to determine the universal design features

Figure 3: Setup of the system to study auditory-visual speech production at MARCS AuditoryLaboratories - E: Electromagnetic Articulograph;VM: Vicon MX40 system, A: Audio; S: Stimuluspresentation; V: High-speed Video

with ImageCube high speed mass storage system,Vision Research) and three powerful workstationsand a data server. Figure 3 shows the setup. To min-imise reflection the acrylic glass cube of the EMAsystem was painted black. The combined equipmentcan record articulatory and face motion data at 200Hz with the 4 Mega-Pixel resolution (2352 x 1728pixels up to 160 fps) of the motion tracking Viconcameras only slightly reduced by windowing and thecolor video Phantom camera still on full resolution(2,400 x 1,800 pixel up to 480 fps). Thus we are nowin the position to examine auditory-visual speech asthe cross-modal process it genuinely is and providedata to test a gestural account of speech face motion.

The author would like to thank Michael Tyler, ChrisDavis, and two anonymous reviewers for many insight-ful comments.

6. REFERENCES

[1] Arbib, M. A. 2002. The mirror system, imitation,and the evolution of language. In: Dautenhahn, K.,Nehaniv, C. L., (eds), Imitation in animals and arti-facts. Cambridge, Massachusetts: MIT Press 229–280.

[2] Barron, J. L., Fleet, D. J., Beauchemin, S. S. 1994.Performance of optical flow techniques. Interna-tional Journal of Computer Vision 12:1, 43–77.

[3] Browman, C. P., Goldstein, L. 1992. Articulatoryphonology: An overview. Phonetica 49, 155–180.

[4] Carter, J., Shadle, C., Davies, C. May 1996. On theuse of structured light in speech research. ETRW -4th Speech Prod. Seminar Autrans. 229–232.

[5] Essa, I., Pentland, A. July 1997. Coding, analy-sis, interpretation and recognition of facial expres-sions. IEEE Transactions on Pattern Analysis andMachine Intelligence 19 (7), 757–763.

[6] Fowler, C. A. 1986. An event approach to the studyof speech perception from a direct-realist perspec-tive. Journal of Phonetics 14, 3–28.

[7] Geumann, A., Kroos, C., Tillmann, H. G. 1999. Arethere compensatory effects in natural speech? Pro-ceedings of the 14th International Congress of Pho-netic Sciences (ICPhS) San Francisco, USA.

[8] Gutierrez-Osuna, R., Kakumanu, P. K., Esposito,A., Garcia, O. N., Bojórquez, A., Castillo, J. L.,Rudomín, I. 2005. Speech-driven facial animationwith realistic dynamics. IEEE Transactions on Mul-timedia 7(1), 33–42.

[9] Hockett, C. 1963. The problem of universals in lan-guage. In: Greenberg, J., (ed), Universals of Lan-guage. Cambridge, MA: MIT Press.

[10] Hodgen, J., Rubin, P., McDermott, E., Katagiri, S.,Goldstein, L. in press. Inverting mappings fromsmooth paths through Rn to paths through Rm: Atechnique applied to recovering articulation fromacoustics. Speech Communication.

[11] Jackson, J. E. 1991. A user’s guide to principalcomponents. New York: John Wiley & Sons.

[12] Jiang, J., Alwan, A., Bernstein, L. E., Keating, P.,Auer, E. 2000. On the correlation between facialmovements, tongue movements and speech acous-tics. International Conference on Spoken LanguageProcessing volume 1 Bejing, China. 42–45.

[13] Kroos, C., Kuratate, T., Vatikiotis-Bateson, E. 2002.Video-based face motion measurement. Journal ofPhonetics (special issue) 30, 569–590.

[14] Liberman, A. M., Mattingly, I. G. 1985. The mo-tor theory of speech perception revised. Cognition21(1), 1–36.

[15] Massaro, D. W. 1998. Perceiving Talking Faces.From Speech Perception to a Behavioral Principle.MIT Press.

[16] McGurk, H., MacDonald, J. 1976. Hearing lips andseeing voices. Nature 264, 746–748.

[17] Revèret, L., Bailly, G., Badin, P. 2000. MOTHER:a new generation of talking heads providing a flex-ible articulatory control for video-realistic speechanimation. International Conference on Speech andLanguage Processing Beijing, China.

[18] Saltzman, E. L., Munhall, K. G. 1989. A dynamicalapproach to gestural patterning in speech produc-tion. Ecological Psychology 1(4), 333–382.

[19] Sumby, W., Pollack, I. 1954. Visual contribution tospeech intelligibility in noise. Journal of the Acous-tical Society of America 26, 212–215.

[20] Wrench, A. 2000. A multi-channel/multi-speakerarticulatory database for continuous speech. recog-nition research. Phonus (5), 1–13.

[21] Yehia, H. C., Rubin, P. E., Vatikiotis-Bateson, E.1998. Quantitative association of vocal-tract andfacial behavior. Speech Communication 26, 23–44.

1 Strictly speaking, a viseme (as the term is currentlyused) is not defined at all, since there is no visual-onlylanguage of which it could be the smallest units convey-ing a distinction of meaning: only sign languages couldhave visemes.

ICPhS XVI Saarbrücken, 6-10 August 2007

284 www.icphs2007.de