Department of Speech, Language, and Hearing...

12
Vowel and consonant contributions to vocal tract shape Brad H. Story a Department of Speech, Language, and Hearing Sciences, Speech Acoustics Laboratory, University of Arizona, Tucson, Arizona 85721 Received 22 October 2008; revised 13 May 2009; accepted 13 May 2009 The purpose of this study was to develop a method by which a vowel-consonant-vowel VCV utterance based on x-ray microbeam articulatory data could be separated into a vowel-to-vowel transition and a consonant superposition function. The result is a model that represents a vowel sequence as a time-dependent perturbation of the neutral vocal tract shape governed by coefficients of canonical deformation patterns. Consonants were modeled as superposition functions that can force specific portions of the vocal tract shape to be constricted or expanded, over a specific time course. The three VCVs .pÄ, .tÄ, and .kÄ, produced by one female speaker, were analyzed and reconstructed with the developed model. They were shown to be reasonable approximations of the original VCVs, as assessed qualitatively by visual inspection and quantitatively by calculating rms error and correlation coefficients. This establishes a method for future modeling of other speech material. © 2009 Acoustical Society of America. DOI: 10.1121/1.3158816 PACS numbers: 43.70.Bk, 43.70.Jt, 43.70.Aj CHS Pages: 825–836 I. INTRODUCTION It is well known that speech does not merely consist of a serial concatentation of individual speech sounds, but rather a continuous blending of one sound into another. Production of a blended acoustic stream requires that movements of the articulatory structures be temporally and spatially coordi- nated such that they are engaged in generating more than one sound segment at any instant of time. This process of over- lapping speech movements, commonly referred to as coar- ticulation e.g., Kozhevnikov and Chistovich, 1965; Öhman, 1966; Kent and Minifie, 1977 or coproduction e.g., Fowler, 1980, must be included in the development of any realistic model of speech production. Because movement of the ar- ticulators has the collective effect of determining the shape of the vocal tract airspace, and because the shape of the airspace is the most direct connection to the acoustic reso- nances expressed in the speech signal as formants, it is im- portant to develop a model that directly relates overlap of phonetically-relevant gestures to the time-dependent changes of the tract shape during production of speech. The purpose of this study was to determine whether realistic changes in vocal tract shape can be accurately represented with a model where coarticulated speech is produced by superimposing consonant gestures on an ongoing sequence of vowel transi- tions. This particular view of coarticulation seems to have originated with Öhman’s 1966 spectrographic analysis of vowel-consonant-vowel VCV utterances. The results sug- gested that vowels and consonants are generated by two par- allel production systems and place coincident demands on the same articulatory structures. Thus, a VCV is considered to be produced as a vowel-vowel VV sequence upon which a consonant gesture is superimposed. Based on a cineradio- graphic study of /h.’CV/ utterances, Perkell 1969 similarly concluded that production of vowels and consonants is car- ried out by two functionally different types of articulatory activity. He noted that consonant articulations are generally faster and require more precise timing than vowel articula- tions. This led to the hypothesis that articulatory activity could be divided into two classes in which the extrinsic speech musculature is utilized primarily for vowel produc- tion, whereas the intrinsic musculature of the tongue and lips executes the place, manner, and degree of a specific conso- nant that is additively superimposed on the overall vocal tract posture provided by the extrinsic musculature. That is, rapid, precise consonantal gestures are superimposed on an underlying vowel gesture. Gracco 1992 also referred to a functional division of vocal tract movement into two general categories: “those that produce and release constrictions valving, and those that modulate the shape or geometry of the vocal tract.” Löfqvist and Gracco 1999 later reported that, during production of a VCV, tongue movement toward the final vowel often began before the consonant closure had occurred, and further that most of the movement from the initial vowel occurred during the consonant closure. Both findings suggest that the overall shaping of the vocal tract continues while constrictions are imposed and released. The paradigm of functional division of the articulatory system into separate vowel and consonant classes has influ- enced various control strategies for articulatory and vocal tract models. Öhman 1967 followed his spectrographic study by proposing a model that allows for interpolation of the midsagittal cross-distance width of one vowel shape to another over the time course of an utterance. Simultaneously, a consonant constriction function can be activated that varies over the same time course as the vowel component. At each successive point in time, the consonantal function is super- imposed on the modeled vowel substrate to produce a com- posite tract shape. Nakata and Mitsuoka 1965 and Ichikawa and Nakata 1968 used the idea of superimposing a conso- nant on a VV transition in a rule-based speech synthesizer. a Electronic mail: [email protected] J. Acoust. Soc. Am. 126 2, August 2009 © 2009 Acoustical Society of America 825 0001-4966/2009/1262/825/12/$25.00

Transcript of Department of Speech, Language, and Hearing...

Page 1: Department of Speech, Language, and Hearing …sal.arizona.edu/sites/default/files/Story_microbeam_vcv...Vowel and consonant contributions to vocal tract shape Brad H. Storya Department

Vowel and consonant contributions to vocal tract shapeBrad H. Storya�

Department of Speech, Language, and Hearing Sciences, Speech Acoustics Laboratory, University ofArizona, Tucson, Arizona 85721

�Received 22 October 2008; revised 13 May 2009; accepted 13 May 2009�

The purpose of this study was to develop a method by which a vowel-consonant-vowel �VCV�utterance based on x-ray microbeam articulatory data could be separated into a vowel-to-voweltransition and a consonant superposition function. The result is a model that represents a vowelsequence as a time-dependent perturbation of the neutral vocal tract shape governed by coefficientsof canonical deformation patterns. Consonants were modeled as superposition functions that canforce specific portions of the vocal tract shape to be constricted or expanded, over a specific timecourse. The three VCVs �.pÄ�, �.tÄ�, and �.kÄ�, produced by one female speaker, were analyzed andreconstructed with the developed model. They were shown to be reasonable approximations of theoriginal VCVs, as assessed qualitatively by visual inspection and quantitatively by calculating rmserror and correlation coefficients. This establishes a method for future modeling of other speechmaterial. © 2009 Acoustical Society of America. �DOI: 10.1121/1.3158816�

PACS number�s�: 43.70.Bk, 43.70.Jt, 43.70.Aj �CHS� Pages: 825–836

I. INTRODUCTION

It is well known that speech does not merely consist of aserial concatentation of individual speech sounds, but rathera continuous blending of one sound into another. Productionof a blended acoustic stream requires that movements of thearticulatory structures be temporally and spatially coordi-nated such that they are engaged in generating more than onesound segment at any instant of time. This process of over-lapping speech movements, commonly referred to as coar-ticulation �e.g., Kozhevnikov and Chistovich, 1965; Öhman,1966; Kent and Minifie, 1977� or coproduction �e.g., Fowler,1980�, must be included in the development of any realisticmodel of speech production. Because movement of the ar-ticulators has the collective effect of determining the shapeof the vocal tract airspace, and because the shape of theairspace is the most direct connection to the acoustic reso-nances expressed in the speech signal as formants, it is im-portant to develop a model that directly relates overlap ofphonetically-relevant gestures to the time-dependent changesof the tract shape during production of speech. The purposeof this study was to determine whether realistic changes invocal tract shape can be accurately represented with a modelwhere coarticulated speech is produced by superimposingconsonant gestures on an ongoing sequence of vowel transi-tions.

This particular view of coarticulation seems to haveoriginated with Öhman’s �1966� spectrographic analysis ofvowel-consonant-vowel �VCV� utterances. The results sug-gested that vowels and consonants are generated by two par-allel production systems and place coincident demands onthe same articulatory structures. Thus, a VCV is consideredto be produced as a vowel-vowel �VV� sequence upon whicha consonant gesture is superimposed. Based on a cineradio-graphic study of /h.’CV/ utterances, Perkell �1969� similarly

a�

Electronic mail: [email protected]

J. Acoust. Soc. Am. 126 �2�, August 2009 0001-4966/2009/126�2

concluded that production of vowels and consonants is car-ried out by two functionally different types of articulatoryactivity. He noted that consonant articulations are generallyfaster and require more precise timing than vowel articula-tions. This led to the hypothesis that articulatory activitycould be divided into two classes in which the extrinsicspeech musculature is utilized primarily for vowel produc-tion, whereas the intrinsic musculature of the tongue and lipsexecutes the place, manner, and degree of a specific conso-nant that is additively superimposed on the overall vocaltract posture provided by the extrinsic musculature. That is,rapid, precise consonantal gestures are superimposed on anunderlying vowel gesture. Gracco �1992� also referred to afunctional division of vocal tract movement into two generalcategories: “…those that produce and release constrictions�valving�, and those that modulate the shape or geometry ofthe vocal tract.” Löfqvist and Gracco �1999� later reportedthat, during production of a VCV, tongue movement towardthe final vowel often began before the consonant closure hadoccurred, and further that most of the movement from theinitial vowel occurred during the consonant closure. Bothfindings suggest that the overall shaping of the vocal tractcontinues while constrictions are imposed and released.

The paradigm of functional division of the articulatorysystem into separate vowel and consonant classes has influ-enced various control strategies for articulatory and vocaltract models. Öhman �1967� followed his spectrographicstudy by proposing a model that allows for interpolation ofthe midsagittal cross-distance �width� of one vowel shape toanother over the time course of an utterance. Simultaneously,a consonant constriction function can be activated that variesover the same time course as the vowel component. At eachsuccessive point in time, the consonantal function is super-imposed on the modeled vowel substrate to produce a com-posite tract shape. Nakata and Mitsuoka �1965� and Ichikawaand Nakata �1968� used the idea of superimposing a conso-

nant on a VV transition in a rule-based speech synthesizer.

© 2009 Acoustical Society of America 825�/825/12/$25.00

Page 2: Department of Speech, Language, and Hearing …sal.arizona.edu/sites/default/files/Story_microbeam_vcv...Vowel and consonant contributions to vocal tract shape Brad H. Storya Department

Similarly, Båvegård �1995� and Carré and Chennoukh �1995�both reported vocal tract area function models where conso-nant constrictions are superimposed on an interpolation of avowel-to-vowel transition. Browman and Goldstein’s �1990�development of “articulatory phonology” also reflects theideas suggested by Öhman’s work. In their view, speech isproduced by a series of overlapping gestures created by ac-tivation of “tract variables” such as constriction location anddegree of the tongue body and tip. Models based on similarconceptualizations of speech production have been proposedby Fowler and Saltzman �1993�, Byrd �1996�, and Tjaden�1999�.

More recently, Story �2005a� described a model of thevocal tract area function that incorporates many aspects ofthe type of speech production system suggested by Öhman�1966, 1967� and Perkell �1969�. The model consists of mul-tiple hierarchical tiers,1 each of which is capable of imposingparticular types of perturbation on the vocal tract shape. Inthe first tier, vowels and vowel-to-vowel transitions are pro-duced by perturbing a phonetically-neutral vocal tract con-figuration with two canonical deformation patterns calledmodes. The modes, which were derived by principal compo-nent analysis, are shaped such that they efficiently exploit theacoustic properties of the vocal tract �Story, 2007a�, and tendto be common across speakers �Story and Titze, 1998; Story,2005b, 2007b�. Consonant production results from a secondtier of perturbation that imposes severe constrictions on theunderlying vowel or evolving VV transition. The constrictionfunctions are precisely defined by the range along the tractlength that is affected by the constriction, the constrictionlocation, the cross-sectional area at the point of maximumconstriction, and by the timing of the activation.

A speech signal produced by this model2 carries infor-mation characteristic of the successive tiers of vocal tractstructure or movement on which it is built. For example, theidiosyncratic features of the neutral vocal tract shape will setthe acoustic background on which vowel transitions gener-ated by the first tier are superimposed. This is demonstratedin Fig. 1�a� where time-dependent formant contours areshown for a transition from �*� to �i� simulated by the modelin terms of area function changes. At any point in time, theformants along the transition can be considered to be per-turbed or deflected away �as suggested by the up and downarrows� from the formants of the underlying neutral tract�shown with dotted lines�. The characteristics of the voweltransitions, in turn, provide the acoustic background onwhich consonantal perturbations are imposed by the secondtier of the model. Demonstrated in Fig. 1�b� are the formantcontour effects that result from perturbing the �*i� transitionwith a consonant constriction intended to approximate a �d�.For comparison purposes, the �*i� transition and neutral tractformants are also shown in this figure, along with the timecourse of the consonant magnitude, denoted as mc�t�. Theconsonant magnitude indicates the period of time in whichthe consonant perturbation affects the vocal tract shape �i.e.,for mc�t��0� and causes the formant frequencies to be de-flected upward or downward relative to those formants that

would exist in the absence of the consonant �see arrows�.

826 J. Acoust. Soc. Am., Vol. 126, No. 2, August 2009

Although utterances can be simulated with this model,temporal activations of the vowel and consonant elementshave been mostly prescribed by ad hoc rules �Story, 2005a�or formant-to-coefficient mapping �Story and Titze, 1998�. Itwas shown in Story �2007b�, however, that canonical modessimilar to those on which the vowel tier is based could beextracted from the articulatory data available in the Univer-sity of Wisconsin-Madison’s x-ray microbeam �XRMB� da-tabase �Westbury, 1994�. Specifically, the x-y coordinates offleshpoint pellets, along with the outline of the hard palate,were used to approximate the shape of the oral part of the

0 0.1 0.2 0.3 0.40

0.5

1

1.5

2

2.5

3

F1v1

F2v1

F3v1

F1v2

F2v2

F3v2

F1n

F2n

F3n

Time (s)

Fre

quen

cy(k

Hz)

(a) [Ui]

0 0.1 0.2 0.3 0.40

0.5

1

1.5

2

2.5

3

F1ci

F2ci

F3ci

F1cf

F2cf

F3cf

Time (s)

Fre

quen

cy(k

Hz)

Con

sona

ntm

agni

tude

,mc(t

)

↑↑

↓ ↓

(b) [Udi]

FIG. 1. Demonstration of multiple “layers” of vocal tract perturbations toproduce a VCV. �a� Time-varying formant frequencies �thick solid lines� thatresult from perturbing a neutral vocal tract shape with vowel-to-vowel �VV�sequence; the Fn’s are the formants for the neutral tract shape and shownwith dotted lines, Fv’s are the formants in an approximately stable portion ofthe initial and final vowels. �b� Time-varying formant frequencies �thicksolid lines� that result from perturbing the VV sequence �thin solid lines�with a consonant perturbation approximately representative of �d�; thedashed line indicates the time course of the consonant activation whoseamplitude ranges from 0 to 1 �not Hz� and the Fci’s and Fcf’s are formantfrequencies that exist at the point where the occlusion begins and ends.

vocal tract as a midsagittal cross-distance function for eleven

Brad H. Story: Vocal tract shape

Page 3: Department of Speech, Language, and Hearing …sal.arizona.edu/sites/default/files/Story_microbeam_vcv...Vowel and consonant contributions to vocal tract shape Brad H. Storya Department

vowels. A principal component analysis �PCA� performed onspeaker-specific sets of these cross-distance functions re-vealed mode shapes essentially the same as those derivedfrom MRI-based �magnetic resonance imaging� area functionsets in previous research �e.g., Story and Titze, 1998; Story,2005b; Mokhtari et al. 2007�, even though only the oral por-tion of the vocal tract was available for analysis. Furtheranalysis demonstrated that VV transitions could be accu-rately represented with time-dependent scaling of the modessuperimposed on the mean cross-distance function, thus vali-dating the first tier of the Story �2005a� model. It remains aquestion though whether naturally spoken VCVs can be ac-curately represented by modeling them as a VV transitionwith a superimposed consonant function, as prescribed bythe second tier of the model.

The aim of this study was to address this question byextracting time-varying cross-distance functions fromXRMB data, with the method described in Story �2007b�, forVCV utterances with stop consonants, and determining howwell their time course can be represented with a model simi-lar to Story �2005a� but formulated for cross-distance, ratherthan area. The assumption was that the initial and final vow-els in a VCV can be well represented by scaling coefficientsof the vowel-based modes. These coefficients can then beinterpolated to generate the underlying VV transition. It wasalso assumed that the consonant superposition function canbe determined from the shape of the vocal tract during theclosure period, and represented with parameters such as con-striction location, extent, degree, and temporal activation.

Described in Secs. II–V is a step-by-step process fortransforming the x-y coordinates of articulatory fleshpointsduring production of a VCV to the components of a vocaltract model based on superimposing a consonant gesture on avowel substrate. Since the outcome of early steps in the pro-cess are needed for later steps, each section combines expla-nations of the method with intermediate results.

II. MEASUREMENT OF TIME-VARYING CROSS-DISTANCE FUNCTIONS

A. Speaker and speaking tasks

The speaker chosen from the XRMB database for thepresent study was JW26. This female speaker was 24 yearsold at the time of data collection and her dialect base is listedas Verona, Wisconsin �Westbury, 1994�. She was also one ofthe four speakers studied in Story �2007b� so that vocal tractmodes have already been reported for her vowels. These willbe used subsequently in Sec. III. The XRMB protocol foreach speaker in the database contains a set of VCVs in theform of �.CÄ�, where “C” is 1 of 21 different consonants�task number 16, file “tp016”�. Only those VCVs with theunvoiced stop consonants �p, t, k� were analyzed in thisstudy.

B. Cross-distance functions from XRMB data

The XRMB data consist of time-dependent displace-ments of gold pellets affixed to four points on the tongue,two on the jaw, and one on each of the upper and lower lips.

During data collection, x-y coordinates of each pellet were

J. Acoust. Soc. Am., Vol. 126, No. 2, August 2009

acquired at a sampling interval of 6.866 ms �Westbury,1994�. To extract a representation of the vocal tract shape,the same technique was used as described by Story �2007b�where a midsagittal cross-distance function can be deter-mined from a two-dimensional vocal tract profile, con-structed from pellet coordinates, for any time frame of dataproduced by a given speaker. Although the details of thetechnique will not be repeated here, the process is demon-strated graphically in Fig. 2. The positions of the pellets onthe tongue �T1–T4�, incisor �MNi�, lower lip �LL�, and upperlip �UL� are shown as filled circles in Fig. 2�a� for a specific

−80 −60 −40 −20 0 20−40

−20

0

20

40

LL

UL

T1

T2T3

T4

MNi

Palatal Outline

PosteriorPharyngeal Wall

P−A (mm)

I−S

(mm

)

(a)

−7 −6 −5 −4 −3 −2 −1 0 10

0.5

1

1.5

2

Distance from Lips (cm)

Cro

ss−

dist

ance

(cm

)

(b)

FIG. 2. Demonstration of finding a cross-distance function from XRMBdata. �a� Sagittal view of a time frame representative of JW26’s �æ� vowel.A superior and an inferior vocal tract boundary are generated based on thetongue points �T1-T4�, palatal outline, pharyngeal wall, and four phantompoints �open circles� related to the mandible and lips. The lines extendingacross the vocal tract are perpendicular to the centerline and comprise thecross-distance measurements. �b� Resulting cross-distance function; notethat the x-axis indicates distance into the vocal tract from the lips, where thelips are at 0 cm.

time frame representative of JW26’s �æ� vowel. Also shown

Brad H. Story: Vocal tract shape 827

Page 4: Department of Speech, Language, and Hearing …sal.arizona.edu/sites/default/files/Story_microbeam_vcv...Vowel and consonant contributions to vocal tract shape Brad H. Storya Department

is the palatal outline and an approximation of the posteriorpharyngeal wall. The open circles denote derived “phantom”points �see Story, 2007b, p. 3774� that more accurately rep-resent the interface of tissue and air than do the actual pelletson the lips and jaw.3 The inferior vocal tract profile is gen-erated with a spline fit through the four tongue points and thetwo lower phantom points. Similarly, the upper profile isconstructed with a spline fit through the upper two phantompoints and the palatal outline. The next step is to use aniterative bisection algorithm to find the centerline throughthe airspace, extending from the lips back to the most poste-rior point of the palatal outline. Finally, the distance from thelower to upper profile is measured perpendicularly at a suc-cession of points along the centerline. This collection ofcross-distances can be plotted as a function of the distancefrom the lips, as shown in Fig. 2�b�. It is noted that thenumber of points in a particular cross-distance function de-pends on the vocal tract shape at a given time frame �seeStory, 2007b, p. 3775�. For all uses of the algorithm in thisstudy and previously in Story �2007b�, each cross-distancefunction was resampled with a cubic spline so that it contains33 elements separated by equal length intervals.

This same process can be performed over consecutiveXRMB time frames to generate a time-dependent cross-distance function. These will, in general, be referred to asDVCV�x , t� where x is the distance from the lips and t is time.Shown in Fig. 3 are three-dimensional �3D� surface plots ofthe DVCV�x , t�’s determined for JW26’s utterances �.pÄ�,�.tÄ�, and �.kÄ�. The three phonetic symbols on each plotindicate the approximate temporal location of the initialvowel �.�, the consonant, and the final vowel �Ä�, respec-tively. It is noted that, as expected, the consonant constrictionis moved progressively in the posterior direction for the �p�,�t�, and �k� consonants. Because the cross-distance analysisonly extends as far back as the posterior end of the palataloutline, the representation of the �k� is somewhat incomplete.That is, the point of maximum constriction is included, alongwith the vocal tract shape anterior to it, but the tract shapeposterior to the constriction is not available. Although diffi-cult to discern from the figures, it is also noted that the cross-distance does not necessarily become zero at the point ofconstriction. For points on the tongue this is a result of thelow spatial resolution of the data. That is, if the point ofcontact of the tongue with an opposing surface is betweentwo pellets, the cross-distance algorithm will not account forit. At the lips, an incomplete occlusion may appear in thecross-distance function for a bilabial closure because of po-tentially different degrees of compression force. Correctionsfor the “non-zero” occlusions based on various mathematicalfunctions could perhaps be proposed. For the present study,however, it was decided not to alter the measured cross-distances and to assume that the spatial and temporal char-acteristics of the consonant constriction are accurately de-picted. These three VCV surfaces served as the vocal tractshape data for which the subsequently described method was

developed and tested.

828 J. Acoust. Soc. Am., Vol. 126, No. 2, August 2009

III. RECOVERY OF THE VV SEQUENCE

The method for separating the vowel and consonantparts of a VCV cross-distance function is based on the as-sumption that

DVCV�x,t� = DVV�x,t�C�x,t� for x = ��1,N� , �1�

where DVV�x , t� is the vowel-to-vowel �VV� cross-distancefunction that would exist in the absence of a consonant,

−6−5

−4−3

−2−1

0 00.1

0.20.3

0.40.5

0

1

2

3

Time (sec.)Distance fromLips (cm)

Cro

ss−

dist

ance

(cm

) A

@p

(a)

−6−5

−4−3

−2−1

0 00.1

0.20.3

0.40.5

0

1

2

3

Time (sec.)Distance fromLips (cm)

Cro

ss−

dist

ance

(cm

) A

@t

(b)

−6−5

−4−3

−2−1

0 00.1

0.20.3

0.40.5

0.6

0

1

2

3

Time (sec.)Distance fromLips (cm)

Cro

ss−

dist

ance

(cm

) A

@k

(c)

FIG. 3. �Color online� Time-varying cross-distance functions for VCVs spo-ken by JW26. The phonetic symbols are approximately aligned with thetime frame where the respective vowels or consonants are expressed in thecross-distance function. �a� �.p�, �b� �.t�, and �c� �.k�.

C�x , t� is a consonant superposition function that perturbs the

Brad H. Story: Vocal tract shape

Page 5: Department of Speech, Language, and Hearing …sal.arizona.edu/sites/default/files/Story_microbeam_vcv...Vowel and consonant contributions to vocal tract shape Brad H. Storya Department

VV by imposing a constriction at a specific time and loca-tion, and � is the distance between each of the N=33 con-secutive elements in a cross-distance function. Thus, the aimof the method is to extract a representation of both DVV�x , t�and C�x , t� from a measured DVCV�x , t�.

In this section, a technique is described for approximat-ing the vowel-to-vowel sequence DVV�x , t� that underlies agiven VCV. Using the concept of vocal tract modes, the tech-nique effectively allows for the “recovery” of the VV se-quence from the cross-distance functions generated in theprevious section.

A. Vocal tract modes

It was assumed that the VV cross-distance function canbe represented by time-dependent linear combinations of vo-cal tract modes such as those reported in Story �2007b�. Inthat study cross-distance functions were obtained for 11vowels from each of four speakers. Vocal tract modes werethen computed, with principal component analysis, for eachspeaker’s set of vowels. The two modes that accounted fornearly 99% of the variance in JW26’s set of cross-distancefunctions are shown in the lower part of Fig. 4. The firstmode �1 is plotted with a solid line and the second mode �2

is shown as a dashed line. The mean cross-distance function�across the eleven vowels� � is shown in the upper part ofthe plot. Although plotted in the same figure for convenience,the units for the modes and mean cross-distance are not thesame, as indicated on the left and right hand sides of the plot,respectively.

Cross-distance functions can be reconstructed with highaccuracy for the original 11 vowels with the relation,

DV�x� = ��x� + q1�1�x� + q2�2�x� , �2�

where �1�x� and �2�x� are the modes, ��x� is the mean

−6 −5 −4 −3 −2 −1 0

0

0.5

1

1.5

2

Distance from Lips (cm)

Nor

mal

ized

units

Cro

ss−

dist

ance

(cm

)

Mean cross−distance

φ1

φ2

FIG. 4. �Color online� Vocal tract modes and mean tract shape for JW26�see Story 2007b�. The modes ��1 and �2� are shown in the lower part of theplot as solid and dashed lines and the mean cross-distance is shown in theupper part. Note that the axis label corresponding to the modes is on leftside, whereas for the mean tract shape the label is on the right.

cross-distance function, x is the distance from the lips, and q1

J. Acoust. Soc. Am., Vol. 126, No. 2, August 2009

and q2 are weighting coefficients that generate a specific vo-cal tract shape.

B. Time-varying mode coefficients

The purpose of this step was to find time-dependentmode coefficients representative of the transition from initialto final vowel for each of the VCV utterances shown previ-ously as time-varying cross-distance functions in Fig. 3. Forany given VCV cross-distance function, it was hypothesizedthat the portions of the utterance not affected by the conso-nant can be fairly well approximated by a particular combi-nation of the q1 and q2 coefficients of Eq. �1�. It follows thatthose portions of the VCV that are affected by the consonantwill be poorly represented by the mode coefficients. Thus,the first step in recovering the vowel and consonant compo-nents of the VCV consists of finding mode coefficients thatprovide a best fit to the cross-distance function at each timeframe. The error between measured and reconstructed cross-distances should be low during the initial and final vowelportions and high during the consonant, and thus can providea means of segmenting the VCV.

To determine coefficient values for the cross-distancefunction within each time frame of a VCV, a Nelder–MeadSimplex optimization technique �Lagarias et al., 1998; theMathworks, 2008� was used to find the �q1 ,q2� coefficientsthat minimized the squared error between a measured cross-distance function and that constructed with Eq. �2�. Theminimum error within each frame can be used to generate anerror function over the duration of the utterance.

Applying this process to the three DVCV�x , t�’s in Fig. 3produces the error functions shown in the top row of Fig. 5.Although they represent the squared error at each time frame,for purposes here, each one is normalized so that the maxi-mum error is equal to 1.0. The points denoted with filledcircles, labeled as to and tf, indicate the time instants wherelocal minima of the error functions occur, suggesting that theparticular combination of coefficients determined at thesepoints provides a good fit to the vocal tract shape. The ver-tical lines that pass through each of the filled circles dividethe VCV into three regions. The first region �“I” in the fig-ure�, extending from the beginning of the utterance to thefirst local minima to, is the domain of the initial vowel. Be-tween to and tf is region II, a time segment where the errorincreases rapidly due to the influence of the consonant con-striction on the underlying VV transition, and then decreasestoward zero as the consonant vanishes. Region II is assumedto be a transition zone where the speaker simultaneouslymoves the vocal tract from the initial to final vowel shapeand executes the onset and offset of a consonantal constric-tion. Region III extends from the second minimum tf to theend of the utterance, and is assumed to be the domain of thefinal vowel.

Ideally, the error within regions I and III would drop tozero for any given VCV because the vowel-based modeswould provide an exact match to the actual cross-distancefunction. Although each error function plot indicates a ten-dency toward zero in these regions, none of the three actually

becomes zero at any point in time. For �.p�, there is a fairly

Brad H. Story: Vocal tract shape 829

Page 6: Department of Speech, Language, and Hearing …sal.arizona.edu/sites/default/files/Story_microbeam_vcv...Vowel and consonant contributions to vocal tract shape Brad H. Storya Department

large error in region I indicating that the cross-distance func-tions are not well represented by the vowel-based modes,likely due to the influence of the consonant already duringthis period of time. For �.t�, a local minimum does notactually exist on the left side of the error peak, so to wasmanually chosen at a point where the slope of the error wasreduced; this was done so that three regions could still bedefined. The error function for �.k� contains two clearlydefined minima that bracket the error peak but the error doesrise at the beginning of the utterance. In all three cases, theerror increases toward the end of the utterance suggestingthat the speaker may have been moving the vocal tract to-ward a less vowel-like shape that could not be well repre-sented by the mode coefficients.

In the second row of Fig. 5 are plots of the mode coef-ficients as they vary over the time course of each of the threeVCV utterances. The same three regions defined by the errorfunctions are also indicated in each of these three plots. Thethick lines �both solid and dashed� represent the q1 and q2

coefficients derived by the frame-by-frame optimization pro-cess. As would be expected because of the same target vow-els, they are similar, although certainly not identical, across

[@pA] [

0 0.1 0.2 0.3 0.4 0.50

0.2

0.4

0.6

0.8

1I II III

to

tf

Normalized

Error

Time (sec.)0 0.1 0.20

0.2

0.4

0.6

0.8

1I II

to

Normalized

Error

0 0.1 0.2 0.3 0.4 0.5

−4

−3

−2

−1

0

1

2

3

4

Time (sec.)

Mod

eco

eff.va

lues

I II III

to

tf

q1(t)

q2(t)

q1vv

(t)

q2vv

(t)

0 0.1 0.2

−4

−3

−2

−1

0

1

2

3

4

T

Mod

eco

eff.va

lues

I II

to

FIG. 5. �Color online� Normalized error functions �top row� and time-varyinand �.k� �right column�. The error functions indicate the differences betmode-coefficient fitting algorithm. The bottom row of plots show the time-vcorrected versions �thin lines� based on interpolation. The regions in each ploof the peak in the respective error function.

the three utterances in regions I and III. In region II, how-

830 J. Acoust. Soc. Am., Vol. 126, No. 2, August 2009

ever, the coefficients differ across the three VCVs becausethe optimization process was attempting to find a vocal tractshape that would fit both the vowel shape and the givenconsonant constriction during this period.

To remove the effect of the consonant constriction fromthe temporal pattern of the mode coefficients it was assumedthat, within region II, they could be replaced with an inter-polation of the coefficient values at to �boundary of regions Iand II�, and their values at tf �boundary of regions II and III�.With the interpolation specified as a sinusoidal function, theq1�t� and q2�t� coefficients over the entire duration T of theutterance were replaced with

qnVV�t�

= �qn�t� for 0 � t � to,

�qnf − qno�sin�2��t − to�4�tf − to� � + qno for to � t � tf ,

qn�t� for tf � t � T ,�

�3�

where n= �1,2�, and qno and qnf are the coefficient values at

[@kA]

.3 0.4 0.5

III

tf

(sec.)0 0.1 0.2 0.3 0.4 0.5 0.60

0.2

0.4

0.6

0.8

1I II III

to

tf

Normalized

Error

Time (sec.)

3 0.4 0.5(sec.)

III

tf

q1(t)

q2(t)

q1vv

(t)

q2vv

(t)

0 0.1 0.2 0.3 0.4 0.5 0.6

−4

−3

−2

−1

0

1

2

3

4

Time (sec.)

Mod

eco

eff.va

lues

I II III

to

tf

q1(t)

q2(t)

q1vv

(t)

q2vv

(t)

de coefficients �bottom row� for �.p� �left column�, �.t� �middle column�,the original VCV cross-distance functions and those constructed by the

g mode coefficients derived from the fitting algorithm �thick lines� and thedefined by the points, to and tf, which denote the local minima on either side

@tA]

0Time

0.ime

g moweenaryint are

the boundaries of region II. The qnVV�t� were also smoothed

Brad H. Story: Vocal tract shape

Page 7: Department of Speech, Language, and Hearing …sal.arizona.edu/sites/default/files/Story_microbeam_vcv...Vowel and consonant contributions to vocal tract shape Brad H. Storya Department

with a tenth order FIR filter that had a low-pass cutoff fre-quency of 10 Hz. The thin lines �both solid and dashed� vis-ible in region II of the three mode coefficient plots �Fig. 5�are the result of interpolation with Eq. �3�, and are represen-tative of the underlying VV transition. Although not visiblein the plots, note that q1VV�t�=q1�t� and q2VV�t�=q2�t� inregions I and III. The filled circle corresponding to to on theq2VV�t� trace for �.pÄ� is displaced slightly upward from theq2�t� trace. This is due to the mild filtering that is appliedafter the interpolation. Other interpolation schemes such ascosine, minimum jerk, or linear were also tested, but thesinusoidal function seemed to best serve the purpose of ap-proximating the time-dependence of the coefficients duringthe portion of the vowel transition that is affected by theconstriction. Kröger et al. �1995� made use of a similarsinusoid-based function for estimating movement trajectoriesof articulatory fleshpoints.

It is noted that division of the VCV into the three spe-cific regions is critical for separating the vowel and conso-

Dvv(x, t)

−6−5

−4−3

−2−1

0 00.1

0.20.3

0.40.5

0

1

2

3

Time (sec.)Distance fromLips (cm)

Cross

−distan

ce(cm) A

@

−6−5

−4−3

−2−1

0

1

2

Distance fromLips (cm)

C(x,tc)

C(x,t)

p

(a) [@A] with [p] removed (b) [

−6−5

−4−3

−2−1

0 00.1

0.20.3

0.40.5

0

1

2

3

Time (sec.)Distance fromLips (cm)

Cross

−distan

ce(cm) A

@

−6−5

−4−3

−2−

0

1

2

Distance fromLips (cm)

C(x,tc)

C(x,t)

t

(c) [@A] with [t] removed (d) [

−6−5

−4−3

−2−1

0 00.1

0.20.3

0.40.5

0.6

0

1

2

3

Time (sec.)Distance fromLips (cm)

Cross

−distan

ce(cm) A

@

−6−5

−4−3

−2−1

0

1

2

Distance fromLips (cm)

C(x,tc)

C(x,t)

k

(e) [@A] with [k] removed (f) [

nant components. Temporal parsing by any other criterion

J. Acoust. Soc. Am., Vol. 126, No. 2, August 2009

would undermine the hypothesized representation of a VCVas a constriction superimposed on a vowel transition, andwould reduce the effectiveness of the interpolation in regionII to remove the influence of the consonant.

C. Reconstruction of the hypothetical VV sequence

The interpolated time-varying mode coefficients cannow be used to construct a hypothetical cross-distance his-tory over the duration of the vowel sequence with a time-dependent version of Eq. �2�,

DVV�x,t� = ���x� + q1VV�t��1�x� + q2VV�t��2�x�� , �4�

in which q1VV�t� and q2VV�t� are functions of time, and�1�x�, �2�x�, and ��x� remain unchanged as functions of thedistance from the lips. Time-varying vowel-to-vowel cross-distance functions were generated with the coefficientsq1VV�t� and q2VV�t� from the each of the plots in the bottomrow of Fig. 5, and are shown in the left column of Fig. 6.

x, t)

0.10.2

0.30.4

0.5

C(xc,t)

Time (sec.)

rturbation

00.1

0.20.3

0.40.5

C(xc,t)

Time (sec.)

rturbation

00.1

0.20.3

0.40.5

0.6

Time (sec.)

C(xc,t)

rturbation

FIG. 6. �Color online� In the left col-umn are the hypothetical VV se-quences, Dvv�x , t�, recovered from thethree original VCVs, and shown in theright column are the consonant pertur-bation functions C�x , t�. The productof the VV sequences with the conso-nant perturbations would reconstructthe original cross-distance functions.The white lines in the right columnplots indicate the time frame and spa-tial location corresponding to themaximum constriction.

C(

0 0

p] pe

10

t] pe

0

k] pe

There are some subtle differences, but, as expected, each plot

Brad H. Story: Vocal tract shape 831

Page 8: Department of Speech, Language, and Hearing …sal.arizona.edu/sites/default/files/Story_microbeam_vcv...Vowel and consonant contributions to vocal tract shape Brad H. Storya Department

shows a similar time-progression of the cross-distance fromthe initial �.� to the final �� in the absence of the consonantconstriction. Thus a hypothetical underlying VV transitionhas been recovered, in each case, from the original VCV.

IV. RECOVERY OF THE CONSONANTSUPERPOSITION FUNCTION

A. Separation of vowel and consonant portions ofVCVs

With DVV�x , t� known for a given VCV, the time-dependence of the consonant constriction can now be recov-ered by rearrangement of Eq. �1� such that

C�x,t� =DVCV�x,t�DVV�x,t�

. �5�

This is a ratio of the measured VCV to hypothetical VVcross-distances at every point in time and space. In regions Iand III, as defined previously, this ratio should ideally be 1.0since the cross-distances are representative of the same vow-els in both the VCV and VV. The values of C�x , t� in regionII are expected to deviate toward 0.0 due to the influence ofthe consonant constriction.

With Eq. �5�, C�x , t� functions were calculated for thethree VCVs and are shown in the right column of Fig. 6. Inthe initial and final vowel portions of each case, the functionsare fairly flat and nearly equal to one along the posteriorextent from the lips into the oral cavity. The slight deviationsaway from one in the vowel portions result from the inexactmatch of the vowel-based model to the measured cross-distance functions. Such deviations are expected based onthe nonzero values present in regions I and III of the errorfunctions �Fig. 5�. In the vicinity of the consonant, eachC�x , t� decreases toward a minimum value near zero, reveal-ing the onset and offset of the constriction perturbation in theabsence of the vowel substrate. The temporal pattern of eachconstriction can be best seen along the spatial point xc, whereC�xc , t� is highlighted with a white line. The shape of C�x , t�at time instant tc represents the fully expressed consonantconstriction. C�x , tc� functions are indicated on each of theplots as white lines extending from the lips back into thevocal tract.

To provide a clearer picture of the overall shape of theconstrictions, all three C�x , tc� functions are replotted in Fig.7. The constriction minima xc, marked with filled circles, arelocated at the lips for �p�, at −2.3 cm from the lips for �t�,and at −5.1 cm for �k�. It is noted that the constriction for �k�is located at the most posterior point afforded by the XRMBdata. This means that the C�x , tc� for �k� captures only theanterior portion of the constriction shape. It is also noted thateven though each VCV contains a stop consonant, none ofthe three C�x , tc� functions in Fig. 7 actually becomes zero,suggesting that a complete occlusion is never attained. Thisis an artifact due to the low spatial resolution as discussed

previously in Sec. II B.

832 J. Acoust. Soc. Am., Vol. 126, No. 2, August 2009

B. Time dependence of the consonant perturbationfunction

The operations described to this point allow for a givenVCV cross-distance function to be effectively separated intothe “VV” and “C” components of Eq. �1�. In this section,additional operations are described for estimating the tempo-ral activation of the consonant constriction.

The spatial configuration of the constriction is assumedto be fully expressed at time instant tc as the functionC�x , tc�; these were shown previously in Fig. 7 for each ofthe three VCVs. This function can be represented in generalas

C�x,tc� = 1 − mc�tc�f�x� = 1 − f�x� , �6�

where mc�tc� is a scaling function defined to be equal to oneat tc, and f�x� is a spatial shaping function that, when sub-tracted from 1.0 at all values along the x-axis, will produceC�x , tc�. At all other time instants, C�x , t� is considered to bea scaled version of C�x , tc� that returns to a value of 1.0along the entire x-axis when the consonant effect is absent.

The time course of the consonant perturbation can bedetermined from the variation along C�xc , t�, where xc is theconstriction location. These functions were shown as whitelines in the right column plots of Fig. 6. Note that eachC�xc , t� tracks the temporal variation of the point of maxi-mum constriction, and is perpendicular to the correspondingspatial variation. Each C�xc , t� can be written as

C�xc,t� = 1 − mc�t�f�xc� , �7�

where mc�t� is the time-dependent consonant magnitude. Themc�t� function can be determined from known quantities byfirst substituting Eq. �7� into Eq. �1� �evaluated at xc� to givethe relation

DVCV�xc,t� = DVV�xc,t��1 − mc�t�f�xc�� . �8�

Rearranging Eq. �6� and evaluating at xc yields

f�xc� = 1 − C�xc,tc� , �9�

which can be substituted into Eq. �8�. Solving for mc�t� re-

−6 −5 −4 −3 −2 −1 00

0.2

0.4

0.6

0.8

1

1.2

1.4

Distance from Lips (cm)

Con

sona

ntsu

perp

ositi

on,C

(x,t

c)

xc

xc

xc

ptk

FIG. 7. Consonant perturbation �superposition� functions for �p, t, k� at thepoint in time representative of maximum constriction �tc�. These plots pro-vide a clearer view of the white lines C�x , tc� shown previously in Fig. 6.

sults in

Brad H. Story: Vocal tract shape

Page 9: Department of Speech, Language, and Hearing …sal.arizona.edu/sites/default/files/Story_microbeam_vcv...Vowel and consonant contributions to vocal tract shape Brad H. Storya Department

mc�t� =DVCV�xc,t� − DVV�xc,t�DVV�xc,t��C�xc,tc� − 1�

, �10�

where all quantities on the right-hand side of the equation areknown from previous measurements and analyses.

Plots of the magnitude mc�t� calculated with Eq. �10� foreach VCV, along with the corresponding audio signal areshown in Fig. 8. In all three cases, the peak �where mc�t�=1� occurs just prior to the plosive burst as indicated by thevertical lines, and is the point at which the correspondingconsonant function would be fully superimposed on the un-derlying VV transition. On either side of the peaks, mc�t�decreases rapidly indicating progressively less consonantaleffect on the vocal tract shape. Ideally the consonant magni-tude would drop to zero in these regions �i.e., no consonantalinfluence�, but for �p� and �k�, it decreases to values slightlybelow zero, and for �t�, never reaches zero. The imperfectmatch of the hypothetical VV transitions to the non-consonantal portions of the original utterances causes the nu-merator of Eq. �10� to be non-zero during these time periods,

0 0.1 0.2 0.3 0.4 0.5 0.6−1

−0.5

0

0.5

1

Time (sec.)

Am

plitu

de

Audiom

c(t)

(a) [@pA]

0 0.1 0.2 0.3 0.4 0.5 0.6−1

−0.5

0

0.5

1

Time (sec.)

Am

plitu

de

Audiom

c(t)

(b) [@tA]

0 0.1 0.2 0.3 0.4 0.5 0.6−1

−0.5

0

0.5

1

Time (sec.)

Am

plitu

de

Audiom

c(t)

(c) [@kA]

FIG. 8. Consonant magnitude functions mc�t� calculated with Eq. �10� foreach of the three VCVs. Also shown in the background are the correspond-ing audio signals. The vertical line in each plot marks the time instant of theplosive burst.

thus generating the positive or negative offset observed in

J. Acoust. Soc. Am., Vol. 126, No. 2, August 2009

the mc�t� functions. Nonetheless, it is apparent from theseplots that the influence of the consonant begins prior to theacoustic offset of the initial vowel, and seems to subside inless than 0.1 s after the acoustic onset of the final vowel.These values are approximately representative of the threeutterances spoken by the speaker JW26 and are not intendedto characterize the consonants in general, but rather to showthe superposition effect of the consonant on the underlyingvowel substrate.

V. RECONSTRUCTION OF THE ORIGINAL VCVs

The mc�t� functions derived in the previous section cannow be used to reconstruct the time-varying cross-distancefunctions for each of the respective VCVs. First, from Eq.�6�

f�x� = 1 − C�x,tc� , �11�

which can be substituted into

C��x,t� = 1 − mc�t�f�x� = 1 − mc�t��1 − C�x,tc�� �12�

to produce a consonant superposition function. Next, Eq. �1�is rewritten with new notation as

DVCV� �x,t� = DVV�x,t�C��x,t�, x = ��1,N� , �13�

where the primes indicate an approximation to the previoussimilar quantities.

Using Eqs. �12� and �13�, the C��x , t� was recovered andDVCV� �x , t� reconstructed for �.p�, �.t�, and �.k�. The re-sulting time-varying functions are shown as 3D plots in theleft and right columns of Fig. 9. As prescribed by Eq. �13�,each DVCV� �x , t� in the right column is the product of thecorresponding C��x , t� in the left column with the appropriateDVV�x , t� shown previously in Fig. 6. Qualitatively, theseplots indicate that the primary temporal and spatial featuresof both the original cross-distance functions �Fig. 3� and thederived consonant superposition functions �Fig. 6, right col-umn� are maintained in the reconstructed versions, but someof the fine detail is lost.

The accuracy of each DVCV� �x , t� relative to the originaltime-varying cross-distance function was assessed by calcu-lating the rms error and correlation coefficient at each timeframe. These quantities are shown in Fig. 10 as functions oftime. The rms error, which provides a measure of the abso-lute difference in the cross-distances, is shown in the lowerpart of the graph and its units are denoted by the axis labelon the left. With the exception of the initial 0.05 s of �.p�,the rms error of all three VCVs is less than 0.2 cm for theduration of the utterance. During the period of time wherethe consonant affected the vocal tract shape, the rms error isnearly zero. Shown in the upper part of Fig. 10, and denotedby the right axis label, is the correlation coefficient. Thisgives an assessment of the similarity in shape of the originaland reconstructed cross-distance functions at each timeframe, and is nearly 1.0 over the duration of each VCV. Theexception, again, is the initial 0.05 s portion of �.p� wherethe correlation coefficient drops to about 0.85. Taken to-

gether, these two measures suggest that the three VCVs can

Brad H. Story: Vocal tract shape 833

Page 10: Department of Speech, Language, and Hearing …sal.arizona.edu/sites/default/files/Story_microbeam_vcv...Vowel and consonant contributions to vocal tract shape Brad H. Storya Department

be reasonably well represented and reconstructed by combin-ing a separate VV sequence with a time-dependent consonantperturbation.

VI. DISCUSSION

In the discussion of his numerical model of coarticula-tion, Öhman �1967, p. 318� stated that “The principal valueof the model…lies in its ability to summarize with a singlemathematical formula the articulatory equivalent of therather complex acoustic description of VCV coarticula-tion…” Perhaps the same might be said for the presentmodel, although a few more mathematical relations than justone are required for the complete description. In particular,the ability of the VCV cross-distance model to separate thetemporal and spatial contributions of vowels and consonantsto the vocal tract shape may allow for insight into the plan-ning of speech utterances. For instance, questions regardingthe variability of constriction location, shape, and timing

C �(x, t)

−6−5

−4−3

−2−1

0 00.1

0.20.3

0.40.5

0

1

2

Time (sec.)Distance fromLips (cm)

C(x,t)

p

−6−5

−4−3

−2

0

1

2

3

Distance fromLips (cm)

Cross−distan

ce(cm)

@p

(a)

−6−5

−4−3

−2−1

0 00.1

0.20.3

0.40.5

0

1

2

Time (sec.)Distance fromLips (cm)

C(x,t)

t

−6−5

−4−3

−2

0

1

2

3

Distance fromLips (cm)

Cross−distan

ce(cm)

@t

(c)

−6−5

−4−3

−2−1

0 00.1

0.20.3

0.40.5

0.6

0

1

2

Time (sec.)Distance fromLips (cm)

C(x,t)

k

−6−5

−4−3

−2−

0

1

2

3

Distance fromLips (cm)

Cross

−distan

ce(cm) A

@k

(e)

within different vowel environments may be addressed. It is

834 J. Acoust. Soc. Am., Vol. 126, No. 2, August 2009

also of interest to relate vocal tract shape change to acousticoutput. This could be studied by comparing the structure offormant frequency contours measured for the VCV utter-ances with the temporal variations of the consonant magni-tude and mode coefficients, similar to the idealized versionof acoustic characteristics shown in Fig. 1.

A goal in developing the VCV model is to eventuallyuse it to provide information useful for controlling an areafunction model of the vocal tract. Since the cross-distancemodel shares many features with the area function modelproposed by Story �2005b�, transformation of the spatial andtemporal information derived for VCVs to parameters rel-evant for this model should be relatively straightforward.Such a transformation was demonstrated for vowel se-quences �Story, 2007b� based on mode coefficients, but ad-ditional work is needed to parametrize the consonant super-position functions. Specifically, the C�x , tc� functions �Fig. 7�will need to be described in terms of constriction location

v(x, t)

00.1

0.20.3

0.40.5

Time (sec.)

(b)

00.1

0.20.3

0.40.5

Time (sec.)

(d)

00.1

0.20.3

0.40.5

0.6

Time (sec.)

(f)

FIG. 9. �Color online� In the left col-umn are the consonant perturbationfunctions C��x , t� recovered for eachof the three VCVs with Eq. �12�, andshown in the right column are the re-constructed VCV cross-distance func-tions DVCV� �x , t�.

D�vc

−10

A

−10

A

10

�already denoted as xc�, extent of the constriction along the

Brad H. Story: Vocal tract shape

Page 11: Department of Speech, Language, and Hearing …sal.arizona.edu/sites/default/files/Story_microbeam_vcv...Vowel and consonant contributions to vocal tract shape Brad H. Storya Department

vocal tract length, and degree of symmetry about the con-striction location �skewness�. In addition, informed tech-niques for correcting the C�x , tc� functions for complete oc-clusions need to be developed, as well as corrections for theoffsets of the consonant magnitude functions �mc�t�� duringthe vowel-only portions of VCVs, as noted in Fig. 8.

Although speculative at this point, transformation of theVCV cross distance information to area function model pa-rameters may allow for more realistic simulation of vocaltract shape changes along with associated acoustic output.This could aid in understanding both the acoustic and per-ceptual consequences of various spatial and temporal aspectsof vocal tract movement. Controlling an area function withthese parameters would also allow for experimenting withthe temporal or spatial variability of the vowel and consonantcomponents of an utterance. For example, the acoustic effectand perception of “sliding” the consonant magnitude func-tion �cf. Tjaden, 1999; Löfqvist and Gracco, 1999� along thetime axis relative to the vowel transition could be investi-gated systematically.

A key part of the VCV cross-distance model is the in-terpolation of the time-varying mode coefficients during theperiod when the consonant affects the vocal tract shape. It isthis interpolation that allows for the recovery of a hypotheti-cal VV sequence that can subsequently be used to recoverthe consonant superposition function. The sinusoidal func-tion used in the model was chosen because it was found toprovide a reasonable fit to the mode coefficients between theboundaries of region II. This does not mean that a sinusoidalfunction is necessarily representative of the temporal patternof a complete vocal tract gesture, however. It is noted thatthe time points, to and tf, that bound region II for each VCVare located at the onset and offset of the consonant gesture�defined by the minima in the error functions�, not the begin-ning and end of the VV transition. Thus, the interpolation ofthe mode coefficients within region II is an estimate of onlya portion of the entire VV transition. It is also important tonote that locating to and tf at the error function minima isnecessary to avoid an interpolation that �1� could include

FIG. 10. Comparison of the reconstructed VCVs of Fig. 9 to the originaltime-varying cross-distance functions shown previously in Fig. 3. The lowerpart of the plot indicates the rms error between the cross-distance functionsat every time frame over the course of each utterance. The upper part showsthe correlation coefficients calculated at each time frame. Note the axis labelfor the rms error is on the left side of the plot and on the upper right side forthe correlation coefficients.

remnants of the consonant constriction if Region II were re-

J. Acoust. Soc. Am., Vol. 126, No. 2, August 2009

duced in duration or �2� not adequately represent the VVtransition if the duration of Region II were increased.

Although the VCV cross-distance model shares manysimilarities with the Öhman �1967� model, a primary differ-ence is that all of the parameters needed by the presentmodel to decompose and reconstruct a given VCV can beobtained from analysis of that VCV alone. In Öhman’smodel the canonical vocal tract shape for a particular conso-nant and the coarticulation function could be obtained onlyfrom multiple VCVs in which that consonant was embeddedwithin two different symmetric vowel contexts. Also requiredwere canonical tract shapes for the vowels themselves. In theXRMB database, isolated VCVs were spoken only in theasymmetric �.Ä� context, hence, a direct comparison with theÖhman model is not currently possible unless a variety ofVCVs were approximated from word and sentence level ma-terial. Additional new XRMB data collection could poten-tially supply a wider variety of VCV speech material forwhich a comparison could be made. Another difference isthat the representation of velar stop consonants by the Öh-man model is problematic because the location of the con-striction is dependent on the vowel context, meaning that acanonical tract shape does not exist for this class of conso-nants. Although the VCV cross-distance approach has theproblem of limited data in the velar region, there is no limi-tation of the model with regard to vowel-dependent constric-tion locations.

At this point the utility of the VCV model is admittedlylimited by a number of factors. First, it has been applied toonly three VCVs from one speaker. Certainly more speechmaterial from additional speakers needs to be modeled inorder to determine if the decomposition technique general-izes across a wide range of initial and final vowels, as well asacross a variety of consonantal constrictions. Second, theconstraint that XRMB articulatory data represent only theoral cavity does limit the degree to which the results can beextended to describe movement of the entire vocal tract. Thislimitation is tempered somewhat by knowing that the vocaltract modes determined from XRMB data appear to replicatethose modes found for area functions based on the entirevocal tract �Story, 2007b�. In addition, for most consonantsother than those in the velar region, the consonant constric-tion is fairly well represented by the extent of the XRMBdata. For the velar consonants, only a portion of the fullextent of the constriction is captured by the data, and it ispossible that in some cases the location of maximum con-striction is not well represented. It can be noted, however,that the location and general configuration of the constric-tions for each of the three consonants represented by theconsonant superposition functions in Fig. 7 are similar to thevocal tract area functions reported for the same stop conso-nants in Story and Titze �1998�. A third limitation is that theXRMB flesh points �pellets� provide a sparse spatial repre-sentation of the vocal tract in only the midsagittal plane. Thelow spatial resolution does not allow for a detailed represen-tation of the air channels needed for fricative and affricateproduction, and midsagittal data cannot adequately depict

consonants with lateral constrictions.

Brad H. Story: Vocal tract shape 835

Page 12: Department of Speech, Language, and Hearing …sal.arizona.edu/sites/default/files/Story_microbeam_vcv...Vowel and consonant contributions to vocal tract shape Brad H. Storya Department

Despite these limitations, however, reconstruction ofeven a rudimentary and partial representation of the time-varying vocal tract shape is difficult to achieve by any meansother than fleshpoint tracking �i.e., XRMB data, but alsoarticulometer-type data �Perkell et al. 1992��. New tech-niques in magnetic resonance imaging show promise for ob-taining 3D time-dependent image sets �e.g., Takemoto et al.2006�, but these are currently limited to short utterances thatmust be repeated several hundred times. Finally, it must bepointed out that reasonable success in separating the voweland consonant components of a VCV with the proposedmodel does not prove that a speaker necessarily plans anutterance by superimposing a constriction on a vowel transi-tion. Rather, the results merely suggest it to be a possibleplanning paradigm that, at this stage, may provide usefulinformation for more detailed modeling of the time-varyingvocal tract shape.

VII. CONCLUSION

The aim of this study was to develop a method by whicha VCV, represented as a time-dependent cross-distance func-tion derived from XRMB articulatory data, could be sepa-rated into a vowel-to-vowel sequence and a consonant super-position function. The result is a model that represents avowel sequence as a time-dependent perturbation of the neu-tral vocal tract shape governed by coefficients of the vocaltract modes. Consonants are modeled as superposition func-tions that can force specific portions of the tract shape to beconstricted or expanded, over a specific time course. Recon-structions of three VCVs ��.p�, �.t�, and �.k�� with thedeveloped model were shown to be reasonable approxima-tions of the original VCVs, as assessed qualitatively by vi-sual inspection and quantitatively by calculating rms errorand correlation coefficients. Although the method was devel-oped and tested on a small data set from one female speaker,it does establish a method for future modeling of otherspeech material.

ACKNOWLEDGMENTS

This research was supported by NIH Grant No. R01-DC04789. The author would like to thank Kate Bunton fordiscussions on x-ray microbeam data and editing of earlydrafts of the manuscript, as well as the Associate Editor andtwo anonymous reviewers for their insightful comments.

1The model includes four hierarchical tiers where the third and fourth tiersprovide time-dependent control of vocal tract length changes and nasal-ization, respectively. These aspects of speech production, however, are notdiscussed in the present study. Hence, these capabilities of the model arenot reviewed.

2In addition to a model of the vocal tract shape, production of synthetic orsimulated speech requires components for computing voice source char-acteristics and wave propagation. These latter two components are part ofan existing speech production model �e.g., Story and Titze, 1998�, but willnot be discussed here.

3In Story �2007b� the two most anterior phantom points were determinedby a correction function applied to the upper and lower lip pellet positionsbased on assuming contact during production of the bilabial consonant�m�. In the present study, this correction function was based on lip closureduring a silent non-speech segment of recorded data. This was done so thatthe upper and lower lip positions would indicate contact without the lip

836 J. Acoust. Soc. Am., Vol. 126, No. 2, August 2009

compression that is characteristic of bilabials. The vocal tract modes usedin this study were also recalculated based on the new correction.

Båvegård, M. �1995�. “Introducing a parametric consonantal model to thearticulatory speech synthesizer,” in Proceedings Eurospeech 95, Madrid,Spain, pp. 1857–1860.

Browman, C., and Goldstein, L. �1990�. “Gestural specification usingdynamically-defined articulatory structures,” J. Phonetics 18, 299–320.

Byrd, D. �1996�. “Influences on articulatory timing in consonant sequences,”J. Phonetics 24, 209–244.

Carré, R., and Chennoukh, S. �1995�. “Vowel-consonant-vowel modeling bysuperposition of consonant closure on vowel-to-vowel gestures,” J. Pho-netics 23, 231–241.

Fowler, C. A. �1980�. “Coarticulation and theories of extrinsic timing,” J.Phonetics 8, 113–133.

Fowler, C. A., and Saltzman, E. �1993�. “Coordination and coarticulation inspeech production,” Lang Speech 36, 171–195.

Gracco, V. L. �1992�. “Characteristics of speech as a motor control system,”Haskins Labs. Stat. Rep. on speech Res., SR-109/110, 13-26.

Ichikawa, A., and Nakata, K. �1968�. “Speech synthesis by rule,” Reports ofthe Sixth International Congress on Acoustics, edited by Y. Kohasi �Inter-national Council of Scientific Unions, Tokyo�, pp. 171–1744.

Kent, R. D., and Minifie, F. D. �1977�. “Coarticulation in recent speechproduction models,” J. Phonetics 5, 115–133.

Kozhevnikov, V. A., and Chistovich, L. A. �1965�. “Speech: Articulationand perception �trans. US Dept. of Commerce, Clearing House for FederalScientific and Technical Information�,” Joint Publications Research Ser-vice, Washington, DC, No. 30, p. 543.

Kröger, B. J., Schröder, G., and Opgen-Rhein, C. �1995�. “A gesture-baseddynamic model describing articulatory movement data,” J. Acoust. Soc.Am. 98, 1878–1889.

Lagarias, J. C., Reeds, J. A., Wright, M. H., and Wright, P. E. �1998�.“Convergence properties of the Nelder–Mead Simplex method in low di-mensions,” SIAM J. Optim. 9, 112–147.

Löfqvist, A., and Gracco, V. L. �1999�. “Interarticulator programming inVCV sequences: Lip and tongue movements,” J. Acoust. Soc. Am. 105,1864–1876.

Mokhtari, P., Kitamura, T., Takemoto, H., and Honda, K. �2007�. “Principalcomponents of vocal tract area functions and inversion of vowels by linearregression of cepstrum coefficients,” J. Phonetics 35, 20–39.

Nakata, K., and Mitsuoka, T. �1965�. “Phonemic transformation and controlaspects of synthesis of connected speech,” J. Radio Res. Labs. 12, 171–186.

Öhman, S. E. G. �1966�. “Coarticulation in VCV utterances: Spectrographicmeasurements,” J. Acoust. Soc. Am. 39, 151–168.

Öhman, S. E. G. �1967�. “Numerical model of coarticulation,” J. Acoust.Soc. Am. 41, 310–320.

Perkell, J. �1969�. Physiology of Speech Production: Results and Implica-tions of a Quantitative Cineradiographic Study �MIT, Cambridge, MA�.

Perkell, J., Cohen, M. H., Svirsky, M. A., Matthies, M. L., Garabicta, I., andJackson, M. T. T. �1992�. “Electromagnetic midsagittal articulometer sys-tems for transducing speech articulatory movements,” J. Acoust. Soc. Am.92, 3078–3096.

Story, B. H. �2005a�. “A parametric model of the vocal tract area functionfor vowel and consonant simulation,” J. Acoust. Soc. Am. 117, 3231–3254.

Story, B. H. �2005b�. “Synergistic modes of vocal tract articulation forAmerican English vowels,” J. Acoust. Soc. Am. 118, 3834–3859.

Story, B. H. �2007a�. “A comparison of vocal tract perturbation patternsbased on statistical and acoustic considerations,” J. Acoust. Soc. Am. 122,EL107–EL114.

Story, B. H. �2007b�. “Time-dependence of vocal tract modes during pro-duction of vowels and vowel sequences,” J. Acoust. Soc. Am. 121, 3770–3789.

Story, B. H., and Titze, I. R. �1998�. “Parameterization of vocal tract areafunctions by empirical orthogonal modes,” J. Phonetics 26, 223–260.

Takemoto, H., Honda, K., Masaki, S., Shimada, Y., and Fujimoto, I. �2006�.“Measurement of temporal changes in vocal tract area function from 3Dcine-MRI data,” J. Acoust. Soc. Am. 119, 1037–1049.

Tjaden, K. �1999�. “Can a model of overlapping gestures account for scan-ning speech patterns?,” J. Speech Lang. Hear. Res. 42, 604–617.

The Mathworks, MATLAB, Version 7.6.0.324 �R2008a�.Westbury, J. R. �1994�. X-Ray Microbeam Speech Production Database Us-

er’s Handbook, �version 1.0��UW-Madison�.

Brad H. Story: Vocal tract shape