Three-dimensional tongue surface reconstruction: …...Three-dimensional tongue surface...

Three-dimensional tongue surface reconstruction: Practicalconsiderations for ultrasound data

Andrew J. Lundberga)

Department of Computer Science, Johns Hopkins University, 3400 North Charles Street, Baltimore,Maryland 21218

Maureen StoneDivision of Otolaryngology, University of Maryland Medical School, 16 South Eutaw Street, Suite 500,Baltimore, Maryland 21201

~Received 8 September 1998; revised 10 June 1999; accepted 29 June 1999!

This paper discusses methods for reconstructing the tongue from sparse data sets. Sixty ultrasoundslices already have been used to reconstruct three-dimensional~3D! tongue surface shapes@Stoneand Lundberg, J. Acoust. Soc. Am.99, 3728–3737~1996!#. To reconstruct 3D surfaces, particularlyin motion, collecting 60 slices would be impractical, and possibly unnecessary. The goal of thisstudy was to select a sparse set of slices that would best reconstruct the 18 measured speech sounds.First a coronal sparse set was calculated from 3D surface reconstructions. Selection of contours wasglobally optimized using coarse to fine search. Sparse and dense reconstructions were comparedusing maximum error, standard deviation error, and surface coverage. For all speech sounds,maximum error was less than 1.5 mm, standard deviation error was less than 0.32 mm, and averagereconstruction coverage was 80%. To generalize the method across subjects, optimal slice locationswere calculated from only the midsagittal contour. Six midsagittal points were optimized toreconstruct the midsagittal contour. Corresponding coronal slices were then used to reconstruct 3Dsurfaces. For data collection planning, a midsagittal sample can be collected first and optimalcoronal slices can be determined from it. Errors and reconstruction coverage from the midsagittalsource set were comparable to the optimized coronal sparse set. These sparse surfaces reconstructedstatic 3D surfaces, and should be usable for motion sequences as well. ©1999 Acoustical Societyof America.@S0001-4966~99!03610-3#

PACS numbers: 43.70.Jt@AL #

g

,to

to

c

thcohowllgfa

ou-glta

cu-ifi-ntsve-forno

force-of

tiction

2Dde

ofcriti-s aapeguesed.

2D’s

INTRODUCTION

Ultrasound imaging has been used to represent tonpositions for over 15 years~Sonieset al., 1981; Keller andOstry, 1983; Stoneet al., 1983!. Like other imaging systemsit provides a 2D measurement of the tongue surface conin a single plane~such as midsagittal, coronal, or oblique!.One strength of ultrasound is that it images tongue conmovement using a fairly rapid frame rate~30 Hz!. Another isthat contours from several spatial planes can be reconstruinto 3D surfaces~Stone and Lundberg, 1996!. However, asultrasound collects 2D slices, the subject must repeatspeech corpus once for each desired slice. Also, bothtours and surfaces are represented by many points. Tcompact quantification of movement is difficult in the casecontours and even more difficult in the case of surfaces. Timprovements of the ultrasound technique would radicaincrease its usefulness: reduced dimensionality of a toncontour or surface, and accurate representation of surmotion. Prior research has accurately represented statictongue surface shapes from dense data sets of ultrasimages~Stone and Lundberg, 1996!. The present paper presents methodology to accurately represent static 3D tonsurface shapes and motion from sparse data sets of usound images. The method also reduces the dimension

a!Electronic mail: [email protected]

2858 J. Acoust. Soc. Am. 106 (5), November 1999 0001-4966/99/10

ue

ur

ur

ted

en-us,foyuece3Dnd

uera-lity

of tongue surface representation and maintains highly acrate reproduction of local deformation features. This modcation is an essential step if multi-plane tongue movemeare to be reconstructed practically into tongue surface moments. Ultrasound has been shown to be a useful toolcollecting 3D tongue surface data. It is noninvasive, hasexposure limits, and is relatively inexpensive~$20 000USD!.

In previous research, a series of 2D images was usedreconstructing a detailed 3D view of the tongue surfa~Stone and Lundberg, 1996!. This required a special transducer, however, which collected 60 slices in a polar sweep60 degrees in 10 s. While this was feasible for a 3D staspeech sample, this method is too slow for 4D data collec~3D surfaces moving in time!. As there are not yet any 3Dultrasound devices that simultaneously collect multipleslices in motion, any 4D sample would need to be repeateNtimes ~where N is the number of slices to be used in threconstruction!.

The motion of three-dimensional tongue surfaces isinterest because the tongue is a complex system that iscal in speaking, swallowing, and breathing. The tongue ivolume preserving, deformable object. That is, tongue shis systematically related to tongue position, because tonvolume can be redistributed, but not increased or decreaComplicated tongue surface shapes can be produced inand 3D due to the complex distribution of the tongue

28586(5)/2858/10/$15.00 © 1999 Acoustical Society of America

thd

usape

-eg

torcba

aerstcese

stthheatn

r-

gthathntipr

atb

nsuDongnI

ecs

aln

o-as

3Dgesuldightor-

cese-dl-

sh-

thewas

onre-erguedatasthe

wereas3Dals oneme

tion

-s-

eentbe-

dif-andges

quettalittal

thethetheore

f the

muscles and its lack of bony tissue~Kier and Smith, 1985!.The intrinsic muscles originate and insert on soft tissue;extrinsic muscles also insert on soft tissue, thus insuringformation with every movement. Contracting variomuscles and contacting the palate allows complicated shto be made. Subtle changes in these shapes reflect differdue to coarticulation~Stone and Lele, 1992!, dialect and lan-guage~Stone and Yeni-Komshian, 1991!, and speech disorders~Stone, 1995!. In order to capture subtle shape changdue to these factors, especially over time, accurate tonsurface representation is essential.

The present study used the database acquired in Sand Lundberg~1996! to determine a minimal number, osparse set, of coronal slices needed for reasonable restructions. The specific coronal slices to collect mustspecified, as well as what error tolerances define a reasonreconstruction. Earlier work~Miyawaki et al., 1975; Stone,1990! suggested that 3D tongue surface shape could beequately specified using five lengthwise segments. Thfore, although many sparse sets of coronal slices were tefive slices were hypothesized to be optimal. In fact, six sliwere determined to give the most accurate compact repretation, as discussed below.

While real-time 3D ultrasound devices do not yet exithere are experimental ultrasound systems that simplifydata collection of multiple static 2D slices. These approacuse a 2D ultrasound transducer with an automated 3D sppositioning system. One system is the scanning 3D traducer used in Stone and Lundberg~1996!, which internallymoves a 2D transducer through a radial space~Acoustic Im-aging Inc., Phoenix, AZ!. The second is a holder that extenally rotates a traditional 2D transducer~Tomtec Inc., Den-ver, CO!. The first is static; the second allows time-varyindata to be collected independently at several slices andreconstructed. Both systems collect up to 60 planes of dand use computer control to position the transducer, butcommercial reconstruction algorithms are quite poor, aslices in planes, other than the original, are very unrealisMoreover, measurement of 60 tongue planes at 30 framessecond is unrealistically time consuming and unnecessadense spatially.

An alternative method, data-driven slice selection, cculates from subject data an optimal sparse source secoronal slices from which reasonable 3D surfaces canconstructed. For this method, an externally rotated traducer, or even a manually positioned system, would beficient for 4D data collection. Thus, to collect data for 4reconstructions, one would do multiple data collectionscoronal tongue image sequences at a few specific orietions. The separate image sequences would then be aliin 3D space from their respective collection orientations.this paper, data-driven slice selection is simulated by seling a sparse set of slices from the already existing denseof coronal slices described in Stone and Lundberg~1996!.

I. METHODS

A. Subject and speech materials

The subject was a 26-year-old white female with a Btimore Maryland accent. Nineteen English speech sou

2859 J. Acoust. Soc. Am., Vol. 106, No. 5, November 1999 A

ee-

esnce

sue

ne

on-eble

d-e-ed,sn-

,es

ials-

entae

dc.er

ily

l-ofe

s-f-

fta-ed

nt-et

-ds

were studied: /i/, /(/, /e/, /}/ /,/, /Ä/, /#/, /Å/, /o/, /)/, /u/, /É/,///, /Y/, /b/, /s/, /l/, /n/, /G/. The subject sustained each phneme for 10 s while 60 slices were collected. The data wcollected with the subject lying on her back because thetransducer required a non-upright position for clear ima~due to a fluid bubble within the transducer head that woobscure ultrasound when the transducer was in an uprposition!. Complete recording procedures and subject infmation can be found in Stone and Lundberg~1996!. In addi-tion, for this experiment, several additional tongue surfawere collected. For validation of the reconstruction procdures the vowel /,/ was collected twice using sagittal ancoronal slices. To study variability the sound /l/ was colected twice, once with a normal articulation, and once puing the tongue tip into the palate to induce variability.

B. Validation of dense data set

Prior to determining an optimal sparse source set for3D reconstructions, validation of the dense data setsperformed to guarantee the accuracy of the original~dense!data reconstructions. Further validation was performedthe reconstruction method beyond the original phantomconstruction of a known surface done in Stone and Lundb~1996!. Data collection and reconstructions of the tongsurface were done from the coronal and sagittal densesets for /,/. The first collection was done with the 60 sliceoriented in the coronal direction. For the second data set,transducer was rotated 90 degrees so that the 60 slicesoriented in the sagittal direction. Surface reconstruction wperformed on each data set, and they were overlaid inspace to find the best fit in terms of overlap and minimerror~measured as the 3D distance between surface pointthe data sets!. It should be noted that this error would includthe normal variability between repeated tokens of the saphoneme. For the coronal and sagittal /,/ data, the maxi-mum error achieved was 2.6 mm, with a standard deviaerror of 1.16 mm, and 86% overlap of the two data sets~seeFig. 1!. Figure 1 shows the reconstructed /,/ surfaces fromcoronal~left! and sagittal~right! slice sets and a set of distance vectors~bottom! comparing the two surfaces. The ditance vector image is a set of vectors from the first~left!reconstructed surface to the second~right! surface. Thelength of any vector corresponds to the 3D distance betwthe surfaces at that point~the orientation of the vectors is nonecessarily in the direction of the shortest 3D distancetween the surfaces!.

These intersurface differences were largely due toferences in measurement errors that occur in coronalsagittal tongue contours when using ultrasound. Tissue edbecome difficult to measure whenever the surface is oblito the ultrasound beam. This is most problematic for sagiimages when the tongue surface is grooved, and a sagcontour may lie entirely along the descending slope ofgroove. In the coronal plane, this is most problematic intongue root, where the entire contour may be oblique toultrasound beam. In general, one can recover a groove measily from a coronal scan, and one can measure more otongue root on sagittal scans.

2859. J. Lundberg and M. Stone: 3D tongue surface reconstruction

ilit

dam

the

con-e set,e-pline

ngueheewdre-hisim-icesand

to

mayon-ruc-

nalarethene ofslice

beion

o

hadughans-ererseer-the

e set.

thges

eInuein-thesend

lrf

For comparison, and to test measurement repeatabthe same judge twice measured the /Ä/ surface of a singlecoronal ultrasound image~see Fig. 2!. A year had passedbetween the two measurements of the images. The errortance between the two reconstructed surfaces had a mmum error of 1.84 mm, standard deviation error of 0.32 m

FIG. 1. Reconstructions of coronal /,/ surface~left! and sagittal /,/ surface~right! and distance vectors showing the distances between them.

FIG. 2. Two reconstructions of an /Ä/ measured twice from the coronaimages to test measurement repeatability and the distance vector subetween them.


y,

is-xi-,

and 96% overlap. The maximum error was greater thanerror induced by using sparse reconstructions.

C. Determining an optimal sparse source set

For the dense data sets, tongue surfaces were restructed as a b-spline surface that interpolated the densof coronal tongue surface contours~Stone and Lundberg1996!. As tongue surfaces are fairly smooth, particularly btween the measured contours of the dense data set, a b-ssurface is a sufficient model. For the sparse data sets, tosurfaces were similarly reconstructed by defining tb-spline surface that most smoothly interpolated the fcoronal tongue surface contours~measured from ultrasounslice images!. Tongue surfaces are simple enough to beconstructed by a small set of coronal contours by tmethod, but the position of these coronal slices becomesportant. For example, for an arched surface the coronal slmust be selected near the point of maximal curvaturedisplacement. If inappropriate coronal contours are usedreconstruct the tongue, the resulting surface may~at worst!intersect the true surface only along those contours, andbe of a significantly different shape. Selection of a reasable set of coronal contours is critical to sparse reconsttion of the tongue.

A sparse reconstruction contains just a few coroslices from the dense set of 60 coronal slices, so theremany possible sparse sets one could collect. In fact,dense data set can be considered to be 56 slices, as nothe speech sounds had measurable data beyond the 55thand slice numbering starts at 0~see Fig. 3!. We consideredselecting six slices because this was in fact determined tothe most appropriate choice for balancing data collectconstraints and reconstruction accuracy~see Fig. 4!. Therewere then 56 choose 6@6 selected from 56 without regard tselection order556!/(6!(5626)!)#, or about 32 millionpossible sets of six coronal slices. The optimal slice setto be defined globally for all 19 speech sounds, even thoeach sound had a different optimal set, because the trducer is fixed during actual speech production. There wtwo desirable properties used in defining an optimal spareconstruction. The first was maximal reconstruction covage, i.e., the ratio of the tongue surface measured indense set of tongue slices that was covered by the sparsThe second was minimizing error.

1. Reconstruction coverage

As the tongue moved forward and back in the mouduring speech, the first and last measurable coronal ima~for a fixed dense set of radial images! varied widely ~seeFig. 3!. Loss of the front slice~s! occurred when the tonguwas pulled back and up, creating a sublingual air cavity.the back, limits on measurable slices were not from tongposition, but from reduced image clarity caused by thecreasingly oblique orientation of the tongue surface toultrasound beam~which varied in different tongue surfaceand speech sounds!. The sound /i/ exemplifies both thesproblems, as /i/ is the highest of the front raised vowels, ais also very oblique and difficult to measure in the back.

ace


eF

ant

nn

rueio.oc

yifill

ldafo

es.ealnts,slice,icesioniblethe

uc-pti-ur-hodingForffectttalandatathe

ical

t alle in-facet inpointcesrors.ea-uesurelyfor

cesall

rtt

um-

Each of the speech sounds had a specific range of msurable surface contours within the 60-degree 3D sector.any speech sound, if the extremes of that measurable rare in the sparse data set, the sparse reconstruction ofspeech sound will cover the full 3D sector range that a dedata set covered. If the extreme measurable slices arepart of the sparse set, the sparse reconstruction will be tcated at the most extreme slice that does lie within its msurable range. Reconstruction coverage is the area ratthe sparse reconstruction over the dense reconstructionthe coronal slices were collected in a polar sweep, the recstruction coverage can be estimated by the degree rangeered by the measurable slices of the sparse set divided brange covered by all measurable slices for any specsound. For example, for /i/ only four of the six slices fewithin the measurable surface~see Fig. 3!, so its reconstruc-tion coverage is 25 degrees/31 degrees.

In order for a 3D reconstruction to be useful it shoucover as much of the tongue as possible. Therefore, spsets containing from two to nine-slices were optimized

FIG. 3. The range of measurable slices for each of the data sets and velines showing the locations of the six-point optimal sparse source secoronal slices.


a-orgehatseotn-a-ofAsn-ov-thec

rser

maximum coverage~see Fig. 4!. The benefit gained fromincreasing the number of slices diminished beyond six slicA second consideration was the practical limitations of rspeech data collection. Using current ultrasound instrumethe subject must repeat the speech corpus once for eachas they are collected independently. Therefore, fewer slare preferred. The third consideration was a prior indicatthat a large reduction in the number of slices was feas~Stone, 1990!. Based on these three considerations anddata in Fig. 4, six-slice sets were optimized.

2. Error analysis of the six-slice set

The second property desired in an optimal reconstrtion was minimal error. Sparse sets of six slices were omized for minimum error. Reconstruction of the tongue sface from a sparse set of slices was identical to the metfor reconstructing from a dense data set. An interpolatb-spline surface was fit to the set of surface data points.the sparse data set from six coronal slices, this had the eof simplifying the reconstructed surfaces along the sagiaxis. The resulting tongue surfaces were smoother,might lose detail. To measure the errors induced by this dreduction, the dense reconstruction was compared tosparse one. To do this, a regularly spaced 2D grid of vertlines ~about one 1-mm spacing! was intersected with thetongue surface. A large enough grid was selected so thathe tongue surfaces in the data set were covered. Thestersections gave a regularly spaced set of tongue surpoints from the dense reconstructions. For each grid pointhe dense data set reconstruction, the closest surfacewas found for the sparse reconstruction. The 3D distanbetween these point sets over all points gave a set of erFrom these, maximum and standard deviation errors msured in millimeters were determined for each of 19 tongsurfaces~corresponding to 19 different static speech sound!.The 3D distances were used as a distance measure, as pvertical error measures would exaggerate the distancesoblique areas of the surface. In contrast, 3D error distanare measured in a direction normal to the surfaces inareas.

icalof

FIG. 4. Coverage of reconstructions for sparse data sets with different nbers of coronal slices.


s, excep
TABLE I. Reconstruction errors resulting from sparse reconstructions based on different optimizations. No optimization~None! refers to simply taking anequidistant spacing of slices over the full range of the data. Errors reported are for 3D error measured over the surfaces of the 19 speech soundt forthe six-point set.
Optimization Selected slicesAverage

errorMaximum

errorStandard deviation

error CoverageError cost5max1s.d.1@23(1-coverage)3#

None ~0 11 22 33 44 55! 0.37 2.66 0.53 0.79 4.21Six-slice set ~0 8 16 24 33 38! 0.23 1.42 0.32 0.82 2.63Six-point set ~3 10 18 24 32 38! 0.21 1.40 0.29 0.80 2.67Six-point seta ~3 10 18 24 32 38! 0.20a 0.81a 0.27a 0.80a 2.06a

aErrors measured for the midsagittal contour only.

a

ae

onntiovim

tpttaef

oveneb

thiotr

re

estereethrea

, s

theneo

otre

hly

e

re-tioner-

ofpos-

thendi-ime-

ingleop-echpe.n-

ces.alerrornalon-set.ndbutse19intsThettal3Dhtlyea-

tion

uewasped

4D

3. Optimization of global error cost

To select a set of optimal sparse coronal slices, optimity was defined as minimizing the error cost function:

error cost5maximum error1s.d. error

1@23~12reconstruction coverage!3#. ~1!

In addition to this cost function, sparse sets covering anerage of less than 0.66 of the 19 speech sound set weliminated from consideration to prevent the optimizatifrom being skewed by outlying maximum errors. The costant 2 in the error cost equation balances the optimizabetween the goals of minimizing error and maximizing coerage. Some balance is necessary because simply maxing coverage results in high surface errors~larger errors thanthe default equidistant errors in Table I!, and optimizing forerror only would result in shrinking the sparse surfaceconsecutive slices one degree apart. For all subjectssented in this paper, the value 2 worked well for both sagicontour and 3D surface optimizations. Evaluation of theror cost for any slice set required a sparse reconstructioneach of the speech sounds. A brute force search of themillion possible sets would require roughly a year of prcessing time. Thus, a search was needed that could gifast approximation to the global optimum. A coarse to fimethod was used to first get a rough estimation of the glooptimum, and then refine that estimation. To do this,method tests all possible six-slice sets with the restrictthat only every fourth possible slice from the dense seconsidered. This is equivalent to finding the best sparseconstruction from a set of coronal slices space 4 degapart. So, at this most coarse level, there are only 56/4514possible slices. Since 14 choose 6 is only 3003, all thpossibilities can be tested. After determining this coarse soptimum, six-slice sets at a finer level are tested. Nowstricted to every second slice, all possible six-slice swithin a single size 2-degree step are considered. In owords, at each slice position, consider the slice 2 degbefore, the current slice, and the slice 2 degrees after,choose the best permutation across all six slices. Thuslecting from three slices at six positions gives 365729 per-mutations to consider. The best of these permutations isrefined by the same process, using a step size of 1 degregive a global optimum approximation. This coarse to fimethod is much faster than considering the full rangepermutations. It considers 30031729172954461 ratherthan 32 million possibilities. The possibilities it does nconsider are those where multiple slices are within 4 deg


l-

v-re

-n

-iz-

ore-l

r-or32-

a

alenise-es

ep-

tseresnde-

en, to

f

es

of one another. For creating sparse data sets, it is higunlikely that such data sets will be optimal~over the totalrange of 501 degrees!, so the use of the coarse to finmethod should be very reasonable.

The optimal sparse coronal set, for all 19 sounds,sulted in an average error of 0.25 mm, a standard deviaof 0.33 mm, a maximum error of 1.47 mm, and 84% covage of the dense data sets. Due to the variability in lengthtongue surfaces the maximal reconstruction coveragesible for any six-slice set would be 90%~see Fig. 4!. Asultrasound has a measurement error around 0.5 mm,sparse data set was a very good approximation. This icated that accurate reconstructions could be made from tvarying ultrasound with as few as six slices~at the appropri-ate positions!.

D. Optimizing source sets for individual subjects

These data sets and analyses were based on a ssubject, so there is legitimate concern that any subject’stimal sparse source set will vary based on factors of speproduction, subject size, or the surrounding vocal tract shaIt would be foolish and impractical to do a dense 3D recostruction of each subject simply to find the best sparse sliFor this reason, a simpler method for estimating optimsparse source sets was sought. Instead of measuring thein the entire 3D surface reconstruction from a sparse coroslice set, error was measured only in the 2D midsagittal ctour reconstruction from a sparse midsagittal point dataIn effect, this would concentrate on the midsagittal slice, aperform the same analysis as finding the best set of slicesin only two dimensions. This was simulated on the dendata set by extracting the midsagittal profiles for thespeech sounds, and determining the optimal set of poneeded to best reconstruct the global set of profiles.coronal slices corresponding to the optimal midsagipoints were remarkably close to those selected by theanalysis. Using them as the sparse set resulted in sligreduced surface coverage, but also gave improved error msurements particularly at midline.

II. RESULTS

The goal of this study was to reduce the representaof the tongue surface to a few key slices~i.e., optimal sparsesource slices!. These slices had to reconstruct 3D tongsurfaces with the highest accuracy possible. If this stepaccomplished adequately, the procedure could be develofurther to collect time varying data at each slice for use in


–0.95

TABLE II. Errors in 3D reconstructions based on the optimizations of six coronal slices and six midsaggital points.

6 slices 6 points

SoundAverage

errorMaximum


error CoverageAverage

errorMaximum


error Coverage

i 0.37 1.42 0.48 0.97 0.32 1.37 0.44 0.90( 0.17 0.88 0.22 0.70 0.17 1.03 0.23 0.65e 0.16 0.82 0.21 0.75 0.15 0.51 0.20 0.70} 0.23 1.34 0.31 0.88 0.26 1.40 0.32 0.81, 0.26 0.77 0.28 0.94 0.19 0.72 0.25 0.83Ä 0.19 0.79 0.25 0.93 0.21 1.06 0.28 0.85# 0.42 1.42 0.52 0.97 0.31 1.18 0.40 0.85Å 0.32 1.20 0.41 0.73 0.28 1.29 0.36 0.93o 0.16 0.74 0.21 0.83 0.18 0.70 0.24 0.73) 0.20 0.89 0.29 0.81 0.12 0.49 0.15 0.95u 0.21 0.86 0.27 0.63 0.17 1.14 0.24 0.80É 0.38 1.38 0.50 0.80 0.20 1.06 0.27 0.90/ 0.14 0.53 0.17 0.91 0.21 0.79 0.28 0.83Y 0.15 0.83 0.20 0.83 0.15 0.60 0.19 0.76b 0.32 1.24 0.42 0.73 0.29 0.99 0.36 0.68s 0.18 0.64 0.22 0.92 0.16 0.52 0.20 0.81l 0.22 1.07 0.30 0.61 0.24 0.90 0.30 0.57n 0.20 1.38 0.30 0.86 0.19 1.14 0.27 0.80G 0.22 0.86 0.29 0.88 0.27 0.84 0.34 0.82

Range 0.14–0.42 0.64–1.42 0.17–0.52 0.61–0.97 0.12–0.31 0.49–1.40 0.12–0.44 0.57

ide

afo

copccf

berawadettime,gs

iodn

coatsr

ms

rcethem isbil-

ury,

theerupt

ane

d byceeft/ym-ssedhethetry

guewas

was

eatlytheittalse

reconstructions~x,y,z,t!. Two sparse source sets were consered. The first was the set of six coronal slices optimizfrom all the coronal slices of the 56-slice dense set~hereaftercalled the six-slice set!. The second set was the six coronslices corresponding to the midsagittal points optimizedreconstructing the midsagittal profile~hereafter called thesix-point set!.

A. Global characteristics of the reconstructions

For each of the sparse sets, global measures of restruction accuracy were calculated. Table I shows the omal six-point and six-slice sets with their global reconstrution errors. Maximum error, standard deviation error, surfacoverage, and the resulting cost function were calculatedthe entire set of surfaces. The results indicated that theoptimal sparse source was the six-point set. Surface covewas degraded from the six-slice set optimum and errorimproved. Use of midsagittal points as a source set tendeproduce better reconstructed surfaces than the coronal smany cases, because midsagittal points focused the opzation algorithm on midsagittal features. Thus local deprsions, or ‘‘dimples,’’ as seen in /l/ and /Ä/, and steep slopesas seen in /i/ and /É/, were better captured using the midsaittal source sets. Larger error was seen instead at thefaces’ extreme edges~the least important areas! and also inareas of left-to-right asymmetry, as midsagittal optimizatignores and thus may diminish asymmetries. Errors for invidual sounds are shown in Table II. The six-point recostructions had smaller average errors than the six-slice restructions for 11 of the 19 sounds; maximum error wsmaller for 13 of the sounds, including all the consonanCoverage was improved in only four cases. The sparseconstructions were also more accurate than repeatedsurements of a frame~Fig. 2! or comparing sagittal versu


-d

lr

n-ti--eorstgesto, in

i-s-

-ur-

ni--n-

s.e-ea-

coronal dense data sets~Fig. 1!. This would indicate thathuman error in edge detection would be the primary souof error in sparse reconstructions. Concurrent withpresent study, a new and automated edge detection systebeing developed that should improve measurement reliaity.

B. Preservation of local features

In addition to global statistical error measurement, fo‘‘local’’ features were considered: left-to-right asymmetrabrupt changes in slope, local surface depressions, andconstriction location in fricatives. Visual inspection of thdense reconstructions indicate that depressions and abchanges in slope were most evident in the midsagittal pl~Stone and Lundberg, 1996; Figs. 4–6!. Preservation of thesetwo features in the sparse reconstructions was enhanceoptimizing slice selection in the midsagittal plane. A sourset determined by midsagittal points cannot account for lright differences in shape or motion. In these data sets asmetry was least well represented. If the selected slices pathrough maximally asymmetric regions, the length of tasymmetry would be overestimated. If the slices missedareas of maximal asymmetry, the degree of asymmewould be underestimated. The most asymmetrical tonshape in the data set was /i/ where the maximum error1.37 mm.

The second and most easily resolved local featurethe local dimple seen in low back vowels and /l/~Stone andLundberg, 1996; Figs. 5 and 6!. The use of the five- andsix-point sparse sets instead of the coronal sparse set grimproved resolution of centrally occurring depressions in3D surfaces, as they were key features in the midsagprofile as well. Figure 5 compares the dimple in the denand sparse surfaces for /l/.


hethvefo

euy

as

heec-c-

the

setfrompe

gestm

g-wasnnel

saypro--ces.ho-l of

ltwog theon,dif-t

thece-

l

o

e/lstthean

berion

avete-sag-ur/

tsosstureues

that had incomplete data at one end or the other; and little

The third local feature was abrupt change in slope. Tfeature was particularly evident for /i/ which had an archtongue in the front, and abruptly became grooved inback. In addition, the measurable tongue surface wasshort. The six-point set resulted in four measurable sliceseven the shortest tongue surfaces, and captured the grovery accurately. Figure 6 shows good representationabrupt slope changes and deep groove in /i/.

The fourth local feature was the location of fricativconstrictions. Fricative constructions in English often occslightly off midline. Moreover, they may not be marked b

FIG. 5. Reconstructions from a dense set of slices~left!, and from a six-sliceset ~right! for /l/ and the distance vector surface between them.

FIG. 6. Reconstructions from a dense set of slices~left!, and from a six-sliceset ~right! for /i/ and the distance vector surface between them.


isderyorvesof

r

midline tongue features. Therefore, particular attention wpaid to the error in the three fricatives /Y/, /s/, and /b/. TableII shows the maximum error for each sound. For /Y/ themaximum error 0.60 occurred laterally, though not at tedges. Maximum error was 0.2 mm at the constriction. Eltropalatography~EPG! data confirmed the subjects constrition locations~see Stone and Lundberg, 1996!. For the /s/ thelargest error was 0.5 mm and occurred at the edge. Atconstriction, the largest error was 0.3 mm. The /Y/ and /s/shapes were actually fairly easy to predict from a sparsebecause the tongue shape did not change dramaticallyfront to back. The /b/ had a more changeable surface shaand had larger average and maximum errors. The larerror, 0.99 mm, occurred laterally. Several errors of 0.7 mdid appear in the constriction region slightly off the midsaittal plane. In the constriction region, the sparse data setbelow the dense data set, which overestimated the chasize.

C. Variability

Intrasubject variation occurs because humans do notan utterance exactly the same way every time. Phonemeduction varies slightly from repetition to repetition. Intrasubject variation could not be seen in these single utteranOne example of variation was contrived, however. The pneme /l/ was repeated twice by the subject with the goacreating two different shapes. The first, / l1 /, was producednormally. The second, /l2 /, was produced with a forcefuapical contact. Both were sustained about 10 s. Thetongue surfaces were measured and reconstructed usinsame procedure as in Figs. 5 and 6. In the /l/ comparishowever, the two surfaces were dense reconstructions offerent repetitions~see Fig. 7!. We were interested in whacauses the midsagittal depression often seen just behindtongue blade in /l/. It was hypothesized that the more forful apical contact would create a larger deformation for /l2 /,the more extreme or ‘‘tense’’ production. Therefore, the /2 /would have a deeper depression than /l1 /. This was found tobe true. The lower left portion of Fig. 7 shows the twtongues spatially aligned and superimposed. The /l2 / tongueis higher than /l1 / in back and on the sides, and lower in thdepression region. The depression depths were 2 mm for1 /and 4 mm for /l2 /, at the deepest point relative to the highepoint in the same coronal slice. The important features,dimples, differed across the repetitions by 2 mm, larger ththe maximal error for sparse reconstructions. This numshould be accurate since it occurred in the midsagittal regwhere we generally expect smaller reconstruction errors.

Intersubject variation occurs because humans hslightly different oral morphologies and use different stragies for creating speech gestures. Table III presents midittal optimization data from 17 additional subjects. Fospeech sounds were collected for each of these subjects:,/,/Ä/, /i/, and /u/. The best six-point sparse set~optimal selec-tions! is compared to the equidistant six-point sparse set~de-fault selections!. The optimized selections column presenthe optimized range of slices. This makes it clear that acrsubjects there existed a variety of tongue lengths and fealocations. Smaller ranges were caused by two things: tong


enctinar

ti-ancu

ver-erces

aterfouras

onto

m,Theim-

onthe

f /rsth

nes

anterior-to-posterior differences among sounds. Differstarting and ending slices among subjects, e.g., subjeversus subject 16, reflect rotational differences in positionthe transducer, not true position differences. The primsubject is the subject used in the rest of the paper.

Table III shows that for 11 of the 17 subjects the opmization reduced the maximum error by at least 0.8 mmfor 13 subjects it increased the surface coverage. Subjebenefited the most from the optimization. Her default eq

FIG. 7. Reconstructions of dense sets from two distinct productions o~normal and tense! to show intrasubject variability, the distance vectobetween them, and, on the lower left, the two surfaces superimposed inbest alignment. The black surface is for normal production, /l1 /, and thegray surface is for tense production, /l2 /.


t3

gy

dt 7i-

distant point set selected points 6 degrees apart with an oall tongue length of 30 degrees. After optimization, htongue length was 25 degrees, and her interpoint distanwere 5, 4, 4, 5, and 8 degrees apart, indicating a grerepresentation in the anterior tongue. Figure 8 shows thevowel contours and the optimized points for subject 7radial trajectories. With these modifications in point locatifor this subject, the average error decreased from 0.520.28, the maximum error decreased from 2.42 to 1.04 mand the standard deviation decreased from 0.68 to 0.37.least improvement was seen for subject 10 whose errorsproved only slightly. For some subjects, the optimizatiwas essential, for without it some sounds only had two of

l/

eir

FIG. 8. Sagittal contours of the tongue for subject 7 showing radial liindicating the default~dashed! and optimal~solid! point sets and their inter-sections with the four speech sounds.

spline

age

TABLE III. Midsagittal optimization for the primary and additional subjects based on four speech sounds. The default selections and results~in parentheses!use simple equidistantly spaced points. Values of~---! are displayed for selections that captured only two points for at least one sound, so that noestimation can be done for that sound. Errors reported here are measured only on the midsagittal contour.

Subject Optimal selections Default selections Average error Maximum error Standard deviation Cover

Primary @00 10 18 23 31 35# ~00 09 18 27 36 45! 0.15 ~0.27! 0.45 ~1.40! 0.19 ~0.38! 0.84 ~0.82!1. A. C. @10 13 21 29 38 43# ~09 16 24 31 39 46! 0.50 ~0.54! 1.49 ~1.86! 0.63 ~0.70! 0.90 ~0.87!2. C. S. @12 17 21 29 35 42# ~09 16 22 29 35 42! 0.32 ~0.43! 0.94 ~2.15! 0.41 ~0.56! 0.88 ~0.93!3. E. B. @04 12 17 21 26 29# ~02 09 16 22 29 36! 0.21 ~0.36! 0.71 ~1.59! 0.28 ~0.48! 0.79 ~0.82!4. E. D. @08 14 18 23 27 32# ~02 09 16 22 29 36! 0.37 ~0.50! 1.16 ~2.01! 0.49 ~0.65! 0.84 ~0.76!5. E. L. @05 10 19 27 35 42# ~00 08 17 25 34 42! 0.41 ~---! 1.30 ~---! 0.52 ~---! 0.88 ~0.69!6. E. S. @07 12 16 21 26 31# ~01 08 14 21 27 34! 0.35 ~---! 0.99 ~---! 0.44 ~---! 0.84 ~0.62!7. F. S. @10 15 19 23 28 35# ~09 15 21 27 33 39! 0.28 ~0.52! 1.04 ~2.42! 0.37 ~0.68! 0.86 ~0.80!8. J. M. @06 10 16 23 31 45# ~05 13 21 29 37 45! 0.43 ~0.46! 1.40 ~2.27! 0.55 ~0.65! 0.83 ~0.78!9. J. U. @03 07 13 19 25 30# ~00 08 16 24 32 40! 0.27 ~0.42! 1.02 ~2.40! 0.36 ~0.57! 0.81 ~0.81!10. K. L. @02 06 15 22 30 42# ~01 10 18 27 35 44! 0.47 ~0.53! 1.65 ~1.68! 0.58 ~0.68! 0.90 ~0.79!11. K. R. @16 21 24 28 33 38# ~13 18 23 29 34 39! 0.31 ~0.50! 0.99 ~2.07! 0.42 ~0.69! 0.87 ~0.83!12. M. B. @17 21 25 29 32 35# ~14 19 24 28 33 38! 0.23 ~0.36! 0.71 ~1.42! 0.32 ~0.51! 0.84 ~0.77!13. R. S. @15 19 22 24 29 34# ~12 17 23 28 34 39! 0.28 ~0.40! 0.98 ~1.40! 0.39 ~0.56! 0.85 ~0.86!14. S. F. @07 10 15 17 23 31# ~03 10 17 24 31 38! 0.39 ~0.44! 1.44 ~1.50! 0.50 ~0.53! 0.79 ~0.82!15. S. G. @01 08 16 24 32 45# ~00 10 20 31 41 51! 0.39 ~---! 1.16 ~---! 0.48 ~---! 0.84 ~0.66!16. T. M. @19 23 27 32 37 40# ~16 22 27 33 38 44! 0.38 ~0.45! 1.33 ~1.54! 0.50 ~0.63! 0.84 ~0.83!17. V. S. @18 23 28 31 34 38# ~17 22 26 31 35 40! 0.34 ~0.49! 1.18 ~1.45! 0.46 ~0.62! 0.93 ~0.78!


keorFifoa

acsli

ranjeecrsfottadrubrlece

ecs

ra

pugrad

matn

bleely,

cyti-

onea-ror

tion.2

rors

calandcal

mi-a-

tionfor’’ch.tal

-the

riesis

nsuearserror.les,/l/.nd

e

erysix-

hirdndst.

teepr-in-

riorfarionngh aselyy of

n,

fau

e

six equidistant points fall on the tongue surface, as marby ~---!. This occurred when the subject had great anteriposterior differences in tongue position across sounds.ure 9 presents an example of this positional differencesubject 5. The contour for /i/ went from a two-point tothree-point representation after the optimization.

III. DISCUSSION

This study was able to reconstruct 3D tongue surfshapes using as few as six coronal slices. The bestselection used an optimized set of midsagittal points.

Three important issues are involved in choosing a spadata set for 3D reconstruction. The first and most importissue is finding the best six-point source set for each subWithout this, results cannot be generalized across subjand validity of the method is breached. The optimal spasource sets determined here will certainly not be optimalall subjects. Therefore, prior to data collection, a midsagidata set needs to be collected for each subject. From thisset an optimal source point set is determined for reconsttion of the midsagittal profile. Coronal images would thencollected at each point and reconstructed as described eaThis procedure can be used to collect time-varying spesamples at each coronal slice angle for use in 3D timmotion reconstructions of the tongue surface during speThe midsagittal data can be used in the reconstructionwell.

Second, the transducer must be positioned in an accuand precise manner. A positional error of a few degreesone slice will reduce significantly the capture of local shafeatures such as dimples and degree of grooving. Althonot addressed in this paper, a 3D automated head and tducer support system~AHATS! based on the currently use2D head and transducer support~HATS! system~Stone andDavis, 1995! is under construction. This system uses coputer~or manually! controled positioning of the transducerpredetermined angles for collection of real-time sagittal a

FIG. 9. Sagittal contours of the tongue for subject 5 showing that the deset~dashed! fails to measure the contours for one sound at three~the solid ormore points, while the optimal set does intersect each contour at thremore points.


d–g-r

ece

set

ct.tserl

atac-eier.h–h.as

teinehns-

-

d

coronal motion. Manual transducer positioning is acceptaif predetermined slice positions are calculated accuratand precision of transducer placement is assured.

The third issue of importance is reconstruction accuraof 3D shape and motion. Global reconstruction was opmized by minimizing the maximum and standard deviatierrors. As a result, the average errors were below the msurement error for ultrasound. The largest maximum erfor all 19 sounds was 1.40 mm, which occurred in /}/ onetime on an extreme edge. The greatest standard deviaerror was 0.44 mm, which occurred for /i/. Errors above 1mm occurred exclusively at the most lateral edge, and erabove 1.0 tended to occur in the posterior most row.

Optimized reconstructions also need to represent lofeatures well, such as asymmetry, local depressions,steep slopes. Optimization improved representation of lofeatures compared to equidistant slices. Midsagittal optization further improved midsagittal features. The first feture, tongue asymmetry, is more prevalent in tongue mothan in static data and so will be even more importantfuture studies. Left-to-right rotation and a ‘‘leading edgeare seen fairly often in coronal ultrasound images of speeThese asymmetries do not vary systematically with palashape, or handedness~Hamlet, 1987!; they are more prevalent in some subjects and some tasks, however. Whenslice selection is based on midsagittal points, asymmetcannot be taken into account, since no lateral informationavailable. However, leading edges and left-to-right rotatioextend across a fairly long region of the lengthwise tongand, therefore, should be captured by one or more spslices. Future research will continue to carefully assess ein representation of asymmetry using the current method

The second feature, local tongue depressions or dimpwas visible in this data set for nonhigh back vowels andThey have been observed fairly often in other ultrasoudata sets~Davis, 1996; Fig. 1! and can be inferred from sompoint tracking data@Stone, 1990~Table I!# and MRI data~Kumadaet al., 1992; Niitsuet al., 1992! as well. They tendto appear in the ‘‘middle’’ segment of the tongue~approxi-mately 2.5–4 cm back from the protruded tip! ~cf. Stone,1990!. The present 3D reconstructions captured dimples vaccurately because dimples occur at midline and thepoint sparse set optimized their representations.

Accurate representation of steep slopes was the tfeature examined in the reconstructions. Front raised sou~e.g., high front vowels! have a very advanced tongue rooThis is due to genioglossus posterior~GGP! contraction,which causes a deep posterior groove defined by a sslope midsagittally and laterally. Anteriorly, the tongue suface is high and flat, or even arched. Therefore, a sharpflection point in the midsagittal profile separates the antearch from the posterior groove. Choosing a point toofrom the inflection point will cause a serious underestimatin the slope magnitude and origin point. Moreover, durichanges from front raising to other shape categories, sucback raising, 3D motion reconstructions from inappropriatselected slices will misrepresent and reduce the accuracthe deformation.

One type of ‘‘error’’ is utterance-to-utterance variatio

lt

or


ei

eeeb

rid

ot,

rroaaxuresishs

thio

4t, p

nd

e.

eoc.

welsoni-

n-ver-

ngiat-

oc.

ed

ngforl

moc.

han-

e

-

g a

or free variation. Humans do not produce repeated spesounds identically. We expect that the induced variationshape between the /l1 / and /l2 / ~1.63-mm maximum differ-ence! is the same size or larger than would occur in frvariation and is consistent with systematic differences dumorphological constraints. If so, such differences wouldlarger than the maximum measurement error~1.40 mm!caused by using the sparse data set and should be wellresented, especially if the important features occur at mline.

The current sparse set criteria minimize the problemaccurate 3D tongue reconstruction from a sparse slice secan be seen from the maximum and standard deviation ein Table II. The standard deviation errors seen in the dwere no worse than typical measurement error. The mmum errors~above 1.3 mm! were seen on the edges. Oexpectation is that the selection of fairly equidistant slicand the optimization across all the lingual sounds in Englwill continue to provide as reasonable a 3D coverage apossible.

ACKNOWLEDGMENT

This work was supported by a research grant fromNational Institute of Deafness and Other CommunicatDisorders~NIH-R01DC01758!.

Davis, E., Douglas, A., and Stone, M.~1996!. ‘‘A Continuum MechanicsRepresentation of Tongue Motion in Speech,’’ Proceedings of theInternational Conference on Spoken Language Processing, Vol. 2788–792.

Hamlet, S.~1987!. ‘‘Handedness and articulatory asymmetries on /s/ a/l/,’’ J. Phonetics15, 191–195.

Keller, E., and Ostry, D.~1983!. ‘‘Computerized measurement of tongudorsum movements with pulsed echo ultrasound,’’ J. Acoust. Soc. Am73,1309–1315.


chn

toe

ep--

fasrs

tai-

,,is

en

hp.

Kier, W. M., and Smith, K.~1985!. ‘‘Tongues, tentacles and trunks: thbiomechanics of movement in muscular-hydrostats,’’ Zoo. J. Linnean S83, 307–324.

Kumada, M., Niitsu, M., Niimi, S., and Hirose, H.~1992!. ‘‘A Study on theInner Structure of the Tongue in the Production of the 5 Japanese Voby Tagging Snapshot MRI,’’ Research Institute of Logopedics and Phatrics, University of Tokyo, Annual Bulletin, Vol. 26, pp. 1–13.

Miyawaki, K., Hirose, H., Ushijima, T., and Sawashima, M.~1975!. ‘‘Apreliminary report on the electromyographic study of the activity of ligual muscles,’’ Research Institute of Logopedics and Phoniatrics, Unisity of Tokyo, Annual Bulletin, Vol. 9, pp. 91–106.

Niitsu, Kumada, M., Niimi, S., and Itai, Y.~1992!. ‘‘Tongue Movementduring Phonation: A Rapid Quantitative Visualization Using TaggiSnapshot MRI Imaging,’’ Research Institute of Logopedics and Phonrics, University of Tokyo, Annual Bulletin, Vol. 26, pp. 149–156.

Sonies, B., Shawker, T., Hall, T., Gerber, L., and Leighton, S.~1981!. ‘‘Ul-trasonic Visualization of Tongue Motion During Speech,’’ J. Acoust. SAm. 70, 683–686.

Stone, M.~1990!. ‘‘A three-dimensional model of tongue movement bason ultrasound and x-ray microbeam data,’’ J. Acoust. Soc. Am.87, 2207–2217.

Stone, M.~1995!. ‘‘How the tongue takes advantage of the palate durispeech,’’ inProducing Speech: Contemporary Issues: A FestschriftKatherine Safford Harris, edited by F. Bell-Berti and L. J. Raphae~American Institute of Physics, New York!, Chap. 10, pp. 143–153.

Stone, M., and Davis, E. P.~1995!. ‘‘A head and transducer support systefor making ultrasound images of tongue/jaw movement,’’ J. Acoust. SAm. 98, 3107–3112.

Stone, M., and Lele, S.~1992!. ‘‘Representing the Tongue Surface WitCurve Fits,’’ Proceedings of the International Conference on Spoken Lguage Processing, Vol. 2, pp. 875–878.

Stone, M., and Lundberg, A.~1996!. ‘‘Three-dimensional tongue surfacshapes of English consonants and vowels,’’ J. Acoust. Soc. Am.99,3728–3737.

Stone, M., and Yeni-Komshian, G.~1991!. ‘‘Some effects of pharyngealization on tongue shape as seen on ultrasound,’’ J. Acoust. Soc. Am.89,S1979~A!.

Stone, M., Sonies, B., Shawker, T., Weiss, G., and Nadel, L.~1983!.‘‘Analysis of real-time ultrasound images of tongue configuration usingrid-digitizing system,’’ J. Acoust. Soc. Am.11, 207–218.


Three-dimensional tongue surface reconstruction: …...Three-dimensional tongue surface...

Documents

Transcript of Three-dimensional tongue surface reconstruction: …...Three-dimensional tongue surface...