Vowel recognition: Formants, Spectral Peaks,hillenbr/Papers... · acoustic measurements. The two...

26
Vowel recognition: Formants, Spectral Peaks, and Spectral Shape Representations James Hillenbrand Speech Pathology and Audiology Western Michigan University Kalamazoo MI 49008 and Robert A. Houde RIT Research Corporation 125 Tech Park Drive Rochester, NY 14623 Invited presentation, Acoustical Society of America, St. Louis, November, 1995.

Transcript of Vowel recognition: Formants, Spectral Peaks,hillenbr/Papers... · acoustic measurements. The two...

Page 1: Vowel recognition: Formants, Spectral Peaks,hillenbr/Papers... · acoustic measurements. The two main differences were: (1) our data showed somewhat more overlap among vowel categories

Vowel recognition: Formants, Spectral Peaks,

and Spectral Shape Representations

James Hillenbrand

Speech Pathology and Audiology

Western Michigan University

Kalamazoo MI 49008

and

Robert A. Houde

RIT Research Corporation

125 Tech Park Drive

Rochester, NY 14623

Invited presentation, Acoustical Society of America, St. Louis, November, 1995.

Page 2: Vowel recognition: Formants, Spectral Peaks,hillenbr/Papers... · acoustic measurements. The two main differences were: (1) our data showed somewhat more overlap among vowel categories

Hillenbrand & Houde (1995), St. Louis ASA Talk 2

We have decided to restrict our comments today to just a few issues having to do with the

spectral representations for vowels. One of the main issues that we'd like to focus on is the general

question of formants versus spectral shape representations for speech, with a particular focus on the

representation of vowel quality. Before getting into that material, we would like to describe a new

database that we collected of acoustic measurements for vowels in /hVd/ syllables (Hillenbrand,

Getty, Clark, and Wheeler, 1995). The experiment was designed as a replication and extension of

the classic Bell Labs study by Peterson & Barney (1952). The measurements that Peterson &

Barney reported have obviously had a profound influence on vowel perception research. Our study

was intended to address a number of limitations of this classic work, the most important of which

are: (1) unavailability of the original recordings, and (2) the absence of spectral change and

duration measurements.

The design of the study was quite simple. Recordings were made of /hVd/ utterances

spoken by 139 talkers: 45 men, 48 women, and 46 10-12 year-old children. The utterances

included the 10 vowels recorded by Peterson & Barney, plus /e/ and /o/. The majority of the talkers

(87%) were from the Lower Peninsula of Michigan; the remainder were primarily from other areas

of the upper midwest (e.g., Illinois, Wisconsin, Ohio, Minnesota). The talkers were screened for

dialect using a fairly simple procedure that I will not have time to describe. Recordings were made

of the 12 vowels in /hVd/ context. The acoustic measurements consisted of: (1) vowel duration and

"steady state" times, which were measured by hand from gray-scale spectrograms, (2) f0 contours,

which were automatically tracked and then hand edited, and (3) contours of F1-F4, which were

hand-edited from LPC peak displays.

_______________________________________________

Figure 1

_______________________________________________

Figure 1 shows an LPC peak display before and after hand editing. From left to right, the

dashed lines indicate the start of the vowel, an "eyeball" measurement of the steadiest portion of the

vowel, and the end of the vowel.

_______________________________________________

Figure 2

_______________________________________________

The signals were presented to a panel of listeners for identification. Although there are

some details that differ, the overall intelligibility of our vowels is nearly identical to Peterson &

Barney. In both cases, the error rate is about 5%, where an error is any trial in which the listener

reports hearing a vowel other than the one intended by the talker. It is also of some interest to note

that identification rates do not differ much across the three groups of talkers. (Peterson & Barney

did not report results separately for men, women, and child talkers.)

Page 3: Vowel recognition: Formants, Spectral Peaks,hillenbr/Papers... · acoustic measurements. The two main differences were: (1) our data showed somewhat more overlap among vowel categories

Hillenbrand & Houde (1995), St. Louis ASA Talk 3

_______________________________________________

Figure 3

_______________________________________________

In contrast to the identification results, there were some very large differences in the

acoustic measurements. The two main differences were: (1) our data showed somewhat more

overlap among vowel categories than the Peterson & Barney data when only F1 and F2 are

considered, and (2) the average positions of some of the vowels in F1-F2 space are different. Figure

3 shows a standard F1-F2 plot of the data from our study. Individual tokens that were not well

identified by the listeners have been removed, and the database has been thinned of redundant data

points. It can be seen that there is a large amount of overlap among adjacent vowels. The overlap is

especially large for the vowels /æ/ and /ɛ/, /u/ and /ʊ/ and the group /ɑ-ɔ-ʌ/. As will be seen in a

moment, these vowels can be separated rather well when duration and the spectral change are taken

into account.

_______________________________________________

Figure 4

_______________________________________________

Figure 4 compares average formant values from our study with Peterson & Barney. The

figure shows data from the adult females only, although very similar patterns are seen for the men.

(The children present a somewhat different case; this is discussed in our JASA paper). The data are

plotted in the form of an acoustic vowel diagram; the Peterson & Barney data are shown in dotted

lines, and our data are shown in solid lines. The central vowels /ɝ/ and /ʌ/ occupy nearly identical

positions in the two data sets, but substantial differences are seen for all other vowels. To the extent

that we can rely on traditional articulatory interpretations of formant data, these findings suggest

the following: (1) the high vowels in our data are produced somewhat lower in the mouth, (2) the

back vowels in our data seem to be produced with somewhat more anterior tongue positions, and

(3) the low back vowels /ɑ/ and /ɔ/ in our data are shifted down and forward. There are also very

large differences in the production of the vowels /æ/ and /ɛ/. The Peterson & Barney formant data

are consistent with the textbook description of these vowels, suggesting a higher and more anterior

tongue position for /æ/ than /ɛ/. Our data suggest rather similar tongue positions for these two

vowels, at least when the formants are sampled at "steady state". Our results are consistent with

several studies by Labov, suggesting that the vowel /æ/ is drifting upward in many dialects of

American English, particularly the “Northern Cities” dialect spoken by our subjects (e.g., Labov,

Yeager, and Steiner, 1972). We believe that our speakers are differentiating these two vowels

primarily by differences in the spectral change patterns and duration.

These comparisons raise a rather obvious question: What accounts for the differences

between the two sets of formant measurements? Very broadly, there only seem to be two

possibilities: (1) methodological differences in the measurement of the formant frequencies (we

used LPC spectral peaks and Peterson & Barney used narrow band spectra produced on an analog

spectrograph), and (2) dialect differences. Although we do not have time to develop the argument

here, we believe that dialect is by far the more important of the two factors. There are two dialect

factors that may have played a role. First, as discussed above, there are dialect differences based on

Page 4: Vowel recognition: Formants, Spectral Peaks,hillenbr/Papers... · acoustic measurements. The two main differences were: (1) our data showed somewhat more overlap among vowel categories

Hillenbrand & Houde (1995), St. Louis ASA Talk 4

region. (This is somewhat difficult to evaluate since so little is known about the dialect of Peterson

& Barney's talkers, although the majority of the women and children were from the Mid-Atlantic

region). Second, the two sets of recordings are separated by approximately 40 years. An excellent

study of British RP by Bauer (1985) shows clearly that substantial shifts in vowel formant

frequencies can be expected over a period of 40 years.

There is one final point that is worth making on these comparisons. There has been a

tendency over the years to view the Peterson & Barney measurements as some sort of a "gold

standard" that tells us how the vowels of American English are produced. I think that we all

understand that these values vary across region and that they can show significant changes over

time. If nothing else, our study serves to remind us of these simple facts.

_______________________________________________

Figure 5

_______________________________________________

All of the measurements that have been discussed up to this point were taken at the point in

the vowel where the formant pattern was judged to be maximally steady. Figure 5 shows what the

spectral change patterns look like for our vowels. What we have done here is to simply connect a

line between the spectral pattern sampled at 20% of vowel duration and the spectral pattern

sampled at 80% of vowel duration. The symbol for each vowel is plotted at the location of the

second sample. There are just two simple points that we would like to make about this figure. First,

it is obvious that nearly all of the vowels show a fair amount of spectral movement. The only

notable exceptions are /i/ and /u/, which show very little formant movement. The more important

point, however, is that the spectral changes occur in such a way as to increase the contrast between

adjacent vowels. For example, vowel pairs such as /æ/ and /ɛ/, /u/ and /ʊ/, and /ɑ/ and /ɔ/, which

occupy very similar static positions, show very different patterns of spectral change.

One obvious limitation of this figure is that we are displaying averages rather than

individual data points. We have not found a way to make a coherent display of the spectral change

patterns that shows individual data points, but this issue can be addressed by looking at the

performance of a discriminant classifier that is trained on either static or dynamic formant patterns.

_______________________________________________

Figure 6

_______________________________________________

Figure 6 shows the classification accuracy for a quadratic discriminant classifier that has

been trained on either a single sample of the formant pattern or two samples, taken at 20% and 80%

of vowel duration. The open bars show the results for the single-sample case, and the filled bars

show the results for two samples. In the left half of the figure duration information was not used,

while the right half of the figure shows comparable results with vowel duration added to the

parameter set. Results are shown for several combinations of parameters: F1-F2, F1-F2-F3, f0-F1-F2,

and f0-F1-F2-F3.

Page 5: Vowel recognition: Formants, Spectral Peaks,hillenbr/Papers... · acoustic measurements. The two main differences were: (1) our data showed somewhat more overlap among vowel categories

Hillenbrand & Houde (1995), St. Louis ASA Talk 5

The discriminant analysis results are quite straightforward: the pattern classifier does a

much better job of sorting the vowels when it is provided with information about the pattern of

spectral change. The effect is particularly large for simpler parameter sets. For example, with F1

and F2 alone, the classification accuracy is 71% for a single sample at "steady state" and 91% for

two samples of the formant pattern. Adding vowel duration to the parameter set also improves

classification accuracy, although the effect is not particularly large if the formant pattern is sampled

twice.

The pattern classification approach has some obvious limitations (e.g., Nearey, 1992;

Hillenbrand & Gayvert, 1993). Showing that a pattern classifier is strongly affected by spectral

change is not the same as showing that human listeners rely on spectral change patterns. We

recently completed an experiment that was designed to examine this question more closely using

synthesized signals. We selected 300 signals from our /hVd/ database. Listeners were asked to

identify the original signals and two different synthesized versions. The two synthesis methods are

shown in Figure 7. One set of signals was synthesized using the original measured formant

_______________________________________________

Figure 7

_______________________________________________

contours, shown here by the dashed curves. The vowel here is /æ/, and the original measured

formant contour shows a pronounced offglide in which the formants converge, a pattern which was

quite common in our data for /æ/. A second set of signals was synthesized with flat formants, fixed

at the values measured at "steady state," shown here by the solid curves. For these signals a 40 ms

transition connected the steady-state values to the formant values that were measured at the end of

the vowel. Both sets of synthetic signals were generated with the original measured f0 contours and

at the original measured durations.

_______________________________________________

Figure 8

_______________________________________________

The labeling results are shown in Figure 8. The findings are rather clear: identification rates

are highest for the original signals, and lowest for the signals that were synthesized with flattened

formant contours. The primary purpose of this experiment was to measure the relative importance

of formant movements in vowel recognition, but the significant drop in intelligibility between the

naturally produced signals and the signals that were synthesized to match the measured formant

patterns is quite important, in our view. This drop in intelligibility can be attributed to some

unknown combination of two factors: (1) errors in the estimation of the acoustic measurements,

especially the formant frequencies, and (2) the loss of phonetically relevant information in the

transformation to a formant representation, as has been suggested by Bladon (Bladon, 1982; Bladon

and Lindblom, 1981) and Zahorian and Jagharghi (1986, 1993). We will have more to say about the

general question of formant representations later.

Page 6: Vowel recognition: Formants, Spectral Peaks,hillenbr/Papers... · acoustic measurements. The two main differences were: (1) our data showed somewhat more overlap among vowel categories

Hillenbrand & Houde (1995), St. Louis ASA Talk 6

The effect of flattening the formant contours is rather large, and not especially surprising

given the significant body of evidence that has accumulated over the last 20 year or so implicating

a role for spectral change in vowel recognition (e.g., Jenkins et al., 1983; Nearey & Assmann,

1986; Nearey, 1989; Hillenbrand and Gayvert, 1993). We see this study not so much as a

confirmation that spectral change is important, since this is already well established, but as a way to

measure the size of the effect using speech signals other than isolated vowels which are closely

modeled after naturally produced utterances. An obvious limitation of this study is that vowel

duration was not manipulated. We are working on that experiment right now and should have some

results to report soon. [Note: This study has been completed – see Hillenbrand, Clark, and Houde,

2000.]

FORMANTS VERSUS SPECTRAL SHAPE

The next issue that we would like to discuss is the longstanding debate regarding formant

representations versus overall spectral shape. This is an issue that has received only sporadic

attention over the years, but has been revived to some extent recently, due in part to the work of

Bladon and Lindblom (Bladon, 1982; Bladon and Lindblom, 1981) and Zahorian and Jagharghi

(1986; 1987). The essence of formant theory is that phonetic quality is controlled not by the fine

_______________________________________________

Figure 9

_______________________________________________

details of the spectrum, but by the frequency locations of spectral peaks that correspond to the

natural resonant frequencies of the vocal tract. The primary virtue of formant theory is that formant

representations account for a very large number of findings in the phonetic perception literature.

The most straightforward demonstration is the high intelligibility of speech that has been

synthesized from measured formant patterns.

There are, however, several important problems with formant representations. The most

important of these is the one that Bladon has called the determinacy problem. Quite simply, after

several decades of concentrated effort by many imaginative investigators, no one has developed a

reliable method to measure formant frequencies automatically from natural speech. The problem, of

course, is that formant tracking involves more than just extracting spectral peaks; it is necessary to

assign each of the peaks to specific formant slots corresponding to F1, F2, and F3.

A second problem has to do with explaining listening errors. As Klatt (1982a; 1982b) has

pointed out, errors that are made by human listeners do not appear to be consistent with the idea

that the auditory system is tracking formants. When formant trackers make errors, they tend to be

large ones since the errors involve missing formants entirely or tracking spurious peaks. Human

listeners, on the other hand, nearly always hear a vowel that is very close in phonetic space to the

intended vowel. This fact is easily accommodated by whole spectrum model.

_______________________________________________

Figure 10

_______________________________________________

Page 7: Vowel recognition: Formants, Spectral Peaks,hillenbr/Papers... · acoustic measurements. The two main differences were: (1) our data showed somewhat more overlap among vowel categories

Hillenbrand & Houde (1995), St. Louis ASA Talk 7

An alternative to formant representations is a whole spectrum or spectral shape model. The

basic idea is essentially the inverse of formant theory: Phonetic quality is assumed to be controlled

not by formants, but by the overall shape of the spectrum. According to this view, formants only

seem to be important because spectral peaks have such a large effect on the overall shape of the

spectrum. The main advantage of this view is that it does not require formant tracking, which we

know to be somewhere between difficult and impossible. Spectral shape models can also easily

accommodate the data on listening errors. But there is a very significant problem with whole

spectrum models: There is very solid evidence that large alterations can be made in spectral shape

without affecting phonetic quality, as long as formant frequencies are preserved. There is a fair

amount of evidence on this issue, but the key experiment, in our view, comes from a rather simple

study by Klatt (1982a). Listeners in Klatt's study were presented with pairs of synthetic signals

_______________________________________________

Figure 11

_______________________________________________

consisting of a standard /æ/ and one of 13 comparison signals. The comparison signals were

generated by manipulating one of two kinds of parameters: (1) those that affect overall spectral

shape (e.g., high- or low-pass filtering, spectral tilt, changes in formant amplitudes or bandwidths,

the introduction of spectral notches, etc.) and (2) changes in formant frequencies. Listeners were

asked to judge either the overall psychophysical distance or the phonetic distance between the

standard and comparison stimuli. Results were very clear: changes in formant frequency were

easily the most important affecting phonetic quality. Other changes in spectral shape, while clearly

audible to the listeners, did not result in large changes in phonetic quality.

The conclusions that we draw from this literature (and from this literature as a whole, not just

the Klatt experiment) are the following:

1. Phonetic quality seems to be controlled primarily by the frequencies of spectral peaks.

However, we do not believe that a traditional formant analysis is feasible.

2. We believe that the evidence is consistent with a spectral shape model, but one which is

strongly dominated by energy in the vicinity of spectral peaks.

In response to these findings we have developed a spectrum analysis technique called the

Masked Peak Representation (MPR). The MPR is a relatively simple transform that was designed

to: (1) show the kind of sensitivity to spectral peaks that is exhibited by human listeners but without

requiring explicit formant tracking, (2) be relatively insensitive to spectral shape details other than

peaks, and (3) be relatively insensitive to formant mergers, split formants, and related phenomena

that cause such difficulty for traditional formant frequency representations. The representation

shares some features with a cepstrum-based processing scheme described by Aikawa et al. (1993),

which we recently became aware of.

_______________________________________________

Figure 12

_______________________________________________

Page 8: Vowel recognition: Formants, Spectral Peaks,hillenbr/Papers... · acoustic measurements. The two main differences were: (1) our data showed somewhat more overlap among vowel categories

Hillenbrand & Houde (1995), St. Louis ASA Talk 8

The spectrum analysis begins with a smoothed Fourier power spectrum -- the smoothing is

carried out both across frequency and across time. The only thing unconventional about the

analysis is the calculation of a running average of spectral amplitudes, calculated over a rather wide

(800 Hz) window. This running average is subtracted from the smoothed spectrum, producing what

we have called the masked spectrum. Subtraction of the running average is a rough simulation of

lateral suppression, and one practical result is that it allows you to find spectral peaks that would

otherwise be missed. Although it may not be obvious from this figure, the bump corresponding to

F2 in this signal is not, in fact, a spectral peak, but a soft shoulder that would be missed by a

conventional peak picker. We find this pattern to be rather common, and there is little doubt that

listeners are sensitive to these shoulders because the phonetic quality is off for signals that are

synthesized without representing these shoulders.

_______________________________________________

Figure 13

_______________________________________________

The display on top is a normal FFT spectrogram of the utterance, "OK, Hello.", and the

display below shows what the utterance looks like after subtraction of the running average. It can

be seen that most of what is left after the subtraction of the running average is the energy in the

vicinity of spectral peaks; overall spectral tilt has been removed, as is energy in the valleys between

spectral peaks. However, although the formants are quite visible, there are many peaks in the

masked spectrum that clearly do not correspond to formants.

We are just beginning a series of experiments that are intended test the validity of the MPR.

One specific claim that we are making is that the MPR retains those aspects of the original speech

signal that are most relevant to phonetic quality. As one test of this claim, we resynthesized signals

from the /hVd/ database from masked spectra. The synthesizer that was used in this experiment was

a damped sinewave vocoder. The vocoder is driven by three kinds of information:

1. Spectral Peaks

2. Instantaneous Fundamental Frequency

3. A Measure of the Degree of Periodicity

_______________________________________________

Figure 14

_______________________________________________

The spectral peaks that drive the synthesizer are extracted from the masked spectrum. The

signal then is resynthesized by summing exponentially damped sinusoids at frequencies

corresponding to the spectral peaks. This figure shows three damped sinusoids that would be

summed to produce the vowel /ɑ/. If the periodicity measure indicates that the speech is voiced, the

damped sinusoids are repeated at a rate corresponding to the measured fundamental period. For

unvoiced speech, the damped sinusoids are pulsed on and off at random intervals.

Page 9: Vowel recognition: Formants, Spectral Peaks,hillenbr/Papers... · acoustic measurements. The two main differences were: (1) our data showed somewhat more overlap among vowel categories

Hillenbrand & Houde (1995), St. Louis ASA Talk 9

__________________________________________________________________

Audio Tape: Damped Sinewave Synthesis Demo

__________________________________________________________________

The vowel intelligibility tests that we conducted with the damped sinewave synthesizer used

the same 300 signals from the /hVd/ database that were used in the formant synthesis experiment

described previously. The degree-of-periodicity measure that is normally used to control the

voice/unvoice balance in the damped sinewave vocoder was not used in this experiment: The

damped sinusoids were pulsed at random intervals during the /h/, and at intervals derived from the

hand-edited fundamental frequency measurements during the remainder of the syllable.

Listeners were asked to identify: (1) the original, naturally produced utterances, (2)

utterances that were resynthesized from the MPR spectra using the damped sinewave vocoder, and

(3) formant synthesized utterances that followed the measured pitch and formant contours.

_______________________________________________

Figure 15

_______________________________________________

There are several details of the labeling results with these signals that are of some interest,

but in the limited time that is available we will only be able to take a brief look at overall

intelligibility. The main finding here is that the signals that were generated from the MPR-derived

spectral peaks are quite intelligible, although less intelligible than the original signals. The main

point to be made about these results is that vowel quality (and probably phonetic quality in general)

is maintained quite well if spectral peaks are preserved. Further, the spectral peaks do not

necessarily need to correspond to formants. We should point out that we do not regard this as an

especially strong test of the validity of the MPR. A stronger test would involve demonstrating that

phonetic distance judgments such as those reported in Klatt's (1982a) study can be predicted well

from masked spectra. We are currently developing a distance measure using masked spectra that

could be used in an experiment of this kind. We hope to have results to report soon.

Page 10: Vowel recognition: Formants, Spectral Peaks,hillenbr/Papers... · acoustic measurements. The two main differences were: (1) our data showed somewhat more overlap among vowel categories

Hillenbrand & Houde (1995), St. Louis ASA Talk 10

REFERENCES

Aikawa, K., Singer, H., Kawahara, H., and Tohkura, Y. (1993). A dynamic cepstrum incorporating

time-frequency masking and its application to continuous speech recognition. ICASSP-93,

668-671.

Assmann, P., Nearey, T., and Hogan, J. (1982). Vowel identification: orthographic, perceptual, and

acoustic aspects. Journal of the Acoustical Society of America, 71, 975-989.

Bauer, L. (1985). Tracing phonetic change in the received pronunciation of British English. J.

Phon., 13, 61-81.

Bennett, D.C. (1968). Spectral form and duration as cues in the recognition of English and German

vowels. Lang. & Speech, 11, 65-85.

Bennett, S. (1983). A three-year longitudinal study of school-aged children's fundamental

frequencies. J. Speech Hear. Res., 26, 137-142.

Bladon, A. (1982). Arguments against formants in the auditory representation of speech. In R.

Carlson and B. Granstrom (Eds.) The Representation of Speech in the Peripheral Auditory

System, Amsterdam: Elsevier Biomedical Press, 95-102.

Bladon, A. and Lindblom, B. (1981). Modeling the judgment of vowel quality differences. J.

Acoust. Soc. Am., 69, 1414-1422.

Hedlin, P. (1982). A representation of speech with partials. In R. Carlson and B. Granstrom (Eds.)

The Representation of Speech in the Peripheral Auditory System, Amsterdam: Elsevier

Biomedical Press, 247-250.

Hillenbrand, J.M., Clark, M.J., and Houde, R.A. (2000). Some effects of duration on vowel

recognition. Journal of the Acoustical Society of America, 108, 3013-3022.

Hillenbrand, J., and Gayvert, R.T. (1993). Vowel classification based on fundamental frequency

and formant frequencies. Journal of Speech and Hearing Research, 36, 694-700.

Hillenbrand, J., and Gayvert, R.T. (1993). Identification of steady-state vowels synthesized from

the Peterson and Barney measurements. Journal of the Acoustical Society of America, 94,

1993, 668-674.

Hillenbrand, J., Getty, L.A., Clark, M.J., and Wheeler, K. (1995). Acoustic characteristics of

American English vowels. Journal of the Acoustical Society of America, 97, 1300-1313.

Hillenbrand, J.M., and Nearey, T.N. (1999). Identification of resynthesized /hVd/ syllables: Effects

of formant contour. Journal of the Acoustical Society of America, 105, 3509-3523.

Jenkins, J.J., Strange, W., and Edman, T.R. (1983). Identification of vowels in 'vowelless' syllables.

Percept. Psychophys., 34, 441-450.

Klatt, D.H. (1982a). Prediction of perceived phonetic distance from critical-band spectra: A first

step. IEEE ICASSP, 1278-1281.

Page 11: Vowel recognition: Formants, Spectral Peaks,hillenbr/Papers... · acoustic measurements. The two main differences were: (1) our data showed somewhat more overlap among vowel categories

Hillenbrand & Houde (1995), St. Louis ASA Talk 11

Klatt, D.H. (1982b). Speech processing strategies based on auditory models. The Representation of

Speech in the Peripheral Auditory System, Amsterdam: Elsevier Biomedical Press, 181-

196.

Labov, W., Yaeger, M., and Steiner, R. (1972). "A quantitative study of sound change in progress,"

Report on National Science Foundation Contract NSF-FS-3287.

Miller, J.D. (1989). Auditory-perceptual interpretation of the vowel. Journal of the Acoustical

Society of America, 85, 2114- 2134.

Nearey, T.M. (1989). Static, dynamic, and relational properties in vowel perception. Journal of the

Acoustical Society of America, 85, 2088-2113.

Nearey, T.M. (1992). Applications of generalized linear modeling to vowel data. In J. Ohala, T.

Nearey, B. Derwing, M. Hodge, G. Wiebe (Eds.), Proceedings of ICSLP 92, 583-586.

Edmonton, AB: University of Alberta.

Nearey, T.M., Hogan, J, and Rozsypal, A. (1979). Speech signals, cues and features. In G. Prideaux

(Ed.) Perspectives in Experimental Linguistics. Amsterdam: Benjamin.

Nearey, T.M., and Assmann, P. (1986). Modeling the role of vowel inherent spectral change in

vowel identification. Journal of the Acoustical Society of America, 80, 1297-1308.

Peterson, G., and Barney, H.L. (1952). Control methods used in a study of the vowels. Journal of

the Acoustical Society of America, 24, 175-184.

Strange, W. (1989). Dynamic specification of coarticulated vowels spoken in sentence context.

Journal of the Acoustical Society of America, 85, 2135-2153.

Strange, W., Jenkins, J.J., and Johnson, T.L. (1983). Dynamic specification of coarticulated vowels.

Journal of the Acoustical Society of America, 74, 695-705.

Syrdal, A.K. (1985). Aspects of a model of the auditory representation of American English

vowels. Speech Communication, 4, 121-135.

Syrdal, A.K., and Gopal, H.S. (1986). A perceptual model of vowel recognition based on the

auditory representation of American English vowels. Journal of the Acoustical Society of

America, 79, 1086-1100.

Zahorian, S., and Jagharghi, A. (1986). Matching of 'physical' and 'perceptual' spaces for vowels.

Journal of the Acoustical Society of America, Suppl. 1, 79, S8.

Zahorian, S., and Jagharghi, A. (1993). Spectral shape features versus formants as acoustic

correlates for vowels. Journal of the Acoustical Society of America, 94, 1966-1982.

Page 12: Vowel recognition: Formants, Spectral Peaks,hillenbr/Papers... · acoustic measurements. The two main differences were: (1) our data showed somewhat more overlap among vowel categories

Hillenbrand & Houde (1995), St. Louis ASA Talk 12

Figure 1.

Page 13: Vowel recognition: Formants, Spectral Peaks,hillenbr/Papers... · acoustic measurements. The two main differences were: (1) our data showed somewhat more overlap among vowel categories

Hillenbrand & Houde (1995), St. Louis ASA Talk 13

Figure 2.

Page 14: Vowel recognition: Formants, Spectral Peaks,hillenbr/Papers... · acoustic measurements. The two main differences were: (1) our data showed somewhat more overlap among vowel categories

Hillenbrand & Houde (1995), St. Louis ASA Talk 14

Figure 3.

Page 15: Vowel recognition: Formants, Spectral Peaks,hillenbr/Papers... · acoustic measurements. The two main differences were: (1) our data showed somewhat more overlap among vowel categories

Hillenbrand & Houde (1995), St. Louis ASA Talk 15

Figure 4.

Page 16: Vowel recognition: Formants, Spectral Peaks,hillenbr/Papers... · acoustic measurements. The two main differences were: (1) our data showed somewhat more overlap among vowel categories

Hillenbrand & Houde (1995), St. Louis ASA Talk 16

Figure 5.

Page 17: Vowel recognition: Formants, Spectral Peaks,hillenbr/Papers... · acoustic measurements. The two main differences were: (1) our data showed somewhat more overlap among vowel categories

Hillenbrand & Houde (1995), St. Louis ASA Talk 17

Figure 6.

Page 18: Vowel recognition: Formants, Spectral Peaks,hillenbr/Papers... · acoustic measurements. The two main differences were: (1) our data showed somewhat more overlap among vowel categories

Hillenbrand & Houde (1995), St. Louis ASA Talk 18

Figure 7.

Page 19: Vowel recognition: Formants, Spectral Peaks,hillenbr/Papers... · acoustic measurements. The two main differences were: (1) our data showed somewhat more overlap among vowel categories

Hillenbrand & Houde (1995), St. Louis ASA Talk 19

Figure 8. (Note: This study has been completed and the results

published in Hillenbrand & Nearey, JASA, 1999)

Page 20: Vowel recognition: Formants, Spectral Peaks,hillenbr/Papers... · acoustic measurements. The two main differences were: (1) our data showed somewhat more overlap among vowel categories

Hillenbrand & Houde (1995), St. Louis ASA Talk 20

Formant Representations

Basic Idea: Phonetic quality is controlled NOT by the

detailed shape of the spectrum, but the

formant frequency pattern.

Advantage: Formants account for a large number

findings in speech perception. The high

intelligibility of formant synthesis (and

even sinewave speech) confirms that

formants convey a great deal of phonetic

information.

Problems: (1) Formant tracking is an unresolved

problem (2) Many listening errors are

inconsistent with formants

Figure 9.

Page 21: Vowel recognition: Formants, Spectral Peaks,hillenbr/Papers... · acoustic measurements. The two main differences were: (1) our data showed somewhat more overlap among vowel categories

Hillenbrand & Houde (1995), St. Louis ASA Talk 21

Spectral Shape Representations

Basic Idea: Phonetic quality is controlled NOT

by the formant pattern, but by the

overall shape of the spectrum.

Advantages: (1) Does not require formant

tracking

(2) Accommodates data on

listening errors (and vowel

similarity judgments)

Problem: Large alterations can be made in

spectral shape without affecting

phonetic quality as long as the formant

frequencies are preserved.

Figure 10.

Page 22: Vowel recognition: Formants, Spectral Peaks,hillenbr/Papers... · acoustic measurements. The two main differences were: (1) our data showed somewhat more overlap among vowel categories

Hillenbrand & Houde (1995), St. Louis ASA Talk 22

Klatt (1982) Experiment

-----------------------------------------------------------------

Psy Phon -----------------------------------------------------------------

Spectral Shape Manipulations

-----------------------------------------------------------------

1. Random Phase 20.0 2.9

2. High-Pass Filter 15.8 1.6

3. Low-Pass Filter 7.6 2.2

4. Spectral Tilt 9.6 1.5

5. Overall Amplitude 4.5 1.5

6. Formant Bandwidth 3.4 2.4

7. Spectral Notch 1.6 1.5

-----------------------------------------------------------------

Formant Frequency Manipulations

-----------------------------------------------------------------

8. F1 ±8% 6.5 6.2

9. F2 ±8% 5.0 6.1

10. F1, F2 ±8% 8.2 8.6

11. F1, F2, F3, F4, F5, ±8% 10.7 8.1

----------------------------------------------------------------- -----------------------------------------------------------------

Psy: Perceived Psychophysical Distance

Phon: Perceived Phonetic Distance

Figure 11.

Page 23: Vowel recognition: Formants, Spectral Peaks,hillenbr/Papers... · acoustic measurements. The two main differences were: (1) our data showed somewhat more overlap among vowel categories

Hillenbrand & Houde (1995), St. Louis ASA Talk 23

Figure 12.

Page 24: Vowel recognition: Formants, Spectral Peaks,hillenbr/Papers... · acoustic measurements. The two main differences were: (1) our data showed somewhat more overlap among vowel categories

Hillenbrand & Houde (1995), St. Louis ASA Talk 24

Figure 13.

Page 25: Vowel recognition: Formants, Spectral Peaks,hillenbr/Papers... · acoustic measurements. The two main differences were: (1) our data showed somewhat more overlap among vowel categories

Hillenbrand & Houde (1995), St. Louis ASA Talk 25

Figure 14.

Page 26: Vowel recognition: Formants, Spectral Peaks,hillenbr/Papers... · acoustic measurements. The two main differences were: (1) our data showed somewhat more overlap among vowel categories

Hillenbrand & Houde (1995), St. Louis ASA Talk 26

________________________________________________________________________________ ________________________________________________________________________________

IDENTIFICATION RESULTS ________________________________________________________________________________

ORIGINAL MPR FORMANT

SIGNALS SYNTHESIS SYNTHESIS

96.9 90.9 88.7 ________________________________________________________________________________ ________________________________________________________________________________

Figure 15.