Acoustic Determinants of Phrase Boundary...

11
Acoustic determinants of phrase boundary perception Lynn A. Streeter Bell Laboratories, Murray Hill, New Jersey 07974 (Received 24 April 1978;revised 5 September 1978) How three supersegmental variables (amplitude, pitch contour, and durationpattern)influence phrase boundary perception wasinvestigated in two studies. Listeners located the phrase boundary in ambiguous algebraic expressions, such as "(A plusE) times O" and "A plus(E times O)." In one experiment, two values of e•ch of three variables (appropriate or neutral) were orthogonally varied, using linear predictive coding analysis-synthesis procedures. Therewasa total of eightmanipulations for each expression. In the other,the threesuprasegmental variables wereexchanged between the two alternative meanings of an expression, yielding a total of eightmanipulations for each expression. Results from the two studies were consistent in showing thatlisteners use all three cues, and just these three to parse such utterances. That is, it was possible to completely shift the meaning of an expression uttered with one meaning into its alternate meaning by exchanging all three variables. In bothstudies, the effects of duration pattern and pitch contour were additive in total proportion correct. Possible models of howlisteners process pitch and durationinformation independently in making a parsing decision are discussed. PACS numbers: 43.70.Dn, 43.70.Ve INTRODUCTION Observations of production data have indicated that a number of acoustic events are correlated with the pres- ence of a major syntactic boundary. For instance, speakers can modulate the fundamental frequency con- tour to group together words that constitute a major syntactic unit. Similarly, the durational pattern or rhythmic structure of an utterance varies with syntactic structure. For example, Klatt (1975)observed that pho- netic segments are lengthened before syntactic bound- aries. Speakers sometimes insert pauses at phrase boundaries (O'Malley, Kloker, andDara-Abrams, 1973; Macdonald, 1976) as well as alter the phonetic structure itself preceding a boundary (Lehiste, 1960; Lehiste, 1973; Dukes and Nakatani, 1976). Thus, there appear to be numerous options available to signal a major syn- tactic boundary's presence. However, with respect to the listener's perception of syntactic boundaries, the relative importance of various acoustic cues is still unknown. There exists a Sizable literature on how various suprasegmental cues, such as amplitude, fundamental frequency, and duration con- tribute to the perception of lexical stress in polysyllabic words presented in isolation (Fry, 1955;Belingert, 1958; Fry, 1958; Rigualt, 1962; Morton and Jassem, 1965; Westin, Buddenhagen and Obrecht, 1966). In general, the findings of such studies indicate that while duration, intensity, and fundamental frequency all contribute to the perception of lexical stress, intensity variation ap- pears to be the weakest cue and fundamental frequency contour the strongest. However, our knowledge of how these suprasegmental variables affect phrase boundary perception is incomplete. With respect to syntactic boundary perception, there is evidence that durational variables can affect listeners' preception. O'Malley, Kloker and Dara-Abrams (1973) as well as Macdonald (1976) have shown that juncture a}Based on a talk givenat the Acoustical Society of America Meeting, San Diego, CA, November 1976 [J. Acoust. Sec. Am. 60, 828(A)(1976)]. pauses longer than 50 ms are used by listeners to parse syntactically ambiguous utterances. Lehiste, Olive and Streeter (1976) found that listeners can reliably disam- bigustc utterances using durational cues other than pauses. In that study the duration of ambiguous con- stituents was linearly expanded or compressed in time. They found that duration was an effective disambigua- lion cue only when the two meanings of a particular sen- tence could be represented by two distinct surface struc- ture bracketing structures, e.g., '•rhe told (men and women)] stayed at home." verses "•he [(oldmen)and women] stayed at home." The experiments reported here investigated the role of three suprasegmental variables; amplitude, duration, and fundamental frequency in the perception of a phrase boundary. To study the perceptual prominence and in- teractions among these variables, ambiguous algebraic expressions of the type used by O'Malley, Kloker, and Data-Abrams (1973) were used [e.g., "(A plus E) times O" or alternatively, "A plus (E times O)"]. Admittedly, such expressions constitute a rather special subset of possible English sentences. However, they have the dis- tinct advantageof being truly '•ractically" ambiguous. Many ambiguous sentences often are difficult for listen- ers to perceive as ambiguous. Even if listeners can perceive a meaning duality, it is difficult to find' sen- tences with nearly equally probable alternative mean- ings. In the first experiment two values of these three pro- sodic variables were orthogonally varied (a naturally occurring value and a neutral value), yielding a total of eight manipulations for each expression. All expres- sions were analyzed and manipulated using linear pre- dictive coding analysis-synthesis procedures. Listen- ers decidedon the phrase boundary's location. In a second experiment these same three suprasegmental variables were systematically exchanged between the two alternative renderings of a given expression. This pro- cedure of orthogonal variation allows one not only to rank variables in terms of relative importance in the perceptual parsing process, but in addition one can test 1582 J. Acoust. Soc. Am. 64(6), Dec.1978 0001-4966/78/6406-1582500.80 ¸ 1978Acoustical Society of America 1582

Transcript of Acoustic Determinants of Phrase Boundary...

Page 1: Acoustic Determinants of Phrase Boundary Perceptionwjh.harvard.edu/~pal/pdfs/prosody/streeter78.pdf · Streeter (1976) found that listeners can reliably disam- bigustc utterances

Acoustic determinants of phrase boundary perception Lynn A. Streeter Bell Laboratories, Murray Hill, New Jersey 07974 (Received 24 April 1978; revised 5 September 1978)

How three supersegmental variables (amplitude, pitch contour, and duration pattern) influence phrase boundary perception was investigated in two studies. Listeners located the phrase boundary in ambiguous algebraic expressions, such as "(A plus E) times O" and "A plus (E times O)." In one experiment, two values of e•ch of three variables (appropriate or neutral) were orthogonally varied, using linear predictive coding analysis-synthesis procedures. There was a total of eight manipulations for each expression. In the other, the three suprasegmental variables were exchanged between the two alternative meanings of an expression, yielding a total of eight manipulations for each expression. Results from the two studies were consistent in showing that listeners use all three cues, and just these three to parse such utterances. That is, it was possible to completely shift the meaning of an expression uttered with one meaning into its alternate meaning by exchanging all three variables. In both studies, the effects of duration pattern and pitch contour were additive in total proportion correct. Possible models of how listeners process pitch and duration information independently in making a parsing decision are discussed.

PACS numbers: 43.70.Dn, 43.70.Ve

INTRODUCTION

Observations of production data have indicated that a number of acoustic events are correlated with the pres- ence of a major syntactic boundary. For instance, speakers can modulate the fundamental frequency con- tour to group together words that constitute a major syntactic unit. Similarly, the durational pattern or rhythmic structure of an utterance varies with syntactic structure. For example, Klatt (1975) observed that pho- netic segments are lengthened before syntactic bound- aries. Speakers sometimes insert pauses at phrase boundaries (O'Malley, Kloker, and Dara-Abrams, 1973; Macdonald, 1976) as well as alter the phonetic structure itself preceding a boundary (Lehiste, 1960; Lehiste, 1973; Dukes and Nakatani, 1976). Thus, there appear to be numerous options available to signal a major syn- tactic boundary's presence.

However, with respect to the listener's perception of syntactic boundaries, the relative importance of various acoustic cues is still unknown. There exists a Sizable literature on how various suprasegmental cues, such as amplitude, fundamental frequency, and duration con- tribute to the perception of lexical stress in polysyllabic words presented in isolation (Fry, 1955; Belingert, 1958; Fry, 1958; Rigualt, 1962; Morton and Jassem, 1965; Westin, Buddenhagen and Obrecht, 1966). In general, the findings of such studies indicate that while duration, intensity, and fundamental frequency all contribute to the perception of lexical stress, intensity variation ap- pears to be the weakest cue and fundamental frequency contour the strongest. However, our knowledge of how these suprasegmental variables affect phrase boundary perception is incomplete.

With respect to syntactic boundary perception, there is evidence that durational variables can affect listeners' preception. O'Malley, Kloker and Dara-Abrams (1973) as well as Macdonald (1976) have shown that juncture

a}Based on a talk given at the Acoustical Society of America Meeting, San Diego, CA, November 1976 [J. Acoust. Sec. Am. 60, 828(A)(1976)].

pauses longer than 50 ms are used by listeners to parse syntactically ambiguous utterances. Lehiste, Olive and Streeter (1976) found that listeners can reliably disam- bigustc utterances using durational cues other than pauses. In that study the duration of ambiguous con- stituents was linearly expanded or compressed in time. They found that duration was an effective disambigua- lion cue only when the two meanings of a particular sen- tence could be represented by two distinct surface struc- ture bracketing structures, e.g., '•rhe told (men and women)] stayed at home." verses "•he [(old men) and women] stayed at home."

The experiments reported here investigated the role of three suprasegmental variables; amplitude, duration, and fundamental frequency in the perception of a phrase boundary. To study the perceptual prominence and in- teractions among these variables, ambiguous algebraic expressions of the type used by O'Malley, Kloker, and Data-Abrams (1973) were used [e.g., "(A plus E) times O" or alternatively, "A plus (E times O)"]. Admittedly, such expressions constitute a rather special subset of possible English sentences. However, they have the dis- tinct advantage of being truly '•ractically" ambiguous. Many ambiguous sentences often are difficult for listen- ers to perceive as ambiguous. Even if listeners can perceive a meaning duality, it is difficult to find' sen- tences with nearly equally probable alternative mean- ings.

In the first experiment two values of these three pro- sodic variables were orthogonally varied (a naturally occurring value and a neutral value), yielding a total of eight manipulations for each expression. All expres- sions were analyzed and manipulated using linear pre- dictive coding analysis-synthesis procedures. Listen- ers decided on the phrase boundary's location. In a second experiment these same three suprasegmental variables were systematically exchanged between the two alternative renderings of a given expression. This pro- cedure of orthogonal variation allows one not only to rank variables in terms of relative importance in the perceptual parsing process, but in addition one can test

1582 J. Acoust. Soc. Am. 64(6), Dec. 1978 0001-4966/78/6406-1582500.80 ¸ 1978 Acoustical Society of America 1582

Page 2: Acoustic Determinants of Phrase Boundary Perceptionwjh.harvard.edu/~pal/pdfs/prosody/streeter78.pdf · Streeter (1976) found that listeners can reliably disam- bigustc utterances

1583 Lynn A. Streeter: Acoustic determinants of phrase boundary perception 1583

TABLE I. Duration patterns of utterances used in experiment I.

Duration (ms)

^ +/x E +/x 0

(A+E)xO 160 270 280 350 290 A +(E x O) 280 290 230 340 400 (A x E)+ O 160 390 260 310 400 Ax(E + O) 470 510 280 290 410

Neutral A+/x E +/xO 170 240 170 350 280

for interactions among suprasegmental variables. Pres- ence or absence of interactions can then be used to eval-

uate alternative hypotheses of how these cues affect the listener's parsing decision.

I. EXPERIMENT I

A. Method

I. Stimuli

A male speaker of Northeastern American dialect produced each of the following ambiguous bracketed ex- pressions: (1) "(A plus E) times O," (2) "A plus (E times O)," (3) "(A times E) plus O," and (4) "A times (E plus O)." For each production, the speaker was in- structed to convey by whatever means he deemed appro- priate, the desired syntactic bracketing structure. In addition the speaker produced each of the following un- ambiguous, unbracketed expressions: (1) A times E times O, and (2) A plus E plus O. The speaker was in- structed to avoid parsing these unbracketed expressions. All expressions were intermixed and recorded in ran- dom order.

Previous to the experiment proper, two exemplars of each of the four bracketed expressions were presented in a random order to six listeners. There were ten repetitions of each of the eight utterances. Listeners decided whether the speaker intended "E" to be grouped with "A" as a single unit, or whether "E" and "0" con- stituted a single unit. On the basis of these pretest re- sults, the exemplar of each expression with the fewer errors was selected for further analysis. The average error rate for the four selected expressions was 5.5%.

The selected utterances were analyzed in terms of rms amplitude, fundamental frequency, and 12 linear predictive coding (LPC) pseudoarea functions, repre- senting spectral characteristics of speech (Atal and Hanauer, 1971). The analysis program estimates a value for each of the 14 parameters for every 10 ms of speech, and a synthesis program reproduces speech us- ing these parameters. The advantage of LPC analysis- synthesis is that pitch and amplitude can be manipulated without modifying other speech parameters.

Three suprasegmental variables were manipulated as described 'below; amplitude, fundamental frequency con- tour, and duration. For each of the experimental vari- ables there were two levels: (a) "neutral" and (b) a value taken from one of the bracketed expressions. Manipula-

tions of the LPC parameters and synthesis of the utter- ances were done using an interactive program (Nakatani, 1976) implemented on a DDP-224 computer.

The set of bracketed expressions was constructed by concatenating appropriate words from the unbracketed expressions to form the basic utterances to be manipu- lated. For example, the expression, "A plus E times O" was constructed in two ways. In one case, the ele- ments "A plus E" were taken from the unbracketed ex- pressions "A plus E plus O," while the elements "times O" were excised from another unbracketed expression, "A times E times O." These elements "A plus E" and 'times O" were then concatenated to form the expres- sion, "A plus E times O." Similarly, in the second case "A plus" was taken from "A plus E plus O," while "E times O" was taken from "A times E times O." These elements were then concatenated to form a sec.

ond "A plus E times O" expression. .The set of four basic utterances (two expression types by two construc- tion types) were then manipulated to make the test ut- terances.

The durations of the operands (A, E, and O), opera- tors (plus and times), and silent portions in the un- bracketed expressions, e.g., "A plus E plus O" were measured to the nearest 10 ms from the LPC represen- tations. The particular words in the expressions were chosen so as to minimize segmentation difficulties. The beginning point of "A" was taken to be the first voiced sample in the utterance and the end point was defined to be the last voiced sample, prior to a voiceless sample, which indicated the beginning of/p/or /t/. The frica- tion period of the voiceless stops were also marked by the pattern of the pseudoarea function parameters. The end points of "plus" and 'times" were determined by examining the area function pattern as well as the funda- mental frequency contour. (The pseudoarea functions for fricatives form a noteably irregular pattern.) Si- lences were defined to be samples of zero amplitude. The neutral duration pattern was defined as the mean value of-each of the operands, operators, and silent por- tions averaged across the four unbracketed tokens. The durations of operands, operators, and silent portions were measured in the bracketed expressions as well. In the + duration case, elements in the concatenated ex- pressions described above were linearly expanded or compressed in time to match the duration pattern of one of the originally uttered bracketed expressions. It should be noted that this particular speaker did not in- sert juncture pauses to disambiguate the bracketed ex- pressions, rather, he lengthened segments preceding the syntactic boundary. His duration disambiguation strategy was to lengthen the operand directly preceding the phrase boundary. So, when "A" directly preceded the phrase boundary as in "A plus (E times O), .... A" was nearly twice as long as "A" in nonboundary position. Table I shows the durations for each of the constituents in all expressions.

In the neutral or - duration case, the elements in the concatenated expressions were linearly expanded or compressed to match the mean durations of each of the

J. Acoust. Soc. Am., Vol. 64, No. 6, December 1978

Page 3: Acoustic Determinants of Phrase Boundary Perceptionwjh.harvard.edu/~pal/pdfs/prosody/streeter78.pdf · Streeter (1976) found that listeners can reliably disam- bigustc utterances

1584 Lynn A. Streeter: Acoustic determinants of phrase boundary perception 1584

TABLE II. Experiment I. Mean proportion correct and standard error of the mean (sem) for each experimental con- dition.

Amplitude - Amplitude Duration Duration

Pitch

0. 773 0. 595 0. 673 0. 542 sem=0.027 sem=0.026 sem=0.031 sem=0.028

0.689 0.501 0.557 0.499 sem=0.027 sem=O.024 sem=0.031 sem=O.028

operands, operators, and silent portions in the unbrack- eted tokens.

Fundamental frequency manipulations involved impos- ing the duration-adjusted fundamental frequency contour from a bracketed expression on a concatenated expres- sion. For example, to construct the +pitch, - duration condition for the expression, "(A plus E) times O," the pitch contour from "(A plus E) times O" as originally uttered was linearly expanded or compressed word by word to conform to the average duration pattern of the unbracketed expressions. In the case of "plus" and '•imes" voicing onset was manipulated to be the same as the unbracketed expression. This duration-adjusted, fundamental frequency contour was then superimposed on the concatenated utterance, "A plus E times O." The neutral or -pitch condition was a fiat contour of 100 Hz across all voiced sections of the utterance.

The third variable was amplitude. In the + amplitude condition the duration-adjusted amplitude envelope from one of the bracketed expressions was superimposed on a concatenated expression. The amplitude envelope of the concatenated expressions was not altered in the - amplitude condition.

2. Design

In total, three variables were manipulated each with two levels (+ and -); duration pattern, fundamental fre- quency contour, and amplitude envelope. All variables were fully crossed, yielding a total of eight separate manipulations for each of the expressions. There were two expression types ("A plus E times O" and "A times E plus O"), crossed with two brackettrig structures, each with two replicalions. Thus, in total there were 64 dis- tinct utterances. Amplitude was a between-subjects variable, whereas all other variables were within-sub- jects variables. There were two separate random or- ders for each of the two amplitude conditions, yielding a total of four orders. Stimuli in each random order consisting of 160 trials (five repetitions of each of the 32 stimuli) were recorded onto analog tape. There were five seconds of silence separating each utterance.

3. Sub/ects

The subjects were 20 local high school students, who were paid for participation.

4. Procedure

Subjects were told that they would hear two algebraic expressions, either "A plus E times O" or "A times E plus O," which could be grouped in two different ways. They were told about the two possible bracketing struc- tures for each expression. They were to decide for each sentence which of the two alternative meanings was the one conveyed by the speaker. If "E" was grouped with "A" as in "(A plus E) times O," one written re- sponse was required ("1"), and conversely, if they thought "E" was grouped with "O," as in "A plus (E times O)," another response was required ("2").

Subjects listened to the utterances over headphones in a sound-insulated booth. Block order was counter-

balanced across subjects. Each subject made a total of ten judgments for each of the 32 stimuli in one amplitude condition.

B. Results and discussion

Table II shows the results in terms of proportion cor- rect and the standard error of the mean (sero) for each of the experimental cells collapsed across sentence types and replicalions. The two-by-two table on the left side of Table II shows the effects of altering pitch and dura- tion while holding amplitude at a constant value, in this case the + amplitude condition. A "+" for a particular variable indicates that values were taken from the

bracketed expressions, whereas a "-" indicates that the values for that variable were neutral.

An analysis of variance revealed significant main ef- fects for each of the three experimental variables: am- plitude, F (1, 18)=7.84, p<0.05; duration, F (1, 18) =45.80, p<0.001; pitch, F (1,18)=21.81, p<0.001. Thus, changing the duration pattern from a bracketed to neutral pattern decreased proportion correct by ap- proximately 0.14, whereas neutralizing the fundamental frequency contour and amplitude decreased performance by 0.08 and 0.07, respectively. That the -pitch and -duration values used here were indeed neutral is sup- ported by the fact that the two cell means-- -pitch, - duration, + amplitude and - pitch, - duration, - am- plitudeware at chance level, i.e., no preference for either meaning.

The pattern of results with respect to the amplitude manipulation is somewhat puzzling. There was a signif- icant amplitude by duration interaction (F (1, 18)= 4.75, p<0.05) as well as a significant three-way interaction (F (1, 18)=4.42, p<0.05). When amplitude was the sole disambiguation cue, performance was at chance level. The amplitude results may be due to the fact that the amplitude manipulation was a between-subjects variable. Since individual subjects had no contrastire amplitude experience, they may not have perceived it as a mean- trig-differentiating cue.

Ignoring for the moment all cells in the - amplitude condition, inspection of the two-by-two table on the left presents a less complicated state of affairs. Both pitch and duration were reliable disambiguation cues, and in addition, the two cues did not interact (F< 1).

J. Acoust. Soc. Am., Voh 64, No. 6, December 1978

Page 4: Acoustic Determinants of Phrase Boundary Perceptionwjh.harvard.edu/~pal/pdfs/prosody/streeter78.pdf · Streeter (1976) found that listeners can reliably disam- bigustc utterances

1585 Lynn A. Streeter: Acoustic determinants of phrase boundary perception 1585

To determine whether there were differences between

the experimentally devised versions (i.e., versions with duration, pitch, and amplitude specifications taken from the originally produced bracketed expressions--the + D, +P, +A cell) and the originally produced sentences, a control experiment was run. A separate group of six listeners judged each of the originally-uttered, LPC analyzed bracketed expressions 15 times. The mean proportion correct for the control group, who heard the originally uttered, LPCed expressions was 0.76, while proportion correct for the experimentally devised brack- eted expressions was 0.77. Thus, utterances with ma- nipulated suprasegmental variables were as effectively disambiguated as were the original utterances from which they were modeled.

II. EXPERIMENT II

To interpret the size of the effects obtained in experi- ment I, one is forced to rely on the difference between the manipulated and "neutral" utterances. The neutral utterance is an artificial average that conceivably could be worse for some prosodic variables than for others. Thus, while we can conclude that the various variables affect perceptual segmentation, this procedure does not allow us to assess the variables absolute effects or rela- tive importance. To circumvent this problem, in the second experiment the role of these same three variables in phrase boundary perception was assessed, but using a different method.

In the second experiment the suprasegmental cues were systematically exchanged between the two alterna- tive meanings of a given algebraic expression. For ex- ample, the duration pattern from "(A plus E) times O" was imposed on the utterance, "A plus (E times 0)." In the resulting 'q•ybrid" or mixed utterance, duration in- formation conflicts with information provided by pitch and amplitude cues. Note that this technique of exchang- ing acoustic cues circumvents the problem of determin- ing a priori what constitutes a neutral value of a particu- lar acoustic variable.

A. Method

I. Stimuli

Two male speakers of Northeastern dialect produced the four bracketed expressions used in experiment I, attempting to make the syntactic structure of each utter- ance clear. (The second speaker was the speaker from experiment I, who served in a second recording ses- sion.) Each expression was recorded four times in ran- dom order. Stimuli from both speakers were pretested using a group of six listeners. Each listener judged each of the 32 utterances five times. From this set of 32, 16 utterances with the better comprehension score were selected for the actual experiment; two examples of each of the four bracketed expressions for each of the two speakers. Overall percent correct for speakers 1 and 2 on these eight expressions was 94% and 91% cor- rect, respectively. This set of expressions was LPC analyzed.

The duration, pitch, and amplitude, as well as all

combinations of these three variables were exchanged between sentence versions containing the same lexical pattern, but different bracketing structures. For ex- ample, consider the two expressions: '(a) "(A plus E) times O" and (b) "A plus (E times O)." To manipulate duration (D) for (a), all elements in expression (a) were linearly expanded or compressed to assume the duration pattern of expression (b). In this case, duration cues conflicted with the pitch, amplitude, and spectral infor- mation. Similarly, in the duration-pitch condition (DP), both the duration pattern and fundamental frequency con- tour from expression (b) were mapped onto expression (a), resulting in duration and pitch cues conflicting with amplitude and spectral information.

Thus, there was a total of eight variations for each of the eight sentences for each speaker: (1) original ut- terance (OR), (2) duration changed (D), (3) amplitude changed (A), (4) pitch changed (P), (5) duration and am- plitude changed (DA), (6) duration. and pitch changed (DP), (7) amplitude and pitch changed (AP), and (8) dura- tion, amplitude, and pitch changed (DAP). Hypotheti- cally, if duration, amplitude, and pitch are the only variables affecting perception of the phrase boundary, it should be possible to shift the meaning of an original ut- terance to the meaning of the alternate bracketing struc- ture by exchanging all three variables; that is, an origi- nal sentence would be judged no differently from a sen- tence with all replaced variables (DAP). If original sentences were judged no differently from sentences with all replaced variables, the amount of influence each variable has separately could be taken as a measure of the relative importance of that cue in the speech of these speakers.

There were 64 test utterances for each speaker. Test blocks consisted of a random order of the 64 stimuli for one speaker. There were four different random orders for each speaker.

2. Subjects

Seventeen high school students were paid for their par- ticipation. The experiment was conducted on two con- secutive days.

3. Procedure

The procedure was basically the same as the pro- cedure described in experiment I. However, here every subject heard all stimuli, i.e., the design was com- pletely within-subjects. Each subject judged all 128 ut- terances a total of four times. On each day, subjects heard alternating blocks of speaker I and speaker 2 for a total of four blocks. The groups of four blocks were counterbalanced across subjects.

B. Results

Figure 1 shows the results for each of the eight ex- perimental conditions separately for speaker 1 and 2. Speaker 1 is denoted by horizontal stripes and speaker 2 by vertical stripes. Small solid bars indicate plus and minus one standard error of the mean, calculated across the eight sentence means. Data are represented in

J. Acoust. Soc. Am., Vol. 64, No. 6, December 1978

Page 5: Acoustic Determinants of Phrase Boundary Perceptionwjh.harvard.edu/~pal/pdfs/prosody/streeter78.pdf · Streeter (1976) found that listeners can reliably disam- bigustc utterances

1586 Lynn A. Streeter: Acoustic determinants of phrase boundary perception 1586

85

7õ---

70-__

65-

55-__

50'--

•5'•

35'•

30-__

25

20 I OR D A P DA DP

PROSODIC MANIPULATIONS AP DAP

FIG. 1. Percentage of judgments favor- ing original utterances parsing for each of the eight conditions and for each of the two speakers. Data for speaker 1 are denoted by horizontally lined bars, whereas data for speaker 2 are denoted by vertically lined bars. Plus and minus one standard error of the mean is

depicted by small solid bars.

terms of the Percentage of responses obtained for the original sentences' intended bracketing structure. So, in the ease of speaker 1, original sentences were judged to have the meaning intended by the speaker in 81% of judgments, whereas when just the duration pattern from the alternate meaning was mapped onto original sen- fences, only 34% of the judgments favored the original bracketing structure. Therefore, altering the duration pattern from the original meaning to the alternative meaning shifted the average categorization response by about 47% for speaker 1.

If a residual effect of segmental structure exists, it should manifest itself in the difference between an origi- nally uttered express. ion and an expression with all re- placed variables (i.e., one minus the proportion correct in the DAP condition). For example, for speaker 1 81% of listeners' judgments favored the original parsing. However, when D, A, and P were substituted from the alternate meaning, 22% of judgments favored the origi- nal meaning, whereas 78% favored the meaning signaled by the D, A, and P cues. Thus, the residual effect due to segmental structure was 3% for speaker 1. There was no significant effect of segmental structure for either speaker's data when originals were compared to expressions with all replaced variables (p>0.10). Thus, for this sentence set only the experimental variables, D, P, and A affect the perception of a phrase boundary. However, it is conceivable that with an entirely different set of phonetic elements, and/or a different arrangement of the elements, spectral characteristics might be an important cue in phrase boundary perception.

Figure 1 shows that there are large speaker differ- ences (F (1, 16) = 10.42, p < 0.01). First, speaker l's original utterances were comprehended better than speaker 2's (t (7)=2.92, p<0.05). Second, there is a

striking duration by speaker interaction (F (1, 16) =32.79, p<0.001). While each of the speakers showed a reliable duration effect (F (1, 16) = 36.78, p < 0. 001 and F (1, 16) = 17.70, p< 0.001), duration shifted meaning categorization by approximately 44% for speaker 1 and by only 15% for speaker 2. An analysis of the two speakers' duration patterns provides an explanation of this interaction. Both speakers appeared to use the same durational disambiguation strategy, namely to vary the duration of the operands, "A" and "E" depending on the syntactic structure. When "A" directly preceded the phrase boundary, it was lengthened relative to "E," and vice versa. Table III shows the durational patterns for both speakers. Again, neither speaker inserted juncture pauses as a disambiguation cue. All silent por- tions between an operand and operator were within the range 0f values observed for word-initial, nonphrase- initial stop closures (Umeda, 1977). However, even though the durational strategies appeared to be the same for the two speakers, there were differences in their realizations of them. Speaker l's utterances taken as a whole were significantly longer than speaker 2's (t (7) = 5.13, p < 0.01). In an attempt to control for this over- all speech rate difference, the "A" to "E" duration ratios were examined in all sentences. These duration

ratios {before/after boundary) were reliably greater for speaker 1 (t (7) -- 4.29, p < 0.01) with an average ratio of 1.89 for speaker 1 and 1.36 for speaker 2. Thus, speaker 1 not only spoke more slowly, but made a clearer distinction between operands before and after the phrase boundary, which presumably facilitated lis- teners' judgments.

Fundamental frequency contour was a reliable phrase boundary location cue for both speakers (F (1, 16) =23.92, p< 0.001). The magnitude of the pitch effect

J. Acoust. Soc. Am., Vol. 64, No. 6, December 1978

Page 6: Acoustic Determinants of Phrase Boundary Perceptionwjh.harvard.edu/~pal/pdfs/prosody/streeter78.pdf · Streeter (1976) found that listeners can reliably disam- bigustc utterances

1587 Lynn A, Streeter: Acoustic determinants of phrase boundary perception 1587

TABLE lII. Duration patterns of utterances used in experiment II.

Speaker 1 Speaker 2

Duration (ms) Duration (ms)

A +/x E +/x O A +/x E +/x O

210 340 420 440 410 180 310 250 350 310 (A+ E)xO 210 310 340 400 390 200 300 270 360 270

400 320 180 400 360 280 300 190 350 310 A+(E xO) 370 320 210 390 320 250 280 210 390 300

250 470 380 320 460 190 350 200 300 300 (A x E)+ O 310 450 480 370 390 160 340 230 310 270

410 400 210 330 380 280 340 190 330 330 Ax(E* O) 430 370 160 330 360 290 380 190 290 330

was 14% for speaker 1 and 20% for speaker 2. However, the two speakers used pitch differentially to disambigu- ate the expressions. Speaker 1 placed a rise-fall-rise pattern on the operand preceding the boundary, whereas speaker 2 placed a rising contour on the operator pre- ceding the phrase boundary.

Amplitude was the least effective disambiguation cue, shifting overall categorization response by only 2.6% over the two speakers. However, this effect was reli- able (F (1, 16)=10.29, p<0.01). Further, there was a duration by amplitude interaction (F (1, 16) --5.62, lb <0.05), as well as a three-way interaction (F (1, 16) = 4.54, p < 0.05). (Note that in this experiment ampli- tude was a within-subjects factor, whereas in the first experiment it was a between-subjects factor. However, the same pattern of results was obtained in both stud- ies.) As in experiment I, amplitude alone was not an important disambiguation cue, i.e., the cells + ampli- tude, - duration, - pitch and - amplitude, - duration, -pitch do not differ. Amplitude manifests itself as a cue only when duration is "+."

In an attempt to characterize amplitude differences between the two meanings of the expressions, the aver- age amplitude was calculated in the vowel portion of each of the words in the natural u•terances. The values used

to calculate these averages were: The peak vowel am- plitude in the operands and operators and two amplitude values in two adjacent 10 ms samples on both sides of the peak vowel amplitude. Figure 2 shows these aver- age dB amplitudes plotted for each of the two meanings and for each of the two speakers. There are discernable differences in the amplitude pattern between the two bracketing structures. It appears that amplitude tends to drop more after a phrase boundary. In addition, when "E" is in phrase-final or stressed position, it has greater amplitude associated with it than when it is in an unstressed position.

The data represented in Fig. I indicate that there is no interaction between duration and pitch (F < 1). In fact, the effects of duration and pitch appear to sum in total probability correct, and the pattern of results obtained here is similar to those obtained in experiment I. Fur-

50-

48-

46'

44-

42-

40-

38

o 0 Spkr 2

+ A Spkr. 2 o

Avg sero Spkr_ 1=4:38 Avg. s.e m. Spkr_ 2=.640

A +/x E x/+ 0

FIG. 2. Average dB amplitude for in- dividual words in the expressions plotted as a function of syntactie bracketing structure and speakers.

J. Acoust. Soc, Am., Vol. 64, No. 6. December 1978

Page 7: Acoustic Determinants of Phrase Boundary Perceptionwjh.harvard.edu/~pal/pdfs/prosody/streeter78.pdf · Streeter (1976) found that listeners can reliably disam- bigustc utterances

1588 Lynn A. Streeter: Acoustic determinants of phrase boundary perception 1588

MODEL

Stimulus • DURATION PROCESSOR I PITCH PROCESSOR

Respond I

If 8ore Processors Output 0

1. Guess

Respond 0

I Otherwise = Respond t

Total Probobilify Correct = %+P.- %- %+

FIG. 3. Schematic representation of model I, a model in which both outputs from a duration processor and a pitch processor are combined on each trial to produce a parsing decision,

ther discussion of this additive effect is contained in the

general discussion.

III. GENERAL DISCUSSION

Modeling the listener's decision process

In both experiments the effects of pitch and duration did not interact, r•ther their effects summed in total probability correct. That is, it is possible to predict their combined effect by merely adding their separate effects.

Results from the two experiments can be used to pro- pose and choose among various models of the listeners' decision process. In the interest of brevity, three rela- tively disparate models of how listeners might use pitch and duration cues will be contrasted. Briefly, the first model assumes that pitch and duration information are processed in independent channels or processors. Each processor outputs a probability of a correct parsing on each trial. The outputs from both processors are com- bined probabilistically in making the parsing decision. The second model, an alternating model assumes that a parsing decision is made using either pitch or duration information, but not both on any given trial. Thus, the listener's decision alternates between using pitch and

duration information. The third model assumes that the

pitch and duration processors output values indicating how consistent each cue is with each of the possible parsings. These independent estimates are then aver- aged by a combining process to arrive at a parsing deci- sion.

The first model is one that has been applied success- fully to various phenomena in visual perception, most noteably with respect to modeling perception based on outputs from multiple channels or receptive fields (Sachs, Nachmias, and Robson, 1971; Graham, 1977; Watson and Nachmias, 1977). The first model, shown in Fig. 3, assumes that on any particular trial an utter- ance is examined for both pitch and duration informa- tion. The duration and pitch processors are indepen- dent, and each has a nonzero probability of arriving at the correct parsing decision; Pd and Pp, respectively. Before the listener responds, the output from both pro- cessors is examined. Assume that the outputs of each processor are binary (either "1" or "0", with "1" in- ' dicating that the location of the phrase boundary is kaown and "0" indicating that the location of the phrase bound- ary is not known. Each response is based on the union of the two outputs, which results in the probability of a correct decision of: Pd+Pp-(PdxPp). To complete this model, we need to consider that a correct response could also be due to guessing the correct parsing based on neither pitch nor duration cues. For the data from experiment I let: Pd be the probability of correctly judging the location of the phrase boundary using dura- tion information when it is available; Pp, the probability of correctly judging the location of the phrase boundary using pitch information when it is available; and g, the probability of a correct guess, when neither duration nor pitch information is available. Thus, when only the duration cue provides information as to the phrase boundary location, the output of the duration processor is either "0" (the location of the boundary is not known) or 'T' (the location of the boundary is known). If the processor outputs a "1," the location of the phrase boundary is consistent with the acoustic input. That is, if the output is "1," the subject is indeed correct. If the output of the duration processor is "0," a guess as to the phrase boundary location is made. Thus, the probability of correctly judging the phrase boundary lo- cation using only duration information is Pd. A guess will be made with probability (1 -Pal), and the guess will have a probability "•' of being correct. Thus, the ob- served probability of a correct parsing decision based on duration information alone (Pcd) is the sum of the probability that the duration processor outputs a correct decision plus the probability of correctly guessing the phrase boundary location

Pcd = Pd + g(1 - Pd) . Similarly, the observed probability of a correct parsing decision based on pitch information alone equals

Pcp= Pp + g'(1 - Pp) . When both pitch and duration provide informatioa as to the phrase boundary location, there is a probability Pd that the duration processor arrived at a correct parsing

J. Acoust. Soc. Am., Vol. 64, No. 6, December 1978

Page 8: Acoustic Determinants of Phrase Boundary Perceptionwjh.harvard.edu/~pal/pdfs/prosody/streeter78.pdf · Streeter (1976) found that listeners can reliably disam- bigustc utterances

1589 Lynn A. Streeter: Acoustic determinants of phrase boundary perception 1589

MODEL

Stimulus 7.•

DURATION PROCESSOR PITCH PROCESSOR Probobility of Correct Probobilily of Correct

Decision = PD Decision = Pp

Respond I • G•,•:)•espond t Respond ! Respond 0 Respond 1 Respond 0

Total Probobility Correct = Po+CJ(1-PD )+PP+ g (I-Pp) FIG. 4. Schematic representation of model II. This model is an alternating model in which on each trial a parsing decision is made on the basis of either duration information or pitch in- formation, but not both.

and a probability Pp that the pitch processor arrived at a correct parsing. If either or both processors output a "1," a correct response will be made with, probability, Pd+Pp-PdPp. If both processors output a "0," a guess will be made with probability [1- (Pd +Pp- PdPp)]. Again, there is a probability g that the guess will be cor- rect. Thus, the observed probability correct when both duration and pitch information are present in the acous- tic signal equals

Pcdp = Pd + Pp - PdPp + g(1 - Pd + Pp - PdPp) . The observed data are Pcd,, Pcp, and Pcdp. To evaluate model I, we need to estimate Pd, Pp, and Pdp (the esti- mate from the observed data of the condition in which both pitch and duration provide information). These three probabilities (Pd, Pp, and Palp) are estimated us- ing Pcd, Pcp, and Pcdp respectively by

Pc. -g 1-g

The guessing probability (g) is taken to be the observed probability of a correct parsing decision when neither pitch nor duration information is present--namely, the -P, -D, +A condition for half of the subjects and the -P, -D, -A condition for the other half of the subjects in experiment I. Having estimated Pd, Pp, and Palp, we can evaluate model I:

model I Pdp prediction =Pd + Pp - PdPp . We can then compare the model I prediction of Palp

against the estimate of Pdp derived from the observed data. Comparing the model I Palp prediction and the ob- served value of Pdp for each of the twenty subjects in experiment I, we find that the data reject model I (t {19)

=3.08, p< 0.01). The mean observed Pdp value was 0.411, while model I predicted a value of 0.326. Using sentences as the analysis unit, the data marginally re- ject model I (t (7) =2.20, p<0.10). Table V shows the observed values of Pal, Pp, and Palp as well as the model I predictions of Pdp separately across subjects and sen- tences for the data in experiment I.

In experiment II pitch and duration cues were ex- changed between the two alternative meanings. In this case, we assume that the subject starts with an assump- tion of one or the other meaning: Pd is the probability that duration signals a chagge of meaning; Pp, the proba- bility that pitch signals a change of meaning; and g, the probability that a change in meaning will result from guessing alone, if neither pitch nor duration leads to a change. As before, the observed probability correct (Pc) for each of the three conditions equals

Pc.--P. +g(1-P.) . In estimating Pal, Pp, and Palp, we let g equal the proba- bility that the subject selects the alternate meaning, when neither pitch nor duration signals a change in meaning. In other words, g equals one minus the ob- served probability correct in the original utterance con- dition (i.e., the +D, +P, +A condition). Using this estimate of g and the observed Pcd in the - D, + P, +A condition, Pcp in the + D, - P, +A condition, and Pcdp in the - D, - P, +A condition, we solve for Pd, Pp, and Pdp, respectively. We can then compare the obtained Pdp for the -D, -P, +A condition with the estimate from model I (Pd + Pp - PdPp) for each subject in ex- periment II. The model I estimate of the Palp differs re- liably from the observed estimate of Pdp (t (16)=3.00, p<0.01). Likewise, a comparison using the 16 sen- tences as the analysis unit leads to a rejection of model I (t (15) =3.20, p<0.01). The various parameter values are shown in Table V.

The second model is shown in Fig. 4. In model II on any given trial the parsing decision is made using dura- tion information with probability, Pch or the parsing decision is made using pitch information with probability (1 -Pch). If the duration processor is selected, the probability that the duration processor will yield the cor- rect parsing decision is Pd. Similarly, if the pitch pro- cessor is selected, its probability of arriving at a cor- rect parsing decision equals Pp. Thus, when both pitch and duration information are present, the Pdp prediction from model I1 equals

PchPd + (1 - Pch)Pp .

Thus, on any particular trial a parsing decision is made on the basis of either pitch information or duration information, but not both. Again, assume that if the processor selected outputs "0," a guess is' made. For example, if the duration processor is selected, the probability of a correct response {Pcd) again equals

Pd + (1 - Pd)g.

Similarly, if the pitch processor is selected Pcp equals

+ (1 - pp)g .

J. Acoust. Soc. Am., Vol. 64, No. 6, December 1978

Page 9: Acoustic Determinants of Phrase Boundary Perceptionwjh.harvard.edu/~pal/pdfs/prosody/streeter78.pdf · Streeter (1976) found that listeners can reliably disam- bigustc utterances

1590 Lynn A. Streeter: Acoustic determinants of phrase boundary perception 1590

MODEL Trr

Stimulus DURATION PROCESSOR

(D's Range From 0 to 1} PITCH PROCESSOR

I if D>.50

/ Respond Meoning 1

(P• Ronge From 0 to t }

/\ \ Dt • P2

/

Combiner I[;)I + Pl 2

DZ + P2

C2

if D = .50 I Guess

FIG. 5. Schematic representation of model III. In model III consistency estimates from the duration and pitch processors are averaged to make a parsing decision.

Ct =

Decision

D - 50 • ICt-C2) 2

I i i i

Respond Meoning 2

Thus, total probability correct when both pitch and dura- tion provide information equals

Pd + (1 - rd)g + Pp + (1 - Pp)g . To test the second model, we again use the estimates of Pal, Pp, and Palp, and g (the probability of a correct guess) obtained above from the observed data to test whether the observed P•p differs from the model II pre- diction of P•p (i.e., P• + Pp). Let us assume that the choice parameter, Pch equals 0.50 to simplify matters. For the data in the first experiment using subjects as the analysis unit, the model II prediction of P•p does not differ significantly from the observed Palp value (! (19) = 1.58, p > 0.10). Similarly, using sentences as the analysis unit model II's prediction and the observed Pdp do not differ (! (7) =1.18, p>0.10). For the experiment H data the model rr predictions do not differ reliably from the observed Pdp's either across subjects (t (16) =0..35). or across sentences (t (15)=0.07). The mean obtained Pd's, Pp's, and Pdp's as well as the model H predictions are shown in Table V. l

Models I and II assume that the outputs from the dura- tion and pitch processors are combined probabilistically. However, if we abandon the requirement of a probabilis- tic combining rule, there is a variant of model I which

will produce additivity of pitch and duration cues in total proportion correct. This third model is shown in Fig. 5.

In model III, instead of having the duration and pitch processors output binary responses with certain proba- bilities attached to them, we assume that both proces- sors output numbers ranging from zero to one, which in- dicate the degree to which a certain cue is consistent with each parsing. Thus, the duration processor out- puts two values; D1 (the degree to which the duration in- formation is consistent with meaning one) and D2 (the degree to which the duration information is consistent with meaning two). Similarly, the pitch processor ex- amines the pitch information and outputs two numbers ranging from zero to one; P1 and P2. These "consis- tency" numbers or strength estimates are then trans- mitted to a combiner, which takes the average of D1 and P1 and also averages D2 and P2. Thus, the combiner takes the mean of the two cues for each of the two pos- sible meanings. The combiner outputs these two aver- ages, C1 and C2, where

C1 =(D1 +P1)/2 and C2 =(D2 +P2)/2 .

These outputs feed into a decision processor, which combines C1 and C2 in the following way:

D=0.50 + (Cl-C2)/2 .

When D is greater than 0.50, the subject responds with meaning one, and when D is less than 0.50, the subject responds with meaning two. If D is equal to 0.50, the subject guesses.

While this model is similar to model I in spirit, it differs in some basic respects. In model irr, since the outputs of the pitch and duration processors are not probabilities, the combining algorithm can differ from the one used in model I (i.e., Pd+Pp-PdPp). Thus, in this third model, additivity in total proportion cor- rect is observed because additivity is inherent in the combining algorithm. The combiner's best guess as to the utterance's parsing is an average of the strengths of the pitch and duration cues.

To illustrate how model Irl fits the observed data, consider the observed data shown in the panel on the left in Table IV. To test model III, assume that D1 and D2, and P1 and P2 are complementary. In other words, D1 equals (1- D2) and P1 equals (1- P2). Using the four observed data values, we solve for D1, D2, P1, and P2. The best fitting values of D1 and P1 are 0. 813 and 0.659, respecUvely. Thus, model HI predicts a value

TABLE IV. Experiment II. Mean propor- tion correct for each experimental condi- tion.

+ Amplitude - Amplitude Duration Duration

+ 0.759 0.452 0.732 0.442

Pitch

-- 0.606 0.287 0.540 0.288

J. Acoust. Soc. Am., Vol. 64, No. 6, D•embe• 1978

Page 10: Acoustic Determinants of Phrase Boundary Perceptionwjh.harvard.edu/~pal/pdfs/prosody/streeter78.pdf · Streeter (1976) found that listeners can reliably disam- bigustc utterances

1591 Lynn A. Streeter: Acoustic determinants of phrase boundary perception 1591

TABLE V. Parameter values derived from observed verses predicti.ons of model I and model II.

Predictions of parameters from observed data

Model predictions of Pa•

P,t Pp P,FPt• P•t• Model I Model II Experiment I

Subjects

Mean 0.242 0.128 0.043 0.441 0.327 0.370 Standard error 0.046 0.044 0.014 0.061 0.061 0.073

Sentences

Mean 0.270 0.146 0.057 0.503 0.358 0.415 Standard error 0.075 0.045 0.020 0.085 0.093 0.112

E'xperiment II

Subjects

Mean 0.366 0.178 0.081 0.553 0.463 0.544 Standard error 0.051 0.040 0.015 0.083 0.064 0.078

Sentences

Mean 0.389 0.219 0.086 0.611 0.526 0.608 Standard error 0.060 0.026 0.013 0.037 0.055 0.066

of 0.736 for the +D, +P, +Acell, 0.577 for the +D, -P, +Acell, 0.423 for the-D, +P, +A cell, and 0.264 for the -D, -P, +A cell. Thus, this model does produce additivity and provides a good fit to the observed data values.

In all the models presented, the pitch and duration processors have been assumed to be independent. If both processors output probabilities, we expect to ob- serve the pattern of results predicted by model I; name- ly, we expect an interaction (i.e., PdPp >0). The ab- sence of such an interaction may suggest that pitch and duration are not processed independently as we have as- sumed, since the probability of both cues together (Pdp) is greater than the sum of the parts (Pd + Pp - PdPp). It may be that pitch and duration are inseparable cues, i.e., they form some sort of integrated percept or "gestalt." The plausibility of such an explanation is bolstered by the fact that in natural speech pitch and duration information are normally consistent or perfect- ly correlated. Due to this correlation in nature, listen- ers may have learned to treat these two cues as an in- tegrated whole, i.e., they do not perceive the two cues independently. In addition, both pitch and duration exist in the time domain. Duration is obviously time-depen- dent. However, the pitch contour is also a function of both time and fundamental frequency. Thus, it is logi- cally impossible to manipulate these two cues indepen- dently. We observed that the separate duration and pitch manipulations affected phrase boundary perception. Thus, the rise and fall of the pitch conveyed information about the syntactic structure. Similarly, the durational relation of the elements in a sentence also conveyed in- formation about the syntactic structure. However, either manipulation alone distorts the melody or tune of the utterance, and this tune or melody cannot be perfect- ly reconstructed without changing both simultaneously.

IV. CONCLUSION

The results of the two experiments are consistent. Both the pitch contour and the duration pattern were re- liably used as cues in parsing ambiguous algebraic ex- pressions. Amplitude by comparison appears to be a less important cue that is only effected in combination with appropriate values of duration. Moreover, at least for this set of sentences, only these three cues are used in phrase boundary perception. That is, the segmental or spectral characteristics of the utterance were not re- liable cues for the location of the phrase boundary.

There were differences between the speakers in the effectiveness of the duration cue. For these two speakers, duration had a larger range of effectiveness than did pitch. For one speaker, for whom duration dif- ferences between pre- and postboundary positions were quite pronounced, altering duration shifted categoriza- tion responses by 44%, whereas for the second speaker, whose pre- and postboundary duration differences were less pronounced, the effects of duration was smaller.

Two probabilistic models and one nonprobabilistic model were discussed. Since the effects of pitch and duration summed in total probability correct, it appears that listeners do not independently combine the proba- bilities of duration and pitch providing a correct parsing when deciding on the location of a phrase boundary. However, the other two models were consistent with the data. One, a probabilistic model, assumes that pitch and duration information are used in an alternating fash- ion with one or the other cue determining the phrase- boundary location on a particular trial. In the other model, the pitch and duration processors output values indicating the degree of consistency of a particular ut- terance with each of the two possible parsings. These consistency values are then averaged to determine the phrase boundary decision.

ACKNOWLEDGMENTS

I thank S. L. Donald, N.H. Macdonald, and J. D. Dukes for assistance. The following people have pro- vided valuable discussion in various phases of this re- search; T. K. Landauer, M. Y. Liberman, I. Lehiste, D. E. Meyer, L. H. Nakatani, and J.P. Olive. T. K. Landauer's, M. Y. Liberman's, and O. Fujimura's comments on the manuscript are greatly appreciated.

1The amount of data becomes somewhat sparse to evaluate each of the speakers separately. However, the data were ex- amined separately for each speaker, and the results were somewhat inconclusive. For speaker 1 the observed Pdp was 0. 707, which was intermediary between the model I predic- tion (0. 667) and the model II prediction (0. 778). For this .qpcaker the data does not clearly reject either model (p >0.10 in both cases). On the other hand, speaker 2's data reject model I (t (7) = 3.68,. p < 0.01), but fail to reject the second model (p > 0.10). For speaker 2 the observed Pdp value was 0. 517, and the model I prediction was 0. 385, whereas model II predicted a value of 0. 439.

J. Acoust. $oc. Am., Vol. 64, No. 6, December 1978

Page 11: Acoustic Determinants of Phrase Boundary Perceptionwjh.harvard.edu/~pal/pdfs/prosody/streeter78.pdf · Streeter (1976) found that listeners can reliably disam- bigustc utterances

1592 Lynn A. Streeter: Acoustics determinants of phrase boundary perception 1592

Atal, B. S., and Hanauer, S. L. (1971). "Speech analysis and synthesis by linear prediction of the speech wave," J. Acoust. Soc. Am. 50, 637-655.

Bolinger, D. L. (•958). "A theory of pitch accent in English," Word 14, 109-149.

Dukes, K. D., and Nakatani, L. H. (1976). "A study of acous- tic cues related to juncture perception" (unpublished paper, Bell Labs).

Fry, D. B. (1955). "Duration and intensity as physical cor- relates of linguistic stress," J. Acoust. Soc. Am. 27, 765- 768.

Fry, D. B. (1958). "Experiments in the perception of stress," Lang. Speech 1, 126-152.

Graham, N. (1977). "Visual detection of aperiodic spatial stimuli by probability summation among narrowband channels," Vision Res. 17, 637-652.

Klatt, D. H. (1975). "Vowel lengthening is syntactically de- termined in a connected discourse," J. Phonetics 3, 129- 140.

Lehiste, I. (1960). "An acoustic-phonetic study of open junc- ture," Phonetica 5, (Suppl.), 1-54.

Lehiste, I. (1973). "Phonetic disambiguation of syntactic am- biguity," Glossa 7, 107-121.

Lehiste, I., Olive, J.P., and Streeter, L. A. (1976). "The role of duration in disambiguating syntactically ambiguous sentences," J. Acoust. Soc. Am. 60, 1199-1202.

Macdonald, N. H. (1976). "Duration as a syntactie boundary

cue in ambiguous sentences," Paper presented at the IEEE International Conference on Acoustics, Speech and Sigual Processing, Philadelphia, PA.

Morton, J., and Jassem, W. (1965). "Acoustic correlates of stress," Ls.ng. Speech8, 159-181.

Nakatani, L. H. (1976). "SYNLOG: An interactive system for manipulating speech," (unpublished paper, Bell Labs).

O'Malley, M. H., Kloker, B. R., and Data-Abrams, B. (1973). "Recovering parentheses from spoken algebraic ex- pressions," IEEE Trans. Audio Electroacoust. AU-21, 217- 220.

Rigualt, A., (1962). "Role de la frequencie de l'intensite et de la duree vocaliques dans la perception de l'accent en francais," Proceedings of the 9th International Congress of Linguists (Mouton, S-Gravenhage), pp. 849-858.

Sachs, M. B., Nachmias, J., and Robson, J. G. (1971). "Spatial-frequency channels in human vision," J. Opt. Soc. Am. 61, 1176-1186.

Umeda, N. (1977). "Consonant duration in American English," J. Acoust. Soc. Am. 61, 846-858.

Watson, A. B., andNachmias, J. (1977). "Patterns of tem- poral interaction in the detection of gratings," Vision Res. 17, 893-902.

Westin, K., Buddenhagen, R. G., and Obrecht, D. H. (1966). "An experimental analysis of the relative importance of pitch, quantity, and intensity as cues to phonemic distinctions in southern Swedish," Lang. Speech 9, 114-126.

J. Acoust. Soc. Am., Vol. 64, No. 6, December 1978