Computer Models of British English Rhythm and...

8
paper 129 p 1 There are very few computer models of British English rhythm and intonation. Part of the reason lies in the relative absence of rel- evant models in the literature, and part in the fact that such de- scriptive models as do exist are lacking the kind of essential detail required for computer implementation. This paper is not intended as a survey of past or present efforts, but as a short statement of some practical experience we have had on the joint speech analysis and speech synthesis project that currently exists de facto between the universities of Calgary and Essex, currently comprising Ian Witten, Peter Madams, Rick Jenkins, Walter Lawrence and myself. In 1973 Ian and I set out to attempt speech synthesis by rule at both segmental and suprasegmental levels, based partly upon some rather simple segmental rules I had developed at Edinburgh University in 1964 using the published Haskins Laboratory work and the Phonetics Dept. experience with speech synthesis. Figure 1 illustrates the ba- sic framework at the segmental level, the durations and steady-state values being looked up from a table. Such a model at the segmental level clearly indicates an early tendency towards a postural view of speech, albeit oversimplified, and no doubt reflects the influence of my tutors. At the suprasegmental level we faced two problems: patterns of segment timing had to be specified and pattern of voicing frequency variation determined. My research and experience (especially that of hearing infinitely compressed speech, in which there is no amplitude variation) strongly indicated that we need not worry about any spe- cial provision for patterns of intensity variation. The initial timing algorithm decided upon was simple. Like Mat- tingley, we used segment durations intrinsically determined from a table with lengthening (50% of the steady-state duration) of stressed vowels. However, under the influence of the isochronicity theory, especially as expounded by Abercrombie, we elected to view speech as having a foot structure based upon the stressed syllables and to modify the intrinsically determined element durations in each foot to provide a pronounced tendency for the stressed syllables to fall at equal intervals. The procedure involved “squashing” or “stretching” complete feet, towards a norm of 600 milliseconds duration, by scal- ing the steady state durations of its elements, after applying the stressing. Under favourable (?) circumstances the procedure produced precise isochrony, but there were limits on the amount of squashing or stretching allowed. The election was an important factor in the further decision to use Halliday’s (1970) prescription for voicing frequency variation, diagrammatically summarised in figure 2. Other factors were the notation, well suited to computer input; the com- prehensiveness (it covered, at an adequate level of delicacy, enough patterns to deal with all sentences in an appropriate way); the ex- plicit inclusion of the whole of each contour, giving continuity of specification; its relative simplicity, giving hope of viable imple- mentation; and its link with grammatical usage, giving hope for fu- ture research on automatic assignation of appropriate contours. This list goes a long way to specifying the requirements such a model must satisfy in order to be suited to computer implementation, and the possession of these qualities by Halliday’s model probably reflects its successful design as a vehicle for communicating with students of English-as-a-foreign-language. 8th International Congress of Phonetic Sciences Leeds, UK, Aug 17-23, 1975 Computer Models of British English Rhythm and Intonation David R. Hill Assoc. Professor, Dept. of Computer Science, The University, Calgary, Alberta, Canada.

Transcript of Computer Models of British English Rhythm and...

paper 129 p 1

There are very few computer models of British English rhythm and intonation. Part of the reason lies in the relative absence of rel-evant models in the literature, and part in the fact that such de-scriptive models as do exist are lacking the kind of essential detail required for computer implementation. This paper is not intended as a survey of past or present efforts, but as a short statement of some practical experience we have had on the joint speech analysis and speech synthesis project that currently exists de facto between the universities of Calgary and Essex, currently comprising Ian Witten, Peter Madams, Rick Jenkins, Walter Lawrence and myself.

In 1973 Ian and I set out to attempt speech synthesis by rule at both segmental and suprasegmental levels, based partly upon some rather simple segmental rules I had developed at Edinburgh University in 1964 using the published Haskins Laboratory work and the Phonetics Dept. experience with speech synthesis. Figure 1 illustrates the ba-sic framework at the segmental level, the durations and steady-state values being looked up from a table. Such a model at the segmental level clearly indicates an early tendency towards a postural view of speech, albeit oversimplified, and no doubt reflects the influence of my tutors. At the suprasegmental level we faced two problems: patterns of segment timing had to be specified and pattern of voicing frequency variation determined. My research and experience (especially that of hearing infinitely compressed speech, in which there is no amplitude variation) strongly indicated that we need not worry about any spe-cial provision for patterns of intensity variation.

The initial timing algorithm decided upon was simple. Like Mat-tingley, we used segment durations intrinsically determined from a table with lengthening (50% of the steady-state duration) of stressed vowels. However, under the influence of the isochronicity theory, especially as expounded by Abercrombie, we elected to view speech as having a foot structure based upon the stressed syllables and to modify the intrinsically determined element durations in each foot to provide a pronounced tendency for the stressed syllables to fall at equal intervals. The procedure involved “squashing” or “stretching” complete feet, towards a norm of 600 milliseconds duration, by scal-ing the steady state durations of its elements, after applying the stressing. Under favourable (?) circumstances the procedure produced precise isochrony, but there were limits on the amount of squashing or stretching allowed. The election was an important factor in the further decision to use Halliday’s (1970) prescription for voicing frequency variation, diagrammatically summarised in figure 2. Other factors were the notation, well suited to computer input; the com-prehensiveness (it covered, at an adequate level of delicacy, enough patterns to deal with all sentences in an appropriate way); the ex-plicit inclusion of the whole of each contour, giving continuity of specification; its relative simplicity, giving hope of viable imple-mentation; and its link with grammatical usage, giving hope for fu-ture research on automatic assignation of appropriate contours. This list goes a long way to specifying the requirements such a model must satisfy in order to be suited to computer implementation, and the possession of these qualities by Halliday’s model probably reflects its successful design as a vehicle for communicating with students of English-as-a-foreign-language.

8th International Congress of Phonetic SciencesLeeds, UK, Aug 17-23, 1975

Computer Models of British English Rhythm and Intonation

David R. Hill

Assoc. Professor, Dept. of Computer Science,The University, Calgary, Alberta, Canada.

paper 129 p 2

The first implementation (summer 1973) produced much amusement and was soon modified, for a subjective effect of plain-song chanting of speech was inescapable. Some passages could have been taken straight from the spoken part of a church service conducted by a vicar of the old school, insofar as they were credible at all. The basic trou-ble seemed to lie in lack of voicing frequency movement during pre-tonic feet, in voicing frequency movements that were too smooth and too predictable, in having too long a foot length and in having too regular a rhythm. Especially, we had ignored what Halliday calls the “sandhi” features of intonation patterns, but doesn’t specify. These amount to voicing frequency variations which necessarily occur with speakers, being an overlay on the basic intonation contour, and thus do not need to be specified in teaching people to speak correctly. Ian modified the timing algorithm, incorporating some of Walter’s ideas on syllable structure and its relation to segment duration which in turn derived from Abercrombie (Lawrence 1974) and introduced some pretonic foot voicing frequency shift, especially for tone-group 1, though re-tained the principle of tending to isochronicity. The result is the computer voice you hear on the tape, during the log in procedure to the speaking computer. Walter has continued his introspective ap-proach to rhythm and intonation, using the same Essex computer system (developed as a component of the joint Essex-Calgary project) and the second tape you hear alternates between the same computer voice as before, and speech using the same segmental synthesis algorithm, but with timing and voicing frequency data supplied by hand on the ba-sis of Walter’s ideas as to what is necessary. This also illustrates how the computer system may be used for comparison of different rule sets, and what improvement is possible, even on our present knowl-edge, given a computer formalisation of the knowledge that goes into Walter’s synthesis.

In writing and modifying such program suites, many arbitrary de-cisions must be made, and much taken on trust. It behoves the as-piring computer speech-output programmer to gain as much linguistic knowledge as possible, and to talk to the right people. Despite the acceptability of the speech we can now generate, for people hearing it in the context of practical applications such as stores enquiries (components, delivery dates, etc.), it is clear that much improvement is needed. We have reacted in two ways. First, we have jointly writ-ten a new segmental-suprasegmental program suite which is just now working, and forms a basis for planned elaboration in the future.

Brief mention is made of the segmental part in another paper in this proceedings (Hill 1975) and the suprasegmental module is out-lined in Witten (1975). Detailed reports will become available, but the emphasis is on flexibility to permit information-gathering ex-periments with computer generated synthetic speech, and on a better defined interface between segmental and suprasegmental levels. Both these areas have caused problems with the present system.

In parallel with this work, I have begun an analytic-spectro-graphic study of intonation and rhythm, observing that our knowl-edge of the basic cues and patterns is incomplete and may be wrongly structured, as well as observing that experiments in synthesis alone, based on invalid assumptions about real speech, may be very mislead-ing. In view of the strange character of perception, it is not even certain that such cues and patterns are as people perceive them, for perception follows its own unstated assumptions and prejudices. Figures 3 and 4 show two voicing frequency analyses as specific but typical examples of the small set I have so far examined. The into-nation contour is superimposed upon an approximation to a segmental. analysis, of the conventional type. Times are given in milliseconds. The original speech was generated and recorded to illustrate Halli-day’s book (1970, study unit 30, sentences 1 and 2) and may be taken

paper 129 p 3

as realisations of the contours specified by a trained speaker. It is interesting to compare these real speech intonation contours to Hal-liday’s paradigms, and to try to interpret the rhythm in terms of a theory of isochronicity. This very limited view is suggestive of some of the difficulties we face in implementing computer models of rhythm and intonation.

Acknowledgements

I would like to thank Ian Witten and Peter Madams of the EES Dept. at Essex University, as well as Walter Lawrence, for their col-laboration in this research, and for many lively discussions on rel-evant topics. I would also like to thank my own University (of Cal-gary) for generously granting sabbatical leave, and Essex University for granting me facilities during this period. Finally, I should like to thank the National Research Council of Canada for the generous financial support over a number of years that has made the research possible.

References

HALLIDAY. M.A.K. (1970) A course in spoken English: Intonation. Ox-ford University Press.

HILL, D.R. (1964) Unpublished experiments at Dept. of Phonetics, Ed-inburgh University.

HILL, D.R. (1975) Avoiding segmentation in speech analysis: problems and benefits. Proc. 8th Int. Cong. of Phonetic Sciences, Leeds, 17th-23rd August 1975, paper no.l28.

LAWRENCE, W. (l974) The phoneme, the syllable and the parameter track. Proc. of Speech Communication Seminar, K.T.H., Stockholm, August 1-3.

WITTEN, I.H. (1975) A flexible scheme for assigning timing and pitch to synthetic speech. Technical Report EES-MMS-SYN 1-75 University of Essex: Colchester, U.K. [Later published under the same title in Language and Speech, 20(3), pp. 240-260, July-September.]

paper 129 p 4

1. Basic segmental synthesis framework (Hill 1964).

paper 129 p 5

The dashed lines indicate the start of each tonic. For cases where the contour is linear for the tonic, note that if more than one syllable occurs, more change occurs on the first syl-lable than the remainder. The above summary is a condensed ba-sic statement of the main features.

2. Summary of the main features of different intonation contours required as a basic framework for British Eng-lish intonation following Halliday (1970). •

paper 129 p 6

3. Two typical voicing frequency tracings taken from re-

cordings accompanying Halliday (1970), study unit 30.

Eighth International Congress of Phonetic Sciences Paper 129

17-23 August 1975

DISCUSSION SHEET

paper presented by HILL on Friday 22nd at 16.20

YOUR NAME: STEVE MARCUS

PLEASE RECORD ON THIS SHEET, writing very clearly, a brief summary, in English, French or German, of what you have just said. Then hand in the sheet, at the Secretariat, so that your contribution may reach the presenter of the paper as soon as possible, for him to record his reply (on the back if necessary).

PRESENTERS are requested to return the completed sheet to the Secre-tariat before the end of the Congress.

AUTHOR’S REPLY

A strict interpretation of isochrony in spoken English must be con-cerned with the interval between successive word stresses. However, one problem is that we lack, at present, a reliable objective set of criteria for deciding that a syllable carries such stress; nor have I seen a published discussion of such problems as words bearing double stress, in the context of isochronicity (how does “isochronicity” fit into the foot structure, for example). Walter Lawrence suggests that simple cases (“blackbird” for example) are accommodated by foot reduction -- two successive feet for the two syllables, but having a total duration of only one and a half feet, with a foot boundary in the middle of the word, our simple squashing/stretching algorithm would produce much the same effect. But even assigning stress (which is what assigning foot boundaries amounts to) is a subject of debate. Jonathan Allan, at MIT Research Lab of Electronics used the Chomsky-Halle approach in assigning word stress for speech synthesis. The problem of looking for isochrony quickly assumes the character of looking for a real (?) invisible man, if we are not careful. Howev-er, from the point of view of natural rhythm -- even intelligibility -- in synthetic speech by rules, isochronicity, or any other theory dealing only with the spacing of stressed syllables is not enough. Indeed we do need a theory dealing with the finer time structure of syllables. Our current computer program for synthesis has some rules for this, but I am currently including the syllable structure problem in my general research effort. I would not be surprised if the syl-lable turned out to be the important fundamental speech unit for many purposes , although I am not working on that assumption yet.

Eighth International Congress of Phonetic Sciences Paper 129

17-23 August 1975

DISCUSSION SHEET

Paper presented by HILL on Friday 22nd. at 16.20 YOUR NAME: J.D. O’Connor

PLEASE RECORD ON THIS SHEET, writing very clearly, a brief summary, in English, French or German, of what you have just said. Then hand in the sheet, at the Secretariat, so that your contribution may reach the presenter of the paper as soon as possible, for him to record his reply (on the back if necessary).

PRESENTERS are requested to return the completed sheet to the Secre-tariat before the end of the Congress.

AUTHOR’S REPLY

This is an excellent point and I thank you for your question. I think the main relevant difference between the two theories is that Jassem allows the equivalent of “anacrusis” in his metrical calculations, i.e. unstressed syllables that are not counted in determining the time interval between successive beats. If observed departures from Abercrombie’s simpler statement of isochronicity always took the form of unexpectedly lengthened feet which also contained syllables that would not be counted under Jassem’s formulation, then your suggestion could be valid. In practice I find discrepancies between foot lengths of the order of 2-300%, while George Allen tells me he has found in-terval variations on a five to one ratio. Such variation is outside any reasonable interpretation of anacrusis. However, I do take the point that one should not entirely dismiss isochronicity just because measurements on real speech disagree with a simple interpretation.