David R. Hill Assoc. Professor, Dept. of Computer Science...

7
There is a chicken and egg problem that exists at all levels of speech analysis. Assumptions about the essential structure constrain the analysis that is performed, while the analysis constrains the variety of structure that may be discovered; yet the structure is precisely what the analysis aims to clarify. There are currently two basic approaches to speech analy- sis. The first is a type of mnemonic analysis in terms of sequences of sym- bols representing units of speech that have been perceived (with all that implies) by a listener. It includes alphabetic writing, pictograms and transcriptions used by phoneticians. With sufficient training any required degree of “delicacy” may be achieved. Such analyses are essentially segmen- tal and subjective. The second type of analysis, made possible by advances in various branches of signal processing and instrumentation, extracts and records various objective attributes of utterances, and is a truly physical analy- sis. This second type of analysis is essentially parallel-continuous and objective, even taking pitch-synchronous analysis into account. Various problems arise in trying to relate these two very different kinds of analysis being encountered most uncompromisingly in the so-called automatic speech recognition problem. The difficulties are inherent be- cause of the chicken and egg problem noted above. The phonemic intuition of speakers, which underpins phonetic transcription, assumes speech is gener- ated as a succession of elements (thus /k/ followed by /æ/ followed by /t/, for “c-a-t”, is used as a bridge between the printed and spoken word). At an articulatory level, such a view is fair. Although the precise posture may vary, and is not generally “held” in any sense, there is a target for /k/, a target for /æ/ and so on, and when these targets are successively attempted, with suitable control of voice and timing, an utterance identifi- able as /k æ t/ will emerge. At an acoustic level, as seen by the machine, the view fails for a number of reasons. For instance, there may be no iden- tifiable acoustic boundary within a two-posture sequence. The problem is most obvious for glide-vowel combinations but, even when there seems an ob- vious boundary (e.g. vowel-nasal, nasal-vowel, vowel-stop sequences, etc.), the “boundary” is actually most nearly related to the point of closure or release, in whatever sense this may be defined. It is well established that there are important perceptual cues to the identity of the underlying pos- tures that do not lie between such boundaries, perhaps the most important cues. Although the variety of approaches to speech description based on ma- chine analysis is apparently quite large, most attempt the route from con- tinuous parameters to symbolic description by some form of segmentation into time blocks followed by processing through to phoneme symbols, either directly or, via coalescing procedures, from “machine phonemes” -- whatever that may mean (usually spectral sections based on 10 msec segments of wave- form). However, because of the character of the varying speech stream, the perceptually relevant changes that are observed in the many simultaneously varying parameters are not necessarily synchronised either with each other or with any recognisable major boundary. Furthermore, attributes that vary are important at several levels, or may be distinctive in terms of relative timing. Thus, presence versus absence of voicing, in the time-neighbourhood 8th International Congress of Phonetic Sciences, 17-23 August 1975: paper no.128 Avoiding segmentation in speech analysis: problems and benefits David R. Hill Assoc. Professor, Dept. of Computer Science, The University, Calgary, Alberta, Canada.

Transcript of David R. Hill Assoc. Professor, Dept. of Computer Science...

Page 1: David R. Hill Assoc. Professor, Dept. of Computer Science ...pages.cpsc.ucalgary.ca/~hill/papers/icphs-128-avoiding-seg-A4.pdfments of the structure (Buchanan. Sutherland & Feigenbaum

There is a chicken and egg problem that exists at all levels of speech analysis. Assumptions about the essential structure constrain the analysis that is performed, while the analysis constrains the variety of structure that may be discovered; yet the structure is precisely what the analysis aims to clarify. There are currently two basic approaches to speech analy-sis. The first is a type of mnemonic analysis in terms of sequences of sym-bols representing units of speech that have been perceived (with all that implies) by a listener. It includes alphabetic writing, pictograms and transcriptions used by phoneticians. With sufficient training any required degree of “delicacy” may be achieved. Such analyses are essentially segmen-tal and subjective.

The second type of analysis, made possible by advances in various branches of signal processing and instrumentation, extracts and records various objective attributes of utterances, and is a truly physical analy-sis. This second type of analysis is essentially parallel-continuous and objective, even taking pitch-synchronous analysis into account.

Various problems arise in trying to relate these two very different kinds of analysis being encountered most uncompromisingly in the so-called automatic speech recognition problem. The difficulties are inherent be-cause of the chicken and egg problem noted above. The phonemic intuition of speakers, which underpins phonetic transcription, assumes speech is gener-ated as a succession of elements (thus /k/ followed by /æ/ followed by /t/, for “c-a-t”, is used as a bridge between the printed and spoken word). At an articulatory level, such a view is fair. Although the precise posture may vary, and is not generally “held” in any sense, there is a target for /k/, a target for /æ/ and so on, and when these targets are successively attempted, with suitable control of voice and timing, an utterance identifi-able as /k æ t/ will emerge. At an acoustic level, as seen by the machine, the view fails for a number of reasons. For instance, there may be no iden-tifiable acoustic boundary within a two-posture sequence. The problem is most obvious for glide-vowel combinations but, even when there seems an ob-vious boundary (e.g. vowel-nasal, nasal-vowel, vowel-stop sequences, etc.), the “boundary” is actually most nearly related to the point of closure or release, in whatever sense this may be defined. It is well established that there are important perceptual cues to the identity of the underlying pos-tures that do not lie between such boundaries, perhaps the most important cues.

Although the variety of approaches to speech description based on ma-chine analysis is apparently quite large, most attempt the route from con-tinuous parameters to symbolic description by some form of segmentation into time blocks followed by processing through to phoneme symbols, either directly or, via coalescing procedures, from “machine phonemes” -- whatever that may mean (usually spectral sections based on 10 msec segments of wave-form). However, because of the character of the varying speech stream, the perceptually relevant changes that are observed in the many simultaneously varying parameters are not necessarily synchronised either with each other or with any recognisable major boundary. Furthermore, attributes that vary are important at several levels, or may be distinctive in terms of relative timing. Thus, presence versus absence of voicing, in the time-neighbourhood

8th International Congress of Phonetic Sciences, 17-23 August 1975: paper no.128

Avoiding segmentation in speech analysis: problems and benefits

David R. Hill

Assoc. Professor, Dept. of Computer Science,The University, Calgary, Alberta, Canada.

Page 2: David R. Hill Assoc. Professor, Dept. of Computer Science ...pages.cpsc.ucalgary.ca/~hill/papers/icphs-128-avoiding-seg-A4.pdfments of the structure (Buchanan. Sutherland & Feigenbaum

paper 128 p2

of the approximation to the target posture, helps distinguish voiced from unvoiced fricatives, while it is the relative time of cessation/onset of voicing compared to the occurrence/release of closure that is the important cue distinguishing voiced from unvoiced stops. Rises and falls of voicing frequency may give clues at the segmental level (micro-intonation associat-ed with larynx postures and/or a constricted air-passage through the vocal tract -- for stops and fricatives, for example), or at the supra-segmen-tal level where the stress pattern (partly manifested as voicing frequen-cy variation) and intonation contour (pure voicing frequency change) may change the identification of words having the same segmental description, or the meaning of phrases having the same word sequence. Recent work at Eindhoven (‘t Hart & Cohen 1973) suggests that timing of supra-segmentally important voicing frequency variation, relative to events at the segmen-tal level, is vitally important in this. Similar remarks may be made about duration cues which operate both at the segmental and the supra-segmental levels. Indeed, in this context, the distinction between the levels seems somewhat artificial.

The present author has described in detail (Hill & Wacker 1969, Hill 1972) a realisable framework for objectively analysing speech utterances into symbol strings without imposing or implying segmental division of the speech stream. In essence, any continuous description of speech is inter-preted in terms of events occurring in time order, any given event being the beginning or ending of some perceptually significant state or change of state in some perceptually significant attribute. For example, the fact that the frequency of a formant peak starts to rise, or the cessation of regu-lar larynx pulses, could be such events. The problems of instrumenting such measurements are being solved. The occurrence of such events, and their local partial ordering relative to one another, are regarded as evidence, in a statistical sense, for hypotheses about causal postures or posture sequences which may then be represented symbolically. If the duration of a perceptually significant feature is itself perceptually significant then time thresholds may be set up to distinguish the ending of different categories of duration as different events (related to chronemes?). The output sym-bols of such a system depend on the level of hypothesis chosen for test-ing. Choosing words implies redundancy, and the implicit postural sequences may be identified even when incomplete or poorly rendered. For narrow tran-scription, the postural hypotheses will differ, corresponding to finer dis-tinctions between events for shorter fragments of event sequence, but the chance of error will increase, and the machine would need to be more per-fectly and completely constructed -- as might be expected, phoneticians require more training and special knowledge than native speakers. In that an overall structure (in this case an ideal sequence of events characteris-ing a posture or posture sequence in speech) is identified or constructed on the basis of those fragments of structure (sequence) that can be detected, the approach is analogous to the determination of the molecular structure of organic compounds on the basis of mass spectrum data which reveals frag-ments of the structure (Buchanan. Sutherland & Feigenbaum 1969). Figure 1 illustrates diagrammatically the relationship between postural, segmental and event-based views of speech, showing especially how events are distrib-uted in their relevance to identifying implicit postures, but omitting the detection or grouping of event sequence fragments.

An advantage of the approach is that all significant attributes of speech may be treated on an equal footing, and in relation to other attri-butes, at any level of hypothesis. Perhaps such an approach represents at least a step towards Daniel Jones’ “signemic analysis” (Jones 1967, P 269). Prosodemes and phonemes would become “-emic” entities at higher levels of the descriptive hierarchy, with signemes providing the primitives for all “-emic” analysis. Another advantage is the avoidance of segmentation and related problems.

Page 3: David R. Hill Assoc. Professor, Dept. of Computer Science ...pages.cpsc.ucalgary.ca/~hill/papers/icphs-128-avoiding-seg-A4.pdfments of the structure (Buchanan. Sutherland & Feigenbaum

paper 128 p3

There is, in fact, an element of division into “before” and “after”, but this is limited to the detection of the beginnings or endings of chang-es of state, or of new states, for fairly well defined and simple sound at-tributes (voicing, voicing frequency, formant frequency and so on) with the segmentation (in this very restricted sense) of each attribute, by events, being independent of the others. In place of the “building brick” analogy, which is so natural but so misleading in phonemic analysis, it substitutes a “rope woven from knotted strings” analogy in which the strings are per-ceptually significant attributes and the knots the events, as above, occur-ring at particular time instants. The omission or displacement of quite a few knots need not affect the overall ability to accept and reject hypoth-eses about the causal postures or posture sequences.

In view of recent findings in neurophysiology, which suggest time-pat-tern coding of different (though related) kinds of information within a single nerve fibre, it is tempting, if apparently highly speculative, to characterise speech as an “external neural signal”, explaining, perhaps, both the enigmatic face it presents to those working in automatic speech recognition, and the apparent ease with which it is generated and received naturally. However, some further substance is lent to such speculation by recent publications (Condon & Ogston 1971; Condon 1974) which show that a listener’s (as well as the speaker’s) body movements are synchronised with speech events at a very low level, and that this phenomenon is observed even for very young infants (less than two days old). Also, of course, the articulatory movements that underlie speech are the direct result of ner-vous activity -- a fact very likely reflected in the signal produced.

Among the problems to be solved -- by no means all confined to this ap-proach -- are the selection of attributes and the definition of criteria of significance, both of state and change of state. Also there is a problem in selecting what order relationships (sequence fragments) to detect. Solu-tions have been proposed within the framework and are described in the pa-pers already quoted. Current work is out of the pilot study phase and into the construction of a hardware-software prototype system. A further prob-lem deserves note. A descriptive framework offering advantages for analysis ought to do the same for synthesis, yet the method of implementing synthe-sis on such a basis has long remained an elusive goal. A recent result has been the translation of the analysis framework into a synthesis algorithm that at least comes close to realising this goal. Occupying little core, and designed as a flexible tool for experiments in speech synthesis, the program allows events of the kind referred to above to be displaced in time relative to the basic rhythmic framework of speech which, in turn, is tied to the release of each successive posture being synthesised.

The author gratefully acknowledges the generous financial support of the National Research Council of Canada, over a number of years, which has made this research possible.

References

BUCHANAN. B.G., SUTHERLAND. G.L. & FEIGENBAUM. E.A. (l969) Heuristic DENDRAL: a program for generating explanatory hypotheses in organic chemistry. In Machine Intelligence 4 (Meltzer. B. & Michie. D. eds) Edinburgh University Press.

CONDON, W.S. & OGSTON, W.D. (1971) Speech and body motion synchrony of the speaker-hearer. In The Perception of Language (Horton, D.L. & Jenkin,. J.J. eds), Merril (Int. Psych. series); New York.

Page 4: David R. Hill Assoc. Professor, Dept. of Computer Science ...pages.cpsc.ucalgary.ca/~hill/papers/icphs-128-avoiding-seg-A4.pdfments of the structure (Buchanan. Sutherland & Feigenbaum

paper 128 p4

CONDON, W.S. (1974) Speech makes babies move. New Scientist 6th June. pp.624-627.

‘t HART, J. & COHEN, A. (1973) Intonation by rule: a perceptual quest. Journal of Phonetics l, pp.309-327.

HILL, D.R. & WACKER, E.B. (1969) ESOTerIC II -- an approach to practical voice control: progress report ‘69. Machine Intelligence 5 (Michie, D. & Melt-zer. B. eds), pp.463-493, Edinburgh University Press: Edinburgh.

HILL, D.R. (1972). A basis for model building and learning in automatic speech pattern discrimination. Proc. NPL/Phys. Soc./lEE Conference on Machine Perception of Patterns and Pictures. NPL, Teddington. April 1972, pp.151-160. Physical Society: London.

JONES, D. (1967) The phoneme: its nature and use. 3rd Edition. Heffer: Cam-bridge.

(Continued on the next page)

Page 5: David R. Hill Assoc. Professor, Dept. of Computer Science ...pages.cpsc.ucalgary.ca/~hill/papers/icphs-128-avoiding-seg-A4.pdfments of the structure (Buchanan. Sutherland & Feigenbaum

paper 128 p5

1. Relationship between postural, segmental, analytic and event based views of speech, with indication of the distribution of evidence for different postures. Note that the events depicted are at a comparitively low level (classically segmental) and the detection or grouping of event sequence fragments is omitted.

Page 6: David R. Hill Assoc. Professor, Dept. of Computer Science ...pages.cpsc.ucalgary.ca/~hill/papers/icphs-128-avoiding-seg-A4.pdfments of the structure (Buchanan. Sutherland & Feigenbaum

Eighth International Congress of Phonetic Sciences paper 12817-23 August 1975

DISCUSSION SHEET

Paper presented by HILL on 20.8.75 at 12.30

YOUR NAME: J. CARNOCHAN

PLEASE RECORD ON THIS SHEET, writing very clearly, a brief summary, in Eng-lish, French or German, of what you have just said. Then hand in the sheet, at the Secretariat, so that your contribution may reach the presenter of the paper as soon as possible, for him to record his reply (on the back if necessary).

PRESENTERS are requested to return the completed sheet to the Secretariat before the end of the Congress.

AUTHOR’S REPLYThank you for your support, and the impeccable pedigree you provide for a non-segmental approach. I agree that, for different purposes, one often needs different views. It is unfortunate that, in what may be termed “ap-plied acoustic phonetics” (which includes work on automatic speech pattern discrimination), the attractive possibility of a non-segmental approach has so long been ignored. Part of the reason for this probably lies in difficul-ties of realisation, which is the problem area I have addressed.

Page 7: David R. Hill Assoc. Professor, Dept. of Computer Science ...pages.cpsc.ucalgary.ca/~hill/papers/icphs-128-avoiding-seg-A4.pdfments of the structure (Buchanan. Sutherland & Feigenbaum

Eighth International Congress of Phonetic Sciences paper 12817-23 August 1975

DISCUSSION SHEET

Paper presented by HILL on 20.8.75 at 12.30

YOUR NAME: Peter Ladefoged

PLEASE RECORD ON THIS SHEET, writing very clearly, a brief summary, in Eng-lish, French or German, of what you have just said. Then hand in the sheet, at the Secretariat, so that your contribution may reach the presenter of the paper as soon as possible, for him to record his reply (on the back if necessary).

PRESENTERS are requested to return the completed sheet to the Secretariat before the end of the Congress.

AUTHOR’S REPLY I appreciate your question which touches on an important point. The answer depends on what is meant by “separate”. For written output, separation is essential in the output form since this is how language is written. How-ever, the same attribute, even the same event, may contribute to the test-ing of hypotheses at different levels, and there is no internal distinction between the levels. If, instead of requiring written output, we require the machine to take action, such action might well be discriminated by cues at both segmental and suprasegmental levels together, just as the words “ob-ject” (noun) and “object” (verb) are discriminated from other words by seg-mental cues and from each other by suprasegmental cues. This is one sense in which I think the distinction between levels may be artificial and may even represent a defect of our orthography. If one put non-speech sounds in with the speech input, as you have for humans (Broadbent & Ladefoged 1960), then -- since such items would not give rise to events that were incorpo-rated into the sequence fragments, it would be very difficult for the system to determine where they had occurred in time relation to the speech. They would tend to be assigned at the next major break, as in your experiment. This is a fundamental property of the system, closely related to the point at issue, as well as an interesting parallel with human perception.

BROADBENT, D.E. & LADEFOGED, P. (1960) Perception of sequence in auditory events. Quarterly J. Experimental Psychology 12 (3), pp. 162-170