CUHK-EE-DSPSTL 1 An Introduction to Speech Perception Ph.D. student: Li Yujia, Rain Supervisor:...

CUHK-EE-DSPSTL 1

An Introduction toAn Introduction to

Speech PerceptionSpeech Perception

Ph.D. student: Li Yujia, RainPh.D. student: Li Yujia, Rain

Supervisor: Prof. Tan LeeSupervisor: Prof. Tan Lee

Jan. 28, 2005Jan. 28, 2005

CUHK-EE-DSPSTL 2

ContentsContents

•Basic Knowledge•Speech Perception•Perception Theories•Speech Perception versus

Music Perception•Applications

CUHK-EE-DSPSTL 3

Basic KnowledgeBasic Knowledge• Three levels of speech• Segments vs. supra-segments• Basic acoustic features• Auditory components of human

speech perception• Basic methodology of perception

research

BasicsBasics Speech Perception Theories Speech vs. Music Applications

CUHK-EE-DSPSTL 4

Three levels of speechThree levels of speech


Linguistic

Acoustic

Perceptual

Define rules

Speech realization

Interpretation

Speaker

Listener

CUHK-EE-DSPSTL 5

Segments vs. Supra-segmentsSegments vs. Supra-segments


Segments(phonemes)

Supra-segments(prosody)

Vowel, consonant F0, duration, energy

Intelligibility Naturalness

(stress, rhythm, intonation, emotion)

CUHK-EE-DSPSTL 6

Basic acoustic featuresBasic acoustic features


waveform

spectrogram

1/f0

formants

CUHK-EE-DSPSTL 7

Auditory components of human Auditory components of human speech perceptionspeech perception

• The peripheral auditory organs – ear

(signal processing)

• The auditory nervous system – brain

(interpretation)


semantic prosody

CUHK-EE-DSPSTL 8

Basic methodology of Basic methodology of perception researchperception research

• Stimuli: synthesized speech• Testing: by human listening• Results are affected by

– Intrinsic factors: attributes to speech sounds

– Extrinsic factors: resulted from experimental conditions


CUHK-EE-DSPSTL 9

Speech PerceptionSpeech Perception

•Perception of vowels•Perception of consonants•Perception of prosody

BasicsBasics Speech PerceptionSpeech Perception Theories Speech vs. Music Applications

CUHK-EE-DSPSTL 10

Perception of vowels (1)Perception of vowels (1)• Vowel sounds are perceptually

specified by their formant frequencies.


Spectrogram of an /i/ vowel with first and second formant labeled.

CUHK-EE-DSPSTL 11

• Evidence – From production:

• Vowel-tongue position-vocal tract-formant frequencies.

– From perception:• Synthesized speech-first two formants-different vowel

sound

– From physics:• “There is some evidence that the human auditory

nerve already reacts directly to formant frequencies.” (Delgutte, 1980)

Perception of vowels (2)Perception of vowels (2)


CUHK-EE-DSPSTL 12

Perception of consonants (1)Perception of consonants (1)• In perception, many consonants depend on

vowels; much of stop consonants depend on the rapidly changing formant transitions.


steady statetransition

Schematic of first two formant frequency pattern for a /di/ syllable

CUHK-EE-DSPSTL 13


Schematic representations of first two formant frequency patterns for /d/ in front of different vowels

•Lack of acoustic invariance: the lack of something constant in the spectrographic representation (visual representation of speech) to explain the perception of a particular consonant.

•Locus theory: the second formant frequency transitions all seem to be pointing toward the same frequency which is called locus.

Perception of consonants (2)Perception of consonants (2)

CUHK-EE-DSPSTL 14

• What is the basic unit for speech perception?– Because we cannot isolate stop consonants from

vowels in perception, researchers began to think of speech as encoded (vowels and consonants are squeezed together), perhaps in syllable-sized units.

• Speech can be presented at a faster speed rate (30 phonemes per second) than other sounds, and still retain its perceptual intelligibility.


Perception of consonants (3)Perception of consonants (3)

CUHK-EE-DSPSTL 15

Perception of prosody(1)Perception of prosody(1)• The perception of prosody has been

described as dependent on the “melody of speech”, the fluctuations in the pitch, rhythm, and stress (Monrad-Krohn, 1947).

• Related acoustic features are f0, duration and energy intensity.


CUHK-EE-DSPSTL 16

• Perception of prosody is more complex– The relatively vague definition.– The perception of prosody is nonlinear to the

acoustic features.(double f0 ≠ double pitch; double duration ≠ double stress)

– Perceived over long time in a relative sense.(the degree of contrast between the values of the acoustic

variables over a number of syllables)

– An perceived attribute of prosody may be related to several acoustic features.

(f0 is most powerful cue to stress, followed by duration and energy intensity)

Perception of prosody(2)Perception of prosody(2)


CUHK-EE-DSPSTL 17

• Research is relatively sparse

• The target of our research will be:– From acoustic to perception to determine

how one or several acoustic features contribute to the perceived naturalness.

– Improve the naturalness of synthesized speech in an effective way.

Perception of prosody(3)Perception of prosody(3)


CUHK-EE-DSPSTL 18

Perception TheoriesPerception Theories• Masking• Categorical perception• Motor theory• Analysis-by-synthesis• Bottom-up versus top-down

BasicsBasics Speech Perception TheoriesTheories Speech vs. Music Applications

CUHK-EE-DSPSTL 19

MaskingMasking• Frequency masking

– One sound cannot be perceived if another sound close in frequency has a high enough level.

• Temporal masking– A sound cannot be perceived if it is too close in

time to another sound.– Pre-masking tends to last 5 ms; post-masking

can last from 50 to 300 ms.


A BB

5ms

ABB

50-300ms

Pre-masking Post-masking

CUHK-EE-DSPSTL 20

Categorical perception (1)Categorical perception (1)• Voice onset time (VOT) (Lisker and Abramson, 1964)

– Voiced versus voiceless (if the vocal fold vibrates, eg. /z/ and /s/)

– The difference between voiced and voiceless stop consonants (eg. /b/and/p/; /d/and/t/;/g/and/k/) is actually one of the relative timing of the onset of the onset of vocal fold vibration.

– The timing difference is referred to as voice onset time (VOT)


CUHK-EE-DSPSTL 21

Categorical perception (2)Categorical perception (2)


• Voice onset time (VOT)– voiced stop consonants have a relatively short

VOT; whereas voiceless consonants have a longer VOT.

VOT

VOT

VOT measure for a /b/

VOT measure for a /p/

CUHK-EE-DSPSTL 22

• VOT categories– From production:

– From perception:

Categorical perception (3)Categorical perception (3)BasicsBasics Speech Perception TheoriesTheories Speech vs. Music Applications

VOT productions of a single normal adult speaker of American English for words beginning with /d/ and /t/.

Identification functions of a single listener for VOT continuum from /d/ to /t/ in approximately 11 ms steps. Each stimulus is presented 10 times each in random order

CUHK-EE-DSPSTL 23

• Categorical Perception– The insensitivity to differences within a category,

but keen sensitivity to cross-category differences, is referred to as categorical perception.

– It’s characteristic of certain speech sound distinctions, and it’s generally not found for nonspeech sounds (Cutting, 1972).

– It represents one of the human perceptual mechanisms coping with tremendous amount of variations rapidly (ignore nonessential variation within a category)

Categorical perception (4)Categorical perception (4)


CUHK-EE-DSPSTL 24

Motor theory (1)Motor theory (1)• Motor commands:

– The neural message that the brain sends to set the articulators in motion to produce speech.

• Motivation:– When a stop consonant is produced in various

vowel context, because of the lack of acoustic invariance , there must be constant motor commands to the articulators to produce the same consonant.


CUHK-EE-DSPSTL 25

• Original theory:– “Though we cannot exclude the

possibility that a purely auditory decoder exists, we find it more plausible to assume that speech is perceived by processes that are also involved in its production” (Liberman, Cooper, Shankweiler, & Studdert-Kennedy, 1967).

Motor theory (2)Motor theory (2)


CUHK-EE-DSPSTL 26

• Weak version: – Speech production offers important cues

about speech perception which can be used by listeners.

• Strong version:– Speech production forms the basis for

speech perception.

Motor theory (3)Motor theory (3)


()

CUHK-EE-DSPSTL 27

Analysis-by-synthesisAnalysis-by-synthesis

Listeners are hypothesized to decode the acoustic signal by internally generating matching signals.

The signal that provides the best match is the one “perceived” by the listener.


CUHK-EE-DSPSTL 28

Bottom-up versus top-down (1)Bottom-up versus top-down (1)

• Bottom-up:– Use the acoustic information to discover

what is being uttered.

• Top-down:– Use linguistic information


CUHK-EE-DSPSTL 29

• Bottom-up information is important at the beginning of utterance, while top-down information becomes primary when more syllables in an sentence are uttered.

• The role of top-down information is supported, because good organization and prosody will speed up the understanding of a speech.

BBottom-up versus top-down (2)ottom-up versus top-down (2)BasicsBasics Speech Perception TheoriesTheories Speech vs. Music Applications

Bottom-up

Top-down

CUHK-EE-DSPSTL 30

Speech Perception versus Speech Perception versus Music PerceptionMusic Perception

• Physical difference in perception

• Categorical perception in speech; continuous perception in music– We can discriminate about 1200 different pitches

in music, but we can only absolutely identify about 7 ( Liberman, 1967).

– For certain sound difference relevant to speech, listeners can only discriminate accurately about as many sounds as they can identify.

BasicsBasics Speech Perception Theories Speech vs. MusicSpeech vs. Music Applications

For speech For music

CUHK-EE-DSPSTL 31

ApplicationsApplications• Speech recognition• Speech synthesis• Speaker recognition• Hearing aid

BasicsBasics Speech Perception Theories Speech vs. Music ApplicationsApplications

CUHK-EE-DSPSTL 32

SummarySummary• Speech perception

– vowel, consonant, prosody

• Perception theories– Masking, categorical perception, motor

theory, analysis-by-synthesis, bottom-up and top-down

• Speech vs. music perception

CUHK-EE-DSPSTL 33

ConclusionsConclusions• What we have known for speech

perception is very limited, especially for prosody perception.

• Speech perception will help speech technology much.

CUHK-EE-DSPSTL 34

ReferencesReferences1. Jack Ryalls, 1996. A basic introduction to speech perception.San Diego, Calif. :

Singular Pub. Group.2. Gloria J. Borden, Katherine S. Harris, Lawrence J. Raphael, 2003. “Speech

perception”, chapter 6 in Speech science primer : physiology, acoustics, and perception of speech, Philadelphia : Lippincott Williams & Wilkins.

3. Raymond D. Kent, 1997.”Speech perception”, chapter 10 in The speech sciences, San Diego : Singular Pub. Group.

4. Richard B. Ivry and Lynn C, 1998. “Speech perception and language”, chapter 6 in The two sides of perception, Cambridge, Mass. : MIT Press.

5. J.M. Pickett, 1999. The acoustics of speech communication : fundamentals, speech perception theory, and technology, Boston: Allyn and Bacon.

6. Xuedong Huang, Alex Acero, Hsiao-Wuen Hon , 2001. “Spoken language structure”, chapter 2 in Spoken language processing : a guide to theory, algorithm, and system development. Upper Saddle River, N.J. : Prentice Hall PTR.

7. J.Liu, 2001. Tonal behavior in some tone languages. Ph.D. Dissertation. City University of Hong Kong, 2001.

8. Chu Min; Lu Shinan; Si Hongyan; He Lin; Guan Dinghua, 1996. “The control of juncture and prosody in Chinese TTS system”, in the Proceedings of ICSLP 1996, Volume 1, pp 725-728.

9. Pagel, V.; Carbonell, N.; Laprie, Y., 1996.”A new method for speech delexicalization, and its application to the perception of French prosody”, in the Proceedings of ICSLP 1996, volume 2, pp 821-824.

10. Heuft, B.; Portele, T., 1996, “Synthesizing prosody: a prominence-based approach”, in the Proceedings of ICSLP 1996, volume 3, pp 1361-1364.

11. Vainio, M.; Jarvikivi, J.; Werner, S.; Volk, N.; Valikangas, J., 2002, “Effect of prosodic naturalness on segmental acceptability in synthetic speech”, in the Proceedings of 2002 IEEE Workshop on Speech Synthesis,pp143 – 146.

12. Yong-Ju Lee; Sook-Hyang Lee, 1996, “On phonetic characteristics of pause in the Korean read speech”, in the Proceedings of ICSLP 1996, Volume 1,pp 118-120.

13. House, D., 1996, “Differential perception of tonal contours through the syllable”, in the Proceedings of ICSLP 1996, Volume 4,pp 2048 – 2051.

CUHK-EE-DSPSTL 1 An Introduction to Speech Perception Ph.D. student: Li Yujia, Rain Supervisor:...

Documents

Transcript of CUHK-EE-DSPSTL 1 An Introduction to Speech Perception Ph.D. student: Li Yujia, Rain Supervisor:...