CUHK-EE-DSPSTL 1 An Introduction to Speech Perception Ph.D. student: Li Yujia, Rain Supervisor:...

34
CUHK-EE-DSPSTL 1 An Introduction to An Introduction to Speech Perception Speech Perception Ph.D. student: Li Yujia, Ph.D. student: Li Yujia, Rain Rain Supervisor: Prof. Tan Lee Supervisor: Prof. Tan Lee Jan. 28, 2005 Jan. 28, 2005
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    214
  • download

    0

Transcript of CUHK-EE-DSPSTL 1 An Introduction to Speech Perception Ph.D. student: Li Yujia, Rain Supervisor:...

Page 1: CUHK-EE-DSPSTL 1 An Introduction to Speech Perception Ph.D. student: Li Yujia, Rain Supervisor: Prof. Tan Lee Jan. 28, 2005.

CUHK-EE-DSPSTL 1

An Introduction toAn Introduction to

Speech PerceptionSpeech Perception

Ph.D. student: Li Yujia, RainPh.D. student: Li Yujia, Rain

Supervisor: Prof. Tan LeeSupervisor: Prof. Tan Lee

Jan. 28, 2005Jan. 28, 2005

Page 2: CUHK-EE-DSPSTL 1 An Introduction to Speech Perception Ph.D. student: Li Yujia, Rain Supervisor: Prof. Tan Lee Jan. 28, 2005.

CUHK-EE-DSPSTL 2

ContentsContents

•Basic Knowledge•Speech Perception•Perception Theories•Speech Perception versus

Music Perception•Applications

Page 3: CUHK-EE-DSPSTL 1 An Introduction to Speech Perception Ph.D. student: Li Yujia, Rain Supervisor: Prof. Tan Lee Jan. 28, 2005.

CUHK-EE-DSPSTL 3

Basic KnowledgeBasic Knowledge• Three levels of speech• Segments vs. supra-segments• Basic acoustic features• Auditory components of human

speech perception• Basic methodology of perception

research

BasicsBasics Speech Perception Theories Speech vs. Music Applications

Page 4: CUHK-EE-DSPSTL 1 An Introduction to Speech Perception Ph.D. student: Li Yujia, Rain Supervisor: Prof. Tan Lee Jan. 28, 2005.

CUHK-EE-DSPSTL 4

Three levels of speechThree levels of speech

BasicsBasics Speech Perception Theories Speech vs. Music Applications

Linguistic

Acoustic

Perceptual

Define rules

Speech realization

Interpretation

Speaker

Listener

Page 5: CUHK-EE-DSPSTL 1 An Introduction to Speech Perception Ph.D. student: Li Yujia, Rain Supervisor: Prof. Tan Lee Jan. 28, 2005.

CUHK-EE-DSPSTL 5

Segments vs. Supra-segmentsSegments vs. Supra-segments

BasicsBasics Speech Perception Theories Speech vs. Music Applications

Segments(phonemes)

Supra-segments(prosody)

Vowel, consonant F0, duration, energy

Intelligibility Naturalness

(stress, rhythm, intonation, emotion)

Page 6: CUHK-EE-DSPSTL 1 An Introduction to Speech Perception Ph.D. student: Li Yujia, Rain Supervisor: Prof. Tan Lee Jan. 28, 2005.

CUHK-EE-DSPSTL 6

Basic acoustic featuresBasic acoustic features

BasicsBasics Speech Perception Theories Speech vs. Music Applications

waveform

spectrogram

1/f0

formants

Page 7: CUHK-EE-DSPSTL 1 An Introduction to Speech Perception Ph.D. student: Li Yujia, Rain Supervisor: Prof. Tan Lee Jan. 28, 2005.

CUHK-EE-DSPSTL 7

Auditory components of human Auditory components of human speech perceptionspeech perception

• The peripheral auditory organs – ear

(signal processing)

• The auditory nervous system – brain

(interpretation)

BasicsBasics Speech Perception Theories Speech vs. Music Applications

semantic prosody

Page 8: CUHK-EE-DSPSTL 1 An Introduction to Speech Perception Ph.D. student: Li Yujia, Rain Supervisor: Prof. Tan Lee Jan. 28, 2005.

CUHK-EE-DSPSTL 8

Basic methodology of Basic methodology of perception researchperception research

• Stimuli: synthesized speech• Testing: by human listening• Results are affected by

– Intrinsic factors: attributes to speech sounds

– Extrinsic factors: resulted from experimental conditions

BasicsBasics Speech Perception Theories Speech vs. Music Applications

Page 9: CUHK-EE-DSPSTL 1 An Introduction to Speech Perception Ph.D. student: Li Yujia, Rain Supervisor: Prof. Tan Lee Jan. 28, 2005.

CUHK-EE-DSPSTL 9

Speech PerceptionSpeech Perception

•Perception of vowels•Perception of consonants•Perception of prosody

BasicsBasics Speech PerceptionSpeech Perception Theories Speech vs. Music Applications

Page 10: CUHK-EE-DSPSTL 1 An Introduction to Speech Perception Ph.D. student: Li Yujia, Rain Supervisor: Prof. Tan Lee Jan. 28, 2005.

CUHK-EE-DSPSTL 10

Perception of vowels (1)Perception of vowels (1)• Vowel sounds are perceptually

specified by their formant frequencies.

BasicsBasics Speech PerceptionSpeech Perception Theories Speech vs. Music Applications

Spectrogram of an /i/ vowel with first and second formant labeled.

Page 11: CUHK-EE-DSPSTL 1 An Introduction to Speech Perception Ph.D. student: Li Yujia, Rain Supervisor: Prof. Tan Lee Jan. 28, 2005.

CUHK-EE-DSPSTL 11

• Evidence – From production:

• Vowel-tongue position-vocal tract-formant frequencies.

– From perception:• Synthesized speech-first two formants-different vowel

sound

– From physics:• “There is some evidence that the human auditory

nerve already reacts directly to formant frequencies.” (Delgutte, 1980)

Perception of vowels (2)Perception of vowels (2)

BasicsBasics Speech PerceptionSpeech Perception Theories Speech vs. Music Applications

Page 12: CUHK-EE-DSPSTL 1 An Introduction to Speech Perception Ph.D. student: Li Yujia, Rain Supervisor: Prof. Tan Lee Jan. 28, 2005.

CUHK-EE-DSPSTL 12

Perception of consonants (1)Perception of consonants (1)• In perception, many consonants depend on

vowels; much of stop consonants depend on the rapidly changing formant transitions.

BasicsBasics Speech PerceptionSpeech Perception Theories Speech vs. Music Applications

steady statetransition

Schematic of first two formant frequency pattern for a /di/ syllable

Page 13: CUHK-EE-DSPSTL 1 An Introduction to Speech Perception Ph.D. student: Li Yujia, Rain Supervisor: Prof. Tan Lee Jan. 28, 2005.

CUHK-EE-DSPSTL 13

BasicsBasics Speech PerceptionSpeech Perception Theories Speech vs. Music Applications

Schematic representations of first two formant frequency patterns for /d/ in front of different vowels

•Lack of acoustic invariance: the lack of something constant in the spectrographic representation (visual representation of speech) to explain the perception of a particular consonant.

•Locus theory: the second formant frequency transitions all seem to be pointing toward the same frequency which is called locus.

Perception of consonants (2)Perception of consonants (2)

Page 14: CUHK-EE-DSPSTL 1 An Introduction to Speech Perception Ph.D. student: Li Yujia, Rain Supervisor: Prof. Tan Lee Jan. 28, 2005.

CUHK-EE-DSPSTL 14

• What is the basic unit for speech perception?– Because we cannot isolate stop consonants from

vowels in perception, researchers began to think of speech as encoded (vowels and consonants are squeezed together), perhaps in syllable-sized units.

• Speech can be presented at a faster speed rate (30 phonemes per second) than other sounds, and still retain its perceptual intelligibility.

BasicsBasics Speech PerceptionSpeech Perception Theories Speech vs. Music Applications

Perception of consonants (3)Perception of consonants (3)

Page 15: CUHK-EE-DSPSTL 1 An Introduction to Speech Perception Ph.D. student: Li Yujia, Rain Supervisor: Prof. Tan Lee Jan. 28, 2005.

CUHK-EE-DSPSTL 15

Perception of prosody(1)Perception of prosody(1)• The perception of prosody has been

described as dependent on the “melody of speech”, the fluctuations in the pitch, rhythm, and stress (Monrad-Krohn, 1947).

• Related acoustic features are f0, duration and energy intensity.

BasicsBasics Speech PerceptionSpeech Perception Theories Speech vs. Music Applications

Page 16: CUHK-EE-DSPSTL 1 An Introduction to Speech Perception Ph.D. student: Li Yujia, Rain Supervisor: Prof. Tan Lee Jan. 28, 2005.

CUHK-EE-DSPSTL 16

• Perception of prosody is more complex– The relatively vague definition.– The perception of prosody is nonlinear to the

acoustic features.(double f0 ≠ double pitch; double duration ≠ double stress)

– Perceived over long time in a relative sense.(the degree of contrast between the values of the acoustic

variables over a number of syllables)

– An perceived attribute of prosody may be related to several acoustic features.

(f0 is most powerful cue to stress, followed by duration and energy intensity)

Perception of prosody(2)Perception of prosody(2)

BasicsBasics Speech PerceptionSpeech Perception Theories Speech vs. Music Applications

Page 17: CUHK-EE-DSPSTL 1 An Introduction to Speech Perception Ph.D. student: Li Yujia, Rain Supervisor: Prof. Tan Lee Jan. 28, 2005.

CUHK-EE-DSPSTL 17

• Research is relatively sparse

• The target of our research will be:– From acoustic to perception to determine

how one or several acoustic features contribute to the perceived naturalness.

– Improve the naturalness of synthesized speech in an effective way.

Perception of prosody(3)Perception of prosody(3)

BasicsBasics Speech PerceptionSpeech Perception Theories Speech vs. Music Applications

Page 18: CUHK-EE-DSPSTL 1 An Introduction to Speech Perception Ph.D. student: Li Yujia, Rain Supervisor: Prof. Tan Lee Jan. 28, 2005.

CUHK-EE-DSPSTL 18

Perception TheoriesPerception Theories• Masking• Categorical perception• Motor theory• Analysis-by-synthesis• Bottom-up versus top-down

BasicsBasics Speech Perception TheoriesTheories Speech vs. Music Applications

Page 19: CUHK-EE-DSPSTL 1 An Introduction to Speech Perception Ph.D. student: Li Yujia, Rain Supervisor: Prof. Tan Lee Jan. 28, 2005.

CUHK-EE-DSPSTL 19

MaskingMasking• Frequency masking

– One sound cannot be perceived if another sound close in frequency has a high enough level.

• Temporal masking– A sound cannot be perceived if it is too close in

time to another sound.– Pre-masking tends to last 5 ms; post-masking

can last from 50 to 300 ms.

BasicsBasics Speech Perception TheoriesTheories Speech vs. Music Applications

A BB

5ms

ABB

50-300ms

Pre-masking Post-masking

Page 20: CUHK-EE-DSPSTL 1 An Introduction to Speech Perception Ph.D. student: Li Yujia, Rain Supervisor: Prof. Tan Lee Jan. 28, 2005.

CUHK-EE-DSPSTL 20

Categorical perception (1)Categorical perception (1)• Voice onset time (VOT) (Lisker and Abramson, 1964)

– Voiced versus voiceless (if the vocal fold vibrates, eg. /z/ and /s/)

– The difference between voiced and voiceless stop consonants (eg. /b/and/p/; /d/and/t/;/g/and/k/) is actually one of the relative timing of the onset of the onset of vocal fold vibration.

– The timing difference is referred to as voice onset time (VOT)

BasicsBasics Speech Perception TheoriesTheories Speech vs. Music Applications

Page 21: CUHK-EE-DSPSTL 1 An Introduction to Speech Perception Ph.D. student: Li Yujia, Rain Supervisor: Prof. Tan Lee Jan. 28, 2005.

CUHK-EE-DSPSTL 21

Categorical perception (2)Categorical perception (2)

BasicsBasics Speech Perception TheoriesTheories Speech vs. Music Applications

• Voice onset time (VOT)– voiced stop consonants have a relatively short

VOT; whereas voiceless consonants have a longer VOT.

VOT

VOT

VOT measure for a /b/

VOT measure for a /p/

Page 22: CUHK-EE-DSPSTL 1 An Introduction to Speech Perception Ph.D. student: Li Yujia, Rain Supervisor: Prof. Tan Lee Jan. 28, 2005.

CUHK-EE-DSPSTL 22

• VOT categories– From production:

– From perception:

Categorical perception (3)Categorical perception (3)BasicsBasics Speech Perception TheoriesTheories Speech vs. Music Applications

VOT productions of a single normal adult speaker of American English for words beginning with /d/ and /t/.

Identification functions of a single listener for VOT continuum from /d/ to /t/ in approximately 11 ms steps. Each stimulus is presented 10 times each in random order

Page 23: CUHK-EE-DSPSTL 1 An Introduction to Speech Perception Ph.D. student: Li Yujia, Rain Supervisor: Prof. Tan Lee Jan. 28, 2005.

CUHK-EE-DSPSTL 23

• Categorical Perception– The insensitivity to differences within a category,

but keen sensitivity to cross-category differences, is referred to as categorical perception.

– It’s characteristic of certain speech sound distinctions, and it’s generally not found for nonspeech sounds (Cutting, 1972).

– It represents one of the human perceptual mechanisms coping with tremendous amount of variations rapidly (ignore nonessential variation within a category)

Categorical perception (4)Categorical perception (4)

BasicsBasics Speech Perception TheoriesTheories Speech vs. Music Applications

Page 24: CUHK-EE-DSPSTL 1 An Introduction to Speech Perception Ph.D. student: Li Yujia, Rain Supervisor: Prof. Tan Lee Jan. 28, 2005.

CUHK-EE-DSPSTL 24

Motor theory (1)Motor theory (1)• Motor commands:

– The neural message that the brain sends to set the articulators in motion to produce speech.

• Motivation:– When a stop consonant is produced in various

vowel context, because of the lack of acoustic invariance , there must be constant motor commands to the articulators to produce the same consonant.

BasicsBasics Speech Perception TheoriesTheories Speech vs. Music Applications

Page 25: CUHK-EE-DSPSTL 1 An Introduction to Speech Perception Ph.D. student: Li Yujia, Rain Supervisor: Prof. Tan Lee Jan. 28, 2005.

CUHK-EE-DSPSTL 25

• Original theory:– “Though we cannot exclude the

possibility that a purely auditory decoder exists, we find it more plausible to assume that speech is perceived by processes that are also involved in its production” (Liberman, Cooper, Shankweiler, & Studdert-Kennedy, 1967).

Motor theory (2)Motor theory (2)

BasicsBasics Speech Perception TheoriesTheories Speech vs. Music Applications

Page 26: CUHK-EE-DSPSTL 1 An Introduction to Speech Perception Ph.D. student: Li Yujia, Rain Supervisor: Prof. Tan Lee Jan. 28, 2005.

CUHK-EE-DSPSTL 26

• Weak version: – Speech production offers important cues

about speech perception which can be used by listeners.

• Strong version:– Speech production forms the basis for

speech perception.

Motor theory (3)Motor theory (3)

BasicsBasics Speech Perception TheoriesTheories Speech vs. Music Applications

()

Page 27: CUHK-EE-DSPSTL 1 An Introduction to Speech Perception Ph.D. student: Li Yujia, Rain Supervisor: Prof. Tan Lee Jan. 28, 2005.

CUHK-EE-DSPSTL 27

Analysis-by-synthesisAnalysis-by-synthesis

Listeners are hypothesized to decode the acoustic signal by internally generating matching signals.

The signal that provides the best match is the one “perceived” by the listener.

BasicsBasics Speech Perception TheoriesTheories Speech vs. Music Applications

Page 28: CUHK-EE-DSPSTL 1 An Introduction to Speech Perception Ph.D. student: Li Yujia, Rain Supervisor: Prof. Tan Lee Jan. 28, 2005.

CUHK-EE-DSPSTL 28

Bottom-up versus top-down (1)Bottom-up versus top-down (1)

• Bottom-up:– Use the acoustic information to discover

what is being uttered.

• Top-down:– Use linguistic information

BasicsBasics Speech Perception TheoriesTheories Speech vs. Music Applications

Page 29: CUHK-EE-DSPSTL 1 An Introduction to Speech Perception Ph.D. student: Li Yujia, Rain Supervisor: Prof. Tan Lee Jan. 28, 2005.

CUHK-EE-DSPSTL 29

• Bottom-up information is important at the beginning of utterance, while top-down information becomes primary when more syllables in an sentence are uttered.

• The role of top-down information is supported, because good organization and prosody will speed up the understanding of a speech.

BBottom-up versus top-down (2)ottom-up versus top-down (2)BasicsBasics Speech Perception TheoriesTheories Speech vs. Music Applications

Bottom-up

Top-down

Page 30: CUHK-EE-DSPSTL 1 An Introduction to Speech Perception Ph.D. student: Li Yujia, Rain Supervisor: Prof. Tan Lee Jan. 28, 2005.

CUHK-EE-DSPSTL 30

Speech Perception versus Speech Perception versus Music PerceptionMusic Perception

• Physical difference in perception

• Categorical perception in speech; continuous perception in music– We can discriminate about 1200 different pitches

in music, but we can only absolutely identify about 7 ( Liberman, 1967).

– For certain sound difference relevant to speech, listeners can only discriminate accurately about as many sounds as they can identify.

BasicsBasics Speech Perception Theories Speech vs. MusicSpeech vs. Music Applications

For speech For music

Page 31: CUHK-EE-DSPSTL 1 An Introduction to Speech Perception Ph.D. student: Li Yujia, Rain Supervisor: Prof. Tan Lee Jan. 28, 2005.

CUHK-EE-DSPSTL 31

ApplicationsApplications• Speech recognition• Speech synthesis• Speaker recognition• Hearing aid

BasicsBasics Speech Perception Theories Speech vs. Music ApplicationsApplications

Page 32: CUHK-EE-DSPSTL 1 An Introduction to Speech Perception Ph.D. student: Li Yujia, Rain Supervisor: Prof. Tan Lee Jan. 28, 2005.

CUHK-EE-DSPSTL 32

SummarySummary• Speech perception

– vowel, consonant, prosody

• Perception theories– Masking, categorical perception, motor

theory, analysis-by-synthesis, bottom-up and top-down

• Speech vs. music perception

Page 33: CUHK-EE-DSPSTL 1 An Introduction to Speech Perception Ph.D. student: Li Yujia, Rain Supervisor: Prof. Tan Lee Jan. 28, 2005.

CUHK-EE-DSPSTL 33

ConclusionsConclusions• What we have known for speech

perception is very limited, especially for prosody perception.

• Speech perception will help speech technology much.

Page 34: CUHK-EE-DSPSTL 1 An Introduction to Speech Perception Ph.D. student: Li Yujia, Rain Supervisor: Prof. Tan Lee Jan. 28, 2005.

CUHK-EE-DSPSTL 34

ReferencesReferences1. Jack Ryalls, 1996. A basic introduction to speech perception.San Diego, Calif. :

Singular Pub. Group.2. Gloria J. Borden, Katherine S. Harris, Lawrence J. Raphael, 2003. “Speech

perception”, chapter 6 in Speech science primer : physiology, acoustics, and perception of speech, Philadelphia : Lippincott Williams & Wilkins.

3. Raymond D. Kent, 1997.”Speech perception”, chapter 10 in The speech sciences, San Diego : Singular Pub. Group.

4. Richard B. Ivry and Lynn C, 1998. “Speech perception and language”, chapter 6 in The two sides of perception, Cambridge, Mass. : MIT Press.

5. J.M. Pickett, 1999. The acoustics of speech communication : fundamentals, speech perception theory, and technology, Boston: Allyn and Bacon.

6. Xuedong Huang, Alex Acero, Hsiao-Wuen Hon , 2001. “Spoken language structure”, chapter 2 in Spoken language processing : a guide to theory, algorithm, and system development. Upper Saddle River, N.J. : Prentice Hall PTR.

7. J.Liu, 2001. Tonal behavior in some tone languages. Ph.D. Dissertation. City University of Hong Kong, 2001.

8. Chu Min; Lu Shinan; Si Hongyan; He Lin; Guan Dinghua, 1996. “The control of juncture and prosody in Chinese TTS system”, in the Proceedings of ICSLP 1996, Volume 1, pp 725-728.

9. Pagel, V.; Carbonell, N.; Laprie, Y., 1996.”A new method for speech delexicalization, and its application to the perception of French prosody”, in the Proceedings of ICSLP 1996, volume 2, pp 821-824.

10. Heuft, B.; Portele, T., 1996, “Synthesizing prosody: a prominence-based approach”, in the Proceedings of ICSLP 1996, volume 3, pp 1361-1364.

11. Vainio, M.; Jarvikivi, J.; Werner, S.; Volk, N.; Valikangas, J., 2002, “Effect of prosodic naturalness on segmental acceptability in synthetic speech”, in the Proceedings of 2002 IEEE Workshop on Speech Synthesis,pp143 – 146.

12. Yong-Ju Lee; Sook-Hyang Lee, 1996, “On phonetic characteristics of pause in the Korean read speech”, in the Proceedings of ICSLP 1996, Volume 1,pp 118-120.

13. House, D., 1996, “Differential perception of tonal contours through the syllable”, in the Proceedings of ICSLP 1996, Volume 4,pp 2048 – 2051.