Optical Phonetics and Visual Perception of Lexical and Phrasal Stress in English Patricia Keating,...

28
Optical Phonetics and Visual Perception of Lexical and Phrasal Stress in English Patricia Keating, Marco Baroni, Sven Mattys, Rebecca Scarborough, Abeer Alwan, Edward T. Auer, Lynne E. Bernstein
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    213
  • download

    0

Transcript of Optical Phonetics and Visual Perception of Lexical and Phrasal Stress in English Patricia Keating,...

Optical Phonetics and Visual Perception of Lexical and Phrasal

Stress in English

Patricia Keating, Marco Baroni,

Sven Mattys, Rebecca Scarborough,

Abeer Alwan, Edward T. Auer,

Lynne E. Bernstein

IntroductionPhrasal (focal) stress can be perceived visually

above chance, though intonation cannot (e.g. Bernstein et al. 1989).

Many studies have shown that stress is marked by longer, larger, and faster movements of jaw, lips, and tongue; sometimes by eyebrow movements; and acoustically mainly by f0 (pitch accents), lengthening, and loudness.

Jaw lowering and acoustic duration are known to correlate with auditory perception of stress, and eyebrow movement with visual perception.

Optical phonetics of stress

• Extents, durations, and velocities of movements of lips, chin, and eyebrows, and mouth opening, are all potentially visible to perceivers.

• Our production (optical) measures are position and movement measures of visible fleshpoints.

This study• Production experiment: Do speakers show

any consistent optical correlates of phrasal and lexical stresses?

• Perception experiment: Are there differences in the visual intelligibility of phrasal and lexical stress, and of the different speakers?

• Production-perception comparison: Which, if any, of the optical production correlates account for visual intelligibility?

Production methodsLexical stress materials

• 4 minimal pairs– DIScharge / disCHARGE

– DIScount / disCOUNT

– PERvert / perVERT

– SUBject / subJECT

• 4 non-minimal pairs– DEbit / casSETTE

– INstance / conVINCE

– BUSiness / subMIT

– COUrage / gaZELLE

• Minimal pairs read as given, and also reiterantly

• Non-minimal pairs only reiterantly

• 2 reiterant syllables– “buh” = [bʌ] / [bƏ]

– “fer” = [fɝ] / [fɚ]

– differ in mouth opening

• TOTAL 40 words

Production methodsPhrasal stress materials

“So TOMMY gave Timmy a song from Debby.”“So Tommy gave TIMMY a song from Debby.”“So Tommy gave Timmy a song from DEBBY.”“So Tommy gave Timmy a song from Debby.”

• narrow (contrast) accent on one name or “neutral” broad focus

• these 4 stress conditions x 6 combinations of names = 24 sentences

• sentences not read reiterantly

Production methodsBoth stress contrasts involve nuclear accent

• Lexical stress items read in isolation• Phrasal stress items read with narrow focus to

show contrast and/or emphasis

H* L-L% H* L-L%

…a song from TIMMY DIScount

(phrasal stress) (lexical stress)

Production MethodsSpeakers

• 3 male Californians differing in perceptually-determined visual intelligibility for segments– low-medium = Sp-LO– medium = Sp-MID– high = Sp-HI

• VISUAL INTELLIGIBILITY SCORING:– speakers video-recorded reading 320 (other)

sentences– 8 expert deaf lipreaders transcribed sentences,

yielding % correct visual intelligibility scores

Production methodsRecording set-up and procedure

• Videorecording – professional-quality– teleprompter under

camera

• DAT recording

• Facial motion using Qualisys™ system – 120 Hz SR – 20 small passive

retroreflectors – three cameras– infrared flash– 3D position for each

retroreflector

•Items blocked by stress location•Two tokens of each item

eyebrow markers head marker

chin marker

Production methodsFacepoint marker locations and measurements

lip markers

• Left eyebrow displacement

• Head displacement

• Interlip maximum distance

• Interlip opening displacement

• Interlip closing displacement

• Lower lip opening peak velocity

• Lower lip closing peak velocity

• Chin opening displacement

• Chin opening peak velocity

• Chin closing displacement

• Chin closing peak velocity

Production methodsData analysis

• Prosody of audio speech signals checked by two transcribers (some small differences found between prompted and produced stresses, but these differences generally do not affect analyses presented here)

• Here, only tokens used in perception study analyzed (1 of the 2 tokens of each item)

• Effects of stress on the 11 facepoint marker measurements tested by (factorial) ANOVAs

Production resultsOverview

• Stress is well-marked by these measures

• Lexical vs. phrasal stress: more significantly different measures, and larger differences between stressed and unstressed, with phrasal stress than with lexical

• Reiterant vs. nonreiterant words: both sets show stress effect

Production resultsSignificant differences due to Lexical stress

Interlip Opening Displacement all reiterant words

syllable 1 syllable 2

• 5 of 11 measures distinguish stress - 3 opening gesture measures e.g.Head, and Interlip Max. Distance

• Generally holds across speakers and real vs. reiterant

Production resultsSignificant differences due to Phrasal stress

• All 11 measures distinguish stress, e.g.

• Chin and eyebrow measures are more consistent across speakers

00.10.20.30.40.50.60.70.80.9

1

1st name 2nd name 3rd name

Chin Closing Peak Velocity

accented unaccented

Production resultsSignificant Head and Eyebrow movements

Stress in words• Head moves, eyebrow not

Stress in phrases

• Head down

(2 speakers)

• Eyebrow up

head movement

eyebrow movement

So TIMMY gave Tommy a song from Debby

Production resultsAn aside: Eyebrows and F0

• 40 sentences from the phrasal stress corpus

• F0 from audio, and right and left eyebrow positions, at 12 ms intervals

• Significant correlations between eyebrows and F0, but accounting for little variance (only 1-4%)

Perception methods • 1 token of each item from production corpus (120

words, 72 sentences), each presented twice (384 total trials)

• 16 hearing perceivers (not screened for lipreading ability)

• Test video clip (no sound) on right monitor, clickable response choices on left monitor

• Lexical stress: Response choices were pairs of real words, even for reiterant items

• Sentences: Click on one name, or on “NoStress”

Perception resultsOverview

• Stress is perceived above chance

• Lexical vs. phrasal stress: phrasal stress is perceived better

• Reiterant vs. nonreiterant words: perceived equally well

Perception results Overall results, all above chance

0

20

40

60

80

100

sentences reit words non-reit words

Chance 25%

Chance 50%%correct

N=2304 N=3072 N=768

Perception resultsLexical vs. phrasal stress

all lexicalphrasal

Individual subjects’ % correct relative to levels that are significantly above chance: phrasal perceived better (significantly so by paired t-test)

Perception resultsLexical stress

All lexical speech conditions equally-well perceived overall:

•Reiterant & non•buh & fer•Minimal & non

0

20

40

60

80

100

buh fer non-reit

Minimal pairs non-minimal

% correct

Perception results Speakers: lexical stress

• All speakers’ lexical stress perceived above chance (50%)

• Sp-LO perceived better on reiterant words

0

20

40

60

80

100

Sp-LO Sp-MID Sp-HI

% correct

non-reiterant reiterant minimalreiterant non-minimal

Perception resultsPhrasal stress

• 3 focal positions perceived equally well, and correct above chance for almost every item

• Responses to Neutral condition at chance

0

20

40

60

80

100

1 2 3 Neutral

% correct

Position of stress in sentence

Perception results Speakers: phrasal stress

• All speakers’ phrasal stress perceived above chance (25%)

• Sp-MID perceived less accurately

• Sp-LO best for Neutral condition (not shown here)

0

20

40

60

80

100

Sp-LO Sp-MID Sp-HI

% correct

Production-perception comparisons: Speaker differences

• Prosodic intelligibility: Sp-LO highest for words, Neutral sentences; Sp-MID lowest for sentences

• Re production: Sp-LO shows larger lip differences than Sp-MID on sentences, and largest Chin closing displacement on words (but Sp-HI has largest head movement differences)

• Unrelated to segmental intelligibility: compare above with speakers’ names LO-MID-HI, which reflect their segmental intelligibility

Production-perception comparisons:Correlational analyses of sentences

• Tested relations between production measures and % correct perception of phrasal stresses

• 10 of 11 measures correlated significantly with perception, with chin measures accounting for the most variance (up to 40%)

• Only Interlip maximum distance (mouth opening) did not correlate with perception

Production-perception comparisons:Correlational analyses of sentences

• Partial correlations (controlling for contributions of various lip measures) show independent contributions to perception of– Chin opening displacement (15% of variance)– Chin peak opening velocity (11%)– Lower lip peak opening velocity (11%)

• Closing gestures generally make no independent contributions to perception

Summary• Lexical and phrasal stress are visually

perceived above chance• Phrasal stress is marked by more and larger

production differences, and perceived better• Chin opening accounts for most variance in

perception of phrasal stress• Speakers’ visual intelligibility for prosody

does not correspond to segmental