ALIGNMENT OF SPEECH TO HIGHLY IMPERFECT TEXT TRANSCRIPTIONS · ALIGNMENT OF SPEECH TO HIGHLY...

1
ALIGNMENT OF SPEECH TO HIGHLY IMPERFECT TEXT TRANSCRIPTIONS Alexander Haubold and John R. Kender Department of Computer Science, Columbia University Columbia University Department of Computer Science Problem Approach Results Application Unsophisticated Manual Transcription ... prepared slides please. Ok, alright you'll have to bear with me until I get a little more used to this medium. I haven't taught in a 2 hour format for a while and generally speaking when I do teach I use the blackboard. They've requested because it's easier to make archiveable summaries of things if I write things on slides, because then they'll be digitized and placed online. What this means is that if you miss a lecture you don't even have to go to the engineering library ... Near Perfect Transcript Inexpensive Automatic Transcription ... prepared slides plea and a and nonblack to go with the unsettled of more use this medium haven't gotten a two-out, four while and generally speaking when they did teach a if blackboard they've requested because it's easier to make hot titles online of flames that by lightning funds lies the this and that will be digitized and placed on long a look this means is that in this lecture you don't even after go to the engineering library ... Imperfect Transcripts … prepared slides plea and a and … ? ? ? ? ? ? • Missing Temporal Alignment • Words or higher-level structures not time-stamped • Linear Fit (Speech Signal to Text) unsuitable • Does not consider pauses in Speech • Does not adjust to various speeds of user speech Need temporal alignment to index text from speech. Multimedia Browser for Student Presentation Videos: • Database: 5 years, >180 videos, >160 hrs, >1500 students • Used for archival and reference by students and instructors • ASR transcripts aligned for more accurate retrieval • Filtered transcript text in yellow boxes • More salient phrases highlighted in red • Temporal occurrence preserved along horizontal timeline • Text search results are highlighted in separate yellow box 1000 2000 3000 4000 5000 6000 -100 -80 -60 -40 -20 0 20 40 60 80 1 0 0 Alignment error (seconds) Length of audio (seconds) (A) Lecture, single speaker t = 1:48:21 Manual transcription Avg. Matching Error = 3.9 sec 1000 2000 3000 4000 5000 6000 -100 -80 -60 -40 -20 0 20 40 60 80 1 0 0 Alignment error (seconds) Length of audio (seconds) (B) Lecture, single speaker t = 1:48:21 Automatic transcription Avg. Matching Error = 7.7 sec (C) Student Presentation, 31 speakers t = 1:15:12 Automatic transcription Avg. Matching Error = 6.4 sec 0 500 1000 1500 2000 2500 3000 3500 4000 4500 -100 -8 0 -6 0 -4 0 -2 0 0 20 40 60 80 1 0 0 Alignment error (seconds) Length of audio (seconds) (D) Student Presentation, 10 speakers t = 0:22:32 Automatic transcription Avg. Matching Error = 26.7 sec 200 400 600 800 1000 1200 -100 -8 0 -6 0 -4 0 -2 0 0 20 40 60 80 1 0 0 Alignment error (seconds) Length of audio (seconds) Correct speech alignment for 1-100 second accuracy. More than 75% of correct alignment occurs within a 20 second error margin 0 20 40 60 80 100 10 20 30 40 50 60 70 80 90 100 Matching accuracy (se conds) % ofspeech aligned (A) (B) (C ) (D) Alignment results for 4 videos: • Correct alignment within small error margins • Longer pauses introduce larger errors (gray bars) • Smaller error margin for manual transcript Observation: • Vowels and fricatives are the most accurate among automatic transcriptions • Alignment of Speech and Text on easily detectable phonemes Use Edit Distance dynamic programming algorithm to align long sequences of phonemes (60 min ~ 15,000 text phonemes, ~45,000 speech phonemes *) Monophthongs : IY (bee t), IH (bi t), EH (be t), AE ba t, AH (abo ve), UW (boo t), UH (boo k), AA (fa ther), ER (bir d), AO (bou ght) Fricatives : SH (ass ure), S (s ign) Diphthongs : AW (ou t) AH, AY (fi ve) AH, EY (day ) AE, OW (crow ) UH, OY (boy ) AO Fricatives : Z (res ign) S Affricates : CH (ch urch ) SH *Phoneme and Phoneme Substitution Table Audio Transcript Text Phonemes Speech Phonemes Filtered T.P. Phoneme Detection: Vowels (Auto Regressive Model) Fricatives, Affricates (Spectro- gram Energy Distribution) Filtered S.P. keep subset * merge sequences Edit Dist. Text-Speech Alignment with smallest Edit Distance CMU Pronouncing Dictionary: >125,000 words with phonemes

Transcript of ALIGNMENT OF SPEECH TO HIGHLY IMPERFECT TEXT TRANSCRIPTIONS · ALIGNMENT OF SPEECH TO HIGHLY...

Page 1: ALIGNMENT OF SPEECH TO HIGHLY IMPERFECT TEXT TRANSCRIPTIONS · ALIGNMENT OF SPEECH TO HIGHLY IMPERFECT TEXT TRANSCRIPTIONS Alexander Haubold and John R. Kender Department of Computer

ALIGNMENT OF SPEECH TO HIGHLY IMPERFECT

TEXT TRANSCRIPTIONSAlexander Haubold and John R. Kender

Department of Computer Science, Columbia University

Co

lum

bia

Un

ive

rsity

De

pa

rtme

nt o

f Co

mp

ute

r Scie

nce

Problem

Approach

Results Application

Unsophisticated Manual Transcription

... prepared slides please. Ok, alright you'll

have to bear with me until I get a little more

used to this medium. I haven't taught in a 2

hour format for a while and generally speaking

when I do teach I use the blackboard. They've

requested because it's easier to make

archiveable summaries of things if I write

things on slides, because then they'll be

digitized and placed online. What this means is

that if you miss a lecture you don't even have

to go to the engineering library ...

Near Perfect Transcript

Inexpensive Automatic Transcription

... prepared slides plea and a and nonblack to

go with the unsettled of more use this medium

haven't gotten a two-out, four while and

generally speaking when they did teach a if

blackboard they've requested because it's

easier to make hot titles online of flames that

by lightning funds lies the this and that will

be digitized and placed on long a look this

means is that in this lecture you don't even

after go to the engineering library ...

Imperfect Transcripts

… prepared slides plea and a and …

? ? ? ? ? ?

• Missing Temporal Alignment

• Words or higher-level structures not time-stamped

• Linear Fit (Speech Signal to Text) unsuitable

• Does not consider pauses in Speech

• Does not adjust to various speeds of user speech

Need temporal alignment to index text

from speech.

Multimedia Browser for Student Presentation Videos:

• Database: 5 years, >180 videos, >160 hrs, >1500 students

• Used for archival and reference by students and instructors

• ASR transcripts aligned for more accurate retrieval

• Filtered transcript text in yellow boxes

• More salient phrases highlighted in red

• Temporal occurrence preserved along horizontal timeline

• Text search results are highlighted in separate yellow box

1000 2000 3000 4000 5000 6000-100

-80

-60

-40

-20

0

20

40

60

80

100

Ali

gn

men

t er

ror

(sec

on

ds)

Length of audio (seconds)

(A) Lecture, single speaker• t = 1:48:21

• Manual transcription

• Avg. Matching Error = 3.9 sec

1000 2000 3000 4000 5000 6000-100

-80

-60

-40

-20

0

20

40

60

80

100

Ali

gn

men

t er

ror

(sec

on

ds)

Length of audio (seconds)

(B) Lecture, single speaker• t = 1:48:21

• Automatic transcription

• Avg. Matching Error = 7.7 sec

(C) Student Presentation, 31 speakers• t = 1:15:12

• Automatic transcription

• Avg. Matching Error = 6.4 sec

0 500 1000 1500 2000 2500 3000 3500 4000 4500-100

-80

-60

-40

-20

0

20

40

60

80

100

Ali

gn

men

t er

ror

(sec

on

ds)

Length of audio (seconds)

(D) Student Presentation, 10 speakers• t = 0:22:32

• Automatic transcription

• Avg. Matching Error = 26.7 sec

200 400 600 800 1000 1200-100

-80

-60

-40

-20

0

20

40

60

80

100

Ali

gn

men

t er

ror

(sec

on

ds)

Length of audio (seconds)

Correct speech alignment for 1-100 second accuracy. More than 75% of correct alignment occurs within a 20 second error margin

0 20 40 60 80 10010

20

30

40

50

60

70

80

90

100

Matching accuracy (se conds)

%o

fsp

ee

cha

lig

ne

d

(A)(B)

(C)

(D)

Alignment results for 4 videos:

• Correct alignment within small error margins

• Longer pauses introduce

larger errors (gray bars)

• Smaller error margin for manual transcript

Observation:

• Vowels and fricatives are the most accurate among automatic transcriptions

• Alignment of Speech and Text on easily detectable phonemes

Use Edit Distance dynamic programming algorithm to align long sequences of phonemes(60 min ~ 15,000 text phonemes,

~45,000 speech phonemes *)

Monophthongs: IY (beet), IH (bit), EH (bet), AE bat,

AH (above), UW (boot), UH (book),

AA (father), ER (bird), AO (bought)

Fricatives: SH (assure), S (sign)

Diphthongs: AW (out)→ AH, AY (five)→ AH, EY (day)

→ AE, OW (crow)→ UH, OY (boy)→ AO

Fricatives: Z (resign)→ S

Affricates: CH (church)→ SH

*Phoneme and Phoneme Substitution Table

Audio

Transcript

Text Phonemes Speech Phonemes

Filtered T.P.

Phoneme Detection:Vowels (Auto Regressive Model)Fricatives, Affricates (Spectro-

gram Energy Distribution)

Filtered S.P.

keep subset * merge sequences

Edit

Dist.

Text-Speech Alignment

with smallest Edit Distance

CMU Pronouncing

Dictionary:>125,000 words withphonemes