ALIGNMENT OF SPEECH TO HIGHLY IMPERFECT TEXT TRANSCRIPTIONS · ALIGNMENT OF SPEECH TO HIGHLY...
Transcript of ALIGNMENT OF SPEECH TO HIGHLY IMPERFECT TEXT TRANSCRIPTIONS · ALIGNMENT OF SPEECH TO HIGHLY...
ALIGNMENT OF SPEECH TO HIGHLY IMPERFECT
TEXT TRANSCRIPTIONSAlexander Haubold and John R. Kender
Department of Computer Science, Columbia University
Co
lum
bia
Un
ive
rsity
De
pa
rtme
nt o
f Co
mp
ute
r Scie
nce
Problem
Approach
Results Application
Unsophisticated Manual Transcription
... prepared slides please. Ok, alright you'll
have to bear with me until I get a little more
used to this medium. I haven't taught in a 2
hour format for a while and generally speaking
when I do teach I use the blackboard. They've
requested because it's easier to make
archiveable summaries of things if I write
things on slides, because then they'll be
digitized and placed online. What this means is
that if you miss a lecture you don't even have
to go to the engineering library ...
Near Perfect Transcript
Inexpensive Automatic Transcription
... prepared slides plea and a and nonblack to
go with the unsettled of more use this medium
haven't gotten a two-out, four while and
generally speaking when they did teach a if
blackboard they've requested because it's
easier to make hot titles online of flames that
by lightning funds lies the this and that will
be digitized and placed on long a look this
means is that in this lecture you don't even
after go to the engineering library ...
Imperfect Transcripts
… prepared slides plea and a and …
? ? ? ? ? ?
• Missing Temporal Alignment
• Words or higher-level structures not time-stamped
• Linear Fit (Speech Signal to Text) unsuitable
• Does not consider pauses in Speech
• Does not adjust to various speeds of user speech
Need temporal alignment to index text
from speech.
Multimedia Browser for Student Presentation Videos:
• Database: 5 years, >180 videos, >160 hrs, >1500 students
• Used for archival and reference by students and instructors
• ASR transcripts aligned for more accurate retrieval
• Filtered transcript text in yellow boxes
• More salient phrases highlighted in red
• Temporal occurrence preserved along horizontal timeline
• Text search results are highlighted in separate yellow box
1000 2000 3000 4000 5000 6000-100
-80
-60
-40
-20
0
20
40
60
80
100
Ali
gn
men
t er
ror
(sec
on
ds)
Length of audio (seconds)
(A) Lecture, single speaker• t = 1:48:21
• Manual transcription
• Avg. Matching Error = 3.9 sec
1000 2000 3000 4000 5000 6000-100
-80
-60
-40
-20
0
20
40
60
80
100
Ali
gn
men
t er
ror
(sec
on
ds)
Length of audio (seconds)
(B) Lecture, single speaker• t = 1:48:21
• Automatic transcription
• Avg. Matching Error = 7.7 sec
(C) Student Presentation, 31 speakers• t = 1:15:12
• Automatic transcription
• Avg. Matching Error = 6.4 sec
0 500 1000 1500 2000 2500 3000 3500 4000 4500-100
-80
-60
-40
-20
0
20
40
60
80
100
Ali
gn
men
t er
ror
(sec
on
ds)
Length of audio (seconds)
(D) Student Presentation, 10 speakers• t = 0:22:32
• Automatic transcription
• Avg. Matching Error = 26.7 sec
200 400 600 800 1000 1200-100
-80
-60
-40
-20
0
20
40
60
80
100
Ali
gn
men
t er
ror
(sec
on
ds)
Length of audio (seconds)
Correct speech alignment for 1-100 second accuracy. More than 75% of correct alignment occurs within a 20 second error margin
0 20 40 60 80 10010
20
30
40
50
60
70
80
90
100
Matching accuracy (se conds)
%o
fsp
ee
cha
lig
ne
d
(A)(B)
(C)
(D)
Alignment results for 4 videos:
• Correct alignment within small error margins
• Longer pauses introduce
larger errors (gray bars)
• Smaller error margin for manual transcript
Observation:
• Vowels and fricatives are the most accurate among automatic transcriptions
• Alignment of Speech and Text on easily detectable phonemes
Use Edit Distance dynamic programming algorithm to align long sequences of phonemes(60 min ~ 15,000 text phonemes,
~45,000 speech phonemes *)
Monophthongs: IY (beet), IH (bit), EH (bet), AE bat,
AH (above), UW (boot), UH (book),
AA (father), ER (bird), AO (bought)
Fricatives: SH (assure), S (sign)
Diphthongs: AW (out)→ AH, AY (five)→ AH, EY (day)
→ AE, OW (crow)→ UH, OY (boy)→ AO
Fricatives: Z (resign)→ S
Affricates: CH (church)→ SH
*Phoneme and Phoneme Substitution Table
Audio
Transcript
Text Phonemes Speech Phonemes
Filtered T.P.
Phoneme Detection:Vowels (Auto Regressive Model)Fricatives, Affricates (Spectro-
gram Energy Distribution)
Filtered S.P.
keep subset * merge sequences
Edit
Dist.
Text-Speech Alignment
with smallest Edit Distance
CMU Pronouncing
Dictionary:>125,000 words withphonemes