Institute of Information Science, Academia Sinica 12 July, 2011 @ IIS, Academia Sinica Automatic...
-
Upload
benjamin-bates -
Category
Documents
-
view
225 -
download
0
Transcript of Institute of Information Science, Academia Sinica 12 July, 2011 @ IIS, Academia Sinica Automatic...
Institute of Information Science, Academia SinicaInstitute of Information Science, Academia Sinica
12 July, 2011 @ IIS, Academia Sinica12 July, 2011 @ IIS, Academia Sinica
Automatic Detection-based Phone Recognition on TIMIT
Hung-Shin Lee Hung-Shin Lee (( 李鴻欣李鴻欣 ))
Based on Chen and Wang in ISCSLP’08 and Interspeech’09
Page-2
Detection-Based ASRDetection-Based ASR
Knowledge Detection
Knowledge Detection IntegrationIntegration
Knowledge (Higher Level)
Knowledge (Higher Level)
• Phonological attr.• Prosodic attr.• Acoustic attr.• …
Human SR
• HMM• CRF• …
• HMM• CRF• …
DB ASR
DetectorsDetectors IntegratorIntegrator ResultsResults
• Phone• Syllable• Word• Sentence• Semantic info• …
• Phone• Syllable• Word• Sentence• Semantic info• …
Page-3
Phonological SystemsPhonological Systems
Phonological Systems
SPE(Sound Pattern of
English)
MV(Multi-valued
Feature)
GP(Government Phonology)
Literatures (N. Chomsky & M. Halle, 1968) (S. King, 2000)? (J. Harris, 1994)
Feature Types Production-based, Binary
Production-based,2-10 values
Sound structure primes,Binary
Feature Number 13 6 11
Examples anterior, nasal, round
centrality, front back, manner,
phonation, place, roundness
Page-4
Phonological Feature Detection (1)Phonological Feature Detection (1)
MLP (Detectors)MLP (Detectors)
hiddenlayer
posterior probability
quantizationquantization
SPE_14
0101...01
0101...01
GP_11
011..01
011..01
ii-4 i+4
9 frames
13 MFCCs
input layer
recurrentrecurrenttime-delaytime-delay
Page-5
Phonological Feature Detection (2)Phonological Feature Detection (2)
ii-4 i+4
9 frames
13 MFCCs
MLP (Centrality)MLP (Centrality)
MLP (Front-Back)MLP (Front-Back)
MLP (Roundness)MLP (Roundness)
0100
0100
100
100
010
010
0100100.........010
0100100.........010
MV_29
time-delaytime-delay
6 MV Features
Page-6
Conditional Random Field (CRF) IntegratorConditional Random Field (CRF) Integrator
• General Chain CRF
i kiikk
jijj yytys
Zp xx
xxy ,,,exp
1| 1
state feature function transition feature function
λj, μk : feature function weight parameters
.
.
.X
yi-1Output (phone)
Input (phonological features)
yi
xi-1 xi xi+1
Y
.
.
.
.
.
.
j
k
Page-7
CRF Integrator CRF Integrator –– Training Issues Training Issues
• Required Label for CRF Training– Phone: y– Phonological features: x
DetectorsMLP
DetectorsMLP
Speech
Detected-data trained CRF
Phonological features(with errors) DT
CRFDT CRF
Phone labels
Mappingphones → phonological features
Mappingphones → phonological features Phone labels
Oracle-data trained CRF
Phonological features OT CRFOT CRF
Training Data
Training Data
Page-8
ExperimentsExperiments
• Corpus: TIMIT– No SA1, SA2– Training set (3296 utts), Dev set (400 utts)– Test set (1344 utts)
• Phone set: TIMIT61– Evaluation: CMU/MIT 39
• Baseline– CI-HMM
• Toolkits– Nico Toolkit (for MLP), CRF++ (for CRF)
Page-9
Results (1)Results (1)
Phone Corr. % Phone Acc. %
SPE14 93.28 93.20
GP11 98.39 98.36
MV29 88.75 88.56
Model: OT CRFTest: OD Features
Phone Corr. % Phone Acc. %
HMM-baseline 69.02 63.45
OT CRF SPE14 66.19 29.68
GP11 69.03 31.38
MV29 59.24 30.33
DT CRF SPE14 56.56 55.27
GP11 55.74 54.53
MV29 51.84 50.68
Model: OT/DT CRFTest: DD Features
Page-10
Results (2)Results (2)
Methods # System Phone Corr. (%) Phone Acc. (%)
HMM baseline 1 69.02 63.45
OT: SPE+GP+MV 3 61.97 60.65
DT: SPE+GP+MV 3 52.90 52.06
OT+DT: SPE+GP+MV 6 60.81 59.20
OT: SPE+GP+MV +HMM 4 65.53 64.31
DT: SPE+GP+MV +HMM 4 59.57 58.64
OT+DT: SPE+GP+MV +HMM 7 64.22 62.59
System Fusion
Page-11
System Fusion with CRFSystem Fusion with CRF
.
.
.X
yi-1Combined Results (Phone)
Phone Sequence
yi
xi-1 xi xi+1
Y
.
.
.
.
.
.
j
k
SPE Sys.
MV Sys.
GP Sys.
HMM Sys.
Page-12
Two Types of AFDT ImperfectionTwo Types of AFDT Imperfection
h# n eh ow kcl k w eh ae eh s tcl t ix n
Phone
AF(A)
AF(A’)
AF asynchrony AFDT errors
Page-13
CRF Training (1)CRF Training (1)
Phone y
AFs x
t
Mapping Table
PhonePhone
AFsAFs
Oracle Data Training
Phone y
AFs x
t
AFDTAFDT
Detected Data Training
Detected Errors
Page-14
CRF Training (2)CRF Training (2)
Phone y
AFs x
t
AFDTAFDT
Aligned Data Training
AF Sequence
AF Sequence
Page-15
Results (3)Results (3)
System Phone Corr. (%) Phone Acc. (%)
Upper Bound
OT CRF 98.31 98.28
AT CRF 71.49 70.31
Real Case
OT CRF 70.55 34.38
DT CRF 57.30 56.14
AT CRF 64.87 62.32
27.97 % acc. drops on the introduction of AF asynchrony
Detection Error causes further 7.99 % acc. drop
Page-16
AF Asynchrony CompensationAF Asynchrony Compensation
• AF asynchrony is caused by context variation• We can reduce AF asynchrony by letting our systems
learn context variation directly – Long-Term information
Windows + DCTs
MLPWindows + DCTs
Right Context
Left Context
23 dim Mel
MLP
MLP
310ms
144Dim
72Dim
72Dim
72Dim
Page-17
Results (4)Results (4)
Test Data Type System Corr Acc
- CI-HMM 69.02 63.45
- CD-HMM 75.76 65.78
Detected (real case)
OT CRF (±3) 75.24 47.97
Long Term AFDT + DT CRF (±3) 64.58 63.12
Ideal (upper bound)
Long Term AFDT + AT CRF 74.96 73.64
MFCC AFDT + AT CRF (±3) 72.87 71.62
Long Term AFDT + AT CRF (±3) 76.83 74.97
Detected (real case)
Long Term AFDT + AT CRF 69.83 66.97
MFCC AFDT + AT CRF (±3) 66.21 63.16
Long Term AFDT + AT CRF (±3) 71.01 67.67
Page-18
ConclusionsConclusions
• A well-designed phonological feature system is important– AF asynchrony minimization training and AF-phone
synchronization could also be investigated
• Oracle Trained CRF is able to retrieve more phonological information from speech– High phone correction rate (but sensitive to detection error)– Helpful for combination
• Detection-Based ASR is promising– A front-end detector is a major issue
Page-19
AF and Phone Alignment Using AFDTAF and Phone Alignment Using AFDT
t
t
t
t
t
phone sequence
AF sequence