Institute of Information Science, Academia Sinica 12 July, 2011 @ IIS, Academia Sinica Automatic...

19
Institute of Information Science, Academia Sinica Institute of Information Science, Academia Sinica 12 July, 2011 @ IIS, Academia Sinica 12 July, 2011 @ IIS, Academia Sinica Automatic Detection- based Phone Recognition on TIMIT Hung-Shin Lee Hung-Shin Lee ( ( 李李李 李李李 ) ) Based on Chen and Wang in ISCSLP’08 and Interspeech’09

Transcript of Institute of Information Science, Academia Sinica 12 July, 2011 @ IIS, Academia Sinica Automatic...

Page 1: Institute of Information Science, Academia Sinica 12 July, 2011 @ IIS, Academia Sinica Automatic Detection-based Phone Recognition on TIMIT Hung-Shin Lee.

Institute of Information Science, Academia SinicaInstitute of Information Science, Academia Sinica

12 July, 2011 @ IIS, Academia Sinica12 July, 2011 @ IIS, Academia Sinica

Automatic Detection-based Phone Recognition on TIMIT

Hung-Shin Lee Hung-Shin Lee (( 李鴻欣李鴻欣 ))

Based on Chen and Wang in ISCSLP’08 and Interspeech’09

Page 2: Institute of Information Science, Academia Sinica 12 July, 2011 @ IIS, Academia Sinica Automatic Detection-based Phone Recognition on TIMIT Hung-Shin Lee.

Page-2

Detection-Based ASRDetection-Based ASR

Knowledge Detection

Knowledge Detection IntegrationIntegration

Knowledge (Higher Level)

Knowledge (Higher Level)

• Phonological attr.• Prosodic attr.• Acoustic attr.• …

Human SR

• HMM• CRF• …

• HMM• CRF• …

DB ASR

DetectorsDetectors IntegratorIntegrator ResultsResults

• Phone• Syllable• Word• Sentence• Semantic info• …

• Phone• Syllable• Word• Sentence• Semantic info• …

Page 3: Institute of Information Science, Academia Sinica 12 July, 2011 @ IIS, Academia Sinica Automatic Detection-based Phone Recognition on TIMIT Hung-Shin Lee.

Page-3

Phonological SystemsPhonological Systems

Phonological Systems

SPE(Sound Pattern of

English)

MV(Multi-valued

Feature)

GP(Government Phonology)

Literatures (N. Chomsky & M. Halle, 1968) (S. King, 2000)? (J. Harris, 1994)

Feature Types Production-based, Binary

Production-based,2-10 values

Sound structure primes,Binary

Feature Number 13 6 11

Examples anterior, nasal, round

centrality, front back, manner,

phonation, place, roundness

Page 4: Institute of Information Science, Academia Sinica 12 July, 2011 @ IIS, Academia Sinica Automatic Detection-based Phone Recognition on TIMIT Hung-Shin Lee.

Page-4

Phonological Feature Detection (1)Phonological Feature Detection (1)

MLP (Detectors)MLP (Detectors)

hiddenlayer

posterior probability

quantizationquantization

SPE_14

0101...01

0101...01

GP_11

011..01

011..01

ii-4 i+4

9 frames

13 MFCCs

input layer

recurrentrecurrenttime-delaytime-delay

Page 5: Institute of Information Science, Academia Sinica 12 July, 2011 @ IIS, Academia Sinica Automatic Detection-based Phone Recognition on TIMIT Hung-Shin Lee.

Page-5

Phonological Feature Detection (2)Phonological Feature Detection (2)

ii-4 i+4

9 frames

13 MFCCs

MLP (Centrality)MLP (Centrality)

MLP (Front-Back)MLP (Front-Back)

MLP (Roundness)MLP (Roundness)

0100

0100

100

100

010

010

0100100.........010

0100100.........010

MV_29

time-delaytime-delay

6 MV Features

Page 6: Institute of Information Science, Academia Sinica 12 July, 2011 @ IIS, Academia Sinica Automatic Detection-based Phone Recognition on TIMIT Hung-Shin Lee.

Page-6

Conditional Random Field (CRF) IntegratorConditional Random Field (CRF) Integrator

• General Chain CRF

i kiikk

jijj yytys

Zp xx

xxy ,,,exp

1| 1

state feature function transition feature function

λj, μk : feature function weight parameters

.

.

.X

yi-1Output (phone)

Input (phonological features)

yi

xi-1 xi xi+1

Y

.

.

.

.

.

.

j

k

Page 7: Institute of Information Science, Academia Sinica 12 July, 2011 @ IIS, Academia Sinica Automatic Detection-based Phone Recognition on TIMIT Hung-Shin Lee.

Page-7

CRF Integrator CRF Integrator –– Training Issues Training Issues

• Required Label for CRF Training– Phone: y– Phonological features: x

DetectorsMLP

DetectorsMLP

Speech

Detected-data trained CRF

Phonological features(with errors) DT

CRFDT CRF

Phone labels

Mappingphones → phonological features

Mappingphones → phonological features Phone labels

Oracle-data trained CRF

Phonological features OT CRFOT CRF

Training Data

Training Data

Page 8: Institute of Information Science, Academia Sinica 12 July, 2011 @ IIS, Academia Sinica Automatic Detection-based Phone Recognition on TIMIT Hung-Shin Lee.

Page-8

ExperimentsExperiments

• Corpus: TIMIT– No SA1, SA2– Training set (3296 utts), Dev set (400 utts)– Test set (1344 utts)

• Phone set: TIMIT61– Evaluation: CMU/MIT 39

• Baseline– CI-HMM

• Toolkits– Nico Toolkit (for MLP), CRF++ (for CRF)

Page 9: Institute of Information Science, Academia Sinica 12 July, 2011 @ IIS, Academia Sinica Automatic Detection-based Phone Recognition on TIMIT Hung-Shin Lee.

Page-9

Results (1)Results (1)

Phone Corr. % Phone Acc. %

SPE14 93.28 93.20

GP11 98.39 98.36

MV29 88.75 88.56

Model: OT CRFTest: OD Features

Phone Corr. % Phone Acc. %

HMM-baseline 69.02 63.45

OT CRF SPE14 66.19 29.68

GP11 69.03 31.38

MV29 59.24 30.33

DT CRF SPE14 56.56 55.27

GP11 55.74 54.53

MV29 51.84 50.68

Model: OT/DT CRFTest: DD Features

Page 10: Institute of Information Science, Academia Sinica 12 July, 2011 @ IIS, Academia Sinica Automatic Detection-based Phone Recognition on TIMIT Hung-Shin Lee.

Page-10

Results (2)Results (2)

Methods # System Phone Corr. (%) Phone Acc. (%)

HMM baseline 1 69.02 63.45

OT: SPE+GP+MV 3 61.97 60.65

DT: SPE+GP+MV 3 52.90 52.06

OT+DT: SPE+GP+MV 6 60.81 59.20

OT: SPE+GP+MV +HMM 4 65.53 64.31

DT: SPE+GP+MV +HMM 4 59.57 58.64

OT+DT: SPE+GP+MV +HMM 7 64.22 62.59

System Fusion

Page 11: Institute of Information Science, Academia Sinica 12 July, 2011 @ IIS, Academia Sinica Automatic Detection-based Phone Recognition on TIMIT Hung-Shin Lee.

Page-11

System Fusion with CRFSystem Fusion with CRF

.

.

.X

yi-1Combined Results (Phone)

Phone Sequence

yi

xi-1 xi xi+1

Y

.

.

.

.

.

.

j

k

SPE Sys.

MV Sys.

GP Sys.

HMM Sys.

Page 12: Institute of Information Science, Academia Sinica 12 July, 2011 @ IIS, Academia Sinica Automatic Detection-based Phone Recognition on TIMIT Hung-Shin Lee.

Page-12

Two Types of AFDT ImperfectionTwo Types of AFDT Imperfection

h# n eh ow kcl k w eh ae eh s tcl t ix n

Phone

AF(A)

AF(A’)

AF asynchrony AFDT errors

Page 13: Institute of Information Science, Academia Sinica 12 July, 2011 @ IIS, Academia Sinica Automatic Detection-based Phone Recognition on TIMIT Hung-Shin Lee.

Page-13

CRF Training (1)CRF Training (1)

Phone y

AFs x

t

Mapping Table

PhonePhone

AFsAFs

Oracle Data Training

Phone y

AFs x

t

AFDTAFDT

Detected Data Training

Detected Errors

Page 14: Institute of Information Science, Academia Sinica 12 July, 2011 @ IIS, Academia Sinica Automatic Detection-based Phone Recognition on TIMIT Hung-Shin Lee.

Page-14

CRF Training (2)CRF Training (2)

Phone y

AFs x

t

AFDTAFDT

Aligned Data Training

AF Sequence

AF Sequence

Page 15: Institute of Information Science, Academia Sinica 12 July, 2011 @ IIS, Academia Sinica Automatic Detection-based Phone Recognition on TIMIT Hung-Shin Lee.

Page-15

Results (3)Results (3)

System Phone Corr. (%) Phone Acc. (%)

Upper Bound

OT CRF 98.31 98.28

AT CRF 71.49 70.31

Real Case

OT CRF 70.55 34.38

DT CRF 57.30 56.14

AT CRF 64.87 62.32

27.97 % acc. drops on the introduction of AF asynchrony

Detection Error causes further 7.99 % acc. drop

Page 16: Institute of Information Science, Academia Sinica 12 July, 2011 @ IIS, Academia Sinica Automatic Detection-based Phone Recognition on TIMIT Hung-Shin Lee.

Page-16

AF Asynchrony CompensationAF Asynchrony Compensation

• AF asynchrony is caused by context variation• We can reduce AF asynchrony by letting our systems

learn context variation directly – Long-Term information

Windows + DCTs

MLPWindows + DCTs

Right Context

Left Context

23 dim Mel

MLP

MLP

310ms

144Dim

72Dim

72Dim

72Dim

Page 17: Institute of Information Science, Academia Sinica 12 July, 2011 @ IIS, Academia Sinica Automatic Detection-based Phone Recognition on TIMIT Hung-Shin Lee.

Page-17

Results (4)Results (4)

Test Data Type System Corr Acc

- CI-HMM 69.02 63.45

- CD-HMM 75.76 65.78

Detected (real case)

OT CRF (±3) 75.24 47.97

Long Term AFDT + DT CRF (±3) 64.58 63.12

Ideal (upper bound)

Long Term AFDT + AT CRF 74.96 73.64

MFCC AFDT + AT CRF (±3) 72.87 71.62

Long Term AFDT + AT CRF (±3) 76.83 74.97

Detected (real case)

Long Term AFDT + AT CRF 69.83 66.97

MFCC AFDT + AT CRF (±3) 66.21 63.16

Long Term AFDT + AT CRF (±3) 71.01 67.67

Page 18: Institute of Information Science, Academia Sinica 12 July, 2011 @ IIS, Academia Sinica Automatic Detection-based Phone Recognition on TIMIT Hung-Shin Lee.

Page-18

ConclusionsConclusions

• A well-designed phonological feature system is important– AF asynchrony minimization training and AF-phone

synchronization could also be investigated

• Oracle Trained CRF is able to retrieve more phonological information from speech– High phone correction rate (but sensitive to detection error)– Helpful for combination

• Detection-Based ASR is promising– A front-end detector is a major issue

Page 19: Institute of Information Science, Academia Sinica 12 July, 2011 @ IIS, Academia Sinica Automatic Detection-based Phone Recognition on TIMIT Hung-Shin Lee.

Page-19

AF and Phone Alignment Using AFDTAF and Phone Alignment Using AFDT

t

t

t

t

t

phone sequence

AF sequence