IBM NSF Workshop Atlanta, Georgia 7-8 Oct 2003 Speech Recognition: It Takes a Village to Raise a...

22
IBM NSF Workshop Atlanta, Georgia 7-8 Oct 2003 Speech Recognition: It Takes a Village to Raise a Child Michael Picheny Human Language Technologies Group IBM Thomas J. Watson Research Center Special thanks to: Stan Chen, Yuqing Gao, Ramesh Gopinath, Makis Potamianos, Bhuvana Ramabhadran, Bowen Zhou , and Geoff Zweig

Transcript of IBM NSF Workshop Atlanta, Georgia 7-8 Oct 2003 Speech Recognition: It Takes a Village to Raise a...

Page 1: IBM NSF Workshop Atlanta, Georgia 7-8 Oct 2003 Speech Recognition: It Takes a Village to Raise a Child Michael Picheny Human Language Technologies Group.

IBM

NSF Workshop Atlanta, Georgia 7-8 Oct 2003

Speech Recognition: It Takes a Village to Raise a

ChildMichael Picheny

Human Language Technologies GroupIBM Thomas J. Watson Research Center

Special thanks to: Stan Chen, Yuqing Gao, Ramesh Gopinath, Makis Potamianos, Bhuvana Ramabhadran, Bowen Zhou ,

and Geoff Zweig

Page 2: IBM NSF Workshop Atlanta, Georgia 7-8 Oct 2003 Speech Recognition: It Takes a Village to Raise a Child Michael Picheny Human Language Technologies Group.

IBM

NSF Workshop Atlanta, Georgia 7-8 Oct 2003

http://www.nist.gov/speech/tests/rt/rt2003/spring/presentations/rt03s-stt-results-v9.pdf

Page 3: IBM NSF Workshop Atlanta, Georgia 7-8 Oct 2003 Speech Recognition: It Takes a Village to Raise a Child Michael Picheny Human Language Technologies Group.

IBM

NSF Workshop Atlanta, Georgia 7-8 Oct 2003

Developmental Factors

1988 1989 1990 1991 1992 1993 1994 1995•CD HMM •Multiple

Codebooks

•Tied Mixtures

•Double Deltas

•PTMs •STMs•Data

•MLLR •SAT•PLP

•Modeling

•Modeling

•Modeling

•Sig Proc •Modeling

•Modeling•Data

•Adaptation

•Adaptation•Sig Proc

1996 1997 1998 1999 2000 2001 2002 2003•Multiple Models•VTLN•Data

•MLLT, BIC•Data

•ROVER•Data

• fMLLR-SAT

•MMI FSTs •MPE •Data

•Modeling•Adaptation•Data

•Modeling•Data

•Decoding•Data

•Adaptation

•Training

•Modeling

•Training

•Data

•Bulk of improvements from better modeling and more data, closely followed by adaptation (a form of modeling)

Page 4: IBM NSF Workshop Atlanta, Georgia 7-8 Oct 2003 Speech Recognition: It Takes a Village to Raise a Child Michael Picheny Human Language Technologies Group.

IBM

NSF Workshop Atlanta, Georgia 7-8 Oct 2003

• Most current systems model speech as a mixture of diagonal Gaussians, but there is this nagging suspicion that full-covariance models would be better.

• Try to approximate full-covariance models with controlled increase in number of parameters (Axelrod, 2003):

Continue the Basics: Advances in Gaussian Modeling

(EMLLT) 2

)1(,

1

1

ddDdaaP p

Tkk

D

k

kggg

p

(PCGMM) 2

)1(,

1

ddDdS pk

D

k

kg

p

Page 5: IBM NSF Workshop Atlanta, Georgia 7-8 Oct 2003 Speech Recognition: It Takes a Village to Raise a Child Michael Picheny Human Language Technologies Group.

IBM

NSF Workshop Atlanta, Georgia 7-8 Oct 2003

Dp (d = 52)

EMLLT

PCGMM

d 2.67 1.96

2d 2.04 1.75

8d 1.65 1.64

26.5d 1.58 1.58

Advances in Gaussian Modeling

Page 6: IBM NSF Workshop Atlanta, Georgia 7-8 Oct 2003 Speech Recognition: It Takes a Village to Raise a Child Michael Picheny Human Language Technologies Group.

IBM

NSF Workshop Atlanta, Georgia 7-8 Oct 2003

Advances in Gaussian Modeling

nGauss

Diagonal

Full

5000 3.48 1.83

10000 2.68 1.56

42993 2.00 1.35

142622 1.74 1.54

350286 1.68 Forget it

609100 1.65 Really forget it•10k FC Model better than 600k model with 20% of the

parameters•FC models clearly prone to overtraining. PCGMM helps

but still increases number of parameters•Clearly need lots more acoustic data to train even

PCGMM models much less FC models

Page 7: IBM NSF Workshop Atlanta, Georgia 7-8 Oct 2003 Speech Recognition: It Takes a Village to Raise a Child Michael Picheny Human Language Technologies Group.

IBM

NSF Workshop Atlanta, Georgia 7-8 Oct 2003

Teens : Utilizing Linguistic Information in ASRTeens : Utilizing Linguistic Information in ASR

• Standard LVCSR does not explicitly use linguistic information

• Past history is littered with failure

• Over the last few years area beginning to show signs of life

Page 8: IBM NSF Workshop Atlanta, Georgia 7-8 Oct 2003 Speech Recognition: It Takes a Village to Raise a Child Michael Picheny Human Language Technologies Group.

IBM

NSF Workshop Atlanta, Georgia 7-8 Oct 2003 Syntactic Structured LM

Exploiting syntactic dependencies (Chelba ACL98, Wu ICASSP00)

contractNP

endedVP

The

h

contract ended with a loss of 7 cents after

h w w

DT NN VBD IN DT NN IN CD NNS

i-2 i-1 i-2 i-1w i

nt i-1

nt i-2

ii

ii

ST

i1iiiiiiii

ST

i1ii

i1i

ii

WTntnthhwwwP

WTTWwPWwP

)|(),,,,,|(

)|(),|()|(

1121212

1111

•Observe performance improvements ~1% absolute on SWB/BN

Page 9: IBM NSF Workshop Atlanta, Georgia 7-8 Oct 2003 Speech Recognition: It Takes a Village to Raise a Child Michael Picheny Human Language Technologies Group.

IBM

NSF Workshop Atlanta, Georgia 7-8 Oct 2003

Semantic Structured LM Exploiting semantic dependencies (Erdogan ICSLP02)

I want to

book a wayone

ticket

to Houston

Texas for tomorrow

morning

null

null

null

book

null rt-ow

rt-ow

flight

null

city state

word day timerng

RT-OW LOC DATE TIME

LOC-TO

SEGMENT

BOOK

S

jw1jw2jw

jpjg

jc

),,,,|()|( 211

1 jjjjjjj

j cgpwwwpWwp

•Reductions in error rate by 20% for limited domain tasks

Page 10: IBM NSF Workshop Atlanta, Georgia 7-8 Oct 2003 Speech Recognition: It Takes a Village to Raise a Child Michael Picheny Human Language Technologies Group.

IBM

NSF Workshop Atlanta, Georgia 7-8 Oct 2003

Super-Structured LM for LVCSR

W1, ..., W

N

Dialogue State

Semantic Parser

World Knowledg

e

Named EntitySyntacti

c Parser

Speaker (turn,

gender, ID)

•Such an LM would clearly require substantially more annotated data than currently available

Page 11: IBM NSF Workshop Atlanta, Georgia 7-8 Oct 2003 Speech Recognition: It Takes a Village to Raise a Child Michael Picheny Human Language Technologies Group.

IBM

NSF Workshop Atlanta, Georgia 7-8 Oct 2003

Training (hrs)

141 297 602 843

WER(%) 17.2 15.4 14.7 14.5

Nutrition: “There’s no data like more data”

--Robert L. Mercer

RT03 Workshop (BBN)

LIMSI: Lamel (2002)

Page 12: IBM NSF Workshop Atlanta, Georgia 7-8 Oct 2003 Speech Recognition: It Takes a Village to Raise a Child Michael Picheny Human Language Technologies Group.

IBM

NSF Workshop Atlanta, Georgia 7-8 Oct 2003

• “SuperStructured LM” probably need 100x current data• A large amount of linguistic knowledge sources now available

Brown corpus, Penn Treebank (syntactically & semantically annotated)

WordNet and FrameNet, Cyc ontologies, Online dictionaries and thesaurus Text data from WWW

How to provide necessary annotation at reasonable cost – may require community effort.

Nutrition: “There is no data like more data”

Page 13: IBM NSF Workshop Atlanta, Georgia 7-8 Oct 2003 Speech Recognition: It Takes a Village to Raise a Child Michael Picheny Human Language Technologies Group.

IBM

NSF Workshop Atlanta, Georgia 7-8 Oct 2003

Speed, Data & Computing

• Decoding today is fast– ML training even faster– Discriminative training same as

decoding • ~5-10xRT for numerous

iterations• But

– The data is growing, e.g. the EARS program aiming for

• 2000 hrs/year telephony• 5000 hrs/year BN• ~10x increase from current

– Evidence suggests that new & costlier algorithms are necessary to exploit more data

• So– Need minimum 10x increase in

compute power just to track data

– 100x to run 10xRT programs rather than 1xRT programs 0.85 xRT

Speech/non-speech segmentation

Speaker IndependentDecoding

Adaptive Transforms

Speaker-Adaptive Decoding

Acoustic signal

Words

0.01 xRT

0.11 xRT

0.1 xRT

0.63 xRT

Page 14: IBM NSF Workshop Atlanta, Georgia 7-8 Oct 2003 Speech Recognition: It Takes a Village to Raise a Child Michael Picheny Human Language Technologies Group.

IBM

NSF Workshop Atlanta, Georgia 7-8 Oct 2003

The BlueGene Frontier

• 200 TeraFlop computer– Combined power of top 500 supercomputers in 2001– 65,000 processors

• 2GB per processor• ~1GHz clock• 3D torus interconnection• 2 nodes per card; 16 cards per board; 16 boards per plane; 2

planes per rack• Pieces beginning to be tested

– Intended for molecular dynamics, but available for other uses• Potential ASR applications

– Physics based articulatory modeling– Brute-force parameter adjustment to minimize WER– Large scale neural network modeling– Incorporation of Visual Processing

Page 15: IBM NSF Workshop Atlanta, Georgia 7-8 Oct 2003 Speech Recognition: It Takes a Village to Raise a Child Michael Picheny Human Language Technologies Group.

IBM

NSF Workshop Atlanta, Georgia 7-8 Oct 2003

Page 16: IBM NSF Workshop Atlanta, Georgia 7-8 Oct 2003 Speech Recognition: It Takes a Village to Raise a Child Michael Picheny Human Language Technologies Group.

IBM

NSF Workshop Atlanta, Georgia 7-8 Oct 2003

The Village: Collaborative Paradigms

• Originally progress in ASR was haphazard – no way to compare results and would generate a lot of skepticism because of NIH syndrome

• Evaluation-driven ASR programs (Prior DARPA, DOD Paradigms)– Provided a common metric to compare algorithms– Funding based on relative performance of each site

• Sites hope not only to do well, but for other sites to do badly• Discourages free exchange of resources between sites• Large portion of each site’s effort spent replicating other sites’ algorithms + data

• Non-evaluation driven programs: NGSW, MALACH• How to encourage collaboration while retaining the motivation of

competition?– While also retaining objective evaluation of progress

• Recent EARS program a strong step in this direction• Even broader collaboration possible through sharing resources

Page 17: IBM NSF Workshop Atlanta, Georgia 7-8 Oct 2003 Speech Recognition: It Takes a Village to Raise a Child Michael Picheny Human Language Technologies Group.

IBM

NSF Workshop Atlanta, Georgia 7-8 Oct 2003

Encouraging Collaboration

• Decompose ASR systems into modules– Front end; acoustic model; pronunciation model;

language model; adaptation; search; etc.

• Sites collaborate to create single ASR system rather than one per site– Each site works on writing better ASR modules,

rather than complete ASR systems– Each module (e.g., MMIE, VTLN, etc.) needs only be

implemented once across all sites– Progress measured and credit assigned to sites

based on how modules affect WER of global system

Page 18: IBM NSF Workshop Atlanta, Georgia 7-8 Oct 2003 Speech Recognition: It Takes a Village to Raise a Child Michael Picheny Human Language Technologies Group.

IBM

NSF Workshop Atlanta, Georgia 7-8 Oct 2003

Existence Proof: UIMA (Unstructured Information Management Architecture)

http://www.research.ibm.com/people/a/aspector/presentations/www2000f.pdfCharts courtesy of David Ferrucci

• Accelerate Progress in Search and Analysis– Reuse across teams

– Ease of experimentation

– Combination Hypothesis

Page 19: IBM NSF Workshop Atlanta, Georgia 7-8 Oct 2003 Speech Recognition: It Takes a Village to Raise a Child Michael Picheny Human Language Technologies Group.

IBM

NSF Workshop Atlanta, Georgia 7-8 Oct 2003

Text analysis through a series of annotators

DetaggerDetagger

Document with HTMLtags identified and content extracted

TokenizerTokenizer

Document with tokens (e.g., words) identified

LanguageIdentification

LanguageIdentification

document level

Document

Document labeled withLanguage of text

Part of SpeechPart of Speech

Word labeled with its part of speech

Named-EntitiesNamed-Entities

Document withName identified

Annotators: Analyze, Recognize & Label specific semantic content for next

consumer

SemanticClasses

SemanticClasses

Semantic Classes identified

word level

phraselevel

Page 20: IBM NSF Workshop Atlanta, Georgia 7-8 Oct 2003 Speech Recognition: It Takes a Village to Raise a Child Michael Picheny Human Language Technologies Group.

UIMApplicationUIMApplication

AnalysisAnalysis

Collection ProcessingManager

Collection ProcessingManager

Document, Collection & Metadata StoreDocument, Collection & Metadata Store

Knowledge SourceAccess

Knowledge SourceAccess

CollectionLevel Analyses

CollectionLevel Analyses

Knowledge & DataBases

Unstructured

Information

(Text) Analysis Engines(Document-Level)

(Text) Analysis Engines(Document-Level)

CrawlersCrawlers

Analysis Engine

Directory

Analysis Engine

Directory

Acquisition

Unstructured Information Analysis

Structured Information

Semantic Search Engine

Semantic Search Engine

IndicesIndices

DocumentsCollectionsMetadataKnowledge

SourceAdapter Directory

KnowledgeSource

Adapter Directory

AccessAccess

Component Discovery

Client/User

Page 21: IBM NSF Workshop Atlanta, Georgia 7-8 Oct 2003 Speech Recognition: It Takes a Village to Raise a Child Michael Picheny Human Language Technologies Group.

IBM

NSF Workshop Atlanta, Georgia 7-8 Oct 2003

Possible Collaboration Discussion Points

• Sharing data and object files seems reasonable

• Speech community needs to design:– Standard file formats supporting rich annotation– Stable, general, open-source C++ interfaces for front

end modules, acoustic models, LM’s, etc.– Rich tool set

• Port competitive trainer, decoder, adaptation into this framework; create basic file manipulation tools

• Can we ride on top of existing architectures such as UIMA?

Page 22: IBM NSF Workshop Atlanta, Georgia 7-8 Oct 2003 Speech Recognition: It Takes a Village to Raise a Child Michael Picheny Human Language Technologies Group.

IBM

NSF Workshop Atlanta, Georgia 7-8 Oct 2003

Summary

• To help our “speech recognition” child develop, continue basic successful approaches of the past:– Better Modeling– More Data

• Increasing difficulty of problem requires focus on community-wide collaboration for both algorithms and resources