Human Language Technology (HLT) Research at MIT CSAIL Zue Harbin-Speech-07-01... · 2018. 5. 8. ·...

1

Human Language Technology (HLT) Researchat MIT CSAILVictor Zue ([email protected])MIT Computer Science and Artificial

Intelligence LaboratoryCambridge, MA 02139, USA

2007-01-19MIT Computer Science and Artificial Intelligence Laboratory

MIT = Labs + Departments

Faculty and grad students wear two hats: lab + department

…Mechanical Engineering (2)

Materials Science (3)Chemistry (5)

EECS (6)Physics (8)

Brain & Cog. Sci. (9)Chemical Engineering (10)

Aero & Astro (16)Math (18)

…

… CS

AIL

LID

S

MTL

RLE

…

aka “courses”

2


CSAIL History

• 1963: Project MAC formed at MIT

• 1970: Artificial Intelligence Laboratory (AI Lab) separated from Project MAC

• 1974: Project MAC renamed to Laboratory for Computer Science (LCS)

• 2003: CSAIL formed by the merger of AI Lab and LCS

• 2004: CSAIL moved into the Stata Center

Our Home: The Stata Center


CSAIL Today

• MIT’s largest interdepartmental laboratory

• About 840 members– >90 principal investigators (PIs)

* >70 active teaching faculty (from 8 departments)

– ~110 research staff and affiliates

– ~470 graduate students– ~60 undergraduates

– >20 postdoctoral fellows

– ~45 technical and admin staff– ~30 visitors

3


AgarwalArvind AsanovicBalakrishnan Clark Corbato

DennisDevadas Ernst Fano Guttag Jackson Kaashoek

Lampson Liskov Miller Morris Rinard

Rudolph Sollins Terman

Katabi Madden

Ward

Amarasinghe

Stonebraker

Systems

CSAIL PIs

19 NAS/NAE/IM members

6 MacArthur Foundation Genius Awards

3 Turing Awards2 Japan Prizes1 Millennium

Technology Award 1 Knight of the

British Empire ….

Darrell Durand

Glass Golland Grimson Horn

Jaakkola Kaelbling Katz Popovic

Richards Seneff Zue

Adelson

Fisher Freeman

Lozano-Perez

Tenenbaum Willsky

Poggio

Perception & Learning

Berners-Lee

Weitzner

Barzilay Collins

Abelson Brooks Davis Gifford Knight

Leonard Long Massaquoi Moses

Shrobe Sussman Szolovits

Teller Tidor Williams Winston

Roy

Rus Stultz

Wisdom

O’Reilly

Physical, Biological, & Social Systems

Kellis

Tedrake

Demaine Edelman Garland

Indyk Karger Leighton Leiserson

Lynch Meyer Micali Rivest

Sipser Sudan

Berger Goemans

Goldwasser

Shor

Rubinfeld

Theory

6 PIs4 Researchers

~35 Graduate Students HLT


What are human language technologies?

• They are technologies that enable machines to process human languages– Speech Coding: efficient & robust speech transmission

– Speech Recognition: speech text

– Speech Synthesis: text speech– Language Understanding: text meaning

– Language Generation: meaning text– Discourse Analysis: understanding in context

– Dialogue Modeling: preparing system’s side of the interaction

– Machine Translation: text in L1 text in L2– Information Summarization: verbose information terse info

– …

They enable humans to communicate with humans and machines on their own terms

4


• Current day devices can show and tell, but many are still deaf and blind

The Premise

• “Natural” language is what humans use to communicate• Future information devices must satisfy the basic needs of

human communication

• Current day devices can show and tell


Ubiquitous Needs

• Interact with the physical world– “Turn down the music.” “It’s a little too hot in here.” “Record the next

RedSox game for me.” …

• Creating, accessing, and managing information – “Where is the nearest pharmacy?” “When is my next appointment?”

“Find the pictures I took at Michelle’s wedding.” …

• Angel on your shoulder– “Where did I leave my keys?” “Remind me to send the slides to

Randy.” “Who was at the meeting yesterday?” “What were the action items?” …

• …• Human language technology (HLT) is needed for

– Natural human-machine interaction

– Easier access to audio-visual content

5


Speech as Interface:Dialogue Interactions

(Seneff, Glass)


• Can verbalize response– Language generation– Speech synthesis

• Can engage in dialogue with a user during the interaction

• Can communicate with users through conversation

• Can understand verbal input– Speech recognition– Language understanding

(in context)

SpeechRecognition

SpeechRecognition

Language Understanding


ContextResolution

ContextResolution

DialogueManagement

DialogueManagement

LanguageGenerationLanguageGeneration

SpeechSynthesisSpeech

Synthesis

AudioAudio DatabaseDatabase

Next Generation Speech Interfaces

6


Hub

SpeechRecognition

SpeechRecognition



ContextResolution

ContextResolution

DialogueManagement

DialogueManagement

LanguageGenerationLanguageGeneration

SpeechSynthesisSpeech

Synthesis

AudioAudio DatabaseDatabase

• Prototypes:– Weather, traffic, flight

status, and schedules– Real, up-to-date

information– Access via toll-free

telephone numbers

• Conversation requires understanding, generation & dialogue• System deployment shortens development cycle

Conversational Interaction at MIT

video


Visual cues can help

• Visual conversational cues– Body and head pose tracking

– Integrating with speech for high SNR applications

7


Importance of Audio-Visual Integration

• Audio and visual signals both contain information about:– Identity/location of the person: Who is talking? Where is he?– Linguistic message: What’s she saying?

– Emotion, mood, stress, etc.: How does he feel?

• Proper utilization of these two channels of information can lead to robust and enhanced capabilities, e.g.,– Locating and identifying the speaker

– Speech understanding augmented with facial features– Speech, gesture, and sketching integration

– Audio/visual information delivery


Joint

Decision

accept

rejectSpeakerIdentification

Audio-Visual Person Verification

FaceIdentification

video0.01 0.1 1 2 5 10 20 40

0.01

0.1

1 2

5

10

20

40

False Accept probability (%)

Fal

se R

ejec

t p

rob

abili

ty (

%) Face

SpeechCombined

Combined audio-visual inputs reduces equal error rate by 90%

Combined audio-visual inputs reduces equal error rate by 90%

8


Audio-Visual Information Delivery

• New, data-driven approach can produce very natural and intelligible synthetic speech

• We can now produce video-realistic animations

• These animated agents can speak and sing in different languages

• We can combine speech synthesis and facial animation to produce realistic avatars

video

video

video

video

video

Mary101-M

Marilyn

Mary101-sing-J

English Avatar

Chinese Avatar

videoMary101-F

videoMary101-sing-C


Some Ongoing Challenges

• Robust HLTs for realistic environments• Shrinking the platform• Reconfigurable HLTs• Multimodality• . . .

video

video

9


A Specific Problem: Destination Entry

• Speech input will enable natural, efficient interaction• Spoken dialogue provides flexibility (e.g., error recovery)• Challenges for a spoken interface: a very large search space!

– e.g., U.S. has 250 million addresses, 28,000 cities, 1.1 million street names– Wide range/city (e.g., Delmar, AK has 1 street, Houston, TX has 19,000)


Important Ability: Dynamic Vocabularies

• Example: United States street address recognition– 6.2M unique street, city, state pairs (283K unique words)

– 3-pass recognition has much smaller vocabulary (<20K words)– Web-based interface integrated with Google maps API

video

10


A Platform for Map-Based Interactions

• Web-based restaurant content integrated with Google map API• Virtual: web-based deployment allows worldwide access• Adaptive: dialogue-sensitive vocabulary; speaker identification• Multimodal: speech and pen interaction


Other Important Capabilities

Accessible

Contextual

Multimodal

Personalizable

Customizablevideo

11


Vehicle Integration

• Collaborating with researchers at BMW North America to integrate speech interfaces into vehicle infrastructure

• Initial prototype has incorporated CityGuide interface• Test vehicle being used at BMW and CSAIL for system

development, data collection, and evaluation

video


Speech as Content:Review Summarization

(Barzilay)

12


Review Summarization

• Users increasingly rely on online review sites to make decisions about products they plan to buy, restaurants they plant to visit, the movie they’d like to watch, etc. – “What are people saying about the new iPod Nano”

– “Find me a good and inexpensive Thai restaurant near here”

– “Read me a representative review for The Terminator II”

• Sometimes we want to access this information via voice– Reading all the reviews is impractical (too many, each with disparate

opinions)

– Intelligent access to reviews requires semantic analysis

• How do we develop summarization capabilities to fulfill such needs– Automatic, domain-independent, high performance


A First Step…

• Developed an algorithm for review-ranking– Convert free text reviews to ranking in multiple categories

… I started with a half dozen of freshly shucked oysters and my friend ordered the fried shrimp. … The service was very no-nonsense but very friendly and down to earth… The staff were happy to take our photo for us … Very well worth the very spontaneous visit and I'd definitely go there again given the chance

… I started with a half dozen of freshly shucked oysters and my friend ordered the fried shrimp. … The service was very no-nonsense but very friendly and down to earth… The staff were happy to take our photo for us … Very well worth the very spontaneous visit and I'd definitely go there again given the chance

13


Finding Structure in Online Reviews

• Score reviews automatically, according to different dimensions

• The algorithm is based on machine learning• Can be easily adapted to new domains• Empirical results

– Trained and tested the algorithm on 4,000 and 500 restaurant reviews, respectively

– Measured ranking accuracy in terms of distance from true ranking(between 0 and 5)

– Achieved accuracy of 0.6 in test data


Using Structured Representation

• Support of complex queries– Find most negative reviews– Find restaurants with great food, ignore atmosphere ratings

• Analyze temporal trend in consumer opinion• Compare several products in multiple dimensions

14


Speech as Content:MIT Lecture Browser

(Glass, Barzilay)


Spoken Lecture Processing

• In the past decade there has been a dramatic increase in the availability of on-line academic lecture material

• Lecture data has not been widely studied by the community• Human language technology can help educators and students

to more effectively create and disseminate lecture recordings

15


The Research Challenge1) I've been talking -- I've been multiplying matrices already, but

certainly time for me to discuss the rules for matrix multiplication. 2) And the interesting part is the many ways you can do it, and they

all give the same answer. 3) So it's -- and they're all important. 4) So matrix multiplication, and then, uh, come inverses. 5) So we're -- uh, we -- mentioned the inverse of a matrix, but there's

-- that's a big deal. 6) Lots to do about inverses and how to find them. 7) Okay, so I'll begin with how to multiply two matrices. 8) First way, okay, so suppose I have a matrix A multiplying a matrix

B and -- giving me a result -- well, I could call it C. 9) A times B. Okay. 10) Uh, so, l- let me just review the rule for w- for this entry.

8 Rules of Matrix Multiplication:The method for multiplying two matrices A and B to get C = AB can be

summarized as follows:1) Rule 8.1 To obtain the element in the rth row and cth column of C,

multiply each element in the rth row of A by the corresponding…

“I want to learn how to multiply matrices”

Using spoken language technology to transcribe, structure, and retrieve recorded lecture material


Topic search Category

selection

Lecture hits

Segment hits

Expanded hit

Synchronized transcript

Lecture video

Play

User Interface

16


• Lecture speaking style similar to human-human conversations• Average of 800 unique words/lecture (~1/3 News Broadcasts)• Difficult to cover content words w/o topic-specific material• Language models from text a poor predictor of spoken words

DiagonalMaximumCdrOrthogonalOmegaArgumentsEigenThetaMachineMatricesEnergyProceduresRowsVoltsProgramEigenvaluesForceConsNullElectricStreamDeterminantMagneticEnvironmentTransposeChargeExpressionMatrixFieldProcedure

AlgebraPhysicsComp. Science

Top 10 Vocabulary (1.5K stop-words)Unique Words/Lecture

Lecture Analysis


• Based on ten computer science lectures • Recognition Performance

– Vocabulary size: ~37,000

– Word error rate (w/ adaptation): ~41%

• Measured information retrieval ability with text index terms

• Retrieval Performance– Precision (P=% returned segments containing keyword): 92-95%– Recall (R=% relevant segments retrieved): 78-88.%

procedure, complex numbers, programming, program, abstraction, function, arguments, constructor, variable, recursion, predicate, computer science, fixed-point syntax, primitive data, algorithm, logic programming,…

Some Transcription and Retrieval Results

17


Example of Structure Induction


Topic Segmentation

• Task– Partition a text into a linear sequence of topically coherent topics

• Challenges– Recognition errors and lack of separators

* >20-30% WER

– Smooth topic transitions* c.f., broadcast news segmentation

– Lack of training data

– Existing segmentation methods achieve poor performance on lecture transcripts

18


• Segment boundaries in word-occurrence matrix– Computes pair-wise sentence similarity between sentences

• Red lines show manually determined lecture segments• Lecture boundaries are not as clear as broadcast news

Lecture Segmentation


Minimum-Cut Segmentation*

• New graph-theoretic formalization of the segmentation objective– Jointly maximizes within-cluster similarity and minimizes between-cluster

similarity– Incorporates long-range lexical dependencies

• Exact, fast decoding using dynamic programming

Key Strength: Can detect subtle topic changes

* Malyutov & Barzilay, ACL 2006

19


Speech in Education:Language Learning

(Seneff)


Motivation

• There is an ever-increasing need for second language learning

• However, there is a severe shortage of capable language teachers

• Even when teachers are available, they don’t have enough time to interact with students in dialogue exchanges, which is an important part of learning a language

• Perhaps computers can help …

20


Approach

• Use computers to aid language learning– Learning pronunciations, new words, grammar, etc.– More importantly, computers can serve as a conversational

partner for students to practice dialogue interaction– Computers provide non-threatening environment in which to

practice communicating

• Leverage our extensive prior research in multilingual spoken dialogue systems to support language learning

• Create a game-like setting for conversational learning• Currently focused on (bi-directional) learning of

Chinese and English


Three Types of Activities1. Translation: Learning what to say

I want to arrive in the

morning

3. Conversation: Practicing communication

I like to go dancing.

Would you like to go dancing

with me?

2. Eavesdropping: Learning how to communicate

Are you free tomorrow

afternoon?

No, I am going to play basketball.

Equal-PartyConversational

System

Hmm…

我想早上到达

21


Translation: Learning What to Say

• Interacting over the telephone:– Bilingual recognizer supports

seamless language switching

– Student speaks English, system paraphrases, then translates into Mandarin

– Student speaks Mandarin, system paraphrases, then translates into English

• Interacting at a Web page in game mode:– System poses sentence in English to translate– User attempts to speak an equivalent sentence in Mandarin– System evaluates user’s sentence, congratulates them if they succeed,

and then moves on to the next sentence– Number of turns to success becomes an evaluation metric– Game difficulty level adjusted over time based on student performance


A Web-based Translation Game

22


Equal-Party Conversation• Most of the dialogue systems we have developed have been

based on an asymmetric human-computer relationship– Computer has access to information sources (weather, flights,

restaurants, etc.)– User makes requests concerning the content of those databases

• An interesting new research topic is the development of computer conversational skills within an equal partnerscenario– More appropriate than database access domains for beginning

students

– Human and computer play identical roles, but with different hobbies and schedules

– In our initial scenario, student must find a mutually agreeable time to jointly participate in a shared hobby


There is simply not enough time . . .

• To describe some of the other projects– Statistical machine translation (Collins)

– Information extraction (Katz)– Spoken dialogue system using machine learning (Collins)

– Audio-visual speech recognition (Darrell)

– Acoustic scene analysis (Zue)– …

• Please visit our web site for more information

23


Summary

• Future information devices must satisfy the basic needs of human communication

• Natural interactions– Provide natural interactions using human language

– Integrate multiple modalities

– Accommodate multiple languages

• Content processing– Transcribe, index, retrieve, and summarize

– Translate speech as well as text

• Work as well as, if not better than, humans• Much research remains


Thank You!

Human Language Technology (HLT) Research at MIT CSAIL Zue Harbin-Speech-07-01... · 2018. 5. 8. ·...

Documents

Transcript of Human Language Technology (HLT) Research at MIT CSAIL Zue Harbin-Speech-07-01... · 2018. 5. 8. ·...