Human Language Technology (HLT) Research at MIT CSAIL Zue Harbin-Speech-07-01... · 2018. 5. 8. ·...
Transcript of Human Language Technology (HLT) Research at MIT CSAIL Zue Harbin-Speech-07-01... · 2018. 5. 8. ·...
1
Human Language Technology (HLT) Researchat MIT CSAILVictor Zue ([email protected])MIT Computer Science and Artificial
Intelligence LaboratoryCambridge, MA 02139, USA
2007-01-19MIT Computer Science and Artificial Intelligence Laboratory
MIT = Labs + Departments
Faculty and grad students wear two hats: lab + department
…Mechanical Engineering (2)
Materials Science (3)Chemistry (5)
EECS (6)Physics (8)
Brain & Cog. Sci. (9)Chemical Engineering (10)
Aero & Astro (16)Math (18)
…
… CS
AIL
LID
S
MTL
RLE
…
aka “courses”
2
2007-01-19MIT Computer Science and Artificial Intelligence Laboratory
CSAIL History
• 1963: Project MAC formed at MIT
• 1970: Artificial Intelligence Laboratory (AI Lab) separated from Project MAC
• 1974: Project MAC renamed to Laboratory for Computer Science (LCS)
• 2003: CSAIL formed by the merger of AI Lab and LCS
• 2004: CSAIL moved into the Stata Center
Our Home: The Stata Center
2007-01-19MIT Computer Science and Artificial Intelligence Laboratory
CSAIL Today
• MIT’s largest interdepartmental laboratory
• About 840 members– >90 principal investigators (PIs)
* >70 active teaching faculty (from 8 departments)
– ~110 research staff and affiliates
– ~470 graduate students– ~60 undergraduates
– >20 postdoctoral fellows
– ~45 technical and admin staff– ~30 visitors
3
2007-01-19MIT Computer Science and Artificial Intelligence Laboratory
AgarwalArvind AsanovicBalakrishnan Clark Corbato
DennisDevadas Ernst Fano Guttag Jackson Kaashoek
Lampson Liskov Miller Morris Rinard
Rudolph Sollins Terman
Katabi Madden
Ward
Amarasinghe
Stonebraker
Systems
CSAIL PIs
19 NAS/NAE/IM members
6 MacArthur Foundation Genius Awards
3 Turing Awards2 Japan Prizes1 Millennium
Technology Award 1 Knight of the
British Empire ….
Darrell Durand
Glass Golland Grimson Horn
Jaakkola Kaelbling Katz Popovic
Richards Seneff Zue
Adelson
Fisher Freeman
Lozano-Perez
Tenenbaum Willsky
Poggio
Perception & Learning
Berners-Lee
Weitzner
Barzilay Collins
Abelson Brooks Davis Gifford Knight
Leonard Long Massaquoi Moses
Shrobe Sussman Szolovits
Teller Tidor Williams Winston
Roy
Rus Stultz
Wisdom
O’Reilly
Physical, Biological, & Social Systems
Kellis
Tedrake
Demaine Edelman Garland
Indyk Karger Leighton Leiserson
Lynch Meyer Micali Rivest
Sipser Sudan
Berger Goemans
Goldwasser
Shor
Rubinfeld
Theory
6 PIs4 Researchers
~35 Graduate Students HLT
2007-01-19MIT Computer Science and Artificial Intelligence Laboratory
What are human language technologies?
• They are technologies that enable machines to process human languages– Speech Coding: efficient & robust speech transmission
– Speech Recognition: speech text
– Speech Synthesis: text speech– Language Understanding: text meaning
– Language Generation: meaning text– Discourse Analysis: understanding in context
– Dialogue Modeling: preparing system’s side of the interaction
– Machine Translation: text in L1 text in L2– Information Summarization: verbose information terse info
– …
They enable humans to communicate with humans and machines on their own terms
4
2007-01-19MIT Computer Science and Artificial Intelligence Laboratory
• Current day devices can show and tell, but many are still deaf and blind
The Premise
• “Natural” language is what humans use to communicate• Future information devices must satisfy the basic needs of
human communication
• Current day devices can show and tell
2007-01-19MIT Computer Science and Artificial Intelligence Laboratory
Ubiquitous Needs
• Interact with the physical world– “Turn down the music.” “It’s a little too hot in here.” “Record the next
RedSox game for me.” …
• Creating, accessing, and managing information – “Where is the nearest pharmacy?” “When is my next appointment?”
“Find the pictures I took at Michelle’s wedding.” …
• Angel on your shoulder– “Where did I leave my keys?” “Remind me to send the slides to
Randy.” “Who was at the meeting yesterday?” “What were the action items?” …
• …• Human language technology (HLT) is needed for
– Natural human-machine interaction
– Easier access to audio-visual content
5
2007-01-19MIT Computer Science and Artificial Intelligence Laboratory
Speech as Interface:Dialogue Interactions
(Seneff, Glass)
2007-01-19MIT Computer Science and Artificial Intelligence Laboratory
• Can verbalize response– Language generation– Speech synthesis
• Can engage in dialogue with a user during the interaction
• Can communicate with users through conversation
• Can understand verbal input– Speech recognition– Language understanding
(in context)
SpeechRecognition
SpeechRecognition
Language Understanding
Language Understanding
ContextResolution
ContextResolution
DialogueManagement
DialogueManagement
LanguageGenerationLanguageGeneration
SpeechSynthesisSpeech
Synthesis
AudioAudio DatabaseDatabase
Next Generation Speech Interfaces
6
2007-01-19MIT Computer Science and Artificial Intelligence Laboratory
Hub
SpeechRecognition
SpeechRecognition
Language Understanding
Language Understanding
ContextResolution
ContextResolution
DialogueManagement
DialogueManagement
LanguageGenerationLanguageGeneration
SpeechSynthesisSpeech
Synthesis
AudioAudio DatabaseDatabase
• Prototypes:– Weather, traffic, flight
status, and schedules– Real, up-to-date
information– Access via toll-free
telephone numbers
• Conversation requires understanding, generation & dialogue• System deployment shortens development cycle
Conversational Interaction at MIT
video
2007-01-19MIT Computer Science and Artificial Intelligence Laboratory
Visual cues can help
• Visual conversational cues– Body and head pose tracking
– Integrating with speech for high SNR applications
7
2007-01-19MIT Computer Science and Artificial Intelligence Laboratory
Importance of Audio-Visual Integration
• Audio and visual signals both contain information about:– Identity/location of the person: Who is talking? Where is he?– Linguistic message: What’s she saying?
– Emotion, mood, stress, etc.: How does he feel?
• Proper utilization of these two channels of information can lead to robust and enhanced capabilities, e.g.,– Locating and identifying the speaker
– Speech understanding augmented with facial features– Speech, gesture, and sketching integration
– Audio/visual information delivery
2007-01-19MIT Computer Science and Artificial Intelligence Laboratory
Joint
Decision
accept
rejectSpeakerIdentification
Audio-Visual Person Verification
FaceIdentification
video0.01 0.1 1 2 5 10 20 40
0.01
0.1
1 2
5
10
20
40
False Accept probability (%)
Fal
se R
ejec
t p
rob
abili
ty (
%) Face
SpeechCombined
Combined audio-visual inputs reduces equal error rate by 90%
Combined audio-visual inputs reduces equal error rate by 90%
8
2007-01-19MIT Computer Science and Artificial Intelligence Laboratory
Audio-Visual Information Delivery
• New, data-driven approach can produce very natural and intelligible synthetic speech
• We can now produce video-realistic animations
• These animated agents can speak and sing in different languages
• We can combine speech synthesis and facial animation to produce realistic avatars
video
video
video
video
video
Mary101-M
Marilyn
Mary101-sing-J
English Avatar
Chinese Avatar
videoMary101-F
videoMary101-sing-C
2007-01-19MIT Computer Science and Artificial Intelligence Laboratory
Some Ongoing Challenges
• Robust HLTs for realistic environments• Shrinking the platform• Reconfigurable HLTs• Multimodality• . . .
video
video
9
2007-01-19MIT Computer Science and Artificial Intelligence Laboratory
A Specific Problem: Destination Entry
• Speech input will enable natural, efficient interaction• Spoken dialogue provides flexibility (e.g., error recovery)• Challenges for a spoken interface: a very large search space!
– e.g., U.S. has 250 million addresses, 28,000 cities, 1.1 million street names– Wide range/city (e.g., Delmar, AK has 1 street, Houston, TX has 19,000)
2007-01-19MIT Computer Science and Artificial Intelligence Laboratory
Important Ability: Dynamic Vocabularies
• Example: United States street address recognition– 6.2M unique street, city, state pairs (283K unique words)
– 3-pass recognition has much smaller vocabulary (<20K words)– Web-based interface integrated with Google maps API
video
10
2007-01-19MIT Computer Science and Artificial Intelligence Laboratory
A Platform for Map-Based Interactions
• Web-based restaurant content integrated with Google map API• Virtual: web-based deployment allows worldwide access• Adaptive: dialogue-sensitive vocabulary; speaker identification• Multimodal: speech and pen interaction
2007-01-19MIT Computer Science and Artificial Intelligence Laboratory
Other Important Capabilities
Accessible
Contextual
Multimodal
Personalizable
Customizablevideo
11
2007-01-19MIT Computer Science and Artificial Intelligence Laboratory
Vehicle Integration
• Collaborating with researchers at BMW North America to integrate speech interfaces into vehicle infrastructure
• Initial prototype has incorporated CityGuide interface• Test vehicle being used at BMW and CSAIL for system
development, data collection, and evaluation
video
2007-01-19MIT Computer Science and Artificial Intelligence Laboratory
Speech as Content:Review Summarization
(Barzilay)
12
2007-01-19MIT Computer Science and Artificial Intelligence Laboratory
Review Summarization
• Users increasingly rely on online review sites to make decisions about products they plan to buy, restaurants they plant to visit, the movie they’d like to watch, etc. – “What are people saying about the new iPod Nano”
– “Find me a good and inexpensive Thai restaurant near here”
– “Read me a representative review for The Terminator II”
• Sometimes we want to access this information via voice– Reading all the reviews is impractical (too many, each with disparate
opinions)
– Intelligent access to reviews requires semantic analysis
• How do we develop summarization capabilities to fulfill such needs– Automatic, domain-independent, high performance
2007-01-19MIT Computer Science and Artificial Intelligence Laboratory
A First Step…
• Developed an algorithm for review-ranking– Convert free text reviews to ranking in multiple categories
… I started with a half dozen of freshly shucked oysters and my friend ordered the fried shrimp. … The service was very no-nonsense but very friendly and down to earth… The staff were happy to take our photo for us … Very well worth the very spontaneous visit and I'd definitely go there again given the chance
… I started with a half dozen of freshly shucked oysters and my friend ordered the fried shrimp. … The service was very no-nonsense but very friendly and down to earth… The staff were happy to take our photo for us … Very well worth the very spontaneous visit and I'd definitely go there again given the chance
13
2007-01-19MIT Computer Science and Artificial Intelligence Laboratory
Finding Structure in Online Reviews
• Score reviews automatically, according to different dimensions
• The algorithm is based on machine learning• Can be easily adapted to new domains• Empirical results
– Trained and tested the algorithm on 4,000 and 500 restaurant reviews, respectively
– Measured ranking accuracy in terms of distance from true ranking(between 0 and 5)
– Achieved accuracy of 0.6 in test data
2007-01-19MIT Computer Science and Artificial Intelligence Laboratory
Using Structured Representation
• Support of complex queries– Find most negative reviews– Find restaurants with great food, ignore atmosphere ratings
• Analyze temporal trend in consumer opinion• Compare several products in multiple dimensions
14
2007-01-19MIT Computer Science and Artificial Intelligence Laboratory
Speech as Content:MIT Lecture Browser
(Glass, Barzilay)
2007-01-19MIT Computer Science and Artificial Intelligence Laboratory
Spoken Lecture Processing
• In the past decade there has been a dramatic increase in the availability of on-line academic lecture material
• Lecture data has not been widely studied by the community• Human language technology can help educators and students
to more effectively create and disseminate lecture recordings
15
2007-01-19MIT Computer Science and Artificial Intelligence Laboratory
The Research Challenge1) I've been talking -- I've been multiplying matrices already, but
certainly time for me to discuss the rules for matrix multiplication. 2) And the interesting part is the many ways you can do it, and they
all give the same answer. 3) So it's -- and they're all important. 4) So matrix multiplication, and then, uh, come inverses. 5) So we're -- uh, we -- mentioned the inverse of a matrix, but there's
-- that's a big deal. 6) Lots to do about inverses and how to find them. 7) Okay, so I'll begin with how to multiply two matrices. 8) First way, okay, so suppose I have a matrix A multiplying a matrix
B and -- giving me a result -- well, I could call it C. 9) A times B. Okay. 10) Uh, so, l- let me just review the rule for w- for this entry.
8 Rules of Matrix Multiplication:The method for multiplying two matrices A and B to get C = AB can be
summarized as follows:1) Rule 8.1 To obtain the element in the rth row and cth column of C,
multiply each element in the rth row of A by the corresponding…
“I want to learn how to multiply matrices”
Using spoken language technology to transcribe, structure, and retrieve recorded lecture material
2007-01-19MIT Computer Science and Artificial Intelligence Laboratory
Topic search Category
selection
Lecture hits
Segment hits
Expanded hit
Synchronized transcript
Lecture video
Play
User Interface
16
2007-01-19MIT Computer Science and Artificial Intelligence Laboratory
• Lecture speaking style similar to human-human conversations• Average of 800 unique words/lecture (~1/3 News Broadcasts)• Difficult to cover content words w/o topic-specific material• Language models from text a poor predictor of spoken words
DiagonalMaximumCdrOrthogonalOmegaArgumentsEigenThetaMachineMatricesEnergyProceduresRowsVoltsProgramEigenvaluesForceConsNullElectricStreamDeterminantMagneticEnvironmentTransposeChargeExpressionMatrixFieldProcedure
AlgebraPhysicsComp. Science
Top 10 Vocabulary (1.5K stop-words)Unique Words/Lecture
Lecture Analysis
2007-01-19MIT Computer Science and Artificial Intelligence Laboratory
• Based on ten computer science lectures • Recognition Performance
– Vocabulary size: ~37,000
– Word error rate (w/ adaptation): ~41%
• Measured information retrieval ability with text index terms
• Retrieval Performance– Precision (P=% returned segments containing keyword): 92-95%– Recall (R=% relevant segments retrieved): 78-88.%
procedure, complex numbers, programming, program, abstraction, function, arguments, constructor, variable, recursion, predicate, computer science, fixed-point syntax, primitive data, algorithm, logic programming,…
Some Transcription and Retrieval Results
17
2007-01-19MIT Computer Science and Artificial Intelligence Laboratory
Example of Structure Induction
2007-01-19MIT Computer Science and Artificial Intelligence Laboratory
Topic Segmentation
• Task– Partition a text into a linear sequence of topically coherent topics
• Challenges– Recognition errors and lack of separators
* >20-30% WER
– Smooth topic transitions* c.f., broadcast news segmentation
– Lack of training data
– Existing segmentation methods achieve poor performance on lecture transcripts
18
2007-01-19MIT Computer Science and Artificial Intelligence Laboratory
• Segment boundaries in word-occurrence matrix– Computes pair-wise sentence similarity between sentences
• Red lines show manually determined lecture segments• Lecture boundaries are not as clear as broadcast news
Lecture Segmentation
2007-01-19MIT Computer Science and Artificial Intelligence Laboratory
Minimum-Cut Segmentation*
• New graph-theoretic formalization of the segmentation objective– Jointly maximizes within-cluster similarity and minimizes between-cluster
similarity– Incorporates long-range lexical dependencies
• Exact, fast decoding using dynamic programming
Key Strength: Can detect subtle topic changes
* Malyutov & Barzilay, ACL 2006
19
2007-01-19MIT Computer Science and Artificial Intelligence Laboratory
Speech in Education:Language Learning
(Seneff)
2007-01-19MIT Computer Science and Artificial Intelligence Laboratory
Motivation
• There is an ever-increasing need for second language learning
• However, there is a severe shortage of capable language teachers
• Even when teachers are available, they don’t have enough time to interact with students in dialogue exchanges, which is an important part of learning a language
• Perhaps computers can help …
20
2007-01-19MIT Computer Science and Artificial Intelligence Laboratory
Approach
• Use computers to aid language learning– Learning pronunciations, new words, grammar, etc.– More importantly, computers can serve as a conversational
partner for students to practice dialogue interaction– Computers provide non-threatening environment in which to
practice communicating
• Leverage our extensive prior research in multilingual spoken dialogue systems to support language learning
• Create a game-like setting for conversational learning• Currently focused on (bi-directional) learning of
Chinese and English
2007-01-19MIT Computer Science and Artificial Intelligence Laboratory
Three Types of Activities1. Translation: Learning what to say
I want to arrive in the
morning
3. Conversation: Practicing communication
I like to go dancing.
Would you like to go dancing
with me?
2. Eavesdropping: Learning how to communicate
Are you free tomorrow
afternoon?
No, I am going to play basketball.
Equal-PartyConversational
System
Hmm…
我想早上到达
21
2007-01-19MIT Computer Science and Artificial Intelligence Laboratory
Translation: Learning What to Say
• Interacting over the telephone:– Bilingual recognizer supports
seamless language switching
– Student speaks English, system paraphrases, then translates into Mandarin
– Student speaks Mandarin, system paraphrases, then translates into English
• Interacting at a Web page in game mode:– System poses sentence in English to translate– User attempts to speak an equivalent sentence in Mandarin– System evaluates user’s sentence, congratulates them if they succeed,
and then moves on to the next sentence– Number of turns to success becomes an evaluation metric– Game difficulty level adjusted over time based on student performance
2007-01-19MIT Computer Science and Artificial Intelligence Laboratory
A Web-based Translation Game
22
2007-01-19MIT Computer Science and Artificial Intelligence Laboratory
Equal-Party Conversation• Most of the dialogue systems we have developed have been
based on an asymmetric human-computer relationship– Computer has access to information sources (weather, flights,
restaurants, etc.)– User makes requests concerning the content of those databases
• An interesting new research topic is the development of computer conversational skills within an equal partnerscenario– More appropriate than database access domains for beginning
students
– Human and computer play identical roles, but with different hobbies and schedules
– In our initial scenario, student must find a mutually agreeable time to jointly participate in a shared hobby
2007-01-19MIT Computer Science and Artificial Intelligence Laboratory
There is simply not enough time . . .
• To describe some of the other projects– Statistical machine translation (Collins)
– Information extraction (Katz)– Spoken dialogue system using machine learning (Collins)
– Audio-visual speech recognition (Darrell)
– Acoustic scene analysis (Zue)– …
• Please visit our web site for more information
23
2007-01-19MIT Computer Science and Artificial Intelligence Laboratory
Summary
• Future information devices must satisfy the basic needs of human communication
• Natural interactions– Provide natural interactions using human language
– Integrate multiple modalities
– Accommodate multiple languages
• Content processing– Transcribe, index, retrieve, and summarize
– Translate speech as well as text
• Work as well as, if not better than, humans• Much research remains
2007-01-19MIT Computer Science and Artificial Intelligence Laboratory
Thank You!