Subproject III - Spoken Language Systems

24
1 Subproject III - Spoken Language Systems Members: Lin-shan Lee (PI), Lee-Feng Chien (Co-PI) Hsin-min Wang (Co-PI), Berl in Chen (Co-PI) Other Participants: Sin-Horng Chen, Yih-Ru Wang Yuan-Fu Liao, Jen-Tzung

description

Subproject III - Spoken Language Systems. Members: Lin-shan Lee (PI), Lee-Feng Chien (Co-PI) Hsin-min Wang (Co-PI), Berlin Chen (Co-PI) Other Participants: Sin-Horng Chen, Yih-Ru Wang - PowerPoint PPT Presentation

Transcript of Subproject III - Spoken Language Systems

Page 1: Subproject III - Spoken Language Systems

1

Subproject III -Spoken Language Systems

Members:

Lin-shan Lee (PI), Lee-Feng Chien (Co-PI)

Hsin-min Wang (Co-PI), Berlin Chen (Co-PI)

Other Participants:

Sin-Horng Chen, Yih-Ru Wang

Yuan-Fu Liao, Jen-Tzung Chien

Page 2: Subproject III - Spoken Language Systems

2

Outline

MembersResearch ThemeCurrent Achievements with DemosFuture Directions

Page 3: Subproject III - Spoken Language Systems

3

Members

Page 4: Subproject III - Spoken Language Systems

4

Research Theme

Information Extraction and Retrieval

(IE & IR)

Spoken Dialogues

Spoken DocumentUnderstanding and

Organization

MultimediaNetworkContent

NetworksUsers

˙

Named EntityExtraction

Segmentation

Topic Analysisand Organization

Summarization

Title Generation

InformationRetrieval

Two-dimensional Tree Structurefor Organized Topics

Chinese Broadcast News Archive

retrievalresults

titles,summaries

Input Query

User Instructions

Named EntityExtraction

Segmentation

Topic Analysisand Organization

Summarization

Title Generation

InformationRetrieval

Two-dimensional Tree Structurefor Organized Topics

Chinese Broadcast News Archive

retrievalresults

titles,summaries

Input Query

User Instructions

N

i

d

d

d

d

.

.

.

.2

1

K

k

T

T

T

T

.

.2

1

n

j

t

t

t

t

.

.2

1

idP ik dTP kj TtP

documents latent topics query

nj ttttQ ....21Q

N

i

d

d

d

d

.

.

.

.2

1

N

i

d

d

d

d

.

.

.

.2

1

K

k

T

T

T

T

.

.2

1

K

k

T

T

T

T

.

.2

1

n

j

t

t

t

t

.

.2

1

n

j

t

t

t

t

.

.2

1

idP ik dTP kj TtP

documents latent topics query

nj ttttQ ....21Q

Page 5: Subproject III - Spoken Language Systems

5

Research Roadmap

• Term Extraction/Organization Term Translation/Indexing

• Retrieval Modeling

• Title/Summary Generation

• Topic Analysis/Organization

•Information Extraction And Retrieval (IE & IR)

•Spoken Document Understanding and Organization

•Spoken Dialogues

•Distributed Speech Recognition

Information Navigation across Multimedia/Spoken

Documents

Cross-language Information Processing

Knowledge Discovery and Web Mining

Spoken Language Applications

Future DirectionsCurrent Achievements

Speech & LanguageUnderstanding

• …..

Page 6: Subproject III - Spoken Language Systems

6

Information Extraction & Retrieval (IE & IR)

Named Entity Extraction from Text/Spoken Documents

Taxonomy GenerationTerm TranslationRetrieval Modeling for Text/Spoken Documents

Page 7: Subproject III - Spoken Language Systems

7

Named Entity Extraction from Text/Spoken Documents

Global Information for the Entire Document Extracted from Forward/Backward PAT-Trees– Some named entities may not be easily identified from a single

sentence, but can be extracted when information in several sentences jointly considered

Named Entity Matching using Retrieved Text Documents to Identify Some Out-of-Vocabulary (OOV) Words

Page 8: Subproject III - Spoken Language Systems

8

Automatic Taxonomy Generation (1/2)

Problem – Find relationships and associations between terms, and

organize them into a hierarchical structure (i.e. taxonomy)

– Useful for identifying and analyzing concepts embedded in documents and queries

Method– An approach proposed for clustering terms into

comprehensive hierarchical clusters

– Web mining techniques -- automatically generating relationships between terms based on relationships between documents retrieved with the terms from the Web

Page 9: Subproject III - Spoken Language Systems

9

Automatic Taxonomy Generation (1/2)

A Typical Example for Term Taxonomy

Page 10: Subproject III - Spoken Language Systems

10

Automatic Term Translation (1/2)

Problem– Cross-language information retrieval systems usually rely

on bilingual dictionaries; however, search terms are very often missing because they are proper nouns and OOVs

– Discovering translations of unknown query terms in different languages

Method– Finding translations of query terms via mining of huge qua

ntities of data obtained from the Web

– Correlation/Association patterns extracted from parallel bilingual pages retrieved from the Web, the anchor texts of the pages indicating out-links to multi-lingual pages, etc.

Page 11: Subproject III - Spoken Language Systems

11

Automatic Term Translation (2/2)

Machine-Extracted

Translations

The Live Query Term Translation System (LiveTrans)

http://wkd.iis.sinica.edu.tw/LiveTrans/lt.html

Page 12: Subproject III - Spoken Language Systems

12

Retrieval Modeling for Text/Spoken Documents (1/2)

Problem– Conventional retrieval models can not be trained or improved

through use

– Word usage mismatch between the query and the documents

Method– Literal term matching: HMM/N-gram model trained with ML

or MCE criteria

– Concept matching: Topical mixture model (TMM), extended from PLSA, trained in either supervised or unsupervised manner

Page 13: Subproject III - Spoken Language Systems

13

Retrieval Modeling for Text/Spoken Documents (2/2)

HMM/N-gram retrieval model– A document is viewed as a probabilistic

generative model for the query

– Literal term matching

Topical Mixture Model (extended from PLSA)– A document is composed of a set of

K latent topical distributions (unigrams) for predicting the query

– Concept matching

Page 14: Subproject III - Spoken Language Systems

14

Spoken Document Understanding & Organization (1/2)

Problem– The content of multimedia documents very often described

by the associated speech information

– Unlike text documents with paragraphs/titles easy to look through at a glance, multimedia/spoken documents are unstructured and difficult to retrieve/browse

Page 15: Subproject III - Spoken Language Systems

15

Spoken Document Transcription Multimedia/Spoken Document Segmentation Summarization for Multimedia/Spoken Documents Title Generation for Multimedia/Spoken Documents Topic Analysis and Organization for

Multimedia/Spoken Documents

Spoken Document Understanding & Organization (2/2)

Page 16: Subproject III - Spoken Language Systems

16

Dividing a one-hour News Episode into News Stories

An improved audio segmentation technique integrating BIC and Divide-and-Conquer Approaches

Viterbi search over the Hidden Markov Model of text clusters

Spoken Document Segmentation (Broadcast News)

……distance

computation

Page 17: Subproject III - Spoken Language Systems

17

Title Generation for Spoken Documents (Broadcast News)

Training Phase

Generation Phase

For Training Phase – Developing statistical relationships between words in the

training documents and their human-generated titles For New Spoken Documents

– Transcribing into term sequences– Identifying suitable terms, and using them to generate a

readable title

Training DocumentsD={dj, j=1,2,…,N}

(text form)

Human-generatedTitles of Training Documents

T={tj, j=1,2,…,N} (text form)

New Spoken DocumentsD={di, i=1,2,…,N}

(speech form)

Computer-generatedTitles of Spoken Documents

T={ti, i=1,2,…,M} (text/speech form)

Page 18: Subproject III - Spoken Language Systems

18

Topic Analysis and Organization for Spoken Documents (Broadcast News)

Based on Probabilistic Latent Semantic Analysis (PLSA)– Terms (words, syllable pairs, etc.)/documents analyzed by probabilities

considering a set of latent topics

– Trained by EM algorithm

– Related documents don’t have to share common sets of terms, and related terms don’t have to co-exist in the same set of documents

Spoken Documents Clustered by the Latent Topics and Organized in a Two-dimensional Tree Structure, or a Two-layer Map

Two-dimensional Tree Structure

for Organized Topics

K

kikkjij dTPTtPdtP

1

Page 19: Subproject III - Spoken Language Systems

19

Spoken Dialogues

Analysis and Design Using Quantitative Simulations

Page 20: Subproject III - Spoken Language Systems

20

Analysis and Design Based on Quantitative Simulations

Problem– Dialogue performance cannot be predicted before the system is

on line– The effects of different factors, such as the system’s dialogue

strategies, speech recognition and understanding conditions etc., cannot be quantitatively identified and analyzed

Method– Computer-aided analysis and design approaches based on

quantitative simulations

misunderstanding rateslot loss rate

transactionsuccess

rate

Page 21: Subproject III - Spoken Language Systems

21

Demo: Understanding and Organization of Chinese Broadcast News with Interactive

Interface

Named EntityExtraction

Segmentation

Topic Analysisand Organization

Summarization

Title Generation

InformationRetrieval

Two-dimensional Tree Structurefor Organized Topics

Chinese Broadcast News Archive

retrievalresults

titles,summaries

Input Query

User Instructions

Named EntityExtraction

Segmentation

Topic Analysisand Organization

Summarization

Title Generation

InformationRetrieval

Two-dimensional Tree Structurefor Organized Topics

Chinese Broadcast News Archive

retrievalresults

titles,summaries

Input Query

User Instructions

Page 22: Subproject III - Spoken Language Systems

22

Spoken Document Understanding & Organization (1/2)

Problem– The content of multimedia documents very often described

by the associated speech information

– Unlike text documents with paragraphs/titles easy to look through at a glance, multimedia/spoken documents are unstructured and difficult to retrieve/browse

Page 23: Subproject III - Spoken Language Systems

23

Topic Analysis and Organization for Spoken Documents (Broadcast News)

Based on Probabilistic Latent Semantic Analysis (PLSA)– Terms (words, syllable pairs, etc.)/documents analyzed by probabilities

considering a set of latent topics

– Trained by EM algorithm

– Related documents don’t have to share common sets of terms, and related terms don’t have to co-exist in the same set of documents

Spoken Documents Clustered by the Latent Topics and Organized in a Two-dimensional Tree Structure, or a Two-layer Map

Two-dimensional Tree Structure

for Organized Topics

K

kikkjij dTPTtPdtP

1

Page 24: Subproject III - Spoken Language Systems

24

Future Directions

Information Navigation across Multimedia/Spoken Documents – Fast growing of quantities of multimedia/spoken documents are much more

difficult to browse compared to text documents – Better approaches to navigate across huge quantities of multimedia/spoken do

cuments using comprehensive presentation (e.g. topic taxonomy) Cross-language Information Processing Technologies

– Reducing language barriers in a future world of multilingual environment– Seeking for international collaboration and resource exchanging – Collaboration between the two major non-English languages may be a good d

irection Knowledge Discovery and Web Mining

– Web offers live, dynamic and by far the most complete global knowledge the human beings have

– Better approaches to explore the Web resources and enhance the language processing technologies