Latent Topic-semantic Indexing based Automatic Text Summarization

45
1/20 Introduction Method Experiments Conclusions Latent Topic-semantic Indexing based Automatic Text Summarization Jiangsheng Yu, Xue-wen Chen Presenter: Elaheh Barati Futurewei Technologies - Wayne State University December 18, 2016 Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University Latent Topic-semantic Indexing based Automatic Text Summarization

Transcript of Latent Topic-semantic Indexing based Automatic Text Summarization

1/20

Introduction Method Experiments Conclusions

Latent Topic-semantic Indexing basedAutomatic Text Summarization

Jiangsheng Yu, Xue-wen ChenPresenter: Elaheh Barati

Futurewei Technologies - Wayne State University

December 18, 2016

Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University

Latent Topic-semantic Indexing based Automatic Text Summarization

2/20

Introduction Method Experiments Conclusions

IntroductionAutomatic summarizationLatent Dirichlet Allocation

Method

Experiments

Conclusions

Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University

Latent Topic-semantic Indexing based Automatic Text Summarization

3/20

Introduction Method Experiments Conclusions

IntroductionAutomatic summarizationLatent Dirichlet Allocation

Method

Experiments

Conclusions

Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University

Latent Topic-semantic Indexing based Automatic Text Summarization

4/20

Introduction Method Experiments Conclusions

Automatic summarization

An Introduction to Automatic summarization (AS)

� Automatic summarization (AS), or text summarization, isa challenging task of natural language processing (NLP)and machine learning.

� It transforms source text to summary text, while retainingthe most important information in the source.

Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University

Latent Topic-semantic Indexing based Automatic Text Summarization

4/20

Introduction Method Experiments Conclusions

Automatic summarization

An Introduction to Automatic summarization (AS)

� Many extraction methods have been proposed inliterature, and some of them are implemented as opensource tools, or online services.

� In the last decade, the topic-driven approaches becamepopular, and some work based on pLSI and LDA hasachieved significantly better performance.

Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University

Latent Topic-semantic Indexing based Automatic Text Summarization

5/20

Introduction Method Experiments Conclusions

Latent Dirichlet Allocation

An Introduction to Latent Dirichlet Allocation

The plate notation of LDA, a three-level HB model, in whichθ1:M � Dir(α),ϕ1:K � Dir(β), zm,1:Nm � xθmy and wmn � xϕzmny.

Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University

Latent Topic-semantic Indexing based Automatic Text Summarization

5/20

Introduction Method Experiments Conclusions

Latent Dirichlet Allocation

An Introduction to Latent Dirichlet Allocation

For the n-th word in the m-th document, denoted by wmn, wherem = 1, � � � , M, n = 1, � � � , Nm, its topic zmn is a latent variablevarying in the set of t1, � � � , Ku, satisfying wmn � xϕzmny.

Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University

Latent Topic-semantic Indexing based Automatic Text Summarization

5/20

Introduction Method Experiments Conclusions

Latent Dirichlet Allocation

An Introduction to Latent Dirichlet Allocation

The discrete distribution of words: wmn� Multin(1; ϕzmn ).

Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University

Latent Topic-semantic Indexing based Automatic Text Summarization

5/20

Introduction Method Experiments Conclusions

Latent Dirichlet Allocation

An Introduction to Latent Dirichlet Allocation

The Nm latent topics in the m-th document: zm,1:Nm� xθmy,where θm, the (vector) parameter of multinomial distribution oftopics for the m-th document, is also Dirichlet-distributed inthe way of θ1:M � Dir(α).

Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University

Latent Topic-semantic Indexing based Automatic Text Summarization

5/20

Introduction Method Experiments Conclusions

Latent Dirichlet Allocation

An Introduction to Latent Dirichlet Allocation

LDA models adopt the conjugate prior of multinomialdistribution, to describe the priors of parameters ofmultinomial distributions.

Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University

Latent Topic-semantic Indexing based Automatic Text Summarization

6/20

Introduction Method Experiments Conclusions

Latent Dirichlet Allocation

Limitations on LDA

While topic models have been successfully applied toautomatic summarization, they are limited in several aspects.

Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University

Latent Topic-semantic Indexing based Automatic Text Summarization

6/20

Introduction Method Experiments Conclusions

Latent Dirichlet Allocation

Limitations on LDA

While topic models have been successfully applied toautomatic summarization, they are limited in several aspects.

� in the LDA-type models, the observation units arerestricted to words.

� a topic is usually defined by a discrete distribution overmany polysemous words.

� ...

Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University

Latent Topic-semantic Indexing based Automatic Text Summarization

6/20

Introduction Method Experiments Conclusions

Latent Dirichlet Allocation

Limitations on LDA

While topic models have been successfully applied toautomatic summarization, they are limited in several aspects.

� in the LDA-type models, the observation units arerestricted to words.

� a topic is usually defined by a discrete distribution overmany polysemous words.

� ...

Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University

Latent Topic-semantic Indexing based Automatic Text Summarization

6/20

Introduction Method Experiments Conclusions

Latent Dirichlet Allocation

Limitations on LDA

While topic models have been successfully applied toautomatic summarization, they are limited in several aspects.

� in the LDA-type models, the observation units arerestricted to words.

� a topic is usually defined by a discrete distribution overmany polysemous words.

� ...

Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University

Latent Topic-semantic Indexing based Automatic Text Summarization

6/20

Introduction Method Experiments Conclusions

Latent Dirichlet Allocation

Limitations on LDA

These limitations make the learned topics lack of practicalsignificance in many cases, and prevent the topic models fromfurther applications.

Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University

Latent Topic-semantic Indexing based Automatic Text Summarization

7/20

Introduction Method Experiments Conclusions

IntroductionAutomatic summarizationLatent Dirichlet Allocation

Method

Experiments

Conclusions

Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University

Latent Topic-semantic Indexing based Automatic Text Summarization

8/20

Introduction Method Experiments Conclusions

Latent Topic-Semantic Indexing

Our Assumptions:� Words in each observation window are indexed by the

same topic, where a window in a text corpus may be aword, a sentence, or even a paragraph.

� K: number of topics. Each topic ϕ is a discrete distributionover the semantic categories 1, � � � , L.

� the topics ϕ1, � � � ,ϕK priorly satisfy

ϕ1:K � Dir(β)

Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University

Latent Topic-semantic Indexing based Automatic Text Summarization

8/20

Introduction Method Experiments Conclusions

Latent Topic-Semantic Indexing

Our Assumptions:� Words in each observation window are indexed by the

same topic, where a window in a text corpus may be aword, a sentence, or even a paragraph.

� K: number of topics. Each topic ϕ is a discrete distributionover the semantic categories 1, � � � , L.

� the topics ϕ1, � � � ,ϕK priorly satisfy

ϕ1:K � Dir(β)

Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University

Latent Topic-semantic Indexing based Automatic Text Summarization

8/20

Introduction Method Experiments Conclusions

Latent Topic-Semantic Indexing

Our Assumptions:� Words in each observation window are indexed by the

same topic, where a window in a text corpus may be aword, a sentence, or even a paragraph.

� K: number of topics. Each topic ϕ is a discrete distributionover the semantic categories 1, � � � , L.

� the topics ϕ1, � � � ,ϕK priorly satisfy

ϕ1:K � Dir(β)

Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University

Latent Topic-semantic Indexing based Automatic Text Summarization

9/20

Introduction Method Experiments Conclusions

Latent Topic-Semantic Indexing

� Proposed two deep probabilistic models for topic-drivensummarization:

(a) ψ1:L are given (b) ψ1:L � Dir(γ)

Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University

Latent Topic-semantic Indexing based Automatic Text Summarization

9/20

Introduction Method Experiments Conclusions

Latent Topic-Semantic Indexing

� The probability of observing a word v in the semanticcategory l, say ψlv, is given in the following two ways:

(a) ψ1:L are given (b) ψ1:L � Dir(γ)

Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University

Latent Topic-semantic Indexing based Automatic Text Summarization

9/20

Introduction Method Experiments Conclusions

Latent Topic-Semantic Indexing

(a) ψ1:L are given (b) ψ1:L � Dir(γ)

� a given prior semantic matrix ΨT = (ψlv)L�V, where V isthe number of words in the vocabulary, and L the numberof semantic labels.

� a non-informative prior semantic matrix ψ1:L � Dir(γ),where ψl = (ψl1, � � � ,ψlV)

T is a discrete distribution overall words in the vocabulary, l = 1, � � � , L.

Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University

Latent Topic-semantic Indexing based Automatic Text Summarization

10/20

Introduction Method Experiments Conclusions

Latent Topic-Semantic Indexing (cont.)

(a) ψ1:L are given

Assumption of TSI model: in each window (m, n), thesemantics of words w(1)

mn � � �w(Dmn)mn are drawn from a same but

unknown topic zmn

Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University

Latent Topic-semantic Indexing based Automatic Text Summarization

10/20

Introduction Method Experiments Conclusions

Latent Topic-Semantic Indexing (cont.)

(a) ψ1:L are given

For a TSI model, the m-th document is generated in:(1) Choose θm � Dir(α), where θ1:M � Dir(α).

Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University

Latent Topic-semantic Indexing based Automatic Text Summarization

10/20

Introduction Method Experiments Conclusions

Latent Topic-Semantic Indexing (cont.)

(a) ψ1:L are given

For a TSI model, the m-th document is generated in:(2) zmn, the topic of window (m, n), is drawn from zmn � xθmy.

Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University

Latent Topic-semantic Indexing based Automatic Text Summarization

10/20

Introduction Method Experiments Conclusions

Latent Topic-Semantic Indexing (cont.)

(a) ψ1:L are given

For a TSI model, the m-th document is generated in:(3) The semantics in window (m, n) are generated via

s(1)mn , � � � , s(dmn)

mn , � � � , s(Dmn)mn � xϕzmny

Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University

Latent Topic-semantic Indexing based Automatic Text Summarization

10/20

Introduction Method Experiments Conclusions

Latent Topic-Semantic Indexing (cont.)

(a) ψ1:L are given

For a TSI model, the m-th document is generated in:(4) The word w(d)

mn is drawn from semantic category s(d)mn

independently

Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University

Latent Topic-semantic Indexing based Automatic Text Summarization

11/20

Introduction Method Experiments Conclusions

TSI vs LDA

LDA is a special case of TSI when:� the observation window is a word,

� the semantic labels are the words themselves, and� the semantic matrix is an identity matrix.

Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University

Latent Topic-semantic Indexing based Automatic Text Summarization

11/20

Introduction Method Experiments Conclusions

TSI vs LDA

LDA is a special case of TSI when:� the observation window is a word,� the semantic labels are the words themselves, and

� the semantic matrix is an identity matrix.

Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University

Latent Topic-semantic Indexing based Automatic Text Summarization

11/20

Introduction Method Experiments Conclusions

TSI vs LDA

LDA is a special case of TSI when:� the observation window is a word,� the semantic labels are the words themselves, and� the semantic matrix is an identity matrix.

Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University

Latent Topic-semantic Indexing based Automatic Text Summarization

12/20

Introduction Method Experiments Conclusions

IntroductionAutomatic summarizationLatent Dirichlet Allocation

Method

Experiments

Conclusions

Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University

Latent Topic-semantic Indexing based Automatic Text Summarization

13/20

Introduction Method Experiments Conclusions

Experiments Setup

� Topic-based summarizations are tested on Brown corpusin the public dataset of SemCor-3.0, which contains 186documents classified in 15 categories.

� The semantic indexing is restricted to nouns and nounphrases. For this, all the fourth-level noun SynSets in thehypernymy tree of WordNet-3.0 .

� A total of L = 2017 semantic categories are used in the TSImodel.

Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University

Latent Topic-semantic Indexing based Automatic Text Summarization

13/20

Introduction Method Experiments Conclusions

Experiments Setup

� The prior semantic matrix is set by ψlv = nlv/nl, where nl isthe number of total SynSets in semantic category l, and nlvis the number of SynSets of word v in category l.

� We set α = (0.1, � � � , 0.1)T,β = (0.1, � � � , 0.1)T, which arecommonly set as default values in many applications.

Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University

Latent Topic-semantic Indexing based Automatic Text Summarization

14/20

Introduction Method Experiments Conclusions

Evaluation of Summarizers

� Suppose there are T summarizers under test, M documentsto review and a number of reviewers.

� Which summerizer has the best performance?� We use one-way analysis of variance (ANOVA).

Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University

Latent Topic-semantic Indexing based Automatic Text Summarization

14/20

Introduction Method Experiments Conclusions

Evaluation of Summarizers� AM�T: the index matrix, where M, T are the numbers of

documents and summarizers,� the m-th row of AM�T (i.e., (am1, � � � , amt, � � � , amT))

indicates the ordering of T summary results ofdocument m

� The results are scored by 1, 2, � � � , T, from the worst tothe best.

� BM�T: one human review matrix, in which bmt is the scoreof summarizer [amt].

� CM�T: the feedback matrix, in which cmt is the score ofsummarizer [t] on document m, is recovered by

cm,amt = bmt, where m = 1, � � � , M, t = 1, � � � , T

Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University

Latent Topic-semantic Indexing based Automatic Text Summarization

15/20

Introduction Method Experiments Conclusions

Evaluation by One-way ANOVA

1: Input: data AM�T, BM�T, and significance level α.2: Output: Ranks of summarizers.3: The comparison matrix HT�T is initialized as zero.4: We get the feedback matrix C and the mean scores of T sum-

marizers, then initialize s1 ¨ � � � ¨ sT.5: for all possible pairs (i, j) satisfying i   j do6: if H(i,j)

0 is rejected at a given level α then7: si   sj, where  means “is worse than".8: Let hit = 1 for all t ¥ j.9: end if

10: end for11: The summarizer st is ranked by the sum of t-th column of H,

where t = 1, � � � , T.Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University

Latent Topic-semantic Indexing based Automatic Text Summarization

16/20

Introduction Method Experiments Conclusions

Mean scores of four summarizers on the testing Browncorpus.

LDA TSI FS OTS1 2.541667 3.416667 2.083333 1.9583332 2.375000 2.916667 2.250000 2.4583333 2.916667 2.500000 2.125000 2.4583334 2.791667 3.125000 2.208333 1.8750005 2.791667 2.708333 2.416667 2.1250006 2.416667 2.750000 2.666667 2.1666677 2.333333 3.500000 2.291667 2.041667

� Four summarizers: Topic-based methods and two non-topic-driven references,Open Text Summarizer (OTS) and Free Summarizer (FS).

� Seven volunteers participated in the evaluation.� For each document, the ordering of summaries is disrupted.

Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University

Latent Topic-semantic Indexing based Automatic Text Summarization

16/20

Introduction Method Experiments Conclusions

Mean scores of four summarizers on the testing Browncorpus.

LDA TSI FS OTS1 2.541667 3.416667 2.083333 1.9583332 2.375000 2.916667 2.250000 2.4583333 2.916667 2.500000 2.125000 2.4583334 2.791667 3.125000 2.208333 1.8750005 2.791667 2.708333 2.416667 2.1250006 2.416667 2.750000 2.666667 2.1666677 2.333333 3.500000 2.291667 2.041667

� the viewpoint of the first reviewer:OTS ¨ FS ¨ LDA ¨ TSI.

� TSI-based summarization outperforms other methods canbe verified by the one-way ANOVA of mean scores

Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University

Latent Topic-semantic Indexing based Automatic Text Summarization

17/20

Introduction Method Experiments Conclusions

Evaluation of overall performance of foursummarizers at the significance level of 0.05

1 2 3 4 5 6 7LDA 1 0 1 1 1 0 0

TSI 3 1 0 2 0 0 3FS 0 0 0 0 0 0 0

OTS 0 0 0 0 0 0 0

(TSI, 4) = 2 means that� There are 2 summarizers that are significantly worse than

TSI-based method, in the viewpoint of the 4-th reviewer.

Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University

Latent Topic-semantic Indexing based Automatic Text Summarization

17/20

Introduction Method Experiments Conclusions

Evaluation of overall performance of foursummarizers at the significance level of 0.05

1 2 3 4 5 6 7LDA 1 0 1 1 1 0 0

TSI 3 1 0 2 0 0 3FS 0 0 0 0 0 0 0

OTS 0 0 0 0 0 0 0

The results show that:� The topic-based summarizers are better than non-topic

based methods,� TSI-based method achieves the best performance.

Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University

Latent Topic-semantic Indexing based Automatic Text Summarization

18/20

Introduction Method Experiments Conclusions

IntroductionAutomatic summarizationLatent Dirichlet Allocation

Method

Experiments

Conclusions

Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University

Latent Topic-semantic Indexing based Automatic Text Summarization

19/20

Introduction Method Experiments Conclusions

Conclusions

� We proposed a novel deep probabilistic approach to:� indexing the latent topics and semantics of words in a

collection of documents� apply the topic-semantic indexing (TSI) model to

automatic summarization.

Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University

Latent Topic-semantic Indexing based Automatic Text Summarization

19/20

Introduction Method Experiments Conclusions

Conclusions

� The topic-based summarizers, together with two othernon-topic-driven summarizers, FS and OTS, are tested onBrown corpus in the public dataset of SemCor-3.0.

� The summaries are reviewed by human.� The performance of summarization is analyzed by a

well-designed blind experiment� the summarizer is evaluated by ranks derived from

some hypothesis testings of one-way ANOVA.� The experimental results show that TSI is a promising

method for topic-driven summarization.� In the present TSI-based summarization, each observation

window is a word.

Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University

Latent Topic-semantic Indexing based Automatic Text Summarization

19/20

Introduction Method Experiments Conclusions

Conclusions

� The further work includes more experiments on severaldistinct sizes of observation windows, efficient extractionstrategies and their ensemble learning, etc.

Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University

Latent Topic-semantic Indexing based Automatic Text Summarization

20/20