Latent Topic-semantic Indexing based Automatic Text Summarization

1/20

Introduction Method Experiments Conclusions

Latent Topic-semantic Indexing basedAutomatic Text Summarization

Jiangsheng Yu, Xue-wen ChenPresenter: Elaheh Barati

Futurewei Technologies - Wayne State University

December 18, 2016

Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University

Latent Topic-semantic Indexing based Automatic Text Summarization

2/20


IntroductionAutomatic summarizationLatent Dirichlet Allocation

Method

Experiments

Conclusions



3/20



Method

Experiments

Conclusions



4/20


Automatic summarization

An Introduction to Automatic summarization (AS)

� Automatic summarization (AS), or text summarization, isa challenging task of natural language processing (NLP)and machine learning.

� It transforms source text to summary text, while retainingthe most important information in the source.



4/20


Automatic summarization

An Introduction to Automatic summarization (AS)

� Many extraction methods have been proposed inliterature, and some of them are implemented as opensource tools, or online services.

� In the last decade, the topic-driven approaches becamepopular, and some work based on pLSI and LDA hasachieved significantly better performance.



5/20


Latent Dirichlet Allocation

An Introduction to Latent Dirichlet Allocation

The plate notation of LDA, a three-level HB model, in whichθ1:M � Dir(α),ϕ1:K � Dir(β), zm,1:Nm � xθmy and wmn � xϕzmny.



5/20




For the n-th word in the m-th document, denoted by wmn, wherem = 1, � � � , M, n = 1, � � � , Nm, its topic zmn is a latent variablevarying in the set of t1, � � � , Ku, satisfying wmn � xϕzmny.



5/20




The discrete distribution of words: wmn� Multin(1; ϕzmn ).



5/20




The Nm latent topics in the m-th document: zm,1:Nm� xθmy,where θm, the (vector) parameter of multinomial distribution oftopics for the m-th document, is also Dirichlet-distributed inthe way of θ1:M � Dir(α).



5/20




LDA models adopt the conjugate prior of multinomialdistribution, to describe the priors of parameters ofmultinomial distributions.



6/20



Limitations on LDA

While topic models have been successfully applied toautomatic summarization, they are limited in several aspects.



6/20



Limitations on LDA

While topic models have been successfully applied toautomatic summarization, they are limited in several aspects.

� in the LDA-type models, the observation units arerestricted to words.

� a topic is usually defined by a discrete distribution overmany polysemous words.

� ...



6/20



Limitations on LDA

These limitations make the learned topics lack of practicalsignificance in many cases, and prevent the topic models fromfurther applications.



7/20



Method

Experiments

Conclusions



8/20


Latent Topic-Semantic Indexing

Our Assumptions:� Words in each observation window are indexed by the

same topic, where a window in a text corpus may be aword, a sentence, or even a paragraph.

� K: number of topics. Each topic ϕ is a discrete distributionover the semantic categories 1, � � � , L.

� the topics ϕ1, � � � ,ϕK priorly satisfy

ϕ1:K � Dir(β)



9/20



� Proposed two deep probabilistic models for topic-drivensummarization:

(a) ψ1:L are given (b) ψ1:L � Dir(γ)



9/20



� The probability of observing a word v in the semanticcategory l, say ψlv, is given in the following two ways:




9/20




� a given prior semantic matrix ΨT = (ψlv)L�V, where V isthe number of words in the vocabulary, and L the numberof semantic labels.

� a non-informative prior semantic matrix ψ1:L � Dir(γ),where ψl = (ψl1, � � � ,ψlV)

T is a discrete distribution overall words in the vocabulary, l = 1, � � � , L.



10/20


Latent Topic-Semantic Indexing (cont.)

(a) ψ1:L are given

Assumption of TSI model: in each window (m, n), thesemantics of words w(1)

mn � � �w(Dmn)mn are drawn from a same but

unknown topic zmn



10/20



(a) ψ1:L are given

For a TSI model, the m-th document is generated in:(1) Choose θm � Dir(α), where θ1:M � Dir(α).



10/20



(a) ψ1:L are given

For a TSI model, the m-th document is generated in:(2) zmn, the topic of window (m, n), is drawn from zmn � xθmy.



10/20



(a) ψ1:L are given

For a TSI model, the m-th document is generated in:(3) The semantics in window (m, n) are generated via

s(1)mn , � � � , s(dmn)

mn , � � � , s(Dmn)mn � xϕzmny



10/20



(a) ψ1:L are given

For a TSI model, the m-th document is generated in:(4) The word w(d)

mn is drawn from semantic category s(d)mn

independently



11/20


TSI vs LDA

LDA is a special case of TSI when:� the observation window is a word,

� the semantic labels are the words themselves, and� the semantic matrix is an identity matrix.



11/20


TSI vs LDA

LDA is a special case of TSI when:� the observation window is a word,� the semantic labels are the words themselves, and

� the semantic matrix is an identity matrix.



11/20


TSI vs LDA

LDA is a special case of TSI when:� the observation window is a word,� the semantic labels are the words themselves, and� the semantic matrix is an identity matrix.



12/20



Method

Experiments

Conclusions



13/20


Experiments Setup

� Topic-based summarizations are tested on Brown corpusin the public dataset of SemCor-3.0, which contains 186documents classified in 15 categories.

� The semantic indexing is restricted to nouns and nounphrases. For this, all the fourth-level noun SynSets in thehypernymy tree of WordNet-3.0 .

� A total of L = 2017 semantic categories are used in the TSImodel.



13/20


Experiments Setup

� The prior semantic matrix is set by ψlv = nlv/nl, where nl isthe number of total SynSets in semantic category l, and nlvis the number of SynSets of word v in category l.

� We set α = (0.1, � � � , 0.1)T,β = (0.1, � � � , 0.1)T, which arecommonly set as default values in many applications.



14/20


Evaluation of Summarizers

� Suppose there are T summarizers under test, M documentsto review and a number of reviewers.

� Which summerizer has the best performance?� We use one-way analysis of variance (ANOVA).



14/20


Evaluation of Summarizers� AM�T: the index matrix, where M, T are the numbers of

documents and summarizers,� the m-th row of AM�T (i.e., (am1, � � � , amt, � � � , amT))

indicates the ordering of T summary results ofdocument m

� The results are scored by 1, 2, � � � , T, from the worst tothe best.

� BM�T: one human review matrix, in which bmt is the scoreof summarizer [amt].

� CM�T: the feedback matrix, in which cmt is the score ofsummarizer [t] on document m, is recovered by

cm,amt = bmt, where m = 1, � � � , M, t = 1, � � � , T



15/20


Evaluation by One-way ANOVA

1: Input: data AM�T, BM�T, and significance level α.2: Output: Ranks of summarizers.3: The comparison matrix HT�T is initialized as zero.4: We get the feedback matrix C and the mean scores of T sum-

marizers, then initialize s1 ¨ � � � ¨ sT.5: for all possible pairs (i, j) satisfying i j do6: if H(i,j)

0 is rejected at a given level α then7: si sj, where means “is worse than".8: Let hit = 1 for all t ¥ j.9: end if

10: end for11: The summarizer st is ranked by the sum of t-th column of H,

where t = 1, � � � , T.Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University


16/20


Mean scores of four summarizers on the testing Browncorpus.

LDA TSI FS OTS1 2.541667 3.416667 2.083333 1.9583332 2.375000 2.916667 2.250000 2.4583333 2.916667 2.500000 2.125000 2.4583334 2.791667 3.125000 2.208333 1.8750005 2.791667 2.708333 2.416667 2.1250006 2.416667 2.750000 2.666667 2.1666677 2.333333 3.500000 2.291667 2.041667

� Four summarizers: Topic-based methods and two non-topic-driven references,Open Text Summarizer (OTS) and Free Summarizer (FS).

� Seven volunteers participated in the evaluation.� For each document, the ordering of summaries is disrupted.



16/20


Mean scores of four summarizers on the testing Browncorpus.

LDA TSI FS OTS1 2.541667 3.416667 2.083333 1.9583332 2.375000 2.916667 2.250000 2.4583333 2.916667 2.500000 2.125000 2.4583334 2.791667 3.125000 2.208333 1.8750005 2.791667 2.708333 2.416667 2.1250006 2.416667 2.750000 2.666667 2.1666677 2.333333 3.500000 2.291667 2.041667

� the viewpoint of the first reviewer:OTS ¨ FS ¨ LDA ¨ TSI.

� TSI-based summarization outperforms other methods canbe verified by the one-way ANOVA of mean scores



17/20


Evaluation of overall performance of foursummarizers at the significance level of 0.05

1 2 3 4 5 6 7LDA 1 0 1 1 1 0 0

TSI 3 1 0 2 0 0 3FS 0 0 0 0 0 0 0

OTS 0 0 0 0 0 0 0

(TSI, 4) = 2 means that� There are 2 summarizers that are significantly worse than

TSI-based method, in the viewpoint of the 4-th reviewer.



17/20


Evaluation of overall performance of foursummarizers at the significance level of 0.05

1 2 3 4 5 6 7LDA 1 0 1 1 1 0 0

TSI 3 1 0 2 0 0 3FS 0 0 0 0 0 0 0

OTS 0 0 0 0 0 0 0

The results show that:� The topic-based summarizers are better than non-topic

based methods,� TSI-based method achieves the best performance.



18/20



Method

Experiments

Conclusions



19/20


Conclusions

� We proposed a novel deep probabilistic approach to:� indexing the latent topics and semantics of words in a

collection of documents� apply the topic-semantic indexing (TSI) model to

automatic summarization.



19/20


Conclusions

� The topic-based summarizers, together with two othernon-topic-driven summarizers, FS and OTS, are tested onBrown corpus in the public dataset of SemCor-3.0.

� The summaries are reviewed by human.� The performance of summarization is analyzed by a

well-designed blind experiment� the summarizer is evaluated by ranks derived from

some hypothesis testings of one-way ANOVA.� The experimental results show that TSI is a promising

method for topic-driven summarization.� In the present TSI-based summarization, each observation

window is a word.



19/20


Conclusions

� The further work includes more experiments on severaldistinct sizes of observation windows, efficient extractionstrategies and their ensemble learning, etc.



Latent Topic-semantic Indexing based Automatic Text Summarization

Education

Transcript of Latent Topic-semantic Indexing based Automatic Text Summarization