Semantic Wordfication of Document Collections Presenter: Yingyu Wu.

19
Semantic Wordfication of Document Collections Presenter: Yingyu Wu

Transcript of Semantic Wordfication of Document Collections Presenter: Yingyu Wu.

Page 1: Semantic Wordfication of Document Collections Presenter: Yingyu Wu.

Semantic Wordfication of Document Collections

Presenter: Yingyu Wu

Page 2: Semantic Wordfication of Document Collections Presenter: Yingyu Wu.

Outline

• Introduction• ProjCloud Technique• Results and Comparisons• Discussion and Limitations• Conclusion

Page 3: Semantic Wordfication of Document Collections Presenter: Yingyu Wu.

Introduction

• Word Cloud

Page 4: Semantic Wordfication of Document Collections Presenter: Yingyu Wu.

• Two issues of word cloud:

• (1) Existing methods do not yet provide an intuitive visual representation that allows to link words on the layout to the documents they are meant to represent.

• (2) The construction of word clouds inside general polygons with semantical preservation between words.

Page 5: Semantic Wordfication of Document Collections Presenter: Yingyu Wu.

• Contributions:• A novel word cloud-based visualization technique, named ProjCloud.

• (1) combine multidumensional projection and word clouds, which enables to visualize the similarity among documents as well as their corresponding word clouds, extend the exploratory capabilities of the word clouds.

• (2) A new approach for building word clouds inside polygons while still preserving the semantic relationship among keywords.

• (3) A mechanism based on spectral sorting that allows arranging words according to their semantic relationship as well as highlighting the most important words in the cloud.

Page 6: Semantic Wordfication of Document Collections Presenter: Yingyu Wu.

ProjCloud Technique

• Overview of the sequence of steps

Page 7: Semantic Wordfication of Document Collections Presenter: Yingyu Wu.

• Steps:• (1) Mapping document collection into the visual space using a

multidimensional projection technique(LSP).

• (2) Points in the visual space are clustered(polygons). Two versions: automatically and user interactive.

• (3) Keywords extracted (most frequent words). Compute their relevance in order to guide the semantic preserving placement of words • (4) The scaling step take place, keyword are size based on their relevance and on the area of the containing polygon.

• (5) The optimization algorithm take places to generate the word cloud.

Page 8: Semantic Wordfication of Document Collections Presenter: Yingyu Wu.
Page 9: Semantic Wordfication of Document Collections Presenter: Yingyu Wu.

Keyword Relevance and Semantic Relation

• Let M be the document x tem frequency matrix.• Covariance matrix C obtained from M.• Build a graph G where each node corresponds to a keyword and an edge

eij connects between two keywords ( Wi and Wj ) if only if the covariance Cij is among the k-largest ones.

• Assuming that edge eij has weight Cij, it used Fiedler vector, assigns a scalar value aij to each keyword that minimizes:

• If Cij is big then the Wi and Wj will receive similar values when they are closely related.

Page 10: Semantic Wordfication of Document Collections Presenter: Yingyu Wu.

• The most relevant keyword:• Cijmax is the largest covariance in C and Wi and Wj are the corresponding

words.• The most relevant keyword is Wi if the average covariance between Wi

and Wk (k = 1,2,3,..n) is larger than the average covariance of Wj.

• Once we get the most relevant keyword (Wr), the keyword are sorted in increasing order according to

• In ProjCloud, the order given by Fiedler vector dictates the position of words into the cloud.

Page 11: Semantic Wordfication of Document Collections Presenter: Yingyu Wu.

• Sizing keywords• (1) bounding boxes. • (2) the size of keyword is set to the scale value

which fits in the interval [fmin, fmax](12,50).• (3) If the areas of all keyword bounding boxes is

smaller than the area of polygon P, fmax is increased and the values are re-scaled. This process is repeated until the sum of areas of the keywords exceeds the area or P.

Page 12: Semantic Wordfication of Document Collections Presenter: Yingyu Wu.

• The optimization Problem

Page 13: Semantic Wordfication of Document Collections Presenter: Yingyu Wu.

Results

Page 14: Semantic Wordfication of Document Collections Presenter: Yingyu Wu.
Page 15: Semantic Wordfication of Document Collections Presenter: Yingyu Wu.

Comparisons

Page 16: Semantic Wordfication of Document Collections Presenter: Yingyu Wu.
Page 17: Semantic Wordfication of Document Collections Presenter: Yingyu Wu.

Discussion and Limitations

• ProjCloud is largely dependent on the clustering process.

• If the clustering performs poorly, it will make the word cloud very hard to fit and reed.

• Empty space between clusters.

Page 18: Semantic Wordfication of Document Collections Presenter: Yingyu Wu.

Conclusion

Page 19: Semantic Wordfication of Document Collections Presenter: Yingyu Wu.

Thank you