Semantic Wordfication of Document Collections Presenter: Yingyu Wu.

Semantic Wordfication of Document Collections

Presenter: Yingyu Wu

Outline

• Introduction• ProjCloud Technique• Results and Comparisons• Discussion and Limitations• Conclusion

Introduction

• Word Cloud

• Two issues of word cloud:

• (1) Existing methods do not yet provide an intuitive visual representation that allows to link words on the layout to the documents they are meant to represent.

• (2) The construction of word clouds inside general polygons with semantical preservation between words.

• Contributions:• A novel word cloud-based visualization technique, named ProjCloud.

• (1) combine multidumensional projection and word clouds, which enables to visualize the similarity among documents as well as their corresponding word clouds, extend the exploratory capabilities of the word clouds.

• (2) A new approach for building word clouds inside polygons while still preserving the semantic relationship among keywords.

• (3) A mechanism based on spectral sorting that allows arranging words according to their semantic relationship as well as highlighting the most important words in the cloud.

ProjCloud Technique

• Overview of the sequence of steps

• Steps:• (1) Mapping document collection into the visual space using a

multidimensional projection technique(LSP).

• (2) Points in the visual space are clustered(polygons). Two versions: automatically and user interactive.

• (3) Keywords extracted (most frequent words). Compute their relevance in order to guide the semantic preserving placement of words • (4) The scaling step take place, keyword are size based on their relevance and on the area of the containing polygon.

• (5) The optimization algorithm take places to generate the word cloud.

Keyword Relevance and Semantic Relation

• Let M be the document x tem frequency matrix.• Covariance matrix C obtained from M.• Build a graph G where each node corresponds to a keyword and an edge

eij connects between two keywords ( Wi and Wj ) if only if the covariance Cij is among the k-largest ones.

• Assuming that edge eij has weight Cij, it used Fiedler vector, assigns a scalar value aij to each keyword that minimizes:

• If Cij is big then the Wi and Wj will receive similar values when they are closely related.

• The most relevant keyword:• Cijmax is the largest covariance in C and Wi and Wj are the corresponding

words.• The most relevant keyword is Wi if the average covariance between Wi

and Wk (k = 1,2,3,..n) is larger than the average covariance of Wj.

• Once we get the most relevant keyword (Wr), the keyword are sorted in increasing order according to

• In ProjCloud, the order given by Fiedler vector dictates the position of words into the cloud.

• Sizing keywords• (1) bounding boxes. • (2) the size of keyword is set to the scale value

which fits in the interval [fmin, fmax](12,50).• (3) If the areas of all keyword bounding boxes is

smaller than the area of polygon P, fmax is increased and the values are re-scaled. This process is repeated until the sum of areas of the keywords exceeds the area or P.

• The optimization Problem

Results

Comparisons

Discussion and Limitations

• ProjCloud is largely dependent on the clustering process.

• If the clustering performs poorly, it will make the word cloud very hard to fit and reed.

• Empty space between clusters.

Conclusion

Thank you

Semantic Wordfication of Document Collections Presenter: Yingyu Wu.

Documents

Transcript of Semantic Wordfication of Document Collections Presenter: Yingyu Wu.