Semantic Wordfication of Document Collections Presenter: Yingyu Wu.
-
Upload
violet-robertson -
Category
Documents
-
view
221 -
download
0
Transcript of Semantic Wordfication of Document Collections Presenter: Yingyu Wu.
Semantic Wordfication of Document Collections
Presenter: Yingyu Wu
Outline
• Introduction• ProjCloud Technique• Results and Comparisons• Discussion and Limitations• Conclusion
Introduction
• Word Cloud
• Two issues of word cloud:
• (1) Existing methods do not yet provide an intuitive visual representation that allows to link words on the layout to the documents they are meant to represent.
• (2) The construction of word clouds inside general polygons with semantical preservation between words.
• Contributions:• A novel word cloud-based visualization technique, named ProjCloud.
• (1) combine multidumensional projection and word clouds, which enables to visualize the similarity among documents as well as their corresponding word clouds, extend the exploratory capabilities of the word clouds.
• (2) A new approach for building word clouds inside polygons while still preserving the semantic relationship among keywords.
• (3) A mechanism based on spectral sorting that allows arranging words according to their semantic relationship as well as highlighting the most important words in the cloud.
ProjCloud Technique
• Overview of the sequence of steps
• Steps:• (1) Mapping document collection into the visual space using a
multidimensional projection technique(LSP).
• (2) Points in the visual space are clustered(polygons). Two versions: automatically and user interactive.
• (3) Keywords extracted (most frequent words). Compute their relevance in order to guide the semantic preserving placement of words • (4) The scaling step take place, keyword are size based on their relevance and on the area of the containing polygon.
• (5) The optimization algorithm take places to generate the word cloud.
Keyword Relevance and Semantic Relation
• Let M be the document x tem frequency matrix.• Covariance matrix C obtained from M.• Build a graph G where each node corresponds to a keyword and an edge
eij connects between two keywords ( Wi and Wj ) if only if the covariance Cij is among the k-largest ones.
• Assuming that edge eij has weight Cij, it used Fiedler vector, assigns a scalar value aij to each keyword that minimizes:
• If Cij is big then the Wi and Wj will receive similar values when they are closely related.
• The most relevant keyword:• Cijmax is the largest covariance in C and Wi and Wj are the corresponding
words.• The most relevant keyword is Wi if the average covariance between Wi
and Wk (k = 1,2,3,..n) is larger than the average covariance of Wj.
• Once we get the most relevant keyword (Wr), the keyword are sorted in increasing order according to
• In ProjCloud, the order given by Fiedler vector dictates the position of words into the cloud.
• Sizing keywords• (1) bounding boxes. • (2) the size of keyword is set to the scale value
which fits in the interval [fmin, fmax](12,50).• (3) If the areas of all keyword bounding boxes is
smaller than the area of polygon P, fmax is increased and the values are re-scaled. This process is repeated until the sum of areas of the keywords exceeds the area or P.
• The optimization Problem
Results
Comparisons
Discussion and Limitations
• ProjCloud is largely dependent on the clustering process.
• If the clustering performs poorly, it will make the word cloud very hard to fit and reed.
• Empty space between clusters.
Conclusion
Thank you