1 eClassifier: Tool for Taxonomies Scott Spangler [email protected] IBM Almaden Research...
-
Upload
melissa-hutchinson -
Category
Documents
-
view
216 -
download
0
Transcript of 1 eClassifier: Tool for Taxonomies Scott Spangler [email protected] IBM Almaden Research...
![Page 2: 1 eClassifier: Tool for Taxonomies Scott Spangler spangles@almaden.ibm.com IBM Almaden Research Center San Jose, CA.](https://reader036.fdocuments.net/reader036/viewer/2022062404/551499d0550346ea6e8b5657/html5/thumbnails/2.jpg)
2
Assertions on Taxonomy Generation
• Manual methods are too labor intensive, limit scope and scale, and are not maintainable
• Canned taxonomies are a niche solution
• There are many “natural” or “right” taxonomies, even on the same collection
• Clustering, canned taxonomies and other methods are good starting points, but not enough
![Page 3: 1 eClassifier: Tool for Taxonomies Scott Spangler spangles@almaden.ibm.com IBM Almaden Research Center San Jose, CA.](https://reader036.fdocuments.net/reader036/viewer/2022062404/551499d0550346ea6e8b5657/html5/thumbnails/3.jpg)
3
Salient Features of eClassifier
• Clustering algorithm independent– bias towards speed for interaction
• Classification algorithm independent– evaluate multiple algorithms for given taxonomy– pick best algorithm for each level in taxonomy
• Multiple methods to seed taxonomy:– import, clustering, query based
• Multiple methods for evaluating, editing and validating taxonomies
• Given a taxonomy, analysis/discovery against structured and unstructured information
![Page 4: 1 eClassifier: Tool for Taxonomies Scott Spangler spangles@almaden.ibm.com IBM Almaden Research Center San Jose, CA.](https://reader036.fdocuments.net/reader036/viewer/2022062404/551499d0550346ea6e8b5657/html5/thumbnails/4.jpg)
4
eClassifier Principles
• Apply multiple text mining algorithms to textual data sets in a practical manner.
• Provide consistently good results, the goal is not perfection.
• Utilize domain expertise by giving the user control over the mining process.
• Provide tools, metrics and reports to draw useful conclusions from the analysis.
![Page 5: 1 eClassifier: Tool for Taxonomies Scott Spangler spangles@almaden.ibm.com IBM Almaden Research Center San Jose, CA.](https://reader036.fdocuments.net/reader036/viewer/2022062404/551499d0550346ea6e8b5657/html5/thumbnails/5.jpg)
5
The Mining Process
• Create a dictionary of terms (words and phrases)• Prune dictionary (prune irrelevant terms)• Cluster documents based on this dictionary• Examine the resulting taxonomy, modifying based
on domain expertise• Create multiple taxonomies (divide and conquer)• Do deeper analysis by creating keyword
classifications, comparing taxonomies, inspecting dictionary co-occurrence, examining recent trends
![Page 6: 1 eClassifier: Tool for Taxonomies Scott Spangler spangles@almaden.ibm.com IBM Almaden Research Center San Jose, CA.](https://reader036.fdocuments.net/reader036/viewer/2022062404/551499d0550346ea6e8b5657/html5/thumbnails/6.jpg)
6
The Class Table
For viewing and understanding each level in a taxonomy
![Page 7: 1 eClassifier: Tool for Taxonomies Scott Spangler spangles@almaden.ibm.com IBM Almaden Research Center San Jose, CA.](https://reader036.fdocuments.net/reader036/viewer/2022062404/551499d0550346ea6e8b5657/html5/thumbnails/7.jpg)
7
Understanding Class Metrics
• Class Naming Convention• Shortest possible name that covers the examples• “,” => OR• “&” => AND• X_Y => X followed by Y• NONE => no useful text• Miscellaneous => No easy description
• Cohesion• A measure of similarity between documents in the same class (0-
different terms, 100-same terms)• Distinctness
• A measure of similarity between documents in different classes (0-very similar, 100-very unique)
![Page 8: 1 eClassifier: Tool for Taxonomies Scott Spangler spangles@almaden.ibm.com IBM Almaden Research Center San Jose, CA.](https://reader036.fdocuments.net/reader036/viewer/2022062404/551499d0550346ea6e8b5657/html5/thumbnails/8.jpg)
8
Dictionary Tool• Edit -> Dictionary Tool• Use this to edit the
features on which the taxonomy is based
• Delete irrelevant or ambiguous terms
• Generate and edit synonyms
![Page 9: 1 eClassifier: Tool for Taxonomies Scott Spangler spangles@almaden.ibm.com IBM Almaden Research Center San Jose, CA.](https://reader036.fdocuments.net/reader036/viewer/2022062404/551499d0550346ea6e8b5657/html5/thumbnails/9.jpg)
9
Dictionary Generation Files
• StopWords• words excluded from the dictionary
• Synonyms• different forms of the same semantic term
• IncludeWords• words that always appear in dictionary
• Stock Phrases• text to be ignored in creating dictionary
• Synonyms and Stock Phrases can be automatically generated and then edited
![Page 10: 1 eClassifier: Tool for Taxonomies Scott Spangler spangles@almaden.ibm.com IBM Almaden Research Center San Jose, CA.](https://reader036.fdocuments.net/reader036/viewer/2022062404/551499d0550346ea6e8b5657/html5/thumbnails/10.jpg)
10
Refinement of Classes
• Subclass Classes• Subdivide an existing class into multiple subclass at the next level in the
taxonomy
• Merge Classes• Delete Classes• Rename Class• Undo
• Don’t be afraid to try things
• Save• .obj files contain all information eClassifier uses• .class files contain class membership
• Read
![Page 11: 1 eClassifier: Tool for Taxonomies Scott Spangler spangles@almaden.ibm.com IBM Almaden Research Center San Jose, CA.](https://reader036.fdocuments.net/reader036/viewer/2022062404/551499d0550346ea6e8b5657/html5/thumbnails/11.jpg)
11
Class View• For understanding the
concepts and contents of a given class
• View the text• Most typical• Least typical
• View the source Web page
• View distinguishing terms• View deduced rules for
classification and related documents
![Page 12: 1 eClassifier: Tool for Taxonomies Scott Spangler spangles@almaden.ibm.com IBM Almaden Research Center San Jose, CA.](https://reader036.fdocuments.net/reader036/viewer/2022062404/551499d0550346ea6e8b5657/html5/thumbnails/12.jpg)
12
Keyword Searching• Edit->Keyword Search• Search for Dictionary terms• Use “and” , “or” and “_”• Searching within a class• Related Words• Look at Trends• Create new Classes• See where the matching
documents occur via Class Table
![Page 13: 1 eClassifier: Tool for Taxonomies Scott Spangler spangles@almaden.ibm.com IBM Almaden Research Center San Jose, CA.](https://reader036.fdocuments.net/reader036/viewer/2022062404/551499d0550346ea6e8b5657/html5/thumbnails/13.jpg)
13
Document/Page Viewer• Sorting Documents
• Most typical
• Least typical
• View distinguishing terms
• Representative use of important words
• Moving documents
• Trend
• Reports
![Page 14: 1 eClassifier: Tool for Taxonomies Scott Spangler spangles@almaden.ibm.com IBM Almaden Research Center San Jose, CA.](https://reader036.fdocuments.net/reader036/viewer/2022062404/551499d0550346ea6e8b5657/html5/thumbnails/14.jpg)
14
Keyword Class Generation• Execute->Classify by Keywords
• Open queries (KCG files)
• One query per line
• .AND. , .OR., (, )
• Add, Rename, Delete queries
• Prioritize – Move up and down
• Multiple/only one class
• Ambiguous/first matching class
• Run Queries
• Save Queries
• Run eClassifier
![Page 15: 1 eClassifier: Tool for Taxonomies Scott Spangler spangles@almaden.ibm.com IBM Almaden Research Center San Jose, CA.](https://reader036.fdocuments.net/reader036/viewer/2022062404/551499d0550346ea6e8b5657/html5/thumbnails/15.jpg)
15
Comparing Taxonomies• File->Compare Taxonomies
•File->Read Structured Information
• Co-occurrence counts and affinities• Trend• View documents• Transpose• Report (CSV)
![Page 16: 1 eClassifier: Tool for Taxonomies Scott Spangler spangles@almaden.ibm.com IBM Almaden Research Center San Jose, CA.](https://reader036.fdocuments.net/reader036/viewer/2022062404/551499d0550346ea6e8b5657/html5/thumbnails/16.jpg)
16
Dictionary Co-occurrence• View->Dictionary Co-
occurrence
• Type ahead searching
• Co-occurrence counts and affinities
• Trend
• View documents
• Zoom in
• Change Metric -> dependency
![Page 17: 1 eClassifier: Tool for Taxonomies Scott Spangler spangles@almaden.ibm.com IBM Almaden Research Center San Jose, CA.](https://reader036.fdocuments.net/reader036/viewer/2022062404/551499d0550346ea6e8b5657/html5/thumbnails/17.jpg)
17
Advanced Features• Visualization
• Subclass from Structured Information
• Make Classifier
• Read Template
• Import Category• Add a category from another saved taxonomy
• Select Metrics• Add other columns to the Class table
• BIW
![Page 18: 1 eClassifier: Tool for Taxonomies Scott Spangler spangles@almaden.ibm.com IBM Almaden Research Center San Jose, CA.](https://reader036.fdocuments.net/reader036/viewer/2022062404/551499d0550346ea6e8b5657/html5/thumbnails/18.jpg)
18
Visualization• Look at relationships
between selected classes• Discover sub-clusters• Find “borderline”
examples• View/Move Documents• Navigator• Touring