Automating Creation of Hierarchical Faceted Metadata Structures Emilia Stoica, Marti Hearst and...

Post on 16-Dec-2015

225 views 3 download

Tags:

Transcript of Automating Creation of Hierarchical Faceted Metadata Structures Emilia Stoica, Marti Hearst and...

Automating Creation of Hierarchical Faceted Metadata

Structures

Emilia Stoica, Marti Hearst and Megan Richardson*

School of Information, Berkeley *Dept. of Mathematical Sciences, NMSU

Focus: Browse Large Datasets Standard search interface - query box +

retrieved results – not suited for browsing and navigation

User interfaces need to group and organize the results

How do we Create Faceted Hierarchies?

Goals: Help an information architect to create the

hierarchy Currently they do it all by hand!

Balance depth and breadth Avoid “skinny” paths Don’t go too deep or too broad

Choose understandable labels Disambiguate between word senses

Related Work

Automated text categorization LOTS of work on this Assumes that a set of categories is already

created

Little if any work on building facet hierarchies

Castanet

Carves out a structure from the hypernym (IS-A) relations within WordNet

Semi-automatic algorithm for creating hierarchical faceted metadata

Produces surprisingly good results for a wide range of subjects e.g., recipes, medicine, math, news, fine arts

image description

WordNet Challenges

A word may have more than one sense

- Fine granularity of word sense distinctions

e.g., newspaper (#1) - daily publication on

folded sheets

newspaper (#3) - physical object

- Ambiguity for the same sense

tuna#1 cactus

#2 fish food fish bony fish

WordNet Challenges (cont.)

The hypernym path may be quite long (e.g., sense #3 of tuna has 14 nodes)

Sparse coverage of proper names and noun phrases (not addressed)

Our ApproachD

ocum

ents

Sel

ect

ter

ms

WordNet

Build core tree

Augmentcore tree

Remove

top level

categories

Compress

Tree

Divide into facets

1. Select Terms

Select well-distributed terms from the collection

Eliminate stopwords Retain only those terms

with a distribution higher than a threshold

(default: top 10%)

Doc

ume

nts

WordNet

Sel

ect

term

s

Build core tree

Comp. tree

Remove top levelcateg.

Augm. core tree

2. Build Core Tree

Get hypernym path if term: - has only one sense, or - matches a pre-selected WordNet domain Adding a new term increases a

count at each node on its path by # of docs with the term. frozen dessert

sundae

entity

substance,matter

nutriment

dessert

ice cream sundae

frozen dessert

entity

substance,matter

nutriment

dessert

sherbet,sorbet

sherbet

Build a “backbone” Create paths from

unambiguous terms only Bias the structure towards

appropriate senses of words

Doc

ume

nts

WordNet

Sel

ect

te

rms

Build core tree

Comp. tree

Remove top levelcateg.

Augm. core tree

2. Build Core Tree (cont.)

Merge hypernym paths to build a tree

sundae

entity

substance,matter

nutriment

dessert

ice cream sundae

frozen dessert

entity

substance,matter

nutriment

dessert

sherbet,sorbet

sherbet

frozen dessert

sundae sherbet

substance,matter

nutriment

dessert

sherbet,sorbet

frozen dessert

entity

ice cream sundae

3. Augment Core Tree

Attach to Core tree the terms with more than one sense

Favor the more common path over other alternatives

Doc

ume

nts

WordNet

Sel

ect

te

rms

Build core tree

Comp. tree

Remove top levelcateg.

Augm. core tree

Augment Core Tree (cont.)

Date (p1) Date (p2)

entity abstraction substance,matter measure, quantity food, nutrient fundamental quality nutriment time period food calendar day (18) edible fruit (78) date Sunday berries date

Choose this path since it has more items assigned

??

Optional Step: Domains

To disambiguate, use Domains Wordnet has 212 Domains

medicine, mathematics, biology, chemistry, linguistics, soccer, etc.

A better collection has been developed by Magnini (2000) Assigns a domain to every noun synset

Automatically scan the collection to see which domains apply

The user selects which of the suggested domains to use or may add own

Paths for terms that match the selected domains are added to the core tree

Using Domains

dip glosses:

Sense 1: A depression in an otherwise level surface

Sense 2: The angle that a magnet needle makes with horizon

Sense 3: Tasty mixture into which bite-size foods are dipped

dip hypernyms

Sense 1 Sense 2 Sense 3

solid shape, form food

=> concave shape => space => ingredient, fixings

=> depression => angle => flavorer

Given domain “food”, choose sense 3

4. Compress Tree

Rule 1: Eliminate a parent with fewer

than k children unless it is the root or its distribution is larger than 0.1*maxdist

ice cream sundae

dessert

sundae

frozen dessert

sherbet,sorbet

sherbet

parfait

dessert

frozen dessert

sundae parfait sherbet

abstraction

Doc

ume

nts

WordNet

Sel

ect

te

rms

Build core tree

Comp. tree

Remove top levelcateg.

Augm. core tree

4. Compress Tree (cont.)

Rule 2: Eliminate a child whose

name appears within the parent’s name

sundae

dessert

frozen dessert

parfait sherbet

dessert

sundae parfait sherbet

abstraction

Doc

ume

nts

WordNet

Sel

ect

te

rms

Build core tree

Comp. tree

Remove top levelcateg.

Augm. core tree

5. Divide into Facets

Divide into facets

5. Divide into Facets(Remove top levels)

sugar syrup

entity

substance,matter

food,nutriment

ingredient,fixings

food stuff,food product

sweeteningherb

flavorer

parsley oregano sugar syrup

sweeteningherb

flavorer

parsley oregano

Rule 1: Eliminate the top t levels (t =4 for recipe collection).

Divide into facets

Rule 2: For each resulting tree, test if it has at least n children (n =2)If yes, stop. If not, delete the root and repeat.

Manual cleaning: remove facets that don’t make sense

Example: Recipes (13,500 docs)

Castanet Output (shown in Flamenco)

Castanet Output

Castanet Evaluation

This is a tool for information architects (IA), so people of this type did the evaluation

Each IA compared Castanet to other state-of-the-art algorithms LDA (Blei et al. 04) Subsumption (Sanderson & Croft ’99)

Baseline: most frequent terms in the collection Datasets

13,000 recipes from Southwestcooking.com

Subsumption Output

Subsumption Output

LDA Output

LDA Output

Evaluation Method

For each of 2 systems’ output: Examined and commented on top-level Examined and commented on two sub-levels

Then comment on overall properties Meaningful? Systematic? Likely to use in your work?

L

C

S

C

}16 }18

Evaluation (cont.)

Sample questions for top level categories: - Would you add/remove/rename any category ?

- Did this category match your expectations ?

Sample questions for a specific category: - Would you add/move/remove any sub-categories ? - Would you promote any sub-category to top level ?

General questions: - Would you use Castanet ? - Would you use LDA ? - Would you use Subsumption ? - Would you use list of most frequent terms ?

Evaluation Results

“Would you use this system in your work?”

“yes definitely”, “yes, in some cases”

Castanet 85%LDA 0 %

Subsumption 37%

Baseline 74%

Average response to questions about quality (4 = “strongly agree”, 3 = “agree somewhat”, 2 = “disagree

somewhat”, 1 = “strongly disagree”)

Evaluation Results

Average responses for top-level categories (4= “no changes”, 3 = “one or two”, 2 = “a few”, 1 = “many”)

Average responses for 2 subcategories

Needed Improvements

Take spelling variations and morphological variants into account

Use verbs and adjectives, not just nouns Normalize noun phrases Allow terms to have more than one sense Improve algorithm for assigning documents to

categories.

Conclusions

Castanet builds a set of faceted hierarchies by finding IS-A relations between terms using WordNet.

The method has been tested on various domains: medicine, recipes, math, news, description of images

Usability study shows: Castanet is preferred to other state-of-the art solutions. Information architects want to use the tool in their work.

Future work Apply to tags (flickr, delicious)

Learn More

Funding This work supported in part by NSF (IIS-9984741)

For more information: Stoica, E., Hearst, M., and Richardson, M., Automating

Creation of Hierarchical Faceted Metadata Structures, NAACL/HLT 2007

See http://flamenco.berkeley.edu