Growing a Tree in the Forest: Constructing Folksonomies by Integrating Structured Metadata Anon...

28
Growing a Tree in the Forest: Constructing Folksonomies by Integrating Structured Metadata Anon Plangprasopchok 1 , Kristina Lerman 1 , Lise Getoor 2 1 USC Information Sciences Institute 2 Department of Computer Science, University of Maryland SIGKDD 2010 2010. 12. 17. Summarized and Presented by Kim Chung Rim, IDS Lab., Seoul National University

Transcript of Growing a Tree in the Forest: Constructing Folksonomies by Integrating Structured Metadata Anon...

Growing a Tree in the Forest: Constructing Folk-sonomies by Integrating Structured Metadata

Anon Plangprasopchok1, Kristina Lerman1, Lise Getoor2

1USC Information Sciences Institute 2Department of Computer Science, University of Maryland

SIGKDD 2010

2010. 12. 17.

Summarized and Presented by Kim Chung Rim, IDS Lab., Seoul National University

Copyright 2010 by CEBT

Contents

Introduction

Goal

Challenges

Learning Folksonomies

SAP: Growing a Tree

Evaluation

Conclusion & Discussion

2

Copyright 2010 by CEBT

Introduction

The social Web has changed the way people create and use information

Users organize their content via tags

Sites such as Flickr, Del.icio.us and YouTube allows users to tag their contents with descriptive keywords

3

Copyright 2010 by CEBT

Social Metadata

captures collective knowledge of Social Web users

Once mined from many users,

Such collective knowledge will add a rich semantic layer to the content of Social Web

Could be used for information discovery

4

Copyright 2010 by CEBT

Structured Social Metadata

Some Social Websites allow users to organize tags hierarchically

Delicious

Users can group related tagsinto bundles

Flickr

Users can group related photos into sets and related sets into collections

5

Copyright 2010 by CEBT

Folksonomy

Folksonomy – a common taxonomy that is learned from social metadata.

There exist predefined hierarchies such as WordNet or Linnaean classification system.

Benefits of Folksonomy over already existing hierarchies

represent collective agreement of many individuals

Are relatively inexpensive to obtain

Can adapt to evolving vocabularies and community’s in-formation needs

Directly tied to the annotated content

6

Copyright 2010 by CEBT

Goal

The goal of this paper is to

Extract users’ knowledge (folk knowledge) from user-gen-erated semantics

To build up a tree of communal taxonomy

7

animal

fish cat dog

dog

biggle jindo

animal

reptile cat

animal

fish cat dog

biggle jindo

reptile

Copyright 2010 by CEBT

Challenges

Sparseness

Social metadata is very sparse – only 3.74 tags per photo

Nosiy vocabulary

Spelling errors and meaningless tags – not sure, mykid

Ambiguity

Individual tag is often ambiguous – jaguar

8

Copyright 2010 by CEBT

Challenges

Structural noise and conflicts

Users use different term to build their own hierarchy

Varying granularity level

Users’ level of expertise and expressiveness differ

9

Copyright 2010 by CEBT

Learning Folksonomies - intuition

roots with similar term should be aggregated horizontally

A Root and a leaf with similar term should be aggregated vertically

10

animal

fish cat dog

animal

reptile cat

animal

fish cat dog reptile

animal

fish cat dog

dog

biggle jindo

animal

fish cat dog

biggle jindo

Copyright 2010 by CEBT

Learning Folksonomy

Relational Clustering

Two nodes should be clustered if their similarity is above a certain threshold

Similarity is computed using local similarity & structural similarity

– localSim(a,b) measures the similarity of node names and their tag distribution

– structSim(a,b) measures the similarity of leaves

– is a weight for adjusting contributions from localSim and structSim ( )

11

),(),()1(),( bastructSimbalocalSimbanodesim

10

Copyright 2010 by CEBT

Learning Folksonomy - terms

A personal hierarchy is called a sapling

A sapling is composed of a root node and its leaves

Tag statistic of a node x is defined as

Where and are tag and its frequency

is aggregated from all

= {(canada, 1), (vacation, 1), … }

= {(bc,1), (canada, 2), (vacation, 1), … }

12

ir ij

i ll ,...,1

)},(),...,,(),,{(:21 21 ktkttx ftftft

kt ktf

ir i

jl

vicl1

vicr

Copyright 2010 by CEBT

Local Similarity

The local similarity of nodes a and b is composed of

Name similarity

Tag distribution similarity

nameSim can be any string similarity metric, which returns a value from 0 to 1

tagSim can be any function for measuring the similarity of distributions – in this paper, tagSim counts the number of common tags n, in the top K tag of a and b. if n ≥ J (a cer-tain threshold), it returns 1, else it returns n/J.

13

),()1(),(),( batagSimbanameSimbalocalSim

Copyright 2010 by CEBT

Structural Similarity

Compute structural similarity in two cases

Similarity between two root nodes

Similarity between a root node and a leaf node

14

Copyright 2010 by CEBT

Root-to-Root Similarity

Similarity based on

How many of their children have common name

The tag distribution similarity of those that do not have the same name

– Where ,

– is an aggregation of tag distributions of all

15

),()1(),( Btag

Atag

BA LLtagSimCLCLrrRstructSimR

))(),((1

,ji

Bi

Ai lnamelname

ZCL |)||,min(| YX llZ

AtagL

Ail

Copyright 2010 by CEBT

Root-to-Leaf Similarity

Takes into account when a sapling contains the same leaf as the compared node

When merging UK->{Scotland, Glasgow, Edinburgh, Lon-don} and Scotland->{Glasgow, Shetland}, it can be said that Scotland and UK are similar because UK contains a shortcut to Glasgow

Measures overlap between siblings of and children of

16

),(),( BABAi rrRstructSimRrlRstructSimL

Ail

Br

Copyright 2010 by CEBT

Relational Clustering

When the nodeSim(a,b)

exceeds a certain threshold, two nodes are clustered(merged)

17

),(),()1(),( bastructSimbalocalSimbanodesim

animal

fish cat dog

animal

reptile cat

animal

fish cat dog reptile

Copyright 2010 by CEBT

SAP: Growing a Tree

Using Incremental Relational Clustering,

18

canada

canada canada

canada

canada

victoria

victoria

victoria victoria

victoria victoria

Horizontal aggrega-

tion

Vertical ag-gregation

Copyright 2010 by CEBT

Handling Problems

Shortcuts

Take the longer route

Noisy vocabularies

If a name is used by less than 1% of the user who contrib-uted in the sapling, it is considered as noisy and discarded

19

Copyright 2010 by CEBT

Evaluation

20,759 saplings from 7,121 users are selected from Flickr

Values for several parameters are chosen

20

Parame-ters

Value

K 40

J 4

RR 0.1

LR 0.8 0.5

Copyright 2010 by CEBT

Evaluation Methodologies

Against the reference hierarchy (ODP)

Taxonomic Overlap – measuring structure similarity be-tween two trees. Measures how many ancestor and de-scendant nodes overlap to those in the reference tree for each node and take the harmonic mean (fmTO)

Lexical Recall – measuring how well an approach can dis-cover concepts in the reference hierarchy (LR)

Structural evaluation

Area Under Tree – a measurement to define how bushy and deep a tree is (AUT)

Manual evaluation (for concepts not in ODP)

Asking 5 judges whether the path is correct

21

Copyright 2010 by CEBT

Evaluation metric - AUT

AUT for below tree is:

1*(3+1)/2 + 1*(3+2)/2 = 4.5

22

animal

fish cat dog

biggle jindo0 0.5 1 1.5 2

0

1

2

3

nodes per depth

Copyright 2010 by CEBT

Baseline of comparison

SIG – past work of the authors

Assume nodes having the same name refer to the same concept

Keep the relations between node pairs

Combine all relations into a tree

23

Copyright 2010 by CEBT

Results

24

Copyright 2010 by CEBT

Results

Simple comparison

25

Copyright 2010 by CEBT

Some of the Learned Folksonomies

26

Copyright 2010 by CEBT

Conclusion & Discussion

This work can create more accurate and more detailed folksonomies than the current state-of-the-art approach by exploiting structural information

This work is more scalable – it incrementally grows the folksonomies

Lacks comparison with other state-of-the-art approaches

27

Copyright 2010 by CEBT

Thank you

28