Bridging the Gap Between Folksonomies and Taxonomies: A Semantic Web Approach (SemTech 2006)

19
Bridging the Gap Between Folksonomies and Taxonomies: A Semantic Web Approach Presentation to STC 2006 Brad Allen, Founder and CTO Siderean Software, Inc.

Transcript of Bridging the Gap Between Folksonomies and Taxonomies: A Semantic Web Approach (SemTech 2006)

Page 1: Bridging the Gap Between Folksonomies and Taxonomies: A Semantic Web Approach (SemTech 2006)

Bridging the Gap Between Folksonomies and Taxonomies:A Semantic Web ApproachPresentation to STC 2006

Brad Allen, Founder and CTOSiderean Software, Inc.

Page 2: Bridging the Gap Between Folksonomies and Taxonomies: A Semantic Web Approach (SemTech 2006)

Copyright © 2006 Siderean Software, Inc. All rights reserved. 2

Preface• This is not rocket science• This is appropriate semantic technology• What Jim Hendler said: it’s about linking things so the whole is

greater than the sum of the parts

Page 3: Bridging the Gap Between Folksonomies and Taxonomies: A Semantic Web Approach (SemTech 2006)

Copyright © 2006 Siderean Software, Inc. All rights reserved. 3

Disclaimer• We will be viewing uncontrolled vocabulary from the Web live• Sometimes it’s not pretty• Please don’t be offended

Page 4: Bridging the Gap Between Folksonomies and Taxonomies: A Semantic Web Approach (SemTech 2006)

Copyright © 2006 Siderean Software, Inc. All rights reserved. 4

The problem• Associating subject metadata with content and data is an old

technique for improving precision and recall in search• Traditionally, subject languages have been expressed as

highly-governed taxonomies (i.e., thesauri, controlled vocabularies, etc.) that entail substantial costs in creation and use

• User tagging and the emergence of folksonomies have changed the economics of subject metadata creation but at the cost of quality

• Can the two approaches to subject metadata be combined to yield an approach that yields the advantages of both while addressing their shortcomings?

Page 5: Bridging the Gap Between Folksonomies and Taxonomies: A Semantic Web Approach (SemTech 2006)

Copyright © 2006 Siderean Software, Inc. All rights reserved. 5

Taxonomies• A taxonomy is a controlled subject

language whose terms exist in explicit relation to one another

• Advantages• Authoritative reference for terms and

their relational semantics

• Can support reasoning and

classification• Disadvantages

• Creation requires training and discipline

• Expensive and slow to track changes in

usage• Adoption

• Pervasive for decades throughout the

information science and IT communities

Page 6: Bridging the Gap Between Folksonomies and Taxonomies: A Semantic Web Approach (SemTech 2006)

Copyright © 2006 Siderean Software, Inc. All rights reserved. 6

Folksonomies• A folksonomy is an uncontrolled

subject language whose tags have no explicit relation to one another

• Advantages• The cost of creation can be shared

across many untrained users• Can track changes in usage in real-time

• Disadvantages• Lexical variations (misspellings,

inconsistent case or white space)• Lack of relational semantics• Sense ambiguity

• Adoption• Rapid growth on the Web (del.icio.us,

Flickr) and emerging in enterprise pilots (IBM, DKW)

Page 7: Bridging the Gap Between Folksonomies and Taxonomies: A Semantic Web Approach (SemTech 2006)

Copyright © 2006 Siderean Software, Inc. All rights reserved. 7

It’s an old story: neats vs. scruffies• The taxonomy/thesaurus tradition is solid• But user-generated metadata is gold• A good solution should leverage aspects of both approaches

Page 8: Bridging the Gap Between Folksonomies and Taxonomies: A Semantic Web Approach (SemTech 2006)

Copyright © 2006 Siderean Software, Inc. All rights reserved. 8

Bridging the gap• The key ideas

• User tagging gets tags into repository as “author keywords”

• Ingested through RSS feeds with tagged items

• Tags are related to terms in (separately defined) taxonomies

• Users can search using one or the other or both

• Result• Folksonomies make taxonomies more responsive

• Taxonomies make folksonomies more responsible

Page 9: Bridging the Gap Between Folksonomies and Taxonomies: A Semantic Web Approach (SemTech 2006)

Copyright © 2006 Siderean Software, Inc. All rights reserved. 9

Example from DCMI Conference Thesaurus

Page 10: Bridging the Gap Between Folksonomies and Taxonomies: A Semantic Web Approach (SemTech 2006)

Copyright © 2006 Siderean Software, Inc. All rights reserved. 10

Building the bridge with ontologies• SKOS

• Lexical vs. concept-based thesauri

• Modeling taxonomies in SKOS

• skos:Concept

• skos:broader/skos:narrower

• skos:related

• Dublin Core (DC)• Basic asset metadata for modeling content creation

• dc:creator

• dc:dateSubmitted

Page 11: Bridging the Gap Between Folksonomies and Taxonomies: A Semantic Web Approach (SemTech 2006)

Copyright © 2006 Siderean Software, Inc. All rights reserved. 11

Modeling folksonomies in SKOS and DC• Represent each tag as a skos:Concept• The prefLabel of the concept is the tag• The item is skos:subjectOf the concept• The concept is skos:inScheme associated with the RSS

channel• No broader/narrower/related relationships (at least initially)

Page 12: Bridging the Gap Between Folksonomies and Taxonomies: A Semantic Web Approach (SemTech 2006)

Copyright © 2006 Siderean Software, Inc. All rights reserved. 12

Addressing the shortcomings• Reduce/eliminate lexical variation

• Merge variants into a single concept using skos:prefLabel and skos:altLabel• Relate tags to terms and other tags

• Tag the tags with categories• Place tags in time and space

• The dc:dateSubmitted of the item is associated with its tags• Geolocation metadata can be added to concepts representing physical locations

• Tags are related to other tags through shared skos:subjectOf relationships with items

• Compensate for ambiguous tags with term indexing• Index items tagged with ambiguous tags with unambiguous terms based on

context (e.g. the tag “SF”)• Allow users to exploit tags and terms concurrently

Page 13: Bridging the Gap Between Folksonomies and Taxonomies: A Semantic Web Approach (SemTech 2006)

Copyright © 2006 Siderean Software, Inc. All rights reserved. 13

Social aspects• The role of the

community of interest and focused collections of edge content

• A virtuous circle where navigation and tagging continuously improve quality of subject indexing

• A disruptive impact of the economics of knowledge management

ContentConsumers

ContentProducers

(Indexed) content

Tagged content

Navigation andtagging

Navigation andtagging

Community of Interest

Page 14: Bridging the Gap Between Folksonomies and Taxonomies: A Semantic Web Approach (SemTech 2006)

Copyright © 2006 Siderean Software, Inc. All rights reserved. 14

Case studies and demonstrations• Environmental Health News

• RSS item categorization

• Fac.etio.us• RSS/Atom into SKOS/FOAF/DC

• BBC Rushes• Crosswalks

Page 15: Bridging the Gap Between Folksonomies and Taxonomies: A Semantic Web Approach (SemTech 2006)

Copyright © 2006 Siderean Software, Inc. All rights reserved. 15

Case study: Environmental Health News• Aggregating content from

hundreds of Web pages daily

• 105 Web pages

• 103 originating sites

• 101 editors

• 104 subscribers

• Adding value at the metadata level to the Web at large for a focused community of interest

• Policy makers

• Activists

• Researchers

Page 16: Bridging the Gap Between Folksonomies and Taxonomies: A Semantic Web Approach (SemTech 2006)

Copyright © 2006 Siderean Software, Inc. All rights reserved. 16

Case study: fac.etio.us• Aggregating feeds from del.icio.us

social bookmarking site• 105 Web pages• 104 tags• 104 contributors• 104 originating sites

• Combining user tagging with faceted navigation

• “In 3 clicks, I drilled down through 9700+ sites, to a more specific set of 98 things, down to one I found useful.”

• “… the most comprehensive tool for searching the database of del.icio.us.”

• “Siderean’s half-year test makes the narrowness of the del.icio.us service evident.”

Page 17: Bridging the Gap Between Folksonomies and Taxonomies: A Semantic Web Approach (SemTech 2006)

Copyright © 2006 Siderean Software, Inc. All rights reserved. 17

Case study: BBC rushes• Joint work with Accenture

Technology Labs for TRECVID program

• BBC Rushes: 49.3 hours of raw video

• 4 issues of “Summer Holiday”

(~ 2 hours)

• BBC One News (30’) + fragment

(~3’)

• Faceted navigation using both textual and visual features

Page 18: Bridging the Gap Between Folksonomies and Taxonomies: A Semantic Web Approach (SemTech 2006)

Copyright © 2006 Siderean Software, Inc. All rights reserved. 18

Future work• (Semi)automatic folksonomy/taxonomy crosswalk generation

• The notion of “relatedness”

• By cooccurrence

• By explicit warrant

• Machine learning for tag sense disambiguation• Co-training using content that is simultaneously tagged and indexed

• Tag spam filtering

Page 19: Bridging the Gap Between Folksonomies and Taxonomies: A Semantic Web Approach (SemTech 2006)

Siderean Software, Inc.390 North Sepulveda Blvd., Suite 2070El Segundo, CA 90245-4475 USA+1 310 647-4266http://www.siderean.com

ballen at siderean dot com