Thresher: Automating the Unwrapping of Semantic Content from the World Wide Web

34
May 11, 2005 WWW 2005 -- Chiba, Japa n 1 Thresher: Automating the Unwrapping of Semantic Content from the World Wide Web Andrew Hogue Google MIT CSAIL

description

Thresher: Automating the Unwrapping of Semantic Content from the World Wide Web. Andrew Hogue GoogleMIT CSAIL. Acknowledgments. David Karger ([email protected]) Haystack Group (http://haystack.csail.mit.edu). Agenda. Overview Demo Details Induction Matching Semantics - PowerPoint PPT Presentation

Transcript of Thresher: Automating the Unwrapping of Semantic Content from the World Wide Web

Page 1: Thresher: Automating the Unwrapping of  Semantic Content from the  World Wide Web

May 11, 2005 WWW 2005 -- Chiba, Japan 1

Thresher: Automating the Unwrapping of

Semantic Content from the World Wide Web

Andrew HogueGoogle MIT CSAIL

Page 2: Thresher: Automating the Unwrapping of  Semantic Content from the  World Wide Web

May 11, 2005 WWW 2005 -- Chiba, Japan 2

Acknowledgments

• David Karger ([email protected])

• Haystack Group(http://haystack.csail.mit.edu)

Page 3: Thresher: Automating the Unwrapping of  Semantic Content from the  World Wide Web

May 11, 2005 WWW 2005 -- Chiba, Japan 3

Agenda

• Overview• Demo• Details

– Induction– Matching– Semantics– Heuristics

Page 4: Thresher: Automating the Unwrapping of  Semantic Content from the  World Wide Web

May 11, 2005 WWW 2005 -- Chiba, Japan 4

Agenda

• Overview• Demo• Details

– Induction– Matching– Semantics– Heuristics

Page 5: Thresher: Automating the Unwrapping of  Semantic Content from the  World Wide Web

May 11, 2005 WWW 2005 -- Chiba, Japan 5

Unwrapping the Web

• Majority of semantic content in “deep web”• Transformed into human-readable HTML

by scripts• HTML is difficult for automated agents to

understand• Little incentive for content providers to

provide RDF markup• How to “unwrap” this content?

Page 6: Thresher: Automating the Unwrapping of  Semantic Content from the  World Wide Web

May 11, 2005 WWW 2005 -- Chiba, Japan 6

Thresher

• Simple UI for wrapper induction on structured web content

• “Demonstrate” examples of objects• Induce wrapper, or pattern, based on

DOM• User may also label properties with RDF

Page 7: Thresher: Automating the Unwrapping of  Semantic Content from the  World Wide Web

May 11, 2005 WWW 2005 -- Chiba, Japan 7

Thresher

• Built on Haystack Semantic Web client• Everything is RDF• Everything has context menus• Thresher brings RDF into the web browser• Wrappers reify web objects for full

interaction

Page 8: Thresher: Automating the Unwrapping of  Semantic Content from the  World Wide Web

May 11, 2005 WWW 2005 -- Chiba, Japan 8

Thresher

• Underlying wrapper algorithm based on tree edit distance

• Align user’s examples• Keep aligned nodes (layout elements)• Wildcard non-aligned nodes (content)• Pattern matching is also alignment

Page 9: Thresher: Automating the Unwrapping of  Semantic Content from the  World Wide Web

May 11, 2005 WWW 2005 -- Chiba, Japan 9

Agenda

• Overview• Demo• Details

– Induction– Matching– Semantics– Heuristics

Page 10: Thresher: Automating the Unwrapping of  Semantic Content from the  World Wide Web

May 11, 2005 WWW 2005 -- Chiba, Japan 10

Agenda

• Overview• Demo• Details

– Induction– Matching– Semantics– Heuristics

Page 11: Thresher: Automating the Unwrapping of  Semantic Content from the  World Wide Web

May 11, 2005 WWW 2005 -- Chiba, Japan 11

Wrapper Induction

• Wrapper: pattern created from examples• User provides positive examples• Generalize examples into reusable pattern• Existing techniques:

– head-left-right-tail (HLRT) descriptors– Hidden Markov models– Support Vector Machines– Other Machine Learning

Page 12: Thresher: Automating the Unwrapping of  Semantic Content from the  World Wide Web

May 11, 2005 WWW 2005 -- Chiba, Japan 12

Wrapper Induction

• Our approach: take advantage of hierarchical structure of HTML

• Each example picks out a subtree of DOM• Calculate tree edit distance between

examples• Least-cost edit distance gives best

mapping• Remove unmapped nodes to make pattern

Google Employee
is this slide necessary, or is it too much of a repeat?
Page 13: Thresher: Automating the Unwrapping of  Semantic Content from the  World Wide Web

May 11, 2005 WWW 2005 -- Chiba, Japan 13

Tree Edit Distance

• Calculate cost ( ) of sequence of operations to transform one tree into the other

• Operations: insert, delete, change a node• Cost of an operation = size of subtree it

affects• Least-cost set of operations gives best

mapping between elements

Page 14: Thresher: Automating the Unwrapping of  Semantic Content from the  World Wide Web

May 11, 2005 WWW 2005 -- Chiba, Japan 14

Mapping Examples

Page 15: Thresher: Automating the Unwrapping of  Semantic Content from the  World Wide Web

May 11, 2005 WWW 2005 -- Chiba, Japan 15

Mapping Examples

Page 16: Thresher: Automating the Unwrapping of  Semantic Content from the  World Wide Web

May 11, 2005 WWW 2005 -- Chiba, Japan 16

Mapping Examples

Page 17: Thresher: Automating the Unwrapping of  Semantic Content from the  World Wide Web

May 11, 2005 WWW 2005 -- Chiba, Japan 17

Agenda

• Overview• Demo• Details

– Induction– Matching– Semantics– Heuristics

Page 18: Thresher: Automating the Unwrapping of  Semantic Content from the  World Wide Web

May 11, 2005 WWW 2005 -- Chiba, Japan 18

Pattern Matching

• Look for document subtrees with similar structure

• Find alignments of wrapper in tree• Require every node in wrapper be mapped

to some node in document subtree• Wildcards match zero or more times• Each valid alignment is a match

Page 19: Thresher: Automating the Unwrapping of  Semantic Content from the  World Wide Web

May 11, 2005 WWW 2005 -- Chiba, Japan 19

Matching Example

Page 20: Thresher: Automating the Unwrapping of  Semantic Content from the  World Wide Web

May 11, 2005 WWW 2005 -- Chiba, Japan 20

Agenda

• Overview• Demo• Details

– Induction– Matching– Semantics– Heuristics

Page 21: Thresher: Automating the Unwrapping of  Semantic Content from the  World Wide Web

May 11, 2005 WWW 2005 -- Chiba, Japan 21

Adding Semantics

• How to tie wrappers to semantic content?• Assert RDF statements about unwrapped

objects• Tied to wrapper structure• Classes bound to wrappers• Properties bound to wildcards

Page 22: Thresher: Automating the Unwrapping of  Semantic Content from the  World Wide Web

May 11, 2005 WWW 2005 -- Chiba, Japan 22

Semantic Labels

Page 23: Thresher: Automating the Unwrapping of  Semantic Content from the  World Wide Web

May 11, 2005 WWW 2005 -- Chiba, Japan 23

Semantic Matching

Page 24: Thresher: Automating the Unwrapping of  Semantic Content from the  World Wide Web

May 11, 2005 WWW 2005 -- Chiba, Japan 24

Semantic Matching

Page 25: Thresher: Automating the Unwrapping of  Semantic Content from the  World Wide Web

May 11, 2005 WWW 2005 -- Chiba, Japan 25

Semantic Matching

[ <rdf:type> <TalkAnnouncement> ; <series> “Dertouzos Lect…” ; <dc:title> “Distributed Hash…” ; <time> “3:30 PM”]

Page 26: Thresher: Automating the Unwrapping of  Semantic Content from the  World Wide Web

May 11, 2005 WWW 2005 -- Chiba, Japan 26

Agenda

• Overview• Demo• Details

– Induction– Matching– Semantics– Heuristics

Page 27: Thresher: Automating the Unwrapping of  Semantic Content from the  World Wide Web

May 11, 2005 WWW 2005 -- Chiba, Japan 27

• Find additional examples automatically • Consider nodes neighboring the example• Require low normalized cost:

• Often allows us to create wrappers with a single example

Automatically Adding Examples

Page 28: Thresher: Automating the Unwrapping of  Semantic Content from the  World Wide Web

May 11, 2005 WWW 2005 -- Chiba, Japan 28

Automatically Adding Examples

TR

T

Page 29: Thresher: Automating the Unwrapping of  Semantic Content from the  World Wide Web

May 11, 2005 WWW 2005 -- Chiba, Japan 29

List Collapse

• Current wrappers generalize well for single elements

• Will not recognize variable length lists• Collapse neighboring nodes with low

normalized cost• For matching, allow nodes to match more

than once

Google Employee
Do we need this? If we need to cut time, cut list collapse altogether
Page 30: Thresher: Automating the Unwrapping of  Semantic Content from the  World Wide Web

May 11, 2005 WWW 2005 -- Chiba, Japan 30

Wrapper Wrap-up

• Gather user example(s)• Automatically find additional examples• Generalize examples using best mapping• Add semantic labels• Match by finding alignments• Overlay objects on the page for interaction

Page 31: Thresher: Automating the Unwrapping of  Semantic Content from the  World Wide Web

May 11, 2005 WWW 2005 -- Chiba, Japan 31

Additional Tools

• Wrapper Sharing• RSS• Web Operations

Page 32: Thresher: Automating the Unwrapping of  Semantic Content from the  World Wide Web

May 11, 2005 WWW 2005 -- Chiba, Japan 32

Our Contributions

• End-user wrapper induction

• Few examples required

• Bring object interaction into the browser

• Wrappers bridge syntactic-semantic gap

Page 33: Thresher: Automating the Unwrapping of  Semantic Content from the  World Wide Web

May 11, 2005 WWW 2005 -- Chiba, Japan 33

Future Work and Applications

• Document-level classes• Page reformatting• Autonomous agent interaction• Negative examples• Automatic wrapper induction

Page 34: Thresher: Automating the Unwrapping of  Semantic Content from the  World Wide Web

May 11, 2005 WWW 2005 -- Chiba, Japan 34

[email protected]

http://haystack.csail.mit.edu