AMBER presentation

16
Little Knowledge Rules The Web: Domain-Centric Result Page Extraction Tim Furche, Georg Gottlob, Giovanni Grasso, Giorgio Orsi, Christian Scallhart, Cheng Wang Department of Computer Science University of Oxford [email protected]

description

 

Transcript of AMBER presentation

Page 1: AMBER presentation

Little Knowledge Rules The Web: Domain-Centric Result Page Extraction

Tim Furche, Georg Gottlob, Giovanni Grasso, Giorgio Orsi, Christian Scallhart, Cheng Wang

Department of Computer ScienceUniversity of Oxford

[email protected]

Page 2: AMBER presentation

Result Page Understanding

Page 3: AMBER presentation

Outline

Adaptable Model-Based Extraction of

Result Pages (AMBER)

• System Overview

• Experiments

• Current Work

Page 4: AMBER presentation

Part of DIADEM | Domain-centric Intelligent Automated Data Extraction Methodology

AMBER: System Overview

Needs only one clue

Implemented in rules

Very high precision & recall

Domain-Parameterized tool,currently aimed at UK real-estate

Adaptable Model-Based Extraction of Result Pages

Page 5: AMBER presentation

AMBER: System Overview

Page 6: AMBER presentation

Fact Generation & Annotation

• Live browser (Mozilla XUL-Runner)

• Extract DOM tree

• CSS box information

• Textual annotation with GATE (domain dep.)

– Gazetteers

– Regular expression like rules

• All represented as facts in the Page Model

Page 7: AMBER presentation

Phenomenological Mapping

Fact Attribute

• Attribute Model:

– Types & constraints

• Dom node and attribute

• Attribute Creation Constraints:

– Required Annotations

– Disallowed Annotations

Page 8: AMBER presentation

Segmentation Mapping: Identification

Attribute Data area

• From bottom phenomena to data area

• Little knowledge rules the webOnly one domain concept

(mandatory attribute)– Price

– Location

– Title

Page 9: AMBER presentation

Segmentation Mapping: Identification

• Multi data area identification

Page 10: AMBER presentation

Segmentation Mapping: Understanding

• Data area Record

• Domain independent

• Identify leading nodes

• Two problems

– Superfluous nodes

– Correct shift

Page 11: AMBER presentation

Segmentation Mapping: Understanding

Page 12: AMBER presentation

Segmentation Mapping: Understanding

Page 13: AMBER presentation

Experiments

95.0%

96.0%

97.0%

98.0%

99.0%

100.0%

Data Area Record Attribute Price Location

Precision

Recall

F-measure

Page 14: AMBER presentation

Summary

• AMBER - Adaptable Model-based Extraction of Result Pages

– Domain knowledge simple heuristic

– Using DLV compact & easy implementation

– Understanding phase: only one domain clue quickly adaptable to new domains

– Very High precision (99.4%) recall (99.0%)

Page 15: AMBER presentation

Current Work

• Testing AMBER on another domain

• Integrate visual information in understanding phase

• Use probabilistic logic programming to improve the whole system

Page 16: AMBER presentation

Thanks!