AMBER presentation

Post on 22-Nov-2014

1.264 views 5 download

description

 

Transcript of AMBER presentation

Little Knowledge Rules The Web: Domain-Centric Result Page Extraction

Tim Furche, Georg Gottlob, Giovanni Grasso, Giorgio Orsi, Christian Scallhart, Cheng Wang

Department of Computer ScienceUniversity of Oxford

Cheng.wang@trinity.ox.ac.uk

Result Page Understanding

Outline

Adaptable Model-Based Extraction of

Result Pages (AMBER)

• System Overview

• Experiments

• Current Work

Part of DIADEM | Domain-centric Intelligent Automated Data Extraction Methodology

AMBER: System Overview

Needs only one clue

Implemented in rules

Very high precision & recall

Domain-Parameterized tool,currently aimed at UK real-estate

Adaptable Model-Based Extraction of Result Pages

AMBER: System Overview

Fact Generation & Annotation

• Live browser (Mozilla XUL-Runner)

• Extract DOM tree

• CSS box information

• Textual annotation with GATE (domain dep.)

– Gazetteers

– Regular expression like rules

• All represented as facts in the Page Model

Phenomenological Mapping

Fact Attribute

• Attribute Model:

– Types & constraints

• Dom node and attribute

• Attribute Creation Constraints:

– Required Annotations

– Disallowed Annotations

Segmentation Mapping: Identification

Attribute Data area

• From bottom phenomena to data area

• Little knowledge rules the webOnly one domain concept

(mandatory attribute)– Price

– Location

– Title

Segmentation Mapping: Identification

• Multi data area identification

Segmentation Mapping: Understanding

• Data area Record

• Domain independent

• Identify leading nodes

• Two problems

– Superfluous nodes

– Correct shift

Segmentation Mapping: Understanding

Segmentation Mapping: Understanding

Experiments

95.0%

96.0%

97.0%

98.0%

99.0%

100.0%

Data Area Record Attribute Price Location

Precision

Recall

F-measure

Summary

• AMBER - Adaptable Model-based Extraction of Result Pages

– Domain knowledge simple heuristic

– Using DLV compact & easy implementation

– Understanding phase: only one domain clue quickly adaptable to new domains

– Very High precision (99.4%) recall (99.0%)

Current Work

• Testing AMBER on another domain

• Integrate visual information in understanding phase

• Use probabilistic logic programming to improve the whole system

Thanks!