Ontology-based Web Information Extraction in Practice · Development & maintenance of...

25
Ontology-based Web Information Extraction in Practice eRecruitment – eTourism - eProcurement Japan-Austria Joint Workshop on “ICT” Tokyo, October 18-19, 2010 Institute for a.Univ.-Prof. Dr. DI Birgit Pröll Application Oriented Knowledge Processing [email protected]

Transcript of Ontology-based Web Information Extraction in Practice · Development & maintenance of...

Page 1: Ontology-based Web Information Extraction in Practice · Development & maintenance of knowledge-based WebIE systems is expensive. Domain experts & knowledge engineers are needed.

Ontology-based Web Information Extraction in Practice

eRecruitment – eTourism - eProcurement

Japan-Austria Joint Workshop on “ICT”

Tokyo, October 18-19, 2010

Institute for a.Univ.-Prof. Dr. DI Birgit Pröll

Application Oriented Knowledge Processing [email protected]

Page 2: Ontology-based Web Information Extraction in Practice · Development & maintenance of knowledge-based WebIE systems is expensive. Domain experts & knowledge engineers are needed.

Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 2

Contents

� Motivation

� Web Information Extraction (WebIE) by Examples

� General Architecture

� Web Crawler

� Ontology Aware WebIE

� Structure Analysis: Page Segementation, Table Extraction

� Evaluation & Manual Correction of Results

� Lessons Learned & Future Work

Page 3: Ontology-based Web Information Extraction in Practice · Development & maintenance of knowledge-based WebIE systems is expensive. Domain experts & knowledge engineers are needed.

Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 3

Web Information Extraction (WebIE)…extracting structured data from Web pages

accomodation’s name address phone pool facility templatesaccomodation’s nameAlpenrose

addressA-6212 Maurach

phone++43 (0)524352930

pool facility-

Page 4: Ontology-based Web Information Extraction in Practice · Development & maintenance of knowledge-based WebIE systems is expensive. Domain experts & knowledge engineers are needed.

Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 4

WebIE Projects in cooperation with Austrian Industry

Page 5: Ontology-based Web Information Extraction in Practice · Development & maintenance of knowledge-based WebIE systems is expensive. Domain experts & knowledge engineers are needed.

Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 5

Projects‘ Requirements and Approach Taken

WebIE Approaches

• Screen scraping approaches (wrapper generation)

• Automatically trainable systems (machine learning)

• Knowledge-engineering approach

Some WebIE pecularities in the given projects

• Heterogeneously designed Web pages

• Mixture of (semi-)structured data and full text

• Significant structural aspects, e.g.,

• location of information on Web page

• information „hidden“ in Web tables

• Information scattered over several Web pages

• Web site evolution

• Knowledge-engineering approach

+ Web crawler + structural analysis + …

[Appelt et al., 1999]

Page 6: Ontology-based Web Information Extraction in Practice · Development & maintenance of knowledge-based WebIE systems is expensive. Domain experts & knowledge engineers are needed.

Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 6

Contents

� Motivation

� Web Information Extraction (WebIE) by Examples

� General Architecture

� Web Crawler

� Ontology Aware WebIE

� Structure Analysis: Page Segementation, Table Extraction

� Evaluation & Manual Correction of Results

� Lessons Learned & Future Work

Page 7: Ontology-based Web Information Extraction in Practice · Development & maintenance of knowledge-based WebIE systems is expensive. Domain experts & knowledge engineers are needed.

Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 7

To

ke

niz

er

IE-Pipeline (GATE *)

Ga

ze

tte

er-

Lis

ts

Se

nte

nce

-S

pli

tte

r(e

.g.)

On

tolo

gy-

Plu

gin

Tra

ns

du

ce

rs

Crawler

Pre-Processing Information Extraction

KnowledgeBase

Post-Processing

Output

Web sites

AnnotatedWeb pages

<?xml version=1.0“>

<masterdata>

<accname>

Alpengasthof

XYZ

</accname>

<adress>

Baumbachstr.

4040 Linz

</adress>

….

….

</masterdata>

XML

DomainOntology

Overall Architecture

Gazetteerlists

Rules

*) [Cunningham et al, 2006]

Page 8: Ontology-based Web Information Extraction in Practice · Development & maintenance of knowledge-based WebIE systems is expensive. Domain experts & knowledge engineers are needed.

Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 8

Web Crawler

� Collects relevant Web pages

� Classifies Web pages

� Home page, price pages, location pages, etc.

� Based on Support Vector Machine

� Recognises language

� Using meta-tags and an n-gram based algorithm

Page 9: Ontology-based Web Information Extraction in Practice · Development & maintenance of knowledge-based WebIE systems is expensive. Domain experts & knowledge engineers are needed.

Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 9

To

ke

niz

er

IE-Pipeline (GATE)

Ga

ze

tte

er-

Lis

ts

Se

nte

nce

-S

pli

tte

r(e

.g.)

On

tolo

gy-

Plu

gin

Tra

ns

du

ce

rs

Crawler

Pre-Processing Information Extraction

KnowledgeBase

Post-Processing

Output

Web sites

AnnotatedWeb pages

<?xml version=1.0“>

<masterdata>

<accname>

Alpengasthof

XYZ

</accname>

<adress>

Baumbachstr.

4040 Linz

</adress>

….

….

</masterdata>

XML

DomainOntology

Overall ArchitectureTypes of annotations- syntactical, morphological- ontological- structural- relevance judging

Ma

nu

al

Ev

alu

ati

on

Ma

nu

al

Co

rre

ctio

n

Gazetteerlists

Rules

Page 10: Ontology-based Web Information Extraction in Practice · Development & maintenance of knowledge-based WebIE systems is expensive. Domain experts & knowledge engineers are needed.

Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 10

Regular Expressions & Gazetteer Lookup

Rule: Phone1

(

{Token.string=="+"}

{Token.kind==number}

({SpaceToken.kind==space})*

{Token.string=="("}

{Token.kind==number}

{Token.string==")"}

(({SpaceToken.kind==space})*

{Token.kind==number})+

):phone

-->

:phone.MyPhone={}

Gazetteer list ‘phone keywords’

Phone

Telephone

Tel.

Tel:

Tel.:

Telefon

Page 11: Ontology-based Web Information Extraction in Practice · Development & maintenance of knowledge-based WebIE systems is expensive. Domain experts & knowledge engineers are needed.

Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 11

Ontology-Aware Entity Recognition (1/2)

pool

facilities

swimming pool

hotel

amenities

rdfs:subClassOf

lang=en

rdfs:Label

Schwimmbad

rdfs:Label

lang=de

Hallenbad

owl:Synonym

Page 12: Ontology-based Web Information Extraction in Practice · Development & maintenance of knowledge-based WebIE systems is expensive. Domain experts & knowledge engineers are needed.

Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 12

Ontology-Aware Entity Recognition (2/2)

hasAttibute

(ObjectProperty)

pool

facilitiespool

pool

attributesheated

hotel

amenities

rdfs:subClassOf

We offer a wonderful 2500m2 wellness area, lead by a trained wellness team. Indoorswimming pools, new heated natural outdoor pool with sandy beach, open airwhirlpool with a wonderful view of lake Caldaro, large sauna world, and our privatebeach directly at lake Caldaro, full fill all wishes!

Page 13: Ontology-based Web Information Extraction in Practice · Development & maintenance of knowledge-based WebIE systems is expensive. Domain experts & knowledge engineers are needed.

Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 13

Structure Analysis: Web Page Segmentation

Top part

Content part

Bottom part

job title

Senior Java Developer

templates

IT skills + levelJAVA + perfect

MySQL + basic

operation areaSW programming, testing

language skillsEnglish fluently

contact-

powered by Typo3

[Debnath et al., 2005][Chakrabarti et al., 2007]

Page 14: Ontology-based Web Information Extraction in Practice · Development & maintenance of knowledge-based WebIE systems is expensive. Domain experts & knowledge engineers are needed.

Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 14

Structure Analysis: Block Identification

Requirements

Offer

Responsibilities

Block

Block

Block

Block

Block

Content part

Page 15: Ontology-based Web Information Extraction in Practice · Development & maintenance of knowledge-based WebIE systems is expensive. Domain experts & knowledge engineers are needed.

Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 15

Structure Analysis: Table Data Extraction in Marlies

Japanese-Austrian Workshop on Natural Language and Spatio-Temporal Information, 1st-2nd Oct. 2009 – Birgit Pröll

[Yang et al., 2002][Gatterbauer et al., 2007]

Page 16: Ontology-based Web Information Extraction in Practice · Development & maintenance of knowledge-based WebIE systems is expensive. Domain experts & knowledge engineers are needed.

Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 16

Structure Analysis: Table Data Extraction in TourIE

Page 17: Ontology-based Web Information Extraction in Practice · Development & maintenance of knowledge-based WebIE systems is expensive. Domain experts & knowledge engineers are needed.

Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 17

Contents

� Motivation

� Web Information Extraction (WebIE) by Examples

� General Architecture

� Web Crawler

� Ontology Aware WebIE

� Structure Analysis: Page Segementation, Table Extraction

� Evaluation & Manual Correction of Results

� Lessons Learned & Future Work

Page 18: Ontology-based Web Information Extraction in Practice · Development & maintenance of knowledge-based WebIE systems is expensive. Domain experts & knowledge engineers are needed.

Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 18

Evaluation: TourIE

� Evaluation results were satisfactory with respect to the preliminary study.

� Pool facility extraction quality was poor because of incomplete ontology.

Page 19: Ontology-based Web Information Extraction in Practice · Development & maintenance of knowledge-based WebIE systems is expensive. Domain experts & knowledge engineers are needed.

Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 19

Evaluation: JobOlize

0,0000

0,2000

0,4000

0,6000

0,8000

1,0000

1,2000P

recis

ion

Re

call

F-M

eas.

Pre

cis

ion

Reca

ll

F-M

ea

s.

Pre

cis

ion

Re

ca

ll

F-M

ea

s.

Operation

Area

IT-Skill Language

SkillContext-driven Extraction

Non Context-driven Extraction

� Page segmentation & block identification considerably rises precision.

Page 20: Ontology-based Web Information Extraction in Practice · Development & maintenance of knowledge-based WebIE systems is expensive. Domain experts & knowledge engineers are needed.

Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 20

Gesamt

0,00

0,10

0,20

0,30

0,40

0,50

0,60

0,70

0,80

0,90

1,00

Gesam

t

Nam

e

Adresse

Mail

Telefon

Fax

UID

Bezeichnung

Attribute

Material

Hauptm

H-E

inheit

H-W

ert

XYZ

Material

Templates

%

Evaluation: Marlies

� Work in progress (e.g., table extraction).

Marlies OntologyClasses: 2313

Instances: 2661

Assignments of object properties to instances: 42791

Preliminary results for recall:

Page 21: Ontology-based Web Information Extraction in Practice · Development & maintenance of knowledge-based WebIE systems is expensive. Domain experts & knowledge engineers are needed.

Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 21

Manual Correction via Rich Client GUI

Page 22: Ontology-based Web Information Extraction in Practice · Development & maintenance of knowledge-based WebIE systems is expensive. Domain experts & knowledge engineers are needed.

Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 22

Contents

� Motivation

� Web Information Extraction (WebIE) by Examples

� General Architecture

� Web Crawler

� Ontology Aware WebIE

� Structure Analysis: Page Segementation, Table Extraction

� Evaluation & Manual Correction of Results

� Lessons Learned & Future Work

Page 23: Ontology-based Web Information Extraction in Practice · Development & maintenance of knowledge-based WebIE systems is expensive. Domain experts & knowledge engineers are needed.

Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 23

Lessons Learned

� Today‘s Web pages do not adhere to standards or semantic Web proposals.

� Only a few RDF resouces available; proposed microformats rarely used

� Poor HTML, e.g, tables used for layout purposes

� Web 2.0 coded Web pages in progress; content-based image retrieval & OCR

� Development & maintenance of knowledge-based WebIE systems is expensive.

� Domain experts & knowledge engineers are needed.

� Rule-coding is tedious and errorprone.

� Evaluation of numerous methods & algorithms; multiplied due to multilnguality

� Manual evaluation is time consuming.

� WebIE performance considerably depends on quality of domain ontology.

� We have to observe (evolving) legal issues

� Robots exclusion standard, Sitemap etc.

� Further processing of extracted data

Page 24: Ontology-based Web Information Extraction in Practice · Development & maintenance of knowledge-based WebIE systems is expensive. Domain experts & knowledge engineers are needed.

Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 24

Future Work: OntosophiaOntology-driven IE Supported by (Semi-) Automatic Corrective Feedback

Domain-

Expert(s)

Validation GUI

Ontology-driven Information Extraction

Ru

le-G

en

era

tor

Information Extraction

Pipeline

IE-E

va

lua

tio

n

Documents Ontology OptimizationOntology Learning, -Population, -Evaluation

Ontology

Lookup

Document

Annotation

Template

Filling

Visualization of IE-ProcessError Trace Back

IE-Rules Ontology Lookup

Extraction

Domain

Ontology

Domain-

Expert(s)

Domain-

Expert(s)

Validation GUI

Ontology-driven Information Extraction

Ru

le-G

en

era

tor

Information Extraction

Pipeline

IE-E

va

lua

tio

n

Documents Ontology OptimizationOntology Learning, -Population, -Evaluation

Ontology

Lookup

Document

Annotation

Template

Filling

Visualization of IE-ProcessError Trace Back

IE-Rules Ontology Lookup

Extraction

Domain

Ontology

(3) Visual Error Trace Back

(2) OntologyOptimization

(4) Evaluation Support

(1) Extraction

Domain Ontology

Page 25: Ontology-based Web Information Extraction in Practice · Development & maintenance of knowledge-based WebIE systems is expensive. Domain experts & knowledge engineers are needed.

Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 25

Thank you for your Attention!

Acknowledgements to:

Christina Stefan Christina MichaelFeilmayr Parzer Buttinger Guttenbrunner