Ontology-based Web Information Extraction in Practice · Development & maintenance of...
Transcript of Ontology-based Web Information Extraction in Practice · Development & maintenance of...
Ontology-based Web Information Extraction in Practice
eRecruitment – eTourism - eProcurement
Japan-Austria Joint Workshop on “ICT”
Tokyo, October 18-19, 2010
Institute for a.Univ.-Prof. Dr. DI Birgit Pröll
Application Oriented Knowledge Processing [email protected]
Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 2
Contents
� Motivation
� Web Information Extraction (WebIE) by Examples
� General Architecture
� Web Crawler
� Ontology Aware WebIE
� Structure Analysis: Page Segementation, Table Extraction
� Evaluation & Manual Correction of Results
� Lessons Learned & Future Work
Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 3
Web Information Extraction (WebIE)…extracting structured data from Web pages
accomodation’s name address phone pool facility templatesaccomodation’s nameAlpenrose
addressA-6212 Maurach
phone++43 (0)524352930
pool facility-
Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 4
WebIE Projects in cooperation with Austrian Industry
Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 5
Projects‘ Requirements and Approach Taken
WebIE Approaches
• Screen scraping approaches (wrapper generation)
• Automatically trainable systems (machine learning)
• Knowledge-engineering approach
Some WebIE pecularities in the given projects
• Heterogeneously designed Web pages
• Mixture of (semi-)structured data and full text
• Significant structural aspects, e.g.,
• location of information on Web page
• information „hidden“ in Web tables
• Information scattered over several Web pages
• Web site evolution
• Knowledge-engineering approach
+ Web crawler + structural analysis + …
[Appelt et al., 1999]
Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 6
Contents
� Motivation
� Web Information Extraction (WebIE) by Examples
� General Architecture
� Web Crawler
� Ontology Aware WebIE
� Structure Analysis: Page Segementation, Table Extraction
� Evaluation & Manual Correction of Results
� Lessons Learned & Future Work
Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 7
To
ke
niz
er
IE-Pipeline (GATE *)
Ga
ze
tte
er-
Lis
ts
Se
nte
nce
-S
pli
tte
r(e
.g.)
On
tolo
gy-
Plu
gin
Tra
ns
du
ce
rs
Crawler
Pre-Processing Information Extraction
KnowledgeBase
Post-Processing
Output
Web sites
AnnotatedWeb pages
<?xml version=1.0“>
<masterdata>
<accname>
Alpengasthof
XYZ
</accname>
<adress>
Baumbachstr.
4040 Linz
</adress>
….
….
</masterdata>
XML
DomainOntology
Overall Architecture
Gazetteerlists
Rules
*) [Cunningham et al, 2006]
Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 8
Web Crawler
� Collects relevant Web pages
� Classifies Web pages
� Home page, price pages, location pages, etc.
� Based on Support Vector Machine
� Recognises language
� Using meta-tags and an n-gram based algorithm
Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 9
To
ke
niz
er
IE-Pipeline (GATE)
Ga
ze
tte
er-
Lis
ts
Se
nte
nce
-S
pli
tte
r(e
.g.)
On
tolo
gy-
Plu
gin
Tra
ns
du
ce
rs
Crawler
Pre-Processing Information Extraction
KnowledgeBase
Post-Processing
Output
Web sites
AnnotatedWeb pages
<?xml version=1.0“>
<masterdata>
<accname>
Alpengasthof
XYZ
</accname>
<adress>
Baumbachstr.
4040 Linz
</adress>
….
….
</masterdata>
XML
DomainOntology
Overall ArchitectureTypes of annotations- syntactical, morphological- ontological- structural- relevance judging
Ma
nu
al
Ev
alu
ati
on
Ma
nu
al
Co
rre
ctio
n
Gazetteerlists
Rules
Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 10
Regular Expressions & Gazetteer Lookup
Rule: Phone1
(
{Token.string=="+"}
{Token.kind==number}
({SpaceToken.kind==space})*
{Token.string=="("}
{Token.kind==number}
{Token.string==")"}
(({SpaceToken.kind==space})*
{Token.kind==number})+
):phone
-->
:phone.MyPhone={}
Gazetteer list ‘phone keywords’
Phone
Telephone
Tel.
Tel:
Tel.:
Telefon
Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 11
Ontology-Aware Entity Recognition (1/2)
pool
facilities
swimming pool
hotel
amenities
rdfs:subClassOf
lang=en
rdfs:Label
Schwimmbad
rdfs:Label
lang=de
Hallenbad
owl:Synonym
Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 12
Ontology-Aware Entity Recognition (2/2)
hasAttibute
(ObjectProperty)
pool
facilitiespool
pool
attributesheated
hotel
amenities
rdfs:subClassOf
We offer a wonderful 2500m2 wellness area, lead by a trained wellness team. Indoorswimming pools, new heated natural outdoor pool with sandy beach, open airwhirlpool with a wonderful view of lake Caldaro, large sauna world, and our privatebeach directly at lake Caldaro, full fill all wishes!
Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 13
Structure Analysis: Web Page Segmentation
Top part
Content part
Bottom part
job title
Senior Java Developer
templates
IT skills + levelJAVA + perfect
MySQL + basic
operation areaSW programming, testing
language skillsEnglish fluently
contact-
powered by Typo3
[Debnath et al., 2005][Chakrabarti et al., 2007]
Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 14
Structure Analysis: Block Identification
Requirements
Offer
Responsibilities
Block
Block
Block
Block
Block
Content part
Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 15
Structure Analysis: Table Data Extraction in Marlies
Japanese-Austrian Workshop on Natural Language and Spatio-Temporal Information, 1st-2nd Oct. 2009 – Birgit Pröll
[Yang et al., 2002][Gatterbauer et al., 2007]
Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 16
Structure Analysis: Table Data Extraction in TourIE
Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 17
Contents
� Motivation
� Web Information Extraction (WebIE) by Examples
� General Architecture
� Web Crawler
� Ontology Aware WebIE
� Structure Analysis: Page Segementation, Table Extraction
� Evaluation & Manual Correction of Results
� Lessons Learned & Future Work
Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 18
Evaluation: TourIE
� Evaluation results were satisfactory with respect to the preliminary study.
� Pool facility extraction quality was poor because of incomplete ontology.
Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 19
Evaluation: JobOlize
0,0000
0,2000
0,4000
0,6000
0,8000
1,0000
1,2000P
recis
ion
Re
call
F-M
eas.
Pre
cis
ion
Reca
ll
F-M
ea
s.
Pre
cis
ion
Re
ca
ll
F-M
ea
s.
Operation
Area
IT-Skill Language
SkillContext-driven Extraction
Non Context-driven Extraction
� Page segmentation & block identification considerably rises precision.
Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 20
Gesamt
0,00
0,10
0,20
0,30
0,40
0,50
0,60
0,70
0,80
0,90
1,00
Gesam
t
Nam
e
Adresse
Telefon
Fax
UID
Bezeichnung
Attribute
Material
Hauptm
aß
H-E
inheit
H-W
ert
XYZ
Material
Templates
%
Evaluation: Marlies
� Work in progress (e.g., table extraction).
Marlies OntologyClasses: 2313
Instances: 2661
Assignments of object properties to instances: 42791
Preliminary results for recall:
Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 21
Manual Correction via Rich Client GUI
Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 22
Contents
� Motivation
� Web Information Extraction (WebIE) by Examples
� General Architecture
� Web Crawler
� Ontology Aware WebIE
� Structure Analysis: Page Segementation, Table Extraction
� Evaluation & Manual Correction of Results
� Lessons Learned & Future Work
Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 23
Lessons Learned
� Today‘s Web pages do not adhere to standards or semantic Web proposals.
� Only a few RDF resouces available; proposed microformats rarely used
� Poor HTML, e.g, tables used for layout purposes
� Web 2.0 coded Web pages in progress; content-based image retrieval & OCR
� Development & maintenance of knowledge-based WebIE systems is expensive.
� Domain experts & knowledge engineers are needed.
� Rule-coding is tedious and errorprone.
� Evaluation of numerous methods & algorithms; multiplied due to multilnguality
� Manual evaluation is time consuming.
� WebIE performance considerably depends on quality of domain ontology.
� We have to observe (evolving) legal issues
� Robots exclusion standard, Sitemap etc.
� Further processing of extracted data
Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 24
Future Work: OntosophiaOntology-driven IE Supported by (Semi-) Automatic Corrective Feedback
Domain-
Expert(s)
Validation GUI
Ontology-driven Information Extraction
Ru
le-G
en
era
tor
Information Extraction
Pipeline
IE-E
va
lua
tio
n
Documents Ontology OptimizationOntology Learning, -Population, -Evaluation
Ontology
Lookup
Document
Annotation
Template
Filling
Visualization of IE-ProcessError Trace Back
IE-Rules Ontology Lookup
Extraction
Domain
Ontology
Domain-
Expert(s)
Domain-
Expert(s)
Validation GUI
Ontology-driven Information Extraction
Ru
le-G
en
era
tor
Information Extraction
Pipeline
IE-E
va
lua
tio
n
Documents Ontology OptimizationOntology Learning, -Population, -Evaluation
Ontology
Lookup
Document
Annotation
Template
Filling
Visualization of IE-ProcessError Trace Back
IE-Rules Ontology Lookup
Extraction
Domain
Ontology
(3) Visual Error Trace Back
(2) OntologyOptimization
(4) Evaluation Support
(1) Extraction
Domain Ontology
Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 25
Thank you for your Attention!
Acknowledgements to:
Christina Stefan Christina MichaelFeilmayr Parzer Buttinger Guttenbrunner