Post on 26-Jan-2015
description
WeST – Web Science & TechnologiesUniversity of Koblenz ▪ Landau, Germany
Information Extraction for
Building Knowledge Bases
Steffen Staab
Saqib Mir – European Bioinformatics InstituteErmelinda d‘Oro, Massimo Ruffolo – Univ. Calabria, Italy
Steffen Staab staab@uni-koblenz.de
WeST – Web Science & Technologies
Slide 2
A FEW SLIDES WHERE WEST COMES FROM
Steffen Staab staab@uni-koblenz.de
WeST – Web Science & Technologies
Slide 3
Steffen Staab staab@uni-koblenz.de
WeST – Web Science & Technologies
Slide 4
Semantic Web
Web Retrieval
Social Web
Multimedia Web
Software Web
Institut WeST – Web Science & Technologies
GESIS
Steffen Staab staab@uni-koblenz.de
WeST – Web Science & Technologies
Slide 5
We (co-)organize conferences and schools
Steffen Staab staab@uni-koblenz.de
WeST – Web Science & Technologies
Slide 6
We build applications and develop methods…
BTC 1. Prize 2011
1. PrizeGerman Linked Open Gov Data Competition 2012
BTC 1. Prize 2008 German KM 1. Prize 2011
Steffen Staab staab@uni-koblenz.de
WeST – Web Science & Technologies
Slide 7
We teach Web Science
Master in Web Science@Koblenz Free tuition Start Fall 2012 English
2012 Web Science Summer School
Lorentz Center, Leiden, The Netherlands,
9-13 July 2012
Master in eGov@Koblenz Free tuition Start Fall 2012 English
Steffen Staab staab@uni-koblenz.de
WeST – Web Science & Technologies
Slide 8
We are active in joint projects
EU Integrated Project ROBUST (10 Partners):Risk and Opportunity management of huge-scale BUSiness communiTy cooperation
EU Live+Gov - Reality Sensing, Mining and Augmentation for Mobile Citizen–Government Dialogue
EU WeGov – where eGovernment meets the eSociety EU IP SocialSensor - Sensing User Generated Input for
Improved Media Discovery and Experience EU Net2 – a networked for networked knowledge EU MOST – Marrying ontologies and Software
Technologies
Steffen Staab staab@uni-koblenz.de
WeST – Web Science & Technologies
Slide 9
INFORMATION EXTRACTIONFORBUILDING KNOWLEDGE BASES
Steffen Staab,
Saqib Mir, European Bioinformatics InstituteErmelinda d‘Oro, Massimo Ruffolo, Univ Calabria, Italy
Steffen Staab staab@uni-koblenz.de
WeST – Web Science & Technologies
Slide 10
GENERAL MOTIVATION
Steffen Staab staab@uni-koblenz.de
WeST – Web Science & Technologies
Slide 11
General objective: Extracting to LOD
hasLivedInuseAsExample
Steffen Staab staab@uni-koblenz.de
WeST – Web Science & Technologies
Slide 12
General objective: Analysing LOD
hasLivedInuseAsExample
Steffen Staab staab@uni-koblenz.de
WeST – Web Science & Technologies
Slide 13
http://lisa.west.uni-koblenz.de/lisa-demo/
Family‘s analysis of Munich LOD + Open Street Map data
Steffen Staab staab@uni-koblenz.de
WeST – Web Science & Technologies
Slide 14
http://lisa.west.uni-koblenz.de/lisa-demo/
Entrepreneur‘s analysis of Munich LOD + Open Street Map data
Steffen Staab staab@uni-koblenz.de
WeST – Web Science & Technologies
Slide 15
OBSERVATIONS ON INFORMATION EXTRACTION
Steffen Staab staab@uni-koblenz.de
WeST – Web Science & Technologies
Slide 16
Challenges & Opportunities for IE
Not all web pages are created equal
Steffen Staab staab@uni-koblenz.de
WeST – Web Science & Technologies
Slide 17
Challenges & Opportunities for IE
Some challenges are the same, e.g. finding type instances
Steffen Staab staab@uni-koblenz.de
WeST – Web Science & Technologies
Slide 18
Challenges & Opportunities for IE
Some challenges are the same, e.g. finding relation instances
Steffen Staab staab@uni-koblenz.de
WeST – Web Science & Technologies
Slide 19
Challenges & Opportunities for IE
Some contain concepts and their descriptions, some don‘t
No types here,few relation types
Steffen Staab staab@uni-koblenz.de
WeST – Web Science & Technologies
Slide 20
Challenges & Opportunities for IE
Knowing that they are instances and of which type
Textual indication
Positional indication
Steffen Staab staab@uni-koblenz.de
WeST – Web Science & Technologies
Slide 21
Challenges & Opportunities for IE
To some extent
positional and layout
indications work across
languages and sites
Steffen Staab staab@uni-koblenz.de
WeST – Web Science & Technologies
Slide 22
Challenges & Opportunities for IE
owl:sameAs
We should not only think about
Web pages, but about Web sites
Steffen Staab staab@uni-koblenz.de
WeST – Web Science & Technologies
Slide 23
Challenges & Opportunities for IE
owl:sameAs
We should not only think about
Web pages, but about Web sites
Steffen Staab staab@uni-koblenz.de
WeST – Web Science & Technologies
Slide 24
Comparing related work to our objectives
Related work objectives IE on Web pages Acquiring instances and
relationship instances
IE based on linear text
Our objectives IE on Web sites Acquiring items Classifying items in
Instances Concepts Relation instances Relationships
IE also based on spatial position
There is overlap and there are few exceptions in related work
Steffen Staab staab@uni-koblenz.de
WeST – Web Science & Technologies
Slide 25
Outline
The Bio-CaseThe Social Media-Case Motivation State-of-the-Art Core idea of SXPath SXPath Language
Spatial Data Model Syntax & Semantics Complexity
Implementation Evaluation
Steffen Staab staab@uni-koblenz.de
WeST – Web Science & Technologies
Slide 26
Presentation-oriented documents
Music band profile
band photo
band name
Acquiring a music band profile: A music band photo that has at east itsdescriptive information
Steffen Staab staab@uni-koblenz.de
WeST – Web Science & Technologies
Slide 27
Presentation-oriented documents
Steffen Staab staab@uni-koblenz.de
WeST – Web Science & Technologies
Slide 28
Presentation-oriented documents
• HTML DOM structure is site specific• Spatial arrangements are rarely explicit• Spatial layout is hidden in complex nesting of layout elements• Intricate DOM treee structures are conceptually difficult to
query for the user (or a tool!)
Steffen Staab staab@uni-koblenz.de
WeST – Web Science & Technologies
Slide 29
Related Work
Web Query languages Xpath 1.0 and XQuery1.0
Established Too difficult to use for scraping from intricate DOM structures
Visual languages Spatial Graph Grammars [Kong et al.] are quite complex in
term of both usability and efficiency Algebras for creating and querying multimedia interactive
presentations (e.g. ppt) [Subrahmanian et al.]
Web wrapper induction exploiting visual interface [Gottlob et al.] [Sahuguet et al.]
generate XPath location paths of DOM nodes can benefit from using Spatial XPath
Steffen Staab staab@uni-koblenz.de
WeST – Web Science & Technologies
Slide 30
Outline
The Bio-CaseThe Social Media-Case Motivation State-of-the-Art Core idea of SXPath SXPath Language
Spatial Data Model Syntax & Semantics Complexity
Implementation Evaluation
Steffen Staab staab@uni-koblenz.de
WeST – Web Science & Technologies
Slide 31
b
e
Idea: Use Spatial Relations among DOM Nodes
Steffen Staab staab@uni-koblenz.de
WeST – Web Science & Technologies
Slide 32
Idea: Use Spatial Relations among DOM Nodes
Steffen Staab staab@uni-koblenz.de
WeST – Web Science & Technologies
Slide 33
Idea: Use Spatial Relations among DOM Nodes
Steffen Staab staab@uni-koblenz.de
WeST – Web Science & Technologies
Slide 34
Spatial DOM (SDOM)
Steffen Staab staab@uni-koblenz.de
WeST – Web Science & Technologies
Slide 35
Spatial Relations Among Nodes
Rectangular Cardinal Relations (RCR)
Topological Relations
r1 E:NE r2
Spatial models allow for expressing disjunctive relations among regions
Steffen Staab staab@uni-koblenz.de
WeST – Web Science & Technologies
Slide 37
XPath Example
Steffen Staab staab@uni-koblenz.de
WeST – Web Science & Technologies
Slide 38
SXPath Example
Steffen Staab staab@uni-koblenz.de
WeST – Web Science & Technologies
Slide 39
Steffen Staab staab@uni-koblenz.de
WeST – Web Science & Technologies
Slide 40
From XPath 1.0 towards Spatial Querying with SXPath
SXPath features adopts intuitive path notation:
axis::nodetest [pred]*
adds to XPath spatial axes spatial position functions
natural semantics for spatial querying maintains polynomial time combined complexity
Steffen Staab staab@uni-koblenz.de
WeST – Web Science & Technologies
Slide 41
Why SXPath?
an XPath for Information extraction
web applications
familiarity
Simplicity
resilient wrappers
human oriented
efficiency
Steffen Staab staab@uni-koblenz.de
WeST – Web Science & Technologies
Slide 42
Outline
The Bio-CaseThe Social Media-Case Motivation State-of-the-Art Core idea of SXPath SXPath Language
Spatial Data Model Syntax & Semantics Complexity
Implementation Evaluation
Steffen Staab staab@uni-koblenz.de
WeST – Web Science & Technologies
Slide 43
Spatial DOM (SDOM)
Steffen Staab staab@uni-koblenz.de
WeST – Web Science & Technologies
Slide 44
Spatial Navigation Axes
Steffen Staab staab@uni-koblenz.de
WeST – Web Science & Technologies
Slide 45
Spatial Navigation Axes
Steffen Staab staab@uni-koblenz.de
WeST – Web Science & Technologies
Slide 46
Syntax of SXPath
Steffen Staab staab@uni-koblenz.de
WeST – Web Science & Technologies
Slide 50
Complexity Results
Steffen Staab staab@uni-koblenz.de
WeST – Web Science & Technologies
Slide 51
Outline
The Bio-CaseThe Social Media-Case Motivation State-of-the-Art Core idea of SXPath SXPath Language
Spatial Data Model Syntax & Semantics Complexity
Implementation Evaluation
Steffen Staab staab@uni-koblenz.de
WeST – Web Science & Technologies
Slide 52
SXPath System Architecture
Steffen Staab staab@uni-koblenz.de
WeST – Web Science & Technologies
Slide 53
SXPath System
Steffen Staab staab@uni-koblenz.de
WeST – Web Science & Technologies
Slide 54
Results of Experiments
Steffen Staab staab@uni-koblenz.de
WeST – Web Science & Technologies
Slide 55
Formative User Study
Steffen Staab staab@uni-koblenz.de
WeST – Web Science & Technologies
Slide 56
Summative User Study
Steffen Staab staab@uni-koblenz.de
WeST – Web Science & Technologies
Slide 57
Summative User Study
Steffen Staab staab@uni-koblenz.de
WeST – Web Science & Technologies
Slide 58
Summative User Study
Steffen Staab staab@uni-koblenz.de
WeST – Web Science & Technologies
Slide 59
Existing Extensions to PDF
Steffen Staab staab@uni-koblenz.de
WeST – Web Science & Technologies
Slide 60
Table
Page Header
Page Footer
Text Area and Paragraphs
Item List
Page Number
Steffen Staab staab@uni-koblenz.de
WeST – Web Science & Technologies
Slide 61
Outline
The Bio-Case Motivation The (Biochemical) Deep
Web Contributions
Page-level wrapper induction
Site-wide wrapper generation
Error Correction by Mutual Reinforcement
Conclusions and Future Directions
The Social Media Case Motivation State-of-the-Art Core idea of SXPath SXPath Language
Spatial Data Model Syntax & Semantics Complexity
Implementation Evaluation
Steffen Staab staab@uni-koblenz.de
WeST – Web Science & Technologies
Slide 62
>1000 Life Science DBs, number growing quickly
Steffen Staab staab@uni-koblenz.de
WeST – Web Science & Technologies
Slide 63
Biochemical Web Sites: Observations - 1
Labeled Data
Total Labeled Unlabeled Unlabeled(Redundant)
754 719 19 16
Table 1: Data fields across 20 Biochemical Web sites
Full survey:http://sabio.villa-bosch.de/labelsurvey.html (404)
Steffen Staab staab@uni-koblenz.de
WeST – Web Science & Technologies
Slide 64
Biochemical Web Sites: Observations - 2
Dynamic Web Pages
Steffen Staab staab@uni-koblenz.de
WeST – Web Science & Technologies
Slide 65
Biochemical Web Sites: Observations - 3
Rich Site Structure
Steffen Staab staab@uni-koblenz.de
WeST – Web Science & Technologies
Slide 66
Biochemical Web Sites: Observations - 4
Web Services Survey: 11 of 100 Databases1 provide APIs Incomplete coverage Varying granularity No semantics in the service description
1 Databases indexed by the Nucleic Acids Research Journal (http://www3.oup.co.uk/nar/database/). Complete survey available at http://sabiork.villa-bosch.de/index.html/survey.html
Steffen Staab staab@uni-koblenz.de
WeST – Web Science & Technologies
Slide 67
Biochemical Web Sites: Implications
Induce Wrapper
Induce Wrapper
Induce Wrapper
Steffen Staab staab@uni-koblenz.de
WeST – Web Science & Technologies
Slide 68
Contributions
Unsupervised Page-Level Wrapper Induction
Unsupervised Site-Wide Wrapper Induction (Site Structure Discovery)
Automatic Error Detection and Correction by Mutual Reinforcement
Steffen Staab staab@uni-koblenz.de
WeST – Web Science & Technologies
Slide 69
Page-Level Wrapper Induction – 1D1 = {C00221, beta-D-Glucose, …, R01520, 1.1.1.47,…}O1 = {Entry, Name,…, Reaction, R00026, Enzyme,…, 3.2.1.21}
D2 = {C00185, Cellobiose,…, R00306, 1.1.99.18,… }O2 = {Entry, Name,…, Reaction, R00026, Enzyme,…, 3.2.1.21}
//*[text()]
Steffen Staab staab@uni-koblenz.de
WeST – Web Science & Technologies
Slide 70
Page-Level Wrapper Induction - 2
Reclassify – Growing Data Regions
Steffen Staab staab@uni-koblenz.de
WeST – Web Science & Technologies
Slide 71
Page-Level Wrapper Induction - 3
D1´ = {C00221, beta-D-Glucose, …, R01520, 1.1.1.47, 3.2.1.21 …}O1´ = {Entry, Name,…, Reaction, R00026, Enzyme,…,}
D2´ = {C00185, Cellobiose,…, R00306, 1.1.99.18, 3.2.1.21 … }O2´ = {Entry, Name,…, Reaction, R00026, Enzyme,…,}
Steffen Staab staab@uni-koblenz.de
WeST – Web Science & Technologies
Slide 72
Page-Level Wrapper Induction - 4
Selecting Labels for Datahtml/…./table[1]/tr[8]/td[1]/…/code[1]/a[1] (“1.1.1.47” )
html/…./table[1]/tr[6]/th[1]/…/code[1]/ (“Reaction”)
html/…./table[1]/tr[8]/th[1]/…/code[1]/ (“Enzyme”)
Steffen Staab staab@uni-koblenz.de
WeST – Web Science & Technologies
Slide 73
Page-Level Wrapper Induction - 5
Anchor the PathEnzyme - html/table[1]/tr[8]/th[1]/code[1]/
html/table[1]/tr[8]/td[1]/code[1]/a[1]html/table[1]/tr[8]/td[1]/code[1]/a[2]
//*[text()=‘Enzyme’] ../…./../td[1]/code[1]/a[position()≥2]/text()
Pivot GeneralizeRelative
Steffen Staab staab@uni-koblenz.de
WeST – Web Science & Technologies
Slide 74
Selected Sources
KEGG, ChEBI, MSDChem Basic qualitative data Popular Overlapping/complementary data
Steffen Staab staab@uni-koblenz.de
WeST – Web Science & Technologies
Slide 75
Wrapper Induction - Evaluation
SOURCE #L #D #S TP FN FP P R
KEGG Compoundhttp://www.genome.jp/kegg/ compound/
10 762 3 411 351 46 89.9 53.9
15 759 3 0 100 99.6
KEGG Reactionhttp://www.genome.jp/kegg/ reaction/
10 205 3 173 32 0 100 84.4
15 205 0 0 100 100
ChEBIhttp://www.ebi.ac.uk/chebi
22 831 3 595 236 41 93.5 71.6
15 829 2 0 100 99.7
MSDChemhttp://www.ebi.ac.uk/msd-srv/msdchem/
30 600 3 600 0 20 96.7 100
15 600 0 20 96.7 100
Average (based on final wrappers for each source) 99.1 99.8
~9 samples – ~99% P, ~98% R
Table 2: Page-level wrapper induction results, 20 test pages(L=Labels, D=Data entries, S=Training pages)
Steffen Staab staab@uni-koblenz.de
WeST – Web Science & Technologies
Slide 76
Site-Wide Wrapper Induction: Observations
Not all pages contain data (e.g. Legal disclaimers, contact pages, navigational menus)
An efficient approach should ignore these pages We dont need to learn the entire site-structure
Steffen Staab staab@uni-koblenz.de
WeST – Web Science & Technologies
Slide 77
Site-Wide Wrapper Induction: Observations - 2
Classified Link-Collections point to data-intensive pages of the same class.
Steffen Staab staab@uni-koblenz.de
WeST – Web Science & Technologies
Slide 78
Site-Wide Wrapper Induction: Observations - 3
Pages belong to the same class describe the same concepts Some concepts are sometimes omitted Ordering is always the same
Steffen Staab staab@uni-koblenz.de
WeST – Web Science & Technologies
Slide 79
Site-Wide Wrapper Induction
1. Start with C0
2. Follow all classified link-collections
3. Generate wrappers for each set of target pages
4. Determine if new class is formed
5. Add navigation step6. Repeat 2 – 5 for each
new class formed in 4
C0
L3
L1
L2
If C0 != Ci (i>0)S=S+Ci;
Navigation StepsW= {(C0 → L1→ C0),(C0 → L2→ C2),(C0 → L3→ C3)}
S={C0}
C1
C3
C2
Steffen Staab staab@uni-koblenz.de
WeST – Web Science & Technologies
Slide 80
Site-Wide Wrapper Induction – Evaluation
SOURCE #C #C’ #D TP FN FP P R
MSDChem 1 1 N/A N/A N/A N/A N/A N/A
ChEBI 3 1 1711 1195 516 0 100 69.8
KEGG 10 7 6223 5044 1179 188 97 81.1
Average 98.5 75.5
Table 3: Site-wide wrapper induction results, 20 test pages for each class(C=Classes, C´=Classes discovered, D=Data entries)
Steffen Staab staab@uni-koblenz.de
WeST – Web Science & Technologies
Slide 81
Error Detection and Correction:Mutual Reinforcement
Observation: Certain data reappear on more than one class of pages
Steffen Staab staab@uni-koblenz.de
WeST – Web Science & Technologies
Slide 82
Error Detection and Correction:Mutual Reinforcement
Reinforcement if reappearing data correctly classified as Data
Otherwise it points to misclassification Label-Data Mismatch
• Correction: Introduce more samples Label-Label Mismatch
• Cannot be detected
Steffen Staab staab@uni-koblenz.de
WeST – Web Science & Technologies
Slide 83
Where to go next?
Reverse engineering production1. LOD
2. Navigation model
3. Interaction model
4. Layout model
Capture this generative model using machine learning Relational learning
• Markov logic programmes?• …?
emitting RDF & RDFS
what belongs to what
(- not treated at all by us so far -)
spatial positioning
Steffen Staab staab@uni-koblenz.de
WeST – Web Science & Technologies
Slide 84
Bibliography
Linda d’Oro, Massimo Ruffolo, Steffen Staab. SXPath – Extending XPath towards Spatial Querying on Web Documents. In: PVLDB – Proceedings of the VLDB Endowment, 4(2): 129-140, 2010.
S. Mir, S. Staab, I. Rojas. Site-Wide Wrapper Induction for Life Science Deep Web Databases. In: DILS-2009 – Proc. of the Data Integration in the Life Sciences Workshop, Manchester, UK, July 20-22, LNCS, Springer, 2009.
Saqib Mir, Steffen Staab, Isabel Rojas. An Unsupervised Approach for Acquiring Ontologies and RDF Data from Online Life Science Databases. In: 7th Extended Semantic Web Conference (ESWC2010), Heraklion, Greece, May 30-June 3, 2010, pp. 319-333.
WeST – Web Science & TechnologiesUniversity of Koblenz ▪ Landau, Germany
Thank you for your attention!