1 Lessons from the TSIMMIS Project Yannis Papakonstantinou Department of Computer Science &...
-
Upload
allison-campbell -
Category
Documents
-
view
213 -
download
0
Transcript of 1 Lessons from the TSIMMIS Project Yannis Papakonstantinou Department of Computer Science &...
1
Lessons from the TSIMMIS Project
Yannis PapakonstantinouDepartment of Computer Science &
Engineering
University of California, San Diego
2
Overview
• TSIMMIS’ goals, technical challenges, and solutions
• Insufficiencies of the TSIMMIS’ framework
• Going forward
3
Information Resides on Heterogeneous Information Sources
• different interfaces• different data representations• redundant and conflicting information
WWWTickerTape
PersonaldatabaseDialog
4
Goal: System Providing Integrated View of Heterogeneous Data
Integration System
WWW Personaldatabase
• collects and combines information• provides integrated view, uniform user interface
TickerTapeDialog
5
The Wrapper and Mediator Architecture
Mediator
WrapperWrapper
Client
business reports
portfolios for each company
stock market prices
TickerTape Dialog
CommonData Model
6
The Data Warehousing Approach to Integration
Mediator
WrapperWrapper
Client
TickerTape Dialog
Stored Integrated
View
7
The Lazy Integration Approach
Mediator
WrapperWrapper
Client
IBM portfolio
IBM price IBM related reports (in common model)
IBM related reports
TickerTape Dialog
Query Decomposition, Translation and Result Fusion
8
Mediator
Client
Wrapper
Wrappers & Mediators from High-Level Specifications
Mediator SpecificationInterpreter
WrapperGenerator
Wrapper
WrapperSpecification
MediatorSpecification
Source Source
9
Challenge: Sources Without a Well-Structured Schema
• semistructured– irregular– deeply nested– cross-referenced
• incomplete schema knowledge– autonomous– dynamic
• HTML pages• SGML documents• genome data• chemical structures• bibliographic
information• results of the
integration process
Examples
10
Challenge: Different and Limited Source Capabilities
Client
Wrapper(A)
Wrapper(B)
Mediator(U = A + B)
retrieve IBM dataretrieve IBM data
retrieve IBM data
11
Mediator has to Adapt to Query Capabilities of Sources
Client
Wrapper(A)
Wrapper(B)
Mediator(U = A + B)
retrieve everything
retrieve IBM data
retrieve IBM data
retrieve IBM data
(A) does notallow selection
12
Part B
• Semistructured Data Representation
• Mediator Generation
• Wrapper Generation
• Capabilities-Based Rewriting
13
Representation of Semistructured Information using OEM
semanticobject-id
label
Atomic Value
Set Value
structuralobject-id
<http://www/~doe, faculty, {&f1,&l1,&r1}> <&f1, first_name, “John”> <&l1, last_name, “Doe”> <&r1, rank, “professor”>
14
Graph Representation of OEM Data
faculty first_name “John” last_name “Doe” rank “professor”
http://www/~doe
<http://www/~doe, faculty, {&f1,&l1,&r1}> <&f1, first_name, “John”> <&l1, last_name, “Doe”> <&r1, rank, “professor”>
15
OEM Structures Represent Arbitrary Labeled Graphs
faculty first_name “John” last_name “Doe” rank “professor”
http://www/~doe
faculty name “Mary Smith” project “Air DB” paper
author name “John Doe”
author name “Mary Smith”
title “Thin Air DB”
http://www/~smith
16
Overview
• Semistructured Data Representation
• Mediator Generation• Example of mediator specification• Language expressiveness• Implementation and performance
• Wrapper Generation
• Capabilities-Based Rewriting
17
Merge Information Relating to a Faculty
person name “John Doe” birthday “April 1”
s2faculty name “John Doe” rank “professor” papers ...
s1
faculty name “John Doe” rank “professor” birthday “April 1” papers ...
18
Mediator Specification Example
person name “John Doe” birthday “April 1”
s2
<N faculty {<L V>}> :- <faculty {<name N> <L V>}>@s1<N faculty {<L V>}> :- <person {<name N> <L V>}>@s2
faculty name “John Doe” rank “professor” papers ...
s1
faculty name “John Doe” rank “professor” birthday “April 1” papers ...
19
Mediator Specification Example: Semantics of Rule Bodies
<N faculty {<L V>}> :- <faculty {<name N> <L V>}>@s1<N faculty {<L V>}> :- <person {<name N> <L V>}>@s2
person name “John Doe” birthday “April 1”
s2
faculty name “John Doe” rank “professor” birthday “April 1” papers ...
faculty name “John Doe” rank “professor” papers ...
s1
20
Mediator Specification Example: Semantics of Rule Heads
<N faculty {<L V>}> :- <faculty {<name N> <L V>}>@s1<N faculty {<L V>}> :- <person {<name N> <L V>}>@s2
person name “John Doe” birthday “April 1”
s2
“John Doe”faculty name “John Doe” rank “professor” birthday “April 1” papers ...
faculty name “John Doe” rank “professor” papers ...
s1
21
Incrementally Add to Semantically Identified Object
<N faculty {<L V>}> :- <faculty {<name N> <L V>}>@s1<N faculty {<L V>}> :- <person {<name N> <L V>}>@s2
faculty name “John Doe” rank “professor” papers ...
s1person name “John Doe” birthday “April 1”
s2
“John Doe”faculty name “John Doe” rank “professor” birthday “April 1” papers ...
22
Irregularities & Incomplete Schema Knowledge
<N faculty {<L V>}> :- <faculty {<name N> <L V>}>@s1faculty name “John Doe” rank “professor” papersfaculty name “Mary Smith” project “Air DB”
s1
person name “John Doe” birthday “April 1”
s2
faculty name “John Doe” rank “professor” birthday “April 1” papers faculty name “Mary Smith” project “Air DB”
“John Doe”
“Mary Smith”
23
Second Rule Attaches More Subobjects to View Objects
<N faculty {<L V>}> :- <faculty {<name N> <L V>}>@s1<N faculty {<L V>}> :- <person {<name N> <L V>}>@s2
faculty name “John Doe” rank “professor” papers ...
s1
“John Doe”faculty name “John Doe” rank “professor” birthday “April 1” papers ...
person name “John Doe” birthday “April 1”
s2
24
Language Expressiveness
• Information fusion problems solved by MSL– Irregularities– Incomplete knowledge of source structure– Transformation of cross-referenced structures– Inconsistent and redundant data– Use of arbitrary matching criteria
• Theoretical analysis of expressiveness– Consider the relational representation of OEM
graphs. Then MSL is equivalent to “SQL + special form of transitive closure”
25
faculty name “John Doe” rank “associate”
Inconsistent and Redundant Information
<N faculty {<L V>}> :- <faculty {<name N> <L V>}>@s1<N faculty {<L V>}> :- <person {<name N> <L V>}>@s2
AND NOT <faculty {<name N> <L V1>}>@s1
person name “John Doe” rank “assistant”
s1 s2
“John Doe”faculty
name “John Doe” rank “associate”
rank “assistant”
26
Overview
• Semistructured Data Representation
• Mediator Generation• Example of mediator specification• Language expressiveness• Implementation and performance
• Wrapper Generation
• Capabilities-Based Rewriting
27
Mediator Specification Interpreter Architecture
Query Rewriter
Cost-Based Optimizer
Datamerge Engine
MediatorSpecification
Query
logical datamergeprogram
plan
Result
Queries toWrappers
Results
28
Query Rewriting When Known Origins of Information
• <N faculty {<salary S>}> :-:- <faculty {<name N> <salary S>}>@s1
<N faculty {< rank R >}> :- <person {<name N> <rank
R>}>@s2• <well-paid {<name N> <salary X>}>
:- <N faculty {<salary X> <rank assistant>}> AND X>65000
29
Query Rewriter Pushes Conditions to Sources
• <N faculty {<salary S>}> :- :- <faculty {<name N> <salary S>}>@s1 <N faculty {< rank R >}>
:- <person {<name N> <rank R>}>@s2• <well-paid {<name N> <salary X>}> :- <N faculty {<salary
X> <rank assistant>}> AND X>65000• logical datamerge program <well-paid {<name
N> <salary X>}> :- (<faculty {<name N> <salary X>}> AND X>65000)@s1
AND <person {<name N> <rank assistant>}>@s2
30
<name N> :- <person {<rank assistant>}>
Passing Bindings & Local Join Plans
Passing Bindings
Local Join
<salary X> :- <faculty {<name $N> <salary X>}> AND X>65000
<name N> :- <person {<rank assistant>}>
<a {<s X> <n N>}>:- <faculty {<name N> <salary X>}> AND X>65000
N
s1 s2
s1 s2
31
Query Decomposition When Unknown Origins of Information
<X faculty {<S Y>}> :- <X faculty {<birthday “1/20”> <S Y>}>
<N faculty {<L V>}> :- <faculty {<name N> <L V>}>@s1<N faculty {<L V>}> :- <person {<name N> <L V>}>@s2
32
Plan Considers All Possible Sources of birthday
<X faculty {<S Y>}> :- <X faculty {<birthday “1/20”> <S Y>}>
<N faculty {<L V>}> :- <faculty {<name N> <L V>}>@s1<N faculty {<L V>}> :- <person {<name N> <L V>}>@s2
name
s2s1
name
birthday
birthday
33
Overview
• Semistructured-Data Representation
• Mediator Generation
• Wrapper Generation
• Capabilities-Based Rewriting
34
Query Translation in Wrappers
Source
SELECT * FROM personSELECT * FROM personWHERE name=“Smith”
find -allfind -n Smith
Query TranslatorResult
Translator
Wrapper
35
Rapid Query Translation Using Templates and Actions
Source
SELECT * FROM personSELECT * FROM personWHERE name=“Smith”
find -allfind -n Smith
TemplateInterpreter
ResultTranslator
SELECT * FROM person {emit “find -all” }SELECT * FROM personWHERE name=$N {emit “find -n $N”}
36
Description of Infinite Sets of Supported Queries
• uses recursive nonterminals
• Example:– job description contains word w1 and word w2
and ...– SELECT subset(person) FROM person
WHERE \CJob\CJob : job LIKE $W AND \CJob\CJob : TRUE
37
Overview
• Semistructured-Data Representation
• Mediator Generation
• Wrapper Generation
• Capabilities-Based Rewriting
38
Wrapper Supported Queries
Description
Capabilities-Based Rewriter in Mediator Architecture
Capabilities-Based
Rewriter
QueryRewriter
Cost-BasedOptimizer
DatamergeEngine
logical datamerge program
supportedplans
optimal plan
MediatorSpecification
Wrapper Supported Queries
Description
Query
39
Capabilities-Based Rewriter Finds Supported Plans
Supported Queries
SELECT * FROM AWHERE salary>65000
SELECT * FROM A
40
Capabilities-Based Rewriter Finds Most-Selective Supported Plans
Supported Queries
SELECT * FROM BWHERE salary>65000
SELECT * FROM BSELECT * FROM BWHERE salary >65000
41
Capabilities-Based Rewriter Architecture
Component SubQueryDiscovery
Plan Construction
Plan Refinement
Query CapabilitiesDescription
Component SubQueries
Plans (not fully optimized)
Query
Algebraically optimal plans
42
What TSIMMIS Achieved
• system for integration of heterogeneous sources
• challenges and solutions– semistructured data & incomplete schema
knowledge• appropriate specification language and query processing
algorithms
– limited and different query capabilities• query translation algorithm
• capabilities-based query rewriting algorithm
43
Overview
• TSIMMIS’ goals, technical challenges, and solutions
• Insufficiencies of the TSIMMIS’ framework
• Going forward
44
Insufficiencies of the TSIMMIS framework
• OEM was really unstructured data– some loose and partial schematic info may
pay off tremendously
• too “databasy” user/mediator/source interaction
45
Overview
• TSIMMIS’ goals, technical challenges, and solutions
• Insufficiencies of the TSIMMIS’ framework
• Going forward
46
Web emerges as a Distributed DB and XML as its Data Model
DataSource
Native XMLDatabase
XML ViewDocument(s)
XML ViewDocument(s)
XML ViewDocument(s)
Also export:1. Schemas & Metadata (XML-Data, RDF,…)2. Description of supported queries
Wrapper
LegacySource
XMAS QueryLanguage
47
Definition of Integrated Views
DataSource
DataSource
DataSource
Mediator
XML ViewDocument(s)
Integrated XML View
XML ViewDocument(s)
XML ViewDocument(s)
View Definition inXMAS
48
Non-Materialized Views in the MIX mediator system
Blended Browsing &Querying (BBQ) GUI
Application
DOM for Virtual XML Doc’s
MIX Mediator
XMAS query XML document
DTDInference
IntegratedView DTD
XML Source XML Source
QueryProcessor
View Definition inXMAS
Source DTD
49
RDB2XMLWrapper
DTDInference
Resolution
Simplification
Execution
Unfolded Query
Blended Browsing &Querying (BBQ) GUI
MIX MediatorXMAS MediatorView Definition
View DTD
Translation to Algebra
Optimization
XML DocumentFragments
XMAS Query
XMLSource 1
DTD
XMASQuery
XMLDocumentFragments
DOM (VXD) Client API
Application