Post on 19-Jan-2016
description
Extracting Information from Heterogeneous Information
Sources Using Ontologically Specified Target Views
Joachim Biskup
Universität Dortmund
and
David W. Embley
Brigham Young University
Funded by NSF
Information ExchangeSource Target
InformationExtraction
SchemaMatching
Leveragethis …
… to dothis
Presentation Outline
• Overview• Matching (Direct)• Matching (Derived)• Matching Algorithm• Summary
Requirements
1. f is an injective function.2. f maps obj. sets to obj. sets and rel. sets to rel. sets3. f respects rel-set arities.4. f respects referential integrity.5. f respects types.6. f respects real-world identity.7. f ’s coercions are G/S compatible.8. f respects subset constraints.9. f respects mutual-exclusion constraints.10. f respects union constraints
User Interaction(IDS Statements)
• Issue– Explains the issue– Example: units, may need transformation
• Default– Explains the default option– Example: if no transformation, no conversion
• Suggestion– Gives a suggestion about how to resolve the issue– Example: if needed, specify the conversion
Theorem
Let f be the generated mapping from target t to source s,populated such that s has a valid interpretation. Let t’ bethe submodel of t populated from s by f. Then t’ has avalid interpretation.
Proof: the paper is the proof …
Target(Graphical View)
Target(Textual View)
Source Example(Assumed to be Populated)
Matching (Direct)
• Object Sets
• Relationship Sets
Object-Set Type Compatibility
<a, b>1. type(a) = type(b)2. type(a) type(b)3. type(a) type(b)4. type(a) type(b)
type(a) = type(b)• Same type
– string = string, but Airport Head Of State– Need better matching techniques
• Same type, different units– Size Nr Sq Km– Need unit conversion
• Same type, different format– Date Date, but 01/02/2002 Jan 2, 2002– Need format conversion
• Same type, same units and format, different assumptions– Altitude Altitude, but altitude of aircraft and spacecraft differ– Need same assumptions
• Same type, same units and format, same assumption, OIDs
type(a) type(b)and type(a) type(b)
• Real Integer or Video Image– Target has greater discriminating power– Can add .0 or make a video of a single image (?)
• Integer Real or Image Video– Source has greater discriminating power– Can round off or select one of the frames (?)
type(a) type(b)
• Image String– Mismatch, even if same attribute (e.g. both City)– Types can help discard potential matches
• String(5) Integer– But suppose the integer is 2– Might work, but is “2.000” ok?
Relationship Match Requirements
• Referential integrity
• Constraints– Cardinality– Mandatory/Optional
Referential Integrity
a
b
a’
b’
Target Source
. . . . . . a’’
The types of a, a’, and a’’ canall be different, but not arbitrary.Example: a (String), a’ (Integer),a’’ (Real).
Relationship-Set Constraint Compatibility
<a, b>1. constr(a) <=> constr(b)2. (constr(a) <= constr(b)) (constr(a) => constr(b))3. (constr(a) <= constr(b)) (constr(a) => constr(b))4. (constr(a) <= constr(b)) (constr(a) => constr(b))
constr(a) <=> constr(b)
Person Car
owns
drives
o
o
o
o
Person Car?
o o
Need more information to resolve: Perhaps “?” is “purchased.”
(constr(a) <= constr(b)) (constr(a) => constr(b))
City
City Map
City
City Map
a b
The target (a) expects many maps, but the source can’t supply them.
(constr(a) <= constr(b)) (constr(a) => constr(b))
City
City Map
City
City Map
a b
The target (a) expects one map, but the source can supply many.
(constr(a) <= constr(b)) (constr(a) => constr(b))
City
City Map
City
City Map
a b
The target (a) expects at least one and potentially many maps,but the source may have none or at most one.
o
Matching (Derived)
• Generalization/Specialization• Composite Values• Derived Relationship Sets• Displayable/Nondisplayable Object Sets
Generalization/Specialization
• For a target object set, a source object set may:– have no overlap (just ignore)– have a proper subset (accept or find missing
generalization)– have the same values (direct match)– have a proper superset (hard, except for roles)– overlap (like proper subset and proper superset)
• Consider roles and missing generalizations
Roles
target:
source:
City Travel Video
City Clip: Video
o o
o o
Video WithCity Scene
Video WithCity Scene
Missing Generalization
target source
City Map Country Map City Map: Image Country Map: Image
Map: Image
Map: Image
Composite Values
• Composite in Source (split)• Composite in Target (merge)• Examples of Derived Relationships
Composite in Source
Video
Nr Hours Nr Minutes
Video
Time
Nr Hours Nr Minutes
target source
Note also that we generated a source path.
Composite in Source
Video
Nr Hours Nr Minutes
Video
Nr Hours Nr Minutes
target source
Composite in Target
Video
Nr Hours Nr Minutes
target
Video
Time
source
Time
Composite in Target
Video
target
Video
Time
source
Time
Displayable/NondisplayableObject-Set Matches
• Nondisplayable in Source: find a key
• Nondisplayable in Target: create a key
Nondisplayable in Source
target source
Airport Airport
No Key: Discard Match
City
Airline
flys to
serves
Nondisplayable in Source
target source
Airport Airport
No Key: Discard Match
City
Airline
flys to
serves
Nondisplayable in Source
target source
Airport Airport
One Key: Choose it
City
Airline
flys to
serves
Airport Name
Nondisplayable in Source
target source
Airport Airport
One Key: Choose it
City
Airline
flys to
serves
Airport Name
Nondisplayable in Source
target source
Airport Airport
Two or more Keys: Choose One
City
Airline
flys to
serves
Airport Name
Airport Code
Nondisplayable in Source
target source
Airport Airport
Two or more Keys: Choose One
City
Airline
flys to
serves
Airport Name
Airport Code
Matching Algorithm
Sample Match Table
Pictorial View of Match Table
target
source
Summary
Concluding Remarks
• QED (the theorem holds)
Let f be the generated mapping from target t to source s,populated such that s has a valid interpretation. Let t’ bethe submodel of t populated from s by f. Then t’ has avalid interpretation.
Proof: the paper is the proof …
Pictorial View of Match Table
t = target
s = source
f = the mapping
t’ has a validinterpretation
t’ = submodel
Concluding Remarks
• QED (the theorem holds)• Merge (several sources)
– All sources extracted to same view– Union merge
• Object identity problems• Constraint problems
• Source Modeling (convert to OSM)• Framework defined, but not implemented