Efficient Selection & Integration of Data Sources Abir Qasem 1, Dimitre Dimitrov 2, Jeff Heflin 1 1...

24
Efficient Selection & Integration of Data Sources Abir Qasem 1 , Dimitre Dimitrov 2 , Jeff Heflin 1 1 Lehigh University 2 Tech-X Corporation 11/11/07 for Answering Semantic Web Queries

Transcript of Efficient Selection & Integration of Data Sources Abir Qasem 1, Dimitre Dimitrov 2, Jeff Heflin 1 1...

Page 1: Efficient Selection & Integration of Data Sources Abir Qasem 1, Dimitre Dimitrov 2, Jeff Heflin 1 1 Lehigh University 2 Tech-X Corporation 11/11/07 for.

Efficient Selection & Integration of Data Sources

Abir Qasem1, Dimitre Dimitrov2, Jeff Heflin1

1 Lehigh University2 Tech-X Corporation

11/11/07

for Answering Semantic Web Queries

Page 2: Efficient Selection & Integration of Data Sources Abir Qasem 1, Dimitre Dimitrov 2, Jeff Heflin 1 1 Lehigh University 2 Tech-X Corporation 11/11/07 for.

Outline

• Challenges

• Desiderata and overview of our approach

• OWLII: the subset of OWL that our system supports

• OBII: Ontology Based Information Integrator

• Evaluation

• Wrap up

Page 3: Efficient Selection & Integration of Data Sources Abir Qasem 1, Dimitre Dimitrov 2, Jeff Heflin 1 1 Lehigh University 2 Tech-X Corporation 11/11/07 for.

Challenge 1: Scalability• The (Semantic) Web is too big for a given

system – regardless of advances in algorithms and/or smart hacks

• We need to some how identify a suitable subset that is relevant to a query and “work” on them– Sampling and refinement – as Fensel and van

Harmelen (IEEE IC 2007) suggest ?– get a good enough subset in one shot ?

Page 4: Efficient Selection & Integration of Data Sources Abir Qasem 1, Dimitre Dimitrov 2, Jeff Heflin 1 1 Lehigh University 2 Tech-X Corporation 11/11/07 for.

Challenge 2: Heterogeneity

O2 O3 O4 ONO1

sesame

OWLIM

DLDB

sesame

sesame

?? ?

Query Query Query

Query

Query

•Need Alignments•Mapping tools•Third party alignments

•Need Tools that exploit them

Page 5: Efficient Selection & Integration of Data Sources Abir Qasem 1, Dimitre Dimitrov 2, Jeff Heflin 1 1 Lehigh University 2 Tech-X Corporation 11/11/07 for.

Desiderata

1. Rely as much as possible on existing infrastructure.2. Answer a query using any ontology not just a globally

accepted “query” ontology.3. Identify a good enough subset of data sources that will

get useful answers.4. Be able to “discover” alignment information even when

the ontologies are not directly mapped with one another.5. Account for the dynamic nature of the Web where

content of the data sources change rapidly

Page 6: Efficient Selection & Integration of Data Sources Abir Qasem 1, Dimitre Dimitrov 2, Jeff Heflin 1 1 Lehigh University 2 Tech-X Corporation 11/11/07 for.

Our approach(Get a good enough subset in one shot)

• Introduced a concept of source relevance (“REL” statements)– Allows data providers to advertise the relevance of

their data to a query– If a source can express that it has relevant

information we can choose to query it as opposed to other sources that do not express this information.

• Adapted an information integration algorithm to select relevant sources for a query, given relevance meta data and ontology alignment

• Implemented and evaluated the system on synthetic data

Page 7: Efficient Selection & Integration of Data Sources Abir Qasem 1, Dimitre Dimitrov 2, Jeff Heflin 1 1 Lehigh University 2 Tech-X Corporation 11/11/07 for.

PDMS

• Fast and proven algorithm for query reformulation in database community (Halevy et al. ICDE 2003)

• Uses LAV and GAV information integration formalisms to describe maps and data sources – GAV

• In first order logic it is an implication with multiple antecedent and a single consequent

• Usually written like: O3:BigMonitor (x) :- O2: LCD (x) , screen (x, big)

– LAV• In first order logic it is an implication single antecedent and

multiple consequents• Usually written like:O1:CinemaDisplay (x) ⊑O2: LCD (x), screen

(x, big)

Page 8: Efficient Selection & Integration of Data Sources Abir Qasem 1, Dimitre Dimitrov 2, Jeff Heflin 1 1 Lehigh University 2 Tech-X Corporation 11/11/07 for.

OWL for information integration (OWLII)

• The ontology language our system supports• A subset of OWL DL (therefore, decidable)• To represent LAV and GAV in OWL

– We have extended the DHL language (Grosof et. al. 03)

• REL statements are modeled as LAV statements• Details in a tech report

http://www3.lehigh.edu/images/userImages/jgs2/Page_7287/LU-CSE-07-007.pdf

Page 9: Efficient Selection & Integration of Data Sources Abir Qasem 1, Dimitre Dimitrov 2, Jeff Heflin 1 1 Lehigh University 2 Tech-X Corporation 11/11/07 for.

REL example

<meta:RelStatement> <meta:source rdf:resource="http://sourceURL2"/> <meta:contained> <owl:Class rdf:about="TV" /> </meta:contained> <meta:container> <owl:Restriction> <owl:onProperty rdf:resource="#madeBy" /> <owl:hasValue rdf:resource="#Sony" /> </owl:Restriction> </meta:container></meta:RelStatement>

Page 10: Efficient Selection & Integration of Data Sources Abir Qasem 1, Dimitre Dimitrov 2, Jeff Heflin 1 1 Lehigh University 2 Tech-X Corporation 11/11/07 for.

Map example

<owl:Class rdf:about="&a;NovelAuthor">

<rdfs:subClassOf rdf:resource="&b;Author"/>

<rdfs:subClassOf>

<owl:Restriction>

<owl:onProperty rdf:resource="&b;writes"/>

<owl:someValuesFrom rdf:resource="&b;Novel"/>

</owl:Restriction>

</rdfs:subClassOf>

</owl:Class>

Page 11: Efficient Selection & Integration of Data Sources Abir Qasem 1, Dimitre Dimitrov 2, Jeff Heflin 1 1 Lehigh University 2 Tech-X Corporation 11/11/07 for.

Not so simple maps!

• Maps are not always straight forward

• For example: Data type property to object type property– “profession” is a datatype property in O1– “Profession” is a class and hasProfession is a

object property in O2 (domain Person and range Profession)

– O1:Person ⊓ O1:profession.{“teacher”} O2:Person ⊓ O2:hasProfession.{teacher}

Page 12: Efficient Selection & Integration of Data Sources Abir Qasem 1, Dimitre Dimitrov 2, Jeff Heflin 1 1 Lehigh University 2 Tech-X Corporation 11/11/07 for.

OBII

• Ontology Based Information Integrator• Input

– Domain ontologies (class and property hierarchy only)– Map ontologies (OWL files that import two ontologies and

establish alignments using OWLII)– REL files (RDF files. A set of RDF triples enclosed by

RelStatement describes a source relevance)– Data Sources (OWL files that contain only individual and

property assertions ABox or Sesame repository that contain similar data)

– Sparql query

• Output– Variable binding in XML

Page 13: Efficient Selection & Integration of Data Sources Abir Qasem 1, Dimitre Dimitrov 2, Jeff Heflin 1 1 Lehigh University 2 Tech-X Corporation 11/11/07 for.

OBII

Page 14: Efficient Selection & Integration of Data Sources Abir Qasem 1, Dimitre Dimitrov 2, Jeff Heflin 1 1 Lehigh University 2 Tech-X Corporation 11/11/07 for.

Evaluation

• A Baseline system: we load all the ontologies, all the maps and all the sources in a DL reasoner and issue a query to get a sound and complete answer

• Basic PDMS to select sources (without any taxonomic reasoning)

• OBII

Page 15: Efficient Selection & Integration of Data Sources Abir Qasem 1, Dimitre Dimitrov 2, Jeff Heflin 1 1 Lehigh University 2 Tech-X Corporation 11/11/07 for.

Metrics

• Response time– For the baseline system we add the time to load all the data, and

the reasoning time to get the answers. – For the other two systems load time is calculated as the time to

load the ontologies that have been used in the reformulation (map and domain ontologies) and the selected data sources.

– The response time for these two systems then is a sum of load time, reformulation time and the reasoning time.

• Percentage of complete responses to queries.– In determining the completeness of queries, we consider the

baseline system's answers to be the reference set. – This is reasonable because it has all the data available to it and

uses KAON2 a DL reasoner. – We only consider queries that entail at least one answer.

Page 16: Efficient Selection & Integration of Data Sources Abir Qasem 1, Dimitre Dimitrov 2, Jeff Heflin 1 1 Lehigh University 2 Tech-X Corporation 11/11/07 for.

Data

• Real world data is limited– Can not be used to test the system completely

• We decided to use synthetic data

• We developed a work load generator MOST (Maps Ontologies Sources Tester)

• We plan to use some real data soon in the ISENS project

Page 17: Efficient Selection & Integration of Data Sources Abir Qasem 1, Dimitre Dimitrov 2, Jeff Heflin 1 1 Lehigh University 2 Tech-X Corporation 11/11/07 for.

Results (1)

• Response time for each system as we vary the number of ontologies and the number of sources

• Both basic PDMS and OBII are significantly faster then the base line system (note: the chart is in logarithmic scale)

• Additionally, the basic PDMS is typically twice as fast as OBII

• Similar trend in other configurations

# of Onts - Diameter- # of sources

Page 18: Efficient Selection & Integration of Data Sources Abir Qasem 1, Dimitre Dimitrov 2, Jeff Heflin 1 1 Lehigh University 2 Tech-X Corporation 11/11/07 for.

Results (2)

• Contribution of load time to response time

• OBII and basic PDMS’s main performance difference is due to load time.

• OBII identifies more sources as it uses taxonomic reasoning

• Since PDMS fails to identify these sources, it is incomplete for many queries (next chart)

Page 19: Efficient Selection & Integration of Data Sources Abir Qasem 1, Dimitre Dimitrov 2, Jeff Heflin 1 1 Lehigh University 2 Tech-X Corporation 11/11/07 for.

Results (3)

• The percentage of complete query responses decreases in basic PDMS as we increase the number of data sources and the number of ontologies.

• OBII is 100% complete for all the queries with respect to the baseline system

Page 20: Efficient Selection & Integration of Data Sources Abir Qasem 1, Dimitre Dimitrov 2, Jeff Heflin 1 1 Lehigh University 2 Tech-X Corporation 11/11/07 for.

Wrap up!

• The Semantic Web needs to be connected in order for the “Semantics” to really payoff

• We have implemented a fast source algorithm for selecting and integrating Semantic Web data sources

• Our initial evaluation shows promise but there is a lot to be done – Complex ontologies, expressive RELs ….

Page 21: Efficient Selection & Integration of Data Sources Abir Qasem 1, Dimitre Dimitrov 2, Jeff Heflin 1 1 Lehigh University 2 Tech-X Corporation 11/11/07 for.

• Backups

Page 22: Efficient Selection & Integration of Data Sources Abir Qasem 1, Dimitre Dimitrov 2, Jeff Heflin 1 1 Lehigh University 2 Tech-X Corporation 11/11/07 for.

OWLII description

Axiom type Subject Object

owl:equivalentClass Named classes, owl:intersectionOf, owl:someValuesFrom owl:hasValue

Named classes, owl:intersectionOf, owl:someValuesFrom owl:hasValue

rdfs:subClassOf All of the above +

owl:unionOf

All of the above +

owl:allValuesFrom

owl:equivalentProperty

rdfs:subPropertyOf

named properties , owl:inverseOf

named properties , owl:inverseOf

owl:inverseOf named properties named properties

Page 23: Efficient Selection & Integration of Data Sources Abir Qasem 1, Dimitre Dimitrov 2, Jeff Heflin 1 1 Lehigh University 2 Tech-X Corporation 11/11/07 for.

Source Selection Algorithm

subPred returns sub class (properties), enhanced “match” that allows us to considersub classes (properties). This is an improvement over the PDMS

Page 24: Efficient Selection & Integration of Data Sources Abir Qasem 1, Dimitre Dimitrov 2, Jeff Heflin 1 1 Lehigh University 2 Tech-X Corporation 11/11/07 for.

How MOST was used?• OntGenerator

– An average of 20 classes and 20 properties. – The class and property taxonomy have an average branching factor of 4 and an

average depth of 3• MapGenerator

– An even distribution of various mapping axioms (it can be controlled)– Chose to map about 30% of the classes and 30% of the properties of a given

domain ontology– The resulting map views contain an average of 5 conjuncts with some maps

containing up to 11 conjuncts. • SourceGenerator

– Create instances of 30% of the classes and 30% of the properties of the domain ontology that a source commits to.

– On average each data source contains 50 triples. • QueryGenerator

– Generate 200 random queries with 1 to 3 conjuncts (75% of conjuncts are properties as opposed to a class).