Efficient Selection & Integration of Data Sources Abir Qasem 1, Dimitre Dimitrov 2, Jeff Heflin 1 1...

Efficient Selection & Integration of Data Sources

Abir Qasem1, Dimitre Dimitrov2, Jeff Heflin1

1 Lehigh University2 Tech-X Corporation

11/11/07

for Answering Semantic Web Queries

Outline

• Challenges

• Desiderata and overview of our approach

• OWLII: the subset of OWL that our system supports

• OBII: Ontology Based Information Integrator

• Evaluation

• Wrap up

Challenge 1: Scalability• The (Semantic) Web is too big for a given

system – regardless of advances in algorithms and/or smart hacks

• We need to some how identify a suitable subset that is relevant to a query and “work” on them– Sampling and refinement – as Fensel and van

Harmelen (IEEE IC 2007) suggest ?– get a good enough subset in one shot ?

Challenge 2: Heterogeneity

O2 O3 O4 ONO1

sesame

OWLIM

DLDB

sesame

sesame

?? ?

Query Query Query

Query

Query

•Need Alignments•Mapping tools•Third party alignments

•Need Tools that exploit them

Desiderata

1. Rely as much as possible on existing infrastructure.2. Answer a query using any ontology not just a globally

accepted “query” ontology.3. Identify a good enough subset of data sources that will

get useful answers.4. Be able to “discover” alignment information even when

the ontologies are not directly mapped with one another.5. Account for the dynamic nature of the Web where

content of the data sources change rapidly

Our approach(Get a good enough subset in one shot)

• Introduced a concept of source relevance (“REL” statements)– Allows data providers to advertise the relevance of

their data to a query– If a source can express that it has relevant

information we can choose to query it as opposed to other sources that do not express this information.

• Adapted an information integration algorithm to select relevant sources for a query, given relevance meta data and ontology alignment

• Implemented and evaluated the system on synthetic data

PDMS

• Fast and proven algorithm for query reformulation in database community (Halevy et al. ICDE 2003)

• Uses LAV and GAV information integration formalisms to describe maps and data sources – GAV

• In first order logic it is an implication with multiple antecedent and a single consequent

• Usually written like: O3:BigMonitor (x) :- O2: LCD (x) , screen (x, big)

– LAV• In first order logic it is an implication single antecedent and

multiple consequents• Usually written like:O1:CinemaDisplay (x) ⊑O2: LCD (x), screen

(x, big)

OWL for information integration (OWLII)

• The ontology language our system supports• A subset of OWL DL (therefore, decidable)• To represent LAV and GAV in OWL

– We have extended the DHL language (Grosof et. al. 03)

• REL statements are modeled as LAV statements• Details in a tech report

http://www3.lehigh.edu/images/userImages/jgs2/Page_7287/LU-CSE-07-007.pdf

REL example

<meta:RelStatement> <meta:source rdf:resource="http://sourceURL2"/> <meta:contained> <owl:Class rdf:about="TV" /> </meta:contained> <meta:container> <owl:Restriction> <owl:onProperty rdf:resource="#madeBy" /> <owl:hasValue rdf:resource="#Sony" /> </owl:Restriction> </meta:container></meta:RelStatement>

Map example

<owl:Class rdf:about="&a;NovelAuthor">

<rdfs:subClassOf rdf:resource="&b;Author"/>

<rdfs:subClassOf>

<owl:Restriction>

<owl:onProperty rdf:resource="&b;writes"/>

<owl:someValuesFrom rdf:resource="&b;Novel"/>

</owl:Restriction>

</rdfs:subClassOf>

</owl:Class>

Not so simple maps!

• Maps are not always straight forward

• For example: Data type property to object type property– “profession” is a datatype property in O1– “Profession” is a class and hasProfession is a

object property in O2 (domain Person and range Profession)

– O1:Person ⊓ O1:profession.{“teacher”} O2:Person ⊓ O2:hasProfession.{teacher}

OBII

• Ontology Based Information Integrator• Input

– Domain ontologies (class and property hierarchy only)– Map ontologies (OWL files that import two ontologies and

establish alignments using OWLII)– REL files (RDF files. A set of RDF triples enclosed by

RelStatement describes a source relevance)– Data Sources (OWL files that contain only individual and

property assertions ABox or Sesame repository that contain similar data)

– Sparql query

• Output– Variable binding in XML

Evaluation

• A Baseline system: we load all the ontologies, all the maps and all the sources in a DL reasoner and issue a query to get a sound and complete answer

• Basic PDMS to select sources (without any taxonomic reasoning)

• OBII

Metrics

• Response time– For the baseline system we add the time to load all the data, and

the reasoning time to get the answers. – For the other two systems load time is calculated as the time to

load the ontologies that have been used in the reformulation (map and domain ontologies) and the selected data sources.

– The response time for these two systems then is a sum of load time, reformulation time and the reasoning time.

• Percentage of complete responses to queries.– In determining the completeness of queries, we consider the

baseline system's answers to be the reference set. – This is reasonable because it has all the data available to it and

uses KAON2 a DL reasoner. – We only consider queries that entail at least one answer.

Data

• Real world data is limited– Can not be used to test the system completely

• We decided to use synthetic data

• We developed a work load generator MOST (Maps Ontologies Sources Tester)

• We plan to use some real data soon in the ISENS project

Results (1)

• Response time for each system as we vary the number of ontologies and the number of sources

• Both basic PDMS and OBII are significantly faster then the base line system (note: the chart is in logarithmic scale)

• Additionally, the basic PDMS is typically twice as fast as OBII

• Similar trend in other configurations

# of Onts - Diameter- # of sources

Results (2)

• Contribution of load time to response time

• OBII and basic PDMS’s main performance difference is due to load time.

• OBII identifies more sources as it uses taxonomic reasoning

• Since PDMS fails to identify these sources, it is incomplete for many queries (next chart)

Results (3)

• The percentage of complete query responses decreases in basic PDMS as we increase the number of data sources and the number of ontologies.

• OBII is 100% complete for all the queries with respect to the baseline system

Wrap up!

• The Semantic Web needs to be connected in order for the “Semantics” to really payoff

• We have implemented a fast source algorithm for selecting and integrating Semantic Web data sources

• Our initial evaluation shows promise but there is a lot to be done – Complex ontologies, expressive RELs ….

• Backups

OWLII description

Axiom type Subject Object

owl:equivalentClass Named classes, owl:intersectionOf, owl:someValuesFrom owl:hasValue

Named classes, owl:intersectionOf, owl:someValuesFrom owl:hasValue

rdfs:subClassOf All of the above +

owl:unionOf

All of the above +

owl:allValuesFrom

owl:equivalentProperty

rdfs:subPropertyOf

named properties , owl:inverseOf

named properties , owl:inverseOf

owl:inverseOf named properties named properties

Source Selection Algorithm

subPred returns sub class (properties), enhanced “match” that allows us to considersub classes (properties). This is an improvement over the PDMS

How MOST was used?• OntGenerator

– An average of 20 classes and 20 properties. – The class and property taxonomy have an average branching factor of 4 and an

average depth of 3• MapGenerator

– An even distribution of various mapping axioms (it can be controlled)– Chose to map about 30% of the classes and 30% of the properties of a given

domain ontology– The resulting map views contain an average of 5 conjuncts with some maps

containing up to 11 conjuncts. • SourceGenerator

– Create instances of 30% of the classes and 30% of the properties of the domain ontology that a source commits to.

– On average each data source contains 50 triples. • QueryGenerator

– Generate 200 random queries with 1 to 3 conjuncts (75% of conjuncts are properties as opposed to a class).

Efficient Selection & Integration of Data Sources Abir Qasem 1, Dimitre Dimitrov 2, Jeff Heflin 1 1...

Documents

Transcript of Efficient Selection & Integration of Data Sources Abir Qasem 1, Dimitre Dimitrov 2, Jeff Heflin 1 1...