SELF ORGANIZING SEMANTIC TOPOLOGIES IN PEER DATABASE...

SELF ORGANIZING SEMANTIC

TOPOLOGIES IN PEER DATABASE

SYSTEMS

AMI EYAL

SELF ORGANIZING SEMANTIC TOPOLOGIES

IN PEER DATABASE SYSTEMS

RESEARCH THESIS

SUBMITTED IN PARTIAL FULFILLMENT OF THE

REQUIREMENTS

FOR THE DEGREE OF MASTER OF SCIENCE

IN INFORMATION MANAGEMENT ENGINEERING

AMI EYAL

SUBMITTED TO THE SENATE OF THE TECHNION — ISRAEL INSTITUTE OF TECHNOLOGY

TAMMUZ, 5767 HAIFA JUNE, 2007

THIS RESEARCH THESIS WAS SUPERVISED BY DR. AVIGDOR GAL

UNDER THE AUSPICES OF THE INDUSTRIAL ENGINEERING AND

MANAGEMENT DEPARTMENT

ACKNOWLEDGMENT

I would like to express my deepest gratitude to my supervisor, Pro-

fessor Avigdor Gal, for his devoted guidance and wise counsel. My

sincere thanks to the faculty personnel, for their help in all practical

and administrative matters during my studies, special thanks are given

to Judith Ish-Lev. Additional thanks to my colleagues, Haggai, Inbal,

Victor and others, for helpful discussions, motivation and support when

I most needed it. Last and most important, I am deeply indebted to

my dear family and friends, whose endless love and support enabled

the completion of this work.

THE GENEROUS FINANCIAL HELP OF THE EUROPEAN COMMISSION

SIXTH FRAMEWORK IST PROJECT QUALEG AND THE TECHNION IS

GRATEFULLY ACKNOWLEDGED

Contents

Abstract xi

List of Symbols 1

1 Introduction 3

1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.1.1 Schema Matching . . . . . . . . . . . . . . . . . . . . . . . . 5

1.1.2 Peer Database Systems . . . . . . . . . . . . . . . . . . . . . 7

1.2 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2 Model Definition 14

2.1 The Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2 The Network Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.2.1 Schema Mappings . . . . . . . . . . . . . . . . . . . . . . . . 16

2.2.2 Query Dissemination . . . . . . . . . . . . . . . . . . . . . . . 18

2.2.3 Semantic Topology . . . . . . . . . . . . . . . . . . . . . . . . 20

2.3 The Matching Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.3.1 Mapping Accuracy . . . . . . . . . . . . . . . . . . . . . . . . 22

iii

CONTENTS iv

2.3.2 Mapping Accuracy Preservation . . . . . . . . . . . . . . . . 26

2.4 Evaluation of semantic topologies . . . . . . . . . . . . . . . . . . . . 31

2.4.1 Self-Interest Based Topology Evaluation . . . . . . . . . . . . 32

2.4.2 Cooperative Interest Based Topology Evaluation . . . . . . . 34

3 On Optimal Semantic Topologies 37

3.1 Optimal Self-Interest Based Topologies . . . . . . . . . . . . . . . . . 38

3.2 Optimal Cooperative-Interest Based Topologies . . . . . . . . . . . . 40

3.2.1 Degree Bounded Maximum Minimal Product Paths Tree (db-

MMPT) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.2.2 Single Peer Single Query (SPSQ) Optimal Topology Problem 50

3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4 Dynamic Self-Organizing Topologies 54

4.1 Semantic Acquaintance . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.2 Semantic Replacement . . . . . . . . . . . . . . . . . . . . . . . . . . 61

5 Experiments 67

5.1 Simulation Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 67

5.2 Data and parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.3 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

5.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

5.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

5.5.1 Good Initial Topologies . . . . . . . . . . . . . . . . . . . . . 78

5.5.2 Initial Bad Topologies . . . . . . . . . . . . . . . . . . . . . . 82

5.5.3 Randomly Generated Topologies . . . . . . . . . . . . . . . . 92

CONTENTS v

5.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

6 Discussion 109

6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

References 111

Hebrew Abstract k

List of Figures

2.1 A query reformulation example. . . . . . . . . . . . . . . . . . . . . . 17

2.2 DPMS model description. . . . . . . . . . . . . . . . . . . . . . . . . 18

2.3 A semantic network graph, where peers’ schemata are interlinked by

schema mappings provided by the peers. . . . . . . . . . . . . . . . . 19

2.4 Semantic Network Model: Query translation layers and a Topology

with a limit of Kp = 2 neighbors. . . . . . . . . . . . . . . . . . . . . 21

2.5 An example of mapping accuracy. . . . . . . . . . . . . . . . . . . . . 24

2.6 An example of mapping preservation. . . . . . . . . . . . . . . . . . . 28

2.7 An example for query reformulation graph. . . . . . . . . . . . . . . . 30

2.8 An example for accuracy oriented semantic topology evaluation. . . . 33

3.1 Classification of the optimal CIV topology problem. . . . . . . . . . 41

3.2 Example for maximum minimal product paths tree (MMPT) and max-

imum product paths tree (MPT). . . . . . . . . . . . . . . . . . . . . 44

3.3 Example of transformation from MPT to SPT. . . . . . . . . . . . . . 46

3.4 Example of MMPT Vs. db-MMPT. . . . . . . . . . . . . . . . . . . . 47

3.5 Example of transformation from ATSP to db-MMPT. . . . . . . . . . 50

3.6 Example of transformation from db-MMPT to SPSQ. . . . . . . . . . 52

vi

LIST OF FIGURES vii

4.1 Semantically disconnected components. . . . . . . . . . . . . . . . . . 56

4.2 Acquaintance policies example. . . . . . . . . . . . . . . . . . . . . . 60

4.3 Bad replacement example. . . . . . . . . . . . . . . . . . . . . . . . . 63

5.1 Simulation Model: domain, schemata, and query sets. . . . . . . . . . 68

5.2 Simulation Model: semantic topology and query translation layers. . . 70

5.3 Simulation Model: sequence Diagram of a single query cycle. . . . . . 71

5.4 Domain attributes probability for participation in peer schemas and

queries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5.5 Attributes mapping accuracies distributions for similar and different

attributes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

5.6 Network topology: out degree Vs. peer rank following power law. . . 74

5.7 Replacement policies comparison: convergence in initial good topologies 79

5.8 Replacement policies comparison: topology changes in initial good

topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

5.9 Replacement policies comparison: SIV change in initial good topologies 81

5.10 Replacement policies comparison: CIV change in initial good topologies 81

5.11 Acquaintance policies comparison: convergence in initial bad topologies 83

5.12 Acquaintance policies comparison: topology changes in initial bad

topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

5.13 Acquaintance policies comparison: SIV change in initial bad topologies 84

5.14 Acquaintance policies comparison: CIV change in initial bad topologies 85

5.15 Acquaintance policies comparison: reachability change in initial bad

topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

LIST OF FIGURES viii

5.16 Acquaintance policies comparison: average CIV measure change in

initial bad topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

5.17 Replacement policies comparison: convergence in initial bad topologies 87

5.18 Replacement policies comparison: topology changes in initial bad topologies 88

5.19 Replacement policies comparison: SIV change in initial bad topologies 89

5.20 Replacement policies comparison: CIV change in initial bad topologies 90

5.21 Replacement policies comparison: average CIV measure change . . . 90

5.22 Replacement policies comparison: reachability change in initial bad

topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

5.23 Acquaintance policies comparison: convergence in randomly generated

topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

5.24 Acquaintance policies comparison: topology changes in randomly gen-

erated topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

5.25 Acquaintance policies comparison: SIV change in randomly generated

topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

5.26 Acquaintance policies comparison: CIV change in randomly generated

topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

5.27 Acquaintance policies comparison: average CIV change in randomly

generated topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

5.28 Acquaintance policies comparison: reachability change in randomly

generated topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

5.29 Replacement policies comparison: convergence in randomly generated

topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

5.30 Replacement policies comparison: number of topology changes in ran-

domly generated topologies . . . . . . . . . . . . . . . . . . . . . . . . 99

LIST OF FIGURES ix

5.31 Replacement policies comparison: SIV change in random topologies . 100

5.32 Replacement policies comparison: CIV change in random topologies . 101

5.33 Replacement policies comparison: average CIV change in random

topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

5.34 Replacement policies comparison: reachability change in random topologies103

5.35 Replacement policies comparison: topology change visualization . . . 104

5.36 SIV Vs. Average CIV in randomly generated topologies . . . . . . . . 106

List of Tables

4.1 Acquaintance policies evaluation measures . . . . . . . . . . . . . . . 61

4.2 Mapping accuracies for selected candidates using different acquain-

tance policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.3 Replacement policies evaluation measures . . . . . . . . . . . . . . . . 65

5.1 Simulation parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.2 Summary of experimental setup parameters . . . . . . . . . . . . . . 75

x

Abstract

Peer database management systems (PDMS) combine the decentralized setting and

autonomy of peer-to-peer systems with the rich semantic context of database systems.

In a PDMS, members use schema matching techniques to establish schema mappings

as the basis for peer querying. The large-scale and dynamic environments of peer-

to-peer networks dictate the use of automatic schema matching, which was shown to

carry with it a degree of uncertainty.

In the first part of this thesis, we introduce a model for a PDMS that considers

the inherent uncertainty of automatic schema matching and the increase of this un-

certainty over transitive matching in a decentralized environment. We examine the

query reformulation quality in our model, influenced by the use of variant semantic

network topologies. We analyze both local interest and global welfare of semantic

topologies.

Next, we consider (offline) problems of finding optimal semantic topologies that

maximize query reformulation quality. We present an algorithm to find such optimal

topologies that maximize peers local interest. We also show that even in the pres-

ence of complete offline network knowledge, the problem of finding a topology that

maximizes the global welfare is NP-Complete even for a very simple setting.

In the third part, we consider the (online) setting of a peer to peer system where no

xi

ABSTRACT xii

single peer obtains complete network knowledge. We present several heuristic (online)

algorithms for topology self organization in the absence of full network knowledge.

In the final part, we present a PDMS simulation using our model and our online

algorithms for topology self-organization. We compared our algorithms with simi-

lar algorithms in the context of peer file-sharing systems. Our results indicate that

our algorithms better exploit interest-based locality in PDMS environment. We also

show experimental results indicating that local interest interferes with global welfare

in the context of optimal topology self establishment. Finally, we show that opti-

mal topologies maximizing global welfare can not be reached by independent peers

individually applying self organization algorithms, but rather require collaborative

algorithms applied by peers in cooperation.

List of Symbols

P :Set of peers in a PDMS

p :Single peer

DBp :Peer p’s database

Sp :Peer p’s database descriptive schema

S :Set of peers schemata in a PDMS

A :Schema attribute

AI :Set of attribute interpretations

∆I :Global domain of attribute interpretations

q(Si) :Query formed in terms of schema Si

Qp :Set of queries issued by peer p

λq :Query q appearance frequency

IDp :Peer p’s network identifier

mAi→Aj:Attribute mapping from Ai to Aj

MSi→Sj:Schema mapping from Si to Sj

M :Set of schema mappings in a PDMS

G(P, M) :Semantic graph with a set of peers P and a set of mappings M

T (P ,M) :Semantic topology with a set of peers P and a set of mappings M

1

LIST OF SYMBOLS 2

Np :Set of peer p’s neighbors in a semantic topology

Kp :Peer p’s maximum neighbors boundary

µ :Mapping accuracy confidence measure

PathSi→Sj:Path of transitive schema mappings

α :Accuracy preservation over a path of transitive schema mappings

SIV :Self-interest value measure for semantic topology evaluation

CIV :Cooperative-interest value measure for semantic topology evaluation

Chapter 1

Introduction

Peer-to-peer (P2P) systems have served as the subject of a wide research effort over

the past few years. Advantages such as scalability, autonomy and robustness made

it useful in various domains, and many practical applications using this technology

proved most successful. Originally, P2P systems supported only a simple data model

and limited query expressiveness. Later, some research efforts were made to enrich

P2P data models with meta data and enhance query expressiveness [50, 60, 36].

Recently, a revolutionary approach suggested an integration of P2P and database

management systems (DBMS) technologies.

Peer database management systems (PDMS) combine the decentralized setting

and autonomy of P2P systems with the rich semantic context of database management

systems. Each peer maintains a local database and a descriptive schema exposing its

database to the other peers. Information sharing is done by means of query dissem-

ination, iterative propagation of queries among connected peers. Expressive query

languages of the type used in database management systems (e.g. SQL, XQuery)

3

CHAPTER 1. INTRODUCTION 4

may be used to compose complex queries. In a PDMS, members use schema match-

ing techniques to establish schema mappings as the basis for peer querying. Queries

are being reformulated from a source to target peers schema using these mappings.

Schema matching is the process of matching between concepts describing the

meaning of data in heterogeneous schemata. Schema mapping, the outcome of a

matching process, is a translation between similar concepts in a source and target

schemata and may be used for reformulation of queries issued using terms of one

schema to another. Due to its complexity, the operation of matching between two

heterogeneous schemata, was originally performed by human experts [14, 37]. How-

ever, the large-scale and dynamic environments of peer-to-peer networks dictate the

use of automatic schema matching. Automatic schema matching process was shown

to carry with it a degree of uncertainty and its outcome may contain inaccurate,

possibly erroneous mappings.

The presence of uncertain mappings between the peers impacts the quality of

reformulated queries and their returned results. Queries using inaccurate mappings

may return irrelevant results. Consider for example a schema mapping where attribute

FamilyName in one schema is inaccurately mapped to attribute FirstName in another

schema; The simple query return all persons with family name=’Smith’ would then

be translated to return all persons with first name=’Smith’, thus yeilding with results

irrelevant to the original query. Selection of schema mappings by individual peers

influences therefore the overall quality of the query process. In a P2P environment

such as PDMS, mapping selection is derived from the network topology, i.e. the

selection of neighbors by individual peers. A wise choice of neighbors may reduce the

uncertainty of schema mappings and as a direct result, may reduce the inaccuracy of

queries and increase the quality of their outcome.


In this thesis we consider a setting of a PDMS with matching uncertainty. Schema

mappings created by matching between peers are inaccurate to some degree. We con-

sider a dynamic setting where peer connects with some arbitrary peers upon joining

the network and can later change its neighbor set selection. Given this setting, we

focus on the following questions:

• Can we efficiently identify “good” topologies, those that reduce the uncertainty

in the network?

• Can such “good” topologies self organize by self-interested peers acting individ-

ually?

1.1 Related Work

We divide this section into two main subsections: in Section 1.1.1 we bring related

work in the field of schema matching and in Section 1.1.2 we discuss related work in

the field of P2P and specifically PDMSs.

1.1.1 Schema Matching

Schema matching is recognized to be one of the basic operations required by the

process of data and schema integration [8, 42, 12], and thus has a great impact on

its outcome. Schema mappings can serve in tasks of targeted content delivery, view

integration, database integration, query rewriting over heterogeneous sources, dupli-

cate data elimination, and automatic streamlining of workflow activities that involve

heterogeneous data sources. As such, schema matching has impact on numerous mod-

ern applications, currently suffering from the lack of ability to easily and effectively


organize their dataspaces [26]. It impacts business, where company data sources con-

tinuously realign due to changing markets. It also impacts the way business and other

information consumers seek information over the Web. Finally, it also impacts life

sciences, where scientific workflows cross system boundaries more often than not.

Research into schema matching has been going on for more than 25 years now (see

surveys [8, 59, 55, 61] and various online lists, e.g., OntologyMatching1, Ziegler2, Dig-

iCULT3, and SWgr4,) first as part of a broader effort of schema integration and then

as a standalone research. Due to its cognitive complexity, schema matching has been

traditionally considered to be AI-complete, performed by human experts [14, 37]. For

obvious reasons, manual concept reconciliation in large scale and/or dynamic envi-

ronments (with or without computer-aided tools) is inefficient and at times close to

impossible. The move from manual to semi-automatic schema matching has been jus-

tified in the literature using arguments of scalability (especially for matching between

large schemata [34]) and by the need to speed-up the matching process. Researchers

also argue for moving to fully-automatic (that is, unsupervised) schema matching in

settings where a human expert is absent from the decision process. In particular,

such situations characterize numerous emerging applications triggered by the vision

of the Semantic Web and machine-understandable Web resources [10, 63]. In these

applications, schema matching is no longer a preliminary task to the data integration

effort, but rather ad-hoc and incremental.

The AI-complete nature of the problem dictates that semi-automatic and auto-

matic algorithms for schema matching will be of heuristic nature at best. Over the

1http://www.ontologymatching.org/2http://www.ifi.unizh.ch/˜pziegler/IntegrationProjects.html3http://www.digicult.info/pages/resources.php?t=104http://www.semanticweb.gr/modules.php?name=News&

file=categories&op=newindex&catid=17


years, a significant body of work was devoted to the identification of schema match-

ers, heuristics for schema matching. Examples of algorithmic tools providing means

for schema matching include COMA [19], Cupid [45], OntoBuilder [29], Autoplex [9],

Similarity Flooding [47], Clio [48], Glue [21], to name just a few. The main objective

of schema matchers is to provide schema mappings that will be effective from the user

point of view, yet computationally efficient (or at least not disastrously expensive).

Such research has evolved in different research communities, including databases,

information retrieval, information sciences, data semantics, and others.

Automatic matching algorithms, based on syntactic, rather than semantic, means,

may carry with it a degree of uncertainty. [30] used a fuzzy framework to model the

uncertainty of the matching process outcome. They introduced a confidence mea-

sure associated with a matching outcome, indicating a matching algorithm’s belief in

the accuracy of the received mapping. High confidence value indicates an accurate

mapping, close to the perfect outcome of a human expert matching. In addition,

they demonstrated through theoretical and empirical analysis that for a certain fam-

ily of “Well-behaved” mappings termed monotonic, one can safely interpret a high

confidence measure as a good semantic mapping. Thus, automatic matching algo-

rithms applying mappings maintaining the monotonicity principle, can be trusted to

associate confidence measures truly reflecting the accuracy of their outcome.

1.1.2 Peer Database Systems

P2P networks suggest a model where participants communicating via ad-hoc connec-

tions share resources to offer some collaborative service. In contrast with classical

client/server application, all the peer nodes in the network simultaneously function

as both clients and servers to the other nodes in the network. Advantages of this


decentralized setting such as scalability, autonomy and robustness made it useful

for various domains and applications: USENET [35] was an early P2P system for

propagation of news articles. BitTorrent5 is a P2P network for file content sharing

(e.g. audio, video, data, etc...), Skype6 is a P2P based Internet telephony system and

TVants7 is a video streaming (TV) distribution system based on P2P technology.

Peers in P2P networks are typically organized in an overlay network, a structure

built on top of another network such as the Internet. P2P networks can be classified

according to their overlay network organization: unstructured P2P systems such as

Gnutella [1] are constructed by peers establishing arbitrary links with a fixed number

of other peers. In an unstructured P2P network, queries for data have to be flooded

through the network in order to find peers sharing the data. Query propagation is

regulated by a Time-To-Live (TTL) value, indicating the period of time or number of

iterations for query forwarding, before being discarded; this simple robust mechanism

restricts query broadcast within a certain radius. Hence, the main disadvantages with

such networks is that search mechanism is highly inefficient due to flooding and may

fail to retrieve relevant results under TTL limitation.

Research efforts in the context of this problem in file sharing P2P systems, sug-

gested the use of topology self organization as a possible solution. Under this ap-

proach, peers apply light-weight algorithms for estimation of semantic relation with

other peers and links in the overlay network are adjusted in order to improve search

performance. In [68], light-weight policies such as LRU and History were suggested

to identify and maintain links to a list of semantically close neighbors. [62] suggested

the creation of interest-based shortcuts, i.e. direct links to peers with high likelihood

5http://www.bittorrent.com6http://www.skype.com/7http://www.tvants.com


of sharing similar interest. Finally, [16] used routing indices, tables of information

providing a list of neighbors that are most likely to be “in the direction” of the content

corresponding to a query.

Structured P2P networks overcome search disadvantages by maintaining a preset

structure, allocating peers according to their content in a structure that minimizes

flooding while producing query relevant results. P-Grid P2P system [2, 3] organizes

peers in a structured virtual binary search tree. CAN [56] allocates peers into“zones”

in a n-dimensional Cartesian coordinate space and peers in adjacent zones maintain

links. Chord [64] organized peers around a circle. These systems use Distributed

Hash-Tables (DHTs), providing hash table functionalities that enable an efficient

distributed search.

Early P2P systems dealt with very simple data and query models: queries were

composed of a single keyword or string representing a file name. Query results indi-

cated only the existence of items with a similar name, and positive reply from a single

peer was sufficient for content location. Later, systems with richer and more expres-

sive data models evolved. Edutella [50] is a P2P system for exchanging metadata in

RDF. Originally built on top of JXTA,8 it later evolved to support publish-subscribe

functionalities for RDF and RDF Schema data on a super-peer architecture [51].

RDFPeers [13] indexes RDF and RDF Schema data in a DHT. PeerDB [60] is based

on the BestPeer [52] P2P system and allows the sharing of relational data through

attribute-keyword matching. PIER [36] is a full-blown, distributed and relational

database system built on top of a DHT.

Visionary papers [31, 11] appearing in 2002, suggested a new type of P2P systems,

named peer database management systems (PDMS). Harnessing the power of both

8http://www.jxta.org/


P2P and database management technologies, they introduced a vision of a decen-

tralized network of autonomous information sources, each maintaining a rich expres-

sive data model and query capabilities. Integrating these two worlds, they offered a

large-scale robust network of peers with rich heterogeneous schemata where schema

mappings, used as a semantic glue connecting peers in the network, enable peers to

cooperate and share information by means of query reformulation. Query dissemina-

tion in PDMS is done by means of gossiping [5], iterative query reformulation between

connected peers, similar to query propagation in unstructured P2P systems.

The Piazza project [33, 32, 66] introduced a PDMS network where peers schemata

are interlinked by GLAV mappings. Piazza focused on the logic structure, algorith-

mic, and implementation aspects of peer data management such as definition and

creation of mappings, query reformulation and propagation, and methods to improve

their efficiency [65]. The Hyperion project [7, 40] presented another PDMS relying

on the Local Relational Model (LRM) [58], using instance level mappings and co-

ordination rules to share data in decentralized environments. They also focused on

implementation aspects such as mappings definition using mapping tables [46] and al-

gorithms for query reformulation computing [39]. Both Piazza and Hyperion consider

schema mappings as most-accurate as if created by a human expert. [6] first real-

ized the effect of matching uncertainty on query reformulation accuracy and offered

an extended model of PDMS, including confidence measures representing mapping

accuracies. They suggest the usage of mapping accuracy measure for selective query

routing. However, they do not assume the use of matching algorithm following the

monotonicity principle, and hence provide algorithms integrated into the query mech-

anism for analysis and update of mapping confidence measures. Their approach can

be viewed as complementary to ours.


1.2 Thesis Outline

In this thesis, we model a PDMS as a network of peers connected by schema mapping

links associated with mapping confidence measures. We assume matching algorithms

follow the monotonicity principle and take confidence measure as truly reflecting

mapping accuracy. We model the deterioration of accuracy over transitive mappings

and its impact on query processing in PDMS. We define a PDMS semantic overlay

structure, namely semantic topology and show the influence of topology selection on

the quality of queries reformulation in a PDMS. Finally, we adopt the approach of

overlay (topology) self organization in search of semantic topologies that maximize

reformulation quality.

The remaining of this thesis is organized as follows:

In Chapter 2 we present our formal PDMS model. We describe the local database

maintained by each peer, the semantic network of peers connected via schema map-

pings, and elaborate on schema mappings characteristics and their influence on query

mechanism. Additionally, we introduce the semantic topology concept, the organiza-

tion of matched peers in the semantic network and suggest evaluation measures for

semantic topologies in the context of uncertain mappings in PDMS.

In Chapter 3 we consider the problem of finding optimal semantic topologies in

an offline setting. This problem interests us as a baseline for evaluating online self

organizing topologies. We focus on two types of optimal topologies: (1) topologies

maximizing the selfish interest of each peer and (2) topologies that maximized global

welfare.

Chapter 4 deals with dynamic self organization of semantic topologies. We intro-

duce two related problems: the acquaintance problem of semantically related peers


identification, and the replacement problem of local neighbor selection preferences.

We suggest several lightweight heuristic algorithms for each problem and analyze the

differences characteristics of each algorithm.

Chapter 5 describes a simulation we constructed to examine our model empiri-

cally. We implemented our suggested acquaintance and replacement algorithms as

well as other algorithms taken from the context of file sharing P2P systems. We run

experiments using various combinations of acquaintance and replacement algorithms

and compare their results.

In Chapter 6 we summarize with our conclusions from this work and our sugges-

tions for future work.

1.3 Contributions

The main contribution of this work is the definition of a model for evaluation of

semantic topologies in a PDMS with uncertain schema mappings, and a framework

for self organization for such topologies. In detail, our contribution include:

• Definition of semantic topologies and evaluation measurements for their quality.

We suggest different measurements for representation of peers self interest and

network global welfare.

• Presentation of an algorithm for finding optimal semantic topologies maximizing

peers self-interest in an offline setting.

• Provision of a proof that the problem of finding optimal topologies maximizing

global welfare in an offline setting is NP-Complete.

• Demonstration through empirical analysis that optimal topologies maximizing


global welfare can not be reached by means of self organization algorithms

applied autonomously by individual peers, but rather require collaborative al-

gorithms.

Chapter 2

Model Definition

In this chapter we present a generic model for PDMSs that will be used throughout

the rest of this thesis. Our model, partially relying on the model of [6], consists of a

data model (Section 2.1), describing the local databases of the peers, a network model

(Section 2.2), outlining the semantic relations and the organization of the peers, and

a matching model (Section 2.3), describing the structure and characteristics of the

semantic connections between the peers and their relation to the network structure

(topology). In the final part of this chapter (Section 2.4), we present measures for

evaluation of semantic topologies in the context of PDMS. The novelty of our model

is a formal representation of schema mappings’ uncertainty and its impact on the

quality of queries in a PDMS. We demonstrate the influence of a semantic topology

choice on this quality.

14

CHAPTER 2. MODEL DEFINITION 15

2.1 The Data Model

We model each information system as a peer p ∈ P . A peer stores data in a database

DBp according to a structured schema Sp taken from a global set of schemata S. As we

wish to present an approach as generic as possible, we do not make any assumptions

on the exact data model used by the databases in the following. We only require the

schemata store information using attributes, where each attribute A ∈ Sp may be an

attribute in a relational schema, an element or an attribute in XML, and a class or a

property in RDF.

Each local attribute is assigned with a set of fixed interpretations AI from an

abstract and global domain of interpretations ∆I with AI ∈ ∆I . Arbitrary peers are

not aware of such assignments. We say that two attributes Ai and Aj are equivalent,

and write Ai ≡ Aj if and only if AIi = AI

j . Even if equivalent attributes theoretically

have the same extensions, some tuples might be missing in practice (open-world

assumption), i.e., DBpiis not always equivalent to DBpj

even if pi and pj share

identical or equivalent schemata. Those sets of interpretations are used to ground the

semantics of the various attributes in the PDMS from an external and human-centered

point of view.

Attributes may have complex data types and NULL-values are possible. We do

not consider more sophisticated data models to avoid diluting the discussion of the

main ideas through technicalities related to mastering complex data models. More-

over, many practical applications, in particular in P2P systems, digital libraries or

scientific databases, use exactly the type of data model we have introduced, at least

at the meta-data level. A query language for querying and transforming databases

(e.g. SQL, XQuery or SPARQL) builds on basic relational algebra operators (e.g.


Projection, Selection and Renaming). We write q(Si) = {Aj|Aj ∈ Si} to denote a

query formulated in terms of a particular schema Si. Each peer p is associated with

a set of queries Qp, where the frequency of issuing query q is denoted by λq.

2.2 The Network Model

Let us now consider a (potentially big) set of peers P with their related schemata

and data. We assume that a peer p ∈ P can be identified by a unique identifier

IDp (e.g., an IP address or a peer ID in a P2P network). Each peer has a basic

communication mechanism that allows it to establish connection to other peers. We

assume in the following that it is based on an unstructured P2P access structure a

la Gnutella. Thus, peers send ping messages with a certain Time-To-Live value and

receive pong messages in order to learn about the network structure. Extending the

Gnutella protocol, a peer also sends its schema Sp as part of a pong message.

2.2.1 Schema Mappings

Peers can define schema mappings MSi→Sjbetween a source schemata Si and a tar-

get schema Sj. Such mappings can be created manually, semi-or fully-automatically

depending on the peers and the setting. A schema mapping MSi→Sjallows the refor-

mulation of a query of Si into a new query to a target schema Sj. Schema mappings

can be expressed in a variety of ways; in our case following [22], we consider a schema

mapping MSi→Sjthat is given as a set of attribute mappings mAi→Aj

between source

schema Si and target schema Sj:

MSi→Sj=

{mAi→Aj

|Ai ∈ Si, Aj ∈ Sj

}(2.1)


where source attributes Ai ∈ Si are mapped into target attributes Aj ∈ Sj. A

mapping defines a surjective operation from the set of target attributes onto the set

of source attributes, where source attributes that do not appear in the mappings are

mapped by an implicit attribute mapping onto a null value. Using schema mapping

MSi→Sjwe can reformulate a source query q(Si) into a target query q(Sj) using only

attributes from Sj:

q(Sj) ≡MSi→Sj(q(Si)) (2.2)

Figure 2.1: A query reformulation example.dzli`y mebxzl dnbec

Figure 2.1, taken from [17], gives an example of query reformulation in an XML/XQuery

context. query qi is reformulated into query qj using the to mapping Mpi→pj, com-

posed of seven attribute mappings that map target attributes onto source attributes.

Figure 2.2 describes our proposed PDMS model. Attributes from different do-

mains are spread across independent peers schemata. Links between schemata repre-

sent schema mappings (to be described in detail in Section 2.3). Additionally, each


Figure 2.2: DPMS model description.zinrl zinr zyxa mipezp icqn lcen xe`iz

peer’s query set contains only attributes from its own schema. Numbers on the links

represent mapping accuracies, to be discussed later in Section 2.3.

2.2.2 Query Dissemination

Queries are disseminated in the PDMS network in an unstructured and collaborative

way (see Chapter 1). A peer receiving a reformulated query may decide to reformulate

it in turn for further dissemination. Thus, queries can be reformulated several times

iteratively:

q(SN) ≡MSN−1→SN(MSN−2→SN−1

· · · (MS1→S2(q(S1))) (2.3)

This way, queries might traverse several peers through a succession of schema map-

pings. Figure 2.3 shows an example of a semantic network graph G(P ,M), where


Figure 2.3: A semantic network graph, where peers’ schemata are interlinked byschema mappings provided by the peers.

miihpnq miietin zervn`a zexaegn mizinr ly zenkq ea ,zihpnq zyx sxb

nodes represent peers, and directed edges represent schema mappings created by in-

dividual peers and used to reformulate queries. Note that a pair of nodes can be

related through opposite directed edges, whenever two peers are cross-linked.

Queries can be propagated through the semantic network in various ways, depend-

ing on the query forwarding paradigm in use. Forwarding a query irrespective of its

content throughout the network is highly inefficient. In addition to undesired network

flooding, query may potentially be forwarded through many inaccurate mappings,

which results in retrieving many irrelevant results (low precision). TTL mechanism,

common in peer to peer networks, overcomes this problem by limiting the broadcast

of queries to be within a certain radius. However, TTL in a PDMS network causes

a low recall [57] as the system cannot reach all the databases relevant to the query.

Hence, a desired semantic network structure is organized in such a way that every

peer is connected within a small radius to other peers most related to it.


2.2.3 Semantic Topology

Continuing our discussion from Section 2.2.2, consider a semantic network graph

where each peer is mapped against all other peers’ schemata (clique). Potentially,

a query can follow all possible reformulations reaching all peers, thus yielding every

possible answer. However, this architecture does not scale for large networks as query

will flood the network and yield redundant and inaccurate answers. In addition,

matching and mappings maintenance on such a large scale are both time and storage

consuming.

In what follows, we assume a semantic network topology T (P ,M) where every

peer p maintains a list of neighbors N (p), to each member pj of N (p), p maintains a

mappingMSi→Sj. Each peer has a boundary Kp of the number of neighbors according

to its communication and storage capacity. Our network model fits nicely with typical

network models in the context of Peer-to-Peer networks such as power-law networks

[25] and small-world networks [18] that suggest average short path lengths between

peers, and limited number of neighbors distributed according to some power law.

Figure 2.4 presents a visualization of the semantic network separated into layers.

The lowermost layer represents the network topology with links between the peers

representing schemata mappings, limited by fixed number of mappings per peer. The

upper layers represent query translation graphs. Each layer represents a single query

translation between all peers. Query translations and their accuracies (numbers on

the edges) will be discussed in detail in Section 2.3.

The suggested topology can be dynamic. New peers can be discovered by means of

random ping messages as well as through answers to query propagation. By matching

against new peers, peer can expand or replace (if Kp is exceeded) neighbors, thus pos-

sibly improving its ability to obtain answers to queries. In the following we introduce


Figure 2.4: Semantic Network Model: Query translation layers and a Topology witha limit of Kp = 2 neighbors.

mipkyd xtqn lr ueli` mr dibeletehe zezli`y mebxz zeaky :zihpnqd zyxd lcen

mapping oriented techniques for discovery and replacement of semantic neighbors in

a PDMS setting.

2.3 The Matching Model

The query reformulation mechanism is based on the assumption that the schema

mappings are semantically correct [66, 7], i.e., accurate, which might not be the case

for various reasons. As PDMSs target large scale, decentralized, and heterogeneous

environments where autonomous parties have full control over the design of the local

schemata, it is not always possible to create correct mappings between schemata.

In many situations, an approximate mapping relating two similar but semantically

slightly divergent concepts might be more beneficial than no mapping at all. Also,


given the vibrant activity in the area of (semi) automatic schemata alignment [24],

we can expect some (most?) of the mappings to be generated automatically in large-

scale settings, with all the associated issues in terms of quality. In this section we

model schema mapping uncertainty and its amplification over transitive mappings

in the context of PDMS query reformulation. We define an estimation measure for

mapping quality, namely matching accuracy, extended to the setting of PDMS where

chained transitive mappings are used, in the form of accuracy preservation.

2.3.1 Mapping Accuracy

As introduced earlier (see Chapter 1), automatic matching may carry with it a degree

of uncertainty, as it is based on syntactic, rather than semantic, means. We intro-

duce the notion of mapping accuracy to characterize the confidence of the mappings

connecting semantically related schemata. We adopt the proposed model of [30], uti-

lizing a fuzzy framework to model the uncertainty of the matching process outcome.

Given a mapping mAi→Ajbetween two attributes, we associate a confidence measure

µm, normalized between 0 and 1, to specify our belief in the mapping quality. We as-

sume that a manual matching is a perfect process, resulting in a crisp matching, with

confidence measure of 1.1 As for automatic matching, a hybrid of algorithms, such

as presented in [20, 38, 27], or adaptation of relevant work in proximity queries (e.g.,

[69, 67]) and query rewriting over mismatched domains (e.g., [44, 43]) can determine

the level of this attribute mapping accuracy estimator.

1This is, obviously, not always the case. In the absence of sufficient background information,human observers are bound to err as well. However, since our methodology is based on comparingmachine-generated mappings with a mapping as conceived by a human expert, and the latter isbased on human interpretation, we keep this assumption.


Identifying a confidence measure in and of itself is insufficient for matching pur-

poses. One may claim, and justly so, that the use of syntactic means to identify

semantic equivalence may be misleading in that a mapping with a high confidence

measure can be less precise, as conceived by an expert, than a mapping with a lower

confidence measure. In this work, we assume the use of monotonic automatic seman-

tic reconciliation algorithms, where one can safely interpret a resulting mapping with

a high confidence measure as a good semantic mapping. Therefore, high mapping

accuracy suggests (but does not guarantee) a sound mapping that will produce rel-

evant results for queries. Low accuracy on the other hand, implies a mapping with

low confidence level that will most likely produce some irrelevant results.

Suppose we have a schema mapping MSi→Sjfrom pi to pj, composed of a set

of attributes mappings between the corresponding schemata Si and Sj. The schema

mapping confidence measure is a function, a compound confidence measure we calcu-

late using attribute mappings confidence measures. In our work, schema mapping is

computed using average following works such as [29]:

µMSi→Sj=

1

|M|∑

m∈MSi→Sj

µm (2.4)

We calculate query translation accuracy in a similar manner, using average over

the mappings accuracy for attributes participating in the query:

µq(Sj) = µMSi→Sj(q(Si)) =

1

|q|∑

m∈MSi→Sj(q(Si))

µm (2.5)

We use query translation accuracy rather than entire schema mapping accuracy

to evaluate the benefit of peer’s neighbors from a practical point of view, i.e. how

well can it translate queries issued by the peer. To evaluate accuracy over a set of


queries we use a weighted average using queries appearance frequencies as weights:

µQpi=

1

|Qpi|

∑q∈Qpi

λq ∗ µq (2.6)

Figure 2.5: An example of mapping accuracy.miietin zepekpl dnbec

Example 1 We illustrate the compound mapping accuracy via an example. Assume

a peer p1 is connected to peer p2 and p3 as illustrated in Figure 2.5. Schema mappings

in Figure 2.5 are defined in terms of pairwise directed bipartite graphs whose nodes

represent schema attributes and whose edges represent attribute mappings. Attribute

mapping accuracies are given as edge weights. First, we calculate the schema mapping

accuracies of the matching between p1 and its two neighbors:

µMS1→S2=

1

3∗(µmA1→B2

+ µmB1→A2+ µmC1→C2

)=

1

3∗(0.2 + 0.3 + 0.5) ∼= 0.333 (2.7)

µMS1→S3=

1

3∗

(µmA1→A3

+ µmB1→B3+ µmC1→Null

)=

1

3∗ (0.9 + 0.8 + 0.0) ∼= 0.566

(2.8)

We note that p3 has higher schema mapping accuracy, making it a preferred can-

didate neighbor for p1.

p1 sends a query to its neighbors:

q1(S1) = πA1,B1(S1) (2.9)


It evaluates as follows against p2 and p3:

q1(S2) = πmA1→B2,mB1→A2

(S2) (2.10)

q1(S3) = πmA1→A3,mB1→B3

(S3) (2.11)

and the corresponding translation accuracies:

µq1(S2) =1

2∗ (µmA1→B2

+ µmB1→A2) =

1

2∗ (0.2 + 0.3) = 0.25 (2.12)

µq1(S3) =1

2∗ (µmA1→A3

+ µmB1→B3) =

1

2∗ (0.9 + 0.8) = 0.85 (2.13)

and the query translations accuracies supporting our assumption for p3 being the pre-

ferred neighbor.

p1 issues another query to its neighbors:

q2(S1) = πC1(S1) (2.14)

evaluating as follows:

q2(S2) = πmC1→C2(S2) (2.15)

q2(S3) = πmC1→Null(S3) (2.16)

with corresponding translation accuracies:

µq2(S2) = µmC1→C2= 0.5 (2.17)

µq2(S3) = µmC1→Null= 0.0 (2.18)

Despite the higher schema mapping accuracy, p3 is least preferred for q2, thus demon-

strating that neighbors preference is query-dependent. Assuming that both q1 and q2

are issued with the same frequency, they have similar weights for p1 and the total

query translation accuracies are:

µQp1 (S2) =1

2∗ (µq1(S2) + µq2(S2)) = 0.375 (2.19)


µQp1 (S3) =1

2∗ (µq1(S3) + µq2(S3)) = 0.425 (2.20)

Therefore p3 is preferred over p2 considering both queries. Note that for different

weights of query importance, i.e. if q2 was issued much more frequently than q1, the

result might have been opposite and p2 would have been the preferred neighbor.

2.3.2 Mapping Accuracy Preservation

Being a decentralized environment, query reformulation in PDMS relies on the abil-

ity to evaluate transitive mappings among peers’ schemata. When a query is posed

over the schema of a peer, the network will utilize data from any peer that is transi-

tively connected by schema mappings, by chaining mappings. Recall that automatic

semantic matching between two schemata may invlove a degree of uncertainty. For

transitive chained mappings, this uncertainty degree may be amplified due to a com-

position of translations each of which uncertainty affects the accuracy of the following

translations, resulting with mapping accuracy decay.

We introduce the notion of mapping accuracy preservation to characterize the

confidence in a (chained) mappings (transitively) connecting semantically related

schemata. Consider a path of transitively connected peers pi . . . pN composed from a

sequence of schema mappings between the corresponding schemata Si . . . SN :

PathS1→SN= MSN→SN−1

(. . . (MS2→S1)) (2.21)

We associate a confidence measure αPathS1→SN, normalized between 0 and 1, to specify

our belief in the mappings chain quality.

We assume that a chain of mappings resulting from manual matchings will main-

tain the perfect confidence measure of 1, while a chain of mappings that contain

even one most imperfect matching will maintain the lowest confidence measure of 0.


Further more, we assume the mapping accuracy preservation measure for a chain of

transitive mappings to be bounded by the mapping accuracy of the least accurate di-

rect mapping from below, and that of the most accurate direct mapping from above.

Other two desired characteristics of this confidence measure is commutativity and

monotonicity, i.e. two mapping chains with similar number of mappings and pairs of

mappings with equal accuracy measure will result with equal preservation measure,

and the same scenario with pairs of mappings where mapping from one chain has

higher accuracy measure than the mapping from the other chain for all pairs will

result with a higher preservation measure for the first chain.

The mapping accuracy preservation measure α for a chain of matchings is a func-

tion we calculate using the mapping accuracies of the neighbor schemata in the chain.

Natural suitable candidates for α are functions from the family of triangular norms

(i.e., minimum, product) extended to multiple number of arguments using their as-

sociativity property. We refer the interested reader to [23] for exhaustive treatment

of the triangular norms subject. In our work, chained matchings preservation com-

putation is computed using the product function as the computation operator:

αPathS1→...→SN= αMSN→SN−1

(...(MS2→S1)) =

∏MSi→Sj

∈PathS1→SN

µMSi→Sj(2.22)

Under the monotonicity assumption, high preservation suggests a sequence of sound

mappings, while low preservation implies either a path containing one or more inac-

curate mappings that spoil the entire path accuracy or an accuracy decay along a

chain of non-perfect mappings.

Similarly to schema mappings path, we define a query reformulation path as a

sequence of transitive query translations:

Pathq(SN ) = MSN−1→SN(MSN−2→SN−1

· · · (MS1→S2(q(S1)))) (2.23)


Query reformulation path preservation is calculated as schema mapping preservation

using the product function over the accuracy measurements of the transitive query

translations:

αPathq(SN )= αMSN−1→SN

(MSN−2→SN−1···(MS1→S2

(q(S1)))) =∏

q(Si)∈Pathq(SN )

µMSi→Si−1(q(Si−1))

(2.24)

Similarly to accuracy measure calculation, we use query translation accuracy rather

than entire schema mapping accuracy. Preservation over a set of queries is calculated

using a weighted average with queries appearance frequencies as weights:

αPathQpi=

1

|Qpi|

∑q∈Qpi

λq ∗ αPathq (2.25)

Figure 2.6: An example of mapping preservation.miietin zepekp xeniyl dnbec

Example 2 We continue with Example 1 above and illustrate the mapping accuracy

preservation calculation. Given the schema mappings depicted in Figure 2.5, assume

an additional connection between p2 and p3 as illustrated in Figure 2.6. First, we


calculate schema mapping accuracy of the additional matching between p2 and p3:

µMS2→S3=

1

3∗(µmA2→A3

+µmB2→B3+µmC2→Null

) =1

3∗(0.9+0.8+0.0) ∼= 0.566 (2.26)

We can now calculate preservation over the transitive connection of p1 → p2 → p3:

αPathS1→S2→S3= µMS2→S3

∗ µMS1→S2

∼= 0.566 ∗ 0.333 ∼= 0.188 (2.27)

Note that the preservation of direct connection between p1 and p3 equals the accuracy

of this matching:

αPathS1→S3= µMS1→S3

∼= 0.566 (2.28)

and we see that direct matching is preferred over transitive matching. We further

calculate the preservation of q1 through the transitive mappings path. We start with

the transitive translation of q1 from p2 to p3:

q1(S3) = MS2→S3(MS1→S2(q1(S1))) = πmB2→B3,mA2→A3

(S3) (2.29)

and the corresponding query accuracy calculation:

µq1(S3) =1

2∗ (µmB2→B3

+ µmA2→B3) =

1

2∗ (0.9 + 0.8) = 0.85 (2.30)

We can now calculate the query path preservation:

αPathq1(S3)= αMS2→S3

(MS1→S2(q1(S1))) = µMS2→S3

(q1(S2))∗µMS1→S2(q1(S1)) = 0.85∗0.25 ∼= 0.21

(2.31)

As peers perform query reformulations during the propagation process, accuracies

may be calculated on the fly. Calculated accuracies passed along reformulation path

can be used to incrementally calculate accuracy preservation, which in turn can serve

as an indicator in a quality feedback mechanism. We demonstrate such a mechanism

in Example 3.


Figure 2.7: An example for query reformulation graph.dzli`y ly ly mebxz sxbl dnbec

Example 3 Figure 2.7 shows a query reformulation graph for query q issued by p1.

Directed edges represent query translations from peer to peer, and weights represent

translation accuracies. The dashed line is not part of the query graph, but rather a

virtual mapping, representing the direct translation accuracy of q between p1 and p7,

not known to p1. Peer p1 calculates the translation of q using its mappings to p3 and

propagates the translated query along with its calculated preservation measure:

αPathq(S3)= µMS1→S3

(q(S1)) = 0.8 (2.32)

Peer p3, in turn, further translates q(S3) using its mappings to p4, calculates the accu-

mulated preservation using the received preservation of q(S3) and the newly translated

query accuracy, and propagates the result to p4:

αPathq(S4)= αPathq(S3)

∗ µq(S4) = 0.8 ∗ 1.0 = 0.8 (2.33)

In a similar manner, p4 translates and propagates q(S4) to p7:

αPathq(S7)= αPathq(S4)

∗ µq(S7) = 0.8 ∗ 0.6 = 0.48 (2.34)


Now note that if p1 and p7 were directly connected, i.e., p1 had matched against p7’s

schema and added it to its neighbors lists, the query translation accuracy preservation

of q from p1 to p7 would have been:

αPathq(S7)= µMS1→S7

(q(S1)) = 0.8 (2.35)

And the direct mapping preservation is higher than the transitive preserved accuracy

calculated. We conclude that p1 is better off with p7 as a neighbor rather than using

the transitive connection through other peers.

In what follows we assume, without loss of generality, that direct matching be-

tween peers is always better, i.e., more accurate, than having connection through

transitive mappings. Hence, peers naturally strive to shortcut mapping paths and

create direct mappings against other peers. Recall that in our model, peers have lim-

ited resources to devote to neighbors maintenance, so acquiring new neighbors may

be at the cost of existing ones. Regardless of any other considerations, peers will be

interested in choosing new neighbors that improve their queries reformulation quality.

2.4 Evaluation of semantic topologies

Considering a dynamic topology where peers periodically update their neighbors,

we need some measures for semantic topology evaluation in the context of mapping

accuracy. We present two approaches for semantic topology evaluation, namely self-

interest based and cooperative-interest based, each representing a topology evaluation

from a different point of view by setting a different “goodness” measure. Using these

measures, we are able to compare different topologies.


2.4.1 Self-Interest Based Topology Evaluation

Implied by the title, self-interest based topology evaluation represents a measure for

topology quality in the context of mapping accuracy from a single peer narrow point

of view. The basic assumption underlying this approach, is that each peer acts as

an individual according to self centered interest [49]. In the decentralized setting

of PDMS, a peer does not obtain knowledge about other peers’ mappings nor can

it enforce other peers to create mapping links. Under these restrictions, peers may

choose to couple according to their best private knowledge, i.e., by generating a set of

neighbors that maximizes their direct benefit regardless of outside mappings between

other peers.

Let pi be a peer with a set of neighbors Npiand a limit over neighbors number

Kpi. Given a set of queries Qpi

that pi issues, we calculate the peer self-interest value

(SIV ) measure as:

SIVpi=

1

Kpi

∑pj∈Npi

1

|Qpi|

∑q∈Qpi

λq ∗ µq(Sj) (2.36)

We use µq(Sj), measuring the translation accuracy of query q from pi to pj and

multiply by λq representing query importance to pi. We average over all pi’s queries

and get a weighted average of peer’s query set translation accuracy against a single

neighbor pj. We summarize this accuracy over all existing neighbors and divide

(average) by the highest potential neighbors number. Peers connected against as

many peers as they can have and their queries are translated with high accuracy

against their neighbors will receive high SIVpivalue.

Example 4 We demonstrate SIV calculation using the following simple example.

Figure 2.8(a) presents a partial semantic network graph. For ease of presentation,


Figure 2.8: An example for accuracy oriented semantic topology evaluation.miietin zepekp lr zqqaznd zihpnq dibeleteh zkxrdl dnbec

we consider a single query qp1. Edges’ weights represent semantic query translation

accuracies of qpi. Assuming that pi has a limit of Kpi

= 2 neighbors, Figure 2.8(b)

and 2.8(c) show two semantic topologies for 2.8(a) that are valid under the Kpicon-

straint.

We calculate SIVp1 for the first topology (Figure 2.8(b)) to be:

SIVp1(Tb) =1

2∗ (µqp1 (S3) + µqp1 (S4)) =

1

2∗ (1.0 + 0.9) = 0.95 (2.37)

and similarly, for the second topology (Figure 2.8(c)):

SIVpi(Tc) =

1

2∗ (µqp1 (S3) + µqp1 (S2)) =

1

2∗ (1.0 + 0.8) = 0.9 (2.38)

p1, unaware of the connection between p2 and p4, would prefer the first topology with

N (p1) = {p3, p4} over the second topology with N (p1) = {p3, p2}.

An SIV measure of a topology is calculated by averaging over the peers in the net-

work:

SIVT (P,M) =1

|P|∑p∈P

SIVp (2.39)


Where T (P ,M) is a semantic topology T with set of peers P and set of query

translation mappings M.

2.4.2 Cooperative Interest Based Topology Evaluation

Cooperative-interest based topology evaluation represents a wider point of view of a

collaborative network of peers trying to achieve global welfare. This approach relies

on the assumption that peers are willing to cooperate in order to achieve a mutually

beneficial topology. Peers may choose to share their knowledge and act in cooperation

to create globally beneficial mappings.

Given a single query q issued by peer pi, we present an evaluation measure that

considers the entire semantic topological structure and calculates the cooperative-

interest value (CIV ) as follows:

CIVqpi= min

pj∈P−pi

{max

Pathq(Sj)∈T (P,M)

{αPathq(Sj)

}}(2.40)

CIV evaluation measure relates to all the semantically connected peers in the topol-

ogy, thus reflecting their ability to form connections in such a way that would benefit

other peers as well. We measure αPathq(Sj), reflecting query translation accuracy to

a (transitively) connected peer. There may be more than a single query translation

path for q from pi to pj in the topology, we evaluate pj’s value by the best translation

path to it and thus we maximize over the paths between pi and pj in the topology.

We then choose the minimal amongst the peers values, representing the worst peer

to answer q under the given topology. Topologies with high preservation for the least

accurate translation path to any peer will receive high CIV .


Example 5 Continuing our example from Section 2.4.1, we demonstrate CIV cal-

culation over the topologies depicted in Figure 2.8:

CIVqp1(Tb) = min

p2,p3,p4

{αPathq(S2)

, αPathq(S3), αPathq(S1)

}= min {0, 1.0, 0.9} = 0 (2.41)

CIVqp1(Tc) = min

p2,p3,p4

{αPathq(S2)

, αPathq(S3), αPathq(S1)

}= min {0.8, 1.0, 0.8 ∗ 0.1} = 0.8

(2.42)

Unlike the self-interest based evaluation, here the second topology is rated higher then

the first one when using cooperative-based evaluation as q can reach all peers with high

accuracy preservation using this topology.

We calculate CIV measure for a topology as follows:

CIVT (P,M) =1

|P|∑p∈P

1

|Qp|∑q∈Qp

λqp ∗ CIVqp (2.43)

We calculate weighted average over each peer’s query set using query appearance

frequencies as weights reflecting their importance. We then average the values over

all the peers in the network to get the topology value.

Average CIV measure

Although CIV is a good measure for the evaluation of semantic topologies from a

global welfare perspective, it has some insensitivity given bad topologies. Given two

topologies with semantic disconnections (some peers are non-reachable), CIV mea-

sure will associate a 0 value to both and will not be able to distinguish between them.

Therefore, we suggest an alternative CIV measure to be later used for comparison in

our experiments. The average CIV measure, estimates a topology by averaging over

path accuracies rather than by the minimal path accuracy. Formally, the average


CIV measure is defined as:

CIVT (P,M) =1

|P|∑p∈P

1

|Qp|∑q∈Qp

λqp ∗1

|P − pi|∑

pj∈P−pi

maxPathq(Sj)∈T (P,M)

{αPathq(Sj)

}(2.44)

By averaging over the paths rather than considering the worst (minimal preservation)

path in the topology, we get a measure less sensitive to semantic disconnections.

Chapter 3

On Optimal Semantic Topologies

In Chapter 2 we introduced a model for PDMS, a decentralized network of indepen-

dent peers sharing information through semantic relations, where no peer obtains

complete network structure knowledge. We also presented the influence of the se-

mantic network structure (topology) on the quality of query reformulation. In this

chapter, we assume a centralized setting with complete network knowledge and try

to find optimal semantic topologies. This problem interests us as a baseline for eval-

uating online self organizing topologies. We divide this chapter according to the two

evaluation measures introduced in Chapter 2: in Section 3.1 we present an algorithm

for finding optimal self-interest based topologies, and in Section 3.2 we show that

the problem of finding optimal cooperative-interest based topologies is NP-Complete

even for a very simple case.

37

CHAPTER 3. ON OPTIMAL SEMANTIC TOPOLOGIES 38

3.1 Optimal Self-Interest Based Topologies

Recall that self-interest based topology evaluation is a measurement representing the

selfish nature of peers in the network. It reflects the fact that each peer strives to use

its private knowledge in order to obtain a set of neighbors with the highest possible

schema compatibility. Using the self-interest evaluation measure (SIV ) presented in

Chapter 2, we give a formal representation of the optimal self interest based topology

problem:

Optimal Self Interest Topology Problem

Find a semantic topology T (P ,M) such that:

maxT (P,M)⊆G(P,M)

[1|P|

∑pi∈P

1Kpi

∑pj∈Npi

1|Qpi |

∑q∈Qpi

λq ∗ µq(Sj)

]S.T. ∀pi ∈ P , |N (pi)| ≤ Kpi

(3.1)

Given a complete semantic network graph (clique), the objective of the problem is

finding a topology (subgraph of G) that maximizes SIV subject to the peers neighbors

limit constraints.

Recall that SIV measure is calculated by averaging over all the peers in the

network, and the measure for a single peer (SIVpi) is:

SIVpi=

1

Kpi

∑pj∈Npi

1

|Qpi|

∑q∈Qpi

λq ∗ µq(Sj) (3.2)

We use the SIV measure characteristic that each of its averaged elements SIVpiis

calculated using only the direct mappings of pi to its neighbors, and is independent

of the neighbor set selection of other peers. Since each such element is not correlated

with the other average elements, the problem is separable in the peers, i.e. we can

maximize each element separately and average later, therefore we can write the goal


function as:

1

|P|∑pi∈P

maxT (P,M)⊆G(P,M)

1

Kpi

∑pj∈Npi

1

|Qpi|

∑q∈Qpi

λq ∗ µq(Sj)

(3.3)

And taking the constants out of the maximization expression we get:

1

|P|∑pi∈P

1

Kpi

∗ 1

|Qpi|∗ maxT (P,M)⊆G(P,M)

∑pj∈Npi

∑q∈Qpi

λq ∗ µq(Sj)

(3.4)

Therefore we can find an optimal self-interest based topology by solving the following

subproblem for each pi ∈ P :

Find a set of neighbors Npisuch that:

maxNpi⊆P

[∑pj∈Npi

∑q∈Qpi

λq ∗ µq(Sj)

]S.T. |N (pi)| ≤ Kpi

(3.5)

Then we can compose links according to resulting neighbor lists for all peers, to get

an optimal self-interest based topology. Following the analysis above we suggest a

formal representation of a simple algorithm for finding such topology.

Algorithm 1 Self-Interest Optimal Topology Algorithm

T (P ,M = {φ})for all pi ∈ P do

for all pj ∈ P , j 6= i doSIVij = 0for all qk ∈ Qpi

doSIVij = SIVij + λqk

∗ µqk(Sj)

sort {pj, j 6= i} by SIVij in a non-ascending orderfor all pj ∈ TOP-Kpi

{SIVij, j 6= i} doM = M∪MS〉→S|

return T (P ,M)

The algorithm starts with a full network graph and an empty topology with all

peers and no mapping links. It then loops over all peers, and for each peer it solves


the subproblem of finding the neighbors with highest SIV value. This is done by

traversing all other peers (potential neighbors), calculating SIV for each potential

neighbor (lines 5-6), and sorting peers in a non-increasing order of their SIV . Peers

are then added to the topology one by one until reaching the neighbor limit for the

selected peer. The complexity of SIVij calculation is dependent on the size of the

query set of peer pi and the size of the mapping set used for each query qk ∈ Qpi.

This complexity can be calculated for each peer pi as∑

qk∈Qpi|{m|m ∈MSi→Sj

(qk)}|.

Assuming that each attribute A ∈ Spiappears in only a single mapping m ∈MSi→Sj

,

we can set an upper bound to this complexity as follows: |Qpi| ∗ |Spi

| and the total

algorithm complexity is bounded by O(|P|2 ∗maxpi∈P |Qpi| ∗maxpi∈P |Spi

|). For large

networks with many peers, we assume that SIVij calculation is less complex than

looping through all the peers, the overall complexity of the algorithm is bounded by

O(|P|3).

3.2 Optimal Cooperative-Interest Based Topolo-

gies

Recall that cooperative-interest based topology evaluation is a measurement repre-

senting the mutual interest of peers to achieve global welfare. It reflects the fact

that peers are willing to work in collaboration and share knowledge in order to reach

a semantic agreement in the form of topology that maximizes overall queries span

and accuracy potential. Using the cooperative-evaluation measure (CIV ) presented

in Chapter 2, we give a formal representation of the optimal cooperative interest

topology problem:


Optimal Cooperative Interest Topology Problem


maxT (P,M)⊆G(P,M)

[1|P|

∑p∈P

1|Qp|

∑q∈Qp

λqp ∗minpj∈P−pi

{maxPathq(Sj)∈T (P,M)

{αPathq(Sj)

}}]S.T. ∀pi ∈ P , |N (pi)| ≤ Kpi

(3.6)

Given a complete semantic network graph (clique), the objective of the problem is

finding a topology (subgraph of G) that maximizes CIV subject to the peers neighbors

limit constraints.

Unlike self-interest based evaluation, the cooperative-interest based evaluation

measure for each peer is highly dependent on the neighbor set selection of other peers.

Other peers selections dictate the available reformulation paths for which cooperative

value elements are calculated. We therefore cannot divide the problem and solve a

subproblem for each peer. In an attempt to solve this problem, we classify it into

several simpler cases as illustrated in Figure 3.1.

Figure 3.1: Classification of the optimal CIV topology problem.CIV jxr z` meniqwnl d`iand dibeleteh z`ivn ziira ly beeiq


Our primary classification presents the general case where multiple peers issue

queries to the network vs. a simple case where only a single peer issues queries to the

network. Our secondary classification further divides these cases into another two

sub-cases, one where peers issue multiple queries vs. a simpler case where peer(s)

issue a single query only.

The rest of this section is divided as follows: in Section 3.2.1, we introduce the

degree bounded maximum minimal product paths tree (db-MMPT) problem and show

that it is NP-Complete, and in Section 3.2.2 we show using reduction from db-MMPT,

that finding optimal cooperative-interest based topology even for the most simple of

cases, where a single peer issues a single query to the network, is NP-Complete.

3.2.1 Degree Bounded Maximum Minimal Product Paths

Tree (db-MMPT)

We begin with the formal definition of the maximum minimal product paths

tree (MMPT): Given a directed graph G(V, E), with positive edge weights ∀e ∈

E, a(e) > 0, a maximum minimal product paths tree (MMPT) rooted at some node

s ∈ V is a directed subgraph G′(V ′, E ′), where V ′ ⊆ V and E ′ ⊆ E, such that:

1. V ′ is the set of all vertices reachable from s in G,

2. G′ forms a rooted tree with root s, and

3. The minimal weighted product unique simple path from s to any node v ∈ V ′

in G′, is a maximal weighted product path from s to v in G.

Informally, we measure paths in the graph by the product of edge weights and db-

MMPT is a spanning tree of G rooted at s where the worst path (with minimal


weighted product value) to some arbitrary node v, is the best (with maximal weighted

product value) of all the paths from s to v in the spanned graph G.

We continue with a formal definition of the maximal product paths tree

(MPT): Given a directed graph G(V, E), with positive edge weights ∀e ∈ E, a(e) > 0,

a maximal product path tree (MPT) rooted at some node s ∈ V is a directed subgraph

G′(V ′, E ′), where V ′ ⊆ V and E ′ ⊆ E, such that:

1. V ′ is the set of vertices reachable from s in G,


3. For all v ∈ V ′, the unique simple path from s to v in G′ is a maximal weighted

product path from s to v in G.

While MPT requires that every path from the root to any node is a maximal

weighted product path in the original graph, MMPT enforces this requirement only

on the path with the least maximum weighted product from the root to any node the

original graph.

Lemma 1 Given a directed graph G(V, E) where all edge weights are positive ∀e ∈

E, a(e) > 0, an MPT rooted at some node s over G is also an MMPT rooted at s

over G.

Proof: Using MPT definition, an MPT rooted at s over G is a tree rooted at s

spanning all the vertices reachable from s in G, and every simple unique path from s

to any node v is a maximum weighted product path in G. In particular, the minimal

weighted product path from s to any node v in the MPT is also a maximal weighted

product path from s to v in G and hence, by the definition of MMPT is also an

MMPT rooted at s over G.


Figure 3.2: Example for maximum minimal product paths tree (MMPT) andmaximum product paths tree (MPT).

zeilniqwn zeltkn urle zilnipind lelqnd zltkn meniqwn url dnbec

Figure 3.2 shows an example of a graph (part (a)), an MMPT over this graph

(part (b)), and an MPT over this graph (part (c)). We see that the MPT and the

MMPT differ in the path from p1 to p2. However, since the maximal weighted product

path from p1 to p2 is not the minimal of the maximum weighted product paths in

the original graph (the path from p1 to p5 has lower maximum weighted product

path), both trees are MMPTs. Maximal product paths are not necessarily unique

and neither are maximal product paths trees (and hence, maximum minimal product

paths trees).

Next, we present a solution to the problem of finding an MPT over a given graph

G with edge weights 0 ≤ a(e) ≤ 1, using a transformation to shortest paths tree

problem. recall that the formal definition of a shortest path tree is as follows:

Given a graph G(V, E), with positive edge weights ∀e ∈ E, a(e) > 0, a shortest

path tree (SPT) rooted at some node s ∈ V is a directed subgraph G′(V ′, E ′),

where V ′ ⊆ V and E ′ ⊆ E, such that:

1. V ′ is the set of vertices reachable from s in G,



3. for all v ∈ V ′, the unique simple path from s to v in G′ is a least sum of weights

path from s to v in G.

The SPT problem of finding a shortest path tree in a given graph as described above

is well researched and can be solved using classical algorithms such as Bellman-Ford

algorithm and Dijkstra’s algorithm [15] with asymptotic complexity of O(|V | · |E|)

and O(|V |2), respectively.

Lemma 2 Given a graph G(V, E) where all edge weights 0 ≤ a(e) ≤ 1, an SPT with

s as a root over the graph G′(V, E) with new edge weights1 a′(e) = log ( 1a(e)

), is an

MPT tree rooted at s over G.

Proof: Let path path1 be a path from s to v in the SPT rooted at s in graph G′ and

an arbitrary path path2 between s and v in G′. From the definition of SPT:

∑e∈path1

a′(e) ≤∑

e∈path2

a′(e) (3.7)

Therefore, ∑e∈path1

log(1

a(e)) ≤

∑e∈path2

log(1

a(e)) (3.8)

Using the well-known equivalence∑

i log xi = log∏

i xi:

log∏

e∈path1

1

a(e)≤ log

∏e∈path2

1

a(e)(3.9)

Since∏

1a(e)

> 0 for any set of edges in G′, and from the monotonicity of the log

function ∏e∈path1

1

a(e)≤

∏e∈path2

1

a(e)(3.10)

1We add a very small value ε to edges with a(e) = 0


and therefore

1∏e∈path1

1a(e)

≥ 1∏e∈path2

1a(e)

(3.11)

and using the product equivalence 1∏i xi

=∏

i1xi

:

∏e∈path1

a(e) ≥∏

e∈path2

a(e) (3.12)

Since E(G) = E(G′), all the paths in G exits in G′ and vice versa. We have shown

that a path with least weighted sum in G′ is a path with maximal weighted product

in G. Hence, an SPT over G′ is an MPT over G.

Figure 3.3: Example of transformation from MPT to SPT.SPT ziiral MPT ziiran xarnl dnbec

Figure 3.3 shows an example of applying the above suggested transformation from

MPT to SPT. The bold edges on the left hand side graph mark an MPT over the

graph. The right hand side graph is created using log ( 1a(e)

) transformation on the

left graph hand side edges and we see that the SPT over the right hand side graph

(marked again by the bold edges) contains the same edge set as the MPT over the

left hand side graph.

Using the formal definition of MMPT, we continue with the definition of a de-

gree bounded maximum minimal product paths tree (db-MMPT): Given a


directed graph G(V, E), with positive edge weights ∀e ∈ E, a(e) > 0 and out-degree

bounds ∀v ∈ V, outDB(v) ≥ 0, a degree bounded maximum minimal product paths

tree (db-MMPT) rooted at some node s ∈ V is a directed subgraph G′(V ′, E ′), where

V ′ ⊆ V and E ′ ⊆ E, such that:

1. G′ forms an MMPT with root s, and

2. ∀v ∈ V ′, out-degree(v) ≤ outDB(v).

Figure 3.4: Example of MMPT Vs. db-MMPT.db-MMPT znerl MMPT-l dnbec

Figure 3.4(a) shows a graph G and an MMPT over the graph. Figure 3.4(b) shows

the same graph and a db-MMPT over G with outDB(v) = 2. We note that due to

degree constraints the least weighted product path in the db-MMPT (from p1 to p6)

is has lower weighted product than the least weighted product paths in the MMPT

(from p1 to peers p5 and p6).

Finally, we show that the problem of finding a db-MMPT is NP-Complete by

reduction from the asymmetric traveling sales person problem. Our proof outline

partially relies on a similar outline presented in [53]. Recall that the formulation of

a asymmetric traveling sales person (ATSP) problem in graph theory terms is


as follows:

Given a complete directed graph G(V, E), with positive edge weights ∀e ∈ E, a(e) > 0

(where the vertices would represent the cities, the edges would represent the roads,

and the weights would be the cost or traveling distance on that road), find a least

weight Hamiltonian cycle (a round-trip route that visits each city exactly once) over

G.

Theorem 1 Given a directed graph G(V, E), with edge weights 0 ≤ a(e) ≤ 1 and

out-degree bounds ∀v ∈ V, outDB(v) ≥ 0, finding a db-MMPT rooted at some node s

over G is NP-Complete.

Proof: Given an instance of ATSP problem on graph G, we extend and transform

G(V, E) into a new graph G′(V ′E ′) in the following way: we choose an arbitrary node

s ∈ V to be the root. We copy the nodes and edges of G to G′ with new edge weights

a′(e) = 12a(e) . We then extend G′ with a copy of the root, called s′ and a copy of all

the root incoming edges ei,s to ei,s′ such that V ′ = V ∪s′ and E ′ = E∪{ei,s′|ei,s ∈ E}.

The ATSP problem on graph G can then be solved by finding a db-MMPT on graph

G′ with out-degree bounds ∀v ∈ V, outDB(v) = 1. The spanning tree formed by

db-MMPT is a simple path from s to s′, since each node has exactly one successor,

and therefore a Hamiltonian cycle in G (considering s and s′ as one node). We now

show that this simple path in G′ corresponds to a least weight Hamiltonian cycle in

G: Let path path1 be a path from s to s′ in the db-MMPT rooted at s in graph G′

and let path2 be an arbitrary path between s and s′ in G′. From the definition of

db-MMPT: ∏e∈path1

a′(e) ≥∏

e∈path2

a′(e) (3.13)


Therefore, ∏e∈path1

1

2a(e)≥

∏e∈path2

1

2a(e)(3.14)

and using the product equivalence∏

i1xi

= 1∏i xi

:

1∏e∈path1

2a(e)≥ 1∏

e∈path22a(e)

(3.15)

therefore, ∏e∈path1

2a(e) ≤∏

e∈path2

2a(e) (3.16)

Using the well-known equivalence∏

i axi = a

∑i xi

2∑

e∈path1a(e) ≤ 2

∑e∈path2

a(e) (3.17)

Since∑

a(e) > 0, and from the monotonicity of the power function∑e∈path1

a(e) ≤∑

e∈path2

a(e) (3.18)

and we have shown that the simple path from s to s′ in db-MMPT for G′ is a least

weighted path (Hamiltonian cycle) in G, i.e. this path gives the solution to ATSP in

G. The path is the most restricted form of a (db-MMPT) tree and hence the other

(db-MMPT) trees are generalization of above problem and are harder to solve.

Figure 3.5 shows an example of transformation from ATSP problem to db-MMPT

problem . On the left side, a graph G is presented and a least weight Hamiltonian

cycle over G is drawn (marked by bold edges). On the right, we see the graph G′

resulting from the above suggested transformation of ATSP to db-MMPT. p′4 (marked

bold and shaded) is the extended copy of the selected root p4 in G′, and extended

incoming edges to p′4 are marked by dashed lines. We see that the db-MMPT rooted at

p4 over G′ (marked by bold edges) forms a simple path from p4 to p′4 and corresponds

to the least weighted Hamiltonian cycle in G.


Figure 3.5: Example of transformation from ATSP to db-MMPT.db-MMPT ziiral ATSP ziiran xarnl dnbec

3.2.2 Single Peer Single Query (SPSQ) Optimal Topology

Problem

Consider the most simple case of Figure 3.1 where only a single peer pi issues a

single query Qpi= {q} to the network while the rest of the peers merely answer and

propagate q. We formalize this problem as follows:

SPSQ Problem


maxT (P,M)⊆G(P,M)

[minpj∈P−pi

{maxPathq(Sj)∈T (P,M)

{αPathq(Sj)

}}]S.T. ∀pi ∈ P , |N (pi)| ≤ Kpi

(3.19)

Given a complete semantic network graph (clique), the objective of the problem

is finding a topology (subgraph of G) that maximizes CIVqpisubject to the peers

neighbors limit constraints.

Theorem 2 Given a semantic network graph G(P ,M), with associated mapping

accuracies 0 ≤ µM ≤ 1 and neighbor set constraints Kp for each peer p, the SPSQ

problem for a single peer pi with a single query q is NP-Complete.


Proof: Given an instance of a db-MMPT problem on graph G with edge weights

0 ≤ a(e) ≤ 1 and out-degree bounds ∀v ∈ V, outDB(v) ≥ 0, and some node s ∈ V as

root. We transform the problem to an SPSQ problem as follows: G remains the same

with nodes V taken to be the set of peers P , edges E will be the set of schemata

mappings M, root node s is pi - the single peer issuing the single query q, and edge

weights a(e) are query q translation accuracies µq. The db-MMPT problem on graph

G can then be solved by solving SPSQ problem on G finding some optimal topology

T (P ,M), and then solving MPT on the resulting topology T .

Consider the optimal topology T found by solving SPSQ on G. By the definition

of SPSQ, T spans all the peers connected to pi (all the nodes reachable from s) and

satisfies the neighbor set constraints |N (p)| ≤ Kp,∀p ∈ P (out-degree constraints

out-degree(v) ≤ outDB(v),∀v ∈ V ). Running MPT on T can be done in polynomial

time by transformation to SPT (see Section 3.2.1). By the definition of MPT, the

result of running MPT over T is a tree T ′ spanning all the nodes in T and therefore

spanning all the reachable peers from pi (connected nodes from s) in G. Since T ′

spans a graph that satisfies |N (p)| ≤ Kp,∀p ∈ P (out-degree(v) ≤ outDB(v),∀v ∈ V )

constraints, T ′ satisfies these constraints as well. Let pj (v) be some peer (node)

reachable from pi (s) in G, from the definition of MPT the unique simple path from

pi (s) to pj (v) in T ′ is maximal weighted product path from pi (s) to pj (v) in T , i.e.

the minimum of the maximal weighted product paths in T exists in T ′ and is also

the minimum of the unique simple weighted product paths from pi (s) to any peer pj

(node v) in T ′. By the definition of SPSQ, this path is a maximal weighted product

path from pi (s) to pj (v) in G, hence T ′ is a tree rooted at pi (s) spanning all the

reachable peers (nodes) from pi (s) in G, satisfying G’s neighbors limit (out-degree)

constraints, and the minimum of the weighted product paths from pi (s) to any peer


pj (node v) in T ′ is a maximal weighted product path from pi (s) to pj (v) in G.

Therefore by the definition of db-MMPT, T ′ is a solution for db-MMPT rooted at s

in G.

Figure 3.6: Example of transformation from db-MMPT to SPSQ.SPSQ ziiral db-MMPT ziiran xarnl dnbec

Figure 3.6 shows an example of a solution to a db-MMPT problem by transfor-

mation to SPSQ problem. On the left side, a graph G is presented with edge weights

0 ≤ a(e) ≤ 1 and out-degree constraints ∀v ∈ V, outDB(v) = 2. For this graph, a

SPSQ optimal topology with p1 as query issuer is presented (marked by bold edges).

On the right side, we see the result of running MPT on this optimal topology (marked

by bold edges). Note that the resulting tree is a db-MMPT rooted at p1 satisfying

the out-degree constraints over the original graph G.

3.3 Discussion

In the search of optimal offline topologies calculation we presented a simple algorithm

that calculates optimal self-interest based topologies in a polynomial time. In addi-

tion, we have shown that even for a very simple case a of single peer issuing a single

query to the network, the problem of finding an optimal cooperative-interest based


topology is NP-Complete. This problem, named SPSQ, is the most restricted form of

the single peer multiple queries (SPMQ) problems and hence we conclude that

optimal SPMQ topologies are harder to find. Considering the more general problems

of multiple peer single query (MPSQ) and multiple peer multiple queries

(MPMQ) where multiple peers issue queries: under the assumptions that each query

is issued by exactly one peer and no query is equal to or containing another query,

we can show by reduction from SPSQ extended with “virtual” queries that do not

change the optimal topology value, that these problems are also hard.

Chapter 4

Dynamic Self-Organizing

Topologies

In Chapter 2, we presented a model for PDMS and a method to calculate the semantic

accuracy of a query reformulated through a schema mapping. We extended the notion

of semantic accuracy for transitive query translations and presented a method to

calculate accuracy preservation of queries reformulated over a path of peers connected

through schema mappings. We demonstrated how these measures calculation may

be integrated into query translation and propagation mechanism and can serve as

feedback to the quality of the process. In this section, we show how to take advantage

of this feedback to modify the network topology in an automatic manner. Thus,

we make a step towards self-learning networks of peers collaboratively establishing

semantic interoperability in an automated fashion [4].

In the following, we expect peers to perform several tasks: (1) upon propagating a

query, a peer has to calculate the reformulation accuracy and preservation and further

forward these measure along with the new query. (2) upon receiving query results or

54

CHAPTER 4. DYNAMIC SELF-ORGANIZING TOPOLOGIES 55

other feedback (measures), it has to analyze them and update its view of the overall

semantic agreement. (3) Periodically, it has to use its local gained knowledge to

adjust its semantic mappings.

Demonstrated in the Chapter 2, peers can calculate and propagate accuracy mea-

sures along with queries. This light weight mechanism can be extended to pass on

additional informative measures. Forwarding queries with accompanying reformula-

tion measures serves as input for peers to asses semantic similarity. Section 4.1 deals

with semantic similarity assessment in the context of a PDMS. Specifically, we define

the semantic acquaintance problem which is a preliminary phase for filtering suitable

candidates for semantic evaluation. We introduce a methodology and some metrics

for the identification of “good” candidates for matching.

In Section 4.2, we discuss the usage of assessed similarity to perform mapping

adjustments. We introduce the semantic replacement problem which deals with the

practical application of similarity for neighbor list maintenance. If the calculated

similarity measure truly reflects human judgment of similarity, we expect the network

to self-organize into a state where queries get disseminated to the subset of the peers

most likely to return relevant results, where the correct mappings are increasingly

used and where incorrect mappings are neglected. Implicitly, this is a state where a

global agreement on the semantics of the different schemata has been reached.

4.1 Semantic Acquaintance

One of the key conditions for self organizing network establishment is peers ability

to find and connect with new neighbors. Basically, peers can meet other peers using

one of two means: (1) random connection requests through ping messages as part of


the underlying network protocol or (2) acquaintance through queries, i.e., connecting

with peers that produce results to queries issued by the peer.

Upon joining the network, a new peer randomly connects to a set of arbitrary

neighbors. Later on, queries issued by the peer would be reformulated and passed

along to semantically connected peers. However, there may be some peers that are not

included in the group of semantically connected peers, yet still capable of answering

peers queries.

Figure 4.1: Semantically disconnected components.mixiyw izla miihpnq miaikx

Example 6 Consider the network in Figure 4.1, representing a partial semantic

translation network for query q issued by peer p1. Peer p1 joined the network and

established connection with N (p1) = {p2, p3}, the dashed edge is a virtual connection,

non-existing in the network, demonstrating the potential mapping MS1→S5.

The only connection between the peers in the left side of the graph (p1, p2, p3)

and those on the right side of the graph (p4, p5, p6, p7) is by the mapping MS3→S4.

Unfortunately, q cannot be reformulated over this mapping (µq = 0), maybe due to


some missing attributes in p4’s schema, and therefore query q will not reach the peers

on the right side. If there was an actual connection MS1→S4, the query q could have

been reformulated over this mapping and further propagated to other peers as well.

In the absence of a central mechanism to identify semantically disconnected compo-

nents in the network, it is up to each peer to try and bridge over such disconnections

and expand its semantic connected component. Imitating the procedure performed

when joining the network, peers can periodically match against random peers that

they are not familiar with, i.e., peers that never answer its queries.

Assuming a well connected topology, semantic disconnections will not be that

common. Peers can therefore focus on selecting preferred neighbors among the tran-

sitively connected peers discovered during query propagation. Neighbors are chosen

according to their semantic similarity to the selecting peer, reflecting their ability to

translate its queries. Schema matching is the basic operation used to asses semantic

similarity in our model.

Schema matching is a complex [14] time-consuming operation. In [30], it is demon-

strated through empirical analysis that there is no single dominant schema matcher

that performs best, regardless of the data model and application domain. Therefore,

more complex and time-consuming approaches such as matchers ensemble [19, 29] and

top-ranked matchings evaluation [28] are required to establish correct mappings, and

there is no evidence that these approaches can reach a “consensus” of valid schema

matching at a low complexity cost [22].

Recall that the main objective of peers in a PDMS is to share knowledge by means

of queries and therefore the effort dedicated for examination of potential neighbors

should be minimal. The process of finding new neighbors should be light-weight,

complexity wise, for it to scale under such settings where peers perform the process


frequently.

We present the concept of semantic acquaintance to address this problem, se-

mantic acquaintance involves the selection of “good” candidates for schema matching

among the peers discovered through a query mechanism. The idea behind semantic

acquaintance is to apply light-weight decision policies to narrow the list of potential

candidates to include only peers anticipated to have high semantic similarity. Ac-

quaintance policies can make use of measures returned as feedback from the query

mechanism. Light weight acquaintance policies, such as least recently used (LRU)

[62], History, and Popularity [68] to name a few, were formerly suggested and proved

effective in the context of file sharing P2P networks. However, file sharing networks

are different from PDMS in several respects. Firstly, a query answer is given in the

form of yes/no for the existence of a desired file, and a positive reply from a single

source is sufficient. In addition, the context of matching links with their accompany-

ing accuracies does not exist in file-sharing systems. Therefore, despite the fact that

mechanisms from file sharing networks can be generalized and used in the context of

PDMS, accuracy preservation oriented policies may better fit our setting.

In what follows, we introduce several acquaintance policies taking into account

the special characteristics of PDMSs:

Highest Path Length Acquaintance (HPLa) Policy

Highest path length policy takes into account the decay of path accuracy along a

path of transitive mappings. The longer the path is, the higher are the chances for

accuracy decay. For this policy, peers have to forward along with reformulated query

the measure δPath, counting the number of query translations along a path. This

measure is easily calculated as each peer translating a query increases its value and


further forwards it, and formally:

HPLa(pj, q) = δPathpi→...→pj(4.1)

Highest Accuracy Preservation Acquaintance (HAPa) Policy

Highest Accuracy Preservation policy follows the principle of “a friend of a friend,

is also a friend,” which means under PDMS setting that a well matched neighbor

of a well matched neighbor is a good candidate for matching and more generally,

transitively connected peers that maintain high preservation mappings paths are likely

to match a peer with high accuracy. This policy requires the propagation of the path

preservation measure αPath as demonstrated in Chapter 2, and formally:

HAPa(pj, q) = αMSl→Sj···(MSi→Sk

(q)) (4.2)

Nearest Neighbors Accuracy Acquaintance (NNAa) Policy

Nearest neighbors accuracy policy tries to take advantage of both semantic connec-

tivity and connections quality of peers. Assuming that a peer is well connected to

other peers, matching against it would earn the benefit of queries reaching its neigh-

bors. In this policy, each peer summarizes the accuracy of its closest neighbors and

returns this measure along with query results. Note that this form of measure can

have a simple form of nearest neighbors accuracy for a given schema/query relevant

for the receiving peer or generalized form so that nearest neighbors in a given radius

(number of hops). The latter requires a cooperative calculation protocol. We define

it formally as:

NNAa(pj, q) =∑

pk∈N (pj),k 6=i

µMSj→Sk(q) (4.3)


Figure 4.2: Acquaintance policies example.zexkd zeieipicnl dnbec

Example 7 We demonstrate the application of the above policies in the following

example. Figure 4.2 presents a partial network of semantically connected peers. Num-

bers on the edges represent the translations accuracies of query q, issued by peer p1.

Assume that q reaches the peers in the following order p3, p4, p2, p6, p7, p5 and that dur-

ing a single query session, peers answer only the first time the query reaches them,

i.e., p2 will answer q upon receiving it from p4 and will not answer it upon receiving

it the second time from p7.

We demonstrate now the calculation of candidate peers evaluation measures ac-

cording to the policies introduced in this section for peer p4:

HPLa(p4, q) = δPathp1→p3→p4= 2 (4.4)

HAPa(p4, q) = αMS3→S4(MS1→S3

(q)) = 0.8 ∗ 1.0 = 0.8 (4.5)

NNAa(p4, q) =∑

pj∈N (p4),j 6=1

µMS4→Sj(q) = µMS4→S2

(q) + µMS4→S6(q) = 0.8 + 0.9 = 1.7

(4.6)

In a similar manner, we calculated these measures for the rest of the peers. The


pj p4 p2 p6 p7 p5

HPLa(pj, q) 2 3 3 4 4HAPa(pj, q) 0.8 0.64 0.72 0.72 0.648NNAa(pj, q) 1.7 0.7 1.8 0.9 0.6

Table 4.1: Acquaintance policies evaluation measureszexkd zeieipicn zkxrd iccn

results are summarized in the Table 4.1:

Note that different policies may yield different rankings: p5 and p7 are ranked first

according to HPLa policy, p4 according to HAPa policy, and p6 according to NNAa

Policy. While HPLa policy explicitly prioritizes longer paths, HAPa implicitly prefers

shorter ones, where preservation is often maintained higher. NNAa policy, completely

ignores path length.

4.2 Semantic Replacement

In Section 4.1 we presented some policies for identification of semantic similar peers.

However, considering the decentralized setting and the absence of complete network

knowledge, none of the suggested (or any other) policies guarantee a selection of good

candidates for matching, they merely offer a lightweight heuristics to avoid exhaustive

matchings. Further more, even a selected candidate with high matching accuracy will

not necessarily be a good neighbor. Neighbors can be matched well and still translate

some queries inaccurately thus spoiling the peer’s self-interest. Others may offer

accurate translation but are not well connected to other peers thus possibly spoiling

the global welfare as queries may not reach many peers through them. We illustrate

these problems in the following example:


Example 8 Consider Example 7, presented in Section 4.1. Figure 4.2 presents the

network graph and Table 4.1 summarizes the results of applying HPLa, HAPa and

NNAa acquaintance policies on the peers discovered through query process of q issued

by p1. Assume now that p1, applying all the three policies, decides to match against

the top ranked peer of each policy,1 the matching accuracies are given in Table 4.2:

HPLa HAPa NNAa

pj p7 p5 p4 p6

µMS1→Sj(q) 1.0 0.9 0.92 0.95

Table 4.2: Mapping accuracies for selected candidates using different acquaintancepolicies

zepey zexkd zeieipicn it lr mixgap mizinrl ietin zepekp

Assume now that p1 has a limit of Kp1 = 1 neighbors. Comparing the matching

results of all four peers, p1 discovers that p7 has highest accuracy and has in fact, a

perfect mapping. Since µMS1→S7(q) > µMS1→S3

(q), p1 decides that p7 is a better neighbor

than p3 and makes a replacement. The new topology after the replacement is given

in Figure 4.3: In the new topology, p1 achieved a (possibly) better mapping with p7,

but at the cost of loosing good mapping connections to four other peers (circled by a

dashed line). Surely, the replacement of p3 with p7 is not a good option. Replacement

with any of the other candidate peers would not have spoiled the connectivity and thus

could be considered.

Example 8 demonstrates the need for a replacement policy. Replacement deals

with the maintenance of a valuable neighbor list, by providing the decision policy for

acceptance or rejection of new peers to the list. We examine the simple event of a

single replacement candidate, where multiple candidates events can be separated into

1In this example, we consider both HPLa policy yielded top ranked peers (p7, p5)


Figure 4.3: Bad replacement example.dtlgd zeieipicnl dnbec

a sequence of discrete events, each dealing with a single candidate. As each peer is

limited to a finite number of allowed neighbors, there may be two possible decision

scenarios: (1) there exists an open slot in the list, the replacement policy becomes

a placement policy and the new candidate is added to the list or (2) the list is full

and any placement exceeds the neighbors limit. Here, a replacement policy needs

to decide whether to accept new candidate to the list, and at the expense of which

existing neighbor.

Acquaintance and replacement can be viewed as two parts of a problem of identifi-

cation and maintenance of semantic neighbors. Acquaintance and replacement polices

can be coupled to run serially as a single global policy, and in some cases, a single

policy can fit for both. In what follows, we introduce a few examples for acquain-

tance complementary policies that may serve for neighbors replacement decisions in

a PDMS:


Highest Accuracy Preservation Replacement (HAPr) Policy

Highest accuracy replacement policy ranks neighbors according to their schema match-

ing accuracy or query set translation accuracy. This policy is actually similar to HAPa

acquaintance policy, where preservation for directly connected neighbors is calculated

as their matching accuracy. HAPr is a self-interest based oriented replacement policy,

as it guarantees that each replacement does not spoil the closest neighbors accuracy

of the peer. We calculate HAPr as:

HAPr(pj) =∑

q∈Qpi

λq ∗ µq(Sj) (4.7)

Nearest Neighbors Accuracy Replacement (NNAr) Policy

This policy is a good example of a policy that can serve both for acquaintance and

replacement, following the same principle of connecting with neighbors with high total

mappings accuracy. We can either add the matching accuracy of the new candidate

to the sum of its nearest neighbors accuracies, or we can multiply it with the sum to

calculate the nearest neighbors accuracy preservation. We use the second option:

NNAr(pj) =∑

q∈Qpi

λq ∗

µq(Sj) +∑

pk∈N (pj),k 6=i

αMSj→Sk(MSi→Sj

(q))

=

∑q∈Qpi

λq ∗ µq(Sj) ∗

1 +∑

pk∈N (pj),k 6=i

µq(sk)

(4.8)

Near Highest Accuracy Preservation Replacement (HAP80%r) Policy

This replacement policy is a variation on the HAPr policy and calculated in the same

manner, but the acceptance rule is different. Rather than replacing the least accurate

neighbor with a new candidate only if the new peer has higher HAP value, we enforce


the replacement even if the new peer has higher value than 80% of the least accurate

neighbor. Since HAPr is a self-interest based policy, this weak form of HAPr accepts

neighbors that may contribute for the network cooperative-interest value and still not

spoil much of the self-interest value of the network.

Example 9 We continue with Examples 7 and 8 and demonstrate the outcome of

applying the above suggested replacement policies on the network in Figure 4.2. First,

we calculate the different replacement measures for the current neighbor p3, in order

to compare it with the new candidates.

HAPr(p3) = µq(S3) = 0.8 (4.9)

HAP80%r(p3) = 0.8 ∗ µq(S3) = 0.8 ∗ 0.8 = 0.64; (4.10)

NNAr(p3) = µq(S3) ∗ (1 + µq(S4)) = 0.8 ∗ (1 + 1) = 1.6; (4.11)

Similarly, we calculate HAPr and NNAr measures for the different acquaintance

policies top-ranked peers, as detailed in Table 4.2. The results are given in Table 4.3:

HPLa HAPa NNAa

pj p7 p5 p4 p6

HAPr 1.0 0.9 0.92 0.95NNAr 1.9 1.44 2.484 2.755

Table 4.3: Replacement policies evaluation measuresdtlgd zeieipicn zkxrd iccn

According to all the suggested policies, p3 would be replaced by any of the peer

candidates. We get that NNAr replacment policy top ranked p6, retrieved by NNAa

acquaintance policy, while HAPr replacement policy top ranked p7, retrieved by HPLa


acquaintance policy. As we saw in Example 8, replacement of p3 with p7 is a bad

option although it improves the self-interest value of p1 as it disconnects a group of

other peers that become non-reachable. As for p6, we compute CIV before and after

the replacement, as follows:

CIVpre = min{max{0.8 ∗ 1 ∗ 0.8, 0.8 ∗ 1 ∗ 0.9 ∗ 1 ∗ 0.9},

0.8, 1 ∗ 0.8,

0.8 ∗ 1 ∗ 0.9 ∗ 0.9,

0.8 ∗ 1 ∗ 0.9,

0.8 ∗ 1 ∗ 0.9 ∗ 1} = 0.576

(4.12)

CIVpost = min{max{0.95 ∗ 1 ∗ 0.9, 0.95 ∗ 0.6 ∗ 1 ∗ 0.8, 0.95 ∗ 0.9 ∗ 0.6 ∗ 1 ∗ 0.8},

max{0.95 ∗ 0.6, 0.95 ∗ 0.9 ∗ 0.6},

max{0.95 ∗ 0.6 ∗ 1, 0.95 ∗ 0.9 ∗ 0.6 ∗ 1},

0.95,

0.95 ∗ 1} = 0.57

(4.13)

We see that the replacement of p3 with p6 improves the self-interest value, yet spoils

the cooperative-interest value. However, CIV is only slightly decreased so this option

may be considered to be a good one.

In Chapter 5 we simulate a PDMS network and evaluate our different acquaintance

and replacement policies performance.

Chapter 5

Experiments

In this chapter we describe simulation of a PDMS implemented according to our

model. We run experiments using various combinations of acquaintance and replace-

ment algorithms and compare their results. This chapter is divided as follows: Sec-

tion 5.1 describes the simulation architecture. Next, in Section 5.2 we give details

on different data and parameters used in our simulation and the methods we used to

generate them. Then, in Section 5.3 we present the setup for our experiments. In

section 5.4 we detail the measures used to evaluate our results. Section 5.5 summa-

rizes the results of our experiments and finally, in Section 5.6 we give provide a brief

discussion and main conclusions regarding our results.

5.1 Simulation Architecture

Figure 5.1 illustrates our simulation architecture. We now provide a detailed descrip-

tion of its components.

67

CHAPTER 5. EXPERIMENTS 68

Figure 5.1: Simulation Model: domain, schemata, and query sets.zezli`y itqe`e zenkq ,mibyen mler :divleniqd lcen

Configuration: The input of the simulation is a configuration file, allowing us

to control the distributions parameters, the number of domains and peers, sizes of

average query set and query and additional simulation parameters.

Domains: Domains are a representation of the set of concepts existing in the world.

We generate an assortment of elements representing available concepts. In this chap-

ter, we run experiments with single a domain.

Schemata: A schema is a set of elements representing peer’s knowledge base as

exposed to other peers. Schemata are composed of a selection of elements from a

single or multiple semantic domains.

Queries: Each peer is assigned with a set of queries it issues to the network. Queries

are generated using elements from peers’ schemata.


Schema mappings: A schema mapping is composed of a set of element mappings

and their associated accuracies. We generate mappings between all element pairs in

source and target schemata, and then calculate maximum accuracy matches between

the schemata.

Query translation: A query translation represents a query reformulation from a

source peer schema to a target peer schema. Using schema mappings, we generate

query translations between all peers and calculate their corresponding accuracies.

Query translations supply the accuracies (edge weights) for the semantic matchings

layer of the network.

Topology: A topology represents the semantic network layer of the system. Spec-

ifying neighbors list for each peer, topology simulates the semantic connections be-

tween the peers. Topology spans a partial network (subgraph) of a fully connected

network (clique) that follows the neighbors number (out-degree) limitation for each

peer. Queries are executed over an initial topology and their results are analyzed

and used to adjust these topologies, according to the various policies discussed in this

thesis.

Query sequence generator: A query sequence generator generates a sequence of

pairs (peer, query), representing a serialization of parallel queries issued by peers in

the network. A generated sequence will be used to run queries in a given order for a

variety of different settings.


Figure 5.2: Simulation Model: semantic topology and query translation layers.zezli`y mebxz zeakye zihpnq dibeleteh :divleniqd lcen

Query process generator: A query process generator takes a given pair (peer,

query) and a topology, and generates a query process over this topology using proba-

bilistic hop count mechanism to limit query span. Query process uses corresponding

query translation to calculate preservation along query paths. Query process results

in a list of non-neighbor candidate peers with their corresponding query measures

such as path length, path preservation, etc.

Acquaintance algorithm: An acquaintance algorithm takes a list of potential

candidate neighbors and ranks them according to some acquaintance policy. Top-

ranked peers are selected as candidates for replacement.


Replacement algorithm: A replacement algorithm takes a candidate peer and a

list of neighbors and evaluates them according to some replacement policy. Replace-

ment algorithm may modify a given topology by replacing a link from an existing

neighbor to a new one.

Figure 5.3: Simulation Model: sequence Diagram of a single query cycle.dcigi dzli`y xear zelert svx miyxz :divleniqd lcen

Figure 5.3 presents a sequence diagram for a single query cycle: first, a single peer

issues a single query to the network. Query process results in a list of non-neighbor

peers and their associated measurements, required for acquaintance and replacement

algorithms. On the selected candidate, we run an acquaintance algorithm to identify

the best matching candidate. We then run a replacement algorithm with the selected

candidate peer and the list of the query issuer neighbors, and adjust the current

topology according to the algorithm decision.


5.2 Data and parameters

General configuration: Table 5.1 summarizes a set of fixed simulation parameters:

Parameter ValueNumber of Domains 1Attributes per domain 20Number of Peers 25Maximum queries per peer 3

Table 5.1: Simulation parametersdivleniql mixhnxt

Attributes and queries distribution: we model domain attribute distribution

over peers using Zipf distribution. We rank attributes and then generate attributes for

each peer separately using this ranking and Zipf(4,1). We define a maximal number of

queries per peer and assume a uniform distribution Uniform(1,MaxQueries) for peers

query set size. For each query we assume attributes participation using their rank

and Zipf(2,0.75) over peers attributes. Figure 5.4 shows the probability of attributes

to be selected for a schema and a single query of a single peer according to their

rank. We assume this model so that most peers share similar attributes and thus can

translate queries, while some attributes are more rare and can be translated only over

several semantically related peers. Selection of Zipf distribution parameters were set

according to the number of attributes in the domain such that the average schema

size is 11 and the the average query size is 6.

Schema matchings: we generate attribute matching accuracies using two normal

distributions: correct attributes matching accuracies are distributed N(0.8,0.2), and

wrong attributes matching accuracies are distributed N(0.2,0.2). Both distributions


Figure 5.4: Domain attributes probability for participation in peer schemas andqueries.

zyxa mixagd ly zenkqa mibyen zellkidl zexazqd

are trimmed at 1 where all accuracies above are set to 1 (perfect match), and at

0 where all accuracies below 0 are set to 0 (worst match). Figure 5.5 presents the

two distributions graphically. We generate schema mappings by constructing a bipar-

tite graph with source schema attributes and target schema attributes as nodes and

generated attribute matching accuracies as edge weights. We calculate 1:1 matching

between two schemata by running Abest maximum weighted bipartite graph algorithm

[41] over the constructed graph, the result of such matching is a set of attribute map-

pings. Using these mapping sets and the previously generated accuracies, we calculate

query translation accuracies.

Topologies we assume network topology to follow power law rules, similar to struc-

tures discovered on research of Internet topology [25]. We use power-law out degree

(PLOD) generator [54] to generate initial network topologies. Figure 5.6 presents

peers out degree according to their rank and the corresponding log-log chart for a 25

peers network. The highest ranked peer is connected to about half of the network


Figure 5.5: Attributes mapping accuracies distributions for similar and differentattributes.

mipeye midf mibyen xear zezli`yd zepekp zeiebltzd

and the lowest ranked peer is connected to a single neighbor. The log-log chart shows

that the topology follows desired power-law characteristic.

Figure 5.6: Network topology: out degree Vs. peer rank following power law.power law zeiweg zniiwn bexic lenl d`ivi zbxc :zyxd ziibeleteh

Simulation sequences: we generate random query sequences using Uniform(1,peersNumber)

distribution to select query issuers and Uniform(1,peerQueries) distribution to select

issued queries. We repeat the process until we reach the desired sequence size.


5.3 Experimental setup

We implemented six acquaintance policies. The first two are LRUa and Historya,

taken from the context of file-sharing peer-to-peer systems [62, 68]. The next three are

our suggested HPLa, HAPa, and NNAa policies. Last, we implemented Randoma

policy for a baseline comparison. Additionally, we implemented six replacement poli-

cies. The first two, LRUr and Historyr matching the similar acquaintance policies.

Next are our three policies of HAPr, HAP80%r and NNAr. Once again we impele-

mented Randomr policy to be used as a baseline. We implemented our simulation

using Java 2 JDK version 1.4.2 environment, and ran experiments on a laptop with

Intel Centrino Dual core T2300 1.66GHz CPU, 1GB of RAM and Windows XP Pro-

fessional OS.

Parameter ValueMatching configurations 10Initial topologies 10Query sequences 10Queries per sequence 5000

Table 5.2: Summary of experimental setup parametersdivleniqd zvixl mixhnxt ly mekiq

As described in Table 5.2, we generate initial sets of: ten matching configurations,

ten initial topologies, and ten query sequences. We run experiments, each using a

different combination of matching configuration, initial topology, and query sequence,

selected from the corresponding generated sets. In the rest of this chapter, given

results are calculated using an average over the results of different experiments. We

divide our experiments according to the following three settings:


• “Good” topologies1: we generate initial optimal semantic topologies and apply

different policies for their reorganization.

• “Bad” topologies: we generate initial far from optimal semantic topologies and

apply different policies for their reorganization.

• Random topologies: we generate random semantic topologies and apply differ-

ent policies for their reorganization.

The first two settings serve for sanity checks, enabling us to examine the sensitivity

of different policies and evaluation metrics to extreme situations. In the last setting,

we examine the effect of applying different policies on “average” random generated

topologies.

Under each setting, we make the following two type of experiments:

• Acquaintance policies comparison using a fixed replacement policy.

• replacement policies comparison using a fixed acquaintance policy.

5.4 Evaluation

In our experiments we evaluate acquaintance and replacement policies using the fol-

lowing metrics:

Convergence: We measure convergence in terms of steps, i.e., the number of

queries after which there is no change in the topology. Formally, we say that a

1We generate “good” and “bad” topologies by setting high and low query translation accuraciesbetween source and target peers. We use N(0.9,0.05) distribution for high translation accuracies,and N(0.1,0.05) distribution for low translation accuracies, both distributions trimmed at 0 frombelow and at 1 from above. Good neighbors will be associated with high translation accuracies andbad neighbors will be associated with low translation accuracies


policy converges at step t− d if for some (threshold) number of queries d, t ≥ d

T (P ,M)t = T (P ,M)t−1 = · · · = T (P ,M)t−d (5.1)

Topology changes: We measure effective topology changes performed by a policy.

Formally, given an initial topology T (P ,M)0 and n issued queries, we calculate

topology changes as: ∑t=1...n

I{T (P,M)t 6=T (P,M)t−1} (5.2)

SIV change: We measure the effect of applying policies on the semantic topology

self-interest improvement in terms of change in the SIV measure. Formally, given an

initial topology T (P ,M)0 and n issued queries, we calculate SIV change as:

∆SIV =SIVT (P,M)n

− SIVT (P,M)0

SIVT (P,M)0

(5.3)

CIV change: We measure the effect of applying policies on the semantic topology

global welfare improvement in terms of change in the CIV measure. Formally, given

an initial topology T (P ,M)0 and n issued queries, we calculate CIV change as:

∆CIV =CIVT (P,M)n

− CIVT (P,M)0

CIVT (P,M)0

(5.4)

Reachability: We also measure the effect of our policies on the network structure

in the form of changes in the in-degree of peers, reflecting the reachability of peers.

Formally, given an initial topology T (P ,M)0 and n issued queries, we calculate the

change in each in-degree level 0 ≤ i ≤ |P − 1| as:

∆in-degree(i) =∑

p∈T (P,M)n

I{in-degree(p)=i} −∑

p∈T (P,M)0

I{in-degree(p)=i} (5.5)


5.5 Results

In this section we present the outcome of our experiments, divided into three subsec-

tions according to the settings described in Section 5.3: Section 5.5.1 describes our

results given good initial topologies, in Section 5.5.2 we detail our results given bad

initial topologies, and in Section 5.5.3 we give our results using randomly generated

topologies.

5.5.1 Good Initial Topologies

We have created a set of optimal topologies where each peer maintains high mapping

accuracy with its initial neighbors and low mapping accuracy with all the other peers.

In addition, we made sure that all query translations from peers that do not issue

them, maintain low translation accuracies. By that, we created a topology where each

peer is connected to a neighbor set best for it, and the transitive connection to other

peers is less relevant (since its associated accuracy is low), and hence, there is no mo-

tivation for peers to change their neighbors. We calculated the average measurements

for our good initial topologies and the results are given as follows: SIV measure value

is 0.899556702 close to the average generated mapping accuracy between good neigh-

bors (0.9). CIV measure value is 0.000113392, reflecting preservations of worst query

translation paths, and average CIV measure value is 0.141919149, higher then CIV

since it averages rather than minimizes preservation over query translation paths.

Acquaintance Policies Comparison

We have tested the different acquaintance policies using fixed HAPr replacement

given a good initial topology and our results indicate that there were no changes in


the topologies and the metrics were not updated. These results fit our setting of

initial optimal topologies.

Replacement Policies Comparison

We present a comparison between different replacement policies using fixed HAPa

acquaintance given a good initial topology. We get similar results using other fixed

acquaintance policies.

Figure 5.7: Replacement policies comparison: convergence in initial good topologieszeaeh zeizlgzd zeibeleteh ozpda zeqpkzd :dtlgd zeieipicn z`eeyd

Convergence: Figure 5.7 shows the convergence of different replacement policies

given a good initial topology. LRU and History policies, replacing only peers that

do not translate queries, converge at 0 since all the initial neighbors translate all the

queries. HAP and HAP80% policies, replacing peers with low SIV , converge at 0

since the all non-neighbors maintain lower SIV than any neighbor. Random, making

coincidental replacements, does not converge, and therefore is not shown in the graph.

NNA policy converges after 428 steps. Recall that NNA measures peers according

to their neighbors accuracy sum, so that peers with larger number of neighbors than


others may be ranked higher, causing NNA to make wrong replacement decisions in

this case of an initial optimal topology.

Figure 5.8: Replacement policies comparison: topology changes in initial goodtopologies

zeaeh zeizlgzd zeibeleteh ozpda dibeleteh iiepiy :dtlgd zeieipicn z`eeyd

Topology changes: Figure 5.8 shows the numbers of topology changes each policy

makes given a good initial topology. Figure 5.8 (left) shows that Random policy,

making coincidental replacements, performs (wrong) changes in 20% of the queries.

LRU, History, HAP and HAP%80 preform no changes (converging after 0 steps).

Figure 5.8 (right) shows the same data excluding Random policy and we can see that

NNA policy performs changes in about 0.06% of the queries only.

SIV change: Figure 5.9 shows SIV change for each policy given a good initial

topology. Random policy, making many wrong replacements, spoils SIV by 74%.

LRU, History, HAP, and HAP80% policies, making no topology changes, do not

change SIV . NNA policy, making around 3 wrong replacements, spoils SIV by 3%.

These results fit well with the topology being a SIV optimal topology where each

change can only spoil SIV .


Figure 5.9: Replacement policies comparison: SIV change in initial good topologieszeaeh zeizlgzd zeibeleteh ozpda SIV ikxra iepiy :dtlgd zeieipicn z`eeyd

Figure 5.10: Replacement policies comparison: CIV change in initial good topologieszeaeh zeizlgzd zeibeleteh ozpda CIV ikxra iepiy :dtlgd zeieipicn z`eeyd


CIV change: Figure 5.10 shows CIV change for each policy given a good initial

topology. Random policy, making many wrong replacements, spoils CIV by 58%.

LRU, History, HAP, and HAP80% policies, making no topology changes, do not

change CIV . NNA policy, making around 3 wrong replacements, spoils CIV by 5%.

These results fit well will the topology being a CIV optimal topology where each

change will spoil CIV .

5.5.2 Initial Bad Topologies

We have created a set of far from optimal topologies where each peer maintains low

mapping accuracy with its initial neighbors and high mapping accuracy with some

other non-neighbor peers. In addition, we made sure that all query translations from

peers that do not issue them, maintain low translation accuracies. By that, we created

a topology where each peer is connected to a neighbor set bad for it, and since the

translation accuracy of transitive connections is low, peers increase both self-interest

value and global welfare by improving their neighbor set. We calculated the average

measurements for our bad initial topologies and the results are given as follows: SIV

measure value is 0.10025426 close to the average generated mapping accuracy between

bad neighbors (0.1). CIV measure value is 1.2772E − 05, reflecting preservations of

worst query translation paths, and average CIV measure value is 0.015854044, higher

then CIV since it averages rather than minimizes preservation over query translation

paths.


We present a comparison between different acquaintance policies using fixed HAPr

replacement given a bad initial topology. We get similar results using other fixed


replacement policies.

Figure 5.11: Acquaintance policies comparison: convergence in initial bad topologieszerx zeizlgzd zeibeleteh ozpda zeqpkzd :zexkd zeieipicn z`eeyd

Convergence: Figure 5.11 shows the convergence of different acquaintance policies

given a bad initial topology. Unlike convergence for initial optimal topologies, we see

that the topology changes under this setting. Random and NNA policies converge

latest at around 3500 steps, History and HPL before them at around 3000 steps, and

LRU and HAP converge fastest at around 2000 steps.

Topology changes: Figure 5.12 shows the number of topology changes each ac-

quaintance policy makes given a bad initial topology. Random, HPL and NNA policies

lead to topology changes in about 2.1% of the queries, and LRU, History and HAP

policies lead to topology changes in about 1.8% of the queries.

SIV change: Figure 5.19 shows SIV change for each policy given a bad initial

topology. Since we used HAPr replacement policy that makes replacements according

to their contribution to peers self interest, the results follow the same pattern as the


Figure 5.12: Acquaintance policies comparison: topology changes in initial badtopologies

zerx zeizlgzd zeibeleteh ozpda dibeleteh iiepiy :zexkd zeieipicn zee`yd

Figure 5.13: Acquaintance policies comparison: SIV change in initial bad topologieszerx zeizlgzd zeibeleteh ozpda SIV ikxra iepiy :zexkd zeieipicn z`eeyd


number of replacements. Random, HPL and NNA policies improve SIV by about

415% and LRU, History and HAP policies improve SIV by about 340%.

Figure 5.14: Acquaintance policies comparison: CIV change in initial bad topologieszerx zeizlgzd zeibeleteh ozpda CIV ikxra iepiy :zexkd zeieipicn z`eeyd

CIV change: Figure 5.14 shows CIV change for each policy given a bad initial

topology. Recall that we generated a bad topology by connecting each peer with a set

of badly mapped neighbors and by generating mapping accuracies such that global

welfare may be achieved by each peer connecting with the set of its well mapped

neighbors, i.e. improving its self interest value. However, the chart shows that all

the policies, although improving SIV , spoil CIV . In fact all the policies achieve

a final CIV of 0. Recall that CIV is sensitive to peers with poor mappings, and

associates 0 value to bad topologies containing non-reachable peers. Initial topologies

containing a non-reachable peer are associated 0 CIV and maintain this value, and

initial reachable topologies that loose reachability during self organization will achieve

a 0 CIV as well.


Figure 5.15: Acquaintance policies comparison: reachability change in initial badtopologies

zerx zeizlgzd zeibeleteh ozpda zeyibpa iepiy :zexkd zeieipicn z`eeyd

Reachability change: Figure 5.15 shows reachability change for each policy given

a bad initial topology. We see that the change in non-reachable peers number is

positive (around 20% of the peers) and hence final topologies include non-reachable

peers. These results reinforce our conclusions about CIV change results.

Figure 5.16: Acquaintance policies comparison: average CIV measure change ininitial bad topologies

zerx zeizlgzd zeieleteh ozpda average CIV ikxra iepiy :zexkd zeieipicn z`eeyd


average CIV change: We use the average CIV measure defined in Chapter 2

to compare with CIV results, since it is less sensitive to topologies containing non-

reachable peers. Figure 5.16 shows the change in average CIV given a bad initial

topology. The results are similar to the SIV change results, which makes sense as

we generated the initial topology such that improvement in self-interest goes together

with an improvement in global welfare. In the rest of this chapter, we shall use

average CIV measure alongside our original CIV measure to represent change in

global welfare.


We present a comparison between different replacement policies using fixed HAPa

acquaintance given a bad initial topology. We get similar results using other fixed

acquaintance policies.

Figure 5.17: Replacement policies comparison: convergence in initial bad topologieszerx zeizlgzd zeibeleteh ozpda zeqpkzd :dtlgd zeieipicn z`eeyd

Convergence: Figure 5.17 shows the convergence of different replacement policies

given a bad initial topology. LRU and History policies, replacing until all peers can


translate queries with positive accuracy converge fastest at around 275 and 172 steps

respectively. HAP converges second at about 1900 steps and NNA after it at about

2470 steps. HAP80% replacing according to weakend SIV does not realy converge.

This policy is exposed to repetative replacements of peers with close SIV , resulting

in a large number of steps without convergence. Random, replacing coincidently, does

not converge as well. Both policies are not presented in the chart.

Figure 5.18: Replacement policies comparison: topology changes in initial badtopologies

zerx zeizlgzd zeibeleteh ozpda dieleteh iiepiy :dtlgd zeieipicn z`eeyd

Topology changes: Figure 5.18 shows the topology changes each policy makes

given a bad initial topology. Presented on the left chart, Random policy, making

coincidental replacements, performs changes in about 20% of the queries. HAP80%,

exposed to repetative replacements, performs changes in about 18% of the queries.

LRU and History perform the least number of changes (about 0.05% of the queries

only). The right chart presents the same data excluding Random and HAP80%

policies and we see HAP makes changes in about 1.8% of the queries and NNA a bit

more with about 2.4% of the queries.


Figure 5.19: Replacement policies comparison: SIV change in initial bad topologieszerx zeizlgzd zeibeleteh ozpda SIV ikxra iepiy :dtlgd zeieipicn z`eeyd

SIV change: Figure 5.19 shows SIV change for each policy given a bad initial

topology. HAP policy being self-interest oriented makes the largest improvement

(about 425%). HAP80% and NNA follow with 384% and 344% respectively. Even

Random is able to improve SIV by almost 100%. LRU and History on the other

hand, do not improve SIV much (9% and 4% respectively). The fact that both

policies do not consider mapping accuracy makes them inefficient in cases such as

this where queries are translated with low accuracy. In this case, LRU and History

policies will cease making changes when every neighbor can answer all the queries

of the peer connected to it, even if the translation is very inaccurate and possibly

erronous.

CIV change: Figure 5.20 shows CIV change for each policy given a bad initial

topology. The chart shows that all the policies, although improving SIV , spoil CIV .

LRU and History policies spoil CIV by only 16% and 8% respectively and all the

other policies spoil CIV by 58%. The reason for these resutls is that all the final

CIV values achieved by Random, HAP, HAP80% and NNA policies equal to 0, while


Figure 5.20: Replacement policies comparison: CIV change in initial bad topologieszerx zeizlgzd zeibeleteh ozpda CIV ikxra iepiy :dtlgd zeieipicn z`eeyd

for LRU and History policies only some are equal to 0. LRU and History policies,

performing a small number of replacements, do not spoil reachability in some of the

cases, while all the other policies spoil the reachability and hence achieve lower CIV .

Figure 5.21: Replacement policies comparison: average CIV measure changezerx zeizlgzd zeibeleteh ozpda average SIV ikxra iepiy :dtlgd zeieipicn z`eeyd

Average CIV change: Figure 5.21 shows the change in average CIV given a good

(left chart) and a bad initial topology (right chart). Given a good initial topology, we

see that Random policy, making many coincidental replacements spoils average CIV


by about 75% and NNA making few replacements spoils average CIV by around

3%. Other policies make no changes to the topology and therefore do not change

average CIV . Given a bad initial topology, all the policies are able to improve aver-

age CIV . However, LRU and History policies, insensitive to translation accuracies,

achieve a minor improvement (12% and 7% respectively), while even Random pol-

icy achieves 87% improvement. HAP, HAP80% and NNA policies outperform the

other policies and achieve large improvment. HAP, being the most self-interest ori-

ented achieves 331% improvment. HAP80%, compromising on self-interest achieves

357% improvment and NNA, considering both accuracy and connectivity achieves the

largest imrovement (414%).

Figure 5.22: Replacement policies comparison: reachability change in initial badtopologies

zerx zeizlgzd zeibeleteh ozpda zeyibp iiepiy :dtlgd zeieipicn z`eeyd

Reachability change: Figure 5.22 shows reachability change for each policy given

a bad initial topology. Random decreases reachability the most with an average

change of 48% of the peers becoming non-reachable. NNA, HAP80% and HAP de-

crease reachability by making about 36%, 28% and 22% of the peers non-reachable


respectively. LRU and History policies, making a small number of replacemnts, spoil

avewrage reachability by making about 1.2% and 0.7% of the peers non-reachable,

meaning that in many cases, reachabiliy is not spoiled. These results strengthen our

conclusion about the results for CIV change.

5.5.3 Randomly Generated Topologies

This section compares the influence of applying different replacement polices on ran-

domly generated topologies. We generate random semantic topologies using the

methodology described in Section 5.2. We calculated the average measurements

for our bad initial topologies and the results are given as follows: SIV measure

value is 0.620427669. CIV measure value is 0.014513132, reflecting preservations of

worst query translation paths, and average CIV measure value is 0.254284597, higher

then CIV since it averages rather than minimizes preservation over query translation

paths.


We present a comparison between different acquaintance policies using fixed replace-

ment policies given a randomly generated topology. We present results only for some

of the fixed replacement policies in cases where the results are similar using other

fixed policies.

Convergence: Figure 5.23 shows the number of steps until convergence. Ran-

dom replacement making continuous replacements does not converge in general and

therefore there is no difference between acquaintance policies. For other replacement


Figure 5.23: Acquaintance policies comparison: convergence in randomly generatedtopologies

zi`xw` zelxben zeibeleteh ozpda zeqpkzd :zexkd zeieipicn z`eeyd


policies, we see differences in convergence between acquaintance policies but were un-

able to establish dominance. We observe that History has low convergence variance

while HAP and NNA has higher variance.

Figure 5.24: Acquaintance policies comparison: topology changes in randomlygenerated topologies

zi`xw` zelxben zeibeleteh ozpda dibeleteh iiepiy :zexkd zeieipicn z`eeyd

Topology changes: Figure 5.24 shows the number of topology changes with each

policy. The number of topology changes performed by Random replacement pol-

icy has no relation to the selected acquaintance policy. However, we are unable to

identify a major impact of using a certain acquaintance policy with any of the other

replacement policies. We see however, that the number of replacements performed


by each replacement policy is quite stable for each replacement policy. Therefore, we

conclude that acquaintance policies have lower impact than replacement policies on

the number of topology changes.

Figure 5.25: Acquaintance policies comparison: SIV change in randomly generatedtopologies

zi`xw` zelxben zeibeleteh ozpda SIV ikxra iepiy :zexkd zeieipicn z`eeyd

SIV change: Figure 5.25 shows SIV change for each policy given an initial random

topology. We see that replacement policies are indifferent to acquaintance policies in

the context of SIV change. We also see that HAPr replacement improves SIV more

than Historyr replacement.

CIV change: Figure 5.26 shows CIV change for each policy given an initial random

topology. We see that replacement policies are indifferent to acquaintance policies in

the context of CIV change as well. We also see that HAPr replacement spoils CIV

more than Historyr replacement. These results are also affected by CIV measure

sensitivity to peers reachability.


Figure 5.26: Acquaintance policies comparison: CIV change in randomly generatedtopologies

zi`xw` zelxben zeibeleteh ozpda CIV ikxra iepiy :zexkd zeieipicn z`eeyd

Figure 5.27: Acquaintance policies comparison: average CIV change in randomlygenerated topologies

zelxben zeibeleteh ozpda average CIV ikxra iepiy :zexkd zeieipicn z`eeyd

zi`xw`


Average CIV change: Figure 5.26 shows average CIV change for each policy

given an initial random topology. We notice minor differences between different aver-

age CIV change achieved by replacement topologies. Historyr replacement, making

smaller number of replacements, spoils reachability less than HAPr replacement and

hence it does not spoil CIV while HAPr does.

Figure 5.28: Acquaintance policies comparison: reachability change in randomlygenerated topologies

zi`xw` zelxben zeibeleteh ozpda zeyibp iiepiy :zexkd zeieipicn z`eeyd

Reachability change: Figure 5.28 shows reachability changes for each acquain-

tance policy given an initial random topology. We see that reachability is dependent

in the replacement policy and the selection of different acquaintance policies has no

effect on it. We also see that HAPr replacement changes topology structure such

that number of peers reachable by 2-6 peers decrease while number of non-reachable


peers and number of highly mapped peers increase.


We present a comparison between different replacement policies using fixed acquain-

tance policies given a randomly generated topology. We present results only for some

of the fixed acquaintance policies in cases where the results are similar using other

fixed policies.

Figure 5.29: Replacement policies comparison: convergence in randomly generatedtopologies

zi`xw` zelxben zeibeleteh ozpda zeqpkzd :dtlgd zeieipicn z`eeyd


Convergence: Figure 5.29 shows the number of steps until convergence of the dif-

ferent policies. We see that Random, and HAP80% policies perform worse and do

not converge. This is only expected as both these policies perform repetative replace-

ments, reconnecting with previously replaced neighbors. History converges fastest at

approximtely 1000 steps, and NNA converges faster than LRU, both between 1500

and 2500 steps. HAP convergence varies between 1300-3000 steps, exhibiting an

inconsistent performance compared to LRU and NNA.

Figure 5.30: Replacement policies comparison: number of topology changes inrandomly generated topologies

zi`xw` zelxben zeibeleteh ozpda dibeleteh iiepiy :dtlgd zeieipicn z`eeyd


Topology changes: Figure 5.30 shows the number of topology changes each policy

performed. HAP80%, appearing in the top two graphs, performs the highest number

of changes (about 36% of the queries), 3 times and more than the second highest

(Random, with about 12% of the queries) and about 60 times more than the lowest

(History, with about 0.6% of the queries only). Averaging the number of changes

over the number of peers we note that HAP80% preforms around 70 changes per peer

and Random performs around 20 changes per peer, higher and close to the number of

potential neighbors per peer (24), implying repetative peer replacements. LRU and

History perform the lowest number of changes. Recall that both these policies do not

relate to the translation accuracy but rather relate to the option to translate a query

through mapping link, and will both stop performing changes as soon as each peer’s

neighbors are able to translate all the peer’s queries regardless of the translation

accuracy. HAP and NNA policies perform around 2 and 3 times more changes than

the former two, but still keep a rather small amount (about 2-3% of the number of

queries).

Figure 5.31: Replacement policies comparison: SIV change in random topologieszi`xw` zelxben zeibeleteh ozpda SIV ikxra iepiy :dtlgd zeieipicn z`eeyd


SIV change: Figure 5.31 shows the change in SIV measure achieved by each re-

placement policy. As expected, HAP replacement which is self-interest based oriented

policy, performs best and improves SIV by almost 35% which are around 2 times

better than LRU and History policies. HAP80% which is more flexible on highest

accuracy improvement threshold, performs slightly worse than HAP and NNA policy

performs better than LRU and History but not as good as HAP oriented policies.

Note that Random policy almost does not improve SIV and in fact achieved nega-

tive improvement in some of our experiments. LRU and History policies, replacing

neihgbors that are not capable of answering queries with ones that do, achieve some

positive improvement as expected, both policies however, are sensitive to a scenario

where some peers can translate all queries but with very low accuracy in which case

they will not be replaced.

Figure 5.32: Replacement policies comparison: CIV change in random topologieszi`xw` zelxben zeibeleteh ozpda CIV ikxra iepiy :dtlgd zeieipicn z`eeyd

CIV change: Figure 5.32 shows the change in CIV measure achieved by each

replacement policy. We notice that all the policies spoil CIV by almost the same

percentage. This occurs since all the final CIV values are 0 due to the existence of


non-reachable peers in the topology. The reason that the change is not 100% negative

is that some toopologies are associated with 0 CIV value to begin with, i.e. some

initial topologies contain non-reachable peers.

Figure 5.33: Replacement policies comparison: average CIV change in randomtopologies

zelxben zeibeleteh ozpda average CIV ikxra iepiy :dtlgd zeieipicn z`eeyd

zi`xw`

Average CIV change: Figure 5.33 shows the change in CIV measure using av-

erage over paths achieved by each replacement policy. Random performs worse and

decreases CIV by 65%. HAP, being self-interest based oriented, spoils CIV by 20%

and HAP80% by 35%. NNA, relating also to the peer connectivity in the form of

neighbors count and not only to their mapping accuracy, is more cooperative oriented

and spoils CIV by only 10%. LRU and History policies perform best as they range

between low deterioration and low improvement of CIV measure.

Reachability change: Figure 5.34 shows reachability changes for each replacement

policy given an initial random topology. We see a phenomena of decrease in medium

reachability range and increase in non-reachable and highly reachable ranges. This


Figure 5.34: Replacement policies comparison: reachability change in randomtopologies

zi`xw` zelxben zeibeleteh ozpda zeyibp :dtlgd zeieipicn z`eeyd


leads to a topology with a small group of highly connected peers and some other peers

that are non-reachable, such that global welfare is deteriorated. This is most obvious

in Random replacement, and then NNA and HAP80%. HAP causes less changes and

LRU and History make the leaset change in the topology. We used JUNG2 framework

Figure 5.35: Replacement policies comparison: topology change visualizationdibeleteh iiepiy ly zizefg dbvd :dtlgd zeieipicn z`eeyd

to visualize topology changes under each policy. Figure 5.35 shows a graphic visu-

alization of an initial network topology and corresponding final topologies received

2http://jung.sourceforge.net/


from applying different replacement policies. Circles in the graph represent peers and

directed edges represent mapping links between peers. In the initial topology (top

left), 2 peers are non-reachable by the other peers. All the other peers in the inner

circle are well connected, i.e. each one is a selected neighbor of more than a single

peer. Examining the final History topology (top right), we see that the number of

non-reachable peers (in the outermost circle, close to the figure edges) increased to

5, meaning that 12% of the peers became non-reachable. Additionally, 3 more peers

(another 12% of the peers) moved from being well connected to a state of reachabil-

ity hazard, i.e., only single peer is mapped to them. Next, the final HAP topology

(bottom left) contains 8 non-reachable peers (change of 24%), and 3 peers in dan-

ger of non-reachability (chagne of 12%). Last, the final Random topology (bottom

right) contains 15 non-reachable peers (chagne of 52%), and 2 peers in danger of non-

reachability (chagne of 8%). We conclude that topology self organization algorithms,

selfishly applied by inividual peers, cannot manage topology structure and maintain

peers reachability and therefore damage the network global welfare.

5.6 Discussion

Based on our results in Section 5.5 we summarize the following conclusions:

Acquaintance Policies

We were unable to clearly identify the influence of applying different acquaintance

policies on the network. We relate these results to the following possible reasons:

• We run our experiments on a small-scale network (25 peers), leading to small

sets of available candidate non-neighbors discovered during query processes. In


Figure 5.36: SIV Vs. Average CIV in randomly generated topologieszi`xw` zelxben zeibeleteh ozpda average CIV lenl SIV


the absence of a large variety of candidates, acquaintance algorithms do not

have much impact.

• Our matching configuration generation methodology assigns peers with similar

schemata and queries and the acquaintance algorithms were unable to distin-

guish between peers.

Replacement Policies

• We conclude that applying a “rational” replacement policy is effective for se-

mantic topology self adjustment as all the policies performed better than the

Random policy.

• Our replacement policies were capable of identifying good self-interest oriented

topologies and made small or no changes in the presence of such topologies.

• Our replacement polices were also capable of identifying bad self-interest ori-

ented topologies and their usage lead to improvement in peers’ self-interest state,

while policies adopted from file sharing P2P systems were insensitive to trans-

lation accuracies and sometimes failed to identify and improve such topologies.

• In average, our replacement policies provide higher improvements to the self-

interest value of peers in the topology than other policies. In randomly gener-

ated topologies, our HAPr replacement policy improved SIV by 35% vs. 18%

only achieved by LRUr replacement. In the extreme case of an initial bad

topology, HAPr improved SIV by 425%, whereas LRUr improved it by 9%

only.

• Our results indicate that there is a trade-off between peers self-interest and the


global welfare of the network. Policies improving self-interest also deteriorate

global welfare. Figure 5.36 demonstrates this; we note that for the policies

across the front, average CIV decreases with SIV increase.

• We investigated deterioration of global welfare and found an interesting phe-

nomenon that involves peers reachability. We discovered that topologies deteri-

orate to a state where a small group of peers remain connected and many peers

become non-reachable and cannot answer or propagate queries (though they

can still issue queries), thus global welfare is damaged. Topology deterioration

increases with the growing in the number of replacements, as the number of

reachable peers available for replacement constantly decreases.

• We conclude that improving global welfare by means of selfish replacement

policies is non-realistic and application of cooperative algorithms is required.

We leave this topic for future research.

Chapter 6

Discussion

6.1 Conclusions

We considered a problem of identifying semantic topologies that reduce the uncer-

tainty of query reformulations in a PDMS. First, we presented a formal model for

PDMS, extending the existing models with the concept of mapping accuracy preser-

vation reflecting the impact of schema mappings uncertainty on the quality of a query

reformulation process. We demonstrated the influence of a choice of a semantic topol-

ogy over the quality of queries in the network. Additionally, we proposed measures

(SIV , CIV , and average CIV ) for evaluation of semantic topologies, representing

different perspectives of peers local interest and network global welfare.

Next we considered the problem of finding optimal semantic topologies in an

offline setting. We presented an efficient algorithm for finding optimal topologies

maximizing SIV measure. For the problem of finding optimal topologies maximizing

CIV measure, we provided a proof that it is NP-Complete even for its most simple

case of SPSQ (Single-Peer-Single-Query).

109

CHAPTER 6. DISCUSSION 110

Then, we studied the problem of finding optimal semantic topologies in an on-

line setting. We proposed a framework for topology self organization by means of

self-interested peers individually applying algorithms for semantic links adjustment.

In detail, we introduced the semantic acquaintance and replacement problems and

demonstrated their impact on the topology self organization process. We also pre-

sented several acquaintance (HAPa, HPLa, and NNAa) and replacement (HAPr,

HAP80%r, and NNAr) policies suitable for the context of our model, analysed and

demonstrated their different characteristics.

Finally, we presented a simulation architecture we constructed according to our

PDMS model, including a framework for topology self organization using semantic

acquaintance and replacement. We implemented our policies as well as other policies

taken from the field of file sharing P2P systems, and presented an empirical analysis

of the effectiveness of their application. Our results indicated that policies consider-

ing the uncertainty of mappings, perform better in achieving optimal topologies that

maximize SIV measure. We also showed a tradeoff between SIV and CIV measures

representing a tradeoff between peers selfish interest and the network global welfare.

We demonstrated by graphic visualization the impact of non-collaborative replace-

ment algorithms on peers reachability deterioration and concluded that cooperative

algorithms are required for self organization of topologies maximizing CIV measure.

6.2 Future Work

In future work of this research, we consider the following directions:

• Re-examination of the influence of acquaintance algorithms in a larger scale

network settings.

CHAPTER 6. DISCUSSION 111

• Search of new selfish replacement algorithms that will increase self interest while

maintaining global welfare.

• Composition of cooperative protocols for topology global welfare improvement.

• Suggestion of an approximate optimal solution for cooperative-interest semantic

topologies and comparison with online solutions.

• Proving that the problem of finding optimal cooperative-interest semantic topolo-

gies that maximize the average CIV measure is also a hard one.

References

[1] Clip2. The Gnutella Protocol Specification v0.4 (Document Revision 1.2),

www9.limewire.com/developer//gnutella protocol 0.4.pdf, June 2001.

[2] K. Aberer. P-grid: a self-organizing access structure for p2p information

systems. In International Conference on Cooperative Information Systems

(CoopIS), 2001.

[3] K. Aberer, P. Cudr’e-Mauroux, A. Datta, Z. Despotovic, M. Hauswirth,

M. Punceva, and R. Schmidt. P-grid: A self-organizing structured p2p

system. ACM SIGMOD Record, 32(3), 2003.

[4] K. Aberer, P. Cudre-Mauroux, and M. Hauswirth. Start making sense:

The chatty web approach for global semantic agreements. Journal of Web

Semantics, 1(1), 2003.

[5] Karl Aberer, Philippe Cudre-Mauroux, and Manfred Hauswirth. A frame-

work for semantic gossiping. SIGMOD Record, 31(4), 2002.

[6] Karl Aberer, Philippe Cudre-Mauroux, and Manfred Hauswirth. The

chatty web: Emergent semantics through gossiping. Proceedings of the

12th International World Wide Web Conference, 2003.

112

REFERENCES 113

[7] M. Arenas, V. Kantere, A. Kementsietsidis, I. Kiringa, R. Miller, and

J. Mylopoulos. The hyperion project: From data integration to data coor-

dination, 2003.

[8] C. Batini, M. Lenzerini, and S. Navathe. A comparative analysis of

methodologies for database schema integration. ACM Computing Surveys,

18(4):323–364, December 1986.

[9] J. Berlin and A. Motro. Autoplex: Automated discovery of content for

virtual databases. In C. Batini, F. Giunchiglia, P. Giorgini, and M. Mecella,

editors, Cooperative Information Systems, 9th International Conference,

CoopIS 2001, Trento, Italy, September 5-7, 2001, Proceedings, volume 2172

of Lecture Notes in Computer Science, pages 108–122. Springer, 2001.

[10] T. Berners-Lee, J. Hendler, and O. Lassila. The semantic Web. Scientific

American, May 2001.

[11] P. Bernstein, F. Giunchiglia, A. Kementsietsidis, J. Mylopoulos, L. Serafini,

and I. Zaihrayeu. Data management for peer-to-peer computing: A vision,

2002.

[12] P.A. Bernstein and S. Melnik. Meta data management. In Proceedings of

the IEEE CS International Conference on Data Engineering. IEEE Com-

puter Society, 2004.

[13] M. Cai and M. Frank. Rdfpeers: A scalable distributed rdf repository based

on a structured peer-to-peer network. In International World Wide Web

Conference (WWW), 2004.

REFERENCES 114

[14] B. Convent. Unsolvable problems related to the view integration ap-

proach. In Proceedings of the International Conference on Database Theory

(ICDT), Rome, Italy, September 1986. In Computer Science, Vol. 243, G.

Goos and J. Hartmanis, Eds. Springer-Verlag, New York, pp. 141-156.

[15] T. H. Corman, C. E. Leiserson, and R. L. Rivest. Introduction to Algo-

rithms. MIT Press, McGraw-Hill, New York, NY, 1990.

[16] A. Crespo and H. Garcia-Molina. Routing indices for peer-to-peer systems,

2002.

[17] Philippe Cudre-Mauroux. Emergent semantics : rethinking interoperability

for large scale decentralized information systems. PhD thesis, EPFL, 2006.

[18] Watts. D and Strogatz. S. Collective dynamics of small worldnetworks.

Nature 393, 1998.

[19] H.H. Do and E. Rahm. COMA - a system for flexible combination of schema

matching approaches. In Proceedings of the International conference on

very Large Data Bases (VLDB), pages 610–621, 2002.

[20] A. Doan, P. Domingos, and A.Y. Halevy. Reconciling schemas of disparate

data sources: A machine-learning approach. In Walid G. Aref, editor, Pro-

ceedings of the ACM-SIGMOD conference on Management of Data (SIG-

MOD), Santa Barbara, California, May 2001. ACM Press.

[21] A. Doan, J. Madhavan, P. Domingos, and A. Halevy. Learning to map

between ontologies on the semantic web. In Proceedings of the eleventh

international conference on World Wide Web, pages 662–673. ACM Press,

2002.

REFERENCES 115

[22] C. Domshlak, A. Gal, and H. Roitman. Rank aggregation for automatic

schema matching. IEEE Transactions on Knowledge and Data Engineering

(TKDE), 2007. forthcming.

[23] Klement EP, Mesiar R, and Pap E. Triangular norms. Kluwer, Dordrecht,

2000.

[24] J. Euzenat et al. State of the art on current alignment techniques. Knowl-

edgeWeb Deliverable 2.2.3, http://knowledgeweb.semanticweb.org, 2004.

[25] Michalis Faloutsos, Petros Faloutsos, and Christos Faloutsos. On power-

law relationships of the internet topology. In SIGCOMM, pages 251–262,

1999.

[26] M.J. Franklin, A.Y. Halevy, and D. Maier. From databases to dataspaces: a

new abstraction for information management. SIGMOD Record, 34(4):27–

33, 2005.

[27] Modica G, Gal A, and Jamil H. The use of machine generated ontologies

in dynamic information seeking. In: Proceedings of the 9th international

conference on cooperative information systems (CoopIS 2001), September

2001.

[28] A. Gal. Managing uncertainty in schema matching with top-k schema

mappings. Journal of Data Semantics, 2006.

[29] A. Gal, G. Modica, H.M. Jamil, and A. Eyal. Automatic ontology matching

using application semantics. AI Magazine, 26(1), 2005.

REFERENCES 116

[30] Avigdor Gal, Ateret Anaby-Tavor, Alberto Trombetta, and Danilo Mon-

tesi. A framework for modeling and evaluating automatic semantic recon-

ciliation. The VLDB Journal - The International Journal on Very Large

Data Bases archive Volume 14 , Issue 1, March 2005.

[31] Steven Gribble, Alon Halevy, Zachary Ives, Maya Rodrig, and Dan Suciu.

What can databases do for peer-to-peer? WebDB Workshop on Databases

and the Web, June 2001.

[32] A. Halevy, Z. Ives, P. Mork, and I. Tatarinov. Piazza: Data management

infrastructure for semantic web applications, 2003.

[33] A. Halevy, Z. Ives, D. Suciu, and I. Tatarinov. Schema mediation in peer

data management systems. In Proc. of ICDE, 2003.

[34] B. He and K.C.-C. Chang. Making holistic schema matching robust: an

ensemble approach. In Proceedings of the Eleventh ACM SIGKDD Inter-

national Conference on Knowledge Discovery and Data Mining, Chicago,

Illinois, USA, August 21-24, 2005, pages 429–438, 2005.

[35] M.R. Horton and R. Adams. Standard for interchange of usenet messages.

Network Information Center RFC 1036, December 1987.

[36] R. Huebsch, B. Chun, J. M. Hellerstein, B. T. Loo, P. Maniatis, T. Roscoe,

S. Shenker, I. Stoica, and A. R. Yumerefendi. The architecture of pier:

an internet-scale query processor. In In Conference on Innovative Data

Systems Research (CIDR), 2005.

[37] R. Hull. Managing semantic heterogeneity in databases: A theoretical

perspective. In pods, pages 51–61. ACM Press, 1997.

REFERENCES 117

[38] Madhavan J, Bernstein PA, and Rahm E. Generic schema matching with

cupid. In: Proceedings of the international conference on very large data

bases (VLDB), September 2001.

[39] A. Kementsietsidis and M. Arenas. Data sharing through query translation

in autonomous sources, 2004.

[40] A. Kementsietsidis, M. Arenas, and R. Miller. Mapping data in peer-topeer

systems: Semantics and algorithmic issues, 2003.

[41] K.Mehlhorn and S.Naher. LEDA, A platform for combinatorial and geo-

metric computing. Cambridge University Press, 1999.

[42] M. Lenzerini. Data integration: A theoretical perspective. In Proceed-

ings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of

Database Systems (PODS), pages 233–246, 2002.

[43] DeMichiel LG. Resolving database incompatibility: an approach to per-

forming relational operations over mismatched domains. IEEE Trans

Knowl Data Eng 1(4), 1989.

[44] DeMichiel LG. Performing operations over mismatched domains. In: Pro-

ceedings of the IEEE CS international conference on data engineering,

February 1989.

[45] J. Madhavan, P.A. Bernstein, and E. Rahm. Generic schema matching

with Cupid. In Proceedings of the International conference on very Large

Data Bases (VLDB), pages 49–58, Rome, Italy, September 2001.

REFERENCES 118

[46] J. Madhavan and A. Halevy. Composing mappings among data sources,

2003.

[47] S. Melnik, E. Rahm, and P.A. Bernstein. Rondo: A programming platform

for generic model management. In Proceedings of the ACM-SIGMOD con-

ference on Management of Data (SIGMOD), pages 193–204, San Diego,

California, 2003. ACM Press.

[48] R.J. Miller, M.A. Hernandez, L.M. Haas, L.-L. Yan, C.T.H. Ho, R. Fagin,

and L. Popa. The Clio project: Managing heterogeneity. SIGMOD Record,

30(1):78–83, 2001.

[49] Thomas Moscibroda, Stefan Schmid, and Roger Wattenhofer. On the

topologies formed by selfish peers, 2006.

[50] W. Nejdl, B. Wolf, S. Decker C. Qu, M. Sintek, A. Naeve, M. Nilsson,

M. Palm’er, and T. Risch. Edutella: a p2p networking infrastructure based

on rdf. In International World Wide Web Conference (WWW), 2002.

[51] W. Nejdl, M. Wolpers, W. Siberski, C. Schmitz, M.T. Schlosser, I. Brunk-

horst, and A. Loeser. Super-peer-based routing and clustering strategies

for rdf-based peer-to-peer networks. In International World Wide Web

Conference (WWW), 2003.

[52] W.S. Ng, B.C. Ooi, and K.L. Tan. Bestpeer: A selfconfigurable peer-to-

peer system. In International Conference on Data Engineering (ICDE),

2002.

REFERENCES 119

[53] Jurcik P. and Hanzalek Z. Construction of the bounded application-layer

multicast tree in the overlay network model by the integer linear program-

ming. 2005, Emerging Technologies and Factory Automation, 2005. ETFA

2005. 10th IEEE Conference.

[54] Christopher R. Palmer and J. Gregory Steffan. Generating network topolo-

gies that obey power laws. In Proceedings of GLOBECOM 2000, 2000.

[55] E. Rahm and P.A. Bernstein. A survey of approaches to automatic schema

matching. VLDB Journal, 10(4):334–350, 2001.

[56] S. Ratnasamy, P. Francis, M. Handley, R. Karp, and S. Shenker. A scalable

content-addressable network. In ACM SIGCOMM, 2001.

[57] C.J. Van Rijsbergen. Information retrieval. Butterworths, 1979.

[58] L. Serafini, F. Giunchiglia, J. Mylopoulos, and P. Bernstein. The local

relational model: Model and proof theory, 2001.

[59] A. Sheth and J. Larson. Federated database systems for managing dis-

tributed, heterogeneous, and autonomous databases. ACM Computing

Surveys, 22(3):183–236, 1990.

[60] Y. Shu, B.C. Ooi, and K.-L. Tan. Relational data sharing in peer-based

data management systems. SIGMOD Record, 32(3), 2003.

[61] P. Shvaiko and J. Euzenat. A survey of schema-based matching approaches.

Journal of Data Semantics, 4:146 – 171, December 2005.

[62] K. Sripanidkulchai, B. Maggs, and H. Zhang. Efficient content location

using interest-based locality in peer-to-peer systems. INFOCOM, 2003.

REFERENCES 120

[63] B. Srivastava and J. Koehler. Web service composition - Current solutions

and open problems. In Workshop on Planning for Web Services (ICAPS-

03), Trento, Italy, 2003.

[64] I. Stoica, R. Morris, D. Karger, M.F. Kaashoek, and H. Balakrishnan.

Chord: A scalable peer-to-peer lookup service for internet applications. In

ACM SIGCOMM, 2001.

[65] I. Tatarinov and A. Halevy. Efficient query reformulation in peer-data

management systems. In SIGMOD, 2004., 2004.

[66] I. Tatarinov, Z. Ives, J. amd, A. Halevy, D. Suciu, N. Dalvi, X. Dong,

Y. Kadiyaska, G. Miklau, and P. Mork. The piazza peer data management

project, 2003.

[67] Da vis LS and Roussopoulos N. Approximate pattern matching in a pattern

database system. Inf Sys 5(2), 1980.

[68] S. Voulgaris, A. Kermarrec, L. Massoulie, and M. van Steen. Exploiting

semantic proximity in peer-to-peer content searching. In 10th International

Workshop on Future Trends in Distributed Computing Systems (FTDCS

2004), November 2004.

[69] Aref WG, Barbar´a D, Johnson S, and Mehrotra S. Efficient processing

of proximity queries for large databases. In: Yu PS, Chen ALP (eds)

Proceedings of the IEEE CS international conference on data engineering,

March 1995.

zeihpnq zeibeleteh ly invr oebx`

zezyx zeqqean mipezp icqn zekxrna

zinrl zinr

lii` inr

zeihpnq zeibeleteh ly invr oebx`

zezyx zeqqean mipezp icqn zekxrna

zinrl zinr

xwgn lr xeaig

x`ez zlawl zeyixcd ly iwlg ielin myl

mircnl xhqibn

rcin ledip zqcpda

lii` inr

l`xyil ibelepkh oekn — oeipkhd hpql ybed

2007 ipei dtig f"qyz fenz

lb xecbia` 'xc zkxcda dyrp xwgn lr xeaig

ledipe diyrz zqcpdl dhlewta

dcez zxkd

dxeqnd eziigpd lr lb xecbia` 'textl dwenrd izcez z` riadl ipevxa

,llka dhlewtd zeevl dpezp dpk dcez .jxcd jxe` lkl zelirend eizevre

.iaihxhqipinc`d megza reiqde mgd qgid lr ,hxta al-yi` zicedile

zevr lr zcgein dxwed dpezp ,mixg`e lapr ,xehwie ,ibg ,micenill ixagl

dcez ,lkn miaeyge mipexg` .dkinze divaihen zepn lre miliren mipeice

ziteqpi`d mzkinze mzad` ,miaexwd iixagle izgtynl dxeqn ziwpr

.ef dcear znlyd z` exyt`y od

QUALEG itexi`d cegi`d ly ziyiyd zipkzd hwiextl dcen ip`

izenlzyda daicpd zitqkd dkinzd lr oeipkhle

mipipr okez

xi zilbp`a xivwz

1 milnq zniyx

3 `ean 1

5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . minec mixwgn 1.1

5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . zihpnq dn`zd 1.1.1

7 . . . . . . . . zinrl zinr zezyx zeqqean mipezp icqn zekxrn 1.1.2

11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xeaigd oebx` 1.2

12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . zixwir dnexz 1.3

14 lcend zxcbd 2

15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . mipezpd lcen 2.1

16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . zyxd lcen 2.2

16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . miihpnq miietin 2.2.1

18 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . zezli`y zvtd 2.2.2

20 . . . . . . . . . . . . . . . . . . . . . . . . . . . . zihpnq dibeleteh 2.2.3

21 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . dn`zdd lcen 2.3

22 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . miietind zepekp 2.3.1

b

c mipipr okez

26 . . . . . . . . . . . . . . . . . . . . . . . . . . . ietind zepekp xeniy 2.3.2

31 . . . . . . . . . . . . . . . . . . . . . . . . zeihpnq zeibeleteh ly dkxrd 2.4

32 . . . . . . . . . . . . . . iyi` qxhpi` qiqa lr zeibeleteh zkxrd 2.4.1

34 . . . . . . . . . . . . . szeyn qxhpi` qiqa lr zeibeleteh zkxrd 2.4.2

37 zeilnihte` zeihpnq zeibeleteh lr 3

38 . . . . . . . . . . . . . . . iyi` qxhpi` qiqa lr zeilnihte` zeibeleteh 3.1

40 . . . . . . . . . . . . . . szeyn qxhpi` qiqa lr zeilnihte` zeieleteh 3.2

42 . . . . zebxc iveli` mr zilnipind lelqnd zltkn meniqwn ur 3.2.1

dzli`ye cigi zinr xear zilnihte` dibeleteh z`ivn ziira 3.2.2

50 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . dcigi

52 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . oeic 3.3

54 zi`nvr zepbx`zne zepzyn zeibeleteh 4

55 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . zihpnq zexkd 4.1

61 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . zihpnq dtlgd 4.2

67 miieqip 5

67 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . divleniqd dpan 5.1

72 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . mipezp 5.2

75 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . miieqipd dpan 5.3

76 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . miccn 5.4

78 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ze`vez 5.5

78 . . . . . . . . . . . . . . . . . . . . . . zeaeh zeizlgzd zeibeleteh 5.5.1

82 . . . . . . . . . . . . . . . . . . . . . . . zerx zeizlgzd zeibeleteh 5.5.2

92 . . . . . . . . . . . . . . . . . . . . . . zi`xw` zellegn zeibeleteh 5.5.3

d mipipr okez

105 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . oeic 5.6

109 oeic 6

109 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ze`vez 6.1

110 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . zicizr dcear jynd 6.2

111 zexewn zniyx

k xivwz

mixei` zniyx

17 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . dzli`y mebxzl dnbec 2.1

18 . . . . . . . . . . . . . . . . . zinrl zinr zyxa mipezp icqn lcen xeìz 2.2

miietin zervnà zexaegn mizinr ly zenkq ea ,zihpnq zyx sxb 2.3

19 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . miihpnq

lr ueli` mr dibeletehe zezli`y mebxz zeaky :zihpnqd zyxd lcen 2.4

21 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . mipkyd xtqn

24 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . miietin zepekpl dnbec 2.5

28 . . . . . . . . . . . . . . . . . . . . . . . . . . miietin zepekp xeniyl dnbec 2.6

30 . . . . . . . . . . . . . . . . . . . . . . . dzli`y ly ly mebxz sxbl dnbec 2.7

33 . . . . . miietin zepekp lr zqqaznd zihpnq dibeleteh zkxrdl dnbec 2.8

41 . . . CIV jxr z` meniqwnl dìand dibeleteh zìvn ziira ly beeiq 3.1

44 zeilniqwn zeltkn urle zilnipind lelqnd zltkn meniqwn url dnbec 3.2

46 . . . . . . . . . . . . . . . . . . . SPT ziiral MPT ziiran xarnl dnbec 3.3

47 . . . . . . . . . . . . . . . . . . . . . . db-MMPT znerl MMPT-l dnbec 3.4

50 . . . . . . . . . . . . . . . db-MMPT ziiral ATSP ziiran xarnl dnbec 3.5

52 . . . . . . . . . . . . . . . SPSQ ziiral db-MMPT ziiran xarnl dnbec 3.6

56 . . . . . . . . . . . . . . . . . . . . . . . . . mixiyw izla miihpnq miaikx 4.1

e

f mixei` zniyx

60 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . zexkd zeieipicnl dnbec 4.2

63 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . dtlgd zeieipicnl dnbec 4.3

68 . . . . . . . . . zezli`y itqeè zenkq ,mibyen mler :divleniqd lcen 5.1

70 . . . . . . zezli`y mebxz zeakye zihpnq dibeleteh :divleniqd lcen 5.2

71 . . . . . . . . dcigi dzli`y xear zelert svx miyxz :divleniqd lcen 5.3

73 . . . . . . . . . zyxa mixagd ly zenkqa mibyen zellkidl zexazqd 5.4

74 . . . . . . . . . . mipeye midf mibyen xear zezli`yd zepekp zeiebltzd 5.5

74 . . power law zeiweg zniiwn bexic lenl dìvi zbxc :zyxd ziibeleteh 5.6

79 zeaeh zeizlgzd zeibeleteh ozpda zeqpkzd :dtlgd zeieipicn zèeyd 5.7

zeizlgzd zeibeleteh ozpda dibeleteh iiepiy :dtlgd zeieipicn zèeyd 5.8

80 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . zeaeh

zeizlgzd zeibeleteh ozpda SIV ikxra iepiy :dtlgd zeieipicn zèeyd 5.9

81 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . zeaeh

zeizlgzd zeibeleteh ozpda CIV ikxra iepiy :dtlgd zeieipicn zèeyd 5.10

81 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . zeaeh

83 . zerx zeizlgzd zeibeleteh ozpda zeqpkzd :zexkd zeieipicn zèeyd 5.11

zeizlgzd zeibeleteh ozpda dibeleteh iiepiy :zexkd zeieipicn zee`yd 5.12

84 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . zerx

zeizlgzd zeibeleteh ozpda SIV ikxra iepiy :zexkd zeieipicn zèeyd 5.13

84 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . zerx

zeizlgzd zeibeleteh ozpda CIV ikxra iepiy :zexkd zeieipicn zèeyd 5.14

85 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . zerx

86 zerx zeizlgzd zeibeleteh ozpda zeyibpa iepiy :zexkd zeieipicn zèeyd 5.15

g mixei` zniyx

zeieleteh ozpda average CIV ikxra iepiy :zexkd zeieipicn zèeyd 5.16

86 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . zerx zeizlgzd

87 . zerx zeizlgzd zeibeleteh ozpda zeqpkzd :dtlgd zeieipicn zèeyd 5.17

zeizlgzd zeibeleteh ozpda dieleteh iiepiy :dtlgd zeieipicn zèeyd 5.18

88 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . zerx

zeizlgzd zeibeleteh ozpda SIV ikxra iepiy :dtlgd zeieipicn zèeyd 5.19

89 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . zerx

zeizlgzd zeibeleteh ozpda CIV ikxra iepiy :dtlgd zeieipicn zèeyd 5.20

90 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . zerx

zeibeleteh ozpda average SIV ikxra iepiy :dtlgd zeieipicn zèeyd 5.21

90 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . zerx zeizlgzd

91 zerx zeizlgzd zeibeleteh ozpda zeyibp iiepiy :dtlgd zeieipicn zèeyd 5.22

93 zi`xw` zelxben zeibeleteh ozpda zeqpkzd :zexkd zeieipicn zèeyd 5.23

zelxben zeibeleteh ozpda dibeleteh iiepiy :zexkd zeieipicn zèeyd 5.24

94 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . zi`xw`

zelxben zeibeleteh ozpda SIV ikxra iepiy :zexkd zeieipicn zèeyd 5.25

95 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . zi`xw`

zelxben zeibeleteh ozpda CIV ikxra iepiy :zexkd zeieipicn zèeyd 5.26

96 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . zi`xw`

zeibeleteh ozpda average CIV ikxra iepiy :zexkd zeieipicn zèeyd 5.27

96 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . zi`xw` zelxben

97 zi`xw` zelxben zeibeleteh ozpda zeyibp iiepiy :zexkd zeieipicn zèeyd 5.28

98 zi`xw` zelxben zeibeleteh ozpda zeqpkzd :dtlgd zeieipicn zèeyd 5.29

zelxben zeibeleteh ozpda dibeleteh iiepiy :dtlgd zeieipicn zèeyd 5.30

99 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . zi`xw`

h mixei` zniyx

zelxben zeibeleteh ozpda SIV ikxra iepiy :dtlgd zeieipicn zèeyd 5.31

100 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . zi`xw`

zelxben zeibeleteh ozpda CIV ikxra iepiy :dtlgd zeieipicn zèeyd 5.32

101 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . zi`xw`

zeibeleteh ozpda average CIV ikxra iepiy :dtlgd zeieipicn zèeyd 5.33

102 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . zi`xw` zelxben

103 . . zi`xw` zelxben zeibeleteh ozpda zeyibp :dtlgd zeieipicn zèeyd 5.34

104 . . . . . . . dibeleteh iiepiy ly zizefg dbvd :dtlgd zeieipicn zèeyd 5.35

106 . . . . . . . zi`xw` zelxben zeibeleteh ozpda average CIV lenl SIV 5.36

ze`lah zniyx

61 . . . . . . . . . . . . . . . . . . . . . . . . . . zexkd zeieipicn zkxrd iccn 4.1

62 . . . . . . . . zepey zexkd zeieipicn it lr mixgap mizinrl ietin zepekp 4.2

65 . . . . . . . . . . . . . . . . . . . . . . . . . dtlgd zeieipicn zkxrd iccn 4.3

72 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . divleniql mixhnxt 5.1

75 . . . . . . . . . . . . . . . . . . . . divleniqd zvixl mixhnxt ly mekiq 5.2

i

xivwz

zezyxa zeievnd zepekz zealyn zinrl zinr zezyx zeqqean mipezp icqn zekxrn

-iwd zepekz mr cgi ,zyxa mixagd ly dinepeheè zxfean daiaq :oebk zinrl zinr

lk .zeakxen zezli`y ly drad zlekie ihpnq xyer :oebk mipezp icqn zekxrna zeni

x`yl mipezpd qiqa z` zx`znd dnkq bivne inewn mipezp qiqa ldpn zyxa zinr

-i` oteà zezli`y zvtd zervnà dyrp df beqn zyxa rcin seziy .zyxa mizinrd

xiyrd lcena yeniya mihlead zepexzidn cg` .zyxa mixyewn mizinr oia iaihxh

-na zeniiwd dzli`yd zetya ynzydl zlekia uerp mipezp icqn zekxrn ly okeza

zezli`y geqip xyt`nd drad xyer zelra ode ,(SQL, XQeury ,lynl) df beqn zekxr

.zeakxen

ick zenkq oia zihnehe` dn`zdl zewipkha miynzyn df beqn zyxa mixagd

miietin .mdly mipezpd iqiqa z` zex`znd zenkqd oia miihpnq miietin xevil

lr zeprl zlekid ,zinr lkl .zezli`y zervnà rcin seziyl qiqak miynyn el`

miniiwd migpen xnelk ,cala ely dnkqdn migewld migpena zegqepnd zezli`y

ly dnkqn migewld migpena zgqepnd dzli`y ozpda ,ok lr .ely mipezpd qiqaa

-nyn zlra dzli`yl z`f dzli`y mbxzl eilr dney ,xewn znkq oldl ,edylk zinr

,cri znkq oldl ,xg` zinr ly dnkqn wxe j` migewld migpena zgqepnd ddf zer

beqn mebxz .zernyn zelra zeaeyz lawle df zinrl dzli`yd z` xiardl epevxa m`

.crid znkql xewnd znkq oia ihpnq ietina yeniy jez revial ozip df

k

l xivwz

rcin mix`znd mibyen oia jeciy rvazn ea jildz èd zenkq oia zihpnq dn`zd

jildz ly xvezd ,ihnkq ietin .dpey milin xve` e`/e dpan zelra zenkqa mi`vnpd

znkqa mi`vnpd mibyenl idylk xewn znkqa mi`vnpd mibyen oia mebxz x`zn ,df

zxèznd dzli`y xiardl ozip ,df beqn mebxz zervnà .ddf zernyn ilra mde cri

zervnà zgqepnde ddf zernyn zlra dzli`yl xewnd znkqn mibyena yeniy jez

lr xewna rvazd zihpnqd dn`zdd jildz ,ezeakxen lya .crid zknq jezn mibyen

yeniy aiign zinrl zinr zezyx ly lcebd xcq ,mxa .[14, 37] cala iyep` dgnen ici

dxeva rveand zihnehe` dn`zd jildz .zeihpnq zen`zd revial zepkenn zehiya

milawznd miietinde ,ze`ce i` ly zniieqn dcin eaega onehd jildzk gked ,zpkenn

.miieby zeidl sè weic xqegn leaql miieyr ,ef dxeva jildzd zlrtdn d`vezk

lkl ea ,zihpnq dn`zda dpenhd ze`ced i` zkxrdl lcen rivd [30] mcew xwgn

.ietind mzixebl` ici lr zytzp ìdy itk ezepin` z` swynd zepekp ccn jieyn ietin

zlertn lawznd mlyend ietinl aexw ,oin` ietin swyn deab zepekp ccn lra ietin

ihxeìz gezip zervnà xwgnd bivd ,sqepa .iyep` dgnen ici lr zrvaznd dn`zd

ozip mxear ,"miipehepen" mi`xwpe dxecq dxeva mibdpznd miietin ly dgtyn ,iieqipe

minzixebl`l qgiizdl ozip ,jkitl .oin` ietin swyn ok` deab zepekp ccny reawl

mitwynd mipin` minzixebl` l`k ,zeipehepend zpekz z` miniiwnd miietin mibivnd

.mirivn md mze` miietind ly zepekpd ccn z` wiecn ic oteà

zeki` lr drityn zyxa mixag oia ze`ce ixqg ietin ixyw ly mze`vnid zcaer

jez zenbxeznd zezli`y .lèyl zexfgend ze`vezd lre da zevtend zezli`yd

opi`y ze`vez xifgdle dpey zernyn lawl zeieyr miwiiecn mpi`y miietina yeniy

lk ly miietind sqe`l ok m` dax zeaiyg qgiil ozip .zixewnd dzli`yl zexeyw

.zyxa zezli`yd zvtd jildz zeki` lr dax drtyd el` miietinl oky ,zyxa xag

eze` ly ziyi`d mipkyd zxigan xiyi oteà xfbp zyxa zinr lk ly miietind sqe`

dxiga .zyxa )dibeleteh( mixywd dpann zxfbp zyxa miietind sqe` ,dllkdae ,zinr

n xivwz

ly ze`ceed i` zcin z` mvnvl dieyr zinr lkl xyewnd mipkyd sqe` ly zlkyen

inebxza weicd xqeg znx z` oihwdl ,`vei lretke zyxa mitzzynd oia miietind

.zelawznd ze`vezd zeki` z` xtyle zezli`yd

,zinrl zinr zezyx zeqqean mipezp icqn zekxrnl miqgiizn ep` ,z`f dceara

meiw migipn ep` .mitzzynd oia ietind ixywa ze`ce i` ly zniieqn dcin zniiw oda

ède ,zyxl zetxhvdd zra mixag xtqnl zi`xw` xagzn zinr lk da dpzyn daiaq

xèznk daiaq ozpda .zyxa ezelirt ztewz jldna el` mixyw okcrle zepyl ieyr

:od mibivn ep`y zeixwird zel`yd ,lirl

i` znx z` zenvnvnd el`k ,"zeaeh" zeibeleteh liri oteà èvnl ozip m`d •

?zyxa ze`ced

ici lr myeind invr oebx` zervnà el`k "zeaeh" zeibeleteh èvnl ozip m`d •

?miiyi` miqhxpi` qiqa lre i`nvr oteà milretd zyxa mixag

ilnxet lcen bivp oey`xd wlga .miixwir miwlg drax`l dfizd z` miwlgn ip`

-ietin ly ze`ced ià aygznd zinrl zinr zyx lr zqqeand mipezp icqn zekxrnl

zeihpnq zeibeleteh zìvn ly diraa oecp ipyd wlga .zyxa zezli`y lr dzrtyde mi

z` meniqwnl zeìand zeibeletehl od qgiizp ,zyxd lr `ln rcin ozpda zeilnihte`

ly zllekd zeki`d z` meniqwnl zeìand zeibeletehl ode zinr ly iyi`d qxhpi`d

ribdl oeiqpae zeihpnq zeibelteh ly invr oebxà oecp ,iyilyd wlga .zyxa zezli`y

invr oebx`l zehiy bivp df wlga .zyxd lr `ln rcin xcrida zeilnihte` zeibeletehl

bivp ,oexg`d wlga .zyxa miaeh mipky ly dxigae iedifl zexeywd zeiral qgiizpe

df wxta .epbvdy lcend z` znèzd zinrl zinr zyxa mipezp icqn dncnd ieqip

.zeihpnq zeibeleteh ly invr oebx`l zepey zeieipicn oia dèeyd ze`vez bivp

p xivwz

:lcend zxcbd

i` zrtyd ly zil`nxet dbvd rivne [6] miniiw milcenl dagxd deedn eply lcend

bviind ccn mirivn ep` .zyxa zezli`y zeki` lr ihpnq ietina zniiwd ze`ced

zezli`y mebxz jldna rvaznk miawer miietin aizp jxe`l ietind zepekp xeniy z`

aizpa mincwzny lkk mebxzd zepekp zkirc zrtez z` zizenk bviin df ccn .zyxa

zervnà .zexfgend ze`vezd zekià mebtl dieyrd drtez ,zezli`yd ly mebxzd

-ibeleteh( dpan zkxrdl miccn mirivn ep` ,zepekpd xeniy ccne mebxzd zepekp ccn

hand zcewpn od zyxa zezli`y ly mebxzd zeki` z` mibviind zihpnq zyx )zi

.szeyn qxhpi` jezn zelkzqda ode xag lk ly ziyi`d

:`ln izyx rcin ozpda zeilnihte` zeibeleteh zìvn

qiqa xeza ,oeewn `l oteà zeilnihte` zeibeleteh zìvn ly diraa mipiiprzn ep`

mzixebl` mibivn ep` .zpeewn dxeva invr oebx` jez zelawznd zeibeleteh ly dkxrdl

iyi` qxhpi`d z` bviind ccnd z` meniqwnl zeìand el`k zeibeleteh zìvnl liri

zeìand zeibeleteh zìvn ly dirady mi`xn ep` ,z`f znerl .zyxa xag lk ly

ep` .dyw dira ìd zyxa mixagl szeynd qxhpi`d z` bviind ccnd z` meniqwnl

dirad ,xzeia heytd dxwnd xeary mi`xne mixwn xtqnl dirad ly beeiq mirvan

.ilìnepilet onfa oexztl zpzip `l

:zeihpnq zeibeleteh ly invr oebx`

zyxd lk lr `ln rcin zyxa xag s`l oi` ,zinrl zinr zezyxa ifkxn ledip xcrida

dlabn zgz zeilnihte` zeibeleteh zìvn ly zpeewnd diraa mifkxzn ep` jkitle

zeihqixeid zehiy zervnà zeibeleteh ly invr oebx`l zxbqn mirivn ep` .z`f

zexkdd ziira z` mibivn ep` z`f zxbqn zgz .zihpnqd zyxa mipky zxigae iedifl

zwqerd dtlgdd ziira zè zihpnq daxiw ilra miilìvphet mipky iedifa zwqerd

q xivwz

zeiradn zg` lkl .zihpnq daxiw iccn it lr mixg` ipt lr minieqn mixag ztcrda

.meyiil zeixyt` zehiy xtqn migzpne mirivn ep`

: ze`vez

znèzd zinrl zinr zyx iab lr mipezp icqn xear epgzity dincd zaiaq mibivn ep`

mipky zxigae iedifl zeihqixeid zehiyd z` epnyii ,z`f daiaqa .eply lcend z`

megzn zele`yd zepey zehiy mb enk zeihpnq zeibeleteh ly invr oebx` xear eprvdy

oebx` zehiy zlrtd ik dler eply miieqipd on .zinrl zinr zezyxa mivawd seziy

qxhpi`d jxr ccn z` meniqwnl zeìand zeihpnq zeibeleteh zìvnl dliri invr

miihpnq miietina zniiwd ze`ced ià zeaygznd zehiy ike zyxa xag lk ly iyi`d

ccn xetiy jez zyxd iepiyy ep`xd ,ok enk .z`f dxhnl zexg` zehiy lr zeticr

zekiàe szeynd qxhpi`d ccna rebtle ybpzdl ieyr ,xag lk ly iyi`d qxhpi`d

zeihqixeid zehiy eit lr gezip mibivn ep` ,seqal .zyxa zezli`y mebxz ly zllekd

mixag ly zllekd zeyibpa zenbet zyxa mixag ici lr zi`nvr zelrtend invr oebx`l

zlrtd zyxcpy dler o`kn .zyxa zezli`y ly mebxzd zlekia mb jkitle zyxa

szeynd qxhpi`d z` meniqwnl zeìand zeibelteh biydl zpn lr szeyn lewehext

.zyxd ly in`pic oebx` ly mirvnà

SELF ORGANIZING SEMANTIC TOPOLOGIES IN PEER DATABASE...

Documents

Transcript of SELF ORGANIZING SEMANTIC TOPOLOGIES IN PEER DATABASE...