Download - Semantic Service Discovery in the Service Ecosystemeprints.qut.edu.au/50872/1/Christian_Prokopp_Thesis.pdf · 2012-06-13 · The Semantic Space model transforms the shadow's unstructured

Faculty of Science and Technology

Ph.D. Thesis 2011

Semantic Service Discovery in the Service Ecosystem

BSc (Hons), MCom Christian Werner Prokopp

N6201393

Principal Supervisor

Professor Peter Bruza

Associate Supervisor

Professor Alistair Barros

I

Keywords

Semantic Space, Vector Space Model, Web Service, Conceptual Space, Ecosystem,

Categorization, Clustering, Machine Learning, Information Retrieval, Text Mining,

Text Classification, Inverted Index

II

Abstract

Electronic services are a leitmotif in ‘hot’ topics like Software as a Service, Service

Oriented Architecture (SOA), Service oriented Computing, Cloud Computing,

application markets and smart devices. We propose to consider these in what has

been termed the Service Ecosystem (SES). The SES encompasses all levels of

electronic services and their interaction, with human consumption and initiation on

its periphery in much the same way the ‘Web’ describes a plethora of technologies

that eventuate to connect information and expose it to humans.

Presently, the SES is heterogeneous, fragmented and confined to semi-closed

systems. A key issue hampering the emergence of an integrated SES is Service

Discovery (SD). A SES will be dynamic with areas of structured and unstructured

information within which service providers and ‘lay’ human consumers interact;

until now the two are disjointed, e.g., SOA-enabled organisations, industries and

domains are choreographed by domain experts or ‘hard-wired’ to smart device

application markets and web applications. In a SES, services are accessible,

comparable and exchangeable to human consumers closing the gap to the providers.

This requires a new SD with which humans can discover services transparently and

effectively without special knowledge or training. We propose two modes of

discovery, directed search following an agenda and explorative search, which

speculatively expands knowledge of an area of interest by means of categories.

Inspired by conceptual space theory from cognitive science, we propose to

implement the modes of discovery using concepts to map a lay consumer’s service

need to terminologically sophisticated descriptions of services. To this end, we

reframe SD as an information retrieval task on the information attached to services,

such as, descriptions, reviews, documentation and web sites - the Service

Information Shadow. The Semantic Space model transforms the shadow's

unstructured semantic information into a geometric, concept-like representation. We

introduce an improved and extended Semantic Space including categorization calling

it the Semantic Service Discovery model.

We evaluate our model with a highly relevant, service related corpus simulating a

Service Information Shadow including manually constructed complex service

agendas, as well as manual groupings of services. We compare our model against

III

state-of-the-art information retrieval systems and clustering algorithms. By means of

an extensive series of empirical evaluations, we establish optimal parameter settings

for the semantic space model. The evaluations demonstrate the model’s effectiveness

for SD in terms of retrieval precision over state-of-the-art information retrieval

models (directed search) and the meaningful, automatic categorization of service

related information, which shows potential to form the basis of a useful, cognitively

motivated map of the SES for exploratory search.

IV

Table of Contents

Keywords ...................................................................................................................... I

Abstract ....................................................................................................................... II

Table of Contents ....................................................................................................... IV

List of Figures ............................................................................................................ VI

List of Tables ........................................................................................................... VIII

List of Equations ......................................................................................................... X

List of Abbreviations ................................................................................................ XII

Conventions ............................................................................................................. XIII

Statement of Original Authorship ........................................................................... XIV

Acknowledgements .................................................................................................. XV

1 Introduction .......................................................................................................... 1

1.1 Service Ecosystem ......................................................................................... 1

1.2 Service Discovery .......................................................................................... 9

1.3 Research Questions ...................................................................................... 17

1.4 Contributions ............................................................................................... 18

1.5 Thesis Structure ........................................................................................... 19

2 Literature Review ............................................................................................... 21

2.1 Service Discovery ........................................................................................ 21

2.2 Information Retrieval .................................................................................. 27

2.3 Semantic Spaces .......................................................................................... 40

2.4 Cluster Analysis ........................................................................................... 46

2.5 Discussion .................................................................................................... 55

3 Semantic Service Discovery Model ................................................................... 57

3.1 Semantic Information Shadow .................................................................... 57

3.2 Semantic Space Generation ......................................................................... 58

3.3 Semantic Categorization .............................................................................. 63

3.4 Innovations .................................................................................................. 69

3.5 Modes of Discovery ..................................................................................... 72

V

3.6 Software Prototype ...................................................................................... 75

3.7 Evaluation .................................................................................................... 78

3.8 Discussion ................................................................................................... 83

4 Semantic Service Discovery Evaluation ............................................................ 85

4.1 SAP ES Wiki as a Service Information Shadow ......................................... 85

4.2 Experimental Evaluation ............................................................................. 89

4.3 Baseline IR systems ..................................................................................... 91

4.4 Results ......................................................................................................... 96

4.5 Discussion ................................................................................................. 109

5 Semantic Service Categorisation Evaluation ................................................... 111

5.1 Experiment ................................................................................................ 111

5.2 Baseline clustering algorithms .................................................................. 119

5.3 Semantic Categorization ............................................................................ 121

5.4 Discussion ................................................................................................. 132

6 Discussion ........................................................................................................ 135

6.1 Service Discovery by Directed Search ...................................................... 136

6.2 Exploring the Space by Semantic Categories ............................................ 138

6.3 Singular factor ........................................................................................... 140

6.4 Link-weight ............................................................................................... 141

6.5 Default Parameters .................................................................................... 141

6.6 Discovery ................................................................................................... 145

7 Future Work ..................................................................................................... 146

7.1 Scientific .................................................................................................... 146

7.2 Applied ...................................................................................................... 148

Conclusion ............................................................................................................... 150

Appendix A SAP ES Wiki Grouping .................................................................. A-1

Appendix B Example Semantic Categorization by Bundles .............................. B-1

Appendix C CLUTO ........................................................................................... C-1

References ..................................................................................................................... i

VI

List of Figures

Figure 1: App sales projection before Apple iPad release ........................................... 4

Figure 2: Emergence of Service Ecosystem ................................................................. 6

Figure 3: USA.gov services section ............................................................................. 7

Figure 4: Directgov.uk homepage ................................................................................ 8

Figure 5: Service consumer to service ....................................................................... 13

Figure 6: SD as an IR task .......................................................................................... 15

Figure 7: Search activities .......................................................................................... 28

Figure 8: A taxonomy of IR systems ......................................................................... 30

Figure 9: Content bearing terms by DF ...................................................................... 33

Figure 10: Classic IR system ...................................................................................... 34

Figure 11: Three levels of cognition .......................................................................... 42

Figure 12: Singular Value Decomposition in Latent Semantic Analysis ................... 43

Figure 13: Steps in Semantic Space generation ......................................................... 59

Figure 14: Example corpus structure ......................................................................... 59

Figure 15: SVD approximation of word co-occurrence matrix M ............................. 61

Figure 16: SS from word co-occurrence matrix (no singular values) ........................ 62

Figure 17: SS from word co-occurrence matrix (with singular values) ..................... 62

Figure 18: Semantic core expand to categories (simplified) ...................................... 65

Figure 19: Tessellation around core concepts (simplified) ........................................ 66

Figure 20: Categories through tessellation example .................................................. 69

Figure 21: Singular Factor in SS generation .............................................................. 70

Figure 22: LDV example ............................................................................................ 71

Figure 23: SSD graphical user interface main screen ................................................ 76

Figure 24: SSD configuration screen ......................................................................... 77

Figure 25: ES Wiki structure ...................................................................................... 87

Figure 26: Example of bundle page (excerpt) ............................................................ 88

Figure 27: Example use-case ...................................................................................... 89

Figure 28: Use-case query results .............................................................................. 97

VII

Figure 29: SSD query results with varying LDV weights ....................................... 100

Figure 30: Improvements in AAR from no to optimal LDV ................................... 101

Figure 31: Singular Factor influence on AAR ......................................................... 102

Figure 32: Improvements from sf=1 to 0.0 and 0.5 ................................................. 103

Figure 33: Difference between unique and frequency queries ................................. 104

Figure 34: Combined Query vs. Text Query ............................................................ 105

Figure 35: Query factors’ influence on AAR ........................................................... 106

Figure 36: SVD reduction to k dimensions .............................................................. 107

Figure 37: Gap ......................................................................................................... 107

Figure 38: Left window ............................................................................................ 108

Figure 39: Right window ......................................................................................... 108

Figure 40: Practical topical structuring of different corpora .................................... 113

Figure 41: Measurement Cardinality Bias ............................................................... 118

Figure 42: Singular Factor and Perspective ............................................................. 120

Figure 43: Link Weight and Perspective .................................................................. 121

Figure 44: Maximum AMI according to perspective and sf for run 1 ..................... 124

Figure 45: Maximum AMI according to density and sf in run 4 ............................. 125

Figure 46: Link-weight results combined from run 1, 2 and 4 ................................ 128

Figure 47: Cut-off result selection from combined runs 1, 2 and 4 ......................... 128

Figure 48: Maximum and Average AMI according to number of categories .......... 129

Figure 49: Interface dummy for search by browsing of categories ......................... 147

Figure 50: Criterion functions by methods .............................................................. C-3

Figure 51: Methods by criterion functions ............................................................... C-4

Figure 52: Perspective and methods ........................................................................ C-4

Figure 53: Perspective and criterion functions......................................................... C-5

VIII

List of Tables

Table 1: Boolean term document matrix .................................................................... 31

Table 2: Term Frequency to term document matrix .................................................. 31

Table 3: Contingency table ........................................................................................ 36

Table 4: Term co-occurrence matrix .......................................................................... 38

Table 5: Term co-occurrence matrix with gap ........................................................... 38

Table 6: Local fitness (Equation 16) example for varying densities .......................... 67

Table 7: Fitness example for fixed cluster with changing distance ........................... 68

Table 8: Parameters for Semantic Space and Semantic Categories ........................... 78

Table 9: Comparison of sorting and term weight influence ....................................... 80

Table 10: Window size impact49 ................................................................................ 81

Table 11: Columns to SVD reduction impact49 ......................................................... 81

Table 12: Singular factor impact49 ............................................................................. 82

Table 13: Rows to Columns impact49 ........................................................................ 82

Table 14: Gap impact49 .............................................................................................. 83

Table 15: Top 10 results for TASA/TOEFL SSD ...................................................... 83

Table 16: Use-cases Semantic Space parameters exploratory run ............................. 94

Table 17: Use-cases Semantic Space parameters refinement run63 ........................... 94

Table 18: SSD optimal query experiments parameters .............................................. 95

Table 19: Significance of results by paired, two tailed t-test ..................................... 98

Table 20: CLUTO - ES Wiki Semantic Space parameters ...................................... 115

Table 21: Semantic Categorization experiments parameter settings ....................... 122

Table 22: Best SC result by perspectives ................................................................. 123

Table 23: Maximum AMI according to distance, density and sf in run 4 ................ 126

Table 24: Maximum AMI for run 4 - Bundles, density to distance at sf=0 ............. 127

Table 25: Maximum AMI for run 4 - Term, density to distance at sf=0.5 .............. 127

Table 26: Semantic category example ..................................................................... 130

Table 27: Wiki Sales bundle group .......................................................................... 131

Table 28: Top results (AMI) for CLUTO and Semantic Categorization ................. 132

IX

Table 29: CLUTO main criterion functions ............................................................. C-2

X

List of Equations

Equation 1: Zipf's Law ............................................................................................... 32

Equation 2: Inverse Document Frequency ................................................................. 32

Equation 3: TF-IDF of term i in document z for corpus of N .................................... 33

Equation 4: Probabilistic similarity by relevance ratio .............................................. 35

Equation 5: Probabilistic similarity by contingency table ......................................... 36

Equation 6: BM25 ...................................................................................................... 37

Equation 7: Minkowski distance ................................................................................ 39

Equation 8: Euclidean distance .................................................................................. 39

Equation 9: Cosine similarity measure ....................................................................... 39

Equation 10: SVD ...................................................................................................... 44

Equation 11: Truncated SVD ..................................................................................... 44

Equation 12: Row vector as a combination of U and S ............................................. 63

Equation 13: Row vector from U ............................................................................... 63

Equation 14: Term based document vector ................................................................ 63

Equation 15: Sum of similarities ................................................................................ 66

Equation 16: Local Fitness ......................................................................................... 66

Equation 17: Fitness of cluster with medoid c with j members ................................. 67

Equation 18: Term/row vector as a combination of U, S and a scaling factor ........... 70

Equation 19: Linked vector of document ................................................................... 71

Equation 20: Combined query from terms ................................................................. 73

Equation 21: Combined query from objects of different types .................................. 73

Equation 22: Gram-Schmidt algorithm applied for vector negation .......................... 74

Equation 23: Query Factor ......................................................................................... 74

Equation 24: Average Rank ....................................................................................... 91

Equation 25: Adjusted Average Rank ........................................................................ 91

Equation 26: Rand Index .......................................................................................... 115

Equation 27: Adjusted Rand Index .......................................................................... 115

Equation 28: Mutual Information ............................................................................. 116

XI

Equation 29: Probability of random object to be in cluster i ................................... 116

Equation 30: Probability of random object to be in Ui and Vj ................................. 116

Equation 31: Entropy of cluster U ........................................................................... 116

Equation 32: Mutual Information between clustering U and V ............................... 116

Equation 33: Normalized Mutual Information ......................................................... 117

Equation 34: Adjusted Mutual Information ............................................................. 117

XII

List of Abbreviations

AJAX Asynchronous JavaScript and XML CS Conceptual Space CSS Cascading Style Sheets DF Document Frequency DV Document Vector HAL Hyperspace Analogue to Language HTML Hypertext Mark-up Language IDF Inverse Document Frequency LDV Linked Document Vector LSA Latent Semantic Analysis OASIS Organization for the Advancement of Structured Information Standards SaaS Software as a Service SC Semantic Categorization SD Service Discovery SES Service Ecosystem SIS Service Information Shadow SLVM Structured Link Vector Model SME Small and Medium Enterprises SOA Service Oriented Architecture SS Semantic Space SSD Semantic Service Discovery SVD Singular Value Decomposition SWS Semantic Web Services TF Term Frequency TF-IDF Term Frequency-Inverse Document Frequency TV Term Vector UDDI Universal Description Discovery and Integration VSM Vector Space Model WSDL Web Service Definition Language WWW World Wide Web XML Extensible Mark-up Language

XIII

Conventions

Vector

| | Norm/Length of vector

XIV

Statement of Original Authorship

“The work contained in this thesis has not been previously submitted to meet

requirements for an award at this or any other higher education institution. To the

best of my knowledge and belief, the thesis contains no material previously

published or written by another person except where due reference is made.”

Signature Date

XV

Acknowledgements

Australian Research Council

The Service Ecosystems Management for Collaborative Process Improvement

project ARC Linkage Grant LP0669244 supported this research. We thank the

participants in the project and the partners from Queensland Government Department

of Public Works and SAP Research Brisbane for their feedback and support.

Queensland University of Technology High Performance Computing &

Research Support

We would like to recognize the support of the QUT HPC in providing computing

facilities for the numerous experiments.

The Institute of Cognitive Science (ICS) at University of Colorado at Bolder

The ICS at University of Colorado provided the data for the TASA/TOEFL

experiment.

1

1 Introduction

Before the rise of the modern search engines the Internet was little more than

Compuserve and AOL to the average person and FTP, Gopher, UseNet and Email to

scientists and sophisticated users. This changed profoundly with the Hypertext Mark-

up Language and the Hypertext Transfer Protocol resulting in the World Wide Web

(WWW). Initially the information on the WWW was sparse, the websites addresses

known to few and sourced from newsgroups or mailbox listings. The first step to

give a wider access to web sites was manually constructed directories such as

YAHOO!. This in turn increased the popularity of the web and the growing number

of sites eventually required an automated approach, search engines. They initially

were simple indexes that over time evolved into sophisticated systems culminating in

advanced search engines like Google and Bing1. From the inception of manually

constructed directories, discoverability became the catalyst to the growth of the

WWW. This gave it unexpected utility despite or maybe because of a feral nature.

Anyone was able to publish, search and be found on the web. We are standing before

a similar development today with electronic services. Not long ago services were

accessible alone to a specific group in entrenched systems or domains with special

privileges and expertise. By way of analogy, this is changing with the onset of the

“Service Ecosystem”.

1.1 Service Ecosystem

The Service Ecosystem (SES) is a concept around “services” similar to the WWW

around the “web”. The SES utilizes existing technologies and networks flexibly to

provide and consume services electronically. It corresponds to business networks and

communities that are global in nature, created for the core purpose of exploiting

services. Aspects of the emergence of a Service Ecosystem are found in Software as

a Service, application market places, business processing outsourcing, B2B

integrators, cloud computing and business collaboration networks.

1 See http://www.google.com and http://www.bing.com for more details.

2

A service is “work done by one person or group that benefits another”2. We extend

this definition to “(physical or electronic) work done by one entity that benefits

another”. We employ the term ‘ecosystem’ to emphasize the main attributes of the

system we describe here. It is unregulated and feral in the sense that anything that

constitutes services and interacts with the system is part of it. This does not prescribe

that parts of the SES, networks, domains and systems exposing and consuming

services, cannot be semi or even highly regulated, closed and dependent. These

business networks already exist and some organizations and industries will continue

to rely on specific functionalities and trust available in these (semi-)closed

environments.

Entities in the SES consume and provide services ranging from atomic ones to highly

complex, service orchestrations that self-adjust with respect to demand and supply.

Demand for a service creates a niche in the ecosystem filled by a provider. If the

demand is great enough many providers will enter that niche and further

development and diversification will occur until an equilibrium between provider

and consumer is achieved. Similarly, a tapering demand may result in reduction of

service provision. There is no prescription of what types of services there are or will

be, or how to deliver and consume them. This greater flexibility leads to

diversification in services exposed, how to reprovision or repurpose them, how to

channel and consume them as well as the mechanisms for their delivery (Cardoso,

Barros, May, & Kylau, 2010).

1.1.1 Electronic Services

Electronic services are the latest evolution in the drive by computer science to reuse

coded functionality. This started with structured programming followed by software

libraries, object oriented theory, middleware concepts and finally electronic services.

Initially they were little more than encapsulated process calls exposing functionality

by means of the Internet utilizing Internet Protocol, Domain Name Service,

Hypertext Transfer Protocol, eXtensible Markup Language, Simple Object Access

2 See http://wordnetweb.princeton.edu/perl/webwn?s=service for a detailed definition.

3

Protocols, Representational State Transfer and other tools to anyone, anywhere at

any time.

1.1.2 Service Oriented Architecture

In context of SOA, electronic services are the essential, loosely coupled building

blocks with each performing a simple function accessible through a well-defined

interface. Orchestration of these electronic services by a (human) designer results in

an application that performs procedures that are more complex. The focus of SOA is

to increase reusability and decrease redundancy, and aims particularly at large

organisations facing these issues. Services described by an open standard like the

Web Service Description Language (WSDL) are consumable by anyone adhering to

the standard, which theoretically enables service provision and consumption between

previously separated entities, e.g., outside departments or even organisations. This

development led to commoditizing services (Barros & Dumas, 2006) and the

inception of Service-oriented Computing (Papazoglou, Traverso, Dustdar, &

Leymann, 2008), which in combination with cloud computing lowered the technical

and capital investment hurdle for service providers, intermediaries, stakeholders and

consumers. Unfortunately, SOA originated as a functional approach to software

reusability and lacked in business aspects like service level agreements, payment,

advertisement, orchestration, discoverability or bundling.

1.1.3 Software as a Service

A different approach is Software as a Service (SaaS) which instead of providing

commoditized services aims to provide complete software solutions for common

tasks online. This concept too has a long history ranging from early timesharing to

application service providers. The ubiquity of data communication, constantly

increasing bandwidth, penetration of every aspect of business by data processing

devices with the associated cost savings as well as instant readiness of (virtual)

computing hardware through cloud computing make SaaS a successful business

model. It outsources risk, expertise, capital investments, and at the same time gives

access to standardized and function rich software. An example of this is Salesforce3

3 See http://www.salesforce.com and http://appexchange.salesforce.com/home for more.

4

that penetrated successfully the small to medium enterprise market for customer

relationship management.

1.1.4 Application Marketplaces

In between SOA and SaaS lies the application market. Salesforce provides

AppExchange3 where third parties can sell applications integrated with the

Salesforce Customer Relationship Management solution. Salesforce uses a

community driven approach to provide a fertile, flexible and innovative source for

applications. Depending on their complexity, such applications constitute a complex

service or simple software.

Figure 1: App sales projection before Apple iPad release

Currently, device dependent application markets, e.g., iTunes Store, Windows Phone

Marketplace and Android Market4, together with smart phones and devices are

becoming a prominent channel to deliver applications and services to private users

(Figure 1). They build communities, providing, consuming and evaluating

applications limited only by the operators’ regulations and technologies. Some

examples of these applications are flight booking, restaurant guides or online

banking. Many applications are simply composite services focusing on a single

service need, a part of a greater agenda, e.g., booking a flight for a holiday. They

differ from SaaS in that they aim at private consumers, have a smaller set of

functions and have platform specific deliver mechanisms. At the same time,

4 See http://Android.com/market, http://marketplace.windowsphone.com, Apple.com/iphone/apps-for-iphone and Apple.com/ipad/apps-for-ipad/ for more.

2.54.5

21.6

4.26.8

29.5

0

5

10

15

20

25

30

35

2009 2010 estimated 2013 estimated

App sales, in billions

App revenue, in $billions

5

increasingly professional applications are becoming available and make SaaS

functionality available on smart devices. The line between SaaS and applications is

blurring. The International Data Corporation predicts in a market analysis (Ellison,

2010) that providers will make every conceivable service available in apps. They

forecast that the app market grows by 60% annually between 2010 and 2014 to 76.9

billion dollars5.

These market places enter the personal computer market, e.g., with the Mac App

Store6. Today’s platform dependencies and restrictions should be a temporary

phenomenon with increasing pressure from open standards like HTML5, CSS3,

WebM7 and AJAX as platform independent web-service/-application frontends. The

first evidence for this move was provided recently (October 2010) with the

announcement of the Mozilla Open Web Applications framework8. Eventually the

type of device with which someone consumes or engages a service will become

largely irrelevant.

1.1.5 In the Cloud

Cloud computing is the other important shift in paradigm besides service

commoditization, orchestration and distribution on different levels. Heroku9, a cloud

application and service platform, is a comprehensive example of how the abstraction

of hardware (virtual servers) and software (service API for Ruby) reduces the

required expertise, manpower and at the same time allows for highly flexible

resource allocation and billing. A customer can add and remove vast computing

resources and services billed on an hourly base within seconds and in turn provide

services to their customers. Cloud computing is fast becoming a conventional

platform with prominent and potent providers like Amazon Web Services and S3

5 See for a http://www.idc.com/about/viewpressrelease.jsp?containerId=prUS22617910 about the analysis.

6 See http://www.apple.com/mac/app-store/ for more. 7 See http://www.w3.org/TR/html5, http://www.w3.org/TR/css3-roadmap and

http://www.webMProject.org for details. 8 See https://apps.mozillalabs.com/ for details. 9 See http://www.heroku.com for more details.

6

storage10, Windows Azure11 and Rackspace Cloud Hosting12. This empowers

individuals and small companies to compete with large organizations beyond simple

applications in restricted market places by utilizing highly flexible and cost effective

computation and data storage facilities. Similar to the rise of the World Wide Web

(WWW), which dramatically altered the publishing and media industry by the near

cost free delivery channel it provides, cloud computing in combination with SOA,

SaaS and Application Markets can do the same for services.

1.1.6 The emergence of the SES

The development to a SES is already underway in the three market segments of

private users, small and medium-sized enterprises (SME) and (large)

enterprises/organizations. Figure 2 illustrates the converging development from

insulated mainframes and personal computers to app-driven smart devices and SOA-

oriented virtual data and computing centres providing services to other organizations

and private consumers alike.

Time

StandaloneNetworked

Smart Device

SoftwareWWW

Apps

ClosedMiddleware

SOA

MainframeServers

Virtual

Service Ecosystem

Figure 2: Emergence of Service Ecosystem

10 See http://aws.amazon.com/ and https://s3.amazonaws.com/ for more details. 11 See http://www.microsoft.com/windowsazure/windowsazure/ for more details. 12 See http://www.rackspacecloud.com for more details.

7

The SME are moving to SaaS platforms like SalesForce and StrikeIron13 (Barros,

Dumas, & P. Bruza, 2005) and more recently Google Apps Marketplace14 or SAP

Business by Design15. The great benefit for SME is that they have access to

enterprise features in software and hardware with the SaaS business model with very

low or no upfront capital investment and a pay-what-you-use billing. This contrasts

with previous models of very expensive software licenses, consulting services and

inefficient, large hardware. Gartner16 estimates a growth in SaaS combined enterprise

software markets from 10% in 2009 to 16% in 2014 measured by revenue. Of this

currently, 75% is a cloud service with potential to grow to 90% by 2014.

Figure 3: USA.gov services section17

Large organizations such as enterprises or governments also follow the trend to on-

demand rather than “on-the-premises” software (-services) using enterprise solutions

like SAP CRM18, Workday19 or Salesforce. SaaS blurs the line between SME and

enterprise products, and often only packaging, features and support separates them

13 See http://www.salesforce.com and StrikeIron.com for more. 14 See http://www.salesforce.com and StrikeIron.com for more. 15 See http://www.sap.com/sme/solutions/businessmanagement/businessbydesign for more. 16 See http://www.gartner.com/it/page.jsp?id=1406613 for more. 17 See http://www.usa.gov/Citizen/Services.shtml for more. 18 See http://www.sap.com/solutions/business-suite/crm for more. 19 See http://www.workday.com/ for more.

8

while the core services remain equal. Some very large organizations like

governments invest in “self-made” SOA solutions to break the barrier between

departmental silos and expose their services and occasionally data to

customers/citizens and third parties. Examples are Directgov.uk (Figure 4) from the

United Kingdom or USA.gov (Figure 3) from the United States of America. They

provide composite services, e.g., to pay car tax or renew licenses. These services can

involve a number of small or atomic electronic services orchestrated and exposed to

the citizen via the web. These implementations encompass the lifecycle of the

services offered and consumed, and essentially functioning as a domain specific

platform corresponding to the open commercial alternatives mentioned before.

Services exposed on these government platforms are different from SaaS. The

platforms act more as a mediator and backend for applications consuming and

extending their services rather than focusing on complex service delivery and value

adding.

Figure 4: Directgov.uk homepage

In summary, we can observe a convergence of SaaS on the professional side around

software service providers (e.g. SAP, Salesforce CRM or Oracle) and platform

providers (e.g. Salesforce AppExchange or Google Apps Market). Application stores

dominate the private market with Apple being a leading force (99.4% market share in

2009 according to Gartner4). These are increasingly exposing professional services in

the form of applications and are under pressure from open standards and threat of

fragmentation. At the end of this development, we anticipate the establishment of a

9

heterogeneous Service Ecosystem consolidating private, SME and organisational

services, adopting open standards for operator, platform and device independent

deployment and orchestration of services. Individuals to large organizations, either

directly or through secondary interfaces like the web, apps, software or others will

provide and consume the services. Similar to the WWW there will not be one

technology or system that will be identifiable as the SES; it constitutes the

conceptual framework around transparent service provision and consumption

agnostic to industry, domain and actor type.

1.2 Service Discovery

The challenges for an open and independent SES are the ones that faced, and still

face, the WWW. How can it guarantee an open platform? How can one find a

suitable service? What is the quality/reliability of such a service? How can provider

and consumers exchange payments, accumulate discounts, advertise, etc.? The SES

also has to address functional challenges like interoperability, runtime, state

awareness and process orchestration. Depending on the use-case, the demands on

these are different, e.g., an enterprise may have the requirement for a long running

service interacting with suppliers and customer services over its lifetime with varying

billing and complex rights management. On the other end of the spectrum, a private

user may query a free stateless service through an app for information, e.g., a flight

status.

A shared strategic hurdle is the discoverability of services (Papazoglou et al., 2008)

by any type of service consumer from the flood of offerings from individuals, small

companies to corporations and government departments across all domain and

industry boundaries. The development of the WWW has shown that a catalyst and

enabler for the further development is the ability for anyone to find relevant web

sites from the ever-increasing number, which in turn leads to more growth and value.

The discovery process is still central to the function of the WWW as the popularity

of Google shows. The task for service discovery is similarly challenging ranging

from personal to professional services across all domains and industries. It

encompasses anything from organic shopping to automotive supply chain

management reflecting the material and the virtual world reaching from service

offerings through applications down to atomic services described in domain specific

10

ontologies hidden inside a government department, corporate unit or specialised

small service provider.

1.2.1 Traditional Service Discovery

We group existing Service Discovery mechanisms into three areas: interface indices,

communities and ontological deductive systems. We briefly present them here and

their limitations followed by an introduction to a novel approach using statistical

semantics to address the mentioned discovery challenges. A detailed review of

Service Discovery is available in 2.1 in the Literature Review.

UDDI

Universal Description Discovery and Integration (UDDI) specification20 is often

associated with SD. It is an open industry standard (OASIS, 2004a) for a service

registry for service providers to expose information about their business and their

services through their functional and meta-description for Service Oriented

Architecture (SOA) software design. We define “functional description” as the

information about the service interface and technical details like protocols and data

format, and not the purpose of the service! UDDI is not restricted to web or

electronic services, but registries often focus on Web Service Definition Language

(WSDL)21 described web services. Registries can be fully private, partly restricted or

open with the ability to publish and replicate data between nodes and privacy levels

(OASIS, 2004a; figure 4). This allows registries to satisfy a variety of needs ranging

from the organizational to the public consumer. The specification permits free

definable, multiple and overlapping taxonomies even within the same registry.

UDDI’s shortcomings are two-fold. It addresses SOA problems and has a

tremendous flexibility that results in a strong functional orientation with few

restrictions or defined practices. This means that discovery (OASIS, 2004b; chapter

1.6) reduces to matching abstract service interface descriptions, searching according

to a freely defined, shared and implicitly known classification system or via

20 See http://uddi.xml.org/ and http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=uddi-spec for more details.

21 See http://www.w3.org/TR/wsdl for details.

11

keywords. Keyword search matches the optional description field in the standard’s

tModel, which in practice is often poorly utilize, e.g., in the UDDI Business Registry

(UBR) it was wholly ignored (Bachlechner, Siorpaes, Lausen, & Fensel, 2006). This

resulted in a very limited adoption mostly because plain registries in closed

environments are suited for expert users or systems knowing ahead of a search what

they were expecting to find.

Web Search

Public alternatives to UDDI are (web) search and communities. Web search engines,

e.g., using the filetype option in Google, find WSDL files describing web services.

WSDL files like UDDI focus on functional aspects providing only optional

descriptive information. The service designer usually enters the description (like the

rest of the WSDL information) and it is often of poor quality merely repeating

functional information. Furthermore, searching this information depends on the

search engines’ algorithms, which are little more than full keyword indices at best

accounting for typographical errors and divergences (Baeza-Yates & Ribeiro-Neto,

2011; chapter 9). Finding a file relating to a service agenda can therefore be

challenging, particularly if the agenda is complex. Additionally, a considerable

number of index files are orphans relating to non-existing services.

Communities

As an answer to the deficiencies of UDDI web communities and aggregators, e.g.,

WebserviceX.NET or XMethods22, sprang up compiling and sometimes extending

service interface oriented information with non-functional descriptions, reviews,

marketing, billing and uptime measures. These attempts suffer similarly from simple

keyword search, small size, quality issues and limited reach. SalesForce's

AppExchange23 is an exception in this group with an active community providing,

consuming and reviewing services. SalesForce's AppExchange is exemplary for an

intermediate step in the transition from the closed, domain and function oriented

systems to the open, environmentally driven Service Ecosystem. It focuses on

22 See http://www.webservicex.net and http://www.xmethods.net/ for details. 23 See http://sites.force.com/AppExchange for details.

12

SalesForce CRM domain24 and is open to third parties. Nevertheless, like the other

communities it uses a simple keyword based search.

Ontologies

The most formal SD mechanism proposed use ontologies. The most prominent of

which is the Semantic Web Services (SWS), which extends web services with formal

annotations to describe them and their relationship according to a prescribed

ontology. SWS allow reasoning and automatic service orchestration, selection,

optimization, protection and execution. Its drawback is that, a) everyone in the

system has to agree on, know and understand the predefined ontology, b) the

ontology has to describe the 'service world' precisely, c) providers and consumers

have to be able and willing to describe their services and needs according to the

ontology and d) deductive inference based on first order logic does not always

translate into effective search and is computationally expensive (Grüninger, Hull, &

McIlraith, 2008).

1.2.2 Agenda and Service Need

We divide Service Discovery in the How and What. The technical and functional

aspects are the How, e.g., interoperability and parameter matching (Sanchez &

Sheremetov, 2008). This is not the focus of this work. We are interested in the

problem of matching the What of a service from a conceptual point of view. We

propose that a transformative Service Discovery (SD) extracts and matches,

transparently and effectively, the concepts underlying service provision and service

need.

A complete automated Service Discovery system, of course, has to solve the

comprehensive problem of how to combine services on a technical level matching,

parameters and protocols or even ad-hoc deploy wrappers to enable connectivity. At

the same time, such a system will need to match the conceptual information of

services. This work focuses on the conceptual service information matching only. It

is a long way to such a system and likely many different technologies and insights,

24 Offering 292 service as of 11.10.2010.

13

including the ones gained in this work, will be necessary to achieve a complete

automatic Service Discovery process.

Figure 5: Service consumer to service

We can assume that a service consumer has an agenda (Figure 5), for instance,

opening a coffee shop. This agenda translates into several service needs, e.g.,

applying for a business number, registering a business name, requesting permits for

footpath usage and so on. Each service need may be fulfilled through one, several or

combined actual services. In a SES, a service consumer has no expert knowledge of

the services available because of their abundance and the ever-changing alternatives.

She faces an increasing breadth of options and decreasing knowledge when

translating an agenda to needs and trying to fulfil them with the appropriate services.

Traditional SD like UDDI does not address this since it assumes a service consumer

to be knowledgeable about the service offering, e.g., expects her to look it up by an

interface, business name, implicitly known classification, or keyword(s). Web search

suffers from similar problems indexing only functional information. Community

driven discovery is not an option either unless the service consumer’s service needs

are trivial and fit into the community’s domain and she is able to express her need

within the community’s terminology. An even worse example of specific semantics

is the ontology driven approach, which adds a potentially opaque layer of

complexity. It requires the service consumer to exactly know her service needs and

then translate them with great precision into the conceptual structure represented by

the ontology.

A service provider trying to describe a service in these systems has a similar

challenge. In a UDDI, community or search engine scenario the provider can

describe the functionality well. The conceptual description of the service is limited to

a free text field and possibly an arbitrary classification. If each provider utilizes the

14

free text field to add an expressive human comprehensible description it would

improve SD. The service provider though will not be able to anticipate all

circumstance for the service’s potential use and associated variations of linguistic

expression. Therefore, as in traditional search, discovering the service is challenging

due to vocabulary mismatch between the service description and a query description

for a service. If providers were to remedy this by exhaustive descriptions, the result

would be a lack of precision in the retrieved services. In an ontology SD system the

provider has the burden of describing the service appropriately. If the description is

too general, the service does not rank as an optimal solution in many relevant

situations and at the same time, if the service is specialised in its description,

deductive inference used for discovery may utilize it only rarely. Moreover, the

provider depends on the competence and ability of all users to comprehend and apply

the ontology in the intended fashion and with a shared understanding.

A transparent and effective SD should support a service consumer by suggesting

services relevant to her agenda and allowing her to search in her own words.

Consider the example to open a coffee shop. It may not be obvious to the consumer

what all the relevant issues are and hence she cannot translate the agenda into

appropriate queries. In such cases, search is an explorative process. A SD system

therefore has to be flexible and able to anticipate, or at least approximate a service

need from a poor description. As a result, the searcher should not have to perform

semantically challenging tasks like ontological translation or guessing domain

specific terminologies and keywords when she is potentially unaware of relevant

aspects of the agenda. We imagine a process like “presumptive attainment” (P.

Bruza, Barros, & Kaiser, 2009) may fulfil these requirements. A searcher has some

incomplete knowledge of her needs and makes an imprecise query to the SD system

accordingly. The results should not only be meaningful, they should enable the

searcher to expand her knowledge with conjecture and informed guesses to permit

her to refine her understanding of her service need. Such a SD system would be

fundamentally different in the sense that it does not attempt to choose a service by

deductive logic or lookup in a shared terminology but rather by a more abductive

approximation, mimicking human like associational reasoning in a way similar to

automatic query expansion used in IR. We therefore consider it useful to reframe SD

15

as an Information Retrieval task where the initial query description is potentially

very imprecise.

1.2.3 An Information Retrieval Task

The existing SD paradigms focus on explicit (ontology) or implicit (shared

terminology) formal information about services. For example, keyword based

systems are traditionally limited to the functional information (UDDI, WSDL search)

and implicitly use a shared terminology even when applied beyond functional

descriptions (communities). Communities have more information through richer

descriptions, reviews and comments but poorly utilize such information through

simple keyword matching searches. Ontology based systems can be very expressive

but require consumer and provider to map information to a preconceived

conceptualization of the world (the ontology). The development towards the SES

provides additional sources of information about services in descriptions, reviews,

comments, relationships/links, documentation and similar much like we observer

with the service community web sites. We are calling this loose corpus of service

related information the Service Information Shadow (SIS). It is normally in the form

of free text and a rich, naturally occurring source of service information from

consumers and providers, and available for the purposes of SD.

We propose to frame the discovery of a service as an Information Retrieval task

(Figure 6). Let the consumer’s agenda description be an incomplete service need

query. Identifying the relevant service then is equivalent with retrieving a service-

associated document.

Figure 6: SD as an IR task

This Information Retrieval challenge (see more about IR in 2.2) appears to be simple

and classical since it is concerned with unstructured text. However, we stated earlier,

16

that we could not assume service descriptions and consumers to use the same

terminology. In fact, we have put this aspect forward as a requirement for SD since it

is unreasonable to demand a consumer to engage a specific terminology of a service

domain she may not be aware of when formulating her agenda based query. For

example, a user may search for a “car insurance” unaware that in the service domain

the term “automotive insurance” is used. Simple traditional pattern matching search

as employed in SD registries are unable to resolve this mismatch. Employing the SIS

may lessen the possible terminological and vocabulary mismatch between the

consumer and the service description so we cannot dismiss traditional IR systems

like modern keyword matching system working with more advanced matching

algorithms as a possible solution outright. Nevertheless, we propose that a SD system

using advance IR methods to assist the presumptive attainment of knowledge by the

consumer is potentially superior given the described SD challenge.

The conceptual space theory (Gärdenfors, 2004) describes conceptual reasoning in a

geometric space defining quality dimensions where the distance between concepts

indicates their relatedness. A SD system mimicking such conceptual reasoning could

make human-like inferences when relating a consumer's imprecise expressed service

need to related concepts thus promoting retrieval that is more effective as well as

helping the consumer to learn about the problem space around her agenda. Such an

approach is effective in enhancing traditional IR (D. Song & P. D. Bruza, 2003).

Furthermore, the geometric representation of concepts contains prototypical areas of

meaning around which sub-spaces represent conceptual categories. These can

provide the basis of the taxonomy, i.e., organization of the space based on the

inherent relationships of concepts. This opens the door to a conceptual map of the

SIS with which the user can interact to gain an overview without the need of a

detailed query or agenda.

Semantic Spaces (Lowe, 2001) are vector space models sourced from corpora of

unstructured text alike to primitive computational approximations of conceptual

spaces (McArthur, 2007). They represent documents and terms in a high-dimensional

space with the distance between the vectors simulating their semantic relatedness.

We propose to employ Semantic Spaces for conceptual representation and inference

inspired by conceptual space theory. The goal is to exploit Semantic Spaces to

promote effective SD.

17

1.3 Research Questions

We envisage that with the growing proliferation of electronic services consumers

require a mechanism to address their service needs. In the expected, vast and feral

SES, a consumer will not be able to anticipate which services will fulfil her service

needs or in which ontology it is described. We conceive Service Discovery as the

process that matches an informally expressed service need to relevant service(s).

There will be a Service Information Shadow providing a rich source of unstructured

information. Our first and foremost hypothesis is that Service Discovery underpinned

by Semantic Space representation of the Service Information Shadow will

outperform state-of-the-art information retrieval. In addition, we hypothesize that

semantic categories extracted from the semantic space representation of the SES may

assist in exploring the SES.

Research Question 1

We suggest that the Semantic Spaces are an effective computational means to

represent conceptual knowledge for promoting service discovery. The first research

question is then as follows:

Do Semantic Spaces promote effective service retrieval in a Service

Ecosystem?

Research Question 2

Service Discovery in a service ecosystem can be aided with a map of the ecosystem.

This is important since a consumer may not be able to express her service need, e.g.,

when confronted with an unusual or novel agenda. The map is like an abstraction of

the service space and will allow a consumer to orientate herself to refine her service

need. Considering the dynamic nature and size of the SES such an approach has to be

automated while at the same time producing a map which aligns with the consumer.

We propose that Semantic Categories, derived from prototypical areas in the

semantic space representations of the service ecosystem, to provide a meaningful and

effective abstraction since they are inspired by conceptual space theory and thus may

align with how humans process concepts. The second research question is:

Do Semantic Categories provide an automatic, meaningful and effective map

of the Service Ecosystem for exploration?

18

1.4 Contributions

Semantic Service Discovery

The first contribution is the inception of the Semantic Service Discovery model,

which employs a Semantic Space on a Service Information Shadow for Service

Discovery. We will ground this model in the conceptual space theory.

Semantic Space Innovations

The literature established the benefit of matrix factorization and dimensional

reduction of a SS for improved semantic representation of terms and concepts. We

will in the process of this work review some of the parameters in more detail and in

particular the so-called Singular Values. The literature has yet to assess them to

establish if they further enhance semantic representations. We propose to extend the

Semantic Space (SS) model by adding cross document relationship information to

the traditional vector space model. The SD evaluation experiments will provide

evidence of its value.

Based on the conceptual space theory we will motivate an alternative to traditional

clustering algorithms. Such an algorithm will be grounded in cognitive science and

make use of the inherent structure of a SS. We will evaluate it against state-of-the-art

clustering to provide the foundation for further development of a flexible human-like

categorization of the SS.

Relevant Data and Experiment

The envisaged SES and reframing of the SD as an IR task poses two unique

challenges. There is no SES like SIS as a data source in the literature exposing all

desired qualities. Consequently, an experiment measuring and comparing

performance of SD with such data also does not exist.

We plan to identify a suitable data source and design a SD experiment for the

proposed SD scenario. It will compare our SD model against state-of-the-art IR

systems. The semantic categorization algorithm uses the same data source to assess

its potential to provide a map of the SES as an alternative to state-of-the-art

clustering.

19

1.5 Thesis Structure

In this chapter, we introduced the notion of the Service Ecosystem. The emergence

of a vast and open SES poses the vital challenge of adding flexibility to Service

Discovery while maintaining effectiveness, which current SD frameworks address

inadequately. We proposed to reframe SD as an IR task. Our hypothesis is that

Semantic Spaces utilizing SIS are an effective and superior SD system compared to

traditional IR.

The next chapter is a literature review. It highlights the current SD trends from

traditional registries to modern ontology driven approaches. A brief overview over

IR methods precedes an in-depth review of Semantic Space models including an

overview over conceptual space theory.

Chapter 3 presents the Semantic Service Discovery (SSD) model. We begin with a

detailed review of SS generation and then introduce two extensions to SS aiming to

improve its performance. Next, we provide details on the semantic categorization

algorithm inspired by conceptual space theory. We also present novel enhancement

to categorization and clustering in general through typed vectors called

“Perspectives”.

This is followed by a description of the two discovery modes we envision utilizing

the SS. The end of the chapter contains a description of a software prototype that we

implemented to test the SSD model including a quantitative evaluation of the

semantic associations generated by the model and software through a well-

established synonym test.

Chapter 4 evaluates the first research question. It introduces a SIS like corpus for the

following experiments. The effectiveness of the SSD is evaluated by simulating

service need query of different quality and how well the SSD model retrieves

relevant related service documents. State-of-the-art IR and alternative SS systems

provide a baseline to position the SSD models results. We also test the applicable

novel SS features (introduced in chapter 3) to establish their value.

Chapter 5 investigates the second research question by comparing a range of state-of-

the-art clustering methods and criterion functions with Semantic Categorization, an

algorithm inspired by conceptual space theory, against a baseline manual

categorization of the SIS like corpus. This quantitative evaluation precedes a

20

qualitative review of the Semantic Categorization result to provide a better

understanding of its potential to map out the service ecosystem.

Chapter 6 discusses the results of the experiments and the outcome for the research

questions. The thesis then closes with a look at future research to answer questions

raised by this work and how to continue the development of the Semantic Space

based Service Discovery model.

21

2 Literature Review

This chapter provides an overview of Service Discovery frameworks dividing them

into functional, social and ontology types following roughly its historical

development. This provides background to the abilities and limitations of current SD

if we were to apply it to a SES.

The next section focuses on IR systems particularly on the prominent reverse

indexing model, which is still widely used in SD because of its balance of simplicity

and effectiveness. It will be the main comparison baseline for the SS based SD model

since highly effective systems are available that are therefore candidates for

underpinning SD in a SIS.

The last section of this chapter explores the Semantic Spaces. We begin the section

with an introduction to conceptual space (CS) theory to motivate the choice of SS.

We then introduce the basic SS model and the factorization by Singular Value

Decomposition. The section closes with a review of some attempts of combining

information structure and SS followed by an introduction to clustering. Clustering

will form the baseline for the later introduced CS inspired semantic categorization of

the SS.

2.1 Service Discovery

Current SD systems are generally modest look up mechanisms to identify services by

name or function-parameters in a closed setting like a distinct domain, industry or

department. They are used by expert consumers or simple automated systems with

clearly defined needs and knowledge of the available services sharing a terminology

either implicitly, e.g., through domain specific jargon, or explicitly, e.g., through

appropriate documentation. We present here three well known modes of SD and

presumptive attainment as an alternative to facilitate discovery in a SES.

2.1.1 Function oriented SD

The initial and most prominent service lookup system is the Universal Description

Discovery and Integration (UDDI) supported by the Organization for the

22

Advancement of Structured Information Standards (OASIS)25 as an open industry

standard to expose services on the Internet or networks. It was thought of "[...] to

become the de facto standard for web services management on the web" (Sabbouh,

Jolly, Allen, Silvey, & Denning, 2001) and to possibly develop into an equivalent of

search engines for services. Its SOA focus on the functional aspects of services using

merely service interface, arbitrary classifications and keyword lookups designed for

automated systems and expert consumers prevented this development (Atkinson,

Bostan, Deneva, & Schumacher, 2009; SAP News Desk, 2005). For example, a

search for zip-code may return services containing keys zip or postal code but not

zipcode (Dong, Halevy, Madhavan, Nemes, & J. Zhang, 2004).

UDDI remains primarily a solution for closed systems (Atkinson et al., 2009). The

termination of the public UDDI Business Registry (UBR) in 2006 showed that UDDI

is mostly a supporting technology for SOA and is a failure as a SD system (Atkinson

et al., 2009; Bachlechner et al., 2006).

A similar approach is the adoption of web search engines to retrieve WSDL files.

WSDL is function centric and service designer rarely make good use of the optional

free text description in its definition. Web search engines moreover use modest full

text indices and hypertext link relationships (Baeza-Yates & Ribeiro-Neto, 2011;

Hagemann, Letz, & Vossen, 2007) concentrating on efficiency more than

effectiveness. Without the benefit of link ranking WSDL file search is as a plain

keyword lookup over mostly functional service information. Crawling WSDL files

via Google (Al-Masri & Mahmoud, 2007) returned only 340 services with 77%

having no or inadequate documentation and descriptions. Many files were referring

to inactive or non-existing services.

The effectiveness of UDDI or WSDL search as a SD for a SES is not simply a

“given”. Firstly, the service provider and consumer would have to use a global,

unambiguous, predetermined service terminology that does not exist. Secondly,

consumers require a detailed conception of the need together with the ability to

express and match it through functional parameters or names. This process promotes

a lookup style search requiring intimate knowledge of the SES from the consumer,

25 See http://www.oasis-open.org for more details.

23

instead of a more discovery-oriented approach. A range of improvements have been

attempted by employing descriptive, relationship information, clustering, vector

space models and signature matching of the services (Bose, Nayak, & P. Bruza,

2008; Dong et al., 2004; Peng, 2007; Sajjanhar, Hou, & Y. Zhang, 2004; Stroulia &

Wang, 2005; Studholme, Hill, & Hawkes, 1999; Wang & Stroulia, 2003). These

works constitute an encouraging development with some promising results.

However, they all restrict themselves to the limited descriptive information available

within the realms of WSDL and UDDI.

2.1.2 Social-oriented SD

More recently, communities and portals developed around service provision and

consumption that include functions like discovery, review and marketing

(Bachlechner et al., 2006; Rambold, Kasinger, Lautenbacher, & Bauer, 2009). They

are limited in size with a small range and varying quality of services with domain

specific foci using primarily keyword based search to find services. The open

platform ones are lacking in size and professional services. Some professional and

platform related examples are thriving. SalesForce's AppExchange community is

such an example offering access to a sizable group of potential consumers, the

SalesForce CRM customers, and a stable platform including billing options. It is a

domain specific solution and demonstrates that a third party driven service

environment can provide, consume and manage its own services. It extends the

keyword search to reviews and filters of tags and categories originating from within

the community as a kind of informal, crowd-driven terminology. This enhances the

search process but in its core, it relies on implicitly shared terminologies encoded in

the keywords, tags and categories. A discovery in the sense of finding previously

unknown information is equally unmet as in the UDDI and WSDL scenarios. The

communities are functioning because they focus on domain specifics and the

consumers are knowledgeable about the domain.

2.1.3 Ontology-based SD

The Semantic Web uses ontology and annotations to describe information and

relationships. A subset of it is the “Semantic Web Services” (McIlraith, Son, & Zeng,

2001), which proposes automatic service discovery, orchestration and invocation

through deduction (Rambold et al., 2009). However, SWS are not broadly used or

24

standardized despite several years of research (Du, Shin, & Lee, 2008; McIlraith et

al., 2001; Verma & Sheth, 2007) and there is not a single agreed upon ontology (see

OWL-S,26 WSDL-S27 and WSMO28). The European Union for example initiated

recently the Semantic evaluation at large scale (SEAL) project29 to develop

evaluations for semantic technologies, its tools and inter-operability including SWS.

A key problem is that the complexity of the ontology and demand on its users grows

with the size of the ‘world’ it describes. When the ontology describes a particular

domain, the meaning of the vocabulary easily conveys. Once the ontology grows,

synonymy and polysemy become an issue. Synonymy refers to the fact that “[…]

people choose the same key word for a single well-known object less than 20% of

the time” (Deerwester, Dumais, Furnas, Landauer, & Harshman, 1990; Furnas,

Landauer, Gomez, & Dumais, 1983). On the other hand, polysemy refers to the

various meanings a term can have depending on the context it is used, e.g., the chip.

As a result, an ontology that defines singular meanings to its vocabulary acts

contrary to natural language - larger the domain the potential for synonymy and

polysemy increases.

The use of terms in ontologies to expand queries has limitations (Voorhees, 1994)

and is cognitively contentious (P. Bruza et al., 2009). This is part of the symbolic

grounding problem (Dietze, Gugliotta, & Domingue, 2008) where the meanings of

symbols or words are dependent on the consumer and the context of use:

[T]asks are highly dependent on the situational context in which they occur,

SWS technology does not explicitly encourage the representation of domain

situations. Moreover, describing the complex notion of a specific situation in

all its facets is a costly task and may never reach semantic completeness.

Simple vector spaces using quality dimensions have been proposed (Gärdenfors,

2004) to contextualize the SD to reduce the complexity of the ontology and

contextual ambiguity. The dimensions are predetermined in the framework and

26 See http://www.ai.sri.com/daml/services/owl-s/ for more. 27 See http://www.w3.org/2005/04/FSWS/Submissions/17/WSDL-S.htm for more. 28 See http://www.wsmo.org for more. 29 See http://www.seals-project.eu and

http://cordis.europa.eu/fetch?CALLER=EN_NEWS&ACTION=D&RCN=31509 for more.

25

effectively transfer the complexity and grounding problem from the ontology to the

quality dimensions without solving it. Ontology mapping is an alternative approach

to this problem (Pathak, Koul, Caragea, & Honavar, 2005) using smaller manageable

ontologies for different domains and translate between them. It shifts the ontology

complexity into an equally fraught ontology translation problem.

The Semantic Web can only be useful in SD with intermediaries in the form of

service brokers outsourcing the translation/complexity problem to them when used

outside of semi-closed environments with a general ontology (Bachlechner et al.,

2006). Current SWS discovery systems' capabilities are fragmented (Rambold et al.,

2009) and even if complete working systems and annotated services were available

they would not enable automate orchestration and consumption in their current form.

“[T]he employment of semantic technologies and related tools for service discovery

in pervasive environments comes with a major handicap: the underlying semantic

reasoning is particularly costly in terms of computational resources and not intended

for use in highly dynamic and interactive environments”30 (Mokhtar, Preuveneers,

Georgantas, Issarny, & Berbers, 2007) which in conclusion makes their efficient and

effective application in a SES highly doubtful.

2.1.4 Presumptive Attainment

The current modes of SD do not facilitate a consumer to expand incomplete

knowledge by suggesting information related and possibly relevant to the consumer’s

agenda. The introduction argued how this is a central challenge of the Service

Ecosystem in light of its scale and a catalyst for its inception and operation. A SES

reaches across domain, industry and organisational boundaries, and cannot presume a

shared terminology. Such a SD needs to cover all service domains and make services

discoverable by consumers who lack knowledge of applicable services, their

description and who have an imperfect or incomplete conception of their service

need. A discovery process therefore has to use a kind of inference to extrapolate the

service need from inadequate information. This can be a deductive mechanism

extrapolating an initial inadequate description of a service need. Alternatively,

30 Semantic Technologies in this context are referring to ontological not the by us alternatively proposed statistical methods.

26

abduction induces appropriate related concepts aligned with the service need. The

distinction between the two is that deduction with infer concepts implied by the

initial service need description, whereas abduction furnishes suggestions of possibly

related concepts, for example, “concept abduction” has recently been proposed to

hypothesize unstated related concepts for the benefit of search on the semantic web

(Colucci, Noia, Sciascio, Mongiello, & Donini, 2004).

Presumptive attainment (P. Bruza et al., 2009) has been proposed as a possible

approach to abduction of information to extend incomplete knowledge. It states that

an consumer with an agenda but a lack of (complete relevant) knowledge has three

options. The first two actions are to capitulate or to extend the knowledge to

encompass everything relevant to the agenda. The first one is not desirable and the

second one is challenging since the costs are often high or it is hard to identify what

is relevant and then to ‘learn’ it. There is a third option, however. The consumer can

use conjecture to presume some information may be relevant to the agenda. It is

important to note that while this loosely could be described as guessing, it is indeed

informed guessing. The difference is important. The consumer is willing to invest in

an action resulting from this information since she identifies it as conceivably

relevant to the agenda in context of her knowledge.

The rich information surrounding services are valuable to consumers trying to close

an agenda and to identify relevant services. The information is largely unknown and

inaccessible or too costly to process for consumers. Semantic Spaces can unearth

latent relationships from this information and make them easily accessible to

consumers searching the SES. The SS does not require a consumer to have complete

or formal knowledge of the agenda to pose a query as a starting point to explore the

space. The consumer can utilize the SS to extract a small set of possibly related

information and services. This facilitates presumptive attainment by the consumer

since she is facing a manageable, related and potentially relevant subset of the space

filter to match her (incomplete) knowledge of the agenda. Semantic Service

Discovery therefore can provide a mode of discovery where with little knowledge a

consumer can find related information and abduct relevant information and services

unknown at the time of forming the agenda, and ultimately extend her knowledge.

27

2.2 Information Retrieval

The task of retrieving information arose at the time humans started to write down and

collect information. The first systems organized early scriptures in stone, clay,

papyrus and later paper (Baeza-Yates & Ribeiro-Neto, 2011, chapter 1.1.1). They

evolved into the modern library systems using a combination of alphanumeric and

keyword based indices to make information accessible by reference, removing the

need to search an entire collection sequentially.

There were two major events in recent IR history after the millennia old

development. First were the electronic data processing and the subsequent

development of evaluation methodologies for IR in the 1950s and 1960s (Cleverdon,

1967; Kent, Berry, Luehrs, & Perry, 1955). From there on the field matured with

continued improvements (Baeza-Yates & Ribeiro-Neto, 2011; Van Rijsbergen, 1979;

Gerard Salton, 1968, 1983). The second noteworthy event was the rise of the World

Wide Web. A network of loosely related unstructured information connected by

hyperlinks provided and consumed by anyone who has access to the Internet is

dependent on IR systems. In no small part, developments in Information Retrieval

(Page, Brin, Motwani, & Winograd, 1999) aided the Internet and the emergence of a

global information society.

In the last decades, the types of electronic data, coded facts, we create, collect, store

and subsequently search has increased. Initially, text was the only one before all sorts

of data followed like medical, environmental, financial, multimedia, linked and

structured text. IDC31 for example estimated for 2010 that the total electronic data

stored was 1.2 zetabytes or 1,200,000 petabytes or 1.2*1021 bytes. They further

estimate that it will grow 44 fold by 2020. This incomprehensible amount of data

requires sophisticated tools such as IR systems, to extract information, i.e., unique,

useful and contextualized data, from it.

31 See http://www.emc.com/collateral/demos/microsites/idc-digital-universe/iview.htm for more (accessed 25.03.2011).

28

2.2.1 Two modes of IR

“Exploratory search makes us all pioneers and adventurers in a new world of

information riches awaiting discovery along with new pitfalls and costs.”

(Marchionini, 2006)

The challenge for IR is to find information in the sea of data. Here exist an important

division of modes of IR. On the one hand, IR can be the mere lookup of information

according to rules and keys as it is in libraries and simple keyword based IR systems.

Marchionini (2006) identifies an alternative exploratory search (Figure 7) where

learning and investigation lead to a refinement and re-evaluation of the search in a

feedback loop.

Figure 7: Search activities32

The outcome of this discovery process is to not only search information but also

acquire knowledge in the process to enlighten the search process. This kind of

exploratory search is not novel, e.g., O'Day & Jeffries (1993) identified it reviewing

alternative styles used by librarians. They noted that in the knowledge acquisition

phase the librarians using traditional indices had to occasionally learn domain

specific terminologies and significant entities like places, companies and persons to

fulfil their task. A technique to annotate information items to give better access to

relationships between them would be greatly beneficial in their opinion. This aligns

with our suggestion that abductive reasoning in relation to the concepts and their

associations can make transparent relationships with the potential to alleviate the

semantic gap a consumer faces when searching for a service.

32 Based on Marchionini (2006; figure 1)

29

2.2.2 Taxonomy of IR

In its 60 years long modern development, IR as a field has diversified. Figure 8

illustrates a modern taxonomy of information retrieval systems. They divide in three

main groups depending on the information content. We are primarily concerned with

the unstructured text in the Service Information Shadow and thus classical IR

models. The three models are Boolean, Vector and Probabilistic. We suggested

previously that a lack of knowledge about the SES and the underlying terminologies

by the searcher drives the SD challenge in the future. The resulting incomplete,

unstructured and possibly symbolic mismatching query posed by the searcher has to

be comprehensible by an IR system returning meaningful results. In the best case, it

would answer a detailed request with the relevant result(s). In the worst case, it

should return at least related information to a vague request providing the searcher

with enough to enhance her understanding of her needs and refine her query in the

mode of exploratory search. Furthermore, the IR system has to be able to adjust to

the flexible nature of a SES with ever-changing service offerings and changes in

semantics in the corpus over time.

30

Figure 8: A taxonomy of IR systems33

Librarians define a taxonomy reflecting the world of books (and other media) they

organize. They classify a book with this taxonomy independent of if the index terms

occur or share the same meaning in the books. A library index thus is an artificial

index. Originally, this was no problem since the indexed works covered a small

amount of knowledge changing and expanding slowly provided and consumed by a

small circle of learned people. Today, as we have mentioned, the quantity of

information to index is growing rapidly. As a result, a rigid index that demands the

searcher and information item to conform to it is incompatible with the dynamics of

the information growth. This disqualifies the millennia old library indices since

searchers have to learn and adhere to a particular symbol bound encoding of

information. The semantic burden of transferring a need to a query is solely with the

searcher and a change in the index structure, i.e., changes in semantics, would

require all searchers to change their understanding of the index.

33 Based on Modern Information Retrieval (Baeza-Yates & Ribeiro-Neto, 2011, p. 60)

31

“[U]sing just human generated categories for indexing […] might lead to a

poor search experience, particularly if the users are not specialists with

detailed knowledge of the document collection.” (Baeza-Yates & Ribeiro-Neto,

2011; page 64)

2.2.3 Basic concepts

We need to introduce some basic IR concepts before we investigate the suitability of

the three main classic IR models to index and search an unstructured text corpus. Let

C={ d1, d2, d3,…, dj} with i the number of document in a corpus C and dj a document

in the corpus. Let Vj={t1, t2, t3,…, tk} with k being the number of unique terms in a

document and tk one of the terms. Vj is the vocabulary of dj. Furthermore, let

W={w1,1, w1,2, …, wk,j} be the weights each term has in each document.

d1 d2 d3

t1 1 1 0

t2 1 1 1

t3 0 1 1

Table 1: Boolean term document matrix

With such knowledge, we can build a matrix with documents as columns and terms

as rows and note their weight in each table cell. The most basic weight is if a term

occurs (1) or not (0). Table 1 demonstrates such a Boolean term document matrix.

The challenge to find a more expressive term weight is to use a weight that reflects

how well a term identifies a document. The simplest form is Term Frequency (TF).

Table 2 illustrates the previous table and this time the TF or number of occurrences

of a term in a document is the weight noted in the respective table cell. The problem

remaining is that meaningless terms, e.g., ‘the’ or ‘a’, occur frequently in all

documents and overshadow expressive terms.

d1 d2 d3

t1 3 1 0

t2 5 7 1

t3 0 1 4

Table 2: Term Frequency to term document matrix

An expressive term therefore has to distinguish a document from the remainder of

the corpus. So a weight should reflect that a term identifies well with the topic of the

32

document, is not widely used in the corpus and at the same time a probably query

term, i.e., not a typo or uncommon term. This requires us to understand the

distribution of terms better.

; ,1

∑ 1

Equation 1: Zipf's Law

Empirical evidence about term distribution inspired Zipf’s Law (Zipf, 1935). It states

that the rank of a term in a corpus is inversely proportional to its frequency following

a power law distribution (Equation 1). N is the number of words in the language and i

is the i most frequent word. The original Zipf law used a=1 with a being chosen

according to the corpus. In the simplest case, Zipf’s law is a harmonic series, e.g., the

most frequent term occurs twice as often as the second and thrice as the third and so

on. For a > 1 the series converges. For a ≤ 1 the series diverges and the vocabulary

grows indefinitely although progressively slower. It has been shown (Araujo,

Navarro, & Ziviani, 1997) that a value for a between 1.5 and 2.0 fits the natural

distribution best.

Equation 2: Inverse Document Frequency

A ‘heavy’ or content bearing term therefore strikes a balance between high and

Document Frequency (DF; Figure 9), a count of how many documents contain the

term. The TF is a local measure and does not say much about the corpus wide

distinctiveness of a term. Since we know the power law distribution of terms, we can

use it to discriminate terms corpus wide. This can be achieved by the Inverse

Document Frequency or IDF (Gerard Salton, C. S Yang, & Yu, 1975; Sparck-Jones,

1972) using a log of the IDF (Equation 2) with N being the number of documents in

the corpus and ni the number of documents with occurrence of term i. In practice, we

often use ni+1 instead of ni to prevent an error if each document contains the term.

33

Figure 9: Content bearing terms by DF34

The measure of which words to choose as content bearing is usually done by DF,

IDF or TF-IDF (Gerard Salton, 1968) with a term of a relative high frequency

appearing only in a small number of documents receiving a high value. TF-IDF

multiplies the TF35, how often a term i appears in a document z compared to the total

number of terms in z, with the IDF, which is the log of the total number of

documents N divided by the number of documents containing i, (see Equation 3).

,,

∑ ,| | ∗

Equation 3: TF-IDF of term i in document z for corpus of N

2.2.4 Basic IR System

The classical IR systems for unstructured text we are about to discuss are structurally

similar and we provide an overview in Figure 10. It separates into two main parts, the

indexing and querying. The indexing is a onetime or periodical process that converts

the corpus of text into an inverted index. The corpus is parsed and tokenized building

a vocabulary of indexed terms and various processing may be included, e.g.,

removing stop words or stemming words to their grammatical root. The transformed

corpus converts into a reverse index that is similar to the term document matrices but

more space efficient. The simplest form is a list of indexed terms pointing to a list of

documents containing them. This is comparable to a non-sparse version of the

Boolean term document matrix. These indices can also be more sophisticated and a

term or even a multi-word phrase, a so-called n-gram, may point to a list of lists,

which point to positional occurrence inside documents and may even include

different weight measures.

34 Based on (Salton, Yang, & Yu, 1974, fig. 7) 35 Note that TF from hereon is different from the previously naïve version of purely counting term

occurrence in a document.

34

UnstructuredDocuments

Corpus Indexing Index

Query Transform Rank

Punctuation filtering

Stemming

Removing stop words

Bag of Words

Vocabulary

...

Reverse Index

t1 -> d2, d1

t2 -> d1, d4

t3 -> d9, d4, d3

t4 -> d2

t5 -> d5, d1, d13

...

Free text

Boolean logic

Relevance selections

...

Extract query items, e.g. terms

Extract grammar, e.g. Boolean

Map to Vocabulary

Generate query representation

Match query and documents

Rank documents

Return (best) matches

Figure 10: Classic IR system

The querying side of the IR system takes a query, which in its simplest form would

be a single word and transforms it into a representation for the IR system.

Transformation can range from a simple exact pattern match to sophisticated lexical

analysis, Boolean grammar or even phrase detection. The result of which is fed into a

ranking algorithm which matches it with the index and returns a list of results,

possibly ranked by how well it matches the query.

2.2.5 Boolean

The Boolean information retrieval (Lancaster & Fayen, 1974) is the first model

under consideration for Service Discovery (utilizing a Service Information

Shadow from a Service Ecosystem). Under the Boolean model, the IR system

indexes a set of documents by noting which terms occur in which documents

expressible in a Boolean word document matrix. They are equivalent to a bag of

words with the position or frequency of the words/terms in the document in the

corpus being irrelevant. A searcher can express an information need as a set of

terms utilizing Boolean terms of NOT, AND and OR. NOT requires a subsequent

term/expression to not occur (be false) in a document, AND requires both

surrounding terms/expressions to occur (be true) for a document and OR requires

one of two terms/expressions to occur for a document to be considered relevant.

35

So for Table 1 the query q1={t1} would return d1 and d2 as relevant, q2={t1 AND

t3} would return d2 and q3={t1 NOT t3} would return d1. In its basic form the

Boolean model only knows relevant and irrelevant documents matching or not

matching a query. This absolute semantics does not allow for partial matching and

nuanced querying. This often results in simple queries and excessively large result

sets.

For Information Retrieval, the Boolean model is ineffective and rarely used in

professional IR settings, except perhaps for patent retrieval. It is however popular as

simple and quick search method. In particular, the online presences build on the

prevalent combination of a relational database system and scripting language. Most

database systems offer automatic full text index generation and the information

retrieval on these web sites is simply a lookup in these indices. Advanced

implementations may extend this to typographical distance computations, stemming

or basic Boolean logic. We referred to these basic lookup systems as keyword

systems in the previous sections.

2.2.6 Probabilistic

The probabilistic IR framework (S. E. Robertson & Sparck-Jones, 1976) defines that

for an information need there is a set of relevant documents R in C. A query q, a set

of indexed terms, expresses such a need. The challenge is to identify R by q that

contains the properties to do so without the system knowing them. Consequently, a

guess and approximation is necessary and documents with a certain probability of

being relevant identified as the answer to q. A document is a vector of binary term

weights, i.e., each dimension represent one indexed term and a 1/true if the term is

contained in the document or a 0/false otherwise. The consumer can give feedback to

the system identifying (non-)relevant documents and thus improve the model and

answer.

,| ,

| ,

Equation 4: Probabilistic similarity by relevance ratio

The ranking of the results is measuring the similarity of a document dj to a query q.

Equation 4 illustrates the similarity measure also known as the relevance ratio

(Baeza-Yates & Ribeiro-Neto, 2011; p81). It computes the probability of retrieving

36

the vector representation of dj given q divided by its complement. The complement is

the probability of the document being non-relevant to the query. Using a contingency

table (Table 3) Equation 4 can be approximated alternatively as Equation 5 (Baeza-

Yates & Ribeiro-Neto, 2011; p83) using the Robertson-Sparck Jones equation (S. E.

Robertson & Sparck-Jones, 1976). N is the number of all documents from C, ni is the

number of documents containing ti and ri the number of relevant documents to ti

while R is still the number of all relevant documents to the query q. This

approximation assumes R=ri=0 to remove the need for human interaction and results

in a DF based ranking. The addition of 0.5 on the top and bottom of the equation is

needed to ensure that the log does not fail for the two extremes of ni=N and ni=0.

relevant Non‐relevant total

Document with ti ri ni‐ ri ni

Documents without ti R‐ ri N‐ ni‐(R‐ ri) N‐ ni

All documents R N‐R N

Table 3: Contingency table

The original probabilistic model and ranking did not take into account term

frequency or document length (long documents are more likely to be relevant since

they regularly contain a larger part of the vocabulary). The modern BM2536 model

(S. Robertson, Zaragoza, & Taylor, 2004) remedies this with a combination of the

earlier BM11 and BM15 models (S. E. Robertson, Walker, Jones, Hancock-Beaulieu,

& Gatford, 1994). BM25 is effectively a weighting scheme utilizing a TF-IDF

variant, document length normalization and two variables to adjust to corpus

features. This introduced a fully automatic ranking independent from consumer

feedback and in addition relieved the deficiencies of the original probabilistic model.

, ~0.5

0.5

Equation 5: Probabilistic similarity by contingency table

Equation 6 represents a common BM25 variant. On the right hand side, the log is

identical to the DF based ranking from Equation 5. It extends by TF (fi,j) on the

numerator of the left side of the equation. The denominator includes document

36 BM stands for Best Matching.

37

normalization by dividing the documents length in number of words (|dj|) by the

average document length. The scalars K1 and b is an adjustable factor to fine tune for

corpus characteristics.

,1 ,

1| |

∑| |

∗ log0.5

0.5,

Equation 6: BM25

The BM25 formula has been very successful and many state-of-the-art IR system

employ it, or close variants.

2.2.7 Vector

The basic vector space model (G. Salton, Wong, & C. S. Yang, 1975) uses term

document matrix similar to Table 2 with the column vectors representing documents

in a k-dimensional Euclidean space with k being the number of indexed terms.

Commonly a non-binary/Boolean term weight is used. The presumption that

documents are topical establishes term context and thus co-occurrence of terms in a

document implies a shared meaning of the terms while documents with similar topics

will contain similar terms. Consequently, there are two ways to interpret the matrix.

The similarity between row vectors relates to the similarity between terms they

represent. Likewise, document column vectors relate to document similarities. A

query containing indexed terms can be represented as a vector in the same k-

dimensional space and the similarity between the query vector and the document

vectors is used as a measure and ranking for the similarity between the query and

document.

Term co-occurrence Matrix

The term document matrix assumes that terms are fully independent. An alternative

to this view is the term co-occurrence matrix, a prominent model of which is

Hyperspace Analogue to Language or HAL (Lund & Burgess, 1996). It uses a term-

to-term matrix (Table 4) to accumulate term weights for each co-occurrence of two

terms while parsing the corpus. The parsing is done with a sliding context window of

a predefined length, usually 8-10 words, moving from term to term each time the

38

neighbouring terms are noted and their weight (discounted by distance to the centre

of the window) added in the co-occurrence matrix.

t1 t2 t3 t4 t5

t1

0.99 0.02 0.02 0.02

t2 0.99 0.05 0.09 0.01

t3 0.02 0.05 0.33 0.19

t4 0.02 0.09 0.33 0.88

t5 0.01 0.01 0.19 0.88

Table 4: Term co-occurrence matrix

Terms co-occurring frequently will have similar vectors and thus be close in the

resulting space. The proximity of the vectors in the naïve implementation also

depends on similar frequencies, which is often not required or desired. To counter

this effect we can normalize vectors to unit length. A variant of the model called

Wordspace (Sahlgren, 2006; Schütze, 1998) does not use a square matrix. It uses the

row vectors as term representations and chooses only the most content bearing terms

for the columns using a weight, e.g., DF or IDF. Furthermore a gap (Table 5) may be

used (Takayama, Flournoy, Kaufmann, & Peters, 1999) to remove terms that are

either too frequent or too discriminating (Gerard Salton, C. S Yang, et al., 1975). In

the former case, the terms would not carry much discriminating value and in the

latter, they would not optimally utilize the columns being very sparse with little co-

occurrence relevance.

gap=2

t1 t2 t3 t4 t5

t1

0.99 0.02 0.02 0.02

t2 0.99 0.05 0.09 0.01

t3 0.02 0.05 0.33 0.19

t4 0.02 0.09 0.33 0.88

t5 0.01 0.01 0.19 0.88

Table 5: Term co-occurrence matrix with gap

The word context matrix allows for a fine definition of proximity through the length

of the sliding window. It also is possible to parse a corpus with few or even only one

(large) document into a meaningful representation. Documents and queries map into

the space through a combination, i.e., summation, of their indexed terms.

39

Similarity

Lund & Burgess (Schütze, 1998) proposed to use Minkowski (Equation 7) distance

in general and the Euclidean one (Equation 8) in particular to measure geometric

distance in between the two points, represented by the vectors and . A simple

vector length normalization of the original matrix beforehand removes the document

length-weighting problem. Their work provides evidence that the automatic first

order co-occurrence analysis provides a good approximation of semantic relatedness

of words including similar words like “street” and “road” that do not occur with each

other but in comparable circumstances.

,

Equation 7: Minkowski distance

,

Equation 8: Euclidean distance

Another similarity measure is the cosine of the angle between two vectors (Equation

9) also known as the dot product of two vectors. It has several advantages over the

scalar product because of its normalization. The measure is contained between 1 and

0; identical vectors will measure 1 and orthogonal vectors that do not co-occur 0;

document length is naturally disregarded since only the angle is used.

,∑

∑ ∑

Equation 9: Cosine similarity measure

2.2.8 Summary

The review of the field of Information Retrieval, which is mainly concerned with

identifying relevant information to an information need, reinforced our intuition to

reframe the SD challenge as an IR task (see 1.2.3). The taxonomy of modern IR

identifies the classical models as the ones most related to our task of searching

unstructured text. Within that area, we considered the three main models namely the

Boolean, Probabilistic and Vector model.

The Boolean model is the oldest and most basic. It is limited and while the absolute

reasoning about relevance is desirable it comes at the cost of shifting the semantic

reasoning to the searcher and at the same time simplifies and restricts the

40

expressiveness of a search. It is not common anymore since it does not allow for

ranking the relevance of results and regularly returns too many or too few results.

The probabilistic model is based on the assumption that there is an optimal answer to

the information need of a searcher, which it guesses, and with the help of the

searcher increasingly approximates that answer. This compares well with our SD

challenge and in fact with many information-searching tasks today. The introduction

of algorithmic improvements particularly BM25, which provides useful rankings of

results even without consumer feedback, have made it a viable IR model widely used

and considered state-of-the-art. It does not however address the semantic gap we

identified as an essential problem for the discovery process. Since it is widely used

and available in matured IR systems we consider it an excellent choice as a baseline

system to compare alternative systems’ performance.

Lastly, the general vector model presents an interesting solution to the SD challenge.

Firstly, its performance in general corpora is excellent (Baeza-Yates & Ribeiro-Neto,

2011). Secondly, the intrinsic fuzziness of the ranking and similarity measures

combined with the relatedness of comparable terms even when not co-occurring is

intriguing. This may address the semantic gap between a searcher’s query, reflecting

imprecisely her information need, and the terminologically disparate expressed

service information. This prompts us to investigate further the vector model in the

following section.

2.3 Semantic Spaces

A Semantic Space or SS (Lowe, 2001; Turney & Pantel, 2010) is the general term for

a vector space model as found in natural language processing and Information

Retrieval (see 2.2.7) and stems from the distributional hypothesis (Firth, 1957;

Harris, 1954; Weaver, 1955; Wittgenstein, 1953). The hypothesis put simply, states

that the meaning of a term derives from its co-occurrence with other terms.

We propose that humans will be an important service consumer and searcher. They

will query the SES, which poses the questions of how to bridge the semantic gap and

how to enable the searcher to obtain services of which need the searcher is ignorant

of at the start of the search? We propose in this section that (advanced) Semantic

Spaces mimic human conceptual reasoning and can help to answer both questions.

We will first introduce conceptual space theory and then explain how Semantic

41

Spaces relate to them. The aim is to establish that we can guide a searcher to

meaningful and relevant services despite terminological differences in a query and

service related information. We further will propose that this process allows the

searcher to attain information presumptively through conjecture or informed

guessing based on her knowledge of her agenda and the relevant selection of services

or service information presented because of a query (P. Bruza et al., 2009). Overall,

this process fits within the ambit of exploratory search. There have been attempts

into this direction (Bose et al., 2008; Dong et al., 2004; Peng, 2007; Sajjanhar et al.,

2004; Stroulia & Wang, 2005; Studholme et al., 1999; Wang & Stroulia, 2003).

The success of such a SD system depends on both the semantic wealth in the corpus

and its ability to imitate human conceptualization of it. We will show that latter can

be attained through a Semantic Space. The former is difficult with current service

descriptions. There is some semantic content in the UDDI description and inside

WSDL files in comments, optional descriptions and naming conventions, which have

utility in SD (Zhuang, Mitra, & Jaiswal, 2005). This content, provided by technical

developers, emphasises the functional and technical aspects of the services. Richer

information exists in secondary service related documents like reviews, descriptions,

advertisements and documentations. We propose to utilize this secondary service

information corpus, the Service Information Shadow, to enable SD on a conceptual

level.

2.3.1 Conceptual Space

Gardenfors (2004) suggests a three level representation of cognition (Figure 11). The

most abstract is the symbolic, followed by the conceptual and then connectionist

level. The symbolic level uses symbols and grammar to express information.

Keyword and ontology based systems are working on this level. Deduction based on

this level as used by SWS is highly abstract and specific with precise and strict

inference. It requires a great effort for humans to express and comprehend

information on this level but it enables them to transfer complex ideas between

individuals. An important help with this is context, e.g., when someone refers to ‘the

chair’ a conversation, text or senses like vision and locality or gesturing establish

context to identify the instance of chair meant. On the other end of cognition, the

lowest level, connectionism reflects biological processes in a neural network. It

42

processes and stores information in a connectionist representation which can be

simulated by artificial neural networks.

Figure 11: Three levels of cognition

In between these two extremes lies the conceptual level. Within a conceptual level,

knowledge has a geometrical structure. For example, three dimensions (hue,

chromaticity and brightness) can present the properties of colour. Gardenfors (2004)

argues that a property is alike a convex region in a geometric space. In terms of the

example, the property red is a convex region within the tri-dimensional space made

up of hue, chromaticity and brightness. The property blue would occupy a different

region of this space. A domain is a set of integral dimensions in the sense that a value

in one dimension(s) determines or affects the value in another dimension(s). For

example, the three dimensions defining the colour space are integral since the

brightness of a colour will affect both its saturation (chromaticity) and hue.

Gärdenfors extends the notion of properties into concepts based on domains. The

concept apple may have the domains taste, shape, colour, etc. Context is modelled as

a weighting function on the domains, for example, when eating an apple, the taste

domain will be prominent, but when playing with it, the shape domain will be

heavily weighted (i.e., it's roundness). Observe the distinction between

representations at the symbolic and conceptual levels. At the symbolic level apple

can be represented as the atomic proposition apple(x), however, within a conceptual

space (conceptual level), it has a representation involving multiple inter-related

dimensions and domains. Colloquially speaking, the token apple (symbolic level) is

the tip of an iceberg with a rich underlying representation at the conceptual level.

Gärdenfors points out that the symbolic and conceptual representations of

information are not in conflict with each other, but are “different perspectives on

how information is described”.

If a discovery system is able to mimic a conceptual space for service related

information and map a consumer’s need into it then a reasoning based on proximity

can achieve discovery based on conceptual relatedness rather than deductive

Symbolic

Conceptual

Conncetionist

43

reasoning. Furthermore, approximating concepts in the space can guide a consumer

meaningfully even with a vague understanding of her need. Models bridging from

the symbolic to the conceptual generating geometric representations grounded in

cognitive science exist in Semantic Spaces. "[S]emantics is a relation between

linguistic expressions and a cognitive structure" (Gardenfors, 2004; page 159). This

thesis will use Semantic Spaces as the basic computational model to drive effective

service discovery in the SES.

2.3.2 Singular Value Decomposition

A common problem with word context and word document matrices is their size,

sparseness and noisiness. Vector models are good at solving synonyms but an

increasing index and matrix size can introduce noise that results in an increasing

ambiguity in form of weak polysemy. A corpus can easily exceed millions of words

resulting in matrices of tens of thousands to hundreds of thousands rows and columns

(Lund & Burgess, 1996) with most cells empty. Latent Semantic Analysis or LSA

(Deerwester et al., 1990) applied a Singular Value Decomposition (Golub & van

Loan, 1996) to a word document matrix to address these issues by computing latent

semantic factors and removing noise.

Figure 12: Singular Value Decomposition in Latent Semantic Analysis

Assuming M is the word document matrix of rank m with w word rows and d

document columns then a SVD of M results in a left singular matrix U, a square

diagonal matrix S and a right singular matrix V (Figure 12 and Equation 10). The

singular matrices have an orthonormal (column) basis of size m. U has w and V d

rows. S contains only non-zero values along its diagonal. The multiplication of U

with S and transposed V reproduces M. One characteristic of the decomposition is the

ordering of U's and V's columns as well as S's values in decreasing importance to

44

error of the re-composition of M. For example, let k be m-1 and the rank of M greater

than k. Then remove or set to zero the last column and value of U, V and S calling

them Uk, Vk and Sk (Figure 12 and Equation 11). If we then attempt to re-compose M

we will create a least error approximation, M*, of rank k. This lossy compression of

the matrix content not only removes noise but also amplifies significant and higher

order relationships (Deerwester et al., 1990; Landauer & Dumais, 1997; Landauer,

Foltz, & Laham, 1998; Schütze, 1998). In short, it leads to improved and effective

term representations.

Equation 10: SVD

∗

Equation 11: Truncated SVD

The second characteristic of the SVD is that the dot product (also used as cosine

measure) between rows or columns of M is equivalent to the dot product between the

rows or columns in U*S, VT*S or Uk*Sk, VkT*Sk for M*. This is a result of S being

diagonal and the columns in U and V being an orthonormal base. It in turn allows

calculation of the dot products between rows without V and between columns

without U (Deerwester et al., 1990).

A variation of LSA is the Wordspace model (Schütze, 1998; Takayama et al., 1999)

which uses a modified HAL word co-occurrence matrix and SVD based dimensional

reduction. Its use of SVD differs by not employing the S values when reconstructing

the row relationships and relying only on U. The motivation may relate to the idea of

using a symmetric co-occurrence matrix (Schütze & Pedersen, 1997; chapter 2.1)

although it uses rectangular matrices with only content bearing columns instead of

the sparse and computational expensive square ones. The Infomap software37 from

the University of Stanford (Takayama et al., 1999; Widdows, 2003) is a direct

implementation of the model. Its source code documentation shows that it used the S

values but removed them in favour of only using U in 200138 claiming that the S

values contain no significant information without a detailed reference or clear

37 See http://infomap-nlp.sourceforge.net/ for more. 38 See the source code (http://infomap-nlp.sourceforge.net/) version 0.8.6, file encode_wordvec.c,

lines 201-206.

45

explanation. This raises the question if Uk*Sk, Uk or maybe a variation of it is the

optimal representation for the row relationships.

Consequently to measure relatedness by cosine similarity the smaller k reduced

matrices can be employed decreasing memory and computing requirements because

an optimal k is k<<m in most cases. Removing the less information bearing columns

amplifies the similarity between word/document vectors even of higher degree co-

occurrence because the small differentiations between them are the least important

information (Landauer & Dumais, 1997) and can even rectify outliers (Landauer et

al., 1998). The relatedness of words and documents in the resulting matrices is

strikingly similar to human cognition and “[i]t is hard to imagine that LSA could

have simulated the impressive range of meaning-based human cognitive phenomena

that it has unless it is doing something analogous to what humans do” (Landauer et

al., 1998). A conclusion further supported by experiments introducing word meaning

negation (Widdows, 2003) which align well with the conceptual level where adding

or removing a qualitative dimension gives or removes context and meaning. Overall

SVD has shown to improve semantic representation justifying its adoption in our

model despite its computational cost.

2.3.3 Structured Link Vector Model

The proposed secondary service information shadow has no predefined structure. It

will consist of web documents and as such will contain links between the documents

in form of Uniform Resource Locator (URL). This information can be utilized as the

sole base for a vector space model (Milne, 2007) or in semi-structured environments

extend the model to a Structured Link Vector Model (Jianwu & Xiaoou, 2002).

Jianwau and Xiaoou mapped the node structure, and the in and out links in a XML

document into a Vector Space Model utilising TF-IDF. They achieved improvements

in an exemplary k-means clustering task. In the next step, the addition of Latent

Semantic Indexing further enhanced the results (J. Yang, Cheung, & Chen, 2005).

These results are encouraging and encourage us to add a similar extension to a

Semantic Space model for the Service Ecosystem. Unfortunately, the SLVM with its

XML/schema oriented requirements is too strict for our lose corpus. We are

imagining a more flexible, simpler and similarly effective model extending the

46

Semantic Space with the most basic structural information, the outwards directed

(hyper-)link (see chapter 3.4.2).

2.4 Cluster Analysis

The parallels between the Conceptual and Semantic Spaces encourage categories

modelled on the conceptual space theory. The task of generating categories involves

partitioning empirical data, which is an established topic in machine learning.

Machine learning usually solves such a task with supervised or unsupervised

learning. The former requires a-priori knowledge about the categories and exemplary

data with known desired outcomes to optimize a (learning) algorithm to classify

future data. The alternative of unsupervised learning proposes to identify a latent or

unknown structure in the data. The emerging properties of a Semantic Ecosystem

align with the unsupervised learning method and we focus on it in this section.

A common and successful method in unsupervised learning is cluster analysis or

clustering of data. It identifies commonalities and structures in data sets without an

explicit a priori knowledge of the emerging structure. However, we do make implicit

assumption about the data when we choose a cluster analysis algorithm and its

parameters. We motivate and review in this section clustering analysis and identify

how we might choose an algorithm to benefit from the Semantic Space properties.

2.4.1 Intuition

We identify the necessity for cluster analysis before we describe it in more detail.

The abstraction of data (complexity) for human consumption and comprehension is

motivated by the common human behaviour to identify shared qualities in patterns

(Gärdenfors, 2004). We do it to comprehend and communicate about our experiences

more easily. Examples of a successful and very explicit data abstraction are

classification or taxonomies, latter of which originates in biology and the study of

species. The word taxonomy is rooted in the Greek words taxis “arrangement” and

nomina “method”39. Such an explicit organisation of patterns is hard when the

qualities are not obvious, discrete and emerging, or the full set of data is unknown

39 See http://www.etymonline.com/index.php?term=taxonomy for details.

47

ahead of time as in the case of the SES. In such situations, humans are able to do an

ad-hoc classification based on shared features. These features ideally occur more

frequently than by chance, and infrequently and relevant enough to the context to

abstract the problem meaningfully. The transparent process of organising our

experiences has given rise to the strict classifications and taxonomies we use and

share today. We propose to use unsupervised learning to identify cluster of data

patterns sharing properties to give rise to groups of similar patterns in a Semantic

Space to mimic the natural ability and comprehension of experiences by humans.

Since the intuition of the Semantic Space bases on the Conceptual Space theory, we

also put forward the proposition that a similar geometric oriented cluster analysis is

worthy to investigate.

2.4.2 Application

Grouping pattern by commonalities allows humans and computers to reason/process

patterns in a highly efficient way. Humans are able to identify relevant features

easily while we are unable to process huge numbers of patterns. Machines on the

other hand can process considerable number of patterns easily but often fails in

effectively identifying the most relevant features automatically.

For example, if we confront a person with a small number of services and their

description she will be able to identify the most generic/prototypical aspects and

organize them accordingly. If we raise the number of services dramatically, the time

needed and complexity of the task increase to level where a person would have to

sample the data and make a best guess because it becomes either infeasible or

impossible to process the data. Now if assume the data size to be several magnitudes

above easy human processing power then it becomes clear that the sample a person

would take would not allow any conclusions to be drawn about the data. A cluster

analysis, while inferior in its feature selection and grouping abilities, can process all

services and identified clusters of shared properties returning a much better and

complete view of the data. This view can be utilized by humans to browse and search

for the most relevant cluster and provide her with a list of services, the cluster

members, which are more similar to each other than others in the data set. This in

turn can be analysed by the human for further action. The structure of the clusters

and how they divide a feature space ideally mimics human made partitioning of the

48

data to be effective. In this way, cluster analysis can effectively reduce the problem

of large data sets in a human-like way to human comprehendible complexity.

Beside improving human comprehension of large data sets such data abstraction, e.g.

in form of cluster centroid. We do not have to compare a query to the whole data set

but only to the cluster representatives to identify the most relevant items to the query

which reduces the computational expense by approximating the query

neighbourhood (G Salton, 1991). Traditional search can utilize this (find relevant

clusters first, then compare cluster members with query and return ranked list) as

well as a cluster identification task (find relevant cluster(s) and return ranked list of

members according to cluster relatedness). In today's highly distributed

computational environments, this is one possible strategy to distribute computational

and memory load between processing nodes to scale data processing in a near linear

fashion. Different cluster analysis methods lend themselves to different optimizations

and distribution architectures.

2.4.3 The steps of cluster analysis

Clustering or cluster analysis is a matured field that is applied in many data driven

domains like data mining, information retrieval, image & signal processing and

genetics to name only a few. Consequently, a complete review is outside the scope of

this thesis and we refer the reader to comprehensive works (Gan, Ma, & Wu, 2007;

Anil K Jain & Dubes, 1988; Theodoridis & Koutroumbas, 2006; Xu & Wunsch,

2008). We review the major traditional aspects of clustering that are relevant for the

presented problem. Within this context we will chose and propose an alternative

clustering analysis for the SES task based on the SS properties and CS theory

intuition.

Cluster analysis divides into five steps (Anil K Jain & Dubes, 1988):

1. Pattern representation

2. Definition of a similarity measure

3. Process of clustering

4. Cluster representation (optional)

5. Cluster validation (optional)

The first step requires the data to be patterns containing a feature selection relevant

for the clustering. Consequently, an additional phase of processing, the feature

49

extraction and feature selection, e.g. co-occurrence frequencies in text, may be

included in or precede the first step. A common feature processing is matrix

factorization to convert discrete and sparse features into an abstract, compressed

feature space revealing latent relationships.

The feature spaces commonly are a (high-dimensional) space or graph in which case

the definition of similarity (step 2) is usually a measure of proximity. Examples of

similarity measures are Euclidean distance between two patterns, the cosine of the

angle between two pattern vector representations or number of edges in a graph. The

choice of pattern representation and similarity measure requires or assume a certain

knowledge or intuition about the availability and importance of features. It is a key

step affecting their relationship. The similarity measure, consequently, has a strong

influence on the outcome of the further processing. The measure ideally is informed

by what constitutes similarity in the data set and how this translates into a measure in

the feature space (A K Jain, Murty, & Flynn, 1999; Anil K Jain & Dubes, 1988).

The clustering stage itself (step 3) proceeds in one of several ways. It generally

attempts to minimize an error (dissimilarity) or conversely optimize a local and/or

global measure of similarity. We investigate this core step in more detail in the next

section (see 2.4.4).

The data abstraction (step 4) provides a view on the data to enhance human

comprehension and/or computation. The choice of cluster representation depends on

the cluster analysis objective, the cluster shapes and clustering algorithm. For

example, a hyper-spherical cluster in a high-dimensional space represents easily (and

well) by a centroid (combination of cluster members) or possibly medoid

(representative cluster member). Alternative shapes, e.g., elliptical or irregular

shapes, may require different representations, e.g., outer or distant points of a cluster.

Lastly, depending of the feature space, alternative representations in form of

conjunctive statements or positions in a classification tree may be useful (A K Jain et

al., 1999).

The clustering ends with a validation of the outcome (step 5). This is challenging

since there is no absolute truth or optimal clustering, or we at least have not access to

the information otherwise, we would have used it in the clustering. We can develop

training and evaluation sets to test and select our algorithms. Different problems

require different solutions and the training/evaluation of algorithms only indicate

50

their performance in the real world setting as far as they align with it. The outcome

depends on the data and the algorithm but the quality/value is and relies on human

judgement where humans are involved as consumers of the outcome or a derivation

of it. The judgement itself depends on context and can suffer from bias. A complete

mode of evaluation to balance the various issues is advisable (A K Jain et al., 1999).

Firstly, we can compare the cluster analysis result to an optimal solution, which in

our case should be ‘human-made’, possible by persons who are knowledgeable of the

application domain and/or would be potential consumers of the result. Since human

evaluation is subjective and contextual, approximating the real world scenario well is

essential to the evaluation’s credibility. Secondly, we can investigate the outcome

critically and argue the validity of the computed partitioning. This should not be the

sole base of evaluation since it is prone to bias and subjectivity on part of the person

evaluating the clusters. Nevertheless, an informed and critical evaluation can provide

context to the previously mentioned quantitative approach. Lastly, we have the

option to compare two clustering outcomes algorithmically, e.g. based on their

information theoretic distance (Vinh, Epps, & Bailey, 2009). A combination of this

and the first evaluation method of a domain expert review could be a comparison of

cluster analysis results and domain expert partitioning of a data set to identify which

cluster analysis mimic human judgement best. The algorithmic comparison of two

computed cluster analysis outcomes provides no insight in our context.

2.4.4 The three core clustering processes

A basic taxonomy of clustering approaches (A K Jain et al., 1999; p275; fig. 7)

divides clustering algorithms based on the type of partitioning they achieve into

hierarchical and partitional on the top level. Hierarchical clustering produces clusters

organised in a tree, a connected acyclic simple graph. Each cluster is either a parent

and/or child to another cluster (presuming there are at least two clusters). Partitional

clustering does not provide any links between the clusters. The result is a collection

of clusters with each being a collection of patterns. The clusters are either exclusive,

a pattern belongs to only one cluster, or overlapping, a pattern belongs to any number

of cluster and possibly with a varying degree.

Hierarchical and partitional clustering are outcomes from a wide range of clustering

algorithms, which consist of a combination of clustering processes and similarity

51

measures. We focus our review on generalizable characteristics of clustering

algorithms, the processes and similarity measures. This allows us to position our own

clustering approach accordingly.

The three common clustering processes (step 3 in section 2.4.3) are agglomerative,

divisive/bisecting and expectation maximisation (Chidananda Gowda & Krishna,

1978; Dempster, Laird, & Rubin, 1977). Expectation maximisation, like other

clustering approaches, presumes hidden or latent information in a data set and tries to

uncover it. The EM algorithm differs from other approaches by iterating two steps,

expectation and maximisation. In the first step, a clustering solution is

proposed/guessed (to ‘expect’ the models parameters), and in the second step, the

algorithm approximates the solution. It incrementally improves the solution (to

‘maximizing’ the log-likelihood of the data) by using the previous iteration’s output

as the next iteration’s input.

A frequently used, simple and successful implementation of the EM algorithm is the

k-means algorithm (Manning, Raghavan, & Schütze, 2008). It receives as a

parameter the number k of desired clusters and attempts to find the optimal position

of these clusters. K-means guesses in the initial step the cluster centroids/positions

(E-step) and then computes optimized centroids from the attributed cluster members

(M-step). It iterates these two steps (using the centroid from the M-Step for the E-

step) until it achieves no improvement, a minimum delta in change or it exhausted a

maximum number of iterations. Despite is simplicity k-means and EM have proven

themself as a good and fast clustering analysis algorithms for a wide range of

problems including text based Semantic Spaces (Manning et al., 2008). K-means has

some inherent properties that affect the outcome of the clustering. The k-means

clusters tend to be similar in reach, are (hyper-)spherical in shape, susceptible to

local optima (outliers), require knowledge about the data to choose k and the

outcome depends heavily on the initial centroid seeds. Where these attributes are

undesired, we can often balance them by selecting alternative similarity measures,

cleverly seed centroids and/or pair k-means with additional algorithms. Selecting the

optimal number of clusters is a challenging problem with various solutions ranging

from informed guessing to identifying diminishing variance change with increasing k

(also known as the elbow method) to using information theory to name only some.

52

Agglomerative and divisive algorithms can be thought of as bottom-up or top-down

algorithms. Agglomerative algorithms usually consider all patterns as single clusters

with one single member. They iteratively merge the clusters according to the

algorithms distinctive similarity measure and continue until they converge or reach a

limit of iterations. Hierarchical cluster structure is available from agglomerative

algorithm when we retain the merging steps as a tree hierarchy. Agglomerative

algorithms’ most noticeable drawbacks are their performance. Let n be the number of

patterns. The minimal time is O(n2 * log n) and space O(n2) are significantly higher

than k-means for example which is O(n*k*l) time with k being the number of

clusters and l the number of iterations and O(k+n) space. The benefit of

agglomerative algorithms is their versatility in choosing different measures for

merging clusters. The most prominent are single and complete link (A K Jain et al.,

1999; Manning et al., 2008) but there are many alternatives beside these two (Zhao

& George Karypis, 2004). Single link uses the distance between the two closest

patterns in two clusters to measure inter-cluster similarity. This ‘grows’ clusters

along paths allowing to identify irregular cluster shapes. The complete link uses the

combined pattern distances between two clusters resulting in clusters that are more

compact which commonly is desirable despite being less versatile (A K Jain et al.,

1999).

Divisive algorithms proceed from the opposite direction. They consider the whole of

the pattern set a single cluster and divide it trying to achieve a high dissimilarity

between the new clusters and similarity within the new clusters (inter- and intra-

cluster measures). The process continues until it converges, achieves the desired

number of clusters and/or reaches a limit in iterations. They too can create

hierarchical clusters if we consider the dividing steps as branching out in the tree.

The divisive (also known as partitional) algorithms are generally faster than

agglomerative ones (Cutting, Karger, Pedersen, & Tukey, 1992; Larsen & Aone,

1999; Steinbach, G Karypis, & Kumar, 2000).

An additional view on the clustering process can be taken when we separate the

similarity measures in form of criterion functions since they can be used either in

agglomerative or divisive approaches (Zhao & George Karypis, 2004). A software

53

implementation of common criterion functions and clustering methods is the CLUTO

software40. We discuss the functions made available in CLUTO in Appendix C.

2.4.5 Clustering and Semantic Spaces

While we have a wide choice of clustering algorithms available (Zhao & George

Karypis, 2002) we have to be mindful of what features we cluster. An early attempt

to clustering SS was a k-means based word clustering of a LSA generated space

(Bellegarda, Butzberger, Coccaro, & Naik, 1996) investigating it as a complement to

word classification of the time. The resulting clusters confirmed a semantic

association between clusters of words with close vector representations. Their results

further indicated that words of the same root potentially have enough polysemy to

justify placing them in different clusters. Bellegarda (2000) extended this work with

semantic inference for automatic speech recognition. It describes the clustering of

documents representing consumer actions as a training of a SS. The cluster centre

classifies future actions by attributing their textual representation as document

vectors to the closest document cluster centre. Semantic inference removes formal

semantic representations by relating co-occurrences through the SS model allowing

flexible consumer input. Cao, Song, & P. Bruza (2004) used a fuzzy k-means

clustered HAL space to evaluate an automatic organization of a SS motivated by

conceptual space theory. Their results provided further evidence that vectors

representing words with similar meanings clump in the space. They also investigated

the polysemy of words by allowing overlapping clusters and found some words can

belong to more than one cluster41 when they share meaning between them.

This aligns with semantic cores and attributing new ones to prototypes. The idea of

prototypes or prototype theory (Johnson, 1982) proposes that out of a set of

patterns/data/experiences some are more central and representative to a group that

shares certain aspects. For example, a wooden chair with four legs and a back would

be prototypical, at the heart of a category, about chairs while a three-legged stool

without a back would be more peripheral.

40 See http://glaros.dtc.umn.edu/gkhome/views/cluto for more details. 41 Using Reuters data their example was Reagan which appeared in a cluster relating to the Iran

Contra affair and another relating to the U.S.A. Presidency.

54

The canopy clustering (McCallum, Nigam, & Ungar, 2000) is an interesting

agglomerative cluster analysis to identify cluster cores or canopies. It intends to

reduce computational expense on cluster algorithms, identify the number of potential

clusters and remove outlier problems. It is a pre-clustering step to more expensive

cluster analysis. It does this by using two distance measures t1 and t2 with t1 > t2.

Furthermore, it adds all patterns to two lists, potential canopy centres and potential

canopy members. It randomly selects a pattern from the centres and merges all points

within t1 into the canopy removing them from the list of possible canopy centre and

member lists. It removes any pattern within a distance of t1 and t2 from the list of

possible canopy centres. This process iterates until no more canopy centres are

available in the list. In a further post-processing step close canopies may be merged.

2.4.6 Semantic Category Analysis

We propose to extend the canopy clustering based on the prototype intuition from the

Conceptual Space theory. Canopy clustering is a rough locally optimized, greedy

agglomerative cluster analysis. Let us call the areas around the semantic cores

semantic categories then we can see a similarity to the canopy clustering. Semantic

categories require a full cluster analysis to establish these categories since they

should combine local and global measures for inter- and intra-cluster evaluation. Our

intuition is that semantic categories form around dense clusters equivalent to a

prototypical core (intra-cluster measure) but we have to evaluate them in a wider

context (inter-cluster). Unlike the canopy clustering, we require flexible evaluation

of the global and local aspects of the categories. We cannot expect all categories to

be homogenous in their makeup and distribution. This is conflicting with the way

humans establish, contextualise, use and interpret categories. A cluster with many

patterns may occupy to certain extend a larger part of a space than a smaller cluster

effectively using a measure of density. We can relate them to categories of different

breadth. There has to be a limit to this measure, of course, to prevent singular or

excessively large clusters.

On the inter-cluster level, we have to encourage dissimilar cluster and penalize

similar/nearby clusters. What constitutes proximity ideally depends on the local

cluster features instead of a rigid external setting as in the case of canopy clusters.

Canopy cluster analysis with its intent to quickly approximate cluster distributions

55

does not provide these qualities. Furthermore, we presume that some noise and

imprecision is inherent to the process of feature selection and extraction as well as

realistically occurs in real world data sources. We therefor propose to use exclusive

but not complete clustering effectively identifying meaningful semantic prototypes in

the data and attribute ambiguous patterns to the established prototypes. This again is

different to the overlapping semi-exclusive method used by canopy clustering. We

introduce an agglomerative algorithm with local and global measures implementing

the discussed attributes in section 3.3.

2.5 Discussion

We have shown that the current SD methods are separable into two groups. The first

uses a small corpus of service information consisting of one or a combination of

functional, short descriptions and community sourced unstructured annotations. It

employs naive IR models. The searcher therefore has to have a good understanding

of the corpus she is searching and the indexed keywords to be successful. These

models are computationally inexpensive but a growing corpus and decreasing user

sophistication impinges significantly on their effectiveness.

The second group of SD methods is ontological, enforcing a formal predefined

vocabulary for service annotation, i.e., the SWS. The advantages of this model are

deductive reasoning and a well-defined terminology. This comes at the cost of

abstracting the described ‘world’ while inflicting a semantic burden to the searcher.

Furthermore, such a system is inflexible since the established ontology cannot

change readily or adapt easily to reflect changes in the ‘world’. Lastly, the

ontological method does not scale semantically since its complexity becomes a

hurdle for searchers and reasoning with it computationally expensive.

We therefore returned to the intuition from the introduction of the thesis of reframing

the SD as an IR task. Moreover, the traditional SD is little more than a simple IR

system where keywords are used in a reverse index on mostly functional information

about services. The IR domain considers the simple Boolean model ineffective. The

two alternative models applicable to the unstructured text classical IR model are the

probabilistic and vector models. They have shown good results in IR settings and we

are reviewing them later in this thesis.

56

We also introduced in the previous chapter that a searcher or service consumer in a

SES has an agenda from which originate service need(s). At the same time, she

unlikely is knowledgeable of the SES and its services. Subsequently the searcher

may poorly understand and express the service need since she does not or hardly

knows the service offerings. This led us to focus on the mode of IR that is highly

flexible in how it extracts and compares information from a query and the corpus.

We discussed how the vector model and in particular the CS inspired SS describe

conceptual representation at a sub-symbolic level of cognition. At this level of

cognition, reasoning is not deductive but more associational or abductive in nature.

This in turn support presumptive attainment of information, i.e., informed guessing

of related concepts as in concept abduction. This however faces one challenge. To

extract an effective SS we require a semantically rich natural text corpus beyond the

functional descriptions originating from electronic service and SOA.

Lastly, we tie the representation and scale of the data involved to the need to abstract

the data in a meaningful way. The SES will be a system with emerging properties

and we represent the services in a high dimensional space which is a problem

commonly solved by machine learning. We identified unsupervised learning in form

of cluster analysis as the ideal solution. The flexibility and great choice of clustering

approaches requires us to review several approaches in real world experiments and

does not enable us to select a single solution for all situations. We do have some

intuitions about the Semantic Space. We are investigating them as a new cluster

analysis approach in this work besides a wide range of of-the-shelve solutions.

The next chapter will introduce a model for SS based service discovery reviewing the

need for an expressive corpus, describing the details of SS generation, introducing

SS innovations, describing a Semantic Categorization algorithm, explaining

discovery in a SS and evaluating the model’s software implementation by means of a

well establish synonym experiment.

57

3 Semantic Service Discovery Model

In this chapter we introduce a model for SS based service discovery, the Semantic

Service Discovery (SSD) model. The first section reviews the details of a suitable

corpus for the model. Afterwards we introduce the Semantic Space model followed

by a section detailing some innovations introduced by this work to the SS. The

subsequent section details a Semantic Categorization algorithm inspired by

conceptual space theory. We then discuss the two modes of discovery. The last two

sections introduce the software prototype that implements the model and evaluate the

quality of conceptual representation by means of a known synonym experiment,

which assess the quality of semantic representation in Semantic Space models.

3.1 Semantic Information Shadow

We introduced the term of a Service Information Shadow or SIS in chapter 1.2.3 in

conjunction with the reframing of the SD problem as an IR task. We have discussed

in the previous chapter that the field of IR has established models for unstructured

text retrieval tasks. We have further provided insight into the most promising model,

the Semantic Space, since it aligns with the particular problem of imprecise service

needs to match with (to the searcher) unknown services as well as the potential to

deal with the vocabulary mismatch problem. The model requires a rich semantic

corpus written by and for humans to build a geometric representation of concepts.

Conventional electronic web services in the tradition of SOA have been largely

described in a functional way focusing on the how a service interacts rather than

what it does or which purpose it serves. Nevertheless, using the modest informal

semantic information from WSDL files in a SS has proven to be beneficial for

service matching (Bose et al., 2008). We propose to expand this semantic base by

secondary documents associable with services.

We know that human interaction with services leads to associated human readable

information to advertise, describe, review, organize and discuss the service. The

community web sites and application markets reflect this. Let these documents be the

Service Information Shadow. Their content details services and relevant ancillary

information. Let us further assume that a document in the SIS directly points to a

service, e.g., by linking to its WSDL file. The document acts then as a proxy of the

58

concept(s) relating to the service. We can then reframe SD as an Information

Retrieval task where a service need of a consumer fulfilled by one or several services

is equivalent to a query expressing the need retrieving one or several documents

associated with the relevant service(s). Let SIS={D1..Dx} with services S={S1..Sy}.

For example, service S1 links to {D2,D4,D9}. S1 satisfying service need SN1 expressed

as query Q1 is then equivalent to retrieving D2|D4|D9 in response to Q1. In fact, the

keyword search engines using UDDI and WDSL employ a similar approach. Their

limitation is the small and technical corpus in form of UDDI/WSDL descriptions and

fields and the inflexibility of a Boolean IR system using keyword matching.

The benefits of utilizing the naturally occurring and often ignored SIS are manifold.

Firstly, a SS does not require a particular structure of the corpus and thus can utilize

legacy information and various sources of information. The automatic generation of

a SS provides an unbiased representation of the SIS and it provides the flexibility to

add information by recomputing the space when necessary without human

interaction. The SS furthermore provides a rich source of concepts relevant to

services. Lastly, the conceptual representations in the SS facilitate an explorative

mode of discovery even in cases of poorly understood and/or expressed service

needs. Higher order co-occurrence in a SVD reduced SS express recognizable

similarities in the vector representations by matching concepts. We anticipate that

searching using conceptual representations will counter terminological mismatch

between a query and service descriptions leading to relevant results.

3.2 Semantic Space Generation

A Semantic Space (SS) starts with parsing and tokenizing documents identifying

terms and generating a vocabulary. In these models, there are only two types of

objects, terms and documents. We cater for differentiation in the latter thus extending

traditional conceptions of SS. The purpose of which will become apparent later in the

Semantic Categorization (see 3.4.3). Furthermore, we also extract links between

documents for another extension of SS models (see 3.4.2).

The text corpus, the SIS of the SES, consists of a list of document types each

containing a list of documents. A document type could be for example a comment or

(describing) a service operation. The documents are plain text and can contain link

information, e.g., the service operation (description) 'sell share' relates to the

59

(service) bundle 'share trading' (Figure 14). The simplest corpus possible is a single

document of a default type with no link information.

Figure 13: Steps in Semantic Space generation

The semantic space generation (Figure 13) starts with tokenizing the corpus followed

by parsing it into a word co-occurrence matrix and then reduces the matrix by means

of Singular Value Decomposition. At this point, we have a Semantic Space

consisting only of tokens/terms that we can query and explore. Through combination

of the term vectors, we subsequently map the documents into the space. The final

step is a categorization by clustering and tessellating the space.

Figure 14: Example corpus structure

The following subsections explain the various steps in more detail opening with a

discussion of the version of vector space model we have chosen as the foundation of

our SS.

3.2.1 The Vector Space Model

There is a variety of ways to compute a Semantic Space. The initial choice is

whether a term-document or term co-occurrence matrix should serve as the basis for

the model. The former assumes that a document is topical and that the word order in

the document is insignificant. We anticipate that documents are topical but cannot

presume that the documents are of similar topic granularity since the SIS by

definition originates from a broad range of documents. We therefore chose a term co-

occurrence matrix with a gap (see 2.2.7) as the base for our SS with a variable sliding

window.

Tokenizing Parsing SVD Document Mapping Categorization

root

Service Operation

Sell Share

Link to Banking

Buy Share

Link to Banking

Bundle

Share Trading

Link to Sell Share

Link to Buy Share

Banking

60

The term weights we use are the maximum TF-IDF of a term or a fixed scalar of one.

In the former case, the highest TF-IDF of the term across all documents in the SIS is

the term weight. In this way, the weighting is motivated from a document retrieval

approach to SD. Over the years, TF-IDF term weighting has proven to be an

effective term weight. The scalar term weight is a baseline and together with

frequency ordering of the content bearing columns may perform better in situations

where exploiting term relationships is more important for SD.

The order of the matrix columns and rows is in decreasing order of term weight or

Document Frequency (DF). The DF we use is modified to not only count each

document the term occurs in but also the frequency with which it appeared. In effect,

it is equivalent to a TF over the corpus as one document. We offer this option for

applications where the broad semantic base, i.e., term relationships, of the corpus is

more in focus, e.g., in a synonym test, than the identification and retrieval of

documents. In the case where the scalar as term weight is used, the order of columns

and rows is simply the order in which the parser encounters the terms.

The optional gap removes columns in the matrix corresponding to high frequency

terms without discriminative power in the case where DF is used to determine the

order, or are extremely discriminating in case where sort order by term weight in

conjunction with maximum TF-IDF as term weight. In the latter case, the top results

are terms that are highly frequent in a tiny set of documents but very infrequent in

the corpus, thus are excellent identifiers for an insignificant number of documents,

and otherwise introduce sparseness in the matrix reducing its overall information

content.

The matrix is further processed by reducing the dimensionality through SVD (see

2.3.2) and employing the left side of the truncated matrix decomposition. It contains

the approximation of the row vector relationships and when truncated correctly

reduces the noise in the data and amplifies the low and higher co-occurrences

(Bellegarda, 2000). A similar Semantic Space model has been successfully

implemented (Takayama et al., 1999) in Infomap.

3.2.2 Tokenizing

Tokenizing identifies recognizable items with information value in the document and

indexes them for easier processing. In the following, we refer to tokens as terms and

61

in general are words but basic email addresses, URLs and abbreviations are also

recognized. A term is longer than one character and not part of a 'stop list' of 765

common words like 'a', 'you', 'the' or 'it' which we sourced from the Infomap source

code42. The system excludes them since their high frequency results in a low

discriminating information value. After tokenizing the documents the term weight

and DF/corpus-wide TF is calculated and stored for each term. A term-weight of zero

is rounded up to the smallest possible number computable on the system to ensure

that each term has a weight even if it might be minuscule and normally outside the

system's number range. This reduces computational error later when exceptionally

sparse vectors might become incomputable.

3.2.3 Singular Value Decomposition

Deerwester, S. Dumais, Furnas, T. Landauer, & Harshman (1990) in the original

application of SVD to a term document matrix provide the explanation how the dot

product (necessary for cosine similarity) between terms is the left “singular vectors”

multiplied by the “singular values” as dimensional scaling factors (see also 2.3.2).

They furthermore establish that a query consisting of terms is comparable to a

pseudo-document and as such mapped into the column space of the matrix.

Figure 15: SVD approximation of word co-occurrence matrix M

We employ a variant to the term document SVD reduced matrix introduced by

Infomap (Takayama et al., 1999). It employs the dimensional reduction on a term co-

occurrence matrix. They use the rows to index a large part of the corpus and the

columns with a smaller content bearing selection of terms (Figure 15).

42 Publicly available at http://sourceforge.net/projects/infomap-nlp/files/.

62

Figure 16: SS from word co-occurrence matrix (no singular values)

The Infomap method uses the left of the three resulting reduced matrices for term

vector representation (Figure 16). This is differing from the LSA described left

singular vector reconstruction including singular values as illustrated by Figure 17.

We are combining and extending the two approaches in section 3.4.1 with the

introduction of the singular factor.

Figure 17: SS from word co-occurrence matrix (with singular values)

Either approach is highly efficient in storing only the term side of the SS. A

document in the space is a combination of its term vectors and similarly a query

maps into the space as a pseudo-document. All three, the terms, documents and

queries, are present in the same space using the same similarity measure to compare

them.

3.2.4 Term Vectors

The first step to generate the space is to create a term co-occurrence matrix as

explained in chapter 3.2.1 populating it by parsing all documents with a sliding

window. It moves from term to term in the document using the term as a row

reference incrementing the columns of the row by the term-weight of (column) terms

found left and right to it. Once finished, we smooth the resulting row vectors by

63

applying the square root on the matrix cell values. The matrix is sparse since many

terms do not co-occur.

∗ ,

∗ ,

Equation 12: Row vector as a combination of U and S

Equation 13: Row vector from U

We subsequently decompose the term matrix by SVD (Figure 15). The cosine

similarity between the rows of the left singular values combined the singular value

diagonals (Figure 17 and Equation 12) or just rows from U (Figure 16 and Equation

13) measure the semantic similarity between two terms. The resulting row vectors

each represent a term t. For ease of notation, we will use to represent such a term

vector.

3.2.5 Document Vectors

We map documents into the SS by adding up and normalizing their terms' vector

representations after generating the reduced left singular matrix. The final Semantic

Space (Figure 17 and Figure 17 bottom parts) consists of the k reduced (and possibly

scaled), row normalized term vectors and a number of documents of different types

represented by their normalized summed term vectors. A Document Vector (DV), ,

is the sum of its term vectors normalized to unit length (Equation 14).

∑ ∈

∑ ∈

Equation 14: Term based document vector

3.3 Semantic Categorization

Semantic Categories are inspired by the conceptual space theory (Gärdenfors, 2004)

in the tradition of Aristotle’s work about the topic. Algorithmically we obtained

insight from our review of cluster analysis in section 2.4 and in particular canopy

clustering (McCallum et al., 2000). The main premise is that categories are regions in

a space spanned by sub-conceptual dimensions around prototypical cores. Instances

64

of a concept belonging to a category fall within that subspace with its distance to the

core relating to its similarity with the prototype that seeds the category. We argued

that Semantic Spaces with their SVD generated abstract feature space in which

geometric distances between terms and documents indicate their semantic relatedness

approximate such a conceptual space. With the same intuition, we are proposing to

construct Semantic Categories resembling conceptual space categories. To this end,

we introduce Semantic Categories in this section and an algorithm to generate them.

3.3.1 Semantic Category

We pointed out in relation to research question 2 though that a searcher will not

always have adequate awareness of a service need requiring her to explore and learn

about the SES first before being able to understand and describe her need. For this

case, it is desirable to organize the SS providing a conceptual, discoverable and

plausible view of the SES as described by the Service Information Shadow. Humans

have an ability to observe, generalize and abstract in a way of abductive inference to

make sense of the world around them and reason about it (Gabbay & Woods, 2005).

Such reasoning is not deductive but highly pragmatic. It is ‘good’ reasoning if it

helps to close the agent’s given agenda. At the same time, such reasoning is resource

bound - constraints such as time, information and cognitive processing power govern

it.

We can assume that a searcher with an ill-defined service need has a limited amount

of resources/time to achieve her aim, fulfilling the service need. Either the limit is a

result of an external prescription or of the value of achieving the agenda. For

example, if a searcher would have the agenda of an entertaining evening and the

ideal service would be ordering concert tickets there is a limit to how much time the

searcher would expend to search for a service. If categories meaningful to the

searcher organize the service space then she needs little time to identify the

appropriate categories and she can use the remaining time to find the optimal service

by further exploring the categories or forming queries related to the service need.

Otherwise, the searcher is required to spend the most amount of time browsing

through a large part of the service space to learn slowly of its offerings. She would

transparently structure the offerings to orient herself and inform the service need.

This in turn helps her to form appropriate queries or to guess potential alternative

65

service offerings. In effect, she will spend more time/effort on exploring the space

instead of refining a service need and query to optimize her outcome.

For a better understanding of what a human-like abstraction could be, we revisit the

conceptual space (see 2.3.1) example of an apple. A particular apple described by

symbols gives context equivalent to a point or area in the conceptual space inside the

apple concept. The symbol apple can also refer to the whole concept of apple in all

its variations (green, red, sour, sweet, ripe, small, round, etc.) or to a prototypical

apple. For example, “Give me an apple” requests the passing of something that falls

within the apple concept. “It looked like an apple” refers to ‘apple-ness’ of

something indicating that it had the usual (in this case visual) characteristics of an

apple. Gardenfors (2004) identifies these prototypes as subspaces inside a concept.

They contain or are close to the most common expression of a concept with more

unusual ones being more distant, e.g., because of atypical expression in one or more

dimensions like shape or colour such as a “striped apple”.

We propose to implement the idea of a concept subspace around a prototypical core

in a conceptual space to the SS calling it a category. The cosine similarity of

semantic relatedness parallels the geometric closeness in a conceptual space based on

quality dimensions. We suggest that high-density area of semantic representations

identify prototypical areas in a SS. Similar items clump forming a semantic core

because they co-occur in comparable circumstances in the corpus. Unusual instances

of the concept have a higher variation of co-occurrence and therefore appear close

but not as part of the semantic core (Figure 18).

Figure 18: Semantic core expand to categories (simplified)

We propose to identify categories through their prototypical semantic core of high-

density areas of vectors in the space in form of partial, flat, exclusive clusters. We

66

can extend these clusters through tessellation (Voronoi, 1907) to form full categories

spanning a subspace distributing ambiguous objects to categories based on the

proximity to core concepts (Figure 19).

Figure 19: Tessellation around core concepts (simplified)

3.3.2 Cluster Definition and Fitness

We define a cluster to consist of two or more vectors. The vector closest to the centre

of the cluster is a medoid (Kaufman & Rousseeuw, 1987), a pseudo centroid.

Medoids are part of the original data and act as centroid proxies. The remaining

vectors are cluster members. For comparability, we introduce a fitness measure

evaluating local and global qualities of clusters.

Fitness

Let C be a cluster of j>0 (vector) members ( ) and (not counted as a member) as

the medoid. A minimal cluster consists of at least the medoid and one member.

Multiplying the sum of all members' similarities (Equation 15) with the medoid with

the similarity average raised to a density factor (Equation 16) establishes local fitness

ssc of cluster C. A greater density factor gives preference to density over numbers in

a cluster.

,

Equation 15: Sum of similarities

∗

Equation 16: Local Fitness

Table 6 illustrates this effect by 'growing' a cluster from left to right with members

with decreasing similarity to the medoid. After a certain point the addition of another

member (with lower than average similarity) does not outweigh the drop in density

67

anymore. This depends on the density factor. The highlighted cells indicate the

maximum local fitness or the tipping point. For example, for a density factor of 0.25

the cluster reaches maximum fitness of 3.7892 with the eighth member having

average similarity of 0.55. If we increase the density factor, this tipping occurs

‘earlier’ resulting in denser (smaller) clusters.

Members 1 2 3 4 5 6 7 8 9

Avg‐Sim 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5

Sum‐Sim 0.9 1.7 2.4 3 3.5 3.9 4.2 4.4 4.5

Density

0.125 0.8882 1.6658 2.3340 2.8940 3.3474 3.6955 3.9402 4.0832 4.1265

0.25 0.8766 1.6323 2.2698 2.7918 3.2014 3.5018 3.6965 3.7892 3.7840

0.5 0.8538 1.5673 2.1466 2.5981 2.9283 3.1443 3.2533 3.2631 3.1820

1 0.8100 1.4450 1.9200 2.2500 2.4500 2.5350 2.5200 2.4200 2.2500

2 0.7290 1.2283 1.5360 1.6875 1.7150 1.6478 1.5120 1.3310 1.1250

4 0.5905 0.8874 0.9830 0.9492 0.8404 0.6962 0.5443 0.4026 0.2812

8 0.3874 0.4632 0.4027 0.3003 0.2018 0.1243 0.0705 0.0368 0.0176

16 0.1668 0.1262 0.0676 0.0301 0.0116 0.0040 0.0012 0.0003 0.0001

Table 6: Local fitness (Equation 16) example for varying densities

We define the global factor as one less than the cosine similarity between this

cluster's medoid and the next closest one raised to a distance factor. The fitness of the

current cluster decreases if there is an existing cluster with a similarity greater than

zero. The distance factor weights separability of the categories by penalising

proximity between clusters.

1 max , ∗ ∗

Equation 17: Fitness of cluster with medoid c with j members

The final fitness fc for cluster C is the global factor multiplied by the local measure

(Equation 17). Table 7 shows how the final fitness of a cluster changes with a range

of distance factors and variation of the next closest cluster. As intended clusters with

no close neighbours (cosine similarity of 0 between medoids) always achieve

maximum fitness.

68

Next closest cluster

0 0.2 0.4 0.6 0.8 1

Distance

0.125 12.2500 11.9130 11.4922 10.9243 10.0176 0.0000

0.25 12.2500 11.5853 10.7814 9.7421 8.1921 0.0000

0.5 12.2500 10.9567 9.4888 7.7476 5.4784 0.0000

1 12.2500 9.8000 7.3500 4.9000 2.4500 0.0000

2 12.2500 7.8400 4.4100 1.9600 0.4900 0.0000

4 12.2500 5.0176 1.5876 0.3136 0.0196 0.0000

8 12.2500 2.0552 0.2058 0.0080 0.0000 0.0000

16 12.2500 0.3448 0.0035 0.0000 0.0000 0.0000

Table 7: Fitness example for fixed cluster with changing distance

3.3.3 Cut-Off

The clustering starts by considering each vector of the space a possible medoid and

generating the best local cluster around it. We add the globally43 'fittest' cluster

candidate to the list of final clusters removing its members and medoid from the

remaining cluster candidates to prevent overlapping. All remaining candidates then

evaluate their distance to the confirmed cluster updating their global fitness

accordingly. We repeat the process until no more clusters are remaining.

Additionally a cut-off value can be set to remove a potential tail of mini-clusters. The

cut-off is a percentage measure of the highest (first) cluster's fitness, e.g., if the first

cluster fitness is 120 and the cut-off 10% then any cluster with a (global) fitness

below 12 is invalid.

3.3.4 Tessellation

We divide the remainder of the space by tessellation (Voronoi, 1907) with each

cluster medoid as the centre of a convex region known as tessellate. Any remaining

vectors of the clustering type as well as any other type belong to the tessellated area

with closest (by cosine similarity) medoid. Together a cluster with the tessellated

vectors forms a category (Figure 20).

43 Because the first cluster has no neighbouring cluster the global factor will be 1.

69

Figure 20: Categories through tessellation example

3.4 Innovations

We provide two innovations for the Semantic Space model and one for

clustering/categorizing Semantic Spaces beyond Semantic Categories. Firstly, we are

investigating an alternative recombination of the SVD truncated matrix and secondly

we propose the introduction of relationship or link information into the traditionally

“pure” semantic space of vectors. Lastly, we introduce typing of the SIS documents

to enhance categories/clusters.

3.4.1 Singular Factor

Deerwester, S. Dumais, Furnas, T. Landauer, & Harshman (1990) based the SVD

reduction on the argument that they do not intend to reconstruct the original matrix

but to extract latent semantic structure. They propose that the reduced singular

vectors contain some kind of feature space representation but do not attempt to

compare the decomposition with how latent semantic structure is present in human

cognition. We agree that only the resulting effect compares to human cognition, i.e.,

the conceptual level, and the underlying space is merely a means to that end.

Following this argument, we propose that maybe the scaling of the features in the

decomposition by the singular values may not be optimal for our purposes. We are

interested in uncovering and amplifying the semantic features, which not necessarily

is equivalent to reproducing a smallest error approximation of the original matrix.

We propose to vary the influence S, the singular values, have. The raising of the

singular values by a singular factor for scaling purposes (Figure 21 and Equation 18)

is our proposition to adjust the singular values’ influence. The intent is to explore if

70

the ordering and scaling of features in form of columns of the left singular vectors is

optimal or if a scaling of the singular diagonal values may further optimize the

semantic associations. Originally, (Deerwester et al., 1990) argued that S part of

constructing the row and column relationships but there is the alternative view that it

can be ignored (Schütze, 1997, 1998; Takayama et al., 1999).

∗ ,

∗ ,

Equation 18: Term/row vector as a combination of U, S and a scaling factor

The effect of the new singular factor parameter, denoted sf, is as follows (Equation

18). A singular factor of less than 0 reverses the order, 0 removes the singular values’

influence, 0 to 1 smooths them, 1 results in a traditional combination and greater than

1 will amplify the difference between the singular diagonal values. We hope that a

smoothing of the latent factors in the reduced matrix M* by means of the singular

values (e.g. singular factor of 0.5) might improve the resulting SS despite introducing

a greater error distance between it and the original M. This thesis will report the

benefit of employing the matrix S in the Semantic Space re-composition and scaled

variations of it.

Figure 21: Singular Factor in SS generation

3.4.2 Linked Document Vectors

In the traditional SS models, a vector representation of a document bases solely on

the document's text. In a SIS, documents may contain link information relating them

to other documents, e.g., through URL or XML information. We have learned that

word co-occurrence links to semantic relatedness between words. We propose that

documents that are adjacent, i.e., through direct links, are close in meaning similar to

71

how hyperlinks are used on the web, e.g., on Wikipedia44. Links relate a word or

section of a document to a related section, document or web site. In sum, the links on

a page are a collection of related topics. We therefore argue that incorporating links

with the representation of documents can enhance it. We propose to extend the DV,

the traditional term based one, to the Linked Document Vector (LDV), a hybrid

adding the link information as part of the vector representation.

DV3

DV3

DV

2

DV1

LDV1

LDV

2

DV

2

Figure 22: LDV example

A LDV is a combination of its DV, , and related documents’ DVs, ∑ ∈ , as

described by Equation 19. Note that we presume all document vectors to be unit

length. The weighting of the two is adjustable through a scaling factor . To prevent

circular references only the DV and not the LDV of the linked to documents is used.

For example, in Figure 22 document1 links to document2 and document3. LDV1, the

linked vector representation, as a result is a combination of DV1, DV2 and DV3.

∗ ∑ ∈

∗ ∑ ∈

Equation 19: Linked vector of document

44 See www.wikipedia.org for more.

72

3.4.3 Perspectives

The conventional SS does distinguish two types of vectors, the term and document.

In turn, clustering of the space bases on term (Cao et al., 2004) or document vectors

(W. Song & Park, 2007). We propose to source the SS from a SIS and while we do

not prescribe what information an SIS has to contain we anticipate various document

types describing:

Service Bundles

Service Operations

Process Components

Business Objects

Use-cases

Reviews

…

We propose to retain document typing if made available to the SS. Documents of

different types will be the same unstructured text with optional links and only differ

in the attached type. The intention behind the typing is to search or organize the

space along types of information objects. Instead of applying a clustering or

categorization algorithm on all documents of a SIS, we can choose to organize the

space by relevant document types. For example, if a searcher is looking for complex

services then a clustering by service bundles may be more appropriate than by

service operations or even by all documents. We call this selective view of a space

by means of a type a “perspective”. At the time of SS generation, a broad view of all

service related information is desirable to provide a complete semantic base. At the

time of querying or categorizing this may not be optimal since only a subset of

vectors is potentially of interest resulting different semantic cores and categories. An

additional benefit is the opportunity for the searcher to select specific information

types of interest at query time.

3.5 Modes of Discovery

The discovery of an information object in the SS can be either by querying the space

by one or a combination of known information objects or by exploring an abstraction

of the space to obtain an overview or drill down to a particular piece of information.

73

3.5.1 The Query mode of discovery

A consumer or system can query the semantic space by and for different types of

information. A query can be a combination of any information objects represented in

the space. For example, a single word represented by its vector or a combination of

words or a combination of different information object types. A combined query

consists of a combination of words or a mix of information objects, e.g., through

vector addition and normalization (to remove query length bias). In Equation 20 the

normalized sum of all k term vectors of query q is the query vector . In Equation 21

we extend this by summing the sum over types, which is a list of types of objects

with vector representations as part of the query. They contribute equally and the

resulting query is the normalized vector sum. A more sophisticated query system

may also add user, type dependent or algorithmic weighting. We leave this as

possible future improvement.

∑ ∈

∑ ∈

Equation 20: Combined query from terms

∑ ∑ ∈∈

∑ ∑ ∈∈

Equation 21: Combined query from objects of different types

Querying retrieves a ranked list of objects on decreasing cosine similarity with the

query vector. Similar to creating document vectors through term combination,

combined query vectors seek to merge the meanings of the single vectors to express

various aspects. For example, a consumer could express her query using the term

(vectors) incentives and sales and improve the query precision further using the

bundle sales_incentive_and_commission_management. The resulting query vector is

a combination of the two terms and the bundle vector with geometrically nearby

vectors representing related types of information from the SIS like legal documents,

web sites, reviews, combined services, bundles or terms.

Negation

The query combination can also contain negative items that through orthogonal

negation (Widdows, 2003) are subtracted from the query vector. The process takes

two vectors, and , with containing a mixed meaning, e.g., the term apple that

74

could refer to the fruit or the company. The vector then can negate or remove a

particular meaning from . For example, if is the word fruit or the combined vector

of fruit and tree then removing that meaning/vector from is equivalent to

becoming orthogonal to . The similarity measure using cosine will become 0

between the two and will become disconnect from the fruit meaning and retain the

company meaning (or at least the part that is not fruit related) effectively

disambiguating it. The negation can be achieved with the Gram-Schmidt algorithm45

(Arfken, 2005; pages 516-520) orthogonally projecting on and subtracting it

from to arrive at ′ which is orthogonal to (Equation 22).

∙ ∙

Equation 22: Gram-Schmidt algorithm applied for vector negation

Frequency

By default, we add query items like terms up as we encounter them. That means a

repeat of the same term or document results in it counting several times in the query.

This may be desirable when the query originates from a text or when the query shall

emphasize a particular query item over others through repetition or by synonyms.

The alternative is to add each distinctive item only once to the query independent of

its frequency. We call this option query uniqueness.

Factor

∗

Equation 23: Query Factor

Another query option is the query factor. It amplifies a query term according to its

term weight. The term weight in our model of interest is the TF-IDF. It is usually

small (less than 1) and approaches 0. We therefore propose to pass a query factor as a

parameter and raise it to the term weight (Equation 23). In case of the TF-IDF, the

resulting query factor would range between factor and 1 (considering that the TF-

45 This has been used in the Infomap project see file query.c line 142 to 166 from http://sourceforge.net/projects/infomap-nlp/files/.

75

IDF mostly ranges between 1 and o). In a last step, we add the term vector multiplied

with the final factor to the query.

3.5.2 The Browsing mode of discovery

Instead of directed search, the consumer can also use the categories to browse the

SES. Semantic Categorization (see section 3.3) can provide categories for a Semantic

Space sourced from a Service Ecosystem’s Service Information Shadow. The level

of abstraction depends on the settings for the categorization. For example, lowering

the cluster density while raising the distance factor will produce large clusters giving

a bird's eye view and vice versa. A consumer can then drill down from an abstract

level to detailed categories using informed inference to extend her knowledge

arriving at a useful service solution.

Finally, a combination of the two modes is possible where a query returns not a list

of close objects but close categories or the contents of the category thereby providing

a more selective view of the SIS.

3.6 Software Prototype

The proposed SS generation and discovery has been implemented using C#,

Microsoft .NET and mono46. The software executes under Microsoft Windows as

well as Linux with the core functionality encapsulated in a system independent

library. For interactive exploration and creation of simple SS by the researcher a

form based graphical user interface for Microsoft Windows has been developed

(Figure 23). The SS can be loaded from a file or generated from a text corpus or

specially formatted XML file containing the text corpus, document types and

relationships. A configuration screen (Figure 24) gives access to the parameters of a

Semantic Space’s computation.

46 See http://www.microsoft.com/net and http://www.mono-project.com for more.

76

Figure 23: SSD graphical user interface main screen

The interface allows to combine three different object types using Boolean AND and

NOT to create complex queries utilizing vector addition for AND and vector

negation for NOT (see section 3.5.1). In the example in Figure 22 the query consists

of incentives AND sales Linked/Terms,

sales_incentive_and_commission_management Bundle and

confirm_commission_case_creation_as_bulk NOT find_opportunity_by_elements

ServiceOperations. The screenshot shows 0 cut-off, 7 distance and density as

parameters for the SC. Each of the three object type fields shows the most similar

results of that type for a particular query. The second from right field lists all

category medoids with their similarity to the query highlighting the most similar

medoid/category. The right field shows the highlighted category’s members with

their similarity to the medoid.

77

Figure 24: SSD configuration screen

The computationally expensive tasks execute in parallel using a shell program

adaptation running on Linux on the Queensland University of Technology High

Performance Computing facilities47.

3.6.1 Parameters

The parameters manipulate the different areas of space creation and interaction. The

following table names them for future reference and gives a short description.

47 See http://www.itservices.qut.edu.au/hpc/ for more.

78

Type Name Description Default

Co-Occurren

ce M

atrix

rows Number of rows -

cols Number of columns -

left Left sliding window 15

right Right sliding window 15

gap Column gap 50

tw Term-weight: TF-IDF or 1 TF-IDF

tt Ordering of columns: tw or DF tw

SVD

u Number of reduced columns -

sf Singular factor by which S values are raised 1

Query

qf Query factor, multiplying query vectors with qf raised by TF‐IDF 1

uq Unique query switch: use query terms only once or by frequency true

Cat.

density Greater number gives preference to denser clusters. 1

distance Greater number penalises proximity of clusters. 1

cut‐off Percentage fitness of fittest cluster as lower fitness bound. 0

LDV

lnkwght Weight of linked documents in linked document vector. 0%

Table 8: Parameters for Semantic Space and Semantic Categories

3.7 Evaluation

Before we continue with specific experiments to answer the research question we

want to evaluate the proposed Semantic Space model and its implementation testing

the quality of the term vector representations, which are essential for both facilitating

effective querying as well as producing a useful conceptual abstraction of the SES. In

the literature, the synonym section of the Test of English as a Foreign Language

(TOEFL) on the Touchstone Applied Science Associates (TASA) corpus (Landauer

et al., 1998; Turney & Pantel, 2010) is a respected and widely used evaluation of

79

semantic vector representations. It is part of an entry test for students to college in

the United States of America.

3.7.1 Data

The test is multiple choice comprising 80 words and for each of these four possible

synonyms, one of which is the correct answer. The corpus consists of the TASA

corpus containing 44,486 short plain text documents of "General Reading up to 1st

year college" totalling 73,132,886 or an average of 1,644 characters. Students learn

college entry relevant vocabulary and language usage from these readings.

3.7.2 Experiment/Methodology

The measure of the results is in percentage of correct answers. Twenty correct

answers would therefore be 25% for example. If the SS does not contain the queried

word or none of the synonyms then the question counts as answered with a 25%

chance similar to Landauer and Dumais (1997). For example, if the system answered

20 correctly but did not have enough information in the corpus to answer three other

questions then the total result would be 0.2594 (25.94%) or 20.75 correct answers.

3.7.3 Results

In the process of developing this model, we also evaluated some of the SS

parameters and collected 508,032 results. The DF sorting of co-occurrence matrix’s

columns and rows and a fixed scalar of one as the term weight consistently

performed better than TF-IDF sorting and weighting (see Table 9). Therefore, the

following results focus on the 127,008 results subset using DF sorting and fixed term

weight of one. The single word query made qf and uq settings irrelevant. We inserted

the words from the TOEFL test as rows in the beginning of the co-occurrence matrix

overwriting the row ordering to ensure them to be in the vocabulary.

80

Avg Max Sorting TW

0.4816 0.7875 DF 1

0.4228 0.7000 DF TF-IDF

0.3745 0.7031 None 1

0.3843 0.6250 TF-IDF TF-IDF

Table 9: Comparison of sorting and term weight influence4849

We evaluated matrices of 3,000 rows by 3,000, 6,000, 9,000, 12,000, 15,000

columns as well as 15,000, 12,000, 9,000, 6,000 rows by 3,000 columns reduced to

100, 250, 500 and 1,000 dimensions by SVD. Together with combinations50 of gap,

left and right window sizes of 0, 2, 4, 8, 16, 32, 64, 128 and singular factors of -1, -

0.5, 0, 0.5, 1, 2 and 4. Pilot experiments established these settings beforehand as a

reasonable range for exploring the effectiveness of the various parameters.

48 Based on 282,240 results from 3x3, 3x6, 3x9, 6x3 and 9x3 (each thousands) rows by columns matrices.

49 Values are ratio of correct answers, e.g. 0.5 are 50% correct answered. See 3.7.2 for details. 50 Except left and right window each 0.

81

Avg right

left 0 2 4 8 16 32 64 128

0 0.499 0.496 0.497 0.487 0.472 0.449 0.432

2 0.428 0.532 0.533 0.527 0.5 0.486 0.454 0.436

4 0.434 0.495 0.537 0.532 0.516 0.493 0.459 0.442

8 0.474 0.522 0.541 0.545 0.519 0.501 0.462 0.449

16 0.454 0.48 0.5 0.508 0.499 0.502 0.465 0.454

32 0.461 0.472 0.487 0.496 0.489 0.483 0.467 0.465

64 0.459 0.469 0.479 0.477 0.481 0.48 0.47 0.474

128 0.455 0.457 0.462 0.464 0.471 0.472 0.471 0.47

Max right

left 0 2 4 8 16 32 64 128

0 0.625 0.65 0.725 0.7 0.7 0.638 0.613

2 0.563 0.7 0.775 0.738 0.75 0.713 0.663 0.613

4 0.613 0.675 0.75 0.763 0.763 0.713 0.663 0.613

8 0.688 0.763 0.775 0.788 0.775 0.725 0.663 0.663

16 0.663 0.7 0.713 0.725 0.75 0.75 0.675 0.663

32 0.675 0.7 0.7 0.725 0.75 0.725 0.688 0.675

64 0.663 0.675 0.7 0.713 0.7 0.713 0.7 0.725

128 0.688 0.688 0.688 0.688 0.7 0.688 0.675 0.663

Table 10: Window size impact49

Table 10 shows a symmetric window size of 8 on both sides provided the best

average and maximum result. The difference between the best and worst setting in

maximum were -28.6% and on the average results -21.5%.

U

Average Maximum

100 250 500 1000 100 250 500 1000

Columns

3000 0.4903 0.4850 0.4676 0.4380 0.7250 0.7500 0.7250 0.7625

6000 0.5033 0.5005 0.4857 0.4600 0.7250 0.7625 0.7500 0.7625

9000 0.5069 0.5047 0.4978 0.4707 0.7500 0.7875 0.7500 0.7375

12000 0.5113 0.5096 0.5020 0.4741 0.7375 0.7750 0.7375 0.7375

15000 0.5114 0.5110 0.5069 0.4781 0.7375 0.7750 0.7500 0.7375

Table 11: Columns to SVD reduction impact49

Table 11 shows the effect of dimension reduction of the U matrix in terms of both

average and maximum test scores. The reduction of columns to u gives a split result

(Table 11). On average, a reduction to 100 dimensions is the best choice but 250

dimensions achieved the maximum result. This indicates that under the right

circumstances there is information that benefits from a larger than 100 dimension

representation. A dimension reduction to u=100 appears to be resilient generally

82

providing a good balance between retaining and amplifying features while reducing

noise as the average results indicate.

sf Avg Max

‐1 0.5153 0.7875

‐0.5 0.5442 0.7875

0 0.5770 0.7750

0.5 0.5726 0.7625

1 0.4251 0.6375

2 0.3863 0.5500

4 0.3508 0.4750

Table 12: Singular factor impact49

The singular factor acts in a similar fashion as the dimensional reduction. Recall the

singular factor sf scales the singular values (see Equation 18). Using a sf of 0, which

effectively ignores the S values, has the best average outcome (Table 12). On the

maximum results, a negative sf has a slight benefit. Ignoring (Takayama et al., 1999)

or inversing sf produces the best results while a traditional sf=1 (Deerwester et al.,

1990) has negative impact.

Avg Max

rows\cols 3000 6000 9000 12000 15000 3000 6000 9000 12000 15000

3000 0.4673 0.4874 0.4950 0.4992 0.5018 0.7375 0.7625 0.7875 0.7750 0.7750

6000 0.4711 0.7625

9000 0.4711 0.7500

12000 0.4710 0.7375

15000 0.4706 0.7375

Table 13: Rows to Columns impact49

Surprisingly, the number of columns does not have a tremendous impact (Table 13).

The maximum number of columns results in the highest average result but roughly

half that was enough for the maximum result. We expected the number of rows to

have no significant impact because they do not add additional information to the

synonym test. The synonyms are the first rows51 in the co-occurrence matrix and

their co-occurrence with the column words not rows defines their relationship.

51 80 questions with 4 answers each with some re-occurring words resulted in just below 400 rows to be insert at the beginning.

83

gap Avg Max

0 0.4835 0.7625

2 0.4838 0.7625

4 0.4813 0.7750

8 0.4814 0.7750

16 0.4784 0.7875

32 0.4804 0.7625

64 0.4820 0.7750

128 0.4822 0.7750

Table 14: Gap impact49

We recall that the gap is the number of highest order columns we ignore in order to

remove high frequency terms that may be indiscriminating. The use of DF, the

corpus wide term frequency, is an efficient and resilient way of selecting the content

bearing words as the minimal changes of results when varying the gap indicates

(Table 14). The difference between no gap and a gap of 128 is insignificant for

example.

3.8 Discussion

Landauer and Dumais (1997) reported 64.4% with foreign students scoring 64.5% on

the same test. An implementation using random indexing achieved 70-72%

(Kanerva, Kristofersson, & Holst, 2000). The presented SSD implementation

accomplished 78.75% (Table 15). There have been results beyond 90% (Rapp, 2003)

but such systems use a variety of bells and whistles such as external data sources

(Deerwester et al., 1990).

Correct49 row col u gap left right sf

0.7875 3000 9000 250 16 8 8 -1

0.7875 3000 9000 250 16 8 8 -0.5

0.775 3000 9000 250 128 8 16 0

0.775 3000 15000 250 8 8 4 0

0.775 3000 15000 250 64 8 4 -0.5

0.775 3000 15000 250 4 8 4 -0.5

0.775 3000 12000 250 8 8 4 -0.5

0.775 3000 12000 250 8 8 4 0

0.775 3000 12000 250 64 2 4 -0.5

0.7625 6000 3000 1000 32 8 2 0.5

Table 15: Top 10 results for TASA/TOEFL SSD

84

The experience gained from these experiments informed the settings in the large-

scale evaluation reported in the next section, even if adjustments are beneficial

because of the different experimental setup. Furthermore, the semantic vectors

produced by SSD are competitive when compared to state-of-the-art “no frills”

systems on TOEFL. “No frills” is important as such systems are more easily

deployable in an application setting. We have provided an implementation of a

Semantic Space Discovery grounded in conceptual space theory and extended by

semantic querying, semantic categorization and relationship information. The

TASA/TOEFL experiment proves the quality of the semantic vector representation

arrived at by this model. The next two chapters evaluate the SD model to address the

two research questions and position the model against alternatives.

85

4 Semantic Service Discovery Evaluation

We presented Service Discovery as a key challenge in the emerging Service

Ecosystem in the introduction and reframed it as an Information Retrieval task on the

Service Information Shadow (see 3.1). The central issue in the discovery process is

the likely imprecision in the expression of a consumer’s service need as she will not

be aware of all the services available to her in the SES that may address her need and

the terminologies describing them. Moreover, a consumer may have a vague agenda

and subsequent poor service need understanding. This requires a discovery system to

be highly flexible approximating meaningful results from incomplete queries

possibly mismatching the services’ terminologies. A SD system should either

therefore return a collection of alternative solutions where the query is expressive

enough or otherwise approximate it conceptually to foster presumptive attainment of

knowledge by the searcher. We proposed that the SSD model discussed in the

previous chapter could achieve these objectives by imitating abductive inference of

concepts from an SIS through statistical semantics to find or suggest meaningful

services.

In this chapter, we are evaluating the search and discovery of services with varying

service need knowledge by introducing a SIS resembling data source and creating a

discovery scenario describing a complex service need. We simulate the need by use-

cases transformed into long, expressive queries (see next section). We degraded the

queries to simulate imprecise service need understanding. The queries to the SSD

system return ranked lists of service documents. The rank of the relevant one in the

list is the measure for the systems performance. The baselines for comparison for our

model are state-of-the-art IR systems including vector space, probabilistic and

alternative semantic space systems that will perform the same tasks. We investigate

their performance and review details of the SSD model before closing the chapter

with a discussion.

4.1 SAP ES Wiki as a Service Information Shadow

Evaluating the SSD requires a Service Information Shadow of a Service Ecosystem,

a corpus related to services, and a number of discovery scenarios we can execute and

analyse. The SES is a heterogeneous, emerging system and in an early stage of

86

development with a high fragmentation as we discussed in the first chapter. On the

professional side SOA and SaaS dominates with private registries in governments,

corporations and industries, emerging online communities and research projects

compete with online/cloud SaaS solutions while on the end-consumer side

application markets and web based offerings grow with the surge in smart phones

and devices. The SOA solutions are domain specific and provide little natural

semantic content struggling to break domain and industry barriers. SaaS solutions are

generally provider bound with closely link application/service markets while

application market places are still limited to platforms with more than only service-

oriented software. Therefore, substitutes or partial data sets are the only ones

available for an evaluation until a single system or open standard for integration and

data access will surface as a SES platform.

We propose to use the SAP Enterprise Service Wiki52, a web site dedicated to

describe service operations, bundles and organize related information, as a data

source to imitate a SIS. It resembles a SIS of a future SES because it is built by a

variety of sources/individuals (SAP employees, customers and guests), involves

services from many domains, includes secondary service related information and

does not enforce a terminology or ontology but does provide a loose (hyper-)link

structure (Figure 25). The wiki describes software objects like service operations,

service interfaces, process components and business objects. Each object has a web

page with a short description in the wiki or links to the SAP ES Workplace53 that

gives a view with object related information from SAP databases and additional user

provided information. The wiki home page organizes the 125 bundles in 30 groups.

Bundles are user provided collections of related objects. The bundles are represented

by web pages containing descriptions about the bundle (Figure 26), links to lists with

(links to the) related objects. Bundle pages also contain one or several use-cases54

that describe example application(s) of contained service operations with a short text

and a step-by-step list.

52 See https://wiki.sdn.sap.com/wiki/display/ESpackages/Home for details. 53 See http://www.sdn.sap.com/irj/bpx/esworkplace for details. 54 448 in the 125 bundles

87

Figure 25: ES Wiki structure

The wiki by its nature is dynamic, constantly changing55, and not all wiki pages

describing objects have information beyond a template page with some links leading

to inaccessible SAP database views. The available data of 1,114 documents (not

including use-cases) is sufficient to constitute a corpus of text documents relating to

the different objects including services as can be expected in a comparable fashion

from a SIS including the missing/inaccessible descriptions. Another benefit is the

wiki’s hyperlinks, which can facilitate SD.

55 The wiki data used in this experiment has been downloaded on the 20th July 2009.

88

Figure 26: Example of bundle page (excerpt)

We view the service bundles as optimal entry points to search for or engage

combined services as described in the use-cases, as humans would engage them.

Consequently, we focus on them for the discovery process instead of the atomic

functionally described service operations relating merely to SOA scenarios. Service

bundles on the other hand combine related atomic services providing entry points for

tasks reflecting service needs much like combined services as we expect humans to

consume them. Furthermore, bundles, because of their nature, contain rich semantics

in unstructured texts and exemplary use-cases relating to service needs, as they

would stem from an agenda. We therefore focus on the service bundles for our

experiments and the use-cases as the source for service oriented queries.

The wiki was downloaded by a simple web crawler56 and parsed by a purpose built

program to split use-cases from bundles, extract links, convert the text to conform to

8 bit based ASCII (American Standard Code for Information Interchange) and

remove non-word characters excluding abbreviation, email and URL information.

We did not capture all information since the flexible structure of a wiki as well as

human input error limits an automated extraction and some SAP database views were

not freely accessible. We saved the data in XML files containing texts with type

56 See http://www.gnu.org/software/wget/ for details.

89

information (e.g. bundle, service operation, etc.) and (link) relationships as well as

plain text documents.

4.2 Experimental Evaluation

4.2.1 Use-cases as Text Queries

The service bundles in the SAP ES Wiki like Sales or Banking group related service

operations, process components and business objects. The use-cases in the bundles

describe representative scenarios and simple task, e.g., requesting a postal pickup and

shipping service (Figure 27), for the bundle’s services and objects. We suggest that

each of the 448 use-cases is analogous to a description of a service need, or task,

within an agenda, which we can use to discover the originating service bundle from

which the use case derives in the same way as a SD query identifies a relevant

service related document. As a first step, we remove the use-cases from the corpus

before the IR systems index it to avoid a bias towards them. The resulting bundle

documents have an average length of 3,682 characters. Punctuation and Boolean

words (NOT, AND, OR) were removed from the queries to prevent any errors or

confusion in the evaluated IR systems which handle these in different ways.

Figure 27: Example use-case

The bundle from which the use-case/service need originates is the optimal solution.

In a first experiment use-cases will be interpreted as long/full queries (named 100p)

describing the service need searching the SIS. This should achieve high rankings for

the relevant bundles because of the rich semantics in the query. A 100p use-case on

average is 302 words long. In a second experiment to simulate incomplete user

knowledge of a service need we degrade the queries by randomly selecting only 25%

of the words from each use-case to make up the query (named 25p) reducing the

90

average length to 75 words. Please note that we performed the random selection only

once and all systems use the same 25p queries to ensure comparable results. Lastly,

we query for the bundles using only the titles (which we have ignored so far) of the

use-cases (named Titles). We were able to extract 413 use-case titles from the 448

use-cases from 123 of the 125 bundles. The shortest title is Tendering and the longest

Enable Sales Service Professionals to Provide Real Time Information on Product

Configuration Using a Third Party System to Confirm Configuration. The average

length of a title is 6 words after removing Boolean words.

4.2.2 Combined Query

The use-cases extracted from the bundles usually contain a paragraph description and

a table with the steps to execute the use-case including optional, related service

operation(s) to invoke with a step (Figure 27). In traditional information retrieval

scenarios and classical systems, the additional information of the service operations

is unusable beyond a possible keyword match. We have described before how to

compare or combine the vector representations of documents representing objects or

actions with other documents or queries. For example, if a consumer knows of a

service operation or bundle or business object relevant to her query she could add its

vector representation to the query instead of approximating that information by

keywords, as she has to do in conventional IR systems. We propose to use the service

operations where available to expand the query by the document vector

representation of the service operations.

This is an additional source of information not available to the conventional systems

and Infomap or Semantic Vectors do not contain such functionality though their SS

models would theoretically permit a similar functionality. For comparability, we

query the SSD model in two modes called the text query (TQ) which is the classical

text only representation of the use-case as a query. For the 25p and 100p query, we

are also presenting an alternative query mode called combined query (CQ) where the

sum of the service operation representations (as explained in chapter 3.5.1) extends

the text query vector. Our intention is to demonstrate that the additional query

information that is sometimes available to a searcher but difficult to express in

conventional systems can further enhance discovery.

91

4.2.3 Performance Measures

The queries return a ranked list, which contains only one document relevant to the

query. The rank of the single relevant bundle object in the list of bundle objects

returned is the measure of precision with which a system retrieves the appropriate

solution. Since there is only one relevant document and it usually is within the result

set, the traditional IR performance measures of mean average precision and recall do

not apply.

Each system returns for each query a ranked list of documents containing bundle and

other related documents57. We retrieve the first 1,000 documents and filter them for

the top 100 bundles. For each query qn out of n queries, the rank of the correct bundle

in the list of 100 bundles is noted. The averaged result is the measure of Average

Position (AP; Equation 24). Sometimes the correct result is not in the top 100. We

therefore extend the measure to the AAR (Adjusted Average Rank) which

approximates these missing bundles m as retrieved at the next best rank of 101 as

shown in Equation 25. The AAR provides an approximation of the best possible AR

result an IR system that was unable to retrieve all results in the top 100 could have

achieved if all missing bundles had a rank of 101.

Equation 24: Average Rank

∗ 101

Equation 25: Adjusted Average Rank

4.3 Baseline IR systems

We can now measure the performance of the SSD model with the AAR. We propose

to compare it with state-of-the-art IR systems in order to determine its relative

performance. All the baseline systems are classical IR models (see 2.2.2 on page 28)

applying to unstructured text corpora and follow the basic IR system model (see

57 SSD actually returns documents by type and was set to return the top 100 bundles directly.

92

2.2.4 on page 32) indexing the corpus and querying their index with different ranking

functions. We established in the literature review that competitive unstructured text

models are the probabilistic, the traditional vector space and the dimensionally

reduced Semantic Space models, which we compare in the following.

The probabilistic model is represented by the research software Zettair58 from the

search engine group of the Royal Melbourne Institute of Technology (Billerbeck et

al., 2004; Garcia, Lester, Scholer, & Shokouhi, 2006) utilizing the popular and

widely employed BM25 ranking. The state-of-the-art open source Apache Lucene

project59, which is widely used in commercial applications60, represents the

prominent vector space model:

“Lucene scoring uses a combination of the Vector Space Model (VSM) of

Information Retrieval and the Boolean model to determine how relevant a

given Document is to a User's query. [...] It uses the Boolean model to first

narrow down the documents that need to be scored based on the use of boolean

logic in the Query specification. Lucene also adds some capabilities and

refinements onto this model to support boolean and fuzzy searching, but it

essentially remains a VSM based system at the heart.” (Ingersoll, 2009)

Since we aim to provide evidence for the advantages of the SS model for the SD

task, we provide our own model as introduced in the previous chapter. Furthermore,

we review two alternative SS systems to identify any benefits that may result from

our particular implementation and algorithm over established SS systems. The two

alternative SS models are Infomap and Semantic Vectors61 (Widdows & Ferraro,

2008). The Center for the Study of Language and Information (CSLI) at Stanford

University developed the research software Infomap utilizing a SVD reduced HAL

based SS inspiring the SSD model introduced in the previous chapter. A novel SS

system is Semantic Vectors using the fast Random Indexing (Kanerva et al., 2000) as

an alternative to the computationally more expensive SVD. The Office of

58 See http://www.seg.rmit.edu.au/zettair/ for details. 59 See http://lucene.apache.org/ for more details. 60 See http://wiki.apache.org/lucene-java/PoweredBy for details. 61 See http://code.google.com/p/semanticvectors for details.

93

Technology Management at the University of Pittsburgh began the development of

the SV package, which is now under active development as an open source project

with support of Google.

We tested all systems with a wide range of parameters and we only report the best

results and parameter settings for the experiments here. The systems use the same

corpus and query data that we pre-processed removing possible problematic

characters (corpus and queries) and Boolean words (queries only) to prevent ill

formatted or misleading inputs.

4.3.1 Semantic Service Discovery (SSD)

We established the SSD optimal parameter settings (Table 18) in three steps.

Exploratory tests estimated parameters based on previous experience with Infomap.

We recognized that TF-IDF as a term weight for parsing and row/column order

achieves in general the better results and decided to keep tt and tw fixed to it. We

also established a broad range of parameters to explore (Table 16) covering

27,484,800 individual results.

We set rows and columns to a maximum of 6,000 since the SSD identifies less than

7,000 individual terms. We expected the rows to perform better on the larger setting

since it would enhance document representation. The TF-IDF filtering of terms for

the columns might not necessarily profit from a full representation but rather form a

kind of noise reduction with smaller than 6,000 columns. This prompted us to

explore the column setting more than the rows.

We varied the gap and window sizes from 0 to 150 to capture a broad spectrum to

refine it in the next run if a significant subrange can be identified. The SVD

reduction has been set between 100 and 400 dimensions. Infomap usually arrives at

much less but we have found that larger values can have positive effects since we use

a precise SVD algorithm instead of converging one as used by Infomap. The singular

factor uses the same range for all experiments. The negative settings may not have

positive effects but are included for comprehensiveness. The most interesting settings

are 0 (no S value), 0.5 (smoothing effect), 1 (original S values) and amplification

(beyond 1) of S value scaling effect. We used the link weight setting of 0% and 50%

to examine if it has a noteworthy influence in the optimal parameter range selection.

We tested the query factor in a setting from 0 to 3.

94

SS Parameter Value

rows 4,000, 5,000 6,000

cols 2,000, 3,000, 4,000, 5,000 6,000

cg, lw, rw62 0, 25, 50, 100, 150

tw, tt TF-IDF

u 100, 150, 200, 250, 300, 350, 400

sf -1, -0.5, 0, 0.5, 1, 2 ,4

lnkwght 0%, 50%

qf 0.0, 0.2, 0.4, …, 3.0

Table 16: Use-cases Semantic Space parameters exploratory run63

The results from the first parameter evaluation experiment lead to a second one

(Table 17) with 7,392,000 individual results processed during parameter exploration.

We focused on the maximum rows and columns as they have shown most promise.

The gap has been more effective from 25 upwards and possibly including 175. The

window size has returned no conclusive results and the same parameter range has

been retained. The effect of the singular factor is of key interest and we maintain the

full parameter range. The final experiment setting includes a full investigation of the

link weight influence from 0% (no link weight) to 90%. A 100p was not possible

since not all documents contain a link and thus 100p would be an undefined value for

these documents. We changed the query factor to 1 to 3.

SS Parameter Value

rows 6,000

cols 6,000

cg, 25, 50, 100, 150, 175

lw, rw64 0, 25, 50, 100, 150

tw, tt TF-IDF

u 150, 200, 250, 300, 350, 400, 450, 500

sf -1, -0.5, 0, 0.5, 1, 2 ,4

lnkwght 0%, 10%, …, 90%

qf 1.0, 1.2, 1.4, …, 3.0

Table 17: Use-cases Semantic Space parameters refinement run63

The second parameter range (Table 17) is the basis for the results section (see 4.4).

Table 18 lists the optimal results from the second parameter range.

62 Excluding lw and rw equal 0. 63 See section 3.6.1 for details on parameters. 64 Excluding lw and rw equal 0.

95

Experiment LnkWght Query65 Row Col U Gap LW RW SF UQ QF

Titles 30%

TQ

6,000 6,000

200 25 150 100 0.5 F 1.2

0% 450 25 150 150 0.5 F 3

100p

20% CQ

200 50 50 100 0 F 1

0% 200 50 25 100 0 F 2

20% TQ

200 50 25 100 0 F 1

0% 200 50 25 100 0 F 2

25p

20% CQ

200 25 50 100 0 F 1.2

0% 200 25 100 100 0 F 3

20% TQ

200 25 50 100 0 F 1.2

0% 200 25 100 100 0 F 2.6

Table 18: SSD optimal query experiments parameters

4.3.2 Zettair

Zettair (version 0.9.3) indexed the corpus as a list of text documents. It queried with

the default settings as well as with Okapi BM25 enabled term weighting. We used

the top 1,000 results. Across all runs, BM25 was superior and we report it here

instead of the default setting.

4.3.3 Lucene

Lucene is a mature, state-of-the-art IR system, tested in real-world applications,

reviewed and fine-tuned by professional developers, so we chose to use it in the

default settings. We used the version 2.4.1 which was current at the time of the

experiments.

4.3.4 Semantic Vectors

Lucene (version 2.4.1) generated the indices for Semantic Vectors (version 1.2.3).

We tested the default and a Semantic Vectors library’s66 positional index. The default

Lucene index is a bag of words, a reverse index, while the positional index uses a

sliding window considering in document word positions like the one used in the term

co-occurrence matrix. The windows sizes used were 1, 3 and 9, which cover the

optimal range (P. Bruza & Sitbon, 2008). We processed the default index with 2, 4

65 TQ refers to text queries and CQ to combined queries including service operation vectors. 66 See http://code.google.com/p/semanticvectors/wiki/PositionalIndexes for details.

96

and 8 training cycles67. Training cycles rerun the SV algorithm in the hope to

improve results. We tested no more than 8 cycles since results degraded strongly

with more cycles. The querying included the default, training cycles and positional

indices using default, subspace, sparesum and maxsim query settings68. The default

Lucene index queried with default settings achieved the best results in the top 1,000

and we report them as the SV results.

4.3.5 Infomap

Infomap in the latest version 0.8.6 indexed the wiki as a multi-document text corpus.

The query was set to return the top 1,000 documents. The co-occurrence matrix size

was 20,000 by 5,000 with a window size of 50 on the left and right each. Larger row,

column and window settings did not improve performance while smaller ones slowly

degraded it. We reduced the matrix with a maximum 500 SVD iterations

(SVD_ITER) and to a maximum of 500 columns (SINGVALS). Infomap uses a

Lanczos SVD algorithm (Golub & van Loan, 1996). The algorithm converged within

the iterations and with lower dimensionality thus larger SVD values are ineffective

and we used the optimal settings.

4.4 Results

4.4.1 IR systems comparison

We compiled the results of the three different experiments in Figure 28. The measure

used to evaluate performance is the AAR. The AAR ranges from 1 (all queries

returned the correct result at rank 1) to 101 (all queries failed to return the correct

result in the top 100). The rank of the correct result is important since only the first

few results in a ranked list are likely to receive attention by a searcher (Granka,

Joachims, & Gay, 2004; Moffat & Zobel, 2008). Great differences in the AAR

indicate superiority of one method over another. To establish significance of an

AAR’s difference requires a statistical evaluation though. We chose a paired, two

tailed t-test, which has shown to be a resilient and strong statistical evaluation to

67 See http://code.google.com/p/semanticvectors/wiki/TrainingCycles for details. 68 See http://code.google.com/p/semanticvectors/wiki/SearchOptions for details.

97

identify significance in IR (Sanderson & Zobel, 2005; Smucker, Allan, & Carterette,

2007). We compared the sets of query results between the SSD variations and the

baseline IT systems in Table 19 later in the section.

Figure 28: Use-case query results

In Figure 28, we straightaway see that with decreasing query length the AAR

increases for all but the SV system. It confirms the expectation that the longer

queries are more expressive. Only SV, a Semantic Space based on a Lucene index

and random projection is unable to utilize the richer query details. In all experiments

SV has the noticeably highest AAR and performs significantly worse than any SSD

variant. This may be because SV is not optimized for document retrieval.

Zettair has a noticeably higher rank than Lucene, Infomap and the SSD variants in

the 100p experiment. In the 25p experiment, Zettair is closer to Lucene but still fares

4.419

5.061

6.269

6.312

5.738

7.341

1.795

2.199

2.362

2.406

2.946

3.984

4.313

10.167

1.275

1.288

1.350

1.346

1.547

1.931

2.703

7.252

SSD , TQ , 30% LDV

SSD , TQ

Infomap

Lucene

Zettair

SV

SSD , CQ , 20% LDV

SSD , TQ , 20% LDV

SSD , CQ

SSD , TQ

Infomap

Lucene

Zettair

SV

SSD , CQ , 20% LDV

SSD , TQ , 20% LDV

SSD , CQ

SSD , TQ

Infomap

Lucene

Zettair

SV

Titles

25p

100p

AAR

98

much worse than Infomap and SSD. In both experiments, the Zettair result is

significantly inferior to the SSD ones. In the Titles experiment, Zettair, Infomap and

Lucene do have a higher AAR than both SSD variations. Interestingly, the Zettair

and Lucene result is not significantly different from the plain text query SSD result.

Lucene performs better than Zettair in the 100p and 25p. In both experiment, Lucene

is significantly inferior to the SSD variants similarly to Zettair. Just like Zettair,

Lucene performs not significantly worse to the SSD TQ system in the Titles

experiment despite a higher AAR than SSD TQ and Zettair. The SSD TQ with link

weight outperforms Lucene and Zettair though.

SSD

TQ CQ

0% LDV 20% LDV 0% LDV 20% LDV

100p

Infomap 0.0006 0.0000 0.0009 0.0000

Lucene 0.0001 0.0000 0.0001 0.0000

Z Okapi 0.0001 0.0000 0.0001 0.0000

SV 0.0000 0.0000 0.0000 0.0000

0% LDV 20% LDV 0% LDV 20% LDV

25p

Infomap 0.0069 0.0002 0.0091 0.0001

Lucene 0.0001 0.0000 0.0001 0.0000

Z Okapi 0.0003 0.0001 0.0002 0.0000

SV 0.0000 0.0000 0.0000 0.0000

0% LDV 30% LDV

Titles

Infomap 0.0158 0.0012

Lucene 0.0555 0.0058

Z Okapi 0.1908 0.0191

SV 0.0001 0.0000

Table 19: Significance of results by paired, two tailed t-test69

The SSD model returns superior results in nearly all situations. It utilises long

queries particularly well. In all experiments, the SSDs’ AARs are lower than the

baseline systems’. Nevertheless, in the case of short queries SSD with plain text

queries does not achieve significantly better results over Lucene and Zettair. These

short query situations are typically the domain of these inverted index systems and

their performance does not come as a surprise. It is encouraging that the SSD in its

simplest form can compete with them. The utilization of link weights does provide a

69 Cells contain p-value with bold results significantly different (p<0.05).

99

significant advantage to the SSD though and it outperforms all systems in the Titles

experiment significantly. In all experiments, the addition of a modest (20-30%) link

weight has shown improvements. We achieved the best results (in 25p and 100p)

when we add combined queries and link weight to the SSD model. Due to the nature

of the data source, we were not able with reasonable effort to reliably extract

combined queries for the use case titles and therefore only present the text queries for

that experiment.

4.4.2 SSD in detail

The previous section compared the various IR systems in the orthodox unstructured

text IR model with the Semantic Service Discovery system on the exemplary use-

case scenario simulating Service Discovery as directed search. In this section we

review how some of the parameters, particularly link-weight and singular factor, in

the model influence the SSD outcome. To this end, we analyse the results from the

second parameter range SSD experimental run covering 7,392,000 variations using

the fixed 6,000 rows and columns.

LDV weighting

The use-case experiments illustrated the benefit of Linked Document Vectors. The

link weight in all queries in the experiment ranged from 0% to 90% in steps of 10%.

To achieve an overview of the impact of the weights we present them here according

to Titles, 25p and 100p queries and provide minimum, median and average AAR

over these and the 10 weightings (Figure 29).

100

Figure 29: SSD query results with varying LDV weights

Link weight influence all queries in a similar manner across median, average and

minimum AAR. The worst results are at 90% link weight, which replaces the

document’s original (text) vector mostly with a combination of linked to documents’

text vectors. There is a reoccurring trend with an optimum around 20-40% with

degrading AAR surrounding it.

The baseline for the LDV weighting is 0%, which is equivalent to the traditional text

only document vectors used in Semantic Spaces to date. Since 20% to 40% displayed

the best improvements, we provide a detailed view of them in Figure 30. It shows the

percentage improvements with 0% weighting as a baseline. For example, an AAR of

3 at 0% and 2 at 30% lnkwght would be an improvement of 33.3%. The diagram

illustrates that all query types benefit strongly from the LDV. Particularly the

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Minimum

Average

Median

Minimum

Average

Median

Minimum

Average

Median

Titles

25p

100p

AAR

90% 80% 70% 60% 50% 40% 30% 20% 10% 0%

101

average and median results for medium and long query length benefit strongly (up to

32%). Nevertheless, the 25p and Titles minimum AARs improved considerably too

(12-24%). The minimum 100p result was the least impacted by the LDV. Since these

results were nearly optimal (close to AAR of 1) to begin with, the possibility of

further improvement was limited.

Figure 30: Improvements in AAR from no to optimal LDV

11.3%

14.1%

12.8%

24.0%

22.7%

25.5%

5.3%

22.2%

26.5%

12.7%

15.7%

15.3%

22.6%

28.4%

29.9%

3.2%

29.0%

29.8%

10.9%

14.6%

14.7%

19.8%

30.9%

29.9%

3.2%

32.5%

29.5%

0% 5% 10% 15% 20% 25% 30% 35%

Minimum

Average

Median

Minimum

Average

Median

Minimum

Average

Median

Titles

25p

100p

AAR improvement from 0% link weight

40% 30% 20%

102

Singular Factor

Figure 31: Singular Factor influence on AAR

Figure 31 provides an overview how the singular factor influences the SSD outcome.

The immediately identifiable shared characteristic across the three different query

experiments and between the average, median and minimum results is an optimum sf

of 0 and 0.5. A close investigation (Figure 32) with singular values unmodified

(sf=1) as recommended (Deerwester et al., 1990) as a baseline reveals that for the

25p and 100p the best result is achieved with a sf=0. This is equivalent to ignoring

the singular values in the Semantic Space creation much like in the Infomap and

Wordspace models (Schütze, 1998; Takayama et al., 1999). The improvements range

from 58% to 81% on average and still impressive 40 to 50% on the minimum.

1 6 11 16 21 26

Minimum

Average

Median

Minimum

Average

Median

Minimum

Average

Median

Titles

25p

100p

AAR

4.0 2.0 1.0 0.5 0.0 -0.5 -1.0

103

Figure 32: Improvements from sf=1 to 0.0 and 0.5

The singular values can be beneficial, however, as the Titles experiment shows.

Applying them with a sf of 0.5 which is a smoothed square root of the original

singular values shows an improvement of 13% on minimum to 13% on average,

which is better than what is achievable with either sf=1 or sf=0.

Query Term Frequency

The query parameters uq instructs the system to either use or ignore term frequency

in a query. In the first case of setting uq to on/true the system uses every term vector

only once in constructing the query vector independent from the frequency of a query

term in a query. Let us call the alternative setting, when uq is ‘off’, fq for frequency

query. In this case, the system adds every occurrence of a term in the query to the

query vector.

7.5%

5.9%

6.2%

50.6%

54.9%

58.9%

40.1%

76.0%

81.3%

13.1%

13.1%

14.4%

32.9%

37.2%

38.1%

29.2%

49.1%

54.7%

0% 10% 20% 30% 40% 50% 60% 70% 80%

Minimum

Average

Median

Minimum

Average

Median

Minimum

Average

MedianTitles

25p

100p

AAR improvement compared with sf=1

0.5 0.0

104

Figure 33: Difference between unique and frequency queries

Figure 33 displays the difference in the results between the two options. We

recognize that for long and medium sized queries fq is the better choice. The minute

differences in the Titles experiment is not surprising since the brevity of the title

queries makes reoccurring query terms unlikely. This comparison shows that

emphasizing a particular aspect of a query through repetition of a term is an effective

means to weigh and focus the query. More importantly, the average and mean results

indicate that it is also an effective means to counter sub-optimal space parameters.

The improvements on the near optimal minimum results are modest. On the sum of

space parameter variations a long query that permits weighting through repetitive

term use can roughly half the average result’s rank. When we compare 100p and

Titles average uq result we recognize that on average the length of a query is only

beneficial with the weighting of frequent (and thus important) terms.

Combined Queries

We were not able to source a reliable set of service operation links for the Titles

experiment from the SAP ES Wiki due to the data quality. Therefore, we only

present the Combined Queries for the 100p and 25p experiments in Figure 34.

4.419

11.387

8.874

1.795

8.761

5.853

1.275

6.496

3.161

4.448

11.390

8.838

1.911

10.260

7.174

1.304

11.232

7.127

0 2 4 6 8 10 12

Minimum

Average

Median

Minimum

Average

Median

Minimum

Average

Median

Titles

25p

100p

AAR unique versus frequency queries

UQ FQ

105

Figure 34: Combined Query vs. Text Query

We can clearly identify the benefit the combined queries give on the median results.

The average improvement is weaker and implies that the CQ is sensitive to highly

suboptimal parameter settings. The combined query cannot significantly improve on

the 100p optimal result. Since this particular result is close to 1 AAR already, the

margin for optimisation is expectedly narrow. The 25p minimum result, however,

improves strongly when utilizing combined queries.

Trends

The remaining variables have shown only minor trends on the average and median

results with no definite influence on the optimal/minimum AAR results. The query

factor (Figure 35) displays a slight preference towards a setting of 2.0 on average and

median results. The minimal results tend to be better with a neutral (1.0) qf.

18.4%

9.6%

18.5%

1.0%

4.4%

15.8%

0% 2% 4% 6% 8% 10% 12% 14% 16% 18% 20%

1 6 11 16 21 26

Minimum

Average

Median

Minimum

Average

Median

25p

100p

AAR change

AAR

CQ TQ TQ to CQ

106

Figure 35: Query factors’ influence on AAR

The dimensional reduction (SVD Figure 36) shows a discernible benefit from smaller

(k=200) settings on average and median results across all three experiments. The

minimum at the 25p queries shows a light preference to k=300 not reflected in the

other two minimums.

1 2 3 4 5 6 7 8 9 10 11 12 13

Minimum

Average

Median

Minimum

Average

Median

Minimum

Average

Median

Titles

25p

100p

AAR

3.0 2.8 2.6 2.4 2.2 2.0 1.8 1.6 1.4 1.2 1.0

107

Figure 36: SVD reduction to k dimensions

The gap parameter (Figure 37) provides a generally better result with a smaller

setting (25) for median and average results. There is no observable trend for the

minimums.

Figure 37: Gap

1 3 5 7 9 11

Minimum

Average

Median

Minimum

Average

Median

Minimum

Average

MedianTitles

25%

100%

AAR

500 450 400 350 300 250 200 150

1 3 5 7 9 11

Minimum

Average

Median

Minimum

Average

Median

Minimum

Average

Median

Titles

25p

100p

AAR

175 150 100 50 25

108

The left and right window both (Figure 38 and Figure 39) favour smaller settings (25

to 0) for the median and average results without any certain preference for the

minimums.

Figure 38: Left window

Figure 39: Right window

1 3 5 7 9 11 13

Minimum

Average

Median

Minimum

Average

Median

Minimum

Average

Median

Titles

25p

100p

AAR

150 100 50 25 0

1 3 5 7 9 11 13

Minimum

Average

Median

Minimum

Average

Median

Minimum

Average

Median

Titles

25p

100p

AAR

150 100 50 25 0

109

4.5 Discussion

In this chapter, we evaluated the ability of traditional IR and different SS systems to

identify service related information effectively and by extension services by means

of text queries of different detail. The variation in query detail simulated how well

the systems cope with a decreasing precision of service need description. We

reviewed how some of the SSD parameters influence the experimental outcomes in

particular the novel LDV and the singular values.

There is a clear trend that with increasing query length all systems (except for SV)

improve the ranking of the correct bundle. The Semantic Vectors, based on Random

Projection, performed poorly across all experiments. Conventional probabilistic and

vector space systems nearly always performed worse than the SVD based Semantic

Spaces. Lucene performs better than Zettair with increasing query length but Zettair

outperformed Lucene and Infomap in the Titles experiment. SSD consistently

provided the best AAR with TQ and CQ, with and without LDV. This is most likely

due to the Semantic Space being a more expressive representation of underlying

concepts that are transparently accessible through the document and term vectors.

The fact that the performance of SD degrades less than other system, when the

precision of the query degrades, supports this conclusion.

It is clear from the results that Semantic Spaces are a more effective means to search

for services by means of service related information, independent of query quality.

We also demonstrated that expanding queries beyond text to utilize vector

representation of known relevant information, i.e., CQ with service operation

vectors, does further improve retrieval performance. The LDV results emphasize the

benefit of exploiting relationship information in a corpus in combination with the

traditional statistical semantics. As would be expected, it has little benefit in an

optimal situation of a long expressive query in a space with near perfect parameter

settings for a specific corpus where the AAR is already close to 1. In all other sub-

optimal situations of imperfect space parameters or less expressive queries, the

utilization of LDV and often CQ provides a boost to the queries’ and space’s

precision, independent from queries and parameters. Since in a real world setting

more often than not a sub-optimal situation is encountered, e.g. through default

parameter, changing corpora or brief queries, the addition of LDV provides a very

noticeable advance in resilience and overall improvement. Interestingly the addition

110

of service operation vectors in the form of the combined queries does provide a

visible boost in performance only in connection with LDVs. Many service operation

documents contain little to no text but some links to related documents. The

consistent boost from using LDVs with combined queries was greater than

anticipated considering that the queried bundle documents are in general detailed

texts containing rich semantic information. It might be that they benefited indirectly

from an increased disambiguation of semantically poorly expressed but linked

documents like the service operations which improved the quality of the (document)

space overall. We identified an optimal range around a LDV weighting of 20% to

40% with a preference towards 20%.

The review of the singular factors challenges both current notions about S values.

Deerwester et al. (1990) argued that they are necessary in the reconstruction of the

row relationships. If this is true then the original row relationships is less effective in

representing the semantic associations than the sole left matrix of the SVD or in the

case of the Titles a smoothed version of the singular values with the left matrix. The

Titles experiment challenges the Wordspace approach of ignoring S values. The

experiment showed that there is value in incorporating (smoothed) S values in some

instance and worth ignoring them in others. Overall, reproducing only a least square

error approximation of the original co-occurrence matrix is not optimal according to

our findings. We cannot answer the meaning and optimal use of S values with this

research but are putting forward our results to encourage investigations in the future.

In summary, we have demonstrated that Semantic Spaces promote effective service

retrieval in a Service Ecosystem as stated in the first research question. Furthermore,

we showed that (SVD based) Semantic Spaces outperform keyword systems in cases

of long, degraded and title queries. We established that LDVs and the addition of

non-textual query information further improve a Semantic Spaces query performance

and provided a review of LDV weighting. The exact benefit of S values is unclear

and warrants further research.

111

5 Semantic Service Categorisation Evaluation

The previous chapter investigated the first research question by evaluating SD as an

IR task by means of a Service Information Shadow simulated by the SAP ES Wiki.

We compared state-of-the-art IR systems with Semantic Space systems and

examined the Semantic Space innovations of manipulating the singular values and

adding SIS structure information.

In this chapter, we investigate the second research question, if Semantic Categories,

inspired by conceptual space theory, provide a meaningful and effective map of the

Service Ecosystem for exploration. We explore this by comparing the Semantic

Categories to manual groupings using a sophisticated quantitative measure and a

qualitative review of the results. The former provides an objective mean of

comparison and the latter some insight into how meaningful the semantic categories

may be from the user perspective. The quantitative measure is a comparison against

state-of-the art clustering algorithms. We moreover review the influence of the

perspectives, Linked Document Vectors and the manipulation of the singular values

through the singular factor.

The chapter’s structure is as follows. We start with a section about the experimental

setup. It presents the manual groupings used as a baseline, the two perspectives on

the space and details to the choice of measure to compare the categorization and

clustering with the manual baseline. The subsequent section reviews the state-of-the-

art clustering algorithms, their experimental setup and results. The Semantic

Categorization section follows describing its experimental setup and results as well

as a qualitative review of its result. The chapter ends with a discussion and

comparison of the clustering, categorization and manual grouping.

5.1 Experiment

The experiment splits into two parts. One is the state-of-the-art clustering algorithms

and the second is the Semantic Categorization. Preceding them, in this section, we

identify what information we want to organize, what would constitute an optimal

solution and how to compare the solutions.

We recall the SES scenario where a service consumer is searching for a combined or

related services to address a service need originating from an agenda. We are not

112

anticipating widespread atomic service operation invocation or selection by the

consumer. Instead, the combination and brokerage of services or provision of whole

service bundles addresses the lack of business aspects in current service delivery

(Cardoso et al., 2010). The SAP ES Wiki contains bundles provided by the wiki’s

users that aggregate information about tightly related service operations and objects.

The bundles include use-cases describing tasks as we expect them to arise from a

complex service need agenda. We propose that the wiki bundles relate well to SES

service bundles. In short, these bundles are meaningful clusters of services, which

together address some service agenda. In addition, the bundles have associated

categories, which are topical and relevant to the intention of the bundle to address the

related agenda.

5.1.1 Data

The second scenario we introduced in this thesis assumed the need for an overview

of the Service Ecosystem or a part of it. This would be the case when a searcher has a

poor understanding of the agenda or entailing tasks, and may be useful in additional

situations, e.g., ontology design or product management. Important in this scenario is

to minimize the effort the searcher has to expend when faced with a huge and

dynamic system like the SES. This scenario is a pragmatic assumption when we

recall the reoccurring pattern of categorizing, tagging, grouping and otherwise

organizing larger sets of information by humans, e.g., in libraries or service

registries, to provide that kind of overview. Moreover, the SAP ES Wiki itself has

such an overview page organizing the 125 bundles in 30 bundle groups of related

topics70. The wiki users create and maintain the page for as a quick and easy way to

navigate the wiki. They rely on a shared (human) model (see Conceptual Space

theory 2.3.1) of organizing the underlying concepts instead of each creating an own

‘mental map’ of the wiki. They accept differences in how the model applies to them

because of personal biases and experiences since the ‘saved’ effort/time of not

building a personal optimal model themselves exceeds this imprecision ‘cost’. We

anticipate this kind of explorative search to be useful in the future SES.

70 See SAP ES Wiki Grouping in the appendix for details.

113

The size and ever-changing nature of the SES prohibits a manual categorization. This

is comparable to the web catalogues, e.g., as attempted by Yahoo, that were popular

in the early stages of the WWW and have declined since because the web became too

large and fast paced. Consequently, the web requires computationally time and space

efficient algorithms. It cannot utilize sophisticated methods like clustering or

categorization forgoing precision in favour of ‘speed/space’ (Baeza-Yates & Ribeiro-

Neto, 2011; chapter 11). As a result, the prevalent mode of exploring the web is by

directed search, e.g., Google’s web search, which is efficient, scalable and with

support of computational facilities captures a large and timely picture of the web. To

gain an overview of a topic area with such a system a user needs various directed

searches and evaluation of the results. Its effectiveness depends on the user’s ability

to anticipate and accordingly formulate queries that cover the topic of interest.

Unfortunately, exploratory search by this means is very time consuming and not truly

possible if the user has a poor understanding of the topic of interest.

Figure 40: Practical topical structuring of different corpora

We propose that the SES or more precisely, the associated SIS, unlike the web, will

not grow to an unorganisable size (Figure 40). Traditional and manual means are not

applicable to organize such an intermediate corpus topically but automated means are

valid since the SIS will be several magnitudes smaller than the web. The semantic

search discussed in the previous chapter and Semantic Categorization can handle

millions of documents of a SIS.

In summary, we assume the usage scenarios and scope of the SES requires an

automated categorization of the space, and its bundles and bundle groups are suitable

representations of what a searcher would look for and how a human, or service

broker, would organize them. We propose that the Semantic Categories can provide a

meaningful map of the bundles for exploratory search. To establish this we propose

to apply Semantic Categorization to the SAP ES Wiki and compare its automatically

generated categories with the manual (bundle) groups. Most conventional methods

for organizing information rely on clustering algorithms of one form or the other. We

114

therefore position the Semantic Categorization against state-of-the-art clustering of

the space.

5.1.2 Clustering Perspectives

Traditionally, clustering of Semantic Spaces involves the clustering of its elementary

semantic vectors, which are either term and/or document vectors (see 3.4.3 for

details). In our model, the orthodox clustering approach therefore would utilize the

term vectors. We proposed that the optimal perspective for categorization and

clustering on the space is through the most relevant objects. The objects of interest

are the bundle documents and should be the basis for the categories accordingly. We

chose therefore to review the (traditional) term vector perspective and the specific

bundle perspective in our experiments. The comparison of their performances will

evaluate if an alternative perspective improves clustering and categorization results.

5.1.3 Semantic Space Parameters

All experiments use the same rows, cols, g, lnkwght, rw, tw, tt and u parameter

combination (see section 3.6.1 on page 77 for parameter details) to generate the basic

Semantic Space (Table 20). We chose the parameters based on the use-case and

TASA/TOEFL experiments experience. A wide window size captures broad topic

relationships rather than narrow term relationships and a large u allows for resilient

performance with a strong gap to ignore too specific terms. These settings aim to not

overfit the model and still provide a good result71. This permits a focus on the sf,

lnkwght and perspective as well as clustering specific parameters. The combination

of sf and lnkwght return 70 different Semantic Space variations, which we evaluate

from the two different perspectives – term and bundle – in the context of Semantic

Categorization as well as the various clustering algorithms.

71 The top use-case AARs for these settings are 1.3 (100p), 2.07 (25p) and 4.48 (Titles).

115

SS Parameter Value

rows, cols 6,000

cg, lw, rw 150

tw, tt TF-IDF

u 400

sf -1, -0.5, 0, 0.5, 1, 2 ,4

lnkwght 0%, 10%, … 90%

Perspective Bundle, Terms

Table 20: CLUTO - ES Wiki Semantic Space parameters

5.1.4 Performance Measures

We are comparing flat, exclusive clustering of a data set and need to choose an

appropriate measure. Let us define a clustering U and V of a set S containing N data

points {s1, s2,... sN}. A popular measure is pair counting as implemented in the rand

index (Rand, 1971) based on a contingency matrix with the number of pairs:

N00 that are in different clusters in U and V

N11 that are in the same clusters in U and V

N01 that are in different clusters in U but in the same in V

N10 that are in different clusters in V but in the same in U

The rand index has a bound between 0 and 1 RI (Equation 26), however, it mostly

returns values between 0.5 and 1. The value of 0 is only achieved in the exceptional

situation of one clustering consisting of one single cluster and the other of all atomic

clusters with one member. It is therefore not a very intuitive measure to express the

similarity between two clusterings.

,

Equation 26: Rand Index

The adjusted rand index (Hubert & Arabie, 1985) addresses instability and chance

bias of the rand index. Its lower bound is 0 for no shared information and 1 for

identical clustering. It uses a hyper-geometric distribution to model randomness and

adjusts the results for chance (Equation 27).

,2

Equation 27: Adjusted Rand Index

116

The second popular group of measures is information theory motivated. Examples of

it are Normalized Mutual Information (Studholme et al., 1999) and Adjusted Mutual

Information (Vinh et al., 2009). They are based on the Mutual Information (MI,

Equation 28) between two variables X and Y with p(x, y) being the joint and p(x),

p(y) being the marginal probability distribution functions.

, ,,

Equation 28: Mutual Information

Mutual Information reflects how much the two variables depend on each other or

how much information they share. Therefore, a MI of 0 indicates that X and Y are

independent and knowing about one does not change the knowledge about the other.

A MI of 1 therefore indicates that they are identical and knowing one is equal to

knowing both. We can use it to measure how similar two clustering are. The

probability of a random object from S to be in a cluster Ui is P(i) (Equation 29) and

the entropy of U would be H(U) (Equation 31). The entropy of U is lower bound by

0 in case of a single cluster containing all items (log(P(i)) would be 0). The MI of U

and V would be I(U, V) (Equation 32) with P(i,j) (Equation 30) being the probability

of a random object to be in Ui and Vj.

| |

Equation 29: Probability of random object to be in cluster i

,∩

Equation 30: Probability of random object to be in Ui and Vj

log

Equation 31: Entropy of cluster U

, ,,

Equation 32: Mutual Information between clustering U and V

One problem with MI is that its upper bound is equal or less than the smaller of the

two clustering entropies H(U) and H(V). The NMI addresses this by fixing the lower

117

bound to 0 and the higher one to 1. A common normalization is to divide MI by the

square root of the product of the clustering entropies (Equation 33).

,,

Equation 33: Normalized Mutual Information

The trouble with NMI is that it is cardinality biased. If, for example, a clustering

solution W is to be compared with two random clustering A and B with |A|>|W|>|B|

then the NMI(A,W) is likely to be greater than NMI(B,W) despite both clustering

being random because the entropy does not increase ‘fast’ enough to counter the ‘by

chance’ shared information or ‘accidental’ MI. This becomes significant for small

datasets (see next sub-section “NMI versus AMI”). A solution to this is a correction

for chance as by Adjusted Mutual Information or AMI (Vinh et al., 2009). It

calculates and removes the ‘by chance’ projected MI by means of a contingency

table E (Equation 34) of mutual information of all possible pairings between U and

V. We do not review AMI in detail because of its complexity. We refer the reader to

Vinh et al. (2009) for details. AMI ranges between 1 and 0 like NMI and removes the

cardinality bias resulting in a more expressive and intuitively meaningful measure.

,,

Equation 34: Adjusted Mutual Information

NMI versus AMI

The measure we select has to return a value for a clustering/categorization of the 125

bundles against the SAP ES Wiki’s manmade 30 bundle groups. Initially we intended

to use the popular Normalize Mutual Information (Studholme et al., 1999) measure

but unusual results and some investigation identified a cardinality bias (Figure 41).

NMI adjusts for an increase in Mutual Information with the rise of cardinality by

using entropy. Entropy is an accepted measure used in evaluating clustering results

(Zhao & George Karypis, 2004). However, this fails in small samples with relative

large numbers of categories compared to the samples with entropy failing to account

for chance. We therefore adopt Adjusted Mutual Information as a measure as it has

all the qualities of NMI using information theory to measure mutual information with

118

a normalization to make results comparable and additionally removes cardinality bias

by accounting for chance (Vinh et al., 2009).

Figure 41: Measurement Cardinality Bias

Figure 41 illustrates the difference. We generated random categorisations of the 125

bundles with 1 to 125 categories. Each (1, 2, 3… 125 categories) was performed a

hundred times using the average to smooth the result. We measured the NMI and

AMI against the 30 SAP ES Wiki groups and plotted the result with 0 meaning no

shared information and 1 the results being identical. NMI shows a strong bias

towards a greater number of clusters flattening towards 0.8 despite the

categorizations being random. AMI remains about zero, measuring only non-chance

mutual information providing a resilient measure. NMI or entropy based methods are

acceptable measures if the data source contains considerably more data points than

the number of clusters/categories, however, for our experiment AMI is necessary to

give an unbiased result. An alternative measure could be the Adjusted Rand Index

but we did not investigate it further since its behaviour is comparable to AMI (Vinh

& Epps, 2009; Vinh et al., 2009).

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

1 5 913

17

21

25

29

33

37

41

45

49

53

57

61

65

69

73

77

81

85

89

93

97

101

105

109

113

117

121

125

Similarity to SAP ES Wiki groups

No. Random Categories (average result from each 100 runs)

NMI

AMI

119

5.2 Baseline clustering algorithms

We sourced state-of-the-art clustering algorithms from the popular CLUTO72

software package (Zhao & George Karypis, 2002, 2004) (version 2.1.1). CLUTO

divides clustering algorithms into a criterion function that evaluates and optimizes

the clusters and the clustering method that produces the clusters. The combination of

different criterion functions and methods provide a wide range of modern clustering

solutions. The criterion functions and clustering methods of CLUTO are an extensive

topic. We refer the reader to the Appendix C where we concisely describe the

functions and methods, or to the CLUTO website73, manual80 and to the specific

literature (Zhao & George Karypis, 2002, 2004) for an in-depth discussion.

5.2.1 Setup

The input to CLUTO (besides the clustering parameters) is a text file containing a

matrix of vectors or a matrix of similarities of the clustering objects. The output is a

list of clusters corresponding to the input vectors or objects. We provide the various

Semantic Spaces (see 5.1.2) as two vector matrix files in the CLUTO input format

containing the term and the bundle vectors. We process these using the vcluster

CLUTO executable employing all combinations of the criterion functions and

clustering methods resulting in 48 different clustering algorithmic approaches. The

desired number of clusters is set to 30, which is equivalent to the manual wiki

grouping. We also compute solutions for 6 supplementary criterion functions that are

only applicable to agglomerative clustering methods and the graph clustering method

that does not utilize exchangeable criterion functions. The remaining CLUTO

parameters remained set to the software’s default settings.

5.2.2 Results

The clustering experiments by means of CLUTO returned 7,212 results for the 48

combinations of methods, criterion functions, as well as link weight, singular factor

and perspectives. We refer the reader to Appendix C for detailed analysis of the

results. In this section, we focus only on the optimal clustering results reviewing the

72 See http://glaros.dtc.umn.edu/gkhome/cluto/cluto/overview for more. 73 See http://glaros.dtc.umn.edu/gkhome/views/cluto for more details on CLUTO.

120

influence of the novel link weight (see section 3.4.2 page 70), singular factor (see

section 3.4.1 page 69) and perspective parameter (see section 3.4.3 page 72)

influences. We did not choose one particular clustering algorithm that performed best

on average. Instead, in every case we report the best possible combination of method

and function. These highly optimized and selective results provide the baseline for

the Semantic Categorization results in the next section as well as a broad evaluation

of the mentioned novel features.

Singular Factor and Perspective

Figure 42 illustrates the various best results across the sf range and contrasts it with

the two perspectives summarizing the influence of the Singular Factor or sf

parameter.

Figure 42: Singular Factor and Perspective

The overall best results and the Bundle perspective ones are the same since it is in all

cases the best performing. Besides the general difference in performance, a

difference in distribution of results across the sf range between the two perspectives

is apparent. The Bundle ones rise sharply peaking at sf=0 and then drop smoothly off

with increasing sf. The Term ones are seemingly irregular peaking particularly at

sf=0.5. We note here that neither performs best at sf=1 as proposed by Deerwester et

al. (1990). In fact the Term perspective’s result is almost the worst in this case and

only 30% (0.1055) of sf=0.5 (03573). Bundle at sf=1 is just below 90% (0.4272) of

its best result (0.4762 at sf=0) supporting the alternative view of singular values

0.0

0.1

0.2

0.3

0.4

0.5

Bundle Term

Perspective

Max AMI

sf -1

sf -0.5

sf 0

sf 0.5

sf 1

sf 2

sf 4

121

(Takayama et al., 1999). These results are in line with our experience from the use-

case experiments.

Link Weight and Perspective

The lnkwght parameter shows a noteworthy trend when contrasted between the two

perspectives (Figure 43). Overall the Bundle performs far better than Term (0.4762

vs. 0.3473) as establish previously. There is an opposite behaviour to the influence of

lnkwght though. Bundle slightly improves from 0% to 30% (0.4562 vs. 0.4762 or

+4.4%) and then drops off fast from 50% to 90%. Term on the other hand starts flat

and then rises from 40% peaking at 70% improving by +24.8% from 0% (0.2782 to

0.3473).

Figure 43: Link Weight and Perspective

5.3 Semantic Categorization

The Semantic Categorization (see 3.3) derives from the Conceptual Space theory; it

identifies non-overlapping semantic prototypical cores along one perspective and

expands them by tessellation to categories including all vector types.

5.3.1 Setup

The Semantic Categorization is performed with the same basic Semantic Space

parameters (see 5.1.3) as the state-of-the-art clustering (CLUTO) experiments in the

previous section with sf between -1, -0.5, 0, 0.5, 1, 2, 4 and lnkwght of 0% to 90% in

10% steps. Unlike the clustering approach, Semantic Categorization does not require

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

0.50

Bundle Term

Max AMI

0% 10% 20% 30% 40% 50% 60% 70% 80% 90%Link weight

122

the user to know or guess the optimal number of final clusters/categories. SC uses

parameters describing the desired attributes of the final categories. They are distance,

density and cut-off (see 3.3 for details). Density is a local parameter that when

increased gives preference to denser semantic category cluster cores. Distance is a

global parameter, which with increasing value penalizes cluster core proximity. Cut-

off is a global parameter that removes a long tail of tiny clusters especially if the

space is very sparse. It sets the minimum fitness for a cluster as a percentage of the

fittest cluster. Besides the SS and SC parameters, the two perspectives Term and

Bundle are tested.

Run

Bundle

Term

Parameter Value

1

X

X

distance 0.5, 1, 2, 4, 8, 16, 32, 64

density 0.5, 1, 2, 4, 8, 16, 32, 64

cut‐off 0%, 5%, 10%, 20%, 40%

2

X

distance 2, 4, 8, 16, 32, 64, 128

density 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096

cut‐off 0%, 5%, 10%, 15%, 20%, 25%

X

sf -1, -0.5, 0, 0.5 1, 2, 4

distance 0.5, 0.75, …, 2, 5, 10, 15, 20 0.5, 1, 2, 4, 8, 16, 32, 64

density 10, 20, 30, 40, 50 103, 104, …, 109

cut‐off 0%, 1%, 2%, 3%, 4% 5% 0%, 5%, 10%, 20%, 40%

4

X

X

distance 0, 0.5, 1, 5, 10, 50, 100, 500, 103, 104, 105

density 0.5, 1, 5, 101, 102, …, 107

cut‐off 0%, 10%, 20%

Table 21: Semantic Categorization experiments parameter settings

Pilot tests established the parameter range. The first detailed run based on that

experience (see Table 21, Run 1) returned 44,800 results. These results informed run

2, which we optimized for each perspective respectively and returned 54,680

results74. Run 3 was informal with numerous, minor ad-hoc variations tested to

establish if there may be significantly better results to be achieved by detailed

optimizations. The results are not noteworthy and not reported here. The last

experiment was Run 4 with a wider range of distance and density parameters to

74 Some Bundle experiments failed and could not converge due to a combination of sparseness of the space and parameter settings.

123

establish their applicable range and cross relationships with the singular factor,

which totalled 44,425 results.

We have to note here that the number of experiments is largely due to the number of

parameters explored, novelty of the algorithm to establish parameter range (Run 1)

and behaviour of parameters in extremes (Run 3). In fact, the optimization (Run 2)

did not yield much improvement on the best results of the first run (only +5.5% on

the Term perspective). This is despite the SC not depending on the external

knowledge of the optimal outcome of 30 categories. Some of the CLUTO algorithms

would be able process an input without the anticipated number of cluster. The

implementation of CLUTO requires the information though as a means to select the

optimal outcome75.

5.3.2 Results

Table 22 presents the top results for the two perspectives. The Bundle perspective

continues to provide superior results to the term perspective. The number of

categories for the two results is close to the manual optimal of 30. The following

discusses the various parameters and their influence.

AMI Categories density distance cut‐off sf lnkwght

Bundle 0.4368 36 17 0 0 0 0.4

Term 0.3682 34 128 4 0.15 0.5 0

Table 22: Best SC result by perspectives

75 See the CLUTO manual at http://glaros.dtc.umn.edu/gkhome/fetch/sw/cluto/manual.pdf section 3.1.

124

Figure 44: Maximum AMI according to perspective and sf for run 1

On the first impression from run 1 (Figure 44) the singular factor indicates that the

Bundle perspective perform best without singular values (sf=0) or smoothing (sf=0.5)

while the Term perspective is performing worse overall and best with smoothed

singular values just as in previous experiments with CLUTO and the use-cases. The

higher singular factors beyond 1 seem of no use.

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

Bundle Term

Perspective

sf -1

sf -0.5

sf 0

sf 0.5

sf 1

sf 2

sf 4

125

Figure 45: Maximum AMI according to density and sf in run 4

The data from run 4 (Figure 45 and Table 23) paints a more complicated picture.

There appears to be a relationship between the density, distance and the singular

factor parameters. The higher the sf the better large density settings perform

particularly for Bundles. The optimal combination remains at sf=0 and 0.5 for

Bundle and Term respectively. The distance measure seems to have no predictable

beneficial effect particularly for the sparse distribution of the Bundles. Term results

can benefit somewhat from the distance but in an unpredictable manner. Large

distance settings are generally negative with increasing sf.

We recall that the SVD orders singular values decreasingly with the first one or few

containing a significantly larger weight than the tail of minuscule values. This is in

line with the factorization by SVD, which attempts to identify and extract the most

0.00 0.10 0.20 0.30 0.40

-1

-0.5

0

0.5

1

2

4

-1

-0.5

0

0.5

1

2

4

Bundle

Term

sf

1.E+07

1.E+06

1.E+05

1.E+04

1.E+03

1.E+02

10

5

1

0.5

126

significant information/factor(s). A sf of greater than 1 consequently would amplify

this, emphasize the main factor, and collapse the left matrix from the SVD towards

the leading/left /‘heavy’ columns eliminating the finer details in the row vectors

gained from the remaining columns. It seems that large density settings counteract

this shift in weighting which indicates that the higher order columns contain

information about the smaller differences between the vectors, which somewhat

recovers by raising the local weighting. The global distance parameter is less

important for the apparently already well differentiating distribution of Bundle

categories. It plays a beneficial role for the Term perspective. The fading of the

higher columns/dimensions by the larger singular factors indicates that the area in

which the data resides collapses and thus large distance measures become

counterproductive.

sf

Bundle Term

‐1 ‐0.5 0 0.5 1 2 4 ‐1 ‐0.5 0 0.5 1 2 4

Density

0.5 0.023 0.027 0.000 0.000 0.000 0.000 0.000 0.036 0.069 0.134 0.016 0.000 0.000 0.000

1 0.133 0.130 0.001 0.000 0.000 0.000 0.000 0.122 0.101 0.215 0.302 0.000 0.000 0.000

5 0.306 0.328 0.416 0.012 0.000 0.000 0.000 0.101 0.161 0.122 0.274 0.005 0.000 0.000

10 0.267 0.310 0.424 0.062 0.011 0.000 0.000 0.125 0.174 0.151 0.340 0.222 0.000 0.000

1.E+02 0.224 0.287 0.372 0.367 0.226 0.000 0.000 0.121 0.145 0.158 0.364 0.172 0.011 0.000

1.E+03 0.226 0.226 0.378 0.346 0.366 0.084 0.000 0.091 0.117 0.171 0.270 0.115 0.136 0.000

1.E+04 0.000 0.000 0.013 0.205 0.396 0.404 0.000 0.102 0.121 0.183 0.252 0.080 0.171 0.000

1.E+05 0.000 0.000 0.176 0.362 0.000 0.087 0.119 0.192 0.249 0.195 0.178 0.088

1.E+06 0.000 0.398 0.083 0.115 0.117 0.200 0.293 0.173 0.085 0.173

1.E+07 0.135 0.296 0.104 0.117 0.200 0.256 0.213 0.176 0.224

Distance

0 0.250 0.325 0.424 0.361 0.361 0.398 0.266 0.122 0.119 0.191 0.291 0.155 0.171 0.224

0.5 0.294 0.328 0.407 0.346 0.364 0.382 0.266 0.111 0.131 0.184 0.302 0.151 0.157 0.220

1 0.291 0.328 0.412 0.346 0.364 0.382 0.270 0.111 0.131 0.184 0.302 0.151 0.157 0.220

5 0.289 0.309 0.393 0.346 0.364 0.404 0.266 0.120 0.127 0.200 0.310 0.179 0.157 0.173

10 0.306 0.310 0.407 0.346 0.366 0.390 0.294 0.108 0.110 0.166 0.364 0.205 0.157 0.196

1.E+02 0.266 0.319 0.416 0.367 0.370 0.377 0.296 0.118 0.135 0.145 0.274 0.149 0.178 0.203

1.E+03 0.228 0.306 0.416 0.324 0.396 0.345 0.000 0.112 0.126 0.138 0.185 0.146 0.176 0.000

1.E+04 0.234 0.294 0.406 0.347 0.039 0.000 0.000 0.125 0.132 0.151 0.201 0.222 0.040 0.000

1.E+05 0.213 0.294 0.406 0.231 0.000 0.000 0.000 0.117 0.145 0.215 0.201 0.213 0.012 0.000

1.E+06 0.154 0.262 0.137 0.000 0.000 0.000 0.000 0.116 0.174 0.176 0.293 0.101 0.000 0.000

1.E+07 0.071 0.053 0.000 0.000 0.000 0.000 0.000 0.055 0.069 0.109 0.148 0.061 0.000 0.000

Table 23: Maximum AMI according to distance, density and sf in run 476

Table 24 and Table 25 offer a different view of density and distance looking at the

optimal sf results for the two perspectives and their interaction. It reaffirms that the

76 The cell shadings are visual guides to identify trends.

127

sparseness of the Bundle perspective removes the value of a global distance

parameter. The Term perspective indicates that the global feature of the

categorization can be very important nevertheless. It reaches optimum at a distance

of 10 and density of 100 and without a distance measure (distance=0) would have

achieved less than 80% of that.

Density

0.5 1 5 10 1.E+02 1.E+03 1.E+04 1.E+05 Distance

0 0.000 0.001 0.295 0.424 0.340 0.378 0.011 0.000

0.5 0.000 0.001 0.294 0.407 0.372 0.368 0.011 0.000

1 0.000 0.001 0.294 0.412 0.372 0.368 0.011 0.000

5 0.000 0.001 0.322 0.393 0.372 0.368 0.011 0.000

10 0.000 0.001 0.350 0.407 0.372 0.368 0.011 0.000

50 0.000 0.001 0.416 0.384 0.372 0.359 0.011 0.000

100 0.000 0.001 0.416 0.342 0.372 0.359 0.011 0.000

500 0.000 0.001 0.406 0.348 0.341 0.320 0.013 0.000

1000 0.000 0.001 0.406 0.368 0.343 0.214 0.013 0.000

10000 0.000 0.000 0.038 0.137 0.129 0.032 0.000 0.000

100000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000

Table 24: Maximum AMI for run 4 - Bundles, density to distance at sf=0

Density

0.5 1 5 10 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07

Distance

0 0.010 0.291 0.121 0.136 0.225 0.270 0.228 0.225 0.231 0.231

0.5 0.009 0.302 0.120 0.114 0.230 0.257 0.218 0.206 0.193 0.193

1 0.009 0.302 0.115 0.127 0.257 0.265 0.217 0.206 0.193 0.193

5 0.010 0.289 0.224 0.275 0.310 0.260 0.241 0.222 0.222 0.222

10 0.010 0.218 0.257 0.340 0.364 0.230 0.229 0.215 0.185 0.185

50 0.016 0.234 0.274 0.266 0.161 0.087 0.068 0.104 0.121 0.167

100 0.010 0.180 0.174 0.181 0.185 0.082 0.082 0.104 0.122 0.175

500 0.010 0.201 0.107 0.148 0.046 0.078 0.136 0.116 0.121 0.175

1000 0.010 0.201 0.107 0.148 0.036 0.078 0.133 0.123 0.123 0.172

10000 0.010 0.153 0.150 0.166 0.243 0.265 0.252 0.249 0.293 0.256

100000 0.000 0.059 0.107 0.148 0.128 0.111 0.108 0.091 0.097 0.074

Table 25: Maximum AMI for run 4 - Term, density to distance at sf=0.5

The results from the lnkwght parameter for maximum AMI and combined over run 1,

2 and 4 (Figure 46) indicate that there is a slight preference for a 0% link-weight for

Term perspective and 40% for Bundle perspective. The use-case results support the

latter but the benefit for categorization is clearly much weaker if we can claim it at

all. There is a decline in maximum AMI for both perspectives with higher link

weights.

128

Figure 46: Link-weight results combined from run 1, 2 and 4

A selection of cut-off parameters across the run 1, 2 and 4 in Figure 47 illustrates a

difference in its effect on the two perspectives. The Bundles perform best at a 0 cut-

off while Term gains with a cut-off of 5-10%. This is likely due to the difference in

amount and distribution of the two sets. The Bundles are only 125 items while the

Terms, the semantic base, total 6,000. As a result, Term tends more to a ‘tail’ of

minuscule semantic cores unless there is a stopping like a relative fitness measure

introduced, i.e., a cut-off.

Figure 47: Cut-off result selection from combined runs 1, 2 and 4

Figure 48 plots the average and maximum AMI for both perspectives along the

number of categories on the horizontal. The Bundle plots end around 50 categories

since a medoid and at least one member of the categorizing type (in this case a

Bundle vector) define a minimal category. The 125 bundles and their distribution

therefore limit the number of possible categories to less than 63. A category in the

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

Bundle Term

Perspective

Max AMI

0% 10% 20% 30% 40% 50% 60% 70% 80% 90%

0

0.1

0.2

0.3

0.4

Bundle Term

Max AMI

cut‐off

0

0.05

0.1

0.2

0.4

129

Term perspective counts if it contains at least one Bundle. Subsequently the Terms

can produce a maximum of 125 categories.

A peak of the AMI plots occurs just after 30 categories, which is the number of

manual groupings with minimums at 1 and 125. This reconfirms that the chosen

measure does not give preference towards either a minimum or a maximum number

of categories.

Figure 48: Maximum and Average AMI according to number of categories

5.3.3 Qualitative analysis by an example

Semantic Categorization achieved a maximum AMI of 0.4368 with the Bundle

perspective77. This gives us an information theoretic measure, but does this constitute

a comparable and meaningful grouping for a human or at what AMI would that be

reasonable to assume? We present a qualitative review of the best result to illuminate

the question and establish if SC accomplished a meaningful categorization.

77 An XML formatted representation of the categories is available in appendix B.

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

1 6

11

16

21

26

31

36

41

46

51

56

61

66

71

76

81

86

91

96

101

106

111

116

121

Max AMI

Categories (containing Bundles)

Bundle - Max of AMI Bundle - Avg AMI

Term - Max of AMI Term - Avg AMI

130

Similarity Bundle Name Manual

Bundle Group

medoid interactive_selling Sales

0.52 quote_to_order_for_configurable_products Sales

0.52 vehicle_management_system Automotive

0.50 customer_fact_sheet Sales

0.48 sales_incentive_and_commission_management Sales

0.48 product_master_data_management Sales

0.47 opportunity_management Sales

0.38 activity_management Sales

0.40 sales_contract_management Sales

0.25 territory_management Sales

0.29 trade_and_commodity_management Sourcing

0.25 global_data_synchronization Retail

Table 26: Semantic category example

Table 26, an example from the mentioned Semantic Categorization, contains 11

members and a medoid for a total of 12 bundles. Nine of them come from the SAP

ES Wiki Sales bundle group (see Table 27 for the full group). The topic of these 9

bundles does appear strongly sales-related based on their naming. The three

alternatively added bundles in the category are odd at first sight. Two of them have a

low similarity measure. The vehicle_manangement_system bundle shows a stronger

relationship and in its text, we find that it:

“[…] enables the interaction between the dealers and SAP Vehicle

Management for Automotive, which importers and distribution centers run.”

(SAP ES Wiki Vehicle Management System Bundle)

Furthermore, its main audience is system administrators, sales & importer

representatives and dealers. The bundle describes in detail the import, order and

particularly sales process. Latter is a strong focus including three use-cases all

concerned with the sales and ordering processes. The use-cases are not part of the

Semantic Space since we use the same corpus as in chapter 4. Consequently, their

focus further validates the semantic association with the sales bundle group extracted

from the description.

The trade_and_commodity_management bundle, one of the two more loosely related

‘odd’ bundles, concentrates on trading of commodities on exchanges for example.

This does relate to the purchase and transfer of goods although it is not sales in the

typical sense. The global_data_synchronization bundle, the last undiscussed bundle,

by name appears unrelated. When we review the bundle, we find its focus to be a

131

data exchange bundle for retailers to receive data from manufactures in a more

efficient way than the traditional electronic data interfaces. Again, this is not the

usual sales meaning but it relates to the purchase and transfer of goods. In summary,

we find that the category around the sales bundle group is conceptually coherent

including the newly added and less obvious bundles provided by SC.

Manual Group In new ‘Sales’ category

account_and_contact_management

activity_management X

customer_fact_sheet X

customer_quote_management

interactive_selling X

opportunity_management X

order_to_cash

order_to_cash_with_crm

product_master_data_management X

quote_to_order_for_configurable_products X

rebate_management

sales_contract_management X

sales_incentive_and_commission_management X

territory_management X

Table 27: Wiki Sales bundle group

Lastly, we review what happened to the 5 sales group members that SC did not

attributed to the new category. The account_and_contact_management bundle

moved to a category around customer_information_management_-

_business_operations medoid. The wiki attribution to the sales group is sensible

since the account and contact information of business partners/customers are part of

sales but the semantic categorization of organizing the business partner/customer

data and data and management bundles together is conceptually sound. This example

illustrates how a view or bias, e.g., from experience and daily interaction, influences

the wiki users’ organisation of the data. This is not wrong or right since it relates to

their personal organization of the data but if it is not the single or overwhelming case

then this organization may be suboptimal. The different organization by the Semantic

Categorization based on the statistical distribution of the language and the

relationships of documents is a sensible alternative.

132

Another such alternative is the attribution of order_to_cash,

order_to_cash_with_crm and customer_quote_management bundles to a category

around the order_to_cash medoid that established a category around the order and

billing concept. This is clearly related to sales but large and independent enough to

warrant an own category. The three bundles are highly related in the users’ and the

Semantic Categorisation’s interpretation despite the difference in the overall space’s

organisation.

The remaining bundle, rebate_management, is part of a small category with

agency_business. It illustrates the limits of Semantic Categorization. The rebate

bundle handles accumulated discounts poorly describes its purpose with only a short

text. Conceptually is may relate better to the billing and invoice category. The

agency_business bundle does aggregate invoice process data for high volume

scenarios and relates only remotely to the rebate bundle.

5.4 Discussion

In this chapter we presented the SAP ES Wiki bundle groups as an example of

manual made grouping of service related information that can be utilized by someone

exploring the service space to gain an overview without the need to search and

review large quantities of services. We discussed an optimal measure to compare the

manual groups with generated ones and decided to employ the AMI measure. We

introduced the two perspectives, Term and Bundle, under which the space is

organized.

CLUTO SC

Term 0.347 0.368

Bundle 0.476 0.437

Table 28: Top results (AMI) for CLUTO and Semantic Categorization

Our hypothesis from the introduction states that the category model from Conceptual

Space theory could be an effective model to organize the Semantic Space to allow

exploratory search. We provided an algorithm to identify semantic cores and

tessellate categories around them in chapter 3.3. The evaluation of which we

delivered in this chapter by comparing it with the manual bundle groups by means of

AMI and a qualitative investigation. We also provided a wide range of state-of-the-

art clustering algorithms and did exhaustive experiments (see Appendix C) choosing

133

the best possible individual results as a baseline. We compared them with the manual

groups to position the Semantic Categorization results with contemporary methods.

We furthermore did review two novel Semantic Space contributions through their

parameters, singular factor and link-weight, in detail to establish their possible value.

The top results of the traditional clustering and the Semantic Categorization are close

and the difference probably not discernable by the user (Table 28). The SC is slightly

better in the Term perspective while CLUTO achieved a little higher result in the

Bundle clustering. The influence of the singular factor is equal for both clustering

and categorizing. The Bundle perspective benefits from removing the singular values

while the Terms achieve best results with smoothed (sf=0.5) values. The link-weight

has a more varied influence. The Bundle perspective does not change much with a

small to medium link-weight and deteriorates with higher weights. The Term

perspective benefits strongly in the CLUTO setting from a high link-weight while its

influence is either not noticeable or detrimental to Semantic Categorization.

We have established that the Semantic Categorization does return comparable results

to the traditional clustering when measured by information theoretic metrics. The

qualitative review of the top SC Bundle result provides assurance that this measure is

useful and the underlying categories are conceptually relevant. In its current form,

we cannot claim that Semantic Categorization is more effective or computationally

efficient. We have shown though that the Conceptual Space inspired categorization is

appropriate and at comparable with traditional clustering approaches. We note

though that the examined state-of-the-art clustering algorithms are mature and have

undergone extensive development and revision while the SC algorithm is novel.

Furthermore, the reported results for the baseline clustering are only the very best in

e ach situation from a large array of evaluated algorithms. Lastly, the clustering

approaches reviewed here by their nature are dependent on the external knowledge of

how many clusters are optimal. The Semantic Categorization does not require

external parameter but depends solely on the general information of what constitutes

a semantic core and establishes the categories and their number based on them.

Qualitative analysis showed evidence that Semantic Categorization does compute

conceptually coherent categories even when these categories differ from the sole

baseline solution. In summary, we claim that the Semantic Categorization has the

potential to produce an automatic, meaningful and effective map of the Service

134

Ecosystem for exploration as stated in the second research question. For further

research we propose exploration of the topic with less homogenous and larger

corpora, This would complete our understanding for real world applications where

the semantics from the data source and user base may be even broader.

135

6 Discussion

This thesis began with an overview over the various streams of service related

developments that we tied together into the background of an emerging Service

Ecosystem. We identified Service Discovery to be of strategic importance for the

functioning of the SES. An overview over traditional and proposed Service

Discovery mechanisms revealed substantial shortcomings in addressing agenda

based discovery scenarios with uncertain service need knowledge on the searcher’s

part. This becomes increasingly important since we observe a shift from functional

SOA oriented service consumption to complex human selection and consumption on

the backdrop of an ever growing and changing SES. We therefore argued that a

suitable Service Discovery framework and system is required. We assumed that

effective discovery system is one that is sensitive to human conceptualization of the

service domain. Therefore, we used Conceptual Space theory, a theory of conceptual

representation from cognitive science as a background motivating theory to compute

concepts that may align with equivalent human representations. This and the fact that

besides functional descriptions the second main source of service information is

unstructured text lead us to reframes the discovery process as an Information

Retrieval Problem. We hypothesised that a Semantic Space based on the Service

Information Shadow is an effective Service Discovery mechanism for direct and

indirect search.

We modelled a Semantic Spaces Discovery system including novel features to

enhance discovery. We set out to answer the hypothesis by simulating a SIS and SD

scenario using the SAP ES Wiki. The presented experiments are utilizing a small

corpus compared to what is common in the IR domain, e.g., in the Text Retrieval

Conference78. On the other hand, research in service discovery has thus far employed

small corpora (Bose et al., 2008; Klusch, Fries, & Sycara, 2006; Mokhtar et al.,

2007; Peng, 2007; Stroulia & Wang, 2005; Zhuang et al., 2005) down to even single

digit number of services (Sanchez & Sheremetov, 2008). There are no standardized,

open, reliable service corpora with associated unstructured information available. We

78 See http://trec.nist.gov/ and http://trec.nist.gov/data.html for more details.

136

hope that the emergence of SES like architectures (Cardoso et al., 2010) will provide

the foundation to build corpora for service discovery similar to the corpora used in

IR research. Our research identified a unique service repository in the SAP ES Wiki

used to validate empirically the service discovery model presented in this thesis. Its

size is in line with the ‘larger’ service discovery experimental corpora so far used

and emphasizes the importance of such corpora to include all types of information

including unstructured text.

6.1 Service Discovery by Directed Search

Chapter 3 introduced the Semantic Service Discovery model. We employed the

TASA/TOEFL experiment to evaluate the quality of the vector representations in the

dimensionally reduced term co-occurrence matrix and at the same time reviewed the

impact of the modified S values and other parameters in the model. We established

that with the given corpus our model was resilient to parameter variations and able to

provide highly relevant semantic associations when compared to existing research

and the baseline. We furthermore obtained the first evidence that the S values may be

of little or negative value when constructing a Semantic Space.

The subsequent chapter presents a novel data source and experimental setup to

evaluate query-based service discovery in a Service Ecosystem. The overall

comparison was against contemporary IR systems and the SD scenario reframed as

an IR problem. Nonetheless, a traditional IR evaluation would have not done justice

to the complex interrelationships and would have not been as valid as a real-world

data source and scenario. Consequently, we undertook the extensive work to find,

extract and distil a new and relevant corpus. The choice of the SAP ES Wiki with the

use-cases and bundles provided highly applicable queries and documents.

Particularly, the hyperlinks enabled the extension of the model by the new Linked

Document Vectors and the Service Operations provided the base for the test of the

Combined Queries. We chose the baseline IR systems to provide a solid and broad

comparison. They included modern BM25 and VSM systems as well as a selection

of alternative Semantic Space systems. Three different levels of query precision

sourced from the use-cases simulated variations in expressiveness of service

information need. This comprehensive evaluation including combined queries on the

SSD model considers various scenarios of search from the current short, iterative

137

querying to a more thoughtful and expressive querying maybe even supported by

documents, usage history or search by example.

The performance difference of the SSD model over alternative systems of the three

use-case experiments containing 448, 448 and 413 queries is (nearly always)

significant. Even in the two cases where the basic SSD model’s result is not

significantly different from the Zettair and Lucene results, its AAR is noticeably

lower/better. The results across all three experiments are the same indicating that

(SVD sourced) Semantic Spaces are generally superior to the state-of-the-art

probabilistic and VSM systems. Our SSD model and its innovative modifications

proved particularly successful and performed best in all settings improving on the SS

system that inspired the SSD prototype. In optimal circumstances, we were able to

achieve an average rank of 1.275 over 448 queries. This means that a searcher

describing a service need in detail is very likely to find the most relevant result on

rank 1 or very close to it. Even in a worst-case scenario when just a few words

poorly express a service need as in the Titles experiment we provided the best result

with an AAR of 4.42. This means that the easily comprehendible and popular format

of top 10 search results would be highly relevant in this situation. We therefore claim

to confirm out hypothesis from the first research question, which states that Semantic

Spaces promote effective service retrieval in a Service Ecosystem. We extend this

statement and claim that we provide a more effective service discovery mechanism

than achieved with contemporary IR systems. A key role for this effectiveness is the

extrapolation of tacit relationships. When the SVD maps the sparse matrix with term

vectors into a lower dimensional space, a “smoothing” of vector representations

occurs. This lossy compression entails a removal of weak relationships and

strengthening of the remaining including hidden ones. This is akin to the “guessing”

of implicit term associations. On this point, the authors of LSA, the original

application of SVD for semantic purposes, have this to say:

“The relationships inferred by LSA are also not logically defined, nor are they

assumed to be consciously rationalizable as these could be. Instead, they are

relations only of similarity or of context sensitive similarity but they

nevertheless have mutual entailments of the same general nature, and also give

rise to fuzzy indirect inferences that may be weak or strong and logically right

or wrong.” (Landauer et al., 1998)

138

Naturally, the resulting Semantic Space lends itself for extensions with conceptually

related information as we have shown very successfully with LDV and CQ. The

linked document vectors are a simple and effective method to introduce the explicitly

encoded relationship by hyperlink between two documents. Wherever such a

relationship is available, e.g., in XML or HTML, the LDV offers a weighted method

to include this (semi-)structured information in the design and refinement of the

previously purely unstructured and empirically sourced Semantic Space. The

dramatic improvements in the use-case experiments employing LDVs substantiate

this. We therefore claim that the extension by the Link Document vectors has proven

very effective and this extension, where applicable, should be considered in the

future when building SS models. The Combined Queries are a vehicle to extend

easily and concisely a query with relevant information, i.e., existing vector-encoded

objects like service operations. They can benefit the search experience very

noticeably as shown in the 25p experiment where they improved the median and

minimum results by over 18%. The 100p experiment noticed too an impressive

enhancement in AARs, i.e., 15.8% in the median results, despite the absence of

further improvement of the near perfect minimum result. The lack of data for the

Titles experiment prevents us from identifying a possible trend towards improving

deteriorating queries. The fact that the Combined Queries do add additional

information effortlessly for the user and accurately for the system paired with the

promising results we presented do encourage further research into combined queries

and their application in future models.

6.2 Exploring the Space by Semantic Categories

We proposed in chapter 5 the manual bundle grouping in the SAP ES Wiki as a

baseline for alternative organizations of the bundles, which relate to combined or

related services. This provides the underpinning for an examination of the second

research question, which states that Semantic Categories provide an automatic,

meaningful and effective map of the Service Ecosystem for exploration. We

investigate the question by comparing Semantic Categories with manual groupings.

We discussed various similarity measures and identified the information theoretically

motivated and chance adjusted AMI as the best choice of performance measure. The

SSD model includes the novel feature of perspectives on the Semantic Space. It is

139

applicable to any automatic organization of the space and we reviewed with the state-

of-the-art clustering algorithms as well as with Semantic Categorization.

We provided a baseline by means of a broad assortment of clustering algorithms like

k-means and others based on repeated bi-sectioning, agglomerative, direct and

nearest neighbour clustering in combination with a great number of criterion

functions. We employed the popular CLUTO software package for this task and we

describe details to the algorithms evaluated in Appendix C. We achieved a

comparison of the effectiveness of SC by measuring its performance against only the

very best outcomes (see chapter 5.2) of this extensive evaluation. This means that we

did not compare against one particular clustering algorithm but in each comparison,

e.g., perspective of variations in singular factor, against the best possible one from

the pool of algorithms.

The Semantic Categorization and baseline clustering algorithms utilized the same

Semantic Space using the same manually determined gold standard. Their optimal

results are close to each other with SC outperforming CLUTO slightly in the Term

perspective. CLUTO outperforms SC in the overall better performing Bundle

perspective. The difference between the perspectives provided evidence for the value

of typing vectors in a Semantic Space where possible to identify relevant types and

use them primarily for organizing an overview of the service space. The qualitative

review of the best SC result illustrated the meaningfulness of the categories in the

sense of being conceptually coherent. The review also showed that alternative

categorizations made by the SC, which may have resulted in a lower AMI score, are

conceptually relevant and possibly less biased than the groupings present in the

wiki’s manually produced gold standard. We cannot solely based on this insight

claim a better performance than the traditional clustering but encourage future

research to employ sophisticated evaluations beyond purely quantitative measures.

Overall, the results showed that the performance of baseline clustering algorithms

and SC are comparable. Qualitative investigation of the semantic categories provided

evidence that they are meaningful substantiating our second hypothesis that semantic

categories can provide a useful basis for mapping the service space. Considering the

novelty of the SC algorithm and the maturity and vast selection of the baseline

clustering algorithms we do not discount the possibility that with further research SC

may become superior to contemporary clustering algorithms.

140

This work addresses the first step for a map like exploration of the SES. Our

empirical results suggest that traditional and SC is equally effective at creating a map

for this purpose. However, effective navigation of the SES would, in the next step,

require viewing the SES at various levels of abstraction, and allowing the user to

move between these abstraction layers. We anticipate SC to perform well in this

respect too. It was designed to be flexible and utilizes both global and local features

to produce categories of any size and abstraction. SC avoids the commonly in

clustering employed agglomerative and divisive processing of the space, which will

enable SC to provide views on each level of abstraction independent from other

views. This next step will require a review in its own right with a more complex data

source and (similar to this work) multifaceted evaluation of all aspects of the space’s

partitioning.

6.3 Singular factor

The singular factor in this research were a vehicle to evaluate the two propositions in

existing research of ignoring or utilizing the S values when creating the semantic

associations in a reduce matrix or Semantic Space. The singular factor enabled us to

vary how we employ the S values beyond the two proposed settings.

All experiments refuted the original notion that singular values are beneficial for left

side’s row relationship reconstruction (Deerwester et al., 1990). The TASA

experiment showed a strong preference for no singular values (sf=0). The use-case

and categorization experiments partially supported that and showed improvements of

up to 85% in some situations over the direct use of S values (sf=1). However, the

singular values do sometimes improve the quality of semantic associations contrary

to the second traditional view that they are of no benefit (Schütze, 1997, 1998;

Takayama et al., 1999). Both the use-case and categorization results show that under

certain circumstances smoothed singular values (sf=0.5) can return better results than

sf=1 and even sf=0. The literature has not provided conclusive evidence or

explanations about the optimal use of singular values. The unexpected utility of

smoothed S values for enhancing the quality of semantic vectors warrants further

research into detailing the effect of the singular values. Until then SVD grounded SS

models should consider and evaluate their application of singular values or when in

doubt ignore them (sf=0).

141

6.4 Link-weight

The Linked Document Vectors are a flexible means to model a relationship, or link,

between information, which we can find, for example, in hyperlink and XML

documents, ontologies or even in crowd-sourced recommendation systems. We have

used uniform types of links in this work to establish their value. In settings that are

more complex, qualitatively different links with different weights may also be an

option to extend and develop the model further. For example, the Universal Service

Description Language or USDL (Cardoso et al., 2010) offers a flexible approach to

include structured and unstructured information without describing a means of

Service Discovery. The presented SSD model can easily exploit both types of

information available in such a language. For example, the wide range of

dependencies and provider defined service capabilities offer a plethora of

opportunities to extend the SSD model and add weighted extensions to the LDV.

The use-case experiments showed in a definitive manner how substantial the

additional relationship information is for the semantic associations. We were able to

improve results across all scenarios by up to 32%. It is noteworthy that the

improvements were evident in all but the minimum 100p results. We attribute the

minor improvement in this instance to the fact that the results are already near an

optimum with an AAR of 1.27. In all other cases where the Semantic Space

parameters might not be optimal or the query less expressive or even both the LDV

improved results remarkably.

Since the bundles, the focus of the use-case experiments, are semantically rich, we

hypothesize the secondary effect of disambiguating the linked, but poorly described

documents, e.g., the service operation descriptions, and only indirectly the bundles to

be the reason for the superior outcome. Additional corpora with various degrees of

semantic depth and linking have to be explored to fully understand the effect but the

large performance improvements warrants this investigation and endorses LDVs as a

simple and effective option for enhancing the representational capabilities and

resilience of Semantic Space models.

6.5 Default Parameters

The extensive review of the SSD model allows us to suggest optimal/default

parameter ranges. We acknowledge that different data sources may benefit from

142

different settings so this guide may be one in a line of refinements depending on

further research to refine it until a solid understanding for all variables like data

source size is established. We also have to note that while we chose optimal

parameters for the various experiments from a very large parameter space we do not

consider this to weaken the claims we are making. We have observed that the SS

model behaves very well in a band of parameter variations as shown for example by

the similarity in parameters in the use-case experiment. The extensive modelling

illustrated that improvements beyond this band of very good results is both hard to

achieve and not noteworthy. Consequently, we can confidently expect similar results

from SSD systems once we standardize the parameters. However, we do make a

distinct difference between applying the model for term comprehension as in the

TASA/TOFEL, and the topical search and organisation as in the SSD experiments.

These two variants with different foci benefit from noticeably different parameter

ranges particularly in the window sizes.

6.5.1 Term “Semantics”

We experienced the most significant gain from maximum column size for the

original co-occurrence matrix. If the focus is on term comprehension like in a

synonym test then the rows need only reflect the relevant vocabulary as we have

shown with the TASA/TOFEL experiment. In the same experiment, we identified

that a corpus wide frequency (DF) as matrix row & columns sort order, a fixed scalar

term weight and small sliding window (8 on each side) are optimal. The small

window provides optimal co-occurrence information within a sentence’s reach. The

benefit of the fixed term weight and DF sorting combined with the ineffectiveness of

the gap feature suggest that contrary to established understanding (Gerard Salton, C.

S Yang, et al., 1975) in the situation where term to term relationships are the sole

focus, corpus wide frequency in combination with a modest stop word list is a good

discriminator for information value. Ignoring the S values proved effective. The

dimensional reduction conformed to orthodox results and was optimal between 100

to 500 columns.

In general, this thesis shows that for an information task reliant on semantic

associations, e.g., synonym task, the Semantic Space should comprise a large number

of columns to differentiate between the row vectors. The number of rows only needs

143

to represent the relevant vocabulary and does not benefit from any additional rows

beyond it. Furthermore, a fixed scalar term weight and corpus-wide term frequency

as matrix order, a comprehensive stop list plus a dimensional reduction towards 250

columns without S values are effective settings.

6.5.2 Information Retrieval theory bearing on Semantic Space

If the application of the Semantic Space is supposed to include document and topical

features then the optimal parameters change. Firstly, TF-IDF is a better discriminator

for matrix order than term frequency and TF-IDF is a better choice than a fixed

scalar for term weight. A high number of columns remain advisable for a rich

representation of row terms as well as a high number of rows since a larger

vocabulary increases detail of the document and query/pseudo-document

representation. We found that the combined window size of 100 to 300 yields good

results and that the distribution between left and right side is less important. This

suggests a wider more topical association of terms as an optimal solution rather than

narrower co-occurrence. This is an important result and suggests that a simpler, bag

of words approach may yield comparable results. We find that ignoring them is an

optimal solution for comprehensive queries. Short keyword-like queries can benefit

from a smoothed use of the S value. We did not find evidence that the sometimes-

suggested use of unmodified S values is desirable.

The novel addition of weighted links and combined queries returned encouraging

results. Particularly in the common case of suboptimal parameter settings for a

corpus the utilization of links and extending queries by document vectors has shown

strong improvements on average and median results. A low weight of 20% for the

links did perform best in general. The combined queries did require the addition of

the link weight in the model to be a noticeable improvement. On the query

parameters, the query fact has shown little promise. The use of term frequency in the

queries on the other hand did perform substantially better than using unique term

representations.

In general, this thesis shows for an IR like task utilizing a Semantic Space with a

modest sized corpus with link information like in the SD experiments that the

following settings are effective. A large number of rows and columns, a dimensional

144

reduction towards 200 columns, no S values, LDVs and CQs with a weight of 20%

and query term frequency.

6.5.3 Categorization

We did not test extensive parameter settings for optimal Semantic Spaces for

clustering and categorization. However, we did investigate the link weight and

singular value impact. Interestingly both the clustering and Semantic Categorization

both benefit from a removal of the S values (sf=0) for the Bundle perspective which

is also the best performing one. Similarly, the Term perspective did benefit greatly

from using smoothed S values (sf=0.5) for both the baseline and SC. This reinforces

the findings from the use case experiments that there is some useful information

contained in the S values but only under some circumstances and only in the

formerly unexplored smoothed form. The link weight parameter gives a less clear

picture. For the clustering baseline, the Bundle perspective benefits slightly from a

mild link weight (30-40%). The Term perspective (again for clustering) on the other

hand benefits greatly from a strong weight (70%). The SC shows only negative

impact with link weights. Overall, the experiments in this thesis show that the link

weight in a clustering situation should be employed cautiously and there are no

grounds to use it in a categorization setting based on the current experience.

The Semantic Categorization parameters of density, distance and cut-off have to be

set according to the application. Density, the local category parameter, has shown

good performance around a setting of 10. It does scale with higher sf settings but

these settings are irrelevant since they do not provide an improvement compared to

the manual baseline. Distance, the global parameter, depends on the perspective. If a

vector type distributes sparsely like the bundles then this can be distinguishing

enough for distance to have little impact. On the more tightly packed Term vectors,

distance was useful and a setting around 10 exhibited good results. The number and

density of the targeted vectors similarly influence the cut-off parameter. There is no

‘long tail’ of minute clusters in the Bundle perspective and no need for a cut-off. The

denser and more distributed Term perspective though does improve with a cut-off of

5-10%.

145

6.6 Discovery

The results for the directed search and categorization substantiate the proposition that

Semantic Service Discovery is effective. We recall that the consumer when faced

with an incomplete knowledge of her agenda may discover new information and

extend her knowledge (see section 2.1.4). However, the SD system has to facilitate

this process. We have shown that SSD achieves excellent results in identifying

relevant service information with decreasing query information. This aligns with the

challenge identified in this work of presenting related information to a consumer

when she, due to a lack of knowledge, is describing her agenda incompletely. The

quality of this selection of information presented to the agent has direct influence on

her choice and cost of investigating and obtaining the relevant information. The

strong performance of SS models and the improvement of results with the additional

link information and combined queries, and the qualitative results in the category

experiment illustrate the value of the SS results for presumptive attainment.

Particularly the qualitative results emphasize that beyond traditional SD there are

latent features available that a human agent when presented with may use to discover

new information to extend her knowledge. We therefore consider Semantic Spaces

valuable to comprehensive Semantic Service Discovery systems in the future.

146

7 Future Work

We have concluded that Semantic Spaces are effective means for Service Discovery.

We present two aspects, scientific and practical, for future work. The positive

outcome from the SSD model and experiments may motivate real world applications

while at the same time further research and improvements are desirable.

7.1 Scientific

We propose that future research will investigate the role and identify the exact

mathematical effect of the singular values in the extraction of semantic associations

between terms in a word co-occurrence matrix. We were able to establish that they

can be advantageous in a smoothed form and that the factor weighting in the

decomposition is probably the source of the gain but for a predictable, optimal

application an exact model has to be established.

Additional research into the Linked Document Vectors should be undertaken to

provide further evidence of their value. The impressive gains we achieved encourage

us to propose that the LDVs are generally useful but conclusive proof requires

further results across a large spectrum of discovery tasks and scenarios involving

various corpora. Such research should investigate corpora with a range from poorly

to highly linked and poorly to rich semantic descriptions to gain an understanding

how direct and indirect disambiguation in the space from the links versus the

semantic content occurs. We could also imagine typing the links, e.g., weighting

different types of links, to optimize the effect in various settings.

The underlying Conceptual Space theory and algorithms of this research are

universally applicable in our opinion. The conceptual aspect of this work is

transferable to other domains beyond service discovery, commonly home to the

applied methods from information retrieval and cluster analysis like data mining or

text classification. This would entail a desirable cross validation with the corpora and

test scenarios of the related scientific domains. This would enrich our understanding

and validation of the presented work further. Additionally, some of the insight

particular to Service Discovery is possibly applicable to software discovery and

categorization problems which struggles with similar challenges (Delo, Haar,

Larsson, & Parulekar, 2002; Tian, Revelle, & Poshyvanyk, 2009).

147

The apparent challenge for the near future in respect to the Service Discovery

domain is to capitalize on the extensive structural information available in the

enterprise domain of service provision and consumption, as well as the unstructured

secondary information. Approaches like the Universal Service Description Language

or USDL (Cardoso et al., 2010) offer a flexible and extensible way forward to

capture much of both. Bridging the divide between structured and unstructured

information, and utilizing both concurrently is a great challenge. Combined queries

and linked document vectors are means to extend the orthodox Semantic Space with

structured information. It utilizes explicitly encoded information transparently for

untrained users in directed search and browsing, and can even inform the design of

ontologies or taxonomies through semantic categories.

Figure 49: Interface dummy for search by browsing of categories

We established the Semantic Categorization as effective and meaningful but the

algorithm introduced is computationally inefficient. Consequently, an obvious

research area is to improve on the computational performance by developing

alternative, more efficient and effective algorithms based on the Conceptual Space

theory to outperform the matured clustering. An important and unexplored attribute

of the presented Semantic Categorization algorithm is its ability to change the

granularity of categories through the manipulation of the core density and distance

preference. Since these are conceptually inspired parameters, they may return more

148

stable and relatable results than the orthodox clustering. An example would be to

represent the space at different levels of conceptual granularity, which would allow

for a “drilling down” kind of exploratory search that may not necessarily want to rely

on a hierarchical structure to allow for vertical knowledge abduction and discovery

(Figure 49). Lastly, an evaluation of SC in a large corpus would be desirable even

though manual groupings may well not be available and qualitative evaluation may

be the only choice.

7.2 Applied

We have shown the benefit of Semantic Spaces in Service Discovery. We propose

that it is ready for real-world application in Service Discovery and possibly for IR

tasks of similar nature. The SD application has two challenges to overcome,

conceptual and implantation.

The conceptual problem is to integrate it with the current state of the Service

Ecosystem. An open, wide reaching Service Ecosystem does not yet exist. A

practical approach to apply the SSD model is to introduce it on promising solutions

in the service domain that may become supporting pillars in the future SES. For

example, the Universal Service Description Language (Cardoso et al., 2010) which

the industry, i.e., SAP, strongly supports shows promise to become such a pillar. This

work focused on the conceptual side of service discovery but this does not discount

the need for structured and functional approaches in the lower and machine oriented

SOA layers of the SES. An integration of these functional aspects with SSD may

lead to various applications. Preliminary research shows this potential (Bose et al.,

2008). We propose that the SSD model can be the human interface to the search and

discovery of services of a system that internally is well structured but also contains

and through human interaction gains unstructured information. We have shown that

we can exploit some structural information. Furthermore, simple functional matching

of services in orchestration is rarely meaningful and a conceptual support system

founded in Semantic Spaces can help in the development of combined and complex

services and processes. We can imagine moreover to support specialists designing or

mapping ontologies with a conceptual recommendation system that instils a

statistical semantic validity and encourages resulting ontologies and taxonomies to

be close to natural language.

149

Lastly, there is an implementation challenge. The presented Semantic Space depends

on the computationally expensive SVD and the future SES and attached SIS can be

expected to be large. The experience from the development of the software prototype

for this research suggests that future research utilizing large corpora or applications

should utilize of-the-shelf components as far as possible to scale and focus efforts.

Within the time of this research, these components have gradually become available

with the development of HBase, Hadoop, Solr/Lucene and Mahout79 focusing on the

map reduce framework. Distributed or parallel implementation of Lanczos based

SVD algorithm (Baglama & Reichel, 2007) promise to solve a computational

bottleneck of Semantic Spaces by two pronged approach reducing memory and

computational time needed as well as dividing the problem into easy computable

tasks. This development makes a wide spread application of Semantic Spaces for

Service Discover and other scenarios likely in the near future. Many application

challenges contain considerable scientific and further research aspects, e.g., selecting

optimal semantic training sets to minimize computational load or how best to fold

new information into an existing Semantic Space and when to recompute it.

79 See http://hbase.apache.org/, http://hadoop.apache.org/, http://lucene.apache.org/solr/ and http://mahout.apache.org/.

150

Conclusion

Overall, we provided evidence to support both our hypothesis that Semantic Spaces

based on a Service Ecosystem’s Service Information Shadow facilitate effective

Service Discovery. We also investigated the novel features of linked document

vectors, combined queries, perspectives and the importance of singular values. We

demonstrated significant gains using LDVs and perspectives as well as promising

benefits from combined queries. The case for singular values is complex and while in

most cases ignoring them is the optimal setting, we did provide substantiation that

they are useful in a smoothed form in some situations.

A-1

Appendix A SAP ES Wiki Grouping

Overview SAP Core (70)

Sales (14)

Service (5)

Marketing (2)

Management (1)

Human Capital Management (4)

Corporate Services (9)

E-Commerce (1)

Supply Planning (2)

Financials (5)

Procurement (6)

Supply Network Collaboration (3)

Order Fulfillment (2)

Supply Chain Visibility (1)

Product Development and

Manufacturing (5)

Transportation, Warehousing (6)

RFID Enablement (3)

Overview Industries (56)

Banking (12)

Higher Education & Research (2)

Insurance (7)

Automotive (1)

Public Sector (7)

Defense (2)

Healthcare (5)

Consumer Products (2)

Oil & Gas (1)

Travel & Logistics Services (2)

Media (3)

Wholesale Distribution (2)

Retail (5)

Utilities (5)

Experiments

B-1

Appendix B Example Semantic Categorization by

Bundles

<?xml version="1.0"?>

<Categories Type="Bundle">

<Category Name="supplier_collaboration_for_the_supply_chain">

<Type Name="Bundle">

<Member Sim="0.747314593203148">customer_collaboration_for_the_supply_chain</Member>

<Member Sim="0.429137953659488">outsourced_manufacturing</Member>

</Type>

</Category>

<Category Name="order_to_cash">


<Member Sim="0.736248072501191">order_to_cash_with_crm</Member>

<Member Sim="0.692977378246265">order_to_cash_for_fashion</Member>

<Member Sim="0.373950237494739">customer_quote_management</Member>

<Member Sim="0.2707411271289">dispute_management</Member>

<Member Sim="0.418494613262352">integration_of_transportation_management_system</Member>

<Member Sim="0.349931330554136">convergent_invoicing</Member>

<Member Sim="0.438597390282901">bill-to-cash</Member>

<Member Sim="0.524114671311319">supply_chain_operations_and_execution_in_the_oil_and_gas_industry</Member>

</Type>

</Category>

<Category Name="financial_accounting_-_results_integration">


<Member Sim="0.704381137116215">management_accounting_-_results_integration</Member>

<Member Sim="0.426158777021448">financial_accounting_-_financial_instrument_accounting_integration</Member>

</Type>

</Category>

<Category Name="credit_risk_management_-_financial_instrument_pricing">


<Member Sim="0.700301118871451">financial_accounting_-_financial_instrument_pricing</Member>

</Type>

</Category>

<Category Name="maintenance_service_collaboration">


<Member Sim="0.672546102465906">asset_configuration</Member>

<Member Sim="0.582778542992418">maintenance_processing</Member>

<Member Sim="0.453804044325149">compliance_relevant_data_exchange_-_elogbook</Member>

</Type>

</Category>

<Category Name="customer_information_management_-_business_operations">


<Member Sim="0.652944892794494">account_and_contact_management</Member>

<Member Sim="0.592996736591359">complaint_management</Member>

<Member Sim="0.526080488614963">request_for_registration_processing</Member>

<Member Sim="0.29857951063867">investigative_case_management</Member>

<Member Sim="0.46981354895926">multi-channel_tax_and_revenue_management</Member>

<Member Sim="0.394555062110302">permit_application_and_approval</Member>

</Type>

</Category>

<Category Name="insurance_external_reporting">


<Member Sim="0.625886398149305">insurance_claims_handling</Member>

<Member Sim="0.608479509333417">insurance_external_claims_investigation</Member>

<Member Sim="0.59628375288497">insurance_document_vendor</Member>

</Type>

</Category>

<Category Name="procure_to_pay">


<Member Sim="0.645259178864215">procure_to_pay_for_fashion</Member>

<Member Sim="0.43985149413654">project_system</Member>

<Member Sim="0.508069813919222">external_requirement_processing</Member>

<Member Sim="0.549944421030025">supplier_order_collaboration_with_srm</Member>

<Member Sim="0.433598406229265">item_unique_identification</Member>

</Type>

B-2

</Category>

<Category Name="cross-industry_rfid-enabled_core_logistics_processes">


<Member Sim="0.624072460428367">management_of_tag_ids_and_tag_observations</Member>

<Member Sim="0.594803217452026">management_of_devices_through_enterprise_services</Member>

<Member Sim="0.282271326720487">yard_and_storage_management_processes</Member>

</Type>

</Category>

<Category Name="service_order_management">


<Member Sim="0.616591880740113">customer_service_execution</Member>

<Member Sim="0.402063426288014">installed_base_management</Member>

<Member Sim="0.395323749949864">service_contract_management</Member>

<Member Sim="0.477168619669685">service_parts_management</Member>

</Type>

</Category>

<Category Name="integration_of_quality_management_systems">


<Member Sim="0.60073315331758">easy_inspection_planning</Member>

</Type>

</Category>

<Category Name="manufacturing_work_instructions">


<Member Sim="0.571046919528631">integration_of_manufacturing_execution_systems</Member>

<Member Sim="0.386279698214566">batch_traceability_and_analytics</Member>

<Member Sim="0.333185136258181">responsive_product_development_and_launch</Member>

</Type>

</Category>

<Category Name="resource_and_supply_chain_planning_for_healthcare_providers">


<Member Sim="0.560687478031439">resource_planning_and_scheduling</Member>

</Type>

</Category>

<Category Name="inventory_management">


<Member Sim="0.552632457434165">inventory_lookup</Member>

<Member Sim="0.180389494715461">environment_health_and_safety</Member>

</Type>

</Category>

<Category Name="interactive_selling">


<Member Sim="0.519117489743124">quote_to_order_for_configurable_products</Member>

<Member Sim="0.515982401457111">vehicle_management_system</Member>

<Member Sim="0.501992803379323">customer_fact_sheet</Member>

<Member Sim="0.48321546603633">sales_incentive_and_commission_management</Member>

<Member Sim="0.477825223709467">product_master_data_management</Member>

<Member Sim="0.474312766188881">opportunity_management</Member>

<Member Sim="0.376897803358854">activity_management</Member>

<Member Sim="0.398928439401607">sales_contract_management</Member>

<Member Sim="0.248885230587523">territory_management</Member>

<Member Sim="0.289330139849137">trade_and_commodity_management</Member>

<Member Sim="0.250670916338527">global_data_synchronization</Member>

</Type>

</Category>

<Category Name="hcm_organizational_management">


<Member Sim="0.540502425771903">hcm_master_data</Member>

<Member Sim="0.325895238061658">hcm_time_management</Member>

<Member Sim="0.324054536895044">information_system_integration</Member>

</Type>

</Category>

<Category Name="atp_check">


<Member Sim="0.534233973694007">availability_issue_resolution_and_backorder_processing</Member>

</Type>

</Category>

<Category Name="loans_management_-_business_operations">


<Member Sim="0.532013102182625">financial_accounting_-_loans_integration</Member>

</Type>

B-3

</Category>

<Category Name="demand_management">


<Member Sim="0.521591671772741">demand_planning</Member>

<Member Sim="0.48043289870201">in-store_food_production_integration</Member>

</Type>

</Category>

<Category Name="campaign_management">


<Member Sim="0.520201004123336">lead_management</Member>

</Type>

</Category>

<Category Name="sales_and_service_-_account_origination">


<Member Sim="0.513039527801658">current_account_management_-_business_operations</Member>

</Type>

</Category>

<Category Name="patient_administration">


<Member Sim="0.477684943636421">medical_activities_x002C__patient_billing_and_invoicing</Member>

<Member Sim="0.474679551978179">foundation_for_collaborative_health_networks</Member>

</Type>

</Category>

<Category Name="market_communication">


<Member Sim="0.48295093078888">customer_communication</Member>

<Member Sim="0.456408780265494">advanced_meter_infrastructure</Member>

</Type>

</Category>

<Category Name="central_contract_management">


<Member Sim="0.480423417571924">service_procurement</Member>

<Member Sim="0.331160654709027">trade_price_specification_contract</Member>

</Type>

</Category>

<Category Name="subscription_management">


<Member Sim="0.441558309825736">advertising_management</Member>

<Member Sim="0.236343934699565">integration_of_rights_management</Member>

</Type>

</Category>

<Category Name="external_cash_desk">


<Member Sim="0.439073794719508">electronic_bill_presentment_and_payment</Member>

<Member Sim="0.352799823824433">bank_communication_management</Member>

</Type>

</Category>

<Category Name="credit_risk_management_-_credit_portfolio_management">


<Member Sim="0.4243914572109">credit_management</Member>

<Member Sim="0.303607776750799">credit_risk_-_modeling</Member>

</Type>

</Category>

<Category Name="hcm_enterprise_learning">


<Member Sim="0.42265501136075">integration_of_external_warehouse_management_system</Member>

<Member Sim="0.25933846756841">product_catalogue_processing_with_crm</Member>

<Member Sim="0.331422634088728">course_approval_processes</Member>

<Member Sim="0.253453857329117">integration_of_learning_management_systems</Member>

</Type>

</Category>

<Category Name="planning_to_shelf_optimization_integration">


<Member Sim="0.417821710855239">merchandise_and_assortment_planning_integration</Member>

</Type>

</Category>

<Category Name="rebate_management">


<Member Sim="0.411977564377045">agency_business</Member>

</Type>

B-4

</Category>

<Category Name="records_and_document_management">


<Member Sim="0.401095611843341">technical_document_management_connectivity</Member>

</Type>

</Category>

<Category Name="kanban_processing">


<Member Sim="0.387018065784234">business_event_handling_for_process_tracking</Member>

</Type>

</Category>

<Category Name="insurance_credentialing">


<Member Sim="0.362541380984386">commissioning</Member>

</Type>

</Category>

<Category Name="legal_dunning_and_external_collections">


<Member Sim="0.335544124551552">insurance_billing_and_payment</Member>

</Type>

</Category>

<Category Name="public_sector_budget_management">


<Member Sim="0.327642110799172">public_sector_accounting_structures</Member>

<Member Sim="0.238230273919996">funds_commitment_processing</Member>

</Type>

</Category>

<Category Name="real_estate_-_room_reservation">


<Member Sim="0.166656722622809">travel_management</Member>

</Type>

</Category>

</Categories>

C-1

Appendix C CLUTO

Methods

The first set of CLUTO’s clustering methods implement repeated bisection which

considers the set of objects as one cluster and then repeatedly selects and splits one

cluster into two until a stopping criterion is met. CLUTO provides two variations

called rb and rbr with the first being the described implementation and the latter

additionally attempting a post clustering optimization80 not further described by

CLUTO’s manual. The direct method attempts to compute all desired clusters

simultaneously instead of using bisections. The reverse approach to bisections is

agglomerative clustering (Chidananda Gowda & Krishna, 1978) available in two

methods, agglo and bagglo. It assumes each object to be a cluster and then merges

clusters to optimize a criterion function’s result. Bagglo is a variation, which uses an

initial rb clustering on the square root of the desired cluster number to extend the

feature space before an agglo method run. Lastly, there is the graph method using a

nearest neighbour graph to employ the min-cut algorithm (Hao & Orlin, 1994) to

partition/cluster the graph.

Criterion functions

The criterion functions describe the measure the clustering method optimizes. This

can be to maximize the distance between clusters (inter-cluster), minimize the

distance within a cluster (intra-cluster) and/or a combination of these. CLUTO

provides 7 criterion functions (Table 29) that can be applied to 5 clustering methods

plus an additional 6 criterion functions that are applicable to agglomerative methods.

C-2

Criterion Function Optimization Function

I1 maximize ∑∑ ,, ∈

I2 maximize ∑ ∑ ,, ∈

E1 minimize ∑

∑ ,∈ , ∈

∑ ,, ∈

G1 minimize ∑∑ ,∈ , ∈

∑ ,, ∈

G1p minimize ∑∑ ,∈ , ∈

∑ ,, ∈

H1 maximize

H2 maximize

Table 29: CLUTO main criterion functions80

Table 29 lists the 7 main criterion functions and what they optimize, with:

k the total number of clusters

S all objects to cluster

Si objects in cluster i

ni number of objects in cluster i

v, u two objects

sim(v,u) the similarity81 between objects

I1 and I2 locally optimize the intra-cluster similarity ignoring other clusters in the

process. I1 is mathematically equivalent to the k-means algorithm seeking to

minimize sum of squared errors of Euclidean distance (Zhao & George Karypis,

2002). I2 is a vector space variation of I1 using square root rather than the number of

objects in the cluster to scale the measure.

E1 uses a global optimization maximizing the distance of cluster centroids from the

centroid of the whole collection. It also weights larger clusters as more important.

The G1 and G1p are graph-based approaches viewing the similarity as a weight on the

edge between two objects/vertices. The intuition behind the graph inspired criterion

functions is to minimize the edge-cut of each cluster/partition.

80 See CLUTO manual http://glaros.dtc.umn.edu/gkhome/fetch/sw/cluto/manual.pdf for more. 81 We used CLUTO’s default cosine measure.

C-3

H1 and H2 are hybrid functions using a combination of the previously discussed ones.

Both hybrids are divided by E1 and thus to increase H, E1 has to be as small as

possible. Maximizing the distance of the clusters from the global centroid achieves

this. The numerator can be either I function for inter-cluster optimization.

The agglomerative methods additionally have Unweighted Pair Group Method with

Arithmetic Mean (UPGMA), single and complete link functions as well as their

weighted variants available to them. The single link function merges two clusters

minimizing the distance of the closest members between the clusters. The complete

link minimizes the distance of the two furthest distant members of the clusters.

UPMGA minimizes the mean distance between all members of two clusters.

Methods and Criterion Functions Results

The direct method (0.4762) achieved the best result (Figure 50) as well as the highest

average (of maxima) result across all criterion functions (0.4486). Graph without

alternative criterion functions was the worst (0.2384). The agglomerative methods

were generally more volatile and performed particularly poorly with the (w)slink

criterion functions. The rb methods were slightly worse than the direct approach but

performed well overall.

Figure 50: Criterion functions by methods

The G1p achieved the best criterion function result (Figure 51). On average E1 though

was slightly better (0.4190 vs. 0.4095) and not much ahead of I1, I2, H1 and H2

(0.3958, 0.3885, 0.3963 and 0.4060) with the exception of G1 (0.3417). The six

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

0.50

rb rbr direct graph agglo bagglo

Max AMI

average i1 i2 e1 g1 g1p h1 h2 slink wslink clink wclink upgma wupgma

C-4

agglomerative only criterion functions (slink, wslink, clink, wclink, upgma and

wupgma) were not competitive with the exception of wupgma in combination with

agglomeration (0.3970 agglo and 0.4103 bagglo).

Figure 51: Methods by criterion functions

Perspective Results

The comparison of Term to Bundle perspective reveals that for all methods (Figure

52) and all criterion functions (Figure 53) the Bundle perspective achieved a

strikingly better performance. The methods improve between ~23% (agglo, graph)

to ~47% (direct, rbr). Even if we compare the best method and criterion function

from both perspectives (Term – agglo G1p 0.3473 vs. Bundle - direct G1p 0.4762) the

difference remains striking (+37%).

Figure 52: Perspective and methods

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

0.50

Max AMI

average rb rbr direct graph agglo bagglo

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

50%

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

0.50

rb rbr bagglo agglo direct graph

Increase Term

to Bundle

Max AMI

Term

Bundle

Increase

C-5

The poor performance of the graph method and the agglomerative specific criterion

functions (except wupgma) persist. They all profited from the Bundle perspective but

comparatively little (except wclink and wupgma) and have poor Term results to begin

with.

Figure 53: Perspective and criterion functions

0%

10%

20%

30%

40%

50%

60%

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

0.50

Increase Term

to Bundle

Max AMI

Term Bundle Increase

i

References

Al-Masri, E., & Mahmoud, Q. H. (2007). WSCE: A crawler engine for large-scale discovery of web services. IEEE International Conference on Web Services (ICWS 2007), 1104–1111. IEEE Computer Society. doi:http://doi.ieeecomputersociety.org/10.1109/ICWS.2007.197

Araujo, M. D., Navarro, G., & Ziviani, N. (1997). Large Text Searching Allowing Errors. (R. Baeza-Yates, Ed.)South American Workshop on String Processing (WSP’97). Valparaiso, Chile: Carleton University Press International Informatics Series.

Arfken, G. B., & Weber, H. J. (2005). Mathematical methods for physicists (6th ed., Vol. 198, p. 1200). Academic Press.

Atkinson, C., Bostan, P., Deneva, G., & Schumacher, M. (2009). Towards High Integrity UDDI Systems. In W. Aalst, J. Mylopoulos, N. M. Sadeh, M. J. Shaw, C. Szyperski, W. Abramowicz, & D. Flejter (Eds.), Business Information Systems Workshops (pp. 350-361). Springer.

Bachlechner, D., Siorpaes, K., Lausen, H., & Fensel, D. (2006). Web Service Discovery - A Reality Check. Proceedings of the 3rd European Semantic Web Conference (ESWC 2006). Budva,Montenegro.

Baeza-Yates, Ricardo, & Ribeiro-Neto, B. (2011). Modern information retrieval: The Concepts and Technology behind Search (2nd ed., p. 944). Addison-Wesley Professional.

Baglama, J., & Reichel, L. (2007). Restarted block Lanczos bidiagonalization methods. Numerical Algorithms, 43(3), 251-272. doi:10.1007/s11075-006-9057-z

Barros, A., & Dumas, M. (2006). The Rise of Web Service Ecosystems. IT Professional, 8(5), 31-37. doi:10.1109/MITP.2006.123

Barros, A., Dumas, M., & Bruza, P. (2005). The Move to Web Service Ecosystems. BPTrends, 3(12). Retrieved from http://www.bptrends.com/publicationfiles/12-05-WP-WebServiceEcosystems-Barros-Dumas.pdf

Bellegarda, J. R. (2000). Exploiting latent semantic information in statistical language modeling. Proceedings of the IEEE, 88(8), 1279-1296. doi:10.1109/5.880084

Bellegarda, J. R., Butzberger, J. W., Coccaro, N. B., & Naik, D. (1996). A novel word clustering algorithm based on latent semantic analysis. IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings (Vol. 1, pp. 172-175). Atlanta, GA, USA: IEEE. doi:10.1109/ICASSP.1996.540318

ii

Billerbeck, B., Cannane, A., Chattaraj, A., Lester, N., Webber, W., Williams, H. E., Yiannis, J., et al. (2004). RMIT University at TREC 2004. In E. M. Voorhees & L. P. Buckland (Eds.), Proceedings of the Thirteenth Text REtrieval Conference. Gaithersburg: NIST.

Bose, A., Nayak, R., & Bruza, P. (2008). Improving Web Service Discovery by Using Semantic Models. In J. Bailey, D. Maier, K.-D. Schewe, B. Thalheim, & X. Wang (Eds.), Web Information Systems Engineering - WISE 2008 (pp. 366–380). Berlin, Heidelberg: Springer-Verlag. doi:10.1007/978-3-540-85481-4_28

Bruza, P., & Sitbon, L. (2008). On the relevance of documents for semantic representation. In R. McArthur, P. Thomas, A. Turpin, & M. Wu (Eds.), Proceedings of the 13th Australasian Document Computing Symposium (ADCS ’08). Tasmania: School of Computer Science and Information Technology, RMIT University.

Bruza, P., Barros, A., & Kaiser, M. (2009). Augmenting Web Service Discovery by Cognitive Semantics and Abduction. 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology, 403-410. Ieee. doi:10.1109/WI-IAT.2009.69

Cao, G., Song, D., & Bruza, P. (2004). Fuzzy K-means Clustering on a High Dimensional Semantic Space. In Proceedings of the Sixth Asia Pacific Web Conference (APWeb 2004) (pp. 95-101).

Cardoso, J., Barros, A., May, N., & Kylau, U. (2010). Towards a Unified Service Description Language for the Internet of Services: Requirements and First Developments. IEEE International Conference on Services Computing Proceedings (pp. 602-609). doi:http://doi.ieeecomputersociety.org/10.1109/SCC.2010.93

Chidananda Gowda, K., & Krishna, G. (1978). Agglomerative clustering using the concept of mutual nearest neighbourhood. Pattern Recognition, 10(2), 105-112. doi:doi: DOI: 10.1016/0031-3203(78)90018-3

Cleverdon, C. (1967). The CRANFIELD TESTS ON INDEX LANGUAGE DEVICES. Aslib Proceedings, 19(6), 173-194. doi:10.1108/eb050097

Colucci, S., Noia, T. D., Sciascio, E. D., Mongiello, M., & Donini, F. M. (2004). Concept abduction and contraction for semantic-based discovery of matches and negotiation spaces in an e-marketplace. Proceedings of the 6th international conference on Electronic commerce - ICEC ’04 (p. 41). New York: ACM Press. doi:10.1145/1052220.1052226

Cutting, D. R., Karger, D. R., Pedersen, J. O., & Tukey, J. W. (1992). Scatter/Gather: a cluster-based approach to browsing large document collections. Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 318-329). New York, NY, USA: ACM. doi:http://doi.acm.org/10.1145/133160.133214

iii

Deerwester, S., Dumais, S. T., Furnas, G., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6), 391-407.

Delo, J. C., Haar, M. S., Larsson, J. E., & Parulekar, C. A. (2002). Method for categorizing and installing selected software components. Retrieved from http://www.google.com/patents/about?id=IfILAAAAEBAJ

Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society Series B Methodological, 39(1), 1-38. Royal Statistical Society. doi:10.2307/2984875

Dietze, S., Gugliotta, A., & Domingue, J. (2008). Towards context-aware semantic web service discovery through conceptual situation spaces. Proceedings of the 2008 international workshop on Context enabled source and service selection, integration and adaptation organized with the 17th International World Wide Web Conference (WWW 2008) - CSSSIA ’08 (pp. 1-8). New York, New York, USA: ACM Press. doi:10.1145/1361482.1361488

Dong, X., Halevy, A., Madhavan, J., Nemes, E., & Zhang, J. (2004). Similarity search for web services. Proceedings of the Thirtieth international conference on Very large data bases - VLDB ’04 (Vol. 30, pp. 372-383).

Du, H.-J., Shin, D.-H., & Lee, K.-H. (2008). A sophisticated approach to semantic web service discovery. The Journal of Computer Information Systems, 48(3), 44-61.

Ellison, S. (2010, December). Worldwide and U.S. Mobile Applications, Storefronts, and Developer 2010–2014 Forecast and Year-End 2010 Vendor Shares: The “Appification” of Everything. International Data Corporation. Retrieved from http://www.idc.com/research/viewdocsynopsis.jsp?containerId=225668

Firth, J. R. (1957). A synopsis of linguistic theory 1930-55. Studies in Linguistic Analysis, 1952-59, 1-32. Oxford: The Philological Society.

Furnas, G., Landauer, T. K., Gomez, L. M., & Dumais, S. T. (1983). Statistical semantics: Analysis of the potential performance of key-word information systems. Bell System Technical Journal, 62(6), 1753-1806.

Gabbay, D. M., & Woods, J. (2005). The Reach of Abduction: Insight and Trial - A Practical Logic of Cognitive Systems (2nd ed., p. 496). Elsevier Science.

Gan, G., Ma, C., & Wu, J. (2007). Data Clustering: Theory, Algorithms, and Applications. 3600 Market Street, 6th Floor Philadelphia, PA 19104-2688: SIAM. doi:10.1137/1.9780898718348

Garcia, S., Lester, N., Scholer, F., & Shokouhi, M. (2006). RMIT University at TREC 2006: Terabyte Track. In E. M. Voorhees & L. P. Buckland (Eds.), Proceedings of the Fifteenth Text REtrieval Conference.

iv

Gärdenfors, P. (2004). Conceptual Spaces: The Geometry of Thought (Bradford Books) (p. 317). MIT Press.

Golub, G. H., & van Loan, C. F. (1996). Matrix Computations (3rd ed., p. 728). Baltimore: The Johns Hopkins University Press.

Granka, L. a, Joachims, T., & Gay, G. (2004). Eye-tracking analysis of user behavior in WWW search. Proceedings of the 27th annual international conference on Research and development in information retrieval - SIGIR ’04 (p. 478). New York, New York, USA: ACM Press. doi:10.1145/1008992.1009079

Grüninger, M., Hull, R., & McIlraith, S. A. (2008). A Short Overview of FLOWS: A First-Order Logic Ontology for Web Services. IEEE Data Engineering Bulletin, 31(3), 3-7.

Hagemann, S., Letz, C., & Vossen, G. (2007). Web Service Discovery - Reality Check 2.0. Proceedings of Third International Conference on Next Generation Web Services Practices (NWeSP’07) (pp. 113-118). IEEE. doi:10.1109/NWESP.2007.20

Hao, J., & Orlin, J. B. (1994). A Faster Algorithm for Finding the Minimum Cut in a Directed Graph. Journal of Algorithms, 17, 424–446.

Harris, Z. (1954). Distributional Structure. Word, 10(23), 146-162.

Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2(1), 193-218. doi:10.1007/BF01908075

Ingersoll, G. S. (2009). Apache Lucene - Scoring. Apache Software Foundation. Retrieved from http://lucene.apache.org/java/2_4_1/scoring.pdf

Jain, A K, Murty, M. N., & Flynn, P. J. (1999). Data clustering: a review. ACM Computing Surveys, 31(3), 264-323. ACM. doi:10.1145/331499.331504

Jain, Anil K, & Dubes, R. C. (1988). Algorithms for Clustering Data. (Anil K Jain & R. C. Dubes, Eds.)Prentice Hall (Vol. 311, p. 304). Prentice Hall. doi:10.1126/science.311.5762.765

Jianwu, Y., & Xiaoou, C. (2002). A semi-structured document model for text mining. Journal of Computer Science and Technology, 17, 603–610. doi:10.1007/BF02948828

Johnson, R. K. (1982). Prototype Theory, Cognitive Linguistics and Pedagogical Grammar. Working Papers in Linguistics and Language Training, 8, 12–24. Citeseer. Retrieved from http://sunzi1.lib.hku.hk/hkjo/view/45/4500060.pdf

Kanerva, P., Kristofersson, J., & Holst, A. (2000). Random indexing of text samples for latent semantic analysis. In L. R. Gleitman & A. K. Josh (Eds.), Proceedings of the 22nd Annual Conference of the Cognitive Science Society (p. 1036). New Jersey: Erlbaum.

v

Kaufman, L., & Rousseeuw, P. J. (1987). Clustering by means of Medoids. In Y. Dodge (Ed.), Statistical Data Analysis Based on the L1-Norm and Related Methods (pp. 405-416). North-Holland.

Kent, A., Berry, M. M., Luehrs, F. U., & Perry, J. W. (1955). Machine literature searching VIII. Operational criteria for designing information retrieval systems. American Documentation, 6(2), 93-101. doi:10.1002/asi.5090060209

Klusch, M., Fries, B., & Sycara, K. (2006). Automated semantic web service discovery with OWLS-MX. Proceedings of the fifth international joint conference on Autonomous agents and multiagent systems (pp. 915-922). Hakodate: ACM Press. doi:10.1145/1160633.1160796

Lancaster, F. W., & Fayen, E. G. (1974). Information Retrieval On-Line. Journal of the American Society for Information Science, 25(5), 336-337. Los Angeles: John Wiley & Sons. doi:10.1002/asi.4630250510

Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato’s problem: The Latent Semantic Analysis theory of the acquisition, induction, and representation of knowledge. Psychological Review, 104, 211-240.

Landauer, T. K., Foltz, P., & Laham, D. (1998). An introduction to latent semantic analysis. Discourse Processes, 25(2), 259-284. doi:10.1080/01638539809545028

Larsen, B., & Aone, C. (1999). Fast and effective text mining using linear-time document clustering. Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 16-22). New York, NY, USA: ACM. doi:http://doi.acm.org/10.1145/312129.312186

Lowe, W. (2001). Towards a Theory of Semantic Space. In J. D. Moore & K. Stenning (Eds.), Proceedings of the Twenty-first Annual Meeting of the Cognitive Science Society (pp. 576-581).

Lund, K., & Burgess, C. (1996). Producing high-dimensional semantic spaces from lexical co- occurrence. Behavior Research Methods Instruments and Computers, 28(2), 203-208.

Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to Information Retrieval. Computational Linguistics (Vol. 35, p. 496). Cambridge University Press. Retrieved from http://nlp.stanford.edu/IR-book/

Marchionini, G. (2006). Exploratory search: from finding to understanding. Communications of the ACM, 49(4), 41-46.

McArthur, R. J. (2007). Computing with meaning by operationalising socio-cognitive semantics. Unpublished doctoral dissertation, Queensland University of Technology.

vi

McCallum, A., Nigam, K., & Ungar, L. H. (2000). Efficient clustering of high-dimensional data sets with application to reference matching. Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 169-178). New York, NY, USA: ACM. doi:http://doi.acm.org/10.1145/347090.347123

McIlraith, S. A., Son, T. C., & Zeng, H. (2001). Semantic Web Services. IEEE Intelligent Systems, 16(2), 46-53. doi:10.1109/5254.920599

Milne, D. (2007). Computing semantic relatedness using wikipedia link structure. Proceedings of the New Zealand Computer Science Research Student Conference.

Moffat, A., & Zobel, J. (2008). Rank-biased precision for measurement of retrieval effectiveness. ACM Transactions on Information Systems, 27(1), 1-27. doi:10.1145/1416950.1416952

Mokhtar, S., Preuveneers, D., Georgantas, N., Issarny, V., & Berbers, Y. (2007). EASY: Efficient semAntic Service discoverY in pervasive computing environments with QoS and context support. Journal of Systems and Software, 81, 785-808. doi:10.1016/j.jss.2007.07.030

OASIS. (2004a). UDDI Version 3.0.2 UDDI Spec Technical Committee Draft. Organization for the Advancement of Structured Information Standards. Retrieved from http://www.oasis-open.org/committees/uddi-spec/doc/spec/v3/uddi_v3.htm

OASIS. (2004b). Introduction to UDDI: Important Features and Functional Concepts. Organization for the Advancement of Structured Information Standards. Retrieved from http://uddi.org/pubs/uddi-tech-wp.pdf

O’Day, V. L., & Jeffries, R. (1993). Orienteering in an information landscape: how information seekers get from here to there. Proceedings of the SIGCHI conference on Human factors in computing systems - CHI ’93 (pp. 438-445). New York, New York, USA: ACM Press. doi:10.1145/169059.169365

Page, L., Brin, S., Motwani, R., & Winograd, T. (1999). The PageRank Citation Ranking: Bringing Order to the Web. Retrieved from http://ilpubs.stanford.edu:8090/422/

Papazoglou, M. P., Traverso, P., Dustdar, S., & Leymann, F. (2008). Service-oriented computing: a research roadmap. International Journal of Cooperative Information Systems, 17(2), 223-255.

Pathak, J., Koul, N., Caragea, D., & Honavar, V. G. (2005). A framework for semantic web services discovery. Proceedings of the 7th annual ACM international workshop on Web information and data management (pp. 45-50). New York: ACM Press. doi:10.1145/1097047.1097057

vii

Peng, D. (2007). Automatic Conceptual Indexing of Web Services and Its Application to Service Retrieval. Science And Technology, (2006), 290-301.

Rambold, M., Kasinger, H., Lautenbacher, F., & Bauer, B. (2009). Towards Autonomic Service Discovery A Survey and Comparison. IEEE International Conference on Services Computing (pp. 192-201). Bangalore: IEEE. doi:10.1109/SCC.2009.59

Rand, W. M. (1971). Objective Criteria for the Evaluation of Clustering Methods. Journal of the American Statistical Association, 66(336), 846-850.

Rapp, R. (2003). Word Sense Discovery Based on Sense Descriptor Dissimilarity. Proceedings of the Ninth Machine Translation Summit (pp. 315-322). New Orleans.

Van Rijsbergen, C. (1979). Information retrieval (2nd ed.). London ;;Boston: Butterworths.

Robertson, S. E., & Sparck-Jones, K. (1976). Relevance weighting of search terms. Journal of the American Society for Information Science, 27(3), 129-146. doi:10.1002/asi.4630270302

Robertson, S. E., Walker, S., Jones, S., Hancock-Beaulieu, M. M., & Gatford, M. (1994). Okapi at TREC-3. In D. K. Harman (Ed.), Proceedings of the Third Text REtrieval Conference (pp. 109-126). Gaithersburg. doi:10.1145/1031171.1031181

Robertson, S., Zaragoza, H., & Taylor, M. (2004). Simple BM25 extension to multiple weighted fields. Proceedings of the Thirteenth ACM conference on Information and knowledge management - CIKM ’04 (p. 42). New York: ACM Press. doi:10.1145/1031171.1031181

SAP News Desk. (2005, December). Microsoft, IBM, SAP To Discontinue UDDI Web Services Registry Effort. Retrieved February 11, 2010, from http://br.sys-con.com/node/164624

Sabbouh, M., Jolly, S., Allen, D., Silvey, A., & Denning, P. (2001). Interoperability, World Wide Web Consortium Workshop on Web services. San Jose, CA, USA. Retrieved from http://www.w3.org/2001/03/WSWS-popa/paper08

Sahlgren, M. (2006). The Word-Space Model : using distributional analysis to represent syntagmatic and paradigmatic relations between words in high-dimensional vector spaces (p. 156). Stockholm: Department of Linguistics Stockholm University.

Sajjanhar, A., Hou, J., & Zhang, Y. (2004). Algorithm for Web Services Matching. In J. Yu, X. Lin, H. Lu, & Y. Zhang (Eds.), Advanced Web Technologies and Applications (pp. 665-670). Springer. doi:http://dx.doi.org/10.1007/978-3-540-24655-8_72

viii

Salton, G. (1991). Developments in automatic text retrieval. Science (New York, N.Y.), 253(5023), 974-80. American Association for the Advancement of Science. doi:10.1126/science.253.5023.974

Salton, Gerard. (1968). Automatic information organization and retrieval (1st ed., p. 480). MacGraw-Hill.

Salton, Gerard. (1983). Introduction to modern information retrieval. New York: McGraw-Hill.

Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 613-620. doi:10.1145/361219.361220

Salton, Gerard, Yang, C. S, & Yu, C. T. (1975). A Theory of Term Importance in Automatic Text Analysis. Journal of the American Society for Information Science, 26(1), 33-44. doi:10.1002/asi.4630260106

Sanchez, C., & Sheremetov, L. (2008). A model for service discovery with incomplete information. 5th International Conference on Electrical Engineering, Computing Science and Automatic Control (pp. 340-345). Mexico City: IEEE. doi:10.1109/ICEEE.2008.4723428

Sanderson, M., & Zobel, J. (2005). Information retrieval system evaluation: effort, sensitivity, and reliability. In G. Marchionini, A. Moffat, J. Tait, Ricardo Baeza-Yates, & N. Zivian (Eds.), Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR ’05 (pp. 162-169). ACM Press. doi:http://doi.acm.org/10.1145/1076034.1076064

Schütze, H. (1997). A cooccurrence-based thesaurus and two applications to information retrieval. Information Processing & Management, 33(3), 307-318. doi:10.1016/S0306-4573(96)00068-4

Schütze, H. (1998). Automatic Word Sense Discrimination. Computational Linguistics, 24(1), 97-123. Journal of Computational Linguistics.

Smucker, M. D., Allan, J., & Carterette, B. (2007). A comparison of statistical significance tests for information retrieval evaluation. Proceedings of the sixteenth ACM conference on Conference on information and knowledge management - CIKM ’07 (pp. 623-632). New York, New York, USA: ACM Press. doi:10.1145/1321440.1321528

Song, D., & Bruza, P. D. (2003). Towards context sensitive information inference. Journal of the American Society for Information, 54(4), 321-334. John Wiley & Sons. Retrieved from http://eprints.qut.edu.au/10413/

Song, W., & Park, S. C. (2007). A Novel Document Clustering Model Based on Latent Semantic Analysis. Proceedings of Third International Conference on

ix

Semantics, Knowledge and Grid (SKG 2007) (pp. 539-542). IEEE. doi:10.1109/SKG.2007.169

Sparck-Jones, K. (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28(1), 11-21. doi:10.1108/eb026526

Steinbach, M., Karypis, G, & Kumar, V. (2000). A Comparison of Document Clustering Techniques. (M. Grobelnik, D. Mladenic, & N. Milic-Frayling, Eds.)KDD workshop on text mining, 400(X), 1-2. Ieee. doi:10.1109/ICCCYB.2008.4721382

Stroulia, E., & Wang, Y. (2005). Structural and semantic matching for assessing web-service similarity. International Journal of Cooperative Information Systems, 14(4), 407-437.

Studholme, C., Hill, D. L. G., & Hawkes, D. J. (1999). An overlap invariant entropy measure of 3D medical image alignment. Pattern recognition, 32(1), 71–86. Retrieved from http://www.mendeley.com/research/an-overlap-invariant-entropy-measure-of-3d-medical-image-alignment/

Takayama, Y., Flournoy, R., Kaufmann, S., & Peters, S. (1999). Information retrieval based on domain-specific word associations. Proceedings of PACLING ’99. Waterloo. doi:10.1109/BIBE.2009.19

Theodoridis, S., & Koutroumbas, K. (2006). Pattern Recognition, Third Edition. Orlando, FL, USA: Academic Press, Inc.

Tian, K., Revelle, M., & Poshyvanyk, D. (2009). Using Latent Dirichlet Allocation for automatic categorization of software. Mining Software Repositories, 2009. MSR ’09. 6th IEEE International Working Conference on (pp. 163-166). doi:10.1109/MSR.2009.5069496

Turney, P. D., & Pantel, P. (2010). From Frequency to Meaning : Vector Space Models of Semantics. Journal of Artificial Intelligence Research, (37), 141-188.

Verma, K., & Sheth, A. (2007). Semantically Annotating a Web Service. IEEE Internet Computing, 11(2), 83-85. doi:10.1109/MIC.2007.48

Vinh, N. X., & Epps, J. (2009). A Novel Approach for Automatic Number of Clusters Detection in Microarray Data Based on Consensus Clustering. Proceedings of the 2009 Ninth IEEE International Conference on Bioinformatics and Bioengineering (pp. 84-91). Taichung: IEEE. doi:10.1109/BIBE.2009.19

Vinh, N. X., Epps, J., & Bailey, J. (2009). Information Theoretic Measures for Clusterings Comparison: Is a Correction for Chance Necessary? In L. Bottou & M. Littman (Eds.), Proceedings of the 26th International Conference on Machine Learning (pp. 1073-1080). Montreal: Omnipress.

x

Voorhees, E. M. (1994). Query Expansion Using Lexical-Semantic Relations. In W. B. Croft & C. J. van Rijsbergen (Eds.), Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 61-69). ACM/Springer.

Voronoi, G. (1907). Nouvelles applications des paramètres continus à la théorie des formes quadratiques Premier Mémoire: sûr quelques propriétés des formes quadratiques positives parfaits. Journal für die Reine und Angewandte Mathematik, 133, 97-178.

Wang, Y., & Stroulia, E. (2003). Semantic Structure Matching for Assessing Web- Service Similarity. Proceedings of First International Conference on Service-Oriented Computing (pp. 194–207). Springer.

Weaver, W. (1955). Translation (pp. 15-23). Cambridge, MA: Technology Press.

Widdows, D. (2003). Orthogonal negation in vector spaces for modelling word-meanings and document retrieval. Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - ACL ’03 (pp. 136-143). Morristown, NJ, USA: Association for Computational Linguistics. doi:10.3115/1075096.1075114

Widdows, D., & Ferraro, K. (2008). Semantic Vectors: a Scalable Open Source Package and Online Technology Management Application. In N. Calzolari, K. Choukri, B. Maegaard, J. Mariani, J. Odjik, S. Piperidis, & D. Tapias (Eds.), Proceedings of the Sixth International Language Resources and Evaluation (LREC’08). Marrakech, Morocco: European Language Resources Association (ELRA).

Wittgenstein, L. (1953). Philosophical Investigations. Blackwell.

Xu, R., & Wunsch, D. (2008). Clustering (IEEE Press Series on Computational Intelligence) (p. 368). Wiley-IEEE Press.

Yang, J., Cheung, W. K., & Chen, X. (2005). Integrating Element and Term Semantics for Similarity-Based XML Document Clustering. Proceedings of the 2005 IEEE/WIC/ACM International Conference on Web Intelligence (pp. 222-228). Washington, DC, USA: IEEE Computer Society. doi:http://dx.doi.org/10.1109/WI.2005.80

Zhao, Y., & Karypis, George. (2004). Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering. Machine Learning, 55(3), 311-331. doi:10.1023/B:MACH.0000027785.44527.d6

Zhao, Y., & Karypis, George. (2002). Criterion Functions for Document Clustering: Experiments and Analysis. Technical Report. Minneapolis: Department of Computer Science University of Minnesota.

xi

Zhuang, Z., Mitra, P., & Jaiswal, A. (2005). Corpus-based Web Services Matchmaking. American Association for Artificial Intelligence Technical Report WS-05-03.

Zipf, G. K. (1935). The Psychobiology of Language. Houghton-Mifflin.