Faculty of Science and Technology
Ph.D. Thesis 2011
Semantic Service Discovery in the Service Ecosystem
BSc (Hons), MCom Christian Werner Prokopp
N6201393
Principal Supervisor
Professor Peter Bruza
Associate Supervisor
Professor Alistair Barros
I
Keywords
Semantic Space, Vector Space Model, Web Service, Conceptual Space, Ecosystem,
Categorization, Clustering, Machine Learning, Information Retrieval, Text Mining,
Text Classification, Inverted Index
II
Abstract
Electronic services are a leitmotif in ‘hot’ topics like Software as a Service, Service
Oriented Architecture (SOA), Service oriented Computing, Cloud Computing,
application markets and smart devices. We propose to consider these in what has
been termed the Service Ecosystem (SES). The SES encompasses all levels of
electronic services and their interaction, with human consumption and initiation on
its periphery in much the same way the ‘Web’ describes a plethora of technologies
that eventuate to connect information and expose it to humans.
Presently, the SES is heterogeneous, fragmented and confined to semi-closed
systems. A key issue hampering the emergence of an integrated SES is Service
Discovery (SD). A SES will be dynamic with areas of structured and unstructured
information within which service providers and ‘lay’ human consumers interact;
until now the two are disjointed, e.g., SOA-enabled organisations, industries and
domains are choreographed by domain experts or ‘hard-wired’ to smart device
application markets and web applications. In a SES, services are accessible,
comparable and exchangeable to human consumers closing the gap to the providers.
This requires a new SD with which humans can discover services transparently and
effectively without special knowledge or training. We propose two modes of
discovery, directed search following an agenda and explorative search, which
speculatively expands knowledge of an area of interest by means of categories.
Inspired by conceptual space theory from cognitive science, we propose to
implement the modes of discovery using concepts to map a lay consumer’s service
need to terminologically sophisticated descriptions of services. To this end, we
reframe SD as an information retrieval task on the information attached to services,
such as, descriptions, reviews, documentation and web sites - the Service
Information Shadow. The Semantic Space model transforms the shadow's
unstructured semantic information into a geometric, concept-like representation. We
introduce an improved and extended Semantic Space including categorization calling
it the Semantic Service Discovery model.
We evaluate our model with a highly relevant, service related corpus simulating a
Service Information Shadow including manually constructed complex service
agendas, as well as manual groupings of services. We compare our model against
III
state-of-the-art information retrieval systems and clustering algorithms. By means of
an extensive series of empirical evaluations, we establish optimal parameter settings
for the semantic space model. The evaluations demonstrate the model’s effectiveness
for SD in terms of retrieval precision over state-of-the-art information retrieval
models (directed search) and the meaningful, automatic categorization of service
related information, which shows potential to form the basis of a useful, cognitively
motivated map of the SES for exploratory search.
IV
Table of Contents
Keywords ...................................................................................................................... I
Abstract ....................................................................................................................... II
Table of Contents ....................................................................................................... IV
List of Figures ............................................................................................................ VI
List of Tables ........................................................................................................... VIII
List of Equations ......................................................................................................... X
List of Abbreviations ................................................................................................ XII
Conventions ............................................................................................................. XIII
Statement of Original Authorship ........................................................................... XIV
Acknowledgements .................................................................................................. XV
1 Introduction .......................................................................................................... 1
1.1 Service Ecosystem ......................................................................................... 1
1.2 Service Discovery .......................................................................................... 9
1.3 Research Questions ...................................................................................... 17
1.4 Contributions ............................................................................................... 18
1.5 Thesis Structure ........................................................................................... 19
2 Literature Review ............................................................................................... 21
2.1 Service Discovery ........................................................................................ 21
2.2 Information Retrieval .................................................................................. 27
2.3 Semantic Spaces .......................................................................................... 40
2.4 Cluster Analysis ........................................................................................... 46
2.5 Discussion .................................................................................................... 55
3 Semantic Service Discovery Model ................................................................... 57
3.1 Semantic Information Shadow .................................................................... 57
3.2 Semantic Space Generation ......................................................................... 58
3.3 Semantic Categorization .............................................................................. 63
3.4 Innovations .................................................................................................. 69
3.5 Modes of Discovery ..................................................................................... 72
V
3.6 Software Prototype ...................................................................................... 75
3.7 Evaluation .................................................................................................... 78
3.8 Discussion ................................................................................................... 83
4 Semantic Service Discovery Evaluation ............................................................ 85
4.1 SAP ES Wiki as a Service Information Shadow ......................................... 85
4.2 Experimental Evaluation ............................................................................. 89
4.3 Baseline IR systems ..................................................................................... 91
4.4 Results ......................................................................................................... 96
4.5 Discussion ................................................................................................. 109
5 Semantic Service Categorisation Evaluation ................................................... 111
5.1 Experiment ................................................................................................ 111
5.2 Baseline clustering algorithms .................................................................. 119
5.3 Semantic Categorization ............................................................................ 121
5.4 Discussion ................................................................................................. 132
6 Discussion ........................................................................................................ 135
6.1 Service Discovery by Directed Search ...................................................... 136
6.2 Exploring the Space by Semantic Categories ............................................ 138
6.3 Singular factor ........................................................................................... 140
6.4 Link-weight ............................................................................................... 141
6.5 Default Parameters .................................................................................... 141
6.6 Discovery ................................................................................................... 145
7 Future Work ..................................................................................................... 146
7.1 Scientific .................................................................................................... 146
7.2 Applied ...................................................................................................... 148
Conclusion ............................................................................................................... 150
Appendix A SAP ES Wiki Grouping .................................................................. A-1
Appendix B Example Semantic Categorization by Bundles .............................. B-1
Appendix C CLUTO ........................................................................................... C-1
References ..................................................................................................................... i
VI
List of Figures
Figure 1: App sales projection before Apple iPad release ........................................... 4
Figure 2: Emergence of Service Ecosystem ................................................................. 6
Figure 3: USA.gov services section ............................................................................. 7
Figure 4: Directgov.uk homepage ................................................................................ 8
Figure 5: Service consumer to service ....................................................................... 13
Figure 6: SD as an IR task .......................................................................................... 15
Figure 7: Search activities .......................................................................................... 28
Figure 8: A taxonomy of IR systems ......................................................................... 30
Figure 9: Content bearing terms by DF ...................................................................... 33
Figure 10: Classic IR system ...................................................................................... 34
Figure 11: Three levels of cognition .......................................................................... 42
Figure 12: Singular Value Decomposition in Latent Semantic Analysis ................... 43
Figure 13: Steps in Semantic Space generation ......................................................... 59
Figure 14: Example corpus structure ......................................................................... 59
Figure 15: SVD approximation of word co-occurrence matrix M ............................. 61
Figure 16: SS from word co-occurrence matrix (no singular values) ........................ 62
Figure 17: SS from word co-occurrence matrix (with singular values) ..................... 62
Figure 18: Semantic core expand to categories (simplified) ...................................... 65
Figure 19: Tessellation around core concepts (simplified) ........................................ 66
Figure 20: Categories through tessellation example .................................................. 69
Figure 21: Singular Factor in SS generation .............................................................. 70
Figure 22: LDV example ............................................................................................ 71
Figure 23: SSD graphical user interface main screen ................................................ 76
Figure 24: SSD configuration screen ......................................................................... 77
Figure 25: ES Wiki structure ...................................................................................... 87
Figure 26: Example of bundle page (excerpt) ............................................................ 88
Figure 27: Example use-case ...................................................................................... 89
Figure 28: Use-case query results .............................................................................. 97
VII
Figure 29: SSD query results with varying LDV weights ....................................... 100
Figure 30: Improvements in AAR from no to optimal LDV ................................... 101
Figure 31: Singular Factor influence on AAR ......................................................... 102
Figure 32: Improvements from sf=1 to 0.0 and 0.5 ................................................. 103
Figure 33: Difference between unique and frequency queries ................................. 104
Figure 34: Combined Query vs. Text Query ............................................................ 105
Figure 35: Query factors’ influence on AAR ........................................................... 106
Figure 36: SVD reduction to k dimensions .............................................................. 107
Figure 37: Gap ......................................................................................................... 107
Figure 38: Left window ............................................................................................ 108
Figure 39: Right window ......................................................................................... 108
Figure 40: Practical topical structuring of different corpora .................................... 113
Figure 41: Measurement Cardinality Bias ............................................................... 118
Figure 42: Singular Factor and Perspective ............................................................. 120
Figure 43: Link Weight and Perspective .................................................................. 121
Figure 44: Maximum AMI according to perspective and sf for run 1 ..................... 124
Figure 45: Maximum AMI according to density and sf in run 4 ............................. 125
Figure 46: Link-weight results combined from run 1, 2 and 4 ................................ 128
Figure 47: Cut-off result selection from combined runs 1, 2 and 4 ......................... 128
Figure 48: Maximum and Average AMI according to number of categories .......... 129
Figure 49: Interface dummy for search by browsing of categories ......................... 147
Figure 50: Criterion functions by methods .............................................................. C-3
Figure 51: Methods by criterion functions ............................................................... C-4
Figure 52: Perspective and methods ........................................................................ C-4
Figure 53: Perspective and criterion functions......................................................... C-5
VIII
List of Tables
Table 1: Boolean term document matrix .................................................................... 31
Table 2: Term Frequency to term document matrix .................................................. 31
Table 3: Contingency table ........................................................................................ 36
Table 4: Term co-occurrence matrix .......................................................................... 38
Table 5: Term co-occurrence matrix with gap ........................................................... 38
Table 6: Local fitness (Equation 16) example for varying densities .......................... 67
Table 7: Fitness example for fixed cluster with changing distance ........................... 68
Table 8: Parameters for Semantic Space and Semantic Categories ........................... 78
Table 9: Comparison of sorting and term weight influence ....................................... 80
Table 10: Window size impact49 ................................................................................ 81
Table 11: Columns to SVD reduction impact49 ......................................................... 81
Table 12: Singular factor impact49 ............................................................................. 82
Table 13: Rows to Columns impact49 ........................................................................ 82
Table 14: Gap impact49 .............................................................................................. 83
Table 15: Top 10 results for TASA/TOEFL SSD ...................................................... 83
Table 16: Use-cases Semantic Space parameters exploratory run ............................. 94
Table 17: Use-cases Semantic Space parameters refinement run63 ........................... 94
Table 18: SSD optimal query experiments parameters .............................................. 95
Table 19: Significance of results by paired, two tailed t-test ..................................... 98
Table 20: CLUTO - ES Wiki Semantic Space parameters ...................................... 115
Table 21: Semantic Categorization experiments parameter settings ....................... 122
Table 22: Best SC result by perspectives ................................................................. 123
Table 23: Maximum AMI according to distance, density and sf in run 4 ................ 126
Table 24: Maximum AMI for run 4 - Bundles, density to distance at sf=0 ............. 127
Table 25: Maximum AMI for run 4 - Term, density to distance at sf=0.5 .............. 127
Table 26: Semantic category example ..................................................................... 130
Table 27: Wiki Sales bundle group .......................................................................... 131
Table 28: Top results (AMI) for CLUTO and Semantic Categorization ................. 132
IX
Table 29: CLUTO main criterion functions ............................................................. C-2
X
List of Equations
Equation 1: Zipf's Law ............................................................................................... 32
Equation 2: Inverse Document Frequency ................................................................. 32
Equation 3: TF-IDF of term i in document z for corpus of N .................................... 33
Equation 4: Probabilistic similarity by relevance ratio .............................................. 35
Equation 5: Probabilistic similarity by contingency table ......................................... 36
Equation 6: BM25 ...................................................................................................... 37
Equation 7: Minkowski distance ................................................................................ 39
Equation 8: Euclidean distance .................................................................................. 39
Equation 9: Cosine similarity measure ....................................................................... 39
Equation 10: SVD ...................................................................................................... 44
Equation 11: Truncated SVD ..................................................................................... 44
Equation 12: Row vector as a combination of U and S ............................................. 63
Equation 13: Row vector from U ............................................................................... 63
Equation 14: Term based document vector ................................................................ 63
Equation 15: Sum of similarities ................................................................................ 66
Equation 16: Local Fitness ......................................................................................... 66
Equation 17: Fitness of cluster with medoid c with j members ................................. 67
Equation 18: Term/row vector as a combination of U, S and a scaling factor ........... 70
Equation 19: Linked vector of document ................................................................... 71
Equation 20: Combined query from terms ................................................................. 73
Equation 21: Combined query from objects of different types .................................. 73
Equation 22: Gram-Schmidt algorithm applied for vector negation .......................... 74
Equation 23: Query Factor ......................................................................................... 74
Equation 24: Average Rank ....................................................................................... 91
Equation 25: Adjusted Average Rank ........................................................................ 91
Equation 26: Rand Index .......................................................................................... 115
Equation 27: Adjusted Rand Index .......................................................................... 115
Equation 28: Mutual Information ............................................................................. 116
XI
Equation 29: Probability of random object to be in cluster i ................................... 116
Equation 30: Probability of random object to be in Ui and Vj ................................. 116
Equation 31: Entropy of cluster U ........................................................................... 116
Equation 32: Mutual Information between clustering U and V ............................... 116
Equation 33: Normalized Mutual Information ......................................................... 117
Equation 34: Adjusted Mutual Information ............................................................. 117
XII
List of Abbreviations
AJAX Asynchronous JavaScript and XML CS Conceptual Space CSS Cascading Style Sheets DF Document Frequency DV Document Vector HAL Hyperspace Analogue to Language HTML Hypertext Mark-up Language IDF Inverse Document Frequency LDV Linked Document Vector LSA Latent Semantic Analysis OASIS Organization for the Advancement of Structured Information Standards SaaS Software as a Service SC Semantic Categorization SD Service Discovery SES Service Ecosystem SIS Service Information Shadow SLVM Structured Link Vector Model SME Small and Medium Enterprises SOA Service Oriented Architecture SS Semantic Space SSD Semantic Service Discovery SVD Singular Value Decomposition SWS Semantic Web Services TF Term Frequency TF-IDF Term Frequency-Inverse Document Frequency TV Term Vector UDDI Universal Description Discovery and Integration VSM Vector Space Model WSDL Web Service Definition Language WWW World Wide Web XML Extensible Mark-up Language
XIII
Conventions
Vector
| | Norm/Length of vector
XIV
Statement of Original Authorship
“The work contained in this thesis has not been previously submitted to meet
requirements for an award at this or any other higher education institution. To the
best of my knowledge and belief, the thesis contains no material previously
published or written by another person except where due reference is made.”
Signature Date
XV
Acknowledgements
Australian Research Council
The Service Ecosystems Management for Collaborative Process Improvement
project ARC Linkage Grant LP0669244 supported this research. We thank the
participants in the project and the partners from Queensland Government Department
of Public Works and SAP Research Brisbane for their feedback and support.
Queensland University of Technology High Performance Computing &
Research Support
We would like to recognize the support of the QUT HPC in providing computing
facilities for the numerous experiments.
The Institute of Cognitive Science (ICS) at University of Colorado at Bolder
The ICS at University of Colorado provided the data for the TASA/TOEFL
experiment.
1
1 Introduction
Before the rise of the modern search engines the Internet was little more than
Compuserve and AOL to the average person and FTP, Gopher, UseNet and Email to
scientists and sophisticated users. This changed profoundly with the Hypertext Mark-
up Language and the Hypertext Transfer Protocol resulting in the World Wide Web
(WWW). Initially the information on the WWW was sparse, the websites addresses
known to few and sourced from newsgroups or mailbox listings. The first step to
give a wider access to web sites was manually constructed directories such as
YAHOO!. This in turn increased the popularity of the web and the growing number
of sites eventually required an automated approach, search engines. They initially
were simple indexes that over time evolved into sophisticated systems culminating in
advanced search engines like Google and Bing1. From the inception of manually
constructed directories, discoverability became the catalyst to the growth of the
WWW. This gave it unexpected utility despite or maybe because of a feral nature.
Anyone was able to publish, search and be found on the web. We are standing before
a similar development today with electronic services. Not long ago services were
accessible alone to a specific group in entrenched systems or domains with special
privileges and expertise. By way of analogy, this is changing with the onset of the
“Service Ecosystem”.
1.1 Service Ecosystem
The Service Ecosystem (SES) is a concept around “services” similar to the WWW
around the “web”. The SES utilizes existing technologies and networks flexibly to
provide and consume services electronically. It corresponds to business networks and
communities that are global in nature, created for the core purpose of exploiting
services. Aspects of the emergence of a Service Ecosystem are found in Software as
a Service, application market places, business processing outsourcing, B2B
integrators, cloud computing and business collaboration networks.
1 See http://www.google.com and http://www.bing.com for more details.
2
A service is “work done by one person or group that benefits another”2. We extend
this definition to “(physical or electronic) work done by one entity that benefits
another”. We employ the term ‘ecosystem’ to emphasize the main attributes of the
system we describe here. It is unregulated and feral in the sense that anything that
constitutes services and interacts with the system is part of it. This does not prescribe
that parts of the SES, networks, domains and systems exposing and consuming
services, cannot be semi or even highly regulated, closed and dependent. These
business networks already exist and some organizations and industries will continue
to rely on specific functionalities and trust available in these (semi-)closed
environments.
Entities in the SES consume and provide services ranging from atomic ones to highly
complex, service orchestrations that self-adjust with respect to demand and supply.
Demand for a service creates a niche in the ecosystem filled by a provider. If the
demand is great enough many providers will enter that niche and further
development and diversification will occur until an equilibrium between provider
and consumer is achieved. Similarly, a tapering demand may result in reduction of
service provision. There is no prescription of what types of services there are or will
be, or how to deliver and consume them. This greater flexibility leads to
diversification in services exposed, how to reprovision or repurpose them, how to
channel and consume them as well as the mechanisms for their delivery (Cardoso,
Barros, May, & Kylau, 2010).
1.1.1 Electronic Services
Electronic services are the latest evolution in the drive by computer science to reuse
coded functionality. This started with structured programming followed by software
libraries, object oriented theory, middleware concepts and finally electronic services.
Initially they were little more than encapsulated process calls exposing functionality
by means of the Internet utilizing Internet Protocol, Domain Name Service,
Hypertext Transfer Protocol, eXtensible Markup Language, Simple Object Access
2 See http://wordnetweb.princeton.edu/perl/webwn?s=service for a detailed definition.
3
Protocols, Representational State Transfer and other tools to anyone, anywhere at
any time.
1.1.2 Service Oriented Architecture
In context of SOA, electronic services are the essential, loosely coupled building
blocks with each performing a simple function accessible through a well-defined
interface. Orchestration of these electronic services by a (human) designer results in
an application that performs procedures that are more complex. The focus of SOA is
to increase reusability and decrease redundancy, and aims particularly at large
organisations facing these issues. Services described by an open standard like the
Web Service Description Language (WSDL) are consumable by anyone adhering to
the standard, which theoretically enables service provision and consumption between
previously separated entities, e.g., outside departments or even organisations. This
development led to commoditizing services (Barros & Dumas, 2006) and the
inception of Service-oriented Computing (Papazoglou, Traverso, Dustdar, &
Leymann, 2008), which in combination with cloud computing lowered the technical
and capital investment hurdle for service providers, intermediaries, stakeholders and
consumers. Unfortunately, SOA originated as a functional approach to software
reusability and lacked in business aspects like service level agreements, payment,
advertisement, orchestration, discoverability or bundling.
1.1.3 Software as a Service
A different approach is Software as a Service (SaaS) which instead of providing
commoditized services aims to provide complete software solutions for common
tasks online. This concept too has a long history ranging from early timesharing to
application service providers. The ubiquity of data communication, constantly
increasing bandwidth, penetration of every aspect of business by data processing
devices with the associated cost savings as well as instant readiness of (virtual)
computing hardware through cloud computing make SaaS a successful business
model. It outsources risk, expertise, capital investments, and at the same time gives
access to standardized and function rich software. An example of this is Salesforce3
3 See http://www.salesforce.com and http://appexchange.salesforce.com/home for more.
4
that penetrated successfully the small to medium enterprise market for customer
relationship management.
1.1.4 Application Marketplaces
In between SOA and SaaS lies the application market. Salesforce provides
AppExchange3 where third parties can sell applications integrated with the
Salesforce Customer Relationship Management solution. Salesforce uses a
community driven approach to provide a fertile, flexible and innovative source for
applications. Depending on their complexity, such applications constitute a complex
service or simple software.
Figure 1: App sales projection before Apple iPad release
Currently, device dependent application markets, e.g., iTunes Store, Windows Phone
Marketplace and Android Market4, together with smart phones and devices are
becoming a prominent channel to deliver applications and services to private users
(Figure 1). They build communities, providing, consuming and evaluating
applications limited only by the operators’ regulations and technologies. Some
examples of these applications are flight booking, restaurant guides or online
banking. Many applications are simply composite services focusing on a single
service need, a part of a greater agenda, e.g., booking a flight for a holiday. They
differ from SaaS in that they aim at private consumers, have a smaller set of
functions and have platform specific deliver mechanisms. At the same time,
4 See http://Android.com/market, http://marketplace.windowsphone.com, Apple.com/iphone/apps-for-iphone and Apple.com/ipad/apps-for-ipad/ for more.
2.54.5
21.6
4.26.8
29.5
0
5
10
15
20
25
30
35
2009 2010 estimated 2013 estimated
App sales, in billions
App revenue, in $billions
5
increasingly professional applications are becoming available and make SaaS
functionality available on smart devices. The line between SaaS and applications is
blurring. The International Data Corporation predicts in a market analysis (Ellison,
2010) that providers will make every conceivable service available in apps. They
forecast that the app market grows by 60% annually between 2010 and 2014 to 76.9
billion dollars5.
These market places enter the personal computer market, e.g., with the Mac App
Store6. Today’s platform dependencies and restrictions should be a temporary
phenomenon with increasing pressure from open standards like HTML5, CSS3,
WebM7 and AJAX as platform independent web-service/-application frontends. The
first evidence for this move was provided recently (October 2010) with the
announcement of the Mozilla Open Web Applications framework8. Eventually the
type of device with which someone consumes or engages a service will become
largely irrelevant.
1.1.5 In the Cloud
Cloud computing is the other important shift in paradigm besides service
commoditization, orchestration and distribution on different levels. Heroku9, a cloud
application and service platform, is a comprehensive example of how the abstraction
of hardware (virtual servers) and software (service API for Ruby) reduces the
required expertise, manpower and at the same time allows for highly flexible
resource allocation and billing. A customer can add and remove vast computing
resources and services billed on an hourly base within seconds and in turn provide
services to their customers. Cloud computing is fast becoming a conventional
platform with prominent and potent providers like Amazon Web Services and S3
5 See for a http://www.idc.com/about/viewpressrelease.jsp?containerId=prUS22617910 about the analysis.
6 See http://www.apple.com/mac/app-store/ for more. 7 See http://www.w3.org/TR/html5, http://www.w3.org/TR/css3-roadmap and
http://www.webMProject.org for details. 8 See https://apps.mozillalabs.com/ for details. 9 See http://www.heroku.com for more details.
6
storage10, Windows Azure11 and Rackspace Cloud Hosting12. This empowers
individuals and small companies to compete with large organizations beyond simple
applications in restricted market places by utilizing highly flexible and cost effective
computation and data storage facilities. Similar to the rise of the World Wide Web
(WWW), which dramatically altered the publishing and media industry by the near
cost free delivery channel it provides, cloud computing in combination with SOA,
SaaS and Application Markets can do the same for services.
1.1.6 The emergence of the SES
The development to a SES is already underway in the three market segments of
private users, small and medium-sized enterprises (SME) and (large)
enterprises/organizations. Figure 2 illustrates the converging development from
insulated mainframes and personal computers to app-driven smart devices and SOA-
oriented virtual data and computing centres providing services to other organizations
and private consumers alike.
Time
StandaloneNetworked
Smart Device
SoftwareWWW
Apps
ClosedMiddleware
SOA
MainframeServers
Virtual
Service Ecosystem
Figure 2: Emergence of Service Ecosystem
10 See http://aws.amazon.com/ and https://s3.amazonaws.com/ for more details. 11 See http://www.microsoft.com/windowsazure/windowsazure/ for more details. 12 See http://www.rackspacecloud.com for more details.
7
The SME are moving to SaaS platforms like SalesForce and StrikeIron13 (Barros,
Dumas, & P. Bruza, 2005) and more recently Google Apps Marketplace14 or SAP
Business by Design15. The great benefit for SME is that they have access to
enterprise features in software and hardware with the SaaS business model with very
low or no upfront capital investment and a pay-what-you-use billing. This contrasts
with previous models of very expensive software licenses, consulting services and
inefficient, large hardware. Gartner16 estimates a growth in SaaS combined enterprise
software markets from 10% in 2009 to 16% in 2014 measured by revenue. Of this
currently, 75% is a cloud service with potential to grow to 90% by 2014.
Figure 3: USA.gov services section17
Large organizations such as enterprises or governments also follow the trend to on-
demand rather than “on-the-premises” software (-services) using enterprise solutions
like SAP CRM18, Workday19 or Salesforce. SaaS blurs the line between SME and
enterprise products, and often only packaging, features and support separates them
13 See http://www.salesforce.com and StrikeIron.com for more. 14 See http://www.salesforce.com and StrikeIron.com for more. 15 See http://www.sap.com/sme/solutions/businessmanagement/businessbydesign for more. 16 See http://www.gartner.com/it/page.jsp?id=1406613 for more. 17 See http://www.usa.gov/Citizen/Services.shtml for more. 18 See http://www.sap.com/solutions/business-suite/crm for more. 19 See http://www.workday.com/ for more.
8
while the core services remain equal. Some very large organizations like
governments invest in “self-made” SOA solutions to break the barrier between
departmental silos and expose their services and occasionally data to
customers/citizens and third parties. Examples are Directgov.uk (Figure 4) from the
United Kingdom or USA.gov (Figure 3) from the United States of America. They
provide composite services, e.g., to pay car tax or renew licenses. These services can
involve a number of small or atomic electronic services orchestrated and exposed to
the citizen via the web. These implementations encompass the lifecycle of the
services offered and consumed, and essentially functioning as a domain specific
platform corresponding to the open commercial alternatives mentioned before.
Services exposed on these government platforms are different from SaaS. The
platforms act more as a mediator and backend for applications consuming and
extending their services rather than focusing on complex service delivery and value
adding.
Figure 4: Directgov.uk homepage
In summary, we can observe a convergence of SaaS on the professional side around
software service providers (e.g. SAP, Salesforce CRM or Oracle) and platform
providers (e.g. Salesforce AppExchange or Google Apps Market). Application stores
dominate the private market with Apple being a leading force (99.4% market share in
2009 according to Gartner4). These are increasingly exposing professional services in
the form of applications and are under pressure from open standards and threat of
fragmentation. At the end of this development, we anticipate the establishment of a
9
heterogeneous Service Ecosystem consolidating private, SME and organisational
services, adopting open standards for operator, platform and device independent
deployment and orchestration of services. Individuals to large organizations, either
directly or through secondary interfaces like the web, apps, software or others will
provide and consume the services. Similar to the WWW there will not be one
technology or system that will be identifiable as the SES; it constitutes the
conceptual framework around transparent service provision and consumption
agnostic to industry, domain and actor type.
1.2 Service Discovery
The challenges for an open and independent SES are the ones that faced, and still
face, the WWW. How can it guarantee an open platform? How can one find a
suitable service? What is the quality/reliability of such a service? How can provider
and consumers exchange payments, accumulate discounts, advertise, etc.? The SES
also has to address functional challenges like interoperability, runtime, state
awareness and process orchestration. Depending on the use-case, the demands on
these are different, e.g., an enterprise may have the requirement for a long running
service interacting with suppliers and customer services over its lifetime with varying
billing and complex rights management. On the other end of the spectrum, a private
user may query a free stateless service through an app for information, e.g., a flight
status.
A shared strategic hurdle is the discoverability of services (Papazoglou et al., 2008)
by any type of service consumer from the flood of offerings from individuals, small
companies to corporations and government departments across all domain and
industry boundaries. The development of the WWW has shown that a catalyst and
enabler for the further development is the ability for anyone to find relevant web
sites from the ever-increasing number, which in turn leads to more growth and value.
The discovery process is still central to the function of the WWW as the popularity
of Google shows. The task for service discovery is similarly challenging ranging
from personal to professional services across all domains and industries. It
encompasses anything from organic shopping to automotive supply chain
management reflecting the material and the virtual world reaching from service
offerings through applications down to atomic services described in domain specific
10
ontologies hidden inside a government department, corporate unit or specialised
small service provider.
1.2.1 Traditional Service Discovery
We group existing Service Discovery mechanisms into three areas: interface indices,
communities and ontological deductive systems. We briefly present them here and
their limitations followed by an introduction to a novel approach using statistical
semantics to address the mentioned discovery challenges. A detailed review of
Service Discovery is available in 2.1 in the Literature Review.
UDDI
Universal Description Discovery and Integration (UDDI) specification20 is often
associated with SD. It is an open industry standard (OASIS, 2004a) for a service
registry for service providers to expose information about their business and their
services through their functional and meta-description for Service Oriented
Architecture (SOA) software design. We define “functional description” as the
information about the service interface and technical details like protocols and data
format, and not the purpose of the service! UDDI is not restricted to web or
electronic services, but registries often focus on Web Service Definition Language
(WSDL)21 described web services. Registries can be fully private, partly restricted or
open with the ability to publish and replicate data between nodes and privacy levels
(OASIS, 2004a; figure 4). This allows registries to satisfy a variety of needs ranging
from the organizational to the public consumer. The specification permits free
definable, multiple and overlapping taxonomies even within the same registry.
UDDI’s shortcomings are two-fold. It addresses SOA problems and has a
tremendous flexibility that results in a strong functional orientation with few
restrictions or defined practices. This means that discovery (OASIS, 2004b; chapter
1.6) reduces to matching abstract service interface descriptions, searching according
to a freely defined, shared and implicitly known classification system or via
20 See http://uddi.xml.org/ and http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=uddi-spec for more details.
21 See http://www.w3.org/TR/wsdl for details.
11
keywords. Keyword search matches the optional description field in the standard’s
tModel, which in practice is often poorly utilize, e.g., in the UDDI Business Registry
(UBR) it was wholly ignored (Bachlechner, Siorpaes, Lausen, & Fensel, 2006). This
resulted in a very limited adoption mostly because plain registries in closed
environments are suited for expert users or systems knowing ahead of a search what
they were expecting to find.
Web Search
Public alternatives to UDDI are (web) search and communities. Web search engines,
e.g., using the filetype option in Google, find WSDL files describing web services.
WSDL files like UDDI focus on functional aspects providing only optional
descriptive information. The service designer usually enters the description (like the
rest of the WSDL information) and it is often of poor quality merely repeating
functional information. Furthermore, searching this information depends on the
search engines’ algorithms, which are little more than full keyword indices at best
accounting for typographical errors and divergences (Baeza-Yates & Ribeiro-Neto,
2011; chapter 9). Finding a file relating to a service agenda can therefore be
challenging, particularly if the agenda is complex. Additionally, a considerable
number of index files are orphans relating to non-existing services.
Communities
As an answer to the deficiencies of UDDI web communities and aggregators, e.g.,
WebserviceX.NET or XMethods22, sprang up compiling and sometimes extending
service interface oriented information with non-functional descriptions, reviews,
marketing, billing and uptime measures. These attempts suffer similarly from simple
keyword search, small size, quality issues and limited reach. SalesForce's
AppExchange23 is an exception in this group with an active community providing,
consuming and reviewing services. SalesForce's AppExchange is exemplary for an
intermediate step in the transition from the closed, domain and function oriented
systems to the open, environmentally driven Service Ecosystem. It focuses on
22 See http://www.webservicex.net and http://www.xmethods.net/ for details. 23 See http://sites.force.com/AppExchange for details.
12
SalesForce CRM domain24 and is open to third parties. Nevertheless, like the other
communities it uses a simple keyword based search.
Ontologies
The most formal SD mechanism proposed use ontologies. The most prominent of
which is the Semantic Web Services (SWS), which extends web services with formal
annotations to describe them and their relationship according to a prescribed
ontology. SWS allow reasoning and automatic service orchestration, selection,
optimization, protection and execution. Its drawback is that, a) everyone in the
system has to agree on, know and understand the predefined ontology, b) the
ontology has to describe the 'service world' precisely, c) providers and consumers
have to be able and willing to describe their services and needs according to the
ontology and d) deductive inference based on first order logic does not always
translate into effective search and is computationally expensive (Grüninger, Hull, &
McIlraith, 2008).
1.2.2 Agenda and Service Need
We divide Service Discovery in the How and What. The technical and functional
aspects are the How, e.g., interoperability and parameter matching (Sanchez &
Sheremetov, 2008). This is not the focus of this work. We are interested in the
problem of matching the What of a service from a conceptual point of view. We
propose that a transformative Service Discovery (SD) extracts and matches,
transparently and effectively, the concepts underlying service provision and service
need.
A complete automated Service Discovery system, of course, has to solve the
comprehensive problem of how to combine services on a technical level matching,
parameters and protocols or even ad-hoc deploy wrappers to enable connectivity. At
the same time, such a system will need to match the conceptual information of
services. This work focuses on the conceptual service information matching only. It
is a long way to such a system and likely many different technologies and insights,
24 Offering 292 service as of 11.10.2010.
13
including the ones gained in this work, will be necessary to achieve a complete
automatic Service Discovery process.
Figure 5: Service consumer to service
We can assume that a service consumer has an agenda (Figure 5), for instance,
opening a coffee shop. This agenda translates into several service needs, e.g.,
applying for a business number, registering a business name, requesting permits for
footpath usage and so on. Each service need may be fulfilled through one, several or
combined actual services. In a SES, a service consumer has no expert knowledge of
the services available because of their abundance and the ever-changing alternatives.
She faces an increasing breadth of options and decreasing knowledge when
translating an agenda to needs and trying to fulfil them with the appropriate services.
Traditional SD like UDDI does not address this since it assumes a service consumer
to be knowledgeable about the service offering, e.g., expects her to look it up by an
interface, business name, implicitly known classification, or keyword(s). Web search
suffers from similar problems indexing only functional information. Community
driven discovery is not an option either unless the service consumer’s service needs
are trivial and fit into the community’s domain and she is able to express her need
within the community’s terminology. An even worse example of specific semantics
is the ontology driven approach, which adds a potentially opaque layer of
complexity. It requires the service consumer to exactly know her service needs and
then translate them with great precision into the conceptual structure represented by
the ontology.
A service provider trying to describe a service in these systems has a similar
challenge. In a UDDI, community or search engine scenario the provider can
describe the functionality well. The conceptual description of the service is limited to
a free text field and possibly an arbitrary classification. If each provider utilizes the
14
free text field to add an expressive human comprehensible description it would
improve SD. The service provider though will not be able to anticipate all
circumstance for the service’s potential use and associated variations of linguistic
expression. Therefore, as in traditional search, discovering the service is challenging
due to vocabulary mismatch between the service description and a query description
for a service. If providers were to remedy this by exhaustive descriptions, the result
would be a lack of precision in the retrieved services. In an ontology SD system the
provider has the burden of describing the service appropriately. If the description is
too general, the service does not rank as an optimal solution in many relevant
situations and at the same time, if the service is specialised in its description,
deductive inference used for discovery may utilize it only rarely. Moreover, the
provider depends on the competence and ability of all users to comprehend and apply
the ontology in the intended fashion and with a shared understanding.
A transparent and effective SD should support a service consumer by suggesting
services relevant to her agenda and allowing her to search in her own words.
Consider the example to open a coffee shop. It may not be obvious to the consumer
what all the relevant issues are and hence she cannot translate the agenda into
appropriate queries. In such cases, search is an explorative process. A SD system
therefore has to be flexible and able to anticipate, or at least approximate a service
need from a poor description. As a result, the searcher should not have to perform
semantically challenging tasks like ontological translation or guessing domain
specific terminologies and keywords when she is potentially unaware of relevant
aspects of the agenda. We imagine a process like “presumptive attainment” (P.
Bruza, Barros, & Kaiser, 2009) may fulfil these requirements. A searcher has some
incomplete knowledge of her needs and makes an imprecise query to the SD system
accordingly. The results should not only be meaningful, they should enable the
searcher to expand her knowledge with conjecture and informed guesses to permit
her to refine her understanding of her service need. Such a SD system would be
fundamentally different in the sense that it does not attempt to choose a service by
deductive logic or lookup in a shared terminology but rather by a more abductive
approximation, mimicking human like associational reasoning in a way similar to
automatic query expansion used in IR. We therefore consider it useful to reframe SD
15
as an Information Retrieval task where the initial query description is potentially
very imprecise.
1.2.3 An Information Retrieval Task
The existing SD paradigms focus on explicit (ontology) or implicit (shared
terminology) formal information about services. For example, keyword based
systems are traditionally limited to the functional information (UDDI, WSDL search)
and implicitly use a shared terminology even when applied beyond functional
descriptions (communities). Communities have more information through richer
descriptions, reviews and comments but poorly utilize such information through
simple keyword matching searches. Ontology based systems can be very expressive
but require consumer and provider to map information to a preconceived
conceptualization of the world (the ontology). The development towards the SES
provides additional sources of information about services in descriptions, reviews,
comments, relationships/links, documentation and similar much like we observer
with the service community web sites. We are calling this loose corpus of service
related information the Service Information Shadow (SIS). It is normally in the form
of free text and a rich, naturally occurring source of service information from
consumers and providers, and available for the purposes of SD.
We propose to frame the discovery of a service as an Information Retrieval task
(Figure 6). Let the consumer’s agenda description be an incomplete service need
query. Identifying the relevant service then is equivalent with retrieving a service-
associated document.
Figure 6: SD as an IR task
This Information Retrieval challenge (see more about IR in 2.2) appears to be simple
and classical since it is concerned with unstructured text. However, we stated earlier,
16
that we could not assume service descriptions and consumers to use the same
terminology. In fact, we have put this aspect forward as a requirement for SD since it
is unreasonable to demand a consumer to engage a specific terminology of a service
domain she may not be aware of when formulating her agenda based query. For
example, a user may search for a “car insurance” unaware that in the service domain
the term “automotive insurance” is used. Simple traditional pattern matching search
as employed in SD registries are unable to resolve this mismatch. Employing the SIS
may lessen the possible terminological and vocabulary mismatch between the
consumer and the service description so we cannot dismiss traditional IR systems
like modern keyword matching system working with more advanced matching
algorithms as a possible solution outright. Nevertheless, we propose that a SD system
using advance IR methods to assist the presumptive attainment of knowledge by the
consumer is potentially superior given the described SD challenge.
The conceptual space theory (Gärdenfors, 2004) describes conceptual reasoning in a
geometric space defining quality dimensions where the distance between concepts
indicates their relatedness. A SD system mimicking such conceptual reasoning could
make human-like inferences when relating a consumer's imprecise expressed service
need to related concepts thus promoting retrieval that is more effective as well as
helping the consumer to learn about the problem space around her agenda. Such an
approach is effective in enhancing traditional IR (D. Song & P. D. Bruza, 2003).
Furthermore, the geometric representation of concepts contains prototypical areas of
meaning around which sub-spaces represent conceptual categories. These can
provide the basis of the taxonomy, i.e., organization of the space based on the
inherent relationships of concepts. This opens the door to a conceptual map of the
SIS with which the user can interact to gain an overview without the need of a
detailed query or agenda.
Semantic Spaces (Lowe, 2001) are vector space models sourced from corpora of
unstructured text alike to primitive computational approximations of conceptual
spaces (McArthur, 2007). They represent documents and terms in a high-dimensional
space with the distance between the vectors simulating their semantic relatedness.
We propose to employ Semantic Spaces for conceptual representation and inference
inspired by conceptual space theory. The goal is to exploit Semantic Spaces to
promote effective SD.
17
1.3 Research Questions
We envisage that with the growing proliferation of electronic services consumers
require a mechanism to address their service needs. In the expected, vast and feral
SES, a consumer will not be able to anticipate which services will fulfil her service
needs or in which ontology it is described. We conceive Service Discovery as the
process that matches an informally expressed service need to relevant service(s).
There will be a Service Information Shadow providing a rich source of unstructured
information. Our first and foremost hypothesis is that Service Discovery underpinned
by Semantic Space representation of the Service Information Shadow will
outperform state-of-the-art information retrieval. In addition, we hypothesize that
semantic categories extracted from the semantic space representation of the SES may
assist in exploring the SES.
Research Question 1
We suggest that the Semantic Spaces are an effective computational means to
represent conceptual knowledge for promoting service discovery. The first research
question is then as follows:
Do Semantic Spaces promote effective service retrieval in a Service
Ecosystem?
Research Question 2
Service Discovery in a service ecosystem can be aided with a map of the ecosystem.
This is important since a consumer may not be able to express her service need, e.g.,
when confronted with an unusual or novel agenda. The map is like an abstraction of
the service space and will allow a consumer to orientate herself to refine her service
need. Considering the dynamic nature and size of the SES such an approach has to be
automated while at the same time producing a map which aligns with the consumer.
We propose that Semantic Categories, derived from prototypical areas in the
semantic space representations of the service ecosystem, to provide a meaningful and
effective abstraction since they are inspired by conceptual space theory and thus may
align with how humans process concepts. The second research question is:
Do Semantic Categories provide an automatic, meaningful and effective map
of the Service Ecosystem for exploration?
18
1.4 Contributions
Semantic Service Discovery
The first contribution is the inception of the Semantic Service Discovery model,
which employs a Semantic Space on a Service Information Shadow for Service
Discovery. We will ground this model in the conceptual space theory.
Semantic Space Innovations
The literature established the benefit of matrix factorization and dimensional
reduction of a SS for improved semantic representation of terms and concepts. We
will in the process of this work review some of the parameters in more detail and in
particular the so-called Singular Values. The literature has yet to assess them to
establish if they further enhance semantic representations. We propose to extend the
Semantic Space (SS) model by adding cross document relationship information to
the traditional vector space model. The SD evaluation experiments will provide
evidence of its value.
Based on the conceptual space theory we will motivate an alternative to traditional
clustering algorithms. Such an algorithm will be grounded in cognitive science and
make use of the inherent structure of a SS. We will evaluate it against state-of-the-art
clustering to provide the foundation for further development of a flexible human-like
categorization of the SS.
Relevant Data and Experiment
The envisaged SES and reframing of the SD as an IR task poses two unique
challenges. There is no SES like SIS as a data source in the literature exposing all
desired qualities. Consequently, an experiment measuring and comparing
performance of SD with such data also does not exist.
We plan to identify a suitable data source and design a SD experiment for the
proposed SD scenario. It will compare our SD model against state-of-the-art IR
systems. The semantic categorization algorithm uses the same data source to assess
its potential to provide a map of the SES as an alternative to state-of-the-art
clustering.
19
1.5 Thesis Structure
In this chapter, we introduced the notion of the Service Ecosystem. The emergence
of a vast and open SES poses the vital challenge of adding flexibility to Service
Discovery while maintaining effectiveness, which current SD frameworks address
inadequately. We proposed to reframe SD as an IR task. Our hypothesis is that
Semantic Spaces utilizing SIS are an effective and superior SD system compared to
traditional IR.
The next chapter is a literature review. It highlights the current SD trends from
traditional registries to modern ontology driven approaches. A brief overview over
IR methods precedes an in-depth review of Semantic Space models including an
overview over conceptual space theory.
Chapter 3 presents the Semantic Service Discovery (SSD) model. We begin with a
detailed review of SS generation and then introduce two extensions to SS aiming to
improve its performance. Next, we provide details on the semantic categorization
algorithm inspired by conceptual space theory. We also present novel enhancement
to categorization and clustering in general through typed vectors called
“Perspectives”.
This is followed by a description of the two discovery modes we envision utilizing
the SS. The end of the chapter contains a description of a software prototype that we
implemented to test the SSD model including a quantitative evaluation of the
semantic associations generated by the model and software through a well-
established synonym test.
Chapter 4 evaluates the first research question. It introduces a SIS like corpus for the
following experiments. The effectiveness of the SSD is evaluated by simulating
service need query of different quality and how well the SSD model retrieves
relevant related service documents. State-of-the-art IR and alternative SS systems
provide a baseline to position the SSD models results. We also test the applicable
novel SS features (introduced in chapter 3) to establish their value.
Chapter 5 investigates the second research question by comparing a range of state-of-
the-art clustering methods and criterion functions with Semantic Categorization, an
algorithm inspired by conceptual space theory, against a baseline manual
categorization of the SIS like corpus. This quantitative evaluation precedes a
20
qualitative review of the Semantic Categorization result to provide a better
understanding of its potential to map out the service ecosystem.
Chapter 6 discusses the results of the experiments and the outcome for the research
questions. The thesis then closes with a look at future research to answer questions
raised by this work and how to continue the development of the Semantic Space
based Service Discovery model.
21
2 Literature Review
This chapter provides an overview of Service Discovery frameworks dividing them
into functional, social and ontology types following roughly its historical
development. This provides background to the abilities and limitations of current SD
if we were to apply it to a SES.
The next section focuses on IR systems particularly on the prominent reverse
indexing model, which is still widely used in SD because of its balance of simplicity
and effectiveness. It will be the main comparison baseline for the SS based SD model
since highly effective systems are available that are therefore candidates for
underpinning SD in a SIS.
The last section of this chapter explores the Semantic Spaces. We begin the section
with an introduction to conceptual space (CS) theory to motivate the choice of SS.
We then introduce the basic SS model and the factorization by Singular Value
Decomposition. The section closes with a review of some attempts of combining
information structure and SS followed by an introduction to clustering. Clustering
will form the baseline for the later introduced CS inspired semantic categorization of
the SS.
2.1 Service Discovery
Current SD systems are generally modest look up mechanisms to identify services by
name or function-parameters in a closed setting like a distinct domain, industry or
department. They are used by expert consumers or simple automated systems with
clearly defined needs and knowledge of the available services sharing a terminology
either implicitly, e.g., through domain specific jargon, or explicitly, e.g., through
appropriate documentation. We present here three well known modes of SD and
presumptive attainment as an alternative to facilitate discovery in a SES.
2.1.1 Function oriented SD
The initial and most prominent service lookup system is the Universal Description
Discovery and Integration (UDDI) supported by the Organization for the
22
Advancement of Structured Information Standards (OASIS)25 as an open industry
standard to expose services on the Internet or networks. It was thought of "[...] to
become the de facto standard for web services management on the web" (Sabbouh,
Jolly, Allen, Silvey, & Denning, 2001) and to possibly develop into an equivalent of
search engines for services. Its SOA focus on the functional aspects of services using
merely service interface, arbitrary classifications and keyword lookups designed for
automated systems and expert consumers prevented this development (Atkinson,
Bostan, Deneva, & Schumacher, 2009; SAP News Desk, 2005). For example, a
search for zip-code may return services containing keys zip or postal code but not
zipcode (Dong, Halevy, Madhavan, Nemes, & J. Zhang, 2004).
UDDI remains primarily a solution for closed systems (Atkinson et al., 2009). The
termination of the public UDDI Business Registry (UBR) in 2006 showed that UDDI
is mostly a supporting technology for SOA and is a failure as a SD system (Atkinson
et al., 2009; Bachlechner et al., 2006).
A similar approach is the adoption of web search engines to retrieve WSDL files.
WSDL is function centric and service designer rarely make good use of the optional
free text description in its definition. Web search engines moreover use modest full
text indices and hypertext link relationships (Baeza-Yates & Ribeiro-Neto, 2011;
Hagemann, Letz, & Vossen, 2007) concentrating on efficiency more than
effectiveness. Without the benefit of link ranking WSDL file search is as a plain
keyword lookup over mostly functional service information. Crawling WSDL files
via Google (Al-Masri & Mahmoud, 2007) returned only 340 services with 77%
having no or inadequate documentation and descriptions. Many files were referring
to inactive or non-existing services.
The effectiveness of UDDI or WSDL search as a SD for a SES is not simply a
“given”. Firstly, the service provider and consumer would have to use a global,
unambiguous, predetermined service terminology that does not exist. Secondly,
consumers require a detailed conception of the need together with the ability to
express and match it through functional parameters or names. This process promotes
a lookup style search requiring intimate knowledge of the SES from the consumer,
25 See http://www.oasis-open.org for more details.
23
instead of a more discovery-oriented approach. A range of improvements have been
attempted by employing descriptive, relationship information, clustering, vector
space models and signature matching of the services (Bose, Nayak, & P. Bruza,
2008; Dong et al., 2004; Peng, 2007; Sajjanhar, Hou, & Y. Zhang, 2004; Stroulia &
Wang, 2005; Studholme, Hill, & Hawkes, 1999; Wang & Stroulia, 2003). These
works constitute an encouraging development with some promising results.
However, they all restrict themselves to the limited descriptive information available
within the realms of WSDL and UDDI.
2.1.2 Social-oriented SD
More recently, communities and portals developed around service provision and
consumption that include functions like discovery, review and marketing
(Bachlechner et al., 2006; Rambold, Kasinger, Lautenbacher, & Bauer, 2009). They
are limited in size with a small range and varying quality of services with domain
specific foci using primarily keyword based search to find services. The open
platform ones are lacking in size and professional services. Some professional and
platform related examples are thriving. SalesForce's AppExchange community is
such an example offering access to a sizable group of potential consumers, the
SalesForce CRM customers, and a stable platform including billing options. It is a
domain specific solution and demonstrates that a third party driven service
environment can provide, consume and manage its own services. It extends the
keyword search to reviews and filters of tags and categories originating from within
the community as a kind of informal, crowd-driven terminology. This enhances the
search process but in its core, it relies on implicitly shared terminologies encoded in
the keywords, tags and categories. A discovery in the sense of finding previously
unknown information is equally unmet as in the UDDI and WSDL scenarios. The
communities are functioning because they focus on domain specifics and the
consumers are knowledgeable about the domain.
2.1.3 Ontology-based SD
The Semantic Web uses ontology and annotations to describe information and
relationships. A subset of it is the “Semantic Web Services” (McIlraith, Son, & Zeng,
2001), which proposes automatic service discovery, orchestration and invocation
through deduction (Rambold et al., 2009). However, SWS are not broadly used or
24
standardized despite several years of research (Du, Shin, & Lee, 2008; McIlraith et
al., 2001; Verma & Sheth, 2007) and there is not a single agreed upon ontology (see
OWL-S,26 WSDL-S27 and WSMO28). The European Union for example initiated
recently the Semantic evaluation at large scale (SEAL) project29 to develop
evaluations for semantic technologies, its tools and inter-operability including SWS.
A key problem is that the complexity of the ontology and demand on its users grows
with the size of the ‘world’ it describes. When the ontology describes a particular
domain, the meaning of the vocabulary easily conveys. Once the ontology grows,
synonymy and polysemy become an issue. Synonymy refers to the fact that “[…]
people choose the same key word for a single well-known object less than 20% of
the time” (Deerwester, Dumais, Furnas, Landauer, & Harshman, 1990; Furnas,
Landauer, Gomez, & Dumais, 1983). On the other hand, polysemy refers to the
various meanings a term can have depending on the context it is used, e.g., the chip.
As a result, an ontology that defines singular meanings to its vocabulary acts
contrary to natural language - larger the domain the potential for synonymy and
polysemy increases.
The use of terms in ontologies to expand queries has limitations (Voorhees, 1994)
and is cognitively contentious (P. Bruza et al., 2009). This is part of the symbolic
grounding problem (Dietze, Gugliotta, & Domingue, 2008) where the meanings of
symbols or words are dependent on the consumer and the context of use:
[T]asks are highly dependent on the situational context in which they occur,
SWS technology does not explicitly encourage the representation of domain
situations. Moreover, describing the complex notion of a specific situation in
all its facets is a costly task and may never reach semantic completeness.
Simple vector spaces using quality dimensions have been proposed (Gärdenfors,
2004) to contextualize the SD to reduce the complexity of the ontology and
contextual ambiguity. The dimensions are predetermined in the framework and
26 See http://www.ai.sri.com/daml/services/owl-s/ for more. 27 See http://www.w3.org/2005/04/FSWS/Submissions/17/WSDL-S.htm for more. 28 See http://www.wsmo.org for more. 29 See http://www.seals-project.eu and
http://cordis.europa.eu/fetch?CALLER=EN_NEWS&ACTION=D&RCN=31509 for more.
25
effectively transfer the complexity and grounding problem from the ontology to the
quality dimensions without solving it. Ontology mapping is an alternative approach
to this problem (Pathak, Koul, Caragea, & Honavar, 2005) using smaller manageable
ontologies for different domains and translate between them. It shifts the ontology
complexity into an equally fraught ontology translation problem.
The Semantic Web can only be useful in SD with intermediaries in the form of
service brokers outsourcing the translation/complexity problem to them when used
outside of semi-closed environments with a general ontology (Bachlechner et al.,
2006). Current SWS discovery systems' capabilities are fragmented (Rambold et al.,
2009) and even if complete working systems and annotated services were available
they would not enable automate orchestration and consumption in their current form.
“[T]he employment of semantic technologies and related tools for service discovery
in pervasive environments comes with a major handicap: the underlying semantic
reasoning is particularly costly in terms of computational resources and not intended
for use in highly dynamic and interactive environments”30 (Mokhtar, Preuveneers,
Georgantas, Issarny, & Berbers, 2007) which in conclusion makes their efficient and
effective application in a SES highly doubtful.
2.1.4 Presumptive Attainment
The current modes of SD do not facilitate a consumer to expand incomplete
knowledge by suggesting information related and possibly relevant to the consumer’s
agenda. The introduction argued how this is a central challenge of the Service
Ecosystem in light of its scale and a catalyst for its inception and operation. A SES
reaches across domain, industry and organisational boundaries, and cannot presume a
shared terminology. Such a SD needs to cover all service domains and make services
discoverable by consumers who lack knowledge of applicable services, their
description and who have an imperfect or incomplete conception of their service
need. A discovery process therefore has to use a kind of inference to extrapolate the
service need from inadequate information. This can be a deductive mechanism
extrapolating an initial inadequate description of a service need. Alternatively,
30 Semantic Technologies in this context are referring to ontological not the by us alternatively proposed statistical methods.
26
abduction induces appropriate related concepts aligned with the service need. The
distinction between the two is that deduction with infer concepts implied by the
initial service need description, whereas abduction furnishes suggestions of possibly
related concepts, for example, “concept abduction” has recently been proposed to
hypothesize unstated related concepts for the benefit of search on the semantic web
(Colucci, Noia, Sciascio, Mongiello, & Donini, 2004).
Presumptive attainment (P. Bruza et al., 2009) has been proposed as a possible
approach to abduction of information to extend incomplete knowledge. It states that
an consumer with an agenda but a lack of (complete relevant) knowledge has three
options. The first two actions are to capitulate or to extend the knowledge to
encompass everything relevant to the agenda. The first one is not desirable and the
second one is challenging since the costs are often high or it is hard to identify what
is relevant and then to ‘learn’ it. There is a third option, however. The consumer can
use conjecture to presume some information may be relevant to the agenda. It is
important to note that while this loosely could be described as guessing, it is indeed
informed guessing. The difference is important. The consumer is willing to invest in
an action resulting from this information since she identifies it as conceivably
relevant to the agenda in context of her knowledge.
The rich information surrounding services are valuable to consumers trying to close
an agenda and to identify relevant services. The information is largely unknown and
inaccessible or too costly to process for consumers. Semantic Spaces can unearth
latent relationships from this information and make them easily accessible to
consumers searching the SES. The SS does not require a consumer to have complete
or formal knowledge of the agenda to pose a query as a starting point to explore the
space. The consumer can utilize the SS to extract a small set of possibly related
information and services. This facilitates presumptive attainment by the consumer
since she is facing a manageable, related and potentially relevant subset of the space
filter to match her (incomplete) knowledge of the agenda. Semantic Service
Discovery therefore can provide a mode of discovery where with little knowledge a
consumer can find related information and abduct relevant information and services
unknown at the time of forming the agenda, and ultimately extend her knowledge.
27
2.2 Information Retrieval
The task of retrieving information arose at the time humans started to write down and
collect information. The first systems organized early scriptures in stone, clay,
papyrus and later paper (Baeza-Yates & Ribeiro-Neto, 2011, chapter 1.1.1). They
evolved into the modern library systems using a combination of alphanumeric and
keyword based indices to make information accessible by reference, removing the
need to search an entire collection sequentially.
There were two major events in recent IR history after the millennia old
development. First were the electronic data processing and the subsequent
development of evaluation methodologies for IR in the 1950s and 1960s (Cleverdon,
1967; Kent, Berry, Luehrs, & Perry, 1955). From there on the field matured with
continued improvements (Baeza-Yates & Ribeiro-Neto, 2011; Van Rijsbergen, 1979;
Gerard Salton, 1968, 1983). The second noteworthy event was the rise of the World
Wide Web. A network of loosely related unstructured information connected by
hyperlinks provided and consumed by anyone who has access to the Internet is
dependent on IR systems. In no small part, developments in Information Retrieval
(Page, Brin, Motwani, & Winograd, 1999) aided the Internet and the emergence of a
global information society.
In the last decades, the types of electronic data, coded facts, we create, collect, store
and subsequently search has increased. Initially, text was the only one before all sorts
of data followed like medical, environmental, financial, multimedia, linked and
structured text. IDC31 for example estimated for 2010 that the total electronic data
stored was 1.2 zetabytes or 1,200,000 petabytes or 1.2*1021 bytes. They further
estimate that it will grow 44 fold by 2020. This incomprehensible amount of data
requires sophisticated tools such as IR systems, to extract information, i.e., unique,
useful and contextualized data, from it.
31 See http://www.emc.com/collateral/demos/microsites/idc-digital-universe/iview.htm for more (accessed 25.03.2011).
28
2.2.1 Two modes of IR
“Exploratory search makes us all pioneers and adventurers in a new world of
information riches awaiting discovery along with new pitfalls and costs.”
(Marchionini, 2006)
The challenge for IR is to find information in the sea of data. Here exist an important
division of modes of IR. On the one hand, IR can be the mere lookup of information
according to rules and keys as it is in libraries and simple keyword based IR systems.
Marchionini (2006) identifies an alternative exploratory search (Figure 7) where
learning and investigation lead to a refinement and re-evaluation of the search in a
feedback loop.
Figure 7: Search activities32
The outcome of this discovery process is to not only search information but also
acquire knowledge in the process to enlighten the search process. This kind of
exploratory search is not novel, e.g., O'Day & Jeffries (1993) identified it reviewing
alternative styles used by librarians. They noted that in the knowledge acquisition
phase the librarians using traditional indices had to occasionally learn domain
specific terminologies and significant entities like places, companies and persons to
fulfil their task. A technique to annotate information items to give better access to
relationships between them would be greatly beneficial in their opinion. This aligns
with our suggestion that abductive reasoning in relation to the concepts and their
associations can make transparent relationships with the potential to alleviate the
semantic gap a consumer faces when searching for a service.
32 Based on Marchionini (2006; figure 1)
29
2.2.2 Taxonomy of IR
In its 60 years long modern development, IR as a field has diversified. Figure 8
illustrates a modern taxonomy of information retrieval systems. They divide in three
main groups depending on the information content. We are primarily concerned with
the unstructured text in the Service Information Shadow and thus classical IR
models. The three models are Boolean, Vector and Probabilistic. We suggested
previously that a lack of knowledge about the SES and the underlying terminologies
by the searcher drives the SD challenge in the future. The resulting incomplete,
unstructured and possibly symbolic mismatching query posed by the searcher has to
be comprehensible by an IR system returning meaningful results. In the best case, it
would answer a detailed request with the relevant result(s). In the worst case, it
should return at least related information to a vague request providing the searcher
with enough to enhance her understanding of her needs and refine her query in the
mode of exploratory search. Furthermore, the IR system has to be able to adjust to
the flexible nature of a SES with ever-changing service offerings and changes in
semantics in the corpus over time.
30
Figure 8: A taxonomy of IR systems33
Librarians define a taxonomy reflecting the world of books (and other media) they
organize. They classify a book with this taxonomy independent of if the index terms
occur or share the same meaning in the books. A library index thus is an artificial
index. Originally, this was no problem since the indexed works covered a small
amount of knowledge changing and expanding slowly provided and consumed by a
small circle of learned people. Today, as we have mentioned, the quantity of
information to index is growing rapidly. As a result, a rigid index that demands the
searcher and information item to conform to it is incompatible with the dynamics of
the information growth. This disqualifies the millennia old library indices since
searchers have to learn and adhere to a particular symbol bound encoding of
information. The semantic burden of transferring a need to a query is solely with the
searcher and a change in the index structure, i.e., changes in semantics, would
require all searchers to change their understanding of the index.
33 Based on Modern Information Retrieval (Baeza-Yates & Ribeiro-Neto, 2011, p. 60)
31
“[U]sing just human generated categories for indexing […] might lead to a
poor search experience, particularly if the users are not specialists with
detailed knowledge of the document collection.” (Baeza-Yates & Ribeiro-Neto,
2011; page 64)
2.2.3 Basic concepts
We need to introduce some basic IR concepts before we investigate the suitability of
the three main classic IR models to index and search an unstructured text corpus. Let
C={ d1, d2, d3,…, dj} with i the number of document in a corpus C and dj a document
in the corpus. Let Vj={t1, t2, t3,…, tk} with k being the number of unique terms in a
document and tk one of the terms. Vj is the vocabulary of dj. Furthermore, let
W={w1,1, w1,2, …, wk,j} be the weights each term has in each document.
d1 d2 d3
t1 1 1 0
t2 1 1 1
t3 0 1 1
Table 1: Boolean term document matrix
With such knowledge, we can build a matrix with documents as columns and terms
as rows and note their weight in each table cell. The most basic weight is if a term
occurs (1) or not (0). Table 1 demonstrates such a Boolean term document matrix.
The challenge to find a more expressive term weight is to use a weight that reflects
how well a term identifies a document. The simplest form is Term Frequency (TF).
Table 2 illustrates the previous table and this time the TF or number of occurrences
of a term in a document is the weight noted in the respective table cell. The problem
remaining is that meaningless terms, e.g., ‘the’ or ‘a’, occur frequently in all
documents and overshadow expressive terms.
d1 d2 d3
t1 3 1 0
t2 5 7 1
t3 0 1 4
Table 2: Term Frequency to term document matrix
An expressive term therefore has to distinguish a document from the remainder of
the corpus. So a weight should reflect that a term identifies well with the topic of the
32
document, is not widely used in the corpus and at the same time a probably query
term, i.e., not a typo or uncommon term. This requires us to understand the
distribution of terms better.
; ,1
∑ 1
Equation 1: Zipf's Law
Empirical evidence about term distribution inspired Zipf’s Law (Zipf, 1935). It states
that the rank of a term in a corpus is inversely proportional to its frequency following
a power law distribution (Equation 1). N is the number of words in the language and i
is the i most frequent word. The original Zipf law used a=1 with a being chosen
according to the corpus. In the simplest case, Zipf’s law is a harmonic series, e.g., the
most frequent term occurs twice as often as the second and thrice as the third and so
on. For a > 1 the series converges. For a ≤ 1 the series diverges and the vocabulary
grows indefinitely although progressively slower. It has been shown (Araujo,
Navarro, & Ziviani, 1997) that a value for a between 1.5 and 2.0 fits the natural
distribution best.
Equation 2: Inverse Document Frequency
A ‘heavy’ or content bearing term therefore strikes a balance between high and
Document Frequency (DF; Figure 9), a count of how many documents contain the
term. The TF is a local measure and does not say much about the corpus wide
distinctiveness of a term. Since we know the power law distribution of terms, we can
use it to discriminate terms corpus wide. This can be achieved by the Inverse
Document Frequency or IDF (Gerard Salton, C. S Yang, & Yu, 1975; Sparck-Jones,
1972) using a log of the IDF (Equation 2) with N being the number of documents in
the corpus and ni the number of documents with occurrence of term i. In practice, we
often use ni+1 instead of ni to prevent an error if each document contains the term.
33
Figure 9: Content bearing terms by DF34
The measure of which words to choose as content bearing is usually done by DF,
IDF or TF-IDF (Gerard Salton, 1968) with a term of a relative high frequency
appearing only in a small number of documents receiving a high value. TF-IDF
multiplies the TF35, how often a term i appears in a document z compared to the total
number of terms in z, with the IDF, which is the log of the total number of
documents N divided by the number of documents containing i, (see Equation 3).
,,
∑ ,| | ∗
Equation 3: TF-IDF of term i in document z for corpus of N
2.2.4 Basic IR System
The classical IR systems for unstructured text we are about to discuss are structurally
similar and we provide an overview in Figure 10. It separates into two main parts, the
indexing and querying. The indexing is a onetime or periodical process that converts
the corpus of text into an inverted index. The corpus is parsed and tokenized building
a vocabulary of indexed terms and various processing may be included, e.g.,
removing stop words or stemming words to their grammatical root. The transformed
corpus converts into a reverse index that is similar to the term document matrices but
more space efficient. The simplest form is a list of indexed terms pointing to a list of
documents containing them. This is comparable to a non-sparse version of the
Boolean term document matrix. These indices can also be more sophisticated and a
term or even a multi-word phrase, a so-called n-gram, may point to a list of lists,
which point to positional occurrence inside documents and may even include
different weight measures.
34 Based on (Salton, Yang, & Yu, 1974, fig. 7) 35 Note that TF from hereon is different from the previously naïve version of purely counting term
occurrence in a document.
34
UnstructuredDocuments
Corpus Indexing Index
Query Transform Rank
Punctuation filtering
Stemming
Removing stop words
Bag of Words
Vocabulary
...
Reverse Index
t1 -> d2, d1
t2 -> d1, d4
t3 -> d9, d4, d3
t4 -> d2
t5 -> d5, d1, d13
...
Free text
Boolean logic
Relevance selections
...
Extract query items, e.g. terms
Extract grammar, e.g. Boolean
Map to Vocabulary
Generate query representation
Match query and documents
Rank documents
Return (best) matches
Figure 10: Classic IR system
The querying side of the IR system takes a query, which in its simplest form would
be a single word and transforms it into a representation for the IR system.
Transformation can range from a simple exact pattern match to sophisticated lexical
analysis, Boolean grammar or even phrase detection. The result of which is fed into a
ranking algorithm which matches it with the index and returns a list of results,
possibly ranked by how well it matches the query.
2.2.5 Boolean
The Boolean information retrieval (Lancaster & Fayen, 1974) is the first model
under consideration for Service Discovery (utilizing a Service Information
Shadow from a Service Ecosystem). Under the Boolean model, the IR system
indexes a set of documents by noting which terms occur in which documents
expressible in a Boolean word document matrix. They are equivalent to a bag of
words with the position or frequency of the words/terms in the document in the
corpus being irrelevant. A searcher can express an information need as a set of
terms utilizing Boolean terms of NOT, AND and OR. NOT requires a subsequent
term/expression to not occur (be false) in a document, AND requires both
surrounding terms/expressions to occur (be true) for a document and OR requires
one of two terms/expressions to occur for a document to be considered relevant.
35
So for Table 1 the query q1={t1} would return d1 and d2 as relevant, q2={t1 AND
t3} would return d2 and q3={t1 NOT t3} would return d1. In its basic form the
Boolean model only knows relevant and irrelevant documents matching or not
matching a query. This absolute semantics does not allow for partial matching and
nuanced querying. This often results in simple queries and excessively large result
sets.
For Information Retrieval, the Boolean model is ineffective and rarely used in
professional IR settings, except perhaps for patent retrieval. It is however popular as
simple and quick search method. In particular, the online presences build on the
prevalent combination of a relational database system and scripting language. Most
database systems offer automatic full text index generation and the information
retrieval on these web sites is simply a lookup in these indices. Advanced
implementations may extend this to typographical distance computations, stemming
or basic Boolean logic. We referred to these basic lookup systems as keyword
systems in the previous sections.
2.2.6 Probabilistic
The probabilistic IR framework (S. E. Robertson & Sparck-Jones, 1976) defines that
for an information need there is a set of relevant documents R in C. A query q, a set
of indexed terms, expresses such a need. The challenge is to identify R by q that
contains the properties to do so without the system knowing them. Consequently, a
guess and approximation is necessary and documents with a certain probability of
being relevant identified as the answer to q. A document is a vector of binary term
weights, i.e., each dimension represent one indexed term and a 1/true if the term is
contained in the document or a 0/false otherwise. The consumer can give feedback to
the system identifying (non-)relevant documents and thus improve the model and
answer.
,| ,
| ,
Equation 4: Probabilistic similarity by relevance ratio
The ranking of the results is measuring the similarity of a document dj to a query q.
Equation 4 illustrates the similarity measure also known as the relevance ratio
(Baeza-Yates & Ribeiro-Neto, 2011; p81). It computes the probability of retrieving
36
the vector representation of dj given q divided by its complement. The complement is
the probability of the document being non-relevant to the query. Using a contingency
table (Table 3) Equation 4 can be approximated alternatively as Equation 5 (Baeza-
Yates & Ribeiro-Neto, 2011; p83) using the Robertson-Sparck Jones equation (S. E.
Robertson & Sparck-Jones, 1976). N is the number of all documents from C, ni is the
number of documents containing ti and ri the number of relevant documents to ti
while R is still the number of all relevant documents to the query q. This
approximation assumes R=ri=0 to remove the need for human interaction and results
in a DF based ranking. The addition of 0.5 on the top and bottom of the equation is
needed to ensure that the log does not fail for the two extremes of ni=N and ni=0.
relevant Non‐relevant total
Document with ti ri ni‐ ri ni
Documents without ti R‐ ri N‐ ni‐(R‐ ri) N‐ ni
All documents R N‐R N
Table 3: Contingency table
The original probabilistic model and ranking did not take into account term
frequency or document length (long documents are more likely to be relevant since
they regularly contain a larger part of the vocabulary). The modern BM2536 model
(S. Robertson, Zaragoza, & Taylor, 2004) remedies this with a combination of the
earlier BM11 and BM15 models (S. E. Robertson, Walker, Jones, Hancock-Beaulieu,
& Gatford, 1994). BM25 is effectively a weighting scheme utilizing a TF-IDF
variant, document length normalization and two variables to adjust to corpus
features. This introduced a fully automatic ranking independent from consumer
feedback and in addition relieved the deficiencies of the original probabilistic model.
, ~0.5
0.5
Equation 5: Probabilistic similarity by contingency table
Equation 6 represents a common BM25 variant. On the right hand side, the log is
identical to the DF based ranking from Equation 5. It extends by TF (fi,j) on the
numerator of the left side of the equation. The denominator includes document
36 BM stands for Best Matching.
37
normalization by dividing the documents length in number of words (|dj|) by the
average document length. The scalars K1 and b is an adjustable factor to fine tune for
corpus characteristics.
,1 ,
1| |
∑| |
∗ log0.5
0.5,
Equation 6: BM25
The BM25 formula has been very successful and many state-of-the-art IR system
employ it, or close variants.
2.2.7 Vector
The basic vector space model (G. Salton, Wong, & C. S. Yang, 1975) uses term
document matrix similar to Table 2 with the column vectors representing documents
in a k-dimensional Euclidean space with k being the number of indexed terms.
Commonly a non-binary/Boolean term weight is used. The presumption that
documents are topical establishes term context and thus co-occurrence of terms in a
document implies a shared meaning of the terms while documents with similar topics
will contain similar terms. Consequently, there are two ways to interpret the matrix.
The similarity between row vectors relates to the similarity between terms they
represent. Likewise, document column vectors relate to document similarities. A
query containing indexed terms can be represented as a vector in the same k-
dimensional space and the similarity between the query vector and the document
vectors is used as a measure and ranking for the similarity between the query and
document.
Term co-occurrence Matrix
The term document matrix assumes that terms are fully independent. An alternative
to this view is the term co-occurrence matrix, a prominent model of which is
Hyperspace Analogue to Language or HAL (Lund & Burgess, 1996). It uses a term-
to-term matrix (Table 4) to accumulate term weights for each co-occurrence of two
terms while parsing the corpus. The parsing is done with a sliding context window of
a predefined length, usually 8-10 words, moving from term to term each time the
38
neighbouring terms are noted and their weight (discounted by distance to the centre
of the window) added in the co-occurrence matrix.
t1 t2 t3 t4 t5
t1
0.99 0.02 0.02 0.02
t2 0.99 0.05 0.09 0.01
t3 0.02 0.05 0.33 0.19
t4 0.02 0.09 0.33 0.88
t5 0.01 0.01 0.19 0.88
Table 4: Term co-occurrence matrix
Terms co-occurring frequently will have similar vectors and thus be close in the
resulting space. The proximity of the vectors in the naïve implementation also
depends on similar frequencies, which is often not required or desired. To counter
this effect we can normalize vectors to unit length. A variant of the model called
Wordspace (Sahlgren, 2006; Schütze, 1998) does not use a square matrix. It uses the
row vectors as term representations and chooses only the most content bearing terms
for the columns using a weight, e.g., DF or IDF. Furthermore a gap (Table 5) may be
used (Takayama, Flournoy, Kaufmann, & Peters, 1999) to remove terms that are
either too frequent or too discriminating (Gerard Salton, C. S Yang, et al., 1975). In
the former case, the terms would not carry much discriminating value and in the
latter, they would not optimally utilize the columns being very sparse with little co-
occurrence relevance.
gap=2
t1 t2 t3 t4 t5
t1
0.99 0.02 0.02 0.02
t2 0.99 0.05 0.09 0.01
t3 0.02 0.05 0.33 0.19
t4 0.02 0.09 0.33 0.88
t5 0.01 0.01 0.19 0.88
Table 5: Term co-occurrence matrix with gap
The word context matrix allows for a fine definition of proximity through the length
of the sliding window. It also is possible to parse a corpus with few or even only one
(large) document into a meaningful representation. Documents and queries map into
the space through a combination, i.e., summation, of their indexed terms.
39
Similarity
Lund & Burgess (Schütze, 1998) proposed to use Minkowski (Equation 7) distance
in general and the Euclidean one (Equation 8) in particular to measure geometric
distance in between the two points, represented by the vectors and . A simple
vector length normalization of the original matrix beforehand removes the document
length-weighting problem. Their work provides evidence that the automatic first
order co-occurrence analysis provides a good approximation of semantic relatedness
of words including similar words like “street” and “road” that do not occur with each
other but in comparable circumstances.
,
Equation 7: Minkowski distance
,
Equation 8: Euclidean distance
Another similarity measure is the cosine of the angle between two vectors (Equation
9) also known as the dot product of two vectors. It has several advantages over the
scalar product because of its normalization. The measure is contained between 1 and
0; identical vectors will measure 1 and orthogonal vectors that do not co-occur 0;
document length is naturally disregarded since only the angle is used.
,∑
∑ ∑
Equation 9: Cosine similarity measure
2.2.8 Summary
The review of the field of Information Retrieval, which is mainly concerned with
identifying relevant information to an information need, reinforced our intuition to
reframe the SD challenge as an IR task (see 1.2.3). The taxonomy of modern IR
identifies the classical models as the ones most related to our task of searching
unstructured text. Within that area, we considered the three main models namely the
Boolean, Probabilistic and Vector model.
The Boolean model is the oldest and most basic. It is limited and while the absolute
reasoning about relevance is desirable it comes at the cost of shifting the semantic
reasoning to the searcher and at the same time simplifies and restricts the
40
expressiveness of a search. It is not common anymore since it does not allow for
ranking the relevance of results and regularly returns too many or too few results.
The probabilistic model is based on the assumption that there is an optimal answer to
the information need of a searcher, which it guesses, and with the help of the
searcher increasingly approximates that answer. This compares well with our SD
challenge and in fact with many information-searching tasks today. The introduction
of algorithmic improvements particularly BM25, which provides useful rankings of
results even without consumer feedback, have made it a viable IR model widely used
and considered state-of-the-art. It does not however address the semantic gap we
identified as an essential problem for the discovery process. Since it is widely used
and available in matured IR systems we consider it an excellent choice as a baseline
system to compare alternative systems’ performance.
Lastly, the general vector model presents an interesting solution to the SD challenge.
Firstly, its performance in general corpora is excellent (Baeza-Yates & Ribeiro-Neto,
2011). Secondly, the intrinsic fuzziness of the ranking and similarity measures
combined with the relatedness of comparable terms even when not co-occurring is
intriguing. This may address the semantic gap between a searcher’s query, reflecting
imprecisely her information need, and the terminologically disparate expressed
service information. This prompts us to investigate further the vector model in the
following section.
2.3 Semantic Spaces
A Semantic Space or SS (Lowe, 2001; Turney & Pantel, 2010) is the general term for
a vector space model as found in natural language processing and Information
Retrieval (see 2.2.7) and stems from the distributional hypothesis (Firth, 1957;
Harris, 1954; Weaver, 1955; Wittgenstein, 1953). The hypothesis put simply, states
that the meaning of a term derives from its co-occurrence with other terms.
We propose that humans will be an important service consumer and searcher. They
will query the SES, which poses the questions of how to bridge the semantic gap and
how to enable the searcher to obtain services of which need the searcher is ignorant
of at the start of the search? We propose in this section that (advanced) Semantic
Spaces mimic human conceptual reasoning and can help to answer both questions.
We will first introduce conceptual space theory and then explain how Semantic
41
Spaces relate to them. The aim is to establish that we can guide a searcher to
meaningful and relevant services despite terminological differences in a query and
service related information. We further will propose that this process allows the
searcher to attain information presumptively through conjecture or informed
guessing based on her knowledge of her agenda and the relevant selection of services
or service information presented because of a query (P. Bruza et al., 2009). Overall,
this process fits within the ambit of exploratory search. There have been attempts
into this direction (Bose et al., 2008; Dong et al., 2004; Peng, 2007; Sajjanhar et al.,
2004; Stroulia & Wang, 2005; Studholme et al., 1999; Wang & Stroulia, 2003).
The success of such a SD system depends on both the semantic wealth in the corpus
and its ability to imitate human conceptualization of it. We will show that latter can
be attained through a Semantic Space. The former is difficult with current service
descriptions. There is some semantic content in the UDDI description and inside
WSDL files in comments, optional descriptions and naming conventions, which have
utility in SD (Zhuang, Mitra, & Jaiswal, 2005). This content, provided by technical
developers, emphasises the functional and technical aspects of the services. Richer
information exists in secondary service related documents like reviews, descriptions,
advertisements and documentations. We propose to utilize this secondary service
information corpus, the Service Information Shadow, to enable SD on a conceptual
level.
2.3.1 Conceptual Space
Gardenfors (2004) suggests a three level representation of cognition (Figure 11). The
most abstract is the symbolic, followed by the conceptual and then connectionist
level. The symbolic level uses symbols and grammar to express information.
Keyword and ontology based systems are working on this level. Deduction based on
this level as used by SWS is highly abstract and specific with precise and strict
inference. It requires a great effort for humans to express and comprehend
information on this level but it enables them to transfer complex ideas between
individuals. An important help with this is context, e.g., when someone refers to ‘the
chair’ a conversation, text or senses like vision and locality or gesturing establish
context to identify the instance of chair meant. On the other end of cognition, the
lowest level, connectionism reflects biological processes in a neural network. It
42
processes and stores information in a connectionist representation which can be
simulated by artificial neural networks.
Figure 11: Three levels of cognition
In between these two extremes lies the conceptual level. Within a conceptual level,
knowledge has a geometrical structure. For example, three dimensions (hue,
chromaticity and brightness) can present the properties of colour. Gardenfors (2004)
argues that a property is alike a convex region in a geometric space. In terms of the
example, the property red is a convex region within the tri-dimensional space made
up of hue, chromaticity and brightness. The property blue would occupy a different
region of this space. A domain is a set of integral dimensions in the sense that a value
in one dimension(s) determines or affects the value in another dimension(s). For
example, the three dimensions defining the colour space are integral since the
brightness of a colour will affect both its saturation (chromaticity) and hue.
Gärdenfors extends the notion of properties into concepts based on domains. The
concept apple may have the domains taste, shape, colour, etc. Context is modelled as
a weighting function on the domains, for example, when eating an apple, the taste
domain will be prominent, but when playing with it, the shape domain will be
heavily weighted (i.e., it's roundness). Observe the distinction between
representations at the symbolic and conceptual levels. At the symbolic level apple
can be represented as the atomic proposition apple(x), however, within a conceptual
space (conceptual level), it has a representation involving multiple inter-related
dimensions and domains. Colloquially speaking, the token apple (symbolic level) is
the tip of an iceberg with a rich underlying representation at the conceptual level.
Gärdenfors points out that the symbolic and conceptual representations of
information are not in conflict with each other, but are “different perspectives on
how information is described”.
If a discovery system is able to mimic a conceptual space for service related
information and map a consumer’s need into it then a reasoning based on proximity
can achieve discovery based on conceptual relatedness rather than deductive
Symbolic
Conceptual
Conncetionist
43
reasoning. Furthermore, approximating concepts in the space can guide a consumer
meaningfully even with a vague understanding of her need. Models bridging from
the symbolic to the conceptual generating geometric representations grounded in
cognitive science exist in Semantic Spaces. "[S]emantics is a relation between
linguistic expressions and a cognitive structure" (Gardenfors, 2004; page 159). This
thesis will use Semantic Spaces as the basic computational model to drive effective
service discovery in the SES.
2.3.2 Singular Value Decomposition
A common problem with word context and word document matrices is their size,
sparseness and noisiness. Vector models are good at solving synonyms but an
increasing index and matrix size can introduce noise that results in an increasing
ambiguity in form of weak polysemy. A corpus can easily exceed millions of words
resulting in matrices of tens of thousands to hundreds of thousands rows and columns
(Lund & Burgess, 1996) with most cells empty. Latent Semantic Analysis or LSA
(Deerwester et al., 1990) applied a Singular Value Decomposition (Golub & van
Loan, 1996) to a word document matrix to address these issues by computing latent
semantic factors and removing noise.
Figure 12: Singular Value Decomposition in Latent Semantic Analysis
Assuming M is the word document matrix of rank m with w word rows and d
document columns then a SVD of M results in a left singular matrix U, a square
diagonal matrix S and a right singular matrix V (Figure 12 and Equation 10). The
singular matrices have an orthonormal (column) basis of size m. U has w and V d
rows. S contains only non-zero values along its diagonal. The multiplication of U
with S and transposed V reproduces M. One characteristic of the decomposition is the
ordering of U's and V's columns as well as S's values in decreasing importance to
44
error of the re-composition of M. For example, let k be m-1 and the rank of M greater
than k. Then remove or set to zero the last column and value of U, V and S calling
them Uk, Vk and Sk (Figure 12 and Equation 11). If we then attempt to re-compose M
we will create a least error approximation, M*, of rank k. This lossy compression of
the matrix content not only removes noise but also amplifies significant and higher
order relationships (Deerwester et al., 1990; Landauer & Dumais, 1997; Landauer,
Foltz, & Laham, 1998; Schütze, 1998). In short, it leads to improved and effective
term representations.
Equation 10: SVD
∗
Equation 11: Truncated SVD
The second characteristic of the SVD is that the dot product (also used as cosine
measure) between rows or columns of M is equivalent to the dot product between the
rows or columns in U*S, VT*S or Uk*Sk, VkT*Sk for M*. This is a result of S being
diagonal and the columns in U and V being an orthonormal base. It in turn allows
calculation of the dot products between rows without V and between columns
without U (Deerwester et al., 1990).
A variation of LSA is the Wordspace model (Schütze, 1998; Takayama et al., 1999)
which uses a modified HAL word co-occurrence matrix and SVD based dimensional
reduction. Its use of SVD differs by not employing the S values when reconstructing
the row relationships and relying only on U. The motivation may relate to the idea of
using a symmetric co-occurrence matrix (Schütze & Pedersen, 1997; chapter 2.1)
although it uses rectangular matrices with only content bearing columns instead of
the sparse and computational expensive square ones. The Infomap software37 from
the University of Stanford (Takayama et al., 1999; Widdows, 2003) is a direct
implementation of the model. Its source code documentation shows that it used the S
values but removed them in favour of only using U in 200138 claiming that the S
values contain no significant information without a detailed reference or clear
37 See http://infomap-nlp.sourceforge.net/ for more. 38 See the source code (http://infomap-nlp.sourceforge.net/) version 0.8.6, file encode_wordvec.c,
lines 201-206.
45
explanation. This raises the question if Uk*Sk, Uk or maybe a variation of it is the
optimal representation for the row relationships.
Consequently to measure relatedness by cosine similarity the smaller k reduced
matrices can be employed decreasing memory and computing requirements because
an optimal k is k<<m in most cases. Removing the less information bearing columns
amplifies the similarity between word/document vectors even of higher degree co-
occurrence because the small differentiations between them are the least important
information (Landauer & Dumais, 1997) and can even rectify outliers (Landauer et
al., 1998). The relatedness of words and documents in the resulting matrices is
strikingly similar to human cognition and “[i]t is hard to imagine that LSA could
have simulated the impressive range of meaning-based human cognitive phenomena
that it has unless it is doing something analogous to what humans do” (Landauer et
al., 1998). A conclusion further supported by experiments introducing word meaning
negation (Widdows, 2003) which align well with the conceptual level where adding
or removing a qualitative dimension gives or removes context and meaning. Overall
SVD has shown to improve semantic representation justifying its adoption in our
model despite its computational cost.
2.3.3 Structured Link Vector Model
The proposed secondary service information shadow has no predefined structure. It
will consist of web documents and as such will contain links between the documents
in form of Uniform Resource Locator (URL). This information can be utilized as the
sole base for a vector space model (Milne, 2007) or in semi-structured environments
extend the model to a Structured Link Vector Model (Jianwu & Xiaoou, 2002).
Jianwau and Xiaoou mapped the node structure, and the in and out links in a XML
document into a Vector Space Model utilising TF-IDF. They achieved improvements
in an exemplary k-means clustering task. In the next step, the addition of Latent
Semantic Indexing further enhanced the results (J. Yang, Cheung, & Chen, 2005).
These results are encouraging and encourage us to add a similar extension to a
Semantic Space model for the Service Ecosystem. Unfortunately, the SLVM with its
XML/schema oriented requirements is too strict for our lose corpus. We are
imagining a more flexible, simpler and similarly effective model extending the
46
Semantic Space with the most basic structural information, the outwards directed
(hyper-)link (see chapter 3.4.2).
2.4 Cluster Analysis
The parallels between the Conceptual and Semantic Spaces encourage categories
modelled on the conceptual space theory. The task of generating categories involves
partitioning empirical data, which is an established topic in machine learning.
Machine learning usually solves such a task with supervised or unsupervised
learning. The former requires a-priori knowledge about the categories and exemplary
data with known desired outcomes to optimize a (learning) algorithm to classify
future data. The alternative of unsupervised learning proposes to identify a latent or
unknown structure in the data. The emerging properties of a Semantic Ecosystem
align with the unsupervised learning method and we focus on it in this section.
A common and successful method in unsupervised learning is cluster analysis or
clustering of data. It identifies commonalities and structures in data sets without an
explicit a priori knowledge of the emerging structure. However, we do make implicit
assumption about the data when we choose a cluster analysis algorithm and its
parameters. We motivate and review in this section clustering analysis and identify
how we might choose an algorithm to benefit from the Semantic Space properties.
2.4.1 Intuition
We identify the necessity for cluster analysis before we describe it in more detail.
The abstraction of data (complexity) for human consumption and comprehension is
motivated by the common human behaviour to identify shared qualities in patterns
(Gärdenfors, 2004). We do it to comprehend and communicate about our experiences
more easily. Examples of a successful and very explicit data abstraction are
classification or taxonomies, latter of which originates in biology and the study of
species. The word taxonomy is rooted in the Greek words taxis “arrangement” and
nomina “method”39. Such an explicit organisation of patterns is hard when the
qualities are not obvious, discrete and emerging, or the full set of data is unknown
39 See http://www.etymonline.com/index.php?term=taxonomy for details.
47
ahead of time as in the case of the SES. In such situations, humans are able to do an
ad-hoc classification based on shared features. These features ideally occur more
frequently than by chance, and infrequently and relevant enough to the context to
abstract the problem meaningfully. The transparent process of organising our
experiences has given rise to the strict classifications and taxonomies we use and
share today. We propose to use unsupervised learning to identify cluster of data
patterns sharing properties to give rise to groups of similar patterns in a Semantic
Space to mimic the natural ability and comprehension of experiences by humans.
Since the intuition of the Semantic Space bases on the Conceptual Space theory, we
also put forward the proposition that a similar geometric oriented cluster analysis is
worthy to investigate.
2.4.2 Application
Grouping pattern by commonalities allows humans and computers to reason/process
patterns in a highly efficient way. Humans are able to identify relevant features
easily while we are unable to process huge numbers of patterns. Machines on the
other hand can process considerable number of patterns easily but often fails in
effectively identifying the most relevant features automatically.
For example, if we confront a person with a small number of services and their
description she will be able to identify the most generic/prototypical aspects and
organize them accordingly. If we raise the number of services dramatically, the time
needed and complexity of the task increase to level where a person would have to
sample the data and make a best guess because it becomes either infeasible or
impossible to process the data. Now if assume the data size to be several magnitudes
above easy human processing power then it becomes clear that the sample a person
would take would not allow any conclusions to be drawn about the data. A cluster
analysis, while inferior in its feature selection and grouping abilities, can process all
services and identified clusters of shared properties returning a much better and
complete view of the data. This view can be utilized by humans to browse and search
for the most relevant cluster and provide her with a list of services, the cluster
members, which are more similar to each other than others in the data set. This in
turn can be analysed by the human for further action. The structure of the clusters
and how they divide a feature space ideally mimics human made partitioning of the
48
data to be effective. In this way, cluster analysis can effectively reduce the problem
of large data sets in a human-like way to human comprehendible complexity.
Beside improving human comprehension of large data sets such data abstraction, e.g.
in form of cluster centroid. We do not have to compare a query to the whole data set
but only to the cluster representatives to identify the most relevant items to the query
which reduces the computational expense by approximating the query
neighbourhood (G Salton, 1991). Traditional search can utilize this (find relevant
clusters first, then compare cluster members with query and return ranked list) as
well as a cluster identification task (find relevant cluster(s) and return ranked list of
members according to cluster relatedness). In today's highly distributed
computational environments, this is one possible strategy to distribute computational
and memory load between processing nodes to scale data processing in a near linear
fashion. Different cluster analysis methods lend themselves to different optimizations
and distribution architectures.
2.4.3 The steps of cluster analysis
Clustering or cluster analysis is a matured field that is applied in many data driven
domains like data mining, information retrieval, image & signal processing and
genetics to name only a few. Consequently, a complete review is outside the scope of
this thesis and we refer the reader to comprehensive works (Gan, Ma, & Wu, 2007;
Anil K Jain & Dubes, 1988; Theodoridis & Koutroumbas, 2006; Xu & Wunsch,
2008). We review the major traditional aspects of clustering that are relevant for the
presented problem. Within this context we will chose and propose an alternative
clustering analysis for the SES task based on the SS properties and CS theory
intuition.
Cluster analysis divides into five steps (Anil K Jain & Dubes, 1988):
1. Pattern representation
2. Definition of a similarity measure
3. Process of clustering
4. Cluster representation (optional)
5. Cluster validation (optional)
The first step requires the data to be patterns containing a feature selection relevant
for the clustering. Consequently, an additional phase of processing, the feature
49
extraction and feature selection, e.g. co-occurrence frequencies in text, may be
included in or precede the first step. A common feature processing is matrix
factorization to convert discrete and sparse features into an abstract, compressed
feature space revealing latent relationships.
The feature spaces commonly are a (high-dimensional) space or graph in which case
the definition of similarity (step 2) is usually a measure of proximity. Examples of
similarity measures are Euclidean distance between two patterns, the cosine of the
angle between two pattern vector representations or number of edges in a graph. The
choice of pattern representation and similarity measure requires or assume a certain
knowledge or intuition about the availability and importance of features. It is a key
step affecting their relationship. The similarity measure, consequently, has a strong
influence on the outcome of the further processing. The measure ideally is informed
by what constitutes similarity in the data set and how this translates into a measure in
the feature space (A K Jain, Murty, & Flynn, 1999; Anil K Jain & Dubes, 1988).
The clustering stage itself (step 3) proceeds in one of several ways. It generally
attempts to minimize an error (dissimilarity) or conversely optimize a local and/or
global measure of similarity. We investigate this core step in more detail in the next
section (see 2.4.4).
The data abstraction (step 4) provides a view on the data to enhance human
comprehension and/or computation. The choice of cluster representation depends on
the cluster analysis objective, the cluster shapes and clustering algorithm. For
example, a hyper-spherical cluster in a high-dimensional space represents easily (and
well) by a centroid (combination of cluster members) or possibly medoid
(representative cluster member). Alternative shapes, e.g., elliptical or irregular
shapes, may require different representations, e.g., outer or distant points of a cluster.
Lastly, depending of the feature space, alternative representations in form of
conjunctive statements or positions in a classification tree may be useful (A K Jain et
al., 1999).
The clustering ends with a validation of the outcome (step 5). This is challenging
since there is no absolute truth or optimal clustering, or we at least have not access to
the information otherwise, we would have used it in the clustering. We can develop
training and evaluation sets to test and select our algorithms. Different problems
require different solutions and the training/evaluation of algorithms only indicate
50
their performance in the real world setting as far as they align with it. The outcome
depends on the data and the algorithm but the quality/value is and relies on human
judgement where humans are involved as consumers of the outcome or a derivation
of it. The judgement itself depends on context and can suffer from bias. A complete
mode of evaluation to balance the various issues is advisable (A K Jain et al., 1999).
Firstly, we can compare the cluster analysis result to an optimal solution, which in
our case should be ‘human-made’, possible by persons who are knowledgeable of the
application domain and/or would be potential consumers of the result. Since human
evaluation is subjective and contextual, approximating the real world scenario well is
essential to the evaluation’s credibility. Secondly, we can investigate the outcome
critically and argue the validity of the computed partitioning. This should not be the
sole base of evaluation since it is prone to bias and subjectivity on part of the person
evaluating the clusters. Nevertheless, an informed and critical evaluation can provide
context to the previously mentioned quantitative approach. Lastly, we have the
option to compare two clustering outcomes algorithmically, e.g. based on their
information theoretic distance (Vinh, Epps, & Bailey, 2009). A combination of this
and the first evaluation method of a domain expert review could be a comparison of
cluster analysis results and domain expert partitioning of a data set to identify which
cluster analysis mimic human judgement best. The algorithmic comparison of two
computed cluster analysis outcomes provides no insight in our context.
2.4.4 The three core clustering processes
A basic taxonomy of clustering approaches (A K Jain et al., 1999; p275; fig. 7)
divides clustering algorithms based on the type of partitioning they achieve into
hierarchical and partitional on the top level. Hierarchical clustering produces clusters
organised in a tree, a connected acyclic simple graph. Each cluster is either a parent
and/or child to another cluster (presuming there are at least two clusters). Partitional
clustering does not provide any links between the clusters. The result is a collection
of clusters with each being a collection of patterns. The clusters are either exclusive,
a pattern belongs to only one cluster, or overlapping, a pattern belongs to any number
of cluster and possibly with a varying degree.
Hierarchical and partitional clustering are outcomes from a wide range of clustering
algorithms, which consist of a combination of clustering processes and similarity
51
measures. We focus our review on generalizable characteristics of clustering
algorithms, the processes and similarity measures. This allows us to position our own
clustering approach accordingly.
The three common clustering processes (step 3 in section 2.4.3) are agglomerative,
divisive/bisecting and expectation maximisation (Chidananda Gowda & Krishna,
1978; Dempster, Laird, & Rubin, 1977). Expectation maximisation, like other
clustering approaches, presumes hidden or latent information in a data set and tries to
uncover it. The EM algorithm differs from other approaches by iterating two steps,
expectation and maximisation. In the first step, a clustering solution is
proposed/guessed (to ‘expect’ the models parameters), and in the second step, the
algorithm approximates the solution. It incrementally improves the solution (to
‘maximizing’ the log-likelihood of the data) by using the previous iteration’s output
as the next iteration’s input.
A frequently used, simple and successful implementation of the EM algorithm is the
k-means algorithm (Manning, Raghavan, & Schütze, 2008). It receives as a
parameter the number k of desired clusters and attempts to find the optimal position
of these clusters. K-means guesses in the initial step the cluster centroids/positions
(E-step) and then computes optimized centroids from the attributed cluster members
(M-step). It iterates these two steps (using the centroid from the M-Step for the E-
step) until it achieves no improvement, a minimum delta in change or it exhausted a
maximum number of iterations. Despite is simplicity k-means and EM have proven
themself as a good and fast clustering analysis algorithms for a wide range of
problems including text based Semantic Spaces (Manning et al., 2008). K-means has
some inherent properties that affect the outcome of the clustering. The k-means
clusters tend to be similar in reach, are (hyper-)spherical in shape, susceptible to
local optima (outliers), require knowledge about the data to choose k and the
outcome depends heavily on the initial centroid seeds. Where these attributes are
undesired, we can often balance them by selecting alternative similarity measures,
cleverly seed centroids and/or pair k-means with additional algorithms. Selecting the
optimal number of clusters is a challenging problem with various solutions ranging
from informed guessing to identifying diminishing variance change with increasing k
(also known as the elbow method) to using information theory to name only some.
52
Agglomerative and divisive algorithms can be thought of as bottom-up or top-down
algorithms. Agglomerative algorithms usually consider all patterns as single clusters
with one single member. They iteratively merge the clusters according to the
algorithms distinctive similarity measure and continue until they converge or reach a
limit of iterations. Hierarchical cluster structure is available from agglomerative
algorithm when we retain the merging steps as a tree hierarchy. Agglomerative
algorithms’ most noticeable drawbacks are their performance. Let n be the number of
patterns. The minimal time is O(n2 * log n) and space O(n2) are significantly higher
than k-means for example which is O(n*k*l) time with k being the number of
clusters and l the number of iterations and O(k+n) space. The benefit of
agglomerative algorithms is their versatility in choosing different measures for
merging clusters. The most prominent are single and complete link (A K Jain et al.,
1999; Manning et al., 2008) but there are many alternatives beside these two (Zhao
& George Karypis, 2004). Single link uses the distance between the two closest
patterns in two clusters to measure inter-cluster similarity. This ‘grows’ clusters
along paths allowing to identify irregular cluster shapes. The complete link uses the
combined pattern distances between two clusters resulting in clusters that are more
compact which commonly is desirable despite being less versatile (A K Jain et al.,
1999).
Divisive algorithms proceed from the opposite direction. They consider the whole of
the pattern set a single cluster and divide it trying to achieve a high dissimilarity
between the new clusters and similarity within the new clusters (inter- and intra-
cluster measures). The process continues until it converges, achieves the desired
number of clusters and/or reaches a limit in iterations. They too can create
hierarchical clusters if we consider the dividing steps as branching out in the tree.
The divisive (also known as partitional) algorithms are generally faster than
agglomerative ones (Cutting, Karger, Pedersen, & Tukey, 1992; Larsen & Aone,
1999; Steinbach, G Karypis, & Kumar, 2000).
An additional view on the clustering process can be taken when we separate the
similarity measures in form of criterion functions since they can be used either in
agglomerative or divisive approaches (Zhao & George Karypis, 2004). A software
53
implementation of common criterion functions and clustering methods is the CLUTO
software40. We discuss the functions made available in CLUTO in Appendix C.
2.4.5 Clustering and Semantic Spaces
While we have a wide choice of clustering algorithms available (Zhao & George
Karypis, 2002) we have to be mindful of what features we cluster. An early attempt
to clustering SS was a k-means based word clustering of a LSA generated space
(Bellegarda, Butzberger, Coccaro, & Naik, 1996) investigating it as a complement to
word classification of the time. The resulting clusters confirmed a semantic
association between clusters of words with close vector representations. Their results
further indicated that words of the same root potentially have enough polysemy to
justify placing them in different clusters. Bellegarda (2000) extended this work with
semantic inference for automatic speech recognition. It describes the clustering of
documents representing consumer actions as a training of a SS. The cluster centre
classifies future actions by attributing their textual representation as document
vectors to the closest document cluster centre. Semantic inference removes formal
semantic representations by relating co-occurrences through the SS model allowing
flexible consumer input. Cao, Song, & P. Bruza (2004) used a fuzzy k-means
clustered HAL space to evaluate an automatic organization of a SS motivated by
conceptual space theory. Their results provided further evidence that vectors
representing words with similar meanings clump in the space. They also investigated
the polysemy of words by allowing overlapping clusters and found some words can
belong to more than one cluster41 when they share meaning between them.
This aligns with semantic cores and attributing new ones to prototypes. The idea of
prototypes or prototype theory (Johnson, 1982) proposes that out of a set of
patterns/data/experiences some are more central and representative to a group that
shares certain aspects. For example, a wooden chair with four legs and a back would
be prototypical, at the heart of a category, about chairs while a three-legged stool
without a back would be more peripheral.
40 See http://glaros.dtc.umn.edu/gkhome/views/cluto for more details. 41 Using Reuters data their example was Reagan which appeared in a cluster relating to the Iran
Contra affair and another relating to the U.S.A. Presidency.
54
The canopy clustering (McCallum, Nigam, & Ungar, 2000) is an interesting
agglomerative cluster analysis to identify cluster cores or canopies. It intends to
reduce computational expense on cluster algorithms, identify the number of potential
clusters and remove outlier problems. It is a pre-clustering step to more expensive
cluster analysis. It does this by using two distance measures t1 and t2 with t1 > t2.
Furthermore, it adds all patterns to two lists, potential canopy centres and potential
canopy members. It randomly selects a pattern from the centres and merges all points
within t1 into the canopy removing them from the list of possible canopy centre and
member lists. It removes any pattern within a distance of t1 and t2 from the list of
possible canopy centres. This process iterates until no more canopy centres are
available in the list. In a further post-processing step close canopies may be merged.
2.4.6 Semantic Category Analysis
We propose to extend the canopy clustering based on the prototype intuition from the
Conceptual Space theory. Canopy clustering is a rough locally optimized, greedy
agglomerative cluster analysis. Let us call the areas around the semantic cores
semantic categories then we can see a similarity to the canopy clustering. Semantic
categories require a full cluster analysis to establish these categories since they
should combine local and global measures for inter- and intra-cluster evaluation. Our
intuition is that semantic categories form around dense clusters equivalent to a
prototypical core (intra-cluster measure) but we have to evaluate them in a wider
context (inter-cluster). Unlike the canopy clustering, we require flexible evaluation
of the global and local aspects of the categories. We cannot expect all categories to
be homogenous in their makeup and distribution. This is conflicting with the way
humans establish, contextualise, use and interpret categories. A cluster with many
patterns may occupy to certain extend a larger part of a space than a smaller cluster
effectively using a measure of density. We can relate them to categories of different
breadth. There has to be a limit to this measure, of course, to prevent singular or
excessively large clusters.
On the inter-cluster level, we have to encourage dissimilar cluster and penalize
similar/nearby clusters. What constitutes proximity ideally depends on the local
cluster features instead of a rigid external setting as in the case of canopy clusters.
Canopy cluster analysis with its intent to quickly approximate cluster distributions
55
does not provide these qualities. Furthermore, we presume that some noise and
imprecision is inherent to the process of feature selection and extraction as well as
realistically occurs in real world data sources. We therefor propose to use exclusive
but not complete clustering effectively identifying meaningful semantic prototypes in
the data and attribute ambiguous patterns to the established prototypes. This again is
different to the overlapping semi-exclusive method used by canopy clustering. We
introduce an agglomerative algorithm with local and global measures implementing
the discussed attributes in section 3.3.
2.5 Discussion
We have shown that the current SD methods are separable into two groups. The first
uses a small corpus of service information consisting of one or a combination of
functional, short descriptions and community sourced unstructured annotations. It
employs naive IR models. The searcher therefore has to have a good understanding
of the corpus she is searching and the indexed keywords to be successful. These
models are computationally inexpensive but a growing corpus and decreasing user
sophistication impinges significantly on their effectiveness.
The second group of SD methods is ontological, enforcing a formal predefined
vocabulary for service annotation, i.e., the SWS. The advantages of this model are
deductive reasoning and a well-defined terminology. This comes at the cost of
abstracting the described ‘world’ while inflicting a semantic burden to the searcher.
Furthermore, such a system is inflexible since the established ontology cannot
change readily or adapt easily to reflect changes in the ‘world’. Lastly, the
ontological method does not scale semantically since its complexity becomes a
hurdle for searchers and reasoning with it computationally expensive.
We therefore returned to the intuition from the introduction of the thesis of reframing
the SD as an IR task. Moreover, the traditional SD is little more than a simple IR
system where keywords are used in a reverse index on mostly functional information
about services. The IR domain considers the simple Boolean model ineffective. The
two alternative models applicable to the unstructured text classical IR model are the
probabilistic and vector models. They have shown good results in IR settings and we
are reviewing them later in this thesis.
56
We also introduced in the previous chapter that a searcher or service consumer in a
SES has an agenda from which originate service need(s). At the same time, she
unlikely is knowledgeable of the SES and its services. Subsequently the searcher
may poorly understand and express the service need since she does not or hardly
knows the service offerings. This led us to focus on the mode of IR that is highly
flexible in how it extracts and compares information from a query and the corpus.
We discussed how the vector model and in particular the CS inspired SS describe
conceptual representation at a sub-symbolic level of cognition. At this level of
cognition, reasoning is not deductive but more associational or abductive in nature.
This in turn support presumptive attainment of information, i.e., informed guessing
of related concepts as in concept abduction. This however faces one challenge. To
extract an effective SS we require a semantically rich natural text corpus beyond the
functional descriptions originating from electronic service and SOA.
Lastly, we tie the representation and scale of the data involved to the need to abstract
the data in a meaningful way. The SES will be a system with emerging properties
and we represent the services in a high dimensional space which is a problem
commonly solved by machine learning. We identified unsupervised learning in form
of cluster analysis as the ideal solution. The flexibility and great choice of clustering
approaches requires us to review several approaches in real world experiments and
does not enable us to select a single solution for all situations. We do have some
intuitions about the Semantic Space. We are investigating them as a new cluster
analysis approach in this work besides a wide range of of-the-shelve solutions.
The next chapter will introduce a model for SS based service discovery reviewing the
need for an expressive corpus, describing the details of SS generation, introducing
SS innovations, describing a Semantic Categorization algorithm, explaining
discovery in a SS and evaluating the model’s software implementation by means of a
well establish synonym experiment.
57
3 Semantic Service Discovery Model
In this chapter we introduce a model for SS based service discovery, the Semantic
Service Discovery (SSD) model. The first section reviews the details of a suitable
corpus for the model. Afterwards we introduce the Semantic Space model followed
by a section detailing some innovations introduced by this work to the SS. The
subsequent section details a Semantic Categorization algorithm inspired by
conceptual space theory. We then discuss the two modes of discovery. The last two
sections introduce the software prototype that implements the model and evaluate the
quality of conceptual representation by means of a known synonym experiment,
which assess the quality of semantic representation in Semantic Space models.
3.1 Semantic Information Shadow
We introduced the term of a Service Information Shadow or SIS in chapter 1.2.3 in
conjunction with the reframing of the SD problem as an IR task. We have discussed
in the previous chapter that the field of IR has established models for unstructured
text retrieval tasks. We have further provided insight into the most promising model,
the Semantic Space, since it aligns with the particular problem of imprecise service
needs to match with (to the searcher) unknown services as well as the potential to
deal with the vocabulary mismatch problem. The model requires a rich semantic
corpus written by and for humans to build a geometric representation of concepts.
Conventional electronic web services in the tradition of SOA have been largely
described in a functional way focusing on the how a service interacts rather than
what it does or which purpose it serves. Nevertheless, using the modest informal
semantic information from WSDL files in a SS has proven to be beneficial for
service matching (Bose et al., 2008). We propose to expand this semantic base by
secondary documents associable with services.
We know that human interaction with services leads to associated human readable
information to advertise, describe, review, organize and discuss the service. The
community web sites and application markets reflect this. Let these documents be the
Service Information Shadow. Their content details services and relevant ancillary
information. Let us further assume that a document in the SIS directly points to a
service, e.g., by linking to its WSDL file. The document acts then as a proxy of the
58
concept(s) relating to the service. We can then reframe SD as an Information
Retrieval task where a service need of a consumer fulfilled by one or several services
is equivalent to a query expressing the need retrieving one or several documents
associated with the relevant service(s). Let SIS={D1..Dx} with services S={S1..Sy}.
For example, service S1 links to {D2,D4,D9}. S1 satisfying service need SN1 expressed
as query Q1 is then equivalent to retrieving D2|D4|D9 in response to Q1. In fact, the
keyword search engines using UDDI and WDSL employ a similar approach. Their
limitation is the small and technical corpus in form of UDDI/WSDL descriptions and
fields and the inflexibility of a Boolean IR system using keyword matching.
The benefits of utilizing the naturally occurring and often ignored SIS are manifold.
Firstly, a SS does not require a particular structure of the corpus and thus can utilize
legacy information and various sources of information. The automatic generation of
a SS provides an unbiased representation of the SIS and it provides the flexibility to
add information by recomputing the space when necessary without human
interaction. The SS furthermore provides a rich source of concepts relevant to
services. Lastly, the conceptual representations in the SS facilitate an explorative
mode of discovery even in cases of poorly understood and/or expressed service
needs. Higher order co-occurrence in a SVD reduced SS express recognizable
similarities in the vector representations by matching concepts. We anticipate that
searching using conceptual representations will counter terminological mismatch
between a query and service descriptions leading to relevant results.
3.2 Semantic Space Generation
A Semantic Space (SS) starts with parsing and tokenizing documents identifying
terms and generating a vocabulary. In these models, there are only two types of
objects, terms and documents. We cater for differentiation in the latter thus extending
traditional conceptions of SS. The purpose of which will become apparent later in the
Semantic Categorization (see 3.4.3). Furthermore, we also extract links between
documents for another extension of SS models (see 3.4.2).
The text corpus, the SIS of the SES, consists of a list of document types each
containing a list of documents. A document type could be for example a comment or
(describing) a service operation. The documents are plain text and can contain link
information, e.g., the service operation (description) 'sell share' relates to the
59
(service) bundle 'share trading' (Figure 14). The simplest corpus possible is a single
document of a default type with no link information.
Figure 13: Steps in Semantic Space generation
The semantic space generation (Figure 13) starts with tokenizing the corpus followed
by parsing it into a word co-occurrence matrix and then reduces the matrix by means
of Singular Value Decomposition. At this point, we have a Semantic Space
consisting only of tokens/terms that we can query and explore. Through combination
of the term vectors, we subsequently map the documents into the space. The final
step is a categorization by clustering and tessellating the space.
Figure 14: Example corpus structure
The following subsections explain the various steps in more detail opening with a
discussion of the version of vector space model we have chosen as the foundation of
our SS.
3.2.1 The Vector Space Model
There is a variety of ways to compute a Semantic Space. The initial choice is
whether a term-document or term co-occurrence matrix should serve as the basis for
the model. The former assumes that a document is topical and that the word order in
the document is insignificant. We anticipate that documents are topical but cannot
presume that the documents are of similar topic granularity since the SIS by
definition originates from a broad range of documents. We therefore chose a term co-
occurrence matrix with a gap (see 2.2.7) as the base for our SS with a variable sliding
window.
Tokenizing Parsing SVD Document Mapping Categorization
root
Service Operation
Sell Share
Link to Banking
Buy Share
Link to Banking
Bundle
Share Trading
Link to Sell Share
Link to Buy Share
Banking
60
The term weights we use are the maximum TF-IDF of a term or a fixed scalar of one.
In the former case, the highest TF-IDF of the term across all documents in the SIS is
the term weight. In this way, the weighting is motivated from a document retrieval
approach to SD. Over the years, TF-IDF term weighting has proven to be an
effective term weight. The scalar term weight is a baseline and together with
frequency ordering of the content bearing columns may perform better in situations
where exploiting term relationships is more important for SD.
The order of the matrix columns and rows is in decreasing order of term weight or
Document Frequency (DF). The DF we use is modified to not only count each
document the term occurs in but also the frequency with which it appeared. In effect,
it is equivalent to a TF over the corpus as one document. We offer this option for
applications where the broad semantic base, i.e., term relationships, of the corpus is
more in focus, e.g., in a synonym test, than the identification and retrieval of
documents. In the case where the scalar as term weight is used, the order of columns
and rows is simply the order in which the parser encounters the terms.
The optional gap removes columns in the matrix corresponding to high frequency
terms without discriminative power in the case where DF is used to determine the
order, or are extremely discriminating in case where sort order by term weight in
conjunction with maximum TF-IDF as term weight. In the latter case, the top results
are terms that are highly frequent in a tiny set of documents but very infrequent in
the corpus, thus are excellent identifiers for an insignificant number of documents,
and otherwise introduce sparseness in the matrix reducing its overall information
content.
The matrix is further processed by reducing the dimensionality through SVD (see
2.3.2) and employing the left side of the truncated matrix decomposition. It contains
the approximation of the row vector relationships and when truncated correctly
reduces the noise in the data and amplifies the low and higher co-occurrences
(Bellegarda, 2000). A similar Semantic Space model has been successfully
implemented (Takayama et al., 1999) in Infomap.
3.2.2 Tokenizing
Tokenizing identifies recognizable items with information value in the document and
indexes them for easier processing. In the following, we refer to tokens as terms and
61
in general are words but basic email addresses, URLs and abbreviations are also
recognized. A term is longer than one character and not part of a 'stop list' of 765
common words like 'a', 'you', 'the' or 'it' which we sourced from the Infomap source
code42. The system excludes them since their high frequency results in a low
discriminating information value. After tokenizing the documents the term weight
and DF/corpus-wide TF is calculated and stored for each term. A term-weight of zero
is rounded up to the smallest possible number computable on the system to ensure
that each term has a weight even if it might be minuscule and normally outside the
system's number range. This reduces computational error later when exceptionally
sparse vectors might become incomputable.
3.2.3 Singular Value Decomposition
Deerwester, S. Dumais, Furnas, T. Landauer, & Harshman (1990) in the original
application of SVD to a term document matrix provide the explanation how the dot
product (necessary for cosine similarity) between terms is the left “singular vectors”
multiplied by the “singular values” as dimensional scaling factors (see also 2.3.2).
They furthermore establish that a query consisting of terms is comparable to a
pseudo-document and as such mapped into the column space of the matrix.
Figure 15: SVD approximation of word co-occurrence matrix M
We employ a variant to the term document SVD reduced matrix introduced by
Infomap (Takayama et al., 1999). It employs the dimensional reduction on a term co-
occurrence matrix. They use the rows to index a large part of the corpus and the
columns with a smaller content bearing selection of terms (Figure 15).
42 Publicly available at http://sourceforge.net/projects/infomap-nlp/files/.
62
Figure 16: SS from word co-occurrence matrix (no singular values)
The Infomap method uses the left of the three resulting reduced matrices for term
vector representation (Figure 16). This is differing from the LSA described left
singular vector reconstruction including singular values as illustrated by Figure 17.
We are combining and extending the two approaches in section 3.4.1 with the
introduction of the singular factor.
Figure 17: SS from word co-occurrence matrix (with singular values)
Either approach is highly efficient in storing only the term side of the SS. A
document in the space is a combination of its term vectors and similarly a query
maps into the space as a pseudo-document. All three, the terms, documents and
queries, are present in the same space using the same similarity measure to compare
them.
3.2.4 Term Vectors
The first step to generate the space is to create a term co-occurrence matrix as
explained in chapter 3.2.1 populating it by parsing all documents with a sliding
window. It moves from term to term in the document using the term as a row
reference incrementing the columns of the row by the term-weight of (column) terms
found left and right to it. Once finished, we smooth the resulting row vectors by
63
applying the square root on the matrix cell values. The matrix is sparse since many
terms do not co-occur.
∗ ,
∗ ,
Equation 12: Row vector as a combination of U and S
Equation 13: Row vector from U
We subsequently decompose the term matrix by SVD (Figure 15). The cosine
similarity between the rows of the left singular values combined the singular value
diagonals (Figure 17 and Equation 12) or just rows from U (Figure 16 and Equation
13) measure the semantic similarity between two terms. The resulting row vectors
each represent a term t. For ease of notation, we will use to represent such a term
vector.
3.2.5 Document Vectors
We map documents into the SS by adding up and normalizing their terms' vector
representations after generating the reduced left singular matrix. The final Semantic
Space (Figure 17 and Figure 17 bottom parts) consists of the k reduced (and possibly
scaled), row normalized term vectors and a number of documents of different types
represented by their normalized summed term vectors. A Document Vector (DV), ,
is the sum of its term vectors normalized to unit length (Equation 14).
∑ ∈
∑ ∈
Equation 14: Term based document vector
3.3 Semantic Categorization
Semantic Categories are inspired by the conceptual space theory (Gärdenfors, 2004)
in the tradition of Aristotle’s work about the topic. Algorithmically we obtained
insight from our review of cluster analysis in section 2.4 and in particular canopy
clustering (McCallum et al., 2000). The main premise is that categories are regions in
a space spanned by sub-conceptual dimensions around prototypical cores. Instances
64
of a concept belonging to a category fall within that subspace with its distance to the
core relating to its similarity with the prototype that seeds the category. We argued
that Semantic Spaces with their SVD generated abstract feature space in which
geometric distances between terms and documents indicate their semantic relatedness
approximate such a conceptual space. With the same intuition, we are proposing to
construct Semantic Categories resembling conceptual space categories. To this end,
we introduce Semantic Categories in this section and an algorithm to generate them.
3.3.1 Semantic Category
We pointed out in relation to research question 2 though that a searcher will not
always have adequate awareness of a service need requiring her to explore and learn
about the SES first before being able to understand and describe her need. For this
case, it is desirable to organize the SS providing a conceptual, discoverable and
plausible view of the SES as described by the Service Information Shadow. Humans
have an ability to observe, generalize and abstract in a way of abductive inference to
make sense of the world around them and reason about it (Gabbay & Woods, 2005).
Such reasoning is not deductive but highly pragmatic. It is ‘good’ reasoning if it
helps to close the agent’s given agenda. At the same time, such reasoning is resource
bound - constraints such as time, information and cognitive processing power govern
it.
We can assume that a searcher with an ill-defined service need has a limited amount
of resources/time to achieve her aim, fulfilling the service need. Either the limit is a
result of an external prescription or of the value of achieving the agenda. For
example, if a searcher would have the agenda of an entertaining evening and the
ideal service would be ordering concert tickets there is a limit to how much time the
searcher would expend to search for a service. If categories meaningful to the
searcher organize the service space then she needs little time to identify the
appropriate categories and she can use the remaining time to find the optimal service
by further exploring the categories or forming queries related to the service need.
Otherwise, the searcher is required to spend the most amount of time browsing
through a large part of the service space to learn slowly of its offerings. She would
transparently structure the offerings to orient herself and inform the service need.
This in turn helps her to form appropriate queries or to guess potential alternative
65
service offerings. In effect, she will spend more time/effort on exploring the space
instead of refining a service need and query to optimize her outcome.
For a better understanding of what a human-like abstraction could be, we revisit the
conceptual space (see 2.3.1) example of an apple. A particular apple described by
symbols gives context equivalent to a point or area in the conceptual space inside the
apple concept. The symbol apple can also refer to the whole concept of apple in all
its variations (green, red, sour, sweet, ripe, small, round, etc.) or to a prototypical
apple. For example, “Give me an apple” requests the passing of something that falls
within the apple concept. “It looked like an apple” refers to ‘apple-ness’ of
something indicating that it had the usual (in this case visual) characteristics of an
apple. Gardenfors (2004) identifies these prototypes as subspaces inside a concept.
They contain or are close to the most common expression of a concept with more
unusual ones being more distant, e.g., because of atypical expression in one or more
dimensions like shape or colour such as a “striped apple”.
We propose to implement the idea of a concept subspace around a prototypical core
in a conceptual space to the SS calling it a category. The cosine similarity of
semantic relatedness parallels the geometric closeness in a conceptual space based on
quality dimensions. We suggest that high-density area of semantic representations
identify prototypical areas in a SS. Similar items clump forming a semantic core
because they co-occur in comparable circumstances in the corpus. Unusual instances
of the concept have a higher variation of co-occurrence and therefore appear close
but not as part of the semantic core (Figure 18).
Figure 18: Semantic core expand to categories (simplified)
We propose to identify categories through their prototypical semantic core of high-
density areas of vectors in the space in form of partial, flat, exclusive clusters. We
66
can extend these clusters through tessellation (Voronoi, 1907) to form full categories
spanning a subspace distributing ambiguous objects to categories based on the
proximity to core concepts (Figure 19).
Figure 19: Tessellation around core concepts (simplified)
3.3.2 Cluster Definition and Fitness
We define a cluster to consist of two or more vectors. The vector closest to the centre
of the cluster is a medoid (Kaufman & Rousseeuw, 1987), a pseudo centroid.
Medoids are part of the original data and act as centroid proxies. The remaining
vectors are cluster members. For comparability, we introduce a fitness measure
evaluating local and global qualities of clusters.
Fitness
Let C be a cluster of j>0 (vector) members ( ) and (not counted as a member) as
the medoid. A minimal cluster consists of at least the medoid and one member.
Multiplying the sum of all members' similarities (Equation 15) with the medoid with
the similarity average raised to a density factor (Equation 16) establishes local fitness
ssc of cluster C. A greater density factor gives preference to density over numbers in
a cluster.
,
Equation 15: Sum of similarities
∗
Equation 16: Local Fitness
Table 6 illustrates this effect by 'growing' a cluster from left to right with members
with decreasing similarity to the medoid. After a certain point the addition of another
member (with lower than average similarity) does not outweigh the drop in density
67
anymore. This depends on the density factor. The highlighted cells indicate the
maximum local fitness or the tipping point. For example, for a density factor of 0.25
the cluster reaches maximum fitness of 3.7892 with the eighth member having
average similarity of 0.55. If we increase the density factor, this tipping occurs
‘earlier’ resulting in denser (smaller) clusters.
Members 1 2 3 4 5 6 7 8 9
Avg‐Sim 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5
Sum‐Sim 0.9 1.7 2.4 3 3.5 3.9 4.2 4.4 4.5
Density
0.125 0.8882 1.6658 2.3340 2.8940 3.3474 3.6955 3.9402 4.0832 4.1265
0.25 0.8766 1.6323 2.2698 2.7918 3.2014 3.5018 3.6965 3.7892 3.7840
0.5 0.8538 1.5673 2.1466 2.5981 2.9283 3.1443 3.2533 3.2631 3.1820
1 0.8100 1.4450 1.9200 2.2500 2.4500 2.5350 2.5200 2.4200 2.2500
2 0.7290 1.2283 1.5360 1.6875 1.7150 1.6478 1.5120 1.3310 1.1250
4 0.5905 0.8874 0.9830 0.9492 0.8404 0.6962 0.5443 0.4026 0.2812
8 0.3874 0.4632 0.4027 0.3003 0.2018 0.1243 0.0705 0.0368 0.0176
16 0.1668 0.1262 0.0676 0.0301 0.0116 0.0040 0.0012 0.0003 0.0001
Table 6: Local fitness (Equation 16) example for varying densities
We define the global factor as one less than the cosine similarity between this
cluster's medoid and the next closest one raised to a distance factor. The fitness of the
current cluster decreases if there is an existing cluster with a similarity greater than
zero. The distance factor weights separability of the categories by penalising
proximity between clusters.
1 max , ∗ ∗
Equation 17: Fitness of cluster with medoid c with j members
The final fitness fc for cluster C is the global factor multiplied by the local measure
(Equation 17). Table 7 shows how the final fitness of a cluster changes with a range
of distance factors and variation of the next closest cluster. As intended clusters with
no close neighbours (cosine similarity of 0 between medoids) always achieve
maximum fitness.
68
Next closest cluster
0 0.2 0.4 0.6 0.8 1
Distance
0.125 12.2500 11.9130 11.4922 10.9243 10.0176 0.0000
0.25 12.2500 11.5853 10.7814 9.7421 8.1921 0.0000
0.5 12.2500 10.9567 9.4888 7.7476 5.4784 0.0000
1 12.2500 9.8000 7.3500 4.9000 2.4500 0.0000
2 12.2500 7.8400 4.4100 1.9600 0.4900 0.0000
4 12.2500 5.0176 1.5876 0.3136 0.0196 0.0000
8 12.2500 2.0552 0.2058 0.0080 0.0000 0.0000
16 12.2500 0.3448 0.0035 0.0000 0.0000 0.0000
Table 7: Fitness example for fixed cluster with changing distance
3.3.3 Cut-Off
The clustering starts by considering each vector of the space a possible medoid and
generating the best local cluster around it. We add the globally43 'fittest' cluster
candidate to the list of final clusters removing its members and medoid from the
remaining cluster candidates to prevent overlapping. All remaining candidates then
evaluate their distance to the confirmed cluster updating their global fitness
accordingly. We repeat the process until no more clusters are remaining.
Additionally a cut-off value can be set to remove a potential tail of mini-clusters. The
cut-off is a percentage measure of the highest (first) cluster's fitness, e.g., if the first
cluster fitness is 120 and the cut-off 10% then any cluster with a (global) fitness
below 12 is invalid.
3.3.4 Tessellation
We divide the remainder of the space by tessellation (Voronoi, 1907) with each
cluster medoid as the centre of a convex region known as tessellate. Any remaining
vectors of the clustering type as well as any other type belong to the tessellated area
with closest (by cosine similarity) medoid. Together a cluster with the tessellated
vectors forms a category (Figure 20).
43 Because the first cluster has no neighbouring cluster the global factor will be 1.
69
Figure 20: Categories through tessellation example
3.4 Innovations
We provide two innovations for the Semantic Space model and one for
clustering/categorizing Semantic Spaces beyond Semantic Categories. Firstly, we are
investigating an alternative recombination of the SVD truncated matrix and secondly
we propose the introduction of relationship or link information into the traditionally
“pure” semantic space of vectors. Lastly, we introduce typing of the SIS documents
to enhance categories/clusters.
3.4.1 Singular Factor
Deerwester, S. Dumais, Furnas, T. Landauer, & Harshman (1990) based the SVD
reduction on the argument that they do not intend to reconstruct the original matrix
but to extract latent semantic structure. They propose that the reduced singular
vectors contain some kind of feature space representation but do not attempt to
compare the decomposition with how latent semantic structure is present in human
cognition. We agree that only the resulting effect compares to human cognition, i.e.,
the conceptual level, and the underlying space is merely a means to that end.
Following this argument, we propose that maybe the scaling of the features in the
decomposition by the singular values may not be optimal for our purposes. We are
interested in uncovering and amplifying the semantic features, which not necessarily
is equivalent to reproducing a smallest error approximation of the original matrix.
We propose to vary the influence S, the singular values, have. The raising of the
singular values by a singular factor for scaling purposes (Figure 21 and Equation 18)
is our proposition to adjust the singular values’ influence. The intent is to explore if
70
the ordering and scaling of features in form of columns of the left singular vectors is
optimal or if a scaling of the singular diagonal values may further optimize the
semantic associations. Originally, (Deerwester et al., 1990) argued that S part of
constructing the row and column relationships but there is the alternative view that it
can be ignored (Schütze, 1997, 1998; Takayama et al., 1999).
∗ ,
∗ ,
Equation 18: Term/row vector as a combination of U, S and a scaling factor
The effect of the new singular factor parameter, denoted sf, is as follows (Equation
18). A singular factor of less than 0 reverses the order, 0 removes the singular values’
influence, 0 to 1 smooths them, 1 results in a traditional combination and greater than
1 will amplify the difference between the singular diagonal values. We hope that a
smoothing of the latent factors in the reduced matrix M* by means of the singular
values (e.g. singular factor of 0.5) might improve the resulting SS despite introducing
a greater error distance between it and the original M. This thesis will report the
benefit of employing the matrix S in the Semantic Space re-composition and scaled
variations of it.
Figure 21: Singular Factor in SS generation
3.4.2 Linked Document Vectors
In the traditional SS models, a vector representation of a document bases solely on
the document's text. In a SIS, documents may contain link information relating them
to other documents, e.g., through URL or XML information. We have learned that
word co-occurrence links to semantic relatedness between words. We propose that
documents that are adjacent, i.e., through direct links, are close in meaning similar to
71
how hyperlinks are used on the web, e.g., on Wikipedia44. Links relate a word or
section of a document to a related section, document or web site. In sum, the links on
a page are a collection of related topics. We therefore argue that incorporating links
with the representation of documents can enhance it. We propose to extend the DV,
the traditional term based one, to the Linked Document Vector (LDV), a hybrid
adding the link information as part of the vector representation.
DV3
DV3
DV
2
DV1
LDV1
LDV
2
DV
2
Figure 22: LDV example
A LDV is a combination of its DV, , and related documents’ DVs, ∑ ∈ , as
described by Equation 19. Note that we presume all document vectors to be unit
length. The weighting of the two is adjustable through a scaling factor . To prevent
circular references only the DV and not the LDV of the linked to documents is used.
For example, in Figure 22 document1 links to document2 and document3. LDV1, the
linked vector representation, as a result is a combination of DV1, DV2 and DV3.
∗ ∑ ∈
∗ ∑ ∈
Equation 19: Linked vector of document
44 See www.wikipedia.org for more.
72
3.4.3 Perspectives
The conventional SS does distinguish two types of vectors, the term and document.
In turn, clustering of the space bases on term (Cao et al., 2004) or document vectors
(W. Song & Park, 2007). We propose to source the SS from a SIS and while we do
not prescribe what information an SIS has to contain we anticipate various document
types describing:
Service Bundles
Service Operations
Process Components
Business Objects
Use-cases
Reviews
…
We propose to retain document typing if made available to the SS. Documents of
different types will be the same unstructured text with optional links and only differ
in the attached type. The intention behind the typing is to search or organize the
space along types of information objects. Instead of applying a clustering or
categorization algorithm on all documents of a SIS, we can choose to organize the
space by relevant document types. For example, if a searcher is looking for complex
services then a clustering by service bundles may be more appropriate than by
service operations or even by all documents. We call this selective view of a space
by means of a type a “perspective”. At the time of SS generation, a broad view of all
service related information is desirable to provide a complete semantic base. At the
time of querying or categorizing this may not be optimal since only a subset of
vectors is potentially of interest resulting different semantic cores and categories. An
additional benefit is the opportunity for the searcher to select specific information
types of interest at query time.
3.5 Modes of Discovery
The discovery of an information object in the SS can be either by querying the space
by one or a combination of known information objects or by exploring an abstraction
of the space to obtain an overview or drill down to a particular piece of information.
73
3.5.1 The Query mode of discovery
A consumer or system can query the semantic space by and for different types of
information. A query can be a combination of any information objects represented in
the space. For example, a single word represented by its vector or a combination of
words or a combination of different information object types. A combined query
consists of a combination of words or a mix of information objects, e.g., through
vector addition and normalization (to remove query length bias). In Equation 20 the
normalized sum of all k term vectors of query q is the query vector . In Equation 21
we extend this by summing the sum over types, which is a list of types of objects
with vector representations as part of the query. They contribute equally and the
resulting query is the normalized vector sum. A more sophisticated query system
may also add user, type dependent or algorithmic weighting. We leave this as
possible future improvement.
∑ ∈
∑ ∈
Equation 20: Combined query from terms
∑ ∑ ∈∈
∑ ∑ ∈∈
Equation 21: Combined query from objects of different types
Querying retrieves a ranked list of objects on decreasing cosine similarity with the
query vector. Similar to creating document vectors through term combination,
combined query vectors seek to merge the meanings of the single vectors to express
various aspects. For example, a consumer could express her query using the term
(vectors) incentives and sales and improve the query precision further using the
bundle sales_incentive_and_commission_management. The resulting query vector is
a combination of the two terms and the bundle vector with geometrically nearby
vectors representing related types of information from the SIS like legal documents,
web sites, reviews, combined services, bundles or terms.
Negation
The query combination can also contain negative items that through orthogonal
negation (Widdows, 2003) are subtracted from the query vector. The process takes
two vectors, and , with containing a mixed meaning, e.g., the term apple that
74
could refer to the fruit or the company. The vector then can negate or remove a
particular meaning from . For example, if is the word fruit or the combined vector
of fruit and tree then removing that meaning/vector from is equivalent to
becoming orthogonal to . The similarity measure using cosine will become 0
between the two and will become disconnect from the fruit meaning and retain the
company meaning (or at least the part that is not fruit related) effectively
disambiguating it. The negation can be achieved with the Gram-Schmidt algorithm45
(Arfken, 2005; pages 516-520) orthogonally projecting on and subtracting it
from to arrive at ′ which is orthogonal to (Equation 22).
∙ ∙
Equation 22: Gram-Schmidt algorithm applied for vector negation
Frequency
By default, we add query items like terms up as we encounter them. That means a
repeat of the same term or document results in it counting several times in the query.
This may be desirable when the query originates from a text or when the query shall
emphasize a particular query item over others through repetition or by synonyms.
The alternative is to add each distinctive item only once to the query independent of
its frequency. We call this option query uniqueness.
Factor
∗
Equation 23: Query Factor
Another query option is the query factor. It amplifies a query term according to its
term weight. The term weight in our model of interest is the TF-IDF. It is usually
small (less than 1) and approaches 0. We therefore propose to pass a query factor as a
parameter and raise it to the term weight (Equation 23). In case of the TF-IDF, the
resulting query factor would range between factor and 1 (considering that the TF-
45 This has been used in the Infomap project see file query.c line 142 to 166 from http://sourceforge.net/projects/infomap-nlp/files/.
75
IDF mostly ranges between 1 and o). In a last step, we add the term vector multiplied
with the final factor to the query.
3.5.2 The Browsing mode of discovery
Instead of directed search, the consumer can also use the categories to browse the
SES. Semantic Categorization (see section 3.3) can provide categories for a Semantic
Space sourced from a Service Ecosystem’s Service Information Shadow. The level
of abstraction depends on the settings for the categorization. For example, lowering
the cluster density while raising the distance factor will produce large clusters giving
a bird's eye view and vice versa. A consumer can then drill down from an abstract
level to detailed categories using informed inference to extend her knowledge
arriving at a useful service solution.
Finally, a combination of the two modes is possible where a query returns not a list
of close objects but close categories or the contents of the category thereby providing
a more selective view of the SIS.
3.6 Software Prototype
The proposed SS generation and discovery has been implemented using C#,
Microsoft .NET and mono46. The software executes under Microsoft Windows as
well as Linux with the core functionality encapsulated in a system independent
library. For interactive exploration and creation of simple SS by the researcher a
form based graphical user interface for Microsoft Windows has been developed
(Figure 23). The SS can be loaded from a file or generated from a text corpus or
specially formatted XML file containing the text corpus, document types and
relationships. A configuration screen (Figure 24) gives access to the parameters of a
Semantic Space’s computation.
46 See http://www.microsoft.com/net and http://www.mono-project.com for more.
76
Figure 23: SSD graphical user interface main screen
The interface allows to combine three different object types using Boolean AND and
NOT to create complex queries utilizing vector addition for AND and vector
negation for NOT (see section 3.5.1). In the example in Figure 22 the query consists
of incentives AND sales Linked/Terms,
sales_incentive_and_commission_management Bundle and
confirm_commission_case_creation_as_bulk NOT find_opportunity_by_elements
ServiceOperations. The screenshot shows 0 cut-off, 7 distance and density as
parameters for the SC. Each of the three object type fields shows the most similar
results of that type for a particular query. The second from right field lists all
category medoids with their similarity to the query highlighting the most similar
medoid/category. The right field shows the highlighted category’s members with
their similarity to the medoid.
77
Figure 24: SSD configuration screen
The computationally expensive tasks execute in parallel using a shell program
adaptation running on Linux on the Queensland University of Technology High
Performance Computing facilities47.
3.6.1 Parameters
The parameters manipulate the different areas of space creation and interaction. The
following table names them for future reference and gives a short description.
47 See http://www.itservices.qut.edu.au/hpc/ for more.
78
Type Name Description Default
Co-Occurren
ce M
atrix
rows Number of rows -
cols Number of columns -
left Left sliding window 15
right Right sliding window 15
gap Column gap 50
tw Term-weight: TF-IDF or 1 TF-IDF
tt Ordering of columns: tw or DF tw
SVD
u Number of reduced columns -
sf Singular factor by which S values are raised 1
Query
qf Query factor, multiplying query vectors with qf raised by TF‐IDF 1
uq Unique query switch: use query terms only once or by frequency true
Cat.
density Greater number gives preference to denser clusters. 1
distance Greater number penalises proximity of clusters. 1
cut‐off Percentage fitness of fittest cluster as lower fitness bound. 0
LDV
lnkwght Weight of linked documents in linked document vector. 0%
Table 8: Parameters for Semantic Space and Semantic Categories
3.7 Evaluation
Before we continue with specific experiments to answer the research question we
want to evaluate the proposed Semantic Space model and its implementation testing
the quality of the term vector representations, which are essential for both facilitating
effective querying as well as producing a useful conceptual abstraction of the SES. In
the literature, the synonym section of the Test of English as a Foreign Language
(TOEFL) on the Touchstone Applied Science Associates (TASA) corpus (Landauer
et al., 1998; Turney & Pantel, 2010) is a respected and widely used evaluation of
79
semantic vector representations. It is part of an entry test for students to college in
the United States of America.
3.7.1 Data
The test is multiple choice comprising 80 words and for each of these four possible
synonyms, one of which is the correct answer. The corpus consists of the TASA
corpus containing 44,486 short plain text documents of "General Reading up to 1st
year college" totalling 73,132,886 or an average of 1,644 characters. Students learn
college entry relevant vocabulary and language usage from these readings.
3.7.2 Experiment/Methodology
The measure of the results is in percentage of correct answers. Twenty correct
answers would therefore be 25% for example. If the SS does not contain the queried
word or none of the synonyms then the question counts as answered with a 25%
chance similar to Landauer and Dumais (1997). For example, if the system answered
20 correctly but did not have enough information in the corpus to answer three other
questions then the total result would be 0.2594 (25.94%) or 20.75 correct answers.
3.7.3 Results
In the process of developing this model, we also evaluated some of the SS
parameters and collected 508,032 results. The DF sorting of co-occurrence matrix’s
columns and rows and a fixed scalar of one as the term weight consistently
performed better than TF-IDF sorting and weighting (see Table 9). Therefore, the
following results focus on the 127,008 results subset using DF sorting and fixed term
weight of one. The single word query made qf and uq settings irrelevant. We inserted
the words from the TOEFL test as rows in the beginning of the co-occurrence matrix
overwriting the row ordering to ensure them to be in the vocabulary.
80
Avg Max Sorting TW
0.4816 0.7875 DF 1
0.4228 0.7000 DF TF-IDF
0.3745 0.7031 None 1
0.3843 0.6250 TF-IDF TF-IDF
Table 9: Comparison of sorting and term weight influence4849
We evaluated matrices of 3,000 rows by 3,000, 6,000, 9,000, 12,000, 15,000
columns as well as 15,000, 12,000, 9,000, 6,000 rows by 3,000 columns reduced to
100, 250, 500 and 1,000 dimensions by SVD. Together with combinations50 of gap,
left and right window sizes of 0, 2, 4, 8, 16, 32, 64, 128 and singular factors of -1, -
0.5, 0, 0.5, 1, 2 and 4. Pilot experiments established these settings beforehand as a
reasonable range for exploring the effectiveness of the various parameters.
48 Based on 282,240 results from 3x3, 3x6, 3x9, 6x3 and 9x3 (each thousands) rows by columns matrices.
49 Values are ratio of correct answers, e.g. 0.5 are 50% correct answered. See 3.7.2 for details. 50 Except left and right window each 0.
81
Avg right
left 0 2 4 8 16 32 64 128
0 0.499 0.496 0.497 0.487 0.472 0.449 0.432
2 0.428 0.532 0.533 0.527 0.5 0.486 0.454 0.436
4 0.434 0.495 0.537 0.532 0.516 0.493 0.459 0.442
8 0.474 0.522 0.541 0.545 0.519 0.501 0.462 0.449
16 0.454 0.48 0.5 0.508 0.499 0.502 0.465 0.454
32 0.461 0.472 0.487 0.496 0.489 0.483 0.467 0.465
64 0.459 0.469 0.479 0.477 0.481 0.48 0.47 0.474
128 0.455 0.457 0.462 0.464 0.471 0.472 0.471 0.47
Max right
left 0 2 4 8 16 32 64 128
0 0.625 0.65 0.725 0.7 0.7 0.638 0.613
2 0.563 0.7 0.775 0.738 0.75 0.713 0.663 0.613
4 0.613 0.675 0.75 0.763 0.763 0.713 0.663 0.613
8 0.688 0.763 0.775 0.788 0.775 0.725 0.663 0.663
16 0.663 0.7 0.713 0.725 0.75 0.75 0.675 0.663
32 0.675 0.7 0.7 0.725 0.75 0.725 0.688 0.675
64 0.663 0.675 0.7 0.713 0.7 0.713 0.7 0.725
128 0.688 0.688 0.688 0.688 0.7 0.688 0.675 0.663
Table 10: Window size impact49
Table 10 shows a symmetric window size of 8 on both sides provided the best
average and maximum result. The difference between the best and worst setting in
maximum were -28.6% and on the average results -21.5%.
U
Average Maximum
100 250 500 1000 100 250 500 1000
Columns
3000 0.4903 0.4850 0.4676 0.4380 0.7250 0.7500 0.7250 0.7625
6000 0.5033 0.5005 0.4857 0.4600 0.7250 0.7625 0.7500 0.7625
9000 0.5069 0.5047 0.4978 0.4707 0.7500 0.7875 0.7500 0.7375
12000 0.5113 0.5096 0.5020 0.4741 0.7375 0.7750 0.7375 0.7375
15000 0.5114 0.5110 0.5069 0.4781 0.7375 0.7750 0.7500 0.7375
Table 11: Columns to SVD reduction impact49
Table 11 shows the effect of dimension reduction of the U matrix in terms of both
average and maximum test scores. The reduction of columns to u gives a split result
(Table 11). On average, a reduction to 100 dimensions is the best choice but 250
dimensions achieved the maximum result. This indicates that under the right
circumstances there is information that benefits from a larger than 100 dimension
representation. A dimension reduction to u=100 appears to be resilient generally
82
providing a good balance between retaining and amplifying features while reducing
noise as the average results indicate.
sf Avg Max
‐1 0.5153 0.7875
‐0.5 0.5442 0.7875
0 0.5770 0.7750
0.5 0.5726 0.7625
1 0.4251 0.6375
2 0.3863 0.5500
4 0.3508 0.4750
Table 12: Singular factor impact49
The singular factor acts in a similar fashion as the dimensional reduction. Recall the
singular factor sf scales the singular values (see Equation 18). Using a sf of 0, which
effectively ignores the S values, has the best average outcome (Table 12). On the
maximum results, a negative sf has a slight benefit. Ignoring (Takayama et al., 1999)
or inversing sf produces the best results while a traditional sf=1 (Deerwester et al.,
1990) has negative impact.
Avg Max
rows\cols 3000 6000 9000 12000 15000 3000 6000 9000 12000 15000
3000 0.4673 0.4874 0.4950 0.4992 0.5018 0.7375 0.7625 0.7875 0.7750 0.7750
6000 0.4711 0.7625
9000 0.4711 0.7500
12000 0.4710 0.7375
15000 0.4706 0.7375
Table 13: Rows to Columns impact49
Surprisingly, the number of columns does not have a tremendous impact (Table 13).
The maximum number of columns results in the highest average result but roughly
half that was enough for the maximum result. We expected the number of rows to
have no significant impact because they do not add additional information to the
synonym test. The synonyms are the first rows51 in the co-occurrence matrix and
their co-occurrence with the column words not rows defines their relationship.
51 80 questions with 4 answers each with some re-occurring words resulted in just below 400 rows to be insert at the beginning.
83
gap Avg Max
0 0.4835 0.7625
2 0.4838 0.7625
4 0.4813 0.7750
8 0.4814 0.7750
16 0.4784 0.7875
32 0.4804 0.7625
64 0.4820 0.7750
128 0.4822 0.7750
Table 14: Gap impact49
We recall that the gap is the number of highest order columns we ignore in order to
remove high frequency terms that may be indiscriminating. The use of DF, the
corpus wide term frequency, is an efficient and resilient way of selecting the content
bearing words as the minimal changes of results when varying the gap indicates
(Table 14). The difference between no gap and a gap of 128 is insignificant for
example.
3.8 Discussion
Landauer and Dumais (1997) reported 64.4% with foreign students scoring 64.5% on
the same test. An implementation using random indexing achieved 70-72%
(Kanerva, Kristofersson, & Holst, 2000). The presented SSD implementation
accomplished 78.75% (Table 15). There have been results beyond 90% (Rapp, 2003)
but such systems use a variety of bells and whistles such as external data sources
(Deerwester et al., 1990).
Correct49 row col u gap left right sf
0.7875 3000 9000 250 16 8 8 -1
0.7875 3000 9000 250 16 8 8 -0.5
0.775 3000 9000 250 128 8 16 0
0.775 3000 15000 250 8 8 4 0
0.775 3000 15000 250 64 8 4 -0.5
0.775 3000 15000 250 4 8 4 -0.5
0.775 3000 12000 250 8 8 4 -0.5
0.775 3000 12000 250 8 8 4 0
0.775 3000 12000 250 64 2 4 -0.5
0.7625 6000 3000 1000 32 8 2 0.5
Table 15: Top 10 results for TASA/TOEFL SSD
84
The experience gained from these experiments informed the settings in the large-
scale evaluation reported in the next section, even if adjustments are beneficial
because of the different experimental setup. Furthermore, the semantic vectors
produced by SSD are competitive when compared to state-of-the-art “no frills”
systems on TOEFL. “No frills” is important as such systems are more easily
deployable in an application setting. We have provided an implementation of a
Semantic Space Discovery grounded in conceptual space theory and extended by
semantic querying, semantic categorization and relationship information. The
TASA/TOEFL experiment proves the quality of the semantic vector representation
arrived at by this model. The next two chapters evaluate the SD model to address the
two research questions and position the model against alternatives.
85
4 Semantic Service Discovery Evaluation
We presented Service Discovery as a key challenge in the emerging Service
Ecosystem in the introduction and reframed it as an Information Retrieval task on the
Service Information Shadow (see 3.1). The central issue in the discovery process is
the likely imprecision in the expression of a consumer’s service need as she will not
be aware of all the services available to her in the SES that may address her need and
the terminologies describing them. Moreover, a consumer may have a vague agenda
and subsequent poor service need understanding. This requires a discovery system to
be highly flexible approximating meaningful results from incomplete queries
possibly mismatching the services’ terminologies. A SD system should either
therefore return a collection of alternative solutions where the query is expressive
enough or otherwise approximate it conceptually to foster presumptive attainment of
knowledge by the searcher. We proposed that the SSD model discussed in the
previous chapter could achieve these objectives by imitating abductive inference of
concepts from an SIS through statistical semantics to find or suggest meaningful
services.
In this chapter, we are evaluating the search and discovery of services with varying
service need knowledge by introducing a SIS resembling data source and creating a
discovery scenario describing a complex service need. We simulate the need by use-
cases transformed into long, expressive queries (see next section). We degraded the
queries to simulate imprecise service need understanding. The queries to the SSD
system return ranked lists of service documents. The rank of the relevant one in the
list is the measure for the systems performance. The baselines for comparison for our
model are state-of-the-art IR systems including vector space, probabilistic and
alternative semantic space systems that will perform the same tasks. We investigate
their performance and review details of the SSD model before closing the chapter
with a discussion.
4.1 SAP ES Wiki as a Service Information Shadow
Evaluating the SSD requires a Service Information Shadow of a Service Ecosystem,
a corpus related to services, and a number of discovery scenarios we can execute and
analyse. The SES is a heterogeneous, emerging system and in an early stage of
86
development with a high fragmentation as we discussed in the first chapter. On the
professional side SOA and SaaS dominates with private registries in governments,
corporations and industries, emerging online communities and research projects
compete with online/cloud SaaS solutions while on the end-consumer side
application markets and web based offerings grow with the surge in smart phones
and devices. The SOA solutions are domain specific and provide little natural
semantic content struggling to break domain and industry barriers. SaaS solutions are
generally provider bound with closely link application/service markets while
application market places are still limited to platforms with more than only service-
oriented software. Therefore, substitutes or partial data sets are the only ones
available for an evaluation until a single system or open standard for integration and
data access will surface as a SES platform.
We propose to use the SAP Enterprise Service Wiki52, a web site dedicated to
describe service operations, bundles and organize related information, as a data
source to imitate a SIS. It resembles a SIS of a future SES because it is built by a
variety of sources/individuals (SAP employees, customers and guests), involves
services from many domains, includes secondary service related information and
does not enforce a terminology or ontology but does provide a loose (hyper-)link
structure (Figure 25). The wiki describes software objects like service operations,
service interfaces, process components and business objects. Each object has a web
page with a short description in the wiki or links to the SAP ES Workplace53 that
gives a view with object related information from SAP databases and additional user
provided information. The wiki home page organizes the 125 bundles in 30 groups.
Bundles are user provided collections of related objects. The bundles are represented
by web pages containing descriptions about the bundle (Figure 26), links to lists with
(links to the) related objects. Bundle pages also contain one or several use-cases54
that describe example application(s) of contained service operations with a short text
and a step-by-step list.
52 See https://wiki.sdn.sap.com/wiki/display/ESpackages/Home for details. 53 See http://www.sdn.sap.com/irj/bpx/esworkplace for details. 54 448 in the 125 bundles
87
Figure 25: ES Wiki structure
The wiki by its nature is dynamic, constantly changing55, and not all wiki pages
describing objects have information beyond a template page with some links leading
to inaccessible SAP database views. The available data of 1,114 documents (not
including use-cases) is sufficient to constitute a corpus of text documents relating to
the different objects including services as can be expected in a comparable fashion
from a SIS including the missing/inaccessible descriptions. Another benefit is the
wiki’s hyperlinks, which can facilitate SD.
55 The wiki data used in this experiment has been downloaded on the 20th July 2009.
88
Figure 26: Example of bundle page (excerpt)
We view the service bundles as optimal entry points to search for or engage
combined services as described in the use-cases, as humans would engage them.
Consequently, we focus on them for the discovery process instead of the atomic
functionally described service operations relating merely to SOA scenarios. Service
bundles on the other hand combine related atomic services providing entry points for
tasks reflecting service needs much like combined services as we expect humans to
consume them. Furthermore, bundles, because of their nature, contain rich semantics
in unstructured texts and exemplary use-cases relating to service needs, as they
would stem from an agenda. We therefore focus on the service bundles for our
experiments and the use-cases as the source for service oriented queries.
The wiki was downloaded by a simple web crawler56 and parsed by a purpose built
program to split use-cases from bundles, extract links, convert the text to conform to
8 bit based ASCII (American Standard Code for Information Interchange) and
remove non-word characters excluding abbreviation, email and URL information.
We did not capture all information since the flexible structure of a wiki as well as
human input error limits an automated extraction and some SAP database views were
not freely accessible. We saved the data in XML files containing texts with type
56 See http://www.gnu.org/software/wget/ for details.
89
information (e.g. bundle, service operation, etc.) and (link) relationships as well as
plain text documents.
4.2 Experimental Evaluation
4.2.1 Use-cases as Text Queries
The service bundles in the SAP ES Wiki like Sales or Banking group related service
operations, process components and business objects. The use-cases in the bundles
describe representative scenarios and simple task, e.g., requesting a postal pickup and
shipping service (Figure 27), for the bundle’s services and objects. We suggest that
each of the 448 use-cases is analogous to a description of a service need, or task,
within an agenda, which we can use to discover the originating service bundle from
which the use case derives in the same way as a SD query identifies a relevant
service related document. As a first step, we remove the use-cases from the corpus
before the IR systems index it to avoid a bias towards them. The resulting bundle
documents have an average length of 3,682 characters. Punctuation and Boolean
words (NOT, AND, OR) were removed from the queries to prevent any errors or
confusion in the evaluated IR systems which handle these in different ways.
Figure 27: Example use-case
The bundle from which the use-case/service need originates is the optimal solution.
In a first experiment use-cases will be interpreted as long/full queries (named 100p)
describing the service need searching the SIS. This should achieve high rankings for
the relevant bundles because of the rich semantics in the query. A 100p use-case on
average is 302 words long. In a second experiment to simulate incomplete user
knowledge of a service need we degrade the queries by randomly selecting only 25%
of the words from each use-case to make up the query (named 25p) reducing the
90
average length to 75 words. Please note that we performed the random selection only
once and all systems use the same 25p queries to ensure comparable results. Lastly,
we query for the bundles using only the titles (which we have ignored so far) of the
use-cases (named Titles). We were able to extract 413 use-case titles from the 448
use-cases from 123 of the 125 bundles. The shortest title is Tendering and the longest
Enable Sales Service Professionals to Provide Real Time Information on Product
Configuration Using a Third Party System to Confirm Configuration. The average
length of a title is 6 words after removing Boolean words.
4.2.2 Combined Query
The use-cases extracted from the bundles usually contain a paragraph description and
a table with the steps to execute the use-case including optional, related service
operation(s) to invoke with a step (Figure 27). In traditional information retrieval
scenarios and classical systems, the additional information of the service operations
is unusable beyond a possible keyword match. We have described before how to
compare or combine the vector representations of documents representing objects or
actions with other documents or queries. For example, if a consumer knows of a
service operation or bundle or business object relevant to her query she could add its
vector representation to the query instead of approximating that information by
keywords, as she has to do in conventional IR systems. We propose to use the service
operations where available to expand the query by the document vector
representation of the service operations.
This is an additional source of information not available to the conventional systems
and Infomap or Semantic Vectors do not contain such functionality though their SS
models would theoretically permit a similar functionality. For comparability, we
query the SSD model in two modes called the text query (TQ) which is the classical
text only representation of the use-case as a query. For the 25p and 100p query, we
are also presenting an alternative query mode called combined query (CQ) where the
sum of the service operation representations (as explained in chapter 3.5.1) extends
the text query vector. Our intention is to demonstrate that the additional query
information that is sometimes available to a searcher but difficult to express in
conventional systems can further enhance discovery.
91
4.2.3 Performance Measures
The queries return a ranked list, which contains only one document relevant to the
query. The rank of the single relevant bundle object in the list of bundle objects
returned is the measure of precision with which a system retrieves the appropriate
solution. Since there is only one relevant document and it usually is within the result
set, the traditional IR performance measures of mean average precision and recall do
not apply.
Each system returns for each query a ranked list of documents containing bundle and
other related documents57. We retrieve the first 1,000 documents and filter them for
the top 100 bundles. For each query qn out of n queries, the rank of the correct bundle
in the list of 100 bundles is noted. The averaged result is the measure of Average
Position (AP; Equation 24). Sometimes the correct result is not in the top 100. We
therefore extend the measure to the AAR (Adjusted Average Rank) which
approximates these missing bundles m as retrieved at the next best rank of 101 as
shown in Equation 25. The AAR provides an approximation of the best possible AR
result an IR system that was unable to retrieve all results in the top 100 could have
achieved if all missing bundles had a rank of 101.
Equation 24: Average Rank
∗ 101
Equation 25: Adjusted Average Rank
4.3 Baseline IR systems
We can now measure the performance of the SSD model with the AAR. We propose
to compare it with state-of-the-art IR systems in order to determine its relative
performance. All the baseline systems are classical IR models (see 2.2.2 on page 28)
applying to unstructured text corpora and follow the basic IR system model (see
57 SSD actually returns documents by type and was set to return the top 100 bundles directly.
92
2.2.4 on page 32) indexing the corpus and querying their index with different ranking
functions. We established in the literature review that competitive unstructured text
models are the probabilistic, the traditional vector space and the dimensionally
reduced Semantic Space models, which we compare in the following.
The probabilistic model is represented by the research software Zettair58 from the
search engine group of the Royal Melbourne Institute of Technology (Billerbeck et
al., 2004; Garcia, Lester, Scholer, & Shokouhi, 2006) utilizing the popular and
widely employed BM25 ranking. The state-of-the-art open source Apache Lucene
project59, which is widely used in commercial applications60, represents the
prominent vector space model:
“Lucene scoring uses a combination of the Vector Space Model (VSM) of
Information Retrieval and the Boolean model to determine how relevant a
given Document is to a User's query. [...] It uses the Boolean model to first
narrow down the documents that need to be scored based on the use of boolean
logic in the Query specification. Lucene also adds some capabilities and
refinements onto this model to support boolean and fuzzy searching, but it
essentially remains a VSM based system at the heart.” (Ingersoll, 2009)
Since we aim to provide evidence for the advantages of the SS model for the SD
task, we provide our own model as introduced in the previous chapter. Furthermore,
we review two alternative SS systems to identify any benefits that may result from
our particular implementation and algorithm over established SS systems. The two
alternative SS models are Infomap and Semantic Vectors61 (Widdows & Ferraro,
2008). The Center for the Study of Language and Information (CSLI) at Stanford
University developed the research software Infomap utilizing a SVD reduced HAL
based SS inspiring the SSD model introduced in the previous chapter. A novel SS
system is Semantic Vectors using the fast Random Indexing (Kanerva et al., 2000) as
an alternative to the computationally more expensive SVD. The Office of
58 See http://www.seg.rmit.edu.au/zettair/ for details. 59 See http://lucene.apache.org/ for more details. 60 See http://wiki.apache.org/lucene-java/PoweredBy for details. 61 See http://code.google.com/p/semanticvectors for details.
93
Technology Management at the University of Pittsburgh began the development of
the SV package, which is now under active development as an open source project
with support of Google.
We tested all systems with a wide range of parameters and we only report the best
results and parameter settings for the experiments here. The systems use the same
corpus and query data that we pre-processed removing possible problematic
characters (corpus and queries) and Boolean words (queries only) to prevent ill
formatted or misleading inputs.
4.3.1 Semantic Service Discovery (SSD)
We established the SSD optimal parameter settings (Table 18) in three steps.
Exploratory tests estimated parameters based on previous experience with Infomap.
We recognized that TF-IDF as a term weight for parsing and row/column order
achieves in general the better results and decided to keep tt and tw fixed to it. We
also established a broad range of parameters to explore (Table 16) covering
27,484,800 individual results.
We set rows and columns to a maximum of 6,000 since the SSD identifies less than
7,000 individual terms. We expected the rows to perform better on the larger setting
since it would enhance document representation. The TF-IDF filtering of terms for
the columns might not necessarily profit from a full representation but rather form a
kind of noise reduction with smaller than 6,000 columns. This prompted us to
explore the column setting more than the rows.
We varied the gap and window sizes from 0 to 150 to capture a broad spectrum to
refine it in the next run if a significant subrange can be identified. The SVD
reduction has been set between 100 and 400 dimensions. Infomap usually arrives at
much less but we have found that larger values can have positive effects since we use
a precise SVD algorithm instead of converging one as used by Infomap. The singular
factor uses the same range for all experiments. The negative settings may not have
positive effects but are included for comprehensiveness. The most interesting settings
are 0 (no S value), 0.5 (smoothing effect), 1 (original S values) and amplification
(beyond 1) of S value scaling effect. We used the link weight setting of 0% and 50%
to examine if it has a noteworthy influence in the optimal parameter range selection.
We tested the query factor in a setting from 0 to 3.
94
SS Parameter Value
rows 4,000, 5,000 6,000
cols 2,000, 3,000, 4,000, 5,000 6,000
cg, lw, rw62 0, 25, 50, 100, 150
tw, tt TF-IDF
u 100, 150, 200, 250, 300, 350, 400
sf -1, -0.5, 0, 0.5, 1, 2 ,4
lnkwght 0%, 50%
qf 0.0, 0.2, 0.4, …, 3.0
Table 16: Use-cases Semantic Space parameters exploratory run63
The results from the first parameter evaluation experiment lead to a second one
(Table 17) with 7,392,000 individual results processed during parameter exploration.
We focused on the maximum rows and columns as they have shown most promise.
The gap has been more effective from 25 upwards and possibly including 175. The
window size has returned no conclusive results and the same parameter range has
been retained. The effect of the singular factor is of key interest and we maintain the
full parameter range. The final experiment setting includes a full investigation of the
link weight influence from 0% (no link weight) to 90%. A 100p was not possible
since not all documents contain a link and thus 100p would be an undefined value for
these documents. We changed the query factor to 1 to 3.
SS Parameter Value
rows 6,000
cols 6,000
cg, 25, 50, 100, 150, 175
lw, rw64 0, 25, 50, 100, 150
tw, tt TF-IDF
u 150, 200, 250, 300, 350, 400, 450, 500
sf -1, -0.5, 0, 0.5, 1, 2 ,4
lnkwght 0%, 10%, …, 90%
qf 1.0, 1.2, 1.4, …, 3.0
Table 17: Use-cases Semantic Space parameters refinement run63
The second parameter range (Table 17) is the basis for the results section (see 4.4).
Table 18 lists the optimal results from the second parameter range.
62 Excluding lw and rw equal 0. 63 See section 3.6.1 for details on parameters. 64 Excluding lw and rw equal 0.
95
Experiment LnkWght Query65 Row Col U Gap LW RW SF UQ QF
Titles 30%
TQ
6,000 6,000
200 25 150 100 0.5 F 1.2
0% 450 25 150 150 0.5 F 3
100p
20% CQ
200 50 50 100 0 F 1
0% 200 50 25 100 0 F 2
20% TQ
200 50 25 100 0 F 1
0% 200 50 25 100 0 F 2
25p
20% CQ
200 25 50 100 0 F 1.2
0% 200 25 100 100 0 F 3
20% TQ
200 25 50 100 0 F 1.2
0% 200 25 100 100 0 F 2.6
Table 18: SSD optimal query experiments parameters
4.3.2 Zettair
Zettair (version 0.9.3) indexed the corpus as a list of text documents. It queried with
the default settings as well as with Okapi BM25 enabled term weighting. We used
the top 1,000 results. Across all runs, BM25 was superior and we report it here
instead of the default setting.
4.3.3 Lucene
Lucene is a mature, state-of-the-art IR system, tested in real-world applications,
reviewed and fine-tuned by professional developers, so we chose to use it in the
default settings. We used the version 2.4.1 which was current at the time of the
experiments.
4.3.4 Semantic Vectors
Lucene (version 2.4.1) generated the indices for Semantic Vectors (version 1.2.3).
We tested the default and a Semantic Vectors library’s66 positional index. The default
Lucene index is a bag of words, a reverse index, while the positional index uses a
sliding window considering in document word positions like the one used in the term
co-occurrence matrix. The windows sizes used were 1, 3 and 9, which cover the
optimal range (P. Bruza & Sitbon, 2008). We processed the default index with 2, 4
65 TQ refers to text queries and CQ to combined queries including service operation vectors. 66 See http://code.google.com/p/semanticvectors/wiki/PositionalIndexes for details.
96
and 8 training cycles67. Training cycles rerun the SV algorithm in the hope to
improve results. We tested no more than 8 cycles since results degraded strongly
with more cycles. The querying included the default, training cycles and positional
indices using default, subspace, sparesum and maxsim query settings68. The default
Lucene index queried with default settings achieved the best results in the top 1,000
and we report them as the SV results.
4.3.5 Infomap
Infomap in the latest version 0.8.6 indexed the wiki as a multi-document text corpus.
The query was set to return the top 1,000 documents. The co-occurrence matrix size
was 20,000 by 5,000 with a window size of 50 on the left and right each. Larger row,
column and window settings did not improve performance while smaller ones slowly
degraded it. We reduced the matrix with a maximum 500 SVD iterations
(SVD_ITER) and to a maximum of 500 columns (SINGVALS). Infomap uses a
Lanczos SVD algorithm (Golub & van Loan, 1996). The algorithm converged within
the iterations and with lower dimensionality thus larger SVD values are ineffective
and we used the optimal settings.
4.4 Results
4.4.1 IR systems comparison
We compiled the results of the three different experiments in Figure 28. The measure
used to evaluate performance is the AAR. The AAR ranges from 1 (all queries
returned the correct result at rank 1) to 101 (all queries failed to return the correct
result in the top 100). The rank of the correct result is important since only the first
few results in a ranked list are likely to receive attention by a searcher (Granka,
Joachims, & Gay, 2004; Moffat & Zobel, 2008). Great differences in the AAR
indicate superiority of one method over another. To establish significance of an
AAR’s difference requires a statistical evaluation though. We chose a paired, two
tailed t-test, which has shown to be a resilient and strong statistical evaluation to
67 See http://code.google.com/p/semanticvectors/wiki/TrainingCycles for details. 68 See http://code.google.com/p/semanticvectors/wiki/SearchOptions for details.
97
identify significance in IR (Sanderson & Zobel, 2005; Smucker, Allan, & Carterette,
2007). We compared the sets of query results between the SSD variations and the
baseline IT systems in Table 19 later in the section.
Figure 28: Use-case query results
In Figure 28, we straightaway see that with decreasing query length the AAR
increases for all but the SV system. It confirms the expectation that the longer
queries are more expressive. Only SV, a Semantic Space based on a Lucene index
and random projection is unable to utilize the richer query details. In all experiments
SV has the noticeably highest AAR and performs significantly worse than any SSD
variant. This may be because SV is not optimized for document retrieval.
Zettair has a noticeably higher rank than Lucene, Infomap and the SSD variants in
the 100p experiment. In the 25p experiment, Zettair is closer to Lucene but still fares
4.419
5.061
6.269
6.312
5.738
7.341
1.795
2.199
2.362
2.406
2.946
3.984
4.313
10.167
1.275
1.288
1.350
1.346
1.547
1.931
2.703
7.252
SSD , TQ , 30% LDV
SSD , TQ
Infomap
Lucene
Zettair
SV
SSD , CQ , 20% LDV
SSD , TQ , 20% LDV
SSD , CQ
SSD , TQ
Infomap
Lucene
Zettair
SV
SSD , CQ , 20% LDV
SSD , TQ , 20% LDV
SSD , CQ
SSD , TQ
Infomap
Lucene
Zettair
SV
Titles
25p
100p
AAR
98
much worse than Infomap and SSD. In both experiments, the Zettair result is
significantly inferior to the SSD ones. In the Titles experiment, Zettair, Infomap and
Lucene do have a higher AAR than both SSD variations. Interestingly, the Zettair
and Lucene result is not significantly different from the plain text query SSD result.
Lucene performs better than Zettair in the 100p and 25p. In both experiment, Lucene
is significantly inferior to the SSD variants similarly to Zettair. Just like Zettair,
Lucene performs not significantly worse to the SSD TQ system in the Titles
experiment despite a higher AAR than SSD TQ and Zettair. The SSD TQ with link
weight outperforms Lucene and Zettair though.
SSD
TQ CQ
0% LDV 20% LDV 0% LDV 20% LDV
100p
Infomap 0.0006 0.0000 0.0009 0.0000
Lucene 0.0001 0.0000 0.0001 0.0000
Z Okapi 0.0001 0.0000 0.0001 0.0000
SV 0.0000 0.0000 0.0000 0.0000
0% LDV 20% LDV 0% LDV 20% LDV
25p
Infomap 0.0069 0.0002 0.0091 0.0001
Lucene 0.0001 0.0000 0.0001 0.0000
Z Okapi 0.0003 0.0001 0.0002 0.0000
SV 0.0000 0.0000 0.0000 0.0000
0% LDV 30% LDV
Titles
Infomap 0.0158 0.0012
Lucene 0.0555 0.0058
Z Okapi 0.1908 0.0191
SV 0.0001 0.0000
Table 19: Significance of results by paired, two tailed t-test69
The SSD model returns superior results in nearly all situations. It utilises long
queries particularly well. In all experiments, the SSDs’ AARs are lower than the
baseline systems’. Nevertheless, in the case of short queries SSD with plain text
queries does not achieve significantly better results over Lucene and Zettair. These
short query situations are typically the domain of these inverted index systems and
their performance does not come as a surprise. It is encouraging that the SSD in its
simplest form can compete with them. The utilization of link weights does provide a
69 Cells contain p-value with bold results significantly different (p<0.05).
99
significant advantage to the SSD though and it outperforms all systems in the Titles
experiment significantly. In all experiments, the addition of a modest (20-30%) link
weight has shown improvements. We achieved the best results (in 25p and 100p)
when we add combined queries and link weight to the SSD model. Due to the nature
of the data source, we were not able with reasonable effort to reliably extract
combined queries for the use case titles and therefore only present the text queries for
that experiment.
4.4.2 SSD in detail
The previous section compared the various IR systems in the orthodox unstructured
text IR model with the Semantic Service Discovery system on the exemplary use-
case scenario simulating Service Discovery as directed search. In this section we
review how some of the parameters, particularly link-weight and singular factor, in
the model influence the SSD outcome. To this end, we analyse the results from the
second parameter range SSD experimental run covering 7,392,000 variations using
the fixed 6,000 rows and columns.
LDV weighting
The use-case experiments illustrated the benefit of Linked Document Vectors. The
link weight in all queries in the experiment ranged from 0% to 90% in steps of 10%.
To achieve an overview of the impact of the weights we present them here according
to Titles, 25p and 100p queries and provide minimum, median and average AAR
over these and the 10 weightings (Figure 29).
100
Figure 29: SSD query results with varying LDV weights
Link weight influence all queries in a similar manner across median, average and
minimum AAR. The worst results are at 90% link weight, which replaces the
document’s original (text) vector mostly with a combination of linked to documents’
text vectors. There is a reoccurring trend with an optimum around 20-40% with
degrading AAR surrounding it.
The baseline for the LDV weighting is 0%, which is equivalent to the traditional text
only document vectors used in Semantic Spaces to date. Since 20% to 40% displayed
the best improvements, we provide a detailed view of them in Figure 30. It shows the
percentage improvements with 0% weighting as a baseline. For example, an AAR of
3 at 0% and 2 at 30% lnkwght would be an improvement of 33.3%. The diagram
illustrates that all query types benefit strongly from the LDV. Particularly the
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Minimum
Average
Median
Minimum
Average
Median
Minimum
Average
Median
Titles
25p
100p
AAR
90% 80% 70% 60% 50% 40% 30% 20% 10% 0%
101
average and median results for medium and long query length benefit strongly (up to
32%). Nevertheless, the 25p and Titles minimum AARs improved considerably too
(12-24%). The minimum 100p result was the least impacted by the LDV. Since these
results were nearly optimal (close to AAR of 1) to begin with, the possibility of
further improvement was limited.
Figure 30: Improvements in AAR from no to optimal LDV
11.3%
14.1%
12.8%
24.0%
22.7%
25.5%
5.3%
22.2%
26.5%
12.7%
15.7%
15.3%
22.6%
28.4%
29.9%
3.2%
29.0%
29.8%
10.9%
14.6%
14.7%
19.8%
30.9%
29.9%
3.2%
32.5%
29.5%
0% 5% 10% 15% 20% 25% 30% 35%
Minimum
Average
Median
Minimum
Average
Median
Minimum
Average
Median
Titles
25p
100p
AAR improvement from 0% link weight
40% 30% 20%
102
Singular Factor
Figure 31: Singular Factor influence on AAR
Figure 31 provides an overview how the singular factor influences the SSD outcome.
The immediately identifiable shared characteristic across the three different query
experiments and between the average, median and minimum results is an optimum sf
of 0 and 0.5. A close investigation (Figure 32) with singular values unmodified
(sf=1) as recommended (Deerwester et al., 1990) as a baseline reveals that for the
25p and 100p the best result is achieved with a sf=0. This is equivalent to ignoring
the singular values in the Semantic Space creation much like in the Infomap and
Wordspace models (Schütze, 1998; Takayama et al., 1999). The improvements range
from 58% to 81% on average and still impressive 40 to 50% on the minimum.
1 6 11 16 21 26
Minimum
Average
Median
Minimum
Average
Median
Minimum
Average
Median
Titles
25p
100p
AAR
4.0 2.0 1.0 0.5 0.0 -0.5 -1.0
103
Figure 32: Improvements from sf=1 to 0.0 and 0.5
The singular values can be beneficial, however, as the Titles experiment shows.
Applying them with a sf of 0.5 which is a smoothed square root of the original
singular values shows an improvement of 13% on minimum to 13% on average,
which is better than what is achievable with either sf=1 or sf=0.
Query Term Frequency
The query parameters uq instructs the system to either use or ignore term frequency
in a query. In the first case of setting uq to on/true the system uses every term vector
only once in constructing the query vector independent from the frequency of a query
term in a query. Let us call the alternative setting, when uq is ‘off’, fq for frequency
query. In this case, the system adds every occurrence of a term in the query to the
query vector.
7.5%
5.9%
6.2%
50.6%
54.9%
58.9%
40.1%
76.0%
81.3%
13.1%
13.1%
14.4%
32.9%
37.2%
38.1%
29.2%
49.1%
54.7%
0% 10% 20% 30% 40% 50% 60% 70% 80%
Minimum
Average
Median
Minimum
Average
Median
Minimum
Average
MedianTitles
25p
100p
AAR improvement compared with sf=1
0.5 0.0
104
Figure 33: Difference between unique and frequency queries
Figure 33 displays the difference in the results between the two options. We
recognize that for long and medium sized queries fq is the better choice. The minute
differences in the Titles experiment is not surprising since the brevity of the title
queries makes reoccurring query terms unlikely. This comparison shows that
emphasizing a particular aspect of a query through repetition of a term is an effective
means to weigh and focus the query. More importantly, the average and mean results
indicate that it is also an effective means to counter sub-optimal space parameters.
The improvements on the near optimal minimum results are modest. On the sum of
space parameter variations a long query that permits weighting through repetitive
term use can roughly half the average result’s rank. When we compare 100p and
Titles average uq result we recognize that on average the length of a query is only
beneficial with the weighting of frequent (and thus important) terms.
Combined Queries
We were not able to source a reliable set of service operation links for the Titles
experiment from the SAP ES Wiki due to the data quality. Therefore, we only
present the Combined Queries for the 100p and 25p experiments in Figure 34.
4.419
11.387
8.874
1.795
8.761
5.853
1.275
6.496
3.161
4.448
11.390
8.838
1.911
10.260
7.174
1.304
11.232
7.127
0 2 4 6 8 10 12
Minimum
Average
Median
Minimum
Average
Median
Minimum
Average
Median
Titles
25p
100p
AAR unique versus frequency queries
UQ FQ
105
Figure 34: Combined Query vs. Text Query
We can clearly identify the benefit the combined queries give on the median results.
The average improvement is weaker and implies that the CQ is sensitive to highly
suboptimal parameter settings. The combined query cannot significantly improve on
the 100p optimal result. Since this particular result is close to 1 AAR already, the
margin for optimisation is expectedly narrow. The 25p minimum result, however,
improves strongly when utilizing combined queries.
Trends
The remaining variables have shown only minor trends on the average and median
results with no definite influence on the optimal/minimum AAR results. The query
factor (Figure 35) displays a slight preference towards a setting of 2.0 on average and
median results. The minimal results tend to be better with a neutral (1.0) qf.
18.4%
9.6%
18.5%
1.0%
4.4%
15.8%
0% 2% 4% 6% 8% 10% 12% 14% 16% 18% 20%
1 6 11 16 21 26
Minimum
Average
Median
Minimum
Average
Median
25p
100p
AAR change
AAR
CQ TQ TQ to CQ
106
Figure 35: Query factors’ influence on AAR
The dimensional reduction (SVD Figure 36) shows a discernible benefit from smaller
(k=200) settings on average and median results across all three experiments. The
minimum at the 25p queries shows a light preference to k=300 not reflected in the
other two minimums.
1 2 3 4 5 6 7 8 9 10 11 12 13
Minimum
Average
Median
Minimum
Average
Median
Minimum
Average
Median
Titles
25p
100p
AAR
3.0 2.8 2.6 2.4 2.2 2.0 1.8 1.6 1.4 1.2 1.0
107
Figure 36: SVD reduction to k dimensions
The gap parameter (Figure 37) provides a generally better result with a smaller
setting (25) for median and average results. There is no observable trend for the
minimums.
Figure 37: Gap
1 3 5 7 9 11
Minimum
Average
Median
Minimum
Average
Median
Minimum
Average
MedianTitles
25%
100%
AAR
500 450 400 350 300 250 200 150
1 3 5 7 9 11
Minimum
Average
Median
Minimum
Average
Median
Minimum
Average
Median
Titles
25p
100p
AAR
175 150 100 50 25
108
The left and right window both (Figure 38 and Figure 39) favour smaller settings (25
to 0) for the median and average results without any certain preference for the
minimums.
Figure 38: Left window
Figure 39: Right window
1 3 5 7 9 11 13
Minimum
Average
Median
Minimum
Average
Median
Minimum
Average
Median
Titles
25p
100p
AAR
150 100 50 25 0
1 3 5 7 9 11 13
Minimum
Average
Median
Minimum
Average
Median
Minimum
Average
Median
Titles
25p
100p
AAR
150 100 50 25 0
109
4.5 Discussion
In this chapter, we evaluated the ability of traditional IR and different SS systems to
identify service related information effectively and by extension services by means
of text queries of different detail. The variation in query detail simulated how well
the systems cope with a decreasing precision of service need description. We
reviewed how some of the SSD parameters influence the experimental outcomes in
particular the novel LDV and the singular values.
There is a clear trend that with increasing query length all systems (except for SV)
improve the ranking of the correct bundle. The Semantic Vectors, based on Random
Projection, performed poorly across all experiments. Conventional probabilistic and
vector space systems nearly always performed worse than the SVD based Semantic
Spaces. Lucene performs better than Zettair with increasing query length but Zettair
outperformed Lucene and Infomap in the Titles experiment. SSD consistently
provided the best AAR with TQ and CQ, with and without LDV. This is most likely
due to the Semantic Space being a more expressive representation of underlying
concepts that are transparently accessible through the document and term vectors.
The fact that the performance of SD degrades less than other system, when the
precision of the query degrades, supports this conclusion.
It is clear from the results that Semantic Spaces are a more effective means to search
for services by means of service related information, independent of query quality.
We also demonstrated that expanding queries beyond text to utilize vector
representation of known relevant information, i.e., CQ with service operation
vectors, does further improve retrieval performance. The LDV results emphasize the
benefit of exploiting relationship information in a corpus in combination with the
traditional statistical semantics. As would be expected, it has little benefit in an
optimal situation of a long expressive query in a space with near perfect parameter
settings for a specific corpus where the AAR is already close to 1. In all other sub-
optimal situations of imperfect space parameters or less expressive queries, the
utilization of LDV and often CQ provides a boost to the queries’ and space’s
precision, independent from queries and parameters. Since in a real world setting
more often than not a sub-optimal situation is encountered, e.g. through default
parameter, changing corpora or brief queries, the addition of LDV provides a very
noticeable advance in resilience and overall improvement. Interestingly the addition
110
of service operation vectors in the form of the combined queries does provide a
visible boost in performance only in connection with LDVs. Many service operation
documents contain little to no text but some links to related documents. The
consistent boost from using LDVs with combined queries was greater than
anticipated considering that the queried bundle documents are in general detailed
texts containing rich semantic information. It might be that they benefited indirectly
from an increased disambiguation of semantically poorly expressed but linked
documents like the service operations which improved the quality of the (document)
space overall. We identified an optimal range around a LDV weighting of 20% to
40% with a preference towards 20%.
The review of the singular factors challenges both current notions about S values.
Deerwester et al. (1990) argued that they are necessary in the reconstruction of the
row relationships. If this is true then the original row relationships is less effective in
representing the semantic associations than the sole left matrix of the SVD or in the
case of the Titles a smoothed version of the singular values with the left matrix. The
Titles experiment challenges the Wordspace approach of ignoring S values. The
experiment showed that there is value in incorporating (smoothed) S values in some
instance and worth ignoring them in others. Overall, reproducing only a least square
error approximation of the original co-occurrence matrix is not optimal according to
our findings. We cannot answer the meaning and optimal use of S values with this
research but are putting forward our results to encourage investigations in the future.
In summary, we have demonstrated that Semantic Spaces promote effective service
retrieval in a Service Ecosystem as stated in the first research question. Furthermore,
we showed that (SVD based) Semantic Spaces outperform keyword systems in cases
of long, degraded and title queries. We established that LDVs and the addition of
non-textual query information further improve a Semantic Spaces query performance
and provided a review of LDV weighting. The exact benefit of S values is unclear
and warrants further research.
111
5 Semantic Service Categorisation Evaluation
The previous chapter investigated the first research question by evaluating SD as an
IR task by means of a Service Information Shadow simulated by the SAP ES Wiki.
We compared state-of-the-art IR systems with Semantic Space systems and
examined the Semantic Space innovations of manipulating the singular values and
adding SIS structure information.
In this chapter, we investigate the second research question, if Semantic Categories,
inspired by conceptual space theory, provide a meaningful and effective map of the
Service Ecosystem for exploration. We explore this by comparing the Semantic
Categories to manual groupings using a sophisticated quantitative measure and a
qualitative review of the results. The former provides an objective mean of
comparison and the latter some insight into how meaningful the semantic categories
may be from the user perspective. The quantitative measure is a comparison against
state-of-the art clustering algorithms. We moreover review the influence of the
perspectives, Linked Document Vectors and the manipulation of the singular values
through the singular factor.
The chapter’s structure is as follows. We start with a section about the experimental
setup. It presents the manual groupings used as a baseline, the two perspectives on
the space and details to the choice of measure to compare the categorization and
clustering with the manual baseline. The subsequent section reviews the state-of-the-
art clustering algorithms, their experimental setup and results. The Semantic
Categorization section follows describing its experimental setup and results as well
as a qualitative review of its result. The chapter ends with a discussion and
comparison of the clustering, categorization and manual grouping.
5.1 Experiment
The experiment splits into two parts. One is the state-of-the-art clustering algorithms
and the second is the Semantic Categorization. Preceding them, in this section, we
identify what information we want to organize, what would constitute an optimal
solution and how to compare the solutions.
We recall the SES scenario where a service consumer is searching for a combined or
related services to address a service need originating from an agenda. We are not
112
anticipating widespread atomic service operation invocation or selection by the
consumer. Instead, the combination and brokerage of services or provision of whole
service bundles addresses the lack of business aspects in current service delivery
(Cardoso et al., 2010). The SAP ES Wiki contains bundles provided by the wiki’s
users that aggregate information about tightly related service operations and objects.
The bundles include use-cases describing tasks as we expect them to arise from a
complex service need agenda. We propose that the wiki bundles relate well to SES
service bundles. In short, these bundles are meaningful clusters of services, which
together address some service agenda. In addition, the bundles have associated
categories, which are topical and relevant to the intention of the bundle to address the
related agenda.
5.1.1 Data
The second scenario we introduced in this thesis assumed the need for an overview
of the Service Ecosystem or a part of it. This would be the case when a searcher has a
poor understanding of the agenda or entailing tasks, and may be useful in additional
situations, e.g., ontology design or product management. Important in this scenario is
to minimize the effort the searcher has to expend when faced with a huge and
dynamic system like the SES. This scenario is a pragmatic assumption when we
recall the reoccurring pattern of categorizing, tagging, grouping and otherwise
organizing larger sets of information by humans, e.g., in libraries or service
registries, to provide that kind of overview. Moreover, the SAP ES Wiki itself has
such an overview page organizing the 125 bundles in 30 bundle groups of related
topics70. The wiki users create and maintain the page for as a quick and easy way to
navigate the wiki. They rely on a shared (human) model (see Conceptual Space
theory 2.3.1) of organizing the underlying concepts instead of each creating an own
‘mental map’ of the wiki. They accept differences in how the model applies to them
because of personal biases and experiences since the ‘saved’ effort/time of not
building a personal optimal model themselves exceeds this imprecision ‘cost’. We
anticipate this kind of explorative search to be useful in the future SES.
70 See SAP ES Wiki Grouping in the appendix for details.
113
The size and ever-changing nature of the SES prohibits a manual categorization. This
is comparable to the web catalogues, e.g., as attempted by Yahoo, that were popular
in the early stages of the WWW and have declined since because the web became too
large and fast paced. Consequently, the web requires computationally time and space
efficient algorithms. It cannot utilize sophisticated methods like clustering or
categorization forgoing precision in favour of ‘speed/space’ (Baeza-Yates & Ribeiro-
Neto, 2011; chapter 11). As a result, the prevalent mode of exploring the web is by
directed search, e.g., Google’s web search, which is efficient, scalable and with
support of computational facilities captures a large and timely picture of the web. To
gain an overview of a topic area with such a system a user needs various directed
searches and evaluation of the results. Its effectiveness depends on the user’s ability
to anticipate and accordingly formulate queries that cover the topic of interest.
Unfortunately, exploratory search by this means is very time consuming and not truly
possible if the user has a poor understanding of the topic of interest.
Figure 40: Practical topical structuring of different corpora
We propose that the SES or more precisely, the associated SIS, unlike the web, will
not grow to an unorganisable size (Figure 40). Traditional and manual means are not
applicable to organize such an intermediate corpus topically but automated means are
valid since the SIS will be several magnitudes smaller than the web. The semantic
search discussed in the previous chapter and Semantic Categorization can handle
millions of documents of a SIS.
In summary, we assume the usage scenarios and scope of the SES requires an
automated categorization of the space, and its bundles and bundle groups are suitable
representations of what a searcher would look for and how a human, or service
broker, would organize them. We propose that the Semantic Categories can provide a
meaningful map of the bundles for exploratory search. To establish this we propose
to apply Semantic Categorization to the SAP ES Wiki and compare its automatically
generated categories with the manual (bundle) groups. Most conventional methods
for organizing information rely on clustering algorithms of one form or the other. We
114
therefore position the Semantic Categorization against state-of-the-art clustering of
the space.
5.1.2 Clustering Perspectives
Traditionally, clustering of Semantic Spaces involves the clustering of its elementary
semantic vectors, which are either term and/or document vectors (see 3.4.3 for
details). In our model, the orthodox clustering approach therefore would utilize the
term vectors. We proposed that the optimal perspective for categorization and
clustering on the space is through the most relevant objects. The objects of interest
are the bundle documents and should be the basis for the categories accordingly. We
chose therefore to review the (traditional) term vector perspective and the specific
bundle perspective in our experiments. The comparison of their performances will
evaluate if an alternative perspective improves clustering and categorization results.
5.1.3 Semantic Space Parameters
All experiments use the same rows, cols, g, lnkwght, rw, tw, tt and u parameter
combination (see section 3.6.1 on page 77 for parameter details) to generate the basic
Semantic Space (Table 20). We chose the parameters based on the use-case and
TASA/TOEFL experiments experience. A wide window size captures broad topic
relationships rather than narrow term relationships and a large u allows for resilient
performance with a strong gap to ignore too specific terms. These settings aim to not
overfit the model and still provide a good result71. This permits a focus on the sf,
lnkwght and perspective as well as clustering specific parameters. The combination
of sf and lnkwght return 70 different Semantic Space variations, which we evaluate
from the two different perspectives – term and bundle – in the context of Semantic
Categorization as well as the various clustering algorithms.
71 The top use-case AARs for these settings are 1.3 (100p), 2.07 (25p) and 4.48 (Titles).
115
SS Parameter Value
rows, cols 6,000
cg, lw, rw 150
tw, tt TF-IDF
u 400
sf -1, -0.5, 0, 0.5, 1, 2 ,4
lnkwght 0%, 10%, … 90%
Perspective Bundle, Terms
Table 20: CLUTO - ES Wiki Semantic Space parameters
5.1.4 Performance Measures
We are comparing flat, exclusive clustering of a data set and need to choose an
appropriate measure. Let us define a clustering U and V of a set S containing N data
points {s1, s2,... sN}. A popular measure is pair counting as implemented in the rand
index (Rand, 1971) based on a contingency matrix with the number of pairs:
N00 that are in different clusters in U and V
N11 that are in the same clusters in U and V
N01 that are in different clusters in U but in the same in V
N10 that are in different clusters in V but in the same in U
The rand index has a bound between 0 and 1 RI (Equation 26), however, it mostly
returns values between 0.5 and 1. The value of 0 is only achieved in the exceptional
situation of one clustering consisting of one single cluster and the other of all atomic
clusters with one member. It is therefore not a very intuitive measure to express the
similarity between two clusterings.
,
Equation 26: Rand Index
The adjusted rand index (Hubert & Arabie, 1985) addresses instability and chance
bias of the rand index. Its lower bound is 0 for no shared information and 1 for
identical clustering. It uses a hyper-geometric distribution to model randomness and
adjusts the results for chance (Equation 27).
,2
Equation 27: Adjusted Rand Index
116
The second popular group of measures is information theory motivated. Examples of
it are Normalized Mutual Information (Studholme et al., 1999) and Adjusted Mutual
Information (Vinh et al., 2009). They are based on the Mutual Information (MI,
Equation 28) between two variables X and Y with p(x, y) being the joint and p(x),
p(y) being the marginal probability distribution functions.
, ,,
Equation 28: Mutual Information
Mutual Information reflects how much the two variables depend on each other or
how much information they share. Therefore, a MI of 0 indicates that X and Y are
independent and knowing about one does not change the knowledge about the other.
A MI of 1 therefore indicates that they are identical and knowing one is equal to
knowing both. We can use it to measure how similar two clustering are. The
probability of a random object from S to be in a cluster Ui is P(i) (Equation 29) and
the entropy of U would be H(U) (Equation 31). The entropy of U is lower bound by
0 in case of a single cluster containing all items (log(P(i)) would be 0). The MI of U
and V would be I(U, V) (Equation 32) with P(i,j) (Equation 30) being the probability
of a random object to be in Ui and Vj.
| |
Equation 29: Probability of random object to be in cluster i
,∩
Equation 30: Probability of random object to be in Ui and Vj
log
Equation 31: Entropy of cluster U
, ,,
Equation 32: Mutual Information between clustering U and V
One problem with MI is that its upper bound is equal or less than the smaller of the
two clustering entropies H(U) and H(V). The NMI addresses this by fixing the lower
117
bound to 0 and the higher one to 1. A common normalization is to divide MI by the
square root of the product of the clustering entropies (Equation 33).
,,
Equation 33: Normalized Mutual Information
The trouble with NMI is that it is cardinality biased. If, for example, a clustering
solution W is to be compared with two random clustering A and B with |A|>|W|>|B|
then the NMI(A,W) is likely to be greater than NMI(B,W) despite both clustering
being random because the entropy does not increase ‘fast’ enough to counter the ‘by
chance’ shared information or ‘accidental’ MI. This becomes significant for small
datasets (see next sub-section “NMI versus AMI”). A solution to this is a correction
for chance as by Adjusted Mutual Information or AMI (Vinh et al., 2009). It
calculates and removes the ‘by chance’ projected MI by means of a contingency
table E (Equation 34) of mutual information of all possible pairings between U and
V. We do not review AMI in detail because of its complexity. We refer the reader to
Vinh et al. (2009) for details. AMI ranges between 1 and 0 like NMI and removes the
cardinality bias resulting in a more expressive and intuitively meaningful measure.
,,
Equation 34: Adjusted Mutual Information
NMI versus AMI
The measure we select has to return a value for a clustering/categorization of the 125
bundles against the SAP ES Wiki’s manmade 30 bundle groups. Initially we intended
to use the popular Normalize Mutual Information (Studholme et al., 1999) measure
but unusual results and some investigation identified a cardinality bias (Figure 41).
NMI adjusts for an increase in Mutual Information with the rise of cardinality by
using entropy. Entropy is an accepted measure used in evaluating clustering results
(Zhao & George Karypis, 2004). However, this fails in small samples with relative
large numbers of categories compared to the samples with entropy failing to account
for chance. We therefore adopt Adjusted Mutual Information as a measure as it has
all the qualities of NMI using information theory to measure mutual information with
118
a normalization to make results comparable and additionally removes cardinality bias
by accounting for chance (Vinh et al., 2009).
Figure 41: Measurement Cardinality Bias
Figure 41 illustrates the difference. We generated random categorisations of the 125
bundles with 1 to 125 categories. Each (1, 2, 3… 125 categories) was performed a
hundred times using the average to smooth the result. We measured the NMI and
AMI against the 30 SAP ES Wiki groups and plotted the result with 0 meaning no
shared information and 1 the results being identical. NMI shows a strong bias
towards a greater number of clusters flattening towards 0.8 despite the
categorizations being random. AMI remains about zero, measuring only non-chance
mutual information providing a resilient measure. NMI or entropy based methods are
acceptable measures if the data source contains considerably more data points than
the number of clusters/categories, however, for our experiment AMI is necessary to
give an unbiased result. An alternative measure could be the Adjusted Rand Index
but we did not investigate it further since its behaviour is comparable to AMI (Vinh
& Epps, 2009; Vinh et al., 2009).
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
1 5 913
17
21
25
29
33
37
41
45
49
53
57
61
65
69
73
77
81
85
89
93
97
101
105
109
113
117
121
125
Similarity to SAP ES Wiki groups
No. Random Categories (average result from each 100 runs)
NMI
AMI
119
5.2 Baseline clustering algorithms
We sourced state-of-the-art clustering algorithms from the popular CLUTO72
software package (Zhao & George Karypis, 2002, 2004) (version 2.1.1). CLUTO
divides clustering algorithms into a criterion function that evaluates and optimizes
the clusters and the clustering method that produces the clusters. The combination of
different criterion functions and methods provide a wide range of modern clustering
solutions. The criterion functions and clustering methods of CLUTO are an extensive
topic. We refer the reader to the Appendix C where we concisely describe the
functions and methods, or to the CLUTO website73, manual80 and to the specific
literature (Zhao & George Karypis, 2002, 2004) for an in-depth discussion.
5.2.1 Setup
The input to CLUTO (besides the clustering parameters) is a text file containing a
matrix of vectors or a matrix of similarities of the clustering objects. The output is a
list of clusters corresponding to the input vectors or objects. We provide the various
Semantic Spaces (see 5.1.2) as two vector matrix files in the CLUTO input format
containing the term and the bundle vectors. We process these using the vcluster
CLUTO executable employing all combinations of the criterion functions and
clustering methods resulting in 48 different clustering algorithmic approaches. The
desired number of clusters is set to 30, which is equivalent to the manual wiki
grouping. We also compute solutions for 6 supplementary criterion functions that are
only applicable to agglomerative clustering methods and the graph clustering method
that does not utilize exchangeable criterion functions. The remaining CLUTO
parameters remained set to the software’s default settings.
5.2.2 Results
The clustering experiments by means of CLUTO returned 7,212 results for the 48
combinations of methods, criterion functions, as well as link weight, singular factor
and perspectives. We refer the reader to Appendix C for detailed analysis of the
results. In this section, we focus only on the optimal clustering results reviewing the
72 See http://glaros.dtc.umn.edu/gkhome/cluto/cluto/overview for more. 73 See http://glaros.dtc.umn.edu/gkhome/views/cluto for more details on CLUTO.
120
influence of the novel link weight (see section 3.4.2 page 70), singular factor (see
section 3.4.1 page 69) and perspective parameter (see section 3.4.3 page 72)
influences. We did not choose one particular clustering algorithm that performed best
on average. Instead, in every case we report the best possible combination of method
and function. These highly optimized and selective results provide the baseline for
the Semantic Categorization results in the next section as well as a broad evaluation
of the mentioned novel features.
Singular Factor and Perspective
Figure 42 illustrates the various best results across the sf range and contrasts it with
the two perspectives summarizing the influence of the Singular Factor or sf
parameter.
Figure 42: Singular Factor and Perspective
The overall best results and the Bundle perspective ones are the same since it is in all
cases the best performing. Besides the general difference in performance, a
difference in distribution of results across the sf range between the two perspectives
is apparent. The Bundle ones rise sharply peaking at sf=0 and then drop smoothly off
with increasing sf. The Term ones are seemingly irregular peaking particularly at
sf=0.5. We note here that neither performs best at sf=1 as proposed by Deerwester et
al. (1990). In fact the Term perspective’s result is almost the worst in this case and
only 30% (0.1055) of sf=0.5 (03573). Bundle at sf=1 is just below 90% (0.4272) of
its best result (0.4762 at sf=0) supporting the alternative view of singular values
0.0
0.1
0.2
0.3
0.4
0.5
Bundle Term
Perspective
Max AMI
sf -1
sf -0.5
sf 0
sf 0.5
sf 1
sf 2
sf 4
121
(Takayama et al., 1999). These results are in line with our experience from the use-
case experiments.
Link Weight and Perspective
The lnkwght parameter shows a noteworthy trend when contrasted between the two
perspectives (Figure 43). Overall the Bundle performs far better than Term (0.4762
vs. 0.3473) as establish previously. There is an opposite behaviour to the influence of
lnkwght though. Bundle slightly improves from 0% to 30% (0.4562 vs. 0.4762 or
+4.4%) and then drops off fast from 50% to 90%. Term on the other hand starts flat
and then rises from 40% peaking at 70% improving by +24.8% from 0% (0.2782 to
0.3473).
Figure 43: Link Weight and Perspective
5.3 Semantic Categorization
The Semantic Categorization (see 3.3) derives from the Conceptual Space theory; it
identifies non-overlapping semantic prototypical cores along one perspective and
expands them by tessellation to categories including all vector types.
5.3.1 Setup
The Semantic Categorization is performed with the same basic Semantic Space
parameters (see 5.1.3) as the state-of-the-art clustering (CLUTO) experiments in the
previous section with sf between -1, -0.5, 0, 0.5, 1, 2, 4 and lnkwght of 0% to 90% in
10% steps. Unlike the clustering approach, Semantic Categorization does not require
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
0.50
Bundle Term
Max AMI
0% 10% 20% 30% 40% 50% 60% 70% 80% 90%Link weight
122
the user to know or guess the optimal number of final clusters/categories. SC uses
parameters describing the desired attributes of the final categories. They are distance,
density and cut-off (see 3.3 for details). Density is a local parameter that when
increased gives preference to denser semantic category cluster cores. Distance is a
global parameter, which with increasing value penalizes cluster core proximity. Cut-
off is a global parameter that removes a long tail of tiny clusters especially if the
space is very sparse. It sets the minimum fitness for a cluster as a percentage of the
fittest cluster. Besides the SS and SC parameters, the two perspectives Term and
Bundle are tested.
Run
Bundle
Term
Parameter Value
1
X
X
distance 0.5, 1, 2, 4, 8, 16, 32, 64
density 0.5, 1, 2, 4, 8, 16, 32, 64
cut‐off 0%, 5%, 10%, 20%, 40%
2
X
distance 2, 4, 8, 16, 32, 64, 128
density 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096
cut‐off 0%, 5%, 10%, 15%, 20%, 25%
X
sf -1, -0.5, 0, 0.5 1, 2, 4
distance 0.5, 0.75, …, 2, 5, 10, 15, 20 0.5, 1, 2, 4, 8, 16, 32, 64
density 10, 20, 30, 40, 50 103, 104, …, 109
cut‐off 0%, 1%, 2%, 3%, 4% 5% 0%, 5%, 10%, 20%, 40%
4
X
X
distance 0, 0.5, 1, 5, 10, 50, 100, 500, 103, 104, 105
density 0.5, 1, 5, 101, 102, …, 107
cut‐off 0%, 10%, 20%
Table 21: Semantic Categorization experiments parameter settings
Pilot tests established the parameter range. The first detailed run based on that
experience (see Table 21, Run 1) returned 44,800 results. These results informed run
2, which we optimized for each perspective respectively and returned 54,680
results74. Run 3 was informal with numerous, minor ad-hoc variations tested to
establish if there may be significantly better results to be achieved by detailed
optimizations. The results are not noteworthy and not reported here. The last
experiment was Run 4 with a wider range of distance and density parameters to
74 Some Bundle experiments failed and could not converge due to a combination of sparseness of the space and parameter settings.
123
establish their applicable range and cross relationships with the singular factor,
which totalled 44,425 results.
We have to note here that the number of experiments is largely due to the number of
parameters explored, novelty of the algorithm to establish parameter range (Run 1)
and behaviour of parameters in extremes (Run 3). In fact, the optimization (Run 2)
did not yield much improvement on the best results of the first run (only +5.5% on
the Term perspective). This is despite the SC not depending on the external
knowledge of the optimal outcome of 30 categories. Some of the CLUTO algorithms
would be able process an input without the anticipated number of cluster. The
implementation of CLUTO requires the information though as a means to select the
optimal outcome75.
5.3.2 Results
Table 22 presents the top results for the two perspectives. The Bundle perspective
continues to provide superior results to the term perspective. The number of
categories for the two results is close to the manual optimal of 30. The following
discusses the various parameters and their influence.
AMI Categories density distance cut‐off sf lnkwght
Bundle 0.4368 36 17 0 0 0 0.4
Term 0.3682 34 128 4 0.15 0.5 0
Table 22: Best SC result by perspectives
75 See the CLUTO manual at http://glaros.dtc.umn.edu/gkhome/fetch/sw/cluto/manual.pdf section 3.1.
124
Figure 44: Maximum AMI according to perspective and sf for run 1
On the first impression from run 1 (Figure 44) the singular factor indicates that the
Bundle perspective perform best without singular values (sf=0) or smoothing (sf=0.5)
while the Term perspective is performing worse overall and best with smoothed
singular values just as in previous experiments with CLUTO and the use-cases. The
higher singular factors beyond 1 seem of no use.
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
Bundle Term
Perspective
sf -1
sf -0.5
sf 0
sf 0.5
sf 1
sf 2
sf 4
125
Figure 45: Maximum AMI according to density and sf in run 4
The data from run 4 (Figure 45 and Table 23) paints a more complicated picture.
There appears to be a relationship between the density, distance and the singular
factor parameters. The higher the sf the better large density settings perform
particularly for Bundles. The optimal combination remains at sf=0 and 0.5 for
Bundle and Term respectively. The distance measure seems to have no predictable
beneficial effect particularly for the sparse distribution of the Bundles. Term results
can benefit somewhat from the distance but in an unpredictable manner. Large
distance settings are generally negative with increasing sf.
We recall that the SVD orders singular values decreasingly with the first one or few
containing a significantly larger weight than the tail of minuscule values. This is in
line with the factorization by SVD, which attempts to identify and extract the most
0.00 0.10 0.20 0.30 0.40
-1
-0.5
0
0.5
1
2
4
-1
-0.5
0
0.5
1
2
4
Bundle
Term
sf
1.E+07
1.E+06
1.E+05
1.E+04
1.E+03
1.E+02
10
5
1
0.5
126
significant information/factor(s). A sf of greater than 1 consequently would amplify
this, emphasize the main factor, and collapse the left matrix from the SVD towards
the leading/left /‘heavy’ columns eliminating the finer details in the row vectors
gained from the remaining columns. It seems that large density settings counteract
this shift in weighting which indicates that the higher order columns contain
information about the smaller differences between the vectors, which somewhat
recovers by raising the local weighting. The global distance parameter is less
important for the apparently already well differentiating distribution of Bundle
categories. It plays a beneficial role for the Term perspective. The fading of the
higher columns/dimensions by the larger singular factors indicates that the area in
which the data resides collapses and thus large distance measures become
counterproductive.
sf
Bundle Term
‐1 ‐0.5 0 0.5 1 2 4 ‐1 ‐0.5 0 0.5 1 2 4
Density
0.5 0.023 0.027 0.000 0.000 0.000 0.000 0.000 0.036 0.069 0.134 0.016 0.000 0.000 0.000
1 0.133 0.130 0.001 0.000 0.000 0.000 0.000 0.122 0.101 0.215 0.302 0.000 0.000 0.000
5 0.306 0.328 0.416 0.012 0.000 0.000 0.000 0.101 0.161 0.122 0.274 0.005 0.000 0.000
10 0.267 0.310 0.424 0.062 0.011 0.000 0.000 0.125 0.174 0.151 0.340 0.222 0.000 0.000
1.E+02 0.224 0.287 0.372 0.367 0.226 0.000 0.000 0.121 0.145 0.158 0.364 0.172 0.011 0.000
1.E+03 0.226 0.226 0.378 0.346 0.366 0.084 0.000 0.091 0.117 0.171 0.270 0.115 0.136 0.000
1.E+04 0.000 0.000 0.013 0.205 0.396 0.404 0.000 0.102 0.121 0.183 0.252 0.080 0.171 0.000
1.E+05 0.000 0.000 0.176 0.362 0.000 0.087 0.119 0.192 0.249 0.195 0.178 0.088
1.E+06 0.000 0.398 0.083 0.115 0.117 0.200 0.293 0.173 0.085 0.173
1.E+07 0.135 0.296 0.104 0.117 0.200 0.256 0.213 0.176 0.224
Distance
0 0.250 0.325 0.424 0.361 0.361 0.398 0.266 0.122 0.119 0.191 0.291 0.155 0.171 0.224
0.5 0.294 0.328 0.407 0.346 0.364 0.382 0.266 0.111 0.131 0.184 0.302 0.151 0.157 0.220
1 0.291 0.328 0.412 0.346 0.364 0.382 0.270 0.111 0.131 0.184 0.302 0.151 0.157 0.220
5 0.289 0.309 0.393 0.346 0.364 0.404 0.266 0.120 0.127 0.200 0.310 0.179 0.157 0.173
10 0.306 0.310 0.407 0.346 0.366 0.390 0.294 0.108 0.110 0.166 0.364 0.205 0.157 0.196
1.E+02 0.266 0.319 0.416 0.367 0.370 0.377 0.296 0.118 0.135 0.145 0.274 0.149 0.178 0.203
1.E+03 0.228 0.306 0.416 0.324 0.396 0.345 0.000 0.112 0.126 0.138 0.185 0.146 0.176 0.000
1.E+04 0.234 0.294 0.406 0.347 0.039 0.000 0.000 0.125 0.132 0.151 0.201 0.222 0.040 0.000
1.E+05 0.213 0.294 0.406 0.231 0.000 0.000 0.000 0.117 0.145 0.215 0.201 0.213 0.012 0.000
1.E+06 0.154 0.262 0.137 0.000 0.000 0.000 0.000 0.116 0.174 0.176 0.293 0.101 0.000 0.000
1.E+07 0.071 0.053 0.000 0.000 0.000 0.000 0.000 0.055 0.069 0.109 0.148 0.061 0.000 0.000
Table 23: Maximum AMI according to distance, density and sf in run 476
Table 24 and Table 25 offer a different view of density and distance looking at the
optimal sf results for the two perspectives and their interaction. It reaffirms that the
76 The cell shadings are visual guides to identify trends.
127
sparseness of the Bundle perspective removes the value of a global distance
parameter. The Term perspective indicates that the global feature of the
categorization can be very important nevertheless. It reaches optimum at a distance
of 10 and density of 100 and without a distance measure (distance=0) would have
achieved less than 80% of that.
Density
0.5 1 5 10 1.E+02 1.E+03 1.E+04 1.E+05 Distance
0 0.000 0.001 0.295 0.424 0.340 0.378 0.011 0.000
0.5 0.000 0.001 0.294 0.407 0.372 0.368 0.011 0.000
1 0.000 0.001 0.294 0.412 0.372 0.368 0.011 0.000
5 0.000 0.001 0.322 0.393 0.372 0.368 0.011 0.000
10 0.000 0.001 0.350 0.407 0.372 0.368 0.011 0.000
50 0.000 0.001 0.416 0.384 0.372 0.359 0.011 0.000
100 0.000 0.001 0.416 0.342 0.372 0.359 0.011 0.000
500 0.000 0.001 0.406 0.348 0.341 0.320 0.013 0.000
1000 0.000 0.001 0.406 0.368 0.343 0.214 0.013 0.000
10000 0.000 0.000 0.038 0.137 0.129 0.032 0.000 0.000
100000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
Table 24: Maximum AMI for run 4 - Bundles, density to distance at sf=0
Density
0.5 1 5 10 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07
Distance
0 0.010 0.291 0.121 0.136 0.225 0.270 0.228 0.225 0.231 0.231
0.5 0.009 0.302 0.120 0.114 0.230 0.257 0.218 0.206 0.193 0.193
1 0.009 0.302 0.115 0.127 0.257 0.265 0.217 0.206 0.193 0.193
5 0.010 0.289 0.224 0.275 0.310 0.260 0.241 0.222 0.222 0.222
10 0.010 0.218 0.257 0.340 0.364 0.230 0.229 0.215 0.185 0.185
50 0.016 0.234 0.274 0.266 0.161 0.087 0.068 0.104 0.121 0.167
100 0.010 0.180 0.174 0.181 0.185 0.082 0.082 0.104 0.122 0.175
500 0.010 0.201 0.107 0.148 0.046 0.078 0.136 0.116 0.121 0.175
1000 0.010 0.201 0.107 0.148 0.036 0.078 0.133 0.123 0.123 0.172
10000 0.010 0.153 0.150 0.166 0.243 0.265 0.252 0.249 0.293 0.256
100000 0.000 0.059 0.107 0.148 0.128 0.111 0.108 0.091 0.097 0.074
Table 25: Maximum AMI for run 4 - Term, density to distance at sf=0.5
The results from the lnkwght parameter for maximum AMI and combined over run 1,
2 and 4 (Figure 46) indicate that there is a slight preference for a 0% link-weight for
Term perspective and 40% for Bundle perspective. The use-case results support the
latter but the benefit for categorization is clearly much weaker if we can claim it at
all. There is a decline in maximum AMI for both perspectives with higher link
weights.
128
Figure 46: Link-weight results combined from run 1, 2 and 4
A selection of cut-off parameters across the run 1, 2 and 4 in Figure 47 illustrates a
difference in its effect on the two perspectives. The Bundles perform best at a 0 cut-
off while Term gains with a cut-off of 5-10%. This is likely due to the difference in
amount and distribution of the two sets. The Bundles are only 125 items while the
Terms, the semantic base, total 6,000. As a result, Term tends more to a ‘tail’ of
minuscule semantic cores unless there is a stopping like a relative fitness measure
introduced, i.e., a cut-off.
Figure 47: Cut-off result selection from combined runs 1, 2 and 4
Figure 48 plots the average and maximum AMI for both perspectives along the
number of categories on the horizontal. The Bundle plots end around 50 categories
since a medoid and at least one member of the categorizing type (in this case a
Bundle vector) define a minimal category. The 125 bundles and their distribution
therefore limit the number of possible categories to less than 63. A category in the
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
Bundle Term
Perspective
Max AMI
0% 10% 20% 30% 40% 50% 60% 70% 80% 90%
0
0.1
0.2
0.3
0.4
Bundle Term
Max AMI
cut‐off
0
0.05
0.1
0.2
0.4
129
Term perspective counts if it contains at least one Bundle. Subsequently the Terms
can produce a maximum of 125 categories.
A peak of the AMI plots occurs just after 30 categories, which is the number of
manual groupings with minimums at 1 and 125. This reconfirms that the chosen
measure does not give preference towards either a minimum or a maximum number
of categories.
Figure 48: Maximum and Average AMI according to number of categories
5.3.3 Qualitative analysis by an example
Semantic Categorization achieved a maximum AMI of 0.4368 with the Bundle
perspective77. This gives us an information theoretic measure, but does this constitute
a comparable and meaningful grouping for a human or at what AMI would that be
reasonable to assume? We present a qualitative review of the best result to illuminate
the question and establish if SC accomplished a meaningful categorization.
77 An XML formatted representation of the categories is available in appendix B.
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
1 6
11
16
21
26
31
36
41
46
51
56
61
66
71
76
81
86
91
96
101
106
111
116
121
Max AMI
Categories (containing Bundles)
Bundle - Max of AMI Bundle - Avg AMI
Term - Max of AMI Term - Avg AMI
130
Similarity Bundle Name Manual
Bundle Group
medoid interactive_selling Sales
0.52 quote_to_order_for_configurable_products Sales
0.52 vehicle_management_system Automotive
0.50 customer_fact_sheet Sales
0.48 sales_incentive_and_commission_management Sales
0.48 product_master_data_management Sales
0.47 opportunity_management Sales
0.38 activity_management Sales
0.40 sales_contract_management Sales
0.25 territory_management Sales
0.29 trade_and_commodity_management Sourcing
0.25 global_data_synchronization Retail
Table 26: Semantic category example
Table 26, an example from the mentioned Semantic Categorization, contains 11
members and a medoid for a total of 12 bundles. Nine of them come from the SAP
ES Wiki Sales bundle group (see Table 27 for the full group). The topic of these 9
bundles does appear strongly sales-related based on their naming. The three
alternatively added bundles in the category are odd at first sight. Two of them have a
low similarity measure. The vehicle_manangement_system bundle shows a stronger
relationship and in its text, we find that it:
“[…] enables the interaction between the dealers and SAP Vehicle
Management for Automotive, which importers and distribution centers run.”
(SAP ES Wiki Vehicle Management System Bundle)
Furthermore, its main audience is system administrators, sales & importer
representatives and dealers. The bundle describes in detail the import, order and
particularly sales process. Latter is a strong focus including three use-cases all
concerned with the sales and ordering processes. The use-cases are not part of the
Semantic Space since we use the same corpus as in chapter 4. Consequently, their
focus further validates the semantic association with the sales bundle group extracted
from the description.
The trade_and_commodity_management bundle, one of the two more loosely related
‘odd’ bundles, concentrates on trading of commodities on exchanges for example.
This does relate to the purchase and transfer of goods although it is not sales in the
typical sense. The global_data_synchronization bundle, the last undiscussed bundle,
by name appears unrelated. When we review the bundle, we find its focus to be a
131
data exchange bundle for retailers to receive data from manufactures in a more
efficient way than the traditional electronic data interfaces. Again, this is not the
usual sales meaning but it relates to the purchase and transfer of goods. In summary,
we find that the category around the sales bundle group is conceptually coherent
including the newly added and less obvious bundles provided by SC.
Manual Group In new ‘Sales’ category
account_and_contact_management
activity_management X
customer_fact_sheet X
customer_quote_management
interactive_selling X
opportunity_management X
order_to_cash
order_to_cash_with_crm
product_master_data_management X
quote_to_order_for_configurable_products X
rebate_management
sales_contract_management X
sales_incentive_and_commission_management X
territory_management X
Table 27: Wiki Sales bundle group
Lastly, we review what happened to the 5 sales group members that SC did not
attributed to the new category. The account_and_contact_management bundle
moved to a category around customer_information_management_-
_business_operations medoid. The wiki attribution to the sales group is sensible
since the account and contact information of business partners/customers are part of
sales but the semantic categorization of organizing the business partner/customer
data and data and management bundles together is conceptually sound. This example
illustrates how a view or bias, e.g., from experience and daily interaction, influences
the wiki users’ organisation of the data. This is not wrong or right since it relates to
their personal organization of the data but if it is not the single or overwhelming case
then this organization may be suboptimal. The different organization by the Semantic
Categorization based on the statistical distribution of the language and the
relationships of documents is a sensible alternative.
132
Another such alternative is the attribution of order_to_cash,
order_to_cash_with_crm and customer_quote_management bundles to a category
around the order_to_cash medoid that established a category around the order and
billing concept. This is clearly related to sales but large and independent enough to
warrant an own category. The three bundles are highly related in the users’ and the
Semantic Categorisation’s interpretation despite the difference in the overall space’s
organisation.
The remaining bundle, rebate_management, is part of a small category with
agency_business. It illustrates the limits of Semantic Categorization. The rebate
bundle handles accumulated discounts poorly describes its purpose with only a short
text. Conceptually is may relate better to the billing and invoice category. The
agency_business bundle does aggregate invoice process data for high volume
scenarios and relates only remotely to the rebate bundle.
5.4 Discussion
In this chapter we presented the SAP ES Wiki bundle groups as an example of
manual made grouping of service related information that can be utilized by someone
exploring the service space to gain an overview without the need to search and
review large quantities of services. We discussed an optimal measure to compare the
manual groups with generated ones and decided to employ the AMI measure. We
introduced the two perspectives, Term and Bundle, under which the space is
organized.
CLUTO SC
Term 0.347 0.368
Bundle 0.476 0.437
Table 28: Top results (AMI) for CLUTO and Semantic Categorization
Our hypothesis from the introduction states that the category model from Conceptual
Space theory could be an effective model to organize the Semantic Space to allow
exploratory search. We provided an algorithm to identify semantic cores and
tessellate categories around them in chapter 3.3. The evaluation of which we
delivered in this chapter by comparing it with the manual bundle groups by means of
AMI and a qualitative investigation. We also provided a wide range of state-of-the-
art clustering algorithms and did exhaustive experiments (see Appendix C) choosing
133
the best possible individual results as a baseline. We compared them with the manual
groups to position the Semantic Categorization results with contemporary methods.
We furthermore did review two novel Semantic Space contributions through their
parameters, singular factor and link-weight, in detail to establish their possible value.
The top results of the traditional clustering and the Semantic Categorization are close
and the difference probably not discernable by the user (Table 28). The SC is slightly
better in the Term perspective while CLUTO achieved a little higher result in the
Bundle clustering. The influence of the singular factor is equal for both clustering
and categorizing. The Bundle perspective benefits from removing the singular values
while the Terms achieve best results with smoothed (sf=0.5) values. The link-weight
has a more varied influence. The Bundle perspective does not change much with a
small to medium link-weight and deteriorates with higher weights. The Term
perspective benefits strongly in the CLUTO setting from a high link-weight while its
influence is either not noticeable or detrimental to Semantic Categorization.
We have established that the Semantic Categorization does return comparable results
to the traditional clustering when measured by information theoretic metrics. The
qualitative review of the top SC Bundle result provides assurance that this measure is
useful and the underlying categories are conceptually relevant. In its current form,
we cannot claim that Semantic Categorization is more effective or computationally
efficient. We have shown though that the Conceptual Space inspired categorization is
appropriate and at comparable with traditional clustering approaches. We note
though that the examined state-of-the-art clustering algorithms are mature and have
undergone extensive development and revision while the SC algorithm is novel.
Furthermore, the reported results for the baseline clustering are only the very best in
e ach situation from a large array of evaluated algorithms. Lastly, the clustering
approaches reviewed here by their nature are dependent on the external knowledge of
how many clusters are optimal. The Semantic Categorization does not require
external parameter but depends solely on the general information of what constitutes
a semantic core and establishes the categories and their number based on them.
Qualitative analysis showed evidence that Semantic Categorization does compute
conceptually coherent categories even when these categories differ from the sole
baseline solution. In summary, we claim that the Semantic Categorization has the
potential to produce an automatic, meaningful and effective map of the Service
134
Ecosystem for exploration as stated in the second research question. For further
research we propose exploration of the topic with less homogenous and larger
corpora, This would complete our understanding for real world applications where
the semantics from the data source and user base may be even broader.
135
6 Discussion
This thesis began with an overview over the various streams of service related
developments that we tied together into the background of an emerging Service
Ecosystem. We identified Service Discovery to be of strategic importance for the
functioning of the SES. An overview over traditional and proposed Service
Discovery mechanisms revealed substantial shortcomings in addressing agenda
based discovery scenarios with uncertain service need knowledge on the searcher’s
part. This becomes increasingly important since we observe a shift from functional
SOA oriented service consumption to complex human selection and consumption on
the backdrop of an ever growing and changing SES. We therefore argued that a
suitable Service Discovery framework and system is required. We assumed that
effective discovery system is one that is sensitive to human conceptualization of the
service domain. Therefore, we used Conceptual Space theory, a theory of conceptual
representation from cognitive science as a background motivating theory to compute
concepts that may align with equivalent human representations. This and the fact that
besides functional descriptions the second main source of service information is
unstructured text lead us to reframes the discovery process as an Information
Retrieval Problem. We hypothesised that a Semantic Space based on the Service
Information Shadow is an effective Service Discovery mechanism for direct and
indirect search.
We modelled a Semantic Spaces Discovery system including novel features to
enhance discovery. We set out to answer the hypothesis by simulating a SIS and SD
scenario using the SAP ES Wiki. The presented experiments are utilizing a small
corpus compared to what is common in the IR domain, e.g., in the Text Retrieval
Conference78. On the other hand, research in service discovery has thus far employed
small corpora (Bose et al., 2008; Klusch, Fries, & Sycara, 2006; Mokhtar et al.,
2007; Peng, 2007; Stroulia & Wang, 2005; Zhuang et al., 2005) down to even single
digit number of services (Sanchez & Sheremetov, 2008). There are no standardized,
open, reliable service corpora with associated unstructured information available. We
78 See http://trec.nist.gov/ and http://trec.nist.gov/data.html for more details.
136
hope that the emergence of SES like architectures (Cardoso et al., 2010) will provide
the foundation to build corpora for service discovery similar to the corpora used in
IR research. Our research identified a unique service repository in the SAP ES Wiki
used to validate empirically the service discovery model presented in this thesis. Its
size is in line with the ‘larger’ service discovery experimental corpora so far used
and emphasizes the importance of such corpora to include all types of information
including unstructured text.
6.1 Service Discovery by Directed Search
Chapter 3 introduced the Semantic Service Discovery model. We employed the
TASA/TOEFL experiment to evaluate the quality of the vector representations in the
dimensionally reduced term co-occurrence matrix and at the same time reviewed the
impact of the modified S values and other parameters in the model. We established
that with the given corpus our model was resilient to parameter variations and able to
provide highly relevant semantic associations when compared to existing research
and the baseline. We furthermore obtained the first evidence that the S values may be
of little or negative value when constructing a Semantic Space.
The subsequent chapter presents a novel data source and experimental setup to
evaluate query-based service discovery in a Service Ecosystem. The overall
comparison was against contemporary IR systems and the SD scenario reframed as
an IR problem. Nonetheless, a traditional IR evaluation would have not done justice
to the complex interrelationships and would have not been as valid as a real-world
data source and scenario. Consequently, we undertook the extensive work to find,
extract and distil a new and relevant corpus. The choice of the SAP ES Wiki with the
use-cases and bundles provided highly applicable queries and documents.
Particularly, the hyperlinks enabled the extension of the model by the new Linked
Document Vectors and the Service Operations provided the base for the test of the
Combined Queries. We chose the baseline IR systems to provide a solid and broad
comparison. They included modern BM25 and VSM systems as well as a selection
of alternative Semantic Space systems. Three different levels of query precision
sourced from the use-cases simulated variations in expressiveness of service
information need. This comprehensive evaluation including combined queries on the
SSD model considers various scenarios of search from the current short, iterative
137
querying to a more thoughtful and expressive querying maybe even supported by
documents, usage history or search by example.
The performance difference of the SSD model over alternative systems of the three
use-case experiments containing 448, 448 and 413 queries is (nearly always)
significant. Even in the two cases where the basic SSD model’s result is not
significantly different from the Zettair and Lucene results, its AAR is noticeably
lower/better. The results across all three experiments are the same indicating that
(SVD sourced) Semantic Spaces are generally superior to the state-of-the-art
probabilistic and VSM systems. Our SSD model and its innovative modifications
proved particularly successful and performed best in all settings improving on the SS
system that inspired the SSD prototype. In optimal circumstances, we were able to
achieve an average rank of 1.275 over 448 queries. This means that a searcher
describing a service need in detail is very likely to find the most relevant result on
rank 1 or very close to it. Even in a worst-case scenario when just a few words
poorly express a service need as in the Titles experiment we provided the best result
with an AAR of 4.42. This means that the easily comprehendible and popular format
of top 10 search results would be highly relevant in this situation. We therefore claim
to confirm out hypothesis from the first research question, which states that Semantic
Spaces promote effective service retrieval in a Service Ecosystem. We extend this
statement and claim that we provide a more effective service discovery mechanism
than achieved with contemporary IR systems. A key role for this effectiveness is the
extrapolation of tacit relationships. When the SVD maps the sparse matrix with term
vectors into a lower dimensional space, a “smoothing” of vector representations
occurs. This lossy compression entails a removal of weak relationships and
strengthening of the remaining including hidden ones. This is akin to the “guessing”
of implicit term associations. On this point, the authors of LSA, the original
application of SVD for semantic purposes, have this to say:
“The relationships inferred by LSA are also not logically defined, nor are they
assumed to be consciously rationalizable as these could be. Instead, they are
relations only of similarity or of context sensitive similarity but they
nevertheless have mutual entailments of the same general nature, and also give
rise to fuzzy indirect inferences that may be weak or strong and logically right
or wrong.” (Landauer et al., 1998)
138
Naturally, the resulting Semantic Space lends itself for extensions with conceptually
related information as we have shown very successfully with LDV and CQ. The
linked document vectors are a simple and effective method to introduce the explicitly
encoded relationship by hyperlink between two documents. Wherever such a
relationship is available, e.g., in XML or HTML, the LDV offers a weighted method
to include this (semi-)structured information in the design and refinement of the
previously purely unstructured and empirically sourced Semantic Space. The
dramatic improvements in the use-case experiments employing LDVs substantiate
this. We therefore claim that the extension by the Link Document vectors has proven
very effective and this extension, where applicable, should be considered in the
future when building SS models. The Combined Queries are a vehicle to extend
easily and concisely a query with relevant information, i.e., existing vector-encoded
objects like service operations. They can benefit the search experience very
noticeably as shown in the 25p experiment where they improved the median and
minimum results by over 18%. The 100p experiment noticed too an impressive
enhancement in AARs, i.e., 15.8% in the median results, despite the absence of
further improvement of the near perfect minimum result. The lack of data for the
Titles experiment prevents us from identifying a possible trend towards improving
deteriorating queries. The fact that the Combined Queries do add additional
information effortlessly for the user and accurately for the system paired with the
promising results we presented do encourage further research into combined queries
and their application in future models.
6.2 Exploring the Space by Semantic Categories
We proposed in chapter 5 the manual bundle grouping in the SAP ES Wiki as a
baseline for alternative organizations of the bundles, which relate to combined or
related services. This provides the underpinning for an examination of the second
research question, which states that Semantic Categories provide an automatic,
meaningful and effective map of the Service Ecosystem for exploration. We
investigate the question by comparing Semantic Categories with manual groupings.
We discussed various similarity measures and identified the information theoretically
motivated and chance adjusted AMI as the best choice of performance measure. The
SSD model includes the novel feature of perspectives on the Semantic Space. It is
139
applicable to any automatic organization of the space and we reviewed with the state-
of-the-art clustering algorithms as well as with Semantic Categorization.
We provided a baseline by means of a broad assortment of clustering algorithms like
k-means and others based on repeated bi-sectioning, agglomerative, direct and
nearest neighbour clustering in combination with a great number of criterion
functions. We employed the popular CLUTO software package for this task and we
describe details to the algorithms evaluated in Appendix C. We achieved a
comparison of the effectiveness of SC by measuring its performance against only the
very best outcomes (see chapter 5.2) of this extensive evaluation. This means that we
did not compare against one particular clustering algorithm but in each comparison,
e.g., perspective of variations in singular factor, against the best possible one from
the pool of algorithms.
The Semantic Categorization and baseline clustering algorithms utilized the same
Semantic Space using the same manually determined gold standard. Their optimal
results are close to each other with SC outperforming CLUTO slightly in the Term
perspective. CLUTO outperforms SC in the overall better performing Bundle
perspective. The difference between the perspectives provided evidence for the value
of typing vectors in a Semantic Space where possible to identify relevant types and
use them primarily for organizing an overview of the service space. The qualitative
review of the best SC result illustrated the meaningfulness of the categories in the
sense of being conceptually coherent. The review also showed that alternative
categorizations made by the SC, which may have resulted in a lower AMI score, are
conceptually relevant and possibly less biased than the groupings present in the
wiki’s manually produced gold standard. We cannot solely based on this insight
claim a better performance than the traditional clustering but encourage future
research to employ sophisticated evaluations beyond purely quantitative measures.
Overall, the results showed that the performance of baseline clustering algorithms
and SC are comparable. Qualitative investigation of the semantic categories provided
evidence that they are meaningful substantiating our second hypothesis that semantic
categories can provide a useful basis for mapping the service space. Considering the
novelty of the SC algorithm and the maturity and vast selection of the baseline
clustering algorithms we do not discount the possibility that with further research SC
may become superior to contemporary clustering algorithms.
140
This work addresses the first step for a map like exploration of the SES. Our
empirical results suggest that traditional and SC is equally effective at creating a map
for this purpose. However, effective navigation of the SES would, in the next step,
require viewing the SES at various levels of abstraction, and allowing the user to
move between these abstraction layers. We anticipate SC to perform well in this
respect too. It was designed to be flexible and utilizes both global and local features
to produce categories of any size and abstraction. SC avoids the commonly in
clustering employed agglomerative and divisive processing of the space, which will
enable SC to provide views on each level of abstraction independent from other
views. This next step will require a review in its own right with a more complex data
source and (similar to this work) multifaceted evaluation of all aspects of the space’s
partitioning.
6.3 Singular factor
The singular factor in this research were a vehicle to evaluate the two propositions in
existing research of ignoring or utilizing the S values when creating the semantic
associations in a reduce matrix or Semantic Space. The singular factor enabled us to
vary how we employ the S values beyond the two proposed settings.
All experiments refuted the original notion that singular values are beneficial for left
side’s row relationship reconstruction (Deerwester et al., 1990). The TASA
experiment showed a strong preference for no singular values (sf=0). The use-case
and categorization experiments partially supported that and showed improvements of
up to 85% in some situations over the direct use of S values (sf=1). However, the
singular values do sometimes improve the quality of semantic associations contrary
to the second traditional view that they are of no benefit (Schütze, 1997, 1998;
Takayama et al., 1999). Both the use-case and categorization results show that under
certain circumstances smoothed singular values (sf=0.5) can return better results than
sf=1 and even sf=0. The literature has not provided conclusive evidence or
explanations about the optimal use of singular values. The unexpected utility of
smoothed S values for enhancing the quality of semantic vectors warrants further
research into detailing the effect of the singular values. Until then SVD grounded SS
models should consider and evaluate their application of singular values or when in
doubt ignore them (sf=0).
141
6.4 Link-weight
The Linked Document Vectors are a flexible means to model a relationship, or link,
between information, which we can find, for example, in hyperlink and XML
documents, ontologies or even in crowd-sourced recommendation systems. We have
used uniform types of links in this work to establish their value. In settings that are
more complex, qualitatively different links with different weights may also be an
option to extend and develop the model further. For example, the Universal Service
Description Language or USDL (Cardoso et al., 2010) offers a flexible approach to
include structured and unstructured information without describing a means of
Service Discovery. The presented SSD model can easily exploit both types of
information available in such a language. For example, the wide range of
dependencies and provider defined service capabilities offer a plethora of
opportunities to extend the SSD model and add weighted extensions to the LDV.
The use-case experiments showed in a definitive manner how substantial the
additional relationship information is for the semantic associations. We were able to
improve results across all scenarios by up to 32%. It is noteworthy that the
improvements were evident in all but the minimum 100p results. We attribute the
minor improvement in this instance to the fact that the results are already near an
optimum with an AAR of 1.27. In all other cases where the Semantic Space
parameters might not be optimal or the query less expressive or even both the LDV
improved results remarkably.
Since the bundles, the focus of the use-case experiments, are semantically rich, we
hypothesize the secondary effect of disambiguating the linked, but poorly described
documents, e.g., the service operation descriptions, and only indirectly the bundles to
be the reason for the superior outcome. Additional corpora with various degrees of
semantic depth and linking have to be explored to fully understand the effect but the
large performance improvements warrants this investigation and endorses LDVs as a
simple and effective option for enhancing the representational capabilities and
resilience of Semantic Space models.
6.5 Default Parameters
The extensive review of the SSD model allows us to suggest optimal/default
parameter ranges. We acknowledge that different data sources may benefit from
142
different settings so this guide may be one in a line of refinements depending on
further research to refine it until a solid understanding for all variables like data
source size is established. We also have to note that while we chose optimal
parameters for the various experiments from a very large parameter space we do not
consider this to weaken the claims we are making. We have observed that the SS
model behaves very well in a band of parameter variations as shown for example by
the similarity in parameters in the use-case experiment. The extensive modelling
illustrated that improvements beyond this band of very good results is both hard to
achieve and not noteworthy. Consequently, we can confidently expect similar results
from SSD systems once we standardize the parameters. However, we do make a
distinct difference between applying the model for term comprehension as in the
TASA/TOFEL, and the topical search and organisation as in the SSD experiments.
These two variants with different foci benefit from noticeably different parameter
ranges particularly in the window sizes.
6.5.1 Term “Semantics”
We experienced the most significant gain from maximum column size for the
original co-occurrence matrix. If the focus is on term comprehension like in a
synonym test then the rows need only reflect the relevant vocabulary as we have
shown with the TASA/TOFEL experiment. In the same experiment, we identified
that a corpus wide frequency (DF) as matrix row & columns sort order, a fixed scalar
term weight and small sliding window (8 on each side) are optimal. The small
window provides optimal co-occurrence information within a sentence’s reach. The
benefit of the fixed term weight and DF sorting combined with the ineffectiveness of
the gap feature suggest that contrary to established understanding (Gerard Salton, C.
S Yang, et al., 1975) in the situation where term to term relationships are the sole
focus, corpus wide frequency in combination with a modest stop word list is a good
discriminator for information value. Ignoring the S values proved effective. The
dimensional reduction conformed to orthodox results and was optimal between 100
to 500 columns.
In general, this thesis shows that for an information task reliant on semantic
associations, e.g., synonym task, the Semantic Space should comprise a large number
of columns to differentiate between the row vectors. The number of rows only needs
143
to represent the relevant vocabulary and does not benefit from any additional rows
beyond it. Furthermore, a fixed scalar term weight and corpus-wide term frequency
as matrix order, a comprehensive stop list plus a dimensional reduction towards 250
columns without S values are effective settings.
6.5.2 Information Retrieval theory bearing on Semantic Space
If the application of the Semantic Space is supposed to include document and topical
features then the optimal parameters change. Firstly, TF-IDF is a better discriminator
for matrix order than term frequency and TF-IDF is a better choice than a fixed
scalar for term weight. A high number of columns remain advisable for a rich
representation of row terms as well as a high number of rows since a larger
vocabulary increases detail of the document and query/pseudo-document
representation. We found that the combined window size of 100 to 300 yields good
results and that the distribution between left and right side is less important. This
suggests a wider more topical association of terms as an optimal solution rather than
narrower co-occurrence. This is an important result and suggests that a simpler, bag
of words approach may yield comparable results. We find that ignoring them is an
optimal solution for comprehensive queries. Short keyword-like queries can benefit
from a smoothed use of the S value. We did not find evidence that the sometimes-
suggested use of unmodified S values is desirable.
The novel addition of weighted links and combined queries returned encouraging
results. Particularly in the common case of suboptimal parameter settings for a
corpus the utilization of links and extending queries by document vectors has shown
strong improvements on average and median results. A low weight of 20% for the
links did perform best in general. The combined queries did require the addition of
the link weight in the model to be a noticeable improvement. On the query
parameters, the query fact has shown little promise. The use of term frequency in the
queries on the other hand did perform substantially better than using unique term
representations.
In general, this thesis shows for an IR like task utilizing a Semantic Space with a
modest sized corpus with link information like in the SD experiments that the
following settings are effective. A large number of rows and columns, a dimensional
144
reduction towards 200 columns, no S values, LDVs and CQs with a weight of 20%
and query term frequency.
6.5.3 Categorization
We did not test extensive parameter settings for optimal Semantic Spaces for
clustering and categorization. However, we did investigate the link weight and
singular value impact. Interestingly both the clustering and Semantic Categorization
both benefit from a removal of the S values (sf=0) for the Bundle perspective which
is also the best performing one. Similarly, the Term perspective did benefit greatly
from using smoothed S values (sf=0.5) for both the baseline and SC. This reinforces
the findings from the use case experiments that there is some useful information
contained in the S values but only under some circumstances and only in the
formerly unexplored smoothed form. The link weight parameter gives a less clear
picture. For the clustering baseline, the Bundle perspective benefits slightly from a
mild link weight (30-40%). The Term perspective (again for clustering) on the other
hand benefits greatly from a strong weight (70%). The SC shows only negative
impact with link weights. Overall, the experiments in this thesis show that the link
weight in a clustering situation should be employed cautiously and there are no
grounds to use it in a categorization setting based on the current experience.
The Semantic Categorization parameters of density, distance and cut-off have to be
set according to the application. Density, the local category parameter, has shown
good performance around a setting of 10. It does scale with higher sf settings but
these settings are irrelevant since they do not provide an improvement compared to
the manual baseline. Distance, the global parameter, depends on the perspective. If a
vector type distributes sparsely like the bundles then this can be distinguishing
enough for distance to have little impact. On the more tightly packed Term vectors,
distance was useful and a setting around 10 exhibited good results. The number and
density of the targeted vectors similarly influence the cut-off parameter. There is no
‘long tail’ of minute clusters in the Bundle perspective and no need for a cut-off. The
denser and more distributed Term perspective though does improve with a cut-off of
5-10%.
145
6.6 Discovery
The results for the directed search and categorization substantiate the proposition that
Semantic Service Discovery is effective. We recall that the consumer when faced
with an incomplete knowledge of her agenda may discover new information and
extend her knowledge (see section 2.1.4). However, the SD system has to facilitate
this process. We have shown that SSD achieves excellent results in identifying
relevant service information with decreasing query information. This aligns with the
challenge identified in this work of presenting related information to a consumer
when she, due to a lack of knowledge, is describing her agenda incompletely. The
quality of this selection of information presented to the agent has direct influence on
her choice and cost of investigating and obtaining the relevant information. The
strong performance of SS models and the improvement of results with the additional
link information and combined queries, and the qualitative results in the category
experiment illustrate the value of the SS results for presumptive attainment.
Particularly the qualitative results emphasize that beyond traditional SD there are
latent features available that a human agent when presented with may use to discover
new information to extend her knowledge. We therefore consider Semantic Spaces
valuable to comprehensive Semantic Service Discovery systems in the future.
146
7 Future Work
We have concluded that Semantic Spaces are effective means for Service Discovery.
We present two aspects, scientific and practical, for future work. The positive
outcome from the SSD model and experiments may motivate real world applications
while at the same time further research and improvements are desirable.
7.1 Scientific
We propose that future research will investigate the role and identify the exact
mathematical effect of the singular values in the extraction of semantic associations
between terms in a word co-occurrence matrix. We were able to establish that they
can be advantageous in a smoothed form and that the factor weighting in the
decomposition is probably the source of the gain but for a predictable, optimal
application an exact model has to be established.
Additional research into the Linked Document Vectors should be undertaken to
provide further evidence of their value. The impressive gains we achieved encourage
us to propose that the LDVs are generally useful but conclusive proof requires
further results across a large spectrum of discovery tasks and scenarios involving
various corpora. Such research should investigate corpora with a range from poorly
to highly linked and poorly to rich semantic descriptions to gain an understanding
how direct and indirect disambiguation in the space from the links versus the
semantic content occurs. We could also imagine typing the links, e.g., weighting
different types of links, to optimize the effect in various settings.
The underlying Conceptual Space theory and algorithms of this research are
universally applicable in our opinion. The conceptual aspect of this work is
transferable to other domains beyond service discovery, commonly home to the
applied methods from information retrieval and cluster analysis like data mining or
text classification. This would entail a desirable cross validation with the corpora and
test scenarios of the related scientific domains. This would enrich our understanding
and validation of the presented work further. Additionally, some of the insight
particular to Service Discovery is possibly applicable to software discovery and
categorization problems which struggles with similar challenges (Delo, Haar,
Larsson, & Parulekar, 2002; Tian, Revelle, & Poshyvanyk, 2009).
147
The apparent challenge for the near future in respect to the Service Discovery
domain is to capitalize on the extensive structural information available in the
enterprise domain of service provision and consumption, as well as the unstructured
secondary information. Approaches like the Universal Service Description Language
or USDL (Cardoso et al., 2010) offer a flexible and extensible way forward to
capture much of both. Bridging the divide between structured and unstructured
information, and utilizing both concurrently is a great challenge. Combined queries
and linked document vectors are means to extend the orthodox Semantic Space with
structured information. It utilizes explicitly encoded information transparently for
untrained users in directed search and browsing, and can even inform the design of
ontologies or taxonomies through semantic categories.
Figure 49: Interface dummy for search by browsing of categories
We established the Semantic Categorization as effective and meaningful but the
algorithm introduced is computationally inefficient. Consequently, an obvious
research area is to improve on the computational performance by developing
alternative, more efficient and effective algorithms based on the Conceptual Space
theory to outperform the matured clustering. An important and unexplored attribute
of the presented Semantic Categorization algorithm is its ability to change the
granularity of categories through the manipulation of the core density and distance
preference. Since these are conceptually inspired parameters, they may return more
148
stable and relatable results than the orthodox clustering. An example would be to
represent the space at different levels of conceptual granularity, which would allow
for a “drilling down” kind of exploratory search that may not necessarily want to rely
on a hierarchical structure to allow for vertical knowledge abduction and discovery
(Figure 49). Lastly, an evaluation of SC in a large corpus would be desirable even
though manual groupings may well not be available and qualitative evaluation may
be the only choice.
7.2 Applied
We have shown the benefit of Semantic Spaces in Service Discovery. We propose
that it is ready for real-world application in Service Discovery and possibly for IR
tasks of similar nature. The SD application has two challenges to overcome,
conceptual and implantation.
The conceptual problem is to integrate it with the current state of the Service
Ecosystem. An open, wide reaching Service Ecosystem does not yet exist. A
practical approach to apply the SSD model is to introduce it on promising solutions
in the service domain that may become supporting pillars in the future SES. For
example, the Universal Service Description Language (Cardoso et al., 2010) which
the industry, i.e., SAP, strongly supports shows promise to become such a pillar. This
work focused on the conceptual side of service discovery but this does not discount
the need for structured and functional approaches in the lower and machine oriented
SOA layers of the SES. An integration of these functional aspects with SSD may
lead to various applications. Preliminary research shows this potential (Bose et al.,
2008). We propose that the SSD model can be the human interface to the search and
discovery of services of a system that internally is well structured but also contains
and through human interaction gains unstructured information. We have shown that
we can exploit some structural information. Furthermore, simple functional matching
of services in orchestration is rarely meaningful and a conceptual support system
founded in Semantic Spaces can help in the development of combined and complex
services and processes. We can imagine moreover to support specialists designing or
mapping ontologies with a conceptual recommendation system that instils a
statistical semantic validity and encourages resulting ontologies and taxonomies to
be close to natural language.
149
Lastly, there is an implementation challenge. The presented Semantic Space depends
on the computationally expensive SVD and the future SES and attached SIS can be
expected to be large. The experience from the development of the software prototype
for this research suggests that future research utilizing large corpora or applications
should utilize of-the-shelf components as far as possible to scale and focus efforts.
Within the time of this research, these components have gradually become available
with the development of HBase, Hadoop, Solr/Lucene and Mahout79 focusing on the
map reduce framework. Distributed or parallel implementation of Lanczos based
SVD algorithm (Baglama & Reichel, 2007) promise to solve a computational
bottleneck of Semantic Spaces by two pronged approach reducing memory and
computational time needed as well as dividing the problem into easy computable
tasks. This development makes a wide spread application of Semantic Spaces for
Service Discover and other scenarios likely in the near future. Many application
challenges contain considerable scientific and further research aspects, e.g., selecting
optimal semantic training sets to minimize computational load or how best to fold
new information into an existing Semantic Space and when to recompute it.
79 See http://hbase.apache.org/, http://hadoop.apache.org/, http://lucene.apache.org/solr/ and http://mahout.apache.org/.
150
Conclusion
Overall, we provided evidence to support both our hypothesis that Semantic Spaces
based on a Service Ecosystem’s Service Information Shadow facilitate effective
Service Discovery. We also investigated the novel features of linked document
vectors, combined queries, perspectives and the importance of singular values. We
demonstrated significant gains using LDVs and perspectives as well as promising
benefits from combined queries. The case for singular values is complex and while in
most cases ignoring them is the optimal setting, we did provide substantiation that
they are useful in a smoothed form in some situations.
A-1
Appendix A SAP ES Wiki Grouping
Overview SAP Core (70)
Sales (14)
Service (5)
Marketing (2)
Management (1)
Human Capital Management (4)
Corporate Services (9)
E-Commerce (1)
Supply Planning (2)
Financials (5)
Procurement (6)
Supply Network Collaboration (3)
Order Fulfillment (2)
Supply Chain Visibility (1)
Product Development and
Manufacturing (5)
Transportation, Warehousing (6)
RFID Enablement (3)
Overview Industries (56)
Banking (12)
Higher Education & Research (2)
Insurance (7)
Automotive (1)
Public Sector (7)
Defense (2)
Healthcare (5)
Consumer Products (2)
Oil & Gas (1)
Travel & Logistics Services (2)
Media (3)
Wholesale Distribution (2)
Retail (5)
Utilities (5)
Experiments
B-1
Appendix B Example Semantic Categorization by
Bundles
<?xml version="1.0"?>
<Categories Type="Bundle">
<Category Name="supplier_collaboration_for_the_supply_chain">
<Type Name="Bundle">
<Member Sim="0.747314593203148">customer_collaboration_for_the_supply_chain</Member>
<Member Sim="0.429137953659488">outsourced_manufacturing</Member>
</Type>
</Category>
<Category Name="order_to_cash">
<Type Name="Bundle">
<Member Sim="0.736248072501191">order_to_cash_with_crm</Member>
<Member Sim="0.692977378246265">order_to_cash_for_fashion</Member>
<Member Sim="0.373950237494739">customer_quote_management</Member>
<Member Sim="0.2707411271289">dispute_management</Member>
<Member Sim="0.418494613262352">integration_of_transportation_management_system</Member>
<Member Sim="0.349931330554136">convergent_invoicing</Member>
<Member Sim="0.438597390282901">bill-to-cash</Member>
<Member Sim="0.524114671311319">supply_chain_operations_and_execution_in_the_oil_and_gas_industry</Member>
</Type>
</Category>
<Category Name="financial_accounting_-_results_integration">
<Type Name="Bundle">
<Member Sim="0.704381137116215">management_accounting_-_results_integration</Member>
<Member Sim="0.426158777021448">financial_accounting_-_financial_instrument_accounting_integration</Member>
</Type>
</Category>
<Category Name="credit_risk_management_-_financial_instrument_pricing">
<Type Name="Bundle">
<Member Sim="0.700301118871451">financial_accounting_-_financial_instrument_pricing</Member>
</Type>
</Category>
<Category Name="maintenance_service_collaboration">
<Type Name="Bundle">
<Member Sim="0.672546102465906">asset_configuration</Member>
<Member Sim="0.582778542992418">maintenance_processing</Member>
<Member Sim="0.453804044325149">compliance_relevant_data_exchange_-_elogbook</Member>
</Type>
</Category>
<Category Name="customer_information_management_-_business_operations">
<Type Name="Bundle">
<Member Sim="0.652944892794494">account_and_contact_management</Member>
<Member Sim="0.592996736591359">complaint_management</Member>
<Member Sim="0.526080488614963">request_for_registration_processing</Member>
<Member Sim="0.29857951063867">investigative_case_management</Member>
<Member Sim="0.46981354895926">multi-channel_tax_and_revenue_management</Member>
<Member Sim="0.394555062110302">permit_application_and_approval</Member>
</Type>
</Category>
<Category Name="insurance_external_reporting">
<Type Name="Bundle">
<Member Sim="0.625886398149305">insurance_claims_handling</Member>
<Member Sim="0.608479509333417">insurance_external_claims_investigation</Member>
<Member Sim="0.59628375288497">insurance_document_vendor</Member>
</Type>
</Category>
<Category Name="procure_to_pay">
<Type Name="Bundle">
<Member Sim="0.645259178864215">procure_to_pay_for_fashion</Member>
<Member Sim="0.43985149413654">project_system</Member>
<Member Sim="0.508069813919222">external_requirement_processing</Member>
<Member Sim="0.549944421030025">supplier_order_collaboration_with_srm</Member>
<Member Sim="0.433598406229265">item_unique_identification</Member>
</Type>
B-2
</Category>
<Category Name="cross-industry_rfid-enabled_core_logistics_processes">
<Type Name="Bundle">
<Member Sim="0.624072460428367">management_of_tag_ids_and_tag_observations</Member>
<Member Sim="0.594803217452026">management_of_devices_through_enterprise_services</Member>
<Member Sim="0.282271326720487">yard_and_storage_management_processes</Member>
</Type>
</Category>
<Category Name="service_order_management">
<Type Name="Bundle">
<Member Sim="0.616591880740113">customer_service_execution</Member>
<Member Sim="0.402063426288014">installed_base_management</Member>
<Member Sim="0.395323749949864">service_contract_management</Member>
<Member Sim="0.477168619669685">service_parts_management</Member>
</Type>
</Category>
<Category Name="integration_of_quality_management_systems">
<Type Name="Bundle">
<Member Sim="0.60073315331758">easy_inspection_planning</Member>
</Type>
</Category>
<Category Name="manufacturing_work_instructions">
<Type Name="Bundle">
<Member Sim="0.571046919528631">integration_of_manufacturing_execution_systems</Member>
<Member Sim="0.386279698214566">batch_traceability_and_analytics</Member>
<Member Sim="0.333185136258181">responsive_product_development_and_launch</Member>
</Type>
</Category>
<Category Name="resource_and_supply_chain_planning_for_healthcare_providers">
<Type Name="Bundle">
<Member Sim="0.560687478031439">resource_planning_and_scheduling</Member>
</Type>
</Category>
<Category Name="inventory_management">
<Type Name="Bundle">
<Member Sim="0.552632457434165">inventory_lookup</Member>
<Member Sim="0.180389494715461">environment_health_and_safety</Member>
</Type>
</Category>
<Category Name="interactive_selling">
<Type Name="Bundle">
<Member Sim="0.519117489743124">quote_to_order_for_configurable_products</Member>
<Member Sim="0.515982401457111">vehicle_management_system</Member>
<Member Sim="0.501992803379323">customer_fact_sheet</Member>
<Member Sim="0.48321546603633">sales_incentive_and_commission_management</Member>
<Member Sim="0.477825223709467">product_master_data_management</Member>
<Member Sim="0.474312766188881">opportunity_management</Member>
<Member Sim="0.376897803358854">activity_management</Member>
<Member Sim="0.398928439401607">sales_contract_management</Member>
<Member Sim="0.248885230587523">territory_management</Member>
<Member Sim="0.289330139849137">trade_and_commodity_management</Member>
<Member Sim="0.250670916338527">global_data_synchronization</Member>
</Type>
</Category>
<Category Name="hcm_organizational_management">
<Type Name="Bundle">
<Member Sim="0.540502425771903">hcm_master_data</Member>
<Member Sim="0.325895238061658">hcm_time_management</Member>
<Member Sim="0.324054536895044">information_system_integration</Member>
</Type>
</Category>
<Category Name="atp_check">
<Type Name="Bundle">
<Member Sim="0.534233973694007">availability_issue_resolution_and_backorder_processing</Member>
</Type>
</Category>
<Category Name="loans_management_-_business_operations">
<Type Name="Bundle">
<Member Sim="0.532013102182625">financial_accounting_-_loans_integration</Member>
</Type>
B-3
</Category>
<Category Name="demand_management">
<Type Name="Bundle">
<Member Sim="0.521591671772741">demand_planning</Member>
<Member Sim="0.48043289870201">in-store_food_production_integration</Member>
</Type>
</Category>
<Category Name="campaign_management">
<Type Name="Bundle">
<Member Sim="0.520201004123336">lead_management</Member>
</Type>
</Category>
<Category Name="sales_and_service_-_account_origination">
<Type Name="Bundle">
<Member Sim="0.513039527801658">current_account_management_-_business_operations</Member>
</Type>
</Category>
<Category Name="patient_administration">
<Type Name="Bundle">
<Member Sim="0.477684943636421">medical_activities_x002C__patient_billing_and_invoicing</Member>
<Member Sim="0.474679551978179">foundation_for_collaborative_health_networks</Member>
</Type>
</Category>
<Category Name="market_communication">
<Type Name="Bundle">
<Member Sim="0.48295093078888">customer_communication</Member>
<Member Sim="0.456408780265494">advanced_meter_infrastructure</Member>
</Type>
</Category>
<Category Name="central_contract_management">
<Type Name="Bundle">
<Member Sim="0.480423417571924">service_procurement</Member>
<Member Sim="0.331160654709027">trade_price_specification_contract</Member>
</Type>
</Category>
<Category Name="subscription_management">
<Type Name="Bundle">
<Member Sim="0.441558309825736">advertising_management</Member>
<Member Sim="0.236343934699565">integration_of_rights_management</Member>
</Type>
</Category>
<Category Name="external_cash_desk">
<Type Name="Bundle">
<Member Sim="0.439073794719508">electronic_bill_presentment_and_payment</Member>
<Member Sim="0.352799823824433">bank_communication_management</Member>
</Type>
</Category>
<Category Name="credit_risk_management_-_credit_portfolio_management">
<Type Name="Bundle">
<Member Sim="0.4243914572109">credit_management</Member>
<Member Sim="0.303607776750799">credit_risk_-_modeling</Member>
</Type>
</Category>
<Category Name="hcm_enterprise_learning">
<Type Name="Bundle">
<Member Sim="0.42265501136075">integration_of_external_warehouse_management_system</Member>
<Member Sim="0.25933846756841">product_catalogue_processing_with_crm</Member>
<Member Sim="0.331422634088728">course_approval_processes</Member>
<Member Sim="0.253453857329117">integration_of_learning_management_systems</Member>
</Type>
</Category>
<Category Name="planning_to_shelf_optimization_integration">
<Type Name="Bundle">
<Member Sim="0.417821710855239">merchandise_and_assortment_planning_integration</Member>
</Type>
</Category>
<Category Name="rebate_management">
<Type Name="Bundle">
<Member Sim="0.411977564377045">agency_business</Member>
</Type>
B-4
</Category>
<Category Name="records_and_document_management">
<Type Name="Bundle">
<Member Sim="0.401095611843341">technical_document_management_connectivity</Member>
</Type>
</Category>
<Category Name="kanban_processing">
<Type Name="Bundle">
<Member Sim="0.387018065784234">business_event_handling_for_process_tracking</Member>
</Type>
</Category>
<Category Name="insurance_credentialing">
<Type Name="Bundle">
<Member Sim="0.362541380984386">commissioning</Member>
</Type>
</Category>
<Category Name="legal_dunning_and_external_collections">
<Type Name="Bundle">
<Member Sim="0.335544124551552">insurance_billing_and_payment</Member>
</Type>
</Category>
<Category Name="public_sector_budget_management">
<Type Name="Bundle">
<Member Sim="0.327642110799172">public_sector_accounting_structures</Member>
<Member Sim="0.238230273919996">funds_commitment_processing</Member>
</Type>
</Category>
<Category Name="real_estate_-_room_reservation">
<Type Name="Bundle">
<Member Sim="0.166656722622809">travel_management</Member>
</Type>
</Category>
</Categories>
C-1
Appendix C CLUTO
Methods
The first set of CLUTO’s clustering methods implement repeated bisection which
considers the set of objects as one cluster and then repeatedly selects and splits one
cluster into two until a stopping criterion is met. CLUTO provides two variations
called rb and rbr with the first being the described implementation and the latter
additionally attempting a post clustering optimization80 not further described by
CLUTO’s manual. The direct method attempts to compute all desired clusters
simultaneously instead of using bisections. The reverse approach to bisections is
agglomerative clustering (Chidananda Gowda & Krishna, 1978) available in two
methods, agglo and bagglo. It assumes each object to be a cluster and then merges
clusters to optimize a criterion function’s result. Bagglo is a variation, which uses an
initial rb clustering on the square root of the desired cluster number to extend the
feature space before an agglo method run. Lastly, there is the graph method using a
nearest neighbour graph to employ the min-cut algorithm (Hao & Orlin, 1994) to
partition/cluster the graph.
Criterion functions
The criterion functions describe the measure the clustering method optimizes. This
can be to maximize the distance between clusters (inter-cluster), minimize the
distance within a cluster (intra-cluster) and/or a combination of these. CLUTO
provides 7 criterion functions (Table 29) that can be applied to 5 clustering methods
plus an additional 6 criterion functions that are applicable to agglomerative methods.
C-2
Criterion Function Optimization Function
I1 maximize ∑∑ ,, ∈
I2 maximize ∑ ∑ ,, ∈
E1 minimize ∑
∑ ,∈ , ∈
∑ ,, ∈
G1 minimize ∑∑ ,∈ , ∈
∑ ,, ∈
G1p minimize ∑∑ ,∈ , ∈
∑ ,, ∈
H1 maximize
H2 maximize
Table 29: CLUTO main criterion functions80
Table 29 lists the 7 main criterion functions and what they optimize, with:
k the total number of clusters
S all objects to cluster
Si objects in cluster i
ni number of objects in cluster i
v, u two objects
sim(v,u) the similarity81 between objects
I1 and I2 locally optimize the intra-cluster similarity ignoring other clusters in the
process. I1 is mathematically equivalent to the k-means algorithm seeking to
minimize sum of squared errors of Euclidean distance (Zhao & George Karypis,
2002). I2 is a vector space variation of I1 using square root rather than the number of
objects in the cluster to scale the measure.
E1 uses a global optimization maximizing the distance of cluster centroids from the
centroid of the whole collection. It also weights larger clusters as more important.
The G1 and G1p are graph-based approaches viewing the similarity as a weight on the
edge between two objects/vertices. The intuition behind the graph inspired criterion
functions is to minimize the edge-cut of each cluster/partition.
80 See CLUTO manual http://glaros.dtc.umn.edu/gkhome/fetch/sw/cluto/manual.pdf for more. 81 We used CLUTO’s default cosine measure.
C-3
H1 and H2 are hybrid functions using a combination of the previously discussed ones.
Both hybrids are divided by E1 and thus to increase H, E1 has to be as small as
possible. Maximizing the distance of the clusters from the global centroid achieves
this. The numerator can be either I function for inter-cluster optimization.
The agglomerative methods additionally have Unweighted Pair Group Method with
Arithmetic Mean (UPGMA), single and complete link functions as well as their
weighted variants available to them. The single link function merges two clusters
minimizing the distance of the closest members between the clusters. The complete
link minimizes the distance of the two furthest distant members of the clusters.
UPMGA minimizes the mean distance between all members of two clusters.
Methods and Criterion Functions Results
The direct method (0.4762) achieved the best result (Figure 50) as well as the highest
average (of maxima) result across all criterion functions (0.4486). Graph without
alternative criterion functions was the worst (0.2384). The agglomerative methods
were generally more volatile and performed particularly poorly with the (w)slink
criterion functions. The rb methods were slightly worse than the direct approach but
performed well overall.
Figure 50: Criterion functions by methods
The G1p achieved the best criterion function result (Figure 51). On average E1 though
was slightly better (0.4190 vs. 0.4095) and not much ahead of I1, I2, H1 and H2
(0.3958, 0.3885, 0.3963 and 0.4060) with the exception of G1 (0.3417). The six
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
0.50
rb rbr direct graph agglo bagglo
Max AMI
average i1 i2 e1 g1 g1p h1 h2 slink wslink clink wclink upgma wupgma
C-4
agglomerative only criterion functions (slink, wslink, clink, wclink, upgma and
wupgma) were not competitive with the exception of wupgma in combination with
agglomeration (0.3970 agglo and 0.4103 bagglo).
Figure 51: Methods by criterion functions
Perspective Results
The comparison of Term to Bundle perspective reveals that for all methods (Figure
52) and all criterion functions (Figure 53) the Bundle perspective achieved a
strikingly better performance. The methods improve between ~23% (agglo, graph)
to ~47% (direct, rbr). Even if we compare the best method and criterion function
from both perspectives (Term – agglo G1p 0.3473 vs. Bundle - direct G1p 0.4762) the
difference remains striking (+37%).
Figure 52: Perspective and methods
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
0.50
Max AMI
average rb rbr direct graph agglo bagglo
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
50%
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
0.50
rb rbr bagglo agglo direct graph
Increase Term
to Bundle
Max AMI
Term
Bundle
Increase
C-5
The poor performance of the graph method and the agglomerative specific criterion
functions (except wupgma) persist. They all profited from the Bundle perspective but
comparatively little (except wclink and wupgma) and have poor Term results to begin
with.
Figure 53: Perspective and criterion functions
0%
10%
20%
30%
40%
50%
60%
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
0.50
Increase Term
to Bundle
Max AMI
Term Bundle Increase
i
References
Al-Masri, E., & Mahmoud, Q. H. (2007). WSCE: A crawler engine for large-scale discovery of web services. IEEE International Conference on Web Services (ICWS 2007), 1104–1111. IEEE Computer Society. doi:http://doi.ieeecomputersociety.org/10.1109/ICWS.2007.197
Araujo, M. D., Navarro, G., & Ziviani, N. (1997). Large Text Searching Allowing Errors. (R. Baeza-Yates, Ed.)South American Workshop on String Processing (WSP’97). Valparaiso, Chile: Carleton University Press International Informatics Series.
Arfken, G. B., & Weber, H. J. (2005). Mathematical methods for physicists (6th ed., Vol. 198, p. 1200). Academic Press.
Atkinson, C., Bostan, P., Deneva, G., & Schumacher, M. (2009). Towards High Integrity UDDI Systems. In W. Aalst, J. Mylopoulos, N. M. Sadeh, M. J. Shaw, C. Szyperski, W. Abramowicz, & D. Flejter (Eds.), Business Information Systems Workshops (pp. 350-361). Springer.
Bachlechner, D., Siorpaes, K., Lausen, H., & Fensel, D. (2006). Web Service Discovery - A Reality Check. Proceedings of the 3rd European Semantic Web Conference (ESWC 2006). Budva,Montenegro.
Baeza-Yates, Ricardo, & Ribeiro-Neto, B. (2011). Modern information retrieval: The Concepts and Technology behind Search (2nd ed., p. 944). Addison-Wesley Professional.
Baglama, J., & Reichel, L. (2007). Restarted block Lanczos bidiagonalization methods. Numerical Algorithms, 43(3), 251-272. doi:10.1007/s11075-006-9057-z
Barros, A., & Dumas, M. (2006). The Rise of Web Service Ecosystems. IT Professional, 8(5), 31-37. doi:10.1109/MITP.2006.123
Barros, A., Dumas, M., & Bruza, P. (2005). The Move to Web Service Ecosystems. BPTrends, 3(12). Retrieved from http://www.bptrends.com/publicationfiles/12-05-WP-WebServiceEcosystems-Barros-Dumas.pdf
Bellegarda, J. R. (2000). Exploiting latent semantic information in statistical language modeling. Proceedings of the IEEE, 88(8), 1279-1296. doi:10.1109/5.880084
Bellegarda, J. R., Butzberger, J. W., Coccaro, N. B., & Naik, D. (1996). A novel word clustering algorithm based on latent semantic analysis. IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings (Vol. 1, pp. 172-175). Atlanta, GA, USA: IEEE. doi:10.1109/ICASSP.1996.540318
ii
Billerbeck, B., Cannane, A., Chattaraj, A., Lester, N., Webber, W., Williams, H. E., Yiannis, J., et al. (2004). RMIT University at TREC 2004. In E. M. Voorhees & L. P. Buckland (Eds.), Proceedings of the Thirteenth Text REtrieval Conference. Gaithersburg: NIST.
Bose, A., Nayak, R., & Bruza, P. (2008). Improving Web Service Discovery by Using Semantic Models. In J. Bailey, D. Maier, K.-D. Schewe, B. Thalheim, & X. Wang (Eds.), Web Information Systems Engineering - WISE 2008 (pp. 366–380). Berlin, Heidelberg: Springer-Verlag. doi:10.1007/978-3-540-85481-4_28
Bruza, P., & Sitbon, L. (2008). On the relevance of documents for semantic representation. In R. McArthur, P. Thomas, A. Turpin, & M. Wu (Eds.), Proceedings of the 13th Australasian Document Computing Symposium (ADCS ’08). Tasmania: School of Computer Science and Information Technology, RMIT University.
Bruza, P., Barros, A., & Kaiser, M. (2009). Augmenting Web Service Discovery by Cognitive Semantics and Abduction. 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology, 403-410. Ieee. doi:10.1109/WI-IAT.2009.69
Cao, G., Song, D., & Bruza, P. (2004). Fuzzy K-means Clustering on a High Dimensional Semantic Space. In Proceedings of the Sixth Asia Pacific Web Conference (APWeb 2004) (pp. 95-101).
Cardoso, J., Barros, A., May, N., & Kylau, U. (2010). Towards a Unified Service Description Language for the Internet of Services: Requirements and First Developments. IEEE International Conference on Services Computing Proceedings (pp. 602-609). doi:http://doi.ieeecomputersociety.org/10.1109/SCC.2010.93
Chidananda Gowda, K., & Krishna, G. (1978). Agglomerative clustering using the concept of mutual nearest neighbourhood. Pattern Recognition, 10(2), 105-112. doi:doi: DOI: 10.1016/0031-3203(78)90018-3
Cleverdon, C. (1967). The CRANFIELD TESTS ON INDEX LANGUAGE DEVICES. Aslib Proceedings, 19(6), 173-194. doi:10.1108/eb050097
Colucci, S., Noia, T. D., Sciascio, E. D., Mongiello, M., & Donini, F. M. (2004). Concept abduction and contraction for semantic-based discovery of matches and negotiation spaces in an e-marketplace. Proceedings of the 6th international conference on Electronic commerce - ICEC ’04 (p. 41). New York: ACM Press. doi:10.1145/1052220.1052226
Cutting, D. R., Karger, D. R., Pedersen, J. O., & Tukey, J. W. (1992). Scatter/Gather: a cluster-based approach to browsing large document collections. Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 318-329). New York, NY, USA: ACM. doi:http://doi.acm.org/10.1145/133160.133214
iii
Deerwester, S., Dumais, S. T., Furnas, G., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6), 391-407.
Delo, J. C., Haar, M. S., Larsson, J. E., & Parulekar, C. A. (2002). Method for categorizing and installing selected software components. Retrieved from http://www.google.com/patents/about?id=IfILAAAAEBAJ
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society Series B Methodological, 39(1), 1-38. Royal Statistical Society. doi:10.2307/2984875
Dietze, S., Gugliotta, A., & Domingue, J. (2008). Towards context-aware semantic web service discovery through conceptual situation spaces. Proceedings of the 2008 international workshop on Context enabled source and service selection, integration and adaptation organized with the 17th International World Wide Web Conference (WWW 2008) - CSSSIA ’08 (pp. 1-8). New York, New York, USA: ACM Press. doi:10.1145/1361482.1361488
Dong, X., Halevy, A., Madhavan, J., Nemes, E., & Zhang, J. (2004). Similarity search for web services. Proceedings of the Thirtieth international conference on Very large data bases - VLDB ’04 (Vol. 30, pp. 372-383).
Du, H.-J., Shin, D.-H., & Lee, K.-H. (2008). A sophisticated approach to semantic web service discovery. The Journal of Computer Information Systems, 48(3), 44-61.
Ellison, S. (2010, December). Worldwide and U.S. Mobile Applications, Storefronts, and Developer 2010–2014 Forecast and Year-End 2010 Vendor Shares: The “Appification” of Everything. International Data Corporation. Retrieved from http://www.idc.com/research/viewdocsynopsis.jsp?containerId=225668
Firth, J. R. (1957). A synopsis of linguistic theory 1930-55. Studies in Linguistic Analysis, 1952-59, 1-32. Oxford: The Philological Society.
Furnas, G., Landauer, T. K., Gomez, L. M., & Dumais, S. T. (1983). Statistical semantics: Analysis of the potential performance of key-word information systems. Bell System Technical Journal, 62(6), 1753-1806.
Gabbay, D. M., & Woods, J. (2005). The Reach of Abduction: Insight and Trial - A Practical Logic of Cognitive Systems (2nd ed., p. 496). Elsevier Science.
Gan, G., Ma, C., & Wu, J. (2007). Data Clustering: Theory, Algorithms, and Applications. 3600 Market Street, 6th Floor Philadelphia, PA 19104-2688: SIAM. doi:10.1137/1.9780898718348
Garcia, S., Lester, N., Scholer, F., & Shokouhi, M. (2006). RMIT University at TREC 2006: Terabyte Track. In E. M. Voorhees & L. P. Buckland (Eds.), Proceedings of the Fifteenth Text REtrieval Conference.
iv
Gärdenfors, P. (2004). Conceptual Spaces: The Geometry of Thought (Bradford Books) (p. 317). MIT Press.
Golub, G. H., & van Loan, C. F. (1996). Matrix Computations (3rd ed., p. 728). Baltimore: The Johns Hopkins University Press.
Granka, L. a, Joachims, T., & Gay, G. (2004). Eye-tracking analysis of user behavior in WWW search. Proceedings of the 27th annual international conference on Research and development in information retrieval - SIGIR ’04 (p. 478). New York, New York, USA: ACM Press. doi:10.1145/1008992.1009079
Grüninger, M., Hull, R., & McIlraith, S. A. (2008). A Short Overview of FLOWS: A First-Order Logic Ontology for Web Services. IEEE Data Engineering Bulletin, 31(3), 3-7.
Hagemann, S., Letz, C., & Vossen, G. (2007). Web Service Discovery - Reality Check 2.0. Proceedings of Third International Conference on Next Generation Web Services Practices (NWeSP’07) (pp. 113-118). IEEE. doi:10.1109/NWESP.2007.20
Hao, J., & Orlin, J. B. (1994). A Faster Algorithm for Finding the Minimum Cut in a Directed Graph. Journal of Algorithms, 17, 424–446.
Harris, Z. (1954). Distributional Structure. Word, 10(23), 146-162.
Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2(1), 193-218. doi:10.1007/BF01908075
Ingersoll, G. S. (2009). Apache Lucene - Scoring. Apache Software Foundation. Retrieved from http://lucene.apache.org/java/2_4_1/scoring.pdf
Jain, A K, Murty, M. N., & Flynn, P. J. (1999). Data clustering: a review. ACM Computing Surveys, 31(3), 264-323. ACM. doi:10.1145/331499.331504
Jain, Anil K, & Dubes, R. C. (1988). Algorithms for Clustering Data. (Anil K Jain & R. C. Dubes, Eds.)Prentice Hall (Vol. 311, p. 304). Prentice Hall. doi:10.1126/science.311.5762.765
Jianwu, Y., & Xiaoou, C. (2002). A semi-structured document model for text mining. Journal of Computer Science and Technology, 17, 603–610. doi:10.1007/BF02948828
Johnson, R. K. (1982). Prototype Theory, Cognitive Linguistics and Pedagogical Grammar. Working Papers in Linguistics and Language Training, 8, 12–24. Citeseer. Retrieved from http://sunzi1.lib.hku.hk/hkjo/view/45/4500060.pdf
Kanerva, P., Kristofersson, J., & Holst, A. (2000). Random indexing of text samples for latent semantic analysis. In L. R. Gleitman & A. K. Josh (Eds.), Proceedings of the 22nd Annual Conference of the Cognitive Science Society (p. 1036). New Jersey: Erlbaum.
v
Kaufman, L., & Rousseeuw, P. J. (1987). Clustering by means of Medoids. In Y. Dodge (Ed.), Statistical Data Analysis Based on the L1-Norm and Related Methods (pp. 405-416). North-Holland.
Kent, A., Berry, M. M., Luehrs, F. U., & Perry, J. W. (1955). Machine literature searching VIII. Operational criteria for designing information retrieval systems. American Documentation, 6(2), 93-101. doi:10.1002/asi.5090060209
Klusch, M., Fries, B., & Sycara, K. (2006). Automated semantic web service discovery with OWLS-MX. Proceedings of the fifth international joint conference on Autonomous agents and multiagent systems (pp. 915-922). Hakodate: ACM Press. doi:10.1145/1160633.1160796
Lancaster, F. W., & Fayen, E. G. (1974). Information Retrieval On-Line. Journal of the American Society for Information Science, 25(5), 336-337. Los Angeles: John Wiley & Sons. doi:10.1002/asi.4630250510
Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato’s problem: The Latent Semantic Analysis theory of the acquisition, induction, and representation of knowledge. Psychological Review, 104, 211-240.
Landauer, T. K., Foltz, P., & Laham, D. (1998). An introduction to latent semantic analysis. Discourse Processes, 25(2), 259-284. doi:10.1080/01638539809545028
Larsen, B., & Aone, C. (1999). Fast and effective text mining using linear-time document clustering. Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 16-22). New York, NY, USA: ACM. doi:http://doi.acm.org/10.1145/312129.312186
Lowe, W. (2001). Towards a Theory of Semantic Space. In J. D. Moore & K. Stenning (Eds.), Proceedings of the Twenty-first Annual Meeting of the Cognitive Science Society (pp. 576-581).
Lund, K., & Burgess, C. (1996). Producing high-dimensional semantic spaces from lexical co- occurrence. Behavior Research Methods Instruments and Computers, 28(2), 203-208.
Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to Information Retrieval. Computational Linguistics (Vol. 35, p. 496). Cambridge University Press. Retrieved from http://nlp.stanford.edu/IR-book/
Marchionini, G. (2006). Exploratory search: from finding to understanding. Communications of the ACM, 49(4), 41-46.
McArthur, R. J. (2007). Computing with meaning by operationalising socio-cognitive semantics. Unpublished doctoral dissertation, Queensland University of Technology.
vi
McCallum, A., Nigam, K., & Ungar, L. H. (2000). Efficient clustering of high-dimensional data sets with application to reference matching. Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 169-178). New York, NY, USA: ACM. doi:http://doi.acm.org/10.1145/347090.347123
McIlraith, S. A., Son, T. C., & Zeng, H. (2001). Semantic Web Services. IEEE Intelligent Systems, 16(2), 46-53. doi:10.1109/5254.920599
Milne, D. (2007). Computing semantic relatedness using wikipedia link structure. Proceedings of the New Zealand Computer Science Research Student Conference.
Moffat, A., & Zobel, J. (2008). Rank-biased precision for measurement of retrieval effectiveness. ACM Transactions on Information Systems, 27(1), 1-27. doi:10.1145/1416950.1416952
Mokhtar, S., Preuveneers, D., Georgantas, N., Issarny, V., & Berbers, Y. (2007). EASY: Efficient semAntic Service discoverY in pervasive computing environments with QoS and context support. Journal of Systems and Software, 81, 785-808. doi:10.1016/j.jss.2007.07.030
OASIS. (2004a). UDDI Version 3.0.2 UDDI Spec Technical Committee Draft. Organization for the Advancement of Structured Information Standards. Retrieved from http://www.oasis-open.org/committees/uddi-spec/doc/spec/v3/uddi_v3.htm
OASIS. (2004b). Introduction to UDDI: Important Features and Functional Concepts. Organization for the Advancement of Structured Information Standards. Retrieved from http://uddi.org/pubs/uddi-tech-wp.pdf
O’Day, V. L., & Jeffries, R. (1993). Orienteering in an information landscape: how information seekers get from here to there. Proceedings of the SIGCHI conference on Human factors in computing systems - CHI ’93 (pp. 438-445). New York, New York, USA: ACM Press. doi:10.1145/169059.169365
Page, L., Brin, S., Motwani, R., & Winograd, T. (1999). The PageRank Citation Ranking: Bringing Order to the Web. Retrieved from http://ilpubs.stanford.edu:8090/422/
Papazoglou, M. P., Traverso, P., Dustdar, S., & Leymann, F. (2008). Service-oriented computing: a research roadmap. International Journal of Cooperative Information Systems, 17(2), 223-255.
Pathak, J., Koul, N., Caragea, D., & Honavar, V. G. (2005). A framework for semantic web services discovery. Proceedings of the 7th annual ACM international workshop on Web information and data management (pp. 45-50). New York: ACM Press. doi:10.1145/1097047.1097057
vii
Peng, D. (2007). Automatic Conceptual Indexing of Web Services and Its Application to Service Retrieval. Science And Technology, (2006), 290-301.
Rambold, M., Kasinger, H., Lautenbacher, F., & Bauer, B. (2009). Towards Autonomic Service Discovery A Survey and Comparison. IEEE International Conference on Services Computing (pp. 192-201). Bangalore: IEEE. doi:10.1109/SCC.2009.59
Rand, W. M. (1971). Objective Criteria for the Evaluation of Clustering Methods. Journal of the American Statistical Association, 66(336), 846-850.
Rapp, R. (2003). Word Sense Discovery Based on Sense Descriptor Dissimilarity. Proceedings of the Ninth Machine Translation Summit (pp. 315-322). New Orleans.
Van Rijsbergen, C. (1979). Information retrieval (2nd ed.). London ;;Boston: Butterworths.
Robertson, S. E., & Sparck-Jones, K. (1976). Relevance weighting of search terms. Journal of the American Society for Information Science, 27(3), 129-146. doi:10.1002/asi.4630270302
Robertson, S. E., Walker, S., Jones, S., Hancock-Beaulieu, M. M., & Gatford, M. (1994). Okapi at TREC-3. In D. K. Harman (Ed.), Proceedings of the Third Text REtrieval Conference (pp. 109-126). Gaithersburg. doi:10.1145/1031171.1031181
Robertson, S., Zaragoza, H., & Taylor, M. (2004). Simple BM25 extension to multiple weighted fields. Proceedings of the Thirteenth ACM conference on Information and knowledge management - CIKM ’04 (p. 42). New York: ACM Press. doi:10.1145/1031171.1031181
SAP News Desk. (2005, December). Microsoft, IBM, SAP To Discontinue UDDI Web Services Registry Effort. Retrieved February 11, 2010, from http://br.sys-con.com/node/164624
Sabbouh, M., Jolly, S., Allen, D., Silvey, A., & Denning, P. (2001). Interoperability, World Wide Web Consortium Workshop on Web services. San Jose, CA, USA. Retrieved from http://www.w3.org/2001/03/WSWS-popa/paper08
Sahlgren, M. (2006). The Word-Space Model : using distributional analysis to represent syntagmatic and paradigmatic relations between words in high-dimensional vector spaces (p. 156). Stockholm: Department of Linguistics Stockholm University.
Sajjanhar, A., Hou, J., & Zhang, Y. (2004). Algorithm for Web Services Matching. In J. Yu, X. Lin, H. Lu, & Y. Zhang (Eds.), Advanced Web Technologies and Applications (pp. 665-670). Springer. doi:http://dx.doi.org/10.1007/978-3-540-24655-8_72
viii
Salton, G. (1991). Developments in automatic text retrieval. Science (New York, N.Y.), 253(5023), 974-80. American Association for the Advancement of Science. doi:10.1126/science.253.5023.974
Salton, Gerard. (1968). Automatic information organization and retrieval (1st ed., p. 480). MacGraw-Hill.
Salton, Gerard. (1983). Introduction to modern information retrieval. New York: McGraw-Hill.
Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 613-620. doi:10.1145/361219.361220
Salton, Gerard, Yang, C. S, & Yu, C. T. (1975). A Theory of Term Importance in Automatic Text Analysis. Journal of the American Society for Information Science, 26(1), 33-44. doi:10.1002/asi.4630260106
Sanchez, C., & Sheremetov, L. (2008). A model for service discovery with incomplete information. 5th International Conference on Electrical Engineering, Computing Science and Automatic Control (pp. 340-345). Mexico City: IEEE. doi:10.1109/ICEEE.2008.4723428
Sanderson, M., & Zobel, J. (2005). Information retrieval system evaluation: effort, sensitivity, and reliability. In G. Marchionini, A. Moffat, J. Tait, Ricardo Baeza-Yates, & N. Zivian (Eds.), Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR ’05 (pp. 162-169). ACM Press. doi:http://doi.acm.org/10.1145/1076034.1076064
Schütze, H. (1997). A cooccurrence-based thesaurus and two applications to information retrieval. Information Processing & Management, 33(3), 307-318. doi:10.1016/S0306-4573(96)00068-4
Schütze, H. (1998). Automatic Word Sense Discrimination. Computational Linguistics, 24(1), 97-123. Journal of Computational Linguistics.
Smucker, M. D., Allan, J., & Carterette, B. (2007). A comparison of statistical significance tests for information retrieval evaluation. Proceedings of the sixteenth ACM conference on Conference on information and knowledge management - CIKM ’07 (pp. 623-632). New York, New York, USA: ACM Press. doi:10.1145/1321440.1321528
Song, D., & Bruza, P. D. (2003). Towards context sensitive information inference. Journal of the American Society for Information, 54(4), 321-334. John Wiley & Sons. Retrieved from http://eprints.qut.edu.au/10413/
Song, W., & Park, S. C. (2007). A Novel Document Clustering Model Based on Latent Semantic Analysis. Proceedings of Third International Conference on
ix
Semantics, Knowledge and Grid (SKG 2007) (pp. 539-542). IEEE. doi:10.1109/SKG.2007.169
Sparck-Jones, K. (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28(1), 11-21. doi:10.1108/eb026526
Steinbach, M., Karypis, G, & Kumar, V. (2000). A Comparison of Document Clustering Techniques. (M. Grobelnik, D. Mladenic, & N. Milic-Frayling, Eds.)KDD workshop on text mining, 400(X), 1-2. Ieee. doi:10.1109/ICCCYB.2008.4721382
Stroulia, E., & Wang, Y. (2005). Structural and semantic matching for assessing web-service similarity. International Journal of Cooperative Information Systems, 14(4), 407-437.
Studholme, C., Hill, D. L. G., & Hawkes, D. J. (1999). An overlap invariant entropy measure of 3D medical image alignment. Pattern recognition, 32(1), 71–86. Retrieved from http://www.mendeley.com/research/an-overlap-invariant-entropy-measure-of-3d-medical-image-alignment/
Takayama, Y., Flournoy, R., Kaufmann, S., & Peters, S. (1999). Information retrieval based on domain-specific word associations. Proceedings of PACLING ’99. Waterloo. doi:10.1109/BIBE.2009.19
Theodoridis, S., & Koutroumbas, K. (2006). Pattern Recognition, Third Edition. Orlando, FL, USA: Academic Press, Inc.
Tian, K., Revelle, M., & Poshyvanyk, D. (2009). Using Latent Dirichlet Allocation for automatic categorization of software. Mining Software Repositories, 2009. MSR ’09. 6th IEEE International Working Conference on (pp. 163-166). doi:10.1109/MSR.2009.5069496
Turney, P. D., & Pantel, P. (2010). From Frequency to Meaning : Vector Space Models of Semantics. Journal of Artificial Intelligence Research, (37), 141-188.
Verma, K., & Sheth, A. (2007). Semantically Annotating a Web Service. IEEE Internet Computing, 11(2), 83-85. doi:10.1109/MIC.2007.48
Vinh, N. X., & Epps, J. (2009). A Novel Approach for Automatic Number of Clusters Detection in Microarray Data Based on Consensus Clustering. Proceedings of the 2009 Ninth IEEE International Conference on Bioinformatics and Bioengineering (pp. 84-91). Taichung: IEEE. doi:10.1109/BIBE.2009.19
Vinh, N. X., Epps, J., & Bailey, J. (2009). Information Theoretic Measures for Clusterings Comparison: Is a Correction for Chance Necessary? In L. Bottou & M. Littman (Eds.), Proceedings of the 26th International Conference on Machine Learning (pp. 1073-1080). Montreal: Omnipress.
x
Voorhees, E. M. (1994). Query Expansion Using Lexical-Semantic Relations. In W. B. Croft & C. J. van Rijsbergen (Eds.), Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 61-69). ACM/Springer.
Voronoi, G. (1907). Nouvelles applications des paramètres continus à la théorie des formes quadratiques Premier Mémoire: sûr quelques propriétés des formes quadratiques positives parfaits. Journal für die Reine und Angewandte Mathematik, 133, 97-178.
Wang, Y., & Stroulia, E. (2003). Semantic Structure Matching for Assessing Web- Service Similarity. Proceedings of First International Conference on Service-Oriented Computing (pp. 194–207). Springer.
Weaver, W. (1955). Translation (pp. 15-23). Cambridge, MA: Technology Press.
Widdows, D. (2003). Orthogonal negation in vector spaces for modelling word-meanings and document retrieval. Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - ACL ’03 (pp. 136-143). Morristown, NJ, USA: Association for Computational Linguistics. doi:10.3115/1075096.1075114
Widdows, D., & Ferraro, K. (2008). Semantic Vectors: a Scalable Open Source Package and Online Technology Management Application. In N. Calzolari, K. Choukri, B. Maegaard, J. Mariani, J. Odjik, S. Piperidis, & D. Tapias (Eds.), Proceedings of the Sixth International Language Resources and Evaluation (LREC’08). Marrakech, Morocco: European Language Resources Association (ELRA).
Wittgenstein, L. (1953). Philosophical Investigations. Blackwell.
Xu, R., & Wunsch, D. (2008). Clustering (IEEE Press Series on Computational Intelligence) (p. 368). Wiley-IEEE Press.
Yang, J., Cheung, W. K., & Chen, X. (2005). Integrating Element and Term Semantics for Similarity-Based XML Document Clustering. Proceedings of the 2005 IEEE/WIC/ACM International Conference on Web Intelligence (pp. 222-228). Washington, DC, USA: IEEE Computer Society. doi:http://dx.doi.org/10.1109/WI.2005.80
Zhao, Y., & Karypis, George. (2004). Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering. Machine Learning, 55(3), 311-331. doi:10.1023/B:MACH.0000027785.44527.d6
Zhao, Y., & Karypis, George. (2002). Criterion Functions for Document Clustering: Experiments and Analysis. Technical Report. Minneapolis: Department of Computer Science University of Minnesota.
xi
Zhuang, Z., Mitra, P., & Jaiswal, A. (2005). Corpus-based Web Services Matchmaking. American Association for Artificial Intelligence Technical Report WS-05-03.
Zipf, G. K. (1935). The Psychobiology of Language. Houghton-Mifflin.
Top Related