Web Personalization in Intelligent Environments

Giovanna Castellano, Lakhmi C. Jain, and Anna Maria Fanelli (Eds.)

Web Personalization in Intelligent Environments

Studies in Computational Intelligence,Volume 229

Editor-in-ChiefProf. Janusz KacprzykSystems Research InstitutePolish Academy of Sciencesul. Newelska 601-447 WarsawPolandE-mail: [email protected]

Further volumes of this series can be found on our homepage:springer.com

Vol. 207. Santo Fortunato, Giuseppe Mangioni,Ronaldo Menezes, and Vincenzo Nicosia (Eds.)Complex Networks, 2009ISBN 978-3-642-01205-1

Vol. 208. Roger Lee, Gongzu Hu, and Huaikou Miao (Eds.)Computer and Information Science 2009, 2009ISBN 978-3-642-01208-2

Vol. 209. Roger Lee and Naohiro Ishii (Eds.)Software Engineering,Artificial Intelligence, Networking andParallel/Distributed Computing, 2009ISBN 978-3-642-01202-0

Vol. 210.Andrew Lewis, Sanaz Mostaghim, andMarcus Randall (Eds.)Biologically-Inspired Optimisation Methods, 2009ISBN 978-3-642-01261-7

Vol. 211. Godfrey C. Onwubolu (Ed.)Hybrid Self-Organizing Modeling Systems, 2009ISBN 978-3-642-01529-8

Vol. 212.Viktor M. Kureychik, Sergey P. Malyukov,Vladimir V. Kureychik, and Alexander S. MalyoukovGenetic Algorithms for Applied CAD Problems, 2009ISBN 978-3-540-85280-3

Vol. 213. Stefano Cagnoni (Ed.)Evolutionary Image Analysis and Signal Processing, 2009ISBN 978-3-642-01635-6

Vol. 214. Been-Chian Chien and Tzung-Pei Hong (Eds.)Opportunities and Challenges for Next-Generation AppliedIntelligence, 2009ISBN 978-3-540-92813-3

Vol. 215. Habib M.AmmariOpportunities and Challenges of Connected k-Covered WirelessSensor Networks, 2009ISBN 978-3-642-01876-3

Vol. 216. Matthew TaylorTransfer in Reinforcement Learning Domains, 2009ISBN 978-3-642-01881-7

Vol. 217. Horia-Nicolai Teodorescu, Junzo Watada, andLakhmi C. Jain (Eds.)Intelligent Systems and Technologies, 2009ISBN 978-3-642-01884-8

Vol. 218. Maria do Carmo Nicoletti andLakhmi C. Jain (Eds.)Computational Intelligence Techniques for BioprocessModelling, Supervision and Control, 2009ISBN 978-3-642-01887-9

Vol. 219. Maja Hadzic, Elizabeth Chang,Pornpit Wongthongtham, and Tharam DillonOntology-Based Multi-Agent Systems, 2009ISBN 978-3-642-01903-6

Vol. 220. Bettina Berendt, Dunja Mladenic,Marco de de Gemmis, Giovanni Semeraro,Myra Spiliopoulou, Gerd Stumme,Vojtech Svatek, andFilip Zelezny (Eds.)Knowledge Discovery Enhanced with Semantic and SocialInformation, 2009ISBN 978-3-642-01890-9

Vol. 221. Tassilo Pellegrini, Soren Auer, Klaus Tochtermann, andSebastian Schaffert (Eds.)Networked Knowledge - Networked Media, 2009ISBN 978-3-642-02183-1

Vol. 222. Elisabeth Rakus-Andersson, Ronald R.Yager,Nikhil Ichalkaranje, and Lakhmi C. Jain (Eds.)Recent Advances in Decision Making, 2009ISBN 978-3-642-02186-2

Vol. 223. Zbigniew W. Ras and Agnieszka Dardzinska (Eds.)Advances in Data Management, 2009ISBN 978-3-642-02189-3

Vol. 224.Amandeep S. Sidhu and Tharam S. Dillon (Eds.)Biomedical Data and Applications, 2009ISBN 978-3-642-02192-3

Vol. 225. Danuta Zakrzewska, Ernestina Menasalvas, andLiliana Byczkowska-Lipinska (Eds.)Methods and Supporting Technologies for Data Analysis, 2009ISBN 978-3-642-02195-4

Vol. 226. Ernesto Damiani, Jechang Jeong, Robert J. Howlett, andLakhmi C. Jain (Eds.)New Directions in Intelligent Interactive Multimedia Systemsand Services - 2, 2009ISBN 978-3-642-02936-3

Vol. 227. Jeng-Shyang Pan, Hsiang-Cheh Huang, andLakhmi C. Jain (Eds.)Information Hiding and Applications, 2009ISBN 978-3-642-02334-7

Vol. 228. Lidia Ogiela and Marek R. OgielaCognitive Techniques in Visual Data Interpretation, 2009ISBN 978-3-642-02692-8

Vol. 229. Giovanna Castellano, Lakhmi C. Jain, andAnna Maria Fanelli (Eds.)Web Personalization in Intelligent Environments, 2009ISBN 978-3-642-02793-2

Giovanna Castellano, Lakhmi C. Jain andAnna Maria Fanelli (Eds.)

Web Personalization inIntelligent Environments

123

Prof. Giovanna CastellanoComputer Science DepartmentUniversity of BariVia Orabona, 470125 BariItalyE-mail: [email protected]

Prof. Lakhmi C. JainUniversity of South AustraliaAdelaideMawson Lakes CampusSouth AustraliaAustralia

E-mail: [email protected]

Prof.Anna Maria FanelliComputer Science DepartmentUniversity of BariVia Orabona, 470125 BariItalyE-mail: [email protected]

ISBN 978-3-642-02793-2 e-ISBN 978-3-642-02794-9

DOI 10.1007/978-3-642-02794-9

Studies in Computational Intelligence ISSN 1860-949X

Library of Congress Control Number: Applied for

c© 2009 Springer-Verlag Berlin Heidelberg

This work is subject to copyright. All rights are reserved, whether the whole or partof the material is concerned, specifically the rights of translation, reprinting, reuse ofillustrations, recitation, broadcasting, reproduction on microfilm or in any other way,and storage in data banks. Duplication of this publication or parts thereof is permittedonly under the provisions of the German Copyright Law of September 9, 1965, inits current version, and permission for use must always be obtained from Springer.Violations are liable to prosecution under the German Copyright Law.

The use of general descriptive names, registered names, trademarks, etc. in this publi-cation does not imply, even in the absence of a specific statement, that such names areexempt from the relevant protective laws and regulations and therefore free for generaluse.

Typeset & Cover Design: Scientific Publishing Services Pvt. Ltd., Chennai, India.

Printed in acid-free paper

9 8 7 6 5 4 3 2 1

springer.com

Foreword

At first sight, the concept of web personalization looks deceivingly simple. A web personalization system is a software component that collects information on visitors to a web site and leverages this knowledge to deliver them the right content, tailoring presentation to the user's needs. All over the world, web designers and web content managers rely on web personalization solutions to improve the effectiveness and us-ability of their web-based applications.

Still, the scientific foundation of web personalization remains a controversial issue. Practitioners know very well that when properly implemented, personalization deliv-ers a much better user experience; but when it is poorly implemented, personalization may backfire and even distract the user's attention away from some useful (and cost-ly-to-develop) enriched content.

In other words, tailoring content, and varying it routinely, may make a site more attractive; but an unstable site look can have a negative impact on the overall mes-sage. Everybody seems to agree that this is a real danger; but there are specific ques-tions that are much harder to answer convincingly.

For example, when does excessive customization become noise? How can we measure the effects of content tailoring on users' experience and cognitive gain? Without a clear answer to these questions, organizations that extensively use person-alization in their content management projects have to take the risk of compromising the effectiveness of the underlying message. Historically, this factor kept the number of adopters low: most businesses are reluctant to risk jeopardizing their core message in exchange for some non-quantified future benefit of personalization.

A sound scientific approach is needed to reverse this trend; but until quite recently, web personalization had little to do with scientific research. As a communication strategy, it was considered more an art than a science. This book provides an entirely different point of view, advocating a scientific approach to web personalization with-out forgetting the interdisciplinary nature of this field and its practical goals.

Editors Giovanna Castellano, Lakhmi Jain and Anna Maria Fanelli, themselves outstanding researchers in this area, successfully put together a book which is self-contained: it provides a comprehensive view of the state of the art, including a de-scription of the personalization process and a classification of the current approaches to Web personalization. Also, the book delves very deeply into current investigation on intelligent techniques in the realm of Web personalization.

VI Foreword

I leave it to the Editor's introduction to comment individually on the excellent se-lected chapters, that are authored by some of the leading international research teams working in this field. Here, it is more important to remark that these chapters collec-tively show what intelligent techniques can do to tackle two open research problems:

• discovering useful knowledge about users from the (uncertain) information collected during interactions.

• using such knowledge to deliver customized recommendations, tailor-made to the needs of the users.

Solving the first problem means providing a scientifically sound definition of user model. To put it simply, such models are composed of a visitor profile and a visitor segment. A visitor profile is a collection of attributes that must be known or guessed in order to support personalization. Explicit profile attributes are the easier part: they are data about the user, coming from online surveys, registration forms, integrated CRM or sales automation tools, and legacy or existing databases. Still, this multiplic-ity of sources poses uncertainty problems in case of conflicts (in which age group do we classify a user who declared that her age is 15 but also provided her driving li-cense number?) and limited trustworthiness (e.g. due to data aging) of some informa-tion sources. Implicit profile attributes are much more uncertain than explicit ones: they are derived from browsing patterns, cookies, and other sources, i.e. from watch-ing or interpreting customer behavior, a process which may be slow and is subject to error. Here, however, one must clarify how uncertainty arises.

There is little uncertainty in the data collection process: personalization systems are probes, not sensors, and exactly register user behavior in terms of clicks and page visits. Uncertainty comes in when mapping profile attributes to profile segments. A segment is just a collection of users with matching profiles; so segment membership is usually uncertain, or better a matter of degree. Visitor segments have different gra-nularity depending on the applications, and are crucial for developing and maintaining classification rules. How organizations collect and store visitor segments is a sensi-tive topic, as it gives rise to a number of privacy issues. Finally gaming, i.e. intention-ally attacking the classification system by providing wrong information or acting er-ratically, is also not unheard-of on the Web and can worsen the situation.

The second problem is the holy grail of web personalization. Web-based recom-mendation systems aggregate the online behavior of many people to find trends, and then make recommendations based on them. This involves some sophisticated ma-thematical modeling to compute how much one user's behavior is similar to another's. Once again, uncertainty mostly comes from the interaction between recommendation and segmentation: recommender systems will try to advise us based on past behavior of our peers, but their notion of “peer” is only as good as their profile segment construction algorithm. When segmentation fails (e.g. due to gaming, or wrong interpretation of implicit parameters) sometimes recommendations turn up plainly wrong, and in some extreme cases they can even be offensive to the users.

Intelligent techniques map the above issues to data mining and machine learning problems. Namely, they use mining and learning to build intelligent (e.g., neuro-fuzzy or temporal) models of user behavior that can be applied to the task of predicting user

Foreword VII

needs and adapting future interactions. The techniques described in this book are flexible enough to handle the various sources of data available to personalization sys-tems; also, they lend themselves to experimental validation.

Thanks to the combined effort of the volume's editors and of its outstanding au-thorship, this book demonstrates that intelligent approaches can provide a much needed hybrid solution to both these problems, smoothly putting together symbolic representation of categories and segments with quantitative computations.

While much work remains to be done, the chapters in this volume provide con-vincing evidence that intelligent techniques can actually pave the way to a scientifi-cally sounder (and commercially more effective) notion of Web personalization.

Ernesto Damiani Università di Milano, Italy

Preface

The Web emerges as both a technical and a social phenomenon. It affects business, everybody's life and leads to considerable social implications. In this scenario, Web personalization arises as a powerful tool to meet the needs of daily users and make the Web a friendlier environment. Web personalization includes any action that adapts the information or services provided by a Web site to the needs of users, taking advantage of the knowledge gained from the users' navigational behavior and individual interests, in combination with the content and the structure of the Web site. In other words, the aim of a Web personalization system is to provide users with the information they want or need, without expecting them to ask for it explicitly. The personalization process covers a fundamental role in an increasing number of application domains such as e-commerce, e-business, adaptive web systems, information retrieval.

Depending on the application context, personalization functions may change rang-ing from improving the organization and presentation of Web sites to enabling better searches. Regardless of the particular application domain, the development of Web personalization systems gives rise to two main challenging problems: how to discover useful knowledge about the user's preferences from the uncertain Web data collected during the interactions of users with the Web site and how to deliver intelligent rec-ommendations, tailor-made to the needs of the users by exploiting the discovered knowledge.

The book aims to provide a comprehensive view of Web personalization and inves-tigate the potential of intelligent techniques in the realm of Web personalization. The book includes six chapters. Chapter one provides an introduction to innovations in Web Personalization. A roadmap of Web personalization is delineated, emphasizing the different personalization functions and the variety of approaches proposed for the realization of personalization systems. In this chapter, a Web personalization process is presented as a particular data mining application with the goal of acquiring all pos-sible information about users accessing the Web site in order to deliver personalized functionalities. In particular, according to the general scheme of a data mining proc-ess, the main steps of a Web personalization process are distinguished, namely Web data collection, Web data preprocessing, pattern discovery and personalization. This chapter provides a detailed description of each of these steps. To complete the intro-duction, different techniques proposed in literature for each personalization step are reviewed, by providing a survey of works in this field.

Preface X

Chapter two by Pasquale Lops, et al. investigates the potential of folksonomies as the source of information about user interests for recommendation. The authors intro-duce a semantic content-based recommender system integrating folksonomies for personalized access. The main contribution is a novel integrated strategy that enables a content-based recommender to infer user interests by applying machine learning techniques, both on official item descriptions provided by a publisher and on tags which users adopt to freely annotate relevant items.

Chapter three by John Garofalakis and Theodoula Giannakoudi shows how to exploit ontologies for Web search personalization. Ontologies are used to provide a semantic profiling of users’ interests, based on the implicit logging of their behavior and the on-thefly semantic analysis and annotation of the web results summaries.

Chapter four by Giovanna Castellano and M. Alessandra Torsello shows how to de-rive user categories for Web personalization. It presents a Web Usage Mining (WUM) approach based on fuzzy clustering to categorize users by grouping together users sharing similar interests. Unlike conventional fuzzy clustering approaches that employ distance-based metrics (such as the Euclidean measure) to evaluate similarity between user interests, the approach described in this chapter makes use of a fuzzy similarity measure that enables identification of user categories by capturing the semantic infor-mation incorporated in the original Web usage data.

Chapter five by Fabián P. Lousame and Eduardo Sánchez presents an overview on recommender systems based on collaborative filtering, which represents one of the most successful recommendation technique to date. The chapter contributes with a general taxonomy useful to classify algorithms and approaches attending to a set of relevant features, and finally provides some guidelines to decide which algorithm best fits on a given recommendation problem or domain.

In Chapter six, Corrado Mencar et al. present a user profile modeling approach conceived to be applicable in various contexts, with the aim of providing personalized contents to different categories of users. The proposed approach is based on fuzzy logic techniques and exploits the flexibility of fuzzy sets to define an innovative scheme of metadata. Along with the modeling approach, the design of a software system based on a Service Oriented Architecture is presented. The system exposes a number of services to be consumed by information systems for personalized content access.

We are grateful to the authors and reviewers for their excellent contribution. Thanks are due to the Springer-Verlag and SCI Data Processing Team of Scientific Publishing Services for their assistance during the preparation of the manuscript.

May 2009 Giovanna Castellano Lakhmi C. Jain

Anna Maria Fanelli

Editors

Giovanna Castellano is Assistant Professor at the Department of Computer Science of the Universitity of Bari, Italy. She received a Ph.D. in Computer Science at the same Uni-versity in 2001.

Her recent research interests focus on the study of Computational Intelligence para-digms and their applications in Web-based systems, image processing and multimedia information retrieval.

Professor Lakhmi C. Jain is a Direc-tor/Founder of the Knowledge-Based Intelli-gent Engineering Systems (KES) Centre, lo-cated in the University of South Australia. He is a fellow of the Institution of Engineers Australia.

His interests focus on the artificial intelli-gence paradigms and their applications in com-plex systems, art-science fusion, e-education, e-healthcare, unmanned air vehicles and intelli-gent agents.

Editors XII

Professor Anna Maria Fanelli is Full Profes-sor at the Department of Computer Science of the Universitity of Bari, Italy, where she plays several roles. She is Director of the Computer Science Department, Director of the PhD School in Computer Science and chair of the CILab (Computational Intelli-gence Laboratory).

Her recent research interests focus on the analysis, synthesis, and application of Com-putational Intelligence techniques with em-phasis on the interpretability of fuzzy rule-based classifiers and Web Intelligence.

Contents

Chapter 1

Innovations in Web PersonalizationGiovanna Castellano, Anna Maria Fanelli, Maria Alessandra Torsello,Lakhmi C. Jain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Chapter 2

A Semantic Content-Based Recommender System IntegratingFolksonomies for Personalized AccessPasquale Lops, Marco de Gemmis, Giovanni Semeraro, Cataldo Musto,Fedelucio Narducci, Massimo Bux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

Chapter 3

Exploiting Ontologies for Web Search PersonalizationJohn Garofalakis, Theodoula Giannakoudi . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

Chapter 4

How to Derive Fuzzy User Categories for Web PersonalizationGiovanna Castellano, Maria Alessandra Torsello . . . . . . . . . . . . . . . . . . . . . . 65

Chapter 5

A Taxonomy of Collaborative-Based Recommender SystemsFabian P. Lousame, Eduardo Sanchez . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

Chapter 6

A System for Fuzzy Items RecommendationCorrado Mencar, Ciro Castiello, Danilo Dell’Agnello,Anna Maria Fanelli . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

1

Innovations in Web Personalization

Giovanna Castellano1, Anna Maria Fanelli1, Maria Alessandra Torsello1,and Lakhmi C. Jain2

1 Computer Science Department, University of Bari,Italy Via Orabona, 4 - 70125 Bari, Italy

2 University of South Australia, Mawson Lakes Campus, South Australia, Australia

Abstract. The diffusion of the Web and the huge amount of information availableonline have given rise to the urgent need for systems able to intelligently assist users,when they browse the network. Web personalization offers this invaluable opportunity,representing one of the most important technologies required by an ever increasingnumber of real-world applications. This chapter presents an overview of the Web per-sonalization in the endeavor of Intelligent systems.

1 Introduction

With the explosive growth of Internet and the easy availability of informationon the Web, we have entered a new information age. Today, the Web providesa new medium for communication, by changing the traditional way of gather-ing, presenting, sharing and using the information. In the era of the Web, theproblem of information overload is continuously expanding. When browsing theWeb, users are very often overwhelmed by a huge amount of information avail-able online. Indeed, the ever more complex structure of sites combined with theheterogeneous nature of the Web, make Web navigation difficult for ordinaryusers, who often are faced with the challenging problem of finding the desiredinformation in right time. An important step in the direction of alleviating theproblem of information overload is represented by Web personalization.

Web personalization can be simply defined as the task of adapting the infor-mation or services provided by a Web site to the needs and interests of users,exploiting the knowledge gained from the users’ navigational behavior and indi-vidual interests, in combination with the content and the structure of the Website. The need to offer personalized services to the users and to provide themwith information tailored to their needs has prompted the development of newintelligent systems able to collect knowledge about the interests of users andadapt its services in order to meet the user’s needs.

Web personalization is a fundamental task in an increasing number of applica-tion domains, such as e-commerce, e-business, information retrieval. Dependingon the context, the personalization functions may change. In e-commerce, forexample, personalization can offer the useful function of suggesting interesting

G. Castellano, L.C. Jain, A.M. Fanelli (Eds.): Web Person. in Intel. Environ., SCI 229, pp. 1–26.springerlink.com c© Springer-Verlag Berlin Heidelberg 2009

2 G. Castellano et al.

products or advertising on the basis of the interests of online customers. Thisfunction is generally realized through recommendation systems that representone of the most popular approaches for Web personalization. In information re-trieval, personalization allows to tailor the search process to the user needs, byproviding them more appropriate results to their queries. These are only few ex-amples among the variety of the personalization functions that could be offered.

Web personalization has received the interest of the scientific community.Many research efforts have been addressed to the investigation of new tech-niques for the development of systems endowed with personalization function-alities. This has led to the growth of a new flourishing research area, known asWeb Intelligence (WI), which has been recognized as the research branch whichapplies principles of the Artificial Intelligence and Information Technology inthe Web domain. The main objective of WI is in the development of IntelligentWeb Information Systems, i.e. systems endowed with intelligent mechanisms as-sociated to the human intelligence, such as reasoning, learning, and so on. Thegrowing development of WI is strongly related to the complexity and the hetero-geneity of the Web, due to the variety of objects included in the network and thecomplex way in which these are connected. Indeed, Web data are characterizedby uncertainty and are fuzzy in nature. In this context, a big challenge is howto develop intelligent techniques able to face with uncertainty and complexity.

This chapter provides a comprehensive view of Web personalization, whichis presented as a particular data mining application with the goal of acquiringall possible information about users accessing the Web site in order to deliverthem personalized functionalities. In particular, according to the general schemeof a data mining process, the main steps of a Web personalization process aredistinguished, namely Web data collection, Web data preprocessing, pattern dis-covery and personalization. This chapter provides a detailed description of eachof these steps. To complete the introductory treatment of the Web personaliza-tion topic, the different techniques which have been proposed in literature foreach distinguished personalization step are inspected, by providing a review ofworks in this field. Once the motivations of the need for Web personalizationhave been explained, a roadmap of Web personalization is delineated, emphasiz-ing the different personalization functions which can be offered and the variety ofapproaches proposed for the realization of personalization systems. Successively,the Web personalization process is described as a data mining application and theideas behind the Web usage mining and its use for Web personalization are pre-sented. Hence, the stages involved in a usage-based Web personalization systemare discussed in detail, with reference to the majority of the existing methods.

2 Web Personalization Roadmap

Web personalization can be defined as any set of actions that can tailor the Webexperience to a particular user or set of users. The actions can range from sim-ply making the presentation more pleasing to anticipating the needs of a userand providing customized and relevant information. As a consequence, a Web

Innovations in Web Personalization 3

personalization system can be developed in order to offer a variety of person-alization functions, making the Web a friendlier environment for its individualusers and hence creating trustworthy relationships between a Web site and itsvisitors. However, different approaches have been proposed to develop effectiveWeb personalization systems. In the following subsections, firstly, the variety offunctions that can be offered by a Web personalization system are described.Then, the different approaches which have been proposed to develop severalkinds of personalization forms are discussed.

2.1 Web Personalization Functions

According to Pierrakos et al. [2003], four basic classes of personalization functionscan be distinguished, namely memorization, guidance, customization and taskperformance support. Each of these functions is examined below in more detail,starting from the simplest to the most complicated ones.

MemorizationMemorization represents the simplest and the most widespread class of person-alization functions. In this form of personalization, the system records in itsmemory information about users accessing the Web site (e.g. using cookies),such as the name, the browsing history, and so on. Then, this information isused by the personalization system as a reminder of the user’s past behavior.In fact, when the user returns to the Web site, the stored information, with-out further processing, is exploited to recognize and to greet the returning user.Memorization is not offered as a stand-alone function but it is usually part of amore complete personalization solution. Examples of personalization functionsbelonging to this class are listed as:

• User Salutation: The Web personalization system recognizes the returninguser and visualizes a personalized message, generally including the user’sname together with a welcome sentence. Though the user salutation functionis one of the simplest form of personalization, this represents a first stepto increase the user’s loyalty, in most of Web commercial applications. Infact, users feel more comfortable accessing Web sites that recognize them asindividuals, rather than as regular visitors.

• Bookmarking: In this personalization function, the system is able to recordthe pages that a user has visited during his/her past accesses. The lists ofthe visited pages will be used in the successive visits of the same user. Infact, when the user returns to visit the Web site, the personalization systempresents these by means of a personalized bookmarking scheme for that sitesupporting the user in the navigation activity.

• Personalized access rights: A Web site can define personalized access rightsthat allow to distinguish different types of users (for example, common usersand authorized users). Different access rights are useful to differentiate thecategory of information that users may access (product prices, reports) or toestablish the set of operations that a category of users may execute (downloadfiles, e-mail).


GuidanceGuidance (or recommender system) represents the class of personalization func-tions consisting in the ability of a Web site to assist users by quickly providingthem with the relevant information according to their interests or suggestingthem alternative browsing options [Mobasher et al., 2000, Nasraoui et al., 2002,Schafer et al., 1999]. In this case, the personalization system relies on datathat reflects the user preferences collected both explicitly (user browsing his-tory stored by the Web Server in access log files) and implicitly (through thefulfillment of questionnaires or apposite registration forms). In the following,examples of guidance functions are described.

• Recommendation of hyperlinks: This function consists in the dynamic recom-mendation of links deemed to be interesting according to the user preferences.The suggested links can be presented in the form of recommendation list dis-played in a separate frame of the current Web page or listed in an appositepop-up window. In Kobsa et al. [2001], recommendation of links is presentedas one of the most developed personalization functionalities, for the sug-gestion of links to topics of information or to an advised navigational paththat the user might follow. Recommender systems are especially employedin e-business field and in many e-commerce applications in order to suggestproducts useful to the clients/users and to increase their loyalty.

• User Tutoring: For this guidance function, basic notions of Adaptive Educa-tional Systems have been applied to personalize Web sites. A Web site canoffer personalized guidance functions to an individual in each step of the userinteraction with the site, taking into account the knowledge and the interestsof the same user. This functionality is achieved by the Web site, for example,by recommending to the user other Web pages or adding explanatory con-tent to the Web pages. An application of this personalization function can beretrieved in Webinars (Web seminars), which are live or replayed multimediapresentations conducted from a Web site.

CustomizationIn general, in this form of personalization, the system takes as input the user pref-erences (generally collected by means of registration forms) and exploits theseto customize the content or the structure of a Web page. This process generallytends to be manual or semi-automatic. The major goal of this personalizationfunction is the efficient management of the information load by alleviating andfacilitating the user interactions with the site. Examples of this class of person-alization functions are:

• Personalized Layout: This customization function refers to the change of theWeb pages in the layout, color or local information according to the profile ofthe connected user. Personalized layout is usually exploited by Web portals,such as Yahoo and Altavista which offer customized functionalities in orderto create personalized “\My-Portals” Web sites.


• Content Customization: The content of Web pages is modified in order tomeet the interests and the preferences of the users. For example, this person-alization function permits to visualize a Web page in different ways (sum-marized or in an extended form), depending on the type of user accessingthe site. To make the appropriate modifications on the Web page content,the user knowledge is also taken into account. An example of Web site withcontent customization functions can be found in Schwarzkopf [2001].

• Customization of hyperlinks: A Web site can also offer customized function-alities by adding or removing links within a particular page. In this way,unusual links are eliminated, changing the topology of the Web site andimproving its usability. This way of customization is described in Chignoliet al. [1999].

• Personalized pricing scheme: Together with the recommendation of hyper-links, this personalization functionality can be employed in e-commerce ap-plications in order to attract users who are not usual visitors or to confirm theclient/user loyalty. For example, personalized pricing scheme allows specialdiscount percentages to users that have been recognized as loyal customers.Acquisti and Varian [2005] present a model which allows sender to offerenhanced services to previous customers by conditioning their price offersdepending on prior purchase behavior of consumers.

• Personalized product differentiation: The aim of this form of personalizationis to satisfy the customer needs by transforming a standard product into apersonalized solution for an individual. This personalization function revealsto be a powerful method especially in the field of marketing. Voloper GlobalMerchant (VGM) represents an example of Web site which offers servicesof multiple pricing levels and product differentiation according to the userneeds. A description of these last two kinds of personalization functions canbe found in Choudhary et al. [2005].

Task performance supportTask performance support represents the most advanced personalization func-tion, inherited from a category of Adaptive Systems known as personal assistants[Mitchell et al., 1994]. In these client-side personalization systems, a personal as-sistant executes actions on behalf of the user, in order to facilitate the accessto relevant information. This approach requires the involvement of the user, in-cluding access, installation and maintenance of the personal assistant software.Examples of personalization functions included in this class are described below.

• Personalized errands: A Web personalization system offers this form of per-sonalization by executing a number of actions in order to assist the work ofthe users, such as sending an e-mail, downloading various items, and so on.Depending on the sophistication of the personalization system, these errandsmay vary from simple routine actions to more complex ones to take intoaccount the personal circumstances of the user.

• Personalized query completion: This personalization function is generallyused to improve the performances of the information retrieval systems. In


fact, a system can add terms to the user queries submitted to a search en-gine or to a Web database system with the aim to enhance or to completethe user requests and to make them more comprehensible.

• Personalized negotiations: This represents one of the most advanced taskperformance support functions and it requires a high degree of sophisticationby the personalization system in order to earn the trust of the user. Here,the system can play the role of negotiator on behalf of a user and it mayparticipate in Web auctions [Bouganis et al., 1999].

2.2 Approaches to Web Personalization

Web personalization has been recognized as one of the major remedies to theinformation overload problem and to increase the loyalty of Web site users. Dueto the importance of providing personalized Web services, different approacheshave been proposed in the past few years in order to develop systems providedwith personalization functionalities.

Starting from architectural and algorithmic considerations, Mobasher et al.[2000] have categorized the approaches and techniques used to develop the exist-ing personalization systems in three general groups: rule-based systems, content-based filtering systems and collaborative filtering systems.

However, a great deal of work has been addressed to develop hybrid personal-ization systems, arisen from the combination of various elements which charac-terize the previous distinguished approaches. In the following, a brief descriptionand overview of the most influential approaches proposed for the developmentof personalization systems is presented.

Rule-based personalization systemsRule-based personalization systems are able to recommend items to their usersby generating a number of decision rules in an automatic manner or manually.Many e-commerce Web sites that are provided with recommendation technolo-gies employ manual rule-based systems to offer personalized services to theircustomers. In such kind of systems, decision rules are manually generated by theWeb site administrator on the basis of demographic and other personal informa-tion about users. These rules are exploited to modify, for example, the contentserved to a user whose profile satisfies one or more decision rules.

A first drawback of personalization systems based on decision rules is in theknowledge engineering bottleneck problem. In fact, in such systems the typeof personalization highly depends on the knowledge engineering of the systemdesigners to create a rule base taking into account the specific characteristicsof the domain or the market research. A further drawback that these kinds ofsystems present is represented by the methods used for the generation of userprofiles. Here, user profiles are generally created explicitly, during the interac-tions of users with the site. To classify users into different categories (or userprofiles) and to derive rules which have to be used for personalization, researchhas mainly focused on the employment of machine learning techniques. In thesetasks, the input is usually affected by the subjective description of users or their


interests given by the users themselves. Moreover, generated user profiles areoften static and the performances of the personalization systems based on thisapproach decrease over time as the profiles age.

Examples of products which adopt this kind of approach are the personalizationengine of Yahoo [Manber et al., 2000], Websphere Personalization of IBM (www-306.ibm.com/software/websphere/) and Broadvision (www.bradvision.com).

Content-based filtering personalization systemsPersonalization systems which fall in this category exploit various elements con-cerning the Web content in order to discover the personal preferences of a currentuser. The basic assumption of this approach is that the choices in the immediatefuture of a user are very similar to the choices made by the same user in his/herimmediate past. In content-based filtering personalization systems, the recom-mendation generation is based around the analysis of items previously rated bya user and the derivation of a profile for a user, based on the content descrip-tions of these items. The content description of the items generally includes aset of features or attributes that characterize the corresponding items. In par-ticular, in such systems, the content description of the items for which the userhas previously expressed interest represents the user profile. Then, the user pro-file is used to predict a rating for previously unseen items and those deemedas being potentially interesting are recommended to the user. In content-basedfiltering systems, the task of recommendation generation involves the compari-son between the extracted features of unseen or unrated items and the contentdescriptions characterizing the user profile. Items that are retained sufficientlysimilar to the identified user profile are recommended to the current user.

In most of e-commerce applications or in other Web-based applications wherepersonalization functions are developed through the content-based filtering ap-proach, the content descriptions of the items are usually represented by textualfeatures extracted from the Web pages or product descriptions.

In such kind of personalization systems, well-known techniques of documentmodeling together with other principles derived from research in the fields of in-formation retrieval and information filtering are exploited. Generally, user pro-files are expressed in the form of vectors, where each component represents aweight or an interest degree related to each item. Predictions about the userinterest for a particular item can be derived through the computation of vectorsimilarities, based on the employment of different methods such as the cosinesimilarity measure or using probabilistic approaches such as Bayesian classifi-cation. In content-based filtering personalization systems, the constructed userprofiles have not a collective (or aggregate) nature but each profile refers to anindividual user, built only on the basis of characteristics of items previously seenor rated by the active user.

Examples of early systems which use the content-based filtering approachto implement personalization functions are NewsWeeder [Lang, 1995], Letizia[Lieberman, 1995], PersonalWebWatcher [Mladenic, 1996], InfoFinder [Krulwichand Burkey, 1996], Syskill and Webert [Pazzani and Billsus, 1997], and the naiveBayes nearest neighbour approach proposed in Schwab et al. [2000].


NewsWeeder is a tool which is able to adaptively construct user models start-ing from the browsing behavior of a user, based on the similarity between Webdocuments containing new items. The constructed models can be useful to filternew items taking into account the requirements of each user.

Syskill and Webert generates user profiles from previously rated Web pageson a particular topic in order to distinguish between interesting and irrelevantpages. To learn user profiles, it uses the 128 most informative words from apage and trains a nave Bayes classifier to predict, among the unseen pages, theinteresting and the uninteresting pages for the user. This system requires theinitial definition by the user of the rates for Web pages.

To avoid the user to explicitly rate Web documents, Letizia defines implicit in-terest indicators to compute content similarity between previously visited pagesand candidate pages for recommendation. The nave Bayes nearest neighbor ap-proach, proposed by Schwab et al. [2000], is used to build user profiles fromimplicit observations. In their recommendation system, they modify the use ofnearest neighbor and nave Bayes to deal with only positive observations by usingdistance and probability thresholds.

Content-based filtering approach for personalization suffers from different lim-itations. The primary drawback of personalization systems based on this ap-proach is strictly related to the method of generation of user profiles. In fact,these are derived by considering only the descriptions of items previously ratedor seen by the user. In this way, user profiles result overspecialized and they mayoften miss important pragmatic relationships between the Web objects such astheir common utility in the context of a particular task. Also, the system highlydepends on the availability of content descriptions of the items being recom-mended. However, approaches based on individual profiles lack of serendipityas recommendations are very focused on the past preferences of the users. Inaddition, given the heterogeneous nature of Web data, the extraction of textualfeatures in the derivation of the content descriptions of items is not always asimple task to face.

Collaborative filtering personalization systemsTo overcome the limitations of content-based filtering systems, Goldberg et al.[1992] introduced the collaborative filtering approach for generating a personal-ized Web experience for a user. Collaborative (also named social) filtering per-sonalization systems aim to personalize a service without considering featuresreferring to the Web content. This personalization approach is based on a basicidea: the interests of a current user are considered similar to the interests of userswho have made similar choices in the past, referred as the current user neighbor-hood. Hence, in this kind of systems, personalization is achieved by searchingfor common features in the preferences of different users, generally expressedexplicitly by the users in the form of item ratings stored by the system. More inparticular, personalization systems based on this approach perform the matchingbetween the ratings of a current user for items and those expressed by similarusers to produce recommendations for items not yet rated or seen by the currentuser. One of the primary techniques to accomplish the task of recommendation


generation is the standard memory-based k-Nearest-Neighbour (kNN) classifi-cation approach. This approach consists in the comparison of the current userprofile with the historical user profiles stored by the system in order to findthe top k users who have expressed preferences more similar to those expressedby the current user. The kNN classification approach gives rise to an importantlimitation for collaborative filtering techniques as well as their lack of scalability.Essentially, kNN requires that the neighborhood formation phase is performed asan online process. As the number of users and items increases, this approach maylead to unacceptable latency for providing recommendations during the interac-tion of users. The sparsity of the available Web data represents another relevantpoint of weakness of the collaborative filtering approach for personalization. Infact, as the number of items increases, the density of each user record decreases,containing often a low number of rating values in correspondence to the rated orvisited items. As a consequence, establishing the similarity among pairs of usersbecomes a complicate task, decreasing the likelihood of a significant overlap ofvisited or rated items included in the corresponding user records.

Collaborative filtering approach suffers from additional disadvantages. Theratings for every item have to be available prior to its recommendation. Thisis referred as the new item rating problem. Another disadvantage is referred asthe new user problem: a new user has to rate a certain number of items beforehe/she can obtain appropriate recommendations from the system. A number ofoptimization strategies have been proposed in order to remedy these shortcom-ings [Aggarwal et al., 1999, OConnor and Herlocker,1999, Sarwar et al., 2000].The proposed strategies are characterized by the dimensionality reduction toalleviate the sparsity problem of the data as well as the offline categorizationof the user records by means of different clustering techniques, allowing the on-line component of the personalization system to search only within a matchingcluster. A growing body of work has also been performed to enhance collabo-rative filtering by integrating data from other sources such as content and userdemographics [Claypool et al., 1999, Pazzani and Billsus, 2006].

Among all the proposed strategies, model-based collaborative filtering sys-tems have been developed as one of the most relevant variants of the traditionalcollaborative filtering approach for Web personalization. A representative ex-ample of model-based variants of collaborative filtering is known as item-basedcollaborative filtering. In item-based collaborative filtering systems, the offlinecomponent builds, starting from the user rating database, the item-item sim-ilarity matrix where each component expresses the similarity existing amongeach pair of the considered items. The item similarity is not based on contentdescriptions of the items but only on the user ratings.

Each item is generally represented in the form of a m-dimensional vector (m isthe number of users) and the similarities between pairs of items are computed byusing different similarity measures such as cosine similarity or the correlation-based similarity. The item similarity matrix is used in the online predictionphase of the system to generate recommendations by predicting the ratings foritems not previously seen by the current user. The predicted rating values are


calculated as a weighted sum of the ratings of items in the neighborhood of thetarget item, consisting of only those items that have been previously rated bythe current user. As the number of considered items increases, storing the itemsimilarity matrix may require huge quantity of memory.

Rather than considering all item similarity values, a proposed alternative con-sists in the memorization of the only similarity values for the k most similaritems. k represents the model size which affects the accuracy of the recommen-dation approach; as k decreases, the coverage as well as the recommendationaccuracy will reduce.

Collaborative filtering personalization systems have gained popularity andcommercial success in a huge number of e-commerce applications for recom-mending products. An example of such a system is represented by GroupLens[Konstan et al., 1997]. In this recommendation system, a user profile is defined asan n-dimensional vector, where n is the number of netnews articles. If an articlehas been rated, its corresponding element in the user profile vector contains thespecified rating value. Articles not rated by the current user but highly rated bythe neighborhood users are candidates to be recommended to the current user.

3 The Web Personalization Process

Generally speaking, the ability of a Web Personalization system to tailor contentand recommend items to a user assumes that it must be able to infer what arethe needs of a user, based on previous or current interactions with that user, andpossibly considering other users. This assumes that the system obtains informa-tion on the user and infers his/her needs, exploiting this information. Hence,central to any personalization system is a user-centric data model. Informationabout user interests may be collected implicitly or explicitly but in either casesuch information should be attributable to a specific user. The association ofWeb data to a specific user is not always a simple task, especially when datais implicitly collected. This is one of the first problems to be addressed in Webpersonalization process. The successive analysis of data characterizing the userinterests has the aim to learn user profiles that are used to predict future interestsof connected users.

Thus, in terms of the learning task, personalization can be viewed as a:

• Prediction Task : a model has to be built in order to predict ratings for itemsnot currently seen or rated by the user.

• Selection Task : a model has to be built in order to select the N most inter-esting items for the current user who has not already rated.

The incorporation of machine learning techniques in the context of Web per-sonalization can provide a complete solution to the overall adaption task. Itreveals to be an appropriate way to analyze data collected on the Web and ex-tract useful knowledge from these. The effort carried out in this direction has ledto the growth of a new research area, named Web mining [Arotariteia and Mitra,2004, Furnkranz, 2005, Kosala and Blockeel, 2000], which refers to the applica-tion of Data Mining methods to automatically discover and extract knowledge


from data generated by the Web. Commonly, according to different types ofWeb data which can be considered in the process of personalization, Web min-ing can be split into three different categories, namely Web content mining, Webstructure mining, and Web usage mining.

Web content mining [Cimiano and Staab, 2004, Liu and Chang, 2004] concernswith the discovery of useful information from the Web contents. Web contentcould encompass a very broad range of data, such as textual, image, audio, video,metadata as well as hyperlinks. Moreover, Web content data can be representedby unstructured data such as free texts, semi-structured data HTML documents,and a more structured data such as data in the tables or database generatedHTML pages. Recently, research in this field is focusing on mining multi typesof data, leading to a new branch called multimedia data mining representing aninstance of the Web content mining.

Web structure mining [Costa and Gong, 2005, Furnkranz, 2002] discovers themodel underlying the link structures of the Web. The model is based on thetopology of the hyperlinks, characterizing the structure of the Web graph. Thiscan be used to categorize Web pages and to generate information about therelationships or the similarity degrees existing among different Web pages.

Web usage mining [Facca and Lanzi, 2005, Mobasher, 2005, Pierrakos et al.,2003, Zhou et al., 2005] aims at discovering interesting patterns from usage datagenerated during the interactions of users with the Web site, generally character-izing the navigational behavior of users. Web usage data includes the data fromWeb server access logs, proxy server logs, registration data, mouse clicks, andany other data which is the result of the user interactions. Web usage mining canbe a valuable and important source of ideas and solutions toward realizing Webpersonalization. It provides an approach to the collection and preprocessing ofusage data, and constructs models representing the behavior and the interests ofusers. These models can be used by a personalization system automatically, i.e.without the intervention of any human expert, for realizing the required person-alization functions. Web usage mining represents the most employed approachfor the development of personalization systems, as also demonstrated by a largenumber of research papers published on this topic [Abraham, 2003, Facca andLanzi, 2005, Mobasher, 2006, Pierrakos et al., 2003].

In this chapter, the attention is mainly focused on the Web personalizationprocess based on the adoption of the Web usage mining approach. In general, ausage-based Web personalization process, being essentially a data mining processas asserted before, consists of the following basic data mining stages [Mobasheret al., 2000]:

• Web data collection: Web data are gathered from various sources by usingdifferent techniques that allow to attain efficient collections of user data forpersonalization.

• Web data preprocessing: Web data are preprocessed to obtain data in a formthat is compatible to be analyzed in the next step. In particular, in this stage,data are cleaned from noise, the inconsistencies are solved, and finally dataare organized in an integrated and consolidated manner.


• Pattern discovery: the collected data are analyzed in order to extract cor-relations between data and discover usage patterns, corresponding to thebehavior and the interests of users. In this stage, learning methods, such asclustering, association rule discovery, sequential pattern discovery and so on,are applied in order to automate the construction of user models.

• Personalization: the extracted knowledge is employed to implement the ef-fective personalization functions. The knowledge extracted in the previousstage, is evaluated and the set of actions necessary for generating recommen-dations is determined. In a final step, the generated recommendations arepresented to the users using proper visualization techniques.

In the overall process of usage-based Web personalization, two principal andrelated modules can be identified as an offline and an online module.

In the offline component, Web usage data are collected and preprocessed.Successively, the specific usage mining tasks are performed in order to derivethe knowledge useful for the implementation of personalization functions. Hence,the offline component is generally faced with the first three stages previouslyidentified: Web data collection, Web data preprocessing and pattern discovery.

The online module mainly comprises a personalization engine which exploitsthe knowledge derived by the offline activities in order to provide users withinteresting information according to their needs and interests.

Fig. 1. The scheme of a usage-based Web personalization process


Figure 1 depicts a generalized framework for the entire Web personalizationprocess based on Web usage mining. In the following sub-sections, a compre-hensive view of this process is presented, providing a detailed discussion of eachinvolved activity. Additionally, an overview of works and methods proposed toprovide different solutions to the development of each stage is presented.

3.1 Web Data Collection

As in any data mining application including the Web personalization process,data collection represents the primary task which has to be performed with theaim of gathering the relevant Web data, which will be analyzed to provide usefulinformation about the user behavior [Srivastava et al., 2000].

There are two main sources of data for Web usage mining, corresponding tothe two software systems interacting during a Web session: the Web server sideand the client side. When intermediaries occur in the client-server communi-cation, they become another important source of usage data, like proxy serverand packet sniffers. Usage data collected at the different sources represent thenavigation patterns of different segments of the overall Web traffic on the site.In the following, each source of usage data will be examined.

Server Side DataWeb servers represent surely the richest and the most common source of Webdata because it explicitly can record large amounts of information characterizingthe browsing behavior of site visitors.

Data collected at the server side principally include various types of log filesgenerated by the Web server. Data recorded into the server log files reflect the(eventually concurrent) accesses to a Web site by multiple users in chronologicalorder. These log files can be stored in various formats. Most of the Web serverssupport as a default option the Common Log File Format (CLF), which typicallyincludes information such as the IP address of the connected user, the time stampof the request (date and time of the access), the URL of the requested page, therequest protocol, a code indicating the status of the request, the size of the page(if the request is successful).

Other formats of log files are the Extended Log Format (W3C), supportedby Web servers as Apache and Netscape, and the very similar W3SVC format,supported by Microsoft Internet Information Server. These formats are charac-terized by the inclusion of additional information about the user requests, likethe address of the referring URL to the requested page, the name and the versionof the browser used for the navigation by the user, the operating system of thehost machine.

Data recorded in log files may not be always entirely reliable. The problem ofthe unreliability of these sources of data is mainly due to the presence of variouslevels of caching within the Web environment and to the misinterpretation ofthe IP user addresses.


Web cache is a mechanism developed to reduce the latency and the Webtraffic. This mechanism consists in keeping track of the Web pages requested bythe users and in saving a copy of these pages for a certain period of time.

Web caches can be configured either at the level of the client local browser,or at the intermediate proxy server. The requests for cached Web pages are notrecorded into log files. In fact, when a user accesses to a same Web page, ratherthan making a new request to the server, the cached copy is returned to theuser. In this way, the user request does not reach the Web server holding thepage and, as a result, the server is not aware of the actions and the page accessesmade by the users. Cache-busting represents one solution to this first problem.This involves the use of special headers, defined either in Web servers or Webpages, that include directives to establish the objects that should be cached, thetime that they should be cached.

The second problem, the IP address misinterpretation, is essentially causedby two reasons. With the use of the intermediate proxy server which assigns toall users the same IP address, the requests from different host machines passingthrough the proxy server are recorded into log files with the same IP. The sameproblem occurs when different users use the same host machine. The dynamic IPallocation gives rise to the opposite situation, where different addresses may beassigned to a same user. Both these problems may cause serious complicationsin the whole Web personalization process, where it is fundamental to identifyindividual users in order to discover their interests.

The Web server can also store other kinds of usage data through the dispen-sation and tracking of cookies. Cookies are tokens (short strings) generated bythe Web server for individual client browsers in order to automatically track thesite users. Through this mechanism, the Web server can store its own informa-tion about the user in a cookie log within the client machine. This informationusually consists in a unique ID, created by the server, which will be used bythe same server to recognize the user, the successive times that he/she will visitthe site. The use of cookies has raised growing concerns regarding user privacyand security. In fact, these require the cooperation of users which, for differentreasons, can choose to disable the option for accepting cookies.

Another kind of data useful for Web personalization which the Web server cancollect are the data explicitly supplied by the users during their interactions withthe site. This kind of data is typically obtained through the fulfillment of theregistration forms which can provide important demographic and personal in-formation or also knowledge about the user preferences. However, these data arenot always reliable, since users often provide incomplete and inaccurate informa-tion. Additional explicit user data collected at the server side can be representedby the query data generated by online visitors while searching for pages relevantto their information needs [Buchner and Mulvenna, 1999].

Client Side DataUsage data collected at the client side are represented by data originated by thehost accessing the Web site.


A first method to collect client side data consists in the use of remote agents(generally implemented in Java or Javascripts) which are embedded in Webpages, such as for example Java applets [Shahabi et al., 2001]. These agentsallow to directly collect information from the client such as the user browsinghistory, the pages visited before visiting the current page, the sites visited be-fore and after the current site, the time that the user accesses to the site andwhen he/she leaves it. This mechanism of client side data collection providesmore reliable data since it permits to overcome the limitations of Web cache andIP misinterpretation (seen before) underlying the adoption of server log files tocollect information about the user browsing behavior. However, the implemen-tation of this method of usage data collection requires the user cooperation inenabling the functionality of the Javascript and Java applets on their machines.In fact, since the employment of remote agents may affect the client system per-formances, introducing additional overhead whenever the users try to access theWeb site, users may choose to disable these functionalities on their systems.

An older mechanism used to collect usage data from the client host consistsin modifying the source code of an existing browser, such as Mosaic and Mozillato enhance its capabilities of data collection [Cunha et al., 1995].

Browsers are modified in order to allow them the memorization of informationabout the user navigational behavior, such as the Web pages visited by users,the access time, the response time of the server. As for the use of remote agents,even in this case, the user cooperation is necessary.

Modified versions of browsers are often considered a threat to the user pri-vacy. Thus, one of the main difficulties inherent in this method of data col-lection consists in convincing users to use these modified browser versions. Away often used to overcome this difficulty consists in offering incentives to userssuch as additional software or services such as those offered by AllAdvantage(www.alladvantage.com) and NetZero (www.netzero.com) companies. Moreover,modifying a modern browser is not a simple task, even when its source isavailable.

Intermediary DataAnother important source of data reflecting the user browsing behavior is rep-resented by the proxy server. A proxy server is a software system which playsthe role of intermediary between the client browser and the Web server able toensure security, administrative control and caching services. Proxy caching rep-resents a way to reduce the loading time of a Web page as well as the networktraffic load at the server and client sides [Cohen et al., 1998].

This intermediate uses logs having similar format to server log files for storingthe Web page requests and the corresponding responses from the server. This isthe main advantage of using these logs. In fact, since proxy caching reveals therequests of multiple clients to multiple servers, this can be considered a valuablesource of data characterizing the navigational behavior of a group of anonymoususers sharing a common proxy server [Srivastava et al., 2000].

Packet sniffers provide an alternative method of usage intermediary data col-lection. A packet sniffer is a piece of software (sometimes a hardware device)


which is able to monitor the network traffic coming to a Web server and toextract usage data directly from TCP/IP packets. On the one hand, the use ofpacket sniffers has the advantage that data are collected and analyzed in realtime. On the other hand, since data are not logged, this can give rise to the lossof data due to the the packet sniffer or with the data transmission.

3.2 Web Data Preprocessing

The second stage in any usage-based Web personalization is preprocessing ofWeb data. Web data collected from the various sources as seen above are usu-ally voluminous and characterized by noise, ambiguity and incompleteness. Asin most of data mining applications, these data need to be assembled in order toobtain data collections expressed in a consistent and integrated manner, usefulto be used as input to the next step of pattern discovery. To accomplish this, apreliminary activity of data preprocessing reveals to be necessary. Data prepro-cessing involves the execution of a set of operations such as the elimination ofnoise, the solution of inconsistencies, the fulfillment of eventual missing values,the removal of redundant or irrelevant data, and so on. In the particular contextof Web personalization, the goal of data preprocessing is to transform and toaggregate the raw data into different levels of abstraction which can properly beemployed to characterize the behavior of users in the overall process of person-alization. Among the various levels, pageview represents the most basic level ofdata abstraction.

A pageview is a set of Web objects or resources corresponding to a single userevent, such as frames, graphics, scripts. In Mobasher [2007], the author identifiesin the session the most basic level of behavioral abstraction. The author definesa session as a sequence of pageviews referring to a single user during a singlevisit. Also, it is stated that a session could be directly used as a user profile,being able to capture the user behavior over time. To construct significant dataabstractions, data preprocessing stage typically includes three main activities,namely data filtering, user identification and user session identification.

Data preprocessing is strongly related to the problem domain and to the qual-ity and type of available data. Hence, this step needs an accurate analysis of dataand constitutes one of the hardest task in the overall Web personalization pro-cess. An additional facet to be taken into account is the trade-off regarding thepreprocessing step. On one hand, an insufficient preprocessing could make moredifficult the next pattern analysis task. On the other hand, an excessive prepro-cessing could remove data with implicit knowledge useful for the successive stepsof the personalization process. As a consequence, the success of pattern discoveryresults highly dependent on the correct application of data preprocessing tasks.An extensive description of data preparation and preprocessing methods can befound in Cooley et al. [1999]. In the sequel, a rapid description of the activi-ties involved in data preprocessing stage is given, by focusing on the techniquesapplied to perform the respective tasks.


Data FilteringData filtering is the first activity included in data preprocessing stage. It repre-sents a fundamental task which is devoted to clean raw Web data from noise.This activity mainly concerns server side data since these can be particularlynoisy. Hence, the rest of the discussion about the data filtering activity willfocus on log files.

Since Web log files record all the interactions between Web site and its users,they may also comprise useless information for the description of the naviga-tional behavior of visitors, and they often contain a large amount of noise. Theaim of data filtering is to clean Web data by analyzing available data and re-moving from log files those records corresponding to irrelevant and redundantrequests. Redundant records in log files are mainly due to the model used by theHTTP protocol which executes a separate access request for every file, image,multimedia objects, in general, embedded in the Web page which is requestedby the user. In this way, a single user request for a Web page may often resultin several log entries that correspond to files automatically downloaded withoutan explicit request of the same user. Since these records do not represent theeffective browser activity of the connected user, they are deemed redundant andhave to be removed.

Elimination of these items can be reasonably accomplished by checking thesuffix of the URL name. For example, all log entries with filename suffixes such asgif, jpeg, GIF, JPEG, jpg, JPG and map can be removed. The list can be modi-fied depending on the type of site being analyzed. Actually, for a site consistingmainly of multimedia content, the elimination of the requests to the previoustype of files should cause the loss of important and useful information [Cooley,2000]. Besides, records corresponding to failed user requests, for example witherror status code, are filtered also.

Another crucial task of data filtering is represented by the identification andelimination of accesses generated by Web robots. Web robots (also known as Webcrawlers or Web spiders) are programs which traverse the Web in a methodicaland automated manner, downloading complete Web sites in order to updatethe index of a search engine. The entries generated by these programs are notconsidered usage data representative of the user browser behavior, so they arefiltered out from the log files. In conventional techniques, Web robot sessionsare detected in different ways: by examining sessions that access a speciallyformatted file called robots.txt, by exploiting the User Agent field of log fileswherein most crawlers identify themselves, or by matching the IP address ofsessions with those of known robot clients.

A robust technique to detect spider sessions has been proposed by Tan andKumar [2002]. Based on the assumption that the behavior of robots is differentfrom those of human users, they have recognized Web robots with a high accu-racy by using a set of relevant features extracted from access logs (percentageof media files requested, percentage of requests made by HTTP methods, av-erage time between requests). Another simple method to recognize robots is tomonitor the navigational behavior pattern of the user.


In particular, if a user accesses to all links of all the pages of a Web site, itwill be considered a crawler.

User IdentificationUser identification is one of the steps more delicate and complicate in the overallWeb personalization process. In fact, the task of identification of a single useris fundamental in order to distinguish his/her corresponding browsing behavior.Various methods have been proposed to automatically recognize a user. Some ofthe most important techniques employed are illustrated below.

Many Web applications require the explicit user registration. However, a po-tential problem in using such methods might be the reluctance of users to sharepersonal information. Besides, this approach presents another important limi-tation due to the burden to the users that in lots of Web sites disincentivesthe navigation and the visits. As a consequence, a number of methods able toautomatically identify users have been developed.

Among all these proposed methods, the simplest and also the mostly adoptedapproach consists in assigning a user to each different IP address present inlog files [Nasraoui and Petenes, 2003, Suryavanshi et al., 2005]. However, thismethod is not very accurate because, for example, a visitor may access the Webfrom different computers, or many users may use the same IP address (if a proxyis used).

Other Web usage mining tools use more accurate approaches for a prioriidentification of unique visitors such as cookies [Kamdar and Joshi, 2000]. Theuse of cookies is not without problems. In that sub-section, in fact, the problemconcerning the possibility for users to disable cookies on their systems has beenalready illustrated.

An alternative method of user identification is that proposed by Pitkow [1997].This method consists in the use of special Internet services, such as the inetdand fingerd, which provide the user name and other information about the useraccessing the Web server. However, as for cookies, also these services can be dis-abled by users. To overcome this limitation, further methods have been proposedin the literature on the topic.

In Cooley et al. [1999], the authors have proposed two different heuristicsfor user identification. The first method analyzes Web log files expressed in theExtended Log Format by searching for different browsers or different operatingsystems, even when the IP address is the same. This information suggests thatthe requests are originated from different users. The second method exploits theknowledge about the topology of the Web site to recognize requests of differentusers. More precisely, if a request for a Web page derives from the same IPaddress of requests for other Web pages but no link exists between these pages,a new user is recognized.

User Session IdentificationIn personalization systems based on Web usage mining techniques the usage dataare analyzed in order to discover the user browsing behavior on a specific Website which is embedded, as specified above, in user sessions. For this reason, the


identification of user sessions represents a fundamental task for the successivedevelopment of personalization functions and constitutes another important stepin Web data preprocessing.

Based on the definitions found in different works of scientific literature, a usersession can be defined as a delimited set of URLs corresponding to the pagesvisited by a user from the moment the user enters a Web site to the momentthe same user leaves it [Suryavanshi et al., 2005]. Starting from this definition,we can state that the problem of user session identification is strictly relatedto the previous problem of identifying a single user. Assuming a user has beenidentified, following one of the methods previously described, the next step ofWeb data preprocessing is to perform user session identification, by dividing theclickstream of each user into sessions. As concerns more properly the problem ofuser session identification, Spiliopoulou [1999] has divided the different existingapproaches in two main categories: time-based and context-based methods.

In time-based methods, the usual solution is to set a minimum timeout andassume that consecutive accesses within it belong to the same session, or seta maximum timeout, where two consecutive accesses that exceed it belong todifferent sessions. Different values have been chosen for setting this timeout de-pending on the content of the examined site and on the particular purpose ofthe personalization process. On the other hand, context-based methods considerthe access to specific kinds of pages or they refer to the definition of conceptualunits of work to identify the different user sessions. Here, transactions are rec-ognized where a transaction represents a subset of pages that occur in a usersession. Based on the assumption that transactions depend on the contextualinformation, Web pages are classified as auxiliary, content and hybrid pages.Auxiliary pages contain links to other pages of the site; content pages containthe information interesting for the user and, finally, the hybrid pages are of bothprevious kinds of pages.

Starting from this classification, Cooley et al. [1999] have distinguishedcontent-only transactions from the auxiliary-content transactions. The first onesinclude all the content pages visited by the user whereas the second ones referto the paths to retrieve a content page. Several methods have been developed toidentify transactions, but none of them is without problems.

3.3 Pattern Discovery

Once Web data have been preprocessed, the next stage of the Web personaliza-tion process consists in discovering patterns of usage of the Web site throughthe application of the effective Web usage mining techniques. To achieve thisaim, methods and algorithms belonging to several fields such as statistics, datamining, machine learning and pattern recognition are applied to discover usefulknowledge for the ultimate personalization process.

Most of commercial applications commonly derive knowledge about users byexecuting statistical analysis on session data. Many Web mining traffic toolsproduce periodic reports including important statistical information descrip-tive of the user browser patterns, such as the most frequently accessed pages,


average view time, average length of navigational paths. This kind of extractedknowledge may be useful to improve the system performance and facilitate thesite modification. In the context of knowledge discovery techniques specificallydesigned for the analysis of Web usage data, research effort mainly focused onthree distinct paradigms: association rules, sequential patterns and clustering.Han and Kamber [2001] give an exhaustive review of these techniques. The moststraightforward technique employed in Web usage mining is represented by asso-ciation rules explaining associations among Web pages which frequently appearin user sessions. Typically, an association rule is expressed in the following form:

A.html, B.html ⇒ C.html

which states that if a user has visited page A.html and page B.html, it is verylikely that in the same session the same user has also visited page C.html. Thiskind of approach has been used in [Joshi et al., 2003, Nanopoulos et al., 2002];while some measures of interest to evaluate association rules mined from Webusage data have been proposed by Huang et al. [2002].

Fuzzy association rules, obtained by the combination of association rules andfuzzy logic, have been extracted in Wong and Pal [2001]. Sequential patterndiscovery turns out to be particularly useful for the identification of navigationalpatterns in Web usage data. In this kind of approach, the element of time isintroduced in the process of discovering patterns which frequently appear inuser sessions. To extract sequential patterns, two main class of algorithms areemployed: methods based on association rule mining and methods based on theuse of tree structures and Markov chains. Some well-known algorithms for miningassociation rule have been modified to obtain sequential patterns. For examplethe Apriori algorithm has been properly extended to derive two new algorithms:the AprioriAll and GSP proposed in Huang et al. [2002] and Mortazavi-Asl[2001]. An alternative algorithm based on the use of a tree structure has beenpresented in Pei et al. [2000]. Tree structures have also been used in Menasalvaset al. [2002].

Clustering is the most widely employed technique in the pattern discovery pro-cess. Clustering techniques look for groups of similar items among large amountof data based on a distance function which computes the similarity betweenitems. Vakali et al. [2004] provide an exhaustive overview of Web data clusteringmethods used in different research works in this area. Following the classificationsuggested by Vakali, in Web usage domain, two kinds of interesting clusters canbe discovered: usage clusters and Web document clusters. Xie and Phoha [2001]were the first to suggest that the focus of Web usage mining should be shiftedfrom single user sessions to group of user sessions. Successively, in a large num-ber of works usage clustering techniques have been used in the process of WebUsage Mining for grouping together similar sessions [Banerjee and Ghosh, 2001,Heer and Chi, 2002, Huang et al., 2002]. Clustering of Web documents aims todiscover groups of pages having related content. In general, a Web document canbe considered as a collection of Web pages (a set of related Web resources, suchas HTML files, XML files, images, applets, multimedia resources). In this frame-work, the Web topology can be regarded as a directed graph, where the nodes


represent the Web pages with URL addresses and the edges among nodes repre-sent the hyperlinks among Web pages. In this context, the concepts of compounddocuments [Eiron and Mc-Curley, 2003] and logical information units [Tajima etal., 1999] have been introduced. A compound document is a set of Web pages hav-ing the fundamental property that their link graph has to contain a vertex cor-responding to a path conducting to every other part of the document. Moreover,a Web community is defined as a set of Web pages that link to more Web pagesin the community than to pages outside of the community [Greco et al., 2004].

The main benefits derived by clustering include increasing Web informationaccessibility, understanding users’ navigation behavior identifying user profiles,improving information retrieval in search engines and content delivery on the Web.

3.4 Personalization

The knowledge extracted through the process of knowledge discovery has tobe exploited in the effective and final personalization process. Personalizationfunctions can be accomplished in a manual or in an automatic and transpar-ent manner for the user. In the first case, the discovered knowledge has to beexpressed in a comprehensible manner for humans, so that knowledge can be an-alyzed to support human experts in making decisions. To accomplish this task,different approaches have been introduced in order to provide useful informationfor personalization.

An effective method for presenting comprehensive information to humans isthe use of visualization tools as WebViz [Pitkow and Bharat, 1994] that rep-resents navigational patterns as graphs. Reports are also a good method tosynthesize and to visualize useful statistical information previously generated.Personalization systems as WUM [Spiliopoulou and Faulstich, 1998] and Web-Miner [Cooley et al., 1997] use SQL-like query mechanisms for the extractionof rules from navigational patterns. Nevertheless, decisions made by the usermay create delay and loss of information. As a consequence, a more interestingapproach consists in the integration of Web usage mining techniques in the per-sonalization process. In particular, the knowledge extracted from Web data isautomatically exploited in a personalization process which adapts the Web-basedapplication according to the discovered patterns. The discovered knowledge willbe delivered subsequently to the users by means of one or more personalizationfunctions. Thus, the activities performed in the effective personalization stepstrongly depend on the different personalization functions which the system of-fers. In this way, if the system offers the personalization function of adapting thecontent of Web site to the needs of current users, the content of Web pages isadapted to the interests of users, modifying also the graphical interface. In thecase of link suggestion, for example, a list of links retained interesting for usersis visualized in the page currently visited. In e-commerce applications, a list ofproducts is recommended to the online customer taking into account the userinterests. These are only few examples of personalization tasks performed in thestep of effective personalization.


Following the scheme of a general Web usage based personalization system,this ultimate phase is included in the online module aimed to realize the person-alization functionalities which are offered by the Web site. All the other stepsinvolved in the Web personalization system, i.e. Web data preprocessing andpattern discovery, are periodically performed in the offline module.

4 Conclusions

This chapter provided a comprehensive view of Web personalization, especiallyfocusing on the different steps involved in a general usage-based Web personal-ization system and the variety of approaches to Web personalization.

In the last few years, research has achieved encouraging results in the fieldof Web personalization. However, a number of challenges and open researchquestions have still to be addressed by researchers. One of the key aspects of apersonalization process consists in the derivation of user models that are able toencode the preferences and the needs of users. In this context, lots of work hasstill to be done in the direction to derive adaptive user models that are able tocapture dynamically the continuous changes related to the interests of users.

Another important aspect that needs to be investigated concerns the defini-tion of more appropriate metrics for the evaluation of the user satisfaction withrespect to the generated recommendations. Also, the exploitation of the relevancefeedback (explicitly expressed by the users or implicitly derived by observing thebehavior of users once they receive recommendations) could be useful not onlyto dynamically adapt user models to the changing interests of users but also toprovide some indicators to quantify the goodness of the provided suggestions.

A further aspect extremely interesting that could be surely enhanced in theliterature is strictly related to the possibility to individuate suitable measuresable to estimate the benefits that can be obtained by endowing Web applica-tions with personalization functionalities. This could permit to justify the hugeresearch efforts carried out in the direction of developing adaptive Web applica-tions that incorporate personalization processes able to support their users byproviding them the right contents or services in the right time.

References

Abraham, A.: Business intelligence from web usage mining. Journal of Information &Knowledge Management 2(4), 375–390 (2003)

Acquisti, A., Varian, H.: Conditioning prices on purchase history. Marketing Sci-ence 24(3), 367–381 (2005)

Aggarwal, C.C., Wolf, J., Yu, P.S.: A new method for similarity indexing for marketdata. In: Proceedings of the 1999 ACM SIGMOD Conference, Philadelphia, PA, pp.407–418 (1999)

Arotariteia, D., Mitra, S.: Web mining: a survey in the fuzzy frame-work. Fuzzy Setsand Systems 148(1), 5–19 (2004)

Banerjee, A., Ghosh, J.: Clickstream clustering using weighted longest common subse-quences. In: Proceedings of the Web Mining Workshop at the 1st SIAM Conferenceon Data Mining (2001)


Bouganis, C., Koukopoulos, D., Kalles, D.: A real time auction system over the www.In: Proceeding of Conference on Communication Networks and Distributed SystemsModeling and Simulation, San Francisco, CA, USA (1999)

Buchner, A.G., Mulvenna, M.D.: Discovering internet marketing intelligence throughonline analytical web usage mining. SIGMOD Record 27(4), 54–61 (1999)

Chignoli, R., Crescenzo, P., Lahire, P.: Customization of links between classes. Tech-nical report, Laboratoire d’Informatique, Signaux and Systmes de Sophia-Antipolis(1999)

Choudhary, V., Ghose, A., Mukhopadhyay, T., Rajan, U.: Personalized pricing andquality di R©erentiation. Management Science 51(7), 1120–1130 (2005)

Cimiano, P., Staab, S.: Learning by googling. SIGKDD Explorations sepcial issue onWeb Content Mining 6(2), 24–33 (2004)

Claypool, M., Gokhale, A., Miranda, T., Murnikov, P., Netes, D., Sartin, M.: Combin-ing content-based and collaborative filters in an online newspaper. In: Proceedingsof the ACM SIGIR 1999 Workshop on Recommender Systems: Algorithms and Eval-uation, Berkeley, California (1999)

Cohen, E., Krishnamurthy, B., Rexford., J.: Improving end-to-end performance of theweb using server volumes and proxy filters. In: Proceedings of ACM SIGCOMM(1998)

Cooley, R.: Web usage mining: discovery and application of interesting patterns fromWeb data. PhD thesis, University of Minnesota (2000)

Cooley, R., Mobasher, B., Srivastava, J.: Grouping Web page references into transac-tions for mining world wide web browsing patterns. Technical report TR 97-021,Dept. of Computer Science, University of Minnesota, Minneapolis, USA (1997)

Cooley, R., Mobasher, B., Srivastava, J.: Data preparation for mining world wide webbrowsing patterns. Knowledge and Information Systems 1(1), 32–55 (1999)

Costa, M., Gong, Z.: Web structure mining: an introduction. In: Proceedings of IEEEInternational Conference on Information Acquisition (2005)

Cunha, C., Bestavros, A., Crovella, M.E.: Characteristics of www client-based traces.Technical report tr-95-010., Boston University, Department of Computer Science(1995)

Eiron, N., McCurley, K.: Untangling compound documents on the web. In: Proceedingsof ACM Hypertext, pp. 85–94 (2003)

Facca, F.M., Lanzi, P.: Mining interesting knowledge from weblogs: a survey. Data &Knowledge Engineering 53, 225–241 (2005)

Furnkranz, J.: Web structure mining - exploiting the graph structure of the world-wideweb. GAI-Journal 21(2), 17–26 (2002)

Furnkranz, J.: Web mining. In: Maimon, O., Rokach, L. (eds.) Data Mining and Knowl-edge Discovery Handbook. Springer, Heidelberg (2005)

Goldberg, D., Nichols, D., Oki, B.M., Terry, D.: Using collaborative filtering to weavean information tapestry. Communications of the ACM 35(12), 61–70 (1992)

Greco, G., Greco, S., Zumpano, E.: Web communities: models and algorithms. WorldWide Web 7(1), 58–82 (2004)

Han, J., Kamber, M.: Data Mining Concepts and Techniques. Morgan Kaufmann, SanFrancisco (2001)

Heer, J., Chi, E.: Mining the structure of user activity using cluster stability. In: Pro-ceedings of the Workshop on Web Analytics (2002)

Huang, X., Cercone, N., An, A.: Comparison of interestingness functions for learningweb usage patterns. In: Proceedings of the Eleventh International Conference onInformation and Knowledge Management, pp. 617–620 (2002)


Kamdar, T., Joshi, A.: On creating adaptive web sites using web log mining. Techni-cal reporttr-cs-00-05., Department of Computer Science and Electrical EngineeringUniversity of Maryland (2000)

Kobsa, A., Koenemann, J., Pohl, W.: Personalized hypermedia presentation tech-niques for improving online customer relationships. The Knowledge EngineeringReview 16(2), 111–155 (2001)

Konstan, J., Miller, B., Maltz, D., Herlocker, J., Gordon, L., Riedl, J.: Grouplens:Applying collaborative filtering to usenet news. Communications of the ACM 40(3),77–87 (1997)

Kosala, R., Blockeel, H.: Web mining research: a survey. ACM SIGKDD ExplorationsNewsletter 2, 1–15 (2000)

Krulwich, B., Burkey, C.: Learning user information interests through extraction ofsemantically signi−cant phrases. In: Proceedings of the AAAI Spring Symposiumon Machine Learning in Information Access, Stanford, California (1996)

Joshi, K., Joshi, A., Yesha, Y.: On using a warehouse to analyse web logs. Distributedand Parallel Databases 13(2), 161–180 (2003)

Lang, K.: Newsweeder: Learning to filter netnews. In: Proceedings of the 12th Inter-national Conference on Machine Learning (1995)

Lieberman, H.: Letizia: An agent that assists web browsing. In: Proceedings of the14th International Joint Conference in Artificial Intelligence (IJCAI 1995), Montreal,Quebec, Canada, pp. 924–929 (1995)

Liu, B., Chang, K.C.C.: Editorial: Special issue on web content mining. SIGKDD Ex-plorations special issue on Web Content Mining 6(2), 1–4 (2004)

Manber, U., Patel, A., Robison, J.: Experience with personalization on yahoo. Com-munications of the ACM 43(8), 35–39 (2000)

Menasalvas, E., Millan, S., Pena, J., Hadjimichael, M., Marban, O.: Subsessions: agranular approach to click path analysis. In: Proceedings of FUZZ-IEEE Fuzzy Setsand Systems Conference, at the World Congress on Computational Intelligence, pp.12–17 (2002)

Mladenic, D.: Personal web watcher: Implementation and design. Technical report,Department of Intelligent Systems, J. Stefan Institute, Slovenia (1996)

Mitchell, T., Caruana, R., Freitag, D., McDermott, J., Zabowski, D.: Experience witha learning personal assistant. Communications of the ACM 37(7), 81–91 (1994)

Mobasher, B.: Web usage mining and personalization. In: Singh, M.P. (ed.) PracticalHandbook of Internet Computing. CRC Press, Boca Raton (2005)

Mobasher, B.: Web usage mining. In: Web Data Mining: Exploring Hyperlinks, Con-tents and Usage Data. Springer, Heidelberg (2006)

Mobasher, B.: Data mining for personalization. In: Brusilovsky, P., Kobsa, A., Nejdl,W. (eds.) Adaptive Web 2007. LNCS, vol. 4321, pp. 90–135. Springer, Heidelberg(2007)

Mobasher, B., Cooley, R., Srivastava, J.: Automatic personalization based on web usagemining. Communications of the ACM 43(8), 142–151 (2000)

Mortazavi-Asl, B.: Discovering and mining user web-page traversal patterns. Master’sthesis, Simon Fraser University (2001)

Nanopoulos, A., Katsaros, D., Manolopoulos, Y.: Exploiting web log mining for webcache enhancement. In: Kohavi, R., Masand, B., Spiliopoulou, M., Srivastava, J.(eds.) WebKDD 2001. LNCS, vol. 2356, pp. 68–87. Springer, Heidelberg (2002)

Nasraoui, O., Krishnapuram, R., Joshi, A., Kamdar, T.: Automatic web user pro-filing and personalization using robust fuzzy relational clustering. In: Segovia, J.,Szczepaniak, P., Niedzwiedzinski, M. (eds.) E-Commerce and Intelligent Methodsin the series Studies in Fuzziness and Soft Computing, Springer, Heidelberg (2002)


Nasraoui, O., Petenes, C.: Combining web usage mining and fuzzy inference for web-site personalization. In: Proceedings of WEBKDD 2003: Web mining as premise toeffective Web applications, pp. 37–46 (2003)

OConnor, M., Herlocker, J.: Clustering items for collaborative filtering. In: Proceed-ings of ACM SIGIR 1999 Workshop on Recommender Systems: Algorithms andEvaluation, Berkeley, California (1999)

Pazzani, M., Billsus, D.: Learning and revising user profiles: The identification of in-teresting web sites. Machine Learning 27, 313–331 (1997)

Pazzani, M., Billsus, D.: Content-based recommendation systems. In: Brusilovsky, P.,Kobsa, A., Nejdl, W. (eds.) Adaptive Web 2007. LNCS, vol. 4321, pp. 325–341.Springer, Heidelberg (2007)

Pei, J., Han, J., Motazavi-Asl, B., Zhu, H.: Mining access patterns efficiently from weblogs. In: Proceedings of the Pacific-Asia Conference on Knowledge Discovery andData Mining, pp. 396–407 (2000)

Pierrakos, D., Paliouras, G., Papatheodorou, C., Spyropoulos, C.D.: Web usage miningas a tool for personalization: a survey. User Modeling and User-Adapted Interac-tion 13(4), 311–372 (2003)

Pitkow, J.: In search of reliable usage data on the www. In: Proceedings of the 6thInt.World Wide Web Conference, Santa Clara, CA (1997)

Pitkow, J., Bharat, K.: Webviz: A tool for world wide web access logvisualization.In: Proceedings of the 1st International World Wide Web Conference, pp. 271–277(1994)

Sarwar, B.M., Karypis, G., Konstan, J.A., Riedl, J.: Application of dimensionalityreduction in recommender system - a case study. In: Proceedings of the WebKDD2000 Web Mining for E-Commerce Workshop at ACM SIGKDD 2000, Boston (2000)

Schafer, J.B., Konstan, J., Reidel, J.: Recommender systems in E-commerce. In: Pro-ceeding of ACM Conf. E-commerce, pp. 158–166 (1999)

Schwab, I., Kobsa, A., Koychev, I.: Learning about users from observation. In: AdaptiveUser Interfaces. AAAI Press, Menlo Park (2000)

Schwarzkopf, E.: An adaptive web site for the UM 2001 conference. In: Proceeding ofthe UM 2001 Workshop on Machine Learning for User Modelling (2001)

Shahabi, C., Banaei-Kashani, F., Faruque, J.: A reliable, efficient, and scalable systemfor web usage data acquisition. In: Proceedings of WebKDD 2001 Workshop inconjunction with the ACMSIGKDD (2001)

Spiliopoulou, M.: Data mining for the web. In: Zytkow, J.M., Rauch, J. (eds.) PKDD1999. LNCS, vol. 1704, pp. 588–589. Springer, Heidelberg (1999)

Spiliopoulou, N., Faulstich, L.: Wum: Aweb utilization miner. In: Proceedings of theInternational Workshop on the Web and Databases, Valencia, Spain, pp. 109–115(1998)

Srivastava, J., Cooley, R., Deshpande, M., Tan, P.-N.: Web usage mining: Discoveryand applications of usage patterns from web data. SIGKDD Explorations 1(2), 1–12(2000)

Suryavanshi, B., Shiri, N., Mudur, S.: An efficient technique for mining usage profilesusing relational fuzzy subtractive clustering. In: Proceedings of the 2005 Int. Work-shop on Challenges in Web Information Retrieval and Integration (WIRI 2005), pp.23–29 (2005)

Tajima, K., Hatano, K., Matsukura, T., Sano, R., Tanaka, K.: Discovery and retrievalof logical information units in web. In: Proceedings of the Workshop on OrganizingWeb Space, WOWS 1999 (1999)

Tan, P.N., Kumar, V.: Discovery of web robot sessions based on their navigationalpatterns. Data Mining and Knowledge Discovery 6(1), 9–35 (2002)


Vakali, A., Pokorn, J., Dalamagas, T.: An overview of web data clustering practices. In:Lindner, W., Mesiti, M., Turker, C., Tzitzikas, Y., Vakali, A.I. (eds.) EDBT 2004.LNCS, vol. 3268, pp. 597–606. Springer, Heidelberg (2004)

Wong, S., Pal, S.: Mining fuzzy association rules for web access case adaptation. In:Proceedings of the Workshop on Soft Computing in Case-Based Reasoning (2001)

Xie, Y., Phoha, V.V.: Web user clustering from access log using belief function. In:Proceedings of the First International Conference on Knowledge Capture, K-CAP2001 (2001)

Zhou, B., Hui, S.C., Fong, A.C.M.: Web usage mining for semantic web personalization.In: Proceedings of the Workshop on Personalization on the Semantic Web, PerSWeb2005 (2005)

2

A Semantic Content-Based Recommender

System Integrating Folksonomies forPersonalized Access

Pasquale Lops, Marco de Gemmis, Giovanni Semeraro,Cataldo Musto, Fedelucio Narducci, and Massimo Bux

Department of Computer ScienceUniversity of Bari “Aldo Moro” - Bari, Italy{lops,degemmis,semeraro,musto,narducci,bux}@di.uniba.it

Summary. Basic content personalization consists in matching up the attributes of auser profile, in which preferences and interests are stored, against the attributes of acontent object. The Web 2.0 (r)evolution and the advent of user generated content(UGC) have changed the game for personalization, since the role of people has evolvedfrom passive consumers of information to that of active contributors. One of the formsof UGC that has drawn more attention from the research community is folksonomy, ataxonomy generated by users who collaboratively annotate and categorize resources ofinterests with freely chosen keywords called tags.

In this chapter, we intend to investigate whether folksonomies might be a valuablesource of information about user interests for a recommender system. In order to achievethat goal, folksonomies have been included into ITR (ITem Recommender), a content-based recommender system developed at the University of Bari [7]. Specifically, staticcontent consisting of the descriptions of the items in a collection have been enrichedwith dynamic UGC through social tagging techniques.

The new recommender system, called FIRSt (Folksonomy-based Item Recom-mender syStem), extends the original ITR system integrating UGC management byletting users to express their preferences for items by entering a numerical rating aswell as to annotate rated items with free tags.

The main contribution of the chapter is an integrated strategy that enables acontent-based recommender to infer user interests by applying machine learning tech-niques, both on official item descriptions provided by a publisher and on tags whichusers adopt to freely annotate relevant items.

Static content and tags are preventively analyzed by advanced linguistic techniquesin order to capture the semantics of the user interests, often hidden behind keywords.The proposed approach has been evaluated in the domain of cultural heritage person-alization. Experiments involving 40 real users show an improvement in the predictiveaccuracy of the tag-augmented recommender compared to the pure content-based one.

Keywords: Content-based Recommender Systems, Web 2.0, Folksonomy, MachineLearning, Semantics.


28 P. Lops et al.

1 Introduction

The amount of information available on the Web and in Digital Libraries is in-creasing over time. In this context, the role of user modeling and personalizedinformation access is becoming crucial: users need a personalized support in sift-ing through large amounts of retrieved information according to their interests.Information filtering systems, relying on this idea, adapt their behavior to in-dividual users by learning their preferences during the interaction in order toconstruct a profile of the user that can be later exploited in selecting relevantitems. Indeed, content personalization basically consists in matching up the at-tributes of a user profile, in which preferences and interests are stored, againstthe attributes of a content object.

Recent developments at the intersection of Information Filtering, MachineLearning, User Modeling and Natural Language Processing offer novel solutionsfor personalized information access. Most work focuses on the use of MachineLearning algorithms for the automated induction of a structured model of userinterests and preferences from text documents, referred to as user profile. If aprofile accurately reflects user preferences, it is of tremendous advantage for theeffectiveness of an information access process. For instance, it could be used tofilter search results, by deciding whether a user is interested in a specific Webpage or not and, in the negative case, preventing it from being displayed.

The problem with this approach is that traditional keyword-based profilesare unable to capture the semantics of user interests because they are primarilydriven by a string matching operation. If a string, or some morphological variant,is found in both the profile and the document, a match is made and the documentis considered as relevant. String matching suffers from problems of:

• polysemy, the presence of multiple meanings for one word;• synonymy, multiple words with the same meaning.

The result is that, due to synonymy, relevant information can be missed ifthe profile does not contain the exact keywords in the documents while, due topolysemy, wrong documents could be deemed relevant.

Semantic analysis and its integration in personalization models is one of themost innovative and interesting approaches nowadays proposed in literature tosolve these problems. Semantic analysis is the key to learn more accurate profilesthat capture concepts expressing user interests from relevant documents. Thesesemantic profiles contain references to concepts defined in lexicons or ontologies.

The Web 2.0 (r)evolution and the advent of user generated content (UGC)have changed the game for personalization, since the role of people has evolvedfrom passive consumers of information to that of active contributors. UGC refersto various kinds of media content, publicly available, that are produced by end-users. For example, on Amazon.com the majority of content is prepared by ad-ministrators, but numerous user reviews of the products being sold are submittedby regular visitors to the site.

One of the forms of UGC that has drawn more attention from the researchcommunity is folksonomy, a taxonomy generated by users who collaboratively

A Semantic Content-Based Recommender System Integrating Folksonomies 29

annotate and categorize resources of interests with freely chosen keywords calledtags. Therefore, it should be investigated whether folksonomies might be a valu-able source of information about user interests and whether they could be in-cluded in semantic user profiles.

The main contribution of this chapter is a strategy to infer user profiles byapplying machine learning techniques both on the “official” item descriptionsprovided by a publisher, and on tags which users adopt to freely annotate rele-vant items. Static content and tags are preventively analyzed by advanced lin-guistic techniques in order to capture the semantics of the user interests oftenhidden behind keywords. The goal of the paper can be formulated in form of thefollowing research question:

• Does the integration of tags cause an increase of the prediction accuracy inthe process of filtering relevant items for users?

This research has been conducted within the CHAT project (Cultural Heritagefruition and e-learning applications of new Advanced multimodal Technologies),that aims at developing new systems and services for multimodal fruition ofcultural heritage content. Data has been gathered from the collections of theVatican picture-gallery, for which both images and detailed textual informationof paintings are available, and letting users involved in the study both rate andannotate them with tags.

The paper is structured as follows. Section 2 briefly introduces InformationFiltering and Recommender Systems. Section 3 provides details about strategiesadopted by the content-based recommender for performing semantic documentindexing and profile learning and how users tagging activity is handled by therecommender when building user profiles. Section 4 presents the experimentalsessions carried out to evaluate the proposed idea and discusses the main findingsof the study. Related work are briefly analyzed in Section 5, while conclusionsand directions for future work are drawn in 6.

2 Information Filtering at Work: Recommender Systems

Starting from a corpus containing all the informative content, Information Filter-ing techniques perform a progressive removal of non-relevant content accordingto information about user interests, previously acquired and stored in a user pro-file [12]. Recommender Systems represent the main area where principles andtechniques of Information Filtering are applied.

Nowadays many web sites embody recommender systems as a way of personal-izing their content for users [25]. Recommender systems have the effect of guidingusers in a personalized way to interesting or useful objects in a large space ofpossible options [4]. Recommendation algorithms use input about customer’sinterests to generate a list of recommended items. At Amazon.com, recommen-dation algorithms are used to personalize the online store for each customer, forexample showing programming titles to a software engineer and baby toys to anew mother [18].

30 P. Lops et al.

Among different recommendation techniques that have already been put for-ward in studies on this matter, the content-based and the collaborative filteringapproaches are the most widely adopted to date.

Systems implementing the content-based approach analyze a set of documents,usually textual descriptions of the items previously rated by an individual user,and build a model or profile of user interests based on the features of the objectsrated by that user [24]. In this approach static content associated to items (theplot of a film, the description of an artwork, etc.) is usually exploited. The profileis then exploited to recommend new relevant items.

Collaborative recommender systems differ from content-based ones in thatuser opinions are used, instead of content. User ratings about objects are gath-ered and stored in a centralized or distributed database. To provide recommen-dations to user X, the system firstly computes the neighborhood of that user(i.e. the subset of users that have a taste similar to X). Similarity in taste ismeasured by computing the closeness of ratings for objects that were rated byboth users. The system then recommends objects that users in X’s neighborhoodindicated to like, provided that they have not yet been rated by X.

Each type of filtering method has its own weaknesses and strengths [31, 1, 17].This work is focused on content-based recommender systems. In the next sec-tion we will introduce FIRSt (Folksonomy-based Item Recommender syStem), acontent-based recommender system that implements the proposed idea of build-ing user profiles by exploiting both static and dynamic content (UGC).

3 FIRSt (Folksonomy-Based Item Recommender syStem)

FIRSt is a semantic content-based recommender system integrating UGC (tags)in the process of learning user profiles. FIRSt is build upon ITem Recommender(ITR), a system capable of providing recommendations for items in several do-mains (e.g., movies, music, books), provided that descriptions of items are avail-able as text documents (e.g. plot summaries, reviews, short abstracts) [19, 7, 29].In the following, we will refer to documents as textual descriptions of items tobe recommended. FIRSt adds new functionalities to ITR for processing tags inorder to include them in semantic profiles.

Sections 3.1 through 3.3 describe the general architecture of ITR, by providingdetails about strategies adopted for semantic document indexing and profilelearning. The evolution of ITR towards FIRSt is presented in Section 3.4, bydescribing how users’ tagging activity is handled for building user profiles.

3.1 ITR General Architecture

The general architecture of ITR is depicted in Figure 1. The recommendationprocess is performed in three steps, each of which is handled by a separatecomponent:

• Content Analyzer – it allows introducing semantics in the recommenda-tion process by analyzing documents in order to identify relevant concepts


Fig. 1. ITR General Architecture

representing the content. This process selects, among all the possible mean-ings (senses) of each polysemous word, the correct one according to the con-text in which the word occurs. In this way, documents are represented usingconcepts instead of keywords, in an attempt to overcome the problems dueto natural language ambiguity. The final outcome of the preprocessing step isa repository of disambiguated documents. This semantic indexing is stronglybased on natural language processing techniques, such as Word Sense Dis-ambiguation (WSD) [20], and heavily relies on linguistic knowledge stored inthe WordNet lexical ontology [23]. Details are provided in Section 3.2.

• Profile Learner – it implements a supervised learning technique for learn-ing a probabilistic model of user interests from disambiguated documentsrated according to her interests. This model represents the semantic profile,which includes those concepts that turn out to be the most indicative of theuser preferences. Details are provided in Section 3.3.

• Recommender – it exploits the user profile to suggest relevant documents bymatching concepts contained in the semantic profile against those containedin documents to be recommended. Details are provided in Section 3.3.

3.2 Semantic Indexing of Documents

Semantic indexing of documents is performed by the Content Analyzer, whichrelies on META (Multi Language Text Analyzer) [2], a natural language pro-cessing tool developed at the University of Bari, able to deal with documents inEnglish or Italian.

The goal of the semantic indexing step is to obtain a concept-based documentrepresentation. To this purpose the text is first tokenized, then for each word,possible lemmas as well as their morpho-syntactic features are collected. Part

32 P. Lops et al.

of speech ambiguities are solved before assigning the proper sense (concept) toeach word. This last step requires the identification of a repository for wordsenses and the design of an automated procedure for performing word-conceptassociation.

As regards the first issue, WordNet version 2.0 has been embodied in thesemantic indexing module. The basic building block for WordNet is the synset(SYNonym SET), a structure containing sets of words with synonymous mean-ings, which represents a specific meaning of a word.

As regards the second issue, we designed a WSD algorithm called JIGSAW[3]. It takes as input a document d = [w1, w2, . . . , wh] encoded as a list ofwords in order of their appearance, and returns a list of WordNet synsetsX = [s1, s2, . . . , sk] (k ≤ h), in which each element sj is obtained by disam-biguating the target word wi based on the semantic similarity of wi with thewords in its context, that is a set of words that precede and follow wi. Noticethat k ≤ h because some words, such as most proper names, might not be foundin WordNet, or because of bigram recognition.

Semantic similarity computes the relatedness of two words. We adopted theLeacock-Chodorow measure [16], which is based on the length of the path be-tween concepts in an IS-A hierarchy. The complete description of the adoptedWSD strategy adopted is not described here, because already published in [30].What we would like to point out here is that the WSD procedure allows to obtaina synset-based vector space representation, called bag-of-synsets (BOS), that isan extension of the classical bag-of-words (BOW) model. In the BOS model asynset vector, rather than a word vector, corresponds to a document. ITR is ableto suggest potentially relevant items to users, as long as item properties can berepresented in form of textual slots. The adoption of slots does not jeopardize thegenerality of the approach since the case of documents not structured into slotscorresponds to have just a single slot in our document representation strategy.The text in each slot is represented by the BOS model by counting separately theoccurrences of a synset in the slots in which it appears. More formally, assumethat we have a collection of N documents structured in M slots. Let s be theindex of the slot, the n-th document is reduced to M bags of synsets, one foreach slot:

dsn = 〈tsn1, t

sn2, . . . , t

snDns

〉where tsnk is the k-th synset in slot s of document dn and Dns is the totalnumber of synsets in slot s of document dn. For all n, k and s, tsnk ∈ Vs, whichis the vocabulary for the slot s (the set of all different synsets found in slot s).Document dn is finally represented in the vector space by M synset-frequencyvectors:

fsn = 〈ws

n1, wsn2, . . . , w

snDns

〉where ws

nk is the weight of the synset tk in the slot s of document dn and can becomputed in different ways: it can be the frequency of synset tk in s or a morecomplex feature weighting score.

By invoking META on a text t, we get META(t) = (x,y), where x is theBOS containing the synsets obtained by applying JIGSAW on t, and y is the


corresponding synset-frequency vector. BOS-indexed documents are used in acontent-based information filtering scenario for learning accurate sense-baseduser profiles, as discussed in the following section.

3.3 Multivariate Poisson Model for Learning User Profiles

The problem of learning user profiles can be cast as a binary Text Categorizationtask [28] since each document has to be classified as interesting or not withrespect to the user preferences. Therefore, the set of categories is restricted toc+, the positive class (user-likes), and c− the negative one (user-dislikes). Thealgorithm for inferring user profiles is naıve Bayes text learning, widely adoptedin content-based recommenders [24]. Although naıve Bayes performance are notas good as some other statistical learning methods such as nearest-neighborclassifiers or support vector machines, it has been shown that it can performsurprisingly well in the classification tasks where the computed probability isnot important [10]. Another advantage of the naıve Bayes approach is that it isvery efficient and easy to implement compared to other learning methods.

There are two different probabilistic models in common use, both of which as-sume that all features are independent of each other, given the context of the class.In the multivariate Bernoulli model a document is a binary feature vector over thespace of words representing whether each word is present or absent. In contrast,the multinomial model captures word frequency information in documents: whencalculating the probability of a document, the probability of the words that occurare multiplied. Although the classifiers based on the multinomial model signifi-cantly outperform those based on the multivariate model at large vocabulary sizes[21], their performance is unsatisfactory when: 1) documents in the training sethave different lengths, thus resulting in a rough parameter estimation; 2) handlingrare categories (few training documents available).

These conditions frequently occur in the user profiling task, where no assump-tions can be made on the length of training documents, and where obtaining anappropriate set of negative examples (i.e., examples of the user-dislikes class)is problematic. Indeed, since users do not perceive having immediate benefitsfrom giving negative feedback to the system [27], the training set for the classuser-likes might be often larger than the one for the class user-dislikes.

In [14], the authors propose a multivariate Poisson model for naıve Bayestext classification that allows a more reasonable parameter estimation under theabove mentioned conditions. We adapt this approach to the case of user profilingtask. The probability that a document dj belongs to a class c (user-likes/user-dislikes) is calculated by the Bayes’ theorem as follows:

P (c|dj) =P (dj |c)P (c)

P (dj |c)P (c) + P (dj |c)P (c)

=P (dj |c)P (dj |c)P (c)

P (dj |c)P (dj |c)P (c) + P (c)

(1)

34 P. Lops et al.

If we set:

zjc = logP (dj|c)P (dj|c) (2)

then Eq. (1) can be rewritten as:

P (c|dj) =ezjcP (c)

ezjcP (c) + P (c)(3)

Using Eq. (3) we can get the posterior probability P (c|dj) by calculating zjc.In the Poisson model proposed in [14] for learning the naıve Bayes text classifier:

zjc =|V |∑

i=1

wij · log λic

μic(4)

where |V | is the vocabulary size, wij is the frequency of term ti in dj , λic (μic) isthe Poisson parameter that indicates the number of occurrences of ti in the posi-tive (negative) training documents on average. The flexibility of this model relieson the fact that it can be expanded by adopting various methods to estimatewij , λic and μic.

In the following, the strategies to adapt this model to the specific task of userprofiling are described.

The first adaption is needed because, as described in Section 3.2, documentsare subdivided into slots, therefore the model should take into account that dj

is the concatenation of M documents dsj , M being the number of slots, s =

1, . . . , M . According to the naıve assumption of features independence, slots areindependent of each other, given the class (i.e. the token probabilities for oneslot are independent of the tokens that occur in other slots), therefore:

P (dj |c) =M∏

s=1

P (dsj |c) (5)


P (c|dj) =

∏Ms=1

P (dsj |c)

P (dsj |c)P (c)

∏Ms=1

P (dsj |c)

P (dsj |c)P (c) + P (c)

(6)

If we set:

zsjc = log

P (dsj |c)

P (dsj |c)

(7)


P (c|dj) =∏M

s=1 ezsjcP (c)

∏Ms=1 ezs

jcP (c) + P (c)(8)


In the Poisson model with slots, Eq. (4) becomes:

zsjc =

|V |∑

i=1

wsij · log

λsic

μsic

(9)

where wsij is the frequency of term ti in the slot s of dj .

Using Eq. (6) and (9), the posterior probability P (c|dj) can be computed byestimating the Poisson parameters λs

ic and μsic. Since we want to normalize term

frequencies according to document lengths, we compute λsic (μs

ic) as an averageof the normalized frequency of ti in the slot s over the number of documents inclass c (c):

λsic =

1|Dc|

|Dc|∑

j=1

wsij μs

ic =1

|Dc||Dc|∑

j=1

wsij s = 1, . . . , M (10)

where Dc (Dc) is the number of documents in class c (c),

wsij =

wsij

α · avgtfs + (1 − α) · avgtfsj

(11)

avgtfsj is the average frequency of a token in the slot s of dj , while avgtfs is

the average frequency of a token in the slot s in the whole collection. This linearcombination smoothes the term frequency using the characteristics of the entiredocument collection.

For the training step we assume that each user provided ratings on items usinga discrete scale ranging from MIN (strongly dislikes) to MAX (strongly likes).Items whose ratings are greater than or equal to (MIN +MAX)/2 are supposedto be liked by the user and included in the positive training set, while items withlower ratings are included in the negative training set. The user profile is learnedfrom rated items by adopting the above described approach.

Therefore, given a new document dj , the recommendation step consists incomputing the a-posteriori classification scores P (c+|dj) and P (c−|dj) (Eq. 6)by using Poisson parameters for synsets estimated in the training step as inEq. (10). Classification scores for the class c+ are used to produce a ranked listof potentially interesting items, from which items to be recommended can beselected.

3.4 From ITR to FIRSt: Integrating Folksonomies into SemanticProfiles

In order to involve folksonomies in the processing performed by ITR, staticcontent describing the items is integrated with dynamic UGC (tags). Tags arecollected during the training step by letting users:

1. express their preferences for items through a numerical rating2. annotate rated items with free tags.

36 P. Lops et al.

Given an item I, the set of tags provided by all the users who rated I isdenoted as SocialTags(I), while the set of tags provided by a specific user U on Iis denoted by PersonalTags(U,I). In addition, PersonalTags(U) denotes the setof tags provided by U on all the items in the collection. Tags are stored in anadditional slot, different from those containing static content.

For example, in the context of cultural heritage personalization an artworkcan be generally represented by at least three slots, namely artist, title, anddescription. Provided that users have a digital support to annotate artifacts,tags can be easily stored in a fourth slot, say tags, which is not static as theother three slots because tags evolve over time.

The distinction between personal and social tags aims at evaluating whetherincluding either just personal tags or social tags in user profiles produces benefi-cial effects on the recommendations. The inclusion of social tags in the personalprofile of a user allows also to extend the pure content-based recommendationparadigm, previously adopted by ITR, toward a hybrid content-collaborativeparadigm [4].

The architecture described in Figure 1 has been modified in order to includetags in the recommendation process. The main adaptation was due to the need ofdefining an appropriate indexing strategy for the slot containing tags, in additionto that already defined for static slots (Figure 2).

Fig. 2. Architecture of FIRSt

Since tags are freely chosen by users and their actual meaning is usually notvery clear, the identification of user interests from tags is a challenging task.We face such a problem by applying WSD to tags as well. This process allowsus to enhance the document model from representing tags as mere keywords orstrings, to exploiting tags as pointers to WordNet synsets (semantic tags).


Semantic tags are obtained by disambiguating tags in a folksonomy, thus pro-ducing as a result a synset-based folksonomy. More specifically, we denote asSemanticSocialTags(I) the set of synsets obtained by disambiguating SocialT-ags(I). In fact, META applied to SocialTags(I) produces the synset-based folk-sonomy corresponding to SocialTags(I). SemanticPersonalTags(U,I) is the setof synsets obtained by disambiguating the tags given by U on I, thus it is theresult of invoking META on PersonalTags(U,I).

The algorithm used by META for tag disambiguation is JIGSAW, with adifferent setting for the context compared to that adopted for disambiguatingstatic content. Indeed, while for static content the context for the target wordis the text in the slot in which it occurs, this strategy is not suitable for tagssince the number of tags provided by users is generally low. This may result in apoor context and consequently in a high percentage of WSD errors on tags. Theintent is to exploit a more reliable context, when available. Therefore, whetherthe target tag occurs in one of the static slots, the text in that slot is used as acontext, otherwise we are forced to accept all the other tags as a context.

Semantic tags are exploited by the Profile Learner to include informationabout tags in the user profiles. The profile learning process for user U starts byselecting all items (disambiguated documents) and corresponding ratings pro-vided by U . Each item falls into either the positive or the negative training setdepending on the user rating, in the same way as described in Section 3.3. LetTR+ and TR− be the positive and negative training set respectively for userU . Several options for generating the user profile can be chosen at this point,depending on the type of content involved in the process.

If we would like to infer a user profile strictly related to personal prefer-ences (one-to-one user profile), all the semantic tags obtained from personaltags provided by U on all items she rated should be exploited in the learningstep. This means that, for each dj ∈ TR+ ∪ TR−, the additional slot for dj isSemanticPersonalTags(U,dj).

On the other hand, if we would like to build a content-collaborative profilefor U , semantic tags obtained from social tags provided by users on all itemsrated by U should be exploited in the learning step. This means that, for eachdj ∈ TR+ ∪ TR−, the additional slot for dj is SemanticSocialTags(dj).

The generation of the user profile is performed by the Profile Learner, whichinfers the profile as a binary text classifier as described in Section 3.3.

The profile contains the user identifier and the a-priori probabilities of likingor disliking an item. Moreover, the profile is structured in two main parts: pro-file like contains features describing the concepts able to deem items relevant,while features in profile dislike should help in filtering out not relevant items.Each part of the profile is structured in four slots, mirroring the representationadopted for items, which are artworks represented by title, artist, description andtags in this case. Each slot reports the features (WordNet identifiers) occur-ring in the training examples, whose frequencies are computed in the trainingstep. Frequencies are used by the Bayesian learning algorithm to induce the

38 P. Lops et al.

classification model (i.e. the user profile) exploited to suggest relevant items inthe recommendation phase.

4 Experimental Evaluation of FIRSt

The goal of the experimental evaluation was to measure the predictive accuracyof FIRSt when different types of content are used in the training step. Prelim-inary experiments have been presented in [8]. As a matter of fact, in order toproperly investigate the effects of including social tagging in the recommenda-tion process, a distinction has to be made between considering, for an artifactI rated as interesting by a user, either the whole folksonomy SocialTags(I), oronly the tags entered by that user for that artifact, i.e. PersonalTags(U,I).

Moreover, tags produced by expert users are distinguished from those of non-expert users, with the aim of investigating the impact of a more specific lexicon inproducing recommendations. In the context of cultural heritage domain, expertusers are supposed to have specific knowledge in the art domain, such as museumcurators, while non-expert users are supposed to be naıve museum visitors.

4.1 Users and Dataset

The dataset considered for the experiments consists of 45 paintings chosen fromthe collection of the Vatican picture-gallery. The dataset was collected usingscreenscraping bots, which captured the required information from the officialwebsite of the Vatican picture-gallery. In particular, for each element in thedataset an image of the artifact was collected, along with three textual proper-ties, namely its title, artist, and description.

Fig. 3. Collecting users’ ratings and tags


30 non-expert users and 10 expert users voluntarily took part in the exper-iments. Notice that users were selected according to the availability samplingstrategy. Even though random sampling is the best way of having a representa-tive sample, that strategy requires a great deal of time and money. Thereforemuch research in psychology is based on samples obtained through non-randomselection, such as the availability sampling, i.e. a sampling of convenience basedon users available to the researcher, often used when the population source isnot completely defined [26]. According to this strategy, non-expert users wereselected among young people having a master degree in Computer Science orHumanities, while expert users were selected among teachers in Arts and Hu-manities disciplines.

Users were requested to interact with a web application (Figure 3), in orderto express their preferences for all the 45 paintings in the collection. A prefer-ence was expressed as a numerical vote on a 5-point scale (1=strongly dislike,5=strongly like). Moreover, users were left free to annotate the paintings withas many tags as they wished.

For the overall 45 paintings in the dataset, 4300 tags were provided by non-expert users, while 1877 were provided by expert users. Some statistics abouttag distribution are reported in Table 1.

Table 1. Tag distribution in the dataset

Type of tags Avg. expert users Avg. non-expert users

PersonalTags(U,I) 4.17 3.18PersonalTags(U) 187.7 143.33SocialTags(I) 41.71 95.55

Each user provided about from 3 to 4 tags for each rated item, thus the addi-tional workload due to tagging activity is quite moderate. The average numberof tags associated with each painting is about 95 for non-expert users and 41 forexpert users, thus experiments relied on a sufficient number of user annotations.

4.2 Design of the Experiments and Evaluation Metrics

Since FIRSt is conceived as a text classifier, its effectiveness can be evaluatedby classification accuracy measures, namely Precision and Recall [28].

Precision (Pr) is defined as the number of relevant selected items divided bythe number of selected items. Recall (Re) is defined as the number of relevantselected items divided by the total number of relevant items. Fβ measure, acombination of precision and recall, is also used to have an overall measure ofpredictive accuracy (β sets the relative degree of importance attributed to Prand Re).

Fβ =(1 + β2) · Pr · Re

β2 · Pr + Re

40 P. Lops et al.

For the evaluation of recommender systems, these measures have been usedin [13]. Since users should trust the recommender, it is important to reduce falsepositives. It is also desirable to provide users with a short list of relevant items(even if not all the possible relevant items are suggested), rather than a longlist containing a greater number of relevant items mixed-up with not relevantones. Therefore, we set β = 0.5 for Fβ measure in order to give more weight toprecision.

These classification measures do not consider predictions and their deviationsfrom actual ratings, they rather compute the frequency with which a recom-mender system makes correct or incorrect decisions about whether a painting isadvisable for a user. These specific measures were adopted because we are inter-ested in measuring how relevant a set of recommendations is for a user. In theexperiment, a painting is considered relevant for a user if the rating is greaterthan or equal to 4, while FIRSt considers a painting relevant for a user if thea-posteriori probability of the class likes is greater than 0.5.

We organized three different experimental sessions, each one with the aim ofevaluating the accuracy of FIRSt for a specific community of users:

1. session#1: non-expert user community – All paintings are rated andtagged by 30 non-expert users, for whom recommendations are computed.

2. session#2: whole user community – All paintings are rated and taggedboth by expert and non-expert users. Recommendations are provided for thewhole set of 40 users.

3. session#3: non-expert user community supported by experts’

tags – In this session we evaluate whether tags provided by experts have pos-itive effects on recommendations generated for non-expert users. All paint-ings are rated solely by non-expert users, but tags used for generating non-expert user profiles are provided by expert users.

For SESSION#1 and SESSION#2, 5 different experiments were designed, de-pending on the type of content used for training the system:

• Exp#1: Static Content - only title, artist and description of the paintings,as collected from the official website of the Vatican picture-gallery

• Exp#2: SemanticPersonalTags(U,I)• Exp#3: SemanticSocialTags(I)• Exp#4: Static Content+SemanticPersonalTags(U,I)• Exp#5: Static Content+SemanticSocialTags(I)

For example, SemanticSocialTags(I) in SESSION#1 includes the set of synsetsobtained by disambiguating tags provided by all non-expert users who rated I,while in SESSION#2 it includes the set of synsets obtained by disambiguatingtags provided by both expert and non-expert users who rated I.

For SESSION#3, 2 different experiments were designed, depending on thetype of content used for training the system:

• Exp#1: SemanticSocialTags(I) – SemanticSocialTags(I) includes the set ofsynsets obtained by disambiguating tags provided by all experts on I. In this


way tags provided by experts contribute to the profiles of non-expert users.The aim of the experiment is to measure whether accuracy of recommenda-tions for non-expert users is improved by tags provided by expert users.

• Exp#2: Static Content+SemanticSocialTags(I) – SemanticSocialTags(I), asintended in Exp#1 in this session, are combined with static content.

All experiments were carried out using the same methodology, consisting inperforming one run for each user, scheduled as follows:

1. select the appropriate content depending on the experiment being executed;2. split the selected data into a training set Tr and a test set Ts ;3. use Tr for learning the corresponding user profile;4. evaluate the predictive accuracy of the induced profile on Ts.

The methodology adopted for obtaining Tr and Ts was the K-fold cross vali-dation [15], with K = 5. Given the size of the dataset (45), applying a 5-fold crossvalidation technique means that the dataset is divided into 5 disjoint partitions,each containing 9 paintings. The learning of profiles and the test of predictionswere performed in 5 steps. At each step, 4 (K-1) partitions were used as thetraining set Tr, whereas the remaining partition was used as the test set Ts. Thesteps were repeated until each of the 5 disjoint partitions was used as the Ts.Results were averaged over the 5 runs.

4.3 Results

Table 2 reports results for Exp#1-Exp#5 in SESSION #1. Table 3 reportsresults for Exp#1-Exp#5 in SESSION #2.

Table 2. Results of Exp#1-Exp#5 in SESSION #1

Exp. Type of Content Precision Recall Fβ=0.5

Exp#1 Static Content 77.01 93.54 79.83Exp#2 SemanticPersonalTags(U,I) 77.63 86.57 79.27Exp#3 SemanticSocialTags(I) 77.40 91.87 79.92Exp#4 Static Content+SemanticPersonalTags(U,I) 78.63 92.79 81.11Exp#5 Static Content+SemanticSocialTags(I) 77.78 93.35 80.46



Exp#1 Static Content 75.17 92.63 78.11Exp#2 SemanticPersonalTags(U,I) 76.60 89.86 78.93Exp#3 SemanticSocialTags(I) 74.91 89.93 77.50Exp#4 Static Content+SemanticPersonalTags(U,I) 77.31 90.61 79.65Exp#5 Static Content+SemanticSocialTags(I) 76.60 91.58 79.19

42 P. Lops et al.

4.4 Results

The first outcome of experiments in SESSION#1 is that the integration of socialor personal tags causes an increase of precision in the process of recommendingartifacts to users. More specifically, precision of profiles learned from both staticcontent and tags (hereafter, augmented profiles) outperformed the precision ofprofiles learned from either static content (hereafter, content-based profiles) orjust tags (hereafter, tag-based profiles). The improvement of augmented pro-files with personal tags (Exp#4) is 1.62 with respect to content-based profiles(Exp#1), while it is about 1 with respect to tag-based profiles (Exp#2 andExp#3). Lower improvements are observed by comparing results of Exp#5 withthose of Exp#2 and Exp#3.

The increase in precision of augmented profiles corresponds to a slight andphysiological loss of recall. Lowest recall has been observed for Exp#2. Thisresult is not surprising since personal tags summarize cultural interests andrepresent them in a deeper and “more precise” way compared to static content,which, on the other hand, allows covering a broader range of user preferences.

To sum up, by observing the Fβ figures, we can conclude that for non-expertusers, the highest accuracy is achieved by augmented profiles with personal tags.

Similar results are observed in SESSION#2, where the community also in-cludes expert users. It is interesting to compare results of Exp#1, Exp#2 andExp#4 in SESSION#1 with those of same experiments in SESSION#2, in or-der to evaluate the accuracy of recommendations provided by content-basedprofiles, tag-based profiles built using just personal tags, and augmented-profileswith personal tags in both communities. The values of Fβ in SESSION#2 arelower than those observed in SESSION#1, thus we can conclude that it is moredifficult to provide recommendations for expert users.

Another interesting finding regards profiles built by using social tags (Exp#3).A comparison between results obtained in SESSION#1 and SESSION#2 high-lights a significant loss both in precision and recall when expert users are includedin the community. Since social tags represent the lexicon of the community,this result might be interpreted as the fact that tagging with more specific andtechnical lexicon does not bring a significant improvement of system predictiveaccuracy.

SESSION#3 provides a more insight on the impact of the lexicon introducedby expert users on recommendation provided to non-expert users (Table 4).



Exp#1 SemanticSocialTags(I) 76.98 92.40 79.64Exp#2 Static Content+SemanticSocialTags(I) 77.47 93.51 80.22

By analyzing results of Exp#1, we observed that precision and recall of tag-based profiles do not outperform those obtained in Exp#3 in SESSION#1, thus


suggesting that the specific lexicon adopted by expert users does not positivelyaffect recommendations for non-expert users. Anyway, the slight improvementin recall (+0.53) suggests that the more technical tags adopted by experts mighthelp to select relevant items missed by profiles built with simple tags.

Even integrating social tags provided by experts with content does not im-prove accuracy of recommendations for non-expert users. Indeed, precision andrecall observed in Exp#2 do not significantly change compared to results ofExp#5 in SESSION#1.

The general conclusion is that the expertise of users contributing to the folk-sonomy does not actually affect the accuracy of recommendations.

5 Related Work

To the best of our knowledge, few studies investigated on how to exploit tagannotations in order to build user profiles.

In [9], the user profile is represented in the form of a tag vector, with eachelement indicating the number of times a tag has been assigned to a documentby that user. A more sophisticated approach is proposed in [22], which takesinto account tag co-occurrence. The matching of profiles to information sourcesis achieved by using simple string matching. As the authors themselves foresee,the matching could be enhanced by adopting WordNet, as in the semanticdocument indexing strategy proposed in this work.

In the work by Szomszor et al. [33], the authors describe a movie recommen-dation system built purely on the keywords assigned to movies via collaborativetagging. Recommendations for the active user are produced by algorithms basedon the similarity between the keywords of a movie and those of the tag-clouds ofmovies she rated. As the authors themselves state, their recommendation algo-rithms can be improved by combining tag-based profiling techniques with moretraditional content-based recommender strategies, as in the approach we haveproposed. In [11], different strategies are proposed to build tag-based user pro-files and to exploit them for producing music recommendations. Tag-based userprofiles are defined as collections of tags, which have been chosen by a user to an-notate tracks, together with corresponding scores representing the user interestin each of these tags, inferred from tag usage and frequencies of listened tracks.

While in the above described approaches only a single set of popular tagsrepresents user interests, in [36] it is observed that this may not be the mostsuitable representation of a user profile, since it is not able to reflect the multipleinterests of users. Therefore, the authors propose a network analysis technique(based on clustering), performed on the personal tags of a user to identify herdifferent interests.

About tag interpretation, Cantador et al. [5] proposed a methodology to se-lect “meaningful” tags from an initial set of raw tags by exploiting WordNet,Wikipedia and Google. If a tag has an exact match in WordNet, it is ac-cepted, otherwise possible misspellings and compound nouns are discovered byusing the Google “did you mean” mechanism (for example the tag sanfrancisco

44 P. Lops et al.

or san farncisco is corrected to san francisco). Finally, tags are correlated totheir appropriate Wikipedia entries.

The main differences between the tag-based profiling process we proposed inthis chapter and the previously discussed ones are:

1. we propose a hybrid strategy that learns the profile of the user U from bothstatic content and tags associated with items rated by U , instead on relyingon tags only;

2. we elaborate on including in the profile of user U not only her personaltags, but also the tags adopted by other users who rated the same items asU . This aspect is particularly important when users who contribute to thefolksonomy have different expertise in the domain;

3. we propose a solution to the challenging task of identifying user interestsfrom tags. Since the main problem lies in the fact that tags are freely chosenby users and their actual meaning is usually not very clear, we have suggestedto semantically interpret tags by means of WordNet. Indeed, some ideason how to analyze tags by means of WordNet in order to capture theirintended meanings are reported in [6], but suggested ideas are not supportedby empirical evaluations. Another approach in which tags are semanticallyinterpreted by means of WordNet is the one proposed in [37]. The authorsdemonstrated the usefulness of tags in collaborative filtering, by designing analgorithm for neighbor selection that exploits a WordNet-based semanticdistance between tags assigned by different users.

When focusing on the application of personalization techniques in the con-text of cultural heritage, it is worth to notice that museums have recognizedthe importance of providing visitors with personalized access to artifacts.Theprojects PEACH (Personal Experience with Active Cultural Heritage) [32] andCHIP (Cultural Heritage Information Personalization) [35] are only two exam-ples of the research effort devoted to support visitors in fulfilling a personalizedexperience and tour when visiting artworks collections. In particular, the recom-mender system developed within CHIP aims at providing personalized access tothe collections of the Rijksmuseum in Amsterdam. It combines Semantic Webtechnologies and content-based algorithms for inferring visitors’ preference froma set of scored artifacts and then, recommending other artworks and relatedcontent topics.

The Steve.museum consortium [34] has begun to explore the use of social tag-ging and folksonomy in cultural heritage personalization scenarios, to increaseaudience engagement with museums’ collections. Supporting social tagging ofartifacts and providing access based on the resulting folksonomy, open museumcollections to new interpretations, which reflect visitors’ perspectives rather thancurators’ ones, and helps to bridge the gap between the professional languageof the curator and the popular language of the museum visitor. Preliminaryexplorations conducted at the Metropolitan Museum of Art of New York haveshown that professional perspectives differ significantly from those of naıve vis-itors. Hence, if tags are associated to artworks, the resulting folksonomy can be


used as a different and valuable source of information to be carefully taken intoaccount when providing recommendations to museum visitors.

6 Conclusions and Future Work

The research question we have tried to answer in this chapter was: Does theintegration of tags cause an increase of the prediction accuracy in the processof filtering relevant items for users? The main contribution of the chapter is atechnique to infer user profiles from both static content, as in classical content-based recommender systems, and tags provided by users to freely annotate items.Being free annotations, tags also tend to suffer from syntactic problems, likepolysemy and synonymy. We faced such a problem by applying WSD to contentas well as tags. Static content and tags, semantically indexed using a WordNet-based WSD procedure, are exploited by a naıve Bayes learning algorithm ableto infer user profiles in the form of binary text classifiers. As a proof of concepts,we developed the FIRSt recommender system, whose recommendations wereevaluated in a cultural heritage scenario.

Experiments aimed at evaluating the predictive accuracy of FIRSt when dif-ferent types of content were used in the training step (pure content, personal tags,social tags, content combined with tags). We also distinguished tags providedby non-expert users from those provided by expert ones. The main outcomes ofexperiments are:

• the highest overall accuracy is reached when profiles learned from both con-tent and personal tags are exploited in the recommendation process

• the expertise of users contributing to the folksonomy does not actually affectthe accuracy of recommendations.

We are currently working on the integration of FIRSt in an adaptive platformfor multimodal and personalized access to museum collections. In this context,specific recommendation services, based upon augmented profiles, are being de-veloped. Each visitor is supposed to be equipped with a mobile terminal sup-porting her during the visit to the museum. For example, the intelligent guideprovided by the terminal might help the visitor to find the most interesting art-works according to her profile and contextual information, such as her currentlocation in the museum.

References

1. Balabanovic, M., Shoham, Y.: Fab: content-based, collaborative recommendation.Commun. ACM 40(3), 66–72 (1997)

2. Basile, P., de Gemmis, M., Gentile, A., Iaquinta, L., Lops, P., Semeraro, G.: META- MultilanguagE Text Analyzer. In: Proc. of the Language and Speech TechnnologyConference, pp. 137–140 (2008)

46 P. Lops et al.

3. Basile, P., Degemmis, M., Gentile, A., Lops, P., Semeraro, G.: UNIBA: JIGSAWalgorithm for Word Sense Disambiguation. In: Proceedings of the 4th ACL 2007 In-ternational Workshop on Semantic Evaluations (SemEval-2007), Prague, Czech Re-public, June 23-24. Association for Computational Linguistics, pp. 398–401 (2007)

4. Burke, R.: Hybrid recommender systems: survey and experiments. User Model.User-Adapt. Interact. 12(4), 331–370 (2002)

5. Cantador, I., Szomszor, M., Alani, H., Fernandez, M., Castells, P.: Enriching Onto-logical User Profiles with Tagging History for Multi-Domain Recommendations. In:Proc. of the Collective Semantics: Collective Intelligence and the Semantic Web,CISWeb2008, Tenerife, Spain (2008)

6. Carmagnola, F., Cena, F., Cortassa, O., Gena, C., Torre, I.: Towards a tag-baseduser model: How can user model benefit from tags? In: Conati, C., McCoy, K.,Paliouras, G. (eds.) UM 2007. LNCS, vol. 4511, pp. 445–449. Springer, Heidelberg(2007)

7. Degemmis, M., Lops, P., Semeraro, G.: A content-collaborative recommender thatexploits WordNet-based user profiles for neighborhood formation. User Model.User-Adapt. Interact. 17(3), 217–255 (2007)

8. Degemmis, M., Lops, P., Semeraro, G., Basile, P.: Integrating tags in a semanticcontent-based recommender. In: Pu, P., Bridge, D.G., Mobasher, B., Ricci, F. (eds.)Proceedings of the 2008 ACM Conference on Recommender Systems, RecSys 2008,Lausanne, Switzerland, October 23-25, 2008, pp. 163–170. ACM, New York (2008)

9. Diederich, J., Iofciu, T.: Finding communities of practice from user profiles basedon folksonomies. In: Innovative Approaches for Learning and Knowledge Sharing,EC-TEL Workshop Proc., pp. 288–297 (2006)

10. Domingos, P., Pazzani, M.J.: On the optimality of the simple bayesian classifierunder zero-one loss. Machine Learning 29(2-3), 103–130 (1997)

11. Firan, C.S., Nejdl, W., Paiu, R.: The benefit of using tag-based profiles. In: Proc.of the Latin American Web Conference, Washington, DC, USA, pp. 32–41. IEEEComputer Society, Los Alamitos (2007)

12. Hanani, U., Shapira, B., Shoval, P.: Information Filtering: Overview of Issues,Research and Systems. User Model. User-Adapt. Interact. 11(3), 203–259 (2001)

13. Herlocker, J.L., Konstan, J.A., Terveen, L.G., Riedl, J.T.: Evaluating collaborativefiltering recommender systems. ACM Trans. Inf. Syst. 22(1), 5–53 (2004)

14. Kim, S.-B., Han, K.-S., Rim, H.-C., Myaeng, S.-H.: Some effective techniques fornaive bayes text classification. IEEE Trans. Knowl. Data Eng. 18(11), 1457–1466(2006)

15. Kohavi, R.: A study of cross-validation and bootstrap for accuracy estimation andmodel selection. In: Proc. of IJCAI-1995, pp. 1137–1145 (1995)

16. Leacock, C., Chodorow, M., Miller, G.: Using corpus statistics and wordnet rela-tions for sense identification. Computational Linguistics 24(1), 147–165 (1998)

17. Lee, W.S.: Collaborative learning for recommender systems. In: Proceedings of theEighteenth International Conference on Machine Learning, pp. 314–321. MorganKaufmann, San Francisco (2001)

18. Linden, G., Smith, B., York, J.: Amazon.com recommendations: item-to-item col-laborative filtering. IEEE Internet Comp. 7(1), 76–80 (2003)

19. Lops, P., Degemmis, M., Semeraro, G.: Improving Social Filtering TechniquesThrough WordNet-Based User Profiles. In: Conati, C., McCoy, K., Paliouras, G.(eds.) UM 2007. LNCS, vol. 4511, pp. 268–277. Springer, Heidelberg (2007)

20. Manning, C., Schutze, H.: Foundations of Statistical Natural Language Processing,ch. 7: Word Sense Disambiguation, pp. 229–264. The MIT Press, Cambridge (1999)


21. McCallum, A., Nigam, K.: A comparison of event models for naive bayes textclassification. In: AAAI 1998 Workshop on Learning for Text Categorization, pp.41–48 (1998)

22. Michlmayr, E., Cayzer, S.: Learning User Profiles from Tagging Data and Leverag-ing them for Personal(ized) Information Access. In: Proc. of the Workshop on Tag-ging and Metadata for Social Information Organization, Int. WWW Conf. (2007)

23. Miller, G.: Wordnet: An on-line lexical database. International Journal of Lexicog-raphy 3(4) (Special Issue) (1990)

24. Mladenic, D.: Text-learning and related intelligent agents: a survey. IEEE Intelli-gent Systems 14(4), 44–54 (1999)

25. Resnick, P., Varian, H.: Recommender systems. Communications of theACM 40(3), 56–58 (1997)

26. Royce, S.A., Straits, B.C.: Approaches to Social Research, 3rd edn. Oxford Uni-versity Press, New York (1999)

27. Schwab, I., Kobsa, A., Koychev, I.: Learning user interests through positive exam-ples using content analysis and collaborative filtering (2001)

28. Sebastiani, F.: Machine learning in automated text categorization. ACM Comp.Surveys 34(1), 1–47 (2002)

29. Semeraro, G., Basile, P., de Gemmis, M., Lops, P.: User Profiles for Personal-izing Digital Libraries. In: Theng, Y.-L., Foo, S., Lian, D.G.H., Na, J.-C. (eds.)Handbook of Research on Digital Libraries: Design, Development and Impact, pp.149–158. IGI Global (2009) ISBN 978-159904879-6

30. Semeraro, G., Degemmis, M., Lops, P., Basile, P.: Combining learning and wordsense disambiguation for intelligent user profiling. In: Proc. of IJCAI 2007, pp.2856–2861. M. Kaufmann, California (2007)

31. Shardanand, U., Maes, P.: Social information filtering: algorithms for automat-ing/word of mouth. In: Proceedings of ACM CHI 1995 Conference on Human Fac-tors in Computing Systems, Denver, Colorado, United States, vol. 1, pp. 210–217(1995)

32. Stock, O., Zancanaro, M., Busetta, P., Callaway, C., Kruger, A., Kruppa, M.,Kuflik, T., Not, E., Rocchi, C.: Adaptive, intelligent presentation of information forthe museum visitor in PEACH. User Modeling and User-Adapted Interaction 17(3),257–304 (2007)

33. Szomszor, M., Cattuto, C., Alani, H., O’Hara, K., Baldassarri, A., Loreto, V.,Servedio, V.D.P.: Folksonomies, the semantic web, and movie recommendation.In: Proc. of the Workshop on Bridging the Gap between Semantic Web and Web2.0 at the 4th ESWC (2007)

34. Trant, J., Wyman, B.: Investigating social tagging and folksonomy in art museumswith steve. museum. In: Collaborative Web Tagging Workshop at WWW 2006,Edinburgh, Scotland (May 2006)

35. Wang, Y., Aroyo, L., Stash, N., Rutledge, L.: Interactive user modeling for person-alized access to museum collections: The Rijksmuseum case study. In: Conati, C.,McCoy, K., Paliouras, G. (eds.) UM 2007. LNCS, vol. 4511, pp. 385–389. Springer,Heidelberg (2007)

36. Yeung, C.M.A., Gibbins, N., Shadbolt, N.: A study of user profile generation fromfolksonomies. In: Proc. of the Workshop on Social Web and Knowledge Manage-ment, WWW Conf. (2008)

37. Zhao, S., Du, N., Nauerz, A., Zhang, X., Yuan, Q., Fu, R.: Improved recommenda-tion based on collaborative tagging behaviors. In: Proc. of Int. Conf. on IntelligentUser Interfaces. ACM Press, New York (2008)

3

Exploiting Ontologies for Web SearchPersonalization

John Garofalakis1,2 and Theodoula Giannakoudi2

1 RA Computer Technology Institute Telematics Center Department N. Kazantzakistr. 26500, Greece

2 University of Patras Computer Engineering and Informatics Dept 26500 Patras,[email protected], [email protected]

Summary. In this work, we present an approach for web search personalization byexploiting the ontologies. Our approach aims to provide personalization in web searchengines by coupling data mining techniques with the underlying semantics of the webcontent. To this purpose, we exploit reference ontologies that emerge from web catalogs(such as ODP - Open Directory Project), which can scale to the growth of the web.Our methodology uses ontologies to provide the semantic profiling of users’ interests,based on the implicit logging of their behavior and the on-the-fly semantic analysisand annotation of the web results summaries. Experimental evaluation of our approachshows that the objectives expected from semantic users’ clustering in search enginesare achievable.

Keywords: Web Usage Mining, Semantic Annotation, Clustering, Ontology, UserProfiles, Web Search, Personalization.

1 Introduction

While Web is constantly growing, web search has to deal with a lot of challenges.The collection of the web documents expands rapidly and the users demandto find the desired information directly. The vital question is which the rightinformation for a specific user is and how this information could be efficientlydelivered, saving the web user from consecutive submitted queries and time-consuming navigation through numerous web results.

Most existing Web search engines return a list of results based on the querywithout paying any attention to the underlying user’s interests or even to thesearching behaviors of other users with common interests. There is no predic-tion of the user’s information needs and problems of polysemy and synonymyoften arise. Thus, when a user submits searching keywords with multiple mean-ing (polysemy) or several words having the same meaning with the submittedkeyword (synonymy), he will probably get a large number of web results andmost of them will not meet his need. For, example, a user submitting the term“opera” may be interested in arts or computers but the results will be the sameregardless of what he looks for.


50 J. Garofalakis and T. Giannakoudi

Fig. 1. The overall personalization methodology

Some current search engines such as Google or Yahoo! have hierarchies of cat-egories to provide users with the opportunity to explicitly specify their interests.However, these hierarchies are usually very large; therefore, they discourage theuser from browsing them in order to define the interested paths. To overcomethese overloads in the users searching tasks, the user interests may be implicitlydetected by tracking his search history and personalizing the web results.

In this work, we propose a personalization method (Figure 1) which couplesdata mining techniques with the underlying semantics of the web content in orderto build semantically enhanced clusters of user profiles. In our methodology,apart from exploiting a specific user search history, we further exploit the searchhistory of other users with similar interests. The user is assigned to relevantconceptual classes of common interest, so as to predict the relevance score ofthe results with the user goal and finally re-rank them. To this purpose, weexploit reference ontologies that emerge from web catalogs (such as ODP-OpenDirectory Project1), which can scale to the growth of the web. Ontologies providefor the semantic profiling of users’ interests based on the implicit logging of their1 The Open Directory Project: http://www.dmoz.org/

Exploiting Ontologies for Web Search Personalization 51

behavior and the on-the-fly semantic analysis and annotation of summaries ofthe web results.

Regarding the semantic clusters, they actually comprise taxonomical subsetsof a general category hierarchy, such as ODP, representing the categories ofinterest for groups of web users with similar search tasks.

Specifically, our methodology consists of five tasks: (1) gathers user’s searchhistory, (2) processes the user activity, taking into consideration other users’activities and constructing clusters of commonly preferred concepts, (3) definesontology-based profiles for the active user based on the detected interests from hiscurrent activity and the interests depicted from the semantic cluster in which hehas been assigned from previous searching sessions, (4) re-ranks the web resultscombining the above information with the semantics of the delivered results and(5) constantly re-organizes the conceptual clusters in order to be up-to-date withthe users’ interests.

Our approach has been experimentally evaluated by utilizing the Google WebService and delivering a transparent Google search web site and the results showthat semantically clustering users in terms of detecting commonly interestingODP categories in search engines is effective.

The remainder of the paper is structured as follows: Section 2 discusses relatedwork. In Section 3, we describe the reference ontology that our approach usesbased on the ODP categorization. Using this ontology, we outline the semanticannotation of web results to the ontology classes. Moreover, we present how theuser profiles are defined over the reference ontology referred earlier as task (2)and how the semantic user clusters are formed, referred as task (3). In Section 4,we discuss what sort of ontology we can discover from a set of positive documents.We also present an ontology mining algorithm. In Section 4, we propose a noveltechnique for web search personalization combining profiles of semantic clusterswith the emerging profile of the active user referred as tasks (4) and (5). InSection 5, we exhibit and discuss our experiments to show the performanceof the proposed approach for the Web search personalization. In this section wedescribe the task (1) and the experimental results of the implementation. Section7 presents the conclusions and gives an outlook on further work.

2 Related Work

In this section, we present work that has been conducted in similar contexts, suchas personalized web searching, usage-based personalization and semantic-awarepersonalization.

Several ontology-based approaches have been proposed for users profiling tak-ing advantage of the knowledge contained in ontologies ([6], [13]) in personal-ization systems. In [5], an aggregation scheme towards more general concepts ispresented. Clustering of the user sessions is provided to identify related conceptsat different levels of abstraction in a recommender system.

Significant studies have been conducted for personalization based on usersearch history. A general framework for personalization based on aggregate usage


profiles is presented in [15]. This work distinguishes between the offline tasks ofdata preparation and usage mining and the online personalization components.[17] suggests learning a user’s preferences automatically based on their past clickhistory and shows how to use this learning for result personalization.

Many researchers have proposed several ways to personalize web searchthrough biasing ranking algorithms towards possible interested pages for theuser. For example, [18] extends the HITS algorithm to promotes pages marked“relevant” by the user in previous searches. A great step towards biased rankingis performed in [9], where a topic-oriented PageRank is built, considering thefirst-level topics listed in the Open Directory. The authors show this algorithmoverperforms the standard PageRank if the search engine can effectively estimatethe query topic.

Specifically, regarding the exploitation of large-scale taxonomies in personal-ized search, a number of interesting works has been presented. In [4], several waysare explored of extending ODP metadata to personalized search. In [12] , users’browsing history is exploited to construct a much smaller subset of user-specificcategories than the entire ODP and a novel ranking logic is implemented amongcategories. In [9], sets of known user interests are automatically mapped onto agroup of categories in ODP and manually edited data of ODP are used for train-ing text classifiers to perform search results categorization and personalization.

Our work differs from previous works in several tasks. We exploit large-scaletaxonomies, such as ODP, to construct combinative semantic user profiles. In ouremerging profiles, both user browsing history and automatically created clustersof user categories are incorporated in personalizing web results. In this way,we re-rank search results taking under consideration apart from the active usertasks, the subsets of “interesting” taxonomy categories that co-occur in otherusers searches, in the case that these users exhibit similar behavior with theactive one.

3 Ontology-Based User Clusters

The general aim of this work is to introduce a method for personalizing the resultsof web searching. For this reason we focused on constructing user profiles implic-itly and automatically, according to their interests and their previous behavior onsearching. At this direction we were based on the work described in [3].

3.1 Reference Ontology

Our first goal was to create a reference ontology upon which we will base the userprofiles. The profile of each user will be represented by a weighted ontology, de-picting the users’ interest for every class of the reference ontology. Rather thancreating a new ontology from scratch, we decided to base our reference ontology onalready existing subject hierarchies. Some of them are Yahoo.com 2, About.com 3,2 Yahoo! Search Engine. http://www.yahoo.com3 About. http://www.about.com


Fig. 2. A depiction of the ODP

Lycos 4 and the Open Directory Project that provide manually-created online sub-ject hierarchies. Our implementation of the reference ontology was finally basedon the Open Directory Project. In Figure 2 there is a depiction of some of theconcepts of the first three levels of the ODP taxonomy.

The choice of the Open Directory Project instead of the other directories forthe construction of the reference ontology made no difference because there is acorrespondence among them.

The ontology created is actually a directed acyclic graph (DAG). Since we wishto create a relatively concise user profile that identifies the general areas of auser’s interests we created our reference ontology by using concepts from only thefirst three levels of the Open Directory Project [19], which are the directories usedby Google search Engine. In addition, since we want concepts that are relatedby a generalization-specialization relationship, we remove subjects that werelinked based on other criteria, e.g. alphabetic or geographic associations. Theontology was created by the Protege 5, the free, open source ontology editor andknowledge-base framework and the language used for development was OWL.

3.2 Semantic Annotation

The construction of the profile, i.e. the weighted ontology, for every user in-cludes the semantic annotation of the user’s previous choices. The semanticcharacterization of the user choices is based on the methodology proposed in[7]. Therefore, the user’s previous choices are analyzed into keywords extracted4 Lycos Search Engine. http://www.lycos.com/5 The Protege Ontology Editor and Knowledge Acquisition System:

http://protege.stanford.edu


from the visited web pages and the keywords are semantically characterized. Thecalculation of the semantic similarity between each keyword and each term ofthe ontology was computed by using semantic similarity measures with WordNet[13]. In Wordnet[14], English nouns, verbs, adjectives and adverbs are organizedinto synonym sets, each representing one underlying lexical concept. The mea-sure that was applied in our methodology is the Wu and Palmer one [19]. Thismeasure calculates the relatedness by considering the depths of the two synstes(on or more sets of synonyms) in the Wordnet taxonomies, along with the depthof the LCS (Lexical Conceptual Structure).

Score =2 × depth(lcs)

depth(k) + depth(c)(1)

wherek = user keyword,c = ontology class,lcs = nearest common ancestor

This means: Score ∈ (0, 1]The score can never be zero because the depth of the LCS is never zero (the

depth of the root of a taxonomy is one). The score is one if the two input synsetsare the exactly same.

The assignment process is time-expensive, therefore we have implemented acaching policy to improve system response. The assignments of instances wordsare kept in cache, to minimize response time in case these words are met again.Every time that this process is executed the amount of previous choices thatare semantically annotated are the users’ choices that have not been annotatedat the last execution of this step of the methodology. This saves time from theexecution, since semantic annotation is a quite time consuming step of the overallmethod applied.

As a result, the keywords and consequently, the users’ choices are assignedto relevant classes of the ontology, when score is over a threshold (e.g. 0.7),after the completion of the ontology assignment step in the proposed method.Experimentation and fine tuning using different threshold values resulted in thechoice of 0.7 as a concept similarity threshold.

3.3 Definition of User Profiles

In this step, our methodology uses the semantic annotations of the users’ choicesso as to construct the profile for every user.

After the semantic characterization of the user’s choices to the ontology con-cepts our methodology moves on the profile creation. From the web access logskept in the web server our method extracts the user’s previous choices, whichhave already been semantically annotated. Therefore, for every user we extractthe concepts and the frequency of appearance from the previous choices thatthe specific user has made. In the end of the execution of this step, there is anaccumulation of the preferences for every user and of the frequency for everyconcept, which is the weight, for every class (preference) in the ontology.


In this step of the methodology proposed, apart from the accumulation of theconcepts for which the user has shown interest, we construct the vector thatrepresents each user’s profile. The vector’s size is the number of concepts thatthe ontology consists of. The value of each element of the vector corresponds tothe weight of the user interest for this concept.

So we propose that, the weight for a concept i for the user u, is calculated as:

wiu =cfiu

sum(cfu)(2)

where cfiu= the number of times that the concept i has been assigned to theuser u.

sum(cfu) = the sum of the times that all the concepts of the ontology hasbeen assigned to the user u.

For the concepts that the user has not selected any previous choice assignedto this concept the value is set to zero.

So for a user u the profile is represented as follows:

pu =< w1u, w2u, ..., wnu > (3)

Wheren is the number of concepts in the ontology and

wiu ={

weight(concepti, u), if concepti > 00, otherwise

Therefore, it is obvious that the weight of each concept is the relative fre-quency of the concept among all concepts of the ontology. The sum of all weightis equal to one, representing the percentage of the user’s interest for every con-cept. Moreover, for each user we create a file that has the profile vector.

3.4 Semantic Clustering of User Profiles

After creating each user profile, the suggested methodology moves on profileclustering. From the profile creation step, a profile for every user is stored in thedatabase and a file with the user’s vector weighted ontology is created. At thisstep of the methodology, the profiles of all the users that reacted with the searchengine are accumulated and are clustered into clusters with similar interests.This procedure is done for the users that have already reacted with the searchengine and their previous reaction has been stored in the web access logs.

The clustering algorithm that has been applied in the methodology proposedin the profile clustering step is the K-Means algorithm [10]. K-Means is oneof the most common clustering algorithms that groups data into clusters withsimilar characteristics or features together. The data in a cluster will have similarfeatures or characteristics which will be dissimilar from the data in other clusters.

The K-Means algorithm accepts the number of clusters to group data into andthe dataset to cluster as input values. It then creates the first K initial clusters(K= number of clusters needed) from the dataset by choosing K rows of datarandomly from the dataset. It calculates the Arithmetic Mean of each cluster


formed in the dataset. The Arithmetic Mean of a cluster is the mean of all theindividual records in the cluster.

Next, K-Means assigns each record in the dataset to only one of the initialclusters. Each record is assigned to the nearest cluster (the cluster which it ismost similar to) using a measure of distance or similarity like the Euclidean Dis-tance Measure, which was used in this module. K-Means re-assigns each recordin the dataset to the most similar cluster and re-calculates the arithmetic meanof all the clusters in the dataset. The preceding steps are repeated until stableclusters are formed and the K-Means clustering procedure is completed. Stableclusters are formed when new iterations or repetitions of the K-Means clusteringalgorithm does not create new clusters as the cluster centre or Arithmetic Meanof each cluster formed is the same as the old cluster centre.

In the end of the execution of this step the users are grouped into clusterswith similar interests and the clusters are stored to the database. Thus, a clusterprofile is built, utilizing the sum of preferences of all cluster members:

pc =< w1, w2, ..., wn > (4)

We should note that every time this step is executed, the clusters are con-structed from the beginning and the users are grouped again. Thus, the clusteringprocedure is not based on the previous constructed clusters. This has been chosenas a way of developing the methodology, considering that the user’s choices willalter periodically and he may not have similar interest with the users in the clusterhe was clustered in a previous execution of the clustering procedure.

The construction of the semantic users’ profiles clusters is presented inFigure 3.

Fig. 3. Creation of the clusters with the semantic users’ profiles


4 Personalization Algorithm

The preprocessed user’s choices, their semantic characterization and the users’clusters are used for processing and personalizing the results from a search en-gine. At this point every user that has reacted previously with the online searchengine has been put in one cluster.

This cluster consists of users with similar interests and can be depicted as aweighted ontology such as the profiles. This weighted ontology will be presentedas a vector, too. The elements of the vector, representing the weighted ontology,would be the sum of interests for a concept of all the users belonging to thecluster divided by the sum of interests of all the users of the cluster for all theconcepts of the ontology. The formulation is the same that was followed in theusers’ profiles described in paragraph 3.3.

The personalized search includes the calculation of the similarity of each re-sult returned by the search engine with the cluster’s interests. This calculationrequires the execution of all the steps of the ontology-based user clusters for eachresult returned by the search engine.

Therefore, for every query that is set to the search engine the proposedmethodology follows the following steps:

1. Extracts the keywords from the users’ previous choices, i.e the users previousvisited rsults pages

2. Applies the semantic annotation step with the difference that at this assign-ment the ontology is not the reference ontology but a part of the ontologywhich consists of the concepts of the ontology for which the cluster that theuser belongs has a non-zero weight. The output of this step is a vector con-taining the similarity values of keywords with the concepts of the ontologyand is depicted as:

result simjc =< simj1, sim

j2, ..., sim

jm > (5)

Where:j is the jth result of the search engine andm is the number of the concepts in the cluster

3. Since we have calculated the similarity of each result to the cluster we cal-culate the value score value for each result. This score is calculated as theinternal product of the cluster vector represented in relation (4) and thesimilarity vector represented in relation (5).So the score will be:

Score = pc × result simjc (6)

The above three steps are executed for every result and the score value is keptin cache. Afterwards, the results of the search engine are organized for presenta-tion to the user according to the score that has been calculated, beginning withthe one with the highest score (Figure 4).

During the reaction of the user with the search engine the choices of the userare stored in the database so as to be processed in the next run of the method.


Fig. 4. The Personalization algorithm

5 Testing and Evaluation

5.1 Experimental Implementation of the Methodology

We developed a WWW search engine utilizing the Google Search API 6 so as totest our methodology. The Search API returns the URL, the title and a shortsummary for each one of the first ten results of the Google search engine. Atfirst we run this limited search engine without personalizing the results but ac-cumulating the users’ choices. At next, we applied the method proposed andcompared the results of the personalized representation with the non personal-ized representation.

Logging Search History

The Google search API, used for the experimental implementation, returns theURL, the title and a small summary for every result, just like the results of theGoogle search engine. For our experimental implementation, we use a databaseused for storing the users’ choices for every query applied in the limited searchengine used for testing.

Through the website of this limited search engine we store the IP address, thedomain name and the user agent for the identification of each user. Every timethat a user enters the search engine there is an identification of the IP address,the agent and the domain name keeping off the multiple storage of a user in thedatabase Moreover, the search engine stores in the database the query and thechoices of the user in every query. So, for every result that is clicked by the userthe search engine stores the title, the URL and the short summary returned inthe database. This database consists of the history of the requests and thereforeis used as the web access logs in this methodology.

At next we apply the steps of the methodology proposed earlier in the webaccess logs for the creation of the semantic users’ profiles clusters. In the web6 Google Web Apis Home Page. http://code.google.com/apis


access logs, i.e. in the database, there are the choices of all the users. For everychoice that has been selected we extract the keywords. For the experimentalimplementation the methodology for the keyword extraction is similar to theone proposed in [3] for the keywords of the pages that have a link for a specificpage. The keywords that are extracted for every URL are accumulated from thetitle of the URL and the short summary returned by the Google search API.The title and the summary are parsed and are cleaned by the HTML tags andthe stop words (very common words, numbers, symbols, articles) are removed,since they are considered not to contribute to the semantic denotation of theweb page’s content. The words that remain are considered the keywords forevery URL since their number is small and no frequency is being taken intoconsideration. After the running of this step the keywords for every URL arestored to the database. At next the keywords are semantically characterizedaccording to the way described in paragraph 3.2. Afterwards, the profiles of theusers are created as analyzed in paragraph 3.3 and finally the users are groupedinto clusters as referred in 3.4 according to the methodology proposed.

5.2 Experimental Results and Evaluation

In order to evaluate the proposed method and prove the efficient behavior of ourpersonalization method, we performed some queries with polysemy expectingthe personalized results to be personalized according to the profile of the clusterthat a user is set and to verify that our method can improve the results’ rankingquality as desired. We applied the queries in the experimental implementationthat returns the first ten results from the Google search engine through theGoogle search API. In one case we applied our personalization methodologywhereas in the other case we extracted the results as they were returned by thesearch API.

We evaluated the use of our automatically created user profiles for personal-ized search using the approach of ranking. A function is applied to the document-query match values and the rank orders returned by the search engine. Therelevant documents are moved higher in the results set and demote non-relevantdocuments.

Our experimental implementation was online for 1 month and twenty usershave reacted with it. The choices that they have made for every query werestored in the database. The choices were processed and the user profiles werecreated. Next, we clustered the users in three clusters. The user that made thequeries has already been put in a cluster and the reference ontology of the clus-ter upon which the score of the results will be based has been created. Weshould note that the cluster has users that are interested in Acting, Advertis-ing, American, Animation, Apple, Appliances, Artists, Audio, Ballet, Ballroom,Biography, Bonsai, Buses, Cables, Choices, Companies, Darwin, DEC, Exploits,Flowers, Fraud, Games, Journals, Licenses, Mach, Mainframe, Morris, Mosaics,Music, Oceania, Opera, Painters, People, Pick, Programs, Quotations, Reference,Representatives, Roleplaying, Security, Series, Soaps, Sports, Sun, Supplies, Syl-lable, Telephony, Test Equipment, Youth, Assemblage, Characters, Christian,


Computer, Cracking, Creativity, Creators, Drawing, Editorial, Home, Instru-ments, Internet, Organizations, Radio, Searching, Unix with various weights foreach concept of the reference ontology.

Methodology Performance under Polysemy Queries

An example query that was applied in the search engine was “opera”. The word“opera” has a twofold meaning. Opera is a form of musical and dramatic workand also it is a very common used web browser. Thus, it is a query that theresults of the search engines will refer both to music and computers. The userthat is giving the query to the search engine asks for information about opera asa kind of music and expects results related to music. In the following table wecan see the results of the search engine. The first column represents the orderof the results of the search API without the application of the personalizationmethodology while in the second column we can see the order of the personalizedresults of the experimental application, according to the score of each result. InTable 1 we can see the titles of results for the query “Opera”.

Table 1. Personalized and non-personalized results for query “Opera” for a user thatis searching for opera related with music and the cluster he belongs has interest in Artsbut in Computes as well

Non-personalized results Personalized results

Download Opera Web Browser (comput-ers)

Opera Software-Company (computers)

Opera Software-Company (computers) Welcome to LA Opera — LA Opera(music)

Opera - Wikipedia the free encyclo-pedia (music)

Opera - Wikipedia the free encyclo-pedia (music)

Opera (Internet suite) - Wikipedia, the freeencyclopedia (computers)

Opera Community (computers)

Opera Mini - Free mobile Web browser foryour phone (computers)

Opera (Internet suite) - Wikipedia, the freeencyclopedia (computers)

Welcome to LA Opera - LA Opera(music)

Opera in to the Ozarks (music)

OperaGlass (computers) Opera Mini - Free mobile Web browser foryour phone (computers)

The Metropolitan Opera (music) The Metropolitan Opera (music)

Opera in to the Ozarks (music) OperaGlass (computers)

Opera Community (computers) Download Opera Web Browser (comput-ers)

Next to each title we give in parenthesis the general concept of the result,which we have concluded after reading the summary. The user searches for resultsrelated to music. The first column represents the results that are returned fromthe search API without personalization. In this column the results that the user


searches are in places 3, 6, 8, 9. On the other hand,b the second column hasthe personalized results and the results related with music are in places 2, 3,6, 8. It is obvious that after the application of the personalization methodologythat is proposed the results related with music are pushed to places closer to thetop. The cluster into which the user belongs, as we have mentioned, has manyinterests that include music and this has been taken into consideration whilecalculating the score of each result pushing the results related with music in ahigher place in the list of the results. Also, because of the fact that the resultsreturned have high similarity with the concepts of the cluster reference ontologythe music related results are pushed closer to the top.

Apart from this query, we have tested the proposed methodology in a sec-ond query, the “Apple Company”. The Apple Company has many meanings.The “Apple Company” is the name of a company that develops and sells prod-ucts related with computers. Moreover, Apple is the name of the record com-pany that the group of Beatles created and the name another company relatedwith music the “Mountain Apple Company”. Also, there is a company named

Table 2. Personalized and non-personalized results for query “Apple Company” for auser that is a company related with music and the cluster he belongs has interest inArts but in Computes as well

Non-personalized results Personalized results

Apple Inc. - Wikipedia, the free encyclope-dia (computers)

Apple Moving Company, Austin, Texas(moving company)

Welcome to the Apple Company Store(computers)

Hawaiian Music - The Mountain Ap-ple Company (music)

Apple-Quicktime (computers) Little Apple Browsing Company -Something New is Brewing (enter-tainment)

Apple Inc. and the Beatles’ AppleCorps Ltd. Enter into New Agree-ment (music)

Apple Inc. - Wikipedia, the free encyclope-dia (computers)

Apple company and contact information(computers)

Apple Canvon Company — SpecialtyFoods From the Heart of New Mexico(food)

Hawaiian Music - The Mountain Ap-ple Company (music)

Apple company and contact information(computers)

Green Apple Co. Inc. (handcraft) Green Apple Co. Inc. (handcraft)

Little Apple Browsing Company -Something New is Brewing (enter-tainment)

Welcome to the Apple Company Store(computers)

Apple Moving Company, Austin, Texas(moving company)

Apple Inc. and the Beatles’ AppleCorps Ltd. Enter into New Agree-ment (music)

Apple Canvon Company - Specialty FoodsFrom the Heart of New Mexico (food)

Apple-Quicktime (computers)


“Green Company” which is related to handcraft, a company named “AppleCanyon Company” related to food, a company named “Little Apple BrewingCompany” related to entertainment and a company named “Apple moving Com-pany” which is a moving company. In Table 2 in the first column there are theresults as they are returned by the search API whereas in the second columnthere are the results as they are reorganized according to the score calculatedby the personalization methodology we propose. For each result next to the titlethere is a general description in parenthesis.

The user keeps on searching for results related to music. The results relatedwith music in the non personalized presentation are in places 4, 6, 8 while in thepersonalized presentation the places are 2, 3, 9. The personalization methodologyhas pushed the desired results to the first places of the list of the results returnedby the search engine.

In both examples, the cluster that the user belongs except for the interest inmusic shows also interest in computers, and this interest is depicted in the resultsof the personalization methodology applied. The first result in both queries wasabout computers because the weighted ontology depicting the cluster has higherweights for concepts related to computers than concepts related to arts. However,the methodology given the relatedness of the results with the cluster’s preferenceshas pushed the desired results in places higher than the places they were putwithout personalization.

Precision Evalution

The twenty users were asked to characterize the top five results in the personal-ized and non- personalized results set as being “relevant” or “non-relevant”. Onaverage, before re-ranking, only 40% of the top retrieved pages were found to berelevant. This amount is remarkably lower than the findings in [1], which reportsthat roughly 50% of documents retrieved by search engines are irrelevant. The

Fig. 5. Average precision of the semantic personalization search engine comparedwith the non-personalization search engine


reason is that the queries tested by the users had polysemy, thus the probabilityof retrieving irrelevant results was higher. The re-ranking of the results by pro-moting those that classify into concepts that belong in the user’s cluster profileproduced an overall performance increase as shown in figure 5.

We see that the ontology based system consistently outperforms comparedwith the simple search, validating our approach of using reference ontology forclustering user profiles in the Semantic search.

6 Conclusions and Future Work

We presented a personalization methodology which is based on clustering seman-tic user profiles. The method analyzes and annotates semantically the web accesslogs. At next it organizes the users’ profiles and groups the users into clusters.The personalization of the results returned by the search engine is done by anon-the-fly semantic characterization and the score of each result is calculated.The scores of the results are kept in cache and the results are reorganized andpresented to the user according to this score putting the one with the highestscore first. By the experimental implementation we showed that the personalizedmethod proposed has notably possibilities to change the scene in personalization.Future work includes the use of Fuzzy K-Means [2] that allows the creation ofoverlapping clusters, so that a user may belong to different cluster profiles withdifferent weights. Also, the development of a reference ontology with more levelsand alteration in factors such as the score of each result taking into considerationthe user’s preference with greater weight than the rest users of the cluster.

References

1. Casasola, E.: ProFusion Personal Assistant: An Agent for Personalized InformationFiltering on the WWW. Master’s thesis, The University of Kansas (1998)

2. Castellano, G., Torsello, A.: Categorization of web users by fuzzy clustering. In:Lovrek, I., Howlett, R.J., Jain, L.C. (eds.) KES 2008, Part II. LNCS, vol. 5178,pp. 222–229. Springer, Heidelberg (2008)

3. Cauch, S., Chafee, J., Pretschner, A.: Ontology-Based User Profiles for Search andBrowsing. Web Intelligence and Agent systems 1(3-4), 219–234 (2003)

4. Chirita, P.A., Nejdl, W., Paiu, R., Kohlschutter, C.: Using ODP metadata to per-sonalize search. In: Proceedings of the 28th annual international ACM SIGIR con-ference on Research and development in information retrieval, Salvador, Brazil(2005)

5. Dai, H., Mobasher, B.: Using Ontologies to Discover Domain-Level Web UsageProfiles. In: Proceddings of the 2nd Workshop on Semantic Web Mining at PKDD2002, Helsinki, Finland (2002)

6. Eirinaki, M., Vazirgiannis, M., Varlamis, I.: SEWeP: Using Site Semantics and aTaxonomy to Enhance the Web Personalization Process. In: Proceedings of the 9thSIGKDD Conference (2003)

7. Garofalakis, J., Giannakoudi, T., Sakkopoulos, E.: An Integrated Technique forWeb Site Usage Semantic Analysis: The ORGAN System. Journal of Web Engi-neering (JWE). Special Issue Logging Traces of Web Activity 6(3), 261–280 (2007)


8. Gauch, S., Madrid, J., Induri, S., Ravindran, D., Chadlavada, S.: KeyConcept: AConceptual Search Engine. Information and Telecommunication Technology Cen-ter. Technical Report: ITTC-FY2004-TR-8646-37, University of Kansas.

9. Haveliwala, T.: Topic-Sensitive PageRank. In: Proceedings of the Eleventh Inter-national World Wide Web Conference (2002)

10. Ma, Z., Pant, G., Sheng, O.: Interest-based personalized search. ACM TransactionsInformation Systems 25(1) (2007)

11. MacQueen, J.B.: Some Methods for classification and Analysis of Multivariate Ob-servations. In: Proceedings of 5th Berkeley Symposium on Mathematical Statisticsand Probability, Berkeley, vol. 1, pp. 281–297. University of California Press (1967)

12. Makris, C., Panagis, Y., Sakkopoulos, E., Tsakalidis, A.: Category ranking forpersonalized search. Data and Knowledge Engineering Journal (DKE) 60(1), 109–125 (2007)

13. Middleton, S., Shadbolt, de Roure, D.C.: Ontological User Profiling in Recom-mender Systems. ACM Transactions Information Systems 22(1), 54–88 (2004)

14. Miller, G.A.: WordNet: A lexical database for English. Communications of theACM 38(11), 39–41 (1995)

15. Mobasher, B., Cooley, R., Srivastava, J.: Automatic Personalization based on webusage Mining. Communications of the ACM 43(8), 142–151 (2000)

16. Pedersen, T., Patwardhan, S., Michelizzi, J.: WordNet: Similarity - Measuring theRelatedness of Concepts. In: Proceedings of the Nineteenth National Conferenceon Artificial Intelligence, pp. 1024–1025. AAAI, San Jose (2004)

17. Qiu, F., Cho, J.: Automatic identification of user interest for personalized search.In: Proceedings of the 15th International WorldWide Web Conference, Edinburgh,Scotland, U.K. ACM Press, New York (2006)

18. Tanudjaja, F., Mui, L.: Persona: A contextualized and personalized web search.In: Proceedings of the 35th Annual Hawaii International Conference on SystemSciences (2002)

19. Wu, Z., Palmer, M.: Verb semantics and lexical selection. In: Proceedings ofthe 32nd Annual Meeting of the Associations for Computational Linguistics, LasCruces, New Mexico, pp. 133–138 (1994)

4

How to Derive Fuzzy User Categories for Web

Personalization

Giovanna Castellano and Maria Alessandra Torsello

University of Bari, Department of Informatics, Via Orabona, 4 - 70126 Bari, Italy{castellano,fanelli,torsello}@di.uniba.it

Summary. Today, Web personalization offers valid tools for the development of ap-plications that have the attractive property to meet in a more effective manner theneeds of their users. To do this, Web developers have to address an important chal-lenge concerning the discovery of knowledge about interests that users exhibit duringtheir interactions with Web sites. Web Usage Mining (WUM) is an active researcharea aimed at the discovery of useful patterns of typical user behaviors by exploitingusage data. Among the different proposed techniques for WUM, clustering has beenwidely employed in order to categorize users by grouping together users sharing similarinterests. In particular, fuzzy clustering reveals to be an approach especially suitableto derive user categories from Web usage data available in log files. Usually, fuzzy clus-tering is based on the use of distance-based metrics (such as the Euclidean measure)to evaluate similarity between user preferences. However, the use of such measuresmay lead to ineffective results by identifying user categories that do not capture thesemantic information incorporated in the original Web usage data. In particular, inthis chapter, we propose an approach based on a relational fuzzy clustering algorithmequipped with a fuzzy similarity measure to derive user categories. As an applicationexample, we apply the proposed approach on usage data extracted from log files ofa real Web site. A comparison with the results obtained using the cosine measure isshown to demonstrate the effectiveness of the fuzzy similarity measure.

Keywords: fuzzy similarity measures, relational fuzzy clustering, Web personaliza-tion, Web user categorization, Web Usage Mining.

1 Introduction

The growing diffusion of Internet as a new medium of information disseminationand the increased number of users that daily browse the network have led moreand more organizations to public their information and to provide their serviceson the Web. However, the explosive growth in the use and the size of Internethas increased the difficulties in managing these information and has originateda growing interest for the development of personalized Web applications, i.e.applications able to adapt their content or services to the user interests. Today,Web personalization represents one of the most powerful tools for the improve-ment of Web-based applications by allowing to provide contents tailor-made to


66 G. Castellano and M.A. Torsello

the needs of users, satisfying in this way their actual desires without asking forthem. Hence, one of the main challenges that Web applications have to faceconsists in understanding user preferences and interests in order to provide per-sonalized functions that appeal to the users. As a result, knowledge discoveryabout user interests reveals to be a crucial activity in the overall process of per-sonalization. In particular, such activity is aimed at the identification of userbehavior patterns, i.e. the discovery of common behaviors exhibited by groupsof users during their visits to Web sites. Advanced technologies, such as thosecoming from data mining and Web mining may offer valid tools to reach thisaim. Among these, Web Usage Mining (WUM) [15], [7] is an important branchof Web mining that is devoted to the discovery of interesting patterns in theuser browsing behavior through the analysis of Web usage data characterizingthe interactions of users with sites. Since access log files store a huge amount ofdata about user access patterns, they represent the most important source of us-age data. Of course, if properly exploited, log files can reveal useful informationabout the browsing behavior of users in a site. As a consequence, these data canbe employed to derive categories of users capturing common interests and trendsamong users accessing the site. The discovered user categories can be exploitedin order to deliver personalized functions to currently connected users.

In the absence of any a priori knowledge, unsupervised classification or clus-tering seems to be the most promising way for learning user behavior patternsand identifying user categories by grouping together users with common brows-ing behavior [24], [25]. In the choice of an effective clustering method for WUM,several factors have to be considered. Early research efforts have relied on clus-tering techniques that often revealed to be inadequate to deal with the noisetypically present in Web usage data. In this context, desirable techniques shouldbe able to handle the uncertainty and vagueness underlying data about the in-teractions of users with the sites. Another important aspect to be consideredis the possibility to obtain overlapping clusters, so that a user can belong tomore than one group. In effect, the browsing behavior of users is highly uncer-tain and fuzzy in nature. A Web site is generally visited by a huge number ofusers having a variety of needs. Moreover, a user may access the same page ofa site for different purposes and may have several goals whenever he/she visitsa site. Such overlapping interests cannot be adequately captured through crisppartitions obtained by hard clustering techniques that assign each object exclu-sively to a single cluster. Thanks to their capacities of deriving clusters withhazy boundaries where objects may have characteristics of different classes withcertain degrees, fuzzy clustering methods result particularly suitable for usagemining [17], [10], [23]. The main advantage of fuzzy clustering over hard clus-tering is that it allows to yield more detailed information about the underlyingstructure of the data. Another main challenge in the use of clustering for thecategorization of Web users is the definition of an appropriate measure that isable to capture the similarity between user interests. In fact, the choice of thedistance measure to be incorporated in clustering algorithms highly affects thequality of the obtained partitions.

How to Derive Fuzzy User Categories for Web Personalization 67

In this chapter, we focus on the adoption of fuzzy clustering for the catego-rization of users visiting a Web site. In particular, to extract user categories,we propose the employment of CARD+, a fuzzy relational clustering algorithmthat works on data quantifying similarity between user interests.

Instead of using standard similarity measures, such as the cosine based simi-larity, we equip CARD+ with a fuzzy distance measure in order to evaluate thesimilarity degree among each pair of Web users. The adopted measure is directlyderived from the similarity quantification of fuzzy sets.

The adoption of similarity metrics based on fuzzy logic theory reveals to beparticularly effective to evaluate the similarity among Web users for differentreasons. A first advantage deriving from the use of the fuzzy paradigm concernsthe possibility to define a measure that is able to deal with data that can have asymbolic nature. In fact, while measures based on the distance concept in metricspaces reveal to be inefficient to deal with this kind of data, fuzzy similaritymeasures permit to reflect the semantic of the employed data and, hence, toapply clustering processes also to data with hybrid nature (eg. numerical, ordinal,and categorical). Moreover, the use of similarity metrics based on the fuzzylogic theory are especially appropriate to deal with the vague and imprecisenature characterizing Web usage data. Classical distance-based metrics couldnot permit to effectively face the uncertainty and the ambiguity that underlieWeb interaction data.

The chapter is articulated as follows. Section 2 briefly overviews works thatemploy different fuzzy clustering techniques for user categorization. Section 3describes our approach for the categorization of Web users. Firstly, we detailthe process of creation of the relation matrix through the computation of thesimilarity degree among users. Then, we describe CARD+, the clustering al-gorithm that we employ to extract user categories. In section 4, we present theresults obtained by applying CARD+ on real-world data and we show the valuesobtained for some validity metrics in order to evaluate the effectiveness of theproposed approach. Finally, section 5 concludes the chapter by summarizing thekey points.

2 Fuzzy Clustering for User Categorization

One active area of research in WUM is represented by clustering of users basedon their Web access patterns. User clustering provides groups of users that seemto behave similarly when they browse a Web site. The knowledge discovered byanalyzing the characteristics of the identified clusters can be properly exploitedin a variety of application domains. For example, in e-commerce applications,clustering of Web users can be used to perform market segmentation. In e-learning context, user categories discovered by applying clustering algorithmscan be employed in order to suggest learning objects that meet the informationneeds of users or to provide personalized learning courses. Also, clusters of userscan be exploited in the personalization process of a Web site where the aims canbe different. For example, user clustering results can help to re-organize the Web


portal by restructuring the site content more efficiently, or even to build adaptiveWeb portals, i.e. portals whose organization and presentation of content changedepending on the specific user needs.

Clustering is a well known data mining technique which has been widely usedin WUM to categorize the preprocessed Web log data. More precisely, user clus-tering groups users having similar navigational behavior (and, hence, havingcommon interests) in the same cluster (or user category) and puts users ex-hibiting dissimilar browsing behavior in different clusters. In WUM, among thedifferent clustering techniques adopted to extract user categories, fuzzy cluster-ing reveals to be particularly effective for mining significant browsing patternsfrom usage data thanks to their capacity to handle the uncertain and the vaguenature underlying Web data. In this section, we give an overview of differentworks that employ fuzzy clustering methods for the categorization of Web users.

In literature, surveys of works that propose the employment of fuzzy clusteringtechniques to support the WUM methodology are presented in [11] and [13].In [16], different kinds of fuzzy clustering techniques are used to discover usercategories. The well-known Fuzzy C-Means (FCM) has been employed in [14]for mining user profiles by partitioning user sessions identified from log data.Here, a user session is defined as the set of the consecutive accesses made by auser within a predefined time period. The FCM algorithm has been successfullyapplied to Web mining in different works such as [2] and [9]. In [1], the authorsproposed a novel ’intelligent miner’ that exploits the combination of a fuzzyclustering algorithm and a fuzzy inference system to analyze the trends of thenetwork traffic flow. Specifically, a hybrid evolutionary FCM approach is adoptedto individuate groups of users with similar interests. Clustering results are thenused to analyze the trends by using a Takagi-Sugeno fuzzy inference systemlearned through a combination of an evolutionary algorithm and the neuralnetwork learning.

Lazzerini and Marcelloni [12] presented a system based on the use of a fuzzyclustering approach to derive a small number of profiles of typical Web site usersstarting from the analysis of Web access log files and to associate each user to theproper profile. The system is composed of two subsystems: the profiler and theclassifier. In the profiler subsystem, the authors applied an Unsupervised FuzzyDivisive Hierarchical Clustering (UFDHC) algorithm to cluster the users of theWeb portal into a hierarchy of fuzzy groups characterized by a set of commoninterests. Each user group is represented by a cluster prototype which definesthe profile of the group members. To identify the profile a specific user belongsto, the classifier employs a classification method which completely exploits theinformation contained in the hierarchy. In particular, a user is associated witha profile by visiting the tree from the root to the deepest node to which theuser belongs with a membership value higher than a fixed threshold. The profilecorresponding to this last node is assigned to the user.

In [3], Runkler and Bezdeck focused on the use of relational fuzzy clusteringapproach for Web mining. This approach results particularly suitable for themanagement of datasets including non-numerical patterns. In fact, this kind of


data can be properly represented numerically by relations among pairwise ofobjects. The obtained relational datasets can be successively clustered by meansof appropriate clustering algorithms. Specifically, as an application, the authorsproposed the use of the Relational Alternating Cluster Estimation (RACE) forthe identification of prototypes that can be interpreted as typical user interests.

In [19], the authors proposed an extension of the Competitive Agglomera-tion clustering algorithm so that it can work on relational data. The resultingCompetitive Agglomeration for Relational Data (CARD) algorithm is able toautomatically partition session data into an optimal number of clusters. More-over, CARD can deal with complex and subjective distance/similarity measureswhich are not restricted to be Euclidean.

Another relational fuzzy clustering method was proposed in [10] for groupinguser sessions. In their work, each session includes the pages of a certain traversalpath. Here, the Web site topology is considered as a bias in the calculation ofthe similarity between the sessions depending on the relative position of thecorresponding pages in the site.

In [18], the Relational Fuzzy Clustering-Maximal Density Estimator (RFC-MDE) algorithm was employed to categorize user sessions identified by the anal-ysis process of the Web log data. The authors demonstrated that this algorithmis robust and can deal with outliers that are typically present in this application.RFC-MDE was applied on real-world examples for the extraction of user profilesfrom log data.

Many other fuzzy relational clustering algorithms have been used for miningWeb usage profiles. Among these, we mention the fuzzy c-Trimered MedoidsAlgorithm [9], the Fuzzy c-Medoids (FCMdd) algorithm [20], and the RelationalFuzzy Subtractive clustering algorithm [23].

In the present work we propose an approach based on the use of relationalfuzzy clustering for the categorization of Web site users. In particular, we pro-pose the use of CARD+, a relational fuzzy clustering algorithm derived from amodified version of CARD. CARD+ permits to incorporate a similarity measurebased on the fuzzy logic theory which enables to better capture similarity de-grees among user interests. In the following sections, we describe in more detailsthe approach that we propose for the identification of fuzzy user categories.

3 Categorization of Web Users

To discover Web user categories encoding interests shared by groups of users, apreliminary activity has to be performed to extract a collection of patterns thatmodel user browsing behaviors.

In our work, information contained in access log files are exploited to de-rive such data. Log files are important sources of information in the process ofknowledge discovery about user browsing behavior since they store in chrono-logical order all the information concerning the accesses made by all the usersto the Web site. However, access log files contain a huge and noisy amountof data, often comprising a high number of irrelevant and useless records. As a


consequence, a preprocessing phase of log files has to be performed so as to retainonly data that can be effectively exploited in order to model user navigationalbehavior.

In this work, the preprocessing of log files is performed by means of LODAP, asoftware tool that we have implemented for the analysis of Web log files in orderto derive models characterizing the user browsing behaviors. To achieve thisaim, based on information stored in log files, LODAP executes a first process,known in literature as sessionization [6], aimed at the derivation of a set ofuser sessions. More precisely, for each user, LODAP determines the sequence ofpages accessed during a predefined time period. User sessions are then exploitedto create models expressing the interest degree exhibited by each user for eachvisited page of the site.

Briefly speaking, log file preprocessing is performed through four main steps:

1. Data Cleaning that removes all redundant and useless records contained inthe Web log file (e.g. accesses to multimedia objects, robots’ requests, etc.)so as to retain only information concerning accesses to pages of the Web site.

2. Data Structuration that groups the significant requests into user sessions.Each user session contains the sequence of pages accessed by the same userduring an established time period.

3. Data Filtering that selects only significant pages accessed in the Web site.In this step, the least visited pages as well as the most visited ones, areremoved.

4. Interest degree computation that exploits information about accessed pagesto create a model of the visitor behavior by evaluating a degree of interestof each user for each accessed page.

Main details about the working scheme of LODAP can be found in [5].As a result, LODAP extracts data which are synthetized in a behavior matrix

B = [bij ] where the rows i = 1, . . . , n represent the users and the columnsj = 1, . . . , m correspond to the Web pages of the site. Each component bij ofthe matrix indicates the interest degree of the i-th user for the j-th page. Thei-th user behavior vector bi (i-th row of the behavior matrix) characterizes thebrowsing behavior of the i-th user.

Starting from the derived behavior data, CARD+ can be applied to categorizeusers. In the categorization process, two main activities can be distinguished:

• The creation of the relation matrix containing the dissimilarity values amongall pairs of users;

• The categorization of users by grouping similar users into categories.

In the following subsections, we detail the activities performed in the categoriza-tion process of Web users.

3.1 Computing Similarity among Web Users

Once the log file preprocessing step has been completed and behavior data areavailable, the effective categorization process of Web users can start. The first


activity in the categorization process of similar users based on the use of rela-tional fuzzy clustering consists in the creation of the relation matrix includingthe dissimilarity values among all pairs of users. To create the relation matrix,an essential task consists in the evaluation of the (dis)similarities among twogeneric users on the basis of a proper measure. In our case, based on the behav-ior matrix, the similarity between two generic users is expressed by the similaritybetween the two corresponding user behavior vectors.

In literature, different metrics have been proposed to measure the similaritydegree between two generic objects. One of the most common measures employedto this aim is the angle cosine measure [21]. In the specific context of usercategory extraction, the cosine measure computes the similarity between anytwo behavior vectors bx and by as follows:

SimCos (bx,by) =bxb

′y

‖bx‖ ‖by‖ =

∑mj=1 bxjbyj

√∑mj=1 b2

xj

√∑mj=1 b2

yj

. (1)

The use of the cosine measure might be ineffective to define the similarity be-tween two users visiting a Web site. In effect, to evaluate the similarity betweentwo generic users (rows of the available matrix), the cosine measure takes intoaccount only the common pages visited by the considered users. This approachmay produce ineffective results, leading to the loss of semantic information un-derlying Web usage data related to the relevance of each page for each user.

To better capture the similarity between two generic Web users, we proposethe use of a fuzzy similarity measure. Specifically, two generic users are modeledas two fuzzy sets and the similarity between these users is expressed as thesimilarity between the corresponding fuzzy sets. To do so, the user behaviormatrix B is converted into a matrix M = [μij ] which expresses the interestdegree of each user for each page in a fuzzy way. A very simple characterizationof the matrix M is provided as follows:

μij =

⎧⎨

⎩

0 if bij < IDminbij−IDmin

idmax−IDminif bij ∈ [IDmin, IDmax]

1 if bij > IDmax

(2)

where IDmin is a minimum threshold for the interest degree under which theinterest for a page is considered null, and IDmax is a maximum threshold of theinterest degree, after which the page is considered surely preferred by the user.

Starting from this fuzzy characterization, the rows of the new matrix M areinterpreted as fuzzy sets defined on the set of Web pages. Each fuzzy set μi isrelated to a user bi and it is simply characterized by the following membershipfunction:

μi (j) = μij ∀j = 1, 2, . . . , m (3)

In this way, the similarity of two generic users is intuitively defined as the sim-ilarity between the corresponding fuzzy sets. The similarity among fuzzy sets


can be evaluated in different ways [26]. One of the most common measures toevaluate similarity between two fuzzy sets is the following:

σ (μ1, μ2) =|μ1 ∩ μ2||μ1 ∪ μ2| (4)

According to this measure, the similarity between two fuzzy sets is given by theratio of two quantities: the cardinality of the intersection of the fuzzy sets andthe cardinality of the union of the fuzzy sets. The intersection of two fuzzy setsis defined by the minimum operator:

(μ1 ∩ μ2) (j) = min{

μb1(j) μb2

(j)}

(5)

The union of two fuzzy sets is defined by the maximum operator:

(μ1 ∪ μ2) (j) = max{

μb1(j)μb2

(j)}

(6)

The cardinality of a fuzzy set (also called ”σ-count”) is computed by summingup all its membership values:

|μ| =m∑

j=1

μ (j) (7)

Summarizing, the similarity between any two users bx and by is defined asfollows:

Simfuzzy (bx,by) =

∑mj=1 min

{μbxj , μbyj

}

∑mj=1 max

{μbxj , μbyj

} . (8)

This fuzzy similarity measure permits to embed the semantic information in-corporated in the user behavior data. In this way, a better estimation of the truesimilarity degree between two user behaviors is obtained. Similarity values aremapped into the similarity matrix Sim = [Simij ]i,j=1,...,n where each componentSimij expresses the similarity value between the user behavior vectors bi andbj calculated by using the fuzzy similarity measure. Starting from the similaritymatrix, the dissimilarity values are simply computed as Dissij = 1 − Simij ,for i, j = 1, . . . , n. These are mapped in a n × n matrix R = [Dissij ]i,j=1,...,nrepresenting the relation matrix.

3.2 Grouping Users by Fuzzy Clustering

Once the relation matrix has been created, the next activity is the categorizationof user behaviors in order to group users with similar interests into a number ofuser categories. To this aim, we adopt the fuzzy relational clustering approach.In particular, in this work, we employ CARD+, that we proposed in [4] as animproved version of the CARD (Competitive Agglomeration Relational Data)clustering algorithm [17]. A key feature of CARD+ is its ability to automatically


categorize the available data into an optimal number of clusters starting froman initial random number. In [17], the authors stated that CARD was able todetermine a final partition containing an optimal number of clusters. However, inour experience, CARD resulted very sensitive to the initial number of clusters byoften providing different final partitions, thus failing in finding the actual numberof clusters buried in data. Indeed, we observed that CARD produces redundantpartitions, with clusters having a high overlapping degree (very low inter-clusterdistance). CARD+ overcomes this limitation by adding a post-clustering processto the CARD algorithm in order to remove redundant clusters.

As common relational clustering approaches, CARD+ obtains an implicit par-tition of the object data by deriving the distances from the relational data to aset of C implicit prototypes that summarize the data objects belonging to eachcluster in the partition. Specifically, starting from the relation matrix R, the fol-lowing implicit distances are computed at each iteration step of the algorithm:

dci = (Rzc)i − zcRzc/2 (9)

for all behavior vectors i = 1, . . . , n and for all implicit clusters c = 1, . . . , C,where zc is the membership vector for the c-th cluster, defined as on the basisof the fuzzy membership values zci that describe the degree of belongingness ofthe i-th behavior vector in the c-th cluster. Once the implicit distance values dci

have been computed, the fuzzy membership values zci are updated to optimizethe clustering criterion, resulting in a new fuzzy partition of behavior vectors.The process is iterated until the membership values stabilize.

Finally, a crisp assignment of behavior vectors to the identified clusters isperformed in order to derive a prototype vector for each cluster, representing auser category. Precisely, each behavior vector is crisply assigned to the closestcluster, creating C clusters:

χc = {bi ∈ B|dci < dki∀c �= k} 1 ≤ c ≤ C. (10)

Then, for each cluster χc a prototype vector vc = (vc1, vc2, . . . , vcm) is derived,where

vcj =

∑bi∈χc

bij

|χc| j = 1, . . . , NP . (11)

The values vcj represent the significance (in terms of relevance degree) of a givenpage pj to the c-th user category.

Summarizing, the CARD+ mines a collection of C clusters from behaviordata, representing categories of users that have accessed to the Web site underanalysis. Each category prototype vc = (vc1, vc2, ..., vcm) describes the typicalbrowsing behavior of a group of users with similar interests about the mostvisited pages of the Web site.

4 Simulation Results

To show the suitability of CARD+ equipped with the fuzzy measure to identifyWeb user categories. we carried out an experimental simulation. We used the


access logs from a Web site targeted to young users (average age 12 years old), i.e.the Italian Web site of the Japanese movie Dragon Ball (www.dragonballgt.it).This site was chosen because of its high daily number of accesses (thousands ofvisits each day).

The preprocessing of log files

Firstly, the preprocessing of log files was executed to derive models of user be-havior. To this aim, LODAP was used to identify user behavior vectors from thelog data collected during a period of 12 hours (from 10:00 a.m. to 22:00 p.m.).Once the four steps of LODAP were executed, a 200 × 42 behavior matrix wasderived. The 42 pages in the Web site were labeled with a number (see table 1)to facilitate the analysis of results, by specifying the content of the Web pages.

Table 1. Description of the retained pages in the Web site

Pages Content

1, ..., 8 Pictures of characters9,..., 13 Various kind of pictures related to the movie14,..., 18 General information about the main character19, 26, 27 Matches20, 21, 36 Services (registration, login, ...)22, 23, 24, 25, 28, ..., 31 General information about the movie32, ..., 37 Entertainment (games, videos,...)38, ..., 42 Description of characters

Categorization of Web users

Starting from the available behavior matrix, the relation matrix was created byusing the fuzzy similarity measure.

Next, the CARD+ algorithm (implemented in the Matlab environment 6.5)was applied to the behavior matrix in order to obtain clusters of users withsimilar browsing behavior. We carried out several runs by setting a differentinitial number of clusters Cmax = (5, 10, 15). To establish the goodness of thederived partitions of behavior vectors, at the end of each run, two indexes werecalculated: the Dunn’s index and the Davies-Bouldin index [8]. These were usedin different works to evaluate the compactness of the partitions obtained byseveral clustering algorithms. Good partitions correspond to large values of theDunn’s index and low values for the Davies-Bouldin index. We observed thatCARD+ with the use of the fuzzy similarity measure provided data partitionswith the same final number of clusters C = 5, independently from the initialnumber of clusters Cmax. The validity indexes took the same values in all runs.In particular, the Dunn’s index value was always equal to 1.35 and the value forthe Davies-Bouldin index was 0.13. As a consequence, the CARD+ algorithmequipped with the fuzzy similarity measure resulted to be quite stable, by parti-tioning the available behavior data into 5 clusters corresponding to the identifieduser categories.


Fig. 1. Comparison of the Dunn’s index obtained by the employed algorithms andsimilarity measures

Fig. 2. Comparison of the Davies-Bouldin index obtained by the employed algorithmsand similarity measures

Evaluation results

To evaluate the effectiveness of the employed fuzzy similarity measure, we com-pared it to the cosine measure within the CARD+ algorithm. We carried out thesame trials of the previous experiments. Moreover, to establish the suitabilityof CARD+ for the task of user categorization, we applied the original CARDalgorithm to categorize user behaviors by employing either the cosine measureand the fuzzy similarity measure for the computation of the relation matrix. Infigures 1 and 2, the obtained values for the validity indexes are compared. Inthis figure, in correspondence of each trial, the final number of clusters extractedby the employed clustering algorithm is also indicated. As it can be observed,CARD+ with the use of the cosine measure derived partitions which categorizeddata into 4 or 5 clusters, resulting less stable than CARD+ equipped with thefuzzy similarity measure. Moreover, the CARD algorithm showed an instablebehavior with both the similarity measures, by providing data partitions with adifferent final number of clusters in each trial.

Analyzing the results obtained by the different runs, we can conclude thatCARD+ with the employment of the fuzzy similarity measure was able to derivethe best partition in terms of compactness; hence, it revealed to be a validapproach for the identification of user categories.


The information about the user categories extracted by CARD+ equippedwith the fuzzy similarity measure are summarized in table 2. In particular, foreach user category (labeled with numbers 1,2,...,5) the pages with the highestdegree of interest are indicated. It can be noted that some pages (e.g. P1, P2,P3, P10, P11, and P12) are included in more than one user category, showing howdifferent categories of users may exhibit common interests.

Table 2. User categories identified on real-world data

User Relevant pages (interest degrees)category

1 P1(55), P2(63), P3(54), P5(52), P7(48), P8(43), P14(66),P28(56), P29(52), P30(37)

2 P1(72),P2(59), P3(95), P6(65), P7(57), P10(74), P11(66),P13(66)

3 P1(50), P2(50), P3(45), P4(46), P5(42), P6(42), P8(34),P9(37), P12(40), P15(41), P16(41), P17(38), P18(37), P19(36)

4 P2(49), P10(47), P11(38), P12(36), P14(27), P31(36), P32(29),P33(39), P34(36), P35(26), P36(20), P37(37), P38(29), P39(30),P40(34), P41(28), P42(24)

5 P4(70), P5(65), P20(64), P21(62), P22(54), P23(63), P24(54),P25(41), P26(47), P27(47)

We can give an interpretation of the identified user categories, by individ-uating the interests of users belonging to each of these. The interpretation isindicated in the following.

• Category 1. Users in this category are mainly interested on information aboutthe movie characters.

• Category 2. Users in this category are interested in the history of the movieand in pictures of movie and characters.

• Category 3. These users are mostly interested to the main character of themovie.

• Category 4. These users prefer pages that link to entertainment objects(games and video).

• Category 5. Users in this category prefer pages containing general informationabout the movie.

The extracted user categories may be used to implement personalization func-tions in the considered Web site.

5 Conclusions

The implicit knowledge discovery about the interests and the preferences of usersthrough the analysis of their navigational behavior has become a crucial task forthe development of personalized Web applications able to provide informationor services adapted to the needs of their users.


To discover significant patterns in the user browsing behavior, the WUMmethodology was widely used in literature. Based on this methodology, knowl-edge about user interests is discovered by analyzing the usage data describing theinteractions of users with the considered Web site. To do this, among the differenttechniques proposed in literature, clustering has been largely employed. Specifi-cally, user clustering derives groups of users sharing similar interests namely alsouser categories. In WUM, fuzzy clustering techniques revealed to be especiallysuitable by giving the possibility to capture the overlapping interests that usersexhibit when they visit a Web site. In this way, in fact, a same user may fall indifferent categories with a certain membership degree according to the fact thata user may have different kinds of interests or needs when he visits a site. Inaddition, fuzzy clustering allows a more efficient management of data permeatedby uncertainty and ambiguity, characteristics of Web interaction data.

In this chapter, to derive user categories from access log files, we proposedan approach based on the use of relational fuzzy clustering. In particular, wepresented CARD+, a fuzzy clustering algorithm that works on relational data(expressed in terms of dissimilarities among all pairs of users) to partition userbehavior data. To evaluate similarity between Web users, a fuzzy measure hasbeen proposed. Differently from the traditional distance-based measures typi-cally used in literature, such as the cosine measure, the fuzzy similarity measureallowed to incorporate the semantic information embedded in data reflectingbetter the concept of similarity among the interests expressed by two genericWeb users. In particular, we showed by presenting comparative results how, ineffect, CARD+ equipped with the proposed fuzzy similarity measure overcomesCARD+ equipped with the standard cosine similarity measure. Also, we showedthat it overcomes the original CARD algorithm, whatever the adopted measureis. Clusters derived by CARD+ using the fuzzy measure are sufficiently sep-arate and correspond to actual user categories embedded in the available logdata. The identified user categories will be exploited to realize personalizationfunctionalities in the considered Web site, such as the dynamical suggestion oflinks to pages considered interesting for a current user, according to his categorymembership.

This chapter was intended to provide a contribute to the research in the WUMfield, emphasizing on the suitability and effectiveness of fuzzy clustering tech-niques in the knowledge discovery process of typical patterns in user navigationalbehavior. In particular, this work focused on the importance of defining new andmore appropriate measures for the evaluation of similarity between Web users inorder to obtain more robust clustering results (and, hence, more significant usercategories). Particularly, we highlighted the advantages deriving from the use offuzzy logic for the definition of similarity measures. In effect, the employmentof similarity measures based on fuzzy logic theory may provide the additionalvalue coming from the introduction of a bias into the clustering process, withthe definition of a measure embedding the specific context a priori knowledgeexpressed in linguistic terms. Additionally, the fuzzy definition of the similarityconcept may be much more interpretable since it is more intuitive and closer


to the human ways of perceiving and understanding. This could enable a bettercomprehension of the clustering results and their translation into the naturallanguage constructs.

Other important facets may be addressed in the process of derivation of Webuser categories. For example, one of the most interesting aspects concerns thepossibility to create adaptive models of user categories that are able to identifythe continuous changes in interests or needs of users and dynamically adapt usercategories according to these changes. This opens a new challenge in WUM anda promising research direction for the development of Web applications equippedwith even more refined and effective personalization functions.

References

1. Abraham, A., Wang, X.: i-Miner: A Web Usage Mining Framework Using Hierarchi-cal Intelligent Systems. In: The IEEE Int. Conf. on Fuzzy Systems, pp. 1129–1134.IEEE Press, Los Alamitos (2003)

2. Arotaritei, D., Mitra, S.: Web Mining: a survey in the fuzzy framework. Fuzzy Setsand System 148, 5–19 (2004)

3. Bezdek, J.C.: Pattern recognition with fuzzy objective function algorithms. PlenumPress, New York (1981)

4. Castellano, G., Fanelli, A.M., Torsello, M.A.: Relational Fuzzy approach for MiningUser Profiles. LNCI, pp. 175–179. WSEAS Press (2007)

5. Castellano, G., Fanelli, A.M., Torsello, M.A.: LODAP: A Log Data Preprocessorfor mining Web browsing patterns. In: Proc. of The 6th WSEAS International Con-ference on Artificial Intelligence, Knowledge Engineering and Data Bases (AIKED2007), Corfu Island, Greece (2007)

6. Cooley, R., Mobasher, B., Srivastava, J.: Data preparation for mining world wideweb browsing patterns. Journal of Knowledge and Information Systems 1, 5–32(1999)

7. Facca, F.M., Lanzi, P.L.: Mining interesting knowledge from weblogs: a survey.Data and Knowledge Engineering 53, 225–241 (2005)

8. Halkidi, M., Batistakis, Y., Vazirgiannis, M.: Cluster Validity Methods: Part II.SIGMOD Record (2002)

9. Krishnapuram, R., Joshi, A., Nasraoui, O., Yi, L.: Low-complexity fuzzy relationalclustering algorithms for web mining. Journal IEEE-FS 9, 595–607 (2001)

10. Joshi, A., Joshi, K.: On mining Web access logs. In: ACM SIGMOID Workshopon Research issues in Data Mining and Knowledge discovery, pp. 63–69 (2000)

11. Joshi, A., Krishnapuram, R.: Robust Fuzzy Clustering Methods to Support WebMining. In: Proc. ACM SIGMOD Workshop on Data Mining and Knowledge Dis-covery (August 1998)

12. Lazzerini, B., Marcelloni, F.: A hierarchical fuzzy clustering-based system to createuser profiles. International Journal on Soft Computing 11, 157–168 (2007)

13. Liu, M., Lui, Y., Hu, H.: Web Fuzzy Clustering Web and its applications in WebUsage Mining. In: 9th International Symposium on future Software TechnologyISFST, Xian, Cina, (October 20-23, 2004)

14. Martin-Bautista, M.J., Vila, M.A., Escbar-Jeria, V.H.: In: IADIS European Con-ference Data Mining, pp. 73–76 (2008)


15. Mobasher, B., Cooley, R., Srivastava, J.: Automatic personalization based on Webusage mining. TR-99010, Department of Computer Science. DePaul University(1999)

16. Mobasher, B.: Web Usage Mining and Personalization. In: Practical Handbook ofInternet Computing. CRC Press LLC, Boca Raton (2005)

17. Nasraoui, O., Frigui, H., Joshi, A., Krishnapuram, R.: Mining Web access log usingrelational competitive fuzzy clustering. In: Proc. of the Eight International FuzzySystem Association World Congress (1999)

18. Nasraoui, O., Krishnapuram, R., Joshi, A.: Relational Clustering based on a newrobust estimator with application to Web mining. In: Proc. of the North AmericanFuzzy Information Society, pp. 705–709 (1999)

19. Nasraoui, O., Krishnapuram, R., Frigui, H., Joshi, A.: Extracting Web user profilesusing relational competitive fuzzy clustering. International Journal on ArtificialIntelligence Tools 9(4), 509–526 (2000)

20. Nasraoui, O., Krishnapuram, R., Joshi, A., Kamdar, T.: Automatic Web UserProfiling and Personalization using a Robust Fuzzy Relational Clustering. E-Commerce and Intelligent Methods in Studies in Fuzziness and Soft Computing(2002)

21. Rossi, F., De Carvalho, F., Lechevallier, Y., Da Silva, A.: Dissimilarities for WebUsage Mining. Data Science and Classification, Studies in Classification, DataAnalysis and Knowledge Organization, 39–46 (2006)

22. Runkler, T.A., Bezdek, J.C.: Web mining with relational clustering. InternationalJournal of Approximate Reasoning 32, 217–236 (2003)

23. Suryavanshi, B.S., Shiri, N., Mudur, S.P.: An efficient technique for mining usageprofiles using Relational Fuzzy Subtractive Clustering. In: Proc. of WIRI 2005,Tokyo, Japan (2005)

24. Vakali, A., Pokorny, J., Dalamagas, T.: An Overview of Web Data ClusteringPractices. In: EDBT Workshops, pp. 597–606 (2004)

25. Wang, X., Abraham, A., Smith, K.A.: Intelligent web traffic mining and analysis.Journal of Network and Computer Applications 28, 147–165 (2005)

26. Zhizhen, L., Pengfei, S.: Similarity measures on intuitionistic fuzzy sets. PatternRecognition Letter 24, 2687–2693 (2003)

27. Kajan, E.: Information technology encyclopedia and acronyms. Springer, Heidel-berg (2002)

28. Broy, M.: Software engineering – From auxiliary to key technologies. In: Broy, M.,Denert, E. (eds.) Software Pioneers. Springer, Heidelberg (2002)

29. Che, M., Grellmann, W., Seidler, S.: Appl. Polym. Sci., vol. 64, pp. 1079–1090(1997)

30. Ross, D.W.: Lysosomes and storage diseases. MA Thesis, Columbia University,New York (1977)

5

A Taxonomy of Collaborative-Based

Recommender Systems

Fabian P. Lousame and Eduardo Sanchez

1 Introduction

The explosive growth in the amount of information available in the WWW andthe emergence of e-commerce in recent years has demanded new ways to deliverpersonalized content. Recommender systems [51] have emerged in this context asa solution based on collective intelligence to either predict whether a particularuser will like a particular item or identify the collection of items that will beof interest to a certain user. Recommender systems have an excellent ability tocharacterize and recommend items within huge collections of data, what makesthem a computerized alternative to human recommendations. Since useful per-sonalized recommendations can add value to the user experience, some of thelargest e-commerce web sites include recommender engines. Three well knownexamples are Amazon.com [1], LastFM [4] and Netflix [6].

Although the first studies can be traced back to cognitive science, approxima-tion theory and information retrieval among other fields, recommender systemsbecame an independent research area in the mid-1990s when Resnick et al. [50],Hill et al. [29] and Shardanand et al. [56] proposed recommendation techniquesexplicitly based on user rating information. Since then, numerous approacheshave been developed that use content or historical information: user-item in-teractions, explicit ratings, or web logs, among others. Nowadays, recommendersystems are typically classified into the following categories:

• content-based, if the user is recommended items that are content-similar tothe items the user already liked;

• collaborative, if the user is recommended items that people with similar tastesand preferences liked in the past;

• hybrid, if the user is recommended items based on a combination of bothcollaborative and content-based methods.

This chapter presents a study focused on recommender systems based oncollaborative filtering, the most successful recommendation technique to date.The chapter provides the reader an overview of recommender systems based oncollaborative filtering, contributes with a general taxonomy to classify the algo-rithms and approaches attending to a set of relevant features, and finally provides


82 F.P. Lousame and E. Sanchez

some guidelines to decide which algorithm best fits on a given recommendationproblem or domain.

2 Recommending Based on Collaborative Filtering

The term Collaborative Filtering (CF) was first introduced by Goldberg et al.[23]. They presented Tapestry, an experimental mail system that combined bothcontent-based filtering and collaborative annotations. Although the system wasenriched with collaborative information, users were required to write complexqueries. The first system that automated recommendations was the GroupLenssystem [50, 37] which helped users find relevant netnews from a huge streamof articles using ratings given by other similar users. Since then, many relevantresearch projects have been developed (Ringo [56], Video Recommender [29],Movielens [19, 5], Jester [24]) and the results have positioned the CF techniquesas the most successful ones to build recommender engines. Popular e-commercesystems, such as Amazon [1], CDNow [3] or LastFM [4], are taking advantageof these engines.

CF relies on the assumption that finding similar users to a new one and ex-amining their usage patterns leads to useful recommendations for the new user.Users usually prefer items that like-minded users prefer, or even that dissimilarusers don’t prefer. This technology does not rely on the content descriptions ofthe items, but depends on preferences expressed by a set of users. These prefer-ences can either be expressed explicitly by numeric ratings or can be indicatedimplicitly by user behaviors, such as clicking on a hyperlink, purchasing a bookor reading a particular news article. CF requires no domain knowledge and offersthe potential to uncover patterns that would be difficult or impossible to detectusing content-based techniques. Besides that, collaborative filtering has provedits ability to identify the most appropriate item for each user, and the quality ofrecommendations is improved over time as long as the user database gets larger.

Two different approaches have been explored for building Pure CF recom-menders. The first approach, referred to as memory-based [56, 37, 15, 50, 27, 54],essentially makes rating predictions based on the entire collection of rated items.Items frequently selected by users of the same group can be used to form thebasis to build a list or recommended items. They produce high-quality recom-mendations but suffer serious scalability problems as the number of users anditems grow. The other approach, known as model-based [56, 14, 15, 9], analyzeshistorical interaction information to build a model of the relations between differ-ent items/users which is intended to find the recommended items. Model-basedschemes produce faster recommendations than memory-based do, but requiresa significant amount of time to build the models and leads to lower qualityrecommendations.

Definitions and Notation

In the context of recommender systems, a dataset is defined as the collection ofall transactions about the items that have been selected by a collection of users.

A Taxonomy of Collaborative-Based Recommender Systems 83

Symbols n and m will be used in this text to denote the number of distinct usersand items in a particular dataset, respectively. Each dataset will be representedformally by a n × m matrix that will be referred to as the user-item matrix,A = U × I. U denotes the set of all users and I the set of all items available inthe database. The value of element ak,i ∈ {1, 0} denotes whether an interactionbetween user k and item i has been observed or not.

In a recommendation problem, there usually exists additional informationabout the utility of the user-item interactions, commonly captured as a ratingthat indicates how a particular user liked a particular item. This rating infor-mation is represented in a different n × m matrix that will be denoted R. Therating that user k expressed for item i is in general a real number and will bereferred to as rk,i. rk denotes the vector of all ratings of user k. In recommendersystems terminology, the active user is the user that queries the recommendersystem for recommendations on some items. The symbol a will be used to referto the active user’s rating vector. By convention, if di denotes a vector thatresults from taking row i from a certain matrix D, dT

j will be used to denote thevector that results from taking column j from that matrix.

The symbol Ak refers to the set of items the user has already experiencedand Rk is the set of items for which user k has actually given ratings. Note thatRk ⊆ I and Rk ⊆ A.

Problem Formulation

In its most common formulation, the CF recommendation problem is reducedto the problem of estimating, using collaborative features, the utility for theitems that have not been selected by the active user. Once these utilities forunseen items are estimated, a top-N recommendation can be built for everyuser, by recommending the user the items with the highest estimated values.This estimation is usually computed from the ratings explicitly given by theactive user to a specific set of items (rating-based filtering) but ratings couldalso be derived from historical data (purchases, ...) or from other sources ofinformation. In the rest of the chapter we will assume without loss of generalitythat interactions are based on rating activity. In movie recommendation, forinstance, the input to the recommender engine would be a set of movies the userhas seen, with some numerical rating associated with each of these movies. Theoutput of the recommender system would be another set of movies, not yet ratedby the user, that the recommender predicts to be highly rated by the user.

More formally, given the user-item rating matrix R and the set of ratings aspecified by the active user, the recommender engine tries to identify an orderedset of items X such that X ∩Rk = ∅. To achieve this, the recommendation enginedefines a function

ν : U × I → �

(k, j) → ν(k, j) = E(rk,j) (1)


Fig. 1. Illustration of the recommendation process. Given the vector of ratings ofthe active user, the collaborative filtering algorithm produces a recommendation byselecting the N items with the highest estimated predictions.

that predicts the utility of the interactions between each user k and every itemj. Note that for a given user k, the utilities need to be computed only for itemsj ∈ I −Rk. Once all utilities are predicted, recommendations to the active userare made by selecting the items with the highest estimated utility (see figure 1).The prediction computation is usually performed on a sparse user-item matrix.Typical values of sparsity are in the order of 98%, what means an almost emptyinteraction matrix.

In addition to recommender systems that predict the absolute values of rat-ings, there are other proposals focused on preference-based filtering, i.e., pre-dicting the relative preferences of users [18, 35, 36]. These techniques predict thecorrect relative order of the items, rather than their individual ratings.

2.1 Memory-Based Collaborative Filtering

Memory-based collaborative filtering is motivated from the observation thatusers usually trust the recommendations from like-minded neighbors. Thesemethods are aimed at computing unknown relations between users and itemsby means of nearest neighbor schemes that either identify pairs of items thattend to be rated similarity or users with a similar rating history. Memory-basedcollaborative filtering became very popular because they are easy-to-implement,very intuitive, avoid the need of training and tuning many parameters, and theuser can easily understand the rationale behind each recommendation.

Three components characterize this approach: (1) data preprocessing, in whichinput data to the recommender engine is preprocessed to remove global effects,to normalize ratings, etc; (2) neighborhood selection, which consists in selectingthe set of K users [items] that are most similar to the active user [to the set ofitems already rated by the active user]; and (3) prediction computation, whichgenerates predictions and aggregates items in a top-N recommendation. Table 1summarizes different memory-based algorithms that are briefly explained in nextsubsections.


Table 1. Summary of memory-based algorithms based on the different components ofthe recommendation process

Data(preprocessing)

Neighborhood selection Prediction computation

User-basedRatings(default voting)

· Pearson correlation· Vector similarity→ Inverse user frequency

· Mean squared difference

· Rating aggregation· Most frequent item

Predictabilitypaths

Ratings · Predictability conditionheuristics

· Linear rating transforma-tion

Item-basedRatings(adjusted ratings)

· Vector similarity· Pearson correlation· Conditional probabilitybased similarity

· Rating aggregation· Regression based

Item-to-item coocurrence

Cluster-basedsmoothing

Ratings(cluster-basedsmoothing)

· Pearson correlation · Rating aggregation

Trust inferences Ratings

· Compute trust of users→ Pearson correlation

Weighted average composi-tion

· Rating aggregation

Improvedneighborhood

Ratings(remove globaleffects)

· Weight optimization · Rating aggregation

User-Based

This CF approach estimates unknown ratings based on recorded ratings of like-minded users. The predicted rating of the active user for item j is a weightedsum of ratings of other users,

νk,j = rk +

∑l∈Uk

wk,l · (rl,j − rl)∑

l∈Uk|wk,l| (2)

where Uk denotes the set of users in the database that satisfy wk,l �= 0. Thisweights can reflect distance, correlation or similarity between each user and theactive user. rk and rl represent the mean rating of the active user k and user l,respectively.

Different weighting functions can be considered. Pearson correlation, cosinevector similarity, Spearman correlation, entropy-based uncertainty, mean-squaredifference are some examples. The Pearson correlation (eq. 3)1 was the firstmeasure used to compute these weights [50]. Breese et al. [15] and Herlocker etal. [27] proved that Pearson correlation performs better than other metrics.

wk,l =

∑i∈Rk∩Rl

(rk,i − rk)(rl,i − rl)√∑

i∈Rk∩Rl(rk,i − rk)2

√∑i∈Rk∩Rl

(rl,i − rl)2(3)

1 Note that Pearson correlation is defined in [−1, +1] and then, in order to make sensewhen using negative weights, ratings should be re-scaled to fit [−r,+r].


Vector similarity is another weighting function that can be used to measure thesimilarity between users:

wk,l =

∑i∈Rk∩Rl

rki · rli√∑

i∈Rk∩Rlr2ki

√∑i∈Rk∩Rl

r2li

(4)

Though Pearson correlation and vector similarity are the most popular, othermetrics are also used. For instance, Shardanand and Maes [56] used a MeanSquared Difference to compute the degree of dissimilarity between users k andl and predictions were made by considering all users with a dissimilarity tothe user which was less than a certain threshold and computing the weightedaverage of the ratings provided by the most similar users, where weights wereinverse proportional to this dissimilarity. They also presented a ConstrainedPearson correlation to take into account the positivity and negativity of ratingsin absolute scales.

Most frequent item recommendation. Instead of using equation 2 to computepredictions and then construct a top-N recommendation by selecting the highestpredicted items, each similar item could be ranked according to how many similarusers selected it

sk,j =∑

l∈Uk/al,j=1

1 (5)

and the recommendation list would be then computed by sorting the most fre-quently selected N items.

Weighting SchemesBreese et al. [15] investigated different modifications to the weighting functionthat have shown to improve performance of this memory-based approach:

Default voting was proposed as an extension of the Pearson correlation (equation3) that improves the similarity measure in cases in which either the active useror the matching user have relatively few ratings (Rk ∩ Rl has very few items).Refer to [15] for a mathematical formulation.

Inverse user frequency tries to reduce weights for commonly selected items basedon the background idea that commonly selected items are not as useful in char-acterizing the user as those items that are selected less frequently. Followingthe original concepts in the domain of information retrieval [10] the user inversefrequency can be defined as:

fi = log| {uk} |

| {uk : i ∈ Bk} | = logn

ni(6)

where ni is the number of users who rated item i and n is the total numberof users in the database. To use the inverse user frequency in equation 4 thetransformed rating is simply the original rating multiplied by the user inversefrequency. It can also be used in correlation but the transformation is not direct(see Breese et al. [15] for a detailed description).


Predictability Paths

Aggarwal et al. [9] proposed a graph-based recommendation algorithm in whichthe users are represented as nodes of a graph and the edges between the nodesindicate the degree of similarity between the users. The recommendations for auser were computed by traversing nearby nodes in this graph. The graph repre-sentation has the ability to capture transitive relations which cannot be capturedby nearest neighborhood algorithms. Authors reported better performance thanthe user-based schemes.

The approach is based on the concepts of horting and predictability. Thehorting condition states whether there is enough overlap between each pair ofusers (k, l) to decide whether the behavior of one user could predict the behaviorof the other or not. By definition, user k horts user l if the following equation issatisfied:

card(Rk ∩Rl) ≥ min(F · card(Rk), G) (7)

where F ≤ 1 and G is some predefined threshold. The predictability conditionestablishes that user l predicts behavior of user k if there exists a linear ratingtransformation

Tsk,l,tk,l: xk,j = s · rl,j + t (8)

that carries ratings rl,j of user l into ratings xk,j of user k with an acceptableerror. The (s, t) pair of real numbers is chosen so that the transformation 8 keepsat least one value in the rating domain (see [9] for further details on s-t valuepair restrictions). More formally, user l predicts user k if user k horts user l (eq.7) and if there exists a linear rating transformation Ts,t such that the expression9 is satisfied, with β a positive real number.

∑j∈Rk∩Rl

|rk,j − xk,j)|card(Rk ∩Rl)

< β (9)

Each arc between users k and l indicates that user l predicts user k and thereforeit has associated a linear transformation Tsk,l,tk,l

. Using an appropriate graphsearch algorithm a set of optimal directed paths between user k and any userl that selected item j can be constructed. Each directed path allows a ratingprediction computation based on the composition of transformations (eq. 8).For instance, given the directed graph k → l1 → ... → ln with predictor values(sk,1, tk,1), (s1,2, t1,2), ..., (sn−1,n, tn−1,n) the predicted rating of item j will beTsk,1,tk,1 ◦ (Ts1,2,t1,2 ◦ (...◦Tsn−1,n,tn−1,n(rn,j)...)). Since different paths may exist,the average of these predicted ratings is computed as the final prediction. A top-N recommendation is constructed by aggregating the N items with the highestpredicted ratings.

Item-Based

The item-based algorithm is an analogous alternative to the user-based approachthat was proposed by Sarwar et al. [53] to address the scalability problems of


the user-based approach. The algorithm, in its original formulation, generatesa list of recommendations for the active user by selecting new items that aresimilar to the collection of items already rated by the user. As for the user-basedapproach, the item-based approach consists of two different components: thesimilarity computation and the prediction computation.

There are different ways to compute the similarity between items. Here wepresent four of these methods: vector similarity, Pearson correlation, adjustedvector similarity and conditional probability-based similarity.

Vector similarity. One way to compute the similarity between items is to con-sider each item i as a vector in the m dimensional user space. The similaritybetween any two items i and j is measured by computing the cosine of the anglebetween these two vectors:

wi,j =

∑k∈Ui∩Uj

rk,irk,j√∑

k∈Ui∩Ujr2k,i

√∑k∈Ui∩Uj

r2k,j

(10)

where the summation is extended to users who rated both of the items, k ∈Ui ∩ Uj .

Pearson correlation. Similarly to equation 3, the Pearson correlation betweenitems i and j is given by:

wi,j =

∑k∈Ui∩Uj

(rk,i − ri)(rk,j − rj)√∑

k∈Ui∩Uj(rk,i − ri)2

√∑k∈Ui∩Uj

(rk,j − rj)2(11)

where ri and rj denote the average rating of items i and j, respectively.

Adjusted vector similarity. Computing similarity between items using the vectorsimilarity has one important drawback: the difference in the rating scale betweendifferent users is not taken into account. This similarity measure addresses thisproblem by subtracting the corresponding user average from each rating:

wi,j =

∑k∈Ui∩Uj

(rk,i − rk)(rk,j − rk)√∑

k∈Ui∩Uj(rk,i − rk)2

√∑k∈Ui∩Uj

(rk,j − rk)2(12)

Conditional probability-based similarity. An alternative way to compute the sim-ilarity between each pair of items is to use a measure based on the conditionalprobability of selecting one of the items given that the other item was selected.This probability can be expressed as the number of users that selected bothitems i, j divided by the total number of users that selected item i:

wi,j = P (j|i) =| {uk : i, j ∈ Rk} || {uk : i ∈ Rk} | (13)

Note that this similarity measure is not symmetric: P (j|i) �= P (i|j).


To compute predictions using the item-based approach, a recommendation listis generated by ranking items with a prediction measure computed by taking aweighted average over all active user’s ratings for items in the collection Rk:

νk,j =

∑i∈Rk

rk,i · wi,j∑

i∈Rk|wi,j | (14)

Model Based InterpretationSince similarities among items do not change frequently, relations between itemscan be stored in a model M . This is why some researchers consider the item-basedis a model-based approach to collaborative filtering. Model M could contain allrelations between pairs of items but one common approach is to store, for eachitem i, its top-K similar items only. This parameterization of M on K is moti-vated due to performance considerations. By using a small value of K, M wouldbe very sparse and then similarity information could be stored in memory evenin situations in which the number of items in the dataset is very large. However,if K is very small, the resulting model will contain limited information and couldpotentially lead to low quality recommendations (see [53] for further reading).

Item-to-Item ExtensionGreg Linden et al. [41] proposed this extension to the item-based approach that iscapable of producing recommendations in real time, to scale to massive datasetsand to generate high-quality recommendations. The algorithm is essentially anitem-based approach but includes several advantages to make the item-to-itemalgorithm faster than the item-based: (1) the similarity computation is extendedonly to item pairs with common users (co-ocurrent items) and (2) the recom-mendation list is computed by looking into a small set that aggregates itemsthat were found similar to a certain basket of user selections.

To determine the most similar match from a given item, the algorithm buildsa co-ocurrence matrix by finding items that users tend to select together. Thesimilarity between two items i and j is not zero if at least q+1 users have selectedthe pair (i, j), with q ≥ 0 some predefined threshold. The similarity between twoitems satisfying this property can be computed in various ways but a commonmethod is to use the cosine similarity described in equation 10. Predictions fornew items are computed with equation 14 (see [41] for further details).

Cluster-Based Smoothing

Xue et al. [59] proposed a collaborative filtering algorithm that provides higheraccuracy as well as increased efficiency in recommendations. The algorithm isa user-based algorithm that has been enhanced with clustering and a ratingsmoothing mechanism based on clustering results. Clustering was performed byusing the K-means algorithm with the Pearson correlation coefficient as thedistance metric (eq. 3) between users.

Data smoothing is a mechanism to fill in the missing values of the rat-ing matrix. To do data smoothing Xue et al. [59] made explicit use of the


clusters as smoothing mechanisms. Based on the clustering results they appliedthe following smoothing strategy

r′k,j ={

rk,j if rk,j �= ∅rk,j otherwise

(15)

where rk,j denotes the smoothed value for user k’s rating towards an item j.By considering the diversity of the user, Xue et al. [59] proposed the followingequation to compute the smoothed rating:

rk,j = rk + ΔrC(k),j = rk +1

|C(k, j)|∑

l∈C(k,j)

(rl,j − rl) (16)

where C(k) denotes the cluster of user k and C(k, j) the subset of users in clusterC(k) that rated item j. Smoothed ratings are used to compute a pre-selectionof neighbors. Basically, given the active user k, a set of most similar clusters isselected to build a neighborhood of similar users. After this preselection, the sim-ilarity between each user l in the neighborhood and the active user is computedusing the smoothed representation of the user ratings,

wk,l =

∑j∈Rk

δl,j · (r′k,j − rk)(r′l,j − rl)√∑

j∈Rk(r′k,j − rk)2

√∑j∈Rk

δ2l,j(r

′l,j − rl)2

(17)

where

δl,j ={

1 − λ if rl,j �= �λ otherwise

(18)

represents the confidential weight for the user l on item j. λ ∈ [0, 1] is a parameterfor tuning the weight between the user rating and the cluster rating.

Predictions for the active user are computed by aggregating ratings from thetop-K most similar users in the same manner as for the user-based algorithm(see equation 2):

νkj = rk +

∑l∈Uk

δl,j · wk,l · (rlj − rl)∑

l∈Ukδl,j · |wkl| (19)

By assigning different values to λ Xue et al., [59] adjusted the weighting schema.For instance, if λ = 0 the algorithm only uses the original rating informationfor the similarity computation and prediction. But if λ = 1 the algorithm is acluster-based CF that uses the average ratings of clustering for similarity andprediction computation.

Trust Inferences

This approach focuses on developing a computational model that permits to ex-plore transitive user similarities based on trust inferences. Papagelis et al. [46],presented the concept of associations between users as an expression of established


trust between each other. This trust is defined in the context of similarity condi-tions and is computed by means of the Pearson correlation (see equation 3). Themore similar two users are, the greater their established trust would become.

While computation of trust in direct associations is based on user-to-usersimilarity, for length-K associations a transitive rule is adopted. According tothis, trust is propagated in the network and associations between users are built,even if they have no co-rated items. If V = {Vi; i = 1, 2, ...K} is the set of allintermediate nodes in a trust path that connects user k with user l, then theirassociated inferred trust would be given by:

T V1→...→VK

k→l = (((Tk→V1) ⊕ TV1→V2) ⊕ ...) ⊕ TVK−1→VK ) ⊕ TVK→l (20)

The symbol ⊕ denotes a special operation that can be best understood for thecase of only one intermediate node Z:

T Zk→l = Tk→Z ⊕ TZ→l

= δ ·( |Bk,Z ||Bk,Z | + |BZ,l| |Tk→Z | + |Bk,Z |

|Bk,Z | + |BZ,l| |TZ→l|)

where Bk,Z = Rk ∩RZ , Bk,Z = Rk ∩RZ and

δ ={

+1 if Tk→Z > 0, TZ→l > 0−1 if Tk→Z · TZ→l < 0

The inferred trust is not applicable if Tk→Z < 0 and TZ→l < 0. In this case thelength of the path between users k and l is supposed to be infinite.

To build a recommendation for the active user, a collection of paths betweenthe user and another trusted users is selected in a first step. Pagagelis et al.[46] proposed different selection mechanisms but one of the best approacheswas Weighted Average Composition, which computes the trust between any twounconnected users k and l using the following equation:

Tk→l =1

∑|P |i=1 CPi

k→l

|P |∑

i=1

CPi

k→l · T Pi

k→l (21)

where CPi

k→l expresses the confidence of the association k → l through the pathPi,

CV1→...VK

k→l = ((Ck→V1 · CV1→V2 · ...) · CVK−1→VK ) · CVK→l (22)

and the confidence of each direct association k → l is assumed to be directlyrelated to the number of co-rated items between the users:

Ck→l =|Rk ∩ Rl|

|Rk ∩ Rumax |(23)

where umax represents the user who rated most items in common with user k.Predictions for unseen items can be computed using equation 2 in which eachweight wk,l is given by equation 21.


Improved Neighborhood-Based

The success of neighborhood-based algorithms depends on the choice of theinterpolation weights (equations 2, 14) which are used to compute unknownratings from neighboring known ones. But the aforementioned user- and item-oriented approaches lack of a rigorous way to derive these weights. Differentalgorithms use different heuristics to compute these weights and there is not anyfundamental justification to choose one or another. Bell and Koren [13] proposeda method to learn interpolation weights directly from the ratings. Their approachimproved prediction accuracy by means of two mechanisms: (1) preprocessingthe user-item rating matrix removing global effects to make the different ratingsmore comparable and (2) deriving interpolation weights from the rating matrix.

The preprocessing step consists of a set of rating transformations that prepareinput data: remove systematic user or item effects (to adjust that some itemswere mostly rated by users that tend to rate high, etc), adjust ratings using itemvariables (such as the number of ratings given to an item, the average rating of anitem, etc.) or adjust ratings by analyzing characteristics (such as date of rating)that may explain some of the variation in ratings2. Interpolation weights arecomputed by modeling the relations between item j and its neighbors throughthe following optimization problem:

minw

∑

k,j /∈Rk

(

rk,j −∑

i∈Rk

wi,j · rk,i

)2

(24)

and are used with 14 in order to predict rk,j . Authors reported that this ap-proach can be very successful when combined with model-based approaches thatuse matrix factorization techniques (see section 2.2). An alternative user-basedapproach formulation can be derived analogously by simply switching roles ofusers and items.

2.2 Model-Based Collaborative Filtering

Model-based collaborative filtering first learns a descriptive model of user prefer-ences and then uses it for predicting ratings. Many of these methods are inspiredfrom machine learning algorithms: neural-network classifiers [14], induction rulelearning [61], Bayesian networks [15], dependency networks [26], latent classmodels [31, 38], principal component analysis [24] and association rule mining[39]. Table 2 synthesizes some of the model-based algorithms that are describedin next subsections.

Cluster Models and Bayesian Classifiers

From a probabilistic perspective, the collaborative filtering task can be viewedas calculating the expected value of the active user’s rating on an item givenwhat we know about the user:2 Further information about mathematical formulation of these preprocessing steps

can be found in [13].


Table 2. Different model-based algorithms based on the different components ofthe recommendation process: data preprocessing, model building and predictioncomputation

Data processing Model buildingPredictioncomputation

Bayesiannetworks

Instance-basedrepresentation

· Probabilistic clustering→ EM fitting

· Bayesian classifier· Dependency networks

· Probabilisticaggregation

Latentclass models

Binary preferencerepresentation

· Latent class models→ EM fitting

· Probabilistic selection

SVD Low dimensional representation→ SVD

· Neighborhood for-mation in the reducedspace→ User-based

SimpleBayesianclassifier

Instance-basedrepresentation

· Naive Bayes classifier · Probabilisticclassification

Associationrule mining

· Binary ratingrepresentation· Instance-basedrepresentation

Association rule mining· Selection based onsupport and confidenceof rules

EigentastePCA rating transformation

Low dimensionality reductionRecursive rectangular clustering

· Most frequent item

PMCF Generative probabilistic model· Probabilisticaggregation

νk,j =∑

s

p(rk,j = x|rk) · x (25)

where the probability expression is the probability that the active user willhave a particular rating to item j given the previously observed ratings rk ={rk,i, i ∈ Rk}. Character x denotes rating values in interval [rmin, rmax].

Breese et al. [15] presented two different probabilistic models for computingp(rk,j = x|rk,i, i ∈ Rk). In the first algorithm, users are clustered using theconditional Bayesian probability based on the idea that there are certain groupsthat capture common sets of user preferences. The probability of observing auser belonging to a particular cluster cs ∈ C = {C1, C2, ...CK} given certain setof item ratings rk is estimated from the probability distribution of ratings ineach cluster:

p(cs, rk) = p(cs)∏

i

p(rk,i|cs) (26)

The clustering solution (parameters p(cs) and p(rk,i|cs)) is computed from datausing the expectation maximization (EM) algorithm.

The second algorithm is based on Bayesian network models where each itemin the database is modeled as a node having states corresponding to the ratingof that item. The learning problem consists of building a network on these nodessuch that each node has a set of parent nodes that are the best predictors for thechild’s rating. They presented a detailed comparison of these two model-based


approaches with the user-based approach and showed that Bayesian networksmodel outperformed the clustering model as well as the user-based scheme.

A related algorithm was proposed by Heckerman et al. [26] based on depen-dency networks instead of Bayesian networks. Although the accuracy of depen-dency networks is lower than the accuracy of Bayesian networks, they learn fasterand have smaller memory requirements.

Latent Class Models

Latent class models can be used in collaborative filtering to produce recommen-dations. This approach is similar to probabilistic models but the resulting rec-ommendations are generated based on a probability classification scheme. Usinglatent class models, a latent class z ∈ Z = {z1, z2, ...zK} is associated with eachobservation (x, y). The key assumption made is that x and y are independentgiven z. In the context of collaborative filtering observations are transactionsand the probability of observing a transaction between user k and item j can bemodeled via latent class models as follows3:

p(k, j) =∑

z∈Z

p(z)p(k|z)p(j|z) (27)

where p(k|z) denotes the probability of having user k given latent variable zand p(j, z) represents the probability of observing item j given variable z. Thestandard procedure to compute probabilities p(k, z) and p(j, z) is to use a EMalgorithm (see [31] for further details). Recommendation was performed by sim-ply selecting the most probable latent classes given the active user k and foreach latent class the most probable observations p(j|z) such that j /∈ Rk.

Hofmann et al. [31] extended this formulation by introducing an additionalrandom variable that captured additional binary preferences (like and dislike).

Singular Value Decomposition

Singular Value Decomposition (SVD) is a matrix-factorization technique thatfactors an m × n matrix R into three matrices:

R = V · E · WT (28)

where V and WT are two orthogonal matrices of size m×r and r×n, respectively,with r the rank of the matrix R. E is a diagonal matrix that has all singularvalues of matrix R. The matrices obtained by performing SVD are particularlyuseful to compute recommendations and have been used in different researchworks to address the problem of sparsity in the user-item matrix [54, 22].

If the r× r matrix E is reduced to have only the q largest diagonal values, Eq,and the matrices V and WT are reduced accordingly, the reconstructed matrixRq is the closest rank-q matrix to R. If R is the original user-item rating matrix,SVD will produce a low dimensional representation of the user-item matrix thatcan be used as a basis to compute recommendations.3 For a detailed description of latent class models refer to Hofmann et al., [31, 30].


Sarwar et al., [52] used SVD to build recommendations following a user-basedlike approach. They successfully applied SVD to obtain a m × q representationof the users Vq · Eq

1/2 and compute the user similarity from that low dimen-sional representation. Compared to correlation-based systems, results showedgood quality predictions and the potential to provide better online performance.Drineas et al., [22] further studied SVD and showed from a mathematical pointof view that this approach can produce competitive recommendations.

Simple Bayesian classifier

Most collaborative filtering systems adopt numerical ratings and try to predictnumerical ratings. However, there are other systems that produce recommenda-tions by accurately classifying items and selecting those that are predicted rele-vant to the user. The simple Bayesian classifier [44] is one of the most successfulalgorithms on many classification domains (text categorization, content-basedfiltering, etc.) and has shown to be competitive for collaborative filtering.

To use this algorithm for CF, an especial representation that merges boththe interaction matrix and the rating matrix R is required. Suppose that D is a2n × m is matrix in which each user rating vector rl is divided into two binaryvectors dlik

l and ddisl which have a boolean value indicating whether the user

liked the item and did not like the item, respectively.Making the naıve assumption that features are independent given the class

label, the probability of observing that an item belongs to cs ∈ {lik, dis} givenits 2(n − 1) feature values is:

p(cs,dTi ) = p(cs)

2(n−1)∏

l=1

p(dl,i|cs) (29)

where both the probability of observing the active user labeling item i with cs,p(cs), and the probability of having feature dl,i if the active user labeled the itemwith class cs, p(dl,i|cs), are estimated from the database:

p(cs) =| {dk,i = cs} |

m; pk,i(dl,i|cs) =

| {dl,i = 1; dk,i = cs} || {dk,i = cs} | (30)

To determine the most likely class of a new item for the active user, the prob-ability of each class is computed and the item is assigned to the class with thehighest probability. Items that are classified into the like class are aggregated ina recommendation list.

Association Rule Mining

Within the context of using association rules to derive top-N recommendations,Lin et al. [39] developed a method for collaborative recommendation based onan association rule mining. Given a set of user transactions, an association ruleis a rule of the form X → Y where both X and Y are sets of items. The standardproblem of mining association rules is to find all association rules that are above


a certain minimum support and confidence for the user4. The recommendationstrategy is based on mining two types of associations: user associations (whereboth X and Y are sets of users) and item associations (if X and Y are sets ofitems). To produce recommendations, user and item associations are combinedin the following way: if user association rule mining gives a minimum support,recommendations are based on user associations, otherwise item associations areused to compute recommendations.

Mobasher et al. [45] also presented an algorithm for recommending additionalwebpages to be visited by a user based on association rules. In this approach,the historical information about users and their web-access patterns were minedusing a frequent itemset discovery algorithm and were used to generate a set ofhigh confidence association rules. The recommendations were computed as theunion of the consequent of the rules that were supported by the pages visited bythe user.

In the same context, Demiriz et al., [20] studied the problem of how to weightthe different rules that are supported by the active user to generate recommenda-tions. Each item the user did not select was scored by finding corresponding rulesand aggregating the scores between rules and the active user. These scores arecomputed by multiplying the similarity measure between the active user and therules and the confidence of the rule. To compute the similarity between the activeuser and the rules, an Euclidean distance was used. He compared this approachboth with the user-based scheme and the dependency network-based algorithm[26]. Experiments showed that the proposed association rule-based scheme issuperior to dependency networks but inferior to the user-based schemes.

Eigentaste

Goldberg et al., [24] proposed a collaborative filtering algorithm that applies adimensionality reduction technique (Principal Component Analysis, PCA) forclustering of users and fast computation of recommendations. PCA is mathe-matically defined as an orthogonal linear transformation that transforms thedata to a new coordinate system such that the greatest variance by any projec-tion of the data comes to lie on the first coordinate (called the first principalcomponent), the second greatest variance on the second coordinate, and so on.It can be applied to collaborative filtering to find a transformed representationof the user-item matrix:

R′ = R ·W (31)

where W is a orthogonal matrix. By keeping the q lower-order principal compo-nents and ignoring higher-order ones, Goldberg et al. [24] used the resulting ‘prin-cipal’ transformation matrix Wq to cluster users in a low dimensional space andcompute recommendations by aggregating ratings from users in the same cluster.

The resulting algorithm, Eigentaste, is essentially a user-based approach inwhich users are clustered based on their representations in the transformed space.4 Further details about association rule mining and algorithms can be found in Lin et

al, [40].


The resulting algorithm is as good as the classical user-based approach in terms ofaccuracy but the computation of the recommendations is much faster and scalable.

Probabilistic Memory-Based

Was proposed by Yu et al. [60] as a efficient approach that generates predictionsfrom a carefully selected small subset of the overall database of user ratings (theprofile space). The algorithm is similar to a memory-based approach but uses aprobabilistic approach to build a compact model from which recommendationsare generated. This probabilistic approach assumes that user k’s real ratings canbe described as a vector xk = {xk,i; i = 1, 2, ...m} that encode the underlying,‘true’ preferences of the user (i.e. his/her personality). Assuming a generativeprobabilistic model, the ratings of an active user a are generated based on aprobability density given by:

p(a|P) =|P|∑

l=1

p(a|xl) · p(xl|P) =1|P|

|P|∑

l=1

p(a|xl) (32)

where P is the profile space, which consists of a subset of rows of the originalrating matrix, R. Assuming that ratings on individual items are independent,given a profile xl, the probability of observing the active user’s ratings a if weassume that the user has the prototype profile xl is

p(a|xl) =m∏

j=1

p(rk,j = aj |rtruek,j = xl,j) (33)

Both Yu et al [60] and Pennock et al. [48] assume that users report ratings foritems they’ve selected with Gaussian noise. This means that user k’s reportedrating for item i is computed from a independent normal distribution with meanrtruek,i :

p(rk,i = x|rtruek,i = y) ∝ e−(x−y)2/2σ2

(34)

where σ is a free parameter. The posterior density of the active user k’s ratingson not yet rated items an based on the ratings the user has already specified ar

can be computed using equation 32 and gives:

p(an|ar,P) =p(an, ar|P)

p(ar|P)=

∑|P|l=1 p(an|xl) · p(ar|xl)

∑|P|l=1 p(ar|xl)

(35)

With this probabilistic model, predictions for the active user are computed bycombining the predictions based on other prototype users xl, weighted by thedegree of like-mindedness to the active user.

2.3 Limitations of Collaborative Filtering

Pure collaborative filtering does not show some of the problems that content-baserecommenders do. For instance content recommenders require explicit textual


information which may not be available in some domains (multimedia recom-mendation, etc). Since collaborative filtering systems use other user’s ratings,they can deal with any kind of items, no matter whether they have contentinformation or not. Besides, content-based systems generally recommend itemsthat are scored highly against the user’s profile so that only items that are verysimilar to those already rated high will be recommended. In contrast, CF recom-menders are able to recommend items that are very dissimilar to those alreadyseen in the past. Despite their popularity and advantages over content-basedfiltering, pure CF has several shortcomings:

Sparsity. This problem has been identified as one of the main technical lim-itations of CF. Commercial recommender systems are used to evaluate largecollections of items [1, 3, 4] in which even very active users may have purchasedless than 1% of the items (1% of 2 million movies is 20.000 movies!). This impliesthat memory-based recommender systems may be unable to make any recom-mendations and the accuracy may be poor. Even users that are very active ratejust a few number of the total available items, and on the contrary, even verypopular items result in having been rated by only few users. As a consequence,it is possible that the similarity between two users could not be defined, makingCF useless. Even if the evaluation of similarity is feasible, it may not be reliableif there is not enough information.

Cold start problem. CF requires users to rate a sufficient number of items beforegetting accurate and reliable recommendations. Therefore, unless the user rates asubstantial number of items, the recommender system will not provide accurateresults. This problem applies to new users but also to non-regular users (withrare tastes), for whom similarities cannot be computed with sufficient reliability.

New item problem. Collaborative filtering algorithms rely only on user’s prefer-ences to make recommendations. Therefore, in a situation in which new items areadded regularly, they can not be recommended until rated by a certain numberof users.

Scalability. The computational complexity of collaborative, memory-based meth-ods grows linearly with the number of users, which in typical commercial appli-cations can reach several millions. In this situation the recommender could sufferserious scalability problems and algorithms may have performance problems withindividual users, for whom the system has large amounts of information.

Different memory-based algorithms have been proposed to address the prob-lems of scalability and sparsity. For instance, Sarwar et al. [53] proposed the item-based algorithm to address scalability problems of the user-based approaches.And Aggarwal et al., [9] and Papagelis et al. [46] proposed different graph basedapproaches to exploit transitive relations among users. To address the new userproblem, Rashid et al. [49] and Yu et al. [60] proposed different techniques basedon item popularity, item entropy and user personalization to determine the bestitems for a new user to rate. Dimensionality reduction techniques such as Singu-lar Value Decomposition could reduce the dimensionality of the original sparse


matrix, [14, 52] and provide faster recommendations. Therefore, model-based ap-proaches can partially address some limitations of memory-based collaborativefiltering such as sparsity and scalability, but others, such the new item problem,still remain unsolved.

3 Hybrid Filtering

Different experiments have shown that collaborative filtering systems can beenhanced by incorporating content-based characteristics. Hybrid recommendersystems combine different types of recommender systems, usually collaborativeand content-based filtering methods, and are essentially intended to avoid thelimitations of both technologies.

There are different ways content-based and collaborative filtering methodscan be combined. For instance, collaborative filtering could be enhanced withcontent-based characteristics, results from separate collaborative and content-based recommenders could be merged into a unique recommendation or recom-mendations may be generated based on a unifying recommendation model. Thereare also other recommender systems that are basically content recommenderswith enhanced recommendations via collaborative features, but they are out ofthe scope of this text. Table 3 summarizes some of the hybrid approaches thatare explained here.

Table 3. Summary of different hybrid-based algorithms based on the different compo-nents of the recommendation process

Enhance collaborative filtering with content-based characteristics

Input data CBF component CF component

Content-boosted CF

· User-item ratings· Item features

Bayesian text classi-fierBuild a pseudo-rating ma-

trix using content fea-

tures

Memory-based CFBuild predictions from the

pseudo-rating matrix using a

user-based approach

Feature-basedCF

· User-item matrix· Item-feature matrix

Content matchingNeighborhood formation

based on item features

Memory-based CFFilter recommended items using

an item-based approach

Combine separate recommenders

Recommender components Prediction computation

Weighted CBF-CF

CBF: ContentmatchingMatch user profiles to

item contents

CF: Memory-basedBuild predictions from the

user-rating matrix using a

user-based approach

Linear combinationCombination weights adjusted

from data

Similarityfusion

CF: Memory-basedProbabilistic user-based

CF: Memory-basedProbabilistic item-based

Linear combinationCombination weights adjusted

from data

Develop a unifying recommendation model

Input data Background model Prediction computation

Spreadactivation

· User-item matrix· Item contents· Demographic data

2-layer graphEnhanced graph with con-

tent features

· Direct retrieval· Association mining· Spread-activation→ Hopfield Net algorithm


3.1 Enhance Collaborative Filtering with Content-BasedCharacteristics

Content-based recommender systems evolved from information retrieval [10] andinformation filtering [12] systems and are designed mostly to recommend text-based items. In content-based filtering, items are recommended to a certain userbased on similarities between new items and the corresponding user profile. Thecontent of these items is usually described by keywords. User profiles contain in-formation about the users’ tastes, preferences and needs that can be extractedfrom different types of information: the collection of items the user has rated highin the past, keywords that represent topics of interest, text queries, transactionalinformation from web logs, etc. Though the significant and early advancementsmade in information retrieval and information filtering, the importance of severaltext-based applications and new improvements such as the use of user profiles,content-based recommenders suffer from several limitations. Limited understand-ing of users and items or overspecialization are some examples.

But content-based filtering may be used in conjunction with collaborative fil-tering to enhance recommendations. Several hybrid recommender systems useessentially collaborative filtering techniques and maintain content-based userprofiles that store useful information and from which user similarities are com-puted. This allows to overcome problems such as sparsity problem and providesa mechanism to recommend users new items not only when they are rated highlyby similar users, but when they score highly against the user profile, so that boththe new item and cold start problems can be tackled.

Content-Boosted CF

Melville et al. [43] proposed a system to overcome two of the main limitationsof pure collaborative filtering, namely sparsity and the new user problem. Theirmethod, content-boosted collaborative filtering (CBCF), uses a pure content-based predictor to convert a sparse user matrix into a full ratings matrix andthen uses pure collaborative filtering to provide recommendations.

The content-based predictor was implemented using a Bayesian text classifierthat learned a user model from a set of rated items. The user model was used topredict ratings of unrated items and create a pseudo-ratings matrix as follows,

r′k,j ={

rk,j if rk,j �= ∅ck,j if rk,j = ∅ (36)

where ck,j is the rating of item j for user k predicted by the pure content rec-ommender. The collaborative filtering component was implemented followingthe user-oriented approach (equation 2) with a slightly modified version5 of thePearson correlation (equation 3) to compute user similarity from the dense rep-resentation R′. Further details can be found in [43].

5 They multiplied the correlation by a significance weighting factor (see [27]), thatgives less confidence to correlations computed from users with few co-rated items.


Soboroff et al. [57] described a similar hybrid filtering technique that combinedcollaborative data with content descriptions of items to generate recommenda-tions. The approach used Latent Semantic Indexing (LSI) with SVD to create asimplified view of a user-profile matrix built from relevant item contents.

Feature-Based CF

Han and Karypis [25] presented several feature-based recommendation algo-rithms to enhance collaborative filtering with content-based filtering in contextsin which there is not enough historical data for measuring similarity betweenitems, i.e. frequently changing items and product catalogs with tailored items.

In the first context, using content-based filtering, a set of similar items werecomputed by matching the set of items selected by the active user with the itemsin the catalog. Using a item-oriented approach to collaborative filtering, recom-mended items were selected and the collection of most representative featureswere extracted as the recommended features. From the real catalog of items, atop-N recommendation was generated by selecting products with these recom-mended features. An alternative method, using association rules, was proposedto generate recommendations in this context. A similar approach, based on fea-ture recommendation was presented for the context of product catalogs withcustom items (see [25] for details).

3.2 Combine Separate Recommenders

Weighted CBF-CF

One of the first approaches that combined recommenders was proposed by Clay-pool et al. [17]. Rating predictions were obtained from separate content-basedand collaborative recommenders and merged into one recommendation usinga linear combination of ratings, keeping the basis of each approach separated.To perform the content-based filtering, each user is represented with a three-component profile that gathers information about user preferences for items, ex-plicit keywords from search queries and implicit keywords extracted from highlyrated items. Content-based filtering is performed by matching the active user’sprofile to the textual representation of new items. Collaborative filtering is per-formed following a user-based approach (see equation 2) with weights computedusing a Pearson correlation (equation 3). Weights of the linear combination aredynamically adjusted to minimize past rating prediction errors. Their approachrealizes the strengths of the content-based filtering and mitigates the effects ofboth the sparsity and the new item problem. The combination of content-basedand collaborative filtering results can be tunned to both avoid the cold startproblem by giving more weight the content-based component for these users orweighting more heavily the collaborative component as the number of users andratings for each item increases.

A similar approach was presented by Pazzani [47]. Their hybrid recommendercombined recommendation results from 3 different approaches: content-based,


demographic-based and collaborative. Content-based was performed by apply-ing a content-based learning algorithm, called Winnow [42], that estimated therelative weights of each keyword of the content model of an item so that theaggregation of these weights was highly correlated with the rating associatedby the user. Similarly, demographic-based recommendations were computed byapplying the Winnow algorithm to demographic features that represent users.Finally collaborative filtering was performed following a pure user-oriented ap-proach (equation 2 combined with 3). The combination was shown to have thepotential of improving the precision of recommendations.

Similarity Fusion

Most collaborative recommenders [15, 53] produce recommendations based onlyon partial information from the data in the user-item matrix (using either cor-relation between user data or correlation between item data). Wang et al., [58]recently proposed a probabilistic approach to exploit more of the data availablein the user-item matrix, by combining all ratings with predictive value into asingle recommendation. The confidence of each individual prediction can be esti-mated by considering its similarity towards both the test user and the test item.The overall prediction is made by averaging the individual ratings weighted bytheir confidence. The confidence of each rating is computed using a probabilisticapproach (equation 25) that combines three different probabilistic models thatestimate predictions based on user similarity, item similarity and rating simi-larity. Two linear combination weights, λ and δ, control the importance of thedifferent prediction sources and were determined experimentally. This similarityfusion scheme was proved to improve prediction accuracy in collaborative filter-ing and, at the same time, was more robust against data sparsity. For furtherdetails about implementations and results, read [58].

3.3 Develop a Unifying Recommendation Model

Spread-Activation

This graph based algorithm was proposed to provide a more comprehensiverepresentation of the data gathered in the user-item matrix and to support flex-ible recommendations by using different strategies, [34, 33, 32]. The approachis hybrid in the sense that both collaborative and content features are mergedto generate recommendations, but also in the way that different collaborativefiltering strategies can be combined to find relevant items.

Recommendations are generated from a background two-layer graph-theoreticrepresentation of the user-item matrix. Nodes represent users and items. Inputinformation about users (demographic data, answers to questionnaires, queryinputs, web usage patterns, etc.), items (textual descriptions, etc.) and trans-actions (purchase history, explicit ratings, browsing behavior, etc.) is trans-formed into links between nodes that capture user similarity, item similarity or


associations between users and items, respectively. This results in a very flexiblerecommendation engine that may combine different recommendation methods,different types of information to model the links and different measures to com-pute the strength of these relations:

• Direct retrieval. Generates recommendations by retrieving items similar to theactive user’s previous selections and items selected by users similar to the ac-tive user. Depending on the algorithm to form neighbors from the graph, theengine can generate content-based, collaborative or hybrid recommendations.

• Association mining. Generates recommendations by building first a model ofassociation rules that are computed from transaction history. Two differenttypes of association rules are generated: content-based rules, built from con-tent similarity among items; and transaction-based rules, built from transac-tion history data. Depending on the type of association rules considered, theengine can produce content-based, collaborative or hybrid recommendations.

• High-degree association. Recommendations are generated from a graph thatcombines information from the previous approaches and uses the Hopfieldnet algorithm [16] to produce recommendations. By setting the activationlevel that corresponds to the active user to μuk

= 1 the algorithm repeatedlyperforms the following activation procedure

μj(t + 1) ∝n−1∑

i=0

tij · μi(t) (37)

until the activation levels of all nodes converge. tij represents the weightof the link between nodes i and j. Depending on the nature of the linksthat are enabled, the algorithm can produce content-based, collaborative orhybrid recommendations.

4 Evaluation of Recommender Systems

4.1 Datasets

To evaluate performance of recommender systems, a number of different datasetshas been reviewed:

• EachMovie was one of the most widely used data sets in recommender sys-tems but it is no longer available for download. It contained 2,811,983 ratings(discrete values from 0 to 5) entered by 72,916 users for 1,628 different movies.

• MovieLens has over 10 million ratings and 100,000 tags for 10,681 movies by71,567 users. Ratings are on a scale from 1 to 5. It contains additional dataabout movie title and genres. Tags are user-generated metadata about themovies.

• Jester contains about 4.1 million continuous ratings (ranged from -10.00 to+10.00) about 100 jokes from 73,421 users collected between April 1999 andMay 2003.


• Book-Crossing was collected between August and September 2007 from theBook-Crossing community [2]. It contains 278,858 users providing 1,149,780ratings about 271,379 books. User demographic data and content informationsuch as title, author and year of publication are also provided. Ratings maybe explicit (expressed on a scale from 1 to 10) or implicit.

• Netflix is a movie rating dataset collected between October 1998 and Decem-ber 2005 that contains over 100 million ratings from 480,000 randomly-chosenNetflix [6] users over 17,000 movie titles. Ratings are on a scale from 1 to 5.It also contains the title and year of release of each movie.

Some researchers [9, 21] have also evaluated recommender systems using syn-thetic datasets in order to characterize the proposed recommendation algorithmsin a controlled setting.

4.2 Accuracy Evaluation Metrics

Research methods in recommender systems include several types of measuresfor evaluating the quality of recommendations. Measures can be mainly catego-rized into two classes: predictive accuracy metrics and decision-support accuracymetrics.

• Predictive accuracy metrics evaluate the accuracy of a system by comparingthe numerical recommendation scores (predictions) against the real user rat-ings for each user-item interaction in the test dataset. Mean Absolute Error(MAE) is one of the most frequently used.

• Decision-support accuracy metrics evaluate how effective a recommendationengine is at helping a user select high-quality items from the set of all items.These metrics consider the prediction process as a binary operation (itemsare predicted as either relevant or not). The most commonly used decision-support accuracy metrics is Precision/Recall.

Mean Absolute Error and Related MeasuresMAE is a widely popular measure of the deviation of recommendations fromtheir true user-specified values and is computed by averaging the absolute errors|ri − νi| corresponding to each rating-prediction pair,

MAE =1N

N∑

i=1

|ri − νi| (38)

The lower the MAE, the better the accuracy of the generated predictions. Some re-searchpapers compute the Normalized MAE, or NMAE, which is the regularMAEdivided by the rating scale. Similar measures are the Mean Squared Error (MSE),which is computed by averaging squared errors; and the Root Mean Squared Error(RMSE), which is computed from MSE by taking the square root.


Precision/Recall MeasuresPrecision and recall are the most popular metrics for evaluating InformationRetrieval systems and they have also been used in collaborative filtering bymany authors. If L = Lr + Lnr is the list of items that are recommended to theactive user and H = Hr +Hnr denotes the rest of items in the dataset, Precisionand Recall measures are computed as

Precision =Lr

Lr + LnrRecall =

Lr

Hr + Lr. (39)

Subindexes ‘r’ and ‘nr’ stand for ‘relevant’ and ‘not relevant’, respectively.

4.3 Other Quality Metrics

The first recommender systems primarily focused on exploring different tech-niques to improve the prediction accuracy. Other important aspects, like scalabil-ity, incoming data adaptation, and comprehensibility have received littleattention. Recommender systems must provide not only accuracy, but also useful-ness. These quality aspects can be quantified through different measures [28] suchas coverage (rate of items for which the system is capable of making recommen-dations), adaptation/learning rate (how the recommender improves as new datais gathered), novelty/serendipity (how good is the recommender at giving non-obvious results) or confidence (measured as the percentage of recommendationsthat are accepted by users, for instance).

5 A Taxonomy for CF

Several works have proposed taxonomies to classify recommender systemsattending to different aspects. Huang et al. [33] presented a taxonomy of rec-ommender systems based on 3 dimensions: the system input, the representationmethods and the recommendation approach. Table 4 summarizes this taxonomy ofrecommender systems. Adomavicius et al. [7] categorized recommender systemsusing only 2 dimensions: the recommendation approach and the recommendationtechnique. Based on the recommendation approach, recommender systems wereclassified as being content-based, collaborative or hybrid; and based on the typesof recommendation techniques used for the rating estimation they were classifiedinto heuristic-based or model-based. Table 5 shows this second classification.

But classification schemes presented so far do not clearly differentiate sys-tems by their real contributions and originality, but by their recommendationapproach or technique (which in most cases is irrelevant for the user). Aspectssuch as the associations that are modeled among the entities and how theyare built are essential to get a deep understanding of how they work and whatare the real benefits and requirements of these systems. In this section, afore-mentioned classification schemes are extended by proposing a taxonomy thatclassifies algorithms according to 4 main aspects: (1) the entities involved and


their representation, (2) the associations among the entities, (3) the techniquesused to build the relations, and (4) the recommendation method.

Table 4. Recommender systems’ taxonomy according to Huang et al., [33]. Rec. Sys.are classified in terms of the input data, its representation and the recommendationapproach

System input

Type Data

Content Acquisition

User Factual dataItem Factual data

Transaction Transactional data Explicit or implicit feedback

Data representation

Type Method

User User attributes, items associated, transactions, item attributesItem Item attributes, users associatedTransaction Transaction attributes, items

Recommendation approach

TypeMethod

Basis Technique

Knowledgeengineering

Content-based kNN, Classification

Collaborative User-based, Item-based,Transaction-based

kNN, Association rule mining, Machinelearning

HybridCBF + CF Merge results from different ap-

proaches, CF augmented with contentinformation, CBF augmented withCF, Comprehensive model

CF + Knowledge engineering

Table 5. Recommender systems’ taxonomy presented by Adomavicius et al., [7]. Rec.Sys. are classified attending to the recommendation approach and the recommendationtechnique

Recommendationapproach

Recommendation Technique

Heuristic-based Model-based

Content-based TF-IDF, Clustering Bayesian classifiers, Clustering, Deci-sion Trees, Artificial neural networks

Collaborative kNN, Clustering, Graph theory Bayesian networks, Clustering, Artifi-cial neural networks, Linear regression,Probabilistic models

Hybrid

CBF+CF: Linear combination ofpredicted ratings, Various votingschemes, Incorporating CBF aspart of the heuristic for CF

CBF+CF: Incorporating CBF as partof the model for the other, Building oneunifying model


5.1 Entities and Representation

Recommender systems studied so far generate recommendations by using infor-mation modeled in 2 different entities6 -user and item- and in their relations.The entity user contains characteristics that differentiate the users of the system.The entity item models information that somehow characterizes and identifieseach single item.

In a recommendation problem, entities may be represented with different typesof information, depending on the requirements of the recommendation technique.Users are usually represented with a unique id but some recommender systemsmay use additional factual information such as demographic information (name,gender, date of birth, address, etc.), textual preferences about the features ofthe items or keywords that describe general user interests. Depending on therecommendation approach, items may be represented only by a unique id (whichis the most common approach in CF) or by content information, usually in theform of textual attributes (for content-based or hybrid recommenders) such asbrand, price, title or description.

5.2 Associations among Entities

The term association or relation describes a certain degree of dependence betweenentities. The majority of the approaches to the problem of recommendation as-sume a data representation for each entity and focus on a single relation betweenthe entities, commonly the one derived from rating activity. But other differentrelations may be examined to build richer models. In the context of recommendersystems, relations may record the user’s explicit expression of the interest on anitem, such as a rating or a comment; or the implicit interaction between usersand items, including examination (selection, purchase, etc), retention (annota-tion, print, ...) and reference, for instance. These relations are explored in orderto infer information about user tastes, item similarities, etc. and generate recom-mendations. Table 6 summarizes some examples of associations.

5.3 Association Building Techniques

Recommender systems can be distinguished by the methods involved in buildingthe associations among entities. Associations can be obtained via 2 mechanisms:(1) explicit, using the information provided by users directly, such as ratings orcomments, which is usually stored in the user-item matrix; or (2) implicit, bycomputing new associations from existing ones or from sources such as purchasehistory or user behavior patterns. Implicit associations are derived using differ-ent techniques: knowledge engineering (case-based reasoning,...), neighborhoodformation techniques (kNN, clustering,...), association rule mining, machinelearning, etc.6 An entity is defined as an object that has a distinct, separate existence. It models a

fictitious or a real thing and may have stated relations to other entities.


Table 6. Examples of relations among entities. Symbols E, I and D denote an explicit,implicit and derived association, respectively.

Type Description Examples

U-IAssociateditems

E Explicit expression of interest foritems

User ratings and comments

IImplicit interaction between usersand items

Examination of items (selec-tion, purchase)Retention of items (save, an-notate, print)Reference to items

U-I Itemattributes

D Expression of user preferences or sat-isfaction

U-U Usersimilarity

D Expression of user similarity, trust orconfidence

I-I Itemsimilarity

DExpression of item similarity or de-pendence

5.4 Recommendation Method

Different recommender systems produce recommendations based on differenttechniques. As a result, recommendations may have slightly different semantics.For instance, the user-based approach produces recommendations by ‘recom-mending items selected (liked) by users similar to the active user ’ whereas aitem-oriented approach will produce recommendations based on ‘items similarto those the active user already selected (liked)’. Therefore, recommender systemscould be further classified depending on the meaning of the recommendationsthey produce:

• User similarity. Recommendations are generated by exploiting user similaritypatterns which are computed using different metrics and sources of informa-tion (the way items are rated, the profiles of preferences and tastes of theusers, etc.)

• Item similarity. In this case recommendations are computed by selecting aneighborhood of items with a certain degree of similarity. Again, the sim-ilarity between items can be computed using different metrics and sourcesof information (item ratings, selections made by the users, inherent itemfeatures)

• Item features. Recommendations are generated by matching textual itemfeatures and textual user preferences, stored in user profiles

• Item association. Performs recommendations by exploring item associationrules, which are frequently derived from user selection patterns.

• Item relevance. It is not used much since it does not produce personalizedrecommendations, but it may be useful to address the cold-start-problemor to get a kind of ‘smart’ set of items from which the recommender canstart building the collaborative user profile. Recommendations are built from


relevance statistics of items: most popular items, the top-N rated items, etc.could be recommended to new users.

• Expert’s relevance. This method may build recommendations by analyzinguser statistics as experts in recommending to other users. Following thismethod, a top-N list of items could be built from the items liked by usersthat usually are good mentors (experts) to other users.

• Hybrid method. In this case, recommendations are built by combining someof the previous methods.

Table 7. Proposed taxonomy which classifies recommender systems attending to theentities and their representation, the associations among these entities, the associationbuilding techniques and the recommendation method

Entities and representation

User Factual data Demographic information (name, gender, birthdate, address, etc.)

Textual preferences Features of the items or keywords that describegeneral user interests

Item Content information Textual attributes (brand, price, title or descrip-tion)

Associations among entities

- see table 1.6 -

Association building

Behavior based Explicit User-item matrix Interactions (binary)Satisfaction (ratings)

Implicit Behavior patterns Examination, Reten-tion, Reference

Inferred Knowledge engineering

Neighborhood formation kNNClustering

Association rule mining

Probabilistic models

Recommendation method

User similarity Items selected (liked) by users similar to the active user

Item similarity Items that are similar to those selected (liked) by the active user

Item features Recommend items based on the similarity between the active user’s profileand the textual content of the items

Item association Items highly associated with items selected (liked) by the active user

Item relevance Most popular items, the top-N rated items, etc. to the active user

Expert relevance Items from popular users, whose recommendations are universally accepted

Hybrid method Recommend items by combining some of the previous methods

Depending on the type of associations explored to compute recommendationsand on the information used to build the relations, the association buildingtechniques can lead to the different recommendation approaches: knowledge en-gineering, collaborative filtering, content-based filtering or hybrid filtering. Fol-lowing this taxonomy definition, table 8 summarizes some of the recommendersystems previously explained.


Table 8. Entities and representation, associations among entities, association build-ing techniques and recommendation method for several recommender systems. Unlessstated otherwise, both user and item entities are represented with a unique id

Entity & Associations

→ [entity representation] Association building Recommendation method

User-based

Resnick et al., 1994Shardanand et al.,1995Breese et al., 1998

U-I: numeric ratings Explicit User similarity· Weighted aggregation of ratingsfrom similar users

U-U: user similarity,based on ratings

Memory - Heuristic:vector similarity, meansquared difference, Pear-son correlation

Predictability

paths

Aggarwal et al.,1999

U-I: numeric ratings Explicit User similarity· Linear rating transformationsand aggregation of ratings fromsimilar users

U-U: predictability condi-tions, based on interac-tions

Memory - Predictabilitycondition estimation

Item-based

Shardanand et al.,1995Sarwar et al., 2001

U-I: numeric ratings Explicit Item similarity· Weighted aggregation of similaritem ratingsI-I: item similarity, based

on ratings

Memory - Heuristic: vec-tor similarity, constrainedPearson correlation

Cluster-based

smoothing

Xue et al., 2005

U-I: numeric ratings Explicit User similarity· Weighted aggregation of ratingsfrom similar usersU-I2: smoothed ratings Memory - K-means clus-

tering

U-U: user similarity,based on smoothedratings

Memory - Heuristic:vector similarity, meansquared difference, Pear-son correlation

Trust inferences

Papagelis et al.,2005

U-I: numeric ratings Explicit User similarity· Weighted aggregation of ratingsfrom trusted usersU-U: user similarity,

based on ratings

Memory - Heuristic: prop-agation of trust and con-fidence

Improved

neighborhood

Bell et al., 2007

U-I: numeric ratings Explicit Item similarity· Weighted aggregation ofratings from similar itemsI-I: item similarity, based

on ratingsMemory - Optimization ofweights

Bayesian

networks

Breese et al., 1998

I-U: instance-based repre-sentation

Model - ProbabilisticBayesian classifier

Item similarity· Classification

Association rule

mining

Lin et al., 2000

U-U: user associations Model - Association rulemining

Item association

I-I: item associations

Eigentaste

Goldbert et al., 2001

U-I: numeric ratings Explicit User similarity· Cluster selection + Aggregationof ratings from similar usersU-U: user clustering,

based on ratings Model - PCA

Content-boosted

Melville et al., 2002

U-I: numeric ratings Explicit User similarity· Weighted aggregation of ratingsfrom similar users

U-U: user similarityMemory - Heuristic: Pear-son correlation based onpseudo-ratings

U-I2: pseudo-ratings→ Item content features

Model - Bayesian classi-fier

Similarity fusion

Wang et al., 2006

U-I: numeric ratings Explicit User similarity· Cluster selection + Aggregationof ratings from similar usersU-U: item similarity,

based on ratings Model - ProbabilisticBayesian model

I-I: user similarity, basedon ratings

Spread-activation

Huang et al., 2004

U-I: transaction history→ Binary transactions Implicit Hybrid method:

· User similarity· Item similarity· Item association→Hopfield net algorithm

U-U: user similarity,based on demographicdata→ Demographic data

Memory - Vector similar-ity

I-I: item similarity, basedon content features→ Content features

I-I2: item similarity,based on content features→ Content features

Model - Association rulemining

I-I3: item similarity,based on transactions→ Binary transactions


6 Conclusion

The selection of the appropriate algorithm may depend on different aspectssuch as the type of information available to represent both users and items,or scalability restrictions. In this section, general guidelines to decide whichalgorithms are better are provided on the basis of the following key aspects:accuracy, meaning of recommendations, scalability and performance, new data,application domain, user activity and prior information.

Accuracy. As a central issue in CF research, prediction accuracy has receivedhigh attention and various methods were proposed for improvement. Still, con-ventional memory-based methods using Pearson correlation coefficient remainamong the most successful. In domains where content information is available,hybrid methods can provide more accurate recommendations than pure collab-orative or content-based approaches (see [11, 47, 57, 43] for empirical compar-isons). Figure 2 shows some experimental NMAE results compiled from differentresearch works in different domains.

Fig. 2. Experimental accuracy NMAE results from different research works. Resultsare shown for different datasets with colored bars.

Meaning of recommendations. As shown in the proposed taxonomy, recommen-dations can stand for slightly different semantics. While user and item similarityare probably the most frequent used recommending strategies, other methods,such as item association, may be interesting in a recommendation engine as well.

Scalability and performance. Memory-based CF often suffers from slow responsetime, since each single prediction requires the scanning of a whole databaseof user ratings. This is a clear disadvantage when compared to typically fastresponses of model-based CF. Recommending items in real time requires theunderlying engine to be highly scalable. To achieve this, recommendation algo-rithms usually divide the recommendation generation into two parts: the off-lineand the on-line component. The first is the part of the algorithm that requires a


enormous amount of operations and the second is the part of the algorithm thatis dynamically computed to provide predictions using data from the stored com-ponent. In this sense, model-based approaches may be more suitable in terms ofscalability and performance than hybrid and neighborhood-based ones.

New data. In case of high volumes of new data, model-based approaches haveto be trained and updated too often, which makes them computationally ex-pensive and intractable. For this situation, memory-based solutions can easilyaccommodate to new data by simply storing it.

Application domain. Depending on the application domain one algorithm mayfit better than another. For instance, in domains such as music recommending,approaches resorted on content-based filtering are useless and pure collaborativefiltering is still the only way to perform personalization. On the contrary, indomains such as movie recommending, where content information is available,the quality of the recommender will probably be enhanced by adding content-based features.

User activity/sparsity. Users do not present the same degree of activity in alldomains. For instance a movie/music recommendation site may have thousandsof transactions per day, while in other domains, such as tourism, users maybe less active, thus emphasizing the sparsity problem. As a result, in low-levelactivity domains, either content-based filtering or hybrid filtering would comeup with more accurate results than pure collaborative filtering approaches.

Prior information. If an initial preference/rating database is not available, onlycontent-based or hybrid recommenders can face both new user and new itemproblems. Learning extensions are essential to select informative query itemsthe user is likely to rate and thus keep the information gathering stage as shortas possible. To address the limitations of collaborative filtering, it is often a goodidea to ask for the creation of a user profile for each newcomer. This ensures thatthe new user has the opportunity to rate items which others have also rated, sothat there is some commonality among user’s profiles.

6.1 Future Directions of CF

Better methods for representing user behavior and product items, more advancedrecommendation modeling methods, introduction of various contextual informa-tion into the recommendation process, utilization of multicriteria ratings, or aprovision of more flexible and less intrusive types of recommendations are someways to improve recommender systems, [55, 21, 7]. The most promising researchlines are here discussed:

Context-aware recommenders. Most CF methods do not use neither user noritem profiles during the recommendation process. Hybrid methods incorporateduser and item profiles but these profiles still are quite simple. New research incontext-aware recommenders essentially tries to model additional information,that may be relevant to recommendations in different senses: (1) for identifyingpertinent subsets of data when computing recommendations, (2) for building


richer rating estimation models, or (3) for providing constraints on recommenda-tion outcomes. There are different active research directions in context-aware rec-ommenders, such as: (1) establishing relevant contextual features, (2) advancedtechniques for learning context from data, (3) contextual modeling techniques,and (4) developing richer interaction capabilities for context-aware recommendersystems (recommendation query languages, intelligent user interfaces).

Flexibility. Flexibility stands for the ability of the recommender system to allowthe user to query the system with his/her specific needs in real time. REQUEST(REcommendation QUEry STatements) [8] is a language that allows users tocustomize recommendations to fit individual needs more accurately. The lan-guage is based on a multidimensional data model in which users, items, ratingsand other contextual relevant information are represented together following theOLAP-based paradigm. In this sense, flexibility of recommenders is closely re-lated with context-rich applications. For instance the query ’recommend me andmy girlfriend top-3 movies and moments based on my personal ratings ’ could beexpressed:

RECOMMEND Movie,Time TO Peter, Lara

USING MovieRecommender

BASED ON PersonalRating

RESTRICT Companion.Type=’Girlfriend’

SHOW TOP 3

Non-Intrusiveness. Many recommender systems are intrusive in the sense thatthey get ratings explicitly from users. Other systems get implicit feedback fromusers, but non-intrusive ratings are often inaccurate and are not as reliable as theexplicit ratings provided by users. Minimizing intrusiveness while maintainingthe accuracy of recommendations is a critical issue in designing recommendersystems: if the system demands bigger user involvement, users are more likely toreject the recommender system. Methods aimed at reducing either the requireduser feedback, by means of attentive interfaces, or the set of required item ratingsto maintain a representative user model, while maintaining a reasonable degreeof confidence in predictions, could be promising directions.

References

1. Amazon.com (March 2008)2. Book-crossing site (March 2008)3. Cdnow.com (March 2008)4. Lastfm site (March 2008)5. Movielens site (March 2008)6. Netflix site (March 2008)7. Adomavicius, G., Tuzhilin, A.: Toward the next generation of recommender sys-

tems: A survey of the state-of-the-art and possible extensions. IEEE Trans. onKnowl. and Data Eng. 17(6), 734–749 (2005)


8. Adomavicius, G., Tuzhilin, A., Zheng, R.: Rql: A query language for recommendersystems. Information Systems Working Papers Series (2005)

9. Aggarwal, C.C., Wolf, J.L., Wu, K.-L., Yu, P.S.: Horting hatches an egg: a newgraph-theoretic approach to collaborative filtering. In: KDD 1999: Proceedings ofthe fifth ACM SIGKDD international conference on Knowledge discovery and datamining, pp. 201–212. ACM, New York (1999)

10. Baeza-Yates, R.A., Ribeiro-Neto, B.A.: Modern Information Retrieval. ACM Press/ Addison-Wesley (1999)

11. Balabanovic, M., Shoham, Y.: Fab: content-based, collaborative recommendation.ACM Commun. 40(3), 66–72 (1997)

12. Belkin, N.J., Croft, W.B.: Information filtering and information retrieval: two sidesof the same coin? ACM Commun. 35(12), 29–38 (1992)

13. Bell, R., Koren, Y.: Improved neighborhood-based collaborative filtering. In: KD-DCup 2007: Proceedings of the 13th ACM SIGKDD international conference onKnowledge discovery and data mining, San Jose, California, USA, pp. 7–14. ACM,New York (2007)

14. Billsus, D., Pazzani, M.J.: Learning collaborative information filters. In: ICML1998: Proceedings of the Fifteenth International Conference on Machine Learning,pp. 46–54. Morgan Kaufmann Publishers Inc., San Francisco (1998)

15. Breese, J.S., Heckerman, D., Kadie, C.: Empirical analysis of predictive algorithmsfor collaborative filtering. In: UAI 1998: Proceedings of the fourteenth conferenceon uncertainty in artificial intelligence, pp. 43–52 (1998)

16. Chen, H., Ng, T.: An algorithmic approach to concept exploration in a large knowl-edge network (automatic thesaurus consultation): symbolic branch-and-boundsearch vs. connectionist hopfield net activation. J. Am. Soc. Inf. Sci. 46(5), 348–369(1995)

17. Claypool, M., Gokhale, A., Mir, T., Murnikov, P., Netes, D., Sartin, M.: Combiningcontent-based and collaborative filters in an online newspaper. In: Proceedings ofACM SIGIR Workshop on Recommender Systems (1999)

18. Cohen, W.W., Schapire, R.E., Singer, Y.: Learning to order things. In: NIPS 1997:Proceedings of the 1997 conference on Advances in neural information processingsystems, vol. 10, pp. 451–457. MIT Press, Cambridge (1998)

19. Dahlen, B.J., Konstan, J.A., Herlocker, J.L., Good, N., Borchers, A., Riedl, J.:Jump-starting movielens: User benefits of starting a collaborative filtering systemwith “dead-data”. University of Minnesota TR 98-017 (1998)

20. Ayhan, D.: Enhancing product recommender systems on sparse binary data. DataMin. Knowl. Discov. 9(2), 147–170 (2004)

21. Deshpande, M., Karypis, G.: Item-based top-n recommendation algorithms. ACMTrans. Inf. Syst. 22(1), 143–177 (2004)

22. Drineas, P., Kerenidis, I., Raghavan, P.: Competitive recommendation systems. In:STOC 2002: Proceedings of the thiry-fourth annual ACM symposium on Theoryof computing, pp. 82–90. ACM, New York (2002)

23. Goldberg, D., Nichols, D., Oki, B.M., Terry, D.: Using collaborative filtering toweave an information tapestry. ACM Commun. 35(12), 61–70 (1992)

24. Goldberg, K., Roeder, T., Gupta, D., Perkins, C.: Eigentaste: A constant timecollaborative filtering algorithm. Information Retrieval 4(2), 133–151 (2001)

25. Han, E.-H(S.), Karypis, G.: Feature-based recommendation system. In: CIKM2005: Proceedings of the 14th ACM international conference on Information andknowledge management, pp. 446–452. ACM, New York (2005)


26. Heckerman, D., Chickering, D.M., Meek, C., Rounthwaite, R., Kadie, C.: De-pendency networks for inference, collaborative filtering, and data visualization.J. Mach. Learn. Res. 1, 49–75 (2001)

27. Herlocker, J.L., Konstan, J.A., Borchers, A., Riedl, J.: An algorithmic frameworkfor performing collaborative filtering. In: SIGIR 1999: Proceedings of the 22ndannual international ACM SIGIR conference on Research and development in in-formation retrieval, pp. 230–237. ACM, New York (1999)

28. Herlocker, J.L., Konstan, J.A., Terveen, L.G., Riedl, J.T.: Evaluating collaborativefiltering recommender systems. ACM Trans. Inf. Syst. 22(1), 5–53 (2004)

29. Hill, W., Stead, L., Rosenstein, M., Furnas, G.: Recommending and evaluatingchoices in a virtual community of use. In: CHI 1995: Proceedings of the SIGCHIconference on Human factors in computing systems, New York, USA, pp. 194–201.ACM Press/Addison-Wesley Publishing Co. (1995)

30. Hofmann, T.: Latent semantic models for collaborative filtering. ACM Trans. Inf.Syst. 22(1), 89–115 (2004)

31. Hofmann, T., Puzicha, J.: Latent class models for collaborative filtering. In: IJ-CAI ’99: Proceedings of the Sixteenth International Joint Conference on ArtificialIntelligence, pp. 688–693. Morgan Kaufmann Publishers Inc., San Francisco (1999)

32. Huang, Z., Chen, H., Zeng, D.: Applying associative retrieval techniques to alleviatethe sparsity problem in collaborative filtering. ACM Transactions on InformationSystems 22(1), 116–142 (2004)

33. Huang, Z., Chung, W., Chen, H.: A graph model for E-commerce recommendersystems. Journal of the American Society for Information Science and Technol-ogy 55(3), 259–274 (2004)

34. Huang, Z., Chung, W., Ong, T.-H., Chen, H.: A graph-based recommender systemfor digital library. In: JCDL 2002: Proceedings of the 2nd ACM/IEEE-CS jointconference on Digital libraries, pp. 65–73. ACM, New York (2002)

35. Jin, R., Si, L., Zhai, C.: Preference-based graphic models for collaborative filtering.In: UAI 2003: Procceedings of the 19th Conference on Uncertainty in ArtificialIntelligence, pp. 329–336 (2003)

36. Jin, R., Si, L., Zhai, C.X., Callan, J.: Collaborative filtering with decoupled modelsfor preferences and ratings. In: CIKM 2003: Proceedings of the twelfth internationalconference on Information and knowledge management, pp. 309–316. ACM Press,New York (2003)

37. Konstan, J.A., Miller, B.N., Maltz, D., Herlocker, J.L., Gordon, L.R., Riedl, J.:Grouplens: Applying collaborative filtering to usenet news. Communications ofthe ACM 40(3), 77–87 (1997)

38. Lee, W.S.: Collaborative learning and recommender systems. In: ICML 2001: Pro-ceedings of the Eighteenth International Conference on Machine Learning, pp.314–321. Morgan Kaufmann Publishers Inc., San Francisco (2001)

39. Lin, W., Alvarez, S.A., Ruiz, C.: Collaborative recommendation via adaptive asso-ciation rule mining. In: Data Mining and Knowledge Discovery, vol. 6, pp. 83–105(2000)

40. Lin, W., Ruiz, C., Alvarez, S.A.: A new adaptive-support algorithm for associationrule mining. Technical report (2000)

41. Linden, G., Smith, B., York, J.: Amazon.com recommendations: Item-to-item col-laborative filtering. IEEE Internet Computing 7(1), 76–80 (2003)

42. Nick, L.: Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Mach. Learn. 2(4), 285–318 (1988)


43. Melville, P., Mooney, R.J., Nagarajan, R.: Content-boosted collaborative filteringfor improved recommendations. In: Eighteenth national conference on Artificialintelligence, pp. 187–192. AAAI, Menlo Park (2002)

44. Miyahara, K., Pazzani, M.J.: Collaborative filtering with the simple bayesian clas-sifier. In: Proceedings of the 6th Pacific Rim International Conference on ArtificialIntelligence, pp. 679–689 (2000)

45. Mobasher, B., Dai, H., Luo, T., Nakagawa, M.: Discovery and evaluation of aggre-gate usage profiles for web personalization. Data Mining and Knowledge Discov-ery 6, 61–82 (2002)

46. Papagelis, M., Plexousakis, D., Kutsuras, T.: Alleviating the sparsity problem ofcollaborative filtering using trust inferences. In: Herrmann, P., Issarny, V., Shiu,S.C.K. (eds.) iTrust 2005. LNCS, vol. 3477, pp. 224–239. Springer, Heidelberg(2005)

47. Pazzani, M.J.: A framework for collaborative, content-based and demographic fil-tering. Artif. Intell. Rev. 13(5-6), 393–408 (1999)

48. Pennock, D.M., Horvitz, E., Lawrence, S., Lee Giles, C.: Collaborative filtering bypersonality diagnosis: A hybrid memory and model-based approach. In: UAI 2000:Proceedings of the 16th Conference on Uncertainty in Artificial Intelligence, pp.473–480. Morgan Kaufmann Publishers Inc., San Francisco (2000)

49. Rashid, A.M., Albert, I., Cosley, D., Lam, S.K., McNee, S.M., Konstan, J.A., Riedl,J.: Getting to know you: learning new user preferences in recommender systems.In: IUI 2002: Proceedings of the 7th international conference on Intelligent userinterfaces, pp. 127–134. ACM, New York (2002)

50. Resnick, P., Iacovou, N., Suchak, M., Bergstrom, P., Riedl, J.: Grouplens: an openarchitecture for collaborative filtering of netnews. In: CSCW 1994: Proceedings ofthe 1994 ACM conference on Computer supported cooperative work, pp. 175–186.ACM, New York (1994)

51. Resnick, P., Varian, H.R.: Recommender systems. Communications of theACM 40(3), 56–58 (1997)

52. Sarwar, B., Karypis, G., Konstan, J., Riedl, J.: Application of dimensionality reduc-tion in recommender systems–a case study. In: ACM WebKDD Workshop (2000)

53. Sarwar, B., Karypis, G., Konstan, J., Reidl, J.: Item-based collaborative filteringrecommendation algorithms. In: WWW 2001: Proceedings of the 10th internationalconference on World Wide Web, pp. 285–295. ACM, New York (2001)

54. Sarwar, B.M., Karypis, G., Konstan, J.A., Riedl, J.: Analysis of recommendationalgorithms for E-commerce. In: ACM Conference on Electronic Commerce, pp.158–167 (2000)

55. Schafer, J.B., Konstan, J., Riedi, J.: Recommender systems in E-commerce. In:EC 1999: Proceedings of the 1st ACM conference on Electronic commerce, pp.158–166. ACM, New York (1999)

56. Shardanand, U., Maes, P.: Social information filtering: algorithms for automating“word of mouth”. In: CHI 1995: Proceedings of the SIGCHI conference on Humanfactors in computing systems, New York, USA, pp. 210–217. ACM Press/Addison-Wesley Publishing Co. (1995)

57. Soboroff, I.M., Nicholas, C.K.: Combining content and collaboration in text fil-tering. In: Proceedings of the IJCAI 1999 Workshop on Machine Learning forInformation Filtering, pp. 86–91 (1999)

58. Wang, J., de Vries, A.P., Reinders, M.J.T.: Unifying user-based and item-basedcollaborative filtering approaches by similarity fusion. In: SIGIR 2006: Proceed-ings of the 29th annual international ACM SIGIR conference on Research anddevelopment in information retrieval, pp. 501–508. ACM Press, New York (2006)


59. Xue, G.-R., Lin, C., Yang, Q., Xi, W., Zeng, H.-J., Yu, Y., Chen, Z.: Scalablecollaborative filtering using cluster-based smoothing. In: SIGIR 2005: Proceedingsof the 28th annual international ACM SIGIR conference on Research and devel-opment in information retrieval, pp. 114–121. ACM, New York (2005)

60. Yu, K., Schwaighofer, A., Tresp, V., Xu, X., Kriegel, H.-P.: Probabilistic memory-based collaborative filtering. IEEE Transactions on Knowledge and Data Engineer-ing 16(1), 56–69 (2004)

61. Zhang, T., Iyengar, V.S.: Recommender systems using linear classifiers. J. Mach.Learn. Res. 2, 313–334 (2002)

6

A System for Fuzzy Items Recommendation

Corrado Mencar, Ciro Castiello, Danilo Dell’Agnello,and Anna Maria Fanelli

Università degli Studi di [email protected],{mencar,castiello,fanelli}@uniba.it

Summary. This contribution presents a user profile modelling approach based onfuzzy logic techniques. The proposed approach is conceived to find application in vari-ous contexts, with the aim of providing personalised contents to different categories ofusers. Both contents and users are described by metadata, so a description languageis introduced along with a formal model defining their association mechanism. Thestrength of the model is the use of the expressive flexibility of fuzzy sets exploited byan innovative scheme of metadata. Along with the formal presentation of the profilemodelling approach, the design of a software system based on a Service Oriented Ar-chitecture is presented. The system exposes a number of services to be consumed byinformation systems for personalized content access. In this way the system can beused in different application contexts.

1 Introduction

Personalisation is one of the key issues pervading most of the technological appli-cations designed for content providing, such as e-commerce systems, web portals,e-learning platforms and so on [1]. In the diverse contexts where they find ap-plication, personalisation mechanisms are mainly based on the definition of userprofiles. These are formal structures representing different pieces of informationrelated to the user, ranging from her expressed preferences or previous knowl-edge to his specific role within the area of interest. Usually, profiles are defined torepresent categories of users sharing common features; in this way, user profilesstand as filters which favour the allocation of personalised contents.

Generally speaking, the definition of user profiles determines a specific gran-ularity level to be introduced inside the area of interest. This kind of informa-tion granulation can be established on a range where the roughest case refersto the definition of a single profile for all the ensemble of users (no person-alisation allowed), while the finest case refers to the definition of a distinctprofile for each user (maximum level of personalisation). Inside this range, thechoice for a particular granularity level is mainly driven by a number of factorsconcerning the trade-off between the involved costs and the produced benefits.


120 C. Mencar et al.

More fine-grained profiles can be achieved through automated profiling systems,which rely on data mining and machine learning techniques [2]. Nevertheless,this kind of approach requires a significant learning time, during which the userbehaviour must be monitored by the system in order to build customised profiles.

The work activity described in this chapter refers to personalisation processeswhose final aim consists in providing items to the users, so as to satisfy theirneeds and goals at best [3]. The proposed approach is aimed at combining thebenefits of automated profiling mechanisms with some form of avalaible a prioriknowledge about domain. In practice, the users are assigned to pre-establisheduser profiles, shared among the entire user community. At the same time, eachuser is also associated with individual profiles, which can be used to track hisspecific behaviour. The definition of complex profile structures (on the basis ofsimpler profile components) allows to build up any kind of user profile, respond-ing to the articulated conditions of real world applications.

Another peculiarity of the proposed approach consists in the introduction offuzzy logic for modelling the association of users and profiles. In typical real worldsituations, a user can be hardly characterised in terms of a single profile: the spec-ification of fuzzy degrees of membership allows to associate a single user with mul-tiple profiles. Moreover, the employment of fuzzy logic is useful also for defininga suitable metadata specification, to be adopted for the description of the items(such as Learning Objects, items in an e-commerce platform and so on). Actually,metadata are largely used in profiling systems to characterise the objects involvedin the personalisation process [4]. The usual association mechanisms based on thecommon metadata schemes produce the simple identification of a number of itemsto be connected with the demanding users. Obviously, that means that a greatdeal of items are left outside from the association process. Fuzzy logic allows fora more comprehensive metadata specification, including the description of impre-cise properties of the items. Consequently, it can be realised a gradual associationbetween users and items, configured as a ranking where degrees of compatibilityare used to identify the most suitable items for each user, without excluding thosecharacterised by lower degrees of compatibility.

The chapter is organised as follows. In the next section a breef overview of thestate of art is presented. In section 3 the profile modelling approach is introduced.In section 4 the model for the description of an item is formalised, while section5 is devoted to the formalisation of the model describing the profiles of theactors involved in the items fruition process. Section 6 describes a proposal for asoftware system implementing the model described. A metric based evaluation ofa prototype of this system is provided in section 7 along with some architecturalremarks. Finally section 8 closes the chapter with some conclusive considerations.

2 Related Work

In the last decade Soft Computing techniques (including Fuzzy Logic, NeuralNetworks, Probabilistic Reasoning, Genetic Algorithms etc.) have been success-fully applied in user modeling [5, 6].

A System for Fuzzy Items Recommendation 121

Fuzzy Logic is usually employed in user modeling for its intrinsic ability ofrepresenting and manipulating imprecise and graded concepts. Its usefulness isgenerally recognized when – as in many real world cases – user models cannot beprecisely defined without arbitrary approximations (for survey on user modelingwith several paradigms, including Fuzzy Logic, see [7]).

A noteworthy application of Fuzzy Logic for user modeling in e-learning sys-tems is given in [8]. Here, fuzzy sets are used to model the user knowledge andare dynamically adapted while user learns new concepts from the e-learning plat-form. In [9] fuzzy rules are employed to register user actions and to refine thestrength of relationship between the user model attributes and the concepts ofthe knowledge domain. In [10] fuzzy sets are used to model beliefs about inter-actions that students make with items and quizzes; In this way the educationalsystem is able to evaluate how much plausible a student actually studied herassigned items.

Fuzzy Logic has been also used for user modeling in several areas other thane-learning systems. As an example, in [11] a fuzzy nearest neighbor approach isused in a collaborative filtering system to guess user preferences on the basis ofhistorical records. In [12] a Fuzzy Logic based approach has been adopted for themodeling of users to improve the interaction between the user and informationretrieval systems. In [13] fuzzy logic tecniques applicated to recommer systemshave been presented. In [14] Fuzzy Multiple Criteria Analysis has been used asa tool for user modeling in a Sales Assistance software.

In most works on user modeling with Fuzzy Logic, the representation of a usermodel is flat (i.e. usually based on a collection of fuzzy sets, or a vector of fuzzyvalues). However, in some industrial contexts, users of an e-learning system mayrequire more complex representations that better represent the role, the knowl-edge, the preferences of each user in her professional context. Besides, the roles ofa user could be described in complex terms, such as a composition of sub-roles. Inthe subsequent sections an approach is proposed to account for these complexitiesby providing a very flexible framework for representing user profiles.

3 Rationale of the Profile Modelling Approach

The activity described in this contribution moves from the assumption that theproposed approach for profile modelling can find application in different contexts.Our investigation is thus addressed to formalise the association process betweena set of items (Its) describing an object, and users, on the basis of suitablemetadata specifications.

The main concern of our approach is related to provide a modelling strategyindependent from the system that owns items. As concerning the independencerequirements, we intend to preserve:

• the independence from the actual representation of the items inside the plat-form;

• the independence from the actual representation of the users inside the plat-form;


• the independence from the specific technologies adopted for the representa-tion of metadata inside the platform.

By conforming to these independence requirements it is possible to devote pe-culiar attention to the management of the profile modelling process, regardlessof the constraints related to the practical realisation of the platform. This kindof approach allows to set aside a proper definition of users and items: they canbe simply acknowledged as class instances, without additional specifications.

As concerning the capability requirements, we intend to realise:

• the capability to employ metadata specifications allowing for the representa-tion of imprecise properties;

• the capability to formalise profiles of high complexity;• the capability to perform (possibly partial) associations of a single user with

several profiles.

The key to association between users and items are metadata, which connect anitem with an attribute and its respective value. Our approach differentiates fromusual metadata specifications since we assume that the value for an attribute,far from being simply an element inside the attribute domain, can be specifiedas a fuzzy set. The theory of fuzzy sets basically modifies the concept of mem-bership: the classical binary membership leaves room for a more comprehensivevariety of membership degrees, defined in terms of a mathematical function (aswe are going to detail in the next section). In this way, fuzzy sets allow for apartial membership of their elements [15]. The employment of fuzzy metadatacharacterisation enables the definition of different properties related to an item.In particular, we can distinguish among: simple properties (regarding the punc-tual evaluation of an attribute by determining a single value inside the set ofinfinite possible values); collective properties (regarding the extensional specifi-cation of a discrete set of values for an attribute); imprecise properties (regardingthe intensional definition of a qualitative value for an attribute). It should benoted how this kind of approach produces a granulation of the attribute do-mains, where fuzzy sets are adopted to represent each information granule. Thisfavours a mechanism of elaboration of concepts that is in agreement with thehuman reasoning schemes [16].

Actually, the formalisation of imprecise properties is included in our modelto cope with the intrinsic difficulties related to some metadata characteristics,which can not be described in terms of simple or collective values. Attempts toformalise such properties by means of discretisation processes lead to arbitrari-ness, resulting in a poor management of the involved items. The introductionof fuzzy sets is intended to overcome this kind of difficulties, together with theadoption of particular mathematical operators that are especially suitable forhandling imprecise information. In this way, gradual associations can be realisedbetween users and items, on the basis of a compatibility ranking. As a result,each user can be ultimately addressed to the most compatible item, withoutarbitrarily discarding those characterised by a lower degree of compatibility.


The user profiles are used to represent stereotypical categories of learners. Inorder to take into account stereotypes of high complexity, in this work the userprofiles are formalised as collections of profile components. Analogously to themetadata specification, the profile components are characterised in terms of fuzzysets: this homogeneity expedites the comparison process aiming at defining a com-patibility degree between profile components and items. The aggregation of suchcompatibility degrees produces the final association of a profile with an item.

Users are characterised by their corresponding profiles, however a single userhardly finds a full representation inside a single profile. For that reason, theproposed modelling approach allows a partial membership of users to differentprofiles, as in the case of real world situations. Therefore, the final association ofa user with an item is evaluated by considering the compatibility degrees relatedto the different profiles of belonging.

In the following sections we are going to detail the profile modelling approachby distinguishing the characterisation of the items from the description of theuser profiling mechanisms. All the involved entities are formally defined in termsof mathematical concepts and suitable examples are provided to illustrate theworking scheme of the modelling approach.

4 Modelling Items

To provide a general way to describe a generic item, a description by means ofmetadata is considered. Regardless to the context in wich item is employed (e.g.e-learning, e-commerce, item recomendation and so on) the model deals onlywith items description.

4.1 Items and Attributes

An item (It) is any object owned by the platform which a user can be interestedin. The proposed model leaves aside the peculiar structure of an object descrip-tion which is simply defined as an element of a set. Let O be a non-empty setof physical objects, namely the item space.

Definition 1. An item is an element o in the item space O, i.e. o ∈ O.

With reference to a particular scenario, an item may be represented by a mul-timedia support, a learning object in an e-learning platform, a document file, apresentation, a book, a hardware component and so on.

Each item can be associated with a set of attributes. Generally speaking,an attribute may be numeric or symbolic and it is related to a (possibly infi-nite) number of distinct values. Let A be a non-empty set, namely the attributespace.

Definition 2. An attribute is an element A in the attribute space A, i.e. A ∈ A.In particular, an attribute A is a set of values a ∈ A.

Example 1. If we consider an item represented by a book in an e-commerceplatform, a list of related attributes may include:


1. the name of the item;2. the difficulty level of the item (e.g. undergraduate, professional etc.)3. the publishing year of the item;4. the author of the item;5. the topic of the item (i.e. fiction, scientific and so on).

Again if we consider an item represented by a Learning Object (LO) in ane-learning system, a list of related attributes may include:

1. the name of the LO;2. the difficulty level of the LO;3. the fruition time of the LO.

The peculiarity of the proposed modelling approach consists in associating aparticular item with the imprecise values of its attributes. To manage these asso-ciations, the concept of fuzzy set is employed, standing as a generalisation of theclassical concept of the mathematical set. By defining a fuzzy set over a domain,it is possible to extend the membership evaluation process to every element insidethe domain, thus moving from binary membership values (0/1) to a gradation ofmembership values over a continuous range. For our purposes, we define a fuzzyset over each attribute of an item as follows. Let A ∈ A be an attribute.

Definition 3. A fuzzy set defined over A is a function:

FA : a ∈ A → FA(a) ∈ [0, 1] (1)

FA(a) is called memberhip degree of value a in fuzzy set FA.

The definition of fuzzy sets enables the items characterisation in terms of thecorrespondence between the attributes and their possible values. This kind ofrelationship can be defined in terms of a set of Attribute-Value pairs, specifiedas follows. Let A be the attribute space.

Definition 4. An Attribute-Value pair is an ordered pair (A, FA), being A ∈ Aand FA a fuzzy set defined over the attribute A. An Attribute-Value set is a setof Attribute-Value pairs:

f = {(A, FA)|A ∈ A}.Remark 1. An Attribute-Value set can be formalised as the function:

f : A ∈ A −→ FA ∈ FA,

being FA the space of all the possible fuzzy sets which can be defined over theattribute A.

4.2 Metadata and Item Description

Attributes and values are strictly connected with an item. Therefore, it is usefulto introduce the metadata concept (to be defined for every attribute), associatingan item with a fuzzy set which represents the attribute value. Let A ∈ A be anattribute of an item o ∈ O.


Definition 5. A metadata mA is a function associating the item o with a fuzzyset defined on A:

mA : O → FA.

In order to obtain a thorough description for an item, it is necessary to refer toits attributes and their related values. A straightforward mechanism to generatean item description is the simple enumeration of the attributes, together withthe fuzzy sets reporting the corresponding values. This kind of description isbased on the set of metadata that it is possible to define for an item. Let A bethe attribute space.

Definition 6. The description of an item o ∈ O, with respect to A, is the set ofall the Attribute-Metadata pairs associated to o:

D(o) = {(A, mA(o))|A ∈ A}. (2)

Remark 2. The description of an item o ∈ O can be formalised as the function:

D(o) : A ∈ A → FA ∈ FA.

Remark 3. The description D(o) is an Attribute-Value set.

Remark 4. We admit the presence of attributes associated with the entire setof values, i.e. when mA(o) = A. This condition is verified when no values arespecified for the attribute A in the characterisation of the item o.

Example 2. Inside the illustrative scenario introduced in the example 1, the itemdescription (here a LO representation has been considered) can be expressed bylisting the attributes together with the fuzzy sets reporting the correspondingvalues:

1. Name → {“Introduction to word processing”/1};2. Fruition time → about 10′ = T [8, 10, 15];3. Creation date → {“07-06-07”/1};4. Complexity → {“Easy”/1, “Average”/0.3, “Expert”/0};5. Scope → {“ICT”/0.7, “Word Processing”/1}.

It can be observed that the fuzzy sets reporting the values for the attributes«Name» and «Creation date» refer to simple properties of the item. They assignthe maximum membership degree (equal to 1) only to one of the infinite valuesthe attributes may assume. All the other values are not reported inside thecharacterisations of the fuzzy sets, since their membership degree is equal tozero. This peculiar condition can be graphically represented by means of fuzzysingletons, as depicted in Fig. 1. The fuzzy sets reporting the values for theattributes «Complexity» and «Scope» refer to collective properties of the item.They are defined over discrete sets and assign a membership degree to each oneof the possible values, as depicted in Fig. 2. Finally, the fuzzy set reporting thevalue for the attribute «Fruition time» refers to an imprecise property of theitem. It is defined over a continuous set and assigns a membership degree to


(a) (b)

Fig. 1. Fuzzy singletons representing simple properties of an item: the «Name» at-tribute (a) and the «Creation date» attribute (b)

(a) (b)

Fig. 2. Fuzzy sets representing collective properties of an item: the «Complexity»attribute (a) and the «Scope» attribute (b)

Fig. 3. Fuzzy set representing an imprecise property of an item: the «Fruition time»attribute


each one of the possible values by means of a triangular function, as depicted inFig. 3. (It should be noted that different kinds of membership functions may beadopted, such as trapezoidal or Gaussian functions).

Definition 7. An item collection O is a subset of the space O, i.e. O ⊆ O.

The description of an item collection can be further specified with reference tothe definition of the single item description as follows.

Definition 8. The description of an item collection O is the union set of all theAttribute-Value sets defined by the description of each item:

D(O) =⋃

o∈O{(D(o))}. (3)

Example 3. The mathematical formalisation of the item description can be ex-tended to manage several distinct items. In this case, the formula in the definition6 should be properly generalised by means of a matrix representation, where rowsand columns correspond to the items and their attributes, respectively. To this aim,we introduce the concept of item collection. The previously described illustrativescenario can be expanded by involving a number of different items. The informa-tion reported in table 1 represents matrix describing a sample item collection.

Table 1. The matrix describing a sample item collection

Name Fruitiontime

Creationdate Complexity Scope

LO1 {“Introductionto Word”/1} about 10’ “07-06-07”/1

{“Easy”/1,“Average”/0.3,“Expert”/0}

{“ICT”/0.7,“WordProcessing”/1}

LO2 {“Introductionto Latex”/1}

12

hour “22-05-07”/1{“Easy”/0.4,

“Average”/0.7,“Expert”/0.1}

{“ICT”/0.7,“WordProcessing”/1}

LO3 {“HTML forDummies”/1} about 40’ “22-04-07”/1

{“Easy”/0.3,“Average”/0.8,“Expert”/0.2}

{“ICT”/0.7,“Web”/1}

5 Modelling the Actors of the Item Fruition Process

In the previous section a way to model items by means of a set of metadata hasbeen provided. In the same way in this section a description of user by mean of aset of metadata is presented. The structure of the user description is quite morecomplex than the items one, reflecting the fact that a user can assume diverseroles at the same time.

5.1 Profile Components and Compatibility Degrees

The user profiles are regarded as complex concepts whose analysis can be per-formed on the basis of simpler elements, that are the profile components. Each of


them is formalised in terms of the previously introduced Attribute-Value pairs, sothat the fuzzy valorisation of attributes canbe replied. LetAbe the attribute space.

Definition 9. A profile component c is defined as the set of ordered pairs:

c = {(A, FA) |A ∈ A}. (4)

Remark 5. A profile component can be formalised as the function:

c : A ∈ A → FA ∈ FA.

Remark 6. The ensemble of the profile components spans the set C, namely thespace of the profile components.

The formalisation of the profile components is useful to define the concept ofuser profiles.

Definition 10. A user profile p is a set of profile components, i.e. p ⊆ C.

Remark 7. The ensemble of the user profiles spans the power set P = 2C of userprofiles.

Example 4. A specific user profile can be constituted by a number of profilecomponents. As an example, we refer to a couple of profile components. Thefirst one (c1) is characterised by the following Attribute-Value pairs:

1. Fruition time → short = T [0, 15, 30];2. Complexity → {“Easy”/1, “Average”/1, “Expert”/0.5};3. Scope → {“ICT”/0.5, “Word Processing”/0.8}.

The second profile component (c2) is characterised by the following Attribute-Value pairs:

1. Complexity → {“Easy”/0.5, “Average”/1, “Expert”/1};2. Scope → {“Management”/1}.

Such a user profile can be properly associated to a «secretary» profile and itis defined in terms of the same attributes employed for the item descriptionsreported in the previous examples. Here the «Fruition time», «Complexity» and«Scope» attributes refer to the characteristics of items that the user is supposedto be addressed to. The attributes not appearing in this example are not deemeduseful for describing the profile components. The pieces of information reportedin the example are quite illustrative of the usefulness of profile components. Infact, the first component c1 is related to the ICT competence of the secretary,with special reference to the use of word processing software. This kind of compe-tence can be reasonably regarded as a non-priority issue for the secretary profile;for that reason the related items are characterised by a low level complexity anda short fruition time. Conversely, the secretary profile is fully qualified in terms ofmanagement activities, as represented by the the maximum membership degreeassociated to the value of the «Scope» attribute in the second component c2. Asa consequence, more complex items are to be considered, without a specificationfor the «Fruition time» attribute: in this case the user should be addressed toitems requiring any time of fruition.


The homogeneity between the item description and the profile components isstraightforward, as resulting from the comparison of definitions 6 and 9. Thecommon structure of these elements allows the definition of a compatibility de-gree among them, which is actually evaluated between a couple of Attribute-Value sets. For this purpose, it is possible to exploit the possibility measureamong fuzzy sets and the aggregation operators. Particularly, the possibilitymeasure [17],[18] verifies the existence of an attribute value both in the profilecomponent and in the item description; the aggregation process, performed overthe evaluated possibility measures, produces a compatibility degree between thecomponent profile and the item.

Definition 11. The possibility degree between two fuzzy sets F ′A, F ′′

A, defined onthe same attribute A, is defined as follows:

Π(F ′A, F ′′

A) = supa∈A

min{F ′A(a), F ′′

A(a)}.

An example is shown in Fig. 4. The possibility degree provides a measure of thecompatibility of two granular values defined on the same attribute. It is hencethe basic operation for the definition of the compatibility degree between anitem and a profile component. The calculation of the possibility degree spans allthe attributes in A. As a consequence, given two Attribute-Value sets f1, f2 therelated possibility degree can be specified.

Definition 12. The possibility degree between two Attribute-Value sets f1, f2 isdefined as follows:

Ψ(f1, f2) : A ∈ A −→ Π(f1(A), f2(A)) ∈ [0, 1].

The definition of the compatibility degree of the two Attribute-Value setsf1, f2 requires the aggregation of the possibility degrees attained for eachattribute.

Definition 13. The compatibility degree between f1 and f2 is defined as:

Kω(f1, f2) = ω(Ψ(f1, f2)).

Function ω is an OWA (Ordered Weighted Average, [19]) aggregation operator:

ω : [0, 1]|A| → [0, 1],

defined as:

ω(π1, π2, . . . , π|A|

)=

|A|∑

j=1

πij · wj ,

where:πi1 ≤ πi1 ≤ · · · ≤ πi|A|

and w1, w2, . . . , w|A| ∈ [0, 1] are weight factors such that:

|A|∑

j=1

wj = 1.


Remark 8. By changing the weight factors, several OWA can be defined, such asthe minimum function (by setting w1 = 1 and wj = 0 for j > 1) or the meanvalue function (by setting wj = 1/|A| for all j). The choice of a specific OWA isa matter of design.

Remark 9. The compatibility degree Kω(c, D(o)) between a profile component cand an item description D(o) can be defined in terms of the compatibility degreebetween a couple of Attribute-Value sets introduced by the definition 13.

Generally speaking, a user profile is compatible with an item if at least one ofits profile components is compatible with the item. Since we are dealing withfuzzy evaluations, it is necessary to refer to the maximum compatibility degreeevaluated for each profile component.

Definition 14. The compatibility degree between a profile p and an item o isdefined as the maximum compatibility degree of the profile components:

Kω(p, D(o)) = maxc∈p

Kω(c, D(o)).

Example 5. It is possible to evaluate the compatibility degree between the userprofile defined in example 4 and item description reported in example 2. Thecompatibility degree is equal to the maximum compatibility degree between oneof its profile components (that are c1, c2) and the item description. By consider-ing the profile component c1, the evaluation of the possibility measures amongthe fuzzy sets defined for the attributes «Scope», «Complexity» and «Fruitiontime» is illustrated in Fig. 4, with the assistance of the graphical representationsof the involved fuzzy sets.

Fig. 4. Evaluation of the possibility measures among fuzzy sets


By adopting the minimum function as OWA aggregation function, the com-patibility degree Kω(c1, LO) between the profile component c1 and the item canbe properly evaluated as:

Kω(c1, LO) = ω(0.8, 1, 1) = 0.8.

An analogous process can be performed with reference to the profile componentc2, yielding the compatibility degree:

Kω(c2, LO) = ω(0, 1) = 0.

According to the definition 14, the final degree of compatibility between the«secretary» user profile and the item is equal to:

max(Kω(c1, LO), Kω(c2, LO)) = max(0.8, 0) = 0.8.

5.2 Users and User Profiles

The items are intended to be demanded by users. Each user can be associatedwith multiple profiles: these associations are characterised by fuzzy membershipdegrees. Three kinds of profiles have been conceived in our profile modellingapproach:

1. competence profiles (characterising the users in terms of their specific rolesor working activities);

2. preference profiles (characterising the users in terms of their specific choicesduring the interaction with the system);

3. acquaintance profiles (characterising the users in terms of the specific infor-mation they have collected during the interaction with the system).

In any case, the structure of the profiles is the same as defined in the previoussection for all the above specified categories. A user can be defined in terms ofthe membership degree with reference to a profile base. Let U be a non-emptyset of users.

Definition 15. A user is an element u in the set of users U , i.e., u ∈ U .

Definition 16. A profile base is a subset P of the profile space P, i.e. P ⊆ P.

Let P ⊆ P be a profile base and let u ∈ U be a user.

Definition 17. The description of the user u is defined by the fuzzy set:

DP (u) : p ∈ P −→ [0, 1]

Example 6. It could be possible to further detail the scenario illustrated in theexample 4 by supposing that the «secretary» user profile may be compatible withsome other user profile (possibly corresponding to some other working function).As an example, we could think of a person inside a company who plays the


different roles of secretary and (to a lesser extent) of tax consultant. In thecontext of the profile modelling approach, such a user u is represented by thefollowing description:

D(u) = {“secretary”/0.8, “tax consultant”/0.2}.

The above formalisation is based on the assumption that there exist both theuser profile «secretary» and the user profile «tax consultant»: the latter may bedescribed in a similar way as illustrated in the example 4.

The compatibility degree between a user and an item can be defined on the basisof the compatibility degree between the description of the item and the profilesassociated with the user. In practice, several degrees of compatibility should betaken into account, weighted by the user membership degrees with respect tothe profile base. Let u ∈ U be a user and let P ⊆ P be a profile base.

Definition 18. The compatibility degree between the description of the user uand the description of the item o ∈ O is defined as:

Kω(DP (u), D(o)) = maxp∈P

min{Kω(p, D(o)), DP (u)(p)}.

Example 7. With reference to the example 6, the compatibility degree betweenthe user and the item can be evaluated by the maximum compatibility degreebetween the item and the user profiles (namely, the «secretary» and the «taxconsultant» profiles). As concerning the «secretary» profile, we have alreadyevaluated its compatibility degree with the item, which is equal to 0.8. By sup-posing a compatibility degree equal to 0.1 for the (undefined) «tax consultant»profile, the ultimate compatibility degree between the user and the item wouldbe equal to 0.8, i.e. the maximum value of the profile compatibility degrees.

Finally, it is possible to formalise the different role of the previously specifiedprofile categories for the user characterisation. Let u ∈ U be a user and let C(Competence), A (Acquaintance) and P (Preference) be three profile bases.

Definition 19. The compatibility degree between the description of the user uand the description of the item o ∈ O is defined as:

Kω(u, o) = min{ max{Kω(DP (u), D(o)), Kω(DC(u), D(o))},1 − Kω(DA(u), D(o))}. (5)

The relationship expressed by (5) represents the logical property associating anitem to a specific user on the basis of his competence, the preferences he hasexpressed during the interaction process and the items he had the opportunityto get acquainted with. Specifically, relationship (5) express the logical propertythat associates an item to a user if the latter has competence or preference onthe item, but he is not acquainted with.


6 Defining the System Architecture

The main issue in designing a system conforming to the model discussed in theprevious sections is to deal with its very general nature. Since the model providesa tool to associate items to users regardless to the context of application, thearchitecture must also reflect this focal point. One can imagine the need of such amechanism of association in an e-learning system, or in an e-commerce platform,and so on. The main concern is to develope a component that could be used as aservice provider by existing systems, so the integration effort can be minimized.The proposed architecture has three layers, each of the them related with aspecific function:

1. a Frontend acting as a request dispatcher;2. a Backend layer dealing with the implementation of the model.3. a Persistence Abstraction Layer dealing with data stored on physical system

(i.e. item and profile descriptions);

The first one accepts incoming requests for services and sends back the systemcomputation result, the second one performs operations according to incomingrequests and the third one is responsible of the management of database trans-actions. Each component offer an interface used by other components in theinteraction. In Fig. 5 an overview of system components is provided.

Fig. 5. An general overview of the system architecture. The system is highlighted intothe boxed area. Each external component, namely databases and service consumers,has a label reflecting its stereotype.


Fig. 6. The architecture of the Frontend Layer. The bounded region identifies boundaryof the Frontend Layer.

6.1 The Frontend Layer

The task of this layer is to provide for an external interface of the system.Requests from service consumers are decoded and forwarded to the Backendlayer while the result of the processing performed by the system is encodedand sent back to the requester due to the need of designing a Service Provider,a Service Oriented Architecture paradigm has been chosen. There are severaladvantages with this approach:

1. the use of a mature protocol for communications between service providerand service consumer;

2. the use of an easy up-to-date architecture;3. the implementation of a platform-independent system.

To manage request for system services SOAP protocol has been chosen1. Everyrequest incoming from clients is encapsulated into SOAP envelopes and deliv-ered to the system. Every envelope has a standard format with a header anda body section. An envelope conveys information over the net through HTTPprotocol, their bodies embedding both information on the service request and1 For details on SOAP specification see URL http://www.w3.org/TR/soap/


on data to deal with. A Dispatcher component is responsible to receive and sendback this requests. An Encoder component is responsible for both encoding anddecoding respectively outgoing and incoming messages. A Forwarder componentis responsible to forward the decoded request to the Backend layer. A result isencapsulated into a SOAP envelope and sent back to the service consumer. Adiagram showing the architecture of this layer is shown in Fig 6.

6.2 The Backend Layer

The task of this layer is to take care of the computational effort of the the system.It provides the mechanism to associate items to the user conforming to the modelformalized in the previous sections. Knowledge aboutmatching strategies, internaldescription of objects involved in the matching process and fuzzy operators is pos-sesed by the components in this layer. This layer is uncoupled from other layersand operates on data translated into an appropriate internal format. This makespossible the realization of the general purpose matching strategy formalized bythe model. A Matcher component is responsible of the association between userand item descriptions, while a Fuzzy Inference Engine component must takecare of the semantic expressed by fuzzy operators presented in the model. Other

Fig. 7. The Architecture of the Backend Layer. The bounded region identifies boundaryof the Backend Layer.


components can be inserted in this layer, one for any further service exposed bythe system. In Fig. 7 has been shown the architecture of this layer.

6.3 The Persistence Abstraction Layer

This layer is responsible of database connection and transactions. In this layerthere are components that deal with the conversion of data as physically storedin databases and an internal format that the system can process. The ratiounderlying this choice is strictly connected to the need of providing a generalway to process information regardless to the format in which it is stored. Itmay be possible to use a relational database or an XML sheet or any othersupport to store data of users and items. A Translator component is responsibleof the adaptation of data beetwen this layer and the above Backend Layer. Amechanism to uncouple Persistence Abstraction Layer implementation from thedatabase underliyng has been designed. For this reason the responsibility tointeract with database is demanded to only one component. This component,namely the Data Access Manger, has the knowledge about the format in wichdata are stored on the physical system and about the mechanism to retrievethem. Fig. 8 show the architecture of this layer.

Fig. 8. The architecture of the Persistence Abstraction Layer. Connection both toitems and users databases are shown. The bounded region identifies boundary of thePersistence Abstraction Layer.


7 Evaluating the Prototype

The implemented prototype has been tested in order to evaluate some remarkablecharacteristics. The main subject of interest was to test how much the systemconforms to the model. In order to inspect this focal point two aspects have beenconsidered: functionality and reliability. Functionality measures how a certainsoftware satisfies the needs expressed in analysis phase while efficiency measuresthe association time between users and items.

A testing environment has been built by poulating the item database withlearning objects (hereafter LOs) and user description database with user profiles(hereafter UPs). The sets of metadata describing both users UPs and LOs wasbound to have a non-empty intersection so that at least one attribute in UPsdescriptions could match with its corresponding one in LOs descriptions. At theend of the building process the items database stored five items represnting LOswhose set of describing metadata had a various cardinality. In the same way theuser description database stored nine UPs with the set metadata with variouscardinalities.

The test phase consisted of the computation of a score for each of the subjectsof the above mentioned analysis. To inspect functionality of the system theSemantic Consistency of Matching Operator (hereafter SCMO) has been defined.Finally to inspect the efficiency is computed by mean of Profile-Items Associationindicator (hereafter PIA) that measure the time the system needs to perform anassociation between a user description and a set of items. Of this indicator theaverage value (PIA_AVG) and the standard deviation of values (PIA_STD) ina battery of tests have been considered.

7.1 The Estimation of SCMO Indicator

To estimate the SCMO the following process has been defined:

1. a set I of items and a set U of users are considered;2. for each user in the set we manually define a list of items ordered with an

empirical criterium that estimates the order of preferences on the basis ofthe semantics of items descriptions;

3. an association test with the system is performed for each user in the set sothat a set of ordered lists of items is obtained;

4. differences in elements ordering between manually defined and system ob-tained lists are evaluated, by assigning a score Si with respect to successfullcomparisons;

5. the average of scores is evaluated in order to obtain the value of the SCMOindicator by means of the formula

SCMO =∑|U|

i=1 Si

|U |


After the testing phase the value of this indicator was estimated to beSCMO � 86% showing a high index of functionality

7.2 The Estimation of PIAs Indicators

To estimate the Efficiency the following process has been defined:

1. a set of resouces and a set of users are considered of cardinality |U |;2. an association test with the system is performed for each user in the set and

association times Ti are registered;3. the average time

T =∑|U|

i=1 Ti

|U |is evaluated;

4. the PIA_AVG value is evaluated with the formula:

PIA_AV G = 100 ∗ exp(−a ∗ T ) (6)

where the parameter a is obtained with the formula:

a =ln(2)

T_FAIR2

and T_FAIR = 500msec, thus leading to a convergence of (6) to a value of50. T_FAIR is the maximum time considered acceptable for the system toprovide a result;

5. the standard deviation time

T =

√∑|U|i=1(Ti − T )2

|U |is evaluated;

6. the PIA_STD value is evaluated with the formula:

PIA_STD = exp(−b ∗ T ) (7)

where the parameter b is obtained with the formula

b =ln(2)

S_FAIR2

and S_FAIR = 100msec, thus leading to a convergence of (7) to a value of100. S_FAIR is the maximum time considered acceptable for the systemto provide a result;

After the testing phase the value of this indicator was estimated to be PIA_AV G= 63 and PIA_STD = 91 showing a high index of efficiency.


8 Conclusions

In this contribution a profile modelling approach has been proposed to be ap-plied in every context in wich a system has to provide an item to a user on thebasis of an esteemed prefernce. The peculiarity of the illustrated approach con-sists in the employment of fuzzy logic for modelling the descriptions of items tobe provided by a system and user profiles. In this way, it is possible to formalisea mathematical scheme of metadata to describe similar as well as complex at-tributes characterised by collective and imprecise properties. That is done bydefining a fuzzy set over each attribute, so that a fuzzy attribute valorisationcan be determined. Moreover, the profiling mechanism benefits from the use offuzzy membership values, since each user can be partially associated with morethan a single profile. Finally, the adoption of fuzzy operators provides furtherassociation mechanisms, enabling the evaluation of compatibility degrees, whichconstitute the basis for building up a ranking of items to be associated with aspecific user. A system architecture based on this model is also designed. The aimof providing a general system is reflected by the use of a Service Oriented Archi-tecture for the design of a service provider component. A test of a prototype hasbeen also provided with respect to the evaluation of functionality and efficiency.Results show high values for the defined indexes. Future work is to be addressedto a more comprehensive study of the fuzzy operators involved in the associationmechanisms, in order to define the most suitable functions for modelling the dif-ferent semantics of the personalisation process. In fact the model considers onlya possibilistic semantic associated to compatibility among metadata describingattributes. As a future address the veristic semantic [18] should be also exploredto provide for a more flexible way to express relationships intercurring amongmetadata.

References

1. Riecken, D.: Introduction: personalized views of personalization. Communicationsof the ACM 43(8), 26–28 (2000)

2. Eirinaki, M., Vazirgiannis, M.: Web mining for web personalization. ACM Trans-actions on Internet Technology 3(1), 1–27 (2003)

3. De Bra, P., Brusilovsky, P., Houben, G.: Adaptive hypermedia: from systems toframework. ACM Computing Surveys, Article No.12, 31(4es) (1999)

4. Neven, F., Duval, E.: Reusable learning objects: a survey of lom-based repositories.In: MULTIMEDIA 2002: Proceedings of the tenth ACM international conferenceon Multimedia, pp. 291–294. ACM, New York (2002)

5. Azvine, B., Wobcke, W.: Human-centred intelligent systems and soft computing.BT Technology Journal 16(3), 125–133 (1998)

6. Frías-Martínez, E., Magoulas, G., Chen, S., Macredie, R.: Recent soft computingapproaches to user modeling in adaptive hypermedia. In: De Bra, P.M.E., Nejdl,W. (eds.) AH 2004. LNCS, vol. 3137, pp. 104–114. Springer, Heidelberg (2004)

7. Brusilovsky, P., Millán, E.: User models for adaptive hypermedia and adaptiveeducational systems. In: Brusilovsky, P., Kobsa, A., Nejdl, W. (eds.) AdaptiveWeb 2007. LNCS, vol. 4321, pp. 3–53. Springer, Heidelberg (2007)


8. Kavcic, A.: Fuzzy user modeling for adaptation in educational hypermedia. IEEETransactions on Systems, Man, and Cybernetics, Part C: Applications and Re-views 34(4), 439–449 (2004)

9. Martinovska, C.: A fuzzy-based approach to user model refinement in adaptivehypermedia systems. In: De Bra, P., Brusilovsky, P., Conejo, R. (eds.) AH 2002.LNCS, vol. 2347, pp. 411–414. Springer, Heidelberg (2002)

10. Kosba, E., Dimitrova, V., Boyle, R.: Using fuzzy techniques to model studentsin web-based learning environments. In: Palade, V., Howlett, R.J., Jain, L. (eds.)KES 2003. LNCS, vol. 2773, pp. 222–229. Springer, Heidelberg (2003)

11. Suryavanshi, B.S., Nematollaah Shiri, S.P.M.: A fuzzy hybrid collaborative filter-ing technique for web personalization. In: Proceedings of the 3rd Workshop onIntelligent Techniques for Web Personalization (ITWPŠ 2005), pp. 1–8 (2005)

12. John, R.I., Mooney, G.J.: Fuzzy user modeling for information retrieval on theworld wide web. Knowl. Inf. Syst. 3(1), 81–95 (2001)

13. Yager: Fuzzy logic methods in recommender systems. Fuzzy sets and systems 136,133–149 (2003)

14. Popp, H., Lödel, D.: Fuzzy techniques and user modeling in sales assistants. UserModeling and User-Adapted Interaction 5(3), 349–370 (1995)

15. Zadeh, L.: Fuzzy sets. Information and Control 8, 338–353 (1965)16. Zadeh, L.: A note on web intelligence, world knowledge and fuzzy logic. Data &

Knowledge Engineering 50, 291–304 (2004)17. Prade, D.D.H.: Possibility theory: an approach to computerized processing of un-

certainty. Plenum Press (1988)18. Yager: Veristic variables. IEEE Transactions on Systems, Man, and Cybernetics,

Part B: Cybernetics 30, 71–84 (2000)19. Yager: On ordered weighted averaging aggregation operators inmulticriteria de-

cisionmaking. IEEE Transactions on Systems, Man and Cybernetics 18, 183–190(1988)

Author Index

Bux, Massimo 27

Castellano, Giovanna 1, 65Castiello, Ciro 119

de Gemmis, Marco 27Dell’Agnello, Danilo 119

Fanelli, Anna Maria 1, 119

Garofalakis, John 49Giannakoudi, Theodoula 49

Jain, Lakhmi C. 1

Lops, Pasquale 27Lousame, Fabian P. 81

Mencar, Corrado 119Musto, Cataldo 27

Narducci, Fedelucio 27

Sanchez, Eduardo 81Semeraro, Giovanni 27

Torsello, Maria Alessandra 1, 65

Web Personalization in Intelligent Environments

Documents

Transcript of Web Personalization in Intelligent Environments