Usage Mining - UMass Dartmouthivalova/Fall2003/cis430/Papers/mobasher.pdf · 142 August 2000/Vol....

142 August 2000/Vol. 43, No. 8 COMMUNICATIONS OF THE ACM

AutomaticPersonalizationBased on WebUsage Mining�

Bamshad Mobasher, Robert Cooley, and Jaideep Srivastava

The ease and speed with which business transactions can be carried

out over the Web have been a key driving force in the rapid

growth of e-commerce. The ability to track user browsing behav-

ior down to individual mouse clicks has brought the vendor and end cus-

tomer closer than ever before. It is now possible for vendors to personalize

their product messages for individual customers on a massive scale, a

phenomenon referred to as “mass customization.” Of course, this type of

Web usage mining can help improve the scalability, accuracy,and flexibility of recommender systems.

COMMUNICATIONS OF THE ACM August 2000/Vol. 43, No. 8 143

personalization is applicable to any Web browsingactivity, not just e-commerce. Web personaliza-tion can be defined as any action that tailors theWeb experience to a particular user, or set of users.The experience can be something as casual asbrowsing a Web site or as (economically) signifi-cant as trading stocks or purchasing a car. Theactions can range from simplymaking the presentation morepleasing to anticipating theneeds of a user and providingcustomized information.

To date, most personaliza-tion systems for the Web havefallen into three major cate-gories: manual decision rulesystems, collaborative filteringsystems, and content-basedfiltering agents. Manual deci-sion rule systems, such asBroadvision (www.broadvi-sion.com), allow Web siteadministrators to specify rulesbased on user demographicsor static profiles (collectedthrough a registrationprocess), or session history.The rules are used to affectthe content served to a partic-ular user. Collaborative filter-ing systems, such as Firefly[11], and Net Perceptions(www.netperceptions.com),typically take explicit infor-mation in the form of userratings or preferences, and,through a correlation engine,return information that is pre-dicted to closely match theusers’ preferences. Content-based filteringapproaches such as those used by WebWatcher [5]rely on content similarity of Web documents topersonal profiles obtained explicitly or implicitlyfrom users.

Increasingly, the new generation of Web per-sonalization tools is attempting to incorporatetechniques for pattern discovery from Web usagedata. For example, some collaborative filteringsystems such as Net Perceptions are experimentingwith obtaining implicit user ratings from usagedata. Web usage mining systems run any numberof data mining algorithms on usage or clickstream

data gathered from one or more Web sites in orderto discover user profiles. The increasing focus onWeb usage data is due to several factors. The inputis not a subjective description of the users by theusers themselves, and thus is not prone to biases.The profiles are dynamically obtained from userpatterns, and thus the system performance does

not degrade over time as the profiles age. Further-more, using content similarity alone as a way toobtain aggregate profiles may result in missingimportant semantic relationships among Webobjects. Thus, Web usage mining can reduce theneed for obtaining subjective user ratings or regis-tration-based personal preferences.

Mining Usage Data for Web PersonalizationPrincipal elements of Web personalization includethe modeling of Web objects (for example, prod-ucts or pages) and subjects (users), categorization

Macintosh II

Data Preparation

Session/Episode File

RecommendationsActive Session

HTTP Server Client Browser

Usage Mining

Site Files

Server Logs

UsageProfiles

FrequentItemsets

Domain Knowledge

Usage Preprocessing

• Data Cleaning• Session Identification• Pageview Identification• Episode Identification• Support Filtering

• Session Clustering• Pageview Clustering

Association RuleDiscovery

Recommendation Engine

Batch Process

OnlineProcess

Figure 1. A general architecture for usage-based Web personalization.

of objects and subjects, matching between andacross objects and/or subjects, and determination ofthe set of actions to be recommended for personal-ization. As depicted in Figure 1, the overall processof usage-based Web personalization is divided intotwo components. The offline component is com-prised of the data preparation and specific usagemining tasks. The data preparation tasks result in aserver session file, where each session is a sequence ofpageviews each represented by a unique UniformResource Identifier (URI) reference attributed to aparticular user. In addition, only URIs that representmeaningful or relevant pageviews are included in aserver session file (see the sidebar “Data Preparation

for Web Usage Mining”). The usage mining taskscan involve the discovery of association rules,sequential patterns, pageview clusters, user clusters,or any other pattern discovery method. The discov-ered patterns are used by the online component toprovide personalized content to users based on theircurrent navigational activity. The personalized con-tent can take the form of recommended links orproducts, targeted advertisements, or text andgraphics tailored to the user’s preferences. The Webserver keeps track of the active server session as theuser’s browser makes HTTP requests. The recom-mendation engine considers the active server sessionin conjunction with the discovered patterns to pro-vide personalized content.

Data preparation. The prerequisite step to anytype of usage mining is the identification of a set ofserver sessions from the raw usage data. Ideally, eachserver session gives an exact accounting of whoaccessed the Web site, what pages were requestedand in what order, and how long each page wasviewed. Preprocessing consists of converting theusage, content, and structure information containedin the various available data sources into various dataabstractions (see the sidebar “Data Preparation forWeb Usage Mining”). The practical difficulties in


Figure 2. Main page for the demonstration site. Initially, no recommendations are provided as

the active user session does not contain sufficient number of references.

Figure 3. Dynamic recommendations after the user has navigated through “President’s

Column” and “Online Archives” pages.

Figure 4. The system provides specific recommendations related to conferences based

on user navigation through “ConferenceUpdate,” “Call for Papers,” and “Asia Pacific

Conference” pages.

performing preprocessing are a moving target. As thetechnology used to deliver content over the Webchanges, so do the preprocessing challenges. Whileeach of the basic preprocessing steps remains con-stant, the difficulty in completing certain steps haschanged dramatically as Web sites have moved fromstatic HTML served directly by a Web server todynamic scripts created from sophisticated contentservers and personalization tools. Both client-sidetools (browsers) and server-side tools (contentservers) have undergone several generations ofimprovements since the inception of the Web.

Discovery of usage profiles. The session fileobtained in the data preparation stage can be used asthe input to a variety of data mining algorithms suchas the discovery of association rules or sequentialpatterns, clustering, and classification. At this pointin the process, the results of the pattern discoverycan be tailored toward several different aspects of

Web usage mining. For example, Perkowitz andEtzioni [8] have proposed the idea of dynamicallycreating multiple index pages for a site based on co-occurrence patterns of pages among user sessions.Schechter et al. [10] have developed techniques forusing the path profiles of users to predict futureHTTP requests, which can be used for network andproxy caching. Spiliopoulou et al. [9], Cooley et al.[2], and Buchner and Mulvenna [1] have applieddata mining techniques to extract usage patternsfrom Web logs for the purpose of deriving market-ing intelligence. Shahabi et. al [12] and Nasraoui etal. [6] have proposed clustering of user sessions topredict future user behavior.

However, the discovery of patterns from usagedata by itself is not sufficient for performing the per-sonalization tasks. The critical step is the effectivederivation of good quality and useful (that is, action-able) “aggregate profiles” from these patterns. Ide-ally, profiles capture aggregate views of the behaviorof subsets of users based on their interests and/orinformation needs. In particular, aggregate profilesmust exhibit three important characteristics—theyshould:

• Capture possibly overlapping interests of users,since many users may have common interests upto a point (in their navigational history) beyondwhich their interests diverge;

• Provide the capability to distinguish amongpageviews in terms of their significance withinthe profile; and

• Have a uniform representation that allows for therecommendation engine to easily integrate differ-ent kinds of profiles (multiple profiles based ondifferent pageview types, or obtained via differentmining techniques).

Given these requirements, we have found thatrepresenting usage profiles as weighted collections ofURIs provides a great deal of flexibility. Each item ina usage profile is a URI uniquely representing a rel-evant pageview, and can have an associated weightrepresenting its significance within the profile. Theusage profiles can be viewed as ordered collections (ifthe goal is to capture the navigational path profilesfollowed by users [9]), or as unordered collections (ifthe focus is on capturing associations among speci-fied content or product pages). Based on the infor-mation collected for each pageview duringpreprocessing, other types of constraints can also beimposed on profiles (for example, we may wish tofocus the personalization effort only on certain typesof products or pages related to specific content cate-


Table 1. User behavior profiles.

Profile 1

Weight

0.780.670.64

0.610.550.52

0.50

Pageview URI

Call for PapersCFP: ACR 1999 Asia-Pacific ConferenceCFP: Society For Consumer Psychology ConferenceACR 1999 Annual ConferenceCFP: ACR 1999 European ConferenceCFP: Int'l Conference on Marketing and DevelopmentConference Update

Profile 2 0.82

0.71

0.680.680.56

0.52

CFP: Journal of Psychology and Marketing IICFP: Society For Consumer Psychology ConferenceConference UpdateCFP: Journal of Consumer Psychology IICFP: Conference on Gender, Marketing and Consumer BehaviorOnline Archives

Table 2. Commonly used data abstractions for Web usage mining.

User

Page File

Pageview

Server SessionEpisode

Single individual that is accessing files from one or more Web servers through a browser.File that is served through HTTP protocol to a user.Set of page files that contribute to a single display in a Web browser.Set of pageviews served due to a series of HTTPRequests from a single user to a single Web server.Subset of pageviews from a single server session.

Term Definition

gories). Another advantage of this representation isthat the profiles themselves can be viewed as vectors,thus facilitating the task of matching a current usersession with similar profiles using standard vectoroperations.

Traditional collaborative filtering techniques areoften based on real-time matching of the currentuser’s profile against similar records (nearest neigh-bors) obtained by the system over time from otherusers. However, as noted in recent studies [7], itbecomes hard to scale collaborative filtering tech-niques to a large number of items (for example,pages or products), while maintaining reasonableprediction performance and accuracy. Part of this isdue to the increasing sparsity in the data as the num-ber of items increase. One potential solution to thisproblem is to first cluster user records with similarcharacteristics, and focus the search for nearest

neighbors only in the matching clusters. In the con-text of Web personalization, this task involves clus-tering user sessions identified in the preprocessingstage.

A variety of clustering techniques can be used forclustering similar sessions based on occurrence pat-terns of URI references. User sessions can bemapped into a multidimensional space as vectors ofURI references (so, the dimensions—or features—are the URIs appearing in the session file). Standardclustering algorithms generally partition this spaceinto groups of items that are close to each otherbased on a measure of distance or similarity. Dimen-sionality reduction techniques may be employed tofocus only on relevant or significant features. Forexample, support filtering discussed earlier (see thesidebar “Data Preparation for Web Usage Mining”)can provide an effective dimensionality reduction


When dealing with server-side data collection froma Web server log or packet-sniffer, the major dif-

ficulties in usage preprocessing are due to the incom-pleteness of the available data. The required high-leveltasks are data cleaning, user identification, sessionidentification, pageview identification, and path com-pletion. In addition, episode identification can beoptionally performed as a final preprocessing step priorto pattern discovery. Some amount of content andstructure preprocessing is almost always necessary. Alist of commonly used terms associated with prepro-cessing is shown in Table 2; the figure appearing heresummarizes the preprocessing steps.

Data cleaning is usually site-specific, and involvestasks such as merging logs from multiple servers,removing graphics file accesses, and parsing of thelogs. The difficulties involved in identifying users andsessions depends greatly on the server-side technolo-gies used for the Web site. For Web sites using cookies orembedded session IDs, user and session identificationis trivial. Web sites without the benefit of additionalinformation for user and session identification mustrely on heuristics, such as those presented in [2].

Pageview identification is the task of determining whichpage file accesses contribute to a single pageview, andis heavily dependent on the intrapage structure. For asingle frame site, each HTML file has a one-to-one cor-relation with a pageview. However, for multiframedsites, several files make up a given pageview. Withoutdetailed site structure information, it is very difficult, ifnot impossible, to infer pageviews from a Web serverlog. Not all pageviews are relevant for specific mining

tasks, and among the relevant pageviews some may bemore significant than others. The significance of apageview may depend on usage, content and structuralcharacteristics of the site, as well as on prior domainknowledge specified by the site designer and the dataanalyst. For example, in an e-commerce site pageviewscorresponding to product-oriented events (for exam-ple, shopping cart changes or product informationviews) may be considered more significant than others.Similarly, in a site designed to provide content, contentpages may be weighted higher than navigational pages.In order to provide a flexible framework for a variety ofdata mining activities a number of attributes must berecorded with each pageview. These attributes includethe pageview id (normally a URI uniquely representingthe pageview), duration, static pageview type (infor-mation page, product view, index page, and so forth),and other metadata, such as content attributes.

Path completion involves inferring cached pageviewsbased on the referring information from the logs. Theextent of the problem created by caching is dependenton both the server-side and client-side technologies.Dynamic content with unique URIs for each server ses-sion (An embedded session ID is a common method formaking URIs unique) will not be subject to proxy levelcaching. However, any type of content can be cachedat the client level. The amount of client level caching isset by the client-side browser.

It is also possible to obtain a further level of granu-larity by identifying episodes within the sessions [2](episodes are referred to as transactions in [2]). Thegoal of episode identification is to dynamically create

Data Preparation for Web Usage Mining

method while actually improving clustering results.Ideally, each cluster represents a group of users withsimilar navigational patterns. However, session clus-ters by themselves are not an effective means of cap-turing an aggregated view of common user profiles.Each session cluster may potentially contain thou-sands of user sessions involving hundreds of URI ref-erences. In our Web usage mining framework, theultimate goal in clustering user sessions is to obtainactionable usage profiles which, as noted previously,can be represented as weighted collections of URIs.We discuss one method for obtaining useful profilesfrom session clusters in the discussion of theWebPersonalizer.

The representation of user sessions as vectors ofURI references can provide a number of advantagesand a great deal of flexibility. For instance, the dis-tance or similarity among sessions can be computed

using standard vector operations. Furthermore,depending on the goals of Web usage mining, a vari-ety of weights can be chosen for each URI in a ses-sion vector. Weights can be based on the amount oftime users spend on pages referenced by each URI,or they can be based on prior domain knowledgespecified by the site owner (for example, in an onlinecatalog, the site owner may wish to weigh productpages referenced by URIs more heavily than otherinformational pages within the site).

For example, consider the two usage profilesderived from session clusters of the site for Associa-tion for Consumer Research shown in Table 1 (alsosee the sidebar “Experiments with the WebPersonal-izer System”). In Table 1, Profile 1 captures thebehavior of users interested in current and upcomingconferences during 1999 related to consumerresearch. On the other hand, Profile 2 captures the


meaningful clusters of references for each user, basedon an underlying model of the user’s browsing behavior.This allows each page reference to be categorized as acontent or navigational reference for a particular user. Con-tent references can be further classified according topage types or the type of user activity (for example,product purchases). For the purpose of this article,however, we focus on server sessions as the units ofuser activity to which various data mining techniquesare applied.

Content preprocessing consists of converting thetext, image, scripts, and other multimedia files intoforms that are useful for the Web Usage Miningprocess. Often, this consists of performing contentmining such as classification or clustering. Whileapplying data mining to the content of Web sites is aninteresting area of research in its own right, in the con-text of Web Usage Mining, the content of a site can beused to filter the input to, or output from the patterndiscovery algorithms. For example, results of a classifi-cation algorithm could be used to limit the discoveredpatterns to those containing pageviews about a certainsubject or class of products. In addition to classifying orclustering page views based on topics, pageviews canalso be classified according to their intended use.Pageviews can be intended to convey information(through text, graphics, or other multimedia), gatherinformation from the user, allow navigation (through alist of hypertext links), or some combination uses. Theintended use of a pageview can also filter the sessionsbefore or after pattern discovery.

The structure of a site is created by the hypertext

links between pageviews. The structure can beobtained and preprocessed in the same manner as thecontent of a site. This is necessary for handlingpageviews that have multiple frames, dynamic pagesthat have the same template name for multiple pageviews, as well as dealing with extraneous referencessuch as to image or sound files. It may also be neces-sary to filter the log files by mapping the references to

the site topology induced by physical links betweenpages. This is particularly important for usage-basedpersonalization, since the recommendation engineshould not provide dynamic links to “out-of-date” ornon-existent pages.

Finally, the session file can be filtered by removingvery small server sessions or episodes, and low-supportURI references—references to those URIs that do notappear in a sufficient number of sessions. This type ofsupport filtering can be useful in eliminating noise fromthe data, such as that generated by shallow naviga-tional patterns of “non-active” users, and URI refer-ences with minimal knowledge value for the purpose ofpersonalization. c

Summary of the preprocessing steps.

DataCleaning

User/SessionIdentification

Site Structureamd Content

UsageStatistics

PageviewIdentification

PathCompletion

Server Session File

Raw UsageData

Episode File

EpisodeIdentification

behavior of users who are more specifically inter-ested in conferences and journals related to con-sumer psychology. Note that the behavior of a singleuser may match both profiles during the same or dif-ferent sessions.

Another approach for obtaining aggregate usageprofiles is to directly compute (overlapping) clustersof pageview references based on how often theyoccur together across user sessions (rather than clus-tering sessions, themselves). We call the usage pro-files obtained in this way pageview clusters. Ingeneral, this technique will result in a different typeof aggregate profiles as compared to the session clus-tering technique. The usage profiles derived fromsession clusters group together pages that co-occurcommonly across similar sessions. On the otherhand, pageview clusters tend to group together fre-quently co-occurring items across sessions, even ifthese sessions are themselves not deemed to be sim-ilar. This technique allows one to obtain clusters

that potentially capture overlapping interests of dif-ferent types of users. The question of which type ofclusters are most appropriate for personalizationtasks is an open research issue. However, the answerto this question, in part, depends on the structureand content of the specific site, as well as the goalsof personalization actions.

The difficulty in clustering URIs directly comesfrom the high dimensionality of the feature space.The user sessions, measured in tens to hundreds ofthousands in a typical application, must be usedinstead of the URIs as features. Traditional cluster-ing techniques, such as distance-based methods,generally cannot handle this type of clustering.Furthermore, dimensionality reduction in this con-text may not be appropriate, as removing a signifi-cant number of sessions as features may result inlosing too much information. In the next sectionwe discuss an approach based on Association RuleHypergraph Partitioning, which has been found to


To demonstrate the feasibility of the proposedtechniques and architecture, we conducted a

series of experiments using the site for the newslet-ter of the Association for Consumer Research (acr-news.org). The site contains a variety of news items,including President’s columns, conferenceannouncements, and calls-for-papers for a numberof conferences and journals. The site was used toimplement a demonstration version of the WebPer-sonalizer system based on the architecture pre-sented here. A local version of thedemonstration site is available fromaztec.cs.depaul.edu/scripts/ACR2.

We used a subset of the ACR logs (from June 1998to June 1999), and used session clustering to deriveusage profiles. The session file for this experimentcontained 18,430 user sessions with a total of 192unique URIs. Based on the average session size, asession window of size 3 was chosen. The sessionclustering process yielded 28 usage profiles repre-senting different types of user access patterns. Athreshold of 0.5 was used to derive usage profilesfrom session clusters (that is, usage profiles con-tained only those URI references appearing in atleast 50% of sessions). We used a recommendationthreshold of 0.3 as a cutoff point to ensure capturingoverlapping user interests.

The recommendation engine was implemented asa set of CGI scripts, using cookies to keep track ofuser’s active session. The figures appearing here

depict a typical interaction of user with site. The topframe in each window contains the actual page con-tents from the site, while the bottom frame containsthe recommended links. When the user clicks on alink in either frame, the top frame will display thecontent of the requested page, and the bottomframe is dynamically updated to include the newrecommendations. As seen in Figure 2, initially thesystem does not provide any recommendations untilthe user has navigated through more pages. Figure 3shows the recommendations resulting after the userhas followed a path to “President’s Column” andthen to “Online Archives.” The recommendationsinclude past President Columns and Editor’s Notes(as well as other pages) often visited by users whohave shown similar access patterns. Figure 4 showsthe results of the user navigation through “Confer-ence Update,” “Call for Papers,” and then “1999 AsiaPacific Conference.” As can be seen in these Figures,user’s intention of looking for more specific informa-tion will result in more specific recommendations.For example, when the user accesses a specific con-ference page (Figure 4), other specific conferenceinformation is presented as potentially interesting(for example, “Winter 2000 SCP Conference” and“Int’l Conference on Marketing and Development”).

Additional experimental results with otherdatasets comparing the various techniques describedin this article can be found at maya.cs.depaul.edu/~mobasher/personalization/. c

Experiments with the WebPersonalizer System

be particularly suitable for this task. Anotherapproach for the clustering URIs directly may bebased on the cluster mining technique of Perkowitzand Etzioni (see their article “Adaptive Web Sites”in this issue).

From profiles to recommendations. The recom-mendation engine is the online component of a Webpersonalization system. The task of the recommen-dation engine is to compute a recommendation set forthe current (active) user session, consisting of theobjects (links, ads, text, products, and so forth) thatmost closely match the current user profile. Theessential aspect of computing a recommendation setfor a user is matching the current user’s activityagainst aggregate usage profiles. The recommenda-tion engine must be an online process, providingresults quickly enough to avoid any perceived delayby the users (beyond what is considered normal fora given Web site and connection speed).

If the data collection procedures in the systeminclude the capability to track users across visits,then the recommendation set can represent a longerterm view of potentially useful links based on theuser’s activity history within the site. On the otherhand, if profiles are derived from anonymous usersessions contained in log files, then the recommen-dations provide a short-term view of user’s naviga-tional history. As depicted in Figure 1, theserecommended objects are then added to the last pagein the active session accessed by the user before thatpage is sent to the browser.

In general there are several design factors that canbe taken into account in determining the recom-mendation set. These factors may include:

• A short-term history depth for the current userrepresenting the portion of the user’s activity his-tory that should be considered relevant for thepurpose of making recommendations;

• The mechanism used for matching aggregateusage profiles and the active session; and

• A measure of significance for each recommenda-tion (in addition to its prediction value), whichmay be based on prior domain knowledge or struc-tural characteristics of the site.

Maintaining a history depth is important becausemost users navigate several paths leading to indepen-dent pieces of information within a session. In manycases these episodes have a length of no more than twoor three references. In such a situation, it may not beappropriate to use references a user made in a previousepisode to make recommendations during the currentepisode. It is possible to capture the user history depth

within a sliding window over the current session.A variety of techniques can be used to match the

active user session with one or more of the discov-ered usage profiles. For instance, standard classifica-tion techniques can be employed to automaticallyassign the new user session to a class determinedbased on aggregate profiles. It is also possible todirectly use patterns discovered as part of the associ-ation rule (or sequential pattern) discovery to pro-vide recommendations (see the sidebar “MiningAssociation Rules for Personalization”). In the archi-tecture described in this article, the aggregate profilesare represented as weighted URI collections. Thiswill allow for both the active session and the profilesto be treated as n-dimensional URI vectors, where nis the number of URI references appearing in thesession file. In this case, standard measures of dis-tance or similarity can be utilized to match the activesession and the usage profiles, and the recommenda-tions can be ranked according to a matching score.This is the method we have used in the WebPerson-alizer system.

Finally, structural characteristics of the site orprior domain knowledge can be used to associate anadditional measure of significance with each recom-mendation. For instance, the site owner or the sitedesigner may wish to consider certain page types(content versus navigational) or product categoriesas having more significance in terms of their recom-mendation value. In this case, significance weightscan be specified as part of the domain knowledge.Or, it may be desirable to consider pages that are far-ther away from the current user location within thesite as being better recommendations. In this case,structural information such as the link distances canbe used to provide significance weighting for recom-mendations.


The distance or similarityamong sessions can be

computed using standardvector operations.

The WebPersonalizer SystemThe WebPersonalizer system uses the architectureshown in Figure 1 to provide a list of recommendedhypertext links to a user while browsing through aWeb site. Currently, the WebPersonalizer systemrelies solely on anonymous usage data provided byWeb server logs and the hypertext structure of a site.The preprocessing steps outlined in [2] are used toconvert the server logs into server sessions. Two dif-ferent methods, each with its own characteristics, areused to discover aggregate usage profiles representedby a set of URIs. The first method involves the com-putation of session clusters and the derivation ofuseful aggregate user profiles from these session clus-ters. In the second method, we use frequent itemsetsdiscovered as part of association rule discovery todirectly obtain clusters of URIs based on their usagecharacteristics (pageview clusters). Once the repre-sentative usage profiles have been computed, a par-tial session for the current user (the active session)can be assigned to one or more matching usage pro-files. The matching profiles are used as the basis forproviding the user with additional recommenda-tions.

In order to derive usage profiles from each sessioncluster, the cluster centroids (the mean vectors) arecomputed. The mean value for each URI in themean vector is computed by finding the ratio of the

number of occurrences of that URI across all ses-sions to the total number of sessions in the cluster.Then, the low-support URIs (those with mean valuebelow a certain threshold), are filtered out. Forexample, if the threshold is set at 0.5, then eachusage profile will contain only those URI referencesthat appear in at least 50% of the sessions within itsassociated session cluster.

For the second method (computing usage profilesdirectly), the WebPersonalizer system uses the Asso-ciation Rule Hypergraph Partitioning (ARHP) tech-nique [4]. ARHP is well-suited for this task since itcan efficiently cluster high-dimensional data setswithout requiring dimensionality reduction as a pre-processing step. Furthermore, the ARHP providesautomatic filtering capabilities, and does not requiredistance computations. The ARHP has been usedsuccessfully in a variety of domains, including thecategorization of Web documents [3]. In thismethod the set of frequent itemsets are used ashyperedges to form a hypergraph. A hypergraph isan extension of a graph in the sense that each hyper-edge can connect more than two vertices. Theweights associated with each hyperedge are com-puted based on the confidence of the associationrules involving the items in the frequent itemset.The hypergraph is then recursively partitioned intoa set of clusters. The similarity among items is cap-


Association rules1 capture the relationships amongitems based on their patterns of co-occurrenceacross transactions in transactional databases suchas point-of-sale data collected in supermarkets. Inthe case of Web transactions, association rules cap-ture relationships among URI references based onthe navigational patterns of users. For example, anassociation rule

{A.html, B.html} ->{C.html} [support = 0.01, confidence = 0.75]

represents the relationship that users who accesspages A.html and B.html also tend to (with a confi-dence of 75%) access the page C.html. The supportvalue represents the fact that the itemset {A.html,B.html, C.html} was present in 1% of user sessions.Association rule discovery methods initially findgroups of items (which in this case are the URIsappearing in the preprocessed log) occurring fre-quently together in many transactions), satisfying aminimum support criteria. Such itemsets are referredto as frequent itemsets.

It is possible to consider the frequent itemsets

discovered as part of association rule mining directlyas usage profiles. The current user session can bematched against frequent itemsets to find candi-date recommendations: if the rule satisfies a speci-fied confidence threshold, then the candidate URI isadded to the recommendation set. It is also possibleto extend this technique to sequential patterns2, particu-larly when the focus is capturing user navigationalpaths (see also “Personalizing a Site with Web UsageMining” in this issue). The problem with this methodis that it might be difficult to find large enoughitemsets with sufficient support to match the cur-rent session. This is particularly true for sites withvery small average session sizes. An alternative toreducing the support threshold in such cases wouldbe to reduce the session window size. This latterchoice may itself lead to some undesired effectssince we may not be taking enough of the user’sactivity history into account.1Agrawal, R. and Srikant, R. Fast algorithms for mining association rules. In Pro-ceedings of the 20th VLDB conference (Santiago, Chile, 1994), 487–499.2Agrawal, R. and Srikant, R. Mining sequential patterns. In Proceedings of the Inter-national Conference on Data Engineering (ICDE), (Taipei, Taiwan, Mar. 1995).

c

Mining Association Rules for Personalization

tured implicitly by the frequent item sets. Each clus-ter represents a group of items (URIs) that are veryfrequently accessed together across sessions. Theconnectivity value of vertex (a URI appearing in thefrequent item set) with respect to a cluster measuresthe percentage of edges with which a vertex is asso-ciated. The significance weight of the URI withinthe resulting profile is obtained as a function of theconnectivity value for that URI.

In the case of usage profiles derived from sessionclustering, the weight for a URI is its mean value inthe cluster mean session vector. In the case ofpageview clusters obtained using the ARHP method,the weight is the connectivity value of the itemwithin the cluster. In computing the matchingscores, the system normalizes for the size of the clus-ters and the active session. This corresponds to theintuitive notion that we should see more of the user’sactive session before obtaining a better match withthe larger cluster. Furthermore, a candidate URI isconsidered to be a better recommendation if it is far-ther away form the current active session. To capturethis notion, the physical link distance between theactive session and a URI is measured (this is thesmallest path in the site graph between the URI andany of the URIs in the session).

The full recommendation set for current activesession is computed by collecting all URIs whoserecommendation score satisfies a minimum thresh-old requirement from each matching profile. TheURIs in the recommendation set are ranked accord-ing to their recommendation score when presentedto the user. Details of the specific techniques used inthe recommendation process, as well as a set ofexperiments comparing them can be found atmaya.cs.depaul.edu/~mobasher/personalization/.

ConclusionThe Web is providing a direct communicationmedium between the vendors of products and ser-vices, and their clients. Coupled with the ability tocollect detailed data at the granularity of individualmouse clicks, this provides a tremendous opportu-nity for personalizing the Web experience for clients.In e-commerce parlance this is being termed masscustomization. Even outside of e-commerce, the ideaof Web personalization has many applications.Recently there has been an increasing amount ofresearch activity on various aspects of the personal-ization problem. Most current approaches to per-sonalization by various Web-based companies relyheavily on human participation to collect profileinformation about users. This suffers from the prob-lems of the profile data being subjective, as well get-

ting out of date as user preferences change over time.We have provided several techniques in which

user preferences are automatically learned from Webusage data by using data mining techniques. Thishas the potential of eliminating subjectivity fromprofile data as well as keeping it updated. We havedescribed a general architecture for automatic Webpersonalization based on the proposed techniques,and discussed solutions to the problems of usagedata preprocessing, usage knowledge extraction, andmaking recommendations based on the extractedknowledge.

References1. Buchner, A. and Mulvenna, M.D. Discovering Internet marketing

intelligence through online analytical Web usage mining. SIGMODRecord 4, 27 (1999).

2. Cooley, R., Mobasher, B., and Srivastava, J. Data preparation for min-ing World Wide Web browsing patterns. Journal of Knowledge andInformation Systems 1, 1 (1999).

3. Han, E., Boley, D., Gini, M., Gross, R., Hastings, K., Karypis, G.,Kumar, V., and Mobasher, B., More, J. Document categorization andquery generation on the World Wide Web using WebACE. Journal ofArtificial Intelligence Review, January 1999.

4. Han, E., Karypis, G., Kumar, V., and Mobasher, B. Clustering basedon association rule hypergraphs. In Proceedings of SIGMOD’97 Work-shop on Research Issues in Data Mining and Knowledge Discovery(DMKD’97), May 1997.

5. Joachims, T., Freitag, D., Mitchell, T. WebWatcher: A tour guide forthe World Wide Web. In Proceedings of the International Joint Confer-ence in AI (IJCAI97), August 1997.

6. Nasraoui, O., Frigui, H., Joshi, A., and Krishnapuram, R. Mining Webaccess logs using relational competitive fuzzy clustering. In Proceedingsof the Eight International Fuzzy Systems Association World Congress,August 1999.

7. O’Conner, M. and Herlocker, J. Clustering items for collaborative fil-tering. In Proceedings of the ACM SIGIR Workshop on Recommender Sys-tems, Berkeley, CA, 1999.

8. Perkowitz, M. and Etzioni, O. Adaptive Web sites: automatically syn-thesizing Web pages. In Proceedings of Fifteenth National Conference onArtificial Intelligence, Madison, WI, 1998.

9. Spiliopoulou, M. and Faulstich, L.C. WUM: A Web UtilizationMiner. In Proceedings of EDBT Workshop WebDB98, Valencia, Spain,LNCS 1590, Springer Verlag, 1999.

10. Schechter, S., Krishnan, M., and Smith, M.D. Using path profiles topredict HTTP requests. In Proceedings of the Seventh InternationalWorld Wide Web Conference, Brisbane, Australia, 1998.

11. Shardanand, U. and Maes, P. Social information filtering: algorithmsfor automating “word of mouth.” In Proceedings of the ACM CHI Con-ference, 1995.

12. Shahabi, C., Zarkesh, A. M., Adibi, J., and Shah, V. Knowledge dis-covery from users Web-page navigation. In Proceedings of Workshop onResearch Issues in Data Engineering, Birmingham, England, 1997.

Bamshad Mobasher ([email protected]) is an assistantprofessor of Computer Science and the director of the Center forWeb and E-Commerce Intelligence at DePaul University inChicago, IL. Robert Cooley ([email protected]) is an independent consultant in e-commerce data analysis. Jaideep Srivastava ([email protected]) is an associate professor of Computer Science in the Department of Computer Science and Engineering at the University of Minnesota in Minneapolis.

© 2000 ACM 0002-0782/00/0800 $5.00

c


Usage Mining - UMass Dartmouthivalova/Fall2003/cis430/Papers/mobasher.pdf · 142 August 2000/Vol....

Documents

Transcript of Usage Mining - UMass Dartmouthivalova/Fall2003/cis430/Papers/mobasher.pdf · 142 August 2000/Vol....