Visual Cluster Exploration of Web Clickstream...

Visual Cluster Exploration of Web Clickstream Data

Jishang Wei∗

University of California, Davis

Zeqian Shen†

eBay Research Labs

Neel Sundaresan‡

eBay Research Labs

Kwan-Liu Ma§

University of California, Davis

ABSTRACT

Web clickstream data are routinely collected to study how usersbrowse the web or use a service. It is clear that the ability to recog-nize and summarize user behavior patterns from such data is valu-able to e-commerce companies. In this paper, we introduce a visualanalytics system to explore the various user behavior patterns re-flected by distinct clickstream clusters. In a practical analysis sce-nario, the system first presents an overview of clickstream clustersusing a Self-Organizing Map with Markov chain models. Then theanalyst can interactively explore the clusters through an intuitiveuser interface. He can either obtain summarization of a selectedgroup of data or further refine the clustering result. We evaluatedour system using two different datasets from eBay. Analysts whowere working on the same data have confirmed the system’s effec-tiveness in extracting user behavior patterns from complex datasetsand enhancing their ability to reason.

1 INTRODUCTION

Clickstreams record user clicking actions on the web. Analyz-ing clickstream data provides insights into user behavior patterns,which are extremely valuable for e-commerce businesses like eBay.For example, knowing different user behavior patterns helps con-duct market segmentations in order to develop marketing strategies,or enhance personalized shopping experience. In practice, how-ever, learning the various user behavior patterns is nontrivial. An-alysts often have little knowledge but many questions about whatuser behaviors are hidden in a clickstream dataset. By interviewinganalysts at eBay, we found the following questions are frequentlyasked.

- What are the most frequent user behavior patterns?

- What are the demographics of the users who follow a specificbehavior pattern?

- How do the behavior patterns correlate with the performanceof the online service?

An interactive data exploration environment provides a fairly ef-fective way of finding answers to these questions. In our work, wedesign a visual analytics system to support such an answer-seekingprocess. This visual analytics system enables analysts to inspectdifferent user behavior patterns and examine the associated patternprofiles.

Nevertheless, clickstream data have a number of characteris-tics [5] that make the visual analysis task challenging. For ex-ample, clickstreams are inherently heterogeneous, and uncertaintynaturally arises during automatic clustering which makes it inde-terminate to partition the data. In addition, it is uneasy to visually

∗e-mail: [email protected]†e-mail: [email protected]‡e-mail:[email protected]§e-mail:[email protected]

summarize a cluster of clickstreams. It might be easy to describethe group behavior verbally, but difficult to present visually.

To handle these challenges, in our work, we utilize a derivedSelf-Organizing Map(SOM) to map and cluster the clickstreams;design an enlightening visualization from which analysts can seethe clear cluster structure; and provide intuitive interaction tools toenable detailed data examination and cluster structure refinement.The major contributions of our work include:

- An SOM with Markov chain models is derived to map click-streams to a 2D space.

- A 2D layout algorithm is introduced to reduce visual clutter.

- An interactive cluster exploration process is designed to en-able user-guided clustering.

We have used real clickstream data to test and evaluate ourmethod with assistance from analysts and product managers ateBay. Our method proves to be helpful in discovering user behaviorpatterns and the corresponding demographics.

2 RELATED WORK

The visual cluster exploration of clickstreams is related to data clus-tering, visualization and interactive exploration.

Clustering web clickstream data to discover user behavior pat-terns has drawn great attention in web usage mining [22, 24]. Thealgorithms for clickstream data clustering mainly fall into two cat-egories: similarity-based and model-based methods. Similarity-based methods define and utilize similarity metrics to group sim-ilar clickstreams. An appropriate similarity metric is the core ofthis type of methods. The edit distance, related longest commonsubsequence and their variations are the most frequently used met-rics [2,7,9,13]. Model-based clustering employs probabilistic mod-els to cluster and describe clickstream data [5, 17]. This methodassumes that user behaviors are generated by a set of probabilisticmodels and each model corresponds to a cluster. The clusteringprocess is to recover the model parameters and assign clickstreamsto the cluster where the model best describes them. Choosing an ap-plication suitable model is critical. There have been many modelsproposed to describe user behaviors [5, 17, 19, 21, 28], and Markovmodels (e.g. first order Markov models, or Hidden Markov mod-els) are the most commonly utilized ones. Compared to similarity-based methods, model-based methods offer better interpretabilitysince the resulting model of each cluster directly characterizes thatcluster. Moreover, model-based clustering algorithms often havea computational complexity that is “linear” to the number of dataobjects under certain practical assumptions.

A proper visualization of clickstream clusters reveals data pat-terns. One direct strategy is to use separate windows with eachshowing one cluster. For example, Cadez et al [5] and Manavogluet al. [17] visualized the clusters of clickstreams with multiple win-dows. Each window shows the clickstream traces of one kind. Thismethod emphasizes individual clickstream patterns, but cuts theconnections among clusters. Space-filling approaches are a widelyused type of visualization technique, which arrange and display thedata in a constrained space. The treemap one popular exampleamong them. For instance, Xu et al. [27] applied treemap to dis-play large digital collections. The treemap was improved through

3

IEEE Conference on Visual Analytics Science and Technology 2012October 14 - 19, Seattle, WA, USA 978-1-4673-4753-2/12/$31.00 ©2012 IEEE

pixel-based rendering and glyph-based techniques to show addi-tional information dimensions. Wattenberg presented four gener-ally desirable properties of a “perfect” layout for the space-fillingvisualization [26]. He then constructed such a layout, jigsaw map,and visualized real-world data with this map.

Interactive, visually-aided cluster analysis has been studied tohandle clickstream data. Makanju et al. [16] introduced a visualanalysis tool, LogView, to cluster server log event sequences andvisualize the clustering result by treemaps [10]. LogView providesinteraction tools to enable searching, filtering and selection of databased on the treemaps. Lee and Podlaseck [15] described an inter-active visualization system that supports interpreting and exploringclickstream data of online stores. Their system includes interactiontools such as zooming, filtering, color coding, dynamic querying,and data sampling, in addition to visualization options such as par-allel coordinates and scatter plot. Takada and Koike [23] developeda highly interactive visual log browser named MieLog. MieLoguses interactive visualization and statistical analysis for manual loginspection tasks. Lam et al. [14] built Session Viewer, a visualiza-tion tool to facilitate statistical and detailed analysis of web sessiondata and bridge gaps between them. Shen and Sundaresan [20] in-troduced TrailExplorer to analyze temporal user behavior patternsin webpage flows. TrailExplorer is particularly tailored to assistvisual analysis of preprocessed web session data.

3 DATA AND APPROACH OVERVIEW

A clickstream is an ordered sequence of predefined actions. Forexample, consider a seller listing items for sale on eBay, they needto complete the Sell-Your-Item page. The page requests extensiveselling related information, which are categorized into eight sec-tions: Category: select a category where the item will appear; Title:write a title for the item; Picture: select or upload pictures of theitem; Description: describe the item; Pricing: set a price for theitem; Payment: set the payment options; Shipping: choose shippingmethods and set the shipping cost; OtherOptions: set other informa-tion about the listing, such as tax and return policy. The clickstreamdata from the Sell-Your-Item page captures user actions in terms ofthe sections they edit. A sample of the data is shown in Figure 1,where each row represents one clickstream. We can see that usersedited the sections in different ordering, skipped some sections andrevisit some in the same clickstream. User demographics and otherselling related information are collected as well for understandingtheir correlations with user behaviors.

Figure 2 shows the general visual cluster exploration process us-ing our system. First, clickstreams are mapped to a 2D plane, whilethe topological relations among data are preserved. Second, theclickstreams are visually encoded, and their placement are adjustedto reduce visual clutter. This 2D visualization makes it possible toboth observe an overview of the cluster structure and perceive in-dividual clickstream patterns. Third, analysts interpret the visualrepresentation, interact with data to obtain detailed information andselect representative clickstreams as cluster prototypes. Fourth, theselected groups can be either shown with statistical summarizationor used to refine the clustering result. Users can iterate betweenthe third and fourth steps until a satisfactory result is achieved. Therest of the paper is organized as follows. Our methods for datamapping and visualization are introduced in Section 4 and Section5, respectively. Section 6 addresses how our system supports inter-active cluster exploration. Finally, two case studies are presented todemonstrate how the system is used in practice.

4 DATA MAPPING AND CLUSTERING

Practically speaking, it is intuitive for people to perceive and in-teract with data in a lower dimensional space. In our work, wechoose to map the clickstreams on a 2D plane to facilitate dataperception. Meanwhile, we hope to partition data into clusters

Figure 1: A sample of the clickstream data on the Sell-Your-Itempage of the eBay website. There are eight predefined actions basedon the eight sections on the page, including Category, Title, Picture,Description, Pricing, Payment, Shipping and OtherOptions. Each rowrepresents one clickstream.

so as to reveal different behavior patterns. The Self-OrganizingMaps (SOM) [11, 12] is such a method that can achieve our goal.However, the conventional SOM is designed to handle data of thesame dimensionality, which are quite different from clickstreamsof different lengths. To solve this problem, we utilize probabilis-tic models within the SOM framework to accommodate the click-stream data mapping and clustering. In this section, we will firstgive a brief introduction to SOM and then elaborate on how proba-bilistic models are integrated into the SOM framework.

4.1 Self-Organizing Map

The Self-Organizing Map is a well-known neural network modelfor high-dimensional data mapping and clustering. It consists ofcomponents called nodes or neurons which are usually arranged ona regular 2D grid. One node is associated with a vector prototyperepresenting a cluster of input data, and nearby nodes contain sim-ilar vector prototypes. A trained SOM can map high-dimensionaldata into a 2D space while maintaining topological relations amongdata objects.

The SOM is usually trained by the competitive learning method,which iteratively updates the vector prototypes. For instance, abatch algorithm trains SOM iteratively by applying the followingtwo steps,

• Matching: for each prototype m, collect a list of clickstreams,whose best matching (most similar) prototype is within theneighborhood of m;

• Adjusting: adjust the prototypes using their respective collec-tions of matching clickstreams.

During the training process, the neighborhood size decreases ateach iteration. The neighborhood relation h(i, j) between two pro-totypes i and j are determined by the geometric relations of theircorresponding grid points oi (xi, yi) and o j (x j, y j). The commonlyused neighborhood relation is the Gaussian function, i.e.,

h(i, j) =1√2πδ

exp(−||oi −o j||22×δ 2

) (1)

After the SOM training, a set of vector prototypes are obtainedrepresenting the input data, with similar prototypes staying closer.The input data are also projected into a low-dimensional space, i.e.,a 2D regular grid.

4.2 SOM with Markov Chain Models

Regarding the conventional SOM, the vector prototypes have thesame dimensionality as input data. At the matching step, the simi-larity between an input data item and a prototype is measured by a

4

Figure 2: The major steps of our system for visual cluster exploration of clickstream data include data mapping, data visualization, interactivecluster exploration, and clusters verification and validation.

pre-defined similarity metric, such as Euclidean distance. At the ad-justing step, the prototypes are simply updated by taking the meanvalue over their respective lists of matching data items. However,because the clickstream data are heterogeneous and of differentlengths, it is not trivial to design a feasible similarity metric andto use specific vectors to represent clickstreams. To deal with thisissue, we take advantage of the probability models as prototypes todescribe the clickstreams. The “similarity” between a clickstreamand a probability model is measured by the probability at whichthe clickstream fits in the model. As such, SOM with probabilitymodels can be applied to map the clickstream data onto a 2D plane.

In the literature, there have been many models proposed to de-scribe clickstreams [5, 17, 19, 21, 28]. We use a first-order Markovchain as in [5] since the Markov chain is a simple but efficientmodel to describe user behaviors. Other models can also be adaptedin the SOM framework provided that the model can capture the datacharacteristics well. Regarding clickstreams, the set of M prede-fined actions, A = {a1, . . . , am, . . . , aM}, corresponds to the setof states of the Markov chain model. We define a dataset of N click-streams as U, where U = {u1, . . . , un, . . . , uN}. A clickstream un

is an ordered sequence with ln actions: (un,1, . . . , un,i, . . . , un,ln),where un,i ∈ A. The set of K Markov chain models across theSOM grid points is represented as Θ = {θ1, · · · , θk, · · · , θK}.A Markov chain model is determined by its parameters, θk ={θ I

k (un,1 = am), θ Tk (un,i = as|un,i−1 = am)}, where θ I

k (un,1 = am)denotes the probability of a clickstream starting with an action am

and θ Tk (un,i = as|un,i−1 = am) denotes the transition probability of

two consecutive actions am and as.

The training method for SOM with Markov chain models is sim-ilar to that of training a conventional SOM with vector prototypes.Nevertheless, at each SOM training iteration, rather than simplyfinding the best matching data points and calculating a mean vec-tor, the Expectation-Maximization (EM) algorithm [18] is appliedto recover the optimal parameters of SOM with Markov chain mod-els. Algorithm 1 gives an overview of the competitive learning al-gorithm for training SOM with Markov chain models.

Specifically speaking, at each iteration of the SOM training, theEM algorithm searches the domain of model parameters and up-dates Θ in order to maximize the coupling likelihood Lc(Θ|U) [6],which measures how well the models fit the dataset.

Lc(Θ|U) =N

∑n=1

logK

∑k=1

1

Kpc(un|θk) (2)

pc(un|θk) represents the coupling likelihood between a click-stream un and a model θk. It is defined as the joint probabilityof un fitting in θk and models in its neighborhood, which is definedby h(k,r).

pc(un|θk) = p(un|θk) ∏r 6=k

p(un|θr)h(k,r) (3)

Algorithm 1 Competitive Learning Algorithm for Training SOMwith Markov Chains

Input: a clickstream dataset U; the number of SOM grid points K;a minimum neighborhood size δmin and a decreasing factor δt .

1: while the neighborhood size is bigger than δmin do2: Decrease the neighborhood size by δt ;3: // EM Algorithm4: while the likelihood calculated by function 2 increases do5: E-Step: calculate the expectation value using function

5, update the probabilities according to functions 4 and3;

6: M-Step: update Markov chain model parameters usingfunctions 6 and 7;

7: end while8: end while

where p(un|θk) represents the probability of one clickstream un

fitting in a Markov model θk, i.e.,

p(un|θk) = θ Ik (un,1)

ln

∏i=2

θ Tk (un,i|un,i−1) (4)

Once the objective likelihood function is defined, the EM algo-rithm iterates between two key steps, E-step and M-step, to find theoptimal Markov chain parameters.

• E-step: Calculate the posterior probability pc(un|θ oldk ) that

gives the probability of the n-th data object fitting in the model

k with parameters θ oldk calculated in the last iteration. The

posterior expectation of Lc, the so-called Q-function, is cal-culated as follows,

Q(Θ;Θold) =K

∑l=1

N

∑n=1

K

∑k=1

pc(un|θ oldk )h(k, l)logp(un|θl) (5)

• M-step: Maximize the Q-function with respect to each subsetof parameters Θ. The update rules for each set of parametersare shown below, and it guarantees to increase the coupling-likelihood Lc:

Initial state probability,

θ Ik (un,1 = am)

=∑N

n=1 ∑Kl=1 pc(un|θ old

k )h(k, l)δ (un,1,am)

∑Mr=1[∑

Nn=1 ∑K

l=1 pc(un|θ oldk

)h(k, l)δ (un,1,ar)]

(6)

where δ (un,1,am) is an indicator function that is equal to 1 ifun,1 = am and 0 otherwise.

Transition probability,

5

Title

OtherOptions

Shipping

Pricing

Description

Picture

Figure 3: Visualization of an example clickstream from the Sell-Your-Item page on eBay website. Different colors denote different actions.In this example, a user first edits the title for the item to sell, uploadspictures, writes descriptions, and then sets the prices, shipping meth-ods and other options.

θ Tk (un,i = as|un,i−1 = am)

=∑N

n=1 ∑Kl=1 pc(un|θ old

k )h(k, l)β (un,i = as|un,i−1 = am)

∑Mr=1[∑

Nn=1 ∑K

l=1 pc(un|θ oldk

)h(k, l)β (un,i = ar|un,i−1 = am)](7)

where β (un,i = ar|un,i−1 = am) is an indicator function thatequals to 1 if action ar follows right after action am in theclickstream un and 0 otherwise.

After the SOM with a set of Markov chain models is trained,each clickstream is then mapped to a 2D position pn determinedby the coordinates of a model, ok(xk, yk) and the probabilitiesp(un|θk) of that clickstream fitting into the respective models,

pn =K

∑k=1

ok p(un|θk) (8)

In other words, a clickstream is placed close to the models that itfits better.

5 VISUALIZATION

Although the clickstreams are successfully projected onto a 2Dspace after the data mapping step, creating a visualization that canclearly present the clickstream clusters is yet to be resolved. In thissection, we address this problem by introducing a self-illustrativevisual representation of clickstreams and an effective layout algo-rithm.

5.1 Visual Representation of Clickstreams

Clickstreams are sequences of user actions, which are of variouslengths. We encode each click action as a rectangle, and color eachaction differently. Thus, one clickstream is represented by a se-quence of colored rectangles. Take Sell-Your-Item for example: aseller lists an item for sale on eBay and carries out a series of ac-tions: “edit title, upload pictures, write description, set prices, selectshipping methods, set other options”. The corresponding visualiza-tion is shown in Figure 3. In order to help users identify the mostfrequent behavior patterns, the size of a rectangle is proportional tothe frequency of the clickstream pattern’s existence in the data set.

5.2 Clickstreams Placement

Considering that visual metaphors take up space unlike mappeddata points, it would cause a serious overlapping problem if wenaively place visual rectangles of all clickstreams where they are

Figure 4: The spiral path along which each clickstream is looking fora position to stay without overlap. The starting point is in the centerand the clickstream rectangle marches to the outer end.

Algorithm 2 Layout Generation by Randomized Greedy Algorithm

Input: clickstream rectangles V= {v1, . . . ,vN}, the correspondingmapped positions P= {p1, . . . ,pN}, the associated significancemeasure S = {s1, . . . ,sN}, and the flag signs F = { f1, . . . , fN}to indicate whether a clickstream is representative.

1: Sort rectangles according to S, and move the ordered represen-tative clickstreams to the beginning of the list;

2: for each rectangle vn in the sorted list do3: while vn doesn’t reach the outer end of the spiral do4: Move vn a bit along the spiral path and try to place it;5: if vn doesn’t intersect with other placed rectangles then6: Place vn;7: break;8: end if9: end while

10: if vn is representative, but did’t find a placement then11: Place vn at pn;12: end if13: end for

mapped (See Figure 5(a)). Although we can see the cluster struc-ture in the visualization, it is impossible to tell the representativebehavior patterns of each cluster because of the overlapping. Thus,the layout has to be adjusted to reduce visual clutter. In addition,it is unscalable and unnecessary to display all clickstreams on alimited screen, especially when the data size is large. A properplacement strategy is expected to satisfy the following principles:

- Uncluttered, the clutter ought to be at a low level whichdoesn’t affect visual pattern perception;

- Consistent, the topological relations among the mapped click-streams should be preserved;

- Representative, important clickstream patterns should beguaranteed to be presented.

We fully considered the above three principles during the place-ment strategy design. First, we move clickstreams to avoid over-lapping as much as possible. Second, regarding the consistencyprinciple, when moving a clickstream, we constrain the placementin its surrounding area. Lastly, we employ a significance factor tomeasure the representativeness of each clickstream. For importantclickstreams, we guarantee that they have higher priorities so thatthey are placed first. The significance factor of each clickstream un

contributing to a model k is defined as

sn,k = fn p(un|θk) (9)

where fn represents the frequency of a clickstream pattern un’s ex-istence, and p(un|θk) is the probability of un fitting in model k.

We define the associated significance of one clickstream as,

sn = maxk

sn,k (10)

6

(a) (b)

Figure 5: Comparison of the visualization before and after the clickstream layout is adjusted. (a) presents the visualization of clickstreams byplacing them where they are mapped on the 2D plane; (b) presents the visualization of clickstreams after the data has been sorted according tosignificance measure and then placed by searching along the spiral path.

The layout algorithm is illustrated in Algorithm 2. All click-streams are first sorted in a descending order according to theirmaximum significance values across all models. Then, a random-ized greedy algorithm is applied to place the clickstream rectan-gles. Every clickstream rectangle is trying to be placed along aspiral path starting at the clickstream’s mapped position, as shownin Figure 4. After a limited number of trials, the rectangle is eitherplaced or discarded finally. However, a set of clickstreams that bestfit each model based on the significance values are selected as rep-resentative clickstreams. They are guaranteed to be placed even ifoverlapping cannot be avoided. We use a Boolean sign b to indicatewhether one clickstream is representative, B = {b1, . . . ,bN}, for allclickstreams. A little clutter would not prohibit the patterns percep-tion because human eyes are “low-pass” filters [4]. When similardata objects are grouped and form a significant pattern to attractusers’ attention, they can discover the clustered patterns and ignorehigh-frequency “noise”.

This layout generation approach is straightforward, and the finalvisualization shows distinct patterns with little clutter. Addition-ally, we evaluate the completeness of placement of significant click-

streams by the factor CoS = ∑significance of the placed clickstreams

∑significance of all clickstreams . Fig-

ure 5 shows the comparison between the visualizations before andafter our placement method is applied to the Sell-Your-Item dataset.In Figure 5(b), the CoS value is 93.2%, which means the majorityof the clickstream patterns are displayed.

6 INTERACTIVE EXPLORATION

Interaction is the key to exploratory data analysis. It provides theopportunity for people to communicate with data. Our system en-ables examining details about one clickstream or any chosen groupof clickstreams. Based on the visualization, analysts may want todivide the whole clickstream dataset into a number of clusters basedon their perception. Our system supports interactive cluster analysisunder the analysts’ supervision.

6.1 Data Profile Exploration

Visualization is a way of communicating messages from data. An-alysts observe, interpret and make sense of what they see to un-derstand the clickstreams. Figure 6 shows the system interface fordata visualization and exploration. The clickstreams are visualizedon the left side (Figure 6(a)) with the legends of actions displayed

at the upper-right corner (Figure 6(b)). The analyst can move themouse over one interesting clickstream pattern to check the dataprofile. For example, the clickstream “I” highlighted in Figure 6(a)is an interesting clickstream. Figure 6(c) shows the visual pattern ofclickstream “I” and the existence frequency of this clickstream pat-tern in the dataset. Figure 6(d) presents the demographic informa-tion and statistical summary about the corresponding clickstreampattern by using histograms. The blue histograms indicate the sta-tistical distribution of each data attribute for the specific clickstreampattern, while the background grey histograms show the overall dis-tribution of the entire dataset. The analyst can also freely select agroup of clickstreams (the highlighted group “G” in Figure 6(a)) tocheck the group statistical information. This intuitive explorationapproach assists analysts to learn about data details from multipleaspects and at different scales.

6.2 Interactive Cluster Analysis

Although SOM projects clickstreams onto the 2D plane with sim-ilar data staying together, deciding which clickstreams belong tothe same group still depends on domain knowledge. Since the dis-played clickstreams are only representative samples, it is also nec-essary to support cluster analysis of the whole dataset in order toobtain a thorough statistical summary. We introduce an interactivecluster analysis approach, the semi-supervised K-means, to meetthis need. The analyst can specify distinct groups of clickstreams onthe visualization and then cluster the whole dataset using the speci-fied groups. During the process of cluster analysis, the clickstreamsare represented by their 2D mapped coordinates. Because the orig-inal inter-clickstream topological relations are preserved while datamapping, it is reasonable to use the 2D coordinates to cluster click-streams.

People can easily perceive and verify clickstream patterns by us-ing the provided visualization and interaction tools. We introducean interactive cluster analysis method by combining the automaticK-means algorithm [8] and experts’ domain knowledge throughinteractive visualization. We utilize the analyst’s input as initial-ization and constraints in the K-means algorithm. Considering K-means has one drawback that it is only feasible for searching hyper-spherical-shaped clusters [1, 3, 25], we adopt a centroid chain tech-nique to deal with this problem. In our method, each cluster is rep-resented by a centroid chain instead of a centroid as in the standard

7

Figure 6: The interface for clickstream pattern exploration. (a) Clickstreams visualization; (b) action legends; (c) the selected clickstream; (d)data profile of one single clickstream or a data group. Whenever the analyst selects a single clickstream(indicated as “I”) or a group(indicated as“G”) in area (a), the corresponding information summary is shown in area (b).

Algorithm 3 Semi-supervised K-means Using Centroid Chains

Input: A set of labeled clickstreams by the analyst and unlabeledones. K groups of centroid chains C = ∪K

k=1Ck.1: Initialize the cluster number as K, and Ck as the centroid chain

of the cluster k;2: while WCSS is reducing do3: // Clickstreams assignment4: for each clickstream pn do5: if pn was labeled then6: Assign pn to the user specified cluster;7: else8: Assign pn to the cluster that has pn’s closest cen-

troid chain using the distance measure as Equation12, record the closest clickstreams to each centroidchain nodes;

9: end if10: end for11: // Centroid chains updating12: Update centroid chains by taking the mean over each cen-

troid chain node and its corresponding closest clickstreams;13: Calculate within-cluster sum of squares (WCSS);14: end while

K-means algorithm.When the analyst starts the interactive cluster analysis, she

sketches loops to include clickstreams. All clickstreams in one loopbelong to a group. The clickstreams lying on the loop are connectedas a centroid chain. After the interaction, the semi-supervised K-means algorithm initializes the number of clusters as K, the num-ber of clickstream groups, and extracts the centroid chain from eachgroup to represent the cluster. Then the algorithm proceeds by al-ternating between two steps, the assignment step and the updatingstep, until the within-cluster sum of squares (WCSS) is minimized.The WCSS is defined as,

WCSS =K

∑k=1

∑pn∈Sk

Dist(pn,Ck) (11)

where Sk corresponds to the cluster set K and pn is a data sam-ple within Sk, Ck is the centroid chain of cluster K with Ck ={p1, . . . ,pm, . . . ,pM}. The distance Dist(pn,Ck) between a datapoint and a centroid chain is defined as,

Dist(pn,Ck) = Minpm∈Ck

||pn −pm|| (12)

In each iteration of the K-means algorithm, at the assignmentstep, the selected clickstreams by the analyst are assigned to thespecified group, and the unselected data are assigned to its clos-est cluster. Meanwhile, each centroid pm along the centroid chainrecords a collection of data points that are most close to it. At theupdating step, the centroid chain of each cluster is updated by tak-ing the mean over pm and its associated collection of close pointsrecorded at the assignment step. Details of our semi-supervised K-means algorithm are shown in Algorithm 3. The final clusteringresults not only are presented in the visualization immediately butalso can be exported to files for further analysis.

7 CASE STUDIES

Understanding how people use the website is critical to the successof an e-commerce business like eBay. As mentioned earlier, ourwork was motivated by the fact that analysts and product managersat eBay have difficulties in obtaining insights into user behaviorpatterns from the clickstream data. We invited some of them to useour system, explore data in their domains, and give feedback. Inthis section, we present our observations of how the system wasused in two case studies.

7.1 Understanding Behavior Patterns on Sell-Your-Item

Page

For the eBay marketplace, listing items for sale is the beginning ofall seller activities. Thus, making the listing process intuitive andefficient is important. As described in Section 3, sellers are requiredto fill eight sections on the Sell-Your-Item page. From our collabo-ration with the product team we learned that the layout and orderingof these sections are critical to the website usability. Currently, thesections are laid out in a sequential order from top down, which wasdesigned and evaluated by user experience designers based on userstudies conducted in usability labs. They would like to understandhow users interact with the page in real-world scenarios. The twoanalysts who were invited to this study would like to seek answersparticularly to the following questions:

Do users fill all the information requested on the page?

Do they follow the pre-defined order to fill in information?

8

Figure 7: Clickstream patterns with pictures uploaded first. Com-pared to the average category distribution in the data, these activi-ties are more likely to happen in the Fashion category. One possibleexplanation is that pictures are very important for selling clothes orshoes.

Figure 8: Clickstream patterns that do not include uploading a pic-ture. Most likely these activities happen in the Media category, whichhas a complete catalog, and default pictures of products are oftenprovided to users during listing.

What are the scenarios when the answers to the questions aboveare ”No”?

In preparing the data, we sampled from one day clickstream dataon the Sell-Your-Item page of eBay US site. Each visit is a sequenceof actions in terms of the sections they edit as described in Section3. In order to answer the last question, the following selling relatedinformation is collected based on the analysts’ recommendations.

• Seller segment

• User gender

• Years being an eBay user

• Selling category

Figure 5 (b) shows the generated visualization. As the partici-pants expected, most users follow the default ordering and fill mostof the sections on the Sell-Your-Item page. Although no cluster withvery distinct patterns stands out, the visualization effectively showsthe variation in the data. The participants pointed out interestingbehavior patterns as outlined in the figure. For example, the pat-terns included in the red box show that users start filling the pageby uploading pictures rather than editing the title. There are also

Figure 9: Clickstream patterns without uploading a picture or writinga description. The corresponding users are less experienced, andmight not understand the importance of the item descriptions for sell-ing on eBay.

patterns in which certain actions are not performed. For example,in the green box users do not upload pictures, and those in the bluebox do not write a description. The participants then would liketo investigate in what scenarios such behaviors happen in order toinfer potential causes. They conducted such analysis by selectingthe patterns of interests one by one and investigating correspondingsummary statistics.

First, the participants noticed in the visualization that a signifi-cant number of clickstreams start by uploading pictures rather thanediting the title. They considered this interesting, because the ti-tle is one of the most critical parts of a listing on eBay, and thetitle input box is placed at the beginning of the page. By selectingthese clickstreams, they can inspect related demographics and sell-ing information on the side (See Figure 7). Please note that due tothe paper space limitation, in Figure 7, Figure 8 and Figure 9, wechoose to show only part of the visualization that contains the click-streams of interest and the statistics summary panel for illustrationpurposes. The statistical information on the side reveals that suchactivities are more likely to happen in the Fashion category and lesslikely in the Tech category compared to the selling category distri-bution of the entire dataset. They considered this to be reasonablebecause pictures are generally more effective than text in describingclothes and shoes.

Next, the analysts investigated the scenario where users do notupload a picture (See Figure 8). It turned out that the majorityof this behavior happens in the Media category, which includesbooks, CDs, DVDs, and etc. For such products, a pre-filled Sell-Your-Item page with standard product pictures are often providedby eBay, while others are not, e.g., clothes in the Fashion categoryand antiques in the Collectibles category. Therefore, most users usethe provided product pictures instead of uploading their own. Be-fore the study, the analysts thought that user behavior during listingcould correlate with various characteristics of the users and listings.Therefore, they recommended that we include seller segments, gen-der, years being a eBay user, and selling category for study. Thesefindings suggest that among these factors, the selling category ismost correlated to user behavior.

The participants continue to examine other clickstream patternsthat do not contain picture uploads. They noticed that a large num-ber of users not only do not upload a picture, but also skip the de-scription section (See Figure 9). For a listing on eBay, users areencouraged to write a detailed description of the items they areselling. The participants mentioned that previous studies showedthat the quality of the description plays a critical role in the buyer’spurchase decision. These listings are also more likely to belong

9

Figure 10: Interactive clustering of clickstream patterns in eBay shopping cart. The analyst groups similar patterns using the interactive lassotool. The result shows strong correlation between user behavior and cart drop-offs. The histograms show corresponding statistical informationof the clusters. Red bars denote the drop-offs, and blue ones denote the successful visits.

to the Media category. However, the analysts observed that thedistribution of years being an eBay user is different from that ofthe group selected in Figure 8. These users are less experienced.The analysts’ intuition was that these inexperienced users have notlearned the importance of descriptions in selling. Based on thisobservation, they considered providing more explicit messaging onthe Sell-Your-Item page to encourage especially the inexperiencedusers to write a description.

Overall, the analysts from the selling team liked the system,which enabled them to explore the data interactively. They ex-pected more diverse behavior patterns on the Sell-Your-Item page.After the study, they realized that most users followed the designedprocess, except several unique categories, e.g., Media and Fashion.This was a significant finding for them. They considered investi-gating the feasibility of providing customized listing processes forthese categories.

7.2 Finding Factors That Correlate to Drop-offs in Shop-ping Cart

When users make purchases at eBay, items are first added into theshopping cart. Users can view their items in the cart, add moreitems, remove items, update item quantities and save items for fu-ture purchase. When users finally decide to buy the items in thecart, they proceed to checkout, which is considered a successfulcart visit. When users end their visit without proceeding to check-out, we call it a drop-off. The performance of the cart is measuredby the drop-off rate, which is defined as the percentage of drop-offsin all cart visits. The lower the drop-off rate, the better the shoppingcart performs. As the shopping cart is considered the most criticalstep in the buying process, optimizing user experience in the cartand reducing drop-off rate will greatly benefit eBay. Furthermore,predicting cart drop-offs provides opportunities for eBay to preventdrop-offs and boost sales. For example, alternative merchandisethat might suit users’ demands and incentives can be offered to en-courage users to checkout when users’ intentions to drop-off aredetected by eBay. Understanding factors that correlate with cartdrop-offs is the key to all of the above. Knowing that we had de-veloped the visual analytics system, an analyst and a productionmanager from the cart team reached out for help. Based on theirdefinitions, a user behavior in the cart is composed of any of the 5possible actions mentioned above. In addition, other factors, suchas users’ buyer segment, gender, years being an eBay user, time ofthe day when the visit happens, and the number of items in the cart,

are also captured based on their recommendations.

After a simple walkthrough demo of the system, we let themexplore by themselves. Their goal was to investigate the correlationbetween cart drop-offs and various factors, including user behaviorpatterns in the clickstream data. Since the visualization providesa clear overview of the clickstream data, they immediately spottedgroups of different patterns. They then started to circle these groupsand investigate summary statistics of them. We noticed that theparticipants quickly went through many iterations of merging andsplitting the groups. Finally, six groups were created, as illustratedin Figure 10. We can see that clickstreams in the same group havesimilar patterns. The overall drop-off rate varies among groups,which suggests a strong correlation between user behavior and cartdrop-offs. The number of items in the cart also shows correlationwith the behavior pattern and drop-off rate. Group F has the lowestdrop-off rate. Most clickstreams in this group have only one itemin the cart. As the overall drop-off rate increases, the distribution ofthe cart size shows a larger variance.

In Figure 10, irregular circles were drawn to group clickstreamsthat were not spatially close to each other, e.g., group B. In otherwords, our automatic data mapping algorithm did not consider themto be similar patterns. We asked the participants their reasons forsuch groupings and which criteria they used to evaluate similarity.They demonstrated their reasoning as illustrated in Figure 11. Ourmodel-based SOM did a good job in mapping clickstreams in groupB and those in group A together, because they both have patterns ofalternating between View and Remove. The difference is that groupB starts with an Add while group A does not. Clickstreams in groupC also start with Add, but do not have the alternating pattern. How-ever, when the analysts investigated the statistics of these groups,they found that the drop-off rate of group B was almost twice thatof group A, but was close to group C. Based on their knowledgeof the shopping cart, the participants knew that when users enterthe shopping cart, their intentions are very different depending onwhether they added an item before. Therefore, they chose to groupB and C together. Such domain knowledge is hard to summarizeand implement directly in the clustering algorithm. It is most effec-tive to capture it through interacting with a visual analytics system.

Finally, our system performed a semi-supervised clusteringbased on the grouping in Figure 10, and the results are illustratedin Figure 12. Six clusters with different patterns were created. Theresults were exported and used by the analysts in further statisticalanalysis. Both participants appreciated the insights they obtained

10

Figure 11: Interactive exploration of clickstream patterns in eBayshopping carts. The analyst groups similar patterns using the inter-active lasso tool. The result shows strong correlations between userbehavior and cart drop-offs. The histograms show correspondingstatistical information about the clusters. Red bars denote drop-offs,and blue ones denote successful visits.

through the study. The analyst learned that user behavior and thenumber of items in the cart can be good features for predicting cartdrop-offs. She said that this definitely will help her develop betterprediction models. The production manager considered developinga user segmentation based on the clustering results.

7.3 Feedback

All the participants, including analysts and product managers, hadlittle experience with information visualization besides basic charts.Therefore, we conducted short training sessions before they startedto use the system in both studies. In the training, we explained thevisual encoding of patterns and the layout. They were able to un-derstand the visualization and identify clusters of similar patternsafter the training. However, one product manager did mention thathe would have difficulties interpreting the visualization by himself.One participant who considered the visualization intuitive said thatthe analogy between this and tag clouds helped. After the initiallearning period, the participants all found the visualization veryuseful. One analyst said, “Being able to see all the user actions isso powerful. I now immediately know not only the most commonbehavior patterns but also the outliers.”

Figure 12: Final clustering results obtained by semi-supervised clus-tering based on input from the analysts. Different colors indicate dif-ferent clusters. Six clusters were created, and each cluster of click-streams showed its own distinct behavior patterns. The results werealso exported and used by the analysts in further statistical analysis.

Most participants found the interactions of the system intuitive.Some of them even started to select groups of patterns by circlingwithout being told. They really liked that they could investigate thesummary statistics of the selected groups. One analyst said, “Thistotally enables us to correlate users’ behavior with their demograph-ics. More important, I can do it interactively.” The interactivity ofour system enables very effective data exploration and reasoning.

8 CONCLUSIONS AND FUTURE WORK

We present a visual cluster exploration approach to analyze valu-able web clickstream data. Our approach maps the heterogeneousclickstreams in a 2D space, visualizes representative data samples,and enables user-guided data exploration and clustering. The visualexploration approach helps analysts uncover user behavior patterns,learn the pattern demographics and make sense of the interrelation-ships between the patterns and their demographics. This knowledgewill help managers make better business strategies, leading to bet-ter services. More importantly, our problem solving framework isnot constrained to analyzing the web clickstream data. It can beextended to deal with a broader class of categorical sequences datain many other fields.

While our approach is effective, there are a few aspects to im-prove. For instance, the current visual encoding is suitable for click-streams with a small number of predefined action options. Whenthe number of action options increases to tens or hundreds, the vi-sual encoding method should be reconsidered. Because there isalways a mismatch between the ever increasing data size and thelimited display screen space, we need to investigate how to takeadvantage of level-of-detail layout techniques to handle very largedatasets.

ACKNOWLEDGEMENTS

The first author was a summer intern at eBay Research Labs whilethis work was done. This work has also been supported in partby the U.S. National Science Foundation through grants CCF-0811422, IIS-1147363, CCF-0808896, and CCF-1025269.

11

REFERENCES

[1] D. Arthur and S. Vassilvitskii. k-means++: the advantages of careful

seeding. In Proceedings of the eighteenth annual ACM-SIAM sympo-

sium on Discrete algorithms, SODA ’07, pages 1027–1035, Philadel-

phia, PA, USA, 2007. Society for Industrial and Applied Mathematics.

[2] A. Banerjee and J. Ghosh. Clickstream clustering using weighted

longest common subsequences. In In Proceedings of the Web Mining

Workshop at the 1st SIAM Conference on Data Mining, pages 33–40,

2001.

[3] S. Basu, A. Banerjee, and R. J. Mooney. Semi-supervised clustering

by seeding. In Proceedings of the Nineteenth International Confer-

ence on Machine Learning, ICML ’02, pages 27–34, San Francisco,

CA, USA, 2002. Morgan Kaufmann Publishers Inc.

[4] W. L. Braje, B. S. Tjan, and G. E. Legge. Human-efficiency for rec-

ognizing and detecting low-pass filtered objects. Vision Research,

35(21):2955–2966, 1995.

[5] I. Cadez, D. Heckerman, C. Meek, P. Smyth, and S. White. Visualiza-

tion of navigation patterns on a web site using model-based clustering.

In Proceedings of the sixth ACM SIGKDD international conference

on Knowledge discovery and data mining, KDD ’00, pages 280–284,

New York, NY, USA, 2000. ACM.

[6] S.-S. Cheng, H.-C. Fu, and H.-M. Wang. Model-based clustering by

probabilistic self-organizing maps. IEEE Transactions on Neural Net-

works, 20(5):805–826, 2009.

[7] Y. Fu, K. Sandhu, and M.-Y. Shih. Clustering of web users based on

access patterns. In In Proceedings of the 1999 KDD Workshop on Web

Mining. Springer-Verlag, 1999.

[8] J. A. Hartigan and M. A. Wong. A k-means clustering algorithm.

JSTOR: Applied Statistics, 28(1):100–108, 1979.

[9] B. Hay, G. Wets, and K. Vanhoof. Clustering navigation patterns on

a website using a sequence alignment method. In In Proceedings of

17th International Joint Conference on Artificial Intelligence, pages

1–6, 2001.

[10] B. Johnson and B. Shneiderman. Tree-maps: a space-filling approach

to the visualization of hierarchical information structures. In Proceed-

ings of the 2nd conference on Visualization ’91, VIS ’91, pages 284–

291, Los Alamitos, CA, USA, 1991. IEEE Computer Society Press.

[11] T. Kohonen. The self-organizing map. Neurocomputing, 21(1-3):1–6,

1998.

[12] T. Kohonen, M. R. Schroeder, and T. S. Huang, editors. Self-

Organizing Maps. Springer-Verlag New York, Inc., Secaucus, NJ,

USA, 3rd edition, 2001.

[13] R. Kothari, P. A. Mittal, V. Jain, and M. K. Mohania. On using page

cooccurrences for computing clickstream similarity. In SDM, 2003.

[14] H. Lam, D. Russell, D. Tang, and T. Munzner. Session viewer: Visual

exploratory analysis of web session logs. In Proceedings of the 2007

IEEE Symposium on Visual Analytics Science and Technology, pages

147–154, Washington, DC, USA, 2007. IEEE Computer Society.

[15] J. Lee, M. Podlaseck, E. Schonberg, and R. Hoch. Visualization and

analysis of clickstream data of online stores for understanding web

merchandising. Data Min. Knowl. Discov., 5:59–84, January 2001.

[16] A. Makanju, S. Brooks, A. N. Zincir-Heywood, and E. E. Milios.

Logview: Visualizing event log clusters. In Proceedings of the 2008

Sixth Annual Conference on Privacy, Security and Trust, pages 99–

108, Washington, DC, USA, 2008. IEEE Computer Society.

[17] E. Manavoglu, D. Pavlov, and C. Giles. Probabilistic user behavior

models. In Data Mining, 2003. ICDM 2003. Third IEEE International

Conference on, pages 203 – 210, nov. 2003.

[18] G. J. McLachlan and T. Krishnan. The EM Algorithm and Extensions

(Wiley Series in Probability and Statistics). Wiley-Interscience, 2 edi-

tion, Mar. 2008.

[19] A. L. Montgomery, S. Li, K. Srinivasan, and J. C. Liechty. Modeling

online browsing and path analysis using clickstream data. Marketing

Science, 23:579–595, September 2004.

[20] Z. Shen and N. Sundaresan. Trail explorer: Understanding user expe-

rience in webpage flows. In IEEE VisWeek Discovery Exhibition, Salt

Lake City, Utah, USA, 2010.

[21] P. Smyth. Clustering sequences with hidden markov models. In Ad-

vances in Neural Information Processing Systems, pages 648–654.

MIT Press, 1997.

[22] J. Srivastava, R. Cooley, M. Deshpande, and P.-N. Tan. Web usage

mining: discovery and applications of usage patterns from web data.

SIGKDD Explor. Newsl., 1:12–23, January 2000.

[23] T. Takada and H. Koike. Mielog: A highly interactive visual log

browser using information visualization and statistical analysis. In

Proceedings of the 16th USENIX conference on System administra-

tion, pages 133–144, Berkeley, CA, USA, 2002. USENIX Associa-

tion.

[24] A. Vakali, J. Pokorny, and T. Dalamagas. An overview of web data

clustering practices. In EDBT Workshops, pages 597–606, 2004.

[25] K. Wagstaff, C. Cardie, S. Rogers, and S. Schrodl. Constrained k-

means clustering with background knowledge. In Proceedings of

the Eighteenth International Conference on Machine Learning, ICML

’01, pages 577–584, San Francisco, CA, USA, 2001. Morgan Kauf-

mann Publishers Inc.

[26] M. Wattenberg. A note on space-filling visualizations and space-filling

curves. In Proceedings of the Proceedings of the 2005 IEEE Sympo-

sium on Information Visualization, INFOVIS ’05, pages 24–, Wash-

ington, DC, USA, 2005. IEEE Computer Society.

[27] W. Xu, M. Esteva, S. D. Jain, and V. Jain. Analysis of large digital

collections with interactive visualization. In IEEE VAST, pages 241–

250, 2011.

[28] A. Ypma and T. Heskes. Automatic categorization of web pages and

user clustering with mixtures of hidden markov models. In WEBKDD,

pages 35–49, 2002.

12

Visual Cluster Exploration of Web Clickstream...

Documents

Transcript of Visual Cluster Exploration of Web Clickstream...