Learning User Clicks

download Learning User Clicks

of 6

description

Learning User Clicks in Web SearchDing Zhou, Levent Bolelli, Jia Li, C. Lee Giles, Hongyuan Zha Department of Computer Science and Engineering Department of Statistics College of Information Sciences and Technology The Pennsylvania State University, University Park, PA 16802

Transcript of Learning User Clicks

  • Learning User Clicks in Web Search

    Ding Zhou, Levent Bolelli, Jia Li, C. Lee Giles, Hongyuan Zha

    Department of Computer Science and Engineering

    Department of Statistics

    College of Information Sciences and Technology

    The Pennsylvania State University, University Park, PA 16802

    Abstract

    Machine learning for predicting user clicks in Web-based search offers automated explanation of useractivity. We address click prediction in the Websearch scenario by introducing a method for clickprediction based on observations of past queriesand the clicked documents. Due to the sparsity ofthe problem space, commonly encountered whenlearning for Web search, new approaches to learnthe probabilistic relationship between documentsand queries are proposed. Two probabilistic modelsare developed, which differ in the interpretation ofthe query-document co-occurrences. A novel tech-nique, namely, conditional probability hierarchy,exibly adjusts the level of granularity in parsingqueries, and, as a result, leverages the advantagesof both models.

    1 Introduction

    Predicting the next click of a user has gained increasing im-portance. A successful prediction strategy makes it possibleto perform both prefetching and recommendations. In addi-tion, the measurement of the likelihood of clicks can infer ausers judgement of search results and improve ranking.

    As an important goal of Web usage mining [Srivastava etal., 2000], predicting user clicks in Web site browsing hasbeen extensively studied. Browse click prediction typicallybreaks down the process into three steps: (1) clean and pre-pare the Web server log data; (2) extract usage patterns; and(3) create a predictive model. There have been a variety oftechniques for usage pattern extraction, user session cluster-ing [Banerjee and Ghosh, 2001; Gunduz and Ozsu, 2003],page association rule discovery [Gunduz and Ozsu, 2003],Markov modeling [Halvey et al., 2005], and implicit rele-vance feedback [Joachims, 2002].

    Contrary to the rich research in click prediction in Web sitebrowsing, the prediction of user clicks in Web search has notbeen well addressed. The fundamental difference betweenpredicting click in Web search and Web site browsing is thescale of the problem space. The modeling of site browsingtypically assumes a tractable number of pages with a reason-able number of observations of the associations among thesepages. This assumption, however, no longer holds in the Web

    search scenario. The vast amount of documents quickly resultin very high dimensional space and hence sparsies the prob-lem space (or equivalently leads to a lack of observations ),which effects model training. In order to reduce the problemspace, some previous work rst classies the large documentcollection into a small number of categories so as to predicttopic transition in Web search [Shen et al., 2005]. However,since the prediction of user clicks requires the granularity ofa single document, we do not consider issues in documentclustering but rather work on the full space of the documentcollection.

    Consider the problem of learning fromWeb search logs forclick prediction. We dene the task as learning the statisticalrelationship between queries and documents. The primaryassumption is that the clicks by users indicate their feedbackon the quality of query-documentmatching. Hypothesize thatthe vocabulary of query in terms of words remains stable overa period of time. Denote this by . Ideally, if we were able tocollect sufcient instances for every combination of words (2 ) and the clicked documents with these queries, we wouldbe able to estimate the probability of future clicks based onpast observations. The feasibility of the approach, however,relies on the assumptions that the training data exhausts allthe possible queries.

    However, since the number of different queries in Websearch explodes with emerging concepts in user knowledgeand randomness in the formation of queries, the tracking ofall possible queries becomes infeasible both practically andcomputationally. For example, Fig. 1 illustrates the increasein the number of distinct queries submitted to CiteSeer over aperiod of three months. The linear growth of distinct queriesover time indicates that we can hardly match even a smallfraction of the new queries exactly with the old queries. As aresult, prediction cannot be performed for new queries, yield-ing low predictability. Furthermore, many new documents arebeing clicked even after having accumulated documents thatwere clicked on over considerably long time. Learning of therelationship between the complete queries and the documentsis consequently a highly challenging problem.

    An alternative naive solution to the lack of training in-stances is to break down queries into words. An observationof a query and the corresponding clicked document ( query-document pair) is transformed into several independent ob-servations of word and document, i.e. word-document pair.

    IJCAI-071162

  • 0 2 4 6 8 10 120

    1

    2

    3

    4

    5

    6

    7

    8

    9 x 105

    Observation time (weeks)

    Obse

    rvatio

    ns

    Number of different queriesNumber of different words

    Figure 1: Number of distinct queries and words in CiteSeerover 12 weeks.

    This solution can predict unknown queries as long as the newquery contains some known words. However, this solutionsuffers in prediction accuracy due to the discarding of wordproximity information in existing queries.

    We rst propose two probabilistic models, namely fullmodel and independent model, in order to capture the ideasbehind the above intuition, which interpret query-documentpairs differently. The full model tends to achieve high pre-diction accuracy if sufcient training instances are provided,but it cannot be applied when new queries are encountered.On the other hand, the independent model yields high pre-dictability while suffering in accuracy. In order to tradeoffthe prediction accuracy with predictability, we suggest a newconditional probability hierarchy technique that combines thetwo models. We are not aware of any previous work study-ing the click prediction problem for Web search on such largescale search logs. In addition, as a by-product of the newcombination approach, n-grams of words are discovered in-crementally.

    2 Related Work

    User click prediction is based on understanding the naviga-tional pattern of users and, in turn, modeling observed pastbehavior to generate predictive future behavior. Usage patternmodeling has been achieved traditionally by session cluster-ing, Markov models, association rule generation, collabora-tive ltering and sequential pattern generation.

    Markov model based prediction methods generally sufferfrom high order, usually needing clustering of user clicks toreduce the limitations stemming from high state-space com-plexity and decreased coverage. On the other hand, lowerorder Markov models are not very accurate in predictingusers browsing behavior, since these models keep a smallwindow to look back in the history, which is not sufcientto correctly discriminate observed patterns [Deshpande andKarypis, 2004]. Markov models seem to be more suitable formobile applications [Halvey et al., 2005] where the numberof states is low due to the few links a user can navigate.

    User session clustering [Banerjee and Ghosh, 2001; Gun-duz and Ozsu, 2003] identies users navigational paths fromweb server logs and denes session similarity metrics basedon similarity of the paths and time spent on each page on apath. Clustering is employed on the similarity graph whichare utilized for predicting users requests.

    Beeferman and Berger [Beeferman and Berger, 2000] pro-posed query clustering based on click-through data. Each

    (query, clicked URL) pair is used to construct a bipartitegraph and queries that result in clicking to the same URL areclustered. The content features in the queries and the docu-ments are discarded. Joachims [Joachims, 2002] introducesa supervised learning method that trains a retrieval functionbased on click-through data that takes the relative positions ofclicks in a rank as training data set. We pursue a different ap-proach that seeks to learn the statistical relationship betweenqueries and user actions in this paper.

    3 Problem Statement

    Descriptions of the problem is formalized. Let be the doc-ument full set. {di} = D denotes the set of docu-ments that have ever been shown to users in search resultsand C D is the document set that has been clicked on. = {wi} denotes the word vocabulary. The query set isQ = {qj}, where qj = {wj1, ..., wjk} 2

    . Our observa-tion of Web search log is abstracted as a sequence of query-document pairs: Q C. The posterior probability forobserving each click on a certain document d is P (d|q,dq),where dq represents the document list returned with query q.

    The problem is to predict P (d|q,dq,), which measuresthe probability of user clicking on document d for query qwhen presented with the result list dq. The prediction of

    user click for query q becomes: d = argmaxd P (d|q,dq,),where d dq and is the observation of query-documentclick so far.

    4 Probabilistic Models

    The two probabilistic models we propose are for acquiringestimations of P (d|q) for each d in dq given query q. De-pending on interpretations of a click on d for q, two types ofmodels are introduced, namely, the independent model andthe full model.

    4.1 Independent model

    When we observe a query-document pair d, q, how do weinterpret it? The independent model we propose rstly as-sume each word in q is independent of each other. Formally,the independent model we dene interprets an instance d, qas observing d and given d, observing the words w1, ..., wkindependently.

    Let us consider howP (d|q) is estimated under the indepen-dent model. In the case where the query consists of k wordsq = [w1, ..., wk] and the clicked document is d, we measurethe probability P (d|w1, ..., wk) as:

    P (d|w1, ..., wk) =P (d, w1, ..., wk)

    P (w1, ..., wk)

    =P (d)P (w1|d)...P (wk|d)

    dD P (d, w1, ..., wk)(1)

    =P (d)

    ki=1 P (wi|d)

    dD P (d)k

    i=1 P (wi|d)

    (2)

    Eq. 1 is obtained using the Bayes formula. The transitionfrom Eq. 1 to Eq. 2 assumes the conditional independence

    IJCAI-071163

  • betweenwi andwj according to the model denition. P (d) isthe marginal probability of document d, which is proportionalto the number of occurrences of d. The above derivationsshow that for the purpose of calculation, each d, q can bebroken down to a collection of independent instances d, wi,where i = 1, ..., k.

    As indicated in Eq. 2, the estimation requires calculation ofP (d) and P (wi|d). Since we are able to keep track of P (d)easily, the computation for Eq. 2 then transforms to P (wi|d).In 6, we discuss the Bayesian estimation of P (wi|d) to ad-dress the sparsity in training data.

    4.2 Full model

    While the independent model treats each query as a set ofindependent single words, the full model reects the other ex-treme where all words within a query are treated as a group.In our denition, the full model treats the q in an instanced, q as a singleton. The combination of words in q is re-garded as an entity.

    The full model emphasizes a query as a group and henceyields high prediction accuracy, provided that a large amountof queries have been observed with large supports. However,as noted before, the number of different queries grows soquickly that the full model always suffers the lack of train-ing data in practice. In the next section, we will introduce themethod to combine the full model and the independent modelin order to address the sparsity issue.

    5 Probability Hierarchy

    Ideally, if we are able to observe all future queries with thedocuments being clicked in the log le, especially with de-cent number of instances, the full model alone can be suf-cient. However, new queries keep emerging which dramat-ically enlarges the problem space if we simply learn fromthe co-occurrence d, q. In order to address such sparsityissues, we introduce this new conditional probability hier-archy (CPH) method, which recursively generates multipleintermediate combinations between the full and independentmodels. The shrinkage rate is tunable according to the sup-ports of the full model.

    5.1 Hierarchical shrinkage

    The conditional probability hierarchy (CPH) we proposecombines the full model and the independent model. It startswith treating a query as a set of independent words, i.e. ob-taining P (d|wj). Hierarchically, words are merged into n-grams 1 (or units) of increasing length, getting P (d|uk). Thenal estimation of P (d|q) is the hierarchical combinationacross several levels.

    Fig. 2 illustrates an example estimation ofP (d|a, b, c, d, e). Suppose we have a document d anda word sequence u ( u can be either a single word of amultiple word query ). Let P (d|u) denote the estimation ofP (d|u) under the full model and P (d|u) under independentmodel. We use P (d|u)h to denote the estimation of P (d|u)

    1The n-gram (or unit) in our case is referred to as a word se-quence of length n.

    a b c d e

    P(d|a,b) P(d|c,d)

    P(d|c,d,e)

    P(d|a,b,c,d,e)

    L2

    L3

    L1

    L4

    Figure 2: CPH: Hierarchical combination of conditionalprobabilities at different levels. To estimate P (d|a, b, c, d, e)for document d and query a, b, c, d, e, P (d|a, b) andP (d|c, d, e) are combined. P (d|a, b) is the combination ofP (d|a), P (d|b) and n(d, a, b), where n(d, a, b) denotes thenumber clicks on d with queries containing a, b.

    with the CPH. A query q = a, b, c, d, e consists of veindividual words a, b, c, d and e. Suppose we already havethe full model estimation of P (d|w) for w = a, b, c, d ande. We set the P (d|w)s at L1 as equivalent to full modelestimation:

    L1 : P (d|w)h = P (d|w) = P (d|w) (3)

    where w = a, b, c, d, e. Note that at the single word level, fullmodel and independent model are equivalent. Thus we haveP (d|w) = P (d|w) .

    Then we arrive at the combination of probabilities in levelL1 to L2:

    L2 : P (d|a, b)h = (1 )P (d|a, b) + P (d|a, b) (4)

    L2 : P (d|c, d)h = (1 )P (d|c, d) + P (d|c, d) (5)

    where is the shrinkage rate that we set according to ourcondence with the full model in this case. Note that canvary for every case and is tunable according to observationsupports of long units.

    Similarly, the estimation P (d|c, d, e)h at L3 is expressedas:

    L3 : P (d|c, d, e)h = (1 )P (d|c, d, e) + P (d|c, d, e) (6)

    where the independent model estimation of P (d|c, d, e) isthe combination of P (d|c, d)h and P (d|e)h by setting:

    P (d|c, d, e) =P (d)P (c, d|d)hP (e|d)hd P (d

    )P (c, d|d)hP (e|d)h. (7)

    Finally, the estimation P (d|a, b, c, d, e)h becomes:

    L4 : P (d|a, b, c, d, e)h =

    (1 )P (d|a, b, c, d, e) + P (d|a, b, c, d, e) (8)

    where, again, P (d|a, b, c, d, e) is estimated using indepen-dent model on P (d|a, b)h and P (d|c, d, e)h.

    In general, for a query q, and document d, the CPHP (d|q)h is:

    P (d|q)h = (1 )P (d|q) + P (d|q). (9)

    IJCAI-071164

  • Then P (d|q) is estimated using:

    P (d|q) =P (d)P (ql|d

    )hP (qr|d)h

    d P (d)P (ql|d)hP (qr|d)h

    . (10)

    where ql qr = q, i.e. ql and qr are a split of q, P (ql|d)h is

    obtained using P (d, ql)h/P (d), P (d|q) is the full model

    estimation. The structure of the tree reects a particular re-cursive parsing of the queries into smaller word combinationsand ultimately single words. With the structure of the treexed, the probability P (d|q)h can be computed recursivelyby a bottom-up procedure. Eq. 9 and Eq. 10 illustrate the re-cursion. We refer to the tree structure, exemplied in Fig. 2,as the conditional probability hierarchy (CPH).

    The construction of the tree has much to do with the over-all performance. Note that we only use 2-way tree here. Agreedy algorithm is used to produce the hierarchy. In particu-lar, we iteratively search for the binary adjacent units (can bea word) with largest support in each level and feed the mergedunits to the higher level.

    Now we need to determine the parameter in Eq. 9. In-tuitively, should depend on the number of instances thatw1, ..., wk appear together. It is natural to give higher weightto the full model when there are many such observations sincethe full model tends to be more accurate. Accordingly, we

    weight as: = n(w1,...,wk)+n(w1,...,wk)

    , where > 0. A large

    indicates a higher trust in previous estimations and a slowerupdate rate.

    6 Bayesian estimation

    So far, we have seen that both models depend on obtaining es-timation for the probability P (d|q) from the observations ofd, q or d, w. In the estimation for the independent model,the estimation of P (wi|d) is required but boils down to theestimation of P (d|wi) using Bayesian formula. For brevity,we only focus on deriving the estimation for P (d|w). Weapply Bayesian estimation due to the lack of sufcient obser-vations. Let P (d|w) = B(). We need to estimate . Letn be the number of times that w has been clicked on with anydocument, x be the number of clicks on d.

    We assume users carry out queries and clicks indepen-dently. Then P (x|) is a binomial distribution. If we use thebeta distribution Beta(, ) as the conjugate prior for , wewill easily see that P (|x) also follows the beta distributionand the beta distribution is parametrized as:

    P (|x) Beta( + x, n + x) (11)

    Now P (|x) Beta(+ x, n+ x), we obtain the es-timation of , conditioned on x, as the expectation of P (|x):

    =

    10

    P (|x)d = + x

    + + n. (12)

    The estimation of in Eq. 12 will serve as our estimationfor P (d|w) in the problem. Eq. 12 gets around the sparsityissue in training data and is capable to provide non-zero valuefor P (d|w) even with no previous observation of d, w.

    The only question left for the Bayesian estimation ofP (|x) is the parametrization of the conjugate prior P (),i.e. the determination of and for P (). Assume eachdocument is equally likely to be clicked on2. We want toset expectation E(P ()) as 1/m, where m is the number ofdistinct documents in the whole collection. Since we haveE(Beta(, )) = (+) , we set

    (+) =

    1m, obtaining

    = m1 . The Bayesian estimation for P (d|w) becomes:

    = P (d|w) = m1 + x

    m1 + + n(13)

    where, again, x is the number of clicks on d and m repre-sents the number of candidate documents. n is the numberof times that w has been clicked on with any document. Weneed , > 0. In experiments, we tune .

    7 Experiments

    For evaluation, we study the property of our approach fromthree perspectives: (a) prediction accuracy; (b) query seg-mentation quality and (c) prediction power, i.e. predictability.The accuracy in prediction is evaluated using both MLE andBayesian estimation. The query segmentation quality exam-ines the semantic structure in queries.

    7.1 Data preparation

    We apply our click model to the search environment in Cite-Seer (citeseer.ist.psu.edu), a popular online search engine anddigital library. We collect the Apache access logs at one Cite-Seer server over 90 days period. There are in total 56,452,022requests in this period.

    We remove the robots by their agent eld in Apachelogs and time constratints. We further identify the queriesperformed at CiteSeer obtaining a total of 886,957 distinctqueries and 1,826,817 query-click pairs. There are in all510,409 distinct documents ever shown in search results,301,052 of which have been clicked on. For each query andthe document being clicked, we collect the rst 20 documentsfrom which this document was picked.

    7.2 Evaluation metrics

    Two important quantitative metrics for evaluation are (a) pre-diction accuracy and (b) predictability.

    We dene a prediction accuracy in evaluation as propor-tional to number of correct predictions of clicked docu-ments from the candidate list. Formally, we have predictionaccuracy dened as: 1 =

    ncns

    , where nc is the number ofcorrect prediction and ns is the size of tested sample queries.For each query, the original returned list of documents areprovided as candidates.

    We dene the predictability metric as the measurement ofthe models robustness to new queries. Consider when themodel estimates the P (d|q) as 0 for all candidate ds, the fail-ure of prediction happens. We denote this percentage as Pf .Quantitatively, the predictability of a model equals to 1Pf .

    2We may change the assumption to take into consideration theranking of documents in the result list.

    IJCAI-071165

  • 10k 100k 300k 500k 700k 900k0.3

    0.35

    0.4

    0.45

    0.5

    0.55

    0.6

    0.65

    0.7

    0.75

    Size of training setPr

    edict

    ion ac

    curac

    y

    HierarchyFullIndependent

    (a) Maximum likelihood estimation.

    10k 100k 300k 500k 700k 900k0.35

    0.4

    0.45

    0.5

    0.55

    0.6

    0.65

    0.7

    Size of training set

    Pred

    iction

    accu

    racy

    HierarchyFullIndependent

    (b) Bayesian estimation.

    Figure 3: Prediction accuracy w.r.t. training size.

    7.3 Prediction accuracy

    We train the independent model, full model, and the CPHmodel over different sized subsets of the collection of query-click pairs. For each round of testing, we randomly choose1,000 query-clicks complementary set of training. In eachtest, we evaluate the accuracy using the two metrics denedabove. Since we will study the predictability later, the accu-racy we show here is for predictable query-click pairs.

    Fig. 3(a) and Fig. 3(b) give the experimental evaluation onthe accuracy for our three models w.r.t. training size. Sub-sets of the whole collection with sizes from 10K to 900Kquery-document pair instances are experimented. Fig. 3(a)presents the accuracy of prediction using MLE for P (d|w).Comparatively, in Fig. 3(b), we present the accuracy compar-ison using Bayesian estimation measurement. In both gures,the shrinkage rate for CPH model is set to 0.6 so that wegive full model higher weight in combination. For Bayesianestimation prediction, the is set to 5.

    We are able to see that the full model usually outperformsindependent model in terms of prediction accuracy, usuallyby 15%. MLE works better in prediction than Bayesian es-timation but the MLE leads to lower predictability ( we willdiscuss the predictability in Sec. 7.5. ). The performance ofthe new CPH technique is slightly lower than full model inMLE but better than full model in Bayesian estimation. TheCPH technique gains signicantly higher accuracy in predic-tion than the independent model.

    In Fig. 4, we show the impacts of the setting of shrinkagerate on the accuracy and predictability. We use the trainingset sized 500K . As expected from Eq. 9, the accuracy tilts upas increases and the predictability goes down.

    0.1 0.3 0.5 0.7 0.90

    0.2

    0.4

    0.6

    0.8

    1

    Shrinkage rate

    Perce

    ntage

    PredictabilityPrediction accuracy

    Figure 4: Prediction accuracy and power w.r.t. shrinkage rate.

    The sensitivity of for Bayesian estimation of three mod-els are also tested. In Table. 1, we present the accuracy of

    Table 1: Prediction accuracy w.r.t. setting.

    Full model Independent model Hierarchy

    1 0.56 0.47 0.61

    5 0.57 0.47 0.62

    10 0.58 0.46 0.61

    100 0.57 0.44 0.59

    Table 2: Hierarchies discovered in sample queries.

    1 [ tutorial, [target, tracking] ]

    2 [ [machine, learning], [search, engine] ]

    3 [ partial, [ [least, square], regression ] ]

    4 [ [k, means], [cluster, analysis] ]

    5 [ [markov, chain], [monte, carlo] ]

    6 [ [spectral, clustering], jordan ]

    7 [ [energy, efcient], [matrix, multiplication], fpgas ]

    8 [ distributed, [cache, architecture] ]

    9 [ [ [code, red], worm ], paper ]

    10 [[dynamic, [channel, allocation]], [cellular, telephone]]

    CPH model w.r.t. setting of for priors in hierarchy. As canbe seen, the accuracy remains relatively stable but the smaller gives higher accuracy.

    7.4 Query segmentation

    In this section, we present the quality of query segmentationformed in discovered query hierarchies. A nice segmentationof queries detects the n-grams in queries that provides the ba-sis for the full model. Due to the limit of space, we present10 randomly picked queries issued to CiteSeer and their seg-mentations performed in the CPH.

    The discovered hierarchies in queries are presented bynesting square brackets in Table 2. We are able to see, fromthe limited sample, that the hidden structure of words in plaintext queries are well discovered using the n-gram frequency.With proper segmentations, discovered n-grams in queries arefeed to full models, and the hierarchical structures are fol-lowed while evaluating Eq. 9 for probabilistic hierarchy. Aswe have seen in Sec. 7.3, the use of n-grams for full modelboosts prediction accuracy of independent model. We willsee in Sec. 7.5 that the probabilistic hierarchy improves inpredictability from full model as well.

    IJCAI-071166

  • Table 3: Predictability of CPH model w.r.t. shrinkage rate.

    S \ 0.1 0.3 0.5 0.7 0.9100k 0.20 0.19 0.18 0.18 0.16

    300k 0.28 0.27 0.27 0.26 0.24

    500k 0.31 0.30 0.29 0.29 0.26

    7.5 Predictability

    The denition of predictability is given in . 7.2. The pre-dictability of three models are compared. Fig. 5 comparesthe predictability of three models w.r.t. the size of trainingset. Generally the predictability increases as the training sizegrows. In particular, the full model has the worst predictabil-ity and the independent model has the best. The CPH modelis between the two.

    10k 100k 300k 500k 700k0

    0.1

    0.2

    0.3

    0.4

    0.5

    Size of training set

    Pred

    ictab

    ility

    FullIndependentHierarchy

    Figure 5: Predictability w.r.t. training size.

    We also present predictability of CPH model w.r.t. theshrinkage rate in Table 3. S is the training size. Note thepredictability drops slightly as grows.

    One might expect an overall evaluation of the three modelscombining prediction accuracy and predictability. We sumup the comparisons among three models in Fig. 6. In Fig. 6,we plot the product of accuracy and predictability w.r.t. thetraining size. We are able to see that the CPH outperformsboth models overall.

    10K 100K 300K 500K 700K0

    0.05

    0.1

    0.15

    0.2

    0.25

    0.3

    0.35

    Size of traing set

    Accu

    racy

    X p

    redi

    ctab

    ility IndependentFull

    Hierarchy

    Figure 6: Combined evaluation of model performances w.r.t.training size.

    7.6 Computation complexity

    Finally, training time for the independent model and fullmodel are compared. Analytically, complexity of training forfull model should be the training time of independent modeltimes the complexity of breaking queries. Let the trainingsize be N and the average length of queries be L. The com-plexity for independent model training is O(LN). Considerthe maximum length of units for full model is set as K . The

    complexity for full model training is therefore bounded byO(LKN). Note that since normally, K and L are small inte-gers, training for both models only requires linear time. Ourexperiment shows that the CPU time of training for full modelis 3-4 times of that for independent model.

    8 Conclusions and Future work

    In this paper, we address the click prediction in the Websearch scenario and explore the sparsity of problem spacethat often exists for machine learning methods for informa-tion retrieval research. Our experiments on the CiteSeer datashow that large scale evaluation gives promising results forour three models in terms of prediction accuracy and pre-dictability.

    For future work, we will propose more sophisticated for-mulation methods for the conditional probability hierarchy(CPH). It would also be useful to improve our methods byconsidering and modeling the intent of users.

    References

    [Banerjee and Ghosh, 2001] A. Banerjee and J. Ghosh.Clickstream clustering using weighted longest commonsubsequences. In In Proceedings of the Web Mining Work-shop at the 1 st SIAM Conference on Data Mining, 2001.

    [Beeferman and Berger, 2000] Doug Beeferman and AdamBerger. Agglomerative clustering of a search enginequery log. In KDD 00: Proceedings of the sixth ACMSIGKDD international conference on Knowledge discov-ery and data mining, 2000.

    [Deshpande and Karypis, 2004] Mukund Deshpande andGeorge Karypis. Selective markov models for pre-dicting web page accesses. ACM Trans. Inter. Tech.,4(2):163184, 2004.

    [Gunduz and Ozsu, 2003] Sule Gunduz and M. Tamer Ozsu.A web page prediction model based on click-stream treerepresentation of user behavior. In KDD 03: Proceed-ings of the ninth ACM SIGKDD international conferenceon Knowledge discovery and data mining, 2003.

    [Halvey et al., 2005] Martin Halvey, Mark T. Keane, andBarry Smyth. Predicting navigation patterns on themobile-internet using time of the week. InWWW 05: Spe-cial interest tracks and posters of the 14th internationalconference on World Wide Web, pages 958959, 2005.

    [Joachims, 2002] Thorsten Joachims. Optimizing search en-gines using clickthrough data. In KDD 02: Proceedingsof the eighth ACM SIGKDD international conference onKnowledge discovery and data mining, 2002.

    [Shen et al., 2005] Xuehua Shen, Susan Dumais, and EricHorvitz. Analysis of topic dynamics in web search. InWWW 05: Special interest tracks and posters of the 14thinternational conference on World Wide Web, pages 11021103, 2005.

    [Srivastava et al., 2000] Jaideep Srivastava, Robert Cooley,Mukund Deshpande, and Pang-Ning Tan. Web usage min-ing: discovery and applications of usage patterns fromwebdata. SIGKDD Explor. Newsl., 1(2):1223, 2000.

    IJCAI-071167

    /ColorImageDict > /JPEG2000ColorACSImageDict > /JPEG2000ColorImageDict > /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages true /GrayImageDownsampleType /Bicubic /GrayImageResolution 300 /GrayImageDepth -1 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /DCTEncode /AutoFilterGrayImages true /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict > /GrayImageDict > /JPEG2000GrayACSImageDict > /JPEG2000GrayImageDict > /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages true /MonoImageDownsampleType /Bicubic /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict > /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False

    /Description > /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ > /FormElements false /GenerateStructure false /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles false /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /DocumentCMYK /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /UseDocumentProfile /UseDocumentBleed false >> ]>> setdistillerparams> setpagedevice