NetBox: A Probabilistic Method for Analyzing …...NetBox: A Probabilistic Method for Analyzing...

25
NetBox: A Probabilistic Method for Analyzing Market Basket Data Jos´ e Miguel Hern´ andez-Lobato joint work with Zoubin Gharhamani Department of Engineering, Cambridge University October 22, 2012 J. M. Hern´ andez-Lobato (UC) NetBox: A Probabilistic Method for Analyzing Market Basket Data October 22, 2012 1 / 25

Transcript of NetBox: A Probabilistic Method for Analyzing …...NetBox: A Probabilistic Method for Analyzing...

Page 1: NetBox: A Probabilistic Method for Analyzing …...NetBox: A Probabilistic Method for Analyzing Market Basket Data Jos e Miguel Hern andez-Lobato joint work with Zoubin Gharhamani

NetBox: A Probabilistic Method for Analyzing MarketBasket Data

Jose Miguel Hernandez-Lobatojoint work with Zoubin Gharhamani

Department of Engineering, Cambridge University

October 22, 2012

J. M. Hernandez-Lobato (UC) NetBox: A Probabilistic Method for Analyzing Market Basket DataOctober 22, 2012 1 / 25

Page 2: NetBox: A Probabilistic Method for Analyzing …...NetBox: A Probabilistic Method for Analyzing Market Basket Data Jos e Miguel Hern andez-Lobato joint work with Zoubin Gharhamani

Market Basket Data

A store sells a large set of products P = {p1, . . . , pd}.A transaction (basket) ti ⊆ P contains the products bought by acustomer during a particular visit to the store.

The transactions t1, . . . , tn can be encoded as a binary matrix X.

X can be very large, e.g. 108 × 104.

J. M. Hernandez-Lobato (UC) NetBox: A Probabilistic Method for Analyzing Market Basket DataOctober 22, 2012 2 / 25

Page 3: NetBox: A Probabilistic Method for Analyzing …...NetBox: A Probabilistic Method for Analyzing Market Basket Data Jos e Miguel Hern andez-Lobato joint work with Zoubin Gharhamani

Market Basket Analysis (MBA) and Association Rules

MBA allows us to identify patterns in customer purchases.

Ideally we would like to answer questions like:

What products are usually bought together?What products may benefit from promotion?What are the best cross-selling opportunities?

Association Rules is a popular method for MBA [Agrawal et al. 1994].

Generates rules of the form A→ B, where A,B ⊆ P and A ∩ B = ∅.A→ B means that if A ⊆ t holds, then we should expect B ⊆ t to holdalso, with high probability.

{peanut butter, jelly} → {bread}

Problem: The number of possible rules grows exponentially with d .

Solution: filter the rules using minimum support and confidence thresholds.

support(A→ B) = P(A ∪ B ⊆ t).confidence(A→ B) = P(B ⊂ t|A ⊂ t).

J. M. Hernandez-Lobato (UC) NetBox: A Probabilistic Method for Analyzing Market Basket DataOctober 22, 2012 3 / 25

Page 4: NetBox: A Probabilistic Method for Analyzing …...NetBox: A Probabilistic Method for Analyzing Market Basket Data Jos e Miguel Hern andez-Lobato joint work with Zoubin Gharhamani

Some Disadvantages of Association Rules (ARules)

No obvious procedure for selecting support and confidence values.Too large and many interesting associations can be missed.Too small and we obtain an explosion of non-significant rules.

Arules usually generates a very large number of rules. Identifyingthe few interesting rules among the many obvious or redundantones can be difficult.

Importantly, ARules, as an unsupervised learning method, is usuallyoutperformed by other techniques when making predictions.

This means that there are some patterns in the data which are notfully captured by ARules.

J. M. Hernandez-Lobato (UC) NetBox: A Probabilistic Method for Analyzing Market Basket DataOctober 22, 2012 4 / 25

Page 5: NetBox: A Probabilistic Method for Analyzing …...NetBox: A Probabilistic Method for Analyzing Market Basket Data Jos e Miguel Hern andez-Lobato joint work with Zoubin Gharhamani

NetBox: A Probabilistic Method for MBA I

NetBox addresses the previous disadvantages of ARules as follows:

NetBox follows a Bayesian approach.

Any hyper-parameter value is either marginalized out or tunedautomatically to the data without any human supervision.

Instead of rules, NetBox generates a network of products[Raeder and Chawla, 2011].

The networks generated often contain several connectedcompoments or clusters of products.By focusing on these clusters, we avoid to examine huge listswith many redundant or non-interesting rules.

NetBox has better predictive performance than ARules and it iscompetitive or better than alternative state-of-the-art methods at alower computational cost.

J. M. Hernandez-Lobato (UC) NetBox: A Probabilistic Method for Analyzing Market Basket DataOctober 22, 2012 5 / 25

Page 6: NetBox: A Probabilistic Method for Analyzing …...NetBox: A Probabilistic Method for Analyzing Market Basket Data Jos e Miguel Hern andez-Lobato joint work with Zoubin Gharhamani

NetBox: A Probabilistic Method for MBA II

Let Px be an ideal distribution such that any arbitrary rowx = (x1, . . . , xd)T of the transaction matrix X is sampled from Px.

We want to specify a model for Px that can be adjusted tothe available data.

For this, we follow the framework of dependency networks[Heckerman et al. 2001] and attempt to learn the conditionaldistributions P(x1|x−1), . . . ,P(xd |x−d).

We assume that each conditional P(xi |x−i ) is a mixture of thepredictive distributions of different models.

In its current form, NetBox mixes the prediction of two models:

A sparse binary classifier (NetBox-SBC).A conditional model based on matrix factorizations (NetBox-CMF).

J. M. Hernandez-Lobato (UC) NetBox: A Probabilistic Method for Analyzing Market Basket DataOctober 22, 2012 6 / 25

Page 7: NetBox: A Probabilistic Method for Analyzing …...NetBox: A Probabilistic Method for Analyzing Market Basket Data Jos e Miguel Hern andez-Lobato joint work with Zoubin Gharhamani

NetBox-SBC

P(xi |w, ε, x−i ) = ε+ (1− 2ε)Θ[(2xi − 1)(x−iw−d + wd)],

P(w|z) =∏d

i=1[ziN (wi |0, v) + (1− zi )δ(wi )],

P(z) =∏d

i=1 Bern(zi |pi ),

P(ε) = Beta(ε|a0, b0),

where a0 = 1, b0 = 9, p1, . . . , pd−1 = 0.5 and pd = 1.

The posterior distribution is approximated by

Q(w, ε, z) = Beta(ε|a, b)∏d

i=1 [N (wi |mi , vi )Bern(zi |pi )]

using assumed density filtering [Opper, 1998].

J. M. Hernandez-Lobato (UC) NetBox: A Probabilistic Method for Analyzing Market Basket DataOctober 22, 2012 7 / 25

Page 8: NetBox: A Probabilistic Method for Analyzing …...NetBox: A Probabilistic Method for Analyzing Market Basket Data Jos e Miguel Hern andez-Lobato joint work with Zoubin Gharhamani

NetBox-CMF

P(X|U,V) =∏n

i=1

∏dj=1N (xi ,j |uiv

Tj , σ

2),

P(U) =∏n

i=1

∏kj=1N (ui ,j |0, tU

j ),

P(V) =∏d

i=1

∏kj=1N (vi ,j |0, sV

j ),

The posterior distribution is approximated by

Q(U,V) =[∏n

i=1

∏kj=1N (ui ,j |mU

i ,j , vUi ,j)

] [∏di=1

∏kj=1N (vi ,j |mV

i ,j , vVi ,j)

]using variational Bayes and the analytic method of Nakajima et al. 2010.

The conditional is modeled assuming P(xi |x−i ,w′) = N (xi |x−iw′, σ2).

The posterior of w′ is approximated with

Q(w′) =∏d−1

i=1 N (w ′i |m′i , v ′i ).

by matching the predictive mean and variance of the MF model.

J. M. Hernandez-Lobato (UC) NetBox: A Probabilistic Method for Analyzing Market Basket DataOctober 22, 2012 8 / 25

Page 9: NetBox: A Probabilistic Method for Analyzing …...NetBox: A Probabilistic Method for Analyzing Market Basket Data Jos e Miguel Hern andez-Lobato joint work with Zoubin Gharhamani

Model Mixing

We compute the average log marginal likelihood on the available data:

`SBCi =

n−1∑n

j=1 log [xj,iPSBC(xj,i = 1|xj,−i ) + (1− xj,i )(1− PSBC(xj,i = 1|xj,−i ))]

`CMFi = n−1

∑nj=1 logPCMF(xj,i |xj,−i )

Let πi be the mixing weight for NetBox-SBC. Then, we estimate πi as

πi = exp(`SBCi )[exp(`SBC

i ) + exp(`CMFi )]−1,

Finally, we generate predictions using

PNetBox(xi = 1|x−i ) = πiPSBC(xi = 1|x−i ) + (1− πi )xT−im

′.

J. M. Hernandez-Lobato (UC) NetBox: A Probabilistic Method for Analyzing Market Basket DataOctober 22, 2012 9 / 25

Page 10: NetBox: A Probabilistic Method for Analyzing …...NetBox: A Probabilistic Method for Analyzing Market Basket Data Jos e Miguel Hern andez-Lobato joint work with Zoubin Gharhamani

Generating a Network of Products

We assign a weight w(j , i) to the edge connecting products j and i as

w(j , i) = PNetBox(xi = 1|xj = 1, x−j = 0)− PNetBox(xi = 1|x−i = 0).

We identify the relevant connections using a statistical test:

We generate Xrand with the same marginals as X but independent entries.

NetBox is run on XRand to obtain a collection of weights wRand(j , i).

Critical values are obtained by fitting a GPD to {wRand(k, i) : k = 1, . . . , d}.

We set to zero thenon-significant weights.

Finally, we prune edges to

maximize the number of

connected components in the

network.

De

nsi

ty

−0.02 0.00 0.01 0.02 0.03 0.04

02

04

06

08

01

00

J. M. Hernandez-Lobato (UC) NetBox: A Probabilistic Method for Analyzing Market Basket DataOctober 22, 2012 10 / 25

Page 11: NetBox: A Probabilistic Method for Analyzing …...NetBox: A Probabilistic Method for Analyzing Market Basket Data Jos e Miguel Hern andez-Lobato joint work with Zoubin Gharhamani

Evaluation of the Prediction Accuracy of NetBox

Data split into disjoint sets of training and test transactions.

A 15% of the products in the test transactions are eliminated.

We try to identify the products missing from each test transaction.

Preformance measure: recall at 10.Benchmark methods:

Association rules (Arules).Asymetric matrix factorization (AMF) [Pan et al, 2009].Rank optimized matrix factorization (ROMF) [Rendle et al, 2009].Ranking based on frequency (Freq).

J. M. Hernandez-Lobato (UC) NetBox: A Probabilistic Method for Analyzing Market Basket DataOctober 22, 2012 11 / 25

Page 12: NetBox: A Probabilistic Method for Analyzing …...NetBox: A Probabilistic Method for Analyzing Market Basket Data Jos e Miguel Hern andez-Lobato joint work with Zoubin Gharhamani

Results

J. M. Hernandez-Lobato (UC) NetBox: A Probabilistic Method for Analyzing Market Basket DataOctober 22, 2012 12 / 25

Page 13: NetBox: A Probabilistic Method for Analyzing …...NetBox: A Probabilistic Method for Analyzing Market Basket Data Jos e Miguel Hern andez-Lobato joint work with Zoubin Gharhamani

Networks of Products

J. M. Hernandez-Lobato (UC) NetBox: A Probabilistic Method for Analyzing Market Basket DataOctober 22, 2012 13 / 25

Page 14: NetBox: A Probabilistic Method for Analyzing …...NetBox: A Probabilistic Method for Analyzing Market Basket Data Jos e Miguel Hern andez-Lobato joint work with Zoubin Gharhamani

Rules Generated by ARules in the Small Netflix Dataset

ARules generated more than 100,000 rules.

We list the top rules according to lift.

J. M. Hernandez-Lobato (UC) NetBox: A Probabilistic Method for Analyzing Market Basket DataOctober 22, 2012 14 / 25

Page 15: NetBox: A Probabilistic Method for Analyzing …...NetBox: A Probabilistic Method for Analyzing Market Basket Data Jos e Miguel Hern andez-Lobato joint work with Zoubin Gharhamani

More Rules...

J. M. Hernandez-Lobato (UC) NetBox: A Probabilistic Method for Analyzing Market Basket DataOctober 22, 2012 15 / 25

Page 16: NetBox: A Probabilistic Method for Analyzing …...NetBox: A Probabilistic Method for Analyzing Market Basket Data Jos e Miguel Hern andez-Lobato joint work with Zoubin Gharhamani

And More Rules...

J. M. Hernandez-Lobato (UC) NetBox: A Probabilistic Method for Analyzing Market Basket DataOctober 22, 2012 16 / 25

Page 17: NetBox: A Probabilistic Method for Analyzing …...NetBox: A Probabilistic Method for Analyzing Market Basket Data Jos e Miguel Hern andez-Lobato joint work with Zoubin Gharhamani

Top Connected Components NetBox Netflix Dataset

J. M. Hernandez-Lobato (UC) NetBox: A Probabilistic Method for Analyzing Market Basket DataOctober 22, 2012 17 / 25

Page 18: NetBox: A Probabilistic Method for Analyzing …...NetBox: A Probabilistic Method for Analyzing Market Basket Data Jos e Miguel Hern andez-Lobato joint work with Zoubin Gharhamani

Top Frequent Itemsets MaxEnt Netflix Dataset

J. M. Hernandez-Lobato (UC) NetBox: A Probabilistic Method for Analyzing Market Basket DataOctober 22, 2012 18 / 25

Page 19: NetBox: A Probabilistic Method for Analyzing …...NetBox: A Probabilistic Method for Analyzing Market Basket Data Jos e Miguel Hern andez-Lobato joint work with Zoubin Gharhamani

Top Connected Components NetBox Pubmed Dataset

J. M. Hernandez-Lobato (UC) NetBox: A Probabilistic Method for Analyzing Market Basket DataOctober 22, 2012 19 / 25

Page 20: NetBox: A Probabilistic Method for Analyzing …...NetBox: A Probabilistic Method for Analyzing Market Basket Data Jos e Miguel Hern andez-Lobato joint work with Zoubin Gharhamani

Top Frequent Itemsets MaxEnt Pubmed Dataset

J. M. Hernandez-Lobato (UC) NetBox: A Probabilistic Method for Analyzing Market Basket DataOctober 22, 2012 20 / 25

Page 21: NetBox: A Probabilistic Method for Analyzing …...NetBox: A Probabilistic Method for Analyzing Market Basket Data Jos e Miguel Hern andez-Lobato joint work with Zoubin Gharhamani

Top Connected Components NetBox Books Dataset

J. M. Hernandez-Lobato (UC) NetBox: A Probabilistic Method for Analyzing Market Basket DataOctober 22, 2012 21 / 25

Page 22: NetBox: A Probabilistic Method for Analyzing …...NetBox: A Probabilistic Method for Analyzing Market Basket Data Jos e Miguel Hern andez-Lobato joint work with Zoubin Gharhamani

Top Frequent Itemsets MaxEnt Books Dataset

J. M. Hernandez-Lobato (UC) NetBox: A Probabilistic Method for Analyzing Market Basket DataOctober 22, 2012 22 / 25

Page 23: NetBox: A Probabilistic Method for Analyzing …...NetBox: A Probabilistic Method for Analyzing Market Basket Data Jos e Miguel Hern andez-Lobato joint work with Zoubin Gharhamani

Conclusions

NetBox is a probabilistic method for market basket analysis which:

Follows a Bayesian approach and does not require the userto specify any hyper-parameter value.

Produces a network of products in which related items areconnected to each other.

These networks are easier to interpret than a list of rules.

Obtains very good predictive performance.

Identifies patterns whose support is too low to be identifiedby frequent itemset methods based on entropy measures.

J. M. Hernandez-Lobato (UC) NetBox: A Probabilistic Method for Analyzing Market Basket DataOctober 22, 2012 23 / 25

Page 24: NetBox: A Probabilistic Method for Analyzing …...NetBox: A Probabilistic Method for Analyzing Market Basket Data Jos e Miguel Hern andez-Lobato joint work with Zoubin Gharhamani

References

Agrawal, Rakesh and Srikant, Ramakrishnan. Fast algorithms for mining association rulesin large databases. In VLDB, pp. 487–499, 1994.

Raeder, Troy and Chawla, Nitesh. Market basket analysis with networks. Social NetworkAnalysis and Mining, 1:97–113, 2011.

Pan, Rong and Scholz, Martin. Mind the gaps: weighting the unknown in large-scaleone-class collaborative filtering. In KDD, pp. 667–676, 2009.

Heckerman, David, Chickering, David Maxwell, Meek, Christopher, Rounthwaite, Robert,and Kadie, Carl. Dependency networks for inference, collaborative filtering, and datavisualization. The Journal of Machine Learning Research, 1:4975, 2001.

Opper, Manfred. On-line learning in neural networks. chapter A Bayesian approach toon-line learning, pp. 363–378. Cambridge University Press, New York, NY, USA, 1998.

Nakajima, Shinichi, Sugiyama, Masashi, and Tomioka, Ryota. Global analytic solution forvariational Bayesian matrix factorization. In NIPS, pp. 17681776, 2010.

S. Rendle, C. Freudenthaler, Z. Gantner, and S.-T. Lars. BPR: Bayesian personalizedranking from implicit feedback. In UAI, pages 452461, 2009.

T. De Bie. Maximum entropy models and subjective interestingness: an application totiles in binary databases. Data Mining and Knowledge Discovery, 23:407446, 2011.

J. M. Hernandez-Lobato (UC) NetBox: A Probabilistic Method for Analyzing Market Basket DataOctober 22, 2012 24 / 25

Page 25: NetBox: A Probabilistic Method for Analyzing …...NetBox: A Probabilistic Method for Analyzing Market Basket Data Jos e Miguel Hern andez-Lobato joint work with Zoubin Gharhamani

Thank you for your attention!

J. M. Hernandez-Lobato (UC) NetBox: A Probabilistic Method for Analyzing Market Basket DataOctober 22, 2012 25 / 25