Big Data Challenges in Recommendation Systems · 11/29/2018  · Big Data Challenges in...

34
Big Data Challenges in Recommendation Systems Dik Lun Lee Department of Computer Science and Engineering The Hong Kong University of Science and Technology November 29, 2018 Big Data Challenges 2018 1

Transcript of Big Data Challenges in Recommendation Systems · 11/29/2018  · Big Data Challenges in...

Page 1: Big Data Challenges in Recommendation Systems · 11/29/2018  · Big Data Challenges in Recommendation Systems Dik Lun Lee ... Netflix, etc. •Community behavior and side information

Big Data Challenges in Recommendation Systems

Dik Lun Lee

Department of Computer Science and Engineering

The Hong Kong University of Science and Technology

November 29, 2018

Big Data Challenges 2018 1

Page 2: Big Data Challenges in Recommendation Systems · 11/29/2018  · Big Data Challenges in Recommendation Systems Dik Lun Lee ... Netflix, etc. •Community behavior and side information

Outline of this talk

• Background of recommendation systems

• Billion-scale recommendation in e-commerce• Modeling user’s sequential interest with item graph and graph embedding

• Side information• Heterogeneous information network (HIN)

• Experiments

• Conclusion

Big Data Challenges 2018 2

Page 3: Big Data Challenges in Recommendation Systems · 11/29/2018  · Big Data Challenges in Recommendation Systems Dik Lun Lee ... Netflix, etc. •Community behavior and side information

Applications of Recommendation Systems

• The famous “Customers who bought this item also bought …”• Recommend “Hadoop” from a “Java” book, linked by purchase transactions

Big Data Challenges 2018 3

Page 4: Big Data Challenges in Recommendation Systems · 11/29/2018  · Big Data Challenges in Recommendation Systems Dik Lun Lee ... Netflix, etc. •Community behavior and side information

Applications of Recommendation Systems

• “Customers who searched for “desktop computer” ultimately bought …”• Recommendations linked by actual purchase transactions

Big Data Challenges 2018 4

Page 5: Big Data Challenges in Recommendation Systems · 11/29/2018  · Big Data Challenges in Recommendation Systems Dik Lun Lee ... Netflix, etc. •Community behavior and side information

Applications of Recommendation Systems

• “Inspired by your browsing history …”

Big Data Challenges 2018 5

Page 6: Big Data Challenges in Recommendation Systems · 11/29/2018  · Big Data Challenges in Recommendation Systems Dik Lun Lee ... Netflix, etc. •Community behavior and side information

Recommendations are based on

user behavior more than content similarity

- Many small items do not have much “content” -

Big Data Challenges 2018 6

Page 7: Big Data Challenges in Recommendation Systems · 11/29/2018  · Big Data Challenges in Recommendation Systems Dik Lun Lee ... Netflix, etc. •Community behavior and side information

Outline of this talk

• Background of recommendation systems

• Billion-scale recommendation in e-commerce

• Side information

• Experiments

• Conclusion

Big Data Challenges 2018 7

Page 8: Big Data Challenges in Recommendation Systems · 11/29/2018  · Big Data Challenges in Recommendation Systems Dik Lun Lee ... Netflix, etc. •Community behavior and side information

Alibaba’s Billion Scale Business

• Billion-scale data• 1 billion users x 2 billion items

• Sparsity problem• Most users interact with a tiny fraction of the items

• Likewise, most items are accessed by a tiny fraction of the users

• Cold-start problem• Millions of items are uploaded every hour (some existing items are in fact updated but

uploaded as new items)

• Challenge: How to improve recommendation accuracy and verify it in a production environment?

Big Data Challenges 2018 8

Page 9: Big Data Challenges in Recommendation Systems · 11/29/2018  · Big Data Challenges in Recommendation Systems Dik Lun Lee ... Netflix, etc. •Community behavior and side information

Application to Taobao Mobile Apps

• Personalized areas are highlighted, which are personalized for 1 billion users

• Mobile homepage contribute 40% of the total recommendation traffic in Taobao

Big Data Challenges 2018 9

Page 10: Big Data Challenges in Recommendation Systems · 11/29/2018  · Big Data Challenges in Recommendation Systems Dik Lun Lee ... Netflix, etc. •Community behavior and side information

Alibaba Recommendation Architecture

• Two stages, matching and ranking, to cope with billion-scale data

User i

User j

User k

Interacted Items

Item-ItemSimilarity

Matrix

Matching

Personalized Ranking

User Model

Personalized ranked Recommendations

User i

Ranking

Big Data Challenges 2018 10

• Matching: For each item, identify its similar items based on user behavior

• Ranking: For each user accessing an item, personalize recommendation based on user history

Page 11: Big Data Challenges in Recommendation Systems · 11/29/2018  · Big Data Challenges in Recommendation Systems Dik Lun Lee ... Netflix, etc. •Community behavior and side information

Methodology: Overview

a) Users navigate website from item to item; access sequences can be broken down into sessions of local sequential access patterns, or local SAPs

b) SAPs are aggregated into item graph

c) Use random walk to sample the item graph into global SAPs

d) Generate node embeddings, which are vectors, from global SAPs • Dot product of two nodes’ embeddings indicates similarity of the two nodes

Big Data Challenges 2018 11

Page 12: Big Data Challenges in Recommendation Systems · 11/29/2018  · Big Data Challenges in Recommendation Systems Dik Lun Lee ... Netflix, etc. •Community behavior and side information

Sequential Access Patterns (SAPs)

• Users navigate the website from item to item, generating access sequences along the way, which are broken into sessions (sequential access patterns, or SAPs)

• Premise: Access sequences represent users’ behaviors, which have commonality that can be exploited in prediction (as in collaborative filtering)

• Data cleansing:• Remove all transitions if users only stay in an item < 1 sec

• Remove all users who in 3 months purchased > 1000 items and generated 3500 clicks

• Remove all items which have been updated so much that they become different items• Venders want to use the same item ID even for new products to accumulate large number of

comments and purchases

Big Data Challenges 2018 12

Page 13: Big Data Challenges in Recommendation Systems · 11/29/2018  · Big Data Challenges in Recommendation Systems Dik Lun Lee ... Netflix, etc. •Community behavior and side information

Item Graph

• Represent the SAPs in an item graph

• All items in SAPs are nodes in item graph

• Two successively accessed items are connected with an edge• The weight of edge (i,j) is the number of pairs (i,j) occurred in the SAPs

• The item graph is large, aggregating all SAPs, and represents the global user transition behavior

• In this small example, all edges have weight 1• We can derive longer sequences, e.g., ABECB and DECB, that do not exist in individual SAPs

Big Data Challenges 2018 13

Page 14: Big Data Challenges in Recommendation Systems · 11/29/2018  · Big Data Challenges in Recommendation Systems Dik Lun Lee ... Netflix, etc. •Community behavior and side information

• Since item graph is very large, we cannot consider all possible paths

• Perform random walk on item graph to generate sampled sequences, with transition probability from node vi to a child vj is:

• Mij is the weight of link ij and N+(vi) is the set of children of vi

• Use random walk to sample the graph into longer sequences representing users’ global access behavior

Random Walk on Item Graph

Big Data Challenges 2018 14

Page 15: Big Data Challenges in Recommendation Systems · 11/29/2018  · Big Data Challenges in Recommendation Systems Dik Lun Lee ... Netflix, etc. •Community behavior and side information

Basic Graph Embedding (BGE)

Big Data Challenges 2018 15

• Each path represents items users tend to access in sequence• Recommend items that are “near” the item user is accessing

• Graph embedding learns a low-dimensional vector of each node

• The embedding shall maximize the co-occurrence probability of two nodes in a window of the sampled sequences

• If node i is in the same sequence as node j, similarity between embedding(i) and embedding(j) should be higher than with embedding(k) where k is not in the sequence

• Node embeddings are (latent) vectors, and similarity can be computed by inner product

• CF only considers co-occurrence (direct link between user and item) but random walk sequences are paths which can capture higher-order relations

Page 16: Big Data Challenges in Recommendation Systems · 11/29/2018  · Big Data Challenges in Recommendation Systems Dik Lun Lee ... Netflix, etc. •Community behavior and side information

Outline of this talk

• Background of recommendation systems

• Billion-scale recommendation in e-commerce

• Side information

• Experiments

• Conclusion

Big Data Challenges 2018 16

Page 17: Big Data Challenges in Recommendation Systems · 11/29/2018  · Big Data Challenges in Recommendation Systems Dik Lun Lee ... Netflix, etc. •Community behavior and side information

Recommendation with Side Information • BGE does not solve cold-start problem: if an item is new, it would not appear in

item graph

• Side information is information besides user-item interactions

• Side information is typically semantics rich• User side: user attributes, social connections, etc.• Item side: product categories, brands, shops, etc.• User-item: reviews written and ratings given by users for products, etc.

• Alibaba keeps 20+ side information beyond user-item interactions, e.g., product categories, shops, and prices of items• Two dresses in the same category and from the same shop are alike

• Items with “similar” side information should be closer in embedding space

• Represent all side information as vectors and learn dense embeddings

• Represent all side information in a Heterogeneous Information Network (HIN), and use graph connectivity to infer relationships/similarities

Big Data Challenges 2018 17

Page 18: Big Data Challenges in Recommendation Systems · 11/29/2018  · Big Data Challenges in Recommendation Systems Dik Lun Lee ... Netflix, etc. •Community behavior and side information

Integrating Side Information with HIN

• A HIN is a graph-based data model

• A HIN instance graph for Royal House and its relations to other entities in Yelp

• Nodes and edges have different types

Big Data Challenges 2018 18

• A HIN schema for Yelp (like a database schema)

• A: aspect extracted from reviews (e.g., food quality, price, service)

• U: users • B: businesses

• R: reviews • Cat: category

• Ci: city

Businesses

User generated content

Social network

User activities

Page 19: Big Data Challenges in Recommendation Systems · 11/29/2018  · Big Data Challenges in Recommendation Systems Dik Lun Lee ... Netflix, etc. •Community behavior and side information

• Metapaths and metagraphs instances are generated by traversing the HIN

Connections in HIN implies Relationships

Big Data Challenges 2018 19

friend

• The three metapaths all connect U1 to B2, indicating strong relationships between

• There are many more paths

U1 has a friend U2 who checked-in B2

=> U1 may like (check-in) B2

U1 wrote review R1 mentioning aspect A1U2 wrote review R2 mentioning aspect A1U2 checked-in B2 => U1 may like (check-in) B2

Page 20: Big Data Challenges in Recommendation Systems · 11/29/2018  · Big Data Challenges in Recommendation Systems Dik Lun Lee ... Netflix, etc. •Community behavior and side information

• Metapaths and metagraphs instances are generated by traversing the HIN

• M9 is a metagraph capturing a ternary relation, which cannot be replaced with 2 binary relations (e.g., M8) without losing information

Metapaths and Metagraphs

Big Data Challenges 2018 20

U1 wrote review R1 which rated B1 and mentioned aspect A1U2 wrote review R2 which rated B1 and mentioned aspect A1U2 checked-in B2 => U1 may like (check-in) B2

You can “loop” the friendOf relationship to find indirect friends

Page 21: Big Data Challenges in Recommendation Systems · 11/29/2018  · Big Data Challenges in Recommendation Systems Dik Lun Lee ... Netflix, etc. •Community behavior and side information

Outline of this talk

• Background of recommendation systems

• Billion-scale recommendation in e-commerce

• Side information

• Experiments• Evaluation of item graph embedding

• Evaluation of Side Information Effectiveness

• Conclusion

Big Data Challenges 2018 21

Page 22: Big Data Challenges in Recommendation Systems · 11/29/2018  · Big Data Challenges in Recommendation Systems Dik Lun Lee ... Netflix, etc. •Community behavior and side information

Datasets in Offline Experiments

Big Data Challenges 2018 22

Page 23: Big Data Challenges in Recommendation Systems · 11/29/2018  · Big Data Challenges in Recommendation Systems Dik Lun Lee ... Netflix, etc. •Community behavior and side information

Offline Experiments: Link Prediction

• 1/3 of links in item graph is hidden

• Training model with remaining 2/3 links

• Obtain item embeddings and predict the hidden links

Big Data Challenges 2018 23

Page 24: Big Data Challenges in Recommendation Systems · 11/29/2018  · Big Data Challenges in Recommendation Systems Dik Lun Lee ... Netflix, etc. •Community behavior and side information

Online A/B Test

• Deployed in Nov 2017; CTRs over 7 days are collected

• “Base” is traditional item-based CF (no graph embedding) with well tuned heuristics

• Base is better than BGE, meaning graph embedding without SI is not enough

Big Data Challenges 2018 24

No side information

Side information (weighted)

Side information (unweighted)

Tuned collaborative filtering

Page 25: Big Data Challenges in Recommendation Systems · 11/29/2018  · Big Data Challenges in Recommendation Systems Dik Lun Lee ... Netflix, etc. •Community behavior and side information

Case Study: Visualization of Item Embeddings

• Item embeddings are projected into 3-D

• (a) Shoes of 7 categories; clusters can be seen

• (b) badminton, table tennis, football shoes; badminton and table tennis shoes are close to each other, because people who play table tennis would likely play badminton

Big Data Challenges 2018 25

Page 26: Big Data Challenges in Recommendation Systems · 11/29/2018  · Big Data Challenges in Recommendation Systems Dik Lun Lee ... Netflix, etc. •Community behavior and side information

Case Study: Cold Start Items

• Cold items have no embeddinglearnt from item graph

• Cold item embedding is the averageof its side information embeddings

• Useful SI’s: Category, Shop, Style …

Big Data Challenges 2018 26

Page 27: Big Data Challenges in Recommendation Systems · 11/29/2018  · Big Data Challenges in Recommendation Systems Dik Lun Lee ... Netflix, etc. •Community behavior and side information

Case Study: Weights of Side Information

• Weights of 13 SI’s for 8 items are shown

• Item embedding are most important; 2nd important is “shop”

• Weights of SI’s vary fordifferent items

Big Data Challenges 2018 27

Page 28: Big Data Challenges in Recommendation Systems · 11/29/2018  · Big Data Challenges in Recommendation Systems Dik Lun Lee ... Netflix, etc. •Community behavior and side information

System Deployment and Operation

• 600 billion entries in recent 3 months are extracted from log

• Item sub-graph has 50 millionnodes

• Random walk produces150 billion sequences

• 6 hours on 100-GPU

• RSP (ranking) is a deep neural network

Big Data Challenges 2018 28

Ranking Service Platform Taobao Personality Platform

XTensorflow

Page 29: Big Data Challenges in Recommendation Systems · 11/29/2018  · Big Data Challenges in Recommendation Systems Dik Lun Lee ... Netflix, etc. •Community behavior and side information

Outline of this talk

• Background of recommendation systems

• Billion-scale recommendation in e-commerce

• Side information

• Experiments• Evaluation of item graph embedding

• Evaluation of Side Information Effectiveness

• Conclusion

Big Data Challenges 2018 29

Page 30: Big Data Challenges in Recommendation Systems · 11/29/2018  · Big Data Challenges in Recommendation Systems Dik Lun Lee ... Netflix, etc. •Community behavior and side information

Datasets

Search engine and applications 30

Page 31: Big Data Challenges in Recommendation Systems · 11/29/2018  · Big Data Challenges in Recommendation Systems Dik Lun Lee ... Netflix, etc. •Community behavior and side information

Performance

Search engine and applications 31

Page 32: Big Data Challenges in Recommendation Systems · 11/29/2018  · Big Data Challenges in Recommendation Systems Dik Lun Lee ... Netflix, etc. •Community behavior and side information

• M2, M3, M8 and M9 are user-based CF: U→*U→B and are more effective than item-based CF: U→B→*B

Which Metagraphs are better?

Search engine and applications 32

Page 33: Big Data Challenges in Recommendation Systems · 11/29/2018  · Big Data Challenges in Recommendation Systems Dik Lun Lee ... Netflix, etc. •Community behavior and side information

Take-Home Message

• Recommendation systems have important commercial value and are used in major e-commerce websites like Alibaba, Amazon, Netflix, etc.

• Community behavior and side information improve recommendation prediction accuracy

• Fully utilize side information (SI may become more and more “central”) to address cold start and sparsity problems• Websites do not just log user-item interactions but a lot of other things

• Many e-commerce websites have reviews and built-in social and trust/follow relations

• Illustrate feasibility of applying machine learning to billion-scale data

• Give interesting observations from real-life experiments

Big Data Challenges 2018 33

Page 34: Big Data Challenges in Recommendation Systems · 11/29/2018  · Big Data Challenges in Recommendation Systems Dik Lun Lee ... Netflix, etc. •Community behavior and side information

For Further Information

• Jizhe Wang, Pipei Huang, Huan Zhao, Zhibo Zhang, Binqiang Zhao, and Dik Lun Lee. 2018. Billion-scale Commodity Embedding for E-commerce Recommendation in Alibaba. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD '18), 839-848.

• Zhao, H., Yao, Q., Li, J., Song, Y., & Lee, D. L. (2017, August). Meta-graph based recommendation fusion over heterogeneous information networks. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 635-644)

• H. Zhao, Q. Yao, J. T. Kwok and D. L. Lee, "Collaborative Filtering with Social Local Models," 2017 IEEE International Conference on Data Mining (ICDM), New Orleans, LA, 2017, pp. 645-654.

Big Data Challenges 2018 34