µ HAWQ ¾ MADlib O N 5xunzhangthu.org/talk/Develop_RecSys_with_HAWQ.pdf · Meetup @Shanghai,...
Transcript of µ HAWQ ¾ MADlib O N 5xunzhangthu.org/talk/Develop_RecSys_with_HAWQ.pdf · Meetup @Shanghai,...
HAWQ MADlib
Hong Wu: [email protected] Ruilong Huo: [email protected] @Shanghai, 2016/04/28
Agenda
• RecSysReview
• RecommendationwithHAWQ
• Demo
• Q&A
2
Agenda
• RecSysReview
3
What is RecSys
• Recommendation
• Morethansearchengine
• exploretheworldwithoutquery
• Personalization
• discoveritemsbasedonpreference
• Self-evolving
Industry Trends
System Design
• Content-basedrecommendation
• featuresfromitemitself
• Collaborativefiltering
• user-usersimilarity&item-itemsimilarity
• Model-basedrecommendation
• factormodel
System Design
• Content-basedrecommendation
• featuresfromitemitself
• Collaborativefiltering
• user-usersimilarity&item-itemsimilarity
• Model-basedrecommendation
• factormodel
Agenda
• RecSysReview
• RecommendationwithHAWQ
8
System Overview
Import Training Data Matrix Factorization
Compute Item-Item
SimilaritiesMake Recommendation
Mod
el P
ipel
ine
Preparation Map original interactive data into user-item-rating tuples. Import into HAWQ Database.
Latent Factor Model Use lmf_igd_run function in MADLib to train user factors and item factors.
Similarity Apply CF strategy to compute similarity.
Recommendation Three ways to return(see later slides).
Preparation
• Userdata
• Itemdata
• Interactivedata
• Userdata
• Itemdata
• Interactivedata• {(user,item):preferenceweight}
Preparation
Latent Factor Model
• Matrixfactorization• modeleachuser-itemasavectoroffactors
Latent Factor Model
Matrix Factorization in MADlib
SELECTmadlib.lmf_igd_run(output,—resulttabletostoremodelinput,—inputtablerow,—userattributenamecol,—itemattributenamevalue,—preferenceweightrow_dim,—thenumberofuserscol_dim,—thenumberofitemsmax_rank,—thenumberoffactorsstepsize,—learningrateiterations,—trainingroundstolerance);
Similarity
• WehaveuserfactorsUanditemfactorsV
• Applycollaborativefilteringstrategy
• computeitem-itemsimilarities
• lesscomputation,largescale
• gooddiversityvsoriginalCF
Similarity
CREATE TABLE item_sim AS ( SELECT ifacA.iid, ifacB.iid, madlib.cosine_similarity(ifacA.rating, ifacB.rating) as sim FROM ifactor AS ifacA, ifactor AS ifacB WHERE ifacA.iid < ifacB.iid sim DESC LIMIT K );
Recommendation
• Useitemsimilaritydirectly
radioFMbooklike-like
Recommendation
• Useuserfactoranditemfactordirectly
• dotproduct
Recommendation
• Useitemsimilaritytopredictrating
• weightedsum
20
FeedStream
Playlist
MovieGuess
Agenda
• RecSysReview
• RecommendationwithHAWQ
• Demo
21
Try it Out
• https://github.com/xunzhang/xz_talk/tree/master/develop_recsys_with_hawq
Why Choose HAWQ to Build Machine Learning Application
• Simple&Efficient
• highlevelabstraction,notransfercost
• focusmoreonapplication’sownlogic
• dataflowthinking
• Highscalability&performance
• distributedmachinelearningwithMADlib
• parallelqueryenginewithHAWQ
23
Summary
• BriefintroofRecSys
• AhybridmodelpipelinebasedonHAWQ
24
Agenda
• RecSysReview
• RecommendationwithHAWQ
• Demo
• Q&A
25
Thanks Questions
Backup Slide
SolutionofCFinlargescale:AB_similarity.
Storagedisasterofdotproductbasedonfactormodel:BuildBalltreeindex.