The $1,000,000 Netflix Contest

Click here to load reader

  • date post

  • Category


  • view

  • download


Embed Size (px)


The $1,000,000 Netflix Contest. is to develop a " ratings prediction program “ that can beat Netflix ’ s (called Cinematch) by 10% in predicting what rating users gave to movies. I.e., predict rating(M,U) where (M,U)  QUALIFYING(MovieID, UserID) . - PowerPoint PPT Presentation

Transcript of The $1,000,000 Netflix Contest

  • The $1,000,000 Netflix Contest is to develop a "ratings prediction programthat can beat Netflixs (called Cinematch) by 10% in predicting what rating users gave to movies.I.e., predict rating(M,U) where (M,U) QUALIFYING(MovieID, UserID).

    Netflix uses Cinematch to decide which movies a user will probably like next (based on all past rating history). All ratings are "5-star" ratings (5 is highest. 1 is lowest. Caution: 0 means did not rate).

    Unfortunately rating=0 does not mean that the user "disliked" that movie, but that it wasn't rated at all. Most ratings are 0. Therefore, the ratings data sets are NOT vector spaces!

    One can approach the Netflix contest problem as a data mining Classification or Prediction problem.

    A "history of ratings by users to movies, TRAINING(MovieID, UserID, Rating, Date) is given with which to train your predictor, which will predict the ratings given to QUALIFYING movie-user pairs (Netflix knows the rating given to Qualifying pairs, but you don't.)

    Since the TRAINING is very large, Netflix also provides a smaller, but representative subset of TRAINING, PROBE(MovieID, UserID) (~2 orders of magnitude smaller than TRAINING).

    Netflix gives 5 years to submit QUALIFYING predictions. That contest window is about 1/2 gone now.

    A team can submit as many solution as they wish and at any time. Each October, Netflix give $50,000 to the team on top the so-called Netflix Leaderboard. Bellcore has won that twice.

  • The Netflix Contest (USER versus MOVIE voting)One can address the prediction or classification problem using several different "approaches".

    USER VOTERs (approach 1): To predict the rating of a pair, (M,U), we take TRAINING as a vector space of user ratings vectors. The users are the points in the vector space and the movies are the dimensions in that vector space. Since there are 17,770 movies each user is tuple of 17770 ratings, if all movies are used as dimensions. Thats too many dimensions! The first dimension pruning: restrict to only those movies that U has rated ( =supportU ). We also allow another round of dimension pruning based on correlation with M.

    Once the dimensions movie set is pruned, we pick a Set of Near Neighbor users to U, (NNS) from the users, V, who have rated M ( =supportM ). Near is defined based on correlation with U. One can think of this step as the voter pruning step. Note: most correlations calculations involve the other variable also. I.e., the result of a user pruning depends on the pruned movie set and vice versa. Thus, theoretically, the movie/user pruning steps could be alternated ad infinitum! Our current approach is to allow an initial global dimension prune, then the voter prune, then a final dimension prune. You will see these 3 prune steps in the .config files.

    We then let voters vote, but they dont necessarily cast the straight-forward rating(M,V) vote.

    The best way to think about the 3 pruning steps (and there could be more!) is: We prune down the dimensions so that vector space methods are tractable, emeliorating the curse of dimensionality (the first, which may be turned off, is a global dimension prune (not based on individual voters). The second is the voter prune based on the currently pruned dimensions. The third is a final dimension prune (different for each voter) which give the final vector space over which the vote by that voter is calculated. Then we let those VOTERS vote as to the best rating prediction to be made. There are many ways to prune, vote, tally, and decide on the final prediction. These choices make up the .config file.MOVIE VOTERs (approach 2) is identical with roles of Movies (voters) and Users (dimensions) reversed

  • The Netflix Contest (Using SLURM to generate a clustering)SLURM has been set up to run on the Penryn Cluster2 (32 8 processor machines - 1 terrabyte of main memory) so that one can create a .config file (must end in .config) which specifies all the parameters for the program. Issuing:

    ./mpp-submit -S -i Data/probe-full.txt -c pf.0001/u.00.00/u.00.00.config -t .0001 -d ./pf.0001

    The program pulls parameters from .config: -t .0001 means SquareError threshold = .0001 -d ./pf.0001 means results goto ./pf.0001 dir. The prog takes as input, the file Data/probe-full.txt (which is not quite the full probe but close) with format:mpp-submit S i InputFile.txt c ConfigFile.config t SqErrThrhd d DirTakesInputFile.txt (MovieID with interleaved UserIDs format or .txt format. See next slide)ConfigFile.config (shows which program to run. In .config format. See next slide)SqErrThrhld (if PredictionSqErr SqErrThrhld, put pair in Dir/lo-InputFile.txt, else put in Dir/hi-InputFile.txt)Directory (existing directory for the output)as inputPuts as output (in Dir)lo-InputFileName.txtHi-InputFileName.txtInputFileName.configInputFileName.rmse

  • The Netflix Contest (Using SLURM to generate a clustering)./mpp-submit -S -i Data/probe-full.txt -c pf.0001/u.00.00/u.00.00.config -t .0001 -d ./pf.0001

    InputFile ConfigFile: pf.0001/u.00.00/u.00.00.configData/probe-full.txt1: 30878 2647871 1283744 2488120 317050 1904905 1989766 14756 1027056 1149588 1394012 1406595 2529547 1682104 2625019 2603381 1774623 470861 712610 1772839 1059319 2380848 548064 2: 1959936 748922 1131325 1312846 2314531 1636093 584750 2418486 715897 1172326 etc.

    where 1: and 2: are movieIDs and the others are userIDs. Note, this in an interleaved format of a 2-column DB file, probe-full(movieID,userID)Program sets parameters as specified in the .config:

    user_voting = enabled movie_voting = disabled user_vote_weight = 1

    # processed only if user voting enabled. [user_voting] Prune_Movie_in_SupU = disabledPrune_Users_in_SupM = enabled Prune_Movies_in_CoSupUV = enabled

    [Prune_Movies_in_SupU] method=MoviePrune leftside = 0 width = 30000 mstrt = 0 mstrt_mult=0 ustrt = 0 ustrt_mult=0 TSa = -100 TSb = -100 Tdvp = -1 Tdvs = -1 Tvdp = -1 Tvds = -1 TD = -1 TP = -1 PPm = .1 TV = -1 TSD = -1 Ch = 1 Ct = 1 [Prune_Movies_in_CoSupUV] method=MovieCommonCoSupportPrune leftside = 0 width = 2000 mstrt = 0 mstrt_mult=0 ustrt = 0 ustrt_mult=0 TSa = -100 TSb = -100 Tdvp = -1 Tdvs = -1 Tvdp = -1 Tvds = -1 TD = -1 TP = -1 PPm = .1 TV = -1 TSD = -1 Ch = 1 Ct = 1(Part identical to blue for movie voting params)[Prune_Users_in_SupM] method=UserCommonCoSupportPrune leftside = 0 width = 30000 mstrt = 0 mstrt_mult=0 ustrt = 0 ustrt_mult=0 TSa = -100 TSb = -100 Tdvp = -1 Tdvs = -1 Tvdp = -1 Tvds = -1 TD = -1 TP = -1 PPm = .1 TV = -1 TSD = -1 Ch = 1 Ct = 1 Only the method, leftside, width, Ch=Choice, Ct=Count parameters are used at this time.

    Using this program, the many "lo-u.xx.xx" and, if movie voting is also enabled, "lo-m.yy.yy" files constitute what we have called a clustering (tho theyre not mutually exclusive). Once we have {z-lo.xx.yy | z=u of m } we can make a submission by: qualifying pair (m,u), use correlations to pick program to make that prediction.

  • The Netflix Contest (Using this scheme to predict Qualifying pair ratings)The above prediction scheme requires the existence of Square Errors (SqErr),e.g., clusters files, lo-u.vv.nn.txt and lo-m.nn.vv.txt are composed of all input pairs such that SqErr .0001

    To predict rating(M,U) for pairs from Qualifying, we wont have answers, so we wont have SqErrs of our predictions relative to those answers.

    So how can we form good cluster then?

    Once thats decided what matchup algorithm should we use to match a cluster (program) to a Qualifying pair to be predicted?

    After the clusters are created, we can try the matchup algorithms that worked best for Probe predictions, but

    We may want to develop new ones because the performance of those matchup algorithms may depend on the way the clusters were created.

    We could use the same 288 configs to generate a new config-subset-collection of Qualifying pairs using, e.g., prediction some kind of prediction variation instead of thresholded prediction SqErr?

    lo-u.vv.nn.txt could be constructed to consist of Qualifying pairs as follows (a variation based method):Set all answers in Qualifying to 1. Use ./mpp-submit to create clusters as above (threshold=.0001) in a directory, q1. Set all answers in Qualifying to 2. Use ./mpp-submit to create clusters as above (threshold=.0001) in a directory, q2, etc. This will create a clustering of 288*5=1440 cluster sets (but, of course, only 288 different programs configs).

    One could matchup a Qualifying pair using count-based correlations, Pearson-correlations, 1-perpendicular-correlations, or?One could matchup (M,U) with the cluster in which the sum of the M and U counts (or counts relative to cluster size) is max?Other?

  • The Netflix Files {Mi} i=1..17770 given by Netflix as:TRAINING as M-U interaction cube (Rolodex Model, m\u)Pmh, 2TRAINING in MySQL with key (mID, uID) 11-bit day numbers starting at 1=1/1/99 and ending at 2922=12/31/06.Mi ( uID, Rating, Date )For each MovieID, Mi, this is a file of all users who rated it, the rating, the rating date.bit-sliced TRAINING: M-U interaction cube (Rolodex Model, m\u)TRAINING in MySQL with key (uID, mID) 11-bit day numbers starting at 1=1/1/99 and ending at 2922=12/31/06.

  • The Program: Code Structure - the main modulesmpp-mpred.C reads a Neflix PROBE file Mi(Uid) and passes Mi and ProbeSupport(Mi) to mpp-user.C to make predictions for each pair (Mi,U), foreach UProbeSupport(Mi).It can also calls separate instances of mpp-user.C for many Us, to be processed in parallel (governed by the number of "slots" specified in 1st code line.)mpp-user.C loops thru ProbeSupport(M), the ULOOP, reading in the designated (matchedup) conf