MSGD: A Novel Matrix Factorization Approach for Large ... › ~lik › publications ›...

15
MSGD: A Novel Matrix Factorization Approach for Large-Scale Collaborative Filtering Recommender Systems on GPUs Hao Li, Kenli Li , Senior Member, IEEE, Jiyao An , Member, IEEE, and Keqin Li , Fellow, IEEE Abstract—Real-time accurate recommendation of large-scale recommender systems is a challenging task. Matrix factorization (MF), as one of the most accurate and scalable techniques to predict missing ratings, has become popular in the collaborative filtering (CF) community. Currently, stochastic gradient descent (SGD) is one of the most famous approaches for MF. However, it is non-trivial to parallelize SGD for large-scale CF MF problems due to the dependence on the user and item pair, which can cause parallelization over-writing. To remove the dependence on the user and item pair, we propose a multi-stream SGD (MSGD) approach, for which the update process is theoretically convergent. On that basis, we propose a Compute Unified Device Architecture (CUDA) parallelization MSGD (CUMSGD) approach. CUMSGD can obtain high parallelism and scalability on Graphic Processing Units (GPUs). On Tesla K20m and K40c GPUs, the experimental results show that CUMSGD outperforms prior works that accelerated MF on shared memory systems, e.g., DSGD, FPSGD, Hogwild!, and CCD++. For large-scale CF problems, we propose multiple GPUs (multi-GPU) CUMSGD (MCUMSGD). The experimental results show that MCUMSGD can improve MSGD performance further. With a K20m GPU card, CUMSGD can be 5-10 times as fast compared with the state-of-the-art approaches on shared memory platform. Index Terms—Collaborative filtering (CF), CUDA parallelization algorithm, matrix factorization (MF), multi-GPU implementation, stochastic gradient descent (SGD) Ç 1 INTRODUCTION I N the era of e-commerce, recommender systems have been applied to all aspects of commercial fields. The following topics are two challenges in the development of personalized recommender systems: (1) making recommendations in real- time, i.e., making recommendations when they are needed, and (2) making recommendations that suit users’ tastes, i.e., recommendation accuracy. CF, as one of the most popular recommender systems, relies on past user behavior and does not need domain knowledge or extensive and specialized data collection [1]. The main task of CF is to estimate the missing data of user-interested items, e.g., scores, clicks, pur- chase records, etc. However, data sparsity because of miss- ing data makes it hard to provide accurate prediction [2]. MF, as a dimensionality reduction technique, that maps both users and items into the same latent factor space, updating user/item feature vectors for existing rating and then predicting the unknown ratings relying on the inner prod- ucts of related user-item feature vector pairs [3], can address the problem of sparsity and has enjoyed surge in research after the Netflix competition [4], [5], [6]. There are several MF techniques that have been applied to CF recommender systems, e.g., Maximum Margin Matrix Factorization (MMMF) [3], [7], Alternating Least Squares (ALS) [8], [9], Cyclic Coordinate Descent (CCD++) [10], [11], [12], Non-negative Matrix Factorization (NMF) [13] and Sto- chastic Gradient Descent (SGD) [4], [5]. Among them, SGD is a popular and efficient approach due to its low time com- plexity and convenient implementation, which have been drawn wide attention [5]. Many recent works integrate fac- tors and neighbors, and SGD can fuse the factors and neigh- bors into the update process [4], [5]. Although SGD has been applied to MF recommender systems successfully, due to high scalability and accuracy, it is difficult to parallelize SGD because of the sequential inherence of SGD, which makes SGD infeasible for large- scale CF data sets. With the rapid development of graphics processing hardware, GPU is becoming common; moreover, CUDA programming makes it possible to harness the com- putation power of GPU efficiently. However, given the sequential essence of SGD, there is a great obstacle prevent- ing SGD from obtaining the efficiency of GPU computation. In this paper, we find that the main obstacle to parallelize SGD is the dependence on the user and item pair by theoret- ical analysis. We apply the MSGD to remove the dep- endence on the user and item pair. On that basis, we propose a CUDA parallelization approach on GPU, namely H. Li, K. Li, and J. An are with the College of Computer Science and Elec- tronic Engineering and National Supercomputing Center in Changsha, Hunan University, Hunan 410082, China. E-mail: {lihao123, lkl, jt_anbob}@hnu.edu.cn. K. Li are with the College of Computer Science and Electronic Engineering and National Supercomputing Center in Changsha, Hunan University, Hunan 410082, China and the Department of Computer Science, State University of New York, New Paltz, NY 12561. E-mail: [email protected]. Manuscript received 29 Dec. 2016; revised 17 May 2017; accepted 5 June 2017. Date of publication 22 June 2017; date of current version 13 June 2018. (Corresponding author: Kenli Li.) Recommended for acceptance by G. Tan. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference the Digital Object Identifier below. Digital Object Identifier no. 10.1109/TPDS.2017.2718515 1530 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 29, NO. 7, JULY 2018 1045-9219 ß 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See ht_tp://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Transcript of MSGD: A Novel Matrix Factorization Approach for Large ... › ~lik › publications ›...

Page 1: MSGD: A Novel Matrix Factorization Approach for Large ... › ~lik › publications › Hao-Li-IEEE-TPDS-2018.pdf · systems, e.g., DSGD, FPSGD, Hogwild!, and CCD++. For large-scale

MSGD: A Novel Matrix Factorization Approachfor Large-Scale Collaborative FilteringRecommender Systems on GPUs

Hao Li, Kenli Li , Senior Member, IEEE, Jiyao An ,Member, IEEE, and Keqin Li , Fellow, IEEE

Abstract—Real-time accurate recommendation of large-scale recommender systems is a challenging task. Matrix factorization (MF),

as one of the most accurate and scalable techniques to predict missing ratings, has become popular in the collaborative filtering (CF)

community. Currently, stochastic gradient descent (SGD) is one of the most famous approaches for MF. However, it is non-trivial to

parallelize SGD for large-scale CF MF problems due to the dependence on the user and item pair, which can cause parallelization

over-writing. To remove the dependence on the user and item pair, we propose a multi-stream SGD (MSGD) approach, for which the

update process is theoretically convergent. On that basis, we propose a Compute Unified Device Architecture (CUDA) parallelization

MSGD (CUMSGD) approach. CUMSGD can obtain high parallelism and scalability on Graphic Processing Units (GPUs). On Tesla

K20m and K40c GPUs, the experimental results show that CUMSGD outperforms prior works that accelerated MF on shared memory

systems, e.g., DSGD, FPSGD, Hogwild!, and CCD++. For large-scale CF problems, we propose multiple GPUs (multi-GPU) CUMSGD

(MCUMSGD). The experimental results show that MCUMSGD can improve MSGD performance further. With a K20m GPU card,

CUMSGD can be 5-10 times as fast compared with the state-of-the-art approaches on shared memory platform.

Index Terms—Collaborative filtering (CF), CUDA parallelization algorithm, matrix factorization (MF), multi-GPU implementation, stochastic

gradient descent (SGD)

Ç

1 INTRODUCTION

IN the era of e-commerce, recommender systems have beenapplied to all aspects of commercial fields. The following

topics are two challenges in the development of personalizedrecommender systems: (1) making recommendations in real-time, i.e., making recommendations when they are needed,and (2) making recommendations that suit users’ tastes, i.e.,recommendation accuracy. CF, as one of the most popularrecommender systems, relies on past user behavior and doesnot need domain knowledge or extensive and specializeddata collection [1]. The main task of CF is to estimate themissing data of user-interested items, e.g., scores, clicks, pur-chase records, etc. However, data sparsity because of miss-ing data makes it hard to provide accurate prediction [2].MF, as a dimensionality reduction technique, that maps bothusers and items into the same latent factor space, updatinguser/item feature vectors for existing rating and then

predicting the unknown ratings relying on the inner prod-ucts of related user-item feature vector pairs [3], can addressthe problem of sparsity and has enjoyed surge in researchafter the Netflix competition [4], [5], [6].

There are several MF techniques that have been appliedto CF recommender systems, e.g., Maximum Margin MatrixFactorization (MMMF) [3], [7], Alternating Least Squares(ALS) [8], [9], Cyclic Coordinate Descent (CCD++) [10], [11],[12], Non-negative Matrix Factorization (NMF) [13] and Sto-chastic Gradient Descent (SGD) [4], [5]. Among them, SGDis a popular and efficient approach due to its low time com-plexity and convenient implementation, which have beendrawn wide attention [5]. Many recent works integrate fac-tors and neighbors, and SGD can fuse the factors and neigh-bors into the update process [4], [5].

Although SGD has been applied to MF recommendersystems successfully, due to high scalability and accuracy, itis difficult to parallelize SGD because of the sequentialinherence of SGD, which makes SGD infeasible for large-scale CF data sets. With the rapid development of graphicsprocessing hardware, GPU is becoming common; moreover,CUDA programming makes it possible to harness the com-putation power of GPU efficiently. However, given thesequential essence of SGD, there is a great obstacle prevent-ing SGD from obtaining the efficiency of GPU computation.

In this paper, we find that the main obstacle to parallelizeSGD is the dependence on the user and item pair by theoret-ical analysis. We apply the MSGD to remove the dep-endence on the user and item pair. On that basis, wepropose a CUDA parallelization approach on GPU, namely

� H. Li, K. Li, and J. An are with the College of Computer Science and Elec-tronic Engineering and National Supercomputing Center in Changsha,Hunan University, Hunan 410082, China.E-mail: {lihao123, lkl, jt_anbob}@hnu.edu.cn.

� K. Li are with the College of Computer Science and Electronic Engineeringand National Supercomputing Center in Changsha, Hunan University,Hunan 410082, China and the Department of Computer Science, StateUniversity of New York, New Paltz, NY 12561. E-mail: [email protected].

Manuscript received 29 Dec. 2016; revised 17 May 2017; accepted 5 June2017. Date of publication 22 June 2017; date of current version 13 June 2018.(Corresponding author: Kenli Li.)Recommended for acceptance by G. Tan.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference the Digital Object Identifier below.Digital Object Identifier no. 10.1109/TPDS.2017.2718515

1530 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 29, NO. 7, JULY 2018

1045-9219� 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See ht _tp://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Page 2: MSGD: A Novel Matrix Factorization Approach for Large ... › ~lik › publications › Hao-Li-IEEE-TPDS-2018.pdf · systems, e.g., DSGD, FPSGD, Hogwild!, and CCD++. For large-scale

CUMSGD. CUMSGD divides the task into coarse sub-tasksthat are mapped to independent thread blocks, and then besolved by those independent thread blocks. Each sub-task isdivided into finer pieces that map to threads within thethread block, then be solved cooperatively by those threadsin parallel.

MSGD can split the MF optimization objective into manyseparable optimization sub-objectives, and these indepen-dent optimization sub-objectives are distributed to CUDAthread blocks. There is no data dependence between thosethread blocks. Thus, CUMSGD does not require thread blocksynchronization. CUMSGD updates user and item featurematrices alternatively. Thus, in multi-GPU environment,MCUMSGD can update user sub-matrix with emitting andreceiving item feature sub-matrix concurrently. The maincontributions of this paper are summarized as follows:

� Theoretic convergence and local minimum escaping:MSGD updates the user feature matrix and itemfeature matrix alternatively, which is under the con-vergence promise. Stochastic Gradient (SG) basedapproaches can be used to escape local minimums.Thus, MSGD can obtain reasonable accuracy.

� CUDA parallelization: Considering the fine grainedparallelism of MSGD, we propose a CUMSGDapproach. The CUMSGD approach can be broughtto CUDA-supported GPUs, and can be applied toindustrial applications of MF.

� Multi-GPU implementation for large-scale datasets: For web-level CF data sets, we propose theMCUMSGD. We adopt an asynchronous communi-cation strategy to hide data access overhead betweenthe multi-GPU and cyclic update strategy to allocaterating sub-matrices and item feature sub-matrices.

The extensive experimental results show that CUMSGDnot only outperforms the state-of-the-art parallelized SGDMF algorithms in a shared memory setting, e.g., fast parallelSGD (FPSGD) [14], [15], distributed SGD (DSGD) [16], [17]and Hogwild! [18], but also outperforms state-of-the-artnon-SG algorithm CCD++ [10], [11].

The paper is organized as follows. We introduce relatedwork in Section 2. We describe the MF problem and relatednotions in Section 3. We derive the update rule of MSGD,and propose parallelization strategy in Section 4. We pro-vide the parallelization strategies of CUMSGD andMCUMSGD in Section 5. We report experimental results ofMSGD and the performance comparison with the state-of-the-art parallel approach in Section 6. Finally, in Section 7,conclusions are drawn.

2 RELATED WORK

MF-based techniques have been proposed as collaborativeprediction. The idea of MF-based CF has been applied tomany related areas, and there are a large number of studieson this topic. Sarwar et al. [19] proposed MF-baseddimensionality reduction in CF recommender systems,which can solve several limitations of neighbor-based tech-niques, e.g., sparsity, scalability, and synonymy. Weng et al.[20] presented CF semantic video indexing by adopting MF,which takes into account the correlation of concepts andsimilarity of shots within the score matrix. Zheng et al. [21],

proposed neighborhood-integrated MF for quality-of-service (QoS) prediction in service computing, which canavoid time-consuming web service invocation. Lian et al.[22] proposed point-of-interest (PoI) recommendation byadopting weight MF to cope with the challenges of theextreme sparsity of user-POI matrices.

SGD has been extensively studied, due to its easy imple-mentation and reasonable accuracy. Langford et al. [23]proved that the online learning approach (delayed SGD) isconvergent. Zinkevich et al. [24] proposed parallelized SGDunder convergence guarantees. Agarwal and Duchi [25] pro-posed distributed delayed SGD, in which the master nodesupdate parameters and the other nodes perform SGD in par-allel. Many works have been conducted on parallelized anddistributed SGD to accelerate MF on CF recommender sys-tems in the literatures [14], [15], [16], [17], [18], [26], [27].Chin et al. [14] focused on a shared memory multi-core envi-ronment to parallelize SGD MF. FPSGD cannot solve large-scale CF problem because FPSGD must load CF data intoshared memory. Gemulla et al. [16] focused on DSGD in adistributed environment. DSGD took the property that someblocks which share no common columns or common rows ofthe rating matrix. DSGD needs the data communicationoverhead on nodes. Because of non-even distribution of therating matrix, the workloads on different nodes may be dif-ferent. Thus, each node requires synchronization. Yun et al.[26] proposed asynchronous and decentralized SGD(NOMAD) in distributed and shared memory multi-coreenvironments. This approach can hide communication timein SGD execution by NOMAD asynchronous via a non-blocking mechanism and can handle different hardware andsystem loads by balancing the workload dynamically.NOMAD does not need extra time for data exchanges. How-ever, when the number of nodes increases, there is a largeextra execution of uniform sampling instructions for eachnode or core, and the cost of communication will be largerthan the cost of computation. Niu et al. [18] proposed Hog-wild! to parallelize SGDMF. This approach assumes that therating matrix is highly sparse and deduced that, for two ran-domly sampled ratings, two serial updates are likely to beindependent. Jin et al. [27] presented GPUSGD to accelerateMF. This approach must load the whole data into globalmemory, and it is unsuited to large-scale CF problems.

Owens et al. [28] introduced general purpose computingon GPU and listed GPU computing applications, e.g., gamephysics and computational biophysics. Pratx and Xing [29]reviewed GPU computing applications in medical physics,including three areas: dose calculation, treatment plan opti-mization, and image processing. There are some works toanalyze and predict GPU performance. Guo et al. [30] pre-sented a performance modeling and optimization analysistool that can provide optimal SPMV solutions based onCSR, ELL, COO, and HYB formats. To improve the perfor-mance of SPMV, Li et al. [31] considered the probability ofthe row non-zero element probability distribution to usefour sparse matrix storage formats efficiently. In CF recom-mender systems, Gao et al. [32] proposed item-based CFrecommender systems on GPU. Kato and Hosino [33] pro-posed singular value decomposition (SVD) based CF onGPU. However, this approach is not suitable for large-scaleCF problems.

LI ET AL.: MSGD: A NOVEL MATRIX FACTORIZATION APPROACH FOR LARGE-SCALE COLLABORATIVE FILTERING RECOMMENDER... 1531

Page 3: MSGD: A Novel Matrix Factorization Approach for Large ... › ~lik › publications › Hao-Li-IEEE-TPDS-2018.pdf · systems, e.g., DSGD, FPSGD, Hogwild!, and CCD++. For large-scale

To solve the problem of large-scale decision variables,known as the curse of dimensionality, Cano and Garcia-Martinez [34] proposed distributed GPU computing forlarge-scale global optimization. To solve the problem ofnon-even distribution of a sparse matrix, which can lead toload imbalance on GPUs and multi-core CPUs, Yang et al.[35] proposed a probability-based partition strategy. A. Stu-art and D. Owens [36] proposed MapReduce-based GPUclustering. MapReduce is focused the parallelization ratherthan the details of communication and resource allocation.Thus, this approach integrates large numbers of map andreduce chunks and uses partial reductions and accumula-tions to harness the GPU clusters. Chen et al. [37] proposedan in-memory heterogeneous CPU-GPU computing archi-tecture to process high data-intensive applications.

3 PRELIMINARY

In this section, we give the problem definition of MF and setup some notions in Section 3.1. We present the state-of-the-art parallelized and distributed SGD and non-SG approachin Section 3.2. CUDA and multi-GPU details are given inSection 3.3.

3.1 Problem and Notation

We denote matrices by uppercase letters and vectors bybold-faced lowercase letters. Let A 2 Rm�n be the ratingmatrix in a recommender system, where m and n are thenumbers of users and items, respectively. ai;j will denotethe ði; jÞ entry of A. We denote U 2 Rm�r and V 2 Rn�r asthe user feature matrix and item feature matrix, respec-tively. Because of the sparsity of CF problems, we use V todenote sparse matrix indices. We use Vi and Vj to denotethe column indices and row indices in the ith row and jthcolumn, respectively. We also denote the kth column of Uby uk, and the kth column of V by vk. The kth elements of ui

and vj are represented by ui;k and vj;k, respectively. The ithelements of uk is uk;i, and the jth element of vk is vk;j. In CFMF recommenders, most of algorithms in related literaturestake UV T as the low-rank approximation matrix to predictthe non-rated or zero entries of the sparse rating matrix A,and the approximation process is accomplished by mini-mizing the euclidean distance function to measure theapproximation degree [3], [4], [5], [7]. The MF problem forrecommender systems is defined as following [3], [4], [5], [7]

argminU;V

dA$U;V

¼Xði;jÞ2V

�ðai;j � bai;jÞ2 þ �Ukuik2

þ �V kvjk2�;

(1)

where bai;j ¼ uivTj , k � k is the euclidean norms, and �u and �v

are regularization parameters to avoid over-fitting. Weobserve that problem (1) involves two parameters, U and V .Problem (1) is non-convex. Thus, the distance parameter dmay fall into local minima. The SG-based approach canescape the local minima by randomly selecting a ði; jÞ entry.Once ai;j is chosen, the objective function in problem (1) isreduced to

di;j ¼ ðai;j � bai;jÞ2 þ �Ukuik2 þ �V kvjk2: (2)

After calculating the sub-gradient over ui and vj, and select-ing the regularization parameters �U and �V , the updaterule is formulated as

vj vj þ gðei;jui � �V vjÞ;ui ui þ gðei;jvj � �UuiÞ;

�(3)

where

ei;j ¼ ai;j � bai;j; (4)

is the error between the existing rating and predicted ratingof the ði; jÞ entry, and g is the learning rate.

The update rule (3) is inherently sequential. We ran-domly select three entry indices fði0; j0Þ; ði0; j1Þ; ði0; j2Þg,where fj0; j1; j2g 2 Vi0 . We select three threads fT0; T1; T2gto update the specific feature vector of fai0;j0 ; ai0;j1 ; ai0;j2g,respectively. The update process is as follows:

T0 :vj0 vj0 þ gðei0;j0ui0 � �V vj0Þ;ui0 ui0 þ gðei0;j0vj0 � �Uui0Þ;

T1 :vj1 vj1 þ gðei0;j1ui0 � �V vj1Þ;ui0 ui0 þ gðei0;j1vj1 � �Uui0Þ;

T2 :vj2 vj2 þ gðei0;j2ui0 � �V vj2Þ;ui0 ui0 þ gðei0;j2vj2 � �Uui0Þ:

We observe that three threads fT0; T1; T2g update ui0 simul-taneously, which is the parallelization over-writing problem.The update process cannot be separated into independentparts because of the dependence on the user and item pair.

3.2 The State-of-the-Art Parallelization Approaches

3.2.1 SG Based

Limited by the over-writing problem, i.e., the dependenceon the user-item pair, prior works focus on splitting the rat-ing matrix into several independent blocks that do not sharethe same row and column and applying SGD on those inde-pendent blocks.

DSGD. In the following discussion, the independentblocks share neither common column nor common row ofthe rating matrix. Given a 2-by-2 divided rating matrix and2 nodes, cyclic update approach of DSGD is illustrated inFig. 1. The first training iteration assigns 2 diagonal matrixblocks to 2 nodes fNode 0; Node 1g; Node 0 updatesfU0; V0g, Node 1 updates fU1; V1g. In the last iteration,Node 0 and Node 1 update fU0; V1g and fU1; V0g, respec-tively. In distributed systems, data communication, syn-chronization, and data imbalance prevent DSGD fromobtaining scalable speedup.

FPSGD. Fig. 2 illustrates FPSGD. FPSGD splits the ratingmatrix into several independent blocks and implants SGDin each independent block. The first iteration assigns 2 diag-onal matrix blocks to 2 threads fT0, T1g; thread T0 updatesfU0; V0g, and thread T1 updates fU1; V1g. We assume thatthread T0 accomplishes the update at first, and then threadT0 has three choices, e.g., fU2; V0g, fU0; V2g, fU2; V2g to thenext iteration.

Hogwild!. The intuition of Hogwild! is that the ratingmatrix is very sparse. Thus, the occurrence probability ofthe over-writing problem is low.

1532 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 29, NO. 7, JULY 2018

Page 4: MSGD: A Novel Matrix Factorization Approach for Large ... › ~lik › publications › Hao-Li-IEEE-TPDS-2018.pdf · systems, e.g., DSGD, FPSGD, Hogwild!, and CCD++. For large-scale

3.2.2 Non-SG Based

CCD++. Based on the idea of coordinate descent, CCD++updates ðu1, v1Þ until ður, vrÞ cyclically. Updating uk can beconverted into updating uk;i in parallel, and updating vkcan be converted into updating vk;j in parallel. The one-variable subproblem of updating uk is reformulated into

uk;i argminz

Xj2Vi

fðai;j � ðbai;j � uk;ivk;jÞ � zvk;jÞ2 þ �Uz2g;

(5)

and the one-variable subproblem of updating vk is reformu-lated into

vk;j argminw

Xi2Vj

fðai;j � ðbai;j � uk;ivk;jÞ � zuk;jÞ2 þ �V w2g:

(6)

The solution of (5) is

z P

j2Viðai;j � ðbai;j � uk;ivk;jÞÞvk;j�U þ

Pj2Vi

v2k;j; uk;i z; (7)

and the solution of (6) is

w P

j2Viðai;j � ðbai;j � vk;juk;iÞÞuk;i

�U þP

i2Vju2k;i

; vk;j w: (8)

3.3 GPU-Based Computing

The GPU resides on a device, and a GPU consists of manystream multiprocessors (SMs). Each SM contains a numberof stream processors (SPs). CUDA is a programming inter-face that can enable GPU to be compatible with various pro-gramming languages and applications. In CUDA, kernelsare functions that are executed on GPU. A kernel function isexecuted by a batch of threads. The batch of threads is orga-nized as a grid of thread blocks. The thread blocks map toSMs. As shown in Fig. 3, the greater the number of SMs aGPU has, the higher the parallelism degree the GPU has.Thus, a GPU with more SMs will execute a CUDA programin less time than a GPU with fewer SMs. Threads in ablocks are organized into small groups of 32 called warpsfor execution on the processors, and warps are implicitlysynchronous; however, threads in different blocks are asyn-chronous. CUDA assumes that CUDA kernel, i.e., CUDAprogram, is executed on a GPU (device) and the rest of theC program is executed on the CPU (host). CUDA threadsaccess data from multiple memory hierarchies. Each threadhas private local memory, and each thread block has sharedmemory that is visible to all threads within the thread block.All threads can access global memory.

Multiple GPUs are connected to the host by peripheralcommunication interconnect express (PCIe) in Fig. 4. CUDAcan perform computation on the device and data transfersbetween the device and host concurrently. More state-of-the-art GPUs (K20m, K40c) have two copy engines, which canoperate data transfers to and from a device, and among

Fig. 2. An illustration of FPSGD.

Fig. 3. CUDA kernel and thread batching.

Fig. 4. Multi-GPU parallel computing model.

Fig. 1. An illustration of DSGD synchronous cyclic update.

LI ET AL.: MSGD: A NOVEL MATRIX FACTORIZATION APPROACH FOR LARGE-SCALE COLLABORATIVE FILTERING RECOMMENDER... 1533

Page 5: MSGD: A Novel Matrix Factorization Approach for Large ... › ~lik › publications › Hao-Li-IEEE-TPDS-2018.pdf · systems, e.g., DSGD, FPSGD, Hogwild!, and CCD++. For large-scale

devices concurrently. CUDA provides synchronizationinstructions, which can ensure the correct execution onmulti-GPU, and each GPU of the multi-GPU has a con-sumer/producer relationship. Memory copies between twodifferent devices can be performed after the instruction ofCUDApeer-to-peer memory access has been enabled.

4 A MSGD MF APPROACH

In Section 4.1, we propose the MSGD MF approach, whichcan remove the dependence on the user and item pair. Wepresent the parallelization design of MSGD in Section 4.2.

4.1 Proposed MSGD Approach

The essence of SGD parallelization over-writing problem isthat the variables that are selected by multiple threads mayshare the same row or column identity. In a training epoch,multiple threads select ai;j belonging to the same row of A,and SGD only updates V . Correspondingly, multiplethreads selected ai;j belonging to the same column of A, andSGD only updates U . Then, the parallelization over-writingproblem can be solved. However, how can the convergenceand accuracy be guaranteed? In the following section, wewill give the derivation of MSGD update rules.

MSGD splits the distance function d of problem (1) into

d ¼Xmi¼1jVijdi; (9)

and

d ¼Xnj¼1jVjjdj; (10)

where

di ¼ 1

jVijXj2Vi

e2i;j þ �Ukuik2 þ �V kvjk2� �

¼ 1

jVijXj2Vi

cjðuiÞ;(11)

and

dj ¼ 1

jVjjXi2Vj

e2i;j þ �Ukuik2 þ �V kvjk2� �

¼ 1

jVjjXi2Vj

ciðvjÞ:(12)

Thus, minimizing d is equivalent to minimizeP

j2VijVijdi

andP

i2VjjVjjdj alternatively. Furthermore, minimizing

jVijdi is equivalent to minimize di, and minimizing jVjjdj isequivalent to minimize dj. If we fix vj; j 2 Vi for di, then thedi, i 2 f1; 2; . . . ;mg are independent of each other. There-fore, applying gradient decent with learning rate g to searchfor the optimal solution to minimize di and dj is derived asfollows:

argminui

di ) ui ui � grdiðuiÞ

¼ ui � 2g

jVijXj2Vi

rcjðuiÞ !

¼ ui � 2g

jVijXj2Vi

�ei;jvj þ �Uui

� � !;

(13)

and

argminvj

dj ) vj vj � grdjðvjÞ

¼ vj � 2g

jVjjXi2Vj

rciðvjÞ

0B@

1CA

¼ vj � 2g

jVjjXi2Vj

�ei;jui þ �V vj� �0

B@1CA:

(14)

However, at each step, gradient descent requires the evalua-tion of jVij and jVjj to update ui and vj respectively, whichis expensive. SGD is a popular modification for the gradientdescent update rule (13) [38], [39], [40], [41], [42], [43], whereat each training epoch t ¼ 1, 2, . . ., we randomly drawjt; jt 2 Vi, and the modification for the gradient descentupdate rule (13) to update U is as follows:

uti ¼ ut�1

i � grcjtðut�1

i Þ¼ ut�1

i þ gðei;jvt�1jt� �Uu

t�1i Þ:

(15)

The same modification for the gradient descent update rule(14) to update V is as follows:

vtj ¼ vt�1j � grcitðvt�1j Þ

¼ vt�1j þ gðei;jut�1it� �V v

t�1j Þ;

(16)

where at each epoch t ¼ 1; 2; . . . ; we randomly drawit; it 2 Vj. The expectation E½ut

ijut�1i � is identical to (13), and

the expectation E½vtjjvt�1j � is identical to (14) [38], [39], [40],[41], [42]. The algorithmofMSGD is described inAlgorithm1.

Algorithm 1.MSGD

Input: Initial U and V , rating matrix A, learning rate g, regulari-zation parameters �V and �U , training epoch epo.

Output: U , V .1: for loop from epo to 0 do2: for j from 0 to n� 1 do3: for i from the first to the last element of Vj do4: Update vj by update rule (16).5: end for6: end for7: for i from 0 tom� 1 do8: for j from the first to the last element of Vi do9: Update ui by update rule (15).10: end for11: end for12: end for13: return ðU; V Þ.

Since problems (11) and (12) are symmetric, we only con-sider the example of problem (11). In general, under twoconstraint conditions where cj is smooth and Lipschitz-continuous with constant L and di is strongly-convex withconstant m, the MSGD is convergent, and as long as we picka small constant step size g < 1=L, gradient descent of diobtains linear convergence rate Oð1=tÞ [39], [44]. We presentthe convergence analysis of MSGD in the supplementarymaterial, which can be found on the Computer Society Digi-tal Library at http://doi.ieeecomputersociety.org/10.1109/

1534 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 29, NO. 7, JULY 2018

Page 6: MSGD: A Novel Matrix Factorization Approach for Large ... › ~lik › publications › Hao-Li-IEEE-TPDS-2018.pdf · systems, e.g., DSGD, FPSGD, Hogwild!, and CCD++. For large-scale

TPDS.2017.2718515. Furthermore, MSGD has the same timecomplexity as SGD, i.e., OðrjVjÞ.4.2 Parallelization Design

We report two parallelization approaches of MSGD, e.g., adecentralized approach and a centralized approach.

We select three entry indices fði0; j0Þ; ði0; j1Þ; ði0; j2Þg,where fj0; j1; j2g 2 Vi0 . We select three threads fT0; T1; T2gto update fvj0 ; vj1 ; vj2g, respectively. As decentralizedapproach, threads fT0; T1; T2g update fvj0 ; vj1 ; vj2g in paral-lel as follows:

vtj0vtj1vtj2

2664

3775 ¼

vt�1j0þ gðei0;j0ut�1

i0� �V v

t�1j0Þ

vt�1j1þ gðei0;j1ut�1

i0� �V v

t�1j1Þ

vt�1j2þ gðei0;j2ut�1

i0� �V v

t�1j2Þ

2664

3775: (17)

In the ði0 þ 1Þth iteration, threads fT0; T1; T2g select threedifferent column indices from Vi0 , and update fvj0 ; vj1 ; vj2gin parallel.

We select three entry indices fði0; j0Þ; ði1; j1Þ; ði1; j2Þg,where i0 2 Vj0 , i1 2 Vj1 , i2 2 Vj2 . We select three threadsfT0; T1; T2g to update fvj0 ; vj1 ; vj2g, respectively. As central-ized approach, threads fT0; T1; T2g update fvj0 ; vj1 ; vj2g inparallel as follows:

vtj0vtj1vtj2

2664

3775 ¼

vt�1j0þ gðei0;j0ut�1

i0� �V v

t�1j0Þ

vt�1j1þ gðei1;j1ut�1

i1� �V v

t�1j1Þ

vt�1j2þ gðei2;j2ut�1

i2� �V v

t�1j2Þ

2664

3775: (18)

Furthermore, we report two approaches to update U inparallel.

We select three entry indices fði0; j0Þ; ði1; j0Þ; ði2; j0Þg,where fi0; i1; i2g 2 Vj0 . We select three threads fT0; T1; T2gto update fui0 ;ui1 ;ui2g, respectively. The decentralizedapproach of threads fT0; T1; T2g to update fui0 ;ui1 ;ui2g is asfollows:

uti0

uti1

uti2

2664

3775 ¼

ut�1i0þ gðei0;j0vt�1j0

� �Uut�1i0Þ

ut�1i1þ gðei1;j0vt�1j0

� �Uut�1i1Þ

ut�1i2þ gðei2;j0vt�1j0

� �Uut�1i2Þ

2664

3775: (19)

We select three entry indices fði0; j0Þ; ði1; j1Þ; ði1; j2Þg,where j0 2 Vi0 , j1 2 Vi1 , j2 2 Vi2 . We select three threadsfT0; T1; T2g to update fui0 ;ui1 ;ui2g, respectively. The cen-tralized approach of threads fT0; T1; T2g to updatefui0 ;ui1 ;ui2g is as follows:

uti0

uti1

uti2

2664

3775 ¼

ut�1i0þ gðei0;j0vt�1j0

� �Uut�1i0Þ

ut�1i1þ gðei1;j1vt�1j1

� �Uut�1i1Þ

ut�1i2þ gðei2;j2vt�1j2

� �Uut�1i2Þ

2664

3775: (20)

There are two parallelization approaches to update bothU and V . We only discuss MSGD updating V in the follow-ing sections due to limited space.

Based on the idea of MSGD parallelization designs, weconsider some CUDA programming problems as follows:

� Which is the highest efficiency CUDA parallelizationapproach?

� How can GPU memory accommodate large-scalerecommender systems data sets?

5 CUDA PARALLELIZATION DESIGN

In this section, we report two scheduling strategies ofCUDA thread in Section 5.1. We introduce CUMSGD coa-lesced memory access in Section 5.2. We present the detailsof multi-GPU implementation in Section 5.3.

5.1 Two Parallelization Approaches

In this section, we show the scheduling strategies of threadblock. The scheduling strategy is how thread blocks selectentries in the rating matrix A. It is necessary to give atten-tion to the difference between the two approaches, e.g., thedecentralized and centralized approach. In the decentral-ized approach, thread blocks select entries in a row of A. Inthe centralized approach, first, a thread block selects a col-umn index j; then, the thread block selects an entry indexði; jÞ, where i 2 Vj.

Algorithm 2. CUDA Decentralized Updating V

Input: Initial U and V , rating matrix A, learning rate g, regulari-zation parameter �V , training epoches epo, and totalnumber of thread blocks C.

Output: V .1: T Thread block id.2: for loop from epo to 0 do3: for i ¼ 0; 2; . . . ;m� 1 do4: parallel: % Each thread block T updates its own vj

by for loop independently5: Select column index j 2 Vi.6: Update vj by update rule (16).7: end parallel8: end for9: end for10: return V .

Fig. 5 illustrates a toy example of the decentralizedupdating V . As shown in Fig. 5, a training epoch is sepa-rated into 6 steps. Taking the 1st and 2nd steps as an exam-ple, after thread block 0 updates v0, 0 2 V0, thread block 0selects v1, 1 2 V1, which is not updated in the 2nd step. Weobserved that a rating matrix has a large number of zeroentries, whereas only non-zero rating entries are typicallystored on GPU global memory with their indices. The distri-bution of non-zero entries is random in a rating matrix.CUMSGD cannot guarantee that a thread block selects thesame column index on two neighbouring update steps.Meanwhile, the shared memory of a thread block only occu-pies a small part. Thus, after a thread block updating vj, thethread block loads vj back to GPU global memory for the

Fig. 5. CUDA decentralized updating V .

LI ET AL.: MSGD: A NOVEL MATRIX FACTORIZATION APPROACH FOR LARGE-SCALE COLLABORATIVE FILTERING RECOMMENDER... 1535

Page 7: MSGD: A Novel Matrix Factorization Approach for Large ... › ~lik › publications › Hao-Li-IEEE-TPDS-2018.pdf · systems, e.g., DSGD, FPSGD, Hogwild!, and CCD++. For large-scale

next update step. Algorithm 2 describes the thread blockscheduling strategy of the decentralized approach.

Algorithm 3. CUDA Centralized Updating V

Input: Initial U and V , rating matrix A, learning rate g, regulari-zation parameter �V , training epoches epo, and totalnumber of thread blocks C.

Output: V .1: T Thread block id.2: for loop from epo to 0 do3: for j from T to n� 1 on the interval of C do4: parallel: % Each thread block T updates its own vj by

for loop independently5: Select i 2 Vj.6: Update vj by update rule (16).7: end parallel8: end for9: end for10: return V .

Fig. 6 illustrates a toy example of the centralized updatingV . The three different colored boxes on three columnsf0; 2; 4g in rating matrix A represent that three thread blocks

update the corresponding fv0; v2; v4g by ffai;0;ui; v0; i 2 V0g;fai;2;ui; v2; i 2 V2g; fai;4;ui; v4; i 2 V4gg, respectively. After

one training iteration, a thread block will select a vj whichisn’t updated in current training epoch. Algorithm 3 showsthread block scheduling strategy of the centralized updateapproach.

5.2 Coalesced Memory Access

We set the parameter r as an integral multiple of 32 (warp-size), for warp synchronization execution and coalescedglobal memory access.

Fig. 7 illustrates coalesced access of CUMSGD to U and Vin global memory. As shown in Fig. 7, a thread block canaccess vj; j 2 Vi continuously due to the coalesced access onGPU global memory via 32-, 64- or 128-byte memory trans-actions [45]. A warp (warpsize = 32) accesses the successive128-bytes in global memory at the same time if the memoryaccess is aligned. For a 128-byte memory transaction, awarp fetches aligned 16 double-type elements from globalmemory to local memory two times. We set the number ofthreads per thread block as r 2 f32; 64; 128; 256; 512g, to syn-chronize warp execution.

5.3 Multi-GPU Approach

We extend CUMSGD to MCUMSGD when the size of a dataset is larger than the memory capacity of a single GPU. We

present multi-GPU data allocation and scheduling strate-gies, and provide space and time complexity analysis ofMCUMSGD on p GPUs.

Algorithm 4. The Basic MCUMSGD Approach

Input: Initial U and V , rating matrix A, learning rate g, columnindex matrix Q, the number of GPU p, training epochesepo, and regularization parameters �U , �V .

Output: U; V .1: for loop from epo to 0 do2: for q 2 f0; 1; . . . ; p� 1g do3: Load fAq; Uq; V qg to GPU q.%GPUs Initial workload4: end for5: parallel: % GPU q, q 2 f0; . . . ; p� 1g6: for l 2 f0; 1; . . . ; p� 1g do7: j Ql;q.8: Update V j by fAq;j; Uq; V jg.9: Synchronization on all GPUs.10: Update Uq by fAq;j; Uq; V jg.11: Load V j to GPU

�ðq � 1þ pÞmod p�.

12: end for13: end parallel14: end for15: return ðU; V Þ.

To improve speedup performance on multi-GPU, weconsider how to reduce the task dependence. It is that therating matrix A is divided into p2 sub-matrix blocks, thenthe cyclic update strategy is adopted. It is very differentfrom DSGD cyclic update strategy in Fig. 1, which needsdata communication cost between the p nodes. Each GPUupdates the item feature sub-matrix, then updates the userfeature sub-matrix with emitting item feature sub-matrixto other GPU concurrently, which can hide the communi-cation overhead of loading the item feature sub-matrix.Thus, each GPU needs cache area to store the receiveditem feature sub-matrix. MCUMSGD can operate multi-GPU synchronization and data access instructions byPCIe. Thus, it can extend the data splitting strategy ofDSGD and asynchronous communication mechanism toMCUSGD on multi-GPU.

Given p GPUs, we divide A into p parts fA0; . . . ; Ap�1g byrow, and the qth part is divided into p sub-parts,fAq;0; . . . ; Aq;p�1g. We define S ¼ fS0; S1; . . . ; Sp�1g as a par-tition of row indices of U , and we define G ¼ fG0;G1; . . . ; Gp�1g as a partition of column indices of V . Wedivide U into p parts fU0; U1; . . . ; Up�1g, where Uq is a vectorset, and the row indices of Uq correspond to Sq. We divide Vinto p parts fV 0; V 1; . . . ; V p�1g, where V q is a vector set, and

Fig. 6. CUDA centralized updating V . Fig. 7. Coalesced access on V and U in global memory.

1536 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 29, NO. 7, JULY 2018

Page 8: MSGD: A Novel Matrix Factorization Approach for Large ... › ~lik › publications › Hao-Li-IEEE-TPDS-2018.pdf · systems, e.g., DSGD, FPSGD, Hogwild!, and CCD++. For large-scale

the column indices of V q correspond to Gq. Then, we loadfAq; Uq; V qg to GPU q at initial step.

We define a column index matrix Q 2 Rp�p, where

Q ¼

0 1 . . . p� 2 p� 1

1 2 . . . p� 1 0

..

. ... . .

. ... ..

.

p� 2 p� 1 . . . p� 4 p� 3

p� 1 0 . . . p� 3 p� 2

266666664

377777775:

For example, in a training epoch, ðlþ 1Þth iteration, l 2f0; 1; . . . ; p� 1g, GPU q updates (Uq; V j), j ¼ Ql;q. We nowformally define the MCUMSGD (see Algorithm 4). GPU qupdates j, j 2 Ql;q in its own processing queue (line 7), andupdates V j (line 8). Then, GPU q loads V j to GPU�ðq � 1þ pÞmod p

�, and updates Uq asynchronously (line

10-11). The computation time cost of each GPU is differentdue to those p GPUs owning different numbers of ratingnon-zero entries, and the asynchrony problem can be han-dled by synchronization (line 9). We assume two GPUsfGPU 0; GPU 1g. Fig. 8 illustrates that the two GPUs updatefV 0; V 1g cyclically, and the illustration is as follows:

Step 1. GPU 0 updates V 0 by fA0;0; U0; V 0g. GPU 1 updatesV 1 by fA1;1; U1; V 1g.

Step 2. GPU 0 updates U0 by fA0;0; U0; V 0g with GPU 0emitting V 0 to GPU 1 asynchronously. GPU 1updates U1 by fA1;1; U1; V 1g with GPU 1 emittingV 1 to GPU 0 asynchronously.

We consider the complexity analysis in the followingcases:

Case 1. The rating matrix and two feature matricesfA;U; V g are distributed among p GPUs. EachGPU also store a 1=p fraction of non-zero entriesof rating matrix A. Each GPU has to store a 1=pfraction of the user feature matrix U and the itemfeature matrix V . Because storing a row of U or Vrequires r space, cache area on each GPU needsOðnr=pÞ space. The maximum space complexityper GPU is O

�ðmrþ 2nrþ 2jVjÞ=p�. The cost ofcommunication time complexity is Oðnr=pÞ. It isdifficult to define the time complexity of CUDA

kernels due to the execution time of a CUDAkernel depending on many elements, e.g., threadblocks, space complexity, and the two updateapproaches, as mentioned in Section 4.2.

Case 2. The number of GPUs, p, increases. We distributefA;U; V g across p GPUs. As expected, when pincreases, the nr=p decreases more slowly thanjVj=p2. Thus, the cost of communication will over-whelm the cost of updating V j, which will lead toslowdown.

Case 3. MCUMSGD integrates the two approachesmentioned in Section 4.2 to save space cost.We consider two sparse matrix storage formats,e.g., CSR and CSC. CUMSGD can use the user-oriented CSR format for decentralized updatingV and centralized updating U , and use item-oriented CSC format for decentralized updatingU and centralized updating V . Thus, the mini-mum space complexity of the hybrid approach isO�ðmrþ 2nrþ jVjÞ=p�.

6 EXPERIMENTS

In this section, the experimental results of MSGD andthe comparison with state-of-the-art approaches are pre-sented. We provide the details of experimental settingsin supplementary material, available online. We compareMSGD and SGD in Section 6.1. We present the speedupperformance of CUMSGD in Section 6.2. We compareCUMSGD with state-of-the-art approaches in Section 6.3.Finally, in Sections 6.4 and 6.5, we answer the two questionsmentioned in Section 4.2.

6.1 MSGD versus SGD

We compare the sequential approach of MSGD and SGD onCPU, and the comparison is illustrated in Fig. 9. As shown inFig. 9, the time of the sequential part increases linearly as ther increases, and MSGD takes longer time than SGD becauseof the increased fetching times of U and V . Thus, we can con-clude that the SGD outperforms the MSGD on one CPU dueto less fetching times of U and V . However, this advantagecannot guarantee the higher performance in the parallelcase. Because of SGD parallelization over-writing problem,SGD suffers from low parallelism. Furthermore, we presentthe accuracy comparison of MSGD and SGD in supplemen-tary material, available online, to demonstrate that MSGDcan achieve the comparable accuracy than SGD.

6.2 CUMSGD Speedup

In this section, we report the performance of GPU speedup,and give some ideas of how to improve the speedup byoptimizing reduction, tuning parameters, e.g., r, the numberof thread blocks, and GPU hardware parameters.

6.2.1 Optimizing Reduction

Increasing the rank (r) of the feature matrix can improvethe accuracy of CF MF to a certain degree [4], [5], in addi-tion to three parameters, e.g., f�U; �V ; gg. The time

Fig. 8. Two GPUs asynchronous cyclic update.

LI ET AL.: MSGD: A NOVEL MATRIX FACTORIZATION APPROACH FOR LARGE-SCALE COLLABORATIVE FILTERING RECOMMENDER... 1537

Page 9: MSGD: A Novel Matrix Factorization Approach for Large ... › ~lik › publications › Hao-Li-IEEE-TPDS-2018.pdf · systems, e.g., DSGD, FPSGD, Hogwild!, and CCD++. For large-scale

complexity of the sequential multiplication of two vectorsis OðrÞ. The two-vector multiplication on CUMSGD onlyneeds O

�log 2ðrÞ

�. Thus, a larger r can improve GPU

speedup performance further. Meanwhile, improving theefficiency of vector multiplication and reduction canimprove the efficiency of CUMSGD significantly. FromCUDA update process described in Algorithms 2 and 3, athread block that has r threads can update a vj only onceby fei;j;ui; vjg. The value of ei;j needs the pre-computation

of uivTj , and vector dot multiplication needs the coopera-

tion of the r threads within the thread block. Thus, the ui

and vj are stored in shared memory of the thread block.Thread synchronization instruction can be saved fromlog 2ðrÞ to log 2ðrÞ � 5 by the warp synchronization mecha-nism presented in [45]. Fig. 10 illustrates the comparisonbetween the maximum synchronization instructionmethod (before optimizing) and the reduced synchroniza-tion instruction method (after optimizing). As shown inFig. 10, the saved synchronization instruction method canimprove speedup after r larger than 128 because the savedcost of warp synchronization is larger than the cost ofwarp scheduling.

6.2.2 Speedup on K20m GPU

For the comparability of speedup on various state-of-the-artMF parallelization methods, we compare the CUMSGD ona single K20m GPU and the MSGD on a single CPU core.

Fig. 11 illustrates the comparison on tU þ tV versusth2d þ td2h. As shown in Fig. 11, tU þ tV is larger thanth2d þ td2h. Thus, we can draw the conclusion that the timecost of updatingU and V is much longer than the cost of datacopies between CPU (host) and GPU (device). CUMSGDneeds more than one training epoch to obtain a reasonableRMSE, and we suppose that CUMSGD needs epo trainingepoches. Thus, the speedup (S) is approximated by

S ¼ epo � T1

epoðtU þ tV Þ¼ T1

tU þ tV:

(21)

Fig. 10. A comparison between the maximum synchronization instruc-tion method (Before optimizing) and the reduced synchronizationinstruction method (After optimizing).

Fig. 9. The time of sequential MSGD versus sequential SGD on CPU.

1538 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 29, NO. 7, JULY 2018

Page 10: MSGD: A Novel Matrix Factorization Approach for Large ... › ~lik › publications › Hao-Li-IEEE-TPDS-2018.pdf · systems, e.g., DSGD, FPSGD, Hogwild!, and CCD++. For large-scale

The choice of the appropriate r of feature matrices canguarantee the reasonable accuracy, and spend less trainingtime in CFMF problem [4], [5], [46]. In our work, because wefocus on the speedup performance on various r of featurematrices rather than the choice of appropriate r, we test fivesets of experiments on various r, e.g., r 2 f32; 64; 128; 256; 512g.GPU occupancy is the ratio of the number of active threads tothe total number of threads, and high occupancy means

that the GPU is working in high efficiency. The GPUoccupancy is calculated by CUDA Occupancy Calculator [45].In CUMSGD, a thread block has r threads, and the numberof thread blocks is tunable, which can control the occupancy.Table 1 lists occupancy under various conditions, e.g., r 2f32; 64; 128; 256; 512g.

As shown in Fig. 12, in the case of 100 percent occupancy,r 2 f128; 256; 512g can obtain optimal speedup performance.In the case of 52 thread blocks, the speedup ðSÞ increases as rincreases, because the occupancy is not full until r ¼ 512. Infact, when r ¼ 512, the case with 52 thread blocks can obtain a40x speedup, which is almost the best one in this condition.Table 1 and Fig. 12 demonstrate that the optimal setting ofthread blocks to achieve the optimal speedup performance onGPU is the total number of SMs� active thread blocks per SM.

6.2.3 Speedup on K40c GPU

Scalability is an evaluation indicator in industrial applica-tions [47], [48]. It is difficult to implant prior parallel

TABLE 1The Occupancy of CUMSGD on K20m and K40c GPU

Rank(Threads Per Block) 32 64 128 256 512

Register Per thread 21 21 21 21 21Shared memory Per block(bytes) 256 512 1,024 2,048 4,096Active Thread Per SM 512 1,024 2,048 2,048 2,048Active Warps Per SM 16 32 64 64 64Active Thread Blocks Per SM 16 16 16 8 4Occupancy of per SM 25% 50% 100% 100% 100%

Fig. 11. tU þ tV versus th2d þ td2h of CUMSGD.

Fig. 12. Speedup on K20m GPU. K20m GPU has 13 SMs. We test S onfive r, e.g., r 2 f32; 64; 128; 256; 512g. We fix r, and test S on four differ-ent number of thread blocks, e.g., f52; 104; 208; 416g.

LI ET AL.: MSGD: A NOVEL MATRIX FACTORIZATION APPROACH FOR LARGE-SCALE COLLABORATIVE FILTERING RECOMMENDER... 1539

Page 11: MSGD: A Novel Matrix Factorization Approach for Large ... › ~lik › publications › Hao-Li-IEEE-TPDS-2018.pdf · systems, e.g., DSGD, FPSGD, Hogwild!, and CCD++. For large-scale

approach onto GPU because block partitioning prevents theGPU from allocating the optimal number of thread blocksdynamically, which makes the GPU unable to obtain theoptimal computing power. CUMSGD removes the depen-dence on the user-item pair, which enables SGD to acquirehigh scalability and high parallelism. Fig. 13 illustrates thespeedup performance of CUMSGD on K40c GPU. Becauseof the higher bandwidth, increased memory clock rate andmore number of SMs compared with K20m GPU, K40cGPU obtains better speedup performance than K20m GPU.Thus, we firmly confirm that CUMSGD has high scalabilityand applicability to industrial applications.

6.3 Comparisons with State-of-the-Art Methods

We compared CUMSGDwith various parallel matrix factor-ization methods on r ¼ 128, including two SG-based meth-ods, e.g., DSGD, Hogwild!, and a non-SG-based method,CCD++, in Fig. 14. Fig. 14 illustrates the test RMSE versustraining time. Among the three parallel stochastic gradient

methods, CUMSGD is faster than DSGD, and Hogwild!. Webelieve that this result is because MSGD has high parallel-ism and GPU has high computing power.

The reasons that we compare RMSE versus computationtime on various MF parallelization are as follows:

� In the Hogwild!, each thread selects ratings ran-domly. It is difficult to identify a full training epoch.

� FPSGD may not have a full training epoch becauseeach thread selects an unprocessed block randomly.

As shown in Fig. 14, CCD++ can obtain the same speedin the beginning and then becomes more stable than DSGDand CUMSGD. We suspect that CCD++ may converge tosome local minimum. Fig. 14 shows that Hogwild! performrelatively poorer than DSGD and CUMSGD. We suspectthat this result is because Hogwild! randomly select ratingsand blocks, and the over-writing problem might decreasethe convergence of HogWilg!. SG-based methods can escapelocal minima, and obtain slightly better test RMSE than thenon-SG method, CCD++. The speedup performance onshared memory platform of prior methods, e.g., DSGD,FPSGD, Hogwild! and CCD++, are presented in Fig. 15.We observe that the speedup of the four methods grow

Fig. 14. RMSE versus computation time on a 16-core system forHogwild!, DSGD, and CCD++, and a K20m GPU for CUMSGD(Time inseconds) using double-precision floating point arithmetic.

Fig. 13. Speedup on K40c GPU. K40c GPU has 15 SMs. We test S onfive r, e.g., r 2 f32; 64; 128; 256; 512g. We fix r, and test S on four variousnumber of thread blocks, e.g., f60; 120; 240; 480g.

1540 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 29, NO. 7, JULY 2018

Page 12: MSGD: A Novel Matrix Factorization Approach for Large ... › ~lik › publications › Hao-Li-IEEE-TPDS-2018.pdf · systems, e.g., DSGD, FPSGD, Hogwild!, and CCD++. For large-scale

slower as the number of cores increase linearly. We presentthe comparison of CUMSGD and FPSGD in supplementarymaterial, available online.

6.4 User-Oriented or Item-Oriented Setting?

MSGD updates U and V alternatively. We consider thatwhich one of user-oriented ordering and item-orientedordering can obtain higher performance on GPU. Recallthat, in Sections 4 and 5, the both orderings can be adoptedby CUMSGD. We observe that there are four update combi-nations in an update epoch as follows:

� Updating U on user-oriented ordering (centralizedapproach) and updating V on user-oriented ordering(decentralized approach).

� Updating U on item-oriented ordering (decentral-ized approach) and updating V on item-orientedordering (centralized approach).

� Updating U on item-oriented ordering (decentral-ized approach) and updating V on user-orientedordering (decentralized approach).

� Updating U on user-oriented ordering (centralizedapproach) and updating V on item-oriented order-ing (centralized approach).

Fig. 16 illustrates the GPU running time of a trainingepoch for the four combinations. We test the GPU runningtime on the optimal number of thread blocks mentioned inSection 6.2. We suspect that the reasons for the gaps betweenthe GPU running times of the four approaches are as follows:

� As illustrated in Section 4, decentralized updating Vneeds more data access times than centralizedupdating V from GPU global memory.

� The data load of decentralized updating V is morebalanced than centralized updating V .

From the experimental results, the combination of user-oriented and item-oriented setting can obtain higher speedupperformance than a single user-oriented or item-orientedsetting; however, a single setting can save space cost.

6.5 Multi-GPU Implementation

Multi-GPU can solve the problem that large-scale data setscannot be loaded into a single GPU. The dependence onfðAi;j; Ui; V jÞji; j 2 0; 1; . . . ; p� 1g may be the bottleneck ofimproving the performance of multi-GPU speedup due tohuge cost of synchronization and communication withinmulti-GPU. As shown in Fig. 8, MCUMSGD hides the dataaccess time into updating U , but the synchronization cost isstill existing. Compared with Fig. 12, the speedup perfor-mance on a single K20m GPU, Fig. 17 illustrates thespeedup performance on two K20m GPUs. We observe thatthe two K20m GPUs can obtain a higher speedup perfor-mance than on a single K20m GPU, whereas the improve-ment of speedup performance doesn’t increase linearlywith the increase of the number of GPUs. We recall thecyclic update process in Section 5.3. We observe that the

Fig. 16. User-oriented update and item-oriented update. User-orientedand item-oriented settings can be combined in four approaches.

Fig. 15. Speedup of various MF methods on a 16-cores system.

LI ET AL.: MSGD: A NOVEL MATRIX FACTORIZATION APPROACH FOR LARGE-SCALE COLLABORATIVE FILTERING RECOMMENDER... 1541

Page 13: MSGD: A Novel Matrix Factorization Approach for Large ... › ~lik › publications › Hao-Li-IEEE-TPDS-2018.pdf · systems, e.g., DSGD, FPSGD, Hogwild!, and CCD++. For large-scale

parallelism is reduced when two GPUs update fV 0; V 1gcyclically. Furthermore, the reduced parallelism and unbal-anced load of each SMs owing to the irregular distributionof non-zero entries in the rating matrix A, may make someSMs be idle in the update process, which may lead to non-linear speedup.

7 CONCLUSION AND FUTURE WORKS

In this paper, we propose a CUDA parallelizationapproach of SGD, which is named MSGD, to accelerate theMF. MSGD removes the dependence on the user-item pairand splits the optimization objective of MF into multipleindependent sub-objectives. For large-scale data sets, wepropose a multi-GPU strategy to solve large-scale MFproblems. From the experimental results, we observe thatCUMSGD is faster than FPSGD, DSGD, Hogwild!, andCCD++. CUMSGD can obtain 50x speedup on a K20mGPU, 60x speedup on a K40c GPU, and 70x speedup ontwo K20m GPUs.

The multi-GPU approach cannot obtain linear speedup.We explain the reason as reduced parallelism and loadunbalance. We attempt to solve this problem by the combi-nation of centralized and decentralized update approaches.

The centralized approach of CUMSGD is based on evenlydividing rows and columns of sparse rating matrix, and thisapproach is not necessarily load balance among SMs. Thus,we are interested in using intelligent approach to make aneven distribution among SMs. We would like to improvethe convergence rate of MSGD, and integrate MSGD withstate-of-the-art learning model of CF MF. Online learning orincremental learning is a widely used approach to real-timesystem [49]. Thus, we would like to expand MSGD to onlinelearning model of CF MF problem.

To solve the large-scale MF problem, we can extendMCUMSGD to heterogeneous CPU-GPU. MSGD has inher-ent parallelism, so we plan to extend MSGD to parallel anddistributed platforms, e.g., OpenMP, MapReduce, andHadoop.

ACKNOWLEDGMENTS

The research was partially funded by the Key Program ofNational Natural Science Foundation of China (Grant No.61432005), the National Outstanding Youth Science Pro-gram of National Natural Science Foundation of China(Grant No. 61625202), the National Natural Science Founda-tion of China (Grant Nos. 61370095, 61370097, 61472124,61572175, and 61672224). The International Science & Tech-nology Cooperation Program of China (Grant Nos.2015DFA11240, 2014DFB30010), and the National High-techR&D Program of China (Grant No. 2015AA015305).

REFERENCES

[1] Y. Cai, H.-F. Leung, Q. Li, H. Min, J. Tang, and J. Li, “Typicality-based collaborative filtering recommendation,” IEEE Trans. Knowl.Data Eng., vol. 26, no. 3, pp. 766–779, Mar. 2014.

[2] G. Adomavicius and A. Tuzhilin, “Toward the next generation ofrecommender systems: A survey of the state-of-the-art and possi-ble extensions,” IEEE Trans. Knowl. Data Eng., vol. 17, no. 6,pp. 734–749, Jun. 2005.

[3] N. Srebro, J. Rennie, and T. S. Jaakkola, “Maximum-margin matrixfactorization,” in Proc. Advances Neural Inf. Process. Syst., 2004,pp. 1329–1336.

[4] Y. Koren, “Factor in the neighbors: Scalable and accurate collabo-rative filtering,” ACM Trans. Knowl. Discovery Data, vol. 4, no. 1,2010, Art. no. 1.

[5] Y. Koren, R. Bell, and C. Volinsky, “Matrix factorization techni-ques for recommender systems,” Computer, vol. 42, no. 8, pp. 30–37, Aug. 2009.

[6] J. Bennett and S. Lanning, “The netflix prize,” in Proc. KDD CupWorkshop, 2007, vol. 2007, Art. no. 35.

[7] N. Srebro, “Learning with matrix factorizations,” Ph.D. disserta-tion, Dept. Elect. Eng. Comput. Sci., Massachusetts Inst. Technol.,Cambridge, MA, USA, 2004.

[8] Y. Zhou, D. Wilkinson, R. Schreiber, and R. Pan, “Large-scale par-allel collaborative filtering for the netflix prize,” in AlgorithmicAspects in Information and Management. Berlin, Germany: Springer,2008, pp. 337–348.

[9] W. Tan, L. Cao, and L. Fong, “Faster and cheaper: Parallelizinglarge-scale matrix factorization on GPUs,” in Proc. 25th ACM Int.Symp. High-Performance Parallel Distrib. Comput., 2016, pp. 219–230.

[10] H.-F. Yu, C.-J. Hsieh, S. Si, and I. Dhillon, “Scalable coordinatedescent approaches to parallel matrix factorization for recom-mender systems,” in Proc. IEEE 12th Int. Conf. Data Mining, 2012,pp. 765–774.

[11] H.-F. Yu, C.-J. Hsieh, S. Si, and I. S. Dhillon, “Parallel matrixfactorization for recommender systems,” Knowl. Inf. Syst., vol. 41,no. 3, pp. 793–819, 2014.

[12] C.-J. Hsieh and I. S. Dhillon, “Fast coordinate descent methodswith variable selection for non-negative matrix factorization,” inProc. 17th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining,2011, pp. 1064–1072.

Fig. 17. Multi-GPU: Two GPUs cyclic update.

1542 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 29, NO. 7, JULY 2018

Page 14: MSGD: A Novel Matrix Factorization Approach for Large ... › ~lik › publications › Hao-Li-IEEE-TPDS-2018.pdf · systems, e.g., DSGD, FPSGD, Hogwild!, and CCD++. For large-scale

[13] D. D. Lee and H. S. Seung, “Algorithms for non-negative matrixfactorization,” in Proc. Advances Neural Inf. Process. Syst., 2001,pp. 556–562.

[14] W.-S. Chin, Y. Zhuang, Y.-C. Juan, and C.-J. Lin, “A fast parallelstochastic gradient method for matrix factorization in sharedmemory systems,” ACM Trans. Intell. Syst. Technol., vol. 6, no. 1,2015, Art. no. 2.

[15] W.-S. Chin, Y. Zhuang, Y.-C. Juan, and C.-J. Lin, “A learning-rateschedule for stochastic gradient methods to matrix factorization,”in Proc. Pacific-Asia Conf. Knowl. Discovery Data Mining, 2015,pp. 442–455.

[16] R. Gemulla, E. Nijkamp, P. J. Haas, and Y. Sismanis, “Large-scalematrix factorization with distributed stochastic gradient descent,”in Proc. 17th ACM SIGKDD Int. Conf. Knowl. Discovery Data Min-ing, 2011, pp. 69–77.

[17] C. Teflioudi, F. Makari, and R. Gemulla, “Distributed matrix com-pletion,” in Proc. IEEE 12th Int. Conf. Data Mining, 2012, pp. 655–664.

[18] F. Niu, B. Recht, C. Re, and S. Wright, “HOGWILD!: A lock-freeapproach to parallelizing stochastic gradient descent,” in Proc.Advances Neural Inf. Process. Syst., 2011, pp. 693–701.

[19] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl, “Application ofdimensionality reduction in recommender system-A case study,”Minnesota Univ Minneapolis Dept of Computer Science, 2000,http://www.dtic.mil/docs/citations/ADA439541

[20] M.-F. Weng and Y.-Y. Chuang, “Collaborative video reindexingvia matrix factorization,” ACM Trans. Multimedia Comput. Com-mun. Appl., vol. 8, no. 2, 2012, Art. no. 23.

[21] Z. Zheng, H. Ma, M. R. Lyu, and I. King, “Collaborative web ser-vice QoS prediction via neighborhood integrated matrixfactorization,” IEEE Trans. Services Comput., vol. 6, no. 3, pp. 289–299, Jul.–Sep. 2013.

[22] D. Lian, C. Zhao, X. Xie, G. Sun, E. Chen, and Y. Rui, “GeoMF:Joint geographical modeling and matrix factorization for point-of-interest recommendation,” in Proc. 20th ACM SIGKDD Int. Conf.Knowl. Discovery Data Mining, 2014, pp. 831–840.

[23] J. Langford, M. Zinkevich, and A. J. Smola, “Slow learners arefast,” in Proc. Advances Neural Inf. Process. Syst., 2009, pp. 2331–2339.

[24] M. Zinkevich, M. Weimer, L. Li, and A. J. Smola, “Parallelized sto-chastic gradient descent,” in Proc. Advances Neural Inf. Process.Syst., 2010, pp. 2595–2603.

[25] A. Agarwal and J. C. Duchi, “Distributed delayed stochastic opti-mization,” in Proc. Advances Neural Inf. Process. Syst., 2011,pp. 873–881.

[26] H. Yun, H.-F. Yu, C.-J. Hsieh, S. Vishwanathan, and I. Dhillon,“NOMAD: Non-locking, stochastic multi-machine algorithm forasynchronous and decentralized matrix completion,” Proc. VLDBEndowment, vol. 7, no. 11, pp. 975–986, 2014.

[27] J. Jin, S. Lai, S. Hu, J. Lin, and X. Lin, “GPUSGD: A GPU-acceler-ated stochastic gradient descent algorithm for matrixfactorization,” Concurrency Comput.: Practice Experience, vol. 28,pp. 3844–3865, 2015.

[28] J. D. Owens, M. Houston, D. Luebke, S. Green, J. E. Stone, andJ. C. Phillips, “GPU computing,” Proc. IEEE, vol. 96, no. 5,pp. 879–899, May 2008.

[29] G. Pratx and L. Xing, “GPU computing in medical physics:A review,”Med. Physics, vol. 38, no. 5, pp. 2685–2697, 2011.

[30] P. Guo, L. Wang, and P. Chen, “A performance modeling andoptimization analysis tool for sparse matrix-vector multiplicationon GPUs,” IEEE Trans. Parallel Distrib. Syst., vol. 25, no. 5,pp. 1112–1123, May 2014.

[31] K. Li, W. Yang, and K. Li, “Performance analysis and optimizationfor SpMV on GPU using probabilistic modeling,” IEEE Trans. Par-allel Distrib. Syst., vol. 26, no. 1, pp. 196–205, Jan. 2015.

[32] Z. Gao, Y. Liang, and Y. Jiang, “Implement of item-based recom-mendation on GPU,” in Proc. IEEE 2nd Int. Conf. Cloud Comput.Intell. Syst., 2012, vol. 2, pp. 587–590.

[33] K. Kato and T. Hosino, “Singular value decomposition for collabo-rative filtering on a GPU,” in Proc. IOP Conf. Series: Mater. Sci.Eng., 2010, pp. 012–017.

[34] A. Cano and C. Garcia-Martinez, “100 million dimensions large-scale global optimization using distributed GPU computing,” inProc. IEEE Congr. Evol. Comput., 2016, pp. 3566–3573.

[35] W. Yang, K. Li, Z. Mo, and K. Li, “Performance optimization usingpartitioned SpMV on GPUs and multicore CPUs,” IEEE Trans.Comput., vol. 64, no. 9, pp. 2623–2636, Sep. 2015.

[36] J. A. Stuart and J. D. Owens, “Multi-GPU MapReduce on GPUclusters,” in Proc. IEEE Int. Parallel Distrib. Process. Symp., 2011,pp. 1068–1079.

[37] C. Chen, K. Li, A. Ouyang, and K. Li, “GFlink: An in-memorycomputing architecture on heterogeneous CPU-GPU clusters forbig data,” in Proc. IEEE 45th Int. Conf. Parallel Process., 2016,pp. 542–551.

[38] L. Bottou, “Large-scale machine learning with stochastic gradientdescent,” in Proc. Int. Conf. Comput. Statist., 2010, pp. 177–186.

[39] R. Johnson and T. Zhang, “Accelerating stochastic gradientdescent using predictive variance reduction,” in Proc. AdvancesNeural Inf. Process. Syst., 2013, pp. 315–323.

[40] T. Zhang, “Solving large scale linear prediction problems usingstochastic gradient descent algorithms,” in Proc. 21st Int. Conf.Mach. Learn., 2004, Art. no. 116.

[41] N. L. Roux, M. Schmidt, and F. R. Bach, “A stochastic gradientmethod with an exponential convergence _rate for finite trainingsets,” in Proc. Advances Neural Inf. Process. Syst., 2012, pp. 2663–2671.

[42] S. J. Reddi, A. Hefny, S. Sra, B. Poczos, and A. Smola, “Stochasticvariance reduction for nonconvex optimization,” in Proc. 33rd Int.Conf. Mach. Learn., 2016, pp. 314–323.

[43] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro, “Robust sto-chastic approximation approach to stochastic programming,”SIAM J. Optimization, vol. 19, no. 4, pp. 1574–1609, 2009.

[44] Y. Nesterov, Introductory Lectures on Convex Optimization: A BasicCourse, vol. 87. Berlin, Germany: Springer, 2013.

[45] C. Nvidia, “C Programming Guide, v7.,” Oct. 2015.[46] S. D. Babacan, M. Luessi, R. Molina, and A. K. Katsaggelos,

“Sparse Bayesian methods for low-rank matrix estimation,” IEEETrans. Signal Process., vol. 60, no. 8, pp. 3964–3977, Aug. 2012.

[47] J. Ye, J.-H. Chow, J. Chen, and Z. Zheng, “Stochastic gradientboosted distributed decision trees,” in Proc. 18th ACM Conf. Inf.Knowl. Manage., 2009, pp. 2061–2064.

[48] J. Nickolls, I. Buck, M. Garland, and K. Skadron, “Scalable parallelprogramming with CUDA,”Queue, vol. 6, no. 2, pp. 40–53, 2008.

[49] E. Hazan, “Introduction to online convex optimization,” Found.Trends� Optimization, vol. 2, no. 3/4, pp. 157–325, 2016.

Hao Li is currently working toward the PhDdegree at Hunan University, China. His researchinterests include mainly in large-scale sparsematrix and tensor factorization, recommendersystems, social network, data mining, machinelearning, and GPU and multi-GPU computing.

Kenli Li received the PhD degree in computerscience from the Huazhong University of Scienceand Technology, China, in 2003. He was a visit-ing scholar with the University of Illinois atUrbana-Champaign, from 2004 to 2005. He iscurrently a full professor of computer science andtechnology with Hunan University and deputydirector of National Supercomputing Center inChangsha. His major research areas include par-allel computing, high-performance computing,grid, and cloud computing. He has published

more than 130 research papers in international conferences and journalssuch as the IEEE Transactions on Computers, the IEEE Transactions onParallel and Distributed Systems, the IEEE Transactions on Signal Proc-essing, the Journal of Parallel and Distributed Computing, ICPP, andCCGrid. He is an outstanding member of CCF. He is a serves on the edi-torial board of the IEEE Transactions on Computers. He is a seniormember of the IEEE.

LI ET AL.: MSGD: A NOVEL MATRIX FACTORIZATION APPROACH FOR LARGE-SCALE COLLABORATIVE FILTERING RECOMMENDER... 1543

Page 15: MSGD: A Novel Matrix Factorization Approach for Large ... › ~lik › publications › Hao-Li-IEEE-TPDS-2018.pdf · systems, e.g., DSGD, FPSGD, Hogwild!, and CCD++. For large-scale

Jiyao An received the PhD degree in mechanicalengineering from Hunan University, China, in2012. He was a visiting scholar in the Departmentof Applied Mathematics, University of Waterloo,Ontario, Canada. He is currently a full professorin the College of Computer Science and Elec-tronic Engineering, Hunan University, Changsha,China. His research interests include cyber-physical systems (CPS), takagi-sugeno fuzzysystems, parallel and distributed computing, andcomputational intelligence. He has publish more

than 50 papers in international and domestic journals and refereedconference papers. He is an active reviewer of international journals. Heis a member of the IEEE and the ACM, and a senior member of the CCF.

Keqin Li is a SUNY distinguished professor ofcomputer science. His current research interestsinclude parallel computing and high-performancecomputing, distributed computing, energy-efficientcomputing and communication, heterogeneouscomputing systems, cloud computing, big datacomputing,CPU-GPUhybrid and cooperative com-puting, multicore computing, storage and file sys-tems. He has published more than 480 journalarticles, book chapters, and refereed conferencepapers, and has received several best paper

awards. He is currently or has served on the editorial boards of the IEEETransactions on Parallel and Distributed Systems, the IEEE Transactionson Computers, the IEEE Transactions on Cloud Computing, the IEEETransactions on Services Computing, and the IEEE Transactions on Sus-tainable Computing. He is a fellow of the IEEE.

" For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

1544 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 29, NO. 7, JULY 2018