HybridSVD: When Collaborative Information is Not Enough · of entities and allow injecting side...

9
HybridSVD: When Collaborative Information is Not Enough Evgeny Frolov Skolkovo Institute of Science and Technology Moscow, Russia [email protected] Ivan Oseledets Skolkovo Institute of Science and Technology Moscow, Russia [email protected] ABSTRACT We propose a new hybrid algorithm that allows incorporating both user and item side information within the standard collaborative fil- tering technique. One of its key features is that it naturally extends a simple PureSVD approach and inherits its unique advantages, such as highly efficient Lanczos-based optimization procedure, sim- plified hyper-parameter tuning and a quick folding-in computation for generating recommendations instantly even in highly dynamic online environments. The algorithm utilizes a generalized formu- lation of the singular value decomposition, which adds flexibility to the solution and allows imposing the desired structure on its latent space. Conveniently, the resulting model also admits an effi- cient and straightforward solution for the cold start scenario. We evaluate our approach on a diverse set of datasets and show its superiority over similar classes of hybrid models. CCS CONCEPTS Information systems Recommender systems. KEYWORDS Collaborative Filtering; Matrix Factorization; Hybrid Recommenders; Cold Start; Top-N Recommendation; PureSVD ACM Reference Format: Evgeny Frolov and Ivan Oseledets. 2019. HybridSVD: When Collaborative Information is Not Enough. In Thirteenth ACM Conference on Recommender Systems (RecSys ’19), September 16–20, 2019, Copenhagen, Denmark. ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/3298689.3347055 1 INTRODUCTION Singular value decomposition (SVD) [24] is a well-established and useful tool with a wide range of applications in various domains of information retrieval, natural language processing, and data anal- ysis. The first SVD-based collaborative filtering (CF) models were proposed in the late 90’s early 00’s and were successfully adapted to a wide range of tasks [10, 28, 43]. It has also anticipated active re- search devoted to alternative matrix factorization (MF) techniques [30]. However, despite the development of many sophisticated and accurate methods, the simplest SVD-based approach called PureSVD Also with Institute of Numerical Mathematics, Russian Academy of Sciences. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. RecSys ’19, September 16–20, 2019, Copenhagen, Denmark © 2019 Copyright held by the owner/author(s). Publication rights licensed to ACM. ACM ISBN 978-1-4503-6243-6/19/09. . . $15.00 https://doi.org/10.1145/3298689.3347055 was later shown to outperform other algorithms in standard top- n recommendation tasks [18]. PureSVD offers a unique set of practical advantages, such as global convergence with deterministic output, lightweight hyper- parameter tuning via simple rank truncation, an analytical expres- sion for instant online recommendations (see Section 3.3), scalable modifications based on randomized algorithms [26]. It has highly optimized implementations in many programming languages based on BLAS and LAPACK routines and is included in many modern machine learning libraries and frameworks, some of which allow to handle nearly billion-scale problems 1 . However, like any other conventional collaborative filtering tech- nique, PureSVD relies solely on the knowledge about user prefer- ences expressed in the form of ratings, likes, purchases or other types of feedback, either explicit or implicit. On the other hand, a user’s choice may be influenced by intrinsic properties of items. For example, users may prefer products of a particular category/brand or products with certain characteristics. Similarly, additional knowl- edge about users, such as demographic information or occupation, may also help to explain their choice. These influencing factors are typically assumed to be well repre- sented by the model’s latent features. However, in situations when users interact with a small number of items from a large assort- ment (e.g., in online stores), it may become difficult to build reliable models from the observed behavior without considering side infor- mation. This additional knowledge may also help in the extreme cold start [21] case, when preference data for some items and/or users are not available at all. Addressing the described problems of insufficient preference data has lead to the development of hybrid models [12]. A plethora of ways for building hybrid recommenders has been explored to date, and a vast amount of work is devoted specifically to incorpo- rating side information into MF methods (more on this in Section 6). Surprisingly, despite many of its advantages, the PureSVD model has received much lower attention in this regard. To the best of our knowledge, there were no attempts for developing an integrated hybrid approach, where interactions data and side information would be jointly factorized with the help of SVD, and the obtained result would be used as an end model for generating recommendations immediately as in the PureSVD case. With this work, we aim to fill that gap and extend the family of hybrid algorithms with a new approach based on a modification of PureSVD. The main contributions of this paper are: We generalize PureSVD by allowing it to incorporate and to jointly factorize side information along with collaborative data. We call this approach HybridSVD. 1 https://github.com/criteo/Spark-RSVD arXiv:1802.06398v4 [cs.LG] 13 Aug 2019

Transcript of HybridSVD: When Collaborative Information is Not Enough · of entities and allow injecting side...

Page 1: HybridSVD: When Collaborative Information is Not Enough · of entities and allow injecting side information. It can be achieved by replacing the scalar product in Eq. (2) with a bilinear

HybridSVD: When Collaborative Information is Not EnoughEvgeny Frolov

Skolkovo Institute of Science and TechnologyMoscow, Russia

[email protected]

Ivan Oseledets∗Skolkovo Institute of Science and Technology

Moscow, [email protected]

ABSTRACTWe propose a new hybrid algorithm that allows incorporating bothuser and item side information within the standard collaborative fil-tering technique. One of its key features is that it naturally extendsa simple PureSVD approach and inherits its unique advantages,such as highly efficient Lanczos-based optimization procedure, sim-plified hyper-parameter tuning and a quick folding-in computationfor generating recommendations instantly even in highly dynamiconline environments. The algorithm utilizes a generalized formu-lation of the singular value decomposition, which adds flexibilityto the solution and allows imposing the desired structure on itslatent space. Conveniently, the resulting model also admits an effi-cient and straightforward solution for the cold start scenario. Weevaluate our approach on a diverse set of datasets and show itssuperiority over similar classes of hybrid models.

CCS CONCEPTS• Information systems→ Recommender systems.

KEYWORDSCollaborative Filtering;Matrix Factorization; Hybrid Recommenders;Cold Start; Top-N Recommendation; PureSVDACM Reference Format:Evgeny Frolov and Ivan Oseledets. 2019. HybridSVD: When CollaborativeInformation is Not Enough. In Thirteenth ACM Conference on RecommenderSystems (RecSys ’19), September 16–20, 2019, Copenhagen, Denmark. ACM,New York, NY, USA, 9 pages. https://doi.org/10.1145/3298689.3347055

1 INTRODUCTIONSingular value decomposition (SVD) [24] is a well-established anduseful tool with a wide range of applications in various domains ofinformation retrieval, natural language processing, and data anal-ysis. The first SVD-based collaborative filtering (CF) models wereproposed in the late 90’s early 00’s and were successfully adaptedto a wide range of tasks [10, 28, 43]. It has also anticipated active re-search devoted to alternative matrix factorization (MF) techniques[30]. However, despite the development of many sophisticated andaccurate methods, the simplest SVD-based approach called PureSVD∗Also with Institute of Numerical Mathematics, Russian Academy of Sciences.

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than theauthor(s) must be honored. Abstracting with credit is permitted. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected] ’19, September 16–20, 2019, Copenhagen, Denmark© 2019 Copyright held by the owner/author(s). Publication rights licensed to ACM.ACM ISBN 978-1-4503-6243-6/19/09. . . $15.00https://doi.org/10.1145/3298689.3347055

was later shown to outperform other algorithms in standard top-nrecommendation tasks [18].

PureSVD offers a unique set of practical advantages, such asglobal convergence with deterministic output, lightweight hyper-parameter tuning via simple rank truncation, an analytical expres-sion for instant online recommendations (see Section 3.3), scalablemodifications based on randomized algorithms [26]. It has highlyoptimized implementations in many programming languages basedon BLAS and LAPACK routines and is included in many modernmachine learning libraries and frameworks, some of which allowto handle nearly billion-scale problems1.

However, like any other conventional collaborative filtering tech-nique, PureSVD relies solely on the knowledge about user prefer-ences expressed in the form of ratings, likes, purchases or othertypes of feedback, either explicit or implicit. On the other hand, auser’s choice may be influenced by intrinsic properties of items. Forexample, users may prefer products of a particular category/brandor products with certain characteristics. Similarly, additional knowl-edge about users, such as demographic information or occupation,may also help to explain their choice.

These influencing factors are typically assumed to be well repre-sented by the model’s latent features. However, in situations whenusers interact with a small number of items from a large assort-ment (e.g., in online stores), it may become difficult to build reliablemodels from the observed behavior without considering side infor-mation. This additional knowledge may also help in the extremecold start [21] case, when preference data for some items and/orusers are not available at all.

Addressing the described problems of insufficient preferencedata has lead to the development of hybrid models [12]. A plethoraof ways for building hybrid recommenders has been explored todate, and a vast amount of work is devoted specifically to incorpo-rating side information into MF methods (more on this in Section 6).Surprisingly, despite many of its advantages, the PureSVD modelhas received much lower attention in this regard. To the best of ourknowledge, there were no attempts for developing an integratedhybrid approach, where interactions data and side information wouldbe jointly factorized with the help of SVD, and the obtained resultwould be used as an end model for generating recommendationsimmediately as in the PureSVD case.

With this work, we aim to fill that gap and extend the family ofhybrid algorithms with a new approach based on a modification ofPureSVD. The main contributions of this paper are:

• We generalize PureSVD by allowing it to incorporate and tojointly factorize side information along with collaborativedata. We call this approach HybridSVD.

1https://github.com/criteo/Spark-RSVD

arX

iv:1

802.

0639

8v4

[cs

.LG

] 1

3 A

ug 2

019

Page 2: HybridSVD: When Collaborative Information is Not Enough · of entities and allow injecting side information. It can be achieved by replacing the scalar product in Eq. (2) with a bilinear

RecSys ’19, September 16–20, 2019, Copenhagen, Denmark E. Frolov and I. Oseledets

• We propose an efficient computational scheme that doesnot break the simplicity of the algorithm, requires minimumefforts in parameter tuning, and leads to a minor increase inoverall complexity governed by a structure of input data.• We introduce a simple yet surprisingly capable approachfor cold start scenario, which can be implemented for bothHybridSVD and PureSVD models.

The rest of this paper is organized as follows. We walk throughSVD internals in Section 2 and show what exactly leads to thelimitations of its standard implementation. In Section 3 we demon-strate how to remove these limitations by switching to an algebraicformulation of the generalized SVD. Section 4 provides evaluationmethodology with model selection process and the results are re-ported in Section 5. Section 6 describes the related research andSection 7 concludes the work.

2 UNDERSTANDING SVD LIMITATIONSOne of the greatest advantages of many MF methods is their flexi-bility in terms of a problem formulation. It allows specifying veryfine-grained models that incorporate both interactions data andadditional knowledge within a single optimization objective (someof these methods are described in Sec. 6). It often results in theformation of a latent feature space with a particular inner structure,controlled by side information.

This technique, however, is not available off-the-shelf in thePureSVD approach due to the classical formulation with its fixedleast squares objective. In this work, we aim to find a new way toformulate the optimization problem so that, while staying withinthe computational paradigm of SVD, it would allow us to account foradditional sources of information during the optimization process.

2.1 When PureSVD does not workConsider the following simple example on fictitious interactionsdata depicted in Table 1. Initially, we have 3 users (Alice, Bob andCarol) and 5 items, with only the first 4 items being observed ininteractions (the first 3 rows and 4 columns of the table). Item inthe last column (Item5) plays the role of a cold-start (i.e., previouslyunobserved) item. We use this toy data to build PureSVD of rank 2and generate recommendations for a new user Tom (New user rowin the table), who has already interacted with Item1, Item4, Item5.

Suppose that in addition to interaction data we know that Item3and Item5 are more similar to each other than to other items in termsof some set of features. In that case, since Tom has expressed aninterest in Item5, it seems natural to expect from a sound recom-mender system to favor Item3 over Item2 in recommendations. This,however, does not happen with PureSVD.

With the help of the folding-in technique [21] (also see Eq. (11))it can be easily verified, that SVD will assign uniform scores forboth items as shown in the PureSVD row in the bottom of thetable. This example illustrates a general limitation of the PureSVDapproach related to the lack of preference data, which can not beresolved without considering side information (or unless more datais collected). In contrast, our approach will assign a higher scoreto Item3 (see Our approach row in Table 1 as an example), whichreflects the relations between Item3 and Item5.

Table 1: An example of insufficient preference data problem

Item1 Item2 Item3 Item4 Item5Observed interactions

Alice 1 1 1Bob 1 1 1Carol 1 1

New userTom 1 ? ? 1 1

PureSVD: 0.3 0.3Our approach: 0.1 0.6

Table note: Item3 and Item5 are similar to each other (as indicated by bluecolor). PureSVD fails to take that into account and assigns uniform scoresto Item3 and Item2 for Tom. Our approach favors Item3 as it is similar toprevious Tom’s choice. The code to reproduce this result can be found athttps://gist.github.com/Evfro/c6ec2b954adfff6aaa356f9b3124b1d2.

2.2 Why PureSVD does not workFormal explanation of the observed result requires an understand-ing of how exactly computations are performed in the SVD algo-rithm. A rigorous mathematical analysis of that is performed bythe authors of the EIGENREC model [36]. As the authors note, thelatent factor model of PureSVD can be viewed as an eigendecomposi-tion of a scaled user-based or item-based cosine similarity matrix. We,therefore, would expect it to depend on scalar products betweenrows and columns of a user-item interactions matrix.

Indeed, in the user-based case, it solves an eigendecompositionproblem for the Gram matrix

G = RR⊤, (1)

where R ∈ RM×N is a matrix of interactions betweenM users andN items with unknown elements imputed by zeros (as required byPureSVD). The elements of G are defined by scalar products of thecorresponding rows of R that represent user preferences:

дi j = r⊤i rj . (2)

This observation immediately suggests that any cross-item relationsare simply ignored by PureSVD as it takes only item co-occurrenceinto account. That is, the contribution of a particular item into theuser similarity score дi j is counted only if the item is present in thepreferences of both user i and user j. The similar conclusion alsoholds for the item-based case. This explains the uniform scoresassigned by PureSVD in our fictitious example in Table 1.

3 PROPOSED MODELIn order to account for cross-entity relations, we have to find adifferent similarity measure that would consider all possible pairsof entities and allow injecting side information. It can be achievedby replacing the scalar product in Eq. (2) with a bilinear form:

дi j = r⊤i S rj . (3)

where symmetric matrix S ∈ RN×N reflects auxiliary relationsbetween items based on side information. Effectively, this matrixcreates “virtual” links between users even if they have no commonitems in their preferences. Occasional links will be filtered out by

Page 3: HybridSVD: When Collaborative Information is Not Enough · of entities and allow injecting side information. It can be achieved by replacing the scalar product in Eq. (2) with a bilinear

HybridSVD: When Collaborative Information is Not Enough RecSys ’19, September 16–20, 2019, Copenhagen, Denmark

dimensionality reduction, whereas more frequent ones will help toreveal valuable consumption patterns.

In a similar fashion we can introduce matrix K ∈ RM×M to incor-porate user-related information. We will use the term side similarityto denote these matrices. Their off-diagonal entries encode simi-larities between users or items based on side information, such asuser attributes or item features. We discuss the properties of suchmatrices and possible ways of constructing them in Sec. 3.2.

Our primary goal is to bring them together into a joint problemformulation with a single solution based on SVD.

3.1 HybridSVDReplacing scalar products with bilinear forms as per Eq. (3) for allusers and all items generates the following matrix cross-products:

RSR⊤, R⊤KR, (4)

where R is the same is in PureSVD. By analogy with standard SVD,this induces the following eigendecomposition problem:{

RSR⊤ = UΣ2U⊤,R⊤KR = VΣ2V⊤,

(5)

where matrices U ∈ RM×k and V ∈ RN×k represent embeddings ofusers and items onto a common k-dimensional latent feature space.This system of equations has a close connection to the GeneralizedSVD [24] and can be solved via the standard SVD of an auxiliarymatrix R [1] of the form:

R ≡ K12 R S

12 = UΣV⊤, (6)

where matrices U, Σ and V represent singular triplets of R similarlyto the PureSVD model. Connections between the original and theauxiliary latent space for both users and items are established byU = K−1/2U and V = S−1/2V respectively.

We note, however, that finding the square root of an arbitrarymatrix and its inverse are computationally intensive operations. Toovercome that difficulty we will assume by now that matrices K andS are symmetric positive definite (more details on that in Sec. 3.2)and therefore can be represented in the Cholesky decompositionform: S = LSL⊤S , K = LKL⊤K , where LS and LK are lower triangularreal matrices. This decomposition can be computed much moreefficiently than finding matrix square root.

By a direct substitution, it can be verified that the followingauxiliary matrix

R ≡ L⊤K R LS = UΣV⊤ (7)also provides a solution to Eq. (5) and can be used to replace Eq. (6)with its expensive square root computation. The connection be-tween the auxiliary latent space and the original one now becomes

U = L−⊤K U, V = L−⊤S V. (8)

As can be seen, matrix R “absorbs” additional relations encodedby matrices K and S simultaneously. It shows that solving the jointproblem in Eq. (5) is as simple as finding standard SVD of an auxiliarymatrix from Eq. (7). We call this model HybridSVD.

The orthogonality of singular vectors U⊤U = V⊤V = I combinedwith Eq. (8) also sheds the light on the effect of side information onthe structure of hybrid latent space, revealing its central property:

U⊤KU = V⊤SV = I, (9)

i.e., columns of the matrices U and V are orthogonal under theconstraints imposed by the matrices K and S respectively2.

3.2 Side similarityIn order to improve the quality of collaborative models, side sim-ilarity matrices must capture representative structure from sideinformation. Constructing such matrices is a domain-specific task,and there are numerous ways for achieving this, ranging from adirect application of various similarity measures over the featurespace [32] to more specialized techniques based on kernel methods,Graph Laplacians, matrix factorization, and deep learning [13].

Independently of the way similarity matrices are obtained, weimpose the following structure on them:{

S = (1 − α) I + α Z,K = (1 − β) I + β Y,

(10)

where Z,Y are real symmetric matrices of conforming size withentries encoding actual feature-based similarities and taking valuesfrom [−1, 1] interval. Coefficients α , β ∈ [0, 1] are free model param-eters determined empirically. Adjusting these coefficients changesthe balance between collaborative and side information: higher valuesput more emphasis on side features, whereas lower values gravitatethe model towards simple co-occurrence patterns.

Note that in the case of α = β = 0 the model turns back intoPureSVD. On the other extreme, when α = β = 1 the model wouldheavily rely on feature-based similarities, which can be an undesiredbehavior. Indeed, if an underlying structure of side features is lessexpressive than that of collaborative information (which is not a rarecase), the model is likely to suffer from overspecialization. Hence,the values of α and β will be most often lower than 1. Moreover,setting them below a certain threshold3 would ensure positivedefiniteness of S and K, giving a unique Cholesky decomposition.

In this work, we explore one of the most direct possible ap-proaches for constructing similarity matrices. We transform cate-gorical side features (e.g., movie genre or product brand, see Table 2)into one-hot encoded vectors and compute the Common Neighborssimilarity [32] scaled by a normalization constant. More specifically,in the item case, we form a sparse binary matrix F ∈ BN×f , whichrows are one-hot feature vectors: if a feature belongs to an item, thenthe corresponding entry in the row is 1, otherwise 0. The feature-based similarity matrix is then computed as Z = 1

|m | FF⊤, wheremdenotes the largest matrix element of FF⊤. We additionally enforcethe diagonal elements of Z to be all ones: Z← Z + I − diag diagZ.

3.3 Efficient computationsCholesky Decomposition. Sparse feature representation allows

having sparse matrices K and S. This can be exploited for computingsparse Cholesky decomposition or, even better, incomplete Choleskydecomposition [24], additionally allowing to skip negligibly smallsimilarity values. The corresponding triangular part of the Choleskyfactors will be sparse as well. Moreover, there is no need to explicitlycalculate inverses of L⊤S and L⊤K in Eq. (8) as it only requires to solvethe corresponding triangular system of equations, which can beperformed very efficiently [24].2 This property is called K- and S-orthogonality [48].3An upper bound can be estimated from the matrix diagonal dominance condition [24].

Page 4: HybridSVD: When Collaborative Information is Not Enough · of entities and allow injecting side information. It can be achieved by replacing the scalar product in Eq. (2) with a bilinear

RecSys ’19, September 16–20, 2019, Copenhagen, Denmark E. Frolov and I. Oseledets

Furthermore, Cholesky decomposition is fully deterministic andadmits symbolic factorization. This featuremakes tuningHybridSVDeven more efficient. Indeed, changing the values of α and β in (0, 1)interval does not affect the sparsity structure of Eq. (10). Therefore,once LS and LK are computed for some values of α and β , theresulting symbolic sparsity pattern can be reused later to performquick calculations with other values of these coefficients. We usedthe CHOLMOD library [15] for that purpose, available in the scikit-sparse package4. Computing sparse Cholesky decomposition wasorders of magnitude faster than computing the model itself.

In general, however, side features do not have to be categoricalor even sparse. For example, if item features are represented bycompact embeddings encoded as rows of a dense matrix F ∈ RN×fwith f ≪N , one can utilize the “identity plus low rank“ structureof similarity matrices and apply fast symmetric factorization tech-niques [6] to efficiently compute matrix roots. The inverses canthen be computed via Sherman-Woodbury-Morrison formula. Weleave the investigation of this alternative for future research.

Computing SVD. There is also no need to directly compute theproduct of Cholesky factors and rating matrix in Eq. (7), whichcould potentially lead to a dense result. Similarly to the standardSVD, assuming the rank of decomposition k ≪ min (M,N ), onecan exploit the Lanczos procedure [24, 31], which invokes a Krylovsubspace method. The key benefit of such approach is that in orderto find k principal singular triplets it is sufficient to provide a ruleof multiplying an auxiliary matrix from Eq. (7) by an arbitrarydense vector from the left and the right. This can be implementedas a sequence of 3 matrix-vector multiplications. Hence, the addedcomputational cost of the algorithm over standard SVD is controlledby the complexity of multiplying Cholesky factors by a dense vector.

More specifically, given the number of non-zero elements nnzRof R that corresponds to the number of the observed interactions,an overall computational complexity of the HybridSVD algorithmisO(nnzR · k) +O((M + N ) · k2) +O((JK + JS ) · k), where the first2 terms correspond to PureSVD’s complexity and the last termdepends on the complexities JK and JS of multiplying matricesLK and LS by a dense vector. In our case JK and JS are deter-mined by the number of non-zero elements nnzLK and nnzLS of thesparse Cholesky factors. Therefore, the total complexity amounts toO(nnztot ·k)+O((M + N )·k2), wherennztot =nnzR+nnzLK +nnzLS .

Generating recommendations. One of the greatest advantages ofthe SVD-based approach is analytical form of the folding-in [21]. Incontrast to many other MFmethods, it is stable and does not requireany additional optimization steps to calculate recommendationsin the warm start regime. It makes SVD-based models especiallyplausible for highly dynamic online environments. For example,in the case of a new “warm” user with some vector of knownpreferences p a vector of predicted item relevance scores r reads:

r ≈ VV⊤ p, (11)

where V corresponds to the right singular vectors of the PureSVDmodel. Moreover, it turns into strict equality in the case of a knownuser [18]. Therefore, it presents a single solution for generatingrecommendations for both known and new users.

4https://github.com/scikit-sparse/scikit-sparse

This result can also be generalized to the hybrid case. Followingthe same folding-in idea and taking into account the special struc-ture of the latent space given by Eq. (8) one arrives at the followingexpression for the vector r of predicted item scores:

r ≈ L−⊤S VV⊤L⊤S p = VlV⊤r p, (12)

where Vl = L−⊤S V and Vr = LS V. As with PureSVD, it is suitablefor predicting preferences of both known and warm-start users.

There is no K matrix in Eq. (12) due to the nature of the folding-in approximation. If needed, one can put more emphasis on userfeatures by combining the folding-in vector with the cold startrepresentation (see Sec. 3.4). We, however, do not investigate thisoption and leave it for future work. In our setup, described in Sec. 4,only item features are available. Hence, no modification is required.

3.4 Cold start settingsUnlike the warm start, in the cold start settings, information aboutpreferences is not known at all. Handling this situation depends onthe ability to translate known user attributes and item characteris-tics into the latent feature space.We will focus on the item cold startscenario. User cold start can be managed in a similar fashion. Thetask is to find among known users the ones who would be the mostinterested in a new cold item. Hence, in order to utilize a standardprediction scheme via scalar products of latent vectors, one needsto obtain latent features of the cold item.

We propose to solve the task in two steps. We start by findinga linear map W between real features of known items and theirlatent representations. We define the corresponding problem as

VW = F, (13)

where rows of F encode item features. This step is performed rightafter the model is computed, and the result is stored for later com-putations. Secondly, once the mapping W is obtained, we use it totransform any cold item represented by its feature vector f into thecorresponding latent feature vector v by solving a linear system:

W⊤v = f . (14)

Finally, the prediction vector can be computed as

r = UΣv = RVv. (15)

The issue with Eq. (13) is that solving it for W can be a challeng-ing task. Hence, many hybrid methods incorporate feature mappinginto the main optimization objective, which allows solving twoproblems simultaneously. However, the orthogonality property ofHybridSVD defined by Eq. (9) admits a direct solution to Eq. (13):

W = V⊤SF. (16)

This result also provides solution for PureSVD by setting S = I. Weuse this technique in our cold start experiments to verify whetherHybridSVD provides any benefit over the PureSVD approach.

3.5 Matrix scalingAccording to the EIGENREC model [36], a simple scaling trick

R← RDd−1, (17)

where D=diag{∥r1∥, . . . , ∥rN ∥} contains Euclidean norms of thecolumns ri of R, can significantly improve the quality of PureSVD.In the default setting the scaling factor d is equal to 1, giving the

Page 5: HybridSVD: When Collaborative Information is Not Enough · of entities and allow injecting side information. It can be achieved by replacing the scalar product in Eq. (2) with a bilinear

HybridSVD: When Collaborative Information is Not Enough RecSys ’19, September 16–20, 2019, Copenhagen, Denmark

standard model. Varying its value governs popularity biases: highervalues would emphasize the significance of popular items, whereaslower values would increase sensitivity to rare ones. Empirically,the values slightly below 1 performed best. In our experiments,we employ this technique for both HybridSVD and PureSVD andreport results in addition to the original non-scaled models.

4 EXPERIMENTSWe conduct two sets of experiments: standard top-n recommendationscenario and item cold start scenario. Every experiment starts with ahyperparameter tuning phase with n = 10 fixed. Once the optimalparameter values are found, they are used for the final evaluationof recommendations quality for a range of values of n. Experimentswere performed using the Polara framework5. The source code forreproducing all results is available in our public Github repository6.

4.1 Evaluation methodologyIn the standard scenario we consequently mark every 20% of usersfor test. Each 20% partition contains only those users who have notbeen tested yet. We randomly withdraw a single item from everytest user and collect all such items in a holdout set. After that, thetest users are merged back with the remaining 80% of users andform a training set. During the evaluation phase, we generate aranked list of top-n recommendations for every test user based ontheir known preferences and evaluate it against the holdout.

In the cold start scenario we perform 80%/20% partitioning of thelist of all unique items. We select items from a 20% partition andmark them as cold-start items. Users with at least one cold-startitem in the preferences are marked as the test users. Users with noitems in their preferences, other than cold-start items, are filteredout. The remaining users form a training set with all cold-startitems excluded. Evaluation of models, in that case, is performed asfollows: for every cold-start item, we generate a ranked list of themost pertinent users and evaluate it against one of the test userschosen randomly among those who have interacted with the item.

In both standard and cold start experiments we conventionallyperform 5-fold cross-validation and average the results, providing95% confidence intervals based on the paired t-test criteria. Thequality of recommendations is measured with the help of meanreciprocal rank (MRR) [29] metric. We also report hit-rate (HR) [19]and coverage – a fraction of all unique recommended entities tothe total amount of unique entities in the training set. The lattercharacterizes an overall diversity of recommendations.

4.2 DatasetsWe have used 5 different publicly available datasets: MovieLens-1M(ML1M) andMovieLens-10M (ML10M) [27]; BookCrossing (BX) [52];Amazon Electonics (AMZe) and Amazon Video Games (AMZvg) [33];large scale R2 Yahoo! Music (YaMus) [50].

In all datasets, we ensure that for every item at least 5 usershave interacted with it, and every user has interacted with at least5 items. We also remove items without any side information. Aswe are not interested in the rating prediction, the setting with onlybinary feedback is considered in our experiments. We use only

5https://github.com/Evfro/polara6https://github.com/Evfro/recsys19_hybridsvd

Table 2: Description of the datasets after pre-processing.

# users # items nnz side informationML-1M 6038 3522 2.7% genres, cast,ML-10M 69797 10258 0.7% directors, writersBX 7160 16273 0.18% publishers, authorsAMZe 124895 44483 0.02% categories,AMZvg 14251 6858 0.13% brandsYaMus 183003 134059 0.1% artists, genres, albums

categorical item features for side similarity computation, followingthe technique described at the end of Sec. 3.2. The main datasetcharacteristics after data preprocessing are provided in Table 2.

Accordingly, both Movielens and Amazon datasets are binarizedwith a threshold value of 4, i.e., lower ratings are removed, and theremaining ratings are set to 1. In the case of Movielens datasets, wehave extended genres data with cast, directors and writers informa-tion crawled from the TMDB database7. For Amazon datasets, weused information about category hierarchy and brands. In order toavoid too dense similarity matrices, we use only the lowest levelhierarchies. In the case of AMZvg dataset, we additionally removethe “Games” category due to its redundancy.

In the BX dataset, we select only the part with implicit feedback.We additionally filter out users with more than 1000 rated items. In-formation about authors and publishers provided within the datasetis used to build side similarity matrices. In the case of Yahoo!Musicdataset, due to its size, we take only 10% of data corresponding tothe first CV split, provided by Yahoo Labs. We select only interac-tions with rating 5 and binarize them. Side similarity is computedusing information about genres, artists and albums.

4.3 Baseline algorithmsWe compare our approach against several standard baseline modelsas well as against two hybrid models that generalize many existingtechniques. Below is a brief description of each competing method.

PureSVD is a simple SVD-based model that replaces unobservedentries with zeros and then computes truncated SVD [18]. We alsoadapt this model to the cold start settings, as described in Sec. 3.4.In addition to that, we implement its scaled variant according toEq. (17). Once the best performing scaling is identified, it is usedwithout further adjustments in HybridSVD as well. We denote thescaled models as PureSVDs and HybridSVDs.

Factorization Machines (FM) [41] encapsulate side informationinto interactions via uniform one-hot encoding framework. Thisis one of the most general and expressive hybrid models. We useimplementation by Turi Create8 adapted for binary data. It imple-ments a binary prediction objective based on a sigmoid functionand includes negative sampling. The optimization task is performedby stochastic gradient descent (SGD) [11] with the learning ratescheduling via ADAGRAD [20]. We also perform standard MF bydisabling side feature handling in FM. We denote this approach asSGD by the name of the optimization algorithm. It is used to verifywhether FM actually benefits from utilizing side features.

7https://www.themoviedb.org8https://github.com/apple/turicreate

Page 6: HybridSVD: When Collaborative Information is Not Enough · of entities and allow injecting side information. It can be achieved by replacing the scalar product in Eq. (2) with a bilinear

RecSys ’19, September 16–20, 2019, Copenhagen, Denmark E. Frolov and I. Oseledets

1 3 10 300.00

0.01

0.02

0.03MRR

RND SIM MP PureSVD HybridSVD FM LCE SGD PureSVDs HybridSVDs

1 3 10 300.00

0.05

0.10HR

1 3 10 300.0

0.5

1.0

AM

Ze

COVERAGE

1 3 10 30top-n

0.00

0.02

0.04

1 3 10 30top-n

0.0

0.1

0.2

1 3 10 30top-n

0.0

0.5

1.0

YaMus

Figure 1: Standard scenario for the Amazon Electronics (top row) and Yahoo!Music (bottom row) datasets. Generally, the scaledversions of the SVD-based algorithms outperform the competing methods. Side information is less important than scaling.

Local Collective Embeddings model (LCE) [44] is another genericapproach that combines the ideas of coupled or collective matrixfactorization [2, 46] with the Graph Laplacian-based regularizationto additionally impose locality constraints. This enforces entitiesthat are close to each other in terms of side features to remain closeto each other in the latent feature space as well. We use the variantof LCE where only side features are used to form the Laplacian. Weadapted an open-source implementation available online9.

We additionally introduce a heuristic similarity-based hybridapproach (SIM), inspired by Eq. (11), where the orthogonal projectorVV⊤ can be viewed as item similarity in the latent space. The SIMmodel simply replaces it with side similarity. In the standard case,it predicts item scores as r = Sp. Likewise, viewing Eq. (15) in thesame sense gives r = Rs for user scores prediction in the cold start,where s denotes the similarity of a cold-start item to known items.

Finally, we include two non-personalized baselines that recom-mend either the most popular (MP) or random (RND) entities.

4.4 Hyperparameters tuningWe test all factorization models on a wide range of rank values (i.e.,a number of latent features) up to 3000. We use the MRR@10 scorefor selecting the optimal configuration.

The HybridSVD model is evaluated for 5 different values of αfrom {0.1, 0.3, 0.5, 0.7, 0.9}. Similarly to the PureSVD case and un-like other competing MF methods, once the model is computed forsome rank kmax with a fixed value of α , we immediately get anymodel with a lower rank value k < kmax by a simple rank truncationwithout any additional optimization steps. This significantly sim-plifies the hyper-parameter tuning procedure as it eliminates theneed for expensive model recomputation during the grid search.

In the case of PureSVD we also perform experiments with dif-ferent values of the scaling parameter d (see Sec. 3.5) set to 0.2, 0.4,

9https://github.com/abhishekkrthakur/LCE

and 0.6. The best performing value is then used for an additionalset of experiments with HybridSVD.

Tuning of the FM model includes only the regularization coef-ficients for the bias terms and for interaction terms. The numberof negative samples is fixed to the default value of 4. The num-ber of epochs is set to 25. The initial rank value is set to 100 onYahoo dataset and 40 on others. Once an optimal configurationfor the fixed rank is found, we perform rank optimization. Thehyper-parameter space of the FM model quickly becomes infeasiblewith the increased granularity of a parameter grid as we do nothave the luxury of a simplified rank optimization available in thecase of HybridSVD. In order to deal with this issue, we employ aRandom Search strategy [9] and limit the number of possible hyper-parameters combinations to 60 (except Yahoo!Music dataset, wherewe set it to 30 due to very long computation time).

Tuning LCE model is similar to FM. In all experiments, we runthe algorithm for 75 epochs and use 10 nearest neighbors for gener-ating Graph Laplacian. The initial rank values used for tuning arethe same as for FM. We first perform grid search with the values ofregularization in the range from 1 to 30 and with the fixed valuesof the LCE model’s coefficients α and β set to 0.1 and 0.05 respec-tively. After this step we tune α and β in a range of values from[0, 1] interval (excluding α = 0). Once the optimal configuration isdetermined, we finally perform rank values tuning.

5 RESULTS AND DISCUSSIONResults for the standard and the cold start scenarios are depictedin Figures 1 and 2 respectively. Confidence intervals are reportedas black vertical bars. For the sake of visual clarity, we do notshow all the results and only report the most descriptive parts ofit. The complete set of figures, as well as the set of optimal hyper-parameters for each model, can be found in our Github repository10.

10See Jupyter notebooks named View_all_results and View_optimal_parameters.

Page 7: HybridSVD: When Collaborative Information is Not Enough · of entities and allow injecting side information. It can be achieved by replacing the scalar product in Eq. (2) with a bilinear

HybridSVD: When Collaborative Information is Not Enough RecSys ’19, September 16–20, 2019, Copenhagen, Denmark

1 3 10 300.00

0.02

0.04

MRR (cold start)

RND MP LCE FM SIM PureSVD PureSVDs HybridSVD HybridSVDs

1 3 10 300.0

0.1

0.2HR (cold start)

1 3 10 300.0

0.5

1.0

BX

COVERAGE (cold start)

1 3 10 30top-n

0.00

0.02

0.04

1 3 10 30top-n

0.0

0.1

0.2

1 3 10 30top-n

0.0

0.5

1.0

YaMus

Figure 2: Cold start scenario for the BookCrossing (top row) and Yahoo! Music (bottom row) datasets. In contrast to the stan-dard case, the use of both scaling and side information improves predictions quality. The HybridSVD approach significantlyoutperforms all other methods. Surprisingly, even the cold start variant of PureSVD outperforms some hybrid models.

5.1 Standard scenarioLet us consider at first the non-scaled versions of the algorithms. Weobserve that, in the standard scenario, there is no absolute winneramong hybrid recommenders. Pure collaborative models providecomparable results in 5 out of 6 datasets.

Nevertheless, HybridSVD significantly outperforms the other(non-scaled) algorithms on the YaMus dataset. By a closer inspectionof the baselines, we note that this is the only dataset where the SIMmodel gets remarkably higher relevance scores than the popularity-based model. Probably, it can be explained by a strong attachmentof users to their favorite artists or genres. Giving more weight toside features captures this effect and improves predictions, whileCF models suffer from popularity biases. After proper scaling (i.e.,debiasing) the effect vanishes, which is indicated by a comparableperformance of both PureSVDs and HybridSVDs. Surprisingly, theFM model was unable to converge to any reasonable solution on thisdataset (we have repeated the experiment 3 times on 3 differentmachines and obtained the same unsatisfactory result).

Generally, we observed a tight competition between FM, LCE,and HybridSVD. Interesting to note that in some cases non-SVD-based hybrid models were slightly underperforming non-hybridPureSVD (see ML1M, ML10M results for FM and ML1M, BX resultsfor LCE). In contrast, HybridSVD exhibitedmore reliable behavior. Itcan be explained by better control over the weights assigned to sideinformation during the fine-tuning phase, effectively lowering thecontribution of side features if it does not aid the learning process.Generally, HybridSVD was performing better than its competitorson 3 datasets – ML1M, YaMus, and BX; LCE was better than otherson 2 datasets – ML10M and AMZe; FM was better on AMZvg.

A significant boost to recommendations quality of the SVD-based models is provided by the scaling trick (Sec. 3.5). On almostall datasets the top scores were achieved by the scaled versions of

either PureSVD or HybridSVD. The only exception was BX dataset,where the non-scaled version of HybridSVD outperformed thescaled version of PureSVD. The result, however, was not statisti-cally significant and the top-performing model was still the scaledversion of HybridSVD. Our results suggest that, in the standardscenario, debiasing data is often more important than adding sideinformation. The effect is especially pronounced in the AMZe andYaMus cases. This also resonates with conclusions in [39]. As theauthors argue, “even a few ratings are more valuable than metadata”.

5.2 Cold start scenarioIn contrast to the standard case, in the cold start scenario, Hy-bridSVD and its scaled version clearly demonstrate superior quality.The scaled version outperforms all other models on 5 out of 6datasets except AMZvg, where all SVD-based models achieve simi-lar scores (which are still the highest among all other competitors).Even the non-scaled version of HybridSVD significantly outper-forms both LCE and FM models on almost all datasets excludingAMZe, where FM performs better than others.

Moreover, unlike the standard scenario, HybridSVD also demon-strates a better quality in comparison with the scaled version ofPureSVD on half of the datasets (YaMus, BX, ML1M) and performscomparably well on ML10M and AMZvg. We note that this resultcould be anticipated by inspecting the baseline models similarly tothe standard case: the scores for both scaled and non-scaled vari-ants of HybridSVD are higher than that of PureSVD on the datasetswhere the SIM model significantly outperforms MP.

Surprisingly and in contrast to HybridSVD, both FM and LCEmodels significantly underperform even the heuristic-based SIMmodel on 4 datasets excluding ML10M and AMZe. This does nothappen even with the cold start adaptation of the PureSVD ap-proach, which is not a hybrid model in the first place. The LCE

Page 8: HybridSVD: When Collaborative Information is Not Enough · of entities and allow injecting side information. It can be achieved by replacing the scalar product in Eq. (2) with a bilinear

RecSys ’19, September 16–20, 2019, Copenhagen, Denmark E. Frolov and I. Oseledets

model also has the worst coverage. The SVD-based models, in turn,typically provide more diverse recommendations without sacri-ficing the quality of recommendations. In general, however, thediversity of recommendations is lower than in the standard case.

We note that the performance of the HybridSVD approach isconsistent, mostly favoring the higher values of α (except for theAMZvg case). Unsurprisingly, in the cold start regime, side featuresare as important as scaling, and HybridSVD makes the best use ofboth of them. Overall, the proposed HybridSVD approach providesa flexible tool to control the contribution of side features into themodel’s predictions and to adjust recommendations based on themeaningfulness of side information.

6 RELATEDWORKMany hybrid recommender systems have been designed and ex-plored to date, and a great amount of work has been devoted to in-corporating side information into matrix factorization algorithms inparticular. We group various hybridization approaches into severalcategories, depending on a particular choice of data preprocessingsteps and optimization techniques.

A broad class of hybrid factorization methods maps real at-tributes and properties to latent features with the help of somelinear transformation. A simple way to achieve this is to describe anentity as the weighted sum of the latent vectors of its characteristics[47]. In the majority of models of this class, the feature mappingis learned as a part of an optimization process [14, 39, 42]. The FMmodel described earlier also belongs to this group. Some authorsalso proposed to learn the mapping as a post-processing step usingeither entire data [23] or only its representative fraction [17].

Alternatively, in the aggregation approach, feature-based rela-tions are imposed on interaction data and are used for learningaggregated representations. Among notable models of this class areSparse Linear Methods with Side Information (SSLIM) [37]. In somecases, a simpler approach based on the augmentation technique canalso help to account for additional sources of information. Withthis approach, features are represented as new dummy (or virtual)entities that extend original interaction data [3, 34]. Another wideclass of methods uses regularization-based techniques to enforcethe proximity of entities in the latent feature space based on theiractual features. Some of these models are based on probabilisticframeworks [25, 40], others directly extend standard MF objectivewith additional regularization terms [16, 35].

One of the most straightforward and well-studied regularization-based approaches is collective MF [46]. In its simplest variant, some-times also called coupled MF [2], parametrization of side features isconstrained only by squared difference terms similarly to the mainoptimization objective of standard MF [22]. There are also severalvariations of the coupled factorization technique, where regulariza-tion is driven by side information-based similarity between users oritems [8, 45] rather than by side features themselves. As we noted inSec. 4.3, the authors of the LCE model extend this approach furtherwith additional Graph Laplacian-based regularization, which is, inturn, related to the kernelized MF [38, 51].

All these methods are based on a general problem formulation,which provides a high level of flexibility for solving hybrid recom-mendation problems. On the other hand, it sacrifices many benefits

of the SVD-based approach, such as global convergence guaran-tees, direct folding-in computation, and quick rank value tuningachieved by simple truncation of factor matrices. Considering theseadvantages, the SVD-based approach has received surprisingly lowattention from the hybrid systems perspective. It was shown to bea convenient intermediate tool for factorizing combined representa-tions of feature matrices and collaborative data [7, 49]. However, tothe best of our knowledge, there were no attempts for developingan integrated hybrid SVD-based approach where interactions dataand side information would be jointly factorized without violatingthe computational paradigm of classical SVD.

Worth mentioning here that SVD generalizations have beenexplored in other disciplines including studies of genome [5], neu-roimaging, signal processing, environmental data analysis (see [4]and references therein). Moreover, the authors of [4] propose anelegant computational framework for reducing the dimensionalityof structured, dense datasets without explicitly involving squareroots or Cholesky factors or any of their inverses. Even though itcan potentially be adapted for sparse data, it is not designed forquick online inference, which requires computing matrix roots andthe corresponding inverses in any case due to Eq. (12).

7 CONCLUSIONS AND FUTURE DIRECTIONSWe have generalized PureSVD to support side information by vir-tually augmenting collaborative data with additional feature-basedrelations. It allows imposing the desired structure on the solutionand, in certain cases, improves the quality of recommendations.

The model is especially suitable for the cold start regime. Theorthogonality property of singular vectors admits an easy extrac-tion of the latent representation of side features, required in thecold start, for both PureSVD and HybridSVD. The latter, however,consistently generates predictions of a higher quality. As a result,the proposed method outperforms all other competing algorithms,sometimes by a significant margin.

Conversely, in the standard case, a simple scaling trick oftenallows achieving a superior quality even without side information.A sufficient amount of collaborative information seems to hinder thepositive effect of side knowledge and makes the use of it redundant.Nonetheless, HybridSVD generally provides much more reliablepredictions across a variety of cases.

We have proposed an efficient computational scheme for bothmodel construction and recommendation generation in online set-tings, including warm and cold start scenarios. The model can befurther improved by replacing standard SVD with its randomizedcounterpart. This would not only speed up computations but wouldalso enable support of highly distributed environments.

We have identified several directions for further research. Thefirst one is to add both user- and item-based similarity informationinto the folding-in setting. One possible way is to combine folding-in vectors with the cold start approximation. Another interestingdirection is to relax the strict positive definiteness constraint andalso support compact, dense feature representations, which can beachieved with the help of fast symmetric factorization techniques.

ACKNOWLEDGMENTSThe work is supported by RFBR, research project No. 18-29-03187.

Page 9: HybridSVD: When Collaborative Information is Not Enough · of entities and allow injecting side information. It can be achieved by replacing the scalar product in Eq. (2) with a bilinear

HybridSVD: When Collaborative Information is Not Enough RecSys ’19, September 16–20, 2019, Copenhagen, Denmark

REFERENCES[1] Hervé Abdi. 2007. Singular value decomposition (SVD) and generalized singular

value decomposition. Encyclopedia of measurement and statistics. Thousand Oaks(CA): Sage (2007), 907–12.

[2] Evrim Acar, Tamara G Kolda, and Daniel M Dunlavy. 2011. All-at-once optimiza-tion for coupled matrix and tensor factorizations. arXiv preprint arXiv:1105.3422(2011).

[3] Marat Akhmatnurov and Dmitry I Ignatov. 2015. Context-Aware RecommenderSystem Based on Boolean Matrix Factorisation.. In CLA. 99–110.

[4] Genevera I Allen, Logan Grosenick, and Jonathan Taylor. 2014. A generalizedleast-square matrix decomposition. J. Amer. Statist. Assoc. 109, 505 (2014), 145–159.

[5] Orly Alter, Patrick O Brown, and David Botstein. 2003. Generalized singularvalue decomposition for comparative analysis of genome-scale expression datasets of two different organisms. Proceedings of the National Academy of Sciences100, 6 (2003), 3351–3356.

[6] Sivaram Ambikasaran, Michael O’Neil, and Karan Raj Singh. 2014. Fast sym-metric factorization of hierarchical matrices with applications. arXiv preprintarXiv:1405.0223 (2014).

[7] Yusuke Ariyoshi and Junzo Kamahara. 2010. A hybrid recommendation methodwith double SVD reduction. In International Conference on Database Systems forAdvanced Applications. Springer, 365–373.

[8] Iman Barjasteh, Rana Forsati, Farzan Masrour, Abdol-Hossein Esfahanian, andHayder Radha. 2015. Cold-start item and user recommendation with decou-pled completion and transduction. In Proceedings of the 9th ACM Conference onRecommender Systems. ACM, 91–98.

[9] James Bergstra and Yoshua Bengio. 2012. Random search for hyper-parameteroptimization. Journal of Machine Learning Research 13, Feb (2012), 281–305.

[10] Daniel Billsus and Michael J Pazzani. 1998. Learning Collaborative InformationFilters.. In Icml, Vol. 98. 46–54.

[11] Léon Bottou. 2012. Stochastic gradient descent tricks. In Neural networks: Tricksof the trade. Springer, 421–436.

[12] Robin Burke. 2002. Hybrid recommender systems: Survey and experiments. Usermodeling and user-adapted interaction 12, 4 (2002), 331–370.

[13] Hongyun Cai, Vincent W Zheng, and Kevin Chen-Chuan Chang. 2018. A com-prehensive survey of graph embedding: Problems, techniques, and applications.IEEE Transactions on Knowledge and Data Engineering 30, 9 (2018), 1616–1637.

[14] Tianqi Chen, Linpeng Tang, Qin Liu, Diyi Yang, Saining Xie, Xuezhi Cao, Chun-yang Wu, Enpeng Yao, Zhengyang Liu, Zhansheng Jiang, et al. 2012. Combiningfactorization model and additive forest for collaborative followee recommenda-tion. KDD CUP (2012).

[15] Yanqing Chen, Timothy A Davis, William W Hager, and Sivasankaran Rajaman-ickam. 2008. Algorithm 887: CHOLMOD, supernodal sparse Cholesky factoriza-tion and update/downdate. ACM Transactions on Mathematical Software (TOMS)35, 3 (2008), 22.

[16] Yifan Chen and Xiang Zhao. 2017. Leveraging High-Dimensional Side Informa-tion for Top-N Recommendation. arXiv preprint arXiv:1702.01516 (2017).

[17] Deborah Cohen, Michal Aharon, Yair Koren, Oren Somekh, and Raz Nissim.2017. Expediting exploration by attribute-to-feature mapping for cold-startrecommendations. In Proceedings of the 11th ACM Conference on RecommenderSystems. ACM, 184–192.

[18] Paolo Cremonesi, Yehuda Koren, and Roberto Turrin. 2010. Performance ofrecommender algorithms on top-n recommendation tasks, In Proc. of fourthACM Conf. Recomm. Syst. - RecSys ’10. Proc. fourth ACM Conf. Recomm. Syst. -RecSys ’10.

[19] Mukund Deshpande and George Karypis. 2004. Item-based top-n recommenda-tion algorithms. ACM Transactions on Information Systems (TOIS) 22, 1 (2004),143–177.

[20] John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methodsfor online learning and stochastic optimization. Journal of Machine LearningResearch 12, Jul (2011), 2121–2159.

[21] Michael D Ekstrand, John T Riedl, Joseph A Konstan, et al. 2011. Collaborativefiltering recommender systems. Foundations and Trends® in Human–ComputerInteraction 4, 2 (2011), 81–173.

[22] Yi Fang and Luo Si. 2011. Matrix co-factorization for recommendation with richside information and implicit feedback. In Proceedings of the 2nd InternationalWorkshop on Information Heterogeneity and Fusion in Recommender Systems. ACM,65–69.

[23] Zeno Gantner, Lucas Drumond, Christoph Freudenthaler, Steffen Rendle, andLars Schmidt-Thieme. 2010. Learning attribute-to-feature mappings for cold-start recommendations. In Data Mining (ICDM), 2010 IEEE 10th InternationalConference on. IEEE, 176–185.

[24] Gene H Golub and Charles F Van Loan. 2012. Matrix computations (4th ed.). TheJohns Hopkins University Press.

[25] Asela Gunawardana and Christopher Meek. 2009. A unified approach to buildinghybrid recommender systems. In Proceedings of the third ACM conference onRecommender systems. ACM, 117–124.

[26] Nathan Halko, Per-Gunnar Martinsson, and Joel A Tropp. 2011. Finding structurewith randomness: Probabilistic algorithms for constructing approximate matrixdecompositions. SIAM review 53, 2 (2011), 217–288.

[27] F Maxwell Harper and Joseph A Konstan. 2016. The movielens datasets: Historyand context. ACM Trans. Interact. Intell. Syst. (TIIS) 5, 4 (2016), 19.

[28] Dohyun Kim and Bong-Jin Yum. 2005. Collaborative filtering based on iterativeprincipal component analysis. Expert Systems with Applications 28, 4 (2005),823–830.

[29] Daniel Kluver, Michael D Ekstrand, and Joseph A Konstan. 2018. Rating-basedcollaborative filtering: algorithms and evaluation. In Social Information Access.Springer, 344–390.

[30] Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix factorization tech-niques for recommender systems. Computer 42, 8 (2009).

[31] Cornelius Lanczos. 1950. An iteration method for the solution of the eigenvalueproblem of linear differential and integral operators. United States Governm. PressOffice Los Angeles, CA.

[32] Linyuan Lü and Tao Zhou. 2011. Link prediction in complex networks: A survey.Physica A: statistical mechanics and its applications 390, 6 (2011), 1150–1170.

[33] Julian McAuley, Christopher Targett, Qinfeng Shi, and Anton Van Den Hengel.2015. Image-based recommendations on styles and substitutes. In Proceedingsof the 38th International ACM SIGIR Conference on Research and Development inInformation Retrieval. ACM, 43–52.

[34] Amir Hossein Nabizadeh, Alípio Mário Jorge, Suhua Tang, and Yi Yu. 2016.Predicting User Preference Based on Matrix Factorization by Exploiting MusicAttributes. In Proceedings of the Ninth International C* Conference on ComputerScience & Software Engineering. ACM, 61–66.

[35] Jennifer Nguyen and Mu Zhu. 2013. Content-boosted matrix factorization tech-niques for recommender systems. Statistical Analysis and Data Mining 6, 4 (2013),286–301.

[36] Athanasios N Nikolakopoulos, Vassilis Kalantzis, Efstratios Gallopoulos, andJohn D Garofalakis. 2017. EigenRec: generalizing PureSVD for effective andefficient top-N recommendations. Knowledge and Information Systems (2017),1–23.

[37] Xia Ning and George Karypis. 2012. Sparse linear methods with side informa-tion for top-n recommendations. In Proceedings of the sixth ACM conference onRecommender systems. ACM, 155–162.

[38] Bithika Pal and Mamata Jenamani. 2018. Kernelized probabilistic matrix factor-ization for collaborative filtering: exploiting projected user and item graph. InProceedings of the 12th ACM Conference on Recommender Systems. ACM, 437–440.

[39] István Pilászy and Domonkos Tikk. 2009. Recommending new movies: even afew ratings are more valuable than metadata. In Proceedings of the third ACMconference on Recommender systems. ACM, 93–100.

[40] Ian Porteous, Arthur U Asuncion, and Max Welling. 2010. Bayesian MatrixFactorization with Side Information and Dirichlet Process Mixtures.. In AAAI.

[41] Steffen Rendle. 2010. Factorization machines. In 2010 IEEE 10th InternationalConference on Data Mining (ICDM). IEEE, 995–1000.

[42] Sujoy Roy and Sharat Chandra Guntuku. 2016. Latent factor representations forcold-start video recommendation. In Proceedings of the 10th ACM Conference onRecommender Systems. ACM, 99–106.

[43] Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. 2000. Applicationof dimensionality reduction in recommender system-a case study. Technical Report.Minnesota Univ Minneapolis Dept of Computer Science.

[44] Martin Saveski and Amin Mantrach. 2014. Item cold-start recommendations:learning local collective embeddings. In Proceedings of the 8th ACM Conferenceon Recommender systems. ACM, 89–96.

[45] Yue Shi, Martha Larson, and Alan Hanjalic. 2010. Mining mood-specific moviesimilarity with matrix factorization for context-aware recommendation. In Pro-ceedings of the workshop on context-aware movie recommendation. ACM, 34–40.

[46] Ajit P Singh and Geoffrey J Gordon. 2008. Relational learning via collective matrixfactorization. In Proceedings of the 14th ACM SIGKDD international conference onKnowledge discovery and data mining. ACM, 650–658.

[47] David H Stern, Ralf Herbrich, and Thore Graepel. 2009. Matchbox: large scale on-line bayesian recommendations. In Proceedings of the 18th international conferenceon World wide web. ACM, 111–120.

[48] Gilbert Strang. 2006. Linear Algebra and Its Applications (4th ed.). Brooks Cole.[49] Panagiotis Symeonidis. 2008. Content-based dimensionality reduction for recom-

mender systems. In Data Analysis, Machine Learning and Applications. Springer,619–626.

[50] Yahoo Labs Webscope 2014. R2 - Yahoo! Music. Retrieved 02-10-2018 fromhttps://webscope.sandbox.yahoo.com/

[51] Tinghui Zhou, Hanhuai Shan, Arindam Banerjee, and Guillermo Sapiro. 2012.Kernelized probabilistic matrix factorization: Exploiting graphs and side infor-mation. In Proceedings of the 2012 SIAM international Conference on Data mining.SIAM, 403–414.

[52] Cai-Nicolas Ziegler, Sean M McNee, Joseph A Konstan, and Georg Lausen. 2005.Improving recommendation lists through topic diversification. In Proceedings ofthe 14th international conference on World Wide Web. ACM, 22–32.