{"title": "Collaborative Filtering in a Non-Uniform World: Learning with the Weighted Trace Norm", "book": "Advances in Neural Information Processing Systems", "page_first": 2056, "page_last": 2064, "abstract": "We show that matrix completion with trace-norm regularization can be significantly hurt when entries of the matrix are sampled non-uniformly, but that a properly weighted version of the trace-norm regularizer works well with non-uniform sampling. We show that the weighted trace-norm regularization indeed yields significant gains on the highly non-uniformly sampled Netflix dataset.", "full_text": "Collaborative Filtering in a Non-Uniform World:\n\nLearning with the Weighted Trace Norm\n\nBrain and Cognitive Sciences and CSAIL, MIT\n\nToyota Technological Institute at Chicago\n\nNathan Srebro\n\nRuslan Salakhutdinov\n\nCambridge, MA 02139\nrsalakhu@mit.edu\n\nChicago, Illinois 60637\n\nnati@ttic.edu\n\nAbstract\n\nWe show that matrix completion with trace-norm regularization can be signi\ufb01-\ncantly hurt when entries of the matrix are sampled non-uniformly, but that a prop-\nerly weighted version of the trace-norm regularizer works well with non-uniform\nsampling. We show that the weighted trace-norm regularization indeed yields sig-\nni\ufb01cant gains on the highly non-uniformly sampled Net\ufb02ix dataset.\n\n1 Introduction\n\nTrace-norm regularization is a popular approach for matrix completion and collaborative \ufb01ltering,\nmotivated both as a convex surrogate to the rank [7, 6] and in terms of a regularized in\ufb01nite factor\nmodel with connections to large-margin norm-regularized learning [17, 1, 15].\n\nCurrent theoretical guarantees on using the trace-norm for matrix completion assume a uniform\nsampling distribution over entries of the matrix [18, 6, 5, 13]. In a collaborative \ufb01ltering setting,\nwhere rows of the matrix represent e.g. users and columns represent e.g. movies, this corresponds\nto assuming all users are equally likely to rate movies and all movies are equally likely to be rated.\nThis of course cannot be further from the truth, as invariably some users are more active than others\nand some movies are rated by many people while others are rarely rated.\n\nIn this paper we show, both analytically and through simulations, that this is not a de\ufb01ciency of\nthe proof techniques used to establish the above guarantees. Indeed, a non-uniform sampling dis-\ntribution can lead to a signi\ufb01cant deterioration in prediction quality and an increase in the sample\ncomplexity. Under non-uniform sampling, as many as \u2126(n4/3) samples might be needed for learn-\ning even a simple (e.g. orthogonal low rank) n \u00d7 n matrix. This is in sharp contrast to the uniform\nsampling case, in which \u02dcO(n) samples are enough. It is important to note that if the rank could\nbe minimized directly, which is in general not computationally tractable, \u02dcO(n) samples would be\nenough to learn a low-rank model even under an arbitrary non-uniform distribution.\n\nOur analysis further suggests a weighted correction to the trace-norm regularizer, that takes into\naccount the sampling distribution. Although appearing at \ufb01rst as counter-intuitive, and indeed be-\ning the opposite of a previously suggested weighting [21], this weighting is well-motivated by our\nanalytic analysis and we discuss how it corrects the problems that the unweighted trace-norm has\nwith non-uniform sampling. We show how the weighted trace-norm indeed yields a signi\ufb01cant\nimprovement on the highly non-uniformly sampled Net\ufb02ix dataset.\n\nThe only other work we are aware of that studies matrix completion under non-uniform sampling\nis work on exact completion (i.e. when the matrix is assumed to be exactly low rank) under power-\nlaw sampling [12]. Other then being limited to one speci\ufb01c distribution, the requirement of the\nmatrix being exactly low rank is central to this work, and the results cannot be directly applied\nin the presence of even small noise. Empirically, the approach leads to deterioration in predictive\nperformance on the Net\ufb02ix data [12].\n\n1\n\n\f2 Complexity Control in terms of Matrix Factorizations\nConsider the problem of predicting the entries of some unknown target matrix Y \u2208 Rn\u00d7m based\non a random subset S of observed entries YS. For example, n and m may represent the number of\nusers and the number of movies, and Y may represent a matrix of partially observed rating values.\nPredicting elements of Y can be done by \ufb01nding a matrix X minimizing the training error, here\nmeasured as a squared error, and some measure c(X) of complexity. That is, minimizing either:\n\nor:\n\nF + \u03bbc(X)\n\nX kXS \u2212 YSk2\nmin\nc(X)\u2264C kXS \u2212 YSk2\nF ,\nmin\n\n(1)\n\n(2)\n\nwhere YS, and similarly XS, denotes the matrix \u201cmasked\u201d by S, where (YS)i,j = Yi,j if (i, j) \u2208 S\nand 0 otherwise. For now we ignore possible repeated entries in S and we also assume that n \u2264 m\nwithout loss of generality. The two formulations (1) and (2) are equivalent up to some (unknown)\ncorrespondence between \u03bb and C, and we will be referring to them interchangeably.\nA basic measure of complexity is the rank of X, corresponding to the minimal dimensionality k such\nthat X = U \u22a4V for some U \u2208 Rk\u00d7n and V \u2208 Rk\u00d7m. Directly constraining the rank of X forms\none of the most popular approaches to collaborative \ufb01ltering. However, the rank is non-convex and\nhard to minimize. It is also not clear if a strict dimensionality constraint is most appropriate for\nmeasuring the complexity.\n\nTrace-norm Regularization\nLately, methods regularizing the norm of the factorization U \u22a4V , rather than its dimensionality, have\nbeen advocated and were shown to enjoy considerable empirical success [14, 15]. This corresponds\nto measuring complexity in terms of the trace-norm of X, which can be de\ufb01ned equivalently either\nas the sum of the singular values of X, or as [7]:\n\nkXktr = min\n\nX=U \u2032V\n\n1\n2\n\n(kUk2\n\nF + kV k2\nF),\n\n(3)\n\nwhere the dimensionality of U and V is not constrained. Beyond the modeling appeal of norm-\nbased, rather than dimension-based, regularization, the trace-norm is a convex function of X and so\ncan be minimized by either local search or more sophisticated convex optimization techniques.\n\nScaling\nThe rank, as a measure of complexity, does not scale with the size of the matrix. That is, even very\nlarge matrices can have low rank. Viewing the rank as a complexity measure corresponding to the\nnumber of underlying factors, if data is explained by e.g. two factors, then no matter how many rows\n(\u201cusers\u201d) and columns (\u201cmovies\u201d) we consider, the data will still have rank two. The trace-norm,\nhowever, does scale with the size of the matrix. To see this, note that the trace-norm is the \u21131 norm\nof the spectrum, while the Frobenius norm is the \u21132 norm of the spectrum, yielding:\n\nkXkF \u2264 kXktr \u2264 kXkFprank(X) \u2264 \u221ankXkF .\n\n(4)\n\nThe Frobenius norm certainly increases with the size of the matrix, since the magnitude of each ele-\nment does not decrease when we have more elements, and so the trace-norm will also increase. The\nabove suggests measuring the trace-norm relative to the Frobenius norm. Without loss of generality,\nconsider each target entry to be of roughly unit magnitude, and so in order to \ufb01t Y each entry of\nX must also be of roughly unit magnitude. This suggests scaling the trace-norm by \u221anm. More\nspeci\ufb01cally, we study the trace-norm through the complexity measure:\n\ntc(X) = kXk2\ntr\nnm\n\n,\n\n(5)\n\nwhich puts the trace-norm on a comparable scale to the rank. In particular, when each entry of X is,\non-average, of unit magnitude (i.e. has unit variance) we have 1 \u2264 tc(X) \u2264 rank(X).\nThe relationship between tc(X) and the rank is tight for \u201corthogonal\u201d low-rank matrices, i.e. low-\nrank matrices X = U \u22a4V where the rows of U and also the rows of V are orthogonal and of equal\nmagnitudes. In order for the entries in Y to have unit magnitude, i.e. kY k2\nF = nm, we have that rows\n\n2\n\n\fin U have normqn/\u221ak and rows in V have normqm/\u221ak, yielding precisely tc(X) = rank(X).\nSuch an orthogonal low-rank matrix can be obtained, e.g., when entries of U and V are zero-mean\ni.i.d. Gaussian with variance 1/\u221ak, corresponding to unit-variance entries in X.\nGeneralization Guarantees\nAnother place where we can see that tc(X) plays a similar role to rank(X) is in the generalization\nand sample complexity guarantees that can be obtained for low-rank and low-trace-norm learning.\nIf there is a low-rank matrix X \u2217 achieving low average error relative to Y (e.g. if Y = X \u2217 + noise),\nthen by minimizing the training error subject to a rank constraint (a computationally intractable\ntask), |S| = \u02dcO(rank(X \u2217)(n + m)) samples are enough in order to guarantee learning a matrix X\nwhose overall average error is close to that of X \u2217 [16]. Similarly, if there is a low-trace-norm matrix\nX \u2217 achieving low average error, then minimizing the training error and the trace-norm (a convex\noptimization problem), |S| = \u02dcO(tc(X \u2217)(n + m)) samples are enough in order to guarantee learning\na matrix X whose overall average error is close to that of X \u2217 [18]. In these bounds tc(X) plays\nprecisely the same role as the rank, up to logarithmic factors.\n\nIn order to get some intuitive understanding of low-rank learning guarantees, it is enough to consider\nthe number of parameters in the rank-k factorization X = U \u22a4V . It is easy to see that the number of\nparameters in the factorization is roughly k(m + n) (perhaps a bit less due to rotational invariants).\nWe therefore would expect to be able to learn X when we have roughly this many samples, as is\nindeed con\ufb01rmed by the rigorous sample complexity bounds.\nFor low-trace-norm learning, consider a sample S of size |S| \u2264 Cn, for some constant C. Taking\nentries of Y to be of unit magnitude, we have kYSkF = p|S| \u2264 \u221aCn (recall that YS is de\ufb01ned to\nbe zero outside S). From (4) we therefore have: kYSktr \u2264 \u221aCn \u00b7 \u221an = \u221aCn and so tc(YS) \u2264 C.\nThat is, we can \u201cshatter\u201d any sample of size |S| \u2264 Cn with tc(X) = C: no matter what the\nunderlying matrix Y is, we can always perfectly \ufb01t the training data with a low trace-norm matrix\nX s.t. tc(X) \u2264 C, without generalizing at all outside S. On the other hand, we must allow matrices\nwith tc(X) = tc(X \u2217), otherwise we can not hope to \ufb01nd X \u2217, and so we can only constrain tc(X) \u2264\nC = tc(X \u2217). We therefore cannot expect to learn with less than ntc(X \u2217) samples. It turns out that\nthis is essentially the largest random sample that can be shattered with tc(X) \u2264 C = tc(X \u2217). If we\nhave more than this many samples we can start learning.\n\n3 Trace-Norm Under a Non-Uniform Distribution\n\nIn this section, we analyze trace-norm regularized learning when the sampling distribution is not\nuniform. That is, when there is some, known or unknown, non-uniform distribution D over entries\nof the matrix Y (i.e. over index pairs (i, j)) and our sample S is sampled i.i.d. from D. Our objective\nis to get low average error with respect to the distribution D. That is, we measure generalization\nperformance in terms of the weighted sum-squared-error:\n(6)\n\nD(i, j)(Xij \u2212 Yij)2.\n\nkX \u2212 Y k2\n\nD = E(i,j)\u223cD(cid:2)(Xij \u2212 Yij )2(cid:3) = Xij\n\nWe \ufb01rst point out that when using the rank for complexity control, i.e. when minimizing the training\nerror subject to a low-rank constraint, non-uniformity does not pose a problem. The same generaliza-\ntion and learning guarantees that can be obtained in the uniform case, also hold under an arbitrary\ndistribution D. In particular, if there is some low-rank X \u2217 such that kX \u2217 \u2212 Y k2\nD is small, then\n\u02dcO(rank(X \u2217)(n + m)) samples are enough in order to learn (by minimizing training error subject to\na rank constraint) a matrix X with kX \u2212 Y k2\nD [16].\nHowever, the same does not hold when learning using the trace-norm. To see this, consider an\northogonal rank-k square n\u00d7n matrix, and a sampling distribution which is uniform over an nA\u00d7nA\nsub-matrix A, with nA = na. That is, the row (e.g. \u201cuser\u201d) is selected uniformly among the \ufb01rst nA\nrows, and the column (e.g. \u201cmovie\u201d) is selected uniformly among the \ufb01rst nA columns. We will use\nA to denote the subset of entries in the submatrix, i.e. A = {(i, j)|1 \u2264 i, j \u2264 nA}. For any sample\nS, we have:\n\nD almost as small as kX \u2217 \u2212 Y k2\n\ntc(YS) = kYSk2\n\nn2 \u2264 kYSk2\n\ntr\n\nF rank(YS)\nn2\n\n\u2264 |S|na\n\nn2 = |S|\nn2\u2212a ,\n\n(7)\n\n3\n\n\fwhere we again take the entries in Y to be of unit magnitude. In the second inequality above we\nuse the fact that YS is zero outside of A, and so we can bound the rank of YS by the dimensionality\nnA = na of A.\nSetting a < 1, we see that we can shatter any sample of size1 kn2\u2212a = \u02dc\u03c9(n) with a matrix X for\nwhich tc(X)<k. When a \u2264 1/2, the total number of entries in A is less than n. In this case \u02dcO(n)\nobservations are enough in order to memorize2 YA. But when 1/2 < a < 1, with \u02dcO(n) observations,\nrestricting to even tc(X) < 1, we can neither learn Y , since we can shatter YS, nor memorize it. For\nexample, when a = 2/3 and so nA = n2/3, we need roughly n4/3 to start learning by constraining\ntc(X) to a constant \u2014 the same as we would need in order to memorize YA. This is a factor of n1/3\ngreater than the sample size needed to learn a matrix with constant tc(X) in the uniform case.\nThe above arguments establish that restricting the complexity to tc(X) < k might not lead to gen-\neralization with \u02dcO(kn) samples in the non-uniform case. But does this mean that we cannot learn a\nrank-k matrix by minimizing the trace-norm using \u02dcO(kn) samples when the sampling distribution\nis concentrated on a small submatrix? Of course this is not the case. Since the samples are uniform\non a small submatrix, we can just think of the submatrix A as our entire space. The target matrix\nstill has low rank, even when restricted to A, and we are back in the uniform sampling scenario.\n\nobservation scenario. When samples are concentrated in nA, we actually need to restrict to a much\n\nThe only issue here is that tc(X) \u2264 k, i.e. kXktr \u2264 n\u221ak, is the right constraint in the uniform\nsmaller trace norm, kXktr \u2264 na\u221ak, which will allow learning with \u02dcO(kna) samples.\n\nWe can, however, modify the example and construct a sampling distribution under which \u2126(n4/3)\nsamples are required in order to learn even an \u201corthogonal\u201d low-rank matrix, no matter what con-\nstraint is placed on the trace-norm. This is a signi\ufb01cantly large sample complexity than \u02dcO(kn),\nwhich is what we would expect, and what is required for learning by constraining the rank directly.\nTo do so, consider another submatrix B of size nB \u00d7 nB with nB = n/2, such\nthat the rows and columns of A and B do not overlap (see \ufb01gure). Now, consider\na sampling distribution D which is uniform over A with probability half, and uni-\nform over B with probability half. Consider \ufb01tting a noisy matrix Y = X \u2217 + noise\nwhere X \u2217 is \u201corthogonal\u201d rank-k. In order to \ufb01t on B, we need to allow a trace-\n\u221ak, i.e. allow tc(X) = k/4. But as discussed above,\nnorm of at least kX \u2217\nwith such a generous constraint on the trace-norm, we will be able to shatter S \u2282 A whenever\n|S \u2229 A| = |S|/2 \u2264 kn2\u2212a/4. Since there is no overlap in rows and columns, and so values in the\nsub-matrices A and B are independent, shattering S\u2229A means we cannot hope to learn in A. Setting\na=2/3 as before, with o(n4/3) samples, we cannot learn in A and B jointly: either we constrain to\na trace-norm which is too low to \ufb01t X \u2217\nB (we under-\ufb01t on B), or we allow a trace-norm which is high\nenough to over\ufb01t YS\u2229A. In any case, we will make errors on at least half the mass of D.3\nEmpirical Example\n\nBktr = n\n\nB\n\nA\n\n2\n\nLet us consider a simple simulation experiment that will help us illustrates this phenomenon. Con-\nsider a simple synthetic example, where we used nA = 300 and nB = 4700, with an orthogonal\nrank-2 matrix X \u2217 and Y = X \u2217 + N (0, 1) (in case of repeated entries, the noise is independent for\neach appearance in the sample). The training sample size was also set to |S|=140,000.\nD\u2212kY \u2212 X \u2217k2\nThe three curves of Fig. 1 measure the excess (test) error kX \u2212 X \u2217k2\nD\nof the learned model, as well as the error contribution from A and from B, as a function of the\nconstraint on tc(X), for the sampling distribution discussed above and a speci\ufb01c sample size. As\ncan be seen, although it is possible to constrain tc(X) so as to achieve squared-error of less than 0.8\non B, this constraint is too lax for A and allows for over-\ufb01tting. Constraining tc(X) so as to avoid\nover\ufb01tting A (achieving almost zero excess test error), leads to a suboptimal \ufb01t on B.\n\nD = kX \u2212 Y k2\n\n1Recall that f (n) = \u02dc\u03c9(g(n)) iff for all p, g(n) logp g(n)\n2The algorithm saw all (or most) entries of the matrix and does not need to predict any unobserved entries.\n3More accurately, if we do allow high enough trace-norm to \ufb01t B, and |S| = o(n4/3), then the \u201ccost\u201d of\nB. For large enough n, we would be tempted to\n\nover\ufb01tting YS\u2229A is negligible compared to the cost of \ufb01tting X \u2217\nvery slightly deteriorate the \ufb01t of X \u2217\n\nB in order to \u201cfree up\u201d enough trace-norm and completely over\ufb01t YS\u2229A.\n\n\u2192 0.\n\nf (n)\n\n4\n\n\f1.2\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\nB\n\nA+B\n\nA\n\n1.2\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\nB\n\nA+B\n\nA\n\n \n\nr\no\nr\nr\nE\nd\ne\nr\na\nu\nq\nS\nn\na\ne\nM\n\n \n\n \n\nr\no\nr\nr\nE\nd\ne\nr\na\nu\nq\nS\nn\na\ne\nM\n\n \n\n1.2\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\nshift A+B\n\nshift A\n\nB\n\nA+B\n\nA\n\n \n\nr\no\nr\nr\nE\nd\ne\nr\na\nu\nq\nS\nn\na\ne\nM\n\n \n\n0\n10\u22123\n\n10\u22122\n\n10\u22121\ntc(X)\n\n100\n\n101\n\n0\n10\n\n15\n\n20\n\ntcpq(X)\n\n25\n\n30\n\n0\n10\u22122\n\n10\u22121\n\n100\n\nRegularization parameter \u03bb\n\n101\n\nFigure 1: Left: Mean squared error (MSE) of the learned model as a function of the constraint on tc(X)\n(left), tcpq(X) (middle). Right: The solid curves show the optimum of the mean squared error objective\n(9) (unweighted trace-norm), as a function of the regularization parameter \u03bb. The dashed curves display a\nweighted trace-norm. The black (middle) curve is the overall MSE error, the red (bottom) curve measures only\nthe contribution from A, and the blue (top) curve measures only the contribution from B.\n\nPenalty Formulation\nUntil now we discussed learning by constraining the trace-norm, i.e. using the formulation (2). It is\nalso insightful to consider the penalty view (1), i.e. learning by minimizing\n\n(9)\n\nX kYS \u2212 XSk2\nmin\n\nF + \u03bbkXktr .\n\nF + \u03bbkXBktr)\nF + \u03bbnBptcB(XB)(cid:17) ,\n\n(8)\nFirst observe that the characterization (3) allows us to decompose kXktr = kXAktr + kXBktr,\nwhere w.l.o.g. we take all columns of U and V outside A and B to be zero. Since we also have\nkYS \u2212 XSk2\nF, we can decompose the training objective\n(8) as:\nkYS \u2212 XSk2\n\nF = kYA\u2229S \u2212 XA\u2229Sk2\nF + \u03bbkXktr = (kYA\u2229S \u2212 XA\u2229Sk2\n\nF + \u03bbkXAktr) + (kYB\u2229S \u2212 XB\u2229Sk2\n\nF + kYB\u2229S \u2212 XB\u2229Sk2\n\nF + \u03bbnAptcA(XA)(cid:17) +(cid:16)kYB\u2229S \u2212 XB\u2229Sk2\n= (cid:16)kYA\u2229S \u2212 XA\u2229Sk2\nwhere tcA(XA) = kXAk2\nA (and similarly tcB(XB)) refers to the complexity measure tc(\u00b7)\ntr /n2\nmeasured relative to the size of A (similarly B). We see that the training objective decomposes\nto objectives over A and B. Each one of these corresponds to a trace-norm regularized learning\nproblem, under a uniform sampling distribution (in the corresponding submatrix) of a noisy low-rank\n\u201corthogonal\u201d matrix, and can therefor be learned with \u02dcO(knA) and \u02dcO(knB) samples respectively.\nIn other words, \u02dcO(kn) samples should be enough to learn both inside A and inside B.\nHowever, the regularization tradeoff parameter \u03bb compounds the two problems. When the objective\nis expressed in terms of tc(\u00b7), as in (9), the regularization tradeoff is scaled differently in each part\nof the training objective. With \u02dcO(kn) samples, it is possible to learn in A with some setting of \u03bb,\nand it is possible to learn in B with some other setting of \u03bb, but from the discussion above we learn\nthat no single value of \u03bb will allow learning in both A and B. Either \u03bb is too high yielding too strict\nregularization in B, so learning on B is not possible, perhaps since it is scaled by nB \u226b nA. Or \u03bb\nis too small and does not provide enough regularization in A.\nReturning to our simulation experiment, the solid curves of Fig. 1, right panel, show the excess\ntest error for the minimizer of the training objective (9), as a function of the regularization tradeoff\nparameter \u03bb. Note that these are essentially the same curves as displayed in Fig. 1, except the\npath of regularized solutions is now parameterized by \u03bb rather than by the bound on tc(X). Not\nsurprisingly, we see the same phenomena: different values of \u03bb are required for optimal learning on\nA and on B. Forcing the same \u03bb on both parts of the training objective (9) yields a deterioration in\nthe generalization performance.\n\n4 Weighted Trace Norm\nThe decomposition (9) and the discussion in the previous section suggests weighting the trace-norm\nby the frequency of rows and columns. For a sampling distribution D, denote by p(i) the row\nmarginal, i.e. the probability of observing row i, and similarly denote by q(j) the column marginal.\nWe propose using the following weighted version of the trace-norm as a regularizer:\n\nkXktr(p,q) = kdiag(\u221ap)Xdiag(\u221aq)ktr = min\n\nX=U \u2032V\n\n1\n\n2(cid:0)Xi\n\np(i)kUik2 +Xj\n\nq(j),kVjk2(cid:1)\n\n(10)\n\n5\n\n\fwhere diag(\u221ap) is a diagonal matrix with pp(i) on its diagonal (similarly diag(\u221aq)). The corre-\nsponding normalized complexity measure is given by tcpq(X) = kXk2\ntr(p,q). Note that for a uniform\ndistribution we have that tcpq(X) = tc(X). Furthermore, it is easy to verify that for an \u201corthogonal\u201d\nrank-k matrix X we have tcpq(X) = k for any sampling distribution.\nEquipped with the weighted trace-norm as a regularizer, let us revisit the problematic sampling\ndistribution studied in the previous Section. In order to \ufb01t the \u201corthogonal\u201d rank-k X \u2217, we need a\n\nweighted trace-norm of kX \u2217ktr(p,q) = ptcpq(X) = \u221ak. How large a sample S \u2229 A can we now\nshatter using such a weighted trace-norm? We can shatter a sample if kYS\u2229Aktr \u2264 \u221ak. We can\n\ncalculate:\n\nkYS\u2229Aktr(p,q) = kYS\u2229Aktr /(2nA) \u2264 p|S \u2229 A|nA/(2nA) = p|S|/(8nA).\n\n(11)\nThat is, we can shatter a sample of size up to |S| = 8knA < 8kn. The calculation for B is identical.\nIt seems that now, with a \ufb01xed constraint on the weighted trace-norm, we have enough capacity to\nboth \ufb01t X \u2217, and with \u02dcO(kn) samples, avoid over\ufb01tting on A.\nReturning to the penalization view (2) we can again decompose the training objective as:\n\nF + \u03bbkXktr(p,q) =\n\nkYS \u2212 XSk2\n= (cid:16)kYA\u2229S \u2212 XA\u2229Sk2\n\navoiding the scaling by the block sizes which we encountered in (9).\n\nF + \u03bb/2ptcA(XA)(cid:17) +(cid:16)kYB\u2229S \u2212 XB\u2229Sk2\n\nF + \u03bb/2ptcB(XB)(cid:17)\n\n(12)\n\nReturning to the synthetic experiments of Fig. 1 (right panel), and comparing (9) with (12), we see\nthat introducing the weighting corresponds to a relative change of nA/nB in the correspondence of\nthe regularization tradeoff parameters used for A and for B. This corresponds to a shift of log nA\nnB\nin the log-domain used in the \ufb01gure. Shifting the solid red (bottom) curve by this amount yields\nthe dashed red (bottom) curve. The solid blue (top) curve and the dashed red (bottom) curve thus\nrepresent the excess error on B and on A when the weighted trace norm is used, i.e. the training\nobjective (12) is minimized. The dashed black (middle) curve is the overall excess error when using\nthis training objective. As can be seen, the weighting aligns the excess errors on A and on B much\nbetter, and yields a lower overall error. The weighted trace-norm achieves the lowest MSE of 0.4301\nwith corresponding \u03bb = 0.11. This is compared to the lowest MSE of 0.4981 with \u03bb = 0.80,\nachieved by the unweighted trace-norm.\n\nIt is also interesting to observe that the weighted trace-norm outperforms its unweighted counterpart\nfor a wide range of regularization parameters \u03bb \u2208 [0.01; 0.6]. This may also suggest that in prac-\ntice, particularly when working with large and imbalanced datasets, it may be easier to search for\nregularization parameters using weighted trace-norm.\n\nFinally, Fig. 1, right panel, also suggests that the optimal shift might actually be smaller than\nnA/nB. We can consider a smaller shift by using the partially-weighted trace-norm:\n\n= min\n\nX=U \u22a4V\n\n1\n2\n\n(Xi\n\nq(j)\u03b1 kVjk2).\n\ntr(p\u03b1/n1\u2212\u03b1,q\u03b1/m1\u2212\u03b1).\n\np(i)\u03b1 kUik2 +Xj\n\ndiag(p\u03b1/2)Xdiag(q\u03b1/2)(cid:13)(cid:13)(cid:13)tr\n\nkXktr(p,q,\u03b1) = (cid:13)(cid:13)(cid:13)\nand the corresponding normalized complexity measure tc\u03b1(X) = kXk2\nOther Weightings and Bayesian Perspective\nThe weighted trace-norm motivated by the analysis here (with \u03b1 = 1) implies that the frequent users\n(equivalently movies) get regularized much stronger than the rare users (equivalently movies). This\nmight at \ufb01rst seem quite counter-intuitive as the natural weighting might seem to be the opposite.\nIndeed, Weimer et al. [21] speculated that with a uniform weighting (\u03b1 = 0) frequent users are\nregularized too heavily compared to infrequent users, and so suggested regularizing frequent users\n(and movies) with a lower weight, corresponding to \u03b1 = \u22121. Although this might seem natural, we\nsaw here that the reverse is actually true \u2013 the Weimer et al. weighting (\u03b1 = \u22121) would only make\nthings worse. Indeed, given the analysis here, Weimer et al. actually observed a deterioration in\nprediction quality when using their weighting. This is also demonstrated in the experiments on the\nNet\ufb02ix data in Section 6.\n\n6\n\n\fThe weighted regularization motivated here (with \u03b1 = 1) is also quite unusual from Bayesian per-\nspective. The trace-norm can be viewed as a negative-log-prior for the Probabilistic Matrix Factor-\nization model [15], where entries of U, V are taken to be i.i.d. Gaussian. The two terms of (8) can\nthen be interpreted as a log-likelihood and log-prior, and minimizing (8) corresponds to \ufb01nding the\nMAP parameters. Introducing weighting (with \u03b1 = 1) effectively states that the effect of the prior\nbecomes stronger as we observe more data. Yet, our analysis strongly suggest that in non-uniform\nsetting, such \u201cunorthodox\u201d regularization is crucial for achieving good generalization performance.\n\n5 Practical Implementation\nWhen dealing with large datasets, such as the Net\ufb02ix data, the most practical way to \ufb01t trace-norm\n\nregularized models is through stochastic gradient descent [15, 8]. Let ni = Pj Sij and mj =\nPi Sij denote the number of observed ratings for user i and movie j respectively. The training\n\nobjective using a partially-weighted trace-norm 10 can be written as:\nq(j)\u03b1\n\n\u03bb\n\nX{i,j}\u2208S\n\n(cid:18)(cid:0)Yij \u2212 U \u22a4\n\ni Vj(cid:1)2\n\n+\n\n2(cid:18) p(i)\u03b1\n\nni kUik2 +\n\nmj kVjk2(cid:19)(cid:19),\n\nwhere U \u2208 Rk\u00d7n and V \u2208 Rk\u00d7m. We can optimize this objective using stochastic gradient descent\nby picking one training pair (i, j) at random at each iteration, and taking a step in the direction\nopposite the gradient of the term corresponding to the chosen (i, j).\nNote that even though the objective (13) as a function of U and V is non-convex, there are no non-\nglobal local minima if we set k to be large enough, i.e. k > min(n, m) [2]. However, in practice\nusing very large values of k becomes computationally expensive. Instead, we consider truncated\ntrace-norm minimization by restricting k to smaller values.\nIn the next section we demonstrate\nthat even when using truncated trace-norm, its weighted version signi\ufb01cantly improves model\u2019s\nprediction performance.\n\nIn our experiments, we also replace unknown row p(i) and column q(j) marginals in (13) by their\nempirical estimates \u02c6p(i) = ni/|S| and \u02c6q(j) = mj/|S|. This results in the following objective:\n\nX{i,j}\u2208S\n\n(cid:18)(cid:0)Yij \u2212 U \u22a4\n\ni Vj(cid:1)2\n\n+\n\n\u03bb\n\n2|S|(cid:18)n\u03b1\u22121\n\ni\n\nkUik2 + m\u03b1\u22121\n\nj\n\nkVjk2(cid:19)(cid:19).\n\n(13)\n\nSetting \u03b1 = 1, corresponding to the weighted trace-norm (10), results in stochastic gradient updates\nthat do not involve the row and column counts at all and are in some sense the simplest. Strangely,\nand likely originating as a \u201cbug\u201d in calculating the stochastic gradients by one of the participants,\nthese steps match the stochastic training used by many practitioners on the Net\ufb02ix dataset, without\nexplicitly considering the weighted trace-norm [8, 19, 15].\n\n6 Experimental results\nWe tested the weighted trace-norm on the Net\ufb02ix dataset, which is the largest publicly available col-\nlaborative \ufb01ltering dataset. The training set contains 100,480,507 ratings from 480,189 anonymous\nusers on 17,770 movie titles. Net\ufb02ix also provides quali\ufb01cation set, containing 1,408,395 ratings,\nout of which we set aside 100,000 ratings for validation. The \u201cquali\ufb01cation set\u201d pairs were selected\nby Net\ufb02ix from the most recent ratings for a subset of the users. Due to the special selection scheme,\nratings from users with few ratings are overrepresented in the quali\ufb01cation set, relative to the train-\ning set. To be able to report results where the train and test sampling distributions are the same, we\nalso created a \u201ctest set\u201d by randomly selecting and removing 100,000 ratings from the training set.\nAll ratings were normalized to be zero-mean by subtracting 3.6. The dataset is very imbalanced: it\nincludes users with over 10,000 ratings as well as users who rated fewer than 5 movies.\n\nFor various values of \u03b1, we learned a factorization U \u22a4V with k = 30 and with k = 100 dimensions\n(factors) using stochastic gradient descent as in (13). For each value of \u03b1 and k we selected the\nregularization tradeoff \u03bb by minimizing the error on the 100,000 quali\ufb01cation set examples set aside\nfor validation. Results on both the Net\ufb02ix quali\ufb01cation set and on the test set we created are reported\nin Table 1. Recall that the sampling distribution of the \u201ctest set\u201d matches that of the training data,\nwhile the quali\ufb01cation set is sampled differently, explaining the large difference in generalization\nbetween the two.\n\n7\n\n\fTable 1: Root Mean Squared Error (RMSE) on the Net\ufb02ix quali\ufb01cation set and on a test set that was held out\nfrom the training data, for training by minimizing (13). We report \u03bb/|S| minimizing the error on the validation\nset (held out from the quali\ufb01cation set), quali\ufb01cation and test errors using this tradeoff, and tc\u03b1(X) at the\noptimum. Last row: training by regularizing the max-norm.\n\n\u03b1\n1\n0.9\n0.75\n0.5\n0\n-1\n\nkXkmax\n\ntc\u03b1(X)\n\n\u03bb/|S|\n0.05\n0.07\n0.2\n0.5\n2.5\n450\n\nk\n4.34\n30\n4.27\n30\n5.04\n30\n7.32\n30\n10.36\n30\n30\n11.41\n30 mc(X) = 5.06\n\nTest\n0.7607\n0.7573\n0.7723\n0.7823\n0.7889\n0.7913\n0.7692\n\nQual\n0.9105\n0.9091\n0.9128\n0.9159\n0.9235\n0.9256\n0.9131\n\ntc\u03b1(X)\n\n\u03bb/|S|\n0.08\n0.1\n0.3\n0.8\n3.0\n700\n\nk\n5.47\n100\n5.23\n100\n6.24\n100\n9.65\n100\n21.23\n100\n100\n23.31\n100 mc(X) = 5.77\n\nTest\n0.7412\n0.7389\n0.7491\n0.7613\n0.7667\n0.7713\n0.7432\n\nQual\n0.9071\n0.9062\n0.9098\n0.9127\n0.9203\n0.9221\n0.9092\n\nFor both k = 30 and k = 100, the weighted trace-norm (\u03b1 = 1) signi\ufb01cantly outperformed the\nunweighted trace-norm (\u03b1 = 0). Interestingly, the optimal weighting (setting of \u03b1) was a bit lower\nthen, but very close to \u03b1 = 1. For completeness, we also evaluated the weighting suggested by\nWeimer et al. [21], corresponding to \u03b1 = \u22121. Unsurprising, given our analysis, this seemingly\nintuitive weighting hurts predictive performance.\n\nFor both k = 30 and k = 100, we also observed that for the weighted trace-norm (\u03b1 = 1) good\ngeneralization is possible with a wide range of \u03bb settings, while for the unweighted trace-norm\n(\u03b1 = 0), the results were much more sensitive to the setting of \u03bb. This con\ufb01rms our previous results\non the synthetic experiment and strongly suggests that it may be far easier to search for regularization\nparameters using the weighted trace-norm.\n\nComparison with the Max-Norm\nWe also compared the predictive performance on Net\ufb02ix to predictions based on max-norm regular-\nization. The max-norm is de\ufb01ned as:\n\nkXkmax = min\n\nX=U \u2032V\n\n1\n2\n\n(max\n\ni kUik2 + max\n\nj kVjk2).\n\n(14)\n\nSimilarly to the rank, but unlike the trace-norm, generalization and learning guarantees based on the\nmax-norm hold also under an arbitrary, non-uniform, sampling distribution. Speci\ufb01cally, de\ufb01ning\nmc(X) = kXk2\nmax (no normalization is necessary here), \u02dcO(mc(X)(n + m)) samples are enough for\ngeneralization w.r.t. any sampling distribution (just like the rank) [18]. This suggests that perhaps the\nmax-norm can be used as an alternative factorization-regularization in the presence of non-uniform\nsampling. Indeed, as evident in Table 1, max-norm based regularization does perform much better\nthen the unweighted trace-norm. The differences between the max-norm and the weighted trace-\nnorm are small, but it seems that using the weighted trace-norm is slightly but consistently better.\n\n7 Summary\nIn this paper we showed both analytically and empirically that under non-uniform sampling, trace-\nnorm regularization can lead to signi\ufb01cant performance deterioration and an increase in sample\ncomplexity. Our analytic analysis suggests a non-intuitive weighting for the trace-norm in order to\ncorrect the problem. Our results on both synthetic and on the highly imbalanced Net\ufb02ix datasets fur-\nther demonstrate that the weighted trace-norm yields signi\ufb01cant improvements in prediction quality.\n\nIn terms of optimization, we focused on stochastic gradient descent,both since it is a simple and\npractical method for very large-scale trace-norm optimization [15, 8], and since the weighting was\noriginally stumbled upon through this optimization approach. However, most recently proposed\nmethods for trace-norm optimization (e.g. [3, 10, 9, 11, 20]) can also be easily modi\ufb01ed for the\nweighted trace-norm.\n\nWe hope that the weighted trace-norm, and the discussions in Sections 3 and 4, will be helpful\nin deriving theoretical learning guarantees for arbitrary non-uniform sampling distributions, both in\nthe form of generalization error bounds as in [18], and generalizing the compressed-sensing inspired\nwork on recovery of noisy low-rank matrices as in [4, 13].\n\nAcknowledgments RS is supported by NSERC, Shell, and NTT Communication Sciences Laboratory.\n\n8\n\n\fReferences\n[1] J. Abernethy, F. Bach, T. Evgeniou, and J.P. Vert. A new approach to collaborative \ufb01lter-\ning: Operator estimation with spectral regularization. Journal of Machine Learning Research,\n10:803\u2013826, 2009.\n\n[2] S. Burer and R.D.C. Monteiro. Local minima and convergence in low-rank semide\ufb01nite pro-\n\ngramming. Mathematical Programming, 103(3):427\u2013444, 2005.\n\n[3] J.F. Cai, E.J. Cand`es, and Z. Shen. A Singular Value Thresholding Algorithm for Matrix\n\nCompletion. SIAM Journal on Optimization, 20:1956, 2010.\n\n[4] E.J. Candes and Y. Plan. Matrix completion with noise. Proceedings of the IEEE (to appear),\n\n2009.\n\n[5] E.J. Candes and B. Recht. Exact matrix completion via convex optimization. Foundations of\n\nComputational Mathematics, 9, 2009.\n\n[6] E.J. Candes and T. Tao. The power of convex relaxation: Near-optimal matrix completion.\n\nIEEE Trans. Inform. Theory (to appear), 2009.\n\n[7] M. Fazel, H. Hindi, and S.P. Boyd. A rank minimization heuristic with application to minimum\norder system approximation. In Proceedings American Control Conference, volume 6, 2001.\n[8] Yehuda Koren. Factorization meets the neighborhood: a multifaceted collaborative \ufb01ltering\n\nmodel. In ACM SIGKDD, pages 426\u2013434, 2008.\n\n[9] Z. Liu and L. Vandenberghe.\n\nInterior-point method for nuclear norm approximation with\napplication to system identi\ufb01cation. SIAM Journal on Matrix Analysis and Applications,\n31(3):1235\u20131256, 2009.\n\n[10] S. Ma, D. Goldfarb, and L. Chen. Fixed point and Bregman iterative methods for matrix rank\n\nminimization. Mathematical Programming, pages 1\u201333, 2009.\n\n[11] R. Mazumder, T. Hastie, and R. Tibshirani. Spectral Regularization Algorithms for Learning\n\nLarge Incomplete Matrices. Journal of Machine Learning Research, 11:2287\u20132322, 2010.\n\n[12] R. Meka, P. Jain, and I. S. Dhillon. Matrix completion from power-law distributed samples. In\n\nAdvances in Neural Information Processing Systems, volume 21, 2009.\n\n[13] B. Recht. A simpler approach to matrix completion. preprint, available from author\u2019s webpage,\n\n2009.\n\n[14] J.D.M. Rennie and N. Srebro. Fast maximum margin matrix factorization for collaborative\n\nprediction. In ICML, page 719, 2005.\n\n[15] Ruslan Salakhutdinov and Andriy Mnih. Probabilistic matrix factorization. In Advances in\n\nNeural Information Processing Systems, volume 20, 2008.\n\n[16] N. Srebro, N. Alon, and T. Jaakkola. Generalization error bounds for collaborative prediction\n\nwith low-rank matrices. In Advances In Neural Information Processing Systems 17, 2005.\n\n[17] N. Srebro, J. Rennie, and T. Jaakkola. Maximum margin matrix factorization. In Advances In\n\nNeural Information Processing Systems 17, 2005.\n\n[18] N. Srebro and A. Shraibman. Rank, trace-norm and max-norm. In COLT, 2005.\n[19] G\u00b4abor Tak\u00b4acs, Istv\u00b4an Pil\u00b4aszy, Botty\u00b4an N\u00b4emeth, and Domonkos Tikk. Scalable collaborative\n\ufb01ltering approaches for large recommender systems. Journal of Machine Learning Research,\n10:623\u2013656, 2009.\n\n[20] R. Tomioka, T. Suzuki, M. Sugiyama, and H. Kashima. A fast augmented lagrangian algorithm\n\nfor learning low-rank matrices. In ICML, pages 1087\u20131094, 2010.\n\n[21] M. Weimer, A. Karatzoglou, and A. Smola. Improving maximum margin matrix factorization.\n\nMachine Learning, 72(3):263\u2013276, 2008.\n\n9\n\n\f", "award": [], "sourceid": 779, "authors": [{"given_name": "Nathan", "family_name": "Srebro", "institution": null}, {"given_name": "Russ", "family_name": "Salakhutdinov", "institution": null}]}