{"title": "Generalization Error Bounds for Collaborative Prediction with Low-Rank Matrices", "book": "Advances in Neural Information Processing Systems", "page_first": 1321, "page_last": 1328, "abstract": null, "full_text": " Generalization Error Bounds for Collaborative\n Prediction with Low-Rank Matrices\n\n\n\n Nathan Srebro Noga Alon\n Department of Computer Science School of Mathematical Sciences\n University of Toronto Tel Aviv University\n Toronto, ON, Canada Ramat Aviv, Israel\n nati@cs.toronto.edu nogaa@tau.ac.il\n\n\n Tommi S. Jaakkola\n Computer Science and Artificial Intelligence Laboratory\n Massachusetts Institute of Technology\n Cambridge, MA, USA\n tommi@csail.mit.edu\n\n\n\n\n Abstract\n\n We prove generalization error bounds for predicting entries in a partially\n observed matrix by fitting the observed entries with a low-rank matrix. In\n justifying the analysis approach we take to obtain the bounds, we present\n an example of a class of functions of finite pseudodimension such that\n the sums of functions from this class have unbounded pseudodimension.\n\n\n1 Introduction\n\n\"Collaborative filtering\" refers to the general task of providing users with information on\nwhat items they might like, or dislike, based on their preferences so far and how they relate\nto the preferences of other users. This approach contrasts with a more traditional feature-\nbased approach where predictions are made based on features of the items.\n\nFor feature-based approaches, we are accustomed to studying prediction methods in terms\nof probabilistic post-hoc generalization error bounds. Such results provide us a (proba-\nbilistic) bound on the performance of our predictor on future examples, in terms of its\nperformance on the training data. These bounds hold without any assumptions on the true\n\"model\", that is the true dependence of the labels on the features, other than the central\nassumptions that the training examples are drawn i.i.d. from the distribution of interest.\n\nIn this paper we suggest studying the generalization ability of collaborative prediction\nmethods. By \"collaborative prediction\" we indicate that the objective is to be able to pre-\ndict user preferences for items, that is, entries in some unknown target matrix Y of user-\nitem \"ratings\", based on observing a subset YS of the entries in this matrix1. We present\n\n 1In other collaborative filtering tasks, the objective is to be able to provide each user with a few\nitems that overlap his top-rated items, while it is not important to be able to correctly predict the users\nratings for other items. Note that it is possible to derive generalization error bounds for this objective\nbased on bounds for the \"prediction\" objective.\n\n\f\n arbitrary source distribution target matrix Y\n random training set random set S of observed entries\n hypothesis predicted matrix X\n training error observed discrepancy DS(X; Y )\n generalization error true discrepancy D(X; Y )\n\n\nFigure 1: Correspondence with post-hoc bounds on the generalization error for standard\nfeature-based prediction tasks\n\n\nbounds on the true average overall error D(X; Y ) = 1 n m loss(X\n nm i=1 a=1 ia; Yia) of\nthe predictions X in terms of the average error over the observed entries DS(X; Y ) =\n 1 loss(X\n|S| iaS ia; Yia), without making any assumptions on the true nature of the pref-\nerences Y . What we do assume is that the subset S of entries that we observe is chosen\nuniformly at random. This strong assumption parallels the i.i.d. source assumption for\nfeature-based prediction.\n\nIn particular, we present generalization error bounds on prediction using low-rank models.\n\nCollaborative prediction using low-rank models is fairly straight forward. A low-rank ma-\ntrix X is sought that minimizes the average observed error DS(X; Y ). Unobserved entries\nin Y are then predicted according to X. The premise behind such a model is that there\nare only a small number of factors influencing the preferences, and that a user's preference\nvector is determined by how each factor applies to that user. Different methods differ in\nhow they relate real-valued entries in X to preferences in Y , and in the associated measure\nof discrepancy. For example, entries in X can be seen as parameters for a probabilistic\nmodels of the entries in Y , either mean parameters [1] or natural parameters [2], and a\nmaximum likelihood criterion used. Or, other loss functions, such as squared error [3, 2],\nor zero-one loss versus the signs of entries in X, can be minimized.\n\n\nPrior Work Previous results bounding the error of collaborative prediction using a low-\nrank matrix all assume the true target matrix Y is well-approximated by a low-rank matrix.\nThis corresponds to a large eigengap between the top few singular values of Y and the\nremaining singular values. Azar et al [3] give asymptotic results on the convergence of\nthe predictions to the true preferences, assuming they have an eigengap. Drineas et al [4]\nanalyze the sample complexity needed to be able to predict a matrix with an eigengap, and\nsuggests strategies for actively querying entries in the target matrix. To our knowledge, this\nis the first analysis of the generalization error of low-rank methods that do not make any\nassumptions on the true target matrix.\n\nGeneralization error bounds (and related online learning bounds) were previously discussed\nfor collaborative prediction applications, but only when prediction was done for each user\nseparately, using a feature-based method, with the other user's preferences as features [5,\n6]. Although these address a collaborative prediction application, the learning setting is a\nstandard feature-based setting. These methods are also limited, in that learning must be\nperformed separately for each user.\n\nShaw-Taylor et al [7] discuss assumption-free post-hoc bounds on the residual errors of\nlow-rank approximation. These results apply to a different setting, where a subset of the\nrows are fully observed, and bound a different quantity--the distance between rows and\nthe learned subspace, rather then the distance to predicted entries.\n\n\nOrganization In Section 2 we present a generalization error bound for zero-one loss,\nbased on a combinatorial result which we prove in Section 3. In Section 4 we generalize\nthe bound to arbitrary loss functions. Finally, in Section 5 we justify the combinatorial\n\n\f\napproach taken, by considering an alternate approach (viewing rank-k matrices as combi-\nnation of k rank-1 matrices) and showing why it does not work.\n\n\n2 Generalization Error Bound for Zero-One Error\n\nWe begin by considering binary labels Yia and a zero-one sign agreement loss:\n\n loss(Xia; Yia) = 1YiaXia0 (1)\n\nTheorem 1. For any matrix Y {1}nm, n, m > 2, > 0 and integer k, with proba-\nbility at least 1 - over choosing a subset S of entries in Y uniformly among all subsets\nof |S| entries, the discrepancy with respect to the zero-one sign agreement loss satisfies2:\n\n\n k(n + m) log 16em - log \n k\n X,rank X 0\nwhere sign X denotes the element-wise sign matrix (sign X)ia = 0 If Xia = 0 .\n -1 If Xia < 1\n\n\nFor all matrices in an equivalence class, the random variable D(X; Y ) is the same, and\n S\ntaking a union bound of the events D(X; Y ) D(X; Y )+ for each of these f (n, m, k)\n S\nrandom variables we have:\n\n log f (n, m, k) - log \n Pr X,rank XkD(X; Y ) D(X; Y ) + (3)\n S\n S 2|S|\n\n\nby using (2) and setting = log f (n,m,k)-log . The proof of Theorem 1 rests on bounding\n 2|S|\nf (n, m, k), which we will do in the next section.\n\nNote that since the equivalence classes we defined do not depend on the sample set, no\nsymmetrization argument is necessary.\n\n 2All logarithms are base two\n\n\f\n3 Sign Configurations of a Low-Rank Matrix\n\nIn this section, we bound the number f (n, m, k) of sign configurations of n m rank-\nk matrices over the reals. Such a bound was previously considered in the context of\nunbounded error communication complexity. Alon, Frankl and Rodl [8] showed that\nf (n, m, k) minh (8 nm/h )(n+m)k+h+m, and used counting arguments to establish\nthat some (in fact, most) binary matrices can only be realized by high-rank matrices, and\ntherefore correspond to functions with high unbounded error communication complexity.\n\nHere, we follow a general course outlined by Alon [9] to obtain a simpler, and slightly\ntighter, bound based on the following result due to Warren:\n\nLet P1, . . . , Pr be real polynomials in q variables, and let C be the complement of the\nvariety defined by iPi, i.e. the set of points in which all the m polynomials are non-zero:\n C = {x q\n R |iPi(x) = 0}\nTheorem 2 (Warren [10]). If all r polynomials are of degree at most d, then the number\nof connected components of C is at most:\n q q\n r 4edr\n c(C) 2(2d)q 2i \n i q\n i=0\n\nwhere the second inequality holds when r > q > 2.\n\nThe signs of the polynomials P1, . . . , Pr are fixed inside each connected component of C.\nAnd so, c(C) bounds the number of sign configurations of P1, . . . , Pr that do not contain\nzeros. To bound the overall number of sign configurations the polynomials are modified\nslightly (see Appendix), yielding:\nCorollary 3 ([9, Proposition 5.5]). The number of -/0/+ sign configurations of r polyno-\nmials, each of degree at most d, over q variables, is at most (8edr/q)q (for r > q > 2).\n\nIn order to apply these bounds to low-rank matrices, recall that any matrix X of rank at\nmost k can be written as a product X = U V where U nk km\n R and V R . Consider\nthe k(n+m) entries of U, V as variables, and the nm entries of X as polynomials of degree\ntwo over these variables:\n k\n\n Xia = UiVa\n =1\nApplying Corollary 3 we obtain:\n k(n+m)\nLemma 4. f (n, m, k) 8e2nm (16em/k)k(n+m)\n k(n+m)\n\n\nSubstituting this bound in (3) establishes Theorem 1. The upper bound on f (n, m, k) is\ntight up to a multiplicative factor in the exponent:\n 1\nLemma 5. For m > k2, f (n, m, k) m (k-1)n\n 2\n\n\nProof. Fix any matrix V mk\n R with rows in general position, and consider the number\nf (n, V, k) of sign configurations of matrices U V , where U varies over all n k matrices.\nFocusing only on +/- sign configurations (no zeros in U V ), each row of sign U V is a\nhomogeneous linear classification of the rows of V , i.e. of m vectors in general position\nin k m\n R . There are exactly 2 k-1 possible homogeneous linear classifications of m\n i=0 i\nvectors in general position in k\n R , and so these many options for each row of sign U V . We\ncan therefore bound:\n n\n k-1\n n n(k-1) 1\nf (n, m, k) f (n, V, k) 2 m m m = m (k-1)n\n 2\n i k-1 k-1\n i=0\n\n\f\n4 Generalization Error Bounds for Other Loss Functions\n\nIn Section 2 we considered generalization error bounds for a zero-one loss function. More\ncommonly, though, other loss functions are used, and it is desirable to obtain generalization\nerror bounds for general loss functions.\n\nWhen dealing with other loss functions, the magnitude of the entries in the matrix are\nimportant, and not only their signs. It is therefore no longer enough to bound the number\nof sign configurations. Instead, we will bound not only the number of ways low rank\nmatrices behave with regards to a threshold of zero, but the number of possible ways low-\nrank matrices can behave relative to any set of thresholds. That is, for any threshold matrix\nT nm\n R , we will show that the number of possible sign configurations of (X - T ),\nwhere X is low-rank, is small. Intuitively, this captures the complexity of the class of\nlow-rank matrices not only around zero, but throughout all possible values.\n\nWe then use standard results from statistical machine learning to obtain generalization error\nbounds from the bound on the number of relative sign configurations. The number of rela-\ntive sign configurations serves as a bound on the pseudodimension--the maximum number\nof entries for which there exists a set of thresholds such that all relative sign configurations\n(limited to these entries) is possible. The pseudodimension can in turn be used to show the\nexistence of a small -net, which is used to obtain generalization error bounds.\n\nRecall the definition of the pseudodimension of a class of real-valued functions:\n\nDefinition 1. A class F of real-valued functions pseudo-shatters the points x1, . . . , xn\nwith thresholds t1, . . . , tn if for every binary labeling of the points (s1, . . . , sn) {+, -}n\nthere exists f F s.t. f (xi) ti iff si = -. The pseudodimension of a class F is the\nsupremum over n for which there exist n points and thresholds that can be shattered.\n\nIn order to apply known results linking the pseudodimension to covering numbers, we\nconsider matrices X nm\n R as real-valued functions X : [n] [m] R over index\npairs to entries in the matrix. The class Xk of rank-k matrices can now be seen as a class\nof real-valued functions over the domain [n] [m]. We bound the pseudodimension of\nthis class by bounding, for any threshold matrix T nm\n R the number of relative sign\nmatrices:\n\n F nm\n T (n, m, k) = {sign (X - T ) {-, 0, +}nm|X R , rank X k}\n fT (n, m, k) = FT (n, m, k)\n\n k(n+m)\nLemma 6. For any T nm\n R , we have fT (n, m, k) 16em .\n k\n\n\nProof. We take a similar approach to that of Lemma 4, writing rank-k matrices as a product\nX = U V where U nk km\n R and V R . Consider the k(n + m) entries of U, V as\nvariables, and the nm entries of X - T as polynomials of degree two over these variables:\n\n k\n\n (X - T )ia = UiVa - Tia\n =1\n\n\nApplying Corollary 10 yields the desired bound.\n\n\nCorollary 7. The pseudodimension of the class Xk of n m matrices over the reals of\nrank at most k, is at most k(n + m) log 16em .\n k\n\n\nWe can now invoke standard generalization error bounds in terms of the pseudodimension\n(Theorem 11 in the Appendix) to obtain:\n\n\f\nTheorem 8. For any monotone loss function with |loss| M , any matrix Y {1}nm,\nn, m > 2, > 0 and integer k, with probability at least 1 - over choosing a subset S of\nentries in Y uniformly among all subsets of |S| entries:\n\n\n k(n + m) log 16em log M|S| - log \n k k(n+m)\n X,rank X 0 such that\n(2xB + 1) > 2xA and 2yB < (2yA + 1). It follows that for any A < B and any\n, > 0, on an initial segment (possibly empty) of N we have gB fB gA fA\nwhile on the rest of N we have gA fA < gB fB. In particular, any pair of\n\n 3We use A to refer both to a positive integer and the finite set it maps to.\n\n\f\nfunctions (fA, fB) or (fA, gB) or (gA, gB) in F that are not associated with the\nsame subset (i.e. A = B), cross each other at most once. This holds also when or are\nnegative, as the functions never change signs.\n\nFor any six naturals x1 < x2 < < x6 and six thresholds, consider the three labellings\n(+, -, +, -, +, -), (-, +, -, +, -, +), (+, +, -, -, +, +). The three functions realizing\nthese labellings must cross each other at least twice, but by the above arguments, there are\nno three functions in F such that every pair crosses each other at least twice.4\n\n\n6 Discussion\n\nAlon, Frankl and Rodl [8] use a result of Milnor similar to Warren's Theorem 2. Milnor's\nand Warren's theorems were previously used for bounding the VC-dimension of certain\ngeometric classes [13], and of general concept classes parametrized by real numbers, in\nterms of the complexity of the boolean formulas over polynomials used to represent them\n[14]. This last general result can be used to bound the VC-dimension of signs of nm rank-\nk matrices by 2k(n + m) log(48enm), yielding a bound similar to Theorem 1 with an extra\nlog |S| term. In this paper, we take a simpler path, applying Warren's theorem directly, and\nthus avoiding the log |S| term and reducing the other logarithmic term. Applying Warren's\ntheorem directly also enables us to bound the pseudodimension and obtain the bound of\nTheorem 8 for general loss functions.\n\nAnother notable application of Milnor's result, which likely inspired these later uses, is for\nbounding the number of configurations of n points in d\n R with different possible linear clas-\nsifications [15, 16]. Viewing signs of rank-k n m matrices as n linear classification of m\npoints in k\n R , this bound can be used to bound f (n, m, k) < 2km log 2n+k(k+1)n log n with-\nout using Warren's Theorem directly [8, 12]. The bound of Lemma 4 avoids the quadratic\ndependence on k in the exponent.\n\n\nAcknowledgments We would like to thank Peter Bartlett for pointing out [13, 14]. N.S. and\nT.J. would like to thank Erik Demaine for introducing them to oriented matroids.\n\n\nA Proof of Corollary 3\n\nConsider a set R q\n R containing one variable configuration for each possible sign pattern. Set\n .\n = 1 min (x) = P\n 2 1iq,xRPi(x)=0 |Pi(x)| > 0. Now consider the 2q polynomials P +\n i i(x) +\nand P -(x) = P q | (x) = 0, P -(x) = 0 . Different points in R\n i i(x) - and C = x R iP +\n i i\n(representing all sign configurations) lie in different connected components of C . Invoking Theorem\n2 on C establishes Corollary 3.\n\nThe count in Corollary 3 differentiates between positive, negative and zero signs. However, we are\nonly concerned with the positivity of YiaXia (in the proof of Theorem 1) or of Xia - Tia (in the\nproof of Theorem 8), and do not need to differentiate between zero and negative values. Invoking\nTheorem 2 on C+ = x q\n R |iP +(x) = 0 , yields:\n i\n\nCorollary 10. The number of -/+ sign configurations (where zero is considered negative) of r poly-\nnomials, each of degree at most d, over q variables, is at most (4edr/q)q (for r > q > 2).\n\n\nApplying Corollary 10 on the nm degree-two polynomials Y k\n ia U\n =1 iVa establishes that for\nany Y , the number of configurations of sign agreements of rank-k matrices with Y is bounded by\n(8em/k)k(n+m) and yields a constant of 8 instead of 16 inside the logarithm in Theorem 1. Applying\nCorollary 10 instead of Corollary 3 allows us to similarly tighten in the bounds in Corollary 7 and in\nTheorem 8.\n\n 4A more careful analysis shows that F has pseudodimension three.\n\n\f\nB Generalization Error Bound in terms of the Pseudodimension\n\nTheorem 11. Let F be a class of real-valued functions f : X R with pseudodimension d, and\nloss : R Y R be a bounded monotone loss function (i.e. for all y, loss(x, y) is mono-\ntone in x), with loss < M . For any joint distribution over (X, Y ), consider an i.i.d. sample\nS = (X1, Y1), . . . , (Xn, Yn). Then for any > 0:\n\n n d\n 1 32eM 2 n\nPr fF EX,Y [loss(f (X), Y )] > loss(f (Xi), Yi) + < 4e(d + 1) e- 32\n S n i=1\n\nThe bound is a composition of a generalization error bound in terms of the L1 covering number [17,\nTheorem 17.1], a bound on the L1 covering number in terms of the pseudodimension [18] and the\nobservation that composition with a monotone function does not increase the pseudodimension [17,\nTheorem 12.3].\n\n\nReferences\n\n [1] T. Hofmann. Latent semantic models for collaborative filtering. ACM Trans. Inf. Syst.,\n 22(1):89115, 2004.\n\n [2] Nathan Srebro and Tommi Jaakkola. Weighted low rank approximation. In 20th International\n Conference on Machine Learning, 2003.\n\n [3] Yossi Azar, Amos Fiat, Anna R. Karlin, Frank McSherry, and Jared Saia. Spectral analysis of\n data. In ACM Symposium on Theory of Computing, pages 619626, 2001.\n\n [4] Petros Drineas, Iordanis Kerenidis, and Prabhakar Raghavan. Competitive recommendation\n systems. In ACM Symposium on Theory of Computing, 2002.\n\n [5] K. Crammer and Y. Singer. Pranking with ranking. In Advances in Neural Information Pro-\n cessing Systems, volume 14, 2002.\n\n [6] Sanjoy Dasgupta, Wee Sun Lee, and Philip M. Long. A theoretical analysis of query selection\n for collaborative filtering. Machine Learning, 51(3):283298, 2003.\n\n [7] John Shawe-Taylor, Nello Cristianini, and Jaz Kandola. On the concentration of spectral prop-\n erties. In Advances in Neural Information Processing Systems, volume 14, 2002.\n\n [8] N. Alon, P Frankl, and V. Rodel. Geometrical realization of set systems and probabilistic\n communication complexity. In Foundations of Computer Science (FOCS), 1985.\n\n [9] Noga Alon. Tools from higher algebra. In M. Grotschel R.L. Graham and L. Lovasz, editors,\n Handbook of Combinatorics, chapter 32, pages 17491783. North Holland, 1995.\n\n[10] H. E. Warren. Lower bounds for approximation by nonlinear manifolds. Transactions of the\n American Mathematical Society, 133:167178, 1968.\n\n[11] Nathan Srebro, Jason Rennie, and Tommi Jaakkola. Maximum margin matrix factorization. In\n Advances in Neural Information Processing Systems, volume 17, 2005.\n\n[12] Nathan Srebro. Learning with Matrix Factorization. PhD thesis, Massachusetts Institute of\n Technology, 2004.\n\n[13] Shai Ben-David and Michael Lindenbaum. Localization vs. identification of semi-algebraic\n sets. Machine Learning, 32(3):207224, 1998.\n\n[14] Paul Goldberg and Mark Jerrum. Bounding the vapnik-chervonenkis dimension of concept\n classes parameterized by real numbers. Machine Learning, 18(2-3):131148, 1995.\n[15] Jacob Goodman and Richard Pollack. Upper bounds for configurations and polytopes in d\n R .\n Discrete and Computational Geometry, 1:219227, 1986.\n\n[16] Noga Alon. The number of polytopes, configurations and real matroids. Mathematika, 33:62\n 71, 1986.\n\n[17] Martin Anthony and Peter L. Bartlett. Neural Networks Learning: Theoretical Foundations.\n Cambridge University Press, 1999.\n\n[18] David Haussler. Sphere packing numbers for subsets of the boolean n-cube with bounded\n Vapnick-Chernovenkis dimension. J. Comb. Thoery, Ser. A, 69(2):217232, 1995.\n\n\f\n", "award": [], "sourceid": 2700, "authors": [{"given_name": "Nathan", "family_name": "Srebro", "institution": null}, {"given_name": "Noga", "family_name": "Alon", "institution": null}, {"given_name": "Tommi", "family_name": "Jaakkola", "institution": null}]}