{"title": "A Denoising View of Matrix Completion", "book": "Advances in Neural Information Processing Systems", "page_first": 334, "page_last": 342, "abstract": "In matrix completion, we are given a matrix where the values of only some of the entries are present, and we want to reconstruct the missing ones. Much work has focused on the assumption that the data matrix has low rank. We propose a more general assumption based on denoising, so that we expect that the value of a missing entry can be predicted from the values of neighboring points. We propose a nonparametric version of denoising based on local, iterated averaging with mean-shift, possibly constrained to preserve local low-rank manifold structure. The few user parameters required (the denoising scale, number of neighbors and local dimensionality) and the number of iterations can be estimated by cross-validating the reconstruction error. Using our algorithms as a postprocessing step on an initial reconstruction (provided by e.g. a low-rank method), we show consistent improvements with synthetic, image and motion-capture data.", "full_text": "A Denoising View of Matrix Completion\n\nWeiran Wang\n\nMiguel \u00b4A. Carreira-Perpi\u02dcn\u00b4an\n\nEECS, University of California, Merced\n\nZhengdong Lu\n\nMicrosoft Research Asia, Beijing\n\nhttp://eecs.ucmerced.edu\n\nzhengdol@microsoft.com\n\nAbstract\n\nIn matrix completion, we are given a matrix where the values of only some of the\nentries are present, and we want to reconstruct the missing ones. Much work has\nfocused on the assumption that the data matrix has low rank. We propose a more\ngeneral assumption based on denoising, so that we expect that the value of a miss-\ning entry can be predicted from the values of neighboring points. We propose a\nnonparametric version of denoising based on local, iterated averaging with mean-\nshift, possibly constrained to preserve local low-rank manifold structure. The few\nuser parameters required (the denoising scale, number of neighbors and local di-\nmensionality) and the number of iterations can be estimated by cross-validating\nthe reconstruction error. Using our algorithms as a postprocessing step on an\ninitial reconstruction (provided by e.g. a low-rank method), we show consistent\nimprovements with synthetic, image and motion-capture data.\n\nCompleting a matrix from a few given entries is a fundamental problem with many applications in\nmachine learning, computer vision, network engineering, and data mining. Much interest in matrix\ncompletion has been caused by recent theoretical breakthroughs in compressed sensing [1, 2] as well\nas by the now celebrated Net\ufb02ix challenge on practical prediction problems [3, 4]. Since completion\nof arbitrary matrices is not a well-posed problem, it is often assumed that the underlying matrix\ncomes from a restricted class. Matrix completion models almost always assume a low-rank structure\nof the matrix, which is partially justi\ufb01ed through factor models [4] and fast convex relaxation [2], and\noften works quite well when the observations are sparse and/or noisy. The low-rank structure of the\nmatrix essentially asserts that all the column vectors (or the row vectors) live on a low-dimensional\nsubspace. This assumption is arguably too restrictive for problems with richer structure, e.g. when\neach column of the matrix represents a snapshot of a seriously corrupted motion capture sequence\n(see section 3), for which a more \ufb02exible model, namely a curved manifold, is more appropriate.\n\nIn this paper, we present a novel view of matrix completion based on manifold denoising, which\nconceptually generalizes the low-rank assumption to curved manifolds. Traditional manifold de-\nnoising is performed on fully observed data [5, 6], aiming to send the data corrupted by noise back\nto the correct surface (de\ufb01ned in some way). However, with a large proportion of missing entries,\nwe may not have a good estimate of the manifold. Instead, we start with a poor estimate and improve\nit iteratively. Therefore the \u201cnoise\u201d may be due not just to intrinsic noise, but mostly to inaccurately\nestimated missing entries. We show that our algorithm can be motivated from an objective purely\nbased on denoising, and prove its convergence under some conditions. We then consider a more\ngeneral case with a nonlinear low-dimensional manifold and use a stopping criterion that works\nsuccessfully in practice. Our model reduces to a low-rank model when we require the manifold to\nbe \ufb02at, showing a relation with a recent thread of matrix completion models based on alternating\nprojection [7]. In our experiments, we show that our denoising-based matrix completion model can\nmake better use of the latent manifold structure on both arti\ufb01cial and real-world data sets, and yields\nsuperior recovery of the missing entries.\n\nThe paper is organized as follows: section 1 reviews nonparametric denoising methods based on\nmean-shift updates, section 2 extends this to matrix completion by using denoising with constraints,\nsection 3 gives experimental results, and section 4 discusses related work.\n\n1\n\n\f1 Denoising with (manifold) blurring mean-shift algorithms (GBMS/MBMS)\n\nIn Gaussian blurring mean-shift (GBMS), denoising is performed in a nonparametric way by local\naveraging: each data point moves to the average of its neighbors (to a certain scale), and the process\nn=1 \u2282 RD and de\ufb01ne a\nis repeated. We follow the derivation in [8]. Consider a dataset {xn}N\nGaussian kernel density estimate\n\np(x) =\n\n1\nN\n\nN\n\nXn=1\n\nG\u03c3(x, xn)\n\n(1)\n\nwith bandwidth \u03c3 > 0 and kernel G\u03c3(x, xn) \u221d exp(cid:0) \u2212 1\n\nbe used, such as the Epanechnikov kernel, which results in sparse af\ufb01nities). The (non-blurring)\nmean-shift algorithm rearranges the stationary point equation \u2207p(x) = 0 into the iterative scheme\nx(\u03c4 +1) = f (x(\u03c4 )) with\n\n2 (kx \u2212 xnk /\u03c3)2(cid:1) (other kernels may\n\nx(\u03c4 +1) = f (x(\u03c4 )) =\n\np(n|x(\u03c4 ))xn\n\np(n|x(\u03c4 )) =\n\nN\n\nXn=1\n\nexp(cid:0)\u2212 1\nPN\nn\u2032=1 exp(cid:0)\u2212 1\n\n2(cid:1)\n2(cid:13)(cid:13)(x(\u03c4 ) \u2212 xn)/\u03c3(cid:13)(cid:13)\n2(cid:13)(cid:13)(x(\u03c4 ) \u2212 xn\u2032 )/\u03c3(cid:13)(cid:13)\n\n2(cid:1)\n\n. (2)\n\nThis converges to a mode of p from almost every initial x \u2208 RD, and can be seen as taking self-\nadapting step sizes along the gradient (since the mean shift f (x) \u2212 x is parallel to \u2207p(x)). This\niterative scheme was originally proposed by [9] and it or variations of it have found widespread\napplication in clustering [8, 10\u201312] and denoising of 3D point sets (surface fairing; [13, 14]) and\nmanifolds in general [5, 6].\n\nThe blurring mean-shift algorithm applies one step of the previous scheme, initialized from every\npoint, in parallel for all points. That is, given the dataset X = {x1, . . . , xN }, for each xn \u2208 X\nwe obtain a new point \u02dcxn = f (xn) by applying one step of the mean-shift algorithm, and then we\nreplace X with the new dataset \u02dcX, which is a blurred (shrunk) version of X. By iterating this process\nwe obtain a sequence of datasets X(0), X(1), . . . (and a corresponding sequence of kernel density\nestimates p(0)(x), p(1)(x), . . .) where X(0) is the original dataset and X(\u03c4 ) is obtained by blurring\nX(\u03c4 \u22121) with one mean-shift step. We can see this process as maximizing the following objective\nfunction [10] by taking parallel steps of the form (2) for each point:\n\nE(X) =\n\np(xn) =\n\nN\n\nXn=1\n\n1\nN\n\nN\n\nXn,m=1\n\nG\u03c3(xn, xm) \u221d\n\nN\n\nXn,m=1\n\n2 k xn\u2212xm\n\n\u03c3\n\ne\u2212 1\n\nk2\n\n.\n\n(3)\n\naf\ufb01nities wnm = G\u03c3(xn, xm); D = diag (PN\nan N \u00d7 N stochastic matrix: pnm = p(n|xm) \u2208 (0, 1) and PN\n\nThis process eventually converges to a dataset X(\u221e) where all points are coincident: a completely\ndenoised dataset where all structure has been erased. As shown by [8], this process can be stopped\nearly to return clusters (= locally denoised subsets of points); the number of clusters obtained is\ncontrolled by the bandwidth \u03c3. However, here we are interested in the denoising behavior of GBMS.\nThe GBMS step can be formulated in a matrix form reminiscent of spectral clustering [8] as \u02dcX =\nX P where X = (x1, . . . , xN ) is a D\u00d7N matrix of data points; W is the N \u00d7N matrix of Gaussian\nn=1 wnm) is the degree matrix; and P = WD\u22121 is\nn=1 pnm = 1. P (or rather its\ntranspose) is the stochastic matrix of the random walk in a graph [15], which in GBMS represents\nthe posterior probabilities of each point under the kernel density estimate (1). P is similar to the\nmatrix N = D\u2212 1\n2 derived from the normalized graph Laplacian commonly used in spectral\nclustering, e.g. in the normalized cut [16]. Since, by the Perron-Frobenius theorem [17, ch. 8], all left\neigenvalues of P(X) have magnitude less than 1 except for one that equals 1 and is associated with\nan eigenvector of constant entries, iterating \u02dcX = X P(X) converges to the stationary distribution of\neach P(X), where all points coincide.\nFrom this point of view, the product \u02dcX = X P(X) can be seen as \ufb01ltering the dataset X with a data-\ndependent low-pass \ufb01lter P(X), which makes clear the denoising behavior. This also suggests using\nother \ufb01lters [12] \u02dcX = X \u03c6(P(X)) as long as \u03c6(1) = 1 and |\u03c6(r)| < 1 for r \u2208 [0, 1), such as explicit\nschemes \u03c6(P) = (1 \u2212 \u03b7)I + \u03b7P for \u03b7 \u2208 (0, 2], power schemes \u03c6(P) = Pn for n = 1, 2, 3 . . . or\nimplicit schemes \u03c6(P) = ((1 + \u03b7)I \u2212 \u03b7P)\u22121 for \u03b7 > 0.\nOne important problem with GBMS is that it denoises equally in all directions. When the data lies\non a low-dimensional manifold, denoising orthogonally to it removes out-of-manifold noise, but\n\n2 WD\u2212 1\n\n2\n\n\fdenoising tangentially to it perturbs intrinsic degrees of freedom of the data and causes shrinkage of\nthe entire manifold (most strongly near its boundary). To prevent this, the manifold blurring mean-\nshift algorithm (MBMS) [5] \ufb01rst computes a predictor averaging step with GBMS, and then for each\npoint xn a corrector projective step removes the step direction that lies in the local tangent space of\nxn (obtained from local PCA run on its k nearest neighbors). In practice, both GBMS and MBMS\nmust be stopped early to prevent excessive denoising and manifold distortions.\n\n2 Blurring mean-shift denoising algorithms for matrix completion\n\nWe consider the natural extension of GBMS to the matrix completion case by adding the constraints\ngiven by the present values. We use the subindex notation XM and XP to indicate selection of the\nmissing or present values of the matrix XD\u00d7N , where P \u2282 U, M = U \\ P and U = {(d, n): d =\n1, . . . , D, n = 1, . . . , N }. The indices P and values XP of the present matrix entries are the data\nof the problem. Then we have the following constrained optimization problem:\n\nN\n\nG\u03c3(xn, xm)\n\ns.t. XP = XP .\n\n(4)\n\nmax\n\nX\n\nE(X) =\n\nXn,m=1\n\nThis is similar to low-rank formulations for matrix completion that have the same constraints but\nuse as objective function the reconstruction error with a low-rank assumption, e.g. kX \u2212 ABXk2\nwith AD\u00d7L, BL\u00d7D and L < D.\nWe initialize XM to the output of some other method for matrix completion, such as singular value\nprojection (SVP; [7]). For simple constraints such as ours, gradient projection algorithms are attrac-\ntive. The gradient of E wrt X is a matrix of D \u00d7 N whose nth column is:\n\n\u2207xn E(X) =\n\n2\n\u03c32\n\nN\n\nXm=1\n\n2 k xn\u2212xm\n\n\u03c3\n\ne\u2212 1\n\nk2\n\n(xm \u2212 xn) \u221d\n\n2\n\n\u03c32 p(xn) \u2212xn +\n\np(m|xn)xm! (5)\n\nN\n\nXm=1\n\nand its projection on the constraint space is given by zeroing its entries having indices in P; call\n\u03a0P this projection operator. Then, we have the following step of length \u03b1 \u2265 0 along the projected\ngradient:\n\nX(\u03c4 +1) = X(\u03c4 ) + \u03b1\u03a0P (\u2207XE(X(\u03c4 ))) \u21d0\u21d2 X(\u03c4 +1)\n\nM = X(\u03c4 )\n\nM + \u03b1(cid:16)\u03a0P (\u2207XE(X(\u03c4 )))(cid:17)M\n\n(6)\n\nwhich updates only the missing entries XM. Since our search direction is ascent and makes an angle\nwith the gradient that is bounded away from \u03c0/2, and E is lower bounded, continuously differen-\ntiable and has bounded Hessian (thus a Lipschitz continuous gradient) in RN L, by carrying out a line\nsearch that satis\ufb01es the Wolfe conditions, we are guaranteed convergence to a local stationary point,\ntypically a maximizer [18, th. 3.2]. However, as reasoned later, we do not perform a line search\nat all, instead we \ufb01x the step size to the GBMS self-adapting step size, which results in a simple\nand faster algorithm consisting of carrying out a GBMS step on X (i.e., X(\u03c4 +1) = X(\u03c4 ) P(X(\u03c4 )))\nand then re\ufb01lling XP to the present values. While we describe the algorithm in this way for ease\nof explanation, in practice we do not actually compute the GBMS step for all xdn values, but only\nfor the missing ones, which is all we need. Thus, our algorithm carries out GBMS denoising steps\nwithin the missing-data subspace. We can derive this result in a different way by starting from\nn,m=1 G\u03c3(xn, xm) (equivalent to (4)),\ncomputing its gradient wrt XP, equating it to zero and rearranging (in the same way the mean-shift\nalgorithm is derived) to obtain a \ufb01xed-point iteration identical to our update above.\n\nthe unconstrained optimization problem maxXP E(X) =PN\n\nFig. 1 shows the pseudocode for our denoising-based matrix completion algorithms (using three\nnonparametric denoising algorithms: GBMS, MBMS and LTP).\n\nConvergence and stopping criterion As noted above, we have guaranteed convergence by simply\nsatisfying standard line search conditions, but a line search is costly. At present we do not have\na proof that the GBMS step size satis\ufb01es such conditions, or indeed that the new iterate X(\u03c4 +1)\nincreases or leaves unchanged the objective, although we have never encountered a counterexample.\nIn fact, it turns out that none of the work about GBMS that we know about proves that either: [10]\nproves that \u2205(X(\u03c4 +1)) \u2264 \u2205(X(\u03c4 )) for 0 < \u03c1 < 1, where \u2205(\u00b7) is the set diameter, while [8, 12]\n\nM\n\n3\n\n\fnotes that P(X) has a single eigenvalue of value 1 and all others of magnitued less than 1. While\nthis shows that all points converge to the same location, which indeed is the global maximum of (3),\nit does not necessarily follow that each step decreases E.\n\nGBMS (k, \u03c3) with full or k-nn graph: given XD\u00d7N , M\nrepeat\n\nfor n = 1, . . . , N\n\nNn \u2190 {1, . . . , N } (full graph) or\n\nk nearest neighbors of xn (k-nn graph)\n\n\u2202xn \u2190 \u2212xn +Pm\u2208Nn\n\nend\nXM \u2190 XM + (\u2202X)M\n\nuntil validation error increases\nreturn X\n\nG\u03c3(xn,xm)\n\nPm\u2032 \u2208Nn\n\nG\u03c3(xn,x\n\nm\u2032 )\n\nxm\n\nmean-shift\n\nstep\n\nmove points\u2019 missing entries\n\nMBMS (L, k, \u03c3) with full or k-nn graph: given XD\u00d7N , M\nrepeat\n\nfor n = 1, . . . , N\n\nNn \u2190 {1, . . . , N } (full graph) or\n\nk nearest neighbors of xn (k-nn graph)\n\nG\u03c3(xn,xm)\n\nG\u03c3(xn,x\n\nm\u2032 )\n\nxm\n\nmean-shift\n\nstep\n\n\u2202xn \u2190 \u2212xn +Pm\u2208Nn\n\nPm\u2032 \u2208Nn\nXn \u2190 k nearest neighbors of xn\n(\u00b5n, Un) \u2190 PCA(Xn, L)\n\u2202xn \u2190 (I \u2212 UnUT\nn )\u2202xn\n\nend\nXM \u2190 XM + (\u2202X)M\n\nuntil validation error increases\nreturn X\n\nestimate L-dim tangent space at xn\n\nsubtract parallel motion\n\nmove points\u2019 missing entries\n\nLTP (L, k) with k-nn graph: given XD\u00d7N , M\nrepeat\n\nfor n = 1, . . . , N\n\nXn \u2190 k nearest neighbors of xn\n(\u00b5n, Un) \u2190 PCA(Xn, L)\n\u2202xn \u2190 (I \u2212 UnUT\n\nn )(\u00b5n \u2212 xn)\n\nend\nXM \u2190 XM + (\u2202X)M\n\nuntil validation error increases\nreturn X\n\nestimate L-dim tangent space at xn\n\nproject point onto tangent space\n\nmove points\u2019 missing entries\n\nFigure 1: Our denoising matrix completion algorithms, based on\nManifold Blurring Mean Shift (MBMS) and its particular cases\nLocal Tangent Projection (LTP, k-nn graph, \u03c3 = \u221e) and Gauss-\nian Blurring Mean Shift (GBMS, L = 0); see [5] for details. Nn\ncontains all N points (full graph) or only xn\u2019s nearest neighbors\n(k-nn graph). The index M selects the components of its input\ncorresponding to missing values. Parameters: denoising scale \u03c3,\nnumber of neighbors k, local dimensionality L.\n\ninterest\n\nHowever, the question of con-\nvergence as \u03c4 \u2192 \u221e has no\npractical\nin a denois-\ning setting, because achieving\na total denoising almost never\nyields a good matrix comple-\ntion. What we want is to achieve\njust enough denoising and stop\nthe algorithm, as was the case\nwith GBMS clustering, and as is\nthe case in algorithms for image\ndenoising. We propose to de-\ntermine the optimal number of\niterations, as well as the band-\nwidth \u03c3 and any other parame-\nters, by cross-validation. Specif-\nically, we select a held-out set\nby picking a random subset of\nthe present entries and consider-\ning them as missing; this allows\nus to evaluate an error between\nour completion for them and the\nground truth. We stop iterating\nwhen this error increases.\n\nThis argument justi\ufb01es an algo-\nrithmic, as opposed to an op-\ntimization, view of denoising-\nbased matrix completion: ap-\nply a denoising step, re\ufb01ll the\npresent values, iterate until the\nvalidation error increases. This\nallows very general de\ufb01nitions\nof denoising, and indeed a low-\nrank projection is a form of de-\nnoising where points are not al-\nlowed outside the linear man-\nifold. Our formulation using\nthe objective function (4) is still\nuseful\nit connects our\ndenoising assumption with the\nmore usual low-rank assumption\nthat has been used in much ma-\ntrix completion work, and jus-\nti\ufb01es the re\ufb01lling step as re-\nsulting from the present-data\nconstraints under a gradient-\nprojection optimization.\n\nin that\n\nMBMS denoising for matrix completion Following our algorithmic-based approach to denois-\ning, we could consider generalized GBMS steps of the form \u02dcX = X \u03c6(P(X)). For clustering,\nCarreira-Perpi\u02dcn\u00b4an [12] found an overrelaxed explicit step \u03c6(P) = (1 \u2212 \u03b7)I + \u03b7P with \u03b7 \u2248 1.25 to\nachieve similar clusterings but faster. Here, we focus instead on the MBMS variant of GBMS that\nallows only for orthogonal, not tangential, point motions (de\ufb01ned wrt their local tangent space as\nestimated by local PCA), with the goal of preserving low-dimensional manifold structure. MBMS\nhas 3 user parameters: the bandwidth \u03c3 (for denoising), and the latent dimensionality L and the\n\n4\n\n\fnumber of neighbors k (for the local tangent space and the neighborhood graph). A special case\nof MBMS called local tangent projection (LTP) results by using a neighborhood graph and setting\n\u03c3 = \u221e (so only two user parameters are needed: L and k). LTP can be seen as doing a low-rank\nmatrix completion locally. LTP was found in [5] to have nearly as good performance as the best \u03c3 in\nseveral problems. MBMS also includes as particular cases GBMS (L = 0), PCA (k = N, \u03c3 = \u221e),\nand no denoising (\u03c3 = 0 or L = D).\nNote that if we apply MBMS to a dataset that lies on a linear manifold of dimensionality d using\nL \u2265 d then no denoising occurs whatsoever because the GBMS updates lie on the d-dimensional\nmanifold and are removed by the corrector step. In practice, even if the data are assumed noiseless,\nthe reconstruction from a low-rank method will lie close to but not exactly on the d-dimensional\nmanifold. However, this suggests using largish ranks for the low-rank method used to reconstruct X\nand lower L values in the subsequent MBMS run.\nIn summary, this yields a matrix completion algorithm where we apply an MBMS step, re\ufb01ll the\npresent values, and iterate until the validation error increases. Again, in an actual implementation\nwe compute the MBMS step only for the missing entries of X. The shrinking problem of GBMS is\nless pronounced in our matrix completion setting, because we constrain some values not to change.\nStill, in agreement with [5], we \ufb01nd MBMS to be generally superior to GBMS.\n\nComputational cost With a full graph, the cost per iteration of GBMS and MBMS is O(N 2D)\nand O(N 2D + N (D + k) min(D, k)2), respectively. In practice with high-dimensional data, best\ndenoising results are obtained using a neighborhood graph [5], so that the sums over points in eqs. (3)\nor (4) extend only to the neighbors. With a k-nearest-neighbor graph and if we do not update\nthe neighbors at each iteration (which affects the result little), the respective cost per iteration is\nO(N kD) and O(N kD + N (D + k) min(D, k)2), thus linear in N. The graph is constructed on the\ninitial X we use, consisting of the present values and an imputation for the missing ones achieved\nwith a standard matrix completion method, and has a one-off cost of O(N 2D). The cost when we\nhave a fraction \u00b5 = |M|\nN D \u2208 [0, 1] of missing data is simply the above times \u00b5. Hence the run time\nof our mean-shift-based matrix completion algorithms is faster the more present data we have, and\nthus faster than the usual GBMS or MBMS case, where all data are effectively missing.\n\n3 Experimental results\n\nWe compare with representative methods of several approaches: a low-rank matrix completion\nmethod, singular value projection (SVP [7], whose performance we found similar to that of alternat-\ning least squares, ALS [3, 4]); \ufb01tting a D-dimensional Gaussian model with EM and imputing the\nmissing values of each xn as the conditional mean E {xn,Mn|xn,Pn} (we use the implementation\nof [19]); and the nonlinear method of [20] (nlPCA). We initialize GBMS and MBMS from some or\nall of these algorithms. For methods with user parameters, we set them by cross-validation in the\nfollowing way: we randomly select 10% of the present entries and pretend they are missing as well,\nwe run the algorithm on the remaining 90% of the present values, and we evaluate the reconstruction\nat the 10% entries we kept earlier. We repeat this over different parameters\u2019 values and pick the one\nwith lowest reconstruction error. We then run the algorithm with these parameters values on the\nentire present data and report the (test) error with the ground truth for the missing values.\n\n100D Swissroll We created a 3D swissroll data set with 3 000 points and lifted it to 100D with\na random orthonormal mapping, and added a little noise (spherical Gaussian with stdev 0.1). We\nselected uniformly at random 6.76% of the entries to be present. We use the Gaussian model and\nSVP (\ufb01xed rank = 3) as initialization for our algorithm. We typically \ufb01nd that these initial X are\nvery noisy (\ufb01g. 3), with some reconstructed points lying between different branches of the manifold\nand causing a big reconstruction error. We \ufb01xed L = 2 (the known dimensionality) for MBMS\nand cross-validated the other parameters: \u03c3 and k for MBMS and GBMS (both using k-nn graph),\nand the number of iterations \u03c4 to be used. Table 1 gives the performance of MBMS and GBMS for\ntesting, along with their optimal parameters. Fig. 3 shows the results of different methods at a few\niterations. MBMS initialized from the Gaussian model gives the most remarkable denoising effect.\nTo show that there is a wide range of \u03c3 and number of iterations \u03c4 that give good performance\nwith GBMS and MBMS, we \ufb01x k = 50 and run the algorithm with varying \u03c3 values and plot\nthe reconstruction error for missing entries over iterations in \ufb01g. 2. Both GBMS can achieve good\n\n5\n\n\fRSSE mean stdev\nMethods\nGaussian\n168.1 2.63 1.59\n+ GBMS (\u221e, 10, 0, 1) 165.8 2.57 1.61\n+ MBMS (1, 20, 2, 25) 157.2 2.36 1.63\nSVP\n156.8 1.94 2.10\n151.4 1.89 2.02\n+ GBMS (3, 50, 0, 1)\n151.8 1.87 2.05\n+ MBMS (3, 50, 2, 2)\n\nRSSE mean stdev\nMethods\n26.1 42.6\nnlPCA\n7.77\nSVP\n21.8 39.3\n6.99\n18.8 37.7\n+ GBMS (400,140,0,1) 6.54\n17.0 34.9\n+ MBMS (500,140,9,5) 6.03\n\nTable 1: Swissroll data set: reconstruction errors\nobtained by different algorithms along with their\noptimal parameters (\u03c3, k, L, no. iterations \u03c4 ). The\nthree columns show the root sum of squared errors\non missing entries, the mean, and the standard de-\nviation of the pointwise reconstruction error, resp.\n\nTable 2: MNIST-7 data set: errors of the dif-\nferent algorithms and their optimal parameters\n(\u03c3, k, L, no. iterations \u03c4 ). The three columns\nshow the root sum of squared errors on miss-\ning entries (\u00d710\u22124), the mean, and the stan-\ndard deviation of pixel errors, respectively.\n\nSVP + GBMS\n\nSVP + MBMS\n\n0.3\n0.5\n1\n2\n3\n5\n\n8\n10\n15\n25\n\u221e\n\n \n \n\n180\n\n170\n\n160\n\nGaussian + GBMS\n180\n\nGaussian + MBMS\n180\n\n170\n\n160\n\n170\n\n160\n\n180\n\n170\n\n160\n\n)\nE\nS\nS\nR\n\n(\n\nr\no\nr\nr\ne\n\n \n\n150\n\n \n\n0 1 2 3 4 5 6 7 8 910 12 14 16 18 20\n\n150\n\niteration \u03c4\n\niteration \u03c4\n\niteration \u03c4\n\n0 1 2 3 4 5 6 7 8 910 12 14 16 18 20\n\n0 1 2 3 4 5 6 7 8 910 12 14 16 18 20\n\n0 1 2 3 4 5 6 7 8 910 12 14 16 18 20\n\n150\n\n150\n\niteration \u03c4\n\nFigure 2: Reconstruction error of GBMS/MBMS over iterations (each curve is a different \u03c3 value).\n\ndenoising (and reconstruction), but MBMS is more robust, with good results occurring for a wide\nrange of iterations, indicating it is able to preserve the manifold structure better.\n\nMocap data We use the running-motion sequence 09 01 from the CMU mocap database with 148\nsamples (\u2248 1.7 cycles) with 150 sensor readings (3D positions of 50 joints on a human body). The\nmotion is intrinsically 1D, tracing a loop in 150D. We compare nlPCA, SVP, the Gaussian model,\nand MBMS initialized from the \ufb01rst three algorithms. For nlPCA, we do a grid search for the weight\ndecay coef\ufb01cient while \ufb01xing its structure to be 2 \u00d7 10 \u00d7 150 units, and use an early stopping\ncriterion. For SVP, we do grid search on {1, 2, 3, 5, 7, 10} for the rank. For MBMS (L = 1) and\nGBMS (L = 0), we do grid search for \u03c3 and k.\nWe report the reconstruction error as a function of the proportion of missing entries from 50% to\n95%. For each missing-data proportion, we randomly select 5 different sets of present values and\nrun all algorithms for them. Fig. 4 gives the mean errors of all algorithms. All methods perform\nwell when missing-data proportion is small. nlPCA, being prone to local optima, is less stable than\nSVP and the Gaussian model, especially when the missing-data proportion is large. The Gaussian\nmodel gives the best and most stable initialization. At 95%, all methods fail to give an acceptable\nreconstruction, but up to 90% missing entries, MBMS and GBMS always beat the other algorithms.\nFig. 4 shows selected reconstructions from all algorithms.\n\nMNIST digit \u20187\u2019 The MNIST digit \u20187\u2019 data set contains 6 265 greyscale (0\u2013255) images of size\n28 \u00d7 28. We create missing entries in a way reminiscent of run-length errors in transmission. We\ngenerate 16 to 26 rectangular boxes of an area approximately 25 pixels at random locations in each\nimage and use them to black out pixels. In this way, we create a high dimensional data set (784\ndimensions) with about 50% entries missing on average. Because of the loss of spatial correlations\nwithin the blocks, this missing data pattern is harder than random.\n\nThe Gaussian model cannot handle such a big data set because it involves inverting large covariance\nmatrices. nlPCA is also very slow and we cannot afford cross-validating its structure or the weight\ndecay coef\ufb01cient, so we picked a reasonable structure (10 \u00d7 30 \u00d7 784 units), used the default weight\ndecay parameter in the code (10\u22123), and allowed up to 500 iterations. We only use SVP as initial-\nization for our algorithm. Since the intrinsic dimension of MNIST is suspected to be not very high,\n\n6\n\n\fSVP\n\u03c4 = 0\n\nSVP + GBMS\n\n\u03c4 = 1\n\nSVP + MBMS\n\n\u03c4 = 2\n\nGaussian\n\u03c4 = 0\n\nGaussian + GBMS\n\nGaussian + MBMS\n\n\u03c4 = 1\n\n\u03c4 = 25\n\n20\n\n15\n\n10\n\n5\n\n0\n\n\u22125\n\n\u221210\n\n\u221215\n\n\u221215\n\n20\n\n15\n\n10\n\n5\n\n0\n\n\u22125\n\n\u221210\n\n\u221215\n\n\u221215\n\n20\n\n15\n\n10\n\n5\n\n0\n\n\u22125\n\n\u221210\n\n\u221215\n\n\u221215\n\n\u221210\n\n\u22125\n\n0\n\n5\n\n10\n\n15\n\n20\n\n20\n\n15\n\n10\n\n5\n\n0\n\n\u22125\n\n\u221210\n\n\u221215\n\n\u221215\n\n20\n\n15\n\n10\n\n5\n\n0\n\n\u22125\n\n\u221210\n\n\u221215\n\n\u221215\n\n\u221210\n\n\u22125\n\n0\n\n5\n\n10\n\n15\n\n20\n\n\u221210\n\n\u22125\n\n0\n\n5\n\n10\n\n15\n\n20\n\n20\n\n15\n\n10\n\n5\n\n0\n\n\u22125\n\n\u221210\n\n\u221215\n\n\u221215\n\n\u221210\n\n\u22125\n\n0\n\n5\n\n10\n\n15\n\n20\n\n\u221210\n\n\u22125\n\n0\n\n5\n\n10\n\n15\n\n20\n\n\u221210\n\n\u22125\n\n0\n\n5\n\n10\n\n15\n\n20\n\nFigure 3: Denoising effect of the different algorithms. For visualization, we project the 100D data\nto 3D with the projection matrix used for creating the data. Present values are re\ufb01lled for all plots.\n\n \n\nframe 2 (leg distance) frame 10 (foot pose)\n\nframe 147 (leg pose)\n\nnlPCA\nnlPCA + GBMS\nnlPCA + MBMS\nSVP\nSVP + GBMS\nSVP + MBMS\nGaussian\nGaussian + GBMS\nGaussian + MBMS\n\n7000\n\n6000\n\n5000\n\nr\no\nr\nr\ne\n\n4000\n\n3000\n\n2000\n\n1000\n\n0\n\n \n\n50\n\n60\n\n70\n\n80\n\n85\n\n90\n\n95\n\n% of missing data\n\nFigure 4: Left: mean of errors (RSSE) of 5 runs obtained by different algorithms for varying percent-\nage of missing values. Errorbars shown only for Gaussian + MBMS to avoid clutter. Right: sample\nreconstructions when 85% percent data is missing. Row 1: initialization. Row 2: init+GBMS. Row\n3: init+MBMS. Color indicates different initialization: black, original data; red, nlPCA; blue, SVP;\ngreen, Gaussian.\n\nwe used rank 10 for SVP and L = 9 for MBMS. We also use the same k = 140 as in [5]. So we\nonly had to choose \u03c3 and the number of iterations via cross-validation.\nTable 2 shows the methods and their corresponding error. Fig. 5 shows some representative recon-\nstructions from different algorithms, with present values re\ufb01lled. The mean-shift averaging among\ncloseby neighbors (a soft form of majority voting) helps to eliminate noise, unusual strokes and\nother artifacts created by SVP, which by their nature tend to occur in different image locations over\nthe neighborhood of images.\n\n4 Related work\n\nMatrix completion is widely studied in theoretical compressed sensing [1, 2] as well as practical\nrecommender systems [3, 4]. Most matrix completion models rely on a low-rank assumption, and\ncannot fully exploit a more complex structure of the problem, such as curved manifolds. Related\nwork is on multi-task learning in a broad sense, which extracts the common structure shared by\nmultiple related objects and achieves simultaneous learning on them. This includes applications\nsuch as alignment of noise-corrupted images [21], recovery of images with occlusion [22], and even\nlearning of multiple related regressors or classi\ufb01ers [23]. Again, all these works are essentially based\non a subspace assumption, and do not generalize to more complex situations.\n\nality L < D) involves setting up a least-squares error function minf ,ZPN\nPN,D\n\nA line of work based on a nonlinear low-rank assumption (with a latent variable z of dimension-\nn=1 kxn \u2212 f (zn)k2 =\nn,d=1 (xdn \u2212 fd(zn))2 where one ignores the terms for which xdn is missing, and estimates the\nfunction f and the low-dimensional data projections Z by alternating optimization. Linear func-\ntions f have been used in the homogeneity analysis literature [24], where this approach is called\n\u201cmissing data deleted\u201d. Nonlinear functions f have been used recently (neural nets [20]; Gaussian\nprocesses for collaborative \ufb01ltering [25]). Better results are obtained if adding a projection term\n\nn=1 kzn \u2212 F(xn)k2 and optimizing over the missing data as well [26].\n\nPN\n\n7\n\n\fOrig Missing nlPCA SVP GBMS MBMS\n\nOrig Missing nlPCA SVP GBMS MBMS\n\nFigure 5: Selected reconstructions of MNIST block-occluded digits \u20187\u2019 with different methods.\n\nPrior to our denoising-based work there have been efforts to extend the low-rank models to smooth\nmanifolds, mostly in the context of compressed sensing. Baraniuk and Wakin [27] show that certain\nrandom measurements, e.g. random projection to a low-dimensional subspace, can preserve the\nmetric of the manifold fairly well, if the intrinsic dimension and the curvature of the manifold\nare both small enough. However, these observations are not suitable for matrix completion and\nno algorithm is given for recovering the signal. Chen et al. [28] explicitly model a pre-determined\nmanifold, and use this to regularize the signal when recovering the missing values. They estimate the\nmanifold given complete data, while no complete data is assumed in our matrix completion setting.\nAnother related work is [29], where the manifold modeled with Isomap is used in estimating the\npositions of satellite cameras in an iterative manner.\n\nFinally, our expectation that the value of a missing entry can be predicted from the values of neigh-\nboring points is similar to one category of collaborative \ufb01ltering methods that essentially use similar\nusers/items to predict missing values [3, 4].\n\n5 Conclusion\n\nWe have proposed a new paradigm for matrix completion, denoising, which generalizes the com-\nmonly used assumption of low rank. Assuming low-rank implies a restrictive form of denoising\nwhere the data is forced to have zero variance away from a linear manifold. More general def-\ninitions of denoising can potentially handle data that lives in a low-dimensional manifold that is\nnonlinear, or whose dimensionality varies (e.g. a set of manifolds), or that does not have low rank\nat all, and naturally they handle noise in the data. Denoising works because of the fundamental fact\nthat a missing value can be predicted by averaging nearby present values.\n\nAlthough we motivate our framework from a constrained optimization point of view (denoise subject\nto respecting the present data), we argue for an algorithmic view of denoising-based matrix com-\npletion: apply a denoising step, re\ufb01ll the present values, iterate until the validation error increases.\nIn turn, this allows different forms of denoising, such as based on low-rank projection (earlier work)\nor local averaging with blurring mean-shift (this paper). Our nonparametric choice of mean-shift\naveraging further relaxes assumptions about the data and results in a simple algorithm with very\nfew user parameters that afford user control (denoising scale, local dimensionality) but can be set\nautomatically by cross-validation. Our algorithms are intended to be used as a postprocessing step\nover a user-provided initialization of the missing values, and we show they consistently improve\nupon existing algorithms.\n\nThe MBMS-based algorithm bridges the gap between pure denoising (GBMS) and local low rank.\nOther de\ufb01nitions of denoising should be possible, for example using temporal as well as spatial\nneighborhoods, and even applicable to discrete data if we consider denoising as a majority voting\namong the neighbours of a vector (with suitable de\ufb01nitions of votes and neighborhood).\n\nAcknowledgments Work supported by NSF CAREER award IIS\u20130754089.\n\n8\n\n\fReferences\n[1] Emmanuel J. Cand`es and Benjamin Recht. Exact matrix completion via convex optimization. Foundations\n\nof Computational Mathematics, 9(6):717\u2013772, December 2009.\n\n[2] Emmanuel J. Cand`es and Terence Tao. The power of convex relaxation: Near-optimal matrix completion.\n\nIEEE Trans. Information Theory, 56(5):2053\u20132080, April 2010.\n\n[3] Yehuda Koren. Factorization meets the neighborhood: A multifaceted collaborative \ufb01ltering model.\n\nSIGKDD 2008, pages 426\u2013434, Las Vegas, NV, August 24\u201327 2008.\n\n[4] Robert Bell and Yehuda Koren. Scalable collaborative \ufb01ltering with jointly derived neighborhood inter-\n\npolation weights. ICDM 2007, pages 43\u201352, October 28\u201331 2007.\n\n[5] Weiran Wang and Miguel \u00b4A. Carreira-Perpi\u02dcn\u00b4an. Manifold blurring mean shift algorithms for manifold\n\ndenoising. CVPR 2010, pages 1759\u20131766, San Francisco, CA, June 13\u201318 2010.\n\n[6] Matthias Hein and Markus Maier. Manifold denoising. NIPS 2006, 19:561\u2013568. MIT Press, 2007.\n[7] Prateek Jain, Raghu Meka, and Inderjit S. Dhillon. Guaranteed rank minimization via singular value\n\nprojection. NIPS 2010, 23:937\u2013945. MIT Press, 2011.\n\n[8] Miguel \u00b4A. Carreira-Perpi\u02dcn\u00b4an. Fast nonparametric clustering with Gaussian blurring mean-shift. ICML\n\n2006, pages 153\u2013160. Pittsburgh, PA, June 25\u201329 2006.\n\n[9] Keinosuke Fukunaga and Larry D. Hostetler. The estimation of the gradient of a density function, with\n\napplication in pattern recognition. IEEE Trans. Information Theory, 21(1):32\u201340, January 1975.\n\n[10] Yizong Cheng. Mean shift, mode seeking, and clustering. IEEE Trans. PAMI, 17(8):790\u2013799, 1995.\n[11] Dorin Comaniciu and Peter Meer. Mean shift: A robust approach toward feature space analysis. IEEE\n\nTrans. PAMI, 24(5):603\u2013619, May 2002.\n\n[12] Miguel \u00b4A. Carreira-Perpi\u02dcn\u00b4an. Generalised blurring mean-shift algorithms for nonparametric clustering.\n\nCVPR 2008, Anchorage, AK, June 23\u201328 2008.\n\n[13] Gabriel Taubin. A signal processing approach to fair surface design. SIGGRAPH 1995, pages 351\u2013358.\n[14] Mathieu Desbrun, Mark Meyer, Peter Schr\u00a8oder, and Alan H. Barr. Implicit fairing of irregular meshes\n\nusing diffusion and curvature \ufb02ow. SIGGRAPH 1999, pages 317\u2013324.\n\n[15] Fan R. K. Chung. Spectral Graph Theory. American Mathematical Society, Providence, RI, 1997.\n[16] Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. IEEE Trans. PAMI, 22(8):888\u2013\n\n905, August 2000.\n\n[17] Roger A. Horn and Charles R. Johnson. Matrix Analysis. Cambridge University Press, 1986.\n[18] Jorge Nocedal and Stephen J. Wright. Numerical Optimization. Springer-Verlag, New York, second\n\nedition, 2006.\n\n[19] Tapio Schneider. Analysis of incomplete climate data: Estimation of mean values and covariance matrices\n\nand imputation of missing values. Journal of Climate, 14(5):853\u2013871, March 2001.\n\n[20] Matthias Scholz, Fatma Kaplan, Charles L. Guy, Joachim Kopka, and Joachim Selbig. Non-linear PCA:\n\nA missing data approach. Bioinformatics, 21(20):3887\u20133895, October 15 2005.\n\n[21] Yigang Peng, Arvind Ganesh, John Wright, Wenli Xu, and Yi Ma. RASL: Robust alignment by sparse\n\nand low-rank decomposition for linearly correlated images. CVPR 2010, pages 763\u2013770, 2010.\n\n[22] A. M. Buchanan and A. W. Fitzgibbon. Damped Newton algorithms for matrix factorization with missing\n\ndata. CVPR 2005, pages 316\u2013322, San Diego, CA, June 20\u201325 2005.\n\n[23] Andreas Argyriou, Theodoros Evgeniou, and Massimiliano Pontil. Multi-task feature learning. NIPS\n\n2006, 19:41\u201348. MIT Press, 2007.\n\n[24] Albert Gi\ufb01. Nonlinear Multivariate Analysis. John Wiley & Sons, 1990.\n[25] Neil D. Lawrence and Raquel Urtasun. Non-linear matrix factorization with Gaussian processes. ICML\n\n2009, Montreal, Canada, June 14\u201318 2009.\n\n[26] Miguel \u00b4A. Carreira-Perpi\u02dcn\u00b4an and Zhengdong Lu. Manifold learning and missing data recovery through\n\nunsupervised regression. ICDM 2011, December 11\u201314 2011.\n\n[27] Richard G. Baraniuk and Michael B. Wakin. Random projections of smooth manifolds. Foundations of\n\nComputational Mathematics, 9(1):51\u201377, February 2009.\n\n[28] Minhua Chen, Jorge Silva, John Paisley, Chunping Wang, David Dunson, and Lawrence Carin. Compres-\nsive sensing on manifolds using a nonparametric mixture of factor analyzers: Algorithm and performance\nbounds. IEEE Trans. Signal Processing, 58(12):6140\u20136155, December 2010.\n\n[29] Michael B. Wakin. A manifold lifting algorithm for multi-view compressive imaging.\n\nConference on Picture Coding Symposium (PCS\u201909), pages 381\u2013384, 2009.\n\nIn Proc. 27th\n\n9\n\n\f", "award": [], "sourceid": 246, "authors": [{"given_name": "Weiran", "family_name": "Wang", "institution": null}, {"given_name": "Miguel", "family_name": "Carreira-Perpi\u00f1\u00e1n", "institution": ""}, {"given_name": "Zhengdong", "family_name": "Lu", "institution": ""}]}