{"title": "Robust Tensor Decomposition with Gross Corruption", "book": "Advances in Neural Information Processing Systems", "page_first": 1422, "page_last": 1430, "abstract": "In this paper, we study the statistical performance of robust tensor decomposition with gross corruption. The observations are noisy realization of the superposition of a low-rank tensor $\\mathcal{W}^*$ and an entrywise sparse corruption tensor $\\mathcal{V}^*$. Unlike conventional noise with bounded variance in previous convex tensor decomposition analysis, the magnitude of the gross corruption can be arbitrary large. We show that under certain conditions, the true low-rank tensor as well as the sparse corruption tensor can be recovered simultaneously. Our theory yields nonasymptotic Frobenius-norm estimation error bounds for each tensor separately. We show through numerical experiments that our theory can precisely predict the scaling behavior in practice.", "full_text": "Robust Tensor Decomposition with Gross Corruption\n\nQuanquan Gu\u2217\n\nDepartment of Operations Research\n\nand Financial Engineering\n\nPrinceton University\nPrinceton, NJ 08544\n\nqgu@princeton.edu\n\nHuan Gui\u2217 Jiawei Han\n\nDepartment of Computer Science\n\nUniversity of Illinois\nat Urbana-Champaign\n\nUrbana, IL 61801\n\n{huangui2,hanj}@illinois.edu\n\nAbstract\n\nIn this paper, we study the statistical performance of robust tensor decomposition\nwith gross corruption. The observations are noisy realization of the superposition\nof a low-rank tensor W\u2217 and an entrywise sparse corruption tensor V\u2217. Unlike\nconventional noise with bounded variance in previous convex tensor decomposition\nanalysis, the magnitude of the gross corruption can be arbitrary large. We show\nthat under certain conditions, the true low-rank tensor as well as the sparse cor-\nruption tensor can be recovered simultaneously. Our theory yields nonasymptotic\nFrobenius-norm estimation error bounds for each tensor separately. We show\nthrough numerical experiments that our theory can precisely predict the scaling\nbehavior in practice.\n\n1\n\nIntroduction\n\nTensor data analysis have witnessed increasing applications in machine learning, data mining and\ncomputer vision. For example, an ensemble of face images can be modeled as a tensor, whose mode\ncorresponds to pixels, subjects, illumination and viewpoint [23]. Traditional tensor decomposition\nmethods such as Tucker decomposition and CANDECOMP/PARAFAC(CP) decomposition [14, 13]\naim to factorize an input tensor into a number of low-rank factors. However, they are prone to local\noptima because they are solving essentially non-convex optimization problems. In order to address\nthis problem, [15] [20] extended the trace norm of matrices [19] to tensors, and generalized convex\nmatrix completion [8] [7] and matrix decomposition [6] to convex tensor completion/decomposition.\nFor example, the goal of tensor decomposition aims to accurately estimate a low-rank tensor W \u2208\nRn1\u00d7...\u00d7nK from the noisy observation tensor Y \u2208 Rn1\u00d7...\u00d7nK that is contaminated by dense\nnoises, i.e., Y = W\u2217 + E, where W\u2217 \u2208 Rn1\u00d7...\u00d7nK is a low-rank tensor, E \u2208 Rn1\u00d7...\u00d7nK is a\nnoise tensor whose entries are i.i.d. Gaussian noise with zero mean and bounded variance \u03c32, i.e.,\nEi1,...,iK \u223c N (0, \u03c32). [22] [21] analyzed the statistical performance of convex tensor decomposition\nunder different extensions of trace norm. They showed that, under certain conditions, the estimation\nerror scales with the rank of the true tensor W\u2217. Furthermore, they demonstrated that given a noisy\ntensor, the true low-rank tensor can be recovered under restricted strong convexity assumption [18].\nHowever, all these algorithms [15] [20] and theoretical results [22] [21] reply on the assumption that\nthe observation noise has a bounded variance \u03c32. Without this assumption, we are not able to identify\nW\u2217.\nOn the other hand, in many practical applications such as face recognition and image/video denoising,\na portion of the observation tensor Y might be contaminated by gross error due to illumination,\nocclusion or pepper/salt noise. This scenario is not covered by \ufb01nite variance noise assumption,\ntherefore new mathematical models are demanded to address this problem. This motivates us to study\n\nthe rank of W\u2217, and therefore the estimated low-rank tensor(cid:99)W could be very far from the true tensor\n\n\u2217Equal Contribution\n\n1\n\n\fconvex tensor decomposition with gross corruption. It is clear that if all the entries of a tensor are\ncorrupted by large error, there is no hope to recover the underlying low-rank tensor. To overcome\nthis problem, one common assumption is that the gross corruption is sparse. Under this assumption,\ntogether with previous low-rank assumption, we formalize the noisy linear observation model as\nfollows:\n\nY = W\u2217 + V\u2217 + E,\n\n(1)\nwhere W\u2217 \u2208 Rn1\u00d7...\u00d7nK is a low-rank tensor, V\u2217 \u2208 Rn1\u00d7...\u00d7nK is a sparse corruption tensor, where\nthe locations of nonzero entries are unknown and the magnitudes of the nonzero entries can be\narbitrarily large, and E \u2208 Rn1\u00d7...\u00d7nK is a noise tensor whose entries are i.i.d. Gaussian noise with\nzero mean and bounded variance \u03c32, and thus dense. Our goal is to recover the low-rank tensor W\u2217,\nas well as the sparse corruption tensor V\u2217. Note that in some applications, the corruption tensor is of\nindependent interest and needs to be recovered.\nGiven the observation model in (1), and the low-rank as well as sparse assumptions on W\u2217 and E\u2217\nrespectively, we propose the following convex minimization to estimate the unknown low-rank tensor\nW\u2217 and the sparse corruption tensor E\u2217 simultaneously:\n\nF + \u03bbM |||W|||S1\n\n+ \u00b5M |||V|||1 ,\n\narg minW,V |||Y \u2212 W \u2212 V|||2\n\n(2)\nis tensor Schatten-1 norm [22], |||\u00b7|||1 is entry-wise (cid:96)1 norm of tensors, and \u03bbM and \u00b5M\nwhere |||\u00b7|||S1\nare positive regularization parameters. We call this optimization Robust Tensor Decomposition, which\ncan been seen as a generalization of convex tensor decomposition in [15] [20] [22]. The regularization\nassociated with the E encourages sparsity on the corruption tensor, where parameter \u00b5M controls the\nsparsity level. In this paper, we focus on the following questions: under what conditions for the size\nof the tensor, the rank of the tensor, and the fraction (sparsity level) of the corruption so that: (i) (2) is\nable to recover W\u2217 and V\u2217 with small estimator error? (ii) (2) is able to recover the exact rank of\nW\u2217 and the support of V\u2217? We will present nonasymptotic error bounds to answer these questions.\nExperiments on synthetic datasets validate our theoretical results.\nThe rest of this paper is arranged as follows. Related work is discussed in Section 2. Section 3\nintroduces the background and notations. Section 4 presents the main results. Section 5 provides\nan ADMM algorithm to solve the problem, followed by the numerical experiments in Section 6.\nSection 7 concludes this work with remarks.\n\n2 Related Work\n\nThe problem of recovering the data under gross error has gained many attentions recently in matrix\ndecomposition. A large body of work have been proposed and analyzed statistically. For example,\n[9] considered the problem of recovering an unknown low-rank and an unknown sparse matrix, given\nthe sum of the two matrices. [5] proposed a similar problem, namely robust principal component\nanalysis (RPCA), which studies the problem of recovering the low-rank and sparse matrices by\nsolving a convex program. [10] studied multi-task regression which decomposes the coef\ufb01cient\nmatrix into two matrices, and imposes different group sparse regularization on two matrices. [25]\nconsidered more general case, where the parameter matrix could be the superposition of more than\ntwo matrices with different structurally constraints. Our paper extends [5] from two perspective: we\nextend the problem from matrices to high-order tensors, and consider the additional noise setting.\nWe notice that [16] extended RPCA to tensors, which aims to recover the low-rank and sparse\ntensors by solving a constrained convex program. However, our formulation departs from [16] in\nthat we consider not only the sparse corruption, but also the dense noise. We also note that low-rank\nnoisy matrix completion [17] and robust matrix decomposition [1] [12] have been studied in in\nthe high dimensional setting as well. Our model can be seen as the high-order extension of robust\nmatrix decomposition. This extension is nontrivial, because the treatment of the tensor trace norm\n(Schatten-1 norm) is more complicated. More importantly, for the robust matrix decomposition\nproblem considered [1], only the sum of error bound of two matrices (low-rank matrix and the sparse\ncorruption matrix) can be obtained under the assumption of restricted strongly convexity. In contrast,\nunder a different condition, our analysis provides error bound for each tensor component (low-rank\ntensor and the sparse corruption tensor) separately, making our results more appealing in practice\nand of independent theoretical interest. Since the problem in [1] is a special case of our problem, our\n\n2\n\n\ftechnical tool can be directly applied to their problem and yields new error bounds on the low-rank\nmatrix as well as the sparse corruption matrix separately.\n\n3 Notation and Background\n\nBefore proceeding, we de\ufb01ne our notation and state assumptions that will appear in various parts of\nthe analysis. For more details about tensor algebra, please refer to [14].\nScalars are denoted by lower case letters (a, b, . . .), vectors by bold lower case letters (a, b, . . .),\nmatrices by bold upper case letters (A, B, . . .), and high-order tensors by calligraphic upper case\nletters (A,B, . . .). A tensor is a higher order generalization of a vector (\ufb01rst order tensor) and a matrix\n(second order tensor). From a multi-linear algebra view, tensor is a multi-linear mapping over a set of\nvector spaces. The order of tensor A \u2208 Rn1\u00d7...\u00d7n2\u00d7...\u00d7nK is K, where nk is the dimensionality of\nthe k-th order. Elements of A are denoted as Ai1...ik...in , 1 \u2264 ik \u2264 nk. We denote the number of\n\nelements in A by N =(cid:81)K\n\nThe mode-k vectors of a K order tensor A are the nk dimensional vectors obtained from A by\nvarying index ik while keeping the other indices \ufb01xed. The mode-k vectors are the column vectors\nof mode-k \ufb02attening matrix A(k) \u2208 Rnk\u00d7(n1...nk\u22121nk+1...nK ) that results by mode-k \ufb02attening the\ntensor A. For example, matrix column vectors are referred to as mode-1 vectors and matrix row\nvectors are referred to as mode-2 vectors.\nThe scalar product of\n\nis de\ufb01ned as (cid:104)A,B(cid:105) =\nAi1...iKBi1...iK = vec(A)vec(B), where vec(\u00b7) is a vectorization. The Frobenius\n\n. . .(cid:80)\n(cid:80)\nnorm of a tensor A is |||A|||F =(cid:112)(cid:104)A,A(cid:105).\n\ntwo tensors A,B \u2208 Rn1...n2...nK ,\n\niK\n\ni1\n\nk=1 nk.\n\nThere are multiple ways to de\ufb01ne tensor rank. In this paper, following [22], we de\ufb01ne the rank of\na tensor based on the mode-k rank of a tensor. More speci\ufb01cally, the mode-k rank of a tensor X ,\ndenoted by rankk(X), is the rank of the mode-k unfolding X(k) (note that X(k) is a matrix, so its\nrank is well-de\ufb01ned). Based on mode-k rank, we de\ufb01ne the rank of tensor X as r(X ) = (r1, . . . , rk)\nif the mode-k rank is rk for k = 1, . . . , K. Note that the mode-k rank can be computed in polynomial\ntime, because it boils down to computing a matrix rank, whereas computing tensor rank [14] is NP\ncomplete.\n(cid:80)K\nSimilarly, we extend the trace norm (a.k.a. nuclear norm) of matrices [19] to tensors. The overlapped\nSchatten-1 norm is de\ufb01ned as |||X|||S1\nk=1 (cid:107)X(k)(cid:107)S1, where X(k) is the mode-k unfolding\nj=1 \u03c3j(X), where \u03c3j(X) is the\nj-th largest singular value of X. The dual norm of the Schatten-1 norm is Schatten-\u221e norm (a.k.a.,\nspectral norm) as (cid:107)X(cid:107)S\u221e = maxj=1,...,r \u03c3j(X).\nBy H\u00a8older\u2019s inequality, we have |(cid:104)W, X(cid:105)| \u2264 (cid:107)W(cid:107)S1(cid:107)X(cid:107)S\u221e. It is easy to prove a similar result for\nthe overlapped Schatten-1 norm and its dual norm. We have the following H\u00a8older-like inequality [22]:\n\nof X , and (cid:107) \u00b7 (cid:107)S1 is the Schatten-1 norm for a matrix, (cid:107)X(cid:107)S1 =(cid:80)r\n\n= 1\nK\n\n(cid:80)K\n|(cid:104)W,X(cid:105)| \u2264 |||W|||S1\nk=1 (cid:107)X(k)(cid:107)S\u221e.\n\n|||X|||S\u2217\n\n1\n\n\u2264 |||W|||S1\n\n|||X|||mean ,\n\n(3)\n\nK\n\nwhere |||X|||mean := 1\n\nMoreover, we de\ufb01ne (cid:96)1-norm and (cid:96)\u221e-norm for tensors that |||X|||1 =(cid:80)n1\n\niK =1 |Xi1,...,iK|,\n|||X|||\u221e = max1\u2264i1\u2264n1 . . . max1\u2264iK\u2264nK |Xi1,...,iK|. By H\u00a8older\u2019s inequality, we have |(cid:104)W,X(cid:105)| \u2264\n|||W|||1 |||X|||\u221e, and the following inequality relates the overlapped Schatten-1 norm with the Frobe-\nnius norm,\n\ni1=1 . . .(cid:80)nK\n\n\u2264 K(cid:88)\n\n|||X|||S1\n\n\u221a\n\nrk |||X|||F .\n\n(4)\n\nk=1\n\nLet W\u2217 \u2208 Rn1\u00d7...\u00d7nK be the low-rank tensor that we wish to recover. We assume that W\u2217 is\nk , where Uk \u2208 Rnk\u00d7rk and\nof rank (r1, . . . , rK). Thus, for each k, we have W\u2217\nVk \u2208 Rrk\u00d7nk are orthogonal matrices, which consist of left and right singular vectors of W\u2217\n(k),\nSk \u2208 Rrk\u00d7rk is a diagonal matrix whose diagonal elements are singular values. Let \u2206 \u2208 Rn1\u00d7...\u00d7nK\n\n(k) = UkSkV(cid:62)\n\n3\n\n\f(5)\n\nk )\u2206(k)(I \u00afN\\k\n\nk of its mode-k unfolding\n\nbe an arbitrary tensor, we de\ufb01ne the mode-k orthogonal complement \u2206(cid:48)(cid:48)\n\u2206(k) \u2208 Rnk\u00d7 \u00afN\\k with respect to the true low-rank tensor W\u2217 as follows\n\u2212 VkV(cid:62)\nk ).\n\n(k). Note that the decomposition \u2206(k) = \u2206(cid:48)\n\nk = (Ink \u2212 UkU(cid:62)\n\u2206(cid:48)(cid:48)\nk = \u2206(k) \u2212 \u2206(cid:48)(cid:48)\nIn addition \u2206(cid:48)\nk is the component which has overlapped row/column space with the\nunfolding of the true tensor W\u2217\nk is de\ufb01ned for\neach mode.\nIn [18], the concept of decomposibility and a large class of decomposable norms are discussed\nat length. Of particular relevance to us is the decomposability of the Schatten-1 norm and (cid:96)1-\nnorm. We have the following equality, i.e., mode-k decomposibility of the Schatten-1 norm that\nk(cid:107)S1, k = 1, . . . , K. To note that the decomposibility is de\ufb01ned\n(cid:107)W\u2217\non each mode. It is also easy to check the decomposibility of the (cid:96)1-norm.\n(cid:8)(i1, i2, . . . , iK) \u2208 [n1] \u00d7 . . . \u00d7 [nK]|V\u2217\nLet V\u2217 \u2208 Rn1\u00d7...\u00d7nK be the gross corruption tensor that we wish to recover. We assume the the\ngross corruption is sparse, in that the cardinality s = |supp(V\u2217)| of its support, S = supp(V\u2217) =\nbetween the (cid:96)1 norm and the Forbenius norm that |||V\u2217|||1 \u2264 \u221a\ns|||V\u2217|||F . Moreover, we have\n|||V\u2217|||1 = |||V\u2217\n\n(cid:54)= 0(cid:9). This assumption leads to the inequality\n\nS|||1. For any D \u2208 Rn1\u00d7...\u00d7nK , we have |||D|||1 = |||DS|||1 + |||DSc|||1 .\n\n(k)(cid:107)S1 +(cid:107)\u2206(cid:48)(cid:48)\n\nk(cid:107)S1 = (cid:107)W\u2217\n\n(k) + \u2206(cid:48)(cid:48)\n\nk + \u2206(cid:48)(cid:48)\n\ni1,...,iK\n\n4 Main Results\n\nTo get a deep theoretical insight into the recovery property of robust tensor decomposition, we will\nnow present a set of estimation error bounds. Unlike the analysis in [1], where only the summation\nof the estimation errors on the low-rank matrix and gross corruption matrix are analyzed, we aim\nat obtaining the estimation error bounds on each tensor (the low-rank tensor and corrupted tensor)\nseparately. All the proofs can be found in the longer version of this paper.\nInstead of considering the observation model in 1, we consider the following more general observation\nmodel\n\nyi = (cid:104)W\u2217,Xi(cid:105) + (cid:104)V\u2217,Xi(cid:105) + \u0001i, i = 1, . . . , M,\n\n(6)\nwhere Xi can be seen as an observation operator, and \u0001i\u2019s are i.i.d. zero mean Gaussian noise with\nvariance \u03c32. Our goal is to estimate an unknown rank (r1, . . . , rk) of tensor W\u2217 \u2208 Rn1\u00d7...\u00d7nK ,\nas well as the unknown support of tensor V\u2217, from observations yi, i = 1, . . . , M. We propose\nthe following convex minimization to estimate the unknown low-rank tensor W\u2217 and the sparse\ncorruption tensor V\u2217 simultaneously, with composite regularizers on W and V as follows:\n\n((cid:99)W,(cid:98)V) = arg minW,V\n\n1\n2M\n\n(cid:107)y \u2212 X(W + V)(cid:107)2\n\n2 + \u03bbM |||W|||S1\n\n+ \u00b5M |||V|||1 ,\n\n1, . . . , y\u2217\n\nM )(cid:62), where y\u2217\n\n(7)\nwhere y = (y1, . . . , yM )(cid:62) is the collection of observations, X(W) is the linear observation model\nthat X(W) = [(cid:104)W,X1(cid:105), . . . ,(cid:104)W,XM(cid:105)](cid:62). Note that (2) is a special case of (7), where the linear\noperator the identity tensor, we have yi as observation of each element in the summation of tensors\nW\u2217 + V\u2217.\ni = (cid:104)W\u2217 + V\u2217,Xi(cid:105), is the true evaluation. Due to the\nWe also de\ufb01ne y\u2217 = (y\u2217\nnoise of observation model, we have y = y\u2217 + \u0001. In addition, we de\ufb01ne the adjoint operator of X as\n\nX\u2217 : RM \u2192 Rn1\u00d7...\u00d7nK that X\u2217(\u0001) =(cid:80)M\nThis section is devoted to obtain the deterministic bound of the residual low-rank tensor \u2206 = (cid:99)W\u2212W\u2217\nand residual corruption tensor D = (cid:98)V \u2212 V\u2217 separately, which makes our analysis unique.\nWe begin with a key technical lemma on residual tensors \u2206 = (cid:99)W \u2212 W\u2217 and D = (cid:98)V \u2212 V\u2217, obtained\nLemma 1. Let(cid:99)W and(cid:98)V be the solution of minimization problem (7) with \u03bbM \u2265 2|||X\u2217(\u0001)|||mean/M,\n\u00b5M \u2265 2|||X\u2217(\u0001)|||\u221e/M, we have\n\nfrom the convex problem in (7).\n\n4.1 Deterministic Bounds\n\ni=1 \u0001iXi.\n\n4\n\n\fk) \u2264 2rk.\n\n1. rank(\u2206(cid:48)\n\n2. There exist \u03b21 \u2265 3 and \u03b22 \u2265 3, such that (cid:80)K\n\nk=1 (cid:107)\u2206(cid:48)(cid:48)\n\nk(cid:107)S1 \u2264 \u03b21\n\n|||DSc|||1 \u2264 \u03b22 |||DS|||1.\n\nk(cid:107)S1 and\nThe lemma can be obtained by utilizing the optimality of(cid:99)W and(cid:98)V, as well as the decomposibility of\nSchatten-1 norm and (cid:96)1-norm of tensors.\nTheorem 1. Let (cid:99)W and (cid:98)V be the solution of minimization problem (7) with \u03bbM \u2265\nAlso, we obtain the key property of the optimal solution of (7), presented in the following theorem.\nK(cid:88)\n2|||X\u2217(\u0001)|||mean/M, \u00b5M \u2265 2|||X\u2217(\u0001)|||\u221e/M, we have\n\n3\u00b5M\n\n(cid:80)K\nk=1 (cid:107)\u2206(cid:48)\n\n1\n2M\n\n(cid:107)X(\u2206 + D)(cid:107)2\n\n2 \u2264 3\u03bbM\n2K\n\n(cid:107)\u2206(cid:48)\n\nk(cid:107)S1 +\n\n|||DS|||1 .\n\n2\n\n(8)\n\nk=1\n\nTheorem 1 provides a deterministic prediction error bound for model (7). This is a very general\nresult, and can be applied to any linear operator X, including the robust tensor decomposition case\nthat we are particularly interested in this paper. It also covers, for example, tensor regression, tensor\ncompressive sensing, to mention a few.\nFurthermore, we impose an assumption on the linear operator and the residual low-rank tensor and\nresidue sparse corruption tensor, which generalized the restricted eigenvalue assumption [2] [10].\nk(cid:107)S1 ,|||DSc|||1 \u2264\n\u03b22 |||DS|||1}, we assume there exist positive scalars \u03ba1, \u03ba2 that\n> 0, \u03ba2 = min\n\u2206,D\u2208\u2126\n\nAssumption 1. De\ufb01ning \u2126 = {(\u2206,D)|(cid:80)K\n\n(cid:80)K\nk=1 (cid:107)\u2206(cid:48)\n\nk(cid:107)S1 \u2264 \u03b21\n\n\u03ba1 = min\n\u2206,D\u2208\u2126\n\nk=1 (cid:107)\u2206(cid:48)(cid:48)\n\n> 0.\n\n(cid:107)X(\u2206 + D)(cid:107)2\n\u221a\nM |||\u2206|||F\n\n(cid:107)X(\u2206 + D)(cid:107)2\n\u221a\nM |||D|||F\n\nNote that Assumption 1 is also related to restricted strong convexity assumption, which is proposed\nin [18] to analyze the statistical properties of general M-estimators in the high dimensional setting.\nCombing the results in Theorem 1 and Assumption 1, we have the following theorem, which\nTheorem 2. Let(cid:99)W,(cid:98)V be an optimal solution of (7), and take the regularization parameters \u03bbM \u2265\nsummarizes our main result.\n(cid:33)\n2|||X\u2217(\u0001)|||mean/M, \u00b5M \u2265 2|||X\u2217(\u0001)|||\u221e/M. Then the following results hold:\n(cid:33)\n\n\u00b5M\n\u03ba2\n\u221a\n\n(cid:32)\n(cid:32)\n\n\u2264 3\n\u03ba1\n\n\u03ba1\n\u221a\n\n1\nK\n\n2rk\n\n\u03bbM\n\n\u221a\n\n\u221a\n\n(9)\n\n+\n\n,\n\nK(cid:88)\nK(cid:88)\n\nk=1\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:99)W \u2212 W\u2217(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)F\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:98)V \u2212 V\u2217(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)F\n\ns\n\ns\n\n\u2264 3\n\u03ba2\n\n1\nK\n\n\u03bbM\n\n2rk\n\n\u03ba1\n\nk=1\n\n+\n\n\u00b5M\n\u03ba2\n\n.\n\n(10)\n\nTheorem 2 provides us with the error bounds of each tensor separately. Speci\ufb01cally, these bounds not\nonly measure how well our decomposition model can approximate the observation model de\ufb01ned\nin (6), but also measure how well the decomposition of the true low-rank tensor and gross corruption\ntensor is. When s = 0, our theoretical results reduce to that proposed in [22], which is a special case\nof our problem, i.e., noisy low-rank tensor decomposition without corruption.\nOn the other hand, the results obtained in Theorem 2 are very appealing both practically and\ntheoretically. From the perspective of applications, this result is quite useful as it helps us to better\nunderstand the behavior of each tensor separately. From the theoretical point of view, this result is\nnovel, and is incomparable with previous results [1][17] or simple generalization of previous results.\n\nThough Theorem 2 has provided estimation error bounds of(cid:99)W and(cid:98)V, it is unclear whether the rank\n\nof W\u2217 and the support of V\u2217 can be exactly recovered. We show that under some assumptions about\nthe true tensors, both of them can be exactly recovered.\nCorollary 1. Under the same conditions of Theorem 2, if the following condition holds:\n\n,\n\n(11)\n\n6(1 + \u03b21)(cid:80)K\n\nk=1\n\n\u03ba1M K\n\n(cid:32)\n\n1\nK\n\nK(cid:88)\n\n\u221a\n\n\u03bbM\n\n2rk\n\n\u03ba1\n\nk=1\n\n(cid:33)\n\n\u221a\n\ns\n\n+\n\n\u00b5M\n\u03ba2\n\n\u03c3rk (W\u2217\n\n(k)) >\n\n\u221a\n\n2rk\n\n5\n\n\fwhere \u03c3rk (W\u2217\n\n(cid:26)\n\n(k)) is the rk-th largest singular value of W\u2217\n2rk\n\n3(1 + \u03b21)(cid:80)K\n\n\u03c3r((cid:99)W(k)) >\n\narg max\n\n\u221a\n\nk=1\n\nr\n\n\u03ba1M K\n\n(k), then\n\n(cid:32)\n\nK(cid:88)\n\n1\nK\n\n(cid:98)rk =\n\n\u221a\n\n2rk\n\n\u03bbM\n\n\u03ba1\n\nk=1\n\n(cid:33)(cid:27)\n\ns\n\n\u221a\n\n\u00b5M\n\u03ba2\n\n+\n\nrecovers the rank of W\u2217\nFurthermore, if the following condition holds:\n\n(k) for all k.\n\n|V\u2217\n\ni1,...,iK\n\n| >\n\nmin\ni1,...,iK\n\nthen\n\n(cid:26)\n\n(cid:98)S =\n\n(i1, i2, . . . , iK) : (cid:98)Vi1,...,iK >\n\n\u221a\n6(1 + \u03b22)\n\u03ba2M\n\n(cid:32)\n\ns\n\n1\nK\n\u221a\n3(1 + \u03b22)\n\u03ba2M\n\nK(cid:88)\n(cid:32)\n\nk=1\n\ns\n\n\u221a\n\n2rk\n\n\u03bbM\n\n\u03ba1\n\nK(cid:88)\n\nk=1\n\n1\nK\n\n(cid:33)\n\ns\n\n,\n\n\u221a\n\n\u00b5M\n\u03ba2\n\n+\n\n\u221a\n\n2rk\n\n\u03bbM\n\n\u03ba1\n\n+\n\n\u00b5M\n\u03ba2\n\n(12)\n\n(cid:33)(cid:27)\n\ns\n\n\u221a\n\nrecovers the true support of V\u2217.\nCorollary 1, basically states that, under the assumption that the singular values of the low-rank tensor\nW\u2217, and the entry values of corruption tensor V\u2217 are above the noise level (e.g., (11) and (12)), we\ncan recover the rank and the support successfully.\n\n4.2 Noisy Tensor Decomposition\n\nNow we are going back to study robust tensor decomposition with corruption in (2), which is a special\ncase of (7), where the linear operator is identity tensor. As the linear operator X is a vectorization such\nthat M = N, and (cid:107)X(\u2206 + D)(cid:107)2 = |||\u2206 + D|||F . In addition, it is easy to show that Assumption 1\n\u221a\nN ). It remains to bound |||X\u2217(\u0001)|||mean and |||X\u2217(\u0001)|||\u221e, as shown in\nholds with \u03ba1 = \u03ba2 = O(1/\nthe following lemma [1] [24].\nLemma 2. Suppose that X : Rn1\u00d7\u00b7\u00b7\u00b7\u00d7nK \u2192 RN is a vectorization of a tensor. Then we have with\nprobability at least 1 \u2212 2 exp(\u2212C(nk + \u00afN\\k)) \u2212 1/N that\n\n(cid:18)\u221a\nK(cid:88)\n|||X\u2217(\u0001)|||\u221e \u2264 4\u03c3(cid:112)log N ,\n\n|||X\u2217(\u0001)|||mean \u2264 \u03c3\nK\n\nk=1\n\n(cid:113)\n\n(cid:19)\n\nnk +\n\n\u00afN\\k\n\n,\n\nwhere C is a universal constant.\n\nWith Theorem 2 and Lemma 2, we immediately have the following estimation error bounds for robust\ntensor decomposition.\nTheorem 3. Suppose that X : Rn1\u00d7\u00b7\u00b7\u00b7\u00d7nK \u2192 RN is a vectorization of a tensor. Then for the\nlog N /N, with\nprobability at least 1 \u2212 2 exp(\u2212C(nk + \u00afN\\k)) \u2212 1/N, any solution of (2) have the following error\n\n/(N K), \u00b5N > 8\u03c3\n\nnk +\n\n\u221a\n\nk=1\n\n(cid:16)\u221a\nregularization constants \u03bbN \u2265 2\u03c3(cid:80)K\n\u03c3(cid:80)K\nbound: (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:99)W \u2212 W\u2217(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)F\nK(cid:88)\n\u03c3(cid:80)K\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:98)V \u2212 V\u2217(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)F\nK(cid:88)\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:98)V \u2212 V\u2217(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)F\n\nKs log n(cid:1) and\n\nO(cid:0)\u03c3\n\nrnK\u22121 + \u03c3\n\n(cid:32)\n(cid:32)\n\n\u2264 6\n\u03ba2\n\n\u2264 6\n\u03ba1\n\n1\nK\n\n1\nK\n\n\u221a\n\n\u221a\n\nk=1\n\nk=1\n\nk=1\n\nk=1\n\n(cid:113) \u00afN\\k\n(cid:17)\n(cid:113) \u00afN\\k\n(cid:113) \u00afN\\k\n\n\u03ba1N K\n\nnk +\n\nnk +\n\n(cid:16)\u221a\n(cid:16)\u221a\n\n(cid:17)\u221a\n(cid:17)\u221a\n\n2rk\n\n2rk\n\n\u03ba1N K\n\n= O(cid:0)\u03c3\n\n\u221a\n\n\u221a\n\nrnK\u22121 + \u03c3\n\n+\n\n4\u03c3\n\n\u221a\n\ns log N\n\u03ba2N\n\u221a\n\n(cid:33)\n(cid:33)\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:99)W \u2212 W\u2217(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)F\nKs log n(cid:1), which matches\n\ns log N\n\u03ba2N\n\n4\u03c3\n\n=\n\n+\n\n,\n\n.\n\nIn the special case that n1 = . . . = nK = n and r1 = . . . = rK = r, we have\n\nthe error bound of robust matrix decomposition [1] when K = 2.\nNote that the high probability support and rank recovery guarantee for the special case of tensor\ndecomposition follows immediately from Corollary 1. Due to the space limit, we omit the result here.\n\n6\n\n\f5 Algorithm\n\nIn this section, we present an algorithm to solve (2). Since (2) is a special case of (7), we consider\nthe more general problem (7). It is easy to show that (7) is equivalent to the following problem with\nauxiliary variables \u03a8, \u03a6:\n\n|||\u03a8k|||S1\n\n+\n\n\u00b5M\nK\n\n|||\u03a6k|||1 ,\n\nK(cid:88)\n\nk=1\n\nK(cid:88)\n\nk=1\n\nK\n\n(cid:88)\n\nk\n\n(cid:33)\n\n(cid:33)\n\n1\n2M\n\n(cid:107)y \u2212 x(cid:62)(w + v)(cid:107)2\n\nminW,V,Y,Z\nsubject to Pkw = \u03c8k, Pkv = \u03c6k,\n\nwhere x, w, v, \u03c8k, \u03c6k are the vectorizations of(cid:80)M\n\n2 +\n\n\u03bbM\nK\n\ntransformation matrix that change the order of rows and columns so that Pkw = \u03c8k.\nThe augmented Lagrangian (AL) function of the above minimization problem with respect to the\nprimal variables (W t,V t) is given as follows:\n\ni=1 Xi,W,V, \u03a8k, \u03a6k respectively, and Pk is the\n\nk=1,{\u03a6k}K\n\nL\u03b7(W,V,{\u03a8k}K\n(cid:107)y \u2212 x(cid:62)(w + v)(cid:107)2\n1\n2\n\n=\n\n2 +\n\nk=1,{\u03b1k}K\nK(cid:88)\n\u03bbM M\n\nk=1,{\u03b2k}K\n|||\u03a8k|||S1\n\n+\n\nK\n\nk=1\n\nk=1)\n\n\u00b5M M\n\nK(cid:88)\n\nk=1\n\n|||\u03a6k|||1\n\n+\u03b7\n\n(\u03b1(cid:62)\n\nk (Pkw \u2212 \u03c8k) +\n\n(cid:107)Pkw \u2212 \u03c8k(cid:107)2\n\n2) +\n\n1\n2\n\n(\u03b2(cid:62)\n\nk (Pkv \u2212 \u03c6k) +\n\n(cid:32)(cid:88)\n\nk\n\n(cid:33)\n\n,\n\n(cid:107)Pkv \u2212 \u03c6k(cid:107)2\n2)\n\n1\n2\n\nwhere \u03b1t, \u03b2t are Lagrangian multiplier vectors, and \u03b7 > 0 is a penalty parameter.\nWe then apply the algorithm of Alternating Direction Method of Multipliers (ADMM)\n[3, 20]\nStarting from initial points\n(w0, v0,{\u03a80\nk=1), ADMM performs the following updates\niteratively:\n\nabove optimization problem.\n\nk=1,{\u03a60\n\nk}K\n\nto solve\nk}K\n(cid:32)\n(cid:32)\n\nthe\nk=1,{\u03b10\nk}K\nk}K\nk=1,{\u03b20\nK(cid:88)\nK(cid:88)\n\nP(cid:62)\nk (\u03c8t\n\nk=1\n\nwt+1 =\n\nvt+1 =\n\n(x(cid:62)y \u2212 x(cid:62)xvt) + \u03b7\n\nk \u2212 \u03b1t\nk)\n\n/ (1 + \u03b7K) ,\n\n(x(cid:62)y \u2212 x(cid:62)xwt+1) + \u03b7\n\nP(cid:62)\nk (\u03c6t\n\nk \u2212 \u03b2t\nk)\n\n/ (1 + \u03b7K) ,\n\nk=1\n\n\u03bbM\n\u03b7K\n\n\u03b1t+1\n\n\u03a8t+1\n\n(Pkwt+1 + \u03b1t\n\nk = proxtr\nk = \u03b1t+1\n\nk),\nk + (Pkwt+1 \u2212 \u03c8t+1\n\u03b3 (\u00b7) is the soft-thresholding operator for trace norm, and prox(cid:96)1\n\nk = 1, . . . , K,\n\u03b3 (\u00b7) is the soft-thresholding\nwhere proxtr\noperator for (cid:96)1 norm [4, 11]. The stopping criterion is that all the partial (sub)gradients are (near)\nzero, under which condition we obtain the saddle point of the augmented Lagrangian function. Since\n(7) is strictly convex, the saddle point is the global optima for the primal problem.\n\nk + (Pkvt+1 \u2212 \u03c6t+1\n\nk = prox(cid:96)1\n\u03a6t+1\n\u03b2t+1\nk = \u03b2t+1\n\n(Pkvt+1 + \u03b2t\nk)\n\nk = 1, . . . , K,\n\n\u00b5M\n\u03b7K\n\n)\n\n)\n\nk\n\nk\n\n6 Experiments\n\nIn this section, we conduct numerical experiments to con\ufb01rm our analysis in previous sections. The\nexperiments are conducted under the setting of robust noisy tensor decomposition.\nWe follow the procedure described in [22] for the experimental part. We randomly generate low-rank\ntensors of dimensions n(1) = (50, 50, 20) ( results are shown in Figure 1(a, b, c)) and n(2) =\n(100, 100, 50)( results are shown in Figure 1(d, e, f)) for various rank (r1, r2, ..., rk). Given a speci\ufb01c\nrank, we \ufb01rst generated the \u201dcore tensor\u201d with elements r1 \u00d7 . . . \u00d7 rK from the standard normal\ndistribution, and then multiplied each mode of the core tensor with an orthonormal factor randomly\ndrawn from the Haar measure. For the gross corruption, we randomly generated the sparsity of\nthe corruption matrix s, and then randomly selected s elements in which we put values randomly\ngenerated from uniform distribution. The additive independent Gaussian noise with variance \u03c32\n\n7\n\n\f(a) |||\u2206|||F\n\nM against Ns of size n(1). (b) |||\u2206|||F\n\nM against Nr of size n(1).\n\n(d) |||\u2206|||F\n\nM against Ns of size n(2). (e) |||\u2206|||F\n\nM against Nr of size n(2).\n\n(c) \u03ba1 against \u03ba2 of size n(1).\n\n(f) \u03ba1 against \u03ba2 of size n(2).\n\nFigure 1: Results of robust noisy tensor decomposition with corruption, under different sizes.\n\nwas added to the observations of elements. We use the alternating direction method of multipliers\n(ADMM) to solve the minimization problem (2). The whole experiments were repeated 50 times and\nthe averaged results are reported.\n\nThe results are shown in Figure 1, where Nr =(cid:80)K\nNr at different values, and then draw the value of(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:99)W \u2212 W\u2217(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)F\ns. In Figure 1(a, d), we \ufb01rst \ufb01x\ne), we \ufb01rst \ufb01x Ns at different values, and then draw(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:99)W \u2212 W\u2217(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)F\n/N against Ns. Similarly, in Figure 1(b,\nstudy the values of \u03ba1 and \u03ba2 at various settings. We can see that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:99)W \u2212 W\u2217(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)F\n/N against Nr. In Figure 1(c, f), we\nwith both Ns and Nr. Similar scalings of(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:98)V \u2212 V\u2217(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)F\n/N scales linearly\n/N can be observed, hence we omit them due\nthis \ufb01nding is consistent with the fact that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:99)W \u2212 W\u2217(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)F\nto space limitation. We can also observe from Figure 1(c, f) that, under various settings, \u03ba1 \u2248 \u03ba2,\n/N. All these results are\n\n/N \u2248 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:98)V \u2212 V\u2217(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)F\n\nrk/K, and Ns =\n\n\u221a\n\n\u221a\n\nk=1\n\nconsistent with each other, validating our theoretical analysis.\n\n7 Conclusions\n\nIn this paper, we analyzed the statistical performance of robust noisy tensor decomposition with\ncorruption. Our goal is to recover a pair of tensors, based on observing a noisy contaminated version\nof their sum. It is based on solving a convex optimization with composite regularizations of Schatten-1\nnorm and (cid:96)1 norm de\ufb01ned on tensors. We provided a general nonasymptotic estimator error bounds on\nthe underly low-rank tensor and sparse corruption tensor. Furthermore, the error bound we obtained\nin this paper is new, and non-comparable with previous theoretical analysis.\n\nAcknowledgement\n\nWe would like to thank the anonymous reviewers for their helpful comments. Research was sponsored\nin part by the Army Research Lab, under Cooperative Agreement No. W911NF-09-2-0053 (NSCTA),\nthe Army Research Of\ufb01ce under Cooperative Agreement No. W911NF-13-1-0193, National Science\nFoundation IIS-1017362, IIS-1320617, and IIS-1354329, HDTRA1-10-1-0120, and MIAS, a DHS-\nIDS Center for Multimodal Information Access and Synthesis at UIUC.\n\n8\n\n101520253035400123456x 10\u22124Nsmean error of low\u2212rank tensor Nr = 2.9Nr = 4.0Nr = 5.433.544.555.50123456x 10\u22124Nrmean error of low\u2212rank tensor Ns = 17Ns = 25Ns = 3501234567x 10\u22126012345678x 10\u22126\u03ba1\u03ba2101520253035456789101112x 10\u22125Nsmean error of low\u2212rank tensor Nr = 2.9Nr = 4.0Nr = 4.911.522.533.5423456789x 10\u22125Nrmean error of low\u2212rank tensor Ns = 15.8Ns = 22.4Ns = 31.60.511.522.533.54x 10\u221260.511.522.533.54x 10\u22126\u03ba1\u03ba2\fReferences\n[1] A. Agarwal, S. Negahban, and M. J. Wainwright. Noisy matrix decomposition via convex relaxation:\n\nOptimal rates in high dimensions. The Annals of Statistics, 40(2):1171\u20131197, 04 2012.\n\n[2] P. J. Bickel, Y. Ritov, and A. B. Tsybakov. Simultaneous analysis of lasso and dantzig selector. The Annals\n\nof Statistics, pages 1705\u20131732, 2009.\n\n[3] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning\nvia the alternating direction method of multipliers. Foundations and Trends R(cid:13) in Machine Learning,\n3(1):1\u2013122, 2011.\n\n[4] J.-F. Cai, E. J. Cand`es, and Z. Shen. A singular value thresholding algorithm for matrix completion. SIAM\n\nJournal on Optimization, 20(4):1956\u20131982, 2010.\n\n[5] E. J. Cand`es, X. Li, Y. Ma, and J. Wright. Robust principal component analysis? J. ACM, 58(3):11, 2011.\n[6] E. J. Cand`es and Y. Plan. Matrix completion with noise. Proceedings of the IEEE, 98(6):925\u2013936, 2010.\n[7] E. J. Cand`es and B. Recht. Exact matrix completion via convex optimization. Commun. ACM, 55(6):111\u2013\n\n119, 2012.\n\n[8] E. J. Cand`es and T. Tao. The power of convex relaxation: near-optimal matrix completion.\n\nTransactions on Information Theory, 56(5):2053\u20132080, 2010.\n\nIEEE\n\n[9] V. Chandrasekaran, S. Sanghavi, P. A. Parrilo, and A. S. Willsky. Rank-sparsity incoherence for matrix\n\ndecomposition. SIAM Journal on Optimization, 21(2):572\u2013596, 2011.\n\n[10] P. Gong, J. Ye, and C. Zhang. Robust multi-task feature learning. In Proceedings of the 18th ACM SIGKDD\n\ninternational conference on Knowledge discovery and data mining, pages 895\u2013903. ACM, 2012.\n\n[11] E. T. Hale, W. Yin, and Y. Zhang. Fixed-point continuation for \\ell 1-minimization: Methodology and\n\nconvergence. SIAM Journal on Optimization, 19(3):1107\u20131130, 2008.\n\n[12] D. Hsu, S. M. Kakade, and T. Zhang. Robust matrix decomposition with sparse corruptions. IEEE\n\nTransactions on Information Theory, 57(11):7221\u20137234, 2011.\n\n[13] T. G. Kolda and B. W. Bader. Tensor decompositions and applications. SIAM Review, 51(3):455\u2013500,\n\n2009.\n\n[14] L. D. Lathauwer, B. D. Moor, and J. Vandewalle. On the best rank-1 and rank-(r1,r2,. . .,rn) approximation\n\nof higher-order tensors. SIAM J. Matrix Anal. Appl., 21(4):1324\u20131342, 2000.\n\n[15] J. Liu, P. Musialski, P. Wonka, and J. Ye. Tensor completion for estimating missing values in visual data.\n\nIEEE Trans. Pattern Anal. Mach. Intell., 35(1):208\u2013220, 2013.\n\n[16] C. Mu, B. Huang, J. Wright, and D. Goldfarb. Square deal: Lower bounds and improved relaxations for\n\ntensor recovery. CoRR, abs/1307.5870, 2013.\n\n[17] S. Negahban and M. J. Wainwright. Estimation of (near) low-rank matrices with noise and high-dimensional\n\nscaling. The Annals of Statistics, 39(2):1069\u20131097, 04 2011.\n\n[18] S. N. Negahban, P. Ravikumar, M. J. Wainwright, and B. Yu. A uni\ufb01ed framework for high-dimensional\n\nanalysis of m-estimators with decomposable regularizers. Statistical Science, 27(4):538\u2013557, 11 2012.\n\n[19] N. Srebro and A. Shraibman. Rank, trace-norm and max-norm. In COLT, pages 545\u2013560, 2005.\n[20] R. Tomioka, K. Hayashi, and H. Kashima. Estimation of low-rank tensors via convex optimization. 2010.\n[21] R. Tomioka and T. Suzuki. Convex tensor decomposition via structured schatten norm regularization. In\n\nNIPS, pages 1331\u20131339, 2013.\n\n[22] R. Tomioka, T. Suzuki, K. Hayashi, and H. Kashima. Statistical performance of convex tensor decomposi-\n\ntion. In NIPS, pages 972\u2013980, 2011.\n\n[23] M. A. O. Vasilescu and D. Terzopoulos. Multilinear analysis of image ensembles: Tensorfaces. In ECCV\n\n(1), pages 447\u2013460, 2002.\n\n[24] R. Vershynin.\n\nIntroduction to the non-asymptotic analysis of random matrices.\n\narXiv:1011.3027, 2010.\n\narXiv preprint\n\n[25] E. Yang and P. D. Ravikumar. Dirty statistical models. In NIPS, pages 611\u2013619, 2013.\n\n9\n\n\f", "award": [], "sourceid": 782, "authors": [{"given_name": "Quanquan", "family_name": "Gu", "institution": "Princeton University"}, {"given_name": "Huan", "family_name": "Gui", "institution": "University of Illinois at Urbana-Champaign"}, {"given_name": "Jiawei", "family_name": "Han", "institution": "University of Illinois at Urbana-Champaign"}]}