{"title": "Statistical Performance of Convex Tensor Decomposition", "book": "Advances in Neural Information Processing Systems", "page_first": 972, "page_last": 980, "abstract": "We analyze the statistical performance of a recently proposed convex tensor decomposition algorithm. Conventionally tensor decomposition has been formulated as non-convex optimization problems, which hindered the analysis of their performance. We show under some conditions that the mean squared error of the convex method scales linearly with the quantity we call the normalized rank of the true tensor. The current analysis naturally extends the analysis of convex low-rank matrix estimation to tensors. Furthermore, we show through numerical experiments that our theory can precisely predict the scaling behaviour in practice.", "full_text": "Statistical Performance of Convex Tensor\n\nDecomposition\n\nRyota Tomioka\u2020\n\nTaiji Suzuki\u2020\n\u2020Department of Mathematical Informatics,\n\nThe University of Tokyo\nTokyo 113-8656, Japan\n\nKohei Hayashi\u2021\n\n\u2021Graduate School of Information Science,\nNara Institute of Science and Technology\n\nNara 630-0192, Japan\n\ntomioka@mist.i.u-tokyo.ac.jp\ns-taiji@stat.t.u-tokyo.ac.jp\n\nkohei-h@is.naist.jp\n\nHisashi Kashima\u2020,\u2217\n\n\u2217Basic Research Programs PRESTO,\n\nSynthesis of Knowledge for Information Oriented Society, JST\n\nTokyo 102-8666, Japan\n\nkashima@mist.i.u-tokyo.ac.jp\n\nAbstract\n\nWe analyze the statistical performance of a recently proposed convex tensor de-\ncomposition algorithm. Conventionally tensor decomposition has been formu-\nlated as non-convex optimization problems, which hindered the analysis of their\nperformance. We show under some conditions that the mean squared error of\nthe convex method scales linearly with the quantity we call the normalized rank\nof the true tensor. The current analysis naturally extends the analysis of convex\nlow-rank matrix estimation to tensors. Furthermore, we show through numerical\nexperiments that our theory can precisely predict the scaling behaviour in practice.\n\n1 Introduction\n\nTensors (multi-way arrays) generalize matrices and naturally represent data having more than two\nmodalities. For example, multi-variate time-series, for instance, electroencephalography (EEG),\nrecorded from multiple subjects under various conditions naturally form a tensor. Moreover, in\ncollaborative \ufb01ltering, users\u2019 preferences on products, conventionally represented as a matrix, can\nbe represented as a tensor when the preferences change over time or context.\nFor the analysis of tensor data, various models and methods for the low-rank decomposition of\ntensors have been proposed (see Kolda & Bader [12] for a recent survey). These techniques have\nrecently become increasingly popular in data-mining [1, 14] and computer vision [25, 26]. Besides\nthey have proven useful in chemometrics [4], psychometrics [24], and signal processing [20, 7, 8].\nDespite empirical success, the statistical performance of tensor decomposition algorithms has not\nbeen fully elucidated. The dif\ufb01culty lies in the non-convexity of the conventional tensor decom-\nposition algorithms (e.g., alternating least squares [6]).\nIn addition, studies have revealed many\ndiscrepancies (see [12]) between matrix rank and tensor rank, which make extension of studies on\nthe performance of low-rank matrix models (e.g., [9]) challenging.\nRecently, several authors [21, 10, 13, 23] have focused on the notion of tensor mode-k rank (instead\nof tensor rank), which is related to the Tucker decomposition [24]. They discovered that regularized\nestimation based on the Schatten 1-norm, which is a popular technique for recovering low-rank\nmatrices via convex optimization, can also be applied to tensor decomposition. In particular, the\n\n1\n\n\fFigure 1: Result of estimation of rank-(7, 8, 9) tensor of dimensions 50 \u00d7 50 \u00d7 20 from partial\nmeasurements; see [23] for the details. The estimation error\nF is plotted against the\nfraction of observed elements m = M/N. Error bars over 10 repetitions are also shown. Convex\nrefers to the convex tensor decomposition based on the minimization problem (7). Tucker (exact)\nrefers to the conventional (non-convex) Tucker decomposition [24] at the correct rank. Gray dashed\nline shows the optimization tolerance 10\u22123. The question is how we can predict the point where the\ngeneralization begins (roughly m = 0.35 in this plot).\n\n\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02 \u02c6W \u2212 W\u2217\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02\n\nstudy in [23] showed that there is a clear transition at certain number of samples where the error\ndrops dramatically from no generalization to perfect generalization (see Figure 1).\nIn this paper, motivated by the above recent work, we mathematically analyze the performance of\nconvex tensor decomposition. The new convex formulation for tensor decomposition allows us to\ngeneralize recent results on Schatten 1-norm-regularized estimation of matrices (see [17, 18, 5, 19]).\nUnder a general setting we show how the estimation error scales with the mode-k ranks of the true\ntensor. Furthermore, we analyze the speci\ufb01c settings of (i) noisy tensor decomposition and (ii)\nrandom Gaussian design. In the \ufb01rst setting, we assume that all the elements of a low-rank tensor\nis observed with noise and the goal is to recover the underlying low-rank structure. This is the most\ncommon setting a tensor decomposition algorithm is used. In the second setting, we assume that\nthe unknown tensor is a coef\ufb01cient of a tensor-input scalar-output regression problem and the input\ntensors (design) are randomly given from independent Gaussian distributions. Surprisingly, it turns\nout that the random Gaussian setting can precisely predict the phase-transition-like behaviour in\nFigure 1. To the best of our knowledge, this is the \ufb01rst paper that rigorously studies the performance\nof a tensor decomposition algorithm.\n\n2 Notation\n\nIn this section, we introduce the notations we use in this paper. Moreover, we introduce a H\u00a8older-\nlike inequality (3) and the notion of mode-k decomposability (5), which play central roles in our\nanalysis.\nLet X \u2208 Rn1\u00d7\u00b7\u00b7\u00b7nK be a K-way tensor. We denote the number of elements in X by N =\nk=1 nk.\nThe inner product between two tensors \u2329W,X\u232a is de\ufb01ned as \u2329W,X\u232a = vec(W)\u22a4vec(X ), where\nvec is a vectorization. In addition, we de\ufb01ne the Frobenius norm of a tensor\nThe mode-k unfolding X (k) is the nk \u00d7 \u00afn\\k (\u00afn\\k :=\nk\u2032\u0338=k nk\u2032) matrix obtained by concatenating\nthe mode-k \ufb01bers (the vectors obtained by \ufb01xing every index of X but the kth index) of X as column\nvectors. The mode-k rank of a tensor X , denoted by rankk(X ), is the rank of the mode-k unfolding\nX (k) (as a matrix). Note that when K = 2 and X is actually a matrix, and X (2) = X (1)\n\u22a4. We say\na tensor X is rank (r1, . . . , rK) when rk = rankk(X ) for k = 1, . . . , K. Note that the mode-k rank\ncan be computed in a polynomial time, because it boils down to computing a matrix rank, whereas\ncomputing tensor rank is NP complete [11]. See [12] for more details.\nSince for each k, the convex envelope of the mode-k rank is given as the Schatten 1-norm [18]\n(known as the trace norm [22] or the nuclear norm [3]), it is natural to consider the following\n\nQ\np\u2329X ,X\u232a.\n\nQ\n\nK\n\n\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02X\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02\n\nF =\n\n2\n\n00.20.40.60.8110\u22123100Fraction of observed elements ConvexTucker (exact)Optimization tolerance\foverlapped Schatten 1-norm\n\nof a tensor W \u2208 Rn1\u00d7\u00b7\u00b7\u00b7\u00d7nK (see also [21]):\n\nwhere W (k) is the mode-k unfolding of W. Here \u2225 \u00b7 \u2225S1 is the Schatten 1-norm for a matrix\n\nk=1\n\n=\n\nS1\n\n,\n\nS1\n\n(1)\n\n\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02W\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02\n\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02W\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02\n\nS1\n\n(cid:176)(cid:176)W (k)\n\n(cid:176)(cid:176)\n\nKX\nXr\n\n1\nK\n\n\u2225W\u2225S1 =\n\n\u03c3j(W ),\n\nj=1\n\nwhere \u03c3j(W ) is the jth largest singular-value of W . The dual norm of the Schatten 1-norm is the\nSchatten \u221e-norm (known as the spectral norm) as follows:\n\n\u2225X\u2225S\u221e = max\n\nj=1,...,r\n\n\u03c3j(X).\n\n|\u2329W , X\u232a| \u2264 \u2225W\u2225S1\n\nSince the two norms \u2225 \u00b7 \u2225S1 and \u2225 \u00b7 \u2225S\u221e are dual to each other, we have the following inequality:\n(2)\nwhere \u2329W , X\u232a is the inner product of W and X.\nThe same inequality holds for the overlapped Schatten 1-norm (1) and its dual norm. The dual norm\nof the overlapped Schatten 1-norm can be characterized by the following lemma.\nLemma 1. The dual norm of the overlapped Schatten 1-norm denoted as\nis de\ufb01ned as the\nin\ufb01mum of the maximum mode-k spectral norm over the tensors whose average equals the given\n\n\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02\u00b7\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02\n\n\u2225X\u2225S\u221e,\n\n\u2217\nS\n1\n\ntensor X as follows:\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02X\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02\n\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02\u00b7\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02\n\nwhere Y (k)\n\n\u2217\nS\n1\n\ninf\n\n1\n\nK (Y(1)+Y(2)+\u00b7\u00b7\u00b7+Y(K))=X\n\nmax\n\nk=1,...,K\n\n\u2225Y (k)\n\n(k)\n\n\u2225S\u221e ,\n\n=\n\n\u2217\nS\n1\n\n\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02X\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02\n\n\u2264\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02X\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02\n\nXK\n\n(k) is the mode-k unfolding of Y (k). Moreover, the following upper bound on the dual norm\nis valid:\n\nS1\n\nk=1\n\n\u2217\n1\n\nS\n\n\u2217\n1\n\nS\n\n1\nK\n\nmean :=\n\n:=\nX /ck,\n\n\u2225X (k)\u2225S\u221e .\n\n\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02W\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02\n\n\u2264 1. The second part is obtained by setting Y (k) =\n\nProof. The \ufb01rst part can be shown by solving the dual of the maximization problem\nsup\u2329W,X\u232a s.t.\nwhere ck = \u2225X (k)\u2225S\u221e, and using Jensen\u2019s inequality.\n\n\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02X\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02\nKPK\nNote that the above bound is tighter than the more intuitive relation |\u2329W,X\u232a| \u2264\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02W\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02\n\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02X\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02\n\n|\u2329W,X\u232a| \u2264\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02W\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02\n\nAccording to Lemma 1, we have the following H\u00a8older-like inequality\n\n\u2264\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02W\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02\n\nS\u221e := max1,...,K \u2225X (k)\u2225S\u221e), which one might come up as an analogy to the matrix case (2).\n(\nFinally, let W\u2217 \u2208 Rn1\u00d7\u00b7\u00b7\u00b7\u00d7nK be the low-rank tensor that we wish to recover. We assume that W\u2217\nis rank (r1, . . . , rK). Thus, for each k we have\nwhere U k \u2208 Rnk\u00d7rk and V k \u2208 R\u00afn\\k\u00d7rk are orthogonal, and Sk \u2208 Rrk\u00d7rk is diagonal. Let\n\u2206 \u2208 Rn1\u00d7\u00b7\u00b7\u00b7\u00d7nK be an arbitrary tensor. We de\ufb01ne the mode-k orthogonal complement \u2206\n\u2032\u2032\nk of an\nunfolding \u2206(k) \u2208 Rnk\u00d7\u00afn\\k of \u2206 with respect to the true low-rank tensor W\u2217 as follows:\n\n\u2217\n(k) = U kSkV k\n\n(k = 1, . . . , K),\n\n\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02X\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02\n\n\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02X\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02\n\n\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02X\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02\n\nk\u2032=1 1/ck\u2032\n\nmean.\n\n(3)\n\n\u2217\nS\n1\n\nW\n\nS\u221e\n\nS1\n\nS1\n\nS1\n\n\u2032\u2032\nk = (I nk\n\u2206\n\n\u2212 U kU k\n\n\u22a4)\u2206(k)(I \u00afn\\k\n\n(4)\n\u2032\u2032\nk is the component having overlapped row/column space with the\n\u2032\n\u2032\u2032\n\u2217\nk + \u2206\n(k). Note that the decomposition \u2206(k) = \u2206\nk is de\ufb01ned for\n\nk := \u2206(k) \u2212 \u2206\n\u2032\nIn addition \u2206\nunfolding of the true tensor W\neach mode; thus we use subscript k instead of (k).\nUsing the decomposition de\ufb01ned above we have the following equality, which we call mode-k de-\ncomposability of the Schatten 1-norm:\n\n(5)\nThe above decomposition is de\ufb01ned for each mode and thus it is weaker than the notion of decom-\nposability discussed by Negahban et al. [15].\n\n(k = 1, . . . , K).\n\n\u2217\n(k) + \u2206\n\n\u2032\u2032\nk\n\n\u2225S1 = \u2225W\n\n\u2217\n(k)\n\n\u2225S1 + \u2225\u2206\n\n\u2032\u2032\nk\n\n\u2225S1\n\n\u2225W\n\n\u2212 V kV k\n\n\u22a4).\n\n3\n\n\f3 Theory\n\nIn this section, we \ufb01rst present a deterministic result that holds under a certain choice of regular-\nization constant \u03bbM and an assumption called the restricted strong convexity. Then, we focus on\nspecial cases to justify the choice of regularization constant and the restricted strong convexity as-\nsumption. We analyze the setting of (i) noisy tensor decomposition and (ii) random Gaussian design\nin Section 3.2 and Section 3.3, respectively.\n\n3.1 Main result\nOur goal is to estimate an unknown rank (r1, . . . , rK) tensor W\u2217 \u2208 Rn1\u00d7\u00b7\u00b7\u00b7nK from observations\n(6)\n\nyi = \u2329Xi,W\u2217\u232a + \u03f5i\n\n(i = 1, . . . , M).\n\n\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02W\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02\n\nHere the noise \u03f5i follows the independent zero-mean Gaussian distribution with variance \u03c32.\nWe employ the regularized empirical risk minimization problem proposed in [21, 10, 13, 23] for the\nestimation of W as follows:\n\n\u2225y \u2212 X(W)\u22252\n\n1\n2M\n\n,\n\nS1\n\n2 + \u03bbM\n\nminimize\n\nW\u2208Rn1\u00d7\u00b7\u00b7\u00b7\u00d7nK\n\n(7)\nwhere y = (y1, . . . , yM )\u22a4 is the collection of observations; X : Rn1\u00d7\u00b7\u00b7\u00b7\u00d7nK \u2192 RM is a linear\noperator that maps W to the M dimensional output vector X(W) = (\u2329X1,W\u232a , . . . ,\u2329XM ,W\u232a) \u22a4 \u2208\nRM . The Schatten 1-norm term penalizes every mode of W to be jointly low-rank (see Equation (1));\n\u03bbM > 0 is the regularization constant. Accordingly, the solution of the minimization problem (7) is\ntypically a low-rank tensor when \u03bbM is suf\ufb01ciently large. In addition, we denote the adjoint operator\nof X as X\nThe \ufb01rst step in our analysis is to characterize the particularity of the residual tensor \u2206 := \u02c6W \u2212W\u2217\nas in the following lemma.\nLemma 2. Let \u02c6W be the solution of the minimization problem (7) with \u03bbM \u2265 2\nand let \u2206 := \u02c6W \u2212 W\u2217, where W\u2217 is the true low-rank tensor. Let \u2206(k) = \u2206\ndecomposition de\ufb01ned in Equation (4). Then we have the following inequalities:\n\nP\ni=1 \u03f5iXi \u2208 Rn1\u00d7\u00b7\u00b7\u00b7\u00d7nK .\n\n\u2217 : RM \u2192 Rn1\u00d7\u00b7\u00b7\u00b7\u00d7nK ; that is X\n\n\u2217(\u03f5)\nmean/M,\n\u2032\u2032\n\u2032\nk + \u2206\nk be the\n\n\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02X\n\n\u2217(\u03f5) =\n\n\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02\n\nM\n\nP\n\nk) \u2264 2rk for each k = 1, . . . , K.\n\u2032\n1. rank(\u2206\n\u2225\u2206\n\u2032\u2032\nk\n\n\u2225S1.\n\n\u2264 3\n\n\u2225\u2206\n\n\u2225S1\n\nK\nk=1\n\nK\nk=1\n\n2.\n\n\u2032\nk\n\nP\n\nProof. The proof uses the mode-k decomposability (5) and is analogous to that of Lemma 1 in\n[17].\n\nThe second ingredient of our analysis is the restricted strong convexity. Although, \u201cstrong\u201d may\nsound like a strong assumption, the point is that we require this assumption to hold only for the\nparticular residual tensor we characterized in Lemma 2. The assumption can be stated as follows.\nAssumption 1 (Restricted strong convexity). We suppose that there is a positive constant \u03ba(X) such\nthat the operator X satis\ufb01es the inequality\n\nP\nfor all \u2206 \u2208 Rn1\u00d7\u00b7\u00b7\u00b7\u00d7nK such that for each k = 1, . . . , K, rank(\u2206\n3\n\nk) \u2264 2rk and\n\u2032\n\u2032\u2032\nk are de\ufb01ned through the decomposition (4).\n\n\u2225S1, where \u2206\n\n\u2032\nk and \u2206\n\n\u2225\u2206\n\nF ,\n\n\u2032\nk\n\nK\nk=1\n\n2\n\n\u2225X(\u2206)\u22252\n\n\u2265\u03ba(X)\n\n1\nM\n\n(8)\n\u2264\n\n\u2225\u2206\n\n\u2032\u2032\nk\n\n\u2225S1\n\nK\nk=1\n\n\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02\u2206\n\n\ufb02\ufb02\ufb02\ufb02\ufb02\ufb022\n\nP\n\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02X\n\n\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02\n\nNow using the above two ingredients, we are ready to prove the following deterministic guarantee\non the performance of the estimation procedure (7).\nTheorem 1. Let \u02c6W be the solution of the minimization problem (7) with \u03bbM \u2265 2\nmean/M.\nSuppose that the operator X satis\ufb01es the restricted strong convexity condition. Then the following\nbound is true:\n\n\u2217(\u03f5)\n\nP\n\n\u221a\n\n\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02 \u02c6W \u2212 W\u2217\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02\n\n\u2264 32\u03bbM\n\nK\nk=1\n\u03ba(X)K\n\nF\n\nrk\n\n.\n\n(9)\n\n4\n\n\f\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02W\u2217\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02\n\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02\u2206\n\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02\n\n\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02X\n\nProof. Let \u2206 = \u02c6W \u2212W\u2217. Combining the fact that the objective value for \u02c6W is smaller than that for\nW\u2217, the H\u00a8older-like inequality (3), the triangular inequality\n, and\nthe assumption\n1\n2M\n\n\u2264 \u03bbM /2, we obtain\n\u2217(\u03f5)/M\n\n\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02\n\u2264\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02X\n\n+ \u03bbM\n\n\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02\n\n(10)\n\nmean\n\nmean\n\nS1\n\nS1\n\nS1\n\n.\n\nNow the left-hand side can be lower-bounded using the restricted strong convexity (8). On the other\nhand, using Lemma 2, the right-hand side can be upper-bounded as follows:\n\nS1\n\nS1\n\nS1\n\n\u2264 2\u03bbM\n\n\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02\u2206\n\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02\nP\n(11)\nF = \u2225\u2206(k)\u2225F for k = 1, . . . , K. Combining in-\n\n\u2225\u2206\n\u2032\nk\n\n\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02\n\n\u2225S1\n\n\u2264 4\n\n2rk,\n\nK\nk=1\n\nK\nk=1\n\n\u221a\n\nK\n\n\u2206\n\nF\n\nS1\n\n\u2264 1\n\n\u2225S1 + \u2225\u2206\n\u2032\u2032\nk\nwhere the last inequality follows because\nequalities (8), (10), and (11), we obtain our claim (9).\n\n\u2225S1) \u2264 4\n\n\u2032\nk\n\nK\n\nK\n\n\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02\u2206\n\n\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02\n\n2\n\n\u2217(\u03f5)/M\n\u2225X(\u2206)\u22252\nP\nk=1(\u2225\u2206\n\nK\n\n\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02\u2206\n\n\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02\n\n\u2212\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02 \u02c6W\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02\n\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02\u2206\nP\n\n\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02\n\n\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02\n\n\u2264 \ufb02\ufb02\ufb02\ufb02\ufb02\ufb02\u2206\n\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02\n\nNegahban et al. [15] (see also [17]) pointed out that the key properties for establishing a sharp con-\nvergence result for a regularized M-estimator is the decomposability of the regularizer and the re-\nstricted strong convexity. What we have shown suggests that the weaker mode-k decomposability (5)\nsuf\ufb01ce to obtain the above convergence result for the overlapped Schatten 1-norm (1) regularization.\n\n3.2 Noisy Tensor Decomposition\n\nIn this subsection, we consider the setting where all the elements are observed (with noise) and the\ngoal is to recover the underlying low-rank tensor without noise.\nSince all the elements are observed only once, X is simply a vectorization (M = N), and the left-\nhand side of inequality (10) gives the quantity of interest \u2225X(\u2206)\u22252\nF . Therefore, the\nremaining task is to bound\nLemma 3. Suppose that X : n1\u00d7\u00b7\u00b7\u00b7\u00d7nK \u2192 N is a vectorization of a tensor. With high probability\nthe quantity\n\n2 =\nmean as in the following lemma.\n\nmean is concentrated around its mean, which can be bounded as follows:\n\n\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02 \u02c6W \u2212W\u2217\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02\n\n\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02X\n\n\u2217(\u03f5)\n\n\u2217(\u03f5)\n\n\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02\n\n\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02\n\n\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02\n\n\u2217(\u03f5)\n\n\u2264 \u03c3\nK\n\nmean\n\np\n\n\u00a2\n\nnk +\n\n\u00afn\\k\n\n.\n\n(12)\n\n\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02X\n\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02X\n\nE\n\nSetting the regularization constant as \u03bbM = c0E\nmean/N, we obtain the following theorem.\nTheorem 2. Suppose that X : n1\u00d7\u00b7\u00b7\u00b7\u00d7nK \u2192 N is a vectorization of a tensor. There are universal\nconstants c0 and c1, such that, with high probability, any solution of the minimization problem (7)\n\u00afn\\k)/(KN) satis\ufb01es the following bound:\nwith regularization constant \u03bbM = c0\u03c3\n\nP\n\nnk +\n\nK\n\n\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02 \u02c6W \u2212 W\u2217\ufb02\ufb02\ufb02\ufb02\ufb02\ufb022\n\n\u2264 c1\u03c32\n\nF\n\n\u02c6\n\nk=1(\n\nKX\n\nk=1\n\n1\nK\n\n\u00a2!2\u02c6\n\n!2\n\nrk\n\n.\n\nKX\n\nk=1\n\n\u221a\n\n1\nK\n\nProof. Combining Equations (10)\u2013(11) with the fact that X is simply a vectorization and M = N,\nwe have\n\n\u2225 \u02c6W \u2212 W\u2217\u2225F \u2264 16\n\n1\nN\n\n\u221a\n\n2\u03bbM\nK\n\n\u221a\n\nK\nk=1\n\nrk.\n\nSubstituting the choice of regularization constant \u03bbM and squaring both sides, we obtain our claim.\u2044\nWe can simplify the result of Theorem 2 by noting that \u00afn\\k = N/nk \u226b nk, when the dimen-\n\u22121 :=\nsions are of the same order.\n\n(1/n1, . . . , 1/nK), we have\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02 \u02c6W \u2212 W\u2217\ufb02\ufb02\ufb02\ufb02\ufb02\ufb022\n\nIntroducing the notation \u2225r\u22251/2 = ( 1\n\u00a2\n\nrk)2 and n\n\nP\n\nK\nk=1\n\n\u00a1\n\n\u221a\n\nK\n\n\u2264 Op\n\n\u03c32\u2225n\n\n\u22121\u22251/2\u2225r\u22251/2\n\nF\n\nN\n\n(13)\n\u22121\u22251/2\u2225r\u22251/2 the normalized rank, because \u00afr = r/n when the dimen-\n\n.\n\nWe call the quantity \u00afr = \u2225n\nsions are balanced (nk = n and rk = r for all k = 1, . . . , K).\n\nk=1\n\nKX\n\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02X\n\u00a1\u221a\n\n\u221a\n\n\u00a1\u221a\n\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02\np\np\n\n\u2217(\u03f5)\n\nnk +\n\n\u00afn\\k\n\nP\n\n5\n\n\f3.3 Random Gaussian Design\nIn this subsection, we consider the case the elements of the input tensors Xi (i = 1, . . . , M) in the\nobservation model (6) are distributed according to independent identical standard Gaussian distribu-\ntions. We call this setting random Gaussian design.\n\u2217(\u03f5)\nFirst we show an upper bound on the norm\nmean, which we use to specify the scaling of\nthe regularization constant \u03bbM in Theorem 1.\nLemma 4. Let X : Rn1\u00d7\u00b7\u00b7\u00b7\u00d7nK \u2192 RM be a random Gaussian design. In addition, we assume\nthat the noise \u03f5i is sampled independently from N (0, \u03c32). Then with high probability the quantity\n\n\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02X\n\n\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02\n\n\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02X\n\n\u2217(\u03f5)\n\n\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02\n\nmean is concentrated around its mean, which can be bounded as follows:\n\n\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02X\n\nE\n\n\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02\n\n\u2217(\u03f5)\n\n\u2264 \u03c3\n\nmean\n\n\u221a\n\nM\n\nK\n\nKX\n\n\u00a1\u221a\n\np\n\n\u00a2\n\nnk +\n\n\u00afn\\k\n\n.\n\nNext the following lemma, which is a generalization of a result presented in Negahban and Wain-\nwright [17, Proposition 1], provides a ground for the restricted strong convexity assumption (8).\nLemma 5. Let X : Rn1\u00d7\u00b7\u00b7\u00b7\u00d7nK \u2192 RM be a random Gaussian design. Then it satis\ufb01es\n\nk=1\n\n\u02c6r\n\n\u2225X(\u2206)\u22252\u221a\n\nM\n\n\u2265 1\n4\n\n\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02\u2206\n\n\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02\n\n\u2212 1\nK\n\nF\n\nKX\n\nk=1\n\nr\n\n!\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02\u2206\n\n\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02\n\n,\n\nS1\n\nnk\nM\n\n+\n\n\u00afn\\k\nM\n\nwith probability at least 1 \u2212 2 exp(\u2212N/32).\n\nProof. The proof is analogous to that of Proposition 1 in [17] except that we use H\u00a8older-like in-\nequality (3) for tensors instead of inequality (2) for matrices.\n\nP\nP\n\nP\n\np\n\nFinally, we obtain the following convergence bound.\nTheorem 3. Under the random Gaussian design setup, there are universal constants c0, c1, and c2\n\u221a\nsuch that for a sample size M \u2265 c1( 1\n\u00afn\\k))2( 1\nrk)2, any solution of the\n\u221a\nK\nk=1(\nnk +\nminimization problem (7) with regularization constant \u03bbM = c0\u03c3\nM)\nsatis\ufb01es the following bound:\n\u221a\n\n\u00afn\\k)/(K\n\nP\n\nP\n\np\n\np\n\nK\nk=1\nK\n\nK\n\nk=1(\n\nnk +\n\n\u221a\n\n\u221a\n\nK\n\n\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02 \u02c6W \u2212 W\u2217\ufb02\ufb02\ufb02\ufb02\ufb02\ufb022\n\n\u03c32( 1\nK\n\n\u2264 c2\n\nF\n\n\u221a\nk=1(\n\nK\n\nnk +\n\n\u00afn\\k))2( 1\nK\nM\n\nK\nk=1\n\nrk)2\n\n,\n\nwith high probability.\nAgain we can simplify the result of Theorem 3 as follows: for sample size M \u2265 c1N \u00afr we have\n\n\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02 \u02c6W \u2212 W\u2217\ufb02\ufb02\ufb02\ufb02\ufb02\ufb022\n\n(cid:181)\n\u03c32 N\u2225n\n\n\u2264 Op\n\n\u00b6\n\n\u22121\u22251/2\u2225r\u22251/2\n\n,\n\nF\n\nM\n\n(14)\n\u22121\u22251/2\u2225r\u22251/2 is the normalized rank. Note that the condition on the number of\nwhere \u00afr = \u2225n\nsamples M does not depend on the noise variance \u03c32. Therefore in the limit \u03c32 \u2192 0, the bound (14)\nis suf\ufb01ciently small but only valid for sample size M that exceeds c1N \u00afr, which implies a threshold\nbehavior as in Figure 1.\n\u22121\u22251/2 = O(n1 + n2). Therefore\nNote also that in the matrix case (K = 2), r1 = r2 = r and N\u2225n\n\u2264\nwe can restate the above result as for sample size M \u2265 c1r(n1 + n2), we have \u2225 \u02c6W \u2212 W\nOp(r(n1 + n2)/M), which is compatible with the result in [17, 18].\n\n\u2217\u22252\n\nF\n\n4 Experiments\n\nIn this section, we conduct two numerical experiments to con\ufb01rm our analysis in Section 3.2 and\nSection 3.3.\n\n6\n\n\f(a) Small noise (\u03c3 = 0.01).\n\n(b) Large noise (\u03c3 = 0.1).\n\nFigure 2: Result of noisy tensor decomposition for tensors of size 50\u00d7 50\u00d7 20 and 100\u00d7 100\u00d7 50.\n\n4.1 Noisy Tensor Decomposition\n\nWe randomly generated low-rank tensors of dimensions n(1) = (50, 50, 20) and n(2) =\n(100, 100, 50) for various ranks (r1, . . . , rK). For a speci\ufb01c rank, we generated the true tensor\nby drawing elements of the r1 \u00d7 \u00b7\u00b7\u00b7 \u00d7 rK \u201ccore tensor\u201d from the standard normal distribution and\nmultiplying its each mode by an orthonormal factor randomly drawn from the Haar measure. As\ndescribed in Section 3.2, the observation y consists of all the elements of the original tensor once\n(M = N) with additive independent Gaussian noise with variance \u03c32. We used the alternating\ndirection method of multipliers (ADMM) for \u201cconstraint\u201d approaches described in [23, 10] to solve\nthe minimization problem (7). The whole experiment was repeated 10 times and averaged.\n\n\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02 \u02c6W \u2212 W\u2217\ufb02\ufb02\ufb02\ufb02\ufb02\ufb022\n\nF /N is plotted against\nThe results are shown in Figure 2. The mean squared error\n\u22121\u22251/2\u2225r\u22251/2 (of the true tensor) de\ufb01ned in Equation (13). Since the\nthe normalized rank \u00afr = \u2225n\nchoice of the regularization constant \u03bbM only depends on the size of the tensor and not on the ranks\nof the underlying tensor in Theorem 2, we \ufb01x the regularization constant to some different values\nand report the dependency of the estimation error on the normalized rank \u00afr of the true tensor.\nFigure 2(a) shows the result for small noise (\u03c3 = 0.01) and Figure 2(b) shows the result for large\nnoise (\u03c3 = 0.1). As predicted by Theorem 2, the squared error\nF grows linearly\nagainst the normalized rank \u00afr. This behaviour is consistently observed not only around the preferred\nregularization constant value (triangles) but also in the over-\ufb01tting case (circles) and the under-\n\ufb01tting case (crosses). Moreover, as predicted by Theorem 2, the preferred regularization constant\nvalue scales linearly and the squared error scales quadratically to the noise standard deviation \u03c3.\nAs predicted by Lemma 3, the curves for the smaller 50 \u00d7 50 \u00d7 20 tensor and those for the larger\n100 \u00d7 100 \u00d7 50 tensor seem to agree when the regularization constant is scaled by the factor two.\n\u00afn\\k, which is roughly scaled by\nNote that the dominant term in inequality (12) is the second term\nthe factor two from 50 \u00d7 50 \u00d7 20 to 100 \u00d7 100 \u00d7 50.\n\n\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02 \u02c6W \u2212 W\u2217\ufb02\ufb02\ufb02\ufb02\ufb02\ufb022\n\np\n\n4.2 Tensor completion from partial observations\n\nIn this subsection, we repeat the simulation originally done by Tomioka et al. [23] and demonstrate\nthat our results in Section 3.3 can precisely predict the empirical scaling behaviour with respect to\nboth the size and rank of a tensor.\nWe present results for both matrix completion (K = 2) and tensor completion (K = 3). For\nthe matrix case, we randomly generated low-rank matrices of dimensions 50 \u00d7 20, 100 \u00d7 40, and\n250\u00d7200. For the tensor case, we randomly generated low-rank tensors of dimensions 50\u00d750\u00d720\nand 100 \u00d7 100 \u00d7 50. We generated the matrices or tensors as in the previous subsection for various\nranks. We randomly selected some elements of the true matrix/tensor for training and kept the\n\n7\n\n00.20.40.60.810123x 10\u22124Normalized rankMean squared error size=[50 50 20] \u03bbM=0.03/Nsize=[50 50 20] \u03bbM=0.33/Nsize=[50 50 20] \u03bbM=0.54/Nsize=[100 100 50] \u03bbM=0.06/Nsize=[100 100 50] \u03bbM=0.69/Nsize=[100 100 50] \u03bbM=1.11/N00.20.40.60.8100.0050.010.0150.020.0250.03Normalized rankMean squared error size=[50 50 20] \u03bbM=0.33/Nsize=[50 50 20] \u03bbM=2.34/Nsize=[50 50 20] \u03bbM=6/Nsize=[100 100 50] \u03bbM=0.66/Nsize=[100 100 50] \u03bbM=4.5/Nsize=[100 100 50] \u03bbM=12/N\f(a) Matrix completion (K = 2).\n\n(b) Tensor completion (K = 3).\n\nFigure 3: Scaling behaviour of matrix/tensor completion with respect to the size n and the rank r.\n\n\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02 \u02c6W \u2212 W\u2217\ufb02\ufb02\ufb02\ufb02\ufb02\ufb02\n\nF smaller than 0.01 against the normalized rank \u00afr = \u2225n\n\nremaining elements for testing. No observation noise is added. We used the ADMM for \u201cas a\nmatrix\u201d and \u201cconstraint\u201d approaches described in [23] to solve the minimization problem (7) for\nmatrix completion and tensor completion, respectively. Since there is no observation noise, we\nchose the regularization constant \u03bb \u2192 0. A single experiment for a speci\ufb01c size and rank can be\nvisualized as in Figure 1.\nIn Figure 3, we plot the minimum fraction of observations m = M/N that achieved error\n\u22121\u22251/2\u2225r\u22251/2 (of the true ten-\nsor) de\ufb01ned in Equation (13). The matrix case is plotted in Figure 3(a) and the tensor case is plotted\nin Figure 3(b). Each series (blue crosses or red circles) corresponds to different matrix/tensor size\nand each data-point corresponds to a different core size (rank). We can see that the fraction of obser-\nvations m = M/N scales linearly against the normalized rank \u00afr, which agrees with the condition\nM/N \u2265 c1\u2225n\n\u22121\u22251/2\u2225r\u22251/2 = c1\u00afr in Theorem 3 (see Equation (14)). The agreement is especially\ngood for tensor completion (Figure 3(b)), where the two series almost overlap. Interestingly, we\ncan see that when compared at the same normalized rank, tensor completion is easier than matrix\ncompletion. For example, when nk = 50 and rk = 10 for each k = 1, . . . , K, the normalized rank\nis 0.2. From Figure 3, we can see that we only need to see 30% of the entries in the tensor case to\nachieve error smaller than 0.01, whereas we need about 60% of the entries in the matrix case.\n\n5 Conclusion\n\nWe have analyzed the statistical performance of a tensor decomposition algorithm based on the\noverlapped Schatten 1-norm regularization (7). Numerical experiments show that our theory can\npredict the empirical scaling behaviour well. The fraction of observation m = M/N at the threshold\npredicted by our theory is proportional to the quantity we call the normalized rank, which re\ufb01nes\nconjecture (sum of the mode-k ranks) in [23].\nThere are numerous directions that the current study can be extended. In this paper, we have focused\non the convergence of the estimation error; it would be meaningful to also analyze the condition for\nthe consistency of the estimated rank as in [2]. Second, although we have succeeded in predicting\nthe empirical scaling behaviour, the setting of random Gaussian design does not match the tensor\ncompletion setting in Section 4.2. In order to analyze the latter setting, the notion of incoherence in\n[5] or spikiness in [16] might be useful. This might also explain why tensor completion is easier than\nmatrix completion at the same normalized rank. Moreover, when the target tensor is only low-rank\nin a certain mode, Schatten 1-norm regularization fails badly (as predicted by the high normalized\nrank).\nIn\na broader context, we believe that the current paper could serve as a basis for re-examining the\nconcept of tensor rank and low-rank approximation of tensors based on convex optimization.\nAcknowledgments. We would like to thank Franz Kir\u00b4aly and Hiroshi Kajino for their valuable\ncomments and discussions. This work was supported in part by MEXT KAKENHI 22700138,\n23240019, 23120004, 22700289, and NTT Communication Science Laboratories.\n\nIt would be desirable to analyze the \u201cMixture\u201d approach that aims at this case [23].\n\n8\n\n00.10.20.30.40.50.600.20.40.60.81Normalized rank ||n\u22121||1/2||r||1/2Fraction at err<=0.01 size=[50 20]size=[100 40]size=[250 200]00.20.40.60.800.20.40.60.81Normalized rank ||n\u22121||1/2||r||1/2Fraction at Error<=0.01 size=[50 50 20]size=[100 100 50]\fReferences\n[1] E. Acar and B. Yener. Unsupervised multiway data analysis: A literature survey. IEEE T. Knowl. Data.\n\nEn., 21(1):6\u201320, 2009.\n\n[2] F.R. Bach. Consistency of trace norm minimization. J. Mach. Learn. Res., 9:1019\u20131048, 2008.\n[3] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.\n[4] R. Bro. PARAFAC. Tutorial and applications. Chemometr. Intell. Lab., 38(2):149\u2013171, 1997.\n[5] E. J. Candes and B. Recht. Exact matrix completion via convex optimization. Found. Comput. Math.,\n\n9(6):717\u2013772, 2009.\n\n[6] J.D. Carroll and J.J. Chang. Analysis of individual differences in multidimensional scaling via an n-way\n\ngeneralization of \u201cEckart-Young\u201d decomposition. Psychometrika, 35(3):283\u2013319, 1970.\n\n[7] P. Comon. Tensor decompositions. In J. G. McWhirter and I. K. Proudler, editors, Mathematics in signal\n\nprocessing V. Oxford University Press, 2002.\n\n[8] L. De Lathauwer and J. Vandewalle. Dimensionality reduction in higher-order signal processing and\n\nrank-(r1, r2, . . . , rn) reduction in multilinear algebra. Linear Algebra Appl., 391:31\u201355, 2004.\n\n[9] K. Fukumizu. Generalization error of linear neural networks in unidenti\ufb01able cases.\n\nLearning Theory, pages 51\u201362. Springer, 1999.\n\nIn Algorithmic\n\n[10] S. Gandy, B. Recht, and I. Yamada. Tensor completion and low-n-rank tensor recovery via convex opti-\n\nmization. Inverse Problems, 27:025010, 2011.\n\n[11] J. H\u02daastad. Tensor rank is NP-complete. Journal of Algorithms, 11(4):644\u2013654, 1990.\n[12] T. G. Kolda and B. W. Bader. Tensor decompositions and applications. SIAM Review, 51(3):455\u2013500,\n\n2009.\n\n[13] J. Liu, P. Musialski, P. Wonka, and J. Ye. Tensor completion for estimating missing values in visual data.\n\nIn Prof. ICCV, 2009.\n\n[14] M. M\u00f8rup. Applications of tensor (multiway array) factorizations and decompositions in data mining.\n\nWiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1(1):24\u201340, 2011.\n\n[15] S. Negahban, P. Ravikumar, M. Wainwright, and B. Yu. A uni\ufb01ed framework for high-dimensional\nIn Y. Bengio, D. Schuurmans, J. Lafferty,\n\nanalysis of m-estimators with decomposable regularizers.\nC. K. I. Williams, and A. Culotta, editors, Advances in NIPS 22, pages 1348\u20131356. 2009.\n\n[16] S. Negahban and M.J. Wainwright. Restricted strong convexity and weighted matrix completion: Optimal\n\nbounds with noise. Technical report, arXiv:1009.2118, 2010.\n\n[17] S. Negahban and M.J. Wainwright. Estimation of (near) low-rank matrices with noise and high-\n\ndimensional scaling. Ann. Statist., 39(2), 2011.\n\n[18] B. Recht, M. Fazel, and P.A. Parrilo. Guaranteed minimum-rank solutions of linear matrix equations via\n\nnuclear norm minimization. SIAM Review, 52(3):471\u2013501, 2010.\n\n[19] A. Rohde and A.B. Tsybakov. Estimation of high-dimensional low-rank matrices. Ann. Statist.,\n\n39(2):887\u2013930, 2011.\n\n[20] N.D. Sidiropoulos, R. Bro, and G.B. Giannakis. Parallel factor analysis in sensor array processing. IEEE\n\nT. Signal Proces., 48(8):2377\u20132388, 2000.\n\n[21] M. Signoretto, L. De Lathauwer, and J.A.K. Suykens. Nuclear norms for tensors and their use for convex\n\nmultilinear estimation. Technical Report 10-186, ESAT-SISTA, K.U.Leuven, 2010.\n\n[22] N. Srebro, J. D. M. Rennie, and T. S. Jaakkola. Maximum-margin matrix factorization. In Lawrence K.\nSaul, Yair Weiss, and L\u00b4eon Bottou, editors, Advances in NIPS 17, pages 1329\u20131336. MIT Press, Cam-\nbridge, MA, 2005.\n\n[23] R. Tomioka, K. Hayashi, and H. Kashima. Estimation of low-rank tensors via convex optimization.\n\nTechnical report, arXiv:1010.0789, 2011.\n\n[24] L. R. Tucker. Some mathematical notes on three-mode factor analysis. Psychometrika, 31(3):279\u2013311,\n\n1966.\n\n[25] M. Vasilescu and D. Terzopoulos. Multilinear analysis of image ensembles: Tensorfaces. Computer\n\nVision\u2014ECCV 2002, pages 447\u2013460, 2002.\n\n[26] H. Wang and N. Ahuja. Facial expression decomposition. In Proc. 9th ICCV, pages 958 \u2013 965, 2003.\n\n9\n\n\f", "award": [], "sourceid": 596, "authors": [{"given_name": "Ryota", "family_name": "Tomioka", "institution": null}, {"given_name": "Taiji", "family_name": "Suzuki", "institution": null}, {"given_name": "Kohei", "family_name": "Hayashi", "institution": null}, {"given_name": "Hisashi", "family_name": "Kashima", "institution": null}]}