{"title": "Mixed Linear Regression with Multiple Components", "book": "Advances in Neural Information Processing Systems", "page_first": 2190, "page_last": 2198, "abstract": "In this paper, we study the mixed linear regression (MLR) problem, where the goal is to recover multiple underlying linear models from their unlabeled linear measurements. We propose a non-convex objective function which we show is {\\em locally strongly convex} in the neighborhood of the ground truth. We use a tensor method for initialization so that the initial models are in the local strong convexity region. We then employ general convex optimization algorithms to minimize the objective function. To the best of our knowledge, our approach provides first exact recovery guarantees for the MLR problem with $K \\geq 2$ components. Moreover, our method has near-optimal computational complexity $\\tilde O (Nd)$ as well as near-optimal sample complexity $\\tilde O (d)$ for {\\em constant} $K$. Furthermore, we show that our non-convex formulation can be extended to solving the {\\em subspace clustering} problem as well. In particular, when initialized within a small constant distance to the true subspaces, our method converges to the global optima (and recovers true subspaces) in time {\\em linear} in the number of points. Furthermore, our empirical results indicate that even with random initialization, our approach converges to the global optima in linear time, providing speed-up of up to two orders of magnitude.", "full_text": "Mixed Linear Regression with Multiple Components\n\nKai Zhong 1\n\nPrateek Jain 2\n\nInderjit S. Dhillon 3\n\n1,3 University of Texas at Austin\n1 zhongkai@ices.utexas.edu,\n\n2 Microsoft Research India\n2 prajain@microsoft.com\n\n3 inderjit@cs.utexas.edu\n\nAbstract\n\nIn this paper, we study the mixed linear regression (MLR) problem, where the\ngoal is to recover multiple underlying linear models from their unlabeled linear\nmeasurements. We propose a non-convex objective function which we show is\nlocally strongly convex in the neighborhood of the ground truth. We use a tensor\nmethod for initialization so that the initial models are in the local strong convexity\nregion. We then employ general convex optimization algorithms to minimize the\nobjective function. To the best of our knowledge, our approach provides \ufb01rst exact\nrecovery guarantees for the MLR problem with K 2 components. Moreover, our\nmethod has near-optimal computational complexity eO(N d) as well as near-optimal\nsample complexity eO(d) for constant K. Furthermore, we show that our non-\n\nconvex formulation can be extended to solving the subspace clustering problem\nas well. In particular, when initialized within a small constant distance to the true\nsubspaces, our method converges to the global optima (and recovers true subspaces)\nin time linear in the number of points. Furthermore, our empirical results indicate\nthat even with random initialization, our approach converges to the global optima\nin linear time, providing speed-up of up to two orders of magnitude.\n\n1\n\nIntroduction\n\nThe mixed linear regression (MLR) [7, 9, 29] models each observation as being generated from one\nof the K unknown linear models; the identity of the generating model for each data point is also\nunknown. MLR is a popular technique for capturing non-linear measurements while still keeping the\nmodels simple and computationally ef\ufb01cient. Several widely-used variants of linear regression, such\nas piecewise linear regression [14, 28] and locally linear regression [8], can be viewed as special\ncases of MLR. MLR has also been applied in time-series analysis [6], trajectory clustering [15],\nhealth care analysis [11] and phase retrieval [4]. See [27] for more applications.\nIn general, MLR is NP-hard [29] with the hardness arising due to lack of information about the model\nlabels (model from which a point is generated) as well as the model parameters. However, under\ncertain statistical assumptions, several recent works have provided poly-time algorithms for solving\nMLR [2, 4, 9, 29]. But most of the existing recovery gurantees are restricted either to mixtures with\nK = 2 components [4, 9, 29] or require poly(1/\u270f) samples/time to achieve \u270f-approximate solution\n[7, 24] (analysis of [29] for two components can obtain \u270f approximate solution in log(1/\u270f) samples).\nHence, solving the MLR problem with K 2 mixtures while using near-optimal number of samples\nand computational time is still an open question.\nIn this paper, we resolve the above question under standard statistical assumptions for constant\nmany mixture components K. To this end, we propose the following smooth objective function as a\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fsurrogate to solve MLR:\n\nf (w1, w2,\u00b7\u00b7\u00b7 , wK) :=\n\nnXi=1\n\n\u21e7K\nk=1(yi xT\n\ni wk)2,\n\n(1)\n\nwhere {(xi, yi) 2 Rd+1}i=1,2,\u00b7\u00b7\u00b7 ,N are the data points and {wk}k=1,2,\u00b7\u00b7\u00b7 ,K are the model parameters.\nThe intuition for this objective is that the objective value is zero when {wk}k=1,2,\u00b7\u00b7\u00b7 ,K is the global\noptima and y\u2019s do not contain any noise. Furthermore, the objective function is smooth and hence\nless prone to getting stuck in arbitrary saddle points or oscillating between two points. The standard\nEM [29] algorithm instead makes a \u201csharp\u201d selection of mixture component and hence the algorithm\nis more likely to oscillate or get stuck. This intuition is re\ufb02ected in Figure 1 (d) which shows that\nwith random initialization, EM algorithm routinely gets stuck at poor solutions, while our proposed\nmethod based on the above objective still converges to the global optima.\nUnfortunately, the above objective function is non-convex and is in general prone to poor saddle\npoints, local minima. However under certain standard assumptions, we show that the objective is\nlocally strongly convex (Theorem 1) in a small basin of attraction near the optimal solution. Moreover,\nthe objective function is smooth. Hence, we can use gradient descent method to achieve linear rate\nof convergence to the global optima. But, we will need to initialize the optimization algorithm with\nan iterate which lies in a small ball around the optima. To this end, we modify the tensor method\nin [2, 7] to obtain a \u201cgood\u201d initialization point. Typically, tensor methods require computation of\nthird and higher order moments which leads to signi\ufb01cantly worse sample complexity in terms of\ndata dimensionality d. However, for the special case of MLR, we provide a small modi\ufb01cation of the\nstandard tensor method that achieves nearly optimal sample and time complexity bounds for constant\nK (see Theorem 3) . More concretely, our approach requires \u02dcO(d(K log d)K) many samples and\nrequires \u02dcO(N d) computational time; note the exponential dependence on K. Also for constant K,\nthe method has nearly optimal sample and time complexity.\nSubspace clustering: MLR can be viewed as a special case of subspace clustering (SC), since each\nregressor-response pair lies in the subspace determined by this pair\u2019s model parameters. However,\nsolving MLR using SC approaches is intractable because the dimension of each subspace is only one\nless than the ambient dimension, which will easily violate the conditions for the recovery guarantees\nof most methods (see e.g. Table 1 in [23] for the conditions of different methods). Nonetheless,\nour objective for MLR easily extends to the subspace clustering problem. That is, given data points\n{zi 2 Rd}i=1,2,\u00b7\u00b7\u00b7 ,N, the goal is to minimize the following objective w.r.t. K subspaces (each of\ndimension at most r):\n\nmin\n\nUk2Od\u21e5r,k=1,2,\u00b7\u00b7\u00b7 ,K\n\nf (U1, U2,\u00b7\u00b7\u00b7 , UK) =\n\nNXi=1\n\n\u21e7K\n\nk=1\u2326Id UkU T\n\nk , zizT\n\ni \u21b5.\n\n(2)\n\nUk denotes the basis spanned by k-th estimated subspace and Od\u21e5r \u21e2 Rd\u21e5r denotes the set of\northonormal matrices, i.e., U T U = I if U 2 Od\u21e5r. We propose a power-method style algorithm\nto alternately optimize (2) w.r.t {Uk}k=1,2,\u00b7\u00b7\u00b7 ,K, which takes only O(rdN ) time compared with\nO(dN 2) for the state-of-the-art methods, e.g. [13, 22, 23].\nAlthough EM with power method [4] shares the same computational complexity as ours, there\nis no convergence guarantee for EM to the best of our knowledge. In contrast, we provide local\nconvergence guarantee for our method. That is, if N = \u02dcO(rKK) and if data satis\ufb01es certain standard\nassumptions, then starting from an initial point {Uk}k=1,\u00b7\u00b7\u00b7 ,K that lies in a small ball of constant\nradius around the globally optimal solution, our method converges super-linearly to the globally\noptimal solution. Unfortunately, our existing analyses do not provide global convergence guarantee\nand we leave it as a topic for future work. Interestingly, our empirical results indicated that even with\nrandomly initialized {Uk}k=1,\u00b7\u00b7\u00b7 ,K, our method is able to recover the true subspace exactly using\nnearly O(rK) samples.\nWe summarize our contributions below:\n(1) MLR: We propose a non-convex continuous objective function for solving the mixed linear\nregression problem. To the best of our knowledge, our algorithm is the \ufb01rst work that can handle K \n2 components with global convergence guarantee in the noiseless case (Theorem 4). Our algorithm has\nnear-optimal linear (in d) sample complexity and near-optimal computational complexity; however,\nour sample complexity dependence on K is exponential.\n\n2\n\n\f(2) Subspace Clustering: We extend our objective function to subspace clustering, which can be\noptimized ef\ufb01ciently in O(rdN ) time compared with O(dN 2) for state-of-the-art methods. We also\nprovide a small basin of attraction in which our iterates converge to the global optima at super-linear\nrate (Theorem 5).\n\n2 Related Work\n\nMixed Linear Regression:\nEM algorithm without careful initialization is only guaranteed to have local convergence [4, 21, 29].\n[29] proposed a grid search method for initialization. However, it is limited to the two-component\ncase and seems non-trivial to extend to multiple components. It is known that exact minimization\nfor each step of EM is not scalable due to the O(d2N + d3) complexity. Alternatively, we can\nuse EM with gradient update, whose local convergence is also guaranteed by [4] but only in the\ntwo-symmetric-component case, i.e., when w2 = w1.\nTensor Methods for MLR were studied by [7, 24]. [24] approximated the third-order moment directly\nfrom samples with Gaussian distribution, while [7] learned the third-order moment from a low-rank\nlinear regression problem. Tensor methods can obtain the model parameters to any precision \u270f but\nrequires 1/\u270f2 time/samples. Also, tensor methods can handle multiple components but suffer from\nhigh sample complexity and high computational complexity. For example, the sample complexity\nrequired by [7] and [24] is O(d6) and O(d3) respectively. On the other hand, the computational\nburden mainly comes from the operation on tensor, which costs at least O(d3) for a very simple\ntensor evaluation. [7] also suffers from the slow nuclear norm minimization when estimating the\nsecond and third order moments. In contrast, we use tensor method only for initialization, i.e., we\nrequire \u270f to be a certain constant. Moreover, with a simple trick, we can ensure that the sample and\ntime complexity of our initialization step is only linear in d and N.\nConvex Formulation. Another approach to guarantee the recovery of the parameters is to relax\nthe non-convex problem to convex problem. [9] proposed a convex formulation of MLR with two\ncomponents. The authors provide upper bounds on the recovery errors in the noisy case and show their\nalgorithm is information-theoretically optimal. However, the convex formulation needs to solve a\nnuclear norm function under linear constraints, which leads to high computational cost. The extension\nfrom two components to multiple components for this formulation is also not straightforward.\nSubspace Clustering:\nSubspace clustering [13, 17, 22, 23] is an important data clustering problem arising in many research\nareas. The most popular subspace clustering algorithms, such as [13, 17, 23], are based on a two-stage\nalgorithm \u2013 \ufb01rst \ufb01nding a neighborhood for each data point and then clustering the points given the\nneighborhood. The \ufb01rst stage usually takes at least O(dN 2) time, which is prohibitive when N is\nlarge. On the other hand, several methods such as K-subspaces clustering [18], K-SVD [1] and online\nsubspace clustering [25] do have linear time complexity O(rdN ) per iteration, however, there are no\nglobal or local convergence guarantees. In contrast, we show locally superlinear convergence result\nfor an algorithm with computational complexity O(rdN ). Our empirical results indicate that random\ninitialization is also suf\ufb01cient to get to the global optima; we leave further investigation of such an\nalgorithm for future work.\n\n3 Mixed Linear Regression with Multiple Components\nIn this paper, we assume the dataset {(xi, yi) 2 Rd+1}i=1,2,\u00b7\u00b7\u00b7 ,N is generated by,\n\nyi = xT\n\ni w\u21e4zi,\n\nzi \u21e0 multinomial(p), xi \u21e0N (0, Id),\n\n(3)\nwhere p is the proportion of different components satisfying pT 1 = 1, {w\u21e4k 2 Rd}k=1,2,\u00b7\u00b7\u00b7 ,K are\nthe ground truth parameters. The goal is to recover {w\u21e4k}k=1,2,\u00b7\u00b7\u00b7 ,K from the dataset. Our analysis is\nbased on noiseless cases but we illustrate the empirical performance of our algorithm for the noisy\ncases, where yi = xT\nNotation. We use [N ] to denote the set {1, 2,\u00b7\u00b7\u00b7 , N} and Sk \u21e2 [N ] to denote the index set of the\nsamples that come from k-th component. De\ufb01ne pmin := mink2[K]{pk}, pmax := maxk2[K]{pk}.\nDe\ufb01ne wj := wj w\u21e4j and w\u21e4kj := w\u21e4k w\u21e4j . De\ufb01ne min := minj6=k{kw\u21e4jkk} and\n\ni w\u21e4zi + ei for some noise ei (see Figure 1).\n\n3\n\n\fmax\n\nmax := maxj6=k{kw\u21e4jkk}. We assume min\nis independent of the dimension d. De\ufb01ne w :=\n[w1; w2;\u00b7\u00b7\u00b7 ; wK] 2 RKd. We denote w(t) as the parameters at t-th iteration and w(0) as the\ninitial parameters. For simplicity, we assume there are pkN samples from the k-th model in any\n\nrandom subset of N samples. We use EJXK to denote the expectation of a random variable X. Let\nT 2 Rd\u21e5d\u21e5d be a tensor and Tijk be the i, j, k-th entry of T . We say a tensor is supersymmetric\nif Tijk is invariant under any permutation of i, j, k. We also use the same notation T to denote\nthe multi-array map from three matrices, A, B, C 2 Rd\u21e5r, to a new tensor: [T (A, B, C)]i,j,k =\nPp,q,l TpqlApiBqjClk. We say a tensor T is rank-one if T = a \u2326 b \u2326 c, where Tijk = aibjck.\nWe use kAk denote the spectral norm of the matrix A and i(A) to denote the i-th largest singular\nvalue of A. For tensors, we use kTkop to denote the operator norm for a supersymmetric tensor T ,\nkTkop := maxkak=1 |T (a, a, a)|. We use T(1) 2 Rd\u21e5d2 to denote the matricizing of T in the \ufb01rst\norder, i.e., [T(1)]i,(j1)d+k = Tijk. Throughout the paper, we use eO(d) to denote O(d\u21e5 polylog(d)).\nWe assume K is a constant in general. However, if some numbers depend on KK, we will explicitly\npresent it in the big O notation. For simplicity, we just include higher-order terms of K and ignore\nlower-order terms, e.g., O((2K)2K) may be replaced by O(KK).\n\n1\n8\n\n3.1 Local Strong Convexity\nIn this section, we analyze the Hessian of objective (1).\nTheorem 1 (Local Positive De\ufb01niteness). Let {xi, yi}i=1,2,\u00b7\u00b7\u00b7 ,N be sampled from the MLR model\n(3). Let {wk}k=1,2,\u00b7\u00b7\u00b7 ,K be independent of the samples and lie in the neighborhood of the optimal\nsolution, i.e.,\n(4)\nwhere cm = O(pmin(3K)K(min/max)2K2), min = minj6=k{kw\u21e4j w\u21e4kk} and max =\nmaxj6=k{kw\u21e4j w\u21e4kk}. Let P 1 be a constant. Then if N O((P K)Kd logK+2(d)), w.p.\n1 O(KdP ), we have,\n\nkwkk := kwk w\u21e4kk \uf8ff cmmin,8k 2 [K],\n\npminN 2K2\n\nmin I r2f (w + w) 10N (3K)K2K2\nmax I,\n\n(5)\nfor any w := [w1; w2;\u00b7\u00b7\u00b7 ; wK] satisfying kwkk \uf8ff cf min, where cf =\nO(pmin(3K)KdK+1(min/max)2K2).\nThe above theorem shows the Hessians of a small neighborhood around a \ufb01xed {wk}k=1,2,\u00b7\u00b7\u00b7 ,K,\nwhich is close enough to the optimum, are positive de\ufb01niteness (PD). The conditions on\n{wk}k=1,\u00b7\u00b7\u00b7 ,K and {wk}k=1,\u00b7\u00b7\u00b7 ,K are different. {wk}k=1,\u00b7\u00b7\u00b7 ,K are required to be independent\nof samples and in a ball of radius cmmin centered at the optimal solution. On the other hand,\n{wk}k=1,2,\u00b7\u00b7\u00b7 ,K can be dependent on the samples but are required to be in a smaller ball of radius\ncf min. The conditions are natural as if min is very small then distinguishing between w\u21e4k and w\u21e4k0\nis not possible and hence Hessians will not be PD w.r.t both the components.\nTo prove the theorem, we decompose the Hessian of Eq. (1) into multiple blocks, (rf )jl = @2f\n@wj @wl 2\nRd\u21e5d. When wk ! w\u21e4k for all k 2 [K], the diagonal blocks of the Hessian will be strictly positive\nde\ufb01nite. At the same time, the off-diagonal blocks will be close to zeros. The blocks are approximated\nby the samples using matrix Bernstein inequality. The detailed proof can be found in Appendix A.2.\nTraditional analysis of optimization methods on strongly convex functions, such as gradient descent,\nrequires the Hessians of all the parameters are PD. Theorem 1 implies that when wk = w\u21e4k for all\nk = 1, 2,\u00b7\u00b7\u00b7 , K, a small basin of attraction around the optimum is strongly convex as formally stated\nin the following corollary.\nCorollary 1 (Strong Convexity near the Optimum). Let {xi, yi}i=1,2,\u00b7\u00b7\u00b7 ,N be sampled from the MLR\nmodel (3). Let {wk}k=1,2,\u00b7\u00b7\u00b7 ,K lie in the neighborhood of the optimal solution, i.e.,\n(6)\nwhere cf = O(pmin(3K)KdK+1(min/max)2K2). Then, for any constant P 1, if N \nO((P K)Kd logK+2(d)), w.p. 1 O(KdP ), the objective function f (w1, w2,\u00b7\u00b7\u00b7 , wK) in Eq. (1)\nis strongly convex. In particular, w.p. 1 O(KdP ), for all w satisfying Eq. (6),\n\nkwk w\u21e4kk \uf8ff cf min,8k 2 [K],\n\n1\n8\n\npminN 2K2\n\nmin I r2f (w) 10N (3K)K2K2\nmax I.\n\n(7)\n\n4\n\n\fThe strong convexity of Corollary 1 only holds in the basin of attraction near the optimum that has\ndiameter in the order of O(dK+1), which is too small to be achieved by our initialization method\n(in Sec. 3.2) using \u02dcO(d) samples. Next, we show by a simple construction, the linear convergence of\ngradient descent (GD) with resampling is still guaranteed when the solution is initialized in a much\nlarger neighborhood.\nTheorem 2 (Convergence of Gradient Descent). Let {xi, yi}i=1,2,\u00b7\u00b7\u00b7 ,N be sampled from the MLR\nmodel (3). Let {wk}k=1,2,\u00b7\u00b7\u00b7 ,K be independent of the samples and lie in the neighborhood of\nthe optimal solution, de\ufb01ned in Eq. (4). One iteration of gradient descent can be described as,\nmax ). Then, if N O(KKd logK+2(d)), w.p.\nw+ = w \u2318rf (w), where \u2318 = 1/(10N (3K)K2K2\n1 O(Kd2),\n\nkw+ w\u21e4k2 \uf8ff (1 \n\n)kw w\u21e4k2\n\n(8)\n\npmin2K2\n\nmin\n\n80(3K)K2K2\n\nmax\n\nRemark. The linear convergence Eq. (8) requires the resampling of the data points for each iteration.\nIn Sec. 3.3, we combine Corollary 1, which doesn\u2019t require resampling when the iterate is suf\ufb01ciently\nclose to the optimum, to show that there exists an algorithm using a \ufb01nite number of samples to\nachieve any solution precision.\nTo prove Theorem 2, we prove the PD properties on a line between a current iterate and the optimum\nby constructing a set of anchor points and then apply traditional analysis for the linear convergence\nof gradient descent. The detailed proof can be found in Appendix A.3.\n\n3.2\n\nInitialization via Tensor method\n\nIn this section, we propose a tensor method to initialize the parameters. We de\ufb01ne the second-order\n\nmoment M2 := Eqy2(x \u2326 x I)y and the third-order moments, M3 := Eqy3x \u2326 x \u2326 xy \nEqy3(ej \u2326 x \u2326 ej + ej \u2326 ej \u2326 x + x \u2326 ej \u2326 ej)y. According to Lemma 6 in [24], M2 =\nPj2[d]\nPk=[K] 2pkw\u21e4k \u2326 w\u21e4k and M3 =Pk=[K] 6pkw\u21e4k \u2326 w\u21e4k \u2326 w\u21e4k. Therefore by calculating the eigen-\n\ndecomposition of the estimated moments, we are able to recover the parameters to any precision\nprovided enough samples. Theorem 8 of [24] needs O(d3) sample complexity to obtain the model\nparameters with certain precision. Such high sample complexity comes from the tensor concentration\nbound. However, we \ufb01nd the problem of tensor eigendecomposition in MLR can be reduced to\nRK\u21e5K\u21e5K space such that the sample complexity and computational complexity are O(poly(K)).\nOur method is similar to the whitening process in [7, 19]. However, [7] needs O(d6) sample complex-\n\nity due to the nuclear-norm minimization problem, while ours requires only eO(d). For this sample\ncomplexity, we need assume the following,\nop , Pk2[K] pkkw\u21e4kk2 and\nAssumption 1. The following quantities, K(M2), kM2k, kM3k2/3\n(Pk2[K] pkkw\u21e4kk3)2/3, have the same order of d, i.e., the ratios between any two of them are\nindependent of d.\nThe above assumption holds when {w\u21e4k}k=1,2,\u00b7\u00b7\u00b7 ,K are orthonormal to each other.\nWe formally present the tensor method in Algorithm 1 and its theoretical guarantee in Theorem 3.\nTheorem 3. Under Assumption 1, if |\u2326| O(d log2(d)+log4(d)), then w.p. 1O(d2), Algorithm 1\nwill output {w(0)\n\nk }K\n\nk=1 that satis\ufb01es,\nkw(0)\n\nk w\u21e4kk \uf8ff cmmin,8k 2 [K]\n\nwhich falls in the locally PD region, Eq. (4), in Theorem 1.\n\nforming \u02c6M2. In particular, we alternately compute \u02c6Y (t+1) = Pi2\u2326M2\n\nThe proof can be found in Appendix B.2. Forming \u02c6M2 explicitly will cost O(N d2) time, which\nis expensive when d is large. We can compute each step of the power method without explicitly\ni Y (t)) Y (t))\nand let Y (t+1) = QR( \u02c6Y (t+1)). Now each power method iteration only needs O(KN d) time.\nFurthermore, the number of iterations needed will be a constant, since power method has linear\nconvergence rate and we don\u2019t need very accurate solution. For the proof of this claim, we refer\n\ny2\ni (xi(xT\n\n5\n\n\fk=1\n\n2\n\nAlgorithm 1 Initialization for MLR via Tensor Method\nInput: {xi, yi}i2\u2326\nOutput: {w(0)\nk }K\n1: Partition the dataset \u2326 into \u2326=\u2326 M2[\u23262[\u23263 with |\u2326M2| = O(d log2(d)), |\u23262| = O(d log2(d))\nand |\u23263| = O(log4(d))\n2: Compute the approximate top-K eigenvectors, Y 2 Rd\u21e5K, of the second-order moment, \u02c6M2 :=\n|\u2326M2|Pi2\u2326M2\ni (xi \u2326 xi I), by the power method.\ny2\n2|\u23262|Pi2\u23262\n3: Compute \u02c6R2 = 1\ni (Y T xi \u2326 Y T xi I).\ny2\n4: Compute the whitening matrix \u02c6W 2 RK\u21e5K of \u02c6R2, i.e., \u02c6W = \u02c6U2 \u02c6\u21e41/2\n6|\u23263|Pi2\u23263\ni (ri \u2326 ri \u2326 ri Pj2[K] ej \u2326 ri \u2326 ej Pj2[K] ej \u2326 ej \u2326 ri \n5: Compute \u02c6R3 = 1\nPj2[K] ri \u2326 ej \u2326 ej), where ri = Y T xi for all i 2 \u23263.\n6: Compute the eigenvalues {\u02c6ak}K\nk=1 and the eigenvectors {\u02c6vk}K\n\u02c6R3( \u02c6W , \u02c6W , \u02c6W ) 2 RK\u21e5K\u21e5K by using the robust tensor power method [2].\n7: Return the estimation of the models, w(0)\n\n2 is the eigendecomposition of \u02c6R2.\n\nk=1 of the whitened tensor\n\n\u02c6U T\n2 , where \u02c6R2 =\n\nk = Y ( \u02c6W T )\u2020(\u02c6ak \u02c6vk)\n\n1\n\n\u02c6U2 \u02c6\u21e42 \u02c6U T\n\ny3\n\nto the proof of Lemma 10 in Appendix B. Next we compute \u02c6R2 using O(KN d) and compute \u02c6W\nin O(K3) time. Computing \u02c6R3 takes O(KN d + K3N ) time. The robust tensor power method\ntakes O(poly(K)polylog(d)) time. In summary, the computational complexity for the initialization\n\nis O(KdN + K3N + poly(K)polylog(d)) = eO(dN ).\n\n3.3 Global Convergence Algorithm\nWe are now ready to show the complete algorithm, Algorithm 2, that has global convergence guarantee.\nWe use f\u2326(w) to denote the objective function Eq. (1) generated from a subset of the dataset \u2326,\n\ni.e.,f\u2326(w) =Pi2\u2326 \u21e7K\nTheorem 4 (Global Convergence Guarantee). Let {xi, yi}i=1,2,\u00b7\u00b7\u00b7 ,N be sampled from the MLR\nmodel (3) with N O(d(K log(d))2K+3). Let the step size \u2318 be smaller than a positive constant.\nThen given any precision \u270f> 0, after T = O(log(d/\u270f)) iterations, w.p. 1 O(Kd2 log(d)), the\noutput of Algorithm 2 satis\ufb01es\n\nk=1(yi xT\n\ni wk)2.\n\nkw(T ) w\u21e4k \uf8ff \u270fmin.\n\nThe detailed proof is in Appendix B.3. The computational complexity required by our algorithm\nis near-optimal: (a) tensor method (Algorithm 1) is carefully employed such that only O(dN )\ncomputation is needed; (b) gradient descent with resampling is conducted in log(d) iterations to\npush the iterate to the next phase; (c) gradient descent without resampling is \ufb01nally executed to\nachieve any precision with log(1/\u270f) iterations. Therefore the total computational complexity is\nO(dN log(d/\u270f)). As shown in the theorem, our algorithm can achieve any precision \u270f> 0 without\nany sample complexity dependency on \u270f. This follows from Corollary 1 that shows local strong\nconvexity of objective (1) with a \ufb01xed set of samples. By contrast, tensor method [7, 24] requires\nO(1/\u270f2) samples and EM algorithm requires O(log(1/\u270f)) samples[4, 29].\n\n4 Subspace Clustering (SC)\n\nThe mixed linear regression problem can be viewed as clustering N (d + 1)-dimensional data points,\nzi = [xi, yi]T , into one of the K subspaces, {z : [w\u21e4k,1]T z = 0} for k 2 [K]. Assume we have\ndata points {zi}i=1,2,\u00b7\u00b7\u00b7 ,N sampled from the following model,\n(9)\n\nai \u21e0 multinomial(p), si \u21e0N (0, Ir), zi = U\u21e4aisi,\n\nwhere p is the proportion of samples from different subspaces and satis\ufb01es pT 1 = 1 and\n{U\u21e4k}k=1,2,\u00b7\u00b7\u00b7 ,K are the bases of the ground truth subspaces. We can solve Eq. (2) by alternately\nminimizing over Uk when \ufb01xing the others, which is equivalent to \ufb01nding the top-r eigenvectors\n\n6\n\n\fj , zizT\n\nofPN\n\ni=1 \u21b5k\n\ni zizT\n\ni , where \u21b5k\n\ni =\u21e7 j6=k\u2326Id UjU T\n\ni \u21b5. When the dimension is high, it is very\n\nexpensive to compute the exact top-r eigenvectors. A more ef\ufb01cient way is to use one iteration of the\npower method (aka subspace iteration), which only takes O(KdN ) computational time per iteration.\nWe present our algorithm in Algorithm 3.\nWe show Algorithm 3 will converge to the ground truth when the initial subspaces are suf\ufb01ciently close\nto the underlying subspaces. De\ufb01ne D( \u02c6U , \u02c6V ) := 1p2kU U TV V TkF for some \u02c6U , \u02c6V 2 Rd\u21e5r, where\nU, V are orthogonal bases of Span( \u02c6U ), Span( \u02c6V ) respectively. De\ufb01ne Dmax := maxj6=q D(U\u21e4q , U\u21e4j ),\nDmin := minj6=q D(U\u21e4q , U\u21e4j ).\nTheorem 5. Let {zi}i=1,2,\u00b7\u00b7\u00b7 ,N be sampled from subspace clustering model (9).\nO(r(K log(r))2K+2) and the initial parameters {U 0\nk {D(U\u21e4k , U 0\nmax\n\n(10)\nwhere cs = O(pmin/pmax(3K)K(Dmin/Dmax)2K3), then w.p. 1 O(Kr2), the sequence\nK}t=1,2,\u00b7\u00b7\u00b7 generated by Algorithm 3 converges to the ground truth superlinearly. In\n{U t\nparticular, for t := maxk{D(U\u21e4k , U t\n\nk}k2[K] satisfy\nk )}\uf8ff csDmin,\n\n2,\u00b7\u00b7\u00b7 , U t\n\nIf N \n\nk)},\n\n1, U t\n\nt+1 \uf8ff 2\n\nt /(2csDmin) \uf8ff\n\n1\n2\n\nt.\n\nWe refer to Appendix C.2 for the proof. Compared to other methods, our sample complexity only\ndepends on the dimension of each subspace linearly. We refer to Table 1 in [23] for a comparison\nof conditions for different methods. Note that if Dmin/Dmax is independent of r or d, then the\ninitialization radius cs is a constant. However, initialization within the required distance to the optima\nis still an open question; tensor methods do not apply in this case. Interestingly, our experiments\nseem to suggest that our proposed method converges to the global optima (in the setting considered\nin the above theorem).\n\nAlgorithm 2 Gradient Descent for MLR\nInput: {xi, yi}i=1,2,\u00b7\u00b7\u00b7 ,N, step size \u2318.\nOutput: w\n1: Partition\n\ndataset\n\nthe\n{\u2326(t)}t=0,1,\u00b7\u00b7\u00b7 ,T0+1\n\n2: Initialize w(0) by Algorithm 1 with \u2326(0)\n3: for t = 1, 2,\u00b7\u00b7\u00b7 , T0 do\nw(t) = w(t1) \u2318rf\u2326(t)(w(t1))\n4:\n5: for t = T0 + 1, T0 + 2,\u00b7\u00b7\u00b7 , T0 + T1 do\nw(t) = w(t1) \u2318rf\u2326(T0+1)(w(t1))\n6:\n\ninto\n\n5 Numerical Experiments\n\nAlgorithm 3 Power Method for SC\nInput: data points {zi}i=1,2,\u00b7\u00b7\u00b7 ,N\nOutput: {Uk}k2[K]\n1: Some initialization, {U 0\nk}k2[K].\n2: Partition the data into {\u2326(t)}t=0,1,2,\u00b7\u00b7\u00b7 ,T .\n3: for t = 0, 1, 2,\u00b7\u00b7\u00b7 , T do\ni \u21b5, i 2 \u2326(t)\nj=1\u2326Id U t\nj U tT\n\u21b5i =\u21e7 K\n4:\nj\nfor k = 1, 2,\u00b7\u00b7\u00b7 , K do\n5:\ni = \u21b5i/\u2326Id U t\ni \u21b5\n6:\nk QR(Pi2\u2326(t) \u21b5k\n7:\n\nk , zizT\ni zizT\n\n\u21b5k\nU t+1\n\ni U t\nk)\n\n, zizT\n\nkU tT\n\n5.1 Mixed Linear Regression\nIn this section, we use synthetic data to show the properties of our algorithm that minimizes Eq. (1),\nwhich we call LOSCO (LOcally Strongly Convex Objective). We generate data points and parameters\nfrom standard normal distribution. We set K = 3 and pk = 1\n3 for all k 2 [K]. The error is de\ufb01ned\nas \u270f(t) = min\u21e12Perm([K]){maxk2[K] kw(t)\n\u21e1(k) w\u21e4kk/kw\u21e4kk}, where Perm([K]) is the set of all the\npermutation functions on the set [K]. The errors reported in the paper are averaged over 10 trials. In\nour experiments, we \ufb01nd there is no difference whether doing resampling or not. Hence, for simplicity,\nwe use the original dataset for all the processes. We set both of two parameters in the robust tensor\npower method (denoted as N and L in Algorithm 1 in [2]) to be 100. The experiments are conducted\nin Matlab. After the initialization, we use alternating minimization (i.e., block coordinate descent) to\nexactly minimize the objective over wk for k = 1, 2,\u00b7\u00b7\u00b7 , K cyclicly.\nFig. 1(a) shows the recovery rate for different dimensions and different samples. We call the result of\na trial is a successful recovery if \u270f(t) < 106 for some t < 100. The recovery rate is the proportion of\n\n7\n\n\f10000\n\n8000\n\n6000\n\n4000\n\n2000\n\nN\n\nN=6000\nN=60000\nN=600000\n\n0\n\n-5\n\n-10\n\n)\n\u03f5\n(\ng\no\n\nl\n\n0\n\n0\n\n200\n\n400\n\nd\n\n600\n\n800\n\n1000\n\n(a) Sample complexity\n\n-15\n\n-10\n\n-5\n\nlog(\u03c3)\n\n0\n(b) Noisy case\n\n0\n\n)\nr\nr\ne\n(\ng\no\n\nl\n\n-10\n\n-20\n\n-30\n\n0\n\n0\n\n)\nr\nr\ne\n(\ng\no\n\nl\n\n-10\n\n-20\n\n1.5\n\n-30\n\n0\n\nLOSCO-ALT-tensor\nEM-tensor\nLOSCO-ALT-random\nEM-random\n1\n\n0.5\n\ntime(s)\n\nLOSCO-ALT-tensor\nEM-tensor\nLOSCO-ALT-random\nEM-random\n\n100\n\n200\ntime(s)\n\n300\n\n400\n\n(c) d = 100, N = 6k\n\n(d) d = 1k, N = 60k\n\nFigure 1: (a),(b): Empirical performance of our method. (c), (d): performance of our methods vs\nEM method. Our method with random initialization is sign\ufb01cantly better than EM with random\ninitialization. Performance of the two methods is comparable when initialized with tensor method.\n\n10 trials with successful recovery. As shown in the \ufb01gure, the sample complexity for exact recovery\nis nearly linear to d. Fig. 1(b) shows the behavior of our algorithm in the noisy case. The noise is\ndrawn from ei 2N (0, 2), i.i.d., and d is \ufb01xed as 100. As we can see from the \ufb01gure, the solution\nerror is almost proportional to the noise deviation. Comparing among different N\u2019s, the solution error\ndecreases when N increases, so it seems consistent in presence of unbiased noise. We also illustrate\nthe performance of our tensor initialization method in Fig. 2(a) in Appendix D.\nWe next compare with EM algorithm [29], where we alternately assign labels to points and exactly\nsolve each model parameter according to the labels. EM has been shown to be very sensitive to the\ninitialization [29]. The grid search initialization method proposed in [29] is not feasible here, because\nit only handles two components with a same magnitude. Therefore, we use random initialization\nand tensor initialization for EM. We compare our method with EM on convergence speed under\ndifferent dimensions and different initialization methods. We use exact alternating minimization\n(LOSCO-ALT) to optimize our objective (1), which has similar computational complexity as EM.\nFig. 1(c)(d) shows our method is competitive with EM on computational time, when it converges to\nthe optima. In the case of (d), EM with random initialization doesn\u2019t converge to the optima, while\nour method still converges. In Appendix D, we will show some more experimental results.\n\nTable 1: Time (sec.) comparison for different subspace clustering methods\n\nN/K\n200\n400\n600\n800\n1000\n\nSSC\n22.08\n152.61\n442.29\n918.94\n1738.82\n\nSSC-OMP\n\n31.83\n60.74\n99.63\n159.91\n258.39\n\nLRR\n4.01\n11.18\n33.36\n79.06\n154.89\n\nTSC\n2.76\n8.45\n30.09\n75.69\n151.64\n\nNSN+spectral NSN+GSR PSC\n0.41\n0.32\n0.60\n0.73\n0.76\n\n3.28\n11.51\n36.04\n85.92\n166.70\n\n5.90\n15.90\n33.26\n54.46\n83.96\n\n5.2 Subspace Clustering\nIn this section, we compare our subspace clustering method, which we call P SC (Power method\nfor Subspace Clustering), with state-of-the-art methods, SSC [13], SSC-OMP [12], LRR [22], TSC\n[17], NSN+spectral [23] and NSN+GSR [23] on computational time. We \ufb01x K = 5, r = 30 and\nd = 50. The ground truth U\u21e4k is generated from Gaussian matrices. Each data point is a normalized\nGaussian vector in their own subspace. Set pk = 1/K. The initial subspace estimation is generated\nby orthonormlizing Gaussian matrices. The stopping criterion for our algorithm is that every point is\nclustered correctly, i.e., the clustering error (CE) (de\ufb01ned in [23]) is zero. We use publicly available\ncodes for all the other methods (see [23] for the links).\nAs we shown from Table 1, our method is much faster than all other methods especially when N\nis large. Almost all CE\u2019s corresponding to the results in Table 1 are very small, which are listed in\nAppendix D. We also illustrate CE\u2019s of our method for different N, d and r when \ufb01xing K = 5 in\nFig. 6 of Appendix D, from which we see whatever the ambient dimension d is, the clusters will be\nexactly recovered when N is proportional to r.\nAcknowledgement: This research was supported by NSF grants CCF-1320746, IIS-1546459 and\nCCF-1564000.\n\n8\n\n\fReferences\n[1] Amir Adler, Michael Elad, and Yacov Hel-Or. Linear-time subspace clustering via bipartite graph modeling.\n\nIEEE transactions on neural networks and learning systems, 26(10):2234\u20132246, 2015.\n\n[2] Animashree Anandkumar, Rong Ge, Daniel Hsu, Sham M Kakade, and Matus Telgarsky. Tensor decompo-\n\nsitions for learning latent variable models. JMLR, 15:2773\u20132832, 2014.\n\n[3] Peter Arbenz, Daniel Kressner, and D-MATH ETH Z\u00fcrich. Lecture notes on solving large scale eigenvalue\n\nproblems. http://people.inf.ethz.ch/arbenz/ewp/Lnotes/lsevp2010.pdf.\n\n[4] Sivaraman Balakrishnan, Martin J Wainwright, and Bin Yu. Statistical guarantees for the em algorithm:\n\nFrom population to sample-based analysis. Annals of Statistics, 2015.\n\n[5] Emmanuel J. Cand\u00e8s and Benjamin Recht. Exact matrix completion via convex optimization. Foundations\n\nof Computational Mathematics, 9(6):717\u2013772, December 2009.\n\n[6] Alexandre X Carvalho and Martin A Tanner. Mixtures-of-experts of autoregressive time series: asymptotic\n\nnormality and model speci\ufb01cation. Neural Networks, 16(1):39\u201356, 2005.\n\n[7] Arun T. Chaganty and Percy Liang. Spectral experts for estimating mixtures of linear regressions. In\n\nICML, pages 1040\u20131048, 2013.\n\n[8] Xiujuan Chai, Shiguang Shan, Xilin Chen, and Wen Gao. Locally linear regression for pose-invariant face\n\nrecognition. Image Processing, 16(7):1716\u20131725, 2007.\n\n[9] Yudong Chen, Xinyang Yi, and Constantine Caramanis. A convex formulation for mixed regression with\n\ntwo components: Minimax optimal rates. COLT, 2014.\n\n[10] Chandler Davis and William Morton Kahan. The rotation of eigenvectors by a perturbation. iii. SIAM\n\nJournal on Numerical Analysis, 7(1):1\u201346, 1970.\n\n[11] P. Deb and A. M. Holmes. Estimates of use and costs of behavioural health care: a comparison of standard\n\nand \ufb01nite mixture models. Health economics, 9(6):475\u2013489, 2000.\n\n[12] Eva L Dyer, Aswin C Sankaranarayanan, and Richard G Baraniuk. Greedy feature selection for subspace\n\nclustering. The Journal of Machine Learning Research, 14(1):2487\u20132517, 2013.\n\n[13] Ehsan Elhamifar and Ren\u00e9 Vidal. Sparse subspace clustering. In CVPR, pages 2790\u20132797, 2009.\n[14] Giancarlo Ferrari-Trecate and Marco Muselli. A new learning method for piecewise linear regression. In\n\nArti\ufb01cial Neural Networks\u2014ICANN 2002, pages 444\u2013449. Springer, 2002.\n\n[15] Scott Gaffney and Padhraic Smyth. Trajectory clustering with mixtures of regression models. In KDD.\n\nACM, 1999.\n\n[16] Jihun Hamm and Daniel D Lee. Grassmann discriminant analysis: a unifying view on subspace-based\n\nlearning. In ICML, pages 376\u2013383. ACM, 2008.\n\n[17] Reinhard Heckel and Helmut Bolcskei. Subspace clustering via thresholding and spectral clustering. In\n\nAcoustics, Speech and Signal Processing, pages 3263\u20133267. IEEE, 2013.\n\n[18] Jeffrey Ho, Ming-Husang Yang, Jongwoo Lim, Kuang-Chih Lee, and David Kriegman. Clustering\nappearances of objects under varying illumination conditions. In CVPR, volume 1, pages I\u201311. IEEE,\n2003.\n\n[19] Daniel Hsu and Sham M Kakade. Learning mixtures of spherical gaussians: moment methods and spectral\n\ndecompositions. In ITCS, pages 11\u201320. ACM, 2013.\n\n[20] Daniel Hsu, Sham M Kakade, and Tong Zhang. A tail inequality for quadratic forms of subgaussian\n\nrandom vectors. Electronic Communications in Probability, 17(52):1\u20136, 2012.\n\n[21] Abbas Khalili and Jiahua Chen. Variable selection in \ufb01nite mixture of regression models. Journal of the\n\namerican Statistical association, 2012.\n\n[22] Guangcan Liu, Zhouchen Lin, and Yong Yu. Robust subspace segmentation by low-rank representation. In\n\nICML, pages 663\u2013670, 2010.\n\n2753\u20132761, 2014.\n\n[23] Dohyung Park, Constantine Caramanis, and Sujay Sanghavi. Greedy subspace clustering. In NIPS, pages\n\n[24] Hanie Sedghi and Anima Anandkumar. Provable tensor methods for learning mixtures of generalized\n\nlinear models. arXiv preprint arXiv:1412.3046, 2014.\n\n[25] Jie Shen, Ping Li, and Huan Xu. Online low-rank subspace clustering by basis dictionary pursuit. In ICML,\n\n[26] Joel A. Tropp. User-friendly tail bounds for sums of random matrices. Foundations of Computational\n\n[27] Kert Viele and Barbara Tong. Modeling with mixtures of linear regressions. Statistics and Computing,\n\n[28] Elisabeth Vieth. Fitting piecewise linear regression functions to biological responses. Journal of applied\n\n[29] Xinyang Yi, Constantine Caramanis, and Sujay Sanghavi. Alternating minimization for mixed linear\n\nregression. In ICML, pages 613\u2013621, 2014.\n\n2016.\n\nMathematics, 12(4):389\u2013434, 2012.\n\n12(4):315\u2013330, 2002.\n\nphysiology, 67(1):390\u2013396, 1989.\n\n9\n\n\f", "award": [], "sourceid": 1140, "authors": [{"given_name": "Kai", "family_name": "Zhong", "institution": "UT AUSTIN"}, {"given_name": "Prateek", "family_name": "Jain", "institution": "Microsoft Research"}, {"given_name": "Inderjit", "family_name": "Dhillon", "institution": "University of Texas at Austin"}]}