{"title": "On Distributed Averaging for Stochastic k-PCA", "book": "Advances in Neural Information Processing Systems", "page_first": 11026, "page_last": 11035, "abstract": "In the stochastic k-PCA problem, we are given i.i.d. samples from an unknown distribution over vectors, and the goal is to compute the top k eigenvalues and eigenvectors of the moment matrix. In the simplest distributed variant, we have 'm' machines each of which receives 'n' samples. Each machine performs some computation and sends an O(k) size summary of the local dataset to a central server. The server performs an aggregation and computes the desired eigenvalues and vectors. The goal is to achieve the same effect as the server computing using m*n samples by itself. The main choices in this framework are the choice of the summary, and the method of aggregation. We consider a slight variant of the well-studied \"distributed averaging\" approach, and prove that this leads to significantly better bounds on the dependence between 'n' and the eigenvalue gaps. Our method can also be applied directly to a setting where the \"right\" value of the parameter k (i.e., one for which there is a non-trivial eigenvalue gap) is not known exactly. This is a common issue in practice which prior methods were unable to address.", "full_text": "On Distributed Averaging for Stochastic k-PCA\n\nAditya Bhaskara\n\nSchool of Computing\nUniversity of Utah\n\nbhaskara@cs.utah.edu\n\nMaheshakya Wijewardena\n\nSchool of Computing\nUniversity of Utah\n\npmaheshakya4@gmail.com\n\nAbstract\n\nIn the stochastic k-PCA problem, we are given i.i.d. samples from an unknown\ndistribution over vectors, and the goal is to compute the top k eigenvalues and\neigenvectors of the moment matrix. In the simplest distributed variant, we have\nm machines each of which receives n samples. Each machine performs some\ncomputation and sends an O(k)-size summary of the local dataset to a central\nserver. The server performs an aggregation and computes the desired eigenvalues\nand vectors. The goal is to achieve the same effect as the server computing using\nmn samples by itself. The main choices in this framework are the choice of the\nsummary, and the method of aggregation. We consider a slight variant of the well-\nstudied distributed averaging approach, and prove that this leads to signi\ufb01cantly\nbetter bounds on the dependence between n and the eigenvalue gaps. Our method\ncan also be applied directly to a setting where the \u2018right\u2019 value of the parameter k\n(i.e., one for which there is a non-trivial eigenvalue gap) is not known exactly. This\nis a common issue in practice which prior methods were unable to address.\n\n1\n\nIntroduction\n\nPrincipal Component Analysis (PCA) is one of the classic tools for the analysis of high dimensional\ndata. It is used in applications ranging from data visualization, to dimension reduction, to signal\nde-noising [16, 10]. Formally, the problem is the following: given a collection of data points\nx1, x2, . . . , xn, the aim is to \ufb01nd a subspace U of dimension precisely k such that captures the\nmost mass of the points. Speci\ufb01cally, the goal is to \ufb01nd a matrix U (d \u00d7 k) with orthonormal\ncolumns (corresponding to a basis for the desired subspace) to as to maximize \ufffd\u03a3U\ufffdF , where \u03a3 is\nthe covariance matrix of the data, de\ufb01ned as\ufffdi xixT\ni . This problem can be solved ef\ufb01ciently by\ncomputing the singular value decomposition (SVD) (see [9]).\nIn the stochastic version of the problem, the data is viewed as samples from an unknown distribution\nD over points in Rd, and the goal is to \ufb01nd the top k singular directions of the distribution covariance\nmatrix (or the second moment matrix) \u03a3 = Ex\u223cDxxT . The question of how many samples from D\nare needed to \ufb01nd a good estimate for the k-PCA is extensively studied ([15, 2, 20], and tight bounds\nthat involve the gap between the kth and the (k + 1)th eigenvalues of \u03c3 can be obtained using matrix\nconcentration inequalities [1, 22].\nIn this paper, we consider distributed algorithms for stochastic PCA, where the samples from D\nare distributed across machines, and the goal is to use a small amount of communication and \ufb01nd\na solution that approximates the PCA of the distribution. Our focus will be on the simplest model,\nwhere we have m machines that each has access to n i.i.d. samples of the data. Each machine\nsends one summary to a central server. The server, using the summaries from the different machines,\ncomputes the estimate of the PCA (this will be known as the aggregation step).\nThis distributed procedure is well-studied for various optimization problems [14, 24, 25]. The\nmost well-known example is distributed convex optimization, where the goal is to optimize the loss\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fL(\u03b8) = Ex\u2208Df (x, \u03b8) for some convex function f. Here, it turns out that a simple procedure known\nas distributed averaging yields good guarantees. Machines simply optimize the objective on their\nlocal dataset and send the solution \u03b8 to the server, and the central server averages the local solutions.\nTight bounds are known for distributed averaging for various convex objectives (see [24]).\nA natural problem not covered by the general results on convex optimization is PCA. Garber et al. [8]\nstudied the power of distributed averaging for PCA. They showed that for the problem of computing\nthe top eigenvector, simply averaging the best vectors for the different machines does not work (the\nissue being one of having the right signs). However, it turns out that averaging with appropriately\nchosen signs works well, as long as there is a suf\ufb01cient gap between \u03bb1 and \u03bb2. Fan et al. [7] extended\nthis idea to the case of \ufb01nding the top k principal subspaces of the covariance matrix. They show\nthat as long as every machine has suf\ufb01ciently many samples (a quantity that depends on the gap\nbetween \u03bbk and \u03bbk+1), distributed averaging of the projection matrices output by different machines\n(followed by a k-SVD) yields a good approximation.\n\n1.1 Problem setup and motivation\n\nLet us start with some basic notation. For a real symmetric matrix M, we use \u03bbj(M ) to refer to its\nj-th largest eigenvalue. We denote by Mk the best k-rank approximation of M. Also, the trace of M\n\nwill be denoted as Tr(M ) =\ufffdj \u03bbj(M ). The Frobenius norm is \ufffdM\ufffdF :=\ufffd\ufffdj \u03bbj(M )2. For a\n\nr \u2208 [d], we de\ufb01ne \u0394r to be the eigenvalue gap \u03bbr \u2212 \u03bbr+1.\nFormal setting for distributed stochastic PCA. Let D be an (unknown) sub-Gaussian distribution\nover vectors in Rd.1 We have m machines, each of which receives n i.i.d. samples from D.\nA denotes the covariance matrix of the distribution D, i.e., A = Ex\u2208DxxT . Let the spectral\ndecomposition of A be denoted U \u039bU T where U \u2208 Rd\u00d7d is a matrix with orthonormal columns and\n\u03a3 = diag(\u03bb1, \u03bb2, . . . , \u03bbd). The aim is to \ufb01nd the vectors u1, u2, . . . , uk (the \ufb01rst k columns of U)\nand the corresponding eigenvalues \u03bb1, \u03bb2, . . . , \u03bbk.\nMotivation. The works [7, 8] have two key limitations. First, they are aimed at \ufb01nding the k-PCA\nsubspace, for a given k. Even modifying the goal slightly, e.g., requiring the algorithm to output each\nof the top k PCA directions individually, requires more communication, and a sample complexity\n(per machine) that has a quadratic dependence on the individual gaps, as we explain below. Second,\nand more importantly, these works assume the knowledge of a number k for which there exists an\neigenvalue gap. This is quite unrealistic in practice. Can we design algorithms that can work with\nonly a rough idea of the location of the gap? Our main contribution is to handle these two issues. We\nprovide novel estimation bounds and validate them using experiments on real and synthetic data.\nThe \ufb01rst restriction above is quite serious at a quantitative level. Ignoring other terms, the works\nof [7, 8] require (in order to estimate the k-subspace), a value of n \u2265 1/\u03942\nk. Intuitively, this\ncorresponds to a requirement of each machine having a rough estimate of the top-k PCA subspace.\nTheir main results then can be interpreted as saying that under this assumption, the server can obtain\na signi\ufb01cantly better estimate of the top-k subspace than the individual machines. Further, the\nk (i.e., the projection matrix to the top-k\nestimation errors are only obtained for the matrix UkU T\nsubspace). If one needs each of the top k singular directions, the procedure needs to be re-done for\nmin, where \u0394min = mini\u2208k \u0394i, which could be tiny).\neach index (and this requires having n \u2265 1/\u03942\nNote that our setting is slightly different from the deterministic case of the distributed PCA, where\nwe have a matrix U whose columns are arbitrarily distributed across machines, and the goal is to\n\ufb01nd the best k-subspace. Moreover, the objective is not always to \ufb01nd the right eigenvalues/vectors,\nbut to approximate the value of the low-rank error (see [4, 11] and references therein). These results\nextend the works of [19, 5] where the power of subspace embeddings in matrix approximation is\nshown to distributed settings. In this case, in order to obtain a 1 + \ufffd low rank error in Frobenius norm\nfor A, each machine needs to communicate O(k/\ufffd) vectors. It is evident from their work that the\nsketching methods perform better as the sketch size grows. In contrast to these results, a part of our\ngoal is to discuss the trade-off between n and the quality of the approximations. Applying these\nsketching methods in our setting will yield error terms that depend only on the sketch size and cannot\nbe controlled by n or m, thus making them undesirable for this setting.\n\n1As in the prior works [7, 8], the data distribution is assumed to have sub-Gaussian tails ([12, 18]), i.e., there\nexists a constant C > 0 such that \ufffd(uT x)2\ufffd\u03c81 \u2264 CE[(uT x)2], \u2200u \u2208 Rd. The \u03c81 norm of a random variable\nX is \ufffdX\ufffd\u03c81 = supp\u22651(E|X|p)1/p/p (see [23]).\n\n2\n\n\f\ufffd\ufffdvi \u2212 vi\ufffd \u2264\n\n1.2 Our contributions\nIn Theorem 1 and its corollaries (Section 2.1), we show that as long as n \u2265 \u03a9(1/\u03942\n\na \u201csingle sketch\u201d) compute estimates\ufffdvi of each of the vectors v1, v2, . . . , vk up to an error\n\u03ba2\u221amn\ufffd , where \u03b4i := min(\u03bbi\u22121 \u2212 \u03bbi, \u03bbi \u2212 \u03bbi+1),\n\n\u03b4i \u00b7\ufffd \u03ba1\n\n+\n\nn\n\n1\n\nk), we can (using\n\nwhere \u03ba1, \u03ba2 are factors that are not dominant when \u0394k is large enough. Further, the amount of\ncommunication per machine is O(kd), i.e., each machine communicates k vectors in Rd.\nRemark. Note that if each individual machine is to achieve \ufffd\ufffdvi \u2212 vi\ufffd \u2264 1/4, the above requirement\ntranslates to min(n,\u221amn) \u2265 4/\u03b4i. To achieve this error using prior work, one needs n \u2265 \u03a9(1/\u03b42\ni ).\nOur result can be a signi\ufb01cant improvement as m grows. Speci\ufb01cally, in our setting the individual\nmachines need not be able to obtain any estimate of vi, but the corresponding average is still accurate.\nOur next result (Theorem 9) is in a setting in which we only approximately know the location of\nthe gap in the eigenvalues (as is common in practice). In particular, suppose that there exists a\nk \u2208 (k0, k1) such that \u0394k is large enough. Then, using a single sketch of k1 vectors (i.e. space\nO(k1d), we show an ef\ufb01cient way to \ufb01nd k, and thus also achieve the same guarantees as our \ufb01rst\nresult (described above).\nUsing prior results (at least directly) to solve this problem lead to two issues. First, we need to run\nthe averaging procedures for each k in the range (k0, k1). And more importantly, it is not clear how\nto determine which of the PCA directions obtained are accurate and which are not (because accuracy\nguarantees depend on the consecutive gaps, and we do not know which gap is large).\nFinally, we run experiments on both real and synthetic datasets (the latter gives us a way to control\nthe eigenvalue gaps), and establish that our theoretical bounds are re\ufb02ected accurately in practice.\n\n2 Spectral approximation via distributed averaging\n\nk .\n\nk (\ufffd\u039b(j)\n\nAlgorithm 1 Distributed Averaging (parameter k)\n\nWe start by introducing notation that we will use for the rest of the paper and stating the theorems\nformally. Recall that the jth machine gets n i.i.d. vectors from a sub-Gaussian distribution D,\nand let \ufffdA(j) denote the empirical covariance matrix. Also, for the jth machine, let \ufffdA(j)\nk denote\nk \ufffd\u039b(j)\nthe best rank k approximation of \ufffdA(j), and denote its SVD by \ufffdU (j)\nk (\ufffdU (j)\nk )T . We also de\ufb01ne\n\ufffdV (j)\nk = \ufffdU (j)\nk )1/2. The columns of this matrix are what each machine sends to the central server.\nm\ufffdj\u2208[m] \ufffdA(j)\nWe also de\ufb01ne the average across machines: \ufffdAk = 1\nLocal: On each machine, compute the rank-k SVD of the empirical covariance matrix \ufffdA(j), and\nsend \ufffdV (j)\nm\ufffdm\nServer: On the central server, compute \ufffdAk = 1\neigenvalues and the corresponding eigenvectors of \ufffdAk.\n\nThe procedure above differs from the prior works [8, 7] in the choice of the summary (i.e., what the\nindividual machines send to the central server). Algorithm 1 uses the eigenvectors weighted by the\nsquare root of the eigenvalues, while unweighted vectors are used in the prior work. This turns out to\ngive us three advantages: (i) an ef\ufb01cient way to obtain the individual eigenvectors v1, v2, . . . , vk, (ii)\nuse the summary for different values of k, as we will see in Section 4, and also (iii) improved bounds\non the parameters m, n (especially in experiments).\n\n\u02c6V (j)( \u02c6V (j))T . Then output the top k\n\n(as de\ufb01ned above) to the server.\n\nj=1\n\nk\n\n2.1 Guarantees for eigenvalue and eigenvector estimation\n\nWe now formally state the guarantees obtained by the procedure above in order to estimate the\neigenvalues and eigenvectors of A. We start with a general theorem about approximating the best\nk-rank approximation of A: Ak, which will imply both of these statements.\n\n3\n\n\fTheorem 1. There exist constants C1 and C2 such that\n\u03ba1\nn\n\n\ufffd\ufffd\ufffd\ufffd\ufffdAk \u2212 Ak\ufffdF\ufffd\ufffd\ufffd\u03c81 \u2264 C1\n\u03ba2\u221amn\nk and \u03ba2 = \u03bb1\ufffdk\u03bb1 \u00b7 T r(A)/\u0394k\n\n+ C2\n\n,\n\nwhere \u03ba1 = \u221ak\u03bb2\nThe statement uses subgaussian norms described in section 1.1. By de\ufb01nition, we can rephrase the\n\n1 \u00b7 T r(A)/\u03942\n\ntheorem as a concentration bound: the probability of \ufffd\ufffdAk \u2212 Ak\ufffdF exceeding log(1/\u03b4) times the\nRHS is at most \u03b4. Using known perturbation bounds, we can show the following corollaries.\nCorollary 2. For all i \u2264 k, there exist constants C1 and C2 such that\n\u03ba2\u221amn\n\n\u03ba1\nn\n\n+ C2\n\n.\n\n\ufffd\ufffd\ufffd|\u03bbi(\ufffdAk) \u2212 \u03bbi(A)|\ufffd\ufffd\ufffd\u03c81 \u2264 C1\n\nThe corollary follows from Theorem 1 using Weyl\u2019s inequality [21].\nCorollary 3. De\ufb01ne \u03b4i = min{(\u03bbi \u2212 \u03bbi\u22121), (\u03bbi+1 \u2212 \u03bbi)}.2 For i \u2264 k, there exist constants C1 and\nC2 such that\n\n\ufffd\ufffd\ufffd1 \u2212 (\ufffduT\n\ni ui)2\ufffd\ufffd\ufffd\u03c81 \u2264 C1\n\n\u03ba2\n1\ni n2 + C2\n\u03b42\n\n\u03ba2\n2\n\u03b42\ni mn\n\n,\n\nwhere\ufffdui is the eigenvector corresponding to the ith largest eigenvalue of \ufffdAk.\n\nThe proof follows from the Davis-Kahan sin-\u0398 theorem [21].\nAs outlined in Section 1.2, when the gap \u03b4i \ufffd \u0394k, using the summary corresponding to k has a\nsigni\ufb01cant advantage over using the one for i. This results in better guarantee (compared to the\nprocedures of [8, 7]) when recovering ui, for 1 \u2264 i \u2264 k.\n3 Analysis: estimating the rank-k approximation of A\n\nTheorem 1.\n\nOur goal in this section will be to show that \ufffdAk approximates Ak accurately, thereby proving\nOutline of the argument. The key step is to de\ufb01ne the matrix A\u2217, which is the expectation of \ufffdA(j)\nk .\nAs all the machines receive inputs drawn from D, this is independent of j. The argument proceeds\nin two steps, similar to the works of [7] and [8]. The \ufb01rst step is showing that \ufffdA\u2217 \u2212 Ak\ufffd (in other\nwords, the bias) is small. This is the harder step, and involves showing that one obtains non-trivial\nk \u2212 Ak\ufffd is of the order O(1/\u221an), we will show\ncancellations. In other words, even though \ufffd\ufffdA(j)\nthat \ufffdA\u2217 \u2212 Ak\ufffd is of the order O(1/n). The second step is to show that \ufffdAk, which is the empirical\naverage of \ufffdA(j)\nk over the m machines, is close to A\u2217. This is proved using a matrix concentration\n\nbound, originally due to [3].\nTo summarize, let us de\ufb01ne (noting that the RHS is independent of j),\n\nthe second as the variance. In what follows we will bound the terms separately.\n\nWe have \ufffd\ufffdAk \u2212 Ak\ufffd \u2264 \ufffdA\u2217 \u2212 Ak\ufffd + \ufffd\ufffdAk \u2212 A\u2217\ufffd. The \ufb01rst term will be referred to as the bias and\n\n3.1 Analyzing the bias term\n\nk ].\n\nA\u2217 = E[\ufffdA(j)\n\nWe now show the following theorem about the bias term.\nTheorem 4. There is a constant C such that\n\n\ufffdA\u2217 \u2212 Ak\ufffdF \u2264 C\n\n\u221ak\u03bb2\n1 \u00b7 Tr(A)\n\u03942\nkn\n\n.\n\n2To deal with the border cases i = 1, d, de\ufb01ne \u03bb0 = +\u221e and \u03bbd+1 = \u2212\u221e.\n\n4\n\n\fk .\n\nIn what follows, we abuse notation slightly and denote \ufffdA = \ufffdA(j) for some machine j. As we are\n\ufb01nally interested in the expectation, the choice of j will not matter. De\ufb01ne \ufffdA = A + E and let\n\ufffd = \ufffdE\ufffd2/\u0394k. By de\ufb01nition, A = E[\ufffdA]. Let us also de\ufb01ne the projection matrices \u03a0 = UkU T\n\ufffd\u03a0 = \ufffdUk\ufffdU T\nThe main idea behind the proof of theorem 4 is we express \ufffdAk \u2212 Ak in a single machine using linear\nand quadratic terms of E. Once we consider the expectation of this error, the linear terms of E\n(O(1/\u221an) which is dominant in magnitude) will become zero, thus giving the bound for O(1/n)\nerror bias in theorem 4. The \ufb01rst lemma gives a coarse bound, which we will use when \ufffdE\ufffd is large.\nLemma 5. Let \ufffdAk be the rank-k approximation on one of the machines, and let E be as de\ufb01ned\nabove. Then\n\nk and\n\n\ufffdAk \u2212 Ak = \u03a0E + H, where \ufffdH\ufffdF \u2264 2\u221ak\n\n\u03bb1\ufffdE\ufffd2\n\n\u0394k\n\n+ 2\u221ak\ufffdE\ufffd2\n\u0394k\n\n2\n\n.\n\nThe next lemma shows that when \ufffd = \ufffdE\ufffd2/\u0394k is small, we have a much better bound.\nLemma 6. Let \ufffdA satisfy the condition \ufffd = \ufffdA \u2212 \ufffdA\ufffd2/\u0394k \u2264 1/10. There exists a linear function\nf : Rd\u00d7k \ufffd\u2192 Rd\u00d7k and a constant C such that\n\ufffdAk \u2212 Ak = \u03a0E +\ufffdf (EUk)U T\nThe lemma is a consequence of a result in [7] showing that in this case,\ufffd\u03a0 has a suf\ufb01ciently good \ufb01rst\n\nk + Ukf (EUk)T\ufffd A + H, where \ufffdH\ufffdF \u2264\n\norder approximation in terms of E.\n\n2(\u03bb1 + \ufffdE\ufffd2)\n\u03942\nk\n\nc\u221ak\ufffdE\ufffd2\n\n.\n\nProof. Lemma 2 of [7] shows that\n\n\ufffd\u03a0 = \u03a0 + f (EUk)U T\n\nk + U T\n\nk f (EUk)T + E\ufffd,\n\n2/\u03942\n\nk + U T\n\nThus to show the lemma, the error term is\n\nk f (EUk)T\ufffd E + E\ufffdA + E\ufffdE\n\nwhere (a) f is a linear function as in the statement of the theorem that also satis\ufb01es \ufffdf (.)\ufffdF \u2264\n\ufffd.\ufffdF /\u0394k, and (b) E\ufffd is a matrix with \ufffdE\ufffd\ufffdF \u2264 24\u221ak\ufffdE\ufffd2\nk (this is only true under the assump-\ntion we have, i.e., \ufffd \u2264 1/10). Using this,\n\ufffdAk \u2212 Ak = \ufffd\u03a0(A + E) \u2212 Ak\n=\ufffd\u03a0 + f (EUk)U T\n= \u03a0E +\ufffdf (EUk)U T\n\nk + Ukf (EUk)T + E\ufffd\ufffd (A + E) \u2212 Ak\nk + Ukf (EUk)T\ufffd A +\ufffdf (EUk)U T\nH =\ufffdf (EUk)U T\nTo bound the \ufb01rst term, note that \ufffdf (EUk)\ufffdF \u2264 \ufffdEUk\ufffdF\n\ufffd\ufffdf (EUk)U T\nk f (EUk)T\ufffd E\ufffdF \u2264 \ufffdf (EUk)U T\nk f (EUk)T\ufffdF\ufffdE\ufffd2 \u2264\nThe second term can be bounded (using the bound on \ufffdE\ufffd\ufffd above), by 24\u221ak\u03bb1\ufffdE\ufffd2\nNote that the two lemmas give different linear approximations of \ufffdAk \u2212 Ak. However, in order to take\n\nexpectation, we need the same function. Luckily, we observe that the one from Lemma 6 can be used\nin the place of one from before, with small error. To this end, note that\n\nk f (EUk)T\ufffd E + E\ufffdA + E\ufffdE.\n\nbound on E\ufffd again completes the proof of the lemma.\n\n2\u221ak\ufffdE\ufffd2\n\n. Thus we have\n\nk. Using the\n\n\u0394k \u2264\n\n\u221ak\ufffdE\ufffd2\n\nk + U T\n\nk + U T\n\nk + U T\n\n2/\u03942\n\n2\n\n.\n\n\u0394k\n\n\u0394k\n\n\ufffd\ufffdf (EUk)U T\n\nk + Ukf (EUk)T\ufffd A\ufffdF \u2264 2\ufffdf (EUk)\ufffdF\ufffdA\ufffd2 \u2264\n\nfrom the property of f mentioned earlier (shown in [7]).\nWe can now prove Theorem 4.\n\n5\n\n2\u221ak\u03bb1\ufffdE\ufffd2\n\n\u0394k\n\n,\n\n(1)\n\n\fProof of Theorem 4. Using the observation in (1), Lemma 5 implies that for all E, we have\n\nk + Ukf (EUk)T\ufffd A + H, where \ufffdH\ufffdF \u2264 4\u221ak\ufffdE\ufffd2(\u03bb1 + \ufffdE\ufffd2)\n\ufffdAk \u2212 Ak = \u03a0E +\ufffdf (EUk)U T\n.\n(2)\nUsing this expression for bounding \ufffdE\ufffdF when \ufffd \u2265 1/10, and the one from Lemma 6 when \ufffd is\nsmaller, we can now take the expected value of \ufffdAk \u2212 Ak. The linear terms in E will evaluate to zero.\nThus we have\nwhere Q1 and Q2 are bounds on \ufffdH\ufffdF from (2) and Lemma 6 respectively. Now, conditioned on\nk. Thus we can simplify the above\n\ufffdE\ufffd2/\u0394k \u2265 1/10, it is trivially true that \ufffdE\ufffd2/\u0394k \u2264 10\ufffdE\ufffd2\nas\n\n2/\u03942\n\n\u0394k\n\n\ufffdE[\ufffdAk \u2212 Ak]\ufffdF \u2264 E[Q1 | \ufffd \u2265 1/10] + E[Q2 | \ufffd \u2265 1/10],\n\ufffd .\n\ufffdE[\ufffdAk \u2212 Ak]\ufffdF \u2264 E\ufffd C\u221ak\ufffdE\ufffd2\n\ufffdE[\ufffdAk \u2212 Ak]\ufffdF \u2264\n\nC\u221ak\u03bb2\n1 \u00b7 Tr(A)\nn\u03942\nk\n\n2(\u03bb1 + \ufffdE\ufffd)\n\u03942\nk\n\n.\n\nUsing the subgaussian property of the moments of our distribution, we have that the expectation\nabove is dominated by E[\ufffdE\ufffd2\n\n2] term (due to the multiplier \u03bb1). This gives\n\nThis completes the proof of the theorem.\n\n3.2 Analyzing the variance term\n\nWe now need to show that the average of the matrices \ufffdAk is close to the expectation (which is A\u2217).\n\nThe main idea is to use the concentration inequality due to Bosq [3] (see also Lemma 4 of [7]). The\ninequality lets us bound the \u03c81 norm of the average of i.i.d. random variables using the \u03c81 norm of\nthe individual variables.\nTo this end, we \ufb01rst show the following.\nLemma 7. Suppose each machine receives n points, where n \u2265 \u03bb1Tr(A)\nsuch that\n\u0394k\ufffd k\u03bb1 \u00b7 Tr(A)\n\n. Then, there is a constant C\n\n\u03bb1\n\n\u03942\nk\n\nn\n\n.\n\nNow we analyze the average of \ufffdAk.\nTheorem 8. There exists a constant C such that the matrix \ufffdAk, i.e. the average of the matrices \ufffdA(i)\n\nsatis\ufb01es\n\nk ,\n\n\ufffd\ufffd\ufffd\ufffd\ufffdAk \u2212 A\u2217\ufffdF\ufffd\ufffd\ufffd\u03c81 \u2264 C\n\ufffd\ufffd\ufffd\ufffd\ufffdAk \u2212 A\u2217\ufffdF\ufffd\ufffd\ufffd\u03c81 \u2264 C\n\nTheorems 4 and 8 together complete the proof of our main approximation result, Theorem 1. As\nobserved in Section 2.1, this also completes the proofs of Corollaries 2 and 3.\n\n\u03bb1\n\n\u0394k\ufffd k\u03bb1T r(A)\n\nmn\n\n.\n\n4 Algorithm for imprecise k\nWe now consider the setting in which we do not exactly know the value of k for which \u03bbk \u2212 \u03bbk+1 is\n\u201clarge\u201d. Knowing that some k in the interval (k0, k1) satis\ufb01es an appropriate gap assumption, we will\ngive an algorithm that can, using O(k1) columns of communication per machine, (a) \ufb01nd such a k,\nand (b) compute all the eigenvalues and eigenvectors with guarantees matching ones from the case in\nwhich we know k (i.e. Theorem 1).\n\nt\n\n1\n\n(i.e., the empirical average of the rank-t approximations on the individual machines).\n\nOur algorithm relies on the following theorem. As de\ufb01ned earlier, for any t \u2265 1, denote \ufffdAt :=\nm\ufffdi \ufffdA(i)\nNow the main advantage of having every machine sending across the eigenvectors\ufffdvi scaled by\ufffd\ufffd\u03bbi\n(Algorithm 1) is that if a machine sends this information for 1 \u2264 i \u2264 k1, then the central server can\ncompute \ufffdAt for every 1 \u2264 t \u2264 k1.\n\n6\n\n\fTheorem 9. Let \ufffdAt be de\ufb01ned as above, and let \u03b4 > 0 be a given parameter. Let k be an integer for\nwhich \u0394k = \u03bbk \u2212 \u03bbk+1 is suf\ufb01ciently large, in particular, so that for the given m, n, \u03b4, we have\n\n\u0394k \u2265 C\ufffd \u03ba1\n\nn\n\n+\n\n\u03ba2\u221amn\ufffd log(1/\u03b4),\n\nwhere \u03ba1 and \u03ba2 are as de\ufb01ned in Theorem 1, and C is an appropriate constant. Then with probability\n\nthat satisfy\n\nat least 1 \u2212 \u03b4, for all t \u2265 k, the matrix \ufffdAt has its top k + 1 eigenvalues \u03b81 \u2265 \u03b82 \u2265 \u00b7\u00b7\u00b7 \u2265 \u03b8k \u2265 \u03b8k+1\n\n|\u03b8i \u2212 \u03bbi| \u2264 O\ufffd \u03ba1\n\u03b8k+1 \u2264 \u03bbk+1 + O\ufffd \u03ba1\n\n+\n\nn\n\nn\n\n\u03ba2\u221amn\ufffd log(1/\u03b4), for 1 \u2264 i \u2264 k, and\n\n+\n\n\u03ba2\u221amn\ufffd log(1/\u03b4).\n\n(3)\n\n(4)\n\nProof of this theorem is deferred to Section A.4 of the supplementary material.\n\nEstimating the location of the gap. The theorem shows that one can use any value of t \u2265 k in\norder to estimate all the eigenvalues up to k, and also the gap between k and k + 1. Thus, if we only\nknow the approximate location of a gap (some k between k0 and k1, we can use Algorithm 1 with\nk = t1, and using the above result, \ufb01nd the k with the desired gap. Knowing k, the server can then\n\ncompute \ufffdAk (using only the information it has), and this leads to a \ufb01ner estimate of the matrix Ak.\n\nEstimating the eigenvectors. The theorem above can also be used (together with the sin-\u0398 the-\norem) to show estimates on eigenvector estimation (as in Corollary 3). While the bounds are\nqualitatively similar to those in Corollary 3, we observe that in practice, using t > k is signi\ufb01cantly\nbetter for approximating the top k eigenspace to a good accuracy.\n\n5 Experiments\n\nWe validate our results with experiments using synthetic and real datasets. We simulated a distributed\nenvironment on a single machine.\n\n5.1 Synthetic dataset\n\nWe generated vectors in Rd from a multivariate-Gaussian distribution with mean 0 and the covariance\nmatrix A = U \u039bU T . \u039b(1, 1) = 1, \u039b(i, i) = 0.9\u039b(i \u2212 1, i \u2212 1) for i = 2, . . . , 6. We set \u039b(7, 7) =\n\u039b(6, 6) \u2212 0.3. For 7 < i \u2264 50, we set \u039b(i, i) = 0.9\u039b(i \u2212 1, i \u2212 1). So there is a gap of \u03946 = 0.3. In\nthis experiment we \ufb01xed the number of machines to m = 50. We computed top 3 eigenvectors using\nthe algorithm 1 increasing the number of points n per machine and compared them with the top 3\neigenvectors of the population covariance matrix. Note that it is not possible to use prior methods for\ncomputing individual eigenvectors. We \ufb01rst computed the eigenvectors by communicating only the\ntop k = 3 weighted vectors. We then compute that by communicating k = 7 weighted vectors so that\ni \u02dcui)2, i = 1, 2, 3) for both these cases.\nit includes the eigengap \u03946. We computed the error (1 \u2212 (uT\nThese results are averaged over 200 iterations\nAs observed in Figure 1, these results are consistent with our theoretical bounds for the case where\ncorrect eigengap is not known (but it is located within the k we communicate).\n\n5.2 Real datasets\n\nr \u2212 \u02dcUr \u02dcU T\n\nWe used 3 real datasets to evaluate our methods (Table 1)[ [13, 17, 6]]. Each dataset has N points\nand d features. In these experiments we compute the error of the subspace spanned by the top r\neigenvectors (\ufffdUrU T\nr \ufffdF ). We consider each dataset X as the population matrix, then the\npopulation covariance matrix A = XX T . In each machine we sampled n columns from these X\nmatrices uniformly at random. For these experiments we \ufb01xed number of machines m = 50. Each\nresult is averaged over 200 iterations.\nWe compare the prior method by [7] (unweighted) with our methods. In one of the cases we\ncommunicate exactly r vectors (weighted r) and in the other case we communicate a slightly higher\n\n7\n\n\fFigure1:Estimationerrorsof\ufb01rsteigenvector(left),secondeigenvector(middle),andthirdeigen-vector(right)fork=3andk=7vs.samplessizenpermachine.DatasetNdrtMNIST-small20000196515NIPS-papers11463150515FMA-music213145181070Table1:Datasetinformation(t>r)numberofvectors(weightedt).Thisistowardstheendsofdemonstratingourtheoreticalresultsforthecasewherewedonotknowtheexactlocationofareasonableeigengap.Notethatitisnotpossibletocomputethecorrecteigenspaceusingpriormethodsifwedonotcommunicatetheexactnumberofrequiredvectors.Similartothesyntheticdatasetexperimentswecomputedtheerrorofeachmethodvaryingthesamplesizenpermachine(Figure2).0100002000030000n0.000.050.100.150.200.250.300.350.40error of top r subspaceunweightedweighted rweighted t05000100001500020000n0.51.01.52.0unweightedweighted rweighted t10002000300040005000n0.20.40.60.81.01.21.41.61.8unweightedweighted rweighted tFigure2:EstimationerrorsoftoprsubspaceofMNIST-smalldataset(left),NIPS-papersdataset(middle),FMA-musicdataset(right)vs.unweighted,weightedr,weightedtaveraging.References[1]R.AhlswedeandA.Winter.Strongconverseforidenti\ufb01cationviaquantumchannels.IEEETransactionsonInformationTheory,48(3):569\u2013579,March2002.[2]AkshayBalsubramani,SanjoyDasgupta,andYoavFreund.Thefastconvergenceofincrementalpca.InProceedingsofthe26thInternationalConferenceonNeuralInformationProcessingSystems-Volume2,NIPS\u201913,pages3174\u20133182,USA,2013.CurranAssociatesInc.8\f[3] Denis Bosq. Stochastic processes and random variables in function spaces. In Linear Processes\n\nin Function Spaces, pages 15\u201342. Springer, 2000.\n\n[4] Christos Boutsidis, David P Woodruff, and Peilin Zhong. Optimal principal component analysis\nin distributed and streaming models. In Proceedings of the forty-eighth annual ACM symposium\non Theory of Computing, pages 236\u2013249. ACM, 2016.\n\n[5] Kenneth L Clarkson and David P Woodruff. Low rank approximation and regression in input\nsparsity time. In Proceedings of the forty-\ufb01fth annual ACM symposium on Theory of Computing,\npages 81\u201390. ACM, 2013.\n\n[6] Micha\u00ebl Defferrard, Kirell Benzi, Pierre Vandergheynst, and Xavier Bresson. Fma: A dataset\n\nfor music analysis. 2017.\n\n[7] Jianqing Fan, Dong Wang, Kaizheng Wang, and Ziwei Zhu. Distributed estimation of principal\n\neigenspaces. arXiv preprint arXiv:1702.06488, 2017.\n\n[8] Dan Garber, Ohad Shamir, and Nathan Srebro. Communication-ef\ufb01cient algorithms for dis-\ntributed stochastic principal component analysis. In Proceedings of the 34th International\nConference on Machine Learning-Volume 70, pages 1203\u20131212. JMLR. org, 2017.\n\n[9] Gene H. Golub and Charles F. Van Loan. Matrix Computations. The Johns Hopkins University\n\nPress, third edition, 1996.\n\n[10] Ian Jolliffe. Principal component analysis. Springer, 2011.\n\n[11] Ravi Kannan, Santosh Vempala, and David Woodruff. Principal component analysis and higher\ncorrelations for distributed data. In Conference on Learning Theory, pages 1040\u20131057, 2014.\n\n[12] Vladimir Koltchinskii and Karim Lounici. Concentration inequalities and moment bounds for\n\nsample covariance operators. arXiv preprint arXiv:1405.2468, 2014.\n\n[13] Yann LeCun. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/.\n\n[14] Ryan McDonald, Keith Hall, and Gideon Mann. Distributed training strategies for the structured\nperceptron. In Human Language Technologies: The 2010 Annual Conference of the North\nAmerican Chapter of the Association for Computational Linguistics, HLT \u201910, pages 456\u2013464,\nStroudsburg, PA, USA, 2010. Association for Computational Linguistics.\n\n[15] Erkki Oja. Simpli\ufb01ed neuron model as a principal component analyzer. Journal of Mathematical\n\nBiology, 15(3):267\u2013273, November 1982.\n\n[16] Karl Pearson. Liii. on lines and planes of closest \ufb01t to systems of points in space. The London,\nEdinburgh, and Dublin Philosophical Magazine and Journal of Science, 2(11):559\u2013572, 1901.\n\n[17] Valerio Perrone, Paul A Jenkins, Dario Spano, and Yee Whye Teh. Poisson random \ufb01elds for\n\ndynamic feature models. arXiv preprint arXiv:1611.07460, 2016.\n\n[18] Markus Rei\u00df and Martin Wahl. Non-asymptotic upper bounds for the reconstruction error of\n\npca. arXiv preprint arXiv:1609.03779, 2016.\n\n[19] Tamas Sarlos. Improved approximation algorithms for large matrices via random projections.\nIn 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS\u201906), pages\n143\u2013152. IEEE, 2006.\n\n[20] Ohad Shamir. A stochastic pca and svd algorithm with an exponential convergence rate. In\nProceedings of the 32Nd International Conference on International Conference on Machine\nLearning - Volume 37, ICML\u201915, pages 144\u2013152. JMLR.org, 2015.\n\n[21] Gilbert W. Stewart and Ji guang Sun. Matrix Perturbation Theory. Academic Press, 1990.\n\n[22] Joel A. Tropp. User-friendly tail bounds for sums of random matrices. Foundations of Compu-\n\ntational Mathematics, 12(4):389\u2013434, Aug 2012.\n\n9\n\n\f[23] Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. arXiv\n\npreprint arXiv:1011.3027, 2010.\n\n[24] Yuchen Zhang, Martin J Wainwright, and John C Duchi. Communication-ef\ufb01cient algorithms\nfor statistical optimization. In Advances in Neural Information Processing Systems, pages\n1502\u20131510, 2012.\n\n[25] Martin A. Zinkevich, Markus Weimer, Alex Smola, and Lihong Li. Parallelized stochastic\ngradient descent. In Proceedings of the 23rd International Conference on Neural Information\nProcessing Systems - Volume 2, NIPS\u201910, pages 2595\u20132603, USA, 2010. Curran Associates Inc.\n\n10\n\n\f", "award": [], "sourceid": 5905, "authors": [{"given_name": "Aditya", "family_name": "Bhaskara", "institution": "University of Utah"}, {"given_name": "Pruthuvi Maheshakya", "family_name": "Wijewardena", "institution": "University of Utah"}]}