{"title": "Efficient Globally Convergent Stochastic Optimization for Canonical Correlation Analysis", "book": "Advances in Neural Information Processing Systems", "page_first": 766, "page_last": 774, "abstract": "We study the stochastic optimization of canonical correlation analysis (CCA), whose objective is nonconvex and does not decouple over training samples. Although several stochastic gradient based optimization algorithms have been recently proposed to solve this problem, no global convergence guarantee was provided by any of them. Inspired by the alternating least squares/power iterations formulation of CCA, and the shift-and-invert preconditioning method for PCA, we propose two globally convergent meta-algorithms for CCA, both of which transform the original problem into sequences of least squares problems that need only be solved approximately. We instantiate the meta-algorithms with state-of-the-art SGD methods and obtain time complexities that significantly improve upon that of previous work. Experimental results demonstrate their superior performance.", "full_text": "Ef\ufb01cient Globally Convergent Stochastic\n\nOptimization for Canonical Correlation Analysis\n\nWeiran Wang1\u2217\n1Toyota Technological Institute at Chicago\n{weiranwang,dgarber,nati}@ttic.edu\n\nJialei Wang2\u2217\n\nDan Garber1\n\nNathan Srebro1\n\n2University of Chicago\njialei@uchicago.edu\n\nAbstract\n\n\u03a3\u2212 1\n\n2\n\n\u03a3\u2212 1\n\n2\n\n2\n\nWe study the stochastic optimization of canonical correlation analysis (CCA),\nwhose objective is nonconvex and does not decouple over training samples. Al-\nthough several stochastic gradient based optimization algorithms have been re-\ncently proposed to solve this problem, no global convergence guarantee was pro-\nvided by any of them. Inspired by the alternating least squares/power iterations\nformulation of CCA, and the shift-and-invert preconditioning method for PCA, we\npropose two globally convergent meta-algorithms for CCA, both of which trans-\nform the original problem into sequences of least squares problems that need only\nbe solved approximately. We instantiate the meta-algorithms with state-of-the-art\nSGD methods and obtain time complexities that signi\ufb01cantly improve upon that\nof previous work. Experimental results demonstrate their superior performance.\n\nIntroduction\n\n1\nCanonical correlation analysis (CCA, [1]) and its extensions are ubiquitous techniques in sci-\nenti\ufb01c research areas for revealing the common sources of variability in multiple views of the\nsame phenomenon. In CCA, the training set consists of paired observations from two views, de-\nnoted (x1, y1), . . . , (xN , yN ), where N is the training set size, xi \u2208 Rdx and yi \u2208 Rdy for\ni = 1, . . . , N. We also denote the data matrices for each view2 by X = [x1, . . . , xN ] \u2208 Rdx\u00d7N and\nY = [y1, . . . , yN ] \u2208 Rdy\u00d7N , and d := dx + dy. The objective of CCA is to \ufb01nd linear projections\nof each view such that the correlation between the projections is maximized:\n\nmax\nu,v\n\nu\u22a4\u03a3xyv\n\ns.t. u\u22a4\u03a3xxu = v\u22a4\u03a3yyv = 1\n\nN XY\u22a4 is the cross-covariance matrix, \u03a3xx = 1\n\nN XX\u22a4 + \u03b3xI and \u03a3yy = 1\nwhere \u03a3xy = 1\n\u03b3yI are the auto-covariance matrices, and (\u03b3x, \u03b3y) \u2265 0 are regularization parameters [2].\nWe denote by (u\u2217, v\u2217) the global optimum of (1), which can be computed in closed-form. De\ufb01ne\n\nN YY\u22a4 +\n\nT := \u03a3\u2212 1\n\nxx \u03a3xy\u03a3\u2212 1\n\n2\n\nyy \u2208 Rdx\u00d7dy ,\n\nand let (\u03c6, \u03c8) be the (unit-length) left and right singular vector pair associated with T\u2019s largest\nsingular value \u03c11. Then the optimal objective value, i.e., the canonical correlation between the\nviews, is \u03c11, achieved by (u\u2217, v\u2217) = (\u03a3\u2212 1\n\nyy \u03c8). Note that\n\nxx \u03c6, \u03a3\u2212 1\n\n2\n\n2\n\n(1)\n\n(2)\n\n\u03c11 = kTk \u2264\n\nxx X\n\nyy Y\n\n\u2264 1.\n\nFurthermore, we are guaranteed to have \u03c11 < 1 if (\u03b3x, \u03b3y) > 0.\n\n\u2217The \ufb01rst two authors contributed equally.\n2We assume that X and Y are centered at the origin for notational simplicity; if they are not, we can center\n\nthem as a pre-processing operation.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\f\u0010\n\u0010\n\u0012\n\u0012\n\u0012\n\u0010\n\u0010\n\u0010\n\n\u0010\n\n\u0010\n\u0011\n\u0011\n\n2\n\n\u0011\u0011\n\u0011\u0011\n\u0010\n\u0011\u0013\n\u0010\n\u0011\u0013\n\u0011\n\u0010\n\u0011\u0013\n\u0010\n\u0010\n\u0011\u0011\n\u0010\n\u0011\n\u0011\u0011\n\u0010\n\u0011\u0011\n\n\u00b7 log2\n\n\u00b7 log2\n\n(local)\n\n\u00b7 log2\n\n1\n\u03b7\n\n1\n\u03b7\n\n1\n\u03b7\n\n1\n\u03b7\n\n1\n\u03b7\n\n1\n\u03b7\n\n2\n\n1\n\u03b7\n\n1\n\u03b7\n\n\u0010\n\u0010\n\u0010\nq\nq\n\n1\n\n1\n\n\u0001\n\n\u0014\n\n\ufffd\n\n\u0015\n\nTable 1: Time complexities of different algorithms for achieving \u03b7-suboptimal solution (u, v) to\n\u2265 1 \u2212 \u03b7. GD=gradient descent, AGD=accelerated\nCCA, i.e., min\nGD, SVRG=stochastic variance reduced gradient, ASVRG=accelerated SVRG. Note ASVRG pro-\nvides speedup over SVRG only when \u02dc\u03ba > N, and we show the dominant term in its complexity.\n\n(u\u22a4\u03a3xxu\u2217)2, (v\u22a4\u03a3yyv\u2217)2\n\nAlgorithm\n\nLeast squares solver\n\nTime complexity\n\nAppGrad [3]\n\nCCALin [6]\n\nThis work:\n\nAlternating least\nsquares (ALS)\n\nThis work:\n\nShift-and-invert\n\npreconditioning (SI)\n\nGD\n\nAGD\n\nAGD\n\nSVRG\n\nASVRG\n\nAGD\n\nSVRG\n\nASVRG\n\n\u02dcO\n\u02dcO\n\u02dcO\n\u02dcO\n\u02dcO\n\u02dcO\n\u02dcO\n\u02dcO\n\ndN \u02dc\u03ba \u03c12\n2 \u00b7 log\n\u03c12\n1\u2212\u03c12\ndN\u221a\u02dc\u03ba \u03c12\n2 \u00b7 log\n1\u2212\u03c12\n\u03c12\ndN\u221a\u02dc\u03ba\n\u03c12\n1\n1\u2212\u03c12\n\u03c12\n\u03c12\n1\n1\u2212\u03c12\n\u03c12\n2\n2\n\n2\n\nd(N + \u02dc\u03ba)\nd\u221aN \u02dc\u03ba\ndN\u221a\u02dc\u03ba\nd\n\nN + (\u02dc\u03ba 1\n4\u221a\u02dc\u03ba\n\n3\n\ndN\n\n2\n\n\u03c12\n\u00b7 log2\n1\n1\u2212\u03c12\n\u03c12\n\u03c11\u2212\u03c12 \u00b7 log2\n1\n\u03c11\u2212\u03c12\n\u03c11\u2212\u03c12 \u00b7 log2\n\n)2\n\n1\n\nFor large and high dimensional datasets, it is time and memory consuming to \ufb01rst explicitly form\nthe matrix T (which requires eigen-decomposition of the covariance matrices) and then compute its\nsingular value decomposition (SVD). For such datasets, it is desirable to develop stochastic algo-\nrithms that have ef\ufb01cient updates, converges fast, and takes advantage of the input sparsity. There\nhave been recent attempts to solve (1) based on stochastic gradient descent (SGD) methods [3, 4, 5],\nbut none of these work provides rigorous convergence analysis for their stochastic CCA algorithms.\nThe main contribution of this paper is the proposal of two globally convergent meta-algorithms for\nsolving (1), namely, alternating least squares (ALS, Algorithm 2) and shift-and-invert precondition-\ning (SI, Algorithm 3), both of which transform the original problem (1) into sequences of least\nsquares problems that need only be solved approximately. We instantiate the meta algorithms with\nstate-of-the-art SGD methods and obtain ef\ufb01cient stochastic optimization algorithms for CCA.\nIn order to measure the alignments between an approximate solution (u, v) and the optimum\n(u\u2217, v\u2217), we assume that T has a positive singular value gap \u0394 := \u03c11 \u2212 \u03c12 \u2208 (0, 1] so its top\nleft and right singular vector pair is unique (up to a change of sign).\nTable 1 summarizes the time complexities of several algorithms for achieving \u03b7-suboptimal align-\nments, where \u02dc\u03ba =\nmin(\u03c3min(\u03a3xx), \u03c3min(\u03a3yy)) is the upper bound of condition numbers of least squares\nproblems solved in all cases.3 We use the notation \u02dcO(\u00b7) to hide poly-logarithmic dependencies (see\nSec. 3.1.1 and Sec. 3.2.3 for the hidden factors). Each time complexity may be preferrable in certain\nregime depending on the parameters of the problem.\nNotations We use \u03c3i(A) to denote the i-th largest singular value of a matrix A, and use \u03c3max(A)\nand \u03c3min(A) to denote the largest and smallest singular values of A respectively.\n\nmax(kxik2, kyik2)\n\nmax\n\ni\n\n2 Motivation: Alternating least squares\nOur solution to (1) is inspired by the alternating least squares (ALS) formulation of CCA [7, Al-\ngorithm 5.2], as shown in Algorithm 1. Let the nonzero singular values of T be 1 \u2265 \u03c11 \u2265 \u03c12 \u2265\n\u00b7\u00b7\u00b7 \u2265 \u03c1r > 0, where r = rank(T) \u2264 min(dx, dy), and the corresponding (unit-length) left and right\nsingular vector pairs be (a1, b1), . . . , (ar, br), with a1=\u03c6 and b1 = \u03c8. De\ufb01ne\n\nC =\n\n0 T\nT\u22a4 0\n\n\u2208 Rd\u00d7d.\n\n(3)\n\n3For the ALS meta-algorithm, its enough to consider a per-view conditioning. And when using AGD as the\nleast squares solver, the time complexities dependends on \u03c3max(\u03a3xx) instead, which is less than maxi \ufffdxi\ufffd2.\n\n2\n\n\fq\nq\n\u0015\n\u0015\u0014\n\na1\nb1\n\n1\n2\n\n\u0014\n\n\u0015\n\np\np\n\u0014\n\n\u0015\n\n\u0014\n\n\u0010\n\n\u0011\n\n2\n\nn\n\u0014\n\u0014\nNX\n\nar\nbr\n\nn\n\u0015\n\ufffd\n\f\f\n\n1\n2\n\n\u0015\n\n\ufffd\n\nn\nn\n\u0014\n\u0014\n\no\nn\n\n \u02dc\u03c60\n\n , \u03c80 \u2190 \u02dc\u03c80/\n\n \u02dc\u03c80\n\no\no\no\n\n , \u03c8t \u2190 \u02dc\u03c8t/\n\n \u02dc\u03c6t\n\no\n\n \u02dc\u03c8t\n\u0014\n\u0015\n\u0015\n\u0015\n\u0001\n\u0001\n\f\f2\n\nxx \u03a3xy\u03a3\u2212 1\nyy \u03a3\u22a4xy\u03a3\u2212 1\n\n{(\u03c6T , \u03c8T ) \u2192 (\u03c6, \u03c8)}\n\nyy \u03c8t\u22121\nxx \u03c6t\u22121\n\n\u02dc\u03c6t/|| \u02dc\u03c6t||\n\u02dc\u03c8t/|| \u02dc\u03c8t||\n\n> 0.4\n\u2265\n\na1\n\u2212b1\n\n\u02dc\u03c60, \u02dc\u03c80\n\nyy \u02dcvt.\n\n1\u221a2\n\n(5)\n\n1\n2\n\n.\n\n2\n\n2\n\n2\n\n+\n\n\u03b3x\n2 kuk2 .\n\n(6)\n\n\u02dc\u03c6t \u2190 \u03a3\u2212 1\n\u02dc\u03c8t \u2190 \u03a3\u2212 1\n\n2\n\nAlgorithm 1 Alternating least squares for CCA.\nInput: Data matrices X \u2208 Rdx\u00d7N , Y \u2208 Rdy\u00d7N , regularization parameters (\u03b3x, \u03b3y).\n\n\u02dcv0 \u2208 Rdy .\n\n\u02dcu\u22a40 \u03a3xx \u02dcu0, v0 \u2190 \u02dcv0/\n\n\u02dcv\u22a40 \u03a3yy \u02dcv0\n\n\u03c60 \u2190 \u02dc\u03c60/\n\nInitialize \u02dcu0 \u2208 Rdx ,\nu0 \u2190 \u02dcu0/\nfor t = 1, 2, . . . , T do\nxx \u03a3xyvt\u22121\nyy \u03a3\u22a4xyut\u22121\n\n\u02dcut \u2190 \u03a3\u22121\n\u02dcvt \u2190 \u03a3\u22121\nut \u2190 \u02dcut/\n\nend for\n\n\u02dcu\u22a4t \u03a3xx \u02dcut, vt \u2190 \u02dcvt/\n\n\u02dcv\u22a4t \u03a3yy \u02dcvt\n\n\u03c6t \u2190 \u02dc\u03c6t/\n\nOutput: (uT , vT ) \u2192 (u\u2217, v\u2217) as T \u2192 \u221e.\n\nIt is straightforward to check that the nonzero eigenvalues of C are:\n\u03c11 \u2265 \u00b7\u00b7\u00b7 \u2265 \u03c1r \u2265 \u2212\u03c1r \u2265 \u00b7\u00b7\u00b7 \u2265 \u2212\u03c11,\n1\u221a2\n\n, . . . ,\n\n1\u221a2\n\nwith corresponding eigenvectors 1\u221a2\nThe key observation is that Algorithm 1 effectively runs a variant of power iterations on C to extract\nits top eigenvector. To see this, make the following change of variables\n\n, . . . ,\n\n,\n\n.\n\nar\n\u2212br\n\n1\n2\n\nxxut,\n\n\u03c6t = \u03a3\n\n(4)\nThen we can equivalently rewrite the steps of Algorithm 1 in the new variables as in {} of each line.\nObserve that the iterates are updated as follows from step t \u2212 1 to step t:\n\n\u03c8t = \u03a3\n\nxx \u02dcut,\n\nyyvt,\n\n\u02dc\u03c8t = \u03a3\n\n\u02dc\u03c6t = \u03a3\n\n\u02dc\u03c6t\n\u02dc\u03c8t\n\n\u2190\n\n0 T\nT\u22a4 0\n\n\u03c6t\u22121\n\u03c8t\u22121\n\n,\n\n\u03c6t\n\u03c8t\n\n\u2190\n\nExcept for the special normalization steps which rescale the two sets of variables separately, Algo-\nrithm 1 is very similar to the power iterations [8].\nWe show the convergence rate of ALS below (see its proof in Appendix A). The \ufb01rst measure of\nprogress is the alignment of \u03c6t to \u03c6 and the alignment of \u03c8t to \u03c8, i.e., (\u03c6\u22a4t \u03c6)2 = (u\u22a4t \u03a3xxu\u2217)2\nand (\u03c8\u22a4t \u03c8)2 = (v\u22a4t \u03a3yyv\u2217)2. The maximum value for such alignments is 1, achieved when the\niterates completely align with the optimal solution. The second natural measure of progress is the\nobjective of (1), i.e., u\u22a4t \u03a3xyvt, with the maximum value being \u03c11.\nTheorem 1 (Convergence of Algorithm 1). Let \u00b5 := min\nThen for t \u2265 \u2308 \u03c12\n1\u2212\u03c12\n\u03c12\n1 \u2212 \u03b7, and u\u22a4t \u03a3xyvt \u2265 \u03c11(1 \u2212 2\u03b7).\nRemarks We have assumed a nonzero singular value gap in Theorem 1 to obtain linear conver-\ngence in both the alignments and the objective. When there exists no singular value gap, the top\nsingular vector pair is not unique and it is no longer meaningful to measure the alignments. Nonethe-\nless, it is possible to extend our proof to obtain sublinear convergence for the objective in this case.\nObserve that, besides the steps of normalization to unit length, the basic operation in each iteration\nof Algorithm 1 is of the form \u02dcut \u2190 \u03a3\u22121\nN XY\u22a4vt\u22121, which is\nequivalent to solving the following regularized least squares (ridge regression) problem\n\n\u2309, we have in Algorithm 1 that min\n\n(u\u22a40 \u03a3xxu\u2217)2, (v\u22a40 \u03a3yyv\u2217)2\n\n(u\u22a4t \u03a3xxu\u2217)2, (v\u22a4t \u03a3yyv\u2217)2\n\nxx \u03a3xyvt\u22121 = ( 1\n\nN XX\u22a4 + \u03b3xI)\u22121 1\n\n1\n\u00b5\u03b7\n\nlog\n\n2\n\n1\n\nmin\n\nu\n\n1\n2N\n\nu\u22a4X \u2212 v\u22a4t\u22121Y\n\n+\n\n\u03b3x\n2 kuk2 \u2261 min\n\nu\n\n1\nN\n\n1\n2\n\ni=1\n\nu\u22a4xi \u2212 v\u22a4t\u22121yi\n\nIn the next section, we show that, to maintain the convergence of ALS, it is unnecessary to solve the\nleast squares problems exactly. This enables us to use state-of-the-art SGD methods for solving (6)\nto suf\ufb01cient accuracy, and to obtain a globally convergent stochastic algorithm for CCA.\n\n4One can show that \u00b5 is bounded away from 0 with high probability using random initialization (u0, v0).\n\n3\n\n\fp\np\np\n\nq\nq\n\n2\n\n2\n\n4\n\nq\nq\n\nAlgorithm 2 The alternating least squares (ALS) meta-algorithm for CCA.\nInput: Data matrices X \u2208 Rdx\u00d7N , Y \u2208 Rdy\u00d7N , regularization parameters (\u03b3x, \u03b3y).\n\nInitialize \u02dcu0 \u2208 Rdx , \u02dcv0 \u2208 Rdy .\n\u02dcu0 \u2190 \u02dcu0/\nfor t = 1, 2, . . . , T do\n\n\u02dcu\u22a40 \u03a3xx \u02dcu0,\n\n\u02dcv0 \u2190 \u02dcv0/\n\n\u02dcv\u22a40 \u03a3yy \u02dcv0,\n\nu0 \u2190 \u02dcu0,\n\nv0 \u2190 \u02dcv0\n\n1\n2N\n\n1\n2N\n\nu\n\nv\n\nft(u) :=\n\nu\u22a4X \u2212 v\u22a4t\u22121Y\n\nSolve min\napproximate solution \u02dcut satisfying ft(\u02dcut) \u2264 minu ft(u) + \u01eb.\nSolve min\napproximate solution \u02dcvt satisfying gt(\u02dcvt) \u2264 minv gt(v) + \u01eb.\nut \u2190 \u02dcut/\n\nv\u22a4Y \u2212 u\u22a4t\u22121X\n\nvt \u2190 \u02dcvt/\n\n\u02dcu\u22a4t \u03a3xx \u02dcut,\n\n\u02dcv\u22a4t \u03a3yy \u02dcvt\n\ngt(v) :=\n\n+\n\n+\n\n\u03b3x\n2 kuk2 with initialization \u02dcut\u22121, and output\n\u03b3y\n2 kvk2 with initialization \u02dcvt\u22121, and output\n\nend for\n\nOutput: (uT , vT ) is the approximate solution to CCA.\n\n3 Our algorithms\n\n3.1 Algorithm I: Alternating least squares (ALS) with variance reduction\nOur \ufb01rst algorithm consists of two nested loops. The outer loop runs inexact power iterations while\nthe inner loop uses advanced stochastic optimization methods, e.g., stochastic variance reduced\ngradient (SVRG, [9]) to obtain approximate matrix-vector multiplications. A sketch of our algorithm\nis provided in Algorithm 2. We make the following observations from this algorithm.\nConnection to previous work At step t, if we optimize ft(u) and gt(v) crudely by a single batch\ngradient descent step from the initialization (\u02dcut\u22121, \u02dcvt\u22121), we obtain the following update rule:\n\n\u02dcut \u2190 \u02dcut\u22121 \u2212 2\u03be X(X\u22a4 \u02dcut\u22121 \u2212 Y\u22a4vt\u22121)/N,\n\u02dcvt \u2190 \u02dcvt\u22121 \u2212 2\u03be Y(Y\u22a4 \u02dcvt\u22121 \u2212 X\u22a4ut\u22121)/N,\n\nut \u2190 \u02dcut/\nvt \u2190 \u02dcvt/\n\n\u02dcu\u22a4t \u03a3xx \u02dcut\n\n\u02dcv\u22a4t \u03a3yy \u02dcvt\n\n\u02dcu\u22a4t \u03a3xx \u02dcut to ensure the constraints, where \u02dcu\u22a4t \u03a3xx \u02dcut = 1\n\nwhere \u03be > 0 is the stepsize (assuming \u03b3x = \u03b3y = 0). This coincides with the AppGrad algorithm\nof [3, Algorithm 3], for which only local convergence is shown. Since the objectives ft(u) and gt(v)\ndecouple over training samples, it is convenient to apply SGD methods to them. This observation\nmotivated the stochastic CCA algorithms of [3, 4]. We note however, no global convergence guar-\nantee was shown for these stochastic CCA algorithms, and the key to our convergent algorithm is to\nsolve the least squares problems to suf\ufb01cient accuracy.\nWarm-start Observe that for different t, the least squares problems ft(u) only differ in their targets\nas vt changes over time. Since vt\u22121 is close to vt (especially when near convergence), we may use\n\u02dcut as initialization for minimizing ft+1(u) with an iterative algorithm.\nNormalization At the end of each outer loop, Algorithm 2 implements exact normalization of the\nform ut \u2190 \u02dcut/\nN (\u02dcu\u22a4t X)(\u02dcu\u22a4t X)\u22a4 +\n\u03b3x k\u02dcutk2 requires computing the projection of the training set \u02dcu\u22a4t X. However, this does not in-\ntroduce extra computation because we also compute this projection for the batch gradient used by\nSVRG (at the beginning of time step t + 1). In contrast, the stochastic algorithms of [3, 4] (possibly\nadaptively) estimate the covariance matrix from a minibatch of training samples and use the esti-\nmated covariance for normalization. This is because their algorithms perform normalizations after\neach update and thus need to avoid computing the projection of the entire training set frequently.\nBut as a result, their inexact normalization steps introduce noise to the algorithms.\nInput sparsity For high dimensional sparse data (such as those used in natural language process-\ning [10]), an advantage of gradient based methods over the closed-form solution is that the former\ntakes into account the input sparsity. For sparse inputs, the time complexity of our algorithm depends\non nnz(X, Y), i.e., the total number of nonzeros in the inputs instead of dN.\nCanonical ridge When (\u03b3x, \u03b3y) > 0, ft(u) and gt(v) are guaranteed to be strongly convex due\nto the \u21132 regularizations, in which case SVRG converges linearly. It is therefore bene\ufb01cial to use\n\n\f\u0011\n\n\u0010\n\f\f\n\u0010\n\u0011\n\n2\n\u00b5\u03b7\n\n\f\f2\n\u0010\n\u0011\u0013\n\n.\n\n\u0011\u0013\n\u0011\n\n1\n\u03b7\n\n\u0011\n\n\u0010\n\n\u0011\n\u0012\n\n\u0010\n\n\u0010\n\n\u0012\n\u0015\n\n\u0010\n\n\u0011\u0011\n\u0014\n\n\u0010\n\n\ufffd\n\n\ufffd\n\n\ufffd\n\u0012\n\n\u0010\n\n\u0001\n\u0011\u0013\n\u0010\n\u0010\n\n1\n\u03b7\n\n\u0011\nP\n\ufffd\n\u0001\u0001\n\ufffd\n\u0011\n\u0010\n\n1\n\u01eb\n\n2\n\n2\n\n\u0001\u0001\n\nsmall nonzero regularization for improved computational ef\ufb01ciency, especially for high dimensional\ndatasets where inputs X and Y are approximately low-rank.\nConvergence By the analysis of inexact power iterations where the least squares problems are\nsolved (or the matrix-vector multiplications are computed) only up to necessary accuracy, we pro-\nvide the following theorem for the convergence of Algorithm 2 (see its proof in Appendix B). The\nkey to our analysis is to bound the distances between the iterates of Algorithm 2 and that of Algo-\nrithm 1 at all time steps, and when the errors of the least squares problems are suf\ufb01ciently small (at\nthe level of \u03b72), the iterates of the two algorithms have the same quality.\nTheorem 2 (Convergence of Algorithm 2). Fix T \u2265 \u2308 \u03c12\n1\u2212\u03c12\n\u03c12\n\n\u2309, and set \u01eb(T ) \u2264\nThen we have u\u22a4T \u03a3xxuT = v\u22a4T \u03a3yyvT = 1,\n\nin Algorithm 2.\n\nlog\n\n2\n\n2\n\n1\n\n\u03b72\u03c12\nr\n128\nmin\n\n(2\u03c11/\u03c1r)\u22121\n(2\u03c11/\u03c1r)T \u22121\n\n\u00b7\n(u\u22a4T \u03a3xxu\u2217)2, (v\u22a4T \u03a3yyv\u2217)2\n\n\u2265 1 \u2212 \u03b7, and u\u22a4T \u03a3xyvT \u2265 \u03c11(1 \u2212 2\u03b7).\n\n3.1.1 Stochastic optimization of regularized least squares\nWe now discuss the inner loop of Algorithm 2, which approximately solves problems of the form (6).\nOwing to the \ufb01nite-sum structure of (6), several stochastic optimization methods such as SAG [11],\nSDCA [12] and SVRG [9], provide linear convergence rates. All these algorithms can be readily ap-\nplied to (6); we choose SVRG since it is memory ef\ufb01cient and easy to implement. We also apply the\nrecently developed accelerations techniques for \ufb01rst order optimization methods [13, 14] to obtain\nan accelerated SVRG (ASVRG) algorithm. We give the sketch of SVRG for (6) in Appendix C.\n\n+ \u03b3x\n\n1\n\u01eb\n\n.\n\ndx (N + \u03bax) log\n\ndx\u221aN \u03bax log\n\nu\u22a4xi \u2212 v\u22a4yi\n\nwhere \u03bax = maxikxik2\n\nN\ni=1 f i(u) where each component f i(u) = 1\n2\n\n2 kuk2\nNote that f (u) = 1\nN\nis kxik2-smooth, and f (u) is \u03c3min(\u03a3xx)-strongly convex5 with \u03c3min(\u03a3xx) \u2265 \u03b3x. We show in\nAppendix D that the initial suboptimality for minimizing ft(u) is upper-bounded by constant when\nusing the warm-starts. We quote the convergence rates of SVRG [9] and ASVRG [14] below.\nLemma 3. The SVRG algorithm [9] \ufb01nds a vector \u02dcu satisfying6 E[f (\u02dcu)] \u2212 minu f (u) \u2264 \u01eb in time\n\u03c3min(\u03a3xx) . The ASVRG algorithm [14] \ufb01nds a such solution\nO\nin time O\nRemarks As mentioned in [14], the acceleration version provides speedup over normal SVRG\nonly when \u03bax > N and we only show the dominant term in the above complexity.\nBy combining the iteration complexity of\ncomplexity of\n\u02dcO\n\u00b7 log2\nfor\nand \u02dcO(\u00b7) hides poly-logarithmic depen-\nALS+ASVRG, where \u03ba := max\ndences on 1\n. Our algorithm does not require the initialization to be close to the optimum\nand converges globally. For comparison, the locally convergent AppGrad has a time complexity\n[3, Theorem 2.1] of \u02dcO\n. Note,\nin this complexity, the dataset size N and the least squares condition number \u03ba\u2032 are multiplied to-\ngether because AppGrad essentially uses batch gradient descent as the least squares solver. Within\nour framework, we can use accelerated gradient descent (AGD, [15]) instead and obtain a globally\nconvergent algorithm with a total time complexity of \u02dcO\n\nloop (Theorem 2) and the time\ntime complexity of\n\nloop (Lemma 3), we obtain the total\n\nfor ALS+SVRG and \u02dcO\n\nthe inner\n\u03c12\n1\n1\u2212\u03c12\n\u03c12\n\nmaxikxik2\n\u03c3min(\u03a3xx) , maxikyik2\n\n\u03c3max(\u03a3xx)\n\n\u03c3min(\u03a3xx) , \u03c3max(\u03a3yy)\n\n\u03c3min(\u03a3yy)\n\n, where \u03ba\u2032 := max\n\ndN\u221a\u03ba\u2032\n\nd\u221aN \u03ba\n\n\u00b5 and 1\n\u03c1r\n\n\u03c12\n1\n1\u2212\u03c12\n\u03c12\n\n\u03c12\n1\n1\u2212\u03c12\n\u03c12\n\n2\n\nthe outer\n\n2 \u00b7 log\n\n\u00b7 log2\n\n\u00b7 log2\n\n1\n\u03b7\n\nd (N + \u03ba)\n\n2\n\n\u03c12\n1\n1\u2212\u03c12\n\u03c12\n\n2\n\n\u03c3min(\u03a3yy)\n\ndN \u03ba\u2032\n\n1\n\u03b7\n\n2\n\n3.2 Algorithm II: Shift-and-invert preconditioning (SI) with variance reduction\nThe second algorithm is inspired by the shift-and-invert preconditioning method for PCA [16, 17].\nInstead of running power iterations on C as de\ufb01ned in (3), we will be running power iterations on\n\nM\u03bb = (\u03bbI \u2212 C)\u22121 =\n\n\u03bbI\n\u2212T\u22a4\n\n\u2212T\n\u03bbI\n\n\u22121\n\n\u2208 Rd\u00d7d,\n\n(7)\n\n5We omit the regularization in these constants, which are typically very small, to have concise expressions.\n6The expectation is taken over random sampling of component functions. High probability error bounds\n\ncan be obtained using the Markov\u2019s inequality.\n\n5\n\n\f\u0015\n\u0015\nNX\n\ufffd\n\u0015\u001eq\n\n1\nN\n\n\u22121\n\ni=1\n\n\u0014\n\u0015\u0014\n\n1\n2\n\n\u0014\n\u0015\n\u0002\n\u0003\u0014\n\u0014\n\n1\n2\n\n1\n\n\u0015\n\ufffd\n\u0001\n\n\u0014\n\u0015\n\n\u0014\n\u0001\n\u0011\n\n\u0014\n\n\u0010\n\n\u0003\u0014\n\n\u0015\n\n1\n2\n\n#\n\u0002\n\u0010\n\n\u0014\n\u0011\n\n\"\n\n\u0014\n\n\ufffd\n\n\u0014\n\u0015\n\u0001\u0015\u0014\n\u0012\n\n1\n2\n\n\u0015\n\u0015\u0014\n\u0015\n\u0010\n\u0011\n\u0010\n\u0010\n\n\u03b72\n64\nlog\n\n\u02dc\u0394\n18\n\nu\nv\n\n,\n\n\u0014\n\u0015\n\ufffd\n\n\"\n\u0010\n\u0001\n\n\u0015\n\u0011\n\n1\n2\n\n#\n\u0013\n\u0011\u0011\n\n(11)\n\n(8)\n\n(9)\n\n\u0010\n\nwhere \u03bb > \u03c11. It is straightforward to check that M\u03bb is positive de\ufb01nite and its eigenvalues are:\n\n\u03bb \u2212 \u03c11 \u2265 \u00b7\u00b7\u00b7 \u2265\n\n1\n\n\u03bb \u2212 \u03c1r \u2265 \u00b7\u00b7\u00b7 \u2265\n1\u221a2\n\nar\nbr\n\n1\n\n\u03bb + \u03c1r \u2265 \u00b7\u00b7\u00b7 \u2265\n1\u221a2\n\nar\n\u2212br\n\n,\n\n\u03bb + \u03c11\n1\u221a2\n\na1\nb1\n\n, . . . ,\n\nwith eigenvectors 1\u221a2\nThe main idea behind shift-and-invert power iterations is that when \u03bb \u2212 \u03c11 = c(\u03c11 \u2212 \u03c12) with c \u223c\nO(1), the relative eigenvalue gap of M\u03bb is large and so power iterations on M\u03bb converges quickly.\nOur shift-and-invert preconditioning (SI) meta-algorithm for CCA is sketched in Algorithm 3 (in\nAppendix E due to space limit) and it proceeds in two phases.\n\n, . . . ,\n\n, . . . ,\n\n.\n\na1\n\u2212b1\n\n3.2.1 Phase I: shift-and-invert preconditioning for eigenvectors of M\u03bb\nUsing an estimate of the singular value gap \u02dc\u0394 and starting from an over-estimate of \u03c11 (1 + \u02dc\u0394\nsuf\ufb01ces), the algorithm gradually shrinks \u03bb(s) towards \u03c11 by crudely estimating the leading eigen-\nvector/eigenvalues of each M\u03bb(s) along the way and shrinking the gap \u03bb(s) \u2212 \u03c11, until we reach\na \u03bb(f ) \u2208 (\u03c11, \u03c11 + c(\u03c11 \u2212 \u03c12)) where c \u223c O(1). Afterwards, the algorithm \ufb01xes \u03bb(f ) and runs\ninexact power iterations on M\u03bb(f ) to obtain an accurate estimate of its leading eigenvector. Note\n\nin this phase, power iterations implicitly operate on the concatenated variables 1\u221a2\n\n\u03a3\n\n\u03a3\n\nxx \u02dcut\nyy \u02dcvt\n\n1\n2\n\nand\n\n1\u221a2\n\n1\n2\n\n\u03a3\n\n\u03a3\n\nxxut\nyyvt\n\n1\n2\n\nin Rd (but without ever computing \u03a3\n\nxx and \u03a3\n\nyy).\n\nMatrix-vector multiplication\n\nut\u22121\nvt\u22121\nwhere \u03bb varies over time in order to locate \u03bb(f ). This is equivalent to solving\n\nThe matrix-vector multiplications in Phase I have the form\n\u03bb\u03a3xx \u2212\u03a3xy\n\u2212\u03a3\u22a4xy\n\u03bb\u03a3yy\n\n\u02dcut\n\u02dcvt\n\n\u03a3xx\n\n\u2190\n\n\u03a3yy\n\n,\n\n\u02dcut\n\u02dcvt\n\n\u2190 min\n\nu,v\n\nu\u22a4v\u22a4\n\n\u03bb\u03a3xx \u2212\u03a3xy\n\u2212\u03a3\u22a4xy\n\u03bb\u03a3yy\n\nu\nv\n\n\u2212 u\u22a4\u03a3xxut\u22121 \u2212 v\u22a4\u03a3yyvt\u22121.\n\nAnd as in ALS, this least squares problem can be further written as \ufb01nite-sum:\n\nmin\nu,v\n\nht(u, v) =\n\nhi\nt(u, v)\n\nwhere\n\nhi\nt(u, v) =\n\nu\u22a4v\u22a4\n\n\u03bb\n\nxix\u22a4i + \u03b3xI\n\u2212yix\u22a4i\n\n\u2212xiy\u22a4i\nyiy\u22a4i + \u03b3yI\n\n\u03bb\n\nWe could directly apply SGD methods to this problem as before.\nNormalization The normalization steps in Phase I have the form\n\nut\nvt\n\n\u221a2\n\n\u2190\n\n\u02dcut\n\u02dcvt\n\n\u02dcu\u22a4t \u03a3xx \u02dcut + \u02dcv\u22a4t \u03a3yy \u02dcvt,\n\n\u2212 u\u22a4\u03a3xxut\u22121 \u2212 v\u22a4\u03a3yyvt\u22121.\n\nand so the following remains true for the normalized iterates in Phase I:\n\nfor\n\nt = 1, . . . , T.\n\nu\u22a4t \u03a3xxut + v\u22a4t \u03a3yyvt = 2,\n\n(10)\nUnlike the normalizations in ALS, the iterates ut and vt in Phase I do not satisfy the original CCA\nconstraints, and this is taken care of in Phase II.\nWe have the following convergence guarantee for Phase I (see its proof in Appendix F).\nTheorem 4 (Convergence of Algorithm 3, Phase I). Let \u0394 = \u03c11 \u2212 \u03c12 \u2208 (0, 1], and \u02dc\u00b5 :=\n> 0, and \u02dc\u0394 \u2208 [c1\u0394, c2\u0394] where 0 < c1 \u2264 c2 \u2264 1. Set\nin\n4 log\n\n, \u03b74\nm1 = \u23088 log\n410\nAlgorithm 3. Then the (uT , vT ) output by Phase I of Algorithm 3 satis\ufb01es (10) and\n\nu\u22a40 \u03a3xxu\u2217 + v\u22a40 \u03a3yyv\u2217\n\n\u2309, and \u02dc\u01eb \u2264 min\n\n\u2309, m2 = \u2308 5\n\nm2\u22121\n\nm1\u22121\n\n128\n\u02dc\u00b5\u03b72\n\n\u02dc\u0394\n18\n\n16\n\u02dc\u00b5\n\n3084\n\n1\n4\n\n1\n\n2\n\n(u\u22a4T \u03a3xxu\u2217 + v\u22a4T \u03a3yyv\u2217)2 \u2265 1 \u2212\nand the number of calls to the least squares solver of ht(u, v) is O\n\n1\n4\n\n6\n\n1\n\u02dc\u00b5\n\nlog\n\n1\n\u0394\n\n+ log\n\n1\n\n\u02dc\u00b5\u03b72\n\n.\n\n1\n\n\u0011\n\n\f\"\n\nq\n\ufffd\n\u0015\n\u0011\u0011\n\n\u0010\n\u0010\n\nq\n\n\u03a3\n\n\u0010\n\u0010\n\n1\n2\nxx\n\nq\n\u0001\nP\n\u0010\n\n\u0015\"\n\u0010\n\u0010\n\u0011\u0011\n\u0011\u0011\n\n#\u0014\n\u0010\n\u0011\n\n\u0010\n\n\u0011\u0011\n\n3.2.2 Phase II: \ufb01nal normalization\nIn order to satisfy the CCA constraints, we perform a last normalization\n\n\u02c6u \u2190 uT /\n\nu\u22a4T \u03a3xxuT ,\n\n\u02c6v \u2190 vT /\n\nv\u22a4T \u03a3yyvT .\n\n(12)\n\nAnd we output (\u02c6u, \u02c6v) as our \ufb01nal approximate solution to (1). We show that this step does not cause\nmuch loss in the alignments, as stated below (see it proof in Appendix G).\nTheorem 5 (Convergence of Algorithm 3, Phase II). Let Phase I of Algorithm 3 outputs (uT , vT )\nthat satisfy (11). Then after (12), we obtain an approximate solution (\u02c6u, \u02c6v) to (1) such that\n\u2265 1\u2212\u03b7, and \u02c6u\u22a4\u03a3xy \u02c6v \u2265 \u03c11(1\u22122\u03b7).\n\u02c6u\u22a4\u03a3xx \u02c6u = \u02c6v\u22a4\u03a3yy \u02c6v = 1, min\n\n(\u02c6u\u22a4\u03a3xxu\u2217)2, (\u02c6v\u22a4\u03a3yyv\u2217)2\n\n3.2.3 Time complexity\nWe have shown in Theorem 4 that Phase I only approximately solves a small number of instances\nof (9). The normalization steps (10) require computing the projections of the traning set which are\nreused for computing batch gradients of (9). The \ufb01nal normalization (12) is done only once and\ncosts O(dN ). Therefore, the time complexity of our algorithm mainly comes from solving the least\nsquares problems (9) using SGD methods in a blackbox fashion. And the time complexity for SGD\nmethods depends on the condition number of (9). Denote\n\nQ\u03bb =\n\n\u03bb\u03a3xx \u2212\u03a3xy\n\u2212\u03a3\u22a4xy\n\u03bb\u03a3yy\n\n=\n\n1\n2\nyy\n\n\u03a3\n\n\u03bbI\n\u2212T\u22a4\n\n\u2212T\n\u03bbI\n\n1\n2\nxx\n\n\u03a3\n\n.\n\n1\n2\nyy\n\n\u03a3\n\n(13)\n\nIt is clear that\n\n\u03c3max(Q\u03bb) \u2264 (\u03bb + \u03c11) \u00b7 max (\u03c3max(\u03a3xx), \u03c3max(\u03a3yy)) ,\n\u03c3min(Q\u03bb) \u2265 (\u03bb \u2212 \u03c11) \u00b7 min (\u03c3min(\u03a3xx), \u03c3min(\u03a3yy)) .\n\nN\ni=1 hi\n\n1\n\n\u03c11\u2212\u03c12\n\n\u02dc\u03ba, where \u02dc\u03ba :=\n\n\u03bb\u2212\u03c11 \u2264 9\n\n\u02dc\u0394 \u2264 9\n\n\u03c3min(Q\u03bb) \u2264 9/c1\n\u03c11\u2212\u03c12\n\nWe have shown in the proof of Theorem 4 that \u03bb+\u03c11\nc1\u0394 throughout Algorithm 3 (cf.\nLemma 10, Appendix F.2), and thus the condtion number for AGD is \u03c3max(Q\u03bb)\n\u02dc\u03ba\u2032, where\n\u02dc\u03ba\u2032 := max(\u03c3max(\u03a3xx), \u03c3max(\u03a3yy))\nmin(\u03c3min(\u03a3xx), \u03c3min(\u03a3yy)) . For SVRG/ASVRG, the relevant condition number depends on the\ngradient Lipschitz constant of individual components. We show in Appendix H (Lemma 12) that the\nmaxi max(kxik2, kyik2)\nrelevant condition number is at most 9/c1\nmin(\u03c3min(\u03a3xx), \u03c3min(\u03a3yy)) . An interesting\n\u03c11\u2212\u03c12\nissue for SVRG/ASVRG is that, depending on the value of \u03bb, the independent components hi\nt(u, v)\nmay be nonconvex. If \u03bb \u2265 1, each component is still guaranteed to by convex; otherwise, some\ncomponents might be non-convex, with the overall average 1\nt being convex. In the later\nN\ncase, we use the modi\ufb01ed analysis of SVRG [16, Appendix B] for its time complexity. We use warm-\nstart in SI as in ALS, and the initial suboptimality for each subproblem can be bounded similarly.\nThe total time complexities of our SI meta-algorithm are given in Table 1. Note that \u02dc\u03ba (or \u02dc\u03ba\u2032)\nare multiplied together, giving the effective condition number. When using SVRG as\nand\nthe least squares solver, we obtain the total time complexity of \u02dcO\nif all components are convex, and \u02dcO\n)2) \u00b7 log2\nd\u221aN\u221a\u02dc\u03ba\ning ASVRG, we have \u02dcO\n\u02dcO\n\u03c11\u2212\u03c12 \u00b7 log2\nand 1\nN from other parameters in the time complexities.\nParallel work In a parallel work [6], the authors independently proposed a similar ALS algorithm7,\nand they solve the least squares problems using AGD. The time complexity of their algorithm for ex-\ntracting the \ufb01rst canonical correlation is \u02dcO\n, which has linear dependence\n(so their algorithm is linearly convergent, but our complexity for ALS+AGD has\non\nquadratic dependence on this factor), but typically worse dependence on N and \u03ba\u2032 (see remarks in\nSection 3.1.1). Moreover, our SI algorithm tends to signi\ufb01cantly outperform ALS theoretically and\nempirically. It is future work to remove extra log\n\nif all components are convex, and\notherwise. Here \u02dcO(\u00b7) hides poly-logarithmic dependences on 1\ndN\n\u0394 . It is remarkable that the SI meta-algorithm is able to separate the dependence of dataset size\n\nd(N + (\u02dc\u03ba 1\n\u03c11\u2212\u03c12 \u00b7 log2\n\nd(N + \u02dc\u03ba 1\n1\n\u03b7\n\ndependence in our analysis.\n\notherwise. When us-\n\ndN\u221a\u03ba\u2032\n\n\u03c12\n1\n\u03c12\n1\u2212\u03c12\n\n2 \u00b7 log\n\n) \u00b7 log2\n\n\u03c12\n1\n\u03c12\n1\u2212\u03c12\n\n2\n\n\u03c11\u2212\u03c12\n\n1\n\u03b7\n\nlog\n\n1\n\u03b7\n\n3\n\n4\u221a\u02dc\u03ba\n\n1\n\n\u03c11\u2212\u03c12\n\n1\n\u03b7\n\n1\n\u03b7\n\n1\n\u03b7\n\n1\n\n1\n\u03b7\n\n7Our arxiv preprint for the ALS meta-algorithm was posted before their paper got accepted by ICML 2016.\n\n7\n\n\u0014\nq\n\u0010\n\n\u0011\n\n\u0010\n\n#\n\n\u0010\n\n\u0011\u0011\n\n\u02dc\u00b5\n\n\f\u0010\n\n0\n\n0\n\n0\n\n\u0011\n\n0\n\n0\n\n0\n\n\u03b3x = \u03b3y = 10\u22123\n\n\u03b3x = \u03b3y = 10\u22122\n\n\u03ba\u2032 = 534.4, \u03b4 = 4.256\n\n\u03ba\u2032 = 54.34, \u03b4 = 2.548\n\n0\n\n10\nS-AppGrad\n\n0\n\n10\nS-AppGrad\n\nCCALin\n\nAppGrad\n\nSI-AVR\n\nALS-AVR\n\nALS-VR\n\nSI-VR\n\n100\n\n200\n\n300\n\n400\n\n500\n\n600\n\n-5\n\n10\n\n-10\n\n10\n\n-15\n\n10\n\nSI-AVR\n\nAppGrad\n\nALS-VR\n\nSI-VR\n\nCCALin\n\nALS-AVR\n\n100\n\n200\n\n300\n\n400\n\n500\n\n600\n\n\u03ba\u2032 = 34070, \u03b4 = 10.58\n\n\u03ba\u2032 = 3416, \u03b4 = 9.082\n\n10\n\n0\nCCALin\n\n0\n\nS-AppGrad\n10\n\nAppGrad\n\nSI-AVR\n\nALS-VR\n\nS-AppGrad\n\nALS-AVR\n\n100\n\n200\n\n300\n\n400\n\n500\n\n600\n\nSI-VR\n\n-5\n\n10\n\n-10\n\n10\n\n-15\n\n10\n\nALS-VR\n\nAppGrad\n\nCCALin\n\nALS-AVR\n\nSI-AVR\n\nSI-VR\n\n100\n\n200\n\n300\n\n400\n\n500\n\n600\n\n\u03ba\u2032 = 22350, \u03b4 = 12.30\n\n\u03ba\u2032 = 2236, \u03b4 = 9.874\n\n0\n\n10\n\nALS-VR\n\n0\n\n10\nS-AppGrad\n\nAppGrad\n\nCCALin\n\nS-AppGrad\n\nALS-AVR\n\nSI-VR\n\nSI-AVR\n\n100\n\n200\n\n300\n\n400\n\n500\n\n600\n\n# Passes\n\n-5\n\n10\n\n-10\n\n10\n\n-15\n\n10\n\nAppGrad\n\nCCALin\n\nALS-VR\n\nALS-AVR\n\nSI-VR\n\nSI-AVR\n\n100\n\n200\n\n300\n\n400\n\n500\n\n600\n\n# Passes\n\n\u03b3x = \u03b3y = 10\u22125\n\n\u03ba\u2032 = 53340, \u03b4 = 5.345\n\n\u03b3x = \u03b3y = 10\u22124\n\n\u03ba\u2032 = 5335, \u03b4 = 4.924\n\n10\n\n0\nCCALin\n\n-2\n\n10\n\nSI-AVR\n\n-4\n\n10\n\n-6\n\n10\n\nAppGrad\n\nCCALin\n\nS-AppGrad\n\n0\n\n10\n\nAppGrad\n\nSI-VR\n\nALS-VR\n\nS-AppGrad\n\n-5\n\n10\n\nSI-AVR\n\nALS-AVR\n\n-10\n\n10\n\n-15\n\n10\n\nALS-AVR\n\nSI-VR\n\nALS-VR\n\n0\n\n100\n\n200\n\n300\n\n400\n\n500\n\n600\n\n0\n\n100\n\n200\n\n300\n\n400\n\n500\n\n600\n\n\u03ba\u2032 = 2699000, \u03b4 = 11.22\n\n\u03ba\u2032 = 332800, \u03b4 = 11.10\n\n0\n\n10\n\nCCALin\n\n-1\n\n10\n\nSI-AVR\n\n-2\n\n10\n\nAppGrad\n\n0\n\n10\n\n-2\n\n10\n\nAppGrad\n\nS-AppGrad\n\nSI-VR\n\n-3\n\n10\n\nSI-VR\n\nALS-VR\n\nALS-AVR\n\n-4\n\n10\n\n-4\n\n10\n\n-6\n\n10\n\nCCALin\n\nS-AppGrad\n\nALS-VR\n\nALS-AVR\n\nSI-AVR\n\n0\n\n100\n\n200\n\n300\n\n400\n\n500\n\n600\n\n0\n\n100\n\n200\n\n300\n\n400\n\n500\n\n600\n\n\u03ba\u2032 = 2235000, \u03b4 = 12.82\n\n\u03ba\u2032 = 223500, \u03b4 = 12.75\n\n0\n\n10\n\n-2\n\n10\n\n-4\n\n10\n\n-6\n\n10\n\nAppGrad\n\n0\n\n10\n\nCCALin\n\nAppGrad\n\nALS-AVR\n\nS-AppGrad\n\nCCALin\n\nALS-AVR\n\nS-AppGrad\n\n-5\n\n10\n\nALS-VR\n\nALS-VR\n\n-10\n\n10\n\nSI-AVR\n\nSI-VR\n\nSI-VR\n\nSI-AVR\n\n-15\n\n10\n\n0\n\n100\n\n200\n\n300\n\n400\n\n500\n\n600\n\n0\n\n100\n\n200\n\n300\n\n400\n\n500\n\n600\n\n# Passes\n\n# Passes\n\nl\nl\ni\n\nm\na\ni\nd\ne\nM\n\ny\nt\ni\nl\na\nm\n\ni\nt\np\no\nb\nu\nS\n\n1\n1\nW\n\nJ\n\ny\nt\ni\nl\na\nm\n\ni\nt\np\no\nb\nu\nS\n\ny\nt\ni\nl\na\nm\n\ni\nt\np\no\nb\nu\nS\n\nT\nS\nI\nN\nM\n\n-5\n\n10\n\n-10\n\n10\n\n-15\n\n10\n\n-5\n\n10\n\n-10\n\n10\n\n-5\n\n10\n\n-10\n\n10\n\n-15\n\n10\n\nFigure 1: Comparison of suboptimality vs. # passes for different algorithms. For each dataset and\nregularization parameters (\u03b3x, \u03b3y), we give \u03ba\u2032 = max\n\n\u03c3min(\u03a3xx) , \u03c3max(\u03a3yy)\n\n\u03c3max(\u03a3xx)\n\n\u03c3min(\u03a3yy)\n\n.\n\nand \u03b4 = \u03c12\n\u03c12\n1\u2212\u03c12\n\n2\n\n1\n\nExtension to multi-dimensional projections To extend our algorithms to L-dimensional projec-\ntions, we can extract the dimensions sequentially and remove the explained correlation from \u03a3xy\neach time we extract a new dimension [18]. For the ALS meta-algorithm, a cleaner approach is\nto extract the L dimensions simultaneously using (inexact) orthogonal iterations [8], in which case\nthe subproblems become multi-dimensional regressions and our normalization steps are of the form\nUt \u2190 \u02dcUt( \u02dcU\u22a4t \u03a3xx \u02dcUt)\u2212 1\n2 (the same normalization is used by [3, 4]). Such normalization involves\nthe eigenvalue decomposition of a L \u00d7 L matrix and can be solved exactly as we typically look\nfor low dimensional projections. Our analysis for L = 1 can be extended to this scenario and the\nconvergence rate of ALS will depend on the gap between \u03c1L and \u03c1L+1.\n\n4 Experiments\nWe demonstrate the proposed algorithms, namely ALS-VR, ALS-AVR, SI-VR, and SI-AVR, abbre-\nviated as \u201cmeta-algorithm \u2013 least squares solver\u201d (VR for SVRG, and AVR for ASVRG) on three\nreal-world datasets: Mediamill [19] (N = 3 \u00d7 104), JW11 [20] (N = 3 \u00d7 104), and MNIST [21]\n(N = 6 \u00d7 104). We compare our algorithms with batch AppGrad and its stochastic version\ns-AppGrad [3], as well as the CCALin algorithm in parallel work [6]. For each algorithm, we\ncompare the canonical correlation estimated by the iterates at different number of passes over the\ndata with that of the exact solution by SVD. For each dataset, we vary the regularization parameters\n\u03b3x = \u03b3y over {10\u22125, 10\u22124, 10\u22123, 10\u22122} to vary the least squares condition numbers, and larger\nregularization leads to better conditioning. We plot the suboptimality in objective vs. # passes for\neach algorithm in Figure 1. Experimental details (e.g. SVRG parameters) are given in Appendix I.\nWe make the following observations from the results. First, the proposed stochastic algorithms sig-\nni\ufb01cantly outperform batch gradient based methods AppGrad/CCALin. This is because the least\nsquares condition numbers for these datasets are large, and SVRG enable us to decouple depen-\ndences on the dataset size N and the condition number \u03ba in the time complexity. Second, SI-VR\nconverges faster than ALS-VR as it further decouples the dependence on N and the singular value gap\nof T. Third, inexact normalizations keep the s-AppGrad algorithm from converging to an accurate\nsolution. Finally, ASVRG improves over SVRG when the the condition number is large.\n\nAcknowledgments\nResearch partially supported by NSF BIGDATA grant 1546500.\n\n8\n\n\fReferences\n\n[1] H. Hotelling. Relations between two sets of variates. Biometrika, 28(3/4):321\u2013377, 1936.\n[2] H. D. Vinod. Canonical ridge and econometrics of joint production. J. Econometrics, 1976.\n[3] Z. Ma, Y. Lu, and D. Foster. Finding linear structure in large datasets with scalable canonical\n\ncorrelation analysis. In ICML, 2015.\n\n[4] W. Wang, R. Arora, N. Srebro, and K. Livescu. Stochastic optimization for deep CCA via\n\nnonlinear orthogonal iterations. In ALLERTON, 2015.\n\n[5] B. Xie, Y. Liang, and L. Song. Scale up nonlinear component analysis with doubly stochastic\n\ngradients. In NIPS, 2015.\n\n[6] R. Ge, C. Jin, S. Kakade, P. Netrapalli, and A. Sidford. Ef\ufb01cient algorithms for large-scale\ngeneralized eigenvector computation and canonical correlation analysis. arXiv, April 13 2016.\n[7] G. Golub and H. Zha. Linear Algebra for Signal Processing, chapter The Canonical Correla-\n\ntions of Matrix Pairs and their Numerical Computation, pages 27\u201349. 1995.\n\n[8] G. Golub and C. van Loan. Matrix Computations. third edition, 1996.\n[9] R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance\n\nreduction. In NIPS, 2013.\n\n[10] Y. Lu and D. Foster. Large scale canonical correlation analysis with iterative least squares. In\n\nNIPS, 2014.\n\n[11] M. Schmidt, N. Le Roux, and F. Bach. Minimizing \ufb01nite sums with the stochastic average\n\ngradient. Technical Report HAL 00860051, \u00c9cole Normale Sup\u00e9rieure, 2013.\n\n[12] S. Shalev-Shwartz and T. Zhang. Stochastic dual coordinate ascent methods for regularized\n\nloss minimization. Journal of Machine Learning Research, 2013.\n\n[13] R. Frostig, R. Ge, S. Kakade, and A. Sidford. Un-regularizing: Approximate proximal point\n\nand faster stochastic algorithms for empirical risk minimization. In ICML, 2015.\n\n[14] H. Lin, J. Mairal, and Z. Harchaoui. A universal catalyst for \ufb01rst-order optimization. In NIPS,\n\n2015.\n\n[15] Y. Nesterov. Introductory Lectures on Convex Optimization. A Basic Course. Springer, 2004.\n[16] D. Garber and E. Hazan. Fast and simple PCA via convex optimization. arXiv, 2015.\n[17] C. Jin, S. Kakade, C. Musco, P. Netrapalli, and A. Sidford. Robust shift-and-invert precondi-\n\ntioning: Faster and more sample ef\ufb01cient algorithms for eigenvector computation. 2015.\n\n[18] D. Witten, R. Tibshirani, and T. Hastie. A penalized matrix decomposition, with applications\n\nto sparse principal components and canonical correlation analysis. Biostatistics, 2009.\n\n[19] C. Snoek, M. Worring, J. van Gemert, J. Geusebroek, and A. Smeulders. The challenge prob-\nlem for automated detection of 101 semantic concepts in multimedia. In MULTIMEDIA, 2006.\n\n[20] J. Westbury. X-Ray Microbeam Speech Production Database User\u2019s Handbook, 1994.\n[21] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document\n\nrecognition. Proc. IEEE, 86(11):2278\u20132324, 1998.\n\n[22] M. Warmuth and D. Kuzmin. Randomized online PCA algorithms with regret bounds that are\n\nlogarithmic in the dimension. Journal of Machine Learning Research, 2008.\n\n[23] R. Arora, A. Cotter, K. Livescu, and N. Srebro. Stochastic optimization for PCA and PLS. In\n\nALLERTON, 2012.\n\n[24] A. Balsubramani, S. Dasgupta, and Y. Freund. The fast convergence of incremental PCA. In\n\nNIPS, 2013.\n\n[25] O. Shamir. A stochastic PCA and SVD algorithm with an exponential convergence rate. In\n\nICML, 2015.\n\n[26] F. Yger, M. Berar, G. Gasso, and A. Rakotomamonjy. Adaptive canonical correlation analysis\n\nbased on matrix manifolds. In ICML, 2012.\n\n9\n\n\f", "award": [], "sourceid": 457, "authors": [{"given_name": "Weiran", "family_name": "Wang", "institution": "University of California, Merced"}, {"given_name": "Jialei", "family_name": "Wang", "institution": "University of Chicago"}, {"given_name": "Dan", "family_name": "Garber", "institution": "Technion"}, {"given_name": "Dan", "family_name": "Garber", "institution": "Toyota Technological Institute at Chicago"}, {"given_name": "Nati", "family_name": "Srebro", "institution": "TTI-Chicago"}]}