{"title": "The Scaling Limit of High-Dimensional Online Independent Component Analysis", "book": "Advances in Neural Information Processing Systems", "page_first": 6638, "page_last": 6647, "abstract": "We analyze the dynamics of an online algorithm for independent component analysis in the high-dimensional scaling limit. As the ambient dimension tends to infinity, and with proper time scaling, we show that the time-varying joint empirical measure of the target feature vector and the estimates provided by the algorithm will converge weakly to a deterministic measured-valued process that can be characterized as the unique solution of a nonlinear PDE. Numerical solutions of this PDE, which involves two spatial variables and one time variable, can be efficiently obtained. These solutions provide detailed information about the performance of the ICA algorithm, as many practical performance metrics are functionals of the joint empirical measures. Numerical simulations show that our asymptotic analysis is accurate even for moderate dimensions. In addition to providing a tool for understanding the performance of the algorithm, our PDE analysis also provides useful insight. In particular, in the high-dimensional limit, the original coupled dynamics associated with the algorithm will be asymptotically \u201cdecoupled\u201d, with each coordinate independently solving a 1-D effective minimization problem via stochastic gradient descent. Exploiting this insight to design new algorithms for achieving optimal trade-offs between computational and statistical efficiency may prove an interesting line of future research.", "full_text": "The Scaling Limit of High-Dimensional Online\n\nIndependent Component Analysis\n\nChuang Wang and Yue M. Lu\n\nJohn A. Paulson School of Engineering and Applied Sciences\n\nHarvard University\n\n33 Oxford Street, Cambridge, MA 02138, USA\n{chuangwang,yuelu}@seas.harvard.edu\n\nAbstract\n\nWe analyze the dynamics of an online algorithm for independent component\nanalysis in the high-dimensional scaling limit. As the ambient dimension tends to\nin\ufb01nity, and with proper time scaling, we show that the time-varying joint empirical\nmeasure of the target feature vector and the estimates provided by the algorithm\nwill converge weakly to a deterministic measured-valued process that can be\ncharacterized as the unique solution of a nonlinear PDE. Numerical solutions of this\nPDE, which involves two spatial variables and one time variable, can be ef\ufb01ciently\nobtained. These solutions provide detailed information about the performance\nof the ICA algorithm, as many practical performance metrics are functionals of\nthe joint empirical measures. Numerical simulations show that our asymptotic\nanalysis is accurate even for moderate dimensions. In addition to providing a tool\nfor understanding the performance of the algorithm, our PDE analysis also provides\nuseful insight. In particular, in the high-dimensional limit, the original coupled\ndynamics associated with the algorithm will be asymptotically \u201cdecoupled\u201d, with\neach coordinate independently solving a 1-D effective minimization problem via\nstochastic gradient descent. Exploiting this insight to design new algorithms for\nachieving optimal trade-offs between computational and statistical ef\ufb01ciency may\nprove an interesting line of future research.\n\n1\n\nIntroduction\n\nOnline learning methods based on stochastic gradient descent are widely used in many learning and\nsignal processing problems. Examples includes the classical least mean squares (LMS) algorithm\n[1] in adaptive \ufb01ltering, principal component analysis [2, 3], independent component analysis (ICA)\n[4], and the training of shallow or deep arti\ufb01cial neural networks [5\u20137]. Analyzing the convergence\nrate of stochastic gradient descent has already been the subject of a vast literature (see, e.g., [8\u201311].)\nUnlike existing work that analyze the behaviors of the algorithms in \ufb01nite dimensions, we present\nin this paper a framework for studying the exact dynamics of stochastic gradient algorithms in the\nhigh-dimensional limit, using online ICA as a concrete example. Instead of minimizing a generic\nfunction as considered in the optimization literature, the stochastic algorithm we analyze here is\nsolving an estimation problem. The extra assumptions on the ground truth (e.g., the feature vector)\nand the generative models for the observations allow us to obtain the exact asymptotic dynamics of\nthe algorithms.\nAs the main result of this work, we show that, as the ambient dimension n \u2192 \u221e and with proper\ntime-scaling, the time-varying joint empirical measure of the true underlying independent component\n\u03be and its estimate x converges weakly to the unique solution of a nonlinear partial differential\nequation (PDE) [see (6).] Since many performance metrics, such as the correlation between \u03be and\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fx and the support recover rate, are functionals of the joint empirical measure, knowledge about the\nasymptotics of the latter allows us to easily compute the asymptotic limits of various performance\nmetrics of the algorithm.\nThis work is an extension of a recent analysis on the dynamics of online sparse PCA [12] to more\ngeneral settings. The idea of studying the scaling limits of online learning algorithm \ufb01rst appeared in\na series of work that mostly came from the statistical physics communities [3, 5, 13\u201316] in the 1990s.\nSimilar to our setting, those early papers studied the dynamics of various online learning algorithms\nin high dimensions. In particular, they show that the mean-squared error (MSE) of the estimation,\ntogether with a few other \u201corder parameters\u201d, can be characterized as the solution of a deterministic\nsystem of coupled ordinary differential equations (ODEs) in the large system limit. One limitation of\nsuch ODE-level analysis is that it cannot provide information about the distributions of the estimates.\nThe latter are often needed when one wants to understand more general performance metrics beyond\nthe MSE. Another limitation is that the ODE analysis cannot handle cases where the algorithms have\nnon-quadratic regularization terms (e.g., the incorporation of (cid:96)1 norms to promote sparsity.) In this\npaper, we show that both limitations can be eliminated by using our PDE-level analysis, which tracks\nthe asymptotic evolution of the probability distributions of the estimates given by the algorithm. In a\nrecent paper [10], the dynamics of an ICA algorithm was studied via a diffusion approximation. As\nan important distinction, the analysis in [10] keeps the ambient dimension n \ufb01xed and studies the\nscaling limit of the algorithm as the step size tends to 0. The resulting PDEs involve O(n) spatial\nvariables. In contrast, our analysis studies the limit as the dimension n \u2192 \u221e, with a constant step\nsize. The resulting PDEs only involve 2 spatial variables. This low-dimensional characterization\nmakes our limiting results more practical to use, especially when the dimension is large.\nThe basic idea underlying our analysis can trace its root to the early work of McKean [17, 18], who\nstudied the statistical mechanics of Markovian-type mean-\ufb01eld interactive particles. The mathematical\nfoundation of this line of research has been further established in the 1980s (see, e.g., [19, 20]). This\ntheoretical tool has been used in the analysis of high-dimensional MCMC algorithms [21]. In our\nwork, we study algorithms through the lens of high-dimensional stochastic processes. Interestingly,\nthe analysis does not explicitly depend on whether the underlying optimization problem is convex\nor nonconvex. This feature makes the presented analysis techniques a potentially very useful tool\nin understanding the effectiveness of using low-complexity iterative algorithms for solving high-\ndimensional nonconvex estimation problems, a line of research that has recently attracted much\nattention (see, e.g., [22\u201325].)\nThe rest of the paper is organized as follows. We \ufb01rst describe in Section 2 the observation model and\nthe online ICA algorithm studied in this work. The main convergence results are given in Section 3,\nwhere we show that the time-varying joint empirical measure of the target independent component\nand its estimates given by the algorithm can be characterized, in the high-dimensional limit, by the\nsolution of a deterministic PDE. Due to space constraint, we only provide in the appendix a formal\nderivation leading to the PDE, and leave the rigorous proof of the convergence to a follow-up paper.\nFinally, in Section 4 we present some insight obtained from our asymptotic analysis. In particular,\nin the high-dimensional limit, the original coupled dynamics associated with the algorithm will be\nasymptotically \u201cdecoupled\u201d, with each coordinate independently solving a 1-D effective minimization\nproblem via stochastic gradient descent.\n\nNotations and Conventions: Throughout this paper, we use boldfaced lowercase letters, such as \u03be\nand xk, to represent n-dimensional vectors. The subscript k in xk denotes the discrete-time iteration\nstep. The ith component of the vectors \u03be and xk are written as \u03bei and xk,i, respectively.\n\n2 Data model and online ICA\nWe consider a generative model where a stream of sample vectors yk \u2208 Rn, k = 1, 2, . . . are\ngenerated according to\n(1)\nwhere \u03be \u2208 Rn is a unique feature vector we want to recover. (For simplicity, we consider the case\nof recovering a single feature vector, but our analysis technique can be generalized to study cases\ninvolving a \ufb01nite number of feature vectors.) Here ck \u2208 R is an i.i.d. random variable drawn from an\nunknown non-Gaussian distribution Pc with zero mean and unit variance. And ak \u223c N (0, I\u2212 1\nn \u03be\u03beT )\n\nn \u03beck + ak,\n\nyk = 1\u221a\n\n2\n\n\fwhere F (x) =(cid:82) f (x) dx and\n\nmin(cid:107)x(cid:107)=n\n\n\u2212 1\nK\n\nn(cid:88)\n\ni=1\n\n1\nn\n\nF ( 1\u221a\n\nn yT\n\nk x) +\n\nK(cid:88)\n\n(cid:90)\n\nk=1\n\n\u03a6(x) =\n\n\u03c6(x) dx\n\n(cid:80)K\n\n\u03a6(xi),\n\n(3)\n\n(4)\n\nsk\n\n\u221a\n\n(cid:21)\n\n(cid:20)ck\n\nmodels background noise. We use the normalization (cid:107)\u03be(cid:107)2 = n so that in the large n limit, all elements\n\u03bei of the vector are O(1) quantities. The observation model (1) is equivalent to the standard sphered\n, where A \u2208 Rn\u00d7n is an orthonormal matrix with the \ufb01rst column being\ndata model yk = A\nn and sk is an i.i.d. (n \u2212 1)-dimensional standard Gaussian random vector.\n(cid:80)n\n\u03be/\nTo establish the large n limit, we shall assume that the empirical measure of \u03be de\ufb01ned by \u00b5(\u03be) =\ni=1 \u03b4(\u03be \u2212 \u03bei) converges weakly to a deterministic measure \u00b5\u2217(\u03be) with \ufb01nite moments. Note that\n1\nn\nthis assumption can be satis\ufb01ed in a stochastic setting, where each element of \u03be is an i.i.d. random\nvariable drawn from \u00b5\u2217(\u03be), or in a deterministic setting [e.g., \u03bei =\n2(i mod 2), in which case\n\u00b5\u2217(\u03be) = 1\nWe use an online learning algorithm to extract the non-Gaussian component \u03be from the data stream\n{yk}k\u22651. Let xk be the estimate of \u03be at step k. Starting from an initial estimate x0, the algorithm\nupdate xk by\n\n2 \u03b4(\u03be \u2212 \u221a\n\n2 \u03b4(\u03be) + 1\n\n2).]\n\n\u221a\n\nn f ( 1\u221a\n\nn yT\n\nk xk)yk \u2212 \u03c4k\n\nn \u03c6(xk)\n\nxk+1 =\n\n(2)\nwhere f (\u00b7) is a given twice differentiable function and \u03c6(\u00b7) is an element-wise nonlinear mapping\nintroduced to enforce prior information about \u03be, e.g., sparsity. The scaling factor 1\u221a\nn in the above\nequations makes sure that each component xk,i of the estimate is of size O(1) in the large n limit.\nThe above online learning scheme can be viewed as a projected stochastic gradient algorithm for\nsolving an optimization problem\n\n(cid:101)xk = xk + \u03c4k\u221a\nn(cid:107)(cid:101)xk(cid:107)(cid:101)xk,\n\n\u221a\n\nn\n\nK\n\n\u221a\n1\n\nk=1 f ( 1\u221a\n\nk xk)yk, in place of the true gradient\n\nis a regularization function. In (2), we update xk using an instantaneous noisy estimation\n1\u221a\nn f ( 1\u221a\nk xk)yk, once a new sample yk\nn yT\nis received.\nIn practice, one can use f (x) = \u00b1x3 or f (x) = \u00b1 tanh(x) to extract symmetric non-Gaussian\nk (cid:54)= 3) and use f (x) = \u00b1x2 to extract asymmetric non-Gaussian\nsignals (for which E c3\nsignals. The algorithm in (2) with f (x) = x3 can also be regarded as implementing a low-rank tensor\ndecomposition related to the empirical kurtosis tensor of yk [10, 11].\nFor the nonlinear mapping \u03c6(x), the choice of \u03c6(x) = \u03b2x for some \u03b2 > 0 corresponds to using an\nL2 norm in the regularization term \u03a6(x). If the feature vector is known to be sparse, we can set\n\u03c6(x) = \u03b2 sgn(x), which is equivalent to adding an L1-regularization term.\n\nk = 0 and E c4\n\nn yT\n\n3 Main convergence result\n\nWe provide an exact characterization of the dynamics of the online learning algorithm (2) when the\nambient dimension n goes to in\ufb01nity. First, we de\ufb01ne the joint empirical measure of the feature\nvector \u03be and its estimate xk as\n\n\u00b5n\n\nt (\u03be, x) =\n\n1\nn\n\n\u03b4(\u03be \u2212 \u03bei, x \u2212 xk,i)\n\n(5)\n\nwith t de\ufb01ned by k = (cid:98)tn(cid:99). Here we rescale (i.e., \u201caccelerate\u201d) the time by a factor of n.\nThe joint empirical measure de\ufb01ned above carries a lot of information about the performance of the\nalgorithm. For example, as both \u03be and xk have the same norm\nn by de\ufb01nition, the normalized\ncorrelation between \u03be and xk de\ufb01ned by\n\n\u221a\n\nn(cid:88)\n\ni=1\n\nQn\n\nt =\n\n\u03beT xk\n\n1\nn\n\n3\n\n\f(cid:80)n\n\nt = E\u00b5n\n\nt\n\nt\n\nh(\u03be, x).\n\nt = 1\nn\n\nt via the expectation E\u00b5n\n\n[\u03bex], i.e., the expectation of \u03bex taken with respect to the empirical\ni=1 h(\u03bei, xk,i) with some\nt , i.e.,\n\ncan be computed as Qn\nmeasure. More generally, any separable performance metric H n\nfunction h(\u00b7,\u00b7) can be expressed as an expectation with respect to the empirical measure \u00b5n\nt = E\u00b5n\nH n\nt is a random probability\nDirectly computing Qn\nmeasure. We bypass this dif\ufb01culty by investigating the limiting behavior of the joint empirical\nt de\ufb01ned in (5). Our main contribution is to show that, as n \u2192 \u221e, the sequence of\nmeasure \u00b5n\nt }n converges weakly to a deterministic measure \u00b5t. Note that\nrandom probability measures {\u00b5n\nthe limiting value of Qn\nt via the identity\nt = E\u00b5t[\u03bex].\nlimn\u2192\u221e Qn\nLet Pt(x, \u03be) be the density function of the limiting measure \u00b5t(\u03be, x) at time t. We show that it is\ncharacterized as the unique solution of the following nonlinear PDE:\n\nt can then be computed from the limiting measure \u00b5n\n\n[\u03bex] is challenging, as \u00b5n\n\nt\n\n(6)\n\n(7)\n\n(8)\n\n(9)\n\n(10)\n\n\u2202x\n\nR2\n\nQt =\n\n\u2202\n\n\u2202t Pt(\u03be, x) = \u2212 \u2202\n\n\u2202x2 Pt(\u03be, x)\n\n2 \u039b(Qt) \u22022\n\n\u03bexPt(\u03be, x) dx d\u03be\n\n(cid:2)\u0393(x, \u03be, Qt, Rt)Pt(\u03be, x)(cid:3) + 1\n(cid:90)(cid:90)\n(cid:90)(cid:90)\n1 \u2212 Q2(cid:1)(cid:69)\n(cid:112)\nf 2(cid:0)cQ + e\n2 \u039b(Q)(cid:3) \u2212 \u03beG(Q) \u2212 \u03c4 \u03c6(x)\n\u0393(x, \u03be, Q, R) = x(cid:2)QG(Q) + \u03c4 R \u2212 1\n(cid:68)\n(cid:69)\n1 \u2212 Q2(cid:1)(cid:69)\n(cid:112)\n1 \u2212 Q2(cid:1)c\nf(cid:48)(cid:0)cQ + e\n\n\u039b(Q) = \u03c4 2(cid:68)\n(cid:68)\nf(cid:0)cQ + e\n\nx\u03c6(x)Pt(\u03be, x) dx d\u03be\n\n(cid:112)\n\nRt =\n\nR2\n\nG(Q) = \u2212\u03c4\n\nwith\n\nwhere\n\nwhere the two functions \u039b(Q) and \u0393(x, \u03be, Q, R) are de\ufb01ned as\n\n(11)\nIn the above equations, e and c denote two independent random variables, with e \u223c N (0, 1) and\nc \u223c Pc, the non-Gaussian distribution of ck introduced in (2); the notation (cid:104)\u00b7(cid:105) denotes the expectation\nover e and c; and f (\u00b7) and \u03c6(\u00b7) are the two functions used in the online learning algorithm (2).\nWhen \u03c6(x) = 0 (and therefore Rt = 0), we can derive a simple ODE for Qt from (6) and (7):\n\n+ \u03c4 Q\n\n.\n\nd\ndt\n\nQt = (Q2\n\nt \u2212 1)G(Qt) \u2212 1\n2\n\nQt\u039b(Qt).\n\nExample 1 As a concrete example, we consider the case when ck is drawn from a symmetric\nnon-Gaussian distribution. Due to symmetry, E c3\nk = m6. We use\nf (x) = x3 in (2) to detect the feature vector \u03be. Substituting this speci\ufb01c f (x) into (9) and (11), we\nobtain\n\nk = 0. Write E c4\n\nk = m4 and E c6\n\nG(Q) = \u03c4 Q3(m4 \u2212 3)\n\n\u039b(Q) = \u03c4 2(cid:104)\n\n15 + 15Q4(1 \u2212 Q2)(m4 \u2212 3) + Q6(m6 \u2212 15)\n\n(cid:105)\n\nand \u0393(x, \u03be, Q, R) can be computed by substituting (12) and (13) into (10). Moreover, for the case\n\u03c6(x) = 0, we derive a simple ODE for qt = Q2\n\n(cid:104)\n\nt as\nt (1 \u2212 qt)(m4 \u2212 3) + q3\n\n15q2\n\nt (m6 \u2212 15) + 15\n\n.\n\n(14)\n\ndqt\ndt\n\n= \u22122\u03c4tq2\n\nt (1 \u2212 qt)(m4 \u2212 3) \u2212 \u03c4 2\nt qt\n\nNumerical veri\ufb01cations of the ODE results are shown in Figure 1(a). In our experiment, the ambient\ndimension is set to n = 5000 and we plot the averaged results as well as error bars (corresponding to\none standard deviation) over 10 independent trials. Two different initial values of q0 = Q2\n0 are used.\nIn both cases, the asymptotic theoretical predictions match the numerical results very well.\nThe ODE in (14) can be solved analytically. Next we brie\ufb02y discuss its stability. The right-hand side\nof (14) is plotted in Figure 1(b) as a function of qt. It is clear that the ODE (14) always admits a\n\n4\n\n(12)\n\n(13)\n\n(cid:105)\n\n\f(a)\n\n(b)\n\nFigure 1: (a) Comparison between the analytical prediction given by the ODE in (14) with numerical\nsimulations of the online ICA algorithm. We consider two different initial values for the algorithm.\nThe top one, which starts from a better initial guess, converges to an informative estimation, whereas\nthe bottom one, with a worse initial guess, converges to a non-informative solution. (b) The stability\nof the ODE in (14). We draw g(q) = 1\ndt for different value of \u03c4 = 0.02, 0.04, 0.06, 0.08 from top\n\u03c4\nto bottom.\n\ndq\n\nsolution qt = 0, which corresponding to a trivial, non-informative solution. Moreover, this trivial\nsolution is always a stable \ufb01xed point. When the stepsize \u03c4 > \u03c4c for some constant \u03c4c, qt = 0 is also\nthe unique stable \ufb01xed point. When \u03c4 < \u03c4c however, two additional solutions of the ODE emerge.\nOne is a stable \ufb01xed point denoted by q\u2217\ns and the other is an unstable \ufb01xed point denoted by q\u2217\nu,\nwith q\u2217\ns. Thus, in order to reach an informative solution, one must initialize the algorithm\nu. This insight agrees with a previous stability analysis done in [26], where the authors\nwith Q2\ninvestigated the dynamics near qt = 0 via a small qt expansion.\n\nu < q\u2217\n0 > q\u2217\n\nExample 2 In this experiment, we verify the accuracy of the asymptotic predictions given by the\n\u221a\nPDE (6). The settings are similar to those in Example 1. In addition, we assume that the feature\n\u221a\nvector \u03be is sparse, consisting of \u03c1n nonzero elements, each of which is equal to 1/\n\u03c1. Figure 2\nshows the asymptotic conditional density Pt(x|\u03be) for \u03be = 0 and \u03be = 1/\n\u03c1 at two different times.\nThese theoretical predictions are obtained by solving the PDE (6) numerically. Also shown in the\n\ufb01gure are the empirical conditional densities associated with one realization of the ICA algorithm.\nAgain, we observe that the theoretical predictions and numerical results have excellent agreement.\nTo demonstrate the usefulness of the PDE analysis in providing detailed information about the\nperformance of the algorithm, we show in Figure 3 the performance of sparse support recovery using\na simple hard-thresholding scheme on the estimates provided by the algorithm. By changing the\nthreshold values, one can have trade-offs between the true positive and false positive rates. As we can\nsee from the \ufb01gure, this precise trade-off can be accurately predicted by our PDE analysis.\n\n4\n\nInsights given by the PDE analysis\n\nIn this section, we present some insights that can be gained from our high-dimensional analysis. To\nsimplify the PDE in (6), we can assume that the two functions Qt and Rt in (7) and (8) are given to\nus in an oracle way. Under this assumption, the PDE (6) describes the limiting empirical measure of\nthe following stochastic process\n\n(cid:113) \u039b(Qk/n)\n\nn wk,i,\n\ni = 1, 2, . . . n\n\nzk+1,i = zk,i + 1\n\nn \u0393(zk,i, \u03bei, Qk/n, Rk/n) +\n\n(15)\nwhere wk,i is a sequence of independent standard Gaussian random variables. Unlike the original\nonline learning update equation (2) where different coordinates of xk are coupled, the above process\nis uncoupled. Each component zk,i for i = 1, 2, . . . , n evolves independently when conditioned on\nQt and Rt. The continuous-time limit of (15) is described by a stochastic differential equation (SDE)\n\ndZt = \u0393(Zt, \u03be, Qt, Rt) dt +(cid:112)\u039b(Qt) dBt,\n\n5\n\n05010015020000.20.40.60.81ODESimulation00.51-0.500.5\f(a)\n\n(b)\n\nFigure 2: (a) A demonstration of the accuracy of our PDE analysis. See the discussions in Example\n2 for details. (b) Effective 1-D cost functions.\n\nFigure 3: Trade-offs between the true positive and false positive rates in sparse support recovery. In\nour experiment, n = 104, and the sparsity level is set to \u03c1 = 0.3. The theoretical results obtained by\nour PDE analysis can accurately predict the actual performance at any run-time of the algorithm.\n\nwhere Bt is the standard Brownian motion.\nWe next have a closer look at the equation (15). Given a scalar \u03be, Qt and Rt, we can de\ufb01ne a\ntime-varying 1-D regularized quadratic optimization problem minx\u2208R Et(x, \u03be) with the effective\npotential\n\n2 dt(x \u2212 bt\u03be)2 + \u03c4 \u03a6(x),\n\n(16)\nwhere dt = QtG(Qt) \u2212 1\n2 \u039b(Qt) + \u03c4 Rt , bt = G(Qt)/dt and \u03a6(x) is the regularization term\nde\ufb01ned in (4). Then, the stochastic process (15) can be viewed as a stochastic gradient descent\n\nEt(x, \u03be) = 1\n\n6\n\n-1012340246-1012340246-1012340246-1012340246-10123400.10.2-10123400.050.10.1500.20.40.60.8100.20.40.60.81\f(cid:113) \u039b(Qk)\n\nfor solving this 1-D problem with a step-size equal to 1/n. One can verify that the exact gradient\nof (16) is \u2212\u0393(x, \u03be, Qt, Rt). The third term\nn wk in (15) adds stochastic noise to the true\ngradient. Interestingly, although the original optimization problem (3) is non-convex, its 1-D effective\noptimization problem is always convex for convex regularizers \u03a6(x) (e.g., \u03a6(x) = \u03b2 |x|.) This\nprovides an intuitive explanation for the practical success of online ICA.\nTo visualize this 1-D effective optimization problem, we plot in Figure 2(b) the effective potential\nEt(x, \u03be) at t = 0 and t = 100, respectively. From Figure (2), we can see that the L1 norm always\nintroduces a bias in the estimation for all non-zero \u03bei, as the minimum point in the effective 1-D\ncost function is always shifted towards the origin. It is hopeful that the insights gained from the\n1-D effective optimization problem can guide the design of a better regularization function \u03a6(x)\nto achieve smaller estimation errors without sacri\ufb01cing the convergence speed. This may prove an\ninteresting line of future work.\nThis uncoupling phenomenon is a typical consequence of mean-\ufb01eld dynamics, e.g., the Sherrington-\nKirkpatrick model [27] in statistical physics. Similar phenomena are observed or proved in other high\ndimensional algorithms especially those related to approximate message passing (AMP) [28\u201330].\nHowever, for these algorithms using batch updating rules with the Onsager reaction term, the limiting\ndensities of iterands are Gaussian. Thus the evolution of such densities can be characterized by\ntracking a few scalar parameters in discrete time. For our case, the limiting densities are typically\nnon-Gaussian and they cannot be parametrized by \ufb01nitely many scalars. Thus the PDE limit (6) is\nrequired.\n\nAppendix: A Formal derivation of the PDE\n\n(17)\n\nn f ( 1\u221a\n\nn yT\n\n(cid:101)\u2206k,i =(cid:101)xk,i \u2212 xk,i = \u03c4k\u221a\n(cid:105)\n(cid:105)\n(cid:104)(cid:101)\u2206k,i\n\nand Ek\n\nk,i\n\nIn this appendix, we present a formal derivation of the PDE (6). We \ufb01rst note that (xk, \u03bek)k with\n\u03bek = \u03be forms an exchangeable Markov chain on R2n driven by the random variable ck \u223c Pc and the\nGaussian random vector ak . The drift coef\ufb01cient \u0393(x, \u03be, Q, R) and the diffusion coef\ufb01cient \u039b(Q)\nin the PDE (6) are determined, respectively, by the conditional mean and variance of the increment\nxk+1,i \u2212 xk,i, conditioned upon the previous state vector xk and \u03bek.\nLet the increment of the gradient-descent step in the learning rule (2) be\nk xk)yk,i \u2212 \u03c4k\n\nwhere(cid:101)xk,i is the ith component of the output(cid:101)xk. Let Ek denote the conditional expectation with\n\nn \u03c6(xk,i)\n\nrespect to ck and ak given xk and \u03bek.\nWe \ufb01rst compute Ek\n\n(cid:104)(cid:101)\u22062\nk ck +(cid:101)ek,i + 1\u221a\n(cid:0)aT\nk ck +(cid:101)ek,i up to the \ufb01rst order and get\nk xk \u2212 ak,ixk,i\n(cid:104)\n(cid:105)\n(cid:105)\n(cid:35)\n\n(cid:104)(cid:101)\u2206k,i\n(cid:105)\nn \u03beT xk and(cid:101)ek,i = 1\u221a\nk ck +(cid:101)ek,i + 1\u221a\nk ck +(cid:101)ek,i)( 1\u221a\n\n(cid:1). We use the Taylor expansion of f around\n(cid:105)\n(cid:104)\ndeterministic quantity Qk. Moreover,(cid:101)ek,i and ak,i are both zero-mean Gaussian with the covariance\n\nEk\n= Ek\nwhere \u03b4k,i includes all higher order terms. As n \u2192 \u221e, the random variable Qn\n\nk ck +(cid:101)ek,i)( 1\u221a\n\n. From (1) and (17) we have\n\n(cid:105) \u2212 \u03c4k\n\nwhere Qn\nQn\n\nk converges to a\n\n+ 1\u221a\n\nn xk,iEk\n\nn ak,ixk,i)( 1\u221a\n\nn ak,ixk,i)( 1\u221a\n\nn \u03beick + ak,i)ak,i\n\nn \u03beick + ak,i)\n\nn \u03beick + ak,i)\n\nn \u03beick + ak,i)\n\nn \u03c6(xk,i),\n\n= \u03c4k\u221a\nn\n\nEk\n\nf (Qn\n\n(cid:104)\n(cid:34)\n\nEk\n\nk = 1\n\nf (Qn\n\nf (Qn\n\nf(cid:48)(Qn\n\n(cid:104)\n\nn\n\n+ \u03b4k,i,\n\nmatrix\n\n. We thus have\n\nn \u03bek,iQk\n1 + O( 1\nn )\n\n1 \u2212 Q2\n\u2212 1\u221a\n\nk + O( 1\nn \u03bek,iQk\n\nn ) \u2212 1\u221a\nk ck +(cid:101)ek,i)( 1\u221a\n\nf(cid:48)(Qn\n\n(cid:104)\n\nEk\n\nn \u03beick + ak,i)ak,i\n\n=\n\n(cid:28)\n\n(cid:113)\n\n1 \u2212 Q2\n\nke)\n\n(cid:29)\n\n+ o(1)\n\nf(cid:48)(Qkc +\n\n(cid:105)\n\n7\n\n\f(cid:104)\n(cid:28)\n\nand\n\nEk\n\n=\n\nf (Qn\n\n(cid:113)\n\nk ck +(cid:101)ek,i)( 1\u221a\nn \u03beick + ak,i)\n(cid:34)(cid:28)\nke \u2212 \u03bei\u221a\n(cid:113)\n\n1 \u2212 Q2\n\nf (Qkc +\n\n(cid:29)\n\nn Qka)( 1\u221a\n\nn \u03beic + a)\n\n(cid:28)\n\n(cid:113)\n\n(cid:29)(cid:35)\n\n1 \u2212 Q2\n\n= 1\u221a\n\nn \u03bei\n\ncf (Qkc +\n\nwhere in the last line, we use the Taylor expansion again to expand f around Qkc +(cid:112)1 \u2212 Q2\n\nke and\nthe bracket (cid:104)\u00b7(cid:105) denotes the average over two independent random variables c \u223c Pc and e \u223c N (0, 1).\nThus, we have\n\nn ),\n\nke)\n\nke)\n\n+ o( 1\u221a\n\n\u2212 Qk\n\nf(cid:48)(Qkc +\n\n1 \u2212 Q2\n\n(cid:113)\n\n(cid:29)\n\n(cid:35)\n\n\u2212\u03beiG(Qk) + \u03c4kxk,i\n\nf(cid:48)(Qkc +\n\n1 \u2212 Q2\n\nke)\n\n\u2212 \u03c4k\u03c6(xk,i)\n\n+ o( 1\n\nn ),\n\n(cid:105)\n(cid:29)\n\n(cid:28)\n\n(cid:28)\n\nf 2(Qkc +\n\n(cid:113)\n\n1 \u2212 Q2\n\nke)\n\n(cid:29)\n\n+ o( 1\n\nn ).\n\n+ o( 1\n\nn ) = \u03c4 2\n\nk\nn\n\nNext, we deal with the normalization step. Again, we use the Taylor expansion for the term\n\nwhere the function G(Q) is de\ufb01ned in (11).\nTo compute the (conditional) variance, we have\n\n(cid:104)\n\nf 2(Qn\n\n(cid:105)\nk +(cid:101)ek,i)\n(cid:17)(cid:13)(cid:13)(cid:13)(cid:13)\u22121\n\n= \u03c4 2\n\nk\nn\n\nEk\n\n(cid:16)\nxk + (cid:101)\u2206k\n\n(cid:34)\n\n=\n\n1\nn\n\nEk\n\nk,i\n\n(cid:105)\n\nEk\n\n(cid:104)(cid:101)\u2206k,i\n(cid:105)\n(cid:104)(cid:101)\u22062\n(cid:13)(cid:13)(cid:13)(cid:13) 1\n(cid:13)(cid:13)\u22121\n(cid:104)(cid:101)\u22062\n\nEk\n\n=\n\nn\n\ni=1\n\nk,i\n\n(cid:105)\n\n(cid:13)(cid:13) 1\nn(cid:101)xk\n\n(cid:80)n\n\n1\nn\n\nxk+1 = xk \u2212 1\n\nn xk\n\nxT\n\nup to the \ufb01rst order, which yields\n\n(cid:18)\n\n(cid:19)\nk (cid:101)\u2206k\n2(cid:101)\u2206\nk (cid:101)\u2206k + 1\nk (cid:101)\u2206k \u2248 1\n\nT\n\nn\n\n+ (cid:101)\u2206k + \u03b4k,\n(cid:80)n\n\ni=1 xk,iEk\n\n(cid:105)\n\n(cid:104)(cid:101)\u2206k,i\n\nn(cid:101)\u2206\n\n, 1\n\nk (cid:101)\u2206k \u2248\n\nT\n\nwhere \u03b4k includes all higher order terms. Note that 1\n\nk \u03c6(x) = Rn\n\nand 1\nn xT\nEk\n\n(cid:2)xk+1,i \u2212 xk,i\n(cid:104)(cid:0)xk+1,i \u2212 xk,i\n(cid:1)2(cid:105)\n\nn xT\nk \u2192 Rk, we have\n(cid:105)\n\n(cid:3) = 1\n(cid:104)(cid:101)\u22062\n\n= Ek\n\nk,i\n\nEk\n\nn \u0393(xk,i, \u03bei, Qk, Rk) + o( 1\nn ).\nFinally, the normalization step does not change the variance term, and thus\n\n+ o( 1\n\nn ) = 1\n\nn \u039b(Qk) + o( 1\n\nn ).\n\nThe above computation of Ek(xk+1,i \u2212 xk,i) and Ek(xk+1,i \u2212 xk,i)2 connects the dynamics (2) to\n(15). In fact, both (2) and (15) have the same limiting empirical measure described by (6).\nA rigorous proof of our asymptotic result is built on the weak convergence approach for measure-\nvalued processes. Details will be presented in an upcoming paper. Here we only provide a sketch\nof the general proof strategy: First, we prove the tightness of the measure-valued stochastic process\nt )0\u2264t\u2264T on D([0, T ],M(R2)), where D denotes the space of c\u00e0dl\u00e0g processes taking values from\n(\u00b5n\nthe space of probability measures. This then implies that any sequence of the measure-valued process\nt )0\u2264t\u2264T}n (indexed by n) must have a weakly converging subsequence. Second, we prove any\n{(\u00b5n\nconverging (sub)sequence must converge weakly to a solution of the weak form of the PDE (6).\nThird, we prove the uniqueness of the solution of the weak form of the PDE (6) by constructing a\ncontraction mapping. Combining these three statements, we can then conclude that any sequence\nmust converge to this unique solution.\n\nAcknowledgments This work is supported by US Army Research Of\ufb01ce under contract W911NF-\n16-1- 0265 and by the US National Science Foundation under grants CCF-1319140 and CCF-1718698.\n\nReferences\n[1] Simon Haykin and Bernard Widrow. Least-mean-square adaptive \ufb01lters, volume 31. John Wiley & Sons,\n\n2003.\n\n8\n\n\f[2] Erkki Oja and Juha Karhunen. On stochastic approximation of the eigenvectors and eigenvalues of the\n\nexpectation of a random matrix. J. Math. Anal. Appl., 106(1):69\u201384, 1985.\n\n[3] Michael Biehl and E Schl\u00f6sser. The dynamics of on-line principal component analysis. J. Phys. A. Math.\n\nGen., 31(5):97\u2013103, 1998.\n\n[4] Aapo Hyv\u00e4rinen and Erkki Oja. One-unit learning rules for independent component analysis. In Adv.\n\nNeural Inf. Process. Syst., pages 480\u2013486, 1997.\n\n[5] M Biehl. An Exactly Solvable Model of Unsupervised Learning. Europhys. Lett., 25(5):391\u2013396, 1994.\n\n[6] Ohad Shamir and Tong Zhang. Stochastic gradient descent for non-smooth optimization: Convergence\nresults and optimal averaging schemes. In International Conference on Machine Learning, pages 71\u201379,\n2013.\n\n[7] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436\u2013444, 2015.\n\n[8] Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M Kakade, and Michael I Jordan. How to Escape Saddle\n\nPoints Ef\ufb01ciently. arXiv:1703.00887, 2017.\n\n[9] Ioannis Mitliagkas, Constantine Caramanis, and Prateek Jain. Memory Limited , Streaming PCA. In Adv.\n\nNeural Inf. Process. Syst., 2013.\n\n[10] Chris Junchi Li, Zhaoran Wang, and Han Liu. Online ICA: Understanding Global Dynamics of Nonconvex\n\nOptimization via Diffusion Processes. In Adv. Neural Inf. Process. Syst., pages 4961\u20134969, 2016.\n\n[11] Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping From Saddle Points ? Online Stochastic\n\nGradient for Tensor Decomposition. In JMLR Work. Conf. Proc., volume 40, pages 1\u201346, 2015.\n\n[12] Chuang Wang and Yue M. Lu. Online Learning for Sparse PCA in High Dimensions: Exact Dynamics and\n\nPhase Transitions. In Inf. Theory Work. (ITW), 2016 IEEE, pages 186\u2013190, 2016.\n\n[13] David Saad and Sara A Solla. Exact Solution for On-Line Learning in Multilayer Neural Networks. Phys.\n\nRev. Lett., 74(21):4337\u20134340, 1995.\n\n[14] David Saad and Magnus Rattray. Globally optimal parameters for online learning in multilayer neural\n\nnetworks. Phys. Rev. Lett., 79(13):2578, 1997.\n\n[15] Magnus Rattray and Gleb Basalyga. Scaling laws and local minima in Hebbian ICA. In Adv. Neural Inf.\n\nProcess. Syst., pages 495\u2013502, 2002.\n\n[16] G. Basalyga and M. Rattray. Statistical dynamics of on-line independent component analysis. J. Mach.\n\nLearn. Res., 4(7-8):1393\u20131410, 2004.\n\n[17] Henry P McKean. Propagation of chaos for a class of non-linear parabolic equations. Stoch. Differ.\n\nEquations (Lecture Ser. Differ. Equations, Sess. 7, Cathol. Univ., 1967), pages 41\u201357, 1967.\n\n[18] Henry P McKean. A class of Markov processes associated with nonlinear parabolic equations. Proc. Natl.\n\nAcad. Sci., 56(6):1907\u20131911, 1966.\n\n[19] Sylvie M\u00e9l\u00e9ard and Sylvie Roelly-Coppoletta. A propagation of chaos result for a system of particles with\n\nmoderate interaction. Stoch. Process. their Appl., 26:317\u2013332, 1987.\n\n[20] Alain-Sol Sznitman. Topics in progagation of chaos. In Paul-Louis Hennequin, editor, Ec. d\u2019{\\\u2019e}t{\\\u2019e}\n\nProbab. Saint-Flour XIX\u20131989, pages 165\u2013251. Springer Berlin Heidelberg, 1991.\n\n[21] G. O. Roberts, A. Gelman, and W. R. Gilks. Weak convergence and optimal scaling of random walk\n\nMetropolis algorithms. Ann. Appl. Probab., 7(1):110\u2013120, 1997.\n\n[22] Praneeth Netrapalli, Prateek Jain, and Sujay Sanghavi. Phase retrieval using alternating minimization. In\n\nAdv. Neural Inf. Process. Syst., pages 2796\u20132804, 2013.\n\n[23] Emmanuel J. Candes, Xiaodong Li, and Mahdi Soltanolkotabi. Phase retrieval via Wirtinger \ufb02ow: Theory\n\nand algorithms. IEEE Trans. Inf. Theory, 61(4):1985\u20132007, 2015.\n\n[24] Huishuai Zhang, Yuejie Chi, and Yingbin Liang. Provable non-convex phase retrieval with outliers: Median\n\ntruncated wirtinger \ufb02ow. arXiv:1603.03805, 2016.\n\n[25] Xiaodong Li, Shuyang Ling, Thomas Strohmer, and Ke Wei. Rapid, robust, and reliable blind deconvolution\n\nvia nonconvex optimization. arXiv:1606.04933, 2016.\n\n9\n\n\f[26] Magnus Rattray. Stochastic trapping in a solvable model of on-line independent component analysis.\n\nNeural Comput., 14(2):17, 2002.\n\n[27] L. F. Cugliandolo and J. Kurchan. On the out-of-equilibrium relaxation of the Sherrington-Kirkpatrick\n\nmodel. J. Phys. A. Math. Gen., 27(17):5749\u20135772, 1994.\n\n[28] Jean Barbier, Mohamad Dia, Nicolas Macris, Florent Krzakala, Thibault Lesieur, and Lenka Zdeborov\u00e1.\nMutual information for symmetric rank-one matrix estimation: A proof of the replica formula. In Adv.\nNeural Inf. Process. Syst., pages 424\u2013432, 2016.\n\n[29] Mohsen Bayati and Andrea Montanari. The dynamics of message passing on dense graphs, with applica-\n\ntions to compressed sensing. IEEE Trans. Inf. Theory, 57(2):764\u2013785, 2011.\n\n[30] David Donoho and Andrea Montanari. High dimensional robust M-estimation: asymptotic variance via\n\napproximate message passing. Probab. Theory Relat. Fields, 166(3-4):935\u2013969, 2016.\n\n10\n\n\f", "award": [], "sourceid": 3321, "authors": [{"given_name": "Chuang", "family_name": "Wang", "institution": "Harvard University"}, {"given_name": "Yue", "family_name": "Lu", "institution": "Harvard University"}]}