{"title": "Convergence rates of sub-sampled Newton methods", "book": "Advances in Neural Information Processing Systems", "page_first": 3052, "page_last": 3060, "abstract": "We consider the problem of minimizing a sum of $n$ functions via projected iterations onto a convex parameter set $\\C \\subset \\reals^p$, where $n\\gg p\\gg 1$. In this regime, algorithms which utilize sub-sampling techniques are known to be effective.In this paper, we use sub-sampling techniques together with low-rank approximation to design a new randomized batch algorithm which possesses comparable convergence rate to Newton's method, yet has much smaller per-iteration cost. The proposed algorithm is robust in terms of starting point and step size, and enjoys a composite convergence rate, namely, quadratic convergence at start and linear convergence when the iterate is close to the minimizer. We develop its theoretical analysis which also allows us to select near-optimal algorithm parameters. Our theoretical results can be used to obtain convergence rates of previously proposed sub-sampling based algorithms as well. We demonstrate how our results apply to well-known machine learning problems.Lastly, we evaluate the performance of our algorithm on several datasets under various scenarios.", "full_text": "Convergence rates of sub-sampled Newton methods\n\nMurat A. Erdogdu\n\nDepartment of Statistics\n\nStanford University\n\nerdogdu@stanford.edu\n\nAndrea Montanari\n\nDepartment of Statistics\nand Electrical Engineering\n\nStanford University\n\nmontanari@stanford.edu\n\nAbstract\n\nWe consider the problem of minimizing a sum of n functions via projected itera-\ntions onto a convex parameter set C\u21e2 Rp, where n  p  1. In this regime,\nalgorithms which utilize sub-sampling techniques are known to be effective. In\nthis paper, we use sub-sampling techniques together with low-rank approximation\nto design a new randomized batch algorithm which possesses comparable con-\nvergence rate to Newton\u2019s method, yet has much smaller per-iteration cost. The\nproposed algorithm is robust in terms of starting point and step size, and enjoys\na composite convergence rate, namely, quadratic convergence at start and linear\nconvergence when the iterate is close to the minimizer. We develop its theoretical\nanalysis which also allows us to select near-optimal algorithm parameters. Our\ntheoretical results can be used to obtain convergence rates of previously proposed\nsub-sampling based algorithms as well. We demonstrate how our results apply to\nwell-known machine learning problems. Lastly, we evaluate the performance of\nour algorithm on several datasets under various scenarios.\n\n1\n\nIntroduction\n\nWe focus on the following minimization problem,\n\nminimize f (\u2713) :=\n\n1\nn\n\nnXi=1\n\nfi(\u2713),\n\n(1.1)\n\nwhere fi : Rp ! R. Most machine learning models can be expressed as above, where each function\nfi corresponds to an observation. Examples include logistic regression, support vector machines,\nneural networks and graphical models.\nMany optimization algorithms have been developed to solve the above minimization problem\n[Bis95, BV04, Nes04]. For a given convex set C\u21e2 Rp, we denote the Euclidean projection onto this\nset by PC. We consider the updates of the form\n(1.2)\nwhere \u2318t is the step size and Qt is a suitable scaling matrix that provides curvature information.\nUpdates of the form Eq. (1.2) have been extensively studied in the optimization literature (for sim-\nplicity, we assume C = Rp throughout the introduction). The case where Qt is equal to identity\nmatrix corresponds to Gradient Descent (GD) which, under smoothness assumptions, achieves lin-\near convergence rate with O(np) per-iteration cost. More precisely, GD with ideal step size yields\nk\u02c6\u2713t+1  \u2713\u21e4k2 \uf8ff \u21e0t\n1,GD = 1  (\u21e4p/\u21e41), and \u21e4i is the i-th largest\neigenvalue of the Hessian of f (\u2713) at minimizer \u2713\u21e4.\nSecond order methods such as Newton\u2019s Method (NM) and Natural Gradient Descent (NGD)\n[Ama98] can be recovered by taking Qt to be the inverse Hessian and the Fisher information evalu-\nated at the current iterate, respectively. Such methods may achieve quadratic convergence rates with\n\n\u02c6\u2713t+1 = PC\u21e3\u02c6\u2713t  \u2318tQtr\u2713f (\u02c6\u2713t)\u2318 ,\n\n1,GDk\u02c6\u2713t  \u2713\u21e4k2 , where, as limt!1 \u21e0t\n\n1\n\n\fO(np2 + p3) per-iteration cost [Bis95, Nes04]. In particular, for t large enough, Newton\u2019s method\nyields k\u02c6\u2713t+1  \u2713\u21e4k2 \uf8ff \u21e02,NMk\u02c6\u2713t  \u2713\u21e4k2\n2, and it is insensitive to the condition number of the Hessian.\nHowever, when the number of samples grows large, computing Qt becomes extremely expensive.\nA popular line of research tries to construct the matrix Qt in a way that the update is compu-\ntationally feasible, yet still provides suf\ufb01cient second order information. Such attempts resulted in\nQuasi-Newton methods, in which only gradients and iterates are utilized, resulting in an ef\ufb01cient up-\ndate on Qt. A celebrated Quasi-Newton method is the Broyden-Fletcher-Goldfarb-Shanno (BFGS)\nalgorithm which requires O(np + p2) per-iteration cost [Bis95, Nes04].\nAn alternative approach is to use sub-sampling techniques, where scaling matrix Qt is based on\nrandomly selected set of data points [Mar10, BCNN11, VP12, Erd15]. Sub-sampling is widely\nused in the \ufb01rst order methods, but is not as well studied for approximating the scaling matrix. In\nparticular, theoretical guarantees are still missing.\nA key challenge is that the sub-sampled Hessian is close to the actual Hessian along the directions\ncorresponding to large eigenvalues (large curvature directions in f (\u2713)), but is a poor approximation\nin the directions corresponding to small eigenvalues (\ufb02atter directions in f (\u2713)). In order to overcome\nthis problem, we use low-rank approximation. More precisely, we treat all the eigenvalues below\nthe r-th as if they were equal to the (r + 1)-th. This yields the desired stability with respect to the\nsub-sample: we call our algorithm NewSamp. In this paper, we establish the following:\n\n1. NewSamp has a composite convergence rate: quadratic at start and linear near the mini-\nmizer, as illustrated in Figure 1. Formally, we prove a bound of the form k\u02c6\u2713t+1  \u2713\u21e4k2 \uf8ff\n1k\u02c6\u2713t  \u2713\u21e4k2 + \u21e0t\n2 with coef\ufb01cient that are explicitly given (and are computable\n\u21e0t\nfrom data).\n\n2k\u02c6\u2713t  \u2713\u21e4k2\n\n2. The asymptiotic behavior of the linear convergence coef\ufb01cient is limt!1 \u21e0t\n\n1 = 1 \n(\u21e4p/\u21e4r+1) + , for  small. The condition number (\u21e41/\u21e4p) which controls the conver-\ngence of GD, has been replaced by the milder (\u21e4r+1/\u21e4p). For datasets with strong spectral\nfeatures, this can be a large improvement, as shown in Figure 1.\n\n3. The above results are achived without tuning the step-size, in particular, by setting \u2318t = 1.\n4. The complexity per iteration of NewSamp is O(np + |S|p2) with |S| the sample size.\n5. Our theoretical results can be used to obtain convergence rates of previously proposed sub-\n\nsampling algorithms.\n\nThe rest of the paper is organized as follows: Section 1.1 surveys the related work. In Section 2,\nwe describe the proposed algorithm and provide the intuition behind it. Next, we present our theo-\nretical results in Section 3, i.e., convergence rates corresponding to different sub-sampling schemes,\nfollowed by a discussion on how to choose the algorithm parameters. Two applications of the al-\ngorithm are discussed in Section 4. We compare our algorithm with several existing methods on\nvarious datasets in Section 5. Finally, in Section 6, we conclude with a brief discussion.\n\n1.1 Related Work\n\nEven a synthetic review of optimization algorithms for large-scale machine learning would go be-\nyond the page limits of this paper. Here, we emphasize that the method of choice depends crucially\non the amount of data to be used, and their dimensionality (i.e., respectively, on the parameters n\nand p). In this paper, we focus on a regime in which n and p are large but not so large as to make\ngradient computations (of order np) and matrix manipulations (of order p3) prohibitive.\nOnline algorithms are the option of choice for very large n since the computation per update is\nindependent of n. In the case of Stochastic Gradient Descent (SGD), the descent direction is formed\nby a randomly selected gradient. Improvements to SGD have been developed by incorporating the\nprevious gradient directions in the current update equation [SRB13, Bot10, DHS11].\nBatch algorithms, on the other hand, can achieve faster convergence and exploit second order infor-\nmation. They are competitive for intermediate n. Several methods in this category aim at quadratic,\nor at least super-linear convergence rates. In particular, Quasi-Newton methods have proven effec-\ntive [Bis95, Nes04]. Another approach towards the same goal is to utilize sub-sampling to form an\napproximate Hessian [Mar10, BCNN11, VP12, Erd15]. If the sub-sampled Hessian is close to the\ntrue Hessian, these methods can approach NM in terms of convergence rate, nevertheless, they enjoy\n\n2\n\n\fAlgorithm 1 NewSamp\nInput: \u02c6\u27130, r,\u270f, {\u2318t}t, t = 0.\n\n3. end while\n\nOutput: \u02c6\u2713t.\n\n[Uk, \u21e4k] = TruncatedSVDk(H) is rank-k truncated SVD of H with \u21e4ii = i.\n\n1. De\ufb01ne: PC(\u2713) = argmin\u271302Ck\u2713  \u27130k2 is the Euclidean projection onto C,\n2. while k\u02c6\u2713t+1  \u02c6\u2713tk2 \uf8ff \u270f do\n|St|Pi2St r2\nr+1Ip + Ur\u21e41\n\nSub-sample a set of indices St \u21e2 [n].\n\u2713fi(\u02c6\u2713t),\nLet HSt = 1\nand\nr+1Ir UT\nr  1\nQt = 1\n\u02c6\u2713t+1 = PC\u21e3\u02c6\u2713t  \u2318tQtr\u2713f (\u02c6\u2713t)\u2318,\nt t + 1.\n\nr ,\n\n[Ur+1, \u21e4r+1] = TruncatedSVDr+1(HSt),\n\nmuch smaller complexity per update. No convergence rate analysis is available for these methods:\nthis analysis is the main contribution of our paper. To the best of our knowledge, the best result in\nthis direction is proven in [BCNN11] that estabilishes asymptotic convergence without quantitative\nbounds (exploiting general theory from [GNS09]).\nOn the further improvements of the sub-sampling algorithms, a common approach is to use Conju-\ngate Gradient (CG) methods and/or Krylov sub-spaces [Mar10, BCNN11, VP12]. Lastly, there are\nvarious hybrid algorithms that combine two or more techniques to increase the performance. Ex-\namples include, sub-sampling and Quasi-Newton [BHNS14], SGD and GD [FS12], NGD and NM\n[LRF10], NGD and low-rank approximation [LRMB08].\n\n2 NewSamp : Newton-Sampling method via rank thresholding\nIn the regime we consider, n  p, there are two main drawbacks associated with the classical\nsecond order methods such as Newton\u2019s method. The dominant issue is the computation of the Hes-\nsian matrix, which requires O(np2) operations, and the other issue is inverting the Hessian, which\nrequires O(p3) computation. Sub-sampling is an effective and ef\ufb01cient way of tackling the \ufb01rst is-\nsue. Recent empirical studies show that sub-sampling the Hessian provides signi\ufb01cant improvement\nin terms of computational cost, yet preserves the fast convergence rate of second order methods\n[Mar10, VP12]. If a uniform sub-sample is used, the sub-sampled Hessian will be a random matrix\nwith expected value at the true Hessian, which can be considered as a sample estimator to the mean.\nRecent advances in statistics have shown that the performance of various estimators can be signi\ufb01-\ncantly improved by simple procedures such as shrinkage and/or thresholding [CCS10, DGJ13]. To\nthis extent, we use low-rank approximation as the important second order information is generally\ncontained in the largest few eigenvalues/vectors of the Hessian.\nNewSamp is presented as Algorithm 1. At iteration step t, the sub-sampled set of indices, its size and\nthe corresponding sub-sampled Hessian is denoted by St, |St| and HSt, respectively. Assuming that\nthe functions fi\u2019s are convex, eigenvalues of the symmetric matrix HSt are non-negative. Therefore,\nSVD and eigenvalue decomposition coincide. The operation TruncatedSVDk(HSt) = [Uk, \u21e4k]\nis the best rank-k approximation, i.e., takes HSt as input and returns the largest k eigenvalues\n\u21e4k 2 Rk\u21e5k with the corresponding k eigenvectors Uk 2 Rp\u21e5k. This procedure requires O(kp2)\ncomputation [HMT11]. Operator PC projects the current iterate to the feasible set C using Euclidean\nprojection. We assume that this projection can be done ef\ufb01ciently. To construct the curvature matrix\n[Qt]1, instead of using the basic rank-r approximation, we \ufb01ll its 0 eigenvalues with the (r + 1)-th\neigenvalue of the sub-sampled Hessian which is the largest eigenvalue below the threshold. If we\ncompute a truncated SVD with k = r + 1 and \u21e4ii = i, the described operation results in\n\nr+1Ip + Ur\u21e41\n\nQt = 1\n\n(2.1)\nwhich is simply the sum of a scaled identity matrix and a rank-r matrix. Note that the low-rank\napproximation that is suggested to improve the curvature estimation has been further utilized to\nreduce the cost of computing the inverse matrix. Final per-iteration cost of NewSamp will be\n\nr  1\n\nOnp + (|St| + r)p2 \u21e1O np + |St|p2. NewSamp takes the parameters {\u2318t,|St|}t and r as\n\ninputs. We discuss in Section 3.4, how to choose them optimally, based on the theory in Section 3.\n\nr+1Ir UT\n\nr ,\n\n3\n\n\f)\nr\no\nr\nr\n\nE\n(\ng\no\n\nl\n\n0\n\n\u22121\n\n\u22122\n\n\u22123\n\n\u22124\n\n\u22125\n\nConvergence Rate\n\nConvergence Coefficients\n\n0.25\n\nl\n\ne\nu\na\nV\n\n0.20\n\nSub\u2212sample size\n\nNewSamp : St = 100\nNewSamp : St = 200\nNewSamp : St = 500\n\n0\n\n200\n\nIterations\n\n400\n\nCoefficient\n\u03be1 : linear\n\u03be2 : quadratic\n\n0.15\n\n600\n\n0\n\n20\n\nRank\n\n40\n\nFigure 1: Left plot demonstrates convergence rate of NewSamp , which starts with a quadratic rate and transi-\ntions into linear convergence near the true minimizer. The right plot shows the effect of eigenvalue thresholding\non the convergence coef\ufb01cients up to a scaling constant. x-axis shows the number of kept eigenvalues. Plots\nare obtained using Covertype dataset.\n\nBy the construction of Qt, NewSamp will always be a descent algorithm. It enjoys a quadratic\nconvergence rate at start which transitions into a linear rate in the neighborhood of the minimizer.\nThis behavior can be observed in Figure 1. The left plot in Figure 1 shows the convergence behavior\nof NewSamp over different sub-sample sizes. We observe that large sub-samples result in better\nconvergence rates as expected. As the sub-sample size increases, slope of the linear phase decreases,\ngetting closer to that of quadratic phase. We will explain this phenomenon in Section 3, by Theorems\n3.2 and 3.3. The right plot in Figure 1 demonstrates how the coef\ufb01cients of two phases depend on\nthe thresholded rank. Coef\ufb01cient of the quadratic phase increases with the rank threshold, whereas\nfor the linear phase, relation is reversed.\n3 Theoretical results\nIn this section, we provide the convergence analysis of NewSamp based on two different sub-\nsampling schemes:\n\nS1: Independent sub-sampling: At each iteration t, St is uniformly sampled from [n] =\n\n{1, 2, ..., n}, independently from the sets {S\u2327}\u2327<t , with or without replacement.\nS2: Sequentially dependent sub-sampling: At each iteration t, St is sampled from [n], based\non a distribution which might depend on the previous sets {S\u2327}\u2327<t , but not on any ran-\ndomness in the data.\nThe \ufb01rst sub-sampling scheme is simple and commonly used in optimization. One drawback is\nthat the sub-sampled set at the current iteration is independent of the previous sub-samples, hence\ndoes not consider which of the samples were previously used to form the approximate curvature\ninformation. In order to prevent cycles and obtain better performance near the optimum, one might\nwant to increase the sample size as the iteration advances [Mar10], including previously unused\nsamples. This process results in a sequence of dependent sub-samples which falls into the sub-\nsampling scheme S2. In our theoretical analysis, we make the following assumptions:\nAssumption 1 (Lipschitz continuity). For any subset S \u21e2 [n], 9M|S| depending on the size of S,\nsuch that 8\u2713, \u27130 2C ,\nAssumption 2 (Bounded Hessian). 8i 2 [n], r2\n\nkHS(\u2713)  HS(\u27130)k2 \uf8ff M|S| k\u2713  \u27130k2.\n\n\u2713fi(\u2713) is upper bounded by a constant K, i.e.,\n\nIndependent sub-sampling\n\n3.1\nIn this section, we assume that St \u21e2 [n] is sampled according to the sub-sampling scheme S1. In\nfact, many stochastic algorithms assume that St is a uniform subset of [n], because in this case the\nsub-sampled Hessian is an unbiased estimator of the full Hessian. That is, 8\u2713 2C , E [HSt(\u2713)] =\nH[n](\u2713), where the expectation is over the randomness in St. We next show that for any scaling\nmatrix Qt that is formed by the sub-samples St, iterations of the form Eq. (1.2) will have a composite\nconvergence rate, i.e., combination of a linear and a quadratic phases.\n\nmax\n\ni\uf8ffn r2\n\n\u2713fi(\u2713)2 \uf8ff K.\n\n4\n\n\fLemma 3.1. Assume that the parameter set C is convex and St \u21e2 [n] is based on sub-sampling\nscheme S1 and suf\ufb01ciently large. Further, let the Assumptions 1 and 2 hold and \u2713\u21e4 2C . Then, for an\nabsolute constant c > 0, with probability at least 1  2/p, the updates of the form Eq. (1.2) satisfy\n\nfor coef\ufb01cients \u21e0t\n\n1 and \u21e0t\n\nk\u02c6\u2713t+1  \u2713\u21e4k2 \uf8ff \u21e0t\n2 de\ufb01ned as\n\n1k\u02c6\u2713t  \u2713\u21e4k2 + \u21e0t\n\n2k\u02c6\u2713t  \u2713\u21e4k2\n2,\n\n\u21e0t\n\n1 =I  \u2318tQtHSt(\u02c6\u2713t)2\n\n+ \u2318tcKQt2s log(p)\n\n|St|\n\n,\u21e0\n\nt\n2 = \u2318t\n\nMn\n\n2 Qt2 .\n\nRemark 1. If the initial point \u02c6\u27130 is close to \u2713\u21e4, the algorithm will start with a quadratic rate of\nconvergence which will transform into linear rate later in the close neighborhood of the optimum.\nThe above lemma holds for any matrix Qt. In particular, if we choose Qt = H1\nSt , we obtain a\nbound for the simple sub-sampled Hessian method. In this case, the coef\ufb01cients \u21e0t\n2 depend\n1 and \u21e0t\np is the smallest eigenvalue of the sub-sampled Hessian. Note that t\non kQtk2 = 1/t\np\ncan be arbitrarily small which might blow up both of the coef\ufb01cients. In the following, we will see\nhow NewSamp remedies this issue.\nTheorem 3.2. Let the assumptions in Lemma 3.1 hold. Denote by t\nwhere \u02c6\u2713t is given by NewSamp at iteration step t. If the step size satis\ufb01es\n\ni, the i-th eigenvalue of HSt(\u02c6\u2713t)\n\np where t\n\n\u2318t \uf8ff\n\n2\np/t\n1 + t\n\nr+1\n\n,\n\n(3.1)\n\nthen we have, with probability at least 1  2/p,\n\nk\u02c6\u2713t+1  \u2713\u21e4k2 \uf8ff \u21e0t\n\nfor an absolute constant c > 0, for the coef\ufb01cients \u21e0t\n\n2k\u02c6\u2713t  \u2713\u21e4k2\n2,\n2 are de\ufb01ned as\n\n1k\u02c6\u2713t  \u2713\u21e4k2 + \u21e0t\n1 and \u21e0t\nr+1s log(p)\n\n,\u21e0\n\ncK\nt\n\nt\n2 = \u2318t\n\nMn\n2t\n\n.\n\n\u21e0t\n1 = 1  \u2318t\n\nt\np\nt\n\n+ \u2318t\n\nr+1\n\nr+1\n\n1 and \u21e0t\n\n|St|\nNewSamp has a composite convergence rate where \u21e0t\n2 are the coef\ufb01cients of the linear and the\nquadratic terms, respectively (See the right plot in Figure 1). We observe that the sub-sampling size\nhas a signi\ufb01cant effect on the linear term, whereas the quadratic term is governed by the Lipschitz\nconstant. We emphasize that the case \u2318t = 1 is feasible for the conditions of Theorem 3.2.\n3.2 Sequentially dependent sub-sampling\nHere, we assume that the sub-sampling scheme S2 is used to generate {S\u2327}\u23271. Distribution of\nsub-sampled sets may depend on each other, but not on any randomness in the dataset. Examples\ninclude \ufb01xed sub-samples as well as sub-samples of increasing size, sequentially covering unused\ndata. In addition to Assumptions 1-2, we assume the following.\nAssumption 3 (i.i.d. observations). Let z1, z2, ..., zn 2 Z be i.i.d. observations from a distribution\nD. For a \ufb01xed \u2713 2 Rp and 8i 2 [n], we assume that the functions {fi}n\ni=1 satisfy fi(\u2713) = '(zi,\u2713 ),\nfor some function ' : Z \u21e5 Rp ! R.\nMost statistical learning algorithms can be formulated as above, e.g., in classi\ufb01cation problems, one\ni=1 where yi and xi denote the class label and the covariate,\nhas access to i.i.d. samples {(yi, xi)}n\nand ' measures the classi\ufb01cation error (See Section 4 for examples). For sub-sampling scheme S2,\nan analogue of Lemma 3.1 is stated in Appendix as Lemma B.1, which leads to the following result.\nTheorem 3.3. Assume that the parameter set C is convex and St \u21e2 [n] is based on the sub-sampling\nscheme S2. Further, let the Assumptions 1, 2 and 3 hold, almost surely. Conditioned on the event\nE = {\u2713\u21e4 2C} , if the step size satis\ufb01es Eq. 3.1, then for \u02c6\u2713t given by NewSamp at iteration t, with\nprobability at least 1  cE ep for cE = c/P(E), we have\n\nfor the coef\ufb01cients \u21e0t\n\n1k\u02c6\u2713t  \u2713\u21e4k2 + \u21e0t\n\nk\u02c6\u2713t+1  \u2713\u21e4k2 \uf8ff \u21e0t\n2 de\ufb01ned as\nr+1s p\n\n2k\u02c6\u2713t  \u2713\u21e4k2\n2,\nlog\u2713 diam(C)2Mn + M|St|2 |St|\n\n1 and \u21e0t\nc0K\nt\n\nK2\n\n\u25c6,\u21e0\n\nt\np\nt\n\n\u21e0t\n1 = 1  \u2318t\nwhere c, c0 > 0 are absolute constants and t\n\n|St|\n\n+ \u2318t\n\nr+1\n\ni denotes the i-th eigenvalue of HSt(\u02c6\u2713t).\n\nt\n2= \u2318t\n\nMn\n2t\n\nr+1\n\n,\n\n5\n\n\f1 and \u21e0t\n\nCompared to the Theorem 3.2, we observe that the coef\ufb01cient of the quadratic term does not change.\nThis is due to Assumption 1. However, the bound on the linear term is worse, since we use the\nuniform bound over the convex parameter set C.\n3.3 Dependence of coef\ufb01cients on t and convergence guarantees\nThe coef\ufb01cients \u21e0t\n2 depend on the iteration step which is an undesirable aspect of the above\nresults. However, these constants can be well approximated by their analogues \u21e0\u21e41 and \u21e0\u21e42 evaluated\nat the optimum which are de\ufb01ned by simply replacing t\nj with \u21e4j in their de\ufb01nition, where the latter\nis the j-th eigenvalue of full-Hessian at \u2713\u21e4. For the sake of simplicity, we only consider the case\nwhere the functions \u2713 ! fi(\u2713) are quadratic.\nTheorem 3.4. Assume that the functions fi(\u2713) are quadratic, St is based on scheme S1 and \u2318t = 1.\nLet the full Hessian at \u2713\u21e4 be lower bounded by k. Then for suf\ufb01ciently large |St| and absolute\nconstants c1, c2, with probability 1  2/p\n\n\u21e0t\n1  \u21e0\u21e41 \uf8ff\n\nc1Kplog(p)/|St|\nkk  c2Kplog(p)/|St| := .\n\nTheorem 3.4 implies that, when the sub-sampling size is suf\ufb01ciently large, \u21e0t\n1 will concentrate\naround \u21e0\u21e41. Generalizing the above theorem to non-quadratic functions is straightforward, in which\ncase, one would get additional terms involving the difference k\u02c6\u2713t  \u2713\u21e4k2. In the case of scheme S2,\nif one uses \ufb01xed sub-samples, then the coef\ufb01cient \u21e0t\n1 does not depend on t. The following corollary\ngives a suf\ufb01cient condition for convergence. A detailed discussion on the number of iterations until\nconvergence and further local convergence properties can be found in [Erd15, EM15].\nCorollary 3.5. Assume that \u21e0t\n, i.e., \u21e0t\nconvergence is\n\n2 are well-approximated by \u21e0\u21e41 and \u21e0\u21e42 with an error bound of\ni \uf8ff \u21e0\u21e4i +  for i = 1, 2, as in Theorem 3.4. For the initial point \u02c6\u27130, a suf\ufb01cient condition for\n\n1 and \u21e0t\n\nk\u02c6\u27130  \u2713\u21e4k2 <\n\n1  \u21e0\u21e41  \n\u21e0\u21e42 + \n\n.\n\n3.4 Choosing the algorithm parameters\nStep size: Let  = O(log(p)/|St|). We suggest the following step size for NewSamp at iteration t,\n(3.2)\n\n\u2318t() =\n\n.\n\n2\np/t\n\n1 + t\n\nr+1 + \n\n1 and \u21e0t\n\n1. The other terms in \u21e0t\n\nNote that \u2318t(0) is the upper bound in Theorems 3.2 and 3.3 and it minimizes the \ufb01rst component\nof \u21e0t\n2 linearly depend on \u2318t. To compensate for that, we shrink \u2318t(0)\ntowards 1. Contrary to most algorithms, optimal step size of NewSamp is larger than 1. A rigorous\nderivation of Eq. 3.2 can be found in [EM15].\nSample size: By Theorem 3.2, a sub-sample of size O((K/\u21e4p)2 log(p)) should be suf\ufb01cient to ob-\ntain a small coef\ufb01cient for the linear phase. Also note that sub-sample size |St| scales quadratically\nwith the condition number.\nRank threshold: For a full-Hessian with effective rank R (trace divided by the largest eigenvalue), it\nsuf\ufb01ces to use O(R log(p)) samples [Ver10]. Effective rank is upper bounded by the dimension p.\nHence, one can use p log(p) samples to approximate the full-Hessian and choose a rank threshold\nwhich retains the important curvature information.\n4 Examples\n4.1 Generalized Linear Models (GLM)\n\nMaximum likelihood estimation in a GLM setting is equivalent to minimizing the negative log-\nlikelihood `(\u2713),\n\nminimize\n\n\u27132C\n\nf (\u2713) =\n\n1\nn\n\nnXi=1\n\n[(hxi,\u2713 i)  yihxi,\u2713 i] ,\n\n(4.1)\n\nwhere  is the cumulant generating function, xi 2 Rp denote the rows of design matrix X 2 Rn\u21e5p,\nand \u2713 2 Rp is the coef\ufb01cient vector. Here, hx, \u2713i denotes the inner product between the vectors x,\n\u2713. The function  de\ufb01nes the type of GLM, i.e., (z) = z2 gives ordinary least squares (OLS) and\n(z) = log(1 + ez) gives logistic regression (LR). Using the results from Section 3, we perform a\nconvergence analysis of our algorithm on a GLM problem.\n\n6\n\n\fCorollary 4.1. Let St \u21e2 [n] be a uniform sub-sample, and C = Rp be the parameter set. Assume\nthat the second derivative of the cumulant generating function, (2) is bounded by 1, and it is\nLipschitz continuous with Lipschitz constant L. Further, assume that the covariates are contained\nin a ball of radius pRx, i.e. maxi2[n] kxik2 \uf8ff pRx. Then, for \u02c6\u2713t given by NewSamp with constant\nstep size \u2318t = 1 at iteration t, with probability at least 1  2/p, we have\n2k\u02c6\u2713t  \u2713\u21e4k2\n2,\n\nk\u02c6\u2713t+1  \u2713\u21e4k2 \uf8ff \u21e0t\n\n1k\u02c6\u2713t  \u2713\u21e4k2 + \u21e0t\n\nfor constants \u21e0t\n\n1 and \u21e0t\n\n2 de\ufb01ned as\n\n\u21e0t\n1 =1 \n\nt\ni\nt\n\nr+1\n\n+\n\nr+1s log(p)\n\ncRx\nt\n\n|St|\n\n,\u21e0\n\nt\n2 =\n\nLR3/2\nx\n2t\n\nr+1\n\n,\n\nwhere c > 0 is an absolute constant and t\n\ni is the ith eigenvalue of HSt(\u02c6\u2713t).\n\n4.2 Support Vector Machines (SVM)\nA linear SVM provides a separating hyperplane which maximizes the margin, i.e., the distance\nbetween the hyperplane and the support vectors. Although the vast majority of the literature focuses\non the dual problem [SS02], SVMs can be trained using the primal as well. Since the dual problem\ndoes not scale well with the number of data points (some approaches get O(n3) complexity) the\nprimal might be better-suited for optimization of linear SVMs [Cha07]. The primal problem for the\nlinear SVM can be written as\n\nminimize\n\n\u27132C\n\nf (\u2713) =\n\n1\n2k\u2713k2\n\n2 +\n\n1\n2\n\nC\n\nnXi=1\n\n`(yi,h\u2713, xii)\n\n(4.2)\n\nwhere (yi, xi) denote the data samples, \u2713 de\ufb01nes the separating hyperplane, C > 0 and ` could\nbe any loss function. The most commonly used loss functions include Hinge-p loss, Huber loss\nand their smoothed versions [Cha07]. Smoothing or approximating such losses with more stable\nfunctions is sometimes crucial in optimization. In the case of NewSamp which requires the loss\nfunction to be twice differentiable (almost everywhere), we suggest either smoothed Huber loss, or\nHinge-2 loss [Cha07]. In the case of Hinge-2 loss, i.e., `(y,h\u2713, xi) = max{0, 1  yh\u2713, xi}2, by\ncombining the offset and the normal vector of the hyperplane into a single parameter vector \u2713, and\ndenoting by SVt the set of indices of all the support vectors at iteration t, we may write the Hessian,\n\nr2\n\u2713f (\u2713) =\n\n1\n\n|SVt|nI + C Xi2SVt\n\nxixT\n\nio,\n\nwhere\n\nSVt = {i : yih\u2713t, xii < 1}.\n\nWhen |SVt| is large, the problem falls into our setup and can be solved ef\ufb01ciently using NewSamp.\nNote that unlike the GLM setting, Lipschitz condition of our Theorems do not apply here. However,\nwe empirically demonstrate that NewSamp works regardless of such assumptions.\n5 Experiments\nIn this section, we validate the performance of NewSamp through numerical studies. We experi-\nmented on two optimization problems, namely, Logistic Regression (LR) and SVM. LR minimizes\nEq. 4.1 for the logistic function, whereas SVM minimizes Eq. 4.2 for the Hinge-2 loss.\nIn the\nfollowing, we brie\ufb02y describe the algorithms that are used in the experiments:\n\n1. Gradient Descent (GD), at each iteration, takes a step proportional to negative of the full\ngradient evaluated at the current iterate. Under certain regularity conditions, GD exhibits a\nlinear convergence rate.\n\n2. Accelerated Gradient Descent (AGD) is proposed by Nesterov [Nes83], which improves\n\nover the gradient descent by using a momentum term.\n\n3. Newton\u2019s Method (NM) achieves a quadratic convergence rate by utilizing the inverse Hes-\n\nsian evaluated at the current iterate.\n\n4. Broyden-Fletcher-Goldfarb-Shanno (BFGS) is the most popular and stable Quasi-Newton\n\nmethod. Qt is formed by accumulating the information from iterates and gradients.\n\n5. Limited Memory BFGS (L-BFGS) is a variant of BFGS, which uses only the recent iterates\n\nand gradients to construct Qt, providing improvement in terms of memory usage.\n\n6. Stochastic Gradient Descent (SGD) is a simpli\ufb01ed version of GD where, at each iteration,\na randomly selected gradient is used. We follow the guidelines of [Bot10] for the step size.\n\n7\n\n\fDataset:)\n\nSynthe'c)\n\nLogistic Regression, rank=3\n\nCT)Slices)\n\nLogistic Regression, rank=60\n\nMSD)\n\nLogistic Regression, rank=60\n\nMethod\nNewSamp\nBFGS\nLBFGS\nNewton\nGD\nAGD\nSGD\nAdaGrad\n10\n\n0\n\n40\n\n50\n\n20\n30\nTime(sec)\nSVM, rank=60\n\n1\n\n0\n\n)\nr\no\nr\nr\n\nE\n(\ng\no\n\nl\n\n\u22121\n\n\u22122\n\n\u22123\n\n\u22124\n\n2\n\n0\n\n\u22122\n\n)\nr\no\nr\nr\n\nE\n(\ng\no\n\nl\n\n15\n\nMethod\nNewSamp\nBFGS\nLBFGS\nNewton\nGD\nAGD\nSGD\nAdaGrad\n\n30\n\n60\n\nTime(sec)\n\n90\n\n120\n\n10\n\nTime(sec)\n\n20\n\n30\n\n\u22124\n\n0\n\nMethod\nNewSamp\nBFGS\nLBFGS\nNewton\nGD\nAGD\nSGD\nAdaGrad\n10\n\n0\n\n0\n\n)\nr\no\nr\nr\n\nE\n(\ng\no\n\nl\n\n\u22122\n\n\u22124\n\n1\n\n0\n\n)\nr\no\nr\nr\n\nE\n(\ng\no\n\nl\n\n\u22121\n\n\u22122\n\n\u22123\n\n\u22124\n\n0\n\nMethod\nNewSamp\nBFGS\nLBFGS\nNewton\nGD\nAGD\nSGD\nAdaGrad\n\n25\n\n1\n\n0\n\n)\nr\no\nr\nr\n\nE\n(\ng\no\n\nl\n\n\u22121\n\n\u22122\n\n\u22123\n\n\u22124\n\n2\n\n0\n\n\u22122\n\n\u22124\n\n40\n\n50\n\n20\n30\nTime(sec)\nSVM, rank=3\n\n)\nr\no\nr\nr\n\nE\n(\ng\no\n\nl\n\n50\n\nTime(sec)\n\n75\n\n100\n\nMethod\nNewSamp\nBFGS\nLBFGS\nNewton\nGD\nAGD\nSGD\nAdaGrad\n\n0\n\nMethod\nNewSamp\nBFGS\nLBFGS\nNewton\nGD\nAGD\nSGD\nAdaGrad\n\n0\n\n10\n\n5\n\nTime(sec)\nSVM, rank=60\n\nFigure 2: Performance of several algorithms on different datasets. NewSamp is represented with red color .\n\n7. Adaptive Gradient Scaling (AdaGrad) uses an adaptive learning rate based on the previous\ngradients. AdaGrad signi\ufb01cantly improves the performance and stability of SGD [DHS11].\nFor batch algorithms, we used constant step size and for all the algorithms, the step size that provides\nthe fastest convergence is chosen. For stochastic algorithms, we optimized over the parameters that\nde\ufb01ne the step size. Parameters of NewSamp are selected following the guidelines in Section 3.4.\nWe experimented over various datasets that are given in Table 1. Each dataset consists of a design\nmatrix X 2 Rn\u21e5p and the corresponding observations (classes) y 2 Rn. Synthetic data is generated\nthrough a multivariate Gaussian distribution. As a methodological choice, we selected moderate val-\nues of p, for which Newton\u2019s method can still be implemented, and nevertheless we can demonstrate\nan improvement. For larger values of p, comparison is even more favorable to our approach.\nThe effects of sub-sampling size |St| and rank threshold are demonstrated in Figure 1. A thorough\ncomparison of the aforementioned optimization techniques is presented in Figure 2. In the case of\nLR, we observe that stochastic methods enjoy fast convergence at start, but slows down after several\nepochs. The algorithm that comes close to NewSamp in terms of performance is BFGS. In the case\nof SVM, NM is the closest algorithm to NewSamp . Note that the global convergence of BFGS is not\n\nbetter than that of GD [Nes04]. The condition for super-linear rate isPt k\u2713t\u2713\u21e4k2 < 1 for which,\n\nan initial point close to the optimum is required [DM77]. This condition can be rarely satis\ufb01ed\nin practice, which also affects the performance of other second order methods. For NewSamp,\neven though rank thresholding provides a level of robustness, we found that initial point is still an\nimportant factor. Details about Figure 2 and additional experiments can be found in Appendix C.\n\nDataset\nCT slices\nCovertype\nMSD\nSynthetic\n\nn\n53500\n581012\n515345\n500000\n\np\n386\n54\n90\n300\n\nr\n60\n20\n60\n3\n\nReference\n[GKS+11, Lic13]\n[BD99, Lic13]\n[MEWL, Lic13]\n\u2013\n\nTable 1: Datasets used in the experiments.\n\n6 Conclusion\nIn this paper, we proposed a sub-sampling based second order method utilizing low-rank Hessian\n\nestimation. The proposed method has the target regime n  p and has Onp + |S|p2 complexity\n\nper-iteration. We showed that the convergence rate of NewSamp is composite for two widely used\nsub-sampling schemes, i.e., starts as quadratic convergence and transforms to linear convergence\nnear the optimum. Convergence behavior under other sub-sampling schemes is an interesting line\nof research. Numerical experiments demonstrate the performance of the proposed algorithm which\nwe compared to the classical optimization methods.\n\n8\n\n\fShun-Ichi Amari, Natural gradient works ef\ufb01ciently in learning, Neural computation 10 (1998).\n\nReferences\n[Ama98]\n[BCNN11] Richard H Byrd, Gillian M Chin, Will Neveitt, and Jorge Nocedal, On the use of stochastic hessian\ninformation in optimization methods for machine learning, SIAM Journal on Optimization (2011).\nJock A Blackard and Denis J Dean, Comparative accuracies of arti\ufb01cial neural networks and dis-\ncriminant analysis in predicting forest cover types from cartographic variables, Compag (1999).\n[BHNS14] Richard H Byrd, SL Hansen, Jorge Nocedal, and Yoram Singer, A stochastic quasi-newton method\n\n[BD99]\n\n[Bis95]\n[Bot10]\n[BV04]\n[CCS10]\n\n[Cha07]\n[DE15]\n\n[DGJ13]\n\n[DHS11]\n\n[DM77]\n\n[EM15]\n\n[Erd15]\n\n[FS12]\n\nfor large-scale optimization, arXiv preprint arXiv:1401.7020 (2014).\nChristopher M. Bishop, Neural networks for pattern recognition, Oxford University Press, 1995.\nL`eon Bottou, Large-scale machine learning with stochastic gradient descent, COMPSTAT, 2010.\nStephen Boyd and Lieven Vandenberghe, Convex optimization, Cambridge University Press, 2004.\nJian-Feng Cai, Emmanuel J Cand`es, and Zuowei Shen, A singular value thresholding algorithm\nfor matrix completion, SIAM Journal on Optimization 20 (2010), no. 4, 1956\u20131982.\nOlivier Chapelle, Training a support vector machine in the primal, Neural Computation (2007).\nLee H Dicker and Murat A Erdogdu, Flexible results for quadratic forms with applications to\nvariance components estimation, arXiv preprint arXiv:1509.04388 (2015).\nDavid L Donoho, Matan Gavish, and Iain M Johnstone, Optimal shrinkage of eigenvalues in the\nspiked covariance model, arXiv preprint arXiv:1311.0851 (2013).\nJohn Duchi, Elad Hazan, and Yoram Singer, Adaptive subgradient methods for online learning\nand stochastic optimization, J. Mach. Learn. Res. 12 (2011), 2121\u20132159.\nJohn E Dennis, Jr and Jorge J Mor\u00b4e, Quasi-newton methods, motivation and theory, SIAM review\n19 (1977), 46\u201389.\nMurat A Erdogdu and Andrea Montanari, Convergence rates of sub-sampled Newton methods,\narXiv preprint arXiv:1508.02810 (2015).\nMurat A. Erdogdu, Newton-Stein Method: A second order method for GLMs via Stein\u2019s lemma,\nNIPS, 2015.\nMichael P Friedlander and Mark Schmidt, Hybrid deterministic-stochastic methods for data \ufb01tting,\nSIAM Journal on Scienti\ufb01c Computing 34 (2012), no. 3, A1380\u2013A1405.\n\n[GN10]\n\n[GKS+11] Franz Graf, Hans-Peter Kriegel, Matthias Schubert, Sebastian P\u00a8olsterl, and Alexander Cavallaro,\n2d image registration in ct images using radial image descriptors, MICCAI 2011, Springer, 2011.\nDavid Gross and Vincent Nesme, Note on sampling without replacing from a \ufb01nite collection of\nmatrices, arXiv preprint arXiv:1001.2738 (2010).\nIgor Griva, Stephen G Nash, and Ariela Sofer, Linear and nonlinear optimization, Siam, 2009.\n\n[GNS09]\n[HMT11] Nathan Halko, Per-Gunnar Martinsson, and Joel A Tropp, Finding structure with randomness:\n\nProbabilistic algorithms for constructing approximate matrix decompositions, no. 2, 217\u2013288.\nM. Lichman, UCI machine learning repository, 2013.\nNicolas Le Roux and Andrew W Fitzgibbon, A fast natural newton method, ICML, 2010.\n\n[Lic13]\n[LRF10]\n[LRMB08] Nicolas Le Roux, Pierre-A Manzagol, and Yoshua Bengio, Topmoumoute online natural gradient\n\n[Mar10]\n[MEWL]\n\n[Nes83]\n\nalgorithm, NIPS, 2008.\nJames Martens, Deep learning via hessian-free optimization, ICML, 2010, pp. 735\u2013742.\nThierry B. Mahieux, Daniel P.W. Ellis, Brian Whitman, and Paul Lamere, The million song dataset,\nISMIR-11.\nYurii Nesterov, A method for unconstrained convex minimization problem with the rate of conver-\ngence o (1/k2), Doklady AN SSSR, vol. 269, 1983, pp. 543\u2013547.\n\n[Nes04]\n[SRB13] Mark Schmidt, Nicolas Le Roux, and Francis Bach, Minimizing \ufb01nite sums with the stochastic\n\n, Introductory lectures on convex optimization: A basic course, vol. 87, Springer, 2004.\n\n[SS02]\n\n[Tro12]\n\n[Ver10]\n\n[VP12]\n\naverage gradient, arXiv preprint arXiv:1309.2388 (2013).\nBernhard Sch\u00a8olkopf and Alexander J Smola, Learning with kernels: support vector machines,\nregularization, optimization, and beyond, MIT press, 2002.\nJoel A Tropp, User-friendly tail bounds for sums of random matrices, Foundations of Computa-\ntional Mathematics (2012).\nRoman Vershynin,\narXiv:1011.3027 (2010).\nOriol Vinyals and Daniel Povey, Krylov Subspace Descent for Deep Learning, AISTATS, 2012.\n\nIntroduction to the non-asymptotic analysis of\n\nrandom matrices,\n\n9\n\n\f", "award": [], "sourceid": 1722, "authors": [{"given_name": "Murat", "family_name": "Erdogdu", "institution": "Stanford University"}, {"given_name": "Andrea", "family_name": "Montanari", "institution": "Stanford"}]}