{"title": "Preconditioned Spectral Descent for Deep Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 2971, "page_last": 2979, "abstract": "Deep learning presents notorious computational challenges. These challenges include, but are not limited to, the non-convexity of learning objectives and estimating the quantities needed for optimization algorithms, such as gradients. While we do not address the non-convexity, we present an optimization solution that ex- ploits the so far unused \u201cgeometry\u201d in the objective function in order to best make use of the estimated gradients. Previous work attempted similar goals with preconditioned methods in the Euclidean space, such as L-BFGS, RMSprop, and ADA-grad. In stark contrast, our approach combines a non-Euclidean gradient method with preconditioning. We provide evidence that this combination more accurately captures the geometry of the objective function compared to prior work. We theoretically formalize our arguments and derive novel preconditioned non-Euclidean algorithms. The results are promising in both computational time and quality when applied to Restricted Boltzmann Machines, Feedforward Neural Nets, and Convolutional Neural Nets.", "full_text": "Preconditioned Spectral Descent for Deep Learning\n\nDavid E. Carlson,1 Edo Collins,2 Ya-Ping Hsieh,2 Lawrence Carin,3 Volkan Cevher2\n\n1 Department of Statistics, Columbia University\n\n2 Laboratory for Information and Inference Systems (LIONS), EPFL\n3 Department of Electrical and Computer Engineering, Duke University\n\nAbstract\n\nDeep learning presents notorious computational challenges. These challenges in-\nclude, but are not limited to, the non-convexity of learning objectives and estimat-\ning the quantities needed for optimization algorithms, such as gradients. While we\ndo not address the non-convexity, we present an optimization solution that exploits\nthe so far unused \u201cgeometry\u201d in the objective function in order to best make use\nof the estimated gradients. Previous work attempted similar goals with precon-\nditioned methods in the Euclidean space, such as L-BFGS, RMSprop, and ADA-\ngrad. In stark contrast, our approach combines a non-Euclidean gradient method\nwith preconditioning. We provide evidence that this combination more accurately\ncaptures the geometry of the objective function compared to prior work. We theo-\nretically formalize our arguments and derive novel preconditioned non-Euclidean\nalgorithms. The results are promising in both computational time and quality\nwhen applied to Restricted Boltzmann Machines, Feedforward Neural Nets, and\nConvolutional Neural Nets.\n\n1\n\nIntroduction\n\nIn spite of the many great successes of deep learning, ef\ufb01cient optimization of deep networks re-\nmains a challenging open problem due to the complexity of the model calculations, the non-convex\nnature of the implied objective functions, and their inhomogeneous curvature [6]. It is established\nboth theoretically and empirically that \ufb01nding a local optimum in many tasks often gives compara-\nble performance to the global optima [4], so the primary goal is to \ufb01nd a local optimum quickly. It\nis speculated that an increase in computational power and training ef\ufb01ciency will drive performance\nof deep networks further by utilizing more complicated networks and additional data [14].\nStochastic Gradient Descent (SGD) is the most widespread algorithm of choice for practitioners\nof machine learning. However, the objective functions typically found in deep learning problems,\nsuch as feed-forward neural networks and Restricted Boltzmann Machines (RBMs), have inhomo-\ngeneous curvature, rendering SGD ineffective. A common technique for improving ef\ufb01ciency is to\nuse adaptive step-size methods for SGD [25], where each layer in a deep model has an independent\nstep-size. Quasi-Newton methods have shown promising results in networks with sparse penalties\n[16], and factorized second order approximations have also shown improved performance [18]. A\npopular alternative to these methods is to use an element-wise adaptive learning rate, which has\nshown improved performance in ADAgrad [7], ADAdelta [30], and RMSprop [5].\nThe foundation of all of the above methods lies in the hope that the objective function can be well-\napproximated by Euclidean (e.g., Frobenius or (cid:96)2) norms. However, recent work demonstrated that\nthe matrix of connection weights in an RBM has a tighter majorization bound on the objective\nfunction with respect to the Schatten-\u221e norm compared to the Frobenius norm [1]. A majorization-\nminimization approach with the non-Euclidean majorization bound leads to an algorithm denoted\nas Stochastic Spectral Descent (SSD), which sped up the learning of RBMs and other probabilistic\n\n1\n\n\fmodels. However, this approach does not directly generalize to other deep models, as it can suffer\nfrom loose majorization bounds.\nIn this paper, we combine recent non-Euclidean gradient methods with element-wise adaptive learn-\ning rates, and show their applicability to a variety of models. Speci\ufb01cally, our contributions are:\n\ni) We demonstrate that the objective function in feedforward neural nets is naturally bounded by\nthe Schatten-\u221e norm. This motivates the application of the SSD algorithm developed in [1],\nwhich explicitly treats the matrix parameters with matrix norms as opposed to vector norms.\n\nii) We develop a natural generalization of adaptive methods (ADAgrad, RMSprop) to the non-\nEuclidean gradient setting that combines adaptive step-size methods with non-Euclidean gra-\ndient methods. These algorithms have robust tuning parameters and greatly improve the con-\nvergence and the solution quality of SSD algorithm via local adaptation. We denote these new\nalgorithms as RMSspectral and ADAspectral to mark the relationships to Stochastic Spectral\nDescent and RMSprop and ADAgrad.\nrithm [9]. This greatly reduces the per-iteration overhead when using the Schatten-\u221e norm.\n\niii) We develop a fast approximation to our algorithm iterates based on the randomized SVD algo-\n\niv) We empirically validate these ideas by applying them to RBMs, deep belief nets, feedforward\nneural nets, and convolutional neural nets. We demonstrate major speedups on all models, and\ndemonstrate improved \ufb01t for the RBM and the deep belief net.\n\nWe denote vectors as bold lower-case letters, and matrices as bold upper-case letters. Operations (cid:12)\nand (cid:11) denote element-wise multiplication and division, and\nX the element-wise square root of X.\n1 denotes the matrix with all 1 entries. ||x||p denotes the standard (cid:96)p norm of x. ||X||Sp denotes\nthe Schatten-p norm of X, which is ||s||p with s the singular values of X. ||X||S\u221e is the largest\nsingular value of X, which is also known as the matrix 2-norm or the spectral norm.\n\n\u221a\n\n2 Preconditioned Non-Euclidean Algorithms\n\nWe \ufb01rst review non-Euclidean gradient descent algorithms in Section 2.1. Section 2.2 motivates and\ndiscusses preconditioned non-Euclidean gradient descent. Dynamic preconditioners are discussed\nin Section 2.3, and fast approximations are discussed in Section 2.4.\n\n2.1 Non-Euclidean Gradient Descent\n\nUnless otherwise mentioned, proofs for this section may be found in [13]. Consider the minimization\nof a closed proper convex function F (x) with Lipschitz gradient ||\u2207F (x) \u2212 \u2207F (y)||q \u2264 Lp||x \u2212\ny||p,\u2200x, y,where p and q are dual to each other, and Lp > 0 is the smoothness constant. This\nLipschitz gradient implies the following majorization bound, which is useful in optimization:\n\nF (y) \u2264 F (x) + (cid:104)\u2207F (x), y \u2212 x(cid:105) + Lp\n\n2 ||y \u2212 x||2\np.\n\n(cid:9) , this approach yields the algorithm:\n\n(1)\n\nA natural strategy to minimize F (x) is to iteratively minimize the right-hand side of (1). De\ufb01ning\nthe #-operator as s# \u2208 arg maxx\nxk+1 = xk \u2212 1\n\n[\u2207F (xk)]# , where k is the iteration count.\n\n2||x||2\n\n(2)\n\np\n\n(cid:8)(cid:104)s, x(cid:105) \u2212 1\n\nLp\n\nFor p = q = 2, (2) is simply gradient descent, and s# = s. In general, (2) can be viewed as gradient\ndescent in a non-Euclidean norm.\nTo explore which norm ||x||p leads to the fastest convergence, we note the convergence rate of (2)\n), where x\u2217 is a minimizer of F (\u00b7). If we have an Lp such that\nis F (xk)\u2212 F (x\u2217) = O(\n(1) holds and Lp||x0\u2212x\u2217||2\n2, then (2) can lead to superior convergence. One such\nexample is presented in [13], where the authors proved that L\u221e||x0 \u2212 x\u2217||2\u221e improves a dimension-\ndependent factor over gradient descent for a class of problems in computer science. Moreover, they\nshowed that the algorithm in (2) demands very little computational overhead for their problems, and\nhence || \u00b7 ||\u221e is favored over || \u00b7 ||2.\n\nLp||x0\u2212x\u2217||2\np (cid:28) L2||x0\u2212x\u2217||2\n\nk\n\np\n\n2\n\n\fFigure 1: Updates from parameters Wk for a multivariate logistic regression. (Left) 1st order\napproximation error at parameter Wk + s1u1v1 + s2u2v2, with {u1, u2, v1, v2} singular vectors\nof \u2207Wf (W). (Middle) 1st order approximation error at parameter Wk + s1 \u02dcu1 \u02dcv1 + s2 \u02dcu2 \u02dcv2, with\n{ \u02dcu1, \u02dcu2, \u02dcv1, \u02dcv2} singular vectors of\n(Right)\nShape of the error implied by Frobenius norm and the Schatten-\u221e norm. After preconditioning, the\nerror surface matches the shape implied by the Schatten-\u221e norm and not the Frobenius norm.\n\nD (cid:12) \u2207Wf (W) with D a preconditioner matrix.\n\n\u221a\n\nAs noted in [1], for the log-sum-exp function, lse(\u03b1) = log(cid:80)N\n\ni=1 \u03c9i exp(\u03b1i), the constant L2 is\n\u2264 1/2 and \u2126(1/ log(N )) whereas the constant L\u221e is \u2264 1. If \u03b1 are (possibly dependent) N zero\nmean sub-Gaussian random variables, the convergence for the log-sum-exp objective function is\nimproved by at least N\nlog2 N (see Supplemental Section A.1 for details). As well, non-Euclidean\ngradient descent can be adapted to the stochastic setting [2].\nThe log-sum-exp function reoccurs frequently in the cost function of deep learning models. An-\nalyzing the majorization bounds that are dependent on the log-sum-exp function with respect to\nthe model parameters in deep learning reveals majorization functions dependent on the Schatten-\u221e\nnorm. This was shown previously for the RBM in [1], and we show a general approach in Sup-\nplemental Section A.2 and speci\ufb01c results for feed-forward neural nets in Section 3.2. Hence, we\npropose to optimize these deep networks with the Schatten-\u221e norm.\n\n2.2 Preconditioned Non-Euclidean Gradient Descent\n\nIt has been established that the loss functions of neural networks exhibit pathological curvature [19]:\nthe loss function is essentially \ufb02at in some directions, while it is highly curved in others. The regions\nof high curvature dominate the step-size in gradient descent. A solution to the above problem is to\nrescale the parameters so that the loss function has similar curvature along all directions. The basis\nof recent adative methods (ADAgrad, RMSprop) is in preconditioned gradient descent, with iterates\n(3)\nWe restrict without loss of generality the preconditioner Dk to a positive de\ufb01nite diagonal matrix\n(cid:44) (cid:104)x, x(cid:105)D, we note that\nand \u0001k > 0 is a chosen step-size. Letting (cid:104)y, x(cid:105)D (cid:44) (cid:104)y, Dx(cid:105) and ||x||2\nthe iteration in 3 corresponds to the minimizer of\n\nxk+1 = xk \u2212 \u0001kDk\n\n\u22121\u2207F (xk).\n\nD\n\n\u02dcF (y) (cid:44) F (xk) + (cid:104)\u2207F (xk), y \u2212 xk(cid:105) + 1\n\n||y \u2212 xk||2\n\n2\u0001k\n\n.\n\n(4)\n\nDk\n\nConsequently, for (3) to perform well, \u02dcF (y) has to either be a good approximation or a tight upper\nbound of F (y), the true function value. This is equivalent to saying that the \ufb01rst order approximation\nerror F (y)\u2212F (xk)\u2212(cid:104)\u2207F (xk), y\u2212xk(cid:105) is better approximated by the scaled Euclidean norm. The\npreconditioner Dk controls the scaling, and the choice of Dk depends on the objective function.\nAs we are motivated to use Schatten-\u221e norms for our models, the above reasoning leads us to\nconsider a variable metric non-Euclidean approximation. For a matrix X, let us denote D to be\nan element-wise preconditioner. Note that D is not a diagonal matrix in this case. Because the\noperations here are element-wise, this would correspond to the case above with a vectorized form\nD (cid:12) X||S\u221e. We consider the\n\nof X and a preconditioner of diag(vec(D)). Let ||X||D,S\u221e = ||\u221a\n\nfollowing surrogate of F ,\n\nF (Y) (cid:39) F (Xk) + (cid:104)\u2207F (Xk), Y \u2212 Xk(cid:105) + 1\n\n||Y \u2212 Xk||2\n\nDk,S\u221e.\n\n(5)\n\n2\u0001k\n\n3\n\ns1s2Gradient0102005101520s1s2Preconditioned0102005101520s1s2NormShape0102005101520||.||2F||.||2S\u221e\fXk+1 = Xk \u2212 \u0001k[\u2207F (xk) (cid:11)(cid:112)\n\nDk]# (cid:11)(cid:112)\n\nUsing the #-operator from Section 2.1, the minimizer of (5) takes the form (see Supplementary\nSection C for the proof):\n\nDk.\n\n(6)\nWe note that classi\ufb01cation with a softmax link naturally operates on the Schatten-\u221e norm. As an\nillustrative example of the applicability of this norm, we show the \ufb01rst order approximation error\nfor the objective function in this model, where the distribution on the class y depends on covariates\nx, y \u223c categorical(softmax(Wx)). Figure 1 (left) shows the error surfaces on W without the\npreconditioner, where the uneven curvature will lead to poor updates. The Jacobi (diagonal of the\nHessian) preconditioned error surface is shown in Figure 1 (middle), where the curvature has been\nmade homogeneous. However, the shape of the error does not follow the Euclidean (Frobenius)\nnorm, but instead the geometry from the Schatten-\u221e norm shown in Figure 1 (right). Since many\ndeep networks use the softmax and log-sum-exp to de\ufb01ne a probability distribution over possible\nclasses, adapting to the the inherent geometry of this function can bene\ufb01t learning in deeper layers.\n\n2.3 Dynamic Learning of the Preconditioner\n\nOur algorithms amount to choosing an \u0001k and preconditioner Dk. We propose to use the precondi-\ntioner from ADAgrad [7] and RMSprop [5]. These preconditioners are given below:\n\n(cid:26) Vk+1 = \u03b1Vk + (1 \u2212 \u03b1) (\u2207f (Xk)) (cid:12) (\u2207f (Xk)), RMSprop\n\nDk+1 = \u03bb1 +(cid:112)Vk+1,\n\nVk+1 = Vk + (\u2207f (Xk)) (cid:12) (\u2207f (Xk)),\n\nADAgrad .\nThe \u03bb term is a tuning parameter controlling the extremes of the curvature in the preconditioner.\nThe updates in ADAgrad have provably improved regret bound guarantees for convex problems\nover gradient descent with the iterates in (3) [7]. ADAgrad and ADAdelta [30] have been applied\nsuccessfully to neural nets. The updates in RMSprop were shown in [5] to approximate the equi-\nlibration preconditioner, and have also been successfully applied in autoencoders and supervised\nneural nets. Both methods require a tuning parameter \u03bb, and RMSprop also requires a term \u03b1 that\ncontrols historical smoothing.\nWe propose two novel algorithms that both use the iterate in (6). The \ufb01rst uses the ADAgrad pre-\nconditioner which we call ADAspectral. The second uses the RMSprop preconditioner which we\ncall RMSspectral.\n\n2.4 The #-Operator and Fast Approximations\nLetting X = Udiag(s)VT be the SVD of X, the #-operator for the Schatten-\u221e norm (also known\nas the spectral norm) can be computed as follows [1]: X# = ||s||1UVT .\nDepending on the cost of the gradient estimation, this computation may be relatively cheap [1] or\nquite expensive. In situations where the gradient estimate is relatively cheap, the exact #-operator\ndemands signi\ufb01cant overhead. Instead of calculating the full SVD, we utilize a randomize SVD\nalgorithm [9, 22]. For N \u2264 M, this reduces the cost from O(M N 2) to O(M K 2+M N log(k)) with\nk the number of projections used in the algorithm. Letting \u02dcUdiag(\u02dcs) \u02dcVT (cid:39) X represent the rank-k+\n1 approximate SVD, then the approximate #-operator corresponds to the low-rank approximation\nand the reweighted remainder, X# (cid:39) ||\u02dcs||1( \u02dcU1:k \u02dcV1:k + \u02dcs\u22121\nWe note that the #-operator is also de\ufb01ned for the (cid:96)\u221e norm, however, for notational clarity, we will\ndenote this as x(cid:91) and leave the # notation for the Schatten-\u221e case. This x(cid:91) solution was given in\n[13, 1] as x(cid:91) = ||x||1\u00d7sign(x). Pseudocode for these operations is in the Supplementary Materials.\n3 Applicability of Schatten-\u221e Bounds to Models\n3.1 Restricted Boltzmann Machines (RBM)\n\nk+1(X \u2212 \u02dcU1:Kdiag( \u02dcs1:K) \u02dcV1:K\n\n)).\n\nT\n\nRBMs [26, 11] are bipartite Markov Random Field models that form probabilistic generative mod-\nels over a collection of data. They are useful both as generative models and for \u201cpre-training\u201d\ndeep networks [11, 8]. In the binary case, the observations are binary v \u2208 {0, 1}M with connec-\ntions to latent (hidden) binary units, h \u2208 {0, 1}J. The probability for each state {v, h} is de\ufb01ned\n\n4\n\n\fv\n\nN log(cid:80)\n\n(cid:80)\nh exp(\u2212E\u03b8(vn, h)).\n\nh exp(\u2212E\u03b8(vn, h)) + log(cid:80)\n\nby parameters \u03b8 = {W, c, b} with the energy \u2212E\u03b8(v, h) (cid:44) cT v + vT Wh + hT b and proba-\nbility p\u03b8(v, h) \u221d \u2212E\u03b8(v, h). The maximum likelihood estimator implies the objective function\nmin\u03b8 F (\u03b8) = \u2212 1\nThis objective function is generally intractable, al-\nthough an accurate but computationally intensive es-\ntimator is given via Annealed Importance Sampling\n(AIS) [21, 24]. The gradient can be comparatively\nquickly estimated by taking a small number of Gibbs\nsampling steps in a Monte Carlo Integration scheme\n(Contrastive Divergence) [12, 28]. Due to the noisy\nnature of the gradient estimation and the intractable\nobjective function, second order methods and line\nsearch methods are inappropriate and SGD has tra-\nditionally been used [16].\n[1] proposed an upper\nbound on perturbations to W of\n\nInputs: \u00011,..., \u03bb, \u03b1, Nb\nParameters: \u03b8 = {W, b, c}\nHistory Terms : VW, vb, vc\nfor i=1,. . . do\n\nAlgorithm 1 RMSspectral for RBMs\n\n\u03bb +\n\n\u221a\n\nF ({W + U, b, c}) \u2264 F ({W, b, c})\n+ (cid:104)\u2207WF ({W, b, c}), U(cid:105) + M J\n\n2 ||U||2\nS\u221e\n\nVW\n\n(cid:112)\nb =(cid:112)\u03bb +\n\nSample a minibatch of size Nb\nEstimate gradient (dW, db, dc)\n% Update matrix parameter\nVW = \u03b1VW + (1 \u2212 \u03b1)dW (cid:12) dW\nD1/2\nW =\nW = W \u2212 \u0001i(dW (cid:11) D1/2\n% Update bias term b\nVb = \u03b1Vb + (1 \u2212 \u03b1)db (cid:12) db\nd1/2\nb = b \u2212 \u0001i(db (cid:11) d1/2\n% Same for c\n\nW )# (cid:11) D1/2\n\n\u221a\n\nW\n\nb\n\nvb\n\n)(cid:91) (cid:11) d1/2\n\nThis majorization motivated the Stochastic Spec-\ntral Descent (SSD) algorithm, which uses the #-\noperator in Section 2.4. In addition, bias parameters\nb and c were bound on the (cid:96)\u221e norm and use the (cid:91) up-\ndates from Section 2.4 [1]. In their experiments, this\nmethod showed signi\ufb01cantly improved performance\nover competing algorithm for mini-batches of 2J and CD-25 (number of Gibbs sweeps), where\nthe computational cost of the #-operator is insigni\ufb01cant. This motivates using the preconditioned\nspectral descent methods, and we show our proposed RMSspectral method in Algorithm 1.\nWhen the RBM is used to \u201cpre-train\u201d deep models, CD-1 is typically used (1 Gibbs sweep). One\nsuch model is the Deep Belief Net, where parameters are effectively learned by repeatedly learning\nRBM models [11, 24]. In this case, the SVD operation adds signi\ufb01cant overhead. Therefore, the\nfast approximation of Section 2.4 and the adaptive methods result in vast improvements. These\nenhancements naturally extend to the Deep Belief Net, and results are detailed in Section 4.1.\n\nend for\n\nb\n\n3.2 Supervised Feedforward Neural Nets\n\nAlgorithm 2 RMSspectral for FNN\nInputs: \u00011,..., \u03bb, \u03b1, Nb\nParameters: \u03b8 = {W0, . . . , WL}\nHistory Terms : V0, . . . , VL\nfor i=1,. . . do\n\nFeedforward Neural Nets are widely used models\nfor classi\ufb01cation problems. We consider L lay-\ners of hidden variables with deterministic nonlinear\nlink functions with a softmax classi\ufb01er at the \ufb01nal\nlayer. Ignoring bias terms for clarity, an input x is\nmapped through a linear transformation and a non-\nlinear link function \u03b7(\u00b7) to give the \ufb01rst layer of hid-\nden nodes, \u03b11 = \u03b7(W0x). This process continues\nwith \u03b1(cid:96) = \u03b7(W(cid:96)\u22121\u03b1(cid:96)\u22121). At the last layer, we\nset h = WL\u03b1L and an J-dimensional class vector\nis drawn y \u223c categorical(softmax(h)). The stan-\ndard approach for parameter learning is to minimize\nthe objective function that corresponds to the (pe-\nnalized) maximum likelihood objective function over the parameters \u03b8 = {W0, . . . , WL} and data\nexamples {x1, . . . , xN}, which is given by:\n\nSample a minibatch of size Nb\nEstimate gradient by backprop (dW(cid:96))\nfor (cid:96) = 0, . . . , L do\nV(cid:96) = \u03b1V(cid:96) + (1 \u2212 \u03b1)dW(cid:96) (cid:12) dW(cid:96)\n\u221a\n(cid:96) =\nD\nW(cid:96) = W(cid:96)\u2212\u0001i(dW(cid:96)(cid:11)D\n\n(cid:96) )#(cid:11)D\n\nend for\n\nend for\n\n(cid:112)\n\n\u03bb +\n\nV(cid:96)\n\n1\n2\n(cid:96)\n\n1\n2\n\n1\n2\n\n\u03b8M L = arg min\u03b8 f (\u03b8) = 1\nN\n\nn=1\n\nj=1 exp(hn,\u03b8,j)\n\n(7)\n\n(cid:80)N\n\n(cid:16)\u2212yT\nn hn,\u03b8 + log(cid:80)J\n\n(cid:17)\n\nWhile there have been numerous recent papers detailing different optimization approaches to this\nobjective [7, 6, 5, 16, 19], we are unaware of any approaches that attempt to derive non-Euclidean\nbounds. As a result, we explore the properties of this objective function. We show the key results\nhere and provide further details on the general framework in Supplemental Section A.2 and the\nspeci\ufb01c derivation in Supplemental Section D. By using properties of the log-sum-exp function\n\n5\n\n\fFigure 2: A normalized time unit is 1 SGD iteration (Left) This shows the reconstruction error from\ntraining the MNIST dataset using CD-1 (Middle) Log-likelihood of training Caltech-101 Silhouettes\nusing Persistent CD-25 (Right) Log-likelihood of training MNIST using Persistent CD-25\n\nfrom [1, 2], the objective function from (7) has an upper bound,\n\nf (\u03c6) \u2264 f (\u03b8) + (cid:104)\u2207\u03b8f (\u03b8), \u03c6 \u2212 \u03b8(cid:105) + 1\n\nN\n\nn=1 ( 1\n\n2 maxj(hn,\u03c6,j \u2212 hn,\u03b8,j)2\n\n(cid:80)N\n\n|hn,\u03c6,j \u2212 hn,\u03b8,j \u2212 (cid:104)\u2207\u03b8hn,\u03b8,j, \u03c6 \u2212 \u03b8(cid:105)|).\n\n+2 max\n\nj\n\n(8)\n\nWe note that this implicitly requires the link function to have a Lipschitz continuous gradient. Many\ncommonly used links, including logistic, hyperbolic tangent, and smoothed recti\ufb01ed linear units,\nhave Lipschitz continuous gradients, but recti\ufb01ed linear units do not.\nIn this case, we will just\nproceed with the subgradient. A strict upper bound on these parameters is highly pessimistic, so in-\nstead we propose to take a local approximation around the parameter W(cid:96) in each layer individually.\nConsidering a perturbation U around W(cid:96), the terms in (8) have the following upper bounds:\n\n2 maxx \u03b7(cid:48)(x)2 ,\n\nS\u221e ||\u03b1(cid:96)||2\nS\u221e||\u03b1(cid:96)||2\n\n(h\u03c6,j \u2212 h\u03b8,j)2 \u02dc\u2264 ||U||2\n2||U||2\ndt \u03b7(t)|t=x and \u03b7(cid:48)(cid:48)(x) = d2\n\n2 ||\u2207\u03b1(cid:96)+1hj||2\n2||\u2207\u03b1(cid:96)+1hj||\u221e||\u2207\u03b1(cid:96)hj||\u221e maxx |\u03b7(cid:48)(cid:48)(x)|.\n|h\u03c6,j \u2212 h\u03b8,j \u2212 (cid:104)\u2207\u03b8h\u03b8,j, \u03c6 \u2212 \u03b8(cid:105)| \u02dc\u2264 1\ndt2 \u03b7(t)|t=x. Because both \u03b1(cid:96) and \u2207\u03b1(cid:96)+1 hj can easily\nWhere \u03b7(cid:48)(x) = d\nbe calculated during the standard backpropagation procedure for gradient estimation, this can be\ncalculated without signi\ufb01cant overhead. Since these equations are bounded on the Schatten-\u221e norm,\nthis motivates using the Stochastic Spectral Descent algorithm with the #-operator is applied to the\nweight matrix for each layer individually.\nHowever, the proposed updates require the calculation of many additional terms; as well, they are\npessimistic and do not consider the inhomogenous curvature. Instead of attempting to derive the\nstep-sizes, both RMSspectral and ADAspectral will learn appropriate element-wise step-sizes by\nusing the gradient history. Then, the preconditioned #-operator is applied to the weights from each\nlayer individually. The RMSspectral method for feed-forward neural nets is shown in Algorithm 2.\nIt is unclear how to use non-Euclidean geometry for convolution layers [14], as the pooling and\nconvolution create alternative geometries. However, the ADAspectral and RMSspectral algorithms\ncan be applied to convolutional neural nets by using the non-Euclidean steps on the dense layers\nand linear updates from ADAgrad and RMSprop on the convolutional \ufb01lters. The bene\ufb01ts from the\ndense layers then propagate down to the convolutional layers.\n\n4 Experiments\n\n4.1 Restricted Boltzmann Machines\n\nTo show the use of the approximate #-operator from Section 2.4 as well as RMSspec and ADAspec,\nwe \ufb01rst perform experiments on the MNIST dataset. The dataset was binarized as in [24]. We detail\nthe algorithmic setting used in these experiments in Supplemental Table 1, which are chosen to\nmatch previous literature on the topic. The batch size was chosen to be 1000 data points, which\nmatches [1]. This is larger than is typical in the RBM literature [24, 10], but we found that all\nalgorithms improved their \ufb01nal results with larger batch-sizes due to reduction in sampling noise.\n\n6\n\nNormalizedtime,thousands050100150200ReconstructionError121314151617MNIST,CD-1TrainingSGDADAgradRMSpropSSD-FADAspectralRMSspectralSSDNormalizedtime,thousands01020304050logp(v)-95-90-85-80MNIST,PCD-25TrainingSGDADAgradRMSpropSSDADAspectralRMSspectralNormalizedtime,thousands01020304050logp(v)-120-115-110-105-100-95-90Caltech-101,PCD-25TrainingSGDADAgradRMSpropSSDADAspectralRMSspectral\fThe analysis supporting the SSD algorithm does not directly apply to the CD-1 learning procedure,\nso it is of interest to examine how well it generalizes to this framework. To examine the effect of\nCD-1 learning, we used reconstruction error with J=500 hidden, latent variables. Reconstruction\nerror is a standard heuristic for analyzing convergence [10], and is de\ufb01ned by taking ||v \u2212 \u02c6v||2,\nwhere v is an observation and \u02c6v is the mean value for a CD-1 pass from that sample. This result\nis shown in Figure 2 (left), with all algorithms normalized to the amount of time it takes for a\nsingle SGD iteration. The full #-operator in the SSD algorithm adds signi\ufb01cant overhead to each\niteration, so the SSD algorithm does not provide competitive performance in this situation. The\nSSD-F, ADAspectral, and RMSspectral algorithms use the approximate #-operator. Combining\nthe adaptive nature of RMSprop with non-Euclidean optimization provides dramatically improved\nperformance, seemingly converging faster and to a better optimum.\nHigh CD orders are necessary to \ufb01t the ML estimator of an RBM [24]. To this end, we use the\nPersistent CD method of [28] with 25 Gibbs sweeps per iteration. We show the log-likelihood of the\ntraining data as a function of time in Figure 2(middle). The log-likelihood is estimated using AIS\nwith the parameters and code from [24]. There is a clear divide with improved performance from the\nSchatten-\u221e based methods. There is further improved performance by including preconditioners.\nAs well as showing improved training, the test set has an improved log-likelihood of -85.94 for\nRMSspec and -86.04 for SSD.\nFor further exploration, we trained a Deep Belief Net with two hidden layers of size 500-2000 to\nmatch [24]. We trained the \ufb01rst hidden layer with CD-1 and RMSspectral, and the second layer\nwith PCD-25 and RMSspectral. We used the same model sizes, tuning parameters, and evaluation\nparameters and code from [24], so the only change is due to the optimization methods. Our estimated\nlower-bound on the performance of this model is -80.96 on the test set. This compares to -86.22 from\n[24] and -84.62 for a Deep Boltzmann Machine from [23]; however, we caution that these numbers\nno longer re\ufb02ect true performance on the test set due to bias from AIS and repeated over\ufb01tting [23].\nHowever, this is a fair comparison because we use the same settings and the evaluation code.\nFor further evidence, we performed the same maximum-likelihood experiment on the Caltech-101\nSilhouettes dataset [17]. This dataset was previously used to demonstrate the effectiveness of an\nadaptive gradient step-size and Enhanced Gradient method for Restricted Boltzmann Machines [3].\nThe training curves for the log-likelihood are shown in Figure 2 (right). Here, the methods based on\nthe Schatten-\u221e norm give state-of-the-art results in under 1000 iterations, and thoroughly dominate\nthe learning. Furthermore, both ADAspectral and RMSspectral saturate to a higher value on the\ntraining set and give improved testing performance. On the test set, the best result from the non-\nEuclidean methods gives a testing log-likelihood of -106.18 for RMSspectral, and a value of -109.01\nfor RMSprop. These values all improve over the best reported value from SGD of -114.75 [3].\n\n4.2 Standard and Convolutional Neural Networks\nCompared to RBMs and other popular machine learning models, standard feed-forward neural nets\nare cheap to train and evaluate. The following experiments show that even in this case where the\ncomputation of the gradient is ef\ufb01cient, our proposed algorithms produce a major speed up in con-\nvergence, in spite of the per-iteration cost associated with approximating the SVD of the gradient.\nWe demonstrate this claim using the well-known MNIST and Cifar-10 [15] image datasets.\nBoth datasets are similar in that they pose a classi\ufb01cation task over 10 possible classes. However,\nCIFAR-10, consisting of 50K RGB images of vehicles and animals, with an additional 10K images\nreserved for testing, poses a considerably more dif\ufb01cult problem than MNIST, with its 60K greyscale\nimages of hand-written digits, plus 10K test samples. This fact is indicated by the state-of-the-art\naccuracy on the MNIST test set reaching 99.79% [29], with the same architecture achieving \u201conly\u201d\n90.59% accuracy on CIFAR-10.\nTo obtain the state-of-the-art performance on these datasets, it is necessary to use various types of\ndata pre-processing methods, regularization schemes and data augmentation, all of which have a big\nimpact of model generalization [14]. In our experiments we only employ ZCA whitening on the\nCIFAR-10 data [15], since these methods are not the focus of this paper. Instead, we focus on the\ncomparative performance of the various algorithms on a variety of models.\nWe trained neural networks with zero, one and two hidden layers, with various hidden layer sizes,\nand with both logistic and recti\ufb01ed linear units (ReLU) non-linearities [20]. Algorithm parameters\n\n7\n\n\fFigure 3:\nlikelihood of the current training batch on CIFAR-10 (Right) Accuracy on the CIFAR-10 test set\n\n(Left) Log-likelihood of current training batch on the MNIST dataset (Middle) Log-\n\ncan be found in Supplemental Table 2. We observed fairly consistent performance across the various\ncon\ufb01gurations, with spectral methods yielding greatly improved performance over their Euclidean\ncounterparts. Figure 3 shows convergence curves in terms of log-likelihood on the training data as\nlearning proceeds. For both MNIST and CIFAR-10, SSD with estimated Lipschitz steps outperforms\nSGD. Also clearly visible is the big impact of using local preconditioning to \ufb01t the local geometry\nof the objective, ampli\ufb01ed by using the spectral methods.\nSpectral methods also improve convergence of convolutional neural nets (CNN). In this setting, we\napply the #-operator only to fully connected linear layers. Preconditioning is performed for all\nlayers, i.e., when using RMSspectral for linear layers, the convolutional layers are updated via RM-\nSprop. We applied our algorithms to CNNs with one, two and three convolutional layers, followed\nby two fully-connected layers. Each convolutional layer was followed by max pooling and a ReLU\nnon-linearity. We used 5 \u00d7 5 \ufb01lters, ranging from 32 to 64 \ufb01lters per layer.\nWe evaluated the MNIST test set using a two-layer convolutional net with 64 kernels. The best\ngeneralization performance on the test set after 100 epochs was achieved by both RMSprop and\nRMSspectral, with an accuracy of 99.15%. RMSspectral obtained this level of accuracy after only\n40 epochs, less that half of what RMSprop required.\nTo further demonstarte the speed up, we trained on CIFAR-10 using a deeper net with three convo-\nlutional layers, following the architecture used in [29]. In Figure 3 (Right) the test set accuracy is\nshown as training proceeds with both RMSprop and RMSspectral. While they eventually achieve\nsimilar accuracy rates, RMSspectral reaches that rate four times faster.\n\n5 Discussion\nIn this paper we have demonstrated that many deep models naturally operate with non-Euclidean\ngeometry, and exploiting this gives remarkable improvements in training ef\ufb01ciency, as well as \ufb01nd-\ning improved local optima. Also, by using adaptive methods, algorithms can use the same tuning\nparameters across different model sizes con\ufb01gurations. We \ufb01nd that in the RBM and DBN, im-\nproving the optimization can give dramatic performance improvements on both the training and the\ntest set. For feedforward neural nets, the training ef\ufb01ciency of the propose methods give staggering\nimprovements to the training performance.\nWhile the training performance is drastically better via the non-Euclidean quasi-Newton methods,\nthe performance on the test set is improved for RBMs and DBNs, but not in feedforward neural\nnetworks. However, because our proposed algorithms \ufb01t the model signi\ufb01cantly faster, they can\nhelp improve Bayesian optimization schemes [27] to learn appropriate penalization strategies and\nmodel con\ufb01gurations. Furthermore, these methods can be adapted to dropout [14] and other recently\nproposed regularization schemes to help achieve state-of-the-art performance.\n\nAcknowledgements The research reported here was funded in part by ARO, DARPA, DOE, NGA\nand ONR, and in part by the European Commission under grants MIRG-268398 and ERC Future\nProof, by the Swiss Science Foundation under grants SNF 200021-146750, SNF CRSII2-147633,\nand the NCCR Marvel. We thank the reviewers for their helpful comments.\n\n8\n\nSeconds02004006008001000logp(v)-100-10-1-10-2-10-3MNIST,2-LayerNNSGDADAgradRMSpropSSDADAspectralRMSspectralSeconds1000200030004000logp(v)-100-10-1-10-2-10-3Cifar,2-LayerCNNSGDADAgradRMSpropSSDADAspectralRMSspectralSeconds#1050246Accuracy00.20.40.60.81Cifar-10,5-LayerCNNRMSpropRMSspectral\fReferences\n[1] D. Carlson, V. Cevher, and L. Carin. Stochastic Spectral Descent for Restricted Boltzmann Machines.\n\nAISTATS, 2015.\n\n[2] D. Carlson, Y.-P. Hsieh, E. Collins, L. Carin, and V. Cevher. Stochastic Spectral Descent for Discrete\n\nGraphical Models. IEEE J. Special Topics in Signal Processing, 2016.\n\n[3] K. Cho, T. Raiko, and A. Ilin. Enhanced Gradient for Training Restricted Boltzmann Machines. Neural\n\nComputation, 2013.\n\n[4] A. Choromanska, M. Henaff, M. Mathieu, G. B. Arous, and Y. LeCun. The Loss Surfaces of Multilayer\n\nNetworks. AISTATS 2015.\n\n[5] Y. N. Dauphin, H. de Vries, J. Chung, and Y. Bengio. RMSProp and equilibrated adaptive learning rates\n\nfor non-convex optimization. arXiv:1502.04390 2015.\n\n[6] Y. N. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, and Y. Bengio. Identifying and attacking\n\nthe saddle point problem in high-dimensional non-convex optimization. In NIPS, 2014.\n\n[7] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic\n\noptimization. JMLR, 2010.\n\n[8] D. Erhan, Y. Bengio, A. Courville, P.-A. Manzagol, P. Vincent, and S. Bengio. Why Does Unsupervised\n\nPre-training Help Deep Learning? JMLR 2010.\n\n[9] N. Halko, P. G. Martinsson, and J. A. Tropp. Finding Structure with Randomness: Probabilistic Algo-\n\nrithms for Constructing Approximate Matrix Decompositions. SIAM Review 2011.\n\n[10] G. Hinton. A Practical Guide to Training Restricted Boltzmann Machines. U. Toronto Technical Report,\n\n2010.\n\n[11] G. E. Hinton, S. Osindero, and Y.-W. Teh. A fast learning algorithm for deep belief nets. Neural Compu-\n\ntation, 2006.\n\n[12] G. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation,\n\n2002.\n\n[13] J. A. Kelner, Y. T. Lee, L. Orecchia, and A. Sidford. An Almost-Linear-Time Algorithm for Approximate\n\nMax Flow in Undirected Graphs, and its Multicommodity Generalizations 2013.\n\n[14] A. Krizhevsky and G. E. Hinton. ImageNet Classi\ufb01cation with Deep Convolutional Neural Networks.\n\nNIPS, 2012.\n\n[15] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. University of\n\nToronto, Tech. Rep, 2009.\n\n[16] Q. V. Le, A. Coates, B. Prochnow, and A. Y. Ng. On Optimization Methods for Deep Learning. ICML,\n\n2011.\n\n[17] B. Marlin and K. Swersky. Inductive principles for restricted Boltzmann machine learning. ICML, 2010.\n[18] J. Martens and R. Grosse. Optimizing Neural Networks with Kronecker-factored Approximate Curvature.\n\narXiv:1503.05671 2015.\n\n[19] J. Martens and I. Sutskever. Parallelizable Sampling of Markov Random Fields. AISTATS, 2010.\n[20] V. Nair and G. E. Hinton. Recti\ufb01ed linear units improve restricted boltzmann machines. In ICML, 2010.\n[21] R. M. Neal. Annealed Importance Sampling. U. Toronto Technical Report, 1998.\n[22] V. Rokhlin, A. Szlam, and M. Tygert. A Randomized Algorithm for Principal Component Analysis. SIAM\n\nJournal on Matrix Analysis and Applications 2010.\n\n[23] R. Salakhutdinov and G. Hinton. Deep Boltzmann Machines. AISTATS, 2009.\n[24] R. Salakhutdinov and I. Murray. On the Quantitative Analysis of Deep Belief Networks. ICML, 2008.\n[25] T. Schaul, S. Zhang, and Y. LeCun. No More Pesky Learning Rates. arXiv 1206.1106 2012.\n[26] P. Smolensky. Information Processing in Dynamical Systems: Foundations of Harmony Theory, 1986.\n[27] J. Snoek, H. Larochelle, and R. P. Adams. Practical Bayesian Optimization of Machine Learning Algo-\n\nrithms. In NIPS, 2012.\n\n[28] T. Tieleman and G. Hinton. Using fast weights to improve persistent contrastive divergence. ICML, 2009.\n[29] L. Wan, M. Zeiler, S. Zhang, Y. L. Cun, and R. Fergus. Regularization of neural networks using dropcon-\n\nnect. In ICML, 2013.\n\n[30] M. D. Zeiler. ADADELTA: An Adaptive Learning Rate Method. arXiv 1212.5701 2012.\n\n9\n\n\f", "award": [], "sourceid": 1681, "authors": [{"given_name": "David", "family_name": "Carlson", "institution": null}, {"given_name": "Edo", "family_name": "Collins", "institution": null}, {"given_name": "Ya-Ping", "family_name": "Hsieh", "institution": "EPFL"}, {"given_name": "Lawrence", "family_name": "Carin", "institution": "Duke University"}, {"given_name": "Volkan", "family_name": "Cevher", "institution": "EPFL"}]}