{"title": "Hamiltonian descent for composite objectives", "book": "Advances in Neural Information Processing Systems", "page_first": 14470, "page_last": 14480, "abstract": "In optimization the duality gap between the primal and the dual problems is a measure of the suboptimality of any primal-dual point. In classical mechanics the equations of motion of a system can be derived from the Hamiltonian function, which is a quantity that describes the total energy of the system.  In this paper we consider a convex optimization problem consisting of the sum of two convex functions, sometimes referred to as a composite objective, and we identify the duality gap to be the `energy' of the system.  In the Hamiltonian formalism the energy is conserved, so we add a contractive term to the standard equations of motion so that this energy decreases linearly (ie, geometrically) with time.  This yields a continuous-time ordinary differential equation (ODE) in the primal and dual variables which converges to zero duality gap, ie, optimality.  This ODE has several useful properties: it induces a natural operator splitting; at convergence it yields both the primal and dual solutions; and it is invariant to affine transformation despite only using first order information.  We provide several discretizations of this ODE, some of which are new algorithms and others correspond to known techniques, such as the alternating direction method of multipliers (ADMM).  We conclude with some numerical examples that show the promise of our approach. We give an example where our technique can solve a convex quadratic minimization problem orders of magnitude faster than several commonly-used gradient methods, including conjugate gradient, when the conditioning of the problem is poor.  Our framework provides new insights into previously known algorithms in the literature as well as providing a technique to generate new primal-dual algorithms.", "full_text": "Hamiltonian descent for composite objectives\n\nBrendan O\u2019Donoghue\n\nDeepMind\n\n\u275c\u2666\u275e\u2666\u2665\u2666\u2763\u2764\u2709\u2761\u2745\u2763\u2666\u2666\u2763\u2767\u2761\u2733\u275d\u2666\u2660\n\nChris J. Maddison\n\nDeepMind / University of Oxford\n\n\u275d\u2660\u275b\u275e\u275e\u2710s\u2745\u2763\u2666\u2666\u2763\u2767\u2761\u2733\u275d\u2666\u2660\n\nAbstract\n\nIn optimization the duality gap between the primal and the dual problems is a\nmeasure of the suboptimality of any primal-dual point. In classical mechanics the\nequations of motion of a system can be derived from the Hamiltonian function,\nwhich is a quantity that describes the total energy of the system. In this paper\nwe consider a convex optimization problem consisting of the sum of two convex\nfunctions, sometimes referred to as a composite objective, and we identify the\nduality gap to be the \u2018energy\u2019 of the system. In the Hamiltonian formalism the\nenergy is conserved, so we add a contractive term to the standard equations of\nmotion so that this energy decreases linearly (i.e., geometrically) with time. This\nyields a continuous-time ordinary differential equation (ODE) in the primal and\ndual variables which converges to zero duality gap, i.e., optimality. This ODE has\nseveral useful properties: it induces a natural operator splitting; at convergence it\nyields both the primal and dual solutions; and it is invariant to af\ufb01ne transformation\ndespite only using \ufb01rst order information. We provide several discretizations of\nthis ODE, some of which are new algorithms and others correspond to known\ntechniques, such as the alternating direction method of multipliers (ADMM). We\nconclude with some numerical examples that show the promise of our approach.\nWe give an example where our technique can solve a convex quadratic minimization\nproblem orders of magnitude faster than several commonly-used gradient methods,\nincluding conjugate gradient, when the conditioning of the problem is poor. Our\nframework provides new insights into previously known algorithms in the literature\nas well as providing a technique to generate new primal-dual algorithms.\n\n1\n\nIntroduction and prior work\n\nIn physics the Hamiltonian function represents the total energy of a system in some set of coordinates\n(loosely speaking). In the most typical case the coordinates are the position x \u2208 Rn and momentum\np \u2208 Rn, and the Hamiltonian is the sum of the potential energy, a function of the position, and the\nkinetic energy, a function of the momentum. The equations of motion for the system can be derived\nfrom the Hamiltonian. Let us denote the Hamiltonian as H : Rn \u00d7 Rn \u2192 R, which we assume is\ndifferentiable, then the equations of motion [1] are given by\n\n\u02d9xt = \u2207pH(xt, pt),\n\n\u02d9pt = \u2212\u2207xH(xt, pt),\n\nwhere we use the notation \u02d9xt := dxt/dt. For ease of notation we shall sometimes use z := (x, p) \u2208\nR2n to denote the concatenation of the position and momentum into a single quantity, in which case\nwe can write the Hamiltonian \ufb02ow as\n\n\u02d9zt = J\u2207H(zt),\n\nJ =(cid:20) 0\n\n\u2212I\n\nI\n\n0(cid:21) ,\n\n(1)\n\nand note that J T J = I and that J is skew symmetric, that is J = \u2212J T , and so vT Jv = 0 for\nany v. It is easy to show that these equations of motion conserve the Hamiltonian since \u02d9H(zt) =\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f\u2207zH(zt)T \u02d9zt = \u2207H(zt)T J\u2207H(zt) = 0. This conservation property is required for anything that\nmodels the energy of a system in the physical universe, but not directly useful in optimization where\nthe goal is convergence to an optimum. By adding a contractive term to the Hamiltonian \ufb02ow\nwe derive an ordinary differential equation (ODE) whose solutions converge to a minimum of the\nHamiltonian. We call the resulting \ufb02ow \u201cHamiltonian descent\u201d.\n\nIn optimization there has been a lot of recent interest in continuous-time ordinary differential equations\n(ODEs) that when discretized yield known or interesting novel algorithms [2, 3, 4]. In particular Su et\nal.[5] derived a simple ODE that corresponds to Nesterov\u2019s accelerated gradient scheme [6], see also\n[7]. That work was extended in [8] where the authors derived a \u201cBregman Lagrangian\u201d framework\nthat generates a family of continuous-time ODEs corresponding to several discrete-time algorithms,\nincluding Nesterov\u2019s accelerated gradient. This was extended in [9] to derive a novel acceleration\nalgorithm. In [10] the authors used Lyapunov functions to analyze the convergence properties of\ncontinuous and discrete-time systems. There is a natural Hamiltonian perspective on the Bregman\nLagrangian, which was exploited in [11] to derive optimization methods from symplectic integrators.\n\nIn a similar vein, the authors of [12] used a conformal Hamiltonian system to expand the class\nof functions for which linear convergence of \ufb01rst-order methods can be obtained by encoding\ninformation about the convex conjugate into a kinetic energy. Follow-up work analyzed the properties\nof conformal symplectic integrators for these conformal Hamiltonian systems [13].\n\nHamiltonian mechanics have previously been applied to several areas outside of classical mechanics\n[14], most notably in Hamiltonian Monte Carlo (HMC), where the goal is to sample from a target dis-\ntribution and Hamiltonian mechanics are used to propose moves in a Metropolis-Hastings algorithm;\nsee [15] for a good survey. More recently Hamiltonian mechanics has been discussed in the context\nof game theory [16], where a symplectic gradient algorithm was developed that converges to stable\n\ufb01xed points of general games.\n\n1.1 The convex conjugate\n\nThe Hamiltonian as used in physics is derived by taking the Legendre transform (or convex conjugate)\nof one of the terms in the Lagrangian describing the system, which for a function f : Rn \u2192 R is\nde\ufb01ned as\n\nf \u2217(p) = sup\nx\n\n(xT p \u2212 f (x)).\n\nThe function f \u2217 is always convex, even if f is not. When f is closed, proper, and convex, then\n(f \u2217)\u2217 = f , and (\u2202f )\u22121 = \u2202f \u2217, where \u2202f denotes the subdifferential of f , which for differentiable\nfunctions is just the gradient, i.e., \u2202f = \u2207f (or more precisely \u2202f = {\u2207f }) [17].\n\n2 Hamiltonian descent\n\nA modi\ufb01cation to the Hamiltonian \ufb02ow equation (1) yields an ordinary differential equation whose\nsolutions decrease the Hamiltonian linearly:\n\n\u02d9zt = J\u2207H(zt) + z\u22c6 \u2212 zt,\n\n(2)\n\nwhere z\u22c6 \u2208 argminz H(z). This departs from the standard Hamiltonian \ufb02ow equations by the\naddition of the term involving the difference between z\u22c6 and zt. One can view the Hamiltonian\ndescent equation as a \ufb02ow in a \ufb01eld consisting of the sum of a standard Hamiltonian \ufb01eld and the\nnegative gradient \ufb01eld of function (1/2)kzt \u2212 z\u22c6k2\n2. Solutions to this differential equation descend\nthe level sets of the Hamiltonian and so we refer to (2) as Hamiltonian descent equations. Note\nthat this \ufb02ow is different to the dissipative \ufb02ows using conformal Hamiltonian mechanics studied in\n[12, 13], which are also Hamiltonian descent methods but employ a different dissipative force. We\nshall show the linear convergence of solutions of (2) to a minimum of the Hamiltonian function; \ufb01rst\nwe will state a necessary assumption:\nAssumption 1. The Hamiltonian H together with a point (x\u22c6, p\u22c6) = z\u22c6 \u2208 arg minz H(z) satisfy\nthe following:\n\n\u2022 z\u22c6 = arg minz H(z) is unique,\n\n\u2022 H(z) \u2265 H(z\u22c6) = 0 for all z \u2208 R2n,\n\n2\n\n\f\u2022 H is proper, closed, convex,\n\n\u2022 H is continuously differentiable.\n\nTheorem 1. If zt is following the equations of motion in (2) where z\u22c6 and the Hamiltonian func-\ntion satisfy assumption 1, then the Hamiltonian converges to zero linearly (i.e., geometrically).\nFurthermore, zt converges to z\u22c6 and \u02d9zt converges to zero.\n\nProof. Consider the time derivative of the Hamiltonian:\n\n\u02d9H(zt) = \u2207H(zt)T \u02d9zt = \u2207H(zt)T (J\u2207H(zt) + z\u22c6 \u2212 zt) \u2264 \u2212H(zt).\n\n(3)\n\nsince J is skew-symmetric, H(z\u22c6) = 0 and H is convex. Gr\u00f6nwall\u2019s inequality [18] then implies\nthat 0 \u2264 H(zt) \u2264 H(z0) exp(\u2212t) and so H(zt) \u2192 0 linearly. Consider M = {z \u2208 R2n :\n\u2207H(z)T (z\u22c6 \u2212 z) = 0}. It is not too hard to see that M = {z\u22c6} and that M is an invariant set, since\n\u22c6) \u2265 H(z) by convexity. Because H has a unique minimum, its sublevel set are\n\u2207H(z \u2032\nbounded. Thus, we can apply Theorem 3.4 of [19] (Local Invariant Set Theorem) to argue that all\nsolutions zt \u2192 z\u22c6. Further, we have \u2207H(zt) \u2192 0 by continuity and thus \u02d9zt \u2192 0.\n\n\u22c6)T (z\u22c6 \u2212 z \u2032\n\nIn contrast, consider the gradient descent \ufb02ow \u02d9zt = \u2212\u2207H(zt), which also converges since\n\n\u02d9H(zt) = \u2207H(zt)T \u02d9zt = \u2212k\u2207H(zt)k2\n\n2 \u2264 0.\n\nIn this case, linear convergence is only guaranteed when H has some other property, such as strong\nconvexity, which Hamiltonian descent does not require.\n\nIt may appear that these equations of motion are unrealizable without knowledge of a minimum of\nthe Hamiltonian z\u22c6, which would defeat the goal of \ufb01nding such a point. However, by a judicious\nchoice of the Hamiltonian we can cancel the terms involving z\u22c6, and make the system realizable.\nFor example, take the problem of minimizing convex f : Rn \u2192 R, and consider the following\nHamiltonian\n\nH(x, p) = f (x) + f \u2217(p) \u2212 pT x\u22c6,\n\nwhere x\u22c6 is any minimizer of f . Note that (x\u22c6, 0) \u2208 argmin(x,p) H(x, p). Assuming f and f \u2217 are\ncontinuously differentiable and (x\u22c6, 0) is a unique minimum of H, then it is readily veri\ufb01ed that this\nHamiltonian satis\ufb01es assumption 1. So the solutions of the equations of motion will converge to a\nminimum of H linearly. In this case the \ufb02ow is given by\n\n\u02d9xt = \u2207pH(xt, pt) + x\u22c6 \u2212 xt = \u2207f \u2217(pt) \u2212 xt\n\u02d9pt = \u2212\u2207xH(xt, pt) + p\u22c6 \u2212 pt = \u2212\u2207f (xt) \u2212 pt,\n\nsince p\u22c6 = 0, and note that theorem 1 implies that \u02d9xt \u2192 0, \u02d9pt \u2192 0 and in the limit these equations\nreduce to the optimality condition for the problem, namely \u2207f (x) = 0. However, this system requires\nthe ability to evaluate \u2207f \u2217, which is as hard as the original problem (since x\u22c6 = \u2207f \u2217(0)). In the\nsequel we shall exploit the structure of composite optimization problems to avoid this requirement.\n\n2.1 Af\ufb01ne invariance\n\nThe Hamiltonian descent equations of motion (2) are invariant to a set of af\ufb01ne transformations. This\nproperty is very useful since it means that the performance of an algorithm based on these equations\nwill be much less sensitive to the conditioning of the problem than, for example, gradient descent\nwhich does not enjoy af\ufb01ne invariance.\n\nTo show this property, consider a non-singular matrix M that satis\ufb01es M JM T = J and consider the\nHamiltonian in the new coordinate system,\n\n\u00afH(y) = H(M \u22121y),\n\nwhere clearly y\u22c6 = M z\u22c6. At time \u03c4 we have the point y\u03c4 , and let z\u03c4 = M \u22121y\u03c4 . Running Hamiltonian\ndescent in the transformed coordinates we obtain\n\n\u02d9y\u03c4 = J\u2207 \u00afH(y\u03c4 ) + y\u22c6 \u2212 y\u03c4\n\n= JM \u2212T \u2207H(M \u22121y\u03c4 ) + M z\u22c6 \u2212 M z\u03c4\n= M (J\u2207H(z\u03c4 ) + z\u22c6 \u2212 z\u03c4 )\n= M \u02d9z\u03c4 .\n\n3\n\n\fNow let z0 = M \u22121y0, then we have yt = y0 +R t\n\n0 M \u02d9z\u03c4 = M zt for all t, and\ntherefore \u00afH(yt) = H(M \u22121M zt) = H(zt), i.e., the original and transformed Hamiltonians have\nexactly the same value for all t and thus the rate of convergence is unchanged by the transformation.\nThe condition on M is not too onerous; for example any M of the form:\n\n0 \u02d9y\u03c4 = M z0 +R t\n\nfor nonsingular R \u2208 Rn\u00d7n satis\ufb01es the condition. Contrast this to vanilla gradient \ufb02ow,\n\n0\n\nM =(cid:20)R\n\n0 R\u2212T(cid:21)\n\n\u02d9y\u03c4 = \u2212\u2207 \u00afH(y\u03c4 ) = \u2212M \u2212T \u2207H(M \u22121y\u03c4 ) = M \u2212T \u02d9z\u03c4 .\n\nAgain setting z0 = M \u22121y0 we obtain yt = y0 +R t\n\ncase that M T M = I, i.e., M is orthogonal.\n\n0 \u02d9y\u03c4 = M z0 +R t\n\n0 M \u2212T \u02d9z\u03c4 6= M zt except in the\n\n2.2 Discretizations\n\nThere are many possible ways to discretize the Hamiltonian descent equations, see, e.g., [20]. Here\nwe present two simple approaches and prove their convergence under certain conditions. Later we\nshall show that other discretizations correspond to already known algorithms.\n\n2.2.1\n\nImplicit\n\nConsider the following implicit discretization of (2), for some \u01eb > 0 we take\n\nzk+1 = zk + \u01eb(J\u2207H(zk+1) + z\u22c6 \u2212 zk+1).\n\n(4)\n\nConsider the change in Hamiltonian value at iteration k, \u2206k = H(zk+1) \u2212 H(zk):\n\n\u2206k \u2264 \u2207H(zk+1)T (zk+1 \u2212 zk) = \u01eb\u2207H(zk+1)T (J\u2207H(zk+1) + z\u22c6 \u2212 zk+1) \u2264 \u2212\u01ebH(zk+1)\n\nsince J is skew-symmetric, H(z\u22c6) = 0 and H is convex. From this we have H(zk) \u2264 (1+\u01eb)\u2212kH(z0).\nThus the implicit discretization exhibits linear convergence in discrete-time, without restriction on the\nstep-size \u01eb. However, this scheme is very dif\ufb01cult to implement in practice, since it requires solving a\nnon-linear equation for zk+1 at every step.\n\n2.2.2 Explicit\n\nNow consider the explicit discretization\n\nzk+1 = zk + \u01eb(J\u2207H(zk) + z\u22c6 \u2212 zk),\n\n(5)\n\nthis differs from the implicit discretization in that the right hand side depends solely on zk rather than\nzk+1, and therefore is much more practical to implement. If we assume that the gradient of H is\nL-Lipschitz, then we can show that this sequence converges and that the Hamiltonian converges to\nzero like O(1/k). The proof of this result is included in the appendix. If, in addition, H is \u00b5 > 0\nstrongly convex, then we can show that the Hamiltonian converges to zero like O(\u03bbk) for some \u03bb < 1.\nThe proof of this result, along the explicit dependence of \u03bb on L and \u00b5 is given in the appendix.\n\nWe must mention here that both proofs are somewhat lacking. For example, under the assumptions\nof L-Lipschitzness of \u2207H and \u00b5 strong convexity of H, our analysis requires that the step-size \u01eb\ndepend on both L and \u00b5. This is a stronger requirement than the classical gradient descent analysis.\nMoreover, the rate \u03bb scales poorly with the condition number L/\u00b5 as compared to gradient descent.\nThis may be due to the fact that both analyses depend strongly on the values of L or \u00b5, which are not\ninvariant to af\ufb01ne transformation even though the equations of motion are. We suspect that a tighter\nanalysis is possible under assumptions whose structure mirror the af\ufb01ne invariance structure of the\ndynamics.\n\n3 Composite optimization\n\nNow we come to the main problem we investigate in this paper. Consider a convex optimization\nproblem consisting of the sum of two convex, closed, proper functions h : Rn \u2192 R and g : Rm \u2192 R:\n\nminimize\n\nf (y) := h(Ay) + g(y)\n\n(6)\n\n4\n\n\fover variable y \u2208 Rm, with data matrix A \u2208 Rn\u00d7m. This problem is sometimes referred to as a\ncomposite optimization problem, see, e.g., [21]. The dual problem is given by\n\nmaximize d(p) := \u2212h\u2217(\u2212p) \u2212 g\u2217(AT p),\n\n(7)\n\nover p \u2208 Rn. We assume that h and g\u2217 are both differentiable, which will help ensure that the\nHamiltonian we derive satis\ufb01es assumption 1. Weak duality tells us that for any y, p we have\nf (y) \u2265 d(p), with equality if and only if y and p are primal-dual optimal, since strong duality always\nholds for this problem (under mild technical conditions [22, \u00a75.2.3]). We can rewrite the primal and\ndual problems in equality constrained form:\n\nminimize h(x) + g(y)\nsubject to x = Ay,\n\nmaximize \u2212h\u2217(\u2212p) \u2212 g\u2217(q)\nsubject to\n\nq = AT p,\n\nand obtain necessary and suf\ufb01cient optimality conditions in terms of all four variables:\n\n\u2207g\u2217(q\u22c6) \u2212 y\u22c6 = 0\nAy\u22c6 \u2212 x\u22c6 = 0\n\u2212\u2207h(x\u22c6) \u2212 p\u22c6 = 0\nAT p\u22c6 \u2212 q\u22c6 = 0,\n\n(8)\n\n(9)\n\nthe proof of which is included in the appendix.\n\n3.1 Duality gap as Hamiltonian\n\nIn this section we derive a partial duality gap for problem (8) and use it as our Hamiltonian function\nto derive equations of motion. Then we shall show that in the limit the equations we derive satisfy the\nconditions necessary and suf\ufb01cient for optimality (9). We start by introducing dual variable p for the\nequality constraint in the primal problem (8) to obtain h(x) + g(y) + pT (x \u2212 Ay), and taking the\nLegendre transform of g we get the \u2018full\u2019 Lagrangian in terms of all four primal and dual variables:\n\nL(x, y, p, q) = h(x) \u2212 g\u2217(q) + yT q + pT (x \u2212 Ay),\n\nwhich is convex-concave in (x, y) and (p, q). We refer to this as the full Lagrangian, because if\nwe maximize over (p, q) we recover the primal problem in (8) and if we minimize over (x, y) we\nrecover the dual problem in (8). Denote by (y\u22c6, p\u22c6) any primal-dual optimal point and let x\u22c6 = Ay\u22c6,\nq\u22c6 = AT p\u22c6, and f\u22c6 = f (y\u22c6) = d(p\u22c6), then a simple calculation yields\n\nL(x\u22c6, y\u22c6, p, q) \u2264 max\np,q\n\nL(x\u22c6, y\u22c6, p, q) = f\u22c6 = min\nx,y\n\nL(x, y, p\u22c6, q\u22c6) \u2264 L(x, y, p\u22c6, q\u22c6).\n\nThis is due to strong duality holding for this problem.\nIn other words, if we substitute in the\noptimal primal or dual variables into the Lagrangian, then we obtain valid lower and upper bounds\nrespectively. Then maximizing and minimizing these bounds over the remaining variables yields the\noptimal objective value, f\u22c6. Thus, the difference between these two functions is a partial duality gap\n(though uncomputable without knowledge of a primal-dual optimal point),\n\ngap(x, q) = L(x, y, p\u22c6, q\u22c6) \u2212 L(x\u22c6, y\u22c6, p, q)\n\n= h(x) \u2212 h(x\u22c6) + g\u2217(q) \u2212 g\u2217(q\u22c6) + xT p\u22c6 \u2212 qT y\u22c6\n\u2265 0,\n\n(10)\n\nwith equality only when the Lagrangians are equal, i.e., are optimal. Note that the gap only depends on\nx, q, because the effect of y and p is cancelled out. This gap can also be written in terms of Bregman\ndivergences, where the Bregman divergence between points u and v induced by a differentiable convex\nfunction h is de\ufb01ned as Dh(u, v) = h(u) \u2212 h(v) \u2212 \u2207h(v)T (u \u2212 v), which is always nonnegative\ndue the convexity of h. Though not a true distance metric, it does have some useful \u2018distance-like\u2019\nproperties [23, 24]. We show in the appendix that our partial duality gap can be rewritten as\n\ngap(x, q) = Dh(x, x\u22c6) + Dg\u2217 (q, q\u22c6).\n\nIn other words, the gap also corresponds to a \u2018distance\u2019 between the current iterates and their optimal\nvalues, as induced by the functions h and g\u2217. Furthermore, we show in the appendix that this partial\nduality gap is a lower bound on the full duality gap, i.e.,\n\nf (y) \u2212 d(p) \u2265 gap(Ay, AT p).\n\n5\n\n\fThe gap is not in the form of a Hamiltonian, since the variable x and q are of different dimension. We\ncan reparameterize q = AT p or x = Ay, which yields two possible Hamiltonians, one in dimension\nn and one in dimension m. The \ufb01rst of which is\n\nH(x, p) = gap(x, AT p) = h(x) \u2212 h(x\u22c6) + g\u2217(AT p) \u2212 g\u2217(AT p\u22c6) + xT p\u22c6 \u2212 pT x\u22c6.\n\n(11)\n\nDue to the assumptions on h and g\u2217 we know that H is convex and differentiable, and evidently\nH(x, p) \u2265 H(x\u22c6, p\u22c6) = 0. This Hamiltonian function combined with the equations of motion in\nequation (2) yields dynamics\n\n\u02d9xt = \u2207pH(xt, pt) + x\u22c6 \u2212 xt = A\u2207g\u2217(AT pt) \u2212 xt\n\u02d9pt = \u2212\u2207xH(xt, pt) + p\u22c6 \u2212 pt = \u2212\u2207h(xt) \u2212 pt.\n\n(12)\n\nWe could rewrite these equations as\n\n\u2207g\u2217(qt) \u2212 yt = 0\nAyt \u2212 xt = \u02d9xt\n\u2212\u2207h(xt) \u2212 pt = \u02d9pt\nAT pt \u2212 qt = 0,\n\nIf \u02d9xt \u2192 0 and \u02d9pt \u2192 0, then the above equations converge to the conditions necessary and suf\ufb01cient\nfor optimality, as given in equation (9). This convergence could be guaranteed by theorem 1, when H\nhas a unique minimum (and thus satis\ufb01es all of assumption 1). Still, we suspect it is possible to prove\nthe convergence of the system without this requirement on H\u2019s minima.\n\nThe second Hamiltonian is given by\n\nH(y, q) = gap(Ay, q) = h(Ay) \u2212 h(Ay\u22c6) + g\u2217(q) \u2212 g\u2217(q\u22c6) + yT q\u22c6 \u2212 qT y\u22c6\n\n(13)\n\nwhich yields equations of motion\n\n\u02d9yt = \u2207qH(yt, qt) + y\u22c6 \u2212 yt = \u2207g\u2217(qt) \u2212 yt\n\u02d9qt = \u2212\u2207yH(yt, qt) + q\u22c6 \u2212 qt = \u2212AT \u2207h(Ayt) \u2212 qt,\n\n(14)\n\nor equivalently\n\n\u2207g\u2217(qt) \u2212 yt = \u02d9yt\nAyt \u2212 xt = 0\n\u2212\u2207h(xt) \u2212 pt = 0\nAT pt \u2212 qt = \u02d9qt.\n\nAgain, if \u02d9yt \u2192 0 and \u02d9qt \u2192 0, this system will also satisfy the optimality conditions of (9). Finally,\ntheorem 1 implies that both of these ODEs exhibit linear convergence of the Hamiltonian, i.e., linear\nconvergence of the partial duality gap (10), to zero.\n\n4 Connection to other methods\n\n4.1 ADMM\n\nIn this section we show how a particular discretization of our ODE yields the well-known Alternating\ndirection method of multipliers algorithm (ADMM) [25, 26] when applied to problem (6). We\nshould note that in related work the authors of [27] derive a different ODE that when discretized also\nyields ADMM, as well as a related ODE that corresponds to accelerated ADMM [28]. There is no\ncontradiction here since many ODEs can correspond to the same procedure when discretized.\n\nIn order to prove that ADMM is equivalent to a discretization of Hamiltonian descent we will require\nthe generalized Moreau decomposition, which we present next. In the statement of the lemma\nwe use the notation (A\u2202f AT ) to represent the multi-valued operator de\ufb01ned as (A\u2202f AT )(x) =\nA(\u2202f (AT x)) = {Az | z \u2208 \u2202f (AT x)}.\nLemma 1. For convex, closed, proper function f : Rm \u2192 R and matrix A \u2208 Rn\u00d7m, any point\nx \u2208 Rn satis\ufb01es\n\nx = (I + \u03c1A\u2202f AT )\u22121x + \u03c1A(\u2202f \u2217 + \u03c1AT A)\u22121AT x.\n\n6\n\n\fWe defer the proof to the appendix. To derive ADMM we employ a standard trick in discretizing\ndifferential equations: We add and subtract a term to the dynamics which we shall discretize at\ndifferent points, which in the limit of in\ufb01nitesimal step size will vanish, recovering the original ODE.\nStarting from equation (12) and for any \u03c1 > 0 the modi\ufb01ed ODE is\n\n\u02d9pt = \u2212\u2207h(xt) \u2212 pt \u2212 \u03c1(xt \u2212 xt)\n\u02d9xt = A\u2207g\u2217(AT pt) \u2212 xt + (1/\u03c1)(pt \u2212 pt).\n\nNow we discretize as follows:\n\n(pk \u2212 pk\u22121)/\u01eb = \u2212\u2207h(xk+1) \u2212 pk \u2212 \u03c1(xk+1 \u2212 xk)\n(xk+1 \u2212 xk)/\u01eb = A\u2207g\u2217(AT pk+1) \u2212 xk + (1/\u03c1)(pk+1 \u2212 pk).\n\nSetting \u01eb = 1 yields\n\nxk+1 = (\u03c1I + \u2207h)\u22121(\u03c1xk \u2212 2pk + pk\u22121)\npk+1 = (I + \u03c1A\u2207g\u2217AT )\u22121(pk + \u03c1xk+1)\n\n= pk + \u03c1xk+1 \u2212 \u03c1A(\u2202g + \u03c1AT A)\u22121AT (pk + \u03c1xk+1)\n= pk + \u03c1xk+1 \u2212 \u03c1Ayk+1\n\nwhere we used the generalized Moreau decomposition and introduced variable sequence yk \u2208 Rm,\nand note that from the last equation we have that \u03c1xk \u2212 2pk + pk\u22121 = \u03c1Ayk \u2212 pk. Finally this brings\nus to ADMM; from any initial y0, p0 iterate\n\nxk+1 = (\u03c1I + \u2207h)\u22121(\u03c1Ayk \u2212 pk)\nyk+1 \u2208 (\u03c1AT A + \u2202g)\u22121AT (pk + \u03c1xk+1)\npk+1 = pk + \u03c1(xk+1 \u2212 Ayk+1).\n\nEvidently we have lost the af\ufb01ne invariance property of our ODE. However we might expect\nADMM to be somewhat more robust to conditioning than gradient descent, which appears to be true\nempirically [25].\n\n4.2 PDHG\n\nThe primal-dual hybrid gradient technique (PDHG), also called Chambolle-Pock, is another operator\nsplitting technique with a slightly different form to ADMM. In particular, PDHG only requires\nmultiplies with A and AT rather than requiring A in the proximal step [29, 30, 31]. When applied to\nproblem (6) PDHG yields the following iterates\n\npk+1 = \u2212(I + \u03c1\u2202h\u2217)\u22121(\u03c1Ayk \u2212 pk)\nyk+1 = (I + \u03c3\u2202g)\u22121(\u03c3AT pk+1 + yk).\n\nIn the appendix we show that this corresponds to a particular discretization of Hamiltonian descent,\nwith step size \u01eb = 1. Note that the sign of the dual variable pk is different when compared to [31],\nthis is due to the fact that the dual problem they consider negates the dual variable when compared to\nours, so this is \ufb01xed by rewriting the iterations in terms of \u2212pk.\n\n5 Numerical experiments\n\nIn this section we present two numerical examples where we compare the explicit discretization of\nHamiltonian descent \ufb02ow to gradient descent. Due to the af\ufb01ne invariance property of Hamiltonian\ndescent we expect our technique to outperform when the conditioning of the problem is poor, so we\ngenerate examples with bad conditioning to test that.\n\n5.1 Regularized least-squares\n\nConsider the following \u21132-regularized least-squares problem\n\nminimize\n\n(1/2)kAy \u2212 bk2\n\n2 + (\u03bb/2)kByk2\n2,\n\n(15)\n\n7\n\n\f2 and g(y) = \u03bbkByk2\n\nover variable y \u2208 Rm, where A \u2208 Rn\u00d7m, B \u2208 Rm\u00d7m, and \u03bb \u2265 0 are data. In the notation of problem\n(6) we let h(x) = (1/2)kx \u2212 bk2\n2, and so \u2207g\u2217(q) = argmaxy(yT q \u2212 \u03bbkByk2\n2)\nwhich we assume is always well-de\ufb01ned (i.e., BT B is invertible). We apply the explicit discretization\n(5) of the dynamics given in equation (14) to this problem. To demonstrate the practical effect of\naf\ufb01ne invariance, we randomly generate a nonsingular matrix M and solve a sequence of optimization\nproblems where A is replaced with \u02c6Aj = AM j and B is replaced with \u02c6Bj = BM j for j =\n0, 1, . . . , jmax. Note that the optimal objective value of this perturbed problem is unchanged from the\noriginal, and the solution for each perturbed problem can be obtained by (\u02c6y\u22c6)j = M \u2212jy\u22c6, where y\u22c6\nsolves the original problem (i.e., with j = 0). However, the conditioning of the problem is changed\n- M is selected so that the conditioning of the data is worsening for increasing j. We compare our\nalgorithm to vanilla gradient descent, to proximal gradient descent [32] (where the prox-step is on the\ng term so it is of a similar cost to our method), and to restarted accelerated gradient descent [6, 33],\nand observe the effect of the worsening conditioning.\n\nj\n\n\u02c6Aj + \u03bb \u02c6BT\n\nWe chose m = n = 1000 and for simplicity we chose B = I, \u03bb = 1, and randomly generated\neach entry in A to be IID N (0, 1). The best step size was chosen via exhaustive search for all three\nalgorithms. The matrix M was randomly generated but chosen in such a way so as to be close to\nthe identity. For j = 0 the condition number of the matrix \u02c6AT\n\u02c6Bj was 4.0 \u00d7 103, and for\nj\nj = jmax = 20 the condition number had grown to 2.2 \u00d7 1014, a dramatic increase. Figure (1a)\nshows the performance of both our technique and gradient descent on this sequence of problems. The\ngradient descent traces are in orange, with a different trace for each j. The fastest converging trace\ncorresponds to j = 0, the best conditioned problem. As the conditioning deteriorates the convergence\nis impacted, getting slower with each increase in j. In the appendix we additionally include Figure 3\nwhich compares our technique to proximal gradient, restarted accelerated gradient, and conjugate\ngradient. All three additional techniques display the same deterioration as the conditioning worsens.\nBy problem j = 20 no variant of gradient descent or conjugate gradient has reduced the primal\nobjective error, de\ufb01ned as mink(f (yk) \u2212 f\u22c6), to under O(100). By contrast, our technique is\ncompletely unaffected by the changing data, with every trace essentially identical (up to some\nnumerical tolerances). Furthermore, we used the exact same step size for every run of our method.\nThis is because the discretization procedure preserved the af\ufb01ne invariance of the continuous ODE it\nis approximating, so the changing conditioning of the data has no effect. In Figure (1b) we plot the\nHamiltonian (13) (i.e., the partial duality gap) and the full duality gap: f (yk)\u2212d(pk), for Hamiltonian\ndescent for each value of j. Once again the traces lie directly on top of each other, until numerical\nerrors start to have an impact. We note that the Hamiltonian decreases at each iteration, and converges\nlinearly. The duality gap and the objective values do not necessarily decrease at each iteration, but do\nappear to enjoy linear convergence for each j.\n\n(a) Primal objective value.\n\n(b) Hamiltonian value and duality gap for HD.\n\nFigure 1: Comparison of Hamiltonian descent (HD) and Gradient descent (GD) for problem (15).\n\n5.2 Elastic net regularized logistic regression\n\nIn logistic regression the goal is to learn a classi\ufb01er to separate a set of data points based on their\nlabels, which we take to be either 1 or \u22121. The elastic net is a type of regularization that promotes\nsparsity and small weights in the solution [34]. Given data points ai \u2208 Rm with corresponding label\n\n8\n\n050100150200250k103102101100101102f(yk)fHDGD050100150200250k102100102104106HamiltonianDuality gap\fFigure 2: Comparison of Hamiltonian descent (HD) and Gradient descent (GD) for problem (16).\n\nli \u2208 {\u22121, 1} for i = 1, . . . , n, the elastic net regularized logistic regression problem is given by\n\nminimize\n\n(1/n)Pn\n\n(16)\nover the variable y \u2208 Rm, where \u03bb1 \u2265 0, and \u03bb2 \u2265 0 control the strength of the regularization. In\ni=1 log(1 + exp(lixi)) and g(y) = \u03bb1kyk1 +\n2. We have a closed form expression for the gradient of g\u2217 given by the soft-thresholding\n\ni y)) + \u03bb1kyk1 + (\u03bb2/2)kyk2\n2\n\ni=1 log(1 + exp(liaT\n\n(\u03bb2/2)kyk2\noperator:\n\nthe notation of problem (6) we take h(x) = (1/n)Pn\n(\u2207g\u2217(q))i = (1/\u03bb2)( qi \u2212 \u03bb1\n\n0\nqi + \u03bb1\n\nqi \u2265 \u03bb1\n|qi| \u2264 \u03bb1\nqi \u2264 \u2212\u03bb1.\n\nWe compare the explicit discretization (5) of Hamiltonian descent in equation (14) to proximal\ngradient descent [32], which in this case has the exact same per-iteration cost since it also relies on\ntaking the gradient of h and applying the soft-thresholding operator. We chose dimension m = 500\nand n = 1000 data points and we set \u03bb1 = \u03bb2 = 0.01. The data were generated randomly, and\nthen perturbed so as to give a high condition number, which was 1.0 \u00d7 108. The best step size for\nboth algorithms was found using exhaustive search. In Figure 2 we show the primal objective value\nerror for both algorithms, where the true solution was found using convex cone solver SCS [35, 36].\nHamiltonian descent dramatically outperforms gradient descent on this problem, despite having the\nsame per-iteration cost. This is unsurprising because we would expect Hamiltonian descent to be less\nsensitive to the poor conditioning of the data, due to the af\ufb01ne invariance property.\n\n6 Conclusion\n\nStarting from Hamiltonian mechanics in classical physics, we derived a Hamiltonian descent continu-\nous ODE that converges linearly to a minimum of the Hamiltonian function. We applied Hamiltonian\ndescent to a convex composite optimization problem, and proved linear convergence of the duality\ngap, a measure of how far from optimal a primal-dual point is. In some sense applying Hamiltonian\ndescent to this problem is natural, since we can identify one of the terms in the objective as being\nthe \u2018potential\u2019 energy and the other as the \u2018kinetic\u2019 energy. We provided two discretizations that are\nguaranteed to converge to the optimum under certain assumptions, and also demonstrated that some\nwell-known algorithms correspond to other discretizations of our ODE. In particular we show that a\nparticular discretization yields ADMM. We conclude with two numerical examples that show our\nmethod is much more robust to numerical conditioning than standard gradient methods.\n\nReferences\n\n[1] Sir William Rowan Hamilton. On a general method in dynamics. Philosophical Transactions of\n\nthe Royal Society, 2:247\u2013308, 1834.\n\n[2] Juan Peypouquet and Sylvain Sorin. Evolution equations for maximal monotone operators:\n\nAsymptotic analysis in continuous and discrete time. arXiv preprint arXiv:0905.1270, 2009.\n\n[3] Pascal Bianchi, Walid Hachem, and Adil Salim. A constant step forward-backward algorithm\n\ninvolving random maximal monotone operators. arXiv preprint arXiv:1702.04144, 2017.\n\n9\n\n010002000300040005000600070008000k1010108106104102f(yk)fHDGD\f[4] Laurent Condat. A primal\u2013dual splitting method for convex optimization involving Lipschitzian,\nproximable and linear composite terms. Journal of Optimization Theory and Applications,\n158(2):460\u2013479, 2013.\n\n[5] Weijie Su, Stephen Boyd, and Emmanuel J Cand\u00e8s. A differential equation for modeling\nNesterov\u2019s accelerated gradient method: theory and insights. Journal of Machine Learning\nResearch, 17(1):5312\u20135354, 2016.\n\n[6] Yurii Nesterov. A method of solving a convex programming problem with convergence rate\n\no(1/k2). Soviet Mathematics Doklady, 27(2):372\u2013376, 1983.\n\n[7] Hedy Attouch, Zaki Chbani, and Hassan Riahi. Rate of convergence of the Nesterov accelerated\ngradient method in the subcritical case \u03b1 \u2264 3. ESAIM: Control, Optimisation and Calculus of\nVariations, 25:2, 2019.\n\n[8] Andre Wibisono, Ashia C Wilson, and Michael I Jordan. A variational perspective on accelerated\nmethods in optimization. Proceedings of the National Academy of Sciences, 113(47):E7351\u2013\nE7358, 2016.\n\n[9] Ashia Wilson, Lester Mackey, and Andre Wibisono. Accelerating rescaled gradient descent.\n\narXiv preprint arXiv:1902.08825, 2019.\n\n[10] Ashia C Wilson, Benjamin Recht, and Michael I Jordan. A Lyapunov analysis of momentum\n\nmethods in optimization. arXiv preprint arXiv:1611.02635, 2016.\n\n[11] Michael Betancourt, Michael I Jordan, and Ashia C Wilson. On symplectic optimization. arXiv\n\npreprint arXiv:1802.03653, 2018.\n\n[12] Chris J Maddison, Daniel Paulin, Yee Whye Teh, Brendan O\u2019Donoghue, and Arnaud Doucet.\n\nHamiltonian descent methods. arXiv preprint arXiv:1809.05042, 2018.\n\n[13] Guilherme Fran\u00e7a, Jeremias Sulam, Daniel P Robinson, and Ren\u00e9 Vidal. Conformal symplectic\n\nand relativistic optimization. arXiv preprint arXiv:1903.04100, 2019.\n\n[14] RT Rockafellar. Saddle points of Hamiltonian systems in convex problems of lagrange. Journal\n\nof Optimization Theory and Applications, 12(4):367\u2013390, 1973.\n\n[15] Radford M Neal. MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte\n\nCarlo, pages 113\u2013162, 2011.\n\n[16] David Balduzzi, Sebastien Racaniere, James Martens, Jakob Foerster, Karl Tuyls, and Thore\nGraepel. The mechanics of n-player differentiable games. In International Conference on\nMachine Learning, pages 363\u2013372, 2018.\n\n[17] Ralph Tyrell Rockafellar. Convex analysis. Princeton university press, 1970.\n\n[18] Thomas Hakon Gronwall. Note on the derivatives with respect to a parameter of the solutions\n\nof a system of differential equations. Annals of Mathematics, pages 292\u2013296, 1919.\n\n[19] Jean-Jacques E Slotine, Weiping Li, et al. Applied nonlinear control, volume 199. Prentice hall\n\nEnglewood Cliffs, NJ, 1991.\n\n[20] Ernst Hairer, Christian Lubich, and Gerhard Wanner. Geometric numerical integration:\nstructure-preserving algorithms for ordinary differential equations, volume 31. Springer\nScience & Business Media, 2006.\n\n[21] Yu Nesterov. Gradient methods for minimizing composite functions. Mathematical Program-\n\nming, 140(1):125\u2013161, 2013.\n\n[22] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press,\n\n2004.\n\n[23] Lev M Bregman. The relaxation method of \ufb01nding the common point of convex sets and\nits application to the solution of problems in convex programming. USSR computational\nmathematics and mathematical physics, 7(3):200\u2013217, 1967.\n\n10\n\n\f[24] Heinz H Bauschke and Jonathan M Borwein. Legendre functions and the method of random\n\nbregman projections. Journal of Convex Analysis, 4(1):27\u201367, 1997.\n\n[25] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, Jonathan Eckstein, et al. Distributed opti-\nmization and statistical learning via the alternating direction method of multipliers. Foundations\nand Trends R(cid:13) in Machine learning, 3(1):1\u2013122, 2011.\n\n[26] Bingsheng He and Xiaoming Yuan. On the o(1/n) convergence rate of the Douglas\u2013Rachford\n\nalternating direction method. SIAM Journal on Numerical Analysis, 50(2):700\u2013709, 2012.\n\n[27] Guilherme Franca, Daniel Robinson, and Rene Vidal. Admm and accelerated admm as continu-\nous dynamical systems. In International Conference on Machine Learning, pages 1554\u20131562,\n2018.\n\n[28] Tom Goldstein, Brendan O\u2019Donoghue, Simon Setzer, and Richard Baraniuk. Fast alternating\n\ndirection optimization methods. SIAM Journal on Imaging Sciences, 7(3):1588\u20131623, 2014.\n\n[29] Mingqiang Zhu and Tony Chan. An ef\ufb01cient primal-dual hybrid gradient algorithm for total\n\nvariation image restoration. UCLA CAM report, 34, 2008.\n\n[30] Ernie Esser, Xiaoqun Zhang, and Tony Chan. A general framework for a class of \ufb01rst order\n\nprimal-dual algorithms for tv minimization. 2009.\n\n[31] Antonin Chambolle and Thomas Pock. A \ufb01rst-order primal-dual algorithm for convex problems\nwith applications to imaging. Journal of mathematical imaging and vision, 40(1):120\u2013145,\n2011.\n\n[32] Neal Parikh, Stephen Boyd, et al. Proximal algorithms. Foundations and Trends R(cid:13) in Optimiza-\n\ntion, 1(3):127\u2013239, 2014.\n\n[33] Brendan O\u2019Donoghue and Emmanuel Candes. Adaptive restart for accelerated gradient schemes.\n\nFoundations of computational mathematics, 15(3):715\u2013732, 2015.\n\n[34] Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal of\n\nthe royal statistical society: series B (statistical methodology), 67(2):301\u2013320, 2005.\n\n[35] B. O\u2019Donoghue, E. Chu, N. Parikh, and S. Boyd. Conic optimization via operator splitting\nand homogeneous self-dual embedding. Journal of Optimization Theory and Applications,\n169(3):1042\u20131068, June 2016.\n\n[36] B. O\u2019Donoghue, E. Chu, N. Parikh, and S. Boyd. SCS: Splitting conic solver, version 2.1.0.\n\n\u2764tt\u2663s\u273f\u2734\u2734\u2763\u2710t\u2764\u2709\u275c\u2733\u275d\u2666\u2660\u2734\u275d\u2708\u2460\u2763r\u2663\u2734s\u275ds, November 2017.\n\n11\n\n\f", "award": [], "sourceid": 8196, "authors": [{"given_name": "Brendan", "family_name": "O'Donoghue", "institution": "DeepMind"}, {"given_name": "Chris", "family_name": "Maddison", "institution": "Institute for Advanced Study, Princeton"}]}