{"title": "The Implicit Bias of AdaGrad on Separable Data", "book": "Advances in Neural Information Processing Systems", "page_first": 7761, "page_last": 7769, "abstract": "We study the implicit bias of AdaGrad on separable linear classification problems. \nWe show that AdaGrad converges to a direction that can be characterized as the solution of a quadratic optimization problem with the same feasible set as the hard SVM problem. \nWe also give a discussion about how different choices of the hyperparameters of AdaGrad may impact this direction. \nThis provides a deeper understanding of why adaptive methods do not seem to have the generalization ability as good as gradient descent does in practice.", "full_text": "The Implicit Bias of AdaGrad on Separable Data\n\nQian Qian\n\nDepartment of Statistics\nOhio State University\n\nColumbus, OH 43210, USA\n\nqian.216@osu.edu\n\nXiaoyuan Qian\n\nSchool of Mathematical Sciences\nDalian University of Technology\nDalian, Liaoning 116024, China\n\nxyqian@dlut.edu.cn\n\nAbstract\n\nWe study the implicit bias of AdaGrad on separable linear classi\ufb01cation problems.\nWe show that AdaGrad converges to a direction that can be characterized as the\nsolution of a quadratic optimization problem with the same feasible set as the\nhard SVM problem. We also give a discussion about how different choices of the\nhyperparameters of AdaGrad might impact this direction. This provides a deeper\nunderstanding of why adaptive methods do not seem to have the generalization\nability as good as gradient descent does in practice.\n\n1\n\nIntroduction\n\nIn recent years, implicit regularization from various optimization algorithms plays a crucial role\nin the generalization abilities in training deep neural networks (Salakhutdinov and Srebro [2015],\nNeyshabur et al. [2015], Keskar et al. [2016], Neyshabur et al. [2017], Zhang et al. [2017]). For\nexample, in underdetermined problems where the number of parameters is larger than the number\nof training examples, many global optima fail to exhibit good generalization properties, however, a\nspeci\ufb01c optimization algorithm (such as gradient descent) does converge to a particular optimum that\ngeneralize well, although no explicit regularization is enforced when training the model. In other\nwords, the optimization technique itself \"biases\" towards a certain model in an implicit way (Soudry\net al. [2018]). This motivates a line of works to investigate the implicit biases of various algorithms\n(Telgarsky [2013], Soudry et al. [2018], Gunasekar et al. [2017, 2018a,b]).\n\nThe choice of algorithms would affect the implicit regularization introduced in the learned\nmodels. In underdetermined least squares problems, where the minimizers are \ufb01nite, we know that\ngradient descent yields the minimum L2 norm solution, whereas coordinate descent might give a\ndifferent solution. Another example is logistic regression with separable data. While gradient descent\nconverges in the direction of the hard margin support vector machine solution (Soudry et al. [2018]),\ncoordinate descent converges to the maximum L1 margin solution (Telgarsky [2013], Gunasekar\net al. [2018a]). Unlike the squared loss, the logistic loss does not admit a \ufb01nite global minimizer\non separable data: the iterates will diverge in order to drive the loss to zero. As a result, instead of\ncharacterizing the convergence of the iterates w(t), it is the asymptotic direction of these iterates i.e.,\nlimt\u2192\u221e w(t)/(cid:107)w(t)(cid:107) that is important and therefore has been characterized (Soudry et al. [2018],\nGunasekar et al. [2018b]).\n\nMorevoer, it has attracted much attention that different adaptive methods of gradient descent and\nhyperparameters of an adaptive method exhibit different biases, thus leading to different generalization\nperformance in deep learning (Salakhutdinov and Srebro [2015], Keskar et al. [2016], Wilson et al.\n[2017], Hoffer et al. [2017]). Among those \ufb01ndings is that the vanilla SGD algorithm demonstrates\nbetter generalization than its adaptive variants (Wilson et al. [2017]), such as AdaGrad (Duchi et al.\n[2010]) and Adam (Kingma and Ba [2015]). Therefore it is important to precisely characterize how\ndifferent adaptive methods induce difference biases. A natural question to ask is: can we explain\nthis observation by characterizing the implicit bias of AdaGrad, which is a paradigm of adaptive\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fmethods, in a binary classi\ufb01cation setting with separable data using logistic regression? And how\ndoes the implicit bias depend on the choice of the hyperparameters of this speci\ufb01c algorithm, such as\ninitialization, step sizes, etc?\n\n1.1 Our Contribution\n\nIn this work we study Adagrad applied to logistic regression with separable data. Our contribution is\nthree-fold as listed as follows.\n\nalways converge.\n\n\u2022 We prove that the directions of AdaGrad iterates, with a constant step size suf\ufb01ciently small,\n\u2022 We formulate the asymptotic direction as the solution of a quadratic optimization problem.\nThis achieves a theoretical characterization of the implicit bias of AdaGrad, which also\nprovides insights about why and how the factors involved, such as certain intrinsic properties\nof the dataset, the initialization and the learning rate, affect the implicit bias.\n\n\u2022 We introduce a novel approach to study the bias of AdaGrad.\n\nIt is mainly based on\na geometric estimation on the directions of the updates, which doesn\u2019t depend on any\ncalculation on the convergence rate.\n\n1.2 Paper Organization\n\nThis paper is organized as follows. In Section 2 we explain our problem setup. The main theory is\ndeveloped in Section 3, including convergence of the adaptive learning rates of AdaGrad, existence\nof the asymptotic direction of AdaGrad iterates, and relations between the asymptotic directions of\nAdagrad and gradient descent iterates. We conclude our paper in Section 4 with a review of our\nresults and some questions left to future research.\n\nProblem Setup\n\n2\nLet {(xn, yn) : n = 1,\u00b7\u00b7\u00b7 , N} be a training dataset with features xn \u2208 Rp and labels yn \u2208\n{\u22121, 1} . To simplify the notation, we rede\ufb01ne ynxn as xn , n = 1,\u00b7\u00b7\u00b7 , N, and consider learning\nthe logistic regression model over the empirical loss:\n\nN(cid:88)\n\nl(cid:0)wT xn\n\n(cid:1) , w \u2208 Rp ,\n\nL(w) =\n\nwhere l : Rp \u2192 R . We focus on the following case, same as proposed in Soudry et al. [2018]:\n\nn=1\n\nAssumption 1. There exists a vector w\u2217 such that wT\u2217 xn > 0 for all n.\nAssumption 2. l(u) is continuously differentiable, \u03b2\u2212smooth, and strictly decreasing to zero.\nAssumption 3. There exist positive constants a, b, c, and d such that\n\n(cid:12)(cid:12)l(cid:48)(u) + ce\u2212au(cid:12)(cid:12) \u2264 e\u2212(a+b)u ,\n\nfor u > d .\n\nIt is easy to see that the exponential loss l(u) = e\u2212u and the logistic loss l(u) = log (1 + e\u2212u)\n\nGiven two hyperparameters \u0001 , \u03b7 > 0 and an initial point w(0) \u2208 Rp , we consider the diagonal\n\nw(t + 1) = w(t) \u2212 \u03b7h(t) (cid:12) g(t) ,\n\nt = 0, 1, 2,\u00b7\u00b7\u00b7 ,\n\n(1)\n\nboth meet these assumptions.\n\nAdaGrad iterates\n\nwhere\n\nhi(t) =\n\ng(t) = (g1(t),\u00b7\u00b7\u00b7 , gp(t)) ,\n\ngi(t) =\n\n(w(t)) ,\n\nh(t) = (h1(t),\u00b7\u00b7\u00b7 , hp(t)) ,\n\n(cid:112)gi(0)2 + \u00b7\u00b7\u00b7 + gi(t)2 + \u0001\n\n1\n\n\u2202L\n\u2202wi\n\n2\n\ni = 1,\u00b7\u00b7\u00b7 , p ,\n\n,\n\n\fand (cid:12) is the element-wise multiplication of two vectors, e.g.\n\na (cid:12) b = (a1b1,\u00b7\u00b7\u00b7 , apbp)T\n\nfor a = (a1,\u00b7\u00b7\u00b7 , ap)T , b = (b1,\u00b7\u00b7\u00b7 , bp)T .\n\nTo analyze the convergence of the algorithm, we put an additional restriction on the hyperparam-\n\neter \u03b7 .\n\nAssumption 4. The hyperparameter \u03b7 is not too large; speci\ufb01cally,\n\n2 mini\u2208{1,\u00b7\u00b7\u00b7 ,p}(cid:112)gi(0)2 + \u0001\n\n\u03b7 <\n\n\u03b2\n\n.\n\n(2)\n\nWe are interested in the asymptotic behavior of the AdaGrad iteration scheme in (1). The main\n\nproblem is: does there exists some vector wA such that\n\nt\u2192\u221e w(t)/(cid:107)w(t)(cid:107) = wA ?\n\nlim\n\nWe will provide an af\ufb01rmative answer to this question in the following section.\n\n3 The Asymptotic Direction of AdaGrad Iterates\n\n3.1 Convergence of the Adaptive Learning Rates\n\nWe \ufb01rst provide some elementary facts about AdaGrad iterates (1) with all assumptions (1-4) proposed\nin Section 2.\nLemma 3.1. L (w(t + 1)) < L (w(t)) ( t = 0, 1,\u00b7\u00b7\u00b7 ) .\n\nLemma 3.2. (cid:80)\u221e\n\nt=0 (cid:107)g(t)(cid:107)2 < \u221e .\n\nWe notice that Gunasekar et al. [2018a] showed a similar result (Lemma 6, in Section 3.3 of\ntheir work) for exponential loss only, under slightly different assumptions. However, their approach\ndepends on some speci\ufb01c properties of the exponential function, and thus cannot be extended to\nLemma 3.2 in a trivial manner.\nLemma 3.3. The following statements hold:\n\n(i) (cid:107)g(t)(cid:107) \u2192 0 (t \u2192 \u221e).\n(ii) (cid:107)w(t)(cid:107) \u2192 \u221e (t \u2192 \u221e).\n(iii) L(w(t)) \u2192 0 (t \u2192 \u221e).\n(iv) \u2200n ,\nlimt\u2192\u221e w(t)T xn = \u221e.\n(v) \u2203t0 , \u2200 t > t0 , w(t)T xn > 0.\nTheorem 3.1. The sequence {h(t)}\u221e\n\nsatisfying h\u221e,i > 0 (i = 1,\u00b7\u00b7\u00b7 , p) .\n\nt=0 converges as t \u2192 \u221e to a vector\nh\u221e = (h\u221e,1,\u00b7\u00b7\u00b7 , h\u221e,p)\n\n3.2 Convergence of the Directions of AdaGrad Iterates\nIn the remainder of the paper we denote h\u221e = limt\u2192\u221e h(t) and \u03ben = h1/2\u221e (cid:12)xn (n = 1,\u00b7\u00b7\u00b7 , N ) .\nSince, by Theorem 3.1, the components of h\u221e have a positive lower bound, we can de\ufb01ne\n\n\u03b2(t) = h\u22121\u221e (cid:12) h(t) (t = 0, 1, 2,\u00b7\u00b7\u00b7 ) .\n\nHere the squared root and the inverse of vectors are de\ufb01ned element-wise. We call the function\n\nLind : Rp \u2192 R , Lind(v) =\n\nN(cid:88)\n\nl(cid:0)vT \u03ben\n\n(cid:1)\n\nn=1\n\n3\n\n\fthe induced loss with respect to AdaGrad (1). Note that\n\nL(w) =\n\n=\n\nThus if we set\n\n(cid:1) =\n\nl(cid:0)wT xn\n(cid:18)(cid:16)\n\nl\n\nh\u22121/2\u221e (cid:12) w\n\nN(cid:88)\n(cid:17)T\n\nn=1\n\nl\n\n(cid:18)(cid:16)\n(cid:17)T(cid:16)\nh\u22121/2\u221e (cid:12) w\n(cid:19)\n(cid:16)\n\n= Lind\n\n\u03ben\n\nh1/2\u221e (cid:12) xn\n(cid:17)\n\nh\u22121/2\u221e (cid:12) w\n\n.\n\n(cid:17)(cid:19)\n\nN(cid:88)\nN(cid:88)\n\nn=1\n\nn=1\n\nv(t) = h\u22121/2\u221e (cid:12) w(t) (t = 0, 1, 2,\u00b7\u00b7\u00b7 ) ,\n\nthen v(0) = h\n\n\u22121/2\u221e (cid:12) w(0) , and\n\nh1/2\u221e (cid:12) v(t + 1) = w(t + 1) = w(t) \u2212 \u03b7h(t) (cid:12) \u2207L (w(t))\n\n= h1/2\u221e (cid:12) v(t) \u2212 \u03b7h(t) (cid:12) h\u22121/2\u221e (cid:12) \u2207L(cid:16)\n\n(cid:17)\nh\u22121/2\u221e (cid:12)(cid:16)\n(cid:16)\nh1/2\u221e (cid:12) v(t)\n= h1/2\u221e (cid:12) v(t) \u2212 \u03b7h(t) (cid:12) h\u22121/2\u221e (cid:12) \u2207Lind\n= h1/2\u221e (cid:12) (v(t) \u2212 \u03b7\u03b2(t) (cid:12) \u2207Lind (v(t))) ,\n\n(cid:17)(cid:17)\nh1/2\u221e (cid:12) v(t)\n\nor\n\nv(t + 1) = v(t) \u2212 \u03b7\u03b2(t) (cid:12) \u2207Lind(v(t)) (t = 0, 1,\u00b7\u00b7\u00b7 ) .\n\nWe refer to (4) as the induced form of AdaGrad (1).\n\nThe following result for the induced form is a simple corollary of Lemma 3.3.\n\nLemma 3.4. The following statements hold:\n\n(i) (cid:107)\u2207Lind(t)(cid:107) \u2192 0 (t \u2192 \u221e).\n(ii) (cid:107)v(t)(cid:107) \u2192 \u221e (t \u2192 \u221e).\n(iii) Lind(v(t)) \u2192 0 (t \u2192 \u221e).\n(iv) \u2200n ,\nlimt\u2192\u221e v(t)T \u03ben = \u221e.\n(v) \u2203t0 , \u2200 t > t0 , v(t)T \u03ben > 0.\nFor the induced loss Lind, consider GD iterates\n\nu(t + 1) = u(t) \u2212 \u03b7\u2207Lind(u(t)) (t = 0, 1,\u00b7\u00b7\u00b7 ) .\n\nAccording to Theorem 3 in Soudry et.al.(2018), we have\n\n(3)\n\n(4)\n\n(5)\n\nu(t)\n\nlim\nt\u2192\u221e\n\n(cid:107)u(t)(cid:107) = (cid:98)u\n(cid:107)(cid:98)u(cid:107) ,\n(cid:98)u = arg min\n\n(cid:107)u(cid:107)2.\n\nuT \u03ben\u22651, \u2200n\n\nwhere\n\nNoting that \u03b2(t) \u2192 1 (t \u2192 \u221e) we can obtain GD iterates (5) by taking the limit of \u03b2(t) in\n(4). Therefore it is reasonable to expect that these two iterative processes have similar asymptotic\nbehaviors, especially a common limiting direction.\n\nDifferent from the case of GD method discussed in Soudry et al. [2018], however, it is dif\ufb01cult to\nobtain an effective estimation about the convergence rate of w(t). Instead, we introduce an orthogonal\ndecomposition approach to obtain the asymptotic direction of the original Adagrad process (1).\n\nIn the remainder of the paper, we denote by P the projection onto the 1\u2212dimensional subspace\n\nspanned by(cid:98)u, and Q the projection onto the orthogonal complement. Without any loss of generality\nwe may assume (cid:107)(cid:98)u(cid:107) = 1 . Thus we have the orthogonal decomposition\nwhere P v = (cid:107)P v(cid:107)(cid:98)u =(cid:0)vT(cid:98)u(cid:1)(cid:98)u. Moreover, we denote\n\nv = P v + Qv (v \u2208 Rp) ,\n\n\u03b4(t) = \u2212\u03b7\u2207Lind (v(t)) , d(t) = \u03b2(t) (cid:12) \u03b4(t).\n\n(6)\n\n4\n\n\fUsing this notation we can rewrite the iteration scheme (4) as\n\nv(t + 1) = v(t) + d(t)\n\n(t = 0, 1,\u00b7\u00b7\u00b7 ).\n\nBy reformulating (6) as\nwhere \u03b2(t) \u2212 1 \u2192 0 as t \u2192 \u221e, we regard \u03b4(t) as the decisive part of d(t) and acquire properties of\nd(t) through exploring analogues of \u03b4(t).\nFirst, we can show a basic estimation:\n\nd(t) = \u03b4(t) + (\u03b2(t) \u2212 1) (cid:12) \u03b4(t) ,\n\n\u03b4(t)T(cid:98)u = (cid:107)P \u03b4(t)(cid:107) \u2265 (cid:107)\u03b4(t)(cid:107)\n\nmax\n\nn\n\n(cid:107)\u03ben(cid:107) (t = 0, 1, 2\u00b7\u00b7\u00b7 ) .\n\nThe projection properties of \u03b4(t) is easily passed on to d(t). In fact, for suf\ufb01ciently large t ,\n\nd(t)T(cid:98)u = (cid:107)P d(t)(cid:107) \u2265 (cid:107)d(t)(cid:107)\n\n(cid:107)\u03ben(cid:107) ,\n\n4 max\n\nn\n\n(7)\n\nInequality (7) provides a cumulative effect on the projection of v(t) as t increases:\n\n(cid:107)P v(t)(cid:107) \u2265 (cid:107)v(t)(cid:107)\n\n(cid:107)\u03ben(cid:107) ,\n\n8max\n\nn\n\nfor suf\ufb01ciently large t .\n\nThe following lemma reveals a crucial characteristic of the iterative process (4): as t tends to\n\nin\ufb01nity, the contribution of \u03b4(t) to the increment of the deviation from the direction of(cid:98)u , compared\nto its contribution to the increment in the direction of(cid:98)u, becomes more and more insigni\ufb01cant.\n\nLemma 3.5. Given \u03b5 > 0 . Let a, b, c be positive numbers as de\ufb01ned in Assumption 3 in Section 2.\nIf (cid:107)Qv(t)(cid:107) > 2N (c + 1)(ace\u03b5)\u22121 , then for suf\ufb01ciently large t,\nQv(t)T \u03b4(t) < \u03b5(cid:107)Qv(t)(cid:107)(cid:107)\u03b4(t)(cid:107) .\n\nThis property can be translated into a more convenient version for d(t).\n\nLemma 3.6. For any \u03b5 > 0 , there exist R > 0 such that for suf\ufb01ciently large t and (cid:107)Qv(t)(cid:107) \u2265 R,\n\nTherefore, over a long period, the cumulative increment of v(t) in the direction of (cid:98)u will\n\n(cid:107)Qv(t + 1)(cid:107) \u2212 (cid:107)Qv(t)(cid:107) \u2264 \u03b5(cid:107)d(t)(cid:107) .\n\noverwhelm the deviation from it, yielding the existence of an asymptotic direction for v(t).\nLemma 3.7.\n\nBy the relation (3) between v(t) and w(t), our main result directly follows from (8).\n\nTheorem 3.2. AdaGrad iterates (1) has an asymptotic direction:\n\nwhere\n\nlim\nt\u2192\u221e\n\nv(t)\n\n(cid:107)v(t)(cid:107) =(cid:98)u .\n(cid:107)w(t)(cid:107) = (cid:101)w\n(cid:107)(cid:101)w(cid:107) ,\n(cid:13)(cid:13)(cid:13)(cid:13) 1\u221a\n\nw(t)\n\nh\u221e\n\n(cid:13)(cid:13)(cid:13)(cid:13)2\n\n.\n\n(cid:12) w\n\nlim\nt\u2192\u221e\n\n(cid:101)w = arg min\n\nwT xn\u22651, \u2200n\n\n(8)\n\n(9)\n\n3.3 Factors Affecting the Asymptotic Direction\n\nTheorem 3.2 con\ufb01rms that AdaGrad iterates (1) have an asymptotic direction (cid:101)w/(cid:107)(cid:101)w(cid:107) , where (cid:101)w\n(cid:13)(cid:13)(cid:13)2\n\nis\nis the solution to the optimization problem (9). Since the objective function\ndetermined by the limit vector h\u221e , it is easy to see that the asymptotic direction may depend on the\nchoices of the dataset {(xn, yn)}N\nn=1 , the hyperparameters \u0001 , \u03b7 , and the initial point w(0) . In the\nfollowing we will discuss this varied dependency in several respects.\n\n\u22121/2\u221e (cid:12) w\n\n(cid:13)(cid:13)(cid:13)h\n\n5\n\n\f3.3.1 Difference from the Asymptotic Direction of GD iterates\n\nWhen the classic gradient descent method is applied to minimize the same loss, it is known (see\nTheorem 3, Soudry et al. [2018]) that GD iterates\n\nhave an asymptotic direction (cid:98)w/(cid:107)(cid:98)w(cid:107), where (cid:98)w is the solution of the hard max-margin SVM\n\nwG(t + 1) = wG(t) \u2212 \u03b7\u2207L (wG(t))\n\n(t = 0, 1, 2,\u00b7\u00b7\u00b7 ) ,\n\n(10)\n\nproblem\n\n(cid:107)w(cid:107)2 .\n\narg min\n\nwT xn\u22651, \u2200n\n\n(11)\n\nThe two optimization problems (9) and (11) have the same feasible set\n\n(cid:8)w \u2208 Rp : wT xn \u2265 1, for n = 1,\u00b7\u00b7\u00b7 , N(cid:9) ,\n\nyield different directions, as shown in the following toy example.\nExample 3.1. Let x1 = (cos \u03b8, sin \u03b8)T and L(w) = e\u2212wT x1 . Suppose 0 < \u03b8 < \u03c0/2 . In this\n\nbut they take on different objective functions. It is natural to expect that their solutions (cid:101)w and (cid:98)w\nsetting we simply have (cid:98)w = x1 . Selecting w(0) = (a, b)T and \u0001 = 0, we have\n(cid:19)T\n\n\u2212g(0) = e\u2212w(0)T x1x1 = e\u2212a cos \u03b8\u2212b sin \u03b8 (cos \u03b8, sin \u03b8)T ,\n\n(cid:18) 1\n\nh(0) = (h1(0), h2(0))T = ea cos \u03b8+b sin \u03b8\n\n1\n\n,\n\ncos \u03b8\n\nsin \u03b8\n\n.\n\nIn general we can show there is a sequence of positive numbers p(t) such that\n\n\u2212g(t) = p(t) (cos \u03b8, sin \u03b8)T ,\n\nand\n\n1\n\nNow\n\nh\u221e = lim\nt\u2192\u221e\n\n(cid:112)p(0)2 + p(1)2 + \u00b7\u00b7\u00b7 + p(t)2\n(cid:13)(cid:13)(cid:13)h\u22121/2\u221e (cid:12) w\n(cid:13)(cid:13)(cid:13)2\n(cid:101)w = arg min\n2 sin \u03b8(cid:1) =\n(cid:0)w2\n2/2(cid:1)T\nand we have (cid:101)w/(cid:107)(cid:101)w(cid:107) = (cid:0)\u221a\n\n= arg min\nwT x1\u22651\n\n1 cos \u03b8 + w2\n\nwT x1\u22651\n\n2/2,\n\n\u221a\n\n= arg min\nwT x1\u22651\n\n(cid:18)\n\n(cid:19)T\n\n.\n\n,\n\n1\n\nsin \u03b8\n\n(cid:19)T\n\n=\n\n(cid:18) 1\n\ncos \u03b8\n\nsin \u03b8\n\n1\n\n,\n\n\u03c1(cid:0)w2\n\n1\n\u03c1\n\n(cid:18) 1\n2 sin \u03b8(cid:1)\n\ncos \u03b8\n\n1 cos \u03b8 + w2\n\n1\n\ncos \u03b8 + sin \u03b8\n\n,\n\n1\n\ncos \u03b8 + sin \u03b8\n\n(cid:19)\n\n,\n\nbetween 0 and \u03c0/2, i.e., irrelevant to x1. These two directions coincide only when \u03b8 = \u03c0/4.\n\n. Note that this direction is invariant when \u03b8 ranges\n\nSensitivity to Small Coordinate System Rotations\n\ndirection (cid:101)w/(cid:107)(cid:101)w(cid:107) will become (cid:0)\u2212\u221a\n\n3.3.2\nIf we consider the same setting as in Example 3.1, but taking \u03b8 \u2208 (\u03c0/2, \u03c0) . Then the asymptotic\n. This implies, however, if x1 is close to the\ndirection of y\u2212axis, then a small rotation of the coordinate system may result in a large change of\nthe asymptotic direction reaching a right angle, i.e., in this case the asymptotic direction is highly\nunstable even for a small perturbation of its x\u2212coordinate.\n\n2/2(cid:1)T\n\n2/2,\n\n\u221a\n\n3.3.3 Effects of the Initialization and Hyperparameter \u03b7\n\nIt is reasonable to believe that the asymptotic direction of AdaGrad depends on the initial condi-\ntions, including initialization and step size (see Section 3.3, Gunasekar et al. [2018a]). Theorem\n3.2 yields a geometric interpretation for this dependency as shown in Figure 1, where the red\narrows indicate x1 = (cos (3\u03c0/8) , sin (3\u03c0/8)) and x2 = (cos (9\u03c0/20) , sin (9\u03c0/20)), and the\ncyan arrow indicates the max-margin separator, which points at m, the corner of the feasible set\n\n(cid:8)w(cid:12)(cid:12) wT xn \u2265 1, \u2200n = 1, 2(cid:9) (the yellow shadowed area).\n\n6\n\n\f(cid:13)(cid:13)(cid:13)h\n\n\u22121/2\u221e (cid:12) w\n\n(cid:13)(cid:13)(cid:13)2\n\nFigure 1: A case that the asymptotic directions of AdaGrad and GD are different.\n\nSince the isolines of the function\n\nare ellipses (the green dashed curves) centered\nat the origin, the unique minimizer of the function in the feasible set must be the tangency point p\n(pointed at by the magenta arrow) between the tangent ellipse and the boundary of the feasible set. If\nh\u221e varies, then the eccentricity of the tangent ellipses may change. It makes the tangency point\nmove along the boundary, indicating the change of the asymptotic direction.\n\nNumerical simulations also reveal the differences among the asymptotic directions of Ada-\nGrad iterates with various learning rates, as shown in Figure 2. On the left-hand diagram,\nx1 = (cos (\u03c0/8) , sin (\u03c0/8)) and x2 = (cos (\u03c0/20) , sin (\u03c0/20)) are two support vectors. dm\ndenotes the direction of the max-margin separator. d01 and d05 denote the directions of AdaGrad\niterates computed after 108 steps, with \u03b7 = 0.1 and 0.5, respectively. The small angle between the\ntwo may indicates that the asymptotic direction depend on \u03b7. However, all the asymptotic directions\napparently diverge from the max-margin separator. On the right-hand diagram, the red and blue\ncurves plot\nIt illustrates that the two sequences of the directions of AdaGrad iterates slowly converge to their\nown asymptotic directions, slightly different from each other.\n\n(cid:13)(cid:13)(cid:13) vs. the number of the iterates with \u03b7 = 0.1 and 0.5, respectively.\n\n(cid:13)(cid:13)(cid:13)w(t)/(cid:107)w(t)(cid:107) \u2212 dm\n\nFigure 2: Numerical simulations with \u03b7 = 0.1 and 0.5.\n\n3.3.4 Cases that the Asymptotic Direction is Stable\n\nAbove we have observed that the asymptotic direction of AdaGrad iterates can be very different\nfrom the asymptotic direction of GD iterates, which is robust with respect to different choices of\ninitialization and learning rate \u03b7 . It is natural to ask what are the conditions under which the two\nasymptotic directions coincide. The following proposition provides a suf\ufb01cient one.\n\n7\n\n\fProposition 3.1. Let a = (a1,\u00b7\u00b7\u00b7 , ap)T be a vector satisfying aT xn \u2265 1 (n = 1,\u00b7\u00b7\u00b7 , N ) and\na1 \u00b7\u00b7\u00b7 ap (cid:54)= 0 . Suppose that w = (w1,\u00b7\u00b7\u00b7 , wp)T satis\ufb01es wT xn \u2265 1 (n = 1,\u00b7\u00b7\u00b7 , N ) and\n\nai (wi \u2212 ai) \u2265 0 (i = 1,\u00b7\u00b7\u00b7 , p) .\n\nThen for any b = (b1,\u00b7\u00b7\u00b7 , bp)T such that b1 \u00b7\u00b7\u00b7 bp (cid:54)= 0 ,\n\narg min\n\nwT xn\u22651, \u2200n\n\n(cid:107)b (cid:12) w(cid:107)2 = arg min\nwT xn\u22651, \u2200n\n\n(cid:107)w(cid:107)2 = a ,\n\nand therefore the asymptotic directions of AdaGrad (1) and GD (10) are equal.\n\nSuch a condition seems at \ufb01st sight quite harsh to be sati\ufb01ed. However, there is a signi\ufb01cant\nproportion of the chances that a dataset {xn : n = 1,\u00b7\u00b7\u00b7 , N} meets the requirement, as shown in\nthe following result.\nProposition 3.2. Suppose N \u2265 p and X = [x1,\u00b7\u00b7\u00b7 , xN ] \u2208 Rp\u00d7N is sampled from any distri-\nbution whose density function is nonzero almost everywhere. Then with a positive probability the\nasymptotic directions of AdaGrad (1) and GD (10) are equal.\n\nExample 3.2. Let r1, r2 > 0,\n\nx1 = r1 (cos \u03b81, sin \u03b81)T ,\nx2 = r2 (cos \u03b82, sin \u03b82)T , \u03b81 \u2212 \u03c0 < \u03b82 \u2264 0 ,\n\n\u2264 \u03b81 < \u03c0 ,\n\n\u03c0\n2\n\nand L(w) = l(wT x1) + l(wT x2) . The system of equations\nwT xi = 1 (i = 1, 2)\n\nhas a unique solution (\u03b1, \u03b2)T , where\n\nr\u22121\n\n2\n\n\u03b1 =\n\nsin \u03b81 \u2212 r\u22121\nsin (\u03b81 \u2212 \u03b82)\n\n1\n\nsin \u03b82\n\n> 0, \u03b2 =\n\n1 cos \u03b82 \u2212 r\u22121\nr\u22121\nsin (\u03b81 \u2212 \u03b82)\n\n2 cos \u03b81\n\n> 0 .\n\nIt is easy to check that if w = (w1, w2)T satis\ufb01es wT xi \u2265 1 (i = 1, 2) , then w1 \u2265 \u03b1 , w2 \u2265 \u03b2.\nThus any quadratic form b1w2\n2 (b1, b2 > 0) takes its minimum at (\u03b1, \u03b2)T over the feasible\n\nset(cid:8)w : wT xi \u2265 1 (i = 1, 2)(cid:9). Hence the asymptotic direction of AdaGrad (1) applying to this\nproblem is always equal to (\u03b1, \u03b2)T(cid:14)(cid:107)(\u03b1, \u03b2)(cid:107), which is also the asymptotic direction of GD (10).\n\n1 + b2w2\n\nA geometric perspective of this example is given in Figure 2, where the red arrows indicate\nx1 = (cos (5\u03c0/8) , sin (5\u03c0/8)) and x2 = (cos (\u2212\u03c0/8) , sin (\u2212\u03c0/8)) , and the magenta arrow\nindicates (\u03b1, \u03b2)T . It is easy to see that the isoline (the thick ellipse drawn in green) along which the\nfunction\nequals its minimum must intersect with the feasible set (the grey shadowed\narea) at the corner (\u03b1, \u03b2)T , no matter what h\u221e is.\n\n\u22121/2\u221e (cid:12) w\n\n(cid:13)(cid:13)(cid:13)h\n\n(cid:13)(cid:13)(cid:13)2\n\nFigure 3: A case that the asymptotic directions of AdaGrad and GD are equal.\n\n8\n\n\f4 Conclusion\n\nWe proved that the basic diagonal AdaGrad, when minimizing a smooth monotone loss function\nwith an exponential tail, has an asymptotic direction, which can be characterized as the solution\nof a quadratic optimization problem. In this respect AdaGrad is similar to GD, even though their\nasymptotic directions are usually different. The difference between them also lies in the stability of\ntheir asymptotic directions. The asymptotic direction of GD is uniquely determined by the predictors\nxn\u2019s and independent of initialization and the learning rate, as well as the rotation of coordinate\nsystem, while the asymptotic direction of AdaGrad is likely to be affected by those factors.\n\nIn spite of all these \ufb01ndings, we still do not know whether the asymptotic direction of AdaGrad\nwill change for various initialization or different learning rates. Furthermore, we hope our approach\ncan be applied to the research on the implicit biases of other adaptive methods such as AdaDelta,\nRMSProp, and Adam.\n\nReferences\nB. Neyshaburand R. R. Salakhutdinov and N. Srebro. Path-sgd: Path-normalized optimization in\ndeep neural networks. In Advances in Neural Information Processing Systems, page 2422\u20132430,\n2015.\n\nB. Neyshabur, R. Tomioka, and N. Srebro. In search of the real inductive bias: On the role of implicit\nregularization in deep learning. In International Conference on Learning Representations, 2015.\n\nNitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter\nTang. On large-batch training for deep learning: Generalization gap and sharp minima. ICLR,\n2016.\n\nB. Neyshabur, R. Tomioka, R. Salakhutdinov, and N. Srebro. Geometry of optimization and implicit\n\nregularization in deep learning, 2017. URL https://arxiv.org/pdf/1705.03071.pdf.\n\nC. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning requires\n\nrethinking generalization. In International Conference on Learning Representations, 2017.\n\nD. Soudry, E. Hoffer, M. S. Nacson, S. Gunasekar, and N. Srebro. The implicit bias of gradient\n\ndescent on separable data, 2018.\n\nM. Telgarsky. Margins, shrinkage and boosting. Proceedings of the 30th International Conference on\n\nMachine Learning, PMLR, 28(2):307\u2013315, 2013.\n\nSuriya Gunasekar, Blake E. Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, and Nati Srebro.\n\nImplicit regularization in matrix factorization, 2017.\n\nSuriya Gunasekar, Jason Lee, Daniel Soudry, and Nathan Srebro. Characterizing implicit bias in\nterms of optimization geometry. In Proceedings of the 35th International Conference on Machine\nLearning, 2018a.\n\nS. Gunasekar, J. Lee, D. Soudry, and N. Srebro. Implicit bias of gradient descent on linear convo-\nlutional networks. In Proceedings of the 35th International Conference on Machine Learning,\n2018b.\n\nAshia C. Wilson, Rebecca Roelofs, Mitchell Stern, Nathan Srebro, and Benjamin Recht. The marginal\n\nvalue of adaptive gradient methods in machine learning. arXiv, pages 1\u201314, 2017.\n\nE. Hoffer, I. Hubara, and D. Soudry. Train longer, generalize better: closing the generalization gap in\nlarge batch training of neural networks. In Advances in Neural Information Processing Systems,\npage 1\u201313, 2017.\n\nJohn Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and\n\nstochastic optimization. Journal of Machine Learning Research, 12:2121 \u2013 2159, 2010.\n\nDiederik P. Kingma and Jimmy Lei Ba. Adam: a method for stochastic optimization. International\n\nConference on Learning Representations, pages 1\u201313, 2015.\n\n9\n\n\f", "award": [], "sourceid": 4204, "authors": [{"given_name": "Qian", "family_name": "Qian", "institution": "Ohio State University"}, {"given_name": "Xiaoyuan", "family_name": "Qian", "institution": "Dalian University of Technology"}]}