{"title": "Fine-grained Optimization of Deep Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 1454, "page_last": 1464, "abstract": "In recent studies, several asymptotic upper bounds on generalization errors on deep neural networks (DNNs) are theoretically derived. These bounds are functions of several norms of weights of the DNNs, such as the Frobenius and spectral norms, and they are computed for weights grouped according to either input and output channels of the DNNs. In this work, we conjecture that if we can impose multiple constraints on weights of DNNs to upper bound the norms of the weights, and train the DNNs with these weights, then we can attain empirical generalization errors closer to the derived theoretical bounds, and improve accuracy of the DNNs. \n\nTo this end, we pose two problems. First, we aim to obtain weights whose different norms are all upper bounded by a constant number. To achieve these bounds, we propose a two-stage renormalization procedure; (i) normalization of weights according to different norms used in the bounds, and (ii) reparameterization of the normalized weights to set a constant and finite upper bound of their norms. In the second problem, we consider training DNNs with these renormalized weights. To this end, we first propose a strategy to construct joint spaces (manifolds) of weights according to different constraints in DNNs. Next, we propose a fine-grained SGD algorithm (FG-SGD) for optimization on the weight manifolds to train DNNs with assurance of convergence to minima. Experimental analyses show that image classification accuracy of baseline DNNs can be boosted using FG-SGD on collections of manifolds identified by multiple constraints.", "full_text": "Fine-grained Optimization of Deep Neural Networks\n\nMete Ozay\u2217\n\nAbstract\n\nIn recent studies, several asymptotic upper bounds on generalization errors on deep\nneural networks (DNNs) are theoretically derived. These bounds are functions of\nseveral norms of weights of the DNNs, such as the Frobenius and spectral norms,\nand they are computed for weights grouped according to either input and output\nchannels of the DNNs. In this work, we conjecture that if we can impose multiple\nconstraints on weights of DNNs to upper bound the norms of the weights, and train\nthe DNNs with these weights, then we can attain empirical generalization errors\ncloser to the derived theoretical bounds, and improve accuracy of the DNNs.\nTo this end, we pose two problems. First, we aim to obtain weights whose different\nnorms are all upper bounded by a constant number. To achieve these bounds,\nwe propose a two-stage renormalization procedure; (i) normalization of weights\naccording to different norms used in the bounds, and (ii) reparameterization of\nthe normalized weights to set a constant and \ufb01nite upper bound of their norms. In\nthe second problem, we consider training DNNs with these renormalized weights.\nTo this end, we \ufb01rst propose a strategy to construct joint spaces (manifolds) of\nweights according to different constraints in DNNs. Next, we propose a \ufb01ne-\ngrained SGD algorithm (FG-SGD) for optimization on the weight manifolds to\ntrain DNNs with assurance of convergence to minima. Experimental analyses show\nthat image classi\ufb01cation accuracy of baseline DNNs can be boosted using FG-SGD\non collections of manifolds identi\ufb01ed by multiple constraints.\n\n1\n\nIntroduction\n\nUnderstanding generalization behavior of DNNs is an open problem [1]. Recent works [2\u20138]\naddressed this problem by extending the early results proposed for shallow linear neural networks\n(NNs) [9] for a more general class of DNNs (e.g. neural networks with ReLU), recurrent neural\nnetworks [10, 11], and convolutional neural networks (CNNs) (see Table 1 for a comparison). The\nproposed asymptotic bounds were obtained by de\ufb01ning weight matrices of DNNs using random\nmatrices, and applying concentration inequalities on them. Thereby, the bounds were computed by\n\nfunctions of several (cid:96)p norms of these matrices, for 1\u2264 p\u2264\u221e.\n\nIn this work, we conjecture that if we can impose multiple constraints on weights of DNNs to set\nupper bounds of the norms of the weight matrices, and train the DNNs with these weights, then the\nDNNs can achieve empirical generalization errors closer to the proposed theoretical bounds, and we\ncan improve their accuracy in various tasks. We pose two problems in order to achieve this goal;\n\n1. Renormalization of weights to upper bound norms of their matrices.\n2. Training DNNs with renormalized weights with assurance to convergence to minima.\n\n1.1 Background\n\nSpaces of normalized weights can be identi\ufb01ed by different Riemann manifolds [12]; (i) unit norm\n\nweights reside on the sphere Sp(AlBl\u2212 1), (ii) orthonormal weights belong to the Stiefel manifold\n\u2217meteozay@gmail.com\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fSt(Al, Bl), and (iii) weights with orthogonal columns reside on the oblique manifold Ob(AlBl), at\n\neach lth layer of a DNN (formal mathematical de\ufb01nitions are given in the supplemental material).\nChallenge of training DNNs with multiple constraints: DNNs can be trained with multiple con-\nstraints using optimization methods proposed for training shallow algorithms [13, 14], and individual\nmanifolds [12, 15].\nIf we employ these methods on products of weight manifolds (POMs) to\ntrain DNNs, then we observe early divergence, vanishing and exploding gradients due to nonlinear\ngeometry of product of different manifolds.\nMore precisely, the assumption of a bound on the operator norm of Hessian of geodesics in POMs,\nwhich is required for assurance of convergence, fails, while performing Stochastic Gradient Descent\n(SGD) with backpropagation on product of different weight manifolds. Therefore, a non-increasing\nbound on the probability of failure of the optimization algorithm cannot be computed, and a conver-\ngence bound cannot be obtained.\nWe consider training DNNs using a more general setting employing groups of weights which\ncan be normalized according to different normalization constraints. Group wise operations\nare implemented by concatenating weight matrices \u03c9i\ng,l belonging to each gth group by\n\ng,l),\u2200g= 1, 2, . . . , Gl. For the corresponding group, a space of concatenated\ng,l, i= 1, 2, . . . , g. In addition, if\n\nweights is identi\ufb01ed by Cartesian product of manifolds of weights \u03c9i\nwe renormalize weights using standard deviation of features obtained at each epoch, then geometry\nof the manifolds of weights also changes. Therefore, we address the second subproblem which is\noptimization on dynamically changing product manifolds of renormalized weights.\nIn order to solve these problems, we \ufb01rst propose a mathematical framework to make use of the\ngeometric relationship between weight manifolds determined by different constraints (Section 5).\nThen, we suggest an approach for training DNNs using multiple constraints on weights to improve\ntheir performance under the proposed framework. To this end, we propose a new algorithm that we\ncall \ufb01ne-grained stochastic gradient descent (FG-SGD) to train DNNs using POMs. We elucidate\ngeometric properties of POMs to assure convergence of FG-SGD to global minima while training\nnonlinear DNNs with particular assumptions on their architectures, and to local minima while training\na more generic class of nonlinear DNNs.\n\n\u03c9g,l=(\u03c91\n\ng,l, \u03c92\n\ng,l, . . . , \u03c9g\n\n2 A Conceptual Overview of the Contributions\n\n(1) Bounding norms of weights: We propose a two-stage renormalization procedure. First, we\nnormalize weights according to the Euclidean, Frobenius and spectral norm, since they are used\nin the bounds of generalization errors [2\u20138]. Second, we aim to reparameterize the normalized\nweights to set a \ufb01nite and constant upper bound on the weight matrices. For this purpose, we can\nuse a parameter learning approach as utilized in batch normalization (BN) [16]. However, such\nan approach substantially increases running time of DNNs during training. In addition, it is not\nef\ufb01cient to estimate the parameters using small number of samples in batch training. Therefore, we\nreparameterize weights according to (a) geometric properties of weight spaces, and (b) statistical\nproperties of features (standard deviation) on which the weights are applied. Using (b), we can also\ndecrease computation time of statistical properties especially when we apply separable operations\nsuch as channel/depth wise separable convolutions [17\u201319].\nThe proposed reparameterization method enables to set upper bound of each different norm of weight\nmatrices to 1.0. In addition, the proposed renormalization procedure enables to control variance of\nweights during training of DNNs, thereby assures that DNNs do not have spurious local minima\n[20]. Employment of standard deviation in reparameterization also makes optimization landscapes\nsigni\ufb01cantly smoother by bounding amount of change of norms of gradients during training. This\nproperty has been recently studied to analyze effect of BN on optimization landscape in [21]. We\nuse this property to develop a new optimization method for weight renormalization in this paper, as\nexplained in the next problem.\n(2) Training DNNs with renormalized weights: Note that, there is not a single procedure used to\nnormalize weights jointly according to all different norms. Thereby, we normalize weights in groups\nsuch that similar or different norms can be used to normalize matrices of weights belonging to each\ndifferent group.\n\n2\n\n\fWe can mathematically prove that the procedure proposed to solve the previous problem (1) can set\nan upper bound for all of the aforementioned norms. However, we do not have a mathematical proof\nto explain whether weights normalized according a single norm can provide the best generalization\nbound, and to determine its type. We examine this question in various experiments in detail in the\nsupp. mat. Since we cannot mathematically verify this observation, we conjecture that using a diverse\nset of weights normalized with different constraints improves the generalization error compared to\nusing weights normalized according to single constraint. We consider mathematical characterization\nof this property as an open problem.\nOur contributions are summarized as follows:\n\n1. DNNs trained using weights renormalized by the proposed method can achieve tighter bounds for\ntheoretical generalization errors compared to using unnormalized weights (see Proposition 1 in the\nsupp. mat. for derivation). These DNNs do not have spurious local minima [20] (see the next section\nfor a detailed discussion). The proposed scaling method generalizes the scaling method proposed in\n[22] for weight normalization by incorporating geometric properties of weight manifolds.\n2. We explicate the geometry of weight manifolds de\ufb01ned by multiple constraints in DNNs. For this\npurpose, we explore the relationship between geometric properties of POMs (i.e. sectional curvature),\ngradients computed at POMs (Theorem 1), and those of component manifolds of weights in DNNs in\nSection 5 (please see Lemma 1 in the supp. mat. for more precise results).\n3. We propose an algorithm (FG-SGD) for optimization on different collections of POMs (Section 5)\nby generalizing SGD methods employed on weight manifolds [12, 23]. Next, we explore the effect\nof geometric properties of the POMs on the convergence of the FG-SGD using our theoretical results.\nIn the proof of convergence theorems, we observe that gradients of weights should satisfy a particular\nnormalization requirement and we employ this requirement for adaptive computation of step size of\nthe FG-SGD (see (5) in Section 5.2.2). We also provide an example for computation of a step size\nfunction for optimization on POMs identi\ufb01ed by the sphere (Corollary 2 in the supp. mat.).\n4. We prove that loss functions of DNNs trained using the proposed FG-SGD converges to minima\nalmost surely (see Theorem 2 and Corollary 1 in the supplemental material).\n\n3 Construction of Sets of POMs in DNNs\n\nLet S={si=(Ii, yi)}N\ni=1 be a set of training samples, where yi is a class label of the ith im-\nage Ii. We consider an L-layer DNN consisting of a set of tensors W = {Wl}L\nl=1, where\nWl={Wd,l\u2208 RAl\u00d7Bl\u00d7Cl}Dl\nd=1, and Wd,l=[Wc,d,l\u2208 RAl\u00d7Bl]Cl\nc=1 is a tensor2 of weight matrices\nWc,d,l,\u2200l= 1, 2, . . . , L, for each cth channel c= 1, 2, . . . , Cl and each dth weight d= 1, 2, . . . , Dl.\nIn popular DNNs, weights with Al= 1 and Bl= 1 are used at fully connected layers, and those\nwith Al> 1 or Bl> 1 are used at convolutional layers. At each lth layer, a feature representation\nfl(Xl;Wl) is computed by compositionally employing non-linear functions by\nfl(Xl;Wl)= fl(\u22c5;Wl)\u25cb fl\u22121(\u22c5;Wl\u22121)\u25cb\u0016\u25cb f1(X1;W1),\nwhere Xl=[Xc,l]Cl\nc=1, and X1\u2236= I is an image at the \ufb01rst layer (l= 1). The cth channel of the data\nmatrix Xc,l is convolved with the kernel Wc,d,l to obtain the dth feature map Xc,l+1\u2236= q( \u02c6Xd,l) by\n\u02c6Xd,l= Wc,d,l\u2217 Xc,l,\u2200c, d, l, where q(\u22c5) is a non-linear function, such as ReLU.\nDe\ufb01nition 1 (Products of weight manifolds and their collections). Suppose thatGl={M\u03b9,l\u2236 \u03b9\u2208IGl}\nis a set of weight manifolds3 M\u03b9,l of dimension n\u03b9,l, which is identi\ufb01ed by a set of indices\nIGl ,\u2200l= 1, 2, . . . , L. More concretely,IGl contains indices each of which represents an identity\nnumber (\u03b9) of a weight that resides on a manifoldM\u03b9,l at the lth layer.\nI g\nl \u2286IGl , g= 1, 2, . . . , Gl, is used to determine a subsetGg\nl \u2286Gl of weight manifolds which will be\nc=1\u00e1[W1,d,l, W2,d,l,\u0016, WCl,d,l].\n2We use shorthand notation for matrix concatenation such that[Wc,d,l]Cl\n\nPrevious works [12, 23] employ SGD using weights each of which reside on a single manifold3\nat each layer of a DNN. We extend this approach considering that each weight can reside on an\nindividual manifold or on collections of products of manifolds, which are de\ufb01ned next.\n\n3In this work, we consider Riemannian manifolds of normalized weights de\ufb01ned in the previous section.\n\nFormal de\ufb01nitions are given in the supp. mat.\n\n(1)\n\nIn addition, a subset\n\n3\n\n\fTable 1: Comparison of generalization bounds. O denotes big-O and \u02dcO is soft-O. \u03b4l,F , \u03b4l,2, and\n\u03b4l,2\u21921 denotes upper bounds of the Frobenius norm\u0001\u03c9l\u0001F \u2264 \u03b4l,F , spectral norm\u0001\u03c9l\u00012\u2264 \u03b4l,2 and the\nsum of the Euclidean norms for all rows\u0001\u03c9l\u00012\u21921\u2264 \u03b4l,2\u21921 ((cid:96)2\u21921) of weights \u03c9l at the lth layer of an\n\nL layer DNN using N samples.\n\nNeyshabur et al. [24]\n\nBartlett et al. [3]\n\nNeyshabur et al. [8]\n\nl=1\n\n\u02dcO\u00ef\u00ef L\u220f\n\u02dcO\u00ef\u00ef L\u220f\n\nl=1\n\nGl\u220f\n\u221a\ng=1\nGl\u220f\n\u221a\ng=1\n\nN\n\nDNNs (dynamic group scaling)\n\nN\n\nGl\u220f\nL\u220f\nO\u0002 2L\n\u221a\ng=1\nl=1\n\u0002 L\u2211\nGl\u220f\ng=1\nl=1\n\u00bf``(cid:192)L2\u0001\n\n\u03b4g,l,2\n\n\u03b4g,l,2\n\nN\n\n\u03b4g,l,F\n\n\u0002\n2\u00ef\u0017\n( \u03b4g,l,2\u21921\n3\u0002 3\n\u03b4g,l,2 ) 2\n\u00ef\u0017\nL\u2211\nGl\u220f\ng=1\nl=1\n\n\u03b42\ng,l,F\n\u03b42\ng,l,2\n\nNorms\n\n(i) Sphere\n\ng,l. \u0001\u03c9i\n\nTable 2: Comparison of norms of weights belonging to different weight manifolds. Suppose that\nweights \u03c9i\n\ng,l\u2208 RAl\u00d7Bl belonging to the gth group of sizeg, g= 1, 2, . . . , Gl,\u2200l have the same size\ng,l\u00012, and\n\nAl\u00d7 Bl for simplicity, and \u03c3(\u03c9i\ng,l\u0001F ,\u0001\u03c9i\ng,l\u00012\u21921, denotes respectively the Frobenius, spectral and (cid:96)2\u21921 norms of the weight \u03c9i\n\u0001\u03c9i\n\u0001\u03c9i\ng,l\u00012\ng,l\u0001F\n\u0001\u03c9i\n\u0001\u03c9i\ng,l\u00012\u21921\n\ng,l) denotes the top singular value of \u03c9i\n\u03c3(\u03c9i\ng,l)\n(Bl)1~2\n(Bl)1~4\naggregated to construct a product of weight manifolds (POM). EachM\u03b9,l\u2208Gg\nmanifold of a product of weight manifolds which is denoted by Mg,l. A weight \u03c9g,l\u2208 Mg,l is obtained\nby concatenating weights belonging toM\u03b9,l,\u2200\u03b9\u2208I g\nl), whereI g\nl is\nthe cardinality ofI g\n\u00dc\nl . AGl is called a collection of POMs.\n\nl , using \u03c9g,l=(\u03c91, \u03c92,\u0016, \u03c9Ig\n\n(Bl)1~2\n(Bl)1~4\n\nl is called a component\n\n\u03c3(\u03c9i\ng,l)\n\n(iii) Oblique\n\n(ii) Stiefel\n\n1.0\n1.0\n\ng,l.\n\n1.0\n\nWe propose three schemes called POMs for input channels (PI), for output channels (PO) and\ninput/output channels (PIO) to construct index sets. Indices of the sets are selected randomly using a\nhypergeometric distribution without replacement at the initialization of a training step, and \ufb01xed in\nthe rest of the training. Implementation details and experimental analyses are given in the supp. mat.\n\ni,l= \u03b3i,l\n\n4 Bounding Generalization Errors using Fine-grained Weights\n\ni,l is the standard deviation of features input to the ith weight in the gth group \u03c9i\n\ncomponent weights \u03c9i\nmethod proposed to train DNNs using Rt\nand \u03bbt\n\nMathematically, norms of concatenated weights \u03c9g,l,\u2200g, are lower bounded by products of norms of\ng,l,\u2200i. Weights are rescaled dynamically at each tth epoch of an optimization\n, where \u03b3i,l> 0 is a geometric scaling parameter\ng,l,\u2200i, g.\nble 1). Suppose that all layers have the same width \u0001, weights have the same lengthK and\ns ,\u0001\u03c9l\u0001F =\u221a\u0001 and\u0001\u03c9l\u00012\u21921 = \u0001. We compute a concatenated weight matrix\neters by\u0001\u03c9l\u00012 = K\n\u03c9g,l=(\u03c91\ng,l) for the gth weight group of sizeg, g= 1, 2, . . . , Gl,\u2200l using a weight\ng\ngrouping strategy. Then, we have upper bounds of norms by\u0001\u03c9g,l\u0001F \u2264 \u03b4g,l,F \u2264 1,\u0001\u03c9g,l\u00012\u2264 \u03b4g,l,2\u2264 1\nand\u0001\u03c9g,l\u00012\u21921\u2264 \u03b4g,l,2\u21921\u2264 1, g= 1, 2, . . . , Gl, which are de\ufb01ned in Table 2.\n\nthe same stride s. Then, generalization bounds are obtained for DNNs using these \ufb01xed param-\n\ni,l enables us to upper bound the norms of weights by 1 (see Ta-\n\nThe scaling parameter Rt\n\ng,l, . . . , \u03c9\n\ng,l, \u03c92\n\n\u03bbt\n\ni,l\n\n4\n\n\f4.1 The Proof strategy for Computation of the Upper Bounds\n\nThe upper bounds are computed in Proposition 1 in the supplemental material. The proof strategy is\nsummarized as follows:\n\n\u2022 Let bi,l be multiplication of the number of input channels and the size of the receptive \ufb01eld of the\ng,l, and \u02c6bi,l be multiplication of the dimension of output feature maps and the\nunit that employs \u03c9i\nnumber of output channels used at the lth layer, respectively. Then, geometric scaling \u03b3i,l of the\nweight space of \u03c9i\n\ng,l is computed by\n\n\u03b3i,l=\u00bf``(cid:192) 1\nbi,l+ \u02c6bi,l\n\n.\n\ni,l\u2265 1 using two approaches. First,\n\n(2)\n\n\u2022 We can consider that standard deviation of features satisfy \u03bbt\nby employing the central limit theory for weighted summation of random variables of features, we\ncan prove that \u03bbt\ni,l converges to 1 asymptotically, as popularly employed in the previous works.\nSecond, we can assume that we apply batch normalization (BN) by setting the re-scaling parameter\nof the BN to 1. Thereby, we can obtain 1\nIn order to\n\u03bbt\n\n\u2264 1. By de\ufb01nition, \u03b32\n\ni,l < Bl,\u2200i, l.\ng,l)\u2264(\u03b3i,l)\u22121,\u2200i, l, we apply the Bai-Yin law [25, 26]. Thereby, we conclude that\ni,l,\u2200i, l, t during training.\n\u0001\u03c9i\ng,l\u0001F)1~g,\n\nshow that \u03c3(\u03c9i\nlowing Proposition 1 given in the supplemental material, we have\u0001\u03c9g,l\u0001F \u2265 (g\u220f\ni=1\n\u0001\u03c9g,l\u00012\u2265(g\u220f\ni=1\n\nnorms of concatenated weights belonging to groups given in Table 1 are upper bounded by 1, if the\ncorresponding component weights given in Table 2 are rescaled by Rt\nRemark 1. We compute norms of weights belonging to each different manifold in Table 2. Fol-\n\ng,l\u00012)1~g and\u0001\u03c9g,l\u00012\u21921\u2265(g\u220f\n\u0001\u03c9i\ni=1\n\n\u0001\u03c9i\ng,l\u00012\u21921)1~g.\n\ni,l\n\n4.2 Generalization of Scaled Weight Initialization Methods\n\nNote that scaling by Rt\ni,l computed using (2) is different from the scaling method suggested in [12]\nsuch that our proposed method assures tighter upper bound for norms of weights. Our method also\ngeneralizes the scaling method given in [27] (a.k.a. Xavier initialization)in two ways. First, we\nuse size of input receptive \ufb01elds and output feature spaces which determine dimension of weight\nmanifolds, as well as number of input and output dimensions which determine number of manifolds\nused in groups.\nSecond, we perform scaling not just at initialization but also at each tth epoch of the optimization\nmethod. Therefore, diversity of weights is controlled and we can obtain weights uniformly distributed\non the corresponding manifolds whose geometric properties change dynamically at each epoch.\nApplying this property with the results given in [20], we can prove that NNs applying the proposed\nscaling have no spurious local minima4. In addition, our method generalizes the scaling method\nproposed in [22] for weight normalization by incorporating geometric properties of weight manifolds.\n\n5 Optimization using Fine-Grained Stochastic Gradient Descent in DNNs\n\nIn this section, we address the problem of optimization of DNNs considering constraints on weight\nmatrices according the upper bounds on their norms given in Section 4.\n\n5.1 Optimization on POMs in DNNs: Challenges\n\nEmployment of a vanilla SGD on POMs with assurance to convergence to local or global minima for\ntraining DNNs using back-propagation (BP) with collections of POMs is challenging. More precisely,\nwe observe early divergence of SGD, and exploding and vanishing gradients in practice, due to the\nfollowing theoretical properties of collections of POMs:\n\n4We omit the formal theorem and the proof on this result in this work to focus on our main goal and novelty\n\nfor optimization with multiple weight manifolds.\n\n5\n\n\fAlgorithm 1 Optimization using FG-SGD on products manifolds of \ufb01ne-grained weights.\n1: Input: T (number of iterations), S (training set),\n\n\u0398 (set of hyperparameters),L (a loss function),I l\ng\u2286IGl ,\u2200g, l.\n2: Initialization: Construct a collection of products of weight manifoldsGl, initialize re-scaling\nparametersRt\ng,l\u2208 Mg,l withI l\n3: for each iteration t= 1, 2, . . . , T do\nfor each layer l= 1, 2, . . . , L do\nl\u0002,\u2200Gl.\ng,l), \u0398,Rt\ng,l\u0002gradEL(\u03c9t\ng,l)\u2236= \u03a0\u03c9t\ngradL(\u03c9t\nvt\u2236= h(gradL(\u03c9t\ng,l), r(t, \u0398)),\u2200Gl.\ng,l,\u2200Gl.\nl),\u2200\u03c9t\ng,l(vt,Rt\n\u2236= \u03c6\u03c9t\n\u03c9t+1\ng,l}L\n10: Output: A set of estimated weights{\u03c9T\nl=1,\u2200g.\n\ng\u2286IGl ,\u2200m, l.\n\nl and initialize weights \u03c9t\n\n4:\n5:\n6:\n7:\n8:\n9: end for\n\ng,l\nend for\n\nweight \u03c9g,l\u2208 Mg,l at the lth layer from the(l+ 1)st layer using backpropagation (BP). Then, each\n\n\u2022 Geometric properties of a POM Mg,l can be different from those of its component manifolds M\u03b9,\neven if the component manifolds are identical. For example, we observe locally varying curvatures\nwhen we construct POMs of unit spheres. Weight manifolds with more complicated geometric\nproperties can be obtained using the proposed PIO strategy, especially by constructing collections of\nPOMs of non-identical manifolds. Therefore, assumption on existence of compact weight subsets in\nPOMs may fail due to locally varying metrics within a nonlinear component manifold and among\ndifferent component manifolds.\n\u2022 When we optimize weights using SGD in DNNs, we \ufb01rst obtain gradients computed for each\nweight \u03c9g,l moves on Mg,l according to the gradient. However, curvatures and metrics of Mg,l can\nlocally vary, and they may be different from those of component manifolds of Mg,l as explained\nabove.\nThis geometric drawback causes two critical problems. First, weights can be moved incorrectly if we\nmove them using only gradients computed for each individual component of the weights, as popularly\nemployed for the Euclidean linear weight spaces. Second, due to incorrect employment of gradients\nand movement of weights, probability of failure of the SGD cannot be bounded, and convergence\ncannot be achieved (see proofs of Theorem 2, Corollary 1 and Corollary 2 for details). In practice,\nthis causes unbounded increase or decrease of values of gradients and weights.\n\n5.2 A Geometric Approach to Optimization on POMs in DNNs\n\nIn order to address these problems for training DNNs, we \ufb01rst analyze the relationship between\ngeometric properties of POMs and those of their component manifolds in the next theorem.\nRemark 2. (See Lemma 1 given in the supp. mat. for the complete proof of the following propositions)\nOur main theoretical results regarding geometric properties of POMs are summarized as follows:\n1. Computation of metrics: A metric de\ufb01ned on a product weight manifold Mg,l can be computed\nby superposition (i.e. linear combination) of Riemannian metrics of its component manifolds.\n2. Lower bounds of sectional curvatures: Sectional curvature of a product weight manifold Mg,l\nis lower bounded by 0.\n\n\u00dc\n\n5.2.1 Development of FG-SGD employing Geometry of POMs\n\nWe use the \ufb01rst result (1) for projection of Euclidean gradients obtained using BP onto product\nweight manifolds. More precisely, we can compute norms of gradients at weights on a product weight\nmanifold by linear superposition of those computed on its component manifolds in FG-SGD. Thereby,\nwe can move a weight on a product weight manifold by (i) retraction of components of the weight\non component manifolds of the product weight manifold, and (ii) concatenation of projected weight\ncomponents in FG-SGD.\nThe second result (2) show that some sectional curvatures vanish on a product weight manifold Mg,l.\n\nFor instance, suppose that each component weight manifoldM\u03b9,l of Mg,l is a unit two-sphere S2,\n\n6\n\n\f\u2200\u03b9\u2208IGl. Then, Mg,l has unit curvature along two-dimensional subspaces of its tangent spaces, called\n\ntwo-planes. However, Mg,l has zero curvature along all two-planes spanning exactly two distinct\nspheres. In addition, weights can always move according to a non-negative bound on sectional\ncurvature of compact product weight manifolds on its tangent spaces. Therefore, we do not need to\nworry about varying positive and negative curvatures observed at its different component manifolds.\nThe second result also suggests that learning rates need to be computed adaptively by a function of\nnorms of gradients and bounds on sectional curvatures at each layer of the DNN and at each epoch\nof FG-SGD for each weight \u03c9 on each product weight manifold Mg,l. We employ these results to\nanalyze convergence of FG-SGD and compute its adaptive step size in the following sections.\n\n5.2.2 Optimization on POMs using FG-SGD in DNNs\n\n\u03b9,l\u2208Rt\n\n\u03b9,l\n\nM\u03b9,l\nM\u03b9,l is\n\ng\n\n\u03b9,l\n\n\u03b9,l using \u03bbt=1\n\nl computing empirical standard deviation \u03bbt\n\ng,l) at the weight \u03c9t\n\nAn algorithmic description of our proposed \ufb01ne-grained SGD (FG-SGD) is given in Algorithm 1. At\n\n\u03b9,l on the component manifold M\u03b9,l of Mg,l.\nthe tangent space at \u03c9t\nLine 6 (Movement of weights on tangent spaces): The weight \u03c9t\n\n\u03b9 of features\nl is the set of all re-scaling parameters computed at the tth\nepoch at each lth layer. When we employ a FG-SGD on a product weight manifold Mg,l each weight\n\u03c9t\nFG-SGD by the following steps:\n\nthe initialization of the FG-SGD, we identify the component weight manifoldsM\u03b9,l of each product\nweight manifold Mg,l according to the constraints that will be applied on the weights \u03c9\u03b9\u2208M\u03b9,l for\neach gth group at each lth layer. For t= 1, each manifoldM\u03b9,l is scaled by Rt=1\n\u03b9,l = 1,\u2200\u03b9, l.\nFor t> 1, eachM\u03b9,l is re-scaled by Rt\ninput to each weight ofM\u03b9,l, andRt\ng,l \u2208 Mg,l is moved on Mg,l in the descent direction of gradient of loss at each tth step of the\nLine 5 (Projection of gradients on tangent spaces): The gradient gradEL(\u03c9t\ng,l), obtained using\nback-propagation from the upper layer, is projected onto the tangent spaceT\u03c9t\nT\u03c9t\nMg,l= \u0002\n\u03b9\u2208Il\ng,l using the results given in Remark 2, whereT\u03c9t\nto compute gradL(\u03c9t\ng,l is moved onT\u03c9t\ng,l), r(t, \u0398))=\u2212 r(t, \u0398)\ng,l),\ng,l) gradL(\u03c9t\nh(gradL(\u03c9t\n(cid:114)(\u03c9t\nwhere r(t, \u0398) is the learning rate that satis\ufb01es\n\u221eQ\n\u221eQ\nr(t, \u0398)2<\u221e,\nr(t, \u0398)=+\u221e and\nt=0\nt=0\n(cid:114)(\u03c9t\nl )= max{1, \u0393t\n1} 1\ng,l\u00e1\u0001gradL(\u03c9t\ng,l)2\u0393t\n1=(Rt\ng,l)\u00012\n2= max{(2\u03c1t\ng,l+ Rt\ng,l))},\ng,l)2,(1+ cg,l(\u03c1t\ng,l+ Rt\ng,l\u00e1 \u03c1(\u03c9t\ng,l, \u02c6\u03c9g,l) is the geodesic distance between \u03c9t\nTheorem 1 (Computation of gradients on tangent spaces). The (cid:96)2 norm\u0001gradL(\u03c9t\ng,l) residing onT\u03c9t\ngradient gradL(\u03c9t\ng,l)\u00012=\u0002Q\n\u0001gradL(\u03c9t\ngradL(\u03c9t\n\u03b9\u2208Il\n\u03b9,l on the tangent spaceT\u03c9t\n\u03b9,l) is the gradient computed for \u03c9t\nwhere gradL(\u03c9t\n\n\u0393t\n\u0393t\n\u03c1t\nThe following result is used for computation of the (cid:96)2 norm of gradients.\n\n(5)\n(6),\nis the sectional curvature of Mg,l,\n\nis\ncg,l\ng,l and a local minima \u02c6\u03c9g,l on Mg,l.\n\ncomputed\n\nusing\n\nMg,l at the tth epoch and the lth layer can be computed by\n\ng,l)\u00012 of the\n\nM\u03b9,\u2200\u03b9\u2208I l\n\ng.\n\n\u03b9,l)2\u0002 1\n\n2\n\n,\n\nMg,l using\n\ng,l\n\n(3)\n\n(4)\n\n(6)\n\n\u00dc\n\ng,l\n\n\u03b9,l\n\ng,l\n\ng\n\n2,\n\nRt\n\nGm\n\n2\n\nWe compute norms of gradients on tangent spaces of product manifolds by just superposing gradients\ncomputed on tangent spaces of component manifolds using (6). Norms of gradients, section curvatures\nof the product manifolds and their metric properties affect convergence of weights on the product\nmanifolds to local or global minima. More precisely, they de\ufb01ne an upper bound on the divergence of\nthe weights from local or global minima (please see proof of Theorem 2 in the supplemental material\nfor details). This upper bound is further bounded by normalizing gradients in (3).\n\n7\n\n\fl using \u03c6\u03c9t\n\nis an exponential map, or a retraction, i.e. an approximation of the exponential map [28]. The\n\nLine 7 (Projection of moved weights onto product of manifolds): The moved weight located at\n\nvt is projected onto Mg,l re-scaled byRt\nfunction (cid:114)(\u03c9t\nof gradient gradL(\u03c9t\nanalyses in the supp. mat. For computation of (cid:114)(\u03c9t\n\ng,l(vt,Rt\ng,l(vt,Rt\nl)\ng,l) used for computing step size in (3) is employed as a regularizer to control the change\ng,l) at each step of FG-SGD. This property is examined in the experimental\ng,l), we use (6) with Theorem 1. In FG-SGD,\n\nl) to compute \u03c9t+1\n\nweights residing on each POM are moved and projected jointly on the POMs, by which we can employ\ntheir interaction using the corresponding gradients considering nonlinear geometry of manifolds\nunlike SGD methods studied in the literature. G-SGD can consider interactions between component\nmanifolds as well as those between POMs in groups of weights. Employment of (3) and (4) at line 7,\nand retractions at line 8 are essential for assurance of convergence as explained next.\n\ng,l , where \u03c6\u03c9t\n\n5.3 Convergence Properties of FG-SGD\n\nConvergence properties of the proposed FG-SGD used to train DNNs are summarized as follows:\nConvergenge to local minima: The loss function of a non-linear DNN, which employs the proposed\nFG-SGD, converges to a local minimum, and the corresponding gradient converges to zero almost\nsurely (a.s.). The formal theorem and proof are given in Theorem 2 in the supplemental material.\nConvergenge to global minima: Loss functions of particular DNNs such as linear DNNs, one-\nhidden-layer CNNs, one-hidden-layer Leaky Relu networks, nonlinear DNNs with speci\ufb01c network\nstructures (e.g. pyramidal networks), trained using FG-SGD, converge to a global minimum a.s.\nunder mild assumptions on data (e.g. being distributed from Gaussian distribution, normalized, and\nrealized by DNNs). The formal theorem and proof of this result are given in Corollary 1 in the supp.\nmat. The proof idea is to use the property that local minima of loss functions of these networks are\nglobal minima under these assumptions, by employing the results given in the recent works [29\u201338].\n\nAn example for adaptive computation of step size: Suppose that M\u03b9 are identi\ufb01ed by n\u03b9\u2265 2\n\ndimensional unit sphere, or the sphere scaled by the proposed scaling method.\ncomputed using (3) with\n\nIf step size is\n\n(cid:114)(\u03c9t\n\nl )=(max{1,(Rt\n\nl )2(2+ Rt\n\nl )2}) 1\n\n2 ,\n\nGm\n\nGm\n\nGm\n\n(7)\n\nthen the loss function converges to local minima for a generic class of nonlinear DNNs, and to global\nminima for DNNs characterized in Corollary 1. The formal theorem and proof of this result are given\nin Corollary 2 in the supp. mat. The proof idea follows the property that the unit sphere has positive\nsectional curvature, and a product of unit spheres has non-negative curvature.\nUsing these results for the sphere, the normalizing function (5) is computed by the maximum of 1\nand a polynomial function of the norm of gradients (6). Note that, variations of this function (7)\nhave been used in practice to train large DNNs successfully, i.e. avoiding exploding and vanishing\ngradients. Our mathematical framework elucidates the underlying theory behind the success of\ngradient normalization methods for convergence of DNNs to local and global minima.\n\n6 Conclusion and Discussion\n\nWe introduced and elucidated a problem of training CNNs using multiple constraints employed on\nconvolution weights with convergence properties. Following our theoretical results, we proposed\nthe FG-SGD algorithm and adaptive step size estimation methods for optimization on collections of\nPOMs that are identi\ufb01ed by the constraints. Due to page limit, experimental analyses are given in\nthe supplemental material. In these analyses, we observe that our proposed methods can improve\nconvergence properties and classi\ufb01cation performance of CNNs. Overall, the results show that\nemployment of collections of POMs using FG-SGD can boost the performance of various different\nCNNs on various datasets. We consider a research direction for investigating how far local minima are\nfrom global minima in search spaces of FG-SGD using products of weight manifolds with nonlinear\nDNNs and their convergence rates.\nWe believe that our proposed mathematical framework and results will be useful and inspiring for\nresearchers to study geometric properties of parameter spaces of deep networks, and to improve our\nunderstanding of deep feature representations.\n\n8\n\n\fReferences\n[1] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding\ndeep learning requires rethinking generalization. In Proc. of the 5th Int. Conf. on Learn. Rep.\n(ICLR), 2017.\n\n[2] Behnam Neyshabur, Srinadh Bhojanapalli, David Mcallester, and Nati Srebro. Exploring\ngeneralization in deep learning. In Advances in Neural Information Processing Systems (NIPS),\npages 5947\u20135956. 2017.\n\n[3] Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds\nfor neural networks. In Advances in Neural Information Processing Systems (NIPS), pages\n6240\u20136249. 2017.\n\n[4] Taiji Suzuki. Fast generalization error bound of deep learning from a kernel perspective. In\nProc. of the 21st Int. Conf. on Arti\ufb01cial Intelligence and Statistics (AISTATS), volume 84, pages\n1397\u20131406, 2018.\n\n[5] Pan Zhou and Jiashi Feng. Understanding generalization and optimization performance of deep\nCNNs. In Proc. of the 35th Int. Conf. on Mach. Learn. (ICML), volume 80, pages 5960\u20135969,\n2018.\n\n[6] Sanjeev Arora, Rong Ge, Behnam Neyshabur, and Yi Zhang. Stronger generalization bounds for\ndeep nets via a compression approach. In Proc. of the 35th Int. Conf. on Mach. Learn. (ICML),\nvolume 80, pages 254\u2013263, 2018.\n\n[7] Noah Golowich, Alexander Rakhlin, and Ohad Shamir. Size-independent sample complexity of\nneural networks. In roc. of the 31st Conf. on Learning Theory (COLT), pages 297\u2013299. 2018.\n\n[8] Behnam Neyshabur, Srinadh Bhojanapalli, and Nathan Srebro. A PAC-bayesian approach to\nspectrally-normalized margin bounds for neural networks. In Proc. of the 6th Int. Conf. on\nLearn. Rep. (ICLR), 2018.\n\n[9] Martin Anthony and Peter L. Bartlett. Neural Network Learning: Theoretical Foundations.\n\nCambridge University Press, New York, NY, USA, 1st edition, 2009.\n\n[10] Mario Lezcano-Casado and David Martinez-Rubio. Cheap orthogonal constraints in neural\nnetworks: A simple parametrization of the orthogonal and unitary group. In Proc. of the 39th\nInt. Conf. on Mach. Learn. (ICML), 2019.\n\n[11] Mario Lezcano-Casado. Trivializations for gradient-based optimization on manifolds.\n\nAdvances in Neural Information Processing Systems (NIPS). 2019.\n\nIn\n\n[12] Mete Ozay and Takayuki Okatani. Training cnns with normalized kernels. In AAAI Conference\n\non Arti\ufb01cial Intelligence, 2018.\n\n[13] Haoran Chen, Yanfeng Sun, Junbin Gao, Yongli Hu, and Baocai Yin. Partial least squares\nregression on riemannian manifolds and its application in classi\ufb01cations. CoRR, abs/1609.06434,\n2016.\n\n[14] Yui Man Lui. Human gesture recognition on product manifolds. J. Mach. Learn. Res.,\n\n13(1):3297\u20133321, Nov 2012.\n\n[15] Minhyung Cho and Jaehyung Lee. Riemannian approach to batch normalization. In Advances\n\nin Neural Information Processing Systems (NIPS), 2017.\n\n[16] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training\nby reducing internal covariate shift. In Proc. of the 35th Int. Conf. on Mach. Learn. (ICML),\npages 448\u2013456, 2015.\n\n[17] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classi\ufb01cation with deep\nconvolutional neural networks. In Advances in Neural Information Processing Systems (NIPS),\npages 1097\u20131105, 2012.\n\n9\n\n\f[18] J. Luo, H. Zhang, H. Zhou, C. Xie, J. Wu, and W. Lin. Thinet: Pruning cnn \ufb01lters for a thinner\n\nnet. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1\u20131, 2018.\n\n[19] Guotian Xie, Ting Zhang, Kuiyuan Yang, Jianhuang Lai, and Jingdong Wang. Decoupled\n\nconvolutions for cnns, 2018.\n\n[20] Bo Xie, Yingyu Liang, and Le Song. Diverse Neural Network Learns True Target Functions. In\nProc. of the 20th Int. Conf. on Arti\ufb01cial Intelligence and Statistics (AISTATS), volume 54, pages\n1216\u20131224, 2017.\n\n[21] Shibani Santurkar, Dimitris Tsipras, Andrew Ilyas, and Aleksander Madry. How does batch\nnormalization help optimization? (no, it is not about internal covariate shift). In Advances in\nNeural Information Processing Systems (NIPS). 2018.\n\n[22] Tim Salimans and Diederik P. Kingma. Weight normalization: A simple reparameterization\nto accelerate training of deep neural networks. In Advances in Neural Information Processing\nSystems (NIPS), 2016.\n\n[23] Minhyung Cho and Jaehyung Lee. Riemannian approach to batch normalization. In Advances\n\nin Neural Information Processing Systems (NIPS), pages 5231\u20135241. 2017.\n\n[24] Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. Norm-based capacity control in neural\nnetworks. In Proc. of the 28th Conf. on Learning Theory (COLT), volume 40, pages 1376\u20131401,\n2015.\n\n[25] Z. D. Bai and Y. Q. Yin. Limit of the smallest eigenvalue of a large dimensional sample\n\ncovariance matrix. The Annals of Probability, 21(3):1275\u20131294, 07 1993.\n\n[26] Z. D Bai, Jack W Silverstein, and Y.Q Yin. A note on the largest eigenvalue of a large\ndimensional sample covariance matrix. Journal of Multivariate Analysis, 26(2):166 \u2013 168,\n1988.\n\n[27] Xavier Glorot and Yoshua Bengio. Understanding the dif\ufb01culty of training deep feedforward\nneural networks. In Proc. of the 13th Int. Conf. on Arti\ufb01cial Intelligence and Statistics (AISTATS),\nvolume 9, pages 249\u2013256, 2010.\n\n[28] P. A. Absil and Jerome Malick. Projection-like retractions on matrix manifolds. SIAM Journal\n\non Optimization, 22(1):135\u2013158, 2012.\n\n[29] Kenji Kawaguchi. Deep learning without poor local minima. In D. D. Lee, M. Sugiyama, U. V.\nLuxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems\n(NIPS), pages 586\u2013594. 2016.\n\n[30] Alon Brutzkus and Amir Globerson. Globally optimal gradient descent for a convnet with\ngaussian inputs. In Proc. of the 34th Int. Conf. on Mach. Learn. (ICML), pages 605\u2013614, 2017.\n\n[31] Simon S. Du, Jason D. Lee, Yuandong Tian, Barnab\u00e1s P\u00f3czos, and Aarti Singh. Gradient\ndescent learns one-hidden-layer cnn: Don\u2019t be afraid of spurious local minima. In Proc. of the\n35th Int. Conf. on Mach. Learn. (ICML), 2018.\n\n[32] Simon S. Du and Jason D. Lee. On the power of over-parametrization in neural networks with\n\nquadratic activation. In Proc. of the 35th Int. Conf. on Mach. Learn. (ICML), 2018.\n\n[33] Chulhee Yun, Suvrit Sra, and Ali Jadbabaie. Global optimality conditions for deep neural\n\nnetworks. In Proc. of the 6th Int. Conf. on Learn. Rep. (ICLR), 2018.\n\n[34] Moritz Hardt and Tengyu Ma. Identity matters in deep learning. In Proc. of the 5th Int. Conf.\n\non Learn. Rep. (ICLR), 2017.\n\n[35] Rong Ge, Jason D. Lee, and Tengyu Ma. Learning one-hidden-layer neural networks with\n\nlandscape design. In Proc. of the 5th Int. Conf. on Learn. Rep. (ICLR), 2017.\n\n[36] Chulhee Yun, Suvrit Sra, and Ali Jadbabaie. A critical view of global optimality in deep learning.\n\nCoRR, abs/1802.03487, 2018.\n\n10\n\n\f[37] Maithra Raghu, Ben Poole, Jon Kleinberg, Surya Ganguli, and Jascha Sohl-Dickstein. On the\nexpressive power of deep neural networks. In Proc. of the 34th Int. Conf. on Mach. Learn.\n(ICML), volume 70, pages 2847\u20132854, 2017.\n\n[38] Quynh Nguyen and Matthias Hein. The loss surface of deep and wide neural networks. In Proc.\n\nof the 34th Int. Conf. on Mach. Learn. (ICML), volume 70, pages 2603\u20132612, 2017.\n\n[39] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In Proc. IEEE Conf. Comp.\n\nVis. Patt. Recog. (CVPR), 2018.\n\n[40] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training\nby reducing internal covariate shift. In Proc. of the 32nd Int. Conf. Mach. Learn., volume 37,\npages 448\u2013456, 2015.\n\n11\n\n\f", "award": [], "sourceid": 828, "authors": [{"given_name": "Mete", "family_name": "Ozay", "institution": "Independent Researcher (N/A)"}]}