{"title": "Convergent Block Coordinate Descent for Training Tikhonov Regularized Deep Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 1721, "page_last": 1730, "abstract": "By lifting the ReLU function into a higher dimensional space, we develop a smooth multi-convex formulation for training feed-forward deep neural networks (DNNs). This allows us to develop a block coordinate descent (BCD) training algorithm consisting of a sequence of numerically well-behaved convex optimizations. Using ideas from proximal point methods in convex analysis, we prove that this BCD algorithm will converge globally to a stationary point with R-linear convergence rate of order one. In experiments with the MNIST database, DNNs trained with this BCD algorithm consistently yielded better test-set error rates than identical DNN architectures trained via all the stochastic gradient descent (SGD) variants in the Caffe toolbox.", "full_text": "Convergent Block Coordinate Descent for Training\n\nTikhonov Regularized Deep Neural Networks\n\nZiming Zhang and Matthew Brand\n\nMitsubishi Electric Research Laboratories (MERL)\n\nCambridge, MA 02139-1955\n\n{zzhang, brand}@merl.com\n\nAbstract\n\nBy lifting the ReLU function into a higher dimensional space, we develop a smooth\nmulti-convex formulation for training feed-forward deep neural networks (DNNs).\nThis allows us to develop a block coordinate descent (BCD) training algorithm\nconsisting of a sequence of numerically well-behaved convex optimizations. Using\nideas from proximal point methods in convex analysis, we prove that this BCD\nalgorithm will converge globally to a stationary point with R-linear convergence\nrate of order one. In experiments with the MNIST database, DNNs trained with\nthis BCD algorithm consistently yielded better test-set error rates than identical\nDNN architectures trained via all the stochastic gradient descent (SGD) variants in\nthe Caffe toolbox.\n\n1\n\nIntroduction\n\nFeed-forward deep neural networks (DNNs) are function approximators wherein weighted combina-\ntions inputs are \ufb01ltered through nonlinear activation functions that are organized into a cascade of\nfully connected (FC) hidden layers. In recent years DNNs have become the tool of choice for many\nresearch areas such as machine translation and computer vision.\nThe objective function for training a DNN is highly non-convex, leading to numerous obstacles\nto global optimization [10], notably proliferation of saddle points [11] and prevalence of local\nextrema that offer poor generalization off the training sample [8]. These observations have motivated\nregularization schemes to smooth or simplify the energy surface, either explicitly such as weight\ndecay [23] or implicitly such as dropout [32] and batch normalization [19], so that the solutions are\nmore robust, i.e. better generalized to test data.\nTraining algorithms face many numerically dif\ufb01culties that can make it dif\ufb01cult to even \ufb01nd a local\noptimum. One of the well-known issues is so-called vanishing gradient in back propagation (chain\nrule differentiation) [18], i.e. the long dependency chains between hidden layers (and corresponding\nvariables) tend to drive gradients to zero far from the optimum. This issue leads to very slow\nimprovements of the model parameters, an issue that becomes more and more serious in deeper\nnetworks [16]. The vanishing gradient problem can be partially ameliorated by using non-saturating\nactivation functions such as recti\ufb01ed linear unit (ReLU) [25], and network architectures that have\nshorter input-to-output paths such as ResNet [17]. The saddle-point problem has been addressed\nby switching from deterministic gradient descent to stochastic gradient descent (SGD), which can\nachieve weak convergence in probability [6]. Classic proximal-point optimization methods such as\nthe alternating direction method of multipliers (ADMM) have also shown promise for DNN training\n[34; 41], but in the DNN setting their convergence properties remain unknown.\nContributions: In this paper,\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\f1. We propose a novel Tikhonov regularized multi-convex formulation for deep learning, which\n\ncan be used to learn both dense and sparse DNNs;\n\n2. We propose a novel block coordinate descent (BCD) based learning algorithm accordingly,\nwhich can guarantee to globally converge to stationary points with R-linear convergence\nrate of order one;\n\n3. We demonstrate empirically that DNNs estimated with BCD can produce better representa-\ntions than DNNs estimated with SGD, in the sense of yielding better test-set classi\ufb01cation\nrates.\n\nOur Tikhonov regularization is motivated by the fact that the ReLU activation function is equivalent\nto solving a smoothly penalized projection problem in a higher-dimensional Euclidean space. We use\nthis to build a Tikhonov regularization matrix which encodes all the information of the networks, i.e.\nthe architectures as well as their associated weights. In this way our training objective can be divided\ninto three sub-problems, namely, (1) Tikhonov regularized inverse problem [37], (2) least-square\nregression, and (3) learning classi\ufb01ers. Since each sub-problem is convex and coupled with the other\ntwo, our overall objective is multi-convex.\nBlock coordinate descent (BCD) is often used for problems where \ufb01nding an exact solution of a\nsub-problem with respect to a subset (block) of variables is much simpler than \ufb01nding the solution\nfor all variables simultaneously [27]. In our case, each sub-problem isolates block of variables which\ncan be solved easily (e.g. close-form solutions exist). One of the advantages of our decomposition\ninto sub-problems is that the long-range dependency between hidden layers is captured within a sub-\nproblem whose solution helps to propagate the information between inputs and outputs to stabilize\nthe networks (i.e. convergence). Therefore, it does not suffer from vanishing gradient at all. In our\nexperiments, we demonstrate the effectiveness and ef\ufb01ciency of our algorithm by comparing with\nSGD based solvers.\n\n1.1 Related Work\n\n(1) Stochastic Regularization (SR) vs. Local Regularization vs. Tikhonov Regularization: SR\nis a widely-used technique in deep learning to prevent the training from over\ufb01tting. The basic idea\nin SR is to multiple the network weights with some random variables so that the learned network\nis more robust and generalized to test data. Dropout [32] and its variants such like [22] are classic\nexamples of SR. Gal & Ghahramani [14] showed that SR in deep learning can be considered as\napproximate variational inference in Bayesian neural networks.\nRecently Baldassi et al. [2] proposed smoothing non-convex functions with local entropy, and latter\nChaudhari et al. [8] proposed Entropy-SGD for training DNNs. The idea behind such methods\nis to locate solutions locally within large \ufb02at regions of the energy landscape that favors good\ngeneralization. In [9] Chaudhari et al. provided the mathematical justi\ufb01cation for these methods from\nthe perspective of partial differential equations (PDEs)\nIn contrast, our Tikhonov regularization tends to smooth the non-convex loss explicitly, globally, and\ndata-dependently. We deterministically learn the Tikhonov matrix as well as the auxiliary variables in\nthe ill-posed inverse problems. The Tikhonov matrix encodes all the information in the network, and\nthe auxiliary variables represent the ideal outputs of the data from each hidden layer that minimize\nour objective. Conceptually these variables work similarly as target propagation [4].\n(2) SGD vs. BCD: In [6] Bottou et al. proved weak convergence of SGD for non-convex optimization.\n\nGhadimi & Lan [15] showed that SGD can achieve convergence rates that scale as O(cid:0)t\u22121/2(cid:1) for\n\nnon-convex loss functions if the stochastic gradient is unbiased with bounded variance, where t\ndenotes the number of iterations.\nFor non-convex optimization, the BCD based algorithm in [39] was proven to converge globally to\nstationary points. For parallel computing another BCD based algorithm, namely Parallel Successive\nConvex Approximation (PSCA), was proposed in [31] and proven to be convergent.\n(3) ADMM vs. BCD: Alternating direction method of multipliers (ADMM) is a proximal-point\noptimization framework from the 1970s and recently championed by Boyd [7]. It breaks a nearly-\nseparable problem into loosely-coupled smaller problems, some of which can be solved independently\nand thus in parallel. ADMM offers linear convergence for strictly convex problems, and for certain\nspecial non-convex optimization problems, ADMM can also converge [29; 36]. Unfortunately, thus\n\n2\n\n\ffar there is no evidence or mathematical argument that DNN training is one of these special cases.\nTherefore, even though empirically it has been successfully applied to DNN training [34; 41], it still\nlacks of convergence guarantee.\nOur BCD-based DNN training algorithm is also amenable to ADMM-like parallelization. More\nimportantly, as we prove in Sec. 4, it will converge globally to stationary points with R-linear\nconvergence.\n\n2 Tikhonov Regularization for Deep Learning\n\n2.1 Problem Setup\nKey Notations: We denote xi \u2208 Rd0 as the i-th training data, yi \u2208 Y as its corresponding class label\nfrom label set Y, ui,n \u2208 Rdn as the output feature for xi from the n-th (1 \u2264 n \u2264 N) hidden layer in\nour network, Wn,m \u2208 Rdn\u00d7dm as the weight matrix between the n-th and m-th hidden layers, Mn\nas the input layer index set for the n-th hidden layer, V \u2208 RdN +1\u00d7dN as the weight matrix between\nthe last hidden layer and the output layer, U,V,W as nonempty closed convex sets, and (cid:96)(\u00b7,\u00b7) as a\nconvex loss function.\nNetwork Architectures: In our networks we only consider ReLU as the activation functions. To\nprovide short paths through the DNN, we allow multi-input ReLU units which can take the outputs\nfrom multiple previous layers as its inputs.\nFig. 1 illustrates a network architecture that we consider, where\nthe third hidden layers (with ReLU activations), for instance,\ntakes the input data and the outputs from the \ufb01rst and second\nhidden layers as its inputs. Mathematically, we de\ufb01ne our\nmulti-input ReLU function at layer n for data xi as:\n\n(cid:26) xi,\nmax(cid:8)0,(cid:80)\n\nui,n =\n\n(cid:9) , otherwise\n\nif n = 0\n\nm\u2208Mn\n\nWn,mui,m\n\n(1)\n\nFigure 1: Illustration of DNN architec-\ntures that we consider in the paper.\n\nwhere max denotes the entry-wise max operator and 0 denotes a dn-dim zero vector. Note that\nmulti-input ReLUs can be thought of as conventional ReLU with skip layers [17] where W\u2019s are set\nto identity matrices accordingly.\nConventional Objective for Training DNNs with ReLU: We write down the general objective1 in\na recursive way as used in [41] as follows for clarity:\n\n(cid:96)(yi, Vui,N ), s.t. ui,n = max\n\n0,\n\nWn,mui,m\n\n, ui,0 = xi,\u2200i,\u2200n,\n\n(2)\n\n(cid:40)\n\n(cid:88)\n\nm\u2208Mn\n\n(cid:41)\n\n(cid:88)\n\ni\n\nmin\n\nV\u2208V, \u02dcW\u2286W\n\nwhere \u02dcW = {Wn,m}. Note that we separate the last FC layer (with weight matrix V) from the\nrest hidden layers (with weight matrices in \u02dcW) intentionally, because V is for learning classi\ufb01ers\nwhile \u02dcW is for learning useful features. The network architectures we use in this paper are mainly for\nextracting features, on top of which any arbitrary classi\ufb01er can be learned further.\nOur goal is to optimize Eq. 2. To that end, we propose a novel BCD based algorithm which can solve\nthe relaxation of Eq. 2 using Tikhonov regularization with convergence guarantee.\n\n2.2 Reinterpretation of ReLU\nThe ReLU, ordinarily de\ufb01ned as u = max{0, x} for x \u2208 Rd, can be viewed as a projection onto a\nconvex set (POCS) [3], and thus rewritten as a simple smooth convex optimization problem,\n\n(3)\nwhere (cid:107) \u00b7 (cid:107)2 denotes the (cid:96)2 norm of a vector and U here is the nonnegative closed half-space. This\nnon-negative least squares problem becomes the basis of our lifted objective.\n\nmax{0, x} \u2261 arg min\n\nu\u2208U (cid:107)u \u2212 x(cid:107)2\n2,\n\n1For simplicity in this paper we always presume that the domain of each variable contains the regularization,\n\ne.g. (cid:96)2-norm, without showing it in the objective explicitly.\n\n3\n\ninputoutputhidden layers\f2.3 Our Tikhonov Regularized Objective\n\nWe use Eq. 3 to lift and unroll the general training objective in Eq. 2 obtaining the relaxation:\n\n(cid:88)\n\n(cid:88)\n\ni,n\n\n\u03b3n\n2\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)ui,n \u2212 (cid:88)\n\nm\u2208Mn\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)2\n\n2\n\nWn,mui,m\n\n,\n\n(4)\n\nmin\n\n\u02dcU\u2286U ,V\u2208V, \u02dcW\u2286W\n\ns.t.\n\nf ( \u02dcU, V, \u02dcW) \u2206=\nui,n \u2265 0, ui,0 = xi,\u2200i,\u2200n \u2265 1,\n\n(cid:96)(yi, Vui,N ) +\n\ni\n\n(cid:27)\n\nf ( \u02dcU, V, \u02dcW) \u2261(cid:88)\n\n(cid:26)\n\nwhere \u02dcU = {ui,n} and \u03b3n \u2265 0,\u2200n denote prede\ufb01ned regularization constants. Larger \u03b3n values\nforce ui,n,\u2200i to more closely approximate the output of ReLU at the n-th hidden layer. Arranging u\nand \u03b3 terms into a matrix Q, we rewrite Eq. 4 in familiar form as a Tikhonov regularized objective:\n\ni Q( \u02dcW)ui\nuT\n\n1\n2\n\nmin\n\n\u02dcU\u2286U ,V\u2208V, \u02dcW\u2286W\n\n(5)\nHere ui,\u2200i denotes the concatenating vector of all hidden outputs as well as the input data, i.e.\nn=0,\u2200i, P is a prede\ufb01ned constant matrix so that Pui = ui,N ,\u2200i, and Q( \u02dcW) denotes\nui = [ui,n]N\nanother matrix constructed by the weight matrix set \u02dcW.\nProposition 1. Q( \u02dcW) is positive semide\ufb01nite, leading to the following Tikhonov regularization:\n\n(cid:96)(yi, VPui) +\n\n.\n\ni\n\ni Q( \u02dcW)ui \u2261 (\u0393ui)T (\u0393ui) = (cid:107)\u0393ui(cid:107)2\nuT\n\n2,\u2203\u0393,\u2200i,\n\nwhere \u0393 is the Tikhonov matrix.\nDe\ufb01nition 1 (Block Multi-Convexity [38]). A function f is block multi-convex if for each block\nvariable xi,\u2200i, f is a convex function of xi while all the other blocks are \ufb01xed.\nProposition 2. f ( \u02dcU, V, \u02dcW) is block multi-convex.\n\n3 Block Coordinate Descent Algorithm\n\n3.1 Training\n\nEq. 4 can be minimized using alternating optimization, which decomposes the problem into the\nfollowing three convex sub-problems based on Lemma 2:\n\n(cid:80)\n\u2022 Classi\ufb01cation using learned features: minV\u2208V(cid:80)\n\n\u2022 Tikhonov regularized inverse problem: minui\u2208U (cid:96)(yi, VPui) + 1\n\u2022 Least-square regression: min\u2200Wn,m\u2208 \u02dcW \u03b3n\nm\u2208Mn\ni (cid:96)(yi, VPui).\n\n(cid:13)(cid:13)ui,n \u2212(cid:80)\n\n2\n\ni\n\ni Q( \u02dcW)ui,\u2200i.\n2 uT\n;\nWn,mui,m\n\n(cid:13)(cid:13)2\n\n2\n\nAll the three sub-problems can be solved ef\ufb01ciently due to their convexity. In fact the inverse sub-\nproblem alleviates the vanishing gradient issue in traditional deep learning, because it tries to obtain\nthe estimated solution for the output feature of each hidden layer, which are dependent on each other\nthrough the Tikhonov matrix. Such functionality is similar to that of target (i.e. estimated outputs of\neach layer) propagation [4], namely, propagating information between input data and output labels.\nUnfortunately, a simple alternating optimization scheme cannot guarantee the convergence to sta-\ntionary points for solving Eq. 4. Therefore we propose a novel BCD based algorithm for training\nDNNs based on Eq. 4 as listed in Alg. 1. Basically we sequentially solve each sub-problem with an\nextra quadratic term. These extra terms as well as the convex combination rule guarantee the global\nconvergence of the algorithm (see Sec. 4 for more details).\nOur algorithm involves solving a sequence of quadratic programs (QP), whose computational com-\nplexity is cubic, in general, in the input dimension [28]. In this paper we focus on the theoretical\ndevelopment of the algorithm, and consider fast implementations in future work.\n\n3.2 Testing\nGiven a test sample x and learned network weights \u02dcW\u2217, V\u2217, based on Eq. 4 the ideal decision\nfunction for classi\ufb01cation should be y\u2217 = arg miny\u2208Y\n. This indicates that\n\nminu f (u, V\u2217, \u02dcW\u2217)\n\n(cid:110)\n\n(cid:111)\n\n4\n\n\f:training data {(xi, yi)} and regularization parameters {\u03b3n}\n\nAlgorithm 1 Block Coordinate Descent (BCD) Algorithm for Training DNNs\nInput\nOutput :network weights \u02dcW\nRandomly initialize \u02dcU (0) \u2286 U , V(0) \u2208 V, \u02dcW (0) \u2286 W;\nSet sequence {\u03b8t}\u221e\nfor t = 1, 2,\u00b7\u00b7\u00b7 do\n\nt=1 so that 0 \u2264 \u03b8t \u2264 1,\u2200t and sequence\n\n(cid:110)(cid:80)\u221e\n\n(cid:111)\u221e\n\nt=1\n\nconverges to zero, e.g. \u03b8t = 1\nt2 ;\n\n(cid:107)2\n2,\u2200i;\n\ni\n\n+ \u03b8t(u\u2217\n\ni \u2190 arg minui\u2208U (cid:96)(yi, V(t\u22121)Pui) + 1\nu\u2217\ni \u2190 u(t\u22121)\ni \u2212 u(t\u22121)\n),\u2200i;\nu(t)\ni (cid:96)(yi, VPu(t)\ni ) + 1\nV(t) \u2190 V(t\u22121) + \u03b8t(V\u2217 \u2212 V(t\u22121));\n\nV\u2217 \u2190 arg minV\u2208V(cid:80)\n\u02dcW\u2217 \u2190 arg min \u02dcW\u2286W(cid:80)\n\n1\n\ni\n\ni\n\n]T Q( \u02dcW)u(t)\n\n2 [u(t)\nn,m \u2212 W(t\u22121)\n\ni\n\nn,m \u2190 W(t\u22121)\n\nn,m + \u03b8t(W\u2217\n\nn,m ),\u2200n,\u2200m \u2208 Mn, W\u2217\n\nk=t\n\n\u03b8k\n1\u2212\u03b8k\n2 (1 \u2212 \u03b8t)2(cid:107)ui \u2212 u(t\u22121)\ni Q( \u02dcW (t\u22121))ui + 1\n2 uT\n2 (1 \u2212 \u03b8t)2(cid:107)V \u2212 V(t\u22121)(cid:107)2\n(cid:80)\nF ;\n\n2 (1 \u2212 \u03b8t)2(cid:80)\n\ni + 1\n\nn\n\ni\n\nm\u2208Mn\nn,m \u2208 \u02dcW\u2217;\n\n(cid:107)Wn,m \u2212 W(t\u22121)\n\nn,m (cid:107)2\n\nF\n\nW(t)\nend\nreturn \u02dcW;\n\nfor each pair of test data and potential label we have to solve an optimization problem, leading to\nunaffordably high computational complexity that prevents us from using it.\nRecall that our goal is to train feed-forward DNNs using the BCD algorithm in Alg. 1. Considering\nthis, we utilize the network weights \u02dcW\u2217 to construct the network for extracting deep features. Since\nthese features are the approximation of \u02dcU in Eq. 4 (in fact this is a feasible solution of an extreme\ncase where \u03b3n = +\u221e,\u2200n), the learned classi\ufb01er V\u2217 can never be reused at test time. Therefore, we\nretain the architecture and weights of the trained network and replace the classi\ufb01cation layer (i.e. the\nlast layer with weights V) with a linear support vector machine (SVM).\n\n3.3 Experiments\n\n3.3.1 MNIST Demonstration\n\nTo demonstrate the effectiveness and ef\ufb01ciency of our BCD\nbased algorithm in Alg. 1, we conduct comprehensive exper-\niments on MNIST [26] dataset using its 28 \u00d7 28 = 784 raw\npixels as input features. We refer to our algorithm for learning\ndense networks as \u201cBCD\u201d and that for learning sparse networks\nas \u201cBCD-S\u201d, respectively. For sparse learning, we de\ufb01ne the\nconvex set W = {W | (cid:107)Wk(cid:107)1 \u2264 1,\u2200k}, where Wk denotes\nthe k-th row in matrix W and (cid:107) \u00b7 (cid:107)1 denotes the (cid:96)1 norm of a\nvector. All the comparisons are performed on the same PC. We\nimplement our algorithms using MATLAB GPU implementa-\ntion without optimizing the code.\nWe compare our algorithms with the six SGD based solvers in Caffe [20], i.e. SGD [5], AdaDelta\n[40], AdaGrad [12], Adam [21], Nesterov [33], RMSProp [35], which are coded in Python. The\nnetwork architecture that we implemented is illustrated in Fig. 2. This network has three hidden\nlayers (with ReLU) with 784 nodes per layer, four FC layers, and three skip layers inside. Therefore,\nthe mapping function from input xi to output yi de\ufb01ned by the network is:\nf (xi) = Vui,3, ui,3 = max{0, xi + ui,1 + W3,2ui,2},\nui,2 = max{0, xi + W2,1ui,1}, ui,1 = max{0, W1,0xi}.\n\nFigure 2: The network architecture for\nalgorithm/solver comparison.\n\nFor simplicity without loss of generality, we utilize MSE as the loss function, and learn the network\nparameters using different solvers with the same inputs and random initial weights for each FC layer.\nWithout \ufb01ne-tuning the regularization parameters, we simply set \u03b3n = 0.1,\u2200n in Eq. 4 for both BCD\nand BCD-S algorithms. For the Caffe solvers, we modify the demo code in Caffe for MNIST and run\nthe comparison with carefully tuning the parameters to achieve the best performance that we can. We\nreport the results within 100 epochs by averaging three trials, because at this point the training of\nall the methods seems convergent already. For all competing algorithms, in each epoch the entire\n\n5\n\nxi yiw1,0w2,1w3,2vui,1ui,2ui,3\f(a)\n\n(c)\n\n(b)\n\n(d)\n\nFigure 3: (a) Illustration of convergence for BCD and BCD-S. (b) Test error comparison. (c) Running time\ncomparison. (d) Sparseness comparison for BCD and BCD-S.\n\ntraining data is passed through once to update parameters. Therefore, for our algorithms each epoch\nis equivalent to one iteration, and there are 100 iterations in total.\nConvergence: Fig. 3(a) shows the change of training objective with increase of epochs for BCD and\nBCD-S, respectively. As we see both curves decrease monotonically and become \ufb02atter and \ufb02atter\neventually, indicating that both algorithms converge. BCD-S converges much faster than BCD, but its\nobjective is higher than BCD. This is because BCD-S learns sparse models that may not \ufb01t data as\nwell as dense models learned by BCD.\nTesting Error: As mentioned in Sec. 3.2, here we utilize linear SVMs and last-layer hidden\nfeatures extracted from training data to retrain the classi\ufb01er. Based on the network in Fig. 2\nthe feature extraction function is ui,3 = max{0, xi + max{0, W1,0xi} + W3,2 max{0, xi +\nW2,1 max{0, W1,0xi}}}. To conduct fair comparison, we retrain the classi\ufb01ers for all the al-\ngorithms, and summarize the test-time results in Fig. 3(b) with 100 epochs. Our BCD algorithm\nwhich learns dense architectures, same as the SGD based solvers, performs best, while our BCD-S\nalgorithm works still better than the SGD competitors, although it learns much sparser networks.\nThese results are consistent with the training objectives in Fig. 3(a) as well.\nComputational Time: We compare the training time in Fig. 3(c). It seems that our BCD implemen-\ntation is signi\ufb01cantly faster than the Caffe solvers. For instance, our BCD achieves about 2.5 times\nspeed-up than the competitors, while achieving best classi\ufb01cation performance at test time.\nSparseness: In order to compare the difference in terms of weights between the dense and sparse\nnetworks learned by BCD and BCD-S, respectively, we compare the percentage of nonzero weights\nin each FC layer, and show the results in Fig. 3(d). As we see, expect the last FC layer (corresponding\nto parameter V as classi\ufb01ers) BCD-S has the ability of learning much sparser networks for deep\nfeature extraction. In our case BCD-S learns a network with 2.42% nonzero weights2, on average,\nwith classi\ufb01cation accuracy 1.34% lower than that of BCD which learns a network with 97.15%\nnonzero weights. Potentially this ability could be very useful in the scenarios such as embedding\nsystems where sparse networks are desired.\n\n3.3.2 Supervised Hashing\nTo further demonstrate the usage of our approach, we compare with [41]3 for the application of\nsupervised hashing, which is the state-of-the-art in the literature. [41] proposed an ADMM based\n\n2Since we will retrain the classi\ufb01ers after all, here we do not take the nonzeros in the last FC into account.\n3MATLAB code is available at https://zimingzhang.wordpress.com/publications/.\n\n6\n\n0102030405060708090100# Epochs00.511.522.5Training Objective\u00d7104BCDBCD-SAdadeltaAdagradAdamNesterovRmspropSGDBCDBCD-SSolvers00.010.020.030.040.050.060.070.08Test ErrorAdadeltaAdagradAdamNesterovRmspropSGDBCDBCD-SSolvers00.511.522.5Relative Running Time1st2nd3rd4thFully-connected Layer00.10.20.30.40.50.60.70.80.91Percentage of NonzerosBCDBCD-S\foptimization algorithm to train DNNs with relaxed objective that is very related to ours. We train\nthe same DNN on MNIST as used in [41], i.e. with 48 hidden layers and 256 nodes per layer that\nare sequentially and fully connected (see [41] for more details on the network). Using the same\nimage features, we consistently observe marginal improvement over the results (i.e. precision, recall,\nmAP) reported in [41]. However, on the same PC we can \ufb01nish training within 1 hour based on our\nimplementation, while using the MATLAB code for [41] the training needs about 9 hours. Similar\nobservations can be made on CIFAR-10 as used in [41] using a network with 16 hidden layers and\n1024 nodes per layer.\n\n4 Convergence Analysis\n\n4.1 Preliminaries\n\nDe\ufb01nition 2 (Lipschitz Continuity [13]). We say that function f is Lipschitz continuous with Lipschitz\nconstant Lf on X , if there is a (necessarily nonnegative) constant Lf such that\n\n|f (x1) \u2212 f (x2)| \u2264 Lf|x1 \u2212 x2|,\u2200x1, x2 \u2208 X .\n\nDe\ufb01nition 3 (Global Convergence [24]). Let X be a set and x0 \u2208 X a given point, Then an Algorithm,\nA, with initial point x0 is a point-to-set map A : X \u2192 P(X ) which generates a sequence {xk}\u221e\nvia the rule xk+1 \u2208 A(xk), k = 0, 1,\u00b7\u00b7\u00b7 . A is said to be global convergent if for any chosen initial\npoint x0, the sequence {xk}\u221e\nk=0 generated by xk+1 \u2208 A(xk) (or a subsequence) converges to a point\nfor which a necessary condition of optimality holds.\nDe\ufb01nition 4 (R-linear Convergence Rate [30]). Let {xk} be a sequence in Rn that converges to\nx\u2217. We say that convergence is R-linear if there is a sequence of nonnegative scalars {vk} such that\n(cid:107)xk \u2212 x\u2217(cid:107) \u2264 vk,\u2200k, and {vk} converges Q-linearly to zero.\n2(cid:107)w \u2212\nLemma 1 (3-Point Property [1]). If function \u03c6(w) is convex and \u02c6w = arg minw\u2208Rd \u03c6(w) + 1\nw0(cid:107)2\n\nk=1\n\n2, then for any w \u2208 Rd,\n1\n2\n\n\u03c6( \u02c6w) +\n\n(cid:107) \u02c6w \u2212 w0(cid:107)2\n\n2 \u2264 \u03c6(w) +\n\n1\n2\n\n(cid:107)w \u2212 w0(cid:107)2\n\n2 \u2212 1\n2\n\n(cid:107)w \u2212 \u02c6w(cid:107)2\n2.\n\n4.2 Theoretical Results\nDe\ufb01nition 5 (Assumptions on f in Eq. 4). Let f1( \u02dcU) \u2206= f ( \u02dcU,\u00b7,\u00b7), f2(V) \u2206= f (\u00b7, V,\u00b7), f3( \u02dcW) \u2206=\nf (\u00b7,\u00b7, \u02dcW) be the objectives of the three sub-problems, respectively. Then we assume that f is\nlower-bounded and f1, f2, f3 are Lipschitz continuous with constants Lf1, Lf2, Lf3, respectively.\nProposition 3. Let x, y, \u02c6x \u2208 X and y = (1 \u2212 \u03b8)x + \u03b8\u02c6x. Then 1\n2 (1 \u2212 \u03b8)2 (cid:107)\u02c6x \u2212 x(cid:107)2\n2.\nLemma 2. Let X be a nonempty closed convex set, function \u03c6 : X \u2192 R is convex and Lipschitz\ncontinuous with constant L, and scalar 0 \u2264 \u03b8 \u2264 1. Suppose that \u2200x \u2208 X , \u02c6x = arg minz\u2208X \u03c6(z) +\n2(cid:107)z \u2212 z0(cid:107)2\n\n2 and z0 = y = (1 \u2212 \u03b8)x + \u03b8\u02c6x. Then we have\n\n2(cid:107)\u02c6x \u2212 y(cid:107)2\n\n2 = 1\n\n1\n\n1 \u2212 \u03b8\n\u03b8\n\n(cid:107)y \u2212 x(cid:107)2\n\n2 \u2264 \u03c6(x) \u2212 \u03c6(y) \u2264 L(cid:107)y \u2212 x(cid:107)2 \u21d2 (cid:107)y \u2212 x(cid:107)2 \u2264 L\u03b8\n1 \u2212 \u03b8\n\n.\n\nProof. Based on the convexity of \u03c6, Prop. 3, and Lemma 1, we have\n\u03c6(x) \u2212 \u03c6(y) \u2265 \u03c6(x) \u2212 [(1 \u2212 \u03b8) \u03c6(x) + \u03b8\u03c6(\u02c6x)] = \u03b8 [\u03c6(x) \u2212 \u03c6(\u02c6x)]\n\u2265 \u03b8\n\n(cid:107)x \u2212 \u02c6x(cid:107)2\n\n(cid:20) 1\n\n(cid:21)\n\n2\n\n1\n2\n\n2 +\n\n(cid:107)\u02c6x \u2212 z0(cid:107)2\n\n(cid:107)y \u2212 x(cid:107)2\n2,\n2 = 0 if and only if \u02c6x = x (equivalently \u03c6(x) = \u03c6(y)); otherwise (cid:107)y \u2212 x(cid:107)2\n\n= \u03b8 (1 \u2212 \u03b8)(cid:107)x \u2212 \u02c6x(cid:107)2\n\n(cid:107)x \u2212 z0(cid:107)2\n\n2 \u2212 1\n2\n\n2 =\n\n2\n\n2 is\n\n1 \u2212 \u03b8\n\u03b8\n\nwhere (cid:107)y \u2212 x(cid:107)2\nlower-bounded from 0 provided that \u03b8 (cid:54)= 1.\nBased on Def. 2, we have \u03c6(x) \u2212 \u03c6(y) \u2264 L(cid:107)y \u2212 x(cid:107)2.\nTheorem 1. Let\nvex set that is generated by Alg. 1. Suppose that 0 \u2264 \u03b8t \u2264 1,\u2200t and the sequence\nconverges to zero. Then we have\n\n(cid:110)(cid:16) \u02dcU (t), V(t), \u02dcW (t)(cid:17)(cid:111)\u221e\n\nt=1\n\n(cid:110)(cid:80)\u221e\n\n\u03b8k\n1\u2212\u03b8k\n\nk=t\n\n(cid:111)\u221e\n\nt=1\n\n\u2286 U\u00d7V\u00d7W be an arbitrary sequence from a closed con-\n\n7\n\n\f(cid:16) \u02dcU (\u221e), V(\u221e), \u02dcW (\u221e)(cid:17)\n(cid:110)(cid:16) \u02dcU (t), V(t), \u02dcW (t)(cid:17)(cid:111)\u221e\n\nconvergence rate.\n\nt=1\n\n1.\n\n2.\n\nis a stationary point;\n\n(cid:16) \u02dcU (\u221e), V(\u221e), \u02dcW (\u221e)(cid:17)\n\nwill converge to\n\nglobally with R-linear\n\nProof. 1. Suppose that for \u02dcU (\u221e) there exists a (cid:52) \u02dcU (cid:54)= \u2205 so that f1( \u02dcU (\u221e) + (cid:52) \u02dcU) = f1( \u02dcU (\u221e))\n(otherwise, it con\ufb02icts with the fact of \u02dcU (\u221e) being the limit point). From Lemma 2, f1( \u02dcU (\u221e) +\n(cid:52) \u02dcU) = f1( \u02dcU (\u221e)) is equivalent to \u02dcU (\u221e) + (cid:52) \u02dcU = \u02dcU (\u221e), and thus (cid:52) \u02dcU = \u2205, which con\ufb02icts\nwith the assumption of (cid:52) \u02dcU (cid:54)= \u2205. Therefore, there is no direction that can decrease f1( \u02dcU (\u221e)),\ni.e. \u2207f1( \u02dcU (\u221e)) = 0. Similarly we have \u2207f2(V(\u221e)) = 0 and \u2207f3( \u02dcW (\u221e)) = 0. Therefore,\n\nis a stationary point.\n\n2. Based on Def. 5 and Lemma 2, we have\n\nWn,m\u2208 \u02dcW\n\n+\n\nF +\n\nWn,m\u2208 \u02dcW\n\n(cid:88)\n+(cid:13)(cid:13)V(t) \u2212 V(\u221e)(cid:13)(cid:13)2\n(cid:13)(cid:13)(cid:13)V(t) \u2212 V(\u221e)(cid:13)(cid:13)(cid:13)F\n(cid:88)\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)2\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)F\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) \u221e(cid:88)\n(cid:13)(cid:13)(cid:13)2\n(cid:13)(cid:13)(cid:13)V(k) \u2212 V(k+1)(cid:13)(cid:13)(cid:13)F\n\uf8f9\uf8fb = O\n(cid:88)\n\nV(k) \u2212 V(k+1)\n\nk=t\n\n+\n\n+\n\n+\n\n+\n\nLf3 \u03b8k\n1 \u2212 \u03b8k\n\nWn,m\u2208 \u02dcW\n\nF\n\nn,m\n\nn,m\n\nn,m \u2212 W(\u221e)\n\n(cid:13)(cid:13)(cid:13)2\n(cid:13)(cid:13)(cid:13)F\n\n(cid:13)(cid:13)(cid:13)W(t)\n(cid:13)(cid:13)(cid:13)W(t)\nn,m \u2212 W(\u221e)\n(cid:88)\n(cid:88)\n(cid:32) \u221e(cid:88)\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) \u221e(cid:88)\n(cid:13)(cid:13)(cid:13)W(k)\nn,m \u2212 W(k+1)\n(cid:33)\n\nWn,m\u2208 \u02dcW\n\nWn,m\u2208 \u02dcW\n\nW(k)\n\nn,m\n\nk=t\n\n+\n\n\u03b8k\n1 \u2212 \u03b8k\n\n.\n\nk=t\n\nn,m \u2212 W(k+1)\n\nn,m\n\n\uf8f9\uf8fb\n\n(cid:13)(cid:13)(cid:13)F\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)F\n\n(cid:13)(cid:13)(cid:13)2\n(cid:13)(cid:13)(cid:13)2\n\n2\n\n+\n\ni,n\n\ni,n\n\nui,n\u2208 \u02dcU\n\nui,n\u2208 \u02dcU\n\ni,n \u2212 u(\u221e)\n\ni,n \u2212 u(\u221e)\n\n(cid:16) \u02dcU (\u221e), V(\u221e), \u02dcW (\u221e)(cid:17)\n(cid:118)(cid:117)(cid:117)(cid:116) (cid:88)\n(cid:13)(cid:13)(cid:13)u(t)\n(cid:13)(cid:13)(cid:13)u(t)\n\u2264 (cid:88)\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) \u221e(cid:88)\n(cid:88)\n\uf8ee\uf8f0 (cid:88)\n\u221e(cid:88)\n\uf8ee\uf8f0 (cid:88)\n\u221e(cid:88)\nCorollary 1. Let \u03b8t =(cid:0) 1\n\n(cid:13)(cid:13)(cid:13)u(k)\n\nLf1\u03b8k\n1 \u2212 \u03b8k\n\nui,n\u2208 \u02dcU\n\nui,n\u2208 \u02dcU\n\nui,n\u2208 \u02dcU\n\n\u2264\n\n\u2264\n\nk=t\n\nk=t\n\nk=t\n\n=\n\n+\n\nt\n\nProof.\n\n\u221e(cid:88)\n\nk=t\n\n\u03b8k\n1 \u2212 \u03b8k\n\n=\n\ni,n \u2212 u(k+1)\nu(k)\n\ni,n\n\ni,n \u2212 u(k+1)\n\ni,n\n\nLf2\u03b8k\n1 \u2212 \u03b8k\n\n(cid:1)p\n\u221e(cid:88)\n\nk=t\n\nBy combining this with Def. 3 and Def. 4 we can complete the proof.\n\n,\u2200t. Then when p > 1, Alg. 1 will converge globally with order one.\n\n(cid:90) \u221e\n\n\u2264\ntp\u22121\n\u2235p>1\u2264 1\n\n1\nx\n\n(cid:90) \u221e\n\np\n\ntp\u22121\n\n1\n\nkp \u2212 1\n\np\u22121(cid:111)\u221e\n\n1\n\n(cid:110)\n\n(cid:90) \u221e\n\ntp\u22121\n\n1\nx\n\n1\np =\n\n1\np\n\n(x + 1)\n\n1\n\np\u22121dx\n\nd(x + 1)\n\n1\n\np\u22122dx = (p \u2212 1)\u22121(tp \u2212 1)\n\n1\n\np\u22121.\n\nx\n\n(6)\n\nSince the sequence\ncombining these with Def. 4 and Thm. 1 we can complete the proof.\n\nt=1\n\n(tp \u2212 1)\n\n,\u2200p > 1 converges to zero sublinearly with order one, by\n\n5 Conclusion\n\nIn this paper we \ufb01rst propose a novel Tikhonov regularization for training DNNs with ReLU as the\nactivation functions. The Tikhonov matrix encodes the network architecture as well as parameter-\nization. With its help we reformulate the network training as a block multi-convex minimization\nproblem. Accordingly we further propose a novel block coordinate descent (BCD) based algorithm,\nwhich is proven to converge globally to stationary points with R-linear converge rate of order one.\nOur empirical results suggest that our algorithm does converge, is suitable for learning both dense\nand sparse networks, and may work better than traditional SGD based deep learning solvers.\n\n8\n\n\fReferences\n\n[1] L. Baldassarre and M. Pontil. Advanced topics in machine learning part II 5. proximal meth-\nods. University Lecture, http://www0.cs.ucl.ac.uk/staff/l.baldassarre/lectures/\nbaldassarre_proximal_methods.pdf.\n\n[2] C. Baldassi, A. Ingrosso, C. Lucibello, L. Saglietti, and R. Zecchina. Subdominant dense clusters allow for\nsimple learning and high computational performance in neural networks with discrete synapses. Physical\nreview letters, 115(12):128101, 2015.\n\n[3] H. H. Bauschke and J. M. Borwein. On projection algorithms for solving convex feasibility problems.\n\n[4] Y. Bengio. How auto-encoders could provide credit assignment in deep networks via target propagation.\n\n[5] L. Bottou. Stochastic gradient descent tricks. In Neural networks: Tricks of the trade, pages 421\u2013436.\n\nSIAM review, 38(3):367\u2013426, 1996.\n\narXiv preprint arXiv:1407.7906, 2014.\n\nSpringer, 2012.\n\n[6] L. Bottou, F. E. Curtis, and J. Nocedal. Optimization methods for large-scale machine learning. arXiv\n\npreprint arXiv:1606.04838, 2016.\n[7] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning\nvia the alternating direction method of multipliers. Foundations and Trends R(cid:13) in Machine Learning,\n3(1):1\u2013122, 2011.\n\n[8] P. Chaudhari, A. Choromanska, S. Soatto, and Y. LeCun. Entropy-sgd: Biasing gradient descent into wide\n\nvalleys. arXiv preprint arXiv:1611.01838, 2016.\n\n[9] P. Chaudhari, A. Oberman, S. Osher, S. Soatto, and G. Carlier. Deep relaxation: partial differential\n\nequations for optimizing deep neural networks. arXiv preprint arXiv:1704.04932, 2017.\n\n[10] A. Choromanska, M. Henaff, M. Mathieu, G. B. Arous, and Y. LeCun. The loss surfaces of multilayer\n\nnetworks. In AISTATS, 2015.\n\n[11] Y. N. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, and Y. Bengio. Identifying and attacking the\n\nsaddle point problem in high-dimensional non-convex optimization. In NIPS, pages 2933\u20132941, 2014.\n\n[12] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic\n\noptimization. JMLR, 12(Jul):2121\u20132159, 2011.\n\n[13] K. Eriksson, D. Estep, and C. Johnson. Applied Mathematics Body and Soul: Vol I-III. Springer-Verlag\n\nPublishing, 2003.\n\n[14] Y. Gal and Z. Ghahramani. On modern deep learning and variational inference. In Advances in Approximate\n\nBayesian Inference workshop, NIPS, 2015.\n\n[15] S. Ghadimi and G. Lan. Stochastic \ufb01rst-and zeroth-order methods for nonconvex stochastic programming.\n\nSIAM Journal on Optimization, 23(4):2341\u20132368, 2013.\n\n[16] X. Glorot and Y. Bengio. Understanding the dif\ufb01culty of training deep feedforward neural networks. In\n\n[17] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, pages\n\nAISTATS, pages 249\u2013256, 2010.\n\n770\u2013778, 2016.\n\n[18] S. Hochreiter, Y. Bengio, and P. Frasconi. Gradient \ufb02ow in recurrent nets: the dif\ufb01culty of learning\nlong-term dependencies. In J. Kolen and S. Kremer, editors, Field Guide to Dynamical Recurrent Networks.\nIEEE Press, 2001.\n\n[19] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal\n\ncovariate shift. arXiv preprint arXiv:1502.03167, 2015.\n\n[20] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe:\nConvolutional architecture for fast feature embedding. In ACM Multimedia, pages 675\u2013678. ACM, 2014.\n[21] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.\n[22] D. P. Kingma, T. Salimans, and M. Welling. Variational dropout and the local reparameterization trick. In\n\nNIPS, pages 2575\u20132583, 2015.\n\n[23] A. Krogh and J. A. Hertz. A simple weight decay can improve generalization. In NIPS, pages 950\u2013957,\n\n[24] G. R. Lanckriet and B. K. Sriperumbudur. On the convergence of the concave-convex procedure. In NIPS,\n\npages 1759\u20131767, 2009.\n\n[25] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436\u2013444, 2015.\n[26] Y. LeCun, C. Cortes, and C. J. Burges. The mnist database of handwritten digits, 1998.\n[27] Y. Nesterov. Ef\ufb01ciency of coordinate descent methods on huge-scale optimization problems. SIAM Journal\n\non Optimization, 22(2):341\u2013362, 2012.\n\n[28] Y. Nesterov and A. Nemirovskii. Interior-point polynomial algorithms in convex programming. SIAM,\n\n[29] R. Nishihara, L. Lessard, B. Recht, A. Packard, and M. I. Jordan. A general analysis of the convergence of\n\nadmm. In ICML, pages 343\u2013352, 2015.\n\n[30] J. Nocedal and S. J. Wright. Numerical optimization. Springer, 1st. ed. 1999. corr. 2nd printing edition,\n\n1991.\n\n1994.\n\nAug. 1999.\n\n[31] M. Razaviyayn, M. Hong, Z.-Q. Luo, and J.-S. Pang. Parallel successive convex approximation for\n\nnonsmooth nonconvex optimization. In NIPS, pages 1440\u20131448, 2014.\n\n[32] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: a simple way to\n\nprevent neural networks from over\ufb01tting. JMLR, 15(1):1929\u20131958, 2014.\n\n9\n\n\f[33] I. Sutskever, J. Martens, G. E. Dahl, and G. E. Hinton. On the importance of initialization and momentum\n\nin deep learning. In ICML, pages 1139\u20131147, 2013.\n\n[34] G. Taylor, R. Burmeister, Z. Xu, B. Singh, A. Patel, and T. Goldstein. Training neural networks without\n\ngradients: A scalable admm approach. In ICML, 2016.\n\n[35] T. Tieleman and G. Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent\n\nmagnitude. COURSERA: Neural networks for machine learning, 4(2), 2012.\n\n[36] Y. Wang, W. Yin, and J. Zeng. Global convergence of admm in nonconvex nonsmooth optimization. arXiv\n\n[37] R. A. Willoughby. Solutions of ill-posed problems (an tikhonov and vy arsenin). SIAM Review, 21(2):266,\n\npreprint arXiv:1511.06324, 2015.\n\n1979.\n\n[38] Y. Xu and W. Yin. A block coordinate descent method for regularized multiconvex optimization with\napplications to nonnegative tensor factorization and completion. SIAM Journal on imaging sciences,\n6(3):1758\u20131789, 2013.\n\n[39] Y. Xu and W. Yin. A globally convergent algorithm for nonconvex optimization based on block coordinate\n\nupdate. arXiv preprint arXiv:1410.1386, 2014.\n\n[40] M. D. Zeiler. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.\n[41] Z. Zhang, Y. Chen, and V. Saligrama. Ef\ufb01cient training of very deep neural networks for supervised\n\nhashing. In CVPR, June 2016.\n\n10\n\n\f", "award": [], "sourceid": 1092, "authors": [{"given_name": "Ziming", "family_name": "Zhang", "institution": "MERL"}, {"given_name": "Matthew", "family_name": "Brand", "institution": "Mitsubishi Electric Research Labs"}]}