{"title": "End-to-End Kernel Learning with Supervised Convolutional Kernel Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 1399, "page_last": 1407, "abstract": "In this paper, we introduce a new image representation based on a multilayer kernel machine. Unlike traditional kernel methods where data representation is decoupled from the prediction task, we learn how to shape the kernel with supervision. We proceed by first proposing improvements of the recently-introduced convolutional kernel networks (CKNs) in the context of unsupervised learning; then, we derive backpropagation rules to take advantage of labeled training data. The resulting model is a new type of convolutional neural network, where optimizing the filters at each layer is equivalent to learning a linear subspace in a reproducing kernel Hilbert space (RKHS). We show that our method achieves reasonably competitive performance for image classification on some standard ``deep learning'' datasets such as CIFAR-10 and SVHN, and also for image super-resolution, demonstrating the applicability of our approach to a large variety of image-related tasks.", "full_text": "End-to-End Kernel Learning with\n\nSupervised Convolutional Kernel Networks\n\nJulien Mairal\n\nInria\u2217\n\njulien.mairal@inria.fr\n\nAbstract\n\nIn this paper, we introduce a new image representation based on a multilayer kernel\nmachine. Unlike traditional kernel methods where data representation is decoupled\nfrom the prediction task, we learn how to shape the kernel with supervision. We\nproceed by \ufb01rst proposing improvements of the recently-introduced convolutional\nkernel networks (CKNs) in the context of unsupervised learning; then, we derive\nbackpropagation rules to take advantage of labeled training data. The resulting\nmodel is a new type of convolutional neural network, where optimizing the \ufb01lters\nat each layer is equivalent to learning a linear subspace in a reproducing kernel\nHilbert space (RKHS). We show that our method achieves reasonably competitive\nperformance for image classi\ufb01cation on some standard \u201cdeep learning\u201d datasets\nsuch as CIFAR-10 and SVHN, and also for image super-resolution, demonstrating\nthe applicability of our approach to a large variety of image-related tasks.\n\n1\n\nIntroduction\n\nIn the past years, deep neural networks such as convolutional or recurrent ones have become highly\npopular for solving various prediction problems, notably in computer vision and natural language\nprocessing. Conceptually close to approaches that were developed several decades ago (see, [13]),\nthey greatly bene\ufb01t from the large amounts of labeled data that have been made available recently,\nallowing to learn huge numbers of model parameters without worrying too much about over\ufb01tting.\nAmong other reasons explaining their success, the engineering effort of the deep learning community\nand various methodological improvements have made it possible to learn in a day on a GPU complex\nmodels that would have required weeks of computations on a traditional CPU (see, e.g., [10, 12, 23]).\nBefore the resurgence of neural networks, non-parametric models based on positive de\ufb01nite kernels\nwere one of the most dominant topics in machine learning [22]. These approaches are still widely\nused today because of several attractive features. Kernel methods are indeed versatile; as long as a\npositive de\ufb01nite kernel is speci\ufb01ed for the type of data considered\u2014e.g., vectors, sequences, graphs,\nor sets\u2014a large class of machine learning algorithms originally de\ufb01ned for linear models may be\nused. This family include supervised formulations such as support vector machines and unsupervised\nones such as principal or canonical component analysis, or K-means and spectral clustering. The\nproblem of data representation is thus decoupled from that of learning theory and algorithms. Kernel\nmethods also admit natural mechanisms to control the learning capacity and reduce over\ufb01tting [22].\nOn the other hand, traditional kernel methods suffer from several drawbacks. The \ufb01rst one is their\ncomputational complexity, which grows quadratically with the sample size due to the computation of\nthe Gram matrix. Fortunately, signi\ufb01cant progress has been achieved to solve the scalability issue,\neither by exploiting low-rank approximations of the kernel matrix [28, 31], or with random sampling\ntechniques for shift-invariant kernels [21]. The second disadvantage is more critical; by decoupling\n\n\u2217Thoth team, Inria Grenoble, Laboratoire Jean Kuntzmann, CNRS, Univ. Grenoble Alpes, France.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\flearning and data representation, kernel methods seem by nature incompatible with end-to-end\nlearning\u2014that is, the representation of data adapted to the task at hand, which is the cornerstone of\ndeep neural networks and one of the main reason of their success. The main objective of this paper is\nprecisely to tackle this issue in the context of image modeling.\nSpeci\ufb01cally, our approach is based on convolutional kernel networks, which have been recently\nintroduced in [18]. Similar to hierarchical kernel descriptors [3], local image neighborhoods are\nmapped to points in a reproducing kernel Hilbert space via the kernel trick. Then, hierarchical\nrepresentations are built via kernel compositions, producing a sequence of \u201cfeature maps\u201d akin to\nconvolutional neural networks, but of in\ufb01nite dimension. To make the image model computationally\ntractable, convolutional kernel networks provide an approximation scheme that can be interpreted as\na particular type of convolutional neural network learned without supervision.\nTo perform end-to-end learning given labeled data, we use a simple but effective principle consisting\nof learning discriminative subspaces in RKHSs, where we project data. We implement this idea\nin the context of convolutional kernel networks, where linear subspaces, one per layer, are jointly\noptimized by minimizing a supervised loss function. The formulation turns out to be a new type of\nconvolutional neural network with a non-standard parametrization. The network also admits simple\nprinciples to learn without supervision: learning the subspaces may be indeed achieved ef\ufb01ciently\nwith classical kernel approximation techniques [28, 31].\nTo demonstrate the effectiveness of our approach in various contexts, we consider image classi\ufb01cation\nbenchmarks such as CIFAR-10 [12] and SVHN [19], which are often used to evaluate deep neural\nnetworks; then, we adapt our model to perform image super-resolution, which is a challenging inverse\nproblem. On the SVHN and CIFAR-10 datasets, we obtain a competitive accuracy, with about 2% and\n10% error rates, respectively, without model averaging or data augmentation. For image up-scaling,\nwe outperform recent approaches based on classical convolutional neural networks [7, 8].\nWe believe that these results are highly promising. Our image model achieves competitive perfor-\nmance in two different contexts, paving the way to many other applications. Moreover, our results are\nalso subject to improvements. In particular, we did not use GPUs yet, which has limited our ability\nto exhaustively explore model hyper-parameters and evaluate the accuracy of large networks. We\nalso did not investigate classical regularization/optimization techniques such as Dropout [12], batch\nnormalization [11], or recent advances allowing to train very deep networks [10, 23]. To gain more\nscalability and start exploring these directions, we are currently working on a GPU implementation,\nwhich we plan to publicly release along with our current CPU implementation.\n\nRelated Deep and Shallow Kernel Machines. One of our goals is to make a bridge between kernel\nmethods and deep networks, and ideally reach the best of both worlds. Given the potentially attractive\nfeatures of such a combination, several attempts have been made in the past to unify these two schools\nof thought. A \ufb01rst proof of concept was introduced in [5] with the arc-cosine kernel, which admits an\nintegral representation that can be interpreted as a one-layer neural network with random weights\nand in\ufb01nite number of recti\ufb01ed linear units. Besides, a multilayer kernel may be obtained by kernel\ncompositions [5]. Then, hierarchical kernel descriptors [3] and convolutional kernel networks [18]\nextend a similar idea in the context of images leading to unsupervised representations [18].\nMultiple kernel learning [24] is also related to our work since is it is a notable attempt to introduce\nsupervision in the kernel design. It provides techniques to select a combination of kernels from a pre-\nde\ufb01ned collection, and typically requires to have already \u201cgood\u201d kernels in the collection to perform\nwell. More related to our work, the backpropagation algorithm for the Fisher kernel introduced in [25]\nlearns the parameters of a Gaussian mixture model with supervision. In comparison, our approach\ndoes not require a probabilistic model and learns parameters at several layers. Finally, we note that a\nconcurrent effort to ours is conducted in the Bayesian community with deep Gaussian processes [6],\ncomplementing the Frequentist approach that we follow in our paper.\n\n2 Learning Hierarchies of Subspaces with Convolutional Kernel Networks\n\nIn this section, we present the principles of convolutional kernel networks and a few generalizations\nand improvements of the original approach of [18]. Essentially, the model builds upon four ideas that\nare detailed below and that are illustrated in Figure 1 for a model with a single layer.\n\n2\n\n\fIdea 1: use the kernel trick to represent local image neighborhoods in a RKHS.\nGiven a set X , a positive de\ufb01nite kernel K : X \u00d7 X \u2192 R implicitly de\ufb01nes a Hilbert space H, called\nreproducing kernel Hilbert space (RKHS), along with a mapping \u03d5 : X \u2192 H. This embedding is\nsuch that the kernel value K(x, x(cid:48)) corresponds to the inner product (cid:104)\u03d5(x), \u03d5(x(cid:48))(cid:105)H. Called \u201ckernel\ntrick\u201d, this approach can be used to obtain nonlinear representations of local image patches [3, 18].\nMore precisely, consider an image I0 : \u21260 \u2192 Rp0, where p0 is the number of channels, e.g., p0 = 3\nfor RGB, and \u21260 \u2282 [0, 1]2 is a set of pixel coordinates, typically a two-dimensional grid. Given two\nimage patches x, x(cid:48) of size e0 \u00d7 e0, represented as vectors in Rp0e2\n\n0, we de\ufb01ne a kernel K1 as\n\n(cid:16)(cid:68) x\n\n, x(cid:48)\n(cid:107)x(cid:48)(cid:107)\n\n(cid:107)x(cid:107)\n\n(cid:69)(cid:17)\n\n(cid:105)\u22121) = e\u2212 \u03b11\n\n2 (cid:107)y\u2212y(cid:48)\n\n(cid:107)2\n2.\n\nK1(x, x(cid:48)) = (cid:107)x(cid:107)(cid:107)x(cid:48)(cid:107) \u03ba1\n\n\u03ba1((cid:104)y, y(cid:48)(cid:105)) = e\u03b11((cid:104)y,y(cid:48)\n\nif x, x(cid:48) (cid:54)= 0 and 0 otherwise,\n\n(1)\nwhere (cid:107).(cid:107) and (cid:104)., .(cid:105) denote the usual Euclidean norm and inner-product, respectively, and \u03ba1((cid:104)., .(cid:105)) is\na dot-product kernel on the sphere. Speci\ufb01cally, \u03ba1 should be smooth and its Taylor expansion have\nnon-negative coef\ufb01cients to ensure positive de\ufb01niteness [22]. For example, the arc-cosine [5] or the\nGaussian (RBF) kernels may be used: given two vectors y, y(cid:48) with unit (cid:96)2-norm, choose for instance\n(2)\n0 \u2192 H1.\n\nThen, we have implicitly de\ufb01ned the RKHS H1 associated to K1 and a mapping \u03d51 : Rp0e2\nIdea 2: project onto a \ufb01nite-dimensional subspace of the RKHS with convolution layers.\nThe representation of patches in a RKHS requires \ufb01nite-dimensional approximations to be computa-\ntionally manageable. The original model of [18] does that by exploiting an integral form of the RBF\nkernel. Speci\ufb01cally, given two patches x and x(cid:48), convolutional kernel networks provide two vectors\n\u03c81(x), \u03c81(x(cid:48)) in Rp1 such that the kernel value (cid:104)\u03d51(x), \u03d51(x(cid:48))(cid:105)H1 is close to the Euclidean inner\nproduct (cid:104)\u03c81(x), \u03c81(x(cid:48))(cid:105). After applying this transformation to all overlapping patches of the input\nimage I0, a spatial map M1 : \u21260 \u2192 Rp1 may be obtained such that for all z in \u21260, M1(z) = \u03c81(xz),\nwhere xz is the e0 \u00d7 e0 patch from I0 centered at pixel location z.2 With the approximation scheme\nof [18], M1 can be interpreted as the output feature map of a one-layer convolutional neural network.\nA conceptual drawback of [18] is that data points \u03d51(x1), \u03d51(x2), . . . are approximated by vectors\nthat do not live in the RKHS H1. This issue can be solved by using variants of the Nystr\u00f6m\nmethod [28], which consists of projecting data onto a subspace of H1 with \ufb01nite dimension p1.\nFor this task, we have adapted the approach of [31]: we build a database of n patches x1, . . . , xn\nrandomly extracted from various images and normalized to have unit (cid:96)2-norm, and perform a spherical\nK-means algorithm to obtain p1 centroids z1, . . . , zp1 with unit (cid:96)2-norm. Then, a new patch x is\napproximated by its projection onto the p1-dimensional subspace F1 =Span(\u03d5(z1), . . . , \u03d5(zp1)).\nThe projection of \u03d51(x) onto F1 admits a natural parametrization \u03c81(x) in Rp1 . The explicit formula\nis classical (see [28, 31] and Appendix A), leading to\nZ(cid:62) x\n(cid:107)x(cid:107)\n\n\u03c81(x) := (cid:107)x(cid:107)\u03ba1(Z(cid:62)Z)\u22121/2\u03ba1\n\nif x (cid:54)= 0 and 0 otherwise,\n\nwhere we have introduced the matrix Z = [z1, . . . , zp1 ], and, by an abuse of notation, the function \u03ba1\nis applied pointwise to its arguments. Then, the spatial map M1 : \u21260 \u2192 Rp1 introduced above can\nbe obtained by (i) computing the quantities Z(cid:62)x for all patches x of the image I (spatial convolution\nafter mirroring the \ufb01lters zj); (ii) contrast-normalization involving the norm (cid:107)x(cid:107); (iii) applying the\npointwise non-linear function \u03ba1; (iv) applying the linear transform \u03ba1(Z(cid:62)Z)\u22121/2 at every pixel\nlocation (which may be seen as 1\u00d71 spatial convolution); (v) multiplying by the norm (cid:107)x(cid:107) making \u03c81\nhomogeneous. In other words, we obtain a particular convolutional neural network, with non-standard\nparametrization. Note that learning requires only performing a K-means algorithm and computing\nthe inverse square-root matrix \u03ba1(Z(cid:62)Z)\u22121/2; therefore, the training procedure is very fast.\nThen, it is worth noting that the encoding function \u03c81 with kernel (2) is reminiscent of radial basis\nfunction networks (RBFNs) [4], whose hidden layer resembles (3) without the matrix \u03ba1(Z(cid:62)Z)\u22121/2\nand with no normalization. The difference between RBFNs and our model is nevertheless signi\ufb01cant.\nThe RKHS mapping, which is absent from RBFNs, is indeed a key to the multilayer construction\nthat will be presented shortly: a network layer takes points from the RKHS\u2019s previous layer as input\nand use the corresponding RKHS inner-product. To the best of our knowledge, there is no similar\nmultilayer and/or convolutional construction in the radial basis function network literature.\n\n(cid:18)\n\n(cid:19)\n\n(3)\n\n2To simplify, we use zero-padding when patches are close to the image boundaries, but this is optional.\n\n3\n\n\fFigure 1: Our variant of convolutional kernel networks, illustrated between layers 0 and 1. Local\npatches (receptive \ufb01elds) are mapped to the RKHS H1 via the kernel trick and then projected to\nthe \ufb01nite-dimensional subspace F1 =Span(\u03d5(z1), . . . , \u03d5(zp1)). The small blue crosses on the right\nrepresent the points \u03d5(z1), . . . , \u03d5(zp1 ). With no supervision, optimizing F1 consists of minimizing\nprojection residuals. With supervision, the subspace is optimized via back-propagation. Going from\nlayer k to layer k + 1 is achieved by stacking the model described here and shifting indices.\n\nIdea 3: linear pooling in F1 is equivalent to linear pooling on the \ufb01nite-dimensional map M1.\nThe previous steps transform an image I0 : \u21260 \u2192 Rp0 into a map M1 : \u21260 \u2192 Rp1, where each\nvector M1(z) in Rp1 encodes a point in F1 representing information of a local image neighborhood\ncentered at location z. Then, convolutional kernel networks involve a pooling step to gain invariance\nto small shifts, leading to another \ufb01nite-dimensional map I1 : \u21261 \u2192 Rp1 with smaller resolution:\n(4)\n\nM1(z(cid:48))e\u2212\u03b21(cid:107)z(cid:48)\n\nI1(z) =\n\n\u2212z(cid:107)2\n2 .\n\n(cid:88)\n\nz(cid:48)\u2208\u21260\n\nThe Gaussian weights act as an anti-aliasing \ufb01lter for downsampling the map M1 and \u03b21 is set\naccording to the desired subsampling factor (see [18]), which does not need to be integer. Then, every\npoint I1(z) in Rp1 may be interpreted as a linear combination of points in F1, which is itself in F1\nsince F1 is a linear subspace. Note that the linear pooling step was originally motivated in [18] as an\napproximation scheme for a match kernel, but this point of view is not critically important here.\n\nIdea 4: build a multilayer image representation by stacking and composing kernels.\nBy following the \ufb01rst three principles described above, the input image I0 : \u21260 \u2192 Rp0 is transformed\ninto another one I1 : \u21261 \u2192 Rp1. It is then straightforward to apply again the same procedure to\nobtain another map I2 : \u21262 \u2192 Rp2, then I3 : \u21263 \u2192 Rp3, etc. By going up in the hierarchy, the\nvectors Ik(z) in Rpk represent larger and larger image neighborhoods (aka. receptive \ufb01elds) with\nmore invariance gained by the pooling layers, akin to classical convolutional neural networks.\nThe multilayer scheme produces a sequence of maps (Ik)k\u22650, where each vector Ik(z) encodes\na point\u2014say fk(z)\u2014in the linear subspace Fk of Hk. Thus, we implicitly represent an image at\nlayer k as a spatial map fk : \u2126k \u2192 Hk such that (cid:104)Ik(z), I(cid:48)k(z(cid:48))(cid:105) = (cid:104)fk(z), f(cid:48)k(z(cid:48))(cid:105)Hk for all z, z(cid:48).\nAs mentioned previously, the mapping to the RKHS is a key to the multilayer construction. Given Ik,\nlarger image neighborhoods are represented by patches of size ek \u00d7 ek that can be mapped to a\npoint in the Cartesian product space Hek\u00d7ek\nendowed with its natural inner-product; \ufb01nally, the\nkernel Kk+1 de\ufb01ned on these patches can be seen as a kernel on larger image neighborhoods than Kk.\n\nk\n\n3 End-to-End Kernel Learning with Supervised CKNs\n\nIn the previous section, we have described a variant of convolutional kernel networks where linear\nsubspaces are learned at every layer. This is achieved without supervision by a K-means algorithm\nleading to small projection residuals. It is thus natural to introduce also a discriminative approach.\n\n4\n\nI0xx0kerneltrickprojectiononF1M1\u03c81(x)\u03c81(x0)I1linearpoolingHilbertspaceH1F1\u03d51(x)\u03d51(x0)\f3.1 Backpropagation Rules for Convolutional Kernel Networks\n\n\u03bb\n2(cid:107)f(cid:107)2\nH,\n(cid:88)\n\nz\u2208\u2126k\n\n1\nn\n\nmin\nf\u2208H\n\ni=1\n\n(cid:88)\n\nz\u2208\u2126k\n\nWe now consider a prediction task, where we are given a training set of images I 1\n0 , . . . , I n\n0\nwith respective scalar labels y1, . . . , yn living either in {\u22121; +1} for binary classi\ufb01cation and R\nfor regression. For simplicity, we only present these two settings here, but extensions to multiclass\nclassi\ufb01cation and multivariate regression are straightforward. We also assume that we are given a\nsmooth convex loss function L : R \u00d7 R \u2192 R that measures the \ufb01t of a prediction to the true label y.\nGiven a positive de\ufb01nite kernel K on images, the classical empirical risk minimization formulation\nconsists of \ufb01nding a prediction function in the RKHS H associated to K by minimizing the objective\n(5)\n\nn(cid:88)\n\n0 , I 2\n\nL(yi, f (I i\n\n0)) +\n\nn(cid:88)\n\ni=1\n\nwhere the parameter \u03bb controls the smoothness of the prediction function f with respect to the\ngeometry induced by the kernel, hence regularizing and reducing over\ufb01tting [22]. After training a\nconvolutional kernel network with k layers, such a positive de\ufb01nite kernel may be de\ufb01ned as\n\nKZ (I0, I(cid:48)0) =\n\n(cid:104)fk(z), f(cid:48)k(z)(cid:105)Hk =\n\n(cid:104)Ik(z), I(cid:48)k(z)(cid:105),\n\n(6)\n\nwhere Ik, I(cid:48)k are the k-th \ufb01nite-dimensional feature maps of I0, I(cid:48)0, respectively, and fk, f(cid:48)k the\ncorresponding maps in \u2126k \u2192 Hk, which have been de\ufb01ned in the previous section. The kernel is\nalso indexed by Z, which represents the network parameters\u2014that is, the subspaces F1, . . . ,Fk, or\nequivalently the set of \ufb01lters Z1, . . . , Zk from Eq. (3). Then, formulation (5) becomes equivalent to\n\nmin\n\nW\u2208Rpk\u00d7|\u2126k|\n\n1\nn\n\nL(yi,(cid:104)W, I i\n\nk(cid:105)) +\n\n\u03bb\n2(cid:107)W(cid:107)2\nF,\n\n(7)\n\nwhere (cid:107).(cid:107)F is the Frobenius norm that extends the Euclidean norm to matrices, and, with an abuse of\nk are seen as matrices in Rpk\u00d7|\u2126k|. Then, the supervised convolutional kernel\nnotation, the maps I i\nnetwork formulation consists of jointly minimizing (7) with respect to W in Rpk\u00d7|\u2126k| and with respect\nto the set of \ufb01lters Z1, . . . , Zk, whose columns are constrained to be on the Euclidean sphere.\n\nComputing the derivative with respect to the \ufb01lters Z1, . . . , Zk.\nSince we consider a smooth loss function L, e.g., logistic, squared hinge, or square loss, optimizing (7)\nwith respect to W can be achieved with any gradient-based method. Moreover, when L is convex,\nwe may also use fast dedicated solvers, (see, e.g., [16], and references therein). Optimizing with\nrespect to the \ufb01lters Zj, j = 1, . . . , k is more involved because of the lack of convexity. Yet, the\nobjective function is differentiable, and there is hope to \ufb01nd a \u201cgood\u201d stationary point by using\nclassical stochastic optimization techniques that have been successful for training deep networks.\nFor that, we need to compute the gradient by using the chain rule\u2014also called \u201cbackpropagation\u201d [13].\nWe instantiate this rule in the next lemma, which we have found useful to simplify the calculation.\nLemma 1 (Perturbation view of backpropagration.)\nConsider an image I0 represented here as a matrix in Rp0\u00d7|\u21260|, associated to a label y in R and\ncall IZk the k-th feature map obtained by encoding I0 with the network parameters Z. Then, consider\na perturbation E = {\u03b51, . . . , \u03b5k} of the set of \ufb01lters Z. Assume that we have for all j \u2265 0,\n\nwhere (cid:107)E(cid:107) is equal to(cid:80)k\n\nof the same size,\n\nIZ+E\nj\n\n= IZj + \u2206IZ,Ej + o((cid:107)E(cid:107)),\n\n(8)\nis a matrix in Rpj\u00d7|\u2126j| such that for all matrices U\n\nl=1 (cid:107)\u03b5l(cid:107)F, and \u2206IZ,Ej\n(cid:104)\u2206IZ,Ej\n\n, U(cid:105) = (cid:104)\u03b5j, gj(U)(cid:105) + (cid:104)\u2206IZ,Ej\u22121, hj(U)(cid:105),\n\nwhere the inner-product is the Frobenius\u2019s one and gj, hj are linear functions. Then,\n\n\u2207Zj L(y,(cid:104)W, IZk (cid:105)) = L(cid:48)(y,(cid:104)W, IZk (cid:105)) gj(hj+1(. . . hk(W)),\n\nwhere L(cid:48) denote the derivative of the smooth function L with respect to its second argument.\n\nThe proof of this lemma is straightforward and follows from the de\ufb01nition of the Fr\u00e9chet derivative.\nNevertheless, it is useful to derive the closed form of the gradient in the next proposition.\n\n5\n\n(9)\n\n(10)\n\n\fProposition 1 (Gradient of the loss with respect to the the \ufb01lters Z1, . . . , Zk.)\nConsider the quantities introduced in Lemma 1, but denote IZj by Ij for simplicity. By construction,\nwe have for all j \u2265 1,\n(11)\nwhere Ij is seen as a matrix in Rpj\u00d7|\u2126j|; Ej is the linear operator that extracts all overlapping\nej\u22121 \u00d7 ej\u22121 patches from a map such that Ej(Ij\u22121) is a matrix of size pj\u22121e2\nj\u22121 \u00d7 |\u2126j\u22121|; Sj is a\ndiagonal matrix whose diagonal entries carry the (cid:96)2-norm of the columns of Ej(Ij\u22121); Aj is short\nfor \u03baj(Z(cid:62)j Zj)\u22121/2; and Pj is a matrix of size |\u2126j\u22121|\u00d7|\u2126j| performing the linear pooling operation.\nThen, the gradient of the loss with respect to the \ufb01lters Zj, j = 1, . . . , k is given by (10) with\n\nIj = Aj\u03baj(Z(cid:62)j Ej(Ij\u22121)S\u22121\n\nj )SjPj,\n\ngj(U) = Ej(Ij\u22121)B(cid:62)j \u2212\nhj(U) = E(cid:63)\nj\n\n(cid:0)\u03ba(cid:48)j(Z(cid:62)j Zj) (cid:12) (Cj + C(cid:62)j )(cid:1)\n(cid:0)ZjBj + Ej(Ij\u22121)(cid:0)S\u22122\n(cid:0)M(cid:62)j UP(cid:62)j \u2212 Ej(Ij\u22121)(cid:62)ZjBj\n(cid:0)Z(cid:62)j Ej(Ij\u22121)S\u22121\n(cid:1) and Cj = A1/2\n(cid:0)AjUP(cid:62)j\n\nj (cid:12)\n\nwhere U is any matrix of the same size as Ij, Mj = Aj\u03baj(Z(cid:62)j Ej(Ij\u22121)S\u22121\nmap before the pooling step, (cid:12) is the Hadamart (elementwise) product, E(cid:63)\n\nj )Sj is the j-th feature\nj is the adjoint of Ej, and\n(13)\n\nIjU(cid:62)A3/2\n\n.\n\n(cid:1)(cid:1)(cid:1),\n\nBj = \u03ba(cid:48)j\n\n1\n2\n\nZj\n\nj\n\nj\n\n(cid:1)\n\nj\n\n(cid:12)\n\n(12)\n\nj\n\nThe proof is presented in Appendix B. Most quantities that appear above admit physical interpretations:\nmultiplication by Pj performs downsampling; multiplication by P(cid:62)j performs upsampling; multipli-\ncation of Ej(Ij\u22121) on the right by S\u22121\nperforms (cid:96)2-normalization of the columns; Z(cid:62)j Ej(Ij\u22121) can\nbe seen as a spatial convolution of the map Ij\u22121 by the \ufb01lters Zj; \ufb01nally, E(cid:63)\nj \u201ccombines\u201d a set of\npatches into a spatial map by adding to each pixel location the respective patch contributions.\nComputing the gradient requires a forward pass to obtain the maps Ij through (11) and a backward\npass that composes the functions gj, hj as in (10). The complexity of the forward step is dominated\nby the convolutions Z(cid:62)j Ej(Ij\u22121), as in convolutional neural networks. The cost of the backward\npass is the same as the forward one up to a constant factor. Assuming pj \u2264|\u2126j\u22121|, which is typical\nfor lower layers that require more computation than upper ones, the most expensive cost is due to\nEj(Ij\u22121)B(cid:62)j and ZjBj which is the same as Z(cid:62)j Ej(Ij\u22121). We also pre-compute A1/2\nand A3/2\nby eigenvalue decompositions, whose cost is reasonable when performed only once per minibatch.\nOff-diagonal elements of M(cid:62)j UP(cid:62)j \u2212 Ej(Ij\u22121)(cid:62)ZjBj are also not computed since they are set to\nzero after elementwise multiplication with a diagonal matrix. In practice, we also replace Aj by\n(\u03baj(Z(cid:62)j Zj) + \u03b5I)\u22121/2 with \u03b5 = 0.001, which corresponds to performing a regularized projection\nonto Fj (see Appendix A). Finally, a small offset of 0.00001 is added to the diagonal entries of Sj.\nOptimizing hyper-parameters for RBF kernels. When using the kernel (2), the objective is\ndifferentiable with respect to the hyper-parameters \u03b1j. When large amounts of training data are\navailable and over\ufb01tting is not a issue, optimizing the training loss by taking gradient steps with\nrespect to these parameters seems appropriate instead of using a canonical parameter value. Otherwise,\nmore involved techniques may be needed; we plan to investigate other strategies in future work.\n\nj\n\nj\n\n3.2 Optimization and Practical Heuristics\n\nThe backpropagation rules of the previous section have set up the stage for using a stochastic gradient\ndescent method (SGD). We now present a few strategies to accelerate it in our context.\nHybrid convex/non-convex optimization. Recently, many incremental optimization techniques\nhave been proposed for solving convex optimization problems of the form (7) when n is large but\n\ufb01nite (see [16] and references therein). These methods usually provide a great speed-up over the\nstochastic gradient descent algorithm without suffering from the burden of choosing a learning rate.\nThe price to pay is that they rely on convexity, and they require storing into memory the full training\nset. For solving (7) with \ufb01xed network parameters Z, it means storing the n maps I i\nk, which is often\nreasonable if we do not use data augmentation. To partially leverage these fast algorithms for our\nnon-convex problem, we have adopted a minimization scheme that alternates between two steps: (i)\n\ufb01x Z, then make a forward pass on the data to compute the n maps I i\nk and minimize the convex\nproblem (7) with respect to W using the accelerated MISO algorithm [16]; (ii) \ufb01x W, then make one\npass of a projected stochastic gradient algorithm to update the k set of \ufb01lters Zj. The set of network\nparameters Z is initialized with the unsupervised learning method described in Section 2.\n\n6\n\n\fPreconditioning on the sphere. The kernels \u03baj are de\ufb01ned on the sphere; therefore, it is natural\nto constrain the \ufb01lters\u2014that is, the columns of the matrices Zj\u2014to have unit (cid:96)2-norm. As a result,\na classical stochastic gradient descent algorithm updates at iteration t each \ufb01lter z as follows z \u2190\nProj\n(cid:107).(cid:107)2=1[z\u2212\u03b7t\u2207zLt], where \u2207zLt is an estimate of the gradient computed on a minibatch and \u03b7t is\na learning rate. In practice, we found that convergence could be accelerated by preconditioning, which\nconsists of optimizing after a change of variable to reduce the correlation of gradient entries. For\nunconstrained optimization, this heuristic involves choosing a symmetric positive de\ufb01nite matrix Q\nand replacing the update direction \u2207zLt by Q\u2207zLt, or, equivalently, performing the change of\nvariable z = Q1/2z(cid:48) and optimizing over z(cid:48). When constraints are present, the case is not as simple\nsince Q\u2207zLt may not be a descent direction. Fortunately, it is possible to exploit the manifold\nstructure of the constraint set (here, the sphere) to perform an appropriate update [1]. Concretely, (i)\nwe choose a matrix Q per layer that is equal to the inverse covariance matrix of the patches from the\nsame layer computed after the initialization of the network parameters. (ii) We perform stochastic\ngradient descent steps on the sphere manifold after the change of variable z = Q1/2z(cid:48), leading to the\n(cid:107).(cid:107)2=1[z \u2212 \u03b7t(I \u2212 (1/z(cid:62)Qz)Qzz(cid:62))Q\u2207zLt]. Because this heuristic is not a critical\nupdate z \u2190 Proj\ncomponent, but simply an improvement of SGD, we relegate mathematical details in Appendix C.\n\nAutomatic learning rate tuning. Choosing the right learning rate in stochastic optimization is\nstill an important issue despite the large amount of work existing on the topic, see, e.g., [13] and\nreferences therein. In our paper, we use the following basic heuristic: the initial learning rate \u03b7t\nis chosen \u201clarge enough\u201d; then, the training loss is evaluated after each update of the weights W.\nWhen the training loss increases between two epochs, we simply divide the learning rate by two, and\nperform \u201cback-tracking\u201d by replacing the current network parameters by the previous ones.\n\nActive-set heuristic. For classi\ufb01cation tasks, \u201ceasy\u201d samples have often negligible contribution to\nthe gradient (see, e.g., [13]). For instance, for the squared hinge loss L(y, \u02c6y) = max(0, 1 \u2212 y \u02c6y)2, the\ngradient vanishes when the margin y \u02c6y is greater than one. This motivates the following heuristic: we\nconsider a set of active samples, initially all of them, and remove a sample from the active set as soon\nas we obtain zero when computing its gradient. In the subsequent optimization steps, only active\nsamples are considered, and after each epoch, we randomly reactivate 10% of the inactive ones.\n\n4 Experiments\n\nWe now present experiments on image classi\ufb01cation and super-resolution. All experiments were\nconducted on 8-core and 10-core 2.4GHz Intel CPUs using C++ and Matlab.\n\nImage Classi\ufb01cation on \u201cDeep Learning\u201d Benchmarks\n\n4.1\nWe consider the datasets CIFAR-10 [12] and SVHN [19], which contain 32 \u00d7 32 images from 10\nclasses. CIFAR-10 is medium-sized with 50 000 training samples and 10 000 test ones. SVHN is\nlarger with 604 388 training examples and 26 032 test ones. We evaluate the performance of a 9-layer\nnetwork, designed with few hyper-parameters: for each layer, we learn 512 \ufb01lters and choose the RBF\nkernels \u03baj de\ufb01ned in (2) with initial parameters \u03b1j = 1/(0.52). Layers 1, 3, 5, 7, 9 use 3\u00d73 patches\nand a subsampling pooling factor of \u221a2 except for layer 9 where the factor is 3; Layers 2, 4, 6, 8 use\nsimply 1 \u00d7 1 patches and no subsampling. For CIFAR-10, the parameters \u03b1j are kept \ufb01xed during\ntraining, and for SVHN, they are updated in the same way as the \ufb01lters. We use the squared hinge\nloss in a one vs all setting to perform multi-class classi\ufb01cation (with shared \ufb01lters Z between classes).\nThe input of the network is pre-processed with the local whitening procedure described in [20]. We\nuse the optimization heuristics from the previous section, notably the automatic learning rate scheme,\nand a gradient momentum with parameter 0.9, following [12]. The regularization parameter \u03bb and\nthe number of epochs are set by \ufb01rst running the algorithm on a 80/20 validation split of the training\nset. \u03bb is chosen near the canonical parameter \u03bb = 1/n, in the range 2i/n, with i = \u22124, . . . , 4, and\nthe number of epochs is at most 100. The initial learning rate is 10 with a minibatch size of 128.\nWe present our results in Table 1 along with the performance achieved by a few recent methods\nwithout data augmentation or model voting/averaging. In this context, the best published results are\nobtained by the generalized pooling scheme of [14]. We achieve about 2% test error on SVHN and\nabout 10% on CIFAR-10, which positions our method as a reasonably \u201ccompetitive\u201d one, in the same\nballpark as the deeply supervised nets of [15] or network in network of [17].\n\n7\n\n\fTable 1: Test error in percents reported by a few recent publications on the CIFAR-10 and SVHN\ndatasets without data augmentation or model voting/averaging.\n\nStoch P. [29] MaxOut [9]\n\nCIFAR-10\n\nSVHN\n\n15.13\n2.80\n\n11.68\n2.47\n\nNiN [17]\n\n10.41\n2.35\n\nDSN [15]\n\n9.69\n1.92\n\nGen P. [14]\n\n7.62\n1.69\n\nSCKN (Ours)\n\n10.20\n2.04\n\nDue to lack of space, the results reported here only include a single supervised model. Preliminary\nexperiments with no supervision show also that one may obtain competitive accuracy with wide\nshallow architectures. For instance, a two-layer network with (1024-16384) \ufb01lters achieves 14.2%\nerror on CIFAR-10. Note also that our unsupervised model outperforms original CKNs [18]. The best\nsingle model from [18] gives indeed 21.7%. Training the same architecture with our approach is two\norders of magnitude faster and gives 19.3%. Another aspect we did not study is model complexity.\nHere as well, preliminary experiments are encouraging. Reducing the number of \ufb01lters to 128 per\nlayer yields indeed 11.95% error on CIFAR-10 and 2.15% on SVHN. A more precise comparison\nwith no supervision and with various network complexities will be presented in another venue.\n\n4.2\n\nImage Super-Resolution from a Single Image\n\nImage up-scaling is a challenging problem, where convolutional neural networks have obtained\nsigni\ufb01cant success [7, 8, 27]. Here, we follow [8] and replace traditional convolutional neural\nnetworks by our supervised kernel machine. Speci\ufb01cally, RGB images are converted to the YCbCr\ncolor space and the upscaling method is applied to the luminance channel only to make the comparison\npossible with previous work. Then, the problem is formulated as a multivariate regression one. We\nbuild a database of 200 000 patches of size 32 \u00d7 32 randomly extracted from the BSD500 dataset [2]\nafter removing image 302003.jpg, which overlaps with one of the test images. 16\u00d7 16 versions of the\npatches are build using the Matlab function imresize, and upscaled back to 32 \u00d7 32 by using bicubic\ninterpolation; then, the goal is to predict high-resolution images from blurry bicubic interpolations.\nThe blurry estimates are processed by a 9-layer network, with 3 \u00d7 3 patches and 128 \ufb01lters at every\nlayer without linear pooling and zero-padding. Pixel values are predicted with a linear model applied\nto the 128-dimensional vectors present at every pixel location of the last layer, and we use the square\nloss to measure the \ufb01t. The optimization procedure and the kernels \u03baj are identical to the ones used\nfor processing the SVHN dataset in the classi\ufb01cation task. The pipeline also includes a pre-processing\nstep, where we remove from input images a local mean component obtained by convolving the images\nwith a 5 \u00d7 5 averaging box \ufb01lter; the mean component is added back after up-scaling.\nFor the evaluation, we consider three datasets: Set5 and Set14 are standard for super-resolution;\nKodim is the Kodak Image database, available at http://r0k.us/graphics/kodak/, which con-\ntains high-quality images with no compression or demoisaicing artefacts. The evaluation procedure\nfollows [7, 8, 26, 27] by using the code from the author\u2019s web page. We present quantitative results\nin Table 2. For x3 upscaling, we simply used twice our model learned for x2 upscaling, followed by a\n3/4 downsampling. This is clearly suboptimal since our model is not trained to up-scale by a factor 3,\nbut this naive approach still outperforms other baselines [7, 8, 27] that are trained end-to-end. Note\nthat [27] also proposes a data augmentation scheme at test time that slightly improves their results. In\nAppendix D, we also present a visual comparison between our approach and [8], whose pipeline is\nthe closest to ours, up to the use of a supervised kernel machine instead of CNNs.\n\nTable 2: Reconstruction accuracy for super-resolution in PSNR (the higher, the better). All CNN\napproaches are without data augmentation at test time. See Appendix D for the SSIM quality measure.\nFact. Dataset Bicubic SC [30] ANR [26] A+[26] CNN1 [7] CNN2 [8] CSCN [27]\n\nx2\n\nx3\n\nSet5\nSet14\nKodim\nSet5\nSet14\nKodim\n\n33.66\n30.23\n30.84\n30.39\n27.54\n28.43\n\n35.78\n31.80\n32.19\n31.90\n28.67\n29.21\n\n35.83\n31.79\n32.23\n31.92\n28.65\n29.21\n\n36.54\n32.28\n32.71\n32.58\n29.13\n29.57\n\n36.34\n32.18\n32.62\n32.39\n29.00\n29.42\n\n36.66\n32.45\n32.80\n32.75\n29.29\n29.64\n\n36.93\n32.56\n32.94\n33.10\n29.41\n29.76\n\nSCKN\n37.07\n32.76\n33.21\n33.08\n29.50\n29.88\n\nAcknowledgments\n\nThis work was supported by ANR (MACARON project ANR-14-CE23-0003-01).\n\n8\n\n\fReferences\n[1] P.-A. Absil, R. Mahony, and R. Sepulchre. Optimization algorithms on matrix manifolds. Princeton\n\nUniversity Press, 2009.\n\n[2] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik. Contour detection and hierarchical image segmentation.\n\nIEEE T. Pattern Anal., 33(5):898\u2013916, 2011.\n\n[3] L. Bo, K. Lai, X. Ren, and D. Fox. Object recognition with hierarchical kernel descriptors. In CVPR, 2011.\n[4] D. S. Broomhead and D. Lowe. Radial basis functions, multi-variable functional interpolation and adaptive\n\nnetworks. Technical report, DTIC Document, 1988.\n\n[5] Y. Cho and L. K. Saul. Kernel methods for deep learning. In Adv. NIPS, 2009.\n[6] A. Damianou and N. Lawrence. Deep Gaussian processes. In Proc. AISTATS, 2013.\n[7] C. Dong, C. C. Loy, K. He, and X. Tang. Learning a deep convolutional network for image super-resolution.\n\nIn Proc. ECCV. 2014.\n\n[8] C. Dong, C. C. Loy, K. He, and X. Tang. Image super-resolution using deep convolutional networks. IEEE\n\nT. Pattern Anal., 38(2):295\u2013307, 2016.\n\n[9] I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio. Maxout networks. In Proc.\n\nICML, 2013.\n\n[10] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proc. CVPR, 2016.\n[11] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal\n\ncovariate shift. In Proc. ICML, 2015.\n\n[12] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classi\ufb01cation with deep convolutional neural\n\nnetworks. In Adv. NIPS, 2012.\n\n[13] Y. Le Cun, L. Bottou, G. B. Orr, and K.-R. M\u00fcller. Ef\ufb01cient backprop. In Neural Networks, Tricks of the\n\nTrade, Lecture Notes in Computer Science LNCS 1524. 1998.\n\n[14] C.-Y. Lee, P. W. Gallagher, and Z. Tu. Generalizing pooling functions in convolutional neural networks:\n\nMixed, gated, and tree. In Proc. AISTATS, 2016.\n\n[15] C.-Y. Lee, S. Xie, P. W. Gallagher, Z. Zhang, and Z. Tu. Deeply-supervised nets. In Proc. AISTATS, 2015.\n[16] H. Lin, J. Mairal, and Z. Harchaoui. A universal catalyst for \ufb01rst-order optimization. In Adv. NIPS, 2015.\n[17] M. Lin, Q. Chen, and S. Yan. Network in network. In Proc. ICLR, 2013.\n[18] J. Mairal, P. Koniusz, Z. Harchaoui, and C. Schmid. Convolutional kernel networks. In Adv. NIPS, 2014.\n[19] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with\n\nunsupervised feature learning. In NIPS workshop on deep learning, 2011.\n\n[20] M. Paulin, M. Douze, Z. Harchaoui, J. Mairal, F. Perronin, and C. Schmid. Local convolutional features\n\nwith unsupervised training for image retrieval. In Proc. ICCV, 2015.\n\n[21] A. Rahimi and B. Recht. Random features for large-scale kernel machines. In Adv. NIPS, 2007.\n[22] B. Sch\u00f6lkopf and A. J. Smola. Learning with kernels: support vector machines, regularization, optimization,\n\nand beyond. MIT press, 2002.\n\n[23] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In\n\nProc. ICLR, 2015.\n\n[24] S. Sonnenburg, G. R\u00e4tsch, C. Sch\u00e4fer, and B. Sch\u00f6lkopf. Large scale multiple kernel learning. J. Mach.\n\nLearn. Res., 7:1531\u20131565, 2006.\n\n[25] V. Sydorov, M. Sakurada, and C. Lampert. Deep Fisher kernels \u2014 end to end learning of the Fisher kernel\n\nGMM parameters. In Proc. CVPR, 2014.\n\n[26] R. Timofte, V. Smet, and L. van Gool. Anchored neighborhood regression for fast example-based super-\n\nresolution. In Proc. ICCV, 2013.\n\n[27] Z. Wang, D. Liu, J. Yang, W. Han, and T. Huang. Deep networks for image super-resolution with sparse\n\nprior. In Proc. ICCV, 2015.\n\n[28] C. Williams and M. Seeger. Using the Nystr\u00f6m method to speed up kernel machines. In Adv. NIPS, 2001.\n[29] M. D. Zeiler and R. Fergus. Stochastic pooling for regularization of deep convolutional neural networks.\n\nIn Proc. ICLR, 2013.\n\n[30] R. Zeyde, M. Elad, and M. Protter. On single image scale-up using sparse-representations. In Curves and\n\nSurfaces, pages 711\u2013730. 2010.\n\n[31] K. Zhang, I. W. Tsang, and J. T. Kwok. Improved Nystr\u00f6m low-rank approximation and error analysis. In\n\nProc. ICML, 2008.\n\n9\n\n\f", "award": [], "sourceid": 784, "authors": [{"given_name": "Julien", "family_name": "Mairal", "institution": "Inria"}]}