{"title": "Layer-wise analysis of deep networks with Gaussian kernels", "book": "Advances in Neural Information Processing Systems", "page_first": 1678, "page_last": 1686, "abstract": "Deep networks can potentially express a learning problem more efficiently than local learning machines. While deep networks outperform local learning machines on some problems, it is still unclear how their nice representation emerges from their complex structure. We present an analysis based on Gaussian kernels that measures how the representation of the learning problem evolves layer after layer as the deep network builds higher-level abstract representations of the input. We use this analysis to show empirically that deep networks build progressively better representations of the learning problem and that the best representations are obtained when the deep network discriminates only in the last layers.", "full_text": "Layer-wise analysis of deep networks with Gaussian\n\nkernels\n\nGr\u00b4egoire Montavon\n\nMachine Learning Group\n\nTU Berlin\n\nMikio L. Braun\n\nMachine Learning Group\n\nTU Berlin\n\nKlaus-Robert M\u00a8uller\nMachine Learning Group\n\nTU Berlin\n\ngmontavon@cs.tu-berlin.de\n\nmikio@cs.tu-berlin.de\n\nkrm@cs.tu-berlin.de\n\nAbstract\n\nDeep networks can potentially express a learning problem more ef\ufb01ciently than lo-\ncal learning machines. While deep networks outperform local learning machines\non some problems, it is still unclear how their nice representation emerges from\ntheir complex structure. We present an analysis based on Gaussian kernels that\nmeasures how the representation of the learning problem evolves layer after layer\nas the deep network builds higher-level abstract representations of the input. We\nuse this analysis to show empirically that deep networks build progressively bet-\nter representations of the learning problem and that the best representations are\nobtained when the deep network discriminates only in the last layers.\n\n1\n\nIntroduction\n\nLocal learning machines such as nearest neighbors classi\ufb01ers, radial basis function (RBF) kernel\nmachines or linear classi\ufb01ers predict the class of new data points from their neighbors in the input\nspace. A limitation of local learning machines is that they cannot generalize beyond the notion\nof continuity in the input space. This limitation becomes detrimental when the Bayes classi\ufb01er\nhas more variations (ups and downs) than the number of labeled samples available. This situation\ntypically occurs on problems where an instance \u2014 let\u2019s say, a handwritten digit \u2014 can take various\nforms due to irrelevant variation factors such as its position, its size, its thickness and more complex\ndeformations. These multiple factors of variation can greatly increase the complexity of the learning\nproblem (Bengio, 2009).\n\nThis limitation motivates the creation of learning machines that can map the input space into a\nhigher-level representation where regularities of higher order than simple continuity in the input\nspace can be expressed. Engineered feature extractors, nonlocal kernel machines (Zien et al., 2000)\nor deep networks (Rumelhart et al., 1986; LeCun et al., 1998; Hinton et al., 2006; Bengio et al., 2007)\ncan implement these more complex regularities. Deep networks implement them by distorting the\ninput space so that initially distant points in the input space appear closer. Also, their multilayered\nnature acts as a regularizer, allowing them to share at a given layer features computed at the previous\nlayer (Bengio, 2009). Understanding how the representation is built in a deep network and how to\ntrain it ef\ufb01ciently received a lot of attention (Goodfellow et al., 2009; Larochelle et al., 2009; Erhan\net al., 2010). However, it is still unclear how their nice representation emerges from their complex\nstructure, in particular, how the representation evolves from layer to layer.\n\nThe main contribution of this paper is to introduce an analysis based on RBF kernels and on the\nkernel principal component analysis (kPCA, Sch\u00a8olkopf et al., 1998) that can capture and quantify the\nlayer-wise evolution of the representation in a deep network. In practice, for each layer 1 \u2264 l \u2264 L\nof the deep network, we take a small labeled dataset D, compute its image D(l) at the layer l of the\ndeep network and measure what dimensionality the local model built on top of D(l) must have in\norder to solve the learning problem with a certain accuracy.\n\n1\n\n\finput\n\noutput\n\nf1\n\nf2\n\nf3\n\nl = 0\n\nl = 1\n\nl = 2\n\nl = 3\n\ny\n\ny\n\ny\n\ny\n\nx\n\nf1(x)\n\nf2(f1(x))\n\nf3(f2(f1(x)))\n\n)\nd\n(\ne\n\nr\no\nr\nr\ne\n\n)\no\nd\n(\ne\n\nr\no\nr\nr\ne\n\nl = 0\nl = 1\nl = 2\nl = 3\n\ndimensionality d\n\nlayer l\n\nFigure 1: As we move from the input to the output of the deep network, better representations of\nthe learning problem are built. We measure this improvement with the layer-wise RBF analysis\npresented in Section 2 and Section 3.2. This analysis relates the prediction error e(d) to the di-\nmensionality d of a local model built at each layer of the deep network. As the data is propagated\nthrough the deep network, lower errors are obtained with lower-dimensional local models. The plots\non the right illustrate this dynamic where the thick gray arrows indicate the forward path of the deep\nnetwork and where do is a \ufb01xed number of dimensions.\n\nWe apply this novel analysis to a multilayer perceptron (MLP), a pretrained multilayer perceptron\n(PMLP) and a convolutional neural network (CNN). We observe in each case that the error and the\ndimensionality of the local model decrease as we propagate the dataset through the deep network.\nThis reveals that the deep network improves the representation of the learning problem layer after\nlayer. This progressive layer-wise simpli\ufb01cation is illustrated in Figure 1. In addition, we observe\nthat the CNN and the PMLP tend to postpone the discrimination to the last layers, leading to more\ntransferable features and better-generalizing representations than for the simple MLP. This result\nsuggests that the structure of a deep network, by enforcing a separation of concerns between low-\nlevel generic features and high-level task-speci\ufb01c features, has an important role to play in order to\nbuild good representations.\n\n2 RBF analysis of a learning problem\n\nWe would like to quantify the complexity of a learning problem p(y | x) where samples are drawn\nindependently from a probability distribution p(x, y). A simple way to do it is to measure how many\ndegrees of freedom (or dimensionality d) a local model must have in order to solve the learning\nproblem with a certain error e. This analysis relates the dimensionality d of the local model to its\nprediction error e(d).\nIn practice, there are many ways to de\ufb01ne the dimensionality of a model, for example, (1) the\nnumber of samples given to the learning machine, (2) the number of required hidden nodes of a\nneural network (Murata et al., 1994), (3) the number of support vectors of a SVM or (4) the number\nof leading kPCA components of the input distribution p(x) used in the model. The last option is\nchosen for the following two reasons:\n\nFirst, the kPCA components are added cumulatively to the prediction model as the dimensionality of\nthe model increases, thus offering stability, while in the case of support vector machines, previously\nchosen support vectors might be dropped in favor of other support vectors in higher-dimensional\nmodels.\n\nSecond, the leading kPCA components obtained with a \ufb01nite and typically small number of samples\nn are similar to those that would be obtained in the asymptotic case where p(x, y) is fully observed\n(n \u2192 \u221e). This property is shown by Braun (2006) and Braun et al. (2008) in the case of a single\nkernel, and by extension, in the case of a \ufb01nite set of kernels.\n\nThis last property is particularly useful since p(x, y) is unknown and only a \ufb01nite number of observa-\ntions are available. The analysis presented here is strongly inspired from the relevant dimensionality\nestimation (RDE) method of Braun et al. (2008) and is illustrated in Figure 2 for a small two-\n\n2\n\n\fd = 1\n\nd = 2\n\nd = 3\n\ne(d) = 0.5\n\ne(d) = 0.25\n\ne(d) = 0.25\n\nd = 4\n\ne(d) = 0\n\nd = 5\n\ne(d) = 0\n\nd = 6\n\ne(d) = 0\n\nFigure 2: Illustration of the RBF analysis on a toy dataset of 12 samples. As we add more and more\nleading kPCA components, the model becomes more \ufb02exible, creating a better decision boundary.\nNote that with four leading kPCA components out of the 12 kPCA components, all the samples are\nalready classi\ufb01ed perfectly.\n\ndimensional toy example. In the next lines, we present the computation steps required to estimate\nthe error as a function of the dimensionality.\nLet {(x1, y1), . . . , (xn, yn)} be a dataset of n points drawn independently from p(x, y) where yi is\nan indicator vector having value 1 at the index corresponding to the class of xi and 0 elsewhere. Let\nX = (x1, . . . , xn) and Y = (y1, . . . , yn) be the matrices associated to the inputs and labels of the\ndataset. We compute the kernel matrix K associated to the dataset:\n\n[K]ij = k(xi, xj)\n\nwhere k(x, x\u2032) = exp(cid:18)\u2212\n\nkx \u2212 x\u2032k2\n\n2\u03c32 (cid:19) .\n\nThe kPCA components u1, . . . , un are obtained by performing an eigendecomposition of K where\neigenvectors u1, . . . , un have unit length and eigenvalues \u03bb1, . . . , \u03bbn are sorted by decreasing mag-\nnitude:\n\nK = (u1| . . . |un) \u00b7 diag(\u03bb1, . . . , \u03bbn) \u00b7 (u1| . . . |un)\u22a4\n\nLet \u02c6U = (u1| . . . |ud) and \u02c6\u039b = diag(\u03bb1, . . . , \u03bbd) be a d-dimensional approximation of the eigende-\ncomposition. We \ufb01t a linear model \u03b2\u22c6 that maps the projection on the d leading components of the\ntraining data to the log-likelihood of the classes\n\n\u03b2\u22c6 = argmin\u03b2|| exp( \u02c6U \u02c6U \u22a4\u03b2) \u2212 Y ||\n\n2\nF\n\nwhere \u03b2 is a matrix of same size as Y and where the exponential function is applied element-wise.\nThe predicted class log-probability log(\u02c6y) of a test point (x, y) is computed as\n\nlog(\u02c6y) = k(x, X) \u02c6U \u02c6\u039b\u22121 \u02c6U \u22a4\u03b2\u22c6 + C\n\nwhere k(x, X) is a matrix of size 1 \u00d7 n computing the similarities between the new point and each\ntraining point and where C is a normalization constant. The test error is de\ufb01ned as:\n\ne(d) = Pr(argmax \u02c6y 6= argmax y)\n\nThe training and test error can be used as an approximation bound for the asymptotic case n \u2192 \u221e\nwhere the data would be projected on the real eigenvectors of the input distribution. In the next\nsections, the training and test error are depicted respectively as dotted and solid lines in Figure 3 and\nas the bottom and the top of error bars in Figure 4. For each dimension, the kernel scale parameter \u03c3\nthat minimizes e(d) is retained, leading to a different kernel for each dimensionality. The rationale\nfor taking a different kernel for each model is that the optimal scale parameter typically shrinks as\nmore leading components of the input distribution are observed.\n\n3 Methodology\n\nIn order to test our two hypotheses (the progressive emergence of good representations in deep\nnetworks and the role of the structure for postponing discrimination), we consider three deep net-\nworks of interest, namely a convolutional neural network (CNN), a multilayer perceptron (MLP)\nand a variant of the multilayer perceptron pretrained in an unsupervised fashion with a deep belief\n\n3\n\n\fnetwork (PMLP). These three deep networks are chosen in order to evaluate how the two types of\nregularizers implemented respectively by the CNN and the PMLP impact on the evolution of the\nrepresentation layer after layer. We describe how they are built, how they are trained and how they\nare analyzed layer-wise with the RBF analysis described in Section 2.\n\nThe multilayer perceptron (MLP) is a deep network obtained by alternating linear transformations\nand element-wise nonlinearities. Each layer maps an input vector of size m into an output vector\nof size n and consists of (1) a linear transformation linearm\u2192n(x) = w \u00b7 x + b where w is a\nweight matrix of size n \u00d7 m learned from the data and (2) a non-linearity applied element-wise\nto the output of the linear transformation. Our implementation of the MLP maps two-dimensional\nimages of 28 \u00d7 28 pixels into a vector of size 10 (the 10 possible digits) by applying successively\nthe following functions:\n\nf1(x) = tanh(linear28\u00d728\u2192784(x))\nf2(x) = tanh(linear784\u2192784(x))\nf3(x) = tanh(linear784\u2192784(x))\nf4(x) = softmax(linear784\u219210(x))\n\nThe pretrained multilayer perceptron (Hinton et al., 2006) that we abbreviate PMLP in this paper\nis a variant of the MLP where weights are initialized with a deep belief network (DBN, Hinton\net al., 2006) using an unsupervised greedy layer-wise pretraining procedure. This particular weight\ninitialization acts as a regularizer, allowing to learn better-generalizing representation of the learning\nproblem than the simple MLP.\n\nThe convolutional neural network (CNN, LeCun et al., 1998) is a deep network obtained by al-\nternating convolution \ufb01lters y = convolvea\u00d7b\nm\u2192n(x) transforming a set of m input features maps\nj=1 wij \u22c6 xj + bi , i = 1 . . . , n} where\nthe convolution \ufb01lters wij of size a \u00d7 b are learned from data, and pooling units subsampling each\nfeature map by a factor two. Our implementation maps images of 32 \u00d7 32 pixels into a vector of\nsize 10 (the 10 possible digits) by applying successively the following functions:\n\n{x1, . . . , xm} into a set of n output features maps {yi = Pm\n\nf1(x) = tanh(pool(convolve5\u00d75\nf2(x) = tanh(pool(convolve5\u00d75\nf3(x) = tanh(linear5\u00d75\u00d736\u2192400(x))\nf4(x) = softmax(linear400\u219210(x))\n\n1\u219236(x)))\n36\u219236(x)))\n\nThe CNN is inspired by the structure of biological visual systems (Hubel and Wiesel, 1962). It\ncombines three ideas into a single architecture: (1) only local connections between neighboring\npixels are allowed, (2) the convolution operator applies the same \ufb01lter over the whole feature map\nand (3) a pooling mechanism at the top of each convolution \ufb01lter adds robustness to input distortion.\nThese mechanisms act as a regularizer on images and other types of sequential data, and learn well-\ngeneralizing models from few data points.\n\n3.1 Training the deep networks\n\nEach deep network is trained on the MNIST handwriting digit recognition dataset (LeCun et al.,\n1998). The MNIST dataset consists of predicting the digit 0 \u2013 9 from scanned handwritten digits of\n28 \u00d7 28 pixels. We partition randomly the MNIST training set in three subsets of 45000, 5000 and\n10000 samples that are respectively used for training the deep network, selecting the parameters of\nthe deep network and performing the RBF analysis.\n\nWe consider three training procedures:\n\n1. No training: the weights of the deep network are left at their initial value. If the deep\nnetwork hasn\u2019t received unsupervised pretraining, the weights are set randomly according\nto a normal distribution N (0, \u03b3\u22121) where \u03b3 denotes for a given layer the number of input\nnodes that are connected to a single output node.\n\n2. Training on an alternate task: the deep network is trained on a binary classi\ufb01cation task that\nconsists of determining whether the digit is original (positive example) or whether it has\n\n4\n\n\fbeen transformed by one of the 11 possible rotation/\ufb02ip combinations that differs from the\noriginal (negative example). This problem has therefore 540000 labeled samples (45000\npositives and 495000 negatives). The goal of training a deep network on an alternate task\nis to learn features on a problem where the number of labeled samples is abundant and then\nreuse these features to learn the target task that has typically few labels. In the alternate task\ndescribed earlier, negative examples form a cloud around the manifold of positive examples\nand learning this manifold potentially allows the deep network to learn features that can be\ntransfered to the digit recognition task.\n\n3. Training on the target task: the deep network is trained on the digit recognition task using\n\nthe 45000 labeled training samples.\n\nThese procedures are chosen in order to assess the forming of good representations in deep networks\nand to test the role of the structure of deep networks on different aspects of learning, such as the\neffectiveness of random projections, the transferability of features from one task to another and the\ngeneralization to new samples of the same distribution.\n\n3.2 Applying the RBF analysis to deep networks\n\nIn this section, we explain how the RBF analysis described in Section 2 is applied to analyze layer-\nwise the deep networks presented in Section 3.\nLet f = fL \u25e6\u00b7 \u00b7 \u00b7\u25e6f1 be the trained deep network of depth L. Let D be the analysis dataset containing\nthe 10000 samples of the MNIST dataset on which the deep network hasn\u2019t been trained. For each\nlayer, we build a new dataset D(l) corresponding to the mapping of the original dataset D to the l\n\ufb01rst layers of the deep network. Note that by de\ufb01nition, the index zero corresponds to the raw input\ndata (mapped through zero layers):\n\nD(l) = (cid:26) D\n\n{(fl \u25e6 \u00b7 \u00b7 \u00b7 \u25e6 f1(x), t) | (x, t) \u2208 D)}\n\nl = 0 ,\n1 \u2264 l \u2264 L .\n\nThen, for each dataset D(0), . . . , D(L) we perform the RBF analysis described in Section 2. We use\nn = 2500 samples for computing the eigenvectors and the remaining 7500 samples to estimate the\nprediction error of the model. This analysis yields for each dataset D(l) the error as a function of the\ndimensionality of the model e(d). A typical evolution of e(d) is depicted in Figure 1.\nThe goal of this analysis is to observe the evolution of e(d) layer after layer for the deep networks\nand training procedures presented in Section 3 and to test the two hypotheses formulated in Section 1\n(the progressive emergence of good representations in deep networks and the role of the structure\nfor postponing discrimination). The interest of using a local model to solve the learning problem\nis that the local models are blind with respect to possibly better representations that could be ob-\ntained in previous or subsequent layers. This local scoping property allows for \ufb01ne isolation of the\nrepresentations in the deep network. The need for local scoping also arises when \u201cdebugging\u201d deep\narchitectures. Sometimes, deep architectures perform reasonably well even when the \ufb01rst layers do\nsomething wrong. This analysis is therefore able to detect these \u201cbugs\u201d.\n\nThe size n of the dataset is selected so that it is large enough to approximate well the asymptotic\ncase (n \u2192 \u221e) but also be small enough so that computing the eigendecomposition of the kernel\nmatrix of size n \u00d7 n is fast. We choose a set of scale parameters for the RBF kernel corresponding\nto the 0.01, 0.05, 0.10, 0.25, 0.5, 0.75, 0.9, 0.95 and 0.99 quantiles of the distribution of distances\nbetween pairs of data points.\n\n4 Results\n\nLayer-wise evolution of the error e(d) is plotted in Figure 3 in the supervised training case. The\nlayer-wise evolution of the error when d is \ufb01xed to 16 dimensions is plotted in Figure 4. Both \ufb01gures\ncapture the simultaneous reduction of error and dimensionality performed by the deep network when\ntrained on the target task.\nIn particular, they illustrate that in the last layers, a few number of\ndimensions is suf\ufb01cient to build a good model of the target task.\n\n5\n\n\fFigure 3: Layer-wise evolution of the error e(d) when the deep network has been trained on the\ntarget task. The solid line and the dotted line represent respectively the test error and the training\nerror. As the data distribution is mapped through more and more layers, more accurate and lower-\ndimensional models of the learning problem can be obtained.\n\nFrom these results, we \ufb01rst demonstrate some properties of deep networks trained on an \u201casymp-\ntotically\u201d large number of samples. Then, we demonstrate the important role of structure in deep\nnetworks.\n\n4.1 Asymptotic properties of deep networks\n\nWhen the deep network is trained on the target task with an \u201casymptotically\u201d large number of sam-\nples (45000 samples) compared to the number of dimensions of the local model, the deep network\nbuilds representations layer after layer in which a low number of dimensions can create more accu-\nrate models of the learning problem.\n\nThis asymptotic property of deep networks should not be thought of as a statistical superiority of\ndeep networks over local models. Indeed, it is still possible that a higher-dimensional local model\napplied directly on the raw data performs as well as a local model applied at the output of the deep\nnetwork. Instead, this asymptotic property has the following consequence:\n\nDespite the internal complexity of deep networks a local interpretation of the representation is pos-\nsible at each stage of the processing. This means that deep networks do not explode the original data\ndistribution into a statistically intractable distribution before recombining everything at the output,\nbut instead, apply controlled distortions and reductions of the input space that preserve the statistical\ntractability of the data distribution at every layer.\n\n4.2 Role of the structure of deep networks\n\nWe can observe in Figure 4 (left) that even when the convolutional neural network (CNN) and the\npretrained MLP (PMLP) have not received supervised training, the \ufb01rst layers slightly improve the\nrepresentation with respect to the target task. On the other hand, the representation built by a simple\nMLP with random weights degrades layer after layer. This observation highlights the structural\nprior encoded by the CNN: by convolving the input with several random convolution \ufb01lters and\nsubsampling subsequent feature maps by a factor two, we obtain a random projection of the input\ndata that outperforms the implicit projection performed by an RBF kernel in terms of task relevance.\nThis observation closely relates to results obtained in (Ranzato et al., 2007; Jarrett et al., 2009) where\nit is observed that training the deep network while keeping random weights in the \ufb01rst layers still\nallows for good predictions by the subsequent layers. In the case of the PMLP, the successive layers\nprogressively disentangle the factors of variation (Hinton and Salakhutdinov, 2006; Bengio, 2009)\nand simplify the learning problem.\n\nWe can observe in Figure 4 (middle) that the phenomenon is even clearer when the CNN and the\nPMLP are trained on an alternate task: they are able to create generic features in the \ufb01rst layers\nthat transfer well to the target task. This observation suggests that the structure embedded in the\nCNN and the PMLP enforces a separation of concerns between the \ufb01rst layers that encode low-\nlevel features, for example, edge detectors, and the last layers that encode high-level task-speci\ufb01c\n\n6\n\n\fFigure 4: Evolution of the error e(do) as a function of the layer l when do has been \ufb01xed to 16\ndimensions. The top and the bottom of the error bars represent respectively the test error and the\ntraining error of the local model.\n\nMLP, alternate task\nMLP, target task\nPMLP, alternate task\nPMLP, target task\nCNN, alternate task\nCNN, target task\n\nFigure 5: Leading components of the weights (receptive \ufb01elds) obtained in the \ufb01rst layer of each\narchitecture. The \ufb01lters learned by the CNN and the pretrained MLP are richer than the \ufb01lters\nlearned by the MLP. The \ufb01rst component of the MLP trained on the alternate task dominates all\nother components and prevents good transfer on the target task.\n\nfeatures. On the other hand, the standard MLP trained on the alternate task leads to a degradation of\nrepresentations. This degradation is even higher than in the case of random weights, despite all the\nprior knowledge on pixel neighborhood contained implicitly in the alternate task.\n\nFigure 5 shows that the MLP builds receptive \ufb01elds that are spatially informative but dissimilar\nbetween the two tasks. The fact that receptive \ufb01elds are different for each task indicates that the\nMLP tries to discriminate already in the \ufb01rst layers. The absence of a built-in separation of concerns\nbetween low-level and high-level feature extractors seems to be a reason for the inability to learn\ntransferable features.\nIt indicates that end-to-end transfer learning on unstructured learning ma-\nchines is in general not appropriate and supports the recent success of transfer learning on restricted\nportions of the deep network (Collobert and Weston, 2008; Weston et al., 2008) or on structured\ndeep networks (Mobahi et al., 2009).\n\nWhen the deep networks are trained on the target task, the CNN and the PMLP solve the problem\ndifferently as the MLP. In Figure 4 (right), we can observe that the CNN and the PMLP tend to\npostpone the discrimination to the last layers while the MLP starts to discriminate already in the \ufb01rst\nlayers. This result suggests that again, the structure contained in the CNN and the PMLP enforces\na separation of concerns between the \ufb01rst layers encoding low-level generic features and the last\nlayers encoding high-level task-speci\ufb01c features. This separation of concerns might explain the\nbetter generalization of the CNN and PMLP observed respectively in (LeCun et al., 1998; Hinton\net al., 2006). It also rejoins the \ufb01ndings of Larochelle et al. (2009) showing that the pretraining of the\nPMLP must be unsupervised and not supervised in order to build well-generalizing representations.\n\n5 Conclusion\n\nWe present a layer-wise analysis of deep networks based on RBF kernels. This analysis estimates\nfor each layer of the deep network the number of dimensions that is necessary in order to model well\na learning problem based on the representation obtained at the output of this layer.\n\n7\n\n\fWe observe that a properly trained deep network creates representations layer after layer in which a\nmore accurate and lower-dimensional local model of the learning problem can be built.\n\nWe also observe that despite a steady improvement of representations for each architecture of interest\n(the CNN, the MLP and the pretrained MLP), they do not solve the problem in the same way: the\nCNN and the pretrained MLP seem to separate concerns by building low-level generic features in\nthe \ufb01rst layers and high-level task-speci\ufb01c features in the last layers while the MLP does not enforce\nthis separation. This observation emphasizes the limitations of black box transfer learning and, more\ngenerally, of black box training of deep architectures.\n\nReferences\nY. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. Greedy layer-wise training of deep networks.\n\nIn Advances in Neural Information Processing Systems 19, pages 153\u2013160. MIT Press, 2007.\n\nYoshua Bengio. Learning deep architectures for AI. Foundations and Trends in Machine Learning,\n\n2(1):1\u2013127, 2009.\n\nMikio L. Braun. Accurate bounds for the eigenvalues of the kernel matrix. Journal of Machine\n\nLearning Research, 7:2303\u20132328, Nov 2006.\n\nMikio L. Braun, Joachim Buhmann, and Klaus-Robert M\u00a8uller. On relevant dimensions in kernel\n\nfeature spaces. Journal of Machine Learning Research, 9:1875\u20131908, Aug 2008.\n\nR. Collobert and J. Weston. A uni\ufb01ed architecture for natural language processing: Deep neural\nIn International Conference on Machine Learning, ICML,\n\nnetworks with multitask learning.\n2008.\n\nDumitru Erhan, Yoshua Bengio, Aaron C. Courville, Pierre-Antoine Manzagol, Pascal Vincent, and\nJournal of Machine\n\nSamy Bengio. Why does unsupervised pre-training help deep learning?\nLearning Research, 11:625\u2013660, 2010.\n\nIan Goodfellow, Quoc Le, Andrew Saxe, and Andrew Y. Ng. Measuring invariances in deep net-\n\nworks. In Advances in Neural Information Processing Systems 22, pages 646\u2013654, 2009.\n\nG. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks.\n\nScience, 313(5786):504\u2013507, July 2006.\n\nGeoffrey E. Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algorithm for deep belief\n\nnets. Neural Comput., 18(7):1527\u20131554, 2006.\n\nD. H. Hubel and T. N. Wiesel. Receptive \ufb01elds, binocular interaction and functional architecture in\n\nthe cat\u2019s visual cortex. The Journal of physiology, 160:106\u2013154, January 1962.\n\nKevin Jarrett, Koray Kavukcuoglu, Marc\u2019Aurelio Ranzato, and Yann LeCun. What is the best multi-\nstage architecture for object recognition? In Proc. International Conference on Computer Vision\n(ICCV\u201909). IEEE, 2009.\n\nHugo Larochelle, Yoshua Bengio, J\u00b4er\u02c6ome Louradour, and Pascal Lamblin. Exploring strategies for\n\ntraining deep neural networks. J. Mach. Learn. Res., 10:1\u201340, 2009. ISSN 1532-4435.\n\nY. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document\n\nrecognition. Proceedings of the IEEE, 86(1):2278\u20132324, November 1998.\n\nHossein Mobahi, Ronan Collobert, and Jason Weston. Deep learning from temporal coherence\nin video. In L\u00b4eon Bottou and Michael Littman, editors, Proceedings of the 26th International\nConference on Machine Learning, pages 737\u2013744, Montreal, June 2009. Omnipress.\n\nNoboru Murata, Shuji Yoshizawa, and Shun ichi Amari. Network information criterion - determin-\nIEEE Transactions on\n\ning the number of hidden units for an arti\ufb01cial neural network model.\nNeural Networks, 5:865\u2013872, 1994.\n\nGenevieve B. Orr and Klaus-Robert M\u00a8uller, editors. Neural Networks: Tricks of the Trade, this book\nis an outgrowth of a 1996 NIPS workshop, volume 1524 of Lecture Notes in Computer Science,\n1998. Springer.\n\nM. A. Ranzato, Fu J. Huang, Y. L. Boureau, and Y. LeCun. Unsupervised learning of invariant fea-\nture hierarchies with applications to object recognition. In Computer Vision and Pattern Recog-\nnition, 2007. CVPR \u201907. IEEE Conference on, pages 1\u20138, 2007.\n\n8\n\n\fD. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-propagating\n\nerrors. Nature, 323(6088):533\u2013536, 1986.\n\nBernhard Sch\u00a8olkopf, Alexander Smola, and Klaus-Robert M\u00a8uller. Nonlinear component analysis as\n\na kernel eigenvalue problem. Neural Comput., 10(5):1299\u20131319, 1998.\n\nJason Weston, Fr\u00b4ed\u00b4eric Ratle, and Ronan Collobert. Deep learning via semi-supervised embedding.\nIn ICML \u201908: Proceedings of the 25th international conference on Machine learning, pages 1168\u2013\n1175, 2008.\n\nAlexander Zien, Gunnar R\u00a8atsch, Sebastian Mika, Bernhard Sch\u00a8olkopf, Thomas Lengauer, and\nKlaus-Robert M\u00a8uller. Engineering support vector machine kernels that recognize translation ini-\ntiation sites. Bioinformatics, 16(9):799\u2013807, 2000.\n\n9\n\n\f", "award": [], "sourceid": 206, "authors": [{"given_name": "Gr\u00e9goire", "family_name": "Montavon", "institution": null}, {"given_name": "Klaus-Robert", "family_name": "M\u00fcller", "institution": null}, {"given_name": "Mikio", "family_name": "Braun", "institution": null}]}