{"title": "Differentiable Sparse Coding", "book": "Advances in Neural Information Processing Systems", "page_first": 113, "page_last": 120, "abstract": null, "full_text": "Differentiable Sparse Coding\n\nDavid M. Bradley\nRobotics Institute\n\nCarnegie Mellon University\n\nPittsburgh, PA 15213\n\ndbradley@cs.cmu.edu\n\nJ. Andrew Bagnell\nRobotics Institute\n\nCarnegie Mellon University\n\nPittsburgh, PA 15213\n\ndbagnell@ri.cmu.edu\n\nAbstract\n\nPrior work has shown that features which appear to be biologically plausible as\nwell as empirically useful can be found by sparse coding with a prior such as\na laplacian (L1) that promotes sparsity. We show how smoother priors can pre-\nserve the bene\ufb01ts of these sparse priors while adding stability to the Maximum\nA-Posteriori (MAP) estimate that makes it more useful for prediction problems.\nAdditionally, we show how to calculate the derivative of the MAP estimate ef\ufb01-\nciently with implicit differentiation. One prior that can be differentiated this way\nis KL-regularization. We demonstrate its effectiveness on a wide variety of appli-\ncations, and \ufb01nd that online optimization of the parameters of the KL-regularized\nmodel can signi\ufb01cantly improve prediction performance.\n\n1 Introduction\n\nSparse approximation is a key technique developed in engineering and the sciences which approxi-\nmates an input signal, X, in terms of a \u201csparse\u201d combination of \ufb01xed bases B. Sparse approximation\nrelies on an optimization algorithm to infer the Maximum A-Posteriori (MAP) weights \u02c6W that best\nreconstruct the signal, given the model X \u2248 f(BW ). In this notation, each input signal forms\na column of an input matrix X, and is generated by multiplying a set of basis vectors B, and a\ncolumn from a coef\ufb01cient matrix W , while f(z) is an optional transfer function. This relationship\nis only approximate, as the input data is assumed to be corrupted by random noise. Priors which\nproduce sparse solutions for W , especially L1 regularization, have gained attention because of their\nusefulness in ill-posed engineering problems [1], their ability to elucidate certain neuro-biological\nphenomena, [2, 3], and their ability to identify useful features for classi\ufb01cation from related unla-\nbeled data [4].\nSparse coding [2] is closely connected to Independent Component Analysis as well as to certain\napproaches to matrix factorization. It extends sparse approximation by learning a basis matrix B\nwhich represents well a collection of related input signals\u2013the input matrix X\u2013in addition to per-\nforming optimization to compute the best set of weights \u02c6W . Unfortunately, existing sparse coding\nalgorithms that leverage an ef\ufb01cient, convex sparse approximation step to perform inference on the\nlatent weight vector [4] are dif\ufb01cult to integrate into a larger learning architecture.\nIt has been\nconvincingly demonstrated that back-propagation is a crucial tool for tuning an existing generative\nmodel\u2019s output in order to improve supervised performance on a discriminative task. For example,\ngreedy layer-wise strategies for building deep generative models rely upon a back-propagation step\nto achieve excellent model performance [5]. Unfortunately, existing sparse coding architectures pro-\nduce a latent representation \u02c6W that is an unstable, discontinuous function of the inputs and bases;\nan arbitrarily small change in input can lead to the selection of a completely different set of latent\nweights.\nWe present an advantageous new approach to coding that uses smoother priors which preserve the\nsparsity bene\ufb01ts of L1-regularization while allowing ef\ufb01cient convex inference and producing stable\nlatent representations \u02c6W . In particular we examine a prior based on minimizing KL-divergence to\n\n1\n\n\f(cid:90)\n\n(cid:90)\n\n(cid:90)\n\n(cid:90)\n\n(cid:89)\n\nthe uniform distribution which has long been used for approximation problems [6, 7]. We show this\nincreased stability leads to better semi-supervised classi\ufb01cation performance across a wide variety\nof applications for classi\ufb01ers using the latent representation \u02c6W as input. Additionally, because of\nthe smoothness of the KL-divergence prior, B can be optimized discriminatively for a particular\napplication by gradient descent, leading to outstanding empirical performance.\n\n2 Notation\n\nUppercase letters, X, denote matrices and lowercase letters, x, denote vectors. For matrices, super-\nscripts and subscripts denote rows and columns respectively. Xj is the jth column of X, X i is the\nj is the element in the ith row and jth column. Elements of vectors are indicated\nith row of X, and X i\nby subscripts, xj, and superscripts on vectors are used for time indexing xt. X T is the transpose of\nmatrix X.\n\n3 Generative Model\n\nSparse coding \ufb01ts a generative model (1) to unlabeled data, and the MAP estimates of the latent\nvariables of this model have been shown to be useful as input for prediction problems [4].\n(1)\ndivides the latent variables into two independent groups, the coef\ufb01cients W and the basis B, which\ncombine to form the matrix of input examples X. Different examples (columns of X) are assumed\nto be independent of each other. The Maximum A Posteriori (MAP) approximation replaces the\nintegration over W and B in (1) with the maximum value of P (X|W, B)P (W )P (B), and the\nvalues of the latent variables at the maximum, \u02c6W and \u02c6B, are the MAP estimates.\nFinding \u02c6W given B is an approximation problem, solving for \u02c6W and \u02c6B simultaneously over a set\nof independent examples is a coding problem.\n\nP (X) =\n\nP (X|W, B)P (W )P (B)dW dB =\n\nP (B)\n\nP (Xi|Wi, B)P (Wi)dW dB (1)\n\nB\n\nW\n\nB\n\nW\n\ni\n\nGiven B, the negative log of the generative model can be optimized independently for each example,\nand it is denoted for a generic example x by L in (2). L decomposes into the sum of two terms, a loss\nfunction DL (x(cid:107)f(Bw)) between an input example and the reconstruction produced by the transfer\nfunction f, and a regularization function DP (w(cid:107)p) that measures a distance between the coef\ufb01cients\nfor the example w and a parameter vector p. A regularization constant \u03bb controls the relative weight\nof these two terms. For \ufb01xed B, minimizing (2) with respect to w separately for each example is\nequivalent to maximizing (1).\n\nL = DL (x(cid:107)f(Bw)) + \u03bbDP (w(cid:107)p)\n\u02c6w = arg min\n\nL\n\nw\n\n(2)\n(3)\n\nlinear transfer function, DL (x(cid:107)f(Bw)) =(cid:80)\n\nIn many applications, the anticipated distribution of x after being corrupted by noise can be modeled\nby an exponential family distribution. Every exponential family distribution de\ufb01nes a Bregman di-\nvergence which serves as a matching loss function for estimating the parameters of the distribution1.\nOne common choice for the loss/transfer functions is the squared loss function with its matching\ni(xi \u2212 Biw)2, which is the matching Bregman Diver-\ngence for x drawn from a multidimensional gaussian distribution.\nThe regularization function DP (w(cid:107)p) is also often a Bregman divergence, but may be chosen for\nother features such as the sparsity of the resulting MAP estimate \u02c6w. A vector is commonly called\np-norm2, p \u2264 1 regularization\nsparse if many elements are exactly zero. The entropy [9, 10], and Lp\nfunctions [2, 3, 4] promote this form of sparsity, and all of them have shown the ability to learn bases\n\n1The maximum likelihood parameter estimate for any regular exponential family distribution can be found\nby minimizing the corresponding Bregman divergence for that family, and every Bregman divergence has a\nmatching transfer function which leads to a convex minimization problem [8]. That matching transfer function\nis the gradient \u2207\u03c6 of the function \u03c6 which is associated with the Bregman divergence D\u03c6(x(cid:107)y) = \u03c6(x) \u2212\n\u03c6(y) \u2212 (cid:104)x \u2212 y,\u2207\u03c6(y)(cid:105).\n\ni |xi|p corresponds to the negative log of a generalized gaussian prior.\n\np(x) =P\n\n2Lp\n\n2\n\n\fcontaining interesting structure from unlabeled data. However, of these only L1 leads to an ef\ufb01cient,\nconvex procedure for inference, and even this prior does not produce differentiable MAP estimates.\nWe argue that if the latent weight vector \u02c6w is to be used as input to a classi\ufb01er, a better de\ufb01nition of\n\u201csparsity\u201d is that most elements in \u02c6w can be replaced by elements in a constant vector p without sig-\nni\ufb01cantly increasing the loss. One regularization function that produces this form of pseudo-sparsity\nis the KL-divergence KL(w(cid:107)p). This regularization function has long been used for approximation\nproblems in Geophysics, Crystallography, Astronomy, and Physics, where it is commonly referred to\nas Maximum Entropy on the Mean (MEM) [7], and has been shown in the online setting to compete\nwith low L1-norm solutions in terms of regret [11, 12].\nL1 regularization provides sparse solutions because its Fenchel dual [13] is the max function, mean-\ning only the most useful basis vectors participate in the reconstruction. A differentiable approxima-\ni , whose dual is the KL-divergence (4). Regularization\nwith KL has proven useful in online learning, where it is the implicit prior of the exponentiated gra-\ndient descent (EGD) algorithm. EGD has been shown to be \u201csparse\u201d in the sense that it can select a\nfew relevant features to use for a prediction task from many irrelevant ones.\nThe form of KL we use (4) is the full Bregman divergence of the negative entropy function3. Often\nKL is used to compute distances between probability distributions, and for this case the KL we\nuse reduces to the standard form. For sparse coding however, it is inconvenient to assume that\n(cid:107) \u02c6w(cid:107)1 = (cid:107)p(cid:107)1 = 1, so we use the full unnormalized KL instead.\n\ntion to maxi xi is a sum of exponentials,(cid:80)\n\ni ex\n\nDP (w(cid:107)p) =(cid:88)\n\n(cid:20)\n\n(cid:21)\n\nwi log wi\npi\n\n\u2212 wi + pi\n\ni\n\n(4)\n\nFor the prior vector p we use a uniform vector whose L1 magnitude equals the expected L1 mag-\nnitude of w. p has an analogous effect to the q parameter in Lq-norm regularization. p \u2192 0\napproximates L1 and p \u2192 \u221e approximates L2. Changing p affects the magnitude of the KL term,\nso \u03bb in (2) must be adjusted to balance the loss term in the sparse coding objective function (small\nvalues of p require small values of \u03bb).\nBelow we provide a) an ef\ufb01cient procedure for inferring \u02c6w in this model; b) an algorithm for itera-\ntively updating the bases B, and c) show that this model leads to differentiable estimates of \u02c6w. We\nalso provide the general form of the derivative for arbitrary Bregman losses.\n\n4 Implementation\nTo compute \u02c6w with KL-regularization, we minimize (3) using exponentiated gradient descent (EGD)\nwith backtracking until convergence (5). EGD automatically enforces positivity constraints on the\ncoef\ufb01cient vector w, and is particularly ef\ufb01cient for optimization because it is the natural mirror\ndescent rule for KL-regularization [12]. The gradient of the objective function (2) with respect to\nthe coef\ufb01cient for the jth basis vector wj is given in (6) for matching loss/transfer function pairs.\n\nj = wt\nwt+1\nje\n\n\u2212\u03b1 \u2202L\n\u2202wj\n\n\u2202L\n\u2202wj\n\n= (f(Bw) \u2212 x)T Bj + \u03bb log wj\npj\n\n(5)\n\n(6)\n\nThis iterative update is run until the maximum gradient element is less than a threshold, which\nis estimated by periodically running a random set of examples to the limits of machine precision,\nand selecting the largest gradient threshold that produces \u02c6w within \u0001 of the exact solution. The\n\u03b1 parameter is continuously updated to balance the number of sucessful steps and the number of\nbacktracking steps4. Because L1-regularization produces both positive and negative weights, to\ncompare L1 and KL regularization on the same basis we expand the basis used for KL by adding the\nnegation of each basis vector, which is equivalent to allowing negative weights (see Appendix B).\nDuring sparse coding the basis matrix B is updated by Stochastic Gradient Descent (SGD), giving\nthe update rule Bt+1 = Bt \u2212 \u03b7 \u2202L\n. This update equation does not depend on the prior chosen\n\n\u2202Bi\nj\n\n3\u2212H(x) = x log(x)\n4In our experiments, if the ratio of backtracking steps to total steps was more than 0.6, \u03b1 was decreased by\n\n10%. Similarly \u03b1 was increased by 10% if the ratio fell below 0.3.\n\n3\n\n\ffor w and is given in (7) for matching loss/transfer function pairs. SGD implements an implicit\nL2 regularizer and is suitable for online learning, however because the magnitude of w is explicitly\npenalized, the columns of B were constrained to have unit L2 norm to prevent the trivial solution of\n\u221a\nin\ufb01nitely large B and in\ufb01nitely small w. The step size was adjusted for the magnitude of \u02c6w in each\napplication, and was then decayed over time as \u03b7 \u221d 1/\nt. The same SGD procedure was also used\nto optimize B through backpropagation, as explained in the next section.\n\n= wj(f(Biw) \u2212 xi)\n\n\u2202L\n\u2202Bi\nj\n\n(7)\n\n5 Modifying a Generative Model For A Discriminative Task\n\nSparse Coding builds a generative model from unlabeled data that captures structure in that data by\nlearning a basis B. Our hope is that the MAP estimate of basis coef\ufb01cients \u02c6w produced for each\ninput vector x will be useful for predicting a response y associated with x. However, the sparse\ncoding objective function only cares about reconstructing the input well, and does not attempt to\nmake \u02c6w useful as input for any particular task. Fortunately, since priors such as KL-divergence\nregularization produce solutions that are smooth with respect to small changes in B and x, B can be\nmodi\ufb01ed through back-propagation to make \u02c6w more useful for prediction.\nThe key to computing the derivatives required for backpropagation is noting that the gradient with\nrespect to w of the optimization (3) at its minimum \u02c6w can be written as a set of \ufb01xed point equations\nwhere the gradients of the loss term equal the gradient of the regularization:\n\n\u2207DP ( \u02c6w(cid:107)p) = \u2212 1\n\u03bb\n\n\u2207DL (x(cid:107)f(B \u02c6w)) .\n\n(8)\n\nThen if the regularization function is twice differentiable with respect to w, we can use implicit dif-\nferentiation on (8) to compute the gradient of \u02c6w with respect to B, and x [14]. For KL-regularization\nand the simple case of a linear transfer function with squared loss, \u2202 \u02c6w\n\u2202B is given in (9), where (cid:126)ei is a\nunit vector whose ith element is 1. A general derivation for matched loss/transfer function pairs as\n\u2202x means that multiple\nde\ufb01ned before is provided in appendix C. Note that the ability to compute \u2202 \u02c6w\nlayers of sparse coding could be used.\n\n(cid:19)\u22121(cid:0)(Bk \u02c6wi)T + (cid:126)ei(f(Bk \u02c6w) \u2212 xk)(cid:1)\n\n(9)\n\n(cid:18)\n\nBT B + diag( \u03bb\n\u02c6w\n\n)\n\n= \u2212\n\n\u2202 \u02c6w\n\u2202Bk\ni\n\n6 Experiments\n\nWe verify the performance of KL-sparse coding on several benchmark tasks including the MNIST\nhandwritten digit recognition data-set, handwritten lowercase English characters classi\ufb01cation,\nmovie review sentiment regression, and music genre classi\ufb01cation (Appendix E). In each applica-\ntion, the \u02c6w produced using KL-regularization were more useful for prediction than those produced\nwith L1 regularization due to the stability and differentiability provided by KL.\n\n6.1 Sparsity\nKL-regularization retained the desirable pseudo-sparsity characteristics of L1, namely that each\nexample, x, produces only a few large elements in \u02c6w. Figure 1 compares the mean sorted and\nnormalized coef\ufb01cient distribution over the 10,000 digit MNIST test set for KL-divergence and\nseveral Lp\np regularization functions, and shows that although the KL regularization function is not\nsparse in the traditional sense of setting many elements of \u02c6w to zero, it is sparse in the sense that \u02c6w\ncontains only a few large elements in each example, lending support to the idea that this sense of\nsparsity is more important for classi\ufb01cation.\n\n6.2 Stability\nBecause the gradient of the KL-divergence regularization function goes to \u221e with increasing w, it\nproduces MAP estimates \u02c6w that change smoothly with x and B (see Appendix A for more details).\n\n4\n\n\fFigure 1: Left: Mean coef\ufb01cient distribution over the 10,000 digit MNIST test set for various regularization\nfunctions. Each example \u02c6w was sorted by magnitude and normalized by (cid:107) \u02c6w(cid:107)\u221e before computing the mean\nover all examples. Right: test set classi\ufb01cation performance. Regularization functions that produced few large\nvalues in each examples (such as KL and L1) performed the best. Forcing small coef\ufb01cients to be exactly 0\nwas not necessary for good performance. Note the log scale on the horizontal axis.\n\nRegularization\nL1\nKL\n\n0.01\n\nGaussian Noise (Standard Deviation) Random Translations (pixels)\n1.211\u00b10.213\n0.0283\u00b10.0069\n0.0172\u00b10.0016\n0.671\u00b10.080\n\n0.285\u00b10.056\n0.164\u00b10.015\n\n0.1\n\n0.1\n\n0.138\u00b10.026\n0.070\u00b10.011\n\n1\n\nTable 1: The 10,000 images of handwritten digits in the MNIST test set were used to show the stability\nbene\ufb01ts of KL-regularization. Distance (in L1) between the representation for x, \u02c6w, and the representation\nafter adding noise, divided by (cid:107) \u02c6w(cid:107)1. KL-regularization provides representations that are signi\ufb01cantly more\nstable with respect to both uncorrelated additive Gaussian noise (Left), and correlated noise from translating\nthe digit image in a random direction (Right).\n\nTable 1 quanti\ufb01es how KL regularization signi\ufb01cantly reduces the effect on \u02c6w of adding noise to the\ninput x.\nThis stability improves the usefulness of \u02c6w for prediction. Figure 2 shows the most-discriminative\n2-D subspace (as calculated by Multiple Discriminant Analysis [15]) for the input space, the L1 and\nKL coef\ufb01cient space, and the KL coef\ufb01cient space after it has been specialized by back-propagation.\nThe L1 coef\ufb01cients tame the disorder of the input space so that clusters for each class are apparent,\nalthough noisy and overlapping. The switch to KL regularization makes these clusters more distinct,\nand applying back-propagation further separates the clusters.\n\nFigure 2: Shown is the distribution of the eight most confusable digit classes in the input space and in the\ncoef\ufb01cient spaces produced by sparse approximation. Multiple Discriminant Analysis was used to compute the\nmost discriminative 2-D projection of each space. The PCA-whitened input space (left) contains a lot of overlap\nbetween the classes. L1 regularization (center) discovers structure in the unlabeled data, but still produces more\noverlap between classes than KL sparse approximation (right) does with the same basis trained with L1 sparse\ncoding. Figure best seen in color.\n\nImproved Prediction Performance\n\n6.3\nOn all applications, the stability provided by KL-regularization improved performance over L1, and\nback-propagation further improved performance when the training set had residual error after an\noutput classi\ufb01er was trained.\n\n5\n\n\f6.3.1 Handwritten Digit Classi\ufb01cation\nWe tested our algorithm on the benchmark MNIST handwritten digits dataset [16]. 10,000 of the\n60,000 training examples were reserved for validation, and classi\ufb01cation performance was evaluated\non the separate 10,000 example test set. Each example was \ufb01rst reduced to 180D from 768D by\nPCA, and then sparse coding was performed using a linear transfer function and squared loss5. The\nvalidation set was used to pick the regularization constant, \u03bb, and the prior mean for KL, p.\nMaxent classi\ufb01ers6 [17] were then learned on randomly sampled subsets of the training set of vari-\nous sizes. Switching from L1-regularized to KL-regularized sparse approximation improved perfor-\nmance in all cases (Table 2). When trained on all 50,000 training examples, the test set classi\ufb01cation\nerror of KL coef\ufb01cients, 2.21%, was 37% lower than the 3.53% error rate obtained on the L1-\nregularized coef\ufb01cients. As shown in Table 3, this increase in performance was consistent across\na diverse set of classi\ufb01cation algorithms. After running back-propagation with the KL-prior, the\ntest set error was reduced to 1.30%, which improves on the best results reported7 for other shallow-\narchitecture permutation-invariant classi\ufb01ers operating on the same data set without prior knowledge\nabout the problem8, (see Table 4).\n\nTraining Set Size\nL1 (Test Set)\nKL (Test set)\nKL After Backprop (Test Set)\nImprovement from Backprop\nKL (Training Set)\n\n2000\n\n10000\n\n20000\n\n1000\n50000\n7.72% 6.63% 4.74% 4.16% 3.53%\n5.87% 5.06% 3.00% 2.51% 2.21%\n5.66\n4.46% 2.31% 1.78% 1.30%\n3.6% 11.9% 23.0% 29.1% 43.0%\n0.00% 0.05% 1.01% 1.50% 1.65%\n\nTable 2: The ability to optimize the generative model with back-propagation leads to signi\ufb01cant performance\nincreases when the training set is not separable by the model learned on the unlabeled data. Shown is the\nmisclassi\ufb01cation rate on the MNIST digit classi\ufb01cation task. Larger training sets with higher residual error\nbene\ufb01t more from back-propagation.\n\nClassi\ufb01er\nMaxent\n2-layer NN\nSVM (Linear)\nSVM (RBF)\n\nL1\n\nKL\n\nPCA\n7.49% 3.53% 2.21%\n2.23% 2.13% 1.40%\n5.55% 3.95% 2.16%\n1.54% 1.94% 1.28%\n\nKL+backprop\n\n1.30%\n1.36%\n1.34%\n1.31%\n\nTable 3: The stability afforded by the KL-prior improves the performance of all classi\ufb01er types over the\nL1 prior. In addition back-propagation allows linear classi\ufb01ers to do as well as more complicated non-linear\nclassi\ufb01ers.\n\nAlgorithm\nTest Set Error\n\nL1\n\nKL\n\n3.53% 2.21%\n\nKL+backprop\n\n1.30%\n\nSVM 2-layer NN [18]\n1.4%\n\n1.6%\n\n3-layer NN\n\n1.53%\n\nTable 4: Test set error of various classi\ufb01ers on the MNIST handwritten digits database.\n\n6.3.2 Transfer to Handwritten Character Classi\ufb01cation\nIn [4], a basis learned by L1-regularized sparse coding on handwritten digits was shown to improve\nclassi\ufb01cation performance when used for the related problem of handwritten character recognition\n\n5This methodology was chosen to match [4]\n6Also known as multi-class logistic regression\n7An extensive comparison of classi\ufb01cation algorithms for this dataset can be found on the MNIST website,\n\nhttp://yann.lecun.com/exdb/mnist/\n\n8Better results have been reported when more prior knowledge about the digit recognition problem is pro-\nvided to the classi\ufb01er, either through specialized preprocessing or by giving the classi\ufb01er a model of how digits\nare likely to be distorted by expanding the data set with random af\ufb01ne and elastic distortions of the training\nexamples or training with vicinal risk minimization. Convolutional Neural Networks produce the best results\non this problem, but they are not invariant to permutations in the input since they contain a strong prior about\nhow pixels are connected.\n\n6\n\n\fwith small training data sets (< 5000 examples). The handwritten English characters dataset9 they\nused consists of 16x8 pixel images of lowercase letters. In keeping with their work, we padded\nand scaled the images to match the 28x28 pixel size of the MNIST data, projected onto the same\nPCA basis that was used for the MNIST digits, and learned a basis from the MNIST digits by\nL1-regularized sparse coding. This basis was then used for sparse approximation of the English\ncharacters, along with a linear transfer function and squared loss.\nIn this application as well, Table 5 shows that simply switching to a KL prior from L1 for sparse\napproximation signi\ufb01cantly improves the performance of a maxent classi\ufb01er. Furthermore, the KL\nprior allows online improvement of the sparse coding basis as more labeled data for the character-\nrecognition task becomes available. This improvement increases with the size of the training set, as\nmore information becomes available about the target character recognition task.\n\nTraining Set Size Raw PCA\n100\n46.9\n61.2\n500\n66.7\n1000\n76.0\n5000\n20000\n79.7\n\n44.3\n60.4\n66.3\n75.1\n79.3\n\nL1\n44.0\n63.7\n69.5\n78.9\n83.3\n\nKL KL+backprop\n49.4\n69.2\n75.0\n82.5\n86.0\n\n50.7\n69.9\n76.4\n84.2\n89.1\n\nTable 5: Classi\ufb01cation Accuracy on 26-way English Character classi\ufb01cation task.\n\n6.3.3 Comparison to sLDA: Movie Review Sentiment Regression\nKL-regularized sparse coding bears some similarities to the supervised LDA (sLDA) model intro-\nduced in [19], and we provide results for the movie review sentiment classi\ufb01cation task [20] used\nin that work. To match [19] we use vectors of normalized counts for the 5000 words with the high-\nest tf-idf score among the 5006 movie reviews in the data set, use 5-fold cross validation, compute\npredictions with linear regression on \u02c6w, and report our performance in terms of predictive R2 (the\nfraction of variability in the out-of-fold response values which is captured by the out-of-fold predic-\n\ntions \u02c6y: pR2 := 1\u2212((cid:80)(y\u2212 \u02c6y)2)/((cid:80)(y\u2212 \u00afy)2)). Since the input is a probability distribution, we use\n\na normalized exponential transfer function, f(B, w) = eBw\n, to compute the reconstruction of the\n(cid:107)eBw(cid:107)1\ninput. For sparse coding we use KL-divergence for both the loss and the regularization functions,\nas minimizing the KL-divergence between the empirical probability distribution of the document\ngiven by each input vector x and f(B, w) is equivalent to maximizing the \u201cconstrained Poisson\ndistribution\u201d used to model documents in [21] (details given in appendix D). Table 6 shows that the\nsparse coding generative model we use is competitive with and perhaps slightly better than LDA.\nAfter back-propagation, its performance is superior to the supervised version of LDA, sLDA10.\n\npredictive R2\n\n0.263\n0.264\n0.281\n0.457\n0.500\n0.507\n0.534\n\nAlgorithm\nLDA [19]\n\n64D unsupervised KL sparse coding\n256D unsupervised KL sparse coding\n\nL1-regularized regression [19]\n\nsLDA [19]\n\nL2-regularized regression\n\n256D KL-regularized coding with backprop\n\nTable 6: Movie review sentiment prediction task. KL-regularized sparse coding compares favorably with LDA\nand sLDA.\n\n7 Conclusion\nThis paper demonstrates on a diverse set of applications the advantages of using a differentiable,\nsmooth prior for sparse coding. In particular, a KL-divergence regularization function has signi\ufb01cant\n\n9Available at http://ai.stanford.edu/\u02dcbtaskar/ocr/\n10Given that the word counts used as input are very sparse to begin with, classi\ufb01ers whose regret bounds de-\npend on the L2 norm of the gradient of the input (such as L2-regularized least squares) do quite well, achieving\na predictive R2 value on this application of 0.507.\n\n7\n\n\fadvantages over other sparse priors such as L1 because it retains the important aspects of sparsity,\nwhile adding stability and differentiability to the MAP estimate \u02c6w. Differentiability in particular\nis shown to lead to state-of-the-art performance by allowing the generative model learned from\nunlabeled data by sparse-coding to be adapted to a supervised loss function.\n\nAcknowledgments\nDavid M. Bradley is supported by an NDSEG fellowship provided by the Army Research Of\ufb01ce.\nThe authors would also like to thank David Blei, Rajat Raina, and Honglak Lee for their help.\n\nReferences\n[1] J. A. Tropp, \u201cAlgorithms for simultaneous sparse approximation: part ii: Convex relaxation,\u201d Signal\n\nProcess., vol. 86, no. 3, pp. 589\u2013602, 2006.\n\n[2] B. Olshausen and D. Field, \u201cSparse coding with an overcomplete basis set: A strategy employed by v1?\u201d\n\nVision Research, 1997.\n\n[3] Y. Karklin and M. S. Lewicki, \u201cA hierarchical bayesian model for learning non-linear statistical regulari-\n\nties in non-stationary natural signals,\u201d Neural Computation, vol. 17, no. 2, pp. 397\u2013423, 2005.\n\n[4] R. Raina, A. Battle, H. Lee, B. Packer, and A. Y. Ng, \u201cSelf-taught learning: Transfer learning from\nunlabeled data,\u201d in ICML \u201907: Proceedings of the 24th international conference on Machine learning,\n2007.\n\n[5] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, \u201cGreedy layer-wise training of deep networks,\u201d\nin Advances in Neural Information Processing Systems 19, B. Sch\u00a8olkopf, J. Platt, and T. Hoffman, Eds.\nCambridge, MA: MIT Press, 2007, pp. 153\u2013160.\n\n[6] E. Rietsch, \u201cThe maximum entropy approach to inverse problems,\u201d Journal of Geophysics, vol. 42, pp.\n\n489\u2013506, 1977.\n\n[7] G. Besnerais, J. Bercher, and G. Demoment, \u201cA new look at entropy for solving linear inverse problems,\u201d\n\nIEEE Trans. on Information Theory, vol. 45, no. 5, pp. 1565\u20131578, July 1999.\n\n[8] A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh, \u201cClustering with bregman divergences,\u201d Journal of\n\nMachine Learning Research, vol. 6, pp. 1705\u20131749, 2005.\n\n[9] M. Brand, \u201cPattern discovery via entropy minimization,\u201d in AISTATS 99, 1999.\n[10] M. Shashanka, B. Raj, and P. Smaragdis, \u201cSparse overcomplete latent variable decomposition of counts\n\ndata,\u201d in NIPS, 2007.\n\n[11] J. Kivinen and M. Warmuth, \u201cExponentiated gradient versus gradient descent for linear predictors,\u201d In-\n\nformation and Computation, pp. 1\u201363, 1997.\n\n[12] N. Cesa-Bianchi and G. Lugosi, Prediction, Learning, and Games. Cambridge University Press, 2006.\n[13] R. Rifkin and R. Lippert, \u201cValue regularization and fenchel duality,\u201d The Journal of Machine Learning\n\nResearch, vol. 8, pp. 441\u2013479, 2007.\n\n[14] D. Widder, Advanced Calculus, 2nd ed. Dover Publications, 1989.\n[15] R. Duda, P. Hart, and D. Stork, Pattern classi\ufb01cation. Wiley New York, 2001.\n[16] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, \u201cGradient-based learning applied to document recogni-\n\ntion,\u201d Proceedings of the IEEE, vol. 86, no. 11, pp. 2278\u20132324, 1998.\n\n[17] K. Nigam, J. Lafferty, and A. McCallum, \u201cUsing maximum entropy for text classi\ufb01cation,\u201d 1999.\n\n[Online]. Available: citeseer.ist.psu.edu/article/nigam99using.html\n\n[18] P. Y. Simard, D. Steinkraus, and J. C. Platt, \u201cBest practices for convolutional neural networks applied\nto visual document analysis,\u201d in ICDAR \u201903: Proceedings of the Seventh International Conference on\nDocument Analysis and Recognition. Washington, DC, USA: IEEE Computer Society, 2003, p. 958.\n\n[19] D. M. Blei and J. D. McAuliffe, \u201cSupervised topic models,\u201d in NIPS 19, 2007.\n[20] B. Pang and L. Lee, \u201cSeeing stars: Exploiting class relationships for sentiment categorization with respect\n\nto rating scales,\u201d in Proceedings of the ACL, 2005, pp. 115\u2013124.\n\n[21] R. Salakhutdinov and G. Hinton, \u201cSemantic hashing,\u201d in SIGIR workshop on Information Retrieval and\n\napplications of Graphical Models, 2007.\n\n8\n\n\f", "award": [], "sourceid": 3538, "authors": [{"given_name": "J.", "family_name": "Bagnell", "institution": null}, {"given_name": "David", "family_name": "Bradley", "institution": null}]}