{"title": "Supervised Dictionary Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1033, "page_last": 1040, "abstract": "It is now well established that sparse signal models are well suited to restoration tasks and can effectively be learned from audio, image, and video data. Recent research has been aimed at learning discriminative sparse models instead of purely reconstructive ones. This paper proposes a new step in that direction with a novel sparse representation for signals belonging to different classes in terms of a shared dictionary and multiple decision functions. It is shown that the linear variant of the model admits a simple probabilistic interpretation, and that its most general variant also admits a simple interpretation in terms of kernels. An optimization framework for learning all the components of the proposed model is presented, along with experiments on standard handwritten digit and texture classification tasks.", "full_text": "Supervised Dictionary Learning\n\nJulien Mairal\n\nINRIA-Willow project\n\nFrancis Bach\n\nINRIA-Willow project\n\njulien.mairal@inria.fr\n\nfrancis.bach@inria.fr\n\nJean Ponce\n\nEcole Normale Sup\u00b4erieure\njean.ponce@ens.fr\n\nGuillermo Sapiro\n\nUniversity of Minnesota\n\nAndrew Zisserman\nUniversity of Oxford\n\nguille@ece.umn.edu\n\naz@robots.ox.ac.uk\n\nAbstract\n\nIt is now well established that sparse signal models are well suited for restora-\ntion tasks and can be effectively learned from audio, image, and video data. Re-\ncent research has been aimed at learning discriminative sparse models instead of\npurely reconstructive ones. This paper proposes a new step in that direction, with\na novel sparse representation for signals belonging to different classes in terms of\na shared dictionary and discriminative class models. The linear version of the pro-\nposed model admits a simple probabilistic interpretation, while its most general\nvariant admits an interpretation in terms of kernels. An optimization framework\nfor learning all the components of the proposed model is presented, along with\nexperimental results on standard handwritten digit and texture classi\ufb01cation tasks.\n\n1 Introduction\n\nSparse and overcomplete image models were \ufb01rst introduced in [1] for modeling the spatial recep-\ntive \ufb01elds of simple cells in the human visual system. The linear decomposition of a signal using a\nfew atoms of a learned dictionary, instead of prede\ufb01ned ones\u2013such as wavelets\u2013has recently led to\nstate-of-the-art results for numerous low-level image processing tasks such as denoising [2], show-\ning that sparse models are well adapted to natural images. Unlike principal component analysis\ndecompositions, these models are in general overcomplete, with a number of basis elements greater\nthan the dimension of the data. Recent research has shown that sparsity helps to capture higher-order\ncorrelation in data. In [3, 4], sparse decompositions are used with prede\ufb01ned dictionaries for face\nand signal recognition. In [5], dictionaries are learned for a reconstruction task, and the correspond-\ning sparse models are used as features in an SVM. In [6], a discriminative method is introduced\nfor various classi\ufb01cation tasks, learning one dictionary per class; the classi\ufb01cation process itself is\nbased on the corresponding reconstruction error, and does not exploit the actual decomposition co-\nef\ufb01cients. In [7], a generative model for documents is learned at the same time as the parameters of\na deep network structure. In [8], multi-task learning is performed by learning features and tasks are\nselected using a sparsity criterion. The framework we present in this paper extends these approaches\nby learning simultaneously a single shared dictionary as well as models for different signal classes\nin a mixed generative and discriminative formulation (see also [9], where a different discriminative\nterm is added to the classical reconstructive one). Similar joint generative/discriminative frame-\nworks have started to appear in probabilistic approaches to learning, e.g., [10, 11, 12, 13, 14], and\nin neural networks [15], but not, to the best of our knowledge, in the sparse dictionary learning\nframework. Section 2 presents a formulation for learning a dictionary tuned for a classi\ufb01cation task,\nwhich we call supervised dictionary learning, and Section 3 its interpretation in term of probabil-\nity and kernel frameworks. The optimization procedure is detailed in Section 4, and experimental\nresults are presented in Section 5.\n\n2 Supervised dictionary learning\n\nWe present in this section the core of the proposed model. In classical sparse coding tasks, one con-\nsiders a signal x in Rn and a \ufb01xed dictionary D = [d1, . . . , dk] in Rn\u00d7k (allowing k > n, making\n\n\fthe dictionary overcomplete). In this setting, sparse coding with an \u21131 regularization1 amounts to\ncomputing\n\nR\u22c6(x, D) = min\n\u03b1\u2208Rk\n\n||x \u2212 D\u03b1||2\n\n2 + \u03bb1||\u03b1||1.\n\n(1)\n\nIt is well known in the statistics, optimization, and compressed sensing communities that the \u21131\npenalty yields a sparse solution, very few non-zero coef\ufb01cients in \u03b1, although there is no explicit\nanalytic link between the value of \u03bb1 and the effective sparsity that this model yields. Other sparsity\npenalties using the \u21130 regularization2 can be used as well. Since it uses a proper norm, the \u21131\nformulation of sparse coding is a convex problem, which makes the optimization tractable with\nalgorithms such as those introduced in [16, 17], and has proven in practice to be more stable than its\n\u21130 counterpart, in the sense that the resulting decompositions are less sensitive to small perturbations\nof the input signal x. Note that sparse coding with an \u21130 penalty is an NP-hard problem and is often\napproximated using greedy algorithms.\n\nIn this paper, we consider a setting, where the signal may belong to any of p different classes. We\n\ufb01rst consider the case of p = 2 classes and later discuss the multiclass extension. We consider a\ntraining set of m labeled signals (xi)m\ni=1.\nOur goal is to learn jointly a single dictionary D adapted to the classi\ufb01cation task and a function\nf which should be positive for any signal in class +1 and negative otherwise. We consider in this\npaper two different models to use the sparse code \u03b1 for the classi\ufb01cation task:\n(i) linear in \u03b1: f (x, \u03b1, \u03b8) = wT \u03b1 + b, where \u03b8 = {w \u2208 Rk, b \u2208 R} parametrizes the model.\n(ii) bilinear in x and \u03b1: f (x, \u03b1, \u03b8) = xT W\u03b1 + b, where \u03b8 = {W \u2208 Rn\u00d7k, b \u2208 R}. In this case,\nthe model is bilinear and f acts on both x and its sparse code \u03b1.\n\ni=1 in Rn, associated with binary labels (yi \u2208 {\u22121, +1})m\n\nThe number of parameters in (ii) is greater than in (i), which allows for richer models. Note that\none can interpret W as a linear \ufb01lter encoding the input signal x into a model for the coef\ufb01cients \u03b1,\nwhich has a role similar to the encoder in [18] but for a discriminative task.\n\nA classical approach to obtain \u03b1 for (i) or (ii) is to \ufb01rst adapt D to the data, solving\n\nm\n\n||xi \u2212 D\u03b1i||2\n\n2 + \u03bb1||\u03b1i||1,\n\n(2)\n\nNote also that since the reconstruction errors ||xi \u2212 D\u03b1i||2\n2 are invariant to scaling simultaneously\nD by a scalar and \u03b1i by its inverse, we need to constrain the \u21132 norm of the columns of D. Such a\nconstraint is classical in sparse coding [2]. This reconstructive approach (dubbed REC in this paper)\nprovides sparse codes \u03b1i for each signal xi, which can be used a posteriori in a regular classi\ufb01er\nsuch as logistic regression, which would require to solve\n\nmin\nD,\u03b1\n\nXi=1\n\nmin\n\n\u03b8\n\nm\n\nXi=1\n\nC(cid:0)yif (xi, \u03b1i, \u03b8)(cid:1) + \u03bb2||\u03b8||2\n\n2,\n\nwhere C is the logistic loss function (C(x) = log(1 + e\u2212x)), which enjoys properties similar to\nthat of the hinge loss from the SVM literature, while being differentiable, and \u03bb2 is a regularization\nparameter, which prevents over\ufb01tting. This is the approach chosen in [5] (with SVMs). However,\nour goal is to learn jointly D and the model parameters \u03b8. To that effect, we propose the formulation\n\nwhere \u03bb0 controls the importance of the reconstruction term, and the loss for a pair (xi, yi) is\n\nC(cid:0)yif (xi, \u03b1i, \u03b8)(cid:1) + \u03bb0||xi \u2212 D\u03b1i||2\n\n2 + \u03bb1||\u03b1i||1(cid:17) + \u03bb2||\u03b8||2\n\n2,\n\nmin\n\nD,\u03b8,\u03b1(cid:16)\n\nm\n\nXi=1\n\nS \u22c6(xi, D, \u03b8, yi) = min\n\n\u03b1\n\nS(\u03b1, xi, D, \u03b8, yi),\n\nwhere S(\u03b1, xi, D, \u03b8, yi) = C(cid:0)yif (xi, \u03b1i, \u03b8)(cid:1) + \u03bb0||xi \u2212 D\u03b1i||2\n\nIn this setting, the classi\ufb01cation procedure of a new signal x with an unknown label y, given a\nlearned dictionary D and parameters \u03b8, involves supervised sparse coding:\n\n2 + \u03bb1||\u03b1i||1.\n\nmin\n\ny\u2208{\u22121;+1}\n\nS \u22c6(x, D, \u03b8, y),\n\n(6)\n\nThe learning procedure of Eq. (4) minimizes the sum of the costs for the pairs (xi, yi)m\ni=1 and cor-\nresponds to a generative model. We will refer later to this model as SDL-G (supervised dictionary\n\n1The \u21131 norm of a vector x of size n is de\ufb01ned as ||x||1 = Pn\n2The \u21130 pseudo-norm of a vector x is the number of nonzeros coef\ufb01cients of x. Note that it is not a norm.\n\ni=1 |x[i]|.\n\n(3)\n\n(4)\n\n(5)\n\n\fD\n\ni = 1, . . . , m\n\n\u03b1i\n\nw\n\nxi\n\nyi\n\nFigure 1: Graphical model for the proposed generative/discriminative learning framework.\n\nlearning, generative). Note the explicit incorporation of the reconstructive and discriminative com-\nponent into sparse coding, in addition to the classical reconstructive term (see [9] for a different\nclassi\ufb01cation component).\n\nHowever, since the classi\ufb01cation procedure from Eq. (6) compares the different costs S \u22c6(x, D, \u03b8, y)\nof a given signal for each class y = \u22121, +1, a more discriminative approach is to not only make\nthe costs S \u22c6(xi, D, \u03b8, yi) small, as in (4), but also make the value of S \u22c6(xi, D, \u03b8, \u2212yi) greater than\nS \u22c6(xi, D, \u03b8, yi), which is the purpose of the logistic loss function C. This leads to:\n\nmin\n\nD,\u03b8 (cid:16)\n\nm\n\nXi=1\n\nC(S \u22c6(xi, D, \u03b8, \u2212yi) \u2212 S \u22c6(xi, D, \u03b8, yi))(cid:17) + \u03bb2||\u03b8||2\n\n2.\n\nAs detailed below, this problem is more dif\ufb01cult to solve than (4), and therefore we adopt instead a\nmixed formulation between the minimization of the generative Eq. (4) and its discriminative version\n(7), (see also [13])\u2014that is,\n\n(7)\n\n(8)\n\nm\n\nXi=1\n\n(cid:16)\n\n\u00b5C(S \u22c6(xi, D, \u03b8, \u2212yi) \u2212 S \u22c6(xi, D, \u03b8, yi)) + (1 \u2212 \u00b5)S \u22c6(xi, D, \u03b8, yi)(cid:17) + \u03bb2||\u03b8||2\n\n2,\n\nwhere \u00b5 controls the trade-off between the reconstruction from Eq. (4) and the discrimination from\nEq. (7). This is the proposed generative/discriminative model for sparse signal representation and\nclassi\ufb01cation from learned dictionary D and model \u03b8. We will refer to this mixed model as SDL-D,\n(supervised dictionary learning, discriminative). Note also that, again, we constrain the norm of the\ncolumns of D to be less than or equal to one.\n\ncost functions Ci(x1, ..., xp) = log(Pp\n\nAll of these formulations admit a straightforward multiclass extension, using softmax discriminative\nj=1 exj \u2212xi), which are multiclass versions of the logistic\nfunction, and learning one model \u03b8i per class. Other possible approaches such as one-vs-all or\none-vs-one are of course possible, and the question of choosing the best approach among these\npossibilities is still open. Compared with earlier work using one dictionary per class [6], our model\nhas the advantage of letting multiple classes share some features, and uses the coef\ufb01cients \u03b1 of\nthe sparse representations as part of the classi\ufb01cation procedure, thereby following the works from\n[3, 4, 5], but with learned representations optimized for the classi\ufb01cation task similar to [9, 10].\n\nBefore presenting the optimization procedure, we provide below two interpretations of the linear\nand bilinear versions of our formulation in terms of a probabilistic graphical model and a kernel.\n\n3 Interpreting the model\n\n3.1 A probabilistic interpretation of the linear model\n\nLet us \ufb01rst construct a graphical model which gives a probabilistic interpretation to the training and\nclassi\ufb01cation criteria given above when using a linear model with zero bias (no constant term) on\nthe coef\ufb01cients\u2014that is, f (x, \u03b1, \u03b8) = wT \u03b1. It consists of the following components (Figure 1):\n\n2 , and a constraint on the columns of D\u2013that is, ||dj||2\n\n\u2022 The matrices D and the vector w are parameters of the problem, with a Gaussian prior on w,\np(w) \u221d e\u2212\u03bb2||w||2\n2 = 1 for all j. All the dj \u2019s\nare considered independent of each other.\n\u2022 The coef\ufb01cients \u03b1i are latent variables with a Laplace prior, p(\u03b1i) \u221d e\u2212\u03bb1||\u03b1i||1 .\n\u2022 The signals xi are generated according to a Gaussian probability distribution conditioned on D\nand \u03b1i, p(xi|\u03b1i, D) \u221d e\u2212\u03bb0||xi\u2212D\u03b1i||2\n2 . All the xi\u2019s are considered independent from each other.\n\n\f\u2022 The labels yi are generated according to a probability distribution conditioned on w and \u03b1i, and\n\ngiven by p(yi = \u01eb|\u03b1i, W) = e\u2212\u01ebwT \u03b1i/(cid:0)e\u2212WT \u03b1i + eWT \u03b1i(cid:1). Given D and w, all the triplets\n\n(\u03b1i, xi, yi) are independent.\n\nWhat is commonly called \u201cgenerative training\u201d in the literature (e.g., [12, 13]), amounts to\n\ufb01nding the maximum likelihood estimates for D and w according to the joint distribution\np({xi, yi}m\ni=1, D, W), where the xi\u2019s and the yi\u2019s are the training signals and their labels respec-\ntively.\nIt can easily be shown (details omitted due to space limitations) that there is an equiva-\nlence between this generative training and our formulation in Eq. (4) under MAP approximations.3\nAlthough joint generative modeling of x and y through a shared representation has shown great\npromise [10], we show in this paper that a more discriminative approach is desirable. \u201cDiscrim-\ninative training\u201d is slightly different and amounts to maximizing p({yi}m\ni=1) with\nrespect to D and w: Given some input data, one \ufb01nds the best parameters that will predict the labels\nof the data. The same kind of MAP approximation relates this discriminative training formulation\nto the discriminative model of Eq. (7) (again, details omitted due to space limitations). The mixed\napproach from Eq. (8) is a classical trade-off between generative and discriminative (e.g., [12, 13]),\nwhere generative components are often added to discriminative frameworks to add robustness, e.g.,\nto noise and occlusions (see examples of this for the model in [9]).\n\ni=1, D, w|{xi}m\n\n3.2 A kernel interpretation of the bilinear model\nOur bilinear model with f (x, \u03b1, \u03b8) = xT W\u03b1 + b does not admit a straightforward probabilistic\ninterpretation. On the other hand, it can easily be interpreted in terms of kernels: Given two signals\nx1 and x2, with coef\ufb01cients \u03b11 and \u03b12, using the kernel K(x1, x2) = \u03b1T\nx2 in a logistic\n1\nregression classi\ufb01er amounts to \ufb01nding a decision function of the same form as f . It is a product\nof two linear kernels, one on the \u03b1\u2019s and one on the input signals x. Interestingly, Raina et al. [5]\nlearn a dictionary adapted to reconstruction on a training set, then train an SVM a posteriori on\nthe decomposition coef\ufb01cients \u03b1. They derive and use a Fisher kernel, which can be written as\nK \u2032(x1, x2) = \u03b1T\nr2 in this setting, where the r\u2019s are the residuals of the decompositions. In\n1\nsimple experiments, which are not reported in this paper, we have observed that the kernel K, where\nthe signals x replace the residuals r, generally yields a level of performance similar to K \u2032 and often\nactually does better when the number of training samples is small or the data are noisy.\n\n\u03b12xT\n1\n\n\u03b12rT\n1\n\n4 Optimization procedure\n\nClassical dictionary learning techniques (e.g., [1, 5, 19]), address the problem of learning a recon-\nstructive dictionary D in Rn\u00d7k well adapted to a training set, which is presented in Eq. (3). It can\nbe seen as an optimization problem with respect to the dictionary D and the coef\ufb01cients \u03b1. Altough\nnot jointly convex in (D, \u03b1), it is convex with respect to each unknown when the other one is \ufb01xed.\nThis is why block coordinate descent on D and \u03b1 performs reasonably well [1, 5, 19], although not\nnecessarily providing the global optimum. Training when \u00b5 = 0 (generative case), i.e., from Eq.\n(4), enjoys similar properties and can be addressed with the same optimization procedure. Equation\n(4) can be rewritten as:\n\nmin\n\nD,\u03b8,\u03b1(cid:16)\n\nm\n\nXi=1\n\nS(xj, \u03b1j, D, \u03b8, yi)(cid:17) + \u03bb2||\u03b8||2\n\n2, s.t. \u2200 j = 1, . . . , k,\n\n||dj||2 \u2264 1.\n\n(9)\n\nBlock coordinate descent consists therefore of iterating between supervised sparse coding, where\nD and \u03b8 are \ufb01xed and one optimizes with respect to the \u03b1\u2019s and supervised dictionary update,\nwhere the coef\ufb01cients \u03b1i\u2019s are \ufb01xed, but D and \u03b8 are updated. Details on how to solve these two\nproblems are given in sections 4.1 and 4.2. The discriminative version SDL-D from Eq. (7) is more\nproblematic. To reach a local minimum for this dif\ufb01cult non-convex optimization problem, we have\nchosen a continuation method, starting from the generative case and ending with the discriminative\none as in [6]. The algorithm is presented in Figure 2, and details on the hyperparameters\u2019 settings\nare given in Section 5.\n\n4.1 Supervised sparse coding\n\nThe supervised sparse coding problem from Eq. (6) (D and \u03b8 are \ufb01xed in this step) amounts to\nminimizing a convex function under an \u21131 penalty. The \ufb01xed-point continuation method (FPC) from\n\n3We are also investigating how to properly estimate D by marginalizing over \u03b1 instead of maximizing with\n\nrespect to \u03b1.\n\n\fInput: n (signal dimensions); (xi, yi)m\n(parameters); 0 \u2264 \u00b51 \u2264 \u00b52 \u2264 . . . \u2264 \u00b5m \u2264 1 (increasing sequence).\nOutput: D \u2208 Rn\u00d7k (dictionary); \u03b8 (parameters).\nInitialization: Set D to a random Gaussian matrix with normalized columns. Set \u03b8 to zero.\nLoop: For \u00b5 = \u00b51, . . . , \u00b5m,\n\ni=1 (training signals); k (size of the dictionary); \u03bb0, \u03bb1, \u03bb2\n\nLoop: Repeat until convergence (or a \ufb01xed number of iterations),\n\n\u2022 Supervised sparse coding: Solve, for all i = 1, . . . , m,\n\n(cid:26) \u03b1\u22c6\n\n\u03b1\u22c6\n\ni,\u2212 = arg min\u03b1 S(\u03b1, xi, D, \u03b8, \u22121)\ni,+ = arg min\u03b1 S(\u03b1, xi, D, \u03b8, +1)\n\n.\n\n(10)\n\n\u2022 Dictionary and parameters update: Solve\n\nmin\n\nD,\u03b8 (cid:16)\n\nm\n\nXi=1\n\n\u00b5C(cid:0)(S(\u03b1\u22c6\n\ni,\u2212, xi, D, \u03b8, \u2212yi) \u2212 S(\u03b1\u22c6\n\ni,+, xj, D, \u03b8, yi))(cid:1)+\n\n(1 \u2212 \u00b5)S(\u03b1\u22c6\n\ni,yi, xi, D, \u03b8, yi) + \u03bb2||\u03b8||2\n\n2(cid:17) s.t. \u2200j, ||dj||2 \u2264 1.\n\n(11)\n\nFigure 2: SDL: Supervised dictionary learning algorithm.\n\n[17] achieves good results in terms of convergence speed for this class of problems. For our speci\ufb01c\nproblem, denoting by g the convex function to minimize, this method only requires \u2207g and a bound\non the spectral norm of its Hessian Hg. Since the we have chosen models g which are both linear in\n\u03b1, there exists, for each supervised sparse coding problem, a vector a in Rk and a scalar c in R such\nthat\n\n( g(\u03b1) =\n\nC(aT \u03b1 + c) + \u03bb0||x \u2212 D\u03b1||2\n2,\n\u2207g(\u03b1) = \u2207C(aT \u03b1 + c)a \u2212 2\u03bb0DT (x \u2212 D\u03b1),\n\nand it can be shown that, if ||U||2 denotes the spectral norm of a matrix U (which is the magnitude of\nits largest eigenvalue), then we can obtain the following bound, ||Hg(\u03b1)||2 \u2264 |HC(aT \u03b1+c)|||a||2\n2 +\n2\u03bb0||DT D||2.\n\n4.2 Dictionary update\n\nThe problem of updating D and \u03b8 in Eq. (11) is not convex in general (except when \u00b5 is close to 0),\nbut a local minimum can be obtained using projected gradient descent (as in the general literature on\ndictionary learning, this local minimum has experimentally been found to be good enough in terms\nof classi\ufb01cation performance). ). Denoting E(D, \u03b8) the function we want to minimize in Eq. (11),\nwe just need the partial derivatives of E with respect to D and the parameters \u03b8. When considering\nthe linear model for the \u03b1\u2019s, f (x, \u03b1, \u03b8) = wT \u03b1 + b, and \u03b8 = {w \u2208 Rk, b \u2208 R}, we obtain\n\n\u03c9i,z(xi \u2212 D\u03b1\u22c6\n\ni,z)\u03b1\u22c6T\n\ni,z(cid:1),\n\n\u03c9i,zz\u2207C(wT \u03b1\u22c6\n\ni,z + b)\u03b1\u22c6\n\ni,z,\n\n(12)\n\n= \u22122\u03bb0(cid:0)\n\nm\n\nm\n\nXi=1 Xz={\u22121,+1}\nXi=1 Xz={\u22121,+1}\n\nm\n\n\u2202E\n\u2202D\n\n\u2202E\n\u2202w\n\n\u2202E\n\u2202b\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f2\n\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f3\n\n=\n\n=\n\nXi=1 Xz={\u22121,+1}\n\n\u03c9i,zz\u2207C(wT \u03b1\u22c6\n\ni,z + b),\n\nwhere \u03c9i,z = \u2212\u00b5z\u2207C(cid:0)S(\u03b1\u22c6\n\nPartial derivatives when using our model with multiple classes or with the bilinear models\nf (x, \u03b1, \u03b8) = xT W\u03b1 + b are not presented in this paper due to space limitations.\n\ni,+, xi, D, \u03b8, yi)(cid:1) + (1 \u2212 \u00b5)1z=yi.\n\ni,\u2212, xi, D, \u03b8, \u2212yi) \u2212 S(\u03b1\u22c6\n\n5 Experimental validation\n\nWe compare in this section the reconstructive approach, dubbed REC, which consists of learning\na reconstructive dictionary D as in [5] and then learning the parameters \u03b8 a posteriori; SDL with\ngenerative training (dubbed SDL-G); and SDL with discriminative learning (dubbed SDL-D). We\nalso compare the performance of the linear (L) and bilinear (BL) models.\n\n\fREC L\n\nSDL-G L\n\nSDL-D L\n\nREC BL\n\nk-NN, \u21132\n\nSVM-Gauss\n\nMNIST\nUSPS\n\n4.33\n6.83\n\n3.56\n6.67\n\n1.05\n3.54\n\n3.41\n4.38\n\n5.0\n5.2\n\n1.4\n4.2\n\nTable 1: Error rates on the MNIST and USPS datasets in percents for the REC, SDL-G L and\nSDL-D L approaches, compared with k-nearest neighbor and SVM with a Gaussian kernel [20].\n\nBefore presenting experimental results, let us brie\ufb02y discuss the choice of the \ufb01ve model parameters\n\u03bb0, \u03bb1, \u03bb2, \u00b5 and k (size of the dictionary). Tuning all of them using cross-validation is cumbersome\nand unnecessary since some simple choices can be made, some of which can be made sequentially.\nWe de\ufb01ne \ufb01rst the sparsity parameter \u03ba = \u03bb1\n, which dictates how sparse the decompositions are.\n\u03bb0\nWhen the input data points have unit \u21132 norm, choosing \u03ba = 0.15 was empirically found to be a good\nchoice. For reconstructive tasks, a typical value often used in the literature (e.g., [19]) is k = 256 for\nm = 100 000 signals. Nevertheless, for discriminative tasks, increasing the number of parameters is\nlikely to lead to over\ufb01tting, and smaller values like k = 64 or k = 32 are preferred. The scalar \u03bb2 is\na regularization parameter for preventing the model to over\ufb01t the input data. As in logistic regression\nor support vector machines, this parameter is crucial when the number of training samples is small.\nPerforming cross validation with the fast method REC quickly provides a reasonable value for this\nparameter, which can be used afterward for SDL-G or SDL-D.\n\nS \u22c6, one can compute a scale factor \u03b3\u22c6 such that \u03b3\u22c6 = arg min\u03b3Pm\n\nOnce \u03ba, k and \u03bb2 are chosen, let us see how to \ufb01nd \u03bb0, which plays the important role of controlling\nthe trade-off between reconstruction and discrimination. First, we perform cross-validation for a few\niterations with \u00b5 = 0 to \ufb01nd a good value for SDL-G. Then, a scale factor making the costs S \u22c6 dis-\ncriminative for \u00b5 > 0 can be chosen during the optimization process: Given a set of computed costs\ni=1 C({\u03b3(S \u22c6(xi, D, \u03b8, \u2212yi) \u2212\nS \u22c6(xi, D, \u03b8, yi)). We therefore propose the following strategy, which has proven to be effective in\nour experiments: Starting from small values for \u03bb0 and a \ufb01xed \u03ba, we apply the algorithm in Figure\n2, and after a supervised sparse coding step, we compute the best scale factor \u03b3\u22c6, and replace \u03bb0\nand \u03bb1 by \u03b3\u22c6\u03bb0 and \u03b3\u03bb1. Typically, applying this procedure during the \ufb01rst 10 iterations has proven\nto lead to reasonable values for these parameters. Since we are following a continuation path from\n\u00b5 = 0 to \u00b5 = 1, the optimal value of \u00b5 is found along the path by measuring the classi\ufb01cation\nperformance of the model on a validation set during the optimization.\n\n5.1 Digits recognition\n\nIn this section, we present experiments on the popular MNIST [20] and USPS handwritten digit\ndatasets. MNIST is composed of 70 000 28 \u00d7 28 images, 60 000 for training, 10 000 for testing, each\nof them containing one handwritten digit. USPS is composed of 7291 training images and 2007 test\nimages of size 16 \u00d7 16. As is often done in classi\ufb01cation, we have chosen to learn pairwise binary\nclassi\ufb01ers, one for each pair of digits. Although our framework extends to a multiclass formulation,\npairwise binary classi\ufb01ers have resulted in slightly better performance in practice. Five-fold cross\nvalidation is performed to \ufb01nd the best pair (k, \u03ba). The tested values for k are {24, 32, 48, 64, 96},\nand for \u03ba, {0.13, 0.14, 0.15, 0.16, 0.17}. We keep the three best pairs of parameters and use them to\ntrain three sets of pairwise classi\ufb01ers. For a given image x, the test procedure consists of selecting\nthe class which receives the most votes from the pairwise classi\ufb01ers. All the other parameters are\nobtained using the procedure explained above. Classi\ufb01cation results are presented on Table 1 using\nthe linear model. We see that for the linear model L, SDL-D L performs the best. REC BL offers\na larger feature space and performs better than REC L, but we have observed no gain by using\nSDL-G BL or SDL-D BL instead of REC BL (this results are not reported in this table). Since the\nlinear model is already performing very well, one side effect of using BL instead of L is to increase\nthe number of free parameters and thus to cause over\ufb01tting. Note that our method is competitive\nsince the best error rates published on these datasets (without any modi\ufb01cation of the training set)\nare 0.60% [18] for MNIST and 2.4% [21] for USPS, using methods tailored to these tasks, whereas\nours is generic and has not been tuned for the handwritten digit classi\ufb01cation domain.\n\nThe purpose of our second experiment is not to measure the raw performance of our algorithm, but\nto answer the question \u201care the obtained dictionaries D discriminative per se?\u201d. To do so, we have\ntrained on the USPS dataset 10 binary classi\ufb01ers, one per digit in a one vs all fashion on the training\nset. For a given value of \u00b5, we obtain 10 dictionaries D and 10 sets of parameters \u03b8, learned by the\nSDL-D L model.\n\nTo evaluate the discriminative power of the dictionaries D, we discard the learned parameters \u03b8 and\nuse the dictionaries as if they had been learned in a reconstructive REC model: For each dictionary,\n\n\f2.5\n\n2.0\n\n1.5\n\n1.0\n\n0.5\n\n0\n\n0\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n1.0\n\n(a) REC, MNIST\n\n(b) SDL-D, MNIST\n\nFigure 3: On the left, a reconstructive and a discriminative dictionary. On the right, average error\nrate in percents obtained by our dictionaries learned in a discriminative framework (SDL-D L) for\nvarious values of \u00b5, when used at test time in a reconstructive framework (REC-L).\n\nm\n300\n1 500\n3 000\n6 000\n15 000\n30 000\n\nREC L\n48.84\n46.8\n45.17\n45.71\n47.54\n47.28\n\nSDL-G L\n\nSDL-D L\n\nREC BL\n\nSDL-G BL\n\n47.34\n46.3\n45.1\n43.68\n46.15\n45.1\n\n44.84\n\n42\n40.6\n39.77\n38.99\n38.3\n\n26.34\n22.7\n21.99\n19.77\n18.2\n18.99\n\n26.34\n22.3\n21.22\n18.75\n17.26\n16.84\n\nSDL-D BL Gain\n0%\n2%\n4%\n6%\n15%\n25%\n\n26.34\n22.3\n21.22\n18.61\n15.48\n14.26\n\nTable 2: Error rates for the texture classi\ufb01cation task using various methods and sizes m of the\ntraining set. The last column indicates the gain between the error rate of REC BL and SDL-D BL.\n\nwe decompose each image from the training set by solving the simple sparse reconstruction problem\nfrom Eq. (1) instead of using supervised sparse coding. This provides us with some coef\ufb01cients \u03b1,\nwhich we use as features in a linear SVM. Repeating the sparse decomposition procedure on the\ntest set permits us to evaluate the performance of these learned linear SVMs. We plot the average\nerror rate of these classi\ufb01ers on Figure 3 for each value of \u00b5. We see that using the dictionaries\nobtained with discrimative learning (\u00b5 > 0, SDL-D L) dramatically improves the performance of\nthe basic linear classi\ufb01er learned a posteriori on the \u03b1\u2019s, showing that our learned dictionaries are\ndiscriminative per se. Figure 3 also shows a dictionary adapted to the reconstruction of the MNIST\ndataset and a discriminative one, adapted to \u201c9 vs all\u201d.\n\n5.2 Texture classi\ufb01cation\n\nIn the digit recognition task, our bilinear framework did not perform better than the linear one L. We\nbelieve that one of the main reasons is due to the simplicity of the task, where a linear model is rich\nenough. The purpose of our next experiment is to answer the question \u201cWhen is BL worth using?\u201d.\nWe have chosen to consider two texture images from the Brodatz dataset, presented in Figure 4, and\nto build two classes, composed of 12 \u00d7 12 patches taken from these two textures. We have compared\nthe classi\ufb01cation performance of all our methods, including BL, for a dictionary of size k = 64 and\n\u03ba = 0.15. The training set was composed of patches from the left half of each texture and the test\nsets of patches from the right half, so that there is no overlap between them in the training and test\nset. Error rates are reported in Table 2 for varying sizes of the training set. This experiment shows\nthat in some cases, the linear model performs very poorly where BL does better. Discrimination\nhelps especially when the size of the training set is large. Note that we did not perform any cross-\nvalidation to optimize the parameters k and \u03ba for this experiment. Dictionaries obtained with REC\nand SDL-D BL are presented in Figure 4. Note that though they are visually quite similar, they lead\nto very different performances.\n\n6 Conclusion\n\nwe have introduced in this paper a discriminative approach to supervised dictionary learning that\neffectively exploits the corresponding sparse signal decompositions in image classi\ufb01cation tasks, and\nhave proposed an effective method for learning a shared dictionary and multiple (linear or bilinear)\nmodels. Future work will be devoted to adapting the proposed framework to shift-invariant models\nthat are standard in image processing tasks, but not readily generalized to the sparse dictionary\nlearning setting. We are also investigating extensions to unsupervised and semi-supervised learning\nand applications to natural image classi\ufb01cation.\n\n\f(a) Texture 1\n\n(b) Texture 2\n\n(c) REC\n\n(d) SDL-D BL\n\nFigure 4: Left: test textures. Right: reconstructive and discriminative dictionaries\n\nAcknowledgments\n\nThis paper was supported in part by ANR under grant MGA. Guillermo Sapiro would like to thank\nFernando Rodriguez for insights into the learning of discriminatory sparsity patterns. His work is\npartially supported by NSF, NGA, ONR, ARO, and DARPA.\n\nReferences\n\n[1] B. A. Olshausen and D. J. Field. Sparse coding with an overcomplete basis set: A strategy employed by\n\nv1? Vision Research, 37, 1997.\n\n[2] M. Elad and M. Aharon. Image denoising via sparse and redundant representations over learned dictio-\n\nnaries. IEEE Trans. IP, 54(12), 2006.\n\n[3] K. Huang and S. Aviyente. Sparse representation for signal classi\ufb01cation. In NIPS, 2006.\n\n[4] J. Wright, A. Y. Yang, A. Ganesh, S. Sastry, and Y. Ma. Robust face recognition via sparse representation.\n\nIn PAMI, 2008. to appear.\n\n[5] R. Raina, A. Battle, H. Lee, B. Packer, and A. Y. Ng. Self-taught learning: transfer learning from unla-\n\nbeled data. In ICML, 2007.\n\n[6] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman. Learning discriminative dictionaries for local\n\nimage analysis. In CVPR, 2008.\n\n[7] M. Ranzato and M. Szummer. Semi-supervised learning of compact document representations with deep\n\nnetworks. In ICML, 2008.\n\n[8] A. Argyriou and T. Evgeniou and M. Pontil Multi-Task Feature Learning. In NIPS, 2006.\n\n[9] F. Rodriguez and G. Sapiro. Sparse representations for image classi\ufb01cation: Learning discriminative and\n\nreconstructive non-parametric dictionaries. IMA Preprint 2213, 2007.\n\n[10] D. Blei and J. McAuliffe. Supervised topic models. In NIPS, 2007.\n\n[11] A. Holub and P. Perona. A discriminative framework for modeling object classes. In CVPR, 2005.\n\n[12] J.A. Lasserre, C.M. Bishop, and T.P. Minka. Principled hybrids of generative and discriminative models.\n\nIn CVPR, 2006.\n\n[13] R. Raina, Y. Shen, A. Y. Ng, and A. McCallum. Classi\ufb01cation with hybrid generative/discriminative\n\nmodels. In NIPS, 2004.\n\n[14] R. R. Salakhutdinov and G. E. Hinton. Learning a non-linear embedding by preserving class neighbour-\n\nhood structure. In AI and Statistics, 2007.\n\n[15] H. Larochelle, and Y. Bengio. Classi\ufb01cation using discriminative restricted boltzmann machines.\n\nin\n\nICML, 2008.\n\n[16] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. Ann. Stat., 32(2), 2004.\n\n[17] E. T. Hale, W. Yin, and Y. Zhang. A \ufb01xed-point continuation method for l1-regularized minimization with\n\napplications to compressed sensing. CAAM Tech Report TR07-07, 2007.\n\n[18] M. Ranzato, C. Poultney, S. Chopra, and Y. LeCun. Ef\ufb01cient learning of sparse representations with an\n\nenergy-based model. In NIPS, 2006.\n\n[19] M. Aharon, M. Elad, and A. M. Bruckstein. The K-SVD: An algorithm for designing of overcomplete\n\ndictionaries for sparse representations. IEEE Trans. SP, 54(11), 2006.\n\n[20] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.\n\nProc. of the IEEE, 86(11), 1998.\n\n[21] B. Haasdonk and D. Keysers. Tangent distant kernels for support vector machines. In ICPR, 2002.\n\n\f", "award": [], "sourceid": 775, "authors": [{"given_name": "Julien", "family_name": "Mairal", "institution": null}, {"given_name": "Jean", "family_name": "Ponce", "institution": null}, {"given_name": "Guillermo", "family_name": "Sapiro", "institution": null}, {"given_name": "Andrew", "family_name": "Zisserman", "institution": null}, {"given_name": "Francis", "family_name": "Bach", "institution": null}]}