{"title": "Dimensionality Reduction Using the Sparse Linear Model", "book": "Advances in Neural Information Processing Systems", "page_first": 271, "page_last": 279, "abstract": "We propose an approach for linear unsupervised dimensionality reduction, based on the sparse linear model that has been used to probabilistically interpret sparse coding. We formulate an optimization problem for learning a linear projection from the original signal domain to a lower-dimensional one in a way that approximately preserves, in expectation, pairwise inner products in the sparse domain. We derive solutions to the problem, present nonlinear extensions, and discuss relations to compressed sensing. Our experiments using facial images, texture patches, and images of object categories suggest that the approach can improve our ability to recover meaningful structure in many classes of signals.", "full_text": "Dimensionality Reduction\n\nUsing the Sparse Linear Model\n\nIoannis Gkioulekas\n\nHarvard SEAS\n\nCambridge, MA 02138\n\nTodd Zickler\nHarvard SEAS\n\nCambridge, MA 02138\n\nigkiou@seas.harvard.edu\n\nzickler@seas.harvard.edu\n\nAbstract\n\nWe propose an approach for linear unsupervised dimensionality reduction, based\non the sparse linear model that has been used to probabilistically interpret sparse\ncoding. We formulate an optimization problem for learning a linear projection\nfrom the original signal domain to a lower-dimensional one in a way that approxi-\nmately preserves, in expectation, pairwise inner products in the sparse domain. We\nderive solutions to the problem, present nonlinear extensions, and discuss relations\nto compressed sensing. Our experiments using facial images, texture patches, and\nimages of object categories suggest that the approach can improve our ability to\nrecover meaningful structure in many classes of signals.\n\nIntroduction\n\n1\nDimensionality reduction methods are important for data analysis and processing, with their use\nmotivated mainly from two considerations: (1) the impracticality of working with high-dimensional\nspaces along with the deterioration of performance due to the curse of dimensionality; and (2) the\nrealization that many classes of signals reside on manifolds of much lower dimension than that\nof their ambient space. Linear methods in particular are a useful sub-class, for both the reasons\nmentioned above, and their potential utility in resource-constrained applications like low-power\nsensing [1, 2]. Principal component analysis (PCA) [3], locality preserving projections (LPP) [4],\nand neighborhood preserving embedding (NPE) [5] are some common approaches. They seek to\nreveal underlying structure using the global geometry, local distances, and local linear structure,\nrespectively, of the signals in their original domain; and have been extended in many ways [6\u20138].1\nOn the other hand, it is commonly observed that geometric relations between signals in their origi-\nnal domain are only weakly linked to useful underlying structure. To deal with this, various feature\ntransforms have been proposed to map signals to different (typically higher-dimensional) domains,\nwith the hope that geometric relations in these alternative domains will reveal additional structure,\nfor example by distinguishing image variations due to changes in pose, illumination, object class,\nand so on. These ideas have been incorporated into methods for dimensionality reduction by \ufb01rst\nmapping the input signals to an alternative (higher-dimensional) domain and then performing di-\nmensionality reduction there, for example by treating signals as tensors instead of vectors [9, 10] or\nusing kernels [11]. In the latter case, however, it can be dif\ufb01cult to design a kernel that is bene\ufb01cial\nfor a particular signal class, and ad hoc selections are not always appropriate.\n\nIn this paper, we also address dimensionality reduction through an intermediate higher-dimensional\nspace: we consider the case in which input signals are samples from an underlying dictionary model.\nThis generative model naturally suggests using the hidden covariate vectors as intermediate features,\nand learning a linear projection (of the original domain) to approximately preserve the Euclidean ge-\nometry of these vectors. Throughout the paper, we emphasize a particular instance of this model that\nis related to sparse coding, motivated by studies suggesting that data-adaptive sparse representations\n\n1Other linear methods, most notably linear discriminant analysis (LDA), exploit class labels to learn pro-\n\njections. In this paper, we focus on the unsupervised setting.\n\n1\n\n\fare appropriate for signals such as natural images and facial images [12, 13], and enable state-of-\nthe-art performance for denoising, deblurring, and classi\ufb01cation tasks [14\u201319].\n\nFormally, we assume our input signal to be well-represented by a sparse linear model [20], previ-\nously used for probabilistic sparse coding. Based on this generative model, we formulate learning\na linear projection as an optimization problem with the objective of preservation, in expectation, of\npairwise inner products between sparse codes, without having to explicitly obtain the sparse repre-\nsentation for each new sample. We study the solutions of this optimization problem, and we discuss\nhow they are related to techniques proposed for compressed sensing. We discuss applicability of our\nresults to general dictionary models, and nonlinear extensions. Finally, by applying our method to\nthe visualization, clustering, and classi\ufb01cation of facial images, texture patches, and general images,\nwe show experimentally that it improves our ability to uncover useful structure. Omitted proofs and\nadditional results can be found in the accompanying supplementary material.\n\n2 The sparse linear model\nWe use RN to denote the ambient space of the input signals, and assume that each signal x \u2208 RN is\ngenerated as the sum of a noise term \u03b5 \u2208 RN and a linear combination of the columns, or atoms, of\na N \u00d7 K dictionary matrix D = [d1, . . . , dK ], with the coef\ufb01cients arranged as a vector a \u2208 RK,\n(1)\nWe assume the noise to be white Gaussian, \u03b5 \u223c N (0N \u00d71, \u03c32I N \u00d7N ). We are interested in the\nsparse linear model [20], according to which the elements of a are a-priori independent from \u03b5 and\nare identically and independently drawn from a Laplace distribution,\n\nx = Da + \u03b5.\n\np (a) =\n\np (ai) , p (ai) =\n\n1\n2\u03c4\n\nexp(cid:26)\u2212|ai|\n\u03c4 (cid:27) .\n\n(2)\n\nK\n\nYi=1\n\nIn the context of this model, D is usually overcomplete (K > N), and in practice often learned in\nan unsupervised manner from training data. Several ef\ufb01cient algorithms exist for dictionary learn-\ning [21\u201323], and we assume in our analysis that a dictionary D adapted to the signals of interest is\ngiven.\n\nOur adoption of the sparse linear model is motivated by signi\ufb01cant empirical evidence that it is\naccurate for certain signals of interest, such as natural and facial images [12, 13], as well as the\nfact that it enables high performance for such diverse tasks as denoising and inpainting [14, 24],\ndeblurring [15], and classi\ufb01cation and clustering [13, 16\u201319]. Typically, the model (1) with an\nappropriate dictionary D is employed as a means for feature extraction, in which input signals x in\nRN are mapped to higher-dimensional feature vectors a \u2208 RK. When inferring features a (termed\nsparse codes) through maximum-a-posteriori (MAP) estimation, they are solutions to\n\nmin\n\na\n\n1\n\u03c32 kx \u2212 Dak2\n\n2 +\n\n1\n\u03c4 kak1 .\n\n(3)\n\nThis problem, known as the lasso [25], is a convex relaxation of the more general problem of sparse\ncoding [26] (in the rest of the paper we use both terms interchangeably). A number of ef\ufb01cient\nalgorithms for computing a exist, with both MAP [21, 27] and fully Bayesian [20] procedures.\n\n3 Preserving inner products\nLinear dimensionality reduction from RN to RM , M < N, is completely speci\ufb01ed by a projection\nmatrix L that maps each x \u2208 RN to y = Lx, y \u2208 RM , and different algorithms for linear di-\nmensionality reduction correspond to different methods for \ufb01nding this matrix. Typically, we are\ninterested in projections that reveal useful structure in a given set of input signals.\n\nAs mentioned in the introduction, structure is often better revealed in a higher-dimensional space\nof features, say a \u2208 RK. When a suitable feature transform can be found, this structure may exist\nas simple Euclidean geometry and be encoded in pairwise Euclidean distances or inner products\nbetween feature vectors. This is used, for example, in support vector machines and nearest-neighbor\nclassi\ufb01ers based on Euclidean distance, as well as k-means and spectral clustering based on pair-\nwise inner products. For the problem of dimensionality reduction, this motivates learning a pro-\njection matrix L such that, for any two input samples, the inner product between their resulting\nlow-dimensional representations is close to that of their corresponding high-dimensional features.\n\n2\n\n\fMore formally, for two samples xk, k = 1, 2 with corresponding low-dimensional representations\nyk = Lxk and feature vectors ak, we de\ufb01ne \u03b4p = yT\n1 a2 as a quantity whose magnitude\nwe want on average to be small. Assuming that an accurate probabilistic generative model for the\nsamples x and features a is available, we propose learning L by solving the optimization problem\n(E denoting expectation with respect to subscripted variables)\n\n1 y2 \u2212 aT\n\nmin\nLM \u00d7N\n\nE x1,x2,a1,a2(cid:2)\u03b4p2(cid:3) .\n\nSolving (4) may in general be a hard optimization problem, depending on the model used for ak and\nxk. Here we solve it for the case of the sparse linear model of Section 2, under which the feature\nvectors are the sparse codes. Using (1) and denoting S = LT L, (4) becomes\n\n(4)\n\n(5)\n\n(6)\n\nAssuming that x1 and x2 are drawn independently, we prove that (5) is equivalent to problem\n\nmin\nLM \u00d7N\n\n1 (cid:16)DT SD \u2212 I(cid:17) a2 + \u03b5T\nE a1,a2,\u03b51,\u03b52h(cid:16)aT\n4\u03c4 4(cid:13)(cid:13)(cid:13)\nDT SD \u2212 I(cid:13)(cid:13)(cid:13)\n\nmin\nLM \u00d7N\n\nF\n\n2\n\n1 SDa2 + \u03b5T\n\n2 SDa1 + \u03b5T\n\n1 S\u03b52(cid:17)2i.\n\n+ 4\u03c4 2\u03c32 kSDk2\n\nF + \u03c34 kSk2\nF ,\n\nwhere k\u00b7kF is the Frobenius norm, which has the closed-form solution (up to an arbitrary rotation):\n(7)\nHere, \u03bbM = (\u03bb1, . . . , \u03bbM ) is a M \u00d7 1 vector composed of the M largest eigenvalues of the N \u00d7 N\nmatrix DDT , and V M is the N \u00d7 M matrix with the corresponding eigenvectors as columns. The\nfunction f (\u00b7) is applied element-wise to the vector \u03bbM such that\n\nL = diag (f (\u03bbM )) V T\n\nM .\n\nf (\u03bbi) =s\n\n4\u03c4 4\u03bbi\n\n\u03c34 + 4\u03c4 2\u03c32\u03bbi + 4\u03c4 4\u03bb2\ni\n\n,\n\n(8)\n\nand diag (f (\u03bbM )) is the M \u00d7 M diagonal matrix formed from f (\u03bbM ). This solution assumes that\nDDT has full rank N, which in practice is almost always true as D is overcomplete.\n\nThrough comparison with (5), we observe that (6) is a trade-off between bringing inner products\nof sparse codes and their projections close (\ufb01rst term), and suppressing noise (second and third\nterms). Their relative in\ufb02uence is controlled by the variance of \u03b5 and a, through the constants \u03c3\nand \u03c4 respectively. It is interesting to compare their roles in (3) and (6): as \u03c3 increases relative to\n\u03c4 , data \ufb01tting in (3) becomes less important, and (7) emphasizes noise suppression. As \u03c4 increases,\nl1-regularization in (3) is weighted less, and the \ufb01rst term in (6) more. In the extreme case of \u03c3 = 0,\nthe data term in (3) becomes a hard constraint, whereas (6) and (7) simplify, respectively, to\n\nmin\n\nLM \u00d7N(cid:13)(cid:13)(cid:13)\n\nDT SD \u2212 I(cid:13)(cid:13)(cid:13)\n\n2\n\nF\n\n, and L = diag (\u03bbM )\u2212 1\n\n2 V T\n\nM .\n\n(9)\n\nInterestingly, in this noiseless case, an ambiguity arises in the solution of (9), as a minimizer is\nobtained for any subset of M eigenpairs and not necessarily the M largest ones.\n\nThe solution to (7) is similar\u2014and in the noiseless case identical\u2014to the whitening transform of\nthe atoms of D. When the atoms are centered at the origin, this essentially means that solving (4)\nfor the sparse linear model amounts to performing PCA on dictionary atoms learned from training\nsamples instead of the training samples themselves. The above result can also be interpreted in the\nsetting of [28]: dimensionality reduction in the case of the sparse linear model with the objective\nof (4) corresponds to kernel PCA using the kernel DDT , modulo centering and the normalization.\n3.1 Other dictionary models\nEven though we have presented our results using the sparse linear model described in Section 2,\nit is important to realize that our analysis is not limited to this model. The assumptions required\nfor deriving (5) are that signals are generated by a linear dictionary model such as (1), where the\ncoef\ufb01cients of each of the noise and code vectors are independent and identically distributed ac-\ncording to some zero-mean distribution, with the two vectors also independent from each other.\nThe above assumptions apply for several other popular dictionary models. Examples include the\nmodels used implicitly by ridge and bridge regression [29] and elastic-net [30], where the Laplace\n\n3\n\n\fprior on the code coef\ufb01cients is replaced by a Gaussian, and priors of the form exp(\u2212\u03bbkakq\nq) and\nexp(\u2212\u03bbkak1\u2212\u03b3 kak2\n2), respectively. In the context of sparse coding, other sparsity-inducing priors\nthat have been proposed in the literature, such as Student\u2019s t-distribution [31], also fall into the same\nframework. We choose to emphasize the sparse linear model, however, due to the apparent structure\npresent in dictionaries learned using this model, and its empirical success in diverse applications.\n\nIt is possible to derive similar results for a more general model. Speci\ufb01cally, we make the same as-\nsumptions as above, except that we only require that elements of a be zero-mean and not necessarily\nidentically distributed, and similarly for \u03b5. Then, we prove that (4) becomes\n\n2\n\n2\n\nF\n\n2\n\nF\n\n,\n\nF\n\nmin\n\n(10)\n\n(11)\n\n1i\u03b52\n\n2ia2\n\n1ia2\n\n1ia2\n\n+(cid:13)(cid:13)(cid:13)\n\n(SD) \u2299pW 2(cid:13)(cid:13)(cid:13)\nS \u2299pW 3(cid:13)(cid:13)(cid:13)\n+(cid:13)(cid:13)(cid:13)\nwhere \u2299 denotes the Hadamard product and (cid:16)\u221aW(cid:17)ij\nmatrices W 1, W 2 and W 3 in (10), of sizes K \u00d7 K, N \u00d7 K, and N \u00d7 N respectively, are\n1j(cid:3) , (W 3)ij = E(cid:2)\u03b52\n2j(cid:3) + E(cid:2)\u03b52\n2j(cid:3) .\n\nLM \u00d7N(cid:13)(cid:13)(cid:13)(cid:16)DT SD \u2212 I(cid:17) \u2299pW 1(cid:13)(cid:13)(cid:13)\n2j(cid:3) , (W 2)ij = E(cid:2)\u03b52\n(W 1)ij = E(cid:2)a2\n\n= q(W )ij. The elements of the weight\n\nProblem (10) can still be solved ef\ufb01ciently, see for example [32].\n3.2 Extension to the nonlinear case\nWe consider a nonlinear extension of the above analysis through the use of kernels. We denote by\n\u03a6 : RN \u2192 H a mapping from the signal domain to a reproducing kernel Hilbert space H associated\nwith a kernel function k : RN \u00d7 RN \u2192 R [33]. Using a set D = { \u02dcdi \u2208 H, i = 1, . . . , K} as\ndictionary, we extend the sparse linear model of Section 2 by replacing (1) for each x \u2208 RN with\n(12)\ni=1 ai \u02dcdi. For a \u2208 RK we make the same assumptions as in the sparse linear model.\nwhere Da \u2261PK\nThe term \u02dc\u03b5 denotes a Gaussian process over the domain RN whose sample paths are functions in H\nand with covariance operator C \u02dc\u03b5 = \u03c32I, where I is the identity operator on H [33, 34].\nThis nonlinear extension of the sparse linear model is valid only in \ufb01nite dimensional spaces H.\nIn the in\ufb01nite dimensional case, constructing a Gaussian process with both sample paths in H and\nidentity covariance operator is not possible, as that would imply that the identity operator in H\nhas \ufb01nite Hilbert-Schmidt norm [33, 34]. Related problems arise in the construction of cylindrical\nGaussian measures on in\ufb01nite dimensional spaces [35]. We de\ufb01ne \u02dc\u03b5 this way to obtain a probabilistic\nmodel for which MAP inference of a corresponds to the kernel extension of the lasso (3) [36],\n\n\u03a6 (x) = Da + \u02dc\u03b5,\n\nmin\na\u2208RK\n\n1\n2\u03c32 k\u03a6 (x) \u2212 Dak2\n\nH +\n\n1\n\u03c4 kak1 ,\n\n(13)\n\nwhere k\u00b7kH is the norm H de\ufb01ned through k. In the supplementary material, we discuss an alter-\nnative to (12) that resolves these problems by requiring that all \u03a6 (x) be in the subspace spanned\nby the atoms of D. Our results can be extended to this alternative, however in the following we\nadopt (12) and limit ourselves to \ufb01nite dimensional spaces H, unless mentioned otherwise.\nIn the kernel case, the equivalent of the projection matrix L (transposed) is a compact, linear operator\nV : H \u2192 RM , that maps an element x \u2208 RN to y = V\u03a6 (x) \u2208 RM . We denote by V \u2217 : RM \u2192 H\nthe adjoint of V, and by S : H \u2192 H the self-adjoint positive semi-de\ufb01nite linear operator of rank\nM from their synthesis, S = V \u2217V. If we consider optimizing over S, we prove that (4) reduces to\n(14)\n\n+ 4\u03c4 2\u03c32\n\nmin\n\n4\u03c4 4\n\nK\n\nK\n\nK\n\nS\n\nXi=1\n\nXi=1(cid:16)D \u02dcdi,S \u02dcdjEH \u2212 \u03b4ij(cid:17)2\n\nXi=1DS \u02dcdi,S \u02dcdiEH\n\n+ kSk2\n\nHS ,\n\nwhere k\u00b7kHS is the Hilbert-Schmidt norm. Assuming that K DD has full rank (which is almost\nalways true in practice due to the very large dimension of the Hilbert spaces used) we extend the\nrepresenter theorem of [37] to prove that all solutions of (14) can be written in the form\n\n(15)\nwhere \u2297 denotes the tensor product between all pairs of elements of its operands, and B is a K \u00d7 M\nmatrix. Then, denoting Q = BBT , problem (14) becomes\n\nS = (DB) \u2297 (DB) ,\n\nmin\nBK\u00d7M\n\n4\u03c4 4 kK DDQK DD \u2212 Ik2\n\nF + 4\u03c4 2\u03c32(cid:13)(cid:13)(cid:13)\n\nK DDQK\n\n4\n\n2\n\n1\n\nDD(cid:13)(cid:13)(cid:13)\n\n1\n\nK\n\n2\n\nDDQK\n\n2\n\nF\n\n+ \u03c34(cid:13)(cid:13)(cid:13)\n\n2\n\n1\n\nDD(cid:13)(cid:13)(cid:13)\n\n2\n\nF\n\n,\n\n(16)\n\n\fFigure 1: Two-dimensional projection of CMU PIE dataset, colored by identity. Shown at high resolution and\nat their respective projections are identity-averaged faces across the dataset for various illuminations, poses,\nand expressions. Insets show projections of samples from only two distinct identities. (Best viewed in color.)\n\nwhere K DD (i, j) = h \u02dcdi, \u02dcdjiH, i, j = 1, . . . , K. We can replace \u02dcL = BT K\nequivalent problem over \u02dcL of the form (6), with K\n\n1\n\n2\n\nDD instead of D, and thus use (8) to obtain\n\n1\n\n2\n\nDD to turn (16) into an\n\nwhere, similar to the linear case, \u03bbM and V M are the M largest eigenpairs of the matrix K DD, and\n\nB = V M diag (g (\u03bbM ))\n\n(17)\n\ng (\u03bbi) =\n\n1\n\u221a\u03bbi\n\nf (\u03bbi) =s\n\n4\u03c4 4\n\n\u03c34 + 4\u03c4 2\u03c32\u03bbi + 4\u03c4 4\u03bb2\ni\n\n.\n\n(18)\n\nUsing the derived solution, a vector x \u2208 RN is mapped to y = BT K D (x), where K D (x) =\n[h \u02dcd1, \u03a6 (x)iH, . . . ,h \u02dcdM , \u03a6 (x)iH]T . As in the linear case, this is similar to the result of applying\nkernel PCA on the dictionary D instead of the training samples. Note that, in the noiseless case,\n\u03c3 = 0, the above analysis is also valid for in\ufb01nite dimensional spaces H. Expression (17) simpli\ufb01es\nto B = V M diag (\u03bbM )\u22121 where, as in the linear case, any subset of M eigenvalues may be selected.\nEven though in the in\ufb01nite dimensional case selecting the M largest eigenvalues cannot be justi\ufb01ed\nprobabilistically, it is a reasonable heuristic given the analysis in the \ufb01nite dimensional case.\n\n3.3 Computational considerations\nIt is interesting to compare the proposed method in the nonlinear case with kernel PCA, in terms of\ncomputational and memory requirements. If we require dictionary atoms to have pre-images in RN ,\n\nthat is D =(cid:8)\u03a6 (di) , di \u2208 RN , i = 1, . . . , K(cid:9) [36], then the proposed algorithm requires calculating\nand decomposing the K \u00d7 K kernel matrix K DD when learning V, and performing K kernel\nevaluations for projecting a new sample x. For kernel PCA on the other hand, the S\u00d7S matrix K X X\nand S kernel evaluations are needed respectively, where X =(cid:8)\u03a6 (xi) , xi \u2208 RN , i = 1, . . . , S(cid:9) and\nxi are the representations of the training samples in H, with S \u226b K. If the pre-image constraint is\ndropped and the usual alternating procedure [21] is used for learning D, then the representer theorem\nof [38] implies that D = X F , where F is an S \u00d7 K matrix. In this case, the proposed method also\nrequires calculating K X X during learning and S kernel evaluations for out-of-sample projections,\nbut only the eigendecomposition of the K \u00d7 K matrix F T K 2\nOn the other hand, we have assumed so far, in both the linear and nonlinear cases, that a dictionary\nis given. When this is not true, we need to take into account the cost of learning a dictionary,\nwhich greatly outweights the computational savings described above, despite advances in dictionary\nlearning algorithms [21, 22].\nIn the kernel case, whereas imposing the pre-image constraint has\nthe advantages we mentioned, it also makes dictionary learning a harder nonlinear optimization\nproblem, due to the need for evaluation of kernel derivatives. In the linear case, the computational\nsavings from applying (linear) PCA to the dictionary instead of the training samples are usually\nnegligible, and therefore the difference in required computation becomes even more severe.\n\nX X F is required.\n\n5\n\n\fFigure 2: Classi\ufb01cation accuracy results. From left to right: CMU PIE (varying value of M); CMU PIE\n(varying number of training samples); brodatz texture patches; Caltech-101. (Best viewed in color.)\n\n4 Experimental validation\nIn order to evaluate our proposed method, we compare it with other unsupervised dimensionality\nreduction methods on visualization, clustering, and classi\ufb01cation tasks. We use facial images in the\nlinear case, and texture patches and images of object categories in the kernel case.\nFacial images: We use the CMU PIE [39] benchmark dataset of faces under pose, illumination and\nexpression changes, and speci\ufb01cally the subset used in [8].2 We visualize the dataset by projecting\nall face samples to M = 2 dimensions using LPP and the proposed method, as shown in Figure 1.\nAlso shown are identity-averaged faces over the dataset, for various illumination, pose, and expres-\nsion combinations, at the location of their projection. We observe that our method recovers a very\nclear geometric structure, with changes in illumination corresponding to an ellipsoid, changes in\npose to moving towards its interior, and changes in expression accounting for the density on the\nhorizontal axis. We separately show the projections of samples from two distinct indviduals, and\nsee that different identities are mapped to parallely shifted ellipsoids, easily separated by a nearest-\nneighbor classi\ufb01er. On the other hand, such structure is not apparent when using LPP. A larger\nversion of Figure 1 and the corresponding for PCA are provided in the supplementary material.\n\nTo assess how well identity structure is recovered for increasing values of the target dimension\nM, we also perform face recognition experiments. We compare against three baseline methods,\nPCA, NPE, and LPP, linear extensions (spectral regression \u201cSRLPP\u201d [7], spatially smooth LPP\n\u201cSmoothLPP\u201d [8]), and random projections (see Section 5). We produce 20 random splits into\ntraining and testing sets, learn a dictionary and projection matrices from the training set, and use the\nobtained low-dimensional representations with a k-nearest neighbor classi\ufb01er (k = 4) to classify the\ntest samples, as is common in the literature. In Figure 2, we show the average recognition accuracy\nfor the various methods as the number of projections is varied, when using 100 training samples\nfor each of the 68 individuals in the dataset. Also, we compare the proposed method with the best\nperforming alternative, when the number of training samples per individual is varied from 40 to 120.\nWe observe that the proposed method outperforms all other by a wide margin, in many cases even\nwhen trained with fewer samples. However, it can only be used when there are enough training\nsamples to learn a dictionary, a limitation that does not apply to the other methods. For this reason,\nwe do not experiment with cases of 5-20 samples per individual, as commonly done in the literature.\nTexture patches: We perform classi\ufb01cation experiments on texture patches, using the Brodatz\ndataset [40], and speci\ufb01cally classes 4, 5, 8, 12, 17, 84, and 92 from the 2-texture images. We\nextract 12 \u00d7 12 patches and use those from the training images to learn dictionaries and projections\nfor the Gaussian kernel.3 We classify the low-dimensional representations using an one-versus-all\nlinear SVM. In Figure 2, we compare the classi\ufb01cation accuracy of the proposed method (\u201cker.dict\u201d)\nwith the kernel variants of PCA and LPP (\u201cKPCA\u201d and \u201cKLPP\u201d respectively), for varying M. KLPP\nand the proposed method both outperform KPCA. Our method achieves much higher accuracy at\nsmall values of M, and KLPP is better for large values; otherwise they perform similarly.\n\nThis dataset provides an illustrative example for the discussion in Section 3.3. For 20000 training\nsamples, KPCA and KLPP require storing and processing a 20000\u00d720000 kernel matrix, as opposed\nto 512 \u00d7 512 for our method. On the other hand, training a dictionary with K = 512 for this dataset\ntakes approximately 2 hours, on an 8 core machine and using a C++ implementation of the learning\nalgorithm, as opposed to the few minutes required for the eigendecompositions in KPCA and KLPP.\n\nto the number of pixels N = 1024, due to the limited amount of training data, and \u03bb = \u03c3\n\n2Images are pre-normalized to unit length. We use the algorithm of [21] to learn dictionaries, with K equal\n= 0.05 as in [19].\n3Following [36], we set the kernel parameter \u03b3 = 8, and use their method for dictionary learning with\n\nK = 512 and \u03bb = 0.30, but with a conjugate gradient optimizer for the dictionary update step.\n\n2\n\n\u03c4\n\n6\n\n\fMethod\n\nKPCA (k-means)\n\nKLPP (spectral clustering)\n\nker.dict (k-means)\n\nAccuracy\n0.6217\n0.6900\n0.7233\n\nNMI\n0.6380\n0.6788\n0.7188\n\nRand Index\n\n0.4279\n0.5143\n0.5275\n\nTable 1: Clustering results on Caltech-101.\n\nImages of object categories: We use the Caltech-101 [41] object recognition dataset, with the\naverage of the 39 kernels used in [42]. Firstly, we use 30 training samples from each class to learn a\ndictionary4 and projections using KPCA, KLPP, and the proposed method. In Figure 2, we plot the\nclassi\ufb01cation accuracy achieved using a linear SVM for each method and varying M. We see that\nthe proposed method and KPCA perform similarly and outperform KLPP. Our algorithm performs\nconsistently well in both the datasets we experiment with in the kernel case.\n\nWe also perform unsupervised clustering experiments, where we randomly select 30 samples from\neach of the 20 classes used in [43] to learn projections with the three methods, over a range of\nvalues for M between 10 and 150. We combine each with three clustering algorithms, k-means,\nspectral clustering [44], and af\ufb01nity propagation [43] (using negative Euclidean distances of the\nlow-dimensional representations as similarities). In Table 1, we report for each method the best\noverall result in terms of accuracy, normalized mutual information, and rand index [45], along with\nthe clustering algorithm for which these are achieved. We observe that the low-dimensional repre-\nsentations from the proposed method produce the best quality clusterings, for all three measures.\n\n5 Discussion and future directions\nAs we remarked in Section 3, the proposed method uses available training samples to learn D and\nignores them afterwards, relying exclusively on the assumed generative model and the correlation\ninformation in D. To see how this approach could fail, consider the degenerate case when D is the\nidentity matrix, that is the signal and sparse domains coincide. Then, to discover structure we need\nto directly examine the training samples. Better use of the training samples within our framework\ncan be made by adopting a richer probabilistic model, using available data to train it, naturally\nwith appropriate regularization to avoid over\ufb01tting, and then minimizing (4) for the learned model.\nFor example, we can use the more general model of Section 3.1, and assume that each ai follows a\nLaplace distribution with a different \u03c4i. Doing so agrees with empirical observations that, when D is\nlearned, the average magnitude of coef\ufb01cients ai varies signi\ufb01cantly with i. An orthogonal approach\nis to forgo adopting a generative model, and learn a projection matrix directly from training samples\nusing an appropriate empirical loss function. One possibility is minimizing kAT A\u2212X T LT LXk2\nF ,\nwhere the columns of X and A are the training samples and corresponding sparse code estimates,\nwhich is an instance of multidimensional scaling [46] (as modi\ufb01ed to achieve linear induction).\n\nFor the sparse linear model case, objective function (4) is related to the Restricted Isometry Property\n(RIP) [47], used in the compressed sensing literature as a condition enabling reconstruction of a\nsparse vector a \u2208 RK from linear measurements y \u2208 RM when M \u226a K. The RIP is a worst-\ncase condition, requiring approximate preservation, in the low-dimensional domain, of pairwise\nEuclidean distances of all a, and therefore stronger than the expectation condition (4). Verifying\nthe RIP for an arbitrary matrix is a hard problem, but it is known to hold for the equivalent dictio-\nnary \u02dcD = LD with high probability, if L is drawn from certain random distributions, and M is\n\nof the order of only O(cid:0)k log K\n\nlearned matrix L is in practice more useful than random projections (see left of Figure 2). The for-\nmal guarantees that preservation of Euclidean geometry of sparse codes is possible with few linear\nprojections are unique for the sparse linear model, thus further justifying our choice to emphasize\nthis model throughout the paper.\n\nk(cid:1) [48]. Despite this property, our experiments demonstrate that a\n\nAnother quantity used in compressed sensing is the mutual coherence of \u02dcD [49], and its approximate\nminimization has been proposed as a way for learning L for signal reconstruction [50, 51]. One of\nthe optimization problems arrived at in this context [51] is the same as problem (9) we derived in\nthe noiseless case, the solution of which as we mentioned in Section 3 is not unique. This ambiguity\nhas been addressed heuristically by weighting the objective function with appropriate multiplicative\nterms, so that it becomes k\u039b\u2212\u039bV T LT LV \u039bk2\nF , where \u039b and V are eigenpairs of DDT [51]. This\n4We use a kernel extension of the algorithm of [21] without pre-image constraints. We select K = 300\nand \u03bb = 0.1 from a range of values, to achieve about 10% non-zero coef\ufb01cients in the sparse codes and small\nreconstruction error for the training samples. Using K = 150 or 600 affected accuracy by less than 1.5%.\n\n7\n\n\fproblem admits as only minimizer the one corresponding to the M largest eigenvalues. Our analysis\naddresses the above issue naturally by incorporating noise, thus providing formal justi\ufb01cation for\nthe heuristic. Also, the closed-form solution of (9) is not shown in [51], though its existence is\nmentioned, and the (weighted) problem is instead solved through an iterative procedure.\n\nIn Section 3, we motivated preserving inner products in the sparse domain by considering exist-\ning algorithms that employ sparse codes. As our understanding of sparse coding continues to im-\nprove [52], there is motivation for considering other structure in RK. Possibilities include preserva-\ntion of linear subspace (as determined by the support of the sparse codes) or local group relations\nin the sparse domain. Extending our analysis to also incorporate supervision is another important\nfuture direction.\n\nLinear dimensionality reduction has traditionally been used for data preprocessing and visualiza-\ntion, but we are also beginning to see its utility for low-power sensors. A sensor can be designed to\nrecord linear projections of an input signal, instead of the signal itself, with projections implemented\nthrough a low-power physical process like optical \ufb01ltering. In these cases, methods like the ones\nproposed in this paper can be used to obtain a small number of informative projections, thereby\nreducing the power and size of the sensor while maintaining its effectiveness for tasks like recog-\nnition. An example for visual sensing is described in [2], where a heuristically-modi\ufb01ed version\nof our linear approach is employed to select projections for face detection. Rigorously extending\nour analysis to this domain will require accounting for noise and constraints on the projections (for\nexample non-negativity, limited resolution) induced by fabrication processes. We view this as a\nresearch direction worth pursuing.\nAcknowledgments\nThis research was supported by NSF award IIS-0926148, ONR award N000140911022, and the US\nArmy Research Laboratory and the US Army Research Of\ufb01ce under contract/grant number 54262-\nCI.\nReferences\n\n[1] M.A. Davenport, P.T. Boufounos, M.B. Wakin, and R.G. Baraniuk. Signal processing with compressive\n\nmeasurements. IEEE JSTSP, 2010.\n\n[2] S.J. Koppal, I. Gkioulekas, T. Zickler, and G.L. Barrows. Wide-angle micro sensors for vision on a tight\n\nbudget. CVPR, 2011.\n\n[3] I. Jolliffe. Principal component analysis. Wiley, 1986.\n[4] X. He and P. Niyogi. Locality Preserving Projections. NIPS, 2003.\n[5] X. He, D. Cai, S. Yan, and H.J. Zhang. Neighborhood preserving embedding. ICCV, 2005.\n[6] D. Cai, X. He, J. Han, and H.J. Zhang. Orthogonal laplacianfaces for face recognition. IEEE IP, 2006.\n[7] D. Cai, X. He, and J. Han. Spectral regression for ef\ufb01cient regularized subspace learning. ICCV, 2007.\n[8] D. Cai, X. He, Y. Hu, J. Han, and T. Huang. Learning a spatially smooth subspace for face recognition.\n\nCVPR, 2007.\n\n[9] X. He, D. Cai, and P. Niyogi. Tensor subspace analysis. NIPS, 2006.\n[10] J. Ye, R. Janardan, and Q. Li. Two-dimensional linear discriminant analysis. NIPS, 2004.\n[11] B. Scholkopf, A. Smola, and K.R. Muller. Nonlinear component analysis as a kernel eigenvalue problem.\n\nNeural computation, 1998.\n\n[12] B.A. Olshausen and D.J. Field. Sparse coding with an overcomplete basis set: A strategy employed by\n\nV1? Vision Research, 1997.\n\n[13] J. Wright, A.Y. Yang, A. Ganesh, S.S. Sastry, and Y. Ma. Robust face recognition via sparse representa-\n\ntion. PAMI, 2008.\n\n[14] M. Elad and M. Aharon. Image denoising via sparse and redundant representations over learned dictio-\n\nnaries. IEEE IP, 2006.\n\n[15] J.F. Cai, H. Ji, C. Liu, and Z. Shen. Blind motion deblurring from a single image using sparse approxi-\n\nmation. CVPR, 2009.\n\n[16] R. Raina, A. Battle, H. Lee, B. Packer, and A.Y. Ng. Self-taught learning: Transfer learning from unla-\n\nbeled data. ICML, 2007.\n\n[17] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman. Supervised dictionary learning. NIPS, 2008.\n\n8\n\n\f[18] I. Ramirez, P. Sprechmann, and G. Sapiro. Classi\ufb01cation and clustering via dictionary learning with\n\nstructured incoherence and shared features. CVPR, 2010.\n\n[19] J. Yang, K. Yu, and T. Huang. Supervised translation-invariant sparse coding. CVPR, 2010.\n[20] M.W. Seeger. Bayesian inference and optimal design for the sparse linear model. JMLR, 2008.\n[21] H. Lee, A. Battle, R. Raina, and A.Y. Ng. Ef\ufb01cient sparse coding algorithms. NIPS, 2007.\n[22] J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online learning for matrix factorization and sparse coding.\n\nJMLR, 2010.\n\n[23] M. Zhou, H. Chen, J. Paisley, L. Ren, G. Sapiro, and L. Carin. Non-Parametric Bayesian Dictionary\n\nLearning for Sparse Image Representations. NIPS, 2009.\n\n[24] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman. Non-local sparse models for image restoration.\n\nICCV, 2009.\n\n[25] R. Tibshirani. Regression shrinkage and selection via the lasso. JRSS-B, 1996.\n[26] A.M. Bruckstein, D.L. Donoho, and M. Elad. From sparse solutions of systems of equations to sparse\n\nmodeling of signals and images. SIAM review, 2009.\n\n[27] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. Annals of statistics, 2004.\n[28] J. Ham, D.D. Lee, S. Mika, and B. Sch\u00a8olkopf. A kernel view of the dimensionality reduction of manifolds.\n\nICML, 2004.\n\n[29] W.J. Fu. Penalized regressions: the bridge versus the lasso. JCGS, 1998.\n[30] H. Zou and T. Hastie. Regularization and variable selection via the elastic net. JRSS-B, 2005.\n[31] S. Ji, Y. Xue, and L. Carin. Bayesian compressive sensing. IEEE SP, 2008.\n[32] N. Srebro and T. Jaakkola. Weighted low-rank approximations. ICML, 2003.\n[33] A. Berlinet and C. Thomas-Agnan. Reproducing kernel Hilbert spaces in probability and statistics.\n\nKluwer, 2004.\n\n[34] V.I. Bogachev. Gaussian measures. AMS, 1998.\n[35] J. Kuelbs, FM Larkin, and J.A. Williamson. Weak probability distributions on reproducing kernel hilbert\n\nspaces. Rocky Mountain J. Math, 1972.\n\n[36] S. Gao, I. Tsang, and L.T. Chia. Kernel Sparse Representation for Image Classi\ufb01cation and Face Recog-\n\nnition. ECCV, 2010.\n\n[37] J. Abernethy, F. Bach, T. Evgeniou, and J.P. Vert. A new approach to collaborative \ufb01ltering: Operator\n\nestimation with spectral regularization. JMLR, 2009.\n\n[38] B. Scholkopf, R. Herbrich, and A. Smola. A generalized representer theorem. COLT, 2001.\n[39] T. Sim, S. Baker, and M. Bsat. The CMU pose, illumination, and expression (PIE) database.\n\nICAFGR, 2002.\n\nIEEE\n\n[40] T. Randen and J.H. Husoy. Filtering for texture classi\ufb01cation: A comparative study. PAMI, 2002.\n[41] L. Fei-Fei, R. Fergus, and P. Perona. Learning generative visual models from few training examples: an\n\nincremental bayesian approach tested on 101 object categories. CVPR Workshops, 2004.\n\n[42] P. Gehler and S. Nowozin. On feature combination for multiclass object classi\ufb01cation. ICCV, 2009.\n[43] D. Dueck and B.J. Frey. Non-metric af\ufb01nity propagation for unsupervised image categorization. ICCV,\n\n2007.\n\n[44] J. Shi and J. Malik. Normalized cuts and image segmentation. PAMI, 2000.\n[45] N.X. Vinh, J. Epps, and J. Bailey. Information theoretic measures for clusterings comparison: Variants,\n\nproperties, normalization and correction for chance. JMLR, 2010.\n\n[46] T.F. Cox and M.A.A. Cox. Multidimensional Scaling. Chapman & Hall, 2000.\n[47] E.J. Cand`es and T. Tao. Decoding by linear programming. IEEE IT, 2005.\n[48] H. Rauhut, K. Schnass, and P. Vandergheynst. Compressed sensing and redundant dictionaries. IEEE IT,\n\n2008.\n\n[49] D.L. Donoho and X. Huo. Uncertainty principles and ideal atomic decomposition. IEEE IT, 2001.\n[50] M. Elad. Optimized projections for compressed sensing. IEEE SP, 2007.\n[51] J.M. Duarte-Carvajalino and G. Sapiro. Learning to sense sparse signals: Simultaneous sensing matrix\n\nand sparsifying dictionary optimization. IEEE IP, 2009.\n\n[52] K. Yu, T. Zhang, and Y. Gong. Nonlinear learning using local coordinate coding. NIPS, 2009.\n\n9\n\n\f", "award": [], "sourceid": 200, "authors": [{"given_name": "Ioannis", "family_name": "Gkioulekas", "institution": null}, {"given_name": "Todd", "family_name": "Zickler", "institution": null}]}