{"title": "Generalized Lasso based Approximation of Sparse Coding for Visual Recognition", "book": "Advances in Neural Information Processing Systems", "page_first": 181, "page_last": 189, "abstract": "Sparse coding, a method of explaining sensory data with as few dictionary bases as possible, has attracted much attention in computer vision. For visual object category recognition, L1 regularized sparse coding is combined with spatial pyramid representation to obtain state-of-the-art performance. However, because of its iterative optimization, applying sparse coding onto every local feature descriptor extracted from an image database can become a major bottleneck. To overcome this computational challenge, this paper presents \"Generalized Lasso based Approximation of Sparse coding\" (GLAS). By representing the distribution of sparse coefficients with slice transform, we fit a piece-wise linear mapping function with generalized lasso. We also propose an efficient post-refinement procedure to perform mutual inhibition between bases which is essential for an overcomplete setting. The experiments show that GLAS obtains comparable performance to L1 regularized sparse coding, yet achieves significant speed up demonstrating its effectiveness for large-scale visual recognition problems.", "full_text": "Generalized Lasso based Approximation of Sparse\n\nCoding for Visual Recognition\n\nThe University of New South Wales & NICTA\n\nNational Institute of Informatics\n\nShin\u2019ichi Satoh\n\nTokyo, Japan\n\nsatoh@nii.ac.jp\n\nNobuyuki Morioka\n\nSydney, Australia\n\nnmorioka@cse.unsw.edu.au\n\nAbstract\n\nSparse coding, a method of explaining sensory data with as few dictionary bases\nas possible, has attracted much attention in computer vision. For visual object cat-\negory recognition, (cid:96)1 regularized sparse coding is combined with the spatial pyra-\nmid representation to obtain state-of-the-art performance. However, because of its\niterative optimization, applying sparse coding onto every local feature descriptor\nextracted from an image database can become a major bottleneck. To overcome\nthis computational challenge, this paper presents \u201cGeneralized Lasso based Ap-\nproximation of Sparse coding\u201d (GLAS). By representing the distribution of sparse\ncoef\ufb01cients with slice transform, we \ufb01t a piece-wise linear mapping function with\nthe generalized lasso. We also propose an ef\ufb01cient post-re\ufb01nement procedure to\nperform mutual inhibition between bases which is essential for an overcomplete\nsetting. The experiments show that GLAS obtains a comparable performance to\n(cid:96)1 regularized sparse coding, yet achieves a signi\ufb01cant speed up demonstrating its\neffectiveness for large-scale visual recognition problems.\n\n1\n\nIntroduction\n\nRecently, sparse coding [3, 18] has attracted much attention in computer vision research. Its ap-\nplications range from image denoising [23] to image segmentation [17] and image classi\ufb01cation\n[10, 24], achieving state-of-the-art results. Sparse coding interprets an input signal x \u2208 RD\u00d71 with\na sparse vector u \u2208 RK\u00d71 whose linear combination with an overcomplete set of K bases (i.e.,\nD (cid:28) K), also known as dictionary B \u2208 RD\u00d7K, reconstructs the input as precisely as possible. To\nenforce sparseness on u, the (cid:96)1 norm is a popular choice due to its computational convenience and its\ninteresting connection with the NP-hard (cid:96)0 norm in compressed sensing [2]. Several ef\ufb01cient (cid:96)1 reg-\nularized sparse coding algorithms have been proposed [4, 14] and are adopted in visual recognition\n[10, 24]. In particular, Yang et al. [24] compute the spare codes of many local feature descriptors\nwith sparse coding. However, due to the (cid:96)1 norm being non-smooth convex, the sparse coding algo-\nrithm needs to optimize iteratively until convergence. Therefore, the local feature descriptor coding\nstep becomes a major bottleneck for large-scale problems like visual recognition.\nThe goal of this paper is to achieve state-of-the-art performance on large-scale visual recognition\nthat is comparable to the work of Yang et al. [24], but with a signi\ufb01cant improvement in ef\ufb01ciency.\nTo this end, we propose \u201cGeneralized Lasso based Approximation of Sparse coding\u201d, GLAS for\nshort. Speci\ufb01cally, we encode the distribution of each dimension in sparse codes with the slice\ntransform representation [9] and learn a piece-wise linear mapping function with the generalized\nlasso to obtain the best \ufb01t [21] to approximate (cid:96)1 regularized sparse coding. We further propose\nan ef\ufb01cient post-re\ufb01nement procedure to capture the dependency between overcomplete bases. The\neffectiveness of our approach is demonstrated with several challenging object and scene category\ndatasets, showing a comparable performance to Yang et al. [24] and performing better than other\nfast algorithms that obtain sparse codes [22]. While there have been several supervised dictionary\n\n1\n\n\flearning methods for sparse coding to obtain more discriminative sparse representations [16, 25],\nthey have not been evaluated on visual recognition with many object categories due to its com-\nputational challenges. Furthermore, Ranzato et al. [19] have empirically shown that unsupervised\nlearning of visual features can obtain a more general and effective representation. Therefore, in this\npaper, we focus on learning a fast approximation of sparse coding in an unsupervised manner.\nThe paper is organized as follows: Section 2 reviews some related work including the linear spatial\npyramid combined with sparse coding and other fast algorithms to obtain sparse codes. Section 3\npresents GLAS. This is followed by the experimental results on several challenging categorization\ndatasets in Section 4. Section 5 concludes the paper with discussion and future work.\n\n2 Related Work\n\n2.1 Linear Spatial Pyramid Matching Using Sparse Coding\n\nThis section reviews the linear spatial pyramid matching based on sparse coding by Yang et al. [24].\nGiven a collection of N local feature descriptors randomly sampled from training images X =\n[x1, x2, . . . , xN ] \u2208 RD\u00d7N , an over-complete dictionary B = [b1, b2, . . . , bK] \u2208 RD\u00d7K is learned\nby\n\nN(cid:88)\n\ni=1\n\nmin\nB,U\n\n(cid:107)xi \u2212 Bui(cid:107)2\n\n2 + \u03bb(cid:107)ui(cid:107)1\n\ns.t. (cid:107)bk(cid:107)2\n\n2 \u2264 1, k = 1, 2, . . . , K.\n\n(1)\n\nThe cost function above is a combination of the reconstruction error and the (cid:96)1 sparsity penalty\nwhich is controlled by \u03bb. The (cid:96)2 norm on each bk is constrained to be less than or equal to 1\nto avoid a trival solution. Since both B and [u1, u2, . . . , uN ] are unknown a priori, an alternating\noptimization technique is often used [14] to optimize over the two parameter sets.\nUnder the spatial pyramid matching framework, each image is divided into a set of sub-regions\nr = [r1, r2, . . . , rR]. For example, if 1\u00d71, 2\u00d72 and 4\u00d74 partitions are used on an image, we have\n21 sub-regions. Then, we compute the sparse solutions of all local feature descriptors, denoted as\nUrj , appearing in each sub-region rj by\n\n(cid:107)Xrj \u2212 BUrj(cid:107)2\n\n2 + \u03bb(cid:107)Urj(cid:107)1.\n\nmin\nUrj\n\n(2)\n\nThe sparse solutions are max pooled for each sub-region and concatenated with other sub-regions to\nbuild a statistic of the image by\n\nh = [max(|Ur1|)(cid:62), max(|Ur2|)(cid:62), . . . , max(|UrR|)(cid:62)](cid:62),\n\n(3)\n\nwhere max(.) is a function that \ufb01nds the maximum value at each row of a matrix and returns a\ncolumn vector. Finally, a linear SVM is trained on a set of image statistics for classi\ufb01cation.\nThe main advantage of using sparse coding is that state-of-the-art results can be achieved with a\nsimple linear classi\ufb01er as reported in [24]. Compared to kernel-based methods, this dramatically\nspeeds up training and testing time of the classi\ufb01er. However, the step of \ufb01nding a sparse code for\neach local descriptor with sparse coding now becomes a major bottleneck. Using the ef\ufb01cient sparse\ncoding algorithm based on feature-sign search [14], the time to compute the solution for one local\ndescriptor u is O(KZ) where Z is the number of non-zeros in u. This paper proposes an approx-\nimation method whose time complexity reduces to O(K). With the post-re\ufb01nement procedure, its\ntime complexity is O(K + Z 2) which is still much lower than O(KZ).\n\n2.2 Predictive Sparse Decomposition\n\nPredictive sparse decomposition (PSD) described in [10, 11] is a feedforward network that applies a\nnon-linear mapping function on linearly transformed input data to match the optimal sparse coding\nsolution as accurate as possible. Such feedfoward network is de\ufb01ned as: \u02c6ui = Gg(Wxi, \u03b8), where\ng(z, \u03b8) denotes a non-linear parametric mapping function which can be of any form, but to name\na few there are hyperbolic tangent, tanh(z + \u03b8) and soft shrinkage, sign(z) max(|z| \u2212 \u03b8, 0). The\nfunction is applied to linearly transformed data Wxi and subsequently scaled by a diagonal matrix\n\n2\n\n\fG. Given training samples {xi}N\nthe dictionary B. When learning jointly, we minimize the cost function given below:\n\ni=1, the parameters can be estimated either jointly or separately from\n\n(cid:107)xi \u2212 Bui(cid:107)2\n\n2 + \u03bb(cid:107)ui(cid:107)1 + \u03b3(cid:107)ui \u2212 Gg(Wxi, \u03b8)(cid:107)2\n2.\n\n(4)\n\nN(cid:88)\n\ni=1\n\nmin\n\nB,G,W,\u03b8,U\n\nWhen learning separately, B and U are obtained with Eqn. (1) \ufb01rst. Then, other remaining parame-\nters G, W and \u03b8 are estimated by solving the last term of Eqn. (4) only. Gregor and LeCun [7] have\nlater proposed a better, but iterative approximation scheme for (cid:96)1 regularized sparse coding.\nOne downside of the parametric approach is its accuracy is largely dependent on how well its para-\nmetric function \ufb01ts the target statistical distribution, as argued by Hel-Or and Shaked [9]. This paper\nexplores a non-parametric approach which can \ufb01t any distribution as long as data samples available\nare representative. The advantage of our approach over the parametric approach is that we do not\nneed to seek an appropriate parametric function for each distribution. This is particularly useful in\nvisual recognition that uses multiple feature types, as it automatically estimates the function form\nfor each feature type from data. We demonstrate this with two different local descriptor types in our\nexperiments.\n\n2.3 Locality-constrained Linear Coding\n\nAnother notable work that overcomes the bottleneck of the local descriptor coding step is locality-\nconstrained linear coding (LLC) proposed by Wang et al. [22], a fast version of local coordinate\ncoding [26]. Given a local feature descriptor xi, LLC searches for M nearest dictionary bases of\neach local descriptor xi and these nearest bases stacked in columns are denoted as B\u03c6i \u2208 RD\u00d7M\nwhere \u03c6i indicates the index list of the bases. Then, the coef\ufb01cients u\u03c6i \u2208 RM\u00d71 whose linear\ncombination with B\u03c6i reconstructs xi is solved by:\n(cid:107)xi \u2212 B\u03c6iu\u03c6i(cid:107)2\n\ns.t. 1(cid:62)u\u03c6i = 1.\n\n(5)\n\nmin\nu\u03c6i\n\n2\n\nThis is the least squares problem which can be solved quite ef\ufb01ciently. The \ufb01nal sparse code ui is\nobtained by setting its elements indexed at \u03c6i to u\u03c6i. The time complexity of LLC is O(K + M 2).\nThis excludes the time required to \ufb01nd M nearest neighbours. While it is fast, the resulting sparse\nsolutions obtained are not as discriminative as the ones obtained by sparse coding. This may be due\nto the fact that M is \ufb01xed across all local feature descriptors. Some descriptors may need more bases\nfor accurate representation and others may need less bases for more distinctiveness. In contrast, the\nnumber of bases selected with our post-re\ufb01nement procedure to handle the mutual inhibition is\ndifferent for each local descriptor.\n\n3 Generalized Lasso based Approximation of Sparse Coding\n\nThis section describes GLAS. We \ufb01rst learn a dictionary from a collection of local feature descriptors\nas given Eqn. (1). Then, based on slice transform representation, we \ufb01t a piece-wise linear mapping\nfunction with the generalized lasso to approximate the optimal sparse solutions of the local feature\ndescriptors under (cid:96)1 regularized sparse coding. Finally, we propose an ef\ufb01cient post-re\ufb01nement\nprocedure to perform the mutual inhibition.\n\n3.1 Slice Transform Representation\n\nSlice transform representation is introduced as a way to discretize a function space so to \ufb01t a piece-\nwise linear function for the purpose of image denoising by Hel-Or and Shaked [9]. This is later\nadopted by Adler et al. [1] for single image super resolution. In this paper, we utilise the repre-\nsentation to approximate sparse coding to obtain sparse codes for local feature descriptor as fast as\npossible.\nGiven a local descriptor x, we can linearly combine with B(cid:62) to obtain z. For the moment, we just\nconsider one dimension of z denoted as z which is a real value and lies in a half open interval of\n[a, b). The interval is divided into Q \u2212 1 equal-sized bins whose boundaries form a vector q =\n[q1, q2, . . . , qQ](cid:62) such that a = q1 < q2 \u00b7\u00b7\u00b7 < qQ = b.\n\n3\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 1: Different approaches to \ufb01t a piece-wise linear mapping function. Regularized least squares\n(RLS) in red (see Eqn. (8)). (cid:96)1-regularized sparse coding (L1-SC) in magenta (see Eqn. (9)). GLAS\nin green (see Eqn. (10)). (a) All three methods achieving a good \ufb01t. (b) A case when L1-SC fails to\nextrapolate well at the end and RLS tends to align itself to q in black. (c) A case when data samples\nat around 0.25 are removed arti\ufb01cially to illustrate that RLS fails to interpolate as no neighoring\nprior is used. In contrast, GLAS can both interpolate and extrapolate well in the case of missing or\nnoisy data.\n\nThe interval into which the value of z falls is expressed as: \u03c0(z) = j if z \u2208 [qj\u22121, qj), and its\ncorresponding residue is calculated by: r(z) = z\u2212q\u03c0(z)\u22121\nq\u03c0(z)\u2212q\u03c0(z)\u22121\nBased on the above, we can re-express z as\n\n.\n\nz = (1 \u2212 r(z))q\u03c0(z)\u22121 + r(z)q\u03c0(z) = Sq(z)q,\n\n(6)\n\nwhere Sq(z) = [0, . . . , 0, 1 \u2212 r(z), r(z), 0, . . . , 0].\nIf we now come back to the multivariate case of z = B(cid:62)x, then we have the following: z =\n[Sq(z1)q, Sq(z2)q, . . . , Sq(zK)q](cid:62), where zk implies the kth dimension of z. Then, we replace the\nboundary vector q with p = {p1, p2, . . . , pK} such that resulting vector approximates the optimal\nsparse solution of x obtained by (cid:96)1 regularized sparse coding as much as possible. This is written as\n(7)\nHel-Or and Shaked [9] have formulated the problem of learning each pk as regularized least squares\neither independently in a transform domain or jointly in a spatial domain. Unlike their setting,\nwe have signi\ufb01cantly large number of bases which makes joint optimization of all pk dif\ufb01cult.\nMoreover, since we are interested in approximating the sparse solutions which are in the transform\ndomain, we learn each pk independently. Given N local descriptors X = [x1, x2, . . . , xN ] \u2208 RD\u00d7N\nand their corresponding sparse solutions U = [u1, u2, . . . , uN ] = [y1, y2, . . . , yK](cid:62) \u2208 RK\u00d7N\nobtained with (cid:96)1 regularized sparse coding, we have an optimization problem given as\n\n\u02c6u = [Sq(z1)p1, Sq(z2)p2, . . . , Sq(zK)pK](cid:62).\n\n(cid:107)yk \u2212 Skpk(cid:107)2\n\n2 + \u03b1(cid:107)q \u2212 pk(cid:107)2\n2,\n\nmin\npk\n\nwhere Sk = Sq(zk). The regularization of the second term is essential to avoid singularity when\ncomputing the inverse and its consequence is that pk is encouraged to align itself to q when not many\ndata samples are available. This might have been a reasonable prior for image denoising [9], but not\ndesirable for the purpose of approximating sparse coing, as we would like to suppress most of the\ncoef\ufb01cients in u to zero. Figure 1 shows the distribution of one dimension of sparse coef\ufb01cients z\nobtained from a collection of SIFT descriptors and q does not look similar to the distribution. This\nmotivates us to look at the generalized lasso [21] as an alternative for obtaining a better \ufb01t of the\ndistribution of the coef\ufb01cients.\n\n3.2 Generalized Lasso\n\nIn the previous section, we have argued that regularized least squares stated in Eqn. (8) does not give\nthe desired result. Instead most intervals need to be set to zero. This naturally leads us to consider\n(cid:96)1 regularized sparse coding also known as the lasso which is formulated as:\n\n(8)\n\n(9)\n\n(cid:107)yk \u2212 Skpk(cid:107)2\n\n2 + \u03b1(cid:107)pk(cid:107)1.\n\nmin\npk\n\n4\n\n\u22120.500.51\u22120.200.20.40.60.811.2zu  DataRLSL1\u2212SCGLAS\u22120.500.51\u22120.200.20.40.60.811.2zu  DataRLSL1\u2212SCGLAS\u22120.500.51\u22120.200.20.40.60.811.2zu  DataRLSL1\u2212SCGLAS\fHowever, the drawback of this is that the learnt piece-wise linear function may become unstable in\ncases when training data is noisy or missing as illustrated in Figure 1 (b) and (c). It turns out (cid:96)1\ntrend \ufb01ltering [12], generally known as the generalized lasso [21], can overcome this problem. This\nis expressed as\n\n(cid:107)yk \u2212 Skpk(cid:107)2\n\n2 + \u03b1(cid:107)Dpk(cid:107)1,\n\nwhere D \u2208 R(Q\u22122)\u00d7Q is referred to as a penalty matrix and de\ufb01ned as\n\nmin\npk\n\n\uf8ee\uf8ef\uf8ef\uf8f0 \u22121\n\nD =\n\n2 \u22121\n\u22121\n\n2 \u22121\n...\n...\n\u22121\n\n...\n2 \u22121\n\n\uf8f9\uf8fa\uf8fa\uf8fb .\n\n(10)\n\n(11)\n\nTo solve the above optimization problem, we can turn it into the sparse coding problem [21]. Since\nD is not invertible, the key is to augment D with A \u2208 R2\u00d7Q to build a square matrix \u02dcD = [D; A] \u2208\nRQ\u00d7Q such that rank(\u02dcD) = Q and the rows of A are orthogonal to the rows of D. To satisfy such\nconstraints, A can for example be set to [1, 2, . . . , Q; 2, 3, . . . , Q + 1]. If we let \u03b8 = [\u03b81; \u03b82] = \u02dcDpk\n\u22121\nwhere \u03b81 = Dpk and \u03b82 = Apk, then Skpk = Sk \u02dcD\n\u03b8 = Sk1\u03b81 + Sk2\u03b82. After some substitutions,\nk2(yk \u2212Sk1\u03b81), given \u03b81 is solved already. Now,\nk2Sk2)\u22121S(cid:62)\nwe see that \u03b82 can be solved by: \u03b82 = (S(cid:62)\nto solve \u03b81, we have the following sparse coding problem:\n\n(cid:107)(I \u2212 P)yk \u2212 (I \u2212 P)Sk1\u03b81(cid:107)2\n\nmin\n\u03b81\nk2Sk2)\u22121S(cid:62)\nk2. Having computed both \u03b81 and \u03b82, we can recover the solution of pk\n\nwhere P = Sk2(S(cid:62)\nby \u02dcD\nGiven the learnt p, we can approximate sparse solution of x by Eqn. (7). However, explicitly com-\nputing Sq(z) and multiplying it by p is somewhat redundant. Thus, we can alternatively compute\neach component of \u02c6u as follows:\n\n\u03b8. Further details can be found in [21].\n\n2 + \u03b1(cid:107)\u03b81(cid:107)1,\n\n(12)\n\n\u22121\n\n\u02c6uk = (1 \u2212 r(zk)) \u00d7 pk(\u03c0(zk) \u2212 1) + r(zk) \u00d7 pk(\u03c0(zk)),\n\n(13)\nwhose time complexity becomes O(K). In Eqn. (13), since we are essentially using pk as a lookup\ntable, the complexity is independent from Q. This is followed by (cid:96)1 normalization on \u02c6u.\nWhile \u02c6u can readily be used for the spatial max pooling as stated in Eqn. (3), it does not yet capture\nany \u201cexplaining away\u201d effect where the corresponding coef\ufb01cients of correlated bases are mutually\ninhibited to remove redundancy. This is because each pk is estimated independently in the transform\ndomain [9]. In the next section, we propose an ef\ufb01cient post-re\ufb01nement technique to mutually inhibit\nbetween the bases.\n\n3.3 Capturing Dependency Between Bases\n\nTo handle the mutual inhibition between overcomplete bases, this section explains how to re\ufb01ne the\nsparse codes by solving regularized least squares on a signi\ufb01cantly small active basis set. Given a\nlocal descriptor x and its initial sparse code \u02c6u estimated with above method, we set the non-zero\ncomponents of the code to be active. By denoting a set of these active components as \u03c6, we have\n\u02c6u\u03c6 and B\u03c6 which are the subsets of the sparse code and dictionary bases respectively. The goal is\nto compute the re\ufb01ned code of \u02c6u\u03c6 denoted as \u02c6v\u03c6 such that B\u03c6v\u03c6 reconstructs xi as accurately as\npossible. We formulate this as regularised least squares given below:\n2 + \u03b2(cid:107)\u02c6v\u03c6 \u2212 \u02c6u\u03c6(cid:107)2\n2,\n\n(cid:107)x \u2212 B\u03c6\u02c6v\u03c6(cid:107)2\n\n(14)\n\nmin\n\u02c6v\u03c6\n\n\u03c6 B\u03c6 + \u03b2I)\u22121(B(cid:62)\n\nwhere \u03b2 is the weight parameter of the regularisation. This is convex and has the following analytical\nsolution: \u02c6v\u03c6 = (B(cid:62)\nThe intuition behind the above formulation is that the initial sparse code \u02c6u is considered as a good\nstarting point for re\ufb01nement to further reduce the reconstruction error by allowing redundant bases to\ncompete against each other. Empirically, the number of active components for each \u02c6u is substantially\nsmall compared to the whole basis set. Hence, a linear system to be solved becomes much smaller\n\n\u03c6 x + \u03b2\u02c6u\u03c6).\n\n5\n\n\fMethods\n15 Train\n30 Train\nTime (sec)\n\nMethods\n15 Train\n30 Train\nTime (sec)\n\nKM\n\n55.5\u00b11.2\n63.0\u00b11.2\n\n0.06\n\nKM\n\n60.1\u00b11.3\n63.0\u00b11.2\n\n0.05\n\nSIFT (128 Dim.) [15]\n\nLLC [22]\n62.7\u00b11.0\n69.6\u00b10.8\n\n0.25\n\nPSD [11]\n64.0\u00b11.2\n70.6\u00b10.9\n\n0.06\n\nSC [24]\n65.2\u00b11.2\n71.6\u00b10.7\n\n3.53\n\nLocal Self-Similarity (30 Dim.) [20]\nSC [24]\n64.8\u00b10.9\n72.5\u00b11.6\n\nLLC [22]\n62.4\u00b10.8\n69.7\u00b11.3\n\nPSD [11]\n59.7\u00b10.8\n67.2\u00b10.9\n\n0.24\n\n0.05\n\n1.97\n\nGLAS\n64.4\u00b11.2\n71.6\u00b11.0\n\n0.15\n\nGLAS+\n65.1\u00b11.1\n72.3\u00b10.7\n\n0.23\n\nGLAS\n62.3\u00b11.2\n69.8\u00b11.4\n\n0.13\n\nGLAS+\n63.8\u00b10.9\n71.0\u00b11.1\n\n0.18\n\nTable 1: Recognition accuracy on Caltech-101. The dictionary sizes for all methods are set to 1024.\nWe also report the time taken to process 1000 local descriptors for each method.\n\nwhich is computationally cheap. We also make sure that we do not deviate too much from the initial\nsolution by introducing the regularization on \u02c6v\u03c6. This re\ufb01nement procedure may be similar to LLC\n[22]. However, in our case, we do not preset the number of active bases and determine by non-zero\ncomponents of \u02c6u. More importantly, we base our \ufb01nal solution on \u02c6u and do not perform nearest\nneighbor search. With this re\ufb01nement procedure, the total time complexity becomes O(K + Z 2).\nWe refer GLAS with this post-re\ufb01nement procedure as GLAS+.\n\n4 Experimental Results\n\nThis section evaluates GLAS and GLAS+ on several challenging categorization datasets. To learn\nthe mapping function, we have used 50,000 local descriptors as data samples. The parameters Q,\n\u03b1 and \u03b2 are \ufb01xed to 10, 0.1 and 0.25 respectively for all experiments, unless otherwise stated. For\ncomparison, we have implemented methods discussed in Section 2. SC is our re-implementation\nof Yang et al. [24]. LLC is locality-constrained linear coding proposed by Wang et al. [22]. The\nnumber of nearest neighors to consider is set to 5. PSD is predictive sparse decomposition [11].\nShrinkage function is used as its parametric mapping function. We also include KM which builds\nits codebook with k-means clustering and adopts hard-assignment as its local descriptor coding.\nFor all methods, exactly the same local feature descriptors, spatial max pooling technique and linear\nSVM are used to only compare the difference between the local feature descriptor coding techniques.\nAs for the descriptors, SIFT [15] and Local Self-Similarity [20] are used. SIFT is a histogram of\ngradient directions computed over an image patch - capturing appearance information. We have\nsampled a 16\u00d716 patch at every 8 pixel step. In contrast, Local Self-Similarity computes correlation\nbetween a small image patch of interest and its surrounding region which captures the geometric\nlayout of a local region. Spatial max pooling with 1 \u00d7 1, 2 \u00d7 2 and 4 \u00d7 4 image partitions is used.\nThe implementation is all done in MATLAB for fair comparison.\n\n4.1 Caltech-101\n\nThe Caltech-101 dataset [5] consists of 9144 images which are divided into 101 object categories.\nThe images are scaled down to 300\u00d7 300 preserving their aspect ratios. We train with 15/30 images\nper class and test with 15 images per class. The dictionary size of each method is set to 1024 for\nboth SIFT and Local Self-Similarity.\nThe results are averaged over eight random training and testing splits and are reported in Table\n1. For SIFT, GLAS+ is consistently better than GLAS demonstrating the effectiveness of mutual\ninhibition by the post-re\ufb01nement procedure. Both GLAS and GLAS+ performs better than other\nfast algorithms that produces sparse codes. In addition GLAS and GLAS+ performs competitively\nagainst SC. In fact, GLAS+ is slightly better when 30 training images per class are used. While\nsparse codes for both GLAS and GLAS+ are learned from the solutions of SC, the approximated\ncodes are not exactly the same as the ones of SC. Moreover, SC sometimes produces unstable codes\ndue to the non-smooth convex property of (cid:96)1 norm as previously observed in [6]. In contrast, GLAS+\n\n6\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 2: (a) Q, the number of bins to quantize the interval of each sparse code component. (b) \u03b1,\nthe parameter that controls the weight of the norm used for the generalized lasso. (c) When some\ndata samples are missing GLAS is more robust than regularized least squares given in Eqn. (8).\n\napproximates its sparse codes with a relatively smooth piece-wise linear mapping function learned\nwith the generalized lasso (note that the (cid:96)1 norm penalizes on changes in the shape of the function)\nand performs smooth post-re\ufb01nement. We suspect these differences may be contributing to the\nslightly better results of GLAS+ on this dataset.\nAlthough PSD performs quite close to GLAS for SIFT, this is not the case for Local Self-Similarity.\nGLAS outperforms PSD probably due to the distribution of sparse codes is not captured well by a\nsimple shrinkage function. Therefore, GLAS might be more effective for a wide range of distri-\nbutions. This is useful for recognition using multiple feature types where speed is critical. GLAS\nperforms worse than SC, but GLAS+ closes the gap between GLAS and SC. We suspect that due to\nLocal Self-Similarity (30 dim.) being relatively low-dimensional than SIFT (128 dim.), the mutual\ninhibition becomes more important. This might also explain why LLC has performed reasonably\nwell for this descriptor.\nTable 1 also reports computational time taken to process 1000 local descriptors for each method.\nGLAS and GLAS+ are slower than KM and PSD, but are slightly faster than LLC and signi\ufb01cantly\nfaster than SC. This demonstrates the practical importance of our approach where competitive recog-\nnition results are achieved with fast computation.\nDifferent values for Q, \u03b1 and \u03b2 are evaluated one parameter at a time. Figure 2 (a) shows the results\nof different Q. The results are very stable after 10 bins. As sparse codes are computed by Eqn. (13),\nthe time complexity is not affected by what Q is chosen. Figure 2 (b) shows the results for different\n\u03b1 which look very stable. We also observe similar stability for \u03b2.\nWe also validate if the generalized lasso given in Eqn. (10) is more robust than the regularized least\nsquares solution given in Eqn. (8) when some data samples are missing. When learning each qk,\nwe arti\ufb01cially remove data samples from an interval centered around a randomly sampled point,\nas also illustrated in Figure 1 (c). We evaluate with different numbers of data samples removed in\nterms of percentages of the whole data sample set. The results are shown in Figure 2 (c) where the\nperformance of RLS signi\ufb01cantly drops as the number of missing data is increased. However, both\nGLAS and GLAS+ are not affected that much.\n\n4.2 Caltech-256\n\nCaltech-256 [8] contains 30,607 images and 256 object categories in total. Like Caltech-101, we\nscale the images down to 300\u00d7300 preserving their aspect ratios. The results are averaged over\neight random training and testing splits and are reported in Table 2. We use 25 testing images per\nclass. This time, for SIFT, GLAS performs slightly worse than SC, but GLAS+ outperforms SC\nprobably due to the same argument given in the previous experiments on Caltech-101. For Local\nSelf-Similarity, results similar to Caltech-101 are obtained. The performance of PSD is close to KM\nand is outperformed by GLAS, suggesting the inadequate \ufb01tting of sparse codes. LLC performs\nslightly better than GLAS, but could not perform better than GLAS+. While SC performed the best,\nthe performance of GLAS+ is quite close to SC. We also plot a graph of the computational time\ntaken for each method with its achieved accuracy on SIFT and Local Self-Similarity in Figure 3 (a)\nand (b) respectively.\n\n7\n\n05101520256668707274QAverage Recognition  SCGLASGLAS+00.516668707274AlphaAverage Recognition  SCGLASGLAS+0%10%20%30%40%62646668707274% of Missing DataAverage Recognition  SCRLSGLASGLAS+\fSIFT (128 Dim.) [15]\n\nMethods\n15 Train\n30 Train\n\nKM\n\n22.7\u00b10.4\n27.4\u00b10.5\n\nLLC [22]\n28.1\u00b10.5\n34.0\u00b10.6\n\nPSD [11]\n30.4\u00b10.6\n36.3\u00b10.5\n\nSC [24]\n30.7\u00b10.4\n36.8\u00b10.4\n\nGLAS\n30.4\u00b10.4\n36.1\u00b10.4\n\nGLAS+\n32.1\u00b10.4\n38.2\u00b10.4\n\nMethods\n15 Train\n30 Train\n\nKM\n\n23.7\u00b10.4\n28.5\u00b10.4\n\nLocal Self-Similarity (30 Dim.) [20]\nSC [24]\nLLC [22]\n26.3\u00b10.5\n28.7\u00b10.5\n31.9\u00b10.5\n34.7\u00b10.4\n\nPSD [11]\n24.3\u00b10.6\n29.3\u00b10.5\n\nGLAS\n26.0\u00b10.5\n31.2\u00b10.5\n\nGLAS+\n27.6\u00b10.6\n33.3\u00b10.5\n\nTable 2: Recognition accuracy on Caltech-256. The dictionary sizes are all set to 2048 for SIFT and\n1024 for Local Self-Similarity.\n\n(a)\n\n(b)\n\n(c)\n\nFigure 3: Plotting computational time vs. average recognition. (a) and (b) are SIFT and Local-Self\nSimilarity respectively evaluated on Caltech-256 with 30 training images. The dictionary size is set\nto 2048. (c) is SIFT evaluated on 15 Scenes. The dictionary size is set to 1024.\n\n4.3\n\n15 Scenes\n\nThe 15 Scenes [13] dataset contains 4485 images divided into 15 scene classes ranging from indoor\nscenes to outdoor scenes. 100 training images per class are used for training and the rest for testing.\nWe used SIFT to learn 1024 dictionary bases for each method. The results are plotted with com-\nputational time taken in Figure 3 (c). The result of GLAS+ (80.6%) are very similar to that of SC\n(80.7%), yet the former is signi\ufb01cantly faster. In summary, we show that our approach works well\non three different challenging datasets.\n\n5 Conclusion\n\nThis paper has presented an approximation of (cid:96)1 sparse coding based on the generalized lasso called\nGLAS. This is further extended with the post-re\ufb01nement procedure to handle mutual inhibition\nbetween bases which are essential in an overcomplete setting. The experiments have shown compet-\nitive performance of GLAS against SC and achieved signi\ufb01cant computational speed up. We have\nalso demonstrated that the effectiveness of GLAS on two local descriptor types, namely SIFT and\nLocal Self-Similarity where LLC and PSD only perform well on one type. GLAS is not restricted\nto only approximate (cid:96)1 sparse coding, but should be applicable to other variations of sparse coding\nin general. For example, it may be interesting to try GLAS on Laplacian sparse coding [6] that\nachieves smoother sparse codes than (cid:96)1 sparse coding.\n\nAcknowledgment\n\nNICTA is funded by the Australian Government as represented by the Department of Broadband, Communi-\ncations and the Digital Economy and the Australian Research Council through the ICT Centre of Excellence\nprogram.\n\n8\n\n024625303540Computational TimeAverage Recognition  KMLLCPSDSCGLASGLAS+00.511.522830323436Computational TimeAverage Recognition  KMLLCPSDSCGLASGLAS+01234767778798081Computational TimeAverage Recognition  KMLLCPSDSCGLASGLAS+\fReferences\n[1] A. Adler, Y. Hel-Or, and M. Elad. A Shrinkage Learning Approach for Single Image Super-\n\nResolution with Overcomplete Representations. In ECCV, 2010.\n\n[2] D.L. Donoho. For Most Large Underdetermined Systems of Linear Equations the Minimal L1-\nnorm Solution is also the Sparse Solution. Communications on Pure and Applied Mathematics,\n2006.\n\n[3] D.L. Donoho and M. Elad. Optimally sparse representation in general (nonorthogonal) dictio-\n\nnaries via L1 minimization. PNAS, 100(5):2197\u20132202, 2003.\n\n[4] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least Angle Regression. Annals of\n\nStatistics, 2004.\n\n[5] L. Fei-Fei, R. Fergus, and P. Perona. Learning Generative Visual Models from Few Training\nExamples: An Incremental Bayesian Approach Tested on 101 Object Categories. In CVPR\nWorkshop, 2004.\n\n[6] S. Gao, W. Tsang, L. Chia, and P. Zhao. Local Features Are Not Lonely - Laplacian Sparse\n\nCoding for Image Classi\ufb01cation. In CVPR, 2010.\n\n[7] K. Gregor and Y. LeCun. Learning fast approximations of sparse coding. In ICML, 2010.\n[8] G. Grif\ufb01n, A. Holub, and P. Perona. Caltech-256 Object Category Dataset. Technical Report,\n\nCalifornia Institute of Technology, 2007.\n\n[9] Y. Hel-Or and D. Shaked. A Discriminative Approach for Wavelet Denoising. TIP, 2008.\n[10] K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun. What is the Best Multi-Stage Archi-\n\ntecture for Object Recognition. In ICCV, 2009.\n\n[11] K Kavukcuoglu, M Ranzato, and Y Lecun. Fast inference in sparse coding algorithms with\napplications to object recognition. Technical rRport CBLL-TR-2008-12-01, Computational\nand Biological Learning Lab, Courant Institute, NYU, 2008.\n\n[12] S.-J. Kim, K. Koh, S. Boyd, and D. Gorinevsky. L1 trend \ufb01ltering. SIAM Review, 2009.\n[13] S. Lazebnik, C. Schmid, and J. Ponce. Beyond Bags of Features: Spatial Pyramid Matching\n\nfor Recognizing Natural Scene Categories. In CVPR, 2006.\n\n[14] H. Lee, A. Battle, R. Raina, and A.Y. Ng. Ef\ufb01cient sparse coding algorithms. In NIPS, 2006.\n[15] D.G. Lowe. Distinctive Image Features from Scale-Invariant Keypoints. IJCV, 2004.\n[16] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman. Supervised Dictionary Learning. In\n\nNIPS, 2008.\n\n[17] J. Mairal, M. Leordeanu, F. Bach, M. Hebert, and J. Ponce. Discriminative Sparse Image\n\nModels for Class-Speci\ufb01c Edge Detection and Image Interpretation. In ECCV, 2008.\n\n[18] B.A. Olshausen and D.J. Field. Sparse coding with an overcomplete basis set: A strategy\n\nemployed by V1? Vision Research, 37, 1997.\n\n[19] M. Ranzato, F.J. Huang, Y. Boureau, and Y. LeCun. Unsupervised Learning of Invariant Fea-\n\nture Hierarchies with Applications to Object Recognition. In CVPR, 2007.\n\n[20] E. Shechtman and M. Irani. Matching Local Self-Similarities across Image and Videos. In\n\nCVPR, 2007.\n\n[21] R. Tibshirani and J. Taylor. The Solution Path of the Generalized Lasso. The Annals of Statis-\n\ntics, 2010.\n\n[22] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong. Locality-constrained Linear Coding\n\nfor Image Classi\ufb01cation. In CVPR, 2010.\n\n[23] J. Yang, J. Wright, T. Huang, and Y. Ma. Image Super-Resolution via Sparse Representation.\n\nTIP, 2010.\n\n[24] J. Yang, K. Yu, Y. Gong, and T.S. Huang. Linear spatial pyramid matching using sparse coding\n\nfor image classi\ufb01cation. In CVPR, 2009.\n\n[25] J. Yang, K. Yu, and T. Huang. Supervised Translation-Invariant Sparse Coding.\n\n2010.\n\nIn CVPR,\n\n[26] K. Yu, T. Zhang, and Y. Gong. Nonlinear Learning using Local Coordinate Coding. In NIPS,\n\n2009.\n\n9\n\n\f", "award": [], "sourceid": 141, "authors": [{"given_name": "Nobuyuki", "family_name": "Morioka", "institution": null}, {"given_name": "Shin'ichi", "family_name": "Satoh", "institution": null}]}