{"title": "Dimensionality Reduction for Data in Multiple Feature Representations", "book": "Advances in Neural Information Processing Systems", "page_first": 961, "page_last": 968, "abstract": "In solving complex visual learning tasks, adopting multiple descriptors to more precisely characterize the data has been a feasible way for improving performance. These representations are typically high dimensional and assume diverse forms. Thus finding a way to transform them into a unified space of lower dimension generally facilitates the underlying tasks, such as object recognition or clustering. We describe an approach that incorporates multiple kernel learning with dimensionality reduction (MKL-DR). While the proposed framework is flexible in simultaneously tackling data in various feature representations, the formulation itself is general in that it is established upon graph embedding. It follows that any dimensionality reduction techniques explainable by graph embedding can be generalized by our method to consider data in multiple feature representations.", "full_text": "Dimensionality Reduction for Data in Multiple\n\nFeature Representations\n\nYen-Yu Lin1,2\n\nTyng-Luh Liu1\n\nChiou-Shann Fuh2\n\n1Institute of Information Science, Academia Sinica, Taipei, Taiwan\n\n{yylin, liutyng}@iis.sinica.edu.tw\n\n2Department of CSIE, National Taiwan University, Taipei, Taiwan\n\nfuh@csie.ntu.edu.tw\n\nAbstract\n\nIn solving complex visual learning tasks, adopting multiple descriptors to more\nprecisely characterize the data has been a feasible way for improving performance.\nThese representations are typically high dimensional and assume diverse forms.\nThus \ufb01nding a way to transform them into a uni\ufb01ed space of lower dimension\ngenerally facilitates the underlying tasks, such as object recognition or cluster-\ning. We describe an approach that incorporates multiple kernel learning with\ndimensionality reduction (MKL-DR). While the proposed framework is \ufb02exible\nin simultaneously tackling data in various feature representations, the formulation\nitself is general in that it is established upon graph embedding. It follows that\nany dimensionality reduction techniques explainable by graph embedding can be\ngeneralized by our method to consider data in multiple feature representations.\n\n1 Introduction\n\nThe fact that most visual learning problems deal with high dimensional data has made dimension-\nality reduction an inherent part of the current research. Besides having the potential for a more\nef\ufb01cient approach, working with a new space of lower dimension often can gain the advantage of\nbetter analyzing the intrinsic structures in the data for various applications, e.g., [3, 7]. However,\ndespite the great applicability, the existing dimensionality reduction methods suffer from two main\nrestrictions. First, many of them, especially the linear ones, require data to be represented in the\nform of feature vectors. The limitation may eventually reduce the effectiveness of the overall al-\ngorithms when the data of interest could be more precisely characterized in other forms, such as\nbag-of-features [1, 11] or high order tensors [19]. Second, there seems to be lacking a systematic\nway of integrating multiple image features for dimensionality reduction. When addressing applica-\ntions that no single descriptor can appropriately depict the whole dataset, this shortcoming becomes\neven more evident. Alas, it is usually the case for addressing complex visual learning tasks [4].\n\nAiming to relax the two above-mentioned restrictions, we introduce an approach called MKL-DR\nthat incorporates multiple kernel learning (MKL) into the training process of dimensionality reduc-\ntion (DR) algorithms. Our approach is inspired by the work of Kim et al. [8], in which learning an\noptimal kernel over a given convex set of kernels is coupled with kernel Fisher discriminant anal-\nysis (KFDA), but their method only considers binary-class data. Without the restriction, MKL-DR\nmanifests its \ufb02exibility in two aspects. First, it works with multiple base kernels, each of which\nis created based on a speci\ufb01c kind of visual feature, and combines these features in the domain of\nkernel matrices. Second, the formulation is illustrated with the framework of graph embedding [19],\nwhich presents a uni\ufb01ed view for a large family of DR methods. Therefore the proposed MKL-DR\nis ready to generalize any DR methods if they are expressible by graph embedding. Note that these\nDR methods include supervised, semisupervised and unsupervised ones.\n\n\f2 Related work\n\nThis section describes some of the key concepts used in the establishment of the proposed approach,\nincluding graph embedding and multiple kernel learning.\n\n2.1 Graph embedding\n\nMany dimensionality reduction methods focus on modeling the pairwise relationships among data,\nand utilize graph-based structures. In particular, the framework of graph embedding [19] provides\na uni\ufb01ed formulation for a set of DR algorithms. Let \u2126 = {xi \u2208 Rd}N\ni=1 be the dataset. A DR\nscheme accounted for by graph embedding involves a complete graph G whose vertices are over\n\u2126. An af\ufb01nity matrix W = [wij ] \u2208 RN \u00d7N is used to record the edge weights that characterize the\nsimilarity relationships between training sample pairs. Then the optimal linear embedding v\u2217 \u2208 Rd\ncan be obtained by solving\n\nv\u2217 =\n\narg min\n\nv\n\n\u22a4XDX \u22a4\n\nv=1, or\n\nv\u22a4XLX \u22a4v,\n\n(1)\n\nv\n\n\u22a4XL\u2032X \u22a4\n\nv=1\n\nwhere X = [x1 x2 \u00b7 \u00b7 \u00b7 xN ] is the data matrix, and L = diag(W \u00b7 1) \u2212 W is the graph Laplacian\nof G. Depending on the property of a problem, one of the two constraints in (1) will be used in the\noptimization. If the \ufb01rst constraint is chosen, a diagonal matrix D = [dij ] \u2208 RN \u00d7N is included\nfor scale normalization. Otherwise another complete graph G\u2032 over \u2126 is required for the second\nconstraint, where L\u2032 and W \u2032 = [w\u2032\nij ] \u2208 RN \u00d7N are respectively the graph Laplacian and af\ufb01nity\nmatrix of G\u2032. The meaning of (1) can be better understood with the following equivalent problem:\n\nv\n\nmin\n\nPN\ni,j=1 ||v\u22a4xi \u2212 v\u22a4xj||2wij\nsubject to PN\ni=1 ||v\u22a4xi||2dii = 1, or\nPN\ni,j=1 ||v\u22a4xi \u2212 v\u22a4xj||2w\u2032\n\nij = 1.\n\nThe constrained optimization problem (2) implies that pairwise distances or distances to the origin\nof projected data (in the form of v\u22a4x) are modeled by one or two graphs in the framework. By\nspecifying W and D (or W and W \u2032), Yan et al. [19] show that a set of dimensionality reduction\nmethods, such as PCA, LPP [7], LDA, and MFA [19] can be expressed by (1).\n\n2.2 Multiple kernel learning\n\nMKL refers to the process of learning a kernel machine with multiple kernel functions or kernel\nmatrices. Recent research efforts on MKL, e.g., [9, 14, 16] have shown that learning SVMs with\nmultiple kernels not only increases the accuracy but also enhances the interpretability of the resulting\nclassi\ufb01er. Our MKL formulation is to \ufb01nd an optimal way to linearly combine the given kernels.\nm=1). An\nSuppose we have a set of base kernel functions {km}M\nensemble kernel function k (or an ensemble kernel matrix K) is then de\ufb01ned by\n\nm=1 (or base kernel matrices {Km}M\n\n(2)\n\n(3)\n\n(4)\n\n(5)\n\n(6)\n\nk(xi, xj) = PM\nK = PM\n\nm=1 \u03b2mkm(xi, xj),\nm=1 \u03b2mKm,\n\n\u03b2m \u2265 0 .\n\n\u03b2m \u2265 0 ,\n\nConsequently, the learned model from binary-class data {(xi, yi \u2208 \u00b11)} will be of the form:\n\nf (x) = PN\n\ni=1 \u03b1iyiPM\n(7)\nOptimizing both the coef\ufb01cients {\u03b1i}N\nm=1 is one particular form of the MKL prob-\nlems. Our approach leverages such an MKL optimization to yield more \ufb02exible dimensionality\nreduction schemes for data in different feature representations.\n\ni=1 \u03b1iyik(xi, x) + b = PN\ni=1 and {\u03b2m}M\n\nm=1 \u03b2mkm(xi, x) + b.\n\n3 The MKL-DR framework\n\nTo establish the proposed method, we \ufb01rst discuss the construction of a set of base kernels from mul-\ntiple features, and then explain how to integrate these kernels for dimensionality reduction. Finally,\nwe design an optimization procedure to learn the projection for dimensionality reduction.\n\n\f3.1 Kernel as a uni\ufb01ed feature representation\n\ni=1, xi = {xi,m \u2208 Xm}M\n\nConsider a dataset \u2126 of N samples, and M kinds of descriptors to characterize each sample. Let\nm=1, and dm : Xm \u00d7 Xm \u2192 0 \u222a R+ be the distance function for\n\u2126 = {xi}N\ndata representation under the mth descriptor. The domains resulting from distinct descriptors, e.g.\nfeature vectors, histograms, or bags of features, are in general different. To eliminate these varieties\nin representation, we represent data under each descriptor as a kernel matrix. There are several ways\nto accomplish this goal, such as using RBF kernel for data in the form of vector, or pyramid match\nkernel [6] for data in the form of bag-of-features. We may also convert pairwise distances between\ndata samples to a kernel matrix [18, 20]. By coupling each representation and its corresponding\ndistance function, we obtain a set of M dissimilarity-based kernel matrices {Km}M\n\nm=1 with\n\nKm(i, j) = km(xi, xj) = exp(cid:8)(cid:0)\u2212d2\n\n(8)\nwhere \u03c3m is a positive constant. As several well-designed descriptors and their associated distance\nfunctions have been introduced over the years, the use of dissimilarity-based kernel is convenient in\nsolving visual learning tasks. Nonetheless, care must be taken in that the resulting Km is not guar-\nanteed to be positive semide\ufb01nite. Zhang et al. [20] have suggested a solution to resolve this issue.\nIt follows from (5) and (6) that determining a set of optimal ensemble coef\ufb01cients {\u03b21, \u03b22, . . . , \u03b2M }\ncan be interpreted as \ufb01nding appropriate weights for best fusing the M feature representations.\n\nm(xi,m, xj,m)/\u03c32\n\nm(cid:1)(cid:9)\n\n3.2 The MKL-DR algorithm\n\nInstead of designing a speci\ufb01c dimensionality reduction algorithm, we choose to describe MKL-DR\nupon graph embedding. This way we can derive a general framework: If a dimensionality reduction\nscheme is explained by graph embedding, then it will also be extendible by MKL-DR to handle\ndata in multiple feature representations. In graph embedding (2), there are two possible types of\nconstraints. For the ease of presentation, we discuss how to develop MKL-DR subject to constraint\n(4). However, the derivation can be analogously applied when using constraint (3).\n\nIt has been shown that a set of linear dimensionality reduction methods can be kernelized to nonlinear\nones via kernel trick. The procedure of kernelization in MKL-DR is mostly accomplished in a\nsimilar way, but with the key difference in using multiple kernels {Km}M\nm=1. Suppose the ensemble\nkernel K in MKL-DR is generated by linearly combining the base kernels {Km}M\nm=1 as in (6).\nLet \u03c6 : X \u2192 F denote the feature mapping induced by K. Through \u03c6, the training data can be\nimplicitly mapped to a high dimensional Hilbert space, i.e.,\n\n(9)\nBy assuming the optimal projection v lies in the span of training data in the feature space, we have\n\nxi 7\u2192 \u03c6(xi), for i = 1, 2, ..., N .\n\n(10)\nTo show that the underlying algorithm can be reformulated in the form of inner product and accom-\nplished in the new feature space F, we observe that plugging into (2) each mapped sample \u03c6(xi)\nand projection v would appear exclusively in the form of vT \u03c6(xi). Hence, it suf\ufb01ces to show that\nin MKL-DR, vT \u03c6(xi) can be evaluated via the kernel trick:\n\nn=1 \u03b1n\u03c6(xn).\n\nv = PN\n\n\u03b1 = \uf8ee\n\uf8ef\uf8f0\n\n\u03b11\n...\n\u03b1N\n\n\uf8f9\n\uf8fa\uf8fb\n\nvT \u03c6(xi) = PN\nn=1PM\n\u2208 RN , \u03b2 = \uf8ee\n\uf8f9\n\u03b21\n...\n\uf8ef\uf8f0\n\uf8fa\uf8fb\n\u03b2M\n\nm=1 \u03b1n\u03b2mkm(xn, xi) = \u03b1T K(i)\u03b2 where\n\u2208 RM , K(i) = \uf8ee\n\u00b7 \u00b7 \u00b7 KM (1, i)\n...\n\uf8ef\uf8f0\n\u00b7 \u00b7 \u00b7 KM (N, i)\n\nK1(N, i)\n\nK1(1, i)\n\n...\n\n...\n\n(11)\n\n\u2208 RN \u00d7M .\n\n\uf8f9\n\uf8fa\uf8fb\n\nWith (2) and (11), we de\ufb01ne the constrained optimization problem for 1-D MKL-DR as follows:\n\nmin\n\u03b1,\u03b2\n\nPN\ni,j=1 ||\u03b1T K(i)\u03b2 \u2212 \u03b1T K(j)\u03b2||2wij\nsubject to PN\ni,j=1 ||\u03b1T K(i)\u03b2 \u2212 \u03b1T K(j)\u03b2||2w\u2032\n\n(13)\n(14)\nThe additional constraints in (14) are included to ensure the the resulting kernel K in MKL-DR is a\nnon-negative combination of base kernels. We leave the details of how to solve (12) until the next\nsection, where using MKL-DR for \ufb01nding a multi-dimensional projection V is considered.\n\n\u03b2m \u2265 0, m = 1, 2, ..., M .\n\nij = 1,\n\n(12)\n\n\fxi,1\n\n\u03c61\n\nX1\n\n\u03c6m\n\n\u03c61(xi)\n\nF1\n\n\u03b21\n\n\u03b2m\n\nxi,M\n\n\u03c6M\n\n\u03c6M (xi)\n\n\u03b2M\n\nXM\n\nFM\n\n\u03c6(xi)\n\nF\n\nV\n\nV T \u03c6(xi)\n= AT K(i)\u03b2\n\nRP\n\nFigure 1: Four kinds of spaces in MKL-DR: the input space of each feature representation, the\nRKHS induced by each base kernel, the RKHS by the ensemble kernel, and the projected space.\n\n3.3 Optimization\n\nObserve from (11) that the one-dimensional projection v of MKL-DR is speci\ufb01ed by a sample coef-\n\ufb01cient vector \u03b1 and a kernel weight vector \u03b2. The two vectors respectively account for the relative\nimportance among the samples and the base kernels. To generalize the formulation to uncover a\nmulti-dimensional projection, we consider a set of P sample coef\ufb01cient vectors, denoted by\n\n(15)\nWith A and \u03b2, each 1-D projection vi is determined by a speci\ufb01c sample coef\ufb01cient vector \u03b1i and\nthe (shared) kernel weight vector \u03b2. The resulting projection V = [v1 v2 \u00b7 \u00b7 \u00b7 vP ] will map samples\nto a P -dimensional space. Analogous to the 1-D case, a projected sample xi can be written as\n\nA = [\u03b11 \u03b12 \u00b7 \u00b7 \u00b7 \u03b1P ].\n\n(16)\nThe optimization problem (12) can now be extended to accommodate multi-dimensional projection:\n(17)\n\nV \u22a4\u03c6(xi) = A\u22a4K(i)\u03b2 \u2208 RP .\n\nmin\nA,\u03b2\n\nPN\ni,j=1 ||A\u22a4K(i)\u03b2 \u2212 A\u22a4K(j)\u03b2||2wij\nsubject to PN\ni,j=1 ||A\u22a4K(i)\u03b2 \u2212 A\u22a4K(j)\u03b2||2w\u2032\n\n\u03b2m \u2265 0, m = 1, 2, ..., M .\n\nij = 1,\n\nIn Figure 1, we give an illustration of the four kinds of spaces related to MKL-DR, including the\ninput space of each feature representation, the RKHS induced by each base kernel and the ensemble\nkernel, and the projected Euclidean space.\n\nSince direct optimization to (17) is dif\ufb01cult, we instead adopt an iterative, two-step strategy to\nalternately optimize A and \u03b2. At each iteration, one of A and \u03b2 is optimized while the other is\n\ufb01xed, and then the roles of A and \u03b2 are switched. Iterations are repeated until convergence or a\nmaximum number of iterations is reached.\n\nOn optimizing A: By \ufb01xing \u03b2, the optimization problem (17) is reduced to\n\nmin\n\nA\n\ntrace(A\u22a4S\u03b2\n\nW A)\n\nsubject to trace(A\u22a4S\u03b2\n\nW \u2032 A) = 1\n\nwhere\n\nS\u03b2\nS\u03b2\n\nW = PN\nW \u2032 = PN\n\ni,j=1 wij (K(i) \u2212 K(j))\u03b2\u03b2\u22a4(K(i) \u2212 K(j))\u22a4,\nij (K(i) \u2212 K(j))\u03b2\u03b2\u22a4(K(i) \u2212 K(j))\u22a4.\ni,j=1 w\u2032\n\n(18)\n\n(19)\n\n(20)\n\nThe problem (18) is a trace ratio problem, i.e., minA trace(A\u22a4S\u03b2\nW \u2032 A). A closed-\nform solution can be obtained by transforming (18) into the corresponding ratio trace prob-\nlem, i.e., minA trace[(A\u22a4S\u03b2\nW A)]. Consequently, the columns of the optimal A\u2217 =\n[\u03b11 \u03b12 \u00b7 \u00b7 \u00b7 \u03b1P ] are the eigenvectors corresponding to the \ufb01rst P smallest eigenvalues in\n\nW A)/trace(A\u22a4S\u03b2\n\nW \u2032 A)\u22121(A\u22a4S\u03b2\n\nS\u03b2\nW \u03b1 = \u03bbS\u03b2\n\nW \u2032 \u03b1.\n\n(21)\n\n\fAlgorithm 1: MKL-DR\nInput\n\n: A DR method speci\ufb01ed by two af\ufb01nity matrices W and W \u2032 (cf. (2));\nVarious visual features expressed by base kernels {Km}M\nm=1 (cf. (8));\n\nOutput: Sample coef\ufb01cient vectors A = [\u03b11 \u03b12 \u00b7 \u00b7 \u00b7 \u03b1P ]; Kernel weight vector \u03b2;\nMake an initial guess for A or \u03b2;\nfor t \u2190 1, 2, . . . , T do\n\nW in (19) and S\u03b2\n\n1. Compute S\u03b2\n2. A is optimized by solving the generalized eigenvalue problem (21);\n3. Compute SA\n4. \u03b2 is optimized by solving optimization problem (25) via semide\ufb01nite programming;\n\nW \u2032 in (20);\n\nW in (23) and SA\n\nW \u2032 in (24);\n\nreturn A and \u03b2;\n\nOn optimizing \u03b2: By \ufb01xing A, the optimization problem (17) becomes\n\nmin\n\n\u03b2\n\n\u03b2\u22a4SA\n\nW \u03b2\n\nsubject to \u03b2\u22a4SA\n\nW \u2032 \u03b2 = 1 and \u03b2 \u2265 0\n\nwhere\n\n(22)\n\n(23)\n\nSA\n\nW = PN\nW \u2032 = PN\n\ni,j=1 wij (K(i) \u2212 K(j))\u22a4AA\u22a4(K(i) \u2212 K(j)),\nij (K(i) \u2212 K(j))\u22a4AA\u22a4(K(i) \u2212 K(j)).\ni,j=1 w\u2032\n\nSA\n\n(24)\nThe additional constraints \u03b2 \u2265 0 cause that the optimization to (22) can no longer be formulated as\na generalized eigenvalue problem. Indeed it now becomes a nonconvex quadratically constrained\nquadratic programming (QCQP) problem, and is known to be very dif\ufb01cult to solve. We instead\nconsider solving its convex relaxation by adding an auxiliary variable B of size M \u00d7 M :\n\nmin\n\u03b2,B\nsubject to\n\ntrace(SA\n\nW B)\n\nW \u2032 B) = 1,\n\ntrace(SA\neT\nm\u03b2 \u2265 0, m = 1, 2, ..., M,\n\n(cid:20) 1 \u03b2T\n\u03b2 B (cid:21) (cid:23) 0,\n\n(25)\n\n(26)\n(27)\n\n(28)\n\nwhere em in (27) is a column vector whose elements are 0 except that its mth element is 1, and the\nconstraint in (28) means that the square matrix is positive semide\ufb01nite. The optimization problem\n(25) is an SDP relaxation of the nonconvex QCQP problem (22), and can be ef\ufb01ciently solved\nby semide\ufb01nite programming (SDP). One can verify the equivalence between the two optimization\nproblems (22) and (25) by replacing the constraint (28) with B = \u03b2\u03b2T . In view of that the constraint\nB = \u03b2\u03b2T is nonconvex, it is relaxed to B (cid:23) \u03b2\u03b2T . Applying the Schur complement lemma,\nB (cid:23) \u03b2\u03b2T can be equivalently expressed by the constraint in (28). (Refer to [17] for further details.)\nNote that the numbers of constraints and variables in (25) are respectively linear and quadratic to\nM , the number of the adopted descriptors. In practice the value of M is often small. (M = 7 in\nour experiments.) Thus like most of the other DR methods, the computational bottleneck of our\napproach is still in solving the generalized eigenvalue problems.\n\nListed in Algorithm 1, the procedure of MKL-DR requires an initial guess to either A or \u03b2 in the\nalternating optimization. We have tried two possibilities: 1) \u03b2 is initialized by setting all of its\nelements as 1 to equally weight each base kernel; 2) A is initialized by assuming AA\u22a4 = I. In\nour empirical testing, the second initialization strategy gives more stable performances, and is thus\nadopted in the experiments. Pertaining to the convergence of the optimization procedure, since\nSDP relaxation has been used, the values of objective function are not guaranteed to monotonically\ndecrease throughout the iterations. Still, the optimization procedures rapidly converge after only a\nfew iterations in all our experiments.\n\nNovel sample embedding. Given a testing sample z, it is projected to the learned space of lower\ndimension by\n\nz 7\u2192 AT K(z)\u03b2, where K(z) \u2208 RN \u00d7M and K(z)(n, m) = km(xn, z).\n\n(29)\n\n\f4 Experimental results\n\nTo evaluate the effectiveness of MKL-DR, we test the technique with the supervised visual learn-\ning task of object category recognition. In the application, two (base) DR methods and a set of\ndescriptors are properly chosen to serve as the input to MKL-DR.\n\n4.1 Dataset\n\nThe Caltech-101 image dataset [4] consists of 101 object categories and one additional class of\nbackground images. The total number of categories is 102, and each category contains roughly 40\nto 800 images. Although each target object often appears in the central region of an image, the large\nclass number and substantial intraclass variations still make the dataset very challenging. Still, the\ndataset provides a good test bed to demonstrate the advantage of using multiple image descriptors\nfor complex recognition tasks. Since the images in the dataset are not of the same size, we resize\nthem to around 60,000 pixels, without changing their aspect ratio.\n\nTo implement MKL-DR for recognition, we need to select some proper graph-based DR method to\nbe generalized and a set of image descriptors, and then derive (in our case) a pair of af\ufb01nity matrices\nand a set of base kernels. The details are described as follows.\n\n4.2 Image descriptors\n\nFor the Caltech-101 dataset, we consider seven kinds of image descriptors that result in the seven\nbase kernels (denoted below in bold and in abbreviation):\nGB-1/GB-2: From a given image, we randomly sample 300 edge pixels, and apply geometric blur\ndescriptor [1] to them. With these image features, we adopt the distance function, as is suggested in\nequation (2) of the work by Zhang et al. [20], to obtain the two dissimilarity-based kernels, each of\nwhich is constructed with a speci\ufb01c descriptor radius.\nSIFT-Dist: The base kernel is analogously constructed as in GB-2, except now the SIFT descriptor\n[11] is used to extract features.\nSIFT-Grid: We apply SIFT with three different scales to an evenly sampled grid of each image,\nand use k-means clustering to generate visual words from the resulting local features of all images.\nEach image can then be represented by a histogram over the visual words. The \u03c72 distance is used\nto derive this base kernel via (8).\nC2-SWP/C2-ML: Biologically inspired features are also considered here. Speci\ufb01cally, both the C2\nfeatures derived by Serre et al. [15] and by Mutch and Lowe [13] have been chosen. For each of the\ntwo kinds of C2 features, an RBF kernel is respectively constructed.\nPHOG: We adopt the PHOG descriptor [2] to capture image features, and limit the pyramid level\nup to 2. Together with \u03c72 distance, the base kernel is established.\n\n4.3 Dimensionality reduction methods\n\nWe consider two supervised DR schemes, namely, linear discriminant analysis (LDA) and local\ndiscriminant embedding (LDE) [3], and show how MKL-DR can generalize them. Both LDA and\nLDE perform discriminant learning on a fully labeled dataset \u2126 = {(xi, yi)}N\ni=1, but make different\nassumptions about data distribution: LDA assumes data of each class can be modeled by a Gaussian,\nwhile LDE assumes they spread as a submanifold. Each of the two methods can be speci\ufb01ed by\na pair of af\ufb01nity matrices to \ufb01t the formulation of graph embedding (2), and the resulting MKL\ndimensionality reduction schemes are respectively termed as MKL-LDA and MKL-LDE.\nAf\ufb01nity matrices for LDA: The two af\ufb01nity matrices W = [wij ] and W \u2032 = [w\u2032\n\nij ] are de\ufb01ned as\n\nwij = (cid:26)1/nyi,\n\n0,\n\nif yi = yj,\notherwise,\n\nand w\u2032\n\nij =\n\n1\nN\n\n,\n\n(30)\n\nwhere nyi is the number of data points with label yi. See [19] for the derivation.\n\n\fTable 1: Recognition rates (mean \u00b1 std %) for Caltech-101 dataset\n\nkernel(s)\n\nmethod\n\nGB-1\nGB-2\n\nSIFT-Dist\nSIFT-Grid\nC2-SWP\nC2-ML\nPHOG\n\n-\n-\nAll\n\nKFD\n\nKFD-Voting\nKFD-SAMME\n\nMKL-LDA\n\nnumber of classes\n102\n101\n\n57.3 \u00b1 2.5\n60.0 \u00b1 1.5\n53.0 \u00b1 1.4\n48.8 \u00b1 1.9\n30.3 \u00b1 1.2\n46.0 \u00b1 0.6\n41.8 \u00b1 0.6\n68.4 \u00b1 1.5\n71.2 \u00b1 1.4\n74.6 \u00b1 2.2\n\n57.7 \u00b1 0.7\n60.6 \u00b1 1.5\n53.2 \u00b1 0.8\n49.6 \u00b1 0.7\n30.7 \u00b1 1.5\n46.8 \u00b1 0.9\n42.1 \u00b1 1.3\n68.9 \u00b1 0.3\n72.1 \u00b1 0.7\n75.3 \u00b1 1.7\n\nmethod\n\nKLDE\n\nKLDE-Voting\nKLDE-SAMME\n\nMKL-LDE\n\nnumber of classes\n102\n101\n\n57.1 \u00b1 1.4\n60.9 \u00b1 1.4\n54.2 \u00b1 0.5\n49.5 \u00b1 1.3\n31.1 \u00b1 1.5\n45.8 \u00b1 0.2\n42.2 \u00b1 0.6\n68.4 \u00b1 1.4\n71.1 \u00b1 1.9\n75.3 \u00b1 1.5\n\n57.7 \u00b1 0.8\n61.3 \u00b1 2.1\n54.6 \u00b1 1.5\n50.1 \u00b1 0.3\n31.3 \u00b1 0.7\n46.7 \u00b1 1.5\n42.6 \u00b1 1.3\n68.7 \u00b1 0.8\n71.3 \u00b1 1.2\n75.5 \u00b1 1.7\n\nAf\ufb01nity matrices for LDE: In LDE, not only the data labels but also the neighborhood relationships\nare simultaneously considered to construct the af\ufb01nity matrices W = [wij ] and W \u2032 = [w\u2032\n\nij ]:\n\nwij = (cid:26)1,\nij = (cid:26)1,\n\nw\u2032\n\n0, otherwise,\n\n0, otherwise.\n\nif yi = yj \u2227 [i \u2208 Nk(j) \u2228 j \u2208 Nk(i)],\n\nif yi 6= yj \u2227 [i \u2208 Nk\u2032 (j) \u2228 j \u2208 Nk\u2032 (i)],\n\n(31)\n\n(32)\n\nwhere i \u2208 Nk(j) means that sample xi is one of the k nearest neighbors for sample xj. The\nde\ufb01nitions of the af\ufb01nity matrices are faithful to those in LDE [3]. However, since there are now\nmultiple image descriptors, we need to construct an af\ufb01nity matrix for data under each descriptor,\nand average the resulting af\ufb01nity matrices from all the descriptors.\n\n4.4 Quantitative results\n\nOur experiment setting follows the one described by Zhang et al. [20]. From each of the 102 classes,\nwe randomly pick 30 images where 15 of them are included for training and the other 15 images\nare used for testing. To avoid a biased implementation, we redo the whole process of learning\nby switching the roles of training and testing data. In addition, we also carry out the experiments\nwithout using the data from the the background class, since such setting is adopted in some of the\nrelated works, e.g., [5]. Via MKL-DR, the data are projected to the learned space, and the recognition\ntask is accomplished there by enforcing the nearest-neighbor rule.\n\nCoupling the seven base kernels with the af\ufb01nity matrices of LDA and LDE, we can respectively de-\nrive MKL-LDA and MKL-LDE using Algorithm 1. Their effectiveness is investigated by comparing\nwith KFD (kernel Fisher discriminant) [12] and KLDE (kernel LDE) [3]. Since KFD considers only\none base kernel at a time, we implement two strategies to take account of the classi\ufb01cation outcomes\nfrom the seven resulting KFD classi\ufb01ers. The \ufb01rst is named as KFD-Voting. It is constructed based\non the voting result of the seven KFD classi\ufb01ers. If there is any ambiguity in the voting result, the\nnext nearest neighbor in each KFD classi\ufb01er will be considered, and the process is continued until\na decision on the class label can be made. The second is termed as KFD-SAMME. By viewing each\nKFD classi\ufb01er as a multi-class weak learner, we boost them by SAMME [21], which is a multi-class\ngeneralization of AdaBoost. Analogously, we also have KLDE-Voting and KLDE-SAMME.\n\nWe report the mean recognition rates and the standard deviation in Table 1. First of all, MKL-LDA\nachieves a considerable performance gain of 14.6% over the best recognition rate by the seven KFD\nclassi\ufb01ers. On the other hand, while KFD-Voting and KFD-SAMME try to combine the separately\ntrained KFD classi\ufb01ers, MKL-LDA jointly integrates the seven kernels into the learning process. The\nquantitative results show that MKL-LDA can make the most of fusing various feature descriptors,\nand improves the recognition rates from 68.4% and 71.2% to 74.6%. Similar improvements can\nalso be observed for MKL-LDE.\n\nThe recognition rates 74.6% in MKL-LDA and 75.3% in MKL-LDE are favorably comparable to\nthose by most of the existing approaches. In [6], Grauman and Darrell report a 50% recognition\n\n\frate based on the pyramid matching kernel over data in bag-of-features representation. By combing\nshape and spatial information, SVM-KNN of Zhang et al. [20] achieves 59.05%. In Frome et al. [5],\nthe accuracy rate derived by learning the local distances, one for each training sample, is 60.3%.\nOur related work [10] that performs adaptive feature fusing via locally combining kernel matrices\nhas a recognition rate 59.8%. Multiple kernel learning is also used in Varma and Ray [18], and it\ncan yield a top recognition rate of 87.82% by integrating visual cues like shape and color.\n\n5 Conclusions and discussions\n\nThe proposed MKL-DR technique is useful as it has the advantage of learning a uni\ufb01ed space of low\ndimension for data in multiple feature representations. Our approach is general and applicable to\nmost of the graph-based DR methods, and improves their performance. Such \ufb02exibilities allow one\nto make use of more prior knowledge for effectively analyzing a given dataset, including choosing a\nproper set of visual features to better characterize the data, and adopting a graph-based DR method\nto appropriately model the relationship among the data points. On the other hand, via integrating\nwith a suitable DR scheme, MKL-DR can extend the multiple kernel learning framework to address\nnot just the supervised learning problems but also the unsupervised and the semisupervised ones.\n\nAcknowledgements. This work is supported in part by grants 95-2221-E-001-031-MY3 and 97-\n2221-E-001-019-MY3.\n\nReferences\n[1] A. Berg, T. Berg, and J. Malik. Shape matching and object recognition using low distortion correspon-\n\ndences. In CVPR, 2005.\n\n[2] A. Bosch, A. Zisserman, and X. Mu\u02dcnoz. Image classi\ufb01cation using random forests and ferns. In ICCV,\n\n2007.\n\n[3] H.-T. Chen, H.-W. Chang, and T.-L. Liu. Local discriminant embedding and its variants. In CVPR, 2005.\n[4] L. Fei-Fei, R. Fergus, and P. Perona. Learning generative visual models from few training examples: An\nincremental bayesian approach tested on 101 object categories. In CVPR Workshop on Generative-Model\nBased Vision, 2004.\n\n[5] A. Frome, Y. Singer, and J. Malik. Image retrieval and classi\ufb01cation using local distance functions. In\n\nNIPS, 2006.\n\n[6] K. Grauman and T. Darrell. The pyramid match kernel: Ef\ufb01cient learning with sets of features. JMLR,\n\n2007.\n\n[7] X. He and P. Niyogi. Locality preserving projections. In NIPS, 2003.\n[8] S.-J. Kim, A. Magnani, and S. Boyd. Optimal kernel selection in kernel \ufb01sher discriminant analysis. In\n\nICML, 2006.\n\n[9] G. Lanckriet, N. Cristianini, P. Bartlett, L. Ghaoui, and M. Jordan. Learning the kernel matrix with\n\nsemide\ufb01nite programming. JMLR, 2004.\n\n[10] Y.-Y. Lin, T.-L. Liu, and C.-S. Fuh. Local ensemble kernel learning for object category recognition. In\n\nCVPR, 2007.\n\n[11] D. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 2004.\n[12] S. Mika, G. R\u00a8atsch, J. Weston, B. Sch\u00a8olkopf, and K.-R. M\u00a8uller. Fisher discriminant analysis with kernels.\n\nIn Neural Networks for Signal Processing, 1999.\n\n[13] J. Mutch and D. Lowe. Multiclass object recognition with sparse, localized features. In CVPR, 2006.\n[14] A. Rakotomamonjy, F. Bach, S. Canu, and Y. Grandvalet. More ef\ufb01ciency in multiple kernel learning. In\n\nICML, 2007.\n\n[15] T. Serre, L. Wolf, and T. Poggio. Object recognition with features inspired by visual cortex. In CVPR,\n\n2005.\n\n[16] S. Sonnenburg, G. R\u00a8atsch, C. Sch\u00a8afer, and B. Sch\u00a8olkopf. Large scale multiple kernel learning. JMLR,\n\n2006.\n\n[17] L. Vandenberghe and S. Boyd. Semide\ufb01nite programming. SIAM Review, 1996.\n[18] M. Varma and D. Ray. Learning the discriminative power-invariance trade-off. In ICCV, 2007.\n[19] S. Yan, D. Xu, B. Zhang, H. Zhang, Q. Yang, and S. Lin. Graph embedding and extensions: A general\n\nframework for dimensionality reduction. PAMI, 2007.\n\n[20] H. Zhang, A. Berg, M. Maire, and J. Malik. Svm-knn: Discriminative nearest neighbor classi\ufb01cation for\n\nvisual category recognition. In CVPR, 2006.\n\n[21] J. Zhu, S. Rosset, H. Zou, and T. Hastie. Multi-class adaboost. Technical report, Dept. of Statistics,\n\nUniversity of Michigan, 2005.\n\n\f", "award": [], "sourceid": 153, "authors": [{"given_name": "Yen-yu", "family_name": "Lin", "institution": null}, {"given_name": "Tyng-luh", "family_name": "Liu", "institution": null}, {"given_name": "Chiou-shann", "family_name": "Fuh", "institution": null}]}