{"title": "Feature Selection by Maximum Marginal Diversity", "book": "Advances in Neural Information Processing Systems", "page_first": 1375, "page_last": 1382, "abstract": null, "full_text": "Feature Selection by Maximum Marginal\n\nDiversity\n\nNuno Vasconcelos\n\nDepartment of Electrical and Computer Engineering\n\nUniversity of California, San Diego\n\nnuno@media.mit.edu\n\nAbstract\n\nWe address the question of feature selection in the context of visual\nrecognition.\nIt is shown that, besides ef\ufb01cient from a computational\nstandpoint, the infomax principle is nearly optimal in the minimum\nBayes error sense. The concept of marginal diversity is introduced, lead-\ning to a generic principle for feature selection (the principle of maximum\nmarginal diversity) of extreme computational simplicity. The relation-\nships between infomax and the maximization of marginal diversity are\nidenti\ufb01ed, uncovering the existence of a family of classi\ufb01cation proce-\ndures for which near optimal (in the Bayes error sense) feature selection\ndoes not require combinatorial search. Examination of this family in light\nof recent studies on the statistics of natural images suggests that visual\nrecognition problems are a subset of it.\n\n1 Introduction\n\nIt has long been recognized that feature extraction and feature selection are important prob-\nlems in statistical learning. Given a classi\ufb01cation or regression task in some observation\n(typically high-dimensional), the goal is to \ufb01nd the best transform\ninto a feature\nspace \n(typically lower dimensional) where learning is easier (e.g. can be performed with\nspace\nless training data). While in the case of feature extraction there are few constraints on\n,\nfor feature selection the transformation is constrained to be a projection, i.e. the compo-\nnents of a feature vector in\n.\nare a subset of the components of the associated vector in \nBoth feature extraction and selection can be formulated as optimization problems where\nthe goal is to \ufb01nd the transform that best satis\ufb01es a given criteria for \u201cfeature goodness\u201d.\n\nIn this paper we concentrate on visual recognition, a subset of the classi\ufb01cation problem for\nwhich various optimality criteria have been proposed throughout the years. In this context,\nthe best feature spaces are those that maximize discrimination, i.e. the separation between\nthe different image classes to recognize. However, classical discriminant criteria such as\nlinear discriminant analysis make very speci\ufb01c assumptions regarding class densities, e.g.\nGaussianity, that are unrealistic for most problems involving real data. Recently, various\nauthors have advocated the use of information theoretic measures for feature extraction or\nselection [15, 3, 9, 11, 1]. These can be seen as instantiations of the the infomax principle\n\n\u0001\n\u0002\n\u0001\n\u0002\n\fof neural organization1 proposed by Linsker [7], which also encompasses information the-\noretic approaches for independent component analysis and blind-source separation [2]. In\nthe classi\ufb01cation context, infomax recommends the selection of the feature transform that\nmaximizes the mutual information (MI) between features and class labels.\n\nWhile searching for the features that preserve the maximum amount of information about\nthe class is, at an intuitive level, an appealing discriminant criteria, the infomax principle\ndoes not establish a direct connection to the ultimate measure of classi\ufb01cation performance\n- the probability of error (PE). By noting that to maximize MI between features and class\nlabels is the same as minimizing the entropy of labels given features, it is possible to estab-\nlish a connection through Fano\u2019s inequality: that class-posterior entropy (CPE) is a lower\nbound on the PE [11, 4]. This connection is, however, weak in the sense that there is little\ninsight on how tight the bound is, or if minimizing it has any relationship to minimizing\nPE. In fact, among all lower bounds on PE, it is not clear that CPE is the most relevant.\nAn obvious alternative is the Bayes error (BE) which 1) is the tightest possible classi\ufb01er-\nindependent lower-bound, 2) is an intrinsic measure of the complexity of the discrimination\nproblem and, 3) like CPE, depends on the feature transformation and class labels alone.\nMinimizing BE has been recently proposed for feature extraction in speech problems [10].\n\nThe main contribution of this paper is to show that the two strategies (infomax and mini-\nmum BE) are very closely related. In particular, it is shown that 1) CPE is a lower bound\non BE and 2) this bound is tight, in the sense that the former is a good approximation to the\nlatter. It follows that infomax solutions are near-optimal in the minimum BE sense. While\nfor feature extraction both infomax and BE appear to be dif\ufb01cult to optimize directly, we\nshow that infomax has clear computational advantages for feature selection, particularly in\nthe context of the sequential procedures that are prevalent in the feature selection litera-\nture [6]. The analysis of some simple classi\ufb01cation problems reveals that a quantity which\nplays an important role in infomax solutions is the marginal diversity: the average distance\nbetween each of the marginal class-conditional densities and their mean. This serves as\ninspiration to a generic principle for feature selection, the principle of maximum marginal\ndiversity (MMD), that only requires marginal density estimates and can therefore be imple-\nmented with extreme computational simplicity. While heuristics that are close to the MMD\nprinciple have been proposed in the past, very little is known regarding their optimality.\n\nIn this paper we summarize the main results of a theoretical characterization of the prob-\nlems for which the principle is guaranteed to be optimal in the infomax sense (see [13] for\nfurther details). This characterization is interesting in two ways. First, it shows that there is\na family of classi\ufb01cation problems for which a near-optimal solution, in the BE sense, can\nbe achieved with a computational procedure that does not involve combinatorial search.\nThis is a major improvement, from a computational standpoint, to previous solutions for\nwhich some guarantee of optimality (branch and bound search) or near optimality (forward\nor backward search) is available [6]. Second, when combined with recent studies on the\nstatistics of biologically plausible image transformations [8, 5], it suggests that in the con-\ntext of visual recognition, MMD feature selection will lead to solutions that are optimal in\nthe infomax sense. Given the computational simplicity of the MMD principle, this is quite\nsigni\ufb01cant. We present experimental evidence in support of these two properties of MMD.\n\n2 Infomax vs minimum Bayes error\n\nIn this section we show that, for classi\ufb01cation problems, the infomax principle is closely\nrelated to the minimization of Bayes error. We start by de\ufb01ning these quantities.\n\n1Under the infomax principle, the optimal organization for a complex multi-layered perceptual\nsystem is one where the information that reaches each layer is processed so that the maximum amount\nof information is preserved for subsequent layers.\n\n\fclasses in a feature space\n\n, the deci-\nsion function which minimizes the probability of classi\ufb01cation error is the Bayes classi\ufb01er\n\nTheorem 1 Given a classi\ufb01cation problem with \n\u0001\u0003\u0002\u0005\u0004\u0007\u0006\t\b\u000b\n\r\f\u0005\u000e\u0010\u000f\u0012\u0011\u0013\f\u0015\u0014\u0017\u0016\u0019\u0018\u001b\u001a\u001d\u001c\nclasses, and &%('*)\u0019+-,.,-,#+\n\n\u001e\u001f\u0004! #\"\n\u0006\t\b , where $\nto one of \n\r/ . Furthermore, the PE is lower bounded by the Bayes error\n\nis a random variable that assigns \u0006\n\u0011\u0013\f\u0015\u0014\n\n1)32543687\nwhere 436 means expectation with respect to \u0018\n\nProof: All proofs are omitted due to space considerations. They can be obtained by con-\ntacting the author.\n\n\u00189\u001a\u001d\u001c\n\u0004\u0007\u0006\t\b .\n\n\u001e\u001f\u0004! #\"\n\n\u0006\t\b;:;+\n\n(1)\n\n@?\n$CB\u0010D\n\bG\n\n-class classi\ufb01cation problem with observations\n\nIt is straightforward to show that\n\n\u0004!\u0006N+F O\b\u0017PRQ*\u000f1S.TVU\nS.T9X\n\n, and the set of feature transformations\n\nis the class indicator variable de\ufb01ned above,\n\nPrinciple 1 (infomax) Consider an \ndrawn from random variable <\n\u0001>=\n. The best feature space is the one that maximizes the mutual information A\n\b , and A\nwhere $\n$EBFD\n\u0016ZY\nW9X\nand $\n\u0016\u0019IKJ\n\u0016ZY#[\nS.W9X\n\b , where ]^\u0004\n\b\\\n\n. Since the class entropy ]a\u0004\nis the entropy of D\nIKJ\n\n\u001eML\n\u0004!\u0006\t\b\u0017PRQ*\u000f\n, infomax is equivalent to the minimization of the CPE ]^\u0004\n+-,.,.,.+\nJfe\n2o)p\b\n\nLemma 1 Consider a probability mass function b\n)\u0019+jif and H\n\bN2\n\npend on\na bound that plays a central role on the relationship between this quantity and BE.\n\nthe mutual information between D\n\b_2`]^\u0004\n\n/ such that g5h\n\n]^\u0004\n\n\u0004\u0007\u0006\t\b\n\n\u0016#J\n\n.\n\nc'\nJ\td\nPZQ\u0019\u000fV\u0004nm\nPRQ*\u000f\n\n]^\u0004\n\n\u0016;\blk\n\u0016 . Furthermore, the bound is tight in the sense that equality\n\nPZQ\u0019\u000f\n\n(2)\n\n\b\\\n\n\b does not de-\n\b . We next derive\n\n1) . Then,\n\u0004j)325\u0011\u0013\f\u0015\u0014\n\u0016\u0015PZQ\u0019\u000f\n\u0016.J\n\nwhere ]a\u0004\n\nholds when\n\n\br\ns2\n\nand J\n\n2u)\n\n+Oixwzy\n\nu{|,\n\n2o)\n\nThe following theorem follows from this bound.\n\nTheorem 2 The BE of an \nindicator variable $\n\n, is lower bounded by\n\n-class classi\ufb01cation problem with feature space\n\n(3)\n\nand class\n\n(4)\n\nis large\n\nwhere D\n?\u007f~\n(\n\u0004\u0007\u0018\u001b\u0083uk\n\u0081\u0010\u0082\n\n\blk\n\nPRQ*\u000f\n) this bound reduces to 0\n\nis the random vector from which features are drawn. When \n\n2o)p\b\n\n)\u0019+\n\n]^\u0004\n\blk\n\nPZQ\u0019\u000ff\u0004nm\nPZQ\u0019\u000f\n\n]^\u0004\n\n\b\u001b2\n\n\u0081F\u0082\n\n\b .\n\n\u0081\u0010\u0082\n\n\br2\n\nIt is interesting to note the relationship between (4) and Fano\u2019s lower bound on the PE\n\n\u0081\u0010\u0082\nPRQ*\u000f\u0085\u0084\n\n\b . The two bounds are equal up to an additive constant\n\n]^\u0004\neE\u0086xd ) that quickly decreases to zero with the number of classes \n\n. It follows\n(\nthat, at least when the number of classes is large, Fano\u2019s is really a lower bound on BE, not\nonly on PE. Besides making this clear, Theorem 2 is a relevant contribution in two ways.\nFirst, since constants do not change the location of the bound\u2019s extrema, it shows that info-\nmax minimizes a lower bound on BE. Second, unlike Fano\u2019s bound, it sheds considerable\ninsight on the relationship between the extrema of the bound and those of the BE.\n\nIn fact, it is clear from the derivation of the theorem that, the only reason why the right-\nhand (RHS) and left-hand (LHS) sides of (4) differ is the application of (2). Figure 1\n\n\u0002\n0\n\u0002\n\u0016\n\u001e\n%\n\n\u0002\n\u0004\n\b\nD\n\n\u0001\n\u0004\n<\n\u0004\nH\n\u001a\n6\nL\n6\nY\n\u0006\nA\n\u0004\nD\n+\n$\n$\n$\n\"\nD\nD\n2\n\u001e\nJ\n\u001e\n[\n\u0006\n$\n\u0001\n$\n\"\nD\nJ\n\u0016\nh\n\u0016\n\u0016\nJ\n)\n\nb\n\n\nq\n)\nb\nH\nJ\nJ\n\u0002\nt\n\n\nm\n\n\u0002\nv\n\n)\nm\n\n\u0002\n0\n\u0002\n}\n\u0004\n\n)\n\n$\n\"\nD\n\n\nq\n%\n\u0002\n\u0002\n}\n\u0004\n\nd\n\u0080\ne\n$\n\"\nD\nd\n\u0080\ne\n$\n\"\nD\nd\n\u0080\ne\nd\n\u0080\ne\ne\n\u0084\n\f0.6\n\n0.4\n\n0.2\n\n0\n\n\u22120.2\n\n\u22120.4\n\n\u22120.6\n\n\u22120.8\n\n0\n\n1\u2212max(p) \nH(p) \u2212 log(3) + 1\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n2\n\n0.5\n\np\n\n2\n\n0.5\n\np\n\n0.1\n\n0.2\n\n0.3\n\n0.4\n\n0.5\np1\n\n0.6\n\n0.7\n\n0.8\n\n0.9\n\n1\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n0\n\n0.1\n\n0.2\n\n0.3\n\n0.4\n\n0.5\np1\n\n0.6\n\n0.7\n\n0.8\n\n0.9\n\n1\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n0\n\n0.1\n\n0.2\n\n0.3\n\n0.4\n\n0.5\np1\n\n0.6\n\n0.7\n\n0.8\n\n0.9\n\n1\n\nFigure 1: Visualization of (2). Left: LHS and RHS versus \u0002\u0001 for \u0003\u0005\u0004\u0007\u0006\t\b\u000b\n\f\u000e\r\u000f\u0004\u0011\u0010\u0013\u0012\u0014\u0015\u0001\u0017\u0016 . Middle:\ncontours of the LHS versus \n\f\n\n\u0016 for \u0003\u0019\u0004\u001b\u001a\u001c\b\u000b\n\f\u001e\u001d\u0013\u0004\u001f\u0010 \u0012!\n\n\u0016 . Right: same, for RHS.\n\n\u0012!\n\n\b\u0018\n\n*\n\nL\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n5\n\n1\n\n0.8\n\n0.6\n\n)\n\n|\n\nX\nY\nH\n\n(\n\n0.4\n\n0.2\n\n0\n5\n\n0\n\ny\n\n\u22125\n\n\u22125\n\n0\n\nx\n\n0\n\ny\n\n\u22125\n\n\u22125\n\n0\n\nx\n\n5\n\n5\n\n3\n\n2\n\n1\n\ny\n\n0\n\n\u22121\n\n\u22122\n\n\u22123\n\n\u22123\n\n\u22122\n\n\u22121\n\n0\nx\n\n1\n\n2\n\n3\n\nFigure 2: The LHS of (4) as an approximation to (1) for a two-class Gaussian problem where\n\"$#\u0013%\n. Left: surface plot\nof (1). Middle: surface plot of the LHS of (4). Right: contour plots of the two functions.\n\n\u00066\u0016 +\u0011-7\n\u001889\b:2;\u0016 . All plots are functions of 8\n\n\u0010\u000b\u0016,+.-/\n10\u0015\b324\u0016 and \"$#\u0013%\n\n('*)\n\n('5)\n\ninteresting properties. First, bound (2) is tight in the sense de\ufb01ned in the lemma. Second,\nthe maximum of the LHS is co-located with that of the RHS. Finally, (like the RHS) the\n\nshows plots of the RHS and LHS of this equation when \n/ , illustrating three\nLHS is a concave function of b and increasing (decreasing) when the RHS is. Due to these\n\nproperties, the LHS is a good approximation to the RHS and, consequently, the LHS of (4)\na good approximation to its RHS. It follows that infomax solutions will, in general, be very\nsimilar to those that minimize the BE . This is illustrated by a simple example in Figure 2.\n\n%`'\u0015m\u0017+\u0017<\n\n3 Feature selection\n\nFor feature extraction, both infomax and minimum BE are complicated problems that can\nonly be solved up to approximations [9, 11, 10].\nIt is therefore not clear which of the\ntwo strategies will be more useful in practice. We now show that the opposite holds for\nfeature selection, where the minimization of CPE is signi\ufb01cantly simpler than that of BE.\nWe start by recalling that, because the possible number of feature subsets in a feature\nselection problem is combinatorial, feature selection techniques rely on sequential search\nmethods [6]. These methods proceed in a sequence of steps, each adding a set of features\nto the current best subset, with the goal of optimizing a given cost function 2. We denote the\n\ncurrent subset by D7= , the added features by D7> and the new subset by D/?\nTheorem 3 Consider an \nrandom variable <\n\n, and a feature transformation\n\n-class classi\ufb01cation problem with observations drawn from a\nis a infomax feature\n\n2These methods are called forward search techniques. There is also an alternative set of backward\nsearch techniques, where features are successively removed from an initial set containing all features.\nWe ignore the latter for simplicity, even though all that is said can be applied to them as well.\n\nD/=\n\nD/>\n\n.\n\n\b .\n\n\u0001\n\n\u0001\n\nm\nm\nm\nm\nm\nm\n&\n&\n\n\u0004\n+\n%\n\n\u0001\n=\n\n?\n\u0002\n\u0002\n\f\u0002\u0004\u0003\n\n\u0001\u0001\n\n\u001aK\u0004!\u0006&\"\n\nspace if and only if i\n\u0002\u000b\u0003\n0\u0006\u0005\n\u0004\u0007\u0006\t\b\b\u0007\n\t\n j\b.\"Z\"\n\u0004\u0007 j\b\u0013\u0012\n\b ,\u000f\u0011\u0010\n\b ,D\nwhereD\nto the prior class probabilities and\u0003\n\u0014p:\t\n\n\"Z\"\ndivergence between J and\u0014 . Furthermore, if D7?\n\u0002\u0004\u0003\n\u0002\u0004\u0003\n0\u0006\u0005\n0\u0006\u0005\n\b\u0017\u0007\u000e\t\n\u0002\u0004\u0003\n0\u0006\u0005\n\ndecouples into two terms according to\n\n\u00189\u001e\r\u0016\n\n O\b-\"R\"\n\n\u001e\r\u0016\n\n\u0004\u0007\u0006\n\n\u0004!\u0006\n\n(5)\n\n O\b-\"R\"\n\n\u0004\u0007\u0006\t\b\b\u0007\u000e\t\n\nis the Kullback-Leibler\n\n0\u0006\u0005\n\u001e\r\f\n\u001aK\u0004!\u0006&\"\n\u001e\r\f\n\u0004! j\b\n\u0004! j\b denotes expectation with respect\n\u0004!\u0006\t\b\nPZQ\u0019\u000f\n\b , the infomax cost function\nD7>\n\u00189\u001e\u0019\u0018\u0019\u001c\n\b\b\u0007\u000e\t\n\u001e\u0019\u001a\u0010L\n\u0004!\u0006\n\u00189\u001e\u0019\u001a.\u001c\n\n\u00189\u001e\u0019\u0018\u0005\u001c\n\b\b\u0007\u000e\t\n\nD/=\n\u0004\u0007\u0006\n j\b.\"Z\"\n\n+F j\b.\"Z\"\n\u0004\u0007\u0006\n\n\u001e\u001b\u001a\n\n(6)\n\n\u0004!\u0006\n\n\u0004!\u0006\t\b\n\n\u001aK\u0004!\u0006&\"\n\n. The equation also leads to a\nstraightforward procedure for updating the optimal cost once this set is determined. On the\nother hand, when the cost function is BE, the equivalent expression is\n\nEquation (5) exposes the discriminant nature of the infomax criteria. Noting that\n, it clearly favors feature spaces where each class-conditional\ndensity is as distant as possible (in the KL sense) from the average among all classes.\nThis is a sensible way to quantify the intuition that optimal discriminant transforms are the\nones that best separate the different classes. Equation (6), in turn, leads to an optimal rule\n\n j\b\u001f\u001e\n\n\u001d\u001c\nfor \ufb01nding the features D\nminimizes \u0002\n\u001e\r\u0016\n\u0018\u001b\u001al\u001c\n\u0011\u0013\f\u0005\u0014\nNote that the non-linearity introduced by the \u0011\u0013\f\u0015\u0014 operator, makes it impossible to ex-\n\b;: . For this reason,\npress 4\n\nto merge with the current optimal solution D\n\u0004\u0007\u0006\n\n\u0007\u000e\t\n\u001e\u0019\u001ap\u0004!\u0006\n\u001e\u001b\u0018\u0005\u001c\n\u0004!\u0006\n\u001e\u0019\u001a\n\u001e\u0019\u0018\u0019\u001c\n\u0011\u0013\f\u0005\u0014\n\u0016*\u00189\u001a\u001d\u001c\n\n\u001e\u0019\u001a\u001b \n\u0011\u0013\f\u0015\u0014\n\b;: as a function of 4\n\ninfomax is a better principle for feature selection problems than direct minimization of BE.\n\n= : the set which\n\n\u0004! #\"\n\u0016\u0005\u0018\u001b\u001al\u001c\n\n\u0018\u001b\u001a\u001d\u001c\n\u0011\u0013\f\u0015\u0014\n\n+F j\b.\"Z\"\n\n\bO:\"!\n\n\u0004! #\"\n\n\u0004! #\"\n\n4r\u001e\n\n\u00189\u001e\n\n\u00189\u001e\n\n(7)\n\n\bO:8\n\n\u001e\r\u00168\u0004! #\"\n\n\u0004!\u0006\n\n\u001e\u0019\u001a\n\n4 Maximum marginal diversity\n\nTo gain some intuition for infomax solutions, we next consider the Gaussian problem of\n\ntain any useful information for classi\ufb01cation. On the other hand, because the classes are\n\ndiscriminating between them. The different discriminating powers of the two variables\nleads\nit\n\n\u0004j)\nm\u0019\b\u0012\n\n\b\u0012\n\nFigure 3. Assuming that the two classes have equal prior probabilities \u0004\u0007\u0018\nm*\b are equal and feature,\n)$#\u0019m\u0019\b , the marginals \u0018&%('-\u001c\n\u001ar\u0004*)9\"\n)p\b and \u0018+%\r'#\u001c\n\u001ar\u0004*)\u001b\"\nd does not con-\nclearly separated along the)\n\u0084 contains all the information available for\n\u0084 axis, feature,\n\u0004*)V\b\n\n\u007f\u0018-%\r'.\u001c\n\u0018+%\r'#\u001c\n\u001ar\u0004*)9\"\n\u001ar\u0004.)9\"\n)p\b\nm\u0019\b\nare re\ufb02ected by the infomax costs: while \u0018\n%&0\n\u0004*)f\bO:/\u001e\n j\b.\"Z\"\n\u001ar\u0004*)\u001b\"\n\nc\u0018+%\n\b(y\n\u0004*)\u001b\"\n\u0018-%\n\u0004.)f\b5y\n\u0018+%\r'\ng , from \u0018\nm*\b\nto \u001c\n\u0004*)V\b;:\u001d\u001e\n j\b-\"R\"\n\u0018+%\n\u0004*)\u001b\"\nfollows that \u001c\ng , and (5) recommends the selection of\n\u0084 . This is unlike energy-based criteria, such as principal component analysis, that would\nd . The key advantage of infomax is that it emphasizes marginal diversity.\nselect,\n\bo\n\u001d\u001c\ntor D\n132\n\u0004.)9\"\n\u0018+%(4\u0015\u001c\n\nThe intuition conveyed by the example above can be easily transformed into a generic\nprinciple for feature selection.\n\nis the marginal diversity of feature,\n\nDe\ufb01nition 1 Consider a classi\ufb01cation problem on a feature space\n\nfrom which feature vectors are drawn. Then,\n\n, and a random vec-\n\n+.,-,.,#+\n j\b.\"Z\"\n\n\u0004*)V\b;:+\u001e\n\n\u001aK\u0004*)\u001b\"\n\n%&0\n\nv .\n\nPrinciple 2 (Maximum marginal diversity) The best solution for a feature selection\nproblem is to select the subset of features that leads to a set of maximally diverse marginal\ndensities.\n\ny\n\n\u0001\n\u0018\n\u001e\n\u001c\n\u0018\n\u001e\n\u001a\nk\n\u0018\n\u001c\n\u0018\n\u001a\n\n\u0001\n\u0004\n<\n\n\n\u0001\n\n\u0004\n<\n\u001a\n\nH\n\u0016\n\u0018\n\u001a\n\u0010\n0\n7\nJ\nI\nJ\nS\nX\n6\nY\n\u0015\nX\n6\nY\n[\n\u0006\n\n\u0004\n+\n\u001c\n\u001a\n?\n\"\n\u0018\n?\n\u001a\n\n\u001a\n>\n\"\n\u0006\n=\n>\n\"\n\u0006\n=\n\u001a\nq\n\u001a\n=\n\"\n\u0018\n\u001e\n\u001a\n=\n\u001a\n,\n\u0018\n\u001e\n\u0018\n\u001e\n\u001c\n\u001a\n>\n\u0003\n0\n\u0005\n\u0018\n\u001c\n\u001e\n\u001a\nL\n\u001a\n>\n\"\n\u0006\n=\n\u0018\n\u001c\n\u001e\n\u001a\n>\n\"\n\u0006\n=\n\b\n\u001a\n4\n7\n\u0016\n\u001e\n\u0016\n\u0006\n?\n4\n\u0018\n\u001c\n\u001e\n\u001a\n7\n\u0016\n\u0018\n\u001a\nL\n>\n\"\n \n+\n\u0006\n=\n\b\n\u0018\n>\n\"\n\u0006\n=\n\b\n\u001e\n\u001a\n\u0006\n=\n,\n\u001e\n\u0016\n7\n\u0006\n?\n\u001e\n\u001a\n7\n\u0006\n=\n\u001a\n\u0018\n\u001a\n\u0004\n%\n'\n\n\u0003\n0\n7\n\u001c\n\u0018\n%\n'\n\u001a\n\n0\n\u001c\n\u001a\n)\n0\n\u001c\n\u0003\n0\n7\n0\n\u001c\n\u001a\n\u0018\n\u001a\n\u001e\n,\n\u0002\n\n\u0004\n,\nd\n,\n?\n\b\n\u0004\n,\nv\n\u0003\n0\n7\n\u001a\n\u0018\n%\n4\n\u001a\n\f2\n\n1.5\n\n1\n\n0.5\n\n2\n\nx\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\u22125\n\n0.025\n\n0.02\n\n0.015\n\n0.01\n\n0.005\n\nPX\nPX\n\n1\n\n1\n\n|Y(x|1)\n|Y(x|2)\n\n2.5\n\n2\n\n1.5\n\n1\n\n0.5\n\nPX\nPX\n\n2\n\n2\n\n|Y(x|1)\n|Y(x|2)\n\n\u22124\n\n\u22123\n\n\u22122\n\n\u22121\n\n0\nx1\n\n1\n\n2\n\n3\n\n4\n\n5\n\n0\n\u221250\n\n\u221240\n\n\u221230\n\n\u221220\n\n\u221210\n\n0\nx\n\n10\n\n20\n\n30\n\n40\n\n50\n\n0\n\u22122\n\n\u22121.5\n\n\u22121\n\n\u22120.5\n\n0\nx\n\n0.5\n\n1\n\n1.5\n\n2\n\nFigure 3: Gaussian problem with two classes\nLeft: contours of\n\n\u0002\u0001\u0004\u0003\t\u00106\b\u0017\u0006\u0006\u0005\nprobability. Middle: marginals for\n\n\u000b\u0006\f\u0006\n\n, in the two-dimensions,\n\n\u0001 . Right: marginals for\n\n .\n\n\t\b\n\n\b\n\b\n\n\u0016 .\n\nThis principle has two attractive properties. First it is inherently discriminant, recommend-\ning the elimination of the dimensions along which the projections of the class densities are\nmost similar. Second, it is straightforward to implement with the following algorithm.\n\nfeatures D\n\nthe following\n\nAlgorithm 1 (MMD feature selection) For a classi\ufb01cation problem with\n\n\u0004\u0007 j\bE\n\n,\u0014?\n\n+.,.,-,.+\n\n* compute\n\n)\u0019+.,-,.,#+\n\nMMD features.\n\nprocedure returns the top\n\n\r/ and class priors \u0018\n\n\b , \nclasses $\n)\u0019+-,.,.,.+\n- foreach feature w\n\u000e\u001b/ :\n* foreach class &%('*)\u0019+-,.,-,#+\n\r/ , compute an histogram estimate\n\u0016 ,\n\u0016\u0012\u0011\n\b\u0012\n\n* compute the marginal diversity1\n\u0016.J\n\u0001 and division ,\n# are performed element-wise,\n\u0015\u0017\u0016\n- order the features by decreasing diversity, i.e. \ufb01nd '\u0015w\n\b , and return '\n+.,-,.,-+\nv\u0019\u0018\u001b\u001a\nv\u0019\u0018\n\nPRQ*\u000f|\u0004\n\u0010\u0014\u0013\n+-,.,-,.+\u0010w\n/ .\n\n\bMk\n\nv\u001d\u001c\n\n\u0004*)\u001b\"\n\n\u0016 of \u0018-%\n j\b ,\n\b , where both the\n/ such that\n\nIn general, there are no guarantees that MMD will lead to the infomax solution. In [13]\nwe seek a precise characterization of the problems where MMD is indeed equivalent to\ninfomax. Due to space limitations we present here only the main result of this analysis,\nsee [13] for a detailed derivation.\n\nTheorem 4 Consider a classi\ufb01cation problem with class labels drawn from a random\n\nvariable $\n\u0002r\n\nwhere D\n\nand features drawn from a random vector D\n+-,.,-,#+\n\b be the optimal feature subset of size\n+ji8wE%('\nB\u0010D\n\n\\'\n+.,.,-,#+\n\n\u00868d\n\n\u00868d\n\n\u00868d\n\nMMD sense. Furthermore,\n\n)\u0019+-,.,.,\n\nis also the optimal subset of size\n\nin the infomax sense. If\n\n+.,.,-,.+\n\n\b and let\n\n\b\u0012\n\nB\u0010D\n/ , the set D\n\u0086xd\n\u001e \u001f\n O\b-\"R\"\n\u001ar\u0004!\u0006&\"\n\n\u00189\u001e \u001f\n\n\u0004!\u0006\t\b\n\n\u0007\n\t\n\nv\u001d\"\n\n(8)\n\nin the\n\n(9)\n\nThe theorem states that the MMD and infomax solutions will be identical when the mutual\ninformation between features is not affected by knowledge of the class label. This is an\ninteresting condition in light of various recent studies that have reported the observationof\nconsistent patterns of dependence between the features of various biologically plausible\nimage transformations [8, 5]. Even though the details of feature dependence will vary from\none image class to the next, these studies suggest that the coarse structure of the patterns of\ndependence between such features follow universal statistical laws that hold for all types of\nimages. The potential implications of this conjecture are quite signi\ufb01cant. First it implies\n\n\u0007\n\u0004\n\u0001\n\n\b\n\b\n\u000e\n\n\u0004\n,\nd\n%\n'\n\u001a\nJ\n\u0016\n\u000f\n%\n'\n\u0010\nv\nL\n4\n\u001c\n\u001a\n\u0010\nv\n\nd\ne\nH\nv\nL\n2\n\u0004\n,\nv\nH\n\u0016\nv\nL\n\u0016\n\u0010\nv\nL\n\u0016\n,\n#\n\u0010\nv\nd\n?\n1\n2\n\u0004\n,\n1\n2\n\u0004\n,\n'\n,\nv\n'\n,\n\n\u0004\n,\nd\n,\n?\nD\n\u0004\n,\n\u0002\nd\n,\n\u0002\n\u001e\n\u000f\nA\n\u0004\n,\n\u0002\nv\n\u0002\nd\nL\nv\nA\n\u0004\n,\n\u0002\nv\n\u0002\nd\nL\nv\n\"\n$\n\b\n+\n\u000f\n/\n\u0002\nd\nL\nv\n,\n\u0002\nd\n,\n\u0002\nv\n\u0002\n\u000f\n\u0002\n\u0003\n0\n\u0005\n\u001c\n\u0018\n\u001a\n\n\u001e\n!\nd\n1\n2\n\u0004\n,\n\u0002\nv\n\b\n,\n\fe\nr\no\nc\ns\n \nr\ne\nk\ng\nn\no\nZ\nn\na\nJ\n\n/\n\ni\n\n1\n\n0.95\n\n0.9\n\n0.85\n\n0.8\n\n0.75\n\n0.7\n\n0.65\n\n101\n\n1\n\n0.95\n\nt\n\ne\na\nr\n \n\nn\no\n\ni\nt\n\na\nc\ni\nf\ni\ns\ns\na\nC\n\nl\n\n0.9\n\n0.85\n\n0.8\n\nDCT \nPCA \nWavelet\n\n2.2\n\n2\n\n1.8\n\n1.6\n\n1.4\n\n1.2\n\n1\n\n0.8\n\n0.6\n\n0.4\n\ny\nt\ni\ns\nr\ne\nv\nd\n\ni\n\n \nl\n\ni\n\na\nn\ng\nr\na\nm\ne\nv\ni\nt\n\n \n\nl\n\na\nu\nm\nu\nC\n\n102\n\n103\n\nSample size\n\na)\n\n104\n\n0.75\n\n0\n\n5\n\n10\n\n15\n\n20\n\n25\n\n30\n\n0.2\n\n0\n\n5\n\n10\n\nNumber of features\n\nb)\n\nDCT \nPCA \nWavelet\n\n15\n20\nNumber of features\n\n25\n\n30\n\n35\n\nc)\n\nFigure 4: a) JZ score as a function of sample size for the two-class Gaussian problem discussed\nin the text, b) classi\ufb01cation accuracy on Brodatz as a function of feature space dimension, and c)\ncorresponding curves of cumulative marginal density (9). A linear trend was subtracted to all curves\nin c) to make the differences more visible.\n\nthat, in the context of visual processing, (8) will be approximately true and the MMD\nprinciple will consequently lead to solutions that are very close to optimal, in the minimum\nBE sense. Given the simplicity of MMD feature selection, this is quite remarkable. Second,\nit implies that when combined with such transformations, the marginal diversity is a close\npredictor for the CPE (and consequently the BE) achievable in a given feature space. This\nenables quantifying the goodness of the transformation without even having to build the\nclassi\ufb01er. See [13] for a more extensive discussion of these issues.\n\n5 Experimental results\n\nIn this section we present results showing that 1) MMD feature selection outperforms com-\nbinatorial search when (8) holds, and 2) in the context of visual recognition problems,\nmarginal diversity is a good predictor of PE. We start by reporting results on a synthetic\nproblem, introduced by Trunk to illustrate the curse of dimensionality [12], and used by\nJain and Zongker (JZ) to evaluate various feature selection procedures [6]. It consists of\nand is an in-\ntwo Gaussian classes of identity covariance and means\nteresting benchmark for feature selection because it has a clear optimal solution for the\nJZ exploited this property to propose\nan automated procedure for testing the performance of feature selection algorithms across\nvariations in dimensionality of the feature space and sample size. We repeated their ex-\nperiments, simply replacing the cost function they used (Mahalanobis distance - MDist -\nbetween the means) by the marginal diversity.\n\nfeatures (the \ufb01rst [ ) for any [ .\n\nbest subset of [\n\n,.,-,\n\n7R)\n\nFigure 4 a) presents the JZ score obtained with MMD as a function of the sample size. A\ncomparison with Figure 5 of [6] shows that these results are superior to all those obtained\nby JZ, including the ones relying on branch and bound. This is remarkable, since branch\nand bound is guaranteed to \ufb01nd the optimal solution and the Mdist is inversely proportional\nto the PE for Gaussian classes. We believe that the superiority of MMD is due to the\nfact that it only requires estimates of the marginals, while the MDist requires estimates\nof joint densities and is therefore much more susceptible to the curse of dimensionality.\nUnfortunately, because in [6] all results are averaged over dimension, we have not been\nable to prove this conjecture yet. In any case, this problem is a good example of situations\nwhere, because (8) holds, MMD will \ufb01nd the optimal solution at a computational cost that\nis various orders of magnitude smaller than standard procedures based on combinatorial\nsearch (e.g. branch and bound).\n\nFigures 4 b) and c) show that, for problems involving commonly used image transforma-\ntions, marginal diversity is indeed a good predictor of classi\ufb01cation accuracy. The \ufb01gures\n\n\nd\n\u0001\n\u0084\nd\n\u0001\n\u0002\nd\n\u0001\n?\n:\n\u0013\n\fwas measured on the Brodatz texture database ()\u0019)\n\u0003\u0005\u0004\u0006\u0003\n\ncompare, for each space dimension, the recognition accuracy of a complete texture recog-\nnition system with the predictions provided by marginal diversity. Recognition accuracy\ndimensional\nfeature space consisting of the coef\ufb01cients of a multiresolution decomposition over regions\nof\npixels. Three transformations were considered: the discrete cosine transform, prin-\ncipal component analysis, and a three-level wavelet decomposition (see [14] for detailed\ndescription of the experimental set up). The classi\ufb01er was based on Gauss mixtures and\nmarginal diversity was computed with Algorithm 1. Note that the curves of cumulative\nmarginal diversity are qualitatively very similar to those of recognition accuracy.\n\ntexture classes) and a\n\n\u0002\u0001\n\nReferences\n\n[1] S. Basu, C. Micchelli, and P. Olsen. Maximum Entropy and Maximum Likelihood Criteria for\nFeature Selection from Multivariate Data. In Proc. IEEE International Symposium on Circuits\nand Systems, Geneva, Switzerland,2000.\n\n[2] A. Bell and T. Sejnowski. An Information Maximisation Approach to Blind Separation and\n\nBlind Deconvolution. Neural Computation, 7(6):1129\u20131159, 1995.\n\n[3] B. Bonnlander and A. Weigand. Selecting Input Variables using Mutual Information and Non-\nIn Proc. IEEE International ICSC Symposium on Arti\ufb01cial\n\nparametric Density Estimation.\nNeural Networks, Tainan,Taiwan,1994.\n\n[4] D. Erdogmus and J. Principe.\n\nInformation Transfer Through Classi\ufb01ers and its Relation to\nProbability of Error. In Proc. of the International Joint Conference on Neural Networks, Wash-\nington, 2001.\n\n[5] J. Huang and D. Mumford. Statistics of Natural Images and Models. In IEEE Computer Society\n\nConference on Computer Vision and Pattern Recognition, Fort Collins, Colorado, 1999.\n\n[6] A. Jain and D. Zongker. Feature Selection: Evaluation, Application, and Small Sample Perfor-\nmance. IEEE Trans. on Pattern Analysis and Machine Intelligence, 19(2):153\u2013158, February\n1997.\n\n[7] R. Linsker. Self-Organization in a Perceptual Network. IEEE Computer, 21(3):105\u2013117, March\n\n1988.\n\n[8] J. Portilla and E. Simoncelli. Texture Modeling and Synthesis using Joint Statistics of Complex\nWavelet Coef\ufb01cients. In IEEE Workshop on Statistical and Computational Theories of Vision,\nFort Collins, Colorado, 1999.\n\n[9] J. Principe, D. Xu, and J. Fisher. Information-Theoretic Learning. In S. Haykin, editor, Unsu-\n\npervised Adaptive Filtering, Volume 1: Blind-Souurce Separation. Wiley, 2000.\n\n[10] G. Saon and M. Padmanabhan. Minimum Bayes Error Feature Selection for Continuous Speech\n\nRecognition. In Proc. Neural Information Proc. Systems, Denver, USA, 2000.\n\n[11] K. Torkolla and W. Campbell. Mutual Information in Learning Feature Transforms. In Proc.\n\nInternational Conference on Machine Learning, Stanford, USA, 2000.\n\n[12] G. Trunk. A Problem of Dimensionality: a Simple Example. IEEE Trans. on Pattern. Analysis\n\nand Machine Intelligence, 1(3):306\u2013307, July 1979.\n\n[13] N. Vasconcelos. Feature Selection by Maximum Marginal Diversity: Optimality and Implica-\n\ntions for Visual Recognition. In submitted, 2002.\n\n[14] N. Vasconcelos and G. Carneiro. What is the Role of Independence for Visual Regognition? In\n\nProc. European Conference on Computer Vision, Copenhagen, Denmark, 2002.\n\n[15] H. Yang and J. Moody. Data Visualization and Feature Selection: New Algorithms for Non-\n\ngaussian Data. In Proc. Neural Information Proc. Systems, Denver, USA, 2000.\n\nm\n\f", "award": [], "sourceid": 2169, "authors": [{"given_name": "Nuno", "family_name": "Vasconcelos", "institution": null}]}