{"title": "Feature Selection in Mixture-Based Clustering", "book": "Advances in Neural Information Processing Systems", "page_first": 641, "page_last": 648, "abstract": null, "full_text": "Feature Selection in Mixture-Based Clustering\n\nMartin H. Law, Anil K. Jain\n\nDept. of Computer Science and Eng.\n\nMichigan State University,\nEast Lansing, MI 48824\n\nU.S.A.\n\nM\u00b4ario A. T. Figueiredo\n\nInstituto de Telecomunicac\u00b8\u02dcoes,\n\nInstituto Superior T\u00b4ecnico\n\n1049-001 Lisboa\n\nPortugal\n\nAbstract\n\nThere exist many approaches to clustering, but the important issue of\nfeature selection, i.e., selecting the data attributes that are relevant for\nclustering, is rarely addressed. Feature selection for clustering is dif\ufb01cult\ndue to the absence of class labels. We propose two approaches to feature\nselection in the context of Gaussian mixture-based clustering. In the \ufb01rst\none, instead of making hard selections, we estimate feature saliencies.\nAn expectation-maximization (EM) algorithm is derived for this task.\nThe second approach extends Koller and Sahami\u2019s mutual-information-\nbased feature relevance criterion to the unsupervised case. Feature selec-\ntion is then carried out by a backward search scheme. This scheme can\nbe classi\ufb01ed as a \u201cwrapper\u201d, since it wraps mixture estimation in an outer\nlayer that performs feature selection. Experimental results on synthetic\nand real data show that both methods have promising performance.\n\n1 Introduction\n\nIn partitional clustering, each pattern is represented by a vector of features. However, not\nall the features are useful in constructing the partitions: some features may be just noise,\nthus not contributing to (or even degrading) the clustering process. The task of selecting\nthe \u201cbest\u201d feature subset, known as feature selection (FS), is therefore an important task.\nIn addition, FS may lead to more economical clustering algorithms (both in storage and\ncomputation) and, in many cases, it may contribute to the interpretability of the models.\nFS is particularly relevant for data sets with large numbers of features; e.g., on the order of\nthousands as seen in some molecular biology [22] and text clustering applications [21].\n\nIn supervised learning, FS has been widely studied, with most methods falling into two\nclasses: \ufb01lters, which work independently of the subsequent learning algorithm; wrappers,\nwhich use the learning algorithm to evaluate feature subsets [12]. In contrast, FS has re-\nceived little attention in clustering, mainly because, without class labels, it is unclear how\nto assess feature relevance. The problem is even more dif\ufb01cult when the number of clusters\nis unknown, since the number of clusters and the best feature subset are inter-related [6].\n\nSome approaches to FS in clustering have been proposed. Of course, any method not\n\nEmail addresses: lawhiu@cse.msu.edu, jain@cse.msu.edu, mtf@lx.it.pt\n\nThis work was supported by the U.S. Of\ufb01ce of Naval Research, grant no. 00014-01-1-0266, and by\nthe Portuguese Foundation for Science and Technology, project POSI/33143/SRI/2000.\n\n\frelying on class labels (e.g., [16]) can be used. Dy and Brodley [6] suggested a heuristic to\ncompare feature subsets, using cluster separability. A Bayesian approach for multinomial\nmixtures was proposed in [21]; another Bayesian approach using a shrinkage prior was\nconsidered in [8]. Dash and Liu [4] assess the clustering tendency of each feature by\n\nan entropy index. A genetic algorithm was used in [11] for FS in  -means clustering.\n\nTalavera [19] addressed FS for symbolic data. Finally, Devaney and Ram [5] use a notion\nof \u201ccategory utility\u201d for FS in conceptual clustering, and Modha and Scott-Spangler [17]\nassign weights to feature groups with a score similar to Fisher discrimination.\n\nIn this paper, we introduce two new FS approaches for mixture-based clustering [10, 15].\nThe \ufb01rst is based on a feature saliency measure which is obtained by an EM algorithm;\nunlike most FS methods, this does not involve any explicit search. The second approach\nextends the mutual-information based criterion of [13] to the unsupervised context; it is a\nwrapper, since FS is wrapped around a basic mixture estimation algorithm.\n\n2 Finite Mixtures and the EM algorithm\n\n 0\u000b\n\n%CB\n\n\u001a\u001c\u001b\n\n%\u000e&\n\n-dimensional feature\n\n-component mixture is\n\nGiven\u0001\ni.i.d. samples\u0002\u0004\u0003\u0006\u0005\b\u0007\n\t\f\u000b\b\r\u000e\r\u000f\r\u000f\u000b\u0010\u0007\u0012\u0011\u0014\u0013 , the log-likelihood of a\u0015\n\u0016\u000f\u0017#\u0018+*\n\u0016\u000f\u0017#\u0018\n\u0016\u000e\u0017\u0019\u0018\u0012\u001a\u001c\u001b\n\u001a/\u001b\n\u001f' (\u0003\n\u001f! \"\u0003\n\u0002\u001e\u001d\n%\u000e&\n,-&\n\t'.\nwhere:1\n,3254 ;6\n\u000387 ; each\u001f\nis the set of parameters of the9 -th component; and\nis the full parameter set. Each\u0007\nis a?\n\u000b=\r\u000f\r\u000e\r\u000f\u000b>\u001f\n\u000b=\r\u000f\r\u000e\r\u000f\u000b\n\u001f;:<\u0005\f\u001f\nEGFIH and all components have the same form (e.g., Gaussian).\nvector @\n%DB\n%CB\n\u000b=\r\u000f\r\u000f\r\u000e\u000b-A\n\u0016\u000f\u0017#\u0018U\u001a\u001c\u001b\n\u0018PO\n\u001f! W\u0013 ) nor maximum a posteriori\nNeither maximum likelihood (J\n\u001f ML \u0003LKNM\nK\fQSRT\u0005\n\u0002V\u001d\n\u0016\u000f\u0017#\u0018\u0012\u001a/\u001b\n\u0018\nO\n\u0016\u000e\u0017\u0019\u0018\u0012\u001a\u001c\u001b\n\u001f! W\u0013 ) estimates can be found analytically. The\n(J\n\u001f MAP \u0003XKNM\nK\fQ\n\u001fY \u001cZ\n\u0002\u001e\u001d\nusual choice is the EM algorithm, which \ufb01nds local maxima of these criteria. Let [\\\u0003\n\u0013 be a set of \u0001 missing labels, where ]\n\u0003_7 and\n%CB\n\u000b=\r\u000f\r\u000f\r\u000e\u000b>`\n\u0003_@\n\u0005^]\n\u000b\b\r\u000e\r\u000f\r\u000e\u000b>]\nis a sample of\u001a/\u001bed\n4 , for\u001a\"b\n . The complete log-likelihood is\n\u0003c9 , meaning that\u0007\n%DB\n\u0016\u000f\u0017#\u0018\n\u001a\u001c\u001b\n\u0016\u000f\u0017#\u0018\u0012\u001a/\u001b\n%DB\n\u001f' (\u0003\n\u0002f\u000b-[g\u001d\n,-&\n%\u000e&\n\u001bDi\n\u000b\b7#\u000b>jk\u000b=\r\u000f\r\u000e\r\u000e\u0013 using two alternating steps:\nEM produces a sequence of estimates \u0005hJ\n 0\u000b\nF , and plugs it into \u0016\u000f\u0017#\u0018\u0012\u001a/\u001b\n\u001bCi\nl E-step: Computesmn\u0003poq@\n\u001f' yielding the r\n[g\u001d\n\u0002\u0014\u000b>J\n\u0016\u000e\u0017\u0019\u0018\u0012\u001a\u0014\u001b\n\u0002f\u000b-ms\u001d\u0010\u001f! . Since the elements of[\nfunctionr\n\u001bDi\n\u001bCi\n yx\u0014\u0003 Pr vw`\n%DB\n%DB\n\u001de\u0002\u0014\u000b\n\u0003<7z\u001d{\u0007\n {x}|\n\u0003s7 . Notice that\nfollowed by normalization so that 6\n%DB\n% belongs to cluster 9 ) while t\nthat `\n%DB\n%DB\n% .\nprobability, after observing\u0007\n\u001bCi\n\u0018\nO\nl M-step: Updates the parameter estimates,J\nZg7^ \n\u0003+KNM\nin the case of MAP estimation, or without\u0016\u000f\u0017#\u0018\u0012\u001a/\u001b\n\u001f' \n\n3 A Mixture Model with Feature Saliency\nIn our \ufb01rst approach to FS, we assume conditionally independent features, given the com-\nponent label (which in the Gaussian case corresponds to diagonal covariance matrices),\n\n\u001bCi\n \u0010 (\u0003\n\u001f/\u000b\u0010J\n%CB\n:uosvw`\n(i.e., \u0007\n\u0003~7\n\nis the a priori probability\nis the corresponding a posteriori\n\nF , with `\n\n%CB\n\n\u0002\u0014\u000b>[g\u001d\n\u001bDi\n\u001a\u001c\u001b\n\n\u001bDi\n\n \u0010 ^Z\n\n\u0016\u000e\u0017\u0019\u0018\u0012\u001a\u001c\u001b\n\n\u001f' \u007f\u0013\u0019\u000b\n\nKNQkR\u001e\u0005hr\n\n\u001fU\u000b\n\nare binary, we have\n\n(1)\n\n(2)\n\n-\n\n(3)\n\n\u001bCi\n\n \u0010 0\u000b\n\nin the ML case.\n\n\u001a/\u001b\n\n\u0007\u0080\u001d\u0081\u0005\n\n\u0013\u0082\u000b0\u0005\b\u0083\n\n,\u0010\u0084\n\n\u0013\f P\u0003\n\n\u001a/\u001b\n\n\u0007\u0080\u001d\n\n P\u0003\n\n,-&\n\n\tY.\n\n,-&\n\n\tY.\n\n\u0084\u0085&\n\n\u001a\u001c\u001b\n\n,\u0010\u0084\n\n 0\u000b\n\n(4)\n\n\u0011\n$\n\t\n\u0007\n%\n\u001d\n\u0011\n)\n\t\n)\n,\n\u0007\n%\n\u001d\n\u001f\n,\n,\n\u000b\n.\n,\n.\n,\n,\n\t\n*\n\u000b\n.\n\t\n.\n*\n\u0013\n%\nA\n\t\nR\n\u0005\n\t\n\u0011\n%\n`\n\t\n*\n,\n`\na\n\u0003\n%\n\u001d\n\u001f\n,\n\u0011\n)\n\t\n*\n)\n\t\n`\n,\n@\n.\n,\n\u0007\n%\n\u001d\n\u001f\n,\n \nF\n\n\u001f\ni\n\u0003\n4\n\u001f\n \n\u001b\n\u001f\nt\n,\n,\nJ\n\u001f\n,\n%\n\u000b\nJ\n\u001f\nJ\n.\n,\n \n\u0007\n%\n\u001d\nJ\n\u001f\n,\n,\nt\n,\n.\n,\n,\n,\n\u001f\n\u001b\nJ\n\u001f\n.\n,\n*\n)\n,\n\u0083\n,\n*\n)\n,\nE\n$\n\t\nA\n\u0084\n\u001d\n\u0083\n\f,\u0010\u0084\n\n\u001a/\u001b\n\n,-&\n\n,-&\n\n\t\f\u000b\u0010\n\n \u0017\u0018\u0019\n\n,\u0010\u0084\n\n\u0084\u0085&\n\n, where\n\n. Let\n\n(5)\n\n(6)\n\n\u001a/\u001b\n\n\u0007\u0080\u001d\u0081\u0005\n\n\u001a\u001c\u001b\n\n\u0007\u0080\u001d\n\n\t\u000e\n\n - \n\n,\u0010\u0084\n\n UZ\n\nif feature\n\n-th feature is\n\nis the pdf of the\n\nis the probability that the\n\n\u0084 \u2019s as missing variables rather than as parameters;\n\n(ii) estimating\nuseful, which we call its saliency. The resulting mixture model (see proof in [14]) is\n\nhave any form, although we only consider Gaussian densities. In the sequel, we will use\nto run through data points, mixture components, and features, respec-\nthe indices\ntively. Assume now that some features are irrelevant, in the following sense: if feature\nis\nis the common (i.e.,\n\nwhere\u001a\u001c\u001bed\n-th feature in the9 -th component; in general, this could\n,\u0010\u0084\n,9 and\nirrelevant, then\u001a\u001c\u001b\n , for9q\u0003\n7\u0019\u000b=\r\u000f\r\u000e\r\u000f\u000b-\u0015\n\u0003\u0003\u0002\n\u001b\u0007\u0006\n be a set of binary parameters,\nindependent of9 ) density of feature\n\u000b\b\r\u000e\r\u000f\r\u000f\u000b\nsuch that \u0006\nis relevant and \u0006\n4 otherwise; then,\n\u001b\u0081\u001a/\u001b\n,\u0010\u0084\n\u0013\u0019\u000b=\u0005\b\u0083\n\u0013h P\u0003\n\u0013\u0019\u000b=\u0005\b\u0004\n \u0010 \n\t\f\u000b\n,-&\nOur approach consists of: (i) treating the \u0006\n\u001b\u0007\u0006\n7h from the data;\n\u0013\u0082\u000b0\u0005\b\u0004\n\u0013\u0019\u000b=\u0005\u0012\u000f\n\u0013h P\u0003\n\u0084\u0085&\n\tY.\n\n\u000b=\u0005\n\u0003\u0011\u0010\n,\u0010\u0084\n\u0013\u0082\u000b0\u0005\b\u0083\n re\ufb02ects our prior knowledge about the distribution of the non-salient fea-\n\rw\u001d\u0085\n\nI\u001d\u0085\r\nmixture, we \ufb01rst select the component label9 by sampling from a multinomial distribution\nwith parameters \u001b\n . Then, for each feature\n\u000b=\r\b\r=\r=\u000b\n\u0084 ; if we get a head, we use the mixture component\n\u001a\u001c\u001b\n,\u0010\u0084\n\rw\u001d\n , with\u0007\nGiven a set of observations\u0002\n\u000b=\r\b\r=\r0\u000b-\u0007\n\u0013h can be estimated by the maximum likelihood criterion,\n\u0013\u0082\u000b0\u0005\u0012\u000f\n\u001a/\u001b\n\u0016\u000f\u0017#\u0018\n%\u000e\u0084\nK\fQ\n%\u000e&\n\u0084\u0085&\n\tY.\n\nThe form of\ntures. In principle, it can be any 1-D pdf (e.g., Gaussian or student-t); here we only consider\nto be a Gaussian. Equation (6) has a generative interpretation. As in a standard \ufb01nite\n\n,\u0010\u0084\n\u0013\u0082\u000b0\u0005\f\u0004\n\u0018PO\n,\u0010\u0084\n\u0003uKNM\n UZ\n\u0084 \u2019s as missing data (see [14] for details).\n% \u2019s and the \u0006\n\nIn the absence of a closed-form solution, an EM algorithm can be derived by treating both\n\n7#\u000b=\r\u000f\r\u000f\r\u000e\u000b>?\n%CB\n\u000b\b\r\u000e\r\u000f\r\u000e\u000b-A\n7\u0015\u0014\u001b\u000f\n \u0017\u0002\n\n-th feature; otherwise, the common component\n\nwhose probability of getting a head is\n\n, we \ufb02ip a biased coin\n\nto generate the\n\n, the parameters\n\n7\u0015\u0014\u0016\u000f\n\n\u0013\u0082\u000b0\u0005^\u0083\n\n3.1 Model Selection\nStandard EM for mixtures exhibits some weaknesses which also affect the EM algorithm\n, and a good initialization is essential for reach-\ning a good local optimum. To overcome these dif\ufb01culties, we adopt the approach in [9],\nwhich is based on the MML criterion [23, 24]. The MML criterion for the proposed model\n, the following cost function\n\nthe`\njust mentioned: it requires knowledge of\u0015\n(see details in [14]) consists of minimizing, with respect to \u0019\n\u0016\u000f\u0017#\u0018\u0012\u001a/\u001b\n \u001cZ\u001d\u001e\n\u0084 , respectively. If\u001a/\u001b\n\n\u0016\u000e\u0017\u0019\u0018\n\u0016\u000f\u0017#\u0018'\u001b\n \u0012Z\n,-&\n,\u0010\u0084 and\nare the number of parameters in\u0083\n\rw\u001d\u0085\r\nj . From a parame-\n\u0084 \u2019s (see details in [14]); thus, the EM algorithm\n , have simple\n\u0002V\u001d\n-th feature in the9 -th component, the\n\nwhere\nare univariate Gaussians (arbitrary mean and variance),\nter estimation viewpoint, this is equivalent to a MAP estimate with conjugate (improper)\nDirichlet-type priors on the\nundergoes a minor modi\ufb01cation in the M-step, which still has a closed form.\n\n, \u2019s and\n\u0016\u000e\u0017\u0019\u0018\n, values and?\"\u000f\n\nThe terms in equation (8), in addition to the log-likelihood\ninterpretations. The term\n\nis a standard MDL-type parameter code-length cor-\n\n\u0084 values. For the\n\nresponding to\u0015\n\n7\u001f\u0014\u0016\u000f\n \u0010 W\u000b\n and\n\rI\u001d\u0081\n\n\u0001sZ\u001d\u001c\n\n\u0016\u000e\u0017\u0019\u0018!\u001a\u001c\u001b\n\n\u0002V\u001d\n\n\u0016\u000f\u0017#\u0018\n\n*\u0015 \n\nand\n\n\u0084\u0085&\n\n\u0084\u0085&\n\n(8)\n\n \u0017\u0002\n\nFwH\n%\u000e\u0084\n\nI\u001d\n\n%CB\n\n(7)\n\n \n\u0018N\n\nis used.\n\n\u001d\n\u0083\n \n\n\u0001\n\n\nA\n\u0084\n\u001d\n\u0083\n \n\u001b\nA\n\u0084\n\u001d\n\u0004\n\u0084\n\u0002\n\u001b\nA\n\u0084\n\u001d\n\u0004\n\u0084\n \n\n\u0005\n\u0003\n\t\n\u0006\nE\n\u0084\n\u0003\n7\n\n\u0084\n\u0003\n\u0005\n.\n,\n\u0084\n*\n)\n\t\n.\n,\nE\n$\n\t\nA\n\u0084\n\u001d\n\u0083\n\u001b\n\u0002\n\u001b\nA\n\u0084\n\u001d\n\u0004\n\u0084\n\u000f\n\u0084\n\u0084\n\u0003\n\u000f\n\u0084\n\n.\n,\n\u0084\n\u0084\n*\n)\n,\nE\n$\n\t\n\u0013\n\u000f\n\u0084\nA\n\u0084\n\u001d\n\u0083\n\u001b\n\u0084\n\u001b\nA\n\u0084\n\u001d\n\u0004\n\u0084\n\u0002\n\u001b\n\u0002\n\u001b\n \n.\n\t\n.\n*\n\n\u0003\n\u000f\n\u0083\n \n\n\u0002\n\u001b\n\u0004\n\u0084\n \n\u0003\n\u001b\n\u0007\n\t\n\u0011\n%\n\u0003\n@\nA\n\t\nE\n\u0019\n\u0003\n\u001b\n\u0005\n.\n,\n\u0084\n\u0084\nJ\n\u0019\n\u001a\n\u0011\n)\n\t\n*\n)\n,\nE\n$\n\t\n\u0013\n\u000f\n\u0084\nA\n\u001d\n\u0083\n\u001b\n\u0084\n\u001b\nA\n\u001d\n\u0004\n\u0084\n\u0014\n\u0019\n\u0015\nZ\n?\nj\nj\nE\n)\n\t\n*\n)\n\t\n\u0001\n.\n,\n\u000f\n\u0084\nj\nE\n)\n\t\n\u001b\n\u0001\n\u001b\n\u0084\n\u001c\n\u001e\n\u0004\n\u0002\n\u001b\n \n\u001c\n\u0003\n\u001e\n\u0003\n.\n\u000f\n\u0014\n\u0019\nE\n!\n\u0001\n.\n\n\f\u0016\u000e\u0017\u0019\u0018\n\n,\u0010\u0084 is\u0001\n\u001b\u0002\u0001\n\u0016\u000e\u0017\u0019\u0018\n\nin (8) for each feature.\n\nparameters\n-th feature\n\n\u201ceffective\u201d number of data points for estimating\u0083\n\u0084 . Since there are\n . Similarly, for the\nin each\u0083\n,\u0010\u0084 , the corresponding code-length is \nin the common component, the number of effective data points for estimation is \u0001\n\nThus, there is a term \u0003\nOne key property of the EM algorithm for minimizing equation (8) is its pruning behavior,\nforcing some of the\nmessage length in (8) may become invalid at these boundary values can be circumvented\nby the arguments in [9]. When\n\n .\n\u0084 to go to zero or one. Worries that the\n\u0084 and\n\n \u0010 \n7\u0015\u0014\u0016\u000f\n\u0084 goes to zero, the\n\u0084 goes to 1,\n\nFinally, since the model selection algorithm determines the number of components, it can\n, thus alleviating the need for a good initialization [9].\n\n\u0084 are dropped.\n\n-th feature is no longer salient and\n\nto go to zero and some of the\n\n\u000b=\r=\r\b\r=\u000b-\u0083\n\nand\u0083\n\u0084 are removed. When\nbe initialized with a large value of\u0015\n\nBecause of this, as in [9], a component-wise version of EM [2] is adopted (see [14]).\n\n3.2 Experiments and Results\nThe \ufb01rst data set considered consists of 800 points from a mixture of 4 equiprobable Gaus-\nsians with mean vectors \u001b\u0005\u0004\n\u201cnoisy\u201d features (sampled from a \f\nset of 800 10-D patterns. The proposed algorithm was run 10 times, each initialized with\n\u0003\u000e\n\n\u001b\u000b\n\n ,\n , and identity covariance matrices. Eight\n\u000b\b7^ density) were appended to this data, yielding a\n4 ; the common component is initialized to cover all data, and the feature salien-\n\ncies are initialized at 0.5. In all the 10 runs, the 4 components were always identi\ufb01ed.\nThe saliencies of all the ten features, together with their standard deviations (error bars),\nare shown in Fig. 1. We conclude that, in this case, the algorithm successfully locates\nthe clusters and correctly assigns the feature saliencies. See [14] for more details on this\nexperiment.\n\n ,\n\n ,\n\n\u001b\u0005\b\n\ny\nc\nn\ne\n\ni\nl\n\n \n\na\nS\ne\nr\nu\na\ne\nF\n\nt\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n1\n\n2\n\n3\n\ny\nc\nn\ne\n\ni\nl\n\na\ns\n \n\ne\nr\nu\n\nt\n\na\ne\nF\n\n8\n\n9\n\n10\n\n5\n\n4\n7\nFeature Number\n\n6\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n5\n\n10\n\nFeature no\n\n15\n\n20\n\nFigure 1: Feature saliency for 10-D 4-component\nGaussian mixture. Only the \ufb01rst two features are rel-\nevant. The error bars show \u000f one standard deviation.\nIn the next experiment, we consider Trunk\u2019s data [20], which has two 20-dimensional\n\nFigure 2: Feature saliency for the Trunk\ndata. The smaller the feature number, the\nmore important is the feature.\n\nGaussians classes with means \u0010\n\n\t , and covariances\n\u0003\u0015\u0014 . Data is obtained by sampling 5000 points from each of these two Gaus-\n\nsians. Note that these features have a descending order of relevance. As above, the initial\nis set to 30. In all the 10 runs performed, two components were always detected. The\nvalues of the feature saliencies are shown in Fig. 2. We see the general trend that as the\nfeature number increases, the saliency decreases, following the true characteristics of the\ndata.\n\n and \u0010\n\n\u000b=\r=\r\b\r=\u000b\n\n\u0014\u0012\u0010\n\n7\u0019\u000b\n\nFeature saliency values were also computed for the \u201cwine\u201d data set (available at the UCI\nrepository at www.ics.uci.edu/\u02dcmlearn/MLRepository.html), consisting of\n178 13-dimensional points in three classes. After standardizing all features to zero mean\nand unit variance, we applied the LNKnet supervised feature selection algorithm (available\nat www.ll.mit.edu/IST/lnknet/). The nine features selected by LNKnet are 7, 13,\n1, 5, 10, 2, 12, 6, 9. Our feature saliency algorithm (with no class labels) yielded the values\n\n.\n,\n\u000f\n\u001c\n!\n\u000f\n\u0084\n.\n,\n\n\u001b\n7\n\u0014\n\u000f\n\u0084\n!\n\u001b\n\u0001\n\u001b\n\u0084\n.\n,\n\u000f\n\u000f\n\n\u000f\n\u0084\n\t\n\u0084\n*\n\u000f\n\u0004\n\u000f\n\u0006\n\u001b\n\t\n\u0007\n\t\n\t\n\u0004\n\u001b\n4\n\u0015\n\t\n\u0003\n\u001b\n\t\n\u0011\n!\n\t\n\u0011\n!\n\u0004\n!\n\u0003\n\u0013\n\t\n\u0003\n\u0013\n!\n\u0015\n\f1\n\n0.94\n\n2\n\n0.77\n\n3\n\n0.10\n\n4\n\n0.59\n\n5\n\n0.14\n\n6\n\n0.99\n\n7\n\n1.00\n\n8\n\n0.66\n\n9\n\n0.94\n\n10\n0.85\n\n11\n0.88\n\n12\n1.00\n\n13\n0.83\n\nTable 1: Feature saliency of wine data\n\nin Table 1. Ranking the features in descending order of saliency, we get the ordering: 7, 12,\n6, 1, 9, 11, 10, 13, 2, 8, 4, 5, 3. The top 5 features (7, 12, 6, 1, 9) are all in the subset selected\nby LNKnet. If we skip the sixth feature (11), the following three features (10, 13, 2) were\nalso selected by LNKnet. Thus we can see that for this data set, our algorithm, though\ntotally unsupervised, performs comparably with a supervised feature selection algorithm.\n\n4 A Feature Selection Wrapper\n\nOur second approach is more traditional in the sense that it selects a feature subset, instead\nof estimating feature saliency. The number of mixture components is assumed known a\npriori, though no restriction on the covariance of the Gaussian components is imposed.\n\n4.1 Irrelevant Features and Conditional Independence\n\n(9)\n\n 0\n\n]Y\u001d\n\nprobabilities based on all the features, and only on the useful features, respectively\n\n(a byproduct of the EM algorithm) play the role of the missing class labels. Recall that the\n\n, follow some joint probability\nis considered irrelevant\n, that is,\nis split into two subsets: \u201cuseful\u201d features\nis the index set of the non-useful\n\n\u000b\u0010\u0007\u001c . In supervised learning [13], a feature subset \u0007\u0001\n\u001a/\u001b\n\u0007\u0004\u0002T\u000b\u0010\u0007\u0005\n\u001a\u001c\u001b\n\u0007G\u001d\n\nAssume that the class labels,] , and the full feature vector,\u0007\nfunction\u001a\u001c\u001b\nif it is conditionally independent of the label ] , given the remaining features \u0007\u0003\u0002\n\u001a/\u001b\nif\u001a\u001c\u001b\n\u0007\u0004\u0002( , where\u0007\n\u0007\u001c \n(here, \f\u0007\u0006\nand \u201cnon-useful\u201d features\u0007\n\u0005\u00197#\u000b\b\r\u000e\r\u000f\r\u000f\u000b-?;\u0013\nfeatures). It is easy to show that this implies\n\u001a/\u001b\n\u001a/\u001b\n\u001a/\u001b\n\u001a\u001c\u001b\n P\u0003\n]\u0082 \u0080\u0003\n\u000b\u0010\u0007\n]\u0082 \n]\u0019 \nTo generalize this notion to unsupervised learning, we propose to let the expectations t\n, (see (3)) are posterior class probabilities, Prob@\nF . Consider the posterior\nclass9'\u001d\n\u0007\t\b\n\u0007(\u000b>\u001f\n\u001a/\u001b\n\u001a\u001c\u001b\n%CB\n,>B\n%DB\n 0\u000b\n \u0080|\nis the subset of relevant features of sample\u0007\n, have\nwhere\u0007\n% (of course, the\n\u0003<7 and6\nto be normalized such that 6\n\u0003<7 ). If\u0007\n%CB\n%CB\n, exactly, because of the conditional independence in (9),\nfeature subset, then\n, equalst\n%CB\napplied to (3). In practice, such features rarely exist, though they do exhibit different de-\ngrees of irrelevance. So we follow the suggestion in [13], and \ufb01nd \f\n as\nas possible. As botht\nclose tot\n are probabilities, a natural criterion for\n, and\n%CB\nassessing their closeness is the expected value of the Kullback-Leibler divergence (KLD,\n[3]). This criterion is computed as a sample mean\n\u0016\u000f\u0017#\u0018\n\n, andt\n%CB\n\nis a completely irrelevant\n\nthat gives\n\n(11)\n\n(10)\n\n W\u000b\n\n%DB\n\n%CB\n\n (\u0003\n%\u000e&\nindicates that the features in \f\n\n,-&\n\n%CB\n\n%DB\n\nin our case. A low value of \f\nally independent from the expected class labels, given the features in\n\n.\n\nare \u201calmost\u201d condition-\n\n%DB\n\n%CB\n\n%CB\n\n%DB\n\n%CB\n\n%DB\n\nIn practice, we start by obtaining reasonable initial estimates of \u0005\n\nusing all the features, and set \f\nthat \f\nin \f\none feature remains, in what can be considered as a backward search algorithm that yields\na sorting of the features by decreasing order of irrelevance.\n\n\u0013 by running EM\n\u0013 . The process is then repeated until only\n\n\f\u000f\u000e\n, to update the posterior probabilities \u0005\n\n\u0005P\u0013 . At each stage, we \ufb01nd the feature\n\nsuch\n. EM is then run again, using the features not\n\nis smallest and add it to \f\n\n\u0005\b\u0002\u0082\u0013h \n\n%DB\n\n%DB\n\n]\n]\n\u001d\n\u0003\n]\n\u001d\n \n\u0003\n\u0007\n\u0002\n\n\u0007\n\u0002\n\u001d\n\u0007\n\n\u001d\n]\n\u0002\n\u0007\n\u0002\n\u001d\n\u0007\n\n\u001d\n\u0007\n\u0002\n,\nt\nt\n,\n|\nJ\n.\n,\n\u0007\n%\n\u001d\nJ\n\u001f\n,\n\n,\n\u001b\n\f\nJ\n.\n,\n\u0007\n\u0002\n\u001d\nJ\n\u001f\n\u0002\n\u0002\n\n,\n\n,\n,\nt\n,\n\n\n\u000b\n\u001b\n\f\n\u000b\n\n,\n\u001b\n\f\n\f\n\u001b\n\f\n\u0011\n)\n\t\n*\n)\n\t\nt\n,\nt\n,\n\n,\n\u001b\n\f\n \n\u001b\n\f\n \n\nt\n,\n\u0003\n\u0002\nb\n\b\n\f\n\u001b\nt\n,\n\fgood if \u0002\nit cannot increase when more features are used (because, for any random variables\n\nIn the sequel, we call\n\u201cthe entropy of the assignment\u201d. An important characteristic of the entropy is that\n,\n\nis small.\n\n4.2 The assignment entropy\nGiven a method to sort the features in the order of relevance, we now require a method\nto measure how good each subset is. Unlike in supervised learning, we can not resort\nto classi\ufb01cation accuracy. We adopt the criterion that a clustering is good if the clusters\nare \u201ccrisp\u201d, i.e., if, for every\n\nfor some 9 . A natural way to formalize this\n,\u0001\n%DB\n\u0013 ; that is, the clustering is considered to be\nis to consider the mean entropy of the \u0005\n%CB\n\u0016\u000e\u0017\u0019\u0018\n%CB\n,-&\n , a fundamental inequality of information theory [3];\n\u0013\f \n\n, and\nnote that \u0002\nexhibits a diminishing returns behavior (decreasing abruptly as the most relevant features\nare included, but changing little when less relevant features are used). Our empirical results\nshow that \u0002\nindeed has a strong relationship with the quality of the clusters. Of course,\nduring the backward search, one can also consider picking the next feature whose removal\nleast increases \u0002\n, rather than the one yielding the smallest KLD; both options are explored\nin the experiments. Finally, we mention that other minimum-entropy-type criteria have\nbeen recently used for clustering [7], [18], but not for feature selection.\n\n\u0013\f ). Moreover, \u0002\n\nis a conditional entropy \u0002\n\n%DB\n\u00035\u001d\n\n\u0013\f \u0006\u0003\n\u000b\u0006\u0005\n\u0013\f \n\n%DB\n\n\u0013S\u001d\n\n\u0005^\u0007\n\n \b\u0007\n\n%DB\n\n\u0012\t\n\n, t\n\n%\u000f&\n\n\u0003c\u001d\n\n, \u0002\n\n4.3 Experiments\n\nWe have conducted experiments on data sets commonly used for supervised learning tasks.\nSince we are doing unsupervised learning, the class labels are, of course, withheld and\nonly used for evaluation. The two heuristics for selecting the next feature to be removed\n(based on minimum KLD and minimum entropy) are considered in different runs. To assess\nclustering quality, we assign each data point to the Gaussian component that most likely\ngenerated it and then compare this labelling with the ground-truth. Table 2 summarizes the\ncharacteristics of the data sets for which results are reported here (all available from the\nUCI repository); we have also performed tests on other data sets achieving similar results.\n\nThe experimental results shown in Fig. 3 reveal that the general trend of the error rate\n. The error rates either have a minimum close to the \u201cknee\u201d of the H\nagrees well with \u0002\ncurve, or the curve becomes \ufb02at. The two heuristics for selecting the feature to be removed\nperform comparably. For the cover type data set, the DKL heuristic yields lower error rates\n, while the contrary happens for image segmentation and WBC\nthan the one based on \u0002\ndatasets.\n\n5 Concluding Remarks and Future Work\n\nThe two approaches for unsupervised feature selection herein proposed have different ad-\nvantages and drawbacks. The \ufb01rst approach avoids explicit feature search and does not\nrequire a pre-speci\ufb01ed number of clusters; however, it assumes that the features are con-\nditionally independent, given the components. The second approach places no restriction\non the covariances, but it does assume knowledge of the number of components. We be-\nlieve that both approaches can be useful in different scenarios, depending on which set of\nassumptions \ufb01ts the given data better.\n\nSeveral issues require further work: weakly relevant features (in the sense of [12]) are not\nremoved by the \ufb01rst algorithm while the second approach relies on a good initial clustering.\nOvercoming these problems will make the methods more generally applicable. We also\nneed to investigate the scalability of the proposed algorithms; ideas such as those in [1] can\nbe exploited.\n\n\u0001\n7\nt\n,\n\u001b\n\u0005\nt\n,\n\u0014\n\u0001\n6\n\u0011\n\t\n6\n*\n\t\nt\n,\nt\n,\n\u0002\n\u0003\n\u0004\n\u0005\n\u001b\n\u0004\n\u0002\n\u001b\n\u0004\n\u001b\n\u0005\nt\n%\n\u000b\n\u001b\n\u0005\nt\n%\n\u000b\n\u0002\n\u001b\n\u0005\nt\n%\n\u000b\n\fTable 2: Some details of the data sets (WBC stands for Wisconsin breast cancer).\nName\nNo. points used\nNo. of features\nNo. of classes\n\nimage segmentation WBC\n1000\n18\n7\n\ncover type\n2000\n10\n4\n\nwine\n178\n13\n3\n\n569\n30\n2\n\n65\n\n60\n\n55\n\n50\n\n45\n\n40\n\n35\n\n60\n\n55\n\n50\n\n45\n\n40\n\n35\n\n30\n\n25\n\nr\no\nr\nr\n\n \n\nE\n%\n\nr\no\nr\nr\n\n \n\nE\n%\n\n8\n8\n\n10\n10\n\n15\n15\n\n4\n4\nNo. of features\n\n6\n6\n\n(b)\n\nNo. of features\n\n10\n10\n\n(d)\n\n20\n20\n\n25\n25\n\n10\n10\nNo. of features\n\n15\n15\n\n(f)\n\ny\np\no\nr\nt\n\nn\nE\n\n3000\n\n2500\n\n2000\n\n1500\n\n1000\n\n500\n\n0\n\ny\np\no\nr\nt\nn\nE\n\ny\np\no\nr\nt\nn\nE\n\ny\np\no\nr\nt\n\nn\nE\n\n900\n\n800\n\n700\n\n600\n\n500\n\n400\n\n300\n\n200\n\n100\n\n0\n\n450\n\n400\n\n350\n\n300\n\n250\n\n200\n\n150\n\n100\n\n50\n\n0\n\n80\n\n70\n\n60\n\n50\n\n40\n\n30\n\n20\n\n10\n\n0\n\n2\n2\n\n5\n5\n\n5\n5\n\n2\n2\n\n4\n4\n\n4\n4\nNo. of features\n\n6\n6\n\n(a)\n\nNo. of features\n\n10\n10\n\n(c)\n\n8\n8\n\n10\n10\n\n15\n15\n\n60\n\n55\n\n50\n\n45\n\n40\n\n35\n\n30\n\n25\n\n70\n\n65\n\n60\n\n55\n\n50\n\n45\n\n40\n\n35\n\n22\n\n20\n\n18\n\n16\n\n14\n\n12\n\n10\n\n8\n\nr\nr\no\nr\nE\n%\n\n \n\ny\np\no\nr\nt\n\nn\nE\n\n \nr\no\nr\nr\n\n \n\nE\n%\n\ny\np\no\nr\nt\nn\nE\n\nr\no\nr\nr\n\n \n\nE\n%\n\ny\np\no\nr\nt\n\nn\nE\n\n20\n20\n\n25\n25\n\n6\n30\n30\n\n10\n10\nNo. of features\n\n15\n15\n\n(e)\n\n35\n\n30\n\n25\n\n20\n\n15\n\n10\n\n5\n\n0\n\nr\no\nr\nr\n\n \n\nE\n%\n\ny\np\no\nr\nt\n\nn\nE\n\n10\n10\n\n12\n12\n\n6\n6\n\n8\n8\n\nNo. of features\n\n(g)\n\n4000\n\n3500\n\n3000\n\n2500\n\n2000\n\n1500\n\n1000\n\n500\n\n0\n\n1200\n\n1000\n\n800\n\n600\n\n400\n\n200\n\n0\n\n500\n\n400\n\n300\n\n200\n\n100\n\n0\n\n70\n\n60\n\n50\n\n40\n\n30\n\n20\n\n10\n\n0\n\n2\n2\n\n5\n5\n\n5\n5\n\n16\n\n14\n\n12\n\n10\n\nr\no\nr\nr\n\n \n\nE\n%\n\n8\n\n6\n30\n30\n\n35\n\n30\n\n25\n\n20\n\n15\n\n10\n\n5\n\n0\n\nr\no\nr\nr\n\n \n\nE\n%\n\nFigure 3: (a) and (b): cover type; (c) and (d): image segmentation; (e) and (f): WBC; (g)\nand (h): wine. Feature removal by minimum KLD (left column) and minimum \u0002\n(right\ncolumn). Solid lines: error rates; dotted lines: \u0002\none standard\ndeviation over 10 runs.\n\n. Error bars correspond to\n\n2\n2\n\n4\n4\n\n10\n10\n\n12\n12\n\n8\n8\n\n6\n6\nNo. of features\n(h)\n\n\n\fReferences\n\n[1] P. Bradley, U. Fayyad, and C. Reina. Clustering very large database using EM mixture models.\n\nIn Proc. 15th Intern. Conf. on Pattern Recognition, pp. 76\u201380, 2000.\n\n[2] G. Celeux, S. Chr\u00b4etien, F. Forbes, and A. Mkhadri. A component-wise EM algorithm for\n\nmixtures. Journal of Computational and Graphical Statistics, 10:699\u2013712, 2001.\n\n[3] T. Cover and J. Thomas. Elements of Information Theory. John Wiley & Sons, 1991.\n[4] M. Dash and H. Liu. Feature selection for clustering. In Proc. of Paci\ufb01c-Asia Conference on\n\nKnowledge Discovery and Data Mining, 2000, pp. 110\u2013121.\n\n[5] M. Devaney and A. Ram. Ef\ufb01cient feature selection in conceptual clustering.\n\nICML\u20191997, pp. 92\u201397, 1997.\n\nIn Proc.\n\n[6] J. Dy and C. Brodley. Feature subset selection and order identi\ufb01cation for unsupervised learn-\n\ning. In Proc. ICML\u20192000, pp. 247\u2013254, 2000.\n\n[7] E. Gokcay and J. Principe. Information Theoretic Clustering. IEEE Trans. on PAMI, 24(2):158-\n\n171, 2002.\n\n[8] P. Gustafson, P. Carbonetto, N. Thompson, and N. de Freitas. Bayesian feature weighting\nIn Proc. of the 9th Intern.\n\nfor unsupervised learning, with application to object recognition.\nWorkshop on Arti\ufb01cial Intelligence and Statistics, 2003.\n\n[9] M. Figueiredo and A. Jain. Unsupervised learning of \ufb01nite mixture models.\n\nPAMI, 24(3):381\u2013396, 2002.\n\nIEEE Trans. on\n\n[10] A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988.\n[11] Y. Kim, W. Street, and F. Menczer. Feature Selection in Unsupervised Learning via Evolution-\n\nary Search. In Proc. ACM SIGKDD, pp. 365\u2013369, 2000.\n\n[12] R. Kohavi and G. John. Wrappers for feature subset selection. Arti\ufb01cial Intelligence, 97(1-\n\n2):273\u2013324, 1997.\n\n[13] D. Koller and M. Sahami. Toward optimal feature selection. In Proc. ICML\u20191996, pp. 284\u2013292,\n\n1996.\n\n[14] M. Law, M. Figueiredo, and A. Jain. Feature Saliency in Unsupervised Learning. Tech.\nAvailable at\n\nRep., Dept. Computer Science and Eng., Michigan State Univ., 2002.\nhttp://www.cse.msu.edu/\n\nlawhiu/papers/TR02.ps.gz.\n\n[15] G. McLachlan and K. Basford. Mixture Models: Inference and Application to Clustering.\n\nMarcel Dekker, New York, 1988.\n\n[16] P. Mitra and C. A. Murthy. Unsupervised feature selection using feature similarity. IEEE Trans.\n\non PAMI, 24(3):301\u2013312, 2002.\n\n[17] D. Modha and W. Scott-Spangler. Feature weighting in k-means clustering. Machine Learning,\n\n2002. to appear.\n\n[18] S. Roberts, C. Holmes, and D. Denison. Minimum-entropy data partitioning using RJ-MCMC.\n\nIEEE Trans. on PAMI, 23(8):909-914, 2001.\n\n[19] L. Talavera. Dependency-based feature selection for clustering symbolic data. Intelligent Data\n\nAnalysis, 4:19\u201328, 2000.\n\n[20] G. Trunk. A problem of dimensionality: A simple example. IEEE Trans. on PAMI, 1(3):306\u2013\n\n307, 1979.\n\n[21] S. Vaithyanathan and B. Dom. Generalized model selection for unsupervised learning in high\n\ndimensions. In S. Solla, T. Leen, and K. Muller, eds, Proc. of NIPS\u201912. MIT Press, 2000.\n\n[22] E. Xing, M. Jordan, and R. Karp. Feature selection for high-dimensional genomic microarray\n\ndata. In Proc. ICML\u20192001, pp. 601\u2013608, 2001.\n\n[23] C. Wallace and P. Freeman. Estimation and inference via compact coding. Journal of the Royal\n\nStatistical Society (B), 49(3):241\u2013252, 1987.\n\n[24] C.S. Wallace and D.L. Dowe. MML clustering of multi-state, Poisson, von Mises circular and\n\nGaussian distributions. Statistics and Computing, 10:73\u201383, 2000.\n\n\n\f", "award": [], "sourceid": 2308, "authors": [{"given_name": "Martin", "family_name": "Law", "institution": null}, {"given_name": "Anil", "family_name": "Jain", "institution": null}, {"given_name": "M\u00e1rio", "family_name": "Figueiredo", "institution": null}]}