{"title": "Variational Information Maximization for Feature Selection", "book": "Advances in Neural Information Processing Systems", "page_first": 487, "page_last": 495, "abstract": "Feature selection is one of the most fundamental problems in machine learning. An extensive body of work on information-theoretic feature selection exists which is based on maximizing mutual information between subsets of features and class labels. Practical methods are forced to rely on approximations due to the difficulty of estimating mutual information. We demonstrate that approximations made by existing methods are based on unrealistic assumptions. We formulate a more flexible and general class of assumptions based on variational distributions and use them to tractably generate lower bounds for mutual information. These bounds define a novel information-theoretic framework for feature selection, which we prove to be optimal under tree graphical models with proper choice of variational distributions. Our experiments demonstrate that the proposed method strongly outperforms existing information-theoretic feature selection approaches.", "full_text": "Variational Information Maximization for\n\nFeature Selection\n\nShuyang Gao\nUniversity of Southern California, Information Sciences Institute\ngaos@usc.edu, gregv@isi.edu, galstyan@isi.edu\n\nGreg Ver Steeg\n\nAram Galstyan\n\nAbstract\n\nFeature selection is one of the most fundamental problems in machine learning.\nAn extensive body of work on information-theoretic feature selection exists which\nis based on maximizing mutual information between subsets of features and class\nlabels. Practical methods are forced to rely on approximations due to the dif\ufb01culty\nof estimating mutual information. We demonstrate that approximations made by\nexisting methods are based on unrealistic assumptions. We formulate a more \ufb02ex-\nible and general class of assumptions based on variational distributions and use\nthem to tractably generate lower bounds for mutual information. These bounds\nde\ufb01ne a novel information-theoretic framework for feature selection, which we\nprove to be optimal under tree graphical models with proper choice of variational\ndistributions. Our experiments demonstrate that the proposed method strongly\noutperforms existing information-theoretic feature selection approaches.\n\n1\n\nIntroduction\n\nFeature selection is one of the fundamental problems in machine learning research [1, 2]. Its prob-\nlematic issues include a large number of features that are either irrelevant or redundant for the task at\nhand. In these cases, it is often advantageous to pick a smaller subset of features to avoid over-\ufb01tting,\nto speed up computation, or simply to improve the interpretability of the results.\nFeature selection approaches are usually categorized into three groups: wrapper, embedded and\n\ufb01lter [3, 4, 5]. The \ufb01rst two methods, wrapper and embedded, are considered classi\ufb01er-dependent,\ni.e., the selection of features somehow depends on the classi\ufb01er being used. Filter methods, on the\nother hand, are classi\ufb01er-independent and de\ufb01ne a scoring function between features and labels in\nthe selection process.\nBecause \ufb01lter methods may be employed in conjunction with a wide variety of classi\ufb01ers, it is im-\nportant that the scoring function of these methods is as general as possible. Since mutual information\n(MI) is a general measure of dependence with several unique properties [6], many MI-based scoring\nfunctions have been proposed as \ufb01lter methods [7, 8, 9, 10, 11, 12]; see [5] for an exhaustive list.\nOwing to the dif\ufb01culty of estimating mutual information in high dimensions, most existing MI-based\nfeature selection methods are based on various low-order approximations for mutual information.\nWhile those approximations have been successful in certain applications, they are heuristic in nature\nand lack theoretical guarantees. In fact, as we demonstrate in Sec. 2.2, a large family of approximate\nmethods are based on two assumptions that are mutually inconsistent.\nTo address the above shortcomings, in this paper we introduce a novel feature selection method\nbased on a variational lower bound on mutual information; a similar bound was previously studied\nwithin the Infomax learning framework [13]. We show that instead of maximizing the mutual infor-\nmation, which is intractable in high dimensions (hence the introduction of many heuristics), we can\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fmaximize a lower bound on the MI with the proper choice of tractable variational distributions. We\nuse this lower bound to de\ufb01ne an objective function and derive a forward feature selection algorithm.\nWe provide a rigorous proof that the forward feature selection is optimal under tree graphical models\nby choosing an appropriate variational distribution. This is in contrast with previous information-\ntheoretic feature selection methods which lack any performance guarantees. We also conduct em-\npirical validation on various datasets and demonstrate that the proposed approach outperforms state-\nof-the-art information-theoretic feature selection methods.\nIn Sec. 2 we introduce general MI-based feature selection methods and discuss their limitations.\nSec. 3 introduces the variational lower bound on mutual information and proposes two speci\ufb01c vari-\national distributions. In Sec. 4, we report results from our experiments, and compare the proposed\napproach with existing methods.\n\n2\n\nInformation-Theoretic Feature Selection Background\n\n2.1 Mutual Information-Based Feature Selection\nConsider a supervised learning scenario where x = {x1, x2, ..., xD} is a D-dimensional input fea-\nture vector, and y is the output label. In \ufb01lter methods, the mutual information-based feature selec-\ntion task is to select T features xS\u21e4 = {xf1, xf2, ..., xfT } such that the mutual information between\nxS\u21e4 and y is maximized. Formally,\n\nS\u21e4 = arg max\n\nS\n\nI (xS : y) s.t. |S| = T\n\n(1)\n\nwhere I(\u00b7) denotes the mutual information [6].\nForward Sequential Feature Selection Maximizing the objective function in Eq. 1 is generally\nNP-hard. Many MI-based feature selection methods adopt a greedy method, where features are\nselected incrementally, one feature at a time. Let St1 = {xf1, xf2, ..., xft1} be the selected\nfeature set after time step t  1. According to the greedy method, the next feature ft at step t is\nselected such that\n\n(2)\nwhere xSt1[i denotes x\u2019s projection into the feature space St1 [ i. As shown in [5], the mutual\ninformation term in Eq. 2 can be decomposed as:\n\nft = arg max\ni /2St1\n\nI (xSt1[i : y)\n\nI (xSt1[i : y) = I (xSt1 : y) + I (xi : y|xSt1)\n\n= I (xSt1 : y) + I (xi : y)  I (xi : xSt1) + I (xi : xSt1|y)\n= I (xSt1 : y) + I (xi : y)\n\n(3)\n\n (H (xSt1)  H (xSt1|xi)) + (H (xSt1|y)  H (xSt1|xi, y))\n\nwhere H(\u00b7) denotes the entropy [6]. Omitting the terms that do not depend on xi in Eq. 3, we can\nrewrite Eq. 2 as follows:\n\nft = arg max\ni /2St1\n\nI (xi : y) + H (xSt1|xi)  H (xSt1|xi, y)\n\n(4)\n\nThe greedy learning algorithm has been analyzed in [14].\n\n2.2 Limitations of Previous MI-Based Feature Selection Methods\nEstimating high-dimensional\nTherefore,\nmost MI-based feature selection methods propose low-order approximation to H (xSt1|xi) and\nH (xSt1|xi, y) in Eq. 4. A general family of methods rely on the following approximations [5]:\n\ninformation-theoretic quantities is a dif\ufb01cult\n\ntask.\n\n(5)\n\nH (xSt1|xi) \u21e1\n\nH (xfk|xi)\n\nH (xSt1|xi, y) \u21e1\n\nH (xfk|xi, y)\n\nt1Xk=1\nt1Xk=1\n\n2\n\n\fThe approximations in Eq. 5 become exact under the following two assumptions [5]:\n\nAssumption 1. (Feature Independence Assumption) p (xSt1|xi) =\np (xfk|xi)\nAssumption 2. (Class-Conditioned Independence Assumption) p (xSt1|xi, y) =\np (xfk|xi, y)\nAssumption 1 and Assumption 2 mean that the selected features are independent and class-\nconditionally independent, respectively, given the unselected feature xi under consideration.\n\nt1Qk=1\n\nt1Qk=1\n\nAssumption 1\n\nAssumption 2\n\nSatisfying both Assumption 1 and\nAssumption 2\n\nFigure 1: The \ufb01rst two graphical models show the assumptions of traditional MI-based feature selec-\ntion methods. The third graphical model shows a scenario when both Assumption 1 and Assumption\n2 are true. Dashed line indicates there may or may not be a correlation between two variables.\n\nWe now demonstrate that the two assumptions cannot be valid simultaneously unless the data has\na very speci\ufb01c (and unrealistic) structure. Indeed, consider the graphical models consistent with\neither assumption, as illustrated in Fig. 1. If Assumption 1 holds true, then xi is the only common\ncause of the previously selected features St1 = {xf1, xf2, ..., xft1}, so that those features become\nindependent when conditioned on xi. On the other hand, if Assumption 2 holds, then the features\ndepend both on xi and class label y; therefore, generally speaking, distribution over those features\ndoes not factorize by solely conditioning on xi\u2014there will be remnant dependencies due to y. Thus,\nif Assumption 2 is true, then Assumption 1 cannot be true in general, unless the data is generated\naccording to a very speci\ufb01c model shown in the rightmost model in Fig. 1. Note, however, that in\nthis case, xi becomes the most important feature because I(xi : y) > I(xSt1 : y); then we should\nhave selected xi at the very \ufb01rst step, contradicting the feature selection process.\nAs we mentioned above, most existing methods implicitly or explicitly adopt both assumptions or\ntheir stronger versions, as shown in [5]\u2014including mutual information maximization (MIM) [15],\njoint mutual information (JMI) [8], conditional mutual information maximization (CMIM) [9],\nmaximum relevance minimum redundancy (mRMR) [10], conditional Infomax feature extrac-\ntion (CIFE) [16], etc. Approaches based on global optimization of mutual information, such as\nquadratic programming feature selection (QPFS) [11] and the state-of-the-art conditional mutual\ninformation-based spectral method (SPECCMI) [12], are derived from the previous greedy methods\nand therefore also implicitly rely on those two assumptions.\nIn the next section we address these issues by introducing a novel information-theoretic framework\nfor feature selection. Instead of estimating mutual information and making mutually inconsistent\nassumptions, our framework formulates a tractable variational lower bound on mutual information,\nwhich allows a more \ufb02exible and general class of assumptions via appropriate choices of variational\ndistributions.\n\n3 Method\n\n3.1 Variational Mutual Information Lower Bound\n\nLet p(x, y) be the joint distribution of input (x) and output (y) variables. Barber & Agkov [13]\nderived the following lower bound for mutual information I(x : y) by using the non-negativity of\n\nKL-divergence, i.e.,Px p (x|y) log p(x|y)\n\nq(x|y)  0 gives:\n\nI (x : y)  H (x) + hln q (x|y)ip(x,y)\n\n(6)\n\n3\n\n\fwhere angled brackets represent averages and q(x|y) is an arbitrary variational distribution. This\nbound becomes exact if q(x|y) \u2318 p(x|y).\nIt is worthwhile to note that in the context of unsupervised representation learning, p(y|x) and\nq(x|y) can be viewed as an encoder and a decoder, respectively. In this case, y needs to be learned\nby maximizing the lower bound in Eq. 6 by iteratively adjusting the parameters of the encoder and\ndecoder, such as [13, 17].\n\n3.2 Variational Information Maximization for Feature Selection\nNaturally, in terms of information-theoretic feature selection, we could also try to optimize the\nvariational lower bound in Eq. 6 by choosing a subset of features S\u21e4 in x, such that,\n\nS\u21e4 = arg max\n\nS\n\nnH (xS) + hln q (xS|y)ip(xS ,y)o\n\nHowever, the H(xS) term in RHS of Eq. 7 is still intractable when xS is very high-dimensional.\nNonetheless, by noticing that variable y is the class label, which is usually discrete, and hence H(y)\nis \ufb01xed and tractable, by symmetry we switch x and y in Eq. 6 and rewrite the lower bound as\nfollows:\n\nI (x : y)  H (y) + hln q (y|x)ip(x,y) =\u2327ln\u2713 q (y|x)\n\np (y) \u25c6p(x,y)\nThe equality in Eq. 8 is obtained by noticing that H(y) = h ln p (y)ip(y).\nBy using Eq. 8, the lower bound optimal subset S\u21e4 of x becomes:\n\nS\u21e4 = arg max\n\nS\n\n(\u2327ln\u2713 q (y|xS)\n\np (y) \u25c6p(xS ,y))\n\n3.2.1 Choice of Variational Distribution\nq(y|xS) in Eq. 9 can be any distribution as long as it is normalized. We need to choose q(y|xS) to\nbe as general as possible while still keeping the term hln q (y|xS)ip(xS ,y) tractable in Eq. 9.\nAs a result, we set q(y|xS) as\n\n(7)\n\n(8)\n\n(9)\n\n(10)\n\n(11)\n\n(12)\n\nq (y|xS) =\n\nq (xS, y)\nq (xS)\n\n=\n\nq (xS|y) p (y)\nq (xS|y0) p (y0)\n\nPy0\n\nWe can verify that Eq. 10 is normalized even if q(xS|y) is not normalized.\nIf we further denote,\n\nthen by combining Eqs. 9 and 10, we get,\n\nq (xS|y0) p (y0)\n\nq (xS) =Xy0\nI (xS : y) \u2327ln\u2713 q (xS|y)\n\nq (xS) \u25c6p(xS ,y) \u2318 ILB (xS : y)\n\nAnd we also have the following equation which shows the gap between I(xS : y) and ILB(xS : y),\n\nI (xS : y)  ILB (xS : y) = hKL (p (y|xS)||q (y|xS))ip(xS )\n\n(13)\nAuto-Regressive Decomposition.\nNow that q(y|xS) is de\ufb01ned, all we need to do is model\nq(xS|y) under Eq. 10, and q(xS) is easy to compute based on q(xS|y). Here we decompose\nq(xS|y) as an auto-regressive distribution assuming T features in S:\nq (xft|xf<t, y)\n\nq (xS|y) = q (xf1|y)\n\n(14)\n\nTYt=2\n\n4\n\n\fFigure 2: Auto-regressive decomposition for q(xS|y)\n\nwhere xf<t denotes {xf1, xf2, ..., xft1}. The graphical model in Fig. 2 demonstrates this decom-\nposition. The main advantage of this model is that it is well-suited for the forward feature selection\nprocedure where one feature is selected at a time (which we will explain in Sec. 3.2.3). And if\nq (xft|xf<t, y) is tractable, then so is the whole distribution q(xS|y). Therefore, we would \ufb01nd\ntractable Q-distributions over q (xft|xf<t, y). Below we illustrate two such Q-distributions.\nNaive Bayes Q-distribution.\nA natural idea would be to assume xt is independent of other\nvariables given y, i.e.,\n\nThen the variational distribution q(y|xS) can be written based on Eqs. 10 and 15 as follows:\n\nq (xft|xf<t, y) = p (xft|y)\n\nq (y|xS) =\n\np (y) Qj2S\nPy0\np (y0) Qj2S\n\np (xj|y)\np (xj|y0)\n\n(15)\n\n(16)\n\n(18)\n\nAnd we also have the following theorem:\nTheorem 3.1 (Exact Naive Bayes). Under Eq. 16, the lower bound in Eq. 8 becomes exact if and\n\nonly if data is generated by a Naive Bayes model, i.e., p (x, y) = p (y)Qi\n\nThe proof for Theorem 3.1 becomes obvious by using the mutual information de\ufb01nition. Note that\nthe most-cited MI-based feature selection method mRMR [10] also assumes conditional indepen-\ndence given the class label y as shown in [5, 18, 19], but they make additional stronger independence\nassumptions among only feature variables.\nPairwise Q-distribution. We now consider an alternative approach that is more general than the\nNaive Bayes distribution:\n\np (xi|y).\n\nq (xft|xf<t, y) = t1Yi=1\n\np (xft|xfi, y)! 1\n\nt1\n\n(17)\n\nIn Eq. 17, we assume q (xft|xf<t, y) to be the geometric mean of conditional distributions\nq(xft|xfi, y). This assumption is tractable as well as reasonable because if the data is gener-\nated by a Naive Bayes model, the lower bound in Eq. 8 also becomes exact using Eq. 17 due to\np (xft|xfi, y) \u2318 p (xft|y) in that case.\n3.2.2 Estimating Lower Bound From Data\nAssuming either Naive Bayes Q-distribution or pairwise Q-distribution, it is convenient to estimate\nq(xS|y) and q(xS) in Eq. 12 by using plug-in probability estimators for discrete data or one/two-\ndimensional density estimators for continuous data. We also use the sample mean to approximate\nthe expectation term in Eq. 12. Our \ufb01nal estimator for ILB (xS : y) is written as follows:\n\nbILB (xS : y) =\n\n1\n\nN Xx(k),y(k)\n\nlnbq\u21e3x(k)\nS |y(k)\u2318\nbq\u21e3x(k)\nS \u2318\n\nwherex(k), y(k) are samples from data, andbq(\u00b7) denotes the estimate for q(\u00b7).\n\n5\n\n\f3.2.3 Variational Forward Feature Selection Under Auto-Regressive Decomposition\nAfter de\ufb01ning q(y|xS) in Eq. 10 and auto-regressive decomposition of q(xS|y) in Eq. 15, we are\nable to do the forward feature selection previously described in Eq. 2, but replace the mutual infor-\nmation with its lower bound bILB. Recall that St1 is the set of selected features after step t  1,\n\nthen the feature ft will be selected at step t such that\n\n(19)\n\nft = arg max\n\ni /2St1 bILB (xSt1[i : y)\n\nwhere bILB (xSt1[i : y) can be obtained from bILB (xSt1 : y) recursively by auto-regressive de-\ncomposition q (xSt1[i|y) = q (xSt1|y) q (xi|xSt1, y) where q (xSt1|y) is stored at step t  1.\nThis forward feature selection can be done under auto-regressive decomposition in Eqs. 10 and 14\nfor any Q-distribution. However, calculating q(xi|xSt, y) may vary according to different Q-\ndistributions. We can verify that it is easy to get q(xi|xSt, y) recursively from q(xi|xSt1, y) under\nNaive Bayes or pairwise Q-distribution. We call our algorithm under these two Q-distributions\nVMI naive and VMI pairwise respectively.\nIt is worthwhile noting that the lower bound does not always increase at each step. A decrease in\nlower bound at step t indicates that the Q-distribution would approximate the underlying distribu-\ntion worse than it did at previous step t  1. In this case, the algorithm would re-maximize the\nlower bound from zero with only the remaining unselected features. We summarize the concrete\nimplementation of our algorithms in supplementary Sec. A.\nTime Complexity.\nAlthough our algorithm needs to calculate the distributions at each step,\nwe only need to calculate the probability value at each sample point. For both VMI naive and\nVMI pairwise, the total computational complexity is O(N DT ) assuming N as number of samples,\nD as total number of features, T as number of \ufb01nal selected features. The detailed time analysis is\nleft for the supplementary Sec. A. As shown in Table 1, our methods VMI naive and VMI pairwise\nhave the same time complexity as mRMR [10], while the state-of-the-art global optimization method\nSPECCMI [12] is required to precompute the pairwise mutual information matrix, which gives a\ntime complexity of O(N D2).\n\nTable 1: Time complexity in number of features D, selected number of features d, and number\nof samples N\n\nMethod\n\nComplexity O(N DT ) O(N DT )\n\nmRMR VMI naive VMI pairwise SPECCMI\nO(N D2)\n\nO(N DT )\n\nOptimality Under Tree Graphical Models. Although our method VMI naive assumes a Naive\nBayes model, we can prove that this method is still optimal if the data is generated according to\ntree graphical models. Indeed, both of our methods, VMI naive and VMI pairwise, will always\nprioritize the \ufb01rst layer features, as shown in Fig. 3. This optimality is summarized in Theorem B.1\nin supplementary Sec. B.\n\n4 Experiments\n\nSynthetic Data.\nWe begin with the experiments on a synthetic model according to the tree\nstructure illustrated in the left part of Fig. 3. The detailed data generating process is shown in\nsupplementary section D. The root node Y is a binary variable, while other variables are continuous.\nWe use VMI naive to optimize the lower bound ILB(x : y). 5000 samples are used to generate the\nsynthethic data, and variational Q-distributions are estimated by the kernel density estimator. We\ncan see from the plot in the right-hand part of Fig. 3 that our algorithm, VMI naive, selects x1, x2,\nx3 as the \ufb01rst three features, although x2 and x3 are only weakly correlated with y. If we continue\nto add deeper level features {x4, ..., x9}, the lower bound will decrease. For comparison, we also\nillustrate the mutual information between each single feature xi and y in Table 2. We can see from\nTable 2 that it would choose x1, x4 and x5 as the top three features by using the maximum relevance\ncriteria [15].\n\n6\n\n\fFigure 3: (Left) This is the generative model used for synthetic experiments. Edge thickness repre-\nsents the relationship strength. (Right) Optimizing the lower bound by VMI naive. Variables under\nthe blue line denote the features selected at each step. Dotted blue line shows the decreasing lower\nbound if adding more features. Ground-truth mutual information is obtained using N = 100, 000\nsamples.\n\nfeaturei\nx9\nI(xi : y) 0.111 0.052 0.022 0.058 0.058 0.025 0.029 0.012 0.013\n\nx5\n\nx6\n\nx4\n\nx1\n\nx2\n\nx3\n\nx7\n\nx8\n\nTable 2: Mutual information between label y and each feature xi for Fig. 3. I(xi : y) is estimated\nusing N=100,000 samples. Top three variables with highest mutual information are highlighted in\nbold.\n\nReal-World Data. We compare our algorithms VMI naive and VMI pairwise with other pop-\nular information-theoretic feature selection methods, including mRMR [10], JMI [8], MIM [15],\nCMIM [9], CIFE [16], and SPECCMI [12]. We use 17 well-known datasets in previous feature\nselection studies [5, 12] (all data are discretized). The dataset summaries are illustrated in supple-\nmentary Sec. C. We use the average cross-validation error rate on the range of 10 to 100 features to\ncompare different algorithms under the same setting as [12]. Tenfold cross-validation is employed\nfor datasets with number of samples N  100 and leave-one-out cross-validation otherwise. The\n3-nearest-neighbor classi\ufb01er is used for Gisette and Madelon, following [5]. For the remaining\ndatasets, the chosen classi\ufb01er is Linear SVM, following [11, 12].\nThe experimental results can be seen in Table 3.1 The entries with \u21e4 and \u21e4\u21e4 indicate the best perfor-\nmance and the second best performance, respectively (in terms of average error rate). We also use\nthe paired t-test at 5% signi\ufb01cant level to test the hypothesis that VMI naive or VMI pairwise per-\nform signi\ufb01cantly better than other methods, or vice visa. Overall, we \ufb01nd that both of our methods,\nVM Inaive and VM Ipairwise, strongly outperform other methods. This indicates that our variational\nfeature selection framework is a promising addition to the current literature of information-theoretic\nfeature selection.\n\nFigure 4: Number of selected features versus average cross-validation error in datasets Semeion and\nGisette.\n\n1We omit the results for M IM and CIF E due to space limitations. The complete results are shown in the\n\nsupplementary Sec. C.\n\n7\n\n\fLeukemia\nLymphoma\n\nTable 3: Average cross-validation error rate comparison of VMI against other methods. The\nlast two lines indicate win(W)/tie (T)/ loss(L) for VMI naive and VMI pairwise respectively.\nDataset\nVMI naive VMI pairwise\nSPECCMI\n7.4\u00b1(3.6)\u21e4\nLung\n14.5\u00b1(6.0)\n11.6\u00b1(5.6)\n11.2\u00b1(2.7)\u21e4\n11.9\u00b1(1.7)\u21e4\u21e4\nColon\n16.1\u00b1(2.0)\n0.2\u00b1(0.5)\u21e4\u21e4\n0.0\u00b1(0.1)\u21e4\n1.8\u00b1(1.3)\n5.2\u00b1(3.1)\u21e4\u21e4\n3.7\u00b1(1.9)\u21e4\n12.0\u00b1(6.6)\n13.7\u00b1(0.5)\u21e4\u21e4 13.7\u00b1(0.5)\u21e4\u21e4 13.7\u00b1(0.5)\u21e4\u21e4\n18.8\u00b1(0.8)\u21e4\n18.8\u00b1(1.0)\u21e4\u21e4\n21.0\u00b1(3.5)\n15.9\u00b1(0.6)\u21e4\u21e4 15.9\u00b1(0.6)\u21e4\u21e4\n15.9\u00b1(0.5)\u21e4\n5.1\u00b1(0.7)\u21e4\u21e4\n5.1\u00b1(0.6)\u21e4\n5.3\u00b1(0.5)\n12.0\u00b1(1.0)\u21e4\n12.7\u00b1(1.9)\u21e4\u21e4\n16.8\u00b1(1.6)\n14.0\u00b1(4.0)\u21e4\n14.5\u00b1(3.9)\u21e4\u21e4\n26.0\u00b1(9.3)\n3.0\u00b1(1.1)\u21e4\n3.5\u00b1(1.1)\u21e4\u21e4\n4.8\u00b1(3.0)\n7.2\u00b1(2.5)\u21e4\n9.2\u00b1(6.0)\n7.6\u00b1(3.6)\n12.6\u00b1(0.5)\u21e4\u21e4\n12.8\u00b1(0.6)\n15.1\u00b1(1.8)\n6.6\u00b1(0.3)\u21e4\n6.6\u00b1(0.3)\u21e4\n9.0\u00b1(2.3)\n20.4\u00b1(3.1)\u21e4\n21.2\u00b1(3.9)\u21e4\u21e4\n24.0\u00b1(3.7)\n4.8\u00b1(0.9)\u21e4\u21e4\n4.2\u00b1(0.8)\u21e4\n7.1\u00b1(1.3)\n15.9\u00b1(2.5)\u21e4\u21e4\n16.7\u00b1(2.7)\n16.6\u00b1(2.9)\n13/2/2\n12/3/2\n\n10.9\u00b1(4.7)\u21e4\u21e4\n11.6\u00b1(4.7)\n17.3\u00b1(3.0)\n19.7\u00b1(2.6)\n0.4\u00b1(0.7)\n1.4\u00b1(1.2)\n6.6\u00b1(2.2)\n5.6\u00b1(2.8)\n13.6\u00b1(0.4)\u21e4 13.7\u00b1(0.5)\u21e4\u21e4\nSplice\n19.5\u00b1(1.2)\n18.9\u00b1(1.0)\nLandsat\n15.9\u00b1(0.5)\u21e4\nWaveform 15.9\u00b1(0.5)\u21e4\n5.1\u00b1(0.7)\u21e4\u21e4\n5.2\u00b1(0.6)\nKrVsKp\n16.6\u00b1(1.6)\n12.8\u00b1(0.9)\nIonosphere\nSemeion\n23.4\u00b1(6.5)\n24.8\u00b1(7.6)\nMultifeat.\n4.0\u00b1(1.6)\n4.0\u00b1(1.6)\nOptdigits\n7.6\u00b1(3.3)\n7.6\u00b1(3.2)\n12.4\u00b1(0.7)\u21e4\n12.8\u00b1(0.7)\nMusk2\n7.0\u00b1(0.8)\n6.9\u00b1(0.7)\n22.4\u00b1(4.0)\n21.5\u00b1(2.8)\n5.5\u00b1(0.9)\n5.9\u00b1(0.7)\n15.3\u00b1(2.6)\u21e4\n30.8\u00b1(3.8)\n10/6/1\n11/4/2\n9/6/2\n9/6/2\n\nCMIM\n11.4\u00b1(3.0)\n18.4\u00b1(2.6)\n1.1\u00b1(2.0)\n8.6\u00b1(3.3)\n14.7\u00b1(0.3)\n19.1\u00b1(1.1)\n16.0\u00b1(0.7)\n5.3\u00b1(0.5)\n13.1\u00b1(0.8)\n16.3\u00b1(4.4)\n3.6\u00b1(1.2)\n7.5\u00b1(3.4)\u21e4\u21e4\n13.0\u00b1(1.0)\n6.8\u00b1(0.7)\u21e4\u21e4\n22.1\u00b1(2.9)\n5.1\u00b1(1.3)\n17.4\u00b1(2.6)\n10/7/0\n13/3/1\n\nmRMR\n\nJMI\n\nSpambase\nPromoter\nGisette\nMadelon\n\n#W1/T1/L1:\n#W2/T2/L2:\n\nWe also plot the average cross-validation error with respect to number of selected features. Fig. 4\nshows the two most distinguishable data sets, Semeion and Gisette. We can see that both of our\nmethods, VMI N aive and VMI pairwise, have lower error rates in these two data sets.\n5 Related Work\n\nThere has been a signi\ufb01cant amount of work on information-theoretic feature selection in the past\ntwenty years: [5, 7, 8, 9, 10, 15, 11, 12, 20], to name a few. Most of these methods are based on\ncombinations of so-called relevant, redundant and complimentary information. Such combinations\nrepresenting low-order approximations of mutual information are derived from two assumptions,\nand it has proved unrealistic to expect both assumptions to be true. Inspired by group testing [21],\nmore scalable feature selection methods have been developed, but thos methods also require the\ncalculation of high-dimensional mutual information as a basic scoring function.\nEstimating mutual information from data requires a large number of observations\u2014especially when\nthe dimensionality is high. The proposed variational lower bound can be viewed as a way of esti-\nmating mutual information between a high-dimensional continuous variable and a discrete variable.\nOnly a few examples exist in literature [22] under this setting. We hope our method will shed light\non new ways to estimate mutual information, similar to estimating divergences in [23].\n\n6 Conclusion\n\nFeature selection has been a signi\ufb01cant endeavor over the past decade. Mutual information gives\na general basis for quantifying the informativeness of features. Despite the clarity of mutual in-\nformation, estimating it can be dif\ufb01cult. While a large number of information-theoretic methods\nexist, they are rather limited and rely on mutually inconsistent assumptions about underlying data\ndistributions. We introduced a unifying variational mutual information lower bound to address these\nissues and showed that by auto-regressive decomposition, feature selection can be done in a forward\nmanner by progressively maximizing the lower bound. We also presented two concrete methods\nusing Naive Bayes and pairwise Q-distributions, which strongly outperform the existing methods.\nVMI naive only assumes a Naive Bayes model, but even this simple model outperforms the existing\ninformation-theoretic methods, indicating the effectiveness of our variational information maximiza-\ntion framework. We hope that our framework will inspire new mathematically rigorous algorithms\nfor information-theoretic feature selection, such as optimizing the variational lower bound globally\nand developing more powerful variational approaches for capturing complex dependencies.\n\n8\n\n\fReferences\n[1] Manoranjan Dash and Huan Liu. Feature selection for classi\ufb01cation. Intelligent data analysis, 1(3):131\u2013\n\n[2] Huan Liu and Hiroshi Motoda. Feature selection for knowledge discovery and data mining, volume 454.\n\nSpringer Science & Business Media, 2012.\n\n[3] Ron Kohavi and George H John. Wrappers for feature subset selection. Arti\ufb01cial intelligence, 97(1):273\u2013\n\n156, 1997.\n\n324, 1997.\n\n[4] Isabelle Guyon and Andr\u00b4e Elisseeff. An introduction to variable and feature selection. The Journal of\n\nMachine Learning Research, 3:1157\u20131182, 2003.\n\n[5] Gavin Brown, Adam Pocock, Ming-Jie Zhao, and Mikel Luj\u00b4an. Conditional likelihood maximisation: a\nunifying framework for information theoretic feature selection. The Journal of Machine Learning Re-\nsearch, 13(1):27\u201366, 2012.\n\n[6] Thomas M Cover and Joy A Thomas. Elements of information theory. John Wiley & Sons, 2012.\n[7] Roberto Battiti. Using mutual information for selecting features in supervised neural net learning. Neural\n\nNetworks, IEEE Transactions on, 5(4):537\u2013550, 1994.\n\n[8] Howard Hua Yang and John E Moody. Data visualization and feature selection: New algorithms for\n\nnongaussian data. In NIPS, volume 99, pages 687\u2013693. Citeseer, 1999.\n\n[9] Franc\u00b8ois Fleuret. Fast binary feature selection with conditional mutual information. The Journal of\n\nMachine Learning Research, 5:1531\u20131555, 2004.\n\n[10] Hanchuan Peng, Fuhui Long, and Chris Ding. Feature selection based on mutual information criteria of\nmax-dependency, max-relevance, and min-redundancy. Pattern Analysis and Machine Intelligence, IEEE\nTransactions on, 27(8):1226\u20131238, 2005.\n\n[11] Irene Rodriguez-Lujan, Ramon Huerta, Charles Elkan, and Carlos Santa Cruz. Quadratic programming\n\nfeature selection. The Journal of Machine Learning Research, 11:1491\u20131516, 2010.\n\n[12] Xuan Vinh Nguyen, Jeffrey Chan, Simone Romano, and James Bailey. Effective global approaches for\nIn Proceedings of the 20th ACM SIGKDD international\n\nmutual information based feature selection.\nconference on Knowledge discovery and data mining, pages 512\u2013521. ACM, 2014.\n\n[13] David Barber and Felix Agakov. The im algorithm: a variational approach to information maximiza-\ntion. In Advances in Neural Information Processing Systems 16: Proceedings of the 2003 Conference,\nvolume 16, page 201. MIT Press, 2004.\n\n[14] Abhimanyu Das and David Kempe. Submodular meets spectral: Greedy algorithms for subset selection,\nsparse approximation and dictionary selection. In Proceedings of the 28th International Conference on\nMachine Learning (ICML-11), pages 1057\u20131064, 2011.\n\n[15] David D Lewis. Feature selection and feature extraction for text categorization. In Proceedings of the\nworkshop on Speech and Natural Language, pages 212\u2013217. Association for Computational Linguistics,\n1992.\n\n[16] Dahua Lin and Xiaoou Tang. Conditional infomax learning: an integrated framework for feature extrac-\n\ntion and fusion. In Computer Vision\u2013ECCV 2006, pages 68\u201382. Springer, 2006.\n\n[17] Shakir Mohamed and Danilo Jimenez Rezende. Variational information maximisation for intrinsically\nmotivated reinforcement learning. In Advances in Neural Information Processing Systems, pages 2116\u2013\n2124, 2015.\n\n[18] Kiran S Balagani and Vir V Phoha. On the feature selection criterion based on an approximation of\nmultidimensional mutual information. IEEE Transactions on Pattern Analysis & Machine Intelligence,\n(7):1342\u20131343, 2010.\n\n[19] Nguyen Xuan Vinh, Shuo Zhou, Jeffrey Chan, and James Bailey. Can high-order dependencies improve\n\nmutual information based feature selection? Pattern Recognition, 2015.\n\n[20] Hongrong Cheng, Zhiguang Qin, Chaosheng Feng, Yong Wang, and Fagen Li. Conditional mutual\ninformation-based feature selection analyzing for synergy and redundancy. ETRI Journal, 33(2):210\u2013\n218, 2011.\n\n[21] Yingbo Zhou, Utkarsh Porwal, Ce Zhang, Hung Q Ngo, Long Nguyen, Christopher R\u00b4e, and Venu Govin-\ndaraju. Parallel feature selection inspired by group testing. In Advances in Neural Information Processing\nSystems, pages 3554\u20133562, 2014.\n\n[22] Brian C Ross. Mutual information between discrete and continuous data sets. PloS one, 9(2):e87357,\n\n[23] XuanLong Nguyen, Martin J Wainwright, and Michael I Jordan. Estimating divergence functionals\nInformation Theory, IEEE Transactions on,\n\nand the likelihood ratio by convex risk minimization.\n56(11):5847\u20135861, 2010.\n\nfeature selection code.\n\nhttp://github.com/BiuBiuBiLL/\n\n[24] Shuyang Gao.\n\nVariational\nInfoFeatureSelection.\n\n[25] Chris Ding and Hanchuan Peng. Minimum redundancy feature selection from microarray gene expression\n\ndata. Journal of bioinformatics and computational biology, 3(02):185\u2013205, 2005.\n\n[26] Kevin Bache and Moshe Lichman. Uci machine learning repository, 2013.\n\n2014.\n\n9\n\n\f", "award": [], "sourceid": 271, "authors": [{"given_name": "Shuyang", "family_name": "Gao", "institution": "University of Southern California"}, {"given_name": "Greg", "family_name": "Ver Steeg", "institution": "University of Southern California"}, {"given_name": "Aram", "family_name": "Galstyan", "institution": "USC Information Sciences Inst"}]}