{"title": "Variational Information Maximization for Feature Selection", "book": "Advances in Neural Information Processing Systems", "page_first": 487, "page_last": 495, "abstract": "Feature selection is one of the most fundamental problems in machine learning. An extensive body of work on information-theoretic feature selection exists which is based on maximizing mutual information between subsets of features and class labels. Practical methods are forced to rely on approximations due to the difficulty of estimating mutual information. We demonstrate that approximations made by existing methods are based on unrealistic assumptions. We formulate a more flexible and general class of assumptions based on variational distributions and use them to tractably generate lower bounds for mutual information. These bounds define a novel information-theoretic framework for feature selection, which we prove to be optimal under tree graphical models with proper choice of variational distributions. Our experiments demonstrate that the proposed method strongly outperforms existing information-theoretic feature selection approaches.", "full_text": "Variational Information Maximization for\n\nFeature Selection\n\nShuyang Gao\nUniversity of Southern California, Information Sciences Institute\ngaos@usc.edu, gregv@isi.edu, galstyan@isi.edu\n\nGreg Ver Steeg\n\nAram Galstyan\n\nAbstract\n\nFeature selection is one of the most fundamental problems in machine learning.\nAn extensive body of work on information-theoretic feature selection exists which\nis based on maximizing mutual information between subsets of features and class\nlabels. Practical methods are forced to rely on approximations due to the dif\ufb01culty\nof estimating mutual information. We demonstrate that approximations made by\nexisting methods are based on unrealistic assumptions. We formulate a more \ufb02ex-\nible and general class of assumptions based on variational distributions and use\nthem to tractably generate lower bounds for mutual information. These bounds\nde\ufb01ne a novel information-theoretic framework for feature selection, which we\nprove to be optimal under tree graphical models with proper choice of variational\ndistributions. Our experiments demonstrate that the proposed method strongly\noutperforms existing information-theoretic feature selection approaches.\n\n1\n\nIntroduction\n\nFeature selection is one of the fundamental problems in machine learning research [1, 2]. Its prob-\nlematic issues include a large number of features that are either irrelevant or redundant for the task at\nhand. In these cases, it is often advantageous to pick a smaller subset of features to avoid over-\ufb01tting,\nto speed up computation, or simply to improve the interpretability of the results.\nFeature selection approaches are usually categorized into three groups: wrapper, embedded and\n\ufb01lter [3, 4, 5]. The \ufb01rst two methods, wrapper and embedded, are considered classi\ufb01er-dependent,\ni.e., the selection of features somehow depends on the classi\ufb01er being used. Filter methods, on the\nother hand, are classi\ufb01er-independent and de\ufb01ne a scoring function between features and labels in\nthe selection process.\nBecause \ufb01lter methods may be employed in conjunction with a wide variety of classi\ufb01ers, it is im-\nportant that the scoring function of these methods is as general as possible. Since mutual information\n(MI) is a general measure of dependence with several unique properties [6], many MI-based scoring\nfunctions have been proposed as \ufb01lter methods [7, 8, 9, 10, 11, 12]; see [5] for an exhaustive list.\nOwing to the dif\ufb01culty of estimating mutual information in high dimensions, most existing MI-based\nfeature selection methods are based on various low-order approximations for mutual information.\nWhile those approximations have been successful in certain applications, they are heuristic in nature\nand lack theoretical guarantees. In fact, as we demonstrate in Sec. 2.2, a large family of approximate\nmethods are based on two assumptions that are mutually inconsistent.\nTo address the above shortcomings, in this paper we introduce a novel feature selection method\nbased on a variational lower bound on mutual information; a similar bound was previously studied\nwithin the Infomax learning framework [13]. We show that instead of maximizing the mutual infor-\nmation, which is intractable in high dimensions (hence the introduction of many heuristics), we can\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fmaximize a lower bound on the MI with the proper choice of tractable variational distributions. We\nuse this lower bound to de\ufb01ne an objective function and derive a forward feature selection algorithm.\nWe provide a rigorous proof that the forward feature selection is optimal under tree graphical models\nby choosing an appropriate variational distribution. This is in contrast with previous information-\ntheoretic feature selection methods which lack any performance guarantees. We also conduct em-\npirical validation on various datasets and demonstrate that the proposed approach outperforms state-\nof-the-art information-theoretic feature selection methods.\nIn Sec. 2 we introduce general MI-based feature selection methods and discuss their limitations.\nSec. 3 introduces the variational lower bound on mutual information and proposes two speci\ufb01c vari-\national distributions. In Sec. 4, we report results from our experiments, and compare the proposed\napproach with existing methods.\n\n2\n\nInformation-Theoretic Feature Selection Background\n\n2.1 Mutual Information-Based Feature Selection\nConsider a supervised learning scenario where x = {x1, x2, ..., xD} is a D-dimensional input fea-\nture vector, and y is the output label. In \ufb01lter methods, the mutual information-based feature selec-\ntion task is to select T features xS\u21e4 = {xf1, xf2, ..., xfT } such that the mutual information between\nxS\u21e4 and y is maximized. Formally,\n\nS\u21e4 = arg max\n\nS\n\nI (xS : y) s.t. |S| = T\n\n(1)\n\nwhere I(\u00b7) denotes the mutual information [6].\nForward Sequential Feature Selection Maximizing the objective function in Eq. 1 is generally\nNP-hard. Many MI-based feature selection methods adopt a greedy method, where features are\nselected incrementally, one feature at a time. Let St1 = {xf1, xf2, ..., xft1} be the selected\nfeature set after time step t 1. According to the greedy method, the next feature ft at step t is\nselected such that\n\n(2)\nwhere xSt1[i denotes x\u2019s projection into the feature space St1 [ i. As shown in [5], the mutual\ninformation term in Eq. 2 can be decomposed as:\n\nft = arg max\ni /2St1\n\nI (xSt1[i : y)\n\nI (xSt1[i : y) = I (xSt1 : y) + I (xi : y|xSt1)\n\n= I (xSt1 : y) + I (xi : y) I (xi : xSt1) + I (xi : xSt1|y)\n= I (xSt1 : y) + I (xi : y)\n\n(3)\n\n (H (xSt1) H (xSt1|xi)) + (H (xSt1|y) H (xSt1|xi, y))\n\nwhere H(\u00b7) denotes the entropy [6]. Omitting the terms that do not depend on xi in Eq. 3, we can\nrewrite Eq. 2 as follows:\n\nft = arg max\ni /2St1\n\nI (xi : y) + H (xSt1|xi) H (xSt1|xi, y)\n\n(4)\n\nThe greedy learning algorithm has been analyzed in [14].\n\n2.2 Limitations of Previous MI-Based Feature Selection Methods\nEstimating high-dimensional\nTherefore,\nmost MI-based feature selection methods propose low-order approximation to H (xSt1|xi) and\nH (xSt1|xi, y) in Eq. 4. A general family of methods rely on the following approximations [5]:\n\ninformation-theoretic quantities is a dif\ufb01cult\n\ntask.\n\n(5)\n\nH (xSt1|xi) \u21e1\n\nH (xfk|xi)\n\nH (xSt1|xi, y) \u21e1\n\nH (xfk|xi, y)\n\nt1Xk=1\nt1Xk=1\n\n2\n\n\fThe approximations in Eq. 5 become exact under the following two assumptions [5]:\n\nAssumption 1. (Feature Independence Assumption) p (xSt1|xi) =\np (xfk|xi)\nAssumption 2. (Class-Conditioned Independence Assumption) p (xSt1|xi, y) =\np (xfk|xi, y)\nAssumption 1 and Assumption 2 mean that the selected features are independent and class-\nconditionally independent, respectively, given the unselected feature xi under consideration.\n\nt1Qk=1\n\nt1Qk=1\n\nAssumption 1\n\nAssumption 2\n\nSatisfying both Assumption 1 and\nAssumption 2\n\nFigure 1: The \ufb01rst two graphical models show the assumptions of traditional MI-based feature selec-\ntion methods. The third graphical model shows a scenario when both Assumption 1 and Assumption\n2 are true. Dashed line indicates there may or may not be a correlation between two variables.\n\nWe now demonstrate that the two assumptions cannot be valid simultaneously unless the data has\na very speci\ufb01c (and unrealistic) structure. Indeed, consider the graphical models consistent with\neither assumption, as illustrated in Fig. 1. If Assumption 1 holds true, then xi is the only common\ncause of the previously selected features St1 = {xf1, xf2, ..., xft1}, so that those features become\nindependent when conditioned on xi. On the other hand, if Assumption 2 holds, then the features\ndepend both on xi and class label y; therefore, generally speaking, distribution over those features\ndoes not factorize by solely conditioning on xi\u2014there will be remnant dependencies due to y. Thus,\nif Assumption 2 is true, then Assumption 1 cannot be true in general, unless the data is generated\naccording to a very speci\ufb01c model shown in the rightmost model in Fig. 1. Note, however, that in\nthis case, xi becomes the most important feature because I(xi : y) > I(xSt1 : y); then we should\nhave selected xi at the very \ufb01rst step, contradicting the feature selection process.\nAs we mentioned above, most existing methods implicitly or explicitly adopt both assumptions or\ntheir stronger versions, as shown in [5]\u2014including mutual information maximization (MIM) [15],\njoint mutual information (JMI) [8], conditional mutual information maximization (CMIM) [9],\nmaximum relevance minimum redundancy (mRMR) [10], conditional Infomax feature extrac-\ntion (CIFE) [16], etc. Approaches based on global optimization of mutual information, such as\nquadratic programming feature selection (QPFS) [11] and the state-of-the-art conditional mutual\ninformation-based spectral method (SPECCMI) [12], are derived from the previous greedy methods\nand therefore also implicitly rely on those two assumptions.\nIn the next section we address these issues by introducing a novel information-theoretic framework\nfor feature selection. Instead of estimating mutual information and making mutually inconsistent\nassumptions, our framework formulates a tractable variational lower bound on mutual information,\nwhich allows a more \ufb02exible and general class of assumptions via appropriate choices of variational\ndistributions.\n\n3 Method\n\n3.1 Variational Mutual Information Lower Bound\n\nLet p(x, y) be the joint distribution of input (x) and output (y) variables. Barber & Agkov [13]\nderived the following lower bound for mutual information I(x : y) by using the non-negativity of\n\nKL-divergence, i.e.,Px p (x|y) log p(x|y)\n\nq(x|y) 0 gives:\n\nI (x : y) H (x) + hln q (x|y)ip(x,y)\n\n(6)\n\n3\n\n\fwhere angled brackets represent averages and q(x|y) is an arbitrary variational distribution. This\nbound becomes exact if q(x|y) \u2318 p(x|y).\nIt is worthwhile to note that in the context of unsupervised representation learning, p(y|x) and\nq(x|y) can be viewed as an encoder and a decoder, respectively. In this case, y needs to be learned\nby maximizing the lower bound in Eq. 6 by iteratively adjusting the parameters of the encoder and\ndecoder, such as [13, 17].\n\n3.2 Variational Information Maximization for Feature Selection\nNaturally, in terms of information-theoretic feature selection, we could also try to optimize the\nvariational lower bound in Eq. 6 by choosing a subset of features S\u21e4 in x, such that,\n\nS\u21e4 = arg max\n\nS\n\nnH (xS) + hln q (xS|y)ip(xS ,y)o\n\nHowever, the H(xS) term in RHS of Eq. 7 is still intractable when xS is very high-dimensional.\nNonetheless, by noticing that variable y is the class label, which is usually discrete, and hence H(y)\nis \ufb01xed and tractable, by symmetry we switch x and y in Eq. 6 and rewrite the lower bound as\nfollows:\n\nI (x : y) H (y) + hln q (y|x)ip(x,y) =\u2327ln\u2713 q (y|x)\n\np (y) \u25c6p(x,y)\nThe equality in Eq. 8 is obtained by noticing that H(y) = h ln p (y)ip(y).\nBy using Eq. 8, the lower bound optimal subset S\u21e4 of x becomes:\n\nS\u21e4 = arg max\n\nS\n\n(\u2327ln\u2713 q (y|xS)\n\np (y) \u25c6p(xS ,y))\n\n3.2.1 Choice of Variational Distribution\nq(y|xS) in Eq. 9 can be any distribution as long as it is normalized. We need to choose q(y|xS) to\nbe as general as possible while still keeping the term hln q (y|xS)ip(xS ,y) tractable in Eq. 9.\nAs a result, we set q(y|xS) as\n\n(7)\n\n(8)\n\n(9)\n\n(10)\n\n(11)\n\n(12)\n\nq (y|xS) =\n\nq (xS, y)\nq (xS)\n\n=\n\nq (xS|y) p (y)\nq (xS|y0) p (y0)\n\nPy0\n\nWe can verify that Eq. 10 is normalized even if q(xS|y) is not normalized.\nIf we further denote,\n\nthen by combining Eqs. 9 and 10, we get,\n\nq (xS|y0) p (y0)\n\nq (xS) =Xy0\nI (xS : y) \u2327ln\u2713 q (xS|y)\n\nq (xS) \u25c6p(xS ,y) \u2318 ILB (xS : y)\n\nAnd we also have the following equation which shows the gap between I(xS : y) and ILB(xS : y),\n\nI (xS : y) ILB (xS : y) = hKL (p (y|xS)||q (y|xS))ip(xS )\n\n(13)\nAuto-Regressive Decomposition.\nNow that q(y|xS) is de\ufb01ned, all we need to do is model\nq(xS|y) under Eq. 10, and q(xS) is easy to compute based on q(xS|y). Here we decompose\nq(xS|y) as an auto-regressive distribution assuming T features in S:\nq (xft|xf