{"title": "Parallel Feature Selection Inspired by Group Testing", "book": "Advances in Neural Information Processing Systems", "page_first": 3554, "page_last": 3562, "abstract": "This paper presents a parallel feature selection method for classification that scales up to very high dimensions and large data sizes. Our original method is inspired by group testing theory, under which the feature selection procedure consists of a collection of randomized tests to be performed in parallel. Each test corresponds to a subset of features, for which a scoring function may be applied to measure the relevance of the features in a classification task. We develop a general theory providing sufficient conditions under which true features are guaranteed to be correctly identified. Superior performance of our method is demonstrated on a challenging relation extraction task from a very large data set that have both redundant features and sample size in the order of millions. We present comprehensive comparisons with state-of-the-art feature selection methods on a range of data sets, for which our method exhibits competitive performance in terms of running time and accuracy. Moreover, it also yields substantial speedup when used as a pre-processing step for most other existing methods.", "full_text": "Parallel Feature Selection inspired by Group Testing\n\nYingbo Zhou\u2217\n\nUtkarsh Porwal\u2217\n\nCSE Department\nSUNY at Buffalo\n\n{yingbozh, utkarshp}@buffalo.edu\n\nCe Zhang\n\nCS Department\n\nUniversity of Wisconsin-Madison\n\nczhang@cs.wisc.edu\n\nHung Ngo\n\nCSE Department\nSUNY at Buffalo\n\nXuanLong Nguyen\nEECS Department\n\nUniversity of Michigan\n\nhungngo@buffalo.edu\n\nxuanlong@umich.edu\n\nChristopher R\u00b4e\nCS Department\n\nStanford University\n\nchrismre@cs.stanford.edu\n\nVenu Govindaraju\nCSE Department\nSUNY at Buffalo\n\ngovind@buffalo.edu\n\nAbstract\n\nThis paper presents a parallel feature selection method for classi\ufb01cation that scales\nup to very high dimensions and large data sizes. Our original method is inspired\nby group testing theory, under which the feature selection procedure consists of a\ncollection of randomized tests to be performed in parallel. Each test corresponds\nto a subset of features, for which a scoring function may be applied to measure\nthe relevance of the features in a classi\ufb01cation task. We develop a general the-\nory providing suf\ufb01cient conditions under which true features are guaranteed to\nbe correctly identi\ufb01ed. Superior performance of our method is demonstrated on\na challenging relation extraction task from a very large data set that have both\nredundant features and sample size in the order of millions. We present compre-\nhensive comparisons with state-of-the-art feature selection methods on a range of\ndata sets, for which our method exhibits competitive performance in terms of run-\nning time and accuracy. Moreover, it also yields substantial speedup when used\nas a pre-processing step for most other existing methods.\n\nIntroduction\n\n1\nFeature selection (FS) is a fundamental and classic problem in machine learning [10, 4, 12]. In\nclassi\ufb01cation, FS is the following problem: Given a universe U of possible features, identify a\nsubset of features F \u2286 U such that using the features in F one can build a model to best predict\nthe target class. The set F not only in\ufb02uences the model\u2019s accuracy, its computational cost, but also\nthe ability of an analyst to understand the resulting model. In applications, such as gene selection\nfrom micro-array data [10, 4], text categorization [3], and \ufb01nance [22], U may contain hundreds of\nthousands of features from which one wants to select only a small handful for F .\nWhile the overall goal is to have an FS method that is both computationally ef\ufb01cient and statistically\nsound, natural formulations of the FS problem are known to be NP-hard [2]. For large scale data,\nscalability is a crucial criterion, because FS often serves not as an end but a means to other sophis-\nticated subsequent learning. In reality, practitioners often resort to heuristic methods, which can\nbroadly be categorized into three types: wrapper, embedded, and \ufb01lter [10, 4, 12]. In the wrapper\nmethod, a classi\ufb01er is used as a black-box to test on any subset of features. In \ufb01lter methods no\nclassi\ufb01er is used; instead, features are selected based on generic statistical properties of the (labeled)\n\n\u2217denotes equal contribution\n\n1\n\n\fdata such as mutual information and entropy. Embedded methods have built in mechanisms for FS\nas an integral part of the classi\ufb01er training. Devising a mathematically rigorous framework to ex-\nplain and justify FS heuristics is an emerging research area. Recently Brown et al. [4] considered\ncommon FS heuristics using a formulation based on conditional likelihood maximization.\nThe primary contribution of this paper is a new framework for parallelizable feature selection, which\nis inspired by the theory of group testing. By exploiting parallelism in our test design we obtain a\nFS method that is easily scalable to millions of features and samples or more, while preserving\nuseful statistical properties in terms of classi\ufb01cation accuracy, stability and robustness. Recall that\ngroup testing is a combinatorial search paradigm [7] in which one wants to identify a small subset of\n\u201cpositive items\u201d from a large universe of possible items. In the original application, items are blood\nsamples of WWII draftees and an item is positive if it is infected with syphilis. Testing individual\nblood sample is very expensive; the group testing approach is to distribute samples into pools in\na smart way. If a pool is tested negative, then all samples in the pool are negative. On the other\nhand, if a pool is tested positive then at least one sample in the pool is positive. We can think of\nthe FS problem in the group testing framework: there is a presumably small, unknown subset F of\nrelevant features in a large universe of N features. Both FS and group testing algorithms perform\nthe same basic operation: apply a \u201ctest\u201d to a subset T of the underlying universe; this test produces\na score, s(T ), that is designed to measure the quality of the features T (or return positive/negative\nin the group testing case). From the collection of test scores the relevant features are supposed to\nbe identi\ufb01ed. Most existing FS algorithms can be thought of as sequential instantiations in this\nframework1: we select the set T to test based on the scores of previous tests. For example, let X =\n(X1, . . . , XN ) be a collection of features (variables) and Y be the class label. In the joint mutual\ninformation (JMI) method [25], the feature set T is grown sequentially by adding one feature at each\niteration. The next feature\u2019s score, s(Xk), is de\ufb01ned relative to the set of features already selected in\nXj\u2208T I(Xk, Xj; Y ). As each such scoring operation takes a non-negligible amount\n\nT : s(Xk) =(cid:80)\n\nof time, a sequential method may take a long time to complete.\nA key insight is that group testing needs not be done sequentially. With a good pooling design, all\nthe tests can be performed in parallel in which we determine the pooling design without knowing\nany pool\u2019s test outcome. From the vector of test outcomes, one can identify exactly the collection\nof positive blood samples. Parallel group testing, commonly called non-adaptive group testing\n(NAGT) is a natural paradigm and has found numerous applications in many areas of mathematics,\ncomputer Science, and biology [18]. It is natural to wonder whether a \u201cparallel\u201d FS scheme can be\ndesigned for machine learning in the same way NAGT was possible: all feature sets T are speci\ufb01ed\nin advance, without knowing the scores of any other tests, and from the \ufb01nal collection of scores the\nfeatures are identi\ufb01ed. This paper initiates a mathematical investigation of this possibility.\nAt a high level, our parallel feature selection (PFS) scheme has three inter-related components: (1)\nthe test design indicates the collection of subsets of features to be tested, (2) the scoring function\ns : 2[N ] \u2192 R that assigns a score to each test, and (3) the feature identi\ufb01cation algorithm that\nidenti\ufb01es the \ufb01nal selected feature set from the test scores. The design space is thus very large. Every\ncombination of the three components leads to a new PFS scheme.2 We argue that PFS schemes are\npreferred over sequential FS for two reasons:\n\n1. scalability, the tests in a PFS scheme can be performed in parallel, and thus the scheme can\n\nbe scaled to large datasets using standard parallel computing techniques, and\n\n2. stability, errors in individual trials do not affect PFS methods as dramatically as sequential\nmethods. In fact, we will show in this paper that increasing the number of tests improves\nthe accuracy of our PFS scheme.\n\nWe propose and study one such PFS approach. We show that our approach has comparable (and\nsometimes better) empirical quality compared to previous heuristic approaches while providing\nsound statistical guarantees and substantially improved scalability.\nOur technical contributions We propose a simple approach for the \ufb01rst and the third components\nof a PFS scheme. For the second component, we prove a suf\ufb01cient condition on the scoring function\nunder which the feature identi\ufb01cation algorithm we propose is guaranteed to identify exactly the set\n\n1A notable exception is the MIM method, which is easily parallelizable and can be regarded as a special\n\nimplementation of our framework\n\n2It is important to emphasize that this PFS framework is applicable to both \ufb01lter and wrapper approaches.\n\nIn the wrapper approach, the score s(T ) might be the training error of some classi\ufb01er, for instance.\n\n2\n\n\fof original (true) features. In particular, we introduce a notion called C-separability, which roughly\nindicates the strength of the scoring function in separating a relevant feature from an irrelevant\nfeature. We show that when s is C-separable and we can estimate s, we are able to guarantee exact\nrecovery of the right set of features with high probability. Moreover, when C > 0, the number of\ntests can be asymptotically logarithmic in the number of features in U.\nIn theory, we provide suf\ufb01cient conditions (a Na\u00a8\u0131ve Bayes assumption) according to which one can\nobtain separable scoring functions, including the KL divergence and mutual information (MI). In\npractice, we demonstrate that MI is separable even when the suf\ufb01cient condition does not hold,\nand moreover, on generated synthetic data sets, our method is shown recover exactly the relevant\nfeatures. We proceed to provide a comprehensive evaluation of our method on a range of real-world\ndata sets of both large and small sizes. It is the large scale data sets where our method exhibits\nsuperior performance.\nIn particular, for a huge relation extraction data set (TAC-KBP) that has\nmillions redundant features and samples, we outperform all existing methods in accuracy and time,\nin addition to generating plausible features (in fact, many competing methods could not \ufb01nish the\nexecution). For the more familiar NIPS 2013 FS Challenge data, our method is also competitive\n(best or second-best) on the two largest data sets. Since our method hinges on the accuracy of score\nfunctions, which is dif\ufb01cult achieve for small data, our performance is more modest in this regime\n(staying in the middle of the pack in terms of classi\ufb01cation accuracy). Nonetheless, we show that our\nmethod can be used as a preprocessing step for other FS methods to eliminate a large portion of the\nfeature space, thereby providing substantial computational speedups while retaining the accuracy of\nthose methods.\n\n2 Parallel Feature Selection\nThe general setting Let N be the total number of input features. For each subset T \u2286 [N ] :=\n{1, . . . , N}, there is a score s(T ) normalized to be in [0, 1] that assesses the \u201cquality\u201d of features in\nT . We select a collection of t tests, each of which is a subset T \u2286 [N ] such that from the scores\nof all tests we can identify the unknown subset F of d relevant variables that are most important\nto the classi\ufb01cation task. We encode the collection of t tests with a binary matrix A = (aij) of\ndimension t \u00d7 N, where aij = 1 iff feature j belongs to test i. Corresponding to each row i of A is\na \u201ctest score\u201d si = s({j | aij = 1}) \u2208 [0, 1]. Specifying A is called test design, identifying F from\nthe score vector (si)i\u2208[t] is the job of the feature identi\ufb01cation algorithm. The scheme is inherently\nparallel because all the tests must be speci\ufb01ed in advance and executed in parallel; then the features\nare selected from all the test outcomes.\n\nTest design and feature identi\ufb01cation Our test design and feature identi\ufb01cation algorithms are\nextremely simple. We construct the test matrix A randomly by putting a feature in the test with\nprobability p (to be chosen later). Then, from the test scores we rank the features and select d\ntop-ranked features. The ranking function is de\ufb01ned as follows. Given a t \u00d7 N test matrix A, let\naj denote its jth column. The dot-product (cid:104)aj, s(cid:105) is the total score of all the tests that feature j\nparticipates in. We de\ufb01ne \u03c1(j) = (cid:104)aj, s(cid:105) to be the rank of feature j with respect to the test matrix\nA and the score function s.\n\nThe scoring function The crucial piece stiching together the entire scheme is the scoring func-\ntion. The following theorem explains why the above test design and feature identi\ufb01cation strategy\nmake sense, as long as one can choose a scoring function s that satis\ufb01es a natural separability prop-\nerty. Intuitively, separable scoring functions require that adding more hidden features into a test set\nincrease its score.\nDe\ufb01nition 2.1 (Separable scoring function). Let C \u2265 0 be a real number. The score function\ns : 2[N ] \u2192 [0, 1] is said to be C-separable if the following property holds: for every f \u2208 F and\n\u02dcf /\u2208 F , and for every T \u2286 [N ] \u2212 {f, \u02dcf}, we have s(T \u222a {f}) \u2212 s(T \u222a { \u02dcf}) \u2265 C.\n\nIn words, with a separable scoring function adding a relevant feature should be better than adding\nan irrelevant feature to a given subset T of features. Due to space limination, the proofs of the\nfollowing theorem, propositions, and corollaries can be found in the supplementary materials. The\nessence of the idea is that, when s can separate relevant features from irrelevant features, with high\nprobability a relevant feature will be ranked higher than an irrelevant feature. Hoeffding\u2019s inequality\nis then used to bound the number of tests.\n\n3\n\n\fTheorem 2.2. Let A be the random t \u00d7 N test matrix obtained by setting each entry to be 1 with\nprobability p \u2208 [0, 1] and 0 with probability 1 \u2212 p. If the scoring function s is C-separable, then the\nexpected rank of a feature in F is at least the expected rank of a feature not in F .\nFurthermore, if C > 0, then for any \u03b4 \u2208 (0, 1), with probability at least 1 \u2212 \u03b4 every feature in F\nhas rank higher than every feature not in F , provided that the number of tests t satis\ufb01es\n\n(cid:18) d(N \u2212 d)\n\n(cid:19)\n\nt \u2265\n\n2\n\nC 2p2(1 \u2212 p)2 log\n\n\u03b4\n\n.\n\n(1)\n\nBy setting p = 1/2 in the above theorem, we obtain the following. It is quite remarkable that,\nassuming we can estimate the scores accurately, we only need about O(log N ) tests to identify F .\nCorollary 2.3. Let C > 0 be a constant such that there is a C-separable scoring function s. Let\nd = |F|, where F is the set of hidden features. Let \u03b4 \u2208 (0, 1) be an arbitrary constant. Then, there\nis a distribution of t \u00d7 N test matrices A with t = O(log(d(N \u2212 d)/\u03b4)) such that, by selecting a\ntest matrix randomly from the distribution, the d top-ranked features are exactly the hidden features\nwith probability at least 1 \u2212 \u03b4.\nOf course, in reality estimating the scores accurately is a very dif\ufb01cult problem, both statistically\nand computationally, depending on what the scoring function is. We elaborate more on this point\nbelow. But \ufb01rst, we show that separable scoring functions exist, under certain assumption about the\nunderlying distribution.\nSuf\ufb01cient conditions for separable scoring functions We demonstrate the existence of separable\nscoring functions given some suf\ufb01cient conditions on the data. In practice, loss functions such as\nclassi\ufb01cation error and other surrogate losses may be used as scoring functions. For binary classi\ufb01-\ncation, information-theoretic quantities such as Kullback-Leibler divergence, Hellinger distance and\nthe total variation \u2014 all of which special cases of f-divergences [5, 1] \u2014 may also be considered.\nFor multi-class classi\ufb01cation, mutual information (MI) is a popular choice.\nThe data pairs (X, Y ) are assumed to be iid samples from a joint distribution P (X, Y ). The fol-\nlowing result shows that under the so-called \u201cnaive Bayes\u201d condition, i.e., all components of random\nvector X are conditionally independent given label variable Y , the Kullback-Leibler distance is a\nseparable scoring function in a binary classi\ufb01cation setting:\nProposition 2.4. Consider the binary classi\ufb01cation setting, i.e., Y \u2208 {0, 1} and assume that the\nnaive Bayes condition holds. De\ufb01ne score function to be the Kullback-Leibler divergence:\n\ns(T ) := KL(P (X T|Y = 0)||P (X T|Y = 1)).\n\nThen s is a separable scoring function. Moreover, s is C-separable, where C := minf\u2208F s(f ).\nProposition 2.5. Consider the multi-class classi\ufb01cation setting, and assume that the naive Bayes\ncondition holds. Moreover, for any pair f \u2208 F and \u02dcf /\u2208 F , the following holds for any T \u2286\n[N ] \u2212 {f, \u02dcf}\n\nI(Xf ; Y ) \u2212 I(Xf ; X T ) \u2265 I(X \u02dcf ; Y ) \u2212 I(X \u02dcf ; X T ).\n\nThen, the MI function s(T ) := I(XT ; Y ) is a separable scoring function.\nWe note the naturalness of the condition so required, as quantity I(Xf ; Y ) \u2212 I(Xf ; XT ) may be\nviewed as the relevance of feature f with respect to the label Y , subtracted by the redundancy with\nother existing features T . If we assume further that X \u02dcf is independent of both X T and the label Y ,\nand there is a positive constant C such that I(Xf ; Y ) \u2212 I(Xf ; XT ) \u2265 C for any f \u2208 F , then s(T )\nis obviously a C-separable scoring function. It should be noted that the naive Bayes conditions are\nsuf\ufb01cient, but not necessary for a scoring function to be C-separable.\nSeparable scoring functions for \ufb01lters and wrappers.\nIn practice, information-based scoring\nfunctions need to be estimated from the data. Consistent estimators of scoring functions such as KL\ndivergence (more generally f-divergences) and MI are available (e.g., [20]). This provides the theo-\nretical support for applying our test technique to \ufb01lter methods: when the number of training data is\nsuf\ufb01ciently large, a consistent estimate of a separable scoring function must also be a separable scor-\ning function. On the other hand, a wrapper method uses a classi\ufb01cation algorithm\u2019s performance as\na scoring function for testing. Therefore, the choice of the underlying (surrogate) loss function plays\na critical role. The following result provides the existence of loss functions which induce separable\nscoring functions for the wrapper method:\n\n4\n\n\fP (X T|Y = 1). Assume that an f-divergence of the form: s(T ) = (cid:82) \u03c6(dP T\n\n0 := P (X T|Y = 0), P T\n1 )dP T\n\n0 /dP T\n\n0 and P T\n\nProposition 2.6. Consider the binary classi\ufb01cation setting, and let P T\n1 :=\n1 is a\nseparable scoring function for some convex function \u03c6 : R+ \u2192 R. Then there exists a surrogate\nloss function l : R \u00d7 R \u2192 R+ under which the minimum l-risk: Rl(T ) := inf g E [l(Y, g(X T ))] is\nalso a separable scoring function. Here the in\ufb01mum is taken over all measurable classi\ufb01er functions\ng acting on feature input X T , E denotes expectation with respect to the joint distribution of X T\nand Y .\nThis result follows from Theorem 1 of [19], who established a precise correspondence between f-\ndivergences de\ufb01ned by convex \u03c6 and equivalent classes of surrogate losses l. As a consequence,\nif the Hellinger distance between P T\n1 is separable, then the wrapper method using the\nAdaboost classi\ufb01er corresponds to a separable scoring function. Similarly, a separable Kullback-\nLeibler divergence implies that of a logistic regression based wrapper; while a separable variational\ndistance implies that of a SVM based wrapper.\n3 Experimental results\n3.1 Synthetic experiments\nIn this section, we synthetically illustrate that separable scoring functions exist and our PFS frame-\nwork is sound beyond the Na\u00a8\u0131ve Bayes assumption (NBA). We \ufb01rst show that MI is C-separable for\nlarge C even when the NBA is violated. The NBA was only needed in Propositions 2.4 and 2.5 in\norder for the proofs to go through. Then, we show that our framework recovers exactly the relevant\nfeatures for two common classes of input distributions.\nWe generate 1, 000 data points from two sep-\narated 2-D Gaussians with the same covari-\nance matrix but different means, one centered\nat (\u22122,\u22122) and the other at (2, 2). We start\nwith the identity covariance matrix, and gradu-\nally change the off diagonal element to \u22120.999,\nrepresenting highly correlated features. Then,\nwe add 1,000 dimensional zero mean Gaussian\nnoise with the same covariance matrix, where\nthe diagonal is 1 and the off-diagonal elements\nincreases from 0 gradually to 0.999. We then\ncalculate the MI between two features and the\nclass label, and the two features are selected in\nthree settings: 1) the two genuine dimensions;\n2) one of the genuine feature and one from the\nnoisy dimensions; 3) two random pair from the\nnoisy dimensions. The MI that we get from\nthese three conditions is shown in Figure 1. It is clear from this \ufb01gure MI is a separable scoring\nfunction, despite the fact that the NBA is violated.\nWe also synthetically evaluated our entire PFS idea, using two multinomials and two Gaussians to\ngenerate two binary classi\ufb01cation task data. Our PFS scheme is able to capture exactly the relevant\nfeatures in most cases. Details are in the supplementary material section due to lack of space.\n3.2 Real-world data experiment results\nThis section evaluates our approach in terms of accuracy, scalability, and robustness accross a range\nof real-world data sets: small, medium, and large. We will show that our PFS scheme works very\nwell on medium and large data sets; because, as was shown in Section 3.1, with suf\ufb01cient data to\nestimate test scores, we expect our method to work well in terms of accuracy. On the small datasets,\nour approach is only competitive and does not dominate existing approaches, due to the lack of data\nto estimate scores well. However, we show that we can still use our PFS scheme as a pre-processing\nstep to \ufb01lter down the number of dimensions; this step reduces the dimensionality, helps speed up\nexisting FS methods from 3-5 times while keeps their accuracies.\n3.2.1 The data sets and competing methods\nLarge: TAC-KBP is a large data set with the number of samples and dimensions in the millions3;\nits domain is on relation extraction from natural language text. Medium: GISETTE and MADE-\n\nFigure 1:\nIllustration of MI as a separable scor-\ning function for the case of statistically dependent\nfeatures. The top left point shows the scores for\nthe 1st setting; the middle points shows the scores\nfor the 2nd setting; and the bottom points shows\nthe scores for the 3rd setting.\n\n3http://nlp.cs.qc.cuny.edu/kbp/2010/\n\n5\n\n\fLON are two largest data sets from the NIPS 2003 feature selection challenge4, with the number of\ndimensions in the thousands. Small: Colon, Leukemia, Lymph, NCI9, and Lung are chosen from\nthe small Micro-array datasets [6], along with the UCI datasets5. These sets typically have a few\nhundreds to a few thousands variables, with only tens of data samples.\nWe compared our method with various baseline methods including mutual\ninformation\nmaximization[14] (MIM), maximum relevancy minimum redundancy[21] (MRMR), conditional\nmutual information maximization[9] (CMIM), joint mutual information[25] (JMI), double input\nsymmetrical relevance[16] (DISR), conditional infomax feature extraction[15] (CIFE), interaction\ncapping[11] (ICAP), fast correlation based \ufb01lter[26] (FCBF), local learning based feature selection\n[23] (LOGO), and feature generating machine [24] (FGM).\n3.2.2 Accuracy\n\n(a) Precision/Recall of different\nFigure 2: Result from different methods on TAC-KBP dataset.\nmethods; (b) Top-5 keywords appearing in the Top-20 features selected by our method. Dotted lines\nin (a) are FGM (or MIM) with our approach as pre-processing step.\nAccuracy results on large data set. As shown in Figure 2(a), our method dominates both MIM\nand FGM. Given the same precision, our method achieves 2-14\u00d7 higher recall than FGM, and 1.2-\n2.4\u00d7 higher recall than MIM. Other competitors do not \ufb01nish execution in 12 hours. We compare the\ntop-features produced by our method and MIM, and \ufb01nd that our method is able to extract features\nthat are strong indicators only when they are combined with other features, while MIM, which tests\nfeatures individually, ignores this type of combination. We then validate that the features selected\nby our method makes intuitive sense. For each relation, we select the top-20 features and report the\nkeyword in these features.6 As shown in Figure 2(b), these top-features selected by our method are\ngood indicators of each relation. We also observe that using our approach as the pre-processing step\nimproves the quality of FGM signi\ufb01cantly. In Figure 2(a) (the broken lines), we run FGM (MIM)\non the top-10K features produced by our approach. We see that running FGM with pre-processing\nachieves up to 10\u00d7 higher recall given the same precision than running FGM on all 1M features.\nAccuracy results on medium data sets Since the focus of the evaluation is to analyze the ef\ufb01cacy\nof feature selection approaches, we employed the same strategy as Brown et al.[4] i.e.\nthe \ufb01nal\nclassi\ufb01cation is done using k-nearest neighbor classi\ufb01er with k \ufb01xed to three, and applied Euclidean\ndistance7.\nWe denote our method by Fk (and Wk), where F denotes \ufb01lter (and W denotes wrapper method).\nk denotes the number of tests (i.e. let N be the dimension of data, then the total number of tests is\nkN). We bin each dimension of the data into \ufb01ve equal distanced bins when the data is real valued,\notherwise the data is not processed8. MI is used as the scoring function for \ufb01lter method, and log-\nlikelihood is used for scoring the wrapper method. The wrapper we used is logistic regression9.\nFor GISETTE we select up to 500 features and for MADELON we select up to 100 features. To get\nthe test results, we use the features according to the smallest validation error for each method, and\nthe results on test set are illustrated in table 4.\n\n4http://www.nipsfsc.ecs.soton.ac.uk/datasets/\n5http://archive.ics.uci.edu/ml/\n6Following the syntax used by Mintz et al. [17], if a feature has the form [\u21d1poss wif e \u21d3prop of ], we report\n\nthe keyword as wife in Figure 2(b).\n\n7The classi\ufb01er for FGM is linear support vector machine (SVM), since it optimized for the SVM criteria.\n8For SVM based method, the real valued data is not processed, and all data is normalized to have unit length.\n9The logistic regressor used in wrapper is only to get the testing scores, the \ufb01nal classi\ufb01cation scheme is\n\nstill k-NN.\n\n6\n\n0 0.5 1 0 0.1 0.2 Precision Recall Spouse MemberOf TopMember wife leader head pictures member executive husband rebels chairman married commander general widower iraq leader president (b) Keywords in top features (a) P/R-Curve Ours FGM MIM \fTable 1: Test set balanced error rate (%) from different methods on NIPS datasets\nDatasets\nOurs\n(W10)\n2.89\n10.50\n\nBest\nPerf.\n2.15\nGISETTE\nMADELON 10.61\n\nOurs\n(F3)\n4.85\n22.61\n\n3rd Best Median\nPerf.\n3.86\n25.92\n\nPerf.\n3.09\n12.33\n\n2nd Best\n\nPerf.\n3.06\n11.28\n\nOurs\n(W3)\n2.72\n10.17\n\nOurs\n(F10)\n4.69\n18.39\n\nAccuracy results on the small data sets. As expected, due to the lack of data to estimate scores,\nour accuracy performance is average for this data set. Numbers can be found in the supplementary\nmaterials. However, as suggested by theorem A.3 (in supplementary materials), our method can also\nbe used as a preprocessing step for other feature selection method to eliminate a large portion of the\nfeatures. In this case, we use the \ufb01lter methods to \ufb01lter out e + 0.1 of the input features, where e is\nthe desired proportion of the features that one wants to reserve.\n\n(a)\n\n(b)\n\nFigure 3: Result from real world datasets: a) curve showing the ratio between the errors of various\nmethods applied on original data and on \ufb01ltered data, where a large portion of the dimension is\n\ufb01ltered out (value larger than one indicates performance improvement); b) the speed up we get by\napplying our method as a pre-processing method on various methods across different datasets, the\n\ufb02at dashed line indicates the location where the speed up is one.\nUsing our method as preprocessing step achieves 3-5 times speedup as compare to the time spend\nby original methods that take multiple passes through the datasets, and keeps or improves the per-\nformance in most of the cases (see \ufb01gure 3 a and b). The actual running time can be found in\nsupplementary materials.\n3.2.3 Scalability\n\nFigure 4: Scalability Experiment of Our Approach\n\nWe validate that our method is able to run on large-scale data set ef\ufb01ciently, and the ability to take\nadvantage of parallelism is the key to its scalability.\nExperiment Setup Given the TAC-KBP data set, we report the execution time by varying the\ndegree of parallelism, number of features, and number of examples. We \ufb01rst produce a series of\ndata sets by sub-sampling the original data set with different number examples ({104, 105, 106})\nand number of features ({104, 105, 106}). We also try different degree of parallelism by running\nour approach using a single thread, 4-threads on a 4-core CPU, 32 threads on a single 8-CPU (4-\ncore/CPU) machine, and multiple machines available in the national Open Science Grid (OSG).\nFor each combination of number of features, number of examples, and degree of parallelism, we\nestimate the throughput as the number of tests that we can run in 1 second, and estimate the total\nrunning time accordingly. We also ran our largest data set (106 rows and 106 columns) on OSG and\nreport the actual run time.\nDegree of Parallelism Figure 4(a) reports the (estimated) run time on the largest data set (106\nrows and 106 columns) with different degree of parallelism. We \ufb01rst observe that running our\n\n7\n\n360\t\r \u00a03600\t\r \u00a036000\t\r \u00a0360000\t\r \u00a03600000\t\r \u00a01\t\r \u00a010\t\r \u00a0100\t\r \u00a01000\t\r \u00a0360\t\r \u00a03600\t\r \u00a036000\t\r \u00a0360000\t\r \u00a03600000\t\r \u00a010000\t\r \u00a0100000\t\r \u00a01000000\t\r \u00a0360\t\r \u00a03600\t\r \u00a036000\t\r \u00a0360000\t\r \u00a03600000\t\r \u00a010000\t\r \u00a0100000\t\r \u00a01000000\t\r \u00a0Time (seconds) (a) # Cores (b) # Features (c) # Examples OSG runtime Single Thread Single 4-Core CPU Single 8-CPU Machine \fapproach requires non-trivial amount of computational resources\u2013if we only use a single thread, we\nneed about 400 hours to \ufb01nish our approach. However, the running time of our approach decreases\nlinearly with the number of cores that we used. If we run our approach on a single machine with 32\ncores, it \ufb01nishes in just 11 hours. This linear speed-up behavior allows our approach to scale to very\nlarge data set\u2013when we run our approach on the national Open Science Grid, we observed that our\napproach is able to \ufb01nish in 2.2 hours (0.7 hours for actual execution, and 1.5 hours for scheduling\noverhead).\nThe Impact of Number of Features and Number of Examples Figure 4(b,c) report the run time\nwith different number of features and number of examples, respectively. In Figure 4(b), we \ufb01x the\nnumber of examples to be 105, and vary the number of features, and in Figure 4(c), we \ufb01x the number\nof features to be 106 and vary the number of examples. We see that as the number of features or the\nnumber of examples increase, our approach uses more time; however, the running time never grows\nsuper-linearly. This behavior implies the potential of our approach to scale to even larger data sets.\n3.2.4 Stability and robustness\nOur method exhibits several robustness properties. In particular, the proof of Theorem 2.2 suggests\nthat as the number of tests are increased the performance also improves. Therefore, in this section\nwe empirically evaluate this observation. We picked four datasets: KRVSKP, Landset, Splice and\nWaveform from the UCI datasets and both NIPS datasets.\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 5: Change of performance with respect of number of tests on several UCI datasets with (a)\n\ufb01lter and (b) wrapper methods; and (c) GISETTE and (d) MADELON datasets.\n\nThe trend is pretty clear as can be observed from \ufb01gure 5. The performance of both wrapper and\n\ufb01lter methods improves as we increase the number of tests, which can be attributed to the increase of\nrobustness against inferior estimates for the test scores as the number of tests increases. In addition,\napart from MADELON dataset, the performance converges fast, normally around k = 10 \u223c 15.\nAdditional stability experiments can be found in the supplementary materials, where we evaluate\nours and other methods in terms of consistency index.\nAcknowledgements\nCR acknowledges the support of DARPA XDATA Program under No. FA8750-12-2-0335 and\nDEFT Program under No. FA8750-13-2-0039, DARPAs MEMEX program, the NSF CAREER\nAward under No. IIS-1353606 and EarthCube Award under No. ACI-1343760, the Sloan Research\nFellowship, the ONR under awards No. N000141210041 and No. N000141310129, the Moore\nFoundation, American Family Insurance, Google, and Toshiba. HN is supported by NSF grants\nCNF-1409551, CCF-1319402, and CNF-1409551. XN is supported in part by NSF grants CCF-\n1115769, ACI 1342076 , NSF CAREER award under DMS-1351362, and CNS-1409303. Any\nopinions, \ufb01ndings, and conclusions or recommendations expressed in this material are those of the\nauthors and do not necessarily re\ufb02ect the views of DARPA, AFRL, NSF, ONR, or the U.S. govern-\nment.\nReferences\n[1] S. M. Ali and S. D. Silvey. A general class of coef\ufb01cients of divergence of one distribution from another.\n\nJ. Royal Stat. Soc. Series B, 28:131\u2013142, 1966.\n\n[2] Edoardo Amaldi and Viggo Kann. On the approximability of minimizing nonzero variables or unsatis\ufb01ed\n\nrelations in linear systems, 1997.\n\n[3] Ron Bekkerman, Ran El-Yaniv, Naftali Tishby, and Yoad Winter. Distributional word clusters vs. words\n\nfor text categorization. J. Mach. Learn. Res., 3:1183\u20131208, March 2003.\n\n8\n\n\f[4] Gavin Brown, Adam Pocock, Ming-Jie Zhao, and Mikel Luj\u00b4an. Conditional likelihood maximisation: A\n\nunifying framework for information theoretic feature selection. JMLR, 13:27\u201366, 2012.\n\n[5] I. Csisz\u00b4ar. Information-type measures of difference of probability distributions and indirect observation.\n\nStudia Sci. Math. Hungar, 2:299\u2013318, 1967.\n\n[6] C. H. Q. Ding and H. Peng. Minimum redundancy feature selection from microarray gene expression\n\ndata. J. Bioinformatics and Computational Biology, pages 185\u2013206, 2005.\n\n[7] Ding-Zhu Du and Frank K. Hwang. Combinatorial group testing and its applications, volume 12 of Series\n\non Applied Mathematics. World Scienti\ufb01c Publishing Co. Inc., River Edge, NJ, second edition, 2000.\n\n[8] Devdatt P. Dubhashi and Alessandro Panconesi. Concentration of measure for the analysis of randomized\n\nalgorithms. Cambridge University Press, Cambridge, 2009.\n\n[9] Francois Fleuret and Isabelle Guyon. Fast binary feature selection with conditional mutual information.\n\nJournal of Machine Learning Research, 5:1531\u20131555, 2004.\n\n[10] Isabelle Guyon and Andr\u00b4e Elisseeff. An introduction to variable and feature selection. J. Mach. Learn.\n\nRes., 3:1157\u20131182, March 2003.\n\n[11] A. Jakulin and I. Bratko. Machine learning based on attribute interactions: Ph.D. dissertation. 2005.\n\n[12] Ron Kohavi and George H. John. Wrappers for feature subset selection. Artif. Intell., 97(1-2):273\u2013324,\n\nDecember 1997.\n\n[13] Ludmila I. Kuncheva. A stability index for feature selection. In Arti\ufb01cial Intelligence and Applications,\n\npages 421\u2013427, 2007.\n\n[14] David D. Lewis. Feature selection and feature extraction for text categorization. In In Proceedings of\n\nSpeech and Natural Language Workshop, pages 212\u2013217. Morgan Kaufmann, 1992.\n\n[15] Dahua Lin and Xiaoou Tang. Conditional infomax learning: An integrated framework for feature extrac-\n\ntion and fusion. In ECCV (1), pages 68\u201382, 2006.\n\n[16] P. E. Meyer and G. Bontempi. On the use of variable complementarity for feature selection in cancer\n\nclassi\ufb01cation. In Proceedings of EvoWorkshop, pages 91\u2013102. Springer-Verlag, 2006.\n\n[17] Mike Mintz, Steven Bills, Rion Snow, and Daniel Jurafsky. Distant supervision for relation extraction\n\nwithout labeled data. In ACL/IJCNLP, pages 1003\u20131011, 2009.\n\n[18] Hung Q. Ngo, Ely Porat, and Atri Rudra. Ef\ufb01ciently decodable compressed sensing by list-recoverable\n\ncodes and recursion. In Proceedings of STACS, volume 14, pages 230\u2013241, 2012.\n\n[19] X. Nguyen, M. J. Wainwright, and M. I. Jordan. On surrogate losses and f-divergences. Annals of\n\nStatistics, 37(2):876\u2013904, 2009.\n\n[20] X. Nguyen, M. J. Wainwright, and M. I. Jordan. Estimating divergence functionals and the likelihood\n\nratio by by convex risk minimization. IEEE Trans. on Information Theory, 56(11):5847\u20135861, 2010.\n\n[21] H. Peng, F. Long, and C. Ding. Feature selection based on mutual information: criteria of max-\n\ndependency, max-relevance, and min-redundancy. IEEE Transactions on PAMI, 27:1226\u20131238, 2005.\n\n[22] Herv\u00b4e Stoppiglia, G\u00b4erard Dreyfus, R\u00b4emi Dubois, and Yacine Oussar. Ranking a random feature for\n\nvariable and feature selection. J. Mach. Learn. Res., 3:1399\u20131414, March 2003.\n\n[23] Y. Sun, S. Todorovic, and S. Goodison. Local-learning-based feature selection for high-dimensional data\nanalysis. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 32(9):1610\u20131626, Sept 2010.\n\n[24] Mingkui Tan, Li Wang, and Ivor W. Tsang. Learning sparse svm for feature selection on very high\n\ndimensional datasets. In ICML, pages 1047\u20131054, 2010.\n\n[25] Howard Hua Yang and John E. Moody. Data visualization and feature selection: New algorithms for\n\nnongaussian data. In NIPS, pages 687\u2013702, 1999.\n\n[26] Lei Yu and Huan Liu. Ef\ufb01cient feature selection via analysis of relevance and redundancy. Journal of\n\nMachine Learning Research, 5:1205\u20131224, 2004.\n\n[27] Ce Zhang, Feng Niu, Christopher R\u00b4e, and Jude W. Shavlik. Big data versus the crowd: Looking for\n\nrelationships in all the right places. In ACL (1), pages 825\u2013834, 2012.\n\n9\n\n\f", "award": [], "sourceid": 1871, "authors": [{"given_name": "Yingbo", "family_name": "Zhou", "institution": "State University of New York at Buffalo"}, {"given_name": "Utkarsh", "family_name": "Porwal", "institution": null}, {"given_name": "Ce", "family_name": "Zhang", "institution": "UW-Madison"}, {"given_name": "Hung", "family_name": "Ngo", "institution": "University at Buffalo, SUNY"}, {"given_name": "XuanLong", "family_name": "Nguyen", "institution": "University of Michigan"}, {"given_name": "Christopher", "family_name": "R\u00e9", "institution": "Stanford University"}, {"given_name": "Venu", "family_name": "Govindaraju", "institution": "SUNY Buffalo"}]}