{"title": "Fast Feature Selection from Microarray Expression Data via Multiplicative Large Margin Algorithms", "book": "Advances in Neural Information Processing Systems", "page_first": 121, "page_last": 128, "abstract": "", "full_text": "Fast Feature Selection from Microarray\n\nExpression Data via Multiplicative\n\nLarge Margin Algorithms\n\nClaudio Gentile\n\nDICOM, Universit`a dell\u2019Insubria\nVia Mazzini, 5, 21100 Varese, Italy\n\ngentile@dsi.unimi.it\n\nAbstract\n\nNew feature selection algorithms for linear threshold functions are de-\nscribed which combine backward elimination with an adaptive regular-\nization method. This makes them particularly suitable to the classi\ufb01ca-\ntion of microarray expression data, where the goal is to obtain accurate\nrules depending on few genes only. Our algorithms are fast and easy to\nimplement, since they center on an incremental (large margin) algorithm\nwhich allows us to avoid linear, quadratic or higher-order programming\nmethods. We report on preliminary experiments with \ufb01ve known DNA\nmicroarray datasets. These experiments suggest that multiplicative large\nmargin algorithms tend to outperform additive algorithms (such as SVM)\non feature selection tasks.\n\nIntroduction\n\n1\nMicroarray technology allows researchers to simultaneously measure expression levels as-\nsociated with thousands or ten thousands of genes in a single experiment (e.g., [7]). How-\never, the number of replicates in these experiments is often seriously limited (tipically a\nfew dozen). This gives rise to datasets having a large number of gene expression values\n(numerical components) and a relatively small number of samples. As a popular example,\nin the \u201cLeukemia\u201d dataset from [10] we have only 72 observations of the expression level\nof 7129 genes. It is clear that in this extreme scenario machine learning methods related to\nfeature selection play a fundamental role for increasing ef\ufb01ciency and enhancing the com-\nprehensibility of the results. Besides, in biological and medical research \ufb01nding accurate\nclass prediction rules which depend on the level of expression of few genes is important for\na number of activities, ranging from medical diagnostics to drug discovery.\n\nWithin the classi\ufb01cation framework, a regularization method (also called penalty-based\nor feature weighting method) is an indirect route to feature selection. Whereas a (direct)\nfeature selection method searches in the combinatorial space of feature subsets, a regu-\nlarization method constrains the magnitudes of the parameters assigning them a \u201cdegree\nof relevance\u201d during learning, thereby performing feature selection as a by-product of its\nlearning mechanism (see, e.g., [16, 19, 17, 14, 4, 20]). Feature selection is a wide and\nactive \ufb01eld of research; the reader is referred to [15] for a valuable survey. See also, e.g.,\n[3, 6] (and references therein) for speci\ufb01c work on gene expression data.\n\nIn this paper, we introduce novel feature selection algorithms for linear threshold functions,\n\n\fwhose core learning procedure is an incremental large margin algorithm called1 ALMAp\n(Approximate Large Margin Algorithm w.r.t. norm p) [8]. Our ALMAp-based feature selec-\ntion algorithms lie between a direct feature selection method and a regularization method.\nThese algorithms might be considered as a re\ufb01nement on a recently proposed method,\nspeci\ufb01cally tested on microarray expression data, called Recursive Feature Elimination\n(RFE) [13]. RFE uses Support Vector Machines (SVM) as the core learning algorithm,\nand performs backward selection to greedily remove the feature whose associated weight\nis smallest in absolute value until only the desired number of features remain. Our algo-\nrithms operate in a similar fashion, but they allow us to eliminate many features at once by\nexploiting margin information about the current training set. The degree of dimensionality\nreduction is ruled by the norm p in ALMAp. The algorithms start by being aggressive (sim-\nulating a multiplicative algorithm when the number of current features is large) and end by\nbeing gentle (simulating an additive algorithm such as SVM when few features are left).\nFrom a computational standpoint, our algorithms lie somewhere between a 1-norm and a\n2-norm penalization method. However, unlike other regularization approaches speci\ufb01cally\ntailored to feature selection, such as those in [4, 20], we do avoid computationally inten-\nsive linear (or nonlinear) programming methods. This is because we not only solve the\noptimization problem associated to regularization in an approximate way, but also use an\nincremental algorithm having the additional capability to smoothly interpolate between the\ntwo kinds of penalizations.\n\nOur algorithms are simple to implement and turn out to be quite fast. We made preliminary\nexperiments on \ufb01ve known DNA microarray datasets. In these experiments, we compared\nthe margin-based feature selection performed by our multiplicative algorithms to a stan-\ndard correlation-based feature selection method applied to both additive (SVM-like) and\nmultiplicative (Winnow-like) core learning procedures. When possible, we tried to follow\nprevious experimental settings, such as those in [13, 22, 20]. The conclusion of our prelimi-\nnary study is that a multiplicative (large margin) algorithm is often better that an SVM-like\nalgorithm when the goal is to compute linear threshold rules that are both accurate and\ndepend on the value of few components (as is often the case in gene expression datasets).\n\n2 Preliminaries and notation\nAn example is a pair (x; y), where x is an instance vector lying in Rf and y 2 f(cid:0)1; +1g\nis the binary label associated with x. A training set S is a sequence of examples S =\n((x1; y1); :::; (xm; ym)) 2 (Rf (cid:2) f(cid:0)1; +1g)m. When F (cid:18) f1; :::; fg is a set of features\nand v 2 Rf , we denote by vjF the subvector of v where the features/dimensions not in F\nare eliminated. Also, SjF denotes the training set SjF = ((x1jF ; y1); :::; (xmjF ; ym)). A\nweight vector w = (w1; :::; wf ) 2 Rf represents a hyperplane passing through the origin.\nAs usual, we associate with w the (zero threshold) linear threshold function w : x !\nsign(w (cid:1) x) = 1 if w (cid:1) x (cid:21) 0 and = (cid:0)1 otherwise. When p (cid:21) 1 we denote by jjwjjp\nthe p-norm of w, i.e., jjwjjp = (Pf\ni=1 jwijp)1=p\n= maxi jwij). We say that norm q is dual to norm p if q = p\np(cid:0)1 . In this paper we assume\nthat p and q are some pair of dual values, with p (cid:21) 2. We use p-norms for instance\nvectors and q-norms for weight vectors. For notational brevity, throughout this paper we\nuse normalized instances ^x = x=jjxjjp, where p will be clear from the surrounding context.\nThe (normalized) p-norm margin (or just the margin) of a hyperplane w with jjwjjq (cid:20) 1\non example (x; y) is de\ufb01ned as y w (cid:1) ^x. If this margin is positive then w classi\ufb01es (x; y)\ncorrectly. Notice that jjxjjp (cid:20) f 1=p jjxjj1 for any x 2 Rf . Hence if p is logarithmic in\nthe number of features/dimensions of x, i.e., p = ln f, we obtain jjxjj(ln f ) (cid:20) ejjxjj1.\n1Broadly speaking, as the norm parameter p is varied, ALMA p is able to (approximately) interpo-\nlate between Support Vector Machines [5] and (large margin versions of) multiplicative classi\ufb01cation\nalgorithms, such as Winnow [16]. Compared to Winnow, ALMA p is more \ufb02exible (since we can\nadjust the norm parameter p) and requires less tuning. See Section 3 for details.\n\ni=1 jwijp)1=p (also, jjwjj1 = limp!1(Pf\n\n\fALGORITHM ALMAp(S; (cid:11))\nInput: Training set S = ((x1; y1); :::; (xm; ym)); norm parameter p (cid:21) 2; ap-\nproximation parameter (cid:11) 2 (0; 1].\nInitialization: w1 = 0; k = 1.\nFor t = 1; 2; ::: do:\nGet example (xt; yt) and update weights as follows:\np8 (p(cid:0)1)\nSet: (cid:13)k =\nIf yt wk (cid:1) ^xt (cid:20) (1 (cid:0) (cid:11)) (cid:13)k\nthen: w0k = T(cid:0)1(T(wk) + (cid:17)k yt ^xt),\n\n; (cid:17)k = q 2\np(cid:0)1\n\n1pk\n\n1pk\n\n(cid:11)\n\n.\n\nwk+1 = w0k=jjw0kjjq, where q = p\np(cid:0)1 ,\nk   k + 1.\n\nOutput: Final weight vector wk = (wk;1; :::; wk;f ); \ufb01nal margin (cid:13) = (cid:13)k.\n\nFigure 1: The approximate large margin algorithm ALMAp.\n\nis actually bounded from below by the 1-norm margin y w(cid:1)x\njjxjj1\n\nAlso, jjwjj1 (cid:20) 1 implies jjwjjq (cid:20) 1 for any q > 1. Thus if jjwjj1 (cid:20) 1 the (ln f )-norm\nmargin y w(cid:1)x\ndivided by\njjxjj(ln f )\nsome constant. Arguing about the 1-norm margin is convenient when dealing with sparse\nhyperplanes, i.e., with hyperplanes having only a small number of relevant features (e.g.,\n[14]). We say that a training set S = ((x1; y1); :::; (xm; ym)) is linearly separable with\nmargin (cid:13) > 0 when there exists a hyperplane w with jjwjjq (cid:20) 1 such that yt w (cid:1) ^xt (cid:21) (cid:13)\nfor t = 1; :::; m. Given (cid:11) 2 (0; 1], we say that hyperplane w0 is an (cid:11)-approximation to w\n(w.r.t. training set S) if jjw0jjq (cid:20) 1 and yt w0 (cid:1) ^xt (cid:21) (1 (cid:0) (cid:11))(cid:13) holds for t = 1; :::; m. In\nparticular, if the underlying margin is an 1-norm margin (and (cid:11) is not close to 1) then w0\ntends to share the sparsity properties of w. See also Section 3.\n\nf\n\n1\n\n; :::; T (cid:0)1\n\n2jj (cid:1) jj2\n\n2jj (cid:1) jj2\n\nq and its inverse T(cid:0)1 = (T (cid:0)1\n\n3 The large margin algorithm ALMAp\nALMAp is a large margin variant of the p-norm Perceptron algorithm2 introduced by [11]\n(see also [9]). The version of the algorithm we have used in our experiments is described in\nFigure 1, where the one-one mapping T = (T1; :::; Tf ) : Rf ! Rf is the gradient of the\nscalar function 1\n) : Rf ! Rf is the gradient\nof the (Legendre dual) function 1\np. The mapping T depends on the chosen norm p,\nwhich we omit for notational brevity. One can immediately see that p = q = 2 gives T =\nT(cid:0)1 = identity. See [9] for further discussion about the properties of T. The algorithm in\nFigure 1 takes in input a training set S = ((x1; y1); :::; (xm; ym)) 2 (Rf (cid:2) f(cid:0)1; +1g)m,\na norm value p (cid:21) 2 and a parameter (cid:11) 2 (0; 1], measuring the degree of approximation\nto the optimal margin hyperplane. Learning proceeds in a sequence of trials. ALMAp\nmaintains a normalized vector wk of f weights. It starts from w1 = 0 and in the generic\ntrial t it processes example (xt; yt). If the current weight vector wk classi\ufb01es (xt; yt) with\n(normalized) margin not larger than (1(cid:0) (cid:11)) (cid:13)k then the algorithm updates its internal state.\nThe update rule consists of the following: First, the algorithm computes w0k via a (p-norm)\nperceptron-like update rule. Second, w0k is normalized w.r.t.\nthe chosen norm q (recall\nthat q is dual to p). The normalized vector wk+1 will then be used in the next trial. After\nsweeping (typically more than once) through the training set, the algorithm outputs an f-\ndimensional vector wk which represents the linear model the algorithm has learned from\nthe data. The output also includes the \ufb01nal margin (cid:13) = (cid:13)k, where k is the total number of\nupdates (plus one) the algorithm took to compute wk. This margin is a valuable indication\nof the level of \u201cnoise\u201d in the data. In particular, when the training set S is linearly separable,\n\n2The p-norm Perceptron algorithm is a generalization of the classical Perceptron algorithm, ob-\n\ntained by setting p = 2.\n\n\fwe can use (cid:13) to estimate from above the true margin (cid:13)(cid:3) of S (see Theorem 1). In turn, (cid:13)(cid:3)\nhelps us in setting up a reliable feature selection process (see Section 4). Theorem 1 is a\nconvergence result stating two things [8]: 1. ALMAp(S, (cid:11)) computes an (cid:11)-approximation\nto the maximal p-norm margin hyperplane after a \ufb01nite number of updates; 2. the margin\n(cid:13) output by ALMAp(S, (cid:11)) is an upper bound on the true margin of S.3\nTheorem 1 [8] Let (cid:13)(cid:3) = maxw2Rf : jjwjjq=1 mint=1;:::;m yt w (cid:1) ^xt > 0: Then the\nnumber of updates made by the algorithm in Figure 1 (i.e., the number of trials t such that\n(cid:11)2 ((cid:13) (cid:3))2(cid:17) :\nyt wk (cid:1) ^xt (cid:20) (1 (cid:0) (cid:11)) (cid:13)k) is upper bounded by 2 (p(cid:0)1)\nFurthermore, throughout the run of the algorithm we have (cid:13)k (cid:21) (cid:13) (cid:21) (cid:13)(cid:3), for k = 1; 2; :::\n(recall that (cid:13) is the last (cid:13)k produced by ALMAp). Hence the previous bound is also an\nupper bound on the number of trials t such that yt wk (cid:1) ^xt (cid:20) (1 (cid:0) (cid:11)) (cid:13).\nRecalling Section 2, we notice that setting p = O(ln f ) makes ALMAp useful when\nlearning sparse hyperplanes.\nIn particular, the above theorem gives us the following\n1-norm margin upper bound on the number of updates: O (cid:0)ln f = ((cid:11)2 ((cid:13)(cid:3))2)(cid:1), where\n(cid:13)(cid:3) = maxw2Rf : jjwjj1=1 mint=1;:::;m yt w (cid:1) xt =jjxtjj1. This is similar to the behav-\nior exhibited by classi\ufb01ers based on linear programming (e.g., [17, 19, 4] and references\ntherein), as well as to the performance achieved by multiplicative algorithms, such as the\nzero-threshold Winnow algorithm [11].\n\n(cid:11) (cid:0) 4 = O (cid:16) p(cid:0)1\n\n+ 8\n\n((cid:13) (cid:3))2 (cid:0) 2\n\n(cid:11) (cid:0) 1(cid:1)2\n\n4 The multiplicative feature selection algorithms\nWe now describe two feature selection algorithms based on ALMAp. The algorithms differ\nin the way features are eliminated. The \ufb01rst algorithm, called ALMA-FS (ALMA-based\nFeature Selection), is strongly in\ufb02uenced by its training behavior: If ALMAp has made\nmany updates during training then arguably this corresponds to a high level of noise in\nthe data (w.r.t. a linear model). In this case the feature selection mechanism tends to be\nprudent in eliminating features. On the other hand, if the number of updates is small we\ncan think of the linear model computed by ALMAp as an accurate one for the training\ndata at hand, so that one can reliably perform a more aggressive feature removal. The\nsecond algorithm, called ALMAln-RFE, performs Recursive Feature Elimination (RFE) on\nthe linear model computed by ALMAp, and might be seen as a simpli\ufb01ed version of the\n\ufb01rst one, where the rate of feature removal is constant and the \ufb01nal number of features\nis \ufb01xed ahead of time. ALMA-FS is described in Figure 2.\nIt takes in input a training\nset S = ((x1; y1); :::; (xm; ym)) 2 (Rn (cid:2) f(cid:0)1; +1g)m and a parameter (cid:11) (which is the\nsame as ALMAp\u2019s). Then the algorithm repeatedly invokes ALMAp on the same training\nset but progressively reducing the set F of current features. The algorithm starts with\nF = f1; :::; ng, being n the dimension of the input space. Then, on each repeat-until\niteration, the algorithm: sets the norm p to the logarithm4 of the number f of current\nfeatures, runs ALMAp for the given values of (cid:11) and p, gets in output w and (cid:13), and computes\nthe new (smaller) F to be used in the next iteration. Computing the new F amounts to\nsorting the components of w according to decreasing absolute value and then keeping,\namong the f features, only the largest ones (thereby eliminating features which are likely to\nbe irrelevant). Here c((cid:11)) 2 [0; 1] is a suitable function whose value will be speci\ufb01ed later.\nWe call a repeat-until iteration of this kind a feature selection stage. ALMA-FS terminates\nwhen it reaches a local minimum F , where the algorithm is unable to drop any further\nfeatures.\nALMA-FS uses the output produced by ALMAp in the most natural way, retaining only the\nfeatures corresponding to (supposedly) relevant components of w. We point out that here\nthe discrimination between relevant and irrelevant components is based on the margin (cid:13)\n\n3A more general statement holds for the nonseparable case (see [8] for details). In this case, the\n\n(cid:11) parameter in ALMAp(.,(cid:11)) is similar to the C parameter in SVM [5].\n4In order to prevent p < 2, we actually set p = 2 when ln f < 2.\n\n\fALGORITHM ALMA-FS(S; (cid:11))\nInput: Training set S = ((x1; y1); :::; (xm; ym)); approx. param. (cid:11) 2 (0; 1].\nInitialization: F = f1; 2; :::; ng; f := jFj = n.\nRepeat\n\n(cid:15) Set p := maxf2; ln fg and run ALMAp(SjF , (cid:11)), getting in output w =\n(cid:15) Sort w1; :::; wf according to decreasing jwij and let wi1 ; :::; wif be the\n\n(w1; :::; wf ) 2 Rf and (cid:13) > 0;\nsorted sequence; set q = p\n\np(cid:0)1 and compute the smallest f(cid:3) (cid:20) f s.t.\nj=1 jwijjq (cid:21) 1 (cid:0) (c((cid:11)) (cid:13))q;\n\n(1)\n\nPf (cid:3)\n(cid:15) Set F = fi1; i2; :::; if (cid:3)g; f := jFj = f(cid:3);\nUntil F does not shrink any more.\nOutput: Final weight vector w = (w1; :::; wf ).\n\nFigure 2: ALMA-FS: Feature selection using ALMAp where p is logarithmic in f.\n\noutput by ALMAp. In turn, (cid:13) depends on the number of training updates made by ALMAp,\ni.e., on the \u201camount of noise\u201d in the data. This criterion can be viewed as a margin-based\ncriterion according to the following fact: If in any given stage ALMAp has computed an\n(cid:11)-approximation to the maximal margin hyperplane for a (linearly separable) training se-\nquence S, then the (smaller) vector computed at the end of that stage will be an ((cid:11) + c((cid:11)))-\napproximation to the maximal margin hyperplane for the new (linearly separable) sequence\nwhere some features have been eliminated. This statement follows directly from (1) and\nTheorem 1. We omit the details due to space limitations. From this point of view, a reason-\nable choice of c((cid:11)) is one which insures (cid:11) + c((cid:11)) (cid:20) 1 for (cid:11) 2 [0; 1] and the two limiting\nconditions lim(cid:11)!0 (cid:11) + c((cid:11)) = 0 and lim(cid:11)!1 (cid:11) + c((cid:11)) = 1. The simplest function sat-\nisfying the conditions above (the one we used in the experiments) is c((cid:11)) = (cid:11) (1 (cid:0) (cid:11)).\nALMA-FS starts with a relatively large value of the norm parameter p (making it fairly\naggressive at the beginning), and then progressively reduces this parameter so that the al-\ngorithm can focus in later stages on the remaining features. This heuristic approach allows\nus to keep a good approximation capability (as measured by the margin) while dropping a\nlot of irrelevant components from the weight vectors computed by ALMAp.\nALMAln-RFE is a simpli\ufb01ed version of ALMA-FS that halves the number of features in\neach stage, and uses again a norm p logarithmic in the number of current features. The\n(cid:11) parameter is replaced by nf , the desired number of features. ALMAln-RFE(S; nf ) is\nobtained from the algorithm in Figure 2 upon replacing the de\ufb01nition of f (cid:3) in (1) by f(cid:3) =\nmaxfbf =2c; nfg, so that the number of training stages is always logarithmic in n=nf .\n5 Experiments\nWe tested ALMA-FS and ALMAln-RFE on a few well-known microarray datasets (see be-\nlow). For the sake of comparison, we tended to follow previous experimental settings, such\nas those described in [13, 22, 20]. Our results are summarized in Table 1. For each dataset,\nwe \ufb01rst generated a number of random training/test splits. Since we used on-line algo-\nrithms, the output depends on the order of the training sequence. Therefore our random\nsplits also included random permutations of the training set. The results shown in Table 1\nare averaged over these random splits.\nFive datasets have been used in our experiments.\n1. The ALL-AML dataset [10] contains 72 samples, each with expression pro\ufb01les about\n7129 genes. The task is to distinguish between the two variants of leukemia ALL and\nAML. We call this dataset the \u201cLeukemia\u201d dataset. We used the \ufb01rst 38 examples as train-\ning set and the remaining 34 as test set. This seems to be a standard training/test split (e.g.,\n[10, 21, 13, 22]). The results have been averaged over 1000 random permutations of the\n\n\ftraining set.\n2. The \u201cColon Cancer\u201d dataset [2] contains 62 expression pro\ufb01les for tumor and normal\nsamples concerning 2000 genes. Following [20], we randomly split the dataset into a train-\ning set of 50 examples and a test set of 12. The random split was performed 1000 times.\n3. In the ER+/ER(cid:0) dataset from [12] the task is to analyze expression pro\ufb01les of breast\ncancer and classify breast tumors according to ER (Estrogen Receptor) status. This dataset\n(which we call the \u201cBreast\u201d dataset) contains 58 expression pro\ufb01les concerning 3389 genes.\nWe randomly split 1000 times into a training set of size 47 and a test set of size 11.\n4. The \u201cProstate\u201d cancer dataset from [18] contains 102 samples with expression pro\ufb01les\nconcerning 12600 genes. The task is to separate tumor from normal samples. As in [18],\nwe estimated the test error through a Leave-One Out Cross Validation (LOOCV)-like es-\ntimator. In particular, for this dataset we randomly split 1000 times into a training set of\n101 examples and a test set of 1 example, and then averaged the results. (This is roughly\nequivalent to LOOCV with 10 random permutations of the training set.)\n5. In the \u201cLymphoma\u201d dataset [1] the goal is to separate cancerous and normal tissues in\na large B-Cell lymphoma problem. The dataset contains 96 expression pro\ufb01les concerning\n4026 genes, 62 samples are in the classes \u201cDLCL\u201d, \u201cFL\u201d and \u201cCLL\u201d (malignant) and the\nremaining 34 are labelled \u201cotherwise\u201d. As in [20], we randomly split the dataset into a\ntraining set of size 60 and a test set of size 36. The random split was performed 1000 times.\nWe made no preprocessing on the data. All our experiments have been run on a PC with\na single AMD Athlon processor running at 1300 Mhz. The running times we will be giv-\ning are measured on this machine. We compared on these datasets ALMA-FS (\u201cFS\u201d in\nTable 1) and ALMAln-RFE (\u201cln-RFE\u201d) to three more feature selection algorithms: a fast\napproximation to Recursive Feature Elimination applied to SVM (called ALMA2-RFE, ab-\nbreviated as \u201c2-RFE\u201d in Table 1), and a standard feature selection method based on corre-\nlation coef\ufb01cients (e.g., [10]) applied to both (an approximation to) SVM and ALMAln f ,\nbeing f the number of features selected by the correlation method. We call the last two\nmethods ALMA2-CORR (\u201c2-CORR\u201d in Table 1) and ALMAln-CORR (\u201cln-CORR\u201d in Ta-\nble 1), respectively.\nIn all cases our base learning algorithm was ALMAp(.,(cid:11)), where\n(cid:11) 2 f0:5; 0:6; 0:7; 0:8; 0:9g, and p was either 2 (to approximate SVM) or logarithmic\nin the number of features the algorithm was operating on (to simulate a multiplicative large\nmargin algorithm). For each combination (algorithm, number of genes), only the best\naccuracy results (w.r.t. (cid:11)) are shown. On the \u201cColon cancer\u201d, the \u201cBreast\u201d and the \u201cLym-\nphoma\u201d datasets we run ALMAp by cycling 50 times over the current training set. On the\n\u201cLeukemia\u201d and the \u201cProstate\u201d datasets (which are larger) we cycled 100 times. In Table\n1 we give, for each dataset, the average error and the number of features (\u201c# GENES\u201d) se-\nlected by the algorithms.5 The only algorithm which tries to determine the \ufb01nal number of\nfeatures as a part of its inference mechanism is ALMA-FS: all the others take this number\nas an explicit input parameter.\nThe main goal of this experimental study was to carry out a direct comparison between dif-\nferent feature selection methods combined with different core learning algorithms. Feature\nselection performed by ALMA-FS, ALMAln-RFE and ALMA2-RFE is margin-based, while\nfeature selection performed by ALMA2-CORR and ALMAln-CORR is correlation-based.\nAccording to [15], the former falls within the category of wrapper methods, while the lat-\nter is an example of \ufb01lter methods. The two core learning algorithms we employed are the\nSVM-like algorithm ALMA2 and the (large margin) Winnow-like algorithm ALMAp, with\nlogarithmic p. The \ufb01rst has been used with ALMA2-RFE and ALMA2-CORR, the second\nhas been used with ALMA-FS, ALMAln-RFE and ALMAln-CORR.\nThe accuracy results we have obtained are often superior to those reported in the litera-\n\n5Observe that, due to the on-line nature of the algorithms, different sets of genes get selected on\ndifferent runs. Therefore one could also collect statistics about the gene selection frequency over the\nruns. Details will be given in the full paper.\n\n\fTable 1: Experimental results on \ufb01ve microarray datasets. The percentages denote the av-\nerage fraction of misclassi\ufb01ed patterns in the test set, while \u201c# GENES\u201d denotes the average\nnumber of genes (features) selected. The results refer to the same training/test splits. Notice\nthat ALMA-FS (\u201cFS\u201d) determines automatically the number of genes to select. According\nto Wilcoxon signed rank test, (cid:21) 0.5% accuracy difference might be considered signi\ufb01cant.\n\n# GENES\n\n12.7%\n\n\u2014\n\nLEUKEMIA\n\nCOLON\nCANCER\n\nBREAST\n\nPROSTATE\n\nLYMPHOMA\n\n20\n26.5\n40\n60\n100\n200\nALL\n20\n22.6\n40\n60\n100\n200\nALL\n20\n38.5\n40\n60\n100\n200\nALL\n20\n30.8\n40\n60\n100\n200\nALL\n20\n30.8\n40\n60\n100\n200\nALL\n\n9.5%\n\n2-RFE\n5.8%\n\u2014\n6.7%\n8.9%\n9.0%\n7.2%\n3.5%\n\n2-CORR\nln-RFE\nFS\n5.9%\n3.3%\n\u2014\n\u2014\n\u2014\n3.0%\n5.0%\n3.0%\n\u2014\n4.3%\n3.2%\n\u2014\n2.5%\n4.0%\n\u2014\n3.0%\n3.1%\n\u2014\n3.5%\n\u2014\n3.3%\n15.4%\n\u2014 17.0% 13.1%\n\u2014\n\u2014\n\u2014 15.4% 12.1%\n14.4%\n\u2014 14.8% 12.0% 14.2%\n13.7%\n\u2014 14.3% 12.6%\n13.9%\n\u2014 13.2% 12.4%\n\u2014 13.0% 13.3%\n13.0%\n6.1%\n\u2014 11.5% 10.3%\n\u2014\n\u2014\n\u2014\n6.5%\n9.9%\n\u2014 10.7%\n7.5%\n9.9%\n\u2014 10.1%\n9.8%\n13.1%\n\u2014 10.4%\n14.6%\n\u2014 11.9%\n9.6%\n15.8%\n\u2014 15.8% 10.0%\n8.4%\n11.5%\n7.8%\n\u2014\n\u2014\n\u2014\n\u2014\n9.5%\n8.1%\n\u2014\n9.4%\n10.2%\n8.5%\n8.1% 10.3%\n\u2014\n6.9%\n9.3% 10.2%\n\u2014\n8.4%\n\u2014\n9.8%\n9.9%\n10.0%\n\u2014 10.0% 10.4%\n\u2014 10.1%\n9.9%\n12.6%\n\u2014\n\u2014\n\u2014\n10.5%\n7.4%\n7.9%\n9.5%\n6.8%\n7.4%\n8.2%\n6.0%\n6.6%\n6.3%\n5.6%\n7.4%\n5.5%\n7.2%\n7.2%\n\n8.1%\n\u2014\n\u2014\n\u2014\n\u2014\n\u2014\n\nln-CORR\n3.7%\n\u2014\n3.6%\n2.9%\n2.9%\n4.5%\n3.3%\n14.8%\n\u2014\n14.0%\n13.6%\n13.1%\n13.2%\n13.3%\n5.5%\n\u2014\n6.5%\n8.5%\n10.4%\n14.5%\n10.0%\n10.4%\n\u2014\n8.0%\n7.7%\n6.5%\n7.2%\n10.4%\n12.3%\n\u2014\n10.2%\n9.2%\n8.3%\n7.7%\n5.5%\n\nture, though this should not be considered very signi\ufb01cant.6 From our direct comparison,\nhowever, a few (more reliable) conclusions can be drawn. First, on these gene expression\n\n6In fact, the results on feature selection applied to microarray datasets are not readily comparable\nacross different papers, due to the randomness in the training/test splits (which is a relevant source\nof variance) and the different preprocessing of the data. That said, we brie\ufb02y mention a few results\nreported by other researchers on the same datasets. On the \u201cLeukemia\u201d dataset, [22] report 0% test\nerror for a logistic regression algorithm that chooses the number of features to extract by LOOCV.\nThe same error rate is reported by [21] for a linear SVM using 20 genes. [20] use linear SVM as the\nunderlying learning algorithm. On the \u201cColon Cancer\u201d dataset, the authors report an average accuracy\nof 16.4% without feature selection and an accuracy ranging between 15.0% and 16.9% (depending\non the number of genes selected) for the RFE and the AROM (Approximation of the Zero-Norm\nMinimization) methods. On the \u201cLymphoma\u201d dataset the same authors report 7.1% average error for\nlinear SVM and 5.9% to 6.8% average error (again depending on the number of genes selected) for\nthe RFE and the AROM methods. On the \u201cProstate\u201d dataset, [18] use a k-NN classi\ufb01er and report a\nLOOCV accuracy comparable to ALMA2-RFE\u2019s (but worse than ALMAln-CORR\u2019s).\n\n\fdatasets a large margin Winnow-like algorithm generally outperforms an SVM-like algo-\nrithm. Second, despite the common wisdom [15] according to which wrapper methods tend\nto be more accurate than \ufb01lter methods, it is hard to tell here how the two methods compare\n(see [22] for similar results). Third, knowing the \u201coptimal\u201d number of genes beforehand\nis a valuable side information. Notice that, unlike many of the methods proposed in the\nliterature, ALMA-FS tries to determine in an automatic way a \u201cgood\u201d number of features to\nselect.7 In fact, due to the scarcity of examples and the large number of vector components,\nthe repeated use of cross-validation on the same validation set might lead to over\ufb01tting.\nALMA-FS seems to do a \ufb01ne job of it on three out of \ufb01ve datasets (on the \u201cBreast\u201d dataset\n\u201cFS\u201d should only be compared to \u201c2-RFE\u201d and \u201cln-RFE\u201d). Finally, we would like to stress\nthat our feature selection algorithms are quite fast. To give an idea, on the \u201cColon Cancer\u201d\nand the \u201cBreast\u201d datasets our algorithms take on average just a few seconds, while on the\n\u201cProstate\u201d dataset they take just a few minutes.\nReferences\n[1] Alizadeh, A., et al. (2000). Distinct types of diffuse large b-cell lymphoma identi\ufb01ed by gene\n\nexpression pro\ufb01ling. Nature, 403, 503\u2013511.\n\n[2] Alon, U., et al. (1999). Broad patterns of gene expression revealed by clustering analysis of\ntumor and normal colon cancer tissues probed by oligonucleotide arrays. Cell Biol., 96, 6745\u2013\n6750.\n\n[3] Ben-Dor, A., et al. (2000). Tissue classi\ufb01cation with gene expression pro\ufb01les. J. Comput. Biol.,\n\n7, 559\u2013584.\n\n[4] Bradley, P., & Mangasarian, O. (1998). Feature selection via concave minimization and support\n\nvector machines. Proc. 15th ICML (pp. 82\u201390).\n\n[5] Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273\u2013297.\n[6] Dudoit, S., Fridlyand, J., & Speed T.P. (2002). Comparison of discrimination methods for the\n\nclassi\ufb01cation of tumors using gene expression data. JASA, 97(457), 77\u201387.\n\n[7] Fodor, S. (1997). Massively parallel genomics. Science, 277, 393\u2013395.\n[8] Gentile, C. (2001a). A new approximate maximal margin classi\ufb01cation algorithm. JMLR, 2,\n\n213\u2013242.\n\n[9] Gentile, C. (2001b). The robustness of the p-norm algorithms. Machine Learning J., to appear.\n[10] Golub, T., et al. (1999). Molecular classi\ufb01cation of cancer: Class discovery and class prediction\n\nby gene expression. Science, 286, 531\u2013537.\n\n[11] Grove, A., Littlestone, N., & Schuurmans, D. (2001). General convergence results for linear\n\ndiscriminant updates. Machine Learning Journal, 43(3), 173\u2013210.\n\n[12] Gruvberger, S., et al. (2001). Estrogen receptor status in breast cancer is associated with re-\n\nmarkably distinct gene expression patterns. Cancer Res., 61, 5979\u20135984.\n\n[13] Guyon, I., Weston, J., Barnhill, S., & Vapnik, V. (2002). Gene selection for cancer classi\ufb01cation\n\nusing support vector machines. Machine Learning Journal, 46(1-3), 389\u2013422.\n\n[14] Kivinen, J., Warmuth, M., & Auer, P. (1997). The perceptron algorithm vs. winnow: linear vs.\n\nlogarithmic mistake bounds when few input variables are relevant. AI, 97, 325\u2013343.\n\n[15] Kohavi, R., & John, G. (1997). Wrappers for feature subset selection. AI, 97, 273\u2013324.\n[16] Littlestone, N. (1988). Learning quickly when irrelevant attributes abound: A new linear-\n\nthreshold algorithm. Machine Learning, 2, 285\u2013318.\n\n[17] Mangasarian, O. (1997). Mathematical programming in data mining. DMKD, 42(1), 183\u2013201.\n[18] Singh, D., et al. (2002). Gene expression correlates of clinical prostate cancer behavior. Cancer\n\nCell, 1.\n\n[19] Tibshirani, R. (1995). Regression selection and shrinkage via the lasso. JRSS B, 1, 267\u2013288.\n[20] Weston, J., Elisseeff, A., Scholkopf, B., & Tipping, M. (2002). The use of zero-norm with\n\nlinear models and kernel methods. JMLR, to appear.\n\n[21] Weston, J., Mukherjee, S., Chapelle, O., Pontil, M., Poggio, T., & Vapnik, V. (2000). Feature\n\nselection for svms. Proc. NIPS 13.\n\n[22] Xing, E., Jordan, M., & Karp, R. (2001). Feature selection for high-dimensional genomic\n\nmicroarray data. Proc. 18th ICML.\n\n7The reader might object that the number of selected features can depend on the value of parameter\n(cid:11) in ALMAp. In practice, however, we observed that (cid:11) does not have a big in\ufb02uence on this number.\n\n\f", "award": [], "sourceid": 2527, "authors": [{"given_name": "Claudio", "family_name": "Gentile", "institution": null}]}